Skip to main content
Version: Current

Voice Capabilities

Overview

IB-X Conversational Agents can operate as both chat agents and voice agents.

Voice capabilities allow users to interact with Conversational Agents using natural speech rather than text-based messages. The platform converts spoken audio into text, processes the request using the Conversational Agent, and converts the generated response back into speech.

This enables natural and interactive voice experiences for customer support, employee assistance, information retrieval, and business process automation.


Voice Interaction Flow

A typical voice interaction follows the sequence below:


Enabling Voice

Voice capabilities are configured through the conversational trigger.

The trigger allows voice functionality to be enabled alongside traditional chat interactions.

Once enabled, users can interact with the Conversational Agent using either:

  • Text conversations
  • Voice conversations

The available options depend on the configured channel and client experience.


Voice Architecture

Voice-enabled conversations combine multiple platform services that work together to provide natural, low-latency voice interactions.

A typical voice conversation follows the sequence below:

Conversational Agent

Responsible for:

  • Understanding requests
  • Maintaining conversation context
  • Invoking tools
  • Retrieving knowledge
  • Generating responses

Voice Activity Detection (VAD)

Detects speech activity within incoming audio and determines which audio should be processed by Speech-to-Text services.

Speech-to-Text (STT)

Converts spoken audio into text that can be processed by the Conversational Agent.

Turn Detection

Determines when a user has completed their speaking turn and when the transcript should be submitted to the Conversational Agent.

Text-to-Speech (TTS)

Converts generated responses into spoken audio delivered back to the user.

Real-Time Transport

Provides low-latency communication between the client and the conversational platform during voice interactions.

IB-X currently supports:

  • WebSocket
  • Small WebRTC

Additional transport options may be introduced in future releases.

Barge-In

Allows users to interrupt the assistant while it is speaking, creating more natural conversational experiences.


Speech Providers

IB-X integrates with speech providers for speech recognition and speech synthesis.

Speech providers are responsible for:

  • Converting spoken audio into text (Speech-to-Text)
  • Converting generated responses into speech (Text-to-Speech)

Depending on the selected provider, different models, voices, languages, and capabilities may be available.

Supported Providers and Models

IB-X supports multiple speech providers for Speech-to-Text (STT) and Text-to-Speech (TTS).

Speech providers can be accessed using either:

  • Customer-managed provider credentials (Bring Your Own Key)
  • IB-X Integration Gateway using IB-X Currency

The available providers and models depend on the selected provider and deployment configuration.

ProviderSpeech-to-Text (STT)Text-to-Speech (TTS)Notes
DeepgramBase, Nova, Nova-2, PolarisAura, Aura-2Currently supported provider.

Additional providers and models may be introduced in future releases.

Provider Access Models

Bring Your Own Key (BYOK)

Organizations can configure their own provider accounts and credentials. In this model, all provider usage is billed directly by the provider to the organization.

Integration Gateway

Organizations can access supported providers through the IB-X Integration Gateway without managing provider-specific accounts or credentials. Usage is billed using IB-X Currency.

For more information, see:


Speech-to-Text

Speech-to-Text providers convert spoken user input into text that can be processed by the Conversational Agent.

Typical provider configuration includes:

  • Provider selection
  • Connection selection
  • Language selection
  • Use case selection
  • Model selection

These settings influence transcription quality, language support, latency, and recognition behavior.

For information about transcript generation, endpointing, and transcription tuning, see:


Text-to-Speech

Text-to-Speech providers generate spoken responses for the Conversational Agent.

Typical provider configuration includes:

  • Provider selection
  • Connection selection
  • Model selection
  • Voice selection

Organizations can choose voices that align with their branding, audience, and conversational requirements.

For information about response segmentation, speech streaming, and speech generation behavior, see:


Real-Time Communication

Voice interactions require low-latency communication between the client and the conversational platform.

To support different deployment scenarios and conversational requirements, IB-X provides multiple transport options for real-time voice communication.

These transports enable:

  • Real-time messaging
  • Audio streaming
  • Voice interactions
  • Conversation event delivery
  • Low-latency speech processing

The appropriate transport depends on factors such as network conditions, browser capabilities, deployment architecture, and scalability requirements.

IB-X currently supports the following transport options for real-time voice interactions, with additional transports being introduced in future releases:

WebSocket

WebSocket provides a simple and widely supported transport mechanism for voice interactions.

In this mode, audio, transcripts, conversation events, and agent responses are exchanged over a persistent WebSocket connection between the client and the conversational platform.

WebSocket transport is suitable for:

  • Simpler deployments
  • Environments where WebRTC is not required
  • Browser-based and embedded chat experiences
  • Voice interactions that do not require peer-to-peer media capabilities

Small WebRTC

Small WebRTC uses WebRTC-based audio streaming to provide lower-latency voice communication.

This mode is optimized for conversational voice experiences where responsiveness and audio quality are important.

Small WebRTC is suitable for:

  • Real-time voice agents
  • Interactive conversational experiences
  • Reduced audio latency
  • Enhanced speech quality and user experience

Future Transport Modes

Daily WebRTC (Work in Progress)

Daily WebRTC integration is currently under development and is not yet generally available.

Once released, it will provide an additional WebRTC-based transport option built on the Daily platform, enabling advanced real-time voice communication scenarios and expanded media capabilities.


The selected transport affects how audio, events, and conversation data are exchanged between the client and the Conversational Agent, but does not change the underlying conversational capabilities such as:

  • Speech-to-Text
  • Text-to-Speech
  • Knowledge Grounding
  • Tool Execution
  • Memory Management
  • Barge-In
  • Turn Detection

Voice Interaction Features

IB-X provides several capabilities that enable natural and responsive voice conversations.

These capabilities control how speech is detected, transcribed, segmented, processed, and how interruptions are handled during voice interactions.

Note: These advanced voice interaction settings are currently configured through the appsettings.json configuration file. Future releases will make these settings available through the user interface, simplifying configuration and administration.

For detailed information, see:


Voice and Knowledge Grounding

Voice-enabled Conversational Agents can access the same enterprise knowledge available to chat-based agents.

The interaction method changes, but the underlying capabilities remain the same.

Voice agents can:

  • Retrieve enterprise knowledge
  • Invoke tools
  • Execute workflows
  • Access integrations
  • Maintain conversation memory

Voice and Tool Execution

Voice conversations fully support tool execution.

For example, a user may:

What is the status of ticket 12345?

The Conversational Agent can:

  1. Invoke a ticket lookup tool.
  2. Retrieve the information.
  3. Generate a response.
  4. Speak the response back to the user.

This allows voice experiences to participate in enterprise business processes in the same way as chat-based interactions.


Best Practices

  • Select voices that align with organizational branding.
  • Use clear and focused agent instructions.
  • Validate speech recognition accuracy for supported languages.
  • Test voice interactions in realistic environments.
  • Keep spoken responses concise where appropriate.
  • Validate tool execution paths through voice conversations.
  • Continuously review and refine the conversational experience.