Voice Capabilities
Overview
IB-X Conversational Agents can operate as both chat agents and voice agents.
Voice capabilities allow users to interact with Conversational Agents using natural speech rather than text-based messages. The platform converts spoken audio into text, processes the request using the Conversational Agent, and converts the generated response back into speech.
This enables natural and interactive voice experiences for customer support, employee assistance, information retrieval, and business process automation.
Voice Interaction Flow
A typical voice interaction follows the sequence below:

Enabling Voice
Voice capabilities are configured through the conversational trigger.
The trigger allows voice functionality to be enabled alongside traditional chat interactions.
Once enabled, users can interact with the Conversational Agent using either:
- Text conversations
- Voice conversations
The available options depend on the configured channel and client experience.
Voice Architecture
Voice-enabled conversations combine multiple platform services that work together to provide natural, low-latency voice interactions.
A typical voice conversation follows the sequence below:

Conversational Agent
Responsible for:
- Understanding requests
- Maintaining conversation context
- Invoking tools
- Retrieving knowledge
- Generating responses
Voice Activity Detection (VAD)
Detects speech activity within incoming audio and determines which audio should be processed by Speech-to-Text services.
Speech-to-Text (STT)
Converts spoken audio into text that can be processed by the Conversational Agent.
Turn Detection
Determines when a user has completed their speaking turn and when the transcript should be submitted to the Conversational Agent.
Text-to-Speech (TTS)
Converts generated responses into spoken audio delivered back to the user.
Real-Time Transport
Provides low-latency communication between the client and the conversational platform during voice interactions.
IB-X currently supports:
- WebSocket
- Small WebRTC
Additional transport options may be introduced in future releases.
Barge-In
Allows users to interrupt the assistant while it is speaking, creating more natural conversational experiences.
Speech Providers
IB-X integrates with speech providers for speech recognition and speech synthesis.
Speech providers are responsible for:
- Converting spoken audio into text (Speech-to-Text)
- Converting generated responses into speech (Text-to-Speech)
Depending on the selected provider, different models, voices, languages, and capabilities may be available.
Supported Providers and Models
IB-X supports multiple speech providers for Speech-to-Text (STT) and Text-to-Speech (TTS).
Speech providers can be accessed using either:
- Customer-managed provider credentials (Bring Your Own Key)
- IB-X Integration Gateway using IB-X Currency
The available providers and models depend on the selected provider and deployment configuration.
| Provider | Speech-to-Text (STT) | Text-to-Speech (TTS) | Notes |
|---|---|---|---|
| Deepgram | Base, Nova, Nova-2, Polaris | Aura, Aura-2 | Currently supported provider. |
Additional providers and models may be introduced in future releases.
Provider Access Models
Bring Your Own Key (BYOK)
Organizations can configure their own provider accounts and credentials. In this model, all provider usage is billed directly by the provider to the organization.
Integration Gateway
Organizations can access supported providers through the IB-X Integration Gateway without managing provider-specific accounts or credentials. Usage is billed using IB-X Currency.
For more information, see:
Speech-to-Text
Speech-to-Text providers convert spoken user input into text that can be processed by the Conversational Agent.
Typical provider configuration includes:
- Provider selection
- Connection selection
- Language selection
- Use case selection
- Model selection
These settings influence transcription quality, language support, latency, and recognition behavior.
For information about transcript generation, endpointing, and transcription tuning, see:
Text-to-Speech
Text-to-Speech providers generate spoken responses for the Conversational Agent.
Typical provider configuration includes:
- Provider selection
- Connection selection
- Model selection
- Voice selection
Organizations can choose voices that align with their branding, audience, and conversational requirements.
For information about response segmentation, speech streaming, and speech generation behavior, see:
Real-Time Communication
Voice interactions require low-latency communication between the client and the conversational platform.
To support different deployment scenarios and conversational requirements, IB-X provides multiple transport options for real-time voice communication.
These transports enable:
- Real-time messaging
- Audio streaming
- Voice interactions
- Conversation event delivery
- Low-latency speech processing
The appropriate transport depends on factors such as network conditions, browser capabilities, deployment architecture, and scalability requirements.
IB-X currently supports the following transport options for real-time voice interactions, with additional transports being introduced in future releases:
WebSocket
WebSocket provides a simple and widely supported transport mechanism for voice interactions.
In this mode, audio, transcripts, conversation events, and agent responses are exchanged over a persistent WebSocket connection between the client and the conversational platform.
WebSocket transport is suitable for:
- Simpler deployments
- Environments where WebRTC is not required
- Browser-based and embedded chat experiences
- Voice interactions that do not require peer-to-peer media capabilities
Small WebRTC
Small WebRTC uses WebRTC-based audio streaming to provide lower-latency voice communication.
This mode is optimized for conversational voice experiences where responsiveness and audio quality are important.
Small WebRTC is suitable for:
- Real-time voice agents
- Interactive conversational experiences
- Reduced audio latency
- Enhanced speech quality and user experience
Future Transport Modes
Daily WebRTC (Work in Progress)
Daily WebRTC integration is currently under development and is not yet generally available.
Once released, it will provide an additional WebRTC-based transport option built on the Daily platform, enabling advanced real-time voice communication scenarios and expanded media capabilities.
The selected transport affects how audio, events, and conversation data are exchanged between the client and the Conversational Agent, but does not change the underlying conversational capabilities such as:
- Speech-to-Text
- Text-to-Speech
- Knowledge Grounding
- Tool Execution
- Memory Management
- Barge-In
- Turn Detection
Voice Interaction Features
IB-X provides several capabilities that enable natural and responsive voice conversations.
These capabilities control how speech is detected, transcribed, segmented, processed, and how interruptions are handled during voice interactions.
Note: These advanced voice interaction settings are currently configured through the
appsettings.jsonconfiguration file. Future releases will make these settings available through the user interface, simplifying configuration and administration.
For detailed information, see:
- Voice Activity Detection Configuration
- Speech-to-Text Configuration
- Turn Detection Configuration
- Text-to-Speech Configuration
- Barge-In Configuration
Voice and Knowledge Grounding
Voice-enabled Conversational Agents can access the same enterprise knowledge available to chat-based agents.
The interaction method changes, but the underlying capabilities remain the same.
Voice agents can:
- Retrieve enterprise knowledge
- Invoke tools
- Execute workflows
- Access integrations
- Maintain conversation memory
Voice and Tool Execution
Voice conversations fully support tool execution.
For example, a user may:
What is the status of ticket 12345?
The Conversational Agent can:
- Invoke a ticket lookup tool.
- Retrieve the information.
- Generate a response.
- Speak the response back to the user.
This allows voice experiences to participate in enterprise business processes in the same way as chat-based interactions.
Best Practices
- Select voices that align with organizational branding.
- Use clear and focused agent instructions.
- Validate speech recognition accuracy for supported languages.
- Test voice interactions in realistic environments.
- Keep spoken responses concise where appropriate.
- Validate tool execution paths through voice conversations.
- Continuously review and refine the conversational experience.
Related
- Building a Conversational Agent
- AI Persona
- Knowledge Grounding
- Tools and Actions
- When Chat Message Received Trigger