Version: Current

Voice Capabilities

Overview

IB-X Conversational Agents can operate as both chat agents and voice agents.

Voice capabilities allow users to interact with Conversational Agents using natural speech instead of text-based messages. During a voice conversation, spoken audio is converted into text, processed by the Conversational Agent, and the generated response is converted back into speech.

This enables natural and interactive voice experiences for customer support, employee assistance, information retrieval, and business process automation.

Voice Interaction Flow

A typical voice interaction follows the sequence below:

Enabling Voice

Voice capabilities are configured through the When Chat Message Received trigger activity.

Once voice is enabled, users can interact with the Conversational Agent using either:

Text conversations
Voice conversations

The available interaction modes depend on the configured channel and client experience.

For information about configuring voice options, see:

When Chat Message Received Trigger

Voice Architecture

Voice-enabled conversations combine multiple platform services to provide natural, low-latency voice interactions.

A typical voice conversation follows the sequence below:

Conversational Agent

Responsible for:

Understanding user requests
Maintaining conversation context
Invoking tools
Retrieving enterprise knowledge
Generating responses

Voice Activity Detection (VAD)

Detects when the user starts and stops speaking, allowing the platform to distinguish speech from background noise before audio is sent for transcription.

Speech-to-Text (STT)

Converts spoken audio into text that can be processed by the Conversational Agent.

Turn Detection

Determines when a user has completed their speaking turn so that the captured transcript can be submitted for processing.

Text-to-Speech (TTS)

Converts the generated response into natural-sounding speech that is played back to the user.

Real-Time Transport

Provides low-latency communication between the client and the conversational platform during voice interactions.

IB-X currently supports:

WebSocket
Small WebRTC

Additional transport options may be introduced in future releases.

User Interruptions (Barge-In)

Allows users to naturally interrupt the assistant while it is speaking, creating a more conversational experience.

Speech Providers

IB-X integrates with speech providers for speech recognition and speech synthesis.

Speech providers are responsible for:

Converting spoken audio into text (Speech-to-Text)
Converting generated responses into speech (Text-to-Speech)

Depending on the selected provider, different models, voices, languages, and capabilities may be available.

Supported Providers

Speech providers can be accessed using either:

Customer-managed provider credentials (Bring Your Own Key)
IB-X Integration Gateway using IB-X Currency

The available providers and models depend on the selected provider and deployment configuration.

Provider	Speech-to-Text (STT)	Text-to-Speech (TTS)	Notes
Deepgram	Base, Nova, Nova-2, Polaris	Aura, Aura-2	Currently supported provider.

Additional providers and models may be introduced in future releases.

Provider Access Models

Bring Your Own Key (BYOK)

Organizations can configure their own provider accounts and credentials. In this model, all provider usage is billed directly by the speech provider.

Integration Gateway

Organizations can access supported speech providers through the IB-X Integration Gateway without managing provider-specific accounts or credentials. Usage is billed using IB-X Currency.

For more information, see:

Integration Gateway

Speech-to-Text

Speech-to-Text providers convert spoken user input into text that can be processed by the Conversational Agent.

Typical configuration includes:

Provider
Connection
Language
Use Case
Model

These settings determine the speech recognition service used during voice conversations.

Text-to-Speech

Text-to-Speech providers generate spoken responses for the Conversational Agent.

Typical configuration includes:

Provider
Connection
Model
Voice

Organizations can choose voices that best match their brand, audience, and conversational experience.

Real-Time Communication

Voice interactions require low-latency communication between the client and the conversational platform.

To support different deployment scenarios and conversational requirements, IB-X provides multiple transport options for real-time voice communication.

These transports enable:

Real-time messaging
Audio streaming
Conversation events
Low-latency speech processing

The appropriate transport depends on network conditions, browser capabilities, deployment architecture, and scalability requirements.

WebSocket

WebSocket provides a simple and widely supported transport for voice interactions.

Audio, transcripts, conversation events, and agent responses are exchanged over a persistent WebSocket connection between the client and the conversational platform.

WebSocket is suitable for:

Browser-based applications
Embedded chat experiences
Simple voice deployments
Environments where WebRTC is not required

Small WebRTC

Small WebRTC provides low-latency audio streaming using WebRTC.

This transport is optimized for conversational voice experiences where responsiveness and audio quality are important.

Small WebRTC is suitable for:

Real-time voice agents
Interactive conversations
Reduced audio latency
Improved speech quality

Future Transport Modes

Daily WebRTC (Work in Progress)

Daily WebRTC integration is currently under development and is not yet generally available.

Once released, it will provide an additional WebRTC-based transport option for advanced real-time voice communication scenarios.

Regardless of the selected transport, the conversational capabilities remain the same, including:

Speech recognition
Speech synthesis
Knowledge retrieval
Tool execution
Workflow orchestration
Conversation memory
User interruptions

Voice Interaction Features

IB-X includes several built-in capabilities that work together to provide natural and responsive voice conversations.

These capabilities automatically manage:

Speech detection
Speech recognition
Turn completion
Speech synthesis
User interruptions
Conversation responsiveness

The platform uses optimized default settings for these capabilities, eliminating the need for manual tuning in most deployments.

For scenarios that require fine-tuning of voice interaction behavior, administrators can configure the available options through the Advanced section of the When Chat Message Received trigger activity.

For more information, see:

Advanced Voice Settings

Voice and Knowledge Grounding

Voice-enabled Conversational Agents have access to the same enterprise knowledge as chat-based agents.

Only the interaction method changes—the underlying capabilities remain the same.

Voice agents can:

Retrieve enterprise knowledge
Invoke tools
Execute workflows
Access enterprise integrations
Maintain conversation memory

Voice and Tool Execution

Voice conversations fully support tool execution.

For example, a user may say:

What is the status of ticket 12345?

The Conversational Agent can:

Invoke a ticket lookup tool.
Retrieve the required information.
Generate an appropriate response.
Speak the response back to the user.

This enables voice agents to participate in enterprise business processes in the same way as chat-based conversational agents.

Best Practices

Select voices that align with your organization's branding.
Use clear and focused AI instructions.
Validate speech recognition accuracy for the languages you support.
Test voice interactions under realistic network and environmental conditions.
Keep spoken responses concise where appropriate.
Validate tool execution through voice conversations.
Use the default voice interaction settings unless a specific tuning requirement exists.
Adjust only the supported Advanced settings when necessary.

Overview​

Voice Interaction Flow​

Enabling Voice​

Voice Architecture​

Conversational Agent​

Voice Activity Detection (VAD)​

Speech-to-Text (STT)​

Turn Detection​

Text-to-Speech (TTS)​

Real-Time Transport​

User Interruptions (Barge-In)​

Speech Providers​

Supported Providers​

Provider Access Models​

Bring Your Own Key (BYOK)​

Integration Gateway​

Speech-to-Text​

Text-to-Speech​

Real-Time Communication​

WebSocket​

Small WebRTC​

Future Transport Modes​

Daily WebRTC (Work in Progress)​

Voice Interaction Features​

Voice and Knowledge Grounding​

Voice and Tool Execution​

Best Practices​

Related​