Voice Capabilities
Overview
IB-X Conversational Agents can operate as both chat agents and voice agents.
Voice capabilities allow users to interact with Conversational Agents using natural speech rather than text-based messages. The platform converts spoken audio into text, processes the request using the Conversational Agent, and converts the generated response back into speech.
This enables natural and interactive voice experiences for customer support, employee assistance, information retrieval, and business process automation.
Voice Interaction Flow
A typical voice interaction follows the sequence below:
User Speech
│
▼
Speech-to-Text
│
▼
Conversational Agent
│
▼
Text Response
│
▼
Text-to-Speech
│
▼
Spoken Response
Enabling Voice
Voice capabilities are configured through the conversational trigger.
The trigger allows voice functionality to be enabled alongside traditional chat interactions.
Once enabled, users can interact with the Conversational Agent using either:
- Text conversations
- Voice conversations
The available options depend on the configured channel and client experience.
Voice Architecture
Voice-enabled conversations combine multiple platform services:
Conversational Agent
Responsible for:
- Understanding requests
- Maintaining conversation context
- Invoking tools
- Retrieving knowledge
- Generating responses
Speech-to-Text (STT)
Converts spoken audio into text that can be processed by the Conversational Agent.
Text-to-Speech (TTS)
Converts generated responses into spoken audio delivered back to the user.
Real-Time Transport
Provides low-latency communication between the client and the conversational platform during voice interactions.
Speech Providers
IB-X supports configurable providers for speech services.
Speech providers are used for:
- Speech-to-Text (STT)
- Text-to-Speech (TTS)
Depending on the provider, different models, voices, languages, and capabilities may be available.
The available options are determined by:
- Configured provider
- Selected model
- Connection settings
- Provider capabilities
Text-to-Speech
Text-to-Speech generates spoken responses for the Conversational Agent.
Typical configuration includes:
- Provider selection
- Connection selection
- Model selection
- Voice selection
Organizations can choose voices that align with their branding, audience, and conversational requirements.
Speech-to-Text
Speech-to-Text converts spoken user input into text.
Typical configuration includes:
- Provider selection
- Connection selection
- Language selection
- Use case selection
- Model selection
These settings influence transcription quality, language support, and recognition behavior.
Real-Time Communication
Voice interactions require low-latency communication between the client and the conversational platform.
Depending on the scenario, IB-X supports different transport mechanisms optimized for conversational experiences.
These transports enable:
- Real-time messaging
- Audio streaming
- Voice interactions
- Conversation event delivery
The appropriate transport depends on the deployment scenario and conversational requirements.
Voice and Knowledge Grounding
Voice-enabled Conversational Agents can access the same enterprise knowledge available to chat-based agents.
The interaction method changes, but the underlying capabilities remain the same.
Voice agents can:
- Retrieve enterprise knowledge
- Invoke tools
- Execute workflows
- Access integrations
- Maintain conversation memory
Voice and Tool Execution
Voice conversations fully support tool execution.
For example, a user may:
What is the status of ticket 12345?
The Conversational Agent can:
- Invoke a ticket lookup tool.
- Retrieve the information.
- Generate a response.
- Speak the response back to the user.
This allows voice experiences to participate in enterprise business processes in the same way as chat-based interactions.
Testing Voice Interactions
Voice-enabled Agents can be tested directly from the Agent Designer.
- Open the Agent Designer.
- Click Run.
- Start the conversational session.
- Select the voice interaction option.
- Speak naturally with the Agent.
Testing should validate:
- Speech recognition quality
- Response accuracy
- Voice quality
- Tool execution
- Knowledge retrieval
- Overall user experience
Best Practices
- Select voices that align with organizational branding.
- Use clear and focused agent instructions.
- Validate speech recognition accuracy for supported languages.
- Test voice interactions in realistic environments.
- Keep spoken responses concise where appropriate.
- Validate tool execution paths through voice conversations.
- Continuously review and refine the conversational experience.
Related
- Building a Conversational Agent
- AI Persona
- Knowledge Grounding
- Tools and Actions
- When Chat Message Received Trigger