Skip to main content
Version: Current

Voice Capabilities

Overview

IB-X Conversational Agents can operate as both chat agents and voice agents.

Voice capabilities allow users to interact with Conversational Agents using natural speech rather than text-based messages. The platform converts spoken audio into text, processes the request using the Conversational Agent, and converts the generated response back into speech.

This enables natural and interactive voice experiences for customer support, employee assistance, information retrieval, and business process automation.


Voice Interaction Flow

A typical voice interaction follows the sequence below:

User Speech


Speech-to-Text


Conversational Agent


Text Response


Text-to-Speech


Spoken Response

Enabling Voice

Voice capabilities are configured through the conversational trigger.

The trigger allows voice functionality to be enabled alongside traditional chat interactions.

Once enabled, users can interact with the Conversational Agent using either:

  • Text conversations
  • Voice conversations

The available options depend on the configured channel and client experience.


Voice Architecture

Voice-enabled conversations combine multiple platform services:

Conversational Agent

Responsible for:

  • Understanding requests
  • Maintaining conversation context
  • Invoking tools
  • Retrieving knowledge
  • Generating responses

Speech-to-Text (STT)

Converts spoken audio into text that can be processed by the Conversational Agent.

Text-to-Speech (TTS)

Converts generated responses into spoken audio delivered back to the user.

Real-Time Transport

Provides low-latency communication between the client and the conversational platform during voice interactions.


Speech Providers

IB-X supports configurable providers for speech services.

Speech providers are used for:

  • Speech-to-Text (STT)
  • Text-to-Speech (TTS)

Depending on the provider, different models, voices, languages, and capabilities may be available.

The available options are determined by:

  • Configured provider
  • Selected model
  • Connection settings
  • Provider capabilities

Text-to-Speech

Text-to-Speech generates spoken responses for the Conversational Agent.

Typical configuration includes:

  • Provider selection
  • Connection selection
  • Model selection
  • Voice selection

Organizations can choose voices that align with their branding, audience, and conversational requirements.


Speech-to-Text

Speech-to-Text converts spoken user input into text.

Typical configuration includes:

  • Provider selection
  • Connection selection
  • Language selection
  • Use case selection
  • Model selection

These settings influence transcription quality, language support, and recognition behavior.


Real-Time Communication

Voice interactions require low-latency communication between the client and the conversational platform.

Depending on the scenario, IB-X supports different transport mechanisms optimized for conversational experiences.

These transports enable:

  • Real-time messaging
  • Audio streaming
  • Voice interactions
  • Conversation event delivery

The appropriate transport depends on the deployment scenario and conversational requirements.


Voice and Knowledge Grounding

Voice-enabled Conversational Agents can access the same enterprise knowledge available to chat-based agents.

The interaction method changes, but the underlying capabilities remain the same.

Voice agents can:

  • Retrieve enterprise knowledge
  • Invoke tools
  • Execute workflows
  • Access integrations
  • Maintain conversation memory

Voice and Tool Execution

Voice conversations fully support tool execution.

For example, a user may:

What is the status of ticket 12345?

The Conversational Agent can:

  1. Invoke a ticket lookup tool.
  2. Retrieve the information.
  3. Generate a response.
  4. Speak the response back to the user.

This allows voice experiences to participate in enterprise business processes in the same way as chat-based interactions.


Testing Voice Interactions

Voice-enabled Agents can be tested directly from the Agent Designer.

  1. Open the Agent Designer.
  2. Click Run.
  3. Start the conversational session.
  4. Select the voice interaction option.
  5. Speak naturally with the Agent.

Testing should validate:

  • Speech recognition quality
  • Response accuracy
  • Voice quality
  • Tool execution
  • Knowledge retrieval
  • Overall user experience

Best Practices

  • Select voices that align with organizational branding.
  • Use clear and focused agent instructions.
  • Validate speech recognition accuracy for supported languages.
  • Test voice interactions in realistic environments.
  • Keep spoken responses concise where appropriate.
  • Validate tool execution paths through voice conversations.
  • Continuously review and refine the conversational experience.