Skip to main content
Version: Current

Speech-to-Text Configuration

Overview

Speech-to-Text (STT) converts spoken user input into text that can be processed by the Conversational Agent.

In voice-enabled conversations, Speech-to-Text is responsible for receiving audio, identifying when a user has finished speaking, generating transcripts, and delivering those transcripts to the Conversational Agent.

Proper Speech-to-Text configuration helps improve:

  • Transcription accuracy
  • Conversation responsiveness
  • Turn completion speed
  • User experience
  • Resource utilization

IB-X supports configurable speech providers and includes several advanced settings that control transcript generation, utterance completion, and connection management.


How Speech-to-Text Works

A typical Speech-to-Text flow follows the sequence below:

During a conversation, the Speech-to-Text service continuously processes incoming audio and generates transcript updates.

Once the speech input is considered complete, a final transcript is produced and sent to the Conversational Agent for processing.


Utterance Detection

These settings control how Speech-to-Text determines that a spoken phrase or sentence has completed.

OptionDefault ValueDescription
Utterance End Timeout300 msTime to wait after speech activity stops before requesting a final transcript from the Speech-to-Text provider. Higher values reduce premature transcript completion but may increase response latency.
Endpointing Silence300 msDuration of silence that is treated as the end of a phrase or utterance. Lower values result in faster transcript completion, while higher values better tolerate thinking pauses and natural speech gaps.

Transcript Processing

These settings control how transcript updates are processed and delivered.

OptionDefault ValueDescription
Event Poll Interval50 msFrequency at which transcript updates are checked and processed. Lower values may provide slightly faster transcript updates at the cost of additional processing activity.
Transcript Debounce350 msSmall delay applied after receiving a final transcript before processing begins. This helps reduce duplicate processing and prevents multiple transcript completion events from triggering redundant work.

Connection Pooling

Speech-to-Text services often require establishing network connections before transcription can begin.

Connection pooling helps reduce startup latency by keeping a number of ready-to-use connections available.

Connection Pool Settings

OptionDefault ValueDescription
Pre-Warm EnabledTrueDetermines whether idle Speech-to-Text connections are created and maintained in advance. Pre-warming can reduce the delay experienced when speech recognition starts.
Maximum Connections3Maximum number of idle Speech-to-Text connections maintained in the connection pool.
Idle Timeout300 secondsAmount of time an unused connection remains in the pool before it is automatically closed.

Choosing Appropriate Settings

Faster Responses

For highly interactive conversations:

  • Reduce Utterance End Timeout
  • Reduce Endpointing Silence
  • Enable connection pre-warming

This allows transcripts to complete sooner and reduces the time before the agent begins responding.

Improved Transcript Stability

For users who frequently pause while speaking:

  • Increase Utterance End Timeout
  • Increase Endpointing Silence

This reduces the likelihood of transcripts being finalized prematurely.

High-Concurrency Environments

For deployments with many simultaneous voice sessions:

  • Increase Maximum Connections
  • Enable connection pre-warming
  • Monitor connection utilization

This can help reduce connection establishment delays during peak usage.

Resource Optimization

For environments where voice usage is infrequent:

  • Disable connection pre-warming
  • Reduce Maximum Connections
  • Lower Idle Timeout

This minimizes resource consumption when voice interactions occur infrequently.

Best Practices

  • Use the default settings unless a specific tuning requirement exists.
  • Test with realistic conversation patterns and speaking styles.
  • Avoid excessively low endpointing values, which may cause incomplete transcripts.
  • Avoid excessively high endpointing values, which may make the agent feel slow to respond.
  • Enable connection pre-warming for frequently used voice agents.
  • Monitor connection pool usage in high-volume deployments.
  • Validate transcription behavior across different languages, accents, and speech providers.