Speech-to-Text Configuration
Overview
Speech-to-Text (STT) converts spoken user input into text that can be processed by the Conversational Agent.
In voice-enabled conversations, Speech-to-Text is responsible for receiving audio, identifying when a user has finished speaking, generating transcripts, and delivering those transcripts to the Conversational Agent.
Proper Speech-to-Text configuration helps improve:
- Transcription accuracy
- Conversation responsiveness
- Turn completion speed
- User experience
- Resource utilization
IB-X supports configurable speech providers and includes several advanced settings that control transcript generation, utterance completion, and connection management.
How Speech-to-Text Works
A typical Speech-to-Text flow follows the sequence below:

During a conversation, the Speech-to-Text service continuously processes incoming audio and generates transcript updates.
Once the speech input is considered complete, a final transcript is produced and sent to the Conversational Agent for processing.
Utterance Detection
These settings control how Speech-to-Text determines that a spoken phrase or sentence has completed.
| Option | Default Value | Description |
|---|---|---|
| Utterance End Timeout | 300 ms | Time to wait after speech activity stops before requesting a final transcript from the Speech-to-Text provider. Higher values reduce premature transcript completion but may increase response latency. |
| Endpointing Silence | 300 ms | Duration of silence that is treated as the end of a phrase or utterance. Lower values result in faster transcript completion, while higher values better tolerate thinking pauses and natural speech gaps. |
Transcript Processing
These settings control how transcript updates are processed and delivered.
| Option | Default Value | Description |
|---|---|---|
| Event Poll Interval | 50 ms | Frequency at which transcript updates are checked and processed. Lower values may provide slightly faster transcript updates at the cost of additional processing activity. |
| Transcript Debounce | 350 ms | Small delay applied after receiving a final transcript before processing begins. This helps reduce duplicate processing and prevents multiple transcript completion events from triggering redundant work. |
Connection Pooling
Speech-to-Text services often require establishing network connections before transcription can begin.
Connection pooling helps reduce startup latency by keeping a number of ready-to-use connections available.
Connection Pool Settings
| Option | Default Value | Description |
|---|---|---|
| Pre-Warm Enabled | True | Determines whether idle Speech-to-Text connections are created and maintained in advance. Pre-warming can reduce the delay experienced when speech recognition starts. |
| Maximum Connections | 3 | Maximum number of idle Speech-to-Text connections maintained in the connection pool. |
| Idle Timeout | 300 seconds | Amount of time an unused connection remains in the pool before it is automatically closed. |
Choosing Appropriate Settings
Faster Responses
For highly interactive conversations:
- Reduce Utterance End Timeout
- Reduce Endpointing Silence
- Enable connection pre-warming
This allows transcripts to complete sooner and reduces the time before the agent begins responding.
Improved Transcript Stability
For users who frequently pause while speaking:
- Increase Utterance End Timeout
- Increase Endpointing Silence
This reduces the likelihood of transcripts being finalized prematurely.
High-Concurrency Environments
For deployments with many simultaneous voice sessions:
- Increase Maximum Connections
- Enable connection pre-warming
- Monitor connection utilization
This can help reduce connection establishment delays during peak usage.
Resource Optimization
For environments where voice usage is infrequent:
- Disable connection pre-warming
- Reduce Maximum Connections
- Lower Idle Timeout
This minimizes resource consumption when voice interactions occur infrequently.
Best Practices
- Use the default settings unless a specific tuning requirement exists.
- Test with realistic conversation patterns and speaking styles.
- Avoid excessively low endpointing values, which may cause incomplete transcripts.
- Avoid excessively high endpointing values, which may make the agent feel slow to respond.
- Enable connection pre-warming for frequently used voice agents.
- Monitor connection pool usage in high-volume deployments.
- Validate transcription behavior across different languages, accents, and speech providers.