Version: Current

Speech-to-Text Configuration

Overview

Speech-to-Text (STT) converts spoken user input into text that can be processed by the Conversational Agent.

In voice-enabled conversations, Speech-to-Text is responsible for receiving audio, identifying when a user has finished speaking, generating transcripts, and delivering those transcripts to the Conversational Agent.

Proper Speech-to-Text configuration helps improve:

Transcription accuracy
Conversation responsiveness
Turn completion speed
User experience
Resource utilization

IB-X supports configurable speech providers and includes several advanced settings that control transcript generation, utterance completion, and connection management.

How Speech-to-Text Works

A typical Speech-to-Text flow follows the sequence below:

During a conversation, the Speech-to-Text service continuously processes incoming audio and generates transcript updates.

Once the speech input is considered complete, a final transcript is produced and sent to the Conversational Agent for processing.

Utterance Detection

These settings control how Speech-to-Text determines that a spoken phrase or sentence has completed.

Option	Default Value	Description
Utterance End Timeout	300 ms	Time to wait after speech activity stops before requesting a final transcript from the Speech-to-Text provider. Higher values reduce premature transcript completion but may increase response latency.
Endpointing Silence	300 ms	Duration of silence that is treated as the end of a phrase or utterance. Lower values result in faster transcript completion, while higher values better tolerate thinking pauses and natural speech gaps.

Transcript Processing

These settings control how transcript updates are processed and delivered.

Option	Default Value	Description
Event Poll Interval	50 ms	Frequency at which transcript updates are checked and processed. Lower values may provide slightly faster transcript updates at the cost of additional processing activity.
Transcript Debounce	350 ms	Small delay applied after receiving a final transcript before processing begins. This helps reduce duplicate processing and prevents multiple transcript completion events from triggering redundant work.

Connection Pooling

Speech-to-Text services often require establishing network connections before transcription can begin.

Connection pooling helps reduce startup latency by keeping a number of ready-to-use connections available.

Connection Pool Settings

Option	Default Value	Description
Pre-Warm Enabled	True	Determines whether idle Speech-to-Text connections are created and maintained in advance. Pre-warming can reduce the delay experienced when speech recognition starts.
Maximum Connections	3	Maximum number of idle Speech-to-Text connections maintained in the connection pool.
Idle Timeout	300 seconds	Amount of time an unused connection remains in the pool before it is automatically closed.

Choosing Appropriate Settings

Faster Responses

For highly interactive conversations:

Reduce Utterance End Timeout
Reduce Endpointing Silence
Enable connection pre-warming

This allows transcripts to complete sooner and reduces the time before the agent begins responding.

Improved Transcript Stability

For users who frequently pause while speaking:

Increase Utterance End Timeout
Increase Endpointing Silence

This reduces the likelihood of transcripts being finalized prematurely.

High-Concurrency Environments

For deployments with many simultaneous voice sessions:

Increase Maximum Connections
Enable connection pre-warming
Monitor connection utilization

This can help reduce connection establishment delays during peak usage.

Resource Optimization

For environments where voice usage is infrequent:

Disable connection pre-warming
Reduce Maximum Connections
Lower Idle Timeout

This minimizes resource consumption when voice interactions occur infrequently.

Best Practices

Use the default settings unless a specific tuning requirement exists.
Test with realistic conversation patterns and speaking styles.
Avoid excessively low endpointing values, which may cause incomplete transcripts.
Avoid excessively high endpointing values, which may make the agent feel slow to respond.
Enable connection pre-warming for frequently used voice agents.
Monitor connection pool usage in high-volume deployments.
Validate transcription behavior across different languages, accents, and speech providers.

Overview​

How Speech-to-Text Works​

Utterance Detection​

Transcript Processing​

Connection Pooling​

Connection Pool Settings​

Choosing Appropriate Settings​

Faster Responses​

Improved Transcript Stability​

High-Concurrency Environments​

Resource Optimization​

Best Practices​