Text-to-Speech Configuration
Overview
Text-to-Speech (TTS) converts responses generated by the Conversational Agent into spoken audio.
In addition to selecting a speech provider, model, and voice, IB-X provides advanced controls that determine how AI-generated responses are prepared and streamed to the speech engine.
These settings influence:
- Speech responsiveness
- Playback smoothness
- Streaming behavior
- Natural speech flow
- Pronunciation quality
- Latency during voice conversations
Proper tuning helps balance fast response times with natural-sounding speech output.
How Text-to-Speech Works
A typical speech generation flow follows the sequence below:

As the Conversational Agent generates text, IB-X prepares the content for speech generation and determines how the text should be grouped into natural speaking segments.
Speech Generation
These settings control when speech generation begins and how text is prepared before being sent to the speech engine.
| Option | Default Value | Description |
|---|---|---|
| Enable Early Synthesis | False | Determines whether speech generation begins while the AI response is still being streamed. When enabled, spoken responses start sooner but may be generated in smaller segments. When disabled, larger portions of text are accumulated before speech generation begins, often resulting in smoother playback. |
| Enable TTS Text Normalization | False | Determines whether text normalization is performed before speech generation. Normalization can improve pronunciation of URLs, currencies, emojis, abbreviations, and other non-standard |
Response Segmentation
These settings control how long responses are divided into smaller speech-friendly segments.
Segmentation Settings
| Option | Default Value | Description |
|---|---|---|
| Minimum Characters Per Segment | 20 | Minimum number of characters required before a segment can be spoken after the initial segment. |
| Minimum Words Per Segment | 4 | Minimum number of words required before a segment can be spoken after the initial segment. |
| Soft Boundary Minimum Words | 12 | Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many words are available. |
| Soft Boundary Minimum Characters | 80 | Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many characters are available. |
| Target Cut Characters | 140 | Preferred segment size when dividing longer responses into speech chunks. |
| Minimum Cut Before Space | 50 | Prevents very small speech segments from being created when splitting text. |
| Split At Colon Boundary | False | Allows segmentation after colon-delimited phrases such as "Status:" or "Name:" which may improve list-style speech output. |
Prosody Chunking
Prosody refers to the rhythm, pacing, and natural flow of spoken language.
These settings help create more natural speech by controlling how streamed text is grouped into spoken phrases.
Prosody Chunking Settings
| Option | Default Value | Description |
|---|---|---|
| Minimum Words After First Chunk | 4 | Minimum number of additional words required before creating another speech segment after the first chunk has been spoken. |
| Minimum Words Before First Split | 2 | Minimum number of words required before allowing the first segmentation point in a streamed response. |
| Maximum Words Before Force Emit | 35 | Maximum number of words that may be accumulated before speech generation is forced to continue. This prevents excessive delays while waiting for additional text. |
Choosing Appropriate Settings
Faster Speech Start
For highly interactive conversations:
- Enable Early Synthesis
- Reduce Soft Boundary thresholds
- Reduce Target Cut Characters
This allows the agent to begin speaking sooner after generating text.
Smoother Playback
For more natural and polished speech:
- Disable Early Synthesis
- Increase Target Cut Characters
- Increase Soft Boundary thresholds
This produces larger speech segments and reduces abrupt pauses between chunks.
Structured Responses
For agents that frequently read lists, reports, or status updates:
- Enable Split At Colon Boundary
This can improve readability and make spoken output sound more natural.
Improved Pronunciation
For agents that frequently mention:
- URLs
- Email addresses
- Currency values
- Emojis
- Technical abbreviations
Enable Text Normalization to improve pronunciation quality.
Best Practices
- Use the default settings unless specific tuning is required.
- Enable Early Synthesis only when minimizing response latency is more important than perfectly smooth playback.
- Test speech quality with realistic conversation scenarios.
- Enable Text Normalization when responses frequently contain URLs, currencies, abbreviations, or symbols.
- Avoid excessively small segment sizes, which may create choppy or unnatural speech.
- Validate behavior across different speech providers and voices.
- Test both short and long responses to ensure natural speech flow.