Version: Current

Text-to-Speech Configuration

Overview

Text-to-Speech (TTS) converts responses generated by the Conversational Agent into spoken audio.

In addition to selecting a speech provider, model, and voice, IB-X provides advanced controls that determine how AI-generated responses are prepared and streamed to the speech engine.

These settings influence:

Speech responsiveness
Playback smoothness
Streaming behavior
Natural speech flow
Pronunciation quality
Latency during voice conversations

Proper tuning helps balance fast response times with natural-sounding speech output.

How Text-to-Speech Works

A typical speech generation flow follows the sequence below:

As the Conversational Agent generates text, IB-X prepares the content for speech generation and determines how the text should be grouped into natural speaking segments.

Speech Generation

These settings control when speech generation begins and how text is prepared before being sent to the speech engine.

Option	Default Value	Description
Enable Early Synthesis	False	Determines whether speech generation begins while the AI response is still being streamed. When enabled, spoken responses start sooner but may be generated in smaller segments. When disabled, larger portions of text are accumulated before speech generation begins, often resulting in smoother playback.
Enable TTS Text Normalization	False	Determines whether text normalization is performed before speech generation. Normalization can improve pronunciation of URLs, currencies, emojis, abbreviations, and other non-standard

Response Segmentation

These settings control how long responses are divided into smaller speech-friendly segments.

Segmentation Settings

Option	Default Value	Description
Minimum Characters Per Segment	20	Minimum number of characters required before a segment can be spoken after the initial segment.
Minimum Words Per Segment	4	Minimum number of words required before a segment can be spoken after the initial segment.
Soft Boundary Minimum Words	12	Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many words are available.
Soft Boundary Minimum Characters	80	Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many characters are available.
Target Cut Characters	140	Preferred segment size when dividing longer responses into speech chunks.
Minimum Cut Before Space	50	Prevents very small speech segments from being created when splitting text.
Split At Colon Boundary	False	Allows segmentation after colon-delimited phrases such as "Status:" or "Name:" which may improve list-style speech output.

Prosody Chunking

Prosody refers to the rhythm, pacing, and natural flow of spoken language.

These settings help create more natural speech by controlling how streamed text is grouped into spoken phrases.

Prosody Chunking Settings

Option	Default Value	Description
Minimum Words After First Chunk	4	Minimum number of additional words required before creating another speech segment after the first chunk has been spoken.
Minimum Words Before First Split	2	Minimum number of words required before allowing the first segmentation point in a streamed response.
Maximum Words Before Force Emit	35	Maximum number of words that may be accumulated before speech generation is forced to continue. This prevents excessive delays while waiting for additional text.

Choosing Appropriate Settings

Faster Speech Start

For highly interactive conversations:

Enable Early Synthesis
Reduce Soft Boundary thresholds
Reduce Target Cut Characters

This allows the agent to begin speaking sooner after generating text.

Smoother Playback

For more natural and polished speech:

Disable Early Synthesis
Increase Target Cut Characters
Increase Soft Boundary thresholds

This produces larger speech segments and reduces abrupt pauses between chunks.

Structured Responses

For agents that frequently read lists, reports, or status updates:

Enable Split At Colon Boundary

This can improve readability and make spoken output sound more natural.

Improved Pronunciation

For agents that frequently mention:

URLs
Email addresses
Currency values
Emojis
Technical abbreviations

Enable Text Normalization to improve pronunciation quality.

Best Practices

Use the default settings unless specific tuning is required.
Enable Early Synthesis only when minimizing response latency is more important than perfectly smooth playback.
Test speech quality with realistic conversation scenarios.
Enable Text Normalization when responses frequently contain URLs, currencies, abbreviations, or symbols.
Avoid excessively small segment sizes, which may create choppy or unnatural speech.
Validate behavior across different speech providers and voices.
Test both short and long responses to ensure natural speech flow.

Overview​

How Text-to-Speech Works​

Speech Generation​

Response Segmentation​

Segmentation Settings​

Prosody Chunking​

Prosody Chunking Settings​

Choosing Appropriate Settings​

Faster Speech Start​

Smoother Playback​

Structured Responses​

Improved Pronunciation​

Best Practices​