Skip to main content
Version: Current

Text-to-Speech Configuration

Overview

Text-to-Speech (TTS) converts responses generated by the Conversational Agent into spoken audio.

In addition to selecting a speech provider, model, and voice, IB-X provides advanced controls that determine how AI-generated responses are prepared and streamed to the speech engine.

These settings influence:

  • Speech responsiveness
  • Playback smoothness
  • Streaming behavior
  • Natural speech flow
  • Pronunciation quality
  • Latency during voice conversations

Proper tuning helps balance fast response times with natural-sounding speech output.


How Text-to-Speech Works

A typical speech generation flow follows the sequence below:

As the Conversational Agent generates text, IB-X prepares the content for speech generation and determines how the text should be grouped into natural speaking segments.


Speech Generation

These settings control when speech generation begins and how text is prepared before being sent to the speech engine.

OptionDefault ValueDescription
Enable Early SynthesisFalseDetermines whether speech generation begins while the AI response is still being streamed. When enabled, spoken responses start sooner but may be generated in smaller segments. When disabled, larger portions of text are accumulated before speech generation begins, often resulting in smoother playback.
Enable TTS Text NormalizationFalseDetermines whether text normalization is performed before speech generation. Normalization can improve pronunciation of URLs, currencies, emojis, abbreviations, and other non-standard

Response Segmentation

These settings control how long responses are divided into smaller speech-friendly segments.

Segmentation Settings

OptionDefault ValueDescription
Minimum Characters Per Segment20Minimum number of characters required before a segment can be spoken after the initial segment.
Minimum Words Per Segment4Minimum number of words required before a segment can be spoken after the initial segment.
Soft Boundary Minimum Words12Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many words are available.
Soft Boundary Minimum Characters80Allows speech generation to proceed even if a complete sentence boundary has not been reached, provided this many characters are available.
Target Cut Characters140Preferred segment size when dividing longer responses into speech chunks.
Minimum Cut Before Space50Prevents very small speech segments from being created when splitting text.
Split At Colon BoundaryFalseAllows segmentation after colon-delimited phrases such as "Status:" or "Name:" which may improve list-style speech output.

Prosody Chunking

Prosody refers to the rhythm, pacing, and natural flow of spoken language.

These settings help create more natural speech by controlling how streamed text is grouped into spoken phrases.

Prosody Chunking Settings

OptionDefault ValueDescription
Minimum Words After First Chunk4Minimum number of additional words required before creating another speech segment after the first chunk has been spoken.
Minimum Words Before First Split2Minimum number of words required before allowing the first segmentation point in a streamed response.
Maximum Words Before Force Emit35Maximum number of words that may be accumulated before speech generation is forced to continue. This prevents excessive delays while waiting for additional text.

Choosing Appropriate Settings

Faster Speech Start

For highly interactive conversations:

  • Enable Early Synthesis
  • Reduce Soft Boundary thresholds
  • Reduce Target Cut Characters

This allows the agent to begin speaking sooner after generating text.

Smoother Playback

For more natural and polished speech:

  • Disable Early Synthesis
  • Increase Target Cut Characters
  • Increase Soft Boundary thresholds

This produces larger speech segments and reduces abrupt pauses between chunks.

Structured Responses

For agents that frequently read lists, reports, or status updates:

  • Enable Split At Colon Boundary

This can improve readability and make spoken output sound more natural.

Improved Pronunciation

For agents that frequently mention:

  • URLs
  • Email addresses
  • Currency values
  • Emojis
  • Technical abbreviations

Enable Text Normalization to improve pronunciation quality.


Best Practices

  • Use the default settings unless specific tuning is required.
  • Enable Early Synthesis only when minimizing response latency is more important than perfectly smooth playback.
  • Test speech quality with realistic conversation scenarios.
  • Enable Text Normalization when responses frequently contain URLs, currencies, abbreviations, or symbols.
  • Avoid excessively small segment sizes, which may create choppy or unnatural speech.
  • Validate behavior across different speech providers and voices.
  • Test both short and long responses to ensure natural speech flow.