For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunitySign Up
HomeGuidesVerbsAPI ReferenceSelf-HostingClient SDKsTutorialsChangelog
HomeGuidesVerbsAPI ReferenceSelf-HostingClient SDKsTutorialsChangelog
  • Get Started
    • jambonz Overview
    • Developer Quickstart
    • Deployment Options
    • Support Plans
    • jambonz.cloud
  • Using the jambonz portal
  • Features
    • Voice Agents
    • Using OpenAI STT
    • Custom STT providers
    • Custom TTS providers
    • Answering machine detection
    • Conferencing "coach" mode
    • Continous ASR
    • Handling ActionHook Delays
    • Managing media anchors
    • Call Recording
    • SIPREC Server
    • TTS Streaming
    • Dub tracks
    • Filler Noise
    • Securing HTTP Endpoints
    • API Rate Limits
    • Application Environment Variables
LogoLogo
CommunitySign Up
On this page
  • Non-streaming HTTP API
  • Authentication
  • Request body attributes
  • Response
  • Streaming WebSocket API
  • Connection
  • Authentication
  • Protocol
  • Server → jambonz: connect acknowledgement
  • jambonz → server: text streaming
  • Server → jambonz: audio data
  • Server → jambonz: errors
  • Audio sample rate handling
  • Reference flow
  • Tips
Features

Text-to-speech API

Was this page helpful?
Edit this page
Previous

Answering machine detection

Detects whether a call has been answered by a person or a machine.
Next
Built with

jambonz provides native support for many TTS vendors, but if you want to integrate with one we don’t yet support you can do it by writing a small server that implements our API.

There are two APIs you can implement, depending on whether your vendor supports streaming:

  • Non-streaming HTTP API — jambonz posts a full text payload to your server and you return a single audio file. Use this when the vendor’s API is request/response.
  • Streaming WebSocket API — jambonz opens a persistent WebSocket and streams text tokens to your server as they arrive (typically from an LLM); your server streams synthesized audio back. Use this when the vendor supports incremental synthesis (ElevenLabs WebSocket, Cartesia, Rime, etc.) — it dramatically reduces time-to-first-audio for LLM-driven applications.

When you add a custom speech vendor in the jambonz portal you provide an HTTPS URL (for non-streaming) and/or a WebSocket URL (for streaming). You may provide either or both. The same API key is used for both.

Want working code? Check out these examples.

Non-streaming HTTP API

jambonz sends an HTTP POST containing the text to be synthesized and associated properties such as language and voice. Your server returns the synthesized audio as the response body.

Authentication

An Authorization header is sent on the HTTP request:

Authorization: Bearer <apiKey>

The API key is the value you provide when you create the custom speech vendor in the jambonz portal.

Request body attributes

PropertyTypeDescription
languageStringISO language code (e.g. en-US)
voiceStringName of the voice to use
typeStringtext or ssml
textStringText to be synthesized. If type=ssml, must be enclosed in <speak> tags.

Response

Return a 200 OK containing the synthesized audio in the response body, or an HTTP error code on failure. The audio format is indicated via the Content-Type header. Allowed values:

Content-TypeFormat
audio/mpeg or audio/mp3MP3 (preferred)
audio/wav or audio/x-wavLinear PCM with a WAVE header
audio/l16;rate=8000Linear16 PCM at 8 kHz
audio/l16;rate=16000Linear16 PCM at 16 kHz
audio/l16;rate=24000Linear16 PCM at 24 kHz
audio/l16;rate=32000Linear16 PCM at 32 kHz
audio/l16;rate=48000Linear16 PCM at 48 kHz

Streaming WebSocket API

The streaming API lets jambonz push text to your server incrementally and receive audio back as it’s produced — eliminating the wait for a full HTTP response. This is what makes “the LLM and the TTS overlap” possible.

When jambonz needs to synthesize speech, it opens a WebSocket to your server, sends a sequence of text fragments and a flush signal, then keeps the connection open for further synthesis on the same call. Your server is responsible for translating that into whatever wire protocol your TTS vendor uses (ElevenLabs WebSocket, Cartesia, etc.) and streaming the resulting audio back.

Connection

jambonz connects to:

wss://<your-host>[:port]/<your-path>?voice=<voice>&language=<language>&sampleRate=<rate>

The host, port, and path come from the streaming URL you configured on the custom speech credential. voice and language are the values from the application that invoked the TTS (the say verb’s voice and language, or whatever the agent verb resolved). sampleRate is the call’s media sample rate — typically 8000 for PSTN or 16000 for WebRTC; your server should use this as a target rate (resample if your vendor produces something different).

ws:// is also supported (set the credential URL with ws:// instead of wss://) for development against localhost.

Authentication

The WebSocket upgrade request includes:

Authorization: Bearer <apiKey>

<apiKey> is the same key configured on the custom speech credential.

Protocol

After the connection is established, your server must send a connect acknowledgement declaring the audio format it will produce. After that, jambonz streams text in and your server streams audio back until either side closes the connection.

Server → jambonz: connect acknowledgement

Send this once, right after the WebSocket handshake completes:

1{
2 "type": "connect",
3 "data": {
4 "sample_rate": 16000,
5 "base64_encoding": true
6 }
7}
  • sample_rate — the sample rate of the PCM audio your server will produce. jambonz resamples to the call’s rate if needed.
  • base64_encoding — if true, your server will send audio as base64-encoded strings inside data messages (see below). If false (or omitted), your server will send audio as raw binary WebSocket frames. Binary frames are more efficient.

jambonz → server: text streaming

Once the connect ack is received, jambonz sends one or more stream messages with text tokens:

1{"type": "stream", "text": "Hello, "}
2{"type": "stream", "text": "how can I "}
3{"type": "stream", "text": "help you today?"}

Followed by a flush to signal the end of an utterance:

1{"type": "flush"}

flush is your cue to commit any buffered text to the vendor and finalize the current synthesis. Your server may receive more stream/flush cycles on the same connection during the call.

When jambonz is done with the session, it sends:

1{"type": "stop"}

…and may then close the WebSocket.

Server → jambonz: audio data

For each flush, your server should stream the synthesized audio back as it arrives from the underlying vendor. There are two transport options, chosen by your base64_encoding setting in the connect ack:

Binary frames (preferred — base64_encoding: false): send raw L16 PCM audio at the rate you declared in connect, as WebSocket binary frames. Send chunks as they arrive — don’t wait for the full utterance.

Base64 JSON (base64_encoding: true): send each audio chunk as:

1{
2 "type": "data",
3 "data": {
4 "audio": "<base64-encoded L16 PCM>"
5 }
6}

Either way, send audio chunks incrementally as your vendor produces them — that’s the point of streaming.

Server → jambonz: errors

To report an error mid-stream, send:

1{
2 "type": "data",
3 "data": {
4 "error": "<error message>"
5 }
6}

The error message is logged on the jambonz side. One specific value has special semantics: "error": "input_timeout_exceeded" tells jambonz it’s safe to reconnect later (most vendors close the socket after a period of inactivity). Use this value if your upstream vendor closes the connection for idleness so jambonz knows the close was expected and can open a fresh socket on the next utterance.

Audio sample rate handling

jambonz resamples your audio if sample_rate in the connect ack differs from the call’s sampleRate query parameter. You can either:

  • Match sampleRate from the query string (one less resample), or
  • Always produce at your vendor’s natural rate (let jambonz resample).

Both are fine; pick whichever simplifies your server.

Reference flow

A typical session looks like:

jambonz your server vendor
─── WS connect (with Authorization) ──>
─── auth/connect to vendor ──>
<── {"type":"connect","data":{...}} ───
─── {"type":"stream","text":"…"} ───>
─── {"type":"stream","text":"…"} ───>
─── tokens to vendor ──>
─── {"type":"flush"} ──────────────>
─── flush to vendor ──>
<── audio chunk
<── binary PCM frame ──────────────────
<── audio chunk
<── binary PCM frame ──────────────────
<── end-of-utterance
─── {"type":"stream","text":"…"} ───> (next utterance on same socket)
…
─── {"type":"stop"} ──────────────>
─── WS close ──────────────────────>

Tips

  • Keep one upstream connection per WebSocket session. Don’t open a new vendor connection on every flush — that defeats the latency win. Buffer text from stream messages, commit on flush, but reuse the underlying vendor connection across flush cycles.
  • Start sending audio bytes immediately. Don’t wait for the vendor to finish synthesizing the whole utterance before forwarding the first chunk. The first audio frame jambonz receives is what determines time-to-first-byte (logged as tts_time_to_first_byte_ms).
  • Handle the idle-close case. If your vendor closes the upstream socket after silence, send "error": "input_timeout_exceeded" so jambonz reconnects cleanly on the next utterance instead of failing.
  • Stay within the declared sample_rate. If your vendor’s output rate doesn’t match what you declared in connect, resample on your side or change the connect ack — don’t send mismatched audio.