Custom TTS providers

jambonz provides native support for many TTS vendors, but if you want to integrate with one we don’t yet support you can do it by writing a small server that implements our API.

There are two APIs you can implement, depending on whether your vendor supports streaming:

Non-streaming HTTP API — jambonz posts a full text payload to your server and you return a single audio file. Use this when the vendor’s API is request/response.
Streaming WebSocket API — jambonz opens a persistent WebSocket and streams text tokens to your server as they arrive (typically from an LLM); your server streams synthesized audio back. Use this when the vendor supports incremental synthesis (ElevenLabs WebSocket, Cartesia, Rime, etc.) — it dramatically reduces time-to-first-audio for LLM-driven applications.

When you add a custom speech vendor in the jambonz portal you provide an HTTPS URL (for non-streaming) and/or a WebSocket URL (for streaming). You may provide either or both. The same API key is used for both.

Want working code? Check out these examples.

Non-streaming HTTP API

jambonz sends an HTTP POST containing the text to be synthesized and associated properties such as language and voice. Your server returns the synthesized audio as the response body.

Authentication

An Authorization header is sent on the HTTP request:

Authorization: Bearer <apiKey>

The API key is the value you provide when you create the custom speech vendor in the jambonz portal.

Request body attributes

Property	Type	Description
`language`	String	ISO language code (e.g. `en-US`)
`voice`	String	Name of the voice to use
`type`	String	`text` or `ssml`
`text`	String	Text to be synthesized. If `type=ssml`, must be enclosed in `<speak>` tags.

Response

Return a 200 OK containing the synthesized audio in the response body, or an HTTP error code on failure. The audio format is indicated via the Content-Type header. Allowed values:

Content-Type	Format
`audio/mpeg` or `audio/mp3`	MP3 (preferred)
`audio/wav` or `audio/x-wav`	Linear PCM with a WAVE header
`audio/l16;rate=8000`	Linear16 PCM at 8 kHz
`audio/l16;rate=16000`	Linear16 PCM at 16 kHz
`audio/l16;rate=24000`	Linear16 PCM at 24 kHz
`audio/l16;rate=32000`	Linear16 PCM at 32 kHz
`audio/l16;rate=48000`	Linear16 PCM at 48 kHz

Streaming WebSocket API

The streaming API lets jambonz push text to your server incrementally and receive audio back as it’s produced — eliminating the wait for a full HTTP response. This is what makes “the LLM and the TTS overlap” possible.

When jambonz needs to synthesize speech, it opens a WebSocket to your server, sends a sequence of text fragments and a flush signal, then keeps the connection open for further synthesis on the same call. Your server is responsible for translating that into whatever wire protocol your TTS vendor uses (ElevenLabs WebSocket, Cartesia, etc.) and streaming the resulting audio back.

Connection

jambonz connects to:

wss://<your-host>[:port]/<your-path>?voice=<voice>&language=<language>&sampleRate=<rate>

The host, port, and path come from the streaming URL you configured on the custom speech credential. voice and language are the values from the application that invoked the TTS (the say verb’s voice and language, or whatever the agent verb resolved). sampleRate is the call’s media sample rate — typically 8000 for PSTN or 16000 for WebRTC; your server should use this as a target rate (resample if your vendor produces something different).

ws:// is also supported (set the credential URL with ws:// instead of wss://) for development against localhost.

Authentication

The WebSocket upgrade request includes:

Authorization: Bearer <apiKey>

<apiKey> is the same key configured on the custom speech credential.

Protocol

After the connection is established, your server must send a connect acknowledgement declaring the audio format it will produce. After that, jambonz streams text in and your server streams audio back until either side closes the connection.

Server → jambonz: connect acknowledgement

Send this once, right after the WebSocket handshake completes:

1 {
2   "type": "connect",
3   "data": {
4     "sample_rate": 16000,
5     "base64_encoding": true
6   }
7 }

sample_rate — the sample rate of the PCM audio your server will produce. jambonz resamples to the call’s rate if needed.
base64_encoding — if true, your server will send audio as base64-encoded strings inside data messages (see below). If false (or omitted), your server will send audio as raw binary WebSocket frames. Binary frames are more efficient.

jambonz → server: text streaming

Once the connect ack is received, jambonz sends one or more stream messages with text tokens:

1 {"type": "stream", "text": "Hello, "}
2 {"type": "stream", "text": "how can I "}
3 {"type": "stream", "text": "help you today?"}

Followed by a flush to signal the end of an utterance:

1 {"type": "flush"}

flush is your cue to commit any buffered text to the vendor and finalize the current synthesis. Your server may receive more stream/flush cycles on the same connection during the call.

When jambonz is done with the session, it sends:

1 {"type": "stop"}

…and may then close the WebSocket.

Server → jambonz: audio data

For each flush, your server should stream the synthesized audio back as it arrives from the underlying vendor. There are two transport options, chosen by your base64_encoding setting in the connect ack:

Binary frames (preferred — base64_encoding: false): send raw L16 PCM audio at the rate you declared in connect, as WebSocket binary frames. Send chunks as they arrive — don’t wait for the full utterance.

Base64 JSON (base64_encoding: true): send each audio chunk as:

1 {
2   "type": "data",
3   "data": {
4     "audio": "<base64-encoded L16 PCM>"
5   }
6 }

Either way, send audio chunks incrementally as your vendor produces them — that’s the point of streaming.

Server → jambonz: errors

To report an error mid-stream, send:

1 {
2   "type": "data",
3   "data": {
4     "error": "<error message>"
5   }
6 }

The error message is logged on the jambonz side. One specific value has special semantics: "error": "input_timeout_exceeded" tells jambonz it’s safe to reconnect later (most vendors close the socket after a period of inactivity). Use this value if your upstream vendor closes the connection for idleness so jambonz knows the close was expected and can open a fresh socket on the next utterance.

Audio sample rate handling

jambonz resamples your audio if sample_rate in the connect ack differs from the call’s sampleRate query parameter. You can either:

Match sampleRate from the query string (one less resample), or
Always produce at your vendor’s natural rate (let jambonz resample).

Both are fine; pick whichever simplifies your server.

Reference flow

A typical session looks like:

jambonz                                  your server                            vendor
  ─── WS connect (with Authorization) ──>
                                          ─── auth/connect to vendor ──>
  <── {"type":"connect","data":{...}} ───
  ─── {"type":"stream","text":"…"} ───>
  ─── {"type":"stream","text":"…"} ───>
                                          ─── tokens to vendor ──>
  ─── {"type":"flush"} ──────────────>
                                          ─── flush to vendor ──>
                                                                                <── audio chunk
  <── binary PCM frame ──────────────────
                                                                                <── audio chunk
  <── binary PCM frame ──────────────────
                                                                                <── end-of-utterance
  ─── {"type":"stream","text":"…"} ───>   (next utterance on same socket)
  …
  ─── {"type":"stop"} ──────────────>
  ─── WS close ──────────────────────>

Tips

Keep one upstream connection per WebSocket session. Don’t open a new vendor connection on every flush — that defeats the latency win. Buffer text from stream messages, commit on flush, but reuse the underlying vendor connection across flush cycles.
Start sending audio bytes immediately. Don’t wait for the vendor to finish synthesizing the whole utterance before forwarding the first chunk. The first audio frame jambonz receives is what determines time-to-first-byte (logged as tts_time_to_first_byte_ms).
Handle the idle-close case. If your vendor closes the upstream socket after silence, send "error": "input_timeout_exceeded" so jambonz reconnects cleanly on the next utterance instead of failing.
Stay within the declared sample_rate. If your vendor’s output rate doesn’t match what you declared in connect, resample on your side or change the connect ack — don’t send mismatched audio.