Text-to-speech API
Text-to-speech API
Text-to-speech API
jambonz provides native support for many TTS vendors, but if you want to integrate with one we don’t yet support you can do it by writing a small server that implements our API.
There are two APIs you can implement, depending on whether your vendor supports streaming:
When you add a custom speech vendor in the jambonz portal you provide an HTTPS URL (for non-streaming) and/or a WebSocket URL (for streaming). You may provide either or both. The same API key is used for both.
Want working code? Check out these examples.
jambonz sends an HTTP POST containing the text to be synthesized and associated properties such as language and voice. Your server returns the synthesized audio as the response body.
An Authorization header is sent on the HTTP request:
The API key is the value you provide when you create the custom speech vendor in the jambonz portal.
Return a 200 OK containing the synthesized audio in the response body, or an HTTP error code on failure. The audio format is indicated via the Content-Type header. Allowed values:
The streaming API lets jambonz push text to your server incrementally and receive audio back as it’s produced — eliminating the wait for a full HTTP response. This is what makes “the LLM and the TTS overlap” possible.
When jambonz needs to synthesize speech, it opens a WebSocket to your server, sends a sequence of text fragments and a flush signal, then keeps the connection open for further synthesis on the same call. Your server is responsible for translating that into whatever wire protocol your TTS vendor uses (ElevenLabs WebSocket, Cartesia, etc.) and streaming the resulting audio back.
jambonz connects to:
The host, port, and path come from the streaming URL you configured on the custom speech credential. voice and language are the values from the application that invoked the TTS (the say verb’s voice and language, or whatever the agent verb resolved). sampleRate is the call’s media sample rate — typically 8000 for PSTN or 16000 for WebRTC; your server should use this as a target rate (resample if your vendor produces something different).
ws:// is also supported (set the credential URL with ws:// instead of wss://) for development against localhost.
The WebSocket upgrade request includes:
<apiKey> is the same key configured on the custom speech credential.
After the connection is established, your server must send a connect acknowledgement declaring the audio format it will produce. After that, jambonz streams text in and your server streams audio back until either side closes the connection.
Send this once, right after the WebSocket handshake completes:
sample_rate — the sample rate of the PCM audio your server will produce. jambonz resamples to the call’s rate if needed.base64_encoding — if true, your server will send audio as base64-encoded strings inside data messages (see below). If false (or omitted), your server will send audio as raw binary WebSocket frames. Binary frames are more efficient.Once the connect ack is received, jambonz sends one or more stream messages with text tokens:
Followed by a flush to signal the end of an utterance:
flush is your cue to commit any buffered text to the vendor and finalize the current synthesis. Your server may receive more stream/flush cycles on the same connection during the call.
When jambonz is done with the session, it sends:
…and may then close the WebSocket.
For each flush, your server should stream the synthesized audio back as it arrives from the underlying vendor. There are two transport options, chosen by your base64_encoding setting in the connect ack:
Binary frames (preferred — base64_encoding: false): send raw L16 PCM audio at the rate you declared in connect, as WebSocket binary frames. Send chunks as they arrive — don’t wait for the full utterance.
Base64 JSON (base64_encoding: true): send each audio chunk as:
Either way, send audio chunks incrementally as your vendor produces them — that’s the point of streaming.
To report an error mid-stream, send:
The error message is logged on the jambonz side. One specific value has special semantics: "error": "input_timeout_exceeded" tells jambonz it’s safe to reconnect later (most vendors close the socket after a period of inactivity). Use this value if your upstream vendor closes the connection for idleness so jambonz knows the close was expected and can open a fresh socket on the next utterance.
jambonz resamples your audio if sample_rate in the connect ack differs from the call’s sampleRate query parameter. You can either:
sampleRate from the query string (one less resample), orBoth are fine; pick whichever simplifies your server.
A typical session looks like:
flush — that defeats the latency win. Buffer text from stream messages, commit on flush, but reuse the underlying vendor connection across flush cycles.tts_time_to_first_byte_ms)."error": "input_timeout_exceeded" so jambonz reconnects cleanly on the next utterance instead of failing.sample_rate. If your vendor’s output rate doesn’t match what you declared in connect, resample on your side or change the connect ack — don’t send mismatched audio.