Adding custom speech vendors

How to add in support for speech vendors that jambonz doesn’t natively support

jambonz provides native support for lots of speech recognition vendors, but if you want to integrate with a vendor we don’t yet support you can easily do this by writing to our API.

The STT API is based on Websockets.

jambonz opens a Websocket connection towards a URL that you specify, and sends audio as well as JSON control text frames to your server. Your server is responsible for implementing the interface to your chosen speech vendor and returning results in JSON format back over the Websocket connection to jambonz.

Your server is responsible for closing the websocket connection. Generally, this is done after receiving the stop control message from jambonz.

Want to look at some working code? Check out these examples.

Authentication

An Authorization header is sent by jambonz on the HTTP request that creates the Websocket connection. The Authorization header contains an api key, e.g.

1Authorization: Bearer <apiKey>

When you create a custom speech vendor in the jambonz portal you will specify an api key which is then then provided in the Authorization header whenever that custom speech vendor is used in your application.

In the example below, we creeate a Custom speech service for AssemblyAI and add an apiKey of ‘foobarbazzle’.

Note: this is not the API key that you may get from AssemblyAI to use their service.

Creating custom STT vendor

Control messages sent by jambonz

Control messages are sent as JSON frames. Audio is sent as binary frames containing linear16 pcm-encoded audio at 8khz sampling.

The first message that you will receive from jambonz after accepting and upgrading the http request to a Websocket connection is a “start” control message, followed by binary audio frames.

Start control message

propertytypedescription
typeString”start”
languageStringISO language code (e.g. “en-US”)
formatStringDefines audio format. Currently will always be “raw”
encodingStringDefines how the audio is encoded. Currently will always be “LINEAR16”
interimResultsBooleanwhether or not interim (partial) results are being requested
sampleRateHzNumberSample rate of audio. Currently will always be 8000.
optionsObjectThis will contain any options that the application is passing on to the recognizer. This object may be empty.
options.hintsArray or ObjectAny dynamic hints provided by the application.
options.hintsBoostNumberA boost number to apply to the provided hints.

Stop control message

jambonz sends a “stop” message when it is time to stop speech recognition.

jambonz does not close the socket after sending this control message. This is to allow your speech recognizer to return a final transcript, if necessary. So when receiving the stop control message, you should do what is necessary to close and clean up the speech recognition service you are using, return a final transcript if any, and then close the websocket with a normal close.

propertytypedescription
typeString”stop”

Control messages sent to jambonz

Your server is responsible for sending transcriptions, as well as any errors, to jambonz.

Transcription control message

propertytypedescription
typeString”transcription”
is_finalBooleanindicates whether this is a final or interim transcription.
alternativesArrayan ordered list of alternative transcriptions (must contain at least one).
alternatives[n].transcriptStringA transcript of the speaker’s utterance.
alternatives[n].confidenceNumberA confidence probability, between 0 and 1.
languageStringthe language that was recognized.
channelNumberThe channel number (only relevant if diarization is being performed, default to 1).

Error control message

propertytypedescription
typeString”error”
errorStringdetailed error message.
Built with