Adding custom speech vendors
How to add in support for speech vendors that jambonz doesn’t natively support
jambonz provides native support for lots of speech recognition vendors, but if you want to integrate with a vendor we don’t yet support you can easily do this by writing to our API.
The STT API is based on Websockets.
jambonz opens a Websocket connection towards a URL that you specify, and sends audio as well as JSON control text frames to your server. Your server is responsible for implementing the interface to your chosen speech vendor and returning results in JSON format back over the Websocket connection to jambonz.
Your server is responsible for closing the websocket connection. Generally, this is done after receiving the stop control message from jambonz.
Want to look at some working code? Check out these examples.
Authentication
An Authorization header is sent by jambonz on the HTTP request that creates the Websocket connection. The Authorization header contains an api key, e.g.
When you create a custom speech vendor in the jambonz portal you will specify an api key which is then then provided in the Authorization header whenever that custom speech vendor is used in your application.
In the example below, we creeate a Custom speech service for AssemblyAI and add an apiKey of ‘foobarbazzle’.
Note: this is not the API key that you may get from AssemblyAI to use their service.
Control messages sent by jambonz
Control messages are sent as JSON frames. Audio is sent as binary frames containing linear16 pcm-encoded audio at 8khz sampling.
The first message that you will receive from jambonz after accepting and upgrading the http request to a Websocket connection is a “start” control message, followed by binary audio frames.
Start control message
Stop control message
jambonz sends a “stop” message when it is time to stop speech recognition.
jambonz does not close the socket after sending this control message. This is to allow your speech recognizer to return a final transcript, if necessary. So when receiving the stop
control message, you should do what is necessary to close and clean up the speech recognition service you are using, return a final transcript if any, and then close the websocket with a normal close.
Control messages sent to jambonz
Your server is responsible for sending transcriptions, as well as any errors, to jambonz.