Google Gemini Speech to Speech

The jambonz application referenced in this article can be found here.

This is an example jambonz application that connects to the Google Gemini Live API and illustrates how to build a Voice-AI application using jambonz and Google Gemini.

The example covers:

a weather agent using Gemini function calling
an optional MCP server integration, so the same agent can expose its tools through the Model Context Protocol

Prerequisites

a jambonz.cloud account (or a self-hosted jambonz deployment on 10.1.0 or later)
a Google Cloud Platform account with the Gemini API enabled
a carrier and virtual phone number of your choice

Running instructions

Set environment variables

$ GOOGLE_API_KEY="your-gemini-api-key"
$ PORT=3000
$ 
$ # Optional — only required when testing the MCP integration
$ MCP_SERVER_URL=http://your-host:3001/sse

Environment variable	Value
`GOOGLE_API_KEY`	A Google API key with access to the Gemini Live API. You can create one in Google AI Studio.
`PORT`	The port your Express server listens on. Defaults to `3000`.
`MCP_SERVER_URL`	Optional. When set, the agent’s tools are discovered from an MCP server instead of declared inline.

jambonz setup

Create a carrier entity in the jambonz portal.
Add your speech provider of choice. Gemini Live handles speech-to-speech end to end, but jambonz still needs a speech credential configured on the account.
Create a new jambonz application under the Applications tab. Point both the Calling webhook and Call status webhook at your server:
```
ws://your-example-domain.ngrok.io/google-s2s
```
Provision a phone number on your carrier and associate it with the application.

Run the app

$ npm install
$ GOOGLE_API_KEY=<your key> npm start

To run with MCP tools, open two terminals:

$ # Terminal 1 — MCP server
$ MCP_SERVER_PORT=3001 npm run mcp-server
$ 
$ # Terminal 2 — jambonz app
$ GOOGLE_API_KEY=<your key> MCP_SERVER_URL='http://<your host>:3001/sse' node app.js

Call your virtual number and ask Barbara about the weather.

How the `llm` verb is wired up

The application calls session.llm({...}) with vendor: 'google' and a Gemini Live model. The llmOptions.setup object is forwarded verbatim to Google’s BidiGenerateContentSetup message:

1 session.llm({
2   vendor: 'google',
3   model: 'models/gemini-2.0-flash-live-001',
4   auth: { apiKey: process.env.GOOGLE_API_KEY },
5   actionHook: '/final',
6   eventHook: '/event',
7   toolHook: '/toolCall',
8   llmOptions: {
9     setup: {
10       generationConfig: {
11         speechConfig: {
12           voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Aoede' } }
13         }
14       },
15       systemInstruction: {
16         parts: [{ text: 'You are a helpful agent named Barbara that can only provide weather information.' }]
17       },
18       tools: [{
19         functionDeclarations: [{
20           name: 'get_weather',
21           description: 'Get the weather for a location',
22           parameters: {
23             type: 'object',
24             properties: {
25               location: { type: 'string', description: 'The location to get the weather for' },
26               scale: { type: 'string', enum: ['celsius', 'fahrenheit'] }
27             },
28             required: ['location']
29           }
30         }]
31       }]
32     }
33   }
34 });

See the full route in lib/routes/weather-agent.js.

Proactive greeting (“speak first”)

For outbound calls — or any scenario where you want Gemini to speak first — add a greeting to llmOptions. jambonz sends it immediately after setup so the caller hears the agent within the first second:

1 llmOptions: {
2   setup: { /* ... */ },
3   greeting: 'Greet the caller warmly and ask how you can help.'
4 }

The value is an instruction to the model, not the literal greeting text. Use "Say exactly: Hello, thank you for calling Acme." if you need a scripted line.

This also works on models/gemini-3.1-flash-live-preview. On the 3.1 preview, Google restricted clientContent to seeding history only, so jambonz uses realtimeInput.text under the hood — the greeting field is the portable way to trigger a first turn across all Gemini Live models.

Session resumption

Gemini Live sessions can be resumed across websocket reconnects. Opt in by passing sessionResumption: {} in llmOptions. Each llm_event hook delivers a sessionResumptionUpdate containing a fresh newHandle — store the latest handle, then reconnect with sessionResumption: { handle: '<stored handle>' } to continue the conversation.

Function calling

The toolHook fires when Gemini wants to call one of the declared functions. Respond with session.sendToolOutput:

1 session.sendToolOutput(tool_call_id, {
2   toolResponse: {
3     functionResponses: [
4       { id, response: { output: { temperature: 20, unit: 'celsius' } } }
5     ]
6   }
7 });

Gemini’s native tool format uses functionCalls (inbound) and functionResponses (outbound) — jambonz passes them through without reshaping, so the payloads match the Gemini Live tool use docs exactly.

Interruption handling

When the caller speaks over Gemini, the module emits output_audio.playback_stopped with completion_reason: "interrupted" on the event hook, and the queued audio is discarded so the caller hears their own voice, not stale agent audio. No application code is required — interruption handling is built in.

A note on actionHook

Like every jambonz verb, the llm verb fires actionHook when the session ends, including a completion_reason:

Normal conversation end
Connection failure
Disconnect from remote end
Server failure
Server error

Resources

Google Gemini Live API documentation
BidiGenerateContent protocol reference
Example application source
jambonz documentation:
- the llm verb
- the dial verb
- step-by-step guides for adding carriers to jambonz