Google Gemini Speech to Speech

Using jambonz to connect custom telephony to the Google Gemini Live API

The jambonz application referenced in this article can be found here.

This is an example jambonz application that connects to the Google Gemini Live API and illustrates how to build a Voice-AI application using jambonz and Google Gemini.

The example covers:

Prerequisites

  • a jambonz.cloud account (or a self-hosted jambonz deployment on 10.1.0 or later)
  • a Google Cloud Platform account with the Gemini API enabled
  • a carrier and virtual phone number of your choice

Running instructions

Set environment variables

$GOOGLE_API_KEY="your-gemini-api-key"
$PORT=3000
$
$# Optional — only required when testing the MCP integration
$MCP_SERVER_URL=http://your-host:3001/sse
Environment variableValue
GOOGLE_API_KEYA Google API key with access to the Gemini Live API. You can create one in Google AI Studio.
PORTThe port your Express server listens on. Defaults to 3000.
MCP_SERVER_URLOptional. When set, the agent’s tools are discovered from an MCP server instead of declared inline.

jambonz setup

  1. Create a carrier entity in the jambonz portal.
  2. Add your speech provider of choice. Gemini Live handles speech-to-speech end to end, but jambonz still needs a speech credential configured on the account.
  3. Create a new jambonz application under the Applications tab. Point both the Calling webhook and Call status webhook at your server:
    ws://your-example-domain.ngrok.io/google-s2s
  4. Provision a phone number on your carrier and associate it with the application.

Run the app

$npm install
$GOOGLE_API_KEY=<your key> npm start

To run with MCP tools, open two terminals:

$# Terminal 1 — MCP server
$MCP_SERVER_PORT=3001 npm run mcp-server
$
$# Terminal 2 — jambonz app
$GOOGLE_API_KEY=<your key> MCP_SERVER_URL='http://<your host>:3001/sse' node app.js

Call your virtual number and ask Barbara about the weather.

How the llm verb is wired up

The application calls session.llm({...}) with vendor: 'google' and a Gemini Live model. The llmOptions.setup object is forwarded verbatim to Google’s BidiGenerateContentSetup message:

1session.llm({
2 vendor: 'google',
3 model: 'models/gemini-2.0-flash-live-001',
4 auth: { apiKey: process.env.GOOGLE_API_KEY },
5 actionHook: '/final',
6 eventHook: '/event',
7 toolHook: '/toolCall',
8 llmOptions: {
9 setup: {
10 generationConfig: {
11 speechConfig: {
12 voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Aoede' } }
13 }
14 },
15 systemInstruction: {
16 parts: [{ text: 'You are a helpful agent named Barbara that can only provide weather information.' }]
17 },
18 tools: [{
19 functionDeclarations: [{
20 name: 'get_weather',
21 description: 'Get the weather for a location',
22 parameters: {
23 type: 'object',
24 properties: {
25 location: { type: 'string', description: 'The location to get the weather for' },
26 scale: { type: 'string', enum: ['celsius', 'fahrenheit'] }
27 },
28 required: ['location']
29 }
30 }]
31 }]
32 }
33 }
34});

See the full route in lib/routes/weather-agent.js.

Proactive greeting (“speak first”)

For outbound calls — or any scenario where you want Gemini to speak first — add a greeting to llmOptions. jambonz sends it immediately after setup so the caller hears the agent within the first second:

1llmOptions: {
2 setup: { /* ... */ },
3 greeting: 'Greet the caller warmly and ask how you can help.'
4}

The value is an instruction to the model, not the literal greeting text. Use "Say exactly: Hello, thank you for calling Acme." if you need a scripted line.

This also works on models/gemini-3.1-flash-live-preview. On the 3.1 preview, Google restricted clientContent to seeding history only, so jambonz uses realtimeInput.text under the hood — the greeting field is the portable way to trigger a first turn across all Gemini Live models.

Session resumption

Gemini Live sessions can be resumed across websocket reconnects. Opt in by passing sessionResumption: {} in llmOptions. Each llm_event hook delivers a sessionResumptionUpdate containing a fresh newHandle — store the latest handle, then reconnect with sessionResumption: { handle: '<stored handle>' } to continue the conversation.

Function calling

The toolHook fires when Gemini wants to call one of the declared functions. Respond with session.sendToolOutput:

1session.sendToolOutput(tool_call_id, {
2 toolResponse: {
3 functionResponses: [
4 { id, response: { output: { temperature: 20, unit: 'celsius' } } }
5 ]
6 }
7});

Gemini’s native tool format uses functionCalls (inbound) and functionResponses (outbound) — jambonz passes them through without reshaping, so the payloads match the Gemini Live tool use docs exactly.

Interruption handling

When the caller speaks over Gemini, the module emits output_audio.playback_stopped with completion_reason: "interrupted" on the event hook, and the queued audio is discarded so the caller hears their own voice, not stale agent audio. No application code is required — interruption handling is built in.

A note on actionHook

Like every jambonz verb, the llm verb fires actionHook when the session ends, including a completion_reason:

  • Normal conversation end
  • Connection failure
  • Disconnect from remote end
  • Server failure
  • Server error

Resources