Voice Agents

Build conversational AI agents using the agent verb.

The agent verb is an experimental feature and requires jambonz version 10.1.1 or above.

The agent verb orchestrates a complete voice AI agent by wiring together three separate components — STT, LLM, and TTS — with integrated turn detection. Unlike the llm verb (which connects to speech-to-speech APIs where a single vendor handles everything), the agent verb lets you mix and match: for example, Deepgram for STT, Anthropic for the LLM, and Cartesia for TTS.

The agent verb manages the full conversational turn cycle:

  1. User speaks → STT produces a transcript
  2. Turn detection decides the user is done speaking
  3. Transcript is sent to the LLM
  4. LLM response tokens stream to TTS
  5. TTS audio plays back to the caller
  6. If the user barges in, TTS stops and a new turn begins

Looking for runnable examples? The jambonz/v10-examples repository has working demos for every feature described in this guide — basic usage, tool calling, MCP servers, CRM injection, persona switching, supervisor overrides, and more. Clone it and run any example end-to-end in minutes.

Basic Setup

The llm property is the only required field. STT and TTS will use your application’s default speech credentials if not specified.

Below is a minimal voice agent using the Node.js SDK and the application defaults for STT and TTS.

1const http = require('node:http');
2const { createEndpoint } = require('@jambonz/sdk/websocket');
3
4const envVars = {
5 OPENAI_MODEL: {
6 type: 'string',
7 description: 'OpenAI model to use',
8 default: 'gpt-4.1-mini',
9 },
10 SYSTEM_PROMPT: {
11 type: 'string',
12 description: 'System prompt for the voice agent',
13 uiHint: 'textarea',
14 default: [
15 'You are a helpful voice AI assistant.',
16 'The user is interacting with you via voice,',
17 'even if you perceive the conversation as text.',
18 'You eagerly assist users with their questions',
19 'by providing information from your extensive knowledge.',
20 'Your responses are concise, to the point,',
21 'and use natural spoken English with proper punctuation.',
22 'Never use markdown, bullet points, numbered lists,',
23 'emojis, asterisks, or any special formatting.',
24 'You are curious, friendly, and have a sense of humor.',
25 'When the conversation begins,',
26 'greet the user in a helpful and friendly manner.',
27 ].join(' '),
28 },
29};
30
31const port = parseInt(process.env.PORT || '3000', 10);
32const server = http.createServer();
33const makeService = createEndpoint({ server, port, envVars });
34const svc = makeService({ path: '/' });
35
36svc.on('session:new', (session) => {
37 console.log('session:new received', JSON.stringify({
38 call_sid: session.data.call_sid,
39 direction: session.data.direction,
40 from: session.data.from,
41 to: session.data.to,
42 env_vars: session.data.env_vars,
43 }, null, 2));
44
45 try {
46 const model = session.data.env_vars?.OPENAI_MODEL || envVars.OPENAI_MODEL.default;
47 const systemPrompt = session.data.env_vars?.SYSTEM_PROMPT || envVars.SYSTEM_PROMPT.default;
48 console.log('using model:', model);
49
50 session.on('/agent-event', (evt) => {
51 console.log('agent-event received:', evt.type);
52 if (evt.type === 'turn_end') {
53 const { transcript, response, interrupted, latency } = evt;
54 console.log('turn_end', JSON.stringify({ transcript, response, interrupted, latency }, null, 2));
55 }
56 });
57
58 session.on('/agent-complete', () => {
59 console.log('agent-complete received, sending hangup');
60 session.hangup().reply();
61 });
62
63 console.log('sending agent verb...');
64 session
65 .agent({
66 llm: {
67 vendor: 'openai',
68 model,
69 llmOptions: {
70 messages: [{ role: 'system', content: systemPrompt }],
71 },
72 },
73 turnDetection: 'krisp',
74 earlyGeneration: true,
75 bargeIn: { enable: true },
76 eventHook: '/agent-event',
77 actionHook: '/agent-complete',
78 })
79 .send();
80 console.log('agent verb sent');
81 } catch (err) {
82 console.error('Error in session:new handler:', err);
83 }
84});
85
86svc.on('error', (err) => {
87 console.error('service error:', err);
88});
89
90console.log(`voice agent listening on port ${port}`);

Supported LLM Vendors

The jambonz portal now allows you to “bring your own LLM” in a similar fashion to speech credentials. You can configure credentials for any supported LLM vendor and then select that vendor in the agent verb.

Vendorllm.vendorExample models
OpenAIopenaigpt-4.1-mini, gpt-4.1, gpt-4o
Anthropicanthropicclaude-sonnet-4-6, claude-opus-4-6
Googlegooglegemini-2.5-flash-lite, gemini-2.5-pro
AWS Bedrockawsus.meta.llama4-scout-17b-instruct-v1:0

The agent verb normalizes message formats and tool schemas across vendors automatically. You write tools in OpenAI format and the agent verb adapts them for each vendor.

Authentication

By default, the agent verb uses speech credentials configured in the jambonz portal. You can also pass credentials directly:

1llm: {
2 vendor: 'openai',
3 model: 'gpt-4.1-mini',
4 auth: { apiKey: process.env.OPENAI_API_KEY },
5 // ...
6}

For AWS Bedrock, pass accessKeyId, secretAccessKey, and region in the auth object.

Turn Detection

The turnDetection property controls how the agent verb decides the user has finished speaking. We currently support only two modes — STT-based detection and Krisp’s turn detection model.

STT-based detection (default)

1{ "turnDetection": "stt" }

Uses the STT vendor’s native end-of-utterance signal. For most vendors this is silence-based. Some vendors have smarter built-in turn detection:

  • deepgramflux — Acoustic + semantic turn detection (Deepgram’s “Flux” model)
  • assemblyai — Native turn-taking with the u3-rt-pro model
  • speechmatics — Built-in turn detection

These vendors always use their native detection regardless of the turnDetection setting.

Krisp turn detection

1{
2 "turnDetection": {
3 "mode": "krisp",
4 "threshold": 0.5
5 }
6}

Uses the Krisp acoustic end-of-turn model, which analyzes speech patterns rather than just silence. Good for natural conversation where users pause mid-thought.

  • threshold — Confidence threshold from 0.0 to 1.0. Lower values trigger earlier turn transitions (more aggressive). Default: 0.5.
  • model — Optional Krisp model name override.

The shorthand "turnDetection": "krisp" uses default settings.

You must have a Krisp API key configured in order to use Krisp turn detection on a self-hosted jambonz system. Contact support@jambonz.org for details.

Early Generation (Speculative Preflight)

Early generation speculatively sends the transcript to the LLM before end-of-turn is confirmed. If the transcript matches when the turn does end, buffered tokens are released immediately — shaving off the LLM prompt time. If the user keeps talking and the transcript changes, the speculative response is discarded.

There are two ways early generation is triggered:

  • Krisp turn detection — Set earlyGeneration: true to opt in. Krisp emits an early signal that triggers the speculative LLM prompt before final end-of-turn confirmation.
  • Deepgram Flux — Early generation happens automatically. Flux emits a native EagerEndOfTurn event that triggers preflight regardless of the earlyGeneration setting.

For other STT vendors with native turn-taking (assemblyai, speechmatics), early generation is not available.

1session.agent({
2 turnDetection: 'krisp',
3 earlyGeneration: true,
4 // ...
5}).send();

The turn_end event includes preflight metrics so you can track hit rates:

  • hit — speculative transcript matched final, tokens released immediately
  • miss — transcript changed, speculative response discarded
  • pending — preflight was still in progress when the turn ended

Barge-in Configuration

By default, users can interrupt the assistant while it’s speaking. The bargeIn object controls this behavior:

1{
2 "bargeIn": {
3 "enable": true,
4 "minSpeechDuration": 0.5,
5 "sticky": false
6 }
7}
  • enable — Allow interruptions. Default: true.
  • minSpeechDuration — Seconds of speech required to confirm an interruption. Prevents brief noises (coughs, background sounds) from cutting off the assistant. Default: 0.5.
  • sticky — If true, once the user interrupts, the assistant does not resume speaking the interrupted response. Default: false.

Tuning tips:

  • Lower minSpeechDuration (e.g., 0.2) for more responsive barge-in
  • Higher minSpeechDuration (e.g., 1.0) for noisy environments where false triggers are common
  • Set enable: false for scenarios where the assistant must complete its message (e.g., legal disclaimers)

No Response Timeout

The noResponseTimeout property handles the case where the user goes silent after the assistant finishes speaking.

1{ "noResponseTimeout": 12 }

When the timeout fires, the LLM is prompted with a system cue: “The user has not responded. Briefly check if they are still there or ask if they need help.” This generates a natural follow-up rather than leaving dead air.

Defaults to 12 seconds. Set to 0 to disable. The timer is cancelled if the user starts speaking.

This also covers the “missed speech” case: when VAD detects speech but STT returns no transcript, the no-response timer handles the re-prompt.

Greeting

By default (greeting: true), the agent verb prompts the LLM to generate an initial greeting before the user speaks. Set greeting: false if you want the agent to wait silently for the user to speak first.

Tool/Function Calling

The agent verb supports LLM tool/function calling, allowing your agent to perform actions like looking up data, calling APIs, or transferring calls. There are two ways to provide tools:

  1. Roll your own — define a JSON schema, list it in llmOptions.tools, and handle the tool call yourself in a toolHook handler. Use this for tools specific to your application (CRM lookups, business logic, proprietary APIs).
  2. Use pre-built tools from @jambonz/tools — drop in ready-made tools (web search, weather, Wikipedia, calculator, datetime) without writing schemas or handlers. Use this for common utility tools.

You can mix both approaches in the same agent — they share the same toolHook path.

Rolling your own tools

Use this approach for tools that are specific to your application. You supply the schema and handle the execution yourself.

Defining the tool schema

Define tools in llm.llmOptions.tools using the standard function-calling format:

1const weatherTool = {
2 name: 'get_weather',
3 description: 'Get the current temperature and wind speed for a location.',
4 parameters: {
5 type: 'object',
6 properties: {
7 location: { type: 'string', description: 'City name, e.g. "Portland"' },
8 scale: { type: 'string', enum: ['celsius', 'fahrenheit'] },
9 },
10 required: ['location'],
11 },
12};
13
14session.agent({
15 llm: {
16 vendor: 'openai',
17 model: 'gpt-4.1-mini',
18 llmOptions: {
19 messages: [{ role: 'system', content: 'You are a weather assistant.' }],
20 tools: [weatherTool],
21 },
22 },
23 toolHook: '/tool-call',
24 // ...
25}).send();

The agent verb normalizes tool schemas across LLM vendors. You always define tools in the same format regardless of whether you’re using OpenAI, Anthropic, Google, or Bedrock.

Handling tool calls (WebSocket)

Tool calls arrive as events on the toolHook path with tool_call_id, name, and arguments (already parsed as an object). Respond with session.sendToolOutput():

1session.on('/tool-call', async (evt) => {
2 const { tool_call_id, name, arguments: args } = evt;
3
4 if (name === 'get_weather') {
5 try {
6 const geoRes = await fetch(
7 `https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(args.location)}&count=1`
8 );
9 const geoData = await geoRes.json();
10 const { latitude, longitude } = geoData.results[0];
11
12 const wxRes = await fetch(
13 `https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}&current=temperature_2m,wind_speed_10m`
14 );
15 const weather = await wxRes.json();
16 session.sendToolOutput(tool_call_id,
17 `Temperature: ${weather.current.temperature_2m}°C, Wind: ${weather.current.wind_speed_10m} km/h`
18 );
19 } catch (err) {
20 session.sendToolOutput(tool_call_id, `Error: ${err.message}`);
21 }
22 return;
23 }
24
25 session.sendToolOutput(tool_call_id, `Unknown tool: ${name}`);
26});

Handling tool calls (Webhook)

In webhook mode, the tool call arrives as an HTTP POST to the toolHook URL. Return the tool result as JSON in the response body.

Using pre-built tools from @jambonz/tools

For common utility tools — web search, weather, Wikipedia, calculator, datetime — the @jambonz/tools package lets you skip the schema definition and handler code entirely. Each tool bundles a JSON Schema (for the LLM) and an execute() function (for your application) that are wired into your session with a single call.

@jambonz/tools is open source (MIT-licensed) and we actively welcome community contributions. If you’ve built a useful tool — a CRM lookup, a scheduling integration, a knowledge-base query — please consider opening a PR so other jambonz developers can use it. See the contributing guidelines in the repo README.

$npm install @jambonz/tools

Available tools:

ToolFactoryAPI keyDescription
Web SearchcreateTavilySearchTavilySearch the web for current info
WeathercreateWeathernoneCurrent weather for any location (Open-Meteo)
WikipediacreateWikipedianoneFactual summaries
CalculatorcreateCalculatornoneSafe math expression evaluator
Date & TimecreateDateTimenoneCurrent date/time for any timezone

registerTools() wires the tools into your session — it listens on the toolHook path, dispatches each incoming tool call to the matching execute() function, and sends the result back via sendToolOutput():

1import { createTavilySearch, createWeather, createCalculator, registerTools } from '@jambonz/tools';
2
3const search = createTavilySearch({ apiKey: process.env.TAVILY_API_KEY });
4const weather = createWeather({ scale: 'fahrenheit' });
5const calc = createCalculator();
6const tools = [search, weather, calc];
7
8svc.on('session:new', (session) => {
9 registerTools(session, '/tool-call', tools);
10
11 session.agent({
12 stt: { vendor: 'deepgram', language: 'multi' },
13 tts: { vendor: 'cartesia', voice: '9626c31c-bec5-4cca-baa8-f8ba9e84c8bc' },
14 llm: {
15 vendor: 'openai',
16 model: 'gpt-4.1-mini',
17 llmOptions: {
18 messages: [{
19 role: 'system',
20 content: 'You are a helpful voice assistant with web search, weather, and math tools. ' +
21 'Keep responses concise and conversational.',
22 }],
23 tools: tools.map((t) => t.schema),
24 },
25 },
26 toolHook: '/tool-call',
27 actionHook: '/agent-complete',
28 }).send();
29});

registerTools() also accepts a logger option and returns errors to the LLM if a tool throws or is called with an unknown name.

Combining both approaches

You can mix pre-built tools from @jambonz/tools with your own custom tools in the same agent. Include the schemas from both in llmOptions.tools, use registerTools() for the pre-built ones, and attach your own toolHook handler for the custom ones. The two dispatch paths run side by side — registerTools() only handles tool calls whose name matches one it was given, so custom calls fall through to your handler.

1const myTool = {
2 name: 'lookup_order',
3 description: 'Look up an order by ID',
4 parameters: {
5 type: 'object',
6 properties: { order_id: { type: 'string' } },
7 required: ['order_id'],
8 },
9};
10
11// pre-built tools
12registerTools(session, '/tool-call', [search, weather]);
13
14// custom tool handler — runs alongside registerTools
15session.on('/tool-call', async (evt) => {
16 if (evt.name === 'lookup_order') {
17 const order = await db.orders.find(evt.arguments.order_id);
18 session.sendToolOutput(evt.tool_call_id, JSON.stringify(order));
19 }
20});
21
22session.agent({
23 llm: {
24 vendor: 'openai',
25 model: 'gpt-4.1-mini',
26 llmOptions: {
27 tools: [search.schema, weather.schema, myTool],
28 },
29 },
30 toolHook: '/tool-call',
31 // ...
32}).send();

You can also inject pre-built tools mid-conversation using updateAgent:

1session.updateAgent({
2 type: 'update_tools',
3 tools: [search.schema, weather.schema],
4});

MCP Server Integration

Instead of (or in addition to) defining tools inline, you can connect to external MCP servers. The agent verb connects to each server at startup via SSE or Streamable HTTP transport, discovers available tools, and makes them callable by the LLM.

1session
2 .agent({
3 llm: {
4 vendor: 'openai',
5 model: 'gpt-4.1',
6 llmOptions: {
7 messages: [{
8 role: 'system',
9 content: 'You are a sports assistant. Use available tools to answer questions about live scores.',
10 }],
11 },
12 },
13 stt: { vendor: 'deepgram', language: 'en-US' },
14 tts: { vendor: 'cartesia', voice: 'sonic-english' },
15 mcpServers: [
16 { url: 'https://livescoremcp.com/sse' },
17 ],
18 actionHook: '/agent-complete',
19 })
20 .send();

A caller can simply ask “what football matches are on right now?” and the LLM will use the tools discovered from the MCP server to fetch real-time data — no need to define tool schemas in llmOptions.tools.

If an MCP server requires authentication:

1{
2 "mcpServers": [
3 {
4 "url": "https://mcp.tavily.com/mcp/?tavilyApiKey=your-key",
5 "auth": { "apiKey": "your-key" }
6 }
7 ]
8}

Tool dispatch priority: When the LLM requests a tool call, MCP servers are checked first. If the tool name matches one discovered from an MCP server, the call is dispatched there. Otherwise, it falls through to the toolHook webhook. You can use both together.

Mid-conversation Updates

The agent verb supports asynchronous updates while a conversation is in progress, allowing you to change the agent’s behavior, inject context, modify tools, or trigger responses — without interrupting the verb stack.

Updates are sent via WebSocket (session.updateAgent(data)) or REST API.

update_instructions

Replace the LLM system prompt mid-conversation. Useful for persona switching or topic transitions.

1// After identifying the caller's intent, switch to a specialist persona
2session.updateAgent({
3 type: 'update_instructions',
4 instructions: 'You are now a billing support agent. Help the caller with invoice questions.',
5});

inject_context

Append messages to the LLM conversation history. System messages are routed to the system prompt for vendors that don’t support inline system messages (Bedrock, Anthropic, Google).

1// Inject CRM data after identifying the caller
2session.updateAgent({
3 type: 'inject_context',
4 messages: [
5 {
6 role: 'user',
7 content: 'CRM context: Customer name: Sarah Mitchell. Account tier: Gold. ' +
8 'Open support ticket: delayed delivery on the smart home hub.',
9 },
10 ],
11});

update_tools

Replace the tool set available to the LLM. The new tools take effect on the next turn.

1// Add web search capability after the user requests it
2session.updateAgent({
3 type: 'update_tools',
4 tools: [
5 {
6 name: 'web_search',
7 description: 'Search the web for current information',
8 parameters: {
9 type: 'object',
10 properties: { query: { type: 'string' } },
11 required: ['query'],
12 },
13 },
14 ],
15});

generate_reply

Prompt the LLM to generate a new response. If the agent verb is idle, the prompt executes immediately. If busy, the request is queued.

Use interrupt: true to cancel the current response and generate immediately — useful for supervisor overrides or urgent notifications.

1// Supervisor whisper — interrupt with urgent info
2session.updateAgent({
3 type: 'generate_reply',
4 interrupt: true,
5 user_input: 'URGENT: Tell the customer about the flash sale — 50% off all items for the next hour.',
6});
7
8// Gentle prompt with one-shot instructions
9session.updateAgent({
10 type: 'generate_reply',
11 user_input: 'Customer is asking about refunds',
12 instructions: 'Be empathetic and offer a 20% discount before processing a refund.',
13});

Event Handling

The eventHook receives real-time events during the conversation. In WebSocket mode, listen with session.on():

1session.on('/agent-event', (evt) => {
2 switch (evt.type) {
3 case 'user_transcript':
4 console.log('User said:', evt.transcript);
5 break;
6 case 'agent_response':
7 console.log('Agent replied:', evt.response);
8 break;
9 case 'user_interruption':
10 console.log('User interrupted');
11 break;
12 case 'turn_end':
13 console.log('Turn complete:', {
14 transcript: evt.transcript,
15 response: evt.response,
16 latency: evt.latency,
17 });
18 break;
19 }
20});

Event Types

EventDescriptionKey fields
user_transcriptUser speech recognizedtranscript
agent_responseAssistant reply textresponse
user_interruptionUser barged in
turn_endEnd-of-turn summarytranscript, confidence, response, interrupted, latency, tool_calls
history_summarizedConversation summarizedturn, messages_dropped, messages_kept, summary

turn_end Payload

The turn_end event is the most useful for observability. Example payload:

1{
2 "type": "turn_end",
3 "transcript": "What's the weather in Portland?",
4 "confidence": 0.998,
5 "response": "The temperature in Portland is 52°F with wind at 12 km/h.",
6 "interrupted": false,
7 "latency": {
8 "stt_ms": 320,
9 "eot_ms": 180,
10 "llm_ms": 890,
11 "tool_ms": 420,
12 "tts_ms": 210,
13 "preflight": {
14 "result": "hit",
15 "tokens": 12
16 }
17 },
18 "tool_calls": [
19 { "name": "get_weather", "rtt_ms": 420 }
20 ]
21}

Latency Optimization

The turn_end latency breakdown helps you identify bottlenecks and optimize response time.

FieldWhat it measuresHow to optimize
stt_msSTT processing timeChoose low-latency STT vendors (Deepgram). Use hints to improve accuracy.
eot_msEnd-of-turn detection waitTune Krisp threshold (lower = faster). Use vendors with native turn-taking.
llm_msPure LLM thinking time (tool RTT subtracted)Use faster models (e.g., gpt-4.1-mini). Keep system prompts concise. Enable earlyGeneration.
tool_msTotal time in tool callsOptimize tool endpoint latency. Use caching where appropriate.
tts_msTTS engine latency (text → first audio)Choose streaming-capable TTS (Cartesia, ElevenLabs, Deepgram).
preflightSpeculative preflight resultEnable earlyGeneration with Krisp. Monitor hit rate — high miss rates may indicate the threshold is too aggressive.

Conversation History Summarization

For long conversations that might exceed the LLM’s context window, the agent verb can automatically summarize older turns.

Set the JAMBONES_PIPELINE_SUMMARIZE_TURNS environment variable to control how often summarization runs. Values 1–7 are clamped to 8. Set to 0 to disable (default).

When summarization triggers:

  1. The LLM generates a concise summary of the older conversation turns
  2. The summary is appended to the system prompt as a “Conversation context” section
  3. The summarized turns are dropped from conversation history
  4. Half the configured number of turns are kept in full fidelity

A history_summarized event is sent to the eventHook:

1{
2 "type": "history_summarized",
3 "turn": 8,
4 "messages_dropped": 5,
5 "messages_kept": 6,
6 "summary": "The user is a software developer looking for a MacBook Pro..."
7}

Noise Isolation

The noiseIsolation property enables server-side noise cancellation on call audio, improving STT accuracy in noisy environments.

Two vendors are available:

  • "krisp" — Krisp’s proprietary noise cancellation. Requires a Krisp API key on self-hosted systems. Listen to audio samples to hear the model in action.
  • "rnnoise" — Open-source RNNoise-based noise cancellation. No API key required.

Shorthand:

1{ "noiseIsolation": "krisp" }

Detailed configuration:

1{
2 "noiseIsolation": {
3 "mode": "krisp",
4 "level": 80,
5 "direction": "read"
6 }
7}
  • level — Suppression level 0–100. Higher values are more aggressive. Default: 100.
  • direction"read" filters caller audio (default), "write" filters outbound audio.

Error Recovery

The agent verb handles errors gracefully to keep the conversation going:

  • LLM errors with tools — If the LLM fails and tools were included, the agent verb retries the same prompt without tools. This handles models that don’t support tool use in certain configurations.
  • Speculative preflight errors — When a speculative prompt fails, the preflight is discarded and a fresh prompt is issued normally.
  • Recovery to idle — On unrecoverable LLM errors, the agent verb ends the turn and transitions to idle so the user can continue speaking. The no-response timer is not started after an error to avoid retry loops.
  • STT reconnection — The agent verb automatically reconnects the STT stream if the connection drops.

When the agent verb encounters an unrecoverable error, it invokes the actionHook with a completion_reason indicating the failure.

Example Applications

The agent examples repository contains runnable demos for each feature:

ExampleWhat it demonstrates
deepgram-cartesiaBasic agent with Deepgram STT + Cartesia TTS
deepgramflux-elevenlabsDeepgram Flux (native turn detection) + ElevenLabs TTS
speechmatics-rimeSpeechmatics STT + Rime TTS
using-toolsTool calling with weather lookup
web-searchWeb search via Tavily tool
using-mcp-serverMCP server for live sports scores
tavily-mcpWeb search via Tavily MCP server
crm-injectionLive CRM context injection via inject_context
persona-switchMid-conversation persona change via update_instructions
supervisor-interruptUrgent message injection via generate_reply with interrupt
dynamic-toolsMid-conversation tool injection via update_tools