Voice Agents
Build conversational AI agents using the agent verb.
The agent verb is an experimental feature and requires jambonz version 10.1.1 or above.
The agent verb orchestrates a complete voice AI agent by wiring together three separate components — STT, LLM, and TTS — with integrated turn detection. Unlike the llm verb (which connects to speech-to-speech APIs where a single vendor handles everything), the agent verb lets you mix and match: for example, Deepgram for STT, Anthropic for the LLM, and Cartesia for TTS.
The agent verb manages the full conversational turn cycle:
- User speaks → STT produces a transcript
- Turn detection decides the user is done speaking
- Transcript is sent to the LLM
- LLM response tokens stream to TTS
- TTS audio plays back to the caller
- If the user barges in, TTS stops and a new turn begins
Looking for runnable examples? The jambonz/v10-examples repository has working demos for every feature described in this guide — basic usage, tool calling, MCP servers, CRM injection, persona switching, supervisor overrides, and more. Clone it and run any example end-to-end in minutes.
Basic Setup
The llm property is the only required field. STT and TTS will use your application’s default speech credentials if not specified.
Below is a minimal voice agent using the Node.js SDK and the application defaults for STT and TTS.
Supported LLM Vendors
The jambonz portal now allows you to “bring your own LLM” in a similar fashion to speech credentials. You can configure credentials for any supported LLM vendor and then select that vendor in the agent verb.
The agent verb normalizes message formats and tool schemas across vendors automatically. You write tools in OpenAI format and the agent verb adapts them for each vendor.
Authentication
By default, the agent verb uses speech credentials configured in the jambonz portal. You can also pass credentials directly:
For AWS Bedrock, pass accessKeyId, secretAccessKey, and region in the auth object.
Turn Detection
The turnDetection property controls how the agent verb decides the user has finished speaking.
We currently support only two modes — STT-based detection and Krisp’s turn detection model.
STT-based detection (default)
Uses the STT vendor’s native end-of-utterance signal. For most vendors this is silence-based. Some vendors have smarter built-in turn detection:
- deepgramflux — Acoustic + semantic turn detection (Deepgram’s “Flux” model)
- assemblyai — Native turn-taking with the
u3-rt-promodel - speechmatics — Built-in turn detection
These vendors always use their native detection regardless of the turnDetection setting.
Krisp turn detection
Uses the Krisp acoustic end-of-turn model, which analyzes speech patterns rather than just silence. Good for natural conversation where users pause mid-thought.
threshold— Confidence threshold from 0.0 to 1.0. Lower values trigger earlier turn transitions (more aggressive). Default: 0.5.model— Optional Krisp model name override.
The shorthand "turnDetection": "krisp" uses default settings.
You must have a Krisp API key configured in order to use Krisp turn detection on a self-hosted jambonz system. Contact support@jambonz.org for details.
Early Generation (Speculative Preflight)
Early generation speculatively sends the transcript to the LLM before end-of-turn is confirmed. If the transcript matches when the turn does end, buffered tokens are released immediately — shaving off the LLM prompt time. If the user keeps talking and the transcript changes, the speculative response is discarded.
There are two ways early generation is triggered:
- Krisp turn detection — Set
earlyGeneration: trueto opt in. Krisp emits an early signal that triggers the speculative LLM prompt before final end-of-turn confirmation. - Deepgram Flux — Early generation happens automatically. Flux emits a native
EagerEndOfTurnevent that triggers preflight regardless of theearlyGenerationsetting.
For other STT vendors with native turn-taking (assemblyai, speechmatics), early generation is not available.
The turn_end event includes preflight metrics so you can track hit rates:
hit— speculative transcript matched final, tokens released immediatelymiss— transcript changed, speculative response discardedpending— preflight was still in progress when the turn ended
Barge-in Configuration
By default, users can interrupt the assistant while it’s speaking. The bargeIn object controls this behavior:
enable— Allow interruptions. Default:true.minSpeechDuration— Seconds of speech required to confirm an interruption. Prevents brief noises (coughs, background sounds) from cutting off the assistant. Default: 0.5.sticky— Iftrue, once the user interrupts, the assistant does not resume speaking the interrupted response. Default:false.
Tuning tips:
- Lower
minSpeechDuration(e.g., 0.2) for more responsive barge-in - Higher
minSpeechDuration(e.g., 1.0) for noisy environments where false triggers are common - Set
enable: falsefor scenarios where the assistant must complete its message (e.g., legal disclaimers)
No Response Timeout
The noResponseTimeout property handles the case where the user goes silent after the assistant finishes speaking.
When the timeout fires, the LLM is prompted with a system cue: “The user has not responded. Briefly check if they are still there or ask if they need help.” This generates a natural follow-up rather than leaving dead air.
Defaults to 12 seconds. Set to 0 to disable. The timer is cancelled if the user starts speaking.
This also covers the “missed speech” case: when VAD detects speech but STT returns no transcript, the no-response timer handles the re-prompt.
Greeting
By default (greeting: true), the agent verb prompts the LLM to generate an initial greeting before the user speaks. Set greeting: false if you want the agent to wait silently for the user to speak first.
Tool/Function Calling
The agent verb supports LLM tool/function calling, allowing your agent to perform actions like looking up data, calling APIs, or transferring calls. There are two ways to provide tools:
- Roll your own — define a JSON schema, list it in
llmOptions.tools, and handle the tool call yourself in atoolHookhandler. Use this for tools specific to your application (CRM lookups, business logic, proprietary APIs). - Use pre-built tools from
@jambonz/tools— drop in ready-made tools (web search, weather, Wikipedia, calculator, datetime) without writing schemas or handlers. Use this for common utility tools.
You can mix both approaches in the same agent — they share the same toolHook path.
Rolling your own tools
Use this approach for tools that are specific to your application. You supply the schema and handle the execution yourself.
Defining the tool schema
Define tools in llm.llmOptions.tools using the standard function-calling format:
The agent verb normalizes tool schemas across LLM vendors. You always define tools in the same format regardless of whether you’re using OpenAI, Anthropic, Google, or Bedrock.
Handling tool calls (WebSocket)
Tool calls arrive as events on the toolHook path with tool_call_id, name, and arguments (already parsed as an object). Respond with session.sendToolOutput():
Handling tool calls (Webhook)
In webhook mode, the tool call arrives as an HTTP POST to the toolHook URL. Return the tool result as JSON in the response body.
Using pre-built tools from @jambonz/tools
For common utility tools — web search, weather, Wikipedia, calculator, datetime — the @jambonz/tools package lets you skip the schema definition and handler code entirely. Each tool bundles a JSON Schema (for the LLM) and an execute() function (for your application) that are wired into your session with a single call.
@jambonz/tools is open source (MIT-licensed) and we actively welcome community contributions. If you’ve built a useful tool — a CRM lookup, a scheduling integration, a knowledge-base query — please consider opening a PR so other jambonz developers can use it. See the contributing guidelines in the repo README.
Available tools:
registerTools() wires the tools into your session — it listens on the toolHook path, dispatches each incoming tool call to the matching execute() function, and sends the result back via sendToolOutput():
registerTools() also accepts a logger option and returns errors to the LLM if a tool throws or is called with an unknown name.
Combining both approaches
You can mix pre-built tools from @jambonz/tools with your own custom tools in the same agent. Include the schemas from both in llmOptions.tools, use registerTools() for the pre-built ones, and attach your own toolHook handler for the custom ones. The two dispatch paths run side by side — registerTools() only handles tool calls whose name matches one it was given, so custom calls fall through to your handler.
You can also inject pre-built tools mid-conversation using updateAgent:
MCP Server Integration
Instead of (or in addition to) defining tools inline, you can connect to external MCP servers. The agent verb connects to each server at startup via SSE or Streamable HTTP transport, discovers available tools, and makes them callable by the LLM.
A caller can simply ask “what football matches are on right now?” and the LLM will use the tools discovered from the MCP server to fetch real-time data — no need to define tool schemas in llmOptions.tools.
If an MCP server requires authentication:
Tool dispatch priority: When the LLM requests a tool call, MCP servers are checked first. If the tool name matches one discovered from an MCP server, the call is dispatched there. Otherwise, it falls through to the toolHook webhook. You can use both together.
Mid-conversation Updates
The agent verb supports asynchronous updates while a conversation is in progress, allowing you to change the agent’s behavior, inject context, modify tools, or trigger responses — without interrupting the verb stack.
Updates are sent via WebSocket (session.updateAgent(data)) or REST API.
update_instructions
Replace the LLM system prompt mid-conversation. Useful for persona switching or topic transitions.
inject_context
Append messages to the LLM conversation history. System messages are routed to the system prompt for vendors that don’t support inline system messages (Bedrock, Anthropic, Google).
update_tools
Replace the tool set available to the LLM. The new tools take effect on the next turn.
generate_reply
Prompt the LLM to generate a new response. If the agent verb is idle, the prompt executes immediately. If busy, the request is queued.
Use interrupt: true to cancel the current response and generate immediately — useful for supervisor overrides or urgent notifications.
Event Handling
The eventHook receives real-time events during the conversation. In WebSocket mode, listen with session.on():
Event Types
turn_end Payload
The turn_end event is the most useful for observability. Example payload:
Latency Optimization
The turn_end latency breakdown helps you identify bottlenecks and optimize response time.
Conversation History Summarization
For long conversations that might exceed the LLM’s context window, the agent verb can automatically summarize older turns.
Set the JAMBONES_PIPELINE_SUMMARIZE_TURNS environment variable to control how often summarization runs. Values 1–7 are clamped to 8. Set to 0 to disable (default).
When summarization triggers:
- The LLM generates a concise summary of the older conversation turns
- The summary is appended to the system prompt as a “Conversation context” section
- The summarized turns are dropped from conversation history
- Half the configured number of turns are kept in full fidelity
A history_summarized event is sent to the eventHook:
Noise Isolation
The noiseIsolation property enables server-side noise cancellation on call audio, improving STT accuracy in noisy environments.
Two vendors are available:
"krisp"— Krisp’s proprietary noise cancellation. Requires a Krisp API key on self-hosted systems. Listen to audio samples to hear the model in action."rnnoise"— Open-source RNNoise-based noise cancellation. No API key required.
Shorthand:
Detailed configuration:
level— Suppression level 0–100. Higher values are more aggressive. Default: 100.direction—"read"filters caller audio (default),"write"filters outbound audio.
Error Recovery
The agent verb handles errors gracefully to keep the conversation going:
- LLM errors with tools — If the LLM fails and tools were included, the agent verb retries the same prompt without tools. This handles models that don’t support tool use in certain configurations.
- Speculative preflight errors — When a speculative prompt fails, the preflight is discarded and a fresh prompt is issued normally.
- Recovery to idle — On unrecoverable LLM errors, the agent verb ends the turn and transitions to idle so the user can continue speaking. The no-response timer is not started after an error to avoid retry loops.
- STT reconnection — The agent verb automatically reconnects the STT stream if the connection drops.
When the agent verb encounters an unrecoverable error, it invokes the actionHook with a completion_reason indicating the failure.
Example Applications
The agent examples repository contains runnable demos for each feature: