Voice Agents
Build conversational AI agents using the agent verb.
Build conversational AI agents using the agent verb.
The agent verb is an experimental feature and requires jambonz version 10.1.1 or above.
The agent verb orchestrates a complete voice AI agent by wiring together three separate components — STT, LLM, and TTS — with integrated turn detection. Unlike the llm verb (which connects to speech-to-speech APIs where a single vendor handles everything), the agent verb lets you mix and match: for example, Deepgram for STT, Anthropic for the LLM, and Cartesia for TTS.
The agent verb manages the full conversational turn cycle:
Looking for runnable examples? The jambonz/v10-examples repository has working demos for every feature described in this guide — basic usage, tool calling, MCP servers, CRM injection, persona switching, supervisor overrides, and more. Clone it and run any example end-to-end in minutes.
The llm property is the only required field. STT and TTS will use your application’s default speech credentials if not specified.
Below is a minimal voice agent using the Node.js SDK and the application defaults for STT and TTS.
The jambonz portal lets you “bring your own LLM” in a similar fashion to speech credentials. Configure credentials in Account → LLM Services, then reference the vendor in the agent verb.
11 LLM vendors are supported: Anthropic, AWS Bedrock, Azure OpenAI, Baseten, DeepSeek, Google AI Studio, Groq, HuggingFace, OpenAI, and Vertex AI (Gemini and Partner Models). See Bring Your Own LLM for per-vendor setup, model recommendations, and known issues, or the agent verb reference for the full vendor id + example-models table.
The agent verb normalizes message formats and tool schemas across vendors automatically. You write tools in OpenAI format and the agent verb adapts them for each vendor.
By default, the agent verb uses speech credentials configured in the jambonz portal. You can also pass credentials directly:
For AWS Bedrock, pass accessKeyId, secretAccessKey, and region in the auth object.
The turnDetection property controls how the agent verb decides the user has finished speaking.
We currently support only two modes — STT-based detection and Krisp’s turn detection model.
Uses the STT vendor’s native end-of-utterance signal. For most vendors this is silence-based. Some vendors have smarter built-in turn detection:
u3-rt-pro modelThese vendors always use their native detection regardless of the turnDetection setting.
Uses the Krisp acoustic end-of-turn model, which analyzes speech patterns rather than just silence. Good for natural conversation where users pause mid-thought.
threshold — Confidence threshold from 0.0 to 1.0. Lower values trigger earlier turn transitions (more aggressive). Default: 0.5.model — Optional Krisp model name override.The shorthand "turnDetection": "krisp" uses default settings.
You must have a Krisp API key configured in order to use Krisp turn detection on a self-hosted jambonz system. Contact support@jambonz.org for details.
Early generation speculatively sends the transcript to the LLM before end-of-turn is confirmed. If the transcript matches when the turn does end, buffered tokens are released immediately — shaving off the LLM prompt time. If the user keeps talking and the transcript changes, the speculative response is discarded.
There are two ways early generation is triggered:
earlyGeneration: true to opt in. Krisp emits an early signal that triggers the speculative LLM prompt before final end-of-turn confirmation.EagerEndOfTurn event that triggers preflight regardless of the earlyGeneration setting.For other STT vendors with native turn-taking (assemblyai, speechmatics), early generation is not available.
The turn_end event includes preflight metrics so you can track hit rates:
hit — speculative transcript matched final, tokens released immediatelymiss — transcript changed, speculative response discardedpending — preflight was still in progress when the turn endedBy default, users can interrupt the assistant while it’s speaking. The bargeIn object controls this behavior:
enable — Allow interruptions. Default: true.minSpeechDuration — Seconds of speech required to confirm an interruption. Prevents brief noises (coughs, background sounds) from cutting off the assistant. Default: 0.5.sticky — If true, once the user interrupts, the assistant does not resume speaking the interrupted response. Default: false.Tuning tips:
minSpeechDuration (e.g., 0.2) for more responsive barge-inminSpeechDuration (e.g., 1.0) for noisy environments where false triggers are commonenable: false for scenarios where the assistant must complete its message (e.g., legal disclaimers)The noResponseTimeout property handles the case where the user goes silent after the assistant finishes speaking.
When the timeout fires, the LLM is prompted with a system cue: “The user has not responded. Briefly check if they are still there or ask if they need help.” This generates a natural follow-up rather than leaving dead air.
Defaults to 12 seconds. Set to 0 to disable. The timer is cancelled if the user starts speaking.
This also covers the “missed speech” case: when VAD detects speech but STT returns no transcript, the no-response timer handles the re-prompt.
By default (greeting: true), the agent verb prompts the LLM to generate an initial greeting before the user speaks. Set greeting: false if you want the agent to wait silently for the user to speak first.
The agent verb supports LLM tool/function calling, allowing your agent to perform actions like looking up data, calling APIs, or transferring calls. There are two ways to provide tools:
llmOptions.tools, and handle the tool call yourself in a toolHook handler. Use this for tools specific to your application (CRM lookups, business logic, proprietary APIs).@jambonz/tools — drop in ready-made tools (web search, weather, Wikipedia, calculator, datetime) without writing schemas or handlers. Use this for common utility tools.You can mix both approaches in the same agent — they share the same toolHook path.
Use this approach for tools that are specific to your application. You supply the schema and handle the execution yourself.
Define tools in llm.llmOptions.tools using the standard function-calling format:
The agent verb normalizes tool schemas across LLM vendors. You always define tools in the same format regardless of whether you’re using OpenAI, Anthropic, Google, or Bedrock.
Tool calls arrive as events on the toolHook path with tool_call_id, name, and arguments (already parsed as an object). Respond with session.sendToolOutput():
In webhook mode, the tool call arrives as an HTTP POST to the toolHook URL. Return the tool result as JSON in the response body.
@jambonz/toolsFor common utility tools — web search, weather, Wikipedia, calculator, datetime — the @jambonz/tools package lets you skip the schema definition and handler code entirely. Each tool bundles a JSON Schema (for the LLM) and an execute() function (for your application) that are wired into your session with a single call.
@jambonz/tools is open source (MIT-licensed) and we actively welcome community contributions. If you’ve built a useful tool — a CRM lookup, a scheduling integration, a knowledge-base query — please consider opening a PR so other jambonz developers can use it. See the contributing guidelines in the repo README.
Available tools:
registerTools() wires the tools into your session — it listens on the toolHook path, dispatches each incoming tool call to the matching execute() function, and sends the result back via sendToolOutput():
registerTools() also accepts a logger option and returns errors to the LLM if a tool throws or is called with an unknown name.
You can mix pre-built tools from @jambonz/tools with your own custom tools in the same agent. Include the schemas from both in llmOptions.tools, use registerTools() for the pre-built ones, and attach your own toolHook handler for the custom ones. The two dispatch paths run side by side — registerTools() only handles tool calls whose name matches one it was given, so custom calls fall through to your handler.
You can also inject pre-built tools mid-conversation using updateAgent:
Instead of (or in addition to) defining tools inline, you can connect to external MCP servers. The agent verb connects to each server at startup via SSE or Streamable HTTP transport, discovers available tools, and makes them callable by the LLM.
A caller can simply ask “what football matches are on right now?” and the LLM will use the tools discovered from the MCP server to fetch real-time data — no need to define tool schemas in llmOptions.tools.
If an MCP server requires authentication:
Tool dispatch priority: When the LLM requests a tool call, MCP servers are checked first. If the tool name matches one discovered from an MCP server, the call is dispatched there. Otherwise, it falls through to the toolHook webhook. You can use both together.
The agent verb supports asynchronous updates while a conversation is in progress, allowing you to change the agent’s behavior, inject context, modify tools, or trigger responses — without interrupting the verb stack.
Updates are sent via WebSocket (session.updateAgent(data)) or REST API.
Replace the LLM system prompt mid-conversation. Useful for persona switching or topic transitions.
Append messages to the LLM conversation history. System messages are routed to the system prompt for vendors that don’t support inline system messages (Bedrock, Anthropic, Google).
Replace the tool set available to the LLM. The new tools take effect on the next turn.
Prompt the LLM to generate a new response. If the agent verb is idle, the prompt executes immediately. If busy, the request is queued.
Use interrupt: true to cancel the current response and generate immediately — useful for supervisor overrides or urgent notifications.
The eventHook receives real-time events during the conversation. In WebSocket mode, listen with session.on():
The turn_end event is the most useful for observability. Example payload:
The turn_end latency breakdown helps you identify bottlenecks and optimize response time.
For long conversations that might exceed the LLM’s context window, the agent verb can automatically summarize older turns.
Set the JAMBONES_PIPELINE_SUMMARIZE_TURNS environment variable to control how often summarization runs. Values 1–7 are clamped to 8. Set to 0 to disable (default).
When summarization triggers:
A history_summarized event is sent to the eventHook:
The noiseIsolation property enables server-side noise cancellation on call audio, improving STT accuracy in noisy environments.
Two vendors are available:
"krisp" — Krisp’s proprietary noise cancellation. Requires a Krisp API key on self-hosted systems. Listen to audio samples to hear the model in action."rnnoise" — Open-source RNNoise-based noise cancellation. No API key required.Shorthand:
Detailed configuration:
level — Suppression level 0–100. Higher values are more aggressive. Default: 100.direction — "read" filters caller audio (default), "write" filters outbound audio.The agent verb handles errors gracefully to keep the conversation going:
When the agent verb encounters an unrecoverable error, it invokes the actionHook with a completion_reason indicating the failure.
The agent examples repository contains runnable demos for each feature: