Agent

Orchestrate a complete voice agent by creating a cascaded voice pipeline consisting of your choice of STT, LLM, TTS, VAD and turn-taking components.

The agent verb is an experimental feature in jambonz v10.1.1

1session
2 .agent({
3 stt: {
4 vendor: 'deepgram',
5 language: 'multi',
6 deepgramOptions: { model: 'nova-3-general' },
7 },
8 tts: {
9 vendor: 'cartesia',
10 voice: '9626c31c-bec5-4cca-baa8-f8ba9e84c8bc',
11 },
12 llm: {
13 vendor: 'openai',
14 model: 'gpt-4.1-mini',
15 llmOptions: {
16 messages: [
17 { role: 'system', content: 'You are a helpful voice assistant.' },
18 ],
19 tools: [
20 {
21 name: 'get_weather',
22 description: 'Get current weather for a city',
23 parameters: {
24 type: 'object',
25 properties: {
26 city: { type: 'string' },
27 },
28 required: ['city'],
29 },
30 },
31 ],
32 },
33 },
34 turnDetection: 'krisp',
35 earlyGeneration: true,
36 bargeIn: { enable: true, minSpeechDuration: 0.5 },
37 noResponseTimeout: 12,
38 toolHook: '/tool-call',
39 eventHook: '/agent-event',
40 actionHook: '/agent-complete',
41 })
42 .send();

The agent verb wires together three components — STT, LLM, and TTS — with integrated turn detection to create a complete voice AI agent. Unlike the llm verb (which connects to speech-to-speech APIs like OpenAI Realtime), the agent verb lets you mix and match separate vendors for each component.

The agent verb manages the full conversational turn cycle:

  1. User speaks → STT produces a transcript
  2. Turn detection decides the user is done speaking
  3. Transcript is sent to the LLM
  4. LLM response tokens stream to TTS
  5. TTS audio plays back to the caller
  6. If the user barges in, TTS stops and a new turn begins

For a comprehensive guide on building voice AI agents with the agent verb, see Voice Agents.

Parameters

actionHook
string

A webhook invoked when the agent verb ends. The payload includes a completion_reason field indicating why it terminated.

bargeIn
object

Controls whether and how the user can interrupt the assistant while it is speaking.

bargeIn.enable
booleanDefaults to true

Allow the user to interrupt the assistant while it is speaking.

bargeIn.minSpeechDuration
numberDefaults to 0.5

Seconds of detected speech required before confirming an interruption. Prevents brief noises from cutting off the assistant.

bargeIn.sticky
booleanDefaults to false

If true, once the user interrupts, the assistant does not resume speaking the interrupted response.

earlyGeneration
booleanDefaults to false

Enable speculative LLM prompting before end-of-turn is confirmed. When using Krisp turn detection, set this to true to speculatively prompt the LLM before Krisp confirms the turn has ended. If the transcript matches when the turn ends, buffered tokens are released immediately — reducing response latency. Deepgram Flux performs early generation automatically regardless of this setting.

eventHook
string

A webhook invoked for agent events. Receives event types: user_transcript, agent_response, user_interruption, turn_end, and history_summarized. See eventHook Events below.

greeting
booleanDefaults to true

Whether the LLM should generate an initial greeting before the user speaks. Set to false if you want the agent to wait silently for the user to speak first.

id
string

An optional unique identifier for this verb instance.

llm
objectRequired

LLM configuration. This is the only required property. See Bring Your Own LLM for the full list of supported vendors and per-vendor setup.

llm.vendor
stringRequired

LLM vendor id. One of: openai, anthropic, google, vertex-gemini, vertex-openai, bedrock, deepseek, azure-openai, groq, huggingface. See Bring Your Own LLM for setup of each.

llm.model
stringRequired

Model name. Format is vendor-specific (e.g., gpt-4o-mini, claude-haiku-4-5-20251001, gemini-2.5-flash, llama-3.3-70b-versatile, meta-llama/Llama-3.3-70B-Instruct). For Azure OpenAI, this is your deployment name, not the model id.

llm.label
string

Only needed when the account has more than one LLM credential for the same vendor — give each a distinct label in the portal at create time and reference it explicitly here. Most accounts have a single credential per vendor and should leave this unset.

llm.auth
object

Inline authentication credentials. Format depends on the vendor — see the per-vendor pages under Bring Your Own LLM. If not provided, jambonz looks up credentials by vendor (and label, if the rare multi-credential case applies) from the LLM Services configured in the portal.

llm.llmOptions
object

LLM options including system prompt, tools, and generation parameters.

llm.llmOptions.messages
array

Initial conversation messages. Typically includes a system message with instructions: [{ role: 'system', content: '...' }].

llm.llmOptions.systemPrompt
string

System prompt for the LLM. Alternative to providing a system message in messages.

llm.llmOptions.tools
array

Tool/function definitions available to the LLM. Uses OpenAI function-calling format. See Tool Calling below.

llm.llmOptions.maxTokens
number

Maximum number of tokens in the LLM response.

mcpServers
array

External MCP servers that provide tools to the LLM. The agent verb connects to each server at startup, discovers available tools, and makes them callable by the LLM. Each entry requires a url property and optionally auth and roots.

noiseIsolation
string or object

Enable server-side noise isolation to reduce background noise. As a string, pass "krisp" or "rnnoise" for default settings. As an object: { mode: "krisp", level: 80, direction: "read" }. direction can be "read" (filter caller audio, default) or "write" (filter outbound audio). Krisp requires an API key on self-hosted systems.

noResponseTimeout
numberDefaults to 12

Seconds to wait after the assistant finishes speaking before prompting the user to respond. When triggered, the LLM is prompted with a system cue to check if the user is still there. Set to 0 to disable.

stt
object

Speech-to-text configuration. See recognizer for available properties. Key properties: vendor, language, hints, and vendor-specific options like deepgramOptions.

toolHook
string

A webhook invoked when the LLM requests a tool/function call. The payload includes tool_call_id, name, and arguments (already parsed as an object). See Tool Calling below.

tts
object

Text-to-speech configuration. See synthesizer for available properties. Key properties: vendor, voice, language, and vendor-specific options.

turnDetection
string or objectDefaults to stt

Turn detection strategy. Controls when the agent verb decides the user has finished speaking.

As a string:

  • "stt" — Uses the STT vendor’s native end-of-utterance signal. For most vendors this is silence-based. Vendors with smarter built-in turn detection (deepgramflux, assemblyai, speechmatics) always use their native detection regardless of this setting.
  • "krisp" — Uses the Krisp acoustic end-of-turn model, which analyzes speech patterns rather than just silence.

As an object (Krisp only):

  • mode"krisp" (required)
  • threshold — Confidence threshold 0.0–1.0. Lower values trigger earlier turn transitions. Default: 0.5.
  • model — Optional Krisp model name override.

Tool Calling

Define tools in llm.llmOptions.tools and handle calls via toolHook. The tool call payload includes tool_call_id, name, and arguments (already parsed — an object, not a JSON string).

In WebSocket mode, respond with session.sendToolOutput(tool_call_id, result):

1session.on('/tool-call', async (evt) => {
2 const { tool_call_id, name, arguments: args } = evt;
3 if (name === 'get_weather') {
4 const weather = await fetchWeather(args.city);
5 session.sendToolOutput(tool_call_id, weather);
6 return;
7 }
8 session.sendToolOutput(tool_call_id, `Unknown tool: ${name}`);
9});

In webhook mode, return the tool result as JSON in the HTTP response body.

Alternatively, connect external MCP servers to provide tools automatically without defining them inline.

MCP Servers

Instead of (or in addition to) defining tools inline, you can connect to external MCP servers. The agent verb connects to each server at startup via SSE or Streamable HTTP transport, discovers available tools, and makes them callable by the LLM.

1session
2 .agent({
3 llm: { vendor: 'openai', model: 'gpt-4.1', llmOptions: {
4 messages: [{ role: 'system', content: 'You are a sports assistant.' }],
5 }},
6 stt: { vendor: 'deepgram', language: 'en-US' },
7 tts: { vendor: 'cartesia', voice: 'sonic-english' },
8 mcpServers: [
9 { url: 'https://livescoremcp.com/sse' },
10 ],
11 actionHook: '/agent-complete',
12 })
13 .send();

When the LLM requests a tool call, the agent verb checks MCP servers first. If the tool name matches one discovered from an MCP server, the call is dispatched there directly. If no MCP server provides the tool, it falls through to the toolHook.

eventHook Events

The eventHook receives real-time events during the conversation. In WebSocket mode, listen with session.on('/your-event-hook', handler).

Sent at the end of each conversational turn. The most useful event for observability.

1{
2 "type": "turn_end",
3 "transcript": "What's the weather in Portland?",
4 "confidence": 0.998,
5 "response": "The current temperature in Portland is 52°F with wind speed 12 km/h.",
6 "interrupted": false,
7 "latency": {
8 "stt_ms": 320,
9 "eot_ms": 180,
10 "llm_ms": 890,
11 "tool_ms": 420,
12 "tts_ms": 210,
13 "preflight": {
14 "result": "hit",
15 "tokens": 12
16 }
17 },
18 "tool_calls": [
19 { "name": "get_weather", "rtt_ms": 420 }
20 ]
21}

Latency fields (all in milliseconds):

  • stt_ms — STT processing time (user stops talking → final transcript received)
  • eot_ms — Additional wait for end-of-turn detection after transcript
  • llm_ms — Pure LLM thinking time (tool RTT subtracted)
  • tool_ms — Total time spent in tool calls
  • tts_ms — TTS engine latency (text sent → first audio received)
  • preflight — Early generation metrics: result (hit, miss, or pending) and tokens buffered on a hit

Sent when the user’s final transcript is available.

1{
2 "type": "user_transcript",
3 "transcript": "What's the weather in Portland?"
4}

Sent when the LLM finishes generating its response.

1{
2 "type": "agent_response",
3 "response": "The current temperature in Portland is 52°F."
4}

Sent when the user barges in while the assistant is speaking.

1{
2 "type": "user_interruption"
3}

Sent when conversation history summarization completes (requires JAMBONES_PIPELINE_SUMMARIZE_TURNS environment variable).

1{
2 "type": "history_summarized",
3 "turn": 8,
4 "messages_dropped": 5,
5 "messages_kept": 6,
6 "summary": "The user is a software developer looking for a MacBook Pro..."
7}

Mid-conversation Updates

The agent verb supports asynchronous updates while a conversation is in progress. Updates can be sent via WebSocket (session.updateAgent(data)) or REST API.

update_instructions

Replace the LLM system prompt mid-conversation.

1session.updateAgent({
2 type: 'update_instructions',
3 instructions: 'You are now a billing support agent.',
4});

inject_context

Append messages to the LLM conversation history. System messages are routed to the system prompt for vendors that don’t support inline system messages.

1session.updateAgent({
2 type: 'inject_context',
3 messages: [
4 { role: 'user', content: 'CRM context: Customer name: Sarah Mitchell. Account tier: Gold.' },
5 ],
6});

update_tools

Replace the tool set available to the LLM.

1session.updateAgent({
2 type: 'update_tools',
3 tools: [
4 {
5 name: 'transfer_call',
6 description: 'Transfer the caller to a specialist',
7 parameters: { type: 'object', properties: { department: { type: 'string' } } },
8 },
9 ],
10});

generate_reply

Prompt the LLM to generate a new response. Use interrupt: true to cancel the current response and generate immediately.

1session.updateAgent({
2 type: 'generate_reply',
3 interrupt: true,
4 user_input: 'URGENT: Tell the customer about the flash sale.',
5});

Supported LLM Vendors

Vendorllm.vendorExample models
OpenAIopenaigpt-5.4-mini, gpt-5.4, gpt-5.5, gpt-4o-mini
Anthropicanthropicclaude-haiku-4-5-20251001, claude-sonnet-4-6
Google AI Studiogooglegemini-2.5-flash, gemini-2.5-pro
Vertex AI — Geminivertex-geminigemini-2.5-flash, gemini-2.5-pro
Vertex AI — Partner Modelsvertex-openaimeta/llama-3.3-70b-instruct-maas, mistral-large
AWS Bedrockbedrockamazon.nova-micro-v1:0, us.anthropic.claude-haiku-4-5-20251001-v1:0
DeepSeekdeepseekdeepseek-v4-flash, deepseek-v4-pro
Azure OpenAIazure-openaiyour deployment name
Groqgroqllama-3.3-70b-versatile, llama-3.1-8b-instant
HuggingFacehuggingfacemeta-llama/Llama-3.3-70B-Instruct, …:fastest

See Bring Your Own LLM for per-vendor credential setup, model recommendations, and known issues.

Example Applications

See the agent examples for runnable demos covering basic usage, tool calling, MCP servers, CRM injection, persona switching, and more.