Recognizer

Parameters

vendor

stringRequired

Speech vendor to use (see list below, along with any others you add via the
custom speech API). Note: this field is case sensitve, all the built-in vendors are lower case eg aws not AWS

altLanguages

array

(Google, Microsoft) An array of alternative languages that the speaker may be using.

asrDtmfTerminationDigit

string

DTMF key that terminates continuous ASR feature.

asrTimeout

number

Timeout value for continuous ASR feature.

azureServiceEndpoint

string

Custom service endpoint to connect to instead of hosted Microsoft regional endpoints.

diarization

boolean

(Google) Enable speaker diarization.

diarizationMaxSpeakers

number

(Google) Set the maximum speaker count.

diarizationMinSpeakers

number

(Google) Set the minimum speaker count.

enhancedModel

boolean

(Google) Use an enhanced model.

filterMethod

string

(AWS) The method to use when filtering speech: remove, mask, or tag.

hints

array

(Google, Microsoft, Deepgram, Nvidia, Soniox) Array of words or phrases to assist speech detection.
See examples below.

hintsBoost

number

(Google, Nvidia) Number indicating the strength to assign to the configured hints.
See examples below.

identifyChannels

boolean

(AWS) Enable channel identification.

initialSpeechTimeoutMs

number

(Microsoft) Initial speech timeout in milliseconds.

interactionType

string

(Google) Set the interaction type: discussion, presentation, phone_call, voicemail,
professionally_produced, voice_search, voice_command, dictation.

interim

boolean

If true, interim transcriptions are sent.
Default: false.
Note: this only effects use in a Transcribe verb, in Gather interims are sent based on the presence of a partialResponseHook

language

string

Language code to use for speech detection.
Defaults to the application-level setting.

languageModelName

string

(AWS) The name of the custom language model when processing speech.

minConfidence

number

If provided, final transcripts with confidence lower than this value
return a reason of 'stt-low-confidence' in the webhook.

model

string

(Google) Speech recognition model to use.
Default: phone_call.

naicsCode

number

(Google) Set an industry NAICS code that is relevant to the speech.

outputFormat

string

(Microsoft) simple or detailed.
Default: simple.

profanityFilter

boolean

(Google, Deepgram, Nuance, Nvidia) If true, filter profanity from speech transcription.
Default: false.

profanityOption

string

(Microsoft) masked, removed, or raw.
Default: raw.

punctuation

boolean

(Google) Enable automatic punctuation.

requestSnr

boolean

(Microsoft) Request signal-to-noise ratio information.

separateRecognitionPerChannel

boolean

If true, recognize both caller and called party speech using separate recognition sessions.

singleUtterance

boolean

(Google) If true, return only a single utterance/transcript.
Default: true for gather.

transcriptionHook

stringRequired

Webhook to receive an HTTP POST when an interim or final transcription is received.

vad.enable

boolean

If true, delay connecting to the cloud recognizer until speech is detected.

vad.mode

number

If vad is enabled, this setting governs the sensitivity of the voice activity detector;
value must be between 0 and 3 inclusive.
Lower numbers mean more sensitivity.

vad.voiceMs

number

If vad is enabled, the number of milliseconds of speech required before connecting to the cloud recognizer.

vocabularyFilterName

string

(AWS) The name of a vocabulary filter to use when processing the speech.

vocabularyName

string

(AWS) The name of a vocabulary to use when processing the speech.

Vendor-specific options

azureOptions

speechSegmentationSilenceTimeoutMs

number

Duration (in milliseconds) of non-speech audio within a phrase that’s currently being spoken before that phrase is considered “done.”
See here for details.

deepgramOptions

alternatives

number

Number of alternative transcripts to return.

apiKey

string

Deepgram API key to authenticate with (overrides setting in Jambonz portal).

customModel

string

ID of custom model.

diarize

boolean

Whether to assign a speaker to each word in the transcript.

diarizeVersion

string

If set to '2021-07-14.0', the legacy diarization feature will be used.

endpointing

number | string

Indicates the number of milliseconds of silence Deepgram
will use to determine a speaker has finished saying a word or phrase.
Value must be either a number of milliseconds or 'false' to disable the feature entirely.
Default: 10ms.

keyterms

array

An array of keyterm prompts

keywords

array

An array of keywords
to which the model should pay particular attention to boosting or suppressing to help it understand context.

model

string

Deepgram model used to process submitted audio.
Example models: 'nova-3', 'nova-2', 'nova-2-phonecall'; see Deepgram docs for full list.
Default: 'general'.

multichannel

boolean

Indicates whether to transcribe each audio channel independently.

nodelay

boolean

Indicates whether to enable Deepgram’s nodelay feature.

numerals

boolean

Indicates whether to convert numbers
from written format (e.g., "one") to numerical format (e.g., "1").

profanityFilter

boolean

Indicates whether to remove profanity
from the transcript.

punctuate

boolean

Indicates whether to add punctuation
and capitalization to the transcript.

redact

array

Whether to redact information
from transcripts.
Allowed values: 'pci', 'numbers', 'true', 'ssn'.

replace

array

An array of terms or phrases
to search for in the submitted audio and replace.

array

An array of terms or phrases to search for in the submitted audio.

shortUtterance

boolean

Causes a transcript to be returned as soon as Deepgram’s is_final property is set.
This should only be used in scenarios where you expect a very short confirmation
or directed command and want minimal latency.

smartFormatting

boolean

Indicates whether to enableDeepgram’s Smart Formatting feature.

tag

string

A tag to associate with the request.
Tags appear in usage reports.

tier

string

Deepgram tier you would like to use.
Allowed values: 'enhanced', 'base'.
Default: 'base'.

utteranceEndMs

number

A number of milliseconds of silence that Deepgram will wait
after the last word was spoken before returning an UtteranceEnd event,
which is used by Jambonz to trigger the transcript webhook if this property is supplied.
This is essentially Deepgram’s version of continuous ASR.

version

string

Deepgram version of the model to use.
Default: 'latest'.

ibmOptions

acousticCustomizationId

string

ID of a custom acoustic model.

baseModelVersion

string

Base model to be used.

instanceId

string

IBM speech instance ID (overrides setting in Jambonz portal).

languageCustomizationId

string

ID of a custom language model.

model

string

The model to use for speech recognition.

sttApiKey

string

IBM API key to authenticate with (overrides setting in Jambonz portal).

sttRegion

string

IBM region (overrides setting in Jambonz portal).

watsonLearningOptOut

boolean

Set to true to prevent IBM from using your API request data to improve their service.

watsonMetadata

string

A tag value
to apply to the request data provided.

nuanceOptions

allowZeroBaseLmWeight

boolean

When true, custom resources (DLMs, wordsets, etc.) can use the entire weight range.

clientData

object

An object containing arbitrary key-value pairs to inject into the call log.

clientId

string

Nuance client ID to authenticate with (overrides setting in Jambonz portal).

discardSpeakerAdaptation

boolean

If speaker profiles are used, whether to discard updated speaker data.
By default, data is stored.

filterWakeupWord

boolean

Whether to remove the wakeup word from the final result.

formatting.options

object

Object containing key-value pairs of formatting options and values defined in the data pack.

formatting.scheme

string

Keyword for a formatting type defined in the data pack.

includeTokenization

boolean

Whether to include a tokenized recognition result.

kryptonEndpoint

string

Endpoint of the on-prem Krypton endpoint to connect to.
Default: Hosted service.

maskLoadFailures

boolean

Whether to terminate recognition when failing to load external resources.

maxHypotheses

number

Maximum number of n-best hypotheses to return.

noInputTimeoutMs

number

Maximum silence (in milliseconds) allowed while waiting for user input after recognition timers are started.

punctuation

boolean

Whether to enable auto-punctuation.

recognitionTimeoutMs

number

Maximum duration (in milliseconds) of the recognition turn.

resource

array

An array of zero or more recognition resources
(domain LMs, wordsets, etc.) to improve recognition.

resource[].builtin

string

Name of a built-in resource in the data pack.

resource[].externalReference

object

An external DLM or settings file
for creating or updating a speaker profile.

resource[].externalReference.headers

object

An object containing HTTP cache-control directives (e.g., max-age, etc.).

resource[].externalReference.maxLoadFailures

boolean

When true, allow transcription to proceed even if resource loading fails.

resource[].externalReference.requestTimeoutMs

number

Time to wait when downloading resources.

resource[].externalReference.type

string

Resource type: 'undefined_resource_type', 'wordset', 'compiled_wordset', 'domain_lm',
'speaker_profile', 'grammar', 'settings'.

resource[].externalReference.uri

string

Location of the resource as a URN reference.

resource[].inlineGrammar

string

Inline grammar in SRGS XML format.

resource[].inlineWordset

object

Inline wordset JSON resource.
See Wordsets for details.

resource[].reuse

string

Whether the resource will be used multiple times.
Allowed values: 'undefined_reuse', 'low_reuse', 'high_reuse'.
Default: low_reuse.

resource[].weightName

string

Input field setting the weight of the
domain LM or built-in resource relative to the data pack.
Allowed values: 'defaultWeight', 'lowest', 'low', 'medium', 'high', 'highest'.
Default: MEDIUM.

resource[].weightValue

number

Weight of the DLM or built-in resource as a numeric value from 0 to 1.
Default: 0.25.

resource[].wakeupWord

array

Array of wakeup words.

resultType

string

The level of recognition results: 'final', 'partial', 'immutable_partial'.
Default: final.

secret

string

Nuance secret to authenticate with (overrides setting in Jambonz portal).

speechDetectionSensitivity

number

A balance between detecting speech and noise (breathing, etc.), ranging from 0 to 1.
0 means ignore all noise, 1 means interpret all noise as speech.
Default: 0.5.

speechDomain

string

Mapping to internal weight sets for language models in the data pack.

suppressCallRecording

boolean

Whether to disable call logging and audio capture.
By default, call logs, audio, and metadata are collected.

suppressInitialCapitalization

boolean

When true, the first word in a sentence is not automatically capitalized.

topic

string

Specialized language model.

utteranceDetectionMode

string

How many sentences (utterances) within the audio stream are processed.
Allowed values: 'single', 'multiple', 'disabled'.
Default: single.

utteranceEndSilenceMs

number

Minimum silence (in milliseconds) that determines the end of a sentence.

userId

string

Identifies a specific user within the application.

nvidiaOptions

customConfiguration

object

An object of key-value pairs that can be sent to Nvidia for custom configuration.

maxAlternatives

number

Number of alternative transcripts to return.

profanityFilter

boolean

Indicates whether to remove profanity from the transcript.

punctuation

boolean

Indicates whether to provide punctuation in the transcripts.

rivaUri

string

GRPC endpoint (ip:port) that Nvidia Riva is listening on.

verbatimTranscripts

boolean

Indicates whether to provide verbatim transcripts.

wordTimeOffsets

boolean

Indicates whether to provide word-level detail.

openaiOptions

apiKey

string

OpenAI API key to authenticate with (if provided, this overrides the setting in Jambonz portal).

language

string

OpenAI language code. Language could also be set at the recognizer level, but if set here this value will take precedence. Note that language code is supplied in ISO-639-1 (e.g. en) format.

model

string

OpenAI model. Currently ‘whisper-1’, ‘gpt-4o-mini-transcribe’, and ‘gpt-4o-transcribe’ are supported.

prompt

string

An optional text to guide the model’s style or continue a previous audio segment. Note that if you do not set a value here but you do provide hints, a prompt will automatically be generated to convey the hints to the OpenAI model.

input_audio_noise_reduction

object

Governs whether noise reduction is applied to the input audio.

input_audio_noise_reduction.type

stringRequired

‘near_field’ or ‘far_field’.

promptTemplates

object

Templates that can be used to dynamically construct the prompt to be sent to OpenAI. See here for details on constructing prompts.

promptTemplates.hintsTemplate

string

A template that be be used to construct a prompt to convey hints. Use the placeholder “{{hints}}”, e.g. “Please transcribe the following audio, making sure to spell the following words correctly: {{hints}}”

promptTemplates.conversationHistoryTemplate

string

A template that can be used to construct a prompt based on the recent conversation. Use the placeholder “{{turns}}”, e.g. “Here is the recent conversation history: {{turns}}”.
By default, the last 4 turns of the conversation will be used, but you can change this as follows: “Here is the recent conversation history: {{turns:3}}”.

turn_detection

object

Specifies how to detect when a speaker has finished speaking.

turn_detection.type

stringRequired

‘none’, ‘server_vad’, or ‘semantic_vad’.

turn_detection.eagerness

string

‘low’, ‘medium’, ‘high, or ‘auto’. Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium.

turn_detection.threshold

number

Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.

turn_detection.prefix_padding_ms

number

Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.

turn_detection.silence_duration_ms

number

Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.

sonioxOptions

api_key

string

Soniox API key.

model

string

Soniox model to use.
Default: 'precision_ivr'.

profanityFilter

boolean

Indicates whether to remove profanity from the transcript.

storage

object

Properties that dictate whether to store audio and/or transcripts.
Can be useful for debugging purposes.

storage.disableSearch

booleanDefaults to false

If true, do not allow search.

storage.disableStoreAudio

booleanDefaults to false

If true, do not store audio.

storage.disableStoreTranscript

booleanDefaults to false

If true, do not store transcripts.

storage.id

string

Storage identifier.

storage.title

string

Storage title.

speechmaticsOptions

sm_audioEventsConfig

object

Audio events to report.

sm_audioEventsConfig.types

arrayRequired

“applause”, “laughter”, or “music”

transcription_config

object

Audio transcription configuration.

transcription_config.additional_vocab

array

Additional vocabulary words.

transcription_config.audio_filtering_config

object

Audio filtering configuration.

transcription_config.audio_filtering_config.volume_threshold

number

transcription_config.diarization

string

transcription_config.domain

array

transcription_config.enable_entities

boolean

transcription_config.enable_partials

boolean

Enable partial transcriptions.

transcription_config.language

string

Language to transcribe.

transcription_config.max_delay

number

transcription_config.max_delay_mode

string

“fixed” or “flexible”

transcription_config.output_locale

string

transcription_config.operating_point

string

transcription_config.punctuation_overrides

object

Punctuation configuration

transcription_config.punctuation_overrides.permitted_marks

array

transcription_config.punctuation_overrides.sensitivity

number

sm_audioFilteringConfig

object

Audio filtering configuration.

sm_audioFilteringConfig.volume_threshold

numberRequired

Volume threshold to filter.

Providing speech hints

Many recognizers support the ability to provide a dynamic list of words or phrases that should be “boosted” by the recognizer, i.e. the recognizer should be more likely to detect this terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation it would look like this:

1 "hints": ["benign", "malignant", "biopsy"],
2 "hintsBoost": 50

Additionally, google and nvidia allow a boost factor to be specified at the phrase level, e.g.

1 "hints": [
2   {"phrase": "benign", "boost": 50},
3   {"phrase": "malignant", "boost": 10},
4   {"phrase": "biopsy", "boost": 20},
5 ]