Recognizer
A property that can be used in gather
, transcribe
or other verbs to override the application default recognizer settings.
Parameters
Speech vendor to use (see list below, along with any others you add via the
custom speech API).
Note: this field is case sensitve, all the built-in vendors are lower case eg aws
not AWS
(Google, Microsoft) An array of alternative languages that the speaker may be using.
DTMF key that terminates continuous ASR feature.
Timeout value for continuous ASR feature.
Custom service endpoint to connect to instead of hosted Microsoft regional endpoints.
(Google) Enable speaker diarization.
(Google) Set the maximum speaker count.
(Google) Set the minimum speaker count.
(Google) Use an enhanced model.
(AWS) The method to use when filtering speech: remove
, mask
, or tag
.
(Google, Microsoft, Deepgram, Nvidia, Soniox) Array of words or phrases to assist speech detection.
See examples below.
(Google, Nvidia) Number indicating the strength to assign to the configured hints.
See examples below.
(AWS) Enable channel identification.
(Microsoft) Initial speech timeout in milliseconds.
(Google) Set the interaction type: discussion
, presentation
, phone_call
, voicemail
,
professionally_produced
, voice_search
, voice_command
, dictation
.
If true, interim transcriptions are sent.
Default: false
.
Note: this only effects use in a Transcribe verb, in Gather interims are sent based on the presence of a partialResponseHook
Language code to use for speech detection.
Defaults to the application-level setting.
(AWS) The name of the custom language model when processing speech.
If provided, final transcripts with confidence lower than this value
return a reason of 'stt-low-confidence'
in the webhook.
(Google) Speech recognition model to use.
Default: phone_call
.
(Google) Set an industry NAICS code that is relevant to the speech.
(Microsoft) simple
or detailed
.
Default: simple
.
(Google, Deepgram, Nuance, Nvidia) If true, filter profanity from speech transcription.
Default: false
.
(Microsoft) masked
, removed
, or raw
.
Default: raw
.
(Google) Enable automatic punctuation.
(Microsoft) Request signal-to-noise ratio information.
If true, recognize both caller and called party speech using separate recognition sessions.
(Google) If true, return only a single utterance/transcript.
Default: true
for gather
.
Webhook to receive an HTTP POST when an interim or final transcription is received.
If true, delay connecting to the cloud recognizer until speech is detected.
If vad
is enabled, this setting governs the sensitivity of the voice activity detector;
value must be between 0
and 3
inclusive.
Lower numbers mean more sensitivity.
If vad
is enabled, the number of milliseconds of speech required before connecting to the cloud recognizer.
(AWS) The name of a vocabulary filter to use when processing the speech.
(AWS) The name of a vocabulary to use when processing the speech.
Vendor-specific options
azureOptions
Duration (in milliseconds) of non-speech audio within a phrase that’s currently being spoken before that phrase is considered “done.”
See here for details.
deepgramOptions
Number of alternative transcripts to return.
Deepgram API key to authenticate with (overrides setting in Jambonz portal).
ID of custom model.
Whether to assign a speaker to each word in the transcript.
If set to '2021-07-14.0'
, the legacy diarization feature will be used.
Indicates the number of milliseconds of silence Deepgram
will use to determine a speaker has finished saying a word or phrase.
Value must be either a number of milliseconds or 'false'
to disable the feature entirely.
Default: 10ms
.
An array of keywords
to which the model should pay particular attention to boosting or suppressing to help it understand context.
Deepgram model used to process submitted audio.
Example models: 'nova-3'
, 'nova-2'
, 'nova-2-phonecall'
; see Deepgram docs for full list.
Default: 'general'
.
Indicates whether to transcribe each audio channel independently.
Indicates whether to enable Deepgram’s nodelay feature.
Indicates whether to convert numbers
from written format (e.g., "one"
) to numerical format (e.g., "1"
).
Indicates whether to remove profanity
from the transcript.
Indicates whether to add punctuation
and capitalization to the transcript.
Whether to redact information
from transcripts.
Allowed values: 'pci'
, 'numbers'
, 'true'
, 'ssn'
.
An array of terms or phrases
to search for in the submitted audio and replace.
An array of terms or phrases to search for in the submitted audio.
Causes a transcript to be returned as soon as Deepgram’s is_final
property is set.
This should only be used in scenarios where you expect a very short confirmation
or directed command and want minimal latency.
Indicates whether to enableDeepgram’s Smart Formatting feature.
A tag to associate with the request.
Tags appear in usage reports.
Deepgram tier you would like to use.
Allowed values: 'enhanced'
, 'base'
.
Default: 'base'
.
A number of milliseconds of silence that Deepgram will wait
after the last word was spoken before returning an UtteranceEnd
event,
which is used by Jambonz to trigger the transcript webhook if this property is supplied.
This is essentially Deepgram’s version of continuous ASR.
Deepgram version of the model to use.
Default: 'latest'
.
ibmOptions
ID of a custom acoustic model.
Base model to be used.
IBM speech instance ID (overrides setting in Jambonz portal).
ID of a custom language model.
The model to use for speech recognition.
IBM API key to authenticate with (overrides setting in Jambonz portal).
IBM region (overrides setting in Jambonz portal).
Set to true
to prevent IBM from using your API request data to improve their service.
A tag value
to apply to the request data provided.
nuanceOptions
When true, custom resources (DLMs, wordsets, etc.) can use the entire weight range.
An object containing arbitrary key-value pairs to inject into the call log.
Nuance client ID to authenticate with (overrides setting in Jambonz portal).
If speaker profiles are used, whether to discard updated speaker data.
By default, data is stored.
Whether to remove the wakeup word from the final result.
Object containing key-value pairs of formatting options and values defined in the data pack.
Keyword for a formatting type defined in the data pack.
Whether to include a tokenized recognition result.
Endpoint of the on-prem Krypton endpoint to connect to.
Default: Hosted service.
Whether to terminate recognition when failing to load external resources.
Maximum number of n-best hypotheses to return.
Maximum silence (in milliseconds) allowed while waiting for user input after recognition timers are started.
Whether to enable auto-punctuation.
Maximum duration (in milliseconds) of the recognition turn.
An array of zero or more recognition resources
(domain LMs, wordsets, etc.) to improve recognition.
Name of a built-in resource in the data pack.
An external DLM or settings file
for creating or updating a speaker profile.
An object containing HTTP cache-control directives (e.g., max-age
, etc.).
When true, allow transcription to proceed even if resource loading fails.
Time to wait when downloading resources.
Resource type: 'undefined_resource_type'
, 'wordset'
, 'compiled_wordset'
, 'domain_lm'
,
'speaker_profile'
, 'grammar'
, 'settings'
.
Location of the resource as a URN reference.
Inline grammar in SRGS XML format.
Inline wordset JSON resource.
See Wordsets for details.
Whether the resource will be used multiple times.
Allowed values: 'undefined_reuse'
, 'low_reuse'
, 'high_reuse'
.
Default: low_reuse
.
Input field setting the weight of the
domain LM or built-in resource relative to the data pack.
Allowed values: 'defaultWeight'
, 'lowest'
, 'low'
, 'medium'
, 'high'
, 'highest'
.
Default: MEDIUM
.
Weight of the DLM or built-in resource as a numeric value from 0
to 1
.
Default: 0.25
.
Array of wakeup words.
The level of recognition results: 'final'
, 'partial'
, 'immutable_partial'
.
Default: final
.
Nuance secret to authenticate with (overrides setting in Jambonz portal).
A balance between detecting speech and noise (breathing, etc.), ranging from 0
to 1
.
0
means ignore all noise, 1
means interpret all noise as speech.
Default: 0.5
.
Mapping to internal weight sets for language models in the data pack.
Whether to disable call logging and audio capture.
By default, call logs, audio, and metadata are collected.
When true, the first word in a sentence is not automatically capitalized.
Specialized language model.
How many sentences (utterances) within the audio stream are processed.
Allowed values: 'single'
, 'multiple'
, 'disabled'
.
Default: single
.
Minimum silence (in milliseconds) that determines the end of a sentence.
Identifies a specific user within the application.
nvidiaOptions
An object of key-value pairs that can be sent to Nvidia for custom configuration.
Number of alternative transcripts to return.
Indicates whether to remove profanity from the transcript.
Indicates whether to provide punctuation in the transcripts.
GRPC endpoint (ip:port
) that Nvidia Riva is listening on.
Indicates whether to provide verbatim transcripts.
Indicates whether to provide word-level detail.
sonioxOptions
Soniox API key.
Soniox model to use.
Default: 'precision_ivr'
.
Indicates whether to remove profanity from the transcript.
Properties that dictate whether to store audio and/or transcripts.
Can be useful for debugging purposes.
If true, do not allow search.
If true, do not store audio.
If true, do not store transcripts.
Storage identifier.
Storage title.
speechmaticsOptions
Audio events to report.
“applause”, “laughter”, or “music”
Audio transcription configuration.
Additional vocabulary words.
Audio filtering configuration.
Enable partial transcriptions.
Language to transcribe.
“fixed” or “flexible”
Punctuation configuration
Audio filtering configuration.
Volume threshold to filter.
Providing speech hints
Many recognizers support the ability to provide a dynamic list of words or phrases that should be “boosted” by the recognizer, i.e. the recognizer should be more likely to detect this terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation it would look like this:
Additionally, google and nvidia allow a boost factor to be specified at the phrase level, e.g.