Speech recognition and synthesis

Bots that make and accept calls use the voice synthesis (text-to-speech, TTS) and automatic speech recognition (ASR) features.

Text-To-Speech (TTS) (voice synthesis) is the process of generation of speech from the written text.
Automatic Speech Recognition (ASR) is the process of translating of speech to text.

You can do the following when you create a telephone channel:

Select providers, setup speech synthesis and recognition. E.g. select a specific voice or recognition mode. Or you can keep the defaults.
Create a connection using a provider’s account to recognize and synthesize speech.

Select a provider

You can select ASR and TTS providers when you create a telephone channel. Open the ASR tab and select the connection, and then repeat these steps for TTS.

Please note that you will need to manually switch your channel to another provider in case of any faults if a specific ASR or TTS provider is selected.

You can also keep the Default settings, in which case the settings of the most stable ASR and TTS providers will be applied. The channel will be switched to another provider in case of any faults in provider operation.

ASR and TTS configuration

ASR

You can select one of the connections for ASR and specify additional settings when you create a telephone channel.

Connection	Settings	Description
Google	Language	The service can recognize speech in multiple languages. You can find the complete list here. English (`en-US`) is used by default.
	Model	One of the machine learning models is used for speech recognition. These models were trained by Google for certain sound types and sources. See the table for the list of models available for each language: `Phone call` — Use this model to recognize speech in a phone call. `Command and search` — Use this model to recognize speech in short audio files, such as voice commands. `Default` — Use this model if the above models do not satisfy you.
Yandex	Language	The service can recognize speech in the following languages: `ru-RU` (default) — Russian, `en-US` — English, `tr-TR` — Turkish.
	Model	One of the machine learning models is used for speech recognition. Data arrays from Yandex services and applications are used to train models.
Tinkoff		This connection setting is currently not available.

TTS

You can select one of the connections for TTS and specify additional settings when you create a telephone channel.

Connection	Settings	Description
Google	Language	The service can synthesize speech in multiple languages. You can find the complete list here.
	Voice	You can use multiple voice options in the service (see here for the complete list). The following voice is used by default: `en-US-Wavenet-A` for English; `ru-RU-Wavenet-B` for Russian; `cmn-CN-Wavenet-B` for Chinese; `Wavenet-A` for other languages.
	Speed	Speech tempo or speed. Here `1` is the normal speed of specific voice.
	Voice pitch	Voice pitch. Here `20` is 20 halftones up from the original tone, and `-20` means the corresponding decrease.
	Raise volume	Volume increase in dB relative to the normal volume of specific voice. When `+6.0` dB is selected, playback volume is twice as high as the normal one. We strongly discourage you from exceeding `+10.0` dB.
Yandex	Language	Speech can be synthesized in three languages: `ru-RU` (Russian); `en-US` (English); `tr-TR` (Turkish).
	Voice	You can use multiple voice options in the service (see here for the complete list). The following voice is used by default: `alyss` for English; `alena` for Russian; `alyss` for other languages.
	Speed	Speech tempo or speed. Here `1` is the normal speed of specific voice.