This site is no longer updated.Go to new Conversational Cloud docs

Speech recognition and synthesis


Bots that make and accept calls use the voice synthesis (text-to-speech, TTS) and automatic speech recognition (ASR) features.

  • Text-To-Speech (TTS) (voice synthesis) is the process of generation of speech from the written text.
  • Automatic Speech Recognition (ASR) is the process of translating of speech to text.

You can do the following when you create a telephone channel:


Select a provider

You can select ASR and TTS providers when you create a telephone channel. Open the ASR tab and select the connection, and then repeat these steps for TTS.

Please note that you will need to manually switch your channel to another provider in case of any faults if a specific ASR or TTS provider is selected.

You can also keep the Default settings, in which case the settings of the most stable ASR and TTS providers will be applied. The channel will be switched to another provider in case of any faults in provider operation.


ASR and TTS configuration

ASR

You can select one of the connections for ASR and specify additional settings when you create a telephone channel.


Connection Settings Description
Google Language The service can recognize speech in multiple languages. You can find the complete list here. English (en-US) is used by default.
Model One of the machine learning models is used for speech recognition. These models were trained by Google for certain sound types and sources.

See the table for the list of models available for each language:

Phone call — Use this model to recognize speech in a phone call.

Command and search — Use this model to recognize speech in short audio files, such as voice commands.

Default — Use this model if the above models do not satisfy you.
Yandex Language The service can recognize speech in the following languages:

ru-RU (default) — Russian,
en-US — English,
tr-TR — Turkish.
Model One of the machine learning models is used for speech recognition. Data arrays from Yandex services and applications are used to train models.

Tinkoff This connection setting is currently not available.

TTS

You can select one of the connections for TTS and specify additional settings when you create a telephone channel.


Connection Settings Description
Google Language The service can synthesize speech in multiple languages. You can find the complete list here.
Voice You can use multiple voice options in the service (see here for the complete list).

The following voice is used by default:

en-US-Wavenet-A for English;
ru-RU-Wavenet-B for Russian;
cmn-CN-Wavenet-B for Chinese;
Wavenet-A for other languages.
Speed Speech tempo or speed. Here 1 is the normal speed of specific voice.
Voice pitch Voice pitch. Here 20 is 20 halftones up from the original tone, and -20 means the corresponding decrease.
Raise volume Volume increase in dB relative to the normal volume of specific voice. When +6.0 dB is selected, playback volume is twice as high as the normal one. We strongly discourage you from exceeding +10.0 dB.
Yandex Language Speech can be synthesized in three languages:

ru-RU (Russian);
en-US (English);
tr-TR (Turkish).
Voice You can use multiple voice options in the service (see here for the complete list). The following voice is used by default:

alyss for English;
alena for Russian;
alyss for other languages.
Speed Speech tempo or speed. Here 1 is the normal speed of specific voice.