Speech recognition and synthesis
Bots that make and accept calls use the voice synthesis (text-to-speech, TTS) and automatic speech recognition (ASR) features.
- Text-To-Speech (TTS) (voice synthesis) is the process of generation of speech from the written text.
- Automatic Speech Recognition (ASR) is the process of translating of speech to text.
You can do the following when you create a telephone channel:
- Select providers, setup speech synthesis and recognition. E.g. select a specific voice or recognition mode. Or you can keep the defaults.
- Create a connection using a provider’s account to recognize and synthesize speech.
Select a provider
You can select ASR and TTS providers when you create a telephone channel. Open the ASR tab and select the connection, and then repeat these steps for TTS.
Please note that you will need to manually switch your channel to another provider in case of any faults if a specific ASR or TTS provider is selected.
You can also keep the Default settings, in which case the settings of the most stable ASR and TTS providers will be applied. The channel will be switched to another provider in case of any faults in provider operation.
ASR and TTS configuration
ASR
You can select one of the connections for ASR and specify additional settings when you create a telephone channel.
| Connection | Settings | Description |
|---|---|---|
| Language | The service can recognize speech in multiple languages. You can find the complete list here. English (en-US) is used by default. |
|
| Model | One of the machine learning models is used for speech recognition. These models were trained by Google for certain sound types and sources. See the table for the list of models available for each language: Phone call — Use this model to recognize speech in a phone call. Command and search — Use this model to recognize speech in short audio files, such as voice commands. Default — Use this model if the above models do not satisfy you. |
|
| Yandex | Language | The service can recognize speech in the following languages: ru-RU (default) — Russian, en-US — English, tr-TR — Turkish. |
| Model | One of the machine learning models is used for speech recognition. Data arrays from Yandex services and applications are used to train models. | |
| Tinkoff | This connection setting is currently not available. |
TTS
You can select one of the connections for TTS and specify additional settings when you create a telephone channel.
| Connection | Settings | Description |
|---|---|---|
| Language | The service can synthesize speech in multiple languages. You can find the complete list here. | |
| Voice | You can use multiple voice options in the service (see here for the complete list). The following voice is used by default: en-US-Wavenet-A for English; ru-RU-Wavenet-B for Russian; cmn-CN-Wavenet-B for Chinese; Wavenet-A for other languages. |
|
| Speed | Speech tempo or speed. Here 1 is the normal speed of specific voice. |
|
| Voice pitch | Voice pitch. Here 20 is 20 halftones up from the original tone, and -20 means the corresponding decrease. |
|
| Raise volume | Volume increase in dB relative to the normal volume of specific voice. When +6.0 dB is selected, playback volume is twice as high as the normal one. We strongly discourage you from exceeding +10.0 dB. |
|
| Yandex | Language | Speech can be synthesized in three languages: ru-RU (Russian); en-US (English); tr-TR (Turkish). |
| Voice | You can use multiple voice options in the service (see here for the complete list). The following voice is used by default: alyss for English; alena for Russian; alyss for other languages. |
|
| Speed | Speech tempo or speed. Here 1 is the normal speed of specific voice. |