Connecting the classifier to the bot Deprecated

Please note that the use of the classifier by examples has been deprecated since version 1.10.0. Learn more about migrating projects to CAILA.

Do the following to connect the classifier to the bot:

Specify parameters for the platform’s nlp functions in the configuration file of the chatbot.yaml bot:

morphology

Can be used to select the library for morphological analysis of words. Is used for pattern processing in ~, $lemma, $morph, and in the $nlp.parseMorph function.

Specify one of the libraries:

aot — the AOT.ru library is used.
myStem — the myStem utility is used.
pyMorphy — the pyMorphy library is used, which is the best analyzer for Russian.

tokenizer

You can use the tokenizer to specify rules to be used to split the text into words.

Supported tokenizer types

regexp — a simple tokenizer based on regular expressions.
srx — a configurable tokenizer based on customizable segmentation rules. When this tokenizer is specified, you also need to specify a grammar file in the srxPath parameter.
myStem — segmentation via the myStem utility. The preferred tokenizer to be used with patterns and classifier.

vocabulary

Word weight dictionary for pattern ranking. Default: common-vocabulary.json.

lengthLimit, timeLimit

Can be used to modify the limits on the incoming message size and the nlp module processing time.

Default parameters:

nlp:
  lengthLimit:
    enabled: true
    symbols: 400
    words: 100000
  timeLimit:
    enabled: true
    timeout: 10000

For lengthLimit:

symbols — sets the limit on the number of symbols in an incoming message. When this limit is exceeded, the lengthLimit event is triggered which can be processed in the bot script by the event: lengthLimit tag.
words — sets the limit on the number of words in an incoming message. When this limit is exceeded, the lengthLimit event is triggered which can be processed in the bot script by the event: lengthLimit tag.

Please note when you set the limit that the words counter treats the !,.:;?"'()*/[\]{|} symbols as words.

For timeLimit:

timeout — sets the maximum request processing time (in milliseconds) for the nlp module. When this limit is exceeded, the timeLimit event is triggered which can be processed in the bot script by the event: timeLimit tag.

Example of an nlp module:

nlp:                                    // platform’s nlp function parameters
  morphology: myStem                    // library for morphological analysis of words
  tokenizer: myStem                     // tokenizer, specifies rules to be used to split the text into words
  vocabulary: common-vocabulary.json    // word weight dictionary for pattern ranking
  lengthLimit:
    enabled: true
    symbols: 400                        // limit on the number of symbols in an incoming message
    words: 100000                       // limit on the number of words in an incoming message
  timeLimit:
    enabled: true
    timeout: 10000                      // maximum request processing time (in milliseconds) for the nlp module

Next, specify classification parameters:

engine

Classifier type (sts by default).

noMatchThreshold

The lower similarity threshold under which phrases are to be considered different. It was determined empirically in the course of classifier development that the optimum value of this parameter is 0.2.

parameters: algorithm

The type of the classification algorithm used. match-aligner is used, which is the primary type of an sts classifier. You can also use aligner and aligner2, which is an alternative implementation of the classification algorithm.

Next, configure the classifier algorithm. All parameters are set to their default values. You only need to specify the weight dictionary which is identical to the dictionary specified in the nlp block. Default: common-vocabulary.json.

Here is an example of a chatbot.yaml configuration file with a classifier connected to it:

name: demo

entryPoint:
  - main.sc

tests:
  exclude:
    - tests.xml

messages:
    onError: 
        defaultMessage: Oops, something has gone wrong. 
        locales: 
            ru: Oops, something has gone wrong.

nlp:                                    // platform’s nlp function parameters
  morphology: myStem                    // library for morphological analysis of words
  tokenizer: myStem                     // tokenizer, specifies rules to be used to split the text into words
  vocabulary: common-vocabulary.json    // word weight dictionary for pattern ranking
  lengthLimit:
    enabled: true
    symbols: 400                        // limit on the number of symbols in an incoming message
    words: 100000                       // limit on the number of words in an incoming message
  timeLimit:
    enabled: true
    timeout: 10000                      // maximum request processing time (in milliseconds) for the nlp module

classifier:                             // classifier parameters
  enable: true
  engine: sts                           // classifier type
  noMatchThreshold: 0.2
  parameters:
    algorithm: aligner2                 // classification algorithm

aligner:
  vocabulary: common-vocabulary.json

exampleGroups:
    - src/dictionaries/examples.json    // is specified when a group of examples is used