Advanced NLU settings

You can define new settings to configure NLU when you create or edit your project. Settings are passed in the text form within a JSON object.

You must specify your parameters according to the project’s classifier algorithm.

Common settings

Common settings include parameters that do not depend on the algorithm of the classifier in the project:

{
    "patternsEnabled": true,
    "tokenizerEngine": "udpipe",
    "dictionaryAutogeneration": true
}

Parameters

patternsEnabled — if the parameter is active, it is possible to use patterns in training phrases.
tokenizerEngine — tokenizer.
dictionaryAutogeneration — when the parameter is active, the filling of the user dictionary by entity content is executed.

tokenizerEngine

For different NLU languages, different tokenizer engines are supported.

NLU language	Tokenizers	Notes
Russian	`udpipe` `mystem` `morphsrus`	The `mystem` and `morphsrus` tokenizers are used for migrating projects to CAILA.
Chinese	`pinyin`
Portuguese	`udpipe`
Kazakh	`kaznlp`

All languages missing from the table above support the spacy tokenizer.

STS

STS classifier settings:

{
    "patternsEnabled": true,
    "namedEntitiesRequired": true,
    "tokenizerEngine": "udpipe",
    "allowedPatterns": [],

    "stsSettings": {
        "exactMatch": 1.0,
        "lemmaMatch": 0.95,
        "jaccardMatch": 0.5,
        "jaccardMatchThreshold": 0.82,
        "acronymMatch": 1.0,
        "synonymMatch": 0,
        "synonymContextWeight": 0.0,
        "patternMatch": 1,
        "throughPatternMatch": 0,
        "wordSequence1": 0.8,
        "wordSequence2": 0.9,
        "wordSequence3": 1.0,
        "intermediateAlternativesLimit": 5,
        "finalAlternativesLimit": 5,
        "idfShift": 0.0,
        "idfMultiplier": 1.0,
        "namedEntitiesRequired": true
    }
}

Parameters

allowedPatterns — the array of entities having the Automatically expand intents setting turned on.
exactMatch — weight of an exact word match in phrases.
lemmaMatch — weight of a lemma word match.
jaccardMatchThreshold — weight of a Jaccard-based symbol-by-symbol word match.
acronymMatch — weight of an acronym-based word match.
synonymMatch — weight for synonyms.
synonymContextWeight — weight applied during ranking to the weight value from the synonym reference book.
patternMatch — weight of a pattern-based match.
throughPatternMatch — weight of a match by entities found in the example and the input text.
wordSequence1 — weight of lookalike sequences with the length equal to 1.
wordSequence2 — weight of lookalike sequences with the length equal to 2.
wordSequence3 — weight of lookalike sequences with the length greater than 2.
intermediateAlternativesLimit — cutoff threshold for intermediate alternatives processed by the algorithm.
finalAlternativesLimit — maximum number of final results required for the algorithm to finish its work.
namedEntitiesRequired — in order to get into the intent, the phrase must have a system entity.

namedEntitiesRequired

Let's consider the parameter "namedEntitiesRequired": true. If a phrase with a system entity was added to the intent, for example:

I need @duckling.number of apples

That at request of the client I need apples — the phrase will not get in an intent as the system essence has not been found.

Override namedEntitiesRequired in the advanced NLU settings so that phrases without system entities activate the intent.

Deep Learning

Deep Learning classifier settings

{
    "patternsEnabled": true,
    "tokenizerEngine": "udpipe",

    "cnnSettings": {
        "lang": "ru",
        "kernel_sizes": [
            1,
            2
        ],
        "n_filters": 1024,
        "emb_drp": 0.25,
        "cnn_drp": 0.25,
        "bs": 64,
        "n_epochs": 15,
        "lr": 0.001,
        "pooling_name": "max"
    }
}

Parameters

kernel_sizes – list of convolution kernel sizes. A convolution kernel is the size of the context window to be taken into account by the classifier. For instance, “kernel_sizes”: [3] means that the model will use all the triplets of adjacent words to find features in the text. Multiple convolution kernels can be defined for a single model.
n_filters – number of filters. A filter is a specific pattern learned by the model. A model has a unique set of patterns for each kernel. For instance, if we specify “kernel_sizes”: [2,3] and “n_filters”: 512, the total number of filters will be 1024 (512 per kernel).
emb_drp – probability of a drop-out on the embedding layer. Drop-out is the mechanism that forcibly disables a part of weights in the network in the course of training. Drop-out prevents the network from overtraining (i.e. it helps to summarize the information instead of merely memorizing the entire dataset). emb_drp can take any value from 0 to 1.
сnn_drp – probability of a drop-out on the convolution layers of the network.
bs – size of the input batch for trainings. This value defines the number of training examples per step to be fed to the network in the course of training. If the dataset has less than 3000 examples, a value from 16 to 32 is recommended. For larger datasets, this value can be from 32 to 128.
n_epochs – number of learning epochs (i.e. the number of times the model will see all the training data).
lr – learning rate. The factor the model will use to update its weights in the course of training.
pooling_name – aggregation strategy. The model has to aggregate the patterns found in the input string (before the final classification layer). The following aggregation strategies are possible: max, mean, concat.

General recommendations

Deep Learning classifier settings for the dataset size of:

over 100 thousand examples

        "kernel_sizes": [
            2,
            3,
            4
        ],
        "n_filters": 1024-2048,
        "emb_drp": 0.3-0.4,
        "cnn_drp": 0.3-0.4,
        "bs": 64-128,
        "n_epochs": 3,
        "lr": 0.001,
        "pooling_name": "max, concat"

30 to 100 thousand examples

        "kernel_sizes": [
            2,
            3,
            4
        ],
        "n_filters": 1024-2048,
        "emb_drp": 0.3-0.4,
        "cnn_drp": 0.3-0.4,
        "bs": 32-128,
        "n_epochs": 3,
        "lr": 0.001,
        "pooling_name": "max, concat"

10 to 30 thousand examples

        "kernel_sizes": [
            2,
            3,
            4                      // or [2,3]
        ],
        "n_filters": 1024,
        "emb_drp": 0.3-0.5,
        "cnn_drp": 0.3-0.5,
        "bs": 32-64,
        "n_epochs": 3-5,
        "lr": 0.001,
        "pooling_name": "max"

3 to 10 thousand examples

        "kernel_sizes": [
            2,
            3,
            4                      // or [2,3]
        ],
        "n_filters": 1024,
        "emb_drp": 0.4-0.5,
        "cnn_drp": 0.4-0.5,
        "bs": 32,
        "n_epochs": 4-7,
        "lr": 0.001,
        "pooling_name": "max"

1 to 3 thousand examples

        "kernel_sizes": [
            2,
            3
        ],
        "n_filters": 512,
        "emb_drp": 0.5,
        "cnn_drp": 0.5,
        "bs": 16-32,
        "n_epochs": 7-15,
        "lr": 0.001,
        "pooling_name": "max"

Classic ML

{
    "patternsEnabled": true,
    "tokenizerEngine": "udpipe",

    "classicMLSettings": {
        "C": 1,
        "lang": "ru",
        "word_ngrams": [
            1,
            2
        ],
        "lemma_ngrams": [
            0
        ],
        "stemma_ngrams": [
            1,
            2
        ],
        "char_ngrams": [
            3,
            4
        ],
        "lower": true
    }
}

Parameters

С – regularization coefficient that can be used to control model overtraining. Use it to control larger values of the target function coefficients and to penalize them by the value of the parameter. It can take values in the following range: [0.01, 0.1, 1, 10].
word_ngrams – number of words to be combined into word combinations. For instance, for “word_ngrams”: [2, 3] combinations of two and three words will be used.

For instance, the following word combinations will be generated for the I like green apples phrase:

 [
  "I like",
  "like green",
  "green apples",
  "I like green",
  "like green apples"
]

Values greater than 3 are not recommended for this parameter.

lemma_n_grams – number of words to be normalized and combined into word combinations. For instance, for “lemma_n_grams”: [2] combinations of two words will be used.

For instance, the following word combinations will be generated for the I like green apples phrase:

 [
  "I like",
  "like green",
  "green apple"
]

Values greater than 3 are not recommended for this parameter.

stemma_ngrams – number of stemmas to be combined into word combinations. A stemma is the stem of a source word which is not necessarily equal to the morphologic root of that word. For instance, for “stemma_ngrams”: [2] combinations of two stemmas will be used.

For instance, the following word combinations will be generated for the I like green apples phrase:

 [
  "I like",
  "like green",
  "green apple"
]

Using both lemma_n_grams and stemma_ngrams parameters is not recommended due to possible model overtraining. Setting the value of stemma_ngrams to be greater than 3 is also not recommended.

char_n_grams – number of symbols to be combined and treated as a single unit of a phrase.

For instance, for “char_n_grams”: [5] the phrase of green apples is converted to the following set:

 [
  "gre",
  "ree",
  "een",
   ...
]

lower – if set to true, all the phrases are converted to lowercase.

External NLU service

The JAICP platform supports the ability to connect an external NLU service that complies with the Model API specification. You will also be able to create and configure intents, entities, patterns.

Model API allows you to use third-party tokenizers, external named entity and intent recognition NLU services in JAICP projects.

To use an external NLU service in a project, use externalNluSettings in the Advanced NLU settings field:

...
"externalNluSettings": {
    "nluProviderSettings": {
        "markup": {
            "nluType": "external",
            "url": "http://example.com"
        },
        "ner": {
            "nluType": "external",
            "url": "http://example.com"
        },
        "classification": {
            "nluType": "external",
            "url": "http://example.com"
        }
    },
    "language": "ja",
    "nluActionAdditionalProperties": {
        "markup": null,
        "ner": null,
        "classification": {
            "modelId": "123",
            "classifierName": "example",
            "properties": null
        }
    }
}
...

Parameters

Parameter	Description
`classifierName`	Classifier name.
`classification`	Map of parameters for classification requests.
`language`	External NLU language parameter. If not set, language from the project settings will be used.
`markup`	Map of parameters for markup requests.
`modelID`	Classifier model ID.
`ner`	Named entity recognition. Map of parameters for named entity recognition requests.
`nluActionAdditionalProperties`	Additional NLU service properties.
`nluProviderSettings`	Object that determines where the NLU action is going to be performed.
`nluType`	NLU type. You can set either `external` or `caila` NLU type.

How to use

Please note that you cannot use both CAILA and external NLU service intents or entities at the same time in a project.

In the JAICP project you can:

Use entities and intents from the external NLU service.
- Set the nluType to external for the markup, ner and classification parameters.
- Intents and entities are available in the script using the intent and q tags.
- Visual customization in the CAILA section for the external NLU service intents and entities is not supported.
Use CAILA intents and entities from the external NLU service.
- Set the nluType to external for the ner parameter and to caila for the markup and classification parameters.
- The use of entities from the external NLU service isn't available while setting up the intents and slot filling.
- Entities are available in the script using the q tag.
Use CAILA entities and intents from the external NLU service.
- Set the nluType to external for the classification parameter and to caila for the markup and ner parameters.
- Intents are available in the script using the q tag.
Use external NLU service markup with CAILA intents and entities.
- Set the nluType: to external for the markup parameter and to caila for the classification and ner.
- In the CAILA > Intents section, you can use Training phrases in languages that are not supported by the JAICP, but these phrases will be recognized in the script.

You can look through an example of an external NLU service in the GitHub repository