Advanced NLU settings
You can define new settings to configure NLU when you create or edit your project. Settings are passed in the text form within a JSON object.
You must specify your parameters according to the project’s classifier algorithm.
Common settings
Common settings include parameters that do not depend on the algorithm of the classifier in the project:
{
"patternsEnabled": true,
"tokenizerEngine": "udpipe",
"dictionaryAutogeneration": true
}
Parameters
patternsEnabled
— if the parameter is active, it is possible to use patterns in training phrases.tokenizerEngine
— tokenizer.dictionaryAutogeneration
— when the parameter is active, the filling of the user dictionary by entity content is executed.
tokenizerEngine
For different NLU languages, different tokenizer engines are supported.
NLU language | Tokenizers | Notes |
---|---|---|
Russian | udpipe mystem morphsrus |
The mystem and morphsrus tokenizers are used for migrating projects to CAILA. |
Chinese | pinyin |
|
Portuguese | udpipe |
|
Kazakh | kaznlp |
All languages missing from the table above support the spacy
tokenizer.
STS
STS classifier settings:
{
"patternsEnabled": true,
"namedEntitiesRequired": true,
"tokenizerEngine": "udpipe",
"allowedPatterns": [],
"stsSettings": {
"exactMatch": 1.0,
"lemmaMatch": 0.95,
"jaccardMatch": 0.5,
"jaccardMatchThreshold": 0.82,
"acronymMatch": 1.0,
"synonymMatch": 0,
"synonymContextWeight": 0.0,
"patternMatch": 1,
"throughPatternMatch": 0,
"wordSequence1": 0.8,
"wordSequence2": 0.9,
"wordSequence3": 1.0,
"intermediateAlternativesLimit": 5,
"finalAlternativesLimit": 5,
"idfShift": 0.0,
"idfMultiplier": 1.0,
"namedEntitiesRequired": true
}
}
Parameters
allowedPatterns
— the array of entities having the Automatically expand intents setting turned on.exactMatch
— weight of an exact word match in phrases.lemmaMatch
— weight of a lemma word match.jaccardMatchThreshold
— weight of a Jaccard-based symbol-by-symbol word match.acronymMatch
— weight of an acronym-based word match.synonymMatch
— weight for synonyms.synonymContextWeight
— weight applied during ranking to theweight
value from the synonym reference book.patternMatch
— weight of a pattern-based match.throughPatternMatch
— weight of a match by entities found in the example and the input text.wordSequence1
— weight of lookalike sequences with the length equal to 1.wordSequence2
— weight of lookalike sequences with the length equal to 2.wordSequence3
— weight of lookalike sequences with the length greater than 2.intermediateAlternativesLimit
— cutoff threshold for intermediate alternatives processed by the algorithm.finalAlternativesLimit
— maximum number of final results required for the algorithm to finish its work.namedEntitiesRequired
— in order to get into the intent, the phrase must have a system entity.
namedEntitiesRequired
Let's consider the parameter "namedEntitiesRequired": true
. If a phrase with a system entity was added to the intent, for example:
I need @duckling.number of apples
That at request of the client I need apples
— the phrase will not get in an intent as the system essence has not been found.
Override namedEntitiesRequired
in the advanced NLU settings so that phrases without system entities activate the intent.
Deep Learning
Deep Learning classifier settings
{
"patternsEnabled": true,
"tokenizerEngine": "udpipe",
"cnnSettings": {
"lang": "ru",
"kernel_sizes": [
1,
2
],
"n_filters": 1024,
"emb_drp": 0.25,
"cnn_drp": 0.25,
"bs": 64,
"n_epochs": 15,
"lr": 0.001,
"pooling_name": "max"
}
}
Parameters
-
kernel_sizes
– list of convolution kernel sizes. A convolution kernel is the size of the context window to be taken into account by the classifier. For instance,“kernel_sizes”: [3]
means that the model will use all the triplets of adjacent words to find features in the text. Multiple convolution kernels can be defined for a single model. -
n_filters
– number of filters. A filter is a specific pattern learned by the model. A model has a unique set of patterns for each kernel. For instance, if we specify“kernel_sizes”: [2,3]
and“n_filters”: 512
, the total number of filters will be1024
(512
per kernel). -
emb_drp
– probability of a drop-out on the embedding layer. Drop-out is the mechanism that forcibly disables a part of weights in the network in the course of training. Drop-out prevents the network from overtraining (i.e. it helps to summarize the information instead of merely memorizing the entire dataset).emb_drp
can take any value from 0 to 1. -
сnn_drp
– probability of a drop-out on the convolution layers of the network. -
bs
– size of the input batch for trainings. This value defines the number of training examples per step to be fed to the network in the course of training. If the dataset has less than 3000 examples, a value from 16 to 32 is recommended. For larger datasets, this value can be from 32 to 128. -
n_epochs
– number of learning epochs (i.e. the number of times the model will see all the training data). -
lr
– learning rate. The factor the model will use to update its weights in the course of training. -
pooling_name
– aggregation strategy. The model has to aggregate the patterns found in the input string (before the final classification layer). The following aggregation strategies are possible:max
,mean
,concat
.
General recommendations
Deep Learning classifier settings for the dataset size of:
- over 100 thousand examples
"kernel_sizes": [
2,
3,
4
],
"n_filters": 1024-2048,
"emb_drp": 0.3-0.4,
"cnn_drp": 0.3-0.4,
"bs": 64-128,
"n_epochs": 3,
"lr": 0.001,
"pooling_name": "max, concat"
- 30 to 100 thousand examples
"kernel_sizes": [
2,
3,
4
],
"n_filters": 1024-2048,
"emb_drp": 0.3-0.4,
"cnn_drp": 0.3-0.4,
"bs": 32-128,
"n_epochs": 3,
"lr": 0.001,
"pooling_name": "max, concat"
- 10 to 30 thousand examples
"kernel_sizes": [
2,
3,
4 // or [2,3]
],
"n_filters": 1024,
"emb_drp": 0.3-0.5,
"cnn_drp": 0.3-0.5,
"bs": 32-64,
"n_epochs": 3-5,
"lr": 0.001,
"pooling_name": "max"
- 3 to 10 thousand examples
"kernel_sizes": [
2,
3,
4 // or [2,3]
],
"n_filters": 1024,
"emb_drp": 0.4-0.5,
"cnn_drp": 0.4-0.5,
"bs": 32,
"n_epochs": 4-7,
"lr": 0.001,
"pooling_name": "max"
- 1 to 3 thousand examples
"kernel_sizes": [
2,
3
],
"n_filters": 512,
"emb_drp": 0.5,
"cnn_drp": 0.5,
"bs": 16-32,
"n_epochs": 7-15,
"lr": 0.001,
"pooling_name": "max"
Classic ML
{
"patternsEnabled": true,
"tokenizerEngine": "udpipe",
"classicMLSettings": {
"C": 1,
"lang": "ru",
"word_ngrams": [
1,
2
],
"lemma_ngrams": [
0
],
"stemma_ngrams": [
1,
2
],
"char_ngrams": [
3,
4
],
"lower": true
}
}
Parameters
-
С
– regularization coefficient that can be used to control model overtraining. Use it to control larger values of the target function coefficients and to penalize them by the value of the parameter. It can take values in the following range:[0.01, 0.1, 1, 10]
. -
word_ngrams
– number of words to be combined into word combinations. For instance, for“word_ngrams”: [2, 3]
combinations of two and three words will be used.
For instance, the following word combinations will be generated for the I like green apples
phrase:
[
"I like",
"like green",
"green apples",
"I like green",
"like green apples"
]
Values greater than 3 are not recommended for this parameter.
lemma_n_grams
– number of words to be normalized and combined into word combinations. For instance, for“lemma_n_grams”: [2]
combinations of two words will be used.
For instance, the following word combinations will be generated for the I like green apples
phrase:
[
"I like",
"like green",
"green apple"
]
Values greater than 3 are not recommended for this parameter.
stemma_ngrams
– number of stemmas to be combined into word combinations. A stemma is the stem of a source word which is not necessarily equal to the morphologic root of that word. For instance, for“stemma_ngrams”: [2]
combinations of two stemmas will be used.
For instance, the following word combinations will be generated for the I like green apples
phrase:
[
"I like",
"like green",
"green apple"
]
Using both lemma_n_grams
and stemma_ngrams
parameters is not recommended due to possible model overtraining. Setting the value of stemma_ngrams
to be greater than 3 is also not recommended.
char_n_grams
– number of symbols to be combined and treated as a single unit of a phrase.
For instance, for “char_n_grams”: [5]
the phrase of green apples
is converted to the following set:
[
"gre",
"ree",
"een",
...
]
lower
– if set totrue
, all the phrases are converted to lowercase.
External NLU service
The JAICP platform supports the ability to connect an external NLU service that complies with the Model API specification. You will also be able to create and configure intents, entities, patterns.
Model API allows you to use third-party tokenizers, external named entity and intent recognition NLU services in JAICP projects.
To use an external NLU service in a project, use externalNluSettings
in the Advanced NLU settings field:
...
"externalNluSettings": {
"nluProviderSettings": {
"markup": {
"nluType": "external",
"url": "http://example.com"
},
"ner": {
"nluType": "external",
"url": "http://example.com"
},
"classification": {
"nluType": "external",
"url": "http://example.com"
}
},
"language": "ja",
"nluActionAdditionalProperties": {
"markup": null,
"ner": null,
"classification": {
"modelId": "123",
"classifierName": "example",
"properties": null
}
}
}
...
Parameters
Parameter | Description |
---|---|
classifierName |
Classifier name. |
classification |
Map of parameters for classification requests. |
language |
External NLU language parameter. If not set, language from the project settings will be used. |
markup |
Map of parameters for markup requests. |
modelID |
Classifier model ID. |
ner |
Named entity recognition. Map of parameters for named entity recognition requests. |
nluActionAdditionalProperties |
Additional NLU service properties. |
nluProviderSettings |
Object that determines where the NLU action is going to be performed. |
nluType |
NLU type. You can set either external or caila NLU type. |
How to use
Please note that you cannot use both CAILA and external NLU service intents or entities at the same time in a project.
In the JAICP project you can:
-
Use entities and intents from the external NLU service.
-
Use CAILA intents and entities from the external NLU service.
- Set the
nluType
toexternal
for thener
parameter and tocaila
for themarkup
andclassification
parameters. - The use of entities from the external NLU service isn't available while setting up the intents and slot filling.
- Entities are available in the script using the
q
tag.
- Set the
-
Use CAILA entities and intents from the external NLU service.
- Set the
nluType
toexternal
for theclassification
parameter and tocaila
for themarkup
andner
parameters. - Intents are available in the script using the
q
tag.
- Set the
-
Use external NLU service markup with CAILA intents and entities.
- Set the
nluType:
toexternal
for themarkup
parameter and tocaila
for theclassification
andner
. - In the CAILA > Intents section, you can use Training phrases in languages that are not supported by the JAICP, but these phrases will be recognized in the script.
- Set the
You can look through an example of an external NLU service in the GitHub repository