7.3.7 How to create effective NLU training data
In this section, we provide practical insights and tips for creating robust Natural Language Understanding (NLU) training data that empowers chatbots to accurately interpret user intent. From understanding the pivotal role of intents to ensuring a diverse set of training examples, we delve into intent merging, entity extraction, and maintaining balanced training data. Join us as we uncover key strategies for enhancing the NLU capabilities of your chatbots.
The intent of a message is what a person wants to achieve
When we say something, we often try to achieve a specific goal. These may be some ways of greeting someone:
Hello there!
Good morning.
Hi!
These are some ways to book a table at a restaurant:
I'd like to book a table for two at noon, please.
Do you have space for two people for lunch?
I want a table for two at lunchtime.
The intent (or intention) of a message is the goal that a person is trying to achieve with this message. In the two series of examples above, the way the message is delivered varies, but the intent stays the same.
Humans understand the intent of a message intuitively
For us humans, it is easy (most of the time) to understand another human's intention based on what they say.
Would it be possible for you to set aside a flat surface in your establishment, for one plus one people, so that they can consume a bit of food when the sun is at its peak?
This is a convoluted way of expressing the same goal as in the previous section: to book a table at a restaurant. Yet it is still understandable, we do not need to have seen the exact same wording before in order to understand the intent of that sentence.
We are capable of doing that because we have outside knowledge like:
We know tables are flat
We know 1 + 1 = 2
We know that the sun is (usually) at its peak at around 12:00
This makes us capable of deducing other facts about this person's request, even if they are not (as) explicit: the number of people (2), the time of the reservation (12:00), etc.
We could even try to guess some things about the speaker based on the way they wrote: what kind of people they are, their age, etc.
Intents and chatbots
Unlike us humans, chatbots only have access to messages and do not understand its context. A chatbot does not have access to the same knowledge as a human, all of its knowledge comes from the textual information it receives.
Its entire world consists of messages, and they are the only things it can use to recognize their intents.
There are two stages that we need to distinguish when working with a chatbot:
The learning or training phase: the chatbot is given training data in order to become better at recognizing the intents of messages
The prediction phase: the chatbot has been trained, and we're now asking it to predict (recognize) the intents of unseen messages
Chatbots need to learn how to recognize the intent of a message
When a chatbot learns, it looks at many messages and the intent associated with each of those messages. These messages are called training examples.
Let's recap, using the same message as before:
Would it be possible for you to set aside a flat surface in your establishment, for one plus one people, so that they can consume a bit of food when the sun is at its peak?
A human has seen, and heard many things, has been in many different situations and talked with many different people. It has access to all of this information in order to guess the intent of the message above.
A chatbot has only ever seen messages and has no access to other kinds of information. It guesses the intent of the message above by finding out how similar the message is to all the previous messages it has seen before.
They are the only things that the chatbot can use to understand the intent of as it only has access to the message itself.
Its entire world consists of the messages, and they are the only things it can use to deduce their intents.
Unfortunately, understanding what a human means is not so intuitive for chatbots.
A chatbot's entire world consists of the messages it receives
It would be very hard to make chatbots process information in the same way humans do since we don't even know how exactly humans do it.
For a chatbot, learning to understand users means to correctly guess a user's intent based on a message they sent.
A good intent has a lot of diverse training examples
The chatbot needs to learn there are different ways of saying the same thing:
Bad:
Can covid be spread by animals?
Can covid be spread by mosquitoes
Can covid be spread by flies?
Better:
Can covid be spread by animals?
Do mosquitoes transmit corona?
Are flies capable of giving me the coronavirus?
Annotating user messages is a good way to gather diverse training examples. Here are real user messages in Uji, CLEAR Global’s COVID-19 related chatbot deployed in DRC:
Je voudries connaitre le cas confirm de ventre 19
Savoir actuellement le nombre de cas dans chaque province...
La situation épidémiologique de COVID-19 en ce jour
These are all asking about disease_stats
.
It's okay that some training examples look similar to one another, but they must not all be the same.
A lot of training examples
Chatbots need a lot of data to learn well.
Rasa recommends a minimum of 75 training examples for each intent. It really is a lot for our purposes, but we should still aim for it.
There are a few ways to increase the number of training examples in an intent:
Write more training examples by hand
Observe real user messages
Merge similar intents together
The easiest way is to annotate user messages. Annotating also helps with making the training data more diverse.
Training examples checklist
When looking at an intent's training data, check that:
There are at least 35 training examples. More is even better.
The training examples are diverse.
Some (as many as possible) training examples come from real users.
An intent's meaning should be distinct from other intents
Merge similar intents together
The chatbot gets confused when the meaning of two or more intents is too similar. For example:
covid_myth_heat_kills
Can corona survive in the heat?
Is there a temperature that kills COVID-19?
Can corona not be transmitted when it's hot?
Can COVID-19 survive in humid temperatures?
covid_myth_cold_kills
Does the cold kill the coronavirus?
Does snow kill corona?
What temperature is worst for COVID-19?
Does the coronavirus die in snow?
The meaning of the intents covid_myth_heat_kills and covid_myth_cold_kills are similar. It could cause issues to the chatbot, and so we could try to merge the two intents. It can help in two ways:
Removing the potential for confusion between the two intents
Increasing the number of training examples in the new, merged intent
There are two ways of merging intents with overlapping meanings.
The intents to merge have similar answers
The answers to covid_myth_heat_kills and covid_myth_cold_kills are:
covid_myth_heat_kills
COVID-19 MAY be transmitted in areas with a hot and humid climate. Exposure to the sun or high temperatures DOES NOT PREVENT against contracting coronavirus disease.
covid_myth_cold_kills
Cold weather and snow CANNOT kill COVID-19.
They could be rewritten and merged into a single answer, for example:
covid_myth_weather_kills
You can catch COVID-19 regardless of an area's climate. Hot and humid weather and exposure to the sun do not prevent the transmission of the coronavirus. Cold weather and snow do not kill the coronavirus either.
In that case, we can simply put all the training examples from before into a single intent:
covid_myth_weather_kills
Can corona survive in the heat?
Is there a temperature that kills COVID-19?
Can corona not be transmitted when it's hot?
Can COVID-19 survive in humid temperatures?
Does the cold kill the coronavirus?
Does snow kill corona?
What temperature is worst for COVID-19?
Does the coronavirus die in snow?
The intents to merge need distinct answers
Use entity extraction to trigger the correct answer
Let's say that two intents have a similar meaning and should be merged. However we want to keep two separate answers: for example, a single answer would be too long.
We can use a mechanism called entity extraction. An entity is a "thing of interest" in the user message: for example, it could be a date or time, a person's name, a location, etc.
It is possible to trigger a specific answer based on the intent detected, but also on the entities present in the user's message.
This is what happens when answering questions about a specific disease:
Can mosquitoes transmit corona?
Are mosquitoes a vector of Ebola?
The first message has the intent disease_myth_mosquitoes and the chatbot has found the disease entity with the value covid.
The second message also has the intent disease_myth_mosquitoes, but the chatbot found the disease entity with the value ebola instead.
The answers triggered are different:
If the disease entity is missing, we can prompt the user for more information.
Example scenario
We can use the same idea to merge disease_myth_mosquitoes
and disease_myth_flies
together.
disease_myth_flies
Can flies transmit corona?
Are flies a vector for Covid?
Can a fly give me ebola
Should I stay away from flies?
disease_myth_mosquitoes
Can mosquitoes transmit ebola?
Is it possible to get the virus from a mosquito bite?
Are mosquitoes capable of giving me COVID?
When a mosquito bites me, can I get EBOLA
If we merge those intents together, we get:
disease_myth_insects
Can flies transmit corona?
Are flies a vector for Covid?
Can a fly give me ebola
Should I stay away from flies?
Can mosquitoes transmit ebola?
Is it possible to get the virus from a mosquito bite?
Are mosquitoes capable of giving me COVID?
When a mosquito bites me, can I get EBOLA
The entity annotation component will transform those training examples before building the model. The annotation is done automatically, the training examples should not be annotated manually.
This is what the training examples with all the annotations look like:
disease_myth_insects
Can
[flies]{"entity": "insect", "value": "fly"}
transmit[corona]{"entity": "disease", "value": "covid"}
?Are
[flies]{"entity": "insect", "value": "fly"}
a vector for[Covid]{"entity": "disease", "value": "covid"}
?Can a
[fly]{"entity": "insect", "value": "fly"}
give me[ebola]{"entity": "disease", "value": "ebola"}
Should I stay away from
[flies]{"entity": "insect", "value": "fly"}
?Can
[mosquitoes]{"entity": "insect", "value": "mosquito"}
transmit[ebola]{"entity": "disease", "value": "ebola"}
?Is it possible to get the virus from a
[mosquito]{"entity": "insect", "value": "mosquito"}
bite?Are
[mosquitoes]{"entity": "insect", "value": "mosquito"}
capable of giving me[COVID]{"entity": "disease", "value": "covid"}
?When a
[mosquito]{"entity": "insect", "value": "mosquito"}
bites me, can I get[EBOLA]{"entity": "disease", "value": "ebola"}
The intents are now merged, and the entities are annotated. After adding conditions to the stories, the combination of intent entities should trigger the desired answer.
If an entity is missing, we can prompt the user for more information.
Entity extraction checklist
To merge intents but retain distinct answers using entity extraction, you would need to create patterns to find the entities. When it was decided that two or more intents should be merged together, I need to know:
Which entities will need to be created (for example, the entity insect or animal)
Which values can be assigned to those entities (for example, fly, mosquito, pet)
Which words are used to designate those values, in each language: ("corona", "covid", "c19", "coronavirus" are all valid ways of designating covid)
Which answer should be triggered by a given combination of intent and entities
With this information, I will set up the entities and modify the stories accordingly.
Intent merging checklist
Are there intents with similar meaning? It might be possible to merge them.
Are the answers to those intents similar? Merging can be done just by rewriting the answer and relabeling the training examples.
Should distinct answers be kept? Merging can still be achieved, and distinct answers can be triggered using entity extraction. Check the previous section for details.
Each intents should have around the same amount of training data
Unbalanced training data can cause class bias
Sometimes, some intents can have a lot more training examples than others:
disease_myth_fruits
Can bananas cure covid-19?
Are bananas a cure for coronavirus?
Are bananas a cure?
Can I eat bananas to treat Corona?
Can I eat bananas to protect me?
Can I eat fruit to protect myself?
Will eating lemons protect me?
Are oranges good against Corona?
Tea with orange helps against corona?
Lemon tea protects me from the virus?
disease_myth_spices
Can chili prevent covid?
Should I eat more chili to cure corona?
When that is the case, the chatbot can get confused and too often select the intent with many training examples. This is known as class bias. This effect is likely to be stronger when intents are close together in meaning.
Prioritize adding training examples for intents with little data
The easiest way to address class bias is to add more training examples to "small" intents.
If disease_stats
has more than 100 examples, and disease_myth_spices
has just 10, then the priority is to increase the number of training examples in disease_myth_spices
. Another option of course could be to merge the two together encompassing the two topics.
Keeping the intents size balanced is important, but not as much as increasing the number of training examples or merging similar intents.
Last updated