# 7.3.7 How to create effective NLU training data

In this section, we provide practical insights and tips for creating robust Natural Language Understanding (NLU) training data that empowers chatbots to accurately interpret user intent. From understanding the pivotal role of intents to ensuring a diverse set of training examples, we delve into intent merging, entity extraction, and maintaining balanced training data. Join us as we uncover key strategies for enhancing the NLU capabilities of your chatbots.

**The&#x20;*****intent*****&#x20;of a message is what a person wants to achieve**

When we say something, we often try to achieve a specific goal. These may be some ways of *greeting someone*:

* *Hello there!*
* *Good morning.*
* *Hi!*

These are some ways to *book a table at a restaurant*:

* *I'd like to book a table for two at noon, please.*
* *Do you have space for two people for lunch?*
* *I want a table for two at lunchtime.*

The *intent* (or intention) of a message is the goal that a person is trying to achieve with this message. In the two series of examples above, the way the message is delivered varies, but the *intent* stays the same.

**Humans understand the intent of a message intuitively**

For us humans, it is easy (most of the time) to understand another human's intention based on what they say.

Would it be possible for you to set aside a flat surface in your establishment, for one plus one people, so that they can consume a bit of food when the sun is at its peak?

This is a convoluted way of expressing the same goal as in the previous section: to *book a table at a restaurant*. Yet it is still understandable, we do not need to have seen the exact same wording before in order to understand the intent of that sentence.

We are capable of doing that because we have outside knowledge like:

* We know tables are flat
* We know 1 + 1 = 2
* We know that the sun is (usually) at its peak at around 12:00

This makes us capable of deducing other facts about this person's request, even if they are not (as) explicit: the number of people (2), the time of the reservation (12:00), etc.

We could even *try* to guess some things about the speaker based on the way they wrote: what kind of people they are, their age, etc.

**Intents and chatbots**

Unlike us humans, chatbots only have access to messages and do not understand its context. A chatbot does not have access to the same knowledge as a human, all of its knowledge comes from the textual information it receives.

Its entire world consists of messages, and they are the only things it can use to recognize their intents.

There are two stages that we need to distinguish when working with a chatbot:

* The *learning* or *training* phase: the chatbot is given *training data* in order to become better at recognizing the intents of messages
* The *prediction* phase: the chatbot has been trained, and we're now asking it to *predict* (recognize) the intents of *unseen* messages

**Chatbots need to&#x20;*****learn*****&#x20;how to recognize the intent of a message**

When a chatbot *learns*, it looks at many messages and the intent associated with each of those messages. These messages are called *training examples*.

Let's recap, using the same message as before:

Would it be possible for you to set aside a flat surface in your establishment, for one plus one people, so that they can consume a bit of food when the sun is at its peak?

* A human has seen, and heard many things, has been in many different situations and talked with many different people. It has access to all of this information in order to guess the intent of the message above.
* A chatbot has *only ever seen messages* and has no access to other kinds of information. It guesses the intent of the message above by finding out *how similar* the message is to all the *previous messages it has seen before*.

They are the only things that the chatbot can use to understand the intent of as it only has access to the message itself.

Its entire world consists of the messages, and they are the only things it can use to deduce their intents.

Unfortunately, understanding what a human means is not so intuitive for chatbots.

A chatbot's entire world consists of the messages it receives

It would be very hard to make chatbots process information in the same way humans do since we don't even know how exactly humans do it.

For a chatbot, *learning* to understand users means to correctly guess a user's intent based on a message they sent.

**A good intent has&#x20;*****a lot*****&#x20;of&#x20;*****diverse*****&#x20;training examples**

The chatbot needs to learn there are different ways of saying the same thing:

**Bad:**

* *Can covid be spread by animals?*
* *Can covid be spread by mosquitoes*
* *Can covid be spread by flies?*

**Better:**

* *Can covid be spread by animals?*
* *Do mosquitoes transmit corona?*
* *Are flies capable of giving me the coronavirus?*

Annotating user messages is a good way to gather diverse training examples. Here are real user messages in Uji, CLEAR Global’s COVID-19 related chatbot deployed in DRC:

* *Je voudries connaitre le cas confirm de ventre 19*
* *Savoir actuellement le nombre de cas dans chaque province...*
* *La situation épidémiologique de COVID-19 en ce jour*

These are all asking about `disease_stats`.

It's okay that **some** training examples look similar to one another, but they must not **all** be the same.

**A lot of training examples**

Chatbots need a lot of data to learn well.

Rasa recommends a **minimum** of 75 training examples for each intent. It really is *a lot* for our purposes, but we should still aim for it.

There are a few ways to increase the number of training examples in an intent:

* Write more training examples by hand
* Observe real user messages
* Merge similar intents together

The easiest way is to annotate user messages. Annotating also helps with making the training data more diverse.

{% hint style="info" %}
**Training examples checklist**

When looking at an intent's training data, check that:

* There are **at least** **35** training examples. More is even better.
* The training examples are **diverse**.
* Some (as many as possible) training examples come from **real users**.
  {% endhint %}

**An intent's meaning should be distinct from other intents**

**Merge similar intents together**

The chatbot gets confused when the meaning of two or more intents is too similar. For example:

`covid_myth_heat_kills`

* *Can corona survive in the heat?*
* *Is there a temperature that kills COVID-19?*
* *Can corona not be transmitted when it's hot?*
* *Can COVID-19 survive in humid temperatures?*

`covid_myth_cold_kills`

* *Does the cold kill the coronavirus?*
* *Does snow kill corona?*
* *What temperature is worst for COVID-19?*
* *Does the coronavirus die in snow?*

The meaning of the intents covid\_myth\_heat\_kills and covid\_myth\_cold\_kills are similar. It could cause issues to the chatbot, and so we could try to merge the two intents. It can help in two ways:

* Removing the potential for confusion between the two intents
* Increasing the number of training examples in the new, merged intent

There are two ways of merging intents with overlapping meanings.

**The intents to merge have similar answers**

The answers to covid\_myth\_heat\_kills and covid\_myth\_cold\_kills are:

`covid_myth_heat_kills`

*COVID-19 MAY be transmitted in areas with a hot and humid climate. Exposure to the sun or high temperatures DOES NOT PREVENT against contracting coronavirus disease.*

`covid_myth_cold_kills`

*Cold weather and snow CANNOT kill COVID-19.*

They could be rewritten and merged into a single answer, for example:

`covid_myth_weather_kills`

*You can catch COVID-19 regardless of an area's climate. Hot and humid weather and exposure to the sun do not prevent the transmission of the coronavirus. Cold weather and snow do not kill the coronavirus either.*

In that case, we can simply put all the training examples from before into a single intent:

`covid_myth_weather_kills`

* *Can corona survive in the heat?*
* *Is there a temperature that kills COVID-19?*
* *Can corona not be transmitted when it's hot?*
* *Can COVID-19 survive in humid temperatures?*
* *Does the cold kill the coronavirus?*
* *Does snow kill corona?*
* *What temperature is worst for COVID-19?*
* *Does the coronavirus die in snow?*

**The intents to merge need distinct answers**

**Use entity extraction to trigger the correct answer**

Let's say that two intents have a similar meaning and should be merged. However we want to keep two separate answers: for example, a single answer would be too long.

We can use a mechanism called *entity extraction*. An entity is a "thing of interest" in the user message: for example, it could be a date or time, a person's name, a location, etc.

It is possible to trigger a specific answer based on the intent detected, but also on the entities present in the user's message.

This is what happens when answering questions about a specific disease:

* *Can mosquitoes transmit corona?*
* *Are mosquitoes a vector of Ebola?*

The first message has the intent disease\_myth\_mosquitoes and the chatbot has found the disease entity with the value covid.

The second message also has the intent disease\_myth\_mosquitoes, but the chatbot found the disease entity with the value ebola instead.

The answers triggered are different:

| Intent predicted          | disease entity found | Answer triggered                |
| ------------------------- | -------------------- | ------------------------------- |
| disease\_myth\_mosquitoes | covid                | answer\_covid\_myth\_mosquitoes |
| disease\_myth\_mosquitoes | ebola                | answer\_ebola\_myth\_mosquitoes |

If the disease entity is missing, we can prompt the user for more information.

**Example scenario**

We can use the same idea to merge `disease_myth_mosquitoes` and `disease_myth_flies` together.

`disease_myth_flies`

* *Can flies transmit corona?*
* *Are flies a vector for Covid?*
* *Can a fly give me ebola*
* *Should I stay away from flies?*

`disease_myth_mosquitoes`

* Can mosquitoes transmit ebola?
* Is it possible to get the virus from a mosquito bite?
* Are mosquitoes capable of giving me COVID?
* When a mosquito bites me, can I get EBOLA

If we merge those intents together, we get:

`disease_myth_insects`

* *Can flies transmit corona?*
* *Are flies a vector for Covid?*
* *Can a fly give me ebola*
* *Should I stay away from flies?*
* *Can mosquitoes transmit ebola?*
* *Is it possible to get the virus from a mosquito bite?*
* *Are mosquitoes capable of giving me COVID?*
* *When a mosquito bites me, can I get EBOLA*

The entity annotation component will transform those training examples before building the model. **The annotation is done automatically, the training examples should not be annotated manually.**

This is what the training examples with all the annotations look like:

`disease_myth_insects`

* *Can `[flies]{"entity": "insect", "value": "fly"}` transmit `[corona]{"entity": "disease", "value": "covid"}`?*
* *Are `[flies]{"entity": "insect", "value": "fly"}` a vector for `[Covid]{"entity": "disease", "value": "covid"}`?*
* *Can a `[fly]{"entity": "insect", "value": "fly"}` give me `[ebola]{"entity": "disease", "value": "ebola"}`*
* *Should I stay away from `[flies]{"entity": "insect", "value": "fly"}`?*
* *Can `[mosquitoes]{"entity": "insect", "value": "mosquito"}` transmit `[ebola]{"entity": "disease", "value": "ebola"}`?*
* *Is it possible to get the virus from a `[mosquito]{"entity": "insect", "value": "mosquito"}` bite?*
* *Are `[mosquitoes]{"entity": "insect", "value": "mosquito"}` capable of giving me `[COVID]{"entity": "disease", "value": "covid"}`?*
* *When a `[mosquito]{"entity": "insect", "value": "mosquito"}` bites me, can I get `[EBOLA]{"entity": "disease", "value": "ebola"}`*

The intents are now merged, and the entities are annotated. After adding conditions to the stories, the combination of intent entities should trigger the desired answer.

<table><thead><tr><th width="240">Intent predicted</th><th width="128">disease entity found</th><th width="167">insect entity found</th><th>Answer triggered</th></tr></thead><tbody><tr><td><code>disease_myth_insects</code></td><td>covid</td><td>mosquito</td><td><code>answer_covid_myth_mosquitoes</code></td></tr><tr><td><code>disease_myth_insects</code></td><td>covid</td><td>fly</td><td><code>answer_covid_myth_flies</code></td></tr><tr><td><code>disease_myth_insects</code></td><td>ebola</td><td>mosquito</td><td><code>answer_ebola_myth_mosquitoes</code></td></tr><tr><td><code>disease_myth_insects</code></td><td>ebola</td><td>fly</td><td><code>answer_ebola_myth_flies</code></td></tr></tbody></table>

If an entity is missing, we can prompt the user for more information.

{% hint style="success" %}
**Entity extraction checklist**

To merge intents but retain distinct answers using entity extraction, you would need to create patterns to find the entities. When it was decided that two or more intents should be merged together, I need to know:

* Which entities will need to be created (for example, the entity insect or animal)
* Which values can be assigned to those entities (for example, fly, mosquito, pet)
* Which words are used to designate those values, **in each language**: ("corona", "covid", "c19", "coronavirus" are all valid ways of designating covid)
* Which answer should be triggered by a given combination of intent and entities

With this information, I will set up the entities and modify the stories accordingly.
{% endhint %}

{% hint style="success" %}
**Intent merging checklist**

* Are there intents with similar meaning? It might be possible to merge them.
* Are the answers to those intents similar? Merging can be done just by rewriting the answer and relabeling the training examples.
* Should distinct answers be kept? Merging can still be achieved, and distinct answers can be triggered using entity extraction. Check the previous section for details.
  {% endhint %}

### **Each intents should have around the same amount of training data**

**Unbalanced training data can cause class bias**

Sometimes, some intents can have a lot more training examples than others:

`disease_myth_fruits`

* Can bananas cure covid-19?
* Are bananas a cure for coronavirus?
* Are bananas a cure?
* Can I eat bananas to treat Corona?
* Can I eat bananas to protect me?
* Can I eat fruit to protect myself?
* Will eating lemons protect me?
* Are oranges good against Corona?
* Tea with orange helps against corona?
* Lemon tea protects me from the virus?

`disease_myth_spices`

* Can chili prevent covid?
* Should I eat more chili to cure corona?

When that is the case, the chatbot can get confused and too often select the intent with many training examples. This is known as *class bias*. This effect is likely to be stronger when intents are close together in meaning.

**Prioritize adding training examples for intents with little data**

The easiest way to address class bias is to add more training examples to "small" intents.

If `disease_stats` has more than 100 examples, and `disease_myth_spices` has just 10, then the priority is to increase the number of training examples in `disease_myth_spices`. Another option of course could be to merge the two together encompassing the two topics.

Keeping the intents size balanced is important, but not as much as increasing the number of training examples or merging similar intents.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://4bcplaybook.clearglobal.org/7-development-and-deployment-guidelines/7.3-chatbots/7.3.7-how-to-create-effective-nlu-training-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
