Language AI Playbook
  • 1. Introduction
    • 1.1 How to use the partner playbook
    • 1.2 Chapter overviews
    • 1.3 Acknowledgements
  • 2. Overview of Language Technology
    • 2.1 Definition and uses of language technology
    • 2.2 How language technology helps with communication
    • 2.3 Areas where language technology can be used
    • 2.4 Key terminology and concepts
  • 3. Partner Opportunities
    • 3.1 Enabling Organizations with Language Technology
    • 3.2 Bridging the Technical Gap
    • 3.3 Dealing with language technology providers
  • 4. Identifying Impactful Use Cases
    • 4.1 Setting criteria to help choose the use case
    • 4.2 Conducting A Needs Assessment
    • 4.3 Evaluating What Can Be Done and What Works
  • 5 Communication and working together
    • 5.1 Communicating with Communities
    • 5.2 Communicating and working well with partners
  • 6. Language Technology Implementation
    • 6.1 Navigating the Language Technology Landscape
    • 6.2 Creating a Language-Specific Peculiarities (LSP) Document
    • 6.3 Open source data and models
    • 6.4 Assessing data and model maturity
      • 6.4.1 Assessing NLP Data Maturity
      • 6.4.2 Assessing NLP Model Maturity:
    • 6.5 Key Metrics for Evaluating Language Solutions
  • 7 Development and Deployment Guidelines
    • 7.1 Serving models through an API
    • 7.2 Machine translation
      • 7.2.1 Building your own MT models
      • 7.2.2 Deploying your own scalable Machine Translation API
      • 7.2.3 Evaluation and continuous improvement of machine translation
    • 7.3 Chatbots
      • 7.3.1 Overview of chatbot technologies and RASA framework
      • 7.3.2 Building data for a climate change resilience chatbot
      • 7.3.3 How to obtain multilinguality
      • 7.3.4 Components of a chatbot in deployment
      • 7.3.5 Deploying a RASA chatbot
      • 7.3.6 Channel integrations
        • 7.3.6.1 Facebook Messenger
        • 7.3.6.2 WhatsApp
        • 7.3.6.3 Telegram
      • 7.3.7 How to create effective NLU training data
      • 7.3.8 Evaluation and continuous improvement of chatbots
  • 8 Sources and further bibliography
Powered by GitBook
On this page
  1. 7 Development and Deployment Guidelines
  2. 7.3 Chatbots

7.3.8 Evaluation and continuous improvement of chatbots

Previous7.3.7 How to create effective NLU training dataNext8 Sources and further bibliography

Last updated 1 year ago

Developing a successful chatbot involves more than just creating it – it’s an ongoing process that requires careful evaluation and continuous improvement. Understanding how effective your chatbot is in real-world scenarios is essential for enhancing its performance and user satisfaction. This iterative approach ensures that your chatbot evolves alongside user needs and changing conversation dynamics. By collecting user data and analyzing interactions, you can gain insights into how well your chatbot is meeting user expectations and where improvements are needed. Evaluating your chatbot’s performance not only helps identify areas for refinement but also allows you to make informed decisions on enhancing its capabilities.

To illustrate this process, consider the diagram above with explanations of the steps below:

  1. Curated data: This is the NLU training data that will be used in obtaining the first models. This data consists of intents, responses and stories which are manually prepared by linguists. It can be modified once the cycle is running with the insights we get from evaluations.

  2. NLU training data: This is the NLU training data that grows at each cycle step with new labeled user input. Initially, it is equal to the curated data.

  3. Train: This step is where the model is trained with the current NLU training data. This should be done on a development server. Rasa tests are performed right after this step.

  4. Deploy: The newly trained model should be deployed on the production server if it performs at least as good as the previous model.

  5. Collect user data: User input to the production bot is collected from the bot’s database. Information needed for the following steps is: <user input, bot’s classification, confidence score>. Also, the user metrics database should be updated in this step.

  6. Manual labeling: The user data that was collected in the previous step is useful for two purposes: (1) Evaluate how the model is performing in real life, (2) Capture occasions where the model is least sure of its classifications and improve them. It is however needed first that the user input is manually classified by linguists. This is adding a new field to the data collected in the previous step: <user input, bot’s classification, manual classification confidence-score> Since it is not possible to label all incoming user data, two subsets of sizes proportionate to the linguist capacity should be selected:

    1. A random set of user input, to be used for evaluation

    2. Set of classifications with low confidence scores (how sure the model is with its classification of user’s input).

    3. Optionally, user data can be collected for intents that are commonly mistaken, which can be detected in the confusion matrix, to strengthen up modeling of those specific intents.

At the end of this step, both sets should be fed back into the NLU training data. This will help improve the model for the next cycle.

Rasa Open Source offers a suite of powerful evaluation tools to facilitate this continuous improvement journey. These tools enable you to validate your data, test dialogues, assess the quality of your natural language understanding (NLU) model, and compare different pipeline configurations. By utilizing these evaluation functionalities, you can make data-driven decisions to fine-tune your chatbot’s performance and enhance its effectiveness over time.

End-to-end testing plays out pre-programmed conversations with the bot and reports if it goes as planned.

Confusion matrix gives information which intents (what user questions are “about”) are often confused with each other, thus needing more training data.

Cross-validation testing is the best way to assess the maturity of your bot before rolling it out and collecting any user feedback. It measures directly from the curated training data you prepare.

For a detailed guide on how to leverage Rasa’s evaluation capabilities, please refer to the. For higher-level strategies for evaluating language solutions please refer to 6.5 Key Metrics for Evaluating Language Solutions.

recall, precision & F1 scores
official Rasa documentation on testing your assistant