7.3.8 Evaluation and continuous improvement of chatbots

Developing a successful chatbot involves more than just creating it – it’s an ongoing process that requires careful evaluation and continuous improvement. Understanding how effective your chatbot is in real-world scenarios is essential for enhancing its performance and user satisfaction. This iterative approach ensures that your chatbot evolves alongside user needs and changing conversation dynamics. By collecting user data and analyzing interactions, you can gain insights into how well your chatbot is meeting user expectations and where improvements are needed. Evaluating your chatbot’s performance not only helps identify areas for refinement but also allows you to make informed decisions on enhancing its capabilities.

To illustrate this process, consider the diagram above with explanations of the steps below:

  1. Curated data: This is the NLU training data that will be used in obtaining the first models. This data consists of intents, responses and stories which are manually prepared by linguists. It can be modified once the cycle is running with the insights we get from evaluations.

  2. NLU training data: This is the NLU training data that grows at each cycle step with new labeled user input. Initially, it is equal to the curated data.

  3. Train: This step is where the model is trained with the current NLU training data. This should be done on a development server. Rasa tests are performed right after this step.

  4. Deploy: The newly trained model should be deployed on the production server if it performs at least as good as the previous model.

  5. Collect user data: User input to the production bot is collected from the bot’s database. Information needed for the following steps is: <user input, bot’s classification, confidence score>. Also, the user metrics database should be updated in this step.

  6. Manual labeling: The user data that was collected in the previous step is useful for two purposes: (1) Evaluate how the model is performing in real life, (2) Capture occasions where the model is least sure of its classifications and improve them. It is however needed first that the user input is manually classified by linguists. This is adding a new field to the data collected in the previous step: <user input, bot’s classification, manual classification confidence-score> Since it is not possible to label all incoming user data, two subsets of sizes proportionate to the linguist capacity should be selected:

    1. A random set of user input, to be used for evaluation

    2. Set of classifications with low confidence scores (how sure the model is with its classification of user’s input).

    3. Optionally, user data can be collected for intents that are commonly mistaken, which can be detected in the confusion matrix, to strengthen up modeling of those specific intents.

At the end of this step, both sets should be fed back into the NLU training data. This will help improve the model for the next cycle.

Rasa Open Source offers a suite of powerful evaluation tools to facilitate this continuous improvement journey. These tools enable you to validate your data, test dialogues, assess the quality of your natural language understanding (NLU) model, and compare different pipeline configurations. By utilizing these evaluation functionalities, you can make data-driven decisions to fine-tune your chatbot’s performance and enhance its effectiveness over time.

Cross-validation testing is the best way to assess the maturity of your bot before rolling it out and collecting any user feedback. It measures recall, precision & F1 scores directly from the curated training data you prepare.

End-to-end testing plays out pre-programmed conversations with the bot and reports if it goes as planned.

Confusion matrix gives information which intents (what user questions are “about”) are often confused with each other, thus needing more training data.

For a detailed guide on how to leverage Rasa’s evaluation capabilities, please refer to the official Rasa documentation on testing your assistant. For higher-level strategies for evaluating language solutions please refer to 6.5 Key Metrics for Evaluating Language Solutions.

Last updated