Language AI Playbook
  • 1. Introduction
    • 1.1 How to use the partner playbook
    • 1.2 Chapter overviews
    • 1.3 Acknowledgements
  • 2. Overview of Language Technology
    • 2.1 Definition and uses of language technology
    • 2.2 How language technology helps with communication
    • 2.3 Areas where language technology can be used
    • 2.4 Key terminology and concepts
  • 3. Partner Opportunities
    • 3.1 Enabling Organizations with Language Technology
    • 3.2 Bridging the Technical Gap
    • 3.3 Dealing with language technology providers
  • 4. Identifying Impactful Use Cases
    • 4.1 Setting criteria to help choose the use case
    • 4.2 Conducting A Needs Assessment
    • 4.3 Evaluating What Can Be Done and What Works
  • 5 Communication and working together
    • 5.1 Communicating with Communities
    • 5.2 Communicating and working well with partners
  • 6. Language Technology Implementation
    • 6.1 Navigating the Language Technology Landscape
    • 6.2 Creating a Language-Specific Peculiarities (LSP) Document
    • 6.3 Open source data and models
    • 6.4 Assessing data and model maturity
      • 6.4.1 Assessing NLP Data Maturity
      • 6.4.2 Assessing NLP Model Maturity:
    • 6.5 Key Metrics for Evaluating Language Solutions
  • 7 Development and Deployment Guidelines
    • 7.1 Serving models through an API
    • 7.2 Machine translation
      • 7.2.1 Building your own MT models
      • 7.2.2 Deploying your own scalable Machine Translation API
      • 7.2.3 Evaluation and continuous improvement of machine translation
    • 7.3 Chatbots
      • 7.3.1 Overview of chatbot technologies and RASA framework
      • 7.3.2 Building data for a climate change resilience chatbot
      • 7.3.3 How to obtain multilinguality
      • 7.3.4 Components of a chatbot in deployment
      • 7.3.5 Deploying a RASA chatbot
      • 7.3.6 Channel integrations
        • 7.3.6.1 Facebook Messenger
        • 7.3.6.2 WhatsApp
        • 7.3.6.3 Telegram
      • 7.3.7 How to create effective NLU training data
      • 7.3.8 Evaluation and continuous improvement of chatbots
  • 8 Sources and further bibliography
Powered by GitBook
On this page
  1. 7 Development and Deployment Guidelines
  2. 7.2 Machine translation

7.2.3 Evaluation and continuous improvement of machine translation

Evaluating the effectiveness of a machine translation (MT) system is a crucial step in ensuring the quality of its outputs and guiding its improvement over time. Automatic evaluation metrics, such as COMET, chrF++, BLEU and TER which is mentioned in 6.4 Assessing data and model maturity, provide quantitative insights into the performance of an MT system by comparing its output to reference translations. While these metrics offer a quick and convenient way to assess translation quality, they might not always capture the nuances of language and the specific context of the use case. Therefore, while automatic evaluation methods are valuable, manual evaluation remains an essential component in gauging MT quality accurately.

Manual evaluation entails human judgment and understanding, making it particularly valuable for assessing the suitability of an MT system in real-world scenarios. When designing an evaluation strategy, it’s essential to align it with the specific use case of the MT system. For instance, in scenarios where the MT output is post-edited for translation by linguists, seeking feedback directly from these linguists is highly recommended.

One question we find simple to understand and useful for assessment is “Has MT helped you with your work?” with the answers:

  1. No, it was useless;

  2. No, it was not very helpful;

  3. Yes, it was sometimes helpful;

  4. Yes, it was very helpful

For direct quality assessment of MT output, one way of a rating of sentences or paragraphs by your linguists from a scale from 1 to 5 where:

  1. The MT does not communicate the source meaning at all or is incomprehensible.

  2. The MT communicates only part of the source meaning, important information is missing and the text is difficult to follow.

  3. The MT communicates most of the source meaning but some details are incorrect or there is an awkward style.

  4. The MT communicates the source meaning, but terminology needs to be improved.

  5. The MT conveys the source meaning and sounds natural.

However, it’s important to note that manual evaluation goes beyond quantitative metrics. Qualitative assessment from linguists can provide invaluable insights into where the MT system consistently falls short. Examples of consistent errors, such as gender mix-ups or term mistranslations, shed light on areas that need refinement. By analyzing these qualitative assessments, developers can uncover patterns of errors and focus on addressing specific linguistic challenges. This assessment can also guide your next data collection strategy.

Additionally, both the process of evaluation and deployment should be designed to facilitate continuous improvement. In a post-editing setup, collecting post-edited MT outputs forms a valuable dataset for training and refining the MT model in subsequent iterations. This data can be used to fine-tune the model’s performance and address the specific challenges identified during manual evaluation. Deployment, in turn, should make it possible to ask for user feedback to flag mistranslations and errors.

In conclusion, while automatic evaluation metrics provide a preliminary assessment of MT quality, manual evaluation and feedback from linguists are essential for a comprehensive understanding of its effectiveness. By aligning the evaluation process with the use case, implementing structured quality ratings, and incorporating qualitative feedback, developers can ensure that the MT system evolves to meet the demands of real-world translation scenarios. Furthermore, a well-designed evaluation process becomes an integral part of the continuous improvement loop, enhancing the MT system’s performance over time.

Previous7.2.2 Deploying your own scalable Machine Translation APINext7.3 Chatbots

Last updated 1 year ago