7.2.3 Evaluation and continuous improvement of machine translation

Evaluating the effectiveness of a machine translation (MT) system is a crucial step in ensuring the quality of its outputs and guiding its improvement over time. Automatic evaluation metrics, such as COMET, chrF++, BLEU and TER which is mentioned in 6.4 Assessing data and model maturity, provide quantitative insights into the performance of an MT system by comparing its output to reference translations. While these metrics offer a quick and convenient way to assess translation quality, they might not always capture the nuances of language and the specific context of the use case. Therefore, while automatic evaluation methods are valuable, manual evaluation remains an essential component in gauging MT quality accurately.

Manual evaluation entails human judgment and understanding, making it particularly valuable for assessing the suitability of an MT system in real-world scenarios. When designing an evaluation strategy, it’s essential to align it with the specific use case of the MT system. For instance, in scenarios where the MT output is post-edited for translation by linguists, seeking feedback directly from these linguists is highly recommended.

One question we find simple to understand and useful for assessment is “Has MT helped you with your work?” with the answers:

  1. No, it was useless;

  2. No, it was not very helpful;

  3. Yes, it was sometimes helpful;

  4. Yes, it was very helpful

For direct quality assessment of MT output, one way of a rating of sentences or paragraphs by your linguists from a scale from 1 to 5 where:

  1. The MT does not communicate the source meaning at all or is incomprehensible.

  2. The MT communicates only part of the source meaning, important information is missing and the text is difficult to follow.

  3. The MT communicates most of the source meaning but some details are incorrect or there is an awkward style.

  4. The MT communicates the source meaning, but terminology needs to be improved.

  5. The MT conveys the source meaning and sounds natural.

However, it’s important to note that manual evaluation goes beyond quantitative metrics. Qualitative assessment from linguists can provide invaluable insights into where the MT system consistently falls short. Examples of consistent errors, such as gender mix-ups or term mistranslations, shed light on areas that need refinement. By analyzing these qualitative assessments, developers can uncover patterns of errors and focus on addressing specific linguistic challenges. This assessment can also guide your next data collection strategy.

Additionally, both the process of evaluation and deployment should be designed to facilitate continuous improvement. In a post-editing setup, collecting post-edited MT outputs forms a valuable dataset for training and refining the MT model in subsequent iterations. This data can be used to fine-tune the model’s performance and address the specific challenges identified during manual evaluation. Deployment, in turn, should make it possible to ask for user feedback to flag mistranslations and errors.

In conclusion, while automatic evaluation metrics provide a preliminary assessment of MT quality, manual evaluation and feedback from linguists are essential for a comprehensive understanding of its effectiveness. By aligning the evaluation process with the use case, implementing structured quality ratings, and incorporating qualitative feedback, developers can ensure that the MT system evolves to meet the demands of real-world translation scenarios. Furthermore, a well-designed evaluation process becomes an integral part of the continuous improvement loop, enhancing the MT system’s performance over time.

Last updated