> For the complete documentation index, see [llms.txt](https://4bcplaybook.clearglobal.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://4bcplaybook.clearglobal.org/7-development-and-deployment-guidelines/7.2-machine-translation/7.2.3-evaluation-and-continuous-improvement-of-machine-translation.md).

# 7.2.3 Evaluation and continuous improvement of machine translation

Evaluating the effectiveness of a machine translation (MT) system is a crucial step in ensuring the quality of its outputs and guiding its improvement over time. Automatic evaluation metrics, such as COMET, chrF++, BLEU and TER which is mentioned in [6.4 Assessing data and model maturity](/6.-language-technology-implementation/6.4-assessing-data-and-model-maturity.md), provide quantitative insights into the performance of an MT system by comparing its output to reference translations. While these metrics offer a quick and convenient way to assess translation quality, they might not always capture the nuances of language and the specific context of the use case. Therefore, while automatic evaluation methods are valuable, manual evaluation remains an essential component in gauging MT quality accurately.

Manual evaluation entails human judgment and understanding, making it particularly valuable for assessing the suitability of an MT system in real-world scenarios. When designing an evaluation strategy, it’s essential to align it with the specific use case of the MT system. For instance, in scenarios where the MT output is post-edited for translation by linguists, seeking feedback directly from these linguists is highly recommended.

One question we find simple to understand and useful for assessment is “*Has MT helped you with your work?*” with the answers:

1. No, it was useless;
2. No, it was not very helpful;
3. Yes, it was sometimes helpful;
4. Yes, it was very helpful

For direct quality assessment of MT output, one way of a rating of sentences or paragraphs by your linguists from a scale from 1 to 5 where:

1. The MT does not communicate the source meaning at all or is incomprehensible.
2. The MT communicates only part of the source meaning, important information is missing and the text is difficult to follow.
3. The MT communicates most of the source meaning but some details are incorrect or there is an awkward style.
4. The MT communicates the source meaning, but terminology needs to be improved.
5. The MT conveys the source meaning and sounds natural.

However, it’s important to note that manual evaluation goes beyond quantitative metrics. Qualitative assessment from linguists can provide invaluable insights into where the MT system consistently falls short. Examples of consistent errors, such as gender mix-ups or term mistranslations, shed light on areas that need refinement. By analyzing these qualitative assessments, developers can uncover patterns of errors and focus on addressing specific linguistic challenges. This assessment can also guide your next data collection strategy.

Additionally, both the process of evaluation and deployment should be designed to facilitate continuous improvement. In a post-editing setup, collecting post-edited MT outputs forms a valuable dataset for training and refining the MT model in subsequent iterations. This data can be used to fine-tune the model’s performance and address the specific challenges identified during manual evaluation. Deployment, in turn, should make it possible to ask for user feedback to flag mistranslations and errors.

In conclusion, while automatic evaluation metrics provide a preliminary assessment of MT quality, manual evaluation and feedback from linguists are essential for a comprehensive understanding of its effectiveness. By aligning the evaluation process with the use case, implementing structured quality ratings, and incorporating qualitative feedback, developers can ensure that the MT system evolves to meet the demands of real-world translation scenarios. Furthermore, a well-designed evaluation process becomes an integral part of the continuous improvement loop, enhancing the MT system’s performance over time.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://4bcplaybook.clearglobal.org/7-development-and-deployment-guidelines/7.2-machine-translation/7.2.3-evaluation-and-continuous-improvement-of-machine-translation.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
