7.2.1 Building your own MT models

In the field of machine translation, pre-trained models have brought a significant shift, offering valuable starting points for creating translation systems. Platforms like Hugging Face provide access to a range of pre-trained models suitable for various language pairs and directions. For instance, the Helsinki-NLP repository houses both unidirectional and multilingual models trained using parallel data sourced from OPUS. Evaluations of these models across different benchmark datasets are accessible at https://opus.nlpl.eu/dashboard/.

Recently, multilingual models have gained prominence, accommodating multiple language directions simultaneously. Meta AI's NLLB model stands out, supporting a remarkable 200 languages. However, it's crucial to acknowledge that translation quality can differ substantially among various language pairs as the data quality and size differ substantially.

The BLEU scores of the FLORES-200 benchmark set for the NLLB model offer insights into its performance across different language directions. While these models offer impressive capabilities, their practicality necessitates customization to suit specific language pairs and domains.

These pre-trained models are advantageous due to their adaptability. Researchers can download these models and further fine-tune them using their collected data given that they have access to the right computational resources.

In 2022, CLEAR Global has collaborated with Digital Umuganda to fine-tune the NLLB model in Kinyarwanda on two domains: Finance education and Tourism. You can find the source code for training and evaluation scripts together with the results in this link.

Several other well-known machine translation frameworks have also gained traction in the MT landscape. Each of these frameworks brings its own strengths and features to the table, catering to different needs and preferences in the realm of machine translation.

OpenNMT, for instance, stands out as an open-source toolkit that provides comprehensive support for neural machine translation. With its modular design, OpenNMT offers flexibility in building and fine-tuning translation models.

You can consult CLEAR Global’s codebase for training OpenNMT models. It contains the necessary scripts and short instructions for training and evaluating a neural machine translation system from scratch.

HuggingFace Transformers library was created with the objective of providing a unified interface for loading, training, and storing Transformer models, streamlining the process for NLP practitioners. Notably, the library boasts features such as ease of use, enabling users to download and employ cutting-edge NLP models for inference using just a few lines of code. Moreover, the library seamlessly integrates with the open model and datasets in their platform, making loading of a dataset or a model as easy as one line in Python. Refer to their NLP course for a general overview of NLP in practice, and the Translation section for a deeper understanding of developing translation models.

MarianNMT is another notable framework that focuses on efficiency and high performance. It leverages advanced optimization techniques to achieve fast and accurate translations, making it a preferred choice for various MT applications.

JoeyNMT is recognized for its user-friendly interface and ease of use. Built on top of PyTorch, JoeyNMT simplifies the process of creating, training and deploying translation models. Its straightforward configuration and accessibility have made it popular among developers and researchers.

Sockeye is an open-source sequence-to-sequence framework for Neural Machine Translation built on PyTorch. It implements distributed training and optimized inference for state-of-the-art models, powering Amazon Translate and other MT applications.

Last updated