7.2 Machine translation

Machine Translation (MT) is defined as the automatic conversion of text in one language to another language. It has evolved through the years from rule-based to statistical approaches, which modeled the probabilities of mappings between sub-phrases between translations. These probabilities are learned in a statistical fashion from parallel texts where sentence-aligned translations are available in the languages involved (referred to as source and target languages). The diagram below illustrates the modeling of translating the word “sure” from English to Spanish using translations made in the UN Parliament.

In recent years, the field of Machine Translation (MT) has undergone a significant paradigm shift, marked by the emergence of Neural Machine Translation (NMT) systems. Unlike SMT, NMT does not explicitly model these sentence-level probabilities. Instead, NMT models estimate probabilities of generating each target token (word or subword) given the source sentence and previously generated tokens. These probabilities, denoted as p(y_i|x, y_<i), are learned during training from parallel corpora where source and target sentences are aligned. Notably, NMT models store complex mathematical functions' parameters rather than explicit probabilities, transforming input sentences (sequences of discrete symbols) into continuous embeddings, which are then used for mathematical manipulation. This transformation enables NMT to work with the continuous vector representations of language, a departure from the discrete symbol-based approaches of the past.

The figure above roughly demonstrates an encoder-decoder type NMT architecture, a typically common structure in this family of MT systems introduced in 2014. The encoder is responsible for "reading" the source sentence and creating continuous embeddings, while the decoder generates target tokens. This architecture served as a precursor to the development of the Transformers architecture, which is now widely used both within and outside Machine Translation (MT). Transformers introduced self-attention mechanisms that enable models to weigh the relevance of different words in the source and target languages, significantly improving translation quality by capturing long-range dependencies and context in an unparalleled way.

This paradigm shift represented a breakthrough in MT, offering more fluent and context-aware translations by leveraging neural networks to handle the entire translation process, transcending the limitations of earlier rule-based and statistical methods. This new way of modeling introduced in 2014 made 50% fewer word order mistakes, 17% fewer lexical mistakes, and 19% fewer grammar mistakes compared to earlier models as a 2018 study shows.

Machine translation services like Google Translate and DeepL have made their way into reliable tools for translators and also regular folk in the recent years. The principle uses of machine translation are as follows:

  1. Assimilation, emulating a certain document in another language. This use-case enables e.g. reading a news site or technical paper in a language that we don’t understand. We know that it’s not a 100% accurate translation, but it gives the gist to explore further.

  2. Communication, enabling the communication between individuals and organizations e.g. in chat, tourism, and e-commerce, lowering the need for a lingua franca.

  3. Monitoring, enabling tracing of information in large-scale multilingual documents e.g. discovering international trends in Twitter.

  4. Assistance, in improving translation workflows e.g. computer-assisted translation, and post-editing.

Parallel data (bitext)

The type of data that is needed to build a machine translation system is parallel data, which consists of a collection of sentences in a language together with their translations. Historically, parallel data were sourced from translations in multilingual public spaces like the United Nations, and European Parliament. Now, the greatest resource of parallel text is the multilingual web.

Sourcing parallel data

As we presented in Section 6.3, OPUS is a collection of almost all publicly available parallel data. It is the go-to point for many researchers to publish their parallel data or source data for the development of MT models. In addition to OPUS, the Common Crawl initiative plays a pivotal role in the creation of modern MT systems. Common Crawl is a vast web archive that captures a snapshot of the internet, including multilingual content from diverse sources. Researchers and developers can tap into this rich resource to extract parallel text from websites, news articles, and other online content. This approach provides a wealth of real-world linguistic diversity, enabling the training of MT models that are robust and adaptable to different language pairs and domains. In conjunction with traditional sources of parallel data such as multilingual websites, movie subtitles, holy texts, parliament proceedings, and software localization data, these modern initiatives greatly expand the availability of high-quality parallel data, facilitating the advancement of multilingual communication through MT technology.

Note: It's important to recognize that the quality of training data can significantly impact the performance of machine translation models, especially for less-resourced languages. As highlighted in the paper "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets," the success of large-scale pre-training and multilingual modeling in NLP has led to the emergence of numerous web-mined text datasets for various languages. However, lower-resource corpora often exhibit systematic issues, including unusable text, sentences of subpar quality, mislabeling, and nonstandard/ambiguous language codes. This phenomenon is particularly prevalent in languages with limited available resources. The study demonstrates that these quality issues are discernible even to non-proficient speakers, and automated analyses further support these findings. As you work with multilingual corpora, it's essential to consider these challenges and explore techniques to evaluate and enhance the quality of data sources.

Last updated