6.1 Navigating the Language Technology Landscape
Last updated
Last updated
This flowchart describes the key steps and decision points an initiative faces while developing a solution in a marginalized language using NLP.
The starting point of the diagram assumes that you have already come up with a rough idea involving an NLP solution. For example, it could be an already existing solution for a high-resourced language. As NLP languages are data-driven, it is highly likely that there are no well-developed solutions for low-resource languages. The diagram also depicts the decisions to follow in this regard.
Data-driven means that the intelligence that is created with these tools is collected from large volumes of information, or simply data. For example, in the case of machine translation, the engine “models” translation from one language to another by looking at a collection of human-translated documents and sentences. Similarly, a sentiment analyzer learns how to label if a tweet says good or bad about a company from thousands of tweets labeled by humans as carrying a good or bad sentiment.
This dependency on data is what makes these technologies accessible to some languages but not to others. Hence the widely used definitions of high-resource and low-resource languages. The available resources for a language directly influence the possibility of developing an application for that language. As the greatest resource of textual data is the internet, which is dominated by a few languages, these technologies tend to focus on only a handful of dominant languages, e.g., English, Spanish, Chinese, Arabic, etc.
The diagram below by Microsoft Research Labs India illustrates the hierarchy created by this “power law” among languages.
Let’s now navigate through our landscape diagram step by step
The first point to identify is what kind of AI task or tasks the solution is using. Some of the most common NLP tasks are as follows:
Text-based:
Machine Translation: Converting text from one language to another while preserving its meaning.
Information Retrieval: Finding relevant documents or information from a collection based on user queries.
Information Extraction: Automatically extracting structured information from unstructured text.
Sentiment Analysis: Determining the emotional tone or sentiment expressed in a piece of text.
Question Answering: Automatically answering questions posed in natural language.
Text Summarization: Generating concise summaries from longer texts while retaining key information.
Named-Entity Recognition: Identifying and classifying named entities (such as names of people, places, and organizations) in text.
Speech-based:
Automatic Speech Recognition: Converting spoken language into written text.
Text-to-Speech Synthesis: Generating natural-sounding speech from written text.
Image-based:
Image captioning: Generating descriptive captions for images.
Image classification: Assigning labels to images based on their content.
Image generation: Creating new images based on certain input criteria.
Image sentiment analysis: Determining the emotional tone or sentiment expressed in images.
Visual question answering: Answering questions about the content of images.
Scene recognition: Identifying the type or context of a scene in an image.
Recently, most NLP tasks feed on the use of large language models (LLM). A particular task can be achieved through the use of the correct prompts. So you can either look for a model specific for your task or directly for a large language model that supports your language.
When you're designing a solution or product, keep in mind that these kinds of NLP tasks compose only a component of the solution. For example, consider a Customer Support Chatbot solution. In this scenario, a Question Answering NLP component is employed to handle customer inquiries. However, building the complete solution involves various additional components. These include data collection and preprocessing, named-entity recognition, intent recognition, response generation, a knowledge base, user interface design, back-end integration, sentiment analysis, an escalation mechanism, and mechanisms for continuous learning and improvement.
There are many open-source models for language technology available online, such as on platforms like GitHub, Hugging Face, and Kaggle. These platforms host repositories containing pre-trained models and resources that cover a wide array of NLP tasks. Developers can access, fine-tune, and adapt these models to their specific needs, significantly expediting the development process and benefiting from the collective knowledge of the open-source community. We’ll explore this in more detail in the Section 6.3 Open source data and models.
It’s very easy to get lost in the vast amount of available models in the wild. The choice of the model depends on your use case, performance requirements, and ethical considerations. You should compare different models based on their capabilities, limitations, and suitability for your problem. You should also consider the trade-offs between the complexity, accuracy, and efficiency of the models. We’ll explore this in more detail in the next section.
Let’s say you found various models that could serve your purpose; the next step should be to check if they serve your particular purpose.
To verify if a model does what it promises, you can get a hint from the published benchmarks of the model. However, the best practice is to test it on your domain with a representative sample of your data. You should also evaluate the model’s quality, robustness, and fairness using appropriate metrics and benchmarks. You should also check if the model’s license and terms of use are compatible with your intended application. We'll explore this in more detail in 6.4 Assessing data and model maturity.
To check if a model serves your purpose, you need to define clear and measurable objectives for your solution. You should also collect feedback from your users and stakeholders to assess the impact and value of your solution. You should also monitor and update your model regularly to ensure its reliability and relevance.
Once you have identified a working model for your solution, you should make it available to meet your demand. Making your model accessible and usable by your application is a crucial step. Machine learning models are usually served through what's called an application programming interface (API).
What is an API?
An Application Programming Interface (API) is a set of rules and protocols that allow different software applications to communicate with each other. It defines the methods and data structures that developers can use to interact with a specific software component, such as a machine learning model. APIs enable applications to request certain tasks or information from another system and receive appropriate responses.
Serving Machine Learning Models through APIs
Machine learning models, including those used in language technology, are often served through APIs. These APIs expose the functionality of the model to other applications, allowing developers to integrate the model's capabilities without needing to understand its intricate internal workings.
For instance, if you have a machine translation model, you can create an API that inputs text in one language and returns the translated text in another language. This simplifies the integration process, as developers can interact with the model using standard HTTP requests rather than delving into the complexities of the model architecture.
Some models and data are supported by ready-made APIs that allow you to access them easily and quickly. For example, you can use Google Cloud APIs for language technology tasks such as translation, speech recognition, natural language understanding, etc. However, not all models and data are available through APIs, and you may need to build your own API or use other methods to access them.
To deploy and scale your solution, you need to consider the infrastructure, platform, and tools that you will use. You should also consider the security, privacy, and compliance issues that may arise from your solution. You should also plan for maintenance, updates, and improvements to your solution over time. In Chapter 7, we give a detailed overview of deploying an MT backend and a RASA-based chatbot.
In the event that you haven’t found a suitable model for your purpose, you should already be thinking about how to obtain high-quality data for your domain. This data can be used to train a model or fine-tune an already existing one.
Fine-tuning involves taking a foundational model and adjusting it to better suit your specific task or domain. If a model for your exact language or task isn't readily available, you might consider fine-tuning from a foundational model or even a similar language.
For instance, imagine you're working on a translation model for Kinyarwanda, a Bantu language. While there might not be a pre-trained model for Kinyarwanda specifically, you could fine-tune a Swahili-English translation model to help with Kinyarwanda-English translation. Swahili and Kinyarwanda share similar linguistic characteristics as Bantu languages, making the fine-tuned model a valuable starting point.
Ensuring you have enough high-quality data for training your model is crucial. The availability of data might vary based on your language and specific task. Sometimes, pre-existing datasets might be inadequate, leading to the need for data augmentation or active collection efforts. We will look at various open data resources in Section 6.3.
There are many sources of open data for language technology, such as HuggingFace, Kaggle, Common Crawl, Common Voice. You can also find some examples of prominent open data resources in Section 6.3 of this chapter.
The type of data you need depends on your use case and the task at hand. You should look for data that is relevant, representative, reliable, and diverse for your problem. You should also consider the size, format, quality, and licensing of the data.
Once you have collected enough good data, you can start experimenting with training your model.
To build a model, you need to have the skills, tools, and time to perform tasks such as data collection, preprocessing, modeling, training, testing, evaluation, optimization, etc. You may also need to collaborate with other experts or stakeholders to ensure the quality and usability of your model. You should also consider the cost and availability of computational resources such as GPUs or TPUs that may be required for building a model.
In the event that you can’t find the right dataset or it’s not good enough to obtain a high-quality dataset, you have the option to take matters into your own hands and collect your own dataset.
Even in cases where you have access to sufficient data, it’s always good practice to design your project in such a way that it accumulates data over time. This ensures the improvement of your system over time.
Open data initiatives like Common Voice and Tatoeba provide valuable platforms where individuals voluntarily contribute by recording and translating sentences in various languages. These initiatives tap into the collective efforts of people around the world, resulting in diverse and substantial datasets that can be used for training and fine-tuning models. By harnessing the power of community-driven data collection, you can ensure the availability of relevant and authentic language resources tailored to your specific needs. We’ll discuss this deeper in 6.3 Open source data and models.
The amount of data you gather depends on how complex your problem is and how much data you’re building on. A rule of thumb is to collect a dataset as representative as possible of your real world application.
Here's a simple way to think about it: Imagine you're creating a system that understands spoken words, like the kind of system that listens to what people say on the phone. If you want it to work well in the real world, you need data that's similar to the real phone conversations people will have. This data should reflect the different kinds of people who might use your system, like different ages, genders, dialects and levels of education. That way, your system can understand and respond to everyone.
Think about the words people will use, too. If your system is going to be used in a specific area, like banking, it's important to have recordings of conversations about banking. The words, phrases, and terms used in banking will be different from, say, a system meant for ordering food.
Mobilizing the community means actively engaging a group of people who share a common interest or expertise, often related to language or data. For language technology, it involves reaching out to individuals who can contribute valuable data or insights. This can include native speakers, domain experts, or volunteers. Strategies involve organizing workshops, online challenges, or forums where contributors can share recordings, translations, or annotations. Mobilizing the community helps amass diverse, high-quality data that improves the accuracy and effectiveness of language technology solutions.
Sometimes you have potential data sources that can be processed to obtain the necessary data. For instance, if your goal is to build a translation engine specialized in news articles, you can scrape news articles from websites and translate them into your target language, creating parallel data (refer to Masakhane’s work on news topic classification and news translation). Similarly, if you're developing a telephony application for banking, you can gather banking-related text to fine-tune your language model. In these cases, you would need to collect, clean, and restructure the data to prepare it for input into your model.
When scraping data from websites, adhere to ethical guidelines and the website's terms of use. Prioritize sources that provide structured and relevant data while respecting privacy and legality.
Licensing is crucial when using external data or models. Ensure the data and models you use are legally accessible and that you comply with licensing terms. Additionally, consider licensing your own work to enable future sharing and collaboration. We’ll discuss further on open data licensing in 6.3 Open source data and models.