6.3 Open source data and models

In this section, we delve into the world of open data sets and models, pivotal resources that have catalyzed the advancement of language technology. Open data sets refer to collections of data that are freely available to the public, fostering collaboration and innovation. We suggest taking an open data strategy when building language technology solutions. This ensures that a wider community can participate in and take advantage of the data you create.

We will now explore where to find such data, important considerations like licensing, and the significance of these resources in driving progress.

Discovering the NLP landscape with HuggingFace

A notable platform in the realm of open data sets and models is Hugging Face. It offers an extensive collection of pre-trained models and datasets. This platform not only facilitates model search but also offers interactive demonstrations of these models, allowing you to witness their capabilities firsthand. Hugging Face is a hub for initiatives by individuals, organizations, and communities, offering a wealth of language technology resources.

OPUS

When in search of targeted data sets for specific tasks, platforms like OPUS emerge as invaluable resources, particularly for machine translation data. The OPUS project, an initiative from Helsinki University NLP group, serves as a veritable goldmine of translated texts sourced from across the web. This initiative aims to transform and align freely available online content, augmenting it with linguistic annotations. The outcome is a meticulously curated parallel corpus that's publicly accessible. The OPUS project operates on the principles of open source, offering the corpus as an open content package. The compilation of this corpus relies on a variety of open-source tools, and the entire pre-processing procedure is automated to serve in formats that are readily usable with machine translation libraries.

While both Hugging Face and OPUS fall within the realm of open data, they diverge in their scope and offerings. OPUS stands as a specialized resource focused on parallel data, making it a potent asset for machine translation tasks. This platform operates in a moderated environment, wherein its content is carefully curated, ensuring the quality and relevance of the provided data. Unlike Hugging Face, OPUS is designed to primarily host parallel corpora, and it restricts the ability for users to upload their own datasets. This tailored approach to parallel data signifies OPUS's commitment to facilitating high-quality language resources for the language technology community.

Speech data

Within the realm of language technology, speech data forms the cornerstone for fueling technologies like automatic speech recognition (ASR) and text-to-speech (TTS). ASR enables machines to transcribe spoken language into text, while TTS transforms text into natural-sounding speech. To drive the development of these technologies, open speech datasets are indispensable. Among the valuable resources is OpenSLR, a repository of curated open-source speech and audio datasets. This platform provides datasets that serve as the bedrock for training and evaluating some of the most important foundational ASR and TTS models.

Common Voice: A Paradigm Shift for Inclusive Speech Data

In the landscape of speech data, the Common Voice initiative stands out as a transformative force, reshaping the trajectory of voice-enabled technology. Common Voice hosts a publicly accessible voice dataset that owes its existence to the collaborative contributions of volunteers worldwide. Rather than being owned by corporations, this platform thrives on community-driven efforts, offering an extensive dataset for training machine learning models used in ASR and TTS applications.

Engaging with the Common Voice movement is straightforward yet impactful. Participants can register to create accounts, enabling them to actively contribute to expanding the dataset while keeping track of their progress. The process involves crafting sentences, recording them, and validating the contributions of others. This collective approach not only enriches the dataset but also empowers contributors with a sense of ownership in driving language technology forward. The datasets are downloadable with CC-0 (Public domain) licenses through their portal.

An illustrative triumph of Common Voice lies in languages once deemed low-resource. Languages like Kinyarwanda and Catalan, historically lacking comprehensive resources, have experienced a resurgence driven by community involvement. Through local campaigns (see Digital Umuganda (LINK) and Aina (https://www.projecteaina.cat/), these languages have flourished with diverse contributions and taken their place at the top in terms of contributed hours.

As of 11 August 2023, Common Voice supports data contribution in 123 languages, with many more under development. If your language is not on this list yet, you can ask for it to be included. The list of tasks to complete in order for a language to start contributing data is:

Localization of the site through Mozilla’s platform Pontoon
Collection of sentences with CC-0 license (Public domain). More information on CC-0 here.

Licensing stands as a cornerstone in the realm of open data and models, dictating the rules of engagement for usage and sharing. Diverse licenses exist, each carrying its own set of permissions and restrictions. Understanding these licenses is paramount to responsibly leveraging open resources in your language technology projects. In the open data space, here are some commonly used Creative Commons (CC) licenses that you might encounter: CC BY (Attribution): This license permits you to use, modify, and share the data or model, even for commercial purposes, as long as you provide appropriate attribution to the original creator.

CC BY-SA (Attribution-ShareAlike): Similar to CC BY, this license allows usage, modification, and sharing, but any derivative works must be licensed under the same terms. This ensures that subsequent work remains open as well.
CC BY-NC (Attribution-NonCommercial): This license permits usage, modification, and sharing for non-commercial purposes while requiring attribution to the original creator.
CC BY-ND (Attribution-NoDerivatives): Under this license, you can use and share the work, but modifications or derivatives are not allowed. Proper attribution is necessary.
CC BY-NC-SA (Attribution-NonCommercial-ShareAlike): Like CC BY-SA, this license allows modification and sharing, but only for non-commercial purposes. Any derivative works must be licensed under the same terms.
CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives): This is the most restrictive CC license. You can download and share the work, but you can’t modify it or use it commercially.

Adhering to the terms of these licenses ensures ethical and legal use of the resources you encounter. While utilizing open data and models is vital, the spirit of collaboration is equally significant. Contributing back to the community by sharing your own fine-tuned models and datasets, even if they aren't perfect, reinforces the communal nature of language technology advancement. This practice enables others to build upon your work, collaboratively enhancing the capabilities of language models, fostering innovation, and nurturing a dynamic ecosystem of collective growth.

Previous6.2 Creating a Language-Specific Peculiarities (LSP) Document Next6.4 Assessing data and model maturity

Last updated 1 year ago

Discovering the NLP landscape with HuggingFace

OPUS

Speech data

Common Voice: A Paradigm Shift for Inclusive Speech Data

Licensing and Sharing of data and models