Language AI Playbook
  • 1. Introduction
    • 1.1 How to use the partner playbook
    • 1.2 Chapter overviews
    • 1.3 Acknowledgements
  • 2. Overview of Language Technology
    • 2.1 Definition and uses of language technology
    • 2.2 How language technology helps with communication
    • 2.3 Areas where language technology can be used
    • 2.4 Key terminology and concepts
  • 3. Partner Opportunities
    • 3.1 Enabling Organizations with Language Technology
    • 3.2 Bridging the Technical Gap
    • 3.3 Dealing with language technology providers
  • 4. Identifying Impactful Use Cases
    • 4.1 Setting criteria to help choose the use case
    • 4.2 Conducting A Needs Assessment
    • 4.3 Evaluating What Can Be Done and What Works
  • 5 Communication and working together
    • 5.1 Communicating with Communities
    • 5.2 Communicating and working well with partners
  • 6. Language Technology Implementation
    • 6.1 Navigating the Language Technology Landscape
    • 6.2 Creating a Language-Specific Peculiarities (LSP) Document
    • 6.3 Open source data and models
    • 6.4 Assessing data and model maturity
      • 6.4.1 Assessing NLP Data Maturity
      • 6.4.2 Assessing NLP Model Maturity:
    • 6.5 Key Metrics for Evaluating Language Solutions
  • 7 Development and Deployment Guidelines
    • 7.1 Serving models through an API
    • 7.2 Machine translation
      • 7.2.1 Building your own MT models
      • 7.2.2 Deploying your own scalable Machine Translation API
      • 7.2.3 Evaluation and continuous improvement of machine translation
    • 7.3 Chatbots
      • 7.3.1 Overview of chatbot technologies and RASA framework
      • 7.3.2 Building data for a climate change resilience chatbot
      • 7.3.3 How to obtain multilinguality
      • 7.3.4 Components of a chatbot in deployment
      • 7.3.5 Deploying a RASA chatbot
      • 7.3.6 Channel integrations
        • 7.3.6.1 Facebook Messenger
        • 7.3.6.2 WhatsApp
        • 7.3.6.3 Telegram
      • 7.3.7 How to create effective NLU training data
      • 7.3.8 Evaluation and continuous improvement of chatbots
  • 8 Sources and further bibliography
Powered by GitBook
On this page
  • Frameworks for serving models
  • Serverless Deployment
  1. 7 Development and Deployment Guidelines

7.1 Serving models through an API

Previous7 Development and Deployment GuidelinesNext7.2 Machine translation

Last updated 1 year ago

Deploying open-source machine translation models involves the implementation of an application programming interface (API) to make your model accessible to users and applications. An API acts as a bridge between your application and the machine translation model, allowing seamless communication and integration. For example, in the context of deploying machine translation models, an API allows you to send a source text to the model and receive the translated output in return. It provides a standardized way for applications to communicate with the model.

Traditional server deployment involves setting up a dedicated server to host the model and serve model requests. You'll need to configure the server environment, manage resources, and handle scaling as the number of requests increases. Platforms like Flask, Django, or FastAPI can be used to create the API endpoints and handle the translation process. In 7.2.2 Deploying your own scalable Machine Translation API, we will present an example of a machine translation API built on the FastAPI framework.

Even though training machine learning models is generally feasible only with a GPU at hand, it is possible to serve ML models without one. Even most large scale models can be loaded, given sufficient memory. Although, having a GPU at hand would decrease inference time substantially. On the one hand, depending on the real-time constraints of your application, this is something to take into consideration. On the other hand, It's important to note that GPU servers come at a higher cost due to their enhanced processing capabilities. However, even in the absence of dedicated GPUs, the deployment of ML models remains viable, making it an accessible option for various scenarios.

Frameworks for serving models

For streamlined deployment of PyTorch models in production environments, provides a comprehensive solution. TorchServe facilitates efficient model serving with minimal latency, making it suitable for high-performance inference. The platform offers default handlers for common applications such as object detection and text classification, eliminating the need for custom code deployment. With advanced features like multi-model serving, model versioning for A/B testing, monitoring metrics, and RESTful endpoints, TorchServe seamlessly transitions models from research to production. It supports various machine learning environments, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2. Detailed information and documentation for using TorchServe can be found in its .

Tensorflow Extended (TFX), on the other hand, is a comprehensive platform designed for deploying end-to-end production ML pipelines of Tensorflow-based models. As you're ready to move your models from research to production, TFX aids in creating and managing a production pipeline. A TFX pipeline consists of components that implement an ML pipeline tailored for scalable, high-performance machine learning tasks. These components can be built using TFX libraries, which can also be employed individually. More information about TFX and its capabilities can be accessed through its .

provides a secure and convenient solution for deploying Hugging Face Transformers, Sentence-Transformers, and Diffusion models from the Hub onto dedicated and auto-scaling infrastructure managed by Hugging Face. This service enables you to deploy models without the need to rent and administer a server, as you would be paying Hugging Face for their managed infrastructure. The creation of a Hugging Face Endpoint is based on a Hugging Face Model Repository, which generates image artifacts from the chosen model or a custom-provided container image. These image artifacts are detached from the Hugging Face Hub source repositories to ensure heightened security and reliability.

Hugging Face Inference Endpoints supports all tasks associated with Hugging Face Transformers, Sentence-Transformers, and Diffusion, as well as custom tasks not currently supported by Hugging Face Transformers, such as speaker diarization and diffusion. Additionally, the service allows for the use of a custom container image managed externally via services like , Elastic Container Registry, , or Google . This deployment approach proves particularly advantageous if you seek to sidestep the complexities of managing your own server while leveraging the capabilities of state-of-the-art language models for your applications.

Serverless Deployment

Serverless deployment offers a more hands-off approach, allowing you to focus on the code rather than infrastructure management. Platforms like , , and enable you to deploy your API without provisioning servers. These platforms automatically scale based on demand, which can be particularly advantageous for handling varying workloads.

The potential benefits of serverless computing become clear when considering the 'classic' workflow for setting up a new server in the cloud. With traditional methods, creating a new virtual machine (VM), configuring the machine, setting up NGINX, and managing auto-scaling rules consume significant time and energy. Additionally, you're billed for every second of server uptime, regardless of usage patterns. In contrast, serverless deployment platforms alleviate these concerns, allowing you to focus on writing and deploying your application's core code. This approach is particularly cost-effective for apps with variable usage patterns, as you only pay for the resources you actually use.

As you consider serverless deployment, it's crucial to be aware of certain drawbacks. When serverless functions are inactive, there can be a brief lag when they're reactivated, affecting responsiveness. Limited resources and configurability may hinder tasks requiring extensive CPU or memory usage. Debugging can be challenging due to less control over the runtime environment, especially when dealing with complex architectures. Moreover, vendor lock-in is a concern as cloud providers often offer specific frameworks for serverless deployment, potentially complicating migration to different platforms.

TorchServe
official documentation
official documentation
HuggingFace Inference Endpoints
Docker Hub
AWS
Azure Container Registry
Artifact Registry
AWS Lambda
Google Cloud Functions
Azure Functions