7.1 Serving models through an API

Deploying open-source machine translation models involves the implementation of an application programming interface (API) to make your model accessible to users and applications. An API acts as a bridge between your application and the machine translation model, allowing seamless communication and integration. For example, in the context of deploying machine translation models, an API allows you to send a source text to the model and receive the translated output in return. It provides a standardized way for applications to communicate with the model.

Traditional server deployment involves setting up a dedicated server to host the model and serve model requests. You'll need to configure the server environment, manage resources, and handle scaling as the number of requests increases. Platforms like Flask, Django, or FastAPI can be used to create the API endpoints and handle the translation process. In 7.2.2 Deploying your own scalable Machine Translation API, we will present an example of a machine translation API built on the FastAPI framework.

Even though training machine learning models is generally feasible only with a GPU at hand, it is possible to serve ML models without one. Even most large scale models can be loaded, given sufficient memory. Although, having a GPU at hand would decrease inference time substantially. On the one hand, depending on the real-time constraints of your application, this is something to take into consideration. On the other hand, It's important to note that GPU servers come at a higher cost due to their enhanced processing capabilities. However, even in the absence of dedicated GPUs, the deployment of ML models remains viable, making it an accessible option for various scenarios.

Frameworks for serving models

For streamlined deployment of PyTorch models in production environments, TorchServe provides a comprehensive solution. TorchServe facilitates efficient model serving with minimal latency, making it suitable for high-performance inference. The platform offers default handlers for common applications such as object detection and text classification, eliminating the need for custom code deployment. With advanced features like multi-model serving, model versioning for A/B testing, monitoring metrics, and RESTful endpoints, TorchServe seamlessly transitions models from research to production. It supports various machine learning environments, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2. Detailed information and documentation for using TorchServe can be found in its official documentation.

Tensorflow Extended (TFX), on the other hand, is a comprehensive platform designed for deploying end-to-end production ML pipelines of Tensorflow-based models. As you're ready to move your models from research to production, TFX aids in creating and managing a production pipeline. A TFX pipeline consists of components that implement an ML pipeline tailored for scalable, high-performance machine learning tasks. These components can be built using TFX libraries, which can also be employed individually. More information about TFX and its capabilities can be accessed through its official documentation.

HuggingFace Inference Endpoints provides a secure and convenient solution for deploying Hugging Face Transformers, Sentence-Transformers, and Diffusion models from the Hub onto dedicated and auto-scaling infrastructure managed by Hugging Face. This service enables you to deploy models without the need to rent and administer a server, as you would be paying Hugging Face for their managed infrastructure. The creation of a Hugging Face Endpoint is based on a Hugging Face Model Repository, which generates image artifacts from the chosen model or a custom-provided container image. These image artifacts are detached from the Hugging Face Hub source repositories to ensure heightened security and reliability.

Hugging Face Inference Endpoints supports all tasks associated with Hugging Face Transformers, Sentence-Transformers, and Diffusion, as well as custom tasks not currently supported by Hugging Face Transformers, such as speaker diarization and diffusion. Additionally, the service allows for the use of a custom container image managed externally via services like Docker Hub, AWS Elastic Container Registry, Azure Container Registry, or Google Artifact Registry. This deployment approach proves particularly advantageous if you seek to sidestep the complexities of managing your own server while leveraging the capabilities of state-of-the-art language models for your applications.

Serverless Deployment

Serverless deployment offers a more hands-off approach, allowing you to focus on the code rather than infrastructure management. Platforms like AWS Lambda, Google Cloud Functions, and Azure Functions enable you to deploy your API without provisioning servers. These platforms automatically scale based on demand, which can be particularly advantageous for handling varying workloads.

The potential benefits of serverless computing become clear when considering the 'classic' workflow for setting up a new server in the cloud. With traditional methods, creating a new virtual machine (VM), configuring the machine, setting up NGINX, and managing auto-scaling rules consume significant time and energy. Additionally, you're billed for every second of server uptime, regardless of usage patterns. In contrast, serverless deployment platforms alleviate these concerns, allowing you to focus on writing and deploying your application's core code. This approach is particularly cost-effective for apps with variable usage patterns, as you only pay for the resources you actually use.

As you consider serverless deployment, it's crucial to be aware of certain drawbacks. When serverless functions are inactive, there can be a brief lag when they're reactivated, affecting responsiveness. Limited resources and configurability may hinder tasks requiring extensive CPU or memory usage. Debugging can be challenging due to less control over the runtime environment, especially when dealing with complex architectures. Moreover, vendor lock-in is a concern as cloud providers often offer specific frameworks for serverless deployment, potentially complicating migration to different platforms.

Last updated