Ml Model Serving

June 18, 2023 2-minute read

mlops • ML Model Deployment

Infrastructure for Model Serving

two main options for infrastructure:

On-Premises: In this approach, you have all the hardware and software required for running your models on your own premises.
1. pros: provides you with complete control and flexibility to adapt to changes quickly.
2. cons: it can be complicated and expensive to procure, install, configure, and maintain the hardware infrastructure.
Cloud Provider: Alternatively, you can outsource your infrastructure needs to a cloud provider like Amazon Web Services, Google Cloud Platform, or Microsoft Azure.
1. pros: Cloud providers manage the hardware infrastructure for you, offering services such as pipeline management.
2. cons: sometimes can be pricier.

Model Servers

Model servers are responsible for instantiating machine learning models and exposing the desired methods to clients. Some popular model servers include:

TensorFlow serving: Allows serving TensorFlow models and provides an API for clients to make predictions.
Torch Serve: Serves PyTorch models and offers an interface for making predictions.
Kubeflow Serving: Specifically designed for serving models in a Kubeflow environment.
NVIDIA Triton Inference Server: A server optimized for NVIDIA GPU-accelerated models.

Optimization

Strategies for Optimization

There are several areas where optimization can be applied:

Infrastructure: Scale the infrastructure by adding more powerful hardware or using containerized and virtualized environments. This allows for efficient resource allocation and scaling based on demand.
1. Vertical Scaling: It involves using more powerful hardware resources by upgrading CPUs, adding more RAM, or using newer GPUs.
2. Horizontal Scaling: It involves adding more devices as the workload increases.
Model Architecture: Understand the trade-off between inference speed and accuracy. If a slightly less accurate model significantly improves latency, it may be worth considering. Consider model resource optimization
Model Compilation: If the target hardware is known, perform model compilation to create a model artifact and execution runtime optimized for that hardware. This can reduce memory consumption and latency.
Application Layer: Optimize the application layer by caching frequently accessed data (feature store) or precomputing certain predictions. For example, popular product recommendations can be stored in faster data storage (key-value store) to minimize database hits.

References

https://community.deeplearning.ai/t/mlep-course-4-lecture-notes/54456