Ml Model Serving


img

Infrastructure for Model Serving

two main options for infrastructure:

  1. On-Premises: In this approach, you have all the hardware and software required for running your models on your own premises.

    1. pros: provides you with complete control and flexibility to adapt to changes quickly.
    2. cons: it can be complicated and expensive to procure, install, configure, and maintain the hardware infrastructure.
  2. Cloud Provider: Alternatively, you can outsource your infrastructure needs to a cloud provider like Amazon Web Services, Google Cloud Platform, or Microsoft Azure.

    1. pros: Cloud providers manage the hardware infrastructure for you, offering services such as pipeline management.
    2. cons: sometimes can be pricier.

Model Servers

Model servers are responsible for instantiating machine learning models and exposing the desired methods to clients. Some popular model servers include:

  1. TensorFlow serving: Allows serving TensorFlow models and provides an API for clients to make predictions.
  2. Torch Serve: Serves PyTorch models and offers an interface for making predictions.
  3. Kubeflow Serving: Specifically designed for serving models in a Kubeflow environment.
  4. NVIDIA Triton Inference Server: A server optimized for NVIDIA GPU-accelerated models.

Optimization

Strategies for Optimization

There are several areas where optimization can be applied:

  1. Infrastructure: Scale the infrastructure by adding more powerful hardware or using containerized and virtualized environments. This allows for efficient resource allocation and scaling based on demand.

    1. Vertical Scaling: It involves using more powerful hardware resources by upgrading CPUs, adding more RAM, or using newer GPUs.
    2. Horizontal Scaling: It involves adding more devices as the workload increases.
  2. Model Architecture: Understand the trade-off between inference speed and accuracy. If a slightly less accurate model significantly improves latency, it may be worth considering. Consider model resource optimization

  3. Model Compilation: If the target hardware is known, perform model compilation to create a model artifact and execution runtime optimized for that hardware. This can reduce memory consumption and latency.

  4. Application Layer: Optimize the application layer by caching frequently accessed data (feature store) or precomputing certain predictions. For example, popular product recommendations can be stored in faster data storage (key-value store) to minimize database hits.


References

  1. https://community.deeplearning.ai/t/mlep-course-4-lecture-notes/54456