Mlops


Introduction to MLOps

MLOps, or Machine Learning Operations, refers to the practices and processes involved in managing and deploying machine learning models in production. It combines the principles of software engineering and DevOps with the unique challenges presented by machine learning systems.

img source: https://neptune.ai/blog/mlops

Challenges in ML Engineering

There are several challenges in ML engineering that drive the need for MLOps:

  1. Slow Deployment: Deploying ML models to production can take days, weeks or even months. Resulting in missed opportunities or deploying outdated models.

  2. Lack of Tracking and Reproducibility: Traditional data science projects often lack proper tracking and reproducibility, making collaboration and iteration difficult. Manual tracking, lack of provenance, and non-reproducible models are common issues.

  3. Insufficient Monitoring: Once models are deployed, there is often a lack of monitoring for their performance and inference data.

MLOps and DevOps

MLOps can be seen as an extension of DevOps principles, adapted for machine learning systems. DevOps focuses on developing and managing software systems, aiming to reduce development cycles, increase deployment velocity, and ensure high-quality software releases.

Key differences between DevOps and MLOps include:

  • Continuous Integration (CI): In MLOps, continuous integration involves testing and validating not only code and components but also data schemas and models.

  • Continuous Delivery (CD): MLOps goes beyond deploying a single software or service. It involves deploying a system (ML pipeline) that automatically deploys a service (ML model).

  • Continuous Training (CT) : automatically retraining and serving the models.

MLOps Level

The level of automation of these steps defines the maturity of the ML process, which reflects the velocity of training new models given new data or training new models given new implementations.

Level 0 (Manual Process)

Manual, data science-driven process suitable for infrequent model changes. Key challenges include maintaining model accuracy and addressing model failures.

img source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Level 1 (ML Pipeline Automation)

Level 1 aims to achieve continuous training of models by automating the training pipeline. img source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Characteristics

  • Rapid Experiment
    • Orchestrated Data validation, preparation, model training, model evaluation and validation for rapid experiment.
  • CT for model in production
    • Model automatically trained in production using fresh data based on live pipeline triggers.
  • Experiment - Operational symmetry
    • Consideration of infrastructure compatibility and consistency with prediction service.
  • Modularized code
    • Reusable and composable pipeline components.
    • Containerization of components for reproducibility.
    • Isolate each component in the pipeline.
  • CD of models
    • Model validation evaluates the model’s predictive quality and compares it to the current model before promotion to production.
  • Pipeline Deployment:
    • Deployment of the entire training pipeline, not just the trained model.

Additional components

  1. Data Validation
  2. Model Validation
  3. Feature Store
  4. Metadata Management
  5. ML Pipelines Triggers

Level 2 (CI/CD Pipeline Automation)

Level 2 focuses on rapid and reliable update of pipelines in production. somewhat still speculative img

Stages of the ML CI/CD automation pipeline: img source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning


References

  1. https://neptune.ai/blog/mlops
  2. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning