Unit 8 · AI in Production

Introduction to MLOps: From Notebook to Production

10 min read · Lesson 1 of 4 in Unit 8 · Published 5 April 2026
Listen to this lesson

You build a model in a notebook. It works. You get 87% accuracy. You're happy. You deploy it to production.

Three months later it's performing at 64% accuracy. You didn't touch it. The data changed. The users changed. The world changed. And now your model is broken in ways that are hard to debug because production is messier than your notebook.

That gap - between a notebook experiment and an actual working system - is where MLOps lives.

The gap between notebook and production

A notebook is a controlled environment. You have a dataset. You split it into train and test. You train once, evaluate once, iterate. When you're done, you have a model file.

Production is chaos. Data arrives continuously. It's different from your training data. Your model makes predictions on it. Those predictions affect real decisions. Sometimes they're wrong and you need to understand why.

Your notebook didn't have to handle: new data arriving every minute, data in different formats or with different distributions, multiple models running in parallel, needing to roll back if a new model is bad, monitoring whether the model still works, retraining when performance degrades, version control for data, models, and code, compliance and audit requirements, or budget and compute constraints.

A notebook solves exactly one problem: build a model. Production needs to solve dozens more.

What MLOps is and why it exists

MLOps is the infrastructure and practices that turn a model into a system that works reliably in production.

It's borrowed from DevOps - the practices that make software deployable and maintainable. But ML adds complexity because your system has three moving parts: code, data, and models. In traditional software, the code is the source of truth. In ML, the code matters less than the data and the model weights.

MLOps exists because ML in production fails in different ways than regular software. A regular application can be fully tested before deployment. An ML system can never be fully tested - you'll always encounter data distributions you didn't see in training. A regular application is deterministic - the same input always produces the same output. An ML system is probabilistic - it makes mistakes, and those mistakes change over time as data changes.

You need different tools and practices for this reality.

The key stages: data, training, deployment, monitoring

Data management. You need to know what data your model was trained on. You need to track data quality. You need to catch when the data distribution shifts. You need to version datasets so that when a model breaks, you can recreate the exact conditions that created it.

Training. You need to be able to retrain models automatically. You need to track hyperparameters and results. You need to run experiments in parallel and compare them. You need to version models and know which version is in production.

Deployment. You need to serve models with low latency and high availability. You need to be able to roll out new models gradually and roll back if something breaks. You need to handle multiple versions running simultaneously.

Monitoring. You need to track model performance on real data over time. You need to notice when performance degrades and alert the right people. You need to log predictions so you can debug failures. You need to know when to retrain.

These stages feed back into each other. Monitoring tells you when to retrain. Training produces a new model. Deployment puts it in production. Monitoring checks if it works. The cycle continues.

Why ML fails differently in production

A regular application usually fails because of bugs. The code has a flaw, you fix it, you deploy the fix. ML systems fail because the world changed.

Your model learned patterns from training data. In production, the data is different. Not in format - usually the same format. But the distribution is different. Users behave differently. Products change. Campaigns succeed or fail. The patterns the model learned aren't true any more.

This is called drift and it's inevitable. You can't prevent it. You can only detect it and handle it.

A regular application is also deterministic - you can test every branch. An ML model is probabilistic. You can't test every possible input. You don't know in advance what mistakes it will make. You can only observe mistakes in production and try to learn from them.

This is why monitoring matters so much in ML. You can't catch all errors before deployment. You have to catch them after and respond.

Do data scientists need to care about MLOps?

They don't need to become MLOps engineers - that's a different skill. But they should understand the basics. How models go from notebook to production. What happens when data distribution changes. How to instrument a model so it's debuggable.

The companies that are good at ML have data scientists and MLOps engineers who work together. The data scientist doesn't have to implement the monitoring system, but they should understand what needs to be monitored and why. They should think about edge cases that might not show up in training data. They should care that their model is reliable, not just accurate.

Check your understanding

Why does an ML model's performance degrade over time even if the code hasn't changed?

What makes ML systems harder to fully test before deployment compared to regular software?

Podcast version

Prefer to listen on the go? The podcast episode for this lesson covers the same material in a conversational format.

Frequently Asked Questions

What is MLOps?

MLOps is the infrastructure and practices that turn a trained model into a system that works reliably in production. It borrows from DevOps but adds complexity because ML systems have three moving parts: code, data, and model weights. A regular application's source of truth is the code; an ML system's source of truth is the data and model.

Why do ML models fail in production?

ML models fail in production primarily because the world changes. The model learned patterns from training data, but in production the data distribution shifts - users behave differently, products change, markets move. This is called drift and it's inevitable. Unlike regular software bugs that you fix once, drift is ongoing and requires continuous monitoring and retraining.

What are the four stages of the ML lifecycle?

Data management (versioning, quality, distribution tracking), training (reproducible experiments, hyperparameter tracking, model versioning), deployment (serving with low latency and high availability, gradual rollouts), and monitoring (tracking performance on real data, detecting drift, alerting on degradation). These stages feed back into each other in a continuous cycle.

Do data scientists need to learn MLOps?

They don't need to become MLOps engineers, but they should understand the basics. How models go from notebook to production. What happens when data distribution changes. How to instrument a model so it's debuggable. The companies that are good at ML have data scientists who care that their models work reliably, not just that they score well on a held-out test set.

How It Works

MLOps combines three areas: infrastructure for running ML workloads, processes for managing the ML lifecycle, and monitoring for keeping models healthy in production.

A mature MLOps setup has: a feature store (centralised, versioned features for training and serving), an experiment tracker (MLflow, Weights & Biases - logs params, metrics, and model artefacts), a model registry (stores model versions with their lineage), a serving infrastructure (APIs, containers, or managed services), and a monitoring system (tracks input distributions, output distributions, and performance metrics).

When things work well: new data triggers a pipeline, models retrain automatically, evaluation gates compare new vs old, deployment is gradual (A/B or canary), and alerts fire before customers notice problems.

Key Points
  • The gap between notebook experiment and production system is where most ML projects fail
  • Production requirements notebooks don't handle: continuous data, version control, rollback, monitoring, retraining
  • MLOps is DevOps adapted for ML's three moving parts: code, data, and models
  • ML fails in production because data distributions drift, not just because of code bugs
  • The four MLOps stages: data management, training, deployment, monitoring - forming a cycle
  • ML systems can't be fully tested before deployment - some errors only emerge with real data
  • Monitoring is not optional - it's how you detect drift before it causes visible business damage
  • Data scientists don't need to build the infrastructure but should understand what it does and why
Sources
  • Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS.
  • Shankar, S. et al. (2022). Operationalizing Machine Learning: An Interview Study. arXiv.
  • Google. (2022). MLOps: Continuous delivery and automation pipelines in machine learning. cloud.google.com.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly.