Monitoring and Drift: Keeping Models Healthy in Production
You deploy a model. It's tested. It works. You put it in production on a Monday.
By Friday it's performing badly. Not because you changed it. Because the world changed. The data your model sees in production is different from the data it trained on. This is drift and it's normal - and it's what kills most ML systems in production.
Why a model that worked at launch can stop working
A model learns patterns from training data. Those patterns are true for that data at that time. In production, the conditions change.
An example: you build a model to predict whether someone will buy a product based on their browsing behaviour. You trained it on data from 2024. You deploy it in January 2025. There's a new marketing campaign that changes browsing patterns. Suddenly your model is wrong about what user behaviour means.
Or simpler: you train on historical data that includes 40% of a specific user segment. In production, that segment is 10% of your traffic. The distribution changed, and the model's predictions aren't calibrated for the new distribution.
The model doesn't change. The data it's receiving changed. So performance degrades. This is inevitable - you can't train on data that doesn't exist yet. The only question is how fast you notice and how you respond.
What data drift is
Data drift is when the input data distribution changes. The features your model sees in production are statistically different from the features it saw in training.
You trained on data where customer age ranged from 18 to 65. In production, you're seeing users from 15 to 85. That's drift. You trained on mobile and desktop users equally. In production, 80% are mobile. Drift. You trained on products in three categories. A new category launched and now 30% of your predictions are on that category. Drift.
Data drift is relatively easy to detect. You log the features you see in production. You compare them to the training data distribution. If the statistics are significantly different, you have drift. Calculate means, variances, percentiles of features in a recent production window and compare to training data. Statistical tests like the Kolmogorov-Smirnov test or Wasserstein distance can quantify the difference.
What concept drift is and how it differs
Concept drift is different. It's when the relationship between features and the target changes. The features look the same but they mean different things.
You train a model: "If a user visits a product page for 10 seconds, they're likely to buy." You deploy it. In production, people are reading reviews more carefully and time on page means something different. The relationship changed even though the feature is the same.
Or: you train on data from a recession. Now the economy is different. Economic patterns that predicted buying behaviour then don't predict it now.
Concept drift is harder to detect than data drift. You can see data drift immediately - the input distribution changed. With concept drift, you only notice when performance metrics degrade over time. That's slower feedback, and slower feedback means more damage before you notice.
How to detect drift
You need to monitor your model's inputs and outputs.
For inputs: log the features and track their distributions. Set thresholds for how much change is acceptable. If feature distributions change significantly compared to training data, you have data drift.
For outputs: log predictions and track the distribution. If the model suddenly predicts "yes" 90% of the time when it used to predict 50%, something's wrong. Crude but catches obvious issues fast.
For performance: this is the hard part. You need ground truth labels to measure performance. In some domains you get them quickly - if you're predicting credit default, you know in months. In others it's slow. Identify where you can measure performance and monitor it actively.
Most teams set up a monitoring dashboard: input distributions, output distributions, performance metrics if labels are available. Set alerts that fire when things deviate significantly from baseline.
What to do when drift is detected
You have a few options.
Investigate first. Understand what changed. Are your assumptions about the problem still valid? Did the business change? Did user behaviour change? Did your data pipeline break? Understanding the cause matters because it determines the solution.
Retrain. Most commonly, you retrain on newer data. If drift is detected, gather recent data, retrain, and deploy a new model. Ideally this is automated - drift detection triggers retraining which triggers evaluation which triggers deployment if the new model is better.
Accept it. Sometimes drift is acceptable. If your model is degrading slowly and retraining is expensive, the cost may not be worth it. A business decision, not a technical one.
Change the problem. Sometimes drift signals that your original assumption is wrong. If the business has fundamentally changed, you might need a different model for a different problem.
Fix the pipeline. Sometimes drift happens because your data pipeline is broken. You're supposed to log customer demographics but they're not being logged correctly. Fix the pipeline and drift might disappear.
How seriously most teams take monitoring
Not seriously enough.
Most teams log some metrics and maybe watch a dashboard occasionally. But they don't have automated alerts. They don't have processes for responding to drift. When a model starts failing, they discover it because customers complain or because a business metric tanks.
Good practices: automated drift detection with alerts. Someone on-call who responds to alerts. An automated retraining pipeline that tests new models before deploying. Monitoring that covers inputs, outputs, and performance.
This is hard. It requires infrastructure and maintenance. But it's the difference between a model that degrades gracefully and one that crashes and burns. The teams that are good at ML in production treat monitoring the same as running any service. Observability matters. You plan for failure and you have processes to respond. You don't ship and forget.
Check your understanding
What is the key difference between data drift and concept drift?
Why is concept drift harder to detect than data drift?
Podcast version
Prefer to listen on the go? The podcast episode for this lesson covers the same material in a conversational format.