ML Monitoring — Models Decay. Most Teams Find Out Too Late.

A model that was accurate at deployment will not remain accurate indefinitely. The relationship between inputs and outputs that the model learned from historical data reflects the state of the world at the time that data was collected. The world changes. Customer behavior shifts. Economic conditions move. Competitors change their pricing. The source systems that feed the model produce data in subtly different formats after a schema migration. Each of these changes degrades model performance. The question is not whether a deployed model will decay, but when and how fast — and whether the organization has the monitoring infrastructure to detect the decay before it produces consequential decision errors.

Data drift is the change in the statistical distribution of model inputs that occurs without any change in the model itself. When the average transaction value in the training data was two hundred dollars and the current transaction value has shifted to three hundred dollars, a model that learned its patterns from the training distribution is making predictions based on input values that it has limited experience with. Distribution shift tests — population stability index, Kolmogorov-Smirnov statistics, Jensen-Shannon divergence — detect this change in input distributions and trigger investigation into whether the model's learned relationships still hold under the new distribution.

Concept drift is the more serious problem: a change in the relationship between inputs and outputs that the model was trained to predict. A fraud detection model trained before a new fraud pattern emerged will correctly classify the inputs it was trained on and fail on the new pattern — not because its inputs have changed, but because the concept it is trying to predict has changed. Concept drift is harder to detect than data drift because it requires ground truth labels to measure — and ground truth labels often arrive with substantial delays in production settings. Shadow scoring, where a new model version is run alongside the production model and its predictions compared, is the standard technique for detecting concept drift before it degrades business outcomes.

The monitoring architecture that supports sustained model health is not a single metric or a single alert. It is a pipeline: input distribution checks that run on every scoring batch, output distribution checks that flag when the model's prediction distribution shifts materially, performance checks that run against labeled samples as ground truth arrives, and operational checks that detect infrastructure problems — latency spikes, throughput drops, feature computation errors — separately from model quality problems. The teams that manage models well in production instrument all four layers from day one. The teams that struggle treat monitoring as a project for after things go wrong.

Back to all articles