MLOps, without the hype deck
MLOps is what the job actually is: data, training, serving, and the monitoring loop that catches the silent failures before your users do.
- mlops
- infra
- production
Most MLOps writing is a tour of logos. This one is a tour of problems. The logos change every eighteen months; the problems do not.
If you squint at any ML team that has been in production for more than a year, you find the same four questions on the whiteboard. Every MLOps decision you will ever make is an answer to one of them.
The four questions
- Can I rebuild this model from scratch?, reproducibility.
- Can I ship a new version without holding my breath?, deployment.
- Will I know when it breaks?, monitoring.
- Can I improve it from what users actually do?, feedback.
Hold that in your head. Every tool, every pipeline, every 2 a.m. decision maps onto one of those four.
1. Reproducibility: the quiet foundation
Reproducibility is the thing you take for granted until you can't. A model that worked six months ago, on a branch that no longer exists, trained on a data snapshot nobody pinned, is not a model. It's a rumor.
The minimum:
- Code: git, every run has a commit SHA in its metadata.
- Data: content-addressed. DVC, LakeFS, or a plain S3 path with an immutable snapshot ID. The data version is part of the model version.
- Environment: a Dockerfile or a uv/poetry lockfile. Not a
pip install -r requirements.txtwith a torn-off top. - Randomness: seed everything you can (numpy, torch, framework splits). Document the things you can't (CUDA non-determinism at certain ops).
You do not need a platform for this. You need discipline and five yaml fields in a run manifest.
2. Deployment: small changes, fast
Deploying a model is not deploying code. Models have weights, versions, shapes, latencies, and failure modes the rest of your stack does not. Treat them as a separate lifecycle.
The patterns that keep working:
- Model registry: an artifact store with versions, stages (
staging,production,archived), and metadata (training data version, eval metrics, git SHA). MLflow, W&B, or a boring S3 bucket with a convention. - Shadow deploys: run the new model alongside the old, compare outputs offline before any user sees the new version.
- Canary + rollback: 1% of traffic first, SLOs on latency and quality, one-button rollback. If rollback takes more than a minute, you do not have rollback.
- Decouple model from service: the serving binary should be able to load any registered model. If pushing a new model requires a code deploy, every retrain becomes a release. That is how teams stop retraining.
3. Monitoring: where most failures hide
The interesting failures are not the ones that throw exceptions. They are the silent ones: a feature pipeline starts returning zeros at 3 a.m., a segment of users sees quietly worse predictions, a data schema drifts by one unit and nobody notices for a week.
You need signal on three planes:
- System: p50/p95/p99 latency, error rate, QPS, GPU utilization, cost per prediction. Your SRE toolkit.
- Data: schema checks (Great Expectations, Pandera), null-rate alerts, distribution drift on the inputs you actually use. Run them on both training and inference data. A discrepancy is the bug.
- Model: prediction distribution over time, calibration, and, when you have labels, quality. For delayed-label problems (fraud, churn), use proxies: click-through, complaint rate, follow-on conversions.
The single most useful dashboard a team can build is not the one that looks best. It is the one on-call actually opens at 3 a.m. Keep it short and actionable. One glance, one decision.
4. Feedback: the part everyone skips
A model that never learns from production traffic becomes worse, slowly, forever. User behavior drifts. The world drifts. Even a perfectly stable model rots by standing still.
The feedback loop has three moving pieces:
- Capture, log every prediction, the input features, the model version, and any downstream signal you can legally retain.
- Label, implicit (user clicks, corrections, dwell time) or explicit (annotator queues, active learning on uncertain predictions).
- Retrain, on a schedule you trust, with the same pipeline that trained the original. Treat retraining as a deploy, not as a script.
The teams that get this right have one thing in common: the same person who trains the model also gets paged when it regresses. The second best proxy is a shared Slack channel. The worst is a handoff.
A pragmatic 2026 stack
You do not need more than this to start:
- Training orchestration: Dagster or Prefect (DAGs with types; logs that don't lie).
- Experiment + registry: MLflow (self-hostable, boring, works).
- Feature store: Feast if you need online features, a Parquet lake + dbt if you don't.
- Serving: Ray Serve, BentoML, or a FastAPI container behind whatever load balancer you already have.
- Monitoring: Prometheus/Grafana for system, Evidently or WhyLabs for data/model.
- CI: every PR rebuilds the training image, runs a subset eval, blocks merge on regression.
None of that is fancy. That is the point. MLOps is boring, on purpose, because the alternative is ML that only works on Tuesdays.
What to build first
If you are the first ML person on a team, build these in order:
- A reproducible training script that emits a versioned model artifact with its training data ID.
- A single serving path that loads any registered model.
- A monitoring board with system metrics, feature schema checks, and a prediction drift panel.
- A retraining job on a schedule, gated by the same evals that blocked the original merge.
You can build this in two weeks. Everything else, feature stores, vector DBs, LLM gateways, is a refinement on top of that loop. Build the loop first. The rest is negotiable.