MLOps on Azure: From Experiment to Production in 8 Weeks
A step-by-step guide to building production MLOps on Azure — pipelines, model registry, retraining, feature stores, and A/B deployment.
Most machine learning models never reach production. The ones that do often degrade silently because no one built the infrastructure to monitor, retrain, and redeploy them. MLOps — the discipline of operationalizing ML — bridges this gap.
This guide presents a practical 8-week roadmap for standing up production-grade MLOps on Azure, based on patterns we have implemented across multiple enterprise clients.
Why 8 Weeks?
Eight weeks is not arbitrary. It is the shortest timeline that allows for proper foundation-building without cutting corners that create technical debt. Teams that try to compress this into two weeks end up with brittle pipelines that break at the first model update. Teams that stretch it to six months lose organizational momentum.
The 8-week plan assumes a team of 2-3 ML engineers with Azure experience and an existing model that has been validated in notebooks.
Week 1-2: Foundation — Azure ML Workspace and Version Control
Set up the Azure ML workspace
- Resource group structure: Separate resource groups for dev, staging, and production. Each with its own Azure ML workspace.
- Compute: Use compute instances for development, compute clusters for training, and managed online endpoints for inference.
- Networking: Private endpoints for the workspace, storage account, and container registry. No public internet access to training data or model artifacts.
Establish version control practices
Every artifact must be versioned and traceable:
- Code: Git repository with branching strategy (we recommend trunk-based development for ML)
- Data: Azure ML data assets with versioning. Never reference raw storage paths directly in training code.
- Environments: Conda or pip requirements files, checked into source control. Use Azure ML curated environments as a base and extend them.
- Models: Azure ML model registry from day one. Every trained model gets a version, linked to its training run, dataset version, and code commit.
Critical practice: If you cannot trace a production model back to the exact data, code, and environment that produced it, you do not have MLOps — you have organized chaos.
Week 3-4: Training Pipelines
Azure ML Pipelines
Replace notebook-based training with reproducible, parameterized pipelines:
- Pipeline components: Break training into discrete steps — data preparation, feature engineering, training, evaluation, registration. Each step is a reusable component.
- Parameterization: Hyperparameters, dataset versions, and compute targets should all be pipeline parameters, not hardcoded values.
- Compute scaling: Use compute clusters with auto-scaling (min nodes = 0) to avoid paying for idle compute. Spot instances can reduce training cost by 60-80% for fault-tolerant jobs.
Pipeline structure example
Data Prep → Feature Engineering → Train → Evaluate → Register (if metrics pass)The evaluation step is critical. Define go/no-go metrics before building the pipeline:
- Minimum accuracy/F1/AUC threshold for model registration
- Performance comparison against the current production model
- Data quality checks — schema validation, null rates, distribution drift
Automated triggers
- Schedule-based: Retrain weekly or monthly on a fixed cadence
- Data-driven: Trigger retraining when new data lands in the bronze layer (use Azure Event Grid + ML pipeline triggers)
- Drift-driven: Trigger when monitoring detects data or prediction drift beyond thresholds
Week 5: Feature Store
A feature store eliminates the most common source of training-serving skew: features computed differently in notebooks vs. production.
Azure ML managed feature store
- Feature sets: Define features as reusable transformations registered in the feature store
- Materialization: Features are precomputed and stored for both training (offline store) and inference (online store)
- Point-in-time correctness: The feature store handles temporal joins automatically, preventing data leakage in training
When a feature store is worth the investment
- You have 3+ models sharing common features
- Features require complex transformations (aggregations, window functions, joins across data sources)
- You have experienced training-serving skew — the model performs differently in production than in evaluation
If you have a single model with simple features, a feature store adds complexity without proportional benefit. Start with well-structured pipeline components and migrate to a feature store when the need becomes clear.
Week 6: Model Deployment and A/B Testing
Managed online endpoints
Azure ML's managed online endpoints handle the infrastructure for real-time inference:
- Blue/green deployment: Deploy a new model version alongside the existing one. Route a percentage of traffic to the new version.
- Traffic splitting: Start with 10% traffic to the new model. Monitor error rates, latency, and business metrics. Gradually increase to 100% if metrics are healthy.
- Auto-scaling: Configure scaling rules based on CPU utilization or request count. Set minimum replicas to handle baseline traffic without cold starts.
Deployment checklist
Before any model hits production:
- Model registered in Azure ML with linked training run
- Inference code tested with representative inputs
- Endpoint health probe configured
- Scaling rules validated under load
- Rollback procedure documented and tested
- Logging capturing input features, predictions, and latency
Batch deployment
For non-real-time workloads (scoring millions of records overnight), use batch endpoints:
- Cost-efficient: Use low-priority compute
- Parallelized: Azure ML handles distribution across nodes
- Output to storage: Results written directly to blob storage or data lake
Week 7: Monitoring and Drift Detection
Deployment without monitoring is a liability. Models degrade — the question is whether you detect it before or after business impact.
What to monitor
| Signal | Tool | Alert threshold |
|---|---|---|
| Data drift | Azure ML data drift monitor | Statistical distance > 0.1 on key features |
| Prediction drift | Custom metrics in Application Insights | Distribution shift in predicted classes/values |
| Model performance | Ground truth comparison (when available) | Accuracy drop > 5% from baseline |
| Operational health | Azure Monitor | Latency > P99 target, error rate > 1% |
| Feature freshness | Feature store monitoring | Materialization lag > SLA |
Setting up Azure ML monitoring
- Enable data collection on managed endpoints to capture input features and predictions
- Configure data drift monitors comparing production input distributions against training data
- Set up Azure Monitor alerts for endpoint health metrics
- Create a monitoring dashboard in Azure Dashboards or Grafana combining ML and operational metrics
Practical tip: Do not alert on everything. Start with three signals that directly correlate with business impact. Expand monitoring breadth after the team builds operational muscle.
Week 8: CI/CD and Operational Readiness
CI/CD for ML
Integrate your ML pipelines into your existing DevOps workflow:
- CI (on pull request): Lint code, run unit tests on pipeline components, validate environment dependencies, run a fast training pipeline on a data sample
- CD (on merge to main): Execute full training pipeline, deploy to staging, run integration tests, promote to production with traffic splitting
Use Azure DevOps or GitHub Actions with the Azure ML CLI v2 extension. Define pipelines as YAML, not through the UI — this ensures reproducibility and code review.
Operational runbooks
Document these procedures before go-live:
- Model rollback: How to revert to the previous model version in under 5 minutes
- Emergency model disable: How to disable ML-powered features and fall back to rules-based logic
- Retraining failure: What happens when a scheduled retraining pipeline fails? Who is notified? What is the SLA for investigation?
- Drift response: When drift is detected, what is the escalation path?
Team responsibilities
| Role | Responsibility |
|---|---|
| ML Engineer | Pipeline development, model training, feature engineering |
| MLOps Engineer | CI/CD, monitoring, infrastructure, endpoint management |
| Data Engineer | Data pipeline reliability, feature store materialization |
| Product Owner | Business metric definition, go/no-go decisions on model updates |
What Success Looks Like
After 8 weeks, you should have:
- Automated training pipelines triggered by schedule or data changes
- A model registry with full lineage from data to production
- Managed endpoints with blue/green deployment and auto-scaling
- Monitoring dashboards with drift detection and alerting
- CI/CD pipelines that test and deploy model changes like any other software
This is not the end — it is the foundation. From here, you can add advanced capabilities like online experimentation frameworks, champion/challenger testing, and multi-model orchestration.
Ready to build your MLOps foundation? Contact us — we can accelerate your team from notebooks to production ML in a structured, low-risk manner.