We Open-Sourced Our Enterprise Databricks AI Platform Blueprint
A production-grade, open-source reference architecture for Azure Databricks covering networking, security, MLOps, agentic AI, and CI/CD — built on the Azure Well-Architected Framework.
Building an enterprise AI platform from scratch is a multi-month effort. You need to solve networking, security, data governance, compute, ML lifecycle management, and CI/CD — and you need them all to work together. Most teams either cobble together tutorials that leave critical gaps or pay consultants to build something proprietary they can never fully own.
We decided to change that. Today we are open-sourcing our complete enterprise Databricks platform blueprint: databricks-enterprise-ai-platform.
What This Is
This is not a quickstart or a hello-world Terraform module. It is a full production reference architecture that provisions, configures, and operates an Azure Databricks platform for ML and LLM workloads. Every design decision is mapped to the Azure Well-Architected Framework across all five pillars: Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency.
The repository contains:
- 9 Terraform modules covering landing zone, networking, firewall, security, storage, monitoring, Databricks, compute integration, and container registry
- Delta Lake medallion architecture with Unity Catalog governance
- 3 sample ML projects with real training code (revenue forecasting, anomaly detection, LLM document triage)
- 3 agentic AI patterns running on Azure Functions (orchestrator, multi-agent, monitoring responder)
- 4 GitHub Actions workflows for infrastructure and ML CI/CD with zero stored secrets
The Architecture
The platform follows a hub-spoke network topology:
Hub VNet hosts shared services — Azure Firewall with forced tunneling, Azure Bastion for secure management access, VPN Gateway with Entra ID authentication, and Private DNS Zones for seven Azure services.
Spoke VNet hosts the workloads — Databricks with VNet injection and Secure Cluster Connectivity (no public IPs on compute nodes), Azure Functions with VNet integration, and private endpoints for storage, Key Vault, and Container Registry.
All egress traffic flows through Azure Firewall with explicit FQDN whitelists. No PaaS service has a public endpoint. Three user-assigned managed identities handle authentication — no passwords or API keys anywhere in the stack.
Why We Built This
Every enterprise engagement we run at CC Conceptualise starts with the same foundational work: setting up secure networking, configuring Databricks with private connectivity, building the data lake with proper encryption, and wiring up CI/CD. We found ourselves solving the same architectural problems repeatedly.
Rather than keep this knowledge locked in client deliverables, we extracted the patterns into a reusable, opinionated blueprint that any team can fork and adapt.
What Makes This Different
Zero-trust by default, not bolted on later. Private endpoints on every PaaS service. Firewall-forced tunneling on all egress. OIDC federation for CI/CD — not a single client secret in GitHub. This is the security posture enterprises need from day one.
WAF alignment is documented, not assumed. Every Terraform resource includes comments mapping it to specific Well-Architected Framework pillars with rationale. This is audit-ready — your cloud architects and compliance teams can trace every design decision.
End-to-end, not just infrastructure. Most open-source Terraform repos stop at "here's a Databricks workspace." Ours continues through Unity Catalog setup, medallion data pipelines, ML model training and promotion, agentic AI workflows, and automated CI/CD — the full stack from terraform apply to production model serving.
Cost controls built in. Configurable budget alerts (default $1,000), cluster policies limiting max workers and enforcing auto-termination, storage lifecycle rules, and consumption-based Functions. Because unchecked cloud spend kills more AI projects than bad models.
The 9 Modules
| Module | What It Does |
|---|---|
landing_zone | Resource group + Azure Policy enforcement (tags, location, HTTPS, diagnostics) |
networking | Hub-spoke VNets, 7+ subnets, NSGs, route tables, NAT Gateway, Private DNS Zones, VPN, Bastion |
firewall | Azure Firewall with application and network rule collections (Databricks control plane, PyPI, GitHub whitelists) |
security | Key Vault with private endpoint, 3 managed identities, RBAC role assignments |
storage | ADLS Gen2, 4 containers (bronze/silver/gold/mlflow), CMK encryption, lifecycle rules, private endpoints |
monitoring | Log Analytics, Application Insights, action groups, budget alerts, scheduled query alerts |
databricks | Premium workspace with VNet injection, 4 private endpoints, Access Connector, diagnostics |
compute_integration | Azure Functions EP1, Service Bus Premium (3 queues), Event Grid system topic |
acr | Container Registry Premium with private endpoint and AcrPull RBAC |
Agentic AI Patterns
The repository includes three production-ready agentic AI patterns running on Azure Functions:
Durable Orchestrator: A sequential workflow that validates data, triggers a Databricks training job, polls for completion, evaluates metrics, and decides whether to promote or reject the model. This replaces fragile cron-based ML pipelines with a self-healing, resumable orchestration.
Multi-Agent (Planner-Executor-Critic): An iterative loop where a Planner agent decomposes a task, an Executor agent runs it, and a Critic agent evaluates the output — retrying up to three attempts if the result is rejected. This pattern handles complex, multi-step ML workflows where simple sequential execution is insufficient.
Monitoring Responder: An event-driven agent triggered by Service Bus messages from monitoring alerts. It classifies the severity, creates an incident record, and automatically mitigates — rolling back a model, triggering retraining, or logging for manual review depending on the alert type.
Getting Started
git clone https://github.com/MedGhassen/databricks-enterprise-ai-platform.git
cd databricks-enterprise-ai-platformStart with the docs/ directory for architecture documentation and the WAF alignment matrix. Then review the Terraform variable definitions in the module directories to understand the configuration knobs before running terraform init and terraform plan.
The project is MIT-licensed. Fork it, adapt it, break it apart, put it back together. If you build something interesting on top of it, we would love to hear about it.
Related Resources
- Data Lakehouse Architecture on Azure — Deep dive into the medallion architecture pattern used in this platform.
- MLOps on Azure: From Experiment to Production — The MLOps practices that govern the ML lifecycle in this blueprint.
- Infrastructure as Code Strategy — Strategic foundations for managing Terraform at enterprise scale.
Questions about deploying this platform or adapting it to your infrastructure? Contact us — we built it, and we help teams operationalize it.