Skip to main content
All posts
AI & Data5 min read

We Open-Sourced Our Enterprise Databricks AI Platform Blueprint

A production-grade, open-source reference architecture for Azure Databricks covering networking, security, MLOps, agentic AI, and CI/CD — built on the Azure Well-Architected Framework.

Building an enterprise AI platform from scratch is a multi-month effort. You need to solve networking, security, data governance, compute, ML lifecycle management, and CI/CD — and you need them all to work together. Most teams either cobble together tutorials that leave critical gaps or pay consultants to build something proprietary they can never fully own.

We decided to change that. Today we are open-sourcing our complete enterprise Databricks platform blueprint: databricks-enterprise-ai-platform.

What This Is

This is not a quickstart or a hello-world Terraform module. It is a full production reference architecture that provisions, configures, and operates an Azure Databricks platform for ML and LLM workloads. Every design decision is mapped to the Azure Well-Architected Framework across all five pillars: Reliability, Security, Cost Optimization, Operational Excellence, and Performance Efficiency.

The repository contains:

  • 9 Terraform modules covering landing zone, networking, firewall, security, storage, monitoring, Databricks, compute integration, and container registry
  • Delta Lake medallion architecture with Unity Catalog governance
  • 3 sample ML projects with real training code (revenue forecasting, anomaly detection, LLM document triage)
  • 3 agentic AI patterns running on Azure Functions (orchestrator, multi-agent, monitoring responder)
  • 4 GitHub Actions workflows for infrastructure and ML CI/CD with zero stored secrets

The Architecture

The platform follows a hub-spoke network topology:

Hub VNet hosts shared services — Azure Firewall with forced tunneling, Azure Bastion for secure management access, VPN Gateway with Entra ID authentication, and Private DNS Zones for seven Azure services.

Spoke VNet hosts the workloads — Databricks with VNet injection and Secure Cluster Connectivity (no public IPs on compute nodes), Azure Functions with VNet integration, and private endpoints for storage, Key Vault, and Container Registry.

All egress traffic flows through Azure Firewall with explicit FQDN whitelists. No PaaS service has a public endpoint. Three user-assigned managed identities handle authentication — no passwords or API keys anywhere in the stack.

Why We Built This

Every enterprise engagement we run at CC Conceptualise starts with the same foundational work: setting up secure networking, configuring Databricks with private connectivity, building the data lake with proper encryption, and wiring up CI/CD. We found ourselves solving the same architectural problems repeatedly.

Rather than keep this knowledge locked in client deliverables, we extracted the patterns into a reusable, opinionated blueprint that any team can fork and adapt.

What Makes This Different

Zero-trust by default, not bolted on later. Private endpoints on every PaaS service. Firewall-forced tunneling on all egress. OIDC federation for CI/CD — not a single client secret in GitHub. This is the security posture enterprises need from day one.

WAF alignment is documented, not assumed. Every Terraform resource includes comments mapping it to specific Well-Architected Framework pillars with rationale. This is audit-ready — your cloud architects and compliance teams can trace every design decision.

End-to-end, not just infrastructure. Most open-source Terraform repos stop at "here's a Databricks workspace." Ours continues through Unity Catalog setup, medallion data pipelines, ML model training and promotion, agentic AI workflows, and automated CI/CD — the full stack from terraform apply to production model serving.

Cost controls built in. Configurable budget alerts (default $1,000), cluster policies limiting max workers and enforcing auto-termination, storage lifecycle rules, and consumption-based Functions. Because unchecked cloud spend kills more AI projects than bad models.

The 9 Modules

ModuleWhat It Does
landing_zoneResource group + Azure Policy enforcement (tags, location, HTTPS, diagnostics)
networkingHub-spoke VNets, 7+ subnets, NSGs, route tables, NAT Gateway, Private DNS Zones, VPN, Bastion
firewallAzure Firewall with application and network rule collections (Databricks control plane, PyPI, GitHub whitelists)
securityKey Vault with private endpoint, 3 managed identities, RBAC role assignments
storageADLS Gen2, 4 containers (bronze/silver/gold/mlflow), CMK encryption, lifecycle rules, private endpoints
monitoringLog Analytics, Application Insights, action groups, budget alerts, scheduled query alerts
databricksPremium workspace with VNet injection, 4 private endpoints, Access Connector, diagnostics
compute_integrationAzure Functions EP1, Service Bus Premium (3 queues), Event Grid system topic
acrContainer Registry Premium with private endpoint and AcrPull RBAC

Agentic AI Patterns

The repository includes three production-ready agentic AI patterns running on Azure Functions:

Durable Orchestrator: A sequential workflow that validates data, triggers a Databricks training job, polls for completion, evaluates metrics, and decides whether to promote or reject the model. This replaces fragile cron-based ML pipelines with a self-healing, resumable orchestration.

Multi-Agent (Planner-Executor-Critic): An iterative loop where a Planner agent decomposes a task, an Executor agent runs it, and a Critic agent evaluates the output — retrying up to three attempts if the result is rejected. This pattern handles complex, multi-step ML workflows where simple sequential execution is insufficient.

Monitoring Responder: An event-driven agent triggered by Service Bus messages from monitoring alerts. It classifies the severity, creates an incident record, and automatically mitigates — rolling back a model, triggering retraining, or logging for manual review depending on the alert type.

Getting Started

Bash
git clone https://github.com/MedGhassen/databricks-enterprise-ai-platform.git
cd databricks-enterprise-ai-platform

Start with the docs/ directory for architecture documentation and the WAF alignment matrix. Then review the Terraform variable definitions in the module directories to understand the configuration knobs before running terraform init and terraform plan.

The project is MIT-licensed. Fork it, adapt it, break it apart, put it back together. If you build something interesting on top of it, we would love to hear about it.

Related Resources

Questions about deploying this platform or adapting it to your infrastructure? Contact us — we built it, and we help teams operationalize it.

Databricks enterprise platformAzure Databricks architectureopen-source AI platformMLOps reference architectureAzure Well-Architected Framework

Frequently Asked Questions

What is the databricks-enterprise-ai-platform project?
It is a production-grade, open-source reference architecture for building an end-to-end AI/ML platform on Azure Databricks. It covers everything from hub-spoke networking and firewall rules to Unity Catalog governance, MLOps pipelines, agentic AI patterns, and CI/CD automation — all aligned to the Azure Well-Architected Framework.
Is this suitable for production use?
Yes. The architecture enforces zero-trust networking, private endpoints on all PaaS services, managed identities with no stored secrets, and OIDC federation for CI/CD. It includes environment parity between dev and prod, budget alerts, and Azure Policy enforcement. However, you should review and adapt the Terraform variables, CIDR ranges, and cost thresholds to your organisation's requirements.
What Azure services does the platform use?
The platform provisions Azure Databricks Premium (VNet-injected), ADLS Gen2 with CMK encryption, Azure Key Vault, Azure Firewall, Azure Functions Premium, Azure Service Bus, Azure Event Grid, Azure Container Registry, Log Analytics, Application Insights, and Azure Policy — all connected via private endpoints in a hub-spoke network topology.
Can I use this with AWS or GCP instead of Azure?
The Terraform modules are Azure-specific. However, the architectural patterns — hub-spoke networking, medallion data architecture, MLOps lifecycle, and agentic AI workflows — are cloud-agnostic concepts that can be adapted to AWS or GCP equivalents.
What license is the project under?
The project is released under the MIT License, allowing free use, modification, and distribution for both commercial and non-commercial purposes.

Need expert guidance?

Our team specializes in cloud architecture, security, AI platforms, and DevSecOps. Let's discuss how we can help your organization.

Related articles