Skip to main content
All posts
Cybersecurity6 min read

Building a Zero-Trust AI Platform on Azure Databricks

How we designed a fully private Azure Databricks platform with hub-spoke networking, forced-tunnel firewall, private endpoints, and managed identities — no public IPs, no stored secrets.

Most Databricks deployments start with the default configuration: public endpoints, permissive networking, and service principal secrets stored in environment variables. It works for a proof of concept. It does not work for enterprises handling sensitive data, operating under regulatory obligations, or facing audit scrutiny.

This post walks through the security architecture of our open-source enterprise Databricks platform, explaining how each layer enforces Zero Trust principles.

The Security Problem with Default Databricks

A default Azure Databricks workspace has several security gaps:

  • Public endpoints for the workspace UI and API
  • Compute nodes with public IPs that can reach the internet directly
  • Shared infrastructure with other Databricks customers in the control plane
  • Service principal secrets that need to be stored and rotated somewhere

For regulated industries — financial services, healthcare, government, critical infrastructure — any of these is a compliance blocker. For NIS2-scoped organisations, the combination is untenable.

Layer 1: Hub-Spoke Network Topology

The foundation is a hub-spoke network design where shared security services live in the hub and workloads live in isolated spokes.

Hub VNet contains:

  • Azure Firewall (forced tunneling subnet)
  • Azure Bastion (management access without public RDP/SSH)
  • VPN Gateway with Entra ID authentication
  • Private DNS Resolver for VPN client name resolution
  • 7 Private DNS Zones for Azure services

Spoke VNet contains:

  • Databricks public and private subnets (VNet injection)
  • Azure Functions integration subnet
  • Private endpoint subnet
  • NAT Gateway for controlled outbound connectivity

VNet peering connects hub to spoke. User-Defined Routes on the spoke subnets force all egress through the hub firewall. No spoke resource can reach the internet without passing through firewall inspection.

Layer 2: Firewall-Forced Tunneling

Azure Firewall sits at the network boundary with explicit allow rules:

Application rules whitelist FQDN targets:

  • Databricks control plane endpoints
  • PyPI (for library installation)
  • Ubuntu package repositories (for Databricks Runtime)
  • GitHub (for CI/CD)

Network rules allow specific IP ranges for Databricks infrastructure services.

Everything else is denied by default. If a compromised node attempts to exfiltrate data to an unknown endpoint, the firewall blocks it and logs the attempt.

Firewall diagnostic logs feed into Log Analytics with scheduled query alerts. A spike in denied connections triggers investigation.

Layer 3: Private Endpoints Everywhere

Every PaaS service is accessed exclusively through private endpoints:

ServicePrivate EndpointsWhy
Databricks4 (UI/API, browser auth, DFS, blob)Eliminates public workspace and DBFS access
ADLS Gen22 (blob, DFS)Data lake accessible only from VNet
Key Vault1Secrets never traverse public networks
Container Registry1Container images pulled over private network

Each private endpoint has a corresponding Private DNS Zone linked to both hub and spoke VNets. When a service references *.blob.core.windows.net, DNS resolves to a private IP within the VNet — not a public endpoint.

Public network access is disabled on every service. Even with the correct credentials, you cannot reach these services from the public internet.

Layer 4: Managed Identities — No Secrets

The platform uses three user-assigned managed identities:

  1. Databricks identity — Used by the Access Connector for Unity Catalog to reach ADLS Gen2
  2. Functions identity — Used by Azure Functions to interact with Service Bus and Key Vault
  3. CI/CD identity — Used by GitHub Actions via OIDC federation

No identity has a client secret. Authentication happens through Azure's internal token service. The CI/CD identity deserves special attention: GitHub Actions exchanges an OIDC token for an Azure access token at runtime — no secrets are stored in GitHub, no credentials to rotate, no risk of secret leakage in logs.

RBAC assignments follow least privilege:

  • Databricks identity gets Storage Blob Data Contributor on the data lake — nothing else
  • Functions identity gets Key Vault Secrets User and Service Bus Data Sender/Receiver — nothing else
  • CI/CD identity gets Contributor scoped to the resource group — not the subscription

Layer 5: Data Encryption

At rest: ADLS Gen2 uses Customer-Managed Keys stored in Key Vault with infrastructure-level double encryption. This means data is encrypted twice — once by the storage service and once at the infrastructure layer.

In transit: All connections use TLS. Private endpoints ensure traffic never leaves the Azure backbone. HSTS headers enforce HTTPS on all web interfaces.

Key management: Key Vault has soft-delete (90 days) and purge protection enabled. Even an administrator cannot permanently delete an encryption key without waiting through the retention period.

Layer 6: Azure Policy Enforcement

The landing zone module deploys Azure Policy assignments that act as guardrails:

  • Require tags on all resources (environment, project, owner, cost-centre)
  • Restrict locations to approved Azure regions
  • Enforce HTTPS on storage accounts
  • Require diagnostics to be sent to Log Analytics

These policies prevent configuration drift. If someone creates a storage account without HTTPS, the deployment fails. If a resource is missing required tags, it is flagged for remediation.

Layer 7: Monitoring and Alerting

Security without visibility is security theatre. The platform deploys:

  • Log Analytics workspace aggregating diagnostics from all services
  • Scheduled query alerts for anomalous patterns (firewall denies, authentication failures, unexpected storage access)
  • Configurable budget alerts (default $1,000, adjustable per environment) — because runaway compute is a security incident too
  • Application Insights for Azure Functions telemetry

Firewall deny logs, Key Vault access logs, storage authentication failure logs, and Databricks job outcome logs all feed into a single pane of glass.

Mapping to Compliance Frameworks

This architecture directly supports:

  • NIS2 — Network segmentation, incident detection (firewall + monitoring), access control (managed identities + RBAC), supply chain security (private endpoints)
  • ISO 27001 — Asset management (tags + policy), access control (RBAC), cryptography (CMK), operations security (monitoring), communications security (private endpoints)
  • EU AI Act — Logging and traceability (Log Analytics + MLflow), cybersecurity (the entire stack), robustness (drift detection + auto-mitigation)

The WAF alignment documentation in the repository maps every resource to specific compliance controls.

Getting Started

The full implementation is available at github.com/MedGhassen/databricks-enterprise-ai-platform under the MIT license. Start with the modules/networking and modules/firewall directories to understand the network security foundation.

Related Resources

Questions about implementing Zero Trust for your Databricks environment? Contact us — this is what we do.

Zero Trust AI platformAzure Databricks securityprivate endpoints Databrickshub-spoke network architecturemanaged identity Databricks

Frequently Asked Questions

Why does Databricks need VNet injection for Zero Trust?
Without VNet injection, Databricks compute nodes communicate over the public internet to the control plane. VNet injection places compute in your own subnets, allowing you to route all traffic through your firewall, enforce NSGs, and eliminate public IP addresses on worker nodes via Secure Cluster Connectivity.
How many private endpoints does the platform use?
The platform deploys private endpoints for Databricks (4 endpoints: combined UI/API, browser authentication, DFS, and blob for DBFS), ADLS Gen2 (blob and DFS), Key Vault, and Container Registry — each with a corresponding Private DNS Zone for name resolution.
Can I use this architecture with Databricks on AWS?
The Zero Trust principles apply universally, but the implementation is Azure-specific. On AWS, equivalent concepts include VPC with PrivateLink, AWS Network Firewall, and IAM roles instead of managed identities. The architectural patterns transfer; the Terraform code does not.
Does the forced-tunnel firewall add latency to Databricks jobs?
Azure Firewall adds approximately 1-2ms of latency per connection. For Databricks workloads, the overhead is negligible compared to job execution time. The security benefit of inspecting and controlling all egress traffic far outweighs the minimal performance impact.

Need expert guidance?

Our team specializes in cloud architecture, security, AI platforms, and DevSecOps. Let's discuss how we can help your organization.

Related articles