Multi-Region Azure Architecture: Disaster Recovery Patterns That Actually Work
Practical multi-region disaster recovery patterns for Azure with Bicep templates, RTO/RPO targets, and real cost analysis for active-active, active-passive, and pilot light architectures.
Disaster recovery slides look great in executive presentations. Multi-region architecture diagrams with arrows flowing between paired regions are reassuring. Then a region actually fails, and organisations discover that their DR plan was never tested, their failover does not work, and their RTO target of "15 minutes" is actually four hours of panicked manual intervention.
This guide covers DR patterns that actually work in production. We have deployed and tested each of these patterns with enterprise clients. We will be specific about what each pattern delivers, what it costs, and where it breaks down.
Understanding RTO and RPO in Practice
Recovery Time Objective (RTO): How long your application can be down. This is not a technical metric — it is a business decision. An e-commerce site losing EUR 50,000 per hour of downtime has a different RTO budget than an internal HR portal.
Recovery Point Objective (RPO): How much data loss is acceptable. RPO near zero means no transactions can be lost. RPO of one hour means you accept losing up to 60 minutes of data.
The uncomfortable truth about RTO/RPO
Most enterprises set RTO/RPO targets without understanding the cost implications:
| RTO Target | RPO Target | Pattern Required | Approximate Cost Premium |
|---|---|---|---|
| < 1 minute | Near zero | Active-active | 80-100 % of base cost |
| < 15 minutes | < 5 minutes | Active-passive (warm) | 40-60 % of base cost |
| < 1 hour | < 15 minutes | Active-passive (cold) | 20-35 % of base cost |
| < 4 hours | < 1 hour | Pilot light | 10-20 % of base cost |
| < 24 hours | < 24 hours | Backup/restore | 5-10 % of base cost |
Negotiate RTO/RPO with the business before designing the architecture. "Everything must be active-active" is a budget decision, not a technical one.
Pattern 1: Active-Active with Azure Front Door
This is the gold standard: both regions serve production traffic, and users are routed to the closest healthy endpoint. If one region fails, the other absorbs all traffic with minimal disruption.
Architecture
Bicep template: Front Door with multi-region backends
resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
name: 'fd-dr-production'
location: 'global'
sku: {
name: 'Premium_AzureFrontDoor'
}
}
resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2023-05-01' = {
parent: frontDoor
name: 'app-endpoint'
location: 'global'
properties: {
enabledState: 'Enabled'
}
}
resource originGroupApp 'Microsoft.Cdn/profiles/originGroups@2023-05-01' = {
parent: frontDoor
name: 'app-origins'
properties: {
loadBalancingSettings: {
sampleSize: 4
successfulSamplesRequired: 3
additionalLatencyInMilliseconds: 50
}
healthProbeSettings: {
probePath: '/health'
probeRequestType: 'GET'
probeProtocol: 'Https'
probeIntervalInSeconds: 10
}
sessionAffinityState: 'Disabled'
}
}
resource originWestEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
parent: originGroupApp
name: 'west-europe'
properties: {
hostName: appServiceWestEurope.properties.defaultHostName
httpPort: 80
httpsPort: 443
priority: 1
weight: 1000
enabledState: 'Enabled'
}
}
resource originNorthEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
parent: originGroupApp
name: 'north-europe'
properties: {
hostName: appServiceNorthEurope.properties.defaultHostName
httpPort: 80
httpsPort: 443
priority: 1
weight: 1000
enabledState: 'Enabled'
}
}Data layer for active-active
The hard part of active-active is the data layer. Options:
Cosmos DB with multi-region writes: The simplest path to active-active data. Cosmos DB natively supports multi-region writes with configurable consistency levels. Session consistency is the sweet spot for most applications — it guarantees read-your-own-writes within a session while allowing eventual consistency across regions.
resource cosmosAccount 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
name: 'cosmos-dr-production'
location: 'westeurope'
properties: {
databaseAccountOfferType: 'Standard'
enableMultipleWriteLocations: true
consistencyPolicy: {
defaultConsistencyLevel: 'Session'
}
locations: [
{ locationName: 'westeurope', failoverPriority: 0, isZoneRedundant: true }
{ locationName: 'northeurope', failoverPriority: 1, isZoneRedundant: true }
]
}
}Azure SQL with active geo-replication: SQL does not support multi-region writes natively. The secondary is read-only. For true active-active with SQL, you need application-level write routing or acceptance of a primary region for writes with read replicas in secondary regions.
RTO/RPO achieved
- RTO: Under 30 seconds (Front Door health probe interval + DNS propagation)
- RPO: Near zero for Cosmos DB multi-region writes. Up to 5 seconds for SQL geo-replication.
Cost reality
Active-active doubles your compute costs and approximately doubles your database costs (Cosmos DB multi-region writes consume RUs in each region). For a typical enterprise application costing EUR 15,000/month in a single region, expect EUR 28,000-32,000/month for active-active.
Pattern 2: Active-Passive with Warm Standby
The secondary region is fully provisioned but does not serve production traffic. Failover is automated but involves promoting replicas and switching traffic.
Architecture
Failover automation
# Automated failover runbook
param(
[string]$ResourceGroupName = "rg-production",
[string]$SqlServerName = "sql-prod-westeurope",
[string]$FailoverGroupName = "fog-production",
[string]$AksClusterName = "aks-prod-northeurope",
[string]$AksResourceGroup = "rg-production-northeurope"
)
# Step 1: Failover SQL
Write-Output "Initiating SQL failover group switch..."
Switch-AzSqlDatabaseFailoverGroup `
-ResourceGroupName $ResourceGroupName `
-ServerName $SqlServerName `
-FailoverGroupName $FailoverGroupName
# Step 2: Scale up AKS in secondary region
Write-Output "Scaling AKS in secondary region..."
az aks nodepool scale --resource-group $AksResourceGroup `
--cluster-name $AksClusterName `
--name systempool --node-count 5
# Step 3: Front Door health probes handle traffic routing automatically
Write-Output "Failover complete. Secondary region is now serving traffic."RTO/RPO achieved
- RTO: 15-30 minutes (SQL failover: 5 min, AKS scale-up: 5-15 min, DNS propagation: 1-5 min)
- RPO: Under 5 minutes (SQL geo-replication lag)
Cost reality
The secondary region runs at minimum scale. Expect 40-60 % cost premium over single-region. For the EUR 15,000/month example: EUR 21,000-24,000/month.
Pattern 3: Pilot Light
Only the data layer is replicated. Compute infrastructure is defined in IaC but not deployed until needed.
Architecture
Deployment pipeline for pilot light activation
# Azure DevOps pipeline - activate DR region
trigger: none # Manual or alert-triggered
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: ActivateDR
displayName: 'Activate DR Region'
jobs:
- job: DeployInfrastructure
steps:
- task: AzureCLI@2
displayName: 'Deploy AKS in DR region'
inputs:
azureSubscription: 'production-subscription'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az deployment group create \
--resource-group rg-production-northeurope \
--template-file ./bicep/aks-dr.bicep \
--parameters environment=dr nodeCount=5
- task: AzureCLI@2
displayName: 'Deploy application workloads'
inputs:
azureSubscription: 'production-subscription'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az aks get-credentials --resource-group rg-production-northeurope \
--name aks-dr-northeurope
kubectl apply -k ./k8s/overlays/dr/
- task: AzureCLI@2
displayName: 'Failover SQL'
inputs:
azureSubscription: 'production-subscription'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az sql failover-group set-primary \
--resource-group rg-production \
--server sql-prod-westeurope \
--name fog-productionRTO/RPO achieved
- RTO: 1-4 hours (infrastructure deployment: 30-90 min, application deployment: 15-30 min, validation: 15-30 min)
- RPO: Under 5 minutes (SQL geo-replication is always running)
Cost reality
Only data replication costs during normal operation. 10-20 % premium. For the EUR 15,000/month example: EUR 16,500-18,000/month.
Service-Specific DR Patterns
Azure Kubernetes Service (AKS) Multi-Cluster
// AKS multi-region with zone redundancy
resource aksWestEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
name: 'aks-prod-westeurope'
location: 'westeurope'
properties: {
kubernetesVersion: '1.29'
agentPoolProfiles: [
{
name: 'systempool'
count: 3
vmSize: 'Standard_D4s_v5'
availabilityZones: ['1', '2', '3']
mode: 'System'
}
]
}
}
resource aksNorthEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
name: 'aks-prod-northeurope'
location: 'northeurope'
properties: {
kubernetesVersion: '1.29'
agentPoolProfiles: [
{
name: 'systempool'
count: 2 // Warm standby
vmSize: 'Standard_D4s_v5'
availabilityZones: ['1', '2', '3']
mode: 'System'
}
]
}
}Azure SQL Geo-Replication with Failover Groups
resource failoverGroup 'Microsoft.Sql/servers/failoverGroups@2023-05-01-preview' = {
parent: sqlServerPrimary
name: 'fog-production'
properties: {
readWriteEndpoint: {
failoverPolicy: 'Automatic'
failoverWithDataLossGracePeriodMinutes: 5
}
readOnlyEndpoint: {
failoverPolicy: 'Enabled'
}
partnerServers: [
{ id: sqlServerSecondary.id }
]
databases: [sqlDatabase.id]
}
}Azure Site Recovery for VMs
For legacy workloads running on VMs that cannot be containerised, Azure Site Recovery (ASR) provides continuous replication with 15-minute RPO and approximately 2-hour RTO including boot time and recovery plan execution. It is not the fastest option, but it requires zero application changes and works for any VM workload.
DR Pattern Selection Decision Flow
Testing Your DR Plan
A DR plan that is not tested is not a plan — it is a hope.
Monthly: Automated health checks
#!/bin/bash
echo "=== DR Readiness Report ==="
# Check SQL replication lag
az sql failover-group show \
--resource-group rg-production \
--server sql-prod-westeurope \
--name fog-production \
--query "replicationState"
# Verify Bicep templates compile
az bicep build --file ./bicep/dr-region.bicep
# Check storage replication
az storage account show \
--resource-group rg-production \
--name stprodwesteurope \
--query "statusOfPrimary"Quarterly: Simulated failover
Execute a full failover to the secondary region during a maintenance window. Route a percentage of traffic to the DR region and validate application functionality. Measure actual RTO against targets.
Annually: Full chaos exercise
Simulate a complete primary region failure without warning the operations team (inform management only). Measure detection time, escalation time, failover time, and data integrity.
Common Mistakes
Mistake 1: Testing failover but never testing failback. Failing over is half the problem. Failing back to the primary region after it recovers is often harder and less tested.
Mistake 2: Forgetting stateful components. DNS records, SSL certificates, configuration stored in App Configuration or Key Vault, secrets — all need to be available in both regions.
Mistake 3: Not accounting for capacity. Your DR region needs enough capacity to run production workloads. If you are using a warm standby, ensure the secondary region has quota for the VMs you need.
Mistake 4: Ignoring cross-region latency. If your application has synchronous calls between services, and some services fail over while others do not, you may introduce cross-region latency that breaks timeout assumptions.
Choosing the Right Pattern
| Factor | Active-Active | Active-Passive (Warm) | Pilot Light |
|---|---|---|---|
| RTO | < 1 min | 15-30 min | 1-4 hours |
| RPO | Near zero | < 5 min | < 5 min |
| Cost premium | 80-100 % | 40-60 % | 10-20 % |
| Operational complexity | High | Medium | Low (until activation) |
| Best for | Revenue-critical apps | Core business apps | Internal, non-critical apps |
The right pattern depends on the cost of downtime, not the cost of infrastructure. If one hour of downtime costs EUR 500,000, active-active at EUR 15,000/month premium is trivially justified.
Next Steps
Disaster recovery architecture is one of those areas where the gap between "what we planned" and "what actually works" is widest. The patterns above are proven, but they need to be adapted to your specific application architecture, compliance requirements, and budget constraints.
If you need help designing or testing a multi-region DR architecture on Azure, contact us at mbrahim@conceptualise.de. We specialise in architectures that work under pressure, not just in presentations.
Topics