Skip to main content
All posts
Cloud Architecture10 min read

Multi-Region Azure Architecture: Disaster Recovery Patterns That Actually Work

Practical multi-region disaster recovery patterns for Azure with Bicep templates, RTO/RPO targets, and real cost analysis for active-active, active-passive, and pilot light architectures.

Published

Disaster recovery slides look great in executive presentations. Multi-region architecture diagrams with arrows flowing between paired regions are reassuring. Then a region actually fails, and organisations discover that their DR plan was never tested, their failover does not work, and their RTO target of "15 minutes" is actually four hours of panicked manual intervention.

This guide covers DR patterns that actually work in production. We have deployed and tested each of these patterns with enterprise clients. We will be specific about what each pattern delivers, what it costs, and where it breaks down.

Understanding RTO and RPO in Practice

Recovery Time Objective (RTO): How long your application can be down. This is not a technical metric — it is a business decision. An e-commerce site losing EUR 50,000 per hour of downtime has a different RTO budget than an internal HR portal.

Recovery Point Objective (RPO): How much data loss is acceptable. RPO near zero means no transactions can be lost. RPO of one hour means you accept losing up to 60 minutes of data.

The uncomfortable truth about RTO/RPO

Most enterprises set RTO/RPO targets without understanding the cost implications:

RTO TargetRPO TargetPattern RequiredApproximate Cost Premium
< 1 minuteNear zeroActive-active80-100 % of base cost
< 15 minutes< 5 minutesActive-passive (warm)40-60 % of base cost
< 1 hour< 15 minutesActive-passive (cold)20-35 % of base cost
< 4 hours< 1 hourPilot light10-20 % of base cost
< 24 hours< 24 hoursBackup/restore5-10 % of base cost

Negotiate RTO/RPO with the business before designing the architecture. "Everything must be active-active" is a budget decision, not a technical one.

Pattern 1: Active-Active with Azure Front Door

This is the gold standard: both regions serve production traffic, and users are routed to the closest healthy endpoint. If one region fails, the other absorbs all traffic with minimal disruption.

Architecture

Loading diagram...

Bicep template: Front Door with multi-region backends

Bicep
resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
  name: 'fd-dr-production'
  location: 'global'
  sku: {
    name: 'Premium_AzureFrontDoor'
  }
}

resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2023-05-01' = {
  parent: frontDoor
  name: 'app-endpoint'
  location: 'global'
  properties: {
    enabledState: 'Enabled'
  }
}

resource originGroupApp 'Microsoft.Cdn/profiles/originGroups@2023-05-01' = {
  parent: frontDoor
  name: 'app-origins'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
      additionalLatencyInMilliseconds: 50
    }
    healthProbeSettings: {
      probePath: '/health'
      probeRequestType: 'GET'
      probeProtocol: 'Https'
      probeIntervalInSeconds: 10
    }
    sessionAffinityState: 'Disabled'
  }
}

resource originWestEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
  parent: originGroupApp
  name: 'west-europe'
  properties: {
    hostName: appServiceWestEurope.properties.defaultHostName
    httpPort: 80
    httpsPort: 443
    priority: 1
    weight: 1000
    enabledState: 'Enabled'
  }
}

resource originNorthEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
  parent: originGroupApp
  name: 'north-europe'
  properties: {
    hostName: appServiceNorthEurope.properties.defaultHostName
    httpPort: 80
    httpsPort: 443
    priority: 1
    weight: 1000
    enabledState: 'Enabled'
  }
}

Data layer for active-active

The hard part of active-active is the data layer. Options:

Cosmos DB with multi-region writes: The simplest path to active-active data. Cosmos DB natively supports multi-region writes with configurable consistency levels. Session consistency is the sweet spot for most applications — it guarantees read-your-own-writes within a session while allowing eventual consistency across regions.

Bicep
resource cosmosAccount 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
  name: 'cosmos-dr-production'
  location: 'westeurope'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'
    }
    locations: [
      { locationName: 'westeurope', failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'northeurope', failoverPriority: 1, isZoneRedundant: true }
    ]
  }
}

Azure SQL with active geo-replication: SQL does not support multi-region writes natively. The secondary is read-only. For true active-active with SQL, you need application-level write routing or acceptance of a primary region for writes with read replicas in secondary regions.

RTO/RPO achieved

  • RTO: Under 30 seconds (Front Door health probe interval + DNS propagation)
  • RPO: Near zero for Cosmos DB multi-region writes. Up to 5 seconds for SQL geo-replication.

Cost reality

Active-active doubles your compute costs and approximately doubles your database costs (Cosmos DB multi-region writes consume RUs in each region). For a typical enterprise application costing EUR 15,000/month in a single region, expect EUR 28,000-32,000/month for active-active.

Pattern 2: Active-Passive with Warm Standby

The secondary region is fully provisioned but does not serve production traffic. Failover is automated but involves promoting replicas and switching traffic.

Architecture

Loading diagram...

Failover automation

Powershell
# Automated failover runbook
param(
    [string]$ResourceGroupName = "rg-production",
    [string]$SqlServerName = "sql-prod-westeurope",
    [string]$FailoverGroupName = "fog-production",
    [string]$AksClusterName = "aks-prod-northeurope",
    [string]$AksResourceGroup = "rg-production-northeurope"
)

# Step 1: Failover SQL
Write-Output "Initiating SQL failover group switch..."
Switch-AzSqlDatabaseFailoverGroup `
    -ResourceGroupName $ResourceGroupName `
    -ServerName $SqlServerName `
    -FailoverGroupName $FailoverGroupName

# Step 2: Scale up AKS in secondary region
Write-Output "Scaling AKS in secondary region..."
az aks nodepool scale --resource-group $AksResourceGroup `
    --cluster-name $AksClusterName `
    --name systempool --node-count 5

# Step 3: Front Door health probes handle traffic routing automatically
Write-Output "Failover complete. Secondary region is now serving traffic."

RTO/RPO achieved

  • RTO: 15-30 minutes (SQL failover: 5 min, AKS scale-up: 5-15 min, DNS propagation: 1-5 min)
  • RPO: Under 5 minutes (SQL geo-replication lag)

Cost reality

The secondary region runs at minimum scale. Expect 40-60 % cost premium over single-region. For the EUR 15,000/month example: EUR 21,000-24,000/month.

Pattern 3: Pilot Light

Only the data layer is replicated. Compute infrastructure is defined in IaC but not deployed until needed.

Architecture

Loading diagram...

Deployment pipeline for pilot light activation

YAML
# Azure DevOps pipeline - activate DR region
trigger: none  # Manual or alert-triggered

pool:
  vmImage: 'ubuntu-latest'

stages:
  - stage: ActivateDR
    displayName: 'Activate DR Region'
    jobs:
      - job: DeployInfrastructure
        steps:
          - task: AzureCLI@2
            displayName: 'Deploy AKS in DR region'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az deployment group create \
                  --resource-group rg-production-northeurope \
                  --template-file ./bicep/aks-dr.bicep \
                  --parameters environment=dr nodeCount=5

          - task: AzureCLI@2
            displayName: 'Deploy application workloads'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az aks get-credentials --resource-group rg-production-northeurope \
                  --name aks-dr-northeurope
                kubectl apply -k ./k8s/overlays/dr/

          - task: AzureCLI@2
            displayName: 'Failover SQL'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az sql failover-group set-primary \
                  --resource-group rg-production \
                  --server sql-prod-westeurope \
                  --name fog-production

RTO/RPO achieved

  • RTO: 1-4 hours (infrastructure deployment: 30-90 min, application deployment: 15-30 min, validation: 15-30 min)
  • RPO: Under 5 minutes (SQL geo-replication is always running)

Cost reality

Only data replication costs during normal operation. 10-20 % premium. For the EUR 15,000/month example: EUR 16,500-18,000/month.

Service-Specific DR Patterns

Azure Kubernetes Service (AKS) Multi-Cluster

Bicep
// AKS multi-region with zone redundancy
resource aksWestEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'aks-prod-westeurope'
  location: 'westeurope'
  properties: {
    kubernetesVersion: '1.29'
    agentPoolProfiles: [
      {
        name: 'systempool'
        count: 3
        vmSize: 'Standard_D4s_v5'
        availabilityZones: ['1', '2', '3']
        mode: 'System'
      }
    ]
  }
}

resource aksNorthEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'aks-prod-northeurope'
  location: 'northeurope'
  properties: {
    kubernetesVersion: '1.29'
    agentPoolProfiles: [
      {
        name: 'systempool'
        count: 2  // Warm standby
        vmSize: 'Standard_D4s_v5'
        availabilityZones: ['1', '2', '3']
        mode: 'System'
      }
    ]
  }
}

Azure SQL Geo-Replication with Failover Groups

Bicep
resource failoverGroup 'Microsoft.Sql/servers/failoverGroups@2023-05-01-preview' = {
  parent: sqlServerPrimary
  name: 'fog-production'
  properties: {
    readWriteEndpoint: {
      failoverPolicy: 'Automatic'
      failoverWithDataLossGracePeriodMinutes: 5
    }
    readOnlyEndpoint: {
      failoverPolicy: 'Enabled'
    }
    partnerServers: [
      { id: sqlServerSecondary.id }
    ]
    databases: [sqlDatabase.id]
  }
}

Azure Site Recovery for VMs

For legacy workloads running on VMs that cannot be containerised, Azure Site Recovery (ASR) provides continuous replication with 15-minute RPO and approximately 2-hour RTO including boot time and recovery plan execution. It is not the fastest option, but it requires zero application changes and works for any VM workload.

DR Pattern Selection Decision Flow

Loading diagram...

Testing Your DR Plan

A DR plan that is not tested is not a plan — it is a hope.

Monthly: Automated health checks

Bash
#!/bin/bash
echo "=== DR Readiness Report ==="

# Check SQL replication lag
az sql failover-group show \
  --resource-group rg-production \
  --server sql-prod-westeurope \
  --name fog-production \
  --query "replicationState"

# Verify Bicep templates compile
az bicep build --file ./bicep/dr-region.bicep

# Check storage replication
az storage account show \
  --resource-group rg-production \
  --name stprodwesteurope \
  --query "statusOfPrimary"

Quarterly: Simulated failover

Execute a full failover to the secondary region during a maintenance window. Route a percentage of traffic to the DR region and validate application functionality. Measure actual RTO against targets.

Annually: Full chaos exercise

Simulate a complete primary region failure without warning the operations team (inform management only). Measure detection time, escalation time, failover time, and data integrity.

Common Mistakes

Mistake 1: Testing failover but never testing failback. Failing over is half the problem. Failing back to the primary region after it recovers is often harder and less tested.

Mistake 2: Forgetting stateful components. DNS records, SSL certificates, configuration stored in App Configuration or Key Vault, secrets — all need to be available in both regions.

Mistake 3: Not accounting for capacity. Your DR region needs enough capacity to run production workloads. If you are using a warm standby, ensure the secondary region has quota for the VMs you need.

Mistake 4: Ignoring cross-region latency. If your application has synchronous calls between services, and some services fail over while others do not, you may introduce cross-region latency that breaks timeout assumptions.

Choosing the Right Pattern

FactorActive-ActiveActive-Passive (Warm)Pilot Light
RTO< 1 min15-30 min1-4 hours
RPONear zero< 5 min< 5 min
Cost premium80-100 %40-60 %10-20 %
Operational complexityHighMediumLow (until activation)
Best forRevenue-critical appsCore business appsInternal, non-critical apps

The right pattern depends on the cost of downtime, not the cost of infrastructure. If one hour of downtime costs EUR 500,000, active-active at EUR 15,000/month premium is trivially justified.

Next Steps

Disaster recovery architecture is one of those areas where the gap between "what we planned" and "what actually works" is widest. The patterns above are proven, but they need to be adapted to your specific application architecture, compliance requirements, and budget constraints.

If you need help designing or testing a multi-region DR architecture on Azure, contact us at mbrahim@conceptualise.de. We specialise in architectures that work under pressure, not just in presentations.

Topics

Azure disaster recoverymulti-region Azure architectureAzure Front Door DRCosmos DB multi-regionAzure Site Recovery

Frequently Asked Questions

In active-active, both regions serve production traffic simultaneously, providing near-zero RTO but at higher cost. In active-passive, the secondary region is provisioned but does not serve traffic until failover occurs, offering lower cost but with an RTO measured in minutes to hours depending on the architecture.

Expert engagement

Need expert guidance?

Our team specializes in cloud architecture, security, AI platforms, and DevSecOps. Let's discuss how we can help your organization.

Get in touchNo commitment · No sales pressure

Related articles

All posts