What is the difference between active-active and active-passive disaster recovery?

In active-active, both regions serve production traffic simultaneously, providing near-zero RTO but at higher cost. In active-passive, the secondary region is provisioned but does not serve traffic until failover occurs, offering lower cost but with an RTO measured in minutes to hours depending on the architecture.

What RTO and RPO can I realistically achieve on Azure?

Active-active with Azure Front Door and Cosmos DB multi-region writes can achieve RTO under 30 seconds and RPO near zero. Active-passive with Azure Site Recovery typically delivers 15-30 minute RTO and RPO under 5 minutes. Pilot light patterns have 1-4 hour RTO depending on scale-up time.

How much does multi-region disaster recovery cost on Azure?

Costs vary dramatically by pattern. Pilot light adds 10-20% to your baseline infrastructure cost. Active-passive with warm standby adds 40-60%. Active-active effectively doubles your compute and database costs but provides the best availability. The right investment depends on your business continuity requirements and the cost of downtime.

Multi-Region Azure Architecture: Disaster Recovery Patterns That Actually Work

Disaster recovery slides look great in executive presentations. Multi-region architecture diagrams with arrows flowing between paired regions are reassuring. Then a region actually fails, and organisations discover that their DR plan was never tested, their failover does not work, and their RTO target of "15 minutes" is actually four hours of panicked manual intervention.

This guide covers DR patterns that actually work in production. We have deployed and tested each of these patterns with enterprise clients. We will be specific about what each pattern delivers, what it costs, and where it breaks down.

Understanding RTO and RPO in Practice

Recovery Time Objective (RTO): How long your application can be down. This is not a technical metric — it is a business decision. An e-commerce site losing EUR 50,000 per hour of downtime has a different RTO budget than an internal HR portal.

Recovery Point Objective (RPO): How much data loss is acceptable. RPO near zero means no transactions can be lost. RPO of one hour means you accept losing up to 60 minutes of data.

The uncomfortable truth about RTO/RPO

Most enterprises set RTO/RPO targets without understanding the cost implications:

RTO Target	RPO Target	Pattern Required	Approximate Cost Premium
< 1 minute	Near zero	Active-active	80-100 % of base cost
< 15 minutes	< 5 minutes	Active-passive (warm)	40-60 % of base cost
< 1 hour	< 15 minutes	Active-passive (cold)	20-35 % of base cost
< 4 hours	< 1 hour	Pilot light	10-20 % of base cost
< 24 hours	< 24 hours	Backup/restore	5-10 % of base cost

Negotiate RTO/RPO with the business before designing the architecture. "Everything must be active-active" is a budget decision, not a technical one.

Pattern 1: Active-Active with Azure Front Door

This is the gold standard: both regions serve production traffic, and users are routed to the closest healthy endpoint. If one region fails, the other absorbs all traffic with minimal disruption.

Architecture

Loading diagram...

Bicep template: Front Door with multi-region backends

Bicep

resource frontDoor 'Microsoft.Cdn/profiles@2023-05-01' = {
  name: 'fd-dr-production'
  location: 'global'
  sku: {
    name: 'Premium_AzureFrontDoor'
  }
}

resource endpoint 'Microsoft.Cdn/profiles/afdEndpoints@2023-05-01' = {
  parent: frontDoor
  name: 'app-endpoint'
  location: 'global'
  properties: {
    enabledState: 'Enabled'
  }
}

resource originGroupApp 'Microsoft.Cdn/profiles/originGroups@2023-05-01' = {
  parent: frontDoor
  name: 'app-origins'
  properties: {
    loadBalancingSettings: {
      sampleSize: 4
      successfulSamplesRequired: 3
      additionalLatencyInMilliseconds: 50
    }
    healthProbeSettings: {
      probePath: '/health'
      probeRequestType: 'GET'
      probeProtocol: 'Https'
      probeIntervalInSeconds: 10
    }
    sessionAffinityState: 'Disabled'
  }
}

resource originWestEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
  parent: originGroupApp
  name: 'west-europe'
  properties: {
    hostName: appServiceWestEurope.properties.defaultHostName
    httpPort: 80
    httpsPort: 443
    priority: 1
    weight: 1000
    enabledState: 'Enabled'
  }
}

resource originNorthEurope 'Microsoft.Cdn/profiles/originGroups/origins@2023-05-01' = {
  parent: originGroupApp
  name: 'north-europe'
  properties: {
    hostName: appServiceNorthEurope.properties.defaultHostName
    httpPort: 80
    httpsPort: 443
    priority: 1
    weight: 1000
    enabledState: 'Enabled'
  }
}

Data layer for active-active

The hard part of active-active is the data layer. Options:

Cosmos DB with multi-region writes: The simplest path to active-active data. Cosmos DB natively supports multi-region writes with configurable consistency levels. Session consistency is the sweet spot for most applications — it guarantees read-your-own-writes within a session while allowing eventual consistency across regions.

Bicep

resource cosmosAccount 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
  name: 'cosmos-dr-production'
  location: 'westeurope'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'
    }
    locations: [
      { locationName: 'westeurope', failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'northeurope', failoverPriority: 1, isZoneRedundant: true }
    ]
  }
}

Azure SQL with active geo-replication: SQL does not support multi-region writes natively. The secondary is read-only. For true active-active with SQL, you need application-level write routing or acceptance of a primary region for writes with read replicas in secondary regions.

RTO/RPO achieved

RTO: Under 30 seconds (Front Door health probe interval + DNS propagation)
RPO: Near zero for Cosmos DB multi-region writes. Up to 5 seconds for SQL geo-replication.

Cost reality

Active-active doubles your compute costs and approximately doubles your database costs (Cosmos DB multi-region writes consume RUs in each region). For a typical enterprise application costing EUR 15,000/month in a single region, expect EUR 28,000-32,000/month for active-active.

Pattern 2: Active-Passive with Warm Standby

The secondary region is fully provisioned but does not serve production traffic. Failover is automated but involves promoting replicas and switching traffic.

Architecture

Loading diagram...

Failover automation

Powershell

# Automated failover runbook
param(
    [string]$ResourceGroupName = "rg-production",
    [string]$SqlServerName = "sql-prod-westeurope",
    [string]$FailoverGroupName = "fog-production",
    [string]$AksClusterName = "aks-prod-northeurope",
    [string]$AksResourceGroup = "rg-production-northeurope"
)

# Step 1: Failover SQL
Write-Output "Initiating SQL failover group switch..."
Switch-AzSqlDatabaseFailoverGroup `
    -ResourceGroupName $ResourceGroupName `
    -ServerName $SqlServerName `
    -FailoverGroupName $FailoverGroupName

# Step 2: Scale up AKS in secondary region
Write-Output "Scaling AKS in secondary region..."
az aks nodepool scale --resource-group $AksResourceGroup `
    --cluster-name $AksClusterName `
    --name systempool --node-count 5

# Step 3: Front Door health probes handle traffic routing automatically
Write-Output "Failover complete. Secondary region is now serving traffic."

RTO/RPO achieved

RTO: 15-30 minutes (SQL failover: 5 min, AKS scale-up: 5-15 min, DNS propagation: 1-5 min)
RPO: Under 5 minutes (SQL geo-replication lag)

Cost reality

The secondary region runs at minimum scale. Expect 40-60 % cost premium over single-region. For the EUR 15,000/month example: EUR 21,000-24,000/month.

Pattern 3: Pilot Light

Only the data layer is replicated. Compute infrastructure is defined in IaC but not deployed until needed.

Architecture

Loading diagram...

Deployment pipeline for pilot light activation

YAML

# Azure DevOps pipeline - activate DR region
trigger: none  # Manual or alert-triggered

pool:
  vmImage: 'ubuntu-latest'

stages:
  - stage: ActivateDR
    displayName: 'Activate DR Region'
    jobs:
      - job: DeployInfrastructure
        steps:
          - task: AzureCLI@2
            displayName: 'Deploy AKS in DR region'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az deployment group create \
                  --resource-group rg-production-northeurope \
                  --template-file ./bicep/aks-dr.bicep \
                  --parameters environment=dr nodeCount=5

          - task: AzureCLI@2
            displayName: 'Deploy application workloads'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az aks get-credentials --resource-group rg-production-northeurope \
                  --name aks-dr-northeurope
                kubectl apply -k ./k8s/overlays/dr/

          - task: AzureCLI@2
            displayName: 'Failover SQL'
            inputs:
              azureSubscription: 'production-subscription'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az sql failover-group set-primary \
                  --resource-group rg-production \
                  --server sql-prod-westeurope \
                  --name fog-production

RTO/RPO achieved

RTO: 1-4 hours (infrastructure deployment: 30-90 min, application deployment: 15-30 min, validation: 15-30 min)
RPO: Under 5 minutes (SQL geo-replication is always running)

Cost reality

Only data replication costs during normal operation. 10-20 % premium. For the EUR 15,000/month example: EUR 16,500-18,000/month.

Service-Specific DR Patterns

Azure Kubernetes Service (AKS) Multi-Cluster

Bicep

// AKS multi-region with zone redundancy
resource aksWestEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'aks-prod-westeurope'
  location: 'westeurope'
  properties: {
    kubernetesVersion: '1.29'
    agentPoolProfiles: [
      {
        name: 'systempool'
        count: 3
        vmSize: 'Standard_D4s_v5'
        availabilityZones: ['1', '2', '3']
        mode: 'System'
      }
    ]
  }
}

resource aksNorthEurope 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'aks-prod-northeurope'
  location: 'northeurope'
  properties: {
    kubernetesVersion: '1.29'
    agentPoolProfiles: [
      {
        name: 'systempool'
        count: 2  // Warm standby
        vmSize: 'Standard_D4s_v5'
        availabilityZones: ['1', '2', '3']
        mode: 'System'
      }
    ]
  }
}

Azure SQL Geo-Replication with Failover Groups

Bicep

resource failoverGroup 'Microsoft.Sql/servers/failoverGroups@2023-05-01-preview' = {
  parent: sqlServerPrimary
  name: 'fog-production'
  properties: {
    readWriteEndpoint: {
      failoverPolicy: 'Automatic'
      failoverWithDataLossGracePeriodMinutes: 5
    }
    readOnlyEndpoint: {
      failoverPolicy: 'Enabled'
    }
    partnerServers: [
      { id: sqlServerSecondary.id }
    ]
    databases: [sqlDatabase.id]
  }
}

Azure Site Recovery for VMs

For legacy workloads running on VMs that cannot be containerised, Azure Site Recovery (ASR) provides continuous replication with 15-minute RPO and approximately 2-hour RTO including boot time and recovery plan execution. It is not the fastest option, but it requires zero application changes and works for any VM workload.

DR Pattern Selection Decision Flow

Loading diagram...

Testing Your DR Plan

A DR plan that is not tested is not a plan — it is a hope.

Monthly: Automated health checks

Bash

#!/bin/bash
echo "=== DR Readiness Report ==="

# Check SQL replication lag
az sql failover-group show \
  --resource-group rg-production \
  --server sql-prod-westeurope \
  --name fog-production \
  --query "replicationState"

# Verify Bicep templates compile
az bicep build --file ./bicep/dr-region.bicep

# Check storage replication
az storage account show \
  --resource-group rg-production \
  --name stprodwesteurope \
  --query "statusOfPrimary"

Quarterly: Simulated failover

Execute a full failover to the secondary region during a maintenance window. Route a percentage of traffic to the DR region and validate application functionality. Measure actual RTO against targets.

Annually: Full chaos exercise

Simulate a complete primary region failure without warning the operations team (inform management only). Measure detection time, escalation time, failover time, and data integrity.

Common Mistakes

Mistake 1: Testing failover but never testing failback. Failing over is half the problem. Failing back to the primary region after it recovers is often harder and less tested.

Mistake 2: Forgetting stateful components. DNS records, SSL certificates, configuration stored in App Configuration or Key Vault, secrets — all need to be available in both regions.

Mistake 3: Not accounting for capacity. Your DR region needs enough capacity to run production workloads. If you are using a warm standby, ensure the secondary region has quota for the VMs you need.

Mistake 4: Ignoring cross-region latency. If your application has synchronous calls between services, and some services fail over while others do not, you may introduce cross-region latency that breaks timeout assumptions.

Choosing the Right Pattern

Factor	Active-Active	Active-Passive (Warm)	Pilot Light
RTO	< 1 min	15-30 min	1-4 hours
RPO	Near zero	< 5 min	< 5 min
Cost premium	80-100 %	40-60 %	10-20 %
Operational complexity	High	Medium	Low (until activation)
Best for	Revenue-critical apps	Core business apps	Internal, non-critical apps

The right pattern depends on the cost of downtime, not the cost of infrastructure. If one hour of downtime costs EUR 500,000, active-active at EUR 15,000/month premium is trivially justified.

Next Steps

Disaster recovery architecture is one of those areas where the gap between "what we planned" and "what actually works" is widest. The patterns above are proven, but they need to be adapted to your specific application architecture, compliance requirements, and budget constraints.

If you need help designing or testing a multi-region DR architecture on Azure, contact us at mbrahim@conceptualise.de. We specialise in architectures that work under pressure, not just in presentations.

Multi-Region Azure Architecture: Disaster Recovery Patterns That Actually Work

Understanding RTO and RPO in Practice

The uncomfortable truth about RTO/RPO

Pattern 1: Active-Active with Azure Front Door

Architecture

Bicep template: Front Door with multi-region backends

Data layer for active-active

RTO/RPO achieved

Cost reality

Pattern 2: Active-Passive with Warm Standby

Architecture

Failover automation

RTO/RPO achieved

Cost reality

Pattern 3: Pilot Light

Architecture

Deployment pipeline for pilot light activation

RTO/RPO achieved

Cost reality

Service-Specific DR Patterns

Azure Kubernetes Service (AKS) Multi-Cluster

Azure SQL Geo-Replication with Failover Groups

Azure Site Recovery for VMs

DR Pattern Selection Decision Flow

Testing Your DR Plan

Monthly: Automated health checks

Quarterly: Simulated failover

Annually: Full chaos exercise

Common Mistakes

Choosing the Right Pattern

Next Steps

Frequently Asked Questions

Need expert guidance?

Related articles

Azure Cost Anomaly Detection: Catch Spikes Early

Reserved Instances vs Savings Plans vs Spot on Azure

Cloud Chargeback & Showback Models That Actually Stick