Event-Driven Architecture on Azure: When, Why, and How
A practical guide to event-driven architecture on Azure — Event Grid, Service Bus, Event Hubs, CQRS, and saga patterns explained.
Event-driven architecture (EDA) is not a universal solution — but when the problem fits, it is transformative. The challenge for enterprise teams is knowing when EDA is the right choice, which Azure services to use for which scenario, and how to avoid the pitfalls that turn event-driven systems into debugging nightmares.
This guide is the condensed version of what we cover in our architecture workshops, drawn from production systems processing millions of events daily.
When Event-Driven Architecture Makes Sense
EDA adds complexity. That complexity is justified when you have:
- Temporal decoupling needs: the producer should not wait for the consumer, and they may operate on different schedules.
- Multiple consumers for the same event: one action triggers reactions in several bounded contexts (e.g., an order placed triggers inventory, shipping, notifications, and analytics).
- Spike absorption: traffic is bursty and downstream systems cannot scale as quickly as upstream producers.
- Audit requirements: you need a complete, immutable history of what happened and when.
When EDA is the wrong choice:
- Simple CRUD applications with a single database
- Scenarios requiring synchronous, immediate responses
- Teams without operational maturity for distributed debugging
- Domains where eventual consistency is unacceptable
Rule of thumb: If your system needs to respond "yes, this is done" within the same HTTP request, event-driven is the wrong default. Use synchronous processing and consider events only for side effects.
Choosing the Right Azure Service
Azure offers three core messaging services. Choosing the wrong one is a common and expensive mistake.
Azure Event Grid
What it is: Lightweight event routing service optimized for reactive, event-driven scenarios.
Use when:
- You need to react to Azure resource events (blob created, resource group changed)
- You want simple pub/sub with filtering by event type or subject
- Events are notifications ("something happened"), not commands ("do this")
- You need near-real-time delivery with low latency
Limitations: No ordering guarantee, limited retry capabilities, maximum event size of 1 MB. Not suitable for high-throughput data streaming.
Azure Service Bus
What it is: Enterprise message broker with full queuing and pub/sub capabilities.
Use when:
- You need guaranteed ordered delivery (sessions)
- You need exactly-once processing semantics
- Messages are commands that must be processed reliably
- You need dead-letter queues, scheduled delivery, or message deferral
- Transaction support across multiple queues/topics is required
This is the default choice for most enterprise scenarios. Its feature set covers 80% of use cases.
Azure Event Hubs
What it is: High-throughput data streaming platform (comparable to Apache Kafka).
Use when:
- You are ingesting telemetry, logs, or IoT data at massive scale (millions of events per second)
- You need consumer groups that read at their own pace
- You want to replay events from a specific point in time
- Event ordering within a partition is required
Key distinction: Event Hubs retains events for a configurable period (up to 90 days). Consumers pull events — the broker does not push.
Decision Matrix
| Requirement | Event Grid | Service Bus | Event Hubs |
|---|---|---|---|
| Pub/sub notifications | Best | Good | Overkill |
| Command processing | No | Best | No |
| Ordered delivery | No | Yes (sessions) | Yes (partition) |
| High throughput streaming | No | Moderate | Best |
| Dead-letter handling | Limited | Full | Manual |
| Event replay | No | No | Yes |
Implementing CQRS: Separating Reads from Writes
Command Query Responsibility Segregation (CQRS) pairs naturally with EDA. The core idea: use different models for reading and writing data.
When CQRS adds value:
- Read and write workloads have different scaling requirements
- The read model needs to be denormalized for query performance
- Multiple bounded contexts need different views of the same data
Practical implementation on Azure:
- Write side: Commands are processed, domain events are published to Service Bus.
- Event handlers: Subscribe to events, update read-optimized projections (e.g., in Azure Cosmos DB or Azure SQL).
- Read side: Queries hit the denormalized read store directly — no joins, no complex queries.
Critical considerations:
- Eventual consistency is inherent. The read model will lag behind the write model. Design your UI to handle this (e.g., redirect to a confirmation page rather than immediately querying the read model).
- Do not apply CQRS everywhere. Simple domains with low read/write asymmetry gain nothing but complexity.
- Projection rebuilds must be possible. Store events durably so you can rebuild read models from scratch when the schema changes.
Event Sourcing: When the Event Log Becomes the Truth
Event sourcing goes further than CQRS: instead of storing current state, you store every event that changed the state.
Compelling use cases:
- Financial systems where auditability is non-negotiable
- Domains where understanding "how we got here" is as important as "where we are"
- Systems that need to reconstruct state at any point in time
Implementation realities:
- Event store: Use Azure Cosmos DB with a change feed, or a dedicated event store library like Marten (PostgreSQL).
- Snapshots: For aggregates with thousands of events, periodically create snapshots to avoid replaying the entire event stream.
- Schema evolution: Events are immutable. When your event schema needs to change, use upcasting (transforming old events to new format at read time).
- Storage costs: Events accumulate. Plan for archival — move events older than N months to cold storage.
Warning: Event sourcing is powerful but adds significant complexity. We recommend it only when the business value of a complete audit trail justifies the engineering investment.
The Saga Pattern: Managing Distributed Transactions
In a microservices world, you cannot use database transactions across service boundaries. The saga pattern coordinates multi-step business processes through events.
Choreography (preferred for simple sagas):
Each service listens for events and publishes its own events. No central coordinator.
OrderService -> OrderPlaced
InventoryService (listens) -> InventoryReserved
PaymentService (listens) -> PaymentProcessed
ShippingService (listens) -> ShipmentScheduledOrchestration (preferred for complex sagas):
A central saga orchestrator manages the flow and compensation logic.
SagaOrchestrator:
1. Send ReserveInventory command -> wait for response
2. Send ProcessPayment command -> wait for response
3. Send ScheduleShipment command -> wait for response
On failure at any step -> send compensating commandsPractical guidance:
- Choreography works for 3-4 steps. Beyond that, the implicit flow becomes impossible to reason about.
- Orchestration is better for complex flows with conditional logic, timeouts, and human intervention steps.
- MassTransit provides excellent saga state machine support for .NET, including persistence to Azure SQL or Cosmos DB.
- Always define compensating actions. If payment fails after inventory is reserved, the saga must release the reservation.
Dead-Letter Handling: The Safety Net
Dead-letter queues (DLQ) are where messages go when they cannot be processed. Ignoring them is ignoring production failures.
A robust dead-letter strategy:
- Monitor DLQ depth with Azure Monitor alerts. A growing DLQ is a symptom of a bug, not a feature.
- Build a dead-letter processing pipeline: inspect, diagnose, fix, and replay. Do not manually re-send messages.
- Categorize failures: transient (retry), poison (investigate and discard), and schema mismatch (fix consumer and replay).
- Set appropriate
MaxDeliveryCount: too low and you dead-letter on transient failures; too high and you delay detection. We default to 5-10. - Preserve context: log the exception, the message body, and the trace ID when moving to DLQ.
Tooling recommendation: Build an internal "DLQ dashboard" that operations teams can use to inspect, replay, or discard dead-lettered messages. This pays for itself within weeks.
Architecture Checklist
Before adopting event-driven architecture on Azure:
- Identified clear event producers and consumers
- Chosen the correct Azure messaging service per scenario
- Defined event schemas with versioning strategy
- Designed compensation logic for distributed transactions
- Established dead-letter monitoring and replay procedures
- Accepted eventual consistency and designed the UI accordingly
- Implemented distributed tracing across all event flows
At CC Conceptualise, we design and implement event-driven architectures on Azure that handle real-world complexity. If you are evaluating EDA for your enterprise, reach out.