Designing Highly Available Systems Across Multi-Region
Cloud providers like AWS and GCP design their regions to be completely isolated. When us-east-1
goes down (which happens more often than we'd like), applications architected for a single region go offline.
To achieve 99.999% availability ("five nines"), you must design for Multi-Region failover.
Active-Passive (Disaster Recovery)
The simplest multi-region architecture is Active-Passive. Your primary infrastructure runs in Region A (e.g., US-East). Region B (e.g., US-West) runs a scaled-down version of your app or sits idle.
The Data Problem
The hardest part of multi-region architecture is state (data). If Region A dies, Region B needs access to the latest data.
- Asynchronous Replication: You configure your primary database in Region A to asynchronously replicate to a read-replica in Region B.
- Failover Event: When Region A fails, you manually (or via automated scripts) promote the Region B database replica to become a primary write node. You then update Route 53 (DNS) to route all traffic to Region B.
Trade-off: Because replication is asynchronous, you may lose the last few seconds of data during a hard failure (Recovery Point Objective > 0).
Active-Active Architecture
In an Active-Active setup, both Region A and Region B handle live customer traffic simultaneously. Users in New York hit the US-East data center; users in California hit US-West. If one region fails, DNS routing smoothly shifts all traffic to the surviving region.
The CAP Theorem Challenge
Active-Active requires a multi-master database setup. However, the CAP Theorem dictates that in the presence of a network partition (which always happens eventually), you must choose between Consistency and Availability.
If a user updates their profile in US-East, and immediately queries it in US-West, the data might not have replicated yet (Eventual Consistency). Technologies like Amazon DynamoDB Global Tables or CockroachDB handle this complex conflict resolution for you, usually using "Last Writer Wins" semantics based on timestamps.
Stateless Application Tiers
Regardless of your database strategy, your application servers must be completely stateless. They cannot store sessions in local memory or write temporary files to the disk. All sessions must go to a globally replicated data store (like Redis or DynamoDB), allowing users to seamlessly transition between regions without being logged out.
Conclusion
Multi-region architecture doubles your infrastructure costs and exponentially increases complexity. It should only be pursued when the financial cost of an hour of downtime significantly outweighs the engineering cost of implementing an active-active setup.