Two concepts are key in understanding how the AWS physical infrastructure is architected:
- Availability Zone (AZ)
AWS infrastructure planning implements an elaborate strategy to offer highly availabile, resilient1 and scalable2 services. AWS abstracts away most infrastructure management tasks from its users: from renting data center real estate to wiring up multiple machines in a local network.
Inspite of following a distributed model and many levels of replication (hardware, data, software, network), different parts of this infrastructure fail occasionally and it’s difficult to predict which ones, when and how they will fail.
When these systems do fail, having different Regions and AZs enables AWS to continue providing services to its customers with minimal to zero disruption. This model isn’t completely fail-safe. Some failures might still be disruptive, but it’s rare.
Availability Zone (AZ)
A collection of data centers representing a partition of the AWS infrastructure and services. Each data center is hosted in a separate facility and may have hundreds of thousands of machines.
AZs are interconnected within each Region with maximum throughput and low-latency communications. AWS uses a fully redundant network with dedicated metro fiber3. By replicating application resources across different AZs, AWS provides redundancy against from natural events and disasters (lightning strikes, tornadoes, flooding, etc).
Although a single Region offers a great level of redundancy with multiple AZs, some risks still apply. Political instability, social unrest or military conflicts are some ot the factors that may strike down an entire Region.
To ensure maximum availability and resilience, though, applications can benefit from cross-region replication. In this case, if an entire Region goes offline, the application can continue to serve its users from another Region of the planet. Latency might increase slightly to users that were previously served by the unavailable Region, but services won’t be disrupted.
Some services will provide an easy way to implement Cross-Region, such as DynamoDB Global Tables and S3 Replication, while others will require developers to implement their own logic.
Not all AWS services will provide multi-AZ redundancy automatically, though. It is possible to enabled the feature relatively easily. This is the case of Relational Database Service (RDS) instances6 and File Systems7, for example. There are tutorials for other services, such as Elastic Compute Cloud (EC2)8.
For compute workloads running on EC29, AWS offers partition placement groups10. It allows developers to control services that must be running on a single AZ, as well as distribute services inside a single Data Center.
Cluster placement groups11 will keep multiple EC2 instances clustered together to reduce network latency, typically required by High-Performance Computational (HPC) workloads. Services such as Kafka, Hadoop and HBase may benefit from this feature.
Spread placement groups12 allows to distribute critical instances on different server racks, reducing the exposure to correlated failures.
- Refer to the Reliability page. ↩︎
- Refer to the Sclability page. ↩︎
- AWS Availability Zones ↩︎
- Not necessarily following any political borders, but more aligned with business and commercial practices (e.g. “Asia Pacific“, “Middle East“). ↩︎
- List of AWS Regions and AZs ↩︎
- AWS RDS Multi-AZ Deployments ↩︎
- Deploying Multi-AZ File Systems ↩︎
- Increase the Availability of Your Application on Amazon EC2 ↩︎
- EC2: Elastic Compute Cloud ↩︎
- Using partition placement groups for large distributed and replicated workloads in Amazon EC2 ↩︎
- EC2 Cluster Placement Groups ↩︎
- EC2 Spread Placement Groups ↩︎