The strategy and practical considerations about AWS physical infrastructure
Two concepts are key in understanding how the AWS physical infrastructure is architected:
AWS infrastructure planning implements an elaborate strategy to offer highly availabile, resilient1 and scalable2 services. AWS abstracts away most infrastructure management tasks from its users: from renting data center real estate to wiring up multiple machines in a local network.
Inspite of following a distributed model and many levels of replication (hardware, data, software, network), different parts of this infrastructure fail occasionally and it’s difficult to predict which ones, when and how they will fail.
When these systems do fail, having different Regions and AZs enables AWS to continue providing services to its customers with minimal to zero disruption. This model isn’t completely fail-safe. Some failures might still be disruptive, but it’s rare.
A collection of data centers representing a partition of the AWS infrastructure and services. Each data center is hosted in a separate facility and may have hundreds of thousands of machines.
AZs are interconnected within each Region with maximum throughput and low-latency communications. AWS uses a fully redundant network with dedicated metro fiber3. By replicating application resources across different AZs, AWS provides redundancy against from natural events and disasters (lightning strikes, tornadoes, flooding, etc).
Although a single Region offers a great level of redundancy with multiple AZs, some risks still apply. Political instability, social unrest or military conflicts are some ot the factors that may strike down an entire Region.
To ensure maximum availability and resilience, though, applications can benefit from cross-region replication. In this case, if an entire Region goes offline, the application can continue to serve its users from another Region of the planet. Latency might increase slightly to users that were previously served by the unavailable Region, but services won’t be disrupted.
Not all AWS services will provide multi-AZ redundancy automatically, though. It is possible to enabled the feature relatively easily. This is the case of Relational Database Service (RDS) instances6 and File Systems7, for example. There are tutorials for other services, such as Elastic Compute Cloud (EC2)8.
For compute workloads running on EC29, AWS offers partition placement groups10. It allows developers to control services that must be running on a single AZ, as well as distribute services inside a single Data Center.
Cluster placement groups11 will keep multiple EC2 instances clustered together to reduce network latency, typically required by High-Performance Computational (HPC) workloads. Services such as Kafka, Hadoop and HBase may benefit from this feature.
Spread placement groups12 allows to distribute critical instances on different server racks, reducing the exposure to correlated failures.
Save time spent on debugging applications.
Increase development velocity and quality.
Get actionable insights to your infrastructure.