Global Infrastructure

The strategy and practical considerations about AWS physical infrastructure

Dashbird is a monitoring platform for monitoring modern cloud infrastructure by providing alerts, insights and data visualisation.

Start a Free Trial Learn more

Overview

Two concepts are key in understanding how the AWS physical infrastructure is architected:

  1. Region
  2. Availability Zone (AZ)
  3. Edge

AWS infrastructure planning implements an elaborate strategy to offer highly availabile, resilient1 and scalable2 services. AWS abstracts away most infrastructure management tasks from its users: from renting data center real estate to wiring up multiple machines in a local network.

Inspite of following a distributed model and many levels of replication (hardware, data, software, network), different parts of this infrastructure fail occasionally and it’s difficult to predict which ones, when and how they will fail.

When these systems do fail, having different Regions and AZs enables AWS to continue providing services to its customers with minimal to zero disruption. This model isn’t completely fail-safe. Some failures might still be disruptive, but it’s rare.

Availability Zone (AZ)

A collection of data centers representing a partition of the AWS infrastructure and services. Each data center is hosted in a separate facility and may have hundreds of thousands of machines.

AZs are interconnected within each Region with maximum throughput and low-latency communications. AWS uses a fully redundant network with dedicated metro fiber3. By replicating application resources across different AZs, AWS provides redundancy against from natural events and disasters (lightning strikes, tornadoes, flooding, etc).

Region

An AWS Region corresponds to a geographical area4 that contains multiple AZs (typically 3). AWS offers more than 20 geographical Regions across the globe5.

Replication Options

Cross-Region Replication

Although a single Region offers a great level of redundancy with multiple AZs, some risks still apply. Political instability, social unrest or military conflicts are some ot the factors that may strike down an entire Region.

To ensure maximum availability and resilience, though, applications can benefit from cross-region replication. In this case, if an entire Region goes offline, the application can continue to serve its users from another Region of the planet. Latency might increase slightly to users that were previously served by the unavailable Region, but services won’t be disrupted.

Some services will provide an easy way to implement Cross-Region, such as DynamoDB Global Tables and S3 Replication, while others will require developers to implement their own logic.

Multi-AZ Replication

Managed services usually will provide multi-AZ replication by default. This is the case of all serverless systems, such as Lambda, DynamoDB, and S3.

Not all AWS services will provide multi-AZ redundancy automatically, though. It is possible to enabled the feature relatively easily. This is the case of Relational Database Service (RDS) instances6 and File Systems7, for example. There are tutorials for other services, such as Elastic Compute Cloud (EC2)8.

Controlling Multi-AZ

For compute workloads running on EC29, AWS offers partition placement groups10. It allows developers to control services that must be running on a single AZ, as well as distribute services inside a single Data Center.

Cluster placement groups11 will keep multiple EC2 instances clustered together to reduce network latency, typically required by High-Performance Computational (HPC) workloads. Services such as Kafka, Hadoop and HBase may benefit from this feature.

Spread placement groups12 allows to distribute critical instances on different server racks, reducing the exposure to correlated failures.


Footnotes:

Operate Cloud Applications at Highest Quality

Save time spent on debugging applications.

Increase development velocity and quality.

Get actionable insights to your infrastructure.

Finish setup in 2 minutes!