Building fault-tolerant serverless functions with AWS Lambda
Resilience is the ability of a cloud system to anticipate and handle faults without disrupting or discontinuing services to its users1.
Lambda offers a high level of resilience by benefiting from multi-AZ2 replication3 by default4. Each function can run from one or more AZs within an AWS Region5. Even in the event of multi-machine or an entire data center failure, AWS is able to continue serving invocations to a Lambda function.
Cross-region replication, function versioning and retry behavior are other features provided by Lambda that increases application reliability.
This enables multiple consumers of a single function to upgrade to newer versions as it best suits. It reduces the likelihood of service disruption by upgrading a function for all consumers at the same time. Blue/green and rolling deployments are also a possibility by using versioning in Lambda functions.
When a function invocation fails for some reason, Lambda may retry multiple times until the execution is successful. A retry is simply invoking the same function again with the same event payload.
This bahavior enables fault-tolerance in Lambda applications, since it avoids transient faults from frustrating the request service definitively. For more information, please read the page about Lambda retry behavior.
Although multi-AZ replication is enabled by default for all Lambda functions, Cross-Region must be implemented manually by developers. This can be accomplished by combining API Gateway regional endpoints and Route53 active-active setup.
Below is an outline of the implementation:
For a detailed walk-through, please check this AWS blog post.
Lambda can scale very quickly to accommodate hundreds or thousands of concurrent requests to multiple functions. To protect the entire platform from abuse and DoS attacks, there is a limit to how much it can scale. The default value is 1,000 concurrent requests (burstable to 3,000 to cope with short peaks).
It is very common that applications will rely on multiple functions. If a single function scales up to 1,000 concurrent requests, it will prevent all others from being executed. To avoid this type of scenario, Lambda provides Reserved Concurrency. Read more about it in the Scalability and Concurrency page.
When asynchronous invocations fail, Lambda may retry the request multiple times8. When the last retry still fails, Lambda can be configured to send the request payload to a Dead-Letter Queue (DLQ). This queue can store messages for several days9, which allows developers to inspect failed requests, possibly fix the causes of failure and replay them.
For functions running inside a VPC, AWS is only able to provide multi-AZ replication if the VPC has subnets in different AZs. More info in Configuring a Lambda Function to Access Resources in a VPC. ↩︎
Save time spent on debugging applications.
Increase development velocity and quality.
Get actionable insights to your infrastructure.