Start free trial

Central data platform for your serverless environment.

Get full access to all premium features for 14 days. No code changes and no credit card required.

Password: 8+ characters, at least one upper case letter, one lower case letter, and one numeric digit

By signing up, you agree to our Privacy policy and
Terms and Conditions.

Why Serverless Apps Fail and How to Design Resilient Architectures

Share

We’ve been monitoring 100,000’s of serverless backend components for 2+ years at Dashbird. In our experience, Serverless infrastructure failures boil down to:

  • Throughput and concurrency limitations;
  • Increased latency;
  • Timeout errors;

These isolated faults become causes of failure due to dependencies in our cloud architectures (ref. Difference of Fault vs. Failure). If a serverless Lambda function relies on a database that is under stress, the entire API may start returning 5XX errors.

You may think this is just a fact of life, but we can dodge or at least mitigate these failures in many cases.

Serverless is not a magical silver bullet. These services have their limitations, especially to scalability capacities. AWS Lambda, for example, can increase concurrency level up to a certain level per minute. Throw in 10,000 concurrent requests out of thin air and it will throttle.

A typical architecture looks like this:

It usually works well under a low scale. Put in more load a single component’s fault can bring the whole implementation to its knees.

Consider this scenario: due to market reasons, API Endpoint 1 starts receiving an unusual amount of requests. Your clients are generating more data and your backend needs to store it in the RDS instance. Relational databases usually don’t scale linearly to I/O level, so we can expect an increase in query latency during this peak demand. API Endpoint 1 or Lambda function 1 will start timing out at some point due to the database delays.

Another possible fault scenario is throttling from Lambda function 1 due to a rapid increase in concurrency.

Not only API Endpoint 1 will become unavailable to clients, but also the second endpoint. In the first scenario, Endpoint 2 also relies on the same RDS instance. In the second scenario, Lambda function 1will consume the entire concurrency limits for your AWS account, causing Lambda function 2 to throttle requests as well.

We can avoid this by decoupling the API Endpoint 1and Lambda function 1. In the example, our clients are only sending information that needs to be stored, but no processing and customized response are needed. Here is an alternative architecture:

Instead of sending requests directly from API Endpoint 1 to the Lambda function 1, we first store all requests in a highly-scalable SQS queue. The API can immediately return a 200 message to clients. The Lambda function 1 will later pull messages from the queue in a rate that is manageable for its own concurrency limits and the RDS instance capabilities.

With this modification, the potential for widespread failure is minimized by having a queue absorbing peaks in demand. SQS standard queues can handle nearly unlimited throughput. At the same time, all components serving Endpoint 2 can continue to work normally, since data consumption by the Lambda function 1 is smoothed out.

This is a simplified example, there are several aspects to consider in terms of potential failure points and architectural improvements. We are hosted a webinar to cover these topics in much more depth. You can rewatch it here to find out more.

Made by Developers for Developers

Our history and present are deeply rooted in building large-scale cloud applications on state-of-the-art technology. Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

10,000+ developers trust Dashbird

Dashbird helped us reduce the time to discovery from 2-4 hours to a matter of seconds. This means we’re saving dozens of hours in developer time, which not only can be funneled into improving the product but also means hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.

Read our blog

AWS Well Architected Framework in Serverless: Cost Optimization

In this post, we’ll talk about the Cost Optimization (COST) pillar of the Well-Architected Framework (WAF) and what you should focus on in order to build cost-effective serverless environments.

How to Build, Deploy, and Debug a Food Delivery App on AWS

We’re getting down and dirty in this hands-on tutorial on how to build and deploy an event-driven Lambda backed food delivery app, and how to monitor it without using AWS products.

AWS Step Functions Error Handling

In this article, you’ll learn the common reasons behind AWS Step Functions errors and how to assess and handle them.

AWS Well-Architected Framework in Serverless: Reliability Pillar

In this article, we will focus on the AWS WAF Reliability (REL) pillar: the Foundations, Failure and Change Management.

AWS Well-Architected Framework in Serverless: Operational Excellence

This article will discuss the second most crucial pillar of the AWS Well-Architected Framework: Operational Excellence (OPS).

Go to blog