AWS Well-Architected Framework in Serverless: Reliability Pillar

Kay Plößer

November 25th, 2020

This is part three of the “Well-Architected in Serverless” series where we decipher the AWS Well-Architected Framework (WAF) pillars and translate them into real-life actions. In this article, we will focus on the AWS WAF Reliability (REL) pillar.

Read the other posts in this series:

Part 1: Security Pillar
Part 2: Operational Excellence Pillar
Part 4: Cost Optimization Pillar
Part 5: Performance Efficiency Pillar

The Reliability Pillar

Unlike the Operational Excellence (OPS) and Security (SEC) pillars, the REL pillar is tradable. You can trade its goals for getting more out of the remaining two pillars: the Cost Optimization (COST) and the Performance Efficiency (PERF) pillars.

Trading means that you don’t have to go all-in in every one of these pillars. Maybe you want to save money, so you don’t do global replications, which would make your system more reliable but also more expensive.

The same goes for the PERF pillar; maybe you want to be as reliable as possible, this can imply that you wait for eventually consistent data storage to do its thing before you respond to a client, which makes your system more reliable in terms of a crash, but also slower in terms of performance.

The three parts that make up the REL pillar:

Foundations

The foundation of the REL pillar is the knowledge of quotas and constraints of the services you use. If you make a system unreliable because of a bug, that’s one thing, but if you didn’t know that a service is eventually consistent, you have a greater problem. This is also true for forgetting that you can only send a specific amount of requests per time frame to a service.

Luckily, for AWS services, some tools can help with that. The AWS Service Quotas Console can give you insights about each AWS service and even notify you when your systems hit the limits of the services they’re using. The AWS Trusted Advisor could also help to find out how much of a service you already used.

Managing Foundations

Dashbird integrates with the majority of the popular managed services in AWS to provide alerting and warning notifications for when the usage of a service reaches any sort of limits, such as timeouts, throttling, out of memory, and the like. In addition, developers can implement custom alarms and policies for use cases specific to their environment. Moreover, the platform visualizes the limits of services to grasp the state of resource usage easily and understand the capacity and long-term threats to the system.

Change Management

Change management is the anticipation of changes to your serverless system. This means how customers are changing their system’s usage patterns and how you change your system in terms of code.

Examples of this are traffic spikes, which are usually handled automatically by a serverless system because it can scale out automatically. Still, change management also includes new features you want to deploy or migrations when you change databases.

Staying on top of Change Management

Dashbird gives engineering teams confidence and the ability to iterate quickly. A large factor in this is the reduction of the time it takes to detect and respond to incidents. Another topic that Dashbird helps with is getting real-time visibility into the inner workings of serverless applications. Developers can use this functionality to monitor the service at critical times and measure the performance, cost, and quality impact of system changes.

Failure Management

Failure management is about what you do when things fail, and they will fail because nothing is forever. Serverless services, especially managed ones, provide much of the failure management, for low-level issues, out-of-the-box, but this doesn’t mean that everything will keep working indefinitely.

Serverless systems are often event-based and utilize asynchronous communication rather heavily. In essence, this means if you send a request to an API, it might not respond with the actual result but just tells you that it accepted your request and will now start to process it. Now, if something goes wrong along the way, you have no direct way of finding out in the client that sent the request.

To make sure nothing gets lost, you need to keep track of your events. Implement retry logic for your Lambda functions with dead-letter queues and log what went wrong.

Staying on top of failures

Dashbird helps you monitor SQS queues and provides functionality to set alarms for DLQs.

Maintaining reliability

A serverless developer needs a tool that automatically monitors for known and unknown failures across all managed services. Dashbird platform provides engineering organisations with end-to-end visibility into all monitoring data across cloud-native services (logs, metrics and traces in one place) combined with an automatic failure detection functionality, identifying know and unknown failures as soon as they happen.

SAL Questions for the Reliability Pillar

There are two serverless related questions about the REL pillar in the SAL. Let’s look into them.

REL 1: How are you regulating inbound request rates?

Your serverless applications will have some kind of entry point, a front door, so to say, where all external data comes into your system. AWS offers different services to facilitate this, one is API Gateway, and another one is AppSync.

These services, like all the other services you’ll be using downstream, have their limits. It can lead to reliability issues if you rely on these limits alone. If your system gets sufficiently complex, it’s not easy to calculate what service will fold first.

That’s why you should set up adequate throttling for API Gateway and AppSync. These services also allow defining usage plans for issued API keys; that way, you can clearly communicate how much a customer can expect from your system.

It’s also crucial to use concurrency controls of Lambda because it can scale faster than most services. If you integrate with a non-serverless service and suddenly your Lambda function scales up to thousands of concurrent invocations, it will be like a distributed denial-of-service (DDoS) attack.

REL 2: How are you building resiliency into your serverless application?

The main lever for increasing resiliency is decoupling of logic and responsibility between resources and designing the system to handle failures on its own. In most use cases, as much as possible should be made asynchronous. This is a great post outlining the design principles for building resilience into serverless applications .

In addition to system design, it’s important to have tools and processes to measure and track system activity and to get notified on unexpected events in reasonable time windows. No system will be 100% resilient and have the ability to recover from any failure. Engineering teams building on serverless should be responsible for testing their system with different failure scenarios and make continuous improvements and modifications, constantly learn from past incidents and thrive to develop the most optimal processes and tooling to respond to incidents.

Summary

The REL pillar is all about designing your system in a way that won’t break down. Learn about the services quotas and limits. Sometimes a service sounds like just what you need before reading that it can’t handle more than 1000 requests per second. Throttle your systems entry-points so clients can’t overload downstream services and give customers clear answers on what they can expect from your system.

Also, keep everything monitored. The inherent asynchronicity of serverless systems makes them less straightforward to debug when something has gone wrong; this means you need a way to get notified when things go out of bounds so you can react quickly. This also means you need logging data to evaluate what has gone wrong after an incident.

If you’re still curious to learn more about the WAF, how it came about and some more best practices for each of the five pillars, you can watch our recent session with Tim Robinson (AWS) here:

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.