Start free trial

Central data platform for your serverless environment.

Get full access to all premium features for 14 days. No code changes and no credit card required.

Password: 8+ characters, at least one upper case letter, one lower case letter, and one numeric digit

By signing up, you agree to our Privacy policy and
Terms and Conditions.

AWS Well-Architected Framework in Serverless: Reliability Pillar

Share

This is part three of the “Well-Architected in Serverless” series where we decipher the AWS Well-Architected Framework (WAF) pillars and translate them into real-life actions. In this article, we will focus on the AWS WAF Reliability (REL) pillar.

Read the other posts in this series:

Part 1: Security Pillar

Part 2: Operational Excellence Pillar

Part 4: Cost Optimization Pillar

Part 5: Performance Efficiency Pillar

The Reliability Pillar

Unlike the Operational Excellence (OPS) and Security (SEC) pillars, the REL pillar is tradable. You can trade its goals for getting more out of the remaining two pillars: the Cost Optimization (COST) and the Performance Efficiency (PERF) pillars.

Trading means that you don’t have to go all-in in every one of these pillars. Maybe you want to save money, so you don’t do global replications, which would make your system more reliable but also more expensive.

The same goes for the PERF pillar; maybe you want to be as reliable as possible, this can imply that you wait for eventually consistent data storage to do its thing before you respond to a client, which makes your system more reliable in terms of a crash, but also slower in terms of performance.

The three parts that make up the REL pillar:

Foundations

The foundation of the REL pillar is the knowledge of quotas and constraints of the services you use. If you make a system unreliable because of a bug, that’s one thing, but if you didn’t know that a service is eventually consistent, you have a greater problem. This is also true for forgetting that you can only send a specific amount of requests per time frame to a service.

Luckily, for AWS services, some tools can help with that. The AWS Service Quotas Console can give you insights about each AWS service and even notify you when your systems hit the limits of the services they’re using. The AWS Trusted Advisor could also help to find out how much of a service you already used.

Managing Foundations

Dashbird integrates with the majority of the popular managed services in AWS to provide alerting and warning notifications for when the usage of a service reaches any sort of limits, such as timeouts, throttling, out of memory, and the like. In addition, developers can implement custom alarms and policies for use cases specific to their environment. Moreover, the platform visualizes the limits of services to grasp the state of resource usage easily and understand the capacity and long-term threats to the system.

Change Management

Change management is the anticipation of changes to your serverless system. This means how customers are changing their system’s usage patterns and how you change your system in terms of code. 

Examples of this are traffic spikes, which are usually handled automatically by a serverless system because it can scale out automatically. Still, change management also includes new features you want to deploy or migrations when you change databases.

Staying on top of Change Management

Dashbird gives engineering teams confidence and the ability to iterate quickly. A large factor in this is the reduction of the time it takes to detect and respond to incidents. Another topic that Dashbird helps with is getting real-time visibility into the inner workings of serverless applications. Developers can use this functionality to monitor the service at critical times and measure the performance, cost, and quality impact of system changes.

Failure Management

Failure management is about what you do when things fail, and they will fail because nothing is forever. Serverless services, especially managed ones, provide much of the failure management, for low-level issues, out-of-the-box, but this doesn’t mean that everything will keep working indefinitely.

Serverless systems are often event-based and utilize asynchronous communication rather heavily. In essence, this means if you send a request to an API, it might not respond with the actual result but just tells you that it accepted your request and will now start to process it. Now, if something goes wrong along the way, you have no direct way of finding out in the client that sent the request

To make sure nothing gets lost, you need to keep track of your events. Implement retry logic for your Lambda functions with dead-letter queues and log what went wrong.

Staying on top of failures

Dashbird helps you monitor SQS queues and provides functionality to set alarms for DLQs.

solving lambda retries

Maintaining reliability

A serverless developer needs a tool that automatically monitors for known and unknown failures across all managed services. Dashbird platform provides engineering organisations with end-to-end visibility into all monitoring data across cloud-native services (logs, metrics and traces in one place) combined with an automatic failure detection functionality, identifying know and unknown failures as soon as they happen.

aws lambda timeout

SAL Questions for the Reliability Pillar

There are two serverless related questions about the REL pillar in the SAL. Let’s look into them.

REL 1: How are you regulating inbound request rates?

Your serverless applications will have some kind of entry point, a front door, so to say, where all external data comes into your system. AWS offers different services to facilitate this, one is API Gateway, and another one is AppSync.

These services, like all the other services you’ll be using downstream, have their limits. It can lead to reliability issues if you rely on these limits alone. If your system gets sufficiently complex, it’s not easy to calculate what service will fold first.

That’s why you should set up adequate throttling for API Gateway and AppSync. These services also allow defining usage plans for issued API keys; that way, you can clearly communicate how much a customer can expect from your system.

It’s also crucial to use concurrency controls of Lambda because it can scale faster than most services. If you integrate with a non-serverless service and suddenly your Lambda function scales up to thousands of concurrent invocations, it will be like a distributed denial-of-service (DDoS) attack.

REL 2: How are you building resiliency into your serverless application?

The main lever for increasing resiliency is decoupling of logic and responsibility between resources and designing the system to handle failures on its own. In most use cases, as much as possible should be made asynchronous. This is a great post outlining the design principles for building resilience into serverless applications .

In addition to system design, it’s important to have tools and processes to measure and track system activity and to get notified on unexpected events in reasonable time windows. No system will be 100% resilient and have the ability to recover from any failure. Engineering teams building on serverless should be responsible for testing their system with different failure scenarios and make continuous improvements and modifications, constantly learn from past incidents and thrive to develop the most optimal processes and tooling to respond to incidents.

Summary

The REL pillar is all about designing your system in a way that won’t break down. Learn about the services quotas and limits. Sometimes a service sounds like just what you need before reading that it can’t handle more than 1000 requests per second. Throttle your systems entry-points so clients can’t overload downstream services and give customers clear answers on what they can expect from your system.

Also, keep everything monitored. The inherent asynchronicity of serverless systems makes them less straightforward to debug when something has gone wrong; this means you need a way to get notified when things go out of bounds so you can react quickly. This also means you need logging data to evaluate what has gone wrong after an incident.

If you’re still curious to learn more about the WAF, how it came about and some more best practices for each of the five pillars, you can watch our recent session with Tim Robinson (AWS) here:


Further reading:

Exploring Lambda limits

Why your Lambda functions may be doomed to fail

How systems can be reliable and the importance to cloud applications

Made by Developers for Developers

Our history and present are deeply rooted in building large-scale cloud applications on state-of-the-art technology. Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

10,000+ developers trust Dashbird

Dashbird helped us reduce the time to discovery from 2-4 hours to a matter of seconds. This means we’re saving dozens of hours in developer time, which not only can be funneled into improving the product but also means hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.

Read our blog

AWS Step Functions Input and Output Manipulation Handbook

In this handbook, you’ll learn how to manipulate AWS Step Functions Input and Output and filter this data.

How to Save Hundreds of Hours on Lambda Debugging

Learn simple ways to save a ton of time when scanning logs to debug errors in your Lambda functions.

Why Are Some Engineers Missing The Point of Serverless?

Why are some engineers missing the point of serverless? Let’s have a look at the common critique points, benefits, drawbacks of serverless, and if it makes sense for your use case.

How Dashbird innovates serverless monitoring

What makes an effective serverless monitoring strategy? In this article, we’re discussing the three core ideas that Dashbird’s serverless monitoring tool was built on top and that should be the fundamentals of any effective serverless monitoring approach.

Debugging with Dashbird: Malformed Lambda Proxy Response

A problem that pops up quite frequently when people try to build serverless applications with AWS API Gateway and AWS Lambda is: Execution failed due to configuration error: Malformed Lambda proxy response.Learn what causes it and how to fix it.

Go to blog