Serverless Well-Architected – Reconciling Resilience and Cost-Optimization

Taavi Rehemägi

January 7th, 2021

We recently wrote about the reasons why serverless apps fail and explored some ideas to make architectures more resilient and scalable.

Some of these architectural designs can become expensive if we don’t consider the financial impacts of architectural decisions. With proper care and consideration to this aspect, it is possible to achieve the same value in terms of scalability and resiliency while keeping costs at a manageable level.

Being able to reconcile these two aspects of the architectural decision-making process will not only make you a better architect and/or developer but also your services will be more valuable in the marketplace. Especially large projects can save thousands to millions of dollars with simple cost optimizations.

In an illustrative example, we ended up with the following architecture idea. Since Endpoint 1 sometimes scales to a level higher than the underlying resources can handle, we placed a queue in front of Lambda function 1 to smooth demand peaks.

This architectural change (adding a Queue) will contribute to the Reliability pillar of the Well-Architected framework. Nonetheless, depending on the project, it might be undesirable from the perspective of the Cost-Optimization pillar.

Consider this API is only used internally with the purpose of decoupling services. Endpoint 1 accepts POST requests (write-only) and uses the API Gateway simplified model, “HTTP API”. The API and Lambda will cost in total $1.20 per million requests ($1.00 for HTTP API and $0.20 for Lambda invocations), apart from Lambda’s memory time and RDS I/O.

Let’s analyze now how SQS will increase this cost structure. Ideally, we want to provide a smooth experience to the client relying on Endpoint 1, so we will not require it to group requests and take advantage of SQS batch. Each incoming API request will translate into a new SQS request, adding another $0.40 per million API requests, a 33% increase in cost.

On top of that, Lambda function 1 will need to consume these messages in order to process and store in the RDS database. Here SQS batch could be used to reduce costs. Let’s say Lambda will poll SQS frequently and get, on average, 5 messages per request. The total added cost of 1 million API requests is then $0.08 ($0.40 ÷ 5).

The cost for 1 million API requests now jumped to $1.68, up from $1.20, a 40% increase! Project that as a $100,000 AWS bill now jumping to $140,000.

If we analyze closely how the application behaves, we will see that demand only exceeds capacity on two occasions throughout the day. If AWS Lambda and RDS are capable of coping with demand for most of the day, why would we pay the extra SQS 40% cost all the time?

One solution here would be to create three additional resources:

CloudWatch alarm
SNS topic to receive alarm state changes
Lambda to respond to alarm

With a CloudWatch alarm, we can monitor when Lambda or the API starts failing due to increased demand and concurrency throughput errors. It will send an alarm through SNS, which will reach another Lambda: function 3.

The role of Function 3 is to turn SQS on/off depending on demand. During low demand, this Lambda would temporarily detach SQS from Endpoint 1, and route API requests directly to Lambda function 1. When the alarm identifies concurrency errors, it will reestablish SQS as the destination for API requests to avoid contamination from the backend faults.

A Dead-Letter-Queue could be configured in Lambda function 1 so that it never misses a request, even in the meanwhile during demand peak and SQS is not up yet.

This way, the extra 40% cost would only apply to times when the demand is really high and a message buffer is absolutely needed.

The solution proposed would add extra complexity to the architecture. It might not be the most suitable solution in all cases. Our purpose was to illustrate possible scenarios and coordination between different AWS services that can be combined to actually reduce the AWS bill.

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.