Serverless Well-Architected – Reconciling Resilience and Cost-Optimization

We recently wrote about the reasons why serverless apps fail and explored some ideas to make architectures more resilient and scalable.

Some of these architectural designs can become expensive if we don’t consider the financial impacts of architectural decisions. With proper care and consideration to this aspect, it is possible to achieve the same value in terms of scalability and resiliency while keeping costs at a manageable level.

Being able to reconcile these two aspects of the architectural decision-making process will not only make you a better architect and/or developer but also your services will be more valuable in the marketplace. Especially large projects can save thousands to millions of dollars with simple cost optimizations.

In an illustrative example, we ended up with the following architecture idea. Since Endpoint 1 sometimes scales to a level higher than the underlying resources can handle, we placed a queue in front of Lambda function 1 to smooth demand peaks.

This architectural change (adding a Queue) will contribute to the Reliability pillar of the Well-Architected framework. Nonetheless, depending on the project, it might be undesirable from the perspective of the Cost-Optimization pillar.

Consider this API is only used internally with the purpose of decoupling services. Endpoint 1 accepts POST requests (write-only) and uses the API Gateway simplified model, “HTTP API”. The API and Lambda will cost in total $1.20 per million requests ($1.00 for HTTP API and $0.20 for Lambda invocations), apart from Lambda’s memory time and RDS I/O.

Let’s analyze now how SQS will increase this cost structure. Ideally, we want to provide a smooth experience to the client relying on Endpoint 1, so we will not require it to group requests and take advantage of SQS batch. Each incoming API request will translate into a new SQS request, adding another $0.40 per million API requests, a 33% increase in cost.

On top of that, Lambda function 1 will need to consume these messages in order to process and store in the RDS database. Here SQS batch could be used to reduce costs. Let’s say Lambda will poll SQS frequently and get, on average, 5 messages per request. The total added cost of 1 million API requests is then $0.08 ($0.40 ÷ 5).

The cost for 1 million API requests now jumped to $1.68, up from $1.20, a 40% increase! Project that as a $100,000 AWS bill now jumping to $140,000.

If we analyze closely how the application behaves, we will see that demand only exceeds capacity on two occasions throughout the day. If AWS Lambda and RDS are capable of coping with demand for most of the day, why would we pay the extra SQS 40% cost all the time?

One solution here would be to create three additional resources:

  • CloudWatch alarm
  • SNS topic to receive alarm state changes
  • Lambda to respond to alarm

With a CloudWatch alarm, we can monitor when Lambda or the API starts failing due to increased demand and concurrency throughput errors. It will send an alarm through SNS, which will reach another Lambda: function 3.

The role of Function 3 is to turn SQS on/off depending on demand. During low demand, this Lambda would temporarily detach SQS from Endpoint 1, and route API requests directly to Lambda function 1. When the alarm identifies concurrency errors, it will reestablish SQS as the destination for API requests to avoid contamination from the backend faults.

A Dead-Letter-Queue could be configured in Lambda function 1 so that it never misses a request, even in the meanwhile during demand peak and SQS is not up yet.

This way, the extra 40% cost would only apply to times when the demand is really high and a message buffer is absolutely needed.

The solution proposed would add extra complexity to the architecture. It might not be the most suitable solution in all cases. Our purpose was to illustrate possible scenarios and coordination between different AWS services that can be combined to actually reduce the AWS bill.

Read our blog

5 Common Amazon SQS Issues

As with all services on AWS, issues can crop up while using SQS because it’s not always obvious what every service can and cannot do. But fear not, for this article aims to help you solve the most common ones as quickly as possible. Ready to fix your queues? Then let’s dive in!

5 Common Step Function Issues

Here you will find the most common issues when working with Step Functions, especially when starting with the service.

6 Common DynamoDB Issues

It’s expected that developers face many of the same issues when starting their NoSQL journey with DynamoDB. This article might clear things up a bit.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.