Start free trial

Central data platform for your serverless environment.

Get full access to all premium features for 14 days. No code changes and no credit card required.

Password: 8+ characters, at least one upper case letter, one lower case letter, and one numeric digit

By signing up, you agree to our Privacy policy and
Terms and Conditions.

Solving the Challenges of Serverless at Scale

Share

Best Practices of Serverless at Scale 

A serverless application in its infancy looks and runs vastly different to one at scale. When there are more components to manage, the key to operational excellence is rooted in serverless best practices. Dashbird was created with the mission to help developers succeed with modern cloud environments, no matter their size. As experienced developers ourselves, we’ve faced and understand the challenges found in the functionality of at-scale serverless architecture. In this article, we run through the common serverless challenges, the architectural patterns and best practices to combat them. 

 

Find out more about scalable serverless designs for enterprises.

 

Exploring the Challenges 

As with anything, we should be constantly aspiring to catch problems sooner rather than later. Here is an example of an established but early-stage serverless application:

 

serverless architecture

 

As you can see, its workflow is simple and there is minimal load meaning the requests, execution times and concurrency are manageable.

 

In just a few months, that same architecture can look like this: 

 

serverless architecture at scale

 

As load increases, the existing infrastructure comes under stress. This is a great exercise in identifying the potential points of failures in your system, and the scenarios in which those could happen. In this example, you can see clearly how each source has its own limit leading to either failure or performance degradation. It’s important to remember that while different services have different API limits and throttling limits, failures can also happen through configuration mistakes and code errors. 

 

Common issues at higher loads:

 

common issues with serverless architecture

 

Lambda Concurrency

Lambda concurrency is the number of requests that your function serves at any time. A good formula for estimated Lambda concurrency is: 

Average Execution Time * Average Request Per Second = Estimate Concurrency

 

This helps to determine the number of containers that’ll be used simultaneously. With this in mind, let’s remind ourselves of some default AWS limits in place. 

 

  • Function-Based Burst Limits 

These can still occur even when concurrency is running fine. There is between a 500-3000 initial burst limit on functions (region dependent) with the ability to scale up by 500 every one minute. 

  • Account Wide Limit 

These are soft limits and built-in for your protection. By default, it’s set to 1000 concurrent executions, however these can be changed. 

  • API Gateway Limit 

There is a 10k request per second limit, per region which can be increased as needed. However, the 5k concurrency burst limit and 29-second timeout lime cannot be changed. 

  • Other AWS API Limits

All AWS APIs have limits, which is important to factor in when building and mapping out your application for scale. For example, KMS has a limit between 5,500-10,000 requests per second, depending on the region.

As your application scales or if it often experiences spikey loads, these limits need to be kept in mind for stable performance

 

Architectural Patterns and Best Practice

An unoptimized at-scale serverless application would look like this: 

 

serverless at scale

 

With so many requests per second, the stress becomes clear as other resources multiply. For a relational database, 3,000 new connections per second is a huge load and can cause lag in your system. Additionally, the 7,500 containers now needed increases your costs significantly.

These are our top tips for code-level optimizations to help with this. 

  • Keep everything in the Initialization Phase, and only connect the database when KMS queries have been cached. By doing this, executions will only run for the main logic you have. 
  • Keep orchestration out of code
  • Manage all connections out of the handler code

 

Using the above, the optimized at-scale serverless architecture now looks like this: 

 

serverless architecture

 

You can see a huge reduction in the execution time, as the connection doesn’t need to be established and the total connections resulting in a far smoother performance

Additional Serverless Patterns to Question

  • Do you need an API response? 

A habit we can fall into is always having a detailed database response from the API, when sometimes a simple acknowledgment is all that’s needed. By doing this, you can decouple the database from the KMS request and create an asynchronous processing model using SQS and Lambda, allowing you to set your concurrency limit and the load. There is no change to the model. 

  • Definitely need an API response?

If an API response is needed, there are few optimization tweaks to consider. 

  • Switching to a serverless, non-relational database such as DynamoDB or Serverless Aurora. Using the HTTP interface and the proxy/cache elements, there is no connection limit and being non-relational means there will be less lag and slowness to experience. 
  • Implement client retries and backoffs, to wait for the response outside of the synchronous call*. 
  • Implement webhooks or polling long tasks*.

*These features may have a negative impact for the client, however at a very high scale, the compromise can be worthwhile.  

  • Don’t orchestrate in code

The purpose of serverless is to keep code focused on business logic, meaning that elements of your serverless application of undifferentiated value can use managed services. Make use of the best services to support your application’s functionality. 

 

Additionally don’t wait in code, and instead, use Step Functions to enable tasks to be run in parallel and enable automatic triggers and retries. This is one of the best optimization actions many of our customers have seen from both a performance perspective and a reduction of costs.

 

Tackling Operating and Monitoring Challenges 

 

With the benefits of serverless, comes a new host of monitoring challenges to overcome, which is where Dashbird can provide value and expertise

Challenges Using Managed Services

  1. There is no code access like we are used to. It’s no longer a case of attaching an agent to the API to send a failure alarm, instead we have a more abstract control panel to work from. 
  2. Serverless components also have a huge amount of data output, with each resource providing logs, tracing data, errors, and configuration data; it rapidly piles up. 
  3. Failures are very specific to the service used. The issues found in API Gateway vary from Lambda, for example, emphasizing the requirement for deep knowledge of individual services and all their possible errors. 
  4. Its large scale nature naturally means challenges are potentially larger and widespread

 

Challenges Using a Distributed System

  1. There is a lot of surface area to manage. There can be hundreds or thousands of parts to your infrastructure, which organically increases the likelihood of failure, errors, and vulnerabilities for attackers. 
  2. It’s a dynamic and forever changing system, adapting to demand and requirements. 
  3. Understanding the resource relationships and their interactions are new in the serverless world. 

 

Dashbird is built on three core pillars that target all these issues: 

  • Centralized Observability and Visualization 
  • Automated Failure and Error Alert 
  • Actionable Well-Architected and Best Practice Insights

 

Centralized Observability and Visualisation 

serverless observability

It’s important to make the already available mass of data output work efficiently for us. Democratizing data breaks down traditional silos and enables users to navigate their own data more easily through customizable queries and searches. Dashbird’s use of prebuilt views and simple dashboard offering visualization of your data, for easier and quicker understanding. 

The centralized platform offers dynamic resource management, where you’re able to understand resource relationships and view your entire application in one place

 

Automated Failure and Error Alerts 

Monitoring is only effective if there is continuous alert coverage across your entire infrastructure. Dashbird uses out-of-the-box automated alerts notifying you of failures and errors, which integrate seamlessly into a developer’s workflow by sending in real-time via Slack or email.

Dashbird also proactively listens to log and metric data meaning that any potential negative trails (not yet failures) are highlighted and can be investigated before they escalate. 

 

Actionable Well-Architected and Best Practice Insights

Building serverless applications requires consistent best practice habits, which can be difficult to maintain or even start. Using the AWS Well-Architected lens, Dashbird helps to ensure your system is built and fixed based on industry-standard best practices

The Insights Engine detects non-binary issues such as delays, consumption issues, or limits enabling users to take action and improve and optimize their architecture to be reliable at any scale. Within its periodic assessments, Dashbird also helps to instill strong security and compliance practices, discovering areas needing encryption, inactive resources and over- or under-provisioned components all of which can be increasing exposure for attacks.

Made by Developers for Developers

Our history and present are deeply rooted in building large-scale cloud applications on state-of-the-art technology. Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

10,000+ developers trust Dashbird

Dashbird helped us reduce the time to discovery from 2-4 hours to a matter of seconds. This means we’re saving dozens of hours in developer time, which not only can be funneled into improving the product but also means hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.

Read our blog

AWS Well Architected Framework in Serverless: Cost Optimization

In this post, we’ll talk about the Cost Optimization (COST) pillar of the Well-Architected Framework (WAF) and what you should focus on in order to build cost-effective serverless environments.

How to Build, Deploy, and Debug a Food Delivery App on AWS

We’re getting down and dirty in this hands-on tutorial on how to build and deploy an event-driven Lambda backed food delivery app, and how to monitor it without using AWS products.

AWS Step Functions Error Handling

In this article, you’ll learn the common reasons behind AWS Step Functions errors and how to assess and handle them.

AWS Well-Architected Framework in Serverless: Reliability Pillar

In this article, we will focus on the AWS WAF Reliability (REL) pillar: the Foundations, Failure and Change Management.

AWS Well-Architected Framework in Serverless: Operational Excellence

This article will discuss the second most crucial pillar of the AWS Well-Architected Framework: Operational Excellence (OPS).

Go to blog