Monitoring platform for keeping systems up and running at all times.
Full stack visibility across the entire stack.
Detect and resolve any incident in record time.
Conform to industry best practices.
When building serverless applications, Lambda functions often form the backbone of the system. They might provide just a few lines of code, but these lines are usually what hold the whole architecture composed of many managed services together.
Event-driven architecture is what this style is called, and it’s most prevalent in serverless applications. API gateways collect requests from your users, convert them to events, and send these along the way. Sometimes an upstream service, like DynamoDB or SQS, can handle such an event directly to save on Lambda invocation costs.
More often than not, we need to apply some validation and transformation to an event so the service can handle it. That’s where Lambda comes into play, and that’s also one source of our timeouts.
One of the main beginner mistakes when implementing a Lambda function is to wait for other services inside that function.
Usually, it goes like this:
You create a function that needs to read or write data to multiple other services; this can be S3, Kinesis, or even another Lambda function. So you choose one of these services as an event source for the Lambda and then try to call the other services inside your function. They are all connected over the network, and so your function waits for all their responses.
Depending on what you are doing and how many services are involved, this waiting time will sum up, and sooner or later, the Lambda timeout takes your function down.
The default timeout of a Lambda function is three seconds. This means, if you don’t explicitly configure a timeout, your function invocations will be suspended after three seconds.
Now, if you call a few services, some of which are currently at capacity, a request can very well take a second on its own. So it’s no surprise that a three-second timeout is reached quickly.
The naive solution for this problem is dialing up the timeout in the function’s configuration. After all, Lambda’s upper limit for timeouts is 15 minutes, so there is quite some runway.
The problem with this solution is that it isn’t free. You pay for every millisecond your function is waiting for and doing nothing. This includes errors on the other side of the network. If a service has a problem and doesn’t answer you, you still pay for the waiting, just to be notified that the service timed out and you didn’t get anything for your money. Not to mention, if this function directly affects the end-user, this is a major UX issue.
Refactoring your architecture is a better idea.
If you have a Lambda function that times out because it waits too long for multiple services, you can try to use the services as event sources for new Lambda invocations instead. For example, don’t let one Lambda function call ten services like seen in Figure 1; let it only call one, but this one service then triggers a new Lambda function when it’s finished and so on like Figure 2 illustrates.
This way, you chain up multiple services and Lambda functions and don’t have to pay for Lambda invocation time while the other service work.
You shouldn’t orchestrate with Lambda functions. When your workload has sufficient complexity, you should use a queuing service to orchestrate it. AWS offers a plentitude of managed queues, and most of them are serverless in nature. SQS, SNS, Kinesis, and Step Functions are ways to manage data flow in your architecture outside of a Lambda function.
Step Functions can wait for weeks for an event to resolve and even work with manual resolvers handled by human interaction. In Figure 3 above, you see how architecture can look like. First, API Gateway calls Step Functions, and then Step Functions takes care of orchestrating other services, like Lambda and DynamoDB.
The next reason for timeouts is complex Lambda functions that do too much at once. If you’re coming from a more monolithic approach for application architecture, you’re probably inclined to put as much logic into a function as possible. This can increase the time it takes to complete.
Lambda has a default timeout of three seconds, but it allows you to change this configuration. The maximum is 15 minutes, but keep in mind that you have to pay for it. If you really have long-running processes you can’t cut down on, maybe Lambda isn’t the right solution, and you should look into EC2 or ECS instead.
Also, API Gateway has a hard limit of 30 seconds. So, even if you can configure your Lambda function to run 15 minutes, you will hit a limit when you use API Gateway to call your Lambda functions.
If you increase the memory of your Lambda function, it also gets more CPU power. More computation power means quicker execution and, in turn, lower execution time. But as with increased timeout, this isn’t always free. Sometimes more memory accelerates the function and makes it cheaper, but sometimes it even gets more expensive.
You should look into tools like the Lambda Power Tuner to get the most out of your functions. Otherwise, it can very well be that you leave money and performance on the table.
The safest approach here is to simplify your Lambda functions. For example, instead of writing one function that does ten tasks and takes 3 minutes, try to split it up. Ten functions that only take 20 seconds each are way more flexible. In addition, it allows more integration options (see 30 second API Gateway limit), and you can tune every one of these tasks individually.
Maybe, one task takes the majority of the time, and you can allocate more memory for it. The other tasks can then run with lower memory, which can be cheaper.
Dashbird shows you a central location for all of your Lambda errors. So even if you have hundreds of Lambda functions in your AWS account, you only have to look into one place to find out what’s going on.
The Dashbird app will show you the most recent errors in your AWS account. Figure 4 below shows a list that includes the timeout error we were talking about.
If you click on the timeout error, you’re directly taken to a detailed view of the related event, as seen in Figure 5. It includes all metrics of the event and related invocations.
If you want to know more, you can then click on an invocation at the bottom under occurrences, and you will be presented with the view in Figure 6.
Here you find everything related to the timed-out invocation—duration, error, and logs. If you enabled X-Ray tracing for your function, you could even check how long it took to communicate with other services. In Figure 7, you see how it looks like.
With traces, you see right away if you’re paying for waiting on other services.
Try Dashbird for free—it’s an industry standard for serverless observability and will help you debug those pesky timeouts.
In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.
In this article we’ll go through the ins and outs of AWS Lambda pricing model, how it works, what additional charges you might be looking at and what’s in the fine print.
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.