Monitoring platform for keeping systems up and running at all times.
Full stack visibility across the entire stack.
Detect and resolve any incident in record time.
Conform to industry best practices.
In this article, we’ll be discussing everything you need to know about the basics of AWS Lambda error handling and some popular methods using StepFunctions and X-Ray. Regardless if you’re an AWS Lambda expert or if you’re a new Lambda user, there’s always something new to learn. You’ve probably already encountered Lambda errors that may seem pretty challenging since the mechanism that runs Lambda retries will often make it incredibly difficult to follow up on changes that occur within your serverless application.
Serverless is not all about straightforward execution of code on Lambda function, but it’s a different type of architecture of your entire system. Distributed nodes within this architecture that are activated thanks to asynchronous events are what makes this system.
Every node has to be designed like a singular part with its own API.
To learn how to define all these nodes accurately, you have to know how to handle Lambda errors. In addition, it’s necessary to deal appropriately with Lambda retry behavior as well.
Therefore, let’s jump right into how do the AWS Lambda retries and errors work, as well as what’s the whole buzz around it all.
Lambda functions can fail (they will), and when they do, it’s because of one of these situations:
When failure occurs, and it will occur at some point, you’ll most likely notice Lambda retries based on these behaviors:
Every Lambda can be executed multiple times with the same input, while the “caller” didn’t know about it happening. To successfully perform the same operation several times, Lambda has to be ‘idempotent.’ Idempotency means that no added effects will take place when it’s run by the same input more than one time.
It’s worth mentioning that serverless functions aren’t the only example when it comes to idempotency terms. A standard model is the API network: when a request doesn’t receive a response, the same request will be sent repeatedly.
For example, in Serverless architectures, it can come to a similar case when Lambda gets a timeout before receiving a response. Even though it’s a highly unexpected thing to happen, in some cases, a wrong retry handling can be the cause of severe problems (a database (DB) structural violation).
The definition of idempotency states that it’s the property of specific operations in computer science and mathematics. It’s applicable several times without disrupting the result beyond the first application. However, it’s still somewhat confusing.
What happens if you wish to execute the same operation several times and it’s not actually a retry?
Let’s say that Lambda received a user operation log as input, and it’s solely responsible for recording that operation log into a database. In this example, we’ll need to make a difference between Lambda’s trigger input and a retry case since they’re the same because the user has initiated the same operation again.
Referring to the Lambda’s request ID as a part of the input itself is the right solution. Only when there’s a Lambda retry, you’ll obtain the same ID. To be able to extract it, utilize context.awsRequestId within Node.js or the appropriate field in any other language. What this method does is it provides the general approach that’ll search for retry executions.
It’s not always a convenient solution to utilize the request ID for being a genuine idempotent. You might have noticed within the previous example that this ID should’ve also been saved in the DB. That way, the following invocations would realize whether there’s a need to add a new record. There’s one more solution, and it’s using some in-memory data store. However, it might add quite a significant overhead.
AWS Lambda error handling can be done in different ways, like utilizing wrappers. On the other hand, AWS Step Functions have proved to be incredibly beneficial for building a serverless application that’ll deal with retries and errors appropriately, making Step Functions an effective solution. You can learn more about AWS Step Functions in our Ultimate Guide to Step Functions.
Let’s say that the application has to perform multiple operations in response to an event. By combining them all with the same Lambda, the code will have to check for every operation separately.
If you’re trying to keep your Lambda idempotent, should it be redone?
Remember that this can cause severe headaches. It would help if you learned the difference between monolithic applications and the Step Function example we’ve mentioned. In a monolithic application, the application itself can become responsible for forcing retries because it’s capable of waiting between them, and that’s something that isn’t possible in Serverless.
However, with Step Functions, you’re able to run every operation on separate Lambda. In addition, you’re able to define suitable transitions between them for each specific case. You can also control the retries’ behavior – their delay duration as well as their number. That way, you’ll quickly adjust it to be the most suitable one for your particular case. You’ll even disable it when you believe that’s the right step to take. Even if needed by a single Lambda, creating a step machine is possibly the most straightforward solution for disabling unwanted retry behavior.
As you probably know, all available Step Function triggers are quite limited; the only triggers that are available are API Gateway, including a manual execution utilizing SDK.
To successfully deploy this Lambda, you have to utilize the Serverless framework, along with the incredible ‘serverless-resources-env’ plugin so it could easily pass the state machine ARN. Additionally, you have to ensure that you use ‘serverless-pseudo-parameters’ and ‘serverless-step-functions’ so you’d be capable of defining the state machine like it’s shown in the below example:
You can see that the artificial choice of implementing an SNS event is made purposely to trigger the state machine, and it’s accessible as input by the initial step Lambda. Everything will become idempotent since we deliberately named the state machine’s execution as the invoker ‘Lambda request ID.’ In case a retry happens to this invoker Lambda, AWS will give it the same request ID.
After that, AWS won’t be executing the state machine again because it has the same name. In theory, the state machine’s execution name is a part of its input as well. Even though this solution is advantageous in numerous case scenarios, you should know that it’ll also add a significant complexity overhead, further affecting the system’s overall observability and debugging.
It’s essential to comprehend that the Step Function’s error handling mechanism is quite different than the AWS Lambda error handling mechanism. For each Task state, a timeout duration placeholder can be set, and in case the Task isn’t completed in time, anStates.Timeout error will be generated. This particular timeout is unlimited in a way. Also, in a typical case of a Task that executes a Lambda, the case won’t be the same. Lambda’s actual timeout duration can be determined solely by its pre-configured value, and it can’t get any longer by utilizing this method. Therefore, it’s essential to ensure that you’ve configured the Task timeout so it’d be equal to the timeout of Lambda. The Task’s retry behavior is disabled by default, and it can be configured in a certain way.
The Error Processor sample application shows the utilization of AWS Lambda to handle events that are coming from the AWS CloudWatch Logs subscription. Now, CloudWatch Logs will allow you to invoke a Lambda function if a log entry matches a particular pattern. The subscription within this application will monitor a function’s log group for all entries with the word ERROR. In response, it’ll invoke a processor Lambda function. The processor function will then retrieve the full log-stream and trace data for the request that have caused this error, and it’ll store them so it could use them later.
Function code can be found in these files:
You can quickly deploy the sample within a few minutes via AWS CloudFormation and AWS CLI.
This sample application utilizes these AWS services:
A Lambda function will generate errors randomly when found within the application. If CloudWatch Logs detect the word ERROR within the function’s logs, it’ll provide the processor function with an event for processing.
The data has details about the log event when it’s decoded. The function will use all these details to successfully identify the log stream and parses the log message so it would obtain the ID of the particular request that has caused this error.
The processor function will utilize information obtained from the CloudWatch Logs event to download the X-Ray trace and the full log stream for a request that has caused an error. Both will be stored within the AWS S3 bucket. Moreover, to successfully allow the trace time and log stream to finalize, the function will wait for a short period before it starts accessing the data.
The application utilizes AWS X-Ray to trace function invocations and all the calls that functions make to AWS. X-Ray utilizes the received trace data from functions to create a service map that is of significant help for error identification. This particular service map showcases the random error function that generates errors for some specific requests. Additionally, it showcases the processor function that calls CloudWatch Logs, Amazon S3, and X-Ray.
These two configured Node.js functions serve active tracing within the template and are instrumented with the AWS X-Ray SDK (Node.js) in code. Along with active tracing, Lambda tags will add a tracing header to all incoming requests, and they’ll send a trace with timing details to AWS X-Ray. Moreover, the random error function utilizes X-Ray SDK to record the request ID and the user information within annotations. These annotations are attached to the trace, so you could use them to locate the specific request’s trace.
The processor function will obtain the request ID from the CloudWatch Logs event, and it’ll utilize the AWS SDK for JavaScript to search X-Ray for that particular request. It also utilizes AWS SDK clients, which are instrumented by the X-Ray SDK to download the log stream and the trace. After that, it’ll store them in the output bucket. The X-Ray SDK will record all these calls, and they’ll appear within the trace as subsegments.
The application is implemented within the two Node.js modules, and it’s deployed with shell scripts and AWS CloudFormation template. This template will create the processor function, the random error function, and all the following supporting resources:
If you’re trying to work around the limitations of integrating Lambda with CloudFormation successfully, the template will come up with an additional function that’ll run during deployments. All Lambda functions will always come with a specific CloudWatch log group that’ll store the output from all function executions. Additionally, this log group won’t be created until the function has been invoked for the first time.
Creating a subscription that depends on the log group’s sole existence, the application needs to utilize a 3rd Lambda function to invoke the random error function. This template also includes the primer function inline code. Each AWS CloudFormation custom resource is capable of invoking it during its deployment. DependsOn properties will ensure that the resource-based policy and the log stream are created before the subscription.
Serverless architectures fundamentally change how we develop, deploy, and monitor applications. As you now know, services such as AWS Lambda also come with their own limits and idiosyncrasies: limited memory and execution time, retry-behavior, and many others may create side-effects that can easily become monitoring nightmares.
Composing multiple services for compute, data storage, queues, etc. magnifies the problem. The number of potential issues is multiplied by the interactions and dependencies throughout the cloud stack.
Running such architectures at scale is even more challenging. At each level of traffic, we cannot expect the stack will behave homogeneously. Perhaps AWS Lambda functions will scale faster than a database, for example.
Dashbird is designed to provide developers with ways to easily navigate such complex problems while achieving a high degree of visibility and quality in any serverless architecture.
Dashbird was created by serverless developers, for serverless developers to enhance monitoring and operating specifically for AWS services at scale. By continuously collecting and filtering your log data, Dashbird automatically detects all your code exceptions, timeouts, configuration errors, and other anomalies in real-time, and sends you a notification immediately if there’s an error or something is about to break.
On top of that, the platform will give you actionable insights based on the AWS Well-Architected Framework to help you improve your infrastructure and make it truly reliable at any scale with the ability to take on added complexity over time.
You can give Dashbird a try for free:
AWS Lambda error handling in serverless architecture may seem pretty confusing, but as much as it’s hard to comprehend how it can affect your entire system, it’s vital to thoroughly understand it. It’s important that you know how to manage AWS Lambda retries behavior successfully, and the same goes for Step Functions. Every retry counter field within the context parameter is undoubtedly a feature that’s been missing.
Besides the techniques mentioned in this article, there are various other methods that will help with AWS Lambda error handling, and utilizing wrappers is only one example.
The architecture with Step Functions that we’ve discussed today is quite useful in many cases, and AWS Lambda error handling is one of them. Even though it helps control Lambda retries appropriately, it also encourages the separation of elements, which is an excellent practice within the world of Serverless.
In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.
Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.
In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.