Debugging serverless applications with Dashbird

Taavi Rehemägi

January 25th, 2019

With AWS Lambda, we get scalability and resilience out-of-the-box. What’s more, AWS also provides built-in monitoring, logging and tracing support through CloudWatch and X-Ray. These built-in tools provide a good starting point but many developers eventually outgrow them as their serverless application becomes more complex.

In this post, let’s take a serverless application and see how Dashbird can help you debug a serverless application.

Challenges with serverless observability

When it comes to observability, serverless has introduced some interesting challenges. For so long, we have relied on the use of agents and daemons to collect metrics and logs. They run silently in the background, away from our critical paths where we are concerned with minimizing user-facing latencies. They collect, buffer and publish these observability data in batches to improve efficiency. As a practice, they are so deeply ingrained into how we monitor our applications, until now.

When it comes to serverless, specifically with managed platforms such as AWS Lambda, there’s simply nowhere for us to install these agents!

To collect metrics and logs as part of your function’s invocation would understandably add overhead to its invocation time. Since AWS is collecting logs from your function asynchronously already, and publishing them to CloudWatch Logs. A common workaround is to subscribe to these logs and perform post-processing on them.

Indeed, that is how Dashbird collects data about your function’s execution. It subscribes the CloudWatch Logs log groups to a Kinesis stream and then processes the events from there. You can read about the advantages of this approach in this article.

As our serverless applications become more complex, it’s important for us to be able to trace executions across multiple functions. As the demo app demonstrates, even a simple user transaction can span across multiple event sources as well as Lambda functions.

The demo app

Imagine you’re building a Twitter clone. One of the core features of the system is to distribute a user’s post to his followers’ feeds. To implement this feature, imagine we have two separate API endpoints:

POST posts/create : to create a new post for the current user
GET followers/{userId} : to fetch a user’s followers

Each endpoint is handled by a separate Lambda function – create-post and get-followers respectively.

When a user publishes a new post, the create-post function would save the post in the posts DynamoDB table and also publish a post-created event into a Kinesis stream. This event then triggers a distribute-post function. This function would query the GET followers/{userId} endpoint and then add the post to the followers’ feeds. The get-followers function would query the followers DynamoDB table to return the IDs of the user’s followers.

For brevity sake I have omitted the logic for actually distributing the posts. So the overall architecture for our demo app looks like the following.

To make things more interesting, each of the Lambda functions are hardwired to error or timeout based on a configurable probability. The source code for the demo app is available on Github, so feel free to try it out yourself.

Introducing Dashbird

Even with a simple serverless application like the one outlined above, we have quite a few functions to look after. Let’s see how we can use Dashbird to help us monitor this application and debug issues.

As soon as I log in, I have a high level dashboard for my account.

In addition to the data I get in the AWS Lambda console (see below), the Dashbird dashboard has two useful data points:

Average memory utilization for the functions
Cost for the Lambda invocations

Next, in Dashbird’s Lambda console, I can see a high level summary of my functions and their activities over the last 24 hours.

What I find very useful here is the fact that it highlights functions that have been idle for 10 days as inactive. As your serverless architecture expands and you end up with hundreds of functions, maintained by different teams, it’s very difficult to track which functions are no longer needed. Having redundant functions lying around is a security risk as they remain an attack surface that might be exploited.

While this view alone cannot tell you definitively which functions are no longer used. Many functions are not run on a regular basis. Maybe they are part of a cron job that only runs once a month. Or maybe they are only used during disaster recovery scenarios. Nonetheless, being aware of which functions are inactive encourages teams to ask the question “Is this function still needed?”. From here, maybe better practices can emerge. For example, use tags to mark functions that are expected to be used sparingly so they are not flagged by these checks.

If I navigate to one of the functions, then I have a function-centric view of invocation time, error, cost and memory utilization. In addition, I can also see a list of the recent invocations. What’s really useful here is that cold starts and retries are clearly labelled. When debugging live issues this lets me quickly narrow down the invocations that I need to pay attention to.

Straight away, I can see that 3 of the invocations timed out after 6 seconds. What’s more, the original Kinesis event was retried 3 times and finally succeeded on the third retry.

If I click on the “+” button next to an invocation then I can drill down to the invocation itself. Here I can see the logs and X-Ray trace for this invocation in one screen. This is great as it saves me from having to constantly jump between different AWS console.

Debugging with Dashbird

As mentioned before, the demo app is hardwired to error and timeout. And sure enough, when these failure cases happen, Dashbird’s built-in alerting kicks in and I promptly received emails notifying me that something went wrong.

While this built-in alerting is great, I couldn’t see any settings to adjust the alert sensitivity.

As I followed the links in the emails, to the function, then the failed invocation. Dashbird neatly groups the related invocations – the initial timeout, and the subsequent retries – together. I can quickly see that Kinesis event was successfully processed on the 3rd retry.

Dashbird also tracks the open issues in the Errors console. Now that I know the problem has resolved itself I can go ahead and resolve the error.

Tailing function logs

Another nice feature of Dashbird is that it’s able to tail the logs for multiple functions at the same time. For the demo app, I want to see the logs for both the create-post and distribute-post functions as I curl the POST posts/create endpoint.

That way, I can see that the event was successfully published into the Kinesis stream, and was subsequently received by the distribute-post function.

Conclusion

Overall I was impressed with what Dashbird has to offer, and it’s clear that a lot of thought has gone into the product. It has many nice touches that makes debugging much easier. For example, grouping retries together, and integrating X-Ray traces and logs in one screen. These might seem like trivial niceties, but they can make a big difference under the high-pressure scenarios of dealing with a live issue.

From what I have been able to see, I think Dashbird is a really great tool. The main thing missing for me is the ability to trace executions end-to-end. Personally, I’d really like to see the entire execution traced from the API call to create a post, all the way to the get-followers function performing a Query against the followers DynamoDB table.

If you’re solely relying on the built-in AWS tools (CloudWatch, CloudWatch Logs, and X-Ray) then you should give Dashbird a try. Why not sign up for a free trial, and deploy the demo app to your environment so you can see how Dashbird can help you debug your serverless application?

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.