Passing the “Is it Working?” Test with Serverless Architectures Is Not Enough

Mariliis Retter

December 14th, 2020

This post is published with the author’s, Paul Singman’s, approval. Original post here.

Setting the Scene

Say you are an awesome developer sitting contentedly at your desk when a Slack message suddenly interrupts your peaceful mental flow:

It would appear there is a data issue with the new Activity History service released last month… Or at least a couple people think there is.

Now, instead of making progress on new tasks, you now need to drop those and look into what’s happening here.

Sigh.

Setting up the Problem

What this Activity History service does is calculate and then expose counts of how many times users have used the company’s application.

If we’re Netflix, it’s how many episodes a user’s watched. If we’re Spotify, perhaps this powers their popular Year In Review feature that shows how many minutes you’ve listened this year. [Answer: A lot.]

It is powered by a modern, serverless pipeline built on AWS with an architecture that looks like:

The way this works is user activity gets POSTed to an ingestion API Gateway endpoint. Backing the API is a Lambda function that writes the data to a Kinesis Data Stream for temporary storage. Next, a Lambda function invokes to validate the schema of the ingested data. And if it looks good, it is written to a Dynamo table that holds the activity events for all users.

Finally, we have an API Gateway endpoint backed by a Lambda that is responsible for fetching and aggregating records for a user to be shown on an Activity History screen in the mobile app.

In my experience, this is a typical serverless architecture for an app that contains these types of features.

Anyway, to debug such a system, and respond confidently to the inquiry from Slack, there a number of things we should check:

Is the API endpoint working and if so, what value does it return for this user?
Is the Lambda function backing the API returning successfully?
What value is stored in the DynamoDB table for this user?
Is the Lambda function that validates data and writes to Dynamo experiencing any issues?
How is the performance of the Kinesis Data Stream that triggers the Lambda?
And are there any errors or latency in the Lambda ingesting data and writing to Kinesis?

That’s a lot to do, no?

Make no mistake…

AWS deserves praise for creating the services that enable such functionality to be possible in the first place. However, we can also admit that the out-of-the-box monitoring tools like CloudWatch Logs and Metrics don’t make debugging tasks like the one delineated above easy.

And speaking personally, having built and maintained serverless architectures over the last several years, it is crucial to be able to debug them quickly. At least if you want to take full advantage of the fast development speed serverless can promise and not spend most of your time looking into problems that may arise.

What’s the Solution?

The issue is that for each debugging step, there’s an isolated log group or metric graph to inspect, and frankly you’ll drive yourself crazy trying to pull up each one in a separate browser tab to identify the location of the issue.

A better approach would be to have access to a single centralized location from which you get a pulse on the recent behavior of all AWS resources. This allows you to narrow the scope of your investigation into the problem.

As with everything, this is either something you can build yourself or see if a solution exists to buy. After looking into what’s out there, I ended up hooking up to serverless monitoring and intelligence tool Dashbird to my AWS account.

Within 5 minutes a whole host of new functionalities and insights were unlocked about my serverless resources!

Now in a single browser tab, any issues that arise across our example Activity History Pipeline become visible. On the Dashbird Insights tab, we can see ConnectionErrors occurring in the Ingestion Lambda. And now we can dig further into the Logs and Performance of that Lambda function specifically, and in a short time triage the issue raised at the beginning of the article.

Whether the source of the error been in an API Gateway endpoint, Kinesis Data Stream, or DynamoDB table, Dashbird also contains dashboards showing problematic behavior in any of these services.

The Moral of the Story

If you are a developer of an application or data pipeline using serverless architectures, it can be exciting to get your first project up and running. The beauty of the modern cloud is how you can stitch together resources like an artist to achieve functionality in a seamless, integrated way.

In some sense, we’ve reached cloud-computing nirvana with Function-as-a-Service offerings like AWS Lambda that integrate with countless others and are billed one a 1 millisecond basis.

If this is a development pattern you want to take full advantage of beyond an initial single process, it is imperative to extend your monitoring capabilities beyond the AWS defaults.

No matter how good you get at querying CloudWatch Logs, or viewing Lambda Metric Graphs, or managing capacity on a Dynamo Table — when the number of resources under your watch grows into the tens or even hundreds — the speed at which you can fix errors in the system will increase to untenable levels.

And what seemed like fun at the beginning, will become a headache until finally you wake up one morning and admit, “There must be a better way!”

If you find yourself spending more and more of your time looking into the performance of your serverless pipelines — or worse, not checking them at all — I recommend integrating with Dashbird or a similar tool.

For Dashbird specifically, you can learn more about their services on their website. If you come across other solutions in this space, I’m interested to hear about them as well!

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.