Passing the “Is it Working?” Test with Serverless Architectures Is Not Enough
Setting the Scene
Say you are an awesome developer sitting contentedly at your desk when a Slack message suddenly interrupts your peaceful mental flow:
It would appear there is a data issue with the new Activity History service released last month… Or at least a couple people think there is.
Now, instead of making progress on new tasks, you now need to drop those and look into what’s happening here.
Setting up the Problem
What this Activity History service does is calculate and then expose counts of how many times users have used the company’s application.
If we’re Netflix, it’s how many episodes a user’s watched. If we’re Spotify, perhaps this powers their popular Year In Review feature that shows how many minutes you’ve listened this year. [Answer: A lot.]
It is powered by a modern, serverless pipeline built on AWS with an architecture that looks like:
The way this works is user activity gets POSTed to an ingestion API Gateway endpoint. Backing the API is a Lambda function that writes the data to a Kinesis Data Stream for temporary storage. Next, a Lambda function invokes to validate the schema of the ingested data. And if it looks good, it is written to a Dynamo table that holds the activity events for all users.
Finally, we have an API Gateway endpoint backed by a Lambda that is responsible for fetching and aggregating records for a user to be shown on an Activity History screen in the mobile app.
In my experience, this is a typical serverless architecture for an app that contains these types of features.
Anyway, to debug such a system, and respond confidently to the inquiry from Slack, there a number of things we should check:
- Is the API endpoint working and if so, what value does it return for this user?
- Is the Lambda function backing the API returning successfully?
- What value is stored in the DynamoDB table for this user?
- Is the Lambda function that validates data and writes to Dynamo experiencing any issues?
- How is the performance of the Kinesis Data Stream that triggers the Lambda?
- And are there any errors or latency in the Lambda ingesting data and writing to Kinesis?
That’s a lot to do, no?
Make no mistake…
AWS deserves praise for creating the services that enable such functionality to be possible in the first place. However, we can also admit that the out-of-the-box monitoring tools like CloudWatch Logs and Metrics don’t make debugging tasks like the one delineated above easy.
And speaking personally, having built and maintained serverless architectures over the last several years, it is crucial to be able to debug them quickly. At least if you want to take full advantage of the fast development speed serverless can promise and not spend most of your time looking into problems that may arise.
What’s the Solution?
The issue is that for each debugging step, there’s an isolated log group or metric graph to inspect, and frankly you’ll drive yourself crazy trying to pull up each one in a separate browser tab to identify the location of the issue.
A better approach would be to have access to a single centralized location from which you get a pulse on the recent behavior of all AWS resources. This allows you to narrow the scope of your investigation into the problem.
As with everything, this is either something you can build yourself or see if a solution exists to buy. After looking into what’s out there, I ended up hooking up to a serverless monitoring and intelligence tool Dashbird to my AWS account.
Within 5 minutes a whole host of new functionalities and insights were unlocked about my serverless resources!
Now in a single browser tab, any issues that arise across our example Activity History Pipeline become visible. On the Dashbird Insights tab, we can see ConnectionErrors occurring in the Ingestion Lambda. And now we can dig further into the Logs and Performance of that Lambda function specifically, and in a short time triage the issue raised at the beginning of the article.
Whether the source of the error been in an API Gateway endpoint, Kinesis Data Stream, or DynamoDB table, Dashbird also contains dashboards showing problematic behavior in any of these services.
The Moral of the Story
If you are a developer of an application or data pipeline using serverless architectures, it can be exciting to get your first project up and running. The beauty of the modern cloud is how you can stitch together resources like an artist to achieve functionality in a seamless, integrated way.
In some sense, we’ve reached cloud-computing nirvana with Function-as-a-Service offerings like AWS Lambda that integrate with countless others and are billed one a 1 millisecond basis.
If this is a development pattern you want to take full advantage of beyond an initial single process, it is imperative to extend your monitoring capabilities beyond the AWS defaults.
No matter how good you get at querying CloudWatch Logs, or viewing Lambda Metric Graphs, or managing capacity on a Dynamo Table — when the number of resources under your watch grows into the tens or even hundreds — the speed at which you can fix errors in the system will increase to untenable levels.
And what seemed like fun at the beginning, will become a headache until finally you wake up one morning and admit, “There must be a better way!”
If you find yourself spending more and more of your time looking into the performance of your serverless pipelines — or worse, not checking them at all — I recommend integrating with Dashbird or a similar tool.
For Dashbird specifically, you can learn more about their services on their website. If you come across other solutions in this space, I’m interested to hear about them as well!