Why You Should Stop Hoarding Metrics

Kay Plößer

July 21st, 2021

Serverless lets you deploy applications far away in a data center of a cloud provider. This relieves you of the lion’s share of operational burdens. The more you buy into your cloud provider’s ecosystem, the less you have to do yourself: no more OS updates or database bugfix installations.

But you still need to do some operation-related work on your own. For instance, monitoring your application to know what’s going on in that far away data center.

Usually, the monitoring journey of a new software product in the cloud goes like this:

The first version gets built with just basic monitoring capabilities or worse, without any monitoring. Then things go wrong as they always do, and nobody is really sure why. After debugging the problem, people get the idea that they didn’t have enough metrics. Depending on the severity of the issue, this leads either to more metrics added to the monitoring setup with every new incident or an overkill solution where everything is monitored.

Either way, after a sufficient number of issues with the application, there are so many metrics and alerts set up that you went from having no insight into your infrastructure to having so much insight that important parts get drowned in the sheer amount of metrics and alerts.

Why are Too Many Metrics and Alerts Bad?

If you encounter grave failures that threaten your business, you become cautious. You don’t want anything to slip through the cracks, so you add everything you can find. But in the end, it drowns the important data that could have saved you from the next incident that was about to happen.

If you can’t see the forest full of trees, then you’re back at point one—with no insights. The crucial question is, how much data can you and your team, the humans looking at the monitoring dashboards, reasonably perceive?

All the metrics you don’t need can distract you from the important parts. They can also lead you to optimize for metrics that don’t matter.

Should You Get Rid of Your Metrics?

Short answer: No, you shouldn’t.

Long answer: All these metrics are potentially shadowing important information. Should you stop storing them at all? Well, the problem isn’t that the metrics are saved somewhere. While this can become a financial and performance problem under certain circumstances, it isn’t a direct problem for your operations. The issue is that your teams can’t make sense of them. Depending on future requirements, it could be good to have the metrics stored somewhere, so you can display them when needed.

You should cut at the dashboards and graphs of these metrics to relieve your team from the informational overload.

Think in Terms of SLAs, SNLOs, and SLIs

A Service Level Agreement (SLA) includes promises you made to your customers in your contracts with them. For example, your service will respond within one second for 99% of all requests. You are legally bound to that promise, so it must never be broken. You can look at the AWS Lambda SLA to see what this means in practice. AWS loses real money when their service is down for too long.

A Service Level Objective (SLO) is how you would redefine the SLA to measure it in a meaningful way. In the one-second response example above, this might be easy to measure, but that is not always the case. SLOs are the thresholds you define for your metrics that should not be broken. The successful response rate should not be under a certain ratio; the response latency should not be over a certain value, etc.

In Figure 1, you can see how such an SLO gets set as Dashbird alert. Here the latency should always be under one second on average.

high api latency alert setup — Figure 1: Dashbird alert for API latency – a critical-level alarm will alert you via email, Slack, SMS, etc, when API response is above 1,000 milliseconds (1 second)

The Service Level Indicator (SLI) is now where the solution to our hoarding problem lies. An SLI consists of one or more important metrics to check if your system is currently breaking any SLOs. “One or more metrics” is the key phrase. If you can calculate a value from multiple metrics that show that your SLOs are met, you don’t have to look at every metric to check if things are going well.

In Figure 2, you can see such an SLI, the duration a Lambda function took to execute in milliseconds.

lambda function duration view — Figure 2: Dashbird function details for AWS Lambda

It’s a top-down approach. You write a contract; it explains your SLAs; you define your SLOs based on these requirements and then set up SLIs so you can check if the SLOs are holding.

Smart Triggers Help Solve Real Business Problems

In the end, alerts are triggered when your SLOs aren’t met anymore, or better, way before they aren’t met anymore, so you can solve problems that are about to happen.

Does disk space or CPU utilization play into your SLOs? If not, don’t display them.

This doesn’t mean you should only define SLOs for SLAs in your contracts. It can very well be that your contract doesn’t mention something that could be an important SLA; after all, contracts aren’t perfect, and your customers could still become angry if you fail to deliver something they expect. This only means they can’t sue you in the end, which is important for your company, but only the basic minimum.

Dashbird was Built with Best Practices in Mind

Dashbird comes with plenty of serverless know-hows out of the box. After all, it was created because the founders firsthand saw how the lack of observability or too much information could get in the way of product development.

After you integrate Dashbird with your AWS account, it starts to collect monitoring data from CloudWatch and builds important metrics for your infrastructure right away without any additional coding.

Dashbird sets up metrics and alerts for all supported AWS resources, so you don’t have to. These are based on years of experience with monitoring serverless systems for Dashbird customers and, of course, the AWS Well-Architected Framework, the official resource from AWS for building and maintaining applications on the AWS cloud.

Figure 3 shows an example of these insights. A Lambda function runtime is upgradable, this is important because runtime versions aren’t supported forever. Dashbird shows that upgrade possibility way before the current runtime version isn’t supported anymore.

well-architected lens based on aws well-architected framework — Figure 3: Dashbird Well-Architected Lens

Dashbird only shows you what’s important, so your team doesn’t get overwhelmed. This gives room for your team to add their SLIs and SLOs for the SLAs defined in your contracts, while highlighting crucial metrics you should keep in mind.

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.