All-in-one serverless DevOps platform.
Full-stack visibility across the entire stack.
Detect and resolve incidents in record time.
Conform to industry best practices.
Monitoring vs observability – is there even a difference and is your monitoring system observable?
Photo by Scott Webb from Pexels
Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and “robustness” of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” — Wikipedia
An observable system allows us to assess how the system works without interfering or even interacting with it. Simply by looking at the outputs of a system (such as logs, metrics, traces), we can assess how this system is performing.
One of the best explanations about monitoring and observability I’ve seen was provided in an online course, “Building Modern Python Applications on AWS”, by Morgan Willis, a Senior Cloud Technologist at AWS.
“Monitoring is the act of collecting data. What types of data we collect, what we do with the data, and if that data is readily analyzed or available is a different story. This is where observability comes into play. Observability is not a verb, it’s not something you do. Instead, observability is more of a property of a system.” — Morgan Willis
According to this explanation, tools such as CloudWatch or X-Ray can be viewed as monitoring or tracing tools. They allow us to collect logs and metrics about our system and send alerts about errors and incidents. Therefore, monitoring is an active part of collecting data that will help us assess the health of our system and how its different components work together. Once we establish monitoring that continuously collects logs, system outputs, metrics, and traces, our system becomes observable.
As a data engineer, I like to think of monitoring as the data ingestion part of ETL (extract, transform, load). Meaning, you gather data from multiple sources (logs, traces, metrics) and put them into a data lake. Once all this data is available, a skilled analyst can gain insights from that data and build beautiful dashboards that tell a story that this data conveys. That’s the observability part — gaining insights from the collected data. And observability platforms such as Dashbird play the role of a skilled analyst. They provide you with visualizations and insights about the health of your system.
Monitoring will get you information about your system and let you know if there’s a failure, while Observability grants an easy way of understanding where and why that failure happened, and what caused it.
Monitoring is a prerequisite for observability. A system that we don’t monitor is not observable.
The ultimate purpose of monitoring is to control a system’s health by actively collecting error logs and system metrics and then leveraging those to alert about incidents. This means:
Even though monitoring is an active process, AWS takes care of that automatically when we use CloudWatch or X-Ray.
The purpose of observability is to use the system’s outputs to gather insights and act on them. Examples:
Although serverless microservices offer a myriad of benefits in terms of decoupling, reducing dependencies between individual components, and overall faster development cycles, the biggest challenge is to ensure that all those small “moving parts” are working well together. It’s highly impractical, if not impossible, to track all microservices by manually looking up the logs, metrics, and traces scattered across different cloud services.
When looking at AWS, you would have to go to AWS to see the logs, find your Lambda function’s log group, then find the logs you are really interested in. Then, to see the corresponding API traces, you would go either to X-Ray or to CloudTrail and again search across potentially hundreds of components to find the one you want to investigate. As you can see, finding and accessing the logs and traces of every single component is quite time-consuming. Additionally, debugging single parts doesn’t give you the “big-picture” view of how those components work together.
To put it simply, you get observability in your application by knitting together monitoring with alerting while having a clear debugging solution that provides clarity for your data. Missing just one of these aspects will leave you at a great disadvantage, chasing your tail trying to figure out what went wrong within your app. It’s not enough to be notified every time something breaks down. Neither is having the insight of knowing when something is about to. You have to be able to pinpoint the issue within your platform efficiently.
With a growing architecture of microservices, we need an easier (automated) way to add observability to the serverless ecosystem.
Here’s an example of a service we’re all too familiar with – Twitter. As you might imagine a product like Twitter has a lot of moving parts and when something breaks down it can be difficult to understand why or what caused the problem. Imagine having 350 million active users that interact with each other through your system, tweeting, liking, dm-ing, retweeting, and so on. That’s a lot of information to follow and if you’ve ever worked on a platform this size you can imagine the kind of effort it would take to figure out why a tweet isn’t posted or a message takes too long to be delivered.
Before they made the switch from a monolithic application to a distributed system, finding out why something doesn’t work was, at times, as simple as opening an error log file and seeing what went wrong.
When you have hundreds, maybe thousands, of small services communicating asynchronously with each other, saying that debugging a simple thing like a tweet not firing would be hard is a complete understatement. They’ve posted a really cool post about their migration to microservices in 2013. Read the post here.
With distributed systems (read microservices), especially at scale, having observability into your platform is more than a necessity, it’s a requirement that can’t be circumvented by using only alerting or by only looking at logs. You need an environment that provides visibility to a microscopic level in order to have the right information on which to act upon.
Twitter’s observability system is humongous and took years to develop into the well-oiled machine it is today.
“The Observability Engineering team at Twitter provides full-stack libraries and multiple services to our internal engineering teams to monitor service health, alert on issues, support root cause investigation by providing distributed systems call traces, and support diagnosis by creating a searchable index of aggregated application/system logs.” – Anthony Asta in Observability in Twitter part I
Our time series metric ingestion service handles more than 2.8 billion write requests per minute, stores 4.5 petabytes of time series data, and handles 25,000 query requests per minute Antony Asta on the scope of their observability systems published in 2016 in a two-parter that covers architecture, metrics ingestion, time series database, and indexing services. Check out part one and part two.
Our time series metric ingestion service handles more than 2.8 billion write requests per minute, stores 4.5 petabytes of time series data, and handles 25,000 query requests per minute
Understandably, not all businesses have the resources and time to build their own observability systems. With a 2-minute setup, you can sign up to Dashbird and add observability to your serverless AWS architecture immediately. Each serverless component in your AWS account, on which you enabled CloudWatch logs and X-Ray or CloudTrail traces, is automatically monitored with those tools. But it’s not yet observable until you do something with this collected data.
The true benefit of Dashbird is that it doesn’t require any code changes and any effort on your side . It simply uses the data that already exists, i.e., data for which you already enabled monitoring with AWS-native services designed for that purpose.
Dashbird is built by serverless developers with specifically AWS Lambda in mind
As a serverless observability platform, Dashbird allows you to accomplish all of the points addressed when discussing examples of insights gathered from an observable system:
While monitoring tools allow you to collect application logs as well as metrics about resource utilization and network traffic, or traces of HTTP requests made to specific services, observability is a property of a system that analyzes and visualizes collected data, thereby allowing you to improve your application lifecycle by gathering insights about the underlying system. Furthermore, observability in the serverless space is non-negotiable. You have to have it and it’s not a quantifiable attribute, meaning you can’t have some observability or too much of it. You either do or don’t.
Today we are excited to announce scheduled searches – a new feature on Dashbird that allows you to track any log event across your stack, turn it into time-series metric and also configure alert notifications based on it.
One of the most vital aspects to monitor is the metrics. You should know how your cluster performs and if it can keep up with the traffic. Learn more about monitoring Amazon OpenSearch Service.
Dashbird recently added support for ELB, so now you can keep track of your load balancers in one central place. It comes with all the information you expect from AWS monitoring services and more!
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.