Although serverless is very stable and reliable, many things can go wrong with our software. A hardware failure may cause glitches, network instability can disrupt API communications, and the application itself can present unexpected bugs.
AWS Lambda, for example, has three types of errors, as discussed in Lambda: Invocation, Function and Runtime Errors. Since developers don’t have access to the underlying infrastructure in serverless systems, logs are usually piped to a central repository (e.g. AWS CloudWatch Logs1).
In some cases, errors are just returned from API calls. It is the application responsibility to handle and log these error messages, otherwise it will be impossible to detect and inspect them later.
Take DynamoDB and its capacity modes2. If the number of queries gets too high and above the database capacity, it will return a
Provisioned Throughput Exceeded Exception. When a Lambda function is querying, it must log the error, which is going to be stored in CloudWatch Logs as well.
Failure detection is the process of inspecting logs and identifying all strings and patterns that indicate whether an error occurred in an application.
Logging errors is only the first step. Even for small applications, the amount of log information can easily become impossible for humans to parse. This is when a failure detection algorithm is valuable.
Such an algorithm will automatically identify a DynamoDB error, a Python or Node.js exception (e.g.
TypeError) or an AWS Lambda misconfiguration, for example.
Waiting for a customer or manager to discover an error and report to the development team can erode trust in the application. Much better is when developers are the first to learn about the failure and can proactively provide notification, or perhaps even a quick fix.
For that reason, professional development teams must implement an error alerting mechanism coupled with the failure detection algorithms. Whenever something fails, the system should alert the responsible development team at the most convenient channels (e.g. Email, Slack).
Traditional logging and monitoring services from big, classic companies or open source projects won’t work with serverless. Simply because serverless is not a traditional architecture. It requires specialized error inspection and alerting.
AWS CloudWatch Logs, for example, does not provide failure detection algorithms for serverless, nor alerting mechanisms. It can serve only as a great log central repository.
Having a dev team implementing its own failure detection and alerting system would be a waste of time. There are professional services tailored for serverless – such as Dashbird – that can provide a much better solution at a fraction of the internal development costs.
Dashbird works by reading logs from CloudWatch Logs. It is an asynchronous way of inspecting failures that do not require code instrumentation, nor interferes or degrades the application performance. Please check How Dashbird Works for more information.
The best professional monitoring platforms will provide a failure management system as well. It helps to organize failures that are pending, resolve errors that have been fixed, etc.
In Dashbird, for example, errors are tracked in different states: open, resolved, or muted.
Another benefit is setting up custom alerting policies. Developers can control which AWS Lambda functions to monitor, for example. Perhaps testing and experimental functions can be ignored, but production ones must be monitored closely.
The screenshot below illustrates a policy to monitor timeout failures in a given Lambda functions.