How Can a Shared Slack Channel Improve Your Data Quality?

Have you ever heard anyone saying: “Our data is great, we’ve never had any data quality issues“? Ensuring data quality is hard. The magnitude of the problem makes us believe that we need some really big actions to make any improvements. But the reality shows, often the simplest and most intuitive solutions can be incredibly impactful. In this article, we’ll look at one idea to improve the process around data quality, and make it more rewarding and actionable.

Often the most impactful changes come from rethinking our processes.


Taking ownership of data

Regardless of how your data team is structured (centralized BI/Data Team vs. decentralized domain-oriented teams leveraging Data Mesh paradigm), people need to take ownership to make any lasting and impactful change. If nobody feels responsible for fixing an issue in the data, we shouldn’t expect that the situation will get better, regardless of the tools we use. 

How can we approach this?

The simplest way to improve data ownership is to assign owners to the most critical data artifacts, i.e., specific tables in a data warehouse, data lake datasets, and data science models. It’s not that we want to blame those people for data quality issues. Instead, assigning owners can create more transparency over who should look after specific data assets and do whatever they can to ensure that this data remains clean. The process can then evolve by adding automation and monitoring dashboards for more visibility.

In a nutshell, before taking any tools or automation scripts into consideration, it’s helpful to first think about a process of establishing data ownership.


Making the process more rewarding and easier to track

Once the ownership is defined, we can improve the process by making quality checks more rewarding and automated. Adding simple scripts that execute data quality checks and notify data owners via a shared Slack channel about any data quality issues can be highly effective to increase the team’s engagement in improving data quality.

The important part of the process is to send those alerts to a Slack channel which is shared across the data team. As social creatures, we are more motivated to fix issues if other people can see the effort we put into it. For instance, the data owner who took care of the issue can:

  • send a reply explaining what was the root cause and which steps have been taken to solve the problem,
  • simply adding a checkmark to show that this issue has been taken care of,
  • or adding a ticket link if the issue turned out to be much more complex and needs to be put into a backlog.

An example of a shared Data Quality Slack channel with users engaged in the process 

All of the above actions add visibility and a (social) proof that data quality issues are no longer ignored. It demonstrates how taking ownership and making the process more socially rewarding can already yield tangible improvements.


Leveraging automation to facilitate the process

Let’s assume that we established the process and agreed on data ownership. How can we go about implementing those automated data quality alerts? The process can be as simple as:

  • building SQL queries that check for anomalies in the data,
  • writing a script that sends a Slack notification if the alert condition is met,
  • creating a shared Slack channel and a webhook to send messages to it.

First, to create a webhook, go to https://api.slack.com/apps → Create an App.

improve data quality

Add a name to your app and select your desired Slack workspace.

improve data quality slack channel

Select incoming webhooks and create one for the Slack channel of your choice (“Add New Webhook to Workspace”).

improve data quality slack channel

improve data quality slack channel
improve data quality slack channel

Once all that’s done, you can copy your Webhook URL and use it in your Python script. Note that you should treat this webhook in the same way as an API key or password

improve data quality slack channel

The script to build the alert is as simple as sending a POST request to the Slack API endpoint represented by the webhook (line 19 in the Gist below). 

import json
import logging
import requests
import pandas as pd
import awswrangler as wr
from typing import List, Any


class DataQualityAlert:
    def __init__(self, slack_webhook_url: str, database: str = "ecommerce"):
        self.slack_webhook_url = slack_webhook_url
        self.database = database
        self.logger = logging.getLogger(__name__)

    def read_sql(self, query: str) -> pd.DataFrame:
        return wr.athena.read_sql_query(query, database=self.database)

    def send_slack_message(self, text: str) -> None:
        response = requests.post(url=self.slack_webhook_url, data=json.dumps({'text': text}))
        self.logger.info("Sent message to Slack. Status %d - %s. Message: %s", response.status_code, response.reason, text)

    def alert_about_outliers(self, alert_type: str, current_data: List[Any], expected: List[Any]) -> None:
        if sorted(current_data) != sorted(expected):
            detected_outliers = list(set(current_data) - set(expected))
            msg_text = f"{alert_type} check failed. Expected: `{', '.join(expected)}`. " \
                       f"Outliers: `{', '.join(detected_outliers)}`."
            self.send_slack_message(msg_text)
        else:
            self.logger.info("No outliers found. Skipping alert.")


if __name__ == '__main__':
    logging.basicConfig(format="[%(levelname)s] [%(name)s] [%(asctime)s]: %(message)s", level="INFO")
    # aws secretsmanager create-secret --name slack-webhook --secret-string '{"hook_url": "YOUR_HOOK_URL"}'
    default_webhook_url = wr.secretsmanager.get_secret_json("slack-webhook").get("hook_url")
    dqa = DataQualityAlert(slack_webhook_url=default_webhook_url)

    payments = dqa.read_sql('SELECT distinct payment_type FROM order_payments')
    curr_state_in_data = list(payments.payment_type)
    expected = ['boleto', 'credit_card', 'debit_card', 'voucher']
    dqa.alert_about_outliers("Ensure valid payment type", curr_state_in_data, expected)

    orders = dqa.read_sql('SELECT distinct order_status FROM orders')
    curr_state_in_data = list(orders.order_status)
    expected = ['approved', 'canceled', 'created', 'delivered', 'invoiced', 'processing', 'shipped']
    dqa.alert_about_outliers("Ensure valid order status", curr_state_in_data, expected)

Note that on line 35, the Webhook URL is retrieved from AWS Secrets Manager. If you want to follow the same approach to store this confidential piece of information, make sure to add it to your set of secrets:

aws secretsmanager create-secret --name slack-webhook --secret-string '{"hook_url": "YOUR_HOOK_URL"}'

About the checks from the example

In this code example, we are checking whether order status and payment type match the expected (allowed) values. If not, we should receive a Slack message informing us about the outliers:

data quality check

Obviously, those checks represent rather contrived examples (based on this e-commerce dataset from Kaggle). In a real-world scenario, your data quality checks may validate:

  • whether a specific KPI in your data reaches some critical value or when it’s surpassing the expected range of values,
  • the occurrence of highly improbable values (B2C-customer buying hundreds of items of the same product),
  • whether some values (for instance, marketing, payment, or logistic costs) significantly deviate from planned values,
  • whether data is up-to-date, complete, duplicate-free, and without missing values,
  • …and many more.

Deploying the scripts to AWS

To make running those periodic checks more scalable, we could leverage AWS Lambda. To make the previously shown Github gist work with Lambda, we need to wrap our main execution code into a lambda handler (starting from line 34). Also, we need to ensure our logger is defined globally in a way that is compliant with AWS Lambda.

import json
import logging
import requests
import pandas as pd
import awswrangler as wr
from typing import List, Any

logger = logging.getLogger()
logger.setLevel(logging.INFO)


class DataQualityAlert:
    def __init__(self, slack_webhook_url: str, database: str = "ecommerce"):
        self.slack_webhook_url = slack_webhook_url
        self.database = database

    def read_sql(self, query: str) -> pd.DataFrame:
        return wr.athena.read_sql_query(query, database=self.database)

    def send_slack_message(self, text: str) -> None:
        response = requests.post(url=self.slack_webhook_url, data=json.dumps({'text': text}))
        logger.info("Sent message to Slack. Status %d - %s. Message: %s", response.status_code, response.reason, text)

    def alert_about_outliers(self, alert_type: str, current_data: List[Any], expected: List[Any]) -> None:
        if sorted(current_data) != sorted(expected):
            detected_outliers = list(set(current_data) - set(expected))
            msg_text = f"{alert_type} check failed. Expected: `{', '.join(expected)}`. " \
                       f"Outliers: `{', '.join(detected_outliers)}`."
            self.send_slack_message(msg_text)
        else:
            logger.info("No outliers found. Skipping alert.")


def lambda_handler(event, context):
    # aws secretsmanager create-secret --name slack-webhook --secret-string '{"hook_url": "YOUR_HOOK_URL"}'
    default_webhook_url = wr.secretsmanager.get_secret_json("slack-webhook").get("hook_url")
    dqa = DataQualityAlert(slack_webhook_url=default_webhook_url)

    payments = dqa.read_sql('SELECT distinct payment_type FROM order_payments')
    curr_state_in_data = list(payments.payment_type)
    expected_data = ['boleto', 'credit_card', 'debit_card', 'voucher']
    dqa.alert_about_outliers("Ensure valid payment type", curr_state_in_data, expected_data)

    orders = dqa.read_sql('SELECT distinct order_status FROM orders')
    curr_state_in_data = list(orders.order_status)
    expected_data = ['approved', 'canceled', 'created', 'delivered', 'invoiced', 'processing', 'shipped']
    dqa.alert_about_outliers("Ensure valid order status", curr_state_in_data, expected_data)

The full project is available in this Github repository.

To deploy our container image to AWS, we build and push our container image to ECR (123456 is a placeholder for AWS Account ID). 

aws ecr create-repository --repository-name dq_alerts
docker build -t dq_alerts .
docker tag dq_alerts:latest 123456.dkr.ecr.eu-central-1.amazonaws.com/dq_alerts:latest
aws ecr get-login-password | docker login --username AWS --password-stdin 123456.dkr.ecr.eu-central-1.amazonaws.com
docker push 123456.dkr.ecr.eu-central-1.amazonaws.com/dq_alerts:latest

Then, in the Lambda configuration, we select our desired container image, as shown below. 

lambda configuration container image

Since performing database queries can be time-consuming, we need to increase the timeout setting. Also, increasing the memory size to at least 256 MB seems reasonable since data returned by the queries can take a larger amount of space in memory.

increase lambda memory

Make sure to add relevant IAM policies. For this example, we need Athena and S3 permissions.

add IAM policies Lambda athena s3

Lastly, to ensure that our checks run on schedule, we need to add a CloudWatch schedule as a trigger:

add CloudWatch schedule trigger
add CloudWatch schedule trigger

Additionally, we can test the function using an empty JSON payload:

test function JSON payload

How can we further improve the automated process?

So far, we established a semi-automated process ensuring that the owners are notified about data quality issues occurring in their data assets. We also started building automated checks to regularly validate our data. Now we can start thinking about how to run it at scale. While AWS Lambda provides highly scalable compute resources, it makes it difficult to track and fix errors occurring in your serverless functions. 

AWS Lambda monitoring

One possible solution is to leverage a serverless observability platform such as Dashbird. The initial onboarding is as simple as clicking a single button to deploy a CloudFormation template within your AWS account. Once the CloudFormation stack is finished, you can immediately start using the dashboards, configure alerts on failure, and dive into beautifully formatted log messages

Dashbird observability platform demonstrating a cold start in AWS Lambda 


Drawbacks of the presented approach

The first problem in the demonstrated approach is that we need some stateful logic to ensure that we don’t notify about the same problems too frequently, otherwise people will start ignoring the alerts and will possibly mute the Slack channel. Also, the social aspect might get lost if there are too many messages.

Additionally, writing all sorts of data quality checks by ourselves is not scalable, and possibly not even feasible if you deal with vast amounts of data. As Barr Moses points out, such quality checks can only cover the known unknowns, i.e. problems that can be anticipated. An interesting idea would be to combine the social aspect of a shared Slack channel with alerts from end-to-end observability pipelines.


Conclusion

In this article, we looked at how simple process adjustments can increase the team’s engagement and commitment to improving data quality. Often the most impactful changes don’t require any big decisions or investments but from rethinking our processes, adding automation to enhance their execution, and ensuring that the entire team keeps working together on a common goal of improving data quality


Further reading:

10 ways to protect your mission-critical database

Why you should consider a data lake?

The New Rules of Data Quality

Introducing a better way to manage data quality at scale with testing and observability

Read our blog

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

AWS Lambda Free Tier: Where Are The Limits?

In this article we’ll go through the ins and outs of AWS Lambda pricing model, how it works, what additional charges you might be looking at and what’s in the fine print.

More articles

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.