Monitoring platform for keeping systems up and running at all times.
Full stack visibility across the entire stack.
Detect and resolve any incident in record time.
Conform to industry best practices.
This simple FastAPI service will help you find data in a data lake
Data lakes provide a myriad of benefits. They are data agnostic and don’t require you to define a schema upfront. However, without a proper structure, it may be challenging to find the data that you need. In this article, we’ll address this problem by creating a FastAPI abstraction allowing us to query the AWS Glue metadata catalog.
If we would implement it as a Python package instead, we would assume that everyone knows enough Python to use it. However, your company may have other developers (Java, C#, JavaScript, Go, …) that need access to data from a data lake. By building a REST-ful service for data discovery, you are providing a programming-language agnostic interface for everyone. This API-first approach has further benefits:
We can start by listing all the methods that can be potentially useful for data discovery:
We have three choices: either using lower-level AWS SDKs, doing queries on Athena’s information_schema, or leveraging awswrangler. In the end, we may combine all of them to satisfy the requirements from the previous section.
#1. To list all databases (i.e., schemas) and tables, we can use the awswrangler package, especially the module wr.catalog.
#2. To filter for table names with a specific prefix or suffix, we can also use wr.catalog.tables. The same is true for retrieving table definition and doing a full-text search on those definitions:
#3. List all partitions of a specific table:
#4. Show table’s DDL:
#5. List S3 directory path or show specific objects according to file type (ex. parquet files) and last modified date:
#6. Query column comments to find the dataset that you need:
#7. Describe Athena table definition:
#8. Search for specific column names:
The above methods constitute a basic MVP for a data discovery service. It allows querying schemas, tables, columns, column comments (aka data dictionary), showing a preview of the data, as well as exploring the underlying files, directories, and partitions.
By leveraging services such as AWS X-Ray and Lake Formation, we could add methods to query usage access patterns and identify the most and the least used datasets.
Side note: If you don’t want to attach your boto3_session separately in each API method, you can attach a global session using:
The full code for this MVP can be found in this Github repository. Let’s briefly discuss some of the details.
In the following demo, we can see all the endpoints applied on a Brazilian E-commerce dataset from Kaggle.
The screencast above demonstrates how we can use this API to discover and query e-commerce data stored in an S3 data lake. The end users can identify tables or schemas that they need. By drilling down into the underlying files, data scientists and analysts can explore this data in an interactive notebook or other tools of their choice. Reading a specific parquet, CSV, Excel, or JSON file from S3 with awswrangler is as simple as:
There are several options to deploy a REST API. If you already use AWS, you may find it useful to leverage serverless services built specifically for building resilient and scalable APIs: AWS Lambda and API Gateway.
With the excellent package called mangum, converting our FastAPI to a Lambda handler is as simple as importing this package (from mangum import Mangum) and adding a single line of code: handler = Mangum(app=app).
The Dockerfile to deploy the API to AWS Lambda looks as follows:
We use the official Python Lambda image with Python 3.8 and install all required packages. Then, we copy our remaining API code and specify the Lambda handler that will be used as an entry point to our container.
Finally, we have to push the container to ECR — AWS container registry. To replicate this demo, replace 123456789 with your AWS account ID, and adjust your AWS region name. In case you’re wondering: dda is my abbreviation for data-discovery-api.
The container image is deployed to ECR. Now, we need to create a Lambda function. We choose the container image option and select our ECR image.
Since querying some tables can take longer than Lambda’s default timeout of just three seconds, we can extend it to three minutes.
Finally, we need to attach IAM policies to grant our Lambda function permissions to retrieve data from AWS Glue, Athena, and S3 (of course, we should use more granular permissions for production).
After configuring the Lambda function with our container image, we can create a new API in the API Gateway console. We need to choose REST API and specify a name.
Then, we can add a method and resource to configure Lambda Proxy integration.
Once all that is set up, we can deploy the API and start serving it.
To see it step by step, here is a screencast that demonstrates API Gateway configuration, deployment, and testing the API:
Imagine that you deployed a similar service at your company and it turned out to be a resounding success. Now you need to ensure that it stays healthy, can be easily maintained, and runs without errors. At this point, you can either:
Dashbird helps you monitor serverless applications at any scale.
The implications of each option:
In the image above, we can see how Dashbird pointed out the task which timed out after 3 seconds (default Lambda timeout). It also reveals when the Lambda function experienced a cold start and shows the duration for each API call.
For API Gateway, we can see the average latency, number of API requests, and different types of errors that we encountered in the initial testing of the API.
In this article, we investigated how we can build a REST-ful service for data discovery. By leveraging Python libraries such as FastAPI, awswrangler, boto3, and Mangum, we can build useful APIs in just a few hours rather than weeks. Additionally, by deploying this service to API Gateway and Lambda, we can serve this API at scale with no operational headaches. Lastly, by leveraging Dashbird, you can add observability to your serverless resources without having to install any CloudWatch agents, pulling logs, or building any dashboards.
Thank you for reading! If this article was useful, have a look at related articles:
In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.
Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.
In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.