Monitoring platform for keeping systems up and running at all times.
Full stack visibility across the entire stack.
Detect and resolve any incident in record time.
Conform to industry best practices.
Investigating time-to-insights for data science with a data lake vs. data warehouse
A common challenge in data engineering is to combine traditional data warehousing and BI reporting with experiment-driven machine learning projects. Many data scientists tend to work more with Python and ML frameworks rather than SQL. Therefore, their data needs are often different from those of data analysts. In this article, we’ll explore why having a data lake often provides tremendous help for data science use cases. We’ll finish up with a fun computer-vision demo extracting text from images stored in an S3 data lake.
Having only a purely relational data warehouse imposes a limitation on the variety of data formats this data platform can support. Many data warehouse solutions allow you to analyze nested JSON-like structure, but it’s still a fraction of data formats that can be supported by a data lake.
While nothing beats relational table structure for analytics, it’s still beneficial to have an additional platform that allows you to do more than that. Data lakes are data agnostic. They support a large variety of data crucial for data science:
Manual ingestion of raw data into a data warehouse is quite a tedious and slow process.
You need to define a schema and specify all data types in advance, create a table, open a JDBC connection in your script or ETL tool, then finally you can start loading your data. In contrast, the load step in a data lake is often as simple as a single command. For instance, ingesting a Pandas dataframe into an S3-based data lake with AWS Glue catalog can be accomplished in a single line of Python code (the syntax in Pyspark and Dask is quite similar):
If you are mainly using Python for analytics and data engineering, you will likely find it much easier to write and read data using a data lake rather than using a data warehouse.
We discussed a variety of data supported by data lakes, but they also support a variety of processing frameworks. While a data warehouse encourages processing data in-memory using primarily SQL and UDFs, data lakes make it easy to retrieve data in a programming language or platform of your choice. Because no proprietary format is enforced, you have more freedom. This way, you can leverage the power of a Spark or Dask cluster and a wide range of extremely useful libraries that are built on top of them simply using Python. See the example below demonstrating reading a parquet file in Dask and Spark:
Note that it’s not an either-or decision, a good data engineer can handle both (DWH and data lake), but data lakes make it easier to use data with those distributed processing frameworks.
It’s difficult and time-consuming to fix the traditional ETL data pipelines in which only the “Load” part failed because you have to start the full pipeline from scratch. Data lakes encourage and enable the ELT approach. You can load extracted data in its raw format straight into a data lake and transform it later, either in the same or within an entirely separate data pipeline. There are many tools that let you sync raw data into your data warehouse or data lake and do the transformations later when needed. Decoupling of raw data ingestion from transformation leads to more resilient data workloads.
To illustrate the benefits of data lakes for data science projects, we’ll do a simple demo of the AWS Rekognition service to extract text from images.
What’s our use case? We upload an image to an S3 bucket that stores raw data. This triggers a Lambda function that extracts text from those images. Finally, we store the extracted text into a DynamoDB table and inspect the results using SQL.
How can you use it in your architecture? Instead of DynamoDB, you might as well use a data warehouse table or another S3 bucket location that could be queried using Athena. Also, instead of using the detect_text method (line 18 in the code snippet below) of AWS Rekognition, you can modify the code to:
How to implement this? First, I created an S3 bucket and a DynamoDB table. The table is configured with img_filename, i.e. the file name of an uploaded image, as a partition key so that rerunning our function will not cause any duplicates (idempotency).
I already have an S3 bucket with a folder called images: s3://data-lake-bronze/images.
We also need to create a Lambda function with an IAM role to which I attached the IAM policy for S3, Rekognition, and DynamoDB attached. The function shown below has a lambda handler called lambda_function.lambda_handler and runtime Python 3.8. It also has an S3 trigger attached, which calls the function upon any PUT object operation, i.e. on any file upload to the folder /images/.
The code of the function:
Let’s test it with some images:
After uploading those images to the S3 bucket that we defined in our Lambda trigger, the Lambda should be invoked once for each image upload. Once finished, we can inspect the results in DynamoDB.
It looks like the first two images were recognized pretty well, but the difficult image on the right (“Is this just fantasy”) was not.
If you want to implement a similar use case at scale, it may become challenging to monitor those resources and to implement proper alerting about errors. In that case, you may want to consider a serverless observability platform such as Dashbird. You could group all your Lambda functions and corresponding DynamoDB, ECS, Kinesis, Step Functions, SNS, or SQS resources into a project dashboard that allows you to see the current status of your microservices and ML pipelines at a glance.
Each color represents a different resource such as a specific Lambda function. By hovering over each of them, you can see more details. Then, you can drill down for more information.
Side note: The zero costs in the image above can be justified by an always-free tier of AWS Lambda and DynamoDB. I’ve been using Lambda for my personal projects for the last two years and have never been billed for Lambda so far.
To answer the question from the title: yes, a data lake can definitely speed up the development of data pipelines, especially those related to data science use cases. The ability to deal with a wide range of data formats and easily integrate this data with distributed processing and ML frameworks makes data lakes particularly useful for teams that do data science at scale. Still, if using raw data from a data lake requires too much cleaning, data scientists and data engineers may still prefer to use already preprocessed and historized data from DWH. As always, consider carefully what works best for your use case.
Thank you for reading! If this article was useful, follow me to see my next posts.
References & additional resources:
In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.
In this article we’ll go through the ins and outs of AWS Lambda pricing model, how it works, what additional charges you might be looking at and what’s in the fine print.
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.