Can Data Lakes Accelerate Building ML Data Pipelines?

Anna Geller

April 19th, 2021

Investigating time-to-insights for data science with a data lake vs. data warehouse

A common challenge in data engineering is to combine traditional data warehousing and BI reporting with experiment-driven machine learning projects. Many data scientists tend to work more with Python and ML frameworks rather than SQL. Therefore, their data needs are often different from those of data analysts. In this article, we’ll explore why having a data lake often provides tremendous help for data science use cases. We’ll finish up with a fun computer-vision demo extracting text from images stored in an S3 data lake.

1. Data lakes are data agnostic

Having only a purely relational data warehouse imposes a limitation on the variety of data formats this data platform can support. Many data warehouse solutions allow you to analyze nested JSON-like structure, but it’s still a fraction of data formats that can be supported by a data lake.

While nothing beats relational table structure for analytics, it’s still beneficial to have an additional platform that allows you to do more than that. Data lakes are data agnostic. They support a large variety of data crucial for data science:

different file types (csv, tsv, json, txt, parquet, orc), data encryption, and compression formats (snappy, gzip, zlib, lzo),
images, audio, video files enabling deep learning use cases with Computer Vision algorithms,
model checkpoints created by ML training jobs,
joins across relational and non-relational data, both server-side (ex. Presto or Athena) and on the client-side (ex. your Python script),
data from the web: clickstream, shopping cart data, social media (ex. tweets, Reddit, blog posts, and news article scraped for Natural Langage Processing analytics),
time-series data: IoT and sensor data, weather data, financial data.

2. Increased development efficiency

Manual ingestion of raw data into a data warehouse is quite a tedious and slow process.

You need to define a schema and specify all data types in advance, create a table, open a JDBC connection in your script or ETL tool, then finally you can start loading your data. In contrast, the load step in a data lake is often as simple as a single command. For instance, ingesting a Pandas dataframe into an S3-based data lake with AWS Glue catalog can be accomplished in a single line of Python code (the syntax in Pyspark and Dask is quite similar):

Ingestion into an S3 data lake — Image by author

If you are mainly using Python for analytics and data engineering, you will likely find it much easier to write and read data using a data lake rather than using a data warehouse.

3. Support for a wider range of data processing tools

We discussed a variety of data supported by data lakes, but they also support a variety of processing frameworks. While a data warehouse encourages processing data in-memory using primarily SQL and UDFs, data lakes make it easy to retrieve data in a programming language or platform of your choice. Because no proprietary format is enforced, you have more freedom. This way, you can leverage the power of a Spark or Dask cluster and a wide range of extremely useful libraries that are built on top of them simply using Python. See the example below demonstrating reading a parquet file in Dask and Spark:

Note that it’s not an either-or decision, a good data engineer can handle both (DWH and data lake), but data lakes make it easier to use data with those distributed processing frameworks.

4. Failures in your data pipelines become easier to fix

It’s difficult and time-consuming to fix the traditional ETL data pipelines in which only the “Load” part failed because you have to start the full pipeline from scratch. Data lakes encourage and enable the ELT approach. You can load extracted data in its raw format straight into a data lake and transform it later, either in the same or within an entirely separate data pipeline. There are many tools that let you sync raw data into your data warehouse or data lake and do the transformations later when needed. Decoupling of raw data ingestion from transformation leads to more resilient data workloads.

Demo: using data lake to provide ML as a service

To illustrate the benefits of data lakes for data science projects, we’ll do a simple demo of the AWS Rekognition service to extract text from images.

What’s our use case? We upload an image to an S3 bucket that stores raw data. This triggers a Lambda function that extracts text from those images. Finally, we store the extracted text into a DynamoDB table and inspect the results using SQL.

How can you use it in your architecture? Instead of DynamoDB, you might as well use a data warehouse table or another S3 bucket location that could be queried using Athena. Also, instead of using the detect_text method (line 18 in the code snippet below) of AWS Rekognition, you can modify the code to:

detect_faces
detect_custom_labels
recognize_celebrities
…and many more.

How to implement this? First, I created an S3 bucket and a DynamoDB table. The table is configured with img_filename, i.e. the file name of an uploaded image, as a partition key so that rerunning our function will not cause any duplicates (idempotency).

Create DynamoDB table for our demo — image by the author

I already have an S3 bucket with a folder called images: s3://data-lake-bronze/images.

We also need to create a Lambda function with an IAM role to which I attached the IAM policy for S3, Rekognition, and DynamoDB attached. The function shown below has a lambda handler called lambda_function.lambda_handler and runtime Python 3.8. It also has an S3 trigger attached, which calls the function upon any PUT object operation, i.e. on any file upload to the folder /images/.

Lambda function to detect text in images uploaded to data lake — image by author

The code of the function:

creates a client for Rekognition and corresponding S3 and DynamoDB resource objects,
extracts the S3 bucket name and key (filename) from the event trigger,
reads the image object and passes it to the Rekognition client,
finally, it retrieves the detected text and uploads it to our DynamoDB table.

Let’s test it with some images:

sample picture to test — Photo by Nadi Lindsay from Pexels

sample picture to test 2 — Photo by Alexas Fotos from Pexels

sample picture to test 3 — Photo by Mudassir Ali from Pexels

After uploading those images to the S3 bucket that we defined in our Lambda trigger, the Lambda should be invoked once for each image upload. Once finished, we can inspect the results in DynamoDB.

PartiQL query editor in DynamoDB (1) — image by the author

It looks like the first two images were recognized pretty well, but the difficult image on the right (“Is this just fantasy”) was not.

How can we do it at scale?

If you want to implement a similar use case at scale, it may become challenging to monitor those resources and to implement proper alerting about errors. In that case, you may want to consider a serverless observability platform such as Dashbird. You could group all your Lambda functions and corresponding DynamoDB, ECS, Kinesis, Step Functions, SNS, or SQS resources into a project dashboard that allows you to see the current status of your microservices and ML pipelines at a glance.

Each color represents a different resource such as a specific Lambda function. By hovering over each of them, you can see more details. Then, you can drill down for more information.

Side note: The zero costs in the image above can be justified by an always-free tier of AWS Lambda and DynamoDB. I’ve been using Lambda for my personal projects for the last two years and have never been billed for Lambda so far.

AWS always-free tier pricing table — AWS always-free tier. Source— screenshot by the author

Conclusion

To answer the question from the title: yes, a data lake can definitely speed up the development of data pipelines, especially those related to data science use cases. The ability to deal with a wide range of data formats and easily integrate this data with distributed processing and ML frameworks makes data lakes particularly useful for teams that do data science at scale. Still, if using raw data from a data lake requires too much cleaning, data scientists and data engineers may still prefer to use already preprocessed and historized data from DWH. As always, consider carefully what works best for your use case.

Thank you for reading! If this article was useful, follow me to see my next posts.

References & additional resources:

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.