With the volume, velocity, and variety of today’s data, we have all started to acknowledge that there is no one-size-fits-all database for all data needs. Instead, many companies shifted towards choosing the right data store for a specific use case or project. The distribution of data across different data stores brought the challenge of consolidating data for analytics. Historically, the only viable solution was to build a data warehouse: extract data from all those different sources, clean and bring it together, and finally, load this data to polished Data Warehouse (DWH) tables in a well-defined structure. While there is nothing wrong with this approach, a combination of a data lake and a data warehouse may be just the solution you need. Let’s investigate why.
1. Building a staging area for your data warehouse
A data lake doesn’t need to be the end destination of your data. Data is constantly flowing, moving, changing its form and shape. A modern data platform should facilitate the ease of ingestion and discoverability, while at the same time allowing for a thorough and rigorous structure for reporting needs. A common emerging pattern is that a data lake serves as an immutable layer for your data ingestion. Nothing ever gets deleted from it (perhaps just overwritten by a new version, or deleted for compliance reasons). All raw data ever ingested into your data platform can be found in a data lake. This means that you can still have ELT/ETL jobs that transform and clean the data, and later ingest it into your data warehouse, while strictly following Kimball, Inmon, or Data Vault methodology, including Slowly Changing Dimension historization and schema alignment.
You don’t need to choose between a data lake or a data warehouse. You can have both: data lake as an immutable staging area and a data warehouse for BI and reporting.
Databricks coined the term data lakehouse which strives to combine the best of both worlds in a single solution. Similarly, platforms such as Snowflake allow you to leverage cloud storage buckets such as S3 as external stages, effectively leveraging data lake as a staging area.
In the end, you need to decide yourself whether a single “data lakehouse”, or a combination of data lake and data warehouse works best for your use case. Monte Carlo Data put it nicely:
“Increasingly, we’re finding that data teams are unwilling to settle for just a data warehouse, a data lake, or even a data lakehouse — and for good reason. As more use cases emerge and more stakeholders (with differing skill sets!) are involved, it is almost impossible for a single solution to serve all needs.” — Source
2. Audit log of all data ever ingested into your data ecosystem thanks to the immutable staging area
An audit trail is often important to satisfy regulatory requirements. Data lakes make it easy to collect metadata about when and by which user the data was ingested. This can be helpful not only for compliance reasons but also to track data ownership.
3. Increase the time-to-value and time-to-insights
By providing an immutable layer of all data ever ingested, we make data available to all consumers immediately after obtaining that data. By providing raw data, you are enabling exploratory analysis that would be difficult to accomplish when different data teams may use the same dataset in a very different way. Often different data consumers may need different transformations based on the same raw data. Data lake allows you to dive anywhere into all sorts and flavors of data and decide on your own what might be useful for you to generate insights.
4. A single data platform for real-time and batch analytics
Ingesting real-time data into a data warehouse is still a challenging problem. Even though there are tools on the market that try to address it, this problem can be solved much easier when leveraging data lake as an immutable layer for ingesting all of your data. For instance, many solutions such as Kinesis Data Streams or Apache Kafka allow you to specify S3 location as a sink for your data.
With the growing volume of data from social media, sensors, logs, web analytics, it can become expensive over time to store all of your data in a data warehouse. Many traditional data warehouses tie storage and processing tightly together, making scaling of each difficult.
Data lakes scale storage and processing (queries and API requests to retrieve data) independently of each other. Some cloud data warehouses support this paradigm, as well. More on that in my previous article:
Typically, data warehouse solutions require you to manage the underlying compute clusters. Cloud vendors started realizing the pain of doing that and built either fully managed or entirely serverless analytical data stores.
For instance, when leveraging S3 with AWS Glue and Athena, your platform remains fully serverless and you pay only for what you use. You can utilize this single data platform to:
- retrieve both relational and non-relational data,
- query historical and real-time data,
- checkpoint your ML training jobs and serve ML models,
- query data directly after ingestion before any transformations were applied,
- combine your data from the data lake and DWH tables via external tables (available in nearly any DWH solution: Redshift Spectrum, Snowflake external tables, …)
- integrate with other services and distributed compute frameworks, such as Dask or Spark.
Regarding the integrations, on AWS, you can leverage:
- Lake Formation for management of access,
- awswrangler (Python library that can be described as Pandas on AWS),
- Quicksight (AWS BI tool),
- delta lake (open-source platform created by Databricks providing a.o., ACID-compliant transactions & upserts for your data lake),
- lakeFS (version control for your data),
- Upsolver (a.o., data ingestion of stream and batch, including upserts, using Kappa architecture)
- AWS Database Migration Service which allows you to incrementally export data from your RDS database tables (or even entire schemas) into S3 parquet files that can be crawled with AWS Glue and queried using Athena.
7. Future proof
I couldn’t find any trustworthy statistics, but my guess is that at least a third of the data that is typically stored in a data warehouse is almost never used. Such data sources are ingested, cleaned, and maintained “just in case” they might be needed later. This means that data engineers are investing a lot of time and effort into building and maintaining something that may not even yet have a clear business need.
The ELT paradigm allows you to save engineering time by building data pipelines only for use cases that are really needed, while simultaneously storing all the data in a data lake for potential future use cases. If a specific business question arises in the future, you may find the answer because the data is already there. But you don’t have to spend time cleaning and maintaining data pipelines for something that doesn’t yet have a clear business use case.
Another reason why data lakes and cloud data platforms are future proof is that if your business grows beyond your imagination, your platform is equipped for growth. You don’t need expensive migration scenarios to a larger or smaller database to accommodate your growth.
Regardless of your choice, your cloud data platform should allow you to grow your data assets with virtually no limits.
Demo: Serverless Event-driven ETL with Data Lake on AWS
To build an event-driven ETL demo, I used this dataset and followed the Databricks bronze-silver-gold principle. In short, it means that you use the “bronze” layer for raw data, “silver” for preprocessed and clean data, and finally “gold” tables represent the final stage of polished data for reporting. To implement this, I created:
- S3 bucket for raw data: s3://data-lake-bronze
- S3 bucket for cleaned and transformed data: s3://data-lake-silver
- AWS Lambda function (called event-driven-etl) which is triggered any time a new file arrives in the “bronze” S3 bucket. It transforms the new object and loads the data to the stage: “silver”.
In the images below you can find the Dockerfile and Lambda function code I used for this simple demo. For a step-by-step guide on how to build a Lambda function with a Docker container image, you can check my previous blog post on Dashbird.
Dockerfile for the lambda function — image by author
A simple event-driven ETL in AWS Lambda — image by author
The command wr.s3.to_parquet() not only loads the data to a new data lake location, but it’s also:
- compressing the data using snappy and parquet format,
- classifying a schema based on Pandas dataframe’s data types and column names,
- storing the schema in AWS Glue catalog,
- creating a new Athena table.
As a result, we can see how S3, AWS Glue, and Athena play together in the management console:
The silver dataset in S3, Glue, and Athena — image by author
How does the serverless ETL with Lambda scale?
Imagine that you would do something similar for many more datasets. Managing all those lambda functions would likely become challenging. Even though the compute power of AWS Lambda scales virtually infinitely, managing the state of data transformations is difficult, especially in real-time and event-driven scenarios. If you use an observability platform such as Dashbird, you can easily inspect which event-driven ETL workloads succeeded and which did not.
When testing my function, I’ve made several mistakes. Dashbird observability was very helpful to view the state of my event-driven ETL, including all the error messages. It allowed me to dive deeper into the logs, and inspect all unsuccessful executions at a glance. Imagine how difficult this might be if you have to do it for hundreds of ETL jobs.
Similarly, configuring alerts on failure is as simple as adding your email address or Slack channel to the Alerting policy:
You can also be notified based on other selected conditions such as cold starts, when duration exceeds a specific threshold (potentially a zombie task), or when there is an unusually high number of function invocations for a specific period of time.
It may be considered a strong statement, but I believe that data lakes, as well as data warehouse solutions with data lake capabilities, constitute an essential component in building any future-proof data platform. Building a relational schema in advance for all of your data is inefficient and often incompatible with today’s data needs. Also, having an immutable data ingestion layer storing all data ever ingested is highly beneficial for audit, data discovery, reproducibility, and fixing mistakes in data pipelines.
Global power plant database — Data source used for the demo