All-in-one serverless DevOps platform.
Full-stack visibility across the entire stack.
Detect and resolve incidents in record time.
Conform to industry best practices.
With the volume, velocity, and variety of today’s data, we have all started to acknowledge that there is no one-size-fits-all database for all data needs. Instead, many companies shifted towards choosing the right data store for a specific use case or project. The distribution of data across different data stores brought the challenge of consolidating data for analytics. Historically, the only viable solution was to build a data warehouse: extract data from all those different sources, clean and bring it together, and finally, load this data to polished Data Warehouse (DWH) tables in a well-defined structure. While there is nothing wrong with this approach, a combination of a data lake and a data warehouse may be just the solution you need. Let’s investigate why.
A data lake doesn’t need to be the end destination of your data. Data is constantly flowing, moving, changing its form and shape. A modern data platform should facilitate the ease of ingestion and discoverability, while at the same time allowing for a thorough and rigorous structure for reporting needs. A common emerging pattern is that a data lake serves as an immutable layer for your data ingestion. Nothing ever gets deleted from it (perhaps just overwritten by a new version, or deleted for compliance reasons). All raw data ever ingested into your data platform can be found in a data lake. This means that you can still have ELT/ETL jobs that transform and clean the data, and later ingest it into your data warehouse, while strictly following Kimball, Inmon, or Data Vault methodology, including Slowly Changing Dimension historization and schema alignment.
You don’t need to choose between a data lake or a data warehouse. You can have both: data lake as an immutable staging area and a data warehouse for BI and reporting.
Databricks coined the term data lakehouse which strives to combine the best of both worlds in a single solution. Similarly, platforms such as Snowflake allow you to leverage cloud storage buckets such as S3 as external stages, effectively leveraging data lake as a staging area.
In the end, you need to decide yourself whether a single “data lakehouse”, or a combination of data lake and data warehouse works best for your use case. Monte Carlo Data put it nicely:
“Increasingly, we’re finding that data teams are unwilling to settle for just a data warehouse, a data lake, or even a data lakehouse — and for good reason. As more use cases emerge and more stakeholders (with differing skill sets!) are involved, it is almost impossible for a single solution to serve all needs.” — Source
An audit trail is often important to satisfy regulatory requirements. Data lakes make it easy to collect metadata about when and by which user the data was ingested. This can be helpful not only for compliance reasons but also to track data ownership.
By providing an immutable layer of all data ever ingested, we make data available to all consumers immediately after obtaining that data. By providing raw data, you are enabling exploratory analysis that would be difficult to accomplish when different data teams may use the same dataset in a very different way. Often different data consumers may need different transformations based on the same raw data. Data lake allows you to dive anywhere into all sorts and flavors of data and decide on your own what might be useful for you to generate insights.
Ingesting real-time data into a data warehouse is still a challenging problem. Even though there are tools on the market that try to address it, this problem can be solved much easier when leveraging data lake as an immutable layer for ingesting all of your data. For instance, many solutions such as Kinesis Data Streams or Apache Kafka allow you to specify S3 location as a sink for your data.
With the growing volume of data from social media, sensors, logs, web analytics, it can become expensive over time to store all of your data in a data warehouse. Many traditional data warehouses tie storage and processing tightly together, making scaling of each difficult.
Data lakes scale storage and processing (queries and API requests to retrieve data) independently of each other. Some cloud data warehouses support this paradigm, as well. More on that in my previous article:
Why you are throwing money away if your cloud data warehouse doesn’t separate storage and compute
What you should consider before migrating to the cloud to make your data warehouse and data lake future-proof
Typically, data warehouse solutions require you to manage the underlying compute clusters. Cloud vendors started realizing the pain of doing that and built either fully managed or entirely serverless analytical data stores.
For instance, when leveraging S3 with AWS Glue and Athena, your platform remains fully serverless and you pay only for what you use. You can utilize this single data platform to:
Regarding the integrations, on AWS, you can leverage:
I couldn’t find any trustworthy statistics, but my guess is that at least a third of the data that is typically stored in a data warehouse is almost never used. Such data sources are ingested, cleaned, and maintained “just in case” they might be needed later. This means that data engineers are investing a lot of time and effort into building and maintaining something that may not even yet have a clear business need.
The ELT paradigm allows you to save engineering time by building data pipelines only for use cases that are really needed, while simultaneously storing all the data in a data lake for potential future use cases. If a specific business question arises in the future, you may find the answer because the data is already there. But you don’t have to spend time cleaning and maintaining data pipelines for something that doesn’t yet have a clear business use case.
Another reason why data lakes and cloud data platforms are future proof is that if your business grows beyond your imagination, your platform is equipped for growth. You don’t need expensive migration scenarios to a larger or smaller database to accommodate your growth.
Regardless of your choice, your cloud data platform should allow you to grow your data assets with virtually no limits.
To build an event-driven ETL demo, I used this dataset and followed the Databricks bronze-silver-gold principle. In short, it means that you use the “bronze” layer for raw data, “silver” for preprocessed and clean data, and finally “gold” tables represent the final stage of polished data for reporting. To implement this, I created:
In the images below you can find the Dockerfile and Lambda function code I used for this simple demo. For a step-by-step guide on how to build a Lambda function with a Docker container image, you can check my previous blog post on Dashbird.
Dockerfile for the lambda function — image by author
A simple event-driven ETL in AWS Lambda — image by author
The command wr.s3.to_parquet() not only loads the data to a new data lake location, but it’s also:
As a result, we can see how S3, AWS Glue, and Athena play together in the management console:
The silver dataset in S3, Glue, and Athena — image by author
Imagine that you would do something similar for many more datasets. Managing all those lambda functions would likely become challenging. Even though the compute power of AWS Lambda scales virtually infinitely, managing the state of data transformations is difficult, especially in real-time and event-driven scenarios. If you use an observability platform such as Dashbird, you can easily inspect which event-driven ETL workloads succeeded and which did not.
When testing my function, I’ve made several mistakes. Dashbird observability was very helpful to view the state of my event-driven ETL, including all the error messages. It allowed me to dive deeper into the logs, and inspect all unsuccessful executions at a glance. Imagine how difficult this might be if you have to do it for hundreds of ETL jobs.
Similarly, configuring alerts on failure is as simple as adding your email address or Slack channel to the Alerting policy:
You can also be notified based on other selected conditions such as cold starts, when duration exceeds a specific threshold (potentially a zombie task), or when there is an unusually high number of function invocations for a specific period of time.
It may be considered a strong statement, but I believe that data lakes, as well as data warehouse solutions with data lake capabilities, constitute an essential component in building any future-proof data platform. Building a relational schema in advance for all of your data is inefficient and often incompatible with today’s data needs. Also, having an immutable data ingestion layer storing all data ever ingested is highly beneficial for audit, data discovery, reproducibility, and fixing mistakes in data pipelines.
Deploying AWS Lambda with Docker containers
4 Immediate Benefits You’ll Gain from a Modern Monitoring Platform
Data lake vs. Warehouse
Global power plant database — Data source used for the demo
Today we are excited to announce scheduled searches – a new feature on Dashbird that allows you to track any log event across your stack, turn it into time-series metric and also configure alert notifications based on it.
One of the most vital aspects to monitor is the metrics. You should know how your cluster performs and if it can keep up with the traffic. Learn more about monitoring Amazon OpenSearch Service.
Dashbird recently added support for ELB, so now you can keep track of your load balancers in one central place. It comes with all the information you expect from AWS monitoring services and more!
Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.
Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.
Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.
Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.
I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.
Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.
Great UI. Easy to navigate through CloudWatch logs. Simple setup.
Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.
End-to-end observability and real-time error tracking for AWS applications.