Will Serverless computing reshape big data and data science?

Serverless development has been turning heads in the market for quite some time now. But it has yet to be accepted by the majority in the development community. With AWS Lambda, Azure Functions, and IBM’s Open Whisk, the market is poised to take a different route in this field.

Most of these organizations are spending a lot of money to make the market accept this new paradigm using serverless computing. In the coming years, our thinking, terminology, and the way we develop will go through a significant change. This is true for many other domains as well. People have already started experimenting and coming up with solutions for big data, data science, virtual reality, among many other fields.

However, this development has yet to become mature. But, given the rapid pace with which the open-source community, commercial organizations, and the market trend speak about serverless computing, we could see it redefining Big Data solutions. Data science will surely evolve further with the rise of serverless development.

Hopefully, as serverless technology matures, so will the development and monitoring solutions become more sophisticated. The common underlying core business value related questions that organizations planning to go for serverless have are:

  • Will the overall cost to set up big data structure be reduced?
  • Will it minimize the organization’s operational cost?
  • Is there availability of a required pool of skilled resources to handle such type of development?
  • What will be the necessary skills?
  • How can we manage large data pipelines?

It will be fun to see where serverless takes us next. We are looking forward to the journey into unchartered territory, which should solve the complex world of the data economy.

The Evolution of Big Data

Purpelle.com, an online fashion retailer, has built scalable serverless data pipelines that work at low operating costs while being engineered and maintained by a single developer. These data pipelines help to collect millions of data points every day. They achieved it using AWS services such as Kinesis, Lambda, and Kinesis Data Firehose.

Serverless computing has opened new possibilities for the development community. Pay for what you use has made it relatively cheap to leverage the power of the cloud.

The evolution in Big data analytics has undergone through following phases.

Lambda can be used to perform essential compute functions. The same as with a Hadoop distribution platform. Initial Hadoop processing involved setting up on-premise infrastructure using cheap commodity hardware for nodes and one rather costly main machine for maintaining the master.

Soon enough, the cloud started to offer the same capability, so you can spin up several instances in seconds depending upon your requirement. These instances worked similarly as what the on-premise Hadoop infrastructure. The migration from physical infrastructure to cloud gave an edge to organizations who had difficulty in setting up extensive infrastructure using cheap commodity hardware.

The idea of pay-per-use and spinning up as many instances as required gave people the idea to perform Hadoop jobs in the cloud without maintaining physical hardware. It also provided an additional advantage. As soon as the job was finished, you could just delete the entire cloud instance, so you don’t need to pay for it anymore!

But cloud computing still needed manual intervention from the user to spin up required virtual machines; thus, a operations or infrastructure team was needed to maintain the instances.

Lambda goes above and beyond! It is more than everything you need in your development environment. The best part is you don’t need a team to scale up the capabilities manually. Lambda takes care of that itself and automatically scales up the system whenever required. Lambda is an event-oriented system. It works by using events to trigger actions.

What’s more, it can do Hadoop-like processing while eliminating the requirement to bring or use the Hadoop framework. The only problem is that no one knows the hardware capability that powers AWS Lambda. But looking at other robust infrastructure provided by AWS, Lambda can be trusted to perform big data processing without failure caused due to hardware.

Here’s a diagram of a Hadoop map and reduce task.

Overview of Big data processing using AWS Lambda

Here are two techniques or approaches can be applied to achieve the big data processing.

  1. Use Amazon S3 if working with persistent data.
  2. Use Amazon Kinesis if working with streaming data.

A serverless Map-Reduce architecture using AWS lambda – Image source – AWS

Cost comparison for Hadoop processing

During the early days of Hadoop, much of the cost was incurred on setting up on-premise infrastructure with cheap commodity hardware. A petabyte Hadoop cluster requires around 200 nodes which cost approximately $1 million. The cost of each node is about $4,000. It was accumulating operational and hardware costs of roughly $36 per hour. This cost is per hour, no matter if it is idling or running.

As cloud solutions became more mature, the organization began to leverage its power for big data problems. Now, no more commodity hardware-based infrastructure was required to be maintained. Besides that, the pay-as-you-go model allowed more flexibility in terms of eliminating the requirement of supporting hardware.

In a cloud model, the virtual machines were shut down when the goal was achieved, bringing the cost down even further. For using a static cluster of EC2 for 100 TB data usage, one will have to pay around $78,000, while using Amazon EMR with S3 for the same amount of data, one will have to pay about $28,000.

Coming to AWS Lambda’s pricing model, Amazon has been quite brilliant considering the market sentiments.

Source: Amazon

It’s pretty evident from the pricing model above that as serverless services become more mature, the cost of big data infra will come down. Finally, we need to wait for more community-driven development using serverless in the field of big data and expect this incredible service to revolutionize our existing big data and data science solutions.

Further reading:

Put a stop to data swamps with event-driven data testing

10 ways to protect your mission-critical database

How can a shared Slack channel improve your data quality?

Are NoSQL databases relevant for data engineering?

Read our blog

Introducing easy custom event monitoring for serverless applications.

Today we are excited to announce scheduled searches – a new feature on Dashbird that allows you to track any log event across your stack, turn it into time-series metric and also configure alert notifications based on it.

Why and How to Monitor Amazon OpenSearch Service

One of the most vital aspects to monitor is the metrics. You should know how your cluster performs and if it can keep up with the traffic. Learn more about monitoring Amazon OpenSearch Service.

Why and How to Monitor AWS Elastic Load Balancing

Dashbird recently added support for ELB, so now you can keep track of your load balancers in one central place. It comes with all the information you expect from AWS monitoring services and more!

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.