Serverless BigData Pipeline implementation

Bashkar Ndas

July 12th, 2018

Recently, I came across the AWS India Summit 2016 summary, where Purplle showcased their model of implementation using Serverless architecture. Quite surprisingly it was handled by one-man team and done with such efficiency that I decided to explore the architecture and how they implemented it in their organization.

Image source: iamwire

As what Big Data is known for the same challenges were faced by purplle.com team in implementing the pipeline. The challenges faced by team were:

Variety
Velocity
Veracity

Following are the definitions as per the general data pipeline architecture:

Collectors/Routers: They help to handle massive influx of data through streams like click-streams and ad impressions.
Data lake: It is a data lake which is redundant and durable is able to handle I/O at high volumes and is available all the time.
Data warehouse: It is flexible warehouse which allows experimentation with data modelling and allows continuous ingestion of raw data from data lake.
Hot data tier (NoSQL/Cache): It can quickly read and write for unit and batch and has the ability to perform at uneven traffic flow.

Same architecture was implemented using AWS Lambda.

Image source: iamwire.com

Trackers (Kinesis SDK and API Gateway with Lambda): Here Suitable candidate to fit in the role of trackers were Kinesis SDK and API Gateway with Lambda which were used to collect data from various sources such as apps, website, server, and CRM seamlessly. This ensured that there were no data leakage due to network or connection errors and people didn’t have to worry about managing exceptions and retries.
Collectors (Kinesis, Lambda, Kinesis Firehose): For collectors Kinesis played the role equivalent of Kafka which was used to buffer streaming data. Schema policing, validations, and enrichers were written on Node.js which ran when Lambda was triggered from Kinesis. Finally, Kinesis Firehose was used to stream the validated data into downstream sinks S3 and Redshift. A copy of data was also ingested in real-time prediction engine and eventually into DynamoDB.
Data lake (S3): AWS S3 is highly durable, widely available, and ridiculously cheap object store. It supports on-the-fly data encryption as well. It is a perfect data lake and is widely used in the industry for this very use case.
Data warehouse (Redshift): AWS Redshift enables us to quickly model and query our data using standard SQL queries. Loading up raw data into a model can be easily done with a few clicks using the AWS data pipeline. It’s very powerful, cheap, and flexible in terms of changing the size of the cluster on the fly.
NoSQL/Caching (DynamoDB): We use DynamoDB as our hot data store. It’s a fully-managed, scalable, low-latency NoSQL database.

Clearly, I could see the benefit and reason behind their leverage of AWS Lambda with other AWS capabilities for building their capabilities. As also mentioned by company CTO Suyash Katyayani.

We could see the benefits of using serverless technologies.

Low cost of experimentation, agility in development.
Pay per use — There is no need to commit to a particular infra specifications
Highly scalable & available.
Completely Managed – We could focus on building our core product

This not only saved developers from additional efforts but also was proved to be low cost solution for the startup firm. Obviously, we are eagerly waiting to see how such kind of stories start to evolve amongst the other startup organizations and soon among other big names. In my opinion such solutions would indeed help startup organizations to scale up their business at much faster pace without actually worrying about other infrastructure related issues.

Hope you guys and girls enjoyed reading this as much as I enjoyed writing it. If you liked it, feel free to share this tutorial. Until next time, be curious and have fun.

We aim to improve Dashbird every day and user feedback is extremely important for that, so please let us know if you have any feedback about these improvements and new features! We would really appreciate it!

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.