Cutting Step-Functions Costs on Enterprise-Scale Workflows

Taavi Rehemägi

August 6th, 2020

AWS Step Functions is a great service for orchestrating multi-step workflows with complex logic. It’s fast to implement, relatively easy to use and just works. The problem is its price.

For relatively low-scale projects, it’s a feasible solution. But for large-scale, enterprise-grade orchestration with hundreds of millions of processes, each with dozens of steps, it can be cost prohibitive.

Why Step Functions is expensive

Behind the scenes, AWS Step Functions runs synchronously with our resources. This architecture triggers a double-billing issue, which is one side of the Serverless trilemma.

The recently announced Express Workflows slashed per-transition cost from $25/million to $1/million and created a new dimension: duration of tasks. And guess what? Task duration is charged exactly the same pricing as AWS Lambda: per memory-second, rounded to the nearest multiple of 100 milliseconds.

This is like having a Lambda function deployed with a Finite-state Machine implementation, which triggers other resources and keeps running in idle state waiting for their responses.

AWS recommends using Express workflows if tasks have short execution times. Standard workflows probably contain an overpriced markup to account for the risk of long-running ones.

This is suboptimal, but it’s understandable why AWS went that route. Without having access to the underlying code of tasks, it’s virtually impossible to provide all the feature-set available on Step Functions without synchronous execution and double billing.

Affordable orchestration solutions

For large-scale and enterprise-level workflows that cannot afford the wasted resources of the Step Functions model, there are at least a couple of alternatives. One will certainly be able to figure out a dozen more, but the two we cover do the job of illustrating our point while staying 100% serverless, which is our goal.

I must anticipate that any of the two will probably require more effort to implement in comparison to Step Functions. This additional effort may be small or large, depending on your workflow requirements.

Real-world code examples

We are planning on open sourcing code examples illustrating the architectures below, along with CloudFormation and CDK templates for easy deployment in your own AWS accounts.

In case this is something you would like to have, please subscribe here to receive a heads-up once it’s ready.

Orchestration with EventBridge

EventBridge is a serverless event bus that routes events from sources to targets based on certain rules. Sounds a bit like Tasks and Choices on Step Functions, right?

With the Schema Registry feature, it became even easier to configure EventBridge to work similarly to a workflow orchestration mechanism. We can organize event routing schemas in logical groups, resembling how Workflows are organized in Step Functions.

Any part of your application can send an event to an Event Bus, which will be matched against a set of schemas to determine which consumers should receive it. Schemas are defined in JSON following OpenAPI standards.

Event Patterns allow us to determine how events are processed depending on the fields and values present on them. Content-based filtering provides even more granularity.

An Event Bus, however, limits itself to receiving events and routing to the appropriate target(s). It won’t track down what targets are working on and react to their responses automatically, as Step Functions does.

Another potential downside is that EventBridge is a relatively new service. Knowledge among developers and tooling to work with it is still not as mature as it is for Step Functions. Dashbird, for example, just announced support for Step Functions in its architectural insights engine. While more advanced tools are not yet available for EventBridge, as always, CloudWatch already supports it for basic metric monitoring.

The architecture we are discussing could involve, for example, one Event Bus and multiple Lambda functions. Each function is responsible for one step of the process. At the end of each step, the respective Lambda function is responsible for sending another event to the same Event Bus providing extra information about the latest process, so that EventBridge can parse and route to the next step in the process.

Now that Lambda offers the Destinations feature, with a simple configuration parameter it will deliver function responses to an Event Bus automatically for us. The destination event also differentiates successful executions from failed ones, making it easy to respond to failures accordingly within EventBridge, similarly to Step Functions error handling.

DynamoDB Streams and Lambda

The DynamoDB Streams integration with Lambda can also be used to orchestrate complex workflows. As a new item is entered in a DynamoDB table, it can generate streams of events that automatically trigger a particular Lambda function.

We cannot customize which Lambda is triggered for each item, so this architecture requires a central orchestrator as a Lambda function. It could be deployed with an open-source Finite-state Machine system, which would work similarly to Step Functions and deliver many of its features.

The main advantage here is that event routing is done in your preferred programming language, which certainly offers much more capabilities than EventBridge Schema JSON.

One disadvantage is that this central Lambda must be invoked prior to every single step, adding latency and costs. Nevertheless, as the central function receives events from DynamoDB, it can parse the workflow rules, determine targets and invoke them asynchronously to avoid the double billing issue.

The target Lambdas are then responsible for updating the tasks items in the same DynamoDB table, which will trigger another stream for the central Lambda to move into the next step of the workflow.

Unfortunately, Lambda Destinations does not support delivering responses to DynamoDB (yet!), so we need to manually embed this logic into our Lambda functions. As a result, the integration part of this architecture requires extra care in making sure events and responses will flow as expected throughout the entire process cycle.

Wrapping up

We’ve covered in a high-level two possible alternative architectures to using Step Functions for large-scale, enterprise-level workflows. Both are 100% serverless and take advantage of event-driven and asynchronous communication to improve resource utilization, reduce waste and overall costs in comparison to Step Functions. We also propose using some of the latest features offered by AWS on other services, such as EventBridge Schema Registry and Lambda Destinations.

As we mentioned earlier, in case you’d like to receive code examples as well as CloudFormation and CDK templates for implementing these architectural ideas, sign up to our newsletter.

Read our blog

Making serverless applications reliable and bug-free

In this guide, we’ll talk about common problems developers face with serverless applications on AWS and share some practical strategies to help you monitor and manage your applications more effectively.

ANNOUNCEMENT: new pricing and the end of free tier

Today we are announcing a new, updated pricing model and the end of free tier for Dashbird.

4 Tips for AWS Lambda Performance Optimization

In this article, we’re covering 4 tips for AWS Lambda optimization for production. Covering error handling, memory provisioning, monitoring, performance, and more.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

Get started free or learn more

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.