The Ultimate Guide to AWS Step Functions

Ready to start monitoring your serverless application?

Dashbird Banner

Instantly detect and prevent known and unknown serverless errors!

Get started free

The use of serverless computing has become a must nowadays, and some of you may already know a thing or two about Amazon Web Services like Lambda Functions, Step Functions, and other services AWS provides. However, if this is the first time you hear about them – fantastic!

In this article, we’ll discuss AWS Step Functions, what they are used for, how to use them, and the advantages or disadvantages that they bring.

AWS Step Functions 101

Before we can jump into Step Functions, you need to familiarize yourself with the basic structure behind them. Step Functions are an AWS-managed service that uses Finite-State Machine (FSM) model.

FSMs are studied in theoretical computer science and are a way of modeling workflows in software systems. With Step Functions, FSMs can be used in your serverless architecture to coordinate different AWS services to form processes that solve your use-cases in a well-defined way.

Why is that important?

By coordinating multiple AWS services into different serverless workflows, you can quickly build and update the apps. Additionally, with Step Functions, you’ll be able to both design and run workflows that’ll bring together various services, including Amazon ECS and AWS Lambda, into feature-rich applications.

For example, you can call a Lambda function on each step, but you can also wait for human interactions or external API input. This makes Step Functions a mighty service. And the best of all, Step Functions itself is serverless too! This means on-demand pricing and minimal operational overhead.

Dashbird Book

Get your free Ebook!

Serverless Best Practices handbook

FSM Model Explained

The FSM model does a simple job – it uses given states and transitions to complete the tasks at hand. FSMs are also known as a behavioral model. It’s an abstract machine (system) that can be in one state at a time, but it can also switch between a finite number of states. This means it doesn’t allow infinity loops, which removes one, often very costly, source of errors entirely.

The two keywords you need to remember are States and Transitions.

Now, why are these words so important?

This machine is defined solely by its states and the relationships, called transitions, between them. A very straightforward example is the closed-door example; you can see its state diagram below in Figure 1. The door can either be open or closed, and these are the only two possible states. The transition part is the switch between two states, but you need to provide some input first to get there. When you close the door, you’re placing an input. Additionally, the sequence of opening the closed door is known as the switch between two states (transition).

step functions state input and transition
Figure 1: StepFunctions state input and transition

You can also apply the same thing to other examples, like your daily life routines. Let’s take “Work, home, bed” as states. From work (state), you take a bus (input) to get home (state), and when you arrive home, you go to bed (another input leading to another state). Tomorrow morning when you wake up and get out of your bed, you’re transitioning from the last state you were in into a previous one, and yet again, taking the bus from home to work is one more transition.

There are other, more complex examples with many more states, inputs, and transitions between them, and the more states you add, the more complex the FSM model becomes.

The conclusion is simple – FSMs are a method of modeling your system by defining the states and transitions between those states.

What are Step Functions?

Step Functions are an FSM implementation offered as a serverless service by AWS. Step Functions are made of state machines (workflows) and tasks. Tasks are individual states or single units of work.

In figure 2 below, you’ll see an example of Amazon’s State Machine in which the green rectangles represent the states. The result leads to another state, which then leads to the choice that depends on the given input (email or SMS). In this example, the green states were successfully executed, while the white-colored state wasn’t.

Amazon's state machine
Figure 2: Amazon’s state machine

This entire graph representing the state machine is also known as a Workflow, and there are two types of workflows available.

Step Function Workflow

Types Of Workflows

Workflows are divided into two groups: Standard and Express workflow. Unlike Standard, the Express workflow is a relatively new option that has been available since last year. The table below shows the differences between these two workflow types.

Workflows

Standard

Express

Max Duration

1 year

5 minutes

Execution Rate (per second)

2,000+

100,000+

State Transitions

4,000+

Unlimited

Price

Per state transition

Number of executions, duration & memory

Execution

Exactly once

At least once

Execution History

API, AWS Console or CloudWatch

CloudWatch

State machines orchestrate the work of AWS services, like Lambda functions. When one function ends, it triggers another function to begin. Although Max Duration is significantly different, Express workflow allows more scalability. Moreover, Express workflow pricing is constructed with more details since users will have to pay for the number of executions, including the duration and memory used for those executions. Standard workflow pricing requires users to pay only for each state transition that occurs.

It’s important to note that Standard workflow is a long-running workflow that has to be durable and auditable. In contrast, the Express workflow type is needed for a much higher frequency and event processing volume.

Workflow Execution

Now that you know the basics, the next step of the way is the execution. To trigger the workflow to start the execution against Step Function API, you can use CloudWatch events as a time trigger or use API Gateway as a proxy.

State Types

It’s essential to remember that States aren’t the same thing as Tasks since Tasks are one of the State types. There are numerous State types, and all of them have a role to play in the overall workflow:

  • Pass: Pushes input to output.
  • Task: Takes input and produces output.
  • Choice: Allows the user to use Branching Logic that’s based on the input.
  • Wait: It adds delays to State Machine execution.
  • Success: Has an expected dead-end that stops execution successfully.
  • Fail: Has an expected dead-end that stops execution with a failure.
  • Parallel: Allows a user to implement parallel branches in execution, meaning the user can start multiple states at once.
  • (Dynamic) Mapping: Runs a set of steps for every input item.

Tasks

Tasks are the leading States in which all the work is done. Tasks can call Activities (remote executions):

  • Call an execution on either ECS, EC2 machines, or mobile devices.
  • Sending SMS notifications and wait for the input.

Another constructive element that Step Functions Tasks provide is that it allows you to reach out from your AWS space.

Error Handling

Error handling includes retries and catch. An excellent example of how do Step Functions work is shown in Figure 3 below:

Step Function visual workflow
Figure 3: Step Function visual workflow

In this example, you can see the Parallel branching task. This task is a perfect example representing how the entire execution will fail if only one state encounters an error.

Users are provided with Amazon State Language that helps them catch those errors and define all the retries. All this is extremely important for business-critical operations.

Amazon State Language allows you to place a comment, define when the state should start, and define the states and tasks. Moreover, suppose a customer handled an error. In that case, this tool allows you to specify the retries based on the error name, but also to specify the retry interval, as well as the number of retry maximum attempts, and backoff rate, which you can see in the example below:

{
  "Comment": "A Hello World example",
  "StartAt": "HelloWorld",
  "States": {
    "HelloWorld": {
      "Type": "Task"
      "Resource": "arn:aws:Lambda:...",
      "Retry": [
        {
          "ErrorEquals": ["HandledError"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "End": true
    }
  }
}

In case you wish to catch errors, you’ll see why some states weren’t executed and which tasks have failed. See an example of how to catch an error in this example:

{
  "Comment": "A Hello World example",
  "StartAt": "HelloWorld",
  "States": {
    "HelloWorld": {
      "Type": "Task",
      "Resource": "arn:aws:Lambda:...",
      "Catch": [
        {
          "ErrorEquals": ["States. TaskFailed"],
          "Next": "fallback",
          "End": true
        }
      ],
      "fallback": {
        "Type": "Pass",
        "Result": "Hello, AWS Step Functions!",
        "End": true
      }
    }
  }
}

The first retry attempt will start at the pre-determined interval, and it gets multiplied by the backoff rate you’ve set.

Error handling is critical because if Parallel tasks execute successfully, but one fails, the entire execution will fail. However, even if the entire execution fails, the state changes will remain intact.

Error handling allows you to track everything that’s happened in the log, and by doing so, you’ll have a better insight on why some errors happened so you could handle the core problem.

Step Function Demonstration

Let’s look into some Step Functions examples. These will be built with the AWS CDK.

Choice Step Function Example

You’ll have to input a preferred number into your function. For example, if you chose a number 10 and a customer buys more than ten items from you, the Step Function will execute successfully by following a preferred choice. In case a customer buys less than ten items, the execution will also be successful, but under a different pre-set choice.

The code for this example looks like this:

const success = new stepFunsfnctions.Succeed(this, "Success!");

const moreTask = new stepFunsfnctions.Pass(this, "MORE");
moreTask.next(success);

const lessTask = new stepFunsfnctions.Pass(this, "LESS");
lessTask.next(success);

const desiredAmountChoice = new stepFunsfnctions.Choice(
  this,
  "More than desired amount?"
);
desiredAmountChoice.when(
  stepFunsfnctions.Condition.numberGreaterThanJsonPath(
    "$.itemAmount",
    "$.desiredAmount"
  ),
  moreTask
);
desiredAmountChoice.when(
  stepFunsfnctions.Condition.numberLessThanEqualsJsonPath(
    "$.itemAmount",
    "$.desiredAmount"
  ),
  lessTask
);

new stepFunsfnctions.StateMachine(this, "StateMachine", {
  definition: desiredAmountChoice,
});

The desiredAmountChoice state compares the itemAmount with the desiredAmount input and branches accordingly. The input will be supplied when a new execution of the state machine is created.

The desiredAmountChoice leads to two different states, moreTask, and lessTask. In this example, they both are simply pass-type states, but you can switch them for task-type states that execute a Lambda function, for example.

In Figure 4, you see how the state machine performs with the following input:

{
    "itemAmount": 23,
    "desiredAmount": 10
}

Retry & Catch Step Function Example

If your Lambda function throws an error, the task it belongs to will fail. In the next example, we will try to access an event attribute that doesn’t exist, this way, the Lambda function always crashes. After some retries, we will fall back to a pass-type state as a placeholder for our error handling.

Error catch and error handling are essential for Step Functions since it allows for a successful, and error-free function execution.

The code for the retry and catch example look like this:

const brokenTask = new stepFunsfnctionsTasks.LambdaInvoke(this, "BrokenTask", {
  lambdaFunction: new lambda.Function(this, "BrokenFunction", {
    runtime: lambda.Runtime.NODEJS_12_X,
    handler: "index.handler",
    code: new lambda.InlineCode(`
            exports.handler = async (event) => {
              const error = event.x.y;
              return {Payload: "result text"};
            }
          `),
  }),
  outputPath: "$.Payload",
});

brokenTask.addRetry({ maxAttempts: 5 });

const handleFail = new stepFunsfnctions.Pass(this, "HandleFail");

const success = new stepFunsfnctions.Succeed(this, "Success!");

handleFail.next(success);

brokenTask.addCatch(handleFail);

brokenTask.next(success);

new stepFunsfnctions.StateMachine(this, "StateMachine", {
  definition: brokenTask,
});

The brokenTask tries to invoke the brokenFunction, but it never succeeds. It will retry for maxAttempts of 5 and then execute the handleFail state added with addCatch.

In Figure 5, you can see the state machine’s attempts to execute the brokenTask. You can also see that the interval between failed steps gets longer because the default backoffRate for retries is a multiplicator of 2.

failed execution steps
Figure 5: Failed execution steps

In Figure 6, you see how the state machine performed in the end.

retry step diagram
Figure 6: Retry state machine

When to Use Step Functions?

Step Functions Standard workflow is excellent for business-critical workflows and brings along numerous business benefits. It provides much better error handling logic than Lambda Functions, while it’s relatively easy to orchestrate them. On the other hand, it’s meant more for business-critical ones since it pretty expensive compared to Express workflow. The Standard workflow price is $25 per one million executions with the additional cost for memory and duration of use. If you’d like to learn more about saving money on your AWS Step Functions, then check our article on how to cut costs on Step Functions on Enterprise-Scale workflows.

Complex workflow allows you to handle a tremendous amount of states. Complex workflow is excellent for orchestrating microservices since you won’t need to build a connection between them, and you can call out different languages from different services.

Step Functions are also beneficial for long-running or delayed workflows. It allows you to have a workflow for up to a year while also implementing the waiting state.

Step Function Best Practices

One of the best use practices of Step Functions is for large payloads. By putting payloads in S3 and importing them to Step Functions, you’ll be good to go. If you don’t, your workflow might fail. You can easily do it by specifying the location of S3 with an “arn” like shown in the example code below:

{
  "StartAt": "Invoke Lambda function",
  "States": {
    "Invoke Lambda function": {
      "Type": "Task",
      "Resource": "arn:aws: states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:Lambda:...",
        "Payload": {
          "Data": "arn:aws:53:::MyBucket/data.json"
        }
      },
      "End": true
    }
  }
}

Use Step Function Timeouts

Using timeouts will help you avoid stuck executions since there are no default timeouts in Step Function tasks. Moreover, Step Functions rely on the activity worker’s response.

How to Handle Lambda Exceptions?

Lambda can have very short-lived service errors. This is why it’s good to add Lambda service exceptions since it’s excellent at handling these exceptions proactively, as shown in this example:

"Retry": [
  {
    "ErrorEquals": [
      "Lambda.ServiceException",
      "Lambda.AWSLambdaException",
      "Lambda.SdkClientException"
    ],
    "IntervalSeconds": 2,
    "MaxAttempts": 6,
    "BackoffRate": 2
  }
]

Integrations & Development Tools

Possible Integrations

There’s a dozen of services available for integrations that you can use, and you can integrate them from the Tasks:

  • Submit Amazon Web Services batch job;
  • Use CodeBuild;
  • Get or put items in DynamoDB table;
  • Run Amazon’s ECS;
  • Integrate with EMR;
  • Run Amazon’s Fargate task;
  • Integrate with Glue;
  • Invocation of a Lambda function;
  • Use SageMaker’s machine classification, inference, and machine learning model training;
  • Use Topic to publish a message;
  • Send messages to SQS queue;
  • Step Functions
Step Functions: Integrations
Figure 7: Step Functions Integrations

Dev Tools

The AWS CDK has a Step Functions module that allows you to define your workflows directly in your CDK stack, with static type checks and everything.

AWS provides a Step Functions plugin that’s used in the Serverless framework. It allows you to do everything Step Functions can do, while it helps devs take care of the rows and many other things they need to define.

It’s possible to download Step Functions as a .jar file or a Docker image so you can run it on your machine.

It’s also vital to stay on top of your Step Functions’ performance. This is where serverless monitoring tools like Dashbird come in! Step Functions publishes events and metrics to CloudTrail and CloudWatch, which Dashbird monitors.

Dashbird’s Insights engine detects errors related to state machine definitions or task execution failures in real-time. It notifies you immediately, via Slack or email, when something within your workflows breaks or is about to go wrong. The Insights engine is based on AWS Well-Architected best practices and constantly runs your whole serverless infrastructure’s data against its rules to help you make sure your app is optimized and reliable at any scale.

Dashbird AWS insights
Figure 8: Dashbird Insights for AWS services

Step Function Advantages & Disadvantages

Although Express workflow is much cheaper than the standard workflow, it doesn’t come with any visual aid that helps monitor your executions since it pushes the information to the CloudWatch log. While it provides exceptional insights, the lack of visual aid might seem challenging, especially with too many executions at hand. It might seem like a challenging task to recuperate what’s failing and what’s not.

Final Thoughts

Step Functions are AWS’s relatively new product that will undoubtedly change your performance rates by allowing you to break down your applications into basic service components. From there, you’ll be able to manipulate each of these components individually. That’s why Step Functions are quite helpful for achieving higher performance rates, but they’ll also allow you to break down your application into service components and manipulate them all independently.

Read our blog

Dashbird app launches new version

The new Dashbird app is bringing your data together for a faster, more secure, and smoother observability experience with team collaboration in mind.

AWS updates for serverless builders in 2021

In this article, we’re covering all AWS updates since and including re:Invent 2020 that all serverless builders should be aware of.

Is Real-Time Processing Worth It For Your Analytical Use Cases?

Real-time technologies are powerful but they add significant complexity to your data architecture. In this article, we’ll look at several options to reap the benefits of a real-time paradigm with the least amount of architectural changes and maintenance effort.

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.