5 Common Step Function Issues

Step Functions, the serverless finite state machine service from AWS. With DynamoDB, Lambda, and API Gateway, it forms the core of serverless AWS services. If you have tasks with multiple steps and you want to ensure they will get executed in the proper order, Step Functions is your service of choice.

It offers direct integrations with many AWS services, so you don’t need to use Lambda Functions as glue. This can improve the performance of your state machine and lower its costs.

But Lambda runs your code, making debugging much more straightforward than running a managed service that’s essentially a black box. This is the reason why I’m writing this article. Here you will find the most common issues when working with Step Functions, especially when starting with the service.

1. Task Returned a Result with a Size Exceeding the Maximum Number of Characters Service Limit

It’s a rather unwieldy error message, but it means one of the payloads you’re passing between states is over 256 KB. Make sure you don’t exceed this limit, which can quickly happen when merging multiple parallel states.

Many services at AWS have some stringent limits on the data they can process. This allows AWS engineers to optimize these services and offer on-demand payment for serverless ones. But the downside for you, the user, is that those limits make the services unsuitable for many use-cases. 

In the best-case scenario, you get along by planning right around those limits. So, when implementing a workflow as a state machine, make sure you stay inside the 256 KB limit when passing payloads along.

2. State Machine Canceled without Error

There are various reasons why a state machine could get canceled, and sometimes you won’t get an error directly. Check the execution history of your state machine, where you can find all outputs.

Errors are a fickle topic, and sometimes things can go so wrong that the whole machine executing your code crashes. The logging works, but there is no time to respond with an error. But rest assured that AWS knows about that and keeps logging everything they can.

The execution history is usually an excellent place to investigate Step Functions problems. Typical events that could lead to a cancelation are more than 25,000 entries in the execution history, invalid data types in your outputs (i.e., numbers instead of strings), or missing variables in choice states.

3. The Choice State’s Condition Path References an Invalid Value

You put an unresolvable path into the Variable field of your choice state. In the simplest case, it was just a typo, but it could also mean an object is incomplete or you’re trying to call functions.

State machine definitions aren’t statically typed; especially wrong path definitions can lead to headaches when you have a typo. Make sure your state outputs always line up with the inputs and 

Step Function state machines are simple systems, they can execute basic logic to branch or parallelize states, but they aren’t computing engines. The paths you’re writing inside state machine definitions aren’t JavaScript; they’re VTL templates, so none of your usual JavaScript methods for objects or arrays are available here. You must calculate your path targets’ value inside a Lambda function before checking it in a choice state. It also has to be boolean, number, string, or timestamp.

Try to define your state machines with tools like the AWS Step Functions Workflow Studio to minimize problems at definition time. 

4. State Machine Stops After 25,000 Executions

You exceeded the state machines’ execution history with a standard workflow. If you can, switch to express workflow; if that’s not possible, you have to split up the state machine and start as a new execution.

The Step Functions service will log all executions of your state machines into an execution history; this is nice for debugging. But this history is limited to 25,000 entries, so the moment your state machine would have the 25,001st state change, the step function service will shut it down.

You can configure state machines as standard and express workflows. The express workflow comes with its limits on execution time, but it allows for more than 25,000 execution history entries. You can use the express workflow to get around this limit if you have many short-lived executions.

If your state machine has long-living execution steps and more than 25,000 execution steps, you will have to split it into multiple state machines. This way, every of these state machines can run as a new execution and, in turn, gets a new execution history.

5. Wrong String Format in Parameters

One of your states sends a result in the wrong format, and there is no alternative to select from. You have to use the States.Format() function for string construction to get around this.

Often you can simply select the right piece of data from your results to pass it into the parameters of the next state. And if not, you might at least be able to modify the target state so it accepts the structure you have available. But this might not always be the case.

The States.Format() function is globally available in your state machine definition. With this function, you can concatenate and reformat the data you have so it fits the parameters of the target state. 

Here is a simple example that builds a full name for a parameter:

{
  "Parameters": {
    "foo.$":
      "States.Format('{} {}', $.firstName, $.lastName)"
  }
}

If you can’t get away with this, you will have to plug a Lambda function that reformats the data more complexly. This function call will cost extra and slow down the execution, but sometimes it’s the last resort.

Conclusion

AWS Step Functions is a powerful service that helps coordinate your serverless architecture’s more complex tasks. But as with all serverless services, it comes with severe limitations you have to keep in mind when building.

As with all technology, make sure to modularize your stack. Small state machines are more manageable than a single monolithic one that might exceed limits here and there.

Also, as with all serverless systems built in the AWS cloud, Lambda is here to help. If things don’t fit 100%, you can always throw in a function here and there to smooth things out. But keep in mind that they aren’t free.

How Dashbird can Help

As with the last article about common DynamoDB issues, Dashbird can also give you insights into state machines and their executions. It will automatically monitor all state machines in your AWS account; no extra config is needed.

All your state machines will be evaluated according to the Well-Architected Framework, so you see right away if everything is following best practices.

You can try Dashbird now; no credit card is required. The first one million Lambda invocations are on us! See our product tour here.

Read our blog

5 Common Amazon SQS Issues

As with all services on AWS, issues can crop up while using SQS because it’s not always obvious what every service can and cannot do. But fear not, for this article aims to help you solve the most common ones as quickly as possible. Ready to fix your queues? Then let’s dive in!

6 Common DynamoDB Issues

It’s expected that developers face many of the same issues when starting their NoSQL journey with DynamoDB. This article might clear things up a bit.

[Infographic] OpenSearch from a serverless perspective

Dashbird got an update, and you can now monitor the OpenSearch clusters you set up with Amazon OpenSearch Service. But what does this even mean? Let’s dive more into it!

Made by developers for developers

Dashbird was born out of our own need for an enhanced serverless debugging and monitoring tool, and we take pride in being developers.

What our customers say

Dashbird gives us a simple and easy to use tool to have peace of mind and know that all of our Serverless functions are running correctly. We are instantly aware now if there’s a problem. We love the fact that we have enough information in the Slack notification itself to take appropriate action immediately and know exactly where the issue occurred.

Thanks to Dashbird the time to discover the occurrence of an issue reduced from 2-4 hours to a matter of seconds or minutes. It also means that hundreds of dollars are saved every month.

Great onboarding: it takes just a couple of minutes to connect an AWS account to an organization in Dashbird. The UI is clean and gives a good overview of what is happening with the Lambdas and API Gateways in the account.

I mean, it is just extremely time-saving. It’s so efficient! I don’t think it’s an exaggeration or dramatic to say that Dashbird has been a lifesaver for us.

Dashbird provides an easier interface to monitor and debug problems with our Lambdas. Relevant logs are simple to find and view. Dashbird’s support has been good, and they take product suggestions with grace.

Great UI. Easy to navigate through CloudWatch logs. Simple setup.

Dashbird helped us refine the size of our Lambdas, resulting in significantly reduced costs. We have Dashbird alert us in seconds via email when any of our functions behaves abnormally. Their app immediately makes the cause and severity of errors obvious.