Coordinate components using a workflow. Define series of steps, trigger and track each step, retries upon errors, logs the state. Tasks can be run on anywhere not just AWS. Console provides visualized view of the state machine.
Defined by Amazon State Language. Specification here. State lint for validation.
Structure:
- StartAt : string, name of start state
- States
- Comment - optional
- Version - optional
- TimeoutSeconds: integer, time the execution is allowed to run
State (and common fields)
- name
- Type (required), and allowed fields in addition to the commons:
- Task - do something
- Resource: URI, especially an ARN, id the task
- arn:partition:service:region:account:task_type:name
- partition: aws
- service: "states" (for activity), or "lambda"
- region
- account: account id
- task_type: "activity" or "function" (lambda)
- name: resource name, i.e. activity name or function name
- ResultPath, Retry, Catch, TimeoutSeconds, HeartbeatSeconds (optional)
- Choice - go to first matched branch according to Choice Rules (no "End", so cannot choose to end)
- How it works: go through Choices in the array order, match condition (if the variable to match does not exist in input, execution will fail), if found, go to the next state, if not, go to default, or if no default, execution fail
- Choices: array of Choice Rules, which contains:
- some conditions: variable to match, type of comparison, and value to compare, by type of variable (no match of string with number)
- "Next"
- Default (optional, recommended) - name of default next state
- Fail / Success - stop execution
- Pass - just pass input to output or inject some fixed data
- Result(optional) - as the output
- ResultPath(ptional) - where to put output in the input
- Wait - delay by amount of time or until date/time
- Seconds
- Timestamp - absolute time stamp
- SecondsPath
- Parallel
- Branches: an array of parallel state machine to execute, each branch must be self-contained, receive an own copy of input, generates own output
- How it works: run all branches in parallel, wait until all terminates, then go to Next;
- Output: is array of each branch output, can further use ResultPath to insert into input
- Error Handling: if any branch fails, entire Parallel fails and all branches stopped; then the Parallel state may handle that error
- Comment
- Next (more than one in Choice): name of next state
- End (alternative to "Next"):
- true - to end successfully
- InputPath, OutputPath (optional): see input/output processing
- Task can be activity or lambda function
- Activity/Lambda has ARN
- Activity: code that do something and call Step Functions with API
- GetActivityTask, SendTaskSuccess, SendTaskFailure, SendTaskHeartbeat
- Lambda: called by Step Functions
A task, performed by a worker, hosted anywhere.
- Name
- Arn - assigned
- Worker
- Poll for active work with GetActivityTask() - get task token and JSON input
- do the work
- SendTaskSuccess / SendTaskFailure - with token, output or error
- heartbeat - report still doing it
Start from "StartAt" state.
Each state:
- Transition into - can have multiple incoming transitions
- Next
Terminal state:
- Type: Succeed, Type: Fail, End: true, or runtime error
Executions:
State Machines:
Data & Input Output Processing
JSON data, output of previous state becomes input into next state. Can restrict states to working on a subset of the input by using Input & Output Processing.
Initial input - upon StartExecution. Default {}.
Output - returned from last state, as execution's result.
Input Output Processing:
- Uses Json Path syntax
- InputPath: selects which components from "state input" to pass to the task (worker)
- For any state except "Fail"
- $.path - trim to that path
- ResultPath: how to combine "state input" and task result, then pass to output path, for state types:
- Pass, Task, Parallel
- Default, as if "ResultPath": "$" - replace state data with task result
- "ResultPath":"$.somePath" - put task result under some path, overwrite existing node if exists
- Use with "Catch" to include error in state data
- OutputPath: further filter the result before passing to "state output"
- For any state, see "InputPath"
Possible errors: Machine definition, task failures (exception from lambda), transient issues (network).
Errors:
- Names
- Custom
- Built-in (States.XXX)
- ALL
- Timeout - Task timeout or miss heartbeat
- TaskFailed - fail during execution
- Permissions - insufficient privilege
- Lambda errors (catch certain exceptions)
Retrying:
- "Retry" filed for Task and Parallel states
- Value: array of "retrier" objects
- Treated as state transition
- "Retrier" represents a certain number of retries for some type of errors
- ErrorEquals (required) - array of strings, match error names
- States.ALL - match any error,if use, must appear alone and in the last retrier
- IntervalSeconds (optional) - default 1, wait then first retry
- MaxAttempts(optional) - default 3, 0 means never retry (use as exception)
- BackoffRate(optional) - 2.0 by default
Uses "Catch" (only after retry, if available):
- Value is array of catchers
- Catcher:
- "ErrorEquals" (required)
- "Next" (required)
- "ResultPath" (optional) to include the error with state input, rather than replacing it
Best practice:
- Handles Lambda.ServiceException and Lambda.SdkclientException in production code