0% found this document useful (0 votes)

49 views

AWS Step Functions Overview

Uploaded by

sarthak dalabehera

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

AWS Step Functions Overview

Uploaded by

sarthak dalabehera

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

AWS Step Functions Overview

teach me about aws step functions and it's internals

AWS Step Functions: An Overview

AWS Step Functions is a serverless orchestration service that enables you to coordinate multiple AWS
services into serverless workflows so you can build and update applications quickly. Using Step
Functions, you design and run workflows that stitch together services like AWS Lambda, Amazon ECS,
AWS Batch, and more, into an integrated application.

Key Concepts

1. State Machine:

A state machine is a workflow definition in AWS Step Functions.

It's defined using Amazon States Language (ASL), a JSON-based language.
Each state machine consists of a series of states, transitions, and end states.

2. States:

Task State: Represents a single unit of work performed by a task.

Choice State: Adds branching logic to the state machine.
Parallel State: Enables parallel execution of tasks.
Wait State: Delays the execution for a specific time.
Succeed State: Terminates the state machine successfully.
Fail State: Terminates the state machine with a failure.
Pass State: Passes its input to its output or injects some fixed data.
Map State: Iterates steps over a list of items.

3. Transitions:

Define the movement from one state to another.

Governed by the `Next` field in the state definition.

4. Amazon States Language (ASL):

JSON-based language used to define state machines.

Specifies state types, transitions, input/output, error handling, etc.

Internals

Execution Model
Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 1/7
1. Execution:

An instance of a state machine running a specific workflow.

Each execution has a unique ID and runs independently.

2. Input and Output:

Each state can accept input and produce output.

Input and output are managed using JSON.

3. Error Handling:

Built-in support for retry and catch.

You can define `Retry` and `Catch` blocks within state definitions.

Service Integration

1. AWS Lambda:

Task states can invoke Lambda functions to perform work.

Integration is direct, making it easy to manage Lambda-based workflows.

2. Amazon ECS and AWS Batch:

Step Functions can manage long-running tasks by integrating with ECS and AWS Batch.

3. API Gateway, SNS, SQS, DynamoDB:

Various other AWS services can be integrated into workflows to provide comprehensive
orchestration.

Example Workflow

Here's a basic example of a state machine defined using ASL:

json

{
"Comment": "An example of the Amazon States Language using a Pass state.",
"StartAt": "HelloWorld",
"States": {
"HelloWorld": {
"Type": "Pass",
"Result": "Hello, World!",
"End": true
}
}
}

Advanced Features

1. Express Workflows:

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 2/7
Designed for high-volume, short-duration workflows.
Offers a different pricing model and execution limits compared to standard workflows.

2. Activity:

Allows you to integrate external, non-serverless tasks.

Step Functions will wait for an activity worker to complete the task.

3. Service Integrations:

Direct integration with over 200 AWS services.

Allows calling service APIs directly from Step Functions.

Monitoring and Logging

1. AWS CloudWatch:

Monitor execution metrics and logs.

Set up alarms based on execution data.

2. AWS X-Ray:

Trace and debug the execution of your state machines.

Best Practices

1. Modular Workflows:
Break complex workflows into smaller, reusable components.
2. Error Handling:
Implement robust error handling using retries and catch blocks.
3. Cost Management:
Be mindful of the costs associated with long-running tasks and high-frequency executions.

Conclusion

AWS Step Functions provide a powerful way to orchestrate serverless workflows and integrate multiple
AWS services. By understanding its key concepts and internals, you can design robust, scalable, and
maintainable workflows that streamline your application development process.

give an asl example for a data processing pipeline for ml training with ingestion, cleaning and
training steps on aws batch from step function

Sure! Here's an example of an AWS Step Functions state machine using Amazon States Language (ASL)
for a data processing pipeline for ML training. The pipeline includes steps for data ingestion, data
cleaning, and model training using AWS Batch.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 3/7
json

{
"Comment": "A data processing pipeline for ML training with ingestion, cleaning, and
training steps using AWS Batch.",
"StartAt": "DataIngestion",
"States": {
"DataIngestion": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "DataIngestionJobDefinition",
"JobName": "DataIngestionJob",
"JobQueue": "DataProcessingQueue",
"ContainerOverrides": {
"command": ["python", "data_ingestion.py"]
}
},
"Next": "DataCleaning"
},
"DataCleaning": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "DataCleaningJobDefinition",
"JobName": "DataCleaningJob",
"JobQueue": "DataProcessingQueue",
"ContainerOverrides": {
"command": ["python", "data_cleaning.py"]
}
},
"Next": "ModelTraining"
},
"ModelTraining": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "ModelTrainingJobDefinition",
"JobName": "ModelTrainingJob",
"JobQueue": "TrainingQueue",
"ContainerOverrides": {
"command": ["python", "model_training.py"]
}
},
"End": true
}
}
}

Explanation

1. Data Ingestion Step:

Submits an AWS Batch job to handle data ingestion.

The job runs a Python script `data_ingestion.py`.

2. Data Cleaning Step:

Submits an AWS Batch job to handle data cleaning.

The job runs a Python script `data_cleaning.py`.

3. Model Training Step:

Submits an AWS Batch job to handle model training.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 4/7
The job runs a Python script `model_training.py`.

Notes

Job Definitions and Queues: Ensure that the job definitions (`DataIngestionJobDefinition`,
`DataCleaningJobDefinition`, `ModelTrainingJobDefinition`) and job queues (`DataProcessingQueue`,
`TrainingQueue`) exist in your AWS Batch setup.
Job Execution: The `command` parameter in `ContainerOverrides` specifies the command to be run
inside the Docker container defined by the job definition.
Error Handling: You may want to add error handling (retry and catch blocks) to handle job failures
gracefully.

This example provides a straightforward pipeline, but depending on your needs, you might add more
steps, parallel processing, or additional error handling and logging.

Built-in support for retry and catch.

You can define Retry and Catch blocks within state definitions.

how can we do that in the above

To add built-in support for retry and catch blocks within state definitions, you can modify the
`DataIngestion`, `DataCleaning`, and `ModelTraining` states to include these features. Below is the
updated example of the AWS Step Functions state machine with retry and catch blocks added for each
state.

json

{
"Comment": "A data processing pipeline for ML training with ingestion, cleaning, and
training steps using AWS Batch with retry and catch blocks.",
"StartAt": "DataIngestion",
"States": {
"DataIngestion": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "DataIngestionJobDefinition",
"JobName": "DataIngestionJob",
"JobQueue": "DataProcessingQueue",
"ContainerOverrides": {
"command": ["python", "data_ingestion.py"]
}
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 5/7
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "HandleFailure"
}
],
"Next": "DataCleaning"
},
"DataCleaning": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "DataCleaningJobDefinition",
"JobName": "DataCleaningJob",
"JobQueue": "DataProcessingQueue",
"ContainerOverrides": {
"command": ["python", "data_cleaning.py"]
}
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "HandleFailure"
}
],
"Next": "ModelTraining"
},
"ModelTraining": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "ModelTrainingJobDefinition",
"JobName": "ModelTrainingJob",
"JobQueue": "TrainingQueue",
"ContainerOverrides": {
"command": ["python", "model_training.py"]
}
},
"Retry": [
{
"ErrorEquals": ["States.ALL"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "HandleFailure"
}
],
"End": true
},
"HandleFailure": {
"Type": "Fail",
"Error": "JobFailed",
"Cause": "The job failed after multiple retry attempts."
}
}
}

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 6/7
Explanation

1. Retry Block:

`ErrorEquals`: Specifies the errors that should trigger a retry. `States.ALL` retries on any
error.
`IntervalSeconds`: The initial wait time before retrying.
`MaxAttempts`: The maximum number of retry attempts.
`BackoffRate`: Multiplier by which the retry interval increases after each attempt.

2. Catch Block:

`ErrorEquals`: Specifies the errors that should trigger the catch block. `States.ALL` catches
any error.
`ResultPath`: Path in the state’s output to store the error information.
`Next`: The next state to transition to when an error is caught.

3. HandleFailure State:

A `Fail` state that terminates the state machine execution when a job fails after retrying.

By incorporating these retry and catch blocks, the state machine will attempt to retry each task up to
three times with an increasing delay between retries. If all retries fail, the execution will transition to the
`HandleFailure` state, which will log the failure and terminate the workflow.

slice workspace chats aren't used to train our models. ChatGPT can make mistakes.

Printed using Save ChatGPT as PDF, powered by PDFCrowd HTML to PDF API. 7/7