Amazon EMR Serverless/MSK Serverless Demo

Source code for the blog post, Serverless Analytics on AWS: Getting Started with Amazon EMR Serverless and MSK Serverless . Using the newly released Amazon EMR Serverless and Amazon MSK Serverless for batch and streaming analytics with Apache Spark and Apache Kafka.

PySpark Applications

01_example_console.py: Batch read from Kafka topic, aggregation to console
02_example_csv_s3.py: Batch read from Kafka topic, aggregation to CSV in S3
03_example_kafka.py: Batch read from Kafka topic, aggregation to Kafka topic
04_stream_sales_to_kafka.py: Streaming writes to Kafka topic
05_streaming_kafka.py: Streaming reads from Kafka topic, aggregations over sliding event-time window to console

PySpark Application Dependencies

Upload to S3 in advance.

# dependencies
wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.4/aws-msk-iam-auth-1.1.4-all.jar
wget https://repo1.maven.org/maven2/org/apache/commons/commons-pool2/2.11.1/commons-pool2-2.11.1.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/2.8.1/kafka-clients-2.8.1.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.2.1/spark-sql-kafka-0-10_2.12-3.2.1.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.12/3.2.1/spark-token-provider-kafka-0-10_2.12-3.2.1.jar

# optional dependencies for avro-format messages
wget https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.2.1/spark-avro_2.12-3.2.1.jar

Create Kafka Topics and Messages from Client EC2 Instance

Example commands.

cd kafka_2.12-2.8.1

# *** CHANGE ME ***
export BOOTSTRAP_SERVERS=<your_bootstrap_server>:9098

# create topics 3x: topicA, topicB, topicC
bin/kafka-topics.sh --create --topic topicA \
    --partitions 6 \
    --bootstrap-server $BOOTSTRAP_SERVERS \
    --command-config config/client.properties

# list topics to confirm creation
bin/kafka-topics.sh --list \
    --bootstrap-server $BOOTSTRAP_SERVERS \
    --command-config config/client.properties

# write messages to topic
bin/kafka-console-producer.sh \
    --topic topicA \
    --bootstrap-server $BOOTSTRAP_SERVERS \
    --producer.config config/client.properties

# copy and paste contents of 'sample_data/sales_messages.txt' and then Ctrl+C to exit producer

# display messages in topics
bin/kafka-console-consumer.sh \
    --topic topicA \
    --from-beginning --max-messages 10 \
    --property print.value=true \
    --property print.offset=true \
    --property print.partition=true \
    --property print.timestamp=true \
    --bootstrap-server $BOOTSTRAP_SERVERS \
    --consumer.config config/client.properties

Submit PySpark Job to Amazon EMR Serverless using AWS CLI

Example commands.

aws emr-serverless start-job-run \
    --application-id <your_application_id> \
    --execution-role-arn <your_execution_role_arn> \
    --name 01-example-console \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your_s3_bucket>/scripts/01_example_console.py",
            "entryPointArguments": [
                "--bootstrap_servers=<your_bootstrap_server>"
            ],
            "sparkSubmitParameters": "--conf spark.jars=s3://<your_s3_bucket>/jars/*.jar"
        }
    }'

aws emr-serverless start-job-run \
    --application-id <your_application_id> \
    --execution-role-arn <your_execution_role_arn> \
    --name 02-example-csv-s3 \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your_s3_bucket>/scripts/02_example_csv_s3.py",
            "entryPointArguments": [
                "--bootstrap_servers=<your_bootstrap_server>", 
                "--s3_bucket=<your_s3_bucket>"
            ],
            "sparkSubmitParameters": "--conf spark.jars=s3://<your_s3_bucket>/jars/*.jar"
        }
    }'

aws emr-serverless start-job-run \
    --application-id <your_application_id> \
    --execution-role-arn <your_execution_role_arn> \
    --name 03-example-kafka \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your_s3_bucket>/scripts/03_example_kafka.py",
            "entryPointArguments": [
                "--bootstrap_servers=<your_bootstrap_server>"
            ],
            "sparkSubmitParameters": "--conf spark.jars=s3://<your_s3_bucket>/jars/*.jar"
        }
    }'

# run 04 and 05 simultaneously 
aws emr-serverless start-job-run \
    --application-id <your_application_id> \
    --execution-role-arn <your_execution_role_arn> \
    --name 04-stream-sales-to-kafka \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your_s3_bucket>/scripts/04_stream_sales_to_kafka.py",
            "entryPointArguments": [
                "--bootstrap_servers=<your_bootstrap_server>", 
                "--s3_bucket=<your_s3_bucket>"
            ],
            "sparkSubmitParameters": "--conf spark.jars=s3://<your_s3_bucket>/jars/*.jar"
        }
    }'

aws emr-serverless start-job-run \
    --application-id <your_application_id> \
    --execution-role-arn <your_execution_role_arn> \
    --name 05-streaming-kafka \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://<your_s3_bucket>/scripts/05_streaming_kafka.py",
            "entryPointArguments": [
                "--bootstrap_servers=<your_bootstrap_server>"
            ],
            "sparkSubmitParameters": "--conf spark.jars=s3://<your_s3_bucket>/jars/*.jar"
        }
    }'

Cleaning up

Don't forget to clean up your resources when you're done with the demonstration.

# delete applicatiom, cluster, and ec2 client
aws kafka delete-cluster --cluster-arn <your_msk_serverless_cluster_arn>
aws emr-serverless delete-application --application-id <your_application_id>
aws ec2 terminate-instances --instance-ids <your_ec2_instance_id>

# all objects (including all object versions and delete markers) in the bucket 
# must be deleted before the bucket itself can be deleted.
aws s3api delete-bucket --bucket <your_s3_bucket>

The contents of this repository represent my viewpoints and not of my past or current employers, including Amazon Web Services (AWS). All third-party libraries, modules, plugins, and SDKs are the property of their respective owners.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
sample_data		sample_data
spark_jobs		spark_jobs
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon EMR Serverless/MSK Serverless Demo

PySpark Applications

PySpark Application Dependencies

Create Kafka Topics and Messages from Client EC2 Instance

Submit PySpark Job to Amazon EMR Serverless using AWS CLI

Cleaning up

About

Releases

Packages

Languages

License

garystafford/emr-msk-serverless-demo

Folders and files

Latest commit

History

Repository files navigation

Amazon EMR Serverless/MSK Serverless Demo

PySpark Applications

PySpark Application Dependencies

Create Kafka Topics and Messages from Client EC2 Instance

Submit PySpark Job to Amazon EMR Serverless using AWS CLI

Cleaning up

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages