0% found this document useful (0 votes)

62 views3 pages

Simple AWS ETL Project

The document outlines a project to create an ETL pipeline for customer order data, involving extraction from a CSV file, transformation using AWS Glue, and loading into a Redshift data warehouse. It details the steps including setting up an S3 bucket, configuring Glue crawlers and jobs, and using AWS Athena for querying. Additionally, it provides interview points for ETL testing, covering data validation, quality checks, and performance testing.

Uploaded by

mmyybabybaby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views3 pages

Simple AWS ETL Project

Uploaded by

mmyybabybaby

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Project: Customer Order Data ETL Pipeline

Goal: To extract customer order data from a CSV source, transform it, and load it into a
Redshift data warehouse for analysis.

1. Source System (CSV):

● Scenario: We'll simulate a source system by creating a simple CSV file named [Link].
● Content:
Code snippet
order_id,customer_id,order_date,product_id,quantity,price
1,101,2023-10-26,A123,2,25.00
2,102,2023-10-27,B456,1,50.00
3,101,2023-10-28,A123,3,25.00
4,103,2023-10-29,C789,1,100.00

2. AWS Lambda (Extraction) / S3 (Raw CSVs):

● Lambda Function (Optional):
○ For simplicity, we'll manually upload the [Link] to an S3 bucket.
○ However, in a real-world scenario, you could create a Lambda function triggered by
events (e.g., file upload to a specific S3 location) to automate the extraction process.
○ The Lambda function would read the CSV from the source (e.g., another S3 bucket,
database), and store it in the "Raw" S3 bucket.
● S3 "Raw" Bucket:
○ Create an S3 bucket named your-project-raw-bucket.
○ Upload the [Link] file to this bucket.

3. AWS Glue Crawlers (Schema Discovery):

● Crawler Configuration:
○ Create a Glue crawler.
○ Configure it to crawl the your-project-raw-bucket and point to the [Link] file.
○ Specify a Glue database (e.g., orders_db) to store the discovered schema.
○ Run the crawler.
● Outcome:
○ The crawler will analyze the [Link] file and create a table schema in the Glue Data
Catalog, defining the columns and data types.

4. AWS Glue Data Catalog:

● Verification:
○ Go to the Glue Data Catalog and verify that the orders_db database and the orders
table (created by the crawler) are present.
○ Inspect the table schema to ensure it matches the structure of your [Link] file.

5. AWS Glue Jobs (Transformation):

● Job Creation:
○ Create a Glue job (Spark or Python Shell).
○ Use the Glue Data Catalog table (orders_db.orders) as the source.
○ Transformation Logic (Example):
■ Convert the order_date column to a date data type.
■ Calculate the total_amount column (quantity * price).
■ Example Python pyspark code:
Python
import sys
from [Link] import *
from [Link] import getResolvedOptions
from [Link] import SparkContext
from [Link] import GlueContext
from [Link] import Job
from [Link] import col, to_date

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions([Link], ['JOB_NAME'])
[Link](args['JOB_NAME'], args)

datasource0 =
glueContext.create_dynamic_frame.from_catalog(database =
"orders_db", table_name = "orders", transformation_ctx =
"datasource0")
df = [Link]()

transformed_df = [Link]("order_date",
to_date(col("order_date"))) \
.withColumn("total_amount", col("quantity") *
col("price"))

transformed_dynamic_frame =
glueContext.create_dynamic_frame.from_df(transformed_df,
transformation_ctx = "transformed_dynamic_frame")

glueContext.write_dynamic_frame.from_options(frame =
transformed_dynamic_frame, connection_type = "s3",
connection_options = {"path": "s3://your-project-transformed-
bucket/"}, format = "parquet")

[Link]()

● S3 "Transformed" Bucket (Parquet):

○ Create an S3 bucket named your-project-transformed-bucket.
○ Run the Glue job.
○ The transformed data (in Parquet format) will be stored in this bucket.

6. AWS Glue Jobs (Redshift Load):

● Job Creation:
○ Create another Glue job.
○ Use the transformed Parquet data in your-project-transformed-bucket as the source.
○ Redshift Connection:
■ Configure a connection to your Redshift cluster.
■ Specify the target Redshift table (e.g., orders_transformed).
○ Load Logic:
■ Use the Glue redshift connector to load the parquet data into the redshift table.
● Redshift (Data Warehouse):
○ Verify that the orders_transformed table is created in your Redshift cluster and
contains the transformed data.

7. AWS Athena / BI Tools (Querying and Analysis):

● AWS Athena:
○ Create an external table in Athena that points to the your-project-transformed-bucket
(Parquet data).
○ Run SQL queries to analyze the data (e.g., find the total sales per customer).
● BI Tools (Optional):
○ Connect a BI tool (e.g., Tableau, Power BI) to Redshift or Athena to create
visualizations and dashboards.

ETL Testing Interview Points:

● Source Data Validation: Explain how you would validate the [Link] data (e.g., data
types, completeness, consistency).
● Data Quality Checks: Describe the data quality checks you would perform during the
transformation process (e.g., handling null values, data type conversions, business rule
validation).
● Schema Validation: Discuss how you would validate the schema discovered by the Glue
crawler and the schema of the transformed data.
● Data Reconciliation: Explain how you would reconcile the data between the source
[Link] and the Redshift orders_transformed table (e.g., row counts, data sampling,
checksums).
● Performance Testing: Discuss how you would test the performance of the Glue jobs and
Redshift queries.
● Error Handling: Describe how you would handle errors during the ETL process (e.g.,
logging, alerting, retry mechanisms).
● Incremental Loads: Explain how you would implement incremental loads for new order
data.
● Data lineage: How would you track the data as it moves through the system.
● Security: How would you handle security of the data at rest and in transit.

By walking through this project and highlighting these testing points, you'll demonstrate a solid
understanding of ETL testing principles and AWS services. Good luck!

AWS ETL Pipeline for Data Engineers
No ratings yet
AWS ETL Pipeline for Data Engineers
4 pages
AWS DATA Engineering Abhishek
No ratings yet
AWS DATA Engineering Abhishek
6 pages
Incremental Data Loading AWS Detailed
No ratings yet
Incremental Data Loading AWS Detailed
17 pages
AWS Glue ETL with Python Guide
No ratings yet
AWS Glue ETL with Python Guide
3 pages
ETL AWS Real Time Senario
No ratings yet
ETL AWS Real Time Senario
1 page
AWS Glue ETL Guide: Setup & Execution
No ratings yet
AWS Glue ETL Guide: Setup & Execution
10 pages
Research - IBM DataStage To AWS Glue Migration
No ratings yet
Research - IBM DataStage To AWS Glue Migration
7 pages
Aws Glue Tutorial Case Study
No ratings yet
Aws Glue Tutorial Case Study
13 pages
ETL AWS Flow
No ratings yet
ETL AWS Flow
1 page
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
AWS Cloud Analytics and AI Challenge - E-Commerce C
No ratings yet
AWS Cloud Analytics and AI Challenge - E-Commerce C
11 pages
Processing XML With AWS Glue and Databricks Spark
No ratings yet
Processing XML With AWS Glue and Databricks Spark
23 pages
AWS Glue
No ratings yet
AWS Glue
6 pages
Retail Data Management Ps
No ratings yet
Retail Data Management Ps
5 pages
AWS Glue Is A Fully Managed ETL
No ratings yet
AWS Glue Is A Fully Managed ETL
2 pages
Automate ETL with AWS Glue Guide
No ratings yet
Automate ETL with AWS Glue Guide
1 page
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
No ratings yet
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
14 pages
Presentation 2
No ratings yet
Presentation 2
9 pages
AWS Project1
No ratings yet
AWS Project1
13 pages
Project Ringba
No ratings yet
Project Ringba
6 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
AWS Glue
No ratings yet
AWS Glue
5 pages
AWS Glue Is Managed ETL
No ratings yet
AWS Glue Is Managed ETL
2 pages
AWS Glue Guide
No ratings yet
AWS Glue Guide
17 pages
Affinity
No ratings yet
Affinity
7 pages
AWS ETL Testing Questions
No ratings yet
AWS ETL Testing Questions
2 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
Cloud Based ETL Pipeline QuickSight
No ratings yet
Cloud Based ETL Pipeline QuickSight
18 pages
Assignment
No ratings yet
Assignment
2 pages
ETL Automation Guide
No ratings yet
ETL Automation Guide
3 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
Devops Lead
No ratings yet
Devops Lead
10 pages
AWS Glue Project Overview by Abhishek
No ratings yet
AWS Glue Project Overview by Abhishek
24 pages
Aws Data Engineering Project
No ratings yet
Aws Data Engineering Project
1 page
CDC Project
No ratings yet
CDC Project
6 pages
AWS Glue
No ratings yet
AWS Glue
36 pages
AWS Glue Solutions for Enterprises
No ratings yet
AWS Glue Solutions for Enterprises
3 pages
Lab - Performing ETL On A Dataset by Using AWS Glue
100% (1)
Lab - Performing ETL On A Dataset by Using AWS Glue
26 pages
Presentation Glue
No ratings yet
Presentation Glue
12 pages
AWS Test Data Services Guide
No ratings yet
AWS Test Data Services Guide
27 pages
Appflow Integration
No ratings yet
Appflow Integration
8 pages
AWSw 3
No ratings yet
AWSw 3
9 pages
Complete ETL Pipeline Guide - Top 20 Interview Questions ?
No ratings yet
Complete ETL Pipeline Guide - Top 20 Interview Questions ?
8 pages
AWS Glue Interview Guide
No ratings yet
AWS Glue Interview Guide
23 pages
ETL Step Funtion
No ratings yet
ETL Step Funtion
3 pages
Snowpark ETL Setup and Data Loading Guide
No ratings yet
Snowpark ETL Setup and Data Loading Guide
34 pages
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
No ratings yet
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
4 pages
Aws Glue Consulting - Helical IT Solutions
No ratings yet
Aws Glue Consulting - Helical IT Solutions
3 pages
How To Automate Event-Based End-to-End ETL Pipeline Using AWS Glue & AWS Lambda - Data Engineering (-ySaDk0Sgck)
No ratings yet
How To Automate Event-Based End-to-End ETL Pipeline Using AWS Glue & AWS Lambda - Data Engineering (-ySaDk0Sgck)
6 pages
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
62 pages
Glue
No ratings yet
Glue
2 pages
Glue Workflow
No ratings yet
Glue Workflow
2 pages
Serverless Etl Aws Glue
No ratings yet
Serverless Etl Aws Glue
17 pages
ETL Testing Best Practices in AWS
No ratings yet
ETL Testing Best Practices in AWS
2 pages
Trigger and Crawler
No ratings yet
Trigger and Crawler
2 pages
Project K
No ratings yet
Project K
34 pages
Ravi
No ratings yet
Ravi
4 pages
Create An IAM Role For S3 Access
No ratings yet
Create An IAM Role For S3 Access
2 pages
OLTP and OLAP
No ratings yet
OLTP and OLAP
46 pages
Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
No ratings yet
Build An ETL Service Pipeline To Load Data Incrementally From Amazon S3 To Amazon Redshift Using AWS Glue - AWS Prescriptive Guidance
15 pages
COPY JOB (Preview) - Amazon Redshift
No ratings yet
COPY JOB (Preview) - Amazon Redshift
1 page
AWS IAM Roles - Matillion Docs
No ratings yet
AWS IAM Roles - Matillion Docs
3 pages
Validationns Mapping
No ratings yet
Validationns Mapping
2 pages
ETL Testing Engineer From Python
No ratings yet
ETL Testing Engineer From Python
3 pages
Validating Direct Mapping With SQL
No ratings yet
Validating Direct Mapping With SQL
5 pages
Mapping Document Source Side - Metadata Validat...
No ratings yet
Mapping Document Source Side - Metadata Validat...
3 pages
Employee Salary and Department Data
No ratings yet
Employee Salary and Department Data
2 pages
Transformation
No ratings yet
Transformation
2 pages
Designed Test Cases For Data Loading
No ratings yet
Designed Test Cases For Data Loading
1 page
List of Indian Cities by State
No ratings yet
List of Indian Cities by State
28 pages
CET341 Assignment One 2023 - 24
No ratings yet
CET341 Assignment One 2023 - 24
4 pages
SQL - Advanced Interview Questions
No ratings yet
SQL - Advanced Interview Questions
17 pages
Os CH2 Ans Key
No ratings yet
Os CH2 Ans Key
31 pages
Key Capabilities of DBMS and Relational Power
No ratings yet
Key Capabilities of DBMS and Relational Power
2 pages
Resume Interview Questions
No ratings yet
Resume Interview Questions
2 pages
OAIS Reference Model Recommended Practices
No ratings yet
OAIS Reference Model Recommended Practices
150 pages
Web-Based Payroll System Development
No ratings yet
Web-Based Payroll System Development
7 pages
Module 2: Introducing Anypoint Platform: at The End of This Module, You Should Be Able To
No ratings yet
Module 2: Introducing Anypoint Platform: at The End of This Module, You Should Be Able To
22 pages
Customer Orders Database Schema
No ratings yet
Customer Orders Database Schema
23 pages
C Language & Database Basics
No ratings yet
C Language & Database Basics
32 pages
ETL Developer with 9+ Years Experience
No ratings yet
ETL Developer with 9+ Years Experience
6 pages
Unit - 1 Introduction To Database Management System
No ratings yet
Unit - 1 Introduction To Database Management System
40 pages
Rubrics (Entity and Relationship Analysis, Normalization, Data Integrity, and SQL Injection)
No ratings yet
Rubrics (Entity and Relationship Analysis, Normalization, Data Integrity, and SQL Injection)
2 pages
SQL Interviw Quetions 2025
No ratings yet
SQL Interviw Quetions 2025
49 pages
Computer Science Part 2 Notes
No ratings yet
Computer Science Part 2 Notes
10 pages
Job Opportunity at HARSAC
No ratings yet
Job Opportunity at HARSAC
1 page
5th Sem Syllabus
No ratings yet
5th Sem Syllabus
14 pages
Csi ZG518 Ec-2
No ratings yet
Csi ZG518 Ec-2
6 pages
Kumaraman07226 6725dd069a1d2
No ratings yet
Kumaraman07226 6725dd069a1d2
1 page
Image Retrieval PhD Thesis Help
100% (3)
Image Retrieval PhD Thesis Help
7 pages
Azure Databricks for Data Engineers
No ratings yet
Azure Databricks for Data Engineers
87 pages
CSC NEW Lab Manual - Class 12-2025 - Final
No ratings yet
CSC NEW Lab Manual - Class 12-2025 - Final
101 pages
Data Scientist Expertise Overview
No ratings yet
Data Scientist Expertise Overview
6 pages
Data Warehouse Modeling Guide
No ratings yet
Data Warehouse Modeling Guide
9 pages
Microsoft Certification Paths 2021
No ratings yet
Microsoft Certification Paths 2021
7 pages
Prayer Times App Documentation
No ratings yet
Prayer Times App Documentation
6 pages
Using Report SDVBUK00: Symptom
No ratings yet
Using Report SDVBUK00: Symptom
4 pages
AZURE CLI Commands
No ratings yet
AZURE CLI Commands
6 pages
Computer Science Unit 2 Past Paper
No ratings yet
Computer Science Unit 2 Past Paper
16 pages
Office Management Tools
No ratings yet
Office Management Tools
2 pages

Simple AWS ETL Project

Uploaded by

Simple AWS ETL Project

Uploaded by

Project: Customer Order Data ETL Pipeline

1. Source System (CSV):

2. AWS Lambda (Extraction) / S3 (Raw CSVs):

3. AWS Glue Crawlers (Schema Discovery):

4. AWS Glue Data Catalog:

5. AWS Glue Jobs (Transformation):

● S3 "Transformed" Bucket (Parquet):

6. AWS Glue Jobs (Redshift Load):

7. AWS Athena / BI Tools (Querying and Analysis):

ETL Testing Interview Points:

You might also like