Project: Customer Order Data ETL Pipeline
Goal: To extract customer order data from a CSV source, transform it, and load it into a
Redshift data warehouse for analysis.
1. Source System (CSV):
● Scenario: We'll simulate a source system by creating a simple CSV file named [Link].
● Content:
Code snippet
order_id,customer_id,order_date,product_id,quantity,price
1,101,2023-10-26,A123,2,25.00
2,102,2023-10-27,B456,1,50.00
3,101,2023-10-28,A123,3,25.00
4,103,2023-10-29,C789,1,100.00
2. AWS Lambda (Extraction) / S3 (Raw CSVs):
● Lambda Function (Optional):
○ For simplicity, we'll manually upload the [Link] to an S3 bucket.
○ However, in a real-world scenario, you could create a Lambda function triggered by
events (e.g., file upload to a specific S3 location) to automate the extraction process.
○ The Lambda function would read the CSV from the source (e.g., another S3 bucket,
database), and store it in the "Raw" S3 bucket.
● S3 "Raw" Bucket:
○ Create an S3 bucket named your-project-raw-bucket.
○ Upload the [Link] file to this bucket.
3. AWS Glue Crawlers (Schema Discovery):
● Crawler Configuration:
○ Create a Glue crawler.
○ Configure it to crawl the your-project-raw-bucket and point to the [Link] file.
○ Specify a Glue database (e.g., orders_db) to store the discovered schema.
○ Run the crawler.
● Outcome:
○ The crawler will analyze the [Link] file and create a table schema in the Glue Data
Catalog, defining the columns and data types.
4. AWS Glue Data Catalog:
● Verification:
○ Go to the Glue Data Catalog and verify that the orders_db database and the orders
table (created by the crawler) are present.
○ Inspect the table schema to ensure it matches the structure of your [Link] file.
5. AWS Glue Jobs (Transformation):
● Job Creation:
○ Create a Glue job (Spark or Python Shell).
○ Use the Glue Data Catalog table (orders_db.orders) as the source.
○ Transformation Logic (Example):
■ Convert the order_date column to a date data type.
■ Calculate the total_amount column (quantity * price).
■ Example Python pyspark code:
Python
import sys
from [Link] import *
from [Link] import getResolvedOptions
from [Link] import SparkContext
from [Link] import GlueContext
from [Link] import Job
from [Link] import col, to_date
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions([Link], ['JOB_NAME'])
[Link](args['JOB_NAME'], args)
datasource0 =
glueContext.create_dynamic_frame.from_catalog(database =
"orders_db", table_name = "orders", transformation_ctx =
"datasource0")
df = [Link]()
transformed_df = [Link]("order_date",
to_date(col("order_date"))) \
.withColumn("total_amount", col("quantity") *
col("price"))
transformed_dynamic_frame =
glueContext.create_dynamic_frame.from_df(transformed_df,
transformation_ctx = "transformed_dynamic_frame")
glueContext.write_dynamic_frame.from_options(frame =
transformed_dynamic_frame, connection_type = "s3",
connection_options = {"path": "s3://your-project-transformed-
bucket/"}, format = "parquet")
[Link]()
● S3 "Transformed" Bucket (Parquet):
○ Create an S3 bucket named your-project-transformed-bucket.
○ Run the Glue job.
○ The transformed data (in Parquet format) will be stored in this bucket.
6. AWS Glue Jobs (Redshift Load):
● Job Creation:
○ Create another Glue job.
○ Use the transformed Parquet data in your-project-transformed-bucket as the source.
○ Redshift Connection:
■ Configure a connection to your Redshift cluster.
■ Specify the target Redshift table (e.g., orders_transformed).
○ Load Logic:
■ Use the Glue redshift connector to load the parquet data into the redshift table.
● Redshift (Data Warehouse):
○ Verify that the orders_transformed table is created in your Redshift cluster and
contains the transformed data.
7. AWS Athena / BI Tools (Querying and Analysis):
● AWS Athena:
○ Create an external table in Athena that points to the your-project-transformed-bucket
(Parquet data).
○ Run SQL queries to analyze the data (e.g., find the total sales per customer).
● BI Tools (Optional):
○ Connect a BI tool (e.g., Tableau, Power BI) to Redshift or Athena to create
visualizations and dashboards.
ETL Testing Interview Points:
● Source Data Validation: Explain how you would validate the [Link] data (e.g., data
types, completeness, consistency).
● Data Quality Checks: Describe the data quality checks you would perform during the
transformation process (e.g., handling null values, data type conversions, business rule
validation).
● Schema Validation: Discuss how you would validate the schema discovered by the Glue
crawler and the schema of the transformed data.
● Data Reconciliation: Explain how you would reconcile the data between the source
[Link] and the Redshift orders_transformed table (e.g., row counts, data sampling,
checksums).
● Performance Testing: Discuss how you would test the performance of the Glue jobs and
Redshift queries.
● Error Handling: Describe how you would handle errors during the ETL process (e.g.,
logging, alerting, retry mechanisms).
● Incremental Loads: Explain how you would implement incremental loads for new order
data.
● Data lineage: How would you track the data as it moves through the system.
● Security: How would you handle security of the data at rest and in transit.
By walking through this project and highlighting these testing points, you'll demonstrate a solid
understanding of ETL testing principles and AWS services. Good luck!