This project processes The Movies Dataset to create analytics-ready tables for analyzing movie genres and production companies.
It includes a Python-based ETL pipeline, reproducible SQL models, and validation queries using DuckDB.
cd ~/Projects
unzip ~/Downloads/guild_takehome_project_v14.zip -d guild_takehome_project
cd guild_takehome_project
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe ETL pipeline accepts either an S3 URI or a local file path as input, automatically downloading S3 files using Boto3 before processing.
python src/main.py s3://com.guild.us-west-2.public-data/project-data/the-movies-dataset.zip \
--out ./output --log ./error.logAlternatively, you can provide a local file path:
Ensure the-movies-dataset.zip is placed in the project root (same folder as src/, sql/, docs/).
python src/main.py the-movies-dataset.zip --out ./output --log ./error.log
This generates six analytics-ready tables under ./output/:
tblMovie.csv
tblGenre.csv
tblCompany.csv
tblMovieGenres.csv
tblMovieCompanies.csv
tblMovieMetrics.csv
The ETL pipeline implemented in src/main.py and src/transform.py transforms The Movies Dataset into analytics-ready tables that support all required queries.
- Extract: Reads compressed CSVs from
the-movies-dataset.zip. - Transform:
- Parses genre and company JSON arrays.
- Derives
release_yearfromrelease_date. - Cleans invalid IDs, null values, and missing genre tags.
- Computes
profit = revenue - budget.
- Load:
Writes six normalized CSVs (tblMovie,tblGenre,tblCompany,tblMovieGenres,tblMovieCompanies,tblMovieMetrics) into the/outputdirectory. - Validate:
Thesql/*.sqlqueries verify that the data model answers each required analytical question.
This implementation fulfills Deliverable #2 – Implement a program that transforms the input data into a form usable by the data model.
After generating outputs, you can validate the model using DuckDB from the terminal.
Each query auto-registers CSVs from ./output/ as tables and prints the results.
python src/run_query.py sql/most_popular_genre_by_year.sql
python src/run_query.py sql/budget_by_genre_by_year.sql
python src/run_query.py sql/revenue_by_genre_by_year.sql
python src/run_query.py sql/profit_by_genre_by_year.sqlpython src/run_query.py sql/budget_by_company_by_year.sql
python src/run_query.py sql/revenue_by_company_by_year.sql
python src/run_query.py sql/profit_by_company_by_year.sql
python src/run_query.py sql/releases_by_genre_per_company_per_year.sql
python src/run_query.py sql/average_popularity_by_company_by_year.sqlAll queries sort results by their primary metric (descending) within each release_year, so each report clearly surfaces top performers.
The prompt asks for most popular genre by year (singular).
This query ranks genres by average popularity within each year and returns only the top one (ROW_NUMBER() = 1).
To view all genres per year, simply remove the WHERE rn = 1 filter.
run_query.py supports the --limit argument:
# Default (10 rows)
python src/run_query.py sql/revenue_by_genre_by_year.sql
# Top 50 rows
python src/run_query.py sql/revenue_by_genre_by_year.sql --limit 50
# All rows
python src/run_query.py sql/revenue_by_genre_by_year.sql --limit 0docs/data_model.md– table definitions and modeling decisionsdocs/erd.png– Entity-Relationship Diagramdocs/design_scaling_ai.md– scaling strategy, backfills, monitoring, and AI/ML plan
src/
sql/
docs/
output/
requirements.txt
README.md
error.log (optional)
-
Movies with missing or invalid data
• Non-numericmovie_idrows are dropped.
• Only movies with validrelease_yearare analyzed. -
Numeric formatting
• Integer-like fields are displayed as integers (no “.0” suffix). -
Genre filtering
• Only movies with genre tags are included.
• Ensures company lists remain consistent across metrics. -
Join strategy
• All joins areINNER JOIN; noLEFT JOINused. -
Sorting
• Each query orders results by the main metric (descending) within each year
(e.g.,ORDER BY release_year, SUM(f.revenue) DESC). -
Result display
• The--limitflag controls how many rows are shown (default 10;0 = all).
- All SQL validated and syntax-safe
ORDER BYuses raw aggregates (no alias errors)- Supports
--limitparameter - Clean integer output
- Genre-less movies excluded
- Sorted results by key metric per year
- “Most Popular Genre” correctly returns one per year
- Full documentation for review and submission