0% found this document useful (0 votes)

864 views17 pages

T15 Kickoff Statement

The document outlines a challenge involving Analytics and Machine Learning using AWS services, specifically focusing on car data stored in an S3 bucket. It details the steps for data cleaning and analysis using EMR, as well as building a predictive model for car selling prices using SageMaker. The challenge includes specific tasks for data processing, feature engineering, and model training with guidelines for implementation in Python or Scala.

Uploaded by

Deivanai M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

864 views17 pages

T15 Kickoff Statement

Uploaded by

Deivanai M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Wings1 T15 Cloud Analytics and Al

Challenge Overview:
This challenge comprises two distinct parts: Analytics and Machine Learning. You will
work with CSV datasets stored in an S3 bucket named “car-data” with a unique prefix
followed by a random number. Within this bucket, there are two essential folders:

1. Inputfile: This folder contains the car_data.csv dataset, which you will utilize for the
Analytics part. Using car_data.csv, you will be cleaning the data and loading the
data to s3. Also, you will be performing various analytics tasks based on the cleaned
data.
2. car_cleaned_data: This folder contains the car_cleaned_data.csv dataset,
designated for the Machine Learning tasks.

Analytics and ML pipeline:

Analytics: S3 -> EMR->S3 (45mins)

Machine Learning: S3-> SageMaker -> S3. (45 mins)

Note:

• Don’t worry about the CSV files present in the local folder, only use the CSV files
from the s3 bucket itself for the tasks outlined in this challenge
• You can do any of the parts first, both parts (analytics and machine learning) are
independent.

Your mission is to tackle each part systematically, ensuring efficient data processing and
preparation for both Analytics and Machine Learning endeavors. Let’s dive into the
specifics of each task!
IMPORTANT NOTE: Use US East (N. Virginia) us-east-1 Region only.

Note: Please follow the naming conventions which are given in the problem statement.

How to login into AWS Account

Click on the above-mentioned icon on your desktop to be redirected to the AWS login page
in the Firefox web browser. Here, you will find the username and password for the AWS
access page. Click on the “Access Lab” button to be redirected to a new page. Enter the
credentials provided on the previous page.

Analytics

Note: in this challenge you will be using EMR cluster. Before proceeding with the challenge,
create an EMR cluster. It will take 8 to 10 minutes to create the cluster. While it was creating
the cluster. You can read the problem statement and write the code in the template given in
the project folder.

EMR Cluster

EMR cluster creation:

Follow the configurations below for creating the cluster.

Name: spark_cluster

EMR version: 7.1.0

Application Bundle: select the custom aws

• Hadoop 3.3.6
• Hive 3.1.3
• Livy 0.8.0
• Spark 3.5.0

Cluster Configuration:

• Uniform Instance Groups

• Keep only primary node, remove the other two node by clicking Remove
Instance group. Primary node EC2 instance type should be m4.large

EBS root Volume: Keep the default sizes.

Networking: Default

Steps: Leave as it is.

Cluster termination and node replacement: Termination Option - Manually

terminate the cluster.

Bootstrap actions: Default

Cluster Logs: Default

Tags: Default
Software Settings: Default

Security configuration and EC2 key pair: Create a keypair named “emr_spark”

• Key pair type RSA

• Private key file format - .pem format

And download this inside location ~/home/labuser/Desktop/Project/kickoffs-

cloud_analytics_andai-wingsT15-car_data/emr_spark.pem

Identity and Access Management (IAM) roles:

• Amazon EMR service role-click on create new service role

• EC2 instance profile for Amazon EMR-Create an instance profile
• S3 bucket access – All S3 buckets in this account with read and write access

Custom automatic scaling role: Default

Leave the other options as it is and click on create cluster (It will take 6 to 8 mins to create
the cluster). Meanwhile you can read the problem statement and start coding.

Templates:

You can solve this analytics part using either pyspark or scala. You are provided with
python template in challenge.py file and challenge.scala file. You need to complete the
code in the template using pyspark or scala and push the file to EMR using SSH and do a
spark-submit or sbt run. Otherwise, you can use the EMR add step option to run the file.

Open the challenge folder which is present in the Project folder in vs code, then if you are
using python complete the code in the challenge.py template. If you are using scala,
complete the code in challenge.scala file.

Problem Statement
The following functions are given in the python and scala templates.

Task1:

Read data

Complete the following operations in the read_data function

The following are the parameters:

• Spark session-spark
• Mention the bucket name inside the bucket_name variable.
• The dataset will be in the s3 locaiton inside the inputfile folder.
• Read the CSV file into a dataframe. Make sure to give only header as true
• In the challenge file, the return statement is already defined, and you need to
replace the df with your final output dataframe.

Task2:

clean_data

Complete the following operations in the clean data function

The following are the parameters:

• Output of read_data function-input_df

1. Drop the null values if it is having in any of the columns.
2. Drop duplicate rows
3. In the challenge file, the return statement is already defined, and you need to
replace the df with your final output dataframe.

Task3:

S3_load_data
Complete the following operations in the “S3 load_data” function

The following are the parameters:

• Final dataframe to load the data: data (the final dataframe of the following
functions)
• File name for the results: file_name

Mention the bucket name inside the bucket name variable.

Write a code the store the outputs to the respective locations using the output_path param.

• Output files should be a single partition CSV file with header.

Task4:

result_1

Complete the following operations in the result_1 function

The following are the parameters:

Output of clean data function-input_df

• Group the data by “car_name”.

• Calculate the average selling price for each brand and store it in the new column
average_selling_price.
• Calculate the count of cars for each brand and store it in the new column
car_count.
• Filter the grouped data to include only brands with more than 2 cars.
• In the challenge file, the return statement is already defined, and you need to
replace the df with your final output dataframe.

Sample Output:
Task5:

result_2

Complete the following operations in the result_2 function

The following are the parameters:

• Output of clean data function-input_df

• Add a new column price_per_km calculated as selling_price/km_driven.

• Filter the data to include only rows where price_per_km is less than 10.
• In the challenge file, the return statement is already defined, and you need to
replace the df with your final output dataframe.

Sample Output:

Note:
The column names (column names are case-sensitive) and order should be same as given
in the sample output for each task.

IMPORTANT

• You have been given two ways to run the spark code i.e. either inside the emr cluster
or by using emr step function.

• Jump to that heading to complete the task.

Inside the EMR Operations

Steps to be followed:

1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/

2. In the navigation pane, choose Instances which is created by emr.
3. Select the instance and go to the security section and add the ssh port to the ec2
created by emr.
4. Select the instance and choose Connect.
5. Choose the EC2 Instance Connect tab.
6. For Connection type, choose Connect using EC2 Instance Connect.
7. For Username, should be root.
8. Choose Connect to open a terminal window.
9. It will open a new browser tab with the emr cluster.

• Now run the emr_copy.py file using python emr_copy.py in vscode inside the
/home/labuser/Desktop/Project/kickoffs-cloud_analytics_and_ai-wingsT15-
car_data/, it will copy your updated file to the emr cluster in the location
/home/hadoop/.
[NOTE: Make sure the keypair is located correct
/home/labuser/Desktop/Project/kickoffs-cloud_analytics_and_ai-wingsT15-
car_data/emr_spark.pem.]

• Go to the /home/hadoop/directory you will find the files which you have uploaded.
• Run the setup.sh file inside the /home/hadoop/setup dir by “bash setup.sh”.

[NOTE: If the sbt or pyspark is not working then run the command “source ~/.bashrc”]

• After successfully setup of spark you can submit the spark application.
• For Pyspark:
spark-submit challenge.py
• For Scala:

Enter inside the scala directory

sbt run

EMR Step Function

Follow these steps:

For Pyspark:

• Push the challenge.py file which is present inside the python directory into the
s3 location.

For Scala:

• First enter into the scala dir

• Enter the command “sbt package” it will give the challenge file as a
jar file inside the target/scala/ directory.
• Push the jar file into the s3 location
• Now open the Amazon EMR console at https://console.aws.amazon.com/emr.
• In the Cluster List, select your cluster.
• Scroll to the Steps section and expand it, then choose Add step.
• In the Add Step dialog:

• For Step type, choose Custom JAR.

• For Name, type any name.

• For JAR S3 location – “command-runner.jar”

For Arguments

Mention this command

For scala – “spark-submit-class Main <s3_file_path>”

For pyspark – “spark-submit <s3_file_path>”

• For Action on failure, accept the default option (Continue).

• Choose Add. The step appears in the console with a status of Pending.

Machine Learning: Predicting Car Selling Prices

In this problem, you will build a machine learning model to predict the selling prices of
used cars. You are given a dataset with various features related to the cars, such as the
car’s age, mileage, fuel type, etc. Your goal is to preprocess the data, select appropriate
features, and train a model to make accurate price predictions. The final model should
achieve an R² score significantly higher than the baseline.

Dataset Description:

The dataset contains the following columns:

• car name: Name of the car.

• year: Year the car was manufactured.
• selling_price: Price at which the car was sold (target variable).
• km_driven: Kilometers driven by the car
• fuel: Type of fuel used by the car.
• seller_type: Type of seller (individual/dealer).
• transmission: Type of transmission (manual/automatic)
• owner: Number of previous owners.

Perform the following Cloud Driven Task:

Task 1: Launching an Amazon SageMaker Instance

Objective: Launch an Amazon SageMaker notebook instance to develop and deploy
machine learning models.

Instructions:

1. Access Amazon SageMaker:

-Access the AWS Management Console.

-Navigate to the SageMaker service under the ‘Machine Learning’ category.

2. Create a New Notebook Instance:

- Create a new notebook instance under the name car-prediction-notebook

- Choose the instance type ‘ml.t3.medium.

- Configure the necessary permissions to access S3 buckets by cnewly

creating an appropriate role.
- Create the notebook instance.

3. Access the notebook:

- Wait for the notebook instance status to change to ‘InService’.

-Open the Jupyter dashboard from the SageMaker console to begin your
machine learning project.

Perform the following ML Tasks:

Task 1: Load and Explore the Dataset

Instructions:

1. Import AWS SDK ‘boto3” and other necessary libraries such as NumPy, Pandas &
Sklearn on your notebook.
2. Load the dataset from S3 bucket: Build the S3 path for the dataset
car_cleaned_data.csv using string formatting to concatenate the bucket name,
folder name (if any) and file key i.e the name of the dataset. Note: Bucket name ‘car-
dataXYZXYZ (XYZXYZ can be any random integers) & Folder name “car_cleaned_data
3. Load and store the dataset using pandas DataFrames.
4. Print the first few rows of the dataset to understand its structure and use inbuilt
functions to get insights on the dataset.

Hints:

1. Sample S3 URI – “s3://bucket_name/folder_name/file_name.csv”

Task 2: Feature Engineering

Instructions:

1. Create a new feature “car_age’ which represents the age of the car in years. Add
the feature as a column under the name car age in the dataframe.
2. Drop the columns “car name ‘and ‘year’ from the dataset
Hints:

1. The car’s age can be calculated as 2024-year

Task 3: Define Features and Target Variable

Instructions:

1. Define the features “X’ by dropping the target variable (selling price) from the
dataset
2. Define the target variable ‘y’ as the selling price.

Task 4: Preprocess the Data

Instructions:

1. Identify numerical and categorical features present in ‘X’ and ‘y’, store it in a list
for building a pipeline.
2. Apply log transformation to the km driven and selling price. Columns to reduce
skewness. Replace the trasformed data back to ‘X’ & ‘V.
3. Use StandardScaler to scale the numerical features.
4. Use OneHotEncoder to encode the categorical features, ensuring that unknown
categories are handled.

Hints:

1. Use ‘np.log1p() for log transformation

2. Example numerical feature list: numerical_features = [‘xy’, ‘yz’]”
3. Example categorical feature list: “categorical_features = [‘xy, ‘yz’, ‘zz’]”

Task 5: Build a Transformer Pipeline

Instructions:

1. Create a ‘Column Transformer pipeline to preprocess the numerical and

categorical features.
2. Create a pipeline that includes the preprocessors ie “StandardScaler &
“OneHotEncoder. Store the transformer pipeline in a variable, for example
preprocessor. We will utilize this for building the model pipeline in the upcoming
steps.

Hints:

1. Sample Pipeline:

Preprocessor = ColumnTransformer(transformers=[

(‘xy’, scaler(), feature_list1),

(‘yz’, encoder(), feature_list2)

])

Task 6: Build a Model Pipeline and Split the Data

Instructions:

1. Create a Sklearn model pipeline with the preprocessor pipeline and ‘Random
ForestRegressor’ model with a random state set to ‘8’. Store it under the name
‘model’.
2. Split the dataset into training and test sets using ‘train test_split’ with a test size
of 20% and random state set to ‘8’.

Hints:

Sample pipeline:

model = Pipeline(steps= [
(‘preprocessor’, preprocessor),
(‘model’, ModelFunc(arg=value))
])

Task 7: Hyperparameter Tuning

Instructions:

1. Perform hyperparameter tuning using GridSearchCV to find the best parameters

for the Random ForestRegressor. The parameter grid is provided in the sample
notebook and in the hint section.
2. Build the Grid Search named grid search with the generated model pipeline,
parameter grid, cross- validation generator set to ‘S’, scoring set to ‘12’ and
number of jobs set to “1”.

Hints:

1. Parameter grid:

Param_grid = {

‘regressor_n_estimators’: [100, 200, 300),

‘regressor_max_depth’: [None, 10, 20, 30],

‘regressor_min_samples_split’: [2, 5, 10)}

2. Define a parameter grid as given above with different values for in_estimators,
max_depth, and min_samples_split.

Task 8: Train the Model.

Instructions:

1. Fit the GridSearchCV on the training data to find the best model (Note: Fitting
process may take a few minutes to complete the execution, please wait until the
process is finished)
2. Extract the best model from the grid search results. Store the best model under
variable for example ‘best model’.

Hints:
1. Use ‘grid_search.best_estimator_’ to get the best model.

Task 9: Make Predictions

Instructions:

1. Use the best model to make predictions on the test set and store it in variable
called ‘y_pred”.
2. Transform the predictions ie ‘y_test’ and ‘y_pred’ back to the original scale.

Hints:

1. Use np.expm1() to reverse the log transformation.

Task 10: Evaluate the Model

Instructions:

1. Calculate the Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean
Squared Error (RMSE), and R2 score for the model
2. Print the evaluation metrics and the best hyperparameters found.

Hints:

1. Use ‘grid_search.best_params_’ to get the best hyperparameter.

Task 11: Saving the Model to AWS S3

Instructions:

1. Serialize the trained GridsearchCV model using “joblib”

2. Initialize the S3 client using the “boto3’ library.
3. Save the serialized model to a temporary file using ‘tempfile.
4. Upload the model file to the specified 53 bucket named ‘car-data.
5. Ensure the model is saved as ‘grid_search.pkl in the S3 bucket.

Hints:

1. Temporary files in Python can be managed using “tempfile.TemporaryFile().”

2. Use joblib.dump() for saving the model.
3. We can push objects into $3 using put object(…) method with necessary
parameters available under boto3

Once you complete the challenge, make sure all the files are saved in
the specified directory and click on SUBMIT.

T15 Hand-On Solution Id 80827
No ratings yet
T15 Hand-On Solution Id 80827
2 pages
Scala Constructs: Concepts of Functional Programming
No ratings yet
Scala Constructs: Concepts of Functional Programming
21 pages
Spark SQL Hands - On
No ratings yet
Spark SQL Hands - On
3 pages
Python Pandas MCQs
No ratings yet
Python Pandas MCQs
7 pages
Milestone Challenge On Used Bikes Data Set
25% (8)
Milestone Challenge On Used Bikes Data Set
11 pages
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
No ratings yet
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
14 pages
DataFrame Operations Using A Json File
No ratings yet
DataFrame Operations Using A Json File
1 page
Python3 - Programming-Final Assessment - INCOMPLETO
No ratings yet
Python3 - Programming-Final Assessment - INCOMPLETO
32 pages
Zenpython Handson1
67% (3)
Zenpython Handson1
2 pages
Python Hands On
100% (1)
Python Hands On
11 pages
Python Function Argument Types Explained
No ratings yet
Python Function Argument Types Explained
7 pages
ECMAScript6 Handson
100% (1)
ECMAScript6 Handson
2 pages
NLP Using Python
No ratings yet
NLP Using Python
50 pages
Fresco Play Hands On Answers
33% (3)
Fresco Play Hands On Answers
2 pages
Python Qualis
No ratings yet
Python Qualis
6 pages
Wings1 T1 Nodejs APIs (62637)
No ratings yet
Wings1 T1 Nodejs APIs (62637)
3 pages
Image Classification Handson-Image - Test
No ratings yet
Image Classification Handson-Image - Test
5 pages
Create A DataFrame
No ratings yet
Create A DataFrame
1 page
Informatica 41128 PDF
No ratings yet
Informatica 41128 PDF
34 pages
JSON Data Processing for Students
No ratings yet
JSON Data Processing for Students
1 page
Unstructtured Data Classification Fresco
100% (1)
Unstructtured Data Classification Fresco
4 pages
Kafka - Premiera Ola
No ratings yet
Kafka - Premiera Ola
5 pages
Prequel 2
No ratings yet
Prequel 2
2 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
HTML5 HandsOn
No ratings yet
HTML5 HandsOn
3 pages
Class N Static
No ratings yet
Class N Static
5 pages
Python Pandas Hands-On CID 55937
No ratings yet
Python Pandas Hands-On CID 55937
10 pages
Python Programming Concepts and Outputs
100% (1)
Python Programming Concepts and Outputs
12 pages
Tableau Sequel
No ratings yet
Tableau Sequel
5 pages
FAQ Metamorph
50% (4)
FAQ Metamorph
5 pages
R
No ratings yet
R
15 pages
TCS Wings 1 Program Overview 2024
No ratings yet
TCS Wings 1 Program Overview 2024
6 pages
Image Classification Hands-On
100% (1)
Image Classification Hands-On
1 page
Machine Learning - Exploring The Model Q&A.txt TCS
100% (1)
Machine Learning - Exploring The Model Q&A.txt TCS
1 page
Data Handling in R - Introduction To Dplyr
No ratings yet
Data Handling in R - Introduction To Dplyr
2 pages
Milestone - Coding - Python - Cu
0% (1)
Milestone - Coding - Python - Cu
3 pages
Hands-On Data Science and Blockchain Essentials
50% (4)
Hands-On Data Science and Blockchain Essentials
4 pages
Informatica Hands On Document
100% (1)
Informatica Hands On Document
9 pages
Data Visulization FrescoPlay MFDM
No ratings yet
Data Visulization FrescoPlay MFDM
2 pages
Py Spark Final
No ratings yet
Py Spark Final
1 page
Data Analysis with Python Code
0% (1)
Data Analysis with Python Code
5 pages
Bootstrap Hands - On
No ratings yet
Bootstrap Hands - On
8 pages
Essentials (55104)
No ratings yet
Essentials (55104)
4 pages
DNN Handson
0% (1)
DNN Handson
2 pages
Underwriting Case Study: Tcs Ilp
No ratings yet
Underwriting Case Study: Tcs Ilp
16 pages
AdvanceTS1handson - Jupyter Notebook
100% (2)
AdvanceTS1handson - Jupyter Notebook
3 pages
Python 3 Programming Q & A
No ratings yet
Python 3 Programming Q & A
4 pages
Build a Deep Neural Network in Python
0% (1)
Build a Deep Neural Network in Python
6 pages
T13 Answers Ion PDF
No ratings yet
T13 Answers Ion PDF
20 pages
Biz Skill Track 2
No ratings yet
Biz Skill Track 2
13 pages
Travel App Design and Cloud Migration Insights
100% (1)
Travel App Design and Cloud Migration Insights
6 pages
Mock Analytics Question Service Configuration Instructions
No ratings yet
Mock Analytics Question Service Configuration Instructions
5 pages
Lsde Workshop wk10
No ratings yet
Lsde Workshop wk10
29 pages
Submitting PySpark Apps on AWS EMR
No ratings yet
Submitting PySpark Apps on AWS EMR
7 pages
Amazon EMR Big Data Case Study
No ratings yet
Amazon EMR Big Data Case Study
6 pages
Giri Update Resume Word
No ratings yet
Giri Update Resume Word
4 pages
MLS C01
No ratings yet
MLS C01
4 pages
BDA PPT (Paytm Case Study - Implementing Big Data)
No ratings yet
BDA PPT (Paytm Case Study - Implementing Big Data)
21 pages
Automating EMR Cluster Management Solutions
No ratings yet
Automating EMR Cluster Management Solutions
9 pages
Isc 11 Physics
No ratings yet
Isc 11 Physics
5 pages
SE 230 - Chapter 1 Quiz (Quiz-1) Name: - ST Id
No ratings yet
SE 230 - Chapter 1 Quiz (Quiz-1) Name: - ST Id
2 pages
Sircal Product Brochure
No ratings yet
Sircal Product Brochure
1 page
Mechanics of Solids Exam - 3rd Sem
No ratings yet
Mechanics of Solids Exam - 3rd Sem
3 pages
Dental MCQ Practice Questions 2016
No ratings yet
Dental MCQ Practice Questions 2016
10 pages
A Graph-Based Iterative Compiler Pass Selection
No ratings yet
A Graph-Based Iterative Compiler Pass Selection
10 pages
Discrete Math Algorithms Overview
No ratings yet
Discrete Math Algorithms Overview
7 pages
ASC Tiger Shrimp Strategy Analysis
No ratings yet
ASC Tiger Shrimp Strategy Analysis
8 pages
MACD Indicator Explained
No ratings yet
MACD Indicator Explained
5 pages
How To Build A Theremin Using Three AM Radios
No ratings yet
How To Build A Theremin Using Three AM Radios
4 pages
2023 11 15 18 44 49 ORDENADOR-PC Log
No ratings yet
2023 11 15 18 44 49 ORDENADOR-PC Log
404 pages
Test Seven: Questions: A B C D E
No ratings yet
Test Seven: Questions: A B C D E
11 pages
Northwest Newsprint
100% (1)
Northwest Newsprint
6 pages
Python R Language Combined Syllabus
No ratings yet
Python R Language Combined Syllabus
5 pages
(Slichter, 1989) - Principles of Magnetic Resonance (3rd Ed.)
100% (4)
(Slichter, 1989) - Principles of Magnetic Resonance (3rd Ed.)
339 pages
DSLS
No ratings yet
DSLS
159 pages
Bisection Method Multiple-Choice Quiz
No ratings yet
Bisection Method Multiple-Choice Quiz
2 pages
Growatt SPA1000-3000TL BL Datasheet
No ratings yet
Growatt SPA1000-3000TL BL Datasheet
2 pages
Grid Connected Solar Wind Hybrid
No ratings yet
Grid Connected Solar Wind Hybrid
6 pages
Alphanumeric Keypad With Arduino - MIcrocontroller Projects
No ratings yet
Alphanumeric Keypad With Arduino - MIcrocontroller Projects
6 pages
Parasoft Virtualize: Try It For Free
No ratings yet
Parasoft Virtualize: Try It For Free
2 pages
El-Sharkawi Chapter 9
No ratings yet
El-Sharkawi Chapter 9
4 pages
Mathematics-I: Deepak Singh
No ratings yet
Mathematics-I: Deepak Singh
20 pages
Staircase and Landing Guidelines
No ratings yet
Staircase and Landing Guidelines
2 pages
Astm D7006-22
No ratings yet
Astm D7006-22
7 pages
CANON F-789SGA Emulator For Board Exams - (CalTech Vids Soon - )
No ratings yet
CANON F-789SGA Emulator For Board Exams - (CalTech Vids Soon - )
20 pages
STL Sorting with Comparators
No ratings yet
STL Sorting with Comparators
3 pages
Buffer Overflow Server
No ratings yet
Buffer Overflow Server
12 pages
Energy Forms in Thermodynamics Study Guide
No ratings yet
Energy Forms in Thermodynamics Study Guide
16 pages
Master Thesis Meteorology
100% (2)
Master Thesis Meteorology
8 pages