0% found this document useful (0 votes)

8 views

Basic Data Prep and Pre-Processing (2)

This document outlines a practical session for machine learning using Scikit-learn, focusing on data preparation, preprocessing, model training, and evaluation with the Iris dataset. It includes steps for importing libraries, loading data, preprocessing, training a Logistic Regression model, and visualizing results. Additionally, it suggests testing with other datasets such as Wine and Breast Cancer.

Uploaded by

reshika2080-0174

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Basic Data Prep and Pre-Processing (2)

Uploaded by

reshika2080-0174

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Week 3.1 Basic Data Prep and Pre-Processing.

ipynb - Colab about:blank

 Practical Session:
Machine Learning with Scikit-learn

Objective: This practical session aims to introduce the use of sklearn to load datasets, perform data preprocessing, train a
machine learning model, and evaluate the model's performance. We will be using the Iris dataset for training a Logistic Regression
model and visualizing results.

There are 2 parts in this practical.

Requirements:

• Python (version 3.x)

• Libraries: sklearn, pandas, numpy, matplotlib, seaborn
• IDE: Jupyter Notebook, Google Colab, or any Python IDE

## 1st Part

Step 1: Import Required Libraries

1. Open your Python environment (Jupyter Notebook, Google Colab, etc.).

2. Start by importing the necessary libraries for data manipulation, visualization, and model building.

# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import machine learning utilities from sklearn

1 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import
StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,
confusion_matrix, classiVcation_report

Step 2: Load and Explore the Iris Dataset

1. Load the Iris dataset from sklearn. This dataset is commonly used for classification problems and consists of 150 samples from
```python
# Load Iris dataset
iris = load_iris()

# Convert to a pandas DataFrame

iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

# Display the first few rows of the dataset

iris_df.head()

2. Observe the dataset structure and column names to understand the feature set (sepal length, sepal width, petal length, petal
width) and the target variable (target column).

Stop here. Compile and execute.

Note:

The iris dataset loaded using load_iris() from the sklearn.datasets module is in the format of a Bunch object, which is similar to a
Python dictionary but with additional attributes. This Bunch object contains the following key-value pairs:

• data: A 2D NumPy array with shape (150, 4). Each row represents one sample, and each column represents one of the four
features (sepal length, sepal width, petal length, petal width).

• target: A 1D NumPy array of length 150. Each element represents the target label (species) of the corresponding sample in the
data array.

• feature_names: A list containing the names of the four features.

2 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

• feature_names: A list containing the names of the four features.

• target_names: A list of the target labels (species names: 'setosa', 'versicolor', 'virginica').

• DESCR: A string containing a full description of the dataset.

• Vlename: The path to the location of the dataset on disk (if applicable).

• frame: A pandas DataFrame, only present if the dataset was loaded with as_frame=True. This structure allows easy access to
both the data and metadata associated with the dataset.

2nd Part
Step 3: Data Preprocessing

1. Split the dataset into training and testing sets. This will allow us to train the model on a portion of the data and test its
performance on unseen data.

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(iris_df.iloc[:, :-1], iris_df['target'], test_size=

Check the shape of the training and testing data

print("Training data shape:", X_train.shape) print("Testing data shape:", X_test.shape)

2. Standardize the features using StandardScaler to ensure that the model works with normalized data.
```python
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 4: Train a Logistic Regression Model

3 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

1. Train a Logistic Regression model on the Iris dataset. Logistic Regression is a classiVcation algorithm that works well for
small datasets like Iris.

# Initialize and train the Logistic Regression model

model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

Display the model coeacients

print("Model coeacients:", model.coef_)

Step 5: Evaluate the Model

1. Predict the target values for the testing set using the trained model.
2. Evaluate the model's performance using various metrics like accuracy, confusion matrix, and classification report.
```python
# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score:", accuracy)

# Display confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("
Confusion Matrix:
", conf_matrix)

# Classification report for precision, recall, and F1-score

class_report = classification_report(y_test, y_pred)

4 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("
Classification Report:
", class_report)

Step 6: Visualizing the Results

1. Visualize the confusion matrix using a heatmap to get a better understanding of the model’s classiVcation performance.

# Visualize the confusion matrix

plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

2. Plot the distribution of target values in the Iris dataset to see how the classes are distributed.

# Visualize the distribution of target values

sns.countplot(x=iris_df['target'])
plt.title('Distribution of Target Values in Iris Dataset')
plt.show()

Step 7: Test with Another Dataset

3. Repeat the steps above using a different dataset such as the Wine dataset or the Breast Cancer dataset.

# Load the Wine dataset

wine = load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

5 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

wine_df['target'] = wine.target

Display Vrst few rows of the dataset

wine_df.head()

Steps to perform basic data prep:

1. Import statements
2. Ingest/Load data opt. Convert to df
3. Display

Steps to perform ML

1. Import statements
2. Ingest/Load data
3. EDA
4. Split train and test
5. Scaling
g. train
7. test
h. Evaluate performance

# Import necessary libraries

import numpy as np # For numerical computations
import pandas as pd # For data manipulation
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns # For advanced data visualization

6 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

# Import datasets from sklearn

from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.linear_model import LogisticRegression # Example model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Model evaluation

# Load datasets
iris = load_iris()
wine = load_wine()
breast_cancer = load_breast_cancer()

# Convert datasets to pandas DataFrame for ease of use

iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

wine_df['target'] = wine.target

cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)

cancer_df['target'] = breast_cancer.target

# Display first few rows of each dataset

print("Iris dataset:")
print(iris_df.head())

print("\nWine dataset:")
print(wine_df.head())

print("\nBreast Cancer dataset:")

print(cancer_df.head())

Iris dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2

7 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

target
0 0
1 0
2 0
3 0
4 0

Wine dataset:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80

flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \

0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04

od280/od315_of_diluted_wines proline target

0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0

Breast Cancer dataset:

mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030

mean compactness mean concavity mean concave points mean symmetry \

0 0.27760 0.3001 0.14710 0.2419

8 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

1 0.07864 0.0869 0.07017 0.1812

2 0.15990 0.1974 0.12790 0.2069
3 0.28390 0.2414 0.10520 0.2597
4 0.13280 0.1980 0.10430 0.1809

mean fractal dimension ... worst texture worst perimeter worst area \
0 0.07871 ... 17.33 184.60 2019.0
1 0.05667 ... 23.41 158.80 1956.0
2 0.05999 ... 25.53 152.50 1709.0
3 0.09744 ... 26.50 98.87 567.7
4 0.05883 ... 16.67 152.20 1575.0

# Split data into train and test sets (Example using Iris dataset)
X_train, X_test, y_train, y_test = train_test_split(iris_df.iloc[:, :-1], iris_df['target'], test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build and train a Logistic Regression model

# regression =predicting a value,
# classification = group the data into classes (4 classes) - supervised (with labelled data)
# clustering = group the data into clusters (large number) - unsupervised
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

▾ LogisticRegression i ?

LogisticRegression(max_iter=200)

# Predict and evaluate model

y_pred = model.predict(X_test_scaled)
print("\nAccuracy score:", accuracy_score(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)

9 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("\nConfusion Matrix:\n", conf_matrix)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize the distribution of target values in Iris dataset

sns.countplot(x=iris_df['target'])
plt.title('Distribution of target values in Iris dataset')
plt.show()

Accuracy score: 1.0

Confusion Matrix:
[[19 0 0]
[ 0 13 0]
[ 0 0 13]]

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19

1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

10 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("Model coefficients:", model.coef_)

Model coefficients: [[-0.96229141 1.02709252 -1.74177531 -1.59749108]

[ 0.48511907 -0.3432642 -0.30050696 -0.66873113]
[ 0.47717234 -0.68382832 2.04228227 2.26622221]]

plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

11 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

12 of 12 04/11/2024, 2:13 pm

House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
Manual PDS expt no. 7,8,9
No ratings yet
Manual PDS expt no. 7,8,9
6 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
ML Practical File
No ratings yet
ML Practical File
24 pages
Lab 6
No ratings yet
Lab 6
4 pages
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
No ratings yet
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
14 pages
ML - Practical File
No ratings yet
ML - Practical File
15 pages
KRAI LabManual
No ratings yet
KRAI LabManual
77 pages
PRACTICAL FILE DL
No ratings yet
PRACTICAL FILE DL
14 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Advanced_ML_Image_Processing
No ratings yet
Advanced_ML_Image_Processing
6 pages
Lab W7
No ratings yet
Lab W7
4 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
ChatGPT_MyLearning on Coding for Machine Learning
No ratings yet
ChatGPT_MyLearning on Coding for Machine Learning
16 pages
assignmnet (1)
No ratings yet
assignmnet (1)
25 pages
Lab4_Instructions
No ratings yet
Lab4_Instructions
52 pages
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
No ratings yet
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
8 pages
ICT-4202, DIP Lab Manual - 8
No ratings yet
ICT-4202, DIP Lab Manual - 8
20 pages
1_Data Preprocessing and Cleaning_55
No ratings yet
1_Data Preprocessing and Cleaning_55
8 pages
Machine Learning Program 4 (SHANKAR)
No ratings yet
Machine Learning Program 4 (SHANKAR)
6 pages
04A - Working With Datastores - Jupyter Notebook PDF
No ratings yet
04A - Working With Datastores - Jupyter Notebook PDF
11 pages
Ml Project Assigment
No ratings yet
Ml Project Assigment
32 pages
CI-CD Pipeline with Project Deployment
No ratings yet
CI-CD Pipeline with Project Deployment
61 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Big Data Machine Learning Lab 4
No ratings yet
Big Data Machine Learning Lab 4
7 pages
MLP Week 1 Lecture 7
No ratings yet
MLP Week 1 Lecture 7
14 pages
AIML 7 To 11
No ratings yet
AIML 7 To 11
7 pages
Lab 2 Assignment_W2022
No ratings yet
Lab 2 Assignment_W2022
8 pages
mini4
No ratings yet
mini4
9 pages
Experiement 1,2,4 and 5
No ratings yet
Experiement 1,2,4 and 5
12 pages
KNN Lab
No ratings yet
KNN Lab
4 pages
Machine Learning Lab Dlihebca6sem
No ratings yet
Machine Learning Lab Dlihebca6sem
25 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Importing Data Into Pandas Dataframes
No ratings yet
Importing Data Into Pandas Dataframes
5 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Create Data Quality Framework with Great Expectations | by Pallavi Sinha | Mediu
No ratings yet
Create Data Quality Framework with Great Expectations | by Pallavi Sinha | Mediu
20 pages
Deep Learning For Credit Risk 1713932406
No ratings yet
Deep Learning For Credit Risk 1713932406
13 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Tutorial Recurrent Neural Networks
No ratings yet
Tutorial Recurrent Neural Networks
20 pages
24CSPC212-PIC Lab Manual
No ratings yet
24CSPC212-PIC Lab Manual
45 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Reagrding Lab Test
No ratings yet
Reagrding Lab Test
8 pages
Machine Learning: Data Analysis Process
No ratings yet
Machine Learning: Data Analysis Process
30 pages
LAB MANUAL
No ratings yet
LAB MANUAL
100 pages
Output2
No ratings yet
Output2
2 pages
CVDL Assignment 2
No ratings yet
CVDL Assignment 2
4 pages
Lab 1 Assignment_W2022
No ratings yet
Lab 1 Assignment_W2022
7 pages
Data Mining and Warehousing Concepts Lab: (ITPC - 228)
No ratings yet
Data Mining and Warehousing Concepts Lab: (ITPC - 228)
6 pages
ITS665_ISP565_GROUP_PROJECT_MAC2024
No ratings yet
ITS665_ISP565_GROUP_PROJECT_MAC2024
9 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
ccs341-data-warehousing-lab-manual2021 (1)
No ratings yet
ccs341-data-warehousing-lab-manual2021 (1)
48 pages
CV 5
No ratings yet
CV 5
4 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
dwdm_file-final_ver3.pdf_20241230_172003_0000
No ratings yet
dwdm_file-final_ver3.pdf_20241230_172003_0000
54 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
User Guide For ET7190 Demo Board
No ratings yet
User Guide For ET7190 Demo Board
48 pages
Manual de Jazz and Big Band - JABB3 - Manual - March5 PDF
No ratings yet
Manual de Jazz and Big Band - JABB3 - Manual - March5 PDF
111 pages
Ghi Enterprise GST
No ratings yet
Ghi Enterprise GST
2 pages
LTE Poster
No ratings yet
LTE Poster
1 page
Imp DBMS Questions
No ratings yet
Imp DBMS Questions
4 pages
Adrian Salas, MIAS 250, Access Policies
No ratings yet
Adrian Salas, MIAS 250, Access Policies
7 pages
كتاب مختار الأحاديث للسيد أحمد الهاشمي
No ratings yet
كتاب مختار الأحاديث للسيد أحمد الهاشمي
205 pages
Design System Designops Report
No ratings yet
Design System Designops Report
85 pages
Global Edge C Program
100% (1)
Global Edge C Program
14 pages
Antennas PRO67B
No ratings yet
Antennas PRO67B
19 pages
CNC Notching and Marking Machines: at 820 E CNC - at 820 E HD CNC
No ratings yet
CNC Notching and Marking Machines: at 820 E CNC - at 820 E HD CNC
2 pages
Zahra Sayed - PHD Thesis - 3D Mapping of Islamic Geometric Motifs
No ratings yet
Zahra Sayed - PHD Thesis - 3D Mapping of Islamic Geometric Motifs
207 pages
Memory Management Ibm Os 360
No ratings yet
Memory Management Ibm Os 360
9 pages
Project Management Action Plan Template
100% (2)
Project Management Action Plan Template
3 pages
Brocade FOS CLI Commands
No ratings yet
Brocade FOS CLI Commands
5 pages
(Ebook PDF) Project Management Metrics, Kpis, and Dashboards: A Guide To Measuring and Monitoring Project Performance 3Rd Edition
100% (5)
(Ebook PDF) Project Management Metrics, Kpis, and Dashboards: A Guide To Measuring and Monitoring Project Performance 3Rd Edition
41 pages
PMP 31
No ratings yet
PMP 31
1 page
Sony DCR Vx2000
No ratings yet
Sony DCR Vx2000
228 pages
Thermostat Brochure
No ratings yet
Thermostat Brochure
2 pages
Installation and Use of SIMULIA CST Studio Suite Student Edition
No ratings yet
Installation and Use of SIMULIA CST Studio Suite Student Edition
4 pages
Manual Instalación Sistema Completo ENATEL
No ratings yet
Manual Instalación Sistema Completo ENATEL
28 pages
N-Channel 60-V (D-S) MOSFET
No ratings yet
N-Channel 60-V (D-S) MOSFET
5 pages
Voice Recognition Thesis Philippines
100% (2)
Voice Recognition Thesis Philippines
7 pages
TR234 Patch Manual
No ratings yet
TR234 Patch Manual
5 pages
Features: WANGLU 7 Inch IP AHD TVI CVI SDI Analog CCTV Tester HD Input/Output 7 Inch IPS Touch Screen, 1280 800 Resolu On
No ratings yet
Features: WANGLU 7 Inch IP AHD TVI CVI SDI Analog CCTV Tester HD Input/Output 7 Inch IPS Touch Screen, 1280 800 Resolu On
21 pages
s14980.r3 - Acorn - Rev4, Gecko g540 With Db25 Connections
No ratings yet
s14980.r3 - Acorn - Rev4, Gecko g540 With Db25 Connections
1 page
Prediction of Copd Using Machine Learning Technique
No ratings yet
Prediction of Copd Using Machine Learning Technique
11 pages
WeatherWork 1
No ratings yet
WeatherWork 1
9 pages
Prophet
No ratings yet
Prophet
19 pages
MCS-2024-2025
No ratings yet
MCS-2024-2025
116 pages