0% found this document useful (0 votes)
8 views

Basic Data Prep and Pre-Processing (2)

This document outlines a practical session for machine learning using Scikit-learn, focusing on data preparation, preprocessing, model training, and evaluation with the Iris dataset. It includes steps for importing libraries, loading data, preprocessing, training a Logistic Regression model, and visualizing results. Additionally, it suggests testing with other datasets such as Wine and Breast Cancer.

Uploaded by

reshika2080-0174
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Basic Data Prep and Pre-Processing (2)

This document outlines a practical session for machine learning using Scikit-learn, focusing on data preparation, preprocessing, model training, and evaluation with the Iris dataset. It includes steps for importing libraries, loading data, preprocessing, training a Logistic Regression model, and visualizing results. Additionally, it suggests testing with other datasets such as Wine and Breast Cancer.

Uploaded by

reshika2080-0174
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Week 3.1 Basic Data Prep and Pre-Processing.

ipynb - Colab about:blank

 Practical Session:
Machine Learning with Scikit-learn

Objective: This practical session aims to introduce the use of sklearn to load datasets, perform data preprocessing, train a
machine learning model, and evaluate the model's performance. We will be using the Iris dataset for training a Logistic Regression
model and visualizing results.

There are 2 parts in this practical.

Requirements:

• Python (version 3.x)


• Libraries: sklearn, pandas, numpy, matplotlib, seaborn
• IDE: Jupyter Notebook, Google Colab, or any Python IDE

## 1st Part

Step 1: Import Required Libraries

1. Open your Python environment (Jupyter Notebook, Google Colab, etc.).


2. Start by importing the necessary libraries for data manipulation, visualization, and model building.

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import machine learning utilities from sklearn

1 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import
StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,
confusion_matrix, classiVcation_report

Step 2: Load and Explore the Iris Dataset


1. Load the Iris dataset from sklearn. This dataset is commonly used for classification problems and consists of 150 samples from
```python
# Load Iris dataset
iris = load_iris()

# Convert to a pandas DataFrame


iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

# Display the first few rows of the dataset


iris_df.head()

2. Observe the dataset structure and column names to understand the feature set (sepal length, sepal width, petal length, petal
width) and the target variable (target column).

Stop here. Compile and execute.

Note:

The iris dataset loaded using load_iris() from the sklearn.datasets module is in the format of a Bunch object, which is similar to a
Python dictionary but with additional attributes. This Bunch object contains the following key-value pairs:

• data: A 2D NumPy array with shape (150, 4). Each row represents one sample, and each column represents one of the four
features (sepal length, sepal width, petal length, petal width).

• target: A 1D NumPy array of length 150. Each element represents the target label (species) of the corresponding sample in the
data array.

• feature_names: A list containing the names of the four features.

2 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

• feature_names: A list containing the names of the four features.

• target_names: A list of the target labels (species names: 'setosa', 'versicolor', 'virginica').

• DESCR: A string containing a full description of the dataset.

• Vlename: The path to the location of the dataset on disk (if applicable).

• frame: A pandas DataFrame, only present if the dataset was loaded with as_frame=True. This structure allows easy access to
both the data and metadata associated with the dataset.

2nd Part
Step 3: Data Preprocessing

1. Split the dataset into training and testing sets. This will allow us to train the model on a portion of the data and test its
performance on unseen data.

# Split the data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(iris_df.iloc[:, :-1], iris_df['target'], test_size=

Check the shape of the training and testing data


print("Training data shape:", X_train.shape) print("Testing data shape:", X_test.shape)

2. Standardize the features using StandardScaler to ensure that the model works with normalized data.
```python
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 4: Train a Logistic Regression Model

3 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

1. Train a Logistic Regression model on the Iris dataset. Logistic Regression is a classiVcation algorithm that works well for
small datasets like Iris.

# Initialize and train the Logistic Regression model


model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

Display the model coeacients


print("Model coeacients:", model.coef_)

Step 5: Evaluate the Model


1. Predict the target values for the testing set using the trained model.
2. Evaluate the model's performance using various metrics like accuracy, confusion matrix, and classification report.
```python
# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score:", accuracy)

# Display confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("
Confusion Matrix:
", conf_matrix)

# Classification report for precision, recall, and F1-score


class_report = classification_report(y_test, y_pred)

4 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("
Classification Report:
", class_report)

Step 6: Visualizing the Results

1. Visualize the confusion matrix using a heatmap to get a better understanding of the model’s classiVcation performance.

# Visualize the confusion matrix


plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

2. Plot the distribution of target values in the Iris dataset to see how the classes are distributed.

# Visualize the distribution of target values


sns.countplot(x=iris_df['target'])
plt.title('Distribution of Target Values in Iris Dataset')
plt.show()

Step 7: Test with Another Dataset


3. Repeat the steps above using a different dataset such as the Wine dataset or the Breast Cancer dataset.

# Load the Wine dataset


wine = load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

5 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)


wine_df['target'] = wine.target

Display Vrst few rows of the dataset


wine_df.head()

Steps to perform basic data prep:

1. Import statements
2. Ingest/Load data opt. Convert to df
3. Display

Steps to perform ML

1. Import statements
2. Ingest/Load data
3. EDA
4. Split train and test
5. Scaling
g. train
7. test
h. Evaluate performance

# Import necessary libraries


import numpy as np # For numerical computations
import pandas as pd # For data manipulation
import matplotlib.pyplot as plt # For data visualization
import seaborn as sns # For advanced data visualization

6 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

# Import datasets from sklearn


from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.linear_model import LogisticRegression # Example model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Model evaluation

# Load datasets
iris = load_iris()
wine = load_wine()
breast_cancer = load_breast_cancer()

# Convert datasets to pandas DataFrame for ease of use


iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)


wine_df['target'] = wine.target

cancer_df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)


cancer_df['target'] = breast_cancer.target

# Display first few rows of each dataset


print("Iris dataset:")
print(iris_df.head())

print("\nWine dataset:")
print(wine_df.head())

print("\nBreast Cancer dataset:")


print(cancer_df.head())

Iris dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2

7 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

3 4.6 3.1 1.5 0.2


4 5.0 3.6 1.4 0.2

target
0 0
1 0
2 0
3 0
4 0

Wine dataset:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80

flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \


0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04

od280/od315_of_diluted_wines proline target


0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0

Breast Cancer dataset:


mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030

mean compactness mean concavity mean concave points mean symmetry \


0 0.27760 0.3001 0.14710 0.2419

8 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

1 0.07864 0.0869 0.07017 0.1812


2 0.15990 0.1974 0.12790 0.2069
3 0.28390 0.2414 0.10520 0.2597
4 0.13280 0.1980 0.10430 0.1809

mean fractal dimension ... worst texture worst perimeter worst area \
0 0.07871 ... 17.33 184.60 2019.0
1 0.05667 ... 23.41 158.80 1956.0
2 0.05999 ... 25.53 152.50 1709.0
3 0.09744 ... 26.50 98.87 567.7
4 0.05883 ... 16.67 152.20 1575.0

# Split data into train and test sets (Example using Iris dataset)
X_train, X_test, y_train, y_test = train_test_split(iris_df.iloc[:, :-1], iris_df['target'], test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build and train a Logistic Regression model


# regression =predicting a value,
# classification = group the data into classes (4 classes) - supervised (with labelled data)
# clustering = group the data into clusters (large number) - unsupervised
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

▾ LogisticRegression i ?

LogisticRegression(max_iter=200)

# Predict and evaluate model


y_pred = model.predict(X_test_scaled)
print("\nAccuracy score:", accuracy_score(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)

9 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("\nConfusion Matrix:\n", conf_matrix)


print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize the distribution of target values in Iris dataset


sns.countplot(x=iris_df['target'])
plt.title('Distribution of target values in Iris dataset')
plt.show()

Accuracy score: 1.0

Confusion Matrix:
[[19 0 0]
[ 0 13 0]
[ 0 0 13]]

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

10 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

print("Model coefficients:", model.coef_)

Model coefficients: [[-0.96229141 1.02709252 -1.74177531 -1.59749108]


[ 0.48511907 -0.3432642 -0.30050696 -0.66873113]
[ 0.47717234 -0.68382832 2.04228227 2.26622221]]

plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

11 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank

12 of 12 04/11/2024, 2:13 pm

You might also like