Basic Data Prep and Pre-Processing (2)
Basic Data Prep and Pre-Processing (2)
Practical Session:
Machine Learning with Scikit-learn
Objective: This practical session aims to introduce the use of sklearn to load datasets, perform data preprocessing, train a
machine learning model, and evaluate the model's performance. We will be using the Iris dataset for training a Logistic Regression
model and visualizing results.
Requirements:
## 1st Part
1 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import
StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,
confusion_matrix, classiVcation_report
2. Observe the dataset structure and column names to understand the feature set (sepal length, sepal width, petal length, petal
width) and the target variable (target column).
Note:
The iris dataset loaded using load_iris() from the sklearn.datasets module is in the format of a Bunch object, which is similar to a
Python dictionary but with additional attributes. This Bunch object contains the following key-value pairs:
• data: A 2D NumPy array with shape (150, 4). Each row represents one sample, and each column represents one of the four
features (sepal length, sepal width, petal length, petal width).
• target: A 1D NumPy array of length 150. Each element represents the target label (species) of the corresponding sample in the
data array.
2 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
• target_names: A list of the target labels (species names: 'setosa', 'versicolor', 'virginica').
• Vlename: The path to the location of the dataset on disk (if applicable).
• frame: A pandas DataFrame, only present if the dataset was loaded with as_frame=True. This structure allows easy access to
both the data and metadata associated with the dataset.
2nd Part
Step 3: Data Preprocessing
1. Split the dataset into training and testing sets. This will allow us to train the model on a portion of the data and test its
performance on unseen data.
2. Standardize the features using StandardScaler to ensure that the model works with normalized data.
```python
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
3 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
1. Train a Logistic Regression model on the Iris dataset. Logistic Regression is a classiVcation algorithm that works well for
small datasets like Iris.
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score:", accuracy)
4 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
print("
Classification Report:
", class_report)
1. Visualize the confusion matrix using a heatmap to get a better understanding of the model’s classiVcation performance.
2. Plot the distribution of target values in the Iris dataset to see how the classes are distributed.
5 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
1. Import statements
2. Ingest/Load data opt. Convert to df
3. Display
Steps to perform ML
1. Import statements
2. Ingest/Load data
3. EDA
4. Split train and test
5. Scaling
g. train
7. test
h. Evaluate performance
6 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
# Load datasets
iris = load_iris()
wine = load_wine()
breast_cancer = load_breast_cancer()
print("\nWine dataset:")
print(wine_df.head())
Iris dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
7 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
target
0 0
1 0
2 0
3 0
4 0
Wine dataset:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80
8 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
mean fractal dimension ... worst texture worst perimeter worst area \
0 0.07871 ... 17.33 184.60 2019.0
1 0.05667 ... 23.41 158.80 1956.0
2 0.05999 ... 25.53 152.50 1709.0
3 0.09744 ... 26.50 98.87 567.7
4 0.05883 ... 16.67 152.20 1575.0
# Split data into train and test sets (Example using Iris dataset)
X_train, X_test, y_train, y_test = train_test_split(iris_df.iloc[:, :-1], iris_df['target'], test_size=0.3, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
▾ LogisticRegression i ?
LogisticRegression(max_iter=200)
9 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
Confusion Matrix:
[[19 0 0]
[ 0 13 0]
[ 0 0 13]]
Classification Report:
precision recall f1-score support
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
10 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
11 of 12 04/11/2024, 2:13 pm
Week 3.1 Basic Data Prep and Pre-Processing.ipynb - Colab about:blank
12 of 12 04/11/2024, 2:13 pm