KNN Model

Here's a detailed explanation of each part of the code and the methods used.
This code is a
machine learning workflow using the K-Nearest Neighbors (KNN) algorithm for classification,
where we aim to find the optimal KKK value that gives the highest accuracy for predicting
outcomes. Let’s go through each section of the code step-by-step.
1. Importing Libraries
python
Copy code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
import matplotlib.pyplot as plt
● numpy: Provides support for handling arrays and mathematical operations.

● pandas: Used for data manipulation and analysis, allowing us to load and preprocess
the dataset.
● train_test_split (from sklearn.model_selection): Splits the dataset into
training and testing sets.
● KNeighborsClassifier (from sklearn.neighbors): Implements the K-Nearest
Neighbors (KNN) classifier, which is used for classification tasks.
● accuracy_score, confusion_matrix, classification_report (from
sklearn.metrics): Provide metrics to evaluate the performance of the model.
● matplotlib.pyplot: Used for creating plots, helping visualize data and results.
2. Loading the Dataset

python
Copy code
data = pd.read_csv(r"C:\Users\Admin\Desktop\Practical\ML\aaa.csv") #
Update the path as necessary
● This line loads the dataset from a CSV file into a Pandas DataFrame, making it easier to
manipulate and analyze. You need to replace the file path with the path where your
dataset is stored.
3. Handling Missing Values

python
Copy code
cols_clean = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction']
data[cols_clean] = data[cols_clean].replace(0,
np.nan).fillna(data[cols_clean].mean())
● cols_clean: This list defines the columns where 0 values are considered as missing
data (for instance, it’s unrealistic to have 0 for blood pressure).
● Replacing values:
○ .replace(0, np.nan): Replaces 0 values with NaN (Not a Number) to mark
them as missing.
○ .fillna(data[cols_clean].mean()): Fills missing values with the mean of
the respective column. This is a common data-cleaning step to ensure the
dataset has no missing values before training the model.
4. Splitting Features and Target Variables

python
Copy code
X = data.drop('Outcome', axis=1) # Features
y = data['Outcome'] # Target variable
● X: Contains all the features in the dataset except the target variable Outcome.
● y: Stores the target variable Outcome, which we are trying to predict. Here, Outcome is
typically binary (e.g., 0 for "No" and 1 for "Yes" in diabetes prediction).
5. Splitting the Dataset into Training and Testing Sets

python
Copy code
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)
● train_test_split: Splits the dataset into training and testing sets.

○ test_size=0.25: Reserves 25% of the data for testing and uses 75% for
training.
○ random_state=42: Ensures reproducibility by providing a fixed seed for the
random generator.
6. Finding the Optimal KKK Value

python
Copy code
accuracies = []
max_k = min(15, X_train.shape[0])
for k in range(1, max_k):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))
● accuracies: An empty list that will store the accuracy for each KKK value.
● max_k: Sets the maximum value for KKK to 15 or the number of training samples
(whichever is smaller).
● for k in range(1, max_k): Loops through different values of KKK (from 1 to
max_k), training and testing a KNN model for each value.
● KNeighborsClassifier(n_neighbors=k): Creates a KNN classifier with KKK
neighbors.
● .fit(X_train, y_train): Trains the model using the training data.
● .predict(X_test): Predicts the target variable on the test set.
● accuracy_score(y_test, y_pred): Calculates the accuracy of the model for the
current KKK value and appends it to the accuracies list.
7. Selecting the Optimal KKK Value

python
Copy code
optimal_k = np.argmax(accuracies) + 1
print(f'Optimal K: {optimal_k}, Accuracy: {max(accuracies):.2f}')
● np.argmax(accuracies): Finds the index of the highest accuracy in the

accuracies list.
● optimal_k: Adds 1 to the index because KKK values start from 1, not 0.
● Print statement: Displays the optimal KKK value and its corresponding accuracy.
8. Training and Evaluating the Model with the Optimal KKK

python
Copy code
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k)
knn_optimal.fit(X_train, y_train)
y_pred_optimal = knn_optimal.predict(X_test)
● KNeighborsClassifier(n_neighbors=optimal_k): Initializes the KNN model

with the optimal KKK value.
● .fit(X_train, y_train): Trains the model on the training set.
● .predict(X_test): Makes predictions on the test set using the optimal model.
9. Printing Evaluation Metrics

python
Copy code
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_optimal))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_optimal))
● confusion_matrix(y_test, y_pred_optimal): Outputs the confusion matrix,

which shows true positives, true negatives, false positives, and false negatives. It helps
evaluate the performance of the classifier.
● classification_report(y_test, y_pred_optimal): Provides detailed metrics,
including precision, recall, and F1-score for each class, offering a comprehensive view of
model performance.
10. Plotting Accuracy vs. KKK Values
python
Copy code
plt.plot(range(1, max_k), accuracies, marker='o', linestyle='-')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('K Value vs Accuracy')
plt.xticks(range(1, max_k))
plt.grid()
plt.show()
● plt.plot(range(1, max_k), accuracies, marker='o', linestyle='-'):

Plots KKK values on the x-axis and corresponding accuracy scores on the y-axis. Using
marker='o' adds a circle marker at each KKK value to visualize each point more
clearly.
● plt.xlabel, plt.ylabel, plt.title: Label the x-axis as "K Value," y-axis as
"Accuracy," and add a title to the plot.
● plt.xticks(range(1, max_k)): Ensures each KKK value is marked on the x-axis
for easier interpretation.
● plt.grid(): Adds a grid to the plot for better readability.
● plt.show(): Displays the plot.
Summary
This code loads a dataset, cleans it, and splits it into training and testing sets. It then uses the
K-Nearest Neighbors (KNN) algorithm to find the optimal KKK value that provides the highest
accuracy for classification. After determining the best KKK, it evaluates the model's performance
using a confusion matrix and classification report, then visualizes how accuracy varies with
different KKK values.
This process is essential in machine learning to ensure a well-tuned model that balances bias
and variance, particularly when using algorithms like KNN where KKK selection has a significant
impact on model accuracy.

KNN Model

Uploaded by

KNN Model

Uploaded by

Here's a detailed explanation of each part of the code and the methods used.

● numpy: Provides support for handling arrays and mathematical operations.

2. Loading the Dataset

3. Handling Missing Values

4. Splitting Features and Target Variables

5. Splitting the Dataset into Training and Testing Sets

● train_test_split: Splits the dataset into training and testing sets.

6. Finding the Optimal KKK Value

7. Selecting the Optimal KKK Value

● np.argmax(accuracies): Finds the index of the highest accuracy in the

8. Training and Evaluating the Model with the Optimal KKK

● KNeighborsClassifier(n_neighbors=optimal_k): Initializes the KNN model

9. Printing Evaluation Metrics

● confusion_matrix(y_test, y_pred_optimal): Outputs the confusion matrix,

● plt.plot(range(1, max_k), accuracies, marker='o', linestyle='-'):

You might also like