0% found this document useful (0 votes)
12 views5 pages

KNN Model

Uploaded by

Status World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
12 views5 pages

KNN Model

Uploaded by

Status World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 5

Here's a detailed explanation of each part of the code and the methods used.

This code is a
machine learning workflow using the K-Nearest Neighbors (KNN) algorithm for classification,
where we aim to find the optimal KKK value that gives the highest accuracy for predicting
outcomes. Let’s go through each section of the code step-by-step.

1. Importing Libraries
python
Copy code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
import matplotlib.pyplot as plt

● numpy: Provides support for handling arrays and mathematical operations.


● pandas: Used for data manipulation and analysis, allowing us to load and preprocess
the dataset.
● train_test_split (from sklearn.model_selection): Splits the dataset into
training and testing sets.
● KNeighborsClassifier (from sklearn.neighbors): Implements the K-Nearest
Neighbors (KNN) classifier, which is used for classification tasks.
● accuracy_score, confusion_matrix, classification_report (from
sklearn.metrics): Provide metrics to evaluate the performance of the model.
● matplotlib.pyplot: Used for creating plots, helping visualize data and results.

2. Loading the Dataset


python
Copy code
data = pd.read_csv(r"C:\Users\Admin\Desktop\Practical\ML\aaa.csv") #
Update the path as necessary
● This line loads the dataset from a CSV file into a Pandas DataFrame, making it easier to
manipulate and analyze. You need to replace the file path with the path where your
dataset is stored.

3. Handling Missing Values


python
Copy code
cols_clean = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction']
data[cols_clean] = data[cols_clean].replace(0,
np.nan).fillna(data[cols_clean].mean())

● cols_clean: This list defines the columns where 0 values are considered as missing
data (for instance, it’s unrealistic to have 0 for blood pressure).
● Replacing values:
○ .replace(0, np.nan): Replaces 0 values with NaN (Not a Number) to mark
them as missing.
○ .fillna(data[cols_clean].mean()): Fills missing values with the mean of
the respective column. This is a common data-cleaning step to ensure the
dataset has no missing values before training the model.

4. Splitting Features and Target Variables


python
Copy code
X = data.drop('Outcome', axis=1) # Features
y = data['Outcome'] # Target variable

● X: Contains all the features in the dataset except the target variable Outcome.
● y: Stores the target variable Outcome, which we are trying to predict. Here, Outcome is
typically binary (e.g., 0 for "No" and 1 for "Yes" in diabetes prediction).

5. Splitting the Dataset into Training and Testing Sets


python
Copy code
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=42)

● train_test_split: Splits the dataset into training and testing sets.


○ test_size=0.25: Reserves 25% of the data for testing and uses 75% for
training.
○ random_state=42: Ensures reproducibility by providing a fixed seed for the
random generator.

6. Finding the Optimal KKK Value


python
Copy code
accuracies = []
max_k = min(15, X_train.shape[0])
for k in range(1, max_k):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))

● accuracies: An empty list that will store the accuracy for each KKK value.
● max_k: Sets the maximum value for KKK to 15 or the number of training samples
(whichever is smaller).
● for k in range(1, max_k): Loops through different values of KKK (from 1 to
max_k), training and testing a KNN model for each value.
● KNeighborsClassifier(n_neighbors=k): Creates a KNN classifier with KKK
neighbors.
● .fit(X_train, y_train): Trains the model using the training data.
● .predict(X_test): Predicts the target variable on the test set.
● accuracy_score(y_test, y_pred): Calculates the accuracy of the model for the
current KKK value and appends it to the accuracies list.

7. Selecting the Optimal KKK Value


python
Copy code
optimal_k = np.argmax(accuracies) + 1
print(f'Optimal K: {optimal_k}, Accuracy: {max(accuracies):.2f}')

● np.argmax(accuracies): Finds the index of the highest accuracy in the


accuracies list.
● optimal_k: Adds 1 to the index because KKK values start from 1, not 0.
● Print statement: Displays the optimal KKK value and its corresponding accuracy.

8. Training and Evaluating the Model with the Optimal KKK


python
Copy code
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k)
knn_optimal.fit(X_train, y_train)
y_pred_optimal = knn_optimal.predict(X_test)

● KNeighborsClassifier(n_neighbors=optimal_k): Initializes the KNN model


with the optimal KKK value.
● .fit(X_train, y_train): Trains the model on the training set.
● .predict(X_test): Makes predictions on the test set using the optimal model.

9. Printing Evaluation Metrics


python
Copy code
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_optimal))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_optimal))

● confusion_matrix(y_test, y_pred_optimal): Outputs the confusion matrix,


which shows true positives, true negatives, false positives, and false negatives. It helps
evaluate the performance of the classifier.
● classification_report(y_test, y_pred_optimal): Provides detailed metrics,
including precision, recall, and F1-score for each class, offering a comprehensive view of
model performance.
10. Plotting Accuracy vs. KKK Values
python
Copy code
plt.plot(range(1, max_k), accuracies, marker='o', linestyle='-')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('K Value vs Accuracy')
plt.xticks(range(1, max_k))
plt.grid()
plt.show()

● plt.plot(range(1, max_k), accuracies, marker='o', linestyle='-'):


Plots KKK values on the x-axis and corresponding accuracy scores on the y-axis. Using
marker='o' adds a circle marker at each KKK value to visualize each point more
clearly.
● plt.xlabel, plt.ylabel, plt.title: Label the x-axis as "K Value," y-axis as
"Accuracy," and add a title to the plot.
● plt.xticks(range(1, max_k)): Ensures each KKK value is marked on the x-axis
for easier interpretation.
● plt.grid(): Adds a grid to the plot for better readability.
● plt.show(): Displays the plot.

Summary

This code loads a dataset, cleans it, and splits it into training and testing sets. It then uses the
K-Nearest Neighbors (KNN) algorithm to find the optimal KKK value that provides the highest
accuracy for classification. After determining the best KKK, it evaluates the model's performance
using a confusion matrix and classification report, then visualizes how accuracy varies with
different KKK values.

This process is essential in machine learning to ensure a well-tuned model that balances bias
and variance, particularly when using algorithms like KNN where KKK selection has a significant
impact on model accuracy.

You might also like