KNN Model
KNN Model
This code is a
machine learning workflow using the K-Nearest Neighbors (KNN) algorithm for classification,
where we aim to find the optimal KKK value that gives the highest accuracy for predicting
outcomes. Let’s go through each section of the code step-by-step.
1. Importing Libraries
python
Copy code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
import matplotlib.pyplot as plt
● cols_clean: This list defines the columns where 0 values are considered as missing
data (for instance, it’s unrealistic to have 0 for blood pressure).
● Replacing values:
○ .replace(0, np.nan): Replaces 0 values with NaN (Not a Number) to mark
them as missing.
○ .fillna(data[cols_clean].mean()): Fills missing values with the mean of
the respective column. This is a common data-cleaning step to ensure the
dataset has no missing values before training the model.
● X: Contains all the features in the dataset except the target variable Outcome.
● y: Stores the target variable Outcome, which we are trying to predict. Here, Outcome is
typically binary (e.g., 0 for "No" and 1 for "Yes" in diabetes prediction).
● accuracies: An empty list that will store the accuracy for each KKK value.
● max_k: Sets the maximum value for KKK to 15 or the number of training samples
(whichever is smaller).
● for k in range(1, max_k): Loops through different values of KKK (from 1 to
max_k), training and testing a KNN model for each value.
● KNeighborsClassifier(n_neighbors=k): Creates a KNN classifier with KKK
neighbors.
● .fit(X_train, y_train): Trains the model using the training data.
● .predict(X_test): Predicts the target variable on the test set.
● accuracy_score(y_test, y_pred): Calculates the accuracy of the model for the
current KKK value and appends it to the accuracies list.
Summary
This code loads a dataset, cleans it, and splits it into training and testing sets. It then uses the
K-Nearest Neighbors (KNN) algorithm to find the optimal KKK value that provides the highest
accuracy for classification. After determining the best KKK, it evaluates the model's performance
using a confusion matrix and classification report, then visualizes how accuracy varies with
different KKK values.
This process is essential in machine learning to ensure a well-tuned model that balances bias
and variance, particularly when using algorithms like KNN where KKK selection has a significant
impact on model accuracy.