DEPARTMENT: ____Computer
Science________________________________________________
Session: Sprint-2024 Course Instructor: ___Shoukat
Ali____________________
Subject: __Machine Learning__ Course Code: __________ Max. Marks:
___5__
Class/Sec.: 8-C Submission Date: 06/15/24 Time Duration: () From: ____ to ______
Student Name: __Muhammad Zohaib___ ID: _CSC-20F-132_
Assignment 02
Apply following machine learning classifier/algorithm on PIMA Indian diabetic database to predict whether
patients in datasets have diabetes or not.
Moreover, perform a comparative study of the mentioned algorithm.
1. Logistics regression
2. Decision tree
3. Random forest
4. Naive Byes
5. KNN
6. SVM
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import DecisionTreeClassifier
from [Link] import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from [Link] import KNeighborsClassifier
from [Link] import SVC
from [Link] import accuracy_score, classification_report
from [Link] import StandardScaler, RobustScaler
from [Link] import Pipeline
# Load the dataset
data = pd.read_csv('[Link]')
X = [Link]('Outcome', axis=1)
y = data['Outcome']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Define pipelines for each model
pipelines = {
'Logistic Regression': Pipeline([('scaler', RobustScaler()),('logreg',
LogisticRegression(max_iter=1000, solver='liblinear'))]),
'Decision Tree': Pipeline([('scaler', StandardScaler()),('tree',
DecisionTreeClassifier())]),
'Random Forest': Pipeline([('scaler', StandardScaler()),('forest',
RandomForestClassifier())]),
'Naive Bayes': Pipeline([('scaler', StandardScaler()),('nb',
GaussianNB())]),
'KNN': Pipeline([('scaler', StandardScaler()),('knn',
KNeighborsClassifier())]),
'SVM': Pipeline([('scaler', StandardScaler()),('svm', SVC())])
# Train and evaluate models
for name, pipeline in [Link]():
[Link](X_train, y_train)
y_pred = [Link](X_test)
print(f'\n{name}:')
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}') # Format
accuracy to 4 decimal places
print('Classification Report:\n', classification_report(y_test,
y_pred))
Study:
Support
Characteristi Logistic Decision Random K-Nearest Vector
c Regression Tree Forest Naive Bayes Neighbors Machine
Simple,
interpretable, Powerful,
handles accurate, less Non-
linearly Interpretable, prone to parametric, Effective in
separable visualizes overfitting, simple, can high-
data, efficient decision handles non- learn complex dimensional
with large rules, handles linear Simple, fast, decision spaces,
datasets, non-linear relationships, handles high- boundaries, flexible kernel
benefits from relationships, works well dimensional works well choice, works
feature useful for with data, good for with well with
Winning scaling/outlier feature standardized categorical standardized standardized
Qualities handling selection features features features features
Computationa
lly expensive Sensitive to
Assumes Prone to for large hyperparamet
linear overfitting, Assumes datasets, ers, less
relationships, sensitive to feature requires interpretable,
less accurate small data Less independenc careful tuning computationall
with complex changes, may interpretable, e, sensitive to of k, sensitive y demanding
Areas for decision not generalize computational data to irrelevant with large
Improvement boundaries well ly demanding distribution features datasets
Performance
on Pima
Indians 75-80% 70-75% 75-82% 70-75% 72-78% 75-82%
_____________________________________________________________________________________
BEST OF LUCK