ML Lab File

DELHI TECHNOLOGICAL UNIVERSITY
(Formerly Delhi College of Engineering)

Shahbad Daulatpur, Bawana Road, Delhi 110042
DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING
CO327 : MACHINE LEARNING

LAB FILE
SUBMITTED TO: SUBMITTED BY:

Mr. Harshit Pathak Ankush Dua
Department of Computer Science B. Tech. COE IIIrd Year (SEM - V)
And Engineering 2K22/CO/065
INDEX
S. No. Title Date Signature
1. To study about NumPy, Pandas and Matplotlib libraries in 23-08-2024

Python.
2. To perform Data Pre-Processing and Data Summarization 30-08-2024

on Iris Dataset.
3. To perform Data Preprocessing and Data Visualization on 06-09-2024

Iris Dataset.
4. To implement K Means Clustering. 13-09-2024
5. To implement Data Classification using KNN. 20-09-2024
6. To implement Decision Tree using ID3 Algorithm. 27-09-2024
7. To implement Logistic Regression for Binary 04-10-2024

Classification.
8. To implement Linear Regression for predicting 18-10-2024

Continuous Values.
9. To implement Multi-Layer Neural Network. 25-10-2024

EXPERIMENT 1
EXPERIMENT OBJECTIVE
To study about NumPy, Pandas and Matplotlib libraries in Python.
DESCRIPTION
This experiment showcases important Python libraries essential for effective programming and
data manipulation tasks:
1. NumPy: A fundamental package for numerical computing in Python. It provides support
for arrays, matrices, and a collection of mathematical functions to operate on these data
structures.
2. Pandas: A powerful data manipulation and analysis library built on top of NumPy. It
offers data structures like Series and DataFrames, allowing for easy handling of structured
data.
3. Matplotlib: A plotting library for creating static, interactive, and animated visualizations
in Python. It is commonly used for data visualization and provides a wide variety of plots
and charts.
These libraries are fundamental in data science for tasks such as array operations, data
preprocessing, and graphical representation of data trends.
ALGORITHM
1. Install and import libraries: Ensure NumPy, Pandas, and Matplotlib are installed using
pip.
2. NumPy Operations:
 Create arrays and perform arithmetic operations.
 Utilize functions like mean(), reshape(), and slicing.
3. Pandas Operations:
 Create DataFrames from dictionaries or CSV files.
 Perform data filtering, aggregation, and column manipulations.
4. Matplotlib Visualizations:
 Generate plots like line graphs, histograms, and scatter plots.
5. Integrate all three: Use NumPy arrays or Pandas DataFrames to generate visualizations.
PROGRAM CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print("NumPy Implementation:")
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.arange(1, 11)
print("NumPy Array 1:", arr1)
print("NumPy Array 2:", arr2)
print("Element at index 2 of arr1:", arr1[2])

print("Slice of arr2 from index 1 to 4:", arr2[1:5])
arr_sum = arr1 + arr2[:5]

print("Sum of arr1 and the first 5 elements of arr2:", arr_sum)
mean_value = np.mean(arr1)
std_dev = np.std(arr1)
print("Mean of arr1:", mean_value)
print("Standard deviation of arr1:", std_dev)
reshaped_arr = arr2.reshape(2, 5)
print("Reshaped Array:\n", reshaped_arr)
print("\nPandas Implementation:")
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'Salary': [50000, 54000, 58000, 60000, 62000]
}
df = pd.DataFrame(data)
print("Pandas DataFrame:\n", df)
print("Selecting Age column:\n", df['Age'])
df['Tax'] = df['Salary'] * 0.2

print("\nDataFrame with Tax column:\n", df)
salary_mean = df['Salary'].mean()
print("\nMean Salary:", salary_mean)
sorted_df = df.sort_values(by='Salary', ascending=False)

print("\nDataFrame sorted by Salary:\n", sorted_df)
#print("\nMatplotlib Implementation:")
plt.bar(df['Name'], df['Salary'], color='blue')
plt.title('Matplotlib Implementation (Salaries of Employees)')
plt.xlabel('Employee Names')
plt.ylabel('Salaries')
plt.grid(axis='y')
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The output shows the basic usage of NumPy for creating arrays, Pandas for creating
DataFrames, and Matplotlib for visualizing data.
2. This experiment highlights how these libraries interact and can be used together for data
analysis and visualization.
FINDINGS AND LEARNINGS

1. NumPy is efficient for numerical computations and allows for easy array manipulations.
2. Pandas simplifies data handling with its DataFrame structure, making data analysis tasks
more straightforward.
3. Matplotlib provides powerful visualization capabilities, which are essential for
understanding data insights.
EXPERIMENT 2
To perform Data Pre-Processing and Data Summarization on Iris Dataset.
DESCRIPTION
The Iris Dataset is a popular dataset used for classification tasks. It contains 150 records of iris
flowers with four features (sepal length, sepal width, petal length, and petal width) and a target
class (species).
1. Data Pre-Processing: This is the process of cleaning and transforming raw data into a
usable format. It involves handling missing values, normalizing data, and converting
categorical data into numerical formats.
2. Data Summarization: This involves describing the main features of the dataset, often
using statistical measures such as mean, median, mode, standard deviation, and
visualization techniques to understand data distribution.
ALGORITHM
1. Import the required libraries (NumPy, Pandas, and Matplotlib).
2. Load the Iris dataset into a Pandas DataFrame.
3. Check for missing values and handle them (if any).
4. Normalize the numeric data to bring it to a similar scale.
5. Encode the categorical target variable (Species).
6. Summarize the dataset using statistical metrics (mean, median, etc.).
7. Visualize the distributions of features to identify trends.
PROGRAM CODE
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import seaborn as sns
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target # Keep it as numeric for analysis
print("Missing values:\n", data.isnull().sum())
scaler = StandardScaler()
data.iloc[:, :-1] = scaler.fit_transform(data.iloc[:, :-1])
print("\nNormalized Data:\n", data.head())
data['species'] = data['species'].map({0: iris.target_names[0], 1: iris.target_names[1], 2:

iris.target_names[2]})
print("\nSpecies Names:\n", data['species'].unique())
print("\nSummary Statistics:\n", data.describe())
data.iloc[:, :-1].hist(figsize=(10, 8))

plt.suptitle("Feature Distributions - Iris Dataset")
plt.show()
sns.countplot(x='species', data=data)
plt.title("Species Distribution")
plt.xlabel("Species")
plt.ylabel("Count")
plt.show()
sns.pairplot(data, hue='species')
plt.title("Pairplot\nof\nIris Dataset", x = 1.3, y = 3.8)
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Iris dataset contains no missing values, which simplifies the pre-processing steps.
2. The statistical summary provides insights into the distribution of the different features
(Sepal and Petal dimensions) across the species.
3. The pairplot visually illustrates the relationships between different features, showing how
species are distributed based on their measurements.

1. Data pre-processing is crucial for ensuring data quality and usability in analysis.
2. Statistical summaries clarify the dataset and identify trends or anomalies.
3. Visualizations play a significant role in exploratory data analysis, allowing for intuitive
understanding of relationships within the data.
EXPERIMENT 3
To perform Data Preprocessing and Data Visualization on Iris Dataset.
DESCRIPTION
This experiment involves two tasks:
1. Data Preprocessing: The process of preparing raw data for analysis. This includes
cleaning the data, handling missing values, encoding categorical variables, and
normalizing numerical values.
2. Data Visualization: The representation of data in graphical formats to identify trends,
patterns, and outliers. Common techniques include scatter plots, histograms, and box
plots.
ALGORITHM
1. Import necessary libraries (NumPy, Pandas, Matplotlib, Seaborn).
2. Load the Iris dataset into a DataFrame.
3. Check and handle missing values (if any).
4. Normalize the numeric data using StandardScaler.
5. Encode the categorical target variable (species).
6. Visualize individual feature distributions using histograms.
7. Plot relationships between features using scatter plots and pair plots.
8. Display class distributions using bar charts or count plots.
PROGRAM CODE
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target
print("Missing Values:\n", data.isnull().sum())
scaler = StandardScaler()
data.iloc[:, :-1] = scaler.fit_transform(data.iloc[:, :-1])
encoder = LabelEncoder()
data['species'] = encoder.fit_transform(data['species'])
data.iloc[:, :-1].hist(figsize=(10, 8))

plt.suptitle("Histograms of Features - Iris Dataset")
plt.show()
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data['sepal length (cm)'], y=data['sepal width (cm)'], hue=data['species'],
palette='bright')
plt.title("Sepal Length vs Sepal Width")
plt.show()
sns.pairplot(data, hue='species', palette='bright')

plt.title("Pair Plot\nof\nFeatures\nIris Dataset", x = 1.2, y = 3.6)
plt.show()
sns.countplot(x='species', data=data, hue='species', palette='bright', legend=False)
plt.title("Class Distribution - Iris Dataset")
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The histograms reveal the distribution of each feature, offering insights into possible
skewness or outliers.
2. The boxplot summarizes the range and distribution of Sepal Length for each species,
indicating any outliers.
3. The scatter plot highlights the relationship between Sepal Length and Sepal Width,
revealing patterns in feature clustering for different species.
4. The pair plot visualizes relationships between all features, helping to identify correlations
or trends.

1. Data preprocessing (such as scaling) ensures the features are on the same scale, which
improves the performance of many Machine Learning algorithms.
2. Visualization techniques like histograms, scatter plots, and pair plots are essential for
exploratory data analysis (EDA).
3. Feature distributions assist in understanding the underlying structure of the data and help
in model selection for classification tasks.
EXPERIMENT 4
To implement K Means Clustering.
DESCRIPTION
K-Means Clustering is an unsupervised learning algorithm primarily used to partition datasets
into a predefined number of clusters (K) based on feature similarity. Each cluster is represented
by its centroid, and data points are assigned to clusters based on proximity to these centroids.
The process begins by randomly initializing K centroids and iteratively adjusts these centroids
to minimize intra-cluster variance. During each iteration, the algorithm calculates the distance
from each data point to each centroid, assigns the point to the closest cluster, and updates the
centroids by averaging the points within each cluster. The algorithm repeats this process until
the centroids stabilize, indicating that the clusters are optimally partitioned. This technique is
widely used for data segmentation, customer profiling, and exploratory data analysis.
ALGORITHM
1. Import necessary libraries (Pandas, Matplotlib, Scikit-learn).
2. Load the dataset into a DataFrame.
3. Preprocess the data: Normalize feature values using StandardScaler.
4. Initialize the K-Means algorithm with a predefined number of clusters (K=3).
5. Fit the K-Means model to the data.
6. Assign each data point to the nearest cluster.
7. Visualize the clusters using scatter plots.
8. Evaluate the model by comparing clusters with actual labels (optional).
PROGRAM CODE
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix
iris = load_iris()
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
true_labels = iris.target
X = iris_data[['sepal length (cm)', 'sepal width (cm)']]
kmeans = KMeans(n_clusters=3, random_state=42)

cluster_labels = kmeans.fit_predict(X)
iris_data['Cluster'] = cluster_labels
plt.scatter(X['sepal length (cm)'], X['sepal width (cm)'],
c=iris_data['Cluster'], cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, label='Centroids')
plt.title('K Means Clustering of Iris Dataset')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.legend()
#plt.grid()
plt.show()
mapped_labels = [0 if x == 1 else (1 if x == 0 else 2) for x in cluster_labels]
accuracy = accuracy_score(true_labels, mapped_labels)

conf_matrix = confusion_matrix(true_labels, mapped_labels)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
OUTPUT FILE
DISCUSSIONS
1. The K Means algorithm successfully grouped the Iris dataset into three clusters, which
correspond to the three species.
2. The centroids of the clusters represent the mean values of the features for each cluster.
3. The clustering visualizations help in understanding how well the algorithm has
differentiated between the species based on the selected features.
1. K Means clustering is an effective method for grouping data points based on feature
similarities.
2. Choosing the right number of clusters is crucial and can be guided by prior knowledge of
the dataset.
3. Visualizations of clustering results can provide insights into the performance of the
clustering algorithm and the separation between different groups.
EXPERIMENT 5
To implement Data Classification using KNN.
DESCRIPTION
K-Nearest Neighbours (KNN) is a supervised classification algorithm that classifies data points
based on the labels of their nearest neighbours in the feature space. It operates on the
assumption that similar data points are likely to belong to the same class. The K value
represents the number of neighbours considered in classification, where the majority class
among the K neighbours determines the predicted label for a test point. The algorithm
calculates the distance (usually Euclidean) between the test point and all training data points,
sorts the training points by distance, and identifies the top K closest points. A common choice
for K is an odd number to avoid ties, and the optimal K value may vary based on the dataset.
KNN is simple, effective, and widely used for various classification tasks in image recognition,
recommendation systems, and anomaly detection.
ALGORITHM
1. Import the required libraries (Pandas, Scikit-learn, Matplotlib).
2. Load the dataset into a DataFrame.
3. Split the dataset into training and testing sets.
4. Normalize the feature data using StandardScaler.
5. Initialize the KNN algorithm with a chosen value of K (e.g., 5).
6. Train the KNN model on the training data.
7. Predict the labels on the test data.
8. Evaluate the model's performance using metrics like accuracy and confusion matrix.
9. Visualize the confusion matrix to understand model performance.
PROGRAM CODE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.2f}%\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='d',
xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix for KNN Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The KNN algorithm classified the species of the Iris dataset with a specific accuracy,
which can be influenced by the choice of K.
2. The confusion matrix provides insights into how well the model performed for each class,
highlighting any misclassifications.
3. The classification report presents key performance metrics, useful for evaluating model
performance.

1. KNN is a straightforward yet effective classification algorithm that is sensitive to the
choice of K and the distance metric.
2. Proper data splitting into training and testing sets is essential for evaluating model
performance fairly.
3. Model performance metrics such as accuracy, precision, recall, and F1-score are crucial
for understanding how well a classifier performs on different classes.
EXPERIMENT 6
To implement Decision Tree using ID3 Algorithm.
DESCRIPTION
The ID3 (Iterative Dichotomiser 3) algorithm is a popular algorithm used to build decision
trees. It selects the best feature to split the data based on Information Gain, which measures the
reduction in entropy after the dataset is split on a particular attribute. The algorithm builds the
tree recursively until all data points belong to a single class or there are no more attributes to
split on.
1. Entropy: Measures the impurity in a dataset. High entropy means more disorder, while
low entropy indicates homogeneity.
2. Information Gain: Quantifies the effectiveness of an attribute in classifying the data by
reducing entropy.
The decision tree is non-parametric, meaning it doesn’t assume an underlying data distribution,
making it robust for both categorical and numerical datasets.
ALGORITHM
1. Calculate Entropy of the entire dataset.
2. For each attribute:
 Compute Entropy for each possible value of the attribute.
 Calculate the Information Gain for the attribute.
3. Select the attribute with the highest Information Gain as the decision node.
4. Split the dataset on this attribute and recursively apply the process to the child nodes.
5. Stop the recursion if:
 All data points at a node belong to the same class.
 There are no more features to split on.
6. Prediction: Classify new samples by traversing the decision tree based on feature values.
PROGRAM CODE
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.tree import DecisionTreeClassifier, plot_tree
def entropy(data):
labels = data.iloc[:, -1]
label_counts = Counter(labels)
total = len(data)
ent = 0.0
for count in label_counts.values():
prob = count / total
ent -= prob * np.log2(prob) if prob > 0 else 0
return ent
def information_gain(data, attribute):

total_entropy = entropy(data)
values = data[attribute].unique()
weighted_entropy = 0.0
total_samples = len(data)
for value in values:

subset = data[data[attribute] == value]
weighted_entropy += (len(subset) / total_samples) * entropy(subset)
return total_entropy - weighted_entropy
def id3(data, original_data, features, target_attribute="target"):

if len(data) == 0:
return Counter(original_data[target_attribute]).most_common(1)[0][0]
elif len(data[target_attribute].unique()) == 1:
return data[target_attribute].iloc[0]
elif len(features) == 0:
return Counter(data[target_attribute]).most_common(1)[0][0]
best_feature = max(features, key=lambda f: information_gain(data, f))

tree = {best_feature: {}}
remaining_features = [f for f in features if f != best_feature]
for value in data[best_feature].unique():

subset = data[data[best_feature] == value]
subtree = id3(subset, original_data, remaining_features, target_attribute)
tree[best_feature][value] = subtree
return tree
def predict(tree, sample):

for node in tree:
value = sample[node]
subtree = tree[node].get(value, None)
if isinstance(subtree, dict):
return predict(subtree, sample)
else:
return subtree
data = pd.DataFrame({
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast',
'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool',
'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal',
'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
})
features = list(data.columns[:-1])
tree = id3(data, data, features, target_attribute='PlayTennis')
test_sample = {'Outlook': 'Sunny', 'Temperature': 'Cool', 'Humidity': 'High', 'Wind': 'Strong'}

prediction = predict(tree, test_sample)
print("Decision Tree:", tree)

print("Prediction for the test sample:", prediction)
X = pd.get_dummies(data[['Outlook', 'Temperature', 'Humidity', 'Wind']])

y = data['PlayTennis']
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)

clf.fit(X, y)
# Custom class names

custom_class_names = ['No', 'Yes']
plot_tree(clf, feature_names=X.columns, class_names=custom_class_names, filled=True,
rounded=True)
plt.title("Decision Tree By Using ID3 Algorithm", fontsize=16, loc='center') #
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Decision Tree algorithm, using the ID3 criterion, partitions the data based on features,
resulting in clear classification rules.
2. The accuracy provides a measure of how well the model performs, while the confusion
matrix highlights any misclassifications.
3. The decision tree structure illustrates how the model makes decisions, providing
transparency in the classification proces

1. ID3 algorithm works well on smaller datasets like Iris, providing interpretable results.
2. Overfitting can occur if the tree is too deep, requiring techniques like pruning.
3. Entropy-based splits ensure that the tree is built using features that provide the most
information at each step.
4. Decision trees are easy to interpret but sensitive to small variations in data, which may
result in overfitting.
EXPERIMENT 7
To implement Logistic Regression for Binary Classification.
DESCRIPTION
Logistic Regression is a supervised machine learning algorithm used for binary classification
tasks, predicting the likelihood of a data point belonging to one of two classes. Instead of
providing direct output as a continuous variable, logistic regression uses the logistic function
(sigmoid) to output probabilities bounded between 0 and 1. The algorithm is particularly
suitable for classification tasks where the outcome variable is categorical (e.g., 0 or 1). Logistic
regression is widely used in fields like medical diagnosis, credit scoring, and spam detection.
ALGORITHM
1. Load essential packages (e.g., Pandas, Scikit-learn, Matplotlib).
2. Import the dataset, handle missing values if any, and normalize the feature set for better
performance.
3. Divide the data into training and test sets to enable model training and validation.
4. Initialize the Logistic Regression model and train it on the training data.
5. Use the trained model to predict class labels for the test dataset.
6. Assess the model’s accuracy using metrics like accuracy score, precision, recall, F1-
score, and a confusion matrix.
7. Plot the decision boundary if applicable.
PROGRAM CODE
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import make_classification
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0,

n_classes=2, n_clusters_per_class=1, random_state=42)
data = pd.DataFrame(X, columns=["feature1", "feature2"])
data["target"] = y
X_train, X_test, y_train, y_test = train_test_split(data[["feature1", "feature2"]], data["target"],

test_size=0.3, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix: Logistic Regression")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
plt.figure(figsize=(8,6))
plt.scatter(X_test["feature1"], X_test["feature2"], c=y_test, cmap="bwr", edgecolor="k",
s=40, label="Actual")
plt.scatter(X_test["feature1"], X_test["feature2"], c=y_pred, cmap="coolwarm", marker="x",
s=30, label="Predicted")
x_min, x_max = X_test["feature1"].min() - 1, X_test["feature1"].max() + 1
y_min, y_max = X_test["feature2"].min() - 1, X_test["feature2"].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.2, cmap="coolwarm")

plt.colorbar()
plt.title("Decision Boundary: Logistic Regression")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend(["Actual", "Predicted"])
plt.show()
OUTPUT FILE
DISCUSSIONS
Logistic regression effectively models binary outcomes by predicting probabilities. The
confusion matrix and classification report provide insight into the model's performance for each
class, showing any misclassifications and overall accuracy.

Logistic regression is a powerful and interpretable algorithm for binary classification tasks.
Proper feature scaling improves the model’s performance, and evaluating metrics like F1-score
is crucial for understanding model behavior across classes.
EXPERIMENT 8
To implement Linear Regression for predicting Continuous Values.
DESCRIPTION
Linear Regression is a widely used supervised learning algorithm for predicting continuous
outcomes based on one or more independent variables. It models the relationship between the
dependent variable y and independent variables x by fitting a linear equation:
y = b0 + b1.x + ϵ
Where b0 is the intercept, b1 is the slope, and ϵ is the error term. Linear regression minimizes
the sum of squared residuals to find the line of best fit. It is commonly applied in trend analysis,
forecasting, and financial modeling.
ALGORITHM
1. Import necessary libraries (e.g., Pandas, Scikit-learn, Matplotlib).
2. Load the dataset, handle missing values, and normalize or standardize features if
needed.
3. Split data into training and test sets.
4. Initialize the Linear Regression model and train it using the training data.
5. Use the trained model to predict the target variable on the test dataset.
6. Use metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and
R-squared to evaluate the model.
7. Plot the regression line along with actual data points to visualize the fit.
PROGRAM CODE
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
data = pd.DataFrame(X, columns=["feature1"])

data["target"] = y
X_train, X_test, y_train, y_test = train_test_split(data[["feature1"]], data["target"],

test_size=0.3, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
plt.scatter(X_test, y_test, color="blue", label="Actual Values")
plt.plot(X_test, y_pred, color="red", linewidth=2, label="Regression Line")
plt.title("Linear Regression: Actual vs Predicted")
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT FILE
DISCUSSIONS
Linear regression establishes a linear relationship between features and target, which can be
visualized by plotting predicted values against actual ones. Metrics like MSE and R-squared
offer insights into model accuracy and variance explained by the model.

Linear regression is an effective and interpretable model for continuous prediction tasks. Proper
data preprocessing, including feature selection and scaling, can enhance model accuracy, and
evaluation metrics like R-squared reveal the model’s goodness-of-fit.
EXPERIMENT 9
To implement Multi-Layer Neural Network.
DESCRIPTION
A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of multiple
layers of nodes (neurons), including an input layer, one or more hidden layers, and an output
layer. MLPs are capable of learning complex patterns through the use of non-linear activation
functions.
Key Components:
1. Input Layer: The first layer that receives input data.
2. Hidden Layers: Intermediate layers that transform input into something the output layer
can use.
3. Output Layer: The final layer that produces the output of the network.
4. Weights and Biases: Each connection between neurons has an associated weight that
adjusts as learning proceeds. Each neuron also has a bias term.
5. Activation Function: Functions like Sigmoid, ReLU, or Tanh that introduce non-linearity
into the model.
Forward Pass: During the forward pass, the input data is passed through the network, and
predictions are made based on the current weights and biases.
Backward Pass (Backpropagation): Backpropagation is used to update weights and biases
based on the error of predictions. This is done by:
1. Calculating the error at the output layer.
2. Propagating this error backward through the network to update weights and biases.
ALGORITHM
1. Initialize Weights and Biases: Randomly initialize weights and biases for each layer.
2. Forward Pass:
 Calculate the output of each neuron in the hidden and output layers using the
activation function.
3. Calculate Loss: Use a loss function (e.g., Mean Squared Error for regression or Cross-
Entropy for classification) to evaluate the difference between predicted and actual
outputs.
4. Backward Pass:
 Compute gradients of the loss with respect to weights and biases.
 Update weights and biases using gradient descent.
5. Repeat: Continue forward and backward passes for a specified number of epochs or
until convergence.
PROGRAM CODE
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
def mean_squared_error(y_true, y_pred):

return np.mean((y_true - y_pred) ** 2)
class MultiLayerNN:
def init (self, input_size, hidden_layers, output_size, learning_rate=0.01):
self.learning_rate = learning_rate
self.weights = []
self.biases = []
self.losses = []
layer_sizes = [input_size] + hidden_layers + [output_size]

for i in range(len(layer_sizes) - 1):
weight = np.random.rand(layer_sizes[i], layer_sizes[i + 1]) * 0.01
bias = np.zeros((1, layer_sizes[i + 1]))
self.weights.append(weight)
self.biases.append(bias)
def forward(self, X):

self.layer_outputs = [X]
for i in range(len(self.weights)):
z = np.dot(self.layer_outputs[-1], self.weights[i]) + self.biases[i]
a = sigmoid(z)
self.layer_outputs.append(a)
return self.layer_outputs[-1]
def backward(self, X, y):

m = y.shape[0]
output = self.layer_outputs[-1]
error = output - y
for i in reversed(range(len(self.weights))):
d_loss = error * sigmoid_derivative(output)
self.weights[i] -= self.learning_rate * np.dot(self.layer_outputs[i].T, d_loss) / m
self.biases[i] -= self.learning_rate * np.sum(d_loss, axis=0, keepdims=True) / m
error = np.dot(d_loss, self.weights[i].T)

output = self.layer_outputs[i]
def fit(self, X, y, epochs):

for epoch in range(epochs):
self.forward(X)
self.backward(X, y)
loss = mean_squared_error(y, self.layer_outputs[-1])
self.losses.append(loss)
if epoch % 1000 == 0:
print(f'Epoch {epoch}, Loss: {loss:.4f}')
def predict(self, X):

output = self.forward(X)
return np.round(output)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([[0], [1], [1], [0]])
nn = MultiLayerNN(input_size=2, hidden_layers=[4], output_size=1, learning_rate=0.1)

nn.fit(X, y, epochs=10000)
predictions = nn.predict(X)
print("Predictions:\n", predictions)
plt.plot(nn.losses)
plt.title('Loss over Epochs')
plt.xlabel('Epochs (per 1000)')
plt.ylabel('Mean Squared Error')
plt.grid()
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Multi-Layer Perceptron effectively learns complex patterns in the data, leveraging its
hidden layers to improve classification accuracy.
2. The accuracy score provides an overview of the model's performance, while the confusion
matrix highlights specific misclassifications.
3. The classification report includes valuable metrics for assessing the performance of the
model across different species.

1. MLPs can effectively classify data even in relatively small datasets like the Iris dataset by
learning non-linear decision boundaries.
2. Normalization of features significantly enhances model performance and convergence
during training.
3. Epochs and Batch Size: Adjusting these parameters can affect training speed and model
accuracy. More epochs can lead to better learning but may also risk overfitting.
4. Neural networks, such as MLPs, provide flexibility and can model complex relationships
within the data, making them powerful tools for classification tasks.

ML Lab File

Uploaded by

ML Lab File

Uploaded by

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering)

DEPARTMENT OF COMPUTER SCIENCE AND

CO327 : MACHINE LEARNING

SUBMITTED TO: SUBMITTED BY:

1. To study about NumPy, Pandas and Matplotlib libraries in 23-08-2024

2. To perform Data Pre-Processing and Data Summarization 30-08-2024

3. To perform Data Preprocessing and Data Visualization on 06-09-2024

4. To implement K Means Clustering. 13-09-2024

5. To implement Data Classification using KNN. 20-09-2024

6. To implement Decision Tree using ID3 Algorithm. 27-09-2024

7. To implement Logistic Regression for Binary 04-10-2024

8. To implement Linear Regression for predicting 18-10-2024

9. To implement Multi-Layer Neural Network. 25-10-2024

print("Element at index 2 of arr1:", arr1[2])

arr_sum = arr1 + arr2[:5]

print("Selecting Age column:\n", df['Age'])

df['Tax'] = df['Salary'] * 0.2

sorted_df = df.sort_values(by='Salary', ascending=False)

FINDINGS AND LEARNINGS

print("Missing values:\n", data.isnull().sum())

data['species'] = data['species'].map({0: iris.target_names[0], 1: iris.target_names[1], 2:

print("\nSummary Statistics:\n", data.describe())

data.iloc[:, :-1].hist(figsize=(10, 8))

FINDINGS AND LEARNINGS

data.iloc[:, :-1].hist(figsize=(10, 8))

sns.pairplot(data, hue='species', palette='bright')

FINDINGS AND LEARNINGS

X = iris_data[['sepal length (cm)', 'sepal width (cm)']]

kmeans = KMeans(n_clusters=3, random_state=42)

mapped_labels = [0 if x == 1 else (1 if x == 0 else 2) for x in cluster_labels]

accuracy = accuracy_score(true_labels, mapped_labels)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

FINDINGS AND LEARNINGS

def information_gain(data, attribute):

for value in values:

return total_entropy - weighted_entropy

def id3(data, original_data, features, target_attribute="target"):

best_feature = max(features, key=lambda f: information_gain(data, f))

for value in data[best_feature].unique():

def predict(tree, sample):

test_sample = {'Outlook': 'Sunny', 'Temperature': 'Cool', 'Humidity': 'High', 'Wind': 'Strong'}

print("Decision Tree:", tree)

X = pd.get_dummies(data[['Outlook', 'Temperature', 'Humidity', 'Wind']])

clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)

# Custom class names

FINDINGS AND LEARNINGS

X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0,

X_train, X_test, y_train, y_test = train_test_split(data[["feature1", "feature2"]], data["target"],

plt.contourf(xx, yy, Z, alpha=0.2, cmap="coolwarm")

FINDINGS AND LEARNINGS

data = pd.DataFrame(X, columns=["feature1"])

X_train, X_test, y_train, y_test = train_test_split(data[["feature1"]], data["target"],

FINDINGS AND LEARNINGS

def mean_squared_error(y_true, y_pred):

layer_sizes = [input_size] + hidden_layers + [output_size]

def forward(self, X):

def backward(self, X, y):

error = np.dot(d_loss, self.weights[i].T)

def fit(self, X, y, epochs):

def predict(self, X):

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

nn = MultiLayerNN(input_size=2, hidden_layers=[4], output_size=1, learning_rate=0.1)

FINDINGS AND LEARNINGS

You might also like