Machine Learning Lab
Machine Learning Lab
List of Tasks
1. Exercises to solve the real-world problems using the following machine
learning methods:
a. Linear Regression
b. Logistic Regression.
6. Write a program to demonstrate the working of the decision tree based ID3
algorithm. Use an appropriate data set for
building the decision tree and apply this knowledge to classify a new sample.
data sets.
8. Write a program to implement the naïve Bayesian classifier for Iris data set.
Compute the accuracy of the classifier,
Built-in Java classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data
set.
10. Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data
set for clustering using k-Means algorithm.
Compare the results of these two algorithms and comment on the quality of
clustering. You can add Java/Python ML
predictions.
13. For a given set of training data examples stored in a .CSV file, implement
and demonstrate the Candidate-Elimination
algorithm to output a description of the set of all hypotheses consistent with the
training examples.
14. Implement and demonstrate the FIND-S algorithm for finding the most
specific hypothesis based on a given set of
training data samples. Read the training data from a .CSV file.
1. First, you need to have Python installed on your machine. You can
download the latest version of Python from the official website
(https://www.python.org/downloads/). Make sure to select the option to
add Python to your system PATH during the installation process.
2. Once Python is installed, open the command prompt by pressing the
Windows key + R and typing cmd in the Run dialog box.
3. In the command prompt, type the following command to install Jupyter
Notebook:
4. pip install jupyter
This will download and install Jupyter Notebook and its dependencies.
Once you’ve created a new notebook, you can start writing code in the cells. To
run a cell, press Shift + Enter or click the Run button in the toolbar. You can
also add text, equations, and visualizations to your notebook using Markdown
syntax.
Jupyter Notebook also supports a variety of keyboard shortcuts to make your
work more efficient. Here are some of the most useful shortcuts:
• Shift + Enter: Run the current cell and move to the next one.
• Ctrl + Enter: Run the current cell.
• Esc: Enter command mode.
• Enter: Enter edit mode.
• A: Insert a new cell above the current cell.
• B: Insert a new cell below the current cell.
• D + D: Delete the current cell.
• M: Change the current cell type to Markdown.
• Y: Change the current cell type to code.
In Cmd:
#Data Preprocessing
X = dataset.iloc[:,:-1].values #independent variable array
y = dataset.iloc[:,1].values #dependent variable vector
Output:
Variance tells that how much a random variable is different from its expected
value.
Low variance means there is a small variation in the prediction of the target
function with changes in the training data set. At the same time, High
variance shows a large variation in the prediction of the target function with
changes in the training dataset.
Source code:
Output:
Bias: 540.17
Variance: 18.08
Average MSE using 5-fold CV: 566.52
5) Aim: Write a program to simulate a perception network for pattern
classification and function approximation.
Description:
Pattern Classification:
Function Approximation:
The program provides visual insights by plotting the generated data, the decision
boundary for classification, and the approximated function using the perceptron.
Through this simulation, one can observe the perceptron's ability to make binary
classifications and its approach to approximating simple functions.
Source code:
class SimplePerceptron:
def __init__(self):
self.weights = np.zeros(2)
Description:
Working Mechanism:
Entropy Calculation:
The ID3 algorithm starts by calculating the entropy of each attribute present in
the dataset. Entropy provides a measure of the randomness or impurity in the
dataset. Lower entropy values suggest attributes that are more effective for
classification.
Information Gain:
Post entropy calculation, the algorithm computes the information gain for each
attribute. Information gain quantifies the reduction in entropy achieved by
splitting the dataset based on a particular attribute. The attribute with the highest
information gain is selected as the decision node.
Tree Construction:
Using the attribute with the maximum information gain, the dataset is split into
subsets. This splitting process is recursive, meaning each subset becomes a node
in the decision tree, and the process continues until a stopping criterion is met,
such as reaching a minimum node size or depth.
Decision Making:
n = len(root.childs)
if n > 0:
for i in range(0, n):
traverse(root.childs[i])
def calculate():
rows = [i for i in range(0, 14)]
columns = [i for i in range(0, 4)]
root = buildTree(X, rows, columns)
root.decision = 'Start'
traverse(root)
calculate()
Output:
8) Aim: Write a program to implement the naïve Bayesian classifier for
Iris data set. Compute the accuracy of the classifier, considering few test
data sets
Description:
The implementation of the Naïve Bayes classifier for the Iris dataset offers an
insightful exploration into the application of probabilistic classification
techniques on a well-known dataset in the machine learning community.
Implementation Steps:
Data Loading:
Data Preprocessing:
Split the dataset into a training set and a test set. Typically, a significant portion
(e.g., 70-80%) is used for training, and the remaining portion is used for testing.
Model Training:
Prior Probabilities: Compute the prior probabilities for each class based on the
frequency of occurrences in the training set.
Classification:
For each test instance, compute the posterior probabilities of all classes using
Bayes' theorem. Assign the class label to the instance based on the highest
posterior probability.
Performance Evaluation:
After classifying all test instances, compare the predicted class labels with the
actual labels to compute the classifier's accuracy. Accuracy is a fundamental
metric that provides insights into the model's predictive performance on unseen
data.
Source code:
#Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train the Naive Bayes classifier
model = GaussianNB()
model.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)
#Make predictions on new samples
sample = [[5.1, 3.5, 1.4, 6.2]]
prediction = model.predict(sample)
print('Prediction:', prediction)
Output:
11) Aim: Write a program to implement k-Nearest Neighbor algorithm to
classify the iris data set. Print both correct and wrong predictions.
Description:
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique. K-NN algorithm assumes the similarity
between the new case/data and available cases and put the new case into the
category that is most similar to the available categories.
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbours
Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
Among these k neighbours, count the number of the data points in each category.
Step-4: Assign the new data points to that category for which the number of the
neighbour is maximum.
Source code:
Implementation Steps:
A synthetic dataset is created using numpy, comprising 100 data points generated
along a sine curve with some added random noise.
It computes weights for each training point based on the Gaussian kernel,
emphasizing points closer to the test point.
Prediction:
Using the LWR function, predictions are generated for each point in the dataset.
This means that instead of a single line of best fit, the algorithm calculates a curve
that best fits the data point-by-point.
Visualization:
The matplotlib library is utilized to visualize the synthetic data points as blue dots
and the LWR fit as a red curve. This visualization provides a clear understanding
of how LWR fits a curve that closely follows the underlying trend of the data,
especially in regions of high data density.
Source code:
import numpy as np
import matplotlib.pyplot as plt
# Generate synthetic data
np.random.seed(0)
X = np.linspace(0, 10, 100)
y = np.sin(X) + np.random.normal(0, 0.2, 100)
#Locally Weighted Regression algorithm.
def locally_weighted_regression(test_point, X, y, tau):
m = X.shape[0]
weights = np.exp(-((X - test_point) ** 2) / (2 * tau ** 2))
W = np.diag(weights)
theta = np.linalg.inv(X[:, None].T @ W @ X[:, None]) @ X[:, None].T @ W @ y
prediction = test_point * theta
return prediction[0]
# Predictions using LWR
predictions = [locally_weighted_regression(test_point, X, y, tau=0.5) for test_point in
X]
# Plotting the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, predictions, color='red', label='LWR Fit')
plt.title('Locally Weighted Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()
Output:
13) Aim: For a given set of training data examples stored in a .CSV file,
implement and demonstrate the Candidate-Elimination algorithm to output
a description of the set of all hypotheses consistent with the training
examples.
Description:
Iterative Refinement:
For positive instances, generalize the specific hypothesis and specialize the
general hypothesis.
For negative instances, generalize the general hypothesis and specialize the
specific hypothesis.
import pandas as pd
# Load dataset
def load_data(filename):
return pd.read_csv(filename)
def candidate_elimination(examples):
for j in range(len(specific_h)):
if specific_h[j] is None:
specific_h[j] = row[j]
specific_h[j] = '?'
general_h[j][j] = '?'
for j in range(len(specific_h)):
if specific_h[j] != row[j]:
general_h[j][j] = specific_h[j]
to_remove = []
to_remove.append(hypothesis)
general_h.remove(hypothesis)
data = load_data('training_data.csv')
# Display results
print("\nSpecific hypothesis:")
print(specific)
print("\nGeneral hypothesis:")
print(hypothesis)
Output:
14) Aim: Implement and demonstrate the FIND-S algorithm for finding the
most specific hypothesis based on a given set of training data samples. Read
the training data from a .CSV file.
Description:
The find-S algorithm is a machine learning concept learning algorithm. The find-
S technique identifies the hypothesis that best matches all of the positive cases.
The find-S algorithm considers only positive cases.
When the find-S method fails to categorize observed positive training data, it
starts with the most particular hypothesis and generalizes it.
Representations:
• The most specific hypothesis is represented using ϕ.
• The most general hypothesis is represented using ?.
Source code:
import numpy as np
# Define the training data
X = np.array([
['Sunny', 'Warm', 'Normal', 'Strong', 'Warm', 'Same'],
['Sunny', 'Warm', 'High', 'Strong', 'Warm', 'Same'],
['Rainy', 'Cold', 'High', 'Strong', 'Warm', 'Change'],
['Sunny', 'Warm', 'High', 'Strong', 'Cool', 'Change']
])
y = np.array(['+', '+', '-', '+'])
# Initialize the most specific hypothesis
hypothesis = ['0', '0', '0', '0', '0', '0']
# Find the most specific hypothesis
for i in range(len(y)):
if y[i] == '+':
for j in range(len(X[i])):
if hypothesis[j] == '0':
hypothesis[j] = X[i][j]
elif hypothesis[j] != X[i][j]:
hypothesis[j] = '?'
print("Most specific hypothesis:", hypothesis)
Output:
15) Aim: Solve optimal relay coordination as a linear programming problem
using Genetic Algorithm.
Description:
Objective: Minimize the sum of the relay time settings while ensuring
coordination.
Constraints: Ensure that the time settings for each relay are within their
minimum and maximum bounds.
For simplicity, let's assume we have 3 relays with minimum and maximum time
settings.
Source Code:
import numpy as np
import random
import matplotlib.pyplot as plt
# Define relay settings [min_time, max_time]
relay_settings = [
[0.1, 0.3], # Relay 1
[0.2, 0.5], # Relay 2
[0.4, 0.8] # Relay 3
]
# GA parameters
population_size = 100
num_generations = 100
mutation_rate = 0.1
def compute_fitness(individual):
return sum(individual)
def crossover(parent1, parent2):
crossover_point = random.randint(0, len(parent1) - 1)
child1 = parent1[:crossover_point] + parent2[crossover_point:]
child2 = parent2[:crossover_point] + parent1[crossover_point:]
return child1, child2
def mutate(individual):
for i in range(len(individual)):
if random.random() < mutation_rate:
individual[i] = random.uniform(relay_settings[i][0], relay_settings[i][1])
return individual
# Initialize population
population = [ [random.uniform(min_time, max_time) for min_time, max_time in
relay_settings] for _ in range(population_size)]
# Main GA loop
for generation in range(num_generations):
# Evaluate fitness
fitness_scores = [compute_fitness(individual) for individual in population]
sorted_population = [x for _, x in sorted(zip(fitness_scores, population), key=lambda
pair: pair[0])]
# Selection: Top 50% of the population becomes parents
parents = sorted_population[:population_size // 2]
Output: