ML Lab File
ML Lab File
EXPERIMENT OBJECTIVE
To study about NumPy, Pandas and Matplotlib libraries in Python.
DESCRIPTION
This experiment showcases important Python libraries essential for effective programming and
data manipulation tasks:
1. NumPy: A fundamental package for numerical computing in Python. It provides support
for arrays, matrices, and a collection of mathematical functions to operate on these data
structures.
2. Pandas: A powerful data manipulation and analysis library built on top of NumPy. It
offers data structures like Series and DataFrames, allowing for easy handling of structured
data.
3. Matplotlib: A plotting library for creating static, interactive, and animated visualizations
in Python. It is commonly used for data visualization and provides a wide variety of plots
and charts.
These libraries are fundamental in data science for tasks such as array operations, data
preprocessing, and graphical representation of data trends.
ALGORITHM
1. Install and import libraries: Ensure NumPy, Pandas, and Matplotlib are installed using
pip.
2. NumPy Operations:
Create arrays and perform arithmetic operations.
Utilize functions like mean(), reshape(), and slicing.
3. Pandas Operations:
Create DataFrames from dictionaries or CSV files.
Perform data filtering, aggregation, and column manipulations.
4. Matplotlib Visualizations:
Generate plots like line graphs, histograms, and scatter plots.
5. Integrate all three: Use NumPy arrays or Pandas DataFrames to generate visualizations.
PROGRAM CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print("NumPy Implementation:")
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.arange(1, 11)
print("NumPy Array 1:", arr1)
print("NumPy Array 2:", arr2)
mean_value = np.mean(arr1)
std_dev = np.std(arr1)
print("Mean of arr1:", mean_value)
print("Standard deviation of arr1:", std_dev)
reshaped_arr = arr2.reshape(2, 5)
print("Reshaped Array:\n", reshaped_arr)
print("\nPandas Implementation:")
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [24, 27, 22, 32, 29],
'Salary': [50000, 54000, 58000, 60000, 62000]
}
df = pd.DataFrame(data)
print("Pandas DataFrame:\n", df)
salary_mean = df['Salary'].mean()
print("\nMean Salary:", salary_mean)
#print("\nMatplotlib Implementation:")
plt.bar(df['Name'], df['Salary'], color='blue')
plt.title('Matplotlib Implementation (Salaries of Employees)')
plt.xlabel('Employee Names')
plt.ylabel('Salaries')
plt.grid(axis='y')
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The output shows the basic usage of NumPy for creating arrays, Pandas for creating
DataFrames, and Matplotlib for visualizing data.
2. This experiment highlights how these libraries interact and can be used together for data
analysis and visualization.
EXPERIMENT OBJECTIVE
To perform Data Pre-Processing and Data Summarization on Iris Dataset.
DESCRIPTION
The Iris Dataset is a popular dataset used for classification tasks. It contains 150 records of iris
flowers with four features (sepal length, sepal width, petal length, and petal width) and a target
class (species).
1. Data Pre-Processing: This is the process of cleaning and transforming raw data into a
usable format. It involves handling missing values, normalizing data, and converting
categorical data into numerical formats.
2. Data Summarization: This involves describing the main features of the dataset, often
using statistical measures such as mean, median, mode, standard deviation, and
visualization techniques to understand data distribution.
ALGORITHM
1. Import the required libraries (NumPy, Pandas, and Matplotlib).
2. Load the Iris dataset into a Pandas DataFrame.
3. Check for missing values and handle them (if any).
4. Normalize the numeric data to bring it to a similar scale.
5. Encode the categorical target variable (Species).
6. Summarize the dataset using statistical metrics (mean, median, etc.).
7. Visualize the distributions of features to identify trends.
PROGRAM CODE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import seaborn as sns
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target # Keep it as numeric for analysis
scaler = StandardScaler()
data.iloc[:, :-1] = scaler.fit_transform(data.iloc[:, :-1])
print("\nNormalized Data:\n", data.head())
sns.countplot(x='species', data=data)
plt.title("Species Distribution")
plt.xlabel("Species")
plt.ylabel("Count")
plt.show()
sns.pairplot(data, hue='species')
plt.title("Pairplot\nof\nIris Dataset", x = 1.3, y = 3.8)
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Iris dataset contains no missing values, which simplifies the pre-processing steps.
2. The statistical summary provides insights into the distribution of the different features
(Sepal and Petal dimensions) across the species.
3. The pairplot visually illustrates the relationships between different features, showing how
species are distributed based on their measurements.
EXPERIMENT OBJECTIVE
To perform Data Preprocessing and Data Visualization on Iris Dataset.
DESCRIPTION
This experiment involves two tasks:
1. Data Preprocessing: The process of preparing raw data for analysis. This includes
cleaning the data, handling missing values, encoding categorical variables, and
normalizing numerical values.
2. Data Visualization: The representation of data in graphical formats to identify trends,
patterns, and outliers. Common techniques include scatter plots, histograms, and box
plots.
ALGORITHM
1. Import necessary libraries (NumPy, Pandas, Matplotlib, Seaborn).
2. Load the Iris dataset into a DataFrame.
3. Check and handle missing values (if any).
4. Normalize the numeric data using StandardScaler.
5. Encode the categorical target variable (species).
6. Visualize individual feature distributions using histograms.
7. Plot relationships between features using scatter plots and pair plots.
8. Display class distributions using bar charts or count plots.
PROGRAM CODE
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, LabelEncoder
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target
print("Missing Values:\n", data.isnull().sum())
scaler = StandardScaler()
data.iloc[:, :-1] = scaler.fit_transform(data.iloc[:, :-1])
encoder = LabelEncoder()
data['species'] = encoder.fit_transform(data['species'])
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data['sepal length (cm)'], y=data['sepal width (cm)'], hue=data['species'],
palette='bright')
plt.title("Sepal Length vs Sepal Width")
plt.show()
plt.figure(figsize=(6, 4))
sns.countplot(x='species', data=data, hue='species', palette='bright', legend=False)
plt.title("Class Distribution - Iris Dataset")
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The histograms reveal the distribution of each feature, offering insights into possible
skewness or outliers.
2. The boxplot summarizes the range and distribution of Sepal Length for each species,
indicating any outliers.
3. The scatter plot highlights the relationship between Sepal Length and Sepal Width,
revealing patterns in feature clustering for different species.
4. The pair plot visualizes relationships between all features, helping to identify correlations
or trends.
EXPERIMENT OBJECTIVE
To implement K Means Clustering.
DESCRIPTION
K-Means Clustering is an unsupervised learning algorithm primarily used to partition datasets
into a predefined number of clusters (K) based on feature similarity. Each cluster is represented
by its centroid, and data points are assigned to clusters based on proximity to these centroids.
The process begins by randomly initializing K centroids and iteratively adjusts these centroids
to minimize intra-cluster variance. During each iteration, the algorithm calculates the distance
from each data point to each centroid, assigns the point to the closest cluster, and updates the
centroids by averaging the points within each cluster. The algorithm repeats this process until
the centroids stabilize, indicating that the clusters are optimally partitioned. This technique is
widely used for data segmentation, customer profiling, and exploratory data analysis.
ALGORITHM
1. Import necessary libraries (Pandas, Matplotlib, Scikit-learn).
2. Load the dataset into a DataFrame.
3. Preprocess the data: Normalize feature values using StandardScaler.
4. Initialize the K-Means algorithm with a predefined number of clusters (K=3).
5. Fit the K-Means model to the data.
6. Assign each data point to the nearest cluster.
7. Visualize the clusters using scatter plots.
8. Evaluate the model by comparing clusters with actual labels (optional).
PROGRAM CODE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, confusion_matrix
iris = load_iris()
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
true_labels = iris.target
plt.figure(figsize=(10, 6))
plt.scatter(X['sepal length (cm)'], X['sepal width (cm)'],
c=iris_data['Cluster'], cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, label='Centroids')
plt.title('K Means Clustering of Iris Dataset')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.legend()
#plt.grid()
plt.show()
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
OUTPUT FILE
DISCUSSIONS
1. The K Means algorithm successfully grouped the Iris dataset into three clusters, which
correspond to the three species.
2. The centroids of the clusters represent the mean values of the features for each cluster.
3. The clustering visualizations help in understanding how well the algorithm has
differentiated between the species based on the selected features.
FINDINGS AND LEARNINGS
1. K Means clustering is an effective method for grouping data points based on feature
similarities.
2. Choosing the right number of clusters is crucial and can be guided by prior knowledge of
the dataset.
3. Visualizations of clustering results can provide insights into the performance of the
clustering algorithm and the separation between different groups.
EXPERIMENT 5
EXPERIMENT OBJECTIVE
To implement Data Classification using KNN.
DESCRIPTION
K-Nearest Neighbours (KNN) is a supervised classification algorithm that classifies data points
based on the labels of their nearest neighbours in the feature space. It operates on the
assumption that similar data points are likely to belong to the same class. The K value
represents the number of neighbours considered in classification, where the majority class
among the K neighbours determines the predicted label for a test point. The algorithm
calculates the distance (usually Euclidean) between the test point and all training data points,
sorts the training points by distance, and identifies the top K closest points. A common choice
for K is an odd number to avoid ties, and the optimal K value may vary based on the dataset.
KNN is simple, effective, and widely used for various classification tasks in image recognition,
recommendation systems, and anomaly detection.
ALGORITHM
1. Import the required libraries (Pandas, Scikit-learn, Matplotlib).
2. Load the dataset into a DataFrame.
3. Split the dataset into training and testing sets.
4. Normalize the feature data using StandardScaler.
5. Initialize the KNN algorithm with a chosen value of K (e.g., 5).
6. Train the KNN model on the training data.
7. Predict the labels on the test data.
8. Evaluate the model's performance using metrics like accuracy and confusion matrix.
9. Visualize the confusion matrix to understand model performance.
PROGRAM CODE
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='d',
xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix for KNN Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The KNN algorithm classified the species of the Iris dataset with a specific accuracy,
which can be influenced by the choice of K.
2. The confusion matrix provides insights into how well the model performed for each class,
highlighting any misclassifications.
3. The classification report presents key performance metrics, useful for evaluating model
performance.
EXPERIMENT OBJECTIVE
To implement Decision Tree using ID3 Algorithm.
DESCRIPTION
The ID3 (Iterative Dichotomiser 3) algorithm is a popular algorithm used to build decision
trees. It selects the best feature to split the data based on Information Gain, which measures the
reduction in entropy after the dataset is split on a particular attribute. The algorithm builds the
tree recursively until all data points belong to a single class or there are no more attributes to
split on.
1. Entropy: Measures the impurity in a dataset. High entropy means more disorder, while
low entropy indicates homogeneity.
2. Information Gain: Quantifies the effectiveness of an attribute in classifying the data by
reducing entropy.
The decision tree is non-parametric, meaning it doesn’t assume an underlying data distribution,
making it robust for both categorical and numerical datasets.
ALGORITHM
1. Calculate Entropy of the entire dataset.
2. For each attribute:
Compute Entropy for each possible value of the attribute.
Calculate the Information Gain for the attribute.
3. Select the attribute with the highest Information Gain as the decision node.
4. Split the dataset on this attribute and recursively apply the process to the child nodes.
5. Stop the recursion if:
All data points at a node belong to the same class.
There are no more features to split on.
6. Prediction: Classify new samples by traversing the decision tree based on feature values.
PROGRAM CODE
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
def entropy(data):
labels = data.iloc[:, -1]
label_counts = Counter(labels)
total = len(data)
ent = 0.0
for count in label_counts.values():
prob = count / total
ent -= prob * np.log2(prob) if prob > 0 else 0
return ent
return tree
if isinstance(subtree, dict):
return predict(subtree, sample)
else:
return subtree
data = pd.DataFrame({
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast',
'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool',
'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal',
'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong',
'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes',
'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
})
features = list(data.columns[:-1])
tree = id3(data, data, features, target_attribute='PlayTennis')
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=X.columns, class_names=custom_class_names, filled=True,
rounded=True)
plt.title("Decision Tree By Using ID3 Algorithm", fontsize=16, loc='center') #
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Decision Tree algorithm, using the ID3 criterion, partitions the data based on features,
resulting in clear classification rules.
2. The accuracy provides a measure of how well the model performs, while the confusion
matrix highlights any misclassifications.
3. The decision tree structure illustrates how the model makes decisions, providing
transparency in the classification proces
EXPERIMENT OBJECTIVE
To implement Logistic Regression for Binary Classification.
DESCRIPTION
Logistic Regression is a supervised machine learning algorithm used for binary classification
tasks, predicting the likelihood of a data point belonging to one of two classes. Instead of
providing direct output as a continuous variable, logistic regression uses the logistic function
(sigmoid) to output probabilities bounded between 0 and 1. The algorithm is particularly
suitable for classification tasks where the outcome variable is categorical (e.g., 0 or 1). Logistic
regression is widely used in fields like medical diagnosis, credit scoring, and spam detection.
ALGORITHM
1. Load essential packages (e.g., Pandas, Scikit-learn, Matplotlib).
2. Import the dataset, handle missing values if any, and normalize the feature set for better
performance.
3. Divide the data into training and test sets to enable model training and validation.
4. Initialize the Logistic Regression model and train it on the training data.
5. Use the trained model to predict class labels for the test dataset.
6. Assess the model’s accuracy using metrics like accuracy score, precision, recall, F1-
score, and a confusion matrix.
7. Plot the decision boundary if applicable.
PROGRAM CODE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix: Logistic Regression")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
plt.figure(figsize=(8,6))
plt.scatter(X_test["feature1"], X_test["feature2"], c=y_test, cmap="bwr", edgecolor="k",
s=40, label="Actual")
plt.scatter(X_test["feature1"], X_test["feature2"], c=y_pred, cmap="coolwarm", marker="x",
s=30, label="Predicted")
x_min, x_max = X_test["feature1"].min() - 1, X_test["feature1"].max() + 1
y_min, y_max = X_test["feature2"].min() - 1, X_test["feature2"].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
OUTPUT FILE
DISCUSSIONS
Logistic regression effectively models binary outcomes by predicting probabilities. The
confusion matrix and classification report provide insight into the model's performance for each
class, showing any misclassifications and overall accuracy.
EXPERIMENT OBJECTIVE
To implement Linear Regression for predicting Continuous Values.
DESCRIPTION
Linear Regression is a widely used supervised learning algorithm for predicting continuous
outcomes based on one or more independent variables. It models the relationship between the
dependent variable y and independent variables x by fitting a linear equation:
y = b0 + b1.x + ϵ
Where b0 is the intercept, b1 is the slope, and ϵ is the error term. Linear regression minimizes
the sum of squared residuals to find the line of best fit. It is commonly applied in trend analysis,
forecasting, and financial modeling.
ALGORITHM
1. Import necessary libraries (e.g., Pandas, Scikit-learn, Matplotlib).
2. Load the dataset, handle missing values, and normalize or standardize features if
needed.
3. Split data into training and test sets.
4. Initialize the Linear Regression model and train it using the training data.
5. Use the trained model to predict the target variable on the test dataset.
6. Use metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and
R-squared to evaluate the model.
7. Plot the regression line along with actual data points to visualize the fit.
PROGRAM CODE
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="blue", label="Actual Values")
plt.plot(X_test, y_pred, color="red", linewidth=2, label="Regression Line")
plt.title("Linear Regression: Actual vs Predicted")
plt.xlabel("Feature (X)")
plt.ylabel("Target (y)")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT FILE
DISCUSSIONS
Linear regression establishes a linear relationship between features and target, which can be
visualized by plotting predicted values against actual ones. Metrics like MSE and R-squared
offer insights into model accuracy and variance explained by the model.
EXPERIMENT OBJECTIVE
To implement Multi-Layer Neural Network.
DESCRIPTION
A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of multiple
layers of nodes (neurons), including an input layer, one or more hidden layers, and an output
layer. MLPs are capable of learning complex patterns through the use of non-linear activation
functions.
Key Components:
1. Input Layer: The first layer that receives input data.
2. Hidden Layers: Intermediate layers that transform input into something the output layer
can use.
3. Output Layer: The final layer that produces the output of the network.
4. Weights and Biases: Each connection between neurons has an associated weight that
adjusts as learning proceeds. Each neuron also has a bias term.
5. Activation Function: Functions like Sigmoid, ReLU, or Tanh that introduce non-linearity
into the model.
Forward Pass: During the forward pass, the input data is passed through the network, and
predictions are made based on the current weights and biases.
Backward Pass (Backpropagation): Backpropagation is used to update weights and biases
based on the error of predictions. This is done by:
1. Calculating the error at the output layer.
2. Propagating this error backward through the network to update weights and biases.
ALGORITHM
1. Initialize Weights and Biases: Randomly initialize weights and biases for each layer.
2. Forward Pass:
Calculate the output of each neuron in the hidden and output layers using the
activation function.
3. Calculate Loss: Use a loss function (e.g., Mean Squared Error for regression or Cross-
Entropy for classification) to evaluate the difference between predicted and actual
outputs.
4. Backward Pass:
Compute gradients of the loss with respect to weights and biases.
Update weights and biases using gradient descent.
5. Repeat: Continue forward and backward passes for a specified number of epochs or
until convergence.
PROGRAM CODE
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
return x * (1 - x)
class MultiLayerNN:
def init (self, input_size, hidden_layers, output_size, learning_rate=0.01):
self.learning_rate = learning_rate
self.weights = []
self.biases = []
self.losses = []
for i in reversed(range(len(self.weights))):
d_loss = error * sigmoid_derivative(output)
self.weights[i] -= self.learning_rate * np.dot(self.layer_outputs[i].T, d_loss) / m
self.biases[i] -= self.learning_rate * np.sum(d_loss, axis=0, keepdims=True) / m
predictions = nn.predict(X)
print("Predictions:\n", predictions)
plt.plot(nn.losses)
plt.title('Loss over Epochs')
plt.xlabel('Epochs (per 1000)')
plt.ylabel('Mean Squared Error')
plt.grid()
plt.show()
OUTPUT FILE
DISCUSSIONS
1. The Multi-Layer Perceptron effectively learns complex patterns in the data, leveraging its
hidden layers to improve classification accuracy.
2. The accuracy score provides an overview of the model's performance, while the confusion
matrix highlights specific misclassifications.
3. The classification report includes valuable metrics for assessing the performance of the
model across different species.