The Dalai Lama Institute for Higher Education
Bangalore University
Machine Learning Lab Manual
BCA 6th Sem
LIST OF PROGRAMS
1. Install and set up Python and essential libraries like NumPy and pandas
2. Introduce scikit-learn as a machine learning library.
3. Install and set up scikit-learn and other necessary tools.
4. Write a program to Load and explore the dataset of .CVS and excel files using
pandas.
5. Write a program to Visualize the dataset to gain insights using Matplotlib or Seaborn
by plotting scatter plots, bar charts.
6. Write a program to Handle missing data, encode categorical variables, and perform
feature scaling.
7. Write a program to implement a k-Nearest Neighbours (k-NN) classifier using scikit-
learn and Train the classifier on the dataset and evaluate its performance.
8. Write a program to implement a linear regression model for regression tasks and
Train the model on a dataset with continuous target variables.
9. Write a program to implement a decision tree classifier using scikit-learn and
visualize the decision tree and understand its splits.
10. Write a program to Implement K-Means clustering and Visualize clusters.
1. install and set up Python and essential libraries like NumPy and Pandas.
Install Python: If you haven't already installed Python, you can download it from the official
website:
To verify (terminal)
python --version
Install pip: pip is a package manager for Python that allows you to easily install and manage
libraries. Most recent versions of Python come with pip pre-installed. You can verify if pip is
installed by running the following command in your terminal or command prompt:
pip --version
Install NumPy and pandas: Once you have Python and pip installed, you can use pip to install
NumPy and pandas by running the following commands in your terminal or command prompt:
#In terminal
pip install numpy
pip install pandas
This will download and install NumPy and Pandas along with any dependencies they require.
Verify installation: After installing NumPy and pandas, you can verify that they were installed
correctly by running the following commands in Python's interactive mode or a Python script:
import numpy
import pandas
print(numpy.__version__)
print(pandas.__version__)
These commands should print the versions of NumPy and pandas that were installed.
Output:
2. Introduce sci-kit-learn as a machine learning library.
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface
in Python. This library, which is largely written in Python, is built upon NumPy,
SciPy and Matplotlib.
Installation
If you already installed NumPy and Scipy, the following are the two easiest ways to install
scikit-learn −
Using pip
The following command can be used to install sci-kit-learn via pip
pip install -U scikit-learn
Features
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is
focused on modeling the data. Some of the most popular groups of models provided by Sklearn
are as follows −
Supervised Learning algorithms − Almost all the popular supervised learning algorithms,
like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of
scikit-learn.
Unsupervised Learning algorithms − On the other hand, it also has all the popular
unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component
Analysis) to unsupervised neural networks.
Clustering − This model is used for grouping unlabeled data.
Cross Validation − It is used to check the accuracy of supervised models on unseen data.
3. Install and set up scikit-learn and other necessary tools.
scikit-learn, a powerful Python library for machine learning. Here are the steps to set it up:
Install Python: If you haven’t already installed Python, download and install the latest version
of Python 3 from the official Python website.
Install scikit-learn using pip: Open your terminal or command prompt and run the following
command:
pip install -U scikit-learn
To verify your installation, you can use the following commands:
python -m pip show scikit-learn
# To see which version and where scikit-learn is installed
python -m pip freeze
# To see all packages installed in the active virtual environment
import sklearn
import numpy
import pandas
import matplotlib
print(sklearn.__version__)
print(numpy.__version__)
print(pandas.__version__)
print(matplotlib.__version__)
These commands should print the versions of scikit-learn and other libraries that were
installed.
Output:
4. Write a program to Load and explore the dataset of .CVS and excel files using
pandas.
import pandas as pd
def explore_dataset(file_path):
# Check if the file is a CSV or Excel file
if file_path.endswith('.csv'):
# Load CSV file into a pandas DataFrame
df = pd.read_csv(file_path)
elif file_path.endswith('.xlsx'):
# Load Excel file into a pandas DataFrame
df = pd.read_excel(file_path)
else:
print("Unsupported file format. Please provide a CSV or Excel file.")
return
# Display basic information about the DataFrame
print("Dataset information:")
print([Link]())
# Display the first few rows of the DataFrame
print("\nFirst few rows of the dataset:")
print([Link]())
# Display summary statistics for numerical columns
print("\nSummary statistics:")
print([Link]())
# Display unique values for categorical columns
print("\nUnique values for categorical columns:")
for column in df.select_dtypes(include='object').columns:
print(f"{column}: {df[column].unique()}")
# Example usage
file_path = '[Link]'
# Change this to the path of your CSV or Excel file
explore_dataset(file_path)
Output:
5. Write a program to Visualize the dataset to gain insights using Matplotlib or
Seaborn by plotting scatter plots, and bar charts.
import pandas as pd
import [Link] as plt
import seaborn as sns
def visualize_dataset(file_path):
# Load the dataset into a pandas DataFrame
df = pd.read_csv(file_path)
# Assuming it's a CSV file, change accordingly if it's an Excel file
# Plot scatter plots
[Link](df)
[Link]("Pairplot of the Dataset")
[Link]()
# Plot bar chart for categorical column (assuming the first column is categorical)
if [Link][:, 0].dtype == 'object':
[Link](x=[Link][0], data=df)
[Link]("Bar Chart of Categorical Column")
[Link]([Link][0])
[Link]("Count")
[Link]()
else:
print("No categorical column found to plot bar chart.")
# Example usage
file_path = '[Link]' # Change this to the path of your CSV file
visualize_dataset(file_path)
Output:
6. Write a program to Handle missing data, encode categorical variables, and
perform feature scaling.
import pandas as pd
from [Link] import load_iris
from [Link] import SimpleImputer
from [Link] import OneHotEncoder, StandardScaler
# Load Iris dataset
iris = load_iris()
iris_df = [Link](data=[Link], columns=iris.feature_names)
iris_df['target'] = [Link]
def preprocess_dataset(df):
# Handle missing data (Iris dataset doesn't have missing values, but we'll simulate
some)
[Link][::10, 0] = float('NaN')
# Simulate missing values in the first column
imputer = SimpleImputer(strategy='mean')
df[[Link]] = imputer.fit_transform(df[[Link]])
# Encode categorical variable (if applicable)
# Since Iris dataset doesn't have categorical variables, we'll skip this step
# Perform feature scaling
scaler = StandardScaler()
df[[Link][:-1]] = scaler.fit_transform(df[[Link][:-1]])
return df
# Preprocess Iris dataset
preprocessed_df = preprocess_dataset(iris_df)
# Display preprocessed dataset
print("Preprocessed dataset:")
print(preprocessed_df.head())
Output:
7. Write a program to implement a k-Nearest Neighbours (k-NN) classifier using
scikit-learn and Train the classifier on the dataset and evaluate its performance.
import numpy as np
import pandas as pd
from [Link] import load_iris
from sklearn.model_selection import train_test_split
from [Link] import KNeighborsClassifier
from [Link] import accuracy_score, classification_report
# Load Iris dataset
iris = load_iris()
X = [Link]
y = [Link]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize the k-NN classifier
k = 3 # Number of neighbors
knn_classifier = KNeighborsClassifier(n_neighbors=k)
# Train the classifier
knn_classifier.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = knn_classifier.predict(X_test)
# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Display classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Output:
8. Write a program to implement a linear regression model for regression tasks
and Train the model on a dataset with continuous target variables.
import numpy as np
import pandas as pd
from [Link] import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score
# Load Boston Housing dataset
boston = load_boston()
X = [Link]
y = [Link]
# Convert the data to a pandas DataFrame for easier manipulation
boston_df = [Link](data=X, columns=boston.feature_names)
boston_df['target'] = y
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Initialize Linear Regression model
linear_regression = LinearRegression()
# Train the model
linear_regression.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = linear_regression.predict(X_test)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)
Output:
9. Write a program to implement a decision tree classifier using scikit-learn and
visualize the decision tree and understand its splits.
import numpy as np
import pandas as pd
from [Link] import load_iris
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier, plot_tree
import [Link] as plt
# Load Iris dataset
iris = load_iris()
X = [Link]
y = [Link]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize Decision Tree classifier
decision_tree = DecisionTreeClassifier()
# Train the classifier
decision_tree.fit(X_train, y_train)
# Visualize the decision tree
[Link](figsize=(12, 8))
plot_tree(decision_tree, feature_names=iris.feature_names,
class_names=iris.target_names, filled=True)
[Link]()
Output:
10. Write a program to Implement K-Means clustering and Visualize clusters.
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import KMeans
# Generate sample data
X, y = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
# Create a K-Means clusterer with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)
# Fit the data
[Link](X)
# Get cluster labels
labels = kmeans.labels_
# Plot the data with cluster labels
[Link](figsize=(8, 6))
[Link](X[:, 0], X[:, 1], c=labels, cmap='viridis')
[Link](kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=100,
c='red', label='Centroids')
[Link]('K-Means Clustering')
[Link]('X')
[Link]('Y')
[Link]()
[Link]()
Output: