ML Practical File
ML Practical File
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by
its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs. The various steps to data reduction are:
PROGRAM:
import pandas as pd import
numpy as np
data = pd.read_csv("data1.csv")
print("data\n") print(data.head())
print(null_data) print(data[null_data])
# Train-Test Splitting
def split_train_test_data(data, test_ratio):
np.random.seed(42) shuffled =
np.random.permutation(len(data)) test_set_size = int(len(data) *
test_ratio) test_indices = shuffled[:test_set_size] train_indices =
shuffled[test_set_size:] return data.iloc[train_indices],
data.iloc[test_indices]
train_set, test_set = split_train_test_data(data, 0.4)
print(f"Rows in train set: {len(train_set)}\nRows in test set: {len(test_set)} \n")
OUTPUT:
EXPERIMENT-2
AIM: Implement Simple Linear Regression.
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
Y = β0 + β1X
Here,
• Y is a dependent variable.
• X is an independent variable.
• β0 and β1 are the regression coefficients.
• β0 is the intercept or the bias that fixes the offset to a line.
• β1 is the slope or weight that specifies the factor by which X has an impact on Y.
Case-01: β1 < 0
Case-02: β1 = 0
PROGRAM:
import pandas from pandas import DataFrame import
matplotlib.pyplot as plt from sklearn.linear_model import
LinearRegression from scipy.stats import pearsonr
data = pandas.read_csv('Linear_regression_basic.csv')
print("Dataset : \n",data)
print(data.describe())
x = DataFrame(data,columns=['x']) y
= DataFrame(data,columns=['y'])
# plt.figure(figsize=(10,10)) plt.title('LINEAR
REGRESSION') plt.xlabel('X axis') #label of X
axis plt.ylabel('Y axis') #label of Y axis
plt.ylim(0,7) #for Y axis limit
plt.xlim(0,8) #for X axis limit plt.grid()
#for grid
# plt.scatter(x,y,alpha=(0.7)) #Visibility of point
plt.scatter(x,y,color='green',s=50) #s for size plt.show()
print("Data is x : \n",data['x'])
print("Data in y : \n",data['y'])
# Correlation Coefficient corr,_ =
pearsonr(data['x'],data['y'])
print("Correlation Coefficient : ",corr)
regression = LinearRegression() regression.fit(x,y)
print("Regression Cofficient : ",regression.coef_)
# Intercept print("Regression Intercept :
",regression.intercept_)
plt.figure(figsize=(10,6)) plt.title('LINEAR REGRESSION')
plt.xlabel('x --->') plt.ylabel('y --->') plt.ylim(0,15)
plt.xlim(0,15) plt.scatter(x,y,alpha=(0.5))
plt.plot(x,regression.predict(x),color = "red",linewidth=2)
plt.show() print("Accuracy score : ",regression.score(x,y))
print(" New data
") data['y'][2] = 7
print(data)
X = DataFrame(data,columns=['x'])
Y = DataFrame(data,columns=['y']) plt.title('LINEAR
REGRESSION NEW') plt.xlabel('X axis') #label of X axis
plt.ylabel('Y axis') #label of Y axis plt.ylim(0,10)
#for Y axis limit plt.xlim(0,10) #for X axis limit
plt.grid() #for grid
# plt.scatter(x,y,alpha=(0.7)) #Visibility of point
plt.scatter(X,Y,color='green',s=50) #s for size plt.show()
regression = LinearRegression()
regression.fit(X,Y)
print("Regression cofficient : ",regression.coef_)
print("Regression Intercept : ",regression.intercept_)
plt.figure(figsize=(10,6))
plt.title('LINEAR REGRESSION
NEW')
plt.xlabel('x --->') plt.ylabel('y --->') plt.ylim(0,15)
plt.xlim(0,15) plt.scatter(X,Y)
plt.plot(X,regression.predict(X),color = "red",linewidth=2)
plt.show()
print("New score : ",regression.score(X,Y))
OUTPUT:
EXPERIMENT-3
AIM: Simulate Multiple Linear Regression.
In multiple linear regression, the dependent variable depends on more than one independent
variables.
Here,
• Y is a dependent variable.
• X1, X2, …., Xn are independent variables.
• β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor by
which Xj has an impact on Y.
PROGRAM:
# Train-Test Splitting
from sklearn.model_selection import train_test_split train_set, test_set =
train_test_split(data, test_size=0.2, random_state=42) print(f"Rows in train set:
{len(train_set)}\nRows in test set: {len(test_set)} \n") print("Train set : ",train_set)
print("Test set : ",test_set)
reg = linear_model.LinearRegression()
reg.fit(train_set[["area","bedrooms","age"]],train_set["price"])
print("Regression Cofficient : ",reg.coef_) #m1,m2,m3
print("Intercept : ",reg.intercept_) #m4
print("predicted value of test set : ")
print(reg.predict(test_set[["area","bedrooms","age"]]))
OUTPUT:
EXPERIMENT-4
AIM: Implement Decision Trees.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset. o
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions. o It is called a decision tree because,
similar to a tree, it starts with the root node, which expands on further branches and
constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM). o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Implementation:
import pandas as pd
Output:
EXPERIMENT-5
AIM: Implement Random Forest Classification.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions, and
it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
Python Code:
# Random Forest Classification
# Importing the libraries import
numpy as np import
matplotlib.pyplot as plt import
pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv') X =
dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, ra ndom_state = 0)
# Training the Random Forest Classification model on the Training set from
sklearn.ensemble import RandomForestClassifier classifier =
RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results y_pred =
classifier.predict(X_test)
Output:
[[63 5]
[ 4 28]]
EXPERIMENT-6
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions. o
It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an apple without depending
on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
PROGRAM:
temp=np.array(temp) label=np.array(label)
for r in range(0,len(wheather)):
print ("%10s\t%d\t%10s\t%d\t%10s\t%d\t%10s\t%d\t%10s\t%d"%(wheather[r],whe
ather_encoded[r],temp[r],temp_encoded[r],humidity[r],humidity_encoded[r],windy
[r],windy_encoded[r],play[r],label[r]))
OUTPUT:
EXPERIMENT-7
AIM: Implement K-Nearest Neighbors (K-NN), K-means.
K-NN Algorithm:
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
K-NN Algorithm:
o Step-1: Select the number K of the neighbors o Step-2: Calculate the Euclidean distance of
K number of neighbors o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Program:
import pandas as pd from sklearn.datasets
import load_iris iris = load_iris()
print("IRIS\nFeature Names:\n",iris.feature_names)
print("\nTarget Names:\n",iris.target_names) df =
pd.DataFrame(iris.data,columns=iris.feature_names)
df['target'] = iris.target
print("\nDATASET:\n",df.head())
print("Shape: ",df.shape)
Output:
k-Means Algorithm:
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled dataset
into different clusters. Here K defines the number of predefined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on.
It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into knumber of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm. k-Means ALGORITHM:
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Program:
from sklearn.cluster import KMeans import pandas as
pd from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
df = pd.read_csv("income.csv")
print("DATAFRAME:\n",df.head())
plt.scatter(df.Age,df['Income($)'])
plt.xlabel('Age') plt.ylabel('Income($)')
plt.show()
km = KMeans(n_clusters=3) y_predicted =
km.fit_predict(df[['Age','Income($)']]) print("Y
predicted:",y_predicted) df['cluster']=y_predicted print("New
dataframe:\n",df.head()) print("\nCluster Centers :
",km.cluster_centers_)
df1 = df[df.cluster==0] df2 = df[df.cluster==1] df3 =
df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',m
arker='*',label='centroid') plt.xlabel('Age') plt.ylabel('Income ($)') plt.legend() plt.show()
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine.
PROGRAM:
import pandas as pd from sklearn.datasets
import load_digits digits = load_digits()
print("Digits Target : ",digits.target) df =
pd.DataFrame(digits.data,digits.target) df['target'] =
digits.target
print("\nDATASET DIGIT\n",df.head())
OUTPUT:
APRIORI ALGORITHM:
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses a breadth-
first search and Hash Tree to calculate the itemset associations efficiently. It is the iterative
process for finding the frequent itemsets from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used for
market basket analysis and helps to find those products that can be bought together. It can also be
used in the healthcare field to find drug reactions for patients.
PROGRAM:
import numpy as np import pandas as pd from mlxtend.frequent_patterns
import apriori, association_rules
data = pd.read_excel('Online_Retail.xlsx')
print("DataSet:",data.head()) print("Data Columns
: ",data.columns) print("Data Shape :
",data.shape)
# Stripping extra spaces in the description
data['Description'] = data['Description'].str.strip()
Neural networks (NN), also called artificial neural networks (ANN) are a subset of learning algorithms
within the machine learning field that are loosely based on the concept of biological neural networks.
Andrey Bulezyuk, who is a German-based machine learning specialist with more than five
years of experience, says that “neural networks are revolutionizing machine learning because they are
capable of efficiently modelling sophisticated abstractions across an extensive range of disciplines
and industries.”
A deliberate activation function for every hidden layer. In this simple neural network Python
tutorial, we’ll employ the Sigmoid activation function.
There are several types of neural networks. In this project, we are going to create the feed-forward or
perception neural networks. This type of ANN relays data directly from the front to the back.
Training the feed-forward neurons often need back-propagation, which provides the network with
corresponding set of inputs and outputs. When the input data is transmitted into the neuron, it is
processed, and an output is generated.
Summarizing an Artificial Neural Network:
1. Take inputs
2. Add bias (if required)
3. Assign random weights to input features
4. Run the code for training.
5. Find the error in prediction.
6. Update the weight by gradient descent algorithm.
7. Repeat the training phase with updated weights.
8. Make predictions.
Python Code:
from joblib.numpy_pickle_utils import xrange from
numpy import *
class NeuralNet(object): def
__init__(self): # Generate random
numbers random.seed(1)
# Train the neural network and adjust the weights each time. def train(self,
inputs, outputs, training_iterations):
for iteration in xrange(training_iterations): # Pass the
training set through the network. output = self.learn(inputs)
if __name__ == "__main__":
# Initialize neural_network = NeuralNet()
# The training set. inputs = array([[0, 1, 1], [1, 0, 0], [1, 0, 1]]) outputs = array([[1, 0,
1]]).T
Python Code:
import numpy as np import pandas as pd import random
import matplotlib.pyplot from sklearn.datasets import
load_breast_cancer from sklearn.model_selection import
train_test_split from sklearn.linear_model import
LogisticRegression from sklearn.metrics import
accuracy_score
#import the breast cancer dataset from sklearn.datasets import
load_breast_cancer cancer=load_breast_cancer() df =
pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
label=cancer["target"]
#splitting the model into training and testing set X_train, X_test, y_train, y_test =
train_test_split(df, label, test_size=0.30,
random_state=101)
#training a logistics regression model logmodel = LogisticRegression()
logmodel.fit(X_train,y_train) predictions = logmodel.predict(X_test)
print("Accuracy = "+ str(accuracy_score(y_test,predictions))) #defining
various steps required for the genetic algorithm def
initilization_of_population(size,n_feat):
population = [] for i in
range(size):
chromosome = np.ones(n_feat,dtype=np.bool)
chromosome[:int(0.3*n_feat)]=False
np.random.shuffle(chromosome)
population.append(chromosome) return population
def fitness_score(population):
scores = [] for chromosome in
population:
logmodel.fit(X_train.iloc[:,chromosome],y_train) predictions =
logmodel.predict(X_test.iloc[:,chromosome])
scores.append(accuracy_score(y_test,predictions)) scores, population =
np.array(scores), np.array(population) inds = np.argsort(scores) return
list(scores[inds][::-1]), list(population[inds,:][::-1]) def
selection(pop_after_fit,n_parents):
population_nextgen = [] for i in range(n_parents):
population_nextgen.append(pop_after_fit[i]) return
population_nextgen
def crossover(pop_after_sel):
population_nextgen=pop_after_sel for i in
range(len(pop_after_sel)):
child=pop_after_sel[i]
child[3:7]=pop_after_sel[(i+1)%len(pop_after_sel)][3:7]
population_nextgen.append(child) return population_nextgen
def mutation(pop_after_cross,mutation_rate):
population_nextgen = [] for i in
range(0,len(pop_after_cross)):
chromosome = pop_after_cross[i] for j in
range(len(chromosome)): if random.random() <
mutation_rate: chromosome[j]= not
chromosome[j]
population_nextgen.append(chromosome)
#print(population_nextgen)
return population_nextgen
def generations(size,n_feat,n_parents,mutation_rate,n_gen,X_train,
X_test, y_train, y_test):
best_chromo= [] best_score= []
population_nextgen=initilization_of_population(size,n_feat) for i in range(n_gen):
scores, pop_after_fit = fitness_score(population_nextgen) print(scores[:2])
pop_after_sel = selection(pop_after_fit,n_parents) pop_after_cross =
crossover(pop_after_sel) population_nextgen =
mutation(pop_after_cross,mutation_rate) best_chromo.append(pop_after_fit[0])
best_score.append(scores[0]) return best_chromo,best_score
OUTPUT: