ML manoj
ML manoj
Practical File
1-7
1. Implement data pre-processing
8-11
2. Deploy Simple Linear Regression
12-15
3. Simulate Multiple Linear Regression
16-19
4. Implement Decision Tree
20-23
5. Deploy Random forest classification
24-27
6. Simulate Naïve Bayes algorithm
36-39
9. Simulate Artificial Neural Network
40-43
10. Implement the Genetic Algorithm code
2
Practical No. 1
Aim: Implement data pre-processing
• Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. • Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis.
Need of Data Preprocessing
• For achieving better results from the applied model in Machine Learning projects the
format of the data must be in a proper manner. Some specified Machine Learning model
needs information in a specified format, for example, Random Forest algorithm does not
support null values, therefore to execute random forest algorithm null values have to be
managed from the original raw data set.
• Another aspect is that data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithms are executed in one data set, and best out of
them is chosen.
Steps
:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling
Python Code:
3
Figure 1: Dataset uploaded
x= data_set.iloc[:,:-1].values
y= data_set.iloc[:,3].values
x = np.array(ct.fit_transform(x),
dtype=np.float) labelencoder_y=
LabelEncoder() y=
labelencoder_y.fit_transform(y)
5
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state
=0)
6
Output: Dividing Dataset into training and testing datasets
7
Figure 5: Feature scaling
8
Practical No. 2
Aim: Deploy Simple Linear Regression
Regression searches for relationships among variables.
For example, you can observe several employees of some company and try to understand how
their salaries depend on the features, such as experience, level of education, role, city they
work in, and so on.
This is a regression problem where data related to each employee represent one observation.
The presumption is that the experience, education, role, and city are the independent features,
while the salary depends on them.
Similarly, you can try to establish a mathematical dependence of the prices of houses on
their areas, numbers of bedrooms, distances to the city center, and so on.
Generally, in regression analysis, you usually consider some phenomenon of interest and have
a number of observations. Each observation has two or more features. Following the
assumption that (at least) one of the features depends on the others, you try to establish a
relation among them.
In other words, you need to find a function that maps some features or variables to others
sufficiently well.
The dependent features are called the dependent variables, outputs, or responses.
The independent features are called the independent variables, inputs, or predictors.
Regression problems usually have one continuous and unbounded dependent variable. The
inputs, however, can be continuous, discrete, or even categorical data such as gender,
nationality, brand, and so on.
.
Simple Linear Regression 5 steps have to perform as per below:
1. Importing the dataset.
2. Splitting dataset into training set and testing set (2 dimensions of X and y per
each set). Normally, the testing set should be 5% to 30% of dataset.
3. Visualize the training set and testing set to double check (you can bypass this step
if you want).
4. Initializing the regression model and fitting it using training set (both X and y).
5. Let’s predict
data_set= pd.read_csv('Salary_Data.csv')
9
Figure 1: output of Dataset uploaded
x= data_set.iloc[:, :-
1].values y= data_set.iloc[:,
1].values
10
4: Fitting the Simple Linear Regression to the Training Set from
This will create a prediction vector y_pred, and x_pred, which will contain predictions of
test dataset, and prediction of training set respectively
11
Figure 6: Prediction variables
Practical Outcomes:
• Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc
12
Practical No- 3
• For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent variable.
• MLR tries to fit a regression line through a multidimensional space of data-points.
“Multiple Linear Regression is one of the important regression algorithms which models the
linear relationship between a single dependent continuous variable and more than one
independent variable.”
Python Code:
13
#Extracting Independent and dependent Variable
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 4].values
x = nm.array(ct.fit_transform(x), dtype=nm.float)
labelencoder_y= LabelEncoder() y=
labelencoder_y.fit_transform(y)
14
Figure 3: Conversion of categorical data to numerical
Step 4: Splitting training data and testing data from sklearn.model_selection import
train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2,
random_state=0)
15
Step 6: Prediction of Test set results
#Predicting the Test set result; y_pred=
regressor.predict(x_test)
We can also check the score for training dataset and test dataset. Below is the code for it:
print('Train Score: ', regressor.score(x_train, y_train)) print('Test
Score: ', regressor.score(x_test, y_test))
Practical Outcomes:
• A linear relationship should exist between the Target and predictor variables.
• The regression residuals must be normally distributed.
• MLR assumes little or no multicollinearity (correlation between the
independent variable) in data
Applications of MLR
16
Practical No -4
Algorithm:
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Python Code:
17
) X = dataset.iloc[:, [2, 3]].values
18
y = dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_test = sc.transform(X_test)
20
alpha = 0.75, cmap = ListedColormap(('red',
'green'))) plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max()) for i, j in
enumerate(np.unique(y_set)): plt.scatter(X_set[y_set ==
j, 0], X_set[y_set == j, 1],
Output
21
Practical Outcomes:
• Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
• The logic behind the decision tree can be easily understood because it shows a treelike
structure
22
Practical No - 5
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Python Code:
# Random Forest
Classification # Importing the
libraries import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values y =
dataset.iloc[:, -1].values
23
# Splitting the dataset into the Training set and Test
set from sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
# Training the Random Forest Classification model on the Training set from
sklearn.ensemble import RandomForestClassifier
= classifier.predict(X_test)
24
plt.contourf(X1, X2,
classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
25
cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(),
X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in
enumerate(np.unique(y_set)): plt.scatter(X_set[y_set ==
j, 0], X_set[y_set == j, 1], c=
ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend()
plt.show()
Output:
26
Practical Outcomes:
• There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result. • The
predictions from each tree must have very low correlations
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
27
Practical No. 6
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Python Code:
# Naive Bayes
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values y =
dataset.iloc[:, -1].values
28
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split X_train,
X_test, y_train, y_test = train_test_split(X, y, test_size =
0.25, random_state = 0)
30
X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in
enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j,
1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Training set)') plt.xlabel('Age')
plt.ylabel('Estimated Salary') plt.legend() plt.show()
Output:
Practical Outcomes:
• The rain forest algorithm is a machine learning algorithm that is easy to use and
flexible. It uses ensemble learning, which enables organizations to solve
regression and classification problems.
31
• This is an ideal algorithm for developers because it solves the problem of
overfitting of datasets. It’s a very resourceful tool for making accurate predictions
needed in strategic decision making in organizations.
32
Practical No-7
The K-NN working can be explained on the basis of the below algorithm:
Python Code:
# K-Nearest Neighbors (K-NN)
# Importing the libraries import
numpy as np import
matplotlib.pyplot as plt import
33
pandas as pd
34
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values y =
dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
# Training the K-NN model on the Training set from sklearn.neighbors import
KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors = 5,
metric = 'minkowski', p = 2) classifier.fit(X_train, y_train)
= classifier.predict(X_test)
y_train
35
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max()
36
+ 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step
= 0.01))
plt.contourf(X1, X2,
classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75,
cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(),
X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in
enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j,
0], X_set[y_set == j, 1], c=
ListedColormap(('red', 'green'))(i), label = j) plt.title('K-NN
(Training set)') plt.xlabel('Age') plt.ylabel('Estimated
Salary') plt.legend() plt.show()
37
plt.show()
Output:
38
Practical Outcomes:
• Since the KNN algorithm requires no training before making predictions, new
data can be added seamlessly, which will not impact the accuracy of the
algorithm.
• KNN is very easy to implement. There are only two parameters required to
implement KNN—the value of K and the distance function (e.g. Euclidean,
Manhattan, etc.
Applications:
• KNN is widely used in almost all industries, such as healthcare, financial
services, eCommerce, political campaigns, etc.
• Healthcare companies use the KNN algorithm to determine if a patient is
susceptible to certain diseases and conditions.
• Financial institutions predict credit card ratings or qualify loan applications and
the likelihood of default with the help of the KNN algorithm.
• Political analysts classify potential voters into separate classes based on whom
they are likely to vote for.
39
Practical No- 8
Python Code:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values y =
dataset.iloc[:, -1].values
# Splitting the dataset into the Training set and Test set from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
40
X_test = sc.transform(X_test)
= classifier.predict(X_test)
41
# Visualising the Test set results from
matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max()
+ 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step
= 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap =
ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max()) for i, j in
enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0],
X_set[y_set == j, 1], c=
ListedColormap(('red', 'green'))(i), label = j) plt.title('SVM (Test
set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend()
plt.show() Ouptput:
Practical Outcomes:
• It works really well with a clear margin of separation
• It is effective in high dimensional spaces.
• It is effective in cases where the number of dimensions is greater than the number
of samples.
• It uses a subset of training points in the decision function (called support vectors), so it
is also memory efficient.
Applications:
• Face detection – SVMc classify parts of the image as a face and non-face and create
a square boundary around the face.
• Text and hypertext categorization – SVMs allow Text and hypertext categorization
for both inductive and transductive models. They use training data to classify
documents into different categories. It categorizes on the basis of the score generated
and then compares with the threshold value.
42
• Classification of images – Use of SVMs provides better search accuracy for image
classification. It provides better accuracy in comparison to the traditional query-
based searching techniques.
• Bioinformatics – It includes protein classification and cancer classification. We use
SVM for identifying the classification of genes, patients on the basis of genes and
other biological problems.
• Protein fold and remote homology detection – Apply SVM algorithms for
protein remote homology detection.
• Handwriting recognition – We use SVMs to recognize handwritten characters
used widely.
• Generalized predictive control(GPC) – Use SVM based GPC to control
chaotic dynamics with useful parameters.
43
Practical No. 9
Python Code:
from joblib.numpy_pickle_utils import xrange from
numpy import *
44
class NeuralNet(object):
def init (self):
# Generate random numbers
random.seed(1)
45
# The neural network thinks.
= NeuralNet()
Output:
46
Practical Outcomes:
• Know the main provisions neuromathematics;
• Know the main types of neural networks;
• Know and apply the methods of training neural networks;
• Know the application of artificial neural networks;
• To be able to formalize the problem, to solve it by using a neural network
Applications of ANN
a. Classification of data:
Based on a set of data, our trained neural network predicts whether it is a dog or a cat? b.
Anomaly detection:
Given the details about transactions of a person, it can say that whether the transaction is
fraud or not.
c. Speech recognition:
We can train our neural network to recognize speech patterns. Example: Siri, Alexa, Google
assistant.
d. Audio generation:
Given the inputs as audio files, it can generate new music based on various factors like genre,
singer, and others.
e. Time series analysis:
Spell checking:
We can train a neural network that detects misspelled spellings and can also suggest a
similar meaning for words. Example: Grammarly g. Character recognition:
Machine translation:
We can develop a neural network that translates one language into another language. i.
Image processing:
We can train a neural network to process an image and extract pieces of information from it.
47
Practical No. 10
Aim: Implement the Genetic Algorithm code
One of the advanced algorithms in the field of computer science is Genetic Algorithm
inspired by the Human genetic process of passing genes from one generation to another. It is
generally used for optimization purpose and is heuristic in nature and can be used at various
places. For eg – solving np problem, game theory, code-
Here are quick steps for how the genetic algorithm works:
48
#training a logistics regression model logmodel =
LogisticRegression() logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test) print("Accuracy =
"+ str(accuracy_score(y_test,predictions))) #defining
various steps required for the genetic algorithm def
initilization_of_population(size,n_feat): population = []
for i in range(size):
chromosome =
np.ones(n_feat,dtype=np.bool)
chromosome[:int(0.3*n_feat)]=False
np.random.shuffle(chromosome)
population.append(chromosome) return
population
def fitness_score(population):
scores = [] for chromosome
in population:
logmodel.fit(X_train.iloc[:,chromosome],y_train)
predictions = logmodel.predict(X_test.iloc[:,chromosome])
scores.append(accuracy_score(y_test,predictions)) scores,
population = np.array(scores), np.array(population) inds
= np.argsort(scores) return list(scores[inds][::-1]),
list(population[inds,:][::-1])
def selection(pop_after_fit,n_parents):
population_nextgen = [] for i in
range(n_parents):
population_nextgen.append(pop_after_fit[i]
) return population_nextgen
49
def crossover(pop_after_sel):
population_nextgen=pop_after_sel
for i in range(len(pop_after_sel)):
child=pop_after_sel[i]
child[3:7]=pop_after_sel[(i+1)%len(pop_after_sel)]
[3:7] population_nextgen.append(child) return
population_nextgen
def mutation(pop_after_cross,mutation_rate):
population_nextgen = [] for i in
range(0,len(pop_after_cross)):
def generations(size,n_feat,n_parents,mutation_rate,n_gen,X_train,
X_test, y_train, y_test):
best_chromo= [] best_score= []
population_nextgen=initilization_of_population(size,n_feat
) for i in range(n_gen): scores, pop_after_fit =
fitness_score(population_nextgen)
print(scores[:2]) pop_after_sel =
selection(pop_after_fit,n_parents) pop_after_cross =
crossover(pop_after_sel) population_nextgen =
mutation(pop_after_cross,mutation_rate)
50
best_chromo.append(pop_after_fit[0]
) best_score.append(scores[0])
return best_chromo,best_score
Output:
Practical Outcomes:
Applications:
• Robotics
The use of genetic algorithm in the field of robotics is quite big. Actually, genetic
algorithm is being used to create learning robots which will behave as a human and
will do tasks like cooking our meal, do our laundry etc
• Traffic and Shipment Routing (Travelling Salesman Problem)
This is a famous problem and has been efficiently adopted by many sales-based
companies as it is time saving and economical. This is also achieved using genetic
algorithm.
• Engineering Design
Engineering design has relied heavily on computer modeling and simulation to make
design cycle process fast and economical. Genetic algorithm has been used to
optimize and provide a robust solution
51