21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF

11/2/2020 Machine Learning Using scikit-learn.
ipynb - Colaboratory
01 Machine Learning Project
Course - Introduction
Welcome to the course on scikit-learn.
In this course, you will understand all the practical aspects of fitting a Machine Learning Model.
You will learn:
The different steps involved in this process such as data acquisition, data transformation, data cleaning and model fitting.
How to perform each step using Python scikit-learn package?
Introduction to Machine Learning
According to Arthur Samuel,
Machine learning is a eld of computer science that gives computers the ability to learn without being explicitly
programmed.
A Machine learning project is typically classified into two categories, depending on its learning system.
Supervised Learning
Unsupervised Learning.
Supervised Learning
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 1/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Here, we are illustrating a supervised learning approach.
Unsupervised Learning
Here, we are illustrating an unsupervised learning approach.
Steps in a Machine Learning Project
A Machine Learning Project involves the following steps:
Defining the Problem :
Define a problem statement, which addresses a business problem.

Obtaining the Source Data :
The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and
social networking sites.
Understanding Data Through Visualization :
Look into data and understand important features such as its mean, and spread.
Preparing Data for Machine Learning Algorithms :
Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be
manipulated or transformed through one or more pre-processing steps.
Choosing an algorithm :
Based on features of data set, pick a suitable algorithm.

Building the Model :
Train the algorithm with considered training data set and verify its performance through a metric.
Fine-tuning the Model :
Identify values of vital parameters, associated with the chosen model for better performance.
Use the best model :
Use the model with better performance for addressing the defined problem.
02 Introduction to scikit-learn
Introduction to scikit-learn
scikit-learn is a Machine learning toolkit in Python . The package contains efficient tools used for Data Mining and
Data Analysis .
It is built on NumPy, SciPy, and matplotlib packages. It is opensource and also commercially usable under BSD license.
scikit-learn Utilities
scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.
Preprocessing
Model Selection
Classification
Regression
Clustering
Dimensionality Reduction
Steps with scikit-learn
Mostly, one would perform the following steps while working on a Machine learning problem with scikit-learn :
1. Cleaning raw data set.

2. Further transforming with many scikit-learn pre-processing utilities.
3. Splitting data into train and test sets with train_test_split utility.
4. Creating a suitable model with default parameters.
5. Training the Model using fit function.
. Evaluating the Model and fine-tuning it.
03 Gathering Data from Multiple Sources
About the Topic
In this topic, you will learn how scikit-learn library can be used to get public datasets.
You will also understand how this library simpli es your tasks required in fitting Machine Learning models.
Reading Data for ML
Any Machine Learning Algorithm requires data for building a model.
The data can be obtained from Multiple sources such as http, ftp repositories, databases, local repositories, etc.
Many times raw data, read from a source, cannot be used directly by an ML algorithm for building a Model.
So, raw data has to be cleaned, processed, transformed (if required) and then passed to an ML algorithm always.
Example Data - Breast Cancer Dataset
Breast Cancer data set is a popular one, which contains details of 30 features obtained from 569 cancer patients.
We will be doing the following tasks and make cancer data set ready for ML.
Reading raw data from UCI archive

Extract features from Raw data.
Naming or Labelling features
Extract target values from Raw data
Naming or Labelling target values
Reading Data from UCI Archive
The raw data set from UCI archive can be read with the following code snippet.
import pandas as pd
cancer_set = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
header = None)
print(cancer_set.shape)
Output
(569, 32)
Read raw dataset contains 32 columns.

the 1st column has patient ID details, and the 2nd one has tumor type, i.e. malignant or benign .
The rest 30 columns represent various features obtained from each patient.
Extracting Features from Raw Set
All columns, representing features are extracted with the following code snippet.
cancer_features = cancer_set.iloc[:,2:]
print(cancer_features.shape)
print(type(cancer_features))
Output
(569, 30)
<class 'pandas.core.frame.DataFrame'>
cancer_features is a dataframe . It is converted to a numpy array with below code.
cancer_features = cancer_features.values
print(type(cancer_features))
print(cancer_features.shape)
Output
<class 'numpy.ndarray'>
(569, 30)
Naming features
The 30 features used associated with cancer_features dataset are labeled with the following listed names.
cancer_features_names = ['mean radius',

'mean texture', 'mean perimeter',
'mean area', 'mean smoothness',
'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry',
'mean fractal dimension','radius error',
'texture error','perimeter error',
'area error', 'smoothness error',
'compactness error','concavity error',
'concave points error','symmetry error',
'fractal dimension error','worst radius',
'worst texture', 'worst perimeter',
'worst area','worst smoothness',
'worst compactness', 'worst concavity',
'worst concave points','worst symmetry',
'worst fractal dimension']
Extracting target values from Raw data
Target values of each patient are extracted with below code snippet.
cancer_target = cancer_set.iloc[:, 1]
Replacing 'M' with 0 and 'B' with 1

cancer_target = cancer_target.replace(['M', 'B'], [0, 1])
Converting to numpy array

cancer_target = cancer_target.values
print(type(cancer_target))
print(cancer_target.shape)
Output
<class 'numpy.ndarray'>
(569,)
Thus obtained cancer_features and cancer_target can be used by a ML algorithm.
scikit-learn Datasets
scikit-learn by default comes with few popular datasets.
They can be loaded into your working environment and used.
You can know more about datasets fromscikit-learnin the following video.
Reading Cancer Data from scikit-learn
Previously, you have read breast cancer data from UCI archive and derived cancer_features and cancer_target arrays.
The same processed data is available in scikit-learn . The below code snippet illustrates accessing features and target
arrays.
import sklearn.datasets as datasets
cancer = datasets.load_breast_cancer()
print(cancer.data.shape)
print(cancer.target.shape)
The multiple steps explained earlier are simplified using the above set of commands.
Output
(569, 30)
(569,)
from sklearn import datasets
iris = datasets.load_iris()
type(iris)
sklearn.utils.Bunch
04 Preprocessing with scikit-learn
Preprocessing - Introduction
Preprocessing is a step, in which raw data is modi ed or transformed into a format, suitable for further downstream processing.
scikit-learn provides many preprocessing utilities such as,
Standardization mean removal

Scaling
Normalization
Binarization
One Hot Encoding
Label Encoding
Imputation
Standardization
Standardization or Mean Removal is the process of transforming each feature vector into a normal distribution with mean 0 and
variance 1.
This can be achieved using StandardScaler .

An example with its output is shown in the next two cards, which requires the following imports.
import sklearn.preprocessing as preprocessing
Standardization - Example
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(breast_cancer.data)
breast_cancer_standardized = standardizer.transform(breast_cancer.data)
print('Mean of each feature after Standardization :\n\n')

print(breast_cancer_standardized.mean(axis=0))
print('\nStd. of each feature after Standardization :\n\n')
print(breast_cancer_standardized.std(axis=0))
Scaling
Scaling transforms existing data values to lie between a minimum and maximum value.
MinMaxScaler transforms data to range 0 and 1.
MaxAbsScaler transforms data to range -1 and 1.
Transforming breast_cancer dataset through Scaling is shown in next three cards.
Using MinMaxScaler
MinMaxScaler with specified range
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(breast_cancer.data)
breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)
In the above example, data is transformed to range 0 and 10.
Using MaxAbsScaler
Using MaxAbsScaler , the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously
centered at sparse or zero data.
Example for MaxAbsScaler
max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)
breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)
By default, MaxAbsScaler transforms data to the range -1 and 1.
Normalization
Normalization scales each sample to have a unit norm.

Normalization can be achieved with 'l1', 'l2', and 'max' norms.
'l1' norm makes the sum of absolute values of each row as 1, and 'l2' norm makes the sum of squares of each row as 1.
'l1' norm is insensitive to outliers.

By default l2 norm is considered. Hence, removing outliers is recommended before applying l2 norm .
Normalization - Example
normalizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)
breast_cancer_normalized = normalizer.transform(breast_cancer.data)
In above example, l1 norm is used with norm parameter.
Binarization
Binarization is the process of transforming data points to 0 or 1 based on a given threshold.
Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0.
By default, a threshold of 0 is used.
Binarization - Example
binarizer = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])
Output
[[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]]
OneHotEncoder
OneHotEncoder converts categorical integer values into one-hot vectors. In an on-hot vector, every category is transformed into a
binary attribute having only 0 and 1 values.
An example creating two binary attributes for the categorical integers 1 and 2, is shown in the next slide.
OneHotEncoder - Example
onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])
Transforming category values 1 and 2 to one-hot vectors

print(onehotencoder.transform([[1]]).toarray())
print(onehotencoder.transform([[2]]).toarray())
Output
[[ 1. 0.]]
[[ 0. 1.]]
Imputation
Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing
values exist.
Below example replaces missing values, represented by np.nan , with the mean of respective column (axis 0).
Example
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')
imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)
Label Encoding
Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming
categorical values ["benign","malignant"] into [0, 1]` is shown below.
Example
labels = ['malignant', 'benign', 'malignant', 'benign']
labelencoder = preprocessing.LabelEncoder()
labelencoder = labelencoder.fit(labels)
bc_labelencoded = labelencoder.transform(breast_cancer.target_names)
Using MinMaxScaler
Example for MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler().fit(breast_cancer.data)
breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)
By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in
next example.
05 Preprocessing Exercises
Hands-On - Machine Learning Using Scikit-Learn | 1 | Preprocessing
Installation
Let's get the installations done, prior to creating tasks.
Run the command given below.
pip install --user numpy scipy scikit-learn
Task 1
Import two modules sklearn.datasets and sklearn.preprocessing .
Load popular iris data set from sklearn.datasets module and assign it to variable 'iris' .
Perform Normalization on iris.data with l2 norm and save the transformed data in variable iris_normalized .
Hint: Use Normalizer API.
Print the mean of every column using the below command. print(iris_normalized.mean(axis=0))

normalizer = preprocessing.Normalizer(norm='l2').fit(iris.data)
iris_normalized = normalizer.transform(iris.data)
print(iris_normalized.mean(axis=0))
[0.75140029 0.40517418 0.45478362 0.14107142]
Task 2
Convert the categorical integer list iris.target into three binary attribute representation and store the result in variable
iris_target_onehot .
Hint: Use reshape(-1,1) on iris.target and OneHotEncoder .
Execute the following print statement print(iris_target_onehot.toarray()[[0,50,100]])
binarizer = preprocessing.Binarizer(threshold=3.0).fit(iris.target.reshape(-1,1))
iris_binarized = binarizer.transform(iris.target.reshape(-1,1))
print(iris_binarized)
iris_target_onehot = preprocessing.OneHotEncoder()
print(iris_target_onehot.fit_transform(iris.target.reshape(-1,1)).toarray()[[0,50,100]])
Task 3
Set first 50 row values of iris.data to Null values. Use numpy.nan
Perform Imputation on 'iris.data' and save the transformed data in variable 'iris_imputed' .
Hint : use Imputer API, Replace numpy.NaN values with mean of corresponding data.
Print the mean of every column using the below command. print(iris_imputed.mean(axis=0))
iris.data[:50] = np.nan
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')
imputer = imputer.fit(iris.data)
iris_imputed = imputer.transform(iris.data)
print(iris_imputed.mean(axis=0))
[6.262 2.872 4.906 1.676]

/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:66: DeprecationWarning: Class Imputer is deprecated
warnings.warn(msg, category=DeprecationWarning)
x = [[0, 0], [0, 1], [2,0]]

enc = preprocessing.OneHotEncoder()
print(enc.fit(x).transform([[1, 1]]).toarray())
[[0. 0. 0. 1.]]
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer d
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use t
warnings.warn(msg, FutureWarning)
regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']

print(preprocessing.LabelEncoder().fit(regions).transform(regions))
[1 0 3 1 2 0]
06 Nearest Neighbors Technique
About the Topic
From this topic, you will understand how to implement various Machine Learning Algorithms using scikit-learn.
You will be learning some supervised and unsupervised learning algorithms.
Nearest Neighbors
Nearest neighbors method is used to determine a predefined number of data points that are closer to a sample point and predict its
label.
sklearn.neighbors provides utilities for unsupervised and supervised neighbors-based learning methods.
scikit-learn implements two different nearest neighbors classifiers:
KNeighborsClassifier
RadiusNeighborsClassifier
Nearest Neighbor Classifiers
KNeighborsClassifier classifies based on k nearest neighbors of every query point, where k is an integer value specified by
the user.
RadiusNeighborsClassifier classifies based on the number of neighbors present in a fixed radius r of every training point.
Nearest Neighbors Regression
scikit-learn implements the following two regressors:
KNeighborsRegressor predicts based on the k nearest neighbors of each query point.

RadiusNeighborsRegressor predicts based on the neighbors present in a fixed radius r of the query point.
Demo of KNeighborsClassifier
The following code snippet illustrates importing required modules and loading cancer dataset.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
cancer = datasets.load_breast_cancer() # Loading the data set
Building a Model of KNN classifier
The following code creates training and test data sets, initializes a KNN classifier, and fits it with training data.
X_train, X_test, Y_train, Y_test = train_test_split(cancer.data, cancer.target,

stratify=cancer.target, random_state=42)
knn_classifier = KNeighborsClassifier()
knn_classifier = knn_classifier.fit(X_train, Y_train)
Determining Accuracy of the Model
The following code determines the accuracy of model on train and test data sets.
print('Accuracy of Train Data :', knn_classifier.score(X_train,Y_train))

print('Accuracy of Test Data :', knn_classifier.score(X_test,Y_test))
Output
Accuracy of Train Data : 0.946009389671

Accuracy of Test Data : 0.93006993007
Hands-On - KNN
Installation
Task 1
Import two modules sklearn.datasets, and sklearn.model_selection.
Load popular iris data set from sklearn.datasets module and assign it to variable iris.
Split iris.data into two sets names X_train and X_test. Also, split iris.target into two sets Y_train and Y_test.
Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30 and perform stratified sampling.
Print the shape of X_train dataset.
Print the shape of X_test dataset.

X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
(112, 4)
(38, 4)
Task 2
Import required module from sklearn.neighbors
Fit K nearest neighbors model on X_train data and Y_train labels, with default parameters. Name the model as knn_clf .
Evaluate the model accuracy on training data set and print it's score.
Evaluate the model accuracy on testing data set and print it's score.
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf = knn_clf.fit(X_train, Y_train)
print('Accuracy of Train Data :', knn_clf.score(X_train,Y_train))
print('Accuracy of Test Data :', knn_clf.score(X_test,Y_test))

Task 3
Fit multiple K nearest neighbors models on X_train data and Y_train labels with n_neighbors parameter value changing from
3 to 10.
Evaluate each model accuracy on testing data set.
Hint: Make use of for loop
Print the n_neighbors value of the model with highest accuracy.
for i in range(3,10):
knn_clf = KNeighborsClassifier(n_neighbors=i)
knn_clf = knn_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', knn_clf.score(X_test,Y_test))
print(6)

6
07 Decision Trees Technique
Decision Trees
Decision Trees is another Supervised Learning method used for Classification and Regression .
Decision Trees learn simple decision rules from training data and build a Model.
DecisionTreeClassifier and DecisionTreeRegressor are the two utilities from sklearn.tree , which can be used for
classification and regression respectively.
Advantages of Decision Trees
Advantages
Decision Trees are easy to understand.

They often do not require any preprocessing.
Decision Trees can learn from both numerical and categorical data.
Disadvantages of Decision Trees
Decision trees sometimes become complex, which do not generalize well and leads to overfitting . Overfitting can be
addressed by placing the least number of samples needed at a leaf node or placing the highest depth of the tree.
A small variation in data can result in a completely different tree . This problem can be addressed by using decision
trees within an ensemble.
Building a Decision Tree Classifier Model
The subsequent code represents the building of a Decision Tree Classifier model.
Before executing this code, perform importing required modules, load cancer dataset, and create train and test data sets as
shown in Neighbors classifier example.
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier()
dt_classifier = dt_classifier.fit(X_train, Y_train)
Determining Accuracy of the Model
Further the below code determines the model accuracy. You can observe that the model is overfitted .
print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))
Output

Fine Tuning the Model
Further the model is improved with change in max_depth value to 2.
dt_classifier = DecisionTreeClassifier(max_depth=2)
dt_classifier = dt_classifier.fit(X_train, Y_train)
print('Accuracy of Train Data :', dt_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', dt_classifier.score(X_test,Y_test))
Output

Hands-On
Installation
Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .
Import numpy and set random seed to 100 .
Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .
Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 .

import numpy as np
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_test.shape)
(379, 13)
(127, 13)
Task 2
Import required module from sklearn.tree .
Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
dt_reg .
Predict the housing price for first two samples of X_test set and print them.(Hint : Use predict() function)
from sklearn.tree import DecisionTreeRegressor

dt_reg = DecisionTreeRegressor()
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', dt_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))
print(dt_reg.predict(X_test[:1]))

[18.2]
Task 3
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to
5.
Print the max_depth value of the model with highest accuracy.
for i in range(2,5):
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))
print(4)
print(4)

4
08 Ensemble Methods
Ensemble Methods
Ensemble methods combine predictions of other learning algorithms, to improve the generalization.
Ensemble methods are two types:
Averaging Methods : They build several base estimators independently and finally average their predictions.
E.g.: Bagging Methods, Forests of randomised trees

Boosting Methods : They build base estimators sequentially and trie to reduce the bias of the combined estimator.
E.g.: Adaboost, Gradient Tree Boosting
Bagging Methods
Bagging Methods draw random subsets of the original dataset, build an estimator and aggregate individual results to form a final one.
BaggingClassifier and BaggingRegressor are the utilities from sklearn.ensemble to deal with Bagging.
Randomized Trees
sklearn.ensemble offers two types of algorithms based on randomized trees: Random Forest and Extra randomness algorithms.
RandomForestClassifier and RandomForestRegressor classes are used to deal with random forests.
In random forests, each estimator is built from a sample drawn with replacement from the training set.
ExtraTreesClassifier and ExtraTreesRegressor classes are used to deal with extremely randomized forests.
In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.
Boosting Methods
Boosting Methods combine several weak models to create a improvised ensemble.
sklearn.ensemble also provides the following boosting algorithms:
AdaBoostClassifier
GradientBoostingClassifier
Demo of Random Forest Classifier
Example of creating a Random forest model is shown below.
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier()
rf_classifier = rf_classifier.fit(X_train, Y_train)
print('Accuracy of Train Data :', rf_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', rf_classifier.score(X_test,Y_test))
Output
Accuracy of Train Data: 0.995305164319

Hands-On - Ensemble Methods

Installation
Task 1
Import numpy and set random seed to 100
Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .
Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 .

import numpy as np
np.random.seed(100)
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_test.shape)
(379, 13)
(127, 13)
Task 2
Import required module from sklearn.ensemble .
Build a Random Forest Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
rf_reg .
Predict the housing price for first two samples of X_test set and print them.
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor()
rf_reg = rf_reg.fit(X_train, Y_train)
print('Accuracy of Train Data :', rf_reg.score(X_train,Y_train))
print('Accuracy of Test Data :', rf_reg.score(X_test,Y_test))
print(rf_reg.predict(X_test[:2]))

[19.302 9.397]
Task 3
Build multiple Random forest regressor on X_train set and Y_train labels with max_depth parameter value changing from 3 to
5 and also setting n_estimators to one of 50, 100, 200 values.
Print the max_depth and n_estimators values of the model with highest accuracy.
Note: Print the parameter values in the form of tuple (a, b) . a refers to max_depth value and b refers to n_estimators
all_scores = {}
n= 100
00
for m in range(3,6):
rf_reg = RandomForestRegressor(n_estimators=n, max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
print(m,n, rf_reg.score(X_test,Y_test))
all_scores[(m,n)] = rf_reg.score(X_test,Y_test)
max_score = max(all_scores, key=all_scores.get)

print(max_score)
09 Support Vector Machines Technique
Understanding SVM
Support Vector Machines (SVMs) separates data points based on decision planes, which separates objects belonging to different
classes in a higher dimensional space.
SVM algorithm uses the best suitable kernel, which is capable of separating data points into two or more classes.
Commonly used kernels are:
linear
polynomial
rbf
sigmoid
Support Vector Classification
scikit-learn provides the following three utilities for performing Support Vector Classification.
SVC ,
NuSVC : Same as SVC but uses a parameter to control the number of support vectors.
LinearSVC : Similar to SVC with parameter kernel taking linear value.
Support Vector Regression
scikit-learn provides the following three utilities for performing Support Vector Regression.
SVR
NuSVR
LinearSVR
Advantages of SVMs
SVM can distinguish the classes in a higher dimensional space.
SVM algorithms are memory efficient.
SVMs are versatile, and a different kernel can be used by a decision function.
Disadvantages of SVMs
SVMs do not perform well on high dimensional data with many samples.
SVMs work better only with Preprocessed data.
They are harder to visualize.
Demo of Support Vector Classification
An example of creating an SVM classifier is shown below.
The shown model overfits the training data.
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier = svm_classifier.fit(X_train, Y_train)
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))
Output

Improving Accuracy Using Scaled Data
In the following example, scaled input data is used to improve the accuracy of SVM classifier.
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)
svm_classifier = SVC()
svm_classifier = svm_classifier.fit(X_train, Y_train)
Determining Accuracy of New Model
print('Accuracy of Train Data :', svm_classifier.score(X_train,Y_train))
print('Accuracy of Test Data :', svm_classifier.score(X_test,Y_test))
Output

Viewing the Classification Report
from sklearn import metrics
Y_pred = svm_classifier.predict(X_test)
print('Classification report : \n',metrics.classification_report(Y_test, Y_pred))
Output
Classification report :
precision recall f1-score support
0 0.96 0.98 0.97 53
1 0.99 0.98 0.98 90
avg 0.98 0.98 0.98 143
Hands-On - SVM
Installation
Task 1
Load popular digits dataset from sklearn.datasets module and assign it to variable digits .
Split digits.data into two sets names X_train and X_test . Also, split digits.target into two sets Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.

digits = datasets.load_digits()
X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, stratify=digits.target, random_state=30)
print(X_test.shape)
(1347, 64)
(450, 64)
Task 2
Import required module from sklearn.svm .
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf .
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf = svm_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', svm_clf.score(X_test,Y_test))
/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change f

"avoid this warning.", FutureWarning)
Task 3
Perform Standardization of digits.data and store the transformed data in variable digits_standardized .
Hint : Use required utility from sklearn.preprocessing .
Once again, split digits_standardized into two sets names X_train and X_test . Also, split digits.target into two sets
Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.
Build another SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf2 .
digits_standardized = standardizer.fit(digits.data)
digits_standardized = standardizer.transform(digits.data)
X_train, X_test, Y_train, Y_test = train_test_split(digits_standardized, digits.target, stratify=digits.target, random_state=
svm_clf2 = SVC()
svm_clf2 = svm_clf2.fit(X_train, Y_train)
print('Accuracy of Test Data :', svm_clf2.score(X_test,Y_test))

/usr/local/lib/python3.6/dist-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change f
"avoid this warning.", FutureWarning)
10 Clustering Technique
Introduction to Clustering
Clustering is one of the unsupervised learning technique.
The technique is typically used to group data points into clusters based on a specific algorithm.
Major clustering algorithms that can be implemented using scikit-learn are:
K-means Clustering
Agglomerative clustering
DBSCAN clustering
Mean-shift clustering
Affinity propagation
Spectral clustering
K-Means Clustering
In K-means Clustering entire data set is grouped into k clusters.
Steps involved are:
k centroids are chosen randomly.

The distance of each data point from k centroids is calculated. A data point is assigned to the nearest cluster.
Centroids of k clusters are recomputed.
The above steps are iterated till the number of data points a cluster reach convergence.
KMeans from sklearn.cluster can be used for K-means clustering.
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is a bottom-up approach.
Steps involved are:
Each data point is treated as a single cluster at the beginning.
The distance between each cluster is computed, and the two nearest clusters are merged together.
The above step is iterated till a single cluster is formed.
AgglomerativeClustering from sklearn.cluster can be used for achieving this.
Merging of two clusters can be any of the following linkage type: ward , complete or average .
Density Based Clustering
Now let's understand how density-based clustering is performed. DBSCAN from sklearn.cluster is used for this purpose. - video 1
Mean Shift Clustering
Mean Shift Clustering aims at discovering dense areas.
Steps Involved:
Identify blob areas with randomly guessed centroids.

Calculate the centroid of each blob area and shift to a new one, if there is a difference.
Repeat the above step till the centroids converge.
make_blobs from sklearn.cluster can be used to initialize the blob areas. MeanShift from sklearn.cluster can be used to
perform Mean Shift clustering.
Affinity Propagation
Affinity Propagation generates clusters by passing messages between pairs of data points, until convergence.
AffinityPropagation class from sklearn.cluster can be used.
The above class can be controlled with two major parameters:
preference : It controls the number of exemplars to be chosen by the algorithm.

damping : It controls numerical oscillations while updating messages.
Spectral Clustering
Spectral Clustering is ideal to cluster data that is connected, and may not be in a compact space.
In general, the following steps are followed:
Build an affinity matrix of data points.

Embed data points in a lower dimensional space.
Use a clustering method like k-means to partition the points on lower dimensional space.
spectral_clustering from sklearn.cluster can be used for achieving this.
Demo of KMeans
An example of performing KMeans clustering is shown below
from sklearn.cluster import KMeans
kmeans_cluster = KMeans(n_clusters=2)
kmeans_cluster = kmeans_cluster.fit(X_train)
kmeans_cluster.predict(X_test)
Output
array([0, 1, 0, ... , 0, 0, 0])
Evaluating a Clustering algorithm
A clustering algorithm is majorly evaluated using the following scores:
Homogeneity : Evaluates if each cluster contains only members of a single class.
Completeness : All members of a given class are assigned to the same cluster.
V-measure : Harmonic mean of Homogeneity and Completeness .
Adjusted Rand index : Measures similarity of two assignments.
Evaluation with scikit-learn
print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))
Output
0.573236466834
0.483862796607
0.524771531969
0.54983994112
Hands-On - Clustering
Installation
Task 1
Import three modules sklearn.datasets , sklearn.cluster , and sklearn.metrics .
Load popular iris dataset from sklearn.datasets module and assign it to variable iris .
Cluster iris.data set into 3 clusters using K-means with default parameters. Name the model as km_cls .
Hint : Import required utility from sklearn.cluster
Determine the homogeneity score of the model and print it.
Hint : Import required utility from sklearn.metrics

from sklearn.cluster import KMeans
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target)
km_cls = KMeans(n_clusters=3)
km_cls = km_cls.fit(X_train)
print(metrics.homogeneity_score(km_cls.predict(X_test), Y_test))
0.7519188912745023
Task 2
Cluster iris.data set into 3 clusters using Agglomerative clustering. Name the model as agg_cls .
from sklearn.cluster import AgglomerativeClustering

agg_cls = AgglomerativeClustering(n_clusters=3)
print(metrics.homogeneity_score(agg_cls.fit_predict(X_test), Y_test))
0.829701532311426
Task 3
Cluster iris.data set using Affinity Propagation clustering method with default parameters. Name the model as af_cls .
from sklearn.cluster import AffinityPropagation

af_cls = AffinityPropagation()
print(metrics.homogeneity_score(af_cls.fit_predict(X_test), Y_test))
10 Course Summary
Scikit Learn Course Summary
In this course, you have studied the following,
Introduced to scikit-learn .
Used scikit-learn datasets for learning Machine Learning concepts.
Used scikit-learn for build models based on Supervised learning techniques like Nearest neigbors , Decision Trees , Random
forests , and SVMs .
Used scikit-learn for build models based on Supervised learning techniques such as K-means , Agglomerative clustering ,
and Density-based clustering .
x = [[7.8], [1.3], [4.5], [0.9]]

print(preprocessing.Binarizer().fit(x).transform(x))
[[1.]
[1.]
[1.]
[1.]]
Double-click (or enter) to edit
Double-click (or enter) to edit

21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF

Uploaded by

21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF

Uploaded by

11/2/2020 Machine Learning Using scikit-learn.

01 Machine Learning Project

Welcome to the course on scikit-learn.

You will learn:

Introduction to Machine Learning

According to Arthur Samuel,

Here, we are illustrating a supervised learning approach.

Here, we are illustrating an unsupervised learning approach.

Steps in a Machine Learning Project

A Machine Learning Project involves the following steps:

Defining the Problem :

Deﬁne a problem statement, which addresses a business problem.

Understanding Data Through Visualization :

Based on features of data set, pick a suitable algorithm.

Steps with scikit-learn

1. Cleaning raw data set.

03 Gathering Data from Multiple Sources

About the Topic

Reading Data for ML

Any Machine Learning Algorithm requires data for building a model.

Example Data - Breast Cancer Dataset

Reading raw data from UCI archive

Reading Data from UCI Archive

Read raw dataset contains 32 columns.

Extracting Features from Raw Set

cancer_features is a dataframe . It is converted to a numpy array with below code.

cancer_features_names = ['mean radius',

Extracting target values from Raw data

Replacing 'M' with 0 and 'B' with 1

Converting to numpy array

Thus obtained cancer_features and cancer_target can be used by a ML algorithm.

scikit-learn by default comes with few popular datasets.

They can be loaded into your working environment and used.

Reading Cancer Data from scikit-learn

import sklearn.datasets as datasets

from sklearn import datasets

04 Preprocessing with scikit-learn

scikit-learn provides many preprocessing utilities such as,

Standardization mean removal

This can be achieved using StandardScaler .

import sklearn.preprocessing as preprocessing

print('Mean of each feature after Standardization :\n\n')

MinMaxScaler transforms data to range 0 and 1.

MaxAbsScaler transforms data to range -1 and 1.

Transforming breast_cancer dataset through Scaling is shown in next three cards.

MinMaxScaler with speciﬁed range

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 10)).fit(breast_cancer.data)

In the above example, data is transformed to range 0 and 10.

Example for MaxAbsScaler

By default, MaxAbsScaler transforms data to the range -1 and 1.

Normalization scales each sample to have a unit norm.

'l1' norm is insensitive to outliers.

In above example, l1 norm is used with norm parameter.

Binarization is the process of transforming data points to 0 or 1 based on a given threshold.

Transforming category values 1 and 2 to one-hot vectors

imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')

labels = ['malignant', 'benign', 'malignant', 'benign']

Example for MinMaxScaler

Hands-On - Machine Learning Using Scikit-Learn | 1 | Preprocessing

Run the command given below.

pip install --user numpy scipy scikit-learn

Hint: Use Normalizer API.

import sklearn.datasets as datasets

[0.75140029 0.40517418 0.45478362 0.14107142]

Hint: Use reshape(-1,1) on iris.target and OneHotEncoder .

Execute the following print statement print(iris_target_onehot.toarray()[[0,50,100]])

[6.262 2.872 4.906 1.676]

import sklearn.preprocessing as preprocessing

x = [[0, 0], [0, 1], [2,0]]

import sklearn.preprocessing as preprocessing

regions = ['HYD', 'CHN', 'MUM', 'HYD', 'KOL', 'CHN']

06 Nearest Neighbors Technique

About the Topic

You will be learning some supervised and unsupervised learning algorithms.

scikit-learn implements two different nearest neighbors classiﬁers:

Nearest Neighbor Classiﬁers