21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
ipynb - Colaboratory
Course - Introduction
In this course, you will understand all the practical aspects of fitting a Machine Learning Model.
The different steps involved in this process such as data acquisition, data transformation, data cleaning and model fitting.
How to perform each step using Python scikit-learn package?
Machine learning is a eld of computer science that gives computers the ability to learn without being explicitly
programmed.
A Machine learning project is typically classified into two categories, depending on its learning system.
Supervised Learning
Unsupervised Learning.
Supervised Learning
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 1/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Unsupervised Learning
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 2/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
The raw data required to build a model can be presented in a single or multiple sources such as relational databases, and
social networking sites.
Look into data and understand important features such as its mean, and spread.
Preparing Data for Machine Learning Algorithms :
Mostly, the captured raw data cannot be used to train using a Machine learning algorithm. The raw datasets have to be
manipulated or transformed through one or more pre-processing steps.
Choosing an algorithm :
Train the algorithm with considered training data set and verify its performance through a metric.
Fine-tuning the Model :
Identify values of vital parameters, associated with the chosen model for better performance.
Use the best model :
Use the model with better performance for addressing the defined problem.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 3/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
02 Introduction to scikit-learn
Introduction to scikit-learn
scikit-learn is a Machine learning toolkit in Python . The package contains efficient tools used for Data Mining and
Data Analysis .
It is built on NumPy, SciPy, and matplotlib packages. It is opensource and also commercially usable under BSD license.
scikit-learn Utilities
scikit-learn library has many utilities that can be used to perform the following tasks involved in Machine Learning.
Preprocessing
Model Selection
Classification
Regression
Clustering
Dimensionality Reduction
Mostly, one would perform the following steps while working on a Machine learning problem with scikit-learn :
In this topic, you will learn how scikit-learn library can be used to get public datasets.
You will also understand how this library simpli es your tasks required in fitting Machine Learning models.
The data can be obtained from Multiple sources such as http, ftp repositories, databases, local repositories, etc.
Many times raw data, read from a source, cannot be used directly by an ML algorithm for building a Model.
So, raw data has to be cleaned, processed, transformed (if required) and then passed to an ML algorithm always.
Breast Cancer data set is a popular one, which contains details of 30 features obtained from 569 cancer patients.
We will be doing the following tasks and make cancer data set ready for ML.
The raw data set from UCI archive can be read with the following code snippet.
import pandas as pd
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 4/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
cancer_set = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
header = None)
print(cancer_set.shape)
Output
(569, 32)
All columns, representing features are extracted with the following code snippet.
cancer_features = cancer_set.iloc[:,2:]
print(cancer_features.shape)
print(type(cancer_features))
Output
(569, 30)
<class 'pandas.core.frame.DataFrame'>
cancer_features = cancer_features.values
print(type(cancer_features))
print(cancer_features.shape)
Output
<class 'numpy.ndarray'>
(569, 30)
Naming features
The 30 features used associated with cancer_features dataset are labeled with the following listed names.
Target values of each patient are extracted with below code snippet.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 5/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
cancer_target = cancer_set.iloc[:, 1]
print(type(cancer_target))
print(cancer_target.shape)
Output
<class 'numpy.ndarray'>
(569,)
scikit-learn Datasets
You can know more about datasets fromscikit-learnin the following video.
Previously, you have read breast cancer data from UCI archive and derived cancer_features and cancer_target arrays.
The same processed data is available in scikit-learn . The below code snippet illustrates accessing features and target
arrays.
cancer = datasets.load_breast_cancer()
print(cancer.data.shape)
print(cancer.target.shape)
The multiple steps explained earlier are simplified using the above set of commands.
Output
(569, 30)
(569,)
iris = datasets.load_iris()
type(iris)
sklearn.utils.Bunch
Preprocessing - Introduction
Preprocessing is a step, in which raw data is modi ed or transformed into a format, suitable for further downstream processing.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 6/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Normalization
Binarization
One Hot Encoding
Label Encoding
Imputation
Standardization
Standardization or Mean Removal is the process of transforming each feature vector into a normal distribution with mean 0 and
variance 1.
Standardization - Example
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(breast_cancer.data)
breast_cancer_standardized = standardizer.transform(breast_cancer.data)
Scaling
Scaling transforms existing data values to lie between a minimum and maximum value.
Using MinMaxScaler
breast_cancer_minmaxscaled10 = min_max_scaler.transform(breast_cancer.data)
Using MaxAbsScaler
Using MaxAbsScaler , the maximum absolute value of each feature is scaled to unit size, i.e., 1. It is intended for data that is previously
centered at sparse or zero data.
max_abs_scaler = preprocessing.MaxAbsScaler().fit(breast_cancer.data)
breast_cancer_maxabsscaled = max_abs_scaler.transform(breast_cancer.data)
Normalization
Normalization - Example
normalizer = preprocessing.Normalizer(norm='l1').fit(breast_cancer.data)
breast_cancer_normalized = normalizer.transform(breast_cancer.data)
Binarization
Any value above the threshold is transformed to 1, and any value below the threshold is transformed to 0.
By default, a threshold of 0 is used.
Binarization - Example
binarizer = preprocessing.Binarizer(threshold=3.0).fit(breast_cancer.data)
breast_cancer_binarized = binarizer.transform(breast_cancer.data)
print(breast_cancer_binarized[:5,:5])
Output
[[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 0.]]
OneHotEncoder
OneHotEncoder converts categorical integer values into one-hot vectors. In an on-hot vector, every category is transformed into a
binary attribute having only 0 and 1 values.
An example creating two binary attributes for the categorical integers 1 and 2, is shown in the next slide.
OneHotEncoder - Example
onehotencoder = preprocessing.OneHotEncoder()
onehotencoder = onehotencoder.fit([[1], [1], [1], [2], [2], [1]])
Output
[[ 1. 0.]]
[[ 0. 1.]]
Imputation
Imputation replaces missing values with either median, mean, or the most common value of the column or row in which the missing
values exist.
Below example replaces missing values, represented by np.nan , with the mean of respective column (axis 0).
Example
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 8/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
imputer = imputer.fit(breast_cancer.data)
breast_cancer_imputed = imputer.transform(breast_cancer.data)
Label Encoding
Label Encoding is a step in which, in which categorical features are represented as categorical integers. An example of transforming
categorical values ["benign","malignant"] into [0, 1]` is shown below.
Example
labelencoder = preprocessing.LabelEncoder()
labelencoder = labelencoder.fit(labels)
bc_labelencoded = labelencoder.transform(breast_cancer.target_names)
Using MinMaxScaler
min_max_scaler = preprocessing.MinMaxScaler().fit(breast_cancer.data)
breast_cancer_minmaxscaled = min_max_scaler.transform(breast_cancer.data)
By default, transformation occurs to a range of 0 and 1. It can also be customized with feature_range argument as shown in
next example.
05 Preprocessing Exercises
Installation
Let's get the installations done, prior to creating tasks.
Task 1
Import two modules sklearn.datasets and sklearn.preprocessing .
Load popular iris data set from sklearn.datasets module and assign it to variable 'iris' .
Perform Normalization on iris.data with l2 norm and save the transformed data in variable iris_normalized .
Print the mean of every column using the below command. print(iris_normalized.mean(axis=0))
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 9/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Task 2
Convert the categorical integer list iris.target into three binary attribute representation and store the result in variable
iris_target_onehot .
binarizer = preprocessing.Binarizer(threshold=3.0).fit(iris.target.reshape(-1,1))
iris_binarized = binarizer.transform(iris.target.reshape(-1,1))
print(iris_binarized)
iris_target_onehot = preprocessing.OneHotEncoder()
print(iris_target_onehot.fit_transform(iris.target.reshape(-1,1)).toarray()[[0,50,100]])
Task 3
Set first 50 row values of iris.data to Null values. Use numpy.nan
Perform Imputation on 'iris.data' and save the transformed data in variable 'iris_imputed' .
Hint : use Imputer API, Replace numpy.NaN values with mean of corresponding data.
Print the mean of every column using the below command. print(iris_imputed.mean(axis=0))
iris.data[:50] = np.nan
imputer = preprocessing.Imputer(missing_values='NaN', strategy='mean')
imputer = imputer.fit(iris.data)
iris_imputed = imputer.transform(iris.data)
print(iris_imputed.mean(axis=0))
[[0. 0. 0. 1.]]
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py:415: FutureWarning: The handling of integer d
If you want the future behaviour and silence this warning, you can specify "categories='auto'".
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use t
warnings.warn(msg, FutureWarning)
[1 0 3 1 2 0]
From this topic, you will understand how to implement various Machine Learning Algorithms using scikit-learn.
Nearest Neighbors
Nearest neighbors method is used to determine a predefined number of data points that are closer to a sample point and predict its
label.
sklearn.neighbors provides utilities for unsupervised and supervised neighbors-based learning methods.
KNeighborsClassifier
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 10/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
RadiusNeighborsClassifier
KNeighborsClassifier classifies based on k nearest neighbors of every query point, where k is an integer value specified by
the user.
RadiusNeighborsClassifier classifies based on the number of neighbors present in a fixed radius r of every training point.
Demo of KNeighborsClassifier
The following code snippet illustrates importing required modules and loading cancer dataset.
The following code creates training and test data sets, initializes a KNN classifier, and fits it with training data.
knn_classifier = KNeighborsClassifier()
The following code determines the accuracy of model on train and test data sets.
Output
Hands-On - KNN
Installation
Let's get the installations done, prior to creating tasks.
Task 1
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 11/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Load popular iris data set from sklearn.datasets module and assign it to variable iris.
Split iris.data into two sets names X_train and X_test. Also, split iris.target into two sets Y_train and Y_test.
Hint: Use train_test_split method from sklearn.model_selection; set random_state to 30 and perform stratified sampling.
iris = datasets.load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
(112, 4)
(38, 4)
Task 2
Import required module from sklearn.neighbors
Fit K nearest neighbors model on X_train data and Y_train labels, with default parameters. Name the model as knn_clf .
Evaluate the model accuracy on training data set and print it's score.
Evaluate the model accuracy on testing data set and print it's score.
Task 3
Fit multiple K nearest neighbors models on X_train data and Y_train labels with n_neighbors parameter value changing from
3 to 10.
for i in range(3,10):
knn_clf = KNeighborsClassifier(n_neighbors=i)
knn_clf = knn_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', knn_clf.score(X_test,Y_test))
print(6)
Decision Trees
Decision Trees is another Supervised Learning method used for Classification and Regression .
Decision Trees learn simple decision rules from training data and build a Model.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 12/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
DecisionTreeClassifier and DecisionTreeRegressor are the two utilities from sklearn.tree , which can be used for
classification and regression respectively.
Advantages of Decision Trees
Advantages
Decision trees sometimes become complex, which do not generalize well and leads to overfitting . Overfitting can be
addressed by placing the least number of samples needed at a leaf node or placing the highest depth of the tree.
A small variation in data can result in a completely different tree . This problem can be addressed by using decision
trees within an ensemble.
The subsequent code represents the building of a Decision Tree Classifier model.
Before executing this code, perform importing required modules, load cancer dataset, and create train and test data sets as
shown in Neighbors classifier example.
dt_classifier = DecisionTreeClassifier()
Further the below code determines the model accuracy. You can observe that the model is overfitted .
Output
dt_classifier = DecisionTreeClassifier(max_depth=2)
Output
Hands-On
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 13/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Installation
Let's get the installations done, prior to creating tasks.
Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .
Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .
Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
(379, 13)
(127, 13)
Task 2
Import required module from sklearn.tree .
Build a Decision tree Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
dt_reg .
Evaluate the model accuracy on training data set and print it's score.
Evaluate the model accuracy on testing data set and print it's score.
Predict the housing price for first two samples of X_test set and print them.(Hint : Use predict() function)
Task 3
Fit multiple Decision tree regressors on X_train data and Y_train labels with max_depth parameter value changing from 2 to
5.
for i in range(2,5):
dt_reg = DecisionTreeRegressor(max_depth=i)
dt_reg = dt_reg.fit(X_train, Y_train)
print('Accuracy of Test Data :', dt_reg.score(X_test,Y_test))
print(4)
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 14/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
print(4)
08 Ensemble Methods
Ensemble Methods
Ensemble methods combine predictions of other learning algorithms, to improve the generalization.
Averaging Methods : They build several base estimators independently and finally average their predictions.
Bagging Methods
Bagging Methods draw random subsets of the original dataset, build an estimator and aggregate individual results to form a final one.
BaggingClassifier and BaggingRegressor are the utilities from sklearn.ensemble to deal with Bagging.
Randomized Trees
sklearn.ensemble offers two types of algorithms based on randomized trees: Random Forest and Extra randomness algorithms.
RandomForestClassifier and RandomForestRegressor classes are used to deal with random forests.
In random forests, each estimator is built from a sample drawn with replacement from the training set.
ExtraTreesClassifier and ExtraTreesRegressor classes are used to deal with extremely randomized forests.
In extremely randomized forests, more randomness is introduced, which further reduces the variance of the model.
Boosting Methods
AdaBoostClassifier
GradientBoostingClassifier
rf_classifier = RandomForestClassifier()
Output
Installation
Let's get the installations done, prior to creating tasks.
Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .
Load popular Boston dataset from sklearn.datasets module and assign it to variable boston .
Split boston.data into two sets names X_train and X_test . Also, split boston.target into two sets Y_train and Y_test .
boston = datasets.load_boston()
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
(379, 13)
(127, 13)
Task 2
Import required module from sklearn.ensemble .
Build a Random Forest Regressor model from X_train set and Y_train labels, with default parameters. Name the model as
rf_reg .
Evaluate the model accuracy on training data set and print it's score.
Evaluate the model accuracy on testing data set and print it's score.
Predict the housing price for first two samples of X_test set and print them.
Task 3
Build multiple Random forest regressor on X_train set and Y_train labels with max_depth parameter value changing from 3 to
5 and also setting n_estimators to one of 50, 100, 200 values.
Print the max_depth and n_estimators values of the model with highest accuracy.
Note: Print the parameter values in the form of tuple (a, b) . a refers to max_depth value and b refers to n_estimators
all_scores = {}
n= 100
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 16/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
00
for m in range(3,6):
rf_reg = RandomForestRegressor(n_estimators=n, max_depth=m)
rf_reg = rf_reg.fit(X_train, Y_train)
print(m,n, rf_reg.score(X_test,Y_test))
all_scores[(m,n)] = rf_reg.score(X_test,Y_test)
Understanding SVM
Support Vector Machines (SVMs) separates data points based on decision planes, which separates objects belonging to different
classes in a higher dimensional space.
SVM algorithm uses the best suitable kernel, which is capable of separating data points into two or more classes.
linear
polynomial
rbf
sigmoid
scikit-learn provides the following three utilities for performing Support Vector Classification.
SVC ,
NuSVC : Same as SVC but uses a parameter to control the number of support vectors.
LinearSVC : Similar to SVC with parameter kernel taking linear value.
scikit-learn provides the following three utilities for performing Support Vector Regression.
SVR
NuSVR
LinearSVR
Advantages of SVMs
SVMs are versatile, and a different kernel can be used by a decision function.
Disadvantages of SVMs
SVMs do not perform well on high dimensional data with many samples.
svm_classifier = SVC()
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 17/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Output
In the following example, scaled input data is used to improve the accuracy of SVM classifier.
standardizer = preprocessing.StandardScaler()
standardizer = standardizer.fit(cancer.data)
cancer_standardized = standardizer.transform(cancer.data)
svm_classifier = SVC()
Output
Y_pred = svm_classifier.predict(X_test)
Output
Classification report :
precision recall f1-score support
0 0.96 0.98 0.97 53
1 0.99 0.98 0.98 90
Hands-On - SVM
Installation
Let's get the installations done, prior to creating tasks.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 18/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Task 1
Import two modules sklearn.datasets , and sklearn.model_selection .
Load popular digits dataset from sklearn.datasets module and assign it to variable digits .
Split digits.data into two sets names X_train and X_test . Also, split digits.target into two sets Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.
digits = datasets.load_digits()
X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, stratify=digits.target, random_state=30)
print(X_train.shape)
print(X_test.shape)
(1347, 64)
(450, 64)
Task 2
Import required module from sklearn.svm .
Build an SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf .
Evaluate the model accuracy on testing data set and print it's score.
svm_clf = SVC()
svm_clf = svm_clf.fit(X_train, Y_train)
print('Accuracy of Test Data :', svm_clf.score(X_test,Y_test))
Task 3
Perform Standardization of digits.data and store the transformed data in variable digits_standardized .
Once again, split digits_standardized into two sets names X_train and X_test . Also, split digits.target into two sets
Y_train and Y_test .
Hint: Use train_test_split method from sklearn.model_selection ; set random_state to 30 ; and perform stratified sampling.
Build another SVM classifier from X_train set and Y_train labels, with default parameters. Name the model as svm_clf2 .
Evaluate the model accuracy on testing data set and print it's score.
standardizer = preprocessing.StandardScaler()
digits_standardized = standardizer.fit(digits.data)
digits_standardized = standardizer.transform(digits.data)
svm_clf2 = SVC()
10 Clustering Technique
Introduction to Clustering
The technique is typically used to group data points into clusters based on a specific algorithm.
K-means Clustering
Agglomerative clustering
DBSCAN clustering
Mean-shift clustering
Affinity propagation
Spectral clustering
K-Means Clustering
The distance between each cluster is computed, and the two nearest clusters are merged together.
Merging of two clusters can be any of the following linkage type: ward , complete or average .
Now let's understand how density-based clustering is performed. DBSCAN from sklearn.cluster is used for this purpose. - video 1
Steps Involved:
make_blobs from sklearn.cluster can be used to initialize the blob areas. MeanShift from sklearn.cluster can be used to
perform Mean Shift clustering.
Affinity Propagation
Affinity Propagation generates clusters by passing messages between pairs of data points, until convergence.
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 20/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Spectral Clustering
Spectral Clustering is ideal to cluster data that is connected, and may not be in a compact space.
Demo of KMeans
kmeans_cluster = KMeans(n_clusters=2)
kmeans_cluster = kmeans_cluster.fit(X_train)
kmeans_cluster.predict(X_test)
Output
Completeness : All members of a given class are assigned to the same cluster.
print(metrics.homogeneity_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.completeness_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.v_measure_score(kmeans_cluster.predict(X_test), Y_test))
print(metrics.adjusted_rand_score(kmeans_cluster.predict(X_test), Y_test))
Output
0.573236466834
0.483862796607
0.524771531969
0.54983994112
Hands-On - Clustering
Installation
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 21/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Task 1
Import three modules sklearn.datasets , sklearn.cluster , and sklearn.metrics .
Load popular iris dataset from sklearn.datasets module and assign it to variable iris .
Cluster iris.data set into 3 clusters using K-means with default parameters. Name the model as km_cls .
iris = datasets.load_iris()
km_cls = KMeans(n_clusters=3)
km_cls = km_cls.fit(X_train)
print(metrics.homogeneity_score(km_cls.predict(X_test), Y_test))
0.7519188912745023
Task 2
Cluster iris.data set into 3 clusters using Agglomerative clustering. Name the model as agg_cls .
0.829701532311426
Task 3
Cluster iris.data set using Affinity Propagation clustering method with default parameters. Name the model as af_cls .
10 Course Summary
Introduced to scikit-learn .
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 22/23
11/2/2020 Machine Learning Using scikit-learn.ipynb - Colaboratory
Used scikit-learn for build models based on Supervised learning techniques like Nearest neigbors , Decision Trees , Random
forests , and SVMs .
Used scikit-learn for build models based on Supervised learning techniques such as K-means , Agglomerative clustering ,
and Density-based clustering .
[[1.]
[1.]
[1.]
[1.]]
https://colab.research.google.com/drive/19TOvwTd4Xa2nNDZxm45dnq1JWpQ3hfO3#scrollTo=Rb6NJLxGnlYP&printMode=true 23/23