Ai Publishing Python Machine Learning For Beginners Learning
Ai Publishing Python Machine Learning For Beginners Learning
Edited by AI Publishing
eBook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-5-3
Legal Notice:
You are not permitted to amend, use, distribute, sell, quote, or paraphrase any part
of the content within this book without the specific consent of the author.
Disclaimer Notice:
Kindly note that the information contained within this document is solely for
educational and entertainment purposes. No warranties of any kind are indicated or
expressed. Readers accept that the author is not providing any legal, professional,
financial, or medical advice. Kindly consult a licensed professional before trying out
any techniques explained in this book.
By reading this document, the reader consents that under no circumstances is the
author liable for any losses, direct or indirect, that are incurred as a consequence of
the use of the information contained within this document, including, but not
restricted to, errors, omissions, or inaccuracies.
How to Contact Us
Preface
Book Approach
Who Is This Book For?
How to Use This Book?
Exercises Solutions
Exercise 2.1
Exercise 2.2
Exercise 3.1
Exercise 3.2
Exercise 4.1
Exercise 4.2
Exercise 5.1
Exercise 5.2
Exercise 6.1
Exercise 6.2
Exercise 7.1
Exercise 7.2
Exercise 8.1
Exercise 8.2
Exercise 9.1
Exercise 9.2
Exercise 10.1
Exercise 10.2
Preface
§ Book Approach
The book follows a very simple approach. It is divided into 10
chapters. The first five chapters of the book are dedicated to
data analysis and visualization, while the last five chapters are
based on machine learning and statistical models for data
science. Chapter 1 provides a very brief introduction to data
science and machine learning and provides a roadmap for
step by step learning approach to data science and machine
learning. The process for environment setup, including the
software needed to run scripts in this book, is also explained
in this chapter.
https://www.aispublishing.net/book-pmld
In this book, you will learn both Data Science and Machine
Learning. In the first five chapters, you will study the concepts
required to store, analyze, and visualize the datasets. From the
6th chapter onwards, different types of machine learning
concepts are explained.
Once you are familiar with basic machine learning and deep
learning algorithms, you are good to go for developing data
science applications. Data science applications can be of
different types, i.e., predicting house prices, recognizing
images, classifying text, etc. Being a beginner, you should try
to develop versatile data science applications, and later, when
you find your area of interest, e.g., natural language
processing or image recognition, delve deep into that. It is
important to mention that this book provides a very generic
introduction to data science, and you will see applications of
data science to structured data, textual data, and image data.
However, this book is not dedicated to any specific data
science field.
$ cd / tmp
$ curl –o https://repo.anaconda.com.archive/Anaconda3-5.2.0-Linux-
x86_64.sh
$ sha256sum Anaconda3-5.2.0-Linux-x86_64.sh
09f53738b0cd3bb96f5b1bac488e5528df9906be2480fe61df40e0e0d19e3d48
Anaconda3-5.2.0-Linux-x86_64.sh
$ bash Anaconda3-5.2.0-Linux-x86_64.sh
Output
Output
[/home/tola/anaconda3] >>>
Output
…
Installation finished.
Do you wish the installer to prepend Anaconda3 install location to path
in your /home/tola/.bashrc? [yes|no]
[no]>>>
$ source `/.bashrc
8. You can also test the installation using the conda command.
$ conda list
https://colab.research.google.com/
1. import tensorflow as tf
2. print(tf.__version__)
With Google Cloud, you can import the datasets from your
Google drive. Execute the following script. And click on the
link that appears as shown below:
You will be prompted to allow Google Colab to access your
Google drive. Click the Allow button, as shown below:
You will see a link appear, as shown in the following image
(the link has been blinded here).
Copy the link and paste it in the empty field in the Google
Colab cell, as shown below:
This way, you can import datasets from your Google drive to
your Google Colab environment.
In the next chapter, you will see how to write your first
program in Python, along with other Python programming
concepts.
Python Crash Course
Script 1:
The above script basically prints a string value in the output
using the print() method. The print() method is used to print
on the console, any string passed to it. If you see the following
output, you have successfully run your first Python program.
Output:
b. Integers
f. Tuples
g. Dictionaries
Script 2:
1. # A string Variable
2. first_name = “Joseph”
3. print(type(first_name))
4.
5. # An Integer Variable
6. age = 20
7. print(type(age))
8.
9. # A floating point variable
10. weight = 70.35
11. print(type(weight))
12.
13. # A floating point variable
14. married = False
15. print(type(married))
16.
17. #List
18. cars = [“Honda”, “Toyota”, “Suzuki”]
19. print(type(cars))
20.
21. #Tuples
22. days = (“Sunday”, “Monday”, “Tuesday”, “Wednesday”, “Thursday”,
“Friday”, “Saturday”)
23. print(type(days))
24.
25. #Dictionaries
26. days2 = {1:“Sunday”, 2:“Monday”, 3:“Tuesday”, 4:“Wednesday”,
5:“Thursday”, 6:“Friday”, 7:“Saturday”}
27. print(type(days2))
Output:
<class ‘str’>
<class ‘int’>
<class ‘float’>
<class ‘bool’>
<class ‘list’>
<class ‘tuple’>
<class ‘dict’>
b. Logical Operators
c. Comparison Operators
d. Assignment Operators
e. Membership Operators
§ Arithmetic Operators
Arithmetic operators are used to perform arithmetic
operations in Python. The following table sums up the
arithmetic operators supported by Python. Suppose X = 20,
and Y = 10.
Here is an example of arithmetic operators with output:
Script 3:
1. X = 20
2. Y = 10
3. print(X + Y)
4. print(X – Y)
5. print(X * Y)
6. print(X / Y)
7. print(X ** Y)
Output:
30
10
200
2.0
10240000000000
§ Logical Operators
Logical operators are used to perform logical AND, OR, and
NOT operations in Python. The following table summarizes the
logical operators. Here, X is True, and Y is False.
Here is an example that explains the usage of Python logical
operators.
Script 4:
1. X = True
2. Y = False
3. print(X and Y)
4. print(X or Y)
5. print(not(X and Y))
Output:
1. False
2. True
3. True
§ Comparison Operators
Comparison operators, as the name suggests, are used to
compare two or more than two operands. Depending upon
the relation between the operands, comparison operators
return Boolean values. The following table summarizes
comparison operators in Python. Here, X is 20, and Y is 35.
The comparison operators have been demonstrated in action
in the following example:
Script 5
1. X = 20
2. Y = 35
3.
4. print(X == Y)
5. print(X != Y)
6. print(X > Y)
7. print(X < Y)
8. print(X >= Y)
9. print(X <= Y)
Output:
False
True
False
True
False
True
§ Assignment Operators
Assignment operators are used to assign values to variables.
The following table summarizes the assignment operators.
Here, X is 20, and Y is equal to 10.
Take a look at script 6 to see Python assignment operators in
action.
Script 6:
1. X = 20; Y = 10
2. R = X + Y
3. print(R)
4.
5. X = 20;
6. Y = 10
7. X += Y
8. print(X)
9.
10. X = 20;
11. Y = 10
12. X -= Y
13. print(X)
14.
15. X = 20;
16. Y = 10
17. X *= Y
18. print(X)
19.
20. X = 20;
21. Y = 10
22. X /= Y
23. print(X)
24.
25. X = 20;
26. Y = 10
27. X %= Y
28. print(X)
29.
30. X = 20;
31. Y = 10
32. X **= Y
33. print(X)
Output:
30
30
10
200
2.0
0
10240000000000
§ Membership Operators
Membership operators are used to find if an item is a member
of a collection of items or not. There are two types of
membership operators: the in operator and the not in
operator. The following script shows the in operator in action.
Script 7:
Output:
True
Script 8:
Output:
True
b. If-else statement
c. If-elif statement
§ IF Statement
If you have to check for a single condition and you do not
concern about the alternate condition, you can use the if
statement. For instance, if you want to check if 10 is greater
than 5 and based on that you want to print a statement, you
can use the if statement. The condition evaluated by the if
statement returns a Boolean value. If the condition evaluated
by the if statement is true, the code block that follows the if
statement executes. It is important to mention that in Python,
a new code block starts at a new line with on tab indented
from the left when compared with the outer block.
Here, in the following example, the condition 10 > 5 is
evaluated, which returns true. Hence, the code block that
follows the if statement executes, and a message is printed on
the console.
Script 9:
1. # The if statement
2.
3. if 10 > 5:
4. print(“Ten is greater than 10”)
Output:
§ IF-Else Statement
The If-else statement comes handy when you want to execute
an alternate piece of code in case the condition for the if
statement returns false. For instance, in the following example,
the condition 5 < 10 will return false. Hence, the code block
that follows the else statement will execute.
Script 10:
1. # if-else statement
2.
3. if 5 > 10:
4. print(“5 is greater than 10”)
5. else:
6. print(“10 is greater than 5”)
Output:
10 is greater than 5
§ IF-Elif Statement
The if-elif statement comes handy when you have to evaluate
multiple conditions. For instance, in the following example, we
first check if 5 > 10, which evaluates to false. Next, an elif
statement evaluates the condition 8 < 4, which also returns
false. Hence, the code block that follows the last else
statement executes.
Script 11:
Output:
b. While Loop
§ For Loop
The for loop is used to iteratively execute a piece of code for
a certain number of times. You should typically use for loop
when you know the exact number of iterations or repetitions
for which you want to run your code. A for loop iterates over
a collection of items. In the following example, we create a
collection of five integers using the range() method. Next, a
for loop iterates five times and prints each integer in the
collection.
Script 12:
1. items = range(5)
2. for item in items:
3. print(item)
Output:
0
l
2
3
4
§ While Loop
The while loop keeps executing a certain piece of code unless
the evaluation condition becomes false. For instance, the
while loop in the following script keeps executing unless the
variable c becomes greater than 10.
Script 13:
1. c = 0
2. while c < 10:
3. print(c)
4. c = c +1
Output:
0
1
2
3
4
5
6
7
8
9
2.6. Functions
In any programming language, functions are used to
implement the piece of code that is required to be executed
numerous times at different locations in the code. In such
cases, instead of writing long pieces of codes again and again,
you can simply define a function that contains the piece of
code, and then you can call the function wherever you want in
the code.
Script 14:
1. def myfunc():
2. print(“This is a simple function”)
3.
4. ### function call
5. myfuncQ
Output:
You can also pass values to a function. The values are passed
inside the parenthesis of the function call. However, you must
specify the parameter name in the function definition, too. In
the following script, we define a function named
myfuncparam(). The function accepts one parameter, i.e.,
num. The value passed in the parenthesis of the function call
will be stored in this num variable and will be printed by the
print()method inside the myfuncparam() method.
Script 15:
1. def myfuncparam(num):
2. print(“This is a function with parameter value: “+num)
3.
4. ### function call
5. myfuncparam(“Parameter 1”)
Output:
This is a function with parameter value: Parameter 1
Script 16:
1. def myreturnfunc():
2. return “This function returns a value”
3.
4. val = myreturnfunc()
5. print(val)
Output:
Script 17:
1. class Fruit:
2.
3. name = “apple”
4. price = 10
5.
6. def eat_fruit(self):
7. print(“Fruit has been eaten”)
8.
9.
10. f = Fruit()
11. f.eat_fruit()
12. print(f.name)
13. print(f.price)
Output:
Script 18:
1. class Fruit:
2.
3. name = “apple”
4. price = 10
5.
6. def __init__(self, fruit_name, fruit_price):
7. Fruit.name = fruit_name
8. Fruit.price = fruit_price
9.
10. def eat_fruit(self):
11. print(“Fruit has been eaten”)
12.
13.
14. f = Fruit(“Orange”, 15)
15. f.eat_fruit()
16. print(f.name)
17. print(f.price)
Output:
2.8.1. NumPy
https://numpy.org/
2.8.2. Matplotlib
While Matplotlib graphs are easy to plot, the look and feel of
the Matplotlib plots have a distinct feel of the 1990s. Many
wrappers libraries like Pandas and Seaborn have been
developed on top of Matplotlib. These libraries allow users to
plot much cleaner and sophisticated graphs.
https://matplotlib.org/
2.8.3. Seaborn
https://seaborn.pydata.org/
2.8.4. Pandas
Pandas library, like Seaborn, is based on the Matplotlib library
and offers utilities that can be used to plot different types of
static plots in a single line of codes. With Pandas, you can
import data in various formats such as CSV (Comma
Separated View) and TSV (Tab Separated View) and can plot
a variety of data visualizations via these data sources.
https://pandas.pydata.org/
https://scikit-learn.org/stable/
2.8.6. TensorFlow
https://www.tensorflow.org/
2.8.7. Keras
https://keras.io/
Exercise 2.1
Question 1
B. While Loop
C. Both A and B
D. None of the above
Question 2
B. Double Value
Question 3
B. Out
C. Not In
D. Both A and C
Exercise 2.2
Print the table of integer 9 using a while loop:
Python NumPy Library for Data Analysis
In the next section, you will see how to create NumPy arrays
using different methods.
Script 1:
1. import numpy as np
2. nums_list = [10,12,14,16,20]
3. nums_array = np.array(nums_list)
4. type(nums_array)
Output:
numpy.ndarray
Script 2:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. nums_2d.shape
Output:
(3, 3)
Script 3:
1. nums_arr = np.arange(5,11)
2. print(nums_arr)
Output:
[5 6 7 8 9 10]
Script 4:
1. nums_arr = np.arange(5,12,2)
2. print(nums_arr)
Output:
[5 7 9 11]
Script 5:
1. ones_array = np.ones(6)
2. print(ones_array)
Output:
[1. 1. 1. 1. 1. 1.]
Script 6:
1. ones_array = np.ones((6,4))
2. print(ones_array)
Output:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
Script 7:
1. zeros_array = np.zeros(6)
2. print(zeros_array)
Output:
[0. 0. 0. 0. 0. 0.]
Script 8:
1. zeros_array = np.zeros((6,4))
2. print(zeros_array)
Output:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
3.2.5. using Eyes Method
Script 9:
1. eyes_array = np.eye(5)
2. print(eyes_array)
Output:
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
Script 10:
1. uniform_random = np.random.rand(4, 5)
2. print(uniform_random)
Output:
Script 11:
1. normal_random = np.random.randn(4, 5)
2. print(uniform_random)
Output:
Script 12:
Output:
[25 49 21 35 17]
Script 13:
1. uniform_random = np.random.rand(4, 6)
2. uniform_random = uniform_random.reshape(3, 8)
3. print(uniform_random)
Output:
Script 14:
1. s = np.arange(1,11)
2. print(s)
Output:
[ 1 2 3 4 5 6 7 8 9 10]
Script 15:
print(s[1])
Output:
Script 16:
print(s[1:9])
Output:
[2 3 4 5 6 7 8 9]
if you specify only the upper bound, all the items from the first
index to the upper bound are returned. similarly, if you specify
only the lower bound, all the items from the lower bound to
the last item of the array are returned.
Script 17:
1. print(s[:5])
2. print(s[5:])
Output:
[1 2 3 4 5]
[ 6 7 8 9 10]
Script 18:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[:2,:])
Output:
[[10 12 13]
[45 32 16]]
Similarly, the following script returns all the rows but only the
first two columns.
Script 19:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[:,:2])
Output:
[[10 12]
[45 32]
[45 32]]
Script 20:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[1:,1:])
Output:
[[32 16]
[32 16]]
The sqrt() function is used to find the square roots of all the
elements in a list as shown below:
Script 21:
1. nums = [10,20,30,40,50]
2. np_sqr = np.sqrt(nums)
3. print(np_sqr)
Output:
The log() function is used to find the logs of all the elements
in a list as shown below:
Script 22:
1. nums = [10,20,30,40,50]
2. np_log = np.log(nums)
3. print(np_log)
Output:
Script 23:
1. nums = [10,20,30,40,50]
2. np_exp = np.exp(nums)
3. print(np_exp)
Output:
You can find the sines and cosines of items in a list using the
sine and cosine function, respectively, as shown in the
following script.
Script 24:
1. nums = [10,20,30,40,50]
2. np_sine = np.sin(nums)
3. print(np_sine)
4.
5. nums = [10,20,30,40,50]
6. np_cos = np.cos(nums)
7. print(np_cos)
Output:
To find a matrix dot product, you can use the dot() function.
To find the dot product, the number of columns in the first
matrix must match the number of rows in the second matrix.
Here is an example.
Script 25:
1. A = np.random.randn(4,5)
2.
3. B = np.random.randn(5,4)
4.
5. Z = np.dot(A,B)
6.
7. print(Z)
Output:
[[ 1.43837722 -4.74991285 1.42127048 -0.41569506]
[-1.64613809 5.79380984 -1.33542482 1.53201023]
[-1.31518878 0.72397674 -2.01300047 0.61651047]
[-1.36765444 3.83694475 -0.56382045 0.21757162]]
Script 26:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. multiply = np.multiply(nums_2d, nums_2d)
7. print(multiply)
Output:
Script 27:
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9]
4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. inverse = np.linalg.inv(nums_2d)
8. print(inverse)
Output:
Script 28:
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9] 4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. determinant = np.linalg.det(nums_2d)
8. print(determinant)
Output:
-9.51619735392994e-16
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9]
4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. trace = np.trace(nums_2d)
8. print(trace)
Output:
15
Exercise 3.1
Question 1:
B. np.multiply(matrix1, matrix2)
C. np.elementwise(matrix1, matrix2)
D. none of the above
Question 2:
B. np.id(4,4)
C. np.eye(4,4)
D. All of the above
Question 3:
B. np.arange(4, 16, 3)
C. np.arange(4, 15,3)
D. none of the above
Exercise 3.2
Create a random NumPy array of five rows and four columns.
Using array indexing and slicing, display the items from row
three to end and column two to end.
Introduction to Pandas Library for Data
Analysis
4.1. Introduction
In this chapter, you will see how to use Python’s Pandas library
for data analysis. In the next chapter, you will see how to use
the Pandas library for data visualization by plotting different
types of plots.
import pandas as pd
Script 1:
1. import pandas as pd
2. titanic_data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
3. titanic_data.head()
Output:
The read_csv() method reads data from a CSV or TSV file and
stores it in a Pandas dataframe, which is a special object that
stores data in the form of rows and columns.
Script 2:
1. titanic_pclass1= (titanic_data.Pclass == 1)
2. titanic_pclass1
Output:
0 False
1 True
2 False
3 True
4 False
…
886 False
887 True
888 False
889 True
890 False
Name: Pclass, Length: 891, dtype: bool
Script 3:
1. titanic_pclass1= (titanic_data.Pclass == 1)
2. titanic_pclass1_data = titanic_data[titanic_pclass1]
3. titanic_pclass1_data.head()
Output:
Script 4:
1. titanic_pclass_data = titanic_data[titanic_data.Pclass == 1]
2. titanic_pclass_data.head()
Output:
Script 5:
1. ages = [20,21,22]
2. age_dataset = titanic_data[titanic_data[“Age”].isin(ages)]
3. age_dataset.head()
Output:
Script 6:
1. ages = [20,21,22]
2. ageclass_dataset = titanic_data[titanic_data[“Age”].isin(ages) &
(titanic_data[“Pclass”] == 1) ]
3. ageclass_dataset.head()
Output:
Script 7:
The output below shows that the dataset now contains only
Name, Sex, and Age columns.
Output:
In addition to filtering columns, you can also drop columns
that you don’t want in the dataset. To do so, you need to call
the drop() method and pass it the list of columns that you
want to drop. For instance, the following script drops the
Name, Age, and Sex columns from the Titanic dataset and
returns the remaining columns.
Script 8:
Output:
Script 9:
1. titanic_pclass1_data = titanic_data[titanic_data.Pclass == 1]
2. print(titanic_pclass1_data.shape)
3.
4. titanic_pclass2_data = titanic_data[titanic_data.Pclass == 2]
5. print(titanic_pclass2_data.shape)
Output:
(216, 12)
(184, 12)
Script 10:
1. final_data = titanic_pclass1_data.append(titanic_pclass2_data,
ignore_index=True)
2. print(final_data.shape)
Output:
(400, 12)
The output now shows that the total number of rows is 400,
which is the sum of the number of rows in the two dataframes
that we concatenated.
Script 11:
Output:
(400, 12)
Script 12:
1. df1 = final_data[:200]
2. print(df1.shape)
3. df2 = final_data[200:]
4. print(df2.shape)
5.
6. final_data2 = pd.concat([df1, df2], axis = 1, ignore_index = True)
7. print(final_data2.shape)
Output:
(200, 12)
(200, 12)
(400, 24)
Script 13:
1. age_sorted_data = titanic_data.sort_values(by=[‘Age’])
2. age_sorted_data.head()
Output:
Script 14:
Output:
Script 15:
1. age_sorted_data = titanic_data.sort_values(by=[‘Age’,’Fare’],
ascending = False)
2. age_sorted_data.head()
Output:
Script 16:
1. updated_class = titanic_data.Pclass.apply(lambda x : x + 2)
2. pdated_class.head()
The output shows that all the values in the Pclass column have
been incremented by 2.
Output:
0 5
1 3
2 5
3 3
4 5
Script 17:
1. def mult(x):
2. return x * 2
3.
4. updated_class = titanic_data.Pclass.apply(mult)
5. updated_class.head()
Output:
0 6
1 2
2 6
3 2
4 6
Script 18:
Output:
Script 19:
1. flights_data_pivot =flights_data.pivot_table(index=‘month’,
columns=‘year’, values=‘passengers’)
2. flights_data_pivot.head()
Output:
Script 20:
1. import pandas as pd
2. titanic_data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
3. titanic_data.head()
4.
5. pd.crosstab(titanic_data.Pclass, titanic_data.Age, margins=True)
Output:
Script 21:
1. import numpy as np
2. titanic_data.Fare = np.where( titanic_data.Age > 20, titanic_data.Fare
+5, titanic_data.Fare)
3.
4. titanic_data.head()
Output:
Hands-on Time – Exercise
Now, it is your turn. Follow the instructions in the exercises
below to check your understanding of data analysis with the
Pandas library. The answers to these exercises are provided
after chapter 10 in this book.
Exercise 4.1
Question 1
B. 1
C. 2
D. None of the above
Question 2
B. sort_rows()
C. sort_values()
D. sort_records()
Question 3
B. filter_columns()
C. apply_filter()
D. None of the above()
Exercise 4.2
Use the apply function to subtract 10 from the Fare column
of the Titanic dataset, without using a lambda expression.
Data Visualization via Matplotlib, Seaborn,
and Pandas Libraries
In this chapter, you will see some of the most commonly used
Python libraries for data visualization. You will see how to plot
different types of plots using Maplotlib, Seaborn, and Pandas
libraries.
A line plot is the first plot that we are going to plot in this
chapter. A line plot is the easiest of all the Matplotlib plots.
This plot is basically used to plot the relationship between two
numerical sets of values. Usually, a line plot is used to plot an
increasing or decreasing trend between two dependent
variables. For instance, if you want to see how the weather
changed over a period of 24ours, you can use a line plot,
where the x-axis contains hourly information, and the y-axis
contains weather in degrees. Let us plot a line plot that
displays the square root of 20 equidistance numbers between
0 and 20. Look at Script 1:
Script 1:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. x_vals = np.linspace(0, 20, 20)
6. y_vals = [math.sqrt(i) for i in x_vals]
7. plt.plot(x_vals, y_vals)
Output:
Script 2:
Output:
You can also increase the default plot size of a Matplotlib plot.
To do so, you can use the rcParams list of the pyplot module
and then set two values for the figure.figsize attribute. The
following script sets the plot size to 8 inches wide and 6
inches tall.
Script 3:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.rcParams[“figure.figsize”] = [8,6]
6.
7. x_vals = np.linspace(0, 20, 20)
8. y_vals = [math.sqrt(i) for i in x_vals]
9. plt.plot(x_vals, y_vals)
In the output, you can see that the default plot size has been
increased.
Output:
Script 4:
Here, in the output, you can see the labels and title that you
specified in the script 4.
Output:
Script 5:
Output:
Script 6:
Output:
You can also plot multiple line plots inside one graph. All you
have to do is call the plot() method twice with different values
for x and y axes. The following script plots a line plot for
square root in red and for a cube function in blue.
Script 7:
Output:
Further Readings – Matplotlib Line Plot
To study more about the Matplotlib line plot, please check
Matplotlib’s official documentation for line plots
(https://bit.ly/33BqsIR). Get used to searching and reading
this documentation. It is a great resource of knowledge.
1. import pandas as pd
2. data = pd.read_csv(“E:\Data Visualization with
Python\Datasets\iris_data.csv”)
If you do not see any error, the file has been read successfully.
To see the first five rows of the Pandas dataframe containing
the data, you can use the head() method as shown below:
Script 9:
data.head()
Output:
You can see that the iris_data.csv file has five columns. We
can use values from any of these two columns to plot a line
plot. To do so, for x and y axes, we need to pass the data
dataframe column names to the plot() function of the pyplot
module. To access a column name from a Pandas dataframe,
you need to specify the dataframe name followed by a pair of
square brackets. Inside the brackets, the column name is
specified. The following script plots a line plot where the x-
axis contains values from the sepal_length column, whereas
the y-axis contains values from the petal_length column of
the dataframe.
Script 10:
Output:
Like CSV, you can also read a TSV file via the read_csv()
method. You have to pass ‘\t’ as the value for the sep
parameter. The script 11 reads iris_data.tsv file and stores it in
a Pandas dataframe. Next, the first five rows of the dataframe
have been printed via the head() method.
Script 11:
1. import pandas as pd
2. data = pd.read_csv(“E:\Data Visualization with
Python\Datasets\iris_data.tsv”, sep=’\t’)
3. data.head()
Output:
The remaining process to plot the line plot remains the same,
as it was for the CSV file. The following script plots a line plot,
where the x-axis contains sepal length, and the y-axis displays
petal length.
Script 12:
Output:
5.2.4. Scatter Plots
Script 13:
The output shows a scattered plot with blue points. The plot
clearly shows that with an increase in sepal length, the petal
length of an iris flower also increases.
Output:
Script 14:
1. import pandas as pd
2. data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
3. data.head()
Output:
To plot a bar plot, you need to call the bar() method. The
categorical values are passed as the x-axis and corresponding
aggregated numerical values are passed on the y-axis. The
following script plots a bar plot between genders and ages of
the passengers on the Titanic ship.
Script 15:
Output:
5.2.6. Histograms
Script 16:
Output:
Script 17:
Output:
Further Readings – Matplotlib Pie Charts
To study more about Matplotlib pie charts, please check
Matplotlib’s official documentation for Pie Charts
(https://bit.ly/31qoXdy). Get used to searching and reading
this documentation. It is a great resource of knowledge.
Script 18:
Output:
Script 19:
1. plt.rcParams[“figure.figsize”] = [10,8]
2. sns.distplot(tips_data[‘total_bill’])
Output:
Script 20:
Output:
The scatter plot can be replaced by a regression line in a joint
plot. To do so, you need to pass reg as the value for the kind
parameter of the jointplot() function.
Script 21:
Output:
Further Readings – Seaborn Joint Plots
To study more about Seaborn joint plots, please check
Seaborn’s official documentation for jointplots
(https://bit.ly/31DHFyO). Try to plot joint plots with a
different set of attributes, as mentioned in the official
documentation.
The pair plot is used to plot a joint plot for all the
combinations of numeric and Boolean columns in a dataset.
To plot a pair plot, you need to call the pairplot() function and
pass it to your dataset.
Script 22:
sns.pairplot(data=tips_data)
Output:
Further Readings – Seaborn Pair Plot
To study more about Seaborn pair plots, please check
Seaborn’s official documentation for pairplots
(https://bit.ly/3a7PdgK). Try to plot pair plots with a
different set of attributes, as mentioned in the official
documentation.
Script 23:
Output:
Output:
The count plot plots a plot like a bar plot. However, unlike a
bar plot that plots average values, the count plot simply
displays the counts of the occurrences of records for each
unique value in a categorical column. The countplot() function
is used to plot a count plot with Seaborn. The following script
plots a count plot for the pclass column of the Titanic dataset.
Script 25:
sns.countplot(x=’pclass’, data=titanic_data)
Output:
Further Readings – Seaborn Count Plot
To study more about Seaborn count plots, please check
Seaborn’s official documentation for count plots
(https://bit.ly/3ilzH3N). Try to plot count plots with a
different set of attributes, as mentioned in the official
documentation.
The box plot is used to plot the quartile information for data in
a numeric column. To plot a box plot, the boxplot() method is
used. To plot a horizontal box plot, the column name of the
dataset is passed to the x-axis. The following script plots a
box plot for the fare column of the Titanic dataset.
Script 26:
sns.boxplot(x=titanic_data[“fare”])
Output:
Violin plots are similar to Box plots. However, unlike Box plots
that plot quartile information, the Violin plots plot the overall
distribution of values in the numeric columns. The following
script plots two Violin plots for the passengers traveling alone
and for the passengers traveling along with another
passenger. The violinplot() function is used to plot a swarm
plot with Seaborn.
Script 27:
Output:
Before you can plot any visualization with the Pandas library,
you need to read data into a Pandas dataframe. The best way
to do so is via the read_csv() method. The following script
shows how to read the Titanic dataset into a dataframe
named titanic_data. You can give any name to the dataframe.
Script 28:
1. import pandas as pd
2. titanic_data = pd.read_csv(r”E:\Data Visualization with
Python\Datasets\titanic_data.csv”)
3. titanic_data.head()
Output:
Let’s now see how to plot different types of plots with Pandas
dataframe. The first plot we are going to plot is a Histogram.
There are multiple ways to plot a graph in Pandas. The first
way is to select the dataframe column by specifying the name
of the column in square brackets that follows the dataframe
name and then append the plot name via dot operator. The
following script plots a histogram for the Age column of the
Titanic dataset using the hist() function. It is important to
mention that behind the scenes, the Pandas library makes use
of the Matplotlib plotting functions. Therefore, you need to
import the Matplotlib’s pyplot module before you can plot
Pandas visualizations.
Script 29:
Output:
Further Readings – Pandas Histogram
To study more about the Pandas histogram, please check
Pandas’ official documentation for histogram
(https://bit.ly/30F0qT9). Try to execute the histogram
method with a different set of attributes, as mentioned in the
official documentation.
To plot line plots via the Pandas dataframe, we will use the
Flights dataset. The following script imports the Flights
dataset from the built-in seaborn library.
Script 30:
1. flights_data = sns.load_dataset(‘flights’)
2.
3. flights_data.head()
Output:
By default, the index serves as the x-axis. In the above script,
the leftmost column, i.e., the column containing 0,1,2 … is the
index column. To plot a line plot, you have to specify the
column names for x and y axes. If you only specify the column
value for the y-axis, the index is used as the x-axis. The
following script plots a line plot for the passengers column of
the flights data.
Script 31:
flights_data.plot.line(y=’passengers’, figsize=(8,6))
Output:
Further Readings – Pandas Line Plots
To study more about Pandas line plots, please check Pandas’
official documentation for line plots
(https://bit.ly/30F0qT9). Try to execute the line() method
with a different set of attributes, as mentioned in the official
documentation.
Script 32:
flights_data.plot.
scatter(x=’year’, y=’passengers’, figsize=(8,6))
Output:
Further Readings – Pandas Scatter Plots
To study more about Pandas scatter plots, please check
Pandas’ official documentation for scatter plots
(https://bit.ly/2DxSg6b). Try to execute the scatter()
method with a different set of attributes, as mentioned in the
official documentation.
Script 33:
Output:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
<class ‘list’>
Script 34:
1. df = pd.DataFrame({‘Gender’:[‘Female’, ‘Male’],
‘Age’:sex_mean.tolist()})
2. ax = df.plot.bar(x=’Gender’, y=’Age’, figsize=(8,6))
Output:
Further Readings – Pandas Bar Plots
To study more about Pandas bar plots, please check Pandas’
official documentation for bar plots (https://bit.ly/31uCe5a).
Try to execute bar plot methods with a different set of
attributes, as mentioned in the official documentation.
To plot box plots via the Pandas library, you need to call the
box() function. The following script plots box plots for all the
numeric columns in the Titanic dataset.
Output:
Further Readings – Pandas Box Plots
To study more about Pandas box plots, please check
Pandas’ official documentation for box plots
(https://bit.ly/3kAvRWG). Try to execute box plot methods
with a different set of attributes, as mentioned in the official
documentation.
Exercise 5.1
Question 1
B. barh()
C. bar_horizontal()
D. horizontal_bar()
Question 2:
B. label
C. axis
D. All of the above
Question 3:
B. percentage = ‘%1.1f%%’
C. perc = ‘%1.1f%%’
D. None of the Above
Exercise 5.2
Plot two scatter plots on the same graph using the
tips_dataset. In the first scatter plot, display values from the
total_bill column on the x-axis and from the tip column on the
y-axis. The color of the first scatter plot should be green. In
the second scatter plot, display values from the total_bill
column on the x-axis and from the size column on the y-axis.
The color of the second scatter plot should be blue, and
markers should be x.
Solving Regression Problems in Machine
Learning Using Sklearn Library
You can read data from CSV files. However, the datasets we
are going to use in this section are available by default in the
Seaborn library. To view all the datasets, you can use the
get_dataset_names() function as shown in the following
script:
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4. sns.get_dataset_names()
Output:
[‘anagrams’,
‘anscombe’,
‘attention’,
‘brain_networks’,
‘car_crashes’,
‘diamonds’,
‘dots’,
‘exercise’,
‘flights’,
‘fmri’,
‘gammas’,
‘geyser’,
‘iris’,
‘mpg’,
‘penguins’,
‘planets’,
‘tips’,
‘titanic’]
The following script loads the Tips dataset and displays its
first five rows.
Script 2:
1. tips_df = sns.load_dataset(“tips”)
2. tips_df.head()
Output:
Script 3:
1. diamond_df = sns.load_dataset(“diamonds”)
2. diamond_df.head()
Output:
As a first step, we divide the data into features and labels sets.
Our labels set consists of values from the “tip” column, while
the features set consists of values from the remaining
columns. The following script divides the data into features
and labels sets.
Script 4:
1. X = tips_df.drop([‘tip’], axis=1)
2. y = tips_df[“tip”]
Script 5:
1. X.head()
Output:
And the following script prints the label set.
Script 6:
1. y.head()
Output:
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
Name: tip, dtype: float64
Machine learning algorithms, for the most part, can only work
with numbers. Therefore, it is important to convert categorical
data into a numeric format.
Script 7:
numerical = X.drop([‘sex’, ‘smoker’, ‘day’, ‘time’], axis = 1)
Script 8:
1. numerical.head()
Output:
Script 9:
Output:
One of the most common approaches to convert a categorical
column to a numeric one is via one-hot encoding. In one-hot
encoding, for every unique value in the original columns, a
new column is created. For instance, for sex, two columns:
Female and Male, are created. If the original sex column
contained male, a 1 is added in the newly created Male column,
while 1 is added in the newly created Female column if the
original sex column contained Female.
Script 10:
1. import pandas as pd
2. cat_numerical = pd.get_dummies(categorical,drop_first=True)
3. cat_numerical.head()
Output:
The final step is to join the numerical columns with the one-
hot encoded columns. To do so, you can use the concat()
function from the Pandas library as shown below:
Script 11:
The final dataset looks like this. You can see that it doesn’t
contain any categorical value.
Output:
Script 12:
Script 13:
The methods used to find the value for these metrics are
available in sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
Script 15:
Output:
Script 16:
Output:
3. The random forest algorithm can be used when you have very
high dimensional data.
Output:
With the Sklearn library, you can use the SVM class to
implement support vector regression algorithms, as shown
below.
Script 18:
Script 19:
Output:
The output shows the mean absolute value for each of the K
folds.
Script 20:
1. tips_df.loc[100]
The output shows that the value of the tip in the 100th record
in our dataset is 2.5.
Output:
total_bill 11.35
tip 2.5
sex Female
smoker Yes
day Fri
time Dinner
size 2
Name: 100, dtype: object
We will try to predict the value of the tip of the 100th record
using the random forest regressor algorithm and see what
output we get. Look at the script below:
Note that you have to scale your single record before it can be
used as input to your machine learning algorithm.
Script 21:
1. from sklearn.ensemble import RandomForestRegressor
2. rf_reg = RandomForestRegressor(random_state=42, n_estimators=500)
3. regressor = rf_reg.fit(X_train, y_train)
4.
5. single_record = sc.transform (X.values[100].reshape(1, -1))
6. predicted_tip = regressor.predict(single_record)
7. print(predicted_tip)
Output:
[2.2609]
Exercise 6.1
Question 1
B. Red
C. 2.5
D. None of the above
Question 2
B. KNN
C. SVM
D. Linear Regression
Question 3
B. Recall
C. F1 Measure
D. All of the above
Exercise 6.2
Using the Diamonds dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
Solving Classification Problems in Machine
Learning Using Sklearn Library
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
Script 2:
Output:
Script 3:
Script 4:
1. X = churn_df.drop([‘Exited’], axis=1)
2. y = churn_df[‘Exited’]
The following script prints the first five rows of the feature set.
Script 5:
1. X.head()
Output:
And the following script prints the first five rows of the label
set, as shown below:
Script 6:
1. y.head()
Output:
0 1
1 0
2 1
3 0
4 0
Name: Exited, dtype: int64
Script 7:
Script 8:
1. numerical.head()
Output:
Next, create a dataframe that contains categorical values only.
You can do so by using the filter() function as shown below:
Script 9:
Output:
Script 10:
1. import pandas as pd
2. cat_numerical = pd.get_dummies(categorical,drop_first=True)
3. cat_numerical.head(
Output:
Script 11:
Output:
Script 12:
Script 13:
Script 14:
True Positive: True positives are those labels that are actually
true and also predicted as true by the model.
False Negative: False negative are labels that are actually true
but predicted as false by the machine learning models.
Confusion Matrix
Precision
Recall
Recall is calculated by dividing true positives by the sum of
the true positive and false negative, as shown below:
F1 Measure
Accuracy
The methods used to find the value for these metrics are
available in the sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
Script 15:
Output:
The output shows that for 81 percent of the records in the test
set, logistic regression correctly predicted whether or not a
customer will leave the bank.
Script 16:
Output:
Further Readings – KNN Classification
To study more about KNN classification, please check these
links:
1. https://bit.ly/33pXWIj
2. https://bit.ly/2FqNmZx
Script 17:
1. from sklearn.ensemble import RandomForestClassifier
2. rf_clf = RandomForestClassifier(random_state=42, n_estimators=500)
3.
4. classifier = rf_clf.fit(X_train, y_train)
5.
6. y_pred = classifier.predict(X_test)
7.
8.
9. from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score
10.
11. print(confusion_matrix(y_test,y_pred))
12. print(classification_report(y_test,y_pred))
13. print(accuracy_score(y_test, y_pred))
Output:
With the Sklearn library, you can use the SVM module to
implement the support vector classification algorithm, as
shown below. The SVC class from the SVM module is used to
implement the support vector classification, as shown below:
Script 18:
Output:
Further Readings – SVM Classification
To study more about SVM classification, please check these
links:
1. https://bit.ly/3hr4jAi
2. https://bit.ly/3iF0gln
Script 19:
Output:
[0.796 0.796 0.7965 0.7965 0.7965]
Script 20:
1. churn_df.loc[100]
Output:
CreditScore 665
Geography France
Gender Female
Age 40
Tenure 6
Balance 0
NumOfProducts 1
HasCrCard 1
IsActiveMember 1
EstimatedSalary 161848
Exited 0
Name: 100, dtype: object
The output above shows that the customer did not exit the
bank after six months since the value for the Exited attribute is
0. Let’s see what our classification model predicts:
Script 21:
Output:
[0]
Exercise 7.1
Question 1
B. Red
C. Male
D. None of the above
Question 2
B. F1
C. Precision
D. Recall
Question 3
B. pd.get_dummies()
C. pd.get_numeric()
D. All of the above
Exercise 7.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
species of the iris plant. Perform all the preprocessing steps.
Data Clustering with Machine Learning
Using Sklearn Library
3. Assign the data point to the cluster of the centroid with the
shorted distance.
5. Repeat steps 2–4 until new centroid values for all the clusters
are different from the previous centroid values.
Script 1:
1. import numpy as np
2. import pandas as pd
3. from sklearn.datasets.samples_generator import make_blobs
4. from sklearn.cluster import KMeans
5. from matplotlib import pyplot as plt
6. %matplotlib inline
Script 2:
The output looks like this. Using K Means clustering, you will
see how we will create four clusters in this dataset.
Output:
Note:
It is important to mention that dummy data is generated
randomly, and hence, you can have a slightly different plot
than the plot in the above figure.
Script 3:
Once the model is trained, you can print the cluster centers
using the cluster_centers_attribute of the KMeans class object.
Script 4:
Output:
[[-4.54070231 7.26625699]
[ 0.10118215 -0.23788283]
[ 2.57107155 8.17934929]
[-0.38501161 3.11446039]]
To print the cluster ids for all the labels, you can use the
labels_attribute of the KMeans class, as shown below.
Script 5:
Output:
[0 2 3 2 1 1 3 1 2 0 0 2 3 3 1 1 2 0 1 2 2 1 3 3 1 1 0 2 0 2 0 1 0 1 3 2
2 3 0 0 0 2 1 2 0 1 3 1 3 2 1 3 3 1 0 2 1 3 0 0 3 3 3 1 1 1 3 0 1 3 2
1 1 2 0 2 1 2 1 0 0 2 1 2 1 0 2 0 0 2 2 3 3 0 2 0 2 3 0 0 3 1 0 3 2 1
3 2 2 0 2 1 1 0 0 3 3 2 3 1 0 0 3 0 1 0 3 1 0 3 2 0 1 1 0 2 1 2 2 0 3
1 3 3 0 1 1 0 2 0 0 0 3 3 3 3 0 3 1 2 1 0 3 2 3 1 3 3 0 3 2 3 0 1 3 2
3 2 1 2 2 3 0 3 2 0 3 0 1 2 2 3 2 2 1 0 1 1 2 3 2 0 1 3 3 3 3 0 0 3 1
0 1 1 3 3 1 3 1 0 0 2 1 1 1 1 2 2 0 2 1 0 1 2 3 0 1 2 0 1 1 0 1 0 3 1
2 1 1 2 3 0 0 1 3 1 2 0 1 1 0 1 0 0 2 2 0 1 2 0 1 2 0 0 1 1 0 1 2 3 0
1 2 3 0 0 3 2 3 0 3 1 3 1 3 0 1 3 3 1 1 2 2 2 3 1 1 3 1 3 3 0 1 1 2 0
2 2 3 1 0 3 2 1 0 2 3 1 0 2 0 0 3 1 1 2 3 3 1 2 2 3 0 3 3 3 1 0 2 0 0
3 1 1 0 1 0 3 1 3 1 0 0 1 3 1 2 0 0 0 1 1 0 0 2 0 0 2 2 3 2 3 3 3 0 3
1 1 1 1 3 1 1 1 2 3 0 2 3 3 1 1 3 3 3 3 3 0 0 3 2 0 3 2 1 1 3 2 1 2 1
1 1 3 3 2 3 1 1 1 2 0 2 1 1 0 0 3 1 2 3 0 2 0 2 0 2 3 3 2 2 0 0 2 0 0
0 1 3 2 2 1 1 2 1 1 0 1 2 1 0 0 2 2 0 3 3 0 0 2 1 3 2 0 3 3 1 2 1 1 3
0 3 3 0 0 1 2 3 1]
Script 6:
Output:
The following script prints the actual four clusters in the
dataset.
Script 7:
Output:
Note:
Script 8:
Output:
We do not use data labels for clustering. Hence, we will
separate features from labels. Execute the following script to
do so:
Script 9:
Output:
Script 11:
1. print(km_model.labels_)
Output:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2
3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 0 2 0 0 0 0
3 0 0 0 2 2 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 2 0 0 0 2 0 0
0 2 0 0 0 2 2 0 2]
Script 12:
Output:
Till now, in this chapter, we have been randomly initializing the
value of K or the number of clusters. However, there is a way
to find the ideal number of clusters. The method is known as
the elbow method. In the elbow method, the value of inertia
obtained by training K Means clusters with different number of
K is plotted.
Script 13:
From the output below, it can be seen that the value of inertia
didn’t decrease much after 3 clusters.
Output:
Let’s now cluster the Iris data using 3 clusters and see if we
can get close to the actual clusters.
Script 14:
Output:
Let’s now plot the actual clusters and see how close the actual
clusters are to predicted clusters.
Script 16:
The output shows that the actual clusters are pretty close to
predicted clusters.
Output:
Example 1
Script 17:
1. import numpy as np
2. import pandas as pd
3. from sklearn.datasets.samples_generator import make_blobs
4. from matplotlib import pyplot as plt
5. %matplotlib inline
Script 18:
Output:
Output:
From the figure above, it can be seen that points 1 and 5 are
closest to each other. Hence, a cluster is formed by
connecting these points. The cluster of 1 and 5 is closest to
data point 10, resulting in a cluster containing points 1, 5, and
10. In the same way, the remaining clusters are formed until a
big cluster is formed.
Script 20:
Output:
Script 21:
Output:
Example 2
Script 22:
Output:
Script 23:
1. # performing kmeans clustering using AgglomerativeClustering class
2. hc_model = AgglomerativeClustering(n_clusters=4, affinity=
’euclidean’, linkage=’ward’)
3. hc_model.fit_predict(features)
The output shows the labels of some of the data points in our
dataset. You can see that since there are 4 clusters, there are
4 unique labels, i.e., 0, 1, 2, and 3.
Output:
array([0, 1, 1, 0, 1, 0, 3, 0, 0, 1, 0, 0, 1, 3, 0, 2, 0, 3, 1, 0, 0,
0,], dtype=int64)
Script 24:
Output:
Similarly, to plot the actual clusters in the dataset (for the sake
of comparison), execute the following script.
Script 25:
Output:
8.2.2. Clustering the Iris Dataset
In this section, you will see how to cluster the Iris dataset
using hierarchical agglomerative clustering. The following
script imports the Iris dataset and displays the first five rows
of the dataset.
Script 26:
Output:
The following script divides the data into features and labels
sets and displays the first five rows of the labels set.
Script 27:
Output:
Script 28:
1. # training Hierarchical clustering model
2. from sklearn.cluster import AgglomerativeClustering
3.
4. # training agglomerative clustering model
5. features = features.values
6. hc_model = AgglomerativeClustering(n_clusters=3, affinity=’euclidean’,
linkage=’ward’)
7. hc_model.fit_predict(features)
The output below shows the predicted cluster labels for the
feature set in the Iris dataset.
Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2,
2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0,
0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0],
dtype=int64)
Script 29:
Output:
You can also create dendrograms using the feature set using
the shc module from the scipy.cluster.hierarchy library. You
have to pass the feature set to the linkage class of the shc
module, and then the object of the linkage class is passed to
the dendrogram class to plot the dendrograms, as shown in
the following script.
Script 30:
Output:
If you want to cluster the dataset into three clusters, you can
simply draw a horizontal line that passes through the three
vertical lines, as shown below. The clusters below the
horizontal line are the resultant clusters. In the following
figure, we form three clusters.
Hands-on Time – Exercise
Now, it is your turn. Follow the instructions in the exercises
below to check your understanding of the clustering
algorithms in machine learning. The answers to these
exercises are provided after chapter 10 in this book.
Exercise 8.1
Question 1
B. Hierarchical Clustering
Question 2
Question 3
B. vertical, horizontal
Exercise 8.2
Apply KMeans clustering on the banknote.csv dataset
available in the Datasets folder in the GitHub repository. Find
the optimal number of clusters and then print the clustered
dataset. The following script imports the dataset and prints
the first five rows of the dataset.
Deep Learning with Python TensorFlow 2.0
In our neural network, we will first find the value of zh1, which
can be calculated as follows:
In the same way, you find the values of ah2, ah3, and ah4.
To find the value of zo, you can use the following formula:
9.1.2. Backpropagation
Our weights are divided into two parts. We have weights that
connect input features to the hidden layer and the hidden
layer to the output node. We call the weights that connect the
input to the hidden layer collectively as wh (w1, w2, w3 ……
w8), and the weights connecting the hidden layer to the
output as wo (w9, w10, w11, w12).
and,
Using equation 10, 12, and 13 in equation 9, you can find the
value of dcost/dwh.
Script 1:
Script 2:
1. import tensorflow as tf
2. print(tf.__version__)
Output:
2.1.0
Script 3:
1. importseaborn as sns
2. import pandas as pd
3. importnumpy as np
4. fromtensorflow.keras.layers import Dense, Dropout, Activation
5. fromtensorflow.keras.models import Model, Sequential
6. fromtensorflow.keras.optimizers import Adam
Script 4:
The following script plots the first five rows of the dataset.
Script 5:
1. banknote_data.head()
Output:
The output shows that our dataset contains five columns. Let’s
see the shape of our dataset.
Script 6:
1. banknote_data.shape
The output shows that our dataset has 1372 rows and 5
columns.
Output:
(1372, 5)
Script 7:
1. sns.countplot(x=’Target’, data=banknote_data)
Output:
The output shows that the number of fake notes (represented
by 1) is slightly less than the number of original banknotes.
Script 8:
1. X = banknote_data.drop([‘Target’], axis=1).values
2. y = banknote_data[[‘Target’]].values
3.
4. print(X.shape)
5. print(y.shape)
Output:
(1372, 4)
(1372, 1)
The variable X contains our feature set while the variable y
contains target labels.
Script 9:
Script 10:
Script 11:
1. dropout_rate = 0.1
2. epochs = 20
3. batch_size = 4
4. learn_rate = 0.001
Script 12:
Script 13:
Output:
From the above output, you can see that the input layer
contains four nodes, the input to the first dense layers is 4,
while the output is 12. Similarly, the input to the second dense
layer is 12, while the output is 6. Finally, in the last dense layer,
the input is 6 nodes, while the output is 1 since we are making
a binary classification. Also, you can see a dropout layer after
each dense layer.
To train the model, you need to call the fit method on the
model object. The fit method takes the training features and
targets as parameters, along with the batch size, the number
of epochs, and the validation split. The validation split refers
to the split in the training data during training.
Script 14:
Output:
Script 15:
1. accuracies = model.evaluate(X_test, y_test, verbose=1)
2. print(“Test Score:”, accuracies[0])
3. print(“Test Accuracy:”, accuracies[1])
Output:
Let’s now plot the accuracy on the training and test sets to
see if our model is overfitting or not.
Script 16:
1. importmatplotlib.pyplot as plt
2. plt.plot(model_history.history[‘accuracy’], label = ‘accuracy’)
3. plt.plot(model_history.history[‘val_accuracy’], label =
‘val_accuracy’)
4. plt.legend([‘train’,’test’], loc=’lowerleft’)
Output:
The above curve meets near 1 and then becomes stable which
shows that our model is not overfitting.
Similarly, the loss values for test and training sets can be
printed as follows:
Script 17:
Output:
And this is it. You have successfully trained a neural network
for classification. In the next section, you will see how to
create and train a recurrent neural network for stock price
prediction.
§ What Is an RNN?
A recurrent neural network is a type of neural network that is
used to process data that is sequential in nature, e.g., stock
price data, text sentences, or sales of items.
Here, we have a single neuron with one input and one output.
On the right side, the process followed by a recurrent neural
network is unfolded. You can see that at time step t, the input
is multiplied by weight vector U, while the previous output at
time t–1, i.e., St–1 is multiplied by the weight vector W, the sum
of the input vector XU + SW becomes the output at time T.
This is how a recurrent neural network captures the sequential
information.
RNN can easily guess that the missing word is “Clouds” here.
Here, the RNN can only guess that the missing word is
“French” if it remembers the first sentence, i.e., “Mike grew up
in France.”
§ What Is an LSTM?
LSTM is a type of RNN which is capable of remembering
longer sequences, and hence, it is one of the most commonly
used RNN for sequence tasks.
§ Cell State
The cell state in LSTM is responsible for remembering a long
sequence. The following figure describes the cell state:
The cell state contains data from all the previous cells in the
sequence. The LSTM is capable of adding or removing
information to a cell state. In other words, LSTM tells the cell
state which part of previous information to remember and
which information to forget.
§ Forget Gate
The forget gate basically tells the cell state which information
to retain from the information in the previous step and which
information to forget. The working and calculation formula for
the forget gate is as follows:
§ Input Gate
The forget gate is used to decide which information to
remember or forget. The input gate is responsible for
updating or adding any new information in the cell state. The
input gate has two parts: an input layer, which decides which
part of the cell state is to be updated, and a tanh layer, which
actually creates a vector of new values that are added or
replaced in the cell state. The working of the input gate is
explained in the following figure:
§ Update Gate
The forget gate tells us what to forget, and the input gate tells
us what to add to the cell state. The next step is to actually
perform these two operations. The update gate is basically
used to perform these two operations. The functioning and
the equations for the update gate are as follows:
§ Output Gate
Finally, you have the output gate, which outputs the hidden
state and the output, just like a common recurrent neural
network. The additional output from an LSTM node is a cell
state, which runs between all the nodes in a sequence. The
equations and the functioning of the output gate are depicted
by the following figure:
In the following sections, you will see how to use LSTM for
solving different types of Sequence problems.
https://finance.yahoo.com/quote/FB/history?p=FB.
The test data will consist of the opening stock prices of the
Facebook company for the month of January 2020. The
training file fb_train.csv and the test file fb_test.csv are also
available in the Datasets folder in the GitHub repository. Let’s
begin with the coding now.
Script 18:
Script 19:
Script 20:
1. # importing libraries
2. import pandas as pd
3. import numpy as np
4.
5. #importing dataset
6. fb_complete_data = pd.read_csv(“/gdrive/My
Drive/datasets/fb_train.csv”)
Running the following script will print the first five rows of the
dataset.
Script 21:
Output:
Script 22:
Script 23:
1. #scaling features
2. from sklearn.preprocessing import MinMaxScaler
3. scaler = MinMaxScaler(feature_range = (0, 1))
4.
5. fb_training_scaled = scaler.fit_transform(fb_training_processed)
If you check the total length of the dataset, you will see it has
1257 records, as shown below:
Script 24:
1. len(fb_training_scaled)
Output:
1257
Script 25:
Script 26:
Script 27:
1. print(X_train.shape)
2. print(y_train.shape)
Output:
(1197, 60)
(1197,)
Script 28:
Script 29:
1. #importing libraries
2. import numpy as np
3. import matplotlib.pyplot as plt
4. from tensorflow.keras.layers import Input, Activation, Dense, Flatten,
Dropout, Flatten, LSTM
5. from tensorflow.keras.models import Model
Script 30:
Script 31:
1. print(X_train.shape)
2. print(y_train.shape)
3. y_train= y_train.reshape(-1,1)
4. print(y_train.shape)
Output:
(1197, 60, 1)
(1197,)
(1197, 1)
You can see the results for the last five epochs in the output.
Output:
Epoch 96/100
38/38 [==============================] - 11s 299ms/step - loss: 0.0018
Epoch 97/100
38/38 [==============================] - 11s 294ms/step - loss: 0.0019
Epoch 98/100
38/38 [==============================] - 11s 299ms/step - loss: 0.0018
Epoch 99/100
38/38 [==============================] - 12s 304ms/step - loss: 0.0018
Epoch 100/100
38/38 [==============================] - 11s 299ms/step - loss: 0.0021
Our model has been trained. Next, we will test our stock
prediction model on the test data.
The test data should also be converted into the right shape to
test our stock prediction model. We will do that later. Let’s
first import the data and then remove all the columns from the
test data except the Open column.
Script 33:
Script 34:
1. fb_all_data = pd.concat((fb_complete_data[‘Open’],
fb_testing_complete_data[‘Open’]), axis=0)
Script 35:
You can see that the length of the input data is 80. Here, the
first 60 records are the last 60 records from the training data,
and the last 20 records are the 20 records from the test file.
Output:
(80,)
Script 36:
1. test_inputs = test_inputs.reshape(-1,1)
2. test_inputs = scaler.transform(test_inputs)
3. print(test_inputs.shape)
Output:
(80, 1)
Script 37:
1. fb_test_features = []
2. for i in range(60, 80):
3. fb_test_features.append(test_inputs[i-60:i, 0])
Script 38:
1. X_test = np.array(fb_test_features)
2. print(X_test.shape)
Output:
(20, 60)
Script 39:
(20, 60, 1)
Script 40:
Script 41:
Script 42:
Output:
The output shows that our algorithm has been able to partially
capture the trend of the future opening stock prices for
Facebook data.
Look at the following image. Here, in the 3rd and 4th rows and
1st and 2nd columns, we have four values 1, 0, 1, and 4. When
we apply max pooling on these four pixels, the maximum
value will be chosen, i.e., you can see 4 in the pooled feature
map.
§ Flattening and Fully Connected Layer
The pooled feature maps are flattened to form a one-
dimensional vector to find more features from an image, as
shown in the following figure:
In this section, you will see how to implement CNN for image
classification in TensorFlow Keras. We will create CNN that is
able to classify an image of fashion items such as shirt, pants,
trousers, sandals into one of the 10 predefined categories. So,
let’s begin without much ado.
Execute the following script to make sure that you are running
the latest version of TensorFlow.
Script 43:
Output:
2.3.0
Script 45:
Script 46:
1. #scaling images
2. training_images, test_images = training_images/255.0,
test_images/255.0
Script 47:
1. print(training_images.shape)
Output:
Script 48:
Output:
The output shows that the 9th image in our test set is the
image of a sneaker.
Script 49:
Output:
Script 50:
Output:
Script 51:
1. training_images[0].shape
Output:
(28, 28, 1)
The shape of a single image is (28, 28, 1). This shape will be
used to train our convolutional neural network. The following
script creates a model for our convolutional neural network.
Script 52:
Script 53:
Script 54:
1. from tensorflow.keras.utils import plot_model
2. plot_model(model, to_file=’model_plot1.png’, show_shapes=True,
show_layer_names=True)
Output:
The following script trains the image classification model.
Script 55:
The results from the last five epochs are shown in the output.
Output:
Let’s plot the training and test accuracies for our model.
Script 56:
1. #plotting accuracy
2. import matplotlib.pyplot as plt
3.
4. plt.plot(model_history.history[‘accuracy’], label = ‘accuracy’)
5. plt.plot(model_history.history[‘val_accuracy’], label =
‘val_accuracy’)
6. plt.legend([‘train’,’test’], loc=’lower left’)
Output:
Let’s make a prediction on one of the images in the test set.
Let’s predict the label for image 9. We know that image 9
contains a sneaker, as we saw earlier by plotting the image.
Script 57:
Output:
Exercise 9.1
Question 1
Question 2
Question 3
B. Non-linearity
C. Quadraticity
D. None of the above
Exercise 9.2
Using the CFAR 10 image dataset, perform image
classification to recognize the image. Here is the dataset:
1. cifar_dataset = tf.keras.datasets.cifar10
Dimensionality Reduction with PCA and
LDA Using Sklearn
§ Disadvantages of PCA
There are two major disadvantages of PCA:
1. You need to standardize the data before you apply PCA
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
The following script imports the Iris dataset using the Seaborn
library and prints the first five rows of the dataset.
Script 2:
Output:
The following script divides the data into the features and
labels sets.
Script 3:
Finally, both the training and test sets should be scaled before
PCA could be applied to them.
Script 5:
To apply PCA via Sklearn, all you have to do is import the PCA
class from the Sklearn.decomposition module. Next, to apply
PCA to the training set, pass the training set to the
fit_tansform() method of the PCA class object. To apply PCA
on the test set, pass the test set to the transform() method of
the PCA class object. This is shown in the following script.
Script 6:
Script 7:
Output:
Script 8:
Script 9:
Output:
0.8666666666666667
The output shows that even with two features, the accuracy
for correctly predicting the label for the iris plant is 86.66.
Finally, with two features, you can easily visualize the dataset
using the following script.
Script 10:
1. from matplotlib import pyplot as plt
2. %matplotlib inline
3.
4. #print actual datapoints
5.
6. plt.scatter(X_test[:,0], X_test[:,1], c= y_test, cmap=’rainbow’ )
Output:
§ Disadvantages of LDA
There are three major disadvantages of LDA:
1. Not able to detect correlated features
Script 11:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
Script 12:
1. #importing dataset
2. banknote_df = pd.read_csv(r”E:\Hands on Python for Data Science and
Machine Learning\Datasets\banknote.csv”)
3.
4. #displaying dataset header
5. banknote_df.head()
Output:
Script 13:
Finally, the following script divides the data into training and
test sets.
Script 14:
Like PCA, you need to scale the data before you can apply
LDA on it. The data scaling is performed in the following step.
Script 15:
1. #applying scaling on training and test data
2. from sklearn.preprocessing import StandardScaler
3. sc = StandardScaler()
4. X_train = sc.fit_transform(X_train)
5. X_test = sc.transform (X_test)
Script 16:
Like PCA, you can find variance ratios for LDA using the
explained_variance_ratio attribute.
Script 17:
[1.]
The above output shows that even with one component, the
maximum variance can be achieved.
Script 18:
Script 19:
Output:
0.9890909090909091
The output shows that even with a single feature, we are able
to correctly predict whether or not a banknote is fake with
98.90 percent accuracy.
Exercise 10.1
Question 1
Question 2
In PCA, dimensionality reduction depends upon the:
A. Feature set only
Question 3
B. Semi-Supervised
C. Supervised
D. Reinforcement
Exercise 10.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the
Datasets folder in the GitHub repository. Print the accuracy
using the two principal components. Also, plot the results on
the test set using the two principal components.
Exercises Solutions
Exercise 2.1
Question 1
B. While Loop
C. Both A and B
D. None of the above
Answer: A
Question 2
B. Double Value
Answer: C
Question 3
B. Out
C. Not In
D. Both A and C
Answer: D
Exercise 2.2.
Print the table of integer 9 using a while loop:
1. j=1
2. while j< 11:
3. print(“9 x “+str(j)+ “ = “+ str(9*j))
4. j=j+1
Exercise 3.1
Question 1:
B. np.multiply(matrix1, matrix2)
C. np.elementwise(matrix1, matrix2)
D. None of the above
Answer: B
Question 2:
B. np.id(4,4)
C. np.eye(4,4)
D. All of the above
Answer: C
Question 3:
B. np.arange(4, 16, 3)
C. np.arange(4, 15,3)
D. None of the above
Answer: D
Exercise 3.2
Create a random NumPy array of five rows and four columns.
Using array indexing and slicing, display the items from row
three to end and column two to end.
Solution:
1. uniform_random = np.random.rand(4, 5)
2. print(uniform_random)
3. print(“Result”)
4. print(uniform_random[2:,3:])
Exercise 4.1
Question 1
B. 1
C. 2
D. None of the above
Answer: B
Question 2
B. sort_rows()
C. sort_values()
D. sort_records()
Answer: C
Question 3
B. filter_columns()
C. apply_filter ()
D. None of the above()
Answer: A
Exercise 4.2
Use the apply function to subtract 10 from the Fare column of
the Titanic dataset, without using the lambda expression.
Solution:
1. def subt(x):
2. return x - 10
3.
4. updated_class = titanic_data.Fare.apply(subt)
5. updated_class.head()
Exercise 5.1
Question 1
B. barh()
C. bar_horizontal()
D. horizontal_bar()
Answer: B
Question 2:
B. label
C. axis
D. All of the above
Answer: B
Question 3:
B. percentage = ‘%1.1f%%’
C. perc = ‘%1.1f%%’
D. None of the Above
Answer: A
Exercise 5.2
Plot two scatter plots on the same graph using the
tips_dataset. In the first scatter plot, display values from the
total_bill column on the x-axis and from the tip column on the
y-axis. The color of the first scatter plot should be green. In
the second scatter plot, display values from the total_bill
column on the x-axis and from the size column on the y-axis.
The color of the second scatter plot should be blue, and the
markers should be x.
Solution:
Output:
Exercise 6.1
Question 1
B. Red
C. 2.5
D. None of the above
Answer: C
Question 2
Which of the following algorithm is a lazy algorithm?
A. Random Forest
B. KNN
C. SVM
D. Linear Regression
Answer: B
Question 3
B. Recall
C. F1 Measure
D. All of the above
Answer: D
Exercise 6.2
Using the Diamonds dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
Solution:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4.
5. diamonds_df = sns.load_dataset(“diamonds”)
6.
7. X = diamonds_df.drop([‘price’], axis=1)
8. y = diamonds_df[“price”]
9.
10. numerical = X.drop([‘cut’, ‘color’, ‘clarity’], axis = 1)
11.
12. categorical = X.filter([‘cut’, ‘color’, ‘clarity’])
13.
14. cat_numerical = pd.get_dummies(categorical,drop_first=True)
15.
16. X = pd.concat([numerical, cat_numerical], axis = 1)
17.
18. from sklearn.model_selection import train_test_split
19.
20. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=0)
21.
22. from sklearn.preprocessing import StandardScaler
23. sc = StandardScaler()
24. X_train = sc.fit_transform(X_train)
25. X_test = sc.transform (X_test)
26.
27. from sklearn import svm
28. svm_reg = svm.SVR()
29. regressor = svm_reg.fit(X_train, y_train)
30. y_pred = regressor.predict(X_test)
31.
32.
33.
34. from sklearn import metrics
35.
36. print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test,
y_pred))
37. print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test,
y_pred))
38. print(‘Root Mean Squared Error:’,
np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Exercise 7.1
Question 1
C. Male
D. None of the above
Answer: D
Question 2
B. F1
C. Precision
D. Recall
Answer: C
Question 3
B. pd.get_dummies()
C. pd.get_numeric()
D. All of the above
Answer: B
Exercise 7.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
species of the iris plant. Perform all the preprocessing steps.
Solution:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4.
5. iris_df = sns.load_dataset(“iris”)
6.
7. iris_df.head()
8.
9. X = iris_df.drop([‘species’], axis=1)
10. y = iris_df[“species”]
11.
12.
13. from sklearn.model_selection import train_test_split
14.
15. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=0)
16.
17. from sklearn.preprocessing import StandardScaler
18. sc = StandardScaler()
19. X_train = sc.fit_transform(X_train)
20. X_test = sc.transform (X_test)
21.
22. from sklearn.ensemble import RandomForestClassifier
23. rf_clf = RandomForestClassifier(random_state=42, n_estimators=500)
24.
25. classifier = rf_clf.fit(X_train, y_train)
26.
27. y_pred = classifier.predict(X_test)
28.
29.
30. from sklearn.metrics import classification_report, confusion_matrix,
accuracy_score
31.
32. print(confusion_matrix(y_test,y_pred))
33. print(classification_report(y_test,y_pred))
34. print(accuracy_score(y_test, y_pred))
Exercise 8.1
Question 1
B. Hierarchical Clustering
Answer: D
Question 2
Answer: C
Question 3
B. vertical, horizontal
Exercise 8.2
Apply KMeans clustering on the banknote.csv dataset
available in the Datasets folder in the GitHub repository. Find
the optimal number of clusters and then print the clustered
dataset. The following script imports the dataset and prints
the first five rows of the dataset.
Exercise 9.1
Question 1
B. Height, Width
Answer: D
Question 2:
Answer (C)
Question 3
B. Non-linearity
C. Quadraticity
D. None of the above
Answer: B
Exercise 9.2
Using the CFAR 10 image dataset, perform image
classification to recognize the image. Here is the dataset:
2. cifar_dataset = tf.keras.datasets.cifar10
Solution:
Exercise 10.1
Question 1
Answer: C
Question 2
Answer: A
Question 3
B. Semi-Supervised
C. Supervised
D. Reinforcement
Answer: C
Exercise 10.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the
Datasets folder in the GitHub repository. Print the accuracy
using the two principal components. Also, plot the results on
the test set using the two principal components.
Solution:
1. import pandas as pd
2. import numpy as np
3.
4. churn_df = pd.read_csv(“E:\Hands on Python for Data Science and
Machine Learning\Datasets\customer_churn.csv”)
5. churn_df.head()
6.
7. churn_df = churn_df.drop([‘RowNumber’, ‘CustomerId’, ‘Surname’],
axis=1)
8.
9. X = churn_df.drop([‘Exited’], axis=1)
10. y = churn_df[‘Exited’]
11.
12. numerical = X.drop([‘Geography’, ‘Gender’], axis = 1)
13. categorical = X.filter([‘Geography’, ‘Gender’])
14. cat_numerical = pd.get_dummies(categorical,drop_first=True)
15. X = pd.concat([numerical, cat_numerical], axis = 1)
16. X.head()
17.
18. from sklearn.model_selection import train_test_split
19.
20. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=0)
21.
22. #applying scaling on training and test data
23. from sklearn.preprocessing import StandardScaler
24. sc = StandardScaler()
25. X_train = sc.fit_transform(X_train)
26. X_test = sc.transform (X_test)
27.
28. #importing PCA class
29. from sklearn.decomposition import PCA
30.
31. #creating object of the PCA class
32. pca = PCA()
33.
34. #training PCA model on training data
35. X_train = pca.fit_transform(X_train)
36.
37. #making predictions on test data
38. X_test = pca.transform(X_test)
39.
40. #printing variance ratios
41. variance_ratios = pca.explained_variance_ratio_
42. print(variance_ratios)
43.
44. #use one principal component
45. from sklearn.decomposition import PCA
46.
47. pca = PCA(n_components=2)
48. X_train = pca.fit_transform(X_train)
49. X_test = pca.transform(X_test)
50.
51. #making predictions using logistic regression
52. from sklearn.linear_model import LogisticRegression
53.
54. #training the logistic regression model
55. lg = LogisticRegression()
56. lg.fit(X_train, y_train)
57.
58.
59. # Predicting the Test set results
60. y_pred = lg.predict(X_test)
61.
62. #evaluating results
63.
64. from sklearn.metrics import accuracy_score
65.
66. print(accuracy_score(y_test, y_pred))
67.
68. from matplotlib import pyplot as plt
69. %matplotlib inline
70.
71. #print actual datapoints
72.
73. plt.scatter(X_test[:,0], X_test[:,1], c= y_test, cmap=’rainbow’ )