Python Machine Learning For Beginners Ebook Final
Python Machine Learning For Beginners Ebook Final
Edited by AI Publishing
eBook Converted and Cover by Gazler Studio
Published by AI Publishing LLC
ISBN-13: 978-1-7347901-5-3
Legal Notice:
You are not permitted to amend, use, distribute, sell, quote, or paraphrase
any part of the content within this book without the specific consent of
the author.
Disclaimer Notice:
Kindly note that the information contained within this document is
solely for educational and entertainment purposes. No warranties of
any kind are indicated or expressed. Readers accept that the author
is not providing any legal, professional, financial, or medical advice.
Kindly consult a licensed professional before trying out any techniques
explained in this book.
www.aipublishing.io/book-machine-learning-python
Preface.......................................................................................1
Book Approach.............................................................................................. 1
Who Is This Book For?................................................................................ 3
How to Use This Book?............................................................................... 3
§§ Book Approach
The book follows a very simple approach. It is divided into
10 chapters. The first five chapters of the book are dedicated
to data analysis and visualization, while the last five chapters
are based on machine learning and statistical models for data
science. Chapter 1 provides a very brief introduction to data
science and machine learning and provides a roadmap for
step by step learning approach to data science and machine
learning. The process for environment setup, including the
2 | P r e fa c e
Requirements
This box lists all requirements needed to be done before
proceeding to the next topic. Generally, it works as a checklist
to see if everything is ready before a tutorial.
Further Readings
Here, you will be pointed to some external reference or
source that will serve as additional content about the specific
Topic being studied. In general, it consists of packages,
documentations, and cheat sheets.
Hands-on Time
Here, you will be pointed to an external file to train and test all
the knowledge acquired about a Tool that has been studied.
Generally, these files are Jupyter notebooks (.ipynb), Python
(.py) files, or documents (.pdf).
www.aipublishing.io/book-machine-learning-python
Example
www.aipublishing.io/book-machine-learning-python
1
Introduction and
Environment Set Up
In this book, you will learn both Data Science and Machine
Learning. In the first five chapters, you will study the concepts
required to store, analyze, and visualize the datasets. From
the 6th chapter onwards, different types of machine learning
concepts are explained.
3. Run the executable file after the download is complete.
You will most likely find the download file in your
download folder. The name of the file should be similar
to “Anaconda3-5.1.0-Windows-x86_64.” The installation
wizard will open when you run the file, as shown in the
following figure. Click the Next button.
4. Now, click I Agree on the License Agreement dialog, as
shown in the following screenshot.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 15
5. Check the Just Me radio button from the Select
Installation Type dialogue box. Click the Next button to
continue.
16 | Introduction and Environment Set Up
7. Go for the second option, Register Anaconda as my
default Python 3.7 in the Advanced Installation Options
dialogue box. Click the Install button to start the
installation, which can take some time to complete.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 17
8. Click Next once the installation is complete.
18 | Introduction and Environment Set Up
10. You have successfully installed Anaconda on your
Windows. Excellent job. The next step is to uncheck
both checkboxes on the dialog box. Now, click on the
Finish button.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 19
3. Run the executable file after the download is complete.
You will most likely find the download file in your
download folder. The name of the file should be similar
to “Anaconda3-5.1.0-Windows-x86_64.” The installation
wizard will open when you run the file, as shown in the
following figure. Click the Continue button.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 21
5. The Important Information dialog will pop up. Simply
click Continue to go with the default version that is
Anaconda 3.
22 | Introduction and Environment Set Up
7. It is mandatory to read the license agreement and click
the Agree button before you can click the Continue
button again.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 23
The system will prompt you to give your password.
Use the same password you use to login to your Mac
computer. Now, click on Install Software.
24 | Introduction and Environment Set Up
The next screen will display the message that the
installation has completed successfully. Click on the
Close button to close the installer.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 25
2. The second step is to download the installer bash script.
Log into your Linux computer and open your terminal.
Now, go to /temp directory and download the bash you
downloaded from Anaconda’s home page using curl.
26 | Introduction and Environment Set Up
$ cd / tmp
$ curl –o https://repo.anaconda.com.archive/Anaconda3-
5.2.0-Linux-x86_64.sh
[/home/tola/anaconda3] >>>
https://colab.research.google.com/
Next, to run your code using GPU, from the top menu, select
Runtime -> Change runtime type, as shown in the following
screenshot:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 29
You should see the following window. Here, from the dropdown
list, select GPU, and click the Save button.
30 | Introduction and Environment Set Up
With Google Cloud, you can import the datasets from your
Google drive. Execute the following script. And click on the
link that appears as shown below:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 31
Copy the link and paste it in the empty field in the Google
Colab cell, as shown below:
This way, you can import datasets from your Google drive to
your Google Colab environment.
In the next chapter, you will see how to write your first program
in Python, along with other Python programming concepts.
2
Python Crash Course
Script 1:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 37
Output:
Welcome to Data Visualization with Python
f. Tuples
g. Dictionaries
A variable is an alias for the memory address where actual
data is stored. The data or the values stored at a memory
address can be accessed and updated via the variable name.
Unlike other programming languages like C++, Java, and C#,
Python is loosely typed, which means that you don’t have to
define the data type while creating a variable. Rather, the type
of data is evaluated at runtime.
Script 2:
1. # A string Variable
2. first_name = "Joseph"
3. print(type(first_name))
4.
5. # An Integer Variable
6. age = 20
7. print(type(age))
8.
9. # A floating point variable
10. weight = 70.35
11. print(type(weight))
12.
13. # A boolean variable
14. married = False
15. print(type(married))
16.
17. #List
D ata S c i e n c e C r a s h C o u r s e for Beginners | 39
18. cars = ["Honda", "Toyota", "Suzuki"]
19. print(type(cars))
20.
21. #Tuples
22. days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
23. print(type(days))
24.
25. #Dictionaries
26. days2 = {1:"Sunday", 2:"Monday", 3:"Tuesday", 4:"Wednesday",
5:"Thursday", 6:"Friday", 7:"Saturday"}
27. print(type(days2))
Output:
<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>
<class 'list'>
<class 'tuple'>
<class 'dict'>
§§ Arithmetic Operators
Arithmetic operators are used to perform arithmetic operations
in Python. The following table sums up the arithmetic operators
supported by Python. Suppose X = 20, and Y = 10.
Operator
Symbol Functionality Example
Name
Addition + Adds the operands on X+ Y= 30
either side
Subtraction − Subtracts the X −Y= 10
operands on either
side
Multiplication * Multiplies the X * Y= 200
operands on either
side
Division / Divides the operand X / Y= 2.0
on the left by the one
on the right
Modulus % Divides the operand X % Y= 0
on the left by the
one on the right and
returns the remainder
Exponent ** Takes exponent of the X ** Y =
operand on the left to 1024 x e10
the power of right
Script 3:
1. X = 20
2. Y = 10
3. print(X + Y)
4. print(X − Y)
5. print(X * Y)
6. print(X / Y)
7. print(X ** Y)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 41
Output:
30
10
200
2.0
10240000000000
§§ Logical Operators
Logical operators are used to perform logical AND, OR, and
NOT operations in Python. The following table summarizes
the logical operators. Here, X is True, and Y is False.
Script 4:
1. X = True
2. Y = False
3. print(X and Y)
4. print(X or Y)
5. print(not(X and Y))
Output:
1. False
2. True
3. True
42 | Python Crash Course
§§ Comparison Operators
Comparison operators, as the name suggests, are used to
compare two or more than two operands. Depending upon the
relation between the operands, comparison operators return
Boolean values. The following table summarizes comparison
operators in Python. Here, X is 20, and Y is 35.
Script 5
1. X = 20
2. Y = 35
3.
4. print(X == Y)
5. print(X != Y)
6. print(X > Y)
7. print(X < Y)
8. print(X >= Y)
9. print(X <= Y)
Output:
False
True
False
True
False
True
§§ Assignment Operators
Assignment operators are used to assign values to variables.
The following table summarizes the assignment operators.
Here, X is 20, and Y is equal to 10.
Script 6:
1. X = 20; Y = 10
2. R = X + Y
3. print(R)
4.
5. X = 20;
6. Y = 10
7. X += Y
8. print(X)
9.
10. X = 20;
11. Y = 10
12. X -= Y
13. print(X)
14.
15. X = 20;
16. Y = 10
D ata S c i e n c e C r a s h C o u r s e for Beginners | 45
17. X *= Y
18. print(X)
19.
20. X = 20;
21. Y = 10
22. X /= Y
23. print(X)
24.
25. X = 20;
26. Y = 10
27. X %= Y
28. print(X)
29.
30. X = 20;
31. Y = 10
32. X **= Y
33. print(X)
Output:
30
30
10
200
2.0
0
10240000000000
§§ Membership Operators
Membership operators are used to find if an item is a member
of a collection of items or not. There are two types of
membership operators: the in operator and the not in operator.
The following script shows the in operator in action.
Script 7:
1. days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
2. print('Sunday' in days)
46 | Python Crash Course
Output:
True
Script 8:
1. days = ("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
2. print('Xunday' not in days)
Output:
True
§§ IF Statement
If you have to check for a single condition and you do not
concern about the alternate condition, you can use the if
statement. For instance, if you want to check if 10 is greater
than 5 and based on that you want to print a statement, you
can use the if statement. The condition evaluated by the if
statement returns a Boolean value. If the condition evaluated
by the if statement is true, the code block that follows the if
statement executes. It is important to mention that in Python,
D ata S c i e n c e C r a s h C o u r s e for Beginners | 47
Script 9:
1. # The if statement
2.
3. if 10 > 5:
4. print("Ten is greater than 10")
Output:
Ten is greater than 10
§§ IF-Else Statement
The If-else statement comes handy when you want to execute
an alternate piece of code in case the condition for the if
statement returns false. For instance, in the following example,
the condition 5 < 10 will return false. Hence, the code block
that follows the else statement will execute.
Script 10:
1. # if-else statement
2.
3. if 5 > 10:
4. print("5 is greater than 10")
5. else:
6. print("10 is greater than 5")
Output:
10 is greater than 5
48 | Python Crash Course
§§ IF-Elif Statement
The if-elif statement comes handy when you have to evaluate
multiple conditions. For instance, in the following example,
we first check if 5 > 10, which evaluates to false. Next, an elif
statement evaluates the condition 8 < 4, which also returns
false. Hence, the code block that follows the last else statement
executes.
Script 11:
1. #if-elif and else
2.
3. if 5 > 10:
4. print("5 is greater than 10")
5. elif 8 < 4:
6. print("8 is smaller than 4")
7. else:
8. print("5 is not greater than 10 and 8 is not smaller
than 4")
Output:
5 is not greater than 10 and 8 is not smaller than 4
§§ For Loop
The for loop is used to iteratively execute a piece of code for
a certain number of times. You should typically use for loop
when you know the exact number of iterations or repetitions
D ata S c i e n c e C r a s h C o u r s e for Beginners | 49
for which you want to run your code. A for loop iterates over
a collection of items. In the following example, we create a
collection of five integers using the range() method. Next,
a for loop iterates five times and prints each integer in the
collection.
Script 12:
1. items = range(5)
2. for item in items:
3. print(item)
Output:
0
1
2
3
4
§§ While Loop
The while loop keeps executing a certain piece of code unless
the evaluation condition becomes false. For instance, the
while loop in the following script keeps executing unless the
variable c becomes greater than 10.
Script 13:
1. c = 0
2. while c < 10:
3. print(c)
4. c = c +1
Output:
0
1
2
3
4
50 | Python Crash Course
5
6
7
8
9
2.6. Functions
In any programming language, functions are used to implement
the piece of code that is required to be executed numerous
times at different locations in the code. In such cases, instead
of writing long pieces of codes again and again, you can
simply define a function that contains the piece of code, and
then you can call the function wherever you want in the code.
Script 14:
1. def myfunc():
2. print("This is a simple function")
3.
4. ### function call
5. myfunc()
Output:
This is a simple function
D ata S c i e n c e C r a s h C o u r s e for Beginners | 51
You can also pass values to a function. The values are passed
inside the parenthesis of the function call. However, you
must specify the parameter name in the function definition,
too. In the following script, we define a function named
myfuncparam(). The function accepts one parameter, i.e.,
num. The value passed in the parenthesis of the function call
will be stored in this num variable and will be printed by the
print()method inside the myfuncparam() method.
Script 15:
1. def myfuncparam(num):
2. print("This is a function with parameter value: "+num)
3.
4. ### function call
5. myfuncparam("Parameter 1")
Output:
This is a function with parameter value:Parameter 1
Script 16:
1. def myreturnfunc():
2. return "This function returns a value"
3.
4. val = myreturnfunc()
5. print(val)
Output:
This function returns a value
52 | Python Crash Course
Script 17:
1. class Fruit:
2.
3. name = "apple"
4. price = 10
5.
6. def eat_fruit(self):
7. print("Fruit has been eaten")
8.
9.
10. f = Fruit()
11. f.eat_fruit()
12. print(f.name)
13. print(f.price)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 53
Output:
Fruit has been eaten
apple
10
Script 18:
1. class Fruit:
2.
3. name = "apple"
4. price = 10
5.
6. def __init__(self, fruit_name, fruit_price):
7. Fruit.name = fruit_name
8. Fruit.price = fruit_price
9.
10. def eat_fruit(self):
11. print("Fruit has been eaten")
12.
13.
14. f = Fruit("Orange", 15)
15. f.eat_fruit()
16. print(f.name)
17. print(f.price)
Output:
Fruit has been eaten
Orange
15
54 | Python Crash Course
2.8.1. NumPy
NumPy is one of the most commonly used libraries for
numeric and scientific computing. NumPy is extremely fast
and contains support for multiple mathematical domains such
as linear algebra, geometry, etc. It is extremely important to
learn NumPy in case you plan to make a career in data science
and data preparation.
https://numpy.org/
2.8.2. Matplotlib
Matplotlib is the de facto standard for static data visualization
in Python, which is the first step in data science and machine
learning. Being the oldest data visualization library in Python,
Matplotlib is the most widely used data visualization library.
Matplotlib was developed to resemble MATLAB, which is one
of the most widely used programming languages in academia.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 55
While Matplotlib graphs are easy to plot, the look and feel
of the Matplotlib plots have a distinct feel of the 1990s.
Many wrappers libraries like Pandas and Seaborn have been
developed on top of Matplotlib. These libraries allow users to
plot much cleaner and sophisticated graphs.
https://matplotlib.org/
2.8.3. Seaborn
Seaborn library is built on top of the Matplotlib library and
contains all the plotting capabilities of Matplotlib. However,
with Seaborn, you can plot much more pleasing and aesthetic
graphs with the help of Seaborn default styles and color
palettes.
https://seaborn.pydata.org/
2.8.4. Pandas
Pandas library, like Seaborn, is based on the Matplotlib library
and offers utilities that can be used to plot different types
of static plots in a single line of codes. With Pandas, you can
import data in various formats such as CSV (Comma Separated
View) and TSV (Tab Separated View) and can plot a variety of
data visualizations via these data sources.
https://pandas.pydata.org/
56 | Python Crash Course
https://scikit-learn.org/stable/
2.8.6. TensorFlow
TensorFlow is one of the most frequently used libraries for
deep learning. TensorFlow has been developed by Google and
offers an easy to use API for the development of various deep
learning models. TensorFlow is consistently being updated,
and at the time of writing of this book, TensorFlow 2 is the
latest major release of TensorFlow. With TensorFlow, you can
not only easily develop deep learning applications but also
deploy them with ease owing to the deployment functionalities
of TensorFlow.
https://www.tensorflow.org/
2.8.7. Keras
Keras is a high-level TensorFlow library that implements
complex TensorFlow functionalities under the hood. If you
are a newbie to deep learning, Keras is the one deep learning
library that you should start for developing deep learning
D ata S c i e n c e C r a s h C o u r s e for Beginners | 57
https://keras.io/
Exercise 2.1
Question 1
Question 3
Exercise 2.2
Print the table of integer 9 using a while loop:
3
Python NumPy Library
for Data Analysis
Script 1:
1. import numpy as np
2. nums_list = [10,12,14,16,20]
3. nums_array = np.array(nums_list)
4. type(nums_array)
Output:
numpy.ndarray
Script 2:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. nums_2d.shape
Output:
(3, 3)
Script 3:
1. nums_arr = np.arange(5,11)
2. print(nums_arr)
Output:
[ 5 6 7 8 9 10]
Script 4:
1. nums_arr = np.arange(5,12,2)
2. print(nums_arr)
62 | Python NumPy Library for D ata A n a ly s i s
Output:
[ 5 7 9 11]
Script 5:
1. ones_array = np.ones(6)
2. print(ones_array)
Output:
[1. 1. 1. 1. 1. 1.]
Script 6:
1. ones_array = np.ones((6,4))
2. print(ones_array)
Output:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
Script 7:
1. zeros_array = np.zeros(6)
2. print(zeros_array)
Output:
[0. 0. 0. 0. 0. 0.]
Script 8:
1. zeros_array = np.zeros((6,4))
2. print(zeros_array)
Output:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Script 9:
1. eyes_array = np.eye(5)
2. print(eyes_array)
64 | Python NumPy Library for D ata A n a ly s i s
Output:
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
Script 10:
1. uniform_random = np.random.rand(4, 5)
2. print(uniform_random)
Output:
[[0.36728531 0.25376281 0.05039624 0.96432236 0.08579293]
[0.29194804 0.93016399 0.88781312 0.50209692 0.63069239]
[0.99952044 0.44384871 0.46041845 0.10246553 0.53461098]
[0.75817916 0.36505441 0.01683344 0.9887365 0.21490949]]
Script 11:
1. normal_random = np.random.randn(4, 5)
2. print(uniform_random)
Output:
[[0.36728531 0.25376281 0.05039624 0.96432236 0.08579293]
[0.29194804 0.93016399 0.88781312 0.50209692 0.63069239]
[0.99952044 0.44384871 0.46041845 0.10246553 0.53461098]
[0.75817916 0.36505441 0.01683344 0.9887365 0.21490949]]
D ata S c i e n c e C r a s h C o u r s e for Beginners | 65
Script 12:
1. integer_random = np.random.randint(10, 50, 5)
2. print(integer_random)
Output:
[25 49 21 35 17]
Script 13:
1. uniform_random = np.random.rand(4, 6)
2. uniform_random = uniform_random.reshape(3, 8)
3. print(uniform_random)
66 | Python NumPy Library for D ata A n a ly s i s
Output:
[[0.37576967 0.5425328 0.56087883 0.35265748 0.19677258
0.65107479 0.63287089 0.70649913]
[0.47830882 0.3570451 0.82151482 0.09622735 0.1269332
0.65866216 0.31875221 0.91781242]
[0.89785438 0.47306848 0.58350797 0.4604004 0.62352155
0.88064432 0.0859386 0.51918485]]
Script 14:
1. s = np.arange(1,11)
2. print(s)
Output:
[ 1 2 3 4 5 6 7 8 9 10]
Script 15:
print(s[1])
Output:
2
D ata S c i e n c e C r a s h C o u r s e for Beginners | 67
Script 16:
print(s[1:9])
Output:
[2 3 4 5 6 7 8 9]
If you specify only the upper bound, all the items from the first
index to the upper bound are returned. Similarly, if you specify
only the lower bound, all the items from the lower bound to
the last item of the array are returned.
Script 17:
1. print(s[:5])
2. print(s[5:])
Output:
[1 2 3 4 5]
[ 6 7 8 9 10]
Script 18:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[:2,:])
Output:
[[10 12 13]
[45 32 16]]
Similarly, the following script returns all the rows but only the
first two columns.
Script 19:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[:,:2])
Output:
[[10 12]
[45 32]
[45 32]]
Script 20:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. print(nums_2d[1:,1:])
Output:
[[32 16]
[32 16]]
Script 21:
1. nums = [10,20,30,40,50]
2. np_sqr = np.sqrt(nums)
3. print(np_sqr)
Output:
[3.16227766 4.47213595 5.47722558 6.32455532 7.07106781]
Script 22:
1. nums = [10,20,30,40,50]
2. np_log = np.log(nums)
3. print(np_log )
Output:
[2.30258509 2.99573227 3.40119738 3.68887945 3.91202301]
Script 23:
1. nums = [10,20,30,40,50]
2. np_exp = np.exp(nums)
3. print(np_exp)
Output:
[2.20264658e+04 4.85165195e+08 1.06864746e+13 2.35385267e+17
5.18470553e+21]
Script 24:
1. nums = [10,20,30,40,50]
2. np_sine = np.sin(nums)
3. print(np_sine)
4.
5. nums = [10,20,30,40,50]
6. np_cos = np.cos(nums)
7. print(np_cos)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 71
Output:
[-0.54402111 0.91294525 -0.98803162 0.74511316 -0.26237485]
[-0.83907153 0.40808206 0.15425145 -0.66693806 0.96496603]
Script 25:
1. A = np.random.randn(4,5)
2.
3. B = np.random.randn(5,4)
4.
5. Z = np.dot(A,B)
6.
7. print(Z)
Output:
[[ 1.43837722 -4.74991285 1.42127048 -0.41569506]
[-1.64613809 5.79380984 -1.33542482 1.53201023]
[-1.31518878 0.72397674 -2.01300047 0.61651047]
[-1.36765444 3.83694475 -0.56382045 0.21757162]]
72 | Python NumPy Library for D ata A n a ly s i s
Script 26:
1. row1 = [10,12,13]
2. row2 = [45,32,16]
3. row3 = [45,32,16]
4.
5. nums_2d = np.array([row1, row2, row3])
6. multiply = np.multiply(nums_2d, nums_2d)
7. print(multiply)
Output:
[[ 100 144 169]
[2025 1024 256]
[2025 1024 256]]
Script 27:
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9]
4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. inverse = np.linalg.inv(nums_2d)
8. print(inverse)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 73
Output:
[[ 3.15251974e+15 -6.30503948e+15 3.15251974e+15]
[-6.30503948e+15 1.26100790e+16 -6.30503948e+15]
[ 3.15251974e+15 -6.30503948e+15 3.15251974e+15]]
Script 28:
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9]
4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. determinant = np.linalg.det(nums_2d)
8. print(determinant)
Output:
-9.51619735392994e-16
Script 29:
1. row1 = [1,2,3]
2. row2 = [4,5,6]
3. row3 = [7,8,9]
4.
5. nums_2d = np.array([row1, row2, row3])
6.
7. trace = np.trace(nums_2d)
8. print(trace)
74 | Python NumPy Library for D ata A n a ly s i s
Output:
15
Exercise 3.1
Question 1:
Question 2:
Question 3:
Exercise 3.2
Create a random NumPy array of five rows and four columns.
Using array indexing and slicing, display the items from row
three to end and column two to end.
4
Introduction to Pandas
Library for Data Analysis
4.1. Introduction
In this chapter, you will see how to use Python’s Pandas library
for data analysis. In the next chapter, you will see how to use
the Pandas library for data visualization by plotting different
types of plots.
Script 1:
1. import pandas as pd
2. titanic_data = pd.read_csv(r"E:\
Data Visualization with Python\Datasets\titanic_data.csv")
3. titanic_data.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 79
Output:
The read_csv() method reads data from a CSV or TSV file and
stores it in a Pandas dataframe, which is a special object that
stores data in the form of rows and columns.
Script 2:
1. titanic_pclass1= (titanic_data.Pclass == 1)
2. titanic_pclass1
80 | Introduction to P a n da s L i b r a r y for D ata A n a ly s i s
Output:
0 False
1 True
2 False
3 True
4 False
...
886 False
887 True
888 False
889 True
890 False
Name: Pclass, Length: 891, dtype: bool
Script 3:
1. titanic_pclass1= (titanic_data.Pclass == 1)
2. titanic_pclass1_data = titanic_data[titanic_pclass1]
3. titanic_pclass1_data.head()
Output:
Script 4:
1. titanic_pclass_data = titanic_data[titanic_data.
Pclass == 1]
2. titanic_pclass_data.head()
Output:
Script 5:
1. ages = [20,21,22]
2. age_dataset = titanic_data[titanic_data["Age"].isin(ages)]
3. age_dataset.head()
Output:
Script 6:
1. ages = [20,21,22]
2. ageclass_dataset = titanic_data[titanic_data["Age"].
isin(ages) & (titanic_data["Pclass"] == 1) ]
3. ageclass_dataset.head()
Output:
Script 7:
1. titanic_data_filter = titanic_data.
filter(["Name", "Sex", "Age"])
2. titanic_data_filter.head()
The output below shows that the dataset now contains only
Name, Sex, and Age columns.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 83
Output:
Script 8:
1. titanic_data_filter = titanic_data.
drop(["Name", "Sex", "Age"], axis = 1)
2. itanic_data_filter.head()
Output:
Script 9:
1. titanic_pclass1_data = titanic_data[titanic_data.
Pclass == 1]
2. print(titanic_pclass1_data.shape)
3.
4. titanic_pclass2_data = titanic_data[titanic_data.
Pclass == 2]
5. print(titanic_pclass2_data.shape)
Output:
(216, 12)
(184, 12)
Script 10:
1. final_data = titanic_pclass1_data.append(titanic_pclass2_
data, ignore_index=True)
2. print(final_data.shape)
Output:
(400, 12)
The output now shows that the total number of rows is 400,
which is the sum of the number of rows in the two dataframes
that we concatenated.
Script 11:
1. final_data = pd.concat([titanic_pclass1_data, titanic_
pclass2_data])
2. print(final_data.shape)
Output:
(400, 12)
Script 12:
1. df1 = final_data[:200]
2. print(df1.shape)
3. df2 = final_data[200:]
4. print(df2.shape)
5.
6. final_data2 = pd.concat([df1, df2], axis = 1, ignore_
index = True)
7. print(final_data2.shape)
Output:
(200, 12)
(200, 12)
(400, 24)
Script 13:
1. age_sorted_data = titanic_data.sort_values(by=['Age'])
2. age_sorted_data.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 87
Output:
Script 14:
1. age_sorted_data = titanic_data.sort_
values(by=['Age'], ascending = False)
2. age_sorted_data.head()
Output:
Script 15:
1. age_sorted_data = titanic_data.sort_
values(by=['Age','Fare'], ascending = False)
2. age_sorted_data.head()
88 | Introduction to P a n da s L i b r a r y for D ata A n a ly s i s
Output:
Script 16:
1. updated_class = titanic_data.Pclass.apply(lambda x : x + 2)
2. updated_class.head()
The output shows that all the values in the Pclass column have
been incremented by 2.
Output:
0 5
1 3
D ata S c i e n c e C r a s h C o u r s e for Beginners | 89
2 5
3 3
4 5
Script 17:
1. def mult(x):
2. return x * 2
3.
4. updated_class = titanic_data.Pclass.apply(mult)
5. updated_class.head()
Output:
0 6
1 2
2 6
3 2
4 6
Script 18:
1. import matplotlib.pyplot as plt
2. import seaborn as sns
3.
4.
5. flights_data = sns.load_dataset('flights')
6.
7. flights_data.head()
Output:
Script 19:
1. flights_data_pivot =flights_data.pivot_table(index=
'month', columns='year', values='passengers')
2. flights_data_pivot.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 91
Output:
Script 20:
1. import pandas as pd
2. titanic_data = pd.read_csv(r"E:\
Data Visualization with Python\Datasets\titanic_data.csv")
3. titanic_data.head()
4.
5. pd.crosstab(titanic_data.Pclass, titanic_data.
Age, margins=True)
Output:
Script 21:
1. import numpy as np
2. titanic_data.Fare = np.where( titanic_data.
Age > 20, titanic_data.Fare +5 , titanic_data.Fare)
3.
4. titanic_data.head()
Output:
Exercise 4.1
Question 1
Question 2
Question 3
Exercise 4.2
Use the apply function to subtract 10 from the Fare column
of the Titanic dataset, without using a lambda expression.
5
Data Visualization via
Matplotlib, Seaborn, and
Pandas Libraries
In this chapter, you will see some of the most commonly used
Python libraries for data visualization. You will see how to plot
different types of plots using Maplotlib, Seaborn, and Pandas
libraries.
96 | D ata V i s u a l i z at i o n v i a M at pl ot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Finally, before you can plot any graphs with Matplotlib library,
you will need to import the pyplot module from the Matplotlib
library. And since all the scripts will be executed inside Jupyter
Notebook, the statement %matplotlib inline has been used to
generate plots inside Jupyter Notebook. Execute the following
script:
1. import matplotlib.pyplot as plt
2. %matplotlib inline
Script 1:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. x_vals = np.linspace(0, 20, 20)
6. y_vals = [math.sqrt(i) for i in x_vals]
7. plt.plot(x_vals, y_vals)
Output:
98 | D ata V i s u a l i z at i o n v i a M at pl ot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 2:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. x_vals = np.linspace(0, 20, 20)
6. y_vals = [math.sqrt(i) for i in x_vals]
7.
8. fig = plt.figure()
9. ax = plt.axes()
10. ax.plot(x_vals, y_vals)
Here is the output of the above script. This method can be used
to plot multiple plots, which we will see in the next chapter. In
this chapter, we will stick to the first approach, where we call
the plot() method directly from the pyplot module.
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 99
You can also increase the default plot size of a Matplotlib plot.
To do so, you can use the rcParams list of the pyplot module
and then set two values for the figure.figsize attribute. The
following script sets the plot size to 8 inches wide and 6 inches
tall.
Script 3:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.rcParams["figure.figsize"] = [8,6]
6.
7. x_vals = np.linspace(0, 20, 20)
8. y_vals = [math.sqrt(i) for i in x_vals]
9. plt.plot(x_vals, y_vals)
In the output, you can see that the default plot size has been
increased.
Output:
100 | D ata V i s u a l i z at i o n v i a M at plot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 4:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. x_vals = np.linspace(0, 20, 20)
6. y_vals = [math.sqrt(i) for i in x_vals]
7. plt.xlabel('X Values')
8. plt.ylabel('Y Values')
9. plt.title('Square Roots')
10. plt.plot(x_vals, y_vals)
Here, in the output, you can see the labels and title that you
specified in the script 4.
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 101
Script 5:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5.
6. x_vals = np.linspace(0, 20, 20)
7. y_vals = [math.sqrt(i) for i in x_vals]
8. plt.xlabel('X Values')
9. plt.ylabel('Y Values')
10. plt.title('Square Roots')
11. plt.plot(x_vals, y_vals, 'r')
Output:
plot() function. Next, you have to pass the value for the loc
attribute of the legend method of the pyplot module. In the
loc attribute, you have to pass the location of your legend.
The following script plots a legend at the upper center corner
of the plot.
Script 6:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5.
6. x_vals = np.linspace(0, 20, 20)
7. y_vals = [math.sqrt(i) for i in x_vals]
8. plt.xlabel('X Values')
9. plt.ylabel('Y Values')
10. plt.title('Square Roots')
11. plt.plot(x_vals, y_vals, 'r', label = 'Square Root')
12. plt.legend(loc='upper center')
Output:
You can also plot multiple line plots inside one graph. All you
have to do is call the plot() method twice with different values
D ata S c i e n c e C r a s h C o u r s e for Beginners | 103
for x and y axes. The following script plots a line plot for square
root in red and for a cube function in blue.
Script 7:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5.
6. x_vals = np.linspace(0, 20, 20)
7. y_vals = [math.sqrt(i) for i in x_vals]
8. y2_vals = x_vals ** 3
9. plt.xlabel('X Values')
10. plt.ylabel('Y Values')
11. plt.title('Square Roots')
12. plt.plot(x_vals, y_vals, 'r', label = 'Square Root')
13. plt.plot(x_vals, y2_vals, 'b', label = 'Cube')
14. plt.legend(loc='upper center')
Output:
104 | D ata V i s u a l i z at i o n v i a M at plot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 8:
1. import pandas as pd
2. data = pd.read_csv("E:\Data Visualization with Python\
Datasets\iris_data.csv")
If you do not see any error, the file has been read successfully.
To see the first five rows of the Pandas dataframe containing
the data, you can use the head() method as shown below:
Script 9:
data.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 105
Output:
You can see that the iris_data.csv file has five columns. We
can use values from any of these two columns to plot a line
plot. To do so, for x and y axes, we need to pass the data
dataframe column names to the plot() function of the pyplot
module. To access a column name from a Pandas dataframe,
you need to specify the dataframe name followed by a pair
of square brackets. Inside the brackets, the column name
is specified. The following script plots a line plot where the
x-axis contains values from the sepal_length column, whereas
the y-axis contains values from the petal_length column of
the dataframe.
Script 10:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.xlabel('Sepal Length')
6. plt.ylabel('Petal Length')
7. plt.title('Sepal vs Petal Length')
8. plt.plot(data["sepal_length"], data["petal_length"], 'b')
106 | D ata V i s u a l i z at i o n v i a M at plot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Output:
Like CSV, you can also read a TSV file via the read_csv()
method. You have to pass ‘\t’ as the value for the sep
parameter. The script 11 reads iris_data.tsv file and stores it in
a Pandas dataframe. Next, the first five rows of the dataframe
have been printed via the head() method.
Script 11:
1. import pandas as pd
2. data = pd.read_csv("E:\Data Visualization with Python\
Datasets\iris_data.tsv", sep='\t')
3. data.head()
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 107
The remaining process to plot the line plot remains the same,
as it was for the CSV file. The following script plots a line plot,
where the x-axis contains sepal length, and the y-axis displays
petal length.
Script 12:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.xlabel('Sepal Length')
6. plt.ylabel('Petal Length')
7. plt.title('Sepal vs Petal Length')
8. plt.plot(data["SepalLength"], data["PetalLength"], "b")
Output:
Script 13:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.xlabel('Sepal Length')
6. plt.ylabel('Petal Length')
7. plt.title('Sepal vs Petal Length')
8. plt.
scatter(data["SepalLength"], data["PetalLength"], c = "b")
The output shows a scattered plot with blue points. The plot
clearly shows that with an increase in sepal length, the petal
length of an iris flower also increases.
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 109
Script 14:
1. import pandas as pd
2. data = pd.read_csv(r"E:\Data Visualization with Python\
Datasets\titanic_data.csv")
3. data.head()
Output:
To plot a bar plot, you need to call the bar() method. The
categorical values are passed as the x-axis and corresponding
aggregated numerical values are passed on the y-axis. The
110 | D ata V i s u a l i z at i o n v i a M at pl ot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 15:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.xlabel('Gender')
6. plt.ylabel('Ages')
7. plt.title('Gender vs Age')
8. plt.bar(data["Sex"], data["Age"])
Output:
5.2.6. Histograms
Histograms are basically used to display the distribution of
data for a numeric list of items. The hist() method is used
to plot a histogram. You simply have to pass a collection
of numeric values to the hist() method. For instance, the
following histogram plots the distribution of values in the Age
column of the Titanic dataset.
Script 16:
1. import matplotlib.pyplot as plt
2. import numpy as np
3. import math
4.
5. plt.title('Age Histogram')
6. plt.hist(data["Age"])
Output:
Script 17:
1. labels = 'IT', 'Marketing', 'Data Science', 'Finance'
2. values = [500, 156, 300, 510]
3. explode = (0.05, 0.05, 0.05, 0.05)
4.
5. plt.pie(values, explode=explode, labels=labels,
autopct='%1.1f%%', shadow=True)
6. plt.show()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 113
Output:
Script 18:
1. import matplotlib.pyplot as plt
2. import seaborn as sns
3.
4. plt.rcParams["figure.figsize"] = [10,8]
5.
6. tips_data = sns.load_dataset('tips')
7.
8. tips_data.head()
Output:
Script 19:
1. plt.rcParams["figure.figsize"] = [10,8]
2. sns.distplot(tips_data['total_bill'])
Output:
Script 20:
sns.jointplot(x='total_bill', y='tip', data=tips_data)
Output:
Script 21:
sns.jointplot(x='size', y='total_bill', data=tips_data,
kind = 'reg')
D ata S c i e n c e C r a s h C o u r s e for Beginners | 117
Output:
Script 22:
sns.pairplot(data=tips_data)
Output:
Script 23:
1. import matplotlib.pyplot as plt
2. import seaborn as sns
3.
4. plt.rcParams["figure.figsize"] = [8,6]
5. sns.set_style("darkgrid")
6.
7. titanic_data = sns.load_dataset('titanic')
8.
9. titanic_data.head()
Output:
Script 24:
sns.barplot(x='pclass', y='age', data=titanic_data)
120 | D ata V i s u a l i z at i o n v i a M at plot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Output:
Script 25:
sns.countplot(x='pclass', data=titanic_data)
Output:
Script 26:
sns.boxplot(x=titanic_data["fare"])
Output:
Script 27:
sns.violinplot(x='alone', y='age', data=titanic_data)
Output:
Script 28:
1. import pandas as pd
2. titanic_data = pd.read_csv(r"E:\
Data Visualization with Python\Datasets\titanic_data.csv")
3. titanic_data.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 125
Output:
Script 29:
1. import matplotlib.pyplot as plt
2. titanic_data['Age'].hist()
126 | D ata V i s u a l i z at i o n v i a M at pl ot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Output:
Script 30:
1. flights_data = sns.load_dataset('flights')
2.
3. flights_data.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 127
Output:
Script 31:
flights_data.plot.line( y='passengers', figsize=(8,6))
Output:
128 | D ata V i s u a l i z at i o n v i a M at pl ot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 32:
flights_data.plot.
scatter(x='year', y='passengers', figsize=(8,6))
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 129
Script 33:
1. titanic_data = pd.read_csv(r"E:\
Data Visualization with Python\Datasets\titanic_data.csv")
2. titanic_data.head()
3. sex_mean = titanic_data.groupby("Sex")["Age"].mean()
4.
5. print(sex_mean)
6. print(type(sex_mean.tolist()))
Output:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
<class 'list'>
130 | D ata V i s u a l i z at i o n v i a M at plot l i b , S e a b o r n , a n d P a n da s L i b r a r i e s
Script 34:
1. df = pd.DataFrame({'Gender':['Female', 'Male'], 'Age':sex_
mean.tolist()})
2. ax = df.plot.bar(x='Gender', y='Age', figsize=(8,6))
Output:
Output:
Exercise 5.1
Question 1
Question 2:
Question 3:
Exercise 5.2
Plot two scatter plots on the same graph using the tips_
dataset. In the first scatter plot, display values from the total_
bill column on the x-axis and from the tip column on the y-axis.
The color of the first scatter plot should be green. In the second
scatter plot, display values from the total_bill column on the
x-axis and from the size column on the y-axis. The color of the
second scatter plot should be blue, and markers should be x.
6
Solving Regression Problems
in Machine Learning Using
Sklearn Library
You can read data from CSV files. However, the datasets we
are going to use in this section are available by default in the
Seaborn library. To view all the datasets, you can use the get_
dataset_names() function as shown in the following script:
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4. sns.get_dataset_names()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 137
Output:
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'geyser',
'iris',
'mpg',
'penguins',
'planets',
'tips',
'titanic']
The following script loads the Tips dataset and displays its
first five rows.
Script 2:
1. tips_df = sns.load_dataset("tips")
2. tips_df.head()
Output:
138 | S olving R egression P roblems in M achine L earning U sing S klearn L ibrary
Script 3:
1. diamond_df = sns.load_dataset("diamonds")
2. diamond_df.head()
Output:
Script 4:
1. X = tips_df.drop(['tip'], axis=1)
2. y = tips_df["tip"]
Script 5:
1. X.head()
Output:
Script 6:
1. y.head()
Output:
0 1.01
1 1.66
2 3.50
3 3.31
4 3.61
Name: tip, dtype: float64
Script 7:
numerical = X.drop(['sex', 'smoker', 'day', 'time'], axis = 1)
Script 8:
1. numerical.head()
Output:
Script 9:
1. categorical = X.filter(['sex', 'smoker', 'day', 'time'])
2. categorical.head()
Output:
Script 10:
1. import pandas as pd
2. cat_numerical = pd.get_dummies(categorical,drop_first=True)
3. cat_numerical.head()
Output:
The final step is to join the numerical columns with the one-hot
encoded columns. To do so, you can use the concat() function
from the Pandas library as shown below:
142 | S olving R egression P roblems in M achine L earning U sing S klearn L ibrary
Script 11:
1. X = pd.concat([numerical, cat_numerical], axis = 1)
2. X.head()
The final dataset looks like this. You can see that it doesn’t
contain any categorical value.
Output:
Script 12:
1. from sklearn.model_selection import train_test_split
2.
3. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 143
Script 13:
1. from sklearn.preprocessing import StandardScaler
2. sc = StandardScaler()
3. #scaling the training set
4. X_train = sc.fit_transform(X_train)
5. #scaling the test set
6. X_test = sc.transform (X_test)
Script 14:
1. from sklearn.linear_model import LinearRegression
2. # training the algorithm
3. lin_reg = LinearRegression()
4. regressor = lin_reg.fit(X_train, y_train)
5. # making predictions on test set
6. y_pred = regressor.predict(X_test)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 145
The methods used to find the value for these metrics are
available in sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
Script 15:
1. from sklearn import metrics
2.
3. print('Mean Absolute Error:', metrics.mean_absolute_
error(y_test, y_pred))
4. print('Mean Squared Error:', metrics.mean_squared_error(y_
test, y_pred))
5. print('Root Mean Squared Error:', np.sqrt(metrics.mean_
squared_error(y_test, y_pred)))
Output:
Mean Absolute Error: 0.7080218832979829
Mean Squared Error: 0.893919522160961
Root Mean Squared Error: 0.9454731736865732
Script 16:
1. from sklearn.neighbors import KNeighborsRegressor
2. knn_reg = KNeighborsRegressor(n_neighbors=5)
3. regressor = knn_reg.fit(X_train, y_train)
4.
5. y_pred = regressor.predict(X_test)
6.
7.
8. from sklearn import metrics
9.
10. print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_
pred))
11. print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_
pred))
12. print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_
error(y_test, y_pred)))
Output:
Mean Absolute Error: 0.7513877551020406
Mean Squared Error: 0.9462902040816326
Root Mean Squared Error: 0.9727744877830794
Script 17:
1. # training and testing the random forest
2. from sklearn.ensemble import RandomForestRegressor
3. rf_reg = RandomForestRegressor(random_state=42, n_
estimators=500)
4. regressor = rf_reg.fit(X_train, y_train)
5. y_pred = regressor.predict(X_test)
6.
7. # evaluating algorithm performance
8. from sklearn import metrics
9.
10. print('Mean Absolute Error:', metrics.mean_absolute_
error(y_test, y_pred))
11. print('Mean Squared Error:', metrics.mean_squared_error(y_
test, y_pred))
12. print('Root Mean Squared Error:', np.sqrt(metrics.mean_
squared_error(y_test, y_pred)))
Output:
Mean Absolute Error: 0.7054065306122449
Mean Squared Error: 0.8045782841306138
Root Mean Squared Error: 0.8969828783932354
With the Sklearn library, you can use the SVM class to implement
support vector regression algorithms, as shown below.
152 | S olving R egression P roblems in M achine L earning U sing S klearn L ibrary
Script 18:
1. # training and testing the SVM
2.
3. from sklearn import svm
4. svm_reg = svm.SVR()
5.
6. regressor = svm_reg.fit(X_train, y_train)
7. y_pred = regressor.predict(X_test)
8.
9.
10. from sklearn import metrics
11.
12. print('Mean Absolute Error:', metrics.mean_absolute_
error(y_test, y_pred))
13. print('Mean Squared Error:', metrics.mean_squared_error(y_
test, y_pred))
14. print('Root Mean Squared Error:', np.sqrt(metrics.mean_
squared_error(y_test, y_pred)))
Script 19:
1. from sklearn.model_selection import cross_val_score
2.
3. print(cross_val_score(regressor, X, y, cv=5, scoring ="neg_
mean_absolute_error"))
154 | S olving R egression P roblems in M achine L earning U sing S klearn L ibrary
Output:
[-0.66386205 -0.57007269 -0.63598762 -0.96960743 -0.87391702]
The output shows the mean absolute value for each of the K
folds.
Script 20:
1. tips_df.loc[100]
The output shows that the value of the tip in the 100th record
in our dataset is 2.5.
Output:
total_bill 11.35
tip 2.5
sex Female
smoker Yes
day Fri
time Dinner
size 2
Name: 100, dtype: object
We will try to predict the value of the tip of the 100th record
using the random forest regressor algorithm and see what
output we get. Look at the script below:
Note that you have to scale your single record before it can be
used as input to your machine learning algorithm.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 155
Script 21:
1. from sklearn.ensemble import RandomForestRegressor
2. rf_reg = RandomForestRegressor(random_state=42, n_
estimators=500)
3. regressor = rf_reg.fit(X_train, y_train)
4.
5. single_record = sc.transform (X.values[100].
reshape(1, -1))
6. predicted_tip = regressor.predict(single_record)
7. print(predicted_tip)
Output:
[2.2609]
Exercise 6.1
Question 1
Question 2
Question 3
Exercise 6.2
Using the Diamonds dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
7
Solving Classification
Problems in Machine Learning
Using Sklearn Library
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
Script 2:
1. churn_df = pd.read_csv("E:\
Hands on Python for Data Science and Machine Learning\
Datasets\customer_churn.csv")
2. churn_df.head()
Output:
Script 3:
1. churn_df = churn_df.drop(['RowNumber', 'CustomerId',
'Surname'], axis=1)
Script 4:
1. X = churn_df.drop(['Exited'], axis=1)
2. y = churn_df['Exited']
The following script prints the first five rows of the feature set.
Script 5:
1. X.head()
Output:
And the following script prints the first five rows of the label
set, as shown below:
Script 6:
1. y.head()
160 | S olving C lassification P roblems in M achine L earning U sing S klearn L ibrary
Output:
0 1
1 0
2 1
3 0
4 0
Name: Exited, dtype: int64
Script 7:
1. numerical = X.drop(['Geography', 'Gender'], axis = 1)
Script 8:
1. numerical.head()
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 161
Script 9:
1. categorical = X.filter(['Geography', 'Gender'])
2. categorical.head()
Output:
Script 10:
1. import pandas as pd
2. cat_numerical = pd.get_dummies(categorical,drop_first=True)
3. cat_numerical.head(
Output:
162 | S olving C lassification P roblems in M achine L earning U sing S klearn L ibrary
Script 11:
1. X = pd.concat([numerical, cat_numerical], axis = 1)
2. X.head()
Output:
Script 12:
1. from sklearn.model_selection import train_test_split
2. # test size is the fraction of test size
3. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
Script 13:
1. from sklearn.preprocessing import StandardScaler
2. sc = StandardScaler()
3. X_train = sc.fit_transform(X_train)
4. X_test = sc.transform (X_test)
Script 14:
1. from sklearn.linear_model import LogisticRegression
2.
3. log_clf = LogisticRegression()
4. classifier = log_clf.fit(X_train, y_train)
5.
6. y_pred = classifier.predict(X_test)
7.
8.
True Positive: True positives are those labels that are actually
true and also predicted as true by the model.
False Negative: False negative are labels that are actually true
but predicted as false by the machine learning models.
Confusion Matrix
Precision
Recall
F1 Measure
Accuracy
The methods used to find the value for these metrics are
available in the sklearn.metrics class. The predicted and actual
values have to be passed to these methods, as shown in the
output.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 167
Script 15:
9. from sklearn.metrics import classification_
report, confusion_matrix, accuracy_score
10.
11. print(confusion_matrix(y_test,y_pred))
12. print(classification_report(y_test,y_pred))
13. print(accuracy_score(y_test, y_pred))
Output:
[[1526 69]
[ 309 96]]
precision recall f1-score support
0.811
The output shows that for 81 percent of the records in the test
set, logistic regression correctly predicted whether or not a
customer will leave the bank.
The pros and cons of the KNN classifier algorithm are the same
as the KNN regression algorithm, which is explained already in
Chapter 6, section 6.3.
Script 16:
1. from sklearn.neighbors import KNeighborsClassifier
2. knn_clf = KNeighborsClassifier(n_neighbors=5)
3. classifier = knn_clf.fit(X_train, y_train)
4.
5. y_pred = classifier.predict(X_test)
6.
7.
8. from sklearn.metrics import classification_
report, confusion_matrix, accuracy_score
9.
10. print(confusion_matrix(y_test,y_pred))
11. print(classification_report(y_test,y_pred))
12. print(accuracy_score(y_test, y_pred))
D ata S c i e n c e C r a s h C o u r s e for Beginners | 169
Output:
[[1486 109]
[ 237 168]]
precision recall f1-score support
0.827
Script 17:
1. from sklearn.ensemble import RandomForestClassifier
2. rf_clf = RandomForestClassifier(random_state=42, n_
estimators=500)
3.
4. classifier = rf_clf.fit(X_train, y_train)
5.
6. y_pred = classifier.predict(X_test)
7.
8.
9. from sklearn.metrics import classification_
report, confusion_matrix, accuracy_score
10.
11. print(confusion_matrix(y_test,y_pred))
12. print(classification_report(y_test,y_pred))
13. print(accuracy_score(y_test, y_pred))
Output:
[[1521 74]
[ 196 209]]
precision recall f1-score support
0.865
With the Sklearn library, you can use the SVM module to
implement the support vector classification algorithm, as
shown below. The SVC class from the SVM module is used to
implement the support vector classification, as shown below:
Script 18:
1. # training SVM algorithm
2. from sklearn import svm
3. svm_clf = svm.SVC()
4.
5. classifier = svm_clf .fit(X_train, y_train)
6. # making predictions on test set
7. y_pred = classifier.predict(X_test)
8.
# evaluating algorithm
9. from sklearn.metrics import classification_
report, confusion_matrix, accuracy_score
10.
11. print(confusion_matrix(y_test,y_pred))
12. print(classification_report(y_test,y_pred))
13. print(accuracy_score(y_test, y_pred))
172 | S olving C lassification P roblems in M achine L earning U sing S klearn L ibrary
Output:
[[1547 48]
[ 225 180]]
precision recall f1-score support
0.8635
Script 19:
1. from sklearn.model_selection import cross_val_score
2.
3. print(cross_val_score(classifier, X, y, cv=5,
scoring ="accuracy"))
D ata S c i e n c e C r a s h C o u r s e for Beginners | 173
Output:
[0.796 0.796 0.7965 0.7965 0.7965]
Script 20:
1. churn_df.loc[100]
Output:
CreditScore 665
Geography France
Gender Female
Age 40
Tenure 6
Balance 0
NumOfProducts 1
HasCrCard 1
IsActiveMember 1
EstimatedSalary 161848
Exited 0
Name: 100, dtype: object
The output above shows that the customer did not exit the
bank after six months since the value for the Exited attribute
is 0. Let’s see what our classification model predicts:
174 | S olving C lassification P roblems in M achine L earning U sing S klearn L ibrary
Script 21:
1. # training the random forest algorithm
2. from sklearn.ensemble import RandomForestClassifier
3. rf_clf = RandomForestClassifier(random_state=42, n_
estimators=500)
4.
5. classifier = rf_clf.fit(X_train, y_train)
6.
7. # scaling single record
8. single_record = sc.transform (X.values[100].
reshape(1, -1))
9.
10. #making predictions on the single record
11. predicted_churn = classifier.predict(single_record)
12. print(predicted_churn)
Output:
[0]
Exercise 7.1
Question 1
Question 2
Question 3
Exercise 7.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
species of the iris plant. Perform all the preprocessing steps.
8
Data Clustering with
Machine Learning Using
Sklearn Library
Script 1:
1. import numpy as np
2. import pandas as pd
3. from sklearn.datasets.samples_generator import make_blobs
4. from sklearn.cluster import KMeans
5. from matplotlib import pyplot as plt
6. %matplotlib inline
Script 2:
1. # generating dummy data of 500 records with 4 clusters
2. features, labels = make_blobs(n_samples=500, centers=4,
cluster_std = 2.00)
3.
4. #plotting the dummy data
5. plt.scatter(features[:,0], features[:,1] )
The output looks like this. Using K Means clustering, you will
see how we will create four clusters in this dataset.
Output:
Note:
Script 3:
1. # performing kmeans clustering using KMeans class
2. km_model = KMeans(n_clusters=4)
3. km_model.fit(features)
Once the model is trained, you can print the cluster centers
using the cluster_centers_ attribute of the KMeans class
object.
Script 4:
1. #printing centroid values
2. print(km_model.cluster_centers_)
Output:
[[-4.54070231 7.26625699]
[ 0.10118215 -0.23788283]
[ 2.57107155 8.17934929]
[-0.38501161 3.11446039]]
To print the cluster ids for all the labels, you can use the labels_
attribute of the KMeans class, as shown below.
Script 5:
1. #printing predicted label values
2. print(km_model.labels_)
182 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
Output:
[0 2 3 2 1 1 3 1 2 0 0 2 3 3 1 1 2 0 1 2 2 1 3 3 1 1 0 2 0 2 0
1 0 1 3 2 2 3 0 0 0 2 1 2 0 1 3 1 3 2 1 3 3 1 0 2 1 3 0 0 3 3
3 1 1 1 3 0 1 3 2 1 1 2 0 2 1 2 1 0 0 2 1 2 1 0 2 0 0 2 2 3 3
0 2 0 2 3 0 0 3 1 0 3 2 1 3 2 2 0 2 1 1 0 0 3 3 2 3 1 0 0 3 0
1 0 3 1 0 3 2 0 1 1 0 2 1 2 2 0 3 1 3 3 0 1 1 0 2 0 0 0 3 3 3
3 0 3 1 2 1 0 3 2 3 1 3 3 0 3 2 3 0 1 3 2 3 2 1 2 2 3 0 3 2 0
3 0 1 2 2 3 2 2 1 0 1 1 2 3 2 0 1 3 3 3 3 0 0 3 1 0 1 1 3 3 1
3 1 0 0 2 1 1 1 1 2 2 0 2 1 0 1 2 3 0 1 2 0 1 1 0 1 0 3 1 2 1
1 2 3 0 0 1 3 1 2 0 1 1 0 1 0 0 2 2 0 1 2 0 1 2 0 0 1 1 0 1 2
3 0 1 2 3 0 0 3 2 3 0 3 1 3 1 3 0 1 3 3 1 1 2 2 2 3 1 1 3 1 3
3 0 1 1 2 0 2 2 3 1 0 3 2 1 0 2 3 1 0 2 0 0 3 1 1 2 3 3 1 2 2
3 0 3 3 3 1 0 2 0 0 3 1 1 0 1 0 3 1 3 1 0 0 1 3 1 2 0 0 0 1 1
0 0 2 0 0 2 2 3 2 3 3 3 0 3 1 1 1 1 3 1 1 1 2 3 0 2 3 3 1 1 3
3 3 3 3 0 0 3 2 0 3 2 1 1 3 2 1 2 1 1 1 3 3 2 3 1 1 1 2 0 2 1
1 0 0 3 1 2 3 0 2 0 2 0 2 3 3 2 2 0 0 2 0 0 0 1 3 2 2 1 1 2 1
1 0 1 2 1 0 0 2 2 0 3 3 0 0 2 1 3 2 0 3 3 1 2 1 1 3 0 3 3 0 0
1 2 3 1]
Script 6:
1. #pring the data points
2. plt.scatter(features[:,0], features[:,1], c= km_model.
labels_, cmap='rainbow' )
3.
4. #print the centroids
5. plt.scatter(km_model.cluster_centers_[:, 0], km_model.
cluster_centers_[:, 1], s=100, c='black')
Output:
Script 7:
1. #print actual datapoints
2. plt.scatter(features[:,0], features[:,1], c= labels,
cmap='rainbow' )
Output:
Note:
Script 8:
1. import seaborn as sns
2.
3. iris_df = sns.load_dataset("iris")
4. iris_df.head()
D ata S c i e n c e C r a s h C o u r s e for Beginners | 185
Output:
Script 9:
1. # dividing data into features and labels
2. features = iris_df.drop(["species"], axis = 1)
3. labels = iris_df.filter(["species"], axis = 1)
4. features.head()
Output:
Script 10:
1. # training KMeans model
2. features = features.values
3. km_model = KMeans(n_clusters=4)
4. km_model.fit(features)
Script 11:
1. print(km_model.labels_)
Output:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 2 3 2 3 2 3 3 3
3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3
3 3 3 3 2 3 3 0 2 0 0 0 0 3 0 0 0 2 2 0 2 2 0 0 0 0 2 0 2 0 2
0 0 2 2 0 0 0 0 0 2 2 0 0 0 2 0 0 0 2 0 0 0 2 2 0 2]
Script 12:
1. #pring the data points
2. plt.scatter(features[:,0], features[:,1], c= km_model.
labels_, cmap='rainbow' )
3.
4. #print the centroids
5. plt.scatter(km_model.cluster_centers_[:, 0], km_model.
cluster_centers_[:, 1], s=100, c='black')
D ata S c i e n c e C r a s h C o u r s e for Beginners | 187
Output:
To calculate the inertia value, you can use the inertia_ attribute
of the KMeans class object. The following script creates inertial
values for K=1 to 10 and plots in the form of a line plot, as
shown below:
188 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
Script 13:
1. # training KMeans on K values from 1 to 10
2. loss =[]
3. for i in range(1, 11):
4. km = KMeans(n_clusters = i).fit(features)
5. loss.append(km.inertia_)
6.
7. #printing loss against number of clusters
8.
9. import matplotlib.pyplot as plt
10. plt.plot(range(1, 11), loss)
11. plt.title('Finding Optimal Clusters via Elbow Method')
12. plt.xlabel('Number of Clusters')
13. plt.ylabel('loss')
14. plt.show()
From the output below, it can be seen that the value of inertia
didn’t decrease much after 3 clusters.
Output:
Let’s now cluster the Iris data using 3 clusters and see if we
can get close to the actual clusters.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 189
Script 14:
1. # training KMeans with 3 clusters
2. km_model = KMeans(n_clusters=3)
3. km_model.fit(features)
Script 15:
1. #pring the data points with prediced labels
2. plt.scatter(features[:,0], features[:,1], c= km_model.
labels_, cmap='rainbow' )
3.
4. #print the predicted centroids
5. plt.scatter(km_model.cluster_centers_[:, 0], km_model.
cluster_centers_[:, 1], s=100, c='black')
Output:
Let’s now plot the actual clusters and see how close the actual
clusters are to predicted clusters.
190 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
Script 16:
1. # converting categorical labels to numbers
2.
3. from sklearn import preprocessing
4. le = preprocessing.LabelEncoder()
5. labels = le.fit_transform(labels)
6.
7. #pring the data points with original labels
8. plt.scatter(features[:,0], features[:,1], c= labels,
cmap='rainbow' )
The output shows that the actual clusters are pretty close to
predicted clusters.
Output:
Example 1
Script 17:
1. import numpy as np
2. import pandas as pd
3. from sklearn.datasets.samples_generator import make_blobs
4. from matplotlib import pyplot as plt
5. %matplotlib inline
Script 18:
1. # generating dummy data of 10 records with 2 clusters
2. features, labels = make_blobs(n_samples=10, centers=2,
cluster_std = 2.00)
3.
4. #plotting the dummy data
5. plt.scatter(features[:,0], features[:,1], color ='r' )
6.
7. #adding numbers to data points
8. annots = range(1, 11)
9. for label, x, y in zip(annots, features[:, 0],
features[:, 1]):
10. plt.annotate(
11. label,
12. xy=(x, y), xytext=(-3, 3),
13. textcoords='offset points', ha='right',
va='bottom')
14. plt.show()
Output:
194 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
Script 19:
1. from scipy.cluster.hierarchy import dendrogram, linkage
2.
3.
4. dendos = linkage(features, 'single')
5.
6. annots = range(1, 11)
7.
8. dendrogram(dendos,
9. orientation='top',
10. labels=annots,
11. distance_sort='descending',
12. show_leaf_counts=True)
13. plt.show()
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 195
From the figure above, it can be seen that points 1 and 5 are
closest to each other. Hence, a cluster is formed by connecting
these points. The cluster of 1 and 5 is closest to data point 10,
resulting in a cluster containing points 1, 5, and 10. In the same
way, the remaining clusters are formed until a big cluster is
formed.
Script 20:
1. from sklearn.cluster import AgglomerativeClustering
2.
3. # training agglomerative clustering model
4. hc_model = AgglomerativeClustering(n_
clusters=2, affinity='euclidean', linkage='ward')
5. hc_model.fit_predict(features)
Output:
array([0, 0, 0, 1, 0, 1, 1, 1, 1, 0], dtype=int64)
Script 21:
1. #pring the data points
2. plt.scatter(features[:,0], features[:,1], c= hc_model.
labels_, cmap='rainbow' )
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 197
Example 2
Script 22:
1. # generating dummy data of 500 records with 4 clusters
2. features, labels = make_blobs(n_
samples=500, centers=4, cluster_std = 2.00)
3.
4. #plotting the dummy data
5. plt.scatter(features[:,0], features[:,1] )
Output:
Script 23:
1. # performing kmeans clustering using
AgglomerativeClustering class
2. hc_model = AgglomerativeClustering(n_clusters=4, affinity=
'euclidean', linkage='ward')
3. hc_model.fit_predict(features)
198 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
The output shows the labels of some of the data points in our
dataset. You can see that since there are 4 clusters, there are
4 unique labels, i.e., 0, 1, 2, and 3.
Output:
array([0, 1, 1, 0, 1, 0, 3, 0, 0, 1, 0, 0, 1, 3, 0, 2, 0, 3,
1, 0, 0, 0,], dtype=int64)
Script 24:
1. #print the data points
2. plt.scatter(features[:,0], features[:,1], c= hc_model.
labels_, cmap='rainbow' )
Output:
Similarly, to plot the actual clusters in the dataset (for the sake
of comparison), execute the following script.
Script 25:
1. #print actual datapoints
2. plt.scatter(features[:,0], features[:,1], c= labels,
cmap='rainbow' )
D ata S c i e n c e C r a s h C o u r s e for Beginners | 199
Output:
Script 26:
1. import seaborn as sns
2.
3. iris_df = sns.load_dataset("iris")
4. iris_df.head()
Output:
200 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
The following script divides the data into features and labels
sets and displays the first five rows of the labels set.
Script 27:
1. # dividing data into features and labels
2. features = iris_df.drop(["species"], axis = 1)
3. labels = iris_df.filter(["species"], axis = 1)
4. features.head()
Output:
Script 28:
1. # training Hierarchical clustering model
2. from sklearn.cluster import AgglomerativeClustering
3.
4. # training agglomerative clustering model
5. features = features.values
6. hc_model = AgglomerativeClustering(n_
clusters=3, affinity='euclidean', linkage='ward')
7. hc_model.fit_predict(features)
The output below shows the predicted cluster labels for the
feature set in the Iris dataset.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 201
Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2,
2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2,
0, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2,
2, 2, 0, 2, 2, 0], dtype=int64)
Script 29:
1. #print the data points
2. plt.scatter(features[:,0], features[:,1], c= hc_model.
labels_, cmap='rainbow' )
Output:
You can also create dendrograms using the feature set using
the shc module from the scipy.cluster.hierarchy library. You
have to pass the feature set to the linkage class of the shc
module, and then the object of the linkage class is passed to
202 | D ata C l u s t e r i n g with M ac h i n e L e a r n i n g U s i n g S k l e a r n L i b r a ry
Script 30:
1. import scipy.cluster.hierarchy as shc
2.
3. plt.figure(figsize=(10, 7))
4. plt.title("Iris Dendograms")
5. dend = shc.dendrogram(shc.linkage(features, method='ward'))
Output:
If you want to cluster the dataset into three clusters, you can
simply draw a horizontal line that passes through the three
vertical lines, as shown below. The clusters below the horizontal
line are the resultant clusters. In the following figure, we form
three clusters.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 203
Exercise 8.1
Question 1
Question 2
Question 3
Exercise 8.2
Apply KMeans clustering on the banknote.csv dataset available
in the Data folder in the book resources. Find the optimal
number of clusters and then print the clustered dataset. The
following script imports the dataset and prints the first five
rows of the dataset.
9
Deep Learning with
Python TensorFlow 2.0
In this chapter, you will be using TensorFlow 2.0 and Keras API
to implement different types of neural networks in Python.
From TensorFlow 2.0, Google has officially adopted Keras as
the main API to run TensorFlow scripts.
In our neural network, we will first find the value of zh1, which
can be calculated as follows:
---------- (1)
Using zh1, we can find the value of ah1, which is:
---------- (2)
In the same way, you find the values of ah2, ah3, and ah4.
To find the value of zo, you can use the following formula:
---- (3)
Finally, to find the output of the neural network ao:
---------- (4)
208 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
9.1.2. Backpropagation
The purpose of backpropagation is to minimize the overall loss
by finding the optimum values of weights. The loss function
we are going to use in this section is the mean squared error,
which is in our case represented as:
Our weights are divided into two parts. We have weights that
connect input features to the hidden layer and the hidden
layer to the output node. We call the weights that connect
the input to the hidden layer collectively as wh (w1, w2, w3
…… w8), and the weights connecting the hidden layer to the
output as wo (w9, w10, w11, w12).
------ (5)
------- (6)
--- (7)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 209
----- (8)
In the same way, you find the derivative of cost with respect
to bias in the output layer, i.e., dcost/dbo, which is given as:
….. (9)
……. (10)
The values of dcost/dao and dao/dzo can be calculated from
equations 6 and 7, respectively. The value of dzo/dah is given
as:
…… (11)
Putting the values of equations 6, 7, and 11 in equation 11, you
can get the value of equation 10.
….. (12)
and,
210 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
…… (13)
Using equation 10, 12, and 13 in equation 9, you can find the
value of dcost/dwh.
Script 1:
pip install --upgrade tensorflow
Script 2:
1. import tensorflow as tf
2. print(tf.__version__)
Output:
2.1.0
Script 3:
1. importseaborn as sns
2. import pandas as pd
3. importnumpy as np
4. fromtensorflow.keras.
layers import Dense, Dropout, Activation
5. fromtensorflow.keras.models import Model, Sequential
6. fromtensorflow.keras.optimizers import Adam
Script 4:
1. # reading data from CSV File
2. banknote_data = pd.read_csv("https://raw.githubusercontent.
com/AbhiRoy96/Banknote-Authentication-UCI-Dataset/master/
bank_notes.csv")
The following script plots the first five rows of the dataset.
Script 5:
1. banknote_data.head()
Output:
212 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
The output shows that our dataset contains five columns. Let’s
see the shape of our dataset.
Script 6:
1. banknote_data.shape
The output shows that our dataset has 1372 rows and 5
columns.
Output:
(1372, 5)
Script 7:
1. sns.countplot(x='Target', data=banknote_data)
Output:
The task is to predict the values for the “Target” column, based
on the values in the first four columns. Let’s divide our data
into features and target labels.
Script 8:
1. X = banknote_data.drop(['Target'], axis=1).values
2. y = banknote_data[['Target']].values
3.
4. print(X.shape)
5. print(y.shape)
Output:
(1372, 4)
(1372, 1)
Script 9:
1. from sklearn.model_selection import train_test_split
2. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=42)
214 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Script 10:
1. fromsklearn.preprocessing import StandardScaler
2. sc = StandardScaler()
3. X_train = sc.fit_transform(X_train)
4. X_test = sc.transform(X_test)
Script 11:
def create_model(learning_rate, dropout_rate):
#create sequential model
model = Sequential()
#adding dense layers
model.add(Dense(12, input_dim=X_train.
shape[1], activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(6, activation='relu'))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='sigmoid'))
#compiling the model
adam = Adam(lr=learning_rate)
model.compile(loss='binary_
crossentropy', optimizer=adam, metrics=['accuracy'])
return model
Script 12:
1. model = create_model(learn_rate, dropout_rate)
Script 13:
1. from tensorflow.keras.utils import plot_model
2. plot_model(model, to_file='model_plot1.png', show_
shapes=True, show_layer_names=True)
Output:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 217
From the above output, you can see that the input layer
contains four nodes, the input to the first dense layers is 4,
while the output is 12. Similarly, the input to the second dense
layer is 12, while the output is 6. Finally, in the last dense layer,
the input is 6 nodes, while the output is 1 since we are making
a binary classification. Also, you can see a dropout layer after
each dense layer.
To train the model, you need to call the fit method on the
model object. The fit method takes the training features and
targets as parameters, along with the batch size, the number
of epochs, and the validation split. The validation split refers
to the split in the training data during training.
Script 14:
1. model_history = model.fit(X_train, y_train, batch_
size=batch_
2. size, epochs=epochs, validation_split=0.2, verbose=1)
Output:
Script 15:
1. accuracies = model.evaluate(X_test, y_test, verbose=1)
2. print("Test Score:", accuracies[0])
3. print("Test Accuracy:", accuracies[1])
Output:
275/275 [==============================] - 0s 374us/sample -
loss: 0.0040 - accuracy: 1.0000
Test Score: 0.00397354013286531
Test Accuracy: 1.0
Let’s now plot the accuracy on the training and test sets to see
if our model is overfitting or not.
Script 16:
1. importmatplotlib.pyplot as plt
2. plt.plot(model_history.
history['accuracy'], label = 'accuracy')
3. plt.plot(model_history.history['val_
accuracy'], label = 'val_ accuracy')
4. plt.legend(['train','test'], loc='lowerleft')
D ata S c i e n c e C r a s h C o u r s e for Beginners | 219
Output:
The above curve meets near 1 and then becomes stable which
shows that our model is not overfitting.
Similarly, the loss values for test and training sets can be
printed as follows:
Script 17:
1. plt.plot(model_history.history['loss'], label = 'loss')
2. plt.plot(model_history.history['val_loss'], label = 'val_
loss')
3. plt.legend(['train','test'], loc='upper left')
Output:
220 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
§§ What Is an RNN?
A recurrent neural network is a type of neural network that is
used to process data that is sequential in nature, e.g., stock
price data, text sentences, or sales of items.
Here, we have a single neuron with one input and one output.
On the right side, the process followed by a recurrent neural
network is unfolded. You can see that at time step t, the input
is multiplied by weight vector U, while the previous output at
time t−1, i.e., St−1 is multiplied by the weight vector W, the sum
of the input vector XU + SW becomes the output at time T.
This is how a recurrent neural network captures the sequential
information.
222 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
RNN can easily guess that the missing word is “Clouds” here.
Here, the RNN can only guess that the missing word is “French”
if it remembers the first sentence, i.e., “Mike grew up in France.”
§§ What Is an LSTM?
LSTM is a type of RNN which is capable of remembering
longer sequences, and hence, it is one of the most commonly
used RNN for sequence tasks.
§§ Cell State
The cell state in LSTM is responsible for remembering a long
sequence. The following figure describes the cell state:
224 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
The cell state contains data from all the previous cells in
the sequence. The LSTM is capable of adding or removing
information to a cell state. In other words, LSTM tells the cell
state which part of previous information to remember and
which information to forget.
§§ Forget Gate
The forget gate basically tells the cell state which information
to retain from the information in the previous step and which
information to forget. The working and calculation formula for
the forget gate is as follows:
§§ Input Gate
The forget gate is used to decide which information to
remember or forget. The input gate is responsible for updating
or adding any new information in the cell state. The input gate
has two parts: an input layer, which decides which part of the
cell state is to be updated, and a tanh layer, which actually
creates a vector of new values that are added or replaced in
the cell state. The working of the input gate is explained in the
following figure:
D ata S c i e n c e C r a s h C o u r s e for Beginners | 225
§§ Update Gate
The forget gate tells us what to forget, and the input gate tells
us what to add to the cell state. The next step is to actually
perform these two operations. The update gate is basically
used to perform these two operations. The functioning and
the equations for the update gate are as follows:
§§ Output Gate
Finally, you have the output gate, which outputs the hidden
state and the output, just like a common recurrent neural
network. The additional output from an LSTM node is a cell
state, which runs between all the nodes in a sequence. The
equations and the functioning of the output gate are depicted
by the following figure:
226 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
In the following sections, you will see how to use LSTM for
solving different types of Sequence problems.
https://finance.yahoo.com/quote/FB/history?p=FB.
The test data will consist of the opening stock prices of the
Facebook company for the month of January 2020. The
training file fb_train.csv and the test file fb_test.csv are also
available in the Datasets folder in the GitHub repository. Let’s
begin with the coding now.
Script 18:
pip install --upgrade tensorflow
Script 19:
1. # mounting google drive
2. from google.colab import drive
3. drive.mount('/gdrive')
Script 20:
1. # importing libraries
2. import pandas as pd
3. import numpy as np
4.
5. #importing dataset
6. fb_complete_data = pd.read_csv("/gdrive/My Drive/datasets/
fb_train.csv")
Running the following script will print the first five rows of the
dataset.
Script 21:
1. #printing dataset header
2. fb_complete_data.head()
228 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Output:
Script 22:
1. #filtering open column
2. fb_training_processed = fb_complete_data[['Open']].values
Script 23:
1. #scaling features
2. from sklearn.preprocessing import MinMaxScaler
3. scaler = MinMaxScaler(feature_range = (0, 1))
4.
5. fb_training_scaled = scaler.fit_transform(fb_training_
processed)
If you check the total length of the dataset, you will see it has
1257 records, as shown below:
Script 24:
1. len(fb_training_scaled)
Output:
1257
D ata S c i e n c e C r a s h C o u r s e for Beginners | 229
Script 25:
1. #training features contain data of last 60 days
2. #training labels contain data of 61st day
3.
4. fb_training_features= []
5. fb_training_labels = []
6. for i in range(60, len(fb_training_scaled)):
7. fb_training_features.append(fb_training_scaled[i-
60:i, 0])
8. fb_training_labels.append(fb_training_scaled[i, 0])
Script 26:
1. #converting training data to numpy arrays
2. X_train = np.array(fb_training_features)
3. y_train = np.array(fb_training_labels)
Script 27:
1. print(X_train.shape)
2. print(y_train.shape)
Output:
(1197, 60)
(1197,)
230 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Script 28:
1. converting data into 3D shape
2. X_train = np.reshape(X_train, (X_train.shape[0], X_train.
shape[1], 1))
Script 29:
1. #importing libraries
2. import numpy as np
3. import matplotlib.pyplot as plt
4. from tensorflow.keras.layers import Input, Activation,
Dense, Flatten, Dropout, Flatten, LSTM
5. from tensorflow.keras.models import Model
Script 30:
1. #defining the LSTM network
2.
3. input_layer = Input(shape = (X_train.shape[1], 1))
4. lstm1 = LSTM(100, activation='relu', return_
sequences=True)(input_layer)
5. do1 = Dropout(0.2)(lstm1)
6. lstm2 = LSTM(100, activation='relu', return_
sequences=True)(do1)
7. do2 = Dropout(0.2)(lstm2)
8. lstm3 = LSTM(100, activation='relu', return_
sequences=True)(do2)
9. do3 = Dropout(0.2)(lstm3)
10. lstm4 = LSTM(100, activation='relu')(do3)
11. do4 = Dropout(0.2)(lstm4)
12.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 231
13. output_layer = Dense(1)(do4)
14. model = Model(input_layer, output_layer)
15. model.compile(optimizer='adam', loss='mse')
Script 31:
1. print(X_train.shape)
2. print(y_train.shape)
3. y_train= y_train.reshape(-1,1)
4. print(y_train.shape)
Output:
(1197, 60, 1)
(1197,)
(1197, 1)
Script 32:
1. #training the model
2. model_history = model.fit(X_train, y_
train, epochs=100, verbose=1, batch_size = 32)
You can see the results for the last five epochs in the output.
Output:
Epoch 96/100
38/38 [==============================] - 11s 299ms/step -
loss: 0.0018
Epoch 97/100
38/38 [==============================] - 11s 294ms/step -
loss: 0.0019
Epoch 98/100
38/38 [==============================] - 11s 299ms/step -
loss: 0.0018
232 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Epoch 99/100
38/38 [==============================] - 12s 304ms/step -
loss: 0.0018
Epoch 100/100
38/38 [==============================] - 11s 299ms/step -
loss: 0.0021
Our model has been trained. Next, we will test our stock
prediction model on the test data.
Script 33:
1. #creating test set
2. fb_testing_complete_data = pd.read_csv("/gdrive/My Drive/
datasets/fb_test.csv")
3. fb_testing_processed = fb_testing_complete_data[['Open']].
values
Script 34:
1. fb_all_data = pd.concat((fb_complete_data['Open'], fb_
testing_complete_data['Open']), axis=0)
Script 35:
1. test_inputs = fb_all_data [len(fb_all_data ) - len(fb_
testing_complete_data) - 60:].values
2. print(test_inputs.shape)
You can see that the length of the input data is 80. Here, the
first 60 records are the last 60 records from the training data,
and the last 20 records are the 20 records from the test file.
Output:
(80,)
Script 36:
1. test_inputs = test_inputs.reshape(-1,1)
2. test_inputs = scaler.transform(test_inputs)
3. print(test_inputs.shape)
Output:
(80, 1)
Script 37:
1. fb_test_features = []
2. for i in range(60, 80):
3. fb_test_features.append(test_inputs[i-60:i, 0])
Script 38:
1. X_test = np.array(fb_test_features)
2. print(X_test.shape)
234 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Output:
(20, 60)
Script 39:
1. #converting test data into 3D shape
2. X_test = np.reshape(X_test, (X_test.shape[0], X_test.
shape[1], 1))
3. print(X_test.shape)
Output:
(20, 60, 1)
Script 40:
1. #making predictions on test set
2. y_pred = model.predict(X_test)
Script 41:
1. #converting scaled data back to original data
2. y_pred = scaler.inverse_transform(y_pred)
Script 42:
1. #plotting original and predicted stock values
2. plt.figure(figsize=(8,6))
3. plt.plot(fb_testing_processed, color='red', label='Actual
Facenook Stock Price')
4. plt.plot(y_pred , color='green', label='Predicted Face
Stock Price')
5. plt.title('Facebook Stock Prices')
6. plt.xlabel('Date')
7. plt.ylabel('Stock Price')
8. plt.legend()
9. plt.show()
Output:
The output shows that our algorithm has been able to partially
capture the trend of the future opening stock prices for
Facebook data.
Here, the box on the leftmost is what humans see. They see a
smiling face. However, a computer sees it in the form of pixel
values of 0s and 1s, as shown on the right-hand side. Here, 0
indicates a white pixel, whereas 1 indicates a black pixel. In the
real-world, 1 indicates a white pixel, while 0 indicates a black
pixel.
-4 2 1 -2
1 -1 8 0
3 -3 1 4
1 0 1 -2
0 2 1 0
1 -0 8 0
3 0 1 4
1 0 1 0
Look at the following image. Here, in the 3rd and 4th rows and 1st
and 2nd columns, we have four values 1, 0, 1, and 4. When we
apply max pooling on these four pixels, the maximum value
will be chosen, i.e., you can see 4 in the pooled feature map.
Execute the following script to make sure that you are running
the latest version of TensorFlow.
Script 43:
1. pip install --upgrade tensorflow
2.
3. import tensorflow as tf
4. print(tf.__version__)
Output:
2.3.0
Script 44:
1. #importing required libraries
2. import numpy as np
3. import matplotlib.pyplot as plt
4. from tensorflow.keras.layers import Input, Conv2D, Dense,
Flatten, Dropout, MaxPool2D
5. from tensorflow.keras.models import Model
Script 45:
1. #importing mnist datase
2. mnist_data = tf.keras.datasets.fashion_mnist
3.
4. #dividing data into training and test sets
5. (training_images, training_labels), (test_images, test_
labels) = mnist_data .load_data()
Script 46:
1. #scaling images
2. training_images, test_images = training_images/255.0, test_
images/255.0
Script 47:
1. print(training_images.shape)
Output:
(60000, 28, 28)
244 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Script 48:
1. #plotting image number 9 from test set
2. plt.figure()
3. plt.imshow(test_images[9])
4. plt.colorbar()
5. plt.grid(False)
6. plt.show()
Output:
The output shows that the 9th image in our test set is the image
of a sneaker.
Script 49:
1. #converting data into the right shape
2. training_images = np.expand_dims(training_images, -1)
3. test_images = np.expand_dims(test_images, -1)
4. print(training_images.shape)
Output:
(60000, 28, 28, 1)
Script 50:
1. #printing number of output classes
2. output_classes = len(set(training_labels))
3. print("Number of output classes is: ", output_classes)
Output:
Number of output classes is: 10
Script 51:
1. training_images[0].shape
Output:
(28, 28, 1)
The shape of a single image is (28, 28, 1). This shape will be
used to train our convolutional neural network. The following
script creates a model for our convolutional neural network.
246 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Script 52:
1. #Developing the CNN model
2.
3. input_layer = Input(shape = training_images[0].shape )
4. conv1 = Conv2D(32, (3,3), strides = 2, activation= 'relu')
(input_layer)
5. maxpool1 = MaxPool2D(2, 2)(conv1)
6. conv2 = Conv2D(64, (3,3), strides = 2, activation= 'relu')
(maxpool1)
7. #conv3 = Conv2D(128, (3,3), strides = 2, activation=
'relu')(conv2)
8. flat1 = Flatten()(conv2)
9. drop1 = Dropout(0.2)(flat1)
10. dense1 = Dense(512, activation = 'relu')(drop1)
11. drop2 = Dropout(0.2)(dense1)
12. output_layer = Dense(output_classes, activation=
'softmax')(drop2)
13.
14. model = Model(input_layer, output_layer)
Script 53:
1. #compiling the CNN model
2. model.compile(optimizer = 'adam', loss= 'sparse_
categorical_crossentropy', metrics =['accuracy'])
Script 54:
1. from tensorflow.keras.utils import plot_model
2. plot_model(model, to_file='model_plot1.png', show_
shapes=True, show_layer_names=True)
Output:
248 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
The following script trains the image classification model.
Script 55:
1. #training the CNN model
2. model_history = model.fit(training_images, training_
labels, epochs=20, validation_data=(test_images, test_
labels), verbose=1)
The results from the last five epochs are shown in the output.
Output:
Let’s plot the training and test accuracies for our model.
D ata S c i e n c e C r a s h C o u r s e for Beginners | 249
Script 56:
1. #plotting accuracy
2. import matplotlib.pyplot as plt
3.
4. plt.plot(model_history.
history['accuracy'], label = 'accuracy')
5. plt.plot(model_history.history['val_
accuracy'], label = 'val_accuracy')
6. plt.legend(['train','test'], loc='lower left')
Output:
Script 57:
1. #making predictions on a single image
2. output = model.predict(test_images)
3. prediction = np.argmax(output[9])
4. print(prediction)
250 | Deep Learning with P y t h o n T e n s o r F l o w 2.0
Output:
7
Exercise 9.1
Question 1
Question 2
Question 3
Exercise 9.2
Using the CFAR 10 image dataset, perform image classification
to recognize the image. Here is the dataset:
1. cifar_dataset = tf.keras.datasets.cifar10
10
Dimensionality Reduction
with PCA and LDA
Using Sklearn
§§ Disadvantages of PCA
There are two major disadvantages of PCA:
1. You need to standardize the data before you apply PCA
2. The independent variable becomes less integrable
3. Some amount of information is lost when you reduce
features.
Script 1:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
The following script imports the Iris dataset using the Seaborn
library and prints the first five rows of the dataset.
Script 2:
1. #importing the dataset
2. iris_df = sns.load_dataset("iris")
3.
4. #print dataset header
5. iris_df.head()
Output:
The above output shows that the dataset contains four features:
sepal_length, sepal_width, petal_length, petal_width, and one
output label, i.e., species. For PCA, we will only use the feature
set.
The following script divides the data into the features and
labels sets.
256 | Dimensionality Reduction with P CA and LDA U s i n g S k l e a r n
Script 3:
1. #creating feature set
2. X = iris_df.drop(['species'], axis=1)
3.
4.
5. #creating label set
6. y = iris_df["species"]
7.
8. #converting labels to numbers
9. from sklearn import preprocessing
10. le = preprocessing.LabelEncoder()
11. y = le.fit_transform(y)
Script 4:
1. #dividing data into 80-20% traning and test sets
2. from sklearn.model_selection import train_test_split
3.
4. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
Finally, both the training and test sets should be scaled before
PCA could be applied to them.
Script 5:
1. #applying scaling on training and test data
2. from sklearn.preprocessing import StandardScaler
3. sc = StandardScaler()
4. X_train = sc.fit_transform(X_train)
5. X_test = sc.transform (X_test)
the test set, pass the test set to the transform() method of the
PCA class object. This is shown in the following script.
Script 6:
1. #importing PCA class
2. from sklearn.decomposition import PCA
3.
4. #creating object of the PCA class
5. pca = PCA()
6.
7. #training PCA model on training data
8. X_train = pca.fit_transform(X_train)
9.
10. #making predictions on test data
11. X_test = pca.transform(X_test)
Once you have applied PCA on a dataset, you can use the
explained_variance_ratio_ feature to print variance caused by
all the features in the dataset. This is shown in the following
script:
Script 7:
1. #printing variance ratios
2. variance_ratios = pca.explained_variance_ratio_
3. print(variance_ratios)
Output:
[0.72229951 0.2397406 0.03335483 0.00460506]
Script 8:
1. #use one principal component
2. from sklearn.decomposition import PCA
3.
4. pca = PCA(n_components=2)
5. X_train = pca.fit_transform(X_train)
6. X_test = pca.transform(X_test)
Script 9:
1. #making predictions using logistic regression
2. from sklearn.linear_model import LogisticRegression
3.
4. #training the logistic regression model
5. lg = LogisticRegression()
6. lg.fit(X_train, y_train)
7.
8.
9. # Predicting the Test set results
10. y_pred = lg.predict(X_test)
11.
12. #evaluating results
13.
14. from sklearn.metrics import accuracy_score
15.
16. print(accuracy_score(y_test, y_pred))
Output:
0.8666666666666667
D ata S c i e n c e C r a s h C o u r s e for Beginners | 259
The output shows that even with two features, the accuracy
for correctly predicting the label for the iris plant is 86.66.
Finally, with two features, you can easily visualize the dataset
using the following script.
Script 10:
1. from matplotlib import pyplot as plt
2. %matplotlib inline
3.
4. #print actual datapoints
5.
6. plt.scatter(X_test[:,0], X_test[:,1], c= y_
test, cmap='rainbow' )
Output:
Disadvantages of LDA
There are three major disadvantages of LDA:
1. Not able to detect correlated features
2. Cannot be used with unsupervised or unlabeled data
3. Some amount of information is lost when you reduce
features.
Script 11:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
Script 12:
1. #importing dataset
2. banknote_df = pd.read_csv(r"E:\Hands on Python for Data
Science and Machine Learning\Datasets\banknote.csv")
3.
4. #displaying dataset header
5. banknote_df.head()
Output:
Script 13:
1. # dividing data into features and labels
2. X = banknote_df.drop(["class"], axis = 1)
3. y = banknote_df.filter(["class"], axis = 1)
Finally, the following script divides the data into training and
test sets.
Script 14:
1. #dividing data into 80-20% training and test sets
2. from sklearn.model_selection import train_test_split
3.
4. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
Like PCA, you need to scale the data before you can apply
LDA on it. The data scaling is performed in the following step.
262 | Dimensionality Reduction with P CA and LDA U s i n g S k l e a r n
Script 15:
1. #applying scaling on training and test data
2. from sklearn.preprocessing import StandardScaler
3. sc = StandardScaler()
4. X_train = sc.fit_transform(X_train)
5. X_test = sc.transform (X_test)
Script 16:
1. #importing LDA class
2. from sklearn.discriminant_
analysis import LinearDiscriminantAnalysis as LDA
3.
4.
5. #creating object of the LDA class
6. lda = LDA()
7.
8. #training LDA model on training data
9. X_train = lda.fit_transform(X_train, y_train)
10.
11. #making predictions on test data
12. X_test = lda.transform(X_test)
Like PCA, you can find variance ratios for LDA using the
explained_variance_ratio attribute.
Script 17:
1. #printing variance ratios
2. variance_ratios = lda.explained_variance_ratio_
3. print(variance_ratios)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 263
Output:
[1.]
The above output shows that even with one component, the
maximum variance can be achieved.
Script 18:
1. #creating object of the LDA class
2. lda = LDA(n_components = 1)
3.
4. #training PCA model on training data
5. X_train = lda.fit_transform(X_train, y_train)
6.
7. #making predictions on test data
8. X_test = lda.transform(X_test)
Script 19:
1. #making predictions using logistic regression
2. from sklearn.linear_model import LogisticRegression
3.
4. #training the logistic regression model
5. lg = LogisticRegression()
6. lg.fit(X_train, y_train)
7.
8.
9. # Predicting the Test set results
10. y_pred = lg.predict(X_test)
11.
264 | Dimensionality Reduction with P CA and LDA U s i n g S k l e a r n
Output:
0.9890909090909091
The output shows that even with a single feature, we are able
to correctly predict whether or not a banknote is fake with
98.90 percent accuracy.
Exercise 10.1
Question 1
Question 2
Question 3
Exercise 10.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the Data
folder in the book resources. Print the accuracy using the two
principal components. Also, plot the results on the test set
using the two principal components.
From the Same Publisher
Exercise 2.1
Question 1
Answer: A
Question 2
Answer: C
272 | Exercises Solutions
Question 3
Answer: D
Exercise 2.2.
Print the table of integer 9 using a while loop:
1. j=1
2. while j< 11:
3. print("9 x "+str(j)+ " = "+ str(9*j))
4. j=j+1
Exercise 3.1
Question 1:
Answer: B
D ata S c i e n c e C r a s h C o u r s e for Beginners | 273
Question 2:
Answer: C
Question 3:
Answer: D
Exercise 3.2
Create a random NumPy array of five rows and four columns.
Using array indexing and slicing, display the items from row
three to end and column two to end.
Solution:
1. uniform_random = np.random.rand(4, 5)
2. print(uniform_random)
3. print("Result")
4. print(uniform_random[2:,3:])
274 | Exercises Solutions
Exercise 4.1
Question 1
Answer: B
Question 2
Answer: C
Question 3
Answer: A
D ata S c i e n c e C r a s h C o u r s e for Beginners | 275
Exercise 4.2
Use the apply function to subtract 10 from the Fare column of
the Titanic dataset, without using the lambda expression.
Solution:
1. def subt(x):
2. return x - 10
3.
4. updated_class = titanic_data.Fare.apply(subt)
5. updated_class.head()
Exercise 5.1
Question 1
Answer: B
Question 2:
Answer: B
276 | Exercises Solutions
Question 3:
Answer: A
Exercise 5.2
Plot two scatter plots on the same graph using the tips_
dataset. In the first scatter plot, display values from the
total_bill column on the x-axis and from the tip column on the
y-axis. The color of the first scatter plot should be green. In the
second scatter plot, display values from the total_bill column
on the x-axis and from the size column on the y-axis. The color
of the second scatter plot should be blue, and the markers
should be x.
Solution:
1. sns.scatterplot(x="total_bill", y="tip", data=tips_data,
color = 'g')
2. sns.scatterplot(x="total_bill", y="size", data=tips_data,
color = 'b', marker = 'x')
D ata S c i e n c e C r a s h C o u r s e for Beginners | 277
Output:
Exercise 6.1
Question 1
Answer: C
278 | Exercises Solutions
Question 2
Answer: B
Question 3
Answer: D
Exercise 6.2
Using the Diamonds dataset from the Seaborn library, train a
regression algorithm of your choice, which predicts the price
of the diamond. Perform all the preprocessing steps.
Solution:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4.
5. diamonds_df = sns.load_dataset("diamonds")
6.
7. X = diamonds_df.drop(['price'], axis=1)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 279
8. y = diamonds_df["price"]
9.
10. numerical = X.drop(['cut', 'color', 'clarity'], axis = 1)
11.
12. categorical = X.filter(['cut', 'color', 'clarity'])
13.
14. cat_numerical = pd.get_dummies(categorical,drop_first=True)
15.
16. X = pd.concat([numerical, cat_numerical], axis = 1)
17.
18. from sklearn.model_selection import train_test_split
19.
20. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
21.
22. from sklearn.preprocessing import StandardScaler
23. sc = StandardScaler()
24. X_train = sc.fit_transform(X_train)
25. X_test = sc.transform (X_test)
26.
27. from sklearn import svm
28. svm_reg = svm.SVR()
29. regressor = svm_reg.fit(X_train, y_train)
30. y_pred = regressor.predict(X_test)
31.
32.
33.
34. from sklearn import metrics
35.
36. print('Mean Absolute Error:', metrics.mean_absolute_
error(y_test, y_pred))
37. print('Mean Squared Error:', metrics.mean_squared_error(y_
test, y_pred))
38. print('Root Mean Squared Error:', np.sqrt(metrics.mean_
squared_error(y_test, y_pred)))
280 | Exercises Solutions
Exercise 7.1
Question 1
Answer: D
Question 2
Answer: C
Question 3
Answer: B
D ata S c i e n c e C r a s h C o u r s e for Beginners | 281
Exercise 7.2
Using the iris dataset from the Seaborn library, train a
classification algorithm of your choice, which predicts the
species of the iris plant. Perform all the preprocessing steps.
Solution:
1. import pandas as pd
2. import numpy as np
3. import seaborn as sns
4.
5. iris_df = sns.load_dataset("iris")
6.
7. iris_df.head()
8.
9. X = iris_df.drop(['species'], axis=1)
10. y = iris_df["species"]
11.
12.
13. from sklearn.model_selection import train_test_split
14.
15. X_train, X_test, y_train, y_test = train_test_
split(X, y, test_size=0.20, random_state=0)
16.
17. from sklearn.preprocessing import StandardScaler
18. sc = StandardScaler()
19. X_train = sc.fit_transform(X_train)
20. X_test = sc.transform (X_test)
21.
22. from sklearn.ensemble import RandomForestClassifier
23. rf_clf = RandomForestClassifier(random_state=42, n_
estimators=500)
24.
25. classifier = rf_clf.fit(X_train, y_train)
26.
27. y_pred = classifier.predict(X_test)
28.
29.
282 | Exercises Solutions
30. from sklearn.metrics import classification_
report, confusion_matrix, accuracy_score
31.
32. print(confusion_matrix(y_test,y_pred))
33. print(classification_report(y_test,y_pred))
34. print(accuracy_score(y_test, y_pred))
Exercise 8.1
Question 1
Answer: D
Question 2
Answer: C
D ata S c i e n c e C r a s h C o u r s e for Beginners | 283
Question 3
Answer: B
Exercise 8.2
Apply KMeans clustering on the banknote.csv dataset
available in the Data folder in the book resources. Find the
optimal number of clusters and then print the clustered
dataset. The following script imports the dataset and prints
the first five rows of the dataset.
1. banknote_df = pd.read_csv(r"E:\
Hands on Python for Data Science and Machine Learning\
Datasets\banknote.csv")
2. banknote_df.head()
3.
4. ### Solution:
5.
6. # dividing data into features and labels
7. features = banknote_df.drop(["class"], axis = 1)
8. labels = banknote_df.filter(["class"], axis = 1)
9. features.head()
10.
11. # training KMeans on K values from 1 to 10
12. loss =[]
13. for i in range(1, 11):
284 | Exercises Solutions
14. km = KMeans(n_clusters = i).fit(features)
15. loss.append(km.inertia_)
16.
17. #printing loss against number of clusters
18.
19. import matplotlib.pyplot as plt
20. plt.plot(range(1, 11), loss)
21. plt.title('Finding Optimal Clusters via Elbow Method')
22. plt.xlabel('Number of Clusters')
23. plt.ylabel('loss')
24. plt.show()
25.
26. # training KMeans with 3 clusters
27. features = features.values
28. km_model = KMeans(n_clusters=2)
29. km_model.fit(features)
30.
31. #pring the data points with prediced labels
32. plt.scatter(features[:,0], features[:,1], c= km_model.
labels_, cmap='rainbow' )
33.
34. #print the predicted centroids
35. plt.scatter(km_model.cluster_centers_[:, 0], km_model.
cluster_centers_[:, 1], s=100, c='black')
Exercise 9.1
Question 1
Answer: (D)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 285
Question 2:
Answer (C)
Question 3
Answer: (B)
Exercise 9.2
Using the CFAR 10 image dataset, perform image classification
to recognize the image. Here is the dataset:
2. cifar_dataset = tf.keras.datasets.cifar10
Solution:
1. #importing required libraries
2. import numpy as np
3. import matplotlib.pyplot as plt
4. from tensorflow.keras.layers import Input, Conv2D, Dense,
Flatten, Dropout, MaxPool2D
286 | Exercises Solutions
5. from tensorflow.keras.models import Model
6.
7.
8. (training_images, training_labels), (test_images, test_
labels) = cifar_dataset.load_data()
9.
10. training_images, test_images = training_images/255.0,
test_images/255.0
11.
12. training_labels, test_labels = training_labels.flatten(),
test_labels.flatten()
13. print(training_labels.shape)
14. print(training_images.shape)
15. output_classes = len(set(training_labels))
16. print("Number of output classes is: ", output_classes)
17. input_layer = Input(shape = training_images[0].shape )
18. conv1 = Conv2D(32, (3,3), strides = 2, activation= 'relu')
(input_layer)
19. maxpool1 = MaxPool2D(2, 2)(conv1)
20. conv2 = Conv2D(64, (3,3), strides = 2, activation= 'relu')
(maxpool1)
21. #conv3 = Conv2D(128, (3,3), strides = 2, activation=
'relu')(conv2)
22. flat1 = Flatten()(conv2)
23. drop1 = Dropout(0.2)(flat1)
24. dense1 = Dense(512, activation = 'relu')(drop1)
25. drop2 = Dropout(0.2)(dense1)
26. output_layer = Dense(output_classes, activation=
'softmax')(drop2)
27.
28. model = Model(input_layer, output_layer)
29. model.compile(optimizer = 'adam', loss= 'sparse_
categorical_crossentropy', metrics =['accuracy'])
30. model_history = model.fit(training_images, training_labels,
epochs=20, validation_data=(test_images, test_labels),
verbose=1)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 287
Exercise 10.1
Question 1
Answer: C
Question 2
Answer: A
Question 3
Answer: C
288 | Exercises Solutions
Exercise 10.2
Apply principal component analysis for dimensionality
reduction on the customer_churn.csv dataset from the Data
folder in the book resources. Print the accuracy using the two
principal components. Also, plot the results on the test set
using the two principal components.
Solution:
1. import pandas as pd
2. import numpy as np
3.
4. churn_df = pd.read_csv("E:\Hands on Python for Data Science
and Machine Learning\Datasets\customer_churn.csv")
5. churn_df.head()
6.
7. churn_df = churn_df.drop(['RowNumber', 'CustomerId',
'Surname'], axis=1)
8.
9. X = churn_df.drop(['Exited'], axis=1)
10. y = churn_df['Exited']
11.
12. numerical = X.drop(['Geography', 'Gender'], axis = 1)
13. categorical = X.filter(['Geography', 'Gender'])
14. cat_numerical = pd.get_dummies(categorical,drop_first=True)
15. X = pd.concat([numerical, cat_numerical], axis = 1)
16. X.head()
17.
18. from sklearn.model_selection import train_test_split
19.
20. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=0)
21.
22. #applying scaling on training and test data
23. from sklearn.preprocessing import StandardScaler
24. sc = StandardScaler()
25. X_train = sc.fit_transform(X_train)
26. X_test = sc.transform (X_test)
D ata S c i e n c e C r a s h C o u r s e for Beginners | 289
27.
28. #importing PCA class
29. from sklearn.decomposition import PCA
30.
31. #creating object of the PCA class
32. pca = PCA()
33.
34. #training PCA model on training data
35. X_train = pca.fit_transform(X_train)
36.
37. #making predictions on test data
38. X_test = pca.transform(X_test)
39.
40. #printing variance ratios
41. variance_ratios = pca.explained_variance_ratio_
42. print(variance_ratios)
43.
44. #use one principal component
45. from sklearn.decomposition import PCA
46.
47. pca = PCA(n_components=2)
48. X_train = pca.fit_transform(X_train)
49. X_test = pca.transform(X_test)
50.
51. #making predictions using logistic regression
52. from sklearn.linear_model import LogisticRegression
53.
54. #training the logistic regression model
55. lg = LogisticRegression()
56. lg.fit(X_train, y_train)
57.
58.
59. # Predicting the Test set results
60. y_pred = lg.predict(X_test)
61.
62. #evaluating results
63.
64. from sklearn.metrics import accuracy_score
65.
290 | Exercises Solutions
66. print(accuracy_score(y_test, y_pred))
67.
68. from matplotlib import pyplot as plt
69. %matplotlib inline
70.
71. #print actual datapoints
72.
73. plt.scatter(X_test[:,0], X_test[:,1], c= y_
test, cmap='rainbow' )