0% found this document useful (0 votes)

64 views

OceanofPDF - Com Python - Andy Vickler

Uploaded by

fuengon1509

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

OceanofPDF - Com Python - Andy Vickler

Uploaded by

fuengon1509

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 177

Python

Python for Data Science and Machine

Learning
OceanofPDF.com
© Copyright 2021 - All rights reserved.
The contents of this book may not be reproduced, duplicated, or transmitted without direct written
permission from the author.
Under no circumstances will any legal responsibility or blame be held against the publisher for any
reparation, damages, or monetary loss due to the information herein, either directly or indirectly.
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part of the content within this book
without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. No warranties of any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical, or professional advice. Please
consult a licensed professional before attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of the information contained
within this document, including, but not limited to, —errors, omissions, or inaccuracies.

OceanofPDF.com
Table of Contents
Introduction
Part One – An Introduction to Data Science and Machine Learning
What Is Data Science?
How Important Is Data Science?
Data Science Limitations
What Is Machine Learning?
How Important Is Machine Learning?
Machine Learning Limitations
Data Science vs. Machine Learning
Part Two – Introducing NumPy
What Is NumPy Library?
How to Create a NumPy Array
Shaping and Reshaping a NumPy array
Index and Slice a NumPy Array
Stack and Concatenate NumPy Arrays
Broadcasting in NumPy Arrays
NumPy Ufuncs
Doing Math with NumPy Arrays
NumPy Arrays and Images
Part Three – Data Manipulation with Pandas
Question One: How Do I Create a Pandas DataFrame?
Question Two – How Do I Select a Column or Index from a DataFrame?
Question Three: How Do I Add a Row, Column, or Index to a
DataFrame?
Question Four: How Do I Delete Indices, Rows, or Columns From a Data
Frame?
Question Five: How Do I Rename the Columns or Index of a DataFrame?
Question Six: How Do I Format the DataFrame Data?
Question Seven: How Do I Create an Empty DataFrame?
Question Eight: When I Import Data, Will Pandas Recognize Dates?
Question Nine: When Should a DataFrame Be Reshaped? Why and How?
Question Ten: How Do I Iterate Over a DataFrame?
Question Eleven: How Do I Write a DataFrame to a File?
Part Four – Data Visualization with Matplotlib and Seaborn
Using Matplotlib to Generate Histograms
Using Matplotlib to Generate Scatter Plots
Using Matplotlib to Generate Bar Charts
Using Matplotlib to Generate Pie Charts
Visualizing Data with Seaborn
Using Seaborn to Generate Histograms
Using Seaborn to Generate Scatter Plots
Using Seaborn to Generate Heatmaps
Using Seaborn to Generate Pairs Plot
Part Five – An In-Depth Guide to Machine Learning
Machine Learning Past to Present
Machine Learning Features
Different Types of Machine Learning
Common Machine Learning Algorithms
Gaussian Naive Bayes classifier
K-Nearest Neighbors
Support Vector Machine Learning Algorithm
Fitting Support Vector Machines
Linear Regression Machine Learning Algorithm
Logistic Regression Machine Learning Algorithm
A Logistic Regression Model
Decision Tree Machine Learning Algorithm
Random Forest Machine Learning Algorithm
Artificial Neural Networks Machine Learning Algorithm
Machine Learning Steps
Evaluating a Machine Learning Model
Model Evaluation Metrics
Regression Metrics
Implementing Machine Learning Algorithms with Python
Advantages and Disadvantages of Machine Learning
Conclusion

References

OceanofPDF.com
Introduction
Thank you for purchasing this guide about data science and machine
learning with Python. One of the first questions people ask is, what is data
science? This is not a particularly easy question to answer because the term
is widespread these days, used just about everywhere. Some say it is
nothing more than an unnecessary label, given that science is all about data
anyway, while others say it is just a buzzword.
However, many fail to see or choose to ignore because data science is quite
possibly the ONLY label that could be given to the cross-discipline skill set
that is fast becoming one of the most important in applications. It isn't
possible to place data science firmly into one discipline; that much is
definitely true. It comprises three areas, all distinct and all overlapping:
Statistician – these people are skilled in modeling datasets and
summarizing them. This is an even more important skill set these
days, given the ever-increasing size of today's datasets.
Computer Scientist – these people design algorithms and use them
to store data, and process it, and visualize it efficiently.
Domain Expertise - is akin to being classically trained in any
subject, i.e. having expert knowledge. This is important to ensure the
right questions are asked and the answers placed in the right
context.
So, going by this, it wouldn't be remiss of me to encourage you to see data
science as a set of skills you can learn and apply in your own expertise area,
rather than being something new you have to learn from scratch. It doesn't
matter if you are examining microspore images looking for
microorganisms, forecasting returns on stocks, optimizing online ads to get
more clicks or any other field where you work with data. This book will
give you the fundamentals you need to learn to ask the right questions in
your chosen expertise area.
Who This Book Is For
This book is aimed at those with experience of programming in Python,
designed to help them further their skills and learn how to use their chosen
programming language to go much further than before. It is not aimed at
those with no experience of Python or programming, and I will not be
giving you a primer on the language.
The book assumes that you already have experience defining and using
functions, calling object methods, assigning variables, controlling program
flow, and all the other basic Python tasks. It is designed to introduce you to
the data science libraries in Python, such as NumPy, Pandas, Seaborn,
Scikit-Learn, and so on, showing you how to use them to store and
manipulate data and gain valuable insights you can use in your
work.
Why Use Python?
Over the last few decades, Python has emerged as the best and most popular
tool for analyzing and visualizing data, among other scientific tasks. When
Python was first designed, it certainly wasn't with data science in mind.
Instead, its use in these areas has come from the addition of third-party
packages, including:
NumPy – for manipulating array-based data of the same kind
Pandas – for manipulating array-based data of different kinds
SciPy – for common tasks in scientific computing
Matplotlib – for high quality, accurate visualizations
Scikit-Learn – for machine learning
And so many more.
Python 2 vs. 3
The code used in this book follows the Python 3 syntax. Many
enhancements are not backward compatible with Python 2, which is why
you are always urged to install the latest Python version.
Despite Python 3 being released around 13 years ago, people have been
slow to adopt it, especially in web development and scientific communities.
This is mostly down to the third-party packages not being made compatible
right from the start. However, from 2014, those packages started to become
available in stable releases, and more people have begun to adopt Python 3.
What's in This Book?
This book has been separated into five sections, each focusing on a specific
area:
Part One – introduces you to data science and machine learning
Part Two – focuses on NumPy and how to use it for storing and
manipulating data arrays
Part Three – focuses on Pandas and how to use it for storing and
manipulating columnar or labeled data arrays
Part Four – focuses on using Matplotlib and Seaborn for data
visualization
Part Five – takes up approximately half of the book and focuses on
machine learning, including a discussion on algorithms.
There is more to data science than all this, but these are the main areas you
should focus on as you start your data science and machine learning
journey.
Installation
Given that you should have experience in Python, I am assuming you
already have Python on your machine. However, if you don't, there are
several ways to install it, but Anaconda is the most common way. There are
two versions of this:
Miniconda – provides you with the Python interpreter and conda, a
command-line tool
Anaconda – provides conda and Python, along with many other
packages aimed at data science. This is a much larger installation, so
make sure you have plenty of free space.
It's also worth noting that the packages discussed in this book can be
installed on top of Miniconda, so it's down to you which version you choose
to install. If you choose Miniconda, you can use the following command to
install all the packages you need:
[~]$ conda install numpy pandas scikit-learn matplotlib seaborn
jupyter
All packages and tools can easily be installed with the following:
conda install packagename
ensuring you type the correct name in place of packagename.
Let's not waste any more time and dive into our introduction to data science
and machine learning.

OceanofPDF.com
Part One – An Introduction to Data Science and
Machine Learning
OceanofPDF.com
The definition of data science tells us that it is a research area used to derive
valuable insight from data, and it does this using a theoretical method. In
short, data science is a combination of simulation, technology, and market
management.
Machine learning is a bit different. This is all about learning the data
science techniques the computers use to learn from the data we give them.
We use these technologies to gain outcomes without needing to program
any specific laws.
Machine learning and data science are trending high these days and,
although many people use the terms interchangeably, they are different. So,
let's break this down and learn what we need to know to continue with the
rest of the book.
What Is Data Science?
The real meaning of data science is found in the deep analysis of huge
amounts of data contained in a company's archive. This involves working
out where that data originated from, whether the data is of sufficient quality,
and whether it can be used to move the business on in the future.
Often, an organization stores its data in one of two formats – unstructured
or organized. Examining the data gives us useful insights into the consumer
and industry dynamics, which can be used to provide an advantage to the
company over their rivals—this is done by looking for patterns in the data.
Data scientists know how to turn raw data into something a business can
use. They know how to spot algorithmic coding and process the data, with
knowledge of statistics and artificial learning. Some of the top companies in
the world use data analytics, including Netflix, Amazon, airlines, fraud
prevention departments, healthcare, etc.
Data Science Careers
Most businesses use data analysis effectively to expand, and data scientists
are in the highest demand across many industries. If you are considering
getting involved in data science, these are some of the best careers you can
train for:
Data Scientist – investigates trends in the data to see what effect they
will have on a business. Their primary job is to determine the
significance of the data and clarify it so that everyone can understand
it.
Data Analyst – analyzes data to see what industry trends exist. They
help develop simplistic views of where a business stands in its
industry.
Data Engineer – often called the backbone of a business, data
engineers create databases to store data, designing them for the best
use, and manage them. Their job entails data pipeline design,
ensuring the data flows adequately, and getting to where it needs to
be.
Business Intelligence Analyst - sometimes called a market
intelligence consultant-study data to improve income and productivity
for a business. The job is less theoretical and more scientific and
requires a deep understanding of common machines.
How Important Is Data Science?
In today's digital world, a world full of data, it is one of the most important
jobs. Here are some of the reasons why.
First, it gives businesses a better way of identifying who their
customers are. Given that customers keep a business's wheels turning,
they determine whether that business fails or succeeds. With data
science, businesses can now communicate with customers in the right
way, in different ways for different customers.
Second, data science makes it possible for a product to tell its story in
an entertaining and compelling way, one of the main reasons why
data science is so popular. Businesses and brands can use the
information gained from data science to communicate what they want
their customers to know, ensuring much stronger relationships
between them.
Next, the results gained from data science can be used in just about
every field, from schooling to banking, healthcare to tourism, etc.
Data science allows a business to see problems and respond to them
accordingly.
A business can use data analytics to better communicate with its
customers/consumers and provide the business with a better view of
how its products are used.
Data science is quickly gaining traction in just about every market
and is now a critical part of how products are developed. That has led
to a rise in the need for data scientists to help manage data and find
the answers to some of the more complex problems a business faces.
Data Science Limitations
Data science may be one of the most lucrative careers in the world right
now, but it isn't perfect and has its fair share of drawbacks. It wouldn't be
right to brag about what data science can do without pointing out its
limitations:
Data science is incredibly difficult to master. That's because it isn't
just one discipline involving a combination of information science,
statistics, and mathematics, to name a few. It isn't easy to master
every discipline involved to a competent level.
Data science also requires a high level of domain awareness. In fact,
it's fair to say that it relies on it. If you have no knowledge of
computer science and statistics, you will find it hard to solve any data
science problem.
Data privacy is a big issue. Data makes the world go around, and data
scientists create data-driven solutions for businesses. However, there
is always a chance that the data they use could infringe privacy. A
business will always have access to sensitive data and, if data
protection is compromised in any way, it can lead to data breaches.
What Is Machine Learning?
Machine learning is a subset of data science, enabling machines to learn
things without the need for a human to program them. Machine learning
analyzes data using algorithms and prepares possible predictions without
human intervention. It involves the computer learning a series of inputs in
the form of details, observations, or commands, and it is used extensively
by companies such as Google and Facebook.
Machine Learning Careers
There are several directions you can take once you have mastered the
relevant skills. These are three of those directions:
Machine Learning Engineer – this is one of the most prized careers
in the data science world. Engineers typically use machine learning
algorithms to ensure machine learning applications and systems are as
effective as they can be. A machine learning engineer is responsible
for molding self-learning applications, improving their effectiveness
and efficiency by using the test results to fine-tune them and run
statistical analyses. Python is one of the most used programming
languages for performing machine learning experiments.
NLP Scientist – these are concerned with Natural Language
Processing, and their job is to ensure machines can interpret natural
languages. They design and engineer software and computers to learn
human speech habits and convert the spoken word into different
languages. Their aim is to get computers to understand human
languages the same way we do, and two of the best examples are apps
called DuoLingo and Grammarly.
Developer/Engineer of Software – these develop intelligent
computer programs, and they specialize in machine learning and
artificial intelligence. Their main aim is to create algorithms that
work effectively and implement them in the right way. They design
complex functions, plan flowcharts, graphs, layouts, product
documentation, tables, and other visual aids. They are also
responsible for composing code and evaluating it, creating tech specs,
updating programs and managing them, and more.
How Important Is Machine Learning?
Machine learning is an ever-changing world and, the faster it evolves, the
higher its significance is, and the more demand grows. One of the main
explanations as to why data scientists can't live without machine learning is
this – "high-value forecasts that direct smart decisions and behavior in real-
time, without interference from humans."
It is fast becoming popular to interpret huge amounts of data and helping to
automate the tasks that data scientists do daily. It's fair to say that machine
learning has changed the way we extract data and visualize it. Given how
much businesses rely on data these days, data-driven decisions help
determine if a company is likely to succeed or fall behind its competitors.
Machine Learning Limitations
Like data science, machine learning also has its limitations, and these are
some of the biggest ones:
Algorithms need to store huge amounts of training data. Rather than
being programmed, an AI program is educated. This means they
require vast amounts of data to learn how to do something and
execute it at a human level. In many cases, these huge data sets are
not easy to generate for specific uses, despite the rapid level at which
data is being produced.
Neural networks are taught to identify things, for example, images,
and they do this by being trained on massive amounts of labeled data
in a process known as supervised learning. No matter how large or
small, any change to the assignment requires a new collection of data,
which also needs to be prepared. It's fair to say that a neural network
cannot truly operate at a human intelligence level because of the brute
force needed to get it there – that will change in the future.
Machine learning needs too much time to mark the training data. AI
uses supervised learning on a series of deep neural nets. It is critical
to label the data in the processing stage, and this is done using
predefined goal attributes taken from historical data. Marking the data
is where the data is cleaned and sorted to allow the neural nets to
work on it.
There's no doubt that vast amounts of labeled information are needed
for deep learning and, while this isn't particularly hard to grasp, it isn't
the easiest job to do. If unlabeled results are used, the program cannot
learn to get smarter.
AI algorithms don't always work well together. Although there have
been some breakthroughs in recent times, an AI model may still only
generalize a situation if it is asked to do something it didn't do in
training. AI models cannot always move data between groups of
conditions, implying that whatever is accomplished by a paradigm
with a specific use case can only be relevant to that case.
Consequently, businesses must use extra resources to keep models
trained and train new ones, despite the use cases being the same in
many cases.
Data Science vs. Machine Learning
Data scientists need to understand data analysis in-depth, besides having
excellent programming abilities. Based on the business hiring the data
scientist, there is a range of expertise, and the skills can be split down into
two different categories:
Technical Skills
You need to specialize in:
Algebra
Computer Engineering
Statistics
You must also be competent in:
Computer programming
Analytical tools, such as R, Spark, Hadoop, and SAS
Working with unstructured data obtained from different networks
Non-Technical Skills
Most of a data scientist's abilities are non-technical and include:
A good sense of business
The ability to interact
Data intuition
Machine learning experts also need command over certain skills, including:
Probability and statistics – theory experience is closely linked to
how you comprehend algorithms, such as Naïve Bayes. Hidden
Markov, Gaussian Mixture, and many more. Being experienced in
numbers and chance makes these easier to grasp.
Evaluating and modeling data – frequently evaluating model
efficacy is a critical part of maintaining measurement reliability in
machine learning. Classification, regression, and other methods are
used to evaluate consistency and error rates in models, along with
assessment plans, and knowledge of these is vital.
Machine learning algorithms – there are tons of machine learning
algorithms, and you need to have knowledge of how they operate to
know which one to use for each scenario. You will need knowledge of
quadratic programming, gradient descent, convex optimization,
partial differential equations, and other similar topics.
Programming languages – you will need experience in
programming in one or more languages – Python, R, Java, C++, etc.
Signal processing techniques – feature extraction is one of the most
critical factors in machine learning. You will need to know some
specialized processing algorithms, such as shearlets, bandlets,
curvelets, and contourlets.
Data science is a cross-discipline sector that uses huge amounts of data to
gain insights. At the same time, machine learning is an exciting subset of
data science, used to encourage machines to learn by themselves from the
data provided.
Both have a huge amount of use cases, but they do have limits. Data science
may be strong, but, like anything, it is only effective if the proper training
and best-quality data are used.
Let's move on and start looking into using Python for data science. Next, we
introduce you to NumPy.

OceanofPDF.com
Part Two – Introducing NumPy
OceanofPDF.com
NumPy is one of the most basic Python libraries, and, over time, you will
come to rely on it in all kinds of data science tasks, be they simple math to
classifying images. NumPy can handle data sets of all sizes efficiently, even
the largest ones, and having a solid grasp on how it works is essential to
your success in data science.
First, we will look at what NumPy is and why you should use it over other
methods, such as Python lists. Then we will look at some of its operations
to understand exactly how to use it – this will include plenty of code
examples for you to study.
What Is NumPy Library?
NumPy is a shortened version of Numerical Python, and it is, by far, the
most scientific library Python has. It supports multidimensional array
objects, along with all the tools needed to work with those arrays. Many
other popular libraries, like Pandas, Scikit-Learn, and Matplotlib, are all
built on NumPy.
So, what is an array? If you know your basic Python, you know that an
array is a collection of values or elements with one or more dimensions. A
one-dimensional array is a Vector, while a two-dimensional array is a
Matrix.
NumPy arrays are known as N-dimensional arrays, otherwise called
ndarrays, and the elements they store are the same type and same size. They
are high-performance, efficient at data operations, and an effective form of
storage as the arrays grow larger.
When you install Anaconda, you get NumPy with it, but you can install it
on your machine separately by typing the following command into the
terminal:
pip install numpy
Once you have done that, the library must be imported, using this
command:
import numpy as np
NOTE
np is the common abbreviation for NumPy
Python Lists vs. NumPy Arrays
Those used to using Python will probably be wondering why we would use
the NumPy arrays when we have the perfectly good Python Lists at our
disposal. These lists already act as arrays to store various types of elements,
after all, so why change things?
Well, the answer to this lies in how objects are stored in memory by Python.
Python objects are pointers to memory locations where object details are
stored. These details include the value and the number of bytes. This
additional information is part of the reason why Python is classed as a
dynamically typed language, but it also comes with an additional cost. That
cost becomes clear when large object collections, such as arrays are stored.
A Python list is an array of pointers. Each pointer points to a specific
location containing information that relates to the element. As far as
computation and memory go, this adds a significant overhead cost. To make
matters worse, when all the stored objects are the same type, most of the
additional information becomes redundant, meaning the extra cost for no
reason.
NumPy arrays overcome this issue. These arrays only store objects of the
same type, otherwise known as homogenous objects. Doing it this way
makes storing the array and manipulating it far more efficient, and we can
see this difference clearly when there are vast amounts of elements in the
array, say millions of them. We can also carry out element-wise operations
on NumPy arrays, something we cannot do with a Python list.
This is why we prefer to use these arrays over lists when we want
mathematical operations performed on large quantities of data.
How to Create a NumPy Array
Basic ndarray
NumPy arrays are simple to create, regardless of which one it is and the
complexity of the problem it is solving. We’ll start with the most basic
NumPy array, the ndarray.
All you need is the following np.array() method, ensuring you pass the
array values as a list:
np.array([1,2,3,4])
The output is:
array([1, 2, 3, 4])
Here, we have included integer values and the dtype argument is used to
specify the data type:
np.array([1,2,3,4],dtype=np.float32)
The output is:
array([1., 2., 3., 4.], dtype=float32)
Because you can only include homogenous data types in a NumPy array if
the data types don’t match, the values are upcast:
np.array([1,2.0,3,4])
The output is:
array([1., 2., 3., 4.])
In this example, the integer values are upcast to float values.
You can also create multidimensional arrays:
np.array([[1,2,3,4],[5,6,7,8]])
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
This creates a two-dimensional array containing values.
NOTE
The above example is a 2 x 4 matrix. A matrix is a rectangular array
containing numbers. Its shape is N x M – N indicates how many rows there
are, and M indicates how many columns there are.
An Array of Zeroes
You can also create arrays containing nothing but zeros. This is done with
the np.zeros() method, ensuring you pass the shape of the array you want:
np.zeros(5)
array([0., 0., 0., 0., 0.])
Above, we have created a one-dimensional array, while the one below is a
two-dimensional array:
np.zeros((2,3))
array([[0., 0., 0.],
[0., 0., 0.]])
An Array of Ones
The np.ones() method is used to create an array containing 1s:
np.ones(5,dtype=np.int32)
array([1, 1, 1, 1, 1])
Using Random Numbers in a ndarray
The np.random.rand() method is commonly used to create ndarray’s with a
given shape containing random values from [0,1):
# random
np.random.rand(2,3)
array([[0.95580785, 0.98378873, 0.65133872],
[0.38330437, 0.16033608, 0.13826526]])
Your Choice of Array
You can even create arrays with any value you want with the np.full()
method, ensuring the shape of the array you want is passed in along with
the desired value:
np.full((2,2),7)
array([[7, 7],
[7, 7]])
NumPy IMatrix
The np.eye() method is another good one that returns 1s on the diagonal and
0s everywhere else. More formally known as an Identity Matrix, it is a
square matric with an N x N shape, meaning there is the same number of
rows as columns. Below is a 3 x 3 IMatrix:
# identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
However, you can also change the diagonal where the values are 1s. It could
be above the primary diagonal:
# not an identity matrix
np.eye(3,k=1)
array([[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
Or below it:
np.eye(3,k=-2)
array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.]])
NOTE
A matrix can only be an IMatrix when the 1s appear on the main diagonal,
no other.
Evenly-Spaced ndarrays
The np.arange() method provides evenly-spaced number arrays:
np.arange(5)
array([0, 1, 2, 3, 4])
We can explicitly define the start and end and the interval step size by
passing in an argument for each value.
NOTE
We define the interval as [start, end) and the final number is not added into
the array:
np.arange(2,10,2)
array([2, 4, 6, 8])
Because we defined the step size as 2, we got the alternate elements as a
result. The final element was 10, and this was not included.
A similar function is called np.linspace(), which takes an argument in the
form of the number of samples we want from the interval.
NOTE
This time, the last number will be included in the result.
np.linspace(0,1,5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
That is how to create a variety of arrays with NumPy, but there is something
else just as important – the array’s shape.
Shaping and Reshaping a NumPy array
When your ndarray has been created, you need to check three things:
How many axes the ndarray has
The shape of the array
The size of the array
NumPy Array Dimensions
Determining how many axes or dimensions the ndarray has is easily done
using an attribute called ndims:
# number of axis
a = np.array([[5,10,15],[20,25,20]])
print('Array :','\n',a)
print('Dimensions :','\n',a.ndim)
Array :
[[ 5 10 15]
[20 25 20]]
Dimensions :
2
Here, we have a two-dimensional array with two rows and three columns.
NumPy Array Shape
The array shape is an array attribute showing the number of rows with
elements along each of the dimensions. The shape can be further indexed so
the result shows the value on each dimension:
a = np.array([[1,2,3],[4,5,6]])
print('Array :','\n',a)
print('Shape :','\n',a.shape)
print('Rows = ',a.shape[0])
print('Columns = ',a.shape[1])
Array :
[[1 2 3]
[4 5 6]]
Shape :
(2, 3)
Rows = 2
Columns = 3
Array Size
The size attribute tells you the number of values in your array by
multiplying the number of rows by columns:
# size of array
a = np.array([[5,10,15],[20,25,20]])
print('Size of array :',a.size)
print('Manual determination of size of array :',a.shape[0]*a.shape[1])
Size of array : 6
Manual determination of size of array : 6
Reshaping an Array
You can use the np.reshape() method to reshape your ndarray without
changing any of the data in it:
# reshape
a = np.array([3,6,9,12])
np.reshape(a,(2,2))
array([[ 3, 6],
[ 9, 12]])
We reshaped this from a one-dimensional to a two-dimensional array.
When you reshape your array and are not sure of any of the axes shapes,
input -1. When NumPy sees -1, it will calculate the shape automatically:
a = np.array([3,6,9,12,18,24])
print('Three rows :','\n',np.reshape(a,(3,-1)))
print('Three columns :','\n',np.reshape(a,(-1,3)))
Three rows :
[[ 3 6]
[ 9 12]
[18 24]]
Three columns :
[[ 3 6 9]
[12 18 24]]
Flattening Arrays
There may be times when you want to collapse a multidimensional array
into a single-dimensional array. There are two methods you can use for this
– flatten() or ravel():
a = np.ones((2,2))
b = a.flatten()
c = a.ravel()
print('Original shape :', a.shape)
print('Array :','\n', a)
print('Shape after flatten :',b.shape)
print('Array :','\n', b)
print('Shape after ravel :',c.shape)
print('Array :','\n', c)
Original shape : (2, 2)
Array :
[[1. 1.]
[1. 1.]]
Shape after flatten : (4,)
Array :
[1. 1. 1. 1.]
Shape after ravel : (4,)
Array :
[1. 1. 1. 1.]
However, one critical difference exists between the two methods – flatten()
will return a copy of the original while ravel() only returns a reference. If
there are any changes to the array revel() returns, the original array will
reflect them, but this doesn’t happen with the flatten() method.
b[0] = 0
print(a)
[[1. 1.]
[1. 1.]]
The original array did not show the changes.
c[0] = 0
print(a)
[[0. 1.]
[1. 1.]]
But this one did show the changes.
A deep copy of the ndarray is created by flatten(), while a shallow copy is
created by ravel().
A new ndarray is created in a deep copy, stored in memory, with the object
flatten() returns pointing to that location in memory. That is why changes
will not reflect in the original.
A reference to the original location is returned by a shallow copy which
means the object ravel() returns points to the memory location where the
original object is stored. That means changes will be reflected in the
original.
Transposing a ndarray
The transpose() method is a reshaping method that takes the input array,
swapping the row values with the columns and vice versa:
a = np.array([[1,2,3],
[4,5,6]])
b = np.transpose(a)
print('Original','\n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
Original
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Expand along columns:
Shape (3, 2)
[[1 4]
[2 5]
[3 6]]
When you transpose a 2 x 3 array, the result is a 3 x 2 array.
Expand and Squeeze NumPy Arrays
Expanding a ndarray involves adding new axes. This is done with the
method called expand_dims(), passing the array and the axis you want to
expand along:
# expand dimensions
a = np.array([1,2,3])
b = np.expand_dims(a,axis=0)
c = np.expand_dims(a,axis=1)
print('Original:','\n','Shape',a.shape,'\n',a)
print('Expand along columns:','\n','Shape',b.shape,'\n',b)
print('Expand along rows:','\n','Shape',c.shape,'\n',c)
Original:
Shape (3,)
[1 2 3]
Expand along columns:
Shape (1, 3)
[[1 2 3]]
Expand along rows:
Shape (3, 1)
[[1]
[2]
[3]]
However, if you want the axis reduced, you would use the squeeze()
method. This will remove an axis with one entry so, where you have a
matrix of 2 x 2 x 1, the third dimension is removed.
# squeeze
a = np.array([[[1,2,3],
[4,5,6]]])
b = np.squeeze(a, axis=0)
print('Original','\n','Shape',a.shape,'\n',a)
print('Squeeze array:','\n','Shape',b.shape,'\n',b)
Original
Shape (1, 2, 3)
[[[1 2 3]
[4 5 6]]]
Squeeze array:
Shape (2, 3)
[[1 2 3]
[4 5 6]]
Where you have a 2 x 2 matrix and you try to use squeeze() you would get
an error:
# squeeze
a = np.array([[1,2,3],
[4,5,6]])
b = np.squeeze(a, axis=0)
print('Original','\n','Shape',a.shape,'\n',a)
print('Squeeze array:','\n','Shape',b.shape,'\n',b)cv
The error message will point to line 3 being the issue, providing a message
saying:
ValueError: cannot select an axis to squeeze out which has size not
equal to one
Index and Slice a NumPy Array
We know how to create NumPy arrays and how to change their shape and
size. Now, we will look at using indexing and slicing to extract values.
Slicing a One-Dimensional Array
When you slice an array, you retrieve elements from a specified "slice" of
the array. You must provide the start and end points like this [start: end], but
you can also take things a little further and provide a step size. Let's say you
alternative element in the array printed. In that case, your step size would be
defined as 2, which means you want the element 2 steps in from the current
index. This would look something like [start: end:step-size]:
a = np.array([1,2,3,4,5,6])
print(a[1:5:2])
[2 4]
Note the final element wasn't considered – when you slice an array, only the
start index is included, not the end index.
You can get around this by writing the next higher value to your final value:
a = np.array([1,2,3,4,5,6])
print(a[1:6:2])
[2 4 6]
If the start or end index isn’t specified, it will default to 0 for the start and
the array size for the end index, with a default step size of 1:
a = np.array([1,2,3,4,5,6])
print(a[:6:2])
print(a[1::2])
print(a[1:6:])
[1 3 5]
[2 4 6]
[2 3 4 5 6]
Slicing a Two-Dimensional Array
Two-dimensional arrays have both columns and rows so slicing them is not
particularly easy. However, once you understand how it's done, you will be
able to slice any array you want.
Before we slice a two-dimensional array, we need to understand how to
retrieve elements from it:
a = np.array([[1,2,3],
[4,5,6]])
print(a[0,0])
print(a[1,2])
print(a[1,0])
1
6
4
Here, we identified the relevant element by supplying the row and column
values. We only need to provide the column value in a one-dimensional
array because the array only has a single row.
Slicing a two-dimensional array requires that a slice for both the column
and the row are mentioned:
a = np.array([[1,2,3],[4,5,6]])
# print first row values
print('First row values :','\n',a[0:1,:])
# with step-size for columns
print('Alternate values from first row:','\n',a[0:1,::2])
#
print('Second column values :','\n',a[:,1::2])
print('Arbitrary values :','\n',a[0:1,1:3])
First row values :
[[1 2 3]]
Alternate values from first row:
[[1 3]]
Second column values :
[[2]
[5]]
Arbitrary values :
[[2 3]]
Slicing a Three-Dimensional Array
We haven't looked at three-dimensional arrays yet, so let's see what one
looks like:
a = np.array([[[1,2],[3,4],[5,6]],# first axis array
[[7,8],[9,10],[11,12]],# second axis array
[[13,14],[15,16],[17,18]]])# third axis array
# 3-D array
print(a)
[[[ 1 2]
[ 3 4]
[ 5 6]]

[[ 7 8]
[ 9 10]
[11 12]]

[[13 14]
[15 16]
[17 18]]]
Three-dimensional arrays don't just have rows and columns. They also have
depth axes, which is where a two-dimensional array is stacked behind
another one. When a three-dimensional array is sliced, you must specify the
two-dimensional array you want to be sliced. This would typically be listed
first in the index:
# value
print('First array, first row, first column value :','\n',a[0,0,0])
print('First array last column :','\n',a[0,:,1])
print('First two rows for second and third arrays :','\n',a[1:,0:2,0:2])
First array, first row, first column value :
1
First array last column :
[2 4 6]
First two rows for second and third arrays :
[[[ 7 8]
[ 9 10]]

[[13 14]
[15 16]]]
If you wanted the values listed as a one-dimensional array, the flatten()
method can be used:
print('Printing as a single array :','\n',a[1:,0:2,0:2].flatten())
Printing as a single array :
[ 7 8 9 10 13 14 15 16]
Negative Slicing
You can also use negative slicing on your array. This prints the elements
from the end, instead of starting at the beginning:
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print(a[:,-1])
[ 5 10]
Each row's final values were printed, but if we wanted the values extracted
from the end, a negative step-by-step would need to be provided. If not, you
get an empty list.
print(a[:,-1:-3:-1])
[[ 5 4]
[10 9]]
That said, slicing logic is the same – you won’t get the end index in the
output.
You can also use negative slicing if you want the original array reversed:
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array :','\n',a[::-1,::-1])
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array :
[[10 9 8 7 6]
[ 5 4 3 2 1]]
And a ndarray can be reversed using the flip() method:
a = np.array([[1,2,3,4,5],
[6,7,8,9,10]])
print('Original array :','\n',a)
print('Reversed array vertically :','\n',np.flip(a,axis=1))
print('Reversed array horizontally :','\n',np.flip(a,axis=0))
Original array :
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
Reversed array vertically :
[[ 5 4 3 2 1]
[10 9 8 7 6]]
Reversed array horizontally :
[[ 6 7 8 9 10]
[ 1 2 3 4 5]]
Stack and Concatenate NumPy Arrays
Existing arrays can be combined to create new arrays, and this can be done
in two ways:
Vertically combine arrays along the rows. This is done with the
vstack() method and increases the number of rows in the array that
gets returned.
Horizontally combine arrays along the columns. This is done using
the hstack() method and increases the number of columns in the array
that gets returned.
a = np.arange(0,5)
b = np.arange(5,10)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
Array 1 :
[0 1 2 3 4]
Array 2 :
[5 6 7 8 9]
Vertical stacking :
[[0 1 2 3 4]
[5 6 7 8 9]]
Horizontal stacking :
[0 1 2 3 4 5 6 7 8 9]
Worth noting is that the axis along which the array is combined must have
the same size. If it doesn’t, an error is thrown:
a = np.arange(0,5)
b = np.arange(5,9)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Vertical stacking :','\n',np.vstack((a,b)))
print('Horizontal stacking :','\n',np.hstack((a,b)))
The error message points to line 5 as where the error is and reads:
ValueError: all the input array dimensions for the concatenation axis
must match exactly, but along dimension 1, the array at index 0 has
size 5 and the array at index 1 has size 4.
You can also use the dstack() method to combine arrays by combing the
elements one index at a time and stacking them on the depth axis.
a = [[1,2],[3,4]]
b = [[5,6],[7,8]]
c = np.dstack((a,b))
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Dstack :','\n',c)
print(c.shape)
Array 1 :
[[1, 2], [3, 4]]
Array 2 :
[[5, 6], [7, 8]]
Dstack :
[[[1 5]
[2 6]]

[[3 7]
[4 8]]]
(2, 2, 2)
You can stack arrays to combine old ones into a new one, but you can also
join passed arrays on an existing axis using the concatenate() method:
a = np.arange(0,5).reshape(1,5)
b = np.arange(5,10).reshape(1,5)
print('Array 1 :','\n',a)
print('Array 2 :','\n',b)
print('Concatenate along rows :','\n',np.concatenate((a,b),axis=0))
print('Concatenate along columns :','\n',np.concatenate((a,b),axis=1))
Array 1 :
[[0 1 2 3 4]]
Array 2 :
[[5 6 7 8 9]]
Concatenate along rows :
[[0 1 2 3 4]
[5 6 7 8 9]]
Concatenate along columns :
[[0 1 2 3 4 5 6 7 8 9]]
However, this method has one primary drawback – the original array must
contain the axis along which you are combining, or else you will get an
error.
The append() method can be used to add new elements at the end of a
ndarray. This is quite useful if you want new values added to an existing
ndarray:
# append values to ndarray
a = np.array([[1,2],
[3,4]])
np.append(a,[[5,6]], axis=0)
array([[1, 2],
[3, 4],
[5, 6]])
Broadcasting in NumPy Arrays
While ndarrays have many good features, one of the best is undoubtedly
broadcasting. It allows mathematical operations between different sized
ndarrays or a simple number and ndarray.
In essence, broadcasting stretches smaller ndarrays to match the larger one’s
shape.
a = np.arange(10,20,2)
b = np.array([[2],[2]])
print('Adding two different size arrays :','\n',a+b)
print('Multiplying a ndarray and a number :',a*2)
Adding two different size arrays :
[[12 14 16 18 20]
[12 14 16 18 20]]
Multiplying a ndarray and a number : [20 24 28 32 36]
Think of it as being similar to making a copy of or stretching the number or
scalar [2, 2, 2] to match the ndarray shape and then do an elementwise
operation on it. However, no actual copies are made – it is just the best way
to think of how broadcasting works.
Why is this so useful?
Because multiplying an array containing a scalar value is more efficient
than multiplying another array.
NOTE
A pair of ndarrays must be compatible to broadcast together. This
compatibility happens when:
Both ndarrays have identical dimensions
One ndarray has a dimension of 1, which is then broadcast to meet the
larger one's size requirements
Should they not be compatible, an error is thrown:
a = np.ones((3,3))
b = np.array([2])
a+b
array([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
In this case, we hypothetically stretched the second array to a shape of 3 x 3
before calculating the result.
NumPy Ufuncs
If you know anything about Python, you know it is dynamically typed.
That means that Python doesn't need to know its data type at the time a
variable is assigned because it will determine it automatically at runtime.
However, while this ensures a cleaner code, it does slow things down a bit.
This tends to worsen when Python needs to do loads of operations over and
again, such as adding two arrays. Python must check each element's data
type whenever an operation has to be performed, slowing things down, but
we can get over this by using ufuncs.
NumPy speeds things up with something called vectorization. On a ndarray
in compiled code, vectorization will perform one operation on each element
in turn. In this way, there is no need to determine the elements' data types
each time, thus making things faster.
A ufunc is a universal function, nothing more than a mathematical function
that performs high-speed element-by-element operations. When you do
simple math operations on an array, the ufunc is automatically called
because the arrays are ufuncs wrappers.
For example, when you use the + operator to add NumPy arrays, a ufunc
called add() is called automatically under the hood and does what it has to
do quietly. This is how a Python list would look:
a = [1,2,3,4,5]
b = [6,7,8,9,10]
%timeit a+b
While this is the same operation as a NumPy array:
a = np.arange(1,6)
b = np.arange(6,11)
%timeit a+b
As you can see, using ufuncs, the array addition takes much less time.
Doing Math with NumPy Arrays
Math operations are common in Python, never more so than in NumPy
arrays, and these are some of the more useful ones you will perform:
Basic Arithmetic
Those with Python experience will know all about arithmetic operations
and how easy they are to perform. And they are just as easy to do on
NumPy arrays. All you really need to remember is that the operation
symbols play the role of ufuncs wrappers:
print('Subtract :',a-5)
print('Multiply :',a*5)
print('Divide :',a/5)
print('Power :',a**2)
print('Remainder :',a%5)
Subtract : [-4 -3 -2 -1 0]
Multiply : [ 5 10 15 20 25]
Divide : [0.2 0.4 0.6 0.8 1. ]
Power : [ 1 4 9 16 25]
Remainder : [1 2 3 4 0]
Mean, Median and Standard Deviation
Finding the mean, median and standard deviation on NumPy arrays requires
the use of three methods – mean(), median(), and std():
a = np.arange(5,15,2)
print('Mean :',np.mean(a))
print('Standard deviation :',np.std(a))
print('Median :',np.median(a))
Mean : 9.0
Standard deviation : 2.8284271247461903
Median : 9.0
Min-Max Values and Their Indexes
In a ndarray, we can use the min() and max() methods to find the min and
max values:
a = np.array([[1,6],
[4,3]])
# minimum along a column
print('Min :',np.min(a,axis=0))
# maximum along a row
print('Max :',np.max(a,axis=1))
Min : [1 3]
Max : [6 4]
You can also use the argmin() and argmax() methods to get the minimum or
maximum value indexes along a specified axis:
a = np.array([[1,6,5],
[4,3,7]])
# minimum along a column
print('Min :',np.argmin(a,axis=0))
# maximum along a row
print('Max :',np.argmax(a,axis=1))
Min : [0 1 0]
Max : [1 2]
Breaking down the output, the first column's minimum value is the
column's first element. The second column's minimum is the second
element, while it's the first element for the third column.
Similarly, the outputs can be determined for the maximum values.
The Sorting Operation
Any good programmer will tell you that the most important thing to
consider is an algorithm's time complexity. One basic yet important
operation you are likely to use daily as a data scientist is sorting. For that
reason, a good sorting algorithm must be used, and it must have the
minimum time complexity.
The NumPy library is top of the pile when it comes to sorting an array's
elements, with a decent range of sorting functions on offer. And when you
use the sort() method, you can also use mergesort, heapsort, and timesort,
all available in NumPy, under the hood.
a = np.array([1,4,2,5,3,6,8,7,9])
np.sort(a, kind='quicksort')
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
It is also possible to sort arrays on any axis you want:
a = np.array([[5,6,7,4],
[9,2,3,7]])# sort along the column
print('Sort along column :','\n',np.sort(a, kind='mergresort',axis=1))
# sort along the row
print('Sort along row :','\n',np.sort(a, kind='mergresort',axis=0))
Sort along column :
[[4 5 6 7]
[2 3 7 9]]
Sort along row :
[[5 2 3 4]
[9 6 7 7]]
NumPy Arrays and Images
NumPy arrays are widely used for storing image data and manipulating it,
but what is image data?
Every image consists of pixels, and these are stored in arrays. Every pixel
has a value of somewhere between 0 and 255, with 0 being a black pixel
and 255 being a white pixel. Colored images comprise three two-
dimensional arrays for the three color channels (RGB – Red, Green, Blue.
These are back-to-back and make up a three-dimensional array. Each of the
array's values stands for a pixel value so, the array size is dependent on how
many pixels are on each dimension.
Python uses the scipy.misc.imread() method from the SciPy library to read
images as arrays. The output is a three-dimensional array comprised of the
pixel values:
import numpy as np
import matplotlib.pyplot as plt
from scipy import misc

# read image
im = misc.imread('./original.jpg')
# image
im
array([[[115, 106, 67],
[113, 104, 65],
[112, 103, 64],
...,
[160, 138, 37],
[160, 138, 37],
[160, 138, 37]],

[[117, 108, 69],

[115, 106, 67],
[114, 105, 66],
...,
[157, 135, 36],
[157, 135, 34],
[158, 136, 37]],

[[120, 110, 74],

[118, 108, 72],
[117, 107, 71],
...,
Checking the type and shape of the array is done like this:
print(im.shape)
print(type(type))
(561, 997, 3)
numpy.ndarray
Because images are nothing more than arrays, we can use array functions to
manipulate them. For example, the image could be horizontally flipped
using a method called np.flip():
# flip
plt.imshow(np.flip(im, axis=1))
Or the pixels value range could be changed or normalized, especially if you
need faster computation:
im/255
array([[[0.45098039, 0.41568627, 0.2627451 ],
[0.44313725, 0.40784314, 0.25490196],
[0.43921569, 0.40392157, 0.25098039],
...,
[0.62745098, 0.54117647, 0.14509804],
[0.62745098, 0.54117647, 0.14509804],
[0.62745098, 0.54117647, 0.14509804]],

[[0.45882353, 0.42352941, 0.27058824],

[0.45098039, 0.41568627, 0.2627451 ],
[0.44705882, 0.41176471, 0.25882353],
...,
[0.61568627, 0.52941176, 0.14117647],
[0.61568627, 0.52941176, 0.13333333],
[0.61960784, 0.53333333, 0.14509804]],

[[0.47058824, 0.43137255, 0.29019608],

[0.4627451 , 0.42352941, 0.28235294],
[0.45882353, 0.41960784, 0.27843137],
...,
[0.6 , 0.52156863, 0.14117647],
[0.6 , 0.52156863, 0.13333333],
[0.6 , 0.52156863, 0.14117647]],

...,
That is as much of an introduction as I can give you to NumPy, showing
you the basic methods and operations you are likely to use as a data
scientist. Next, we discover how to manipulate data using Pandas.

OceanofPDF.com
Part Three – Data Manipulation with Pandas
OceanofPDF.com
Pandas is another incredibly popular Python data science package, and there
is a good reason for that. It offers users flexible, expressive, and powerful
data structures that make data analysis and manipulation dead simple. One
of those structures is the DataFrame.
In this part, we will look at Pandas DataFrames, including fundamental
manipulation, up to the more advanced operations. To do that, we will
answer the top 11 questions asked to provide you with the answers you
need.
What Is a DataFrame?
First, let's look at what a DataFrame is.
If you have any experience using the R language, you will already know
that a data frame provides rectangular grids to store data so it can be looked
at easily. The rows in the grids all correspond to a measurement or an
instance value, and the columns are vectors with data relating to a specific
variable. So, there is no need for the rows to contain identical value types,
although they can if needed. They can store any type of value, such as
logical, character, numeric, etc.
Python DataFrames are much the same. They are included in the Pandas
library, defined as two-dimensional data structures, labeled, and with
columns containing possible different types.
It would be fair to say that there are three main components in a Pandas
DataFrames – data, index, and columns.
First, we can store the following type of data in a DataFrame:
A Pandas DataFrame
A Pandas Series, a labeled one-dimensional array that can hold any
data type with an index or axis label. A simple example would a
DataFrame column
A NumPy ndarray, which could be a structure or record
A two-dimensional ndarray
A dictionary containing lists, Series, dictionaries or one-dimensional
ndarrays
NOTE
Do not confuse np.ndarray with np.array(). The first is a data type, and the
second is a function that uses other data structures to make arrays.
With a structured array, a user can manipulate data via fields, each named
differently. The example below shows a structured array being created
containing three tuples. Each tuple's first element is called foo and is the int
type. The second element is a float called bar.
The same example shows a record array. These are used to expand the
properties of a structured array, and users can access the fields by attribute
and not index. As you can see in the example, we access the foo values in
the record array called r2:
# A structured array
my_array = np.ones(3, dtype=([('foo', int), ('bar', float
)]))
# Print the structured array
_____(my_array['foo'])
# A record array
my_array2 = my_array.view(np.recarray)
# Print the record array
_____(my_array2.foo)
Second, the index and column names can also be specified for the
DataFrame. The index provides the row differences, while the column name
indicates the column differences. Later, you will see that these are handy
DataFrame components for data manipulation.
If you haven't already got Pandas or NumPy on your system, you can
import them with the following import command:
import numpy as np
import pandas as pd
Now you know what DataFrames are and what you can use them for, let's
learn all about them by answering those top 11 questions.
Question One: How Do I Create a Pandas DataFrame?
The first step when it comes to data munging or manipulation is to create
your DataFrame, and you can do this in two ways – convert an existing one
or create a brand new one. We'll only be looking at converting existing
structures in this part, but we will discuss creating new ones later on.
NumPy ndarrays are just one thing that you can use as a DataFrame input.
Creating a DataFrame from a NumPy array is pretty simple. All you need to
do is pass the array to the DataFrame() function inside the data argument:
data = np.array([['','Col1','Col2'],
['Row1',1,2],
['Row2',3,4]])

print(pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:]))
One thing you need to pay attention to is how the DataFrame is constructed
from NumPy array elements. First, the values from lists starting with Row1
and Row2 are selected. Then the row numbers or index, Row1 or Row2, are
selected, followed by the column names of Col1 and Col2.
Next, we print a small data selection. As you can see, this is much the same
as when you subset and two-dimensional NumPy array – the row is
indicated first, where you want your data looked for, followed by the
column. Don't forget that, in Python, indices begin at 0. In the above
example, we look in the rows in index 1 to the end and choose all elements
after index 1. The result is 1, 2, 3, and 4 being selected.
This is the same approach to creating a DataFrame as for any structure
taken as an input by DataFrame().
Try it for yourself on the next example:
# Take a 2D array as input to your DataFrame
my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
print(________________)
# Take a dictionary as input to your DataFrame
my_dict = {1: ['1', '3'], 2: ['1', '2'], 3: ['2', '4']}
print(________________)
# Take a DataFrame as input to your DataFrame
my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4),
columns=['A'])
print(________________)
# Take a Series as input to your DataFrame
my_series = pd.Series({"Belgium":"Brussels", "India":"New
Delhi", "United Kingdom":"London", "United States"
:"Washington"})
print(________________)
Your Series and DataFrame index has the original dictionary keys, but in a
sorted manner – index 0 is Belgium, while index 3 is the United States.
Once your DataFrame is created, you can find out some things about it. For
example, the len() function or shape property can be used with the .index
property:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
# Use the `shape` property
print(________)
# Or use the `len()` function with the ìndex` property
print(____________)
These options provide differing DataFrame information. We get the
DataFrame dimensions using the shape property, which means the height
and width of the DataFrame. And, if you use the index property and len()
function together, you will only get the DataFrame height.
The df[0].count() function could also be used to find out more information
about the DataFrame height but, should there be any NaN values, this won't
provide them, which is why .count() is not always the best option.
Okay, now we know how to create a DataFrame from an existing structure,
it's time to get down to the real work and start looking at some of the basic
operations.
Question Two – How Do I Select a Column or Index from a
DataFrame?
Before we can begin with basic operations, such as deleting, adding, or
renaming, we need to learn how to select the elements. It isn't difficult to
select a value, column, or index from a DataFrame. In fact, it's much the
same as it is in other data analysis languages. Here's an example of the
values in the DataFrame:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
We want to get access to the value in column A at index 0 and we have
several options to do that:
# Using ìloc[]`
print(df.iloc[0][0])
# Using `loc[]`
print(df.loc[0]['A'])
# Using àt[]`
print(df.at[0,'A'])
# Using ìat[]`
print(df.iat[0,0])
There are two of these that you must remember - .iloc[] and .loc[]. There
are some differences between these, but we'll discuss those shortly.
Now let's look at how to select some values. If you wanted rows and
columns, you would do this:
# Use ìloc[]` to select row `0`
print(df.iloc[_])
# Use `loc[]` to select column `'A'`
print(df.loc[:,'_'])
Right now, it's enough for you to know that values can be accessed by using
their label name or their index or column position.
Question Three: How Do I Add a Row, Column, or Index to a
DataFrame?
Now you know how to select values, it's time to understand how a row,
column, or index is added.
Adding an Index
When a DataFrame is created, you want to ensure that you get the right
index so you can add some input to the index argument. If this is not
specified, by default, the DataFrame's index will be of numerical value and
will start with 0, running until the last row in the DataFrame.
However, even if your index is automatically specified, you can still re-use
a column, turning it into your index. We call set_index() on the DataFrame
to do this:
# Print out your DataFrame `df` to check it out
print(__)
# Set 'C' as the index of your DataFrame
df.______('C')
Adding a Row
Before reaching the solution, we should first look at the concept of loc and
how different it is from .iloc, .ix, and other indexing attributes:
.loc[] – works on your index labels. For example, if you have loc[2],
you are looking for DataFrame values with an index labeled 2.
.iloc[] – works on your index positions. For example, if you have
iloc[2], you are looking for DataFrame values with an index labeled
2.
.ix[] – this is a little more complex. If you have an integer-based
index, a label is passed to .ix[]. In that case, .ix[2] means you are
looking for DataFrame values with an index labeled 2. This is similar
to .loc[] but, where you don't have an index that is purely integer-
bases, .ix also works with positions.
Let’s make this a little simpler with the help of an example:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7
, 8, 9]]), index= [2, 'A', 4], columns=[48, 49, 50])
# Pass `2` to `loc`
print(df.loc[_])
# Pass `2` to ìloc`
print(df.iloc[_])
# Pass `2` to ìx`
print(df.ix[_])
Note that we used a DataFrame not based solely on integers, just to show
you the differences between them. As you can clearly see, you don’t get the
same result when you pass 2 to .loc[] or .iloc[]/.ix[].
.loc[] will examine those values labeled 2, and you might get something
like the following returned:
48 1
49 2
50 3
.iloc[] looks at the index position, and passing 2 will give you something
like this:
48 7
49 8
50 9
Because we don't just have integers in the index, .ix[] behaves the same way
as .iloc[] and examines the index positions. In that case, you get the same
results that .iloc[] returned.
Understanding the difference between those three is a critical part of adding
rows to a DataFrame. You should also have realized that the
recommendation is to use .loc[] when you insert rows. If you were to use
df.ix[], you might find you are referencing numerically indexed values, and
you could overwrite an existing DataFrame row, all be it accidentally.
Once again, have a look at the example below"
df = pd._________(data=np.array([[1, 2, 3], [4, 5, 6], [7
, 8, 9]]), index= [2.5, 12.6, 4.8], columns=[48, 49, 50])
# There's no index labeled `2`, so you will change the
index at position `2`
df.ix[2] = [60, 50, 40]
print(df)
# This will make an index labeled `2` and add the new
values
df.loc[2] = [11, 12, 13]
print(df)
It’s easy to see how this is all confusing.
Adding a Column
Sometimes, you may want your index to be a part of a DataFrame, and this
can be done by assigning a DataFrame column or referencing and assigning
one that you haven't created yet. This is done like this:
df = pd._________(data=np.array([[1, 2, 3], [4, 5, 6], [7,
8, 9]]), columns=['A', 'B', 'C'])
# Use `.index`
df['D'] = df.index
# Print `df`
print(__)
What you are doing here is telling the DataFrame that the index should be
column A.
However, appending columns could be done in the same way as adding an
index - .loc[] or .iloc[] is used but, this time, a Series is added to the
DataFrame using .loc[]:
# Study the DataFrame `df`
print(__)
# Append a column to `df`
df.loc[:, 4] = pd.Series(['5', '6'], index=df.index)
# Print out `df` again to see the changes
_____(__)
Don't forget that a Series object is similar to a DataFrame's column, which
explains why it is easy to add them to existing DataFrames. Also, note that
the previous observation we made about .loc[] remains valid, even when
adding columns.
Resetting the Index
If your index doesn't look quite how you want it, you can reset it using
reset_index(). However, be aware when you are doing this because it is
possible to pass arguments that can break your reset.
# Check out the strange index of your DataFrame
print(df)
# Use `reset_index()` to reset the values.
df_reset = df.__________(level=0, drop=True)
# Print `df_reset`
print(df_reset)
Use the code above but replace drop with inplace and see what would
happen.
The drop argument is useful for indicating that you want the existing index
eliminated, but if you use inplace, the original index will be added, with
floats, as an additional column.
Question Four: How Do I Delete Indices, Rows, or Columns
From a Data Frame?
Now you know how to add rows, columns, and indices, let's look at how to
remove them from the data structure.
Deleting an Index
Removing an index from a DataFrame isn't always a wise idea because
Series and DataFrames should always have one. However, you can try these
instead:
Reset your DataFrame index, as we did earlier
If your index has a name, use del df.index.name to remove it
Remove any duplicate values – to do this, reset the index, drop the
index column duplicates, and reinstate the column with no duplicates
as the index:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [40,
50, 60], [23, 35, 37]]),
index= [2.5, 12.6, 4.8, 4.8, 2.5],
columns=[48, 49, 50])

df.___________.drop_duplicates(subset='index',
keep='last').set_index('index')
Finally, you can remove an index with a row – more about this later.
Deleting a Column
To remove one or more columns, the drop() method is used:
# Check out the DataFrame `df`
print(__)
# Drop the column with label 'A'
df.____('A', axis=1, inplace=True)
# Drop the column at position 1
df.____(df.columns[[1]], axis=1)
This might look too straightforward, and that's because the drop() method
has got some extra arguments:
When the axis argument indicates rows, it is 0, and when it drops
columns, it is 1
Inplace can be set to true, thus deleting the column without needing
the DataFrame to be reassigned.
Removing a Row
You can use df.drop_duplicates() to remove duplicate rows but you can also
do it by considering only the column’s values:
# Check out your DataFrame `df`
print(__)
# Drop the duplicates in `df`
df.________([48], keep='last')
If you don’t need to add any unique criteria to your deletion, the drop()
method is ideal, using the index property for specifying the index for the
row you want removed:
# Check out the DataFrame `df`
print(__)
# Drop the index at position 1
df.____(df.index[_])
After this, you can reset your index.
Question Five: How Do I Rename the Columns or Index of a
DataFrame?
If you want to change the index or column values in your DataFrame
different values, you should use a method called .rename():
# Check out your DataFrame `df`
print(__)
# Define the new names of your columns
newcols = {
'A': 'new_column_1',
'B': 'new_column_2',
'C': 'new_column_3'
}
# Use `rename()` to rename your columns
df.______(columns=newcols, inplace=True)
# Use `rename()` to your index
df.______(index={1: 'a'})
If you were to change the inplace argument value to False, you would see
that the DataFrame didn't get reassigned when the columns were renamed.
This would result in the second part of the example taking an input of the
original argument rather than the one returned from the rename() operation
in the first part.
Question Six: How Do I Format the DataFrame Data?
On occasion, you may also want to do operations on the DataFrame values.
We're going to look at a few of the most important methods to formatting
DataFrame values.
Replacing All String Occurrences
The replace() method can easily be used when you want to replace specific
strings in the DataFrame. All you have to do is pass in the values you want
to be changed and the new, replacement values, like this:
# Study the DataFrame `df` first
_____(df)
# Replace the strings by numerical values (0-4)
df._______(['Awful', 'Poor', 'OK', 'Acceptable',
'Perfect'], [0, 1, 2, 3, 4])
You can also use the regex argument to help you out with odd combinations
of strings:
# Check out your DataFrame `df`
print(df)
# Replace strings by others with `regex`
df.replace({'\n': '<br>'}, regex=True)
Put simply, replace() is the best method to deal with replacing strings or
values in the DataFrame.
Removing Bits of Strings from DataFrame Cells
It isn't easy to remove bits of strings and it can be a somewhat cumbersome
task. However, there is a relatively easy solution:
# Check out your DataFrame
_____(df)
# Delete unwanted parts from the strings in the `result`
column
df['result'] = df['result'].map(lambda x: x.lstrip('+-'
).rstrip('aAbBcC'))
# Check out the result again
df
To apply the lambda function element-wise or over individual elements in
the column, we use map() on the column's result. The map() function will
take the string value and removes the + or – on the left and one or more of
the aABbcC you see on the right.
Splitting Column Text into Several Rows
This is a bit more difficult but the example below walks you through what
needs to be done:
# Inspect your DataFrame `df`
print(__)
# Split out the two values in the third row
# Make it a Series
# Stack the values
ticket_series = df['Ticket'].str.split(' ').apply(pd
.Series, 1).stack()
# Get rid of the stack:
# Drop the level to line up with the DataFrame
ticket_series.index = ticket_series.index.droplevel(-1)
# Make your `ticket_series` a dataframe
ticketdf = pd.________(ticket_series)
# Delete the `Ticket` column from your DataFrame
del df['Ticket']
# Join the `ticketdf` DataFrame to `df`
df.____(ticketdef)
# Check out the new `df`
df
What you are doing is:
Inspecting the DataFrame. The values in the last column and the last
row are long. It looks like we have two tickets because one guest took
a partner with them to the concert.
The Ticket column is removed from the DataFrame df and the strings
on a space. This ensures that both tickets are in separate rows. These
four values, which are the ticket numbers, are placed into a Series
object:
0 1
0 23:44:55 NaN
1 66:77:88 NaN
2 43:68:05 56:34:12
Something still isn't right because we can see NaN values. The Series
must be stacked to avoid NaN values in the result.
Next, we see the stacked Series:
0 0 23:44:55
1 0 66:77:88
2 0 43:68:05
1 56:34:12
That doesn't really work, either, and this why the level is dropped, so
it lines up with the DataFrame:
0 23:44:55
1 66:77:88
2 43:68:05
2 56:34:12
dtype: object
Now that is what you really want to see.
Next, your Series is transformed into a DataFrame so it can be
rejoined to the initial DataFrame. However, we don't want duplicates,
so the original Ticket column must be deleted.
Applying Functions to Rows or Columns
You can apply functions to data in the DataFrame to change it. First, we
make a lambda function:
doubler = lambda x: x*2
Next, we apply the function:
# Study the `df` DataFrame
_____(__)
# Apply the `doubler` function to the `A` DataFrame column
df['A'].apply(doubler)
Note the DataFrame row can also be selected and the doubler lambda
function applied to it. You know how to select rows – by using .loc[] or
.iloc[].
Next, something like the following would be executed, depending on
whether your index is selected based on position or label:
df.loc[0].apply(doubler)
Here, the apply() function is only relevant to the doubler function on the
DataFrame axis. This means you target the columns or the index – a row or
a column in simple terms.
However, if you wanted it applied element-wise, the map() function can be
used. Simply replace apply() with map(), and don't forget that the doubler
still has to be passed to map() to ensure the values are multiplied by 2.
Let's assume this doubling function is to be applied to the entire DataFrame,
not just the A column. In that case, the doubler function is applied used
applymap() to every element in your DataFrame:
doubled_df = df.applymap(doubler)
print(doubled_df)
Here, we've been working with anonymous functions create at runtime or
lambda functions but you can create your own:
def doubler(x):
if x % 2 == 0:
return x
else:
return x * 2
# Use `applymap()` to apply `doubler()` to your DataFrame
doubled_df = df.applymap(doubler)
# Check the DataFrame
print(doubled_df)
Question Seven: How Do I Create an Empty DataFrame?
We will use the DataFrame() function for this, passing in the data we want
to be included, along with the columns and indices. Remember, you don't
have to use homogenous data for this – it can be of any type.
There are a few ways to use the DataFrame() function to create empty
DataFrames. First, numpy.nan can be used to initialize it with NaNs. The
numpy.nan function has a float data type:
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A'])
print(df)
At the moment, this DataFrame's data type is inferred. This is the default
and, because numpy.nan has the float data type, that is the type of values the
DataFrame will contain. However, the DataFrame can also be forced to be a
specific type by using the dtype attribute and supplying the type you want,
like this:
df = pd.DataFrame(index=range(0,4),columns=['A'], dtype
='float')
print(df)
If the index or axis labels are not specified, the input data is used to
construct them, using common sense rules.
Question Eight: When I Import Data, Will Pandas Recognize
Dates?
Yes, Pandas can recognize dates but only with a bit of help. When you read
the data in, for example, from a CSV file, the parse_dates argument should
be added:
import pandas as pd
pd.read_csv('yourFile', parse_dates=True)
# or this option:
pd.read_csv('yourFile', parse_dates=['columnName'])
However, there is always the chance that you will come across odd date-
time formats. There is nothing to worry about as it's simple enough to
construct a parser to handle this. For example, you could create a lambda
function to take the DateTime and use a format string to control it:
import pandas as pd
dateparser = lambda x: pd.datetime.strptime(x, '%Y-%m-%d
%H:%M:%S')

# Which makes your read command:

pd.read_csv(infile, parse_dates=['columnName'],
date_parser=dateparse)

# Or combine two columns into a single DateTime column

pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']},
date_parser=dateparse)
Question Nine: When Should a DataFrame Be Reshaped? Why
and How?
When you reshape a DataFrame, you transform it to ensure that the
structure you get better fits your data analysis. We can infer from this that
reshaping is less concerned with formatting the DataFrame values and more
concerned with transforming the DataFrame shape.
So that tells you when and why you should reshape, but how do you do it?
You have a choice of three ways, and we'll look at each of them in turn:
Pivoting
The pivot() function is used for creating a derived table from the original.
Three arguments can be passed with the function:
values – lets you specify the values from the original DataFrame to
bring into your pivot table.
columns – whatever this argument is passed to becomes a column in
the table.
index – whatever this argument is passed to becomes an index.
# Import pandas
import ______ as pd
products = pd.DataFrame({'category': ['Cleaning',
'Cleaning', 'Entertainment', 'Entertainment', 'Tech',
'Tech'],
'store': ['Walmart', 'Dia',
'Walmart', 'Fnac', 'Dia','Walmart'],
'price':[11.42, 23.50, 19.99, 15
.95, 55.75, 111.55],
'testscore': [4, 3, 5, 7, 5, 8]})
# Use `pivot()` to pivot the DataFrame
pivot_products = products._____(index='category', columns
='store', values='price')
# Check out the result
print(pivot_products)
If you don’t specify the values you want on the table, you will end up
pivoting by several columns:
# Import the Pandas library
import ______ as pd
# Construct the DataFrame
products = pd.DataFrame({'category': ['Cleaning',
'Cleaning', 'Entertainment', 'Entertainment', 'Tech',
'Tech'],
'store': ['Walmart', 'Dia',
'Walmart', 'Fnac', 'Dia','Walmart'],
'price':[11.42, 23.50, 19.99, 15
.95, 55.75, 111.55],
'testscore': [4, 3, 5, 7, 5, 8]})
# Use `pivot()` to pivot your DataFrame
pivot_products = products._____(index='category', columns
='store')
# Check out the results
print(pivot_products)
The data may not have any rows containing duplicate values for the
specified columns. If they do, an error message is thrown. If your data
cannot be unique, you must use a different method, pivot_table():
# Import the Pandas library
import ______ as pd
# Your DataFrame
products = pd.DataFrame({'category': ['Cleaning',
'Cleaning', 'Entertainment', 'Entertainment', 'Tech',
'Tech'],
'store': ['Walmart', 'Dia',
'Walmart', 'Fnac', 'Dia','Walmart'],
'price':[11.42, 23.50, 19.99, 15
.95, 19.99, 111.55],
'testscore': [4, 3, 5, 7, 5, 8]})
# Pivot your `products` DataFrame with `pivot_table()`
pivot_products = products.___________(index='category',
columns='store', values='price', aggfunc='mean')
# Check out the results
print(pivot_products)
We passed an extra argument to the pivot_table method, called aggfunc.
This indicates that multiple values are combined with an aggregation
function. In our example, we used the mean function.
Reshaping a DataFrame Using Stack() and Unstack()
If you refer back to question five, you will see we already used a stacking
example. When a DataFrame is stacked, it is made taller. The innermost
column index is moved to become the innermost row index, thus returning a
DataFrame with an index containing a new innermost row labels level.
Conversely, we can use unstack() to make the innermost row index into the
innermost column index.
Using melt() to Reshape a DataFrame
Melting is useful when your data has at least one column that is an
identifier variable, and all others are measured variables.
Measured variables are considered as unpivoted to the row axis. This means
that, where the measured variables were scattered over the DataFrame's
width, melt() will ensure they are placed to its height. So, instead of
becoming wider, the DataFrame becomes longer. This results in a pair of
non-identifier columns called value and variable. Here's an example to
show you what this means:
# The `people` DataFrame
people = pd.DataFrame({'FirstName' : ['John', 'Jane'],
'LastName' : ['Doe', 'Austen'],
'BloodType' : ['A-', 'B+'],
'Weight' : [90, 64]})
# Use `melt()` on the `people` DataFrame
print(pd.____(people, id_vars=['FirstName', 'LastName'],
var_name='measurements'))
Question Ten: How Do I Iterate Over a DataFrame?
Iterating over the DataFrame rows is done with a for loop and calling
iterrows() on the DataFrame:
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7,
8, 9]]), columns=['A', 'B', 'C'])
for index, row in df.iterrows() :
print(row['A'], row['B'])
Using iterrows() lets you loop efficiently over the rows as pairs – index,
Series. In simple terms, the result will be (index, row) tuples.
Question Eleven: How Do I Write a DataFrame to a File?
When you have finished doing all you want to do, you will probably want
to export your DataFrame into another format. We'll look at two ways to do
this – Excel or CSV file.
Outputting to a CSV File
Writing to a CSV file requires the use of to_csv():
import pandas as pd
df.to_csv('myDataFrame.csv')
That looks like a simple code but, for many people, it's where things get
difficult. This is because everyone will have their own requirements for the
data output. You may not want to use commas as the delimiter or want a
specific type of encoding.
We can get around this with specific arguments passed to to_csv() to ensure
you get the output in the format you want:
Delimiting a tab requires the sep argument:
import pandas as pd
df.to_csv('myDataFrame.csv', sep='\t')
The encoding argument allows you to choose the character encoding
you want:
import pandas as pd
df.to_csv('myDataFrame.csv', sep='\t', encoding='utf-8')
You can even specify how to represent missing or NaN values – you
may or may not want the header outputted, you may or may not want
the row names written, you may want compression, and so on. You
can find out how to do all that by reading this page.
Outputting to an Excel File
For this one, you use to_excel() but it isn’t quite as straightforward as the
CSV file:
import pandas as pd
writer = pd.ExcelWriter('myDataFrame.xlsx')
df.to_excel(writer, 'DataFrame')
writer.save()
However, it's worth noting that you require many additional arguments, just
like you do with to_csv, to ensure your data is output as you want. These
include startrow, startcol, and more, and you can find out about them on this
page.
More than Just DataFrames
While that was a pretty basic Pandas tutorial, it gives you a good idea of
using it. Those 11 questions we answered were the most commonly asked
ones, and they represent the fundamental skills you need to import your
data, clean it and manipulate it.
In the next chapter, we'll take a brief look at data visualization with
Matplotlib and Seaborn.

OceanofPDF.com
Part Four – Data Visualization with Matplotlib
and Seaborn
OceanofPDF.com
Data visualization is one of the most important parts of communicating
your results to others. It doesn't matter whether it is in the form of pie
charts, scatter plots, bar charts, histograms, or one of the many other forms
of visualization; getting it right is key to unlocking useful insights from the
data. Thankfully, data visualization is easy with Python.
It's fair to say that data visualization is a key part of analytical tasks, like
exploratory data analysis, data summarization, model output analysis, and
more. Python offers plenty of libraries and tools to help us gain good
insights from data, and the most commonly used is Matplotlib. This library
enables us to generate all different visualizations, all from the same code if
you want.
Another useful library is Seaborn, built on Matplotlib, providing highly
aesthetic data visualizations that are sophisticated, statistically speaking.
Understanding these libraries is critical for data scientists to make the most
out of their data
analysis.
Using Matplotlib to Generate Histograms
Matplotlib is packed with tools that help provide quick data visualization.
For example, researchers who want to analyze new data sets often want to
look at how values are distributed over a set of columns, and this is best
done using a histogram.
A histogram generates approximations to distributions by using a set range
to select results, placing each value set in a bucket or bin. Using Matplotlib
makes this kind of visualization straightforward.
We'll be using the FIFA19 dataset, which you can download from here. Run
these codes to see the outputs for yourself.
To start, if you haven't already done so, import Pandas:
import pandas as pd
Next, the pyplot module from Matplotlib is required, and the customary
way of importing it is as plt:
import matplotlib.pyplot as plt
Next, the data needs to be read into a DataFrame. We'll use the set_option()
method from Pandas to relax the display limit of rows and columns:
df = pd.read_csv("fifa19.csv")

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)
Now we can print the top five rows of the data, and this is done with the
head() method:
print(df.head())
A histogram can be generated for any numerical column. The hist() method
is called on the plt object, and the selected DataFrame column is passed in.
We'll try this on the Overall column because it corresponds to the overall
player rating:
plt.hist(df['Overall'])
The a and y axes can also be labeled, along with the plot title, using three
methods:
xlabel()
ylabel()
title()
plt.xlabel('Overall')

plt.ylabel('Frequency')

plt.title('Histogram of Overall Rating')

plt.show()
This histogram is one of the best ways to see the distribution of values and
see which ones occur the most and the least.
Using Matplotlib to Generate Scatter Plots
Scatterplots are another useful visualization that shows variable
dependence. For example, if we wanted to see if a positive relationship
existed between wage and overall player rating (if the wage increases, does
the rating go up?), we can use the scatterplot to find it.
Before generating the scatterplot, we need the wage column converted from
string to floating-point, which is a numerical column. To do this, a new
column is created, named wage_euro:
df['wage_euro'] = df['Wage'].str.strip('€')

df['wage_euro'] = df['wage_euro'].str.strip('K')

df['wage_euro'] = df['wage_euro'].astype(float)*1000.0
Now, let’s display our new column wage_euro and the overall
column:
print(df[['Overall', 'wage_euro']].head())
The result would look like this:
Overall wage_euro
0 94 565000.0
1 94 405000.0
2 92 290000.0
3 91 260000.0
4 91 355000.0
Generating a Matplotlib scatter plot is as simple as using scatter() on the plt
object. We can also label each axis and provide the plot with a title:
plt.scatter(df['Overall'], df['wage_euro'])

plt.title('Overall vs. Wage')

plt.ylabel('Wage')

plt.xlabel('Overall')

plt.show()
Using Matplotlib to Generate Bar Charts
Another good visualization tool is the bar chart, useful for analyzing data
categories. For example, using our FIFA19 data set, we might want to look
at the most common nationalities, and bar charts can help us with this.
Visualizing categorical columns requires the values to be counted first. We
can generate a dictionary of count values for every one of the categorical
column categories using the count method. We'll do that for the nationality
column:
from collections import Counter

print(Counter(df[‘Nationality’]))
The result would be a long list of nationalities, starting with the most values
and ending with the least.
This dictionary can be filtered using a method called most_common. We'll
ask for the ten most common nationalities in this case but, if you wanted to
see the least common, you could use the least_common method:
print(dict(Counter(df[‘Nationality’]).most_common(10)))
Again, the result would be a dictionary containing the ten most common
nationalities in order of the most values to the least.
Finally, the bar chart can be generated by calling the bar method on the plt
object, passing the dictionary keys and values in:
nationality_dict = dict(Counter(df['Nationality']).most_common(10))
plt.bar(nationality_dict.keys(), nationality_dict.values())

plt.xlabel('Nationality')

plt.ylabel('Frequency')

plt.title('Bar Plot of Ten Most Common Nationalities')

plt.xticks(rotation=90)

plt.show()
You can see from the resulting bar chart that the x-axis values overlap,
making them difficult to see. If we want to rotate the values, we use the
xticks() method, like this:
plt.xticks(rotation=90)
Using Matplotlib to Generate Pie Charts
Pie charts are an excellent way of visualizing proportions in the data. Using
our FIFA19 dataset, we could visualize the player proportions for three
countries – England, Germany, and Spain – using a pie chart.
First, we create four new columns, one each for England, Germany, Spain,
and Other, which contains all the other nationalities:
df.loc[df.Nationality =='England', 'Nationality2'] = 'England'

df.loc[df.Nationality =='Spain', 'Nationality2'] = 'Spain'

df.loc[df.Nationality =='Germany', 'Nationality2'] = 'Germany'

df.loc[~df.Nationality.isin(['England', 'German', 'Spain']),
'Nationality2'] = 'Other'

Next, let’s create a dictionary that contains the proportion values for
each of these:

prop = dict(Counter(df['Nationality2']))

for key, values in prop.items():

prop[key] = (values)/len(df)*100

print(prop)
The result shows you a dictionary or list showing Other, Spain, and
England, with their player proportions. From here, we can use the
Matplotlib pie method to create a pie chart from the dictionary:
fig1, ax1 = plt.subplots()

ax1.pie(prop.values(), labels=prop.keys(), autopct='%1.1f%%',

shadow=True, startangle=90)

ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a

circle.

plt.show()
As you can see, these methods all of us very powerful ways of visualizing
category proportions in the data.
Visualizing Data with Seaborn
Seaborn is another useful library, built on Matplotlib and offering a
powerful plot formatting method. Once you have got to grips with
Matplotlib, you should move on to Seaborn to use more complex
visualizations.
For example, we can use the set() method from Seaborn to improve how
our Matplotlib plots look.
First, Seaborn needs to be imported, and all the figures generated earlier
need to be reformatted. Add the following to the top of your script and run
it:
import seaborn as sns

sns.set()
Using Seaborn to Generate Histograms
We can use Seaborn to generate the same visualizations we generated with
Matplotlib. First, we'll generate the histogram showing the overall column.
This is done on the Seaborn object with distplot:
sns.distplot(df['Overall'])
We can also reuse our plt object to make additional formats on the axis and
set the title:
plt.xlabel('Overall')

plt.ylabel('Frequency')

plt.title('Histogram of Overall Rating')

plt.show()
As you can see, this is much better looking than the plot generated with
Matplotlib.
Using Seaborn to Generate Scatter Plots
We can also use Seaborn to generate scatter plots in a straightforward way,
so let's recreate our earlier one:
sns.scatterplot(df['Overall'], df['wage_euro'])

plt.title('Overall vs. Wage')

plt.ylabel('Wage')

plt.xlabel('Overall')

plt.show()
Using Seaborn to Generate Heatmaps
Seaborn is perhaps best known for its ability to create correlation heatmaps.
These are used for identifying variable independence and, before generating
them, the correlation between a specified set of numerical columns must be
calculated. We'll do this for four columns:
age
overall
wage_euro
skill moves
corr = df[['Overall', 'Age', 'wage_euro', 'Skill Moves']].corr()

sns.heatmap(corr, annot=True)

plt.title('Heatmap of Overall, Age, wage_euro, and Skill Moves')

plt.show()
If we want to see the correlation values, annot can be set to true:
sns.heatmap(corr, annot=True)
Using Seaborn to Generate Pairs Plot
This is the final tool we'll look at – a method called pairplot – which allows
for a matrix of distributions to be generated, along with a scatter plot for a
specified numerical feature set. We'll do this with three columns:
age
overall
potential
data = df[['Overall', 'Age', 'Potential',]]

sns.pairplot(data)

plt.show()
This proves to be one of the quickest and easiest ways to visualize the
relationships between the variables and the numerical value distributions
using scatter plots.
Matplotlib and Seaborn are both excellent tools, very valuable in the data
science world. Matplotlib helps you label, title, and format graphs, one of
the most important factors in effectively communicating your results. It also
gives many of the fundamental tools we need to produce visualizations,
such as scatter plots, histograms, bar charts, and pie charts.
Seaborn is also a very important library because it offers in-depth statistical
tools and stunning visuals. As you saw, the Seaborn-generated visuals were
much more aesthetically pleasing to look at than the Matplotlib visuals. The
Seaborn tools also allow for more sophisticated visuals and analyses.
Both libraries are popular in data visualization, both allowing quick data
visualization to tell stories with the data. There is some overlap in their use
cases, but it is wise to learn both to ensure you do the best job with your
data as you can.
Now buckle up and settle in. The next part is by far the longest as we dive
deep into the most important subset of data science – machine learning.

OceanofPDF.com
Part Five – An In-Depth Guide to Machine
Learning
OceanofPDF.com
The term “machine learning” was first coined in 1959 by Arthur Samuel,
an AI and computer gaming pioneer who defined it as “a field of study
that provides computers with the capability of learning without explicit
programming.”
That means that machine learning is an AI application that allows
software to learn by experience, continually improving without explicit
programming. For example, think about writing a program to identify
vegetables based on properties such as shape, size, color, etc.
You could hardcode it all, add a few rules which you use to identify the
vegetables. To some, this might seem to be the only way that will work,
but there isn’t one programmer who can write the perfect rules that will
work for all cases. That’s where machine learning comes in, without any
rules – this makes it more practical and robust. Throughout this chapter,
you will see how machine learning can be used for this task.
So, we could say that machine learning is a study in making machines
exhibit human behavior in terms of decision making and behavior by
allowing them to learn with little human intervention – which means no
explicit programming. This begs the question, how does a machine gain
experience, and where does it learn from? The answer to that is simple –
data, the fuel that powers machine learning. Without it, there is no
machine learning.
One more question – if machine learning was first mentioned can k in
1959, why has it taken so long to become mainstream? The main reason
for this is the significant amount of computing power machine learning
requires, not to mention the hardware capable of storing vast amounts of
data. It’s only in recent years that we have achieved the requirements
needed to practice machine learning.
How Do Machine Learning and Traditional Programming Differ?
Traditional programming involves input data being fed into a machine,
together with a clean, tested program, to generate some output. With
machine learning, the input and the output data are fed to the machine
during the initial phase, known as the learning phase. From there, the
machine will work the programming out for itself.
Don’t worry if this doesn’t make sense right now. As you work through
this in-depth chapter, it will all come together.
Why We Need Machine Learning
These days, machine learning has more attention than you would think
possible. One of its primary uses is to automate tasks that require human
intelligence to perform. Machines can only replicate this intelligence via
machine learning.
Businesses use machine learning to automate tasks and automate and create
data analysis models. Some industries are heavily reliant on huge amounts
of data which they use to optimize operations and make the right decisions.
Machine learning makes it possible to create models to process and analyze
complex data in vast amounts and provide accurate results. Those models
are scalable and precise, and they function quickly, allowing businesses to
leverage their huge power and the opportunities they provide while
avoiding risks, both known and unknown.
Some of the many uses of machine learning in the real world include text
generation, image recognition, and so on, this increasing the scope,
ensuring machine learning experts become highly sought after individuals.
How Does Machine Learning Work?
Machine learning models are fed historical data, known as training data.
They learn from that and then build a prediction algorithm that predicts
an output for a completely different set of data, known as testing data,
that is given as an input to the system. How accurate the models are is
entirely dependent on the quality of the data and how much of it is
provided – the more quality data there is, the better the results and
accuracy are.
Let’s say we have a highly complex issue that requires predictions.
Rather than writing complex code, we could give the data to machine
learning algorithms. These develop logic and provide predictions, and
we’ll be discussing the different types of algorithms shortly.
Machine Learning Past to Present
These days, we see plenty of evidence of machine learning in action,
such as Natural Language Processing and self-drive vehicles, to name
but two. Machine learning has been in existence for more than 70 years,
though, all beginning in 1943. Warren McCulloch, a neuropsychologist,
and Water Pitts, a mathematician, produced a paper about neurons and
the way they work. They went on to create a model with an electrical
circuit, producing the first-ever neural network.
Alan Turing created the Turing test in 1950. This test was used to
determine whether computers possessed real intelligence, and passing
the test required the computer to fool a human into believing it was a
human.
Arthur Samuel wrote the very first computer learning program in 1952, a
game of checkers that the computer improved at the more it played. It
did this by studying the moves, determining which ones came into the
winning strategies and then using those moves in its own program.
In 1957, Frank Rosenblatt designed the perceptron, a neural network for
computers that simulates human brain thought processes. Ten years later,
an algorithm was created called “Nearest Neighbor.” This algorithm
allowed computers to begin using pattern recognition, albeit very basic
at the time. This was ideal for salesmen to map routes, ensuring all their
visits could be completed within the shortest time.
It was in the 1990s that the biggest changes became apparent. Work was
moving from an approach driven almost entirely by knowledge to a more
data-driven one. Scientists started creating programs that allowed
computers to analyze huge swathes of data, learning from the results to
draw conclusions.
In 1997, Deep Blue, created by IBM, was the first computer-based chess
player to beat a world champion at his own game. The program searched
through potential moves, using its computing power to choose the best
ones. And just ten years later, Geoffrey Hinton coined the term “deep”
learning to describe the new algorithms that allowed computers to
recognize text and objects in images and videos.
In 2012, Alex Krizhevzky published a paper with Ilya Sutskever and
Geoffrey Hinton. An influential paper, it described a computing model
that could reduce the error rate significantly in image recognition
systems. At the same time, Google’s X Lab came up with an algorithm
that could browse YouTube videos autonomously, choosing those that
had cats in them.
In 2016, Google DeepMind researchers created AlphaGo to play the
ancient Chinese game, Go, and pitted it against the world champion, Lee
Sedol. AlphaGo beat the reigning world champion four times out of five.
Finally, in 2020, GPT-3 was released, arguably the most powerful
language model in the world. Released by OpenAI, the model could
generate fully functioning code, write creative fiction, write good
business memos, and so much more, with its use cases only limited by
imagination.
Machine Learning Features
Machine learning offers plenty of features, with the most prominent
being the following:
Automation – most email accounts have a spam folder where all
your spam emails appear. You might question how your email
provider knows which emails are spam, and the answer is machine
learning. The process is automated after the algorithm is taught to
recognize spam emails.
Being able to automate repetitive tasks is one of machine
learning's best features. Many businesses worldwide are using it to
deal with email and paperwork automation. Take the financial
sector, for example. Every day, lots of data-heavy, repetitive tasks
need to be performed, and machine learning takes the brunt of this
work, automating it to speed it up and freeing up valuable staff
hours for more important tasks.
Better Customer Experience – good customer relationships are
critical to businesses, helping to promote brand loyalty and drive
engagement. And the best way for a customer to do this is to
provide better services and food customer experiences. Machine
learning comes into play with both of these. Think about when you
last saw ads on the internet or visited a shopping website. Most of
the time, the ads are relevant to things you have searched for, and
the shopping sites recommend products based on previous searches
or purchases. Machine learning is behind these incredibly accurate
recommendation systems and ensures that we can tailor
experiences to each user.
In terms of services, most businesses use chatbots to ensure that
they are available 24/7. And some are so intelligent that many
customers don’t even realize they are not chatting to a real human
being.
Automated Data Visualization – these days, vast amounts of data
are being generated daily by individuals and businesses, for
example, Google, Facebook, and Twitter. Think about the sheer
amount of data each of those must be generating daily. That data
can be used to visualize relationships, enabling businesses to make
decisions that benefit them and their customers. Using automated
data visualization platforms, businesses can gain useful insights
into the data and use them to improve customer experiences and
customer service.
BI or Business Intelligence – When machine learning
characteristics are merged with data analytics, particularly big
data, companies find it easier to find solutions to issues and use
them to grow their business and increase their profit. Pretty much
every industry uses this type of machine learning to increase
operations.
With Python, you get the flexibility to choose between scripting and
object-oriented programming. You don’t need to recompile any code –
the changes can be implemented, and the results are shown instantly. It is
the most versatile of all the programming languages and is multi-
platforms, which is why it works so well for machine learning and data
science.
Different Types of Machine Learning
Machine learning falls into three primary categories:
Supervised learning
Unsupervised learning
Reinforcement learning
We’ll look briefly at each type before we turn our attention to the
popular machine learning algorithms.
Supervised Learning
We’ll start with the easiest of examples to explain supervised learning.
Let’s say you are trying to teach your child how to tell the difference
between a cat and a dog. How do you do it?
You point out a dog and tell your child, “that’s a dog.” Then you do the
same with a cat. Show them enough and, eventually, they will be able to
tell the difference. They may even begin to recognize different breeds in
time, even if they haven’t seen them before.
In the same way, supervised learning works with two sets of variables.
One is a target variable or labels, which is the variable we are predicting,
while the other is the features variable, which are the ones that help us
make the predictions. The model is shown the features and the labels that
go with them and will then find patterns in the data it looks at. To clear
this up, below is an example taken from a dataset about house prices. We
want to predict a house price based on its size. The target variable is the
price, and the feature it depends on is the size.
Number of roomsPrice
1 $100
3 $300
5 $500
There are thousands of rows and multiple features in real datasets, such
as the size, number of floors, location, etc.
So, the easiest way to explain supervised learning is to say that the
model has x, which is a set of input variables, and y, which is the output
variable. An algorithm is used to find the mapping function between x
and y, and the relationship is y = f(x).
It is called supervised learning because we already know what the output
will be, and we provide it to the machine. Each time it runs, the
algorithm is corrected, and this continues until we get the best possible
results. The data set is used to train the algorithm to an optimal outcome.
Supervised learning problems can be grouped as follows:
Regression – predicts future values. Historical data is used to train
the model, and the prediction would be the future value of a
property, for example.
Classification – the algorithm is trained on various labels to
identify things in a particular category, i.e., cats or dogs, oranges
or apples, etc.
Unsupervised Learning
Unsupervised learning does not involve any target variables. The
machine is given only the features to learn from, which it does alone and
finds structures in the data. The idea is to find the distribution that
underlies the data to learn more about the data.
Unsupervised learning can be grouped as follows:
Clustering – the input variables sharing characteristics are put
together, i.e., all users that search for the same things on the
internet
Association – the rules governing meaningful data associations are
discovered, i.e., if a person watches this, they are likely to watch
that too.
Reinforcement Learning
In reinforcement learning, models are trained to provide decisions based on
a reward/feedback system. They are left to learn how to achieve a goal in
complex situations and, whenever they achieve the goal during the learning,
they are rewarded.
It differs from supervised learning in that there isn’t an available answer, so
the agent determines what steps are needed to do the task. The machine uses
its own experiences to learn when no training data is provided.
Common Machine Learning Algorithms
There are far too many machine learning algorithms to possibly mention
them all, but we can talk about the most common ones you are likely to use.
Later, we’ll look at how to use some of these algorithms in machine
learning examples, but for now, here’s an overview of the most common
ones.
Naïve Bayes Classifier Algorithm
The Naïve Bayes classifier is not just one algorithm. It is a family of
classification algorithms, all based on the Bayes Theorem. They all share
one common principle – each features pair being classified is independent
of one another.
Let’s dive straight into this with an example.
Let’s use a fictional dataset describing weather conditions for deciding
whether to play a golf game or not. Given a specific condition, the
classification is Yes (fit to play) or No (unfit to play.)
The dataset looks something like this:
Outlook Temperature Humidity Windy Play Golf
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
This dataset is split in two:
Feature matrix –where all the dataset rows (vectors) are stored –
the vectors contain the dependent feature values. In our dataset, the
four features are Outlook, Temperature, Windy, and Humidity.
Response vector – where all the class variable values, i.e.,
prediction or output, are stored for each feature matrix row. In our
dataset, the class variable name is called Play Golf.
Assumption
The Naïve Bayes classifiers share an assumption that all features will
provide the outcome with an equal and independent contribution. As far as
our dataset is concerned, we can understand this as:
An assumption that there is no dependence between any features
pair. For example, a temperature classified as Hot is independent of
humidity, and a Rainy outlook is not related to the wind. As such,
we assume the features are independent.
Each feature has the same importance (weight). For example, if you
only know the humidity and the temperature, you cannot provide an
accurate prediction. No attribute is considered irrelevant, and all
play an equal part in determining the outcome.
Bayes’ Theorem
The Bayes’ Theorem uses the probability of an event that happened
already to determine the probability of another one happening. The
Theorem is mathematically stated as:

A and B are both events and P(B)?0:

We want to predict the probability of event A, should event B equal
true. Event B is also called Evidence.
P(A) is the prior probability of A, i.e., the probability an event will
occur before any evidence is seen. That evidence consists of an
unknown instance’s attribute value – in our case, it is event B.
P(A|B) is called a posteriori probability of B, or the probability an
event will occur once the evidence is seen.
We can apply the Theorem to our dataset like this:

Where the class variable is y and the dependent features vector, size n, is
X, where

Just to clear this up, a feature vector and its corresponding class variable
example is:
X = (Rainy, Hot, High, False)
y = No
Here, P(y|X) indicates the probability of not playing, given a set of
weather conditions as follows:
Rainy outlook
Hot temperature
High humidity
No wind
Naïve Assumption
Giving the Bayes Theorem a naïve assumption means providing
independence between the features. The evidence can now be split into its
independent parts. If any two events, A and B, are independent:
P(A,B) = P(A)P(B)
The result is:

And this is expressed as:

For a given input, the denominator stays constant so that term can be
removed:
The next step is creating a classifier model. We will determine the
probability of a given input set for all class variable y’s possible values,
choosing the output that offers the maximum probability. This is
mathematically expressed as:

That leaves us having to calculate P(y) and P(xi | y).

P(y) is also known as the class probability, and P(xi | y) is the conditional
probability.
There are a few Naïve Bayes classifiers, and their main difference is via
the assumptions about the P(xi | y) distribution.
Let’s apply this formula to our weather dataset manually. We will need to
carry out some precomputations, i.e., for each xi in X and yi in y, we must
find P(X=xi | yi). These calculations can be seen in the following tables:
P(X=xi | yi) is calculated for each xi in X and yi in y manually in the above
tables. For example, we calculate the probability of playing given a cool
temperature – P(temp. = cool | play gold = yes = 3/9.
We also need the (P(y)) class probabilities, calculated in the final table
above. For example, the probabilities are calculated as P(play gold = yes)
= 9/14.
Now the classifier is ready to go, so we’ll test it on a feature set we’ll call
‘today.’
today = (Sunny, Hot, Normal, False)
Given this, the probability of playing is:

And not playing is:

P(today) appears in both of the probabilities, so we can ignore it and look

for the proportional probabilities:

and

Because

We can make the sum equal to 1 to convert the numbers into a probability
– this is called normalization:

and

Because

In that case, we predict that Yes, golf will be played.

We’ve discussed a method that we can use for discrete data. Should we be
working with continuous data, assumptions need to be made about each
feature’s value distribution.
We’ll talk about one of the popular Naïve Bayes classifiers.
Gaussian Naive Bayes classifier
In this classifier, the assumption is that each feature’s associated
continuous values are distributed via a Gaussian distribution, also known
as a Normal distribution. When this is plotted, you see a curve shaped
like a bell, symmetric around the feature values’ mean.
The assumption is that the likelihood of the features is Gaussian, which
means the conditional probability is provided via:

Let’s see what a Python implementation of this classifier looks like using
Scikit-learn:
# load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)

X = iris.data
y = iris.target

# splitting X and y into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)

# training the model on training set

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response

values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
And the output is:
Gaussian Naive Bayes model accuracy(in %): 95.0
K-Nearest Neighbors
K-Nearest Neighbors, otherwise known as K-NN, is a critical yet simple
classification algorithm. It is a supervised learning algorithm used for data
mining, pattern detection, and intrusion detection.
In real-life scenarios, it is considered disposable because it is non-
parametric, which means it makes no underlying assumptions about the data
distribution. Some prior data, the training data, is provided, and this uses
attributes to classify the coordinates into specific groups.
Here is an example showing a table with some data points and two features:
With another set of data points, called the testing data, we can analyze the
data and allocate the points into a group. Unclassified points on the table
are marked ‘White.”

Intuition
If those points were plotted on a graph, we might see some groups or
clusters. Unclassified points can be assigned to a group just by seeing which
group the unclassified point’s nearest neighbors are in. That means a point
near a cluster classified ‘Red’ is more likely to be classified as ‘Red.’
Using intuition, it’s clear that the first point (2.5, 7) should have a
classification of ‘Green’ while the second one (5.5, 4.5) should be ‘Red.’
Algorithm
Here, m indicates how many training samples we have, and p is an
unknown point. Here’s what we want to do:
1. The training samples are stored in arr[], which is an array containing
data points. Each of the array elements represents (x, y), which is a
tuple.
For i=0 to m
2. Calculate d(arr[i], p), which is the Euclidean distance
3. Set S of K should be the smallest distances retrieved. Each distance is
correspondent to a data point we already classified.
4. The majority label among S is returned.
We can keep K as an odd number, making it easier to work out the clear
majority where there are only two possible groups. As K increases, the
boundaries across the classifications become more defined and smoother.
And, as the number of data points increases in the training set, the better the
classifier accuracy becomes.
Here’s an example program, written in C++ and Python, where 0 and 1 are
the classifiers::
// C++ program to find groups of unknown
// Points using K nearest neighbor algorithm.
#include <bits/stdc++.h>
using namespace std;

struct Point
{
int val; // Group of point
double x, y; // Co-ordinate of point
double distance; // Distance from test point
};

// Used to sort an array of points by increasing

// order of distance
bool comparison(Point a, Point b)
{
return (a.distance < b.distance);
}

// This function finds classification of point p using

// k nearest neighbor algorithm. It assumes only two
// groups and returns 0 if p belongs to group 0, else
// 1 (belongs to group 1).
int classifyAPoint(Point arr[], int n, int k, Point p)
{
// Fill distances of all points from p
for (int i = 0; i < n; i++)
arr[i].distance =
sqrt((arr[i].x - p.x) * (arr[i].x - p.x) +
(arr[i].y - p.y) * (arr[i].y - p.y));

// Sort the Points by distance from p

sort(arr, arr+n, comparison);

// Now consider the first k elements and only

// two groups
int freq1 = 0; // Frequency of group 0
int freq2 = 0; // Frequency of group 1
for (int i = 0; i < k; i++)
{
if (arr[i].val == 0)
freq1++;
else if (arr[i].val == 1)
freq2++;
}

return (freq1 > freq2 ? 0 : 1);

}

// Driver code
int main()
{
int n = 17; // Number of data points
Point arr[n];

arr[0].x = 1;
arr[0].y = 12;
arr[0].val = 0;
arr[1].x = 2;
arr[1].y = 5;
arr[1].val = 0;

arr[2].x = 5;
arr[2].y = 3;
arr[2].val = 1;

arr[3].x = 3;
arr[3].y = 2;
arr[3].val = 1;

arr[4].x = 3;
arr[4].y = 6;
arr[4].val = 0;

arr[5].x = 1.5;
arr[5].y = 9;
arr[5].val = 1;

arr[6].x = 7;
arr[6].y = 2;
arr[6].val = 1;

arr[7].x = 6;
arr[7].y = 1;
arr[7].val = 1;

arr[8].x = 3.8;
arr[8].y = 3;
arr[8].val = 1;

arr[9].x = 3;
arr[9].y = 10;
arr[9].val = 0;

arr[10].x = 5.6;
arr[10].y = 4;
arr[10].val = 1;

arr[11].x = 4;
arr[11].y = 2;
arr[11].val = 1;

arr[12].x = 3.5;
arr[12].y = 8;
arr[12].val = 0;

arr[13].x = 2;
arr[13].y = 11;
arr[13].val = 0;
arr[14].x = 2;
arr[14].y = 5;
arr[14].val = 1;

arr[15].x = 2;
arr[15].y = 9;
arr[15].val = 0;

arr[16].x = 1;
arr[16].y = 7;
arr[16].val = 0;

/*Testing Point*/
Point p;
p.x = 2.5;
p.y = 7;

// Parameter to decide group of the testing point

int k = 3;
printf ("The value classified to unknown point"
" is %d.\n", classifyAPoint(arr, n, k, p));
return 0;
}
And the output is:
The value classified to the unknown point is 0.
Support Vector Machine Learning Algorithm
Support vector machines or SVMs are grouped under the supervised
learning algorithms used to analyze the data used in regression and
classification analysis. The SVM is classed as a discriminative classifier,
defined formally by a separating hyperplane. In simple terms, with labeled
training data, the supervised learning part, the output will be an optimal
hyperplace that takes new examples and categorizes them.
SVMs represent examples as points in space. Each is mapped so that the
separate categories have a wide, clear gap separating them. As well as being
used for linear classification, an SVM can also be used for efficient non-
linear classification, with the inputs implicitly mapped into high-
dimensional feature spaces.
So, what does an SVM do?
When you provide an SVM algorithm with training examples, each one
labeled as belonging to one of the two categories, it will build a model that
will assign newly provided examples to one of the categories. This makes it
a 'non-probabilistic binary linear classification.'
To illustrate this, we'll use an SVM model that uses ML tools, like Scikit-
learn, to classify cancer UCI datasets. For this, you need to have NumPy,
Pandas, Matplotlib, and Scikit-learn.
First, the dataset must be created:
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs

# creating datasets X containing n_samples

# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
import matplotlib.pyplot as plt
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
plt.show()
Support vector machines don't just draw lines between classes. They also
consider a region around the line of a specified width. Here's an example:
# creating line space between -1 to 3.5
xfit = np.linspace(-1, 3.5)

# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')

# plot a line between the different sets of data

for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)

plt.xlim(-1, 3.5);
plt.show()
Importing Datasets
The SVMs intuition is optimizing linear discriminant models representing
the distance between the datasets, which is a perpendicular distance. Let's
use our training data to train the model. Before we do that, the cancer
datasets need to be imported as a CSV file, and we will train two of all the
features:
# importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# reading csv file and extracting class column to y.

x = pd.read_csv("C:\...\cancer.csv")
a = np.array(x)
y = a[:,30] # classes having 0 and 1

# extracting two features

x = np.column_stack((x.malignant,x.benign))

# 569 samples and 2 features

x.shape

print (x),(y)
The output would be:
[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]

array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1., ....,
1.])
Fitting Support Vector Machines
Next, we need to fit the classifier to the points.
# import support vector classifier
# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='linear')

# fitting x samples and y classes

clf.fit(x, y)
Once the model is fitted, it can be used to make predictions on new values:
clf.predict([[120, 990]])

clf.predict([[85, 550]])
array([ 0.])
array([ 1.])
The data and the pre-processing methods are analyzed, and Matplotlib is
used to produce optimal hyperplanes.
Linear Regression Machine Learning Algorithm
No doubt you have heard of linear regression, as it is one of the most
popular machine learning algorithms. It is a statistical method used to
model relationships between one dependent variable and a specified set of
independent variables. For this section, the dependent variable is referred to
as response and the independent variables as features.
Simple Linear Regression
This is the most basic form of linear regression and is used to predict
responses from single features. Here, the assumption is that there is a linear
relationship between two variables. As such, we want a linear function that
can predict the response (y) value as accurately as it can as a function of the
independent (x) variable or feature.
Let’s look at a dataset with a response value for each feature:
x 0 1 2 3 4 5 6 7 8 9
y 1 3 2 5 7 8 8 9 10 12

We define the following for generality:

x is defined as a feature vector, i.e. x = [x_1, x_2, …., x_n]
y is defined as a response vector, i.e. y = [y_1, y_2, …., y_n]
These are defined for n observations, i.e. n=10.
The model’s task is to fund the best fitting line to predict responses for new
feature values, i.e., those features not in the dataset already. The line is
known as a regression line, and its equation is represented as:

Here:
h(x_1) is representing the ith observation’s predicted response
b_0 and b_1 are both coefficients and are representing the y-intercept
and the regression line slope, respectively.
Creating the model requires that we estimate or learn the coefficients’
values, and once they have been estimated, the model can predict response.
We will be making use of the principle of Least Squares, so consider the
following:

In this, e_i is the ith observation’s residual error and we want to reduce the
total residual error.
The cost function or squared error is defined as:
We want to find the b_0 and b_1 values, where J(b_1, b_1) is the minimum.
Without drawing you through all the mathematical details, this is the result:

Here, SS_xy represents the sum of y and x (the cross deviations:

While SS_xx is representing the sum of x, the squared deviations:

Let’s look at the Python implementation of our dataset:

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):

# number of observations/points
n = np.size(x)

# mean of x and y vector

m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients

b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector

y_pred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot

plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)

if __name__ == "__main__":
main()
And the output is:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
Multiple Linear Regression
Multiple linear regression tries to model relationships between at least two
features and one response, and to do this, a linear equation is fit to the
observed data. It’s nothing more complex than an extension of simple linear
regression.
Let’s say we have a dataset containing p independent variables or features
and one dependent variable or response. The dataset has n rows
(observations).
We define the following:
X (feature matrix) = a matrix of n X p size where x_{ij} is used to
denote the values of the ith observation’s jth feature.
So
And
y (response vector) = a vector of n size, where y_{i} denotes the ith
observation’s response value

The p features regression line is represented as:

Where h(x_i) is the ith observation’s predicted response, and the regression
coefficients are b_0, b_1, …., b_p.
We can also write:

Where the ith observation’s residual error is represented by e_i.

By presenting the matrix features as the following, the linear model can be
generalized a bit further:

And the linear model can now be expressed as the following in terms of
matrices:

where
and

The estimate of b can now be determined using the Least Squares method,
which is used to determine b where the total residual errors are minimized.
The result is:

Where ‘ represents the matrix transpose, and the inverse is represented by

-1.
Now that the least-squares estimate, b’ is known, we can estimate the
multiple linear regression models as:

Where the estimated response vector is y.

Here’s a Python example of multiple linear regression on the Boston house
price dataset:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, metrics

# load the boston dataset

boston = datasets.load_boston(return_X_y=False)

# defining feature matrix(X) and response vector(y)

X = boston.data
y = boston.target

# splitting X and y into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)

# create linear regression object

reg = linear_model.LinearRegression()

# train the model using the training sets

reg.fit(X_train, y_train)

# regression coefficients
print('Coefficients: ', reg.coef_)

# variance score: 1 means perfect prediction

print('Variance score: {}'.format(reg.score(X_test, y_test)))

# plot for residual error

## setting plot style

plt.style.use('fivethirtyeight')

## plotting residual errors in training data

plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
color = "green", s = 10, label = 'Train data')

## plotting residual errors in test data

plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
color = "blue", s = 10, label = 'Test data')

## plotting line for zero residual error

plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)

## plotting legend
plt.legend(loc = 'upper right')

## plot title
plt.title("Residual errors")

## method call for showing the plot

plt.show()
And the output is:
Coefficients:
[ -8.80740828e-02 6.72507352e-02 5.10280463e-02
2.18879172e+00
-1.72283734e+01 3.62985243e+00 2.13933641e-03
-1.36531300e+00
2.88788067e-01 -1.22618657e-02 -8.36014969e-01 9.53058061e-
03
-5.05036163e-01]
Variance score: 0.720898784611
In this example, the Explained Variance Score is used to determine the
accuracy score.
We defined:
explained_variance_score = 1 – Var{y – y’}/Var{y}
In this, the estimated target output is y’, the correct target output is y, and
the variance is VAR, which is the standard deviation square. 1.0 is the best
score, while anything lower is worse.
Assumptions
These are the assumptions made by a linear regression model on the dataset
it is applied to:
Linear relationship – there should be a linear relationship between
the response and the feature variables. Scatter plots can be used to test
the linearity assumption.
Little to no multicollinearity – this occurs when the independent
variables are not independent of one another.
Little to no auto-correlation – this occurs when there is no
independence between the residual errors.
Homoscedasticity – this is when the error term is the same across all
the independent variables. This is defined as the random disturbance
or noise in the relationship between the independent and dependent
variables.
Logistic Regression Machine Learning Algorithm
Logistic regression is another machine learning technique from the field of
statistics and is the go-to for solving binary classification, where there are
two class values. It is named after the function at the very center of the
method, known as the logistic function.
You may also know it as the sigmoid function. It was developed to describe
the population growth properties in ecology, quickly rising and maxing out
at the environment's carrying capacity. An S curve, it takes numbers (real
values) and maps them to values between but not exactly at 0 and 1.
1 / (1 + e^-value)
E is the natural logarithm base, while value is the numerical value you want
to be transformed. Let's see how the logistic function is used in logistic
regression.
Representation
The representation used by logistic regression is an equation similar to
linear regression. Input values, x, are linearly combined using coefficient
values or weights to predict y, the output value. It differs from linear
regression in that the modeled output value is binary, i.e., 0 or 1, and not a
numeric value.
Here’s an example of a logistic regression algorithm:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
The predicted output is y, the intercept or bias is b0, and the coefficient for
x, the input value, is b1. There is a b coefficient associated with each input
data column. This coefficient is a constant real value that the model learns
from the training data.
The coefficients you see in the equation are the actual model representation
stored in a file or in memory.
Probability Prediction
Logistic regression is used to model the default class's probability. For
example, let's say we are modeling sex (male or female) based on a person's
height. In that case, the default class might be male, and the model is
written for the probability of a male, given a certain height. More formally,
that would be:
P(sex=male|height)
We could write this another way and say that the model is for the
probability of X, an input belonging to Y=1, the default class. This could be
written formally as:
P(X) = P(Y=1|X)
Be aware that the prediction needs to be transformed to binary values, 0 or
1, actually to make the prediction.
Logistic regression may be linear, but the logistic function is used to
transform the predictions. This means that the predictions can no longer be
understood as linear combinations of inputs like we do with linear
regression. Continuing from earlier, we can state this model as:
p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))
Without going too deeply into the math, the above equation can be turned
around as follows, not forgetting that e can be removed from one side with
the addition of ln, a natural logarithm, on the other side:
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful because it allows us to see that the output calculation on the
right side is linear, while the left-side input is a log of the default class's
probability.
The right-side ratio is known as the default class's odds rather than
probability. We calculate odds as a ratio of the event's probability, divided
by the probability of the event not happening, for example, 0.8/(1-0.8). The
odds of this are 4, so we could write the following instead:
ln(odds) = b0 + b1 * X
As the odds are log-transformed, the left-side is called the probit or log-
odds.
The exponent can be shifted to the right again and written as:
odds = e^(b0 + b1 * X)
This helps us see that our model is still linear in its output combination, but
the combination relates to the default class's log-odds.
A Logistic Regression Model
You should use your training data to estimate the logistic regression
algorithm's coefficients (Beta values b). This is done with the maximum-
likelihood estimation. This is one of the more common learning algorithms
that many other algorithms use, but it is worth noting that it makes
assumptions about your data distribution.
The very best coefficients would give us a model that predicts values as
close as possible to 1 for our default class and as close as possible to 0 for
the other class. In this case, the maximum-likelihood's intuition for the
logistic regression is that a search procedure seeks the coefficient's values
that minimize the error in the predicted probabilities to the data. For
example, if the data is the primary class, we get a probability of 1.
The math behind maximum-likelihood is out of the scope of this book. Still,
it is sufficient to say that minimization algorithms are used to optimize the
coefficient's best values for the training data.
Making Predictions with Logistic Regression
This is as easy as inputting the numbers to the equation and calculating your
result. Here's an example to make this clearer.
Let's assume we have a model that uses height to predict whether a person
is male or female – we'll use a base height of 150cm.
The coefficients have been learned – b0 = -100 and b1 = 1.6. Using the
above equation, the probability is calculated that we have a male given a
height of 150cm. More formally, this would be
P(male|height=150)
EXP() is used for e:
y = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))
y = exp(-100 + 0.6*150) / (1 + EXP(-100 + 0.6*X))
y = 0.0000453978687
This gives us a near-zero probability of a person being male.
In practice, the probabilities can be used directly. However, we want a solid
answer from this classification problem, so the probabilities can be snapped
to binary class values, i.e.
0 if p(male) < 0.5
1 if p(male) >= 0.5
The next step is to prepare the data.
Prepare Data for Logistic Regression
Logistic regression makes assumptions about data distribution and
relationships which are similar to those made by linear regression. Much
work has gone into the definition of these assumptions, and they are written
in precise statistical and probabilistic language. My advice? Use these as
rules of thumb or guidelines, and don't be afraid to experiment with
preparation schemes.
Here’s an example of logistic regression in Python code:
We’ll use the Pima Indian Diabetes dataset for this, which you can
download from here. First, we load the data using the read CSV function in
Pandas:
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi',
'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None,
names=col_names)
pima.head()
Next, we want to select the features. The columns can be divided into two –
the dependent variable (the target) and the independent variables (the
features.)
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
The dataset must be divided into a training and test dataset. We do that with
the train_test_split function, passing three parameters – features, target,
test_set_size. You can also use random_state to choose records randomly:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,rand
om_state=0)
/home/admin/.local/lib/python3.5/site-
packages/sklearn/cross_validation.py:41: DeprecationWarning: This
module was deprecated in version 0.18 in favor of the
model_selection module into which all the refactored classes and
functions are moved. Also, note that the interface of the new CV
iterators is different from that of this module. This module will be
removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
We used a ratio of 75:25 to break the dataset – 75% training data and 25%
testing data.
Developing the Model and Making the Prediction
First, import the logistic regression model and use the LogisticRegression()
function to create a classifier object. Then you can fit the model using fit()
on the training set and use predict() on the test set to make the predictions:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)

logreg = LogisticRegression()

# fit the model with data

logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)
Using the Confusion Matrix to Evaluate the Model
A confusion matrix is a table we use when we want the performance of our
classification model evaluated. You can also use it to visualize an
algorithm's performance. A confusion matrix shows a class-wise summing
up of the correct and incorrect predictions:
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
array([[119, 11],
[ 26, 36]])
The result is a confusion matrix as an array object, and the matrix
dimension is 2*2 because this was a binary classification. There are two
classes, 0 and 1. Accurate predictions are represented by the diagonal
values, while the non-diagonals represent the inaccurate predictions. The
output, in this case, is 119 and 36 as the actual predictions and 26 and 11 as
the inaccurate predictions.
Using Heatmap to Visualize the Confusion Matrix
This uses Seaborn and Matplotlib to visualize the results in the confusion
matrix – we’ll be using Heatmap:
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True,
cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5,257.44,'Predicted label')
Evaluation Metrics
Now we can evaluate the model using precision, recall, and accuracy
evaluation metrics:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
The result is:
Accuracy: 0.8072916666666666
Precision: 0.7659574468085106
Recall: 0.5806451612903226
We got a classification accuracy of 80%, which is considered pretty good.
The precision metric is about precision, as you would expect. When your
model makes predictions, you want to know how often it makes the right
one. In our case, the model made a 76% prediction accuracy that people
would have diabetes.
In terms of recall, the model can correctly identify those with diabetes in
the test set 58% of the time.
ROC Curve
The ROC or Receiver Operating Characteristic curve plots the true against
the false positive rate and shows the tradeoff between specificity and
sensitivity.
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
The AUC score is 0.86. A score of 0.5 indicates a poor classifier, while a
score of 1 indicates a perfect classifier. 0.86 is as near to perfect as we can
get.
Your ultimate focus in predictive modeling is making predictions that are as
accurate as possible, rather than analyzing the results. As such, some
assumptions can be broken, so long as you have a strong model that
performs well:
Binary Output Variable: you might think this is obvious, given that
we mentioned it earlier. Logistic regression is designed for binary
classification, and it will predict probabilities of instances belonging
to the first or default class and snapping them to a 0 or 1
classification.
Removing Noise: one assumption is that the output variable, y, has no
error. You should remove outliers and potentially misclassified data
from your training data right from the start.
Gaussian Distribution: because logistic regression is linear but with
the output being transformed non-linearly, an assumption is made of a
linear relationship between the input and output variables. You get a
more accurate model when you transform the input variables to
expose the relationship better. For example, univariate transforms,
such as Box-cox, log, and root, can do this.
Removing Correlated Inputs: in the same way as linear regression,
your model can end up overfitting if you have too many highly
correlated inputs. Think about removing the highly correlated inputs
and calculating pairwise correlations.
Failure to Converge: it may happen that the likelihood estimation
learning the coefficients doesn't converge. This could be because of
too many highly correlated inputs in the data or sparse data, which
means the input data contains many zeros.
Decision Tree Machine Learning Algorithm
The decision tree is a popular algorithm, undoubtedly one of the most
popular, and it comes under supervised learning. It is used successfully with
categorical and continuous output variables.
To show you how the decision tree algorithm works, we're going to walk
through implementing one on the UCI database, Balance Scale Weight &
Distance. The database contains 625 instances – 49 balanced and 288 each
to the left and right. There are four numeric attributes and the class name.
Python Packages
You need three Python packages for this:
Sklearn – a popular ML package containing many algorithms, and
we'll be using some popular modules, such as train_test_split,
accuracy_score, and DecisionTreeClassifier.
NumPy – a numeric module containing useful math functions,
NumPy is used to manipulate data and read data in NumPy arrays.
Pandas – commonly used to read/write files and uses DataFrames to
manipulate data.
Sklearn is where all the packages are that help us implement the algorithm,
and this is installed with the following commands:
using pip :
pip install -U scikit-learn
Before you dive into this, ensure that NumPy and Scipy are installed and
that you have pip installed too. If pip is not installed, use the following
commands to get it:
python get-pip.py
If you are using conda, use this command:
conda install scikit-learn
Decision Tree Assumptions
The following are the assumptions made for this algorithm:
The entire training set is considered as the root at the start of the
process.
It is assumed that the attributes are categorical for information gain
and continuous for the Gini index.
Records are recursively distributed, based on attribute values
Statistical methods are used to order the attributes as internal node or
root.
Pseudocode:
The best attribute must be found and placed on the tree's root node
The training dataset needs to be divided into subsets, ensuring that
each one has the same attribute value
Leaf nodes must be found in every branch by repeating steps one and
two on every subset
The implementation of the algorithm requires the following two phases:
The Building Phase – the dataset is preprocessed, then sklearn is
used to split the dataset into test and train sets before training the
classifier.
The Operational Phase – predictions are made and the accuracy
calculated
Importing the data
Data import and manipulation are done using the Pandas package. The URL
we use fetches the dataset directly from the UCI website, so you don't need
to download it at the start. That means you will need a good internet
connection when you run the code. Because "," is used to separate the
dataset, we need to pass that as the sep value parameter. Lastly, because
there is no header in the dataset, the Header's parameter must be passed as
none. If no header parameter is passed, the first line in the dataset will be
used as the Header.
Slicing the Data
Before we train our model, the dataset must be split into two – the training
and testing sets. To do this, the train_test_split module from sklearn is used.
First, the target variable must be split from the dataset attributes:
X = balance_data.values[:, 1:5]
Y = balance_data.values[:,0]
Those two lines of code will separate the dataset. X and Y are both
variables – X stores the attributes, and Y stores the dataset's target variable.
Next, we need to split the dataset into the training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
These code lines split the dataset into a ratio of 70:30 – 70% for training
and 30% for testing. The parameter value for test_size is passed as 0.3. The
variable called random_state is a pseudo-random number generator
typically used in random sampling cases.
Code Terminology:
Both the information gain and Gini Index methods select the right attribute
to place at the internal node or root node.
Gini Index:
This is a metric used to measure how often elements chosen at
random would be identified wrongly.
Attributes with lower Gini Indexes are preferred
Sklearn provides support for the "gini" criteria and takes "gini" value
by default.
Entropy:
This is a measure of a random variable's uncertainty, characterizing
impurity in an arbitrary example collection.
The higher entropy is, the more information it contains.
Information Gain:
Typically, entropy will change when a node is used to divide training
instances in a decision tree into smaller sets. Information gain is one
measure of change.
Sklearn has support for the Information Gain's "entropy" criteria. If
we need the Information Gain method from sklearn, it must be
explicitly mentioned.
Accuracy Score:
This is used to calculate the trained classifier's accuracy score.
Confusion Matrix:
This is used to help understand how the trained classifier behaves
over the test dataset or help validate the dataset.
Let's get into the code:
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.

# Importing the required packages

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Function importing Dataset

def importdata():
balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data',
sep= ',', header = None)

# Printing the dataswet shape

print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions

print ("Dataset: ",balance_data.head())
return balance_data

# Function to split the dataset

def splitdataset(balance_data):

# Separating the target variable

X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]

# Splitting the dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)

# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini

# Function to perform training with entropy.

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)

# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy

# Function to make predictions

def prediction(X_test, clf_object):
# Prediction on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred

# Function to calculate accuracy

def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",

confusion_matrix(y_test, y_pred))

print ("Accuracy : ",

accuracy_score(y_test,y_pred)*100)

print("Report : ",
classification_report(y_test, y_pred))

# Driver code
def main():

# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase
print("Results Using Gini Index:")

# Prediction using gini

y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")

# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)

# Calling main function

if __name__=="__main__":
main()
Here’s the data information:
Dataset Length: 625
Dataset Shape: (625, 5)
Dataset: 0 1 2 3 4
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5
Using the Gini Index, the results are:
Predicted values:
['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L'
'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R'
'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R']
The confusion matrix is:
[[ 0 6 7]
[ 0 67 18]
[ 0 19 71]]
The accuracy is:
73.4042553191
The report:
precision recall f1-score support
B 0.00 0.00 0.00 13
L 0.73 0.79 0.76 85
R 0.74 0.79 0.76 90
avg/total 0.68 0.73 0.71 188
Using Entropy, the result are:
Predicted values:
['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L'
'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L'
'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L'
'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L'
'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R'
'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R'
'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R'
'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R']
The confusion matrix is:
[[ 0 6 7]
[ 0 63 22]
[ 0 20 70]]
The accuracy is
70.7446808511
And the report is:
precision recall f1-score support
B 0.00 0.00 0.00 13
L 0.71 0.74 0.72 85
R 0.71 0.78 0.74 90
avg / total 0.66 0.71 0.68 188
Random Forest Machine Learning Algorithm
The random forest classifier is another supervised machine learning
algorithm that can be used for regression, classification, and other decision
tree tasks. The random forest creates sets of decision trees from random
training data subsets. It then makes its decision using votes from the trees.
To demonstrate how this works, we’ll use the well-known iris flower
dataset to train the model and test it. The model will be built to determine
the flower classifications.
First, we load the dataset:
# importing required libraries
# importing Scikit-learn library and datasets package
from sklearn import datasets

# Loading the iris plants dataset (classification)

iris = datasets.load_iris()
Code: checking our dataset content and features names present in it.
print(iris.target_names)
Output:
[‘setosa’ ‘versicolor’ ‘virginica’]
Code:

print(iris.feature_names)
Output:
[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal
width (cm)’]
Code:
# dividing the datasets into two parts i.e. training datasets and test
datasets
X, y = datasets.load_iris( return_X_y = True)

# Splitting arrays or matrices into random train and test subsets

from sklearn.model_selection import train_test_split
# i.e. 80 % training dataset and 30 % test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.70)
Code: Importing required libraries and random forest classifier
module.
# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# creating dataframe of IRIS dataset
data = pd.DataFrame({‘sepallength’: iris.data[:, 0], ’sepalwidth’:
iris.data[:, 1],
’petallength’: iris.data[:, 2], ’petalwidth’: iris.data[:, 3],
’species’: iris.target})
Code: Looking at a dataset
# printing the top 5 datasets in iris dataset
print(data.head())
The output of this is:
sepallength sepalwidth petallength petalwidth species

0 5.1 3.5 1.4 0.2 0

1 4.9 3.0 1.4 0.2 0

2 4.7 3.2 1.3 0.2 0

3 4.6 3.1 1.5 0.2 0

4 5.0 3.6 1.4 0.2 0

Next, we create the Random Forest Classifier, train it and then test it:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)

# Training the model on the training dataset

# fit function is used to train the model using the training sets as
parameters
clf.fit(X_train, y_train)

# performing predictions on the test dataset

y_pred = clf.predict(X_test)

# metrics are used to find accuracy or error

from sklearn import metrics
print()

# using metrics module for accuracy calculation

print("ACCURACY OF THE MODEL: ",
metrics.accuracy_score(y_test, y_pred))
Output:
ACCURACY OF THE MODEL: 0.9238095238095239

Code: predicting the type of flower from the data set

# predicting which type of flower it is.
clf.predict([[3, 3, 2, 2]])
And the output is:
array([0])
The result tells us the flower classification is the setosa type. Note that the
dataset contains three types – Setosa, Vericolor, and Virginia.
Next, we want to find the important features in the dataset:
# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
# Create a Random forest Classifier
clf = RandomForestClassifier(n_estimators = 100)

# Train the model using the training sets

clf.fit(X_train, y_train)
Code: Calculating feature importance
# using the feature importance variable
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_, index =
iris.feature_names).sort_values(ascending = False)
feature_imp
The output is:
petal width (cm) 0.458607
petal length (cm) 0.413859
sepal length (cm) 0.103600
sepal width (cm) 0.023933
dtype: float64
Each decision tree created by the random forest algorithm has a high
variance. However, when they are all combined in parallel, the variance is
low. That’s because the individual trees are trained perfectly on the
specified sample data, and the output is not dependent on just one tree but
all of them. In classification problems, a majority voting classifier is used
to get the final output. In regression problems, the output is the result of
aggregation – the mean of all outputs.
Random forest uses the aggregation and bootstrap technique, also called
bagging. The idea is to combine several decision trees to determine the
output rather than relying on the individual ones.
The bootstrap part of the technique is where random row and feature
sampling forms the sample datasets for each model. The regression
technique must be approached as we would any ML technique:
A specific data or question is designed, and the source is obtained to
work out what data we need
The data must be in an accessible format, or it must be converted to
an accessible format
All noticeable anomalies must be specified, along with missing data
points that might be needed to get the right data
The machine learning model is created, and a baseline model set that
you want to be achieved
The machine learning model is trained on the data
The test data is used to test the model
The performance metrics from the test and predicted data are
compared
If the model doesn’t do what it should, work on improving the model,
changing the data, or using a different modeling technique.
At the end, the data is interpreted and reported.
Here’s another example of how random forest works.
First, the libraries are imported:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Next, the dataset is imported and printed:
data = pd.read_csv('Salaries.csv')
print(data)
Then all the rows and column 1 are selected from the dataset to x and every
row and column 2 as y :
x = data.iloc[:, 1:2].values
print(x)
y = data.iloc[:, 2].values
The random forest aggressor is fitted to the dataset:
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor

# create regressor object

regressor = RandomForestRegressor(n_estimators = 100,
random_state = 0)

# fit the regressor with x and y data

regressor.fit(x, y)
The result is predicted:
Y_pred = regressor.predict(np.array([6.5]).reshape(1, 1)) # test the
output by changing values
The result is visualized:
# Visualizing the Random Forest Regression results

# arange for creating a range of values

# from min value of x to max
# value of x with a difference of 0.01
# between two consecutive values
X_grid = np.arange(min(x), max(x), 0.01)

# reshape for reshaping the data into a len(X_grid)*1 array,

# i.e. to make a column out of the X_grid value
X_grid = X_grid.reshape((len(X_grid), 1))

# Scatter plot for original data

plt.scatter(x, y, color = 'blue')

# plot predicted data

plt.plot(X_grid, regressor.predict(X_grid),
color = 'green')
plt.title('Random Forest Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Artificial Neural Networks Machine Learning Algorithm
Artificial Neural Networks, or ANNs, are paradigms to process information
based on how the human brain works. Like a human, an ANN learns by
example. They are configured for particular applications, like data
classification or pattern recognition, through a learning process that mostly
involves adjusting the connections between the neurons.
The human brain is filled with billions and billions of neurons connected to
one another via synapses. These are the connections by which neurons send
impulses to other neurons. When one neuron sends a signal to another, the
signal gets added to the neuron's inputs. Once a specified threshold is
exceeded, the target neuron fires a signal forward. That is pretty much how
the internal thinking process happens.
In data science, this process is modeled by networks created on computers
using matrices. We can understand these networks as neuron abstractions
with none of the biological complexities. We could get very complex here,
but we won't. We'll create a simple neural network that has two layers and
can solve a linear classification problem
Let's say we want to predict the output, using specified inputs and outputs
as training examples.
The training process is:
1. Forward Propagation
The inputs are multiplied by the weights:
Let Y = WiIi = W1I1+W2I2+W3I3
The result is passed through a sigmoid formula so the neuron's output can
be calculated. This function normalizes the result so it falls between 0 and
1.
1/1 + e-y
2. Back Propagation
The error is calculated. This is the difference between the actual and
expected output. Depending on what the error is, the weights can be
adjusted. To do this, multiply the error by the input and then do it again
with sigmoid curve gradient:
Weight += Error Input Output (1-Output) ,here Output (1-Output) is
derivative of sigmoid curve.
The entire process must be repeated for however many iterations there are,
potentially thousands.
Let's see what the whole thing looks like coded in Python, using NumPy to
do the matrices calculations. Make sure NumPy is already installed and
then go ahead with the following code:
from joblib.numpy_pickle_utils import xrange
from numpy import *

class NeuralNet(object):
def __init__(self):
# Generate random numbers
random.seed(1)

# Assign random weights to a 3 x 1 matrix,

self.synaptic_weights = 2 * random.random((3, 1)) - 1

# The Sigmoid function

def __sigmoid(self, x):
return 1 / (1 + exp(-x))

# The derivative of the Sigmoid function.

# This is the gradient of the Sigmoid curve.
def __sigmoid_derivative(self, x):
return x * (1 - x)

# Train the neural network and adjust the weights each time.
def train(self, inputs, outputs, training_iterations):
for iteration in xrange(training_iterations):
# Pass the training set through the network.
output = self.learn(inputs)

# Calculate the error

error = outputs - output

# Adjust the weights by a factor

factor = dot(inputs.T, error *
self.__sigmoid_derivative(output))
self.synaptic_weights += factor

# The neural network thinks.

def learn(self, inputs):

return self.__sigmoid(dot(inputs, self.synaptic_weights))

if __name__ == "__main__":
# Initialize
neural_network = NeuralNet()
# The training set.
inputs = array([[0, 1, 1], [1, 0, 0], [1, 0, 1]])
outputs = array([[1, 0, 1]]).T

# Train the neural network

neural_network.train(inputs, outputs, 10000)

# Test the neural network with a test example.

print(neural_network.learn(array([1, 0, 1])))
The Output
Once 10 iterations have been completed, the NN should predict a value of
0.6598.921. This isn't what we want as the answer should be 1. If the
number of iterations is increased to 100, the output is 0.87680541. That
means the network is improving the more iterations it does. So, for 10,000
iterations, we get as near as we can with an output of 0.9897704.
Machine Learning Steps
If only machine learning were as simple as just applying an algorithm to
the data and obtaining the predictions. Sadly, it isn’t, and it involves
many more steps for each project you do:
Step One – Gather the Data
This is the first and most important step in machine learning and is by
far the most time-consuming. It involves collecting all the data needed to
solve the problem and enough of it. For example, let’s say we wanted to
build a model that predicted house prices. In that case, we need a dataset
with information about previous sales, which we can use to form a
tabular structure. We’ll solve something similar when we show you how
to implement a regression problem later.
Step Two – Preparing the Data
When we have the data, it needs to be put into the right format and pre-
processed. Pre-processing involves different steps, including cleaning
the data. Should there be abnormal values, such as strings instead of
numbers, or empty values, you need to understand how you will deal
with them. The simplest way is to drop those rows containing empty
values, although there are a few other ways you can do it. Sometimes,
you may also have columns that will have no bearing on the results;
those columns can also be removed. Usually, data visualization will be
done to see the data in the form of charts, graphs, and diagrams. Once
these have been analyzed, we can determine the important features.
Step Three – Choosing Your Model
Now the data is ready, and we need a machine-learning algorithm to feed
it to. Some people use “machine learning algorithm” interchangeably
with “machine learning model,” but they are not the same. The model is
the algorithm’s output. When the algorithm is implemented on the data,
we get an output containing the numbers, rules, and other data structures
related to the algorithm and needed to make the predictions. For
example, we would get an equation showing the best-fit line if we did
linear regression on the data. That equation is the model. If we didn’t go
down the route of tuning the hyperparameters, your next step would be
training the model.
Step Four – Tuning the Hyperparameters
Hyperparameters are critical because they are responsible for controlling
how a machine learning model behaves. The goal is to find the best
hyperparameter combination to give us the best results. But what are
they? Take the K-NN algorithm and the K variable. When K is given
different values, the results are predictably different. K’s best value isn’t
defined, and each dataset will have its own value. There is also no easy
way of knowing what the best values are; all you can do is try different
ones and see what results you get. In this case, K is a hyperparameter,
and its values need to be tuned to get the right results.
Step Five – Evaluating the Model
Next, we want to know how well the model has performed. The best way
to know that is to test it on more data. This is the testing data, and it
must be completely different from the data the algorithm was trained on.
When you train a model, you do not intend it to learn the values in the
training data. Instead, you want it to learn the data’s underlying pattern
and use that to make predictions on previously unseen data. We’ll
discuss this more in the next section.
Step Six – Making Predictions
The final step is to use the model on real-world problems in the hopes
that it will work as intended.
Evaluating a Machine Learning Model
Training a machine learning model is an important step in machine learning.
Equally important is evaluating how the model works on previously unseen
data and is something you should consider with every pipeline you build.
You need to know if your model works and, if it does, whether its
predictions can be trusted or not. Is the model memorizing the data you feed
it, or is it learning by example? If it is only memorizing data, its predictions
on unseen data will be next to useless.
We're going to take some time out to look at some of the techniques you can
use to evaluate your model to see how well it works. We'll also look at some
of the common evaluation metrics for regression and classification methods
with Python.
Model Evaluation Techniques
Handling how your machine learning model performs requires evaluation,
an integral part of any data science project you are involved in. When you
evaluate the model, you aim to estimate how accurate the model's
generalization is on unseen data.
Model evaluation methods fall into two distinct categories – cross-validation
and holdout. Both use test sets, which are datasets of completely new,
previously unseen data, to evaluate the performance. As you already know
by now, using the same data you train a model on to test it is not
recommended because the model will not learn anything – it will simply
memorize what you trained it on. The predictions will always be correct
anywhere in the training set. This is called overfitting.
Holdout
Holdout evaluation tests models on different data to the training data,
providing an unbiased learning performance estimate. The dataset will be
divided into three random subsets:
The training set – this is a subset of the full dataset using in
predictive model building
The validation set – this subset of the dataset is used to assess the
model's performance. It gives us a test platform that allows us to fine-
tune the parameters for our model and choose the model that performs
the best. Be aware that you won't need a validation set for every single
algorithm.
The test set – this is the unseen data, a subset of the primary dataset
used to tell us how the model may perform in the future. If you find
the model fitting the training set better than it does the test set, you
probably have a case of overfitting on your hands.
The holdout technique is useful because it is simple, flexible, and fast.
However, it does come with an association to high variance. This is because
the differences between the training and set datasets can provide significant
differences in terms of the accuracy estimation.
Cross-Validation
The cross-validation technique will partition the primary dataset into two – a
training set to train the model and another independent set for evaluating the
analysis. Perhaps the best-known cross-validation technique is k-fold, which
partitions the primary dataset into k number of subsamples, equal in size.
Each subsample is called a fold, and the k defines a specified number –
typically, 5 or 10 are the preferred options. This process is repeated k
number of times, and each time a k subset is used to validate, while the
remaining subsets are combined into a training set. The model effectiveness
is drawn from an average estimation of all the k trials.
For example, when you perform ten-fold cross-validation, the data is split
into ten parts, roughly equal in size. Models are then trained in sequence –
the first fold is used as the test set, while the other nine are used for training.
This is repeated for each part, and accuracy estimation is averaged from all
ten trials to see how effective the model is.
Each data point is used as the test set once and is part of the training set k
number of times, in our case, nine. The reduces the bias significantly as
almost all the data is used for fitting. It also reduces variance significantly
because almost all the data is used in the test set. This method's effectiveness
is increased because the test and training sets are interchanged.
Model Evaluation Metrics
Quantifying the model performance requires the use of evaluation metrics,
and which ones you use depend on what kind of task you are doing –
regression, classification, clustering, and so on. Some metrics are good for
multiple tasks, such as precision-recall. Classification, regression, and other
supervised learning tasks make up the largest number of machine learning
applications, so we will concentrate on regression and classification
Classification Metrics
We'll take a quick look at the most common metrics for classification
problems, including:
Classification Accuracy
Confusion matrix
Logarithmic Loss
AUC – Area Under Curve
F-Measure
Classification Accuracy
This is one of the more common metrics used in classification and indicates
the correct predictions as a ratio of all predictions. The sklearn module is
used for computing classification accuracy, as in the example below:
#import modules
import warnings
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score
#ignore warnings
warnings.filterwarnings('ignore')
# Load digits dataset
iris = datasets.load_iris()
# # Create feature matrix
X = iris.data
# Create target vector
y = iris.target
#test size
test_size = 0.33
#generate the same set of random numbers
seed = 7
#cross-validation settings
kfold = model_selection.KFold(n_splits=10, random_state=seed)
#Model instance
model = LogisticRegression()
#Evaluate model performance
scoring = 'accuracy'
results = model_selection.cross_val_score(model, X, y, cv=kfold,
scoring=scoring)
print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100,
results.std()))
#split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
#fit model
model.fit(X_train, y_train)
#accuracy on test set
result = model.score(X_test, y_test)
print("Accuracy - test set: %.2f%%" % (result*100.0))
In this case, the accuracy is 88%.
Cross-validation allows us to test the model while training, checking for
overfitting, and looking to see how the model will generalize on the test
data. Cross-validation can also compare how different models perform on
the same data set and helps us choose the best parameter values to ensure the
maximum possible model accuracy. This is also known as parameter tuning.
Confusion Matrix
Confusion matrices provide more detail in the breakdown of right and
wrong classifications for each individual class. In this case, we will use the
Iris dataset for classification and computation of the confusion matrix:
#import modules
import warnings
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

#ignore warnings
warnings.filterwarnings('ignore')
# Load digits dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
df = pd.read_csv(url)
# df = df.values
X = df.iloc[:,0:4]
y = df.iloc[:,4]
# print (y.unique())
#test size
test_size = 0.33
#generate the same set of random numbers
seed = 7
#Split data into train and test set.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
#Train Model
model = LogisticRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)

#Construct the Confusion Matrix

labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
cm = confusion_matrix(y_test, pred, labels)
print(cm)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
plt.title('Confusion matrix of the classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
The short way of explaining how a confusion matrix should be interpreted is
this:
The elements on the diagonal represent how many points for which the
predicted label and true label are equal. Anything not on the diagonal has
been mislabeled. The higher the values on the diagonal, the better because
this indicates a higher number of correct predictions.
Our classifier was correct in predicting all 13 Setosa and 18 Virginica iris
plants in the test data, but it was wrong in classifying four Versicolor plants
as Virginica.
Logarithmic Loss
Logarithmic loss is also called logloss, and it is used to measure a
classification model's performance where a probability value between 0 and
1 is used as the prediction input. With a divergence between the predicted
probability and the actual label, we saw an increase in log loss. That value
needs to be minimized by the machine learning model – the smaller the
logloss, the better, and the best models have a log loss of 0.
#Classification LogLoss
import warnings
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

warnings.filterwarnings('ignore')
url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
dataframe = pandas.read_csv(url)
dat = dataframe.values
X = dat[:,:-1]
y = dat[:,-1]
seed = 7
#split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
model.fit(X_train, y_train)
#predict and compute logloss
pred = model.predict(X_test)
accuracy = log_loss(y_test, pred)
print("Logloss: %.2f" % (accuracy))
Area Under Curve (AUC)
This performance metric measures how well binary classifiers discriminate
between positive classes and negative ones.:
#Classification Area under curve
import warnings
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve

warnings.filterwarnings('ignore')

url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
dataframe = pandas.read_csv(url)
dat = dataframe.values
X = dat[:,:-1]
y = dat[:,-1]
seed = 7
#split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
model.fit(X_train, y_train)

# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]

auc = roc_auc_score(y_test, probs)

print('AUC - Test Set: %.2f%%' % (auc*100))

# calculate roc curve

fpr, tpr, thresholds = roc_curve(y_test, probs)
# plot no skill
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()
In this example, we have an AUC greater than 0.5 and close to 1. The
perfect classifier has a ROC curve along the y-axis and then along the x-
axis.
F-Measure
Also called F-Score, this measures the accuracy of a test, using the test's
recall and precision to come up with the score. Precision indicates correct
positive results divided by the total number of predicted positive
observations. Recall indicates correct positive results divided by the total
actual positives or the total of all the relevant samples:
import warnings
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_fscore_support as score,
precision_score, recall_score, f1_score

warnings.filterwarnings('ignore')

model = LogisticRegression()
#split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
model.fit(X_train, y_train)
precision = precision_score(y_test, pred)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(y_test, pred)
print('Recall: %f' % recall)
# f1: tp / (tp + fp + fn)
f1 = f1_score(y_test, pred)
print('F1 score: %f' % f1)
Regression Metrics
There are two primary metrics used to evaluate regression problems – Root
Mean Squared Error and Mean Absolute Error.
Mean Absolute Error, also called MAE, tells us the sum of the absolute
differences between the actual and predicted values. Root Mean Squared
Error, also called RMSE, measures an average of the error magnitude by
"taking the square root of the average of squared differences between
prediction and actual observation."
Here’s how we implement both metrics:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,
mean_squared_error
from math import sqrt
url =
"https://raw.githubusercontent.com/jbrownlee/Datasets/master/housin
g.data"
dataframe = pandas.read_csv(url, delim_whitespace=True)
df = dataframe.values
X = df[:,:-1]
y = df[:,-1]
seed = 7
model = LinearRegression()
#split data
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
y, test_size=test_size, random_state=seed)
model.fit(X_train, y_train)
#predict
pred = model.predict(X_test)
print("MAE test score:", mean_absolute_error(y_test, pred))
print("RMSE test score:", sqrt(mean_squared_error(y_test, pred)))
In an ideal world, a model’s estimated performance will tell us if the model
performs well on unseen data or not. Most of the time, we will be building
models to predict on future data and, before a metric is chosen, you must
understand the context – each model will use different datasets and different
objectives to solve the problem.
Implementing Machine Learning Algorithms with Python
By now, you should already have all the software and libraries you need
on your machine – we will mostly be using Scikit-learn to solve two
problems – regression and classification.
Implementing a Regression Problem
We want to solve the problem of predicting house prices given specific
features, like the house size, how many rooms there are, etc.
Gather the Data – we don't need to collect the required data
manually because someone has already done it for us. All we need
to do is import that dataset. It's worth noting that not all available
datasets are free, but the ones we are using are.
We want the Boston Housing dataset, which contains records describing
suburbs or towns in Boston. The data came from the Boston SMSA
(Standard Metropolitan Statistical Area) in 1970, and the attributes have
been defined as:
1. CRIM: the per capita crime rate in each town
2. ZN: the proportion of residential land zoned for lots of
more than 25,000 sq.ft.
3. INDUS: the proportion of non-retail business acres in each
town
4. CHAS: the Charles River dummy variable (= 1 if the tract
bounds river; 0 if it doesn't)
5. NOX: the nitric oxide concentrations in parts per 10
million
6. RM: the average number of rooms in each property
7. AGE: the proportion of owner-occupied units built before
1940
8. DIS: the weighted distances to five employment centers in
Boston
9. RAD: the index of accessibility to radial highways
10. TAX: the full-value property tax rate per $10,000
11. PTRATIO: the pupil-teacher ratio by town
12. B: 1000(Bk−0.63)2 where Bk is the proportion of black
people in each town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000s
You can download the dataset here.
Download the file and open it to see the data on the house sales. Note
that the dataset is not in tabular forms, and there are no names for the
columns. Commas separate all the values.
The first step is to place the data in tabular form, and Pandas will help us
do that. We give Pandas a list with all the column names and a delimiter
of '\s+,' meaning that every entry can be differentiated when single or
multiple spaces are encountered.
We'll start by importing the libraries we need and then the dataset in
CSV format. All this is imported into a Pandas DataFrame.
import numpy as np
import pandas as pd
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',
'AGE', 'DIS', 'RAD', 'TAX','PTRATIO', 'B', 'LSTAT', 'MEDV']
bos1 = pd.read_csv('housing.csv', delimiter=r"\s+",
names=column_names)
Preprocess the Data
Next, the data must be pre-processed. When we look at the dataset, we can
immediately see that it doesn't contain any NaN values. We also see that the
data is in numbers, not strings, so we don't need to worry about errors when
we train the model.
We’ll divide the data into two – 70% for training and 30% for testing.
bos1.isna().sum()
from sklearn.model_selection import train_test_split
X=np.array(bos1.iloc[:,0:13])
Y=np.array(bos1["MEDV"])
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size = 0.30,
random_state =5)
Choose Your Model
To solve this problem, we want two supervised learning algorithms for
solving regression. Later, the results from both will be compared. The
first is K-NN and the second is linear regression, both of which were
demonstrated earlier in this part.
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
#load our first model
lr = LinearRegression()
#train the model on training data
lr.fit(x_train,y_train)
#predict the testing data so that we can later evaluate the model
pred_lr = lr.predict(x_test)
#load the second model
Nn=KNeighborsRegressor(3)
Nn.fit(x_train,y_train)
pred_Nn = Nn.predict(x_test)
Tune the Hyperparameters
All we are going to do here is tune the ok K value in K-NN – we could
do more, but this is just for starters, so I don't want you to get too
overwhelmed. We'll use a for loop and check on the results from k, from
1 to 50. K-NN works fast on small datasets, so this won't take too much
time.
import sklearn
for i in range(1,50):
model=KNeighborsRegressor(i)
model.fit(x_train,y_train)
pred_y = model.predict(x_test)
mse = sklearn.metrics.mean_squared_error(y_test,
pred_y,squared=False)
print("{} error for k = {}".format(mse,i))
The output will be a list of errors for k = 1 right through to k = 50. The
least error is for k = 3, justifying why k = 3 was used when the model
was being trained.
Evaluate the Model
We will use MSE (mean_squared_error) method in Scikit-learn to evaluate
our model. The 'squared' parameter must be set as False, otherwise you
won’t get the RMSE error:
#error for linear regression
mse_lr= sklearn.metrics.mean_squared_error(y_test,
pred_lr,squared=False)
print("error for Linear Regression = {}".format(mse_lr))
#error for linear regression
mse_Nn= sklearn.metrics.mean_squared_error(y_test,
pred_Nn,squared=False)
print("error for K-NN = {}".format(mse_Nn))
The output is:
error for Linear Regression = 4.627293724648145
error for K-NN = 6.173717334583693
We can conclude from this that Linear Regression is much better than K-
NN on this dataset at any rate. This won't always be the case as it is
dependent on what data you are working with.
Make the Prediction
At last, we can predict the house prices using our models and the predict
function. Ensure that you get all the features that were in the data you used
to train the model.
Here’s the entire program from start to finish:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',
'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
bos1 = pd.read_csv('housing.csv', delimiter=r"\s+",
names=column_names)
X=np.array(bos1.iloc[:,0:13])
Y=np.array(bos1["MEDV"])
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size = 0.30,
random_state =54)
#load our first model
lr = LinearRegression()
#train the model on training data
lr.fit(x_train,y_train)
#predict the testing data so that we can later evaluate the model
pred_lr = lr.predict(x_test)
#load the second model
Nn=KNeighborsRegressor(12)
Nn.fit(x_train,y_train)
pred_Nn = Nn.predict(x_test)
#error for linear regression
mse_lr= sklearn.metrics.mean_squared_error(y_test,
pred_lr,squared=False)
print("error for Linear Regression = {}".format(mse_lr))
#error for linear regression
mse_Nn= sklearn.metrics.mean_squared_error(y_test,
pred_Nn,squared=False)
print("error for K-NN = {}".format(mse_Nn))
Implementing a Classification Problem
The problem we are solving here is the Iris population classification
problem. The Iris dataset is one of the most popular and common for
beginners. It has 50 samples of each of three Iris species, along with
other properties of the flowers. One is linearly separable from two
species, but those two are not linearly separable from one another. This
dataset has the following columns:
Different species of iris
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species
This dataset is already included in Scikit-learn, so all we need to do is
import it:
from sklearn.datasets import load_iris
iris = load_iris()
X=iris.data
Y=iris.target
print(X)
print(Y)
The output is a list of features with four items – these are the features.
The bottom part is a list of labels, all transformed to numbers because
the model is unable to understand strings – each name must be coded as
a number.
Here’s the whole thing:
from sklearn.model_selection import train_test_split
#testing data size is of 30% of entire data
x_train, x_test, y_train, y_test =train_test_split(X,Y, test_size =
0.3, random_state =5)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
#fitting our model to train and test
Nn = KNeighborsClassifier(8)
Nn.fit(x_train,y_train)
#the score() method calculates the accuracy of model.
print("Accuracy for K-NN is ",Nn.score(x_test,y_test))
Lr = LogisticRegression()
Lr.fit(x_train,y_train)
print("Accuracy for Logistic Regression is
",Lr.score(x_test,y_test))
The output is:
Accuracy for K-NN is 1.0
Accuracy for Logistic Regression is 0.9777777777777777
Advantages and Disadvantages of Machine Learning
Everything has its advantages and disadvantages, and machine learning
is no different:
Advantages
Makes Identifying Trends and Patterns in the Data Easy
With machine learning, huge amounts of data can be reviewed, looking
for trends and patterns the human eye wouldn't see. For example,
Flipkart, Amazon, and other e-commerce websites use machine learning
to help understand how their customers browse and what they buy to
send them relevant deals, product details, and reminders. The machine
learning results are used to show relevant ads to each customer based on
their browsing and purchasing history.
Continuous Improvement
New data is continuously being generated, and providing machine
learning models with the new data helps them upgrade over time,
become more accurate, and continuously improve their performance.
This leads to continually better decisions being made.
They can Handle Multi-Variety and Multidimensional Data
Machine learning algorithms can handle multi-variety and
multidimensional data easily in uncertain or dynamic environments.
They are Widely Used
It doesn't matter what industry you are in, machine learning can work for
you. Wherever it is applied, it can give businesses the means and
answers they need to make better decisions and provide their customers
with a more personal user experience.
Disadvantages
It Needs Huge Amounts of Data
Machine learning models need vast amounts of data to learn, but the
quantity of day is only a small part. It must be high-quality, unbiased
data, and there must be enough of it. At times, you may even need to
wait until new data has been generated before you can train the model.
It Takes Time and Resources
Machin learning needs sufficient time for the algorithms to learn and
develop to ensure they can do their job with a significant level of
relevancy and accuracy. They also require a high level of resources
which can mean using more computer power than you might have
available.
Interpretation of Results
One more challenge is being able to interpret the results that the
algorithms' output accurately. You must also carefully choose the
algorithms based on the task at hand and what it is intended to do. You
won't always choose the right algorithm for the first time, even though
your analysis points you to a specific one.
Susceptible to Errors
Although machine learning is autonomous, it does have a high
susceptibility to errors. Let's say you train an algorithm with such small
datasets that they are not inclusive. The result would be biased
predictions because your training set is biased. This would lead to a
business showing their customers' irrelevant ads or purchase choices.
With machine learning, this type of thing can kick-start a whole list of
errors that may not be detected for some time and, when they are finally
detected, its time consuming to find where the issues started and put it
right.

OceanofPDF.com
Conclusion
Thank you for taking the time to read my guide. Data science is one of the
most used buzzwords of the current time, which will not change. With the
sheer amount of data being generated daily, companies rely on data science
to help them move their business forward, interact intelligently with their
customers, and ensure they provide exactly that their customers need when
they need it.
Machine learning is another buzzword, a subset of data science that has
only really become popular in the last couple of decades. Machine learning
is all about training machines to think like humans, to make decisions like a
human would and, although true human thought is still some way off for
machines, it is improving by the day.
We discussed several data science techniques and machine learning
algorithms that can help you work with data and draw meaningful insights
from it. However, we’ve really only scraped the surface, and there is so
much more to learn. Use this book as your starting point and, once you’ve
mastered what’s here, you can take your learning further. Data science and
machine learning are two of the most coveted job opportunities in the world
these days and, with the rise in data, there will only be more positions
available. It makes sense to learn something that can potentially lead you to
a lucrative and highly satisfying career.
Thank you once again for choosing my guide, and I wish you luck in your
data science journey.

OceanofPDF.com
References
“A Comprehensive Guide to Python Data Visualization with Matplotlib and Seaborn.” Built In,
builtin.com/data-science/data-visualization-tutorial.
“A Guide to Machine Learning Algorithms and Their Applications.” Sas.com, 2019,
www.sas.com/en_gb/insights/articles/analytics/machine-learning-algorithms.html.
Analytics Vidhya. “Scikit-Learn in Python - Important Machine Learning Tool.” Analytics Vidhya, 5
Jan. 2015, www.analyticsvidhya.com/blog/2015/01/scikit-learn-python-machine-learning-tool/.
“Difference between Data Science and Machine Learning: All You Need to Know in 2021.” Jigsaw
Academy, 9 Apr. 2021, www.jigsawacademy.com/what-is-the-difference-between-data-science-and-
machine-learning/.
Leong, Nicholas. “Python for Data Science — a Guide to Pandas.” Medium, 11 June 2020,
towardsdatascience.com/python-for-data-science-basics-of-pandas-5f8d9680617e.
“Machine Learning Algorithms with Python.” Data Science | Machine Learning | Python | C++ |
Coding | Programming | JavaScript, 27 Nov. 2020, thecleverprogrammer.com/2020/11/27/machine-
learning-algorithms-with-python/.
Mujtaba, Hussain. “Machine Learning Tutorial for Beginners | Machine Learning with Python.”
GreatLearning, 18 Sept. 2020, www.mygreatlearning.com/blog/machine-learning-tutorial/.
Mutuvi, Steve. “Introduction to Machine Learning Model Evaluation.” Medium, Heartbeat, 16 Apr.
2019, heartbeat.fritz.ai/introduction-to-machine-learning-model-evaluation-fa859e1b2d7f.
Python, Real. “Pandas for Data Science (Learning Path) – Real Python.” Realpython.com,
realpython.com/learning-paths/pandas-data-science/.
“The Ultimate NumPy Tutorial for Data Science Beginners.” Analytics Vidhya, 27 Apr. 2020,
www.analyticsvidhya.com/blog/2020/04/the-ultimate-numpy-tutorial-for-data-science-beginners/.