100% found this document useful (1 vote)
268 views481 pages

Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners

This document contains summaries of 4 manuscripts related to data science topics for beginners. The first manuscript is titled "Data Science for Beginners" and covers introductory concepts of data science including data analysis, programming languages, machine learning, and debugging. The second is titled "Data Analysis with Python" and covers topics such as data preprocessing, visualization, and neural networks. The third manuscript is titled "SQL Computer Programming for Beginners" and covers SQL concepts such as data types, queries, and database normalization. The fourth manuscript is titled "Statistics for Beginners" and covers descriptive statistics, distributions, predictive analytics techniques, and machine learning algorithms.

Uploaded by

Elie Koffi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
100% found this document useful (1 vote)
268 views481 pages

Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners

This document contains summaries of 4 manuscripts related to data science topics for beginners. The first manuscript is titled "Data Science for Beginners" and covers introductory concepts of data science including data analysis, programming languages, machine learning, and debugging. The second is titled "Data Analysis with Python" and covers topics such as data preprocessing, visualization, and neural networks. The third manuscript is titled "SQL Computer Programming for Beginners" and covers SQL concepts such as data types, queries, and database normalization. The fourth manuscript is titled "Statistics for Beginners" and covers descriptive statistics, distributions, predictive analytics techniques, and machine learning algorithms.

Uploaded by

Elie Koffi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 481

4 MANUSCRIPTS

______________________
DATA SCIENCE FOR BEGINNERS,
DATA ANALYSIS WITH PYTHON,
SQL COMPUTER PROGRAMMING FOR BEGINNERS,
STATISTICS FOR BEGINNERS

The Tools Every Data Scientist Should Know

Matt Foster
Table of Contents
Data Science for Beginners
Introduction
Chapter 1 - Introduction to Data Science
Chapter 2 - Fields of Study
Chapter 3 - Data Analysis
Chapter 4 - The Python Data Types
Chapter 5 - Some of The Basic Parts of The Python Code
Chapter 6 - Use Case, Creating Requirements and Mindmapping
Chapter 7 - Basic Statistics Concepts of Data Scientists
Chapter 8 - Exploring Our Raw Data
Chapter 9 - Languages Required for Data Science.
Chapter 10 - Classification and Prediction
Chapter 11 -Data Cleaning and Preparation
Chapter 12 - Introduction to Numpy
Chapter 13 - Manipulating Array
Chapter 14 - Python Debugging
Chapter 15 - Advantages of Machine Learning
Chapter 16 - Numba - Just In Time Python Compiler
Conclusion
Table of Contents
Data Analysis with Python
Introduction

Chapter 1 - What is Data Analysis

Chapter 2 - Python Crash Course

Chapter 3 - Data Munging

Chapter 4 - Why Data Preprocessing Is Important

Chapter 5 - What is Data Wrangling?

Chapter 6 - Inheritances to Clean Up the Code

Chapter 7 - Reading and writing data

Chapter 8 - The Different Types of Data We Can Work With

Chapter 9 - The Importance of Data Visualization

Chapter 10 - Indexing and selecting arrays

Chapter 11 - Common Debugging Commands

Chapter 12 - Neural Network and What to Use for?

Conclusion
Table of Contents
SQL COMPUTER PROGRAMMING FOR
BEGINNERS
Introduction
Chapter 1 - Data Types in SQL
Chapter 2 - Constraints
Chapter 3 - Database Backup and Recovery
Chapter 4 - Sql Aliases
Chapter 5 - Database Normalization
Chapter 6 - SQL Server and Database Data Types
Chapter 7 - Downloading and Installing SQL Server Express
Chapter 8 - Deployment
Chapter 9 - SQL Syntax And SQL Queries
Chapter 10 - Relational Database Concepts
Chapter 11 - SQL Injections
Chapter 12 - Fine-Tune Your Indexes
Chapter 13 - Deadlocks
Chapter 14 - Functions: Udfs, SVF, ITVF, MSTVF, Aggregate,
System, CLR
Chapter 15 - Triggers: Dml, Ddl, After, Instead Of, Db, Server,
Logon
Chapter 16 - Select Into Table Creation & Population
Chapter 17 - Data Visualizations
Chapter 18 - Python Debugging
Conclusion
Table of Contents
STATISTICS FOR BEGINNERS
Introduction
Chapter 1 - The Fundamentals of descriptive statistics
Chapter 2 - Predictive Analytics Techniques
Chapter 3 - Decision Tree and how to Use them
Chapter 4 - Measures of central tendency, asymmetry, and variability
Chapter 5 - Distributions
Chapter 6 - Confidence Intervals: Advanced Topics
Chapter 7 - Handling and Manipulating Files
Chapter 8 - BI and Data Mining
Chapter 9 -What Is R-Squared and how does it help us
Chapter 10 - Public Big Data
Chapter 11 - Gamification
Chapter 12 - Introduction To PHP
Chapter 13 - Python Programming Language
Chapter 14 - A brief look at Machine Learning
Chapter 15 - Python Crash Course
Chapter 16 - Unsupervised Learning
Chapter 17 - Neural Networks
Conclusion
DATA SCIENCE FOR
BEGINNERS:
THE ULTIMATE GUIDE TO DEVELOPING
STEP BY STEP YOUR DATA SCIENCE SKILLS
FROM SCRATCH, TO MAKE THE BEST
DECISIONS AND PREDICTIONS

Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or
transmitted without direct written permission from the author or the
publisher.
Under no circumstances will any blame or legal responsibility be held against
the publisher, or author, for any damages, reparation, or monetary loss due to
the information contained within this book, either directly or indirectly.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical, or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
Introduction
The first thing that we need to take a look at here is what the data analysis is
all about. This is basically a practice where we will take all of that raw data
that has been collected for some time, and then order and organize it. This
needs to be done in a way that allows the business to extract all of the useful
information out of it.
The process of organizing, and then thinking about the data that is available
is really going to be one of the key things that help us to understand what is
inside that data, and what might not be present in the data. There are a lot of
methods that a company can choose to use in order to analyze their data, and
choosing the right approach can make it easier to get the right insights and
predictions out of your data.
Of course, one thing that you need to be careful about when working with
this data, and performing data analysis is that it is really easy for someone to
manipulate the data when they are in the analysis phase. This is something
that you need to avoid doing at all costs. Pushing your own agenda or your
own conclusions is not what this analysis is supposed to be about. There is so
much data present in your set, that if you try to push your agenda or the
conclusions that you want, you are likely to find something that matches up
with it along the way.
This doesn’t mean that this is the right information that you should follow
through. You may just match up to some of the outliers, and miss out on a lot
of important information that can lead your business in the right direction. To
avoid failing and using the data analysis in the wrong manner, it is important
for businesses, and data analysts, to pay attention when the data is presented
to them, and then think really critically about the data, as well as about some
of the conclusions that were drawn based on that data.
Remember here that we can get our raw data from a lot of different sources,
and this means that it can come to us in a variety of forms. Some of these
may include off our social media pages, our observations, responses to
surveys that are sent out, and other measurements. All of this data, even when
it is still in its raw form, is going to be useful. But there is so much of it, that
sometimes it all seems a bit overwhelming.
Chapter 1 - Introduction to Data Science
Over the course of the process for data analysis, the raw data has to be
ordered in a way that can make it something useful. For example, if you are
doing the results of a survey as part of your data, you need to take the time to
tally up the results. This helps people to look at the chart or the graph and see
how many people answered the survey, and how these people responded to
some of the specific questions that were on the survey to start with.
As we go through the process of organizing the data, it will not take long
until some big trends start to emerge. And this is exactly what you want when
working with this data analysis. These trends are what we need to highlight
when we try to write up the data to ensure that our readers will take notes of
it. We can see a lot of examples of how this is going to work.
For example, if we are working with a causal kind of survey about the
different ice cream preferences of men and women, we may find that women
are more likely to have a fondness for chocolate compared to men.
Depending on what the end goals of that research survey were about, this
could be a big point of interest for the researcher. They may decide that this is
the flavor they are going to market to women in online and television ads,
with the hopes that they can increase the number of women who want to
check out the product.
We then need to move on to do some modeling of the data using lots of
mathematics and other tools. These sometimes can sort through all of that
data we have and then exaggerate the points of interest. This is a great thing
because it makes these points so much easier for the researcher to see, so they
can act on this information.
Another thing that we need to focus on when we do our data analysis is the
idea of data visualization. Charts, graphs, and even some textual write ups of
the data are all going to be important parts of data analysis. These methods
are designed in order to refine and then distill the data in a manner that makes
it easier for the reader to glean through the interesting information, without
having to go through and sort out the data on their own.
Just by how the human mind works, having graphs and charts can make a big
difference in how well we understand all of the data we look at. We could
write about the information all day long. But a nice chart or graph that helps
to explain the information and shows us some of the relationships that come
up between the data points, can be the right answer to make understanding
the data easier than ever before.
Summarizing all of this data is often going to be a critical part of supporting
any of the arguments that we try to make with the data. It is just as important
as trying to present the data in a manner that is clear and understandable. The
raw data may also be included in the appendix or in another manner. This
allows the key decision-makers of the company to look up some of the
specifics, and double-check the results that you had. This doesn’t mean that
the data analyst got it wrong. But it adds in a level of trust between the parties
and can give us a reference point if needed.
When those who make decisions in the business encounter some data and
conclusions that have been summarized, they must still view them in a
critical manner. Do not just take the word of the data analyst when it comes
to this information, no matter how much you trust them. There can always be
mistakes and other issues that we have to watch out for, and being critical
about anything that we are handed in the data analysis can be critical to how
we will use that data later.
This means that we need to ask where the data was collected from, and then
ask about the size of the sample, and the sampling method that was used in
order to gather up the data we want to use. All of these are going to be critical
to helping us understand whether the data is something that we can actually
use, and it can help us determine if there are any biases found in the data that
we need to be worried about.
For example, if the source of the data comes from somewhere that may have
a conflict of interest with the type of data that is gathered, or you worry that
the source is not a good one to rely on, then this can sometimes call the
results that you are looking at into question. In a similar manner if the data is
high-quality, but it is pulled from a sample size that is small, or the sample
that was used is not truly random like it should be this is going to call into
question the utility of that data.
As we are going through this, the data analyst needs to remember to provide
as much information about the data as possible, including the methods that
were used to collect that data. Reputable researchers are always going to
make sure that they provide information about the techniques of gathering the
data, the source of funding with any surveys and more that are used, and the
point of the data collection, right at the beginning of the analysis. This makes
it easier for other groups to come and look at the data, see if it is legitimate
and will work for their needs, and then determine if this is what they are
going to base their decisions on.
Learning how to use a data analysis is going to be an important step in this
process. Without this, all of the data that we gather is just sitting around and
isn’t being used the way that we would like. It doesn’t do us any good to just
gather the data and then hold it in storage. Without analyzing the data and
learning how to use it, you are basically just wasting money with a lot of
storage for the data.
The data analysis can come into the mix and makes it so much easier for us to
handle our data, and really see some great results. Rather than just having the
data sit in storage, we are going to be able to add in some algorithms and
machine learning models, in order to see what insights and predictions are
hidden in all of that data.
Businesses are then able to take all of those insights and predictions, and use
it to make smart business decisions that they can utilize over and over again.
And with the right machine learning algorithm in place, and some small
adjustments over time, the business is able to add in some more information
as it comes in, helping them to always stay ahead of the competition.
In the past, these tools were not available at all. Business owners who were
good at reading the market and had been in business for some time could
make good predictions, and sometimes they just got lucky. But there was
always a higher risk that something wouldn’t work out, and they would end
up with a failure in their business, rather than a success.
With data analysis, this is no longer an issue at all. The data analysis is going
to allow us to really work with the data, and see what insights are there. This
provides us with a way to make decisions that are backed by data, rather than
estimates and hoping that they are going to work out. With the right
algorithm or model in place, we are able to learn a lot about the market, our
customers, what the competition is doing, how to reduce waste, and so much
more that can really propel our business forward.
There are a lot of different ways that a business can use data analysis to help
them succeed. They can use this as a way to learn more about their customers
and how to pick the right products and increase customer satisfaction all at
the same time. They can use this to identify waste in the business, and how to
cut some of this out without harming the quality of the product. They can use
this to learn more about what the competition is doing or to discover some
new trends in the market that they can capitalize on and get ahead of the
competition. This can even be used for marketing purposes to ensure that the
right ads reach the right customers each time.
There are so many benefits that come to a well-thought-out and researched
data analysis. And it is not just as simple as glancing down at the information
and assuming that it all falls into place and you will be able to get insights in
a few minutes. It requires gathering good information, making a model that
can read through all of that data in a short amount of time, and then even
writing out and creating visuals that go with that data. But when it all comes
together, it can really provide us with some good insights and predictions
about our business, customers, and competition, and can be the trick that gets
us over the top.
Chapter 2 - Fields of Study
A company needs to manage a gigantic measure of data like compensations,
worker's data, customer's data, customer's criticisms, and so forth. This Data
can be both in an unstructured and organized structure. A company would
consistently need this Data to be necessary and complete so that they can
improve, correct choices, and future approaches. This is when data science
comes helpful.
Data science encourages the customers to make the right choices from
accurate data got out of a considerable measure of chaotic data.

Applications of data science


It has now turned into an unavoidable and indispensable piece of enterprises
like hazard management, showcase investigation, advertise advancement,
extortion discovery, and open approaches, among others. Data science by
utilizing statics, machine learning, and prescient displaying causes enterprises
to determine different issues and achieve quantifiable advantages. There are
vast amounts of motivations to settle on a data course, as a lifelong choice.
Following applications help us to comprehend it better:
It encourages organizations to understand customer conduct and tendencies in
a much-engaged manner. It causes them to interface with the customers in an
increasingly customized way and guarantees better administrations to
customers.
It encourages brands to utilize the data thoroughly to impart their message in
a drawing in and persuading manner with the target audience.
The outcomes and discoveries of data science can be executed in practically
all segments like social insurance, training, and travel, among others, helping
them to address the difficulties in their field in an increasingly compelling
manner.
Enormous Data is an as of late risen field and is helping associations to
handle issues in Human assets, asset management, and IT in a vital manner
by utilizing material and non-material assets.
Data scientist is one of the prime positions in an association. They open new
grounds of experimentations and research to the association. A portion of the
next jobs of a data scientist are:
To connect the original data with the past one to offer new items that fulfill
the goals of the target audience.
To translate climate conditions and in like manner, reroute the store
network.
To upgrade the speed of data set appraisal and joining.

To uncover inconsistencies and cheats in the market.


An understanding of the Data Science Course
Data science course is 160+ hours preparing with an accomplished workforce
working in top associations to keep your side by side with ongoing advances.
A review of the course is as per the following:
Arithmetic and statistics: This is an indispensable subject of data science
course and incorporates combination, differentiation, differential conditions,
and so on. Statistics covers inferential statistics, distinct statistics, chi-squared
tests, relapse analysis, and so on.
Programming Language: One can choose from a variety of programming
languages like Python, C++, Matlab, Hadoop, and so forth.
Data wrangling and Data Management: This part manages data mining,
cleaning, and management utilizing MySQL, NoSQL, Cassandra, and so on.
Machine learning: This incorporates regulated and solo learning, testing,
support learning, assessment of models, and their approval.
Data Analysis and Data Visualization: This part shows utilizing the plotting
libraries for programming languages like seaborn in python, plot, ggplot2 in
R, matplotlib, and so forth. It likewise includes using Excel, Tableau, and
D3.js for data representation.
Data science field offers excellent learning and gaining potential. If you also
need to seek after a profession in data analysis and management, the best
organization offering data science course in Dubai is at your compass. Get
yourself joined up with a course that suits your planning and investigates the
world brimming with circumstances and development.
Data Science: Giving Value to Analytics
With an industry of 33.5% compound yearly development rate, one can think
about a few applications with data science at its center. The situation of data
science is developing and spreading at a quick pace, locally as well as
globally as well. Over 40% of the analytics income originates from nations
like the USA and UK. This demonstrates analytics business has discovered
much utilization of data science to help the business quality.
Data science:
Data science is a field which brings different subjects and fields of ability
together like arithmetic, insights, software engineering, and so on. Other than
these there are smaller scale, claim to fame abilities as well, which one needs
to sharpen in. Aside from specialized aptitudes, one needs the business
acumen to comprehend the working of a business unit and know all the
ongoing market trends.
Data science is utilized in ventures like advanced marketing, E-trade, social
insurance, instruction, transport, excitement, and so on. Analytics is used by
all types of business like private, open, and non-benefit associations, as the
primary topic is to offer some benefit to the clients and increment
effectiveness moreover.
Steps in data science
Data science incorporates different exercises and procedures consolidated
together for only one goal, to recognize what's covered up in the data heap.
Data can emerge out of numerous sources like outer media and web,
administrative overview datasets, and internal databases of one's
organization. Whatever be the source data should be worked upon steadily
and with sagacity to uncover the importance from it.

The steps included are:


Edge the targets: This is the simple initial step of data analysis. Here the
administration must realize what they need from their data analytics team.
This step additionally incorporates meanings of parameters for estimating the
performance of the bits of knowledge recuperated.
Choosing business resources: For solving any problem, there must be
sufficient resources accessible as well. If a firm isn't in a situation to spend its
resources on another advancement or channel of the work process, then one
shouldn't waste time in insignificant analysis. A few measurements and
switches ought to be prepositioned to provide a guide to the data analysis.
Data accumulation: More measures of data prompts more odds of solving a
problem. Having constrained measures of data and limited to just a couple of
factors can prompt stagnation and insane experiences. Data ought to be
gathered from fluctuated resources like web, IoT, online networking, and so
on and utilizing changed methods like GPS, satellite imaging, sensors, and so
forth.
Data cleaning: This is an essential step, as mistaken data can give deluding
results. Calculations and automation programs prune the data from
irregularities, wrong figures, and holes.
Data displaying: This is where AI and business acumen comes to utilize.
This includes building calculations that can co-identify with the data and give
results and suggestions required for essential, necessary leadership.
Impart and enhance: Results found are conveyed, and the move is made for
it, and the performance of choice taken is checked. If the models worked,
then data venture goes fruitful, if not, at that point, models and procedures are
enhanced and start once more.

USES OF DATA SCIENCE IN DIFFERENT


INDUSTRIES
Data science is one of the most present and various fields of technology
today. It is tied in with gathering data which are unstructured and crude in the
structure and afterward discovering bits of knowledge from it; can enable any
dare to turn out to be progressively productive. There is data on the places,
from every caring source, regardless of whether inside or outside. This data
recounts to a story and delineates something helpful which a business ought
to understand to make increasingly beneficial methodologies.

Data science
It is a pipeline of exercises all organized together. It begins with gathering the
data and afterward putting away them in structures. At that point, it is trailed
by cleaning the data to expel the undesirable and copy portions of the data
and furthermore right the mistaken bits and complete the fragmented data.
After all the pruning is done, it is trailed by analyzing the data utilizing
numerous measurable and scientific models. This stage is to understand the
concealed examples in the data. The majority of this is then at long last
pursued by imparting everything to the top administration with the goal that
they can take decisions in regards to new items or existing items.
Nowadays, one can discover a few data science courses to turn into a
prepared proficient in the field of data science, and why not? The occupations
will take off up to 28% - 30% by 2020, which means more chances. To be a
data scientist, one necessarily needs not to have an excess of experience,
considerably fresher with science, PC, and financial aspects foundation can
persuade prepared to be a data scientist. This is taking off requirement for
data scientists is a direct result of the increasing use of big data in pretty
much all ventures imaginable.

Data science in banking and finance


Today, numerous banks are utilizing big data to dissect customer's budgetary
conduct and offer essential banking guidance to them. This builds the
simplicity of banking among customers, and furthermore; they get
customized banking to exhort and data. Big data is additionally helping banks
to battle extortion and identify nonperforming resources.
Data science in construction
This is an industry which needs to follow a lot of kinds of data concerning
customer worth, materials and land costing, income, prospects of land, and so
forth. This has turned out to be excessively simple as big data helps in
analyzing the data and give bits of knowledge about the decisions to be
taken.
Data science in retail
Retail businesses depend entirely on stock and customer joy as two
noteworthy mainstays of their center business. Both these features can be
dealt with by big data and its examination. It can help in understanding the
ongoing patterns and customer requests, additionally to dissect customer
inputs and above all, handle stock and warehousing.
Data science in transportation
Transportation industry utilizes big data to break down the routes and
adventures. It helps in mapping the roads and give individuals the most
limited experiences. It additionally helps in following voyaging subtleties
previously and gives customers modified travel bundles. Big data likewise
help the rail industry by utilizing sensor-produced data to understand braking
systems and mileage.

Data science in medicine


It helps in overseeing and analyzing medicinal and human services data,
which thus helps in decision making by specialists. Additionally, it helps in
wellbeing reviews, makes medical clinic the executives progressively
powerful, tracks patients' vital signs and furthermore helps in sickness
determination.
It is omnipresent and will become exponentially even in the forthcoming
years, in this manner making data science a promising vocation decision.
Chapter 3 - Data Analysis
The next library that we are going to spend some time on is the Pandas
library. This is a great one that is always included when we look at data
science, and because it works with the Python language, we know that we
will see some powerful coding that is easy to get started with. With this in
mind, let us take a look at some of the things that we can come to expect
when we work with the Pandas library.
To start, pandas is going to be a package from Python that is open-sourced
and can provide us with a lot of different tools when it comes to completing
our data analysis. This package also includes a few different structures that
we are able to learn about and bring out for a variety of tasks that deal with
data manipulation. For someone who wants to sort through a lot of data in a
quick and orderly fashion to find out the insights and predictions that are
inside, the Pandas library is the best one to work with.
In addition to some of the tasks that we outlined above, Pandas is also going
to bring out a lot of different methods that programmers can invoke for
helping with the data analysis. This is always a good thing when we are
working on things like data science and a variety of problems that machine
learning is able to help us solve along the way.
While we are here, we need to take a look at some of the advantages that
come with using the Pandas library over some of the others. This library may
not be the one that you want to use instead of some of the others, but it is
definitely one that you should consider learning about and keeping in your
toolbox any time that you want to do some data analysis or work with data
science in some shape or form. Some of the different advantages that you will
notice with the Pandas library will include:

1. It is going to take the info that you have and present it in a


manner that is suitable for analyzing the large amounts of
data that you have. This data analysis is going to be
completed in Pandas with the use of the DataFrame and the
Series data structures.

2. You will find that the Pandas package is going to come with
many methods to help us filter through our info in a
convenient manner while seeing some great results.

3. Pandas also come with a variety of utilities that are going to


help us when it is time to perform the input and output
operations in a way that is quick and seamless. In addition,
Pandas is able to read data that comes in a variety of
formats, which is very important in data science, such as
Excel, TSV, and CSV to name a few.

You will find that Pandas is really going to change up the game and how you
do some coding when it comes to analyzing the data a company has with
Python. Pandas are going to be free to use and open source and were meant to
be used by anyone who is looking to handle the data they have in a safe, fast,
and effective manner.
There are a lot of other libraries that are out there, but you will find that a lot
of companies and individuals are going to love working with Pandas. One
thing that is really cool about Pandas is that it is able to take info, whether it
is from an SQL database, a TSV file or even a CSV file, and then it will take
that information and create it into a Python object. This is going to be
changed over to columns and rows and will be called a data frame, one that
will look very similar to a table that we are going to see in other software that
is statistical.
If you have worked with R in the past, then the objects are going to share a
lot of similarities to R as well. And these objects are going to be easier to
work with when you want to do work and you don’t want to worry about
dictionaries or lists for loops or list comprehension. Remember that we talked
earlier about how loops can be nice in Python, but you will find that when it
comes to data analysis, these loops can be clunky, take up a lot of space, and
just take too long to handle. Working with this kind of coding language will
help you to get things done without all of the mess along the way.
For the most part, it is going to be best if you are able to download the
Pandas library at the same time that you download Python. This makes it
easier to work with and will save you sometime later. But if you already have
Python on your computer and later decide that you want to work with Pandas
as well, then this is not a problem. Take some time now to find the pandas
library on its official page and follow the steps that are needed to download it
on the operating system of your choice.
Once you have had some time to download the Pandas library, it is time to
actually learn how this one works and some of the different things that you
are able to do to get it to work for you. The Pandas library is a lot of fun
because it has a ton of capabilities that are on it, and learning what these are
and how to work with them is going to make it easier to complete some of
your own data analysis in the process.
At this point, the first thing that we need to focus on is the steps that we can
take to load up any data, and even save it before it can be run through with
some of the algorithms that come with Pandas. When it is time to work with
this library from Python in order to take all of that data you have collected
and then learn something from it and gain insights, we have to keep in mind
that there are three methods that we can use with this. These three methods
are going to include the following:

1. You can convert a NumPy array, a Python list or a Python


dictionary over to the data frame that is available with
Pandas.

2. You can open up a local file that is found on your computer


with the help of Pandas. This is often going to be something
like a CSV file, but it is possible that it could be something
else like Excel or a delimited text file in some cases.

3. You can also choose to use this in order to open up a file or


another database that are remote, including a JSON or CSV
that is located on a website through a URL or you can have
the program read through it on a database or a table that is
from SQL.

Now, as you go through with these three steps, we have to remember that
there are actually a couple of commands that will show up for each one, and
it depends on which method you go with what command you will choose.
However, one thing that all three shares in common are that the command
they use to open up a info file will be the same. The command that you need
to use to open up your info file, regardless of the method above that you
choose to use will include:
Pd.red_filetype()
Like we talked about a bit earlier on, and throughout this guidebook, there are
a few file types that you are able to use and see results with when writing in
Python. And you get the control of choosing which one is the best for your
project. So, when working on the code above, you would just need to replace
the part that says “filetype” with the actual type of file that you would like to
use. You also need to make sure that you add in the name of your file, the
path, or another location to help the program pull it up and know what it is
doing.
You will find that while you work in the Pandas library, there are also a ton
of arguments that you are able to choose from and to know what all of these
mean and how to pull up each one at the right time is going to be a big
challenge. To save some time, and to not overwhelm you with just how many
options there are, we are going to focus on just the ones that are the most
important for our project, the ones that can help us with a good info analysis,
and leave the rest alone for now.
With this idea in mind, we are going to start out by learning how we can
convert one of the objects that we are already using in Python, whether this is
a list or a dictionary or something else, over to the pandas' library so we can
actually use it for our needs. The command that we are able to use to make
that conversion happen is going to include:
Pd.InfoFrame()
With the code above, the part that goes inside of the parenthesis is where we
are able to specify out the different object, and sometimes the different
objects, that are being created inside that info frame. This is the command
that will bring out a few arguments, and you can choose which ones of those
you want to work with here as well.
In addition to helping out with some of the tasks that we just listed out, we
can also use Pandas to help us save that info frame, so we can pull it up and
do more work later on, or if we are working with more than one type of file.
This is nice because Pandas is going to save tables that come in many
different formats, whether that is CSV, Excel, SQL, or JSON. The general
code that we need to use to help us not only work on the framework that we
are currently on, but to make sure that we can save it as well will include the
following:
Df_to.filetype(filename)
When we get to this point, you should see that the info is already loaded up,
so now we need to take this a step further and look at some of the inspecting
that can be done with this as well. To start this, we need to take a look at the
frame of the info and see whether or not it is able to match up with what we
expect or want it to. To help us do this, we just need to run the name of the
info frame we are choosing to bring up the entire table, but we can limit this a
bit more and get more control by only getting a certain amount of the table to
show up based on what we want to look at.
For example, to help us just get the first n amount of rows (you can decide
how many rows this ends up being), you would just need to use the function
of df.heat(n). Alternatively, if your goal was to work with the n number of
rows that are last in the table, you would need to write out the code df.tail(n).
The df.shape is going to help if you want to work with the number of
columns and rows that show up, and if you would like to gather up some of
the information that is there about the info type, memory or the index, the
only code that you will need to use to make this happen is df.info().
Then you can also take over the command of: s.value_counts(dropna=False)
and this one allows us to view some of the unique values and counts for the
series, such as if you would like to work with just one, and sometimes a few,
columns. A useful command that you may want to learn as well is going to be
the df.describe() function. This one is going to help you out by inputting
some of the summary statistics that come with the numerical columns. It is
also possible for you to get the statistics on the entire info frame or a series.
To help us make a bit more sense out of what we are doing here, and what it
all means, we need to look at a few of the different commands that you are
able to use in Pandas that are going to help us view and inspect the info we
have. These include:

1. df.mean(). This function is going to help us get the mean of


all our columns.
2. df.corr() This function is going to return the correlation
between all of the columns that we have in the frame of
info.
3. Df.count(): This function is going to help us return the
number of non-null values in each of the frames of info
based on the columns.
4. Df.max(). The function is going to return the highest value
in each of the columns.
5. Df.min(). This function is going to return the lowest value
that is found in each of the columns that you have.
6. Df.median(). This is going to be the function that you can
use when you want to look at each column and figure out
the median.
7. Df.std(). This function is going to be the one that you would
use in order to look at each of the columns and then find the
standard deviation that comes with it.

Another cool thing that we are able to do when working on the Pandas library
is that we are able to join together and combine different parts. This is a basic
command in Python, so learning how to do it from the beginning can make a
big difference. But it is so important for helping us to combine or join the
frames of info, or to help out with combining or joining the rows and
columns that we want. There are three main commands that can come into
play to make all of this happen, the following are going to include:

1. Dfl.appent(df2). This one is going to add in the rows of df1


to the end of df2. You need to make sure that the columns
are going to be identical in the process.

2. Df.concat([df1, df2], axis=1). This command is going to add


in the columns that you have in df1 to the end of what is
there with df2. You want to make sure that you have the
rows added together identical.

3. Dfl.oin(df2, on=col1, hot=’inner’). This is going to be an


SQL style join the columns in the df1 with the columns on
df2 where the rows for col have identical values how can be
equal to one of left, right, inner, and outer.

There is so much that we are able to do when it comes to the Pandas library,
and that is one of the reasons why it is such a popular option to go with.
Many companies who want to work with info science are also going to be
willing to add on the Pandas extension because it helps them to do a bit more
with info science, and the coding is often simple thanks to the Python
language that runs along with it.
The commands that we looked at in this chapter are going to be some of the
basic ones with Python and with Pandas, but they are meant to help us learn a
bit more about this language, and all of the things that we can do with the
Pandas library when it comes to Python and to the info science work that we
would like to complete. There is a lot of power that comes with the Pandas
library, and being able to put all of this together, and use some of the
algorithms and models that come with this library can make our info analysis
so much better.
The work that we did in this chapter is a great introduction to what we are
able to do with the Pandas library, but this is just the beginning. You will find
that when you work with Pandas to help out with your info analysis, you are
going to see some great results, and will be able to really write out some
strong models and codes that help not only to bring in the info that your
company needs, but to provide you with the predictions and insights that are
needed as well so your business can be moved to the future.
Chapter 4 - The Python Data Types
The next thing that we need to take a look at is the Python data types. Each
value in Python has a type of data.
Since entirety is an object in Python programming, data types are going to be
like classes and variables are going to be the instance, which is also known as
objects, of these classes. There are a lot of different types of data in Python.
Some of the crucial data types that we are able to work with includes:

Python numbers
The first option that we are able to work on Python data includes the Python
numbers. These are going to include things like complex numbers, floating-
point numbers, and even integers. They are going to be defined as complex,
float, and int classes in Python. For example, we are able to work with the
type() function to identify which category a value or a variable affiliated with
to, and then the isinstance() function to audit if an object exists to a distinct
class.
When we work with integers can be of any length, it is going to only find
limitations in how much memory you have available on your computer. Then
there is the floating-point number.
This is going to be accurate up to 15 decimal places, though you can
definitely go with a smaller amount as well.
The floating points are going to be separated by a decimal point. 1 is going to
be an integer, and 10 will be a floating-point number.
And finally, we have complex numbers. These are going to be the numbers
that we will want to write out as x + y, where x is going to be the real point,
and then they are going to be the imaginary part.
We need to have these two put together in order to make up the complexity
that we need with this kind of number.

Python lists
The next type of data that will show up in the Python language. The Python
list is going to be a regulated series of items. It is going to be one of the data
types that are used the most in Python, and it is exceedingly responsive.
All of the items that will show up on the list can be similar, but this is not a
requirement. You are able to work with a lot of different items on your list,
without them being the same type, to make it easier to work with.
Being able to declare a list is going to be a straightforward option that we are
able to work with. The items are going to be separated out by commas and
then we just need to include them inside some brackets like this: [ ] we can
also employ the slicing operator to help us obtain out a piece or a selection of
items out of that list.
The index starts at 0 in Python.
And we have to remember while working on these that lists are going to be
mutable.
What this means is that the value of the elements that are on your list can be
altered in order to meet your own needs overall.

Python Tuple
We can also work with something that is known as a Python Tuple. The
Tuple is going to be an ordered series of components that is the duplicate as a
list, and it is sometimes hard to see how these are going to be similar and how
they are going to be different.
The gigantic diverse that we are going to see with a Tuple and a list is that the
tuples are going to be immutable.
Tuples, once you create them, are not modifiable.
Tuples are applied to write-protect data, and we are generally quicker than a
list, as they cannot shift actively. It is going to be determined with
parentheses () where the items are also going to be separated out by a comma
as we see with the lists.
We can then use the slicing operator to help us wring some of the
components that we want to use, but we still are not able to change the value
while we are working with the code or the program.
Python Strings
Python strings are also important as well. The string is going to be a sequence
that will include some Unicode characters.
We can work with either a single quote or a double quote to show off our
strings, but we need to make sure that the type of quote that we use at the
beginning is the one that we finish it off with, or we will cause some
confusion with the compiler.
We can even work with multi-line strings with the help of a triple quite.
Like what we are going to see when we use the tuple or the list that we talked
about above, the slicing operator is something that we are able to use with our
string as well. And just like with what we see in the tuples, we will find that
the string is going to be immutable.

Python Set
Next on the list is going to be the Python set. The set is going to be an option
from Python that will include an unordered collection of items that are
unique. The set is going to be defined by values that we can separate with a
comma in braces. The elements in the batch are not going to be ordered, so
we can use them in any manner that we would like.
We have the option to perform this set of operations at the same time as a
union or have an intersection on two sets.
The sets that we work with are going to be unique values and they will make
sure that we eliminate the duplicates. Since the set is going to be an
unordered compilation. Cataloged has no aim.
Therefore the slicing operator is not going to work for this kind of option.

Python Dictionary
And the final type of Python data that we are going to take a look at is known
as the Python dictionary. This is going to be an unordered collection of key-
value pairs that we are able to work with. It is generally going to be used
when we are working with a very large amount of data. The dictionary can be
optimized in such a way that they do a great job of retrieving our data. We
have to know the key to retrieve the value ahead of time to make these work.
When we are working with the Python language, a dictionary is going to be
decided inside braces, with every component being a combination in the form
of key: value. The key and the value can be any type that you would like
based on the kind of code that you would like to write. We can also use the
key to help us retrieve the respective value that we need. But we are not able
to turn this around and work it in that manner at all.
Working with the different types of data is going to be so important for all of
the work that you can do in a Python coding, and can help you out when it is
time to work with data science.
Take a look at the different types of data that are available with the Python
language, and see how great this can be to any of the codes and algorithms
that you want to write into your data science project o verall.
Chapter 5 - Some of The Basic Parts of The Python
Code
Now that we have learned a bit more about the Python code, and some of the
things that you need to do in order to get this coding language set up on your
computer, it is time to take a look at some of the different things that you can
do with your code. We are going to start out with some of the basics, and
then will build on this when we get a bit further on in this guidebook to see
some of the other things that we are able to do with this language. With this
in mind, let’s take a look at some of the basics that you need to know about
any code in Python, and all that you are going to be able to do with this
coding language.
The Keywords in Python
The first part of the Python code that we are going to focus on is the Python
keywords. These keywords are going to be reserved because they give the
commands over to the compiler. You do not want to let these keywords show
up in other parts of the code, and it is important to know that you are using
them in the right part of the code.
Any time that you are using these keywords in the wrong manner, or in the
wrong part of the code, you are going to end up with some errors in place.
These keywords are going to be there to tell your compiler what you wanted
to happen, allowing it to know what it should do at the different parts of the
code. They are really important to the code and will make sure that
everything works the proper manner and at the right times.
How To Name The Identifiers in your Code
The next thing that we need to focus on for a moment when it comes to your
code is working with the identifiers. There are a lot of different identifiers
that you are able to work with, and they do come in a variety of names
including classes, variables, entities, and functions. The neat thing that
happens when you go through the process of naming an identifier is that the
same rules are going to apply no matter what name you have, which can
make it easier for a beginner to remember the different rules that come with
them.
So, let’s dive into some of the rules that we need to remember when doing
these identifiers. You have a lot of different options to keep in mind when
you decide to name the identifiers. For example, you can rely on using all
kinds of letters, whether they are lowercase or uppercase. Numbers work
well, too. You will be allowed to bring in the underscore symbol any time
that you would like. And any combination of these together can help you to
finish up the naming that you want to do.
One thing to remember with the naming rules though is that you should not
start the name with any kind of number, and you do not want to allow any
kind of space between the words that you are writing out. So, you would not
want to pick out the name of 5kids, but you could call it fivekids. And five
kids for a name would not work, but five_kids would be fine.
When you are working on the name for any of the identifiers that you want to
create in this kind of coding language, you need to make sure that you are
following the rules above, but add to this that the name you choose has to be
one that you are able to remember later. You are going to need to, at some
point, pull that name back up, and if you picked out one that is difficult to
remember or doesn’t make sense in the code that you are doing, and you
can’t call it back up, it is going to raise an error or another problem along the
way. Outside of these rules, you will be fine naming the identifier anything
that makes sense for that part of the code.
How to Handle the Control Flow with Python
The control flow in this language can be important. This control flow is there
to ensure that you wrote out the code the proper way. There are some types of
strings in your code that you may want to write out so that the compiler can
read them the right way. But if you write out the string in the wrong manner,
you are going to end up with errors in the system. We will take a look at
many codes into this guidebook that follows the right control flow for this
language, which can make it easier to know what you need to get done and
how you can write out codes in this language.

The Python Statements


The next topic that we need to take a look at when we do some of our codings
is the idea of the statements. These are going to be a simple thing to work on
when it comes to Python. They are simply going to be strings of code that
you are able to write out, and then you will tell the compiler to show that
string on the computer string at the right time.
When you give the compiler the instructions that it needs to follow, you will
find that there are going to be statements that come with it. As long as you
write these statements out in the right manner, the compiler is going to be
able to read them and will show the message that you have chosen on the
computer screen. You are able to choose to write these statements out as long
or as short as you would like, and it all is going to depend on the kind of code
that you are trying to work on at the time.
The Importance of the Python Comments
Any time that you are writing out new code in Python, it is important to know
how to work with the comments. You may find that as you are working on
the various parts of your code, and changing things around, you may want to
add a note or name a part of the code or leave any other explanation that
helps to know what that part of the code is all about. These notes are things
that you and anyone else who is reading through the code will be able to see
and utilize, but they are not going to affect the code. The compiler knows that
comment is going on, and will just skip that and go to the next part of the
code that you wrote out.
Making your own comment in Python is a pretty easy process. You just need
to add in the # symbol before the note that you want to write, and then the
compiler knows that a note is there and that it doesn’t need to read that part
of the code at all. It is possible for you to go through and add in as many of
these comments to the code that you are writing as you would like and you
could fill up the whole code with comments. The compiler would be able to
handle this but the best coding practice is to just add in the amount that you
really need. This helps to keep things organized and ensures that you are
going to have things looking nice and neat.
Variables in Python
Variables are another part of the code that you will need to know about
because they are so common in your code. The variables are there to help
store some of the values that you place in the code, helping them to stay
organized and nice. You can easily add in some of the values to the right
variable simply by using the equal sign. It is even possible for you to take
two values and add them to the same variables if you want and you will see
this occur in a few of the codes that we discuss through this guidebook.
Variables are very common and you will easily see them throughout the
examples that we show.

Looking for the Operators


Another part of the code that we can focus on when working in the Python
language is the idea of the operators. These are simple to use, and there are
going to be a lot of the codes that you try to work on that will include these
operators. But even though they are pretty easy to work with, they can add to
a level of power that is so important to a lot of the codes that you want. And
there are a variety of operators that you are able to focus on when you write a
Python code so you have some options.
For example, you can start out with the arithmetic functions. These are good
ones to work with any time that you need to do some kind of mathematics
with your code. There are going to be the assignment operators that make
sure a value is assigned over to the variable that you are working on. There
can be comparison operators as well which allow you to take two parts of the
code, or the code and the input from the user, and then compare them to see if
they are the same or not and then reacting in the way that you would like
based on the code that you wrote.

As you can see, there are a ton of different parts that come with the
basics of the Python code. Many of these are going to be seen in the
types of codes that you are trying to write out in Python, and can really
help you to start writing out some of your own codes. As we go through
some of the examples, as well as the practice exercises, as we go through
this guidebook, you will find that these basics are going to be found in a
lot of the codes that you would like to work on.
Chapter 6 - Use Case, Creating Requirements and
Mindmapping
Object-oriented programming (OOP) is a model in which programs are
organized around objects or data without the use of functions and logic. An
object has its unique behavior and attributes. In object-oriented programming,
the historical approach to programming is opposed while the stress is given to
how the logic is written rather than defining the data within the logic.
Examples of objects range from physical entities such as humans to small
programs like Widgets.
A programmer focuses on the first step known as data modeling in which all
the objects are identified to be manipulated, and these objects relate to each
other. After identifying an object, then it is generalized as a class of objects.
The class defines the kind of data it contains as well as the logical sequence
that can manipulate it. Among all the logic sequences, each logic sequence is
called a method while the communication of objects with well-defined
interfaces is called messages.
In OOP, the developers focus on object manipulation, rather the logic
required to manipulate them. This approach is well-suited to programming
for the programs, especially the complex, larger, and actively maintained
programs. Open-source organizations also support object-oriented
programming by allowing programmers to contribute to such projects in
groups that results in collaborative development. Furthermore, the additional
benefits of object-oriented programming include the code scalability,
reusability, and efficiency.

Principles of Object-Oriented Programming


There are many principles involved in the object-oriented programming. Here
are some of them discussed below:
· Encapsulation
· Inheritance
· Abstraction
· Polymorphism
Encapsulation
The state and implementation of each object are held inside a defined class or
boundary but privately. Other objects can only access to this class by calling
a list of public functions or lists. Else, the objects cannot access to this class
or the authority to make changes. Such programming characteristics of data
hiding avoid unintended data corruption and provide greater program
security.

Inheritance
The object-oriented programming (OOP) ensures a higher level of accuracy
and reduces time development. Another property of object-oriented
programming results in more thorough data analysis. A relation and
subclasses build between the objects which can be assigned and allow
developers to reuse a common logic while maintaining the unique hierarchy.
This property results in more thorough data analysis, high accuracy, and save
time.

Abstraction
The objects reveal the internal mechanisms only. This can be helpful and
relevant for the use of other objects. Due to this, the concept of a developer
builds that is supportive of going for more addition or making changes over
time more easily.

Polymorphism
Depending on the context, objects are allowed to take on more than one form.
It is the program that will determine the meaning and usage for each
execution of an object, cutting down on the need to duplication code.

Criticism on Object-oriented programming


Developers criticized the object-oriented programming due to multiple
reasons. One of the major concerns about object-oriented programming is
that it does not focus on computation or algorithms. Object-oriented
programming codes may be more complicated to write and take longer to
compile. But developers also find alternatives for such complications.
The alternatives include the following:
· Functional programming
· Imperative programming
· Structured programming
However, only the most advanced programming languages enable developers
with the options to combine them.

Python as an Object-oriented programming


language
Python programming language is widely used as an object-oriented
programming language for web application development. According to a
survey, 90 percent of the programmers prefer to work with Python language
over other languages due to a lot of reasons. Its simplicity, readability and
easy interfacing are the major reasons for its preference. Python is used in
object-oriented programming as well as follows a procedural paradigm;
hence the advanced and diverse applications come out with clean and super
simple codes.
Application development with Python programming language also requires
some frameworks with the help of which application development is easier
for developers. The most used frameworks include Django, CherryPy,
Pyramid, Flask, Qt, PyGUI, and Kivy, etc.
The use of these frameworks is based on the nature and the requirements of
individual projects.
The assistance is provided by these Python frameworks to build sophisticated
applications with minimal efforts and time.
Python is a popular scripting language for many software development
processes. Furthermore, Python can be economically utilized to integrate
disparate systems together.
Chapter 7 - Basic Statistics Concepts of Data
Scientists
While opting for data science, one of the main concepts which should be
known to anyone in the fundamentals of data science. Fundamentals include
statistics that are further classified as distributions, dimensionality reality,
probability, etc. One should be familiar with the fundamentals while thinking
of getting into data science. Let us take a look at the statistical concepts.
Statistics
Statistics is one of the most important aspects of data science. Thus, it can be
said that statistics is a powerful tool while performing the art of data science.
Then you may wonder what statistics actually are. Statistics can be defined as
high-level mathematics, which is used to perform any technical analysis of
data. For example, a bar chart or line graph may give you a basic
visualization of the data and high-level information, but statistics have an
additional advantage of "more information," i.e., We are able to work on the
data in a deeper sense and targeted way, which can be referred to as
"information-driven approach". The Math involved in this approach results in
concrete and accurate conclusions, which eventually reduces your efforts in
estimating or guesstimating data.
By the use of statistics, we are able to gain deep-level and more refined
insights about the structural view of the data. The structural view helps us in
identifying and getting more information by optimally applying other data
science techniques. There are five basic statistics concepts that a data scientist
must be familiar with, and they must know how to apply them in the most
effective way!
Statistical Features
The most common and used concept of statistics is Statistical features. You
will work with many things like bias, variance, mean, median, percentiles,
graphs, and tables. Statistics is probably the first stats technique you will
work with while exploring a data or dataset. Understanding and learning
statistics is more or less easy for a beginner!
[source - http://www.physics.csbsju.edu]
Observe the above figure - The line cutting the middle is known as the
median value of the data. Sometimes, there is confusion between a median
and a mean. Median is usually used over a mean value as it is more robust,
relating to the outlier surfaces. Now, as per figure, it is divided into three
quartiles; The first quartile is approximately at the 25th percentile, i.e., 25%
of the data points are placed below that value. Similarly, the third quartile is
approximately at the 75th percentile i.e., 75% of the data are placed below
that value. The minimum and maximum values represent the lower and upper
limits of the data range, respectively.
The above plot defines the statistical features:
Ø If there are many values in a small range with most of the similar data
points, the box plot will represent as "short."
Ø If there is a wide range of values with most of the different data points, the
box plot will represent as "tall."
Ø Depending upon the position of the median, it can be said that if the
median value is near the bottom limit, then most of the data would have
lower values. Inversely, if the median value is near the upper limit, then most
of the data would have higher values. To sum up: you will find skewed data
if the median value is not exactly at the center of the box plot.
Ø Is the tail very long? This indicates that the data has a very high standard
deviation and variance i.e., The values are scattered and highly varying.
Imagine you have the long tail bending towards only one side of the box and
not on the other side, then your data may be varying in only that direction.
The above box plot example was quite simple to understand and calculate
information from the same. Now let us have a look into some similar
concepts in detail:
Probability Distributions
Probability can be defined as the occurrence of some percent change in an
event. Usually, probability value is calculated or quantified between the
standard range of 0 to 1. Where 0 indicates that the probability of occurrence
of an event is not repetitive. Whereas, one indicates that the probability of
occurrence of an event is very repetitive. A probability distribution is a
function that represents all the probabilities of possible values in the model.
Let us take a look at some common probability distribution patterns:
Uniform Distribution Pattern
[source = https://miro.medium.com]
Uniform Distribution is the most simple and basic type of distribution.
Observe the above figure for the X and Y coordinates and the points on the
co-ordinates. This distribution has only one value that will occur in a certain
range. Anything outside that value range would mean that the range is 0. It is
also known as "on or off" distribution. We can categorize the values as 0 or
the value. The variable other than zero may have multiple values between 0
to 1, but it can be visualization in a similar way as we did in the piecewise
function of multiple uniform distributions.
Normal Distribution Pattern
Normal Distribution also referred to as a "Gaussian Distribution." It is
defined by the mean value and the standard deviation that is shown. The
initial value is responsible for shifting and adjusting the distribution spatially,
whereas the spread is controlled by the standard deviation. The highlighting
difference in this distribution is that the standard deviation is the same in all
the directions regardless of the change in the mean value. With Gaussian
distribution, we can understand the overall average value of the information
or data. The data spread of the model viz. We can also identify the range of
the data spread, which may be in a concentrated format of the data revolving
to the few values, or it can have a wider range that is spread over the values.

Poisson Distribution
[source = https://miro.medium.com]
The above graph depicts a Poisson Distribution, where all the signals are in
the continuous form with variable intensities. Poisson Distribution can be
compared to the uniform distribution as it is very much similar to that; the
exception is only the skewness. Skew or skewness can be defined as "neither
parallel nor perpendicular to a specified or implied line." If the skew value is
less, then it will have a uniform spread in all the direction like the normal
distribution. On the other hand, if the skewness is high, the distribution would
be scattered in different positions. It might be concentrated in one position, or
it can be scattered all over the graph.
This is an overview of the commonly used distribution pattern. But
distributions are not limited to these three; there are many more distributions
that are used in data science. Out of these three, Gaussian distribution can be
used with many algorithms, whereas choosing an algorithm in Poisson
distribution must be a careful discussion due to its skewness feature.
Dimensionality Reduction
Dimensionality Reduction can be somewhat instinctive to understand. In data
science, we would be given a dataset, and by the use of the Dimensionality
Reduction technique, we will have to reduce the dimensions of the data.
Imagine you have been given a dataset cube of 1000 data points, and it’s a 3-
Dimension cube. Now you might think that computing 1000 data points can
be an easy process, but at a larger scale, it might give birth to many problems
and complexity. Now by using the dimensionality reduction technique, if we
look at the data in 2-Dimension, then it is easy to re-arrange the colors into
categories. This will reduce the size of the data points from 1000 to maybe
100 if you categorize colors in 10 groups. Computing 100 data points are
much easier than earlier 1000 data points. In a rare case, these 100 data points
can also be reduced to 10 data points by the dimensional reduction technique
by identifying color similarity and grouping similar color shades in a group.
This is possible only in the 1-Dimension view. This technique helps in a big
computational saving.
Feature Pruning
Feature pruning is another technique of performing dimensionality reduction.
As we saw that we reduce the points in the earlier technique, here, we can
reduce the number of features that are less important or not important to our
analysis. For illustration, while working on any data set, we may come across
20 features; 15 of them might have a high correlation with the output,
whereas five may not having any correlation, or maybe they may have a very
low correlation. Then we may want to remove that five features by feature
pruning technique to reduce unwanted elements as well as reduce the
computing time and effort, taking into consideration that the output remains
unaffected.
Over and Under Sampling
Over and Under Sampling techniques are the classification techniques used
for the perfect classification of the problems. Possibilities are there whenever
we try to classify datasets. For example, imagine we have 2000 data points in
Class 1 but only 200 data points in Class 2. This will require a lot of Machine
Learning Techniques to model the data and make predictions upon our
observations! Here, over and under sampling, comes into the picture. Look at
the below representation.
Look at the image carefully; it can be stated that on both sides, the blue class
has a higher number of samples compared to the orange class. In such a case,
we have two predetermined, pre-processing options that can ease to predict
the result.

Defining Under-Sampling
Under-sampling means the selection of few data from the major class,
utilizing as many data points as the minor class is equipped with. This
selection is performed to maintain the probability level of the class.
Defining Oversampling
Oversampling means creating copies of the data points from the minor class
to level the number of data points in the major class. The copies are made
considering that the distribution of the minor class is maintained.
Bayesian Statistics
Before understanding Bayesian Statistics, we need to know why frequency
analysis can't be applied here. You can understand from the following
example.
Imagine you are playing the dice game. What are the chances of you rolling a
perfect 6? You would probably say that the chance would be 1 in 6, right? In
the event, if we perform a frequency analysis technique here, we would catch
that if someone rolled the dice for 10,000 times, then you may come out with
1 in 6 estimates. But if you are given a dice that is loaded to land always on
6, it would be easy to put six at every time. Since frequency analysis takes the
prior data into account, it fails sometimes. Bayesian Statistics take into
account the evidence.
Baye's Theorem
Let us learn the meaning of the theorem. The capacity P(H) is the recurrence
examination. Given our earlier information, what is the likelihood of our
occasion happening? The P(E|H) in our condition is known as the probability
and is basically the likelihood that our proof is right, given the data from our
recurrence examination. For instance, if you think of rolling the dice 5,000
times and the initial 1000 times the dice rolled out to be a 6, then it is pretty
clear that the dice is having a perfect six only. Here, the P(E) is the likelihood
that the real proof is valid.
In the event that the performed recurrence examination is generally excellent,
at that point, you can be certain that the dice is stacked with an impeccable 6.
Yet, you should likewise mull over the proof of the stacked dice, regardless
of whether it's actual or not founded on its earlier information and the
recurrence investigation you just performed. Each and everything is taken
into consideration in this theorem. Bayesian Theorem is useful when you are
in doubt that the prior data is not sufficient to predict the result. These
statistical concepts are very useful for an aspiring da ta scientist.
Chapter 8 - Exploring Our Raw Data
The first step that shows up when you are working with this process of data
science is going through and exploring some of the raw data that you are
working with. We are not able to build up the models that we want to work
with or learn anything of value out of it if we are not first able to explore
were to find this data, and then see what is inside.
Many companies already know a lot about working with the collecting of
data, and it is likely that you already have a ton of this data present in some
storage for your needs. That is a great place to start, but it is uncertain
whether you will have the right data for your needs. Just because you have
been able to collect a large amount of data doesn’t mean that you are able to
use it, so that is part of what we need to explore in this guidebook to help us
get started.

Answering Your Big Business Questions


Before we spend any time collecting the data that we would like to use in
data analysis, it is important that we stop and focus on how we plan to answer
some of our big business questions along the way. there is always going to be
some reason why you want to do data science in the first place.
This is not something that we have around in order to keep ourselves busy or
something that we just do for the fun of it.
Instead, it is a time-consuming process and we are using it in some manner to
help our business become more successful and see more growth.
Maybe you decide that you are wan to work with data science in order to
figure out which product is the best to come out with next. Maybe you want
to be able to provide a better customer service experience to your customers.
Maybe you feel like there is some waste in your company and you want to be
able to figure out the best way to reduce this waste, ensuring that you are able
to keep costs down with the quality still high. Or maybe it is time to look for
a new niche for your products, and you want to figure out where that would
be.
Each business that is ready to work with data science is going to have some
kind of big question or problem that they would like to solve along the way.
And this is something that we need to figure out right from the beginning. It
is going to lead us along with how we will find the data. When we know what
we are using the data for when all is said and done, it is a lot easier for us to
move on and find just the data that we need, rather than collecting a lot of
data that we will never use at all.
Of course, if you have already spent some time collecting a lot of big data, do
not fret. This business question is going to be something that we are able to
use when it comes to sorting through all of that data to find the insights that
actually matter to you.
There are a lot of times that the data, no matter what kind you collected, is
going to have some insights, but it may not be the exact kind of insight that
you are really looking for. having that business problem or question in mind
is going to be critical because it helps you to know where to look, and how to
stay on track as you go through all of that data as well.

Deciding Where to Store the Data


As we are going through this process, we need to spend a bit of time deciding
where we would like to store all of the information that we have in our
possession in the past, this was going to be a big issue that a lot of companies
had to deal with. The options for storing all of that data would often be
expensive and too time-consuming to work with, and so they had to either
drop some of the important information or make do with what they had
Today though, there are a lot of great options when it comes to the storage
that we can choose from. Many of these are going to be inexpensive to work
with and will be able to hold onto a large amount of data, unlike what we see
in the past. This makes it easier to collect as much of the data that we would
like in order to complete the process of data science.
With this in mind, we have to be able to go through some of the different
options and decide which of the storage options are the best for our needs.
We should take a look at some of the security features that are offered with
each option, how much data they are able to hold, the other features that
come with the storage, and how much it is going to cost.
This can help us to work with the right storage option that will work for our
needs and all of the data that we want to work with.
Places to Search for Data
Another thing that we need to consider as we go through this process is
where we should search for some of the data that we want to work with. This
is not too difficult if you already have an idea of your big business question.
This can lead you to a few places where you would like to search for the
needed data.
The first place that you may want to search for is with your customers. Below
we are going to talk about a few of the different methods that you are able to
use in order to do interviews, focus groups, observations and more to help
you figure out more about what the customer is looking for. you can then
plug this information into one of your models that we will create later and
gain some new predictions and insights into what will suit your customers
later on.
Another option is to work through some of the different studies and more that
third-party researchers have done. You can do some of your own, but often
these are going to take so long to complete and cost so much that it is hard to
work with them on your own.
There are a variety of third party researchers that may have some of the
information that you need, and you can use these for helping you complete
the data science process.
If you are going with one of these options, just be careful about the source
that you are using when it comes to completing the research and using it. You
want to know the motivations behind that researcher doing the study and
figure out if there is any bias or slant to the information that you need to be
aware of before you get started using that data.
Social media is becoming an even more prevalent method of gathering data.
This is often going to come in unstructured and it is going to need a lot of
work before you can pass it through a machine learning algorithm or model.
But since many companies are meeting with their customers online and
learning more about them from there, it is a great place to gather some of the
information that you need about your customers, so it is definitely not one
that we should miss out on.
Some of the types of research that you are able to work with will include
options like:
1. One-on-one interviews: These are the interviews where someone
on your team is able to meet up with someone in your study
group and ask them questions. The responses that you are able to
get from these individuals are going to be useful for you to use in
the analysis to figure out how others in the population that you
are interested in, such as your target customers, are going to feel
about something as well. This is going to work the best when it
is time to create a list of questions to ask so that the interview
stays on track and is ready to go when you are.
2. Focus groups: We may be able to find a lot of great options
when it comes to the one-on-one interviews, but these can take a
lot of time since we have to go through and actually meet up
with and talk to a lot of people one at a time. But with the focus
groups, we can bring in a lot of people to the discussion and then
work off the information that we need. This is a much faster
method to work with, and can help us to really get the answers
and results that we want without having to go to each person one
by one.
3. Surveys: Surveys are going to be a great and economical method
of data collection that you are able to get from a larger audience.
Similar to some of the other methods, surveys are going to need
a set of questions that others are able to answer. This is often
done by an internet application, but other methods can be used as
well. This helps you to get a lot of responses in a short amount of
time and will ensure that you are going to get the results that you
would like in no time at all.
4. Observations: This is going to be one of the methods of data
collection that is going to involve physically viewing the actions
of the customers at the end of the process. To use this kind of
method, you would want to observe the participants see what
works for them. You could offer a finished product that you want
to sell to the individuals, and then ask them to talk about what
they like and don’t like about the product in order to see how
other customers are going to behave.
Gathering the data that you want to use is going to take some time, and a lot
of research needs to go into it. But this will ensure that you are getting some
great data that can work for your insights and predictions. The amount of data
that you are able to collect is not going to be that important, even though
most businesses do collect a ton of data.
The factor here is that the data you work with is high-quality and that you can
actually gain some good insights out of that data when you get started with
your algorithms and more.
Chapter 9 - Languages Required for Data Science.

Basics of python
Keywords are an important part of Python programming; they are words that
are reserved for use by Python itself. You can’t use these names for anything
other than what they are intended for, and you most definitely can’t use them
as part of an identifier name, such as a function or a variable. Reserved
keywords are used for defining the structure and the syntax of Python. There
are, at the present time, 33 of these words, and one thing you must remember
is that they are case sensitive—only three of them are capitalized, and the rest
are in lower case. These are the keywords, written exactly as they appear in
the language:

False
if
assert
as

is
global
in
pass
finally
try
not
while
return
None
for
True
class
break
elif
continue
yield
and
del
with
import
else
def
except
from
or
lambda
nonlocal
raise
Note: only True, False, and None are capitalized.
The identifiers are the names that we give to things like variables, functions,
classes, etc., and the name is just so that we can identify one from another.
There are certain rules that you must abide by when you write an identifier:
● You may use a combination of lowercase letters (a to z), uppercase letters
(A to Z), digits (0 to 9), and the underscore (_). Names such as func_2,
myVar and print_to_screen are all examples of perfectly valid identifier
names.
● You may not start an identifier name with a digit, so 2Class would be
invalid, whereas class2 is valid.
● You may not, as mentioned above, use a reserved keyword in the
identifier name. For example:
>>> global = 2
File "<interactive input>", line 3
global = 2
^
Would give you an error message of:
SyntaxError: invalid syntax
● You may not use any special symbols, such as $, %, #, !, etc., in the
identifier name. For example:
>>> a$ = 1
File "<interactive input>", line 13
A$ = 1
^
Would also give you the following error message:
SyntaxError: invalid syntax
An identifier name can be any length you require.
Things to note are:
• Because the Python programming language is case sensitive, variable
and Variable would mean different things.
• Make sure your identifier names reflect what the identifier does. For
example, while you could get away with writing c = 12, it would make
more sense to write count = 12. You know at a glance exactly what it
does, even if you don’t look at the code for several weeks.
• Use underscores where possible to separate a name made up of
multiple words, for example, this_variabe_has_ many_words
You may also use camel case. This is a writing style where the first letter of
every word is capitalized except for the first one, for example,
thisVariableHasManyWords.
of timе.
Advantages of Machine Learning
Due to the sheer volume and magnitude of the tasks, there are some instances
where an engineer or developer cannot succeed, no matter how hard they try;
in those cases, the advantages of machines over humans are clearly stark.

Identifies Patterns
When the engineer feeds a machine with artificial intelligence a training data
set, it will then learn how to identify patterns within the data and produce
results for any other similar inputs that the engineer provides the machine
with. This is efficiency far beyond that of a normal analyst. Due to the strong
connection between machine learning and data science (which is the process
of crunching large volumes of data and unearthing relationships between the
underlying variables), through machine learning, one can derive important
insights into large volumes of data.

Improves Efficiency
Humans might have designed certain machines without a complete
appreciation for their capabilities, since they may be unaware of the different
situations in which a computer or machine will work. Through machine
learning and artificial intelligence, a machine will learn to adapt to
environmental changes and improve its own efficiency, regardless of its
surroundings.

Completes Specific Tasks


A programmer will usually develop a machine to complete certain tasks,
most of which involving an elaborate and arduous program where there is
scope for the programmer to make errors of omission. He or she might forget
about a few steps or details that they should have included in the program. An
artificially intelligent machine that can learn on its own would not face these
challenges, as it would learn the tasks and processes on its own.

Helps Machines Adapt to the Changing


Environment
With ever-changing technology and the development of new programming
languages to communicate these technological advancements, it is nearly
impossible to convert all existing programs and systems into these new
syntaxes. Redesigning every program from its coding stage to adapt to
technological advancements is counterproductive. At such times, it is highly
efficient to use machine learning so that they can upgrade and adapt to the
changing technological climate all on their own.

Helps Machines Handle Large Data Sets


Machine learning brings with it the capability to handle multiple dimensions
and varieties of data simultaneously and in uncertain conditions. An
artificially intelligent machine with abilities to learn on its own can function
in dynamic environments, emphasizing the efficient use of resources.
Machine learning has helped to develop tools that provide continuous
improvements in quality in small and larger process environments.

Disadvantages of Machine Learning


•It is difficult to acquire data to train the machine. The engineer must know
what algorithm he or she wants to use to train it, and only then can he or she
identify the data set they will need to use to do so. There can be a significant
impact on the results obtained if the engineer does not make the right
decision.
•It’s difficult to interpret the results accurately to determine the effectiveness
of the machine-learning algorithm.
•The engineer must experiment with different algorithms before he or she
chooses one to train the machine with.
•Technology that surpasses machine learning is being researched; therefore, it
is important for machines to constantly learn and transform to adapt to new
technology.
Subjects Involved in Machine Learning
Machine learning is a process that uses concepts from multiple subjects. Each
of these subjects helps a programmer develop a new method that can be used
in machine learning, and all these concepts together form the discipline of the
topic. This section covers some of the subjects and languages that are used in
machine learning.

Statistics
A common problem in statistics is testing a hypothesis and identifying the
probability distribution that the data follows. This allows the statistician to
predict the parameters for an unknown data set. Hypothesis testing is one of
the many concepts of statistics that are used in machine learning. Another
concept of statistics that’s used in machine learning is predicting the value of
a function using its sample values. The solutions to such problems are
instances of machine learning, since the problems in question use historical
(past) data to predict future events. Statistics is a crucial part of machine
learning.
Brain Modeling
Neural networks, are closely related to machine learning. Scientists have
suggested that nonlinear elements with weighted inputs can be used to create
a neural network. Extensive studies are being conducted to assess these
elements.

Adaptive Control Theory


Adaptive control theory is a part of this subject that deals with methods that
help the system adapt to such changes and continue to perform optimally.
The idea is that a system should anticipate the changes and modify itself
accordingly.
Psychological Modeling
For years, psychologists have tried to understand human learning. The EPAM
network is a method that’s commonly used to understand human learning.
This network is utilized to store and retrieve words from a database when the
machine is provided with a function. The concepts of semantic networks and
decision trees were only introduced later. In recent times, research in
psychology has been influenced by artificial intelligence. Another aspect of
psychology called reinforcement learning has been extensively studied in
recent times, and this concept is also used in machine learning.
Artificial Intelligence
As mentioned earlier, a large part of machine learning is concerned with the
subject of artificial intelligence. Studies in artificial intelligence have focused
on the use of analogies for learning purposes and on how past experiences
can help in anticipating and accommodating future events. In recent years,
studies have focused on devising rules for systems that use the concepts of
inductive logic programming and decision tree methods.

Evolutionary Models
A common theory in evolution is that animals prefer to learn how to better
adapt to their surroundings to enhance their performance. For example, early
humans started to use the bow and arrow to protect themselves from
predators that were faster and stronger than them. As far as machines are
concerned, the concepts of learning and evolution can be synonymous with
each other. Therefore, models used to explain evolution can also be utilized
to devise machine learning techniques. The most prominent technique that
has been developed using evolutionary models is the genetic algorithm.
Programming Languages
R
R is a programming language that is estimated to have close to 2 million
users. This language has grown rapidly to become very popular since its
inception in 1990. It is a common belief that R is not only a programming
language for statistical analysis but can also be used for multiple functions.
This tool is not limited to only the statistical domain. There are many features
that make it a powerful language.
The programming language R is one that can be used for many purposes,
especially by data scientists to analyze and predict information through data.
The idea behind developing R was to make statistical analysis easier.
As time passed, the language began to be used in different domains. There
are many people who are adept at coding in R, although they are not
statisticians. This situation has arisen since many packages are being
developed that help to perform functions like data processing, graphic
visualization, and other analyses. R is now used in the spheres of finance,
genetics, language processing, biology, and market research.

Python
Python is a language that has multiple paradigms. You can probably think of
Python as a Swiss Army knife in the world of coding, since this language
supports structured programming, object-oriented programming, functional
programming, and other types of programming. Python is the second-best
language in the world since it can be used to write programs in every industry
and for data mining and website construction.
The creator, Guido Van Possum, decided to name the language Python, after
Monty Python. If you were to use some inbuilt packages, you would find that
there are some sketches of the Monty Python in the code or documentation. It
is for this reason and many others that Python is a language that most
programmers love, though engineers or those with a scientific background
who are now data scientists would find it difficult to work with.
Python’s simplicity and readability make it quite easy to understand. The
numerous libraries and packages available on the internet demonstrate that
data scientists in different sectors have written programs that are tailored to
their needs and are available to download.
Since Python can be extended to work best for different programs, data
scientists have begun to use it to analyze data. It is best to learn how to code
in Python since it will help you analyze and interpret data and identify
solutions that will work best for a business.
Chapter 10 - Classification and Prediction

Classification
Classification is one of the most important tasks in Data Mining. It is based
on examining an object’s features, which based on these features is assigned
to a predetermined set of classes.
The basic idea goes like this: by having a set of categories (classes) and a
dataset with samples, for which we know in which class they belong, the goal
of classification is to create a model, which will then be able to automatically
classify these categories in new, unknown, non-classified samples.

Decision Trees
Decision trees are one of the most popular classification models. Decision
trees are a simple form of rules representation and are widely popular because
they are easily understandable.

Description
Decision trees are the simplest classification model. A decision tree consists
of internal nodes and leaves. Internal nodes are called the nodes which have
children while leaves are called the lowest level nodes which have no
children. The decision tree is represented as follows:

Each internal node gets the name of a feauture


Each branch between two nodes is named with a condition
or a value for the characteristic of the parent node
Each leaf is named with the name of a class
On the above image we can see a decision tree, based on data from the
Titanic passengers. Under the leaves we can find the chance of survival and
the percentage of samples leading to this particular leaf. As expected, most
men died since priority in the life boats was given to women and children.
On the above example the gender, age and number of family members
variables were used in order to identify the class value. Since we have a finite
number of values (survived, died), we are talking about a decision tree which
makes classification.
Decision Tree creation – ID3 Algorithm
One of the most popular algorithms for creating decision trees is the ID3
algorithm. This particular algorithm uses the concepts of entropy and
information gain for choosing the nodes of the decision tree. As mentioned,
information gain is calculated by the formula:
where
where with Sj we denote the samples with value j for the feature A, with |Sj|
their number, with S we denote all samples and with |S| their number, while
with E|Sj| we denote the entropy for the samples subset of the whole dataset
with value of j for the feature A. Entropy E for a given set is calculated based
on the class classification of the set samples. If we have k classes, entropy for
the dataset S is:
where pi is the probability of the class i in S.
In order to create a decision tree, the ID3 algorithm follows the below steps:

1. Calculates the information gain from each variable


2. Puts the variable with the highest information gain as root of the
tree
3. Creates as many branches as the discrete values of a variable
4. Splits the dataset in as many subsets as the discrete values of the
variable chosen
5. Chooses a value-subset, which is not yet chosen. If for the
current value-subset corresponds only one class value, go to step
6, else go to step 7
6. Put the class value as leaf and continue with the next variable-
subset value and go to step 5
7. Calculate the information gain of the remaining variables for this
particular subset
8. Choose the variable with the highest information gain and add a
new node on the branch corresponding to the current value-
subset
9. Repeat from step 3, until no more leaves can be created
Let’s see an example on how we can create a decision tree with ID3, for the
above dataset. First, we will initially calculate entropy E(S). For the class
variable we have three times the value In and five times the value Out. SO:
Next, we will calculate the information gain for each variable. We start with
the Weather variable. We have a total of eight samples and the Weather
variable gets two times the value Sunshine, three times the value Cloudy and
three times the value Rainy. Both two samples with value Weather =
Sunshine have a class value of In. For the three samples with value Weather
= Cloudy one has the class value In and two have the class value Out. So we
have:
where

So finally:
Next, we will calculate the information gain for the Temperature variable.
We have a total of 8 samples and the Temperature variable gets 4 times the
value High, 2 times the value Normal and 2 times the value Low. For the 4
samples with value Temperature=Normal, 2 of them have the class value In
and 2 of them have the class value Out. Both two samples with
Temperature=Normal have a class value of Out. For the two samples with
value Temperature=Low, 1 has a class value of In and 1 has a class value of
Out. So, we have:
where:
So finally:
We then continue with the Humidity variable. We have a total of 8 samples
and the Humidity variable gets 4 times the value High and 4 times the value
Normal. For the 4 samples with Humidity = High, 2 have a class value of In
and 2 have a class value of Out. For the 2 samples with Humidity = Normal,
1 has a class value of In and 3 have a class value of Out. So, we have:
where:
So finally:
Last, we have the Wind variable. We have a total of 8 samples and the Wind
variable gets 6 times the value Light and 2 times the value Strong. For the 6
samples with value Wind =Light, 1 has a class value of In and 5 have a class
value of Out. For the two samples with value Wind =Strong, 1 has a class
value of In and 1 has a class value of Out. So, we have:
where:
So finally:
From the above we can see that the View variable has the highest information
gain. So, we choose it as the root of our tree.
We then need to examine how each branch will continue. For the Sunshine
and Cloudy values, we notice that all samples belong to the same class, In
and Out respectively. This leads us to leaves:
We now need to examine the samples with value Weather=Rainy
Initially, we calculate the information gain of the other variables. For the
Temperature (Wind) variable we have 2 samples with Normal (Light) and 1
sample with Low (Strong). For the Temperature=Normal (Wind=Light) we
have 2 samples with class Out and 0 samples with class In, while for the
Temperature=Low (Wind=Strong) we have 1 sample with class In and 0
samples with class Out. Therefore, we have:
where:
Therefore:
Last, for the Humidity variable we have two samples with Normal value and
1 sample with High value. For the sample with Humidity=High we have 1
time the class Out and 0 times the class In. For the two samples with value
Humidity=Normal we have 1 time the class In and 1 time the class Out.
where:
So, we have:
We select the variable with the higher information gain, that is either the
Temperature variable or the Wind variable since they have the same
information gain. On the image below, we can see the final decision tree by
using the algorithm ID3.

Decision Tree creation – Gini Index


Another way of creating decision trees is by using the Gini index for node
selection. The Gini Index measures the inequality among the values of a
frequency distribution. The values range from 0 to 1, with 0 representing
perfect equality and 1 representing perfect inequality. For a dataset S with m
samples and k classes, gini(S) is calculated by the formula:
where pj is the probability of occurrence of class j in the dataset S. If S is
divided in S1 and S2 then:
where n1 and n2 is the number of samples in S1 and S2 respectively. The
advantage of this method is that for the calculations we only need the split of
the classes in each subset. The best feature is the one with the lowest Gini
value. Let’s see how we can use Gini index to create a decision tree.
We start with the Weather variable. First, we make the split based on the
values of the variable so we have:
Therefore, for the Weather variable we have:
Then we continue with the Temperature variable:
So, for the Temperature variable we have:
Then we continue with the Humidity variable:
So, for the Temperature variable we have:
Last, we have the Wind variable:
So, for the Wind variable we have:
We choose the feature with the lowest Gini value, i.e. the Weather value.
Next, we will need to examine the values Sunshine, Cloudy and Rainy
individually.
For the Sunshine and Cloudy variable, we can see that all samples belong to
the same class, In and Out respectively. Therefore, this leads us to leaves:
For the Rainy value we need to further examine the split. We only need to
examine the samples for which the Weather variable have the Rainy value.
Once again, we start with the Weather variable:
So, for the Weather variable we have:
We notice that for the Temperature, Humidity and Wind variables we have a
similar split, i.e. correspondence of different variable value and class value.
Therefore, the calculation is made with the same way and the resulting values
will be equal. So, we will just need to calculate the Gini index for only one of
these variables. Let’s choose the Temperature variable.
So, for the temperature variable we have:
If we calculate the Gini Index for the variables Humidity and Air as well we
will get:
We have a draw between them so we randomly choose the Temperature
variable. Finally, below we can see the decision tree created:
We should note that the decision trees created with the ID3 algorithm and
with Gini Index accidentally came out the same.

Prediction
Difference between Classification and Prediction
At first glance, classification and prediction seem similar. The basic
difference between classification and prediction is than in classification there
is a finite set of discrete classes. The samples are used in order to create a
model which is then able to classify new samples. In prediction, the value
derived from the model is constant and doesn’t belong to any predefined
finite set. As mentioned previously in the Titanic example, we have a finite
number of class values (Survived, Died), thus we have a decision tree which
makes a classification. If the values of the target variable where not finite, we
would then have a regression tree which would make a prediction.

Linear Regression
Description, Definitions and Notations

On the above image we present a simple example of linear regression.


Variables are the square meters of the house and its sale price in Dollars.
Linear Regression adapts a line in the samples of the datasets, marked in red
Xs. Adaptation is based on a cost function, the value of which we want to
minimize. By having the optimal line i.e. the line which minimizes the value
of the cost function, we can estimate pretty accurately questions like: “Which
is the selling price for 150 square meters houses?”. Therefore, given the
values of the goal variable (in our case the selling price) for each sample, we
try to predict the values of the variable target for new samples.
We will now mention some definitions. We will use m to denote the number
of samples of the training set. We will use X to denote the input variables,
and use y for the goal variable. We will use β for the model parameters.
Cost Function
The cost function F, is given by the following formula:
The basic idea is that we want to minimize the cost function as to βj, so we
want
so that the value of the hβ hypothesis, i.e. prediction, is as close as possible to
the value of the real goal variable named y. The above cost function is the
most popular and known as squared error function.

Gradient Descent Algorithm


Our goal is to minimize the value of the cost function F. This can be achieved
by using the right values for the βj parameters. Manual search is prohibitively
time consuming. The goal of gradient descent is to choose the right βj so that
the value of the cost function can be minimized. In short, the algorithm works
like this:

Random βj values are chosen


Their values are changed repetitively and in a predefined
way, so that the function in each step is minimized.
Before we dive into the formulas, let’s have a look on how the algorithm
works. We will use a simple example with just one input variable and, thus,
two β0 and β1 parameters. Imagine we are in a specific point on the below
graph, e.g. on one of the two red hills, and we want to move to a lower point.
The first thing we need to do is to think: if we could take a small step, what
would its direction be in order to lead us to a lower point? A similar logic is
used for the gradient descent algorithm as per the cost function value. As we
will see later on, this logic is implemented through the partial derivatives of
β0 and β1. Remember that the value of a derivative shows the slop of a line
and thus, in our case, the direction of the path the algorithm should follow on
each step it makes.
The gradient descent algorithm has one important feature. From a different
starting point, it is possible that we get a different final point as we can see in
the below image:
The algorithm is as follows:
repeat until we have convergence {
update β simultaneously (at the end)
j

The α (alpha) parameter is called learning parameter and declares how big
will each step be in each iteration during the algorithm execution.
Usually, parameter α has a standard value and is not adjusted during the
function execution. The partial derivative as per βj determines the direction in
which the algorithm will proceed on the current step. Finally, the update of
the βj parameters applies at the end of each iteration. The corresponding
pseudocode for demonstrating how the β0 and β1 parameters are updated is
the following:
So, after we calculate the new value of β0 (tmp0), we use β0 to calculate the
new value of β1 (tmp1). The new values will be used in the next iteration.

Gradient Descent in Linear Regression


We previously separately examined linear regression and gradient descent.
Now let’s see how linear regression and gradient descent work along.
Assume we have a linear regression model with the β0 and β1 parameters and
the hypothesis is given as:
which defines a y = ax + b line with slope a = b1 and constant term b = β0.
For this particular linear regression model, the cost function is:
In fact, by using the gradient descent algorithm we will minimize the cost
function F. First, we will need to calculate the partial derivatives:
Based on the above calculation the algorithm is done as follows:
repeat until we have convergence {

update β simultaneously (at the end)


j

Learning Parameter
The learning parameter is the α parameter, we saw on the gradient descent
algorithm. The most important question at this point is by what criteria we
choose the value of this parameter. First, let’s see how we can make sure that
our algorithm works right. We will need to display the cost function F in
terms of the number of the algorithm iterations. While the number of
iterations gets bigger, we expect the cost function to follow a descending
route.
On the contrary, if we have a graph like the one below then the algorithm will
not work right. This could be caused by the value of the learning parameter.
In the algorithm graph, the learning parameter defines how large the step will
be. If the value is very small then the algorithm will need a lot of time to find
a minimum (see image below):
On the contrary, if it is too large, it is possible to overcome the minimum and
even start moving to higher values of the cost function (see image below):
Unfortunately, there is no rule for choosing the learning parameter. The only
way is through testing, by carefully paying attention to the graph of the cost
function as to the number of iterations and at the same time ensuring it stays
in a descending route.
OVERFITTING AND REGULARIZATION
Overfitting
Previously we examined linear regression. As we saw, the produced model
tries to match as much as possible with the data. There are three possible
scenarios for our model:

1. The model doesn’t correspond well to the data and we have


underfitting
2. The model corresponds well to the data and generalizes
right, i.e. correctly classifies the new samples
3. The model perfectly approaches the data but cannot
generalize
The third scenario is known as Overfitting. The model is overtrained to
produce perfect results for the training set but it cannot generalize and create
equally good results for new data. If we have many features but the amount
of records of the dataset is small then we would likely have an overfitting
issue.
There are two solutions for dealing with this issue. The first solution is to
reduce the number of features, either by manually choosing the features we
will use or by using a selection algorithm. The second solution is to make a
model regularization. What this means is that we keep all features, but reduce
the corresponding βj parameter, i.e. the importance this particular feature has
during training and model creation. Regularization gives good results when
each of these features contributes a little.

Model Regularization
The basic idea of model regularization is that small values to the β1, β2, …,
βn parameters lead to simpler hypotheses thus reducing the chances of having
overfitting. In the scenario of linear regression, we just need to add an
additional condition in the cost function:
Essentially, the additional condition implies the reduction of the βj
parameters so that the value of the function is smaller overall. The λ
regularization parameter regulates how well the model will approach data and
what will the order of magnitude be for the βj parameters so that we can
avoid overfitting. If λ though gets very high values (e.g. λ=1010) then the βj
parameters will become so small and will tend to 0, thus leading to
underfitting.

Linear Regression with Normalization


We will now examine the implementation of basic functions of the linear
regression models with regularization.
The arguments of the function are:

X: the dataset, i.e. all samples, without the goal variable


theta: the βj parameters
alpha: the learning parameter α
lambda: the regularization parameter
num_iters: the number of iterations that the gradient descent
algorithm will make
The computeCost function is basically the cost function F, and is created as
follows:
We should note that data (features and goal variable) should be numerical in
order to execute the above code without any issues.
Chapter 11 -Data Cleaning and Preparation
The next topic that we need to take a look at in our process of data science is
known as data cleaning and preparation. During the course of doing our own
data analysis and modeling, a lot of time is going to be spent on preparing the
data before it even enters into the model that we want to use. The process of
data preparation is going to include a lot of different tasks, including loading,
cleaning, transforming, and rearranging the data. These tasks are so important
and take up so much of our time, it is likely that an analyst is going to spend
at least 80 percent of their time on this.
Sometimes the way that we see the data stored in a database or a file is not
going to provide us with the right format when we work with a particular
task. Many researchers find that it is easier to do ad hoc processing of the
data, taking it from one form to another working with some programming
language. The most common programming languages to use to make this
happen include Perl, R, Python, or Java.
The good news here though is that the Pandas library that we talked about
before, along with the features it gets from Python, can provide us with
everything that we need. It has the right tools that are fast, flexible, and high-
level that will enable us to get the data manipulated into the form that is most
needed at that time. There are a few steps that we are able to work with in
order to clean the data and get it all prepared, and these include:

What Is Data Preparation?


Let’s suppose that you are going through some of the log files of a website
and analyzing these, hoping to find out which IP out of all the options the
spammers are coming from. Or you can use this to figure out which
demographic on the website is leading to more sales. To answer these
questions or more, an analysis has to be performed on the data with two
important columns. These are going to include the number of hits that have
been made to the website, and the IP address of the hit.
As we can imagine here, the log files that you are analyzing are not going to
be structured, and they could contain a lot of textual information that is
unstructured. To keep this simple, preparing the log file to extract the data in
the format that you require in order to analyze it can be the process known as
data preparation.
Data preparation is a big part of the whole data science process. According to
CrowdFlower, which is a provider of data enrichment platforms that data
scientists can work with, it is seen that out of 80 data scientists, they will
spend their day in the following:

1. 60 percent of their time is spent on organizing and then


cleaning the data they have collected.
2. 19 percent is spent on collecting the sets of data that they
want to use.
3. 9 percent is used to mine the data that they have collected
and prepared in order to draw the necessary patterns.
4. 3 percent of their time will be spent doing any of the
necessary training for the sets of data.
5. 4 percent of the time is going to be spent trying to refine the
algorithms that were created and working on getting them
better at their jobs.
6. 5 percent of the time is spent on some of the other tasks that
are needed for this job.

As we can see from the statistics of the survey above, it helps us to see that
most of the time for that data scientist is spent in preparing the data, which
means they have to spend a good deal of time organizing, cleaning, and
collecting, before they are even able to start on the process of analyzing the
data. There are a few valuable tasks of data science like data visualization and
data exploration, but the least enjoyable process of data science is going to be
the data preparation.
The amount of time that you actually will spend on preparing the data for a
specific problem with the analysis is going to directly depend on the health of
the data. If there are a lot of errors, missing parts, and duplicate values, then
this is a process that will take a lot longer. But if the data is well-organized
and doesn’t need a lot of fixing, then the data preparation process is not going
to take that long at all.

Why Do I Need Data Preparation?


One question that a lot of people have when it is time to work on the process
of data preparation is why they need to do it in the first place. It may seem to
someone who is just getting started in this field that collecting the data and
getting it all as organized as possible would be the best steps to take, and then
they can go on to making their own model. But there are a few different
reasons why data preparation will be so important to this process and they
will include the following:

1. The set of data that you are working with could contain a
few discrepancies in the codes or the names that you are
using.
2. The set of data that you are working with could contain a lot
of outliers or some errors that mess with the results.
3. The set of data that you are working with will lack your
attributes of interest to help with the analysis.
4. The set of data that you want to explore is not going to be
qualitative, but it is going to be quantitative. These are not
the same things, and often having more quality is going to
be the most important.

Each of these things has the potential to really mess up the model that you are
working on and could get you results or predictions that are not as accurate as
you would like. Taking the time to prepare your data and get it clean and
ready to go can solve this issue, and will ensure that your data is going to be
more than ready to use in no time.

What Are the Steps for Data Preparation?


At this point, we need to take some time to look at some of the steps that are
needed to handle the data preparation for data mining. The first step is to
clean the data. This is one of the first and most important steps to handling
the data and getting it prepared. We need to go through and correct any of the
data that is inconsistent by filling out some of the values that are missing and
then smoothing out the outliers and any data that is making a lot of noise and
influencing the analysis in a negative manner.
There is the possibility that we end up with many rows in our set of data that
do not have a value for the attributes of interest, or they could be inconsistent
data that is there as well. In some cases, there are records that have been
duplicated or some other random error that shows up. We need to tackle all of
these issues with the data quality as quickly as possible in order to get a
model at the end that provides us with an honest and reliable prediction.
There are a few methods that we can use to handle some of the missing
values. The method that is chosen is going to be dependent on the
requirement either by ignoring the tuple or filling in some of the missing
values with the mean value of the attribute. This can be done with the help of
the global constant or with some of the other Python machine learning
techniques including the Bayesian formulae or a decision tree.
We can also take some time to tackle the noisy data when needed. It is
possible to handle this in a manual manner. Or there are several techniques of
clustering or regression that can help us to handle this as well. You have to
choose the one that is needed based on the data that you have.
The second step that we need to focus on here is going to be known as data
integration. This step is going to involve a few things like integrating the
schema, resolving some of the conflicts of the data if any show up, and even
handling any of the redundancies that show up in the data that you are using.
Next on the list is going to be the idea of data transformation. This step is
going to be important because it will take the time to handle some of the
noise that is found in your data. This step is going to help us to take out that
noise from the data so it will not cause the analysis you have to go wrong.
We can also see the steps of normalization, aggregation, and generalization
showing up in this step as well.
We can then move on to the fourth step, which is going to be all about
reducing the data. The data warehouse that you are using might be able to
contain petabytes of data, and running an analysis on this complete set of data
could take up a lot of time and may not be necessary for the goals that you
want to get in the end with your model.
In this step, it is the responsibility of the data science to obtain a reduced
representation of their set of data. We want this set to be smaller in size than
some of the others, but inclusive enough that it will provide us with some of
the same analysis outcomes that we want. This can be hard when we have a
very large set of data, but there are a few reduction strategies for the data that
we can apply. Some of these are going to include the numerosity reduction,
aggregation, data cube, and dimensionality reduction, and more, based on the
requirements that you have.
And finally, the fifth step of this is going to be known as data discretization.
The set of data that you are working with will contain three types of
attributes. These three attributes are going to include continuous, nominal,
and ordinal. Some of the algorithms that you will choose to work with only
handle the attributes that are categorical.
This step of data discretization can help someone in data science divide
continuous attributes into intervals, and can also help reduce the size of the
data. This helps us to prepare it for analysis. Take your time with this one to
make sure that it all matches up and does some of the things that you are
expecting.
Many of the methods and the techniques that you are able to use with this
part of the process are going to be strong and can get a lot of the work with
you. But even with all of these tools, it is still considered an area of research,
one that many scientists are going to explore more, and hopefully come up
with some new strategies and techniques that you can use to get it done.

Handling the Missing Data


It is common for data to become missing in many applications of data
analysis. One of the goals of working with the Pandas library here is that we
want to make working with some of this missing data as easy and as painless
as possible. For example, all of the descriptive statistics that happen on the
objects of Pandas exclude the missing data by default.
The way that this data is going to be represented in Pandas is going to have
some problems, but it can be really useful for many of the users who decide
to go with this kind of library. For some of the numeric data that we may
have to work with, the Pandas library is going to work with a floating-point
value that is known as NaN, or not a number, to represent the data that is
missing inside of our set of data.
In the Pandas library, we have adopted a convention that is used in the
programming language of R in order to refer to the missing data. This
missing data is going to show up as NA, which means not available right
now. In the applications of statistics, NA data can either be data that doesn’t
exist at all, or that exists, but we are not going to be able to observe through
problems with collecting the data. When cleaning up the data to be analyzed,
it is often important to do some of the analysis on the missing data itself to
help identify the collection of the data and any problems or potential biases in
the data that has been caused by the missing data.
There are also times when the data is going to have duplicates. When you get
information online or from other sets of data, it is possible that some of the
results will be duplicated. If this happens often, then there is going to be a
mess with the insights and predictions that you get. The data is going to lean
towards the duplicates, and it will not work the way that you would like.
There are ways that you can work with the Pandas library in order to really
improve this and make sure that the duplicates are eliminated or are at least
limited at least a little bit.
There is so much that we are able to do when it comes to working with data
preparation in order to complete the process of data mining and getting the
results that we want in no time with our analysis. Make sure to take some
time on this part, as it can really make or break the system that we are trying
to create. If you do spend enough time on it, and ensure that the data is as
organized and clean as possible, you are going to be happy with the results
and ready to take on the rest of the process.
Chapter 12 - Introduction to Numpy
Now that you know the basics of loading and preprocessing data with the
help of pandas, we can move on to data processing with NumPy. The purpose
of this stage is to have a data matrix ready for the next stage, which involves
supervised and unsupervised machine learning mechanisms. NumPy data
structure comes in the form of ndarray objects, and this is what you will later
feed into the machine learning process. For now, we will start by creating
such an object to better understand this phase.

The n-dimensional Array


You can build complex data structures with them because they are powerful
at storing data, however they’re not great at operating on that data. They
aren’t optimal when it comes to processing power and speed, which are
critical when working with complex algorithms. This is why we’re using
NumPy and its ndarray object, which stands for an “n-dimensional array”.
Let’s look at the properties of a NumPy array:

1. It is optimal and fast at transferring data. When you work


with complex data, you want the memory to handle it
efficiently instead of being bottlenecked.
2. You can perform vectorization. In other words, you can
make linear algebra computations and specific element
operations without being forced to use “for” loops. This is a
large plus for NumPy because Python “for” loops cost a lot
of resources, making it really expensive to work with a large
number of loops instead of ndarrays.
3. In data science operations you will have to use tools, or
libraries, such as SciPy and Scikit-learn. You can’t use them
without arrays because they are required as an input,
otherwise functions won’t perform as intended.
With that being said, here are a few methods of creating a ndarray:

1. Take an already existing data structure and turn into an


array.
2. Build the array from the start and add in the values later.
3. You can also upload data to an array even when it’s stored
on a disk.
Converting a list to a one-dimensional array is a fairly common operation in
data science processes. Keep in mind that you have to take into account the
type of objects such a list contains. This will have an impact on the
dimensionality of the result. Here’s an example of this with a list that
contains only integers:
In: import numpy as np
int_list = [1,2,3]
Array_1 = np.array(int_list)
In: Array_1
Out: array([1, 2, 3])
You can access the array just like you access a list in Python. You simply use
indexing, and just like in Python, it starts from 0. This is how this operation
would look:
In: Array_1[1]
Out: 2
Now you can gain more data about the objects inside the array like so:
In: type(Array_1)
Out: numpy.ndarray
In: Array_1.dtype
Out: dtype('int64')
The result of the dtype is related to the type of operating system you’re
running. In this example, we’re using a 64 bit operating system .
At the end of this exercise, our basic list is transformed into a uni-
dimensional array. But what happens if we have a list that contains more than
just one type of element? Let’s say we have integers, strings, and floats. Let’s
see an example of this:
In: import numpy as np
composite_list = [1,2,3] + [1.,2.,3.] + ['a','b','c']
Array_2 = np.array(composite_list[:3])#here we have only integers
print ('composite_list[:3]', Array_2.dtype)
Array_2 = np.array(composite _list[:6]) #now we have integers and
floats
print (' composite _list[:6]', Array_2.dtype)
Array_2 = np.array(composite _list) #strings have been added to the array
print (' composite _list[:] ',Array_2.dtype)
Out:
composite _list[:3] int64
composite _list[:6] float64
composite _list[:] <U32
As you can see, we have a “composite_list” that contains integers, floats, and
strings. It’s important to understand that when we make an array, we can
have any kind of data types and mix them however we wish.
Next, let’s see how we can load an array from a file. N-dimensional arrays
can be created from the data contained inside a file. Here’s an example in
code:
In: import numpy as np
cars = np.loadtxt('regression-datasets
cars.csv',delimiter=',', dtype=float)
In this example, we tell our tool to create an array from a file with the help of
the “loadtxt” method by giving it a filename, delimiter, and a data type.

Packages installations
To get started with NumPy, we have to install the package into our version of
Python. While the basic method for installing packages to Python is the pip
install method, we will be using the conda install method. This is the
recommended way of managing all Python packages and virtual
environments using the anaconda framework.
Since we installed a recent version of Anaconda, most of the packages we
need would have been included in the distribution. To verify if any package
is installed, you can use the conda list command via the anaconda prompt.
This displays all the packages currently installed and accessible via anaconda.
If your intended package is not available, then you can install via this
method:
First, ensure you have an internet connection. This is required to download
the target package via conda. Open the anaconda prompt, then enter the
following code:
Conda install package
Note: In the code above, ‘package’ is what needs to be installed e.g. NumPy,
Pandas, etc.
As described earlier, we would be working with NumPy arrays. In
programming, an array is an ordered collection of similar items. Sounds
familiar? Yeah, they are just like Python lists, but with superpowers. NumPy
arrays are in two forms: Vectors, and Matrices. They are mostly the same,
only that vectors are one-dimensional arrays (either a column or a row of
ordered items), while a matrix is 2-dimensional (rows and columns). These
are the fundamental blocks of most operations we would be doing with
NumPy. While arrays incorporate most of the operations possible with
Python lists, we would be introducing some newer methods for creating, and
manipulating them.
To begin using the NumPy methods, we have to first import the package into
our current workspace. This can be achieved in two ways:
import numpy as np
Or

from numpy import *


In Jupyter notebook, enter either of the codes above to import the NumPy
package. The first method of import is recommended, especially for
beginners, as it helps to keep track of the specific package a called
function/method is from. This is due to the variable assignment e.g. ‘np’,
which refers to the imported package throughout the coding session.
Notice the use of an asterisk in the second import method. This signifies
‘everything/all’ in programming. Hence, the code reads ‘from NumPy import
everything!!’
Tip: In Python, we would be required to reference the package we are
operating with e.g. NumPy, Pandas, etc. It is easier to assign them variable
names that can be used in further operations. This is significantly useful in a
case where there are multiple packages being used, and the use of standard
variable names such as: ‘np’ for NumPy, ‘pd’ for Pandas, etc. makes the code
more readable.

Example: Creating vectors and matrices from Python lists.


Let us declare a Python list.
In []: # This is a list of integers
Int_list = [1,2,3,4,5]
Int_list

Out[]: [1,2,3,4,5]
Importing the NumPy package and creating an array of integers.
In []: # import syntax
import numpy as np
np.array(Int_list)

Out[]: array([1, 2, 3, 4, 5])


Notice the difference in the outputs? The second output indicates that we
have created an array, and we can easily assign this array to a variable for
future reference.
To confirm, we can check for the type.
In []: x = np.array(Int_list)
type(x)
Out[]: numpy.ndarray
We have created a vector, because it has one dimension (1 row). To check
this, the ‘ndim’ method can be used.
In []: x.ndim # this shows how many dimensions the array has
Out[]: 1
Alternatively, the shape method can be used to see the arrangements.
In []: x.shape # this shows the shape

Out[]: (5,)
Python describes matrices as (rows, columns). In this case, it describes a
vector as (number of elements, ).
To create a matrix from a Python list, we need to pass a nested list containing
the elements we need. Remember, matrices are rectangular, and so each list
in the nested list must have the same size.
In []: # This is a matrix

x = [1,2,3]
y = [4,5,6]

my_list = [y,x] # nested list

my_matrix = np.array(my_list) # creating the matrix

A = my_matrix.ndim
B = my_matrix.shape

# Printing
print('Resulting matrix:\n\n',my_matrix,'\n\nDimensions:',A,
'\nshape (rows,columns):',B)

Out[]: Resulting matrix:


[[4 5 6]
[1 2 3]]

Dimensions: 2
shape (rows,columns): (2, 3)
Now, we have created a 2 by 3 matrix. Notice how the shape method displays
the rows and columns of the matrix. To find the transpose of this matrix i.e.
change the rows to columns, use the transpose () method.
In []: # this finds the transpose of the matrix
t_matrix = my_matrix.transpose()
t_matrix
Out[]: array([[4, 1],
[5, 2],
[6, 3]])
Tip: Another way of knowing the number of dimensions of an array is by
counting the square-brackets that opens and closes the array (immediately
after the parenthesis). In the vector example, notice that the array was
enclosed in single square brackets. In the two-dimensional array example,
however, there are two brackets. Also, tuples can be used in place of lists for
creating arrays.
There are other methods of creating arrays in Python, and they may be more
intuitive than using lists in some applications. One quick method uses the
arange() function.
Syntax: np.arange(start value, stop value, step size, dtype = ‘type’)
In this case, we do not need to pass its output to the list function, our result is
an array object of a data type specified by ‘dtype’.
Example: Creating arrays with the arange() function.
We will create an array of numbers from 0 to 10, with an increment of 2
(even numbers).
In []: # Array of even numbers between 0 and 10
Even_array = np.arange(0,11,2)
Even_array

Out[]: array([ 0, 2, 4, 6, 8, 10])


Notice it behaves like the range () method form our list examples. It returned
all even values between 0 and 11 (10 being the maximum). Here, we did not
specify the types of the elements.
Tip: Recall, the range method returns value up to the ‘stop value – 1’; hence,
even if we change the 11 to 12, we would still get 10 as the maximum.
Since the elements are numeric, they can either be integers or floats. Integers
are the default, however, to return the values as floats, we can also specify the
numeric type.
In []: Even_array2 = np.arange(0,11,2, dtype='float')
Even_array2

Out[]: array([ 0., 2., 4., 6., 8., 10.])


Another handy function for creating arrays is linspace(). This returns a
numeric array of linearly space values within an interval. It also allows for
the specification of the required number of points, and it has the following
syntax:
np.linspace(start value, end value, number of points)

At default, linspace returns an array of 50 evenly spaced points within the


defined interval.

Example: Creating arrays of evenly spaced points with linspace()


In []: # Arrays of linearly spaced points
A = np.linspace(0,5,5) # 5 equal points between 0 & 5
B = np.linspace (51,100) # 50 equal points between 51 & 100
print ('Here are the arrays:\n')
A
B
Here are the arrays:
Out[ ]: array([0. , 1.25, 2.5 , 3.75, 5. ])
Out[ ]: array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15.,
16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44., 45.,
46., 47., 48., 49., 50.])
Notice how the second use of linspace did not require a third argument. This
is because we wanted 50 equally spaced values, which is the default. The
‘dtype’ can also be specified like we did with the range function.

Tip 1: Linspace arrays are particularly useful in plots. They can be used to
create a time axis or any other required axis for producing well defined and
scaled graphs.
Tip 2: The output format in the example above is not the default way for
output in Jupyter notebook. Jupyter displays the last result per cell, at default.
To display multiple results (without having to use the print statement every-
time), the output method can be
changed using the following code.
In[]: # Allowing Jupyter output all results per cell.
# run the following code in a Jupyter cell.

from IPython.core.interactiveshell import InteractiveShell


InteractiveShell.ast_node_interactivity = "all"
There are times when a programmer needs unique arrays like the identity
matrix, or a matrix of ones/zeros. NumPy provides a convenient way of
creating these with the zeros(), ones() and eye() functions.

Example: creating arrays with unique elements.


Let us use the zeros () function to create a vector and a matrix.
In []: np.zeros(3) # A vector of 3 elements
np.zeros((2,3)) # A matrix of 6 elements i.e. 2 rows, 3 columns
Out[]: array([0., 0., 0.])
Out[]: array([[0., 0., 0.],
[0., 0., 0.]])
Notice how the second output is a two-dimensional array i.e. two square
brackets (a matrix of 2 columns and 3 rows as specified in the code).
The same thing goes for creating a vector or matrix with all elements having
a value of ‘1’.
In []: np.ones(3) # A vector of 3 elements
np.ones((2,3)) # A matrix of 6 elements i.e. 2 rows, 3 columns
Out[]: array([1., 1., 1.])
Out[]: array([[1., 1., 1.],
[1., 1., 1.]])
Also, notice how the code for creating the matrices requires the row and
column instructions to be passed as a tuple. This is because the function
accepts one input, so multiple inputs would need to be passed as tuples or
lists in the required order (Tuples are recommended. Recall, they are faster to
operate.).
In the case of the identity matrix, the function eye () only requires one value.
Since identity matrices are always square, the value passed determines the
number of rows and columns.
In []: np.eye(2) # A matrix of 4 elements 2 rows, 2 columns
np.eye(3) # 3 rows, 3 columns

Out[]: array([[1., 0.],


[0., 1.]])
Out[]: array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
NumPy also features random number generators. These can be used for
creating arrays, as well as single values, depending on the required
application. To access the random number generator, we call the library via
np.random, and then choose the random method we prefer. We will consider
three methods for generating random numbers: rand(), randn(), and
randint().

Example: Generating arrays with random values.


Let us start with the rand () method. This generates random, uniformly
distributed numbers between 0 and 1.
In []: np.random.rand (2) # A vector of 2 random values
np.random.rand (2,3) # A matrix of 6 random values

Out[]: array([0.01562571, 0.54649508])


Out[]: array([[0.22445055, 0.35909056, 0.53403529],
[0.70449515, 0.96560456, 0.79583743]])
Notice how each value within the arrays are between 0 & 1. You can try this
on your own and observe the returned values. Since it is a random generation,
these values may be different from yours. Also, in the case of the random
number generators, the matrix specifications are not required to be passed as
lists or tuples, as observed in the second line of code.
The randn () method generates random numbers from the standard normal or
Gaussian distribution. You might want to brush up on some basics in
statistics, however, this just implies that the values returned would have a
tendency towards the mean (which is zero in this case) i.e. the values would
be centered around zero.
In []: np.random.randn (2) # A vector of 2 random values
np.random.randn (2,3) # A matrix of 6 random values

Out[]: array([ 0.73197866, -0.31538023])


Out[]: array([[-0.79848228, -0.7176693 , 0.74770505],
[-2.10234448, 0.10995745, -0.54636425]])
The randint() method generates random integers within a specified range or
interval. Note that the higher range value is exclusive (i.e. has no chance of
being randomly selected), while the lower value is inclusive (could be
included in the random selection).
Syntax: np.random(lower value, higher value, number of values, dtype)
If the number of values is not specified, Python just returns a single value
within the defined range.
In []: np.random.randint (1,5) # A random value between 1 and 5
np.random.randint (1,100,6) # A vector of 6 random values
np.random.randint (1,100,(2,3)) # A matrix of 6 random values
Out[]: 4
Out[]: array([74, 42, 92, 10, 76, 43])
Out[]: array([[92, 9, 99],
[73, 36, 93]])

Tip: Notice how the size parameter for the third line was specified using a
tuple. This is how to create a matrix of random integers using randint.

Example: Illustrating randint().


Let us create a fun dice roll program using the randint() method. We would
allow two dice, and the function will return an output based on the random
values generated in the roll.
In []: # creating a dice roll game with randint()
# Defining the function
def roll_dice():
""" This function displays a
dice roll value when called"""
dice1 = np.random.randint(1,7) # This allows 6 to be inclusive
dice2 = np.random.randint(1,7)

# Display Condition.
if dice1 == dice2:
print('Roll: ',dice1,'&',dice2,'\ndoubles !')
if dice1 == 1:
print('snake eyes!\n')
else:
print('Roll: ',dice1,'&',dice2)

In []: # Calling the function


roll_dice()

Out[]: Roll: 1 & 1


doubles !
snake eyes!
Hint: Think of a fun and useful program to illustrate the use of these random
number generators, and writing such programs will improve your chances of
comprehension. Also, a quick review of statistics, especially measures of
central tendency & dispersion/spread will be useful in your data science
journey.
Chapter 13 - Manipulating Array
Now that we have learned how to declare arrays, we would be proceeding
with some methods for modifying these arrays. First, we will consider the
reshape () method, which is used for changing the dimensions of an array.
Example: Using the reshape() method.
Let us declare a few arrays and call the reshape method to change their
dimensions.
In []: freq = np.arange(10);values = np.random.randn(10)
freq; values

Out[]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Out[]: array([ 1.33534821, 1.73863505, 0.1982571 , -0.47513784,


1.80118596, -1.73710743,
-0.24994721, 1.41695744, -0.28384007, 0.58446065])
Using the reshape method, we would make ‘freq’ and ‘values’ 2
dimensional.
In []: np.reshape(freq,(5,2))

Out[]: array([[0, 1],


[2, 3],
[4, 5],
[6, 7],
[8, 9]])

In []: np.reshape(values,(2,5))

Out[]: array([[ 1.33534821, 1.73863505, 0.1982571 ,


-0.47513784, 1.80118596],
[-1.73710743, -0.24994721, 1.41695744, -0.28384007, 0.58446065]])
Even though the values array still looks similar after reshaping, notice the
two square brackets that indicate it has been changed to a matrix. The reshape
method comes in handy when we need to do array operations, and our arrays
are inconsistent in dimensions. It is also important to ensure the new size
parameter passed to the reshape method does not differ from the number of
elements in the original array. The idea is simple: when calling the reshape
method, the product of the size parameters must equal the number of
elements in the original array.
The maximum and minimum values within an array (or real-world data), and
possibly the index of such maximum or minimum values. To get this
information, we can use the .max(), .min(), .argmax() and .argmin() methods
respectively.

Example:
Let us find the maximum and minimum values in the ‘values’ array, along
with the index of the minimum and maximum within the array.
In []: A = values.max();B = values.min();
C = values.argmax()+1; D = values.argmin()+1

print('Maximum value: {}\nMinimum Value: {}\


\nItem {} is the maximum value, while item {}\
is the minimum value'.format(A,B,C,D))
Output
Maximum value: 1.8011859577930067
Minimum Value: -1.7371074259180737
Item 5 is the maximum value, while item 6 is the minimum value
A few things to note in the code above: The variables C&D, which defines
the position of the maximum and minimum values are evaluated as shown
[by adding 1 to the index of the maximum and minimum values obtained via
argmax () and argmin ()], because Python indexing starts at zero. Python
would index maximum value at 4, and minimum at 5, which is not the actual
positions of these elements within the array (you are less likely to start
counting elements in a list from zero! Unless you are Python, of course.).
Another observation can be made in the code. The print statement is broken
across a few lines using enter. To allow Python to know that the next line of
code is a continuation, the backslash ‘\’ is used. Another way would be to use
three quotes for a multiline string.

conditional selection,
Similar to how we conditional selection works with NumPy arrays, we can
select elements from a data frame that satisfy a Boolean criterion.
Example: Let us grab sections of the data frame ‘Arr_df’ where the value is >
5.
In []: # Grab elements greater than five

Arr_df[Arr_df>5]

Output:

odd1 even1 odd2 even2 Odd sum Even sum

A NaN NaN NaN NaN NaN 6

B NaN 6.0 7.0 8.0 12.0 14

C 9.0 10.0 11.0 12.0 20.0 22

D 13.0 14.0 15.0 16.0 28.0 30

E 17.0 18.0 19.0 20.0 36.0 38

Notice how the instances of values less than 5 are represented with a ‘NaN’.
Another way to use this conditional formatting is to format based on column
specifications.
You could remove entire rows of data, by specifying a Boolean condition
based off a single column. Assuming we want to return the Arr_df data frame
without the row ‘C’. We can specify a condition to return values where the
elements of column ‘odd1’ are not equal to ‘9’ (since row C contains 9 under
column ‘odd1’).
In []: # removing row C through the first column
Arr_df[Arr_df['odd1']!= 9]
Output:

odd1 even1 odd2 even2 Odd sum Even sum

A 1 2 3 4 4 6

B 5 6 7 8 12 14

D 13 14 15 16 28 30

E 17 18 19 20 36 38

Notice that row ‘C’ has been filtered out. This can be achieved through a
smart conditional statement through any of the columns.
In []: # does the same thing : remove row ‘C’
# Arr_df[Arr_df['even2']!= 12]
In[]: # Let us remove rows D and E through 'even2'
Arr_df[Arr_df['even2']<= 12]

Output

odd1 even1 odd2 even2 Odd sum Even sum

A 1 2 3 4 4 6
B 5 6 7 8 12 14

C 9 10 11 12 20 22

Exercise: Remove rows C, D, E via the ‘Even sum’ column. Also, try out
other such operations as you may prefer.

To combine conditional selection statements, we can use the ‘logical and, i.e.
&’, and the ‘logical or, i.e. |’ for nesting multiple conditions. The regular
‘and’ and ‘or’ operators would not work in this case as they are used for
comparing single elements. Here, we will be comparing a series of elements
that evaluate to true or false, and those generic operators find such operations
ambiguous.
Example: Let us select elements that meet the criteria of being greater than 1
in the first column, and less than 22 in the last column. Remember, the ‘and
statement’ only evaluates to true if both conditions are true.
In []:Arr_df[(Arr_df['odd1']>1) & (Arr_df['Even sum']<22)]
Output:

odd1 even1 odd2 even2 Odd sum Even sum

B 5 6 7 8 12 14
Only the elements in Row ‘B’ meet this criterion, and were returned in the
data frame.
This approach can be expounded upon to create even more powerful data
frame filters.
Chapter 14 - Python Debugging
Like most computer programming language, Python utilizes debugging
processes for the benefit of providing exceptional computing programs. The
software enables you to run applications within the specified debugger set
with different breakpoints. Similarly, interactive source code is provided to a
Python program for the benefit of supporting under program controls. Other
actions of a debugger in Python are testing of units, integration, analysis of
log files, and log flows as well as system-level monitoring.
Running a program within the debugger comprises of several tools working
depending on a given command line and IDE systems. For instance, the
development of more sophisticated computer programs has significantly
contributed to the expansion of debugging tools. The tools accompany
various methods of detecting Python programming abnormalities, evaluation
of its impacts, and plan updates and patches to correct emerging problems. In
some cases, debugging tools may improve programmers in the development
of new programs by eliminating code and Unicode faults.

Debugging
Debugging is the technique used in detecting and providing solutions to
either defects or problems within a specific computer program. The term
‘debugging’ was first accredited to Admiral Grace Hopper while working at
Harvard University on Mark II computers in the 1940s. She discovered
several moths between relays, thereby hindering computer operations and
named them ‘debugging' in the system. Despite the term previously used by
Thomas Edison in 1878, debugging began becoming popular in the early
1950s with programmers adopting its use in referring to computer programs.
By the 1960s, debugging gained popularity between computer users and the
most common term mentioned to described solutions to major computing
problems. With the world becoming more digitalized with challenging
programs, debugging has covered a significant scope. Henceforth,
eliminating words like computer errors, bugs, and defects to a more neutral
one such as computer anomaly and discrepancy. However, the neutral terms
are also under impact assessment to determine if their definition of
computing problems provides a cost-effective manner to the system or more
changes be made. The assessment tries to create a more practical term to
define computer problems while retaining the meaning but preventing end-
users from denying the acceptability of faults.

Anti-Debugging
Anti-debugging is the opposite of debugging and encompasses the
implementation of different techniques to prevent debugging processes or
reverse engineering in computer codes. The process is primarily used by
developers, for example in copy-protection schemes as well as malware to
identify and prevent debugging. Anti debugging, is, therefore, the complete
opposite of debugger tools, which include prevention of detection and
removal of errors, which occasionally appear during Python programming?
Some of the conventional techniques used are;

API-based
Exception-based
Modified code
Determining and penalizing debugger
Hardware-and register-based
Timing and latency

Concepts of Python Debugging


Current Line
The current line is a notion where a computer only has to do only one thing at
any given time, especially when creating programs. The flow of codes
typically is controlled from one point to another with activities only running
on the current line to the next below the screen. In Python programming, the
current path can only be changed with functions such as loops, IF statements
and calls among others. It is also not a must to begin programming from the
first line, but you can use breakpoints to decide where to start and where to
avoid.
Breakpoints
When you are running programs in Python package, the codes will usually
begin writing from the first line and run continuously until when there is a
success or error. However, bugs may occur either in a specific function or a
section of the program, but the error codes may not have been used during
input. The error may persist until during the start of the program that you
notice the problem. At this point, breakpoints become useful as they readily
stop these events. Breakpoints alter debuggers where the problem is and
immediately halts program execution and make necessary corrections. This
concept, therefore, enables you to create excellent Python programming
languages within a short time.

Stepping
Stepping is another concept, which operates with debugging tools in making
programs more efficient. Python program stepping is the act of jumping
through codes to determine programs lines with defects as well as any other
mistakes, which need attention before execution. Stepping in different codes
occurs as step-ins, step over, and step out. Step in entails the completion of
the next line filled with systems making the user skip into codes and debug
the intended one. Step over refers to a developer moving to the following line
in the existing function and debug with a new code before running the
program. Step out command refers to skipping to the last line of the program
and making completions of the codes before executing the plan.

Continuous Program Execution


There are some cases where Python programming may result in continuing
program execution by the computer itself. The continue command gives your
computer the control of resuming code input until the end unless there exists
another breakpoint. The resume button may vary depending on the computer
operating systems or the types of language programming packages. However,
there exist several similarities between them making Python debugging more
adaptable to different end-users and developers.

Existing the Debugging Tool


The primary purpose of acquiring a debugger tool is to identify and eliminate
problems. After utilizing debugger functionalities of detecting an error or
problem within the program or codes, correction of the problem follows. The
sequence will include fixing of the failure by rewriting the characters,
stopping debugging processes, insert a breakpoint on the fixed-line, and
launch another debugger tool. Similar, the procedure may vary depending on
the OS and the other packages other than Python.

Function Verification
When writing codes into the program, it is vital to keep track of the state of
each code, especially on calculations and variables. Similarly, the growth of
functions may stake up, leading to creating a function calling technique to
understand how each task affects the next one. Likewise, it is recommended
entering the nested codes first when it comes to stepping in as to develop a
sequential approach of executing the right codes first.

Processes of Debugging
Problem Reproduction
The primary function of a debugger application is to detect and eliminate
problems affecting programming processes. The first step in the debugging
process is to try to identify and reproduce the existing problem, either being a
nontrivial function or other rare software bugs. The method of debugging
primarily focuses on the immediate state of your program and note the bugs
present at the time. The reproduction is typically affected by computer usage
history and the immediate environment, thereby impacting on the end-results.

Simplification of the Problem


The second step is simplifying the inputs of the program by breaking down
the characters for more straightforward elimination of bugs. For example,
large amounts of data in a compiler containing bugs may crash during parsing
or anomaly removal as it includes all the data at once. However, breaking
down or subdivision of files will enhance straightforward reproduction of
problems as well as preventing program break down. The programmer will
thereby identify the bugs by checking different source files in the original test
case and if there exist more problems that need immediately debugging
actions.
Elimination of Bugs
After a sound reproduction of problems and simplification of the program to
check on bugs, the next step is utilizing a debugger tool to analyze the state
of your software. Scanning through a well-organized and simplified program
also enables you to determine the source origin of the fault. Similarly, bug
tracing can also be adopted to track down the source, making it beneficial to
remove problems from at the point of origin. In Python programs, tracing of
bugs plays a significant role in placing variables in different sections to
acquire a high end of execution. That said, the debugger tool would work on
the bug or bugs present, therefore removing it and keeping the program free
from any faults.

Debugging In Constant Variable Systems


Fixed, constant, or embedded systems are much different when compared to
the broader function of computer package designs. Embedded systems tend to
allows users to have multiple platforms, for instance, operating systems and
CPU architecture accompanied by variants. That said, embedded debugger
tools are specifically designed to conduct single tasks to given software for
the benefit of optimizing the program. A unique debugging tool is therefore
needed to undertake a particular task making it much harder to decide on a
specific one.
Faced with the challenge of heterogeneity, embedded debugger tools exist in
different categories of debugging, for instance, commercial and research tools
as well as subdivisions to specific problems. Green Hill Software provides an
example of commercial debuggers while research tools include Flocklab.
Embedded bugs identification, simplification, and elimination utilize a
functionality approach, which collects the operating state information, which
therefore boosts performance and optimizes your system adequately.

Debugging Techniques
Like other language programming software, Python also utilizes a debugging
technique to enhance its bug identification and elimination. Some of the
standard methods of debugging are interactive, print, remote, postmortem,
algorithm, and delta debugging. The technique used to remove bugs
interprets the comparison between the different techniques. For instance,
print debugging entails monitoring and tracing bugs and later printing them
out.
Remote debugging is a technique of removing bugs running a given program
but differs from the bugger tool. While postmortem is debugging methods to
identify and eliminate bugs from already crashed programs. To this end,
leaning the different types of debugging contributes to deciding which to use
when in need of determining Python programming problems. Other
techniques are safe squeezing, which isolates faults and causality tracking
essential for tracing causal agents in computation.

Python Debugging Tools


With several tools available today, it may become difficult to consider the
best choices for Python programs. However, Python has numerous debugging
tools to help in using codes free from errors. Python debugging tools or
debuggers also may operate well depending on the operating system as well
as if it either is within the program or acquired. For that case, Python
programs provide debugger tools working on IDE, command line, or analysis
of the available data to avoid bugs.

Debuggers Tools
Python debuggers are specific or multiple purposes in nature, depending on
the platform used, that is, depending on the operating system. Some of the
all-purpose debuggers are pdb and PdbRcldea while multipurpose include
pudb, Winpdb, and Epdb2, epdb, JpyDbg, pydb, trepan2, and
Pythonpydebug. On the other hand, specific debuggers are gdb, DDD, Xpdb,
and HAP Python Remote Debugger. All the above debugging tools operate in
different parts of the Python program with some used during installation,
program creation, remote debugging, and thread debugging and graphic
debugging, among others.

IDEs Tools
Integrated Development Environment (IDE) is the best Python debugging
tools as they suit well on big projects. Despite the tools varying between the
IDEs, the features remain the same for executing codes, analyzing variables,
and creating breakpoints. The most common and widely used IDE Python
debugging tool is the PyCharm comprising of complete elements of
operations, including plugins essential for maximizing the performance of
Python programs. Subsequently, other IDE debugging tools are also great and
readily available in the market today. Some of them include Komodo IDE,
Thonny, PyScripter, PyDev, Visual Studio Code, and Wing IDE, among
others.

Special-Purpose Tools
Special-purpose debugging tools are essential for detecting and eliminating
bugs from different sections of the Python program primarily working on
remote processes. These types of debugging tools are more useful when
tracing problems in the most sensitive and remote areas where it is unlikely
for other debuggers to access. Some do the most commonly used Special-
purpose debugging tools are FirePython used in Firefox as a Python logger,
manhole, PyConquer, pyringe, hunter, ice-cream and PySnooper. This
subdivision of debugging tools enables programmers to quickly identify
hidden and unnoticed bugs and thereby displaying them for elimination from
the system.

Understanding Debugging and Python


Programming
Before venturing more in-depth into the connection between the program and
debugging associated, there exist different ways of how the application
performs behaves. One of the significant components of debugging is that it
runs codes within your program one at a time and enables you to see the
process of data execution. They act as instant replays of what has occurred in
the Python program with a systematic tutorial hence seeing the semantic
errors occurred.
When the code is being executed, your computer may provide a limited view
of what is happening; hence, debuggers make them possible for you to see
them. As such, the Python program should behave like slow motion graphics
while identifying the errors or bugs present within the codes. As such, the
debugger enables you to determine the following;
The flow of codes in the program
The techniques used to create variables
Specific data contained in each variable within the program
The addition, modification, and elimination of functions
Any other types of calculations performed
Code looping
How the ID and ELSE statements have been entered

Debugger Commands
With debugging being a common feature in the programming language, there
exist several commands used when maneuvering between various operations.
The basic controls are the most essential for beginners and may include an
abbreviation of one or more letters. A blank space must separate the
command while others are enclosed in brackets. However, the syntax
command does not allow for the square brackets to be written but separated
alternatively by a vertical bar. In Python programs, statements are rarely
recognized by debugger commands executed within the parameters of the
program.
As to inspect Python statements, against errors and other related faults,
prefixes are added with an exclamation mark. Henceforth, making it possible
to make changes on variables as well as function calls. Several commands
may also be inserted in the same line but separated by ‘;;’ with inputs spaced
separately from other codes. As such, debugging is said to work with aliases,
which allows for adaptability between words in the same context. Besides,
aliases enhance the need for reading files in the directory with faults but seen
as correct with the use of the debugger prompt.

Common Debugging Commands


Starting
The command used in debugging is ‘s(tart)' which launches the debugger
from its source. The procedure involved includes typing the title of the
debugger and then the name of the file, object, or program executable to
debug. Inside the debugging tool, there appears a prompt providing you with
several commands to choose from and make the necessary corrections.

Running
The command used is ‘[!]statement’ or ‘r(un)’, which facilitates the execution
of the command to the intended lines and identify errors if any. The
command prompt will display several arguments probably at the top of the
package, especially when running programs without debuggers. For example,
when the application is named ‘prog1’, then the command to use is “r prog1
<infile". The debugger will, therefore, execute the command by redirecting
the program name from the file name.
Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno|function[, condition]]” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on the
screen. Thereby provides time to check on the variables while identifying any
errors or mistakes, which might affect the process. Therefore, breakpoints can
be scheduled to halt at any line on either numerical or functions names which
designate program execution.

Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.

Printing
Printing is primarily is used in programming to analyze the value of variables
or expressions used during function examination before execution. It uses the
command' w(here)' and useful after the programming running has been
stopped at a breakpoint or during runtime error. The legal expression used
here is C with possessing an ability to handles the legitimate C expression as
well as function calls. Besides printing, resuming the execution after a
breakpoint or runtime error uses the command ‘c(ont(inue).'

Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on a
given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.

Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.

File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to utilize
debugging tools in such cases. Debuggers should be set on the main source
file for the benefit of scheduling breakpoints and runtime error to examine the
lines in the folders. With Python, the list of the source files can be readily
selected and prescribe it as the working file.

Help and Quit


The help command is represented as ‘h(elp)’ while quit is symbolized as
‘q(uit)’ with both providing assistance during program execution. Help
command displays all the help information topics and can be redirected into a
particular solution of the current problem. While quit, command is crucial to
exist or abort the debugger tool.
Alias
Alias debugging entails the creation of an alias name to execute a command
but must not be enclosed in either single or double-quotes. The control used
is alias [alias [command]]. Replaceable parameters also undergo indicators
and can be replaced with other functions. As such, the name may remain the
same if the settings are left without commands or arguments from debugger
tools. In that case, the aliases maybe incorporate and comprise of any data
collaborated in the pdb prompt.

Python Debugger
In Python programming language, the module pdb typically describes the
interactive source code debugger; therefore, supporting setting parameters in
breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of a
stack frame. Also, postmortem-debugging remains supported under the title
under program control. Python debugging is extensible usually in the way of
pdb obtained from the source evaluation. The interface hence utilizes pdb and
cmd as the primary modules.
The debugger command prompt pdb is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement [, globals [, locals]]) for run
Python statements and runeval (expression [, globals[, locals]]). There also
exist multiple functions not mentioned above to execute Python programs
efficiently.

Using Debugger Commands


As mentioned above, the debugger command prompt is a continuous process,
which displays a window where you input your variables at the bottom.
When your commands are successful in any given window, it would show the
outputs and then display the prompt. The Debugger Command Window is
therefore defined as the Debugger Immediate Window. It presents two panes;
small and the bottom one is where you enter your commands, and the larger
upper one shows your results.
The command prompt is the window where you readily input your debugging
needs, especially when in need to scan through your program for any errors.
The Python debugging prompt is user-friendly, encompasses all the relevant
features of detecting, and eliminate any problems. That said, the prompt will
display your current command, and you can quickly stop, modify, or select
other debugging parameters.

Debugging Session
Using debugging in Python for computer language programming is usually a
repetitive process, which includes writing codes, and running it; it does not
work, and you implement debugging tools, fix errors, and redo the process
once again and again. As such, the debugging session tends to utilize the
same techniques, which hence demand some key points to note. The
sequence below enhances your programming processes and minimizes the
repeats witnessed during program development.

Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the program
or wait for another breakpoint and repeat if need be
When everything seems to go wrong, determine the source of
the problem, alter the current line of codes and begin the
process once more

Tips in Python Debugging


Create a Reliable Branch
With the process of debugging being repetitive and somehow constant across
different programming language platforms, leaning your principles is
essential. Setting your parameters play a significant role in ensuring that your
programs are performed within a given environment. That said; ensure you
set your debugging parameters, especially for beginners.

Install pdb++
When working with Python, it is important to install pdb++ software to ease
maneuvering within a certain command line. The software ensures that you
readily access a unique prompt dialog well colorized and a complete great tab
showed elegantly. Pdb++ also enhances the appearance of your debugger
tool, bringing a newer and standard pdb module.

Conduct Daily Practices


Playing around with Python debugging tools is one of the methods used to
learn more in-depth about incorporating programs with debugging. That said,
create a plan while using a debugger and try to make mistakes or creating
errors and see what happens. Similarly, try to use commands such as
breakpoints, help, and steps to learn further on Python debugging. Create
practical programs while primarily focusing on the use of debuggers to make
corrections on sections with errors.

Learn To Work at Thing at a Time


Learning Python debugging techniques does only detect and enhance error
elimination but also prepares you on understanding how to remove such
problems. One way to do this is by getting used to correcting one anomaly at
a time, which is, removing one bug at a time. Begin with the most obvious
errors then think before doing immediate corrections as at times may lead to
the removal of essential variables. Make the changes and then test your
outcome to ascertain your program outcome.

Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are many
today. Interact with them by seeking answers to your debugging problems as
well as playing around with some programs you create while using debugger
tools. You should avoid making assumptions to any section of Python
programming, especially in debugging as it may result in failures in program
development.

Be Clever
When we create programs and avoid errors by use of debuggers, it may make
you feel excited and overwhelmed from the outcome. However, be smart but
with limits to keep an eye on your work as well as your future operations.
The success of creating a more realistic and useful program does not mean
that you are not to fail in the future. As remaining in control will prepare you
to use Python debugging tools wisely and claim your future accomplishments
positively.

Languages Required for Data Science


Basics of Python
Keywords are an important part of Python programming; they are words that
are reserved for use by Python itself. You can’t use these names for anything
other than what they are intended for, and you most definitely can’t use them
as part of an identifier name, such as a function or a variable. Reserved
keywords are used for defining the structure and the syntax of Python. There
are, at the present time, 33 of these words, and one thing you must remember
is that they are case sensitive—only three of them are capitalized, and the rest
are in lower case. These are the keywords, written exactly as they appear in
the language:
●False
●if
●assert
●as
●is
●global
●in
●pass
●finally
●try
●not
●while
●return
●None
●for
●True
●class
●break
●elif
●continue
●yield
●and
●del
●with
●import
●else
●def
●except
●from
●or
●lambda
●nonlocal
●raise
Note: only True, False, and None are capitalized.
The identifiers are the names that we give to things like variables, functions,
classes, etc., and the name is just so that we can identify one from another.
There are certain rules that you must abide by when you write an identifier:

You may use a combination of lowercase letters (a to z),


uppercase letters (A to Z), digits (0 to 9), and the underscore
(_). Names such as func_2, myVar and print_to_screen are
all examples of perfectly valid identifier names.
You may not start an identifier name with a digit, so 2Class
would be invalid, whereas class2 is valid.
You may not, as mentioned above, use a reserved keyword
in the identifier name. For example:
>>> global = 2
File "<interactive input>", line 3
global = 2
^
Would give you an error message of:
SyntaxError: invalid syntax

You may not use any special symbols, such as $, %, #, !,


etc., in the identifier name. For example:
>>> a$ = 1
File "<interactive input>", line 13
A$ = 1
^
Would also give you the following error message:
SyntaxError: invalid syntax
An identifier name can be any length you require.
Things to note are:
Because the Python programming language is case sensitive, variable and
Variable would mean different things.
Make sure your identifier names reflect what the identifier does. For
example, while you could get away with writing c = 12, it would make more
sense to write count = 12. You know at a glance exactly what it does, even if
you don’t look at the code for several weeks.
Use underscores where possible to separate a name made up of multiple
words, for example, this_variabe_has_ many_words
You may also use camel case.
This is a writing style where the first letter of every word is capitalized except
for the first one, for example, thisVariableHasManyWords.
Chapter 15 - Advantages of Machine Learning
Due to the sheer volume and magnitude of the tasks, there are some instances
where an engineer or developer cannot succeed, no matter how hard they try;
in those cases, the advantages of machines over humans are clearly stark.

Identifies Patterns
When the engineer feeds a machine with artificial intelligence a training data
set, it will then learn how to identify patterns within the data and produce
results for any other similar inputs that the engineer provides the machine
with. This is efficiency far beyond that of a normal analyst. Due to the strong
connection between machine learning and data science (which is the process
of crunching large volumes of data and unearthing relationships between the
underlying variables), through machine learning, one can derive important
insights into large volumes of data.

Improves Efficiency
Humans might have designed certain machines without a complete
appreciation for their capabilities, since they may be unaware of the different
situations in which a computer or machine will work. Through machine
learning and artificial intelligence, a machine will learn to adapt to
environmental changes and improve its own efficiency, regardless of its
surroundings.

Completes Specific Tasks


A programmer will usually develop a machine to complete certain tasks,
most of which involving an elaborate and arduous program where there is
scope for the programmer to make errors of omission. He or she might forget
about a few steps or details that they should have included in the program. An
artificially intelligent machine that can learn on its own would not face these
challenges, as it would learn the tasks and processes on its own.
Helps Machines Adapt to the Changing
Environment
With ever-changing technology and the development of new programming
languages to communicate these technological advancements, it is nearly
impossible to convert all existing programs and systems into these new
syntaxes. Redesigning every program from its coding stage to adapt to
technological advancements is counterproductive. At such times, it is highly
efficient to use machine learning so that they can upgrade and adapt to the
changing technological climate all on their own.

Helps Machines Handle Large Data Sets


Machine learning brings with it the capability to handle multiple dimensions
and varieties of data simultaneously and in uncertain conditions. An
artificially intelligent machine with abilities to learn on its own can function
in dynamic environments, emphasizing the efficient use of resources.
Machine learning has helped to develop tools that provide continuous
improvements in quality in small and larger process environments.

Disadvantages of Machine Learning


It is difficult to acquire data to train the machine. The
engineer must know what algorithm he or she wants to use
to train it, and only then can he or she identify the data set
they will need to use to do so. There can be a significant
impact on the results obtained if the engineer does not make
the right decision.
It’s difficult to interpret the results accurately to determine
the effectiveness of the machine-learning algorithm.
The engineer must experiment with different algorithms
before he or she chooses one to train the machine with.
Technology that surpasses machine learning is being
researched; therefore, it is important for machines to
constantly learn and transform to adapt to new technology.
Subjects Involved in Machine Learning
Machine learning is a process that uses concepts from multiple subjects. Each
of these subjects helps a programmer develop a new method that can be used
in machine learning, and all these concepts together form the discipline of the
topic. This section covers some of the subjects and languages that are used in
machine learning.

Statistics:
A common problem in statistics is testing a hypothesis and identifying the
probability distribution that the data follows. This allows the statistician to
predict the parameters for an unknown data set. Hypothesis testing is one of
the many concepts of statistics that are used in machine learning. Another
concept of statistics that’s used in machine learning is predicting the value of
a function using its sample values. The solutions to such problems are
instances of machine learning, since the problems in question use historical
(past) data to predict future events. Statistics is a crucial part of machine
learning.

Brain Modeling:
Neural networks, are closely related to machine learning. Scientists have
suggested that nonlinear elements with weighted inputs can be used to create
a neural network. Extensive studies are being conducted to assess these
elements.

Adaptive Control Theory:


Adaptive control theory is a part of this subject that deals with methods that
help the system adapt to such changes and continue to perform optimally.
The idea is that a system should anticipate the changes and modify itself
accordingly.

Psychological Modeling:
For years, psychologists have tried to understand human learning. The EPAM
network is a method that’s commonly used to understand human learning.
This network is utilized to store and retrieve words from a database when the
machine is provided with a function. The concepts of semantic networks and
decision trees were only introduced later. In recent times, research in
psychology has been influenced by artificial intelligence. Another aspect of
psychology called reinforcement learning has been extensively studied in
recent times, and this concept is also used in machine learning.

Artificial Intelligence:
As mentioned earlier, a large part of machine learning is concerned with the
subject of artificial intelligence. Studies in artificial intelligence have focused
on the use of analogies for learning purposes and on how past experiences
can help in anticipating and accommodating future events. In recent years,
studies have focused on devising rules for systems that use the concepts of
inductive logic programming and decision tree methods.

Evolutionary Models:
A common theory in evolution is that animals prefer to learn how to better
adapt to their surroundings to enhance their performance. For example, early
humans started to use the bow and arrow to protect themselves from
predators that were faster and stronger than them. As far as machines are
concerned, the concepts of learning and evolution can be synonymous with
each other. Therefore, models used to explain evolution can also be utilized
to devise machine learning techniques. The most prominent technique that
has been developed using evolutionary models is the genetic algorithm.

Programming Languages

R:
R is a programming language that is estimated to have close to 2 million
users. This language has grown rapidly to become very popular since its
inception in 1990. It is a common belief that R is not only a programming
language for statistical analysis but can also be used for multiple functions.
This tool is not limited to only the statistical domain. There are many features
that make it a powerful language.
The programming language R is one that can be used for many purposes,
especially by data scientists to analyze and predict information through data.
The idea behind developing R was to make statistical analysis easier.
As time passed, the language began to be used in different domains. There
are many people who are adept at coding in R, although they are not
statisticians. This situation has arisen since many packages are being
developed that help to perform functions like data processing, graphic
visualization, and other analyses. R is now used in the spheres of finance,
genetics, language processing, biology, and market research.

Python:
Python is a language that has multiple paradigms. You can probably think of
Python as a Swiss Army knife in the world of coding, since this language
supports structured programming, object-oriented programming, functional
programming, and other types of programming. Python is the second-best
language in the world since it can be used to write programs in every industry
and for data mining and website construction.
The creator, Guido Van Possum, decided to name the language Python, after
Monty Python. If you were to use some inbuilt packages, you would find that
there are some sketches of the Monty Python in the code or documentation. It
is for this reason and many others that Python is a language that most
programmers love, though engineers or those with a scientific background
who are now data scientists would find it difficult to work with.
Python’s simplicity and readability make it quite easy to understand. The
numerous libraries and packages available on the internet demonstrate that
data scientists in different sectors have written programs that are tailored to
their needs and are available to download.
Since Python can be extended to work best for different programs, data
scientists have begun to use it to analyze data. It is best to learn how to code
in Python since it will help you analyze and interpret data and identify
solutions that will work best for a business.
Chapter 16 - Numba - Just In Time Python
Compiler
Although numpy is written in C or Fortran and standard routines working on
arrays of data are highly optimized, non-standard operations are still coded in
python and might be painfully slow. Fortunately, the Pydata company
developed a package that can translate python code into native machine code
on the fly and execute it at the same speed as C programs. In some respects,
this approach is even better than compiled code because the resulting code is
optimized for each particular machine and can take advantage of all the
features of the processor, whereas regular compiled programs might ignore
some processor features for the sake of compatibility with older machines, or
might have even been compiled before new features were even developed.
Besides, your Python program, using the Numba just in time compiler will
work on any platform for which Python and Numba are available. The user
will not need to worry about C compiler. There will be no hassle with
dependencies or complex makefiles and scripts. Python code just works out
of the box - taking full advantage of all available hardware.
The LLVM virtual machine used by Numba allows compiled code to run on
different processor architectures, GPU, and accelerator boards. It is under
heavy development, so while I was writing this book execution times for
example programs were cut more than in half.
Such heavy development on both Numba and LLVM has some disadvantages
as well. Obviously, some Python features could never be significantly
accelerated. But some could and will be accelerated in future versions of
Numba. When I started working on this book, Numba’s compiled functions
could not handle lists or create numpy arrays. Now, they can do it.
Obviously, some material in this section will be obsolete well before the rest
of the book. But it is a good thing. Just keep an eye on Pydata's Numba web
site.
For some strange reason, numba was not included in the Anaconda Linux
installer. So, I had to install it manually by opening anaconda3/bin folder in
terminal and typing
conda install numba
The same should work on windows. Just use terminal shortcut from
Anacomda's folder in Windows start menu. Numba is usually included with
later versions of winpython. If not, download the wheel package and
dependence packages from Christopher Gohlke's page and install them using
winpython's setup utility.
To illustrate speedups you can get with numba, I'll implement the Sieve of
Eratosthenes prime number search algorithm. Because, in order to accelerate
a function, Numba needs to know the type of all the variables or at least
should be able to guess them, and this type should not change during
function. The execution numpy arrays are the data structures of choice when
working with numba.
Here is the Python code:
fromnumbaimport jit
import numpy as np
import time
@jit('i4(i4[:])')
defget_primes(a):
m=a.shape[0]
n=1
foriin range(2,m):
ifa[i]!=0:
n+=1
forjin range(i**2,m,i):
a[j]=0
return n
#create an array of integers 0 to a million
a=np.arange(10000000, dtype=np.int32)
start_time = time.time() #get system time
n=get_primes(a) #count prime numbers
#print number of prime numbers below a million
#and execution time
print(n,time.time()-start_time)
First, we import numba, numpy, and the time module that will be used to
time the program execution. Then, we need a function implementing the
Sieve of Eratosthenes on numpy’s array of integers. A function’s definition is
preceded by the decorator@jit (Just In Time compile) imported from the
numba package. It tells numba to compile this function into machine code.
The rest of the program is executed as plain Python. Decorator tells numba
that function must return a four bite or 32 bit integer, and receives a
parameter that is one dimensional array of 4 byte integers.
Using numpy'sarange function, we can create an array of consecutive integer
numbers between zero and a million, remember current time. Call up a
functionget_primes that counts the prime numbers in the array and zeroes out
non-prime numbers. As soon as the function returns, we get current time
again and print the number of found prime numbers as well as time function
was executing.
On my Sandy Bridge laptop, numba accelerated function takes about 7ms to
complete. If I comment out @jit decorator -
#@jit('i4(i4[:])')
The execution time increases to 3s. Compilation results in 428 fold speedup.
Not bad for one line of code. Searching for prime numbers between 1 and 10
millions takes 146ms with numba and 42s in pure Python respectively. This
is also 287 fold speedup. These numbers are bound to change as numba,
llvm, and processors improve.
Because the function get_primes gets just a reference and nota copy of the
original array, non-prime numbers in the array are still zeroed out and we can
get prime numbers using the fancy indexing discussed in the numpy section:
print(a[a>0])
Default array printing behavior is not particularly useful here as it only shows
a few numbers at the beginning and the end of the array. You can change this
behavior or just iterate through a filtered array usingfor loop.

Troubleshooting numba functions


Although numba is under heavy development and is quickly becoming more
robust, it is still a tool for optimizing the most critical parts of code. These
parts should be refactored in small functions, debugged in plain Python, and
then decorated with numba's@jit decorator.
In the best case scenario, you will instantly see a performance boost. But,
sometimes you see no difference. It is likely that numba failed to compile the
function into machine codes and falls back on using Python's objects to
represent problematic variables. This slows execution down to almost a pure
Python level. Perhaps, in some cases, it is good that the function doesn't fail
completely, but it doesn't report problems either and you don't know if you
can tweak your code a little to get your two orders of magnitude performance
increase.
One way to force the compilation to machine code is by the
giving@jitdecorator a parameternopython=True. This will force numba to
fail compilation and show an error message if any variable could not be
compiled into the processor's native type. Another approach is to set the
environmental variableNUMBA_WARNINGS before importing numba. You
can do this from within your python script by adding two lines on top of it.
import os
os.environ['NUMBA_WARNINGS']="1"
from numba import @jit
Finally, you can dump numba's intermediate representation of your function
by applying a methodinspect_typesto your numba compiled functions. If any
variable has typepyobjectinstead of something likeint32orfloat64, there might
be a problem. As numba is getting smarter, the impact of this problem outside
of tight loops might diminish dramatically, but, on the other hand, the
problematic parts of code that can easily reduce performance several fold
become harder to spot.
Describing the types of function parameters, return value, and local variables
in@jitdecorator might also significantly increase the performance of your
numba-compiled function. You might play with some additional numba
compilation parameters. For instance, the use of AVX commands is disabled
on Sandy bridge and Ivy bridge processors by default, and you might want to
try enabling it. This could be done by setting an environment
variableNUMBA_ENABLE_AVX. In case you are curious to see the
assembly code of your numba compiled function, you may request numba to
print it for you by setting the environmental
variableNUMBA_DUMP_ASSEMBLY.
import os
os.environ['NUMBA_ENABLE_AVX']="1"
os.environ['NUMBA_DUMP_ASSEMBLY']="1"
from numba import @jit
See numba documentation for more details.

Process level parallelism


If you need an even higher performance, you can use process level
parallelism. Python objects are not designed for use in parallel programs; so,
Python employs Global Interpreter Lock (GIL) to block parallel execution.
Compiled modules, including numba compiled functions, can use parallelism
to take advantage of multicore processors or several processors in a system.
Autoparallelisation might even make use of parallelism transparent for a
programmer. But, for now, the use of multiple cores is complicated.
Fortunately, time consuming computations can often be divided into
independent chunks. If you have several hundreds of images to analyze,
program run might take minutes or even hours. But, the analysis of different
images can be carried out independently; so, you can spawn several
subprocesses - each running an independent Python interpreter - and hand
every subprocess its fair share of images to analyse. You can start as many
processes on as many cores as your processor has, or, if you have a processor
capable of multithreading, as many threads as it simultaneously supports.
Each process gets a list of file names and returns results of the analysis to the
parent process.
Python's standard library offers facilities to simplify this approach in a
module called multiprocessing. It even allows you to utilize other computers
over a network. Of course, you can still take advantage of Numba’s just in
time compilation. Actually, I suggest you try it first. The 200 fold speed up
you can obtain with Numba might be all you need. It is definitely worth
trying before you buy 200 computers to make a cluster or start hogging the
resources of a cluster you have at work. Using Numba in a cluster will
probably require installing it on each computer.
Using the multiprocessing module is pretty simple.
import glob
import os
from multiprocessing import Pool
DEF f(file_name): #Worker function
process_id=os.getpid() # obtain process id
file_size=os.path.getsize(file_name) #obtain file size
return [process_id, file_size, file_name]
IF __name__ == '__main__': #If executed by a parent process
with Pool(8) as p: #create a pool of 8 workers
#obtain names of files in working directory
#using glob function from module glob
files=glob.glob('*.*')
result = p.map(f, files) #run analysys in parallel
for r in result: #print the results
print('\t'.join([str(s) for s in r]))
It is important to use the
if __name__ == '__main__':
block to spawn child processes. Copies of Python interpreter run in child
processes open the same module to import a worker function. This
conditional statement prevents them from spawning new processes
recursively.
ThePool object has several methods that allow it to run workers
asynchronously, set a time limit for the completion of the parallel task, and
control if the results will be returned in the same order as the arguments or in
an arbitrary order etc. I refer you to the multiprocessing module
documentation for further detail.
Conclusion
Python training is always an excellent idea for those who wish to become a
part of this constantly developing industry. Developers need to have a grasp
on this language, which is not only easy to grasp but also emphasis less on
the syntax. As a result of such convenience, Python is different from other
languages and does not create any trouble for programmers even in the case
of making a few mistakes here and there.
Python programmers can branch out to different fields on the basis of Python
programming language that provides them a solid foundation. The Python
training ensures that programmers can use this useful language in many ways
to the best of its capabilities. Programmers, especially those who want to
make a career as a software engineer can get hands-on projects provided to
find Python live up to their expectations.

What is the best way to learn Python?


It may be a bit challenging for students to learn this language and almost all
the students ask a common question “what are the best ways to learn
Python?” but those who have decided to embark on the journey of learning
Python programming language can easily get Python training by following
few necessary steps.
· Stick to it
· Daily practice
· Make notes
· Go interactive
· Take breaks
· Debugging
· Collaborative Work
· Build Something

Stick to it
Beginners can master their programming skills only if they act like a glue
stick and practice the code repeatedly. Here are some ways for beginners and
intermediates to master the skills of Python programming.

Daily Practice
It is recommended for beginners practice the code on a daily basis in order to
have a grasp over a specific task. Divide a task into smaller steps if possible.
Consistency is the key to achieving any challenging task, which is the same
in learning Python, but real programmers believe in consistency. Even a
student with little knowledge can make a commitment to code daily for at
least an hour.

Make Notes
After making some progress on the journey as a new programmer, making a
note of things is also a good idea. Research says that making notes of the
important and urgent works with a hand is beneficial for long-term retention.
In the case of becoming a real programmer, this habit is very beneficial.
Furthermore, writing code on a paper can also help build the mind of a
programmer is also difficult in the beginning, but there are programmers who
write codes in their minds. So, it is not a big issue in writing on a paper which
will result in friction-less thinking ability. This friction-less thinking ability is
required for a programmer to get hands-on big projects in the future.

Go Interactive
Beginners can take help of the IDEs in order to practice the Strings, lists,
classes, and dictionaries, etc. Install any IDE for practicing the codes. The
most easy and simple IDE is the Jupyter Notebook that comes with Anaconda
Navigator. Install Anaconda navigator and run the IDE that is Jupyter
Notebook. A window will open in the default browser of a desktop computer
or laptop and start practicing the code. Write your code and check the
results.
Make changes to your code and analyze the results and errors. It is a mental
practice and helps a lot in learning.
Take Breaks
It is important to take a break and remind the concepts behind the codes.
Practice, along with absorbing the concepts work out. There is a well-known
technique called a Pomodoro Technique, which is widely used and can help
for learning purpose. Take a break after practicing a task for 25-30 minutes,
remind the concepts, and then repeat the process. Exercise is a kind of
refreshment when you go for a walk or chat with friends. It is also possible to
chat with friends about the concepts that have been learned in the course.
Debugging
Becoming bug bounty hunting is vital in programming languages, especially
in Python programming. Getting bugs in the code also happens with
professionals’ mostly in hard tasks, but this is not always the case. However,
professionals were also beginners when their journey started. Embrace the
moments and do not frustrate yourself when getting these bugs but hunt them.
It is essential to have a methodological approach in order to find out where
the things are breaking down when debugging. Make sure each part of the
written code works out in the most proper way, and this happens if a
programmer starts checking the code from start to end at the end.
When the area is identified where the things are breaking down then insert
these lines into your code and then press the enter button to run your code.
The codes are,
Import pdb; pdb.set_trace( ) ###add the lines to your script and run it
This is the Python debugger while it can also be run with the below line,
Python –m pdb <my_file.py> ###command line

Collaborative Work
Once the programmers are done with sticking to the Python in the starting
journey then proceeding to a collaborative work makes it easier for
programmers to get hands-on some tasks which might be a little bit
challenging but a collaborative work makes it easier. In short, when more
than one mind starts thinking about a problem, then the problem does not
remain a problem anymore. In order to make it collaborative, here are some
tips to follow.

Work with other learners


Learning to code is not easy in the start, but it works best when working with
other learners. Sharing the tricks and tips make it easier to learn well and
proceed forward. There is no need to worry in the case of having no partner
or friend with whom you can work in collaboration. There are always
multiple options in which the most possible way is either to join any public
events organized for learning the purpose, or online peer-to-peer community
support is also available for Python learning enthusiasts.

Teach
It is common to hear from teachers “A teacher learns more when teaching to
students,” is famous among the teaching and learning communities. This is
also valid and true when it comes to learning the Python language. Many
options are available in order to teach and learn more by understand and
solving problems. Teaching through whiteboard is the most common way to
it but also writing blog posts about the tips in learning Python if any,
problems or mistakes, solving specific errors, recording videos, and useful
tricks. Other than this, there is a super simple way to teach by talking or
repeating the same things when done. By doing these strategies, the concepts
and understandings will solidify as well as point out any error or gaps if any.

Ask Questions
There should be no good or bad concept when learning the programming
language. While asking any question, programmer is free to ask anything bad
as it is not a common language. The concepts, rules, and results should be
learned in any possible way by asking even foolish questions. However, a
programmer needs to ensure asking good questions in such a way that the
conversation with others goes well in a pleasant way. It also helps in making
it possible to have some more conversation next time when needed.

Build Something
Almost all programmers believe that learning programming is easy when
solving a simple problem or build something simple. One must learn by
building something, is a kind of perception in order to become a real
programmer.

Build Anything Small


There are many exercises to solve, which helps in learning Python language
in the most possible way. In this way, a programmer gains confidence, which
helps in proceeding forward to much more challenging tasks. Follow this way
once a programmer is done with learning to the basic data structures such as
lists, dictionaries, classes, strings, and functions, etc. It is now the best time to
start building something.
A programmer with the basics can proceed to build something simple that is
one of the applications of the concepts came after the Python training and
learning. Some of the basics tasks to build something are as follow:
· Number guessing game that continues using the “while loop”
· Any simple calculator application or of additional functionalities
· Dice roll simulator
· Poker game
· Price notification service of any currency
Programmers can also come up with new ideas, but only sharp programmers
can develop new ideas or tasks to perform with Python. There are thousands
of programming projects or tasks for both the beginners and intermediates to
practice and master their skills.

Contribute to Open Source


Open source communities have been built which are cooperating in solving
problems and learning new skills. In open source communities, codes are
available for specific programs that can be built by the programmers who
want to practice and master their skills by building something. Many open-
source organizations, including both the low-ranked and high-ranked
companies as well as mentors, are available and open to help on Github and
other platforms. By staying active in such open-source communities, a
programmer can get the codes and work with them uploaded by engineers.
Contributing to an open-source project is a great way to bring your
knowledge into action. A programmer can submit any bug fix request or can
get help in his own script to fix the bugs. By doing so, the managers of the
projects leave replies on your posts that might be useful and can be learned.
In such a way, a programmer learns more by communication with other
developers, especially the experienced ones or may be helpful to other
learners is a kind of contribution to the open-source community.
DATA ANALYSIS WITH
PYTHON:
THE ULTIMATE BEGINNER'S GUIDE TO
LEARN PROGRAMMING IN PYTHON FOR
DATA SCIENCE WITH PANDAS AND NUMPY,
MASTER STATISTICAL ANALYSIS, DATA
MINING, AND VISUALIZATION

Matt Foster
© Copyright 2019 - All rights reserved.

The content contained within this book may not be reproduced, duplicated, or transmitted without
direct written permission from the author or the publisher.

Under no circumstances will any blame or legal responsibility be held against the publisher, or author,
for any damages, reparation, or monetary loss due to the information contained within this book. Either
directly or indirectly.

Legal Notice:

This book is copyright protected. This book is only for personal use. You cannot amend, distribute, sell,
use, quote or paraphrase any part, or the content within this book, without the consent of the author or
publisher.

Disclaimer Notice:

Please note the information contained within this document is for educational and entertainment
purposes only. All effort has been executed to present accurate, up to date, and reliable, complete
information. No warranties of any kind are declared or implied. Readers acknowledge that the author is
not engaging in the rendering of legal, financial, medical, or professional advice. The content within
this book has been derived from various sources. Please consult a licensed professional before
attempting any techniques outlined in this book.

By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, — errors, omissions, or inaccuracies.
Introduction
It is part of the obligations of the banks to analyze, store, or collect vast
numbers of data. With these data, data science applications are transforming
them into a possibility for banks to learn more about their customers. Doing
this will drive new revenue opportunities instead of seeing those data as a
mere compliance exercise. People widely use digital banking, and it is more
popular these days. The result of this influx produces terabytes of data by
customers; therefore, isolating genuinely relevant data is the first line of
action for data scientists. With the customers’ preferences, interactions, and
behaviors, then, data science applications will isolate the information of the
most relevant clients and process them to enhance the decision-making of the
business.

Investment banks risk modeling


While it serves the most critical purposes during the pricing of financial
investments, investment banks have a high priority for risk modeling since it
helps regulate commercial activities. For investment goals and to conduct
corporate reorganizations or restructuring, investment banking evaluates
values of businesses to facilitate acquisitions and mergers as well as create
capital in corporate financing. For banks, as a result, risk modeling seems
exceedingly substantial, and with more data science tools in reserve and
information at hand, they can assess it to their benefit. Now, for efficient risk
modeling and better data-driven decisions, with data science applications,
innovators in the industry are leveraging these new technologies.

Personalized marketing
Providing a customized offer that fits the preferences and needs of particular
customers is crucial to success in marketing. Now it is possible to make the
right offer on the correct device to the right customer at the right time. For a
new product, people target selection to identify the potential customers with
the use of data science application. With the aid of apps, scientists create a
model that predicts the probability of a customer’s response to an offer or
promotion through their demographics, historical purchase, and behavioral
data. Thus, banks have improved their customer relations, personalize
outreach, and efficient marketing through data science applications.
Health and Medicine
An innovative potential industry to implement the solutions of data science in
health and medicine. From the exploration of genetic disease to the discovery
of drug and computerizing medical records, data analytics is taking medical
science to an entirely new level. It is perhaps astonishing that this dynamic is
just the beginning. Through finances, data science and healthcare are most
times connected as the industry makes efforts to cut down on its expenses
with the help of a large amount of data. There is quite a significant
development between medicine and data science, and their advancement is
crucial. Here are some of the impacts data science applications have on
medicine and health.

Analysis of medical image


Medical imaging is one of the most significant benefits the healthcare sectors
get from data science application. As significant research, Big Data Analytics
in healthcare indicates that some of the imaging techniques in medicine and
health are X-ray, magnetic resonance imaging (MRI), mammography,
computed tomography, and so many others. More applications in
development will effectively extract data from images, present an accurate
interpretation, and enhance the quality of the image. As these data science
applications suggest better treatment solutions, they also boost the accuracy
of diagnoses.

Genomics and genetics


Sophisticated therapy individualization is made possible through studies in
genomics and genetics. Finding the individual biological correlation between
disease, genetics, and drug response and also understand the effect of the
DNA on our health is the primary purpose of this study. In the research of the
disease, with an in-depth understanding of genetic issues in reaction to
specific conditions and drugs, the integration of various kinds of data with
genomic data comes through data science techniques. It may be useful to look
into some of these frameworks and technologies. For a short time of
processing efficient data, MapReduce allows reading genetic sequences
mapping, retrieving genomic data is accessible through SQL, BAM file
computation, and manipulation. Also, principally to DNA interpretation to
predict the molecular effects of genetic variation, The Deep Genomics makes
a substantial impact. Scientists have the ability to understand the manner at
which genetic variations impact a genetic code with their database.

Drugs creation
Involving various disciplines, the process of drug discovery is highly
complicated. Most times, the most excellent ideas pass through billions of
enormous time and financial expenditure and testing. Typically, getting a
drug submitted officially can take up to twelve years. With an addition of a
perspective to the individual stage of drug compound screening to the
prediction of success rate derived from the biological factors, the process is
now shortened and simplified with the aid of data science applications. Using
simulations rather than the “lab experiments,” and advanced mathematical
modeling, these applications can forecast how the compound will act in the
body. With computational drug discovery, it produces simulations of
computer model as a biologically relevant network simplifying the prediction
of future results with high accuracy.

Virtual assistance for customer and patients


support
The idea that some patients don’t necessarily have to visit doctors in person is
the concept behind the clinical process optimization. Also, doctors don’t
necessarily have to visit too when the patients can get more effective
solutions with the use of a mobile application. Commonly as chatbots, the
AI-powered mobile apps can provide vital healthcare support. Derived from a
massive network connecting symptoms to causes, it is as simple as receiving
vital information about your medical condition after you describe your
symptoms. When necessary, applications can assign an appointment with a
doctor and also remind you to take your medicine on time. Allowing doctors
to have their focus on more critical cases, these applications save patients’
time on waiting in line for an appointment as well as promote a healthy
lifestyle.

Industry knowledge
To offer the best possible treatment and improve the services, knowledge
management in healthcare is vital. It brings together externally generated
information and internal expertise. With the creation of new technologies and
the rapid changes in the industry every day, effective distribution, storing,
and gathering of different facts is essential. For healthcare organizations to
achieve progressive results, the integration of various sources of knowledge
and their combined use in the treatment process is secure through data
science applications.

Oil and Gas


The primary force behind various trends in industries like marketing, finance,
internet, among others, is machine learning and data science. And there
appears to be no exception for oil and gas industry through the extracting of
important observations with some applications in the sectors in upstream,
midstream, and downstream. As a result, within the industry, a valuable asset
to companies is refined data. Data science applications are quite useful in
some of these sectors of oil and gas.

Immediate drag calculation and torque using neural


networks
There is a need to analyze, in drilling, the structured visual data, which
operators get through logging. Also, they can capture the electronic drilling
recorder and contextual data, which takes the pattern of daily report of
drilling log. It is essential to make an instant decision because of the time-
bound disposition of drilling operations. As a result, companies predict
drilling key performance indicators; analyze rig states for real-time data
visualization with the use of neural networks. Using the AI, they can estimate
the coefficient of regular and friction contact forces between the wellbore and
the string. Also, in any given well, they can calculate on the drill strings real-
time the drag and torque. Historical data of pump washouts is what operators
can utilize, and through the alerts on their phone, they will be able to know
when and if there will be a washout.

Predicting well production profile through feature


extraction models

The recurring neural networks and time series forecasting is part of the
optimization of oil and gas production. Rates of gas-to-oil ratios and oil rates
prediction is a significant KPIs. Operators can calculate bottom-hole
pressure, choke, wellhead temperature, and daily oil rate prediction of data of
nearby well with the use of feature extraction models. In the event of
predicting production decline, they make use of fractured parameters. Also,
for pattern recognition on sucker rod dynamometer cards, they utilize neural
networks and deep learning.

Downstream optimization
To process gas and crude oil, oil refineries use a massive volume of water.
Now, there is a system that tackles water solution management in the oil and
gas industry. Also, with the aid of distribution by analyzing data effectively,
there is an increase in modeling speed for forecasting revenues through
cloud-based services.

The Internet
Anytime anyone thinks about data science, the first idea that comes to mind
is the internet. It is typical of thinking of Google when we talk about
searching for something on the internet. However, Bing, Yahoo, AOL, Ask,
and some others also search engines. For these search engines to give back to
you in a fraction of second when you put a search on them, data science
algorithms are all that they all have in common. Every day, Google process
more than 20 petabytes, and these search engines are known today with the
help of data science.

Targeted advertising
Of all the data science applications, the whole digital marketing spectrum is a
significant challenge against the search engines. The data science algorithms
decide the distribution of digital billboards and banner displays on different
websites. And against the traditional advertisements, data science algorithms
have helped marketers get a higher click-through-rates. Using the behavior of
a user, they can target them with specific adverts. And at the same time and
in the same place online, one user might see ads on anger management while
another user sees another ad on a keto diet.

Website recommendations
This case is something familiar to everyone as you see suggestions of the
same products even on eBay and Amazon. Doing this add so much to the user
experience while it helps to discover appropriate products from several
products available with them. Leaning on the relevant information and
interest of the users, so many businesses have promoted their products and
services with this engine. To improve user experience, some giants on the
internet, including Google Play, Amazon, Netflix, and others have used this
system. They derived these recommendations on the results of a user’s
previous search.

Advanced image recognition


The face recognition algorithm makes use of automatic tag suggestion feature
when a user uploads their picture on social media like Facebook and start
getting tag suggestions. For some time now, Facebook has made significant
capacity and accuracy with its image recognition. Also, by uploading an
image to the internet, you have the option of searching for them on Google,
providing the results of related search with the use of image recognition.

Speech recognition
Siri, Google Voice, Cortana, and so many others are some of the best speech
recognition products. It makes it easy for those who are not in the position of
typing a message to use speech recognition tools. Their speech will be
converted to text when they speak out their words. Though the accuracy of
speech recognition is not certain.

Travel and Tourism


There are several constant challenges and changes, even with the exceptional
opportunities data science has brought to many industries. And there is no
exception when it comes to travel and tourism. Today, there is a rise in travel
culture since a broader audience has been able to afford it. Therefore, by
getting more extensive than ever before, there is a dramatic change in the
target market. As a worldwide trend, travel, and tourism is no more a
privilege of the noble and the rich.
The data science algorithms have become essential in this industry to process
massive data and also delight the requirements of the rising numbers of
consumers. To enhance their services every day, the hotels, airlines, booking
and reservation websites, and several others now see big data are a vital tool.
The travel industry uses some of these tools to make it more efficient;

Customer segmentation and personalized


marketing
To appreciate travel experience, personalization has become a preferred trend
for some people. The customer segmentation is the general stack of services
to please the needs of every group through the adaptation and segmenting of
the customers according to their preferences. Hence, finding a solution that
will align with all situations is crucial. Collecting users’ social media data to
unify, behavioral, and metadata, geolocation is what customer segmentation
and personalized marketing all about. For the future, it assumes and processes
the preferences of the user.

Analysis of customer sentiment


Recognizing emotional elements in the text and analyzing textual data is what
sentiment analysis does. The service provider, as well as the owner of a
business, can learn about the customers’ real attitude towards their brands
through sentiment analysis. The reviews of customers have a huge role when
it comes to the travel industry. This analysis is because to make decisions,
travelers read reviews customers posted on various websites and platforms
and then act upon these recommendations. As a result, providing sentiment
analysis is one of the service packages of some modern booking websites for
those travel hotels and agencies that are willing to cooperate with them.

Recommendation engine
This concept is one of the most promising and efficient, according to some
experts. In their everyday work, some central booking and travel web
platforms use recommendation engines. Mainly, through the available offers,
they match the needs and wishes of customers with these recommendations.
Based on preferences and previous search, the travel and tourism companies
have the ability to provide alternative travel dates, rental deals, new routes,
attractions, and destination when they apply the data-powered
recommendation engine solutions. Offering suitable provisions to all these
customers, booking service providers, and travel agencies achieve this with
the use of recommendation engines.

Travel support bots


With the provisions of exceptional assistance in travel arrangements and
support for the customers, travel bots are indeed changing the travel industry
nowadays. Saving user’s money and time, answering questions, suggesting
new places to visit and organizing the trips have the influence of an AI-
powered travel bot. It is the best possible solution for customers support due
to its support of multiple languages and 24/7 accessibility mode. It is
significant to add that these bots are always learning and as such, are
becoming more helpful and smarter every day. Therefore, solving the major
tasks of travel and tourism is what chatbot can do. Both customers and
business owners benefit from these chatbots.

Route optimization
In the travel and tourism industry, route optimization plays a significant role.
It can be quite challenging to account for several destinations, plan trips,
schedules, and working distances and hours. With route optimization, it
becomes easy to do some of the following:

Time management
Minimization of the travel costs
Minimization of distance
For sure, data science improves lives and also continues to change the faces
of several industries, giving them the opportunity of providing unique
experiences for their customers with high satisfaction rates. Apart from
shifting our attitudes, data science has become one of the promising
technologies that bring changes to different businesses. With several
solutions the data science applications provide, it is no doubt that its benefits
cannot be over-emphasized.
Chapter 1 - What is Data Analysis

Overview of Decision Tree


In this chapter, we will understand the concepts of Decision Trees and their
importance in data science. When any analysis is related to multiple
variables, the concept of decision trees comes into the picture.
Then you might think, "How are these decision trees generated?" They are
generated by specific algorithms that use different forms to split the data into
segments. Now, these segments form a group after combination and become
an up-side-down decision tree that has a root node originating at the top of
the tree. The main information lies in the root node. This is usually a 1-
Dimensional simple display in the decision tree interface.
A decision tree will have a root node splitting into two or more decision
nodes that are categorized by decision rule. Further, the decision nodes are
categorized as a terminal node or leaf node.
The leaf node has the response or dependent variable as the value. Once the
relationship between leaf nodes and decision nodes is established, it becomes
easy to define the relationship between the inputs and the targets while
building the decision tree. You can select and apply rules to the decision tree.
It has the ability to search hidden values or predict new ones for specific
inputs. This rule allocates observations from a dataset to a segment that
depends on the value of the column in the data. These columns are referred to
as inputs.
Splitting rules are responsible for generating the skeleton of the decision tree.
The decision tree appears like a hierarchy. There is a root node at the top,
followed by the decision nodes for the root nodes, and leaf nodes are the part
of decision nodes. For a leaf node, there is a special path defined for the data
to identify in which leaf it should go. Once you have defined the decision
tree, it will become easier for you to generate other node values depending
upon the unseen data.

History of Decision Tree


The decision Tree concept was practiced more than five decades ago; the
first-ever decision tree concept was used in the invention of television
broadcasting back in 1956 by Belson. From that period, the decision tree
concept was widely undertaken, and various forms of Decision Trees were
developed that had new and different capabilities. It was used in the field of
Data Mining, Machine Learning, etc. The Decision Tree concept was
refurbished with new techniques and was implemented at a larger scale.

Modeling Techniques in Decision Trees


Decision Tree concept works best with regression. These techniques are vital
at selecting the inputs or generating dummy variables that represent the
effects in the equations that deal with regression.
Decision Trees are used to collapse a group of categorized values into
specific ranges aligned with the target variable values. This is referred to as
optimal value collapsing. In this, a combination of categories with the same
values of certain target values, there are minimal chances of information loss
while collapsing categories together. Finally, the result will be a perfect
prediction with the best classification outputs.

Why Are Decision Trees Important?


Decision trees concept are used for multiple variable analysis. Multiple
variable analysis helps us to explain, identify, describe, and classify any
target. For explaining multiple variable analyses, take an example of sales,
the probability of sale, or the time required for a marketing campaign to
respond due to the effects of the multiple input variables, dimensions, and
factors. The multiple variable analysis opens doors to discovering some other
relationships and explain them in multiple fashions. This analysis is crucial in
terms of problem-solving as the success of any crucial input depends upon
multiple factors. There are many multiple variable techniques discovered as
of date, which is an attractive part of data science and Decision Trees and
depends on factors like easiness, robustness, and relative power of different
data and their measurement levels.
Decision Trees are always represented in incremental format. Therefore, it
can be said that any set of multiple influences is a group of one-cause, one-
effect relationships depicted in the recursive format of the decision tree. This
implies that a decision tree is able to handle issues of human short memory in
a more controlled way. It is done in the simplest manner that is easy to
understand compared to complex, multiple variable techniques.
A decision tree is important because it helps in the transformation of any raw
data into a highly knowledgeable version and special awareness about
specific issues like business, scientific, social, and engineering. This helps
you to deploy knowledge in front of a decision tree in a simple way, but in a
very powerful human-understandable format, as decision trees help in
discovering and maintaining a stronghold relation between input values of the
data and target values in any set of observations that are used to build a
dataset. If the set of input values form an association with the target value
while the selection process, then all the target values are categorized
separately and combined to form a bin that eventually forms the decision
branch of the decision tree. This is a special case observed in this kind of
grouping; the bin value and the target value. Consider an example of binning
as; suppose the average of target values are stored in three bins that are
created by input values, then binning will try to select every input value and
establish the relationship between the input value and target value. Finally, it
is determined how the input value is linked to the target value. You will need
a strong interpretation skill to know the relationship between the input value
and target value. This relationship is developed when you are able to predict
the value of the target in an effective way. Not only understanding the
relation between input-target values but also you will understand the nature
of the target. Lastly, you can predict the values depending upon such
relationships.
Chapter 2 - Python Crash Course
A lot of tools for processing data are available. Simply put, data analysis is a
methodology requiring inspection, cleansing, transformation, and modeling
of data. Its purpose is to discover vital information, the end result, a proper
interpretation, and decipher a proper mode of action. This chapter gives you
an idea about the best tools and techniques that data scientists use for data
analysis.

Open Source Data Tools


Openrefine
Openrefine was earlier known as Google Refine. This tool is found to be
most efficient when working with disorganized datasets. This enables data
scientists to clean and put the data into a different format. It also allows the
data scientist to integrate different datasets (external and internal). Google
refine is a great tool for large-scale data exploration, enabling the user to
discover the data patterns easily.
Orange
It is an open-source data visualization and analysis tool designed and meant
for those people who do not have expertise in data science. It helps the user
to build an interactive workflow that can be used for analysis and
visualization of data, using a simple interactive workflow and an advanced
toolbox. The output of this tool differs from the mainstream scatter plots, bar
charts, and dendrograms.

Knime
Knime is another open-source solution tool that enables the user to explore
data and interpret the hidden insights effectively. One of its good attributes is
that it contains more than 1000 modules along with numerous examples to
help the user to understand the applications and effective use of the tool. It is
equipped with the most advanced integrated tools with some complex
algorithms.
R-programming
R-programing is the most common and widely used tool. It has become a
standard tool for programming. R is a free open source software that any user
can install, use, upgrade, modify, clone, and even resell. It can easily and
effectively be used in statistical computing and graphics. It is made in a way
that is compatible with any type of operating system like Windows, macOS
platforms, and UNIX. It is a high-performance language that lets the user
manage big data. Since it is free and is regularly updated, it makes
technological projects cost-effective. Along with data Mining, it lets the user
apply their statistical and graphical knowledge, including common tests like a
statistical test, clustering, and linear, non-linear modeling.

Rapidminer
Rapidminer is similar to KNIME with respect to dealing with visual
programming for data modeling, analysis, and manipulation. It helps to
improve the overall yield of data science project teams. It offers an open-
source platform that permits Machine Learning, model deployment, and data
preparation. It is responsible for speeding up the development of an entire
analytical workflow, right from the steps of model validation to deployment.
Pentaho
Pentaho tackles issues faced by the organization concerning its ability to
accept values from another data source. It is responsible for simplifying data
preparation and data blending. It also provides tools used for analysis,
visualization, reporting, exploration, and prediction of data. It lets each
member of a team assign the data meaning.

Weka
Weka is another open-source software that is designed with a view of
handling machine-learning algorithms to simplify data Mining tasks. The
user can use these algorithms directly in order to process a data set. Since it is
implemented in JAVA programming, it can be used for developing a new
Machine Learning scheme. It lets easy transition into the field of data science
owing to its simple Graphical User Interface. Any user acquainted with
JAVA can invoke the library into their code.
The nodexl
The nodexl is open-source software, data visualization, and analysis tool that
is capable of displaying relationships in datasets. It has numerous modules,
like social network data importers and automation.
Gelphi
Gelphi is an open-source visualization and network analysis tool written in
Java language.
Talend
Talend is one of the leading open-source software providers that most data-
driven companies go for. It enables the customers to connect easily
irrespective of the places they're at.

Data Visualization
Data Wrapper
It is an online data-visualization software that can be used to build interactive
charts. Data in the form of CSV, Excel, or PDF can be uploaded. This tool
can be used to generate a map, bar, and line. The graphs created using this
tool have ready to use embed codes and can be uploaded on any website.

Tableau Public
Tableau Public is a powerful tool that can create stunning visualizations that
can be used in any type of business. Data insights can be identified with the
help of this tool. Using visualization tools in Tableau Public, a data scientist
can explore data prior to processing any complex statistical process.
Infogram
Infogram contains more than 35 interactive charts and 500 maps that allow
the user to visualize data. It can make various charts like a word cloud, pie,
and bar.

Google Fusion Tables


Google Fusion Tables is one of the most powerful data analysis tools. It is
widely used when an individual has to deal with massive datasets.
Solver
The solver can support effective financial reporting, budgeting, and analysis.
You can see a button that will allow you to interact with the profit-making
data in a company.

Sentiment Tools
Opentext
Identification and evaluation of expressions and patterns are possible in this
specialized classification engine. It carries out analysis at various levels:
document, sentence, and topic level.

Trackur
Trackur is an automated sentiment analysis software emphasizing a specific
keyword that is tracked by an individual. It can draw vital insights by
monitoring social media and mainstream news. In short, it identifies and
discovers different trends.
Opinion Crawl
Opinion Crawl is also an online sentiment analysis software that analyses the
latest news, products, and companies. Every visitor is given the freedom to
access Web sentiment in a specific topic. Anyone can participate in a topic
and receive an assessment. A pie chart reflecting the latest real-time
sentiment is displayed for every topic. Different concepts that people relate to
are represented by various thumbnails and cloud tags. The positive and
negative effect of the sentiments is also displayed. Web crawlers search the
up-to-date content published on recent subjects and issues to create a
comprehensive analysis.

Data Extraction Tools


Content Grabber
Content Grabber is a tool designed for organizations to enable data mining
and save the data in a specific format like CSV, XML, and Excel reports. It
also has a scripting and editing module, making it a better option for
programming experts. Individuals can also utilize C#, VB.NET to debug, and
write script information.

IBM Cognos Analytics


IBM Cognos Analytics was developed after Cognos Business Intelligence. It
is used for data visualization in the BI product. It is developed with a Web-
based interface. It covers a variety of modules, such as data governance,
strong analytics, and management. The integration of data from different
sources to make reports and visualizations is possible using this tool.

Sage Live
Sage Live is a cloud-based accounting platform that can be used in small and
mid-sized types of businesses. It enables the user to create invoices, bill
payments using smartphones. This is a perfect tool if you wish to have a data
visualization tool supporting different companies, currencies, and banks.
Gawk GNU
Gawk GNU allows the user to utilize a computer without software. It
interprets unique programming language enabling the users to handle simple-
data reformatting Jobs. Following are its main attributes:
➢ It is not procedural. It is data-driven.
➢ Writing programs is easy.
➢ Searching for a variety of patterns from the text units.
Graphlab creates
Graphlab can be used by data scientists as well as developers. It enables the
user to build state-of-the-art data products using Machine Learning to create
smart applications.
The attributes of this tool are the Integration of automatic feature engineering,
Machine Learning visualizations, and model selection to the application. It
can identify and link records within and across data sources. It can simplify
the development of Machine Learning models.
Netlink Business Analytics
Netlink Business Analytics is a comprehensive on-demand solution providing
the tool. You can apply it through any simple browser or company-related
software. Collaboration features also allow the user to share the dashboards
among teams. Features can be customized as per sales and complicated
analytic capability, which is based on inventory forecasting, fraud detection,
sentiment, and customer churn analysis.

Apache Spark
Apache Spark is designed to run-in memory and real-time.
The top 5 data analytics tools and techniques

Visual analytics
Different methods that can be used for data analysis are available. These
methods are possible through integrated efforts involving human interaction,
data analysis, and visualization.

Business Experiments
All the techniques that are used in testing the validity of certain processes are
included in Business Experiments AB testing, business experiments, and the
experimental design.
Regression Analysis
Regression Analysis allows the identification of factors that make two
different variables related to each other.

Correlation Analysis
Correlation Analysis is a statistical technique that detects whether a
relationship exists between two different variables.

Time Series Analysis


Time Series analysis gathers data at specific time intervals. Identifying
changes and predicting future events in a retrospective manner is possible
using this.
Chapter 3 - Data Munging
Now that you’ve gone through a Python programming crash course and you
have some idea of the basic concepts behind programming, we can start
discussing the data science process.
So, what does “data munging” even mean? A few decades ago, a group of
MIT students came up with this term. Data munging is about changing some
original data to more useful data by taking very specific steps. This is
basically the data science pipeline. You might sometimes hear about this term
being referred to as data preparation, or sometimes even data wrangling.
Know that they are all synonyms.
In this chapter we’re going to discuss the data science process and learn how
to upload data from files, deal with missing data, as well as manipulate it.
The Process
All data science projects are different one way or another, however they can
all be broken down into typical stages. The very first step in this process is
acquiring data. This can be done in many ways. Your data can come from
databases, HTML, images, Excel files, and many other sources, and
uploading data is an important step every data scientist needs to go through.
Data munging comes after uploading the data, however at the moment that
raw data cannot be used for any kind of analysis. Data can be chaotic, and
filled with senseless information or gaps. This is why, as an aspiring data
scientist, you solve this problem with the use of Python data structures that
will turn this data into a data set that contains variables. You will need these
data sets when working with any kind of statistical or machine learning
analysis. Data munging might not be the most exciting phase in data science,
but it is the foundation for your project and much needed to extract the
valuable data you seek to obtain.
In the next phase, once you observe the data you obtain, you will begin to
create a hypothesis that will require testing. You will examine variables
graphically, and come up with new variables. You will use various data
science methodologies such as machine learning or graph analysis in order to
establish the most effective variables and their parameters. In other words, in
this phase you process all the data you obtain from the previous phase and
you create a model from it. You will undoubtedly realize in your testing that
corrections are needed and you will return to the data munging phase to try
something else. It’s important to keep in mind that most of the time, the
solution for your hypothesis will be nothing like the actual solution you will
have at the end of a successful project. This is why you cannot work purely
theoretically. A good data scientist is required to prototype a large variety of
potential solutions and put them all to the test until the best course of action is
revealed.
One of the most essential parts of the data science process is visualizing the
results through tables, charts, and plots. In data science, this is referred to as
“OSEMN”, which stands for “Obtain, Scrub, Explore, Model, Interpret”.
While this abbreviation doesn’t entirely illustrate the process behind data
science, it captures the most important stages you should be aware of as an
aspiring data scientist. Just keep in mind that data munging will often take the
majority of your efforts when working on a project.

Importing Datasets with pandas


Now is the time to open the toolset we discussed earlier and take out pandas.
We need pandas to first start by loading the tabular data, such as spreadsheets
and databases, from any files. This tool is great because it will create a data
structure where every row will be indexed, variables kept separate by
delimiters, data can be converted, and more. Now start running Jupyter, and
we’ll discuss more about pandas and CSV files. Type:
In: import pandas as pd
iris_filename = ‘datasets-ucl-iris.csv’
iris = pd.read_csv(iris_filename, sep=',', decimal='.', header=None,
names= ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'])
We start by important pandas and naming our file. In the third line we can
define which character should be used a separator with the “sep” keyword, as
well as the decimal character with the “decimal” keyword. We can also
specify whether there’s a header with the “header” keyword, which in our
case is set to none. The result of what we have so far is an object that we
named “iris” and we refer to it as a pandas DataFrame. In some ways it’s
similar to the lists and dictionaries we talked about in Python, however there
are many more features. You can explore the object’s content just to see how
it looks for now by typing the following line:
In: iris.head()
As you can see, we aren’t using any parameters with these commands, so
what you should get is a table with only the first 5 rows, because that’s the
default if there are no arguments. However, if you want a certain number of
rows to be displayed, simply type the instruction like this:
iris.head(3)
Now you should see the first three rows instead. Next, let’s access the column
names by typing:
In: iris.columns
Out: Index(['sepal_length', 'sepal_width', 'petal_length',
'petal_width', 'target'], dtype='object')
The result of this will be a pandas index of the column names that looks like
a list. Let’s extract the target column. You can do it like this:
In: Y = iris['target']
Y
Out:
0Iris-setosa
1Iris -setosa
2Iris -setosa
3Iris -setosa
...
149Iris-virginica
Name: target, dtype: object
For now it’s important only to understand that Y is a pandas series. That
means it is similar to an array, but in this case it’s one directional. Another
thing that we notice in this example is that the pandas Index class is just like
a dictionary index. Now let’s type the following:
In: X = iris[['sepal_length', 'sepal_width']]
All we did now was asking for a list of columns by index. By doing so, we
received a pandas dataframe as the result. In the first example, we received a
one dimensional pandas series. Now we have a matrix instead, because we
requested multiple columns. What’s a matrix? If your basic math is a bit
rusty, you should know that it is an array of numbers that are arranged in
rows and columns.
Next, we want to have the dimensions of the dataset:
In: print (X.shape)
Out: (150, 2)
In: print (Y.shape)
Out: (150,)
What we have now is a tuple. We can now see the size of the array in both
dimensions. Now that you know the basics of this process, let’s move on to
basic preprocessing.

Preprocessing Data with pandas


The next step after learning how to load datasets is to get accustomed to the
data preprocessing routines. Let’s say we want to apply a function to a certain
section of rows. To achieve this, we need a mask. What’s a mask? It’s a
series of true or false values (Boolean) that we need to tell when a certain line
is selected. As always, let’s examine an example because reading theory can
be dry and confusing.
In: mask_feature = iris['sepal_length'] > 6.0
In: mask_feature
0False
1False
...
146True
147True
148True
149False
In this example we’re trying to select all the lines of our “iris” dataset that
have the value of “sepal length” larger than 6. You can clearly see the
observations that are either true or false, and therefore know the ones that fit
our query. Now let’s use a mask in order to change our “iris-virginica” target
with a new label. Type:
In: mask_target = iris['target'] == 'Iris-virginica'
In: iris.loc[mask_target, 'target'] = 'New label'
All “Iris-virginica” labels will now be shown as “New label” instead. We are
using the “loc()” method to access this data with row and column indexes.
Next, let’s take a look at the new label list in the “target” column. Type:
In: iris['target'].unique()
Out: array(['Iris-setosa', 'Iris-versicolor', 'New label'], dtype=object)
In this example we are using the “unique” method to examine the new list.
Next we can check the statistics by grouping every column. Let’s see this in
action first, and then discuss how it works. Type:
In: grouped_targets_mean = iris.groupby(['target']).mean()
grouped_targets_mean
Out:
In: grouped_targets_var = iris.groupby(['target']).var()
grouped_targets_var
Out:
We start by grouping each column with the “groupby” method. If you are a
bit familiar with SQL, it’s worth noting that this works similarly to the
“GROUP BY” instruction. Next, we use the “mean” method, which computes
the average of the values. This is an aggregate method that can be applied to
one or several columns. Then we can have several other pandas methods such
as “var” which stands for the variance, “sum” for the summation, “count” for
the number of rows, and more. Keep in mind that the result you are looking at
is still a data frame. That means that you can link as many operations as you
want. In our example we are using the “groupby” method to group the
observations by label and then check what the difference is between the
values and variances for each of our groups.
Now let’s assume the dataset contains a time series. What’s a time series, you
ask? In data science, sometimes we have to analyze a series of data points
that are graphed in a certain chronological order. In other words, it is a
sequence of the equally spaced points in time. Time series’ are used often in
statistics, for weather forecasting, and for counting sunspots. Often, these
datasets have really noisy data points, so we have to use a “rolling” operation,
like this:
In: smooth_time_series = pd.rolling_mean(time_series, 5)
As you can see, we’re using the “mean” method again in order to obtain the
average of values. You can also replace this method with “median” instead in
order to get the median of values. In this example, we also specified that we
want to obtain 5 samples.
Now let’s explore pandas “apply” method that has many uses due to its
ability to perform programmatically operations on rows and columns. Let’s
see this in action by counting the number of non-zero elements that exist in
each line.
In: iris.apply(np.count_nonzero, axis=1).head()
Out:05
15
25
35
45
dtype: int64
Lastly, let’s use the “applymap” method for element level operations. In the
next example, we are going to assume we want the length of the string
representation of each cell. Type:
In: iris.applymap(lambda el:len(str(el))).head()
To receive our value, we need to cast every individual cell to a string value.
Once that is done, we can gain the value of the length.

Data Selection with pandas


The final section about working with pandas is data selection. Let’s say you
find yourself in a situation where your dataset has an index column, and you
need to import it and then manipulate it. To visualize this, let’s say we have a
dataset with an index from 100. Here’s how it would look:
n,val1,val2,val3
100,10,10,C
101,10,20,C
102,10,30,B
103,10,40,B
104,10,50,A
So the index of row 0 is 100. If you import such a file, you will have an index
column like in our case labeled as “n”. There’s nothing really wrong with it,
however you might use the index column by mistake, so you should separate
it instead in order to prevent such errors from happening. To avoid possible
issues and errors, all you need to do is mention that “n” is an index column.
Here’s how to do it:
In: dataset = pd.read_csv('a_selection_example_1.csv',
index_col=0) dataset
Out:
Your index column should now be separate. Now let’s access the value inside
any cell. There’s more than one way to do that. You can simply target it by
mentioning the column and line. Let’s assume we want to obtain “Val3” from
the 5 line, which is marked by an index of 104.
th

In: dataset['val3'][104]
Out: 'A'
Keep in mind that this isn’t a matrix, even though it might look like one.
Make sure to specify the column first, and then the row in order to extract the
value from the cell you want.

Categorical and Numerical Data


Now that we’ve gone through some basics with pandas, let’s learn how to
work with the most common types of data, which are numerical and
categorical.
Numerical data is quite self-explanatory, as it deals with any data expressed
in numbers, such as temperature or sums of money. These numbers can either
be integers or floats that are defined with operators such as greater or less
than.
Categorical data, on the other hand, is expressed by a value that can’t be
measured. A great example of this type of data, which is sometimes referred
to as nominal data, is the weather, which holds values such as sunny, partially
cloudy, and so on. Basically, data to which you cannot apply equal to, greater
than, or less than operators is nominal data. Other examples of this data
include products you purchase from an online store, computer IDs, IP
addresses, etc. Booleans are the one thing that is needed to work with both
categorical and numerical data. They can even be used to encode categorical
values as numerical values. Let’s see an example:
Categorical_feature = sunnynumerical_features = [1, 0, 0, 0, 0]
Categorical_feature = foggynumerical _features = [0, 1, 0, 0, 0]
Categorical_feature = snowynumerical _features = [0, 0, 1, 0, 0]
Categorical_feature = rainynumerical _features = [0, 0, 0, 1, 0]
Categorical_feature = cloudynumerical _features = [0, 0, 0, 0, 1]
Here we take our earlier weather example that takes the categorical data
which is in the form of sunny, foggy, etc, and encode them to numerical data.
This turns the information into a map with 5 true or false statements for each
categorical feature we listed. One of the numerical features (1) confirms the
categorical feature, while the other four are o. Now let’s turn this result into a
dataframe that presents each categorical feature as a column and the
numerical features next to that column. To achieve this you need to type the
following code:
In: import pandas as pd
categorical_feature = pd.Series(['sunny', 'foggy', 'snowy', 'rainy', 'cloudy'])
mapping = pd.get_dummies(categorical_feature)
mapping
Out:
In data science, this is called binarization. We do not use one categorical
feature with as many levels as we have. Instead, we create all the categorical
features and assign two binary values to them. Next we can map the
categorical values to a list of numerical values. This is how it would look:
In: mapping['sunny']
Out:
01.0
10.0
20.0
30.0
40.0
Name: sunny, dtype: float64
In: mapping['foggy']
Out:
00.0
11.0
20.0
30.0
40.0
Name: cloudy, dtype: float64
You can see in this example that the categorical value “sunny” is mapped to
the following list of Booleans: 1, 0, 0, 0, 0 and you can go on like this for all
the other values.
Next up, let’s discuss scraping the web for data.

Scraping the Web


You won’t always work with already established data sets. So far in our
examples, we assumed we already had the data we needed and worked with it
as it was. Often, you will have to scrape various web pages to get what you’re
after and download it. Here are a few real world situations where you will
find the need to scrape the web:

1. In finance, many companies and institutions need to scrape the


web in order to obtain up to date information about all the
organizations in their portfolio. They perform this process on
websites belonging to newspaper agencies, social networks,
various blogs, and other corporations.
2. Did you use a product comparison website lately to find out
where to get the best deal? Well, those websites need to
constantly scrape the web in order to update the situation on the
market’s prices, products, and services.
3. How do advertising companies figure out whether something is
popular among people? How do they quantify the feelings and
emotions involved with a particular product, service, or even
political debate? They scrape the web and analyze the data they
find in order to understand people’s responses. This enables
them to predict how the majority of consumers will respond
under similar circumstances.
As you can see, web scraping is necessary when working with data, however
working directly with web pages can be difficult because of the different
people, server locations, and languages that are involved in creating websites.
However, data scientists can rejoice because all websites have one thing in
common, and that is HTML. For this reason, web scraping tools focus almost
exclusively on working with HTML pages. The most popular tool that is used
in data science for this purpose is called Beautiful Soup, and it is written in
Python.
Using a tool like Beautiful Soup comes with many advantages. Firstly it
enables you to quickly understand and navigate HTML pages. Secondly, it
can detect errors and even fill in gaps found in the HTML code of the
website. Web designers and developers are humans, after all, and they make
mistakes when creating web pages. Sometimes those mistakes can turn into
noisy or incomplete data, however Beautiful Soup can rectify this problem.
Keep in mind that Beautiful Soup isn’t a crawler that goes through websites
to index and copy all their web pages. You simply need to import and use the
“urllib” library to download the code behind a webpage, and later import
Beautiful Soup to read the data and run it through a parser. Let’s first start by
downloading a web page.
In: import urllib.request
url = 'https://en.wikipedia.org/wiki/Marco_Polo'
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
With this request, we download the code behind Wikipedia’s Marco Polo
web page. Next up, we use Beautiful Soup to read and parse the resources
through its HTML parser.
In: from bs4 import BeautifulSoup
soup = BeautifulSoup(response, 'html.parser')
Now let’s extract the web page’s title like so:
In: soup.title
Out: <title>Marco Polo - Wikipedia, the free encyclopedia</title>
As you can see, we extracted the HTML title tag, which we can use further
for investigation. Let’s say you want to know which categories are linked to
the wiki page about Marco Polo. You would need to first analyze the page to
learn which HTML tag contains the information we want. There is no
automatic way of doing this because web information, especially on
Wikipedia, constantly changes. You have to analyze the HTML page
manually to learn in which section of the page the categories are stored. How
do you achieve that? Simply navigate to the Marco Polo webpage, press the
F12 key to bring up the web inspector, and go through the code manually.
For our example, we find the categories inside a div tag called “mw-normal-
catlinks”. Here’s the code required to print each category and how the output
would look:
In:
section = soup.find_all(id='mw-normal-catlinks')[0]
for catlink in section.find_all("a")[1:]:
print(catlink.get("title"), "->", catlink.get("href"))
Out:
Category:Marco Polo -> /wiki/Category:Marco_Polo
Category:1254 births -> /wiki/Category:1254_births
Category:1324 deaths -> /wiki/Category:1324_deaths
Category:13th-century explorers -> /wiki/Category:13thcentury_explorers
Category: 13th-century venetian people -
>/wiki/Category:13thcentury_venetian_people
Category:13th-century venetian writers->/wiki/Category:
13thcentury_venetian_writers
Category:14th-century Italian writers->/wiki/Category:
14thcentury_Italian_writers
In this example, we use the “find all” method to find the HTML text
contained in the argument. The method is used twice because we first need to
find an ID, and secondly we need to find the “a” tags.
A word of warning when it comes to web scraping- be careful, because it is
not always permitted to perform scraping. You might need authorization,
because to some websites this minor invasion seems similar to a DoS attack.
This confusion can lead the website to ban your IP address. So if you
download data this way, read the website’s terms and conditions section, or
simply contact the moderators to gain more information. Whatever you do,
do not try to extract information that is copyrighted. You might find yourself
in legal trouble with the website / company owners.
With that being said, let’s put pandas away, and look at data processing by
using NumPy.

NumPy and Data Processing


Now that you know the basics of loading and preprocessing data with the
help of pandas, we can move on to data processing with NumPy. The purpose
of this stage is to have a data matrix ready for the next stage, which involves
supervised and unsupervised machine learning mechanisms. NumPy data
structure comes in the form of ndarray objects, and this is what you will later
feed into the machine learning process. For now, we will start by creating
such an object to better understand this phase.
Chapter 4 - Why Data Preprocessing Is Important
Data processing is going to be a technique that we are able to use with data
mining. When we work on this process, the raw data that we need is gathered
and then we can analyze this data to find a way to transform it into useful
data. For example, when you go through an e-commerce site, you are
basically generating the data that is needed in order to keep things in line and
to ensure that we are able to get that data to show up. During that process, the
data has also been transformed in a way that is more understandable so that
you can get all of the recommended products to you at the right time
Data is the fuel that a lot of companies are using to help them in many ways.
But in the real world, most of the data that is being collected is going to be
pretty noisy. It is going to come to us with a lot of errors, meaning that we
have unstructured data that is sometimes hard to read through and
understand.
In order to make sure that we can turn that data into a more structured form
that is easier to read and understand, we need to work with the data
preprocessing step.
This ensures that the data we are working with can be changed around into a
format that is easier to understand.

Import the Needed Libraries


With this in mind, we need to take some time to import the libraries that are
needed to ensure that we can actually do some of the preprocessing that we
need.
These are pretty basic steps, but they are going to be so important in making
sure that we can get the data to do what we want. The first step that we will
focus on is importing all of the libraries and algorithms that are needed to
take care of that raw data.
In this step, there are a few libraries that are going to be required to get this
data processing off and running. The libraries that are the most important to
work on this step will include:

1. NumPy: This is a library that we are going to use with some of


the more complicated options of mathematical computations in
machine learning. It can work with the N-dimensional array,
Fourier transforms, and linear algebra to name a few.
2. Matplotlib: This library is going to be used when we want to
work on a plotting graph and figures like the line chart, pie chart,
and bar graph. This is the one that we can use when it is time to
create a visualization of the data by the analyzation for
understanding the patterns in our data as easily as possible.
3. Pandas: Panda is a library that is mainly used to help the
manipulations of data. It is going to be a library that is open-
sourced and will contain all of the functions that are related to
data structures. It also has all of the tools that are needed for data
analysis, so whether you just want to use it for a data
preprocessing or for the whole process, you will behave the tools
to get it done.
4. Seaborn: This is another option that you can use for data
visualization. You can say that it is kind of like a more upgraded
version of Matplotlib. It is one that we are able to use to make
graphs and charts that are more informatical.
The codes that you need to use to bring up all of these libraries and get them
ready to use will include:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Importing the Sets of Data


Once we have been able to import the libraries that we need above and any of
the other libraries that we are going to use in this process, it is time to go
through and import the sets of data that we want to use. Before we go through
and work with the data preprocessing step, we need to have some sets of data
ready to use with it. Pandas is the best library that we can use in order to
import these sets of data to get started.
Mostly for some of the smaller preprocessing you can easily import the data
sets from the CSVs files. The format of the set of data the file can be also in
HTML or xlsx file. But as we already know that the CSV files are going to be
lower in size, it is going to be one of the faster options when it comes to
working with all the formats available.

Fill in Some of the Missing Values


It is likely that if you are collecting data that is considered unstructured, then
you will find that there are a lot of missing values that are going to show up
in the mix as well. This happens when you are working through a lot of
different sources and more with the data you have. Finding these missing
values and taking care of them as much as possible is going to be critical to
ensuring that we can really use the data and that the outliers and missing
values are not going to negatively impact our process.
When we are ready to import a set of data, you will find that there are going
to be a few missing values that show up inside of it. If it is not corrected, then
it is going to make it really hard to manage or do any of the preprocessing
that is needed with your data. You will find that there is a lot of inaccurate
information about it, and this can harm the kind of results that you are able to
get.
This means that before you go through and do any data preprocessing, you
have to deal with the issue of missing values first.
While we are on this topic, there are going to be two methods that we can use
when it comes to replacing some of those missing values. Both can work
really well depending on what you are going for in the process, and what
works the best for your project.
With the first method, if you are going through a set of data that is really
large, then you may not want to have any of the missing values there and you
don’t want to spend a lot of time trying to fill them in either. For this
example, we would go through and delete the row of data that has the missing
values. It is not going to have a big effect when it comes to getting the
accurate results that you need from the output.
Another option is to fill in those missing values. Maybe you are going
through the set of data and you notice that there is a numeric column that is
present. And through that column, there are a lot of missing values that are
showing up.
With this method, we are able to replace some of the missing values with the
help of the mean, median, or mode of the values of the entire row of that
column that you want to use. This can help to keep some of the results that
you are working with as organized and as even as possible and will help you
to get the results that you want.
Both of these methods can work well when it comes to working with your
data preprocessing that you need to get done. We just need to make sure that
we are setting them up n the right manner and picking the one that works the
best for our set of data.
Make sure to deal with these missing data points before you start processing
the information so that the missing values don’t mess with the output that you
get.

Modifying the Text Values Over to Numerical


Values
The fourth step that we need to take a look at here is how we can take some
of the text value that we have and turn it over into a number. There are times
when this needs to happen in order to make the model work the way that you
would like.
It takes a bit of work, but it is definitely possible, and it may be just what
your project needs to be successful.
When we work with data preprocessing with the help of machine learning,
we will find that it requires values of the data to be in numerical form. As we
know that the machine learning models are going to contain mathematical
calculations.
This means that we need to convert all of the text values that are in our
columns of the set of data over to a form that is numerical. There are a few
methods that work to handle this, but the LabelEncoder class is going to be
used in most cases to take the categorical or the string variable and change it
over to a numerical value.
The steps that we have gone through above are going to be some of the
biggest things that we need to follow when it is time to preprocess some of
the data that we are working with. But there are a few other steps that have to
come into play, depending on the data that we are working with and what we
need to see get done overall.
For example, some of the other steps that we may need to add to this process
include the Creation of Training and Test data sets and even Feature Scaling.
Data is the biggest fuel source for most businesses in the future, and it is
seeing a lot of growth that is likely to continue into the future as well. Most
of the data that these companies are going to focus on will come in an
unstructured format, and we have to go through and set some rules on how to
convert this into data that is useful.
This is why data preprocessing is so important. Using the steps that are above
can help us to get through this step of the process, and will ensure that the
data is ready to go through our models, and present us with the right output,
each time.
Chapter 5 - What is Data Wrangling?
The next topic that we need to spend some time on is known as data
wrangling. This is basically the process where we are able to clean, and then
unify, the messy and complex sets of data that we have, in order to make
them easier to access and analyze when we would like. This may seem like
part of the boring stuff when it comes to our data science proves, but it is
going to be so important to the end results so we need to spend some time on
seeing how this works.
With all of the vast amounts of data that are present in the world right now,
and with all of the sources of that data growing at a rapid rate and always
expanding, it is getting more and more essential for these large amounts of
available data to get organized and ready to go before you try to accomplish
any analysis. If you just leave the data in the messy form from before, then it
is not going to provide you with an accurate analysis in the end, and you will
be disappointed by the results.
Now, the process of data wrangling is typically going to include a few steps.
We may find that we need to manually convert or map out data from one raw
form into another format. The reason that this is done in the first place is that
it allows us to have a more convenient consumption for the company who
wants to use that data.

What Is Data Wrangling?


When you work with your own project in data science, there are going to be
times when you gather a lot of data and it is incomplete or messy. This is
pretty normal considering all of the types of data you have to collect, from a
variety of sources overall. The raw data that we are going to gather from all
of those different sources is often going to be hard to use in the beginning.
And this is why we need to spend some time cleaning it. Without the data
being cleaned properly, it will not work with the analytical algorithm that we
want to create.
Our algorithm is going to be an important part of this process as well. It is
able to take all of the data you collect over time and will turn it into some
good insights and predictions that can then help to propel your business into
the future with success. But if you are feeding the analytical data a lot of
information that is unorganized or doesn’t make sense for your goals, then
you are going to end up with a mess. To ensure that the algorithm works the
way that you want, you need to make sure that you clean it first, and this is
the process that we can call data wrangling.
If you as the programmer would like to create your own efficient ETL
pipeline, which is going to include extract, transform and load, or if you
would like to create some great looking data visualizations of your work
when you are done, then just get prepared now for the data wrangling.
Like most data scientists, data analysts, and statisticians will admit, most of
the time that they spend implementing an analysis is going to be devoted to
cleaning or wrangling up the data on its own, rather than in actually coding or
running the model or algorithm that they want to use with the data.
According to the O’Reilly 2016 Data Science Salary Survey, almost 70
percent of data scientists will spend a big portion of their time dealing with a
basic analysis of exploratory data, and then 53 percent will spend their time
on the process of cleaning their data before using in an algorithm.
Data wrangling, as we can see here, is going to be an essential part of the data
science process. And if you are able to gain some skills in data wrangling,
and become more proficient with it, you will soon find that you are one of
those people who can be trusted and relied on when it comes to some of the
cutting-edge data science work.

Data Wrangling with Pandas


Another topic that we can discuss in this chapter is the idea of data wrangling
with Pandas. Pandas is seen as one of the most popular libraries in Python for
data science, and specifically to help with data wrangling. Pandas is able to
help us to learn a variety of techniques that work well with data wrangling,
and when these come together to help us deal with some of the data formats
that are the most common out there, along with some of their
transformations.
We have already spent a good deal of time talking about what the Pandas
library is all about. And when it comes to data science, Pandas can definitely
step in and help get a ton of the work done. With that said, it is especially
good at helping us to get a lot of the data wrangling process that we want
doing as well. There may be a few other libraries out there that can do the job
but none are going to be as efficient or as great to work with, as the Pandas
library.
Pandas will have all of the functions and the tools that you need to really
make your project stand out, and to ensure that we are going to see some
great results in the process of data wrangling as well. So, when you are ready
to work with data wrangling, make sure to download the Pandas library, and
any of the other extensions that it needs.

Our Goals with Data Wrangling


When it comes to data wrangling, most data scientists are going to have a few
goals that they would like to meet in order to get the best results. Some of the
main goals that can come up with data wrangling, and should be high on the
list of priorities, include:

1. Reveal a deep intelligence inside of the data that you are


working with. This is often going to be accomplished by
gathering data from multiple sources.
2. Provides us with accurate and actionable data and then puts
it in the hands of n analyst for the business, in a timely
manner so they can see what is there.
3. Reduce the time that is spent collecting, and even
organizing, some of the data that is really unruly, before it
can be analyzed and utilized by that business.
4. Enables the data scientists, and any other analyst to focus on
the analysis of the data, rather than just the process of
wrangling.
5. Drives better skills for making decisions by senior leaders in
that company.

The Key Steps with Data Wrangling


Just like with some of the other processes that we have discussed in this
guidebook, there are a few key steps that need to come into play when it
comes to data wrangling. There are three main steps that we can focus on for
now, but depending on the goals you have and the data that you are trying to
handle, there could be a few more that get added in as well. The three key
steps that we are going to focus on here though will include data acquisition,
joining data, and data cleansing.
First on the list is data acquisition. How are you meant to organize and get
the data ready for your model if you don’t even have the data in the first
place? In this part of the process, our goal is to first identify and then obtain,
access to the data that is in your preferred sources so that you can use it as
you need in the model.
The second step is going to be where we join together the data. You have
already been able to gather in the data that you want to use from a variety of
sources and even did a bit of editing in the process. Now it is time for us to
combine together the edited data for further use and more analysis in the
process.
And then we can end up with the process that is known as data cleansing. In
the data cleansing process, we need to redesign the data into a format that is
functional and usable, and then remove or correct any of the data that we
consider as something that is bad.

What to Expect with Data Wrangling?


The process of data wrangling can be pretty complex, and we need to take
some time to get through all of it and make sure that we have things in the
right order. When people first get into the process of data wrangling, they are
often surprised that there are a number of steps, but each of these is going to
be important to ensure that we can see the results that we want.
To keep things simple for now, we are going to recognize that the data
wrangling process is going to contain six iterative steps. These are going to
include the following:
The process of discovering. Before you are able to dive into the data and the
analysis that you want to do too deeply, we first need to gain a better
understanding of what might be found in the data. This information is going
to give you more guidance on how you would like to analyze the data. How
you wrangle your customer data, as an example, maybe informed by where
they are located, what the customer decided to buy, and what promotions
they were sent and then used.
The second iterative step that comes with the data wrangling process is going
to be structuring. This means that we need to organize the data. This is a
necessary process because the raw data that we have collected may be useful,
but it does come to us in a variety of shapes and sizes. A single column may
actually turn into a few rows to make the analysis a bit easier to work within
the end. One column can sometimes become two. No matter how we change
up some of the work, remember that the movement of our data is necessary in
order to allow our analysis and computation to become so much easier than
before.
Then we can go on to the process of cleaning. We are not able to take that
data and then just throw it at the model or the algorithm that we want to work
with. We do not want to allow all of those outliers and errors into the data
because they are likely to skew some of our data and ruin the results that we
are going to get. This is why we want to clean off the data.
There are a number of things that are going to spend our time cleaning when
it comes to the data in this step. We can get rid of some of the noise and the
outliers we can take some of the null values and change this around to make
them worth something. Sometimes it is as simple as adding in the standard
format, changing the missing values, or handling some of the duplicates that
show up in the data. The point of doing this though is to increase the quality
of the data that you have, no matter what source you were able to find it
from.
Next on the list is the process of enriching the data. Here we are going to take
stock of the data that we are working with, and then we can strategize about
how some other additional data might be able to augment it out. This is going
to be a stage of questions to make sure that it works, so get ready to put on
your thinking cap.
Some of the questions that you may want to ask during this step could
include things like what new types of data can I derive from what I already
have? What other information would better inform my decision making about
this current data? This is the part where we will fill in some of the holes that
may have found their way into the data, and then find the supplementation
that is needed to make that data pop out.
From here we can move on to the step of validation. The validation rules that
we are going to work with this step in the data science process are going to be
repetitive programming sequences. The point of working with these is that
we want to check out and verify the consistency, quality, and security of our
data to make sure that it is going to do the work that we want.
There are a lot of examples that come with the validation stage. But this can
include something like ensuring the uniform distribution of attributes that
should be distributed in a normal way, such as birth dates. It can also be used
as a way to confirm the accuracy of fields through a check across the data.
And the last stage is going to be publishing. Analysts are going to be able to
prepare the wrangled data to use downstream, whether by a software or a
particular user. This one also needs us to go through and document any of the
special steps that were taken or the logic that we used to wrangle this data.
Those who have spent some time wrangling data understand that
implementation of the insights is going to rely upon the ease with which we
are able to get others the information, and how easy it is for these others to
access and utilize the data at hand.
Data wrangling is an important part of our process and ensures that we are
able to get the best results with any process that we undertake. We need to
remember that this is going to help us to get ahead with many of the aspects
of our data science project, and without the proper steps being taken we are
going to be disappointed in what we see as the result in the end. Make sure to
understand what data wrangling is all about, and why it is so important so
that it can be used to help with your data science project.
Chapter 6 - Inheritances to Clean Up the Code
The next topic that we will take a look at when writing Python codes is how
to work on inheritance codes. These codes are great because they will save
you a lot of time and will make your code look nicer because you can reuse
parts of your code without tiring yourself out by having to rewrite it so many
times. This is something that you can do with object-oriented programming,
or OOP, languages, a category which Python is a part of. You can work with
inheritances so you can use a parent code and then make some adjustments to
the parts of the code that you want and make the code unique. As a beginner,
you will find that these inheritances can be quite easy to work with because
you can get the code to work the way you want it to work without having to
write it out a million times over.
To help you keep things simple and to understand how inheritances work a
little better, an inheritance is when you will take a ‘parent’ code and copy it
down into a ‘child’ code. You will then be able to work on the child code and
make some adjustments without having to make any changes in the parent
part of the code. You can do this one time and stop there, or you can keep on
going down the line and change the child code at each level without making
any changes to the parent code.
Working with inheritances can be a fun part of making your own code, and
you can make it look so much nicer without all that mess. Let’s take a look at
what the inheritance code looks like and how it will work inside of your
code:

#Example of inheritance
#base class
class Student(object):
def__init__(self, name, rollno):
self.name = name
self.rollno = rollno
#Graduate class inherits or derived from Student class
class GraduateStudent(Student):
def__init__(self, name, rollno, graduate):
Student__init__(self, name, rollno)
self.graduate = graduate
def DisplayGraduateStudent(self):
print”Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.graduate)
#Post Graduate class inherits from Student class
class PostGraduate(Student):
def__init__(self, name, rollno, postgrad):
Student__init__(self, name, rollno)
self.postgrad = postgrad
def DisplayPostGraduateStudent(self):
print(“Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.postgrad)
#instantiate from Graduate and PostGraduate classes
objGradStudent = GraduateStudent(“Mainu”, 1, “MS-Mathematics”)
objPostGradStudent = PostGraduate(“Shainu”, 2, “MS-CS”)
objPostGradStudent.DisplayPostGraduateStudent()
When you type this into your interpreter, you will get the following results:
(‘Student Name:’, ‘Mainu’)
(‘Student Rollno:’, 1)
(‘Student Group:’, ‘MSC-Mathematics’)
(‘Student Name:’, ‘Shainu’)
(‘Student Rollno:’, 2)
(‘Student Group:’, ‘MSC-CS’)

How To Override The Base Class


The next thing that we can work on when it comes to inheritance codes is
how to override a base class. There will be a lot of times that while you are
working on a derived class, you have to go in and override what you have
placed inside a base class. What this means is that you will take a look at
what was placed inside the base class and then make changes to alter some of
the behavior that was programmed inside of it. This helps to bring in new
behavior which will then be available inside the child class that you plan to
create from that base class.
This does sound a little bit complicated to work with, but it can really be
useful because you can choose and pick the parental features that you would
like to place inside the derived class, which ones you would like to keep
around, and which ones you no longer want to use. This whole process will
make it easier for you to make some changes to the new class and keep the
original parts from your base class that might help you out later. It is a simple
process that you can use to make some changes in the code and get rid of
parts of the base class that is no longer working and replaces them with
something that will work better.

Overloading
Another process that you may want to consider when you’re working with
inheritances is learning how to ‘overload.’ When you work on the process
known as overloading, you can take one of the identifiers that you are
working with and then use that to define at least two methods, if not more.
For the most part, there will only be two methods that are inside of each
class, but sometimes this number will be higher. The two methods should be
inside the exact same class, but they need to have different parameters so that
they can be kept separate in this process. You will find that it is a good idea
to use this method when you want the two matched methods to do the same
tasks, but you would like them to do that task while having different
parameters.
This is not something that is common to work with, and as a beginner, you
will have very little need to use this since many experts don’t actually use it
either. But it is still something that you may want to spend your time learning
about just in case you do need to use it inside of your code. There are some
extra modules available for you that you can download so you can make sure
that overloading will work for you.

Final Notes About Inheritances


As you are working on your codes, you will find that it is possible that you
could work on more than one inheritance code. If you are doing this, it means
that you can make a line of inheritances that are similar to each other, but you
can also make some changes to them as well if needed. You will notice that
multiple inheritances are not all that different from what you did with a
normal inheritance. Instead, you are just adding more steps and continuously
repeating yourself so you can make the changes that you want.
When you want to work with multiple inheritances, you have to take one
class and then give it two or more parent classes to get it started. This is
important once you are ready to write your own code, but you can also use
the inheritances to make sure the code looks nice as you write it out.
Now, as a beginner, you may be worried that working with these multiple
inheritances might be difficult because it sounds too complicated. When you
are working with these types of inheritances, you will create a new class,
which we will call Class3, and you will find that this class was created from
the features that were inside of Class2. Then you can go back a bit further
and will find that Class2 was created with the features that come from Class1
and so on and so forth. Each layer will contain features from the class that
was ahead of it, and you can really go down as far as you would like. You
can have ten of these classes if you would like, with features from the past
parent class in each one, as long as it works inside of your code.
One of the things that you should remember when you’re creating new code
and if you are considering to add in some multiple inheritances is that the
Python language will not allow you to create a circular inheritance. You can
add in as many parent classes as you want, but you are not allowed to go into
the code and make the parent class go in a circle, or the program will get mad
at you if you do so. Expanding out the example that we did above to make
another class or more is fine, but you must make sure that you are copying
the codes out properly before you even make changes so you can get this
program to work.
As you start to write out some more codes using the Python programming
language, you will find that working with different types of inheritances is
actually pretty popular.
There are many times when you can just stick with the same block of code in
the program and then make some changes without having to waste your time
and tire yourself out by rewriting the code over and over again.
Chapter 7 - Reading and writing data
In real-world applications, data comes in various formats. These are the most
common ones: CSV, Excel spreadsheets (xlsx / xls), HTML and SQL. While
Pandas can read SQL files, it is not necessarily the best for working with
SQL databases, since there are quite a few SQL engines: SQL lite,
PostgreSQL, MySQL, etc. Hence, we will only be considering CSV, Excel
and HTML.
Read

The pd.read_file_type(‘file_name’) method is the default way to read files


into the Pandas framework. After import, pandas displays the content as a
data frame for manipulation using all the methods we have practiced so far,
and more.
CSV (comma separated variables) & Excel
Create a CSV file in excel and save it in your python directory. You can
check where your python directory is in Jupyter notebook by typing: pwd(). If
you want to change to another directory containing your files (e.g. Desktop),
you can use the following code:
In []: import os
os.chdir('C:\\Users\\Username\\Desktop')
To import your CSV file, type: pd.read_csv(‘csv_file_name’). Pandas will
automatically detect the data stored in the file and display it as a data frame.
A better approach would be to assign the imported data to a variable like
this:
In []:Csv_data = pd.read_csv(‘example file.csv’)
Csv_data # show
Running this cell will assign the data in ‘example file.csv’ to the variable
Csv_data, which is of the type data frame. Now it can be called later or used
for performing some of the data frame operations.
For excel files (.xlsx and .xls files), the same approach is taken. To read an
excel file named ‘class data.xlsx’, we use the following code:
In []:Xl_data = pd.read_excel(‘class data.xlsx’)
Xl_data # show
This returns a data frame of the required values. You may notice that an
index starting from 0 is automatically assigned at the left side. This is similar
to declaring a data frame without explicitly including the index field. You can
add index names, like we did in previous examples.
Tip: in case the excel spreadsheet has multiple sheets filled. You can specify
the sheet you need to be imported. Say we need only sheet 1, we use:
sheetname = ‘Sheet 1’. For extra functionality, you may check the
documentation for read_excel() by using shift+tab.
Write

After working with our imported or pandas-built data frames, we can write
the resulting data frame back into various formats. We will, however, only
consider writing back to CSV and excel. To write a data frame to CSV, use
the following syntax:
In []:Csv_data.to_csv(‘file name’,index = False)

This writes the data frame ‘Csv_data’ to a CSV file with the specified
filename in the python directory. If the file does not exist, it creates it.
For writing to an excel file, a similar syntax is used, but with sheet name
specified for the data frame being exported.
In []: Xl_data.to_excel(‘file name.xlsx’,sheet_name = ‘Sheet 1’)
This writes the data frame Xl_data to sheet one of ‘file name.xlsx’.
Html
Reading Html files through pandas requires a few libraries to be installed:
htmllib5, lxml, and BeautifulSoup4. Since we installed the latest Anaconda,
these libraries are likely to be included. Use conda list to verify, and conda
install to install any missing ones.
Html tables can be directly read into pandas using the pd.read_html (‘sheet
url’) method. The sheet url is a web link to the data set to be imported. As an
example, let us import the ‘Failed bank lists’ dataset from FDIC’s website
and call it w_data.
In []: w_data =
pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
w_data[0]
To display the result, here we used w_data [0]. This is because the table we
need is the first sheet element in the webpage source code. If you are familiar
with HTML, you can easily identify where each element lies. To inspect a
web page source code, use Chrome browser. On the web page >> right click
>> then, select ‘view page source’. Since what we are looking for is a table-
like data, it will be specified like that in the source code. For example, here is
where the data set is created in the FDIC page source code:

FDIC page source via chrome


This section concludes our lessons on the Pandas framework. To test your
knowledge on all that has been introduced, ensure to attempt all the exercises
below.
For the exercise, we will be working on an example dataset. A salary
spreadsheet from Kaggle.com. Go ahead and download the spreadsheet from
this link: www.kaggle.com/kaggle/sf-salaries
Note: You might be required to register before downloading the file.
Download the file to your python directory and extract the file.

Exercises: We will be applying all we have learned here.


1. Import pandas as pd
2. Import the CSV file into Jupyter notebook, assign it to a
variable ‘Sal’, and display the first 5 values.

Hint: use the .head() method to display the first 5 values of a


data frame. Likewise, .tail() is used for displaying the last 5
results. To specify more values, pass ‘n=value’ into the
method.
3. What is the highest pay (including benefits)? Answer:
567595.43

Hint: Use data frame column indexing and .max() method.


4. According to the data, what is ‘MONICA FIELDS’s Job
title, and how much does she make plus benefits? Answer:
Deputy Chief of the Fire Department, and $ 261,366.14.

Hint: Data frame column selection and conditional selection


works (conditional selection can be found under Example 72.
Use column index ==’string’ for the Boolean condition).
5. Finally, who earns the highest basic salary (minus benefits),
and by how much is their salary higher than the average
basic salary. Answer: NATHANIEL FORD earns the
highest. His salary is higher than the average by $
492827.1080282971.

Hint: Use the .max(), and .mean() methods for the pay gap.
Conditional selection with column indexing also works for
the employee name with the highest pay.
Chapter 8 - The Different Types
of Data We Can Work With
There are two main types mainly structured and unstructured, and the types
of algorithms and models that we can run on them will depend on what kind
of data we are working with. Both can be valuable, but it often depends on
what we are trying to learn, and which one will serve us the best for the topic
at hand. With that in mind, let’s dive into some of the differences between
structured and unstructured data and why each can be so important to our
data analysis.

Structured Data
The first type of data that we will explore is known as structured data. This is
often the kind that will be considered traditional data. This means that we will
see it consisting mainly of lots of text files that are organized and have a lot
of useful information. We can quickly glance through this information and
see what kind of data is there, without having to look up more information,
labeling it, or looking through videos to find what we want.
Structured data is going to be the kind that we can store inside one of the
options for warehouses of data, and we can then pull it up any time that we
want for analysis. Before the era of big data, and some of the emerging
sources of data that we are using on a regular basis now, structured data was
the only option that most companies would use to make their business
decisions.
Many companies still love to work with this structured data. The data is very
organized and easy to read through, and it is easier to digest. This ensures
that our analysis is going to be easier to go through with legacy solutions to
data mining. To make this more specific, this structured data is going to be
made up largely of some of the customer data that is the most basic and could
provide us with some information including the contact information, address,
names, geographical locations and more of the customers.
In addition to all of this, a business may decide to collect some transactional
data and this would be a source of structured data as well. Some of the
transactional data that the company could choose to work with would include
financial information, but we must make sure that when this is used, it is
stored in the appropriate manner so it meets the standards of compliance for
the industry.
There are several methods we can use in order to manage this structured data.
For the most part, though, this type of data is going to be managed with
legacy solutions of analytics because it is already well organized and we do
not need to go through and make adjustments and changes to the data at all.
This can save a lot of time and hassle in the process and ensures that we are
going to get the data that we want to work the way that we want.
Of course, even with some of the rapid rise that we see with new sources of
data, companies are still going to work at dipping into the stores of structured
data that they have. this helps them to produce higher quality insights, ones
that are easier to gather and will not be as hard to look through the model for
insights either. These insights are going to help the company learn some of
the new ways that they can run their business.
While companies that are driven by data all over the world have been able to
analyze this structured data for a long period of time, over many decades,
they are just now starting to really take some of the new and emerging
sources of data as seriously as they should. The good news with this one
though is that it is creating a lot of new opportunities in their company, and
helping them to gain some of the momentum and success that they want.
Even with all of the benefits that come with structured data, this is often not
the only source of data that companies are going to rely on. First off, finding
this kind of data can take a lot of time and can be a waste if you need to get
the results in a quick and efficient manner. Collecting structured data is
something that takes some time, simply because it is so structured and
organized.
Another issue that we need to watch out for when it comes to structured data
is that it can be more expensive. It takes someone a lot of time to sort through
and organize all of that data. And while it may make the model that we are
working on more efficient than other forms, it can often be expensive to work
with this kind of data. Companies need to balance their cost and benefit ratio
here and determine if they want to use any structured data at all, and if they
do, how much of this structured data they are going to add to their model.

Unstructured Data
The next option of data that we can look at is known as unstructured data.
This kind of data is a bit different than what we talked about before, but it is
really starting to grow in influence as companies are trying to find ways to
leverage the new and emerging data sources. Some companies choose to
work with just unstructured data on their own, and others choose to do some
mixture of unstructured data and structured data. This provides them with
some of the benefits of both and can really help them to get the answers they
need to provide good customer service and other benefits to their business.
There are many sources where we are able to get these sources of data, but
mainly they come from streaming data. This streaming data comes in from
mobile applications, social media platforms, location services, and the
Internet of Things. Since the diversity that is there among unstructured
sources of data is so prevalent, and it is likely that those businesses who
choose to use unstructured data will rely on many different sources,
businesses may find that it is harder to manage this data than it was with
structured data.
Because of this trouble with managing the unstructured data, there are many
times when a company will be challenged by this data, in ways that they
weren’t in the past. And many times, they have to add in some creativity in
order to handle the data and to make sure they are pulling out the relevant
data, from all of those sources, for their analytics.
The growth and the maturation of things known as data lakes, and even the
platform known as Hadoop, are going to be a direct result of the expanding
collection of unstructured data. The traditional environments that were used
with structured data are not going to cut it at this point, and they are not going
to be a match when it comes to the unstructured data that most companies
want to collect right now and analyze.
Because it is hard to handle the new sources and types of data, we can’t use
the same tools and techniques that we did in the past. Companies who want to
work with unstructured data have to pour additional resources into various
programs and human talent in order to handle the data and actually collect
relevant insights and data from it.
The lack of any structure that is easily defined inside of this type of data can
sometimes turn businesses away from this kind of data in the first place. But
there really is a lot of potentials that are hidden in that data. We just need to
learn the right methods to use to pull that data out. The unstructured data is
certainly going to keep the data scientist busy overall because they can’t just
take the data and record it in a data table or a spreadsheet. But with the right
tools and a specialized set of skills to work with, those who are trying to use
this unstructured data to find the right insights, and are willing to make some
investments in time and money, will find that it can be so worth it in the end.
Both of these types of data, the structured and the unstructured, are going to
be so important when it comes to the success you see with your business.
Sometimes our project just needs one or the other of these data types, and
other times it needs a combination of both of them.
For a company to reach success though, they need to be able to analyze, in a
proper and effective manner, all of their data, regardless of the type of the
source. Given the experience that the enterprise has with data, it is not a big
surprise that all of this buzz already surrounds data that comes from sources
that may be seen as unstructured. And as new technologies begin to surface
that can help enterprises of all sizes analyze their data in one place it is more
important than ever for us to learn what this kind of data is all about, and how
to combine it with some of the more traditional forms of data, including
structured data.
WHY PYTHON FOR DATA ANALYSIS?
The next thing that we need to spend some of our time on in this guidebook is
the Python language. There are a lot of options that you can choose when
working on your own data analysis, and bringing out all of these tools can
really make a big difference in how much information you are able to get out
of your analysis. But if you want to pick a programming language that is easy
to learn, has a lot of power, and can handle pretty much all of the tasks that
you need to handle with data analysis and machine learning, then Python is
the choice for you. Let’s dive into the Python language a little bit and see
how this language can be used to help us see some great results with our data
analysis.

The Basics of the Python Language


To help us understand a bit more about how the Python language is able to
help us out while handling a data analysis, we first need to take a look at what
the Python language is all about. The Python language is an object-oriented
programming language or OOP language, that is designed with the user in
mind, while still providing us with the power that we need, and the
extensions and libraries, that will make data analysis and machine learning as
easy to work with as possible.
There are a lot of benefits that come with the Python coding language, and
this is one of the reasons why so many people like to learn how to code with
this language compared to one of the other options. First, this coding
language was designed with the beginner in mind. There are a lot of coding
languages that are hard to learn, and only more advanced programmers, those
who have spent years in this kind of field, can learn how to use them.
This is not the case when we talk about the Python language. This one has
been designed to work well for beginners. Even if you have never done any
coding in Python before you will find that this language is easy to catch on
to, and you will be able to write some complex codes, even ones with enough
power to handle machine learning and data science, in no time at all.
Even though the Python language is an easy one to learn how to use, there is
still a lot of power that comes with this language as well. This language is
designed to take on some of those harder projects, the ones that may need a
little extra power behind them. For example, there are a lot of extensions that
come with Python that can make it work with machine learning, a process
where we teach a model or a computer how to make decisions on its own.
Due to the many benefits that come with the Python coding language, there
are a lot of people who are interested in learning more about it, and how to
make it work for their needs. This results in many large communities, all
throughout the world, of people sharing their ideas, asking for help, and
offering any advice you may need. If you are a beginner who is just getting
started with doing data analysis or any kind of Python programming at all,
then this large community is going to be one of the best resources for you to
use. It will help you to really get all of your questions answered and ensures
you are going to be able to finish your project, even if you get stuck on it for
a bit.
This coding language also combines well with some of the other coding
languages out there. While Python can do a lot of work on its own, when you
combine it with some of the other libraries that are out there, sometimes it
needs to be compatible with other languages as well. This is not a problem at
all when it comes to Python, and you can add on any extension, and still write
out the code in Python, knowing that it will be completely compatible with
the library in question.
There are also a lot of different libraries that you can work with when it
comes to the Python language. While we can see a lot of strong coding done
with the traditional library of Python, sometimes adding some more
functionality and capabilities can be the trick that is needed to get results. For
example, there are a number of deep learning and machine learning libraries
that connect with Python and can help this coding language really take on
some of the data science and data analysis projects that you want to use.
Python is also seen as an object-oriented programming language or an OOP
language. This means that it is going to rely on classes and objects to help
organize the information and keep things in line. The objects that we use,
which are going to be based on real objects that we can find in our real world,
are going to be placed in a class to pull out later when they are needed in the
code. This is much easier to work with than we see with the traditional
coding languages of the past, and ensures that all of the different parts of your
code are going to stay exactly where you would like them
As we can see here, there is so much that the Python coding language is
going to be able to do to help us with our data analysis. There are a lot of
different features and capabilities that come with Python, and this makes it
perfect for almost any action or project that we want to handle. When we
combine it together with some of the different libraries that are available, we
can get some of these more complicated tasks done.

How Can Python Help with a Data Analysis?


Now that we have had some time to discuss some of the benefits that come
with the Python language and some of the parts that make up this coding
language, it is now time for us to learn a few of the reasons why Python is the
coding language to help out with all of the complexities and programs that we
want to do with data science.
Looking back, we can see that Python has actually been pretty famous with
data scientists for a long time. Although this language was not really built to
just specifically help out with data science, it is a language that has been
accepted readily and implemented by data scientists for much of the work
that they try to accomplish. of course, we can imagine some of the obvious
reasons why Python is one of the most famous programming languages, and
why it works so well with data science, but some of the best benefits of using
Python to help out with your data science model or project include:
Python is as simple as it gets. One of the best parts about learning how to
work with the Python coding language is that even as someone who is
completely new to programming and who has never done any work in this in
the past, you can grasp the basics of it pretty quickly. This language, in
particular, had two main ideas in mind when it was first started and these
include readability and simplicity.
These features are actually something that is pretty unique when we talk
about coding languages, and they are often only going to apply to a coding
language that is object-oriented, and one that has a tremendous amount of
potential for problem-solving.
What all of this means is that, if you are a beginner to working with data
science and with working on the Python language, then adding these two
together could be the key that you need to get started. They are both going to
seem like simple processes when they work together, and yet you are able to
get a ton done in a short amount of time. Even if you are more experienced
with coding, you will find that Python data science is going to add a lot of
depth to your resume, and can help you get those projects done.
The next benefit is that Python is fast and attractive. Apart from being as
simple as possible, the code that we can write with Python is going to be
leaner and much better looking than others. For example, the Python code
takes up one-third of the volume that we see with code in Java, and one-fifth
of the volume of code in C++, just to do the exact same task.
The use of the common expressions in code writing, rather than going with
variable declarations and empty space in place of ugly brackets can also help
the code in Python to look better. But in addition to having the code look
more attractive, it can help take some of the tediousness that comes in when
learning a new coding language. This coding language can save a lot of time
and is going to tax the brain of the data scientist a lot less, making working
on some of the more complex tasks, like those of data analysis, much easier
to handle overall.
Another benefit here is that the data formats are not going to be as worrisome
with Python. Python is able to work with any kind of data format that you
would like. It is possible for us to directly import SQL tables in the code
without having to convert to a specific format or worry that our chosen
format is not going to work. In addition, we can work with the Comma
Separated Value documents and the web sourced JSON. Python request
library can make it really easy to import data from a lot of websites and will
build up sets of data to help
The Python data analysis library known as Pandas is one of the best for
helping us to handle all of the parts of not only our data analysis but also for
the whole process of data science. Pandas is able to grab onto a lot of data,
without having to worry about lagging and other issues in the process. This is
great news for the data scientist because it helps them to filter, sort, and
display their data in a quick manner.
Next on the list is that the Python library is quickly growing in demand.
While the demand for professionals in the world of IT has seen a decline
recently, at least compared to what it was in the past, the demand for
programmers who can work with Python is steadily on the rise. This is good
news for those who still want to work in this field, and are looking for their
niche or their way to stand out when getting a new job.
Since Python has so many great benefits and has been able to prove itself as a
great language for many things, including programs for data analytics and
machine learning algorithms, many companies who are centered around data
are going to be into those with Python skills. If you already have a strong on
Python, you can really work to ride the market that is out there right now.
And finally, we come back to the idea of the vibrant community that is
available with the Python language. There are times when you will work on a
project, and things are just not working the way that you had thought they
would, or the way you had planned. Getting frustrated is one option, but it is
not really going to help you to find the solution.
The good news with this is that you will be able to use the vibrant
community, and all of the programmers who are in this community, to
provide you with a helping hand when you get stuck. The community that is
around Python has grown so big and it includes members who are passionate
and very active in these communities. For the newer programmer, this means
that there is always an ample amount of material that is flowing on various
websites, and one of these may have the solution that you are looking for
when training your data.
Once you are able to get ahold of the data that you want to use and all of the
libraries that work with as well, you can really work on this community to see
some of the results that you want. Never get stuck and just give up on a
project or an idea that you have with your code, when you have that
community of programmers and more, often many of whom have a lot of
experience with Python, who will be able to help answer your questions and
get that problem solved.
As we work through this guidebook, and you do more work with Python and
data analysis, you will find that there are a lot of libraries that are compatible
with Python that can help to get the work done. These are all going to handle
different algorithms and different tasks that you want to get done during your
data analysis, so the ones that you will want to bring out may vary. There are
many great choices that you can make, including TensorFlow, Pandas,
NumPy, SciPy, and Scikit-Learn to name a few.
Sometimes these libraries work all on their own, and sometimes they need to
be combined with another library so that they can draw features and
functionalities from each other. When we are able to choose the right library
to work with, and we learn how to make them into the model that we need,
our data analysis is going to become more efficient overall.
While there may be other programming languages out there that are able to
handle the work of data analysis, and that may be able to help us create the
models that we need to see accurate insights and predictions based on the
data, none of them are going to work as well as the Python library. Taking
the time to explore this library and seeing what it can do for your data
analysis process can really be a winner when it comes to your business and
using data science to succeed.
Chapter 9 - The Importance of Data Visualization
The next topic that we need to spend some time looking through is the idea of
data visualization.
This is a unique part of our data science journey, and it is so important that
we spend some time and effort looking through it and understanding what
this process is all about. Data visualization is so important when it comes to
our data analysis. It can take all of the complex relationships that we have
been focusing on in our analysis and puts them in a graph format, or at least
in another visual format that is easier to look through.
Sometimes, looking at all of those numbers and figures can be boring and
really hard to concentrate on. It can take a long time for us to figure out what
relationships are present, and which ones are something that we should
ignore. But when we put the information into some kind of graph form, such
as a graph, a chart, or something similar, then we will be able to easily see
some of the complex relationships that show up, and the information will
make more sense overall.
Many of those who are in charge of making decisions based on that data and
on the analysis that you have worked on will appreciate having a graph or
another tool in place to help them out.
Having the rest of the information in place as well can make a difference and
can back up what you are saying, but having that graph in place is one of the
best ways to ensure that they are able to understand the data and the insights
that you have found.
To make it simple, data visualization is going to be the presentation of data
that shows up in a graphical or a pictorial format of some kind or another. It
is going to enable those who make the big decisions for a business to see the
analytics presented in a more visual manner so that they can really grasp
some of the more difficult concepts or find some new patterns out of the data
that they would never have known in any other manner.
There are a lot of different options that you are able to work with when it
comes to data visualization, and having it organized and ready to go the way
that you like, using the right tools along the way, can make life easier. With
an interactive type of visual, for example, you will be able to take this
concept a bit further and use technology to drill down the information, while
interactively changing what data is shown and how it is processed for you.

A Look at the History


As we can imagine, the process of visualization, and using pictures to help us
understand the data in front of us is something that has been around for a long
time. Whether we look at the pictures that show up in our books or even
maps and graphs that were found in the 17TH century and before, we have
been using images and more to help us make sense of the world around us
and all of the data that we have to sort through can really be understood with
some of these visuals as well.
However, it is really a big boost in technology that has helped to make data
visualization something that is as popular as it is today. For example,
computers are really making it possible for us to process a large amount of
data, and we are able to do this at faster speeds than ever before. Today, the
data visualization and all that comes with it is an industry or a field that is
rapidly evolving. Add to it that this is now something that needs a nice blend
of science and art and that it can go a long way in helping us to work with our
own data analysis, and it is no wonder that these visuals are as popular as
they seem.

Why Is Data Visualization So Important?


The next thing that we need to take a look at here is why data visualization is
so important to us. The reason that data visualization is something that we
want to spend our time and energy on is because of the way that someone is
able to process information. It is hard to gather all of the important insights
and more on a process when we have to just read it off a table or a piece of
paper. Sure the information is all right there, but sometimes it is still hard to
form the conclusions and actually see what we are doing when it is just in
text format for us.
For most people, being able to look at a chart or a graph or some other kind
of visual can make things a little bit easier.
Because of the way that our brains work and process the information that we
see, using graphs and charts to visualize a large amount of complex data is
going to be so much easier compared to pouring over some reports or
spreadsheets.
When we work with data visualization, we will find that it is a quick and easy
way to convey a lot of hard and challenging concepts, usually in a manner
that is more universal. And we are able to experiment with the different
scenarios by using an interactive visual that can make some slight
adjustments when we need it the most.
This is just the beginning of what data visualization is able to do for us
though, and it is likely that we will find more and more uses for this as time
goes on. Some of the other ways that data visualization will be able to help us
out will include:

1. Identify areas that will need the most attention when it


comes to improvements and attention.

2. Help us to figure out which of our products we should place


where.

3. It can clarify which factors are the most likely to influence


the behavior of a customer.

4. It can make it easier to tell and make predictions about our


sales volumes, whether these volumes are going to be higher
or lower at a specific time period.

The process of data visualization is going to help us change up the way that
we can work with the data that we are using. Data analysis is supposed to
respond to any issues that are found in the company in a faster manner than
ever before.
And they need to be able to dig through and find more insights as well, look
at data in a different manner, and learn how to be more imaginative and
creative in the process. This is exactly something that data visualization is
able to help us out with.

How Can We Use Data Visualization?


The next thing that we need to take some time to look at is how companies
throughout many industries are able to use data visualization for their own
needs. No matter the size of the company or what kind of industry they are in,
it is possible to use some of the basics of data visualization in order to help
make more sense of the data at hand. And there are a variety of ways that this
data visualization will be able to help you succeed
The first benefit that we can look at is the fact that these visuals are going to
be a great way for us to comprehend the information that we see in a faster
fashion. When we are able to use a graphical representation of all that data on
our business, rather than reading through charts and spreadsheets, we will be
able to see these large amounts of data in a clear and cohesive manner.
It is much easier to go through all of that information and see what is found
inside, rather than having to try and guess and draw the conclusions on our
own.
And since it is often much faster for us to analyze this kind of information in
a graphical format, rather than analyzing it on a spreadsheet, it becomes
easier for us to understand what is there. When we are able to do it in this
manner, it is so much easier for a business to address problems or answer
some of their big questions in a timely manner so that things are fixed
without issue or without having to worry about more damage.
The second benefit that comes with using data visuals to help out with the
process of data science is that they can really make it easy to pinpoint some
of the emerging trends that we need to focus on. This information is within
the data, and we are going to be able to find them even if we just read
through the spreadsheets and the documents.
But this takes a lot of time, can be boring, and often it is hard for us to really
see these correlations and relationships, and we may miss out on some of the
more important information that we need.
Using the idea of these visuals to handle the data, and to discover trends,
whether this is the trends just in the individual business or in the market as a
whole, can really help to ensure that your business gains some big advantages
over others in your competition base. And of course, any time that you are
able to beat out the competition, it is going to positively affect your bottom
line.
When you use the right visual to help you get the work done, it is much easier
to spot some of the outliers that are present, the ones that are more likely to
affect the quality of your product, the customer churn, or other factors that
will change your business. In addition, it is going to help you to address
issues before they are able to turn into much bigger problems that you have to
work with.
Next on the list is that these visuals are going to be able to help you identify
some relationships and patterns that are found in all of that data that you are
using. Even with extensive amounts of data that is complicated, we can find
that the information starts to make more sense when it is presented in a
graphic format, rather than in just a spreadsheet or another format.
With the visuals, it becomes so much easier for a business to recognize some
of the different parameters that are there and how these are highly correlated
with one another. Some of the correlations that we are able to see within our
data are going to be pretty obvious, but there are others that won’t be as
obvious. When we use these visuals to help us find and know about these
relationships, it is going to make it much easier for our business to really
focus on the areas that are the most likely to influence some of our most
important goals.
We may also find that working with these visuals can help us to find some of
the outliers in the information that is there. Sometimes these outliers mean
nothing. If you are looking at the charts and graphs and find just a few
random outliers that don’t seem to connect with each other, it is best to cut
these out of the system and not worry about them.
But there are times when these outliers are going to be important and we
should pay more attention to them.
If you are looking at some of the visuals that you have and you notice that
there are a substantial amount of them that fall in the same area, then you will
need to pay closer attention. This could be an area that you can focus your
attention on to reach more customers, a problem that could grow into a major
challenge if you are not careful, or something else that you need to pay some
attention to.
These visuals can also help us to learn more about our customers. We can use
them to figure out where our customers are, what kinds of products our
customers would be the happiest with, how we can provide better services to
our customers, and more. Many companies decide to work with data
visualization to help them learn more about their customers and to ensure that
they can really stand out from the crowd with the work they do.
And finally, we need to take a look at how these visuals are a great way to
communicate a story to someone else. Once your business has had the time to
uncover some new insights from visual analytics, the next step here is to
communicate some of those insights to others. It isn’t going to do you much
good to come up with all of those insights, and then not actually show them
to the people responsible for key decisions in the business.
Now, we could just hand these individuals, the ones who make some of the
big decisions, the spreadsheets and some of the reports that we have. And
they will probably be able to learn a lot of information from that. But this is
not always the best way to do things.
Instead, we need to make sure that we set things up with the help of a visual,
ensuring that these individuals who make the big decisions can look it over
and see some of the key relationships and information at a glance.
Using graphs, charts, and some of the other visuals that are really impactful
as a representation of our data is going to be so important in this step because
it is going to be engaging and can help us to get our message across to others
in a faster manner than before.
As we can see, there are a lot of benefits that come in when we talk about
data visualizations and all of the things that we are able to do with them.
Being able to figure out the best kind of visualization that works for your
needs, and ensuring that you can actually turn that data into a graph or chart
or another visualization is going to be so important when it is time to work
with your data analysis.
We can certainly do the analysis without data visualization. But when it
comes to showcasing the findings in an attractive and easy to understand
format, nothing is going to be better than data visualization.

How to Lay the Groundwork


Before we try to implement in a brand new technology of any sort, there are
going to be a few types of steps that we need to take and go through to see
the results.
Not only is it important for us to have a nice solid grasp on the data that we
want to use which is something that should happen during the data analysis
part of the process, we also need to understand three other important things
including the needs of the company, the goals of the company, and the
audience of your company.
Some of the things that have to happen before you can prepare and organize
all of the data that you have and complete this kind of data visualization will
include:

1. Understand the data that we need to visualize in the first place.


This means that we need to know how much cardinality is
present in the data, meaning how much uniqueness is going to
show up in the columns, and we need to know the size of the
data. Some of the algorithms that we will use to work on the data
visualization are not going to do as well when it comes to very
large sets of data.

2. Determine what you would like to visualize, and the kind of


information that we want to be able to communicate with this.
This will make it a bit easier to figure out which type of visual
we want to be able to go within this process.

3. KNOW THE AUDIENCE THAT WE ARE WORKING WITH


AND UNDERSTAND HOW THEY ARE GOING TO
PROCESS THE VISUAL INFORMATION THAT YOU WANT
TO SHOW OFF. MANAGEMENT MAY NEED A
DIFFERENT VISUAL THAN A TEAM. THOSE IN
MANUFACTURING MAY NEED A DIFFERENT VISUAL
THAN SOMEONE IN A MORE CREATIVE ROLE. BEING
ABLE TO MAKE THE VISUAL FIT TO THE AUDIENCE
SO THAT THEY CAN ACTUALLY UTILIZE THE
INFORMATION IS GOING TO BE A CRITICAL STEP.
4. Use a visual that is able to convey the information in the form
that is not only the best but also the simplest, for your audience.
There are a lot of cool visuals out there that you can work with,
and they can offer a lot of different ways to show your data. But
if the visual is too difficult to understand, it is not going to make
anyone happy. Put away some of the neat gadgets and find the
best way to showcase that information that makes the most sense
to your audience.

Once you have been able to go through and answer all of the initial questions
that we had about the data type that we would like to work with, and you
know what kind of audience is going to be there to consume the information,
it is time for us to make some preparations for the amount of data that we
plan to work within this process
Keep in mind here that big data is great for many businesses and is often
necessary to make data science work. But it is also going to bring in a few
new challenges to the visualization that we are doing. Large volumes, varying
velocities, and different varieties are all going to be taken into account with
this one.
Plus, data is often going to be generated at a rate that is much faster than it
can be managed and analyzed so we have to figure out the best way to deal
with this problem.
There are factors that we need to consider in this process as well, including
the cardinality of the columns that we want to be able to work with.
We have to be aware of whether there is a high level of cardinality in the
process or a low level. If we are dealing with high cardinality, this is a sign
that we are going to have a lot of unique values in our data. A good example
of this would include bank account numbers since each individual would
have a unique account number.
Then it is possible that your data is going to have a low cardinality. This
means that the column of data that you are working with will come with a
large percentage of repeat values. This is something that we may notice when
it comes to the gender column on our system. The algorithm is going to
handle the amount of cardinality, whether it is high or low, in a different
manner, so we always have to take this into some consideration when we do
our work.

Different Types of Data Visualization Available


As you go through and start to work on adding some visualizations to your
own project, you will quickly notice that there are a lot of choices that you
are able to make. And all of them can work well depending on the kind of
data that you are working with, and the way that you would like to present it.
Sometimes the question that you are asking out of the data will help to
determine which type of visualization is going to be the best for your own
needs.
There are options like bar graphs, line graphs, histograms, pie charts, and
more that can all show the information if you are trying to separate
information into groups and see where your customers lie or which decision
is the best for you, something like a scatterplot could be the best option to
work with.
There are a lot of options when it comes to working with visuals, and we
have to just figure out which one is the best for our needs.
When you are first exploring some of the new data that you have collected,
for example, you may find that something like an auto-chart is the best option
for your needs. This is because they can give you kind of a quick view into a
large amount of data, in a way that other options just are not able to do. It
may not be the final step that you take, but it is going to make a difference in
how well you are able to understand the data in the beginning, and can lead
you on the right path to picking out models and algorithms to work with later
on.
This kind of data exploration capability is going to be helpful, even to those
who are more experienced in machine learning, data science, and statistics as
they seek to speed up the process of analytics because it is going to eliminate
some of the repeated samplings that has to happen for each of the models that
you are working on overall.
None of the visual options are necessarily going to be bad ones, we just have
to learn which one is the best option for the data we have, and for the uses
that we want to do with the data. Each set of data is going to lend itself well
to one type of visual or another, and having a good understanding of what
you are expecting out of this data, and what your data contains in the end, can
help us to figure out which visual we are most interested in.
Data visualization is definitely a part of data science that we do not want to
forget about.
Being able to make this work for our needs, and understanding some of the
process that comes with it, as well as why we actually need to work with a
visual overall, can be important. Make sure to figure out which visual is
going to be the best for your needs to ensure that you will get the best way to
understand the complex relationships in your data in no time.
Chapter 10 - Indexing and selecting arrays
Array indexing is very much similar to List indexing with the same
techniques of item selection and slicing (using square brackets). The methods
are even more similar when the array is a vector.
Example 1:
In []: # Indexing a vector array (values)
values
values[0] # grabbing 1st item
values[-1] # grabbing last item
values[1:3] # grabbing 2nd & 3rd item
values[3:8] # item 4 to 8
Out[]: array([ 1.33534821, 1.73863505, 0.1982571 , -0.47513784,
1.80118596, -1.73710743, -0.24994721, 1.41695744,
-0.28384007, 0.58446065])
Out[]: 1.3353482110285562
Out[]: 0.5844606470172699
Out[]: array([1.73863505, 0.1982571 ])
Out[]: array([-0.47513784, 1.80118596, -1.73710743, -0.24994721,
1.41695744])
The main difference between arrays and lists is in the broadcast property of
arrays. When a slice of a list is assigned to another variable, any changes on
that new variable does not affect the original list. This is seen in the example
below:
In []: num_list = list(range(11)) # list from 0-10
num_list # display list
list_slice = num_list[:4] # first 4 items
list_slice # display slice
list_slice[:] = [5,7,9,3] # Re-assigning elements
list_slice # display updated values

# checking for changes


print(' The list changed !') if list_slice == num_list[:4]\
else print(' no changes in original list')
Out[]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Out[]: [0, 1, 2, 3]
Out[]: [5, 7, 9, 3]
no changes in the original list
For arrays, however, a change in the slice of an array also updates or
broadcasts to the original array, thereby changing its values.
In []: # Checking the broadcast feature of arrays
num_array = np.arange(11) # array from 0-10
num_array # display array
array_slice = num_array[:4] # first 4 items
array_slice # display slice
array_slice[:] = [5,7,9,3] # Re-assigning elements
array_slice # display updated values
num_array
Out[]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Out[]: array([0, 1, 2, 3])
Out[]: array([5, 7, 9, 3])
Out[]: array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
This happens because Python tries to save memory allocation by allowing
slices of an array to be like shortcuts or links to the actual array. This way it
doesn’t have to allocate a separate memory location to it. This is especially
ingenious in the case of large arrays whose slices can also take up significant
memory. However, to take up a slice of an array without broadcast, you can
create a ‘slice of a copy’ of the array. The array.copy() method is called to
create a copy of the original array.
In []:# Here is an array allocation without broadcast
num_array # Array from the last example
# copies the first 4 items of the array copy
array_slice = num_array.copy()[:4]
array_slice # display array
array_slice[:] = 10 # re-assign array
array_slice # display updated values
num_array # checking original list
Out[]:array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
Out[]:array([5, 7, 9, 3])
Out[]:array([10, 10, 10, 10])
Out[]:array([ 5, 7, 9, 3, 4, 5, 6, 7, 8, 9, 10])
Notice that the original array remains unchanged.
For two-dimensional arrays or matrices, the same indexing and slicing
methods work. However, it is always easy to consider the first dimension as
the rows and the other as the columns. To select any item or slice of items,
the index of the rows and columns are specified. Let us illustrate this with a
few examples:
Example 2: Grabbing elements from a matrix
There are two methods for grabbing elements from a matrix:
array_name[row][col] or array_name[row,col].
In []: # Creating the matrix
matrix = np.array(([5,10,15],[20,25,30],[35,40,45]))
matrix #display matrix
matrix[1] # Grabbing second row
matrix[2][0] # Grabbing 35
matrix[0:2] # Grabbing first 2 rows
matrix[2,2] # Grabbing 45
Out[]: array([[ 5, 10, 15],
[20, 25, 30],
[35, 40, 45]])

Out[]: array([20, 25, 30])


Out[]: 35
Out[]: array([[ 5, 10, 15],
[20, 25, 30]])
Out[]: 45
Tip: It is recommended to use the array_name[row,col] method, as it
saves typing and is more compact. This will be the convention for the
rest of this section.
To grab columns, we specify a slice of the row and column. Let us try to grab
the second column in the matrix and assign it to a variable column_slice.
In []: # Grabbing the second column
column_slice = matrix[:,1:2] # Assigning to variable
column_slice
Out[]: array([,
,
])
Let us consider what happened here. To grab the column slice, we first
specify the row before the comma. Since our column contains elements in all
rows, we need all the rows to be included in our selection, hence the ‘ :’ sign
for all. Alternatively, we could use ‘0 :’, which might be easier to understand.
After selecting the row, we then choose the column by specifying a slice
from ‘1:2’, which tells Python to grab from the second item up to (but not
including) the third item. Remember, Python indexing starts from zero.

Exercise: Try to create a larger array, and use these


indexing techniques to grab certain elements from the
array. For example, here is a larger array:
In []: # 5 10 Array of even numbers between 0 and 100.
large_array = np.arange(0,100,2).reshape(5,10)
large_array # show
Out[]: array([[ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18],
[20, 22, 24, 26, 28, 30, 32, 34, 36, 38],
[40, 42, 44, 46, 48, 50, 52, 54, 56, 58],
[60, 62, 64, 66, 68, 70, 72, 74, 76, 78],
[80, 82, 84, 86, 88, 90, 92, 94, 96, 98]])
Tip: Try grabbing single elements and rows from random arrays you
create. After getting very familiar with this, try selecting columns. The
point is to try as many combinations as possible to get you familiar
with the approach. If the slicing and indexing notations are confusing,
try to revisit the section under list or string slicing and indexing.
Click this link to revisit the examples on slicing: List indexing
Conditional selection
Consider a case where we need to extract certain values from an array that
meet a Boolean criterion. NumPy offers a convenient way of doing this
without having to use loops.
Example 3: Using conditional selection
Consider this array of odd numbers between 0 and 20. Assuming we need to
grab elements above 11. We first have to create the conditional array that
selects this:
In []: odd_array = np.arange(1,20,2) # Vector of odd numbers
odd_array # Show vector
bool_array = odd_array > 11 # Boolean conditional array
bool_array

Out[]: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])


Out[]: array([False, False, False, False, False, False, True, True, True, True])
Notice how the bool_array evaluates to True at all instances where the
elements of the odd_array meet the Boolean criterion.
The Boolean array itself is not usually so useful. To return the values that we
need, we will pass the Boolean_array into the original array to get our
results.
In []: useful_Array = odd_array[bool_array] # The values we want
useful_Array
Out[]: array([13, 15, 17, 19])
Now, that is how to grab elements using conditional selection. There is
however a more compact way of doing this. It is the same idea, but it reduces
typing.
Instead of first declaring a Boolean_array to hold our truth values, we just
pass the condition into the array itself, like we did for useful_array.
In []: # This code is more compact
compact = odd_array[odd_array>11] # One line
compact
Out[]: array([13, 15, 17, 19])
See how we achieved the same result with just two lines? It is recommended
to use this second method, as it saves coding time and resources. The first
method helps explain how it all works. However, we would be using the
second method for all other instances in this book.
Exercise: The conditional selection works on all arrays (vectors and matrices
alike). Create a two 3 3 array of elements greater than 80 from the
‘large_array’ given in the last exercise.
Hint: use the reshape method to convert the resulting array into a 3
3 matrix.
NumPy Array Operations
Finally, we will be exploring basic arithmetical operations with NumPy
arrays. These operations are not unlike that of integer or float Python lists.
Array – Array Operations
In NumPy, arrays can operate with and on each other using various arithmetic
operators. Things like the addition of two arrays, division, etc.
Example 4:
In []: # Array - Array Operations
# Declaring two arrays of 10 elements
Array1 = np.arange(10).reshape(2,5)
Array2 = np.random.randn(10).reshape(2,5)
Array1;Array2 # Show the arrays
# Addition
Array_sum = Array1 + Array2
Array_sum # show result array
#Subtraction
Array_minus = Array1 - Array2
Array_minus # Show array
# Multiplication
Array_product = Array1 * Array2
Array_product # Show
# Division
Array_divide = Array1 / Array2
Array_divide # Show
Out[]: array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
array([[ 2.09122638, 0.45323217,
-0.50086442, 1.00633093, 1.24838264], [
1.64954711, -0.93396737, 1.05965475, 0.78422255, -1.84595505]])
array([[2.09122638, 1.45323217, 1.49913558, 4.00633093, 5.24838264],
[6.64954711, 5.06603263, 8.05965475, 8.78422255, 7.15404495]])
array([[-2.09122638, 0.54676783, 2.50086442, 1.99366907, 2.75161736],
[ 3.35045289, 6.93396737, 5.94034525, 7.21577745,
10.84595505]])
array([[ 0. , 0.45323217, -1.00172885, 3.01899278, 4.99353055],
[ 8.24773555, -5.60380425, 7.41758328, 6.27378038, -16.61359546]])
array([[ 0. , 2.20637474, -3.99309655, 2.9811267 , 3.20414581], [
3.03113501, -6.42420727, 6.60592516,
10.20118591, -4.875525 ]])
Each of the arithmetic operations performed are element-wise. The division
operations require extra care however. In Python, most arithmetic errors in
code throw a run-time error, which helps in debugging. For NumPy,
however, the code could run with a warning issued.
Array – Scalar operations
Also, NumPy supports scalar with Array operations. A scalar in this context
is just a single numeric value of either integer or float type. The scalar –
Array operations are also element-wise, by virtue of the broadcast feature of
NumPy arrays.
Example 5:
In []: #Scalar- Array Operations
new_array = np.arange(0,11) # Array of values from 0-10
print('New_array')
new_array # Show
Sc = 100 # Scalar value
# let us make an array with a range from 100 - 110 (using +)
add_array = new_array + Sc # Adding 100 to every item
print('\nAdd_array')
add_array # Show
# Let us make an array of 100s (using -)
centurion = add_array - new_array
print('\nCenturion')
centurion # Show
# Let us do some multiplication (using *)
multiplex = new_array * 100
print('\nMultiplex')
multiplex # Show
# division [take care], let us deliberately generate
# an error. We will do a divide by Zero.
err_vec = new_array / new_array
print('\nError_vec')
err_vec # Show
New_array
Out[]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Add_array
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Centurion
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Multiplex
Out[]: array([ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])
Error_vec
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\ipykernel_launcher.py:27:
RuntimeWarning: invalid value encountered in true_divide
array([nan, 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
Notice the runtime error generated? This divide by zero value was caused by
the division of the first element of new_array by itself, i.e. 0/0. This would
give a divide by zero error in normal Python environment and would not run
the code. NumPy, however, ran the code and indicated the divide by zero in
the Error_vec array as a ‘nan’ type (not-a-number). This also goes for values
that evaluate to infinity, which would be represented by the value ‘+/- inf’
(try 1/0 using NumPy array-scalar or array-array operation.).
Tip: Always take caution when using division to avoid such runtime
errors that could later bug your code.
Universal Array functions
These are some built-in functions designed to operate in an element-wise
fashion on NumPy arrays. They include mathematical, comparison,
trigonometric, Boolean, etc. operations. They are called using the
np.function_name(array) method.
Example 6: A few Universal Array functions (U-Func)
In []: # Using U-Funcs
U_add = np.add(new_array,Sc) # addition
U_add # Show
U_sub = np.subtract(add_array,new_array)
U_sub # Show
U_log = np.log(new_array) # Natural log
U_log # Show
sinusoid = np.sin(new_array) # Sine wave
sinusoid # Show
# Alternatively, we can use the .method
new_array.max() # find maximum
np.max(new_array) # same thing
Out[]: array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110])
Out[]: array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\ipykernel_launcher.py:8: RuntimeWarning: divide by
zero encountered in log
Out[]: array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436,
1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458,
2.30258509])
Out[]: array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866
, 0.98935825, 0.41211849, -0.54402111])
Out[]: 10
Out[]: 10
There are still many more functions available, and a full reference can be
found in the NumPy documentation for Universal functions here:
https://docs.scipy.org/doc/numpy/reference/ufuncs.html
Now that we have explored NumPy for creating arrays, we would consider
the Pandas framework for manipulating these arrays and organizing them into
data frames.

Pandas
This is an open source library that extends the capabilities of NumPy. It
supports data cleaning and preparation, with fast analysis capabilities. It is
more like Microsoft excel framework, but with Python. Unlike NumPy, it has
its own built-in visualization features and can work with data from a variety
of sources. It is one of the most versatile packages for data science with
Python, and we will be exploring how to use it effectively.
To use pandas, make sure it is currently part of your installed packages by
verifying with the conda list method. If it is not installed, then you can install
it using the conda install pandas command; you need an internet connection
for this.
Now that Pandas is available on your PC, you can start working with the
package. First, we start with the Pandas series.

Series
This is an extension of the NumPy array. It has a lot of similarities, but with a
difference in indexing capacity. NumPy arrays are only indexed via number
notations corresponding to the desired rows and columns to be accessed. For
Pandas series, the axes have labels that can be used for indexing their
elements. Also, while NumPy arrays -- like Python lists, are essentially used
for holding numeric data, Pandas series are used for holding any form of
Python data/object.
Example 7: Let us illustrate how to create and use the Pandas series
First, we have to import the Pandas package into our workspace. We will use
the variable name pd for Pandas, just as we used np for NumPy in the
previous section.
In []: import numpy as np #importing numpy for use
import pandas as pd # importing the Pandas package
We also imported the numpy package because this example involves a
numpy array.
In []: # python objects for use
labels = ['First','Second','Third']
# string list
values = [10,20,30] # numeric list
array = np.arange(10,31,10) # numpy array
dico = {'First':10,'Second':20,'Third':30}
# Python dictionary
# create various series
c = pd.Series(values)
print('Default series')
A #show
B = pd.Series(values,labels)
print('\nPython numeric list and label')
B #show
C = pd.Series(array,labels)
print('\nUsing python arrays and labels')
C #show
D = pd.Series(dico)
print('\nPassing a dictionary')
D #show
Default series
Out[]: 0 10
1 20
2 30
dtype: int64
Python numeric list and label
Out[]: First 10
Second 20
Third 30
dtype: int64
Using python arrays and labels
Out[]: First 10
Second 20
Third 30
dtype: int32
Passing a dictionary
Out[]: First 10
Second 20
Third 30
dtype: int64
We have just explored a few ways of creating a Pandas series using a numpy
array, Python list, and dictionary. Notice how the labels correspond to the
values? Also, the dtypes are different. Since the data is numeric and of type
integer, Python assigns the appropriate type of integer memory to the data.
Creating series from NumPy arrays returns the smallest integer size (int 32).
The difference between 32 bits and 64 bits unsigned integers is the
corresponding memory allocation. 32 bits obviously requires less memory
(4bytes, since 8bits make a byte), and 64 bits would require double (8 bytes).
However, 32bits integers are processed faster, but have a limited capacity in
holding values, as compared with 64 bits.
Pandas series also support the assignment of any data type or object as its
data points.
In []: pd.Series(labels,values)
Out[]: 10 First
20 Second
30 Third
dtype: object

Here, the string elements of the label list are now the data points. Also, notice
that the dtype is not ‘object’.
This kind of versatility in item operation and storage is what makes pandas
series very robust. Pandas series are indexed using labels. This is illustrated
in the following examples:
Example 8:
In []: # series of WWII countries
pool1 = pd.Series([1,2,3,4],['USA','Britain','France','Germany'])
pool1 #show
print('grabbing the first element')
pool1['USA'] # first label index
Out[]: USA 1
Britain 2
France 3
Germany 4
dtype: int64
grabbing the first element
Out[]: 1
As shown in the code above, to grab a series element, use the same approach
as the numpy array indexing, but by passing the label corresponding to that
data point. The data type of the label is also important, notice the ‘USA’ label
was passed as a string to grab the data point ‘1’. If the label is numeric, then
the indexing would be similar to that of a numpy array. Consider numeric
indexing in the following example:
In []: pool2 = pd.Series(['USA','Britain','France','Germany'],[1,2,3,4])
pool2 #show
print('grabbing the first element')
pool2[1] #numeric indexing
Out[]: 1 USA
2 Britain
3 France
4 Germany
dtype: object
grabbing the first element
Out[]: 'USA'
Tip: you can easily know the data held by a series through the dtype.
Notice how the dtype for pool1 and pool2 are different, even though
they were both created from the same lists. The difference is that pool2
holds strings as its data points, while pool1 holds integers (int64).
Panda series can be added together. It works best if the two series have
similar labels and data points.
Example 9: Adding series
Let us create a third series ‘pool 3’. This is a similar series as pool1, but
Britain has been replaced with ‘USSR’, and a corresponding data point value
of 5.
In []: pool3 = pd.Series([1,5,3,4],['USA','USSR','France',
'Germany'])
pool3
Out[]: USA 1
USSR 5
France 3
Germany 4
dtype: int64
Now adding series:
In []:# Demonstrating series addition
double_pool = pool1 + pool1
print('Double Pool')
double_pool
mixed_pool = pool1 + pool3
print('\nMixed Pool')
mixed_pool
funny_pool = pool1 + pool2
print('\nFunny Pool')
funny_pool
Double Pool
Out[]: USA 2
Britain 4
France 6
Germany 8
dtype: int64
Mixed Pool
Out[]: Britain NaN
France 6.0
Germany 8.0
USA 2.0
USSR NaN
dtype: float64
Funny Pool
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\pandas\core\indexes\base.py:3772: RuntimeWarning: '<' not
supported between instances of 'str' and 'int', sort order is undefined
for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
Out[]: USA NaN
Britain NaN
France NaN
Germany NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
By adding series, the resultant is the increment in data point values of similar
labels (or indexes). A ‘NaN’ is returned in instances where the labels do not
match.
Notice the difference between Mixed_pool and Funny_pool: In a mixed pool,
a few labels are matched, and their values are added together (due to the add
operation). For Funny_pool, no labels match, and the data points are of
different types. An error message is returned and the output is a vertical
concatenation of the two series with ‘NaN’ Datapoints.
Tip: As long as two series contain the same labels and data points of
the same type, basic array operations like addition, subtraction, etc. can
be done. The order of the labels does not matter, the values will be
changed based on the operator being used. To fully grasp this, try
running variations of the examples given above.
Chapter 11 - Common Debugging Commands

Starting
The command used in debugging is ‘s(tart)' which launches the debugger
from its source. The procedure involved includes typing the title of the
debugger and then the name of the file, object, or program executable to
debug. Inside the debugging tool, there appears a prompt providing you with
several commands to choose from and make the necessary corrections.

Running
The command used is ‘[!]statement’ or ‘r(un)’, which facilitates the execution
of the command to the intended lines and identify errors if any. The
command prompt will display several arguments probably at the top of the
package, especially when running programs without debuggers. For example,
when the application is named ‘prog1’, then the command to use is “r prog1
<infile". The debugger will, therefore, execute the command by redirecting
the program name from the file name.

Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno|function[, condition]]” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on the
screen. Thereby provides time to check on the variables while identifying any
errors or mistakes, which might affect the process. Therefore, breakpoints can
be scheduled to halt at any line on either numerical or functions names which
designate program execution.

Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.

Printing
Printing is primarily is used in programming to analyze the value of variables
or expressions used during function examination before execution. It uses the
command' w(here)' and useful after the programming running has been
stopped at a breakpoint or during runtime error. The legal expression used
here is C with possessing an ability to handles the legitimate C expression as
well as function calls. Besides printing, resuming the execution after a
breakpoint or runtime error uses the command ‘c(ont(inue).'

Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on a
given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.

Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.

File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to utilize
debugging tools in such cases. Debuggers should be set on the main source
file for the benefit of scheduling breakpoints and runtime error to examine the
lines in the folders. With Python, the list of the source files can be readily
selected and prescribe it as the working file.

Help and Quit


The help command is represented as ‘h(elp)’ while quit is symbolized as
‘q(uit)’ with both providing assistance during program execution. Help
command displays all the help information topics and can be redirected into a
particular solution of the current problem. While quit, command is crucial to
exist or abort the debugger tool.

Alias
Alias debugging entails the creation of an alias name to execute a command
but must not be enclosed in either single or double-quotes. The control used
is alias [alias [command]]. Replaceable parameters also undergo indicators
and can be replaced with other functions. As such, the name may remain the
same if the settings are left without commands or arguments from debugger
tools. In that case, the aliases maybe incorporate and comprise of any data
collaborated in the pdb prompt.

Python Debugger
In Python programming language, the module pdb typically describes the
interactive source code debugger; therefore, supporting setting parameters in
breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of a
stack frame. Also, postmortem-debugging remains supported under the title
under program control. Python debugging is extensible usually in the way of
pdb obtained from the source evaluation. The interface hence utilizes pdb and
cmd as the primary modules.
The debugger command prompt pdb is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement [, globals [, locals]]) for run
Python statements and runeval (expression [, globals[, locals]]). There also
exist multiple functions not mentioned above to execute Python programs
efficiently.

Using Debugger Commands


As mentioned above, the debugger command prompt is a continuous process,
which displays a window where you input your variables at the bottom.
When your commands are successful in any given window, it would show the
outputs and then display the prompt. The Debugger Command Window is
therefore defined as the Debugger Immediate Window. It presents two panes;
small and the bottom one is where you enter your commands, and the larger
upper one shows your results.
The command prompt is the window where you readily input your debugging
needs, especially when in need to scan through your program for any errors.
The Python debugging prompt is user-friendly, encompasses all the relevant
features of detecting, and eliminate any problems. That said, the prompt will
display your current command, and you can quickly stop, modify, or select
other debugging parameters.

Debugging Session
Using debugging in Python for computer language programming is usually a
repetitive process, which includes writing codes, and running it; it does not
work, and you implement debugging tools, fix errors, and redo the process
once again and again. As such, the debugging session tends to utilize the
same techniques, which hence demand some key points to note. The
sequence below enhances your programming processes and minimizes the
repeats witnessed during program development.

Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the program
or wait for another breakpoint and repeat if need be
When everything seems to go wrong, determine the source of
the problem, alter the current line of codes and begin the
process once more

Tips in Python Debugging


Create a Reliable Branch
With the process of debugging being repetitive and somehow constant across
different programming language platforms, leaning your principles is
essential. Setting your parameters play a significant role in ensuring that your
programs are performed within a given environment. That said; ensure you
set your debugging parameters, especially for beginners.
Install pdb++
When working with Python, it is important to install pdb++ software to ease
maneuvering within a certain command line. The software ensures that you
readily access a unique prompt dialog well colorized and a complete great tab
showed elegantly. Pdb++ also enhances the appearance of your debugger
tool, bringing a newer and standard pdb module.
Conduct Daily Practices
Playing around with Python debugging tools is one of the methods used to
learn more in-depth about incorporating programs with debugging. That said,
create a plan while using a debugger and try to make mistakes or creating
errors and see what happens. Similarly, try to use commands such as
breakpoints, help, and steps to learn further on Python debugging. Create
practical programs while primarily focusing on the use of debuggers to make
corrections on sections with errors.
Learn To Work at Thing at a Time
Learning Python debugging techniques does only detect and enhance error
elimination but also prepares you on understanding how to remove such
problems. One way to do this is by getting used to correcting one anomaly at
a time, which is, removing one bug at a time. Begin with the most obvious
errors then think before doing immediate corrections as at times may lead to
the removal of essential variables. Make the changes and then test your
outcome to ascertain your program outcome.

Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are many
today. Interact with them by seeking answers to your debugging problems as
well as playing around with some programs you create while using debugger
tools. You should avoid making assumptions to any section of Python
programming, especially in debugging as it may result in failures in program
development.

Be Clever
When we create programs and avoid errors by use of debuggers, it may make
you feel excited and overwhelmed from the outcome. However, be smart but
with limits to keep an eye on your work as well as your future operations.
The success of creating a more realistic and useful program does not mean
that you are not to fail in the future. As remaining in control will prepare you
to use Python debugging tools wisely and claim your future accomplishments
positively.
Chapter 12 - Neural Network and What to Use for?
Regular deep neural networks commonly receive a single vector as an input
and then transform it through a series of multiple hidden layers. Every hidden
layer in regular deep neural networks, in fact, is made up of a collection of
neurons in which every neuron is fully connected to all contained neurons
from the previous layers. In addition, all neurons contained in a deep neural
network are completely independent as they do not share any relations or
connections.
The last fully-connected layer in regular deep neural networks is called the
output layer and in every classification setting, this output layer represents
the overall class score.
Due to these properties, regular deep neural nets are not capable of scaling to
full images. For instance, in CIFAR-10, all images are sized as 32x32x3. This
means that all CIFAR-10 images gave 3 color channels and that they are 32
wide and 32 inches high. This means that a single fully-connected neural
network in a first regular neural net would have 32x32x3 or 3071 weights.
This is an amount that is not manageable as those fully-connected structures
are not capable of scaling to larger images.
In addition, you would want to have more similar neurons to quickly add-up
more parameters. However, in this case of computer vision and other similar
problems, using fully-connected neurons is wasteful as your parameters
would lead to over-fitting of your model very quickly. Therefore,
convolutional neural networks take advantage of the fact that their inputs
consist of images for solving these kinds of deep learning problems.
Due to their structure, convolutional neural networks constrain the
architecture of images in a much more sensible way. Unlike a regular deep
neural network, the layers contained in the convolutional neural network are
comprised of neurons that are arranged in three dimensions including depth,
height, and width. For instance, the CIFAR-10 input images are part of the
input volume of all layers contained in a deep neural network and the volume
comes with the dimensions of 32x32x3.
The neurons in these kinds of layers can be connected to only a small area of
the layer before it, instead of all the layers being fully-connected like in
regular deep neural networks. In addition, the output of the final layers for
CIFAR-10 would come with dimensions of 1x1x10 as the end of
convolutional neural networks architecture would have reduced the full
image into a vector of class score arranging it just along the depth dimension.
To summarize, unlike the regular-three-layer deep neural networks, a
ConvNet composes all its neurons in just three dimensions. In addition, each
layer contained in convolutional neural network transforms the 3D input
volume into a 3D output volume containing various neuron activations.
A convolutional neural network contains layers that all have a simple API
resulting in 3D output volume that comes with a differentiable function that
may or may not contain neural network parameters.
A convolutional neural network is composed of several subsampling and
convolutional layers that are times followed by fully-connected or dense
layers. As you already know, the input of a convolutional neural network is a
nxnxr image where n represents the height and width of an input image while
the r is a total number of channels present. The convolutional neural networks
may also contain k filters known as kernels. When kernels are present, they
are determined as q, which can be the same as the number of channels.
Each convolutional neural network map is subsampled with max or mean
pooling over pxp of a contiguous area in which p commonly ranges between
two for small images and more than 5 for larger images. Either after or before
the subsampling layer a sigmoidal non-linearity and additive bias is applied
to every feature map. After these convolutional neural layers, there may be
several fully-connected layers and the structure of these fully-connected
layers is the same as the structure of standard multilayer neural networks.

How Convolutional Neural Networks Work?


A convolutional neural network structure of ConvNet is normally used for
various deep learning problems. As already mentioned, convolutional neural
networks are used for object recognition, object segmentation, detection and
computer vision due to their structure. Convolutional neural networks, in fact,
learn directly from image data, so there is no need to perform manual feature
extraction which is commonly required in regular deep neural networks.
The use of convolutional neural networks has become popular due to three
main factors. The first of them is the structure of CNNs, which eliminates the
need for performing manual data extraction as all data features are learned
directly by the convolutional neural networks. The second reason for the
increasing popularity of convolutional neural networks is that they produce
amazing, state-of-art object recognition results. The third reason is that
convolutional neural networks can be easily retained for many new object
recognition tasks to help build other deep neural networks.
A CNN can contain hundreds of layers, which each learn automatically to
detect many different features of an image data. In addition, filters are
commonly applied to every training image at different resolutions, so the
output of every convolved image is used as the input to the following
convolutional layer.
The filters can also start with very simple image features like edges and
brightness, so they commonly can increase the complexity of those image
features which define the object as the convolutional layers progress.
Therefore, filters are commonly applied to every training image at different
resolutions as the output of every convolved image acts as the input to the
following convolutional layer.
Convolutional neural networks can be trained on hundreds, thousands and
millions of images.
When you are working with large amounts of image data and with some very
complex network structures, you should use GPUs that can significantly
boost the processing time required for training a neural network model.
Once you train your convolutional neural network model, you can use it in
real-time applications like object recognition, pedestrian detection in ADAS
or advanced driver assistance systems and many others.

Convolutional Neural Networks Applications


Convolutional neural networks are one of the main categories of deep neural
networks which have proven to be very effective in numerous computer
science areas like object recognition, object classification, and computer
vision. ConvNets have been used for many years for distinguishing faces
apart, identifying objects, powering vision in self-driving cars, and robots.
A ConvNet can easily recognize countless image scenes as well as suggest
relevant captions. ConvNets are also able to identify everyday objects,
animals or humans, as well. Lately, convolutional neural networks have also
been used effectively in natural language processing problems like sentence
classification.
Therefore, convolutional neural networks are one of the most important tools
when it comes to machine learning and deep learning tasks. LeEnt was the
very first convolutional neural network introduced that helped significantly
propel the overall field of deep learning. This very first convolutional neural
network was proposed by Yann LeCun back in 1988. It was primarily used
for character recognition problems such as reading digits and codes.
Convolutional neural networks that are regularly used today for innumerable
computer science tasks are very similar to this first convolutional neural
network proposed back in 1988.
Just like today’s convolutional neural networks, LeNet was used for many
character recognition tasks. Just like in LeNet, the standard convolutional
neural networks we use today come with four main operations including
convolution, ReLU non-linearity activation functions, sub-sampling or
pooling and classification of their fully-connected layers.
These operations, in fact, are the fundamental steps of building every
convolutional neural network. To move onto dealing with convolutional
neural networks in Python, we must get deeper into these four basic functions
for a better understanding of the intuition lying behind convolutional neural
networks.
As you know, every image can be easily represented as a matrix containing
multiple values. We are going to use a conventional term channel where we
are referring to a specific component of images. An image derived from a
standards camera commonly has three channels including blue, red and
green. You can imagine these images as three-2D matrices that are stacked
over each other. Each of these matrices also comes with certain pixel values
in the specific range zero to two hundred fifty-five.
On the other hand, if you have a grayscale image, you only get one channel
as there are no colors present, just black and white. In our case here, we are
going to consider grayscale images, so the example we are studying is just a
single-2D matrix that represents a greyscale image. The value of each pixel
contained in the matrix must range from zero to two hundred fifty-five. In
this case, zero indicates a color of black while two hundred fifty-five
indicates a color of white.

Stride and Padding


Secondly, after specifying the depth, you also must specify the stride that you
slide over the filter. When you have a stride that is one, you must move one
pixel at a time. When you have a stride that is two, you can move two pixels
at a time, but this produces smaller volumes of output spatially. By default,
stride value is one. However, you can have bigger strides in the case when
you want to come across less overlap between your receptive fields, but, as
already mentioned, this will result in having smaller feature maps as you are
skipping over image locations.
In the case when you use bigger strides, but you want to maintain the same
dimensionality, you must use padding that surrounds your input with zeros.
You can either pad with the values on the edge or with zeros. Once you get
the dimensionality of your feature map that matches your input, you can
move onto adding pooling layers that padding is commonly used in
convolutional neural networks when you want to preserve the size of your
feature maps.
If you do not use padding, your feature maps will shrink at every layer.
Adding zero-padding is times very convenient when you want to pad your
input volume just with zeros all around the border.
This is called as zero-padding which is a hyperparameter. By using zero-
padding, you can control the size of your output volumes.
You can easily compute the spatial size of your output volume as a simple
function of your input volume size, the convolution layers receptive field
size, the stride you applied and the amount of zero-padding you used in your
convolutional neural network border.
For instance, if you have a 7x7 input and, if your use a 3x3 filter with stride
one and pad zero, you will get a 5x5 output following the formula. If you
have stride two, you will get a 3x3 output volume and so on using the
formula as following in which W represents the size of your input volume, F
represents the receptive field size of your convolutional neural layers, S
represents the stride applied and P represents the amount of zero-padding you
used.
(W-F +2P)/S+1
Using this formula, you can easily calculate how many neurons can fit in
your convolutional neural network. Consider using zero-padding whenever
you can. For instance, if you have an equal input and output dimensions
which are five, you can use zero-padding of one to get three receptive fields.
If you do not use zero-padding in the cases like this, you will get your output
volume with a spatial dimension of three, as three is a number of neurons that
can fit into your original input.
Spatial arrangement hypermeters commonly have mutual constraints. For
instance, if you have input size of ten with no zero-padding used and with a
filter size of three, it is impossible to apply stride. Therefore, you will get the
set of your hyperparameter to be invalid and your convolutional neural
networks library will throw an exception or zero pad completely to the rest to
make it fit.
Fortunately, sizing the convolutional layers properly so all dimensions
included work using zero-padding can really make any job easier.

Parameter Sharing
You can use a parameter sharing scheme in your convolutional layers to
entirely control the number of used parameters. If you denoted a single two-
dimensional slice of depth as your depth slice, you can constrain the neurons
contained in every depth slice to use the same bias and weights. Using
parameter sharing techniques, you will get a unique collection of weights,
one of every depth slice, and you will get a unique collection of weights.
Therefore, you can significantly reduce the number of parameters contained
in the first layer of your ConvNet. Doing this step, all neurons in every depth
slice of your ConvNet will use the same parameters.
In other words, during backpropagation, every neuron contained in the
volume will automatically compute the gradient for all its weights.
However, these computed gradients will add up over every depth slice, so
you get to update just a single collection of weights per depth slice.that all
neurons contained in one depth slice will use the exact same weight vector.
Therefore, when you forward pass of the convolutional layers in every depth
slice, it is computed as a convolution of all neurons’ weights alongside the
input volume. This is the reason why we refer to the collection of weights we
get as a kernel or a filter, which is convolved with your input.
However, there are a few cases in which this parameter sharing assumption,
in fact, does not make any sense. This is commonly the case with many input
images to a convolutional layer that come with certain centered structure,
where you must learn different features depending on your image location.
For instance, when you have an input of several faces which have been
centered in your image, you probably expect to get different hair-specific or
eye-specific features that could be easily learned at many spatial locations.
When this is the case, it is very common to just relax this parameter sharing
scheme and simply use a locally-connected layer.

Matrix Multiplication
The convolution operation commonly performs those dot products between
the local regions of the input and between the filters. In these cases, a
common implementation technique of the convolutional layers is to take full
advantage of this fact and to formulate the specific forward pass of the main
convolutional layer representing it as one large matrix multiply.
Implementation of matrix multiplication is when the local areas of an input
image are completely stretched out into different columns during an operation
known as im2col. For instance, if you have an input of size 227x227x3 and
you convolve it with a filter of size 11x11x3 at a stride of 4, you must take
blocks of pixels in size 11x11x3 in the input and stretch every block into a
column vector of size 363.
However, when you iterate this process in your input stride of 4, you get
fifty-five locations along both weight and height that lead to an output matrix
of x col in which every column contained in fact is a maximally stretched out
receptive fields and where you have 3025 fields in total.
that as the receptive fields overlap, each number in your input volume can be
duplicated in multiple distinct columns. Also, remember, that the weights of
the convolutional layers are very similarly stretched out into certain rows as
well. For instance, if you have 95 filters in size of 11x11x3, you will get a
matrix of w row of size 96x363.
When it comes to matrix multiplications, the result you get from your
convolution will be equal to performing one huge matrix multiply that
evaluates the dot products between every receptive field and between every
filter resulting in the output of your dot production of every filter at every
location. Once you get your result, you must reshape it back to its right
output dimension, which in this case is 55x55x96.
This is a great approach, but it has a downside. The main downside is that it
uses a lot of memory as the values contained in your input volume will be
replicated several times. However, the main benefit of matrix multiplications
is that there are many implementations that can improve your model. In
addition, this im2col ideal can be re-used many times when you are
performing pooling operation .
Conclusion
Thank you for making it through to the end! The next step is to start putting
the information and examples that we talked about in this guidebook to good
use. There is a lot of information inside all that data that we have been
collecting for some time now. But all of that data is worthless if we are not
able to analyze it and find out what predictions and insights are in there. This
is part of what the process of data science is all about, and when it is
combined together with the Python language, we are going to see some
amazing results in the process as well.
This guidebook took some time to explore more about data science and what
it all entails. This is an in-depth and complex process, one that often includes
more steps than what data scientists were aware of when they first get started.
But if a business wants to be able to actually learn the insights that are in
their data, and they want to gain that competitive edge in so many ways, they
need to be willing to take on these steps of data science, and make it work for
their needs.
This guidebook went through all of the steps that you need to know in order
to get started with data science and some of the basic parts of the Python
code. We can then put all of this together in order to create the right
analytical algorithm that, once it is trained properly and tested with the right
kinds of data, will work to make predictions, provide information, and even
show us insights that were never possible before. And all that you need to do
to get this information is to use the steps that we outline and discuss in this
guidebook.
There are so many great ways that you can use the data you have been
collecting for some time now, and being able to complete the process of data
visualization will ensure that you get it all done. When you are ready to get
started with Python data science, make sure to check out this guidebook to
learn how.
Loops are going to be next on the list of topics we need to explore when we
are working with Python. These are going to be a great way to clean up some
of the code that you are working on so that you can add in a ton of
information and processing in the code, without having to go through the
process of writing out all those lines of code. For example, if you would like
a program that would count out all of the numbers that go from one to one
hundred, you would not want to write out that many lines of code along the
way. Or if you would like to create a program for doing a multiplication
table, this would take forever as well. But doing a loop can help to get all of
this done in just a few lines of code, saving you a lot of time and code writing
in the process.
It is possible to add in a lot of different information into the loops that you
would like to write, but even with all of this information, they are still going
to be easy to work with. These loops are going to have all of the ability to tell
your compiler that it needs to read through the same line of code, over and
over again, until the program has reached the conditions that you set. This
helps to simplify the code that you are working on while still ensuring that it
works the way that you want when executing it.
As you decide to write out some of these loops, it is important to remember
to set up the kind of condition that you would like to have met before you
ever try to run the program. If you just write out one of these loops, without
this condition, the loop won’t have any idea when it is time to stop and will
keep going on and on. Without this kind of condition, the code is going to
keep reading through the loop and will freeze your computer. So, before you
execute this code, double-check that you have been able to put in these
conditions before you try to run it at all.
As you go through and work on these loops and you are creating your own
Python code, there are going to be a few options that you can use with loops.
There are a lot of options but we are going to spend our time looking at the
three main loops that most programmers are going to use, the ones that are
the easiest and most efficient.

The while loop


Out of the three loops that we are going to discuss in this guidebook, we are
going to start out with the while loop. The while loop is going to be one that
will tell the compiler the specific times that you would like it to go through
with that loop. This could be a good loop to use any time that you want the
compiler to count from one to ten. With this example, your condition would
tell the compiler to stop when it reaches ten, so that it doesn’t freeze and keep
going. This is also a good option that programmers like to work with because
it will make sure that it goes through the code at least one time, if not more
before it decides to head on to the other parts of the code that you are writing.
To see a good example of how you can work with the while loop take a look
at the code that we have below for reference:
#calculation of simple interest. Ask the user to input the principal, rate of
interest, number of years.
counter = 1
while(counter <= 3):
principal = int(input(“Enter the principal amount:”))
numberofyeras = int(input(“Enter the number of years:”))
rateofinterest = float(input(“Enter the rate of interest:”))
simpleinterest = principal * numberofyears * rateofinterest/100
print(“Simple interest = %.2f” %simpleinterest)
#increase the counter by 1
counter = counter + 1
print(“You have calculated simple interest for 3 time!”)
This example allows your user to go through and place the information that
pertains to them inside the program. The code will them computer the interest
rates based on the information that the user provides. For this one, we set up
the while (right at the beginning of the code) and told it to only go through
the loop three times. You can change it to go through the process as many
times as you would like.
The for loop
Now that we have had some time to look at the while loop, it is time to take a
look at what is known as the for loop and how we are able to use this in some
of the codings that we want to do. When you would like to bring out the for
loop there are going to be some differences compared to the while loop from
before. Keep in mind here that most programmers are going to consider the
for loop the traditional form of coding. You can use this one in many of the
situations where you would work with the while loop so it is important to
learn how to use this for your needs.
When you are ready to create your own for loop, the user is not going to be
the one who would go through here and provide the code with information
that is needed to start the loop. Rather than having this happen, the for loop is
set up to go through an iteration in whatever order you put it into the code.
There is no need to have this kind of input from the user because the loop is
set up to just go through the full iteration until it reaches the end.
And so on until you end up with 10*10 = 100 as your final spot in the
sequence
Any time you need to get one loop to run inside another loop, the nested loop
will be able to help you get this done. You can combine together the for loop,
the while loop, or each combination based on what you want to get done
inside the code. But it definitely shows you how much time and space inside
the code that these loops can save. The multiplication table above only took
up four lines to write out and you got a huge table. Think of how long this
would take if you had to write out each part of the table!
The for loop, the while loop, and the nested loop are the most common types
of loops that beginners will use when writing their own codes in Python. You
can use these codes to get a lot of work done in your chosen program without
having to write out as many lines. You can use it to make sure that a certain
part of your code will go through and rewrite itself.
SQL COMPUTER
PROGRAMMING FOR
BEGINNERS:
THE PRACTICAL STEP BY STEP GUIDE,
TO MASTER THE FUNDAMENTALS OF SQL
DATABASE PROGRAMMING MADE SIMPLE
AND STRESS-FREE, THAT WILL GET YOU
HIRED

Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or
transmitted without direct written permission from the author or the
publisher.
Under no circumstances will any blame or legal responsibility be held against
the publisher, or author, for any damages, reparation, or monetary loss due to
the information contained within this book, either directly or indirectly.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical, or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
Introduction
Starting off with sql with the help
of microsoft access
Understand this, that the Rapid Application Development tool or RAD for
short is designed to be used with Access, requiring no knowledge of
programing. It is possible to write and develop execute SQL statements using
Access but it is essential to implement the back-door method.
The following steps are going to help you open a basic editor in Access to
start writing your SQL codes

Open up your database first and press the CREATE tab to


bring forward the ribbon across the top of the window.
Look for the Queries section and click Query Design.
A dialog box should appear containing show table.
Go ahead and click on the POWER table, after which close
your dialog box.
Next, click on the Home tab and press the View icon
situated at the left corner of the Ribbon
A dropdown menu will be available, showing the different
kinds of view
As expected, choose the SQL view to display the SQL view
object tab and write down the following code.

SELECT
FROM POWER ;
After which, you will need to call upon the WHERE clause right after the
FROM line while making sure to put an asterisk (*) in the blank area. A very
common mistake which people make here is to forget putting the semicolon
at the end of the line. Don’t do that.
SELECT *
FROM POWER
WHERE LastName = ‘Marx’ ;
Once done, click on the floppy-diskette icon to save your
statement.
Enter a specified name and click ok.

Create you first table


Before dwelling deeper into this, please keep in mind that even though we are
teaching you how to create tables with Access, the information are not going
to change even if you go for Microsoft SQL Server, IBM or even Oracle!
The only major difference is that in some cases you will be getting a visual
guideline to help you, while in others you will have to adhere by the bland
typing section.
So, let us have a look at the code for our first table then!
Since you have already opened up the SQL Access window, you are going to
need to enter the following code in order to establish the foundation of your
table.
CREATE TABLE POWERSQL (
ProposalNumber INTEGER PRIMARY KEY,
FirstName CHAR (15),
LastName CHAR (20),
Address CHAR (30),
City CHAR (25),
StateProvince CHAR (2),
PostalCode CHAR (10),
Country CHAR (30),
Phone CHAR (14),
HowKnown CHAR (30),
Proposal CHAR (50),
BusinessOrCharity CHAR (1) );
But at this point you might be a little confused as to where you should write
this code right? Assuming that you are using Access 2013, just simply follow
the steps below to initiate your POWER tool and start writing!

First click on the “Create” tab on the Ribbon to allow


display the icons related to creation
Next, click on the Query Design icon
Locate POWER and click on the “Add” button
After that, press the close button
Next, you will be needing to press the Home tab, followed
by the View Icon situated at the left end of the Ribbon and
choose SQL View
Now, in the space where “SELECT From POWER” is
written, wipe that text from the face of the earth and enter
the code above.
Once entered, press the red exclamation pointed Run Icon.
Now your POWERSQL table should be created as shown
below. Tada!!

As we mentioned earlier, the information here is pretty much the same as


what you would’ve entered through a graphical advantage. However, since
SQL is a universally followed language, the syntax is transferrable to any
ISO standard-compliant DBMS product.
Data retrieval
Now that you are well versed in the ways of creating your basic data
structure, let us sway ourselves a little bit more into how you are going to
manipulate the data’s in a Database.

Table manipulation
The following code is when you will want to add a second address field in
your POWERSQL table.
ALTER TABLE POWERSQL
ADD COLUMN Address2 CHAR (30);
Again, when it comes to deleting a table, you will want to use the following
code.
DROP TABLE POWERSQL;
Not as hard as it sounded before right? Bear with us and even the more
advanced concepts are going to become much easier to you eventually! And
amongst the various concepts, the most common one is the simple task which
is retrieving the required information from a given database. Say for example,
just the data of one row from a collection of thousands.
The very basic code for this method is
SELECT column_list FROM table_name
WHERE condition ;
As you can see, it utilizes the SELECT and WHERE statement to specify the
desired column and condition.
Having familiarized yourself with the skeleton of the code, the following
example should make things clear. Here, the code is asking for information
from the customer.
SELECT FirstName, LastName, Phone FROM CUSTOMER
WHERE State = ‘NH’
AND Status = ‘Active’ ;
Specifically speaking, the above is an example where the statement returns
the phone number of all active customers who are living in New Hampshire
(NH). Keep in mind that the keyword AND is used which simply means that
for a data to be valid for retrieval, both of the given conditions must be met.

View creation from tables


The SELECT statement is what allows you to return the result in virtual table
format.
In an example where the database is made up of CLIENT, TEST, ORDERS,
EMPLOYEE and RESULTS. A view table can be conjured up using the
SELECT command. Say that you want to create a view for a national
marketing manage who wants to observe the state of the company’s order.
You are not going to be needing all of the information, instead you will be
using just a few. The following code is how it should look like:
CREATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName ;

Adding data to rows


Having created the shell your database, now you are going to want to add
data right? It’s simple. Just use the following syntax
INSERT INTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
Take note of the [] brackets. Any value you put inside these columns gets
assigned to their respective columns accordingly.
Another example for a CUSTOMER table might look similar to
INSERT INTO CUSTOMER (CustomerID, FirstName, LastName,
Street, City, State, Zipcode, Phone)
VALUES (:vcustid, ‘David’, ‘Taylor’, ‘235 Loco Ave.’,
‘El Pollo’, ‘CA’, ‘92683’, ‘(617) 555-1963’) ;

Data addition to selected column


Similar to the column, when you want to add data to a whole column instead
of rows, you will have to utilize the following code
INSERT INTO CUSTOMER (CustomerID, FirstName, LastName)
VALUES (:vcustid, ‘Tyson’, ‘Taylor’) ;

Transferring multiple columns and rows amongst


various tables
When you are working with larger databases containing multiple tables,
Clients might sometimes ask you to create a completely new table or modify
a table by filling it up with just a few information from different columns and
rows from perhaps four other tables. These are scenarios where you will have
to use the UNION relation operator to initiate this function.
Assuming that you have two tables, namely PROSPECT and CUSTOMER,
in a scenario you might just want the list of the potential customers who live
in Maine. The following is a code which you should be using
SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
SELECT FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;

Deleting old data


Nothing in this world is static and as time passes on, things are bound to
become obsolete and useless. Same thing happens in case of data stored in
database. At one point or the other, you are going to want to erase previous
data to make way for new one or to simply clean up the whole database. For
performing this action, you will need to use the SQL’s DELETE statement
which behaves in a similar way as the SELECT statement. The example
below is going to remove the data of David Taylor from the CUSTOMER
table.
DELETE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’

Study questions
Q1) The RAD is designed to be used with what?
a) Access
b) Power point
c) Mozilla FireFox
d) Word
Answer: A
Q2) Which of the following can be used to data to a row?
a) INSERT UNTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
b) ADD INTOtable_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
c) INSERT INTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
d) ADD UNTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
Answer: C
Q3) How can you transfer data between two tables, namely PORSPECT and
CUSTOMER
a) CHOOSE FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
SELECT FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
b) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
TRANSFER FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
c) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
INTO FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
d) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
SELECT FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
Answer: D
Q4) How can you eliminate unwanted data?
a) ABOLISH FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
b) ELIMINATE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
c) DELETE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
d) REMOVE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
Answer: C
Q5) How can you design a view with the following criteria – CLIENT,
TEST, ORDERS, EMPLOYEE, RESULTS
a) GEENERATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
b) INITIATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
c) DESIGN VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
d) CREATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
Answer: D
Chapter 1 - Data Types in SQL
Data is at the core of SQL. After all, SQL was created to make it easier to
manipulate the data stored inside a database. It ensures that you do not have
to sort through large chunks of data manually in search of what you want.
Now, there are various types of data that can be stored in the database
depending on the platform used. As such, you now need to learn about the
data types available in SQL.

Data Types: An Introduction


The data type specifies the kind of data which can be stored in a column of a
database table. When creating a table, it is important to decide the data type
which is going to be used for defining the columns. Data types can also be
used for defining variables and for storing procedure output and input
parameters. The data type will instruct SQL to expect a particular kind of data
in each column.
A data type must be selected for each variable or column which is appropriate
for the kind of data which is to be stored in that column or variable.
Additionally, storage requirements must be considered. You need to select
data types which ensure efficiency in the storage.
Selecting the correct data type for the variables, tables and stored procedures
will improve the performance greatly as it will ensure that the execution plan
is correct. At the same time, it will be a great improvement on data integrity
as it ensures that the right data has been stored inside a database.

The List of Data Types


There are several data types used in SQL. The following table lists all of
them along with a short description of what they are. Mark this table as it will
prove to be invaluable as a reference guide on data types when you are
learning and even later.
Before you start, you should take a moment to understand what precision and
scale are. Precision is the total number of digits that is present in a number.
Scale is the total number of digits located on the right side of the decimal
point of a number. In the case of a number like 123.4567, the precision is 7
while the scale is 4.

Variations among Database Platforms


The data types given above are common to all database platforms. However,
there are some which are known by different names in different database
platforms. This can be a source of confusion. As such, take a look at the
variations in the table given below.

Data Types for Different Databases


Each database platform tends to have its own range of data types. As such,
you need to know what those data types are in order to use those platforms
effectively. Here, we shall be focusing on the most popular database
platforms: MySQL, SQL Server and Microsoft Access. A short description
has been provided with each data type.
Data Types in Microsoft Access
Data Types in MySQL
In MySQL, the data types available can be widely classified into 3 categories.
They are text, date/time and number. We shall take a look at the available
data types in each category in the following tables.
MySQL Data Types
Integers:
In MySQL, integer data types will have an extra option known as
UNSIGNED. Generally, an integer will move to a positive value from a
negative one. However, the addition of the UNSIGNED attribute is going to
move up that range. As a result, the integer will start at zero and not a
negative number.
Date/Time:
The two data types, TIMESTAMP and DATETIME may return the same
format. However, they work in different ways. TIMESTAMP will set itself
automatically to the current time and date in an UPDATE or INSERT query.
Different formats are accepted by TIMESTAMP as well.
Data Types in SQL Server
Like MySQL, the data types in SQL Server can also be classified into
different categories. We shall be looking at each one in the following tables.
String:
Number:
Date/Time:
The above tables will certainly prove to be of immense help in your work.
You should certainly go through them more than once to ensure that you have
a thorough idea of the data types in use in SQL in different database
platforms. In fact, it will be a good idea to mark down the chapter so that you
can quickly refer to it whenever you need to refresh your knowledge.
Chapter 2 - Constraints
We’re coming near the finish! In this chapter, we’ll be looking at constraints.
If you’ve been wondering what they are, then you’d be hard pressed to find a
more self-describing name than constraints.
Constraints do precisely that, they constrain the number of values that you
can have it take. For example, a customer’s age cannot be “0” or cannot be
“null” which is to say it cannot be left empty. Think about the difference
between 0 and null. In SQL, null means that the value is “undefined” at the
current moment, while 0 is an integer.

Not null, default & unique


Now, the default setting for any column is to be able to hold both 0 and
NULL values. If you really don’t need something like that, then you use the
NOT NULL constraint to make sure that this value isn’t allowed in that
column. Think of ID numbers, you can have the ID number “0” but your ID
number can’t be undefined. For these cases, you’ll be using the NOT NULL
constraint.
Let’s try to recreate our table from before, or at least, create a program which
lets us input it. In the next SQL query, we’ll be making a new table called
CUSTOMER, after that, we’ll add 5 columns to it. The first three will not be
NULL, as they’ll be IDS, NAMES, and AGES, all of which are crucial to the
running of our program.
CREATE TABLE CUSTOMER(
IDS INT NOT NULL,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2),
PRIMARY KEY (ID)
);
Now, if this is the first time for you to see the CHAR and INT keywords,
don’t worry. Char simply means any characters, while INT restricts it to
integers. If you’re wondering why we even use INT’s when we have
CHAR’s, it’s because INT takes much less memory, and as such, works
much faster.
Now, let’s say you’ve already created the table, but you want to change it.
Maybe you find salary very important, so you don’t want to ever enable it to
be NULL. While regular SQL would have some issues with this, you could
easily do it with MySQL.
ALTER TABLE CUSTOMER
MODIFY SALARIES DECIMAL (20, 2) NOT NULL;
And done! It really is that simple, you’ll find that when it comes to many
things, SQL is extremely simple to use.
Now, let’s take a little look at the DEFAULT keyword. It enables you to set a
value for things that are not specified. Let’s take SALARIES to be, say
2000.00 USD for your average customer, so instead of typing it a thousand
times over, wouldn’t it be better to have a way to simply autofill it if you
don’t provide a value? The way you do that is really simple, take a look at
this:
CREATE TABLE CUSTOMER(
IDS INT NOT NULL,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (ID)
);
Now, what if you’ve already created the table? Fortunately, SQL will let you
change things about it on the fly. You can easily add a DEFAULT keyword
even while the table already exists, all you need to do is run a little snippet of
code like:
MODIFY SALARIES DECIMAL (20,2) DEFAULT 2000.00
And done! It’s that easy. Keep in mind that the DEFAULT keyword can
always be overridden, should you need to apply a different value.
Can you guess what the UNIQUE constraint does? It’s quite similar to having
a pre-emptive DISTINCT query. In fact, you’ll find that most constraints are
precisely that, queries you do before the running of the program. Let’s
assume you want to stop people from having the same IDS. This would make
sense right? After all, you can’t have 2 people with the IDS “4.”
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (ID)
);
Now, like in every example so far, it’s possible to modify this on the fly.
Even if you’ve already created the CUSTOMER table, it’s still possible to
modify it. Let’s say that you’ve found a critical bug in the company code, and
it requires IDS’ to never be identical. To solve this issue, you could just run a
bit of code that modifies the IDS column on the fly.
ALTER TABLE CUSTOMER
MODIFY IDS INT NOT NULL UNIQUE,
Now, let’s say these simply aren’t enough for you. You find these named
constraints to be insufficient, maybe you have some better ideas? Luckily,
SQL supports creativity, so you’ll be able to create your own constraints,
even those affecting multiple columns, let’s take a look at how this would
look in the code, shall we?
ALTER TABLE CUSTOMER
ADD CONSTRAINT your_own_constraint UNIQUE(IDS, AGES) ;
When run, this code will apply your constraint of UNIQUE-ness to the IDS
and AGES columns.

Primary & foreign keys


When it comes to primary keys, you should remember them from the start of
the book. Remember when we were discussing parent-child relationships in
SQL? Well, the primary key is that key which gives the table it belongs to a
unique identity and identifies every row/column combo in it. Per default, the
primary key is filtered through a UNIQUE and NOT NULL constraint,
because primary keys cannot contain redundant, or NULL values.
You can only have one primary key per table, it might be made up of one, or
more than one field. For example, if you use 4-5 different fields to create
your primary key, then you’d have made what is called a composite key.
Now, if your table database already has a primary key assigned to any of its
fields, then you can’t add to it. This is because you can’t have 2 data points
having the same field value.
Now, let’s try looking at the code from before, and let’s make the ID attribute
our primary key, as it is necessary to the function of the program.
Here is the syntax to define the IDS attribute as a primary key in a
CUSTOMERS table.
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS)
);
Now, if you were to do this retroactively, which I say to mean if you already
had a table, and wanted to make IDS the primary key after, you would do it
by running the following code:
ALTER TABLE CUSTOMERS ADD PRIMARY KEY (IDS) ;
You should note that if you are adding a primary key this way, you’ll need
the primary key column to have been declared to be NOT NULL at the start.
Otherwise, you won’t be able to do it.
Now, let’s imagine we want to create a composite key. While this may sound
intimidating, all you need to do is enforce more than one PRIMARY KEY
constraint, for example, let’s say NAMES are just as necessary as IDS.
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
That’s really it; all you need to do is declare the NAMES and IDS to be the
primary keys, and you’ve made them a composite key. The declarative nature
of SQL is finally paying off (83 pages in, but hey, who’s counting.)
Now, let’s assume you want to alter this retroactively too. To make I
IDS and NAMES the PRIMARY KEY, you’ll need to do the following:
ALTER TABLE CUSTOMER
ADD CONSTRAINT PK_CUSTID PRIMARY KEY (IDS, NAMES) ;

Foreign Key
Now we’re getting to the interesting bit, foreign keys, invoked by the
FOREIGN KEY command, lets you connect two tables together. This is why
most of the time, you’ll see the foreign key be called a referencing key. The
foreign key is simply, the columns, or combos thereof, from the primary key
of a different, parent table. In the parent-child dynamic of SQL, there is no
doubt that the foreign key is the child.
The relationship between these two tables, connected like this, would be that
the first gives its primary attributes to the other. For example, in the
CUSTOMER table, you might want to have a bit more functionality, so you
create a table called BUYS. In the BUYS table, you’ll hold all of the
customer’s orders, rather than simply doing this on a sheet of paper like in
the olden days.
Let’s do an example of this!
-- The customer table
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
-- The buys table
CREATE TABLE BUYS (
IDS INT NOT NULL UNIQUE,
TIME DATETIME,
CUST_IDS INT references CUSTOMER(IDS),
AMOUNT double
PRIMARY KEY (ID)
);
While this will create a BUYS table, it doesn’t maintain its relationship with
its parent table- the CUSTOMER table. Because of this we’ll try to insert a
foreign key by altering it, rather than within its construction(this will let you
reuse it later)
ALTER TABLE BUYS
ADD FOREIGN KEY (CUST_IDS) REFERENCES CUSTOMER(IDS);
This tells the program its foreign key will be CUST_IDS, which will pull its
values from the CUSTOMER table.

Index & check constraint


These next 2 constraints are a bit different than the ones you’re used to. The
first, the INDEX constraint, was created in order to create and retrieve data
from a database without losing out on speed. You can create one of these
INDEX-es by using one or N columns from a given table that you want to
pull from. When you create an index, it will assign a row ID, conveniently
called ROWID for every row in the reference table. This will be done before
any sorting of the data is able to occur.
The biggest tell on an SQL developer’s ability is how well they do indexes.
Well-done indexes are the most crucial bit of running a large database. If
you’re planning to get a job in database administration, then the INDEX
constraint is going to become your best friend.
Now, let’s use our CUSTOMER table again, and after creating it again, add
onto it another table which will be an index of it, pulling from 1 column in
the CUSTOMER table.
-- The customer table
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
--Creating an INDEX for a general table, without referring to our original
table.
CREATE INDEX index_names
ON name_of_table ( column01, column02 . . . . .) ;
-- Creating an index in the AGE column is also possible, and will optimize
searching for customers of a given age, let’s check out how to do that!
CREATE INDEX index_ages
ON CUSTOMER (AGE) ;
Naturally, you could do the same with as many other statements as you want.
For example, you could create an index of names, salaries etc.

Check
The CHECK constraint is similar to a conditional loop in object-oriented
languages. The CHECK constraint will check if a given condition is satisfied
or not. If the condition you’ve inputted is satisfied, then the data point is in
violation of the constraint, making it invalid input. In a way, it is a reverse if-
statement, because rather than checking for truth, it checks for falsehood.
Now, let’s take another look at our creation of the CUSTOMER database.
Let’s say you’re working for a supermarket firm, they’re not allowed to sell
alcohol to people under 21 right? Well, you can tick that off inside SQL by
using a check constraint to check the customer’s age, and reject them if
they’re under 21.
-- The customer table
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL CHECK (AGES >= 18),
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
Now, if you’ve already got this table, can you guess what line of code you
would run in order to alter the table to check if the age is above 21?
ALTER TABLE CUSTOMER
MODIFY AGES INT NOT NULL CHECK(AGES >= 21);
You can also achieve this differently, by using a syntax which will let you
name the constraint, which is sometimes useful when making multiple
constraints of the same type.
ALTER TABLE CUSTOMERS
ADD CONSTRAINT check21 CHECK (AGES >= 21) ;

Dropping constraints and integrity constraints


Dropping a constraint is precisely like it sounds like. You take a constraint
you used to impose on the database, and at a certain point in the program, you
stop. For example, if an order contains alcohol, cigarettes, or similar, you’ll
want to stop checking the age. Dropping constraints can be very useful for
efficiency, as undropped constraints can be quite database-intensive.
The following commented bulk of code will outline the ways you should
drop constraints in the future:
--DROP CHECK Constraints
--If you want to drop a CHECK constraint(which you should always do, as
they take quite a bit of memory) then you would do it by running the
following script:
ALTER TABLE CUSTOMER
DROP CONSTRAINT CHECK (AGES >= 21)
-- To DROP a FOREIGN KEY Constraint(which is generally done when you
want to give another foreign key to the table) then do:
ALTER TABLE BUYS
DROP FOREIGN KEY;
- - If you’re looking to DROP an INDEX, then you better know what you’re
doing, but it’s useful sometimes. If you want to risk the integrity of your
index, you would do it like this:
ALTER TABLE CUSTOMER
DROP INDEX IDS ;
-- Now, if you want to delete the primary key constraint, you’re either a
genius or a madman. Without delving too much into your psyche, you would
do it like this:
ALTER TABLE X DROP PRIMARY KEY;
-- Dropping a DEFAULT constraint has a bit different syntax than the rest of
them, as it refers to a particular column, rather than the totality of the table.
ALTER TABLE X
ALTER COLUMN Y DROP DEFAULT ;
-- To drop a UNIQUE constraint (maybe after your teen years?) you can run
the following SQL lines:
ALTER TABLE X
DROP CONSTRAINT your_uique_constraint ;

Integrity Constraints
Tables are, as you’ve seen so far, very easy to manipulate. With that being
said, the Integrity constraints are vital to their continued existence. They
ensure the correct data is mapped to the correct places. In essence, they’re the
unsung heroes of SQL, as they enable us, developers, to do our jobs. The
integrity of the data is analyzed via relational databases, which use a concept
called referential integrity to check for it.
While there are many kinds of integrity constraint, the most important ones
you’ve already learned, such as Primary and foreign key, UNIQUE, and
others. You’ll find that while there is a ton of integrity constraints, not all of
them are very useful in practice.

Stored procedures and functions


So far in this EBook we have covered how to build queries as single
executable lines. However, you can place a number of lines into what is
known as a stored procedure or function within SQL Server and call them
whenever it is required.
There are a number of benefits to stored procedure and functions not just
code reuse including: better security, reduced development cost, consistent
and safe data, use of modularization and sharing application logic.
Stored procedures and functions are similar in that the both store and run
code but functions are executed within the context of another unit of work.

T-sql
T-SQL or Transact Structure Query Language is an extension of the SQL
commands that have been executed thus far in this EBook. T-SQL offers a
number of extra features that are not available in SQL. These features
include local variables and programming elements which are used a lot in
stored procedures.

Creating a stored procedure


To create a stored procedure, you begin with the CREATE PROCEDURE
statement. After which you will have access to the (as mentioned) additional
programming commands in T-SQL. The following is the syntax for creating
a stored procedure:
CREATE PROCEDURE procedureName
[ { @parameterName} datatype [= default_value] [OUTPUT]]
WITH RECOMPILE, ENCRYPTION, EXECUTE AS Clause
AS
BEGIN
SELECT * FROM tableName1
END
It’s best practice to include a procedure name so it can be referenced later, it
is common to use the sp_ prefix.
Next thing to do is to define optional input and output parameter names.
Parameters are used to pass in information to a stored procedure. These are
prefixed by the @ symbol, must have a data type specified and you place
them in parentheses separate by commas. For example @customerID
varchar(50).
There are a number of different ways in which you can execute the query.
You can specify Recompile to indicate that the database engine doesn’t cache
this stored procedure so it must be recompiled every time its executed. You
can use the encryption keyword to hide the stored procedure so it’s not
readily readable. The EXECUTE AS Clause identifies the specific security
context under which the procedure will execute, i.e. control which user
account is used to validate the stored procedure.
After you declare the optional parameters you use the mandatory keyword
AS which defines the start of the T-SQL code and finishes with END. You
can use a stored procedure for more than just regular SQL statements like
SELECT, you can return a value which is useful for error checking.

Controlling the execution of the stored procedure


When you create a Stored Procedure often you need to control T-SQL in
between the BEGIN and END statements when dealing with more than one
T-SQL statement. You can use the following: IF ELSE, BEGIN END,
WHILE BREAK and CASE STATEMENT

If else
Often you will use statements in a Stored Procedure which you need a logical
true or false answer before you can proceed to the next statement. The IF
ELSE statement can facilitate. To test for a true or false statement you can
use the >, <, = and NOT along with testing variables. The syntax for the IF
ELSE statement is the following, note there is only one statement allowed
between each IF ELSE:
IF X=Y
Statement when True
ELSE
Statement when False

Begin end
If you need to execute more than one statement in the IF or ELSE block, then
you case use the BEGIN END statement. You can put together a series of
statements which will run after each other no matter what tested for previous
to it. The syntax for BEGIN END is the following:
IF X=Y
BEGIN
statement1
statement2
END

While break
When you need to perform a loop around a piece of code X number of times
you can use the WHILE BREAK statement. It will keep looping until you
either break the Boolean test condition or the code hits the BREAK
statement. The first WHILE statement will continue to execute as long as the
Boolean expression returns true. Once its False it triggers the break and the
next statement is executed. You can use the CONTINUE statement which is
optional, it moves the processing right back to the WHILE statement. The
syntax for the WHILE BREAK command is the following:
WHILE booleanExpression
SQL_statement1 | statementBlock1
BREAK
SQL_statement2 | statementBlock2
Continue
SQL_statement3 | statementBlock3

Case
When you have to evaluate a number of conditions and a number of answers
you can use the CASE statement. The decision making is carried out with the
initial SELECT or an UPDATE statement. Then a CASE expression (not a
statement) is stated, after which you need to determine with a WHEN clause.
You can use a CASE statement as part of a SELECT, UPDATE or INSERT
statement.
There are two forms of CASE, you can use the simple form of CASE to
compare one value or scalar expression to a list of possible values and return
a value for the first match - or you can use the searched CASE form when
you need a more flexibility to specify a predicate or mini function as opposed
to an equality comparison. The following code illustrates the simple form:
SELECT column1
CASE expression
WHEN valueMatched THEN
statements to be executed
WHEN valueMatched THEN
statements to be executed
ELSE
statements to catch all other possibilities
END
The following code illustrate the more complex form, it is useful for
computing a value depending on the condition:
SELECT column1
CASE
WHEN valueX_is_matched THEN
resulting_expression1
WHEN valueY_is_matched THEN
resulting_ expression 2
WHEN valueZ_is_matched THEN
resulting_ expression 3
ELSE
statements to catch all other possibilities
END
The CASE statement works like so, each table row is put through each CASE
statement and instead of the column value being returned, the value from the
computation is returned instead.

Functions
As mentioned functions are similar to stored procedures but they differ in that
functions (or User Defined Functions UDF) can execute within another piece
of work – you can use them anywhere you would use a table or column.
They are like methods, small and quick to run. You simply pass it some
information and it returns a result. There are two types of functions, scalar
and table valued. The difference between the two is what you can return
from the function.

Scalar functions
A scalar function can only return a single value of the type defined in the
RETURN clause. You can use scalar functions anywhere the scalar matches
the same data type as being used in the T-SQL statements. When calling
them, you can omit a number of the functions parameters. You need to
include a return statement if you want the function to complete and return
control to the calling code. The syntax for the scalar function is the
following:
CREATE FUNCTION schema_Name function_Name
-- Parameters
RETURNS dataType
AS
BEGIN
-- function code goes here
RETURN scalar_Expression
END

Table-valued functions
A table-valued function (TVF) lets you return a table of data rather than the
single value in a scalar function. You can use the table-valued function
anywhere you would normally use a table, usually from the FROM clause in
a query. With table-valued functions it is possible to create reusable code
framework in a database. The syntax of a TVF is the following
CREATE FUNCTION function_Name (@variableName
RETURNS TABLE
AS
RETURN
SELECT columnName1, columnName2
FROM Table1
WHERE columnName > @variableName

Notes on functions
A function cannot alter any external resource like a table for example. A
function needs to be robust and if there is an error generate inside it either
from invalid data being passed or the logic then it will stop executing and
control will return to the T-SQL which called it.
Chapter 3 - Database Backup and Recovery
As mentioned, the most important task a DBA can perform is to back up the
database. When you create, a maintenance plan it’s important to have it top
of the maintenance list in case the job doesn’t get fully completed. Firstly, it
is important to understand the transaction log and why it is important.

The transaction log


Whenever a change is made to the database, be it a transaction or
modification, it is stored in the transaction log. The transaction log is the
most important file in a SQL Server database and everything resolves around
either saving it or using it.
Every transaction log can facilitate transaction recovery, recovery of all
incomplete transactions, rolling forward a restored file, filegroup or page to a
point of failure, transactional replication, disaster recovery.

Recovery
The first step in backing up a database is choosing a recovery option for the
database. You can perform the three types of backups when SQL Server is
online and even while users are making requests from the database.

Recovery models
When you backup and restore in SQL Server you do so in the context of the
recovery model which are models designed to control the maintenance of the
transactional log. The recovery model is a database property that controls
how transactions are logged.
There are three different recovery options: Simple, Full and Bulk Logged.

Simple recovery
You cannot back up the transactional log when utilizing the simple recovery
model. Usually this model is used where updates are infrequent.
Transactions are logged to a minimum and the log will be truncated.

Full recovery
In the full recovery model the transaction log backup must be taken. Only
when the backup process begins will the transactional log be truncated. You
can recover to any point in time. However, you also need the full chain of
log files to restore the database to the nearest time possible.

Bulk logged
This model is designed to be utilized for short term use when you use a bulk
import operation. You use it along with full recovery model whenever you
don’t need a certain point in time recovery. It has performance gains and also
doesn’t fill up the transaction log.

Changing the recovery model


To change the recovery model, you can right click on a database in SQL
Server Management Studio and selecting properties, then select options and
then selecting the recovery model from the drop-down box. Or you can use
one of the following three:
ALTER DATABASE SQLEbook SET RECOVERY SIMPLE
GO
ALTER DATABASE SQLEbook SET RECOVERY FULL
GO
ALTER DATABASE SQLEbook SET RECOVERY BULK_LOGGED
GO

Backups
There are three types of backup: full, differential and transaction log:

Full backup
When you create a full backup, SQL Server creates a CHECKPOINT which
ensures than any dirty page that exist are written to disk. Then SQL Server
backs up each and every page on the database. It then backs up the majority
of the transaction log to ensure there is transactional consistency. What all of
this means is that you are able to restore your database to a most recent point
and have all the transactions including those right up to the very beginning of
the backup.
Differential backup
The differential backup as it name suggests backs up every page in the
database which has since been modified since the last backup. SQL Server
keeps track of all the different pages that have been modified via flags and
DIFF pages.

Transaction log backup


With the log backup SQL Server backs up the data in the transaction log
only, i.e. only the transactions that were recently committed to the database.
The transaction log is not as resource hungry and is considered important
because it can perform backups more often without having an impact on
database performance

Backup strategy
When Database Administrator set out a backup plan they base their plan on
two measures: Recovery Time Objective (or RTO) and Recovery Point
Objective (RPO). RTO is the amount of time it takes to recover after a
notification of a disruption in the business process. RPO is the amount of
time that might pass during a disruption before the size of data that has been
loss exceeds the maximum limit of the business process.
If there was an RPO of 60 minutes you couldn’t achieve this goal if the
backup was set to every 24 hours. You need to set your backup plan based
on these two measure.

Full backup
Exercising this alone is the least flexible option. Essentially your only able to
restore your database back to one point of time which the is the last full back
up. So, if the database went corrupt two hours from midnight (and you
backup at midnight) your RPO would be twenty-two hours. Also, if a user
truncated a table two hours from midnight you would have the same twenty-
two-hour loss of business transactions.

Full backup and log backup


If you have selected Full Recovery mode, you can run both full backups and
transactional log file backups. You can run more frequent backups since
running Transaction Log backup takes less resources. This is a very good
choice if your database is updated throughout the day.
When you are scheduling transactional log backups its best to follow the
RPO. So, if you have an RPO of 60 minutes then set the log file backups to
60 minutes. However, you must check the RTO for such a backup. If you
had a RPO of 60 minutes and are only performing a full back up one a week
you might not be able to restore all 330 backups in allotted time.

Full, differential and log backup


To get around the problem mentioned above you can add differential backups
to the plan. A differential backup is cumulative which mean a serious
reduction in the number of backups you would have to restore to recover your
database to the point just before failure.

Performing a backup
To back up a database right click the database in SSMS then select Tasks->
Backup. You can select what kind of backup (full, differential or transaction
log) to perform and when to perform a backup. The copy-only backup allows
you to perform a backup which doesn’t affect the restore sequence.

Restoring a database
When you want to restore a database in SSMS right click the database then
select Tasks -> Restore -> Database. You can select the database from the
drop down and thus the rest of the tabs will be populated.
If you click on Timeline you will see a graphical diagram of when the last
backup was created which shows how much data was lost. You can recover
to the end of log or a specific date and time.
The Verify Backup Timeline media button enables you to verify the backup
media before you actually restore it. If you want to change where you are
going to store the backup you can click on the Files to select a different
location. You can specify the restore options that you are going to use in the
Options page. Either overwrite the existing database or keep it. The
recovery state either bring the database online or allows further backups to be
applied.
Once you click OK on the bottom the database will be Restored.
Sequences
A sequence refers to a set of numbers that have been generated in a specified
order on demand. These are popular in databases. The reason behind this is
that sequences provide an easy way to have a unique value for each row in a
specified column. In this chapter, we will explore how to use sequences in
SQL.
AUTO_INCREMENT Column
This provides you with the easiest way of creating a sequence in MySQL.
You only have to define the column as auto_increment and leave MySQL to
take care of the rest. To show how to use this property, we will create a
simple table and insert some records into the table.
The following command will help us create the table:
CREATE TABLE colleagues
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id),
name VARCHAR(20) NOT NULL,
home_city VARCHAR(20) NOT NULL
);
The command should create the table successfully as shown below:
We have created a table named colleagues. This table has 3 columns namely
id, name and home_city. The first column is of integer data type while the
rest are varchars (variable characters). We have added the auto_increment
property to the id column, so the column values will be incremented
automatically. When entering data into the table, we don’t need to specify the
value of this column. The reason is that it will start at 1 by default then
increment the values automatically for each record you insert into the table.
Let us now insert some records into the table:
INSERT INTO colleagues
VALUES (NULL, "John", "New York");
INSERT INTO colleagues
VALUES (NULL, "Joel", "New Jersey");
INSERT INTO colleagues
VALUES (NULL, "Cate", "New York");
INSERT INTO colleagues
VALUES (NULL, "Boss", "Washington");
The commands should run successfully as shown below:
Now, we can run the select statement against the table and see its contents:
We see that the id column has also been populated with values starting from
1. Each time you enter a record, the value of this column is incremented by a
1. We have successfully created a sequence.
Renumbering a Sequence
You notice that when you delete a record from a sequence such as the one we
have created above, the records will not be renumbered. You may not be
impressed by such kind of numbering. However, it is possible for you to re-
sequence the records. This only involves a single trick, but be keen by
checking whether the table has a join with another table or not.
However, if you find you have to re-sequence your records, the best way to
do it is by dropping the column and then adding it. Let us show how to drop
the id column of the colleagues' table.
The table is as follows for now:
Let us drop the id column by running the following command:
ALTER TABLE colleagues DROP id;
To confirm whether the deletion has taken place, let view the table data:
The deletion was successful. We combined the ALTER TABLE and the
DROP commands for the deletion of the column. Now, let us re-add the
column to the table:
ALTER TABLE colleagues
ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (id);
The command should run as follows:
We started with the ALTER TABLE command to specify the name of the
table we need to change. The ADD command has then been used to add the
column and set it as the primary key for the table. We have also used the
auto_increment property in the column definition. We can now query the
table to see what has happened:
The id column was added successfully. The sequence has also been
numbered correctly.
The default setting is that MySQL starts the sequence at index 1. However, it
is possible for you to specify this at the time of creating the table. You can
also specify the amount by which the increment will be done each time a
record is created. Like in the table named colleagues, we can alter the table
for the auto_increment to be done at intervals of 2. This can be done by
running the following command:
ALTER TABLE colleagues AUTO_INCREMENT = 2;
The command should run successfully as shown below:
We can specify where the auto_increment will start at the time of the creation
of the table. The following example shows this:
CREATE TABLE colleagues2
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT = 10,
PRIMARY KEY (id),
name VARCHAR(20) NOT NULL,
home_city VARCHAR(20) NOT NULL
);
In the above example, we have set the auto_increment property on the id
column and the initial value for the column will be 10.
Chapter 4 - Sql Aliases
SQL allows you to rename a table or a column temporarily. The new name is
referred to as an alias. Table aliases help us in renaming tables in SQL
statements. Note that the renaming is only temporary, meaning it won’t
change the actual name of the table. We use the column aliases to rename the
columns of a table in a certain SQL query.
Table aliases can be created using the following syntax:
SELECT column_1, column_2....
FROM tableName AS aliasName
WHERE [condition];
Column aliases can be created using the following syntax:
SELECT columnName AS aliasName
FROM tableName
WHERE [condition];
To demonstrate how to use table aliases, we will use two tables, the students'
table and the fee table.
The students table has the following data:
The fee table has the following data:
We can now run the following command showing how to use table aliases:
SELECT s.regno, s.name, s.age, f.amount
FROM students AS s, fee AS f
WHERE s.regno = f.student_regno;
The command should return the following result:
We have used the alias for the students' table and the alias f for the fee table.
We have the fetched three columns from the students' table and one column
from the fee table.
A column alias can be created as showed below:
SELECT regno AS student_regno, name AS student_name
FROM students
WHERE age IS NOT NULL;
The command returns the following output upon execution:
The field with the registration numbers has been given the title student_regno
while that with student names have been returned with the title student_name.
This is because these are the aliases we gave to these columns.
Chapter 5 - Database Normalization
Now that you’re more familiar with database components, like primary and
foreign keys, let’s review database normalization.
By standard practice, every database must go through the normalization
process. Normalization is a process that was created by Robert Boyce and
Edgar Codd back in the 1970’s in order to optimize a database as much as
possible. Each step of the normalization process has what’s known as a form,
which ranges from one to five, where five is the highest normal form.
Though, typically, you can implement up to the third normal form in most
databases without negatively impacting functionality.
The main goal is to maintain the integrity of the data, optimize the efficiency
of the database, provide a more efficient method in tracking and storing data
and help avoid any data issues along the way.
Speaking of avoiding data issues, there are some points to be aware of, like
data anomalies, that can create data issues in the database if the conditions of
a normal form are not met. There are three types of anomalies: insert, update
and delete.
Below is a table that will be used to explain data anomalies.
Insert anomaly:
This occurs when we’re not able to add a new record unless other attributes
exist already. For instance, let’s say there’s a new product that will be sold
but the company doesn’t have a supplier yet. We’d have to wait to find a
valid supplier in order to enter that here, instead of just adding product
information.
Update anomaly:
This occurs when one value needs to be changed in many places, rather than
being changed in only one place. For example, if the supplier changes their
name, like Friendly Supplements, Co., then we have to update that in every
row that it exists.
Delete anomaly:
This occurs when there’s data that we’d like to remove, but if it were to be
removed, we’d be forced to remove other values that we’d like to keep. Let’s
say the energy drink isn’t sold anymore, so this row is deleted. Then all of the
other values will be deleted also. But perhaps we want to know who supplied
that product originally as a way of keeping track of the supplier’s
information.
Now that you’re aware of data anomalies and how they can create issues,
let’s move to the first step in normalization.

First Normal Form (1NF):


This is the first step in normalization. This must be satisfied before moving
onto the next step in the normalization process. Below are its conditions and
the goal of first normal form:

No multiple values can be used in a single cell


Eliminate repeating groups/columns of data
Identify the primary or composite key of each table
Below is an example of table that is not normalized. You can see that the
‘Amount’ column has multiple values in it.
To satisfy the above condition to only use one value per cell, we can split the
table up like this:
Now, there aren’t any duplicate columns in this table, but the Product ID
(which is intended to be the primary key for each product) is not unique since
each ID shows up more than once. To satisfy the condition of a primary key,
the table above can be split into two tables.
Note that ‘PK’ stands for primary key.
Second Normal Form (2NF):
This is the second step in the normalization process and to reach this point,
you must first satisfy all of the conditions in 1NF. Below are the conditions
of 2NF and its goals:

First normal form (1NF) must be satisfied


Remove any non-key column that’s not dependent on the
primary key
Implement a foreign key
Let’s use a different, yet similar example from the last one to demonstrate the
process of Second Normal Form.
To expand upon the second bullet point of “Remove any non-key column not
dependent on the primary key”, this means that any column that’s not directly
relative to the primary key should be moved to a different table.
In the above example, the ‘Product ID’ is the primary key for the Products.
There’s also ‘Product Name’ and ‘Product Description’, which are both
dependent on the primary key.
How about the ‘Supplier’? Is the name of the Supplier relative to the primary
key? No, it should be relative to its own primary key, like a ‘Supplier ID’. So,
these values should be split up into their own table.
Now, to expand upon the foreign key, this is used to link tables together by a
common or relative column. As I mentioned previously, these also enforce
referential integrity (meaning table relationships should always be consistent)
and help avoid data issues.
Below, we’ve added the foreign key ‘Supplier ID’ to the Products table. This
foreign key will reference the primary key in the Supplier’s table.
So what happens if we try to add a record to the Products table with a
‘Supplier ID’ of 3? Well, the database will throw an error because the
Supplier ID of 3 doesn’t exist in the Supplier’s table. Essentially, the
database will use the foreign key as a reference before inserting a new record
to see if that particular Supplier ID exists.
Note in the above example that we’re adding a new product, Ground Coffee,
with a Product ID of 4 which works. However, the Supplier ID is 3, which is
referencing the Supplier’s table. There currently is no Supplier with an ID of
3. This insert would then fail.

Third Normal Form (3NF):


This is the third step in the normalization process. This states that second
normal form must be satisfied, as well as first normal form (since that must
be satisfied before moving forward with second normal form). Below are its
conditions and goals:

Second normal form (2NF) must be satisfied


Eliminate any transitive functional dependencies
Let’s discuss “eliminating transitive functional dependencies” further. Below
is an example of a table that will be used to explain.
In the table above, you can see that the ‘Product Name’ is defined by the
‘Product ID’, which is correct. However, you can also see that based on the
Product ID and its name, this also determines its category (‘Category ID’ and
‘Category Name’).
Since this is the case, then the Product ID determines the Category ID, thus
determining the name of the Category. The ‘Category ID’ and its name
should not be dependent on the ‘Product ID’. In order to satisfy third normal
form (3NF) here, we have to split these tables up into two.
Note that the Products table has the ‘Product ID’ and ‘Product Name’, as well
as ‘Category ID’. The ‘Category ID (FK)’ merely references the ‘Category
ID (PK)’ in the Categories table.
Below, the ‘Category Name’ is properly defined by the ‘Category ID’, as
opposed to the ‘Product ID’.
To further expand this, we could also create a “mapping” table that holds the
‘Product ID’ and ‘Category’ ID, both would be foreign keys that reference
the primary key in their respective tables.
We could then remove the Category ID from the Product table.
Another way to look at the conditions for transitive dependencies is to look at
every column in the table and see if it relates to the table’s primary key. If it
doesn’t, move that column or set of columns to a new table and properly
define those with a primary key.
There are a few more forms as well. Though, even if these aren’t used and a
database has been normalized up to the third normal form, it should not affect
the database’s functionality.
Next, you’ll find just an overview of the next several normal forms.

Boyce-Codd Normal Form (BCNF):


This normal form, also known as 3.5 Normal Form, is an extension of 3NF. It
is considered to be a stricter version of 3NF, in which records within a table
are considered unique. These unique values are based upon a composite key,
which is created by a combination of columns. Though, this does not always
apply or need to be applied for every table, because sometimes the data in a
table does not need to be normalized up to BNCF.

Fourth Normal Form (4NF):


This is the second to last step in normalization. For it, the previous form
(BCNF) must be satisfied. This particular form deals with isolating
independent multi-valued dependencies, in which one specific value in a
column has multiple values dependent upon it. You’d most likely see this
particular value several times in a table.

Fifth Normal Form (5NF):


This is the last step in normalization. The previous normal form must be
satisfied (4NF) before this can be applied. This particular form deals with
multi-valued relationships being associated to one another and isolating said
relationships.
Chapter 6 - SQL Server and Database Data Types
To be able to hold data in certain columns, SQL Server and other relational
database management systems utilize what are called “data types.”
There are different data types available, depending on what data you plan to
store.
For instance, you may be storing currency values, a product number and a
product description. There are certain data types that should be used to store
that information.
The majority of the data types between each RDBMS are relatively the same,
though their names differ slightly, like between SQL Server and MySQL.
There are a lot of data types, though some are more frequently used than
others. The following is a list of common ones that you may find or work
with.
The AdventureWorks2012 database will be used as an example.
VARCHAR
This is an alphanumeric data type, great for holding strings like first and last
names, as well as an email address for example. You can specify the length
of your varchar data type like so when creating a table, VARCHAR( ). The n

value of ‘n’ can be anywhere from 1 to 8,000 or you can substitute MAX,
which is 2 to the 31st power, minus 1. However, this length is rarely used.
When designing your tables, estimate the length of the longest string plus a
few bytes to be on the safe side. If you know that the strings you will be
storing will be around 30 characters, you may want to specify VARCHAR( ) 40

to be on the safe side.


This data type is flexible in a sense to where it will fit only the characters
entered into it, even if you don’t insert 40 characters like in the example
above.
However, there is a bit of overhead with storage, as it will add 2 bytes to your
entire string. For instance, if your string is 10 bytes/characters in length, then
it will be 12 in all actuality.
NVARCHAR
Much like the varchar data type, this is alphanumeric as well. However, it
also stores international characters. So this is a good option if you end up
using characters and letters from another country’s language.
The other difference between VARCHAR and NVCARCHAR is that
NVARCHAR’s values go up to 4,000 instead of 8,000 like VARCHAR.
Though they are the same in how they are defined in length like so:
NVARCHAR( ) where ‘n’ is the length of characters.
n

EXACT NUMERICS
There are various number data types that can be used to represent numbers in
the database. These are called exact numbers.
These types are commonly used when creating ID’s in the database, like an
Employee ID for instance.
Bigint – Values range from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807, which isn’t used so frequently.
Int – most commonly used data type and its values range from
-2,147,483,648 to 2,147,483,647
Smallint – Values range from -32,768 to 32,767
Tinyint – Values range from 0 to 255
In any case, it’s best to pick the data type that will be the smallest out of all of
them so that you can save space in your database.

DECIMAL
Much like the exact numeric data types, this holds numbers; however, they
are numbers including decimals. This is a great option when dealing with
certain numbers, like weight or money. Decimal values can only hold up to
38 digits, including the decimal points.
In order to define the length of the decimal data type when creating a table,
you would write the following: DECIMAL(precision, ). Precision isscale

indicative of the total amount of digits that will be stored both to the left and
to the right of the decimal point. Scale is how many digits you can have to the
right of your decimal point.
Let’s say that you wanted to enter $1,000.50 into your database. First, you
would change this value to 1000.50 and not try to add it with the dollar sign
and comma. The proper way to define this value per the data type would be
DECIMAL(6,2).
FLOAT
This data type is similar to the Exact Numerics as previously explained.
However, this is more of an Approximate Numeric, meaning it should not be
used for values that you do not expect to be exact. One example is that they
are used in scientific equations and applications.
The maximum length of digits that can be held within a column while using
this data type is 128. Though, it uses scientific notation and its range is from
-1.79E + 308 to 1.79E + 308. The “E” represents an exponential value. In this
case, its lowest value is -1.79 to the 308th power. Its max value is 1.79 to the
308th power (notice how this is in the positive range now).
To specify a float data type when creating a table, you’d simply specify the
name of your column and then use FLOAT. There is no need to specify a
length with this data type, as it’s already handled by the database engine
itself.

DATE
The DATE data type in SQL Server is used quite often for storing dates of
course. Its format is YYYY-MM-DD. This data type will only show the
month, day and year and is useful if you only need to see that type of
information aside from the time.
The values of the date data type range from ‘0001-01-01’ to ‘9999-12-31’.
So, you have a lot of date ranges to be able to work with!
When creating a table with a date data type, there’s no need to specify any
parameters. Simply inputting DATE will do.
DATETIME
This is similar to the DATE data type, but more in-depth, as this includes
time. The time is denoted in seconds; more specifically it is accurate by
0.00333 seconds.
Its format is as follows YYYY-MM-DD HH:MI:SS. The values of this data
type range between '1000-01-01 00:00:00' and '9999-12-31 23:59:59'.
Just as the DATE data type, there is no value or length specification needed
for this when creating a table. Simply adding DATETIME will suffice.
If you’re building a table and are deciding between these two data types,
there isn’t much overhead between either. Though, you should determine
whether or not you need the times or would like the times in there. If so, then
use the DATETIME data type, and if not, use the DATE data type.

BIT
This is an integer value that can either be 0, 1 or NULL. It’s a relatively small
data type in which it doesn’t take up much space (8 bit columns = 1 byte in
the database). The integer value of 1 equates to TRUE and 0 equates to
FALSE, which is a great option if you only have true/false values in a
column.
Chapter 7 - Downloading and Installing SQL
Server Express
Before we go any further, I want you to download and install SQL Server on
your own computer. This is going to help you tremendously when it’s time to
write syntax and it’s also necessary if you want to gain some hands-on
experience.
Note: if you have performed this before, you don’t have to follow this step-
by-step, but make sure you’re installing the proper SQL Server “Features”
that we’ll be using, which is shown in this section of the book. If you haven’t
performed this before, just follow along!
Click the link to be taken to the download page for SQL Server:
https://www.microsoft.com/en-us/download/details.aspx?id=54276
Scroll down the page and choose the language that you’d like to use and click
‘Download’.
After the download has completed, open the file.
When the install window comes up, it provides the “Basic”, “Custom” or
“Download Media” options. Let’s select “Custom”.
On the next page where it asks to be installed, keep the default settings.
After that, select “Install”.
Note: If you don’t receive the below menu after the install, just navigate to
the path above on your computer, open the ExpressAdv folder and click on
the SETUP.EXE file and you’ll be launched right into it!
It may take a little while, so go ahead and grab a snack or perhaps some
coffee while you wait.
After the installer has finished, you’ll then be brought to a setup screen with
several options. You’ll land on the “Installation” menu and select the option
to add a New SQL Server stand-alone installation.
Go ahead and accept the license agreements. It will then run a few quick
verification processes.
On the “Install Rules” menu, ensure that you’re receiving a “Passed” status
for just about every rule, which you most likely will. If you end up getting a
Windows Firewall warning like me, ignore it and continue anyway.

Feature Selection
Here, you’ll be able to select the features you want to install on your SQL
Server Instance. Thankfully Microsoft has provided a description of what
each feature is on the right-hand side. Make sure that you have the following
items checked, as these will all be a part of the features available when you
use SQL Server:
Instance Features:

Database Engine Services

R Services (In-Database)
Full-Text and Semantic Extractions for Search

Reporting Services – Native (which is what you can use if


you ever want to delve into SQL Server Reporting Services
or SSRS for short)

Shared Features:

Client Tools Connectivity


Client Tools Backwards Compatibility
Client Tools SDK
Documentation Components
SQL Client Connectivity SDK
Keep the default path in the Instance Root Directory as well:
On the “Instance Configuration” page, keep the default settings for the
default instance.
Next is the “Server Configuration” page. Here, you can specify which
accounts you’d like to run the SQL related services. I’m keeping mine as the
default and I recommend that you do the same. You can always change these
later if you’d like.
No need to worry about ‘Collation’, as this is relative to the language
standards that are used when sorting by strings, etc. This is also important if
you plan on using Unicode in your database (like storing foreign languages).
On the “Database Engine Configuration” page, you can specify how you
want accounts to log in to the database engine. I recommend using ‘Mixed
Mode’, as it will be a great learning opportunity when creating database and
server roles. You can even authenticate using your own Windows account,
too.
When using ‘Mixed Mode’, it needs to ask you for a password for the ‘sa’
account (System Administrator). Create a password that you’ll remember and
store it away somewhere safe – you’ll need it later.
Also, keep whatever populates in the area for SQL Server Administrators. If
nothing populates automatically, click ‘Add Current User’.
Hop on over to the ‘Data Directories’ tab. Here, you can specify where you
want the data, logs and backups stored. If you have another hard drive on
your computer that always stays connected, feel free to use that, as the data
has the potential to grow rather large. But if you only have one hard drive,
that will do.
By default, the installation will keep the logs and data in the same folder. I’ve
changed mine and used the following directories. Feel free to modify yours as
you’d like, but keep it simple. Also, the install package will create these
folders automatically during the installation process.
My configuration:
User database directory: C:\Program Files\Microsoft SQL
Server\MSSQL13.MSSQLSERVER\MSSQL\Data
User database log directory: C:\Program Files\Microsoft SQL
Server\MSSQL13.MSSQLSERVER\MSSQL\Logs
Backup directory: C:\Program Files\Microsoft SQL
Server\MSSQL13.MSSQLSERVER\MSSQL\Backups
On the ‘TempDB’ tab, keep all of the default settings. I recommend changing
the path for the logs here as well, but you can keep the path for the data
directory as is.
As for ‘User Instances’ and ‘FILESTREAM’, you can keep these as they are
and don’t necessarily need to tab over to them, unless you’d like to. Other
than that, click ‘Next’ when ready.
If you selected the Reporting Services – Native option in the “Feature
Selection” page, then you’ll be asked for options when installing and
configuring it on the “Reporting Services Configuration” page. Keep the
default ‘Install and Configure’ option and click ‘Next’.
Then, continue with the setup and Accept the terms of use. After that, it will
start installing. Once the install has finished, you’re ready to go!
Chapter 8 - Deployment
Deployment is a relatively weak part of Python world. There are many
excellent packages available. They are collected in nice distributions, but
your users have to download sometimes huge distributions to use your
program. Often, a program requires some additional packages that need to be
installed on top of the standard distribution. This is already a challenge for a
regular user. But should more than one Python distribution be present on the
system, the problem quickly gets out of control.
Fortunately, there are several packages that allow you to pack your program,
Python interpreter, and dependency libraries into a neat installer or even one
executable file. I'll show how to use pyinstaller, which allows you to create
installers for your program on all major platforms.
Even though pyistaller is cross platform, you still need to package your
program on the platform for which you want to make your installer. If you
want to generate an installer for Windows, make sure your program works on
a Windows machine and package it there. If you want an installer for OS/X,
make your program work on an OS/X box and package it on Mac. Oh, and
programs packaged on 64 bit systems will not run on a 32 bit version of the
same system.
If it is OS/X, Linux, or another UNIX-like platform, you may want to
package your program on the oldest available OS version. The distributive is
dynamically linked to the system's C library and can work with later versions,
but might fail on earlier ones. You don't need separate computers for a OSs
though. You can use virtualbox environment to run a guest OS on the same
physical computer.
It is very handy that Anaconda distribution can work on all platforms. You
can install it and the necessary packages for all systems, and just transfer
your code to package it on all systems. Unfortunately, there might still be
some problems with packaging installers from Anaconda. At the time of this
writing, pyinstaller is only available as an Anaconda add on package on
Linux, which might reflect problems on other platforms. But a few months
ago, OpenCV was also only available for the Linux version of Anaconda.
Now, it is available on Windows and OS/X too. So maybe pyinstaller will be
available for Anaconda on all platforms.
For now, on Windows and OS/X, you can install it prom the Python package
index PYPI. Open your terminal window and run:
pip install pyinstaller
If you have more than one Python installation on your computer, you might
type the complete path to Anaconda's bin directory to make sure pyinstaller
will be installed for the right distribution. On Windows, both Anaconda and
winpython provide a dedicated command prompt shortcut that works with the
given distribution. You can find it in start the menu for Anaconda or in the
Winpython folder respectively.
After pyinstaller is installed for your given Python distribution, all you have
to do is start a command from the command prompt window.
pyinstaller yourscript.py
where yourscript.py is the name of your main program module. if you have a
GUI program and want to suppress the console window on Windows or
OS/X use:
pyinstaller -w yourscript.py
Pyinstaller packages the distribution version of your program in a folder dist.
Subfolder with the same name as the packaged program contains an
executable with the program's name as well as all the dependencies, including
Python itself. To distribute the program, all you have to do is give the user a
copy of this folder. Pyinstaller can create a single file executable and encrypt
Python program's byte code for you if you wish.
The size of the distribution is pretty big. The diet calculator takes over 20MB
of disk space. A Zip compressed version is a little over 9MB. This shouldn't
be an issue with modern hard drives and Internet connection speed though. It
is definitely much smaller than the entire Anaconda distribution, which is
close to 400MB when compressed, and far less confusing than making the
user install Python and possibly the add on packages your program might
require.
Chapter 9 - SQL Syntax And SQL Queries
SQL has its own language elements, which is executed on a CLI (Command
Line Interface). These are the necessary language commands that you can
utilize for your databases. This language is your SQL syntax.
On the other hand, the SQL queries are used to search the databases for the
data or files that you need.
It is important that you understand the basic language that is used for SQL
queries to proceed successfully.
The SQL syntax is the basis of SQL queries, so at times, they are
interchanged with one another.
SELECT STATEMENTS and SQL Queries
These SELECT STATEMENTS are not case-sensitive, but upper-case letters
are used in this book to facilitate reading. So, you can use lower case letters,
if you want.
SQL queries can be more specific through the use of clauses such as ORDER
BY (order by), FROM (from) and WHERE (where).
ORDER BY - is a clause that refers to the sorting of the data;
FROM - is the designated table for the search; and
WHERE - is the clause that defines the rows specified for the query.
Take note of the following important SQL commands too. The terms are self-
explanatory but for the sake of clarity, here they are:
SELECT – This command extracts the file/data from your database.
CREATE DATABASE – This command creates files/data.
DELETE – This command erases file/data from your database.
ALTER DATABASE – This command alters the file/data in your database.
INSERT INTO – This command will allow you to insert a new file/data into
your database.
CREATE TABLE – You can also create a new table in your database with
this command.
DROP TABLE – This command is specifically used in deleting tables in your
database.
CREATE INDEX – You can create an index with this command. An index is
the search key used for your database.
DROP INDEX – With this command, you can drop or delete your index from
your database.
IMPORTANT REMINDERS
SQL STATEMENTS (commands) are generally separated by a semicolon.
But in a few, new database systems, reportedly, they don’t make use of it. So,
be aware of this.
The semicolon is used to separate SQL SELECT STATEMENTS, when there
are more than one statements to be executed using the same server.
Below are examples of SELECT STATEMENTS or SQL Query

SELECT “column_name2”, “column_name3”


FROM “table_name1”
WHERE “column_name3”=’value’;

SELECT * FROM
WHERE “column_name”
ORDER BY “column_name;

More keywords and SQL commands will be introduced as you read the book,
so take it easy!

Common operators in sql


You will need them to define the values in your tables.
Here are the most common operators with their corresponding symbols:
Comparison Operators
Equal =
Not equal <>
Less than <
Less than or equal <=
Greater than >
Greater than or equal >=
Logical operators
LIKE - this keyword will allow you to retrieve the data that you will specify
in your LIKE statement.
ALL – this keyword is utilized to compare all values between tables.
BETWEEN – this keyword displays range values within a set from the
minimum to the maximum values. You can set the range of your values,
using this keyword.
IS NULL – this operator is used in the comparison of value to the NULL
value in a set.
AND – this operator is used to add more conditions in the WHERE clause of
your SQL query.
IN – this compares specified values in your tables.
OR – this operator is also used with the WHERE clause to specify more
conditions in a SQL query.
ANY – this operator compares a value to any specified value indicated in the
SQL statement.
EXISTS – this operator or keyword searches for the specified condition in
your SQL syntax.
UNIQUE – this operator will allow the display of only unique values.
Arithmetic operators
* The asterisk, when used as an arithmetic operator, will multiple values that
are found before and after the symbol.
+ The plus sign will add the values that are positioned before and after the
plus sign.
/ The division sign will divide the left value with the right value of the sign.
- The minus sign will subtract the right value from the left value.
% The percent sign divides the left value with the right value, and displays
the remainder.
Learn how to use these operators properly to optimize your SQL statements
and obtain tables that can be useful to you.

Commonly used symbols in sql


Before you can construct or create proper and correct SQL statements or
queries, you have to know the most commonly used symbols in SQL.
SQL symbols
Semicolon ;
This is used to end SQL statements or queries. It should always be added to
complete the query. An exception is that of the Cache’ SQL, which does
not use semicolons.
Open and close parentheses ( )
These have several uses. Those are used to enclose data types, conditions
and sometimes names of columns. They are used also to enclose a
subquery in the “from” clause, and arithmetic equations. In addition, when
there are varied values and comma separated data.
Double quotes “ “
These indicate a delimited identifier or values.
Singe quotes ‘ ‘
This is used usually to enclose ‘strings’ of data types or conditions.
Asterisk *
The asterisk indicates “all” data, columns or tables.
Underscore _
This is used in table or column names to identify them properly. It is also
used as an identifier.
Percent %
This is used as an identifier name for the first characters of your data such
as, data names, system variables and keywords.
Comma ,
This symbol is used as a list separator such as, in a series of columns or
multiple field names.
Open and close square brackets [ ]
This is used to enclose a list of match data types, or characters, or pattern
strings.
10.Plus +
This is usually used in number operations.
There are still various symbols that you can learn as your knowledge
advances.
These common symbols are appropriate for a beginner, who is just starting to
learn SQL.

How to create databases


As a beginner in SQL, you must know how to create DATABASES.
Databases are simply systematic collections of data that can be stored in your
computer, where they can be retrieved easily for future use.
The system is called DBMS (Database Management System), and is used
extensively by all institutions that need to store large volumes of information
or data.
Examples of data are: Name, Age, Address, Country, ID number and all vital
information that have to be organized and stored.
The retrieval of these databases is possible through database software or
programs such as, SQL, MySQL, ORACLE and other similar apps.
Creating databases is simple with this SQL statement:
Example: CREATE DATABASE “database_name”;
If you want to create a “MyStudents” database, you can state the SQL query
this way:
Example: CREATE DATABASE MyStudents;
If you want to create a “My_Sales” database, you can state your SQL this
way:
Example: CREATE DATABASE My_Sales;
The names of your databases must be unique within the RDBMS (Relational
Database Management System). After creating your database, you can now
create tables for your databases.
You can double check if your database exists by this SQL query:
Example: SHOW DATABASES;
This SQL statement will display all the databases that you have created.
It is important to note that your ability to retrieve or fetch the data that you
have stored is one vital consideration.
Therefore, you have to choose the most applicable and most appropriate SQL
server or software that you can optimize and synchronize with the computer
you are using.

Data types
There are various data types that you should be familiar with. This is because
they make use of SQL language that is significant in understanding SQL
more.
There are six SQL data types
Date and Time Data
As the name implies, this type of data deals with date and time.
Examples are: DateTime (FROM Feb 1, 1816 TO July 2, 8796), small
DateTime (FROM Feb 1, 2012 TO Mar 2085, date (Jun 1, 2016) and time
(3:20 AM.).
Exact Numeric Data
Under exact numeric data, there are several subtypes too such as;
tinyint – FROM 0 TO 255
bit – FROM 0 TO 1
bigint – FROM -9,223,372,036,854,775,808 TO 9,223,372,036,854,775,807
numeric – FROM -10^38+1 TO 10^38-1
int - FROM -2,147,483,648 TO 2,147,483,647
decimal – FROM -10^38+1 TO 10^38-1
money – FROM -922,337,203,685,477.5808 TO 922,337,203,685,477.5807
smallmoney – FROM -214,748.3648 TO +214,748.3647
smallint – FROM -32,768 TO 32,767
Binary Data
Binary data have different types, as well. These are: Binary (fixed),
varbinary (variable length binary) varbinary (max) (variable length binary)
and image.
They are classified according to the length of their bytes, with Binary
having the shortest and the fixed value.
Approximate Numeric Data
These have two types, the float and the real. The float has a value FROM
- 1.79E +308 TO 1.79E +308, while the real data has a value FROM
-3.40E +38 TO 3.40E +38
Unicode Character Strings Data
There are four types of Unicode Character Strings Data namely; ntext,
nchar, nvarchar, and nvarchar (max). They are classified according to their
character lengths.
For ntext, it has a maximum character length of 1,073,741,823, which is
variable.
For nchar, it has a unicode maximum fixed length of 4,000 characters.
For nvarchar (max), it has a unicode variable maximum length of 231
characters.
For nvarchar, it has a variable maximum length of 4,000 unicode
characters.
Character Strings Data
The character Strings Data have almost similar types as the Unicode
Character Strings Data, only, some have different maximum values and
they are non-unicode characters, as well.
For text, it has a maximum variable length of 2,147,483,647 non-unicode
characters.
For char, it has a non-unicode maximum fixed length of 8,000 characters.
For varchar (max), it has a non-unicode variable maximum length of 231
characters.
For varchar, it has a variable maximum length of 8,000 non-unicode
characters.
Miscellaneous Data

Aside from the 6 major types of data, miscellaneous data are also stored as
tables, SQL variants, cursors, XML files, unique identifiers, cursors and/or
timestamps.
You can refer to this chapter when you want to know about the maximum
values of the data you are preparing.

Downloading sql software


Although, almost all of the SQL queries presented here are general, it would
be easy for you to adjust to whatever type of SQL server you will be using,
eventually.
Before you can perform any SQL task in your computer, you have first to
download a SQL software.
Since you’re a beginner, you can use the free MySQL databases software.
Hence, we will be focusing on how to download this application.

What is MySQL?
MySQL is a tool (database server) that uses SQL syntax to manage databases.
It is an RDBMS (Relational Database Management System) that you can use
to facilitate the manipulation of your databases.
If you are managing a website using MySQL, ascertain that the host of your
website supports MySQL too.
Here’s how you can install MySQL in your Microsoft Windows. We will be
using Windows because it is the most common application used in
computers.

How to install MySQL on Microsoft Windows in


your computer.
Step #1 – Go to the MySQL website
Go to www.mysql.com and browse through the applications to select
MySQL. Ascertain that you obtain the MySQL from its genuine website to
prevent downloading viruses, which can be harmful to your computer.
Step #2 – Select the ‘download’ option
Next, click on the download option this will bring you to the MySQL
Community Server, and to the MySQL Community Edition. Click
‘download’.
Step #3 – Choose your Windows’ processor version
Choose your Windows’ processor version by perusing the details given on
the page. Choose from the ‘other downloads’ label. You can choose the 32-
bit or 64-bit.
Click the download button for the Windows (x86, 32-bit), ZIP Archive or the
Windows (x86, 64-bit), ZIP Archive, whichever is applicable to your
computer.
Step #4 – Register on the site
Before you can download your selected version, you will be requested to
register by answering the sign in form for an Oracle account.
You don’t have to reply to the questions that are optional. You can also click
on the ‘no thanks’ button.
There is another option of just downloading the server without signing up,
but you will not be enjoying some freebies such as, being able to download
some white papers and technical information, faster access to MySQL
downloads and other services.
Step #5 – Sign in to your MySQL account
After registering, you can sign in now to your new account. A new page will
appear, select your area through the displayed images of flags. Afterwards,
you can click the download button and save it on your computer.
This can take several minutes.
Step #6 – Name the downloaded file
After downloading the file. You can name your MySQL file and save it in
your desktop or C drive. It’s up to you, whichever you prefer.
Step #7 – Install your MySQL Server
Click the file to open it and then click ‘install’ to install MySQL on your
computer. This will open a small window on your computer will ask if you
want to open and install the program. Just click the “OK” button.
Step #8 – Browse your MySQL packages
The MySQL Enterprise Server page will appear giving you some information
about what your MySQL package contains.
There are packages offered for a small fee, but since we’re just interested in
the community server, just click ‘next’ until you reached the ‘finish’ button.
Step #9 – Uncheck the box ‘Register the MySQL Server now’
After the Wizard has completed the set-up, a box appears asking you to
configure and register your MySQL Server. Uncheck the ‘Register the
MySQL Server now’ box, and check the small box for the “Configure the
MySQL Server now’.
Then click ‘finish’.
Step #10 – Click ‘next’ on the Configuration Wizard box
A box will appear, and you just have to click next.

Step #11 – Select the configuration type


A box will appear, select your configuration type. Tick the small circle for
the ‘Detailed Configuration’, instead of the ‘Standard Configuration’. Click
the ‘next’ button.
Step #12 – Select the server type
There will be three choices; the Developer Machine, the Server Machine and
the Dedicated MySQL Server Machine.
Select the Server Machine because it will have medium memory usage,
which is ideal for a beginner like you, who is interested to learn more about
MySQL.
The Developer Machine uses minimal memory and may not allow you the
maximum usage of your MySQL.
On the other hand, the MySQL Server Machine is for people who work as
database programmers or full-time MySQL users. It will use all of the
available memory in your computer, so it is not recommended for you.
Step #13 – Select the database usage
For database usage, there are three choices, namely; Multifunctional
Database, Transactional Database Only, and Non-Transactional Database
Only. Choose the Multifunctional Database because your purpose is for
general purposes.
The Transactional and Non-transactional are used for more specific purposes.
Click the ‘next’ button at the bottom of the display box.
Step #14 – Select the drive for the InnoDB datafile
Or you can select the drive from your computer, where you want to store
your InnoDB data file. Choose the drive you prefer and then click ‘next’.
Step #15 - Set the number of concurrent connections to the server
This will indicate the number of users that will be connecting simultaneously
to your server. The choices are; Discussion Support (DSS)/OLAP, Online
Transaction Processing (OLTP) and Manual Setting.
It is recommended that you choose the option, DSS/OLAP because you will
not be requiring a high number of concurrent connection. OLTP is needed
for highly loaded servers, while the manual setting can be bothersome to be
setting it every now and then.
After setting this, click ‘next’.
Step #16 – Set the networking options
Enable the TCP/IP Networking by checking the small box before it. Below
it, add your port number and then check the small box to Enable Strict Mode
to set the server SQL mode.
Click ‘next’.
Step #17 – Select the default character set
The most recommended is the Standard Character Set because it is suited for
English and other West European languages. It is also the default for
English.
The other two choices namely; Best Support For Multilingualism and the
Manual Default Character Set are best for those who have other languages
other than English.
Tick the small circle before the Standard Character Set and click ‘next’.

Step #18 – Set the Windows options


Tick the two choices displayed, which are; Install As Windows Server and
Include Bin Directory in Windows Path. This will allow you to work with
your MySQL from your command line.
Selecting the Install As Windows Server will automatically display the
Service Name. The small box below the Service Name must be checked too.
Click ‘next’.
Step #19 – Set the security options
Set your password. The box will indicate where you can type it.
Click ‘next’.
Step #20 - Execute your configurations
Click ‘execute’ and your computer will configure by itself based on your
specifications.
Once the configuration is complete and all the boxes are checked, click
‘finish’.
Step #21 – Set the verification process
Type cmd and press enter in the start menu. This will take you to the
command panel.
Type the following:
MySQL -u root -p
Press ‘enter’.
There is a space between MySQL and the dash symbol, and between u and
root. Also, there is a space between the root and the dash symbol.
The command panel will ask for your password. Type your password and
press ‘enter’.
A MySQL prompt will appear.
You can type any SQL command to display the databases. Add the semicolon
at the end of your SQL statement.
Close your command panel for the meantime.
Using your MySQL can motivate you to learn more about other related
applications such as, PHP, and similar products.
What is important is for you to learn the basics of SQL first.

How to create tables


Your tables are used to store the data or information in your database.
Specific names are assigned to the tables to identify them properly and to
facilitate their manipulation. The rows of the tables contain the information
for the columns.
Knowing how to create tables is important for a beginner, who wants to learn
SQL.

The following are the simple steps:


Step #1 – Enter the keywords CREATE TABLE
These keywords will express your intention and direct what action you have
in mind.
Example: CREATE TABLE
Step #2 – Enter the table name
Right after your CREATE TABLE keywords, add the table name. The table
name should be specific and unique to allow easy and quick access later on.
Example: CREATE TABLE “table_name”
The name of your table must not be easy to guess by anyone. You can do this
by including your initials and your birthdate. If your name is Henry Sheldon,
and your birthdate is October 20, 1964, you can add that information to the
name of your table.
Let’s say you want your table to be about the traffic sources on your website,
you can name the table “traffic_hs2064”
Take note that all SQL statements must end with a semicolon (;). All the data
variables must be enclosed with quotation marks (“ “), as well.
Example: CREATE TABLE traffic_hs2064
Step #3 – Add an open parenthesis in the next line
The parenthesis will indicate the introduction of the columns you want to
create.
Example: CREATE TABLE “table_name”
(
Let’s apply this step to our specific example.
Example: CREATE TABLE traffic_hs2064
(
In some instances, the parentheses are not used.
Step #4 – Add the first column name
This should be related to the data or information you want to collect for your
table. Always separate your column definitions with a comma.
Example: CREATE TABLE “table_name”
(“column_name” “data type”,
In our example, the focus of the table is on the traffic sources of your
website. Hence, you can name the first column “country”.
Example: CREATE TABLE traffic_hs2064
(country
Step #4 – Add more columns based on your data

You can add more columns if you need more data about your table. It’s up to
you. So, if you want to add four more columns, this is how your SQL
statement would appear.
Example: CREATE TABLE “table_name”
(“column_name1” “data type”,
“column_name2” “data type”,
“column_name3” “data type”,
“column_name4” “data type”);
Add the closing parenthesis and the semi-colon after the SQL statement.
Let’s say you have decided to add for column 2 the keyword used in
searching for your website, for column 3, the number of minutes that the
visitor had spent on your website, and for column 4, the particular post that
the person visited. This is how your SQL statement would appear.
Take note:

The name of the table or column must start with a letter, then it can be
followed by a number, an underscore, or another letter. It's preferable that the
number of characters does not exceed 30.
You can also use a VARCHAR (variable-length character) data type to help
create the column.
Common data types are:
date – date specified or value
number (size) – you should specify the maximum number of column digits
inside the open and close parentheses
char (size) – you should specify the size of the fixed length inside the open
and close parentheses.
varchar (size) – you should specify the maximum size inside the open and
close parentheses. This is for variable lengths of the entries.
Number (size, d) – This is similar to number (size), except that ‘d’ represents
the maximum number of digits.
Hence if you want your column to show 10.21, your date type would be:
number (2,2)
Example: CREATE TABLE traffic_hs2064
(country varchar (40),
keywords varchar (30),
time number (3),
post varchar (40) );
Step #5 – Add CONSTRAINTS, if any
CONSTRAINTS are rules that are applied for a particular column. You can
add CONSTRAINTS, if you wish. The most common CONSTRAINTS are:

“NOT NULL” – this indicates that the columns should


not contain blanks

“UNIQUE” – this indicates that all entries added must


be unique and not similar to any item on that particular
column.
In summary, creating a table using a SQL statement will start with the
CREATE TABLE, then the “table name”, then an open parenthesis, then the
“column names”, the “data type”, (add a comma after every column), then
add any “CONSTRAINTS”.
Add the closing parenthesis and the semicolon at the end of your SQL
statement.
Chapter 10 - Relational Database Concepts
SQL's database engine is capable of making decisions independently using
user-input instructions. The language of SQL called Transact-SQL (T-SQL)
contains some basic objects used for defining data requests. Literal values
are constant and may include hex, numeric, or alphanumeric values enclosed
in single quotes as a string. Double quotes are also possible, but there are a
variety of uses for double quotes, so the best practice to follow for simple
strings is single quotation boundaries. Delimited identifiers are used to
reserve specific keywords, and they also enable database objects to contain
names that include spaces. The default behavior of delimited statements uses
double quotes, but manipulation of the setting is possible using the SET
statement. In some cases, there may be statements that the database engine
does not need to process. Notating these lines between asterisk and forward-
slash (/* or */) will render the text between the notation as a comment. It is
also possible to create a line that is only a partial comment, delineated by
double hyphen notation. Identifiers in T-SQL are used to reference diverse
resources such as tables, objects, or databases, providing users a shorthand
for requesting complex data in a query, and there is a specific syntax for
implementing them as well as defining them. Identifiers add complexity to
string queries since they introduce variable syntax rules. Identifiers can
begin with nearly any character, but once set, the user syntax requires caution
not to call identifiers unintentionally. Finally, T-SQL contains reserved
keywords which are names with reserved meaning. These words are used to
define operations the database engine can perform, and objects cannot be
identified using these words unless named as "delimited identifiers."
Creating referenceable data requires adhering to specific organizational
processes and maintaining consistency regarding data type within a column
of a table. This is achieved by limiting data within a column to a single data
type. There is an exception, but the majority of the rule requires a single data
type per column. Numeric data can represent various forms of counted data.
Money, Real, Integer, Decimal and others are all sorts of numerical data.
Character data containers exist in fixed or variable length strings of a single
byte or multiple byte characters. Unicode store character strings in data bytes
larger than one. Data related to time is called “temporal data” and concerns
date and time information stored in various sizes and specificity. Most
temporal data types don't understand daylight savings or time zones, but
DATETIMEOFFSET defines a data type that can accommodate time zones
for data manipulation. T-SQL also allows for some various data types that do
not fit into numerical, character, or temporal data types as well.
Miscellaneous data types supported vary, but some acceptable parameters
include binary data, large objects, and sql_variant data (which allows a single
column to contain for than a single data type) UNIQUEIDENTIFIER data
(which provides for data used in distributed systems without causing conflict
due to 16-byte identification strings), hierarchical data, and timestamp data as
well which is usually maintained to detect changes in a column.
T-SQL enlists two kinds of functions: aggregate and scalar. An aggregate
function, as the name implies, aggregates data contained within a single
column and returns a single value for the query. Convenient, statistical,
analytical, and user-defined aggregate functions are possible. An example of
an aggregate function is the SUM function which calculates the total value of
all entries in a single column. SUM is a convenient aggregate function, and
the complexity possible is staggering when forming queries applying other
functions, especially user-defined aggregate functions. Scalar functions have
five categories and operate on a single row or singular value, as opposed to
aggregate functionality which runs on many rows within a single column.
There are hundreds of scalar functions used for data manipulation, but they
are broken down into numeric, date, string, system, and metadata functions.
Functions manipulate data, and numeric functions transform data using
mathematical instruction. Date functions perform calculations on date and
time formatted data. String function uses include shaping character string
data. System functions are more varied and provide information about
database objects. These functions don't transform data as much as they
provide information on that state of a database. Metadata functions also
follow this form and provide information regarding the state of a database.
Metadata functions focus on retrieving names or IDs of database objects.
Schemas, views, tables, and data types, as well as databases and database
files, are of primary concern. Metadata functions also allow retrieval of
values for a given property within database objects, the databases themselves,
or the server instance.
Operators within T-SQL are scalar and apply to Boolean and mathematical
operations as well as concatenation. Concatenation, in the database context,
refers to the joining together of multiple objects, tables, or fields. Operator
types include unary and binary arithmetic, bitwise, set operators, logic
operators, compound, comparison, and more. Various symbols signifying
possible data manipulation represent these operations. AND, NOT, OR are
examples of operators that function across all data types. There are too many
various operators to include without beginning to describe highly technical
operations and models of their usage which does not fit the scope of this
book. The crucial thing of note is that there are many possibilities to consider
and as familiarity with SQL increases encountering the most common of
these operators is inevitable. Microsoft and Oracle both present free
education regarding operators and their usage freely available on the world
wide web.
Global variables and NULL values require inclusion in the essential
components of SQL. Global variables are used in place of constants and are
preceded by @@ to denote a global variable insertion. NULL values are a
significant feature of SQL. Even though relational models require that all
data to be referenceable, in real-world practice unknown data occurs, and
NULL values allow table creation and manipulation without complete
information. Understanding the basics of T-SQL's data types functions and
objects provide a base of knowledge that expands into progressively more
specific areas of database creation and manipulation.
Chapter 11 - SQL Injections
SQL injection is a special type of hacking method nowadays. By using this
method, a hacker can access the database if the site is vulnerable, and get all
details from the database (“SQL Injection”). The database could even be
destroyed.

How do they work?


Consider the statement given below, which shows how to authenticate a user
in a web application:
SELECT * FROM users WHERE username='username'

AND password='password';
The username and password added in double quotes represent the username
and the password entered by a user. If, for example, someone enters
username alphas, and the password pass123, the query will be:
SELECT * FROM users WHERE username='alphas' AND
password='pass123';
Suppose the user is an attacker, and instead of entering a valid username and
password in the fields, he enters something such as ' OR 'x'='x'.
In such a case, the query will evaluate to the following:
SELECT * FROM users WHERE username='' OR 'x'='x'

AND password='' OR 'x'='x';


The above shows that the statement has tested to a valid SQL statement. The
expression WHERE 'x'='x' will always test to be true. This means that the
above query will return all the rows in the user’s table. With this, the attacker
will then be able to log into the database and do whatever they want to do.
This shows how easy it is for an attacker to gain access to the database by the
use of a dirty yet simple trick.
The user’s table may also be large and loaded with millions of records. In
such a case, the above query can result in a denial of service (DoS) attack.
This is because the system resources may be overloaded, making the
application unavailable to the users.
In the above case, we have seen what an SQL injection can do to your
database table for a select query. This can even be dangerous in a case where
you have a DELETE or UPDATE query. The attacker may delete all the data
from your table or change all its rows permanently.

Preventing an sql injection


Escaped characters are meant to be handled inside of the artificial dialects
such as Perl or PHP. However, with MySQL, there is an extension that
enables you to use PHP so that the proper function is used with the escaped
characters. This extension makes these characters unique to just MySQL.

LIKE Quandary
When dealing with a LIKE quandary, you have to have an escape method in
order to change the characters that the user inserts into your prompt box that
turn out to be literals. The addslashes function allows you to ensure that you
are specifying the range of characters that is needed by the system in order to
escape.

Hacking Scenario
Google is one of the best hacking tools in the world through the use of the
inurl: command to find vulnerable websites.
For example:
inurl:index.php?id=
inurl:shop.php?id=
inurl:article.php?id=
inurl:pageid=
To use these, copy one of the above commands and paste it in the Google
search engine box. Press Enter.
You will get a list of websites. Start from the first website and check each
website’s vulnerability one by one.
Chapter 12 - Fine-Tune Your Indexes
When you are using SQL, you will want to become an expert on your
database. This, however, is not going to happen overnight. Just like with
learning how to use the system, you will need to invest time and effort to
learn the important information and have the proper awareness of how the
database works. You should also ensure that you are equipped with the right
education targeted at working with the database because you never know
what may happen in future.
In order to make your education more streamlined when you are learning the
ins and outs of the database, here are some helpful insights:

1. When using the database, you need to work with the 3NF
design.
2. Numbers compare to characters differently, and you could
end up downgrading your database's performance, so do not
change numbers unless you absolutely have to!
3. With the choose statement, you will only have data to
display on the screen. Ensure that you avoid asterisks with
your choose statement searches to avoid loading data that
you do not need at that time.
4. Any records should be constructed carefully, and only for
the tables that require them. If you do not intend to utilize
the table as much, then it does not need to have an index.
Essentially, you should attempt to save space on the disk,
and if you are creating an index for each table, you are going
to run out of room.
5. A full table scan happens when no index can be found on
that table. You can avoid doing this by creating an index
specifically for that row rather than the entire table.
6. Take precautions when using equality operators, especially
when dealing with times, dates, and real numbers. There is a
possibility that differences will occur, but you are not
necessarily going to notice these differences right away.
Equality operators make it almost impossible to get exact
matches in your queries.
7. Pattern matching can be used, but use it sparingly.
8. Look at how your search is structured, as well as the script
that is being used with your table. You can manipulate the
script of your table to have a faster response time, as long as
you change everything about the table and not just part of it.
9. Searches will be performed regularly on the SQL. Stick to
the standard procedures that work with a large group of
statements rather than small ones. The procedures have been
put into place by the database before you even get the
chance to use it. The database is not like the search engine
though; the procedure is not going to be optimized before
your command is executed.
10. The OR operator should not be used unless necessary.
This operator will slow your searches.
11. Remove any records you currently have to allow you to
optimize larger batches of data. It is an excellent idea to
think of the history of the table in millions of different rows,
and you are probably going to require multiple records to
cover the entire table, which is going to take up space on the
disk. While records will get you the information you want
faster, after the batch has been loaded, the index is going to
slow the system down due to the fact that it is now in the
way.
12. Batch compromises require that you use the commit
function. You should use this function after you construct a
new record.
13. Databases have to be defragmented at least once a
week to ensure everything is working properly.

Sql tuning tools


The Oracle program grants you access to various tools to use if you want to
tune your database or its performance. Two of the most popular tools it offers
are:

1. TKProf — lets you measure how the database performs over


a certain period of time based on every statement you enter
into SQL to be processed
2. EXPLAIN PLAN — shows you the path that is followed to
ensure that statements are carried out as they are meant to
be
Use SQL*Plus Command to measure the time that passes between each
search on your SQL database.
Chapter 13 - Deadlocks
In most cases, multiple users access database applications simultaneously,
which means that multiple transactions are being executed on a database in
parallel. By default, when a transaction performs an operation on a database
resource such as a table, it locks the resource. During that period, no other
transaction can access the locked resource. Deadlocks occur when two or
more processes try to access resources that are locked by the other processes
participating in the deadlock (Steynberg, 2016).
Deadlocks are best explained with the help of an example. Consider a
scenario where some transactionA has performed an operation on tableA and
has acquired a lock on the table. Similarly, there is another transaction named
transactionB that is executing in parallel and performs some operation on
tableB. Now, transactionA wants to perform some operation on tableB, which
is already locked by transactionB. In addition, transactionB wants to perform
an operation on tableA, but that table is already locked by transactionA. This
results in a deadlock since transactionA is waiting on a resource locked by
transactionB, which is waiting on a resource locked by transactionA.
For the sake of this chapter, we will create a dummy database. This database
will be used in the deadlock example that we shall in next section. Execute
the following script:
CREATE DATABASE dldb;
GO
USE dldb;
CREATE TABLE tableA
(
id INT IDENTITY PRIMARY KEY,
patient_name NVARCHAR(50)
)

INSERT INTO tableA VALUES ('Thomas')


CREATE TABLE tableB
(
id INT IDENTITY PRIMARY KEY,
patient_name NVARCHAR(50)

INSERT INTO table2 VALUES ('Helene')


The above script creates a database named “dldb.” In the database, we create
two tables, tableA and tableB. We then insert one record into each table.

Deadlock analysis and prevention


In the above section, a deadlock was generated. We understand the processes
involved in the deadlock. In real-world scenarios, this is not the case.
Multiple users access the database simultaneously, which often results in
deadlocks. However, in such cases we cannot tell which transactions and
resources are involved in the deadlock. We need a mechanism that allows us
to analyze deadlocks in detail so that we can see what transactions and
resources are involved, and determine how to resolve the deadlocks. One
such way is via SQL Server error logs.

Reading Deadlock Info via SQL Server Error Log


The SQL Server provides minimal information about the deadlock. You can
get detailed information about the deadlock via SQL Server error log.
However, to log deadlock information to the error log, first you have to use a
trace flag 1222. You can turn trace flag 1222 on global as well as session
level. To turn on trace flag 1222, execute the following script:
DBCC Traceon(1222, -1)
The above script turns trace flag on global level. If you do not pass the
second argument, the trace flag is turned on session level. To see if the trace
flag is actually turned on, execute the query below:
DBCC TraceStatus(1222)
The above statement results in the following output:
TraceFlag Status Global Session
1222 1 1 0

Here, Status value 1 shows that trace flag 1222 is on. The 1 in the Global
column implies that the trace flag has been turned on globally.
Now, try to generate a deadlock by following the steps that we performed in
the last section. The detailed deadlock information will be logged in the error
log. To view the SQL Server error log, you need to execute the following
stored procedure:
executesp_readerrorlog
The above stored procedure will retrieve a detailed error log. A snippet of this
is shown below:
Your error log might be different depending upon the databases in your
database. The information about all the deadlocks in your database starts with
log text “deadlock-list.” You may need to scroll down a bit to find this row.
Let’s now analyze the log information that is retrieved by the deadlock that
we just created. Note that your values will be different for each column, but
the information remains the same.
ProcessInfo Text
spid13s deadlock-list
spid13s deadlock victim=process1fcf9514ca8
spid13s process-list
spid13s process id=process1fcf9514ca8 taskpriority=0 logused=308 waitresource=KE
waittime=921 ownerId=388813 transactionname=transactionBlasttranstarted=
XDES=0x1fcf8454490 lockMode=X schedulerid=3 kpid=1968 status=suspen
trancount=2 lastbatchstarted=2019-05-27T15:51:54.380 lastbatchcompleted=2
lastattention=1900-01-01T00:00:00.377 clientapp=Microsoft SQL Server Ma
hostname=DESKTOP-GLQ5VRA hostpid=968 loginname=DESKTOP-GLQ
(2) xactid=388813 currentdb=8 lockTimeout=4294967295 clientoption1=671
spid13s executionStack
spid13s frame procname=adhoc line=2 stmtstart=58 stmtend=164
sqlhandle=0x0200000014b61731ad79b1eec6740c98aab3ab91bd31af4d00000
spid13s unknown

spid13s frame procname=adhoc line=2 stmtstart=4 stmtend=142


sqlhandle=0x0200000080129b021f70641be5a5e43a1ca1ef67e9721c9700000
spid13s unknown

spid13s inputbuf
spid13s UPDATE tableA SET patient_name = 'Thomas - TransactionB'
spid13s WHERE id = 1

spid13s process id=process1fcf9515468 taskpriority=0 logused=308 waitresource=KE


waittime=4588 ownerId=388767 transactionname=transactionAlasttranstarted
XDES=0x1fcf8428490 lockMode=X schedulerid=3 kpid=11000 status=suspe
trancount=2 lastbatchstarted=2019-05-27T15:51:50.710 lastbatchcompleted=2
lastattention=1900-01-01T00:00:00.710 clientapp=Microsoft SQL Server Ma
hostname=DESKTOP-GLQ5VRA hostpid=1140 loginname=DESKTOP-GLQ
committed (2) xactid=388767 currentdb=8 lockTimeout=4294967295 clientop
spid13s executionStack
spid13s frame procname=adhoc line=1 stmtstart=58 stmtend=164
sqlhandle=0x02000000ec86cd1dbe1cd7fc97237a12abb461f1fc27e278000000
spid13s unknown

spid13s frame procname=adhoc line=1 stmtend=138


sqlhandle=0x020000003a45a10eb863d6370a5f99368760983cacbf489500000
spid13s unknown

spid13s inputbuf
spid13s UPDATE tableB SET patient_name = 'Helene - TransactionA'
spid13s WHERE id = 1
spid13s resource-list
spid13s keylockhobtid=72057594043105280 dbid=8
objectname=dldb.dbo.tableAindexname=PK__tableA__3213E83F1C2C4D64
associatedObjectId=72057594043105280
spid13s owner-list
spid13s owner id=process1fcf9515468 mode=X
spid13s waiter-list
spid13s waiter id=process1fcf9514ca8 mode=X requestType=wait
spid13s keylockhobtid=72057594043170816 dbid=8
objectname=dldb.dbo.tableBindexname=PK__tableB__3213E83FFE08D6AB
associatedObjectId=72057594043170816
spid13s owner-list
spid13s owner id=process1fcf9514ca8 mode=X
spid13s waiter-list
spid13s waiter id=process1fcf9515468 mode=X requestType=wait

The deadlock information logged by the SQL server error log has three main
parts.

1. The Deadlock Victim


As mentioned earlier, to resolve a deadlock, the SQL server selects one of the
processes involved in the deadlock as a deadlock victim. Above, you can see
that the ID of the process selected as the deadlock victim is
process1fcf9514ca8. You can see this value highlighted in gray in the log
table.

2. Process List
The process list is the list of all the processes involved in a deadlock. In the
deadlock that we generated, two processes were involved. In the processes
list you can see details of both of these processes. The ID of the first process
is highlighted in red, and the ID of the second process is highlighted in green.
Notice that, in the process list, the first process is the process that has been
selected as deadlock victim, too.
Apart from the process ID, you can also see other information about the
processes. For instance, you can find login information of the process, the
isolation level of the process, and more. You can even see the script that the
process was trying to run. For instance, if you look at the first process in the
process list, you will find that it was trying to update the patient_name
column of tableA when the deadlock occurred.

3. Resource List
The resource list contains information about the resources taking place in the
event of the deadlock. In our example, tableA and tableB were the only two
resources during the deadlock. The tables are highlighted in blue in the
resource list of the log in the table above.

Some Tips for Avoiding Deadlock


From the error log, we can get detailed information about the deadlock.
However, we can minimize the chance of deadlock occurrence if we follow
these tips:

Execute transactions in a single batch and keep them short

Release resources automatically after a certain time period

Sequential resource sharing


Don’t allow user to interact with the application when transactions are
being executed
Chapter 14 - Functions: Udfs, SVF, ITVF, MSTVF,
Aggregate, System, CLR
Since there is a return value, it is used inside an expression, and unlike stored
procedure, it is not invoked with an execute statement. A function can call
another function and nesting can be up to 32 levels and is invoked along with
schema name e.g., dbo. Name of a function can be up to 128 characters.
Inside a function, a DML operation (insert, update, delete) cannot be
performed unlike stored procedure. These features of UDFs are summarized
as below:

a. Can do nesting of functions.


b. Cannot call stored procedures, but can call extended SPs.
c. No modification of database tables allowed.

Can take input parameters


d. Cannot contains output parameters
e. Cannot use error handling: try-catch

Different types of user-defined functions are explained as below.


SCALAR-VALUED FUNCTION (SVF)
A scalar-valued function (SVF) is created as follows:
Go to Studio -> Expand IMS node under Databases (click on + sign) ->
Expand Programmability node -> Expand Functions node -> Right Click on
Scalar-valued Functions node -> Select New Scalar-valued Function… You
will be displayed a new UDF template in Query Editor Window.
a. Enter the Author name, Create date and Description of
UDF: e.g., get total of customers.
b. Change the name of UDF appropriately what it will do:
svfIMSGetTotCustomers. Note that we are prefixing svf to
the name of UDF, to indicate that it is scalar-valued function
and also database name: IMS to easily identify where a
function belongs to, which are again part of naming
convention and each company can have their standards.
c. Remove both the parameters: Param1 and Param2 since we are not passing any input parameters in
this example.
d. Enter the SQL statements to get the total of customers and UDF will
look as below.

USE IMS
GO
--
=============================================
-- Author: Neal Gupta
-- Create date: 10/01/2013
-- Description: Get total of customers
--
=============================================

CREATE FUNCTION [dbo].[svfIMSGetTotCustomers]


RETURNS INT
AS
BEGIN
-- Declare the return variable here
DECLARE @TotCustomers INT
SELECT @TotCustomers = COUNT(CustomerID)
FROM [IMS].[dbo].[TblCustomer]
-- Return the result of the function
RETURN @TotCustomers
END
You can use following SQL statement to invoke the above user-
defined function:
SELECT [dbo].[svfIMSGetTotCustomers]()
Notice that we did not pass any input value (empty parenthesis) to above
UDF. However, in below example, we will pass a single input value.
USE IMS
GO
--
=============================================
-- Description: Get Qty of product available
--
=============================================

CREATE FUNCTION [dbo].[svfIMSGetQtyProduct]


@ProductID INT
RETURNS INT
AS
BEGIN
DECLARE @QtyAvailable INT
SELECT @QtyAvailable = QtyAvailable
FROM [IMS].[dbo].[TblProduct]
WHERE ProductID = @ProductID
-- Return the result of the function
RETURN @QtyAvailable
END
The above scalar UDF can be called using following SQL
statement:
SELECT ProductID, [dbo].[svfIMSGetQtyProduct](ProductID)

FROM [IMS].[dbo].[TblProduct]
INLINE TABLE-VALUED FUNCTION (I-TVF)
This is a UDF that returns a data type of table, a set of rows, and is similar to
set of data returned by view, however, TVF provides much more capability
than view, which are limited to one select SQL statement. TVF can have
logic and multiple statements from one or more tables. TVF can replace a
stored procedure and TVF are invoked using the select, unlike stored
procedure that needs to be executed. For example, we want to get all the
orders for the products, a inline TVF is created below:
USE IMS
GO
-- =============================================
-- Description: Get orders for products
--=============================================
CREATE FUNCTION [dbo].[tvfIMSGetProductOrders]
RETURNS TABLE
AS
RETURN

SELECT [OrderID], [CustomerID], [ProductID], [OrderQty],


[OrderDate], [Comment]

FROM [IMS].[dbo].[TblOrder]
You can run following SQL statement to call above inline TVF:
SELECT * FROM [dbo].[tvfIMSGetProductOrders]()

Note that in above inline TVF, input parameter was not passed. However, if
we want to get all the orders for a product ID, we can use following inline
TVF:
USE IMS

GO
-- =============================================
-- Description: Get orders for a product
-- =============================================
CREATE FUNCTION [dbo].[tvfIMSGetOrdersByProductID]
@ProductID INT
RETURNS TABLE
AS
RETURN

SELECT [OrderID], [CustomerID], [ProductID], [OrderQty],


[OrderDate], [Comment]
FROM [IMS].[dbo].[TblOrder]
WHERE ProductID = @ProductID
Similarly, you can invoke above inline TVF as follows:
SELECT * FROM [dbo].[tvfIMSGetOrdersByProductID](2)
MULTI-STATEMENT TABLE-VALUED FUNCTION (MS-TVF)
Similar to inline TVF as explained above, multi-statement TVF returns a
table, however, the returned table needs to be defined first, rows are inserted
to this table, and returned to the caller. For e.g., if we want to get the order
details of a product, we will create a multi-statement TVF as below:
USE IMS
GO

-- =============================================
-- Description: Get order details for a product
-- =============================================
CREATE FUNCTION [dbo].[tvfIMSGetOrderProductDetails]
@ProductID INT
RETURNS @OrderProductDetails TABLE
OrderID INT NOT NULL
,
OrderDate DATETIME
,
ProductID INT
,
Name VARCHAR(50)
,
Price DECIMAL(9,2)
)

AS
BEGIN

INSERT INTO @OrderProductDetails


SELECT
O.OrderID, O.OrderDate
, .
P ProductID, P.Name, P.Price
FROM [IMS].[dbo].[TblOrder] O
INNER JOIN
[IMS].[dbo].[TblProduct] P
ON O.ProductID = P.ProductID
WHERE O.ProductID = @ProductID

-- Return the row data in above table


RETURN
END

Above multi-statement TVF is invoked using following SQL


statement:
SELECT * FROM [dbo].[tvfIMSGetOrderProductDetails](5)

AGGREGATE FUNCTION
These functions give you summarized info like average of salary, total count
of products, minimum and maximum, etc. Aggregate functions are actually
part of system functions (mentioned next) and some of these are mentioned
as below:
COUNT(): Counts all the rows: Count(*), Count(ALL Column),
Count(DISTINCT Column): Counts only distinct rows.

AVG(): Average value of the column used


SUM(): Totals of values for column provided
MAX(), MIN(): Max or Min value of the value for column
STDEV(): Get the standard deviation value for column

SYSTEM FUNCTION
These are in-built functions inside the SQL Server and helps in executing
various operations related to date time (GETDATE, GETUTCDATE, DATE,
MONTH, YEAR, DATEADD, DATEPART, ISDATE), security (USER,
USER_ID, USER_NAME, IS_MEMBER), string manipulation
(CHARINDEX, RTRIM, LTRIM, SUBSTRING, LOWER, UPPER, LEN),
mathematical (ABS, COS, SIN, SQUARE, PI), etc.
CLR FUNCTION
Common Language Runtime (CLR) function is created inside a DLL, similar
to CLR stored procedure.
Similar to stored procedure, you can create schemabinding UDF or
encryption scalar or table-valued UDF using following options:
SCHEMABINDING AND ENCRYPTION UDF

A. SCHEMABINDING UDF: Once UDF is schema bind,


underlying database table cannot be deleted:

CREATE FUNCTION [dbo].[svfIMSSchemaFn]


RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
...

END

B. ENCRYPTION UDF: Once an encrypted UDF is created,


definition is hidden and cannot be viewed later.

CREATE FUNCTION [dbo].[svfIMSEncryptionFn]


)
RETURNS DECIMAL(9,2)
WITH ENCRYPTION
BEGIN
...

END
If you want to modify an existing function, ALTER FUNCTION is used
which keeps underlying security privileges for the UDF intact. For deleting a
UDF, DROP FUNCTION is used.
Chapter 15 - Triggers: Dml, Ddl, After, Instead Of,
Db, Server, Logon
Trigger is a special type of stored procedure that is executed, invoked or
fired, automatically, when a certain database event occurs like DML (insert,
update, delete) operation. However, trigger cannot be passed any value and
does not return a value. A trigger is fired either before or after a database
event and associated with a table or a view. Trigger can also be used to check
the integrity of data before or after an event occurs and can rollback a
transaction. There are 3 categorizes of triggers:
DML Triggers: Fire on DML operations (insert, update, delete)

a. DDL Triggers: Fire on DDL operations (create, alter, drop,


grant)
b. Logon Triggers: Fire in response to LOGON event.
Multiple triggers can also be created on the same table or view, even for the
same database event. Also, triggers can be nested, meaning if the trigger
changes another table on which there is another trigger, and so on, and there
can be up to 32 levels of nesting. Triggers should be used when absolutely
required due to potential for extra I/O overhead; otherwise, stored procedures
or functions need to be considered.
We will go over different types of triggers as below:
DML Triggers (DML-TR)
AFTER TRIGGER: This AFTER trigger is fired whenever a DML statement
as specified in trigger occurs, as below. Note that FOR trigger is same as
AFTER trigger.
USE IMS
GO
IF OBJECT_ID ('trgIMSTblCustomerInsertUpdate', 'TR') IS NOT NULL
DROP TRIGGER trgIMSTblCustomerInsertUpdate
--
=============================================
-- Author: Neal Gupta
-- Create date: 11/01/2013
-- Description: Create a trigger for Insert, Update
--
=============================================
CREATE TRIGGER trgIMSTblCustomerInsertUpdate
ON [IMS].[dbo].[TblCustomer]
AFTER INSERT, UPDATE
AS
BEGIN
SET NOCOUNT ON;
-- SQL statements for trigger here
PRINT 'trgIMSTblCustomerInsertUpdate Invoked'
-- Add any other SQL statement
END
Now, if you run below Insert SQL statement, above trigger is fired:
INSERT INTO [IMS].[dbo].[TblCustomer] ([FirstName],[MiddleName],
[LastName],[Address],[City],[State],[ZipCode],[Phone],[Country])
VALUES ('John', '', 'Doe', '123 Denton Rd', 'Seattle', 'WA', '10006', '623-456-
7890', 'USA')
You will notice that in Messages Tab in Studio, below message is printed,
indicating that trigger: trgIMSTblCustomerInsert is called after the insert is
performed.
Note that when an insert or update SQL statement runs, a copy of new row is
added into INSERTED table and can be accessed within a trigger to perform
additional checks or rollback a transaction, if there is some violation of data
integrity. Similarly, when a delete SQL statement runs, a copy of deleted
rows is added to DELETED table.
Above triggers are also classified as follows:
1. After Insert Trigger : AFTER Trigger for insert SQL
statement
2. After Update Trigger : AFTER Trigger for update SQL
statement
3. After Delete Trigger : AFTER Trigger for delete SQL
statement, as below.
Note that you can have a trigger on any combination of insert or update or
delete or all of these DML operations in one trigger.
USE IMS
GO
--
=============================================
-- Description: Create a trigger for delete
--
=============================================
CREATE TRIGGER trgIMSTblCustomerDelete
ON [IMS].[dbo].[TblCustomer]
AFTER DELETE
AS
IF EXISTS (SELECT * FROM DELETED)
BEGIN
SET NOCOUNT ON;
DECLARE @CountDeleted INT
SET @CountDeleted = (SELECT COUNT(*) FROM DELETED)
PRINT 'trgIMSTblCustomerDelete Invoked'
PRINT CAST(@CountDeleted AS VARCHAR(5)) + ' rows deleted from
TblCustomer table'
END
In order for SQL server to fire above delete trigger, run following delete SQL
statement:
DELETE FROM [IMS].[dbo].[TblCustomer]
WHERE CustomerID = 6
You will see below message, indicating that above delete trigger:
trgIMSTblCustomer Delete was fired:
INSTEAD OF TRIGGER: When INSTEAD OF trigger is activated or fired
on a DML operation, alternative action as defined inside a trigger takes place,
meaning, for e.g., if you run an insert SQL statement to add a new row, that
row will not be really added, unless, Instead Of trigger dictate to add that new
row. Similar to AFTER triggers, these triggers are also categorized into 3
types:

1. INSTEAD OF Insert Trigger : INSTEAD OF Trigger fired


for insert SQL statement
2. INSTEAD OF Update Trigger : INSTEAD OF Trigger fired
for update SQL statement
3. INSTEAD OF delete Trigger : INSTEAD OF Trigger fired
for delete SQL statement

USE IMS
GO
INSERT INTO [IMS].[dbo].[TblProduct] ([Name],[Description],
[Manufacturer],[QtyAvailable],[Price])
VALUES ('UHDTV', 'Ultra-Hi Smart 3D TV', 'LG', 1, 50000.00)
--
=============================================
-- Description: Create INSTEAD OF trigger for delete
--
=============================================
CREATE TRIGGER trgIMSTblProductInsteadOfDelete
ON [IMS].[dbo].[TblProduct]
INSTEAD OF DELETE
AS
BEGIN
SET NOCOUNT ON;
DECLARE @ProductID INT
SELECT @ProductID = (SELECT ProductID FROM DELETED)
IF (@ProductID > 0)
BEGIN
DELETE FROM [IMS].[dbo].
[TblProduct]
WHERE ProductID =
@ProductID
END
ELSE
BEGIN
PRINT 'Error: ' +
@@Error
END
PRINT 'INSTEAD OF TRIGGER Invoked:
trgIMSTblProductInsteadOfDelete'
PRINT CAST(@ProductID AS VARCHAR(5)) + ' ProductID deleted from
TblProduct table'
END
GO
-- Run Delete SQL Query to fire above INSTEAD OF trigger
DELETE FROM [IMS].[dbo].[TblProduct]
WHERE Name = 'UHDTV'
Once you run above delete SQL statement, you will see following message in
Messages Tab in Studio.
Note that truncate SQL statement does not trigger delete trigger, since, it does
not perform individual row deletions.
DDL Triggers (DDL-TR)
These triggers are fired after a DDL statement is executed on a database, for
e.g., CREATE, ALTER, DROP, GRANT, REVOKE, or a server related
event. Some of the features of DDL triggers are summarized below:

a. No INSTEAD OF triggers exist


b. Fire only after DDL SQL statements.
c. INSERTED and DELETED tables are not created.

Now, we will review different types of DDL triggers as below:


A. DATABASE-SCOPED DDL TRIGGER (DS-DDL-TR) :

USE IMS
GO
--
=============================================
-- Description: Create DDL Trigger (DS-DDL-TR)
--
=============================================
CREATE TRIGGER trgIMSDatabaseTrigger1
ON DATABASE
FOR ALTER_TABLE, DROP_TABLE
AS
BEGIN
PRINT 'DLL TRIGGER Fired: ' + 'trgIMSDatabaseTrigger1'
-- Perform a SQL operation:
ROLLBACK
END
Above DDL trigger is fired once you run any alter table or drop table
command. This trigger exists in Object Explorer: Instance -> Databases ->
IMS -> Programmability -> Database Triggers, as in below Fig.

B. SERVER-SCOPED DDL TRIGGER: (SS-DDL-TR) : These


triggers are fired when there is an event occurring at server instance,
for e.g., when a database is created as below.

--
=============================================
-- Description: Create DDL Trigger (SS-DDL-TR)
--
=============================================
CREATE TRIGGER trgIMSServerTrigger1
ON ALL SERVER
FOR CREATE_DATABASE
AS
BEGIN
PRINT 'SERVER DLL TRIGGER Fired: ' +
'trgIMSServerTrigger1'
END
Once you run create database SQL statement, above trigger is going to fire.
These server triggers are located in Object Explorer: SQL Server Instance ->
Server Objects -> Triggers, as in below Fig.
C. CLR DLL Triggers : These triggers are defined in an
outside routine using some programming language like
C#.NET and compiled into a DLL file. This DLL or
assembly file is registered inside SQL Server. These are
typically used for some specialized purpose.

LOGON TRIGGERS (LOG-TR)


These triggers fire when LOGON event occurs, after the authentication of
logging to SQL Server instance completes. Logon trigger does not activate if
the authentication of a user session fails.
-- =============================================
-- Description: Create LOGON Trigger
-- =============================================
CREATE TRIGGER trgSQLInstanceLogon
ON ALL SERVER
FOR LOGON
AS
BEGIN

PRINT 'LOGON TRIGGER Fired: ' + 'trgSQLInstanceLogon'


DECLARE @CountSessions INT
SET @CountSessions = (SELECT COUNT(*) FROM
SYS.DM_EXEC_SESSIONS
WHERE is_user_process = 1)

PRINT 'Count Sessions: ' + CAST(@CountSessions AS


VARCHAR(5))
END

If you login to SQL Server instance, above logon trigger is fired and message
is displayed in SQL Server Logs under Management, as below:
Logon Trigger can be used for auditing and tracking purpose or even
restricting access to certain login or sessions counts.
MULTIPLE TRIGGERS (MUL-TR)
In case a trigger is fired when an insert, update or delete DML operation on a
table occurs and that table already has another trigger, causing another trigger
to be fired. Multiple triggers can be fired on DML, DDL or even LOGON
database events.
There are two other types of triggers like recursive and nested, which allow
for maximum 32 levels, however, there are only rare circumstances where
these will be required.
Chapter 16 - Select Into Table Creation &
Population

Simple select into statement variations


SELECT INTO is an easy way to create a table for ad-hoc purposes in
database development and administration. An added benefit is minimal
logging, therefore good performance. INSERT SELECT is logged, although
with special setup minimal logging can be achieved in some cases.
-- Create and populate copy of the product table in tempdb
SELECT *
INTO tempdb.dbo.Product
FROM AdventureWorks2012.Production.Product;
-- (504 row(s) affected)
SELECT TableRows = count(*) FROM tempdb.dbo.Product; -- 504
-- Copy all persons into new table with last name starting with 'A'
SELECT BusinessEntityID AS ID,
CONCAT(FirstName, ' ', LastName) AS
FullName,
PersonType
INTO ListA
FROM AdventureWorks2012.Person.Person
WHERE LEFT(LastName, 1) = 'A'
ORDER BY LastName, FirstName;
-- (911 row(s) affected)
SELECT TOP (10) ID, FullName, PersonType FROM ListA ORDER BY
ID;

ID FullName PersonType
38 Kim Abercrombie EM
43 Nancy Anderson EM
67 Jay Adams EM
121 Pilar Ackerman EM
207 Greg Alderson EM
211 Hazem Abolrous EM
216 Sean Alexander EM
217 Zainal Arifin EM
227 Gary Altman EM
270 François Ajenstat EM

USE AdventureWorks2012;
-- Create a copy of table in a different schema, same name
-- The WHERE clause predicate with >=, < comparison is better performing
than the YEAR function
SELECT *
INTO dbo.SalesOrderHeader
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '20080101' AND OrderDate < '20090101'; --
YEAR(OrderDate)=2008
-- (13951 row(s) affected)
-- Create a table without population
SELECT TOP (0) SalesOrderID,
OrderDate
INTO SOH
FROM Sales.SalesOrderHeader;
-- (0 row(s) affected)
-- SELECT INTO cannot be used to target an existing table
SELECT * INTO SOH FROM Sales.SalesOrderHeader;
/* Msg 2714, Level 16, State 6, Line 1
There is already an object named 'SOH' in the database. */

NOTE
IDENTITY column is automatically populated. Direct insert into
IDENTITY column requires using of SET IDENTITY_INSERT.

INSERT SOH (SalesOrderID, OrderDate)


SELECT SalesOrderID, OrderDate
FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
GO
/* ERROR due to SalesOrderID in SOH inherited the IDENTITY property.
Msg 544, Level 16, State 1, Line 1
Cannot insert explicit value for identity column in table 'SOH' when
IDENTITY_INSERT is set to OFF. */
-- Turn on forced IDENTITY insert
SET IDENTITY_INSERT dbo.SOH ON;
GO
INSERT SOH(SalesOrderID, OrderDate)
SELECT SalesOrderID, OrderDate
FROM Sales.SalesOrderHeader ORDER BY SalesOrderID;
GO
-- (31465 row(s) affected)
SET IDENTITY_INSERT dbo.SOH OFF;
-- Filter on date
SELECT *
INTO SOH1
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '20080101' AND OrderDate < '20090101';
-- (13951 row(s) affected)
-- Descending sort for population
SELECT *
INTO SOH2
FROM Sales.SalesOrderHeader
ORDER BY SalesOrderID DESC
-- (31465 row(s) affected)
-- 3 columns only
SELECT SalesOrderID,
OrderDate,
SubTotal
INTO SOH3
FROM Sales.SalesOrderHeader;
-- (31465 row(s) affected)
-- SELECT INTO with GROUP BY query source
SELECT [Year]=YEAR(OrderDate),
Orders=COUNT(*)
INTO SOH4
FROM Sales.SalesOrderHeader
GROUP BY YEAR(OrderDate)
-- (4 row(s) affected)
SELECT * FROM SOH4 ORDER BY Year DESC;

Year Orders
2008 13951
2007 12443
2006 3692
2005 1379

-- All source columns, and a new populated datetime column


SELECT *,
[CreateDate]=getdate()
INTO SOH5
FROM Sales.SalesOrderHeader ;
-- (31465 row(s) affected)
-- SELECT INTO temporary table
SELECT TotalOrders = COUNT(*)
INTO #TotalOrders
FROM Sales.SalesOrderHeader ;
-- (1 row(s) affected)
SELECT * FROM #TotalOrders;

TotalOrders
31465

-- Empty table create with one NULL row


SELECT Name=CONVERT(VARCHAR(45), NULL),
Age=CONVERT(INT, NULL)
INTO tempdb.dbo.Person;
INSERT tempdb.dbo.Person (Name, Age)
SELECT 'Roger Bond', 45;
-- (1 row(s) affected)
SELECT * FROM tempdb.dbo.Person;

Name Age
NULL NULL
Roger Bond 45

DELETE tempdb.dbo.Person WHERE Name is NULL;


-- (1 row(s) affected)
SELECT * FROM tempdb.dbo.Person;

Name Age
Roger Bond 45

-- Create gaps in ID sequence; increment by 2: 2, 4, 6, 8 instead of 1, 2, 3, 4


SELECT 2 * [BusinessEntityID] AS
BusinessEntityID
,[PhoneNumber]
,[PhoneNumberTypeID]
,[ModifiedDate]
INTO dbo.Phone
FROM [AdventureWorks2012].[Person].[PersonPhone] pp ORDER BY
pp.BusinessEntityID;
-- (19972 row(s) affected)
-- Populate with 100 random rows
SELECT TOP (100) *
INTO POH
FROM Purchasing.PurchaseOrderHeader
ORDER BY NEWID();
-- (100 row(s) affected)
SELECT PurchaseOrderID,
CONVERT(date, OrderDate) AS
OrderDate,
FORMAT(SubTotal, 'c', 'en-US') AS
SubTotal
FROM POH;

PurchaseOrderID OrderDate SubTotal


3553 2008-08-03 $9,948.33
1637 2008-02-07 $25,531.28
2796 2008-05-31 $97.97
684 2007-09-26 $270.81
3478 2008-07-28 $28,072.28
1904 2008-03-09 $43,878.45
755 2007-10-01 $50,860.43
2660 2008-05-19 $944.37
2787 2008-05-31 $34,644.23
601 2007-09-19 $146.29

-- SELECT INTO with data transformation


SELECT CultureID
,UPPER(Name) AS
Name
,CONVERT(date,ModifiedDate) AS
ModifiedDate
INTO dbo.Culture
FROM [AdventureWorks2012].[Production].[Culture]
ORDER BY CultureID;
-- (8 row(s) affected)
SELECT * FROM dbo.Culture WHERE CultureID != '' ORDER BY
CultureID; -- exclude empty ID
CultureID Name ModifiedDate
ar ARABIC 2002-06-01
en ENGLISH 2002-06-01
es SPANISH 2002-06-01
fr FRENCH 2002-06-01
he HEBREW 2002-06-01
th THAI 2002-06-01
zh-cht CHINESE 2002-06-01

Select into with identity column


The column data types are inherited in SELECT INTO table create. The
IDENTITY property is also inherited in a SELECT INTO unless it is
prevented with special coding. No other constraint is inherited.
-- IDENTITY property of ProductID is inherited
SELECT TOP (0) ProductID, ProductNumber, ListPrice, Color
INTO tempdb.dbo.Product
FROM AdventureWorks2012.Production.Product;
-- (0 row(s) affected)
INSERT tempdb.dbo.Product (ProductID, ProductNumber, ListPrice, Color)
SELECT 20001, 'FERRARI007RED', $400000, 'Red';
GO
/* Msg 544, Level 16, State 1, Line 1
Cannot insert explicit value for identity column in table 'Product'
when IDENTITY_INSERT is set to OFF. */
-- The following is one way to check for IDENTITY property
USE tempdb;
EXEC sp_help 'dbo.Product';

Identity Seed Increment Not For Replication


ProductID 1 1 0

USE AdventureWorks2012;
DROP TABLE tempdb.dbo.Product;
GO
-- The following construct will prevent IDENTITY inheritance
SELECT TOP (0) CAST(ProductID AS INT) AS ProductID, --
Cast/Convert the identity column
ProductNumber,
ListPrice,
Color
INTO tempdb.dbo.Product FROM
AdventureWorks2012.Production.Product;
-- (0 row(s) affected)
INSERT tempdb.dbo.Product (ProductID, ProductNumber, ListPrice, Color)
SELECT 20001, 'FERRARI007RED', $400000, 'Firehouse Red';
GO
SELECT * FROM tempdb.dbo.Product;

ProductID ProductNumber ListPrice Color


20001 FERRARI007RED 400000.00 Firehouse Red
SELECT INTO From Multiple-Table Queries
SELECT INTO works with any query with some restrictions such as XML
data type columns cannot be included.
SELECT JobCandidateID
,BusinessEntityID
,Resume
,ModifiedDate
INTO dbo.Resume
FROM AdventureWorks2012.HumanResources.JobCandidate;
/* ERROR Msg 458, Level 16, State 0, Line 2
Cannot create the SELECT INTO target table "dbo.Resume" because the xml
column "Resume"
is typed with a schema collection "HRResumeSchemaCollection" from
database "AdventureWorks2012".
Xml columns cannot refer to schemata across databases. */
-- SELECT INTO from joined tables
SELECT soh.SalesOrderID,
OrderDate,
OrderQty,
ProductID
INTO SalesOrder
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.SalesOrderDetail sod ON soh.SalesOrderID =
sod.SalesOrderID ;
-- (121317 row(s) affected)
SELECT TOP(5) * FROM SalesOrder ORDER BY SalesOrderID DESC;

SalesOrderID OrderDate OrderQty ProductID


75123 2008-07-31 00:00:00.000 1 878
75123 2008-07-31 00:00:00.000 1 879
75123 2008-07-31 00:00:00.000 1 712
75122 2008-07-31 00:00:00.000 1 878
75122 2008-07-31 00:00:00.000 1 712

-- Check column types - partial results


EXEC sp_help SalesOrder;
Column_name Type Computed Length Prec Scale
SalesOrderID int no 4 10 0
OrderDate datetime no 8
OrderQty smallint no 2 5 0
ProductID int no 4 10 0

Select into with sorted table population


We can create ordering in a new temporary table by using the IDENTITY
function. There is no guarantee though that the IDENTITY sequence will be
the same as the ORDER BY clause specifications. Unique identity values on
the other hand are guaranteed.
SELECT ID=IDENTITY(int, 1, 1),
ProductNumber,
ProductID=CAST(ProductID AS INT),
ListPrice,
COALESCE(Color, 'N/A') AS Color
INTO #Product
FROM Production.Product WHERE ListPrice > 0.0 ORDER BY
ProductNumber;
GO
-- (304 row(s) affected)
SELECT TOP 10 * FROM #Product ORDER BY ID;

ID ProductNumber ProductID ListPrice Color


1 BB-7421 994 53.99 N/A
2 BB-8107 995 101.24 N/A
3 BB-9108 996 121.49 N/A
4 BC-M005 871 9.99 N/A
5 BC-R205 872 8.99 N/A
6 BK-M18B-40 989 539.99 Black
7 BK-M18B-42 990 539.99 Black
8 BK-M18B-44 991 539.99 Black
9 BK-M18B-48 992 539.99 Black
10 BK-M18B-52 993 539.99 Black

-- Permanent table create


SELECT * INTO ProductByProdNo FROM #Product ORDER BY ID;
GO -- (304 row(s) affected)
SELECT TOP (6) * FROM ProductByProdNo ORDER BY ID;

ID ProductNumber ProductID ListPrice Color


1 BB-7421 994 53.99 N/A
2 BB-8107 995 101.24 N/A
3 BB-9108 996 121.49 N/A
4 BC-M005 871 9.99 N/A
5 BC-R205 872 8.99 N/A
6 BK-M18B-40 989 539.99 Black

Select into with random population


We can create a random population by sorting with the NEWID() function.
USE tempdb;
SELECT TOP(5) ID = ContactID,
FullName =
CONCAT(FirstName, ' ', LastName),
Email = EmailAddress
INTO dbo.Person
FROM AdventureWorks.Person.Contact
WHERE EmailPromotion = 2
ORDER BY NEWID();
-- (5 row(s) affected)
SELECT * FROM dbo.Person;
GO

ID FullName Email
1075 Diane Glimp diane0@adventure-works.com
15739 Jesse Mitchell jesse36@adventure-works.com
5405 Jose Patterson jose33@adventure-works.com
1029 Wanida Benshoof wanida0@adventure-works.com
8634 Andrea Collins andrea26@adventure-works.com

-- Rerun the script again after dropping the table


DROP TABLE tempdb.dbo.Person;
GO
-- Command(s) completed successfully.
SELECT TOP(5) ID = ContactID,
FullName = CONCAT(FirstName, ' ',
LastName),
Email = EmailAddress
INTO dbo.Person
FROM AdventureWorks.Person.Contact
WHERE EmailPromotion = 2 ORDER BY NEWID();
SELECT * FROM dbo.Person;

ID FullName Email
9984 Sydney Clark sydney81@adventure-works.com
15448 Denise Raman denise13@adventure-works.com
12442 Carson Jenkins carson5@adventure-works.com
1082 Mary Baker mary1@adventure-works.com
18728 Emma Kelly emma46@adventure-works.com

Combining select into with insert select


First we create an empty table with identity property using SELECT INTO,
then we populate it with INSERT SELECT.
-- Following will fail - only one IDENTITY column per table
SELECT TOP (0) IDENTITY(int, 1, 1) AS ID,
ProductID,
Name AS ProductName,
ListPrice,
COALESCE(Color,
'N/A') AS Color
INTO #Product FROM Production.Product;
GO
/* ERROR Msg 8108, Level 16, State 1, Line 1
Cannot add identity column, using the SELECT INTO statement, to table
'#Product',
which already has column 'ProductID' that inherits the identity property. */
SELECT TOP (0) IDENTITY(int, 1, 1) AS ID,
CAST(ProductID AS INT) AS
ProductID, -- IDENTITY will not be inherited
Name AS ProductName,
ListPrice,
COALESCE(Color, 'N/A') AS
Color
INTO #Product FROM Production.Product;
GO
-- (0 row(s) affected)
DECLARE @Rows tinyint = 5;
INSERT INTO #Product (ProductID, ProductName, ListPrice, Color)
SELECT TOP (@Rows) ProductID,
Name,
ListPrice,
Color
FROM Production.Product
WHERE ListPrice > 0.0 AND Color IS NOT NULL ORDER BY
ListPrice DESC;
-- (5 row(s) affected)
SELECT * FROM #Product;

ID ProductID ProductName ListPrice Color


1 749 Road-150 Red, 62 3578.27 Red
2 750 Road-150 Red, 44 3578.27 Red
3 751 Road-150 Red, 48 3578.27 Red
4 752 Road-150 Red, 52 3578.27 Red
5 753 Road-150 Red, 56 3578.27 Red

Copy table into different database with select into


It requires 3-part name referencing to operate between databases (cross
database). The current database requires only 2-part object name referencing.
USE tempdb;
SELECT *, CopyDate = CONVERT(DATE,GETDATE())
INTO Department
FROM AdventureWorks.HumanResources.Department ORDER BY
DepartmentID;
GO
SELECT TOP (5) DepartmentID, Department=Name, CopyDate FROM Department ORDER BY
DepartmentID;

DepartmentID Department CopyDate


1 Engineering 2016-07-19
2 Tool Design 2016-07-19
3 Sales 2016-07-19
4 Marketing 2016-07-19
5 Purchasing 2016-07-19

-- SQL drop table - full referencing of table for mistake reduction


DROP TABLE tempdb.dbo.Department;

Combining select into with update


After creating a populated table with SELECT INTO, we perform UPDATE
to change a column.
USE tempdb;
SELECT TOP 100 * INTO PurchaseOrderHeader
FROM AdventureWorks.Purchasing.PurchaseOrderHeader ORDER BY
NEWID();
GO
-- The following logic updates dates to different values - multiple value
assignment operator
DECLARE @OrderDate DATETIME = CURRENT_TIMESTAMP;
UPDATE PurchaseOrderHeader SET @OrderDate = OrderDate =
dateadd(day, -1, @OrderDate);
GO
SELECT TOP
(5) PurchaseOrderID, VendorID, OrderDate FROM PurchaseOrderHeader;

PurchaseOrderID VendorID OrderDate


631 39 2016-07-18 09:03:18.193
759 32 2016-07-17 09:03:18.193
2652 33 2016-07-16 09:03:18.193
769 80 2016-07-15 09:03:18.193
949 30 2016-07-14 09:03:18.193

DROP TABLE tempdb.dbo.PurchaseOrderHeader;


SELECT INTO Table Create from Complex Query
SELECT INTO table create works from simple to very complex queries.
USE AdventureWorks;
SELECT SalesStaff = CONCAT(C.LastName, ', ',
C.FirstName),
ZipCode = A.PostalCode,
TotalSales =
FORMAT(SUM(SOD.LineTotal),'c', 'en-US'),
PercentOfTotal = FORMAT(
SUM(SOD.LineTotal) /
SUM(SUM(SOD.
OVER
(PARTITION BY 1, 2 ),'p')
INTO tempdb.dbo.SalesSummary
FROM Person.Contact C
INNER JOIN Person.[Address] A
ON A.AddressID = C.ContactID
INNER JOIN Sales.SalesOrderHeader SOH
ON SOH.SalesPersonID = C.ContactID
INNER JOIN Sales.SalesOrderDetail SOD
ON SOD.SalesOrderID = SOH.SalesOrderID
WHERE TerritoryID IS NOT NULL
GROUP BY C.FirstName, C.LastName, A.PostalCode, C.ContactID
ORDER BY SalesStaff, ZipCode;
-- (17 row(s) affected)
-- SELECT 10 rows random, then sort them by name (SalesStaff) - derived
table construct
SELECT * FROM
SELECT TOP (10) *
FROM tempdb.dbo.SalesSummary ORDER BY NEWID()
)x -- x is called a derived table; also dubbed SELECT FROM
SELECT
ORDER BY SalesStaff;

SalesStaff ZipCode TotalSales PercentOfTotal


Dusza, Maciej 98027 $9,293,903.00 11.55 %
Dyck, Shelley 98027 $10,367,007.43 12.88 %
Ecoffey, Linda 98027 $10,065,803.54 12.51 %
Eldridge, Carla 98027 $3,609,447.21 4.48 %
Elliott, Carol 98027 $7,171,012.75 8.91 %
Emanuel, 98055 $5,926,418.36 7.36 %
Michael
Erickson, Gail 98055 $8,503,338.65 10.56 %
Estes, Julie 98055 $172,524.45 0.21 %
Esteves, Janeth 98055 $1,827,066.71 2.27 %
Evans, Twanna 98055 $1,421,810.92 1.77 %

DROP TABLE tempdb.dbo.SalesSummary ;

Select into table create from system procedure


execution
Using OPENROWSET and OPENQUERY, we can make the result sets of
system procedures and user stored procedures table-like.
SELECT *
INTO #spwho
FROM OPENROWSET ( 'SQLOLEDB',
'SERVER=.;Trusted_Connection=yes',
'SET FMTONLY OFF EXEC sp_who');
GO -- (64 row(s) affected) - it varies, depends on the number server
connections
SELECT TOP (5) * FROM #spwho ORDER BY spid;
GO

spid ecid status loginame hostname blk dbname cmd


1 0 background sa 0 NULL LOG
WRITER
2 0 background sa 0 NULL RECOVERY
WRITER
3 0 background sa 0 NULL LAZY
WRITER
4 0 background sa 0 NULL LOCK
MONITOR
5 0 background sa 0 master SIGNAL
HANDLER

/* Requirement for OPENQUERY operation on current instance.


DATA ACCESS to current SQL Server named instance can be setup the
following way:
exec sp_serveroption @server = 'PRODSVR\SQL2008' -- computer name
for default instance
,@optname = 'DATA ACCESS'
,@optvalue = 'TRUE' ;
This way, OPENQUERY can be used against current instance. Usually
OPENQUERY is used to access linked servers.
*/
SELECT DB_NAME(dbid) AS DB, *
INTO #splock
FROM OPENQUERY(HPESTAR, 'EXEC sp_lock');
GO
-- (156 row(s) affected) - it varies, depends how busy is the system with
OLTP activities
SELECT TOP(2) * FROM #splock ;

DB spid dbid ObjId IndId Type Resource Mode Status


ReportServer 52 5 0 0 DB S GRANT
msdb 54 4 0 0 DB S GRANT

Select into from openquery stored procedure


execution
The following is the only way to make stored procedure results table-like.
The bill-of-materials stored procedure is recursive.
USE AdventureWorks2012;
GO
SELECT Name FROM Production.Product WHERE ProductID = 900; --
LL Touring Frame - Yellow, 50
-- First we test the query execution
DECLARE @RC int; DECLARE @StartProductID int; DECLARE
@CheckDate datetime;
EXECUTE @RC = [dbo].[uspGetBillOfMaterials] @StartProductID =
900 , @CheckDate = '20080216';
GO
-- 24 rows returned
-- Transform query into SELECT INTO table create - Single quotes (around
date literal) must be doubled
SELECT * INTO BOM900
FROM OPENQUERY(HPESTAR, 'EXECUTE [AdventureWorks2012].
[dbo].[uspGetBillOfMaterials] 900,''20080216''');
GO
-- (1 row(s) affected) -- create table
-- (24 row(s) affected) -- inserts
SELECT * FROM BOM900;

ProductAssemblyID ComponentID ComponentDesc TotalQuantity StandardCost


900 324 Chain Stays 2.00 0.00
900 325 Decal 1 2.00 0.00
900 326 Decal 2 1.00 0.00
900 327 Down Tube 1.00 0.00
900 399 Head Tube 1.00 0.00
900 496 Paint - Yellow 8.00 0.00
900 532 Seat Stays 4.00 0.00
900 533 Seat Tube 1.00 0.00
900 534 Top Tube 1.00 0.00
900 802 LL Fork 1.00 65.8097
324 486 Metal Sheet 5 1.00 0.00
327 483 Metal Sheet 3 1.00 0.00
399 485 Metal Sheet 4 1.00 0.00
532 484 Metal Sheet 7 1.00 0.00
533 478 Metal Bar 2 1.00 0.00
534 482 Metal Sheet 2 1.00 0.00
802 316 Blade 2.00 0.00
802 331 Fork End 2.00 0.00
802 350 Fork Crown 1.00 0.00
802 531 Steerer 1.00 0.00
316 486 Metal Sheet 5 1.00 0.00
331 482 Metal Sheet 2 1.00 0.00
350 486 Metal Sheet 5 1.00 0.00
531 487 Metal Sheet 6 1.00 0.00
Execution of select into from dynamic sql
T-SQL script demonstrates SELECT INTO execution within a dynamic SQL.
Biggest challenge is to get the single quotes right. CHAR(39) use is an
option.
-- SQL Server 2008 new feature: instant assignment to a localvariable
DECLARE @DynamicQuery nvarchar(max) =
'SELECT *
INTO BOM400
FROM OPENQUERY(' + QUOTENAME(CONVERT(sysname,
@@SERVERNAME))+ ',
''EXECUTE [AdventureWorks2012].[dbo].
[uspGetWhereUsedProductID] 400,
''''2007-11-21'''''')' ;
PRINT @DynamicQuery; -- test query; this is the static query which will
be executed
/*
SELECT *
INTO BOM400
FROM OPENQUERY([HPESTAR],
'EXECUTE [AdventureWorks2012].[dbo].
[uspGetWhereUsedProductID] 400,
''2007-11-21''')
*/
EXEC sp_executeSQL @DynamicQuery;
GO -- (64 row(s) affected)
SELECT TOP ( 5 ) * FROM BOM400
ORDER BY NEWID() ;
ProductAssemblyID ComponentID ComponentDesc TotalQuantity StandardCost
761 818 Road-650 Red, 1.00 486.7066
62
987 823 Mountain-500 1.00 308.2179
Silver, 48
990 823 Mountain-500 1.00 294.5797
Black, 42
765 826 Road-650 1.00 486.7066
Black, 58
770 818 Road-650 1.00 486.7066
Black, 52

-- Cleanup
DROP TABLE BOM400;

Select into table create from view


Transact-SQL script demonstrates how to import view query results into a
table.
SELECT [FullName],
[SalesPersonID] AS
StaffID,
[SalesTerritory],
COALESCE(FORMAT([2006], 'c','en-US'), '') AS [2006],
COALESCE(FORMAT([2007], 'c','en-US'), '') AS [2007],
COALESCE(FORMAT([2008], 'c','en-US'), '') AS [2008]
INTO #Sales
FROM [AdventureWorks2012].[Sales].[vSalesPersonSalesByFiscalYears]
ORDER BY SalesTerritory, FullName;
GO
SELECT *
FROM #Sales
ORDER BY SalesTerritory, FullName;
GO

FullName StaffID SalesTerritory 2006 2007 2008


Lynn N 286 Australia $1,421,810.92
Tsoflias
Garrett R 278 Canada $930,259.47 $1,225,468.28 $1,453,719.47
Vargas
José Edvaldo 282 Canada $2,088,491.17 $1,233,386.47 $2,604,540.72
Saraiva
Jillian Carson 277 Central $2,737,537.88 $4,138,847.30 $3,189,418.37
Ranjit R 290 France $1,388,272.61 $3,121,616.32
Varkey
Chudukatil
Rachel B 288 Germany $1,827,066.71
Valdez
Michael G 275 Northeast $1,602,472.39 $3,928,252.44 $3,763,178.18
Blythe
David R 283 Northwest $1,017,402.86 $1,139,529.55 $1,573,012.94
Campbell
Pamela O 280 Northwest $1,226,461.83 $746,063.63 $1,352,577.13
Ansman-
Wolfe
Tete A 284 Northwest $735,983.49 $1,576,562.20
Mensa-Annan
Tsvi Michael 279 Southeast $2,645,436.95 $2,210,390.19 $2,315,185.61
Reiter
Linda C 276 Southwest $2,260,118.45 $3,855,520.42 $4,251,368.55
Mitchell
Shu K Ito 281 Southwest $1,593,742.92 $2,374,727.02 $2,458,535.62
Jae B Pak 289 United $4,386,467.42 $4,116,871.23
Kingdom

Select into data import from excel


T-SQL OPENROWSET query imports data into a temporary table from
Excel. Your Excel library maybe different than the one in the example.
SELECT * INTO ContactList FROM
OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=D:\data\excel\Contact.xls', 'SELECT *
FROM [Contact$]')
-- (19972 row(s) affected)
Keep Learning
This probably goes without saying, but you should never stop learning. Don’t
stop learning SQL either if it really interests you.
Dig into the AdventureWorks, Company_Db, Chinook and the Northwind
databases by running queries and understanding the data. Part of learning
SQL is understanding the data within a database, too.
Once you come up with some scripts of your own for these databases or even
your own database project, post it on Github.com to show examples of your
work.

More References
If you missed where I mentioned my free blog earlier in the book, don’t
worry! Here’s a link to The SQL Vault so that you can follow my latest
experiences and they’ll provide something for you to learn as well!
Chapter 17 - Data Visualizations
The final topic that we are going to spend some time learning about in this
guidebook is how we are able to handle some of our data visualizations. This
is where we are going to be able to figure out the best way to present the data
to those who need it the most. Often the data scientist and the person who is
going to need to use the information for their own needs are not going to be
the same people. A company will need to use that data in order to help them
to make some good decisions, but they may not have the technical resources
and knowledge in order to create the algorithms and get it all set up on their
own.
This is why many times they are going to hire a specialist who is able to help
them with the steps of the data science project. This is a great thing that
ensures they are able to work with data in order to make some smart
decisions along the way. but then the data scientist has to make sure that they
are able to read the information. These algorithms can come out with some
pretty technical information that is sometimes hard to understand if you do
not know how to work with them.
This means that the data scientist has to be able to go through and find a way
in order to share the information in a manner that the person who will use it is
able to understand. There are a number of ways that we are able to do this,
but we must remember that one of the best ways to do this is through the help
of data visualization.
Sure, we can go through all of this and try to write it all up in a report or on a
spreadsheet and how that this is going to work. And this is not a bad method
to work with. But this is going to be boring and harder to read through. It
takes a lot more time for us to read through this kind of information and hope
that we are going to find what we need. It is possible, but it is not as easy.
For most people, working with a visual is going to be so much easier than
trying to look through a lot of text. These visuals give us a way to just glance
at some of the information and figure out what is there. When we are able to
look at two parts of our data side by side in a chart or a graph, we are going
to be able to see what information is there and make decisions on that a
whole lot faster than we are able to do with just reading a few pages of
comparisons on a text document.
Picking out the right kind of visual that you will want to work with is going
to be so important to this process. You have to make sure that we are picking
out a visual that works for the kind of data that you want to be able to show
off to others. If you go with the wrong kind of graph, then you are going to
end up with a ton of trouble. The visuals are important and can show us a lot
of information, but they are not going to be all that helpful if you are not even
able to read through them at all or if they don’t showcase the information all
that well in the first place.
Often when we take a look at a visual and all of the information that is there,
we are going to be able to see a ton of information in a short amount of time.
something that could take ten pages of a report could be done in a simple
chart that takes a few minutes to glance at and understand. And when you are
able to use a few of these visuals along the way, you are going to find that it
is much easier to work with and understand what is there.
This doesn’t mean that we can’t work with some of the basics that are there
with the reports and more. The person who is taking a look at the information
and trying to make some smart decisions about it will find that it is really
useful for them to see some of the backgrounds about your information as
well. They need to be able to see how the data was collected, what sources
were used, and more. And this is something that you are able to put inside of
your data and text as well.
There is always a lot of use for a report of this kind, but we need to make sure
that it is more of a backup to some of the other things that you have been able
to do. If this is all that you have, then it is going to be really hard for you to
work with some of this, and it can get boring to figure out what information
is present in the data or what you learned about in your analysis.
The good news here is that there are a ton of different types of visuals that
you are able to work with. This variety is going to help you to really see some
good results with the data because you can make sure that you are able to find
the visual that works for any kind of data that you are working with. There
are options like histograms, pie charts, bar graphs, line graphs, scatterplots,
and more.
Before you end your project, it is a good idea to figure out what kind of
visuals you would like to work with. This is going to ensure that you are able
to pick out the visual that will match with the data, and with the results, that
you have gotten, and this will ensure that we are going to be able to really see
the information that you need to sort through.
There are many options that you are able to work with as you need. You can
choose to pick out the one that is the best for you, and maybe even try a few
of these to figure out which one is going to pack the biggest punch and can
help you to get things done. Make sure to check what your data is telling you,
and learn a bit more about the different visuals that are there and how you are
able to work with them.
With this in mind, we need to take a look at what is going to make good data
visualization. These are going to be created when design, data science, and
communication are able to come together. Data visuals, when they are done
right, are going to offer us some key insights into data sets are that are more
complicated, and they do this in a way that is more intuitive and meaningful
than before. This is why they are often the best way to take a look at some of
the more complicated ideas out there
In order to call something a good data visualization, you have to start out
with data that is clean, complete, and well-sourced. Once the data is set up
and ready to visualize, you need to pick the right chart to work with. This is
sometimes a challenge to work with, but you will find that there are a variety
of resources out there that you can choose to work with, and which will help
you pick out the right chart type for your needs.
Once you have a chance to decide which of these charts is the best, it is time
to go through and design, as well as customize, the visuals to the way that
you would like. Remember that this simplicity is going to be key. You do not
want to have so many elements in it that this distracts from the true message
that you are trying to do within the visual in the first place.
There are many reasons why we would want to work with these data visuals
in the first place. The number one reason is that it can help us to make some
better decisions. Today, more than ever before, we are going to see that
companies are using data tools and visuals in order to ask better questions
and to make some better decisions some of the emerging computer
technologies, and other software programs have made it easier to learn as
much as possible about your company, and this can help us to make some
better decisions that are driven by data.
The strong emphasis that there is right now on performance metrics, KPIs,
and data dashboards is easily able to show us some of the importance that
comes with monitoring and measuring the company data. Common
quantitative information measured by businesses will include the product or
units sold, the amount of revenue that is done each quarter, the expenses of
the department, the statistics on the employees, the market share of the
company and more.
These are also going to help us out with some meaningful storytelling as
well. These visuals are going to be a very big tool for the mainstream media
as well. Data journalism is already something that is on the rise, and many
journalists are going to rely on really good visual tools in order to make it
easier to tell their stories, no matter where they are in the world. And many of
the biggest and most well-known institutions are already embracing all of this
and using these visuals on a regular basis.
You will also find that marketers are going to be able to benefit from these
visuals. Marketers are going to benefit from the combination of quality data
and some emotional storytelling that is going on as well. Some of the best
marketers out there are able to make decisions that are driven by data each
day, but then they have to switch things around and use a different approach
with their customers.
The customer doesn’t want to be treated like they are dumb, but they also
don’t want to have all of the data and facts are thrown out at them all of the
time. this is why a marketer needs to be able to reach the customer both
intelligently as well as emotionally. Data visuals are going to make it easier
for marketers to share their message with statistics, as well as with the heart.
Those are just a few of the examples of how we are able to work with the
idea of data visuals for your needs. There are so many times when we are
able to complete a data visually, and then use it along with some of the other
work that we have been doing with data analysis to ensure that it provides us
with some more context on what is going on with our work.
Being able to not only read but to understand, these data visuals has become a
necessary requirement for this modern business world. because these tools
and the resources that come with them are readily available now, it is true
that even professionals who are non-technical need to be able to look through
this data and figure out some of the data that is there.
Increasing the literacy of data for many professionals, no matter what their
role in the company is all about, is going to be a very big mission to
undertake from the very beginning. This is something that your company
needs to learn how to focus on because it is really going to end up benefiting
everyone who is involved in the process as well. With the right kind of data
education and some good support, we are going to make sure that everyone
not only can read this information, but that they are more informed, and that
they are able to read the data and use that data to help them make some good
decisions overall. All of this can be done simply by being able to read
through these visuals.
Chapter 18 - Python Debugging
Like most computer programming language, Python utilizes debugging
processes for the benefit of providing exceptional computing programs. The
software enables you to run applications within the specified debugger set
with different breakpoints. Similarly, interactive source code is provided to a
Python program for the benefit of supporting under program controls. Other
actions of a debugger in Python are testing of units, integration, analysis of
log files, and log flows as well as system-level monitoring.
Running a program within the debugger comprises of several tools working
depending on a given command line and IDE systems. For instance, the
development of more sophisticated computer programs has significantly
contributed to the expansion of debugging tools. The tools accompany
various methods of detecting Python programming abnormalities, evaluation
of its impacts, and plan updates and patches to correct emerging problems. In
some cases, debugging tools may improve programmers in the development
of new programs by eliminating code and Unicode faults.

Debugging
Debugging is the technique used in detecting and providing solutions to
either defects or problems within a specific computer program. The term
‘debugging’ was first accredited to Admiral Grace Hopper while working at
Harvard University on Mark II computers in the 1940s. She discovered
several moths between relays, thereby hindering computer operations and
named them ‘debugging' in the system. Despite the term previously used by
Thomas Edison in 1878, debugging began becoming popular in the early
1950s with programmers adopting its use in referring to computer programs.
By the 1960s, debugging gained popularity between computer users and the
most common term mentioned to described solutions to major computing
problems. With the world becoming more digitalized with challenging
programs, debugging has covered a significant scope. Henceforth,
eliminating words like computer errors, bugs, and defects to a more neutral
one such as computer anomaly and discrepancy. However, the neutral terms
are also under impact assessment to determine if their definition of
computing problems provides a cost-effective manner to the system or more
changes be made. The assessment tries to create a more practical term to
define computer problems while retaining the meaning but preventing end-
users from denying the acceptability of faults.

Anti-Debugging
Anti-debugging is the opposite of debugging and encompasses the
implementation of different techniques to prevent debugging processes or
reverse engineering in computer codes. The process is primarily used by
developers, for example in copy-protection schemes as well as malware to
identify and prevent debugging. Anti debugging, is, therefore, the complete
opposite of debugger tools, which include prevention of detection and
removal of errors, which occasionally appear during Python programming?
Some of the conventional techniques used are;

API-based
Exception-based
Modified code
Determining and penalizing debugger
Hardware-and register-based
Timing and latency

Concepts of Python Debugging


Current Line
The current line is a notion where a computer only has to do only one thing at
any given time, especially when creating programs. The flow of codes
typically is controlled from one point to another with activities only running
on the current line to the next below the screen. In Python programming, the
current path can only be changed with functions such as loops, IF statements
and calls among others. It is also not a must to begin programming from the
first line, but you can use breakpoints to decide where to start and where to
avoid.

Breakpoints
When you are running programs in Python package, the codes will usually
begin writing from the first line and run continuously until when there is a
success or error. However, bugs may occur either in a specific function or a
section of the program, but the error codes may not have been used during
input. The error may persist until during the start of the program that you
notice the problem. At this point, breakpoints become useful as they readily
stop these events. Breakpoints alter debuggers where the problem is and
immediately halts program execution and make necessary corrections. This
concept, therefore, enables you to create excellent Python programming
languages within a short time.

Stepping
Stepping is another concept, which operates with debugging tools in making
programs more efficient. Python program stepping is the act of jumping
through codes to determine programs lines with defects as well as any other
mistakes, which need attention before execution. Stepping in different codes
occurs as step-ins, step over, and step out. Step in entails the completion of
the next line filled with systems making the user skip into codes and debug
the intended one. Step over refers to a developer moving to the following line
in the existing function and debug with a new code before running the
program. Step out command refers to skipping to the last line of the program
and making completions of the codes before executing the plan.

Continuous Program Execution


There are some cases where Python programming may result in continuing
program execution by the computer itself. The continue command gives your
computer the control of resuming code input until the end unless there exists
another breakpoint. The resume button may vary depending on the computer
operating systems or the types of language programming packages. However,
there exist several similarities between them making Python debugging more
adaptable to different end-users and developers.

Existing the Debugging Tool


The primary purpose of acquiring a debugger tool is to identify and eliminate
problems. After utilizing debugger functionalities of detecting an error or
problem within the program or codes, correction of the problem follows. The
sequence will include fixing of the failure by rewriting the characters,
stopping debugging processes, insert a breakpoint on the fixed-line, and
launch another debugger tool. Similar, the procedure may vary depending on
the OS and the other packages other than Python.

Function Verification
When writing codes into the program, it is vital to keep track of the state of
each code, especially on calculations and variables. Similarly, the growth of
functions may stake up, leading to creating a function calling technique to
understand how each task affects the next one. Likewise, it is recommended
entering the nested codes first when it comes to stepping in as to develop a
sequential approach of executing the right codes first.

Processes of Debugging
Problem Reproduction
The primary function of a debugger application is to detect and eliminate
problems affecting programming processes. The first step in the debugging
process is to try to identify and reproduce the existing problem, either being a
nontrivial function or other rare software bugs. The method of debugging
primarily focuses on the immediate state of your program and note the bugs
present at the time. The reproduction is typically affected by computer usage
history and the immediate environment, thereby impacting on the end-results.

Simplification of the Problem


The second step is simplifying the inputs of the program by breaking down
the characters for more straightforward elimination of bugs. For example,
large amounts of data in a compiler containing bugs may crash during parsing
or anomaly removal as it includes all the data at once. However, breaking
down or subdivision of files will enhance straightforward reproduction of
problems as well as preventing program break down. The programmer will
thereby identify the bugs by checking different source files in the original test
case and if there exist more problems that need immediately debugging
actions.
Elimination of Bugs
After a sound reproduction of problems and simplification of the program to
check on bugs, the next step is utilizing a debugger tool to analyze the state
of your software. Scanning through a well-organized and simplified program
also enables you to determine the source origin of the fault. Similarly, bug
tracing can also be adopted to track down the source, making it beneficial to
remove problems from at the point of origin. In Python programs, tracing of
bugs plays a significant role in placing variables in different sections to
acquire a high end of execution. That said, the debugger tool would work on
the bug or bugs present, therefore removing it and keeping the program free
from any faults.

Debugging In Constant Variable Systems


Fixed, constant, or embedded systems are much different when compared to
the broader function of computer package designs. Embedded systems tend to
allows users to have multiple platforms, for instance, operating systems and
CPU architecture accompanied by variants. That said, embedded debugger
tools are specifically designed to conduct single tasks to given software for
the benefit of optimizing the program. A unique debugging tool is therefore
needed to undertake a particular task making it much harder to decide on a
specific one.
Faced with the challenge of heterogeneity, embedded debugger tools exist in
different categories of debugging, for instance, commercial and research tools
as well as subdivisions to specific problems. Green Hill Software provides an
example of commercial debuggers while research tools include Flocklab.
Embedded bugs identification, simplification, and elimination utilize a
functionality approach, which collects the operating state information, which
therefore boosts performance and optimizes your system adequately.

Debugging Techniques
Like other language programming software, Python also utilizes a debugging
technique to enhance its bug identification and elimination. Some of the
standard methods of debugging are interactive, print, remote, postmortem,
algorithm, and delta debugging. The technique used to remove bugs
interprets the comparison between the different techniques. For instance,
print debugging entails monitoring and tracing bugs and later printing them
out.
Remote debugging is a technique of removing bugs running a given program
but differs from the bugger tool. While postmortem is debugging methods to
identify and eliminate bugs from already crashed programs. To this end,
leaning the different types of debugging contributes to deciding which to use
when in need of determining Python programming problems. Other
techniques are safe squeezing, which isolates faults and causality tracking
essential for tracing causal agents in computation.

Python Debugging Tools


With several tools available today, it may become difficult to consider the
best choices for Python programs. However, Python has numerous debugging
tools to help in using codes free from errors. Python debugging tools or
debuggers also may operate well depending on the operating system as well
as if it either is within the program or acquired. For that case, Python
programs provide debugger tools working on IDE, command line, or analysis
of the available data to avoid bugs.

Debuggers Tools
Python debuggers are specific or multiple purposes in nature, depending on
the platform used, that is, depending on the operating system. Some of the
all-purpose debuggers are pdb and PdbRcldea while multipurpose include
pudb, Winpdb, and Epdb2, epdb, JpyDbg, pydb, trepan2, and
Pythonpydebug. On the other hand, specific debuggers are gdb, DDD, Xpdb,
and HAP Python Remote Debugger. All the above debugging tools operate in
different parts of the Python program with some used during installation,
program creation, remote debugging, and thread debugging and graphic
debugging, among others.

IDEs Tools
Integrated Development Environment (IDE) is the best Python debugging
tools as they suit well on big projects. Despite the tools varying between the
IDEs, the features remain the same for executing codes, analyzing variables,
and creating breakpoints. The most common and widely used IDE Python
debugging tool is the PyCharm comprising of complete elements of
operations, including plugins essential for maximizing the performance of
Python programs. Subsequently, other IDE debugging tools are also great and
readily available in the market today. Some of them include Komodo IDE,
Thonny, PyScripter, PyDev, Visual Studio Code, and Wing IDE, among
others.

Special-Purpose Tools
Special-purpose debugging tools are essential for detecting and eliminating
bugs from different sections of the Python program primarily working on
remote processes. These types of debugging tools are more useful when
tracing problems in the most sensitive and remote areas where it is unlikely
for other debuggers to access. Some do the most commonly used Special-
purpose debugging tools are FirePython used in Firefox as a Python logger,
manhole, PyConquer, pyringe, hunter, ice-cream and PySnooper. This
subdivision of debugging tools enables programmers to quickly identify
hidden and unnoticed bugs and thereby displaying them for elimination
from the system.

Understanding Debugging and Python


Programming
Before venturing more in-depth into the connection between the program and
debugging associated, there exist different ways of how the application
performs behaves. One of the significant components of debugging is that it
runs codes within your program one at a time and enables you to see the
process of data execution. They act as instant replays of what has occurred in
the Python program with a systematic tutorial hence seeing the semantic
errors occurred.
When the code is being executed, your computer may provide a limited view
of what is happening; hence, debuggers make them possible for you to see
them. As such, the Python program should behave like slow motion graphics
while identifying the errors or bugs present within the codes. As such, the
debugger enables you to determine the following;
The flow of codes in the program
The techniques used to create variables
Specific data contained in each variable within the program
The addition, modification, and elimination of functions
Any other types of calculations performed
Code looping
How the ID and ELSE statements have been entered

Debugger Commands
With debugging being a common feature in the programming language, there
exist several commands used when maneuvering between various operations.
The basic controls are the most essential for beginners and may include an
abbreviation of one or more letters. A blank space must separate the
command while others are enclosed in brackets. However, the syntax
command does not allow for the square brackets to be written but separated
alternatively by a vertical bar. In Python programs, statements are rarely
recognized by debugger commands executed within the parameters of the
program.
As to inspect Python statements, against errors and other related faults,
prefixes are added with an exclamation mark. Henceforth, making it possible
to make changes on variables as well as function calls. Several commands
may also be inserted in the same line but separated by ‘;;’ with inputs spaced
separately from other codes. As such, debugging is said to work with aliases,
which allows for adaptability between words in the same context. Besides,
aliases enhance the need for reading files in the directory with faults but seen
as correct with the use of the debugger prompt.
Conclusion
Whenever someone goes into a website, data is taken from their computer
and sent through the site. You are able to take this same data and place it in
the database for SQL. However, it is risky to do this due to the fact that you
will be leaving yourself open to what is known as SQL injection, which can
end up wiping out all of the hard work that you have put into your SQL
script.
An injection typically occurs at the point in time that you ask a user to place
some type of data into a prompt box before they are able to continue.
However, you will not necessarily get the information that you want. Instead,
you could end up getting a statement that runs through your database, and
you won’t know that this has occurred.
Users cannot be trusted to give you the data that you are requesting, so you
need to make sure that the data that they enter is looked at before it is sent to
your database for validation. This is going to help secure your database from
any SQL injection statements that may occur. Most of the time, you will use
pattern matching to look at the data before you decide to send it to your main
database.
Your function calls are used when you try to pull a particular record off of the
database from the table requested when you are working with that title for
that row. The data is usually going to end up matching what you have
received from the user, so you are able to keep your database safe from SQL
injection statements.
With MYSQL, queries are not allowed to be carried or stacked into a single
call function. This helps in keeping calls from failing due to the queries being
stacked.
Extensions such as SQLite, however, do allow for your queries to be stacked
as you do your searches in that one string. This is where safety issues come
into play with your script and your database.
STATISTICS
FOR BEGINNERS:
FUNDAMENTALS OF PROBABILITY AND
STATISTICS FOR DATA SCIENCE AND
BUSINESS APPLICATIONS, MADE EASY FOR
YOU

Matt Foster

© Copyright 2019 - All rights reserved.


The content contained within this book may not be reproduced, duplicated, or
transmitted without direct written permission from the author or the
publisher.
Under no circumstances will any blame or legal responsibility be held against
the publisher, or author, for any damages, reparation, or monetary loss due to
the information contained within this book, either directly or indirectly.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical, or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
Introduction
Once the data has been processed, it is then cleaned, which is done simply to
prevent as well as correct any errors. This is done because after the data has
been processed, it may be filled with gaps, can be found as incomplete, or it
can be found to contain errors. Data cleaning is often done by record
matching and ensuring that there is no duplicate information. Because there is
such a large amount of data being processed, it is vital for the data to be
cleaned.
The first technique that can be used for data cleaning is called data profiling.
This is done by looking at the data, and understanding the minimum and
maximum values as well as the types of data in each field. By understanding
these values, it becomes easier to identify data that contains quality issues as
well as those that have been misunderstood.
The second technique that can be used for data cleaning is to manipulate the
data by making small changes; for example, changing all of the letter O’s to
the number 0, or even just removing spaces within the information.
The next technique is simply ensuring that the data has been entered into the
proper field. For example, it is not unheard of for a zip code to be entered
into the wrong field when the data is sorted by a computer. It is also
important to look at small details such ensuring that specific names such as
Sam are identified as a nickname for a male named Samuel. Checking for
small mistakes made by the software is very important when it comes to
cleaning up the data.
Another mistake in data that needs to be looked at when cleaning data is
taking place is spelling. For example, wherever and where ever to a human
look to mean the same thing, but to a computer, these are completely
different words and this can change the way the data is interpreted.
The next step is exploratory data analysis, which is simply looking at the data
sets and determining the main characteristics. The main function of
exploratory data analysis is to help businesses understand what can be
learned from the data beyond any hypothesis or theory used to first initiate
the data collection.
After the exploratory data analysis, models will be created to determine if by
taking a specific action, a specific outcome would result. For example, in the
beer and diapers example, the model would have been used to determine if
more beer sales would occur by moving the beer closer to the diapers.
Data product may also be used. It is simply a computer program that takes the
data, and determines that if a customer buy X product, they might also be
interested in Y product, based on purchased made by other customers.
This is much like the ‘other customers also purchased ‘section on the
Amazon website. Simply by looking at what customers purchase, the
program is able to determine what other products they might be interested in
purchasing by using a select set of data.
After the data has been analyzed, it is then reported, which can lead to
feedback and further analysis. During this step, a business has to determine
how they will report the results, whether it be in charts, graphs, or other
forms. This depends on what the information will be used for in the future.
Chapter 1 - The Fundamentals
of descriptive statistics

Data can come in many forms. It might be in the form of location data created
from cell phone pings, a listing of all the YouTube videos you have ever
watched, or all the books you have purchased on Amazon. Often, it is
desirable to integrate different types of data into a single coherent picture.
Data might be used in real time or could be analyzed later to find hidden
patterns.
In this chapter, we will explore the general types or classes of big data. As we
will see, big data can come in the form of structured or unstructured data.
Moreover, it can come from different sources. Understanding the types of big
data will be important for getting a full understanding of how big data is
processed and used.

Structured Data
Structured data is the kind of data you would expect to find in a database. It
can include stored items such as dates, names, account numbers, and so forth.
Data scientists can often access structured data using SQL. Large amounts of
structured data have been collected over decades.
Structured data can be human-generated, such as people entering payment
information when ordering a product, or it could be data entered manually by
people working at a company. If you apply for a loan and fill out an online
form, this is human-generated data, which is also structured data. This data
would include an entry that could be put in a database with name, social
security number, address, place of employment, and so on.
In today’s world, structured data is also computer-generated without the
involvement of any people. When data is generated by computer systems, it
might be of a different character than that described above, but it can still be
structured data. For example, if your cell phone company was tracking you,
you could create data points that had your GPS coordinates, together with the
date and time. Additional information like your name or customer identifier
used by the cell phone company could also be included.
Other structured data can include tracking websites. As you are using your
computer, your activity could be tracked, and the URL, date, and time could
be recorded and stored as structured data.
Traditionally, structured data has been stored in relational databases and
accessed using a computer language paired with SQL. However, these tools
are in the midst of an evolving process as they adapt to the world of big data.
The reason things are changing is that many types of data, drawn from
different sources, are finding their way together into the same bits of
structured data.
For those who have little familiarity with relational databases, you can think
of an entry in a database having different fields. We can stick to the example
of an application for a loan as an example. It will have first and last name
fields with pre-determined character lengths. The first name field might be
ten characters and the last name field might be twenty characters. We are just
providing these values as examples; whoever designs the database will make
them long enough to be able to record data from most names.
When collecting information for a financial application, date of birth and
social security number will be collected. These will be given specific formats
in the database, with a date field and a character field that is eleven characters
wide to collect the social security number.
We could go on describing all the fields, but I think you get the point of how
the data is structured. With structured data, specific pieces of information
collected, and the formats of the information, are pre-defined. Each data point
collected is called a field, and every element in the database will have the
same fields, even if the person neglects to fill out some of the data.
Batch processing of structured data can be managed using Hadoop.

Unstructured Data
A lot of big data is classified as unstructured data. This encompasses a wide
variety of data that comes from many sources. One example of unstructured
data is spam email. Machine learning systems have been developed to
analyze email to estimate whether its spam. The data in this case is the text
included in the message, the subject line, and possibly the email address and
sending information for the message. While there are certain common
phrases used in spam emails, someone can type an email with any text they
please, so there is no structure at all to the data. Think about this in terms of a
database. As we mentioned above, a database has fields that are specific data
types and sizes, and structured data will include specific items collected with
the data.
Another example of unstructured data could be text messages. They are of
varied length and may contain different kinds of information. Not only could
a person enter in numerical or alphabetic/textual information, but images,
emojis, and even videos can be included. Any randomly selected text
message may have one or all these elements or some value in between. There
is no specific structure to the data, unlike an entry in a relational database.
Similar to text messages, posting on social media sites is unstructured data.
One person might type a plain text message, while someone else might type a
text message and include an image. Someone else might include many emojis
in their message, and another posting might include a video.
Often, unstructured data is analyzed to extract structured data. This can be
done with text messages or postings on social media sites to glean
information about people’s behaviors.
There are many kinds of unstructured data. For example, photographs and
surveillance data—which includes reams of video—are examples of
unstructured data.

Semi-Structured Data
Data can also be classified as semi-structured. This is data that can have
structured and unstructured elements together.

Storing Data
As mentioned earlier, structured data is stored in relational databases. In the
1990s, this was the primary storage mechanism of big data, before large
amounts of unstructured data began to be collected.
Unstructured data is not necessarily amenable for storage in a database and is
often stored in a graph database. Companies use content management
systems, known in the business as CMSs to store unstructured data. Although
CMSs are not formally structured like a relational database, they can be
searched in real time.
Chapter 2 - Predictive Analytics Techniques (I.E.,
Regression Techniques and Machine Learning
Techniques)

Predictive analytics is currently being used by many different companies all


over the world to turn collected data into valuable information. Because the
predictive analytic techniques are able to learn from the past to predict the
future, more and more companies are finding them useful when it comes to
their business.
There are countless predictive analytics techniques that are used every day by
different businesses, many of them created specifically to support one
particular business, however, there are several techniques that are generic and
can be used by almost all companies.
Linear Regression - To understand linear regression, we must first start with
linear models. Let’s first begin with a mathematical model, which is simply a
mathematical expression used to describe the relationship between
measurements. For example, if the price of your product is $10, and you
wanted to find out the price for several of your products, you would write a
mathematical model like this: y=10*x with X being the number of products.
A linear model is simply a mathematical model that contains an independent
variable and a dependent variable such as our model of y=10*x. This means
that no matter what x equals, y is going to be the total price for the product
sold.
What if you have to deal with shipping and handling? Let’s say, for example,
that your pricing is to include a shipping and handling fee of $20. You would
now create a pricing model of y= 20+(10*x)
Linear models are simply a mathematical model that is going to make linear
regression easy to interpret. Linear regression is used when you do not know
the parameters or your linear model. This linear model is determined through
analysis and the linear regression is used to complete this.
Linear regression is used to scan the data that you do have and use that
information to compute the parameters that fits best within that data. Let’s
say, for example, you want to predict the amount of cattle that will be born on
small dairy farms. Each farm uses the same basic practices, is located in the
same general area, and the only difference is the size of the farm. This leads
you to believe that the size of the dairy farm is the most important piece of
data when it comes to predicting the number of calves that will be born. The
linear model you will create using the size of the farms is going to give you
what is known as the linear regression line.
This will show you the number of calves you can expect to be born just by
using the information you entered. Of course, this is not taking other
information into account, such as the total number of cows on each farm, and
other data that could be collected.
This is a very basic way of using predictive analytics and it is usually the first
technique that is used when it comes to predictive analytics. However,
because not all of the data that is available is being used, this often leads to
incorrect assumptions.
Decision trees - This technique is a very popular data mining technique, and
is liked by analysts because of the user friendly results it produces. Let’s look
at an example of a credit card company. Credit card companies have 2 types
of customers, those who are profitable and those who are not profitable.
Customers who always pay their bills on time and in full, or do not use their
credit card are not profitable customers. However, customers who have a
balance on their cards or do not pay the balance in full each month are the
customers who are profitable for the credit card company. Customers who do
not pay their payments on time are also profitable for the company.
For this example, we will assume that this credit card company has a total of
10 customers, 5 who are profitable and 5 who are not profitable. This is what
is described as the company’s customer base. However, outside of this
customer base is a huge number of potential customers.
The credit card company does not know if these potential customers will be
profitable customers, but because there is a limited marketing budget,
meaning that the company can only market to a limited number of these
potential customers, the company wants to make sure that the budget is used
in a manner that will ensure the company is able to attract the maximum
amount of profitable customers possible.
In other words, the company wants to ensure that they are only marketing to
those customers who are likely to be profitable if they become the credit card
company’s customers. This leads to the problem of the company being able
to predict if a person is going to be profitable or not. This is where analytics
come in.
The credit card company is going to have information available to them about
their potential customers, such as age, gender, the number of credit cards they
already own, and their marital status. The credit card company needs to find
out if any of this data can help them predict if a potential customer is going to
be profitable or not.
Because the same information is available to the credit card company about
their existing customers. At the top of the decision tree, the company is going
to find that 50% of their customers are profitable and 50% are not. This will
branch off into two segments, for example, we will go with the age variable
for this example, those who are 35 and older will go in one segment and
those who are 34 and under in the second segment. The company will then
examine the profitability rate of the two segments.
Now let’s say that out of the customers who are over 35, six of them are
profitable and 2 are unprofitable, giving us a profitability rate of 66%.
Once this information is compared with the overall profitability rate of 50%,
the company can determine that people who are 35 and older tend to be more
profitable for them than the younger group.
This allows the company to understand that if they are marketing their credit
card only to people who are 35 and older, they will end up with a more
profitable group of customers.
This segment will then be segmented into smaller groups, which will show an
even higher profitability using the information provided such as marital status
and sex.
Following this technique will allow the company to determine exactly what
group of people they need to market to in order to increase the profitability of
their customer base.
Machine Learning - This process is very similar to data mining, in that, like
data mining, it searches through data looking for patterns. However, instead
of searching for data that is going to help a person or a company understand
trends, machine learning searches for data that will help to improve the
program itself.
Think about Facebook’s Newsfeed. This program looks for patterns, such as
one person interacts often with another, likes their posts, or writes on the
other person’swall. Using machine learning, Facebook assumes that the two
people are close and more of that friend’s posts will appear on the person’s
news feed.
In other words, machine learning is used in predictive analytics just as any
other techniques are used, by extracting large amounts of data, assessing
risk,and predicting a customer’s behavior by looking at how they have
behaved in the past.
There are, of course, many other techniques that can be used when it comes
to predictive analytics, each advantages and disadvantages. It may take a few
attempts for a company to find the technique that works best for them, but the
payoff is definitely worth the work.
Chapter 3 - Decision Tree and how to Use them
We now know a bit about how supervised machine learning is going to work.
It is time to look at some specific examples of how these learning algorithms
are going to work. And the first one we will look at is known as a decision
tree learning algorithm.
With this one, you will find a lot of efficiencies when it comes to data,
especially if you are taking a look at a few different options and you want to
figure out which method is going to be the right decision for your business or
on your project. When there are a few options presented with the decision
tree, you get the benefit of seeing the possibilities and the outcomes that each
one is going to produce for you. This is often one of the most efficient and
accurate ways to make sure that the decisions fit your needs.
There are different times when you would want to work with a decision tree.
You may find working with it for a continuous random variable or for some
of the other categorical variables that are available. However, you will
usually use this kind of learning algorithm in order to do classification
problems.
To help you make a good decision tree, you need to be able to split up whole
domain sets so that you get two sets at least, often three or more, of similar
data. These will then be sorted out, using their independent variables, because
it is going to help you to distinguish out the different sets that you have in
front of you.
This brings up the question of how you will be able to make all of this work.
Let’s say that we have a group of people, 60 of them. Each person in that
group will have three independent variables that will include their gender,
height, and class. When you take a look at this ground, you know from the
beginning that 30 of these students like to play soccer when they have some
free time. You can work with the decision tree in order to figure out which
people in that group like to play soccer and which don’t.
To help you with this one, you can take the learning algorithm for a decision
tree and look at that group of people before dividing them into some groups.
You would need to use the variables of height, class, and gender to help with
this. The hope with this is that when the whole thing is done, you can end up
with a homogenous set of people.
Of course, there are going to be a few others of the learning algorithms that
you are able to use in order to do this, and they may work well with the
decision tree to help you split up some of the data that you are working with.
This is going to give you a minimum of two kinds of subsets that will
produce outcomes that are pretty homogenous. Remember that it is possible
to have more, but since we just want to know whether the group is going to
be soccer players or not, then we just want to work with two groups.
Decision trees are going to be a good option for programmers to choose
because they make it really easy to split up all of your data, and then you can
make some good decisions based from what shows up in that data. It is a
good way to help you make some great decisions for your business because
you will be able to get all of your information out in front of you to make
decisions, rather than just having to guess.

Random Forests
The next type of learning algorithm that you are able to work with is the
random forest. There are a lot of times when the decision tree is going to
work out well for you, but there are times when you may want to make this a
bit different, and the random forest is going to be the right option for you.
One time, when you would want to work with a random forest, is when you
would like to work with some task that can take your data and explore it, like
dealing with any of the values in the set of data that is missing or if you
would like to be able to handle any of the outliers to that data set.
This is one of the times when you are going to want to choose the random
forest rather than working with the decision trees, and knowing which time to
use each of these different learning algorithms. Some examples of when the
programmer would want to work with a random forest include:

• When you are working on your own training sets, you will find that all
of the objects that are inside a set will be generated randomly, and it can
be replaced if your random tree thinks that this is necessary and better for
your needs.
• If there are M input variable amounts, then m<M is going to be
specified from the beginning, and it will be held as a constant. The reason
that this is so important is that it means that each tree that you have is
randomly picked from their own variable using M.

• The goal of each of your random trees will be to find the split that is the
best for the variable m.

• As the tree grows, all of these trees are going to keep getting as big as
they possibly can. Remember that these random trees are not going to
prune themselves.
The forest that is created from a random tree can be great because it is much
better at predicting certain outcomes. It is able to do this for you
because it will take all prediction from each of the trees that you create,
and then will be able to select the average for regression or the
consensus that you get during the classification.
These random forests are going to be the tool that you want to use many
times with the various parts of data science, and this makes them very
advantageous compared to the other options. First, these algorithms are able
to handle any kind of problem that you are focusing on, both the regression
and classification problems. Most of the other learning algorithms that you
will encounter in this guidebook are only able to handle one type of problem
rather than all of them.
Another benefit of these random forests is that they are going to help you
handle large amounts of data. If your business has a lot of different points
that you want to go through and organize, then the random forest is one of the
algorithms that you need to at least consider.
There is a limitation that comes with using random forests though, which is
why you will not be able to use it with all of the problems that you want to
take on. For example, this can work with regression problems like we talked
about before, but they are not going to be able to make any kind of prediction
that goes past the range that you add to your training data. You will be able to
get some predictions, of course, but these predictions will end up becoming
limited. It will stop at the ranges that you provide, lowering the amount of
accuracy that is found there.
KNN Algorithm
Next on the list of learning algorithms that we are going to take a look at is
the K-nearest neighbors, or KNN, algorithm. This is one that is used a lot in
supervised machine learning, so it is worth our time to take a look at it here.
When you work with the KNN algorithm, you are going to use it to help take
a lot of data and search through it. The goal is to have k-most similar
examples for any data instance that you would like to work with. When you
get this all organized in the proper manner, this KNN algorithm will be able
to take a look through that set of data, and then it will summarize the results
before using these to make the right predictions that you need.
A lot of businesses will use this kind of model in order to help them become
more competitive with the kind of learning that they are able to do in the
industry. This is going to work because there will be a few elements in this
model that will compete against each other. The elements that end up
winning in here are going to be the way that you are the most successful and
you get the prediction that will work the best for you.
Compared to the other two learning algorithms that are out there, this one is
going to be a bit different. In fact, some programmers are going to see this as
one of the lazier learning processes because it is not able to really create any
models unless you go through and ask it to do a new prediction. This is a
good thing for some projects if you would like to keep the information in the
models relevant or have more say in what you are adding to the models, but
in other situations, it is not going to be all that helpful.
There are a lot of benefits of working with the KNN learning algorithm. For
example, when you choose to use this kind of algorithm, you can learn how
to cut out the noise that sometimes shows up inside the set of data. The
reason that this works is that it is going to work solely with the method of
competition to help sort through all of the data in the hopes of finding the
stuff that is the most desirable. This algorithm is useful because it can take in
a lot of data, even larger amounts, at the same time which can be useful in a
lot of different situations.
However, you are going to run into a few conditions to consider when it
comes to this algorithm. The biggest issue is that there are high amounts of
costs computationally, especially when you compare it to what some of the
other learning algorithms will do. This is because KNN is going to look
through the points, all of them before it sends you a prediction. This takes a
lot of time and money overall, and may not be the one that you want to use.

Regression Algorithms
Next on the list is the regression algorithm. You will be able to use this
because it is a type where you will investigate the relationship that is there
between the dependent variables and the predictor variables that you like to
use. You will find that this is the method a programmer will want to work
with any time they see there is a casual relationship between the forecasting
that you do, the time-series modeling, and all of the variables.
You will want to work with these regression algorithms any time that you
want to take all of the different points in your set of data and you want it to fit
onto a line or a curve as closely as possible. This helps you to really see if
there are some factors that are common between these data points so that you
can learn about the data and maybe make some predictions as well.
Many programmers and companies are going to use this kind of regression
algorithm in order to help them make great predictions that then help the
business to grow, along with their profits. You will be able to use it in order
to figure out a good estimation of the growth in sales that the company is
looking for, while still being able to base it on how the conditions of the
economy in the market are doing now and how they will do in the future.
The neat thing about these kinds of learning algorithms is that you are able to
place in any kind of information that seems to be pertinent for your needs.
You are able to add in some information about the economy, both how it has
acted in the present and in the past so that this learning algorithm is able to
figure out what may happen to your business in the future. The information
that you add to this needs to be up to date and easy to read through, or this
algorithm could run into some issues.
Let’s take a look at an example of how this can work. If you go with the
regression algorithm and find that your company is growing near or at the
same rate that other industries have been doing in this kind of economy, then
it is possible to take that new information and use it to make some predictions
about how your company will do in the future based on whether the economy
goes up or down or even stays the same.
There are going to be more than one option of learning algorithms that you
are able to work with when we explore these regression algorithms. And you
will have to take a look at some of the benefits and the differences between
them all to figure out which one is the right for you. There are a lot of options
when it comes to picking out an algorithm that you would like to use, but
some of the most common of these will include:
1. Stepwise regression
2. Logistic regression
3. Linear regression
4. Ridge regression
5. Polynomial regression
Any time that you decide to work with one of these learning algorithms, you
are going to be able to see quickly whether or not there is a relationship
between your dependent and independent variables, as well as what that
relationship is all about. This kind of algorithm is going to be there because it
shows the company the impact that they have to deal with if they try to add or
change the variables in the data. This allows for some experimentation so that
you can see what changes are going to work the best for you and which ones
don’t.
There are going to be a few negatives and shortcomings that you have to
work within the regression algorithms. The first one is that you can only use
these in regression problems (like the name suggests) and not in any kind of
classification problems. This is because this kind of algorithm is going to
spend too much time overfitting the data that you have. This makes the
process tedious and it is best if you are able to avoid working on it at all.

Naïve Bayes
And finally, we are going to move on to the other supervised machine
learning method that we need to look at. This one is known as the Naïve
Bayes method, and it is going to be really useful in a lot of the different kinds
of programs that you want to create, especially if you are looking to
showcase your model to others, even those who don’t understand how all of
this is supposed to work.
To help us get a better understanding of how this learning algorithm is going
to work, we need to spend some time bringing out our imaginations a bit. For
this one, imagine that you are working on some program or problem that
needs classification. In this, you want to be able to come up with a new
hypothesis to go with it, and then you want to be able to design some new
features and discussions that are based on how important the variables in that
data are going to be.
Once all of the information is sorted out, and you are ready to work on the
model that you want to use and then enter the shareholders. These
shareholders want to know what is going on with the model and want to
figure out what kinds of predictions and results you are going to be able to
get from your needs. This brings up the question, how are you going to be
able to show all of the information you are working on to the shareholders
before the work is even done? And how you are going to be able to do this in
a way that is easier to understand?
The good thing to consider with this one is that the Naïve Bayes algorithm is
going to be able to help you, even in the earliest stages of your model, so that
you can organize everything and show others what is going on. The learning
algorithm is going to be what you will need to use in order to do a
demonstration to show off your model, even when it is still found in one of
the earlier stages of development.
This may seem a bit confusing right now, but it is time to look at an example
to help us explain how to make this happen with some apples. When you go
to the store and grab an apple that looks pretty average to you. When you
grab this apple, you will be able to go through and state out some of the
features that distinguish the apple from some of the other fruits that are out
there. Maybe you will say that it is about three inches round, that it is red, and
has a stem.
Yes, some of these features are going to be found in other types of fruit, but
the fact that all of them show up in the product at the same time means that
you have an apple instead of another type of product in your hand. This is a
simple way of thinking about an apple and figuring out how it is different
from some of the others out there, but it is a good example of what is going to
happen when you use the Naïve Bayes algorithm.
A programmer is likely to work with the Naïve Bayes model when they want
to have something that is easy to get started with, and when they have a lot of
data or a large data set, that they want to be able to simplify a bit. One of the
biggest uses of this kind of algorithm is that it is going to be simple to use,
and even if you could do things in a more sophisticated method, it is a better
option to go with.
As you learn more about the Naïve Bayes algorithm, you will start to see
more and more reasons in order to work with it. This kind of model is going
to be an easy one to use and it is the most effective when it comes to
predicting the class of your test data so that it becomes one of the best
choices for anyone who would like to keep the process simple or those who
are new to working with the machine learning process for the first time. The
neat thing here though is, though this is a simple algorithm to bring up, it is
still able to be used in the same way that higher-class algorithms can do.
Of course, just like with some of the other supervised learning algorithms that
you would lie to work with, there are going to be some negatives that show
up along the way. First, when you need to do some variables that are
categorical, and you want to go through and test some data that hasn’t been
able to go through the set of data for training, you may find that the model is
not going to make the best predictions for you and the probability is not
going to be the best either.
If you still want to use the Naïve Bayes algorithm even with some of these
issues, there are a few methods that you can work with to solve the problem.
The Laplace estimation is a good example of this. But the more methods that
you add in, the more complication are going to show up and that kind of
beats the purpose of working with this. Keeping it simple and knowing when
you are able to use this algorithm will help you to get the results that you
want.
This is a good method to use, but realize that you will not pull it out all of the
time. If you have a lot of information that you would like to work on, and you
need to be able to take that information and show it off in a manner that is
simple and easy to understand, then this learning algorithm is going to be a
good option for you.
These are just a few of the different options that you are able to work with
when it comes to working with supervised machine learning. This is one of
the easiest types of machine learning that you are able to work with, and it is
going to prove to be really useful overall. Try out a few of these learning
algorithms and see how supervised machine learning works before we take
some time to move on to the other two types as well.
Chapter 4 - Measures of central tendency,
asymmetry, and variability

The best way to determine the future is to look at the past and that is exactly
what predictive analytics does. Predictive analytics works in much the same
way as a car insurance company does. You see, the company will look at a
set of facts, or data set, often, your age, gender, driving record, and the type
of car you drive.
By looking at this data set, they can use it to predict your chances of getting
into a car accident in the future. Therefore, they are able to determine if they
are willing to insure you and what rate you will pay for the insurance.
Predictive analysis uses a specific data set to determine if a pattern can be
found. If a pattern is found, this pattern is used to predict future trends. This
is much different than other techniques used to analyze the data sets because,
unlike predictive analytics, other techniques provide a business with
information about what has happened in the past.
Of course, knowing what has happened in the past when it comes to
understanding data is very important, but the majority of business people
would agree that what is more important is understanding what is going to
happen in the future.

What Types of Predictive Analytics Are There?


There are two basic types of predictive analytics. The first is the type that is
based off of variables, for example, sales on any given day, the satisfaction of
customers, and losses and gains. The second type of predictive analytics is
based on binary outcomes, for example, we are about to lose this customer,
why a purchase was or was not made, and whether or not a transaction was
fraudulent.
In companies that use predictive analytics, it has been found that they are
using this process mostly for marketing and increasing sales, however,
predictive analytics can be used in any of the major processes of business.
Smooth Forecast Model
In this model, a business will use the variables collected, whether it be a large
or small number of variables, believed to impact the specific event that is
being looked at. For example, if your specific event is the increase in beer
sales, you would look at the variables you believe are impacting these sales.
Of course, you can use this to predict much more complex scenarios such as
how well a product will sell based on the sales of similar products, if sales
will increase or decrease over time, and even customer satisfaction. This is
the simple way to use predictive analytics; it is easy to understand and the
data can easily be manipulated.

Scoring Forecast Model


This model for predictive analytics is a bit more complex than the smooth
forecast model. This is a model in which data is collected and is translated
into a 1 or a 0, which is why this is often referred to as a binary model.
Let’s say, for example, you are trying to determine the likelihood of a
specific set of customers switching to your product instead of the product
they currently use.
You will first gather the information that you have about these customers,
like age, where they are from, how many people live in the household, and so
forth. By using the scoring forecast model, you will be able to determine the
likelihood of future customers switching to your product by examining the
customers who have already switched. After this is done, you will be able to
give them a score, say from 1 to 10,determining the likelihood of them
switching to your product from a competitor’s product.
Predictive analytics does not have to be complicated and it can be one of the
most important benefits of data collection. There are also two other systems
for predictive analytics, the SQL and the RDBMS systems.
These are newer systems and although they are good programs, they are not
perfect, only providing companies with about 80% accuracy. Of course, this
is better than just guessing but there are better ways of using predictive
analytics right now. However, when it comes to dealing with big data, these
may have their advantages.
Natural Language Processing
The process of NLP, also known as Sentiment Analysis, is used to pull data
from unstructured data such as social media. This application is used to help
understand how customers feel about specific products or services based on
the words and phrases pulled from social media comments, for example.
Comments such as“Not worth the price”, “Great service”, “Waste of
money”,or“It broke right after we got it”are all the types of comments that
will allow a company to understand how their customers feel about their
products. It will also help them to understand what they need to do to ensure
their customers’ happiness, and based off of these comments, it can be
predicted as to how well a product will do.
This type of processing can also be used for customer service calls, even as
they are happening, based on the key words the company is looking for.
Of course, natural language processing is not going to give you the same
accuracy as other models because you will be looking for specific key words,
which could make it seem as if there are many more unsatisfied customers
than there are satisfied customers, but it is going to allow you to have some
insight concerning the direction you want to take your company. It can also
allow you to find ways to up-sell other products. For example, ifyou were to
look for the term,“I wish it came with”, you will begin to understand what
your customers are looking for and be able to offer it to them.
Each of these techniques, if used properly, can improve the finances of any
company, and help them understand not only what the future may hold based
on the past but also what their customers are really looking for. At this point,
if a company has not gotten involved in predictive analytics, they are really
missing out on important information that they really need.
Chapter 5 - Distributions
In this chapter, we will investigate the field of data science and then look at
current trends in big data and where they might be heading. No matter what
you think about big data, one thing we know for sure is that it is not going
away any time soon.

What Is Data Science?


As the importance of big data has grown, a new field has emerged for
scientists/engineers who are specialists in working with big data. This field is
known as data science. It is an interdisciplinary field, combining statistics and
probability with computer science and business acumen. The field is in high
demand, and it is expected that in the coming years that demand will only
increase.
In short, the role of a data scientist is to interpret and choose things of value
from big data. They will use many tools to assist them in this process, and
computing power will be an important part of that. While much attention is
focused on machine learning and artificial intelligence, in the end, the data
scientist and the interpretive analysis of the human mind play a central role in
the value of big data.
Since computers are at the heart of big data and the analytics used on it, some
background in computer science and/or computer engineering will be
fundamental to the field of data science. Programming skills are necessary,
but that does not mean a practitioner of data science must be an advanced
computer scientist or get a PhD in the field. In fact, compared to that level,
the kind of programming most data scientists do is a bit rudimentary.
However, you can go further in certain areas if you have more expertise in
hot topics like machine learning.
Statistics and probability perform a central function in the field of data
science. That is the nature of this game. When you have large amounts of
data, the analysis of the data will naturally be statistical. People won’t be
doing this by pencil and paper, but a data scientist needs a thorough
understanding of statistics and the application of many mathematical models
such as linear regression. This will help the data scientist develop the right
models used to look for patterns in data and build models used for machine
learning.
It is also important for data scientists to have some business acumen. Again,
this is an interdisciplinary field, so you don’t need an MBA or anything like
that. Some understanding of business operations, marketing, logistics, and
other issues make a data scientist more useful to large corporations, like
manufacturers or airlines that may be looking to use data scientists.
To see why data science and big data is necessary, visualize having direct
access to the kind of data we are talking about. It could be listed in a comma-
separated file or a spreadsheet. If you opened such a file, it would look
almost like gibberish. You might see columns of names and numbers, and
you could scroll up and down and left and right trying to make sense of it,
which would be virtually impossible. This is why computing power must be
applied to big data. Going through reams of data to find patterns contained in
it is a task perfectly suited for computers.
You cannot just feed computer data and expect answers to pop out like
magic. The data scientist uses certain tools and mathematical models to tease
out the information. The initial judgment of the data scientist will get things
started, as he will choose the best method to analyze the data. Many data
scientists write programs in Python, a simple interpretive language, to
analyze big data. They can also use a statistical package called R. If the data
is stored in a formal database, then SQL, a language used to sort and pull data
from large databases, can be used as part of the process.
Data scientists may do testing by using different machine learning algorithms
and methods to get the most out of the data. It is important to avoid being led
astray. Unfortunately, that possibility always exists, and it is always possible
to draw the wrong conclusions from a set of data.
Later, we will explore the role of machine learning in data science. It turns
out that machine learning is playing a more central role in data science and
the evaluation of big data. This is a different type of approach, whereby,
rather than giving computer-specific instructions to execute, you let the
computer learn on its own by presenting it with large data sets. For working
with big data, the computer can be trained on small subsets of data. Once the
data scientist is satisfied with the level of learning the computer has achieved,
it can then be unleashed on the big data, where it can search for the patterns
and relationships of interest for the application being investigated.
Past Trends and Their Continuation
In the past, the biggest factors influencing the growth of big data have
included increasing the processing power of computer chips, increasing
storage capacity, both of individual computers and cloud computing, cheaper
memory with a much larger capacity, and increasing broadband speeds.
While the famous “Moore’s Law” is often cited with the doubling of
processor speed and capacity every two years, it is clear that, at some point,
this progress will slow down. In fact, the evidence indicates that, as far as
increasing speed and capacity of computer chips is concerned, there is
already a slowdown. Over the coming years, we can continue to expect to see
increased performance and capacity, but the progress seen over the past two
decades is likely to start slowing down.
One wildcard in developing big data is the possible invention of devices
governed by quantum computation rather than classical computers. Quantum
computation can make its presence felt in many ways if it becomes a reality.
One area it will likely impact is the development of totally secure
communications. It might also be able to produce computers that can process
data at much higher and faster rates as compared to conventional or classical
computers. It is important to understand that quantum computation would be
a revolutionary and not evolutionary change. However, many practical
difficulties might prevent quantum computation from ever being anything
more than a theory. At this time, it remains to be seen whether it will become
a practical reality .
Chapter 6 - Confidence Intervals: Advanced Topics
Data Integration refers to the process of combining data from independent
sources, which are warehoused using different tools and usually offers a
single perspective of data. Integrating data is crucial in case of merging two
businesses or consolidating systems inside one company to get a single
perspective of the data assets of the company.
Probably the most common step in data integration is setting up the data
warehouse of the business. The advantage of a data warehouse will enable a
business to conduct analyses according to the data inside the warehouse. This
may not be doable on data available on different source systems, because
they may not have the required data, although the data sets may be named
similarly.
Moreover, if you want to keep data integration solutions completely aligned
with your business goals, then you have to be always mindful of the
particular kinds of business value, which could result from effective use of
tools and strategies for integrating data.
In this Chapter, we will discuss the top ways that data integration can bring
value to your business. I have included several actual cases to show the
various types of value that data integration can provide. Hopefully, this can
help you explain to your partners or your boss the value of data integration. It
could also serve you as a guide on how you can plan and design suitable data
integration strategies to advance your business.

Business Practices Values


Let us begin with a more generalized perspective of data integration. Most
valuable data-driven practices in business often rely on one or several forms
of data integration. As a matter of fact, there are business processes that
cannot be functional without data integration. This is particularly true for data
warehousing and business intelligence.
Remember, effective decisions may rely on calculated, aggregated, and time-
bounded data sets within a data warehouse, which can never take place
without effective data integration. Success in sales, for example, usually
relies on a total view of every customer information that is usually aggregated
using tools and techniques for data integration.
Moreover, integrating various businesses as well as their processes using
shared data should be backed up by a data integration solution. This is helpful
whether the businesses are divisions inside one enterprise or different
enterprises that can share data from one business to another. Meanwhile,
business processes like just-in-time inventory or operational business
intelligence should be backed up by efficient data integration solution, which
could be used in real time or with few delays. As you try to advance your
business, you pace will also accelerate. Data integration could speed up the
process to gather and integrate time-sensitive data at speeds that are not even
possible a decade ago.
Data Integration and related business processes such as data management and
data quality assurance can add value to business data. As a result, the value of
business processes will also increase.

Visibility of Data Integration


Identifying the business value of data integration once you see it can be more
difficult that you might expect because this data analytics process is usually
separated a level or two from the systems that your business might be using.
But in general, the data integration value is usually visible as valuable data.
Below are common examples of data integration in this value field:
● A business executive who access a single view of customer information,
which was built with data integration through data sync.
● A business intelligence user entering a query into a data warehouse and
the system responded with complete data models and metadata that were
set up using data integration
● Several business supervisors accessing information on a computer that is
updated real time or as needed through a data integration solution.
● A product supervisor accesses a list of available supplies from a supplier
within a data set, which the supplier established through data integration
and delivered across business boundaries through business to business
exchange.
Even if data is accessible in a Graphical User Interface (GUI) or a report,
business users may forget that data integration may overlook that data
integration provided the information. Many business executives fail to realize
that data integration is responsible for collecting, preparing, and delivering
most of the data that you may take for granted. Nowadays, Data Integration is
a fast-changing discipline, which offers data for several types of applications
whether they are operational or analytical.

Collaborative Business Practices of Data


Integration
To ensure that Data Integration offers the best type of business value, the
system must be aligned with the goals of the business that is relative to the
data. Fortunately, several collaborative practices have emerged in recent
years, so data specialists could easily streamline their work with broad range
of colleagues.

Data Governance
Data Governance refers to data integration processes that focus on privacy,
security, risk, and compliance. However, many businesses have expanded
Data Governance to also cover quality, standards, architecture, and many
other issues on data. The team working on Data Governance could help data
scientists to get a single view of business goals that are relevant to data and
align their work properly. Meanwhile, the change management process of
Data Integration can enable Data Integration specialists to think of possible
solutions to increase data value.

Data Stewardship
Data Stewardship is designed for managing quality of data by identifying and
prioritizing quality of work according to the needs of the business and certain
parameters such as technological capacity and budget. The person who is in
charge of the data, also known as the data steward, should work together with
business and technical people. Through the years, data integration specialists
have used stewardship into their array of strategies for better credibility in
alignment and prioritization of data integration work.

Collaborative Data Integration


Collaborative Data Integration is a loose strategy for coordinating the tasks of
data integration teams, which include data specialists. In general,
collaborative data integration uses applications and practices like code
review, team hierarchy, project management, and software versioning.

Unified Data Management


Unified Data Management is a recent business practice, which aims to
coordinate tasks across several data management disciplines described above.
UDM also enables collaboration between business management and data
management to ensure that most data management tasks add business value
by supporting business management goals.

What Data Integration Can Do for Your Business


The outcome of data integration is quite ubiquitous in the business world,
which enables commercial activities. However, we often don’t identify these
activities to consider data integration as a crucial process in today’s business.
If your business needs to confirm the value of data integration (a common
requirement for sponsorship, investment, or approval for data integration)
then you have to educate your partners or your boss the critical role that data
integration could play for your data-driven business processes.

Data Warehousing and Business Intelligence


As a support system of data warehousing, data integration can add value to
the business process. Through data integration, you can collect raw data from
different sources and combine them all to develop new products. A data
warehouse will contain data and data sets that do not exist anywhere in the
business.
Moreover, because of the requirements of business intelligence, data that
goes into the warehouse should be regularly reconfigured to develop
calculated, aggregated, and time-bounded data, established into multi-channel
data sets. Data Integration cannot collect data itself; rather it can shift the data
into these necessary structures.
Data integration for business intelligence will allow high-value processes. A
data warehouse constructed through data integration allows decision making
at tactical, strategic, and operational layers. Data created through data
integration is crucial to Business Intelligence strategies such as dashboard
reporting, performance management, advanced analytics, and online
analytics. These data warehousing and business intelligence activities - also
enabled by data integration - could help in customer retention, increasing
sales, improving the efficiency of business operations, guide sales, and
marketing activities, enable strategic planning, and other valuable business
outcomes.

Data Integration Could Add Value to Business Data


Many business owners think data integration as a process of moving data. But
those who are trained in data science understand that it is not easy to just
move data around. There is a need to improve it. As a matter of fact, every
ideal data integration solution can add value to the process.
Data integration improves data during the process. Data quality strategies are
being added into data integration solutions. This is organic because data
integration could filter out concerns about data quality that should be fixed as
well as areas for improvement. Data integration can also help in improving
metadata, data models, master data, and other attributes of data. Hence, the
data could come out as complete, clean, and consistent.
Data integration can also help in building new databases that are valuable for
the business. Remember, the data contained in the data warehouse can never
be found anywhere else in the business. Similar to the value-adding system in
manufacturing, data integration can capture raw material and build them into
new data sets.
Therefore, data integration can convert data to make it more valuable for
more business processes. Aside from moving data, data integration can also
convert data so it is suitable for any target system. To put it simply, data
integration repurposes data so more business units, as well as their processes,
could be beneficial for the business.

Single, Unified View of Business Entities


Through data integration, the business can capture data from several sources
to complete a single view of the entities of the business such as assets,
locations, staff, finances, products, and clients. This is on the same level of
data warehousing, but this is more on operations and not on business
intelligence.
By effectively using data integration, the business can complete its customer
profile and improves value to any client-oriented business process from sales
and marketing to client support. Complete product data can also add value to
business systems for procurement, product management, and supply chain
manufacturing.

Data Replication
Data replication, also known as data synchronization, is another data
integration system that can help add value to the business. For instance, data
replication may build a complete view of a central data hub for access by
several users and applications. This is seen in central hubs for product data,
customer data, and master data. Replication may also enhance relevant data
across several applications and their databases. For instance, client-facing
applications for contact centers can be limited to a partial view of a customer,
unless a total view can be developed by replicating customer data across
these applications.
Data’s business value in replication is that more business owners have a more
unified view of a separate entity like finances, customers, and products. Yet,
data replication systems may tend to move and integrate data more often,
usually several times in a day. This hastens the freshness or data currency in
applications. Hence, data is not just complete but also updated, which is
crucial for businesses that need current data for their decision making.

B2B Data Exchange


B2B Data Exchange is a promising area for development because businesses
can use data integration tools and strategies in areas where these could be
rare. Many data exchanges are low-tech and manually entered, which should
be replaced in order to be synchronized. Experts project a wide
modernization in data exchange between businesses, especially in product-
centric enterprises like retail, suppliers, and manufacturing. This is also
crucial for financial institutions, healthcare, and other organizations who are
using procurement and supply chain systems.
The need to modernize data exchange between businesses is an urgent
concern. However, there is also the need to develop business value in this
area. In general, business partnerships are crucial to advance businesses in
terms of market reach, revenue, and brand development. Business
partnerships can grow by achieving better operational excellence through
data integration.

Real-time Delivery of Data


Businesses need to adapt to the fast pace of the world, and data integration
can help in integrating data at speeds that are even impossible a decade ago.
Real-time data delivery that is usually enabled by modern data integration
systems can enable several high-value business processes.
Businesses are now using applications to monitor data such as business
activities, facility status, grid monitoring, and so on. These can be quite
impossible without the real-time capacity of information delivery supported
by data integration.
Operational business intelligence often captures data several times a day from
operational applications and makes the data available for monitoring and
other kinds of management or operational reports. This provides the business
to access data for strategic and operational decision-making.
Chapter 7 - Handling and Manipulating Files

Basic methods and functions for manipulating files by default are provided in
Python.
The open Function
Before you can write or read a file, you need to open it using the open ( )
function which creates the file object that is necessary for calling other
methods related to it. It has the following syntax:
file object = open ( file_name [ , access_mode ] [ , buffering ] )
The following are the parameter details:

buffering: Buffering does not take place if the value is set to


0. However, if the value is set to 1, the line buffers while the
file is accessed. If you declare the value as any integer
bigger than 1, buffering is performed with your indicated
size. However, if the value is negative, the buffer size
becomes the system default.
access_mode: It identifies the mode wherein the file must be
opened. For instance, it determines whether it has to be read,
written, or appended. The possible values will be discussed
later on. Anyway, the access_mode parameter is optional
and its default file access mode is read ( r ).
file_name: It is a string value containing the name of the file
that you wish to access.

Different Modes for Opening Files


Mode Description
r It opens the file for the sole purpose of reading.
r+ It opens the file for both writing and reading.
rb It opens the file for reading, but only in the binary format.
rb+ It opens the file for both writing and reading in the binary format.
a It opens the file for the purpose of appending.
a+ It opens the file for both reading and appending.
ab It opens the file for appending, but only in the binary format.
ab+ It opens the file for both reading and appending in the binary
format.
w It opens the file for the sole purpose of writing. It overwrites the
file if it already exists.
w+ It opens the file for both reading and writing. It overwrites the file
if it already exists.
wb It opens the file for writing, but only in the binary format. It
overwrites the file if it already exists.
wb+ It opens the file for both reading and writing in the binary format.
It overwrites a file if it already exists.

The close ( ) Method


It closes the file object and flushes unwritten information. When this occurs,
further writing can no longer be done. In Python, a file is automatically
closed once its reference object is reallocated to another file. See to it that
you always close your file using the close ( ) method. Its syntax is as follows:
fileObject.close( );
Consider the following example:
#!/usr/bin/env python
# Open the file
fo = open ( “bar.txt”, “wb” )
print “The name of your file is: “, fo.name
#Close the opened file
fo.close( )
If you execute the code shown above, you will get the following output:
The name of your file is: bar.txt
Writing and Reading Files
The write( ) Method
It writes strings to open files, but it does not add a newline character ( \n ) at
the end of strings. Take note that strings can contain binary data, not just
texts. The syntax of the write( ) method is as follows:
fileObject.write(string);
Take a look at the following example:
#!/usr/bin/env python
# Open the file
fo = open ( “bar.txt” , “wb” )
fo.write ( “ This example shows how to use this method in Python. \n Python
is very easy to comprehend! \n “ );
# Close the opened file
fo.close( )
The example shown above would create the bar.txt file, write its content, and
close it. When you open this file, you would see the following:
This example shows how to use this method in Python.
Python is very easy to comprehend!
The read( ) Method
It reads strings from open files. Take note that the strings in Python can also
contain binary data and not just texts. The syntax of the read( ) method is as
follows:
fileObject.read([count]);
It starts to read the opened file from the beginning to the end. If count is not
found, it still attempts to read as much as it can.
Take a look at the following example:
#!/usr/bin/env python
# Open the file
fo = open ( “bar.txt” , “r+” )
str = fo.read( 10 );
print “Read String is : “ , str
# Close the opened file
fo.close( )
Running the above given code displays the following output:
Read String is : Python is
Deleting and Renaming Files
The remove( ) Method
It can be used to delete files. You just have to supply the name of the file that
you want to delete as the argument. Its syntax is as follows:
os.remove(file_name)
Take a look at the following example:
#!/usr/bin/env python
import os
# This deletes the existing file bar.txt.
os.remove ( “bar.txt” )
The rename( ) Method
It takes two arguments: the new file name and the current file name. Its
syntax is as follows:
os.rename(current_file_name, new_file_name)
Consider the following example:
#!/usr/bin/env python
import os
# This rename the file from foo.txt to bar.txt.
os.rename( “foo.txt” , “bar.txt )
The sample code above renames the existing file foo.txt.
Chapter 8 - BI and Data Mining
The core idea behind business intelligence is to use the data that a business
has available in order to develop actionable information. This will help the
business operate more efficiently because management will be able to make
data-driven decisions rather than trying to act on incomplete information and
hunches. As it stands on its own, data is not useful in its raw form. This is
where business intelligence comes in. It is going to take that data and make it
in a form that can be presentable two human beings that can then make
informed decisions based on what the data is telling them.
As such, business intelligence will pull together all the data in the
organization, analyze the data, present that data in report form or in a
visualized way, which will make the data meaningful for management. Then
in order to enhance the competitiveness and efficiency of the business,
management can make data-driven decisions.

Data Mining
Now let’s become more acquainted with data mining. Data mining is a
jargon word in a sense. It already has a lot in common with some of the
things we’ve been discussing. The first thing that data mining is involved in
is large datasets. In
other words, here we have big data yet again – but that is only in first
appearances. In fact, part of data mining is “mining” the data, finding smaller
subsets within the large datasets that are useful for the analytical purposes at
hand.
Another thing the data mining is involved in is recognizing hidden patterns
that exist in these large datasets. Thus, we are back to the tasks that are
carried out with machine learning, although this isn’t explicitly specified
when discussing data mining. Data mining attempts to classify and
categorize data so that it’s more useful to the organization.
So, we start with raw data which is basically useless. Data mining helps
convert that data into something that can provide value as far as the
information that it contains. A part of data mining is going to be selecting the
data that you want to use. Data warehousing is an important foundation upon
which data mining is based. Companies need to be able to store and access
data in large amounts which is why data warehousing with effective solutions
that are fast and accurate is important. Then the data must go through a
cleansing process. That is when you have huge amounts of data, one of the
problems that you’re going to an encounter is that data is going to be often
corrupted or missing. This is something that is very common when it comes
to relational databases, but this can also happen when you’re restoring huge
amounts of unstructured data.
After the data has been gathered, extracted, and cleansed, the process of data
mining moves on to look for the patterns needed to gather useful information
from the data. Once this is done, the data can be used in many ways by a
business. For example, it could be used for sales analysis or for customer
management and service. Data mining has also been used for fraud
detection. There is much of overlap between data mining and other activities
involving big data, such as machine learning. When it comes to data mining,
you’re going to see a lot of statistical analysis.
This intelligence and data mining are both involved in the process of
converting raw data into actionable information for the business. However,
the goal of business intelligence is to present data in meaningful ways so that
management can make data-driven decisions. In contrast, data mining is used
to find solutions to existing problems.
If you remember when we talked about big data, one of the things that were
important was volume. Business intelligence is certainly driven by large
datasets. However, data mining is different in this respect. Relevant data is
going to be extracted from the raw data to be used in data mining. Therefore,
relatively speaking, data mining is going to be working with smaller subsets
of the data that is available. This is one characteristic that is going to separate
data mining from the other topics that we have talked about so far. Data
mining might be used as a part of an overall strategy of business intelligence.
So, what management is looking for from data mining is solutions that can be
applied to business intelligence. This contrasts with business intelligence on
its own, as it is usually used to present data to people.
So, the core result obtained from data mining is knowledge. This is in the
form of a solution that can be applied within business intelligence. This
provides a big advantage to business and operations. That is because the
findings from data mining can be applied rapidly within business intelligence.
Data mining is also a tool within business intelligence that allows business
intelligence to extract complex data, presenting it understandable forms that
are useful for the people in the organization. The data extracted with data
mining can be presented in readable reports or in graphical format containing
graphs and charts. In this form, it becomes a part of business intelligence so
that the people in the organization can understand, better interpreting the data
and making actionable decisions based on that data.
The volume of data coming to large businesses is only growing with time.
This makes both data mining and business intelligence more important to the
organization as the onslaught of information continues to pour in. It is going
to be important to cull the data in terms of saliency; this is where data mining
plays a role. The data is always changing, making this task even more
important. Demand for data mining and business intelligence solutions will
be increased in proportion to the growth of the volume of data.
For companies to remain competitive and - especially if they want to be a
market leader - they are going to have to utilize data mining and business
intelligence solutions for retaining their advantages.

Data Analytics
Data is not useful if you cannot draw conclusions from it. Data analytics is a
process of organizing and examining datasets for the purpose of extracting
useful and actionable information from the data. Data analytics plays a role
in business intelligence, using tools like OLAP for reporting and analytical
processing. When done effectively, data analytics can help a business
become more competitive and efficient, build better and more targeted
marketing campaigns, improve customer service, and meet the goals that are
a part of business intelligence. Data analytics can be applied to any data that
an organization has access to, including internal and external sources of data.
It can use old data or even real-time data to provide more readable
information that can be accessed by employees in the organization in an
effective way to help them make actionable decisions.
While data analytics can be used as a part of business intelligence efforts, like
machine learning, data analytics can be used for predictive modeling, which
is not part of business intelligence. Typically, BI is used for an informed
decision-making process based on analytics of past data. Data analytics uses
past data but can apply it with predictive analytics to help the company use
modeling and tools to determine future directions of various efforts that can
help the company maintain its edge and advance even further.
Data analytics will also be used in many ways that are like processing data
with machine learning. That is, it will be useful for pattern recognition,
prediction, and cluster analysis. Data analytics is also an important part of
the data mining process.

BI and Social Media


Over the past year, the data collection powers of the social media companies
have come to the forefront of many discussions. At the top of the list of
concerns is privacy. Regardless of what you think about these discussions,
one thing is clear: social media has resulted in the collection of
unprecedented amounts of data - not only about individual people but also
about businesses that use these platforms. Social media is a very effective
way to collect data on customers.
Social media represents unprecedented opportunities for businesses. For one
thing, social media will help businesses understand the behavior of their
customers. Social media also helps businesses target the market to new
customers and acquire them. It also provides an opportunity for a business to
put their face forward in new ways.
In this chapter, we are going to look at the power of social media in terms of
interaction between business intelligence and social media; we will explore
how that can help businesses expand and improve their competitive
advantage.

Leveraging Social Media


There are many ways that a business can leverage social media. The first
way is to recognize that companies like Facebook and Twitter have an
imaginable treasure trove of data on every customer. The data collected is
thorough and global. Moreover, companies like Facebook are ahead in terms
of organizing that data and putting it into a useful form. So, the existence of
social media companies not only provides a platform through which a
business can increase awareness, but it has also created an environment
where other companies are doing a lot of the hard work for you. So, there’s a
bonanza cache of unstructured data that is not only being stored by
companies like Facebook, it is also being analyzed. So, let’s begin at the
beginning: the first advantage that we have here is that we don’t have to
worry about storage capacity because Facebook or Google already has that
data stored for us.
This data has also been put into a form that makes it friendly for all kinds of
analysis. And although there is a lot of hype about privacy violations, the
facts show that for the most part, data is presented in aggregate ways,
preventing the targeting of any individuals unless those individuals
voluntarily choose to interact with other companies. “Voluntarily choose”
means that an individual has freely given their name and email address to the
company for its use – and has read, understood, and agreed to the company’s
privacy policies.

Web Scraping Tools


Web scraping tools allow a company to obtain data that is on other websites
(publicly available data) without having to copy and paste it manually. These
types of tools allow businesses to get data from social media sites that can
then be analyzed and used. This data will be in an unstructured form, and as
such, will be well suited for analysis using machine learning. No matter
which social media platform you are scraping data from, you are going to
have mixed data in almost every case. Consider a Facebook post as an
example. A Facebook post may have an image associated with it, but it
might be plain text. It could also include a video. It might have a hyperlink
and emojis. So, there is a mixture of data that is contained in one single
object. This is also true on many other social media platforms; while
Pinterest and Instagram are photo platforms (primarily), postings will have
text, hashtags, and possibly hyperlinks.
In order to get usable information with this data, it must be rigorously
“percolated” via a data grinder. As we mentioned, this is clearly big data,
and it’s also unstructured data. That means it is not particularly well suited
for use in traditional business intelligence, but rather must be processed using
big data and machine learning methods. You will also be searching for
hidden patterns in the data. As an example, if you were looking for a
hyperlink appearing in many Facebook posts, you would need to cross-link it
with the people that are posting the link in order to find their demographics.
Web scraping might seem like an insurmountable task, but it can have many
advantages. It provides a way to collect data that can be used for marketing
research. It can also be used to extract contact information. However, the
value of doing that is questionable, as most people don’t respond well to
contacts that they – themselves – have not initiated.
There are many effective web scraping tools that are available. You can
consider using import.io, dexi.io, and visual scraper. These powerful tools
can help you do some pre-processing on the data, assisting you in retrieving
the desired type of data that are useful for your purposes. In some cases, you
might be able to get it into a form that can be directly used in business
intelligence.

Direct Interaction with Customers


One of the interesting benefits of social media is that it provides businesses
with the ability to interact with customers directly. This is going to involve
some effort on the part of the business; the more effort that is put in, the more
likely it is going to pay off. And the fact is, this is easy to maintain, and can
be done on a low budget.
The first step toward direct interaction with customers is to create a Facebook
page for the business. A Facebook page is essentially like a Facebook
profile, but they are created either for group interests, hobbies, or in the case
of interest to us businesses. The Facebook page is going to have a timeline
and a photo repository, just like any normal Facebook profile would. One of
the mistakes that a lot of businesses are making: they create Facebook pages,
but then fail to maintain them. If you’re a small business, it’s vitally
important that you maintain the Facebook page. It doesn’t take a lot of work,
and one or two posts per day are enough to grow the Facebook page over
time.
Unfortunately, what you often see when visiting Facebook pages for
businesses is very few posts at all, and many just create the page and leave it
there as a placeholder. Among those that do post, they don’t do so
effectively. Your post must be engaging, inviting users to comment on and
share the post.
But the main benefits of the Facebook page are that once somebody likes the
page, your posts will start showing up on their timeline! This is only one of
the many benefits of marketing by posting regularly on a Facebook page:
when your posts are shared, it creates a domino effect! Friends of the person
who originally shared will see your post, and they may share, and then - you
get the picture! By posting interesting, engaging content, you are reaching
far more people than just those who happen to see your post in the first
place.
A side point to note: user comments left by viewers on your Facebook page
are data. Even if you are running a small-scale operation, these comments
can be analyzed by those at your company, providing actionable
information.

The Challenges of Social Media


As a business, one of the challenges of social media is determining which
platforms are the most useful for the purposes of your business. The first
factor to consider is the question: what are your main customer
demographics? There are certain social media platforms that are used more
frequently by young people, whereas other social media platforms are used
by the general public. The form of data on social media platforms may also
vary from platform to platform.
Let’s get started by looking at a few examples. One of the most popular
social media platforms (and one that does not get much press) is Pinterest.
This platform has been around for a while, and although it has a mobile app
now, its original introduction to the public was as a website. This website’s
purpose is to share user images, and its primary audience in female. Of
course, that doesn’t mean that males are not on the website; however, in
proportional terms, the audience tends to be female, and more specifically, it
tends to be females of a certain age groups. It is estimated that the age group
of 18-30 makes up most active users on Pinterest.
Let’s contrast this information with that of Facebook and its users. Facebook
has been around for a long time, becoming dominant around ten years ago.
The advantage of Facebook is that most people already use it in a personal
manner, so the audience is already vast. If you are targeting Facebook and
looking for the advantages, the biggest one is already baked in you are going
to be able to reach nearly every demographic that there is! If you are using
social media to leverage people over the age of 50, Facebook works for you.
It would also work perfectly fine if your customer demographic was 18 to 34.
Now let’s look at Instagram, which is owned by Facebook. Instagram is an
app-only interface. This fact alone probably makes it more appealing to
younger people. Although Instagram has recently become used more by
older people, it remains primarily the 18 to 34 age group. Gender is more
balanced on Instagram as compared to Pinterest.
Twitter is another platform that, like Facebook, seems to appeal to all age
groups. One downside of Twitter is that due to the nature of the platform’s
communication protocol, advertising on Twitter is a bit more difficult. Some
businesses may have trouble connecting with users due to this flaw. That
said, it has 600 million active users, and so it can reach a lot of people.
There are other social media platforms such as Snapchat and WhatsApp.
These are confined to the mobile space and appeal mostly to people under the
age of 30.
This is an incomplete review but helps to demonstrate that you need to
choose your social media platforms carefully based on what your business is
doing, as well as its demographics.

Data Issues with Social Media


Social media can present data in a wide variety of forms. One of the first
factors that you need to consider is heterogeneity. This means we need to
examine the data in terms of the data types included in social media postings.
So, if heterogeneity is strong, it means that the data is taking many different
forms. Think multimedia here. Data may be in the form of text, images, and
video – or other types that you don’t readily consider but may be extremely
useful. For example, the hyperlinks that people put in their social media
postings are going to tell you a lot about that person. Hashtags are also
important to look at. As you can see, social media postings contain more
data than meets the eye at first. By looking at hashtags, it may be possible to
glean information such as political affiliation for things that the poster is
interested in.
The viral nature of social media is also something to consider. One question
you must ask: is a post by an individual an original post, or is it a shared
post? Even if a person shares a post, it may still provide data about that
person. More than likely, if the person shares a post, they either found it
interesting or they share the views or interests that are present in the post.
They may just think it’s funny. These factors can present a challenge to
businesses trying to do an analysis of social media data.
It is possible to retrieve data from social media sites using web scraping
tools. That takes a lot of processing power, and how it handles and separates
that data presents a difficult choice. Data will have to be sorted by source,
and by type. Different types of analysis can be applied to the data once you
have an in your possession. Clustering analysis could be very helpful. For
example, you could consider looking at all the posts that share a particular
link. Then an analysis could be used to determine the characteristics or other
data points of the people that share the link. This can be important for
marketing purposes, and it’s even used by political professionals that are
trying to target people that may be open to their messages.
Another issue related to social media data concerns immediacy. If you’re
going to utilize social media data, you are going to need to know how recent
the data is. Data from five or ten years ago probably isn’t going to be very
relevant. The more immediate the data is, the more valuable the data is.
Second, social media provides huge amounts of data. This is the scaling
problem. For organizations working with this data on their own, this can be
an extremely difficult problem to deal with. This may force organizations to
either: 1) have companies like Facebook do analysis on their behalf (probably
the most efficient way to do it,) or 2) they can work with smaller datasets
and try to do it themselves. That process, however, it’s probably not as
effective as it would be utilizing the power of the social media companies
themselves in order to help your business.

What Can You Get from Social Media?


Let's take a step back and think about what we can get from social media.
Social media offers unprecedented power that we can use to investigate the
private lives of people. This is not to say that you should be snooping on
people. We are talking about information that these people have voluntarily
chosen to make publicly available. This information is extremely useful from
a business perspective, allowing you to determine many things about
different people, including interests, hobbies, and goals. People often list
important information with their profile that can be combined with
information such as age, gender, and birthplace. All this information can be
combined and analyzed using machine learning capabilities in order to extract
useful patterns for marketing purposes.
One of the keys to social media is you want to be able to speak directly to
your consumers in ways that they can relate to. By analyzing data that is
either shared by the social media company for you (or that you’ve scraped,)
this will help you connect better with your customers.

Dashboarding the Data


One way that many businesses deal with data from social media companies is
by using dashboards. So, for example, they can ask questions about people
with different demographics. This information can be gathered from social
media sites. In fact, Facebook provides a dashboard-like interface that you
can use to analyze Facebook users in terms of many characteristics such as
education, websites they have shown interest in, and more.
For larger organizations, it may possible to integrate social media data and
develop internal data dashboards that can be used in order to access and
analyze the data. For example, you can create clusters of users by various
characteristics. You might be interested in males, ages 30 to 44, unmarried.
Then you could use the data dashboard to extract different information about
this demographic. This is all part of business intelligence.
Hopefully, you can see where this is going: the result is that we can create
actionable information from these types of analyses.

Three Ways to Use Social Media


The businesses that have the most success with social media are those that
understand the central premise of social media. Take a step back and think
about what Facebook is about. The central theme of Facebook’s connection.
Put differently, social media is about building relationships. This means that
there are going to be three ways that a business can utilize and exploit social
media.
The first way is to build relationships with your client base. This can include
active clients as well as prospective ones. That is one reason why you should
create and maintain a Facebook page and actively engage with the people that
post there. This will go a long way two getting your company to build
relationships with the customers. You want to make it personal and real.
People have a certain type of radar that they can use on a subconscious level
to detect whether something is genuine or not. This is not to say that people
are not possible to fool; of course, they are - Bernie Madoff proved that!
Even so, people do have a certain sense of these sorts of things. Businesses
should go into social media with the intent of being genuine and building real
relationships with people. Even if a company is a large organization, it
would benefit by putting an actual face to the page. So, it might be possible
to have an employee who is dedicated to running Facebook pages and
interacting with customers.
The second way that social media can be utilized includes what we have been
discussing up to this point: data gathering. This is the first step to
incorporate social media data into your business intelligence. From there,
they can be analyzed using the tools that you already have. The applications
of using this data are going to depend on your business needs and the reasons
that you’re collecting this data. It might be to get a picture of the ideal
customer. This is a very powerful way of marketing if you can get an idea of
who the ideal customer is. Alternatively, you may be breaking things down,
using clustering in order to determine what different groups are driven by in
terms of their interests and desires. This will help you market to different
groups using targeted methods that are going to speak more directly to them.
In summary, social media can help you develop more effective relationships
with consumers and more effective marketing tools. The way that this
information should be used is to communicate directly to users in a way that
touches them directly and strikes a sensitive chord in them.
Third, you are going to want to use social media to put your advertising and
marketing efforts on steroids.
Chapter 9 -What Is R-Squared and how does it help
us

Even though Big Data is usually referred to as being beneficial for


businesses, it will also become valuable to individual customers in their
personal data. More and more quantified-self programs and applications will
enable customers to store, monitor, visualize, and make sense of their own
lives.
Information about sleeping, eating, and physical activities could be available,
along with records of your life. This is only a matter of years before there will
be applications that will integrate all the data from these separate apps into
one Big Data app about how you live your life.
Soon, these apps could mash it up with geolocation information from your
handheld devices as well as your activities on social media. There are also
apps that allow you to compare your performance with your friends.
These quantified-self applications have the advantage of generating a massive
volume of data, which will allow you to track information about the whole
population groups. This might raise some issue on privacy, but more and
more individuals are happy to learn more about their own lives that they are
signing up easily.
The concept of quantified self has been around during the 1970’s, but it only
took off thanks to IoT and the existence of monitoring devices, which can be
connected to handheld devices. These devices carry all types of sensors,
which could monitor almost anything.
Most technologies on quantified self are focused on improving health and
attitude. Hence, most devices are helping individuals to monitor their
sleeping habits, emotions, activities, stress levels, food consumption, caffeine
consumption, smoking, and much more.
Hence, the movement is targeted towards people who are completely fine
with having their personal information gathered, and made available for
public access with a certain degree of anonymity.

Big Social Data


In the last few years, organizations have seen the benefits of harnessing Big
Social Data, because of the fact that it contains important information that
will enable them to better understand their audience or consumers.
Through sentiment analytics, business organizations could learn what their
customers are thinking of their products, services, advertisements,
announcements, promos, and more.
Moreover, all accessible social data could be used to run predictive analytics
about what customers may like and when do they like it. According to the
feedback customer post on social media, businesses could gather insights that
will normally call for expensive conventional research.
Organizations that are using information available on social media channels
could start hypertargeting customers. Hence, instead of just targeting possible
customers by a specific age, gender, or location, businesses could now focus
on customers according to their latent or actual needs. All of this data can be
derived from what customers say on social media – the retweet or the like
and their context.
For example, Walmart is using the information shared on Facebook and
Twitter to send personalized coupons to possible customers. It also keeps
track of what their customers are commenting – the moment someone tweets
about, let’s say flowers and chocolates, Walmart can send a discount coupon
for those products at the nearest branch.
Another example of hypertargeting is being practiced by MyBuys that
provides multi-channel personalization for online stores and consumer
brands.
The goal here is to improve engagement, drive conversions, and increase
sales through the analysis of the individual behavior of their customers,
which is no easy task as they have 200 million customers who have generated
100 terabytes of data and still g rowing.
Chapter 10 - Public Big Data

In 2011, the Vice President of the European Commission Neelie Kroes


introduced some proposals to legally access the data produced and stored by
public institutions in Europe. Kroes believe that getting access to these
datasets can double their worth to about 70 Billion Euro because when data is
integrated and converted into information, it could provide added value to the
economy.
The open data portal highlights transparency, innovation, and open
governance. The available data could be reused, integrated, analyzed and
visualized for commercial or personal use. This is a significant leap forward
as it could create new business opportunities and could drive innovation.
Other governments are also considering similar action. For example,
Netherlands has already developed a portal on which open datasets could be
funded publicly for anyone’s consumption.
Netherlands actively supports local authorities and departments in sharing
their datasets on this portal to boost innovation and business opportunities.
This is believed to result to a more transparent and more efficient
government.
Meanwhile, the United States is also now looking into the opportunities on
Big Data. Former President Barack Obama introduced a Big Data project
worth $200 million to look into Big Data technologies and opportunities.
The goal is to advance available tools and technologies to effectively process,
access, analyze, store, and visualize the massive volume of data generated by
the local, state, and federal governments.
Australia also developed their own public strategy on Big Data, with the goal
of making information held by regional or national authorities available for
the public. Australia developed this strategy to make certain that business
organizations and governments can take complete advantage of all Big Data
benefits alongside securing privacy of their citizens.
Chapter 11 - Gamification

Much of the new data will be generated from gamification, which in business
is not only an effective tool for marketing campaigns, but can also
revolutionize the manner organizations communicate with their audiences. It
will also create valuable Big Data that could improve big database of
businesses.
Gamification refers to the use of game elements in non-game contexts. This
could be used to communicate with customers and enhance marketing efforts
that could lead to more revenue.
Gamification is also usually used within the organization to improve
employee productivity and crowdsourcing initiatives. Ultimately,
gamification could also change the consumer behavior. The quantified-self
movement is an ideal example of the integration between Big Data and
gamification.
The usual gamification elements that are usually tapped in gamification are
challenges, leaderboards, avatars, badges, points, awards, and levels.
Furthermore, gamification could also be used to learn something, to achieve
something, and also to stimulate personal success.
The objective is to enhance real-life experiences and make people more
willing to perform something. However, gamification is not all about gaming,
but merely the application of gaming elements in a different context.
Various aspects of gamification offer a lot of data that could be analyzed. The
business can easily compare the performance of users and understand why
some groups are performing compared to other groups. When customers are
logging in through the social graph, a lot of public data could be added to
provide context around the data from gamification.
Aside from the various elements that offer directly accessible insights,
gamification could also help in understanding consumer behavior and their
performance. For instance, how long do various groups take to finish a
challenge or how do they use specific services or products. Gamification data
could be used to enhance your offerings.
Gamification can also be used to motivate people to act and to encourage
them to share the right data for the right context. As a matter of fact,
gamification must be considered as a catalyst for sharing. The higher user
engagement, the more chance they will share. This could lead to more
attention to the company as well as more valuable information.
Using gamification for your big data strategy will largely depend on the
speed and quality of the information that is returned to the user. Users will be
more involved if the content is also better. Big Data can also be used to
personalize content. Buying behavior, the time needed to do specific tasks,
and engagement levels could be integrated with public data like posts or
tweets as well as user profiles.
This will provide your business with a lot of valuable insights if the data has
been stored, analyzed and visualized. But, users are now expecting immediate
results and feedback. Hence, real-time data processing is quite crucial.
A few years from now, gamification will become more integrated with how
consumers are accessing and consuming data. This will result in more data
generation. With Big Data, businesses will also need to learn how why their
consumers are behaving in the context of gamification, and so, this will
provide more insights on how their consumers are behaving in real-life.
This information is quite valuable for marketing and sales department to
reach out to potential consumers using the right message in the right context
and with perfect timing.
Business organizations should create the ideal design for gamification
strategy to gain the desired insights and results. Based on a report by Gartner,
80% of the gamification solutions may not deliver the intended results
because of flaws in the design. Remember as with Big Data, flawed design
will only result in flawed data and poor insights.
Chapter 12 - Introduction To PHP
PHP is an acronym for Hypertext Preprocessor. The language is a server-side
HTML embedded scripting language. For beginners, it is hard to understand
the aforementioned statement, however, let me break it down. When I say the
langue is a server-side, I mean the execution of the scripts takes place on the
server where the website is hosted. By HTML embedded, it means PHP
codes can be used inside HTML code. Alternatively, a scripting language is a
programming language, which is interpreted instead of being compiled like
C++ and C programming language. Examples of scripting languages include
Java, Python, Perl, and Ruby.
You can use PHP language on several platforms including UNIX, Linux, and
windows and it supports many databases including Oracle, Sybase, MySQL,
etc. furthermore, PHP files contain scripts, HTML tags, and plain text with
extensions such as PHP3, PHP, or PHTML. Finally, the software is an open-
source program, which is free.

Pre-requisite for learning PHP


If you want to know if there is anything special necessary to know before
learning PHP, then the answer to this question is no. Going through the
documentation section gives you the necessary information you need. One
major reason many find it easy to learn PHP is due to its documentation that
every concept is explained in its simplicity.
Additionally, PHP is a simple and straightforward language for anyone to
learn. However, if you want to learn web language effectively, it is important
to learn the basics of the following languages:

HTML – This is what PHP sends to the web browser


MySQL – You need a database to store data
CSS – You need this to add style to your HTML pages
JAVASCRIPT – TO MAKE YOUR PAGES
INTERACTIVE WITH THE USERS
If you can equip yourself with these languages, then you can learn PHP
effectively.
Getting Started
Before starting this lesson, you should have the following:

PHP and MySQL installed


WEB SERVER (APACHE)
With these two programs, you can successfully write and execute PHP codes.
You can purchase an inexpensive hosting plan that supports MySQL and
PHP. However, if you want to save some cash, you can decide to install it on
your system. To do this, you have to install WAMP server on your machine
for Windows users. After the installation, you can access it through
http://localhost in your browser. Ensure you have this set up before starting
this course.

PHP Syntax
When I started, I indicated that PHP codes are executed on the server-side.
However, every PHP statement begins with <?PHP while ending with ?>.
Let us begin with a simple program. You can copy and paste the program
below using any text editor before saving it with the file name – index1.php
I named the file to “index1.php” because some root folder already has an
index filename as shown in the image below.

<html>
<head>
</head>
<body>
<?php
/* This line contains a comment
Which span to
several lines */
//This comment is a line comment
//echo prints the statement onto the screen
echo “Hello World, Welcome to PHP Programming!”
?>
</body>
</html>

When executed, you should have the output as:

Variables in PHP Programming


Every variable in PHP normally begins with the dollar ($) sign. Most
beginners make the mistake of not including the dollar ($) at the beginner. I
know you won’t make that mistake.
<?PHP
$variable1 = 280;
$variable2 = “PHP Programming”;
?>

We first declare variable1 with the value 280. However, the second is a string
variable with value as “PHP Programming”
It is important to note that every statement n PHP ends with a semicolon. You
will get an error whenever you don’t include a semicolon to indicate the
ending of a statement.
Variable Rules in PHP
A variable name always begins with an underscore (_) or a
letter
A variable name must not include a space (s)
VARIABLE NAMES CAN ONLY HAVE AN
UNDERSCORE OR ALPHA-NUMERICAL
CHARACTER
String Variables
String variables are important especially if you want to manipulate and store
text in your program. The code below assigns the text “Welcome to PHP
Programming” to the variable beginner and prints out the content to the
screen.
<?php
$beginner = ‘Welcome to PHP Programming’;
echo $beginner;
?>
Output
Strlen () function
Perhaps you want to determine the string length of a word or sentence, the
strlen function is what you need. Consider the example below.
<?php
echo strlen(‘‘Today is the best day of your life. Programming is a
lifelong skill and PHP is all your need’’);
?>

The outcome will be the string length of the text including the signs, space,
characters). In this situation, the result will be 92 as shown below.

Operators in PHP Programming


In this segment, I will rundown through the basic operators in PHP. I will
look at the assignment, arithmetic, comparison, logical, and concatenation
operators.
Assignment operators
Operator Examples Large notation
%= p%=q p=p%q
*= p*=q p=p*q
.= p.=q p=p.q
/= p/=q p=p/q
+= p+=q p=p+q
= p=p p=q
-= p-=q p=p-q

Logical Operators
Operator Description Example
! not p=9
q=9
!(p==q) returns false
&& and p=9
q=9
(p < 10 && q> 1) returns true
|| or p=9
q=9
(p==5 || q==5) returns true

Arithmetic Operators
Operator Description Example Result
+ Addition a=8 13
a+5
– Subtraction a=17 3
20-a
/ Division a = 40 20
40/2
* Multiplication a=7 35
a*5
++ Increment a=9 a=10
a++
-- Decrement a=14 a=13
a--
% Modulus (division remainder) 56%6 2
Comparison Operator
Operator Description Example
== is equal to 48==49 returns false
!= is not equal 48!=49 returns true
< is less than 48<49 returns true
<= is less than or equal to 48<=49 returns true
<> is not equal 48<>49 returns true
> is greater than 48>49 returns false
>= is greater than or equal to 48>=49 returns false

Conditional Statements in PHP Programming


At times, you may want to make decisions that require different actions when
writing a program, conditional statement plays a huge role to perform some
decision. In PHP language, we have the if statement, if… else statement, the
if…elseif… else statement. In this section, I will expand on these statements
including their syntax.

If Statement
The statement is required to execute a line of code as far as the condition
stated is true. Consider the example below.
<?php
$number= 23;
if($number='23')
echo "Wake up! Time to begin Your Programming lesson.";
?>

In the statement above, we first declare allocate the value 23 to the variable
“number”. The if statement now evaluates if the variable “number” is equal
to 23 since it is true, it will return:

The If…else statement


This condition examines two different statements and evaluates one
depending on the condition specified. A simple English illustration will be:
buy a donut if there is no pizza available.

<?php
$decision1='Donut';
if($decision1 == 'Donut') {
echo 'Buy Donut when coming';
} else {
echo 'Buy Pizza when coming';
}
?>

The output will be:


Let’s twist the same code and consider the output.
<?php
$decision1='Donut';
if($decision1 == 'Don') {
echo 'Buy Donut when coming';
} else {
echo 'Buy Pizza when coming';
}
?>
The output will be:
It doesn’t work on string alone but you can also use it on number operations.

The if … else if…else statement


You can use this statement to select a single option from a different line of
codes. The example below will explain better.
<?php
$number1=10;
$number2=10;
if($number1 == 8) {
echo 'The expression is true';
} elseif($number2 == $number2) {
echo 'The second expression is true';
} else {
echo 'The two if statements are true';
}
?>

The output:
Switch Statement
The statement allows you to change the course of the program flow. It is best
suited when you want to perform various actions on different conditions.
Consider the example below.
<html>
<body>
<?php
$a=2;
switch ($a)
{
case 1:
echo 'The number is 10';
break;
case 2:
echo 'The number is 20';
break;
case 3:
echo 'The number is 30';
break;
default:
echo 'There is no number that match';
}
?>
</body>
</html>

Output:

Explanation
In the example above, we declare the variable “a” to be 3. The switch
statement has some block of codes, with case or default. If the value of the
case is equivalent to the variable $a, it will execute the statement within that
line and then break. However, if the value of the case is not equivalent to any
of the variable, it will break from the case before executing the default code
block.

Conclusion
PHP language isn’t restricted to professional web browsers alone. You don’t
have to be an IT administrative professional to learn it. Similar to any
scripting language, it may seem complicated at the first time; however, if you
preserver, you will discover it is an interesting language to learn. Learning
PHP programming is the perfect way of understanding the server-side world.
Writing PHP code is not something intimidating if you start from the
foundation as I have done in this book. PHP language is one of the languages
you don’t need anyone to teach you as long as you are ready to learn
everything. In this book, you have learned everything you need to get your
environment ready, variables, conditional statements, and much more.
Chapter 13 - Python Programming Language

Introduction
Python Language is one of the easiest and straightforward object-oriented
languages to learn. Its syntax is simple, thereby making it simple for
beginners to learn and understand the language with ease. In this chapter, I
will cover several aspects of the Python programing language. This
programming guide is for beginners who want to learn a new language.
However, if you are an advanced programmer, you will also learn
something.
Guido Van Rossum developed the Python language but the implementation
began in 1989. Initially, you could have thought, it was named after the
Python snake; however, it was named after a comedy television show called
“Monty Python’s Flying Circus.”

Features of Python
There are certain features that make the python programming language
unique among other programming languages. The summary is displayed in
the diagram below.

Easy to learn – Because python is a high-level and expressive


language, it is easy for everyone – including you to learn and
understand irrespective of their programming level – beginners
to advanced programming
Readable – It is quite easy to read the language
Open source – The language is an open-source language
Cross-platform – This means it is available and you runnable on
different operating systems including UNIX, Linux, Windows,
Mac, etc. This has contributed to its portability.
Free – The language is downloadable without paying anything.
Furthermore, not only can you download it, you can it for
various applications. The software is distributable freely.
Large standard library – Python has its standard library, which
contains various functions and code, which you can add to your
code.
Supports exception handling – Most programming languages
have this exception handling feature. However, an exception is a
situation that takes place in the course of program execution and
has the tendency to disrupt the flow of the program. With python
exception handling feature, you can write less error code while
testing various situation that may lead to an exception in the
future.
1. Memory management – The language also supports automatic memory
management, which means it clear and free memory automatically. There is no
need for clearing the memory on your own when you remember.

Uses of Python
Most beginners before choosing to learn a programming language first
consider what the uses of such language are. However, there are various
applications of the python language in a real-world situation. These include:

Data Analysis – You can use python to develop data analysis and
visualization in the form of charts
Game development – Today, the game industry is a huge market
that yields billions of dollars per year. It may interest you to
know that you can use python to develop interesting games.
Machine learning – We have various machine learning
applications that are written using the python language. For
instance, products recommendation in websites such as eBay,
Flipkart, Amazon, etc. uses a machine-learning algorithm, which
recognizes the user’s interest. Another area of machine learning
is a voice and facial recognition on your phone.
Web development – You didn’t see this coming. Well, web
frameworks such as Flask and Django are based on the python
language. With Python, you can write backend programming
logics, manage database, map URLs, etc.
1. Embedded applications – You can use python to develop embedded applications

How to Install Python Programming Language


It is very easy to install python on your system. Since it is cross-platform,
you don’t need to crack your brain. By cross-platform, I mean you can install
it on Ubuntu, Mac, UNIX, Windows, etc. To install it on your system, you
can visit this link https://www.python.org/downloads. You can download it
here and it comes with the option of choosing your particular operating
system. So the installation process is not complicated. After downloading the
software according to your operating system, follow the onscreen instruction
to complete the process.
Since you are a beginner, I will teach you how to install the PyCharm, which
is a common IDE used for python programming. IDE stands for an integrated
development environment. The IDE contains a debugger, interpreter or
compiler, and a code editor.

Installation of PyCharm IDE


First, go to this address – https://www.jetbrains.com/pycharm/download/ to
download the edition you want. After this, install the downloaded file. If you
are using a MAC system, you have to double click the .dmg file before
dragging PyCharm to the application folder. However, for windows users,
you have to open the .exe file before following the direction on the screen.

Launching PyCharm
For windows users, after installing the .exe file, you will see the PyCharm
icon on the desktop depending on the option you selected during installation.
You can also go to your program files > Jetbrains >PyCharm2017 and look
for the PyCharm.exe file to launch PyCharm.

Python Program Beginning


Once you open the IDE and give a name to your project, you can start
programming. You can begin with this simple program
# This Python program prints Welcome to Python Programming on the
screen
print('Welcome to Python Programming')
If you did that, you should have the following
Welcome to Python Programming
Comments in Python
If you observe from the first line, I began with “#” whenever you see that it is
a comment and in Python, it doesn’t change the program outcome. Comments
are very important because it helps you to easily read the program by
providing further explanation of the codes in English for everyone to
understand.
Comments can be written in two ways in Python. This could be
single or multiple line comments. The single-line comment uses
the #, which is an example of the previous code. However, the
multiple line comment uses three single quotes (''') at the beginning
and end respectively.
'''
Example of multiple line comment
'''

Let me use a real example to explain both the single and multiple line
comment.
'''
Sample program to illustrate multiple line comment
Pay close attention
'''
print("We are making progress")
# Second print statement
print("Do you agree?")
print("Python Programming for Beginners") # Third print statement
Output:
We are making progress
Do you agree?
Python Programming for Beginners
Python Variables
We use variables to store data in programming. Variable creation is very
simple to implement in Python. In python, you have to declare the variable
name and value together. For instance
Number1 = 140 #number1 is of integer type
str = “Beginner” #str is of string type

Variable Name Convention in Python


Another name for a variable name is an identifier. Python has some laid
down rules when it has to do with naming variables. These rules differ from
other programming languages.

Variable names must always start with an underscore (-) or a


letter. For example, _number1, number1,
Variable names cannot contain special characters such as #, %, $,
etc. however, they can have underscore and alphanumeric
characters
A variable name cannot begin with a number. For instance,
3number is invalid
1. It is case sensitive. For instance, number1 and NUMBER2 are entirely different
variable names in python
Python Variable examples
Number1 = 589
Str = “Python Programming”
Print (Number1)
Print (Str)
The output will be:
589
Python Programming

Multiple Assignment
You can also allocate several variables to a single expression in python.
Consider the example below:
Profit = returns = yields = 35
print (Profit)
print (yield)
print (returns)

Output
35
35
35

Let us consider another example


A, B, C = 35, 8, 90
print (A)
print (B)
print (C)
Output
35
8
90

Concatenation and plus Operation on variables


A = 44
B = 68
print (A + B)
c = “Welcome”
d = “Home”
print (c + “ “ + d)

Output
112
Welcome Home

However, if you decide to use the + operator in conjunction with a and c, it


will display an error such as unsupported operand type (s) for +: ‘int’

Data Types in Python


The purpose is to define the type of data a variable accommodate. For
instance, “welcome home” is a string data type while 234 is an integer data
type. In Python, data types are divided into two different groups. We have the
immutable data types, whose values are unchangeable. They include tuple,
string, and numbers. The other group is the mutable data types, whose values
are changeable and they include sets, dictionaries, and list. In this book, my
focus will be on the immutable data types.
Numbers
When working with numbers, python supports floats, integers, and complex
numbers. Float numbers are those with decimal points such as 9.9, 4.2, 42.9,
etc. An integer is the opposite of float because it does not have a decimal
point attached to it. For instance, 3, 35, 89, etc. however, a compound
number contain an imaginary and real part such as 7+10j, etc.
Let’s demonstrate the use of numbers in a python program
# Python program to show how we can use numbers
# declaring the variables number1 and number2 as integer
number1 = 78
number2 = 12
print(number1+number2)
# declaring a and b as float data type
a = 15.9
b = 5.8
print(a-b)
# declaring x and y as complex numbers
x = 5 + 2j
y = 9 + 6j
print(y-x)
Output
100
10.1
4 + 4j

Strings
This is a series of characters enclosed within a special character. In Python,
you have the option of using a single or double quote to represent a string.
There are various means of creating strings in python.

You have the option of a single quote ‘


You can use a double quote ”
YOU CAN USE A TRIPLE-DOUBLE QUOTES “””
# Ways of creating strings in Python
str = 'single string example'
print(str)
str2 = "double string example"
print(str2)
# multi-line string
str3 = """ Triple double-quote string"""
print(str3)
str4 = '''This is Python Programming '''
print(str4)

Single string example


double string example
Triple double-quote string
Beginnersbook.com
This is Python Programming

Tuple
Tuple works like a list but the difference is that in a tuple, the objects are
unchangeable. The elements of a tuple are unchangeable once assigned.
However, in the case of a list, the element is changeable.
In order to create tuple in python, you have to place all the elements in a
parenthesis () with a comma separating it. Let me use an example to illustrate
tuple in python.
# tuple of strings
bioDate = ("John", "M", "Lawson")
print(bioData)
# tuple of int, float, string
Data_New = (1, 2.8, "John Lawson")
print(Date_New)
# tuple of string and list
details = ("The Programmer", [1, 2, 3])
print(details)
# tuples inside another tuple
# nested tuple
Details2 = ((2, 3, 4), (1, 2, "John"))
Print(details2)
Output will be:
("John", "M", "Lawson")
(1, 2.8, "John Lawson")
(“The Programmer", 1, 2, 3)
((2, 3, 4), (1, 2, "John"))

Control Statement in Python Programming


There are various control statements used in Python to make a decision.

If Statement
The statement prints out a message if a specific condition is satisfied. The
format or syntax is as follow:
If condition:
Line of codes
flag = True
if flag==True:
print("Welcome")
print("To")
print("Python Programming")
Output
Welcome
To
Python Programming

Consider another example


number1 = 180
if number1 < 290:
print("number1 is less than 290")

Output
number1 is less than 290

If-else statement
in our previous example, we only test a particular condition, what if you want
to test two different conditions. That is where the “if-else statement” comes
to play. In Python, the statement executes a particular statement if it is true
but if it's not true, it executes the other statement.
Syntax
If conditions
Statement1
Else
Statement2
Let us use our last example to illustrate this.
number1 = 180
if number1 > 290:
print("number1 is greater than 290")
else
print (“number1 is less than 290”)

Output
number1 is less than 290
Number1 = 15
if number1 % 4 == 0:
print("the Number is an Even Number")
else:
print("The Number is an Odd Number")
Output:
The Number is an Odd Number

Bonus Programs
# Program to display the Fibonacci sequence depending on the number the
user wants
# For a different result, change the values
numb1 = 12
# uncomment to take input from the user
#num1 = int(input("How many times? "))
# first two terms
a1 = 0
a2 = 1
count = 0
# Verify if the number of times is valid
if numb1 <= 0:
print("Please enter a positive integer")
elif numb1 == 1:
print("Fibonacci sequence up to",numb1,":")
print(a1)
else:
print("Fibonacci sequence up to",numb1,":")
while count < numb1:
print(a1,end=' , ')
nth = a1 + a2
# update values
a1 = a2
a2 = nth
count += 1
What do you think the output will be?
Fibonacci sequence up to 12 :
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
Chapter 14 - A brief look at Machine Learning
Machine learning is a data science field that comes from all the research done
into artificial intelligence. It has a close association to statistics and to
mathematical optimization, which provides the field with application
domains, theories and methods. We use machine learning more than we
realize, in applications and tasks where it isn’t possible for a rule-based
algorithm to be explicitly programmed.
Some of the applications are search engines spam filters on email, computer
vision and language translation. Very often, people will confuse machine
learning with data mining, but machine learning focuses mostly on
exploratory data analysis.
Some of the terminology you will come across in this section is:
Features – distinctive traits used in defining the outcome
Samples – an item being processed, i.e. an image, document, audio, CSV file
etc.
Feature Vector – numerical features that are representing an object. i.e. an n-
dimensional vector
Feature Extraction – a feature vector processed and data transformed from
high-dimensional to low-dimensional space
Training Set – data set where potential predictive relationships are discovered
Testing set – data set where predictions are tested

Different Machine Learning Types


There are three main types of machine learning:
1. Supervised – the computer is given a set of inputs and the
outputs associated with them. The program will learn from
those inputs so the outputs can be reproduced
2. Unsupervised – with no target variable, the computer learns,
by itself, to find the patterns in the given data
3. Reinforcement – a program is required to interact
dynamically with its environment

Supervised Learning
A supervised machine learning algorithm will study given data and will
generate a function which can then be used for the prediction of new
instances. We’ll assume that we have training data comprised of a set of text
representing news articles related to all kinds of news categories. These
categories, such as sport, national, international, etc., will be our labels. From
the training data, we are going to derive some feature vectors; each word may
be a vector or we may derive certain vectors from the text. For example, we
could count a vector as how many times the word ‘football’ occurs.
The machine learning algorithm is given the labels and the feature vectors
and it will learn from that data. Once training is completed, the model is then
fed the new data; once again, we extract features and input them to the model
and the target data is generated.

Unsupervised Learning
Unsupervised learning is when unlabeled data is analyzed for hidden
structures. For our example, we will use images as the training data set and
the input dataset. The images are of the faces of insects, horses and a human
being; features will be extracted from them and these features identify which
group each image should go to. The features are given to the unsupervised
algorithm, which looks for any patterns; we can then use the algorithm on
new images that can be identified and put into the right group.
Some of the unsupervised machine learning algorithms that we will be
discussing are:
k-means clustering
Hierarchical clustering

Reinforcement Learning
With reinforcement learning, the data for input is a stimulus from the
environment the machine learning model needs to respond to and react to.
The feedback provided is more of a rewards and punishment system in the
environment rather than the teaching process we see in supervised learning.
The actions that the agent takes lead to an outcome that the agent can learn
from rather than being taught; the actions selected by the agent are based on
two things – past experience and new choices, meaning it learns by a system
of trial and error. The reinforcement signal is sent to the agent by way of a
numerical reward that contains an encoding of the success and the agent will
learn to take the actions that increate that reward each time.
Reinforcement learning is not used much in data science, more so in robotics
and the two main algorithms used are:
Temporal difference learning
Q learning

Decision Trees
A decision tree is a predictive model that maps item outcomes to input data.
A popular technique, models generally fall under these two types:
Classification tree – dependent variables taking a finite value. the feature
rules are represented by branches leading to class labels and the outcome
class labels are represented by leaves.
Regression tree – a dependent variable that takes a continuous value.
As an example, we’ll use data that represents whether a person should play a
game of tennis, based on weather, wind intensity, and humidity:
Play Wind Humidity
Outlook
NoLowHigh Sunny
NoHighNormal Rain
YesLowHigh Overcast
YesWeakNormal Rain
YesLowNormal Sunny
YesLowNormal Overcast
YesHighNormal Sunny
If you were to use this data, the target variable being Play and the rest as
independent variables, you would get a decision tree model with a structure
like this:

SunnyOvercast
Rain

High NormalHighWeak
Now, when we get new data, it will traverse the tree to reach the conclusion,
which is the outcome.
Decision trees are very simple and have several advantages:
1. They are easy to communicate and to visualize
2. Odd patterns can be found. Let’s say that you were looking
for a voting pattern between two parties up for election and
your data includes income, education, gender, and age. You
might see a pattern whereby people with higher education
have low incomes and vote for a certain party.
3. Minimal assumptions are made on the data

But there are also disadvantages:


1. They have a high rate of error classifications while the
training set remains small compared to how many classes
there are.
2. When the number of dependent variables and the data
increase in size, they show an exponential computing
growth
3. For a certain construction algorithm, discrete data is
required.

Linear Regression
Linear regression is a modeling approach that models scalar linear
relationships between an independent variable of and a scalar dependent
variable of Y, which may be a value of one or more:
y = Xβ+ε
Let’s use an example to understand this. Below, you can see a list of student
heights and weights:
Height (inches)Weight (pounds)
50125
58135
63145
68144
70170
79165
84171
75166
65160
Put this data through a linear regression function (discussed later) using the
weight as the dependent variable of y and the height as the independent
variable of x and you would get this equation:
y = 1.405405405 x + 57.87687688
If that equation were to be plotted as a line with an intercept of 57.88 and the
line slope as 1.4 over a scatter plot that has Height in the x-axis and Weight
in the y-axis, you would see a plot whereby the regression algorithm has tried
to create that equation which shows the least error when the weight of the
student is predicted.

Logistic Regression
Another of the supervised learning techniques, logistic regression is classed
as a ‘probabilistic classification model’. These tend to be used mostly for
predictions of binary predictors like when customers are going to move to a
competitor.
As the name indicates logistics are used in logistic regression. Logistic
functions are useful for taking value from negative to positive infinity and
outputting values between 0 and 1. This means it can be interpreted as a
probability. The logistic function below will generate a predicted value
between 0 and 1 based on a dependent variable x:

F(x)=
x is the independent variable while F(x) is dependent.
If you were to try plotting the logistic function from negative to positive
infinity, the outcome would be an S graph (s-shaped).
We can apply logistic regression in these scenarios:
1. Deriving a score for the propensity of a retail customer
buying a brand new product
2. How likely it is that a transformer fails by using the data
from the sensor associated with it
3. How likely it is that a user clicks on an ad on a given
website based on previous user behavior

Naïve Bayes Classifier


This is another probabilistic classifier and is based on the Bayes theorem. The
reason it is called naïve is that there is an assumption made of a very strong
interdependence between features. The Bayes theorem is:

P(A\B)=
Breaking this down:
A and B are both events
P(A) and P(B) are the A and B probabilities and are interdependent
P(A\B) is the A probability, given a conditional probability that B is True.
P(B\A) is the B probability, given the conditional probability that A is true.
The naïve Bayes formula is:

P(Ak\B) = P(Ak B)/P(A1 B) + P(A2 B) + … + P(An B)


We’ll use an example to understand the formula by solving the equation.
Tomorrow, Suzanne has her engagement party outdoors in Fort Lauderdale.
In the past couple of years, Fort Lauderdale has only seen six days of rain per
year. However, rain is forecast for the day of the party. In 80% of forecasts,
the weatherman is accurate, but he is inaccurate 20% of the time. We want to
work out what probability there is of rain on the day of the engagement. We
can use events like the following to base the calculation on:
AI – this event says that it will rain
A2 – this event says it won’t rain
B – this event says rain will be predicted by the weatherman.
Below are the probabilities, based on previous events:
P(AI) = 6/365 = 0.016438 – it rains for six days every year
P(AII) = 359/365 = 0.98356 – it doesn’t rain for 359 days of the year
P(B\AI) = 0.8 – 80% of the time, the rain predicted by the weatherman
happens
P(C\AII) = 0.2 – 20% of the time, the rain predicted by the weatherman
doesn’t happen
We use the following formula to calculate the naïve Bayes probability:
P( AI | B ) = P(AI)P(B | AI)/ (P( AI ) P( B | AI) + P(AII) P( B | AII) ) P( AI |
B) = (0.0164 0.8) / ( 0.01640.8 + 0.9834 * 0.2) P(AI | B) = 0.065
This calculation states that, although rain was predicted, the Bayes theorem
says that there is just a 6.5% chance that it will.
We see naïve Bayes used quite a lot in email filtering where the probability of
spam is determined by computing the instance of every word in the email.
The model learns from previous email history, marking some mail as spam
and helping it to determine what is and isn’t spam email.

K-Means Clustering
This is an unsupervised learning technique used to partition data of n
observations to K buckets of observations that are similar. It is known as a
clustering algorithm because it computes the mean of some features that
reference the dependent variables that are based on how things are clustered.
An example would be a segment of customers based on the average amount
they spend per transaction or the average amount of products they purchase
in one year. The mean value will then be the center of the cluster. K is
referring to how many clusters there are so the technique refers to the number
of clusters around the k number of means.
So, how is K chosen? If we knew what it was we were looking for or the
number of clusters we wanted or expected to see, we set K as this number
before the algorithm starts computing.
If we don’t know the number things take a bit longer to complete and will
require some level of trial and error. For example, we would need to try K=4,
5, and 6 until the clusters start to make sense for the domain.
K-means clustering is used quite a lot in market segmentation, computer
vision, geostatistics, astronomy, and in agriculture. We’ll talk more about it
later.

Hierarchical Clustering
Another unsupervised learning technique, this involves observations being
used to build a hierarchy of clusters. Data is grouped at various levels of a
dendrogram or cluster tree. It is not one single cluster set, but a hierarchy
made up of multiple levels where clusters on one level are joined as clusters
on the next. This gives you the choice of working out what level of clustering
is right for you.
Hierarchical clusters are two fundamental types:
Agglomerative hierarchical clustering – bottom-up method, each observation
begins in a cluster of its own and two others as they rise through the
hierarchy
Divisive hierarchical clustering – top-down method, observations start in one
cluster and split in two as they drop through a hierarchy
Chapter 15 - Python Crash Course
Before we dig deeper into data science and working with Python and Jupyter,
you should understand the basics of programming. If you already grasp the
concepts or you have some experience with programming in Python or any
other language, feel free to skip this chapter. However, even if you already
possess the basic knowledge, you might want to refresh your memory.
In this chapter we’re going to discuss basics programming concepts and go
through simple examples that illustrate them. It is recommended that you put
into practice what you read as soon as possible, even if at first you use cheat
sheets. The goal here is to practice, because theory is not enough to solidify
what you learn.
For the purpose of this chapter, we will not use Jupyter or any other IDE that
is normally used when programming. All we need is a shell where we put our
code to the test and exercise. To do that, just head to Python’s main website
here https://www.python.org/shell/ and you’ll be able to try everything out
without installing anything on your computer.

Programming Naming Conventions


Before we start discussing programming concepts such as strings, functions,
conditional statements and loops, you need to learn how to write clean, easy
to understand code. Readability and consistency are vital when working on a
project, especially if others have to read your work. For this reason, you need
to learn about programming naming conventions. A programmer or a data
analyst should be able to look through your code and understand it at a
glance. It should be self-explanatory with intuitive variables that make it clear
to the reader what their purpose is.
Do not ignore this aspect of programming. Someday you might write a
program, hit a stump, abandon it for a short while and when you come back
to it, it will look like gibberish due to variables that have no meaning to you.
With that being said, here are the most commonly used naming conventions
used by programmers and data scientists:

1. Pascal Case: Capitalize the first letter of each word without


using any spaces or symbols between them. A variable name
written using Pascal Case should look something like this:
PascalCaseVariable, MyComplexPassword,
CarManufacturer.
2. Camel Case: This is almost the same as Pascal case, except
the first word starts with a lowercase letter. A variable name
written using Camel Case should look something like this:
camelCaseVariable, myInterface, userPassword.
3. Snake Case: This naming convention is often used to clearly
illustrate multi-word variables by separating each word with
an underscore sign. Snake Case can be applied together with
Camel Case or Pascal Case. The only thing that
differentiates it from others is word separation. A variable
name written using Snake Case should look something like
this: my_snake_case_variable, This_is also_snake_case,
my_account_password. Reading variable names might seem
easier on the eye by using this naming convention. As a side
note, Snake Case is also great for naming your project
folders and files and it is often used for that purpose.
Keep in mind that there is no such thing as the “best” naming convention. It
all depends on your preference, however you should always be consistent
when writing your code. Try not to mix different naming conventions. Pick
one style and stick to it throughout your project. Give your variables a
descriptive name that allows the reader to understand what it does and then
write them by using one of the mentioned conventions.

Data Types
Understanding basic data types is essential to the aspiring data scientist.
Python has several in-built data types, and in this section we will discuss each
one of them. Remember to follow along with the exercises and try to use the
bits of knowledge you gain to come up with your own lines of code.
Here are the most important data types we will be discussing: numbers,
strings, dictionaries, lists, and tuples. Start up your Python shell and let’s
begin!

Numbers
In programming and mathematics, there are several types of numbers, and
you need to specify them in Python when writing code. You have integers,
floats, longs, complex numbers, and a few more. The ones you will use most
often, however, are integers and floats.
An integer (written as “int” in Python) is a positive or negative whole
number. That means that when you declare an integer, you cannot use a
number with decimal points. If you need to use decimals, however, you
declare a float.
In Python, there are several mathematical operators that you can use to make
various calculations using integers and floats. The arithmetic operators are for
adding (+), subtracting (-), multiplication (*), division (/), modulus (%), floor
division (//) and exponent (**). There are also comparison operators such as
greater than (>), less than (<), equal to (==), not equal to (!=), greater than or
equal to (>=) and less than or equal to (<=). These are the basic operators,
and they are included with any Python installation. There’s no need to install
a package or a module for them. Now let’s try a simple exercise to put some
of these operators in action.
x = 100
y = 25
print (x + y)
This simple operation will print the result of x + y. You can use any of the
other arithmetic operators this way. Play around with them and create
complex equations if you want to. The process is the same. Now let’s look at
an example of comparison operators:
x = 100
y = 25
print (x > 100)
The result you will see is “false” because our declared x variable is not
greater than y. Now let’s move on to strings!
Strings
Strings are everything that is in text format. You can declare anything as
simple textual information, such as letters, numbers, or punctuation signs.
Keep in mind that numbers written as strings are not the same as numbers
used as variables. To write a string, simply type whatever you want to type
in-between quotation marks. Here’s an example
x = “10”
In this case, x is a string and not an integer.
So what are strings for? They are used frequently in programming, so let’s
see some of the basic operations in action. You can write code to determine
the character length of a line of text, to concatenate, or for iteration. Here’s an
example:
len(“hello”)
The result you get is 5, because the “len” function is used to return the length
of this string. The word “hello” is made of 5 characters, therefore the
calculation returns 5 to the console. Now let’s see how concatenation looks.
Type the following instruction:
‘my’ + ‘stubborn’ + ‘cat’
The result will be mystubborncat, without any spaces in between the words.
Why? Because we didn’t add any spaces inside the strings. A space is
considered as a character. Try writing it like this:
‘my ‘ + ‘stubborn ‘ + ‘cat’
Now the result will be “my stubborn cat”. By the way, did you realize we
changed the quotation marks to single quotes? The code still performed as
intended, because Python can’t tell the difference between the two. You can
use both double quotes and single quotes as you prefer, and it will have no
impact on your code.
Now let’s see an example of string iteration. Type:
movieTitle = “Star Wars”
for c in movie: print c,

These lines of code will return all individual characters in the declared string.
We first declare a variable called “movieTitle” to which we assign “Star
Wars” as its string. Next we call to print each character within the
“movieTitle.
There are other string operations that you can perform with Python, however
for the purposes of this book it’s enough to stick to the basics. If you wish,
you can always refer to Python’s online documentation and read all the
information they have on strings. Next up, let’s discuss lists!
Lists
Lists are incredibly useful in programming, and you will have to use them
often in your work. If you are familiar with object oriented programming
languages, Python lists are in fact identical to arrays. You can use them to
store data, manipulate it on demand, and store different objects in them and
so on. Using them in Python is simple, so let’s first see how to make a new
list. Type the following line:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The list is created by declaring a series of objects enclosed by square
brackets. As we already mentioned, list don’t have to contain only one data
type. You can store any kind of information in them. Here’s another example
of a list:
myBook = [“title”, “somePages”, 1, 2, 3, 22, 42, “bookCover”]
As you can see, we are creating a list that contains both strings and numbers.
Next up, you can start performing all the operations you used for strings.
They work the same with lists. For instance, here’s how you can concatenate
the two previous lists we created:
x + myBook
Here’s the result:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, “title”, “somePages”, 1, 2, 3, 22, 42,
“bookCover”]
Try out any other operation yourself and see what happens. Explore and
experiment with what you already know.

Dictionaries
Dictionaries are similar to lists, however you need to have a key that is
associated with the objects inside. You use the key to access those objects.
Let’s explain this through an example in order to avoid any confusion. Type
the following lines:
dict = {‘weapon’ : ‘sword’, ‘soldier’ : ‘archer’}
dict [‘weapon’]
As you can see, in the first line we declared a dictionary. It is defined
between two curly brackets, and inside it contains objects with a key assigned
to each of them. For instance, we have object “sword” that has the “weapon”
as its attributed key. In order to access the “sword” we have to call on its key,
which we do in the second line of code. Keep in mind that the “weapon” or
“soldier” keys are only examples. Keys don’t have to be strings. You can use
anything.

Tuples
Tuples are also similar to lists, however their objects can’t be changed after
they are set. Let’s see an example of a tuple and then discuss it. Type the
following line:
x = (1, 2, ‘someText’, 99, [1, 2, 3])
The tuple is in between parentheses and it contains three data types. There are
three integers, a string, and a list. You can now perform any operation you
want on the tuple. Try the same commands you used for lists and strings.
They will work with tuples as well because they are so similar. The only real
difference is that once you declare the elements inside a tuple, you cannot
modify them through code. If you have some knowledge about object
oriented programming, you might notice that Python tuples are similar to
constants.

Python Code Structure


Now that you know some basic data types, we can discuss Python’s code
structure before progressing with statements, loops, and more. Python uses
whitespace or indentation in order to organize code blocks into readable and
executable programs. If you do not respect the indentation, you will receive
an indentation error. Other languages such as those that are heavily based on
the C use curly brackets to mark the beginning and end of a code block. Let’s
see this difference through an example. This is what a C code block would
look like:
if ( x == 100)
{
printf (“x is 100”);
printf (“moving on”);
}
printf (“there’s nothing else to declare”);
As you can see, we have curly braces to mark the borders of a code block.
There’s also another key difference. In Python, there’s no need to use
semicolons to mark the end of a line of code. Here’s the same block of code,
but written in Python:
if x == 100 :
print (“x is 100”)
print (“moving on”)
print (“there’s nothing else to declare”)
Notice how in Python the contents of the “if” statement are indented. That’s
what tells the program that those lines belong to the statement above. If you
type in Python something like this:
if x == 100 :
print (“x is 100”)
print (“moving on”)
print (“there’s nothing else to declare”)
The result will look like this:
IndentationError: unexpected indent
So whenever you write code in Python, make sure you make the appropriate
use of whitespace to define the code blocks. Simply press the “tab” key on
your keyboard to indent as needed.
Another thing worth mentioning here is the use of parentheses in the “if”
statement that we declared in our example. In the C version of the code we
used them, however in the Python version we ignored them. While in Python
you don’t have to use those parentheses, you should. They are optional, but
they make the code easier to read and understand. It is widely accepted in the
programming world that using them is best practice, so you should learn how
to write clean code from the beginning.

Conditional Statements
This is the part when things start to get fun. Conditional statements are used
to give your program some ability to think and decide on their own what they
do with the data they receive. They are used to analyze the condition of a
variable and instruct the program to react based on values. We already used
the most common conditional statement in the example above.
Statements in Python programming are as logical as those you make in the
real world when making decisions. “If I’m sick tomorrow, I will skip school,
else I will just have to go” is a simple way of describing how the “if”
statement above works. You tell the program to check whether you are sick
tomorrow. If it returns a false value, because you aren’t sick, then it will
continue to the “else” statement which tells you to go to school because you
aren’t sick. “If” and “if, else” conditional statements are a major part of
programming. Now let’s see how they look in code:
x = 100
if (x < 100):
print(“x is small”)
This is the basic “If” statement with no other conditions. It simply examines
if the statement is true. You declared the value of x to be 100. If x is smaller
than 100, the program will print “x is small”. In our case, the statement is
false and nothing will happen because we didn’t tell the program what to do
in such a scenario. Let’s extend this code by typing:
x = 100
if (x < 100):
print(“x is small”)
else:
print(“x is big”)
print (“This part will be returned whatever the result”)
Now that we introduced the “else” statement, we are telling the program what
to execute if the statement that “x is smaller than 100” is not true. At the end
of the code block, we also added a separate line outside of the “if else”
statement and it will return the result without considering any of the
conditions. Pay special attention to the indentation here. The last line is not
considered as part of the “if” and “else” statements because of the way we
wrote it.
But what if you want your program to check for several statements and do
something based on the results? That’s when the “elif” conditional comes in.
Here’s how the syntax would look:
if (condition1):
add a statement here
elif (condition2):
add another statement for this condition
elif (condition3):
add another statement for this condition
else:
if none of the conditions apply, do this
As you may have noticed, we haven’t exactly used code to express how the
“elif” statement is used. What we did instead was write what is known as
pseudo code. Pseudo code is useful when you quickly want to write the logic
of your code without worrying about using code language. This makes it
easier to focus on how your code is supposed to work and see if your thinking
is correct. Once you write your pseudo code and decide it’s the correct path
to take, you can replace it with actual code. Here’s how to use elif with real
code:
x = 10
if (x > 10):
print (“x is larger than ten”)
elif x < 4:
print (“x is a smaller number”)
else:
print (“x is not that big”)
Now that you know how conditionals work, start practicing. Use strings, lists
and operators, followed by statements that use that data. You don’t need
more than basic foundations to start programming. The sooner you nudge
yourself in the right direction, the easier you will learn.

Logical Operators
Sometimes you need to make comparisons when using conditional
statements, and that’s what logical operators are for. There are three types:
and, or, and not. We use the “and” operator to receive a certain result if both
of them are checked to be true. The “or” operator will return a result if only
one of the specified statements are true. Finally, the “not” operator is used to
reverse the result.
Let’s see an example of a logical operator used in a simple “if” statement.
Type the following code:
y = 100
if y < 200 and y > 1:
print(“y is smaller than 200 and bigger than 1”)
The program will check if the value of y is smaller than 200, as well as bigger
than 1 and if both statements are true, a result will be printed.
Introduce logical operators when you practice your conditionals. You can
come up with many operations because there’s no limit to how many
statements you can make or how many operators you use.
Loops
Sometimes we need to tell the program to repeat a set of instructions every
time it meets a condition. To achieve this, we have two kinds of loops, known
as the “for” loop and the “while” loop. Here’s an example of a “for” loop:
for x in range(1, 10):
print(x)
In this example, we instruct our program to keep repeating until every value
of x between 1 and 10 is printed. When the printed value is 2, for instance,
the program checks if x is still within the (1, 10) range and if the condition is
true, it will print the next number, and the next and so on.
Here’s an example with a string:
for x in “programming”:
print (x)
The code will be executed repeatedly until all characters inside the word
“programming” are printed.
Here’s another example using a list of objects:
medievalWeapons = [“swords”, “bows”, “spears”, “throwing axes”]
for x in medievalWeapons:
print(x)
In this case, the program will repeat the set of instructions until every object
inside the list we declared is printed.
Next up we have the “while” loop that is used to repeat the code only as long
as a condition is true. When a statement no longer meets the condition we set,
the loop will break and the program will continue the next lines of code after
the loop. Here’s an example:
x=1
while x < 10:
print(x)
x += 1
First we declare that x is an integer with the value of 1. Next we instruct the
program that while x is smaller than 10 it should keep printing the result.
However, we can’t end the loop with just this amount of information. If we
leave it at that, we will create an infinite loop because x is set to always be 1
and that means that x will forever be smaller than. The “x+= 1” at the end
tells the program to increase x’s value by 1 every single time the loop is
executed. This means that at one point x will no longer be smaller than 10,
and therefore the statement will no longer be true. The loop will finish
executing, and the rest of the program will continue.
But what about that risk of running into infinite loops? Sometimes accidents
happen, and we create an endless loop. Luckily, this is preventable by using a
“break” statement at the end of the block of code. This is how it would look:
while True:
answer = input (“Type command:”)
if answer == “Yes”:
break
The loop will continue to repeat until the correct command is used. In this
example, you break out of the loop by typing “Yes”. The program will keep
running the code until you give it the correct instruction to stop.

Functions
Now that you know enough basic programming concepts, we can discuss
making your programs more efficient, better optimized, and easier to analyze.
Functions are used to reduce the number of lines of code that are actually
doing the same thing. It is generally considered best practice to not repeat the
same code more than twice. If you have to, you need to start using a function
instead. Let’s take a look at what a function looks like in code:
def myFunction():
print(“Hello, I’m your happy function!”)
We declare a function with the “def” keyword, which contains a simple string
that will be printed whenever the function is called. The defined functions are
called like this:
myFunction()
You type the name of function followed by two parentheses. Now, these
parentheses don’t always have to stay empty. They can be used to pass
parameters to the function. What’s a parameter? It’s simply a variable that
becomes part of the function’s definition. Let’s take a look at an example to
make things clearer:
def myName(firstname):
print(firstname + “ Johnson”)
myName(“Andrew”)
myName(“Peter”)
myName(“Samuel”)
In this example we use the parameter “firstname” in the function’s definition.
We then instruct the function to always print the information inside the
parameter, plus the word “Johnson”. After defining the function, we call it
several times with different “firstname”. Keep in mind that this is an
extremely crude example. You can have as many parameters as you want. By
defining functions with all the parameters you need, you can significantly
reduce the amount of code you write.
Now let’s examine a function with a set default parameter. A default
parameter will be called when you don’t specify any other information in its
place. Let’s go through an example for a better explanation. Nothing beats
practice and visualization. Type the following code:
def myHobby(hobby = “leatherworking”):
print (“My hobby is “ + hobby)
myHobby (“archery”)
myHobby (“gaming”)
myHobby ()
myHobby (“fishing”)
These are the results you should receive when calling the function:
My hobby is archery
My hobby is gaming
My hobby is leatherworking
My hobby is fishing
Here you can see that the function without a parameter will use the default
value we set.
Finally, let’s discuss a function that returns values. So far our functions were
set to perform something, such as printing a string. We can’t do much with
these results. However, a returned value can be reassigned to a variable and
used in more complex operations. Here’s an example of a return function:
def square(x):
return x * x
print(square (5))
We defined the function and then we used the “return” command to return the
value of the function, which in this example is the square of 5.

Code Commenting
We discussed earlier that maintaining a clear, understandable code is one of
your priorities. On top of naming conventions, there’s another way you can
help yourself and others understand what your code does. This is where code
commenting comes in to save the day.
Few things are worse than abandoning a project for a couple of weeks and
coming back to it only to stare at it in confusion. In programming, you
constantly evolve, so the code you thought was brilliant a while back will
seem like it’s complete nonsense. Luckily, Python gives you the ability to
leave text-based comments anywhere without having any kind of negative
effect on the code. Comments are ignored by the program, and you can use
them to briefly describe what a certain block of code is meant to achieve. A
comment in Python is marked with a hashtag (#).
# This is my comment.
Python disregards everything that is written after a hash symbol. You can
comment before a line of code, after it, or even in the middle of it (though
this is not recommended). Here’s an example of this in action:
print (“This is part of the program and will be executed”) #This is a comment
Comments don’t interfere with the program in any way, but you should pay
attention to how you express yourself and how you write the comment lines.
First of all, comments should not be written in an endless line - you should
break them up into several lines to make them easy to read. Secondly, you
should only use them to write a short, concise description. Don’t be more
detailed than you have to be.
# Here’s how a longer comment line should look like.
# Keep it simple, readable and to the point
# without describing obvious data types and variables.
Get used to using comments throughout your code early one. Other
programmers or data scientists will end up reading it someday, and comments
make it much easier for them to understand what you wanted to accomplish.
Every programmer has a different way of solving a problem, and not
everyone thinks the same way, even if they arrive at the same conclusion. In
the long run, good comments will save you a lot of headaches, and those who
read your code may hate you a little less.
Chapter 16 - Unsupervised Learning
Unsupervised machine learning uses unlabeled data. Data scientists don’t
know the output of yet. The algorithm must discover patterns on its own,
where patterns would otherwise be unknown. Find a structure in a place
where the structure is otherwise unobservable. The algorithm finds data
segments on its own. The model looks for patterns and structure in an
otherwise unlabeled and unrecognizable mass of data. Unsupervised learning
allows us to find patterns that would be unobservable without computer
scientists. Sometimes massive collections of data have patterns, and it would
be impossible to sift through all of it trying to find trends.
This is good for examining the purchasing habits of consumers so that you
can group customers into categories based on patterns in their behavior. The
model may discover that there are similarities in buying patterns between
different subsets of a market, but if you didn’t have your model to sift
through these massive amounts of complicated data, you will never even
realize the nature of these patterns. The beauty of unsupervised learning is the
possibility of discovering patterns or characteristics in massive sets of data
that you would not be able to identify without the help of your model.
A good example of unsupervised learning is fraud detection. Fraud can be a
major problem for financial companies, and with large amounts of daily
users, it can be difficult for companies to identify fraud without the help of
machine learning tools. Models can learn how to spot fraud as the tactics
change with technology. If you want to deal with new, unknown fraud
techniques, then you will need to employ a model that can detect fraud under
unique circumstances.
In the case of detecting fraud, it's better to have more data. Fraud detection
services must use a range of machine learning models to be able to combat
fraud effectively. Using both supervised and unsupervised models. It's
estimated that there will be about $32 billion in fraudulent credit card activity
next year, in 2020 Models for fraud detection classify the output (credit card
transactions) as legitimate or fraudulent.
They can classify based on a feature like time of day or location of the
purchase. If a merchant usually makes sales around $20, and suddenly has a
sale for $8000 from a strange location, then the model will most likely
classify this transaction as fraudulent.
The challenge of using machine learning for fraud detection is the fact that
most transactions are not fraudulent. If there was even a significant amount of
fraudulent transactions among non-fraudulent, then credit cards would not be
a viable industry. The percentage of fraudulent card transactions is so small
that it can create models that are skewed that way. The $8000 purchase from
a strange location is suspicious, but it is more likely to be the result of a
traveling cardholder than fraudulent activity. Unsupervised learning makes it
easier to identify suspicious buying patterns like strange shipping locations
and random jumps in user reviews.

Clustering
Clustering is a sub-group of unsupervised learning. Clustering is the task of
grouping similar things together When we use clustering, we can identify
characteristics and sort our data based on these characteristics. If we are using
machine learning for marketing, clustering can help us identify similarities in
groups of customers of potential clients. Unsupervised learning can help us
sort customers into categories that we might not have created with the help of
machine learning. It can also help you sort your data when you are working
with a large number of variables.
K-Means clustering
K-means clustering works similarly to K-nearest neighbors You pick a
number for k to decide how many groups you want to see. You continue to
cluster and repeat until clusters are more clearly classified.
Your data is grouped around centroids, which are the points on your graph
that you have chosen where you want to see your data clustered. You choose
them at random, and you have k of them. Once you introduce your data to the
model, data points are placed in categories indicated by the closest centroid,
which is measured by Euclidean distance. Then you take the average value of
the data points surrounding each centroid. Keep repeating this process until
your results stay the same, and you have consistent clusters. Each data point
is only assigned to one cluster.
You repeat this process by finding the average values for x and y within each
cluster. This will help you extrapolate the average value of the data points in
each cluster. K-means clustering can help you identify previously unknown
or overlooked patterns in the data.
Choose the value for k that is optimal for the number of categories you want
to create. Ideally, you should have more than 3. However, the advantage
associated with adding more clusters diminishes that higher the number of
clusters you have. The higher the value for k that you choose, the smaller and
more specific the clusters are. You wouldn’t want to use a value for k that is
the same as the number of data points because each data point would end up
in its own cluster.
You will have to know your dataset well and use your intuition to guess how
many clusters are appropriate, and what sort of differences that will be
present. However, our intuition and knowledge of the data are less helpful
once we have more than just a few potential groups.
Dimensionality Reduction
When you are using dimensionality reduction, you are trimming down data to
remove unwanted features. Simply put, you're scaling down the number of
variables in a dataset.
When we have a lot of variables in our model, then we run the risk of having
dimensionality problems. Dimensionality problems are problems that are
unique to models with large datasets and can affect prediction accuracy.
When we have many variables, we need larger populations and sample
populations in order to create our model. With that many variables, it’s hard
to have enough data to have many possible combinations to create a well-
fitting model.
If we use too many variables, then we can also encounter overfitting.
Overfitting is the main problem which would cause a data scientist to
consider dimensionality reduction.
We must choose data that we don’t need, or that is irrelevant. If we have a
model predicting someone’s income, do we need a variable that tells us what
their favorite color is? Probably not. We can drop it out of our dataset.
Usually, it's not that easy to tell when a variable should be dropped. There are
some tools we can use to determine which variables aren’t as important.
Principle Component Analysis is a method of dimensionality reduction. We
take the old set of variables and convert them into a newer set somehow. The
new sets we’ve created are called principal components. There is a tradeoff
between reducing the number of variables while maintaining the accuracy of
your model.
We can also standardize the values of our variables. Make sure they are all
valued in the same relative scale so that you don't inflate the importance of a
variable. For example, if we have variables measured as a probability
between 0 and 1 vs. variables that are measured by whole numbers above
100.
Linear Discriminant is another method of dimensionality reduction where we
combine features or variables, rather than get rid of them altogether.
Kernel Principal Component is the third method for dimensionality
reduction. Here, variables are placed in a new set. This model will be non-
linear, and it will give us even better insight into the true parameters than
original data.
Chapter 17 - Neural Networks
Neural networks are a form of machine learning that is referred to as deep
learning. It’s probably the most advanced method of machine learning, and
truly understanding how it works might require a Ph.D. You could write an
entire book on machine learnings most technical type of model.
Neural networks are computer systems designed to mimic the path of
communication within the human brain. In your body, you have billions of
neurons that are all interconnected and travel up through your spine and into
your brain. They are attached by root-like nodes that pass messages through
each neuron one at a time all the way up the chain until it reaches your brain.
While there is no way to replicate this with a computer yet, we take the
principle idea and apply it to computer neural networks to replicate the ability
to learn like a human brain learns; recognize patterns and inferring
information from the discovery of new information.
In the case of the neural networks, as with all our machine learning models.
Information is processed through neural networks as numerical data. By
giving out numerical data values, we are giving it the power to use algorithms
to make predictions.
Just as with the neurons in the brain, data starts at the top and works its way
down, being first separated into nodes. The neural network uses nodes to
communicate through each layer. A neural network is comprised of three
parts; Input, hidden, and output layers.
In the picture below, we have a visual representation of a neural network,
with the circles being every individual node in the network. On the left side,
we have the input layer; this is where our data goes in. After the data passes
through the input layer, it gets filtered through several hidden layers. The
hidden layers are where data gets sorted by different characteristics and
features. The hidden layers look for patterns within the data set. The hidden
layers are where the ‘magic' is happening because the data is being sorted by
patterns that we probably wouldn't recognize if we sorted it manually. Each
node has a weight which will help to determine the significance of the feature
being sorted.
The best use of these neural networks would be a task that would be easy for
a human but extremely difficult for a computer. Recall at the beginning of the
book when we talked about reasoning and inductive reasoning. Our human
brain is a powerful tool for inductive reasoning; it’s our advantage over
advanced computers that can calculate high numbers of data in a matter of
seconds. We model neural networks after human thinking because we are
attempting to teach a computer how to ‘reason’ like a human. This is quite a
challenge. A good example of a neural network is the example we mentioned
we apply neural networks for tasks that would be extremely easy for a human
but are very challenging for a computer.
Neural networks can take a huge amount of computing power. The first
reason neural networks are a challenge to process is because of the volume of
datasets required to make an accurate model. If you want the model to learn
how to sort photographs, there are many subtle differences between photos
that the model will need to learn to complete the task effectively. That leads
to the next challenge, which is the number of variables required for a neural
network to work properly. The more data that you use and the higher the
number of variables analyzed means that there is an increase in hidden
networks. At any given time, several hundred or even thousands of features
are being analyzed and classified through the model. Take self-driving cars as
an example. Self-driving cars have more than 150 nodes for sorting. This
means that the amount of computing power required for a self-driving car to
make split-second decisions while analyzing thousands of inputs at a time is
quite large.
In the instance of sorting photos, neural networks can be very useful, and the
methods that data scientists use are improving rapidly. If I showed you a
picture of a dog and a picture of a cat, you could easily tell me which one a
cat was, and which one was a dog. But for a computer, this takes
sophisticated neural networks and a large volume of data to teach the model.
A common issue with neural networks is overfitting. The model can predict
the values for the training data, but when it's exposed to unknown data, it is
fit too specifically for the old data and cannot make generalized predictions
for new data.
Say that you have a math test coming up and you want to study. You can
memorize all the formulas that you think will appear on the test and hope that
when the test day comes, you will be able to just plug in the new information
into what you’ve already memorized. Or you can study more deeply; learning
how each formula works so that you can produce good results even when the
conditions change. An overfitted model is like memorizing the formulas for a
test. It will do well if the new data is similar, but when there is a variation,
then it won’t know how to adapt. You can usually tell if your model is
overfitted if it performs well with training data but does poorly with test data.
When we are checking the performance of our model, we can measure it
using the cost value. The cost value is the difference between the predicted
value and the actual value of our model.
One of the challenges with neural networks is that there is no way to
determine the relationship between specific inputs with the output. The
hidden layers are called hidden layers for a reason; they are too difficult to
interpret or make sense of.
The most simplistic type of neural network is called a perceptron. It’s the
derives its simplicity from the fact that it has only one layer through which
data passes. The input layer leads to one classifying hidden layer, and the
resulting prediction is a binary classification. Recall that when we refer to a
classification technique as binary, that means it only sorts between two
different classes, represented by 0 and 1.
The perceptron was first developed by Frank Rosenblatt. It’s a good idea to
familiarize yourself with the perceptron if you’d like to learn more about
neural networks. The perceptron uses the same process as other neural
network models, but typically you’ll be working with more layers and more
possible outputs. When data is received, the perceptron multiples the input by
the weight they are given. Then the sum of all these values is plugged into the
activation function. The activation function tells the input which category it
falls into, in other words predicting the output.
If you were to look at the perceptron on a graph, its line would appear like
this:
The line of the graph of perception appears like a step, with two values, one
on either side of the 1. These two sides of the step are the different classes
that the model will predict based on the inputs. As you might be able to tell
from the graph, it’s a bit crude because there is very little separate along the
line between classes. Even a small change in some input variable will cause
the predicted output to be a different class. It won’t perform as well outside
of the original dataset that you use for training because it is a step function.
An alternative to the perceptron is a model called a sigmoid neuron. The
principle advantage of using the sigmoid neuron is that it is not binary.
Unlike perceptron, which can classify data into two categories, the sigmoid
function creates a probability rather than a classification. The image below
shows the curve of a sigmoid neuron
Notice the shape of the curve around one, where the data is sorted with
the perceptron; the step makes it difficult to classify data with just marginal
differences. With the sigmoid neuron, the data is predicted by the probability
that it falls into a given class. As you can see the line curves at one, which
means that the probability that a data point falls into a given class increases
after one, but it’s only a probability.
Conclusion
As we move into the third decade of the twenty-first century, several new
trends in big data may take hold. The first is the streaming of data combined
with machine learning. Traditionally, computers have learned from data sets
that were fed to computer systems in a controlled fashion. Now the idea is
developing to use data streaming in real time, so computer systems could
learn as they go. It remains to be seen if this is the best approach, but
combining this with the Internet of Things, there is a big hope for massive
improvements in accuracy, value, and efficiency regarding big data.
Another important trend in the coming years is sure to be the increasing role
of artificial intelligence. This has applications across the board, with simple
things like detecting spam email all the way to working robots that many
fears will destroy large numbers of jobs that only require menial labor. The
belief among those familiar with the industry is that despite decades of slow
progress regarding artificial intelligence, its time has definitely arrived. It is
expected to explode over the next decade. Recently, robots have been
unveiled that can cook meals in fast-food restaurants, work in warehouses
unloading boxes and stacking them on shelves, and everyone is talking about
the possibilities of self-driving cars and trucks.
Businesses are eager to take advantage of AI as it becomes more capable and
less expensive. It is believed that applications of artificial intelligence to
business needs will increase company efficiency exponentially. In the
process, tedious and time-consuming tasks, both physical-related tasks like
unloading boxes at a warehouse and data-related tasks done in offices, will be
replaced by artificially intelligent systems and robotics. The movement in this
direction is already well underway, and some people are fretting quite a bit
over the possibility of millions of job losses. However, one must keep in
mind that revolutionary technology has always caused large numbers of job
losses, but this impact is only temporary because the freed labor and
productive capacity have resulted in the creation of new jobs and industries
that nobody anticipated before. One example of this is the famous Luddites
who protested mechanical looms that manufactured textile goods in the
eighteenth century. They rioted and destroyed many factories when these
early machines were introduced. However, by the end of the century, literally
ten times as many people were working in the same industry because of the
increased productivity provided by the introduction of machines. It remains
to be seen, but one can assume this is likely to happen yet again.
Cloud computing has played a central role in the expansion of big data.
Hybrid clouds are expected to gain in the coming years. A hybrid cloud will
combine a company’s own locally managed and controlled data storage with
cloud computing. This will help increase flexibility while enhancing the
security of the data. Cloud bursting will be used, where the company can use
its own local storage until surges in demand force it to use the cloud.
One up-and-coming issue related to big data is privacy. Privacy concerns are
heightening, with people becoming more aware of the ubiquitous targeted
advertising that many companies are using. In addition, large-scale hacks of
data are continually making the news, and consumers are becoming
increasingly concerned about what companies like Facebook and Amazon are
doing with their data. If people are concerned with Facebook invading their
privacy, they will certainly be concerned about their toilet, electric meter, and
refrigerator collecting data on their activities and sending it who knows
where. Politicians the world over are also getting in on the act, with calls for
regulation of big tech companies coming from both sides of the Atlantic.
Many people are anticipating with excitement the implementation of 5G
cellular networks. This is supposed to result in much faster connection speeds
for mobile devices. The capacity for data transfer is expected to be much
larger than is currently available. 5G networks are claimed to have download
speeds that are ten times as great compared with 4G cellular networks. This
will increase not only the speed of using the internet on a mobile device but
also the ability of companies to collect data on customers in real time, and
possibly integrating streaming data from 5G devices with machine learning.
A 5G connection will also allow you to connect more devices
simultaneously. This could be helpful for the advent of the Internet of Things
described earlier. At the time of writing, 5G is barely being tentatively rolled
out in select cities like Chicago.
One anticipated trend in the next few years will be that more companies will
make room for a data curator. This is a management position that will work
with data, present data to others, and understand the types of analysis needed
to get the most out of big data.

You might also like