Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
______________________
DATA SCIENCE FOR BEGINNERS,
DATA ANALYSIS WITH PYTHON,
SQL COMPUTER PROGRAMMING FOR BEGINNERS,
STATISTICS FOR BEGINNERS
Matt Foster
Table of Contents
Data Science for Beginners
Introduction
Chapter 1 - Introduction to Data Science
Chapter 2 - Fields of Study
Chapter 3 - Data Analysis
Chapter 4 - The Python Data Types
Chapter 5 - Some of The Basic Parts of The Python Code
Chapter 6 - Use Case, Creating Requirements and Mindmapping
Chapter 7 - Basic Statistics Concepts of Data Scientists
Chapter 8 - Exploring Our Raw Data
Chapter 9 - Languages Required for Data Science.
Chapter 10 - Classification and Prediction
Chapter 11 -Data Cleaning and Preparation
Chapter 12 - Introduction to Numpy
Chapter 13 - Manipulating Array
Chapter 14 - Python Debugging
Chapter 15 - Advantages of Machine Learning
Chapter 16 - Numba - Just In Time Python Compiler
Conclusion
Table of Contents
Data Analysis with Python
Introduction
Conclusion
Table of Contents
SQL COMPUTER PROGRAMMING FOR
BEGINNERS
Introduction
Chapter 1 - Data Types in SQL
Chapter 2 - Constraints
Chapter 3 - Database Backup and Recovery
Chapter 4 - Sql Aliases
Chapter 5 - Database Normalization
Chapter 6 - SQL Server and Database Data Types
Chapter 7 - Downloading and Installing SQL Server Express
Chapter 8 - Deployment
Chapter 9 - SQL Syntax And SQL Queries
Chapter 10 - Relational Database Concepts
Chapter 11 - SQL Injections
Chapter 12 - Fine-Tune Your Indexes
Chapter 13 - Deadlocks
Chapter 14 - Functions: Udfs, SVF, ITVF, MSTVF, Aggregate,
System, CLR
Chapter 15 - Triggers: Dml, Ddl, After, Instead Of, Db, Server,
Logon
Chapter 16 - Select Into Table Creation & Population
Chapter 17 - Data Visualizations
Chapter 18 - Python Debugging
Conclusion
Table of Contents
STATISTICS FOR BEGINNERS
Introduction
Chapter 1 - The Fundamentals of descriptive statistics
Chapter 2 - Predictive Analytics Techniques
Chapter 3 - Decision Tree and how to Use them
Chapter 4 - Measures of central tendency, asymmetry, and variability
Chapter 5 - Distributions
Chapter 6 - Confidence Intervals: Advanced Topics
Chapter 7 - Handling and Manipulating Files
Chapter 8 - BI and Data Mining
Chapter 9 -What Is R-Squared and how does it help us
Chapter 10 - Public Big Data
Chapter 11 - Gamification
Chapter 12 - Introduction To PHP
Chapter 13 - Python Programming Language
Chapter 14 - A brief look at Machine Learning
Chapter 15 - Python Crash Course
Chapter 16 - Unsupervised Learning
Chapter 17 - Neural Networks
Conclusion
DATA SCIENCE FOR
BEGINNERS:
THE ULTIMATE GUIDE TO DEVELOPING
STEP BY STEP YOUR DATA SCIENCE SKILLS
FROM SCRATCH, TO MAKE THE BEST
DECISIONS AND PREDICTIONS
Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or
transmitted without direct written permission from the author or the
publisher.
Under no circumstances will any blame or legal responsibility be held against
the publisher, or author, for any damages, reparation, or monetary loss due to
the information contained within this book, either directly or indirectly.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical, or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
Introduction
The first thing that we need to take a look at here is what the data analysis is
all about. This is basically a practice where we will take all of that raw data
that has been collected for some time, and then order and organize it. This
needs to be done in a way that allows the business to extract all of the useful
information out of it.
The process of organizing, and then thinking about the data that is available
is really going to be one of the key things that help us to understand what is
inside that data, and what might not be present in the data. There are a lot of
methods that a company can choose to use in order to analyze their data, and
choosing the right approach can make it easier to get the right insights and
predictions out of your data.
Of course, one thing that you need to be careful about when working with
this data, and performing data analysis is that it is really easy for someone to
manipulate the data when they are in the analysis phase. This is something
that you need to avoid doing at all costs. Pushing your own agenda or your
own conclusions is not what this analysis is supposed to be about. There is so
much data present in your set, that if you try to push your agenda or the
conclusions that you want, you are likely to find something that matches up
with it along the way.
This doesn’t mean that this is the right information that you should follow
through. You may just match up to some of the outliers, and miss out on a lot
of important information that can lead your business in the right direction. To
avoid failing and using the data analysis in the wrong manner, it is important
for businesses, and data analysts, to pay attention when the data is presented
to them, and then think really critically about the data, as well as about some
of the conclusions that were drawn based on that data.
Remember here that we can get our raw data from a lot of different sources,
and this means that it can come to us in a variety of forms. Some of these
may include off our social media pages, our observations, responses to
surveys that are sent out, and other measurements. All of this data, even when
it is still in its raw form, is going to be useful. But there is so much of it, that
sometimes it all seems a bit overwhelming.
Chapter 1 - Introduction to Data Science
Over the course of the process for data analysis, the raw data has to be
ordered in a way that can make it something useful. For example, if you are
doing the results of a survey as part of your data, you need to take the time to
tally up the results. This helps people to look at the chart or the graph and see
how many people answered the survey, and how these people responded to
some of the specific questions that were on the survey to start with.
As we go through the process of organizing the data, it will not take long
until some big trends start to emerge. And this is exactly what you want when
working with this data analysis. These trends are what we need to highlight
when we try to write up the data to ensure that our readers will take notes of
it. We can see a lot of examples of how this is going to work.
For example, if we are working with a causal kind of survey about the
different ice cream preferences of men and women, we may find that women
are more likely to have a fondness for chocolate compared to men.
Depending on what the end goals of that research survey were about, this
could be a big point of interest for the researcher. They may decide that this is
the flavor they are going to market to women in online and television ads,
with the hopes that they can increase the number of women who want to
check out the product.
We then need to move on to do some modeling of the data using lots of
mathematics and other tools. These sometimes can sort through all of that
data we have and then exaggerate the points of interest. This is a great thing
because it makes these points so much easier for the researcher to see, so they
can act on this information.
Another thing that we need to focus on when we do our data analysis is the
idea of data visualization. Charts, graphs, and even some textual write ups of
the data are all going to be important parts of data analysis. These methods
are designed in order to refine and then distill the data in a manner that makes
it easier for the reader to glean through the interesting information, without
having to go through and sort out the data on their own.
Just by how the human mind works, having graphs and charts can make a big
difference in how well we understand all of the data we look at. We could
write about the information all day long. But a nice chart or graph that helps
to explain the information and shows us some of the relationships that come
up between the data points, can be the right answer to make understanding
the data easier than ever before.
Summarizing all of this data is often going to be a critical part of supporting
any of the arguments that we try to make with the data. It is just as important
as trying to present the data in a manner that is clear and understandable. The
raw data may also be included in the appendix or in another manner. This
allows the key decision-makers of the company to look up some of the
specifics, and double-check the results that you had. This doesn’t mean that
the data analyst got it wrong. But it adds in a level of trust between the parties
and can give us a reference point if needed.
When those who make decisions in the business encounter some data and
conclusions that have been summarized, they must still view them in a
critical manner. Do not just take the word of the data analyst when it comes
to this information, no matter how much you trust them. There can always be
mistakes and other issues that we have to watch out for, and being critical
about anything that we are handed in the data analysis can be critical to how
we will use that data later.
This means that we need to ask where the data was collected from, and then
ask about the size of the sample, and the sampling method that was used in
order to gather up the data we want to use. All of these are going to be critical
to helping us understand whether the data is something that we can actually
use, and it can help us determine if there are any biases found in the data that
we need to be worried about.
For example, if the source of the data comes from somewhere that may have
a conflict of interest with the type of data that is gathered, or you worry that
the source is not a good one to rely on, then this can sometimes call the
results that you are looking at into question. In a similar manner if the data is
high-quality, but it is pulled from a sample size that is small, or the sample
that was used is not truly random like it should be this is going to call into
question the utility of that data.
As we are going through this, the data analyst needs to remember to provide
as much information about the data as possible, including the methods that
were used to collect that data. Reputable researchers are always going to
make sure that they provide information about the techniques of gathering the
data, the source of funding with any surveys and more that are used, and the
point of the data collection, right at the beginning of the analysis. This makes
it easier for other groups to come and look at the data, see if it is legitimate
and will work for their needs, and then determine if this is what they are
going to base their decisions on.
Learning how to use a data analysis is going to be an important step in this
process. Without this, all of the data that we gather is just sitting around and
isn’t being used the way that we would like. It doesn’t do us any good to just
gather the data and then hold it in storage. Without analyzing the data and
learning how to use it, you are basically just wasting money with a lot of
storage for the data.
The data analysis can come into the mix and makes it so much easier for us to
handle our data, and really see some great results. Rather than just having the
data sit in storage, we are going to be able to add in some algorithms and
machine learning models, in order to see what insights and predictions are
hidden in all of that data.
Businesses are then able to take all of those insights and predictions, and use
it to make smart business decisions that they can utilize over and over again.
And with the right machine learning algorithm in place, and some small
adjustments over time, the business is able to add in some more information
as it comes in, helping them to always stay ahead of the competition.
In the past, these tools were not available at all. Business owners who were
good at reading the market and had been in business for some time could
make good predictions, and sometimes they just got lucky. But there was
always a higher risk that something wouldn’t work out, and they would end
up with a failure in their business, rather than a success.
With data analysis, this is no longer an issue at all. The data analysis is going
to allow us to really work with the data, and see what insights are there. This
provides us with a way to make decisions that are backed by data, rather than
estimates and hoping that they are going to work out. With the right
algorithm or model in place, we are able to learn a lot about the market, our
customers, what the competition is doing, how to reduce waste, and so much
more that can really propel our business forward.
There are a lot of different ways that a business can use data analysis to help
them succeed. They can use this as a way to learn more about their customers
and how to pick the right products and increase customer satisfaction all at
the same time. They can use this to identify waste in the business, and how to
cut some of this out without harming the quality of the product. They can use
this to learn more about what the competition is doing or to discover some
new trends in the market that they can capitalize on and get ahead of the
competition. This can even be used for marketing purposes to ensure that the
right ads reach the right customers each time.
There are so many benefits that come to a well-thought-out and researched
data analysis. And it is not just as simple as glancing down at the information
and assuming that it all falls into place and you will be able to get insights in
a few minutes. It requires gathering good information, making a model that
can read through all of that data in a short amount of time, and then even
writing out and creating visuals that go with that data. But when it all comes
together, it can really provide us with some good insights and predictions
about our business, customers, and competition, and can be the trick that gets
us over the top.
Chapter 2 - Fields of Study
A company needs to manage a gigantic measure of data like compensations,
worker's data, customer's data, customer's criticisms, and so forth. This Data
can be both in an unstructured and organized structure. A company would
consistently need this Data to be necessary and complete so that they can
improve, correct choices, and future approaches. This is when data science
comes helpful.
Data science encourages the customers to make the right choices from
accurate data got out of a considerable measure of chaotic data.
Data science
It is a pipeline of exercises all organized together. It begins with gathering the
data and afterward putting away them in structures. At that point, it is trailed
by cleaning the data to expel the undesirable and copy portions of the data
and furthermore right the mistaken bits and complete the fragmented data.
After all the pruning is done, it is trailed by analyzing the data utilizing
numerous measurable and scientific models. This stage is to understand the
concealed examples in the data. The majority of this is then at long last
pursued by imparting everything to the top administration with the goal that
they can take decisions in regards to new items or existing items.
Nowadays, one can discover a few data science courses to turn into a
prepared proficient in the field of data science, and why not? The occupations
will take off up to 28% - 30% by 2020, which means more chances. To be a
data scientist, one necessarily needs not to have an excess of experience,
considerably fresher with science, PC, and financial aspects foundation can
persuade prepared to be a data scientist. This is taking off requirement for
data scientists is a direct result of the increasing use of big data in pretty
much all ventures imaginable.
2. You will find that the Pandas package is going to come with
many methods to help us filter through our info in a
convenient manner while seeing some great results.
You will find that Pandas is really going to change up the game and how you
do some coding when it comes to analyzing the data a company has with
Python. Pandas are going to be free to use and open source and were meant to
be used by anyone who is looking to handle the data they have in a safe, fast,
and effective manner.
There are a lot of other libraries that are out there, but you will find that a lot
of companies and individuals are going to love working with Pandas. One
thing that is really cool about Pandas is that it is able to take info, whether it
is from an SQL database, a TSV file or even a CSV file, and then it will take
that information and create it into a Python object. This is going to be
changed over to columns and rows and will be called a data frame, one that
will look very similar to a table that we are going to see in other software that
is statistical.
If you have worked with R in the past, then the objects are going to share a
lot of similarities to R as well. And these objects are going to be easier to
work with when you want to do work and you don’t want to worry about
dictionaries or lists for loops or list comprehension. Remember that we talked
earlier about how loops can be nice in Python, but you will find that when it
comes to data analysis, these loops can be clunky, take up a lot of space, and
just take too long to handle. Working with this kind of coding language will
help you to get things done without all of the mess along the way.
For the most part, it is going to be best if you are able to download the
Pandas library at the same time that you download Python. This makes it
easier to work with and will save you sometime later. But if you already have
Python on your computer and later decide that you want to work with Pandas
as well, then this is not a problem. Take some time now to find the pandas
library on its official page and follow the steps that are needed to download it
on the operating system of your choice.
Once you have had some time to download the Pandas library, it is time to
actually learn how this one works and some of the different things that you
are able to do to get it to work for you. The Pandas library is a lot of fun
because it has a ton of capabilities that are on it, and learning what these are
and how to work with them is going to make it easier to complete some of
your own data analysis in the process.
At this point, the first thing that we need to focus on is the steps that we can
take to load up any data, and even save it before it can be run through with
some of the algorithms that come with Pandas. When it is time to work with
this library from Python in order to take all of that data you have collected
and then learn something from it and gain insights, we have to keep in mind
that there are three methods that we can use with this. These three methods
are going to include the following:
Now, as you go through with these three steps, we have to remember that
there are actually a couple of commands that will show up for each one, and
it depends on which method you go with what command you will choose.
However, one thing that all three shares in common are that the command
they use to open up a info file will be the same. The command that you need
to use to open up your info file, regardless of the method above that you
choose to use will include:
Pd.red_filetype()
Like we talked about a bit earlier on, and throughout this guidebook, there are
a few file types that you are able to use and see results with when writing in
Python. And you get the control of choosing which one is the best for your
project. So, when working on the code above, you would just need to replace
the part that says “filetype” with the actual type of file that you would like to
use. You also need to make sure that you add in the name of your file, the
path, or another location to help the program pull it up and know what it is
doing.
You will find that while you work in the Pandas library, there are also a ton
of arguments that you are able to choose from and to know what all of these
mean and how to pull up each one at the right time is going to be a big
challenge. To save some time, and to not overwhelm you with just how many
options there are, we are going to focus on just the ones that are the most
important for our project, the ones that can help us with a good info analysis,
and leave the rest alone for now.
With this idea in mind, we are going to start out by learning how we can
convert one of the objects that we are already using in Python, whether this is
a list or a dictionary or something else, over to the pandas' library so we can
actually use it for our needs. The command that we are able to use to make
that conversion happen is going to include:
Pd.InfoFrame()
With the code above, the part that goes inside of the parenthesis is where we
are able to specify out the different object, and sometimes the different
objects, that are being created inside that info frame. This is the command
that will bring out a few arguments, and you can choose which ones of those
you want to work with here as well.
In addition to helping out with some of the tasks that we just listed out, we
can also use Pandas to help us save that info frame, so we can pull it up and
do more work later on, or if we are working with more than one type of file.
This is nice because Pandas is going to save tables that come in many
different formats, whether that is CSV, Excel, SQL, or JSON. The general
code that we need to use to help us not only work on the framework that we
are currently on, but to make sure that we can save it as well will include the
following:
Df_to.filetype(filename)
When we get to this point, you should see that the info is already loaded up,
so now we need to take this a step further and look at some of the inspecting
that can be done with this as well. To start this, we need to take a look at the
frame of the info and see whether or not it is able to match up with what we
expect or want it to. To help us do this, we just need to run the name of the
info frame we are choosing to bring up the entire table, but we can limit this a
bit more and get more control by only getting a certain amount of the table to
show up based on what we want to look at.
For example, to help us just get the first n amount of rows (you can decide
how many rows this ends up being), you would just need to use the function
of df.heat(n). Alternatively, if your goal was to work with the n number of
rows that are last in the table, you would need to write out the code df.tail(n).
The df.shape is going to help if you want to work with the number of
columns and rows that show up, and if you would like to gather up some of
the information that is there about the info type, memory or the index, the
only code that you will need to use to make this happen is df.info().
Then you can also take over the command of: s.value_counts(dropna=False)
and this one allows us to view some of the unique values and counts for the
series, such as if you would like to work with just one, and sometimes a few,
columns. A useful command that you may want to learn as well is going to be
the df.describe() function. This one is going to help you out by inputting
some of the summary statistics that come with the numerical columns. It is
also possible for you to get the statistics on the entire info frame or a series.
To help us make a bit more sense out of what we are doing here, and what it
all means, we need to look at a few of the different commands that you are
able to use in Pandas that are going to help us view and inspect the info we
have. These include:
Another cool thing that we are able to do when working on the Pandas library
is that we are able to join together and combine different parts. This is a basic
command in Python, so learning how to do it from the beginning can make a
big difference. But it is so important for helping us to combine or join the
frames of info, or to help out with combining or joining the rows and
columns that we want. There are three main commands that can come into
play to make all of this happen, the following are going to include:
There is so much that we are able to do when it comes to the Pandas library,
and that is one of the reasons why it is such a popular option to go with.
Many companies who want to work with info science are also going to be
willing to add on the Pandas extension because it helps them to do a bit more
with info science, and the coding is often simple thanks to the Python
language that runs along with it.
The commands that we looked at in this chapter are going to be some of the
basic ones with Python and with Pandas, but they are meant to help us learn a
bit more about this language, and all of the things that we can do with the
Pandas library when it comes to Python and to the info science work that we
would like to complete. There is a lot of power that comes with the Pandas
library, and being able to put all of this together, and use some of the
algorithms and models that come with this library can make our info analysis
so much better.
The work that we did in this chapter is a great introduction to what we are
able to do with the Pandas library, but this is just the beginning. You will find
that when you work with Pandas to help out with your info analysis, you are
going to see some great results, and will be able to really write out some
strong models and codes that help not only to bring in the info that your
company needs, but to provide you with the predictions and insights that are
needed as well so your business can be moved to the future.
Chapter 4 - The Python Data Types
The next thing that we need to take a look at is the Python data types. Each
value in Python has a type of data.
Since entirety is an object in Python programming, data types are going to be
like classes and variables are going to be the instance, which is also known as
objects, of these classes. There are a lot of different types of data in Python.
Some of the crucial data types that we are able to work with includes:
Python numbers
The first option that we are able to work on Python data includes the Python
numbers. These are going to include things like complex numbers, floating-
point numbers, and even integers. They are going to be defined as complex,
float, and int classes in Python. For example, we are able to work with the
type() function to identify which category a value or a variable affiliated with
to, and then the isinstance() function to audit if an object exists to a distinct
class.
When we work with integers can be of any length, it is going to only find
limitations in how much memory you have available on your computer. Then
there is the floating-point number.
This is going to be accurate up to 15 decimal places, though you can
definitely go with a smaller amount as well.
The floating points are going to be separated by a decimal point. 1 is going to
be an integer, and 10 will be a floating-point number.
And finally, we have complex numbers. These are going to be the numbers
that we will want to write out as x + y, where x is going to be the real point,
and then they are going to be the imaginary part.
We need to have these two put together in order to make up the complexity
that we need with this kind of number.
Python lists
The next type of data that will show up in the Python language. The Python
list is going to be a regulated series of items. It is going to be one of the data
types that are used the most in Python, and it is exceedingly responsive.
All of the items that will show up on the list can be similar, but this is not a
requirement. You are able to work with a lot of different items on your list,
without them being the same type, to make it easier to work with.
Being able to declare a list is going to be a straightforward option that we are
able to work with. The items are going to be separated out by commas and
then we just need to include them inside some brackets like this: [ ] we can
also employ the slicing operator to help us obtain out a piece or a selection of
items out of that list.
The index starts at 0 in Python.
And we have to remember while working on these that lists are going to be
mutable.
What this means is that the value of the elements that are on your list can be
altered in order to meet your own needs overall.
Python Tuple
We can also work with something that is known as a Python Tuple. The
Tuple is going to be an ordered series of components that is the duplicate as a
list, and it is sometimes hard to see how these are going to be similar and how
they are going to be different.
The gigantic diverse that we are going to see with a Tuple and a list is that the
tuples are going to be immutable.
Tuples, once you create them, are not modifiable.
Tuples are applied to write-protect data, and we are generally quicker than a
list, as they cannot shift actively. It is going to be determined with
parentheses () where the items are also going to be separated out by a comma
as we see with the lists.
We can then use the slicing operator to help us wring some of the
components that we want to use, but we still are not able to change the value
while we are working with the code or the program.
Python Strings
Python strings are also important as well. The string is going to be a sequence
that will include some Unicode characters.
We can work with either a single quote or a double quote to show off our
strings, but we need to make sure that the type of quote that we use at the
beginning is the one that we finish it off with, or we will cause some
confusion with the compiler.
We can even work with multi-line strings with the help of a triple quite.
Like what we are going to see when we use the tuple or the list that we talked
about above, the slicing operator is something that we are able to use with our
string as well. And just like with what we see in the tuples, we will find that
the string is going to be immutable.
Python Set
Next on the list is going to be the Python set. The set is going to be an option
from Python that will include an unordered collection of items that are
unique. The set is going to be defined by values that we can separate with a
comma in braces. The elements in the batch are not going to be ordered, so
we can use them in any manner that we would like.
We have the option to perform this set of operations at the same time as a
union or have an intersection on two sets.
The sets that we work with are going to be unique values and they will make
sure that we eliminate the duplicates. Since the set is going to be an
unordered compilation. Cataloged has no aim.
Therefore the slicing operator is not going to work for this kind of option.
Python Dictionary
And the final type of Python data that we are going to take a look at is known
as the Python dictionary. This is going to be an unordered collection of key-
value pairs that we are able to work with. It is generally going to be used
when we are working with a very large amount of data. The dictionary can be
optimized in such a way that they do a great job of retrieving our data. We
have to know the key to retrieve the value ahead of time to make these work.
When we are working with the Python language, a dictionary is going to be
decided inside braces, with every component being a combination in the form
of key: value. The key and the value can be any type that you would like
based on the kind of code that you would like to write. We can also use the
key to help us retrieve the respective value that we need. But we are not able
to turn this around and work it in that manner at all.
Working with the different types of data is going to be so important for all of
the work that you can do in a Python coding, and can help you out when it is
time to work with data science.
Take a look at the different types of data that are available with the Python
language, and see how great this can be to any of the codes and algorithms
that you want to write into your data science project o verall.
Chapter 5 - Some of The Basic Parts of The Python
Code
Now that we have learned a bit more about the Python code, and some of the
things that you need to do in order to get this coding language set up on your
computer, it is time to take a look at some of the different things that you can
do with your code. We are going to start out with some of the basics, and
then will build on this when we get a bit further on in this guidebook to see
some of the other things that we are able to do with this language. With this
in mind, let’s take a look at some of the basics that you need to know about
any code in Python, and all that you are going to be able to do with this
coding language.
The Keywords in Python
The first part of the Python code that we are going to focus on is the Python
keywords. These keywords are going to be reserved because they give the
commands over to the compiler. You do not want to let these keywords show
up in other parts of the code, and it is important to know that you are using
them in the right part of the code.
Any time that you are using these keywords in the wrong manner, or in the
wrong part of the code, you are going to end up with some errors in place.
These keywords are going to be there to tell your compiler what you wanted
to happen, allowing it to know what it should do at the different parts of the
code. They are really important to the code and will make sure that
everything works the proper manner and at the right times.
How To Name The Identifiers in your Code
The next thing that we need to focus on for a moment when it comes to your
code is working with the identifiers. There are a lot of different identifiers
that you are able to work with, and they do come in a variety of names
including classes, variables, entities, and functions. The neat thing that
happens when you go through the process of naming an identifier is that the
same rules are going to apply no matter what name you have, which can
make it easier for a beginner to remember the different rules that come with
them.
So, let’s dive into some of the rules that we need to remember when doing
these identifiers. You have a lot of different options to keep in mind when
you decide to name the identifiers. For example, you can rely on using all
kinds of letters, whether they are lowercase or uppercase. Numbers work
well, too. You will be allowed to bring in the underscore symbol any time
that you would like. And any combination of these together can help you to
finish up the naming that you want to do.
One thing to remember with the naming rules though is that you should not
start the name with any kind of number, and you do not want to allow any
kind of space between the words that you are writing out. So, you would not
want to pick out the name of 5kids, but you could call it fivekids. And five
kids for a name would not work, but five_kids would be fine.
When you are working on the name for any of the identifiers that you want to
create in this kind of coding language, you need to make sure that you are
following the rules above, but add to this that the name you choose has to be
one that you are able to remember later. You are going to need to, at some
point, pull that name back up, and if you picked out one that is difficult to
remember or doesn’t make sense in the code that you are doing, and you
can’t call it back up, it is going to raise an error or another problem along the
way. Outside of these rules, you will be fine naming the identifier anything
that makes sense for that part of the code.
How to Handle the Control Flow with Python
The control flow in this language can be important. This control flow is there
to ensure that you wrote out the code the proper way. There are some types of
strings in your code that you may want to write out so that the compiler can
read them the right way. But if you write out the string in the wrong manner,
you are going to end up with errors in the system. We will take a look at
many codes into this guidebook that follows the right control flow for this
language, which can make it easier to know what you need to get done and
how you can write out codes in this language.
As you can see, there are a ton of different parts that come with the
basics of the Python code. Many of these are going to be seen in the
types of codes that you are trying to write out in Python, and can really
help you to start writing out some of your own codes. As we go through
some of the examples, as well as the practice exercises, as we go through
this guidebook, you will find that these basics are going to be found in a
lot of the codes that you would like to work on.
Chapter 6 - Use Case, Creating Requirements and
Mindmapping
Object-oriented programming (OOP) is a model in which programs are
organized around objects or data without the use of functions and logic. An
object has its unique behavior and attributes. In object-oriented programming,
the historical approach to programming is opposed while the stress is given to
how the logic is written rather than defining the data within the logic.
Examples of objects range from physical entities such as humans to small
programs like Widgets.
A programmer focuses on the first step known as data modeling in which all
the objects are identified to be manipulated, and these objects relate to each
other. After identifying an object, then it is generalized as a class of objects.
The class defines the kind of data it contains as well as the logical sequence
that can manipulate it. Among all the logic sequences, each logic sequence is
called a method while the communication of objects with well-defined
interfaces is called messages.
In OOP, the developers focus on object manipulation, rather the logic
required to manipulate them. This approach is well-suited to programming
for the programs, especially the complex, larger, and actively maintained
programs. Open-source organizations also support object-oriented
programming by allowing programmers to contribute to such projects in
groups that results in collaborative development. Furthermore, the additional
benefits of object-oriented programming include the code scalability,
reusability, and efficiency.
Inheritance
The object-oriented programming (OOP) ensures a higher level of accuracy
and reduces time development. Another property of object-oriented
programming results in more thorough data analysis. A relation and
subclasses build between the objects which can be assigned and allow
developers to reuse a common logic while maintaining the unique hierarchy.
This property results in more thorough data analysis, high accuracy, and save
time.
Abstraction
The objects reveal the internal mechanisms only. This can be helpful and
relevant for the use of other objects. Due to this, the concept of a developer
builds that is supportive of going for more addition or making changes over
time more easily.
Polymorphism
Depending on the context, objects are allowed to take on more than one form.
It is the program that will determine the meaning and usage for each
execution of an object, cutting down on the need to duplication code.
Poisson Distribution
[source = https://miro.medium.com]
The above graph depicts a Poisson Distribution, where all the signals are in
the continuous form with variable intensities. Poisson Distribution can be
compared to the uniform distribution as it is very much similar to that; the
exception is only the skewness. Skew or skewness can be defined as "neither
parallel nor perpendicular to a specified or implied line." If the skew value is
less, then it will have a uniform spread in all the direction like the normal
distribution. On the other hand, if the skewness is high, the distribution would
be scattered in different positions. It might be concentrated in one position, or
it can be scattered all over the graph.
This is an overview of the commonly used distribution pattern. But
distributions are not limited to these three; there are many more distributions
that are used in data science. Out of these three, Gaussian distribution can be
used with many algorithms, whereas choosing an algorithm in Poisson
distribution must be a careful discussion due to its skewness feature.
Dimensionality Reduction
Dimensionality Reduction can be somewhat instinctive to understand. In data
science, we would be given a dataset, and by the use of the Dimensionality
Reduction technique, we will have to reduce the dimensions of the data.
Imagine you have been given a dataset cube of 1000 data points, and it’s a 3-
Dimension cube. Now you might think that computing 1000 data points can
be an easy process, but at a larger scale, it might give birth to many problems
and complexity. Now by using the dimensionality reduction technique, if we
look at the data in 2-Dimension, then it is easy to re-arrange the colors into
categories. This will reduce the size of the data points from 1000 to maybe
100 if you categorize colors in 10 groups. Computing 100 data points are
much easier than earlier 1000 data points. In a rare case, these 100 data points
can also be reduced to 10 data points by the dimensional reduction technique
by identifying color similarity and grouping similar color shades in a group.
This is possible only in the 1-Dimension view. This technique helps in a big
computational saving.
Feature Pruning
Feature pruning is another technique of performing dimensionality reduction.
As we saw that we reduce the points in the earlier technique, here, we can
reduce the number of features that are less important or not important to our
analysis. For illustration, while working on any data set, we may come across
20 features; 15 of them might have a high correlation with the output,
whereas five may not having any correlation, or maybe they may have a very
low correlation. Then we may want to remove that five features by feature
pruning technique to reduce unwanted elements as well as reduce the
computing time and effort, taking into consideration that the output remains
unaffected.
Over and Under Sampling
Over and Under Sampling techniques are the classification techniques used
for the perfect classification of the problems. Possibilities are there whenever
we try to classify datasets. For example, imagine we have 2000 data points in
Class 1 but only 200 data points in Class 2. This will require a lot of Machine
Learning Techniques to model the data and make predictions upon our
observations! Here, over and under sampling, comes into the picture. Look at
the below representation.
Look at the image carefully; it can be stated that on both sides, the blue class
has a higher number of samples compared to the orange class. In such a case,
we have two predetermined, pre-processing options that can ease to predict
the result.
Defining Under-Sampling
Under-sampling means the selection of few data from the major class,
utilizing as many data points as the minor class is equipped with. This
selection is performed to maintain the probability level of the class.
Defining Oversampling
Oversampling means creating copies of the data points from the minor class
to level the number of data points in the major class. The copies are made
considering that the distribution of the minor class is maintained.
Bayesian Statistics
Before understanding Bayesian Statistics, we need to know why frequency
analysis can't be applied here. You can understand from the following
example.
Imagine you are playing the dice game. What are the chances of you rolling a
perfect 6? You would probably say that the chance would be 1 in 6, right? In
the event, if we perform a frequency analysis technique here, we would catch
that if someone rolled the dice for 10,000 times, then you may come out with
1 in 6 estimates. But if you are given a dice that is loaded to land always on
6, it would be easy to put six at every time. Since frequency analysis takes the
prior data into account, it fails sometimes. Bayesian Statistics take into
account the evidence.
Baye's Theorem
Let us learn the meaning of the theorem. The capacity P(H) is the recurrence
examination. Given our earlier information, what is the likelihood of our
occasion happening? The P(E|H) in our condition is known as the probability
and is basically the likelihood that our proof is right, given the data from our
recurrence examination. For instance, if you think of rolling the dice 5,000
times and the initial 1000 times the dice rolled out to be a 6, then it is pretty
clear that the dice is having a perfect six only. Here, the P(E) is the likelihood
that the real proof is valid.
In the event that the performed recurrence examination is generally excellent,
at that point, you can be certain that the dice is stacked with an impeccable 6.
Yet, you should likewise mull over the proof of the stacked dice, regardless
of whether it's actual or not founded on its earlier information and the
recurrence investigation you just performed. Each and everything is taken
into consideration in this theorem. Bayesian Theorem is useful when you are
in doubt that the prior data is not sufficient to predict the result. These
statistical concepts are very useful for an aspiring da ta scientist.
Chapter 8 - Exploring Our Raw Data
The first step that shows up when you are working with this process of data
science is going through and exploring some of the raw data that you are
working with. We are not able to build up the models that we want to work
with or learn anything of value out of it if we are not first able to explore
were to find this data, and then see what is inside.
Many companies already know a lot about working with the collecting of
data, and it is likely that you already have a ton of this data present in some
storage for your needs. That is a great place to start, but it is uncertain
whether you will have the right data for your needs. Just because you have
been able to collect a large amount of data doesn’t mean that you are able to
use it, so that is part of what we need to explore in this guidebook to help us
get started.
Basics of python
Keywords are an important part of Python programming; they are words that
are reserved for use by Python itself. You can’t use these names for anything
other than what they are intended for, and you most definitely can’t use them
as part of an identifier name, such as a function or a variable. Reserved
keywords are used for defining the structure and the syntax of Python. There
are, at the present time, 33 of these words, and one thing you must remember
is that they are case sensitive—only three of them are capitalized, and the rest
are in lower case. These are the keywords, written exactly as they appear in
the language:
False
if
assert
as
is
global
in
pass
finally
try
not
while
return
None
for
True
class
break
elif
continue
yield
and
del
with
import
else
def
except
from
or
lambda
nonlocal
raise
Note: only True, False, and None are capitalized.
The identifiers are the names that we give to things like variables, functions,
classes, etc., and the name is just so that we can identify one from another.
There are certain rules that you must abide by when you write an identifier:
● You may use a combination of lowercase letters (a to z), uppercase letters
(A to Z), digits (0 to 9), and the underscore (_). Names such as func_2,
myVar and print_to_screen are all examples of perfectly valid identifier
names.
● You may not start an identifier name with a digit, so 2Class would be
invalid, whereas class2 is valid.
● You may not, as mentioned above, use a reserved keyword in the
identifier name. For example:
>>> global = 2
File "<interactive input>", line 3
global = 2
^
Would give you an error message of:
SyntaxError: invalid syntax
● You may not use any special symbols, such as $, %, #, !, etc., in the
identifier name. For example:
>>> a$ = 1
File "<interactive input>", line 13
A$ = 1
^
Would also give you the following error message:
SyntaxError: invalid syntax
An identifier name can be any length you require.
Things to note are:
• Because the Python programming language is case sensitive, variable
and Variable would mean different things.
• Make sure your identifier names reflect what the identifier does. For
example, while you could get away with writing c = 12, it would make
more sense to write count = 12. You know at a glance exactly what it
does, even if you don’t look at the code for several weeks.
• Use underscores where possible to separate a name made up of
multiple words, for example, this_variabe_has_ many_words
You may also use camel case. This is a writing style where the first letter of
every word is capitalized except for the first one, for example,
thisVariableHasManyWords.
of timе.
Advantages of Machine Learning
Due to the sheer volume and magnitude of the tasks, there are some instances
where an engineer or developer cannot succeed, no matter how hard they try;
in those cases, the advantages of machines over humans are clearly stark.
Identifies Patterns
When the engineer feeds a machine with artificial intelligence a training data
set, it will then learn how to identify patterns within the data and produce
results for any other similar inputs that the engineer provides the machine
with. This is efficiency far beyond that of a normal analyst. Due to the strong
connection between machine learning and data science (which is the process
of crunching large volumes of data and unearthing relationships between the
underlying variables), through machine learning, one can derive important
insights into large volumes of data.
Improves Efficiency
Humans might have designed certain machines without a complete
appreciation for their capabilities, since they may be unaware of the different
situations in which a computer or machine will work. Through machine
learning and artificial intelligence, a machine will learn to adapt to
environmental changes and improve its own efficiency, regardless of its
surroundings.
Statistics
A common problem in statistics is testing a hypothesis and identifying the
probability distribution that the data follows. This allows the statistician to
predict the parameters for an unknown data set. Hypothesis testing is one of
the many concepts of statistics that are used in machine learning. Another
concept of statistics that’s used in machine learning is predicting the value of
a function using its sample values. The solutions to such problems are
instances of machine learning, since the problems in question use historical
(past) data to predict future events. Statistics is a crucial part of machine
learning.
Brain Modeling
Neural networks, are closely related to machine learning. Scientists have
suggested that nonlinear elements with weighted inputs can be used to create
a neural network. Extensive studies are being conducted to assess these
elements.
Evolutionary Models
A common theory in evolution is that animals prefer to learn how to better
adapt to their surroundings to enhance their performance. For example, early
humans started to use the bow and arrow to protect themselves from
predators that were faster and stronger than them. As far as machines are
concerned, the concepts of learning and evolution can be synonymous with
each other. Therefore, models used to explain evolution can also be utilized
to devise machine learning techniques. The most prominent technique that
has been developed using evolutionary models is the genetic algorithm.
Programming Languages
R
R is a programming language that is estimated to have close to 2 million
users. This language has grown rapidly to become very popular since its
inception in 1990. It is a common belief that R is not only a programming
language for statistical analysis but can also be used for multiple functions.
This tool is not limited to only the statistical domain. There are many features
that make it a powerful language.
The programming language R is one that can be used for many purposes,
especially by data scientists to analyze and predict information through data.
The idea behind developing R was to make statistical analysis easier.
As time passed, the language began to be used in different domains. There
are many people who are adept at coding in R, although they are not
statisticians. This situation has arisen since many packages are being
developed that help to perform functions like data processing, graphic
visualization, and other analyses. R is now used in the spheres of finance,
genetics, language processing, biology, and market research.
Python
Python is a language that has multiple paradigms. You can probably think of
Python as a Swiss Army knife in the world of coding, since this language
supports structured programming, object-oriented programming, functional
programming, and other types of programming. Python is the second-best
language in the world since it can be used to write programs in every industry
and for data mining and website construction.
The creator, Guido Van Possum, decided to name the language Python, after
Monty Python. If you were to use some inbuilt packages, you would find that
there are some sketches of the Monty Python in the code or documentation. It
is for this reason and many others that Python is a language that most
programmers love, though engineers or those with a scientific background
who are now data scientists would find it difficult to work with.
Python’s simplicity and readability make it quite easy to understand. The
numerous libraries and packages available on the internet demonstrate that
data scientists in different sectors have written programs that are tailored to
their needs and are available to download.
Since Python can be extended to work best for different programs, data
scientists have begun to use it to analyze data. It is best to learn how to code
in Python since it will help you analyze and interpret data and identify
solutions that will work best for a business.
Chapter 10 - Classification and Prediction
Classification
Classification is one of the most important tasks in Data Mining. It is based
on examining an object’s features, which based on these features is assigned
to a predetermined set of classes.
The basic idea goes like this: by having a set of categories (classes) and a
dataset with samples, for which we know in which class they belong, the goal
of classification is to create a model, which will then be able to automatically
classify these categories in new, unknown, non-classified samples.
Decision Trees
Decision trees are one of the most popular classification models. Decision
trees are a simple form of rules representation and are widely popular because
they are easily understandable.
Description
Decision trees are the simplest classification model. A decision tree consists
of internal nodes and leaves. Internal nodes are called the nodes which have
children while leaves are called the lowest level nodes which have no
children. The decision tree is represented as follows:
So finally:
Next, we will calculate the information gain for the Temperature variable.
We have a total of 8 samples and the Temperature variable gets 4 times the
value High, 2 times the value Normal and 2 times the value Low. For the 4
samples with value Temperature=Normal, 2 of them have the class value In
and 2 of them have the class value Out. Both two samples with
Temperature=Normal have a class value of Out. For the two samples with
value Temperature=Low, 1 has a class value of In and 1 has a class value of
Out. So, we have:
where:
So finally:
We then continue with the Humidity variable. We have a total of 8 samples
and the Humidity variable gets 4 times the value High and 4 times the value
Normal. For the 4 samples with Humidity = High, 2 have a class value of In
and 2 have a class value of Out. For the 2 samples with Humidity = Normal,
1 has a class value of In and 3 have a class value of Out. So, we have:
where:
So finally:
Last, we have the Wind variable. We have a total of 8 samples and the Wind
variable gets 6 times the value Light and 2 times the value Strong. For the 6
samples with value Wind =Light, 1 has a class value of In and 5 have a class
value of Out. For the two samples with value Wind =Strong, 1 has a class
value of In and 1 has a class value of Out. So, we have:
where:
So finally:
From the above we can see that the View variable has the highest information
gain. So, we choose it as the root of our tree.
We then need to examine how each branch will continue. For the Sunshine
and Cloudy values, we notice that all samples belong to the same class, In
and Out respectively. This leads us to leaves:
We now need to examine the samples with value Weather=Rainy
Initially, we calculate the information gain of the other variables. For the
Temperature (Wind) variable we have 2 samples with Normal (Light) and 1
sample with Low (Strong). For the Temperature=Normal (Wind=Light) we
have 2 samples with class Out and 0 samples with class In, while for the
Temperature=Low (Wind=Strong) we have 1 sample with class In and 0
samples with class Out. Therefore, we have:
where:
Therefore:
Last, for the Humidity variable we have two samples with Normal value and
1 sample with High value. For the sample with Humidity=High we have 1
time the class Out and 0 times the class In. For the two samples with value
Humidity=Normal we have 1 time the class In and 1 time the class Out.
where:
So, we have:
We select the variable with the higher information gain, that is either the
Temperature variable or the Wind variable since they have the same
information gain. On the image below, we can see the final decision tree by
using the algorithm ID3.
Prediction
Difference between Classification and Prediction
At first glance, classification and prediction seem similar. The basic
difference between classification and prediction is than in classification there
is a finite set of discrete classes. The samples are used in order to create a
model which is then able to classify new samples. In prediction, the value
derived from the model is constant and doesn’t belong to any predefined
finite set. As mentioned previously in the Titanic example, we have a finite
number of class values (Survived, Died), thus we have a decision tree which
makes a classification. If the values of the target variable where not finite, we
would then have a regression tree which would make a prediction.
Linear Regression
Description, Definitions and Notations
The α (alpha) parameter is called learning parameter and declares how big
will each step be in each iteration during the algorithm execution.
Usually, parameter α has a standard value and is not adjusted during the
function execution. The partial derivative as per βj determines the direction in
which the algorithm will proceed on the current step. Finally, the update of
the βj parameters applies at the end of each iteration. The corresponding
pseudocode for demonstrating how the β0 and β1 parameters are updated is
the following:
So, after we calculate the new value of β0 (tmp0), we use β0 to calculate the
new value of β1 (tmp1). The new values will be used in the next iteration.
Learning Parameter
The learning parameter is the α parameter, we saw on the gradient descent
algorithm. The most important question at this point is by what criteria we
choose the value of this parameter. First, let’s see how we can make sure that
our algorithm works right. We will need to display the cost function F in
terms of the number of the algorithm iterations. While the number of
iterations gets bigger, we expect the cost function to follow a descending
route.
On the contrary, if we have a graph like the one below then the algorithm will
not work right. This could be caused by the value of the learning parameter.
In the algorithm graph, the learning parameter defines how large the step will
be. If the value is very small then the algorithm will need a lot of time to find
a minimum (see image below):
On the contrary, if it is too large, it is possible to overcome the minimum and
even start moving to higher values of the cost function (see image below):
Unfortunately, there is no rule for choosing the learning parameter. The only
way is through testing, by carefully paying attention to the graph of the cost
function as to the number of iterations and at the same time ensuring it stays
in a descending route.
OVERFITTING AND REGULARIZATION
Overfitting
Previously we examined linear regression. As we saw, the produced model
tries to match as much as possible with the data. There are three possible
scenarios for our model:
Model Regularization
The basic idea of model regularization is that small values to the β1, β2, …,
βn parameters lead to simpler hypotheses thus reducing the chances of having
overfitting. In the scenario of linear regression, we just need to add an
additional condition in the cost function:
Essentially, the additional condition implies the reduction of the βj
parameters so that the value of the function is smaller overall. The λ
regularization parameter regulates how well the model will approach data and
what will the order of magnitude be for the βj parameters so that we can
avoid overfitting. If λ though gets very high values (e.g. λ=1010) then the βj
parameters will become so small and will tend to 0, thus leading to
underfitting.
As we can see from the statistics of the survey above, it helps us to see that
most of the time for that data scientist is spent in preparing the data, which
means they have to spend a good deal of time organizing, cleaning, and
collecting, before they are even able to start on the process of analyzing the
data. There are a few valuable tasks of data science like data visualization and
data exploration, but the least enjoyable process of data science is going to be
the data preparation.
The amount of time that you actually will spend on preparing the data for a
specific problem with the analysis is going to directly depend on the health of
the data. If there are a lot of errors, missing parts, and duplicate values, then
this is a process that will take a lot longer. But if the data is well-organized
and doesn’t need a lot of fixing, then the data preparation process is not going
to take that long at all.
1. The set of data that you are working with could contain a
few discrepancies in the codes or the names that you are
using.
2. The set of data that you are working with could contain a lot
of outliers or some errors that mess with the results.
3. The set of data that you are working with will lack your
attributes of interest to help with the analysis.
4. The set of data that you want to explore is not going to be
qualitative, but it is going to be quantitative. These are not
the same things, and often having more quality is going to
be the most important.
Each of these things has the potential to really mess up the model that you are
working on and could get you results or predictions that are not as accurate as
you would like. Taking the time to prepare your data and get it clean and
ready to go can solve this issue, and will ensure that your data is going to be
more than ready to use in no time.
Packages installations
To get started with NumPy, we have to install the package into our version of
Python. While the basic method for installing packages to Python is the pip
install method, we will be using the conda install method. This is the
recommended way of managing all Python packages and virtual
environments using the anaconda framework.
Since we installed a recent version of Anaconda, most of the packages we
need would have been included in the distribution. To verify if any package
is installed, you can use the conda list command via the anaconda prompt.
This displays all the packages currently installed and accessible via anaconda.
If your intended package is not available, then you can install via this
method:
First, ensure you have an internet connection. This is required to download
the target package via conda. Open the anaconda prompt, then enter the
following code:
Conda install package
Note: In the code above, ‘package’ is what needs to be installed e.g. NumPy,
Pandas, etc.
As described earlier, we would be working with NumPy arrays. In
programming, an array is an ordered collection of similar items. Sounds
familiar? Yeah, they are just like Python lists, but with superpowers. NumPy
arrays are in two forms: Vectors, and Matrices. They are mostly the same,
only that vectors are one-dimensional arrays (either a column or a row of
ordered items), while a matrix is 2-dimensional (rows and columns). These
are the fundamental blocks of most operations we would be doing with
NumPy. While arrays incorporate most of the operations possible with
Python lists, we would be introducing some newer methods for creating, and
manipulating them.
To begin using the NumPy methods, we have to first import the package into
our current workspace. This can be achieved in two ways:
import numpy as np
Or
Out[]: [1,2,3,4,5]
Importing the NumPy package and creating an array of integers.
In []: # import syntax
import numpy as np
np.array(Int_list)
Out[]: (5,)
Python describes matrices as (rows, columns). In this case, it describes a
vector as (number of elements, ).
To create a matrix from a Python list, we need to pass a nested list containing
the elements we need. Remember, matrices are rectangular, and so each list
in the nested list must have the same size.
In []: # This is a matrix
x = [1,2,3]
y = [4,5,6]
A = my_matrix.ndim
B = my_matrix.shape
# Printing
print('Resulting matrix:\n\n',my_matrix,'\n\nDimensions:',A,
'\nshape (rows,columns):',B)
Dimensions: 2
shape (rows,columns): (2, 3)
Now, we have created a 2 by 3 matrix. Notice how the shape method displays
the rows and columns of the matrix. To find the transpose of this matrix i.e.
change the rows to columns, use the transpose () method.
In []: # this finds the transpose of the matrix
t_matrix = my_matrix.transpose()
t_matrix
Out[]: array([[4, 1],
[5, 2],
[6, 3]])
Tip: Another way of knowing the number of dimensions of an array is by
counting the square-brackets that opens and closes the array (immediately
after the parenthesis). In the vector example, notice that the array was
enclosed in single square brackets. In the two-dimensional array example,
however, there are two brackets. Also, tuples can be used in place of lists for
creating arrays.
There are other methods of creating arrays in Python, and they may be more
intuitive than using lists in some applications. One quick method uses the
arange() function.
Syntax: np.arange(start value, stop value, step size, dtype = ‘type’)
In this case, we do not need to pass its output to the list function, our result is
an array object of a data type specified by ‘dtype’.
Example: Creating arrays with the arange() function.
We will create an array of numbers from 0 to 10, with an increment of 2
(even numbers).
In []: # Array of even numbers between 0 and 10
Even_array = np.arange(0,11,2)
Even_array
Tip 1: Linspace arrays are particularly useful in plots. They can be used to
create a time axis or any other required axis for producing well defined and
scaled graphs.
Tip 2: The output format in the example above is not the default way for
output in Jupyter notebook. Jupyter displays the last result per cell, at default.
To display multiple results (without having to use the print statement every-
time), the output method can be
changed using the following code.
In[]: # Allowing Jupyter output all results per cell.
# run the following code in a Jupyter cell.
Tip: Notice how the size parameter for the third line was specified using a
tuple. This is how to create a matrix of random integers using randint.
# Display Condition.
if dice1 == dice2:
print('Roll: ',dice1,'&',dice2,'\ndoubles !')
if dice1 == 1:
print('snake eyes!\n')
else:
print('Roll: ',dice1,'&',dice2)
In []: np.reshape(values,(2,5))
Example:
Let us find the maximum and minimum values in the ‘values’ array, along
with the index of the minimum and maximum within the array.
In []: A = values.max();B = values.min();
C = values.argmax()+1; D = values.argmin()+1
conditional selection,
Similar to how we conditional selection works with NumPy arrays, we can
select elements from a data frame that satisfy a Boolean criterion.
Example: Let us grab sections of the data frame ‘Arr_df’ where the value is >
5.
In []: # Grab elements greater than five
Arr_df[Arr_df>5]
Output:
Notice how the instances of values less than 5 are represented with a ‘NaN’.
Another way to use this conditional formatting is to format based on column
specifications.
You could remove entire rows of data, by specifying a Boolean condition
based off a single column. Assuming we want to return the Arr_df data frame
without the row ‘C’. We can specify a condition to return values where the
elements of column ‘odd1’ are not equal to ‘9’ (since row C contains 9 under
column ‘odd1’).
In []: # removing row C through the first column
Arr_df[Arr_df['odd1']!= 9]
Output:
A 1 2 3 4 4 6
B 5 6 7 8 12 14
D 13 14 15 16 28 30
E 17 18 19 20 36 38
Notice that row ‘C’ has been filtered out. This can be achieved through a
smart conditional statement through any of the columns.
In []: # does the same thing : remove row ‘C’
# Arr_df[Arr_df['even2']!= 12]
In[]: # Let us remove rows D and E through 'even2'
Arr_df[Arr_df['even2']<= 12]
Output
A 1 2 3 4 4 6
B 5 6 7 8 12 14
C 9 10 11 12 20 22
Exercise: Remove rows C, D, E via the ‘Even sum’ column. Also, try out
other such operations as you may prefer.
To combine conditional selection statements, we can use the ‘logical and, i.e.
&’, and the ‘logical or, i.e. |’ for nesting multiple conditions. The regular
‘and’ and ‘or’ operators would not work in this case as they are used for
comparing single elements. Here, we will be comparing a series of elements
that evaluate to true or false, and those generic operators find such operations
ambiguous.
Example: Let us select elements that meet the criteria of being greater than 1
in the first column, and less than 22 in the last column. Remember, the ‘and
statement’ only evaluates to true if both conditions are true.
In []:Arr_df[(Arr_df['odd1']>1) & (Arr_df['Even sum']<22)]
Output:
B 5 6 7 8 12 14
Only the elements in Row ‘B’ meet this criterion, and were returned in the
data frame.
This approach can be expounded upon to create even more powerful data
frame filters.
Chapter 14 - Python Debugging
Like most computer programming language, Python utilizes debugging
processes for the benefit of providing exceptional computing programs. The
software enables you to run applications within the specified debugger set
with different breakpoints. Similarly, interactive source code is provided to a
Python program for the benefit of supporting under program controls. Other
actions of a debugger in Python are testing of units, integration, analysis of
log files, and log flows as well as system-level monitoring.
Running a program within the debugger comprises of several tools working
depending on a given command line and IDE systems. For instance, the
development of more sophisticated computer programs has significantly
contributed to the expansion of debugging tools. The tools accompany
various methods of detecting Python programming abnormalities, evaluation
of its impacts, and plan updates and patches to correct emerging problems. In
some cases, debugging tools may improve programmers in the development
of new programs by eliminating code and Unicode faults.
Debugging
Debugging is the technique used in detecting and providing solutions to
either defects or problems within a specific computer program. The term
‘debugging’ was first accredited to Admiral Grace Hopper while working at
Harvard University on Mark II computers in the 1940s. She discovered
several moths between relays, thereby hindering computer operations and
named them ‘debugging' in the system. Despite the term previously used by
Thomas Edison in 1878, debugging began becoming popular in the early
1950s with programmers adopting its use in referring to computer programs.
By the 1960s, debugging gained popularity between computer users and the
most common term mentioned to described solutions to major computing
problems. With the world becoming more digitalized with challenging
programs, debugging has covered a significant scope. Henceforth,
eliminating words like computer errors, bugs, and defects to a more neutral
one such as computer anomaly and discrepancy. However, the neutral terms
are also under impact assessment to determine if their definition of
computing problems provides a cost-effective manner to the system or more
changes be made. The assessment tries to create a more practical term to
define computer problems while retaining the meaning but preventing end-
users from denying the acceptability of faults.
Anti-Debugging
Anti-debugging is the opposite of debugging and encompasses the
implementation of different techniques to prevent debugging processes or
reverse engineering in computer codes. The process is primarily used by
developers, for example in copy-protection schemes as well as malware to
identify and prevent debugging. Anti debugging, is, therefore, the complete
opposite of debugger tools, which include prevention of detection and
removal of errors, which occasionally appear during Python programming?
Some of the conventional techniques used are;
API-based
Exception-based
Modified code
Determining and penalizing debugger
Hardware-and register-based
Timing and latency
Stepping
Stepping is another concept, which operates with debugging tools in making
programs more efficient. Python program stepping is the act of jumping
through codes to determine programs lines with defects as well as any other
mistakes, which need attention before execution. Stepping in different codes
occurs as step-ins, step over, and step out. Step in entails the completion of
the next line filled with systems making the user skip into codes and debug
the intended one. Step over refers to a developer moving to the following line
in the existing function and debug with a new code before running the
program. Step out command refers to skipping to the last line of the program
and making completions of the codes before executing the plan.
Function Verification
When writing codes into the program, it is vital to keep track of the state of
each code, especially on calculations and variables. Similarly, the growth of
functions may stake up, leading to creating a function calling technique to
understand how each task affects the next one. Likewise, it is recommended
entering the nested codes first when it comes to stepping in as to develop a
sequential approach of executing the right codes first.
Processes of Debugging
Problem Reproduction
The primary function of a debugger application is to detect and eliminate
problems affecting programming processes. The first step in the debugging
process is to try to identify and reproduce the existing problem, either being a
nontrivial function or other rare software bugs. The method of debugging
primarily focuses on the immediate state of your program and note the bugs
present at the time. The reproduction is typically affected by computer usage
history and the immediate environment, thereby impacting on the end-results.
Debugging Techniques
Like other language programming software, Python also utilizes a debugging
technique to enhance its bug identification and elimination. Some of the
standard methods of debugging are interactive, print, remote, postmortem,
algorithm, and delta debugging. The technique used to remove bugs
interprets the comparison between the different techniques. For instance,
print debugging entails monitoring and tracing bugs and later printing them
out.
Remote debugging is a technique of removing bugs running a given program
but differs from the bugger tool. While postmortem is debugging methods to
identify and eliminate bugs from already crashed programs. To this end,
leaning the different types of debugging contributes to deciding which to use
when in need of determining Python programming problems. Other
techniques are safe squeezing, which isolates faults and causality tracking
essential for tracing causal agents in computation.
Debuggers Tools
Python debuggers are specific or multiple purposes in nature, depending on
the platform used, that is, depending on the operating system. Some of the
all-purpose debuggers are pdb and PdbRcldea while multipurpose include
pudb, Winpdb, and Epdb2, epdb, JpyDbg, pydb, trepan2, and
Pythonpydebug. On the other hand, specific debuggers are gdb, DDD, Xpdb,
and HAP Python Remote Debugger. All the above debugging tools operate in
different parts of the Python program with some used during installation,
program creation, remote debugging, and thread debugging and graphic
debugging, among others.
IDEs Tools
Integrated Development Environment (IDE) is the best Python debugging
tools as they suit well on big projects. Despite the tools varying between the
IDEs, the features remain the same for executing codes, analyzing variables,
and creating breakpoints. The most common and widely used IDE Python
debugging tool is the PyCharm comprising of complete elements of
operations, including plugins essential for maximizing the performance of
Python programs. Subsequently, other IDE debugging tools are also great and
readily available in the market today. Some of them include Komodo IDE,
Thonny, PyScripter, PyDev, Visual Studio Code, and Wing IDE, among
others.
Special-Purpose Tools
Special-purpose debugging tools are essential for detecting and eliminating
bugs from different sections of the Python program primarily working on
remote processes. These types of debugging tools are more useful when
tracing problems in the most sensitive and remote areas where it is unlikely
for other debuggers to access. Some do the most commonly used Special-
purpose debugging tools are FirePython used in Firefox as a Python logger,
manhole, PyConquer, pyringe, hunter, ice-cream and PySnooper. This
subdivision of debugging tools enables programmers to quickly identify
hidden and unnoticed bugs and thereby displaying them for elimination from
the system.
Debugger Commands
With debugging being a common feature in the programming language, there
exist several commands used when maneuvering between various operations.
The basic controls are the most essential for beginners and may include an
abbreviation of one or more letters. A blank space must separate the
command while others are enclosed in brackets. However, the syntax
command does not allow for the square brackets to be written but separated
alternatively by a vertical bar. In Python programs, statements are rarely
recognized by debugger commands executed within the parameters of the
program.
As to inspect Python statements, against errors and other related faults,
prefixes are added with an exclamation mark. Henceforth, making it possible
to make changes on variables as well as function calls. Several commands
may also be inserted in the same line but separated by ‘;;’ with inputs spaced
separately from other codes. As such, debugging is said to work with aliases,
which allows for adaptability between words in the same context. Besides,
aliases enhance the need for reading files in the directory with faults but seen
as correct with the use of the debugger prompt.
Running
The command used is ‘[!]statement’ or ‘r(un)’, which facilitates the execution
of the command to the intended lines and identify errors if any. The
command prompt will display several arguments probably at the top of the
package, especially when running programs without debuggers. For example,
when the application is named ‘prog1’, then the command to use is “r prog1
<infile". The debugger will, therefore, execute the command by redirecting
the program name from the file name.
Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno|function[, condition]]” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on the
screen. Thereby provides time to check on the variables while identifying any
errors or mistakes, which might affect the process. Therefore, breakpoints can
be scheduled to halt at any line on either numerical or functions names which
designate program execution.
Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.
Printing
Printing is primarily is used in programming to analyze the value of variables
or expressions used during function examination before execution. It uses the
command' w(here)' and useful after the programming running has been
stopped at a breakpoint or during runtime error. The legal expression used
here is C with possessing an ability to handles the legitimate C expression as
well as function calls. Besides printing, resuming the execution after a
breakpoint or runtime error uses the command ‘c(ont(inue).'
Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on a
given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.
Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.
File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to utilize
debugging tools in such cases. Debuggers should be set on the main source
file for the benefit of scheduling breakpoints and runtime error to examine the
lines in the folders. With Python, the list of the source files can be readily
selected and prescribe it as the working file.
Python Debugger
In Python programming language, the module pdb typically describes the
interactive source code debugger; therefore, supporting setting parameters in
breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of a
stack frame. Also, postmortem-debugging remains supported under the title
under program control. Python debugging is extensible usually in the way of
pdb obtained from the source evaluation. The interface hence utilizes pdb and
cmd as the primary modules.
The debugger command prompt pdb is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement [, globals [, locals]]) for run
Python statements and runeval (expression [, globals[, locals]]). There also
exist multiple functions not mentioned above to execute Python programs
efficiently.
Debugging Session
Using debugging in Python for computer language programming is usually a
repetitive process, which includes writing codes, and running it; it does not
work, and you implement debugging tools, fix errors, and redo the process
once again and again. As such, the debugging session tends to utilize the
same techniques, which hence demand some key points to note. The
sequence below enhances your programming processes and minimizes the
repeats witnessed during program development.
Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the program
or wait for another breakpoint and repeat if need be
When everything seems to go wrong, determine the source of
the problem, alter the current line of codes and begin the
process once more
Install pdb++
When working with Python, it is important to install pdb++ software to ease
maneuvering within a certain command line. The software ensures that you
readily access a unique prompt dialog well colorized and a complete great tab
showed elegantly. Pdb++ also enhances the appearance of your debugger
tool, bringing a newer and standard pdb module.
Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are many
today. Interact with them by seeking answers to your debugging problems as
well as playing around with some programs you create while using debugger
tools. You should avoid making assumptions to any section of Python
programming, especially in debugging as it may result in failures in program
development.
Be Clever
When we create programs and avoid errors by use of debuggers, it may make
you feel excited and overwhelmed from the outcome. However, be smart but
with limits to keep an eye on your work as well as your future operations.
The success of creating a more realistic and useful program does not mean
that you are not to fail in the future. As remaining in control will prepare you
to use Python debugging tools wisely and claim your future accomplishments
positively.
Identifies Patterns
When the engineer feeds a machine with artificial intelligence a training data
set, it will then learn how to identify patterns within the data and produce
results for any other similar inputs that the engineer provides the machine
with. This is efficiency far beyond that of a normal analyst. Due to the strong
connection between machine learning and data science (which is the process
of crunching large volumes of data and unearthing relationships between the
underlying variables), through machine learning, one can derive important
insights into large volumes of data.
Improves Efficiency
Humans might have designed certain machines without a complete
appreciation for their capabilities, since they may be unaware of the different
situations in which a computer or machine will work. Through machine
learning and artificial intelligence, a machine will learn to adapt to
environmental changes and improve its own efficiency, regardless of its
surroundings.
Statistics:
A common problem in statistics is testing a hypothesis and identifying the
probability distribution that the data follows. This allows the statistician to
predict the parameters for an unknown data set. Hypothesis testing is one of
the many concepts of statistics that are used in machine learning. Another
concept of statistics that’s used in machine learning is predicting the value of
a function using its sample values. The solutions to such problems are
instances of machine learning, since the problems in question use historical
(past) data to predict future events. Statistics is a crucial part of machine
learning.
Brain Modeling:
Neural networks, are closely related to machine learning. Scientists have
suggested that nonlinear elements with weighted inputs can be used to create
a neural network. Extensive studies are being conducted to assess these
elements.
Psychological Modeling:
For years, psychologists have tried to understand human learning. The EPAM
network is a method that’s commonly used to understand human learning.
This network is utilized to store and retrieve words from a database when the
machine is provided with a function. The concepts of semantic networks and
decision trees were only introduced later. In recent times, research in
psychology has been influenced by artificial intelligence. Another aspect of
psychology called reinforcement learning has been extensively studied in
recent times, and this concept is also used in machine learning.
Artificial Intelligence:
As mentioned earlier, a large part of machine learning is concerned with the
subject of artificial intelligence. Studies in artificial intelligence have focused
on the use of analogies for learning purposes and on how past experiences
can help in anticipating and accommodating future events. In recent years,
studies have focused on devising rules for systems that use the concepts of
inductive logic programming and decision tree methods.
Evolutionary Models:
A common theory in evolution is that animals prefer to learn how to better
adapt to their surroundings to enhance their performance. For example, early
humans started to use the bow and arrow to protect themselves from
predators that were faster and stronger than them. As far as machines are
concerned, the concepts of learning and evolution can be synonymous with
each other. Therefore, models used to explain evolution can also be utilized
to devise machine learning techniques. The most prominent technique that
has been developed using evolutionary models is the genetic algorithm.
Programming Languages
R:
R is a programming language that is estimated to have close to 2 million
users. This language has grown rapidly to become very popular since its
inception in 1990. It is a common belief that R is not only a programming
language for statistical analysis but can also be used for multiple functions.
This tool is not limited to only the statistical domain. There are many features
that make it a powerful language.
The programming language R is one that can be used for many purposes,
especially by data scientists to analyze and predict information through data.
The idea behind developing R was to make statistical analysis easier.
As time passed, the language began to be used in different domains. There
are many people who are adept at coding in R, although they are not
statisticians. This situation has arisen since many packages are being
developed that help to perform functions like data processing, graphic
visualization, and other analyses. R is now used in the spheres of finance,
genetics, language processing, biology, and market research.
Python:
Python is a language that has multiple paradigms. You can probably think of
Python as a Swiss Army knife in the world of coding, since this language
supports structured programming, object-oriented programming, functional
programming, and other types of programming. Python is the second-best
language in the world since it can be used to write programs in every industry
and for data mining and website construction.
The creator, Guido Van Possum, decided to name the language Python, after
Monty Python. If you were to use some inbuilt packages, you would find that
there are some sketches of the Monty Python in the code or documentation. It
is for this reason and many others that Python is a language that most
programmers love, though engineers or those with a scientific background
who are now data scientists would find it difficult to work with.
Python’s simplicity and readability make it quite easy to understand. The
numerous libraries and packages available on the internet demonstrate that
data scientists in different sectors have written programs that are tailored to
their needs and are available to download.
Since Python can be extended to work best for different programs, data
scientists have begun to use it to analyze data. It is best to learn how to code
in Python since it will help you analyze and interpret data and identify
solutions that will work best for a business.
Chapter 16 - Numba - Just In Time Python
Compiler
Although numpy is written in C or Fortran and standard routines working on
arrays of data are highly optimized, non-standard operations are still coded in
python and might be painfully slow. Fortunately, the Pydata company
developed a package that can translate python code into native machine code
on the fly and execute it at the same speed as C programs. In some respects,
this approach is even better than compiled code because the resulting code is
optimized for each particular machine and can take advantage of all the
features of the processor, whereas regular compiled programs might ignore
some processor features for the sake of compatibility with older machines, or
might have even been compiled before new features were even developed.
Besides, your Python program, using the Numba just in time compiler will
work on any platform for which Python and Numba are available. The user
will not need to worry about C compiler. There will be no hassle with
dependencies or complex makefiles and scripts. Python code just works out
of the box - taking full advantage of all available hardware.
The LLVM virtual machine used by Numba allows compiled code to run on
different processor architectures, GPU, and accelerator boards. It is under
heavy development, so while I was writing this book execution times for
example programs were cut more than in half.
Such heavy development on both Numba and LLVM has some disadvantages
as well. Obviously, some Python features could never be significantly
accelerated. But some could and will be accelerated in future versions of
Numba. When I started working on this book, Numba’s compiled functions
could not handle lists or create numpy arrays. Now, they can do it.
Obviously, some material in this section will be obsolete well before the rest
of the book. But it is a good thing. Just keep an eye on Pydata's Numba web
site.
For some strange reason, numba was not included in the Anaconda Linux
installer. So, I had to install it manually by opening anaconda3/bin folder in
terminal and typing
conda install numba
The same should work on windows. Just use terminal shortcut from
Anacomda's folder in Windows start menu. Numba is usually included with
later versions of winpython. If not, download the wheel package and
dependence packages from Christopher Gohlke's page and install them using
winpython's setup utility.
To illustrate speedups you can get with numba, I'll implement the Sieve of
Eratosthenes prime number search algorithm. Because, in order to accelerate
a function, Numba needs to know the type of all the variables or at least
should be able to guess them, and this type should not change during
function. The execution numpy arrays are the data structures of choice when
working with numba.
Here is the Python code:
fromnumbaimport jit
import numpy as np
import time
@jit('i4(i4[:])')
defget_primes(a):
m=a.shape[0]
n=1
foriin range(2,m):
ifa[i]!=0:
n+=1
forjin range(i**2,m,i):
a[j]=0
return n
#create an array of integers 0 to a million
a=np.arange(10000000, dtype=np.int32)
start_time = time.time() #get system time
n=get_primes(a) #count prime numbers
#print number of prime numbers below a million
#and execution time
print(n,time.time()-start_time)
First, we import numba, numpy, and the time module that will be used to
time the program execution. Then, we need a function implementing the
Sieve of Eratosthenes on numpy’s array of integers. A function’s definition is
preceded by the decorator@jit (Just In Time compile) imported from the
numba package. It tells numba to compile this function into machine code.
The rest of the program is executed as plain Python. Decorator tells numba
that function must return a four bite or 32 bit integer, and receives a
parameter that is one dimensional array of 4 byte integers.
Using numpy'sarange function, we can create an array of consecutive integer
numbers between zero and a million, remember current time. Call up a
functionget_primes that counts the prime numbers in the array and zeroes out
non-prime numbers. As soon as the function returns, we get current time
again and print the number of found prime numbers as well as time function
was executing.
On my Sandy Bridge laptop, numba accelerated function takes about 7ms to
complete. If I comment out @jit decorator -
#@jit('i4(i4[:])')
The execution time increases to 3s. Compilation results in 428 fold speedup.
Not bad for one line of code. Searching for prime numbers between 1 and 10
millions takes 146ms with numba and 42s in pure Python respectively. This
is also 287 fold speedup. These numbers are bound to change as numba,
llvm, and processors improve.
Because the function get_primes gets just a reference and nota copy of the
original array, non-prime numbers in the array are still zeroed out and we can
get prime numbers using the fancy indexing discussed in the numpy section:
print(a[a>0])
Default array printing behavior is not particularly useful here as it only shows
a few numbers at the beginning and the end of the array. You can change this
behavior or just iterate through a filtered array usingfor loop.
Stick to it
Beginners can master their programming skills only if they act like a glue
stick and practice the code repeatedly. Here are some ways for beginners and
intermediates to master the skills of Python programming.
Daily Practice
It is recommended for beginners practice the code on a daily basis in order to
have a grasp over a specific task. Divide a task into smaller steps if possible.
Consistency is the key to achieving any challenging task, which is the same
in learning Python, but real programmers believe in consistency. Even a
student with little knowledge can make a commitment to code daily for at
least an hour.
Make Notes
After making some progress on the journey as a new programmer, making a
note of things is also a good idea. Research says that making notes of the
important and urgent works with a hand is beneficial for long-term retention.
In the case of becoming a real programmer, this habit is very beneficial.
Furthermore, writing code on a paper can also help build the mind of a
programmer is also difficult in the beginning, but there are programmers who
write codes in their minds. So, it is not a big issue in writing on a paper which
will result in friction-less thinking ability. This friction-less thinking ability is
required for a programmer to get hands-on big projects in the future.
Go Interactive
Beginners can take help of the IDEs in order to practice the Strings, lists,
classes, and dictionaries, etc. Install any IDE for practicing the codes. The
most easy and simple IDE is the Jupyter Notebook that comes with Anaconda
Navigator. Install Anaconda navigator and run the IDE that is Jupyter
Notebook. A window will open in the default browser of a desktop computer
or laptop and start practicing the code. Write your code and check the
results.
Make changes to your code and analyze the results and errors. It is a mental
practice and helps a lot in learning.
Take Breaks
It is important to take a break and remind the concepts behind the codes.
Practice, along with absorbing the concepts work out. There is a well-known
technique called a Pomodoro Technique, which is widely used and can help
for learning purpose. Take a break after practicing a task for 25-30 minutes,
remind the concepts, and then repeat the process. Exercise is a kind of
refreshment when you go for a walk or chat with friends. It is also possible to
chat with friends about the concepts that have been learned in the course.
Debugging
Becoming bug bounty hunting is vital in programming languages, especially
in Python programming. Getting bugs in the code also happens with
professionals’ mostly in hard tasks, but this is not always the case. However,
professionals were also beginners when their journey started. Embrace the
moments and do not frustrate yourself when getting these bugs but hunt them.
It is essential to have a methodological approach in order to find out where
the things are breaking down when debugging. Make sure each part of the
written code works out in the most proper way, and this happens if a
programmer starts checking the code from start to end at the end.
When the area is identified where the things are breaking down then insert
these lines into your code and then press the enter button to run your code.
The codes are,
Import pdb; pdb.set_trace( ) ###add the lines to your script and run it
This is the Python debugger while it can also be run with the below line,
Python –m pdb <my_file.py> ###command line
Collaborative Work
Once the programmers are done with sticking to the Python in the starting
journey then proceeding to a collaborative work makes it easier for
programmers to get hands-on some tasks which might be a little bit
challenging but a collaborative work makes it easier. In short, when more
than one mind starts thinking about a problem, then the problem does not
remain a problem anymore. In order to make it collaborative, here are some
tips to follow.
Teach
It is common to hear from teachers “A teacher learns more when teaching to
students,” is famous among the teaching and learning communities. This is
also valid and true when it comes to learning the Python language. Many
options are available in order to teach and learn more by understand and
solving problems. Teaching through whiteboard is the most common way to
it but also writing blog posts about the tips in learning Python if any,
problems or mistakes, solving specific errors, recording videos, and useful
tricks. Other than this, there is a super simple way to teach by talking or
repeating the same things when done. By doing these strategies, the concepts
and understandings will solidify as well as point out any error or gaps if any.
Ask Questions
There should be no good or bad concept when learning the programming
language. While asking any question, programmer is free to ask anything bad
as it is not a common language. The concepts, rules, and results should be
learned in any possible way by asking even foolish questions. However, a
programmer needs to ensure asking good questions in such a way that the
conversation with others goes well in a pleasant way. It also helps in making
it possible to have some more conversation next time when needed.
Build Something
Almost all programmers believe that learning programming is easy when
solving a simple problem or build something simple. One must learn by
building something, is a kind of perception in order to become a real
programmer.
Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or transmitted without
direct written permission from the author or the publisher.
Under no circumstances will any blame or legal responsibility be held against the publisher, or author,
for any damages, reparation, or monetary loss due to the information contained within this book. Either
directly or indirectly.
Legal Notice:
This book is copyright protected. This book is only for personal use. You cannot amend, distribute, sell,
use, quote or paraphrase any part, or the content within this book, without the consent of the author or
publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. All effort has been executed to present accurate, up to date, and reliable, complete
information. No warranties of any kind are declared or implied. Readers acknowledge that the author is
not engaging in the rendering of legal, financial, medical, or professional advice. The content within
this book has been derived from various sources. Please consult a licensed professional before
attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, — errors, omissions, or inaccuracies.
Introduction
It is part of the obligations of the banks to analyze, store, or collect vast
numbers of data. With these data, data science applications are transforming
them into a possibility for banks to learn more about their customers. Doing
this will drive new revenue opportunities instead of seeing those data as a
mere compliance exercise. People widely use digital banking, and it is more
popular these days. The result of this influx produces terabytes of data by
customers; therefore, isolating genuinely relevant data is the first line of
action for data scientists. With the customers’ preferences, interactions, and
behaviors, then, data science applications will isolate the information of the
most relevant clients and process them to enhance the decision-making of the
business.
Personalized marketing
Providing a customized offer that fits the preferences and needs of particular
customers is crucial to success in marketing. Now it is possible to make the
right offer on the correct device to the right customer at the right time. For a
new product, people target selection to identify the potential customers with
the use of data science application. With the aid of apps, scientists create a
model that predicts the probability of a customer’s response to an offer or
promotion through their demographics, historical purchase, and behavioral
data. Thus, banks have improved their customer relations, personalize
outreach, and efficient marketing through data science applications.
Health and Medicine
An innovative potential industry to implement the solutions of data science in
health and medicine. From the exploration of genetic disease to the discovery
of drug and computerizing medical records, data analytics is taking medical
science to an entirely new level. It is perhaps astonishing that this dynamic is
just the beginning. Through finances, data science and healthcare are most
times connected as the industry makes efforts to cut down on its expenses
with the help of a large amount of data. There is quite a significant
development between medicine and data science, and their advancement is
crucial. Here are some of the impacts data science applications have on
medicine and health.
Drugs creation
Involving various disciplines, the process of drug discovery is highly
complicated. Most times, the most excellent ideas pass through billions of
enormous time and financial expenditure and testing. Typically, getting a
drug submitted officially can take up to twelve years. With an addition of a
perspective to the individual stage of drug compound screening to the
prediction of success rate derived from the biological factors, the process is
now shortened and simplified with the aid of data science applications. Using
simulations rather than the “lab experiments,” and advanced mathematical
modeling, these applications can forecast how the compound will act in the
body. With computational drug discovery, it produces simulations of
computer model as a biologically relevant network simplifying the prediction
of future results with high accuracy.
Industry knowledge
To offer the best possible treatment and improve the services, knowledge
management in healthcare is vital. It brings together externally generated
information and internal expertise. With the creation of new technologies and
the rapid changes in the industry every day, effective distribution, storing,
and gathering of different facts is essential. For healthcare organizations to
achieve progressive results, the integration of various sources of knowledge
and their combined use in the treatment process is secure through data
science applications.
The recurring neural networks and time series forecasting is part of the
optimization of oil and gas production. Rates of gas-to-oil ratios and oil rates
prediction is a significant KPIs. Operators can calculate bottom-hole
pressure, choke, wellhead temperature, and daily oil rate prediction of data of
nearby well with the use of feature extraction models. In the event of
predicting production decline, they make use of fractured parameters. Also,
for pattern recognition on sucker rod dynamometer cards, they utilize neural
networks and deep learning.
Downstream optimization
To process gas and crude oil, oil refineries use a massive volume of water.
Now, there is a system that tackles water solution management in the oil and
gas industry. Also, with the aid of distribution by analyzing data effectively,
there is an increase in modeling speed for forecasting revenues through
cloud-based services.
The Internet
Anytime anyone thinks about data science, the first idea that comes to mind
is the internet. It is typical of thinking of Google when we talk about
searching for something on the internet. However, Bing, Yahoo, AOL, Ask,
and some others also search engines. For these search engines to give back to
you in a fraction of second when you put a search on them, data science
algorithms are all that they all have in common. Every day, Google process
more than 20 petabytes, and these search engines are known today with the
help of data science.
Targeted advertising
Of all the data science applications, the whole digital marketing spectrum is a
significant challenge against the search engines. The data science algorithms
decide the distribution of digital billboards and banner displays on different
websites. And against the traditional advertisements, data science algorithms
have helped marketers get a higher click-through-rates. Using the behavior of
a user, they can target them with specific adverts. And at the same time and
in the same place online, one user might see ads on anger management while
another user sees another ad on a keto diet.
Website recommendations
This case is something familiar to everyone as you see suggestions of the
same products even on eBay and Amazon. Doing this add so much to the user
experience while it helps to discover appropriate products from several
products available with them. Leaning on the relevant information and
interest of the users, so many businesses have promoted their products and
services with this engine. To improve user experience, some giants on the
internet, including Google Play, Amazon, Netflix, and others have used this
system. They derived these recommendations on the results of a user’s
previous search.
Speech recognition
Siri, Google Voice, Cortana, and so many others are some of the best speech
recognition products. It makes it easy for those who are not in the position of
typing a message to use speech recognition tools. Their speech will be
converted to text when they speak out their words. Though the accuracy of
speech recognition is not certain.
Recommendation engine
This concept is one of the most promising and efficient, according to some
experts. In their everyday work, some central booking and travel web
platforms use recommendation engines. Mainly, through the available offers,
they match the needs and wishes of customers with these recommendations.
Based on preferences and previous search, the travel and tourism companies
have the ability to provide alternative travel dates, rental deals, new routes,
attractions, and destination when they apply the data-powered
recommendation engine solutions. Offering suitable provisions to all these
customers, booking service providers, and travel agencies achieve this with
the use of recommendation engines.
Route optimization
In the travel and tourism industry, route optimization plays a significant role.
It can be quite challenging to account for several destinations, plan trips,
schedules, and working distances and hours. With route optimization, it
becomes easy to do some of the following:
Time management
Minimization of the travel costs
Minimization of distance
For sure, data science improves lives and also continues to change the faces
of several industries, giving them the opportunity of providing unique
experiences for their customers with high satisfaction rates. Apart from
shifting our attitudes, data science has become one of the promising
technologies that bring changes to different businesses. With several
solutions the data science applications provide, it is no doubt that its benefits
cannot be over-emphasized.
Chapter 1 - What is Data Analysis
Knime
Knime is another open-source solution tool that enables the user to explore
data and interpret the hidden insights effectively. One of its good attributes is
that it contains more than 1000 modules along with numerous examples to
help the user to understand the applications and effective use of the tool. It is
equipped with the most advanced integrated tools with some complex
algorithms.
R-programming
R-programing is the most common and widely used tool. It has become a
standard tool for programming. R is a free open source software that any user
can install, use, upgrade, modify, clone, and even resell. It can easily and
effectively be used in statistical computing and graphics. It is made in a way
that is compatible with any type of operating system like Windows, macOS
platforms, and UNIX. It is a high-performance language that lets the user
manage big data. Since it is free and is regularly updated, it makes
technological projects cost-effective. Along with data Mining, it lets the user
apply their statistical and graphical knowledge, including common tests like a
statistical test, clustering, and linear, non-linear modeling.
Rapidminer
Rapidminer is similar to KNIME with respect to dealing with visual
programming for data modeling, analysis, and manipulation. It helps to
improve the overall yield of data science project teams. It offers an open-
source platform that permits Machine Learning, model deployment, and data
preparation. It is responsible for speeding up the development of an entire
analytical workflow, right from the steps of model validation to deployment.
Pentaho
Pentaho tackles issues faced by the organization concerning its ability to
accept values from another data source. It is responsible for simplifying data
preparation and data blending. It also provides tools used for analysis,
visualization, reporting, exploration, and prediction of data. It lets each
member of a team assign the data meaning.
Weka
Weka is another open-source software that is designed with a view of
handling machine-learning algorithms to simplify data Mining tasks. The
user can use these algorithms directly in order to process a data set. Since it is
implemented in JAVA programming, it can be used for developing a new
Machine Learning scheme. It lets easy transition into the field of data science
owing to its simple Graphical User Interface. Any user acquainted with
JAVA can invoke the library into their code.
The nodexl
The nodexl is open-source software, data visualization, and analysis tool that
is capable of displaying relationships in datasets. It has numerous modules,
like social network data importers and automation.
Gelphi
Gelphi is an open-source visualization and network analysis tool written in
Java language.
Talend
Talend is one of the leading open-source software providers that most data-
driven companies go for. It enables the customers to connect easily
irrespective of the places they're at.
Data Visualization
Data Wrapper
It is an online data-visualization software that can be used to build interactive
charts. Data in the form of CSV, Excel, or PDF can be uploaded. This tool
can be used to generate a map, bar, and line. The graphs created using this
tool have ready to use embed codes and can be uploaded on any website.
Tableau Public
Tableau Public is a powerful tool that can create stunning visualizations that
can be used in any type of business. Data insights can be identified with the
help of this tool. Using visualization tools in Tableau Public, a data scientist
can explore data prior to processing any complex statistical process.
Infogram
Infogram contains more than 35 interactive charts and 500 maps that allow
the user to visualize data. It can make various charts like a word cloud, pie,
and bar.
Sentiment Tools
Opentext
Identification and evaluation of expressions and patterns are possible in this
specialized classification engine. It carries out analysis at various levels:
document, sentence, and topic level.
Trackur
Trackur is an automated sentiment analysis software emphasizing a specific
keyword that is tracked by an individual. It can draw vital insights by
monitoring social media and mainstream news. In short, it identifies and
discovers different trends.
Opinion Crawl
Opinion Crawl is also an online sentiment analysis software that analyses the
latest news, products, and companies. Every visitor is given the freedom to
access Web sentiment in a specific topic. Anyone can participate in a topic
and receive an assessment. A pie chart reflecting the latest real-time
sentiment is displayed for every topic. Different concepts that people relate to
are represented by various thumbnails and cloud tags. The positive and
negative effect of the sentiments is also displayed. Web crawlers search the
up-to-date content published on recent subjects and issues to create a
comprehensive analysis.
Sage Live
Sage Live is a cloud-based accounting platform that can be used in small and
mid-sized types of businesses. It enables the user to create invoices, bill
payments using smartphones. This is a perfect tool if you wish to have a data
visualization tool supporting different companies, currencies, and banks.
Gawk GNU
Gawk GNU allows the user to utilize a computer without software. It
interprets unique programming language enabling the users to handle simple-
data reformatting Jobs. Following are its main attributes:
➢ It is not procedural. It is data-driven.
➢ Writing programs is easy.
➢ Searching for a variety of patterns from the text units.
Graphlab creates
Graphlab can be used by data scientists as well as developers. It enables the
user to build state-of-the-art data products using Machine Learning to create
smart applications.
The attributes of this tool are the Integration of automatic feature engineering,
Machine Learning visualizations, and model selection to the application. It
can identify and link records within and across data sources. It can simplify
the development of Machine Learning models.
Netlink Business Analytics
Netlink Business Analytics is a comprehensive on-demand solution providing
the tool. You can apply it through any simple browser or company-related
software. Collaboration features also allow the user to share the dashboards
among teams. Features can be customized as per sales and complicated
analytic capability, which is based on inventory forecasting, fraud detection,
sentiment, and customer churn analysis.
Apache Spark
Apache Spark is designed to run-in memory and real-time.
The top 5 data analytics tools and techniques
Visual analytics
Different methods that can be used for data analysis are available. These
methods are possible through integrated efforts involving human interaction,
data analysis, and visualization.
Business Experiments
All the techniques that are used in testing the validity of certain processes are
included in Business Experiments AB testing, business experiments, and the
experimental design.
Regression Analysis
Regression Analysis allows the identification of factors that make two
different variables related to each other.
Correlation Analysis
Correlation Analysis is a statistical technique that detects whether a
relationship exists between two different variables.
In: dataset['val3'][104]
Out: 'A'
Keep in mind that this isn’t a matrix, even though it might look like one.
Make sure to specify the column first, and then the row in order to extract the
value from the cell you want.
#Example of inheritance
#base class
class Student(object):
def__init__(self, name, rollno):
self.name = name
self.rollno = rollno
#Graduate class inherits or derived from Student class
class GraduateStudent(Student):
def__init__(self, name, rollno, graduate):
Student__init__(self, name, rollno)
self.graduate = graduate
def DisplayGraduateStudent(self):
print”Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.graduate)
#Post Graduate class inherits from Student class
class PostGraduate(Student):
def__init__(self, name, rollno, postgrad):
Student__init__(self, name, rollno)
self.postgrad = postgrad
def DisplayPostGraduateStudent(self):
print(“Student Name:”, self.name)
print(“Student Rollno:”, self.rollno)
print(“Study Group:”, self.postgrad)
#instantiate from Graduate and PostGraduate classes
objGradStudent = GraduateStudent(“Mainu”, 1, “MS-Mathematics”)
objPostGradStudent = PostGraduate(“Shainu”, 2, “MS-CS”)
objPostGradStudent.DisplayPostGraduateStudent()
When you type this into your interpreter, you will get the following results:
(‘Student Name:’, ‘Mainu’)
(‘Student Rollno:’, 1)
(‘Student Group:’, ‘MSC-Mathematics’)
(‘Student Name:’, ‘Shainu’)
(‘Student Rollno:’, 2)
(‘Student Group:’, ‘MSC-CS’)
Overloading
Another process that you may want to consider when you’re working with
inheritances is learning how to ‘overload.’ When you work on the process
known as overloading, you can take one of the identifiers that you are
working with and then use that to define at least two methods, if not more.
For the most part, there will only be two methods that are inside of each
class, but sometimes this number will be higher. The two methods should be
inside the exact same class, but they need to have different parameters so that
they can be kept separate in this process. You will find that it is a good idea
to use this method when you want the two matched methods to do the same
tasks, but you would like them to do that task while having different
parameters.
This is not something that is common to work with, and as a beginner, you
will have very little need to use this since many experts don’t actually use it
either. But it is still something that you may want to spend your time learning
about just in case you do need to use it inside of your code. There are some
extra modules available for you that you can download so you can make sure
that overloading will work for you.
After working with our imported or pandas-built data frames, we can write
the resulting data frame back into various formats. We will, however, only
consider writing back to CSV and excel. To write a data frame to CSV, use
the following syntax:
In []:Csv_data.to_csv(‘file name’,index = False)
This writes the data frame ‘Csv_data’ to a CSV file with the specified
filename in the python directory. If the file does not exist, it creates it.
For writing to an excel file, a similar syntax is used, but with sheet name
specified for the data frame being exported.
In []: Xl_data.to_excel(‘file name.xlsx’,sheet_name = ‘Sheet 1’)
This writes the data frame Xl_data to sheet one of ‘file name.xlsx’.
Html
Reading Html files through pandas requires a few libraries to be installed:
htmllib5, lxml, and BeautifulSoup4. Since we installed the latest Anaconda,
these libraries are likely to be included. Use conda list to verify, and conda
install to install any missing ones.
Html tables can be directly read into pandas using the pd.read_html (‘sheet
url’) method. The sheet url is a web link to the data set to be imported. As an
example, let us import the ‘Failed bank lists’ dataset from FDIC’s website
and call it w_data.
In []: w_data =
pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
w_data[0]
To display the result, here we used w_data [0]. This is because the table we
need is the first sheet element in the webpage source code. If you are familiar
with HTML, you can easily identify where each element lies. To inspect a
web page source code, use Chrome browser. On the web page >> right click
>> then, select ‘view page source’. Since what we are looking for is a table-
like data, it will be specified like that in the source code. For example, here is
where the data set is created in the FDIC page source code:
Hint: Use the .max(), and .mean() methods for the pay gap.
Conditional selection with column indexing also works for
the employee name with the highest pay.
Chapter 8 - The Different Types
of Data We Can Work With
There are two main types mainly structured and unstructured, and the types
of algorithms and models that we can run on them will depend on what kind
of data we are working with. Both can be valuable, but it often depends on
what we are trying to learn, and which one will serve us the best for the topic
at hand. With that in mind, let’s dive into some of the differences between
structured and unstructured data and why each can be so important to our
data analysis.
Structured Data
The first type of data that we will explore is known as structured data. This is
often the kind that will be considered traditional data. This means that we will
see it consisting mainly of lots of text files that are organized and have a lot
of useful information. We can quickly glance through this information and
see what kind of data is there, without having to look up more information,
labeling it, or looking through videos to find what we want.
Structured data is going to be the kind that we can store inside one of the
options for warehouses of data, and we can then pull it up any time that we
want for analysis. Before the era of big data, and some of the emerging
sources of data that we are using on a regular basis now, structured data was
the only option that most companies would use to make their business
decisions.
Many companies still love to work with this structured data. The data is very
organized and easy to read through, and it is easier to digest. This ensures
that our analysis is going to be easier to go through with legacy solutions to
data mining. To make this more specific, this structured data is going to be
made up largely of some of the customer data that is the most basic and could
provide us with some information including the contact information, address,
names, geographical locations and more of the customers.
In addition to all of this, a business may decide to collect some transactional
data and this would be a source of structured data as well. Some of the
transactional data that the company could choose to work with would include
financial information, but we must make sure that when this is used, it is
stored in the appropriate manner so it meets the standards of compliance for
the industry.
There are several methods we can use in order to manage this structured data.
For the most part, though, this type of data is going to be managed with
legacy solutions of analytics because it is already well organized and we do
not need to go through and make adjustments and changes to the data at all.
This can save a lot of time and hassle in the process and ensures that we are
going to get the data that we want to work the way that we want.
Of course, even with some of the rapid rise that we see with new sources of
data, companies are still going to work at dipping into the stores of structured
data that they have. this helps them to produce higher quality insights, ones
that are easier to gather and will not be as hard to look through the model for
insights either. These insights are going to help the company learn some of
the new ways that they can run their business.
While companies that are driven by data all over the world have been able to
analyze this structured data for a long period of time, over many decades,
they are just now starting to really take some of the new and emerging
sources of data as seriously as they should. The good news with this one
though is that it is creating a lot of new opportunities in their company, and
helping them to gain some of the momentum and success that they want.
Even with all of the benefits that come with structured data, this is often not
the only source of data that companies are going to rely on. First off, finding
this kind of data can take a lot of time and can be a waste if you need to get
the results in a quick and efficient manner. Collecting structured data is
something that takes some time, simply because it is so structured and
organized.
Another issue that we need to watch out for when it comes to structured data
is that it can be more expensive. It takes someone a lot of time to sort through
and organize all of that data. And while it may make the model that we are
working on more efficient than other forms, it can often be expensive to work
with this kind of data. Companies need to balance their cost and benefit ratio
here and determine if they want to use any structured data at all, and if they
do, how much of this structured data they are going to add to their model.
Unstructured Data
The next option of data that we can look at is known as unstructured data.
This kind of data is a bit different than what we talked about before, but it is
really starting to grow in influence as companies are trying to find ways to
leverage the new and emerging data sources. Some companies choose to
work with just unstructured data on their own, and others choose to do some
mixture of unstructured data and structured data. This provides them with
some of the benefits of both and can really help them to get the answers they
need to provide good customer service and other benefits to their business.
There are many sources where we are able to get these sources of data, but
mainly they come from streaming data. This streaming data comes in from
mobile applications, social media platforms, location services, and the
Internet of Things. Since the diversity that is there among unstructured
sources of data is so prevalent, and it is likely that those businesses who
choose to use unstructured data will rely on many different sources,
businesses may find that it is harder to manage this data than it was with
structured data.
Because of this trouble with managing the unstructured data, there are many
times when a company will be challenged by this data, in ways that they
weren’t in the past. And many times, they have to add in some creativity in
order to handle the data and to make sure they are pulling out the relevant
data, from all of those sources, for their analytics.
The growth and the maturation of things known as data lakes, and even the
platform known as Hadoop, are going to be a direct result of the expanding
collection of unstructured data. The traditional environments that were used
with structured data are not going to cut it at this point, and they are not going
to be a match when it comes to the unstructured data that most companies
want to collect right now and analyze.
Because it is hard to handle the new sources and types of data, we can’t use
the same tools and techniques that we did in the past. Companies who want to
work with unstructured data have to pour additional resources into various
programs and human talent in order to handle the data and actually collect
relevant insights and data from it.
The lack of any structure that is easily defined inside of this type of data can
sometimes turn businesses away from this kind of data in the first place. But
there really is a lot of potentials that are hidden in that data. We just need to
learn the right methods to use to pull that data out. The unstructured data is
certainly going to keep the data scientist busy overall because they can’t just
take the data and record it in a data table or a spreadsheet. But with the right
tools and a specialized set of skills to work with, those who are trying to use
this unstructured data to find the right insights, and are willing to make some
investments in time and money, will find that it can be so worth it in the end.
Both of these types of data, the structured and the unstructured, are going to
be so important when it comes to the success you see with your business.
Sometimes our project just needs one or the other of these data types, and
other times it needs a combination of both of them.
For a company to reach success though, they need to be able to analyze, in a
proper and effective manner, all of their data, regardless of the type of the
source. Given the experience that the enterprise has with data, it is not a big
surprise that all of this buzz already surrounds data that comes from sources
that may be seen as unstructured. And as new technologies begin to surface
that can help enterprises of all sizes analyze their data in one place it is more
important than ever for us to learn what this kind of data is all about, and how
to combine it with some of the more traditional forms of data, including
structured data.
WHY PYTHON FOR DATA ANALYSIS?
The next thing that we need to spend some of our time on in this guidebook is
the Python language. There are a lot of options that you can choose when
working on your own data analysis, and bringing out all of these tools can
really make a big difference in how much information you are able to get out
of your analysis. But if you want to pick a programming language that is easy
to learn, has a lot of power, and can handle pretty much all of the tasks that
you need to handle with data analysis and machine learning, then Python is
the choice for you. Let’s dive into the Python language a little bit and see
how this language can be used to help us see some great results with our data
analysis.
The process of data visualization is going to help us change up the way that
we can work with the data that we are using. Data analysis is supposed to
respond to any issues that are found in the company in a faster manner than
ever before.
And they need to be able to dig through and find more insights as well, look
at data in a different manner, and learn how to be more imaginative and
creative in the process. This is exactly something that data visualization is
able to help us out with.
Once you have been able to go through and answer all of the initial questions
that we had about the data type that we would like to work with, and you
know what kind of audience is going to be there to consume the information,
it is time for us to make some preparations for the amount of data that we
plan to work within this process
Keep in mind here that big data is great for many businesses and is often
necessary to make data science work. But it is also going to bring in a few
new challenges to the visualization that we are doing. Large volumes, varying
velocities, and different varieties are all going to be taken into account with
this one.
Plus, data is often going to be generated at a rate that is much faster than it
can be managed and analyzed so we have to figure out the best way to deal
with this problem.
There are factors that we need to consider in this process as well, including
the cardinality of the columns that we want to be able to work with.
We have to be aware of whether there is a high level of cardinality in the
process or a low level. If we are dealing with high cardinality, this is a sign
that we are going to have a lot of unique values in our data. A good example
of this would include bank account numbers since each individual would
have a unique account number.
Then it is possible that your data is going to have a low cardinality. This
means that the column of data that you are working with will come with a
large percentage of repeat values. This is something that we may notice when
it comes to the gender column on our system. The algorithm is going to
handle the amount of cardinality, whether it is high or low, in a different
manner, so we always have to take this into some consideration when we do
our work.
Pandas
This is an open source library that extends the capabilities of NumPy. It
supports data cleaning and preparation, with fast analysis capabilities. It is
more like Microsoft excel framework, but with Python. Unlike NumPy, it has
its own built-in visualization features and can work with data from a variety
of sources. It is one of the most versatile packages for data science with
Python, and we will be exploring how to use it effectively.
To use pandas, make sure it is currently part of your installed packages by
verifying with the conda list method. If it is not installed, then you can install
it using the conda install pandas command; you need an internet connection
for this.
Now that Pandas is available on your PC, you can start working with the
package. First, we start with the Pandas series.
Series
This is an extension of the NumPy array. It has a lot of similarities, but with a
difference in indexing capacity. NumPy arrays are only indexed via number
notations corresponding to the desired rows and columns to be accessed. For
Pandas series, the axes have labels that can be used for indexing their
elements. Also, while NumPy arrays -- like Python lists, are essentially used
for holding numeric data, Pandas series are used for holding any form of
Python data/object.
Example 7: Let us illustrate how to create and use the Pandas series
First, we have to import the Pandas package into our workspace. We will use
the variable name pd for Pandas, just as we used np for NumPy in the
previous section.
In []: import numpy as np #importing numpy for use
import pandas as pd # importing the Pandas package
We also imported the numpy package because this example involves a
numpy array.
In []: # python objects for use
labels = ['First','Second','Third']
# string list
values = [10,20,30] # numeric list
array = np.arange(10,31,10) # numpy array
dico = {'First':10,'Second':20,'Third':30}
# Python dictionary
# create various series
c = pd.Series(values)
print('Default series')
A #show
B = pd.Series(values,labels)
print('\nPython numeric list and label')
B #show
C = pd.Series(array,labels)
print('\nUsing python arrays and labels')
C #show
D = pd.Series(dico)
print('\nPassing a dictionary')
D #show
Default series
Out[]: 0 10
1 20
2 30
dtype: int64
Python numeric list and label
Out[]: First 10
Second 20
Third 30
dtype: int64
Using python arrays and labels
Out[]: First 10
Second 20
Third 30
dtype: int32
Passing a dictionary
Out[]: First 10
Second 20
Third 30
dtype: int64
We have just explored a few ways of creating a Pandas series using a numpy
array, Python list, and dictionary. Notice how the labels correspond to the
values? Also, the dtypes are different. Since the data is numeric and of type
integer, Python assigns the appropriate type of integer memory to the data.
Creating series from NumPy arrays returns the smallest integer size (int 32).
The difference between 32 bits and 64 bits unsigned integers is the
corresponding memory allocation. 32 bits obviously requires less memory
(4bytes, since 8bits make a byte), and 64 bits would require double (8 bytes).
However, 32bits integers are processed faster, but have a limited capacity in
holding values, as compared with 64 bits.
Pandas series also support the assignment of any data type or object as its
data points.
In []: pd.Series(labels,values)
Out[]: 10 First
20 Second
30 Third
dtype: object
Here, the string elements of the label list are now the data points. Also, notice
that the dtype is not ‘object’.
This kind of versatility in item operation and storage is what makes pandas
series very robust. Pandas series are indexed using labels. This is illustrated
in the following examples:
Example 8:
In []: # series of WWII countries
pool1 = pd.Series([1,2,3,4],['USA','Britain','France','Germany'])
pool1 #show
print('grabbing the first element')
pool1['USA'] # first label index
Out[]: USA 1
Britain 2
France 3
Germany 4
dtype: int64
grabbing the first element
Out[]: 1
As shown in the code above, to grab a series element, use the same approach
as the numpy array indexing, but by passing the label corresponding to that
data point. The data type of the label is also important, notice the ‘USA’ label
was passed as a string to grab the data point ‘1’. If the label is numeric, then
the indexing would be similar to that of a numpy array. Consider numeric
indexing in the following example:
In []: pool2 = pd.Series(['USA','Britain','France','Germany'],[1,2,3,4])
pool2 #show
print('grabbing the first element')
pool2[1] #numeric indexing
Out[]: 1 USA
2 Britain
3 France
4 Germany
dtype: object
grabbing the first element
Out[]: 'USA'
Tip: you can easily know the data held by a series through the dtype.
Notice how the dtype for pool1 and pool2 are different, even though
they were both created from the same lists. The difference is that pool2
holds strings as its data points, while pool1 holds integers (int64).
Panda series can be added together. It works best if the two series have
similar labels and data points.
Example 9: Adding series
Let us create a third series ‘pool 3’. This is a similar series as pool1, but
Britain has been replaced with ‘USSR’, and a corresponding data point value
of 5.
In []: pool3 = pd.Series([1,5,3,4],['USA','USSR','France',
'Germany'])
pool3
Out[]: USA 1
USSR 5
France 3
Germany 4
dtype: int64
Now adding series:
In []:# Demonstrating series addition
double_pool = pool1 + pool1
print('Double Pool')
double_pool
mixed_pool = pool1 + pool3
print('\nMixed Pool')
mixed_pool
funny_pool = pool1 + pool2
print('\nFunny Pool')
funny_pool
Double Pool
Out[]: USA 2
Britain 4
France 6
Germany 8
dtype: int64
Mixed Pool
Out[]: Britain NaN
France 6.0
Germany 8.0
USA 2.0
USSR NaN
dtype: float64
Funny Pool
C:\Users\Oguntuase\Anaconda3\lib\site-
packages\pandas\core\indexes\base.py:3772: RuntimeWarning: '<' not
supported between instances of 'str' and 'int', sort order is undefined
for incomparable objects
return this.join(other, how=how, return_indexers=return_indexers)
Out[]: USA NaN
Britain NaN
France NaN
Germany NaN
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
By adding series, the resultant is the increment in data point values of similar
labels (or indexes). A ‘NaN’ is returned in instances where the labels do not
match.
Notice the difference between Mixed_pool and Funny_pool: In a mixed pool,
a few labels are matched, and their values are added together (due to the add
operation). For Funny_pool, no labels match, and the data points are of
different types. An error message is returned and the output is a vertical
concatenation of the two series with ‘NaN’ Datapoints.
Tip: As long as two series contain the same labels and data points of
the same type, basic array operations like addition, subtraction, etc. can
be done. The order of the labels does not matter, the values will be
changed based on the operator being used. To fully grasp this, try
running variations of the examples given above.
Chapter 11 - Common Debugging Commands
Starting
The command used in debugging is ‘s(tart)' which launches the debugger
from its source. The procedure involved includes typing the title of the
debugger and then the name of the file, object, or program executable to
debug. Inside the debugging tool, there appears a prompt providing you with
several commands to choose from and make the necessary corrections.
Running
The command used is ‘[!]statement’ or ‘r(un)’, which facilitates the execution
of the command to the intended lines and identify errors if any. The
command prompt will display several arguments probably at the top of the
package, especially when running programs without debuggers. For example,
when the application is named ‘prog1’, then the command to use is “r prog1
<infile". The debugger will, therefore, execute the command by redirecting
the program name from the file name.
Breakpoints
As essential components in debugging, breakpoints utilize the command
‘b(reak) [[filename:]lineno|function[, condition]]” to enable debuggers to
stop code input process when program execution reaches this point. When a
developer inputs the codes or values, and it meets a breakpoint, the process
gets suspended for a while, and the debugger command dialog appears on the
screen. Thereby provides time to check on the variables while identifying any
errors or mistakes, which might affect the process. Therefore, breakpoints can
be scheduled to halt at any line on either numerical or functions names which
designate program execution.
Back Trace
Backtrace is an executive with the command ‘bt’ and involves a list of
pending function calls to be inserted in the program immediately after it
stops. The validity of backtrace commands are solely active when the
execution is suspended during breakpoints, or after it has exited during a
runtime error abnormally, a state called segmentation faults. This form of
debugging is more critical during segmentation faults as it indicates the
source of the error other than pending function calls.
Printing
Printing is primarily is used in programming to analyze the value of variables
or expressions used during function examination before execution. It uses the
command' w(here)' and useful after the programming running has been
stopped at a breakpoint or during runtime error. The legal expression used
here is C with possessing an ability to handles the legitimate C expression as
well as function calls. Besides printing, resuming the execution after a
breakpoint or runtime error uses the command ‘c(ont(inue).'
Single Step
The single-step uses the command' s(tep), n(ext)’ after a breakpoint to jump
through source lines one at a time. The two commands used to describe a
different indication with ‘step' representing the execution of all the lines and
functions while ‘next' skips function calls while not covering each chain on a
given task. However, it is vital to run the program line by line as to get a
more effective outcome when it comes to tracing errors on execution.
Trace Search
With the command, ‘up, down,' the program functions can either be scrolled
downwards or upwards using the trace search within the pending calls. This
form of debugging enables you to go through the variables within varying
levels of calls in the list. Henceforth, you can readily seek out mistakes as
well as eliminate errors using the desired debugging tool.
File Select
Another basic debugger command is file select which utilizes ‘l(ist) [first[,
last]]’. There exist programs which compose of up to two to several source
files, especially complex programming techniques, thereby the need to utilize
debugging tools in such cases. Debuggers should be set on the main source
file for the benefit of scheduling breakpoints and runtime error to examine the
lines in the folders. With Python, the list of the source files can be readily
selected and prescribe it as the working file.
Alias
Alias debugging entails the creation of an alias name to execute a command
but must not be enclosed in either single or double-quotes. The control used
is alias [alias [command]]. Replaceable parameters also undergo indicators
and can be replaced with other functions. As such, the name may remain the
same if the settings are left without commands or arguments from debugger
tools. In that case, the aliases maybe incorporate and comprise of any data
collaborated in the pdb prompt.
Python Debugger
In Python programming language, the module pdb typically describes the
interactive source code debugger; therefore, supporting setting parameters in
breakpoints. It also provides a single step impact at the source line level,
source code listing, and analysis of arbitraries codes in Python as a form of a
stack frame. Also, postmortem-debugging remains supported under the title
under program control. Python debugging is extensible usually in the way of
pdb obtained from the source evaluation. The interface hence utilizes pdb and
cmd as the primary modules.
The debugger command prompt pdb is essential in running programs in
control of the debugging tools; for instance, pdb.py invoked like a script to
debug related formats. Besides, it may be adopted as an application to scan
crashed programs while using several functions in a slightly differing way.
Some of the commands used are run (statement [, globals [, locals]]) for run
Python statements and runeval (expression [, globals[, locals]]). There also
exist multiple functions not mentioned above to execute Python programs
efficiently.
Debugging Session
Using debugging in Python for computer language programming is usually a
repetitive process, which includes writing codes, and running it; it does not
work, and you implement debugging tools, fix errors, and redo the process
once again and again. As such, the debugging session tends to utilize the
same techniques, which hence demand some key points to note. The
sequence below enhances your programming processes and minimizes the
repeats witnessed during program development.
Setting of breakpoints
Running programs by the relevant debugging tools
Check variable outcomes and compare with the existing
function
When all seems correct, you may either resume the program
or wait for another breakpoint and repeat if need be
When everything seems to go wrong, determine the source of
the problem, alter the current line of codes and begin the
process once more
Ask Question
If you know developers who use Python or other platforms, ask them
questions related to debugging as they are highly using this software. When
you are just beginning and no friends go online find forums, which are many
today. Interact with them by seeking answers to your debugging problems as
well as playing around with some programs you create while using debugger
tools. You should avoid making assumptions to any section of Python
programming, especially in debugging as it may result in failures in program
development.
Be Clever
When we create programs and avoid errors by use of debuggers, it may make
you feel excited and overwhelmed from the outcome. However, be smart but
with limits to keep an eye on your work as well as your future operations.
The success of creating a more realistic and useful program does not mean
that you are not to fail in the future. As remaining in control will prepare you
to use Python debugging tools wisely and claim your future accomplishments
positively.
Chapter 12 - Neural Network and What to Use for?
Regular deep neural networks commonly receive a single vector as an input
and then transform it through a series of multiple hidden layers. Every hidden
layer in regular deep neural networks, in fact, is made up of a collection of
neurons in which every neuron is fully connected to all contained neurons
from the previous layers. In addition, all neurons contained in a deep neural
network are completely independent as they do not share any relations or
connections.
The last fully-connected layer in regular deep neural networks is called the
output layer and in every classification setting, this output layer represents
the overall class score.
Due to these properties, regular deep neural nets are not capable of scaling to
full images. For instance, in CIFAR-10, all images are sized as 32x32x3. This
means that all CIFAR-10 images gave 3 color channels and that they are 32
wide and 32 inches high. This means that a single fully-connected neural
network in a first regular neural net would have 32x32x3 or 3071 weights.
This is an amount that is not manageable as those fully-connected structures
are not capable of scaling to larger images.
In addition, you would want to have more similar neurons to quickly add-up
more parameters. However, in this case of computer vision and other similar
problems, using fully-connected neurons is wasteful as your parameters
would lead to over-fitting of your model very quickly. Therefore,
convolutional neural networks take advantage of the fact that their inputs
consist of images for solving these kinds of deep learning problems.
Due to their structure, convolutional neural networks constrain the
architecture of images in a much more sensible way. Unlike a regular deep
neural network, the layers contained in the convolutional neural network are
comprised of neurons that are arranged in three dimensions including depth,
height, and width. For instance, the CIFAR-10 input images are part of the
input volume of all layers contained in a deep neural network and the volume
comes with the dimensions of 32x32x3.
The neurons in these kinds of layers can be connected to only a small area of
the layer before it, instead of all the layers being fully-connected like in
regular deep neural networks. In addition, the output of the final layers for
CIFAR-10 would come with dimensions of 1x1x10 as the end of
convolutional neural networks architecture would have reduced the full
image into a vector of class score arranging it just along the depth dimension.
To summarize, unlike the regular-three-layer deep neural networks, a
ConvNet composes all its neurons in just three dimensions. In addition, each
layer contained in convolutional neural network transforms the 3D input
volume into a 3D output volume containing various neuron activations.
A convolutional neural network contains layers that all have a simple API
resulting in 3D output volume that comes with a differentiable function that
may or may not contain neural network parameters.
A convolutional neural network is composed of several subsampling and
convolutional layers that are times followed by fully-connected or dense
layers. As you already know, the input of a convolutional neural network is a
nxnxr image where n represents the height and width of an input image while
the r is a total number of channels present. The convolutional neural networks
may also contain k filters known as kernels. When kernels are present, they
are determined as q, which can be the same as the number of channels.
Each convolutional neural network map is subsampled with max or mean
pooling over pxp of a contiguous area in which p commonly ranges between
two for small images and more than 5 for larger images. Either after or before
the subsampling layer a sigmoidal non-linearity and additive bias is applied
to every feature map. After these convolutional neural layers, there may be
several fully-connected layers and the structure of these fully-connected
layers is the same as the structure of standard multilayer neural networks.
Parameter Sharing
You can use a parameter sharing scheme in your convolutional layers to
entirely control the number of used parameters. If you denoted a single two-
dimensional slice of depth as your depth slice, you can constrain the neurons
contained in every depth slice to use the same bias and weights. Using
parameter sharing techniques, you will get a unique collection of weights,
one of every depth slice, and you will get a unique collection of weights.
Therefore, you can significantly reduce the number of parameters contained
in the first layer of your ConvNet. Doing this step, all neurons in every depth
slice of your ConvNet will use the same parameters.
In other words, during backpropagation, every neuron contained in the
volume will automatically compute the gradient for all its weights.
However, these computed gradients will add up over every depth slice, so
you get to update just a single collection of weights per depth slice.that all
neurons contained in one depth slice will use the exact same weight vector.
Therefore, when you forward pass of the convolutional layers in every depth
slice, it is computed as a convolution of all neurons’ weights alongside the
input volume. This is the reason why we refer to the collection of weights we
get as a kernel or a filter, which is convolved with your input.
However, there are a few cases in which this parameter sharing assumption,
in fact, does not make any sense. This is commonly the case with many input
images to a convolutional layer that come with certain centered structure,
where you must learn different features depending on your image location.
For instance, when you have an input of several faces which have been
centered in your image, you probably expect to get different hair-specific or
eye-specific features that could be easily learned at many spatial locations.
When this is the case, it is very common to just relax this parameter sharing
scheme and simply use a locally-connected layer.
Matrix Multiplication
The convolution operation commonly performs those dot products between
the local regions of the input and between the filters. In these cases, a
common implementation technique of the convolutional layers is to take full
advantage of this fact and to formulate the specific forward pass of the main
convolutional layer representing it as one large matrix multiply.
Implementation of matrix multiplication is when the local areas of an input
image are completely stretched out into different columns during an operation
known as im2col. For instance, if you have an input of size 227x227x3 and
you convolve it with a filter of size 11x11x3 at a stride of 4, you must take
blocks of pixels in size 11x11x3 in the input and stretch every block into a
column vector of size 363.
However, when you iterate this process in your input stride of 4, you get
fifty-five locations along both weight and height that lead to an output matrix
of x col in which every column contained in fact is a maximally stretched out
receptive fields and where you have 3025 fields in total.
that as the receptive fields overlap, each number in your input volume can be
duplicated in multiple distinct columns. Also, remember, that the weights of
the convolutional layers are very similarly stretched out into certain rows as
well. For instance, if you have 95 filters in size of 11x11x3, you will get a
matrix of w row of size 96x363.
When it comes to matrix multiplications, the result you get from your
convolution will be equal to performing one huge matrix multiply that
evaluates the dot products between every receptive field and between every
filter resulting in the output of your dot production of every filter at every
location. Once you get your result, you must reshape it back to its right
output dimension, which in this case is 55x55x96.
This is a great approach, but it has a downside. The main downside is that it
uses a lot of memory as the values contained in your input volume will be
replicated several times. However, the main benefit of matrix multiplications
is that there are many implementations that can improve your model. In
addition, this im2col ideal can be re-used many times when you are
performing pooling operation .
Conclusion
Thank you for making it through to the end! The next step is to start putting
the information and examples that we talked about in this guidebook to good
use. There is a lot of information inside all that data that we have been
collecting for some time now. But all of that data is worthless if we are not
able to analyze it and find out what predictions and insights are in there. This
is part of what the process of data science is all about, and when it is
combined together with the Python language, we are going to see some
amazing results in the process as well.
This guidebook took some time to explore more about data science and what
it all entails. This is an in-depth and complex process, one that often includes
more steps than what data scientists were aware of when they first get started.
But if a business wants to be able to actually learn the insights that are in
their data, and they want to gain that competitive edge in so many ways, they
need to be willing to take on these steps of data science, and make it work for
their needs.
This guidebook went through all of the steps that you need to know in order
to get started with data science and some of the basic parts of the Python
code. We can then put all of this together in order to create the right
analytical algorithm that, once it is trained properly and tested with the right
kinds of data, will work to make predictions, provide information, and even
show us insights that were never possible before. And all that you need to do
to get this information is to use the steps that we outline and discuss in this
guidebook.
There are so many great ways that you can use the data you have been
collecting for some time now, and being able to complete the process of data
visualization will ensure that you get it all done. When you are ready to get
started with Python data science, make sure to check out this guidebook to
learn how.
Loops are going to be next on the list of topics we need to explore when we
are working with Python. These are going to be a great way to clean up some
of the code that you are working on so that you can add in a ton of
information and processing in the code, without having to go through the
process of writing out all those lines of code. For example, if you would like
a program that would count out all of the numbers that go from one to one
hundred, you would not want to write out that many lines of code along the
way. Or if you would like to create a program for doing a multiplication
table, this would take forever as well. But doing a loop can help to get all of
this done in just a few lines of code, saving you a lot of time and code writing
in the process.
It is possible to add in a lot of different information into the loops that you
would like to write, but even with all of this information, they are still going
to be easy to work with. These loops are going to have all of the ability to tell
your compiler that it needs to read through the same line of code, over and
over again, until the program has reached the conditions that you set. This
helps to simplify the code that you are working on while still ensuring that it
works the way that you want when executing it.
As you decide to write out some of these loops, it is important to remember
to set up the kind of condition that you would like to have met before you
ever try to run the program. If you just write out one of these loops, without
this condition, the loop won’t have any idea when it is time to stop and will
keep going on and on. Without this kind of condition, the code is going to
keep reading through the loop and will freeze your computer. So, before you
execute this code, double-check that you have been able to put in these
conditions before you try to run it at all.
As you go through and work on these loops and you are creating your own
Python code, there are going to be a few options that you can use with loops.
There are a lot of options but we are going to spend our time looking at the
three main loops that most programmers are going to use, the ones that are
the easiest and most efficient.
Matt Foster
© Copyright 2019 - All rights reserved.
The content contained within this book may not be reproduced, duplicated, or
transmitted without direct written permission from the author or the
publisher.
Under no circumstances will any blame or legal responsibility be held against
the publisher, or author, for any damages, reparation, or monetary loss due to
the information contained within this book, either directly or indirectly.
Legal Notice:
This book is copyright protected. It is only for personal use. You cannot
amend, distribute, sell, use, quote or paraphrase any part, or the content
within this book, without the consent of the author or publisher.
Disclaimer Notice:
Please note the information contained within this document is for educational
and entertainment purposes only. All effort has been executed to present
accurate, up to date, reliable, complete information. No warranties of any
kind are declared or implied. Readers acknowledge that the author is not
engaging in the rendering of legal, financial, medical, or professional advice.
The content within this book has been derived from various sources. Please
consult a licensed professional before attempting any techniques outlined in
this book.
By reading this document, the reader agrees that under no circumstances is
the author responsible for any losses, direct or indirect, that are incurred as a
result of the use of information contained within this document, including,
but not limited to, errors, omissions, or inaccuracies.
Introduction
Starting off with sql with the help
of microsoft access
Understand this, that the Rapid Application Development tool or RAD for
short is designed to be used with Access, requiring no knowledge of
programing. It is possible to write and develop execute SQL statements using
Access but it is essential to implement the back-door method.
The following steps are going to help you open a basic editor in Access to
start writing your SQL codes
SELECT
FROM POWER ;
After which, you will need to call upon the WHERE clause right after the
FROM line while making sure to put an asterisk (*) in the blank area. A very
common mistake which people make here is to forget putting the semicolon
at the end of the line. Don’t do that.
SELECT *
FROM POWER
WHERE LastName = ‘Marx’ ;
Once done, click on the floppy-diskette icon to save your
statement.
Enter a specified name and click ok.
Table manipulation
The following code is when you will want to add a second address field in
your POWERSQL table.
ALTER TABLE POWERSQL
ADD COLUMN Address2 CHAR (30);
Again, when it comes to deleting a table, you will want to use the following
code.
DROP TABLE POWERSQL;
Not as hard as it sounded before right? Bear with us and even the more
advanced concepts are going to become much easier to you eventually! And
amongst the various concepts, the most common one is the simple task which
is retrieving the required information from a given database. Say for example,
just the data of one row from a collection of thousands.
The very basic code for this method is
SELECT column_list FROM table_name
WHERE condition ;
As you can see, it utilizes the SELECT and WHERE statement to specify the
desired column and condition.
Having familiarized yourself with the skeleton of the code, the following
example should make things clear. Here, the code is asking for information
from the customer.
SELECT FirstName, LastName, Phone FROM CUSTOMER
WHERE State = ‘NH’
AND Status = ‘Active’ ;
Specifically speaking, the above is an example where the statement returns
the phone number of all active customers who are living in New Hampshire
(NH). Keep in mind that the keyword AND is used which simply means that
for a data to be valid for retrieval, both of the given conditions must be met.
Study questions
Q1) The RAD is designed to be used with what?
a) Access
b) Power point
c) Mozilla FireFox
d) Word
Answer: A
Q2) Which of the following can be used to data to a row?
a) INSERT UNTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
b) ADD INTOtable_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
c) INSERT INTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
d) ADD UNTO table_1 [(column_1, column_2, ..., column_n)]
VALUES (value_1, value_2, ..., value_n) ;
Answer: C
Q3) How can you transfer data between two tables, namely PORSPECT and
CUSTOMER
a) CHOOSE FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
SELECT FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
b) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
TRANSFER FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
c) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
INTO FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
d) SELECT FirstName, LastName
FROM PROSPECT
WHERE State = ‘ME’
UNION
SELECT FirstName, LastName
FROM CUSTOMER
WHERE State = ‘ME’ ;
Answer: D
Q4) How can you eliminate unwanted data?
a) ABOLISH FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
b) ELIMINATE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
c) DELETE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
d) REMOVE FROM CUSTOMER
WHERE FirstName = ‘David’ AND LastName = ‘Taylor’;
Answer: C
Q5) How can you design a view with the following criteria – CLIENT,
TEST, ORDERS, EMPLOYEE, RESULTS
a) GEENERATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
b) INITIATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
c) DESIGN VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
d) CREATE VIEW ORDERS_BY_STATE
(ClientName, State, OrderNumber)
AS SELECT CLIENT.ClientName, State, OrderNumber
FROM CLIENT, ORDERS
WHERE CLIENT.ClientName = ORDERS.ClientName:
Answer: D
Chapter 1 - Data Types in SQL
Data is at the core of SQL. After all, SQL was created to make it easier to
manipulate the data stored inside a database. It ensures that you do not have
to sort through large chunks of data manually in search of what you want.
Now, there are various types of data that can be stored in the database
depending on the platform used. As such, you now need to learn about the
data types available in SQL.
Foreign Key
Now we’re getting to the interesting bit, foreign keys, invoked by the
FOREIGN KEY command, lets you connect two tables together. This is why
most of the time, you’ll see the foreign key be called a referencing key. The
foreign key is simply, the columns, or combos thereof, from the primary key
of a different, parent table. In the parent-child dynamic of SQL, there is no
doubt that the foreign key is the child.
The relationship between these two tables, connected like this, would be that
the first gives its primary attributes to the other. For example, in the
CUSTOMER table, you might want to have a bit more functionality, so you
create a table called BUYS. In the BUYS table, you’ll hold all of the
customer’s orders, rather than simply doing this on a sheet of paper like in
the olden days.
Let’s do an example of this!
-- The customer table
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL,
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
-- The buys table
CREATE TABLE BUYS (
IDS INT NOT NULL UNIQUE,
TIME DATETIME,
CUST_IDS INT references CUSTOMER(IDS),
AMOUNT double
PRIMARY KEY (ID)
);
While this will create a BUYS table, it doesn’t maintain its relationship with
its parent table- the CUSTOMER table. Because of this we’ll try to insert a
foreign key by altering it, rather than within its construction(this will let you
reuse it later)
ALTER TABLE BUYS
ADD FOREIGN KEY (CUST_IDS) REFERENCES CUSTOMER(IDS);
This tells the program its foreign key will be CUST_IDS, which will pull its
values from the CUSTOMER table.
Check
The CHECK constraint is similar to a conditional loop in object-oriented
languages. The CHECK constraint will check if a given condition is satisfied
or not. If the condition you’ve inputted is satisfied, then the data point is in
violation of the constraint, making it invalid input. In a way, it is a reverse if-
statement, because rather than checking for truth, it checks for falsehood.
Now, let’s take another look at our creation of the CUSTOMER database.
Let’s say you’re working for a supermarket firm, they’re not allowed to sell
alcohol to people under 21 right? Well, you can tick that off inside SQL by
using a check constraint to check the customer’s age, and reject them if
they’re under 21.
-- The customer table
CREATE TABLE CUSTOMER(
IDS INT NOT NULL UNIQUE,
NAMES VARCHAR (15) NOT NULL,
AGES INT NOT NULL CHECK (AGES >= 18),
ADDRESS CHAR (100) ,
SALARIES DECIMAL (20, 2), DEFAULT 2000.00
PRIMARY KEY (IDS, NAMES)
);
Now, if you’ve already got this table, can you guess what line of code you
would run in order to alter the table to check if the age is above 21?
ALTER TABLE CUSTOMER
MODIFY AGES INT NOT NULL CHECK(AGES >= 21);
You can also achieve this differently, by using a syntax which will let you
name the constraint, which is sometimes useful when making multiple
constraints of the same type.
ALTER TABLE CUSTOMERS
ADD CONSTRAINT check21 CHECK (AGES >= 21) ;
Integrity Constraints
Tables are, as you’ve seen so far, very easy to manipulate. With that being
said, the Integrity constraints are vital to their continued existence. They
ensure the correct data is mapped to the correct places. In essence, they’re the
unsung heroes of SQL, as they enable us, developers, to do our jobs. The
integrity of the data is analyzed via relational databases, which use a concept
called referential integrity to check for it.
While there are many kinds of integrity constraint, the most important ones
you’ve already learned, such as Primary and foreign key, UNIQUE, and
others. You’ll find that while there is a ton of integrity constraints, not all of
them are very useful in practice.
T-sql
T-SQL or Transact Structure Query Language is an extension of the SQL
commands that have been executed thus far in this EBook. T-SQL offers a
number of extra features that are not available in SQL. These features
include local variables and programming elements which are used a lot in
stored procedures.
If else
Often you will use statements in a Stored Procedure which you need a logical
true or false answer before you can proceed to the next statement. The IF
ELSE statement can facilitate. To test for a true or false statement you can
use the >, <, = and NOT along with testing variables. The syntax for the IF
ELSE statement is the following, note there is only one statement allowed
between each IF ELSE:
IF X=Y
Statement when True
ELSE
Statement when False
Begin end
If you need to execute more than one statement in the IF or ELSE block, then
you case use the BEGIN END statement. You can put together a series of
statements which will run after each other no matter what tested for previous
to it. The syntax for BEGIN END is the following:
IF X=Y
BEGIN
statement1
statement2
END
While break
When you need to perform a loop around a piece of code X number of times
you can use the WHILE BREAK statement. It will keep looping until you
either break the Boolean test condition or the code hits the BREAK
statement. The first WHILE statement will continue to execute as long as the
Boolean expression returns true. Once its False it triggers the break and the
next statement is executed. You can use the CONTINUE statement which is
optional, it moves the processing right back to the WHILE statement. The
syntax for the WHILE BREAK command is the following:
WHILE booleanExpression
SQL_statement1 | statementBlock1
BREAK
SQL_statement2 | statementBlock2
Continue
SQL_statement3 | statementBlock3
Case
When you have to evaluate a number of conditions and a number of answers
you can use the CASE statement. The decision making is carried out with the
initial SELECT or an UPDATE statement. Then a CASE expression (not a
statement) is stated, after which you need to determine with a WHEN clause.
You can use a CASE statement as part of a SELECT, UPDATE or INSERT
statement.
There are two forms of CASE, you can use the simple form of CASE to
compare one value or scalar expression to a list of possible values and return
a value for the first match - or you can use the searched CASE form when
you need a more flexibility to specify a predicate or mini function as opposed
to an equality comparison. The following code illustrates the simple form:
SELECT column1
CASE expression
WHEN valueMatched THEN
statements to be executed
WHEN valueMatched THEN
statements to be executed
ELSE
statements to catch all other possibilities
END
The following code illustrate the more complex form, it is useful for
computing a value depending on the condition:
SELECT column1
CASE
WHEN valueX_is_matched THEN
resulting_expression1
WHEN valueY_is_matched THEN
resulting_ expression 2
WHEN valueZ_is_matched THEN
resulting_ expression 3
ELSE
statements to catch all other possibilities
END
The CASE statement works like so, each table row is put through each CASE
statement and instead of the column value being returned, the value from the
computation is returned instead.
Functions
As mentioned functions are similar to stored procedures but they differ in that
functions (or User Defined Functions UDF) can execute within another piece
of work – you can use them anywhere you would use a table or column.
They are like methods, small and quick to run. You simply pass it some
information and it returns a result. There are two types of functions, scalar
and table valued. The difference between the two is what you can return
from the function.
Scalar functions
A scalar function can only return a single value of the type defined in the
RETURN clause. You can use scalar functions anywhere the scalar matches
the same data type as being used in the T-SQL statements. When calling
them, you can omit a number of the functions parameters. You need to
include a return statement if you want the function to complete and return
control to the calling code. The syntax for the scalar function is the
following:
CREATE FUNCTION schema_Name function_Name
-- Parameters
RETURNS dataType
AS
BEGIN
-- function code goes here
RETURN scalar_Expression
END
Table-valued functions
A table-valued function (TVF) lets you return a table of data rather than the
single value in a scalar function. You can use the table-valued function
anywhere you would normally use a table, usually from the FROM clause in
a query. With table-valued functions it is possible to create reusable code
framework in a database. The syntax of a TVF is the following
CREATE FUNCTION function_Name (@variableName
RETURNS TABLE
AS
RETURN
SELECT columnName1, columnName2
FROM Table1
WHERE columnName > @variableName
Notes on functions
A function cannot alter any external resource like a table for example. A
function needs to be robust and if there is an error generate inside it either
from invalid data being passed or the logic then it will stop executing and
control will return to the T-SQL which called it.
Chapter 3 - Database Backup and Recovery
As mentioned, the most important task a DBA can perform is to back up the
database. When you create, a maintenance plan it’s important to have it top
of the maintenance list in case the job doesn’t get fully completed. Firstly, it
is important to understand the transaction log and why it is important.
Recovery
The first step in backing up a database is choosing a recovery option for the
database. You can perform the three types of backups when SQL Server is
online and even while users are making requests from the database.
Recovery models
When you backup and restore in SQL Server you do so in the context of the
recovery model which are models designed to control the maintenance of the
transactional log. The recovery model is a database property that controls
how transactions are logged.
There are three different recovery options: Simple, Full and Bulk Logged.
Simple recovery
You cannot back up the transactional log when utilizing the simple recovery
model. Usually this model is used where updates are infrequent.
Transactions are logged to a minimum and the log will be truncated.
Full recovery
In the full recovery model the transaction log backup must be taken. Only
when the backup process begins will the transactional log be truncated. You
can recover to any point in time. However, you also need the full chain of
log files to restore the database to the nearest time possible.
Bulk logged
This model is designed to be utilized for short term use when you use a bulk
import operation. You use it along with full recovery model whenever you
don’t need a certain point in time recovery. It has performance gains and also
doesn’t fill up the transaction log.
Backups
There are three types of backup: full, differential and transaction log:
Full backup
When you create a full backup, SQL Server creates a CHECKPOINT which
ensures than any dirty page that exist are written to disk. Then SQL Server
backs up each and every page on the database. It then backs up the majority
of the transaction log to ensure there is transactional consistency. What all of
this means is that you are able to restore your database to a most recent point
and have all the transactions including those right up to the very beginning of
the backup.
Differential backup
The differential backup as it name suggests backs up every page in the
database which has since been modified since the last backup. SQL Server
keeps track of all the different pages that have been modified via flags and
DIFF pages.
Backup strategy
When Database Administrator set out a backup plan they base their plan on
two measures: Recovery Time Objective (or RTO) and Recovery Point
Objective (RPO). RTO is the amount of time it takes to recover after a
notification of a disruption in the business process. RPO is the amount of
time that might pass during a disruption before the size of data that has been
loss exceeds the maximum limit of the business process.
If there was an RPO of 60 minutes you couldn’t achieve this goal if the
backup was set to every 24 hours. You need to set your backup plan based
on these two measure.
Full backup
Exercising this alone is the least flexible option. Essentially your only able to
restore your database back to one point of time which the is the last full back
up. So, if the database went corrupt two hours from midnight (and you
backup at midnight) your RPO would be twenty-two hours. Also, if a user
truncated a table two hours from midnight you would have the same twenty-
two-hour loss of business transactions.
Performing a backup
To back up a database right click the database in SSMS then select Tasks->
Backup. You can select what kind of backup (full, differential or transaction
log) to perform and when to perform a backup. The copy-only backup allows
you to perform a backup which doesn’t affect the restore sequence.
Restoring a database
When you want to restore a database in SSMS right click the database then
select Tasks -> Restore -> Database. You can select the database from the
drop down and thus the rest of the tabs will be populated.
If you click on Timeline you will see a graphical diagram of when the last
backup was created which shows how much data was lost. You can recover
to the end of log or a specific date and time.
The Verify Backup Timeline media button enables you to verify the backup
media before you actually restore it. If you want to change where you are
going to store the backup you can click on the Files to select a different
location. You can specify the restore options that you are going to use in the
Options page. Either overwrite the existing database or keep it. The
recovery state either bring the database online or allows further backups to be
applied.
Once you click OK on the bottom the database will be Restored.
Sequences
A sequence refers to a set of numbers that have been generated in a specified
order on demand. These are popular in databases. The reason behind this is
that sequences provide an easy way to have a unique value for each row in a
specified column. In this chapter, we will explore how to use sequences in
SQL.
AUTO_INCREMENT Column
This provides you with the easiest way of creating a sequence in MySQL.
You only have to define the column as auto_increment and leave MySQL to
take care of the rest. To show how to use this property, we will create a
simple table and insert some records into the table.
The following command will help us create the table:
CREATE TABLE colleagues
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id),
name VARCHAR(20) NOT NULL,
home_city VARCHAR(20) NOT NULL
);
The command should create the table successfully as shown below:
We have created a table named colleagues. This table has 3 columns namely
id, name and home_city. The first column is of integer data type while the
rest are varchars (variable characters). We have added the auto_increment
property to the id column, so the column values will be incremented
automatically. When entering data into the table, we don’t need to specify the
value of this column. The reason is that it will start at 1 by default then
increment the values automatically for each record you insert into the table.
Let us now insert some records into the table:
INSERT INTO colleagues
VALUES (NULL, "John", "New York");
INSERT INTO colleagues
VALUES (NULL, "Joel", "New Jersey");
INSERT INTO colleagues
VALUES (NULL, "Cate", "New York");
INSERT INTO colleagues
VALUES (NULL, "Boss", "Washington");
The commands should run successfully as shown below:
Now, we can run the select statement against the table and see its contents:
We see that the id column has also been populated with values starting from
1. Each time you enter a record, the value of this column is incremented by a
1. We have successfully created a sequence.
Renumbering a Sequence
You notice that when you delete a record from a sequence such as the one we
have created above, the records will not be renumbered. You may not be
impressed by such kind of numbering. However, it is possible for you to re-
sequence the records. This only involves a single trick, but be keen by
checking whether the table has a join with another table or not.
However, if you find you have to re-sequence your records, the best way to
do it is by dropping the column and then adding it. Let us show how to drop
the id column of the colleagues' table.
The table is as follows for now:
Let us drop the id column by running the following command:
ALTER TABLE colleagues DROP id;
To confirm whether the deletion has taken place, let view the table data:
The deletion was successful. We combined the ALTER TABLE and the
DROP commands for the deletion of the column. Now, let us re-add the
column to the table:
ALTER TABLE colleagues
ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (id);
The command should run as follows:
We started with the ALTER TABLE command to specify the name of the
table we need to change. The ADD command has then been used to add the
column and set it as the primary key for the table. We have also used the
auto_increment property in the column definition. We can now query the
table to see what has happened:
The id column was added successfully. The sequence has also been
numbered correctly.
The default setting is that MySQL starts the sequence at index 1. However, it
is possible for you to specify this at the time of creating the table. You can
also specify the amount by which the increment will be done each time a
record is created. Like in the table named colleagues, we can alter the table
for the auto_increment to be done at intervals of 2. This can be done by
running the following command:
ALTER TABLE colleagues AUTO_INCREMENT = 2;
The command should run successfully as shown below:
We can specify where the auto_increment will start at the time of the creation
of the table. The following example shows this:
CREATE TABLE colleagues2
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT = 10,
PRIMARY KEY (id),
name VARCHAR(20) NOT NULL,
home_city VARCHAR(20) NOT NULL
);
In the above example, we have set the auto_increment property on the id
column and the initial value for the column will be 10.
Chapter 4 - Sql Aliases
SQL allows you to rename a table or a column temporarily. The new name is
referred to as an alias. Table aliases help us in renaming tables in SQL
statements. Note that the renaming is only temporary, meaning it won’t
change the actual name of the table. We use the column aliases to rename the
columns of a table in a certain SQL query.
Table aliases can be created using the following syntax:
SELECT column_1, column_2....
FROM tableName AS aliasName
WHERE [condition];
Column aliases can be created using the following syntax:
SELECT columnName AS aliasName
FROM tableName
WHERE [condition];
To demonstrate how to use table aliases, we will use two tables, the students'
table and the fee table.
The students table has the following data:
The fee table has the following data:
We can now run the following command showing how to use table aliases:
SELECT s.regno, s.name, s.age, f.amount
FROM students AS s, fee AS f
WHERE s.regno = f.student_regno;
The command should return the following result:
We have used the alias for the students' table and the alias f for the fee table.
We have the fetched three columns from the students' table and one column
from the fee table.
A column alias can be created as showed below:
SELECT regno AS student_regno, name AS student_name
FROM students
WHERE age IS NOT NULL;
The command returns the following output upon execution:
The field with the registration numbers has been given the title student_regno
while that with student names have been returned with the title student_name.
This is because these are the aliases we gave to these columns.
Chapter 5 - Database Normalization
Now that you’re more familiar with database components, like primary and
foreign keys, let’s review database normalization.
By standard practice, every database must go through the normalization
process. Normalization is a process that was created by Robert Boyce and
Edgar Codd back in the 1970’s in order to optimize a database as much as
possible. Each step of the normalization process has what’s known as a form,
which ranges from one to five, where five is the highest normal form.
Though, typically, you can implement up to the third normal form in most
databases without negatively impacting functionality.
The main goal is to maintain the integrity of the data, optimize the efficiency
of the database, provide a more efficient method in tracking and storing data
and help avoid any data issues along the way.
Speaking of avoiding data issues, there are some points to be aware of, like
data anomalies, that can create data issues in the database if the conditions of
a normal form are not met. There are three types of anomalies: insert, update
and delete.
Below is a table that will be used to explain data anomalies.
Insert anomaly:
This occurs when we’re not able to add a new record unless other attributes
exist already. For instance, let’s say there’s a new product that will be sold
but the company doesn’t have a supplier yet. We’d have to wait to find a
valid supplier in order to enter that here, instead of just adding product
information.
Update anomaly:
This occurs when one value needs to be changed in many places, rather than
being changed in only one place. For example, if the supplier changes their
name, like Friendly Supplements, Co., then we have to update that in every
row that it exists.
Delete anomaly:
This occurs when there’s data that we’d like to remove, but if it were to be
removed, we’d be forced to remove other values that we’d like to keep. Let’s
say the energy drink isn’t sold anymore, so this row is deleted. Then all of the
other values will be deleted also. But perhaps we want to know who supplied
that product originally as a way of keeping track of the supplier’s
information.
Now that you’re aware of data anomalies and how they can create issues,
let’s move to the first step in normalization.
value of ‘n’ can be anywhere from 1 to 8,000 or you can substitute MAX,
which is 2 to the 31st power, minus 1. However, this length is rarely used.
When designing your tables, estimate the length of the longest string plus a
few bytes to be on the safe side. If you know that the strings you will be
storing will be around 30 characters, you may want to specify VARCHAR( ) 40
EXACT NUMERICS
There are various number data types that can be used to represent numbers in
the database. These are called exact numbers.
These types are commonly used when creating ID’s in the database, like an
Employee ID for instance.
Bigint – Values range from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807, which isn’t used so frequently.
Int – most commonly used data type and its values range from
-2,147,483,648 to 2,147,483,647
Smallint – Values range from -32,768 to 32,767
Tinyint – Values range from 0 to 255
In any case, it’s best to pick the data type that will be the smallest out of all of
them so that you can save space in your database.
DECIMAL
Much like the exact numeric data types, this holds numbers; however, they
are numbers including decimals. This is a great option when dealing with
certain numbers, like weight or money. Decimal values can only hold up to
38 digits, including the decimal points.
In order to define the length of the decimal data type when creating a table,
you would write the following: DECIMAL(precision, ). Precision isscale
indicative of the total amount of digits that will be stored both to the left and
to the right of the decimal point. Scale is how many digits you can have to the
right of your decimal point.
Let’s say that you wanted to enter $1,000.50 into your database. First, you
would change this value to 1000.50 and not try to add it with the dollar sign
and comma. The proper way to define this value per the data type would be
DECIMAL(6,2).
FLOAT
This data type is similar to the Exact Numerics as previously explained.
However, this is more of an Approximate Numeric, meaning it should not be
used for values that you do not expect to be exact. One example is that they
are used in scientific equations and applications.
The maximum length of digits that can be held within a column while using
this data type is 128. Though, it uses scientific notation and its range is from
-1.79E + 308 to 1.79E + 308. The “E” represents an exponential value. In this
case, its lowest value is -1.79 to the 308th power. Its max value is 1.79 to the
308th power (notice how this is in the positive range now).
To specify a float data type when creating a table, you’d simply specify the
name of your column and then use FLOAT. There is no need to specify a
length with this data type, as it’s already handled by the database engine
itself.
DATE
The DATE data type in SQL Server is used quite often for storing dates of
course. Its format is YYYY-MM-DD. This data type will only show the
month, day and year and is useful if you only need to see that type of
information aside from the time.
The values of the date data type range from ‘0001-01-01’ to ‘9999-12-31’.
So, you have a lot of date ranges to be able to work with!
When creating a table with a date data type, there’s no need to specify any
parameters. Simply inputting DATE will do.
DATETIME
This is similar to the DATE data type, but more in-depth, as this includes
time. The time is denoted in seconds; more specifically it is accurate by
0.00333 seconds.
Its format is as follows YYYY-MM-DD HH:MI:SS. The values of this data
type range between '1000-01-01 00:00:00' and '9999-12-31 23:59:59'.
Just as the DATE data type, there is no value or length specification needed
for this when creating a table. Simply adding DATETIME will suffice.
If you’re building a table and are deciding between these two data types,
there isn’t much overhead between either. Though, you should determine
whether or not you need the times or would like the times in there. If so, then
use the DATETIME data type, and if not, use the DATE data type.
BIT
This is an integer value that can either be 0, 1 or NULL. It’s a relatively small
data type in which it doesn’t take up much space (8 bit columns = 1 byte in
the database). The integer value of 1 equates to TRUE and 0 equates to
FALSE, which is a great option if you only have true/false values in a
column.
Chapter 7 - Downloading and Installing SQL
Server Express
Before we go any further, I want you to download and install SQL Server on
your own computer. This is going to help you tremendously when it’s time to
write syntax and it’s also necessary if you want to gain some hands-on
experience.
Note: if you have performed this before, you don’t have to follow this step-
by-step, but make sure you’re installing the proper SQL Server “Features”
that we’ll be using, which is shown in this section of the book. If you haven’t
performed this before, just follow along!
Click the link to be taken to the download page for SQL Server:
https://www.microsoft.com/en-us/download/details.aspx?id=54276
Scroll down the page and choose the language that you’d like to use and click
‘Download’.
After the download has completed, open the file.
When the install window comes up, it provides the “Basic”, “Custom” or
“Download Media” options. Let’s select “Custom”.
On the next page where it asks to be installed, keep the default settings.
After that, select “Install”.
Note: If you don’t receive the below menu after the install, just navigate to
the path above on your computer, open the ExpressAdv folder and click on
the SETUP.EXE file and you’ll be launched right into it!
It may take a little while, so go ahead and grab a snack or perhaps some
coffee while you wait.
After the installer has finished, you’ll then be brought to a setup screen with
several options. You’ll land on the “Installation” menu and select the option
to add a New SQL Server stand-alone installation.
Go ahead and accept the license agreements. It will then run a few quick
verification processes.
On the “Install Rules” menu, ensure that you’re receiving a “Passed” status
for just about every rule, which you most likely will. If you end up getting a
Windows Firewall warning like me, ignore it and continue anyway.
Feature Selection
Here, you’ll be able to select the features you want to install on your SQL
Server Instance. Thankfully Microsoft has provided a description of what
each feature is on the right-hand side. Make sure that you have the following
items checked, as these will all be a part of the features available when you
use SQL Server:
Instance Features:
R Services (In-Database)
Full-Text and Semantic Extractions for Search
Shared Features:
SELECT * FROM
WHERE “column_name”
ORDER BY “column_name;
More keywords and SQL commands will be introduced as you read the book,
so take it easy!
Data types
There are various data types that you should be familiar with. This is because
they make use of SQL language that is significant in understanding SQL
more.
There are six SQL data types
Date and Time Data
As the name implies, this type of data deals with date and time.
Examples are: DateTime (FROM Feb 1, 1816 TO July 2, 8796), small
DateTime (FROM Feb 1, 2012 TO Mar 2085, date (Jun 1, 2016) and time
(3:20 AM.).
Exact Numeric Data
Under exact numeric data, there are several subtypes too such as;
tinyint – FROM 0 TO 255
bit – FROM 0 TO 1
bigint – FROM -9,223,372,036,854,775,808 TO 9,223,372,036,854,775,807
numeric – FROM -10^38+1 TO 10^38-1
int - FROM -2,147,483,648 TO 2,147,483,647
decimal – FROM -10^38+1 TO 10^38-1
money – FROM -922,337,203,685,477.5808 TO 922,337,203,685,477.5807
smallmoney – FROM -214,748.3648 TO +214,748.3647
smallint – FROM -32,768 TO 32,767
Binary Data
Binary data have different types, as well. These are: Binary (fixed),
varbinary (variable length binary) varbinary (max) (variable length binary)
and image.
They are classified according to the length of their bytes, with Binary
having the shortest and the fixed value.
Approximate Numeric Data
These have two types, the float and the real. The float has a value FROM
- 1.79E +308 TO 1.79E +308, while the real data has a value FROM
-3.40E +38 TO 3.40E +38
Unicode Character Strings Data
There are four types of Unicode Character Strings Data namely; ntext,
nchar, nvarchar, and nvarchar (max). They are classified according to their
character lengths.
For ntext, it has a maximum character length of 1,073,741,823, which is
variable.
For nchar, it has a unicode maximum fixed length of 4,000 characters.
For nvarchar (max), it has a unicode variable maximum length of 231
characters.
For nvarchar, it has a variable maximum length of 4,000 unicode
characters.
Character Strings Data
The character Strings Data have almost similar types as the Unicode
Character Strings Data, only, some have different maximum values and
they are non-unicode characters, as well.
For text, it has a maximum variable length of 2,147,483,647 non-unicode
characters.
For char, it has a non-unicode maximum fixed length of 8,000 characters.
For varchar (max), it has a non-unicode variable maximum length of 231
characters.
For varchar, it has a variable maximum length of 8,000 non-unicode
characters.
Miscellaneous Data
Aside from the 6 major types of data, miscellaneous data are also stored as
tables, SQL variants, cursors, XML files, unique identifiers, cursors and/or
timestamps.
You can refer to this chapter when you want to know about the maximum
values of the data you are preparing.
What is MySQL?
MySQL is a tool (database server) that uses SQL syntax to manage databases.
It is an RDBMS (Relational Database Management System) that you can use
to facilitate the manipulation of your databases.
If you are managing a website using MySQL, ascertain that the host of your
website supports MySQL too.
Here’s how you can install MySQL in your Microsoft Windows. We will be
using Windows because it is the most common application used in
computers.
You can add more columns if you need more data about your table. It’s up to
you. So, if you want to add four more columns, this is how your SQL
statement would appear.
Example: CREATE TABLE “table_name”
(“column_name1” “data type”,
“column_name2” “data type”,
“column_name3” “data type”,
“column_name4” “data type”);
Add the closing parenthesis and the semi-colon after the SQL statement.
Let’s say you have decided to add for column 2 the keyword used in
searching for your website, for column 3, the number of minutes that the
visitor had spent on your website, and for column 4, the particular post that
the person visited. This is how your SQL statement would appear.
Take note:
The name of the table or column must start with a letter, then it can be
followed by a number, an underscore, or another letter. It's preferable that the
number of characters does not exceed 30.
You can also use a VARCHAR (variable-length character) data type to help
create the column.
Common data types are:
date – date specified or value
number (size) – you should specify the maximum number of column digits
inside the open and close parentheses
char (size) – you should specify the size of the fixed length inside the open
and close parentheses.
varchar (size) – you should specify the maximum size inside the open and
close parentheses. This is for variable lengths of the entries.
Number (size, d) – This is similar to number (size), except that ‘d’ represents
the maximum number of digits.
Hence if you want your column to show 10.21, your date type would be:
number (2,2)
Example: CREATE TABLE traffic_hs2064
(country varchar (40),
keywords varchar (30),
time number (3),
post varchar (40) );
Step #5 – Add CONSTRAINTS, if any
CONSTRAINTS are rules that are applied for a particular column. You can
add CONSTRAINTS, if you wish. The most common CONSTRAINTS are:
AND password='password';
The username and password added in double quotes represent the username
and the password entered by a user. If, for example, someone enters
username alphas, and the password pass123, the query will be:
SELECT * FROM users WHERE username='alphas' AND
password='pass123';
Suppose the user is an attacker, and instead of entering a valid username and
password in the fields, he enters something such as ' OR 'x'='x'.
In such a case, the query will evaluate to the following:
SELECT * FROM users WHERE username='' OR 'x'='x'
LIKE Quandary
When dealing with a LIKE quandary, you have to have an escape method in
order to change the characters that the user inserts into your prompt box that
turn out to be literals. The addslashes function allows you to ensure that you
are specifying the range of characters that is needed by the system in order to
escape.
Hacking Scenario
Google is one of the best hacking tools in the world through the use of the
inurl: command to find vulnerable websites.
For example:
inurl:index.php?id=
inurl:shop.php?id=
inurl:article.php?id=
inurl:pageid=
To use these, copy one of the above commands and paste it in the Google
search engine box. Press Enter.
You will get a list of websites. Start from the first website and check each
website’s vulnerability one by one.
Chapter 12 - Fine-Tune Your Indexes
When you are using SQL, you will want to become an expert on your
database. This, however, is not going to happen overnight. Just like with
learning how to use the system, you will need to invest time and effort to
learn the important information and have the proper awareness of how the
database works. You should also ensure that you are equipped with the right
education targeted at working with the database because you never know
what may happen in future.
In order to make your education more streamlined when you are learning the
ins and outs of the database, here are some helpful insights:
1. When using the database, you need to work with the 3NF
design.
2. Numbers compare to characters differently, and you could
end up downgrading your database's performance, so do not
change numbers unless you absolutely have to!
3. With the choose statement, you will only have data to
display on the screen. Ensure that you avoid asterisks with
your choose statement searches to avoid loading data that
you do not need at that time.
4. Any records should be constructed carefully, and only for
the tables that require them. If you do not intend to utilize
the table as much, then it does not need to have an index.
Essentially, you should attempt to save space on the disk,
and if you are creating an index for each table, you are going
to run out of room.
5. A full table scan happens when no index can be found on
that table. You can avoid doing this by creating an index
specifically for that row rather than the entire table.
6. Take precautions when using equality operators, especially
when dealing with times, dates, and real numbers. There is a
possibility that differences will occur, but you are not
necessarily going to notice these differences right away.
Equality operators make it almost impossible to get exact
matches in your queries.
7. Pattern matching can be used, but use it sparingly.
8. Look at how your search is structured, as well as the script
that is being used with your table. You can manipulate the
script of your table to have a faster response time, as long as
you change everything about the table and not just part of it.
9. Searches will be performed regularly on the SQL. Stick to
the standard procedures that work with a large group of
statements rather than small ones. The procedures have been
put into place by the database before you even get the
chance to use it. The database is not like the search engine
though; the procedure is not going to be optimized before
your command is executed.
10. The OR operator should not be used unless necessary.
This operator will slow your searches.
11. Remove any records you currently have to allow you to
optimize larger batches of data. It is an excellent idea to
think of the history of the table in millions of different rows,
and you are probably going to require multiple records to
cover the entire table, which is going to take up space on the
disk. While records will get you the information you want
faster, after the batch has been loaded, the index is going to
slow the system down due to the fact that it is now in the
way.
12. Batch compromises require that you use the commit
function. You should use this function after you construct a
new record.
13. Databases have to be defragmented at least once a
week to ensure everything is working properly.
Here, Status value 1 shows that trace flag 1222 is on. The 1 in the Global
column implies that the trace flag has been turned on globally.
Now, try to generate a deadlock by following the steps that we performed in
the last section. The detailed deadlock information will be logged in the error
log. To view the SQL Server error log, you need to execute the following
stored procedure:
executesp_readerrorlog
The above stored procedure will retrieve a detailed error log. A snippet of this
is shown below:
Your error log might be different depending upon the databases in your
database. The information about all the deadlocks in your database starts with
log text “deadlock-list.” You may need to scroll down a bit to find this row.
Let’s now analyze the log information that is retrieved by the deadlock that
we just created. Note that your values will be different for each column, but
the information remains the same.
ProcessInfo Text
spid13s deadlock-list
spid13s deadlock victim=process1fcf9514ca8
spid13s process-list
spid13s process id=process1fcf9514ca8 taskpriority=0 logused=308 waitresource=KE
waittime=921 ownerId=388813 transactionname=transactionBlasttranstarted=
XDES=0x1fcf8454490 lockMode=X schedulerid=3 kpid=1968 status=suspen
trancount=2 lastbatchstarted=2019-05-27T15:51:54.380 lastbatchcompleted=2
lastattention=1900-01-01T00:00:00.377 clientapp=Microsoft SQL Server Ma
hostname=DESKTOP-GLQ5VRA hostpid=968 loginname=DESKTOP-GLQ
(2) xactid=388813 currentdb=8 lockTimeout=4294967295 clientoption1=671
spid13s executionStack
spid13s frame procname=adhoc line=2 stmtstart=58 stmtend=164
sqlhandle=0x0200000014b61731ad79b1eec6740c98aab3ab91bd31af4d00000
spid13s unknown
spid13s inputbuf
spid13s UPDATE tableA SET patient_name = 'Thomas - TransactionB'
spid13s WHERE id = 1
spid13s inputbuf
spid13s UPDATE tableB SET patient_name = 'Helene - TransactionA'
spid13s WHERE id = 1
spid13s resource-list
spid13s keylockhobtid=72057594043105280 dbid=8
objectname=dldb.dbo.tableAindexname=PK__tableA__3213E83F1C2C4D64
associatedObjectId=72057594043105280
spid13s owner-list
spid13s owner id=process1fcf9515468 mode=X
spid13s waiter-list
spid13s waiter id=process1fcf9514ca8 mode=X requestType=wait
spid13s keylockhobtid=72057594043170816 dbid=8
objectname=dldb.dbo.tableBindexname=PK__tableB__3213E83FFE08D6AB
associatedObjectId=72057594043170816
spid13s owner-list
spid13s owner id=process1fcf9514ca8 mode=X
spid13s waiter-list
spid13s waiter id=process1fcf9515468 mode=X requestType=wait
The deadlock information logged by the SQL server error log has three main
parts.
2. Process List
The process list is the list of all the processes involved in a deadlock. In the
deadlock that we generated, two processes were involved. In the processes
list you can see details of both of these processes. The ID of the first process
is highlighted in red, and the ID of the second process is highlighted in green.
Notice that, in the process list, the first process is the process that has been
selected as deadlock victim, too.
Apart from the process ID, you can also see other information about the
processes. For instance, you can find login information of the process, the
isolation level of the process, and more. You can even see the script that the
process was trying to run. For instance, if you look at the first process in the
process list, you will find that it was trying to update the patient_name
column of tableA when the deadlock occurred.
3. Resource List
The resource list contains information about the resources taking place in the
event of the deadlock. In our example, tableA and tableB were the only two
resources during the deadlock. The tables are highlighted in blue in the
resource list of the log in the table above.
●
Don’t allow user to interact with the application when transactions are
being executed
Chapter 14 - Functions: Udfs, SVF, ITVF, MSTVF,
Aggregate, System, CLR
Since there is a return value, it is used inside an expression, and unlike stored
procedure, it is not invoked with an execute statement. A function can call
another function and nesting can be up to 32 levels and is invoked along with
schema name e.g., dbo. Name of a function can be up to 128 characters.
Inside a function, a DML operation (insert, update, delete) cannot be
performed unlike stored procedure. These features of UDFs are summarized
as below:
USE IMS
GO
--
=============================================
-- Author: Neal Gupta
-- Create date: 10/01/2013
-- Description: Get total of customers
--
=============================================
FROM [IMS].[dbo].[TblProduct]
INLINE TABLE-VALUED FUNCTION (I-TVF)
This is a UDF that returns a data type of table, a set of rows, and is similar to
set of data returned by view, however, TVF provides much more capability
than view, which are limited to one select SQL statement. TVF can have
logic and multiple statements from one or more tables. TVF can replace a
stored procedure and TVF are invoked using the select, unlike stored
procedure that needs to be executed. For example, we want to get all the
orders for the products, a inline TVF is created below:
USE IMS
GO
-- =============================================
-- Description: Get orders for products
--=============================================
CREATE FUNCTION [dbo].[tvfIMSGetProductOrders]
RETURNS TABLE
AS
RETURN
FROM [IMS].[dbo].[TblOrder]
You can run following SQL statement to call above inline TVF:
SELECT * FROM [dbo].[tvfIMSGetProductOrders]()
Note that in above inline TVF, input parameter was not passed. However, if
we want to get all the orders for a product ID, we can use following inline
TVF:
USE IMS
GO
-- =============================================
-- Description: Get orders for a product
-- =============================================
CREATE FUNCTION [dbo].[tvfIMSGetOrdersByProductID]
@ProductID INT
RETURNS TABLE
AS
RETURN
-- =============================================
-- Description: Get order details for a product
-- =============================================
CREATE FUNCTION [dbo].[tvfIMSGetOrderProductDetails]
@ProductID INT
RETURNS @OrderProductDetails TABLE
OrderID INT NOT NULL
,
OrderDate DATETIME
,
ProductID INT
,
Name VARCHAR(50)
,
Price DECIMAL(9,2)
)
AS
BEGIN
AGGREGATE FUNCTION
These functions give you summarized info like average of salary, total count
of products, minimum and maximum, etc. Aggregate functions are actually
part of system functions (mentioned next) and some of these are mentioned
as below:
COUNT(): Counts all the rows: Count(*), Count(ALL Column),
Count(DISTINCT Column): Counts only distinct rows.
SYSTEM FUNCTION
These are in-built functions inside the SQL Server and helps in executing
various operations related to date time (GETDATE, GETUTCDATE, DATE,
MONTH, YEAR, DATEADD, DATEPART, ISDATE), security (USER,
USER_ID, USER_NAME, IS_MEMBER), string manipulation
(CHARINDEX, RTRIM, LTRIM, SUBSTRING, LOWER, UPPER, LEN),
mathematical (ABS, COS, SIN, SQUARE, PI), etc.
CLR FUNCTION
Common Language Runtime (CLR) function is created inside a DLL, similar
to CLR stored procedure.
Similar to stored procedure, you can create schemabinding UDF or
encryption scalar or table-valued UDF using following options:
SCHEMABINDING AND ENCRYPTION UDF
END
END
If you want to modify an existing function, ALTER FUNCTION is used
which keeps underlying security privileges for the UDF intact. For deleting a
UDF, DROP FUNCTION is used.
Chapter 15 - Triggers: Dml, Ddl, After, Instead Of,
Db, Server, Logon
Trigger is a special type of stored procedure that is executed, invoked or
fired, automatically, when a certain database event occurs like DML (insert,
update, delete) operation. However, trigger cannot be passed any value and
does not return a value. A trigger is fired either before or after a database
event and associated with a table or a view. Trigger can also be used to check
the integrity of data before or after an event occurs and can rollback a
transaction. There are 3 categorizes of triggers:
DML Triggers: Fire on DML operations (insert, update, delete)
USE IMS
GO
INSERT INTO [IMS].[dbo].[TblProduct] ([Name],[Description],
[Manufacturer],[QtyAvailable],[Price])
VALUES ('UHDTV', 'Ultra-Hi Smart 3D TV', 'LG', 1, 50000.00)
--
=============================================
-- Description: Create INSTEAD OF trigger for delete
--
=============================================
CREATE TRIGGER trgIMSTblProductInsteadOfDelete
ON [IMS].[dbo].[TblProduct]
INSTEAD OF DELETE
AS
BEGIN
SET NOCOUNT ON;
DECLARE @ProductID INT
SELECT @ProductID = (SELECT ProductID FROM DELETED)
IF (@ProductID > 0)
BEGIN
DELETE FROM [IMS].[dbo].
[TblProduct]
WHERE ProductID =
@ProductID
END
ELSE
BEGIN
PRINT 'Error: ' +
@@Error
END
PRINT 'INSTEAD OF TRIGGER Invoked:
trgIMSTblProductInsteadOfDelete'
PRINT CAST(@ProductID AS VARCHAR(5)) + ' ProductID deleted from
TblProduct table'
END
GO
-- Run Delete SQL Query to fire above INSTEAD OF trigger
DELETE FROM [IMS].[dbo].[TblProduct]
WHERE Name = 'UHDTV'
Once you run above delete SQL statement, you will see following message in
Messages Tab in Studio.
Note that truncate SQL statement does not trigger delete trigger, since, it does
not perform individual row deletions.
DDL Triggers (DDL-TR)
These triggers are fired after a DDL statement is executed on a database, for
e.g., CREATE, ALTER, DROP, GRANT, REVOKE, or a server related
event. Some of the features of DDL triggers are summarized below:
USE IMS
GO
--
=============================================
-- Description: Create DDL Trigger (DS-DDL-TR)
--
=============================================
CREATE TRIGGER trgIMSDatabaseTrigger1
ON DATABASE
FOR ALTER_TABLE, DROP_TABLE
AS
BEGIN
PRINT 'DLL TRIGGER Fired: ' + 'trgIMSDatabaseTrigger1'
-- Perform a SQL operation:
ROLLBACK
END
Above DDL trigger is fired once you run any alter table or drop table
command. This trigger exists in Object Explorer: Instance -> Databases ->
IMS -> Programmability -> Database Triggers, as in below Fig.
--
=============================================
-- Description: Create DDL Trigger (SS-DDL-TR)
--
=============================================
CREATE TRIGGER trgIMSServerTrigger1
ON ALL SERVER
FOR CREATE_DATABASE
AS
BEGIN
PRINT 'SERVER DLL TRIGGER Fired: ' +
'trgIMSServerTrigger1'
END
Once you run create database SQL statement, above trigger is going to fire.
These server triggers are located in Object Explorer: SQL Server Instance ->
Server Objects -> Triggers, as in below Fig.
C. CLR DLL Triggers : These triggers are defined in an
outside routine using some programming language like
C#.NET and compiled into a DLL file. This DLL or
assembly file is registered inside SQL Server. These are
typically used for some specialized purpose.
If you login to SQL Server instance, above logon trigger is fired and message
is displayed in SQL Server Logs under Management, as below:
Logon Trigger can be used for auditing and tracking purpose or even
restricting access to certain login or sessions counts.
MULTIPLE TRIGGERS (MUL-TR)
In case a trigger is fired when an insert, update or delete DML operation on a
table occurs and that table already has another trigger, causing another trigger
to be fired. Multiple triggers can be fired on DML, DDL or even LOGON
database events.
There are two other types of triggers like recursive and nested, which allow
for maximum 32 levels, however, there are only rare circumstances where
these will be required.
Chapter 16 - Select Into Table Creation &
Population
ID FullName PersonType
38 Kim Abercrombie EM
43 Nancy Anderson EM
67 Jay Adams EM
121 Pilar Ackerman EM
207 Greg Alderson EM
211 Hazem Abolrous EM
216 Sean Alexander EM
217 Zainal Arifin EM
227 Gary Altman EM
270 François Ajenstat EM
USE AdventureWorks2012;
-- Create a copy of table in a different schema, same name
-- The WHERE clause predicate with >=, < comparison is better performing
than the YEAR function
SELECT *
INTO dbo.SalesOrderHeader
FROM Sales.SalesOrderHeader
WHERE OrderDate >= '20080101' AND OrderDate < '20090101'; --
YEAR(OrderDate)=2008
-- (13951 row(s) affected)
-- Create a table without population
SELECT TOP (0) SalesOrderID,
OrderDate
INTO SOH
FROM Sales.SalesOrderHeader;
-- (0 row(s) affected)
-- SELECT INTO cannot be used to target an existing table
SELECT * INTO SOH FROM Sales.SalesOrderHeader;
/* Msg 2714, Level 16, State 6, Line 1
There is already an object named 'SOH' in the database. */
NOTE
IDENTITY column is automatically populated. Direct insert into
IDENTITY column requires using of SET IDENTITY_INSERT.
Year Orders
2008 13951
2007 12443
2006 3692
2005 1379
TotalOrders
31465
Name Age
NULL NULL
Roger Bond 45
Name Age
Roger Bond 45
USE AdventureWorks2012;
DROP TABLE tempdb.dbo.Product;
GO
-- The following construct will prevent IDENTITY inheritance
SELECT TOP (0) CAST(ProductID AS INT) AS ProductID, --
Cast/Convert the identity column
ProductNumber,
ListPrice,
Color
INTO tempdb.dbo.Product FROM
AdventureWorks2012.Production.Product;
-- (0 row(s) affected)
INSERT tempdb.dbo.Product (ProductID, ProductNumber, ListPrice, Color)
SELECT 20001, 'FERRARI007RED', $400000, 'Firehouse Red';
GO
SELECT * FROM tempdb.dbo.Product;
ID FullName Email
1075 Diane Glimp diane0@adventure-works.com
15739 Jesse Mitchell jesse36@adventure-works.com
5405 Jose Patterson jose33@adventure-works.com
1029 Wanida Benshoof wanida0@adventure-works.com
8634 Andrea Collins andrea26@adventure-works.com
ID FullName Email
9984 Sydney Clark sydney81@adventure-works.com
15448 Denise Raman denise13@adventure-works.com
12442 Carson Jenkins carson5@adventure-works.com
1082 Mary Baker mary1@adventure-works.com
18728 Emma Kelly emma46@adventure-works.com
-- Cleanup
DROP TABLE BOM400;
More References
If you missed where I mentioned my free blog earlier in the book, don’t
worry! Here’s a link to The SQL Vault so that you can follow my latest
experiences and they’ll provide something for you to learn as well!
Chapter 17 - Data Visualizations
The final topic that we are going to spend some time learning about in this
guidebook is how we are able to handle some of our data visualizations. This
is where we are going to be able to figure out the best way to present the data
to those who need it the most. Often the data scientist and the person who is
going to need to use the information for their own needs are not going to be
the same people. A company will need to use that data in order to help them
to make some good decisions, but they may not have the technical resources
and knowledge in order to create the algorithms and get it all set up on their
own.
This is why many times they are going to hire a specialist who is able to help
them with the steps of the data science project. This is a great thing that
ensures they are able to work with data in order to make some smart
decisions along the way. but then the data scientist has to make sure that they
are able to read the information. These algorithms can come out with some
pretty technical information that is sometimes hard to understand if you do
not know how to work with them.
This means that the data scientist has to be able to go through and find a way
in order to share the information in a manner that the person who will use it is
able to understand. There are a number of ways that we are able to do this,
but we must remember that one of the best ways to do this is through the help
of data visualization.
Sure, we can go through all of this and try to write it all up in a report or on a
spreadsheet and how that this is going to work. And this is not a bad method
to work with. But this is going to be boring and harder to read through. It
takes a lot more time for us to read through this kind of information and hope
that we are going to find what we need. It is possible, but it is not as easy.
For most people, working with a visual is going to be so much easier than
trying to look through a lot of text. These visuals give us a way to just glance
at some of the information and figure out what is there. When we are able to
look at two parts of our data side by side in a chart or a graph, we are going
to be able to see what information is there and make decisions on that a
whole lot faster than we are able to do with just reading a few pages of
comparisons on a text document.
Picking out the right kind of visual that you will want to work with is going
to be so important to this process. You have to make sure that we are picking
out a visual that works for the kind of data that you want to be able to show
off to others. If you go with the wrong kind of graph, then you are going to
end up with a ton of trouble. The visuals are important and can show us a lot
of information, but they are not going to be all that helpful if you are not even
able to read through them at all or if they don’t showcase the information all
that well in the first place.
Often when we take a look at a visual and all of the information that is there,
we are going to be able to see a ton of information in a short amount of time.
something that could take ten pages of a report could be done in a simple
chart that takes a few minutes to glance at and understand. And when you are
able to use a few of these visuals along the way, you are going to find that it
is much easier to work with and understand what is there.
This doesn’t mean that we can’t work with some of the basics that are there
with the reports and more. The person who is taking a look at the information
and trying to make some smart decisions about it will find that it is really
useful for them to see some of the backgrounds about your information as
well. They need to be able to see how the data was collected, what sources
were used, and more. And this is something that you are able to put inside of
your data and text as well.
There is always a lot of use for a report of this kind, but we need to make sure
that it is more of a backup to some of the other things that you have been able
to do. If this is all that you have, then it is going to be really hard for you to
work with some of this, and it can get boring to figure out what information
is present in the data or what you learned about in your analysis.
The good news here is that there are a ton of different types of visuals that
you are able to work with. This variety is going to help you to really see some
good results with the data because you can make sure that you are able to find
the visual that works for any kind of data that you are working with. There
are options like histograms, pie charts, bar graphs, line graphs, scatterplots,
and more.
Before you end your project, it is a good idea to figure out what kind of
visuals you would like to work with. This is going to ensure that you are able
to pick out the visual that will match with the data, and with the results, that
you have gotten, and this will ensure that we are going to be able to really see
the information that you need to sort through.
There are many options that you are able to work with as you need. You can
choose to pick out the one that is the best for you, and maybe even try a few
of these to figure out which one is going to pack the biggest punch and can
help you to get things done. Make sure to check what your data is telling you,
and learn a bit more about the different visuals that are there and how you are
able to work with them.
With this in mind, we need to take a look at what is going to make good data
visualization. These are going to be created when design, data science, and
communication are able to come together. Data visuals, when they are done
right, are going to offer us some key insights into data sets are that are more
complicated, and they do this in a way that is more intuitive and meaningful
than before. This is why they are often the best way to take a look at some of
the more complicated ideas out there
In order to call something a good data visualization, you have to start out
with data that is clean, complete, and well-sourced. Once the data is set up
and ready to visualize, you need to pick the right chart to work with. This is
sometimes a challenge to work with, but you will find that there are a variety
of resources out there that you can choose to work with, and which will help
you pick out the right chart type for your needs.
Once you have a chance to decide which of these charts is the best, it is time
to go through and design, as well as customize, the visuals to the way that
you would like. Remember that this simplicity is going to be key. You do not
want to have so many elements in it that this distracts from the true message
that you are trying to do within the visual in the first place.
There are many reasons why we would want to work with these data visuals
in the first place. The number one reason is that it can help us to make some
better decisions. Today, more than ever before, we are going to see that
companies are using data tools and visuals in order to ask better questions
and to make some better decisions some of the emerging computer
technologies, and other software programs have made it easier to learn as
much as possible about your company, and this can help us to make some
better decisions that are driven by data.
The strong emphasis that there is right now on performance metrics, KPIs,
and data dashboards is easily able to show us some of the importance that
comes with monitoring and measuring the company data. Common
quantitative information measured by businesses will include the product or
units sold, the amount of revenue that is done each quarter, the expenses of
the department, the statistics on the employees, the market share of the
company and more.
These are also going to help us out with some meaningful storytelling as
well. These visuals are going to be a very big tool for the mainstream media
as well. Data journalism is already something that is on the rise, and many
journalists are going to rely on really good visual tools in order to make it
easier to tell their stories, no matter where they are in the world. And many of
the biggest and most well-known institutions are already embracing all of this
and using these visuals on a regular basis.
You will also find that marketers are going to be able to benefit from these
visuals. Marketers are going to benefit from the combination of quality data
and some emotional storytelling that is going on as well. Some of the best
marketers out there are able to make decisions that are driven by data each
day, but then they have to switch things around and use a different approach
with their customers.
The customer doesn’t want to be treated like they are dumb, but they also
don’t want to have all of the data and facts are thrown out at them all of the
time. this is why a marketer needs to be able to reach the customer both
intelligently as well as emotionally. Data visuals are going to make it easier
for marketers to share their message with statistics, as well as with the heart.
Those are just a few of the examples of how we are able to work with the
idea of data visuals for your needs. There are so many times when we are
able to complete a data visually, and then use it along with some of the other
work that we have been doing with data analysis to ensure that it provides us
with some more context on what is going on with our work.
Being able to not only read but to understand, these data visuals has become a
necessary requirement for this modern business world. because these tools
and the resources that come with them are readily available now, it is true
that even professionals who are non-technical need to be able to look through
this data and figure out some of the data that is there.
Increasing the literacy of data for many professionals, no matter what their
role in the company is all about, is going to be a very big mission to
undertake from the very beginning. This is something that your company
needs to learn how to focus on because it is really going to end up benefiting
everyone who is involved in the process as well. With the right kind of data
education and some good support, we are going to make sure that everyone
not only can read this information, but that they are more informed, and that
they are able to read the data and use that data to help them make some good
decisions overall. All of this can be done simply by being able to read
through these visuals.
Chapter 18 - Python Debugging
Like most computer programming language, Python utilizes debugging
processes for the benefit of providing exceptional computing programs. The
software enables you to run applications within the specified debugger set
with different breakpoints. Similarly, interactive source code is provided to a
Python program for the benefit of supporting under program controls. Other
actions of a debugger in Python are testing of units, integration, analysis of
log files, and log flows as well as system-level monitoring.
Running a program within the debugger comprises of several tools working
depending on a given command line and IDE systems. For instance, the
development of more sophisticated computer programs has significantly
contributed to the expansion of debugging tools. The tools accompany
various methods of detecting Python programming abnormalities, evaluation
of its impacts, and plan updates and patches to correct emerging problems. In
some cases, debugging tools may improve programmers in the development
of new programs by eliminating code and Unicode faults.
Debugging
Debugging is the technique used in detecting and providing solutions to
either defects or problems within a specific computer program. The term
‘debugging’ was first accredited to Admiral Grace Hopper while working at
Harvard University on Mark II computers in the 1940s. She discovered
several moths between relays, thereby hindering computer operations and
named them ‘debugging' in the system. Despite the term previously used by
Thomas Edison in 1878, debugging began becoming popular in the early
1950s with programmers adopting its use in referring to computer programs.
By the 1960s, debugging gained popularity between computer users and the
most common term mentioned to described solutions to major computing
problems. With the world becoming more digitalized with challenging
programs, debugging has covered a significant scope. Henceforth,
eliminating words like computer errors, bugs, and defects to a more neutral
one such as computer anomaly and discrepancy. However, the neutral terms
are also under impact assessment to determine if their definition of
computing problems provides a cost-effective manner to the system or more
changes be made. The assessment tries to create a more practical term to
define computer problems while retaining the meaning but preventing end-
users from denying the acceptability of faults.
Anti-Debugging
Anti-debugging is the opposite of debugging and encompasses the
implementation of different techniques to prevent debugging processes or
reverse engineering in computer codes. The process is primarily used by
developers, for example in copy-protection schemes as well as malware to
identify and prevent debugging. Anti debugging, is, therefore, the complete
opposite of debugger tools, which include prevention of detection and
removal of errors, which occasionally appear during Python programming?
Some of the conventional techniques used are;
API-based
Exception-based
Modified code
Determining and penalizing debugger
Hardware-and register-based
Timing and latency
Breakpoints
When you are running programs in Python package, the codes will usually
begin writing from the first line and run continuously until when there is a
success or error. However, bugs may occur either in a specific function or a
section of the program, but the error codes may not have been used during
input. The error may persist until during the start of the program that you
notice the problem. At this point, breakpoints become useful as they readily
stop these events. Breakpoints alter debuggers where the problem is and
immediately halts program execution and make necessary corrections. This
concept, therefore, enables you to create excellent Python programming
languages within a short time.
Stepping
Stepping is another concept, which operates with debugging tools in making
programs more efficient. Python program stepping is the act of jumping
through codes to determine programs lines with defects as well as any other
mistakes, which need attention before execution. Stepping in different codes
occurs as step-ins, step over, and step out. Step in entails the completion of
the next line filled with systems making the user skip into codes and debug
the intended one. Step over refers to a developer moving to the following line
in the existing function and debug with a new code before running the
program. Step out command refers to skipping to the last line of the program
and making completions of the codes before executing the plan.
Function Verification
When writing codes into the program, it is vital to keep track of the state of
each code, especially on calculations and variables. Similarly, the growth of
functions may stake up, leading to creating a function calling technique to
understand how each task affects the next one. Likewise, it is recommended
entering the nested codes first when it comes to stepping in as to develop a
sequential approach of executing the right codes first.
Processes of Debugging
Problem Reproduction
The primary function of a debugger application is to detect and eliminate
problems affecting programming processes. The first step in the debugging
process is to try to identify and reproduce the existing problem, either being a
nontrivial function or other rare software bugs. The method of debugging
primarily focuses on the immediate state of your program and note the bugs
present at the time. The reproduction is typically affected by computer usage
history and the immediate environment, thereby impacting on the end-results.
Debugging Techniques
Like other language programming software, Python also utilizes a debugging
technique to enhance its bug identification and elimination. Some of the
standard methods of debugging are interactive, print, remote, postmortem,
algorithm, and delta debugging. The technique used to remove bugs
interprets the comparison between the different techniques. For instance,
print debugging entails monitoring and tracing bugs and later printing them
out.
Remote debugging is a technique of removing bugs running a given program
but differs from the bugger tool. While postmortem is debugging methods to
identify and eliminate bugs from already crashed programs. To this end,
leaning the different types of debugging contributes to deciding which to use
when in need of determining Python programming problems. Other
techniques are safe squeezing, which isolates faults and causality tracking
essential for tracing causal agents in computation.
Debuggers Tools
Python debuggers are specific or multiple purposes in nature, depending on
the platform used, that is, depending on the operating system. Some of the
all-purpose debuggers are pdb and PdbRcldea while multipurpose include
pudb, Winpdb, and Epdb2, epdb, JpyDbg, pydb, trepan2, and
Pythonpydebug. On the other hand, specific debuggers are gdb, DDD, Xpdb,
and HAP Python Remote Debugger. All the above debugging tools operate in
different parts of the Python program with some used during installation,
program creation, remote debugging, and thread debugging and graphic
debugging, among others.
IDEs Tools
Integrated Development Environment (IDE) is the best Python debugging
tools as they suit well on big projects. Despite the tools varying between the
IDEs, the features remain the same for executing codes, analyzing variables,
and creating breakpoints. The most common and widely used IDE Python
debugging tool is the PyCharm comprising of complete elements of
operations, including plugins essential for maximizing the performance of
Python programs. Subsequently, other IDE debugging tools are also great and
readily available in the market today. Some of them include Komodo IDE,
Thonny, PyScripter, PyDev, Visual Studio Code, and Wing IDE, among
others.
Special-Purpose Tools
Special-purpose debugging tools are essential for detecting and eliminating
bugs from different sections of the Python program primarily working on
remote processes. These types of debugging tools are more useful when
tracing problems in the most sensitive and remote areas where it is unlikely
for other debuggers to access. Some do the most commonly used Special-
purpose debugging tools are FirePython used in Firefox as a Python logger,
manhole, PyConquer, pyringe, hunter, ice-cream and PySnooper. This
subdivision of debugging tools enables programmers to quickly identify
hidden and unnoticed bugs and thereby displaying them for elimination
from the system.
Debugger Commands
With debugging being a common feature in the programming language, there
exist several commands used when maneuvering between various operations.
The basic controls are the most essential for beginners and may include an
abbreviation of one or more letters. A blank space must separate the
command while others are enclosed in brackets. However, the syntax
command does not allow for the square brackets to be written but separated
alternatively by a vertical bar. In Python programs, statements are rarely
recognized by debugger commands executed within the parameters of the
program.
As to inspect Python statements, against errors and other related faults,
prefixes are added with an exclamation mark. Henceforth, making it possible
to make changes on variables as well as function calls. Several commands
may also be inserted in the same line but separated by ‘;;’ with inputs spaced
separately from other codes. As such, debugging is said to work with aliases,
which allows for adaptability between words in the same context. Besides,
aliases enhance the need for reading files in the directory with faults but seen
as correct with the use of the debugger prompt.
Conclusion
Whenever someone goes into a website, data is taken from their computer
and sent through the site. You are able to take this same data and place it in
the database for SQL. However, it is risky to do this due to the fact that you
will be leaving yourself open to what is known as SQL injection, which can
end up wiping out all of the hard work that you have put into your SQL
script.
An injection typically occurs at the point in time that you ask a user to place
some type of data into a prompt box before they are able to continue.
However, you will not necessarily get the information that you want. Instead,
you could end up getting a statement that runs through your database, and
you won’t know that this has occurred.
Users cannot be trusted to give you the data that you are requesting, so you
need to make sure that the data that they enter is looked at before it is sent to
your database for validation. This is going to help secure your database from
any SQL injection statements that may occur. Most of the time, you will use
pattern matching to look at the data before you decide to send it to your main
database.
Your function calls are used when you try to pull a particular record off of the
database from the table requested when you are working with that title for
that row. The data is usually going to end up matching what you have
received from the user, so you are able to keep your database safe from SQL
injection statements.
With MYSQL, queries are not allowed to be carried or stacked into a single
call function. This helps in keeping calls from failing due to the queries being
stacked.
Extensions such as SQLite, however, do allow for your queries to be stacked
as you do your searches in that one string. This is where safety issues come
into play with your script and your database.
STATISTICS
FOR BEGINNERS:
FUNDAMENTALS OF PROBABILITY AND
STATISTICS FOR DATA SCIENCE AND
BUSINESS APPLICATIONS, MADE EASY FOR
YOU
Matt Foster
Data can come in many forms. It might be in the form of location data created
from cell phone pings, a listing of all the YouTube videos you have ever
watched, or all the books you have purchased on Amazon. Often, it is
desirable to integrate different types of data into a single coherent picture.
Data might be used in real time or could be analyzed later to find hidden
patterns.
In this chapter, we will explore the general types or classes of big data. As we
will see, big data can come in the form of structured or unstructured data.
Moreover, it can come from different sources. Understanding the types of big
data will be important for getting a full understanding of how big data is
processed and used.
Structured Data
Structured data is the kind of data you would expect to find in a database. It
can include stored items such as dates, names, account numbers, and so forth.
Data scientists can often access structured data using SQL. Large amounts of
structured data have been collected over decades.
Structured data can be human-generated, such as people entering payment
information when ordering a product, or it could be data entered manually by
people working at a company. If you apply for a loan and fill out an online
form, this is human-generated data, which is also structured data. This data
would include an entry that could be put in a database with name, social
security number, address, place of employment, and so on.
In today’s world, structured data is also computer-generated without the
involvement of any people. When data is generated by computer systems, it
might be of a different character than that described above, but it can still be
structured data. For example, if your cell phone company was tracking you,
you could create data points that had your GPS coordinates, together with the
date and time. Additional information like your name or customer identifier
used by the cell phone company could also be included.
Other structured data can include tracking websites. As you are using your
computer, your activity could be tracked, and the URL, date, and time could
be recorded and stored as structured data.
Traditionally, structured data has been stored in relational databases and
accessed using a computer language paired with SQL. However, these tools
are in the midst of an evolving process as they adapt to the world of big data.
The reason things are changing is that many types of data, drawn from
different sources, are finding their way together into the same bits of
structured data.
For those who have little familiarity with relational databases, you can think
of an entry in a database having different fields. We can stick to the example
of an application for a loan as an example. It will have first and last name
fields with pre-determined character lengths. The first name field might be
ten characters and the last name field might be twenty characters. We are just
providing these values as examples; whoever designs the database will make
them long enough to be able to record data from most names.
When collecting information for a financial application, date of birth and
social security number will be collected. These will be given specific formats
in the database, with a date field and a character field that is eleven characters
wide to collect the social security number.
We could go on describing all the fields, but I think you get the point of how
the data is structured. With structured data, specific pieces of information
collected, and the formats of the information, are pre-defined. Each data point
collected is called a field, and every element in the database will have the
same fields, even if the person neglects to fill out some of the data.
Batch processing of structured data can be managed using Hadoop.
Unstructured Data
A lot of big data is classified as unstructured data. This encompasses a wide
variety of data that comes from many sources. One example of unstructured
data is spam email. Machine learning systems have been developed to
analyze email to estimate whether its spam. The data in this case is the text
included in the message, the subject line, and possibly the email address and
sending information for the message. While there are certain common
phrases used in spam emails, someone can type an email with any text they
please, so there is no structure at all to the data. Think about this in terms of a
database. As we mentioned above, a database has fields that are specific data
types and sizes, and structured data will include specific items collected with
the data.
Another example of unstructured data could be text messages. They are of
varied length and may contain different kinds of information. Not only could
a person enter in numerical or alphabetic/textual information, but images,
emojis, and even videos can be included. Any randomly selected text
message may have one or all these elements or some value in between. There
is no specific structure to the data, unlike an entry in a relational database.
Similar to text messages, posting on social media sites is unstructured data.
One person might type a plain text message, while someone else might type a
text message and include an image. Someone else might include many emojis
in their message, and another posting might include a video.
Often, unstructured data is analyzed to extract structured data. This can be
done with text messages or postings on social media sites to glean
information about people’s behaviors.
There are many kinds of unstructured data. For example, photographs and
surveillance data—which includes reams of video—are examples of
unstructured data.
Semi-Structured Data
Data can also be classified as semi-structured. This is data that can have
structured and unstructured elements together.
Storing Data
As mentioned earlier, structured data is stored in relational databases. In the
1990s, this was the primary storage mechanism of big data, before large
amounts of unstructured data began to be collected.
Unstructured data is not necessarily amenable for storage in a database and is
often stored in a graph database. Companies use content management
systems, known in the business as CMSs to store unstructured data. Although
CMSs are not formally structured like a relational database, they can be
searched in real time.
Chapter 2 - Predictive Analytics Techniques (I.E.,
Regression Techniques and Machine Learning
Techniques)
Random Forests
The next type of learning algorithm that you are able to work with is the
random forest. There are a lot of times when the decision tree is going to
work out well for you, but there are times when you may want to make this a
bit different, and the random forest is going to be the right option for you.
One time, when you would want to work with a random forest, is when you
would like to work with some task that can take your data and explore it, like
dealing with any of the values in the set of data that is missing or if you
would like to be able to handle any of the outliers to that data set.
This is one of the times when you are going to want to choose the random
forest rather than working with the decision trees, and knowing which time to
use each of these different learning algorithms. Some examples of when the
programmer would want to work with a random forest include:
• When you are working on your own training sets, you will find that all
of the objects that are inside a set will be generated randomly, and it can
be replaced if your random tree thinks that this is necessary and better for
your needs.
• If there are M input variable amounts, then m<M is going to be
specified from the beginning, and it will be held as a constant. The reason
that this is so important is that it means that each tree that you have is
randomly picked from their own variable using M.
• The goal of each of your random trees will be to find the split that is the
best for the variable m.
• As the tree grows, all of these trees are going to keep getting as big as
they possibly can. Remember that these random trees are not going to
prune themselves.
The forest that is created from a random tree can be great because it is much
better at predicting certain outcomes. It is able to do this for you
because it will take all prediction from each of the trees that you create,
and then will be able to select the average for regression or the
consensus that you get during the classification.
These random forests are going to be the tool that you want to use many
times with the various parts of data science, and this makes them very
advantageous compared to the other options. First, these algorithms are able
to handle any kind of problem that you are focusing on, both the regression
and classification problems. Most of the other learning algorithms that you
will encounter in this guidebook are only able to handle one type of problem
rather than all of them.
Another benefit of these random forests is that they are going to help you
handle large amounts of data. If your business has a lot of different points
that you want to go through and organize, then the random forest is one of the
algorithms that you need to at least consider.
There is a limitation that comes with using random forests though, which is
why you will not be able to use it with all of the problems that you want to
take on. For example, this can work with regression problems like we talked
about before, but they are not going to be able to make any kind of prediction
that goes past the range that you add to your training data. You will be able to
get some predictions, of course, but these predictions will end up becoming
limited. It will stop at the ranges that you provide, lowering the amount of
accuracy that is found there.
KNN Algorithm
Next on the list of learning algorithms that we are going to take a look at is
the K-nearest neighbors, or KNN, algorithm. This is one that is used a lot in
supervised machine learning, so it is worth our time to take a look at it here.
When you work with the KNN algorithm, you are going to use it to help take
a lot of data and search through it. The goal is to have k-most similar
examples for any data instance that you would like to work with. When you
get this all organized in the proper manner, this KNN algorithm will be able
to take a look through that set of data, and then it will summarize the results
before using these to make the right predictions that you need.
A lot of businesses will use this kind of model in order to help them become
more competitive with the kind of learning that they are able to do in the
industry. This is going to work because there will be a few elements in this
model that will compete against each other. The elements that end up
winning in here are going to be the way that you are the most successful and
you get the prediction that will work the best for you.
Compared to the other two learning algorithms that are out there, this one is
going to be a bit different. In fact, some programmers are going to see this as
one of the lazier learning processes because it is not able to really create any
models unless you go through and ask it to do a new prediction. This is a
good thing for some projects if you would like to keep the information in the
models relevant or have more say in what you are adding to the models, but
in other situations, it is not going to be all that helpful.
There are a lot of benefits of working with the KNN learning algorithm. For
example, when you choose to use this kind of algorithm, you can learn how
to cut out the noise that sometimes shows up inside the set of data. The
reason that this works is that it is going to work solely with the method of
competition to help sort through all of the data in the hopes of finding the
stuff that is the most desirable. This algorithm is useful because it can take in
a lot of data, even larger amounts, at the same time which can be useful in a
lot of different situations.
However, you are going to run into a few conditions to consider when it
comes to this algorithm. The biggest issue is that there are high amounts of
costs computationally, especially when you compare it to what some of the
other learning algorithms will do. This is because KNN is going to look
through the points, all of them before it sends you a prediction. This takes a
lot of time and money overall, and may not be the one that you want to use.
Regression Algorithms
Next on the list is the regression algorithm. You will be able to use this
because it is a type where you will investigate the relationship that is there
between the dependent variables and the predictor variables that you like to
use. You will find that this is the method a programmer will want to work
with any time they see there is a casual relationship between the forecasting
that you do, the time-series modeling, and all of the variables.
You will want to work with these regression algorithms any time that you
want to take all of the different points in your set of data and you want it to fit
onto a line or a curve as closely as possible. This helps you to really see if
there are some factors that are common between these data points so that you
can learn about the data and maybe make some predictions as well.
Many programmers and companies are going to use this kind of regression
algorithm in order to help them make great predictions that then help the
business to grow, along with their profits. You will be able to use it in order
to figure out a good estimation of the growth in sales that the company is
looking for, while still being able to base it on how the conditions of the
economy in the market are doing now and how they will do in the future.
The neat thing about these kinds of learning algorithms is that you are able to
place in any kind of information that seems to be pertinent for your needs.
You are able to add in some information about the economy, both how it has
acted in the present and in the past so that this learning algorithm is able to
figure out what may happen to your business in the future. The information
that you add to this needs to be up to date and easy to read through, or this
algorithm could run into some issues.
Let’s take a look at an example of how this can work. If you go with the
regression algorithm and find that your company is growing near or at the
same rate that other industries have been doing in this kind of economy, then
it is possible to take that new information and use it to make some predictions
about how your company will do in the future based on whether the economy
goes up or down or even stays the same.
There are going to be more than one option of learning algorithms that you
are able to work with when we explore these regression algorithms. And you
will have to take a look at some of the benefits and the differences between
them all to figure out which one is the right for you. There are a lot of options
when it comes to picking out an algorithm that you would like to use, but
some of the most common of these will include:
1. Stepwise regression
2. Logistic regression
3. Linear regression
4. Ridge regression
5. Polynomial regression
Any time that you decide to work with one of these learning algorithms, you
are going to be able to see quickly whether or not there is a relationship
between your dependent and independent variables, as well as what that
relationship is all about. This kind of algorithm is going to be there because it
shows the company the impact that they have to deal with if they try to add or
change the variables in the data. This allows for some experimentation so that
you can see what changes are going to work the best for you and which ones
don’t.
There are going to be a few negatives and shortcomings that you have to
work within the regression algorithms. The first one is that you can only use
these in regression problems (like the name suggests) and not in any kind of
classification problems. This is because this kind of algorithm is going to
spend too much time overfitting the data that you have. This makes the
process tedious and it is best if you are able to avoid working on it at all.
Naïve Bayes
And finally, we are going to move on to the other supervised machine
learning method that we need to look at. This one is known as the Naïve
Bayes method, and it is going to be really useful in a lot of the different kinds
of programs that you want to create, especially if you are looking to
showcase your model to others, even those who don’t understand how all of
this is supposed to work.
To help us get a better understanding of how this learning algorithm is going
to work, we need to spend some time bringing out our imaginations a bit. For
this one, imagine that you are working on some program or problem that
needs classification. In this, you want to be able to come up with a new
hypothesis to go with it, and then you want to be able to design some new
features and discussions that are based on how important the variables in that
data are going to be.
Once all of the information is sorted out, and you are ready to work on the
model that you want to use and then enter the shareholders. These
shareholders want to know what is going on with the model and want to
figure out what kinds of predictions and results you are going to be able to
get from your needs. This brings up the question, how are you going to be
able to show all of the information you are working on to the shareholders
before the work is even done? And how you are going to be able to do this in
a way that is easier to understand?
The good thing to consider with this one is that the Naïve Bayes algorithm is
going to be able to help you, even in the earliest stages of your model, so that
you can organize everything and show others what is going on. The learning
algorithm is going to be what you will need to use in order to do a
demonstration to show off your model, even when it is still found in one of
the earlier stages of development.
This may seem a bit confusing right now, but it is time to look at an example
to help us explain how to make this happen with some apples. When you go
to the store and grab an apple that looks pretty average to you. When you
grab this apple, you will be able to go through and state out some of the
features that distinguish the apple from some of the other fruits that are out
there. Maybe you will say that it is about three inches round, that it is red, and
has a stem.
Yes, some of these features are going to be found in other types of fruit, but
the fact that all of them show up in the product at the same time means that
you have an apple instead of another type of product in your hand. This is a
simple way of thinking about an apple and figuring out how it is different
from some of the others out there, but it is a good example of what is going to
happen when you use the Naïve Bayes algorithm.
A programmer is likely to work with the Naïve Bayes model when they want
to have something that is easy to get started with, and when they have a lot of
data or a large data set, that they want to be able to simplify a bit. One of the
biggest uses of this kind of algorithm is that it is going to be simple to use,
and even if you could do things in a more sophisticated method, it is a better
option to go with.
As you learn more about the Naïve Bayes algorithm, you will start to see
more and more reasons in order to work with it. This kind of model is going
to be an easy one to use and it is the most effective when it comes to
predicting the class of your test data so that it becomes one of the best
choices for anyone who would like to keep the process simple or those who
are new to working with the machine learning process for the first time. The
neat thing here though is, though this is a simple algorithm to bring up, it is
still able to be used in the same way that higher-class algorithms can do.
Of course, just like with some of the other supervised learning algorithms that
you would lie to work with, there are going to be some negatives that show
up along the way. First, when you need to do some variables that are
categorical, and you want to go through and test some data that hasn’t been
able to go through the set of data for training, you may find that the model is
not going to make the best predictions for you and the probability is not
going to be the best either.
If you still want to use the Naïve Bayes algorithm even with some of these
issues, there are a few methods that you can work with to solve the problem.
The Laplace estimation is a good example of this. But the more methods that
you add in, the more complication are going to show up and that kind of
beats the purpose of working with this. Keeping it simple and knowing when
you are able to use this algorithm will help you to get the results that you
want.
This is a good method to use, but realize that you will not pull it out all of the
time. If you have a lot of information that you would like to work on, and you
need to be able to take that information and show it off in a manner that is
simple and easy to understand, then this learning algorithm is going to be a
good option for you.
These are just a few of the different options that you are able to work with
when it comes to working with supervised machine learning. This is one of
the easiest types of machine learning that you are able to work with, and it is
going to prove to be really useful overall. Try out a few of these learning
algorithms and see how supervised machine learning works before we take
some time to move on to the other two types as well.
Chapter 4 - Measures of central tendency,
asymmetry, and variability
The best way to determine the future is to look at the past and that is exactly
what predictive analytics does. Predictive analytics works in much the same
way as a car insurance company does. You see, the company will look at a
set of facts, or data set, often, your age, gender, driving record, and the type
of car you drive.
By looking at this data set, they can use it to predict your chances of getting
into a car accident in the future. Therefore, they are able to determine if they
are willing to insure you and what rate you will pay for the insurance.
Predictive analysis uses a specific data set to determine if a pattern can be
found. If a pattern is found, this pattern is used to predict future trends. This
is much different than other techniques used to analyze the data sets because,
unlike predictive analytics, other techniques provide a business with
information about what has happened in the past.
Of course, knowing what has happened in the past when it comes to
understanding data is very important, but the majority of business people
would agree that what is more important is understanding what is going to
happen in the future.
Data Governance
Data Governance refers to data integration processes that focus on privacy,
security, risk, and compliance. However, many businesses have expanded
Data Governance to also cover quality, standards, architecture, and many
other issues on data. The team working on Data Governance could help data
scientists to get a single view of business goals that are relevant to data and
align their work properly. Meanwhile, the change management process of
Data Integration can enable Data Integration specialists to think of possible
solutions to increase data value.
Data Stewardship
Data Stewardship is designed for managing quality of data by identifying and
prioritizing quality of work according to the needs of the business and certain
parameters such as technological capacity and budget. The person who is in
charge of the data, also known as the data steward, should work together with
business and technical people. Through the years, data integration specialists
have used stewardship into their array of strategies for better credibility in
alignment and prioritization of data integration work.
Data Replication
Data replication, also known as data synchronization, is another data
integration system that can help add value to the business. For instance, data
replication may build a complete view of a central data hub for access by
several users and applications. This is seen in central hubs for product data,
customer data, and master data. Replication may also enhance relevant data
across several applications and their databases. For instance, client-facing
applications for contact centers can be limited to a partial view of a customer,
unless a total view can be developed by replicating customer data across
these applications.
Data’s business value in replication is that more business owners have a more
unified view of a separate entity like finances, customers, and products. Yet,
data replication systems may tend to move and integrate data more often,
usually several times in a day. This hastens the freshness or data currency in
applications. Hence, data is not just complete but also updated, which is
crucial for businesses that need current data for their decision making.
Basic methods and functions for manipulating files by default are provided in
Python.
The open Function
Before you can write or read a file, you need to open it using the open ( )
function which creates the file object that is necessary for calling other
methods related to it. It has the following syntax:
file object = open ( file_name [ , access_mode ] [ , buffering ] )
The following are the parameter details:
Data Mining
Now let’s become more acquainted with data mining. Data mining is a
jargon word in a sense. It already has a lot in common with some of the
things we’ve been discussing. The first thing that data mining is involved in
is large datasets. In
other words, here we have big data yet again – but that is only in first
appearances. In fact, part of data mining is “mining” the data, finding smaller
subsets within the large datasets that are useful for the analytical purposes at
hand.
Another thing the data mining is involved in is recognizing hidden patterns
that exist in these large datasets. Thus, we are back to the tasks that are
carried out with machine learning, although this isn’t explicitly specified
when discussing data mining. Data mining attempts to classify and
categorize data so that it’s more useful to the organization.
So, we start with raw data which is basically useless. Data mining helps
convert that data into something that can provide value as far as the
information that it contains. A part of data mining is going to be selecting the
data that you want to use. Data warehousing is an important foundation upon
which data mining is based. Companies need to be able to store and access
data in large amounts which is why data warehousing with effective solutions
that are fast and accurate is important. Then the data must go through a
cleansing process. That is when you have huge amounts of data, one of the
problems that you’re going to an encounter is that data is going to be often
corrupted or missing. This is something that is very common when it comes
to relational databases, but this can also happen when you’re restoring huge
amounts of unstructured data.
After the data has been gathered, extracted, and cleansed, the process of data
mining moves on to look for the patterns needed to gather useful information
from the data. Once this is done, the data can be used in many ways by a
business. For example, it could be used for sales analysis or for customer
management and service. Data mining has also been used for fraud
detection. There is much of overlap between data mining and other activities
involving big data, such as machine learning. When it comes to data mining,
you’re going to see a lot of statistical analysis.
This intelligence and data mining are both involved in the process of
converting raw data into actionable information for the business. However,
the goal of business intelligence is to present data in meaningful ways so that
management can make data-driven decisions. In contrast, data mining is used
to find solutions to existing problems.
If you remember when we talked about big data, one of the things that were
important was volume. Business intelligence is certainly driven by large
datasets. However, data mining is different in this respect. Relevant data is
going to be extracted from the raw data to be used in data mining. Therefore,
relatively speaking, data mining is going to be working with smaller subsets
of the data that is available. This is one characteristic that is going to separate
data mining from the other topics that we have talked about so far. Data
mining might be used as a part of an overall strategy of business intelligence.
So, what management is looking for from data mining is solutions that can be
applied to business intelligence. This contrasts with business intelligence on
its own, as it is usually used to present data to people.
So, the core result obtained from data mining is knowledge. This is in the
form of a solution that can be applied within business intelligence. This
provides a big advantage to business and operations. That is because the
findings from data mining can be applied rapidly within business intelligence.
Data mining is also a tool within business intelligence that allows business
intelligence to extract complex data, presenting it understandable forms that
are useful for the people in the organization. The data extracted with data
mining can be presented in readable reports or in graphical format containing
graphs and charts. In this form, it becomes a part of business intelligence so
that the people in the organization can understand, better interpreting the data
and making actionable decisions based on that data.
The volume of data coming to large businesses is only growing with time.
This makes both data mining and business intelligence more important to the
organization as the onslaught of information continues to pour in. It is going
to be important to cull the data in terms of saliency; this is where data mining
plays a role. The data is always changing, making this task even more
important. Demand for data mining and business intelligence solutions will
be increased in proportion to the growth of the volume of data.
For companies to remain competitive and - especially if they want to be a
market leader - they are going to have to utilize data mining and business
intelligence solutions for retaining their advantages.
Data Analytics
Data is not useful if you cannot draw conclusions from it. Data analytics is a
process of organizing and examining datasets for the purpose of extracting
useful and actionable information from the data. Data analytics plays a role
in business intelligence, using tools like OLAP for reporting and analytical
processing. When done effectively, data analytics can help a business
become more competitive and efficient, build better and more targeted
marketing campaigns, improve customer service, and meet the goals that are
a part of business intelligence. Data analytics can be applied to any data that
an organization has access to, including internal and external sources of data.
It can use old data or even real-time data to provide more readable
information that can be accessed by employees in the organization in an
effective way to help them make actionable decisions.
While data analytics can be used as a part of business intelligence efforts, like
machine learning, data analytics can be used for predictive modeling, which
is not part of business intelligence. Typically, BI is used for an informed
decision-making process based on analytics of past data. Data analytics uses
past data but can apply it with predictive analytics to help the company use
modeling and tools to determine future directions of various efforts that can
help the company maintain its edge and advance even further.
Data analytics will also be used in many ways that are like processing data
with machine learning. That is, it will be useful for pattern recognition,
prediction, and cluster analysis. Data analytics is also an important part of
the data mining process.
Much of the new data will be generated from gamification, which in business
is not only an effective tool for marketing campaigns, but can also
revolutionize the manner organizations communicate with their audiences. It
will also create valuable Big Data that could improve big database of
businesses.
Gamification refers to the use of game elements in non-game contexts. This
could be used to communicate with customers and enhance marketing efforts
that could lead to more revenue.
Gamification is also usually used within the organization to improve
employee productivity and crowdsourcing initiatives. Ultimately,
gamification could also change the consumer behavior. The quantified-self
movement is an ideal example of the integration between Big Data and
gamification.
The usual gamification elements that are usually tapped in gamification are
challenges, leaderboards, avatars, badges, points, awards, and levels.
Furthermore, gamification could also be used to learn something, to achieve
something, and also to stimulate personal success.
The objective is to enhance real-life experiences and make people more
willing to perform something. However, gamification is not all about gaming,
but merely the application of gaming elements in a different context.
Various aspects of gamification offer a lot of data that could be analyzed. The
business can easily compare the performance of users and understand why
some groups are performing compared to other groups. When customers are
logging in through the social graph, a lot of public data could be added to
provide context around the data from gamification.
Aside from the various elements that offer directly accessible insights,
gamification could also help in understanding consumer behavior and their
performance. For instance, how long do various groups take to finish a
challenge or how do they use specific services or products. Gamification data
could be used to enhance your offerings.
Gamification can also be used to motivate people to act and to encourage
them to share the right data for the right context. As a matter of fact,
gamification must be considered as a catalyst for sharing. The higher user
engagement, the more chance they will share. This could lead to more
attention to the company as well as more valuable information.
Using gamification for your big data strategy will largely depend on the
speed and quality of the information that is returned to the user. Users will be
more involved if the content is also better. Big Data can also be used to
personalize content. Buying behavior, the time needed to do specific tasks,
and engagement levels could be integrated with public data like posts or
tweets as well as user profiles.
This will provide your business with a lot of valuable insights if the data has
been stored, analyzed and visualized. But, users are now expecting immediate
results and feedback. Hence, real-time data processing is quite crucial.
A few years from now, gamification will become more integrated with how
consumers are accessing and consuming data. This will result in more data
generation. With Big Data, businesses will also need to learn how why their
consumers are behaving in the context of gamification, and so, this will
provide more insights on how their consumers are behaving in real-life.
This information is quite valuable for marketing and sales department to
reach out to potential consumers using the right message in the right context
and with perfect timing.
Business organizations should create the ideal design for gamification
strategy to gain the desired insights and results. Based on a report by Gartner,
80% of the gamification solutions may not deliver the intended results
because of flaws in the design. Remember as with Big Data, flawed design
will only result in flawed data and poor insights.
Chapter 12 - Introduction To PHP
PHP is an acronym for Hypertext Preprocessor. The language is a server-side
HTML embedded scripting language. For beginners, it is hard to understand
the aforementioned statement, however, let me break it down. When I say the
langue is a server-side, I mean the execution of the scripts takes place on the
server where the website is hosted. By HTML embedded, it means PHP
codes can be used inside HTML code. Alternatively, a scripting language is a
programming language, which is interpreted instead of being compiled like
C++ and C programming language. Examples of scripting languages include
Java, Python, Perl, and Ruby.
You can use PHP language on several platforms including UNIX, Linux, and
windows and it supports many databases including Oracle, Sybase, MySQL,
etc. furthermore, PHP files contain scripts, HTML tags, and plain text with
extensions such as PHP3, PHP, or PHTML. Finally, the software is an open-
source program, which is free.
PHP Syntax
When I started, I indicated that PHP codes are executed on the server-side.
However, every PHP statement begins with <?PHP while ending with ?>.
Let us begin with a simple program. You can copy and paste the program
below using any text editor before saving it with the file name – index1.php
I named the file to “index1.php” because some root folder already has an
index filename as shown in the image below.
<html>
<head>
</head>
<body>
<?php
/* This line contains a comment
Which span to
several lines */
//This comment is a line comment
//echo prints the statement onto the screen
echo “Hello World, Welcome to PHP Programming!”
?>
</body>
</html>
We first declare variable1 with the value 280. However, the second is a string
variable with value as “PHP Programming”
It is important to note that every statement n PHP ends with a semicolon. You
will get an error whenever you don’t include a semicolon to indicate the
ending of a statement.
Variable Rules in PHP
A variable name always begins with an underscore (_) or a
letter
A variable name must not include a space (s)
VARIABLE NAMES CAN ONLY HAVE AN
UNDERSCORE OR ALPHA-NUMERICAL
CHARACTER
String Variables
String variables are important especially if you want to manipulate and store
text in your program. The code below assigns the text “Welcome to PHP
Programming” to the variable beginner and prints out the content to the
screen.
<?php
$beginner = ‘Welcome to PHP Programming’;
echo $beginner;
?>
Output
Strlen () function
Perhaps you want to determine the string length of a word or sentence, the
strlen function is what you need. Consider the example below.
<?php
echo strlen(‘‘Today is the best day of your life. Programming is a
lifelong skill and PHP is all your need’’);
?>
The outcome will be the string length of the text including the signs, space,
characters). In this situation, the result will be 92 as shown below.
Logical Operators
Operator Description Example
! not p=9
q=9
!(p==q) returns false
&& and p=9
q=9
(p < 10 && q> 1) returns true
|| or p=9
q=9
(p==5 || q==5) returns true
Arithmetic Operators
Operator Description Example Result
+ Addition a=8 13
a+5
– Subtraction a=17 3
20-a
/ Division a = 40 20
40/2
* Multiplication a=7 35
a*5
++ Increment a=9 a=10
a++
-- Decrement a=14 a=13
a--
% Modulus (division remainder) 56%6 2
Comparison Operator
Operator Description Example
== is equal to 48==49 returns false
!= is not equal 48!=49 returns true
< is less than 48<49 returns true
<= is less than or equal to 48<=49 returns true
<> is not equal 48<>49 returns true
> is greater than 48>49 returns false
>= is greater than or equal to 48>=49 returns false
If Statement
The statement is required to execute a line of code as far as the condition
stated is true. Consider the example below.
<?php
$number= 23;
if($number='23')
echo "Wake up! Time to begin Your Programming lesson.";
?>
In the statement above, we first declare allocate the value 23 to the variable
“number”. The if statement now evaluates if the variable “number” is equal
to 23 since it is true, it will return:
<?php
$decision1='Donut';
if($decision1 == 'Donut') {
echo 'Buy Donut when coming';
} else {
echo 'Buy Pizza when coming';
}
?>
The output:
Switch Statement
The statement allows you to change the course of the program flow. It is best
suited when you want to perform various actions on different conditions.
Consider the example below.
<html>
<body>
<?php
$a=2;
switch ($a)
{
case 1:
echo 'The number is 10';
break;
case 2:
echo 'The number is 20';
break;
case 3:
echo 'The number is 30';
break;
default:
echo 'There is no number that match';
}
?>
</body>
</html>
Output:
Explanation
In the example above, we declare the variable “a” to be 3. The switch
statement has some block of codes, with case or default. If the value of the
case is equivalent to the variable $a, it will execute the statement within that
line and then break. However, if the value of the case is not equivalent to any
of the variable, it will break from the case before executing the default code
block.
Conclusion
PHP language isn’t restricted to professional web browsers alone. You don’t
have to be an IT administrative professional to learn it. Similar to any
scripting language, it may seem complicated at the first time; however, if you
preserver, you will discover it is an interesting language to learn. Learning
PHP programming is the perfect way of understanding the server-side world.
Writing PHP code is not something intimidating if you start from the
foundation as I have done in this book. PHP language is one of the languages
you don’t need anyone to teach you as long as you are ready to learn
everything. In this book, you have learned everything you need to get your
environment ready, variables, conditional statements, and much more.
Chapter 13 - Python Programming Language
Introduction
Python Language is one of the easiest and straightforward object-oriented
languages to learn. Its syntax is simple, thereby making it simple for
beginners to learn and understand the language with ease. In this chapter, I
will cover several aspects of the Python programing language. This
programming guide is for beginners who want to learn a new language.
However, if you are an advanced programmer, you will also learn
something.
Guido Van Rossum developed the Python language but the implementation
began in 1989. Initially, you could have thought, it was named after the
Python snake; however, it was named after a comedy television show called
“Monty Python’s Flying Circus.”
Features of Python
There are certain features that make the python programming language
unique among other programming languages. The summary is displayed in
the diagram below.
Uses of Python
Most beginners before choosing to learn a programming language first
consider what the uses of such language are. However, there are various
applications of the python language in a real-world situation. These include:
Data Analysis – You can use python to develop data analysis and
visualization in the form of charts
Game development – Today, the game industry is a huge market
that yields billions of dollars per year. It may interest you to
know that you can use python to develop interesting games.
Machine learning – We have various machine learning
applications that are written using the python language. For
instance, products recommendation in websites such as eBay,
Flipkart, Amazon, etc. uses a machine-learning algorithm, which
recognizes the user’s interest. Another area of machine learning
is a voice and facial recognition on your phone.
Web development – You didn’t see this coming. Well, web
frameworks such as Flask and Django are based on the python
language. With Python, you can write backend programming
logics, manage database, map URLs, etc.
1. Embedded applications – You can use python to develop embedded applications
Launching PyCharm
For windows users, after installing the .exe file, you will see the PyCharm
icon on the desktop depending on the option you selected during installation.
You can also go to your program files > Jetbrains >PyCharm2017 and look
for the PyCharm.exe file to launch PyCharm.
Let me use a real example to explain both the single and multiple line
comment.
'''
Sample program to illustrate multiple line comment
Pay close attention
'''
print("We are making progress")
# Second print statement
print("Do you agree?")
print("Python Programming for Beginners") # Third print statement
Output:
We are making progress
Do you agree?
Python Programming for Beginners
Python Variables
We use variables to store data in programming. Variable creation is very
simple to implement in Python. In python, you have to declare the variable
name and value together. For instance
Number1 = 140 #number1 is of integer type
str = “Beginner” #str is of string type
Multiple Assignment
You can also allocate several variables to a single expression in python.
Consider the example below:
Profit = returns = yields = 35
print (Profit)
print (yield)
print (returns)
Output
35
35
35
Output
112
Welcome Home
Strings
This is a series of characters enclosed within a special character. In Python,
you have the option of using a single or double quote to represent a string.
There are various means of creating strings in python.
Tuple
Tuple works like a list but the difference is that in a tuple, the objects are
unchangeable. The elements of a tuple are unchangeable once assigned.
However, in the case of a list, the element is changeable.
In order to create tuple in python, you have to place all the elements in a
parenthesis () with a comma separating it. Let me use an example to illustrate
tuple in python.
# tuple of strings
bioDate = ("John", "M", "Lawson")
print(bioData)
# tuple of int, float, string
Data_New = (1, 2.8, "John Lawson")
print(Date_New)
# tuple of string and list
details = ("The Programmer", [1, 2, 3])
print(details)
# tuples inside another tuple
# nested tuple
Details2 = ((2, 3, 4), (1, 2, "John"))
Print(details2)
Output will be:
("John", "M", "Lawson")
(1, 2.8, "John Lawson")
(“The Programmer", 1, 2, 3)
((2, 3, 4), (1, 2, "John"))
If Statement
The statement prints out a message if a specific condition is satisfied. The
format or syntax is as follow:
If condition:
Line of codes
flag = True
if flag==True:
print("Welcome")
print("To")
print("Python Programming")
Output
Welcome
To
Python Programming
Output
number1 is less than 290
If-else statement
in our previous example, we only test a particular condition, what if you want
to test two different conditions. That is where the “if-else statement” comes
to play. In Python, the statement executes a particular statement if it is true
but if it's not true, it executes the other statement.
Syntax
If conditions
Statement1
Else
Statement2
Let us use our last example to illustrate this.
number1 = 180
if number1 > 290:
print("number1 is greater than 290")
else
print (“number1 is less than 290”)
Output
number1 is less than 290
Number1 = 15
if number1 % 4 == 0:
print("the Number is an Even Number")
else:
print("The Number is an Odd Number")
Output:
The Number is an Odd Number
Bonus Programs
# Program to display the Fibonacci sequence depending on the number the
user wants
# For a different result, change the values
numb1 = 12
# uncomment to take input from the user
#num1 = int(input("How many times? "))
# first two terms
a1 = 0
a2 = 1
count = 0
# Verify if the number of times is valid
if numb1 <= 0:
print("Please enter a positive integer")
elif numb1 == 1:
print("Fibonacci sequence up to",numb1,":")
print(a1)
else:
print("Fibonacci sequence up to",numb1,":")
while count < numb1:
print(a1,end=' , ')
nth = a1 + a2
# update values
a1 = a2
a2 = nth
count += 1
What do you think the output will be?
Fibonacci sequence up to 12 :
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
Chapter 14 - A brief look at Machine Learning
Machine learning is a data science field that comes from all the research done
into artificial intelligence. It has a close association to statistics and to
mathematical optimization, which provides the field with application
domains, theories and methods. We use machine learning more than we
realize, in applications and tasks where it isn’t possible for a rule-based
algorithm to be explicitly programmed.
Some of the applications are search engines spam filters on email, computer
vision and language translation. Very often, people will confuse machine
learning with data mining, but machine learning focuses mostly on
exploratory data analysis.
Some of the terminology you will come across in this section is:
Features – distinctive traits used in defining the outcome
Samples – an item being processed, i.e. an image, document, audio, CSV file
etc.
Feature Vector – numerical features that are representing an object. i.e. an n-
dimensional vector
Feature Extraction – a feature vector processed and data transformed from
high-dimensional to low-dimensional space
Training Set – data set where potential predictive relationships are discovered
Testing set – data set where predictions are tested
Supervised Learning
A supervised machine learning algorithm will study given data and will
generate a function which can then be used for the prediction of new
instances. We’ll assume that we have training data comprised of a set of text
representing news articles related to all kinds of news categories. These
categories, such as sport, national, international, etc., will be our labels. From
the training data, we are going to derive some feature vectors; each word may
be a vector or we may derive certain vectors from the text. For example, we
could count a vector as how many times the word ‘football’ occurs.
The machine learning algorithm is given the labels and the feature vectors
and it will learn from that data. Once training is completed, the model is then
fed the new data; once again, we extract features and input them to the model
and the target data is generated.
Unsupervised Learning
Unsupervised learning is when unlabeled data is analyzed for hidden
structures. For our example, we will use images as the training data set and
the input dataset. The images are of the faces of insects, horses and a human
being; features will be extracted from them and these features identify which
group each image should go to. The features are given to the unsupervised
algorithm, which looks for any patterns; we can then use the algorithm on
new images that can be identified and put into the right group.
Some of the unsupervised machine learning algorithms that we will be
discussing are:
k-means clustering
Hierarchical clustering
Reinforcement Learning
With reinforcement learning, the data for input is a stimulus from the
environment the machine learning model needs to respond to and react to.
The feedback provided is more of a rewards and punishment system in the
environment rather than the teaching process we see in supervised learning.
The actions that the agent takes lead to an outcome that the agent can learn
from rather than being taught; the actions selected by the agent are based on
two things – past experience and new choices, meaning it learns by a system
of trial and error. The reinforcement signal is sent to the agent by way of a
numerical reward that contains an encoding of the success and the agent will
learn to take the actions that increate that reward each time.
Reinforcement learning is not used much in data science, more so in robotics
and the two main algorithms used are:
Temporal difference learning
Q learning
Decision Trees
A decision tree is a predictive model that maps item outcomes to input data.
A popular technique, models generally fall under these two types:
Classification tree – dependent variables taking a finite value. the feature
rules are represented by branches leading to class labels and the outcome
class labels are represented by leaves.
Regression tree – a dependent variable that takes a continuous value.
As an example, we’ll use data that represents whether a person should play a
game of tennis, based on weather, wind intensity, and humidity:
Play Wind Humidity
Outlook
NoLowHigh Sunny
NoHighNormal Rain
YesLowHigh Overcast
YesWeakNormal Rain
YesLowNormal Sunny
YesLowNormal Overcast
YesHighNormal Sunny
If you were to use this data, the target variable being Play and the rest as
independent variables, you would get a decision tree model with a structure
like this:
SunnyOvercast
Rain
High NormalHighWeak
Now, when we get new data, it will traverse the tree to reach the conclusion,
which is the outcome.
Decision trees are very simple and have several advantages:
1. They are easy to communicate and to visualize
2. Odd patterns can be found. Let’s say that you were looking
for a voting pattern between two parties up for election and
your data includes income, education, gender, and age. You
might see a pattern whereby people with higher education
have low incomes and vote for a certain party.
3. Minimal assumptions are made on the data
Linear Regression
Linear regression is a modeling approach that models scalar linear
relationships between an independent variable of and a scalar dependent
variable of Y, which may be a value of one or more:
y = Xβ+ε
Let’s use an example to understand this. Below, you can see a list of student
heights and weights:
Height (inches)Weight (pounds)
50125
58135
63145
68144
70170
79165
84171
75166
65160
Put this data through a linear regression function (discussed later) using the
weight as the dependent variable of y and the height as the independent
variable of x and you would get this equation:
y = 1.405405405 x + 57.87687688
If that equation were to be plotted as a line with an intercept of 57.88 and the
line slope as 1.4 over a scatter plot that has Height in the x-axis and Weight
in the y-axis, you would see a plot whereby the regression algorithm has tried
to create that equation which shows the least error when the weight of the
student is predicted.
Logistic Regression
Another of the supervised learning techniques, logistic regression is classed
as a ‘probabilistic classification model’. These tend to be used mostly for
predictions of binary predictors like when customers are going to move to a
competitor.
As the name indicates logistics are used in logistic regression. Logistic
functions are useful for taking value from negative to positive infinity and
outputting values between 0 and 1. This means it can be interpreted as a
probability. The logistic function below will generate a predicted value
between 0 and 1 based on a dependent variable x:
F(x)=
x is the independent variable while F(x) is dependent.
If you were to try plotting the logistic function from negative to positive
infinity, the outcome would be an S graph (s-shaped).
We can apply logistic regression in these scenarios:
1. Deriving a score for the propensity of a retail customer
buying a brand new product
2. How likely it is that a transformer fails by using the data
from the sensor associated with it
3. How likely it is that a user clicks on an ad on a given
website based on previous user behavior
P(A\B)=
Breaking this down:
A and B are both events
P(A) and P(B) are the A and B probabilities and are interdependent
P(A\B) is the A probability, given a conditional probability that B is True.
P(B\A) is the B probability, given the conditional probability that A is true.
The naïve Bayes formula is:
K-Means Clustering
This is an unsupervised learning technique used to partition data of n
observations to K buckets of observations that are similar. It is known as a
clustering algorithm because it computes the mean of some features that
reference the dependent variables that are based on how things are clustered.
An example would be a segment of customers based on the average amount
they spend per transaction or the average amount of products they purchase
in one year. The mean value will then be the center of the cluster. K is
referring to how many clusters there are so the technique refers to the number
of clusters around the k number of means.
So, how is K chosen? If we knew what it was we were looking for or the
number of clusters we wanted or expected to see, we set K as this number
before the algorithm starts computing.
If we don’t know the number things take a bit longer to complete and will
require some level of trial and error. For example, we would need to try K=4,
5, and 6 until the clusters start to make sense for the domain.
K-means clustering is used quite a lot in market segmentation, computer
vision, geostatistics, astronomy, and in agriculture. We’ll talk more about it
later.
Hierarchical Clustering
Another unsupervised learning technique, this involves observations being
used to build a hierarchy of clusters. Data is grouped at various levels of a
dendrogram or cluster tree. It is not one single cluster set, but a hierarchy
made up of multiple levels where clusters on one level are joined as clusters
on the next. This gives you the choice of working out what level of clustering
is right for you.
Hierarchical clusters are two fundamental types:
Agglomerative hierarchical clustering – bottom-up method, each observation
begins in a cluster of its own and two others as they rise through the
hierarchy
Divisive hierarchical clustering – top-down method, observations start in one
cluster and split in two as they drop through a hierarchy
Chapter 15 - Python Crash Course
Before we dig deeper into data science and working with Python and Jupyter,
you should understand the basics of programming. If you already grasp the
concepts or you have some experience with programming in Python or any
other language, feel free to skip this chapter. However, even if you already
possess the basic knowledge, you might want to refresh your memory.
In this chapter we’re going to discuss basics programming concepts and go
through simple examples that illustrate them. It is recommended that you put
into practice what you read as soon as possible, even if at first you use cheat
sheets. The goal here is to practice, because theory is not enough to solidify
what you learn.
For the purpose of this chapter, we will not use Jupyter or any other IDE that
is normally used when programming. All we need is a shell where we put our
code to the test and exercise. To do that, just head to Python’s main website
here https://www.python.org/shell/ and you’ll be able to try everything out
without installing anything on your computer.
Data Types
Understanding basic data types is essential to the aspiring data scientist.
Python has several in-built data types, and in this section we will discuss each
one of them. Remember to follow along with the exercises and try to use the
bits of knowledge you gain to come up with your own lines of code.
Here are the most important data types we will be discussing: numbers,
strings, dictionaries, lists, and tuples. Start up your Python shell and let’s
begin!
Numbers
In programming and mathematics, there are several types of numbers, and
you need to specify them in Python when writing code. You have integers,
floats, longs, complex numbers, and a few more. The ones you will use most
often, however, are integers and floats.
An integer (written as “int” in Python) is a positive or negative whole
number. That means that when you declare an integer, you cannot use a
number with decimal points. If you need to use decimals, however, you
declare a float.
In Python, there are several mathematical operators that you can use to make
various calculations using integers and floats. The arithmetic operators are for
adding (+), subtracting (-), multiplication (*), division (/), modulus (%), floor
division (//) and exponent (**). There are also comparison operators such as
greater than (>), less than (<), equal to (==), not equal to (!=), greater than or
equal to (>=) and less than or equal to (<=). These are the basic operators,
and they are included with any Python installation. There’s no need to install
a package or a module for them. Now let’s try a simple exercise to put some
of these operators in action.
x = 100
y = 25
print (x + y)
This simple operation will print the result of x + y. You can use any of the
other arithmetic operators this way. Play around with them and create
complex equations if you want to. The process is the same. Now let’s look at
an example of comparison operators:
x = 100
y = 25
print (x > 100)
The result you will see is “false” because our declared x variable is not
greater than y. Now let’s move on to strings!
Strings
Strings are everything that is in text format. You can declare anything as
simple textual information, such as letters, numbers, or punctuation signs.
Keep in mind that numbers written as strings are not the same as numbers
used as variables. To write a string, simply type whatever you want to type
in-between quotation marks. Here’s an example
x = “10”
In this case, x is a string and not an integer.
So what are strings for? They are used frequently in programming, so let’s
see some of the basic operations in action. You can write code to determine
the character length of a line of text, to concatenate, or for iteration. Here’s an
example:
len(“hello”)
The result you get is 5, because the “len” function is used to return the length
of this string. The word “hello” is made of 5 characters, therefore the
calculation returns 5 to the console. Now let’s see how concatenation looks.
Type the following instruction:
‘my’ + ‘stubborn’ + ‘cat’
The result will be mystubborncat, without any spaces in between the words.
Why? Because we didn’t add any spaces inside the strings. A space is
considered as a character. Try writing it like this:
‘my ‘ + ‘stubborn ‘ + ‘cat’
Now the result will be “my stubborn cat”. By the way, did you realize we
changed the quotation marks to single quotes? The code still performed as
intended, because Python can’t tell the difference between the two. You can
use both double quotes and single quotes as you prefer, and it will have no
impact on your code.
Now let’s see an example of string iteration. Type:
movieTitle = “Star Wars”
for c in movie: print c,
…
These lines of code will return all individual characters in the declared string.
We first declare a variable called “movieTitle” to which we assign “Star
Wars” as its string. Next we call to print each character within the
“movieTitle.
There are other string operations that you can perform with Python, however
for the purposes of this book it’s enough to stick to the basics. If you wish,
you can always refer to Python’s online documentation and read all the
information they have on strings. Next up, let’s discuss lists!
Lists
Lists are incredibly useful in programming, and you will have to use them
often in your work. If you are familiar with object oriented programming
languages, Python lists are in fact identical to arrays. You can use them to
store data, manipulate it on demand, and store different objects in them and
so on. Using them in Python is simple, so let’s first see how to make a new
list. Type the following line:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The list is created by declaring a series of objects enclosed by square
brackets. As we already mentioned, list don’t have to contain only one data
type. You can store any kind of information in them. Here’s another example
of a list:
myBook = [“title”, “somePages”, 1, 2, 3, 22, 42, “bookCover”]
As you can see, we are creating a list that contains both strings and numbers.
Next up, you can start performing all the operations you used for strings.
They work the same with lists. For instance, here’s how you can concatenate
the two previous lists we created:
x + myBook
Here’s the result:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, “title”, “somePages”, 1, 2, 3, 22, 42,
“bookCover”]
Try out any other operation yourself and see what happens. Explore and
experiment with what you already know.
Dictionaries
Dictionaries are similar to lists, however you need to have a key that is
associated with the objects inside. You use the key to access those objects.
Let’s explain this through an example in order to avoid any confusion. Type
the following lines:
dict = {‘weapon’ : ‘sword’, ‘soldier’ : ‘archer’}
dict [‘weapon’]
As you can see, in the first line we declared a dictionary. It is defined
between two curly brackets, and inside it contains objects with a key assigned
to each of them. For instance, we have object “sword” that has the “weapon”
as its attributed key. In order to access the “sword” we have to call on its key,
which we do in the second line of code. Keep in mind that the “weapon” or
“soldier” keys are only examples. Keys don’t have to be strings. You can use
anything.
Tuples
Tuples are also similar to lists, however their objects can’t be changed after
they are set. Let’s see an example of a tuple and then discuss it. Type the
following line:
x = (1, 2, ‘someText’, 99, [1, 2, 3])
The tuple is in between parentheses and it contains three data types. There are
three integers, a string, and a list. You can now perform any operation you
want on the tuple. Try the same commands you used for lists and strings.
They will work with tuples as well because they are so similar. The only real
difference is that once you declare the elements inside a tuple, you cannot
modify them through code. If you have some knowledge about object
oriented programming, you might notice that Python tuples are similar to
constants.
Conditional Statements
This is the part when things start to get fun. Conditional statements are used
to give your program some ability to think and decide on their own what they
do with the data they receive. They are used to analyze the condition of a
variable and instruct the program to react based on values. We already used
the most common conditional statement in the example above.
Statements in Python programming are as logical as those you make in the
real world when making decisions. “If I’m sick tomorrow, I will skip school,
else I will just have to go” is a simple way of describing how the “if”
statement above works. You tell the program to check whether you are sick
tomorrow. If it returns a false value, because you aren’t sick, then it will
continue to the “else” statement which tells you to go to school because you
aren’t sick. “If” and “if, else” conditional statements are a major part of
programming. Now let’s see how they look in code:
x = 100
if (x < 100):
print(“x is small”)
This is the basic “If” statement with no other conditions. It simply examines
if the statement is true. You declared the value of x to be 100. If x is smaller
than 100, the program will print “x is small”. In our case, the statement is
false and nothing will happen because we didn’t tell the program what to do
in such a scenario. Let’s extend this code by typing:
x = 100
if (x < 100):
print(“x is small”)
else:
print(“x is big”)
print (“This part will be returned whatever the result”)
Now that we introduced the “else” statement, we are telling the program what
to execute if the statement that “x is smaller than 100” is not true. At the end
of the code block, we also added a separate line outside of the “if else”
statement and it will return the result without considering any of the
conditions. Pay special attention to the indentation here. The last line is not
considered as part of the “if” and “else” statements because of the way we
wrote it.
But what if you want your program to check for several statements and do
something based on the results? That’s when the “elif” conditional comes in.
Here’s how the syntax would look:
if (condition1):
add a statement here
elif (condition2):
add another statement for this condition
elif (condition3):
add another statement for this condition
else:
if none of the conditions apply, do this
As you may have noticed, we haven’t exactly used code to express how the
“elif” statement is used. What we did instead was write what is known as
pseudo code. Pseudo code is useful when you quickly want to write the logic
of your code without worrying about using code language. This makes it
easier to focus on how your code is supposed to work and see if your thinking
is correct. Once you write your pseudo code and decide it’s the correct path
to take, you can replace it with actual code. Here’s how to use elif with real
code:
x = 10
if (x > 10):
print (“x is larger than ten”)
elif x < 4:
print (“x is a smaller number”)
else:
print (“x is not that big”)
Now that you know how conditionals work, start practicing. Use strings, lists
and operators, followed by statements that use that data. You don’t need
more than basic foundations to start programming. The sooner you nudge
yourself in the right direction, the easier you will learn.
Logical Operators
Sometimes you need to make comparisons when using conditional
statements, and that’s what logical operators are for. There are three types:
and, or, and not. We use the “and” operator to receive a certain result if both
of them are checked to be true. The “or” operator will return a result if only
one of the specified statements are true. Finally, the “not” operator is used to
reverse the result.
Let’s see an example of a logical operator used in a simple “if” statement.
Type the following code:
y = 100
if y < 200 and y > 1:
print(“y is smaller than 200 and bigger than 1”)
The program will check if the value of y is smaller than 200, as well as bigger
than 1 and if both statements are true, a result will be printed.
Introduce logical operators when you practice your conditionals. You can
come up with many operations because there’s no limit to how many
statements you can make or how many operators you use.
Loops
Sometimes we need to tell the program to repeat a set of instructions every
time it meets a condition. To achieve this, we have two kinds of loops, known
as the “for” loop and the “while” loop. Here’s an example of a “for” loop:
for x in range(1, 10):
print(x)
In this example, we instruct our program to keep repeating until every value
of x between 1 and 10 is printed. When the printed value is 2, for instance,
the program checks if x is still within the (1, 10) range and if the condition is
true, it will print the next number, and the next and so on.
Here’s an example with a string:
for x in “programming”:
print (x)
The code will be executed repeatedly until all characters inside the word
“programming” are printed.
Here’s another example using a list of objects:
medievalWeapons = [“swords”, “bows”, “spears”, “throwing axes”]
for x in medievalWeapons:
print(x)
In this case, the program will repeat the set of instructions until every object
inside the list we declared is printed.
Next up we have the “while” loop that is used to repeat the code only as long
as a condition is true. When a statement no longer meets the condition we set,
the loop will break and the program will continue the next lines of code after
the loop. Here’s an example:
x=1
while x < 10:
print(x)
x += 1
First we declare that x is an integer with the value of 1. Next we instruct the
program that while x is smaller than 10 it should keep printing the result.
However, we can’t end the loop with just this amount of information. If we
leave it at that, we will create an infinite loop because x is set to always be 1
and that means that x will forever be smaller than. The “x+= 1” at the end
tells the program to increase x’s value by 1 every single time the loop is
executed. This means that at one point x will no longer be smaller than 10,
and therefore the statement will no longer be true. The loop will finish
executing, and the rest of the program will continue.
But what about that risk of running into infinite loops? Sometimes accidents
happen, and we create an endless loop. Luckily, this is preventable by using a
“break” statement at the end of the block of code. This is how it would look:
while True:
answer = input (“Type command:”)
if answer == “Yes”:
break
The loop will continue to repeat until the correct command is used. In this
example, you break out of the loop by typing “Yes”. The program will keep
running the code until you give it the correct instruction to stop.
Functions
Now that you know enough basic programming concepts, we can discuss
making your programs more efficient, better optimized, and easier to analyze.
Functions are used to reduce the number of lines of code that are actually
doing the same thing. It is generally considered best practice to not repeat the
same code more than twice. If you have to, you need to start using a function
instead. Let’s take a look at what a function looks like in code:
def myFunction():
print(“Hello, I’m your happy function!”)
We declare a function with the “def” keyword, which contains a simple string
that will be printed whenever the function is called. The defined functions are
called like this:
myFunction()
You type the name of function followed by two parentheses. Now, these
parentheses don’t always have to stay empty. They can be used to pass
parameters to the function. What’s a parameter? It’s simply a variable that
becomes part of the function’s definition. Let’s take a look at an example to
make things clearer:
def myName(firstname):
print(firstname + “ Johnson”)
myName(“Andrew”)
myName(“Peter”)
myName(“Samuel”)
In this example we use the parameter “firstname” in the function’s definition.
We then instruct the function to always print the information inside the
parameter, plus the word “Johnson”. After defining the function, we call it
several times with different “firstname”. Keep in mind that this is an
extremely crude example. You can have as many parameters as you want. By
defining functions with all the parameters you need, you can significantly
reduce the amount of code you write.
Now let’s examine a function with a set default parameter. A default
parameter will be called when you don’t specify any other information in its
place. Let’s go through an example for a better explanation. Nothing beats
practice and visualization. Type the following code:
def myHobby(hobby = “leatherworking”):
print (“My hobby is “ + hobby)
myHobby (“archery”)
myHobby (“gaming”)
myHobby ()
myHobby (“fishing”)
These are the results you should receive when calling the function:
My hobby is archery
My hobby is gaming
My hobby is leatherworking
My hobby is fishing
Here you can see that the function without a parameter will use the default
value we set.
Finally, let’s discuss a function that returns values. So far our functions were
set to perform something, such as printing a string. We can’t do much with
these results. However, a returned value can be reassigned to a variable and
used in more complex operations. Here’s an example of a return function:
def square(x):
return x * x
print(square (5))
We defined the function and then we used the “return” command to return the
value of the function, which in this example is the square of 5.
Code Commenting
We discussed earlier that maintaining a clear, understandable code is one of
your priorities. On top of naming conventions, there’s another way you can
help yourself and others understand what your code does. This is where code
commenting comes in to save the day.
Few things are worse than abandoning a project for a couple of weeks and
coming back to it only to stare at it in confusion. In programming, you
constantly evolve, so the code you thought was brilliant a while back will
seem like it’s complete nonsense. Luckily, Python gives you the ability to
leave text-based comments anywhere without having any kind of negative
effect on the code. Comments are ignored by the program, and you can use
them to briefly describe what a certain block of code is meant to achieve. A
comment in Python is marked with a hashtag (#).
# This is my comment.
Python disregards everything that is written after a hash symbol. You can
comment before a line of code, after it, or even in the middle of it (though
this is not recommended). Here’s an example of this in action:
print (“This is part of the program and will be executed”) #This is a comment
Comments don’t interfere with the program in any way, but you should pay
attention to how you express yourself and how you write the comment lines.
First of all, comments should not be written in an endless line - you should
break them up into several lines to make them easy to read. Secondly, you
should only use them to write a short, concise description. Don’t be more
detailed than you have to be.
# Here’s how a longer comment line should look like.
# Keep it simple, readable and to the point
# without describing obvious data types and variables.
Get used to using comments throughout your code early one. Other
programmers or data scientists will end up reading it someday, and comments
make it much easier for them to understand what you wanted to accomplish.
Every programmer has a different way of solving a problem, and not
everyone thinks the same way, even if they arrive at the same conclusion. In
the long run, good comments will save you a lot of headaches, and those who
read your code may hate you a little less.
Chapter 16 - Unsupervised Learning
Unsupervised machine learning uses unlabeled data. Data scientists don’t
know the output of yet. The algorithm must discover patterns on its own,
where patterns would otherwise be unknown. Find a structure in a place
where the structure is otherwise unobservable. The algorithm finds data
segments on its own. The model looks for patterns and structure in an
otherwise unlabeled and unrecognizable mass of data. Unsupervised learning
allows us to find patterns that would be unobservable without computer
scientists. Sometimes massive collections of data have patterns, and it would
be impossible to sift through all of it trying to find trends.
This is good for examining the purchasing habits of consumers so that you
can group customers into categories based on patterns in their behavior. The
model may discover that there are similarities in buying patterns between
different subsets of a market, but if you didn’t have your model to sift
through these massive amounts of complicated data, you will never even
realize the nature of these patterns. The beauty of unsupervised learning is the
possibility of discovering patterns or characteristics in massive sets of data
that you would not be able to identify without the help of your model.
A good example of unsupervised learning is fraud detection. Fraud can be a
major problem for financial companies, and with large amounts of daily
users, it can be difficult for companies to identify fraud without the help of
machine learning tools. Models can learn how to spot fraud as the tactics
change with technology. If you want to deal with new, unknown fraud
techniques, then you will need to employ a model that can detect fraud under
unique circumstances.
In the case of detecting fraud, it's better to have more data. Fraud detection
services must use a range of machine learning models to be able to combat
fraud effectively. Using both supervised and unsupervised models. It's
estimated that there will be about $32 billion in fraudulent credit card activity
next year, in 2020 Models for fraud detection classify the output (credit card
transactions) as legitimate or fraudulent.
They can classify based on a feature like time of day or location of the
purchase. If a merchant usually makes sales around $20, and suddenly has a
sale for $8000 from a strange location, then the model will most likely
classify this transaction as fraudulent.
The challenge of using machine learning for fraud detection is the fact that
most transactions are not fraudulent. If there was even a significant amount of
fraudulent transactions among non-fraudulent, then credit cards would not be
a viable industry. The percentage of fraudulent card transactions is so small
that it can create models that are skewed that way. The $8000 purchase from
a strange location is suspicious, but it is more likely to be the result of a
traveling cardholder than fraudulent activity. Unsupervised learning makes it
easier to identify suspicious buying patterns like strange shipping locations
and random jumps in user reviews.
Clustering
Clustering is a sub-group of unsupervised learning. Clustering is the task of
grouping similar things together When we use clustering, we can identify
characteristics and sort our data based on these characteristics. If we are using
machine learning for marketing, clustering can help us identify similarities in
groups of customers of potential clients. Unsupervised learning can help us
sort customers into categories that we might not have created with the help of
machine learning. It can also help you sort your data when you are working
with a large number of variables.
K-Means clustering
K-means clustering works similarly to K-nearest neighbors You pick a
number for k to decide how many groups you want to see. You continue to
cluster and repeat until clusters are more clearly classified.
Your data is grouped around centroids, which are the points on your graph
that you have chosen where you want to see your data clustered. You choose
them at random, and you have k of them. Once you introduce your data to the
model, data points are placed in categories indicated by the closest centroid,
which is measured by Euclidean distance. Then you take the average value of
the data points surrounding each centroid. Keep repeating this process until
your results stay the same, and you have consistent clusters. Each data point
is only assigned to one cluster.
You repeat this process by finding the average values for x and y within each
cluster. This will help you extrapolate the average value of the data points in
each cluster. K-means clustering can help you identify previously unknown
or overlooked patterns in the data.
Choose the value for k that is optimal for the number of categories you want
to create. Ideally, you should have more than 3. However, the advantage
associated with adding more clusters diminishes that higher the number of
clusters you have. The higher the value for k that you choose, the smaller and
more specific the clusters are. You wouldn’t want to use a value for k that is
the same as the number of data points because each data point would end up
in its own cluster.
You will have to know your dataset well and use your intuition to guess how
many clusters are appropriate, and what sort of differences that will be
present. However, our intuition and knowledge of the data are less helpful
once we have more than just a few potential groups.
Dimensionality Reduction
When you are using dimensionality reduction, you are trimming down data to
remove unwanted features. Simply put, you're scaling down the number of
variables in a dataset.
When we have a lot of variables in our model, then we run the risk of having
dimensionality problems. Dimensionality problems are problems that are
unique to models with large datasets and can affect prediction accuracy.
When we have many variables, we need larger populations and sample
populations in order to create our model. With that many variables, it’s hard
to have enough data to have many possible combinations to create a well-
fitting model.
If we use too many variables, then we can also encounter overfitting.
Overfitting is the main problem which would cause a data scientist to
consider dimensionality reduction.
We must choose data that we don’t need, or that is irrelevant. If we have a
model predicting someone’s income, do we need a variable that tells us what
their favorite color is? Probably not. We can drop it out of our dataset.
Usually, it's not that easy to tell when a variable should be dropped. There are
some tools we can use to determine which variables aren’t as important.
Principle Component Analysis is a method of dimensionality reduction. We
take the old set of variables and convert them into a newer set somehow. The
new sets we’ve created are called principal components. There is a tradeoff
between reducing the number of variables while maintaining the accuracy of
your model.
We can also standardize the values of our variables. Make sure they are all
valued in the same relative scale so that you don't inflate the importance of a
variable. For example, if we have variables measured as a probability
between 0 and 1 vs. variables that are measured by whole numbers above
100.
Linear Discriminant is another method of dimensionality reduction where we
combine features or variables, rather than get rid of them altogether.
Kernel Principal Component is the third method for dimensionality
reduction. Here, variables are placed in a new set. This model will be non-
linear, and it will give us even better insight into the true parameters than
original data.
Chapter 17 - Neural Networks
Neural networks are a form of machine learning that is referred to as deep
learning. It’s probably the most advanced method of machine learning, and
truly understanding how it works might require a Ph.D. You could write an
entire book on machine learnings most technical type of model.
Neural networks are computer systems designed to mimic the path of
communication within the human brain. In your body, you have billions of
neurons that are all interconnected and travel up through your spine and into
your brain. They are attached by root-like nodes that pass messages through
each neuron one at a time all the way up the chain until it reaches your brain.
While there is no way to replicate this with a computer yet, we take the
principle idea and apply it to computer neural networks to replicate the ability
to learn like a human brain learns; recognize patterns and inferring
information from the discovery of new information.
In the case of the neural networks, as with all our machine learning models.
Information is processed through neural networks as numerical data. By
giving out numerical data values, we are giving it the power to use algorithms
to make predictions.
Just as with the neurons in the brain, data starts at the top and works its way
down, being first separated into nodes. The neural network uses nodes to
communicate through each layer. A neural network is comprised of three
parts; Input, hidden, and output layers.
In the picture below, we have a visual representation of a neural network,
with the circles being every individual node in the network. On the left side,
we have the input layer; this is where our data goes in. After the data passes
through the input layer, it gets filtered through several hidden layers. The
hidden layers are where data gets sorted by different characteristics and
features. The hidden layers look for patterns within the data set. The hidden
layers are where the ‘magic' is happening because the data is being sorted by
patterns that we probably wouldn't recognize if we sorted it manually. Each
node has a weight which will help to determine the significance of the feature
being sorted.
The best use of these neural networks would be a task that would be easy for
a human but extremely difficult for a computer. Recall at the beginning of the
book when we talked about reasoning and inductive reasoning. Our human
brain is a powerful tool for inductive reasoning; it’s our advantage over
advanced computers that can calculate high numbers of data in a matter of
seconds. We model neural networks after human thinking because we are
attempting to teach a computer how to ‘reason’ like a human. This is quite a
challenge. A good example of a neural network is the example we mentioned
we apply neural networks for tasks that would be extremely easy for a human
but are very challenging for a computer.
Neural networks can take a huge amount of computing power. The first
reason neural networks are a challenge to process is because of the volume of
datasets required to make an accurate model. If you want the model to learn
how to sort photographs, there are many subtle differences between photos
that the model will need to learn to complete the task effectively. That leads
to the next challenge, which is the number of variables required for a neural
network to work properly. The more data that you use and the higher the
number of variables analyzed means that there is an increase in hidden
networks. At any given time, several hundred or even thousands of features
are being analyzed and classified through the model. Take self-driving cars as
an example. Self-driving cars have more than 150 nodes for sorting. This
means that the amount of computing power required for a self-driving car to
make split-second decisions while analyzing thousands of inputs at a time is
quite large.
In the instance of sorting photos, neural networks can be very useful, and the
methods that data scientists use are improving rapidly. If I showed you a
picture of a dog and a picture of a cat, you could easily tell me which one a
cat was, and which one was a dog. But for a computer, this takes
sophisticated neural networks and a large volume of data to teach the model.
A common issue with neural networks is overfitting. The model can predict
the values for the training data, but when it's exposed to unknown data, it is
fit too specifically for the old data and cannot make generalized predictions
for new data.
Say that you have a math test coming up and you want to study. You can
memorize all the formulas that you think will appear on the test and hope that
when the test day comes, you will be able to just plug in the new information
into what you’ve already memorized. Or you can study more deeply; learning
how each formula works so that you can produce good results even when the
conditions change. An overfitted model is like memorizing the formulas for a
test. It will do well if the new data is similar, but when there is a variation,
then it won’t know how to adapt. You can usually tell if your model is
overfitted if it performs well with training data but does poorly with test data.
When we are checking the performance of our model, we can measure it
using the cost value. The cost value is the difference between the predicted
value and the actual value of our model.
One of the challenges with neural networks is that there is no way to
determine the relationship between specific inputs with the output. The
hidden layers are called hidden layers for a reason; they are too difficult to
interpret or make sense of.
The most simplistic type of neural network is called a perceptron. It’s the
derives its simplicity from the fact that it has only one layer through which
data passes. The input layer leads to one classifying hidden layer, and the
resulting prediction is a binary classification. Recall that when we refer to a
classification technique as binary, that means it only sorts between two
different classes, represented by 0 and 1.
The perceptron was first developed by Frank Rosenblatt. It’s a good idea to
familiarize yourself with the perceptron if you’d like to learn more about
neural networks. The perceptron uses the same process as other neural
network models, but typically you’ll be working with more layers and more
possible outputs. When data is received, the perceptron multiples the input by
the weight they are given. Then the sum of all these values is plugged into the
activation function. The activation function tells the input which category it
falls into, in other words predicting the output.
If you were to look at the perceptron on a graph, its line would appear like
this:
The line of the graph of perception appears like a step, with two values, one
on either side of the 1. These two sides of the step are the different classes
that the model will predict based on the inputs. As you might be able to tell
from the graph, it’s a bit crude because there is very little separate along the
line between classes. Even a small change in some input variable will cause
the predicted output to be a different class. It won’t perform as well outside
of the original dataset that you use for training because it is a step function.
An alternative to the perceptron is a model called a sigmoid neuron. The
principle advantage of using the sigmoid neuron is that it is not binary.
Unlike perceptron, which can classify data into two categories, the sigmoid
function creates a probability rather than a classification. The image below
shows the curve of a sigmoid neuron
Notice the shape of the curve around one, where the data is sorted with
the perceptron; the step makes it difficult to classify data with just marginal
differences. With the sigmoid neuron, the data is predicted by the probability
that it falls into a given class. As you can see the line curves at one, which
means that the probability that a data point falls into a given class increases
after one, but it’s only a probability.
Conclusion
As we move into the third decade of the twenty-first century, several new
trends in big data may take hold. The first is the streaming of data combined
with machine learning. Traditionally, computers have learned from data sets
that were fed to computer systems in a controlled fashion. Now the idea is
developing to use data streaming in real time, so computer systems could
learn as they go. It remains to be seen if this is the best approach, but
combining this with the Internet of Things, there is a big hope for massive
improvements in accuracy, value, and efficiency regarding big data.
Another important trend in the coming years is sure to be the increasing role
of artificial intelligence. This has applications across the board, with simple
things like detecting spam email all the way to working robots that many
fears will destroy large numbers of jobs that only require menial labor. The
belief among those familiar with the industry is that despite decades of slow
progress regarding artificial intelligence, its time has definitely arrived. It is
expected to explode over the next decade. Recently, robots have been
unveiled that can cook meals in fast-food restaurants, work in warehouses
unloading boxes and stacking them on shelves, and everyone is talking about
the possibilities of self-driving cars and trucks.
Businesses are eager to take advantage of AI as it becomes more capable and
less expensive. It is believed that applications of artificial intelligence to
business needs will increase company efficiency exponentially. In the
process, tedious and time-consuming tasks, both physical-related tasks like
unloading boxes at a warehouse and data-related tasks done in offices, will be
replaced by artificially intelligent systems and robotics. The movement in this
direction is already well underway, and some people are fretting quite a bit
over the possibility of millions of job losses. However, one must keep in
mind that revolutionary technology has always caused large numbers of job
losses, but this impact is only temporary because the freed labor and
productive capacity have resulted in the creation of new jobs and industries
that nobody anticipated before. One example of this is the famous Luddites
who protested mechanical looms that manufactured textile goods in the
eighteenth century. They rioted and destroyed many factories when these
early machines were introduced. However, by the end of the century, literally
ten times as many people were working in the same industry because of the
increased productivity provided by the introduction of machines. It remains
to be seen, but one can assume this is likely to happen yet again.
Cloud computing has played a central role in the expansion of big data.
Hybrid clouds are expected to gain in the coming years. A hybrid cloud will
combine a company’s own locally managed and controlled data storage with
cloud computing. This will help increase flexibility while enhancing the
security of the data. Cloud bursting will be used, where the company can use
its own local storage until surges in demand force it to use the cloud.
One up-and-coming issue related to big data is privacy. Privacy concerns are
heightening, with people becoming more aware of the ubiquitous targeted
advertising that many companies are using. In addition, large-scale hacks of
data are continually making the news, and consumers are becoming
increasingly concerned about what companies like Facebook and Amazon are
doing with their data. If people are concerned with Facebook invading their
privacy, they will certainly be concerned about their toilet, electric meter, and
refrigerator collecting data on their activities and sending it who knows
where. Politicians the world over are also getting in on the act, with calls for
regulation of big tech companies coming from both sides of the Atlantic.
Many people are anticipating with excitement the implementation of 5G
cellular networks. This is supposed to result in much faster connection speeds
for mobile devices. The capacity for data transfer is expected to be much
larger than is currently available. 5G networks are claimed to have download
speeds that are ten times as great compared with 4G cellular networks. This
will increase not only the speed of using the internet on a mobile device but
also the ability of companies to collect data on customers in real time, and
possibly integrating streaming data from 5G devices with machine learning.
A 5G connection will also allow you to connect more devices
simultaneously. This could be helpful for the advent of the Internet of Things
described earlier. At the time of writing, 5G is barely being tentatively rolled
out in select cities like Chicago.
One anticipated trend in the next few years will be that more companies will
make room for a data curator. This is a management position that will work
with data, present data to others, and understand the types of analysis needed
to get the most out of big data.