Introduction To Data Science
Introduction To Data Science
Financial institutions
Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services. At the time of writing this book, at
least 50% of trades worldwide are performed automatically by machines based on algorithms
developed by quants, as data scientists who work on trading algorithms are often called, with the
help of big data and data science techniques. Governmental organizations are also aware of data’s
value.
Governmental organizations
Many governmental organizations not only rely on internal data scientists to discover valuable
information, but also share their data with the public. You can use this data to gain insights or build
data-driven applications. Data.gov is but one example; it’s the home of the US Government’s open
data. A data scientist in a governmental organization gets to work on diverse projects such as
detecting fraud and other criminal activity or optimizing project funding. A well-known example
was provided by Edward Snowden, who leaked internal documents of the American National
Security Agency and the British Government Communications Headquarters that show clearly
how they used data science and big data to monitor millions of individuals. Those organizations
collected 5 billion data records from widespread applications such as Google Maps, Angry Birds,
email, and text messages, among many other data sources. Then they applied data science
techniques to distill information.
Universities
They use data science in their research but also to enhance the study experience of their students.
The rise of massive open online courses (MOOC) produces a lot of data, which allows universities
to study how this type of learning can complement traditional classes. MOOCs are an invaluable
asset if you want to become a data scientist and big data professional, so definitely look at a few
of the better-known ones: Coursera, Udacity, and edX. The big data and data science landscape
changes quickly, and MOOCs allow you to stay up to date by following courses from top
universities. If you aren’t acquainted with them yet, take time to do so now; you’ll come to love
them as we have.
Facets of data
In data science and big data, you’ll come across many different types of data, and each of them
tends to require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured data: Structured data is data that depends on a data model and resides in a fixed field
within a record. As such, it’s often easy to store structured data in tables within databases or Excel
files. SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases. You may also come across structured data that might give you a hard time
storing it in a traditional relational database. Hierarchical data such as a family tree is one such
example. The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured.
An Excel table is an example of structured data.
Unstructured data: Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your regular email.
Although email contains structured elements such as the sender, title, and body text, it’s a challenge
to find the number of people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands of different languages
and dialects out there further complicate this. A human-written email, is also a perfect example of
natural language data.
Natural language: Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained in one
domain don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to
decipher the meaning of every piece of text. This shouldn’t be a surprise though: humans struggle
with natural language as well. It’s ambiguous by nature. The concept of meaning itself is
questionable here. Have two people listen to the same conversation. Will they get the same
meaning? The meaning of the same words can vary when coming from someone upset or joyous
Machine-generated data: Machine-generated data is information that’s automatically created by
a computer, process, application, or other machine without human intervention. Machine-
generated data is becoming a major data resource and will continue to do so. Wikibon has forecast
that the market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the
integration of complex physical machinery with networked sensors and software) will be
approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there will
be 26 times more connected things than people in 2020. This network is commonly referred to as
the internet of things. The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail records, network
event logs, and telemetry
Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data. Data can also be delivered by
third-party companies and takes many forms ranging from Excel spreadsheets to different types of
databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three sub phases: data cleansing
removes false values from a data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your models.
Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether there
are outliers. To achieve this, you mainly use descriptive statistics, visual techniques, and simple
modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would. Who
hasn’t lost a few days in a project because of a bug that was caused by whitespaces at the end of a
string? You ask the program to join two keys and notice that observations are missing from the
output file. After looking for days through the code, you finally find the bug. Then comes the
hardest part: explaining the delay to the project stakeholders. The cleaning during the ETL phase
wasn’t well executed, and keys in one table contained a whitespace at the end of a string. This
caused a mismatch of keys such as “FR” – “FR”, dropping the observations that couldn’t be
matched.
Impossible values and sanity checks
Sanity checks are another valuable type of data check. Here you check the value against physically
or theoretically impossible values such as people taller than 3 meters or someone with an age of
299 years. Sanity checks can be directly expressed with rules: check = 0 <= age <= 120
Outliers
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than the other
observations. The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values.
c) Transforming data
Certain models require their data to be in a certain shape. Now that you’ve cleansed and
integrated the data, this is the next task you’ll perform: transforming your data so it takes a
suitable form for data modeling.
Transforming data
Relationships between an input variable and an output variable aren’t always linear. Take, for
instance, a relationship of the form y = aebx. Taking the log of the independent variables
simplifies the estimation problem dramatically.
Reducing the number of variables
Sometimes you have too many variables and need to reduce the number because they don’t
add new information to the model. Having too many variables in your model makes the model
difficult to handle, and certain techniques don’t perform well when you overload them with
too many input variables. For instance, all the techniques based on a Euclidean distance
perform well only up to 10 variables.
You’ll need to select the variables you want to include in your model and a modeling technique.
Your findings from the exploratory analysis should already give a fair idea of what variables
will help you construct a good model. Many modeling techniques are available, and choosing
the right model for a problem requires judgment on your part. You’ll need to consider model
performance and whether your project meets all the requirements to use your model, as well
as other factors:
■ Must the model be moved to a production environment and, if so, would it be
easy to implement?
■ How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?
■ Does the model need to be easy to explain?
Model execution
Once you’ve chosen a model you’ll need to implement it in code. Luckily, most programming
languages, such as Python, already have libraries such as StatsModels or Scikit-learn. These
packages use several of the most popular techniques Coding a model is a nontrivial task in most
cases, so having these libraries available can speed up the process. As you can see in the following
code, it’s fairly easy to use linear regression with StatsModels or Scikit-learn. Doing this yourself
would require much more effort even for the simple techniques.
To understand Big Data, you need to get acquainted with its attributes known as the four V’s.
Volume is what’s “big” in Big Data. This relates to terabytes to petabytes of information
coming from a range of sources such as IoT devices, social media, text files, business
transactions, etc. Just so you can grasp the scale, 1 petabyte is equal to 1,000,000 gigabytes.
A single HD movie on Netflix takes up over 4 gigabytes while you are watching. Now
imagine that 1 petabyte contains 250,000 movies. And Big Data isn’t about 1 petabyte, it’s
about thousands and millions of them.
Velocity is the speed at which the data is generated and processed. It’s represented in terms
of batch reporting, near real-time/real-time processing, and data streaming. The best-case
scenario is when the speed with which the data is produced meets the speed with which it
is processed. Let’s take the transportation industry for example. A single car connected to
the Internet with a telematics device plugged in generates and transmits 25 gigabytes of
data hourly at a near-constant velocity. And most of this data has to be handled in real-time
or near real-time.
Variety is the vector showing the diversity of Big Data. This data isn’t just about structured
data that resides within relational databases as rows and columns. It comes in all sorts of
forms that differ from one application to another, and most of Big Data is unstructured.
Say, a simple social media post may contain some text information, videos or images, a
timestamp. etc.
Veracity is the measure of how truthful, accurate, and reliable data is and what value it
brings. Data can be incomplete, inconsistent, or noisy, decreasing the accuracy of the
analytics process. Due to this, data veracity is commonly classified as good, bad, and
undefined. That’s quite a help when dealing with diverse data sets such as medical records,
in which any inconsistencies or ambiguities may have harmful effects.
Knowing the key characteristics, you can understand that not all data can be referred to as Big
Data.
Big data refers to datasets whose size is beyond the ability of typical database software tools to
capture, store, manage and analyse.
-
Unstructured data
In the modern world of big data, unstructured data is the most abundant. It’s so prolific because
unstructured data could be anything: media, imaging, audio, sensor data, text data, and much more.
Unstructured simply means that it is datasets (typical large collections of files) that aren’t stored
in a structured database format. Unstructured data has an internal structure, but it’s not predefined
through data models. It might be human generated, or machine generated in a textual or a non-
textual format.
■ Enable the appropriate organizational change to move towards fact based decisions, adoption of
new technologies, and uniting people from multiple disciplines into a single multidisciplinary team
■ Deliver faster and superior results by embracing and capitalizing on the ever-increasing rate of
change that is occurring in the global market place.
Big Data analytics uses a wide variety of advanced analytics as shown in figure below
■ Deeper insights. Rather than looking at segments, classifi cations, regions, groups, or other
summary levels you ’ll have insights into all the individuals, all the products, all the parts, all the
events, all the transactions, etc.
■ Broader insights. The world is complex. Operating a business in a global, connected economy
is very complex given constantly evolving and changing conditions. As humans, we simplify
conditions so we can process events and understand what is happening. But our bestlaid plans
often go astray because of the estimating or approximating. Big Data analytics takes into account
all the data, including new data sources, to understand the complex, evolving, and interrelated
conditions to produce more accurate insights.
■ Frictionless actions. Increased reliability and accuracy that will allow the deeper and broader
insights to be automated into systematic actions.
The key to success for organizations seeking to take advantage of this opportunity is:
■ Leverage all your current data and enrich it with new data sources
■ Enforce data quality policies and leverage today ’s best technology and people to support the
policies
■ Relentlessly seek opportunities to imbue your enterprise with fact based decision making
■ Embed your analytic insights throughout your organization