Unit: Defining Data Science and Big Data
Prepared by: Varun Rao (Dean, Data Science & AI)
For: Data Science - 1st years
Defining Data Science and Big Data
Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional data
management techniques such as, for example, the RDBMS (relational
database management systems). The widely adopted RDBMS has long
been regarded as a one-size-fits-all solution, but the demands of handling
big data have shown otherwise.
The characteristics of big data are often referred to as the three Vs:
■ Volume—How much data is there?
■ Variety—How diverse are different types of data?
■ Velocity—At what speed is new data generated?
But the eight Vs are : Volume, Velocity, Variety, Veracity, Vocabulary,
Vagueness, Viability and Value.
Volume: The amount of data needing to be processed at a given time. This
can manifest either as an amount over time or amount that needs to be
processed at one time. For example, doing a matrix operation on a 1 billion
by 1 billion matrix or scanning the contents of every published newspaper
in a day for key words are both examples of volume that can constrain
computing.
Velocity: Similar to Volume, this has to do with the speed of the data
coming in and the speed of the transformed data leaving. An example of a
high velocity requirement is telemetry that needs to be analyzed in real time
for a self-driving car.
Variety: In the computing context we are discussing, this term refers to
heterogeneous data sources that need to be identified and normalized
before the compute can occur. In data science, this is often referred to as
data cleaning, this operation is frequently the most labor intensive as it
involves all of the pre-work required to set-up the high-performance
computer. This is where the vast majority of errors and issues are found
with data and this is the fundamental bottle neck in high-performance
computing.
Vocabulary: This term has two meanings. The first meaning is less a
computing issue than it is a communication issue between provider and
customer and it has to do with the language used to describe the desired
outcome of an analysis. For example, the term “accuracy” or “performance”
may have different meaning in the context of structural engineering than it
does in rendering animation.
Vagueness: This term describes an interpretation issue with results being
returned.
Viability: This refers to a model’s ability to represent reality. Model’s by
their very nature are idealized approximations of reality.
Value: This term is defined as whatever is important to the customer.
Another way to define value is the removal of obstacles in their path to
allow them to get to their stated destination.
Data science
Data science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. Data science is an evolutionary
extension of statistics capable of dealing with the massive amounts of data
produced today. today. It adds methods from computer science to
the library of statistics.
The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning,
computing, and algorithm building. Their tools tend to differ too, with data
scientist job descriptions more frequently mentioning the ability to use
Hadoop, Pig, Spark, R, Python, and Java, among others.
Benefits and uses of data science and big data
Data science and big data are used almost everywhere in both commercial
and non-commercial settings. The number of use cases is vast, and the
examples we’ll provide throughout this book only scratch the surface of the
possibilities. A good example of this is Google AdSense, which collects
data from internet users so relevant commercial messages can be matched
to the person browsing the internet.
Facets of data
In data science and big data you’ll come across many different types of
data, and each of them tends to require different tools and techniques. The
main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
Structured
Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in
tables within data-bases or Excel files. SQL, or Structured Query
Language, is the preferred way to manage and query data that resides in
databases.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email.
Natural language
Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques
and linguistics. The natural language processing community has had
success in entity recognition, topic recognition, summarization, text
completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Machine-generated data
Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human
intervention. Machine-generated data is becoming a major data resource
and will continue to do so.
Graph-based or network data
“Graph data” can be a confusing term because any data can be shown in a
graph. Examples of graph-based data can be found on many social media
websites. For instance, on LinkedIn you can see who you know at which
company. Your follower list on Twitter is another example of graph-based
data.
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a
data scientist. High-speed cameras at stadiums will
capture ball and athlete movements to calculate in real time, for example,
the path taken by a defender relative to two baselines.
Streaming data
While streaming data can take almost any of the previous forms, it has an
extra property.
The data science process
The data science process typically consists of six steps, as you can see
Setting the research goal:
Data science is mostly applied in the context of an organization. This
charter contains information such as what you’re going to research, how
the company benefits from that, what data and resources you need, a
timetable, and deliverables.
Retrieving data:
The second step is to collect data. In this step you ensure that you can use
the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party
companies and takes many forms ranging from Excel spreadsheets to
different types of databases.
Data preparation:
Data collection is an error-prone process. This phase consists of three sub-
phases: data cleansing removes false values from a data source and
inconsistencies across data sources, data integration enriches data
sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.
Data exploration:
Data exploration is concerned with building a deeper understanding of your
data. To achieve this you mainly use descriptive statistics, visual
techniques, and simple modeling. This step often goes by the abbreviation
EDA, for Exploratory Data Analysis.
Data modeling or model building:
In this phase you use models, domain knowledge, and insights about the
data you found in the previous steps to answer the research question.
Building a model is an iterative process that involves selecting the variables
for the model, executing the model, and model diagnostics.
Presentation and automation:
Finally, you present the results to your business. These results can take
many forms, ranging from presentations to research reports.
History & Overview of R
R is a very powerful programming language widely used in the Data
Science world. R is a language and environment primarily built for
statistical computing and graphics, But due to its huge popularity, now R
has enough provisions to implement machine learning and Deep Learning
algorithms in a fast and efficient manner.
R is a dialect of the S language. S language was developed by John
Chambers, Rick Becker and others at the Bell Labs for which he got “ACM
Software System Award” in 1998. He donated his prize money
(US$10,000) to the American Statistical Association. S language was
started in 1976 as an internal Statistical analysis environment —
implemented originally in Fortran Libraries. It was re-written in C language
in 1988.
Key idea behind creation of S Language and later R language was to
provide a language which is suitable for Interactive Data Analysis as well
as for writing longer programs.
Features of R:
1. Open-source
R is an open-source software environment. It is free of cost and can be
adjusted and adapted according to the user’s and the project’s requirements.
2. Strong Graphical Capabilities
R can produce static graphics with production quality visualizations and has
extended libraries providing interactive graphic capabilities. This makes data
visualization and data representation very easy.
3. Highly Active Community
R has an open-source library which is supported by its growing number of
users.
4. A Wide Selection of Packages
R contains a sea of packages for all the forms of disciplines like astronomy,
biology, etc. While R was originally used for academic purposes, it is now
being used in industries as well. CRAN or Comprehensive R Archive Network
houses more than 10,000 different packages and extensions that help solve all
sorts of problems in data science.
5. Comprehensive Environment
R has a very comprehensive development environment meaning it helps in
statistical computing as well as software development. R is an object-oriented
programming language. It also has a robust package called Rshiny which can
be used to produce full-fledged web apps.
6. Can Perform Complex Statistical
Calculations
R can be used to perform simple and complex mathematical and statistical
calculations on data objects of a wide variety. It can also perform such
operations on large data sets.
7. Distributed Computing
In distributed computing, tasks are split between multiple processing nodes to
reduce processing time and increase efficiency. R has packages like ddR and
multiDplyr that enable it to use distributed computing to process large data
sets.
8. Running Code Without a Compiler
R is an interpreted language which means that it does not need a compiler to
make a program from the code. R directly interprets provided code into
lower-level calls and pre-compiled code.
9. Interfacing with Databases
R contains several packages that enable it to interact with databases like
Roracle, Open Database Connectivity Protocol, RmySQL, etc.
11. Machine Learning
R can be used for machine learning as well. The best use of R when it comes to
machine learning is in case of exploration or when building one-off models.
12. Data Wrangling
Data wrangling is the process of cleaning complex and inconsistent data sets
to enable convenient computation and further analysis.
13. Cross-platform Support
R is machine-independent. It supports cross-platform operation. Therefore, it
can be used on many different operating systems.
14. Compatible with Other Programming
Languages
While most of its functions are written in R itself, C, C++ or FORTRAN can be
used for computationally heavy tasks. Java, .NET, Python, C, C++, and
FORTRAN can also be used to manipulate objects directly.
15. Data Handling and Storage
R is integrated with all the formats of data storage due to which data handling
becomes easy.
R nuts and bolts
Source:
The upper left window is a plain text editor, like Notepad or TextEdit. “Plain
text” means no fonts, formatting etc, unlike program like Microsoft Word.
You can have multiple files open at once and they appear in tabs.
Depending on the type of the file being edited (i.e. its file extension), there
will be different tools and behavior, but its all plain text.
Console:
The R console is where you give R commands and is the lower left
window in RStudio. It is the same way you would interact with R on the
command line or terminal. In other words, the “Console” tab in the lower left
window is the only part of RStudio that is actually R itself; everything else is
optional tools.
Environment:
The “Environment” tab in the top right window lists the variables and
functions present in the current R session. It does not include the
function/data in loaded packages however (unless you select a package
from the drop down menu that says “Global Environment”). When you ask
“what have I created so far”, the answer is in the environment tab.
File Browser:
The default tab in the lower right window is a basic file browser. You can
open, delete, and rename files there. Its not as well-developed as your
operating system’s file browser and is mostly there so you don’t have to
switch applications to manage files. You can ignore the rest of the tabs
there for now (Plots, Packages, Help, and Viewer), since they are usually
automatically opened when they are relevant.