Big Data and Data Science
Big Data and Data Science
Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. You can think of the
relationship between big data and data science as being like the
relationship between crude oil and an oil refinery. Data science and big
data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.
The characteristics of big data are often referred to as the three Vs:
• Structured
• Unstructured
• Machine-generated
• Graph-based
• Audio, video, and images
Structured data
Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in tables
within databases or Excel files
Figure 1.1. An Excel table is an example of structured data.
The world isn’t made up of structured data, though; it’s imposed upon it by
humans and machines. More often, data comes unstructured.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email (figure 1.2). Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands
of different languages and dialects out there further complicate this.
Machine-generated data
The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry (figure 1.3).
Figure 1.3. Example of machine-generated data
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data
Figure 1.4. Friends in a social network are an example of graph-
based data.
Retrieving data
The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can
use the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party companies
and takes many forms ranging from Excel spreadsheets to different types of
databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases: data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches
data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.
Data exploration
In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select
a technique from the fields of statistics, machine learning, operations
research, and so on. Building a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.
Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll
need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.