0% found this document useful (0 votes)
186 views

Big Data and Data Science

- The document discusses defining data science and big data, recognizing different types of data, and gaining insight into the data science process. - It begins by defining big data and how it differs from traditional data management. It then defines data science as using methods to analyze massive amounts of data and extract knowledge. - The document outlines the six main steps of the data science process: setting a research goal, retrieving data, data preparation, data exploration, data modeling/building, and presentation/automation.

Uploaded by

Aishwarya Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views

Big Data and Data Science

- The document discusses defining data science and big data, recognizing different types of data, and gaining insight into the data science process. - It begins by defining big data and how it differs from traditional data management. It then defines data science as using methods to analyze massive amounts of data and extract knowledge. - The document outlines the six main steps of the data science process: setting a research goal, retrieving data, data preparation, data exploration, data modeling/building, and presentation/automation.

Uploaded by

Aishwarya Jagtap
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

• Defining data science and big data

• Recognizing the different types of data


• Gaining insight into the data science process

Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. You can think of the
relationship between big data and data science as being like the
relationship between crude oil and an oil refinery. Data science and big
data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.

The characteristics of big data are often referred to as the three Vs:

• Volume —How much data is there?


• Variety —How diverse are different types of data?
• Velocity —At what speed is new data generated?

Often these characteristics are complemented with a fourth V, veracity: How


accurate is the data? These four properties make big data different from the
data found in traditional data management tools. Consequently, the
challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big
data calls for specialized techniques to extract the insights.

Data science is an evolutionary extension of statistics capable of dealing with


the massive amounts of data produced today. It adds methods from
computer science to the repertoire of statistics
The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning, computing,
and algorithm building. Their tools tend to differ too, with data scientist job
descriptions more frequently mentioning the ability to use Hadoop, Pig,
Spark, R, Python, and Java, among others.
BENEFITS AND USES OF DATA SCIENCE AND BIG DATA
Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and
products. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
Human resource professionals use people analytics and text mining to
screen candidates, monitor the mood of employees, and study informal
networks among coworkers.
Financial institutions use data science to predict stock markets, determine
the risk of lending money, and learn how to attract new clients for their
services.
A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing
project funding.
The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.

FACETS OF DATA (ALSO REFER PPT)

The main categories of data are these:

• Structured
• Unstructured
• Machine-generated
• Graph-based
• Audio, video, and images

Structured data

Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in tables
within databases or Excel files
Figure 1.1. An Excel table is an example of structured data.
The world isn’t made up of structured data, though; it’s imposed upon it by
humans and machines. More often, data comes unstructured.

Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email (figure 1.2). Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands
of different languages and dialects out there further complicate this.

Machine-generated data

Machine-generated data is information that’s automatically created by a


computer, process, application, or other machine without human
intervention.

The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry (figure 1.3).
Figure 1.3. Example of machine-generated data
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data
Figure 1.4. Friends in a social network are an example of graph-
based data.

Graph databases are used to store graph-based data

Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.

THE DATA SCIENCE PROCESS


The data science process typically consists of six steps, as you can see in the
mind map

Setting the research goal


Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a
project charter. This charter contains information such as what you’re going
to research, how the company benefits from that, what data and resources
you need, a timetable, and deliverables

Retrieving data

The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can
use the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party companies
and takes many forms ranging from Excel spreadsheets to different types of
databases.

Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases: data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches
data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.

Data exploration

Data exploration is concerned with building a deeper understanding of your


data. You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers. To achieve this you
mainly use descriptive statistics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

Data modeling or model building

In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select
a technique from the fields of statistics, machine learning, operations
research, and so on. Building a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.

Presentation and automation

Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll
need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.

You might also like