Big Data and Data Science

- The document discusses defining data science and big data, recognizing different types of data, and gaining insight into the data science process. - It begins by defining big data and how it differs from traditional data management. It then defines data science as using methods to analyze massive amounts of data and extract knowledge. - The document outlines the six main steps of the data science process: setting a research goal, retrieving data, data preparation, data exploration, data modeling/building, and presentation/automation.

Uploaded by

Aishwarya Jagtap

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

186 views

Big Data and Data Science

Uploaded by

Aishwarya Jagtap

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

• Defining data science and big data

• Recognizing the different types of data

• Gaining insight into the data science process

Big data is a blanket term for any collection of data sets so large or
complex that it becomes difficult to process them using traditional
data management techniques such as, for example, the RDBMS
(relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data
science involves using methods to analyze massive amounts of data
and extract the knowledge it contains. You can think of the
relationship between big data and data science as being like the
relationship between crude oil and an oil refinery. Data science and big
data evolved from statistics and traditional data management but are
now considered to be distinct disciplines.

The characteristics of big data are often referred to as the three Vs:

• Volume —How much data is there?

• Variety —How diverse are different types of data?
• Velocity —At what speed is new data generated?

Often these characteristics are complemented with a fourth V, veracity: How

accurate is the data? These four properties make big data different from the
data found in traditional data management tools. Consequently, the
challenges they bring can be felt in almost every aspect: data capture,
curation, storage, search, sharing, transfer, and visualization. In addition, big
data calls for specialized techniques to extract the insights.

Data science is an evolutionary extension of statistics capable of dealing with

the massive amounts of data produced today. It adds methods from
computer science to the repertoire of statistics
The main things that set a data scientist apart from a statistician are the
ability to work with big data and experience in machine learning, computing,
and algorithm building. Their tools tend to differ too, with data scientist job
descriptions more frequently mentioning the ability to use Hadoop, Pig,
Spark, R, Python, and Java, among others.
BENEFITS AND USES OF DATA SCIENCE AND BIG DATA
Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, completion, and
products. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings.
Human resource professionals use people analytics and text mining to
screen candidates, monitor the mood of employees, and study informal
networks among coworkers.
Financial institutions use data science to predict stock markets, determine
the risk of lending money, and learn how to attract new clients for their
services.
A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing
project funding.
The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.

FACETS OF DATA (ALSO REFER PPT)

The main categories of data are these:

• Structured
• Unstructured
• Machine-generated
• Graph-based
• Audio, video, and images

Structured data

Structured data is data that depends on a data model and resides in a fixed
field within a record. As such, it’s often easy to store structured data in tables
within databases or Excel files
Figure 1.1. An Excel table is an example of structured data.
The world isn’t made up of structured data, though; it’s imposed upon it by
humans and machines. More often, data comes unstructured.

Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is
your regular email (figure 1.2). Although email contains structured elements
such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee
because so many ways exist to refer to a person, for example. The thousands
of different languages and dialects out there further complicate this.

Machine-generated data

Machine-generated data is information that’s automatically created by a

computer, process, application, or other machine without human
intervention.

The analysis of machine data relies on highly scalable tools, due to its high
volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry (figure 1.3).
Figure 1.3. Example of machine-generated data
Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties
to represent and store graphical data
Figure 1.4. Friends in a social network are an example of graph-
based data.

Graph databases are used to store graph-based data

Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.

THE DATA SCIENCE PROCESS

The data science process typically consists of six steps, as you can see in the
mind map

Setting the research goal

Data science is mostly applied in the context of an organization. When the
business asks you to perform a data science project, you’ll first prepare a
project charter. This charter contains information such as what you’re going
to research, how the company benefits from that, what data and resources
you need, a timetable, and deliverables

Retrieving data

The second step is to collect data. You’ve stated in the project charter which
data you need and where you can find it. In this step you ensure that you can
use the data in your program, which means checking the existence of, quality,
and access to the data. Data can also be delivered by third-party companies
and takes many forms ranging from Excel spreadsheets to different types of
databases.

Data preparation
Data collection is an error-prone process; in this phase you enhance the
quality of the data and prepare it for use in subsequent steps. This phase
consists of three subphases: data cleansing removes false values from a data
source and inconsistencies across data sources, data integration enriches
data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your
models.

Data exploration

Data exploration is concerned with building a deeper understanding of your

data. You try to understand how variables interact with each other, the
distribution of the data, and whether there are outliers. To achieve this you
mainly use descriptive statistics, visual techniques, and simple modeling.
This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

Data modeling or model building

In this phase you use models, domain knowledge, and insights about the data
you found in the previous steps to answer the research question. You select
a technique from the fields of statistics, machine learning, operations
research, and so on. Building a model is an iterative process that involves
selecting the variables for the model, executing the model, and model
diagnostics.

Presentation and automation

Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports. Sometimes you’ll
need to automate the execution of the process because the business will
want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.

Unit-4 DS
No ratings yet
Unit-4 DS
17 pages
Quality Assurance As A Tool For Effective Management of Health Records Service
No ratings yet
Quality Assurance As A Tool For Effective Management of Health Records Service
31 pages
Aspnet The Complete Reference by Matthew Macdonald Robert Standefer 0072195134 PDF
No ratings yet
Aspnet The Complete Reference by Matthew Macdonald Robert Standefer 0072195134 PDF
5 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
75 pages
Measures of Central Tendency - Dispersion - Skewness - NOTES PGDM
No ratings yet
Measures of Central Tendency - Dispersion - Skewness - NOTES PGDM
89 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
26 pages
Lecture Notes
100% (1)
Lecture Notes
82 pages
Relational Model
No ratings yet
Relational Model
20 pages
Cap 9 - 3151910-Operations-Research-Theory-And-Applications-By-J.-K.-Sharma-Z-Lib - Org
No ratings yet
Cap 9 - 3151910-Operations-Research-Theory-And-Applications-By-J.-K.-Sharma-Z-Lib - Org
54 pages
Big Data Characteristics
No ratings yet
Big Data Characteristics
4 pages
Application of Statistics in Real Life: By: Shrestha Pranay and Shivam Surya Nirwana
No ratings yet
Application of Statistics in Real Life: By: Shrestha Pranay and Shivam Surya Nirwana
21 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Applications of Binomial Distribution
No ratings yet
Applications of Binomial Distribution
3 pages
Business Statistics: Biyani's Think Tank
No ratings yet
Business Statistics: Biyani's Think Tank
33 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
DBA Maths
No ratings yet
DBA Maths
98 pages
QM UNIT 4 Index Numbers
100% (1)
QM UNIT 4 Index Numbers
34 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Bba Vi Sem Question Bank
No ratings yet
Bba Vi Sem Question Bank
7 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Probability and Statistics
No ratings yet
Probability and Statistics
34 pages
Lecture On Database Normalisation
No ratings yet
Lecture On Database Normalisation
31 pages
Unit I - Introduction To DBMS
No ratings yet
Unit I - Introduction To DBMS
9 pages
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
No ratings yet
CAF Jun24 Chp13 Statistical Description and Sampling Pranav Popat
28 pages
Syllabus of Big Data Analysis - Proposed
No ratings yet
Syllabus of Big Data Analysis - Proposed
2 pages
Machine Learning
No ratings yet
Machine Learning
90 pages
168 Nearest Neighbour PDF
No ratings yet
168 Nearest Neighbour PDF
2 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Statistical Infrences Lec 1
No ratings yet
Statistical Infrences Lec 1
35 pages
Business Statistics
No ratings yet
Business Statistics
229 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
chapter 8 -STATISTICAL method by S. P. Gupta
No ratings yet
chapter 8 -STATISTICAL method by S. P. Gupta
49 pages
Karnatak University, Dharwad. Course Wise Subject List
No ratings yet
Karnatak University, Dharwad. Course Wise Subject List
14 pages
BA4101 - Statistics - For - Management - Revised
No ratings yet
BA4101 - Statistics - For - Management - Revised
21 pages
Module 5: Index Numbers & Time Series: 1. Index Number For The Base Year Is Always Taken As 100
No ratings yet
Module 5: Index Numbers & Time Series: 1. Index Number For The Base Year Is Always Taken As 100
21 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Decision Science Material
No ratings yet
Decision Science Material
136 pages
Information Visualization: Dr. Parvathi.R VIT University, Chennai
No ratings yet
Information Visualization: Dr. Parvathi.R VIT University, Chennai
73 pages
Statistics
No ratings yet
Statistics
41 pages
8.2 - SW Engineering - Effort Estimation - FP - COCOMO Model - New
No ratings yet
8.2 - SW Engineering - Effort Estimation - FP - COCOMO Model - New
21 pages
Chapter-8-Estimation & Hypothesis Testing
100% (1)
Chapter-8-Estimation & Hypothesis Testing
12 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Mining: Exploring Data: Lecture Notes For Chapter 3
No ratings yet
Data Mining: Exploring Data: Lecture Notes For Chapter 3
21 pages
R Programming
No ratings yet
R Programming
11 pages
Machine Learning Unit 4
100% (1)
Machine Learning Unit 4
78 pages
R Lnaguager
No ratings yet
R Lnaguager
38 pages
Continue
No ratings yet
Continue
2 pages
1 - Business Statistics
No ratings yet
1 - Business Statistics
82 pages
MSC Datascience Unit1
No ratings yet
MSC Datascience Unit1
20 pages
Reubs High School: Statistics Project
No ratings yet
Reubs High School: Statistics Project
13 pages
Ad3381 - Data Base Design and Management Manual
No ratings yet
Ad3381 - Data Base Design and Management Manual
56 pages
D B M S: ATA ASE Anage Me NT Ystem
No ratings yet
D B M S: ATA ASE Anage Me NT Ystem
114 pages
Regression Methods
No ratings yet
Regression Methods
12 pages
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
No ratings yet
08-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 1
40 pages
Unit 14 Independence of Attributes: Structure
No ratings yet
Unit 14 Independence of Attributes: Structure
10 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
CB Unit 3
No ratings yet
CB Unit 3
60 pages
Individual Determinants of Consumer Behavior
100% (1)
Individual Determinants of Consumer Behavior
19 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
CFM Study Material (Study Group) 1
No ratings yet
CFM Study Material (Study Group) 1
110 pages
Social Judgment Theory
No ratings yet
Social Judgment Theory
9 pages
Sorting
No ratings yet
Sorting
22 pages
Wa0010
No ratings yet
Wa0010
3 pages
AZ-900 LAB Kayla Green-Thompson
No ratings yet
AZ-900 LAB Kayla Green-Thompson
10 pages
Blockchain Overview
100% (1)
Blockchain Overview
32 pages
Exam 70-744: IT Certification Guaranteed, The Easy Way!
No ratings yet
Exam 70-744: IT Certification Guaranteed, The Easy Way!
188 pages
Database Administration Level IV (4) Theory Exam 1 - YouTube
100% (1)
Database Administration Level IV (4) Theory Exam 1 - YouTube
3 pages
Introduction To Internet Governance Assignment Answers
No ratings yet
Introduction To Internet Governance Assignment Answers
5 pages
CG Mini Project
No ratings yet
CG Mini Project
18 pages
Msa University Arts Design Faculty Graphics Media Arts Courses Plans and Units Specifications
No ratings yet
Msa University Arts Design Faculty Graphics Media Arts Courses Plans and Units Specifications
11 pages
ITECH 7410: Software Engineering Methodologies Assignment 2
No ratings yet
ITECH 7410: Software Engineering Methodologies Assignment 2
11 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
VTE 203
No ratings yet
VTE 203
2 pages
How To Build A Bms 1S To 4S Charger / Tester For Lithium-Ion or Lifepo4 Cells
No ratings yet
How To Build A Bms 1S To 4S Charger / Tester For Lithium-Ion or Lifepo4 Cells
20 pages
Identity Based Encryption
No ratings yet
Identity Based Encryption
12 pages
Cyber Terrorism in India: A Physical Reality Orvirtual Myth: Original Article
No ratings yet
Cyber Terrorism in India: A Physical Reality Orvirtual Myth: Original Article
8 pages
01 - N11481 Gallery - Gallery Aqua Master Reference Manual 8.1A in English (155p)
No ratings yet
01 - N11481 Gallery - Gallery Aqua Master Reference Manual 8.1A in English (155p)
155 pages
Oracle Exam
No ratings yet
Oracle Exam
4 pages
All Nepali PDF
100% (5)
All Nepali PDF
2 pages
Nanjing Universi Ty of Aeronau Tics and Astrona Utics
No ratings yet
Nanjing Universi Ty of Aeronau Tics and Astrona Utics
3 pages
EEE Software Lab Report On Pspice
No ratings yet
EEE Software Lab Report On Pspice
9 pages
Chapter 1 Ms Word Summary Quiz
No ratings yet
Chapter 1 Ms Word Summary Quiz
3 pages
Control Chart Cheat Sheet
No ratings yet
Control Chart Cheat Sheet
2 pages
The Amazing Blue Brain Project
No ratings yet
The Amazing Blue Brain Project
7 pages
Harris Velocity
No ratings yet
Harris Velocity
3 pages
Database Programming with ADO
No ratings yet
Database Programming with ADO
10 pages
436 Userguide en
No ratings yet
436 Userguide en
39 pages
X210 Datasheet
No ratings yet
X210 Datasheet
2 pages
KOR240007 Map of Incheon Terminal 1
No ratings yet
KOR240007 Map of Incheon Terminal 1
5 pages
Information Hiding and Encapsulation
No ratings yet
Information Hiding and Encapsulation
2 pages
(Onkyo) - TX-SR604E Service Manual (Software Manual) (Parts Catalog) (Quick Start) (User Guide) (Circuit Diagrams)
No ratings yet
(Onkyo) - TX-SR604E Service Manual (Software Manual) (Parts Catalog) (Quick Start) (User Guide) (Circuit Diagrams)
203 pages