Big Data CH 1
Big Data CH 1
1
Chapter 1: Introduction to Big Data
2
Big Data Overview
• Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
• Big Data refers to datasets whose size are beyond the ability of typical database
software tools to capture, store, manage and analyze.
3
4
Lots of data is being collected and
warehoused
Big Data • Web data, e-commerce
• Purchases at department/
Everywhere! grocery stores
• Bank/Credit Card
transactions
• Social Network
5
Who’s Generating Big Data
6
Mobile devices
(tracking all objects all the time)
Social media and networks
(all of us are generating data)
7
How much data?
https://blog.microfocus.com/how-much-data-is-created-on-the-internet-
each-day/
8
Disk Storage
1 Bit= Binary Digit
8 Bits=1 Byte
1000 Bytes=1 Kilobyte
1000 Kilobytes=1 Megabyte
1000 Megabytes=1 Gigabyte
1000 Gigabytes=1 Terabyte
1000 Terabytes=1 Petabyte
1000 Petabytes=1 Exabyte
1000 Exabytes=1 Zettabyte
1000 Zettabytes=1 Yottabyte
1000 Yottabytes=1 Brontobyte
1000 Brontobytes=1 Geopbyte
9
Name Number of Zeros Groups of (3) Zeros
Ten 1 (10)
Hundred 2 (100)
Thousand 3 1 (1,000)
Million 6 2 (1,000,000)
Understanding Billion 9 3 (1,000,000,000)
Numbers Quadrillion 15 5
Quintillion 18 6
Sextillion 21 7
Septillion 24 8
Octillion 27 9
Nonillion 30 10
.
.
.
10
3Vs of Big Data
Big Data Technology is a new set of approaches for analyzing data sets that
were not previously accessible because they posed challenges across one or
more of the “3 V’s” of Big Data
• Volume - too Big – Terabytes and more of Credit Card Transactions, Web
Usage data, System logs
• Variety - too Complex – truly unstructured data such as Social Media,
Customer Reviews, Call Center Records
• Velocity - too Fast - Sensor data, live web traffic, Mobile Phone usage,
GPS Data
11
Some Make it 4V’s
12
13
Volume
14
Variety
15
Velocity
16
Veracity
• With many forms of big data, quality and accuracy are less
controllable (just think of Twitter posts with hashtags, abbreviations,
typos and colloquial speech as well as the reliability and accuracy
of content) but big data and analytics technology now allows us to work
with these type of data.
17
Value
Companies are
Having access to big
starting to
data is no
generate amazing
good unless we can
value from their big
turn it into value.
data.
18
The 7 Vs of Big Data
Validity
• The interpreted data having a sound basis in logic or fact – is a result of
the logical inferences from matching data.
Volume -Validity = Worthlessness
Visibility
• The state of being able to see or be seen – is implied
https://livingstoneadvisory.com/2013/06/vs-big-data/
19
The 10 Vs of Big Data
Variability
• Big data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
Vulnerability
• Big data brings new security concerns. After all, a data breach with big data is a big
breach
Volatility
• How old does your data need to be before it is considered irrelevant, historic, or not
useful any longer? How long does data need to be kept for?
https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
20
Harnessing Big Data
• OLTP: Online
Transaction
Processing (DBMSs)
• OLAP: Online
Analytical
Processing (Data
Warehousing)
• RTAP: Real-Time
Analytics
Processing (Big Data
Architecture &
technology) 21
Challenges in Handling Big Data
• The Bottleneck is in
technology
• New architecture,
algorithms,
techniques are
needed
• Also in technical skills
• Experts in using the
new technology and
dealing with big
data
22
Big Data Challenges
23
Background of Data Analytics
• The primary goal of big data analytics is to help companies make better
business decisions.
24
Big data Consist of
26
• The technologies associated with big data analytics
include NoSQL technologies associated with big data
analytics include NoSQL databases, Hadoop and
MapReduce.
• Knowledge about these technologies form the core of
an open-source software framework that supports the
processing of large data sets across clustered systems.
Data Analytics • Big Data analytics initiatives include
• Internal data analytics skills
• High cost of hiring experienced analytics
professionals,
• Challenges in integrating Hadoop systems and data
warehouses
27
• Big Analytics delivers competitive advantage
compared to the traditional analytical model.
28
Big Analytics supporting the following objective
s for working with Big Data Analytics:
29
The
Process of
Data
Analytics
30
31
Data Analytics Process: Discovery
32
Data Analytics Process: Discovery
Acquisition
• Data acquisition involves collecting or acquiring data for analysis.
• Acquisition requires access to information and a mechanism for
gathering it.
Pre-processing
• Pre-processing is necessary if analytics is to yield trustworthy , useful
results.
• places it in a standard format for analysis.
33
Data Analytics Process: Discovery
Integration
• Integration involves consolidating data for analysis.
• Retrieving relevant data from various sources for analysis
• Eliminating redundant data or clustering data to obtain a smaller
representative sample.
Analysis
• Searching for relationships between data items in a database or
exploring data in search of classifications or associations.
• Analysis can yield descriptions or predictions.
• Analysis based on interpretation, organizations can determine whether
and how to act on them.
34
Data Analytics Process: Discovery
Interpretation
• Analytic processes are reviewed by data scientists to understand results
and how they were determined.
• Interpretation involves retracing methods, understanding choices made
throughout the process and critically examining the quality of the
analysis.
• It provides the foundation for decisions about whether analytic
outcomes are trustworthy.
35
Data Analytics Process: Application
Application
• Associations discovered amongst data in the knowledge phase of the
analytic process are incorporated into an algorithm and applied.
• In the application phase organizations gather the benefits of knowledge
discovery.
• Through application of derived algorithms, organizations make
determinations upon which they can act.
36
A Brief History of Big Data
18,000 BCE
• Humans use tally sticks to record data for the first time. These are used
to track trading activity and record inventory.
2400 BCE
• The abacus is developed, and the first libraries are built in Babylonia.
300 BCE – 48 AD
• The Library of Alexandria is the world’s largest data storage center –
until it is destroyed by the Romans.
37
A Brief History of Big Data
100 AD – 200 AD
• The Antikythera Mechanism, the first mechanical computer is developed in Greece.
1663
• John Graunt conducts the first recorded statistical-analysis experiments in an
attempt to curb the spread of the bubonic plague in Europe.
1865
• The term “business intelligence” is used by Richard Millar Devens in his
Encyclopedia of Commercial and Business Anecdotes.
38
A Brief History of Big Data
1881
• Herman Hollerith creates the Hollerith Tabulating Machine which uses
punch cards to vastly reduce the workload of the US Census.
1926
• Nikola Tesla predicts that in the future, a man will be able to
access and analyze vast amounts of data using a device small enough to
fit in his pocket.
1928
• Fritz Pfleumer creates a method of storing data magnetically, which
forms basis of modern digital data storage technology.
39
A Brief History of Big Data
1944
• Fremont Rider speculates that Yale Library will contain 200 million
books stored on 6,000 miles of shelves, by 2040.
1958
• Hans Peter Luhn defines Business Intelligence as “the ability to
apprehend the interrelationships of presented facts in such a way as to
guide action towards a desired goal.”
1965
• The US Government plans the world’s first data center to store 742
million tax returns and 175 million sets of fingerprints on magnetic
tape.
40
A Brief History of Big Data
1970
• Relational Database model developed by IBM mathematician Edgar F Codd.
• The Hierarchal file system allows records to be accessed using a simple index
system. This means anyone can use databases, not just computer scientists.
1976
• Material Requirements Planning (MRP) systems are commonly used in business.
Computer and data storage is used for everyday routine tasks.
1989
• Early use of term Big Data in magazine article by fiction author Erik Larson–
commenting on advertisers’ use of data to target customers.
41
A Brief History of Big Data
1991
• The birth of the internet. Anyone can now go online and upload their
own data, or analyze data uploaded by other people.
1996
• The price of digital storage falls to the point where it is more cost-
effective than paper.
1997
• Google launch their search engine which will quickly become the most
popular in the world.
• Michael Lesk estimates the digital universe is increasing tenfold in
size every year.
42
A Brief History of Big Data
1999
• First use of the term Big Data in an academic paper – Visually Exploring Gigabyte
Datasets in Real-time (ACM).
• First use of term Internet of Things, in a business presentation by Kevin Ashton to
Procter and Gamble.
2001
• Three “Vs” of Big Data – Volume, Velocity, Variety – defined by Doug Laney.
2003
• Google File System paper published
2005
• Hadoop – an open-source Big Data framework now developed by Apache – is
developed.
• The birth of “Web 2.0 – the user-generated web”.
43
A Brief History of Big Data
2008
• Globally 9.57 zettabytes (9.57 trillion gigabytes) of information is processed by
the world’s CPUs.
• An estimated 14.7 exabytes of new information is produced this year.
• 2009
• The average US company with over 1,000 employees is storing more than 200
terabytes of data according to the report Big Data: The Next Frontier for
Innovation, Competition and Productivity by McKinsey Global Institute.
2010
• Eric Schmidt, executive chairman of Google, tells a conference that as much data
is now being created every two days, as was created from the beginning of
human civilization to the year 2003.
44
A Brief History of Big Data
2011
• The McKinsey report states that by 2018 the US will face a shortfall
of between 140,000 and 190,000 professional data scientists, and
warns that issues including privacy, security and intellectual
property will have to be resolved before the full value of Big
Data will be realized.
2014
• Mobile internet use overtakes desktop for the first time.
• 88% of executives responding to an international survey by GE say
that big data analysis is a top priority
45
What is a Distributed System?
• Consists of a collection of
autonomous computers, connected
through a network and distribution
Role of Distributed middleware
System in Big Data • Enables computers to coordinate
their activities and to share the
resources of the system
• Users perceive the system as a
single, integrated computing
facility.
46
Big data is distributed data
47
Distributed data generation is fueling big data growth
• Most of us walk around carrying devices that are constantly pulsing all
sorts of data into the cloud and beyond – our locations, our photos, our
tweets, our status updates, our connections, even our heartbeats.
48
“Hadoop” and “MapReduce
49
50
Who are Data Scientist?
51
Data Scientist
52
53
• Data Scientist: The Sexiest Job of
the 21st Century [Harward
Business Review 2013]
• Data scientist? A guide to 2015's
hottest profession
Data scientist: a [Mashable 2015]
brand-new • “It’s official – data scientist is the
best job in America” [Forbes,
profession 2016]
• "This hot new field promises to
revolutionize industries from
business to government, health
care to academia."
• — The New York Times
54
Successful Data Scientist Characteristics
Intellectual curiosity, Intuition
• Find needle in a haystack(something that is difficult to locate in a much larger
space)
• Ask the right questions – value to the business
Communication and engagements
Presentation skills
• Let the data speak but tell a story
• Storyteller – drive business value not just data insights
Creativity
• Guide further investigation
Business Savvy
• Discovering patterns that identify risks and opportunities
• Measure
55
Skills of
data
scientists
56
Role/Skill of Data Scientist
57
Role/Skill of Data Scientist
58
Data Scientist Job Description
59
The 50 best jobs in America
60
Current Trend in Big Data Analytics
https://www.whizlabs.com/blog/big-data-trends-in-2018/
61
Thank You!
62