Lesson 1
Objectives
By the end of this lesson, you will be
able to:
Explain the need for Big Data
Define the concept of Big Data
Describe the basics and benefits of
Hadoop
2
Need for Big Data
90% of the data in the world today has been created in last two years alone.
Structured format has some limitations with respect to handling large
quantities of data. Thus, there is a need for perfect mechanism, like Big Data, to
handle these increasing quantities.
Big Data relies on three important aspects of data complexity as explained in
the following image.
3
What is Big Data
Big Data is the term applied to data sets whose size is beyond the ability of
Defining Big Data the commonly used software tools to capture, manage, and process within a
tolerable elapsed time.
● Web logs
● Sensor network
● Social media
● Internet text and documents
● Internet pages
Sources of Big Data ● Search index data
● Atmospheric science, astronomy, biochemical, medical records
● Scientific research
● Military surveillance
● Photography archives
4
Types of Data
Three types of data can be identified:
Unstructured Data
Data which do not have a pre-defined data model
E.g. Text files
Semi-structured Data
Data which do not have a formal data model
E.g. XML files
Structured Data
Data which is represented in a tabular format
E.g. Databases
5
Handling Limitations of Big Data
How to handle system uptime How to combine accumulated
and downtime data from all the systems
● Commodity hardware for data ● Analyzing data across different
storage and analysis machines
● Maintaining a copy of same ● Merging of data
data across clusters
6
Introduction to Hadoop
What is Hadoop? Why Hadoop?
● A free, Java-based ● Runs applications on
programming framework that distributed systems with
supports the processing of thousands of nodes involving
large data sets in a distributed petabytes of data
computing environment ● Distributed file system
● Based on Google File System provides fast data transfers
(GFS) among the nodes
7
History and Milestones of Hadoop
Hadoop originated from Nutch open source project on search engine to work
over distributed network nodes. Yahoo was the first company to make and use
Hadoop as a core part of their system operations. Now Hadoop is a core part in
systems like Facebook, LinkedIn, Twitter, etc.
Hadoop Milestones
8
Organizations Using Hadoop
Name of the
organization Cluster specifications Uses
[Link]: ● To build Amazon's product search indices
Clusters vary from 1 to 100 nodes ● Process millions of sessions daily for analytics
Amazon
More than 100,000 CPUs in approximately
20,000 computers running Hadoop; ● To support research for ad systems and web
Yahoo biggest cluster has 2000 nodes (2*4cpu search
boxes with 4 TB disk each)
Cluster size is 50 machines, Intel Xeon,
dual processors, dual core, each with 16 ● For a variety of functions ranging from data
AOL GB RAM and 800 GB hard disk giving us a generation to running advanced algorithms
total of 37 TB HDFS capacity for doing behavioral analysis and targeting
● To store copies of internal log and dimension
320-machine cluster with 2,560 cores and data sources
Facebook about 1.3 PB raw storage ● As a source for reporting analytics and
machine learning
9
10
Quiz 1
Which type of data is handled by Hadoop?
a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data
11
Quiz 1
Which type of data is handled by Hadoop?
a. Structured data
b. Semi-structured data
c. Unstructured data
d. Flexible-structure data
Answer: c.
Explanation: Hadoop handles unstructured data for processing.
12
Quiz 2
Which of the following is an unstructured data?
a. Collection of text files
b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets
13
Quiz 2
Which of the following is an unstructured data?
a. Collection of text files
b. Collection of XML files
c. Collection of tables in databases
d. Collection of tickets
Answer: a.
Explanation: Text files are usually unstructured data.
14
Quiz 3
Which of the following is structured data?
a. Collection of text files
b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files
15
Quiz 3
Which of the following is structured data?
a. Collection of text files
b. Collection of tickets
c. Collection of tables in databases
d. Collection of XML files
Answer: c.
Explanation: Databases are usually structured data.
16
Quiz
4
Which of the following is semi-structured data?
a. Collection of tables in databases
b. Collection of text files
c. Collection of tickets
d. Collection of XML files
17
Quiz 4
Which of the following is semi-structured data?
a. Collection of tables in databases
b. Collection of text files
c. Collection of tickets
d. Collection of XML files
Answer: d.
Explanation: XML files are usually semi-structured data.
18
Quiz 5
Which of the following aspects of Big Data refers to data size?
a. Volume
b. Velocity
c. Variety
d. Value
19
Quiz 5
Which of the following aspects of Big Data refers to data size?
a. Volume
b. Velocity
c. Variety
d. Value
Answer: a.
Explanation: Volume in Big Data refers to the size of the data to be processed.
20
Quiz 6
Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?
a. Variety
b. Value
c. Velocity
d. Volume
21
Quiz 6
Which of the following aspects of Big Data refers to the speed of the response of appropriate data request generated
by the user?
a. Variety
b. Value
c. Velocity
d. Volume
Answer: c.
Explanation: Velocity in Big Data refers to the speed of the response of appropriate data request generated
by the user.
22
Quiz 7
Which of the following aspects of Big Data refers to multiple data sources?
a. Variety
b. Value
c. Volume
d. Velocity
23
Quiz 7
Which of the following aspects of Big Data refers to multiple data sources?
a. Variety
b. Value
c. Volume
d. Velocity
Answer: a.
Explanation: Variety in Big Data refers to multiple data sources.
24
Summary
Let us summarize the topics covered in this lesson:
● Big Data is the term applied to data sets whose size is beyond the ability
of the commonly used software tools to capture, manage, and process
within a tolerable elapsed time.
● Big Data relies on volume, velocity, and variety with respect to
processing.
● Data can be divided into 3 types—Unstructured data, semi-structured
data, and structured data.
● Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment.
● Hadoop is a software framework used by organizations like Facebook,
25
26