FACULTY OF COMPUTERS AND
INFORMATION TECHNOLOGY
Information Systems Department
IS467
Selected Topics in Information Systems-1(Big Data)
4th Level, Spring 2024
Lecture Notes 2
© 2 0 2 4 b y D r. M o h a m e d A t t i a 1
Content
❑ Introduction to Hadoop
❑ Hadoop nodes & daemons
❑ Hadoop Architecture
❑ Characteristics
❑ Hadoop Features
❑ HDFS
2
What is Hadoop ?
➢ The Apache Hadoop is open-source framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
➢ It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
3
What is Hadoop?
Distributed Processing
An open-source framework that ❖ Data is processed
allows Distributed Processing of distributedly on multiple
large data-sets across the cluster nodes / servers
of commodity hardware ❖ Multiple machines processes
the data independently
Cluster
❖ Multiple machines connected together
❖ Nodes are connected via LAN
Hadoop Components
Hadoop consists of three key parts
Hadoop Nodes
Nodes
Master Node Slave Node
Hadoop Nodes
Nodes
Master Node Slave Node
Resource Node
Manager Manager
NameNode DataNode
What is YARN ?
YARN (Yet Another Resource Negotiator) is a core component of the Apache Hadoop
ecosystem, designed to manage computing resources in clusters and use them for scheduling
users' applications.
8
YARN Component
YARN consists of the following main components:
▪ Resource Manager (RM):
The Resource Manager is the master who controls all the available cluster resources and is
responsible for allocating resources to various running applications based on constraints
such as capacity, queues, etc.
▪ Node Manager (NM):
The Node Manager is a per-node slave service responsible for containers, monitoring their
resource usage (CPU, memory, disk, network), and reporting it to the Resource Manager.
9
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Work Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Sub Work Sub Work Sub Work Sub Work
Hadoop Characteristics
Open Source
➢Source code is freely available Free
➢Can be redistributed
➢Can be modified Open
No vendor Affordable
lock
Source
Community
Distributed Processing
➢Data is processed distributedly on
cluster
➢Multiple nodes in the cluster
process data independently
Centralized Processing
Distributed Processing
Fault Tolerance
➢Failure of nodes are recovered automatically
➢Framework takes care of failure of hardware
as well tasks.
➢Detection of faults and quick, automatic
recovery from them is a core architectural
goal of HDFS.
Reliability
➢ Data is reliably stored on the
cluster of machines despite machine
failures
➢ Failure of nodes doesn’t cause
data loss
Scalability
▪Vertical Scalability – New hardware
can be added to the nodes
▪Horizontal Scalability – New nodes
can be added on the fly
Data Locality
Data Data
▪ Move computation to data instead of data to
computation. Data Data
Storage Servers App Servers
▪ Data is processed on the nodes where it is
stored. Algo Algo
Dat Dat
a a
Algorithm
Algo Algo
Dat Dat
a a
Servers
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..
Client
Block ops
Read Datanodes Datanodes
replication
B
Blocks
Rack1 Write Rack2
Client
2/29/2024
17
18
19
20
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that
manages the file system and regulates access to files by clients.
There are a number of Data Nodes at least one node in a cluster.
Each data node at least has one block.
The Data Nodes manage storage attached to the nodes that they run
on.
A file is split into one or more blocks and set of blocks are stored
in DataNodes.
DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
2/29/2024
21
Namenode operations
▪ Namenode maintains and manages the Data Nodes and assigns the task to
them.
▪ Namenode does not contain actual file data.
▪ Namenode stores metadata of actual data like Filename, path, number of data
blocks, block IDs, block location, number of replicas, and other slave-
related information.
▪ Namenode manages all the request(read, write) of client for actual data file.
▪ Namenode executes file system name space operations like opening/closing
files, renaming files and directories.
22
Datanode operations
▪ Datanodes is responsible of storing actual data.
▪ Upon instruction from Namenode, it performs operations like
creation/replication/deletion of data blocks.
▪ When one of Datanode gets down then it will not make any effect on Hadoop
cluster due to replication.
▪ All Datanodes are synchronized in the Hadoop cluster in a way that they can
communicate with each other for various operations.
23
What happens if one of the Datanodes gets failed in HDFS?
1. Namenode periodically receives a heartbeat and a Block report from each Datanode in the
cluster.
2. Every Datanode sends heartbeat message after every 3 seconds to Namenode. The health
report is just information about a particular Datanode that is working properly or not. In the
other words we can say that particular Datanode is alive or not.
3. A block report of a particular Datanode contains information about all the blocks on that
resides on the corresponding Datanode.
4. When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a
particular Datanode then corresponding Datanode is considered Dead or failed by
Namenode.
5. Since blocks will be under replicated, the system starts the replication process from one
Datanode to another by taking all block information from the Block report of corresponding
Datanode.
24
HDFS
Example
25
HDFS Example
Let’s assume 10GB of file to store in HDFS. Block size of the cluster is 256MB, replication factor as
3, this 10GB of data requires how much space Datanodes, NameNode required?
Solution:
▪ 10 GB-> 10000 MB Total size file to store (Given)
# of Blocks = File size / Size of given Block
▪ 10000 MB /Size of block given = 10000/256= ~ 40 Block
▪ replication factor as 3 (Given)
Total number of blocks = replication factor * # of blocks
▪ So, the number of blocks will be 40 * 3= 120 block
Size of Data Node = # of Blocks * Block size
▪ 120 block * 256 MB= 30725 ~ 30 GB required for data nodes
26
HDFS Example
Given Metadata per block=150 bytes what is the size of Name node ?
Total metadata for Name Node = Number of blocks × Metadata per block
Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes
Total metadata in GB: 6000 bytes/(1024×1024×1024) ≈0.00000559 GB
Thus, the Data Nodes need 30GB to store the file, and the Name Node requires a minuscule
amount of space for metadata, approximately 5.59KB.
27
28