0% found this document useful (0 votes)

40 views28 pages

Introduction to Hadoop Architecture

This document provides an overview of Hadoop, including its key components and architecture. It discusses: - Hadoop is an open-source framework for distributed processing of large datasets across clusters of machines. - The main Hadoop components are the Hadoop Distributed File System (HDFS) for storage and YARN for resource management. - HDFS uses a master/slave architecture with a NameNode master and DataNodes slaves. The NameNode manages metadata and the DataNodes store actual data blocks. - YARN allows multiple data processing engines (like MapReduce) to leverage resources managed by YARN, such as compute resources on worker machines.

Uploaded by

ahmed hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views28 pages

Introduction to Hadoop Architecture

Uploaded by

ahmed hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

FACULTY OF COMPUTERS AND

INFORMATION TECHNOLOGY

Information Systems Department

IS467
Selected Topics in Information Systems-1(Big Data)
4th Level, Spring 2024

Lecture Notes 2

❑ Hadoop nodes & daemons

❑ Hadoop Architecture

❑ Characteristics

❑ Hadoop Features

❑ HDFS

2
What is Hadoop ?
➢ The Apache Hadoop is open-source framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.

➢ It is designed to scale up from single servers to thousands of machines, each

offering local computation and storage.

3
What is Hadoop?
Distributed Processing
An open-source framework that ❖ Data is processed
allows Distributed Processing of distributedly on multiple
large data-sets across the cluster nodes / servers
of commodity hardware ❖ Multiple machines processes
the data independently
Cluster

❖ Multiple machines connected together

❖ Nodes are connected via LAN
Hadoop Components
Hadoop consists of three key parts
Hadoop Nodes
Nodes

Master Node Slave Node

Hadoop Nodes
Nodes

Master Node Slave Node

Resource Node
Manager Manager

NameNode DataNode
What is YARN ?
YARN (Yet Another Resource Negotiator) is a core component of the Apache Hadoop
ecosystem, designed to manage computing resources in clusters and use them for scheduling
users' applications.

8
YARN Component
YARN consists of the following main components:

▪ Resource Manager (RM):

The Resource Manager is the master who controls all the available cluster resources and is
responsible for allocating resources to various running applications based on constraints
such as capacity, queues, etc.

▪ Node Manager (NM):

The Node Manager is a per-node slave service responsible for containers, monitoring their
resource usage (CPU, memory, disk, network), and reporting it to the Resource Manager.

9
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Hadoop Characteristics

Open Source
➢Source code is freely available Free

➢Can be redistributed
➢Can be modified Open
No vendor Affordable
lock
Source

Community
Distributed Processing

➢Data is processed distributedly on

cluster
➢Multiple nodes in the cluster
process data independently
Centralized Processing

Distributed Processing
Fault Tolerance

➢Failure of nodes are recovered automatically

➢Framework takes care of failure of hardware
as well tasks.

➢Detection of faults and quick, automatic

recovery from them is a core architectural
goal of HDFS.
Reliability

➢ Data is reliably stored on the

cluster of machines despite machine
failures
➢ Failure of nodes doesn’t cause
data loss
Scalability

▪Vertical Scalability – New hardware

can be added to the nodes

▪Horizontal Scalability – New nodes

can be added on the fly
Data Locality
Data Data

▪ Move computation to data instead of data to

computation. Data Data

Storage Servers App Servers

▪ Data is processed on the nodes where it is

stored. Algo Algo
Dat Dat
a a
Algorithm
Algo Algo
Dat Dat
a a

Servers
HDFS Architecture
Metadata(Name, replicas..)
Metadata ops Namenode (/home/foo/data,6. ..

Client
Block ops
Read Datanodes Datanodes

replication
B
Blocks

Rack1 Write Rack2

Client

2/29/2024
17
18
19
20
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system and regulates access to files by clients.
 There are a number of Data Nodes at least one node in a cluster.
 Each data node at least has one block.
 The Data Nodes manage storage attached to the nodes that they run
on.
 A file is split into one or more blocks and set of blocks are stored
in DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.

2/29/2024
21
Namenode operations

▪ Namenode maintains and manages the Data Nodes and assigns the task to
them.
▪ Namenode does not contain actual file data.
▪ Namenode stores metadata of actual data like Filename, path, number of data
blocks, block IDs, block location, number of replicas, and other slave-
related information.
▪ Namenode manages all the request(read, write) of client for actual data file.
▪ Namenode executes file system name space operations like opening/closing
files, renaming files and directories.

22
Datanode operations

▪ Datanodes is responsible of storing actual data.

▪ Upon instruction from Namenode, it performs operations like
creation/replication/deletion of data blocks.
▪ When one of Datanode gets down then it will not make any effect on Hadoop
cluster due to replication.
▪ All Datanodes are synchronized in the Hadoop cluster in a way that they can
communicate with each other for various operations.

23
What happens if one of the Datanodes gets failed in HDFS?
1. Namenode periodically receives a heartbeat and a Block report from each Datanode in the
cluster.
2. Every Datanode sends heartbeat message after every 3 seconds to Namenode. The health
report is just information about a particular Datanode that is working properly or not. In the
other words we can say that particular Datanode is alive or not.
3. A block report of a particular Datanode contains information about all the blocks on that
resides on the corresponding Datanode.
4. When Namenode doesn’t receive any heartbeat message for 10 minutes(ByDefault) from a
particular Datanode then corresponding Datanode is considered Dead or failed by
Namenode.
5. Since blocks will be under replicated, the system starts the replication process from one
Datanode to another by taking all block information from the Block report of corresponding
Datanode.

24
HDFS
Example

25
HDFS Example
Let’s assume 10GB of file to store in HDFS. Block size of the cluster is 256MB, replication factor as
3, this 10GB of data requires how much space Datanodes, NameNode required?
Solution:
▪ 10 GB-> 10000 MB Total size file to store (Given)
# of Blocks = File size / Size of given Block
▪ 10000 MB /Size of block given = 10000/256= ~ 40 Block
▪ replication factor as 3 (Given)
Total number of blocks = replication factor * # of blocks
▪ So, the number of blocks will be 40 * 3= 120 block
Size of Data Node = # of Blocks * Block size
▪ 120 block * 256 MB= 30725 ~ 30 GB required for data nodes

26
HDFS Example
Given Metadata per block=150 bytes what is the size of Name node ?

Total metadata for Name Node = Number of blocks × Metadata per block

Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes

Total metadata in GB: 6000 bytes/(1024×1024×1024) ≈0.00000559 GB

Thus, the Data Nodes need 30GB to store the file, and the Name Node requires a minuscule
amount of space for metadata, approximately 5.59KB.

27
28

Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
75 pages
Unit 4
No ratings yet
Unit 4
36 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
HDFS Basics and Components Guide
No ratings yet
HDFS Basics and Components Guide
55 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
MapReduce 1 vs 2 in Hadoop Framework
No ratings yet
MapReduce 1 vs 2 in Hadoop Framework
19 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2
No ratings yet
Unit 2
56 pages
Understanding Hadoop Architecture and MapReduce
No ratings yet
Understanding Hadoop Architecture and MapReduce
33 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop Architecture Overview and HDFS
No ratings yet
Hadoop Architecture Overview and HDFS
56 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Understanding Apache Hadoop Ecosystem
No ratings yet
Understanding Apache Hadoop Ecosystem
48 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
CH 2. HADOOP
No ratings yet
CH 2. HADOOP
25 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Hadoop Architecture & HDFS Overview
No ratings yet
Hadoop Architecture & HDFS Overview
57 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
Hdfs Architecture
No ratings yet
Hdfs Architecture
16 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
Hadoop Architecture and Data Flow Overview
No ratings yet
Hadoop Architecture and Data Flow Overview
84 pages
Haoop Architecture
No ratings yet
Haoop Architecture
24 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Hadoop
No ratings yet
Hadoop
31 pages
Hadoop
No ratings yet
Hadoop
4 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
HDFS
No ratings yet
HDFS
37 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Introduction to Hadoop and YARN Architecture
No ratings yet
Introduction to Hadoop and YARN Architecture
18 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Introduction to Hadoop and HDFS Concepts
No ratings yet
Introduction to Hadoop and HDFS Concepts
52 pages
CH 2
No ratings yet
CH 2
6 pages
Hakro GmbH NoSQL Initiatives 2025
No ratings yet
Hakro GmbH NoSQL Initiatives 2025
32 pages
Java Program Example - Print Table of Number
No ratings yet
Java Program Example - Print Table of Number
4 pages
Fisher Valve Division PDF
100% (1)
Fisher Valve Division PDF
88 pages
Math 129: Assigned Topics Overview
No ratings yet
Math 129: Assigned Topics Overview
6 pages
Alfa Romeo 155 2.5 v6 Cat
No ratings yet
Alfa Romeo 155 2.5 v6 Cat
1 page
Bio Codoped BCZT
No ratings yet
Bio Codoped BCZT
10 pages
BJT and FET Biasing and Stabilization
No ratings yet
BJT and FET Biasing and Stabilization
15 pages
Demonte Adjetivos
No ratings yet
Demonte Adjetivos
38 pages
Astm-D664-2018 Tan
No ratings yet
Astm-D664-2018 Tan
11 pages
C++ Course Notes - Fundamentals of Programming - 2 - Week 3
No ratings yet
C++ Course Notes - Fundamentals of Programming - 2 - Week 3
9 pages
6-PDE-Laplace Equation
No ratings yet
6-PDE-Laplace Equation
27 pages
Polynomial Quadratic MCQs
No ratings yet
Polynomial Quadratic MCQs
4 pages
2.3.5 Practice - Equilibrium and Kinetics (Practice) - 2
No ratings yet
2.3.5 Practice - Equilibrium and Kinetics (Practice) - 2
7 pages
Exploring, Investigating and Discovering in Mathematics by Vasile Berinde (Auth.) (Z-Lib - Org) - Fragment
No ratings yet
Exploring, Investigating and Discovering in Mathematics by Vasile Berinde (Auth.) (Z-Lib - Org) - Fragment
2 pages
Subsurface Geological Mapping in Nigeria
No ratings yet
Subsurface Geological Mapping in Nigeria
11 pages
Hydrus 2.0 Bulk: User Guide
No ratings yet
Hydrus 2.0 Bulk: User Guide
23 pages
Other Forming Processes
No ratings yet
Other Forming Processes
2 pages
CSci 4511 Midterm 1 Answer Key
No ratings yet
CSci 4511 Midterm 1 Answer Key
9 pages
Measurement of Spee Curve in Individuals With Temp PDF
No ratings yet
Measurement of Spee Curve in Individuals With Temp PDF
10 pages
IT Skill - 2 Final Submission
No ratings yet
IT Skill - 2 Final Submission
26 pages
Topicwise Lecture Notes of Compiler Design (CS - 603 (C) )
No ratings yet
Topicwise Lecture Notes of Compiler Design (CS - 603 (C) )
23 pages
Livre John J. A. Johnson D.G Whitaker D Statistical Thinking in Business Second Edition CRC Press 2005 2
100% (1)
Livre John J. A. Johnson D.G Whitaker D Statistical Thinking in Business Second Edition CRC Press 2005 2
400 pages
MCQ and QSA
No ratings yet
MCQ and QSA
6 pages
Unit 4 - BCEM (Mechanics)
No ratings yet
Unit 4 - BCEM (Mechanics)
29 pages
The Swiss Approach For A Heartbeat-Driven Lead-And Batteryless Pacemaker
No ratings yet
The Swiss Approach For A Heartbeat-Driven Lead-And Batteryless Pacemaker
7 pages
2022 Syllabus
No ratings yet
2022 Syllabus
44 pages
The IOTA ETS-20 and ETS-20-DR: IOTA Emergency Lighting Technical Library
No ratings yet
The IOTA ETS-20 and ETS-20-DR: IOTA Emergency Lighting Technical Library
4 pages
Enhancing EPP Performance via Collaboration
No ratings yet
Enhancing EPP Performance via Collaboration
14 pages
Chapter 6 - Trigonometric Functions & Graphs
No ratings yet
Chapter 6 - Trigonometric Functions & Graphs
57 pages
Brief Operating Instructions Proline Promass 83
No ratings yet
Brief Operating Instructions Proline Promass 83
32 pages
Electronic Devices and Circuits Exam Questions
No ratings yet
Electronic Devices and Circuits Exam Questions
7 pages

Introduction to Hadoop Architecture

Uploaded by

Introduction to Hadoop Architecture

Uploaded by

FACULTY OF COMPUTERS AND

Information Systems Department

❑ Hadoop nodes & daemons

➢ It is designed to scale up from single servers to thousands of machines, each

❖ Multiple machines connected together

Master Node Slave Node

Master Node Slave Node

▪ Resource Manager (RM):

▪ Node Manager (NM):

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

➢Data is processed distributedly on

➢Failure of nodes are recovered automatically

➢Detection of faults and quick, automatic

➢ Data is reliably stored on the

▪Vertical Scalability – New hardware

▪Horizontal Scalability – New nodes

▪ Move computation to data instead of data to

Storage Servers App Servers

▪ Data is processed on the nodes where it is

Rack1 Write Rack2

▪ Datanodes is responsible of storing actual data.

Total metadata for Name Node in bytes: 40×150=600040×150=6000 bytes

You might also like