0% found this document useful (0 votes)

79 views28 pages

Big Data (Hadoop)

The document provides an overview of Big Data and Hadoop, highlighting the challenges organizations face with large data volumes and the key attributes of Big Data such as velocity, volume, and variety. It details Hadoop's architecture, components, and features, including its ability to process and store data using a distributed framework. Additionally, it covers various components of the Hadoop ecosystem like HDFS, MapReduce, Hive, Pig, and others, explaining their functionalities and roles in managing Big Data.

Uploaded by

rakesh201629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views28 pages

Big Data (Hadoop)

Uploaded by

rakesh201629

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Big Data(Hadoop)

 Big Data is about growing challenge that organisations face as they

deal with large and fast growing sources of data or information.

 Big Data deals with PetaBytes of data(in traditional TeraBytes only).

 Big Data challenges include,

 Capturing data
 Data Storage
 Data Analysis
 Searching
 Data Transfer
 Visualization
 Attributes of Big Data:

 Velocity

 Volume

 Variety

 Big Data is resulting into,

 Large and Growing Files

 At High Speed

 In Various Format
Hadoop:

 Hadoop is an Open Source Software Framework.

 Hadoop is used for Distributed data and Processing Dataset of Big

Data.

 The Objective of Hadoop supports Running application on

BigData.

 Hadoop deals,
 Storage

 Process
Key Features Of Hadoop:

 Open Source

 Distributed Technology

 Batch Processing

 Fault tolerance

 Replication

 Scalability
Commodity Hardware for Hadoop:

 Low kind of Hardware.

 Inexpensive Software.

 Distributed mechanisms for Hadoop

 Cloudera

MapR

Horton Networks

Apcche
Hadoop Cluster Nodes:

 Hadoop Cluster Nodes are Storage and Process.

HDFS Storage
NODE
MAPREDUCE
Process

 Data can be Stored in Blocks.

 Each Block size is 64KB.

 Default file for blocks,

/home/hadoop/conf/[Link]
Hadoop Architecture:
 Components of Hadoop Architecture,

 Name Node

 Secondary Name Node

 Data Node

Job Tracker

Task Tracker
Diagramatic representation:
Slave Slave
node node

Name Node
Data
hdfs

Node Dat
a
nod
Data Node
e

Task Tracker
mapreduce

Task
Task
Tracke
Job Tracker Tracke
r
r
Name Node:

Name Node divide the file/application into blocks based

configuration.

Name Node can give Physical Locations of Hadoop Cluster.

Name Node can deal Meta Data only.

Data Node:

Each and Every Slave Node can be called Data Node.

There is NO Threshold Value.

It will increase the data node based on volume of data.

It is Work-Horse of hadoop file system.

Secondary Name Node:

SNN perform functionalities same as name node.

It will gives Physical address/locations of blocks.

And combining the blocks.

SNN is not directly back node for primary node.

Job Tracker:

Job Tracker is always reassemble to Name Node only.

The responsibilities of Job Tracker is

 Assign Tasks

Schedule Tasks

Re-schedule Tasks
Task Tracker:

The responsibilities of Task Tracker is executing the tasks assigned

by the Job Tracker.

The communication between Job Tracker and Task Tracker by

MapReduce jobs(MRjobs).
Hadoop Ecosystem:

 Hdfs
 MapReduce
 Hive
 Pig
 Sqoop
 Hbase
 Oozie
 Flume
 Mahout
 Impala
 YARN
HDFS:

Node contains Local File System(LFS) and Hdfs,MR.

HDFS
MR

LFS

Node failure means LFS have node information but there is no

information in HDFS,MR.

The Meta Data files are : FSImage, EditLog.

HDFS Features:

Support for very Large Files.

Commodity Hardware.

High Latency.

Streaming Access/Sequential file Access.

WRITE ONCE and READ many times.

MapReduce:

MapReduce is built on top HDFS.

It is Processing the huge amount of data in very parallel manner on

the commodity machine.

MapReduce component is working on the KEY-VALUE

architecture.

It have two processing daemons,

 Job Tracker

 Task Tracker
Phases in MapReduce:

 There are three phases,

 MAPPER Phase

Sort & Shuffle Phase

REDUCER Phase

input (K,V) Sort& (K,V) output

Mapper Shuffle Redducer
File Formats in MapReduce:

FileInputForma (FIF)

FileOutputFormat (FOF)

TextInputFormat (TIF)

TextOutputForma (TOF)

KeyValueTextInputFormat (KVTIF)

NLineInputFormat (NLINE)

DBInputFormat (DBIF)
Combiner:

Combiner is one of the predefined functionality of MapReduce.

It is going to applied on the Mapper Class.

It can achieve Network Optimisation.

[Link](<<[Link]>>);
PIG:

Pig is one of the component of the hadoop built on top of HDFS.

It is Abstract and high level languages on top of MapReduce

Programming model.

Pig is meant for querring, data summaration and for advanced

querring.

PIGLatin is language to express the PIG related statements.

Different Modes of PIG Execution:

Local Mode:

LFS

HDFS Mode/MapaReduce Mode:

HDFS

HDFS
Different flavours of PIG Execution:

Grunt Shell

Script Mode

Embedded Mode
HIVE:

HIVE is one of the Component of hadoop built on top of HDFS.

It is Warehouse kind of system in hadoop.

HIVE is meant for data summarization, querring and advanced

querring.

The complete data of Hive is going to be organize by the means of

two table,
 Manage tables

 External tables
SQOOP:

SQOOP is one of the component of hadoop built on top of HDFS.

It is meant for interacting with RDBMS.

It is to import the data from RDBMS tables to hadoopworld(hdfs).

It is to export the processed data from hadoop world(hdfs) to

RDBMS tables.
HBASE:

Hbase is built on top of hdfs and is used for performing real time
random reads/writers.

Hbase is a open source distributed, scalable, fault tolerance, multi

dimensional, versioned and column oriented database.

It does not have Query language.

It cannot be used for transaction processing.

OOZIE:

OOZIE meant for creating the workflow and scheduling same i.e
job scheduling tool in hadoop.

Oozie is open source, distributed, scalable, fault tolerant, java based

web application access through GUI.

Oozie working in principle called Direct Acyclic Graph(DAG).

It is sequential way of job Execution.

FLUME:

Flume is for collecting the live streaming data and distributed the same
data over hdfs paths.

Flume Source: It is collect the data from events.

Interceptors: To send speculative data to collectors.

Collectors: It is converts data in seculars format to flume Sink.

Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
History and Architecture of Hadoop
No ratings yet
History and Architecture of Hadoop
53 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Hadoop Ecosystem Architecture Overview
No ratings yet
Hadoop Ecosystem Architecture Overview
56 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
MapReduce Phases in Hadoop Ecosystem
No ratings yet
MapReduce Phases in Hadoop Ecosystem
28 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Introduction To
No ratings yet
Introduction To
7 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Module - 2
No ratings yet
Module - 2
84 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
BDP Unit 3
No ratings yet
BDP Unit 3
20 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Big Data
No ratings yet
Big Data
67 pages
HADOOP
No ratings yet
HADOOP
19 pages
18 Module 2
No ratings yet
18 Module 2
9 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
Unit 2
No ratings yet
Unit 2
9 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Understanding Hadoop: Architecture & Use Cases
No ratings yet
Understanding Hadoop: Architecture & Use Cases
55 pages
Hadoop
No ratings yet
Hadoop
5 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Understanding Hadoop Ecosystem Basics
No ratings yet
Understanding Hadoop Ecosystem Basics
12 pages
Understanding Hadoop Ecosystem Components
No ratings yet
Understanding Hadoop Ecosystem Components
7 pages
Big Data Ecosystem & Hadoop Guide
No ratings yet
Big Data Ecosystem & Hadoop Guide
31 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Importance of Big Data and Hadoop
No ratings yet
Importance of Big Data and Hadoop
13 pages
Algebra
100% (1)
Algebra
321 pages
Electrical Apparatus For Potentially Explosive Atmospheres
No ratings yet
Electrical Apparatus For Potentially Explosive Atmospheres
10 pages
Quick Start Guide & User Manual: Android 4.0
No ratings yet
Quick Start Guide & User Manual: Android 4.0
36 pages
Coding Standards for Embedded Systems
No ratings yet
Coding Standards for Embedded Systems
136 pages
Spesifikasi Philips Incisive CT 128 Premium
100% (2)
Spesifikasi Philips Incisive CT 128 Premium
2 pages
OOP Laboratory Record Book
No ratings yet
OOP Laboratory Record Book
5 pages
Stack4 AIX WebSphere Runbook R17 1.0
No ratings yet
Stack4 AIX WebSphere Runbook R17 1.0
80 pages
GITAM BTECH Exam Hall Ticket 2025
No ratings yet
GITAM BTECH Exam Hall Ticket 2025
1 page
AD13A.1 11 1 - DS en N.789 - 32
No ratings yet
AD13A.1 11 1 - DS en N.789 - 32
4 pages
Possessions Ielts Speaking Part 2 3 British English Teacher
No ratings yet
Possessions Ielts Speaking Part 2 3 British English Teacher
6 pages
Solar Modules for Diverse Needs
No ratings yet
Solar Modules for Diverse Needs
2 pages
Idea Day Guidelines V2
No ratings yet
Idea Day Guidelines V2
12 pages
Filtration Derivation
No ratings yet
Filtration Derivation
25 pages
Assignment
No ratings yet
Assignment
3 pages
To Construct A Frequency Distribution Table
No ratings yet
To Construct A Frequency Distribution Table
2 pages
Marketing Manager Role at D-Marin
No ratings yet
Marketing Manager Role at D-Marin
2 pages
Manual
No ratings yet
Manual
40 pages
Business Analytics - Learning Guide - T2.2024 BBUS
No ratings yet
Business Analytics - Learning Guide - T2.2024 BBUS
12 pages
Vessel Summary: MT-VESS Rev.4.5.0
No ratings yet
Vessel Summary: MT-VESS Rev.4.5.0
38 pages
High Security Design Software
No ratings yet
High Security Design Software
24 pages
Request of The Approval Sheet / Sample 1/1: Zoom Corporation Product Division
No ratings yet
Request of The Approval Sheet / Sample 1/1: Zoom Corporation Product Division
20 pages
Polling Stations for PEC Elections 2021-24
No ratings yet
Polling Stations for PEC Elections 2021-24
8 pages
Car Analytics Solution
No ratings yet
Car Analytics Solution
4 pages
Vivo India Marketing Role Application
No ratings yet
Vivo India Marketing Role Application
1 page
CS & IT Postal Book Package Checklist
No ratings yet
CS & IT Postal Book Package Checklist
2 pages
Qubo - Bullet Camera
No ratings yet
Qubo - Bullet Camera
12 pages
Amity International School, Sector - 1, Vasundhara SESSION 2021 - 22 Class Xii Database Management (Mysql) - Practical Questions
No ratings yet
Amity International School, Sector - 1, Vasundhara SESSION 2021 - 22 Class Xii Database Management (Mysql) - Practical Questions
3 pages
File Processing Systems
No ratings yet
File Processing Systems
2 pages
AI Fundamentals - A Beginner's Guide To Artificial Intelligence
No ratings yet
AI Fundamentals - A Beginner's Guide To Artificial Intelligence
4 pages
Learning Transactions Sap Service Marketplace USERNAME: S0004078218 PASS: Richard
100% (1)
Learning Transactions Sap Service Marketplace USERNAME: S0004078218 PASS: Richard
3 pages

Big Data (Hadoop)

Uploaded by

Big Data (Hadoop)

Uploaded by

Big Data(Hadoop)

 Big Data is about growing challenge that organisations face as they

deal with large and fast growing sources of data or information.

 Big Data deals with PetaBytes of data(in traditional TeraBytes only).

 Big Data challenges include,

 Big Data is resulting into,

 Large and Growing Files

 Hadoop is an Open Source Software Framework.

 Hadoop is used for Distributed data and Processing Dataset of Big

 The Objective of Hadoop supports Running application on

 Low kind of Hardware.

 Distributed mechanisms for Hadoop

 Hadoop Cluster Nodes are Storage and Process.

 Data can be Stored in Blocks.

 Each Block size is 64KB.

 Default file for blocks,

 Secondary Name Node

Name Node divide the file/application into blocks based

Name Node can give Physical Locations of Hadoop Cluster.

Name Node can deal Meta Data only.

Each and Every Slave Node can be called Data Node.

There is NO Threshold Value.

It will increase the data node based on volume of data.

It is Work-Horse of hadoop file system.

SNN perform functionalities same as name node.

It will gives Physical address/locations of blocks.

And combining the blocks.

SNN is not directly back node for primary node.

Job Tracker is always reassemble to Name Node only.

The responsibilities of Job Tracker is

The responsibilities of Task Tracker is executing the tasks assigned

The communication between Job Tracker and Task Tracker by

Node contains Local File System(LFS) and Hdfs,MR.

Node failure means LFS have node information but there is no

The Meta Data files are : FSImage, EditLog.

Support for very Large Files.

Streaming Access/Sequential file Access.

WRITE ONCE and READ many times.

MapReduce is built on top HDFS.

It is Processing the huge amount of data in very parallel manner on

MapReduce component is working on the KEY-VALUE

It have two processing daemons,

 There are three phases,

Sort & Shuffle Phase

input (K,V) Sort& (K,V) output

Combiner is one of the predefined functionality of MapReduce.

It is going to applied on the Mapper Class.

It can achieve Network Optimisation.

Pig is one of the component of the hadoop built on top of HDFS.

It is Abstract and high level languages on top of MapReduce

Pig is meant for querring, data summaration and for advanced

PIGLatin is language to express the PIG related statements.

HDFS Mode/MapaReduce Mode:

HIVE is one of the Component of hadoop built on top of HDFS.

It is Warehouse kind of system in hadoop.

HIVE is meant for data summarization, querring and advanced

The complete data of Hive is going to be organize by the means of

SQOOP is one of the component of hadoop built on top of HDFS.

It is meant for interacting with RDBMS.

It is to import the data from RDBMS tables to hadoopworld(hdfs).

It is to export the processed data from hadoop world(hdfs) to

Hbase is a open source distributed, scalable, fault tolerance, multi

It does not have Query language.

It cannot be used for transaction processing.

Oozie is open source, distributed, scalable, fault tolerant, java based

Oozie working in principle called Direct Acyclic Graph(DAG).

It is sequential way of job Execution.

Flume Source: It is collect the data from events.

Interceptors: To send speculative data to collectors.

Collectors: It is converts data in seculars format to flume Sink.

You might also like