Big Data(Hadoop)
Big Data is about growing challenge that organisations face as they
deal with large and fast growing sources of data or information.
Big Data deals with PetaBytes of data(in traditional TeraBytes only).
Big Data challenges include,
Capturing data
Data Storage
Data Analysis
Searching
Data Transfer
Visualization
Attributes of Big Data:
Velocity
Volume
Variety
Big Data is resulting into,
Large and Growing Files
At High Speed
In Various Format
Hadoop:
Hadoop is an Open Source Software Framework.
Hadoop is used for Distributed data and Processing Dataset of Big
Data.
The Objective of Hadoop supports Running application on
BigData.
Hadoop deals,
Storage
Process
Key Features Of Hadoop:
Open Source
Distributed Technology
Batch Processing
Fault tolerance
Replication
Scalability
Commodity Hardware for Hadoop:
Low kind of Hardware.
Inexpensive Software.
Distributed mechanisms for Hadoop
Cloudera
MapR
Horton Networks
Apcche
Hadoop Cluster Nodes:
Hadoop Cluster Nodes are Storage and Process.
HDFS Storage
NODE
MAPREDUCE
Process
Data can be Stored in Blocks.
Each Block size is 64KB.
Default file for blocks,
/home/hadoop/conf/[Link]
Hadoop Architecture:
Components of Hadoop Architecture,
Name Node
Secondary Name Node
Data Node
Job Tracker
Task Tracker
Diagramatic representation:
Slave Slave
node node
Name Node
Data
hdfs
Node Dat
a
nod
Data Node
e
Task Tracker
mapreduce
Task
Task
Tracke
Job Tracker Tracke
r
r
Name Node:
Name Node divide the file/application into blocks based
configuration.
Name Node can give Physical Locations of Hadoop Cluster.
Name Node can deal Meta Data only.
Data Node:
Each and Every Slave Node can be called Data Node.
There is NO Threshold Value.
It will increase the data node based on volume of data.
It is Work-Horse of hadoop file system.
Secondary Name Node:
SNN perform functionalities same as name node.
It will gives Physical address/locations of blocks.
And combining the blocks.
SNN is not directly back node for primary node.
Job Tracker:
Job Tracker is always reassemble to Name Node only.
The responsibilities of Job Tracker is
Assign Tasks
Schedule Tasks
Re-schedule Tasks
Task Tracker:
The responsibilities of Task Tracker is executing the tasks assigned
by the Job Tracker.
The communication between Job Tracker and Task Tracker by
MapReduce jobs(MRjobs).
Hadoop Ecosystem:
Hdfs
MapReduce
Hive
Pig
Sqoop
Hbase
Oozie
Flume
Mahout
Impala
YARN
HDFS:
Node contains Local File System(LFS) and Hdfs,MR.
HDFS
MR
LFS
Node failure means LFS have node information but there is no
information in HDFS,MR.
The Meta Data files are : FSImage, EditLog.
HDFS Features:
Support for very Large Files.
Commodity Hardware.
High Latency.
Streaming Access/Sequential file Access.
WRITE ONCE and READ many times.
MapReduce:
MapReduce is built on top HDFS.
It is Processing the huge amount of data in very parallel manner on
the commodity machine.
MapReduce component is working on the KEY-VALUE
architecture.
It have two processing daemons,
Job Tracker
Task Tracker
Phases in MapReduce:
There are three phases,
MAPPER Phase
Sort & Shuffle Phase
REDUCER Phase
input (K,V) Sort& (K,V) output
Mapper Shuffle Redducer
File Formats in MapReduce:
FileInputForma (FIF)
FileOutputFormat (FOF)
TextInputFormat (TIF)
TextOutputForma (TOF)
KeyValueTextInputFormat (KVTIF)
NLineInputFormat (NLINE)
DBInputFormat (DBIF)
Combiner:
Combiner is one of the predefined functionality of MapReduce.
It is going to applied on the Mapper Class.
It can achieve Network Optimisation.
[Link](<<[Link]>>);
PIG:
Pig is one of the component of the hadoop built on top of HDFS.
It is Abstract and high level languages on top of MapReduce
Programming model.
Pig is meant for querring, data summaration and for advanced
querring.
PIGLatin is language to express the PIG related statements.
Different Modes of PIG Execution:
Local Mode:
LFS
LFS
HDFS Mode/MapaReduce Mode:
HDFS
HDFS
Different flavours of PIG Execution:
Grunt Shell
Script Mode
Embedded Mode
HIVE:
HIVE is one of the Component of hadoop built on top of HDFS.
It is Warehouse kind of system in hadoop.
HIVE is meant for data summarization, querring and advanced
querring.
The complete data of Hive is going to be organize by the means of
two table,
Manage tables
External tables
SQOOP:
SQOOP is one of the component of hadoop built on top of HDFS.
It is meant for interacting with RDBMS.
It is to import the data from RDBMS tables to hadoopworld(hdfs).
It is to export the processed data from hadoop world(hdfs) to
RDBMS tables.
HBASE:
Hbase is built on top of hdfs and is used for performing real time
random reads/writers.
Hbase is a open source distributed, scalable, fault tolerance, multi
dimensional, versioned and column oriented database.
It does not have Query language.
It cannot be used for transaction processing.
OOZIE:
OOZIE meant for creating the workflow and scheduling same i.e
job scheduling tool in hadoop.
Oozie is open source, distributed, scalable, fault tolerant, java based
web application access through GUI.
Oozie working in principle called Direct Acyclic Graph(DAG).
It is sequential way of job Execution.
FLUME:
Flume is for collecting the live streaming data and distributed the same
data over hdfs paths.
Flume Source: It is collect the data from events.
Interceptors: To send speculative data to collectors.
Collectors: It is converts data in seculars format to flume Sink.