0% found this document useful (0 votes)

30 views9 pages

Hadoop Notes 1

Hadoop is an open-source framework designed for storing and processing Big Data using a distributed architecture on commodity hardware. It utilizes HDFS for storage, MapReduce for processing, and YARN for resource management, enabling efficient handling of large datasets. The Hadoop ecosystem includes various tools and components like Spark, Hive, and Pig, which enhance its functionality for data analysis and management.

Uploaded by

UTHAYAKUMAR J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views9 pages

Hadoop Notes 1

Uploaded by

UTHAYAKUMAR J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit I Fundamentals of Big Data & Hadoop

Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which
uses commodity hardware to form clusters and store data in a distributed fashion. It works on
Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the required
output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.

Hadoop
Introduction:

Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under
Apache Software Foundation (ASF).

Hadoop is written in the Java programming language and ranks among the highest-level Apache
projects.

Doug Cutting and Mike J. Cafarella developed Hadoop.

By getting inspiration from Google, Hadoop is using technologies like Map-Reduce programming
model as well as Google file system (GFS).

It is optimized to handle massive quantities of data that could be structured, unstructured or

semi-structured, using commodity hardware, that is, relatively inexpensive computers.

It is intended to work upon from a single server to thousands of machines each offering local
computation and storage. It supports the large collection of data set in a distributed computing
environment.

History:
Hadoop came into existence and why it is so popular in the industry nowadays. So, it all started with two
people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that
can index 1 billion pages. After their research, they estimated that such a system will cost around half a
million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However,
they soon realized that their architecture would not be capable enough to work around with billions of
pages on the web.

10
Unit I Fundamentals of Big Data & Hadoop

Doug and Mike

distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS
proved to be something that they were looking for, and soon, they realized that it would solve all their
problems of storing very large files that are generated as a part of the web crawl and indexing process.
Later in 2004, Google published one more paper that introduced MapReduce to the world. Finally, these
two papers led to the foundation of the framework called Hadoop Doug Cutting, Mike Cafarella and team
took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named
it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

Hadoop Architecture:
Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for
a cluster of systems with storage capacity and local computing power by leveraging commodity
hardware.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on
different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable
of running on clusters of computers and they could perform complete statistical analysis for huge
amounts of data.
Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets
using Hadoop MapReduce paradigm.
The 3 important Hadoop core components that play a vital role in the Hadoop architecture are -
11
Unit I Fundamentals of Big Data & Hadoop

1. Hadoop Distributed File System (HDFS)

2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)

Hadoop Distributed File System (HDFS):

o Hadoop Distributed File System runs on top of the existing file systems on each node in a
Hadoop cluster.
o Hadoop Distributed File System is a block-structured file system where each file is divided
into blocks of a pre-determined size.
o Data in a Hadoop cluster is broken down into smaller units (called blocks) and distributed
throughout the cluster. Each block is duplicated twice (for a total of three copies), with the
two replicas stored on two nodes in a rack somewhere else in the cluster.
o Since the data has a default replication factor of three, it is highly available and fault-tolerant.
o If a copy is lost (because of machine failure, for example), HDFS will automatically re-replicate
it elsewhere in the cluster, ensuring that the threefold replication factor is maintained.

Hadoop MapReduce: This is for parallel processing of large data sets.

o The MapReduce framework consists of a single master node (JobTracker) and n numbers of
slave nodes (Task Tracker) where n can be 1000s. Master manages, maintains and monitors
the slaves while slaves are the actual worker nodes.
o Client submit a job to Hadoop. The job can be a mapper, a reducer or a list of input. The job is
sent to job tracker process on master node. Each slave node runs a process through task
tracker.
12
Unit I Fundamentals of Big Data & Hadoop

o The master is responsible for resource management, tracking resource

consumption/availability and scheduling the jobs component tasks on the slaves, monitoring
them and re-executing the failed tasks.
o The slaves TaskTracker execute the tasks as directed by the master and provide task-status
information to the master periodically.
o Master stores the metadata (data about data) while slaves are the nodes which store the data.
The client connects with master node to perform any task.

Hadoop YARN: YARN (Yet Another Resource Negotiator) is the framework responsible for
assigning computational resources for application execution and cluster management. YARN
consists of three core components:

ResourceManager (one per cluster)

ApplicationMaster (one per application)
NodeManagers (one per node)

Hadoop ecosystem:

13
Unit I Fundamentals of Big Data & Hadoop

Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework

which solves big data problems. You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea about how
the services work individually and in collaboration.
The Hadoop ecosystem provides the furnishings that turn the framework into a comfortable home for
big data activity that reflects your specific needs and tastes.
The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions.
Below are the Hadoop components, that together form a Hadoop ecosystem,
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr&Lucene -> Searching & Indexing

14
Unit I Fundamentals of Big Data & Hadoop

Apache open source Hadoop ecosystem elements:

Spark, Pig, and Hive are three of the best-known Apache Hadoop projects. Each is used to create
applications to process Hadoop data.
Spark: Apache Spark is a framework for real time data analytics in a distributed computing
environment. It executes in-memory computations to increase speed of data processing over Map-
Reduce.
Hive: Facebook created HIVE for people who are fluent with SQL. Basically, HIVE is a data warehousing
component which performs reading, writing and managing large data sets in a distributed
environment using SQL-like interface. The query language of Hive is called Hive Query Language (HQL),
which is very similar like SQL. HIVE + SQL = HQL. It provides tools for ETL operations and brings some
SQL-like capabilities to the environment.
Pig: Pig is a procedural language for developing parallel processing applications for large data sets in
the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and automatically

15
Unit I Fundamentals of Big Data & Hadoop

generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig
Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster.
HBase: HBase is a scalable, distributed, NoSQL database that sits atop the HFDS. It was designed to
store structured data in tables that could have billions of rows and millions of columns. It has been
deployed to power historical searches through large data sets, especially when the desired data is
contained within a large amount of unimportant or irrelevant data (also known as sparse data sets).
Oozie: Oozie is the workflow scheduler that was developed as part of the Apache Hadoop project. It
manages how workflows start and execute, and also controls the execution path. Oozie is a server-
based Java web application that uses workflow definitions written in hPDL, which is an XML Process
Definition Language similar to JBOSS JBPM jPDL.
Sqoop: Sqoop is bi-directional data injection tool. Think of Sqoop as a front-end loader for big data.
Sqoop is a command-line interface that facilitates moving bulk data from Hadoop into relational
databases and other structured data stores. Using Sqoop replaces the need to develop scripts to
export and import data. One common use case is to move data from an enterprise data warehouse to
a Hadoop cluster for ETL processing. Performing ETL on the commodity Hadoop cluster is resource
efficient, while Sqoop provides a practical transfer method.
Ambari A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which
includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig,
and Sqoop.
Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of streaming event data.
Mahout A scalable machine learning and data mining library.
Zookeeper A high-performance coordination service for distributed applications.
The ecosystem elements described above are all open source Apache Hadoop projects. There are
numerous commercial solutions that use or support the open source Hadoop projects.

Hadoop Distributions:
Hadoop is an open-source, catch-all technology solution with incredible scalability, low cost
storage systems and fast paced big data analytics with economical server costs.
Hadoop Vendor distributions overcome the drawbacks and issues with the open source edition
of Hadoop. These distributions have added functionalities that focus on:
16
Unit I Fundamentals of Big Data & Hadoop

Support:
Most of the Hadoop vendors provide technical guidance and assistance that makes it easy for customers
to adopt Hadoop for enterprise level tasks and mission critical applications.
Reliability:
Hadoop vendors promptly act in response whenever a bug is detected. With the intent to make
commercial solutions more stable, patches and fixes are deployed immediately.
Completeness:
Hadoop vendors couple their distributions with various other add-on tools which help customers
customize the Hadoop application to address their specific tasks.
Fault Tolerant:
Since the data has a default replication factor of three, it is highly available and fault-tolerant.

Here is a list of top Hadoop Vendors who play a key role in big data market growth
Amazon Elastic MapReduce
Cloudera CDH Hadoop Distribution
Hortonworks Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform (IBM Infosphere Big insights)
Microsoft Azure's HDInsight -Cloud based Hadoop Distribution

Advantages of Hadoop:
The increase in the requirement of computing resources has made Hadoop a viable and extensively used
programming framework. Modern day organizations can learn Hadoop and leverage their knowhow of
managing processing power of their businesses.

1. Scalable: Hadoop is a highly scalable storage platform, because it can stores and distribute very large
data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational

to run applications on thousands of nodes involving many thousands of terabytes of data.

17
Unit I Fundamentals of Big Data & Hadoop

2. Cost effective:
The problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to
reduce costs, many companies in the past would have had to down-sample data and classify it based on
certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would
be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that
when business priorities changed, the complete raw data set was not available, as it was too expensive to
store.

3. Flexible: Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means businesses can use
Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.

4. Speed of Processing: storage method is based on a distributed file system that

dealing with
large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes,
and petabytes in hours.

5. Resilient to failure: A key advantage of using Hadoop is its fault tolerance. When data is sent to an
individual node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.

Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
152 pages
Unit 2
No ratings yet
Unit 2
9 pages
BDA Module-02 Search Creators
No ratings yet
BDA Module-02 Search Creators
33 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Big Data Analytics with Apache Hadoop
No ratings yet
Big Data Analytics with Apache Hadoop
7 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 2
No ratings yet
Unit 2
17 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Introduction to Hadoop Ecosystem Basics
No ratings yet
Introduction to Hadoop Ecosystem Basics
23 pages
Unit 2
No ratings yet
Unit 2
28 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
43 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Module 2 Hadoop Final
No ratings yet
Module 2 Hadoop Final
98 pages
Hadoop for Data Professionals
No ratings yet
Hadoop for Data Professionals
34 pages
Hadoop
No ratings yet
Hadoop
7 pages
Cloud Computing
No ratings yet
Cloud Computing
21 pages
Unit III
No ratings yet
Unit III
32 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
History and Features of Hadoop
No ratings yet
History and Features of Hadoop
11 pages
Bda Unit - 3
No ratings yet
Bda Unit - 3
15 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop Overview for Big Data Course
No ratings yet
Hadoop Overview for Big Data Course
11 pages
Unit 2
No ratings yet
Unit 2
73 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
30 pages
Unit 2
No ratings yet
Unit 2
10 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Hadoop Guide for CS Students
No ratings yet
Hadoop Guide for CS Students
11 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Overview of Hadoop Architecture and Use Cases
No ratings yet
Overview of Hadoop Architecture and Use Cases
6 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
42 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
BDA - Lab-Manual - 1to4
No ratings yet
BDA - Lab-Manual - 1to4
17 pages
Big Data Management
No ratings yet
Big Data Management
4 pages
3 Five V S of Big Data
No ratings yet
3 Five V S of Big Data
12 pages
CN Lab Manual
No ratings yet
CN Lab Manual
75 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Krishna Teja Resume1
No ratings yet
Krishna Teja Resume1
3 pages
Dice Resume CV Likitha Pailla
No ratings yet
Dice Resume CV Likitha Pailla
5 pages
Lesson Plan of BDA - 2025-26
No ratings yet
Lesson Plan of BDA - 2025-26
15 pages
Final
No ratings yet
Final
276 pages
Election Data Analysis Guide
No ratings yet
Election Data Analysis Guide
4 pages
Big Data & Analytics Resume: Kumar Shanu
No ratings yet
Big Data & Analytics Resume: Kumar Shanu
2 pages
Building Arduino Projects For The Internet of Things
No ratings yet
Building Arduino Projects For The Internet of Things
5 pages
BY K Madhavi Data Architect
No ratings yet
BY K Madhavi Data Architect
24 pages
Pooja
No ratings yet
Pooja
3 pages
Network Management KPI's
No ratings yet
Network Management KPI's
66 pages
Data Engineer Resume: Sailaja Reddy
No ratings yet
Data Engineer Resume: Sailaja Reddy
6 pages
R24 MCA II Yr Structure Syllabus 2-7-25
No ratings yet
R24 MCA II Yr Structure Syllabus 2-7-25
37 pages
On Bigdata Nha
No ratings yet
On Bigdata Nha
41 pages
BDA Model
No ratings yet
BDA Model
2 pages
Hive Installation On Windows
No ratings yet
Hive Installation On Windows
21 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
6 pages
Essential Data Science Skills Guide
No ratings yet
Essential Data Science Skills Guide
31 pages
Index
No ratings yet
Index
2 pages
Spark SQL Performance Tuning PDF 1745571931
No ratings yet
Spark SQL Performance Tuning PDF 1745571931
35 pages
Sai Kokadwar Latest Resume
No ratings yet
Sai Kokadwar Latest Resume
3 pages
Ccs334 - Big Data Analytics
60% (5)
Ccs334 - Big Data Analytics
2 pages
Cloudera Developer Training For Spark and Hadoop
No ratings yet
Cloudera Developer Training For Spark and Hadoop
4 pages
M.tech Computer Science and Engineering Full Sylabi
No ratings yet
M.tech Computer Science and Engineering Full Sylabi
70 pages
Amazon DEA-C01 Updated Questions - AWS Certified Data Engineer - Associate
0% (1)
Amazon DEA-C01 Updated Questions - AWS Certified Data Engineer - Associate
33 pages
WT Ha M /: The Big Data Fix Book
No ratings yet
WT Ha M /: The Big Data Fix Book
52 pages
Ashish Resume Final
No ratings yet
Ashish Resume Final
3 pages
Big Data Fundamentals and Applications
No ratings yet
Big Data Fundamentals and Applications
297 pages

Hadoop Notes 1

Uploaded by

Hadoop Notes 1

Uploaded by

Unit I Fundamentals of Big Data & Hadoop

Doug Cutting and Mike J. Cafarella developed Hadoop.

It is optimized to handle massive quantities of data that could be structured, unstructured or

Doug and Mike

1. Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS):

Hadoop MapReduce: This is for parallel processing of large data sets.

o The master is responsible for resource management, tracking resource

ResourceManager (one per cluster)

Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework

Apache open source Hadoop ecosystem elements:

to run applications on thousands of nodes involving many thousands of terabytes of data.

4. Speed of Processing: storage method is based on a distributed file system that

You might also like