0% found this document useful (0 votes)
30 views9 pages

Hadoop Notes 1

Hadoop is an open-source framework designed for storing and processing Big Data using a distributed architecture on commodity hardware. It utilizes HDFS for storage, MapReduce for processing, and YARN for resource management, enabling efficient handling of large datasets. The Hadoop ecosystem includes various tools and components like Spark, Hive, and Pig, which enhance its functionality for data analysis and management.

Uploaded by

UTHAYAKUMAR J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Hadoop Notes 1

Hadoop is an open-source framework designed for storing and processing Big Data using a distributed architecture on commodity hardware. It utilizes HDFS for storage, MapReduce for processing, and YARN for resource management, enabling efficient handling of large datasets. The Hadoop ecosystem includes various tools and components like Spark, Hive, and Pig, which enhance its functionality for data analysis and management.

Uploaded by

UTHAYAKUMAR J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit I Fundamentals of Big Data & Hadoop

Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which
uses commodity hardware to form clusters and store data in a distributed fashion. It works on
Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the required
output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.

Hadoop
Introduction:

Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under
Apache Software Foundation (ASF).

Hadoop is written in the Java programming language and ranks among the highest-level Apache
projects.

Doug Cutting and Mike J. Cafarella developed Hadoop.

By getting inspiration from Google, Hadoop is using technologies like Map-Reduce programming
model as well as Google file system (GFS).

It is optimized to handle massive quantities of data that could be structured, unstructured or


semi-structured, using commodity hardware, that is, relatively inexpensive computers.

It is intended to work upon from a single server to thousands of machines each offering local
computation and storage. It supports the large collection of data set in a distributed computing
environment.

History:
Hadoop came into existence and why it is so popular in the industry nowadays. So, it all started with two
people, Mike Cafarella and Doug Cutting, who were in the process of building a search engine system that
can index 1 billion pages. After their research, they estimated that such a system will cost around half a
million dollars in hardware, with a monthly running cost of $30,000, which is quite expensive. However,
they soon realized that their architecture would not be capable enough to work around with billions of
pages on the web.

10
Unit I Fundamentals of Big Data & Hadoop

Doug and Mike


distributed file system, called GFS, which was being used in production at Google. Now, this paper on GFS
proved to be something that they were looking for, and soon, they realized that it would solve all their
problems of storing very large files that are generated as a part of the web crawl and indexing process.
Later in 2004, Google published one more paper that introduced MapReduce to the world. Finally, these
two papers led to the foundation of the framework called Hadoop Doug Cutting, Mike Cafarella and team
took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Doug named
it after his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

Hadoop Architecture:
Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for
a cluster of systems with storage capacity and local computing power by leveraging commodity
hardware.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on
different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable
of running on clusters of computers and they could perform complete statistical analysis for huge
amounts of data.
Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets
using Hadoop MapReduce paradigm.
The 3 important Hadoop core components that play a vital role in the Hadoop architecture are -
11
Unit I Fundamentals of Big Data & Hadoop

1. Hadoop Distributed File System (HDFS)


2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)

Hadoop Distributed File System (HDFS):


o Hadoop Distributed File System runs on top of the existing file systems on each node in a
Hadoop cluster.
o Hadoop Distributed File System is a block-structured file system where each file is divided
into blocks of a pre-determined size.
o Data in a Hadoop cluster is broken down into smaller units (called blocks) and distributed
throughout the cluster. Each block is duplicated twice (for a total of three copies), with the
two replicas stored on two nodes in a rack somewhere else in the cluster.
o Since the data has a default replication factor of three, it is highly available and fault-tolerant.
o If a copy is lost (because of machine failure, for example), HDFS will automatically re-replicate
it elsewhere in the cluster, ensuring that the threefold replication factor is maintained.

Hadoop MapReduce: This is for parallel processing of large data sets.


o The MapReduce framework consists of a single master node (JobTracker) and n numbers of
slave nodes (Task Tracker) where n can be 1000s. Master manages, maintains and monitors
the slaves while slaves are the actual worker nodes.
o Client submit a job to Hadoop. The job can be a mapper, a reducer or a list of input. The job is
sent to job tracker process on master node. Each slave node runs a process through task
tracker.
12
Unit I Fundamentals of Big Data & Hadoop

o The master is responsible for resource management, tracking resource


consumption/availability and scheduling the jobs component tasks on the slaves, monitoring
them and re-executing the failed tasks.
o The slaves TaskTracker execute the tasks as directed by the master and provide task-status
information to the master periodically.
o Master stores the metadata (data about data) while slaves are the nodes which store the data.
The client connects with master node to perform any task.

Hadoop YARN: YARN (Yet Another Resource Negotiator) is the framework responsible for
assigning computational resources for application execution and cluster management. YARN
consists of three core components:

ResourceManager (one per cluster)


ApplicationMaster (one per application)
NodeManagers (one per node)

Hadoop ecosystem:

13
Unit I Fundamentals of Big Data & Hadoop

Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework


which solves big data problems. You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea about how
the services work individually and in collaboration.
The Hadoop ecosystem provides the furnishings that turn the framework into a comfortable home for
big data activity that reflects your specific needs and tastes.
The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions.
Below are the Hadoop components, that together form a Hadoop ecosystem,
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr&Lucene -> Searching & Indexing

14
Unit I Fundamentals of Big Data & Hadoop

Apache open source Hadoop ecosystem elements:


Spark, Pig, and Hive are three of the best-known Apache Hadoop projects. Each is used to create
applications to process Hadoop data.
Spark: Apache Spark is a framework for real time data analytics in a distributed computing
environment. It executes in-memory computations to increase speed of data processing over Map-
Reduce.
Hive: Facebook created HIVE for people who are fluent with SQL. Basically, HIVE is a data warehousing
component which performs reading, writing and managing large data sets in a distributed
environment using SQL-like interface. The query language of Hive is called Hive Query Language (HQL),
which is very similar like SQL. HIVE + SQL = HQL. It provides tools for ETL operations and brings some
SQL-like capabilities to the environment.
Pig: Pig is a procedural language for developing parallel processing applications for large data sets in
the Hadoop environment. Pig is an alternative to Java programming for MapReduce, and automatically

15
Unit I Fundamentals of Big Data & Hadoop

generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig
Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster.
HBase: HBase is a scalable, distributed, NoSQL database that sits atop the HFDS. It was designed to
store structured data in tables that could have billions of rows and millions of columns. It has been
deployed to power historical searches through large data sets, especially when the desired data is
contained within a large amount of unimportant or irrelevant data (also known as sparse data sets).
Oozie: Oozie is the workflow scheduler that was developed as part of the Apache Hadoop project. It
manages how workflows start and execute, and also controls the execution path. Oozie is a server-
based Java web application that uses workflow definitions written in hPDL, which is an XML Process
Definition Language similar to JBOSS JBPM jPDL.
Sqoop: Sqoop is bi-directional data injection tool. Think of Sqoop as a front-end loader for big data.
Sqoop is a command-line interface that facilitates moving bulk data from Hadoop into relational
databases and other structured data stores. Using Sqoop replaces the need to develop scripts to
export and import data. One common use case is to move data from an enterprise data warehouse to
a Hadoop cluster for ETL processing. Performing ETL on the commodity Hadoop cluster is resource
efficient, while Sqoop provides a practical transfer method.
Ambari A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which
includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig,
and Sqoop.
Flume A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of streaming event data.
Mahout A scalable machine learning and data mining library.
Zookeeper A high-performance coordination service for distributed applications.
The ecosystem elements described above are all open source Apache Hadoop projects. There are
numerous commercial solutions that use or support the open source Hadoop projects.

Hadoop Distributions:
Hadoop is an open-source, catch-all technology solution with incredible scalability, low cost
storage systems and fast paced big data analytics with economical server costs.
Hadoop Vendor distributions overcome the drawbacks and issues with the open source edition
of Hadoop. These distributions have added functionalities that focus on:
16
Unit I Fundamentals of Big Data & Hadoop

Support:
Most of the Hadoop vendors provide technical guidance and assistance that makes it easy for customers
to adopt Hadoop for enterprise level tasks and mission critical applications.
Reliability:
Hadoop vendors promptly act in response whenever a bug is detected. With the intent to make
commercial solutions more stable, patches and fixes are deployed immediately.
Completeness:
Hadoop vendors couple their distributions with various other add-on tools which help customers
customize the Hadoop application to address their specific tasks.
Fault Tolerant:
Since the data has a default replication factor of three, it is highly available and fault-tolerant.

Here is a list of top Hadoop Vendors who play a key role in big data market growth
Amazon Elastic MapReduce
Cloudera CDH Hadoop Distribution
Hortonworks Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform (IBM Infosphere Big insights)
Microsoft Azure's HDInsight -Cloud based Hadoop Distribution

Advantages of Hadoop:
The increase in the requirement of computing resources has made Hadoop a viable and extensively used
programming framework. Modern day organizations can learn Hadoop and leverage their knowhow of
managing processing power of their businesses.

1. Scalable: Hadoop is a highly scalable storage platform, because it can stores and distribute very large
data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational

to run applications on thousands of nodes involving many thousands of terabytes of data.

17
Unit I Fundamentals of Big Data & Hadoop

2. Cost effective:
The problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to
reduce costs, many companies in the past would have had to down-sample data and classify it based on
certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would
be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that
when business priorities changed, the complete raw data set was not available, as it was too expensive to
store.

3. Flexible: Hadoop enables businesses to easily access new data sources and tap into different types of
data (both structured and unstructured) to generate value from that data. This means businesses can use
Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.

4. Speed of Processing: storage method is based on a distributed file system that

dealing with
large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes,
and petabytes in hours.

5. Resilient to failure: A key advantage of using Hadoop is its fault tolerance. When data is sent to an
individual node, that data is also replicated to other nodes in the cluster, which means that in the event
of failure, there is another copy available for use.

18

You might also like