0% found this document useful (0 votes)
82 views33 pages

Understanding MapReduce in Hadoop

MapReduce is a programming framework that processes large datasets in parallel by splitting input data into independent chunks for map tasks, which generate intermediate data that is then combined by reduce tasks. The framework includes a Job Tracker for scheduling and monitoring tasks, and Task Trackers for executing them, ensuring efficient data locality and high throughput. Key components of MapReduce include mappers, combiners, partitioners, and reducers, which work together to transform and aggregate data effectively.

Uploaded by

raios1747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views33 pages

Understanding MapReduce in Hadoop

MapReduce is a programming framework that processes large datasets in parallel by splitting input data into independent chunks for map tasks, which generate intermediate data that is then combined by reduce tasks. The framework includes a Job Tracker for scheduling and monitoring tasks, and Task Trackers for executing them, ensuring efficient data locality and high throughput. Key components of MapReduce include mappers, combiners, partitioners, and reducers, which work together to transform and aggregate data effectively.

Uploaded by

raios1747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

5.

11 PROCESSING DATA WITH HADOOP

• MapReduce Programming is a software framework. MapReduce Programming helps you to process


massive amounts of data in parallel.
• In MapReduce Programming, the input dataset is split into independent chunks. Map tasks process these
independent chunks completely in a parallel manner.
• The output produced by the map tasks serves as intermediate data and is stored on the local disk of that
server.
• The output of the mappers are automatically shuffled and sorted by the framework. MapReduce
Framework sorts the output based on keys.
• This sorted output becomes the input to the reduce tasks. Reduce task provides reduced output by
combining the out- put of the various mappers.
• Job inputs and outputs are stored in a file system. MapReduce framework also takes care of the other
tasks such as scheduling, monitoring, re-executing failed tasks, etc.
Hadoop Distributed File System and MapReduce Framework run on the same set of nodes.
• This configuration allows effective scheduling of tasks on the nodes where data is present (Data Locality).
 This in turn results in very high throughput.

 There are two daemons associated with MapReduce Programming. A single master Job
Tracker per cluster and one slave TaskTracker per cluster-node.

 The Job Tracker is responsible for scheduling tasks to the Task Trackers, monitoring the task,
and re-executing the task just in case the TaskTracker fails. The TaskTracker executes the task.
Refer Figure 5.21.

 The MapReduce functions and input/output locations are implemented via the MapReduce
applications. These applications use suitable interfaces to construct the job.

 The application and the job parameters together are known as job configuration. Hadoop job
client submits job (jar/executable, etc.) to the Job Tracker. Then it is the responsibility of Job
Tracker to schedule tasks to the slaves. In addition to scheduling, it also monitors the task and
provides status information to the job-client.
MapReduce Framework

Phases: Daemons:
Map: Converts input into Key Value pair. Job Tracker: Master, schedules task.
Reduce: Combines output of mappers and Task Tracker: Slave, executes task.
produces a reduced result set.

Figure: 5.21 MapReduce Programming phases and daemons

5.11.1 MapReduce Daemons

1. JobTracker: It provides connectivity between Hadoop and your application. When you submit code to
cluster, Job Tracker creates the execution plan by deciding which task to assign to which node. It also
monitors all the running tasks. When a task fails, it automatically re-schedules the task to a different
node after a predefined number of retries. Job Tracker is a master daemon responsible for executing
overall MapReduce job. There is a single Job Tracker per Hadoop cluster.
2. Task Tracker: This daemon is responsible for executing individual tasks that is assigned by the Job
Tracker. There is a single Task Tracker per slave and spawns multiple Java Virtual Machines (JVMs) to
handle multiple map or reduce tasks in parallel.
Task Tracker continuously sends heartbeat message to Job Tracker. When the Job Tracker fails to receive
a heartbeat from a Task Tracker, the Job Tracker assumes that the Task Tracker has failed and resubmits
the task to another available node in the cluster.
Once the client submits a job to the Job Tracker, it partitions and assigns diverse MapReduce tasks for
each Task Tracker in the cluster. Figure 5.22 depicts Job Tracker and Task Tracker interaction.
5.11.2 How Does MapReduce Work?
MapReduce divides a data analysis task into two parts - map and reduce. Figure 5.23 depicts how the
MapReduce Programming works. In this example, there are two mappers and one reducer. Each mapper
works on the partial dataset that is stored on that node and the reducer combines the output from the
mappers to produce the reduced result set.
Client

Job Tracker

Task Tracker Task Tracker Task Tracker

Reduce Reduce Reduce


Map Map
Map

Figure 5.22 Job Tracker and TaskTracker interaction


Figure 5.24 describes the working model of MapReduce Programming. The following steps describe how
MapReduce performs its task.

1. First, the input dataset is split into multiple pieces of data (several small subsets).
2. Next, the framework creates a master and several workers processes and executes the worker processes
remotely.
Map Reduce

Map Reduce

Map Reduce
3. Several map tasks work simultaneously and read pieces of data that were assigned to each map task. The map
worker uses the map function to extract only those data that are present on their server and generates key/value pair
for the extracted data.
4. Map worker uses partitioner function to divide the data into regions. Partitioner decides which reducer should get
the output of the specified mapper.
5. When the map workers complete their work, the master instructs the reduce workers to begin their work. The
reduce workers in turn contact the map workers to get the key/value data for their partition. The data thus received is
shuffled and sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the output to the file. 7. When all the reduce
workers complete their work, the master transfers the control to the user program.

5.11.3 MapReduce Example


The famous example for MapReduce Programming is Word Count. For example, consider you need to count the
occurrences of similar words across 50 files. You can achieve this using MapReduce Programming. Refer Figure 5.25.
Word Count MapReduce Programming using Java
The MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem statement.
Splitter
8.1 Introduction to Map Reduce
Programming
In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster.
Each task processes small subset of data that has been assigned to it.
This way, Hadoop distributes the load across the cluster. MapReduce job takes a set of files
that is stored in HDFS (Hadoop Distributed File System) as input.
Map task takes care of loading, parsing, transforming, and filtering. The responsibility of
reduce task is grouping and aggregating data that is produced by map tasks to generate final
output.
Each map task is broken into the following phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
• The output produced by map task is known as intermediate keys and
values. These intermediate keys and values are sent to reducer. The
reduce tasks are broken into the following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to
be processed resides.
• This way, Hadoop ensures data locality. Data locality means that data
is not moved over network; only computational code is moved to
process data which saves network bandwidth.
8.2 MAPPER

A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are individual tasks that have the

responsibility of transforming input records into intermediate key-value pairs.

1. Record Reader: Record Reader converts a byte-oriented view of the input (as generated by the Input- Split) into a

record-oriented view and presents it to the Mapper tasks. It presents the tasks with keys and values. Generally the key is

the positional information and value is a chunk of data that constitutes the record.

2. Map: Map function works on the key-value pair produced by RecordReader and generates zero or more intermediate

key-value pairs. The MapReduce decides the key-value pair based on the context.

3. Combiner: It is an optional function but provides high performance in terms of network bandwidth and disk space. It

takes intermediate key-value pair provided by mapper and applies user-specific aggre- gate function to only that mapper.
4. Partitioner: The partitioner takes the intermediate key-value pairs produced by the mapper. splits
them into shard, and sends the shard to the particular reducer as per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of each map task is
written to the local disk of that machine and pulled by the respective reducer.

8.3 REDUCER
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share a common
key) to a smaller set of values. The Reducer has three primary phases: Shuffle and Sort, Reduce, and
Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them into the local
machine where the reducer is running. Then these individual data pipes are sorted by keys which produce
larger data list. The main purpose of this sort is grouping similar words so that their values can be easily
iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase, applies reduce function, and
processes one group at a time. The reduce function iterates all the values associated with that key. Reducer function
provides various operations such as aggregation, filtering, and combining data. Once it is done, the output (zero or
more key-value pairs) of reducer is sent to the output format.
3. Output Format: The output format separates key-value pair with tab (default) and writes it out to a file using
record writer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word count problem.
The Word Count problem has been discussed under "Combiner" and "Partitioner".
8.4 COMBINER
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be the
combiner class. The difference between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on disk.
• A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the
inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class.
• The main function of a Combiner is to summarize the map output records with the same key. The
output (key-value collection) of the combiner will be sent over the network to the actual Reducer
task as input.
• The Combiner class is used in between the Map class and the Reduce class to reduce the volume
of data transfer between Map and Reduce. Usually, the output of the map task is large and the
data transferred to the reduce task is high.
• The following MapReduce task diagram shows the COMBINER PHASE.
How Combiner Works?
• Here is a brief summary on how MapReduce Combiner works −
• A combiner does not have a predefined interface and it must implement
the Reducer interfaces reduce() method.
• A combiner operates on each map output key. It must have the same
output key-value types as the Reducer class.
• A combiner can produce summary information from a large dataset
because it replaces the original Map output.
• Although, Combiner is optional yet it helps segregating data into multiple
groups for Reduce phase, which makes it easier to process.
• MapReduce Combiner Implementation
• The following example provides a theoretical idea about
combiners. Let us assume we have the following input text
file named [Link] for MapReduce.

The important phases of the MapReduce program with Combiner are discussed
below.
Record Reader
This is the first phase of MapReduce where the Record Reader reads every line
from the input text file as text and yields output as key-value pairs.
Input − Line by line text from the input file.
Output − Forms the key-value pairs. The following is the set of expected key-value
pairs.
Map Phase

The Map phase takes input from the


Record Reader, processes it, and produces
the output as another set of key-value
pairs.
Input − The following key-value pair is
the input taken from the Record Reader.

The Map phase reads each key-value pair, divides each word from the value using StringTokenizer,
treats each word as key and the count of that word as value. The following code snippet shows the
Mapper class and the map function.
• Output − The expected output is as follows −

Combiner Phase
The Combiner phase takes each key-value pair from the Map phase, processes it, and produces the output
as key-value collection pairs.
Input − The following key-value pair is the input taken from the Map phase.
• Output − The expected output is as follows −
5. Partitioner
• The partitioning happens after map phase and before reduce phase. The
default partitioner is hash partitioner.
• A partitioner works like a condition in processing an input dataset. The
partition phase takes place after the Map phase and before the Reduce phase.
• A partitioner in MapReduce distributes intermediate key-value pairs
generated by mappers to reducers, ensuring a balanced workload and
efficient processing
• The number of partitioners is equal to the number of reducers. That means a
partitioner will divide the data according to the number of reducers.
Therefore, the data passed from a single partitioner is processed by a single
Reducer.
Partitioner
A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the
data using a user-defined condition, which works like a hash function. Let us take an
example to understand how the partitioner works.
MapReduce Partitioner Implementation
For the sake of convenience, let us assume we have a small table called Employee with the
following data. We will use this sample data as our input dataset to demonstrate how the
partitioner works.
We have to write an application to process the input dataset to find the highest
salaried employee by gender in different age groups (for example, below 20,
between 21 to 30, above 30).
Input Data
The above data is saved as [Link] in the /home/hadoop/hadoopPartitioner directory and
given as input.
Based on the given input, following is the algorithmic explanation of the program.
Map Tasks
The map task accepts the key-value pairs as input while we have the text data in a text file. The input for this map task is as
follows −
Input − The key would be a pattern such as any special key + filename + line number (example: key = @input1)
and the value would be the data in that line (example: value = 1201 \t gopal \t 45 \t Male \t 50000).
Method − The operation of this map task is as follows −
Output − You will get the gender data and the record data value as key-value pairs.
Partitioner Task
The partitioner task accepts the key-value pairs from the map task as its input. Partition
implies dividing the data into segments. According to the given conditional criteria of
partitions, the input key-value paired data can be divided into three parts based on the age
criteria.
Input − The whole data in a collection of key-value pairs.
key = Gender field value in the record.
value = Whole record data value of that gender.
Method − The process of partition logic runs as follows.
Output − The whole data of key-value pairs are segmented into three
collections of key-value pairs. The Reducer works individually on each
collection.
NoSQL (NOT ONLY SQL)
The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-source, relational database
that did not expose the standard SQL interface. Johan Oskarsson, who was then a developer at last. fm, in 2009
reintroduced the term NoSQL at an event called to discuss open-source distributed network. The #NoSQL was
coined by Eric Evans and few other database people at the event found it suitable to describe these non-relational
databases.
Few features of NoSQL databases are as follows:
1. They are open source.
2. They are non-relational.
3. They are distributed.
4. They are schema-less.
5. They are cluster friendly.
6. They are born out of 21" century web applications.
4.1.1 Where is it Used?
NoSQL databases are widely used in big data and other real-time web applications. Refer Figure 4.1. NoSQL.
databases is used to stock log data which can then be pulled for analysis. Likewise it is used to store social
media data and all such data which cannot be stored and analyzed comfortably in RDBMS.

4.1.2 What is it?


NoSQL stands for Not Only SQL. These are non-relational, open source, distributed databases. They are hugely
popular today owing to their ability to scale out or scale horizontally and the adeptness at dealing with a rich
variety of data: structured, semi-structured and unstructured data. Refer Figure 4.2 for additional features of
NoSQL. NoSQL databases.
1. Are non-relational: They do not adhere to relational data model. In fact, they are either key-value pairs or
document-oriented or column-oriented or graph-based databases.
2. Are distributed: They are distributed meaning the data is distributed across several nodes in a cluster constituted of
low-cost commodity hardware.
3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and Durability): They do not offer support for
ACID properties of transactions. On the contrary, they have adherence to Brewer's CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compro- mising on consistency in favor of availability and partition
tolerance.
4. Provide no fixed table schema: NoSQL databases are becoming increasing popular owing to their support for flexibility
to the schema. They do not mandate for the data to strictly adhere to any schema structure at the time of storage.

You might also like