0% found this document useful (0 votes)
23 views

WordCount Program Hadoop Task 2

This document discusses writing a word count program using the MapReduce API for Apache Hadoop. It provides steps to write the program in Java, compile and run it on a sample input text file to obtain the word count output. The key steps are: 1) setting the CLASSPATH, 2) writing the WordCount Java program, 3) compilation, 4) creating a JAR file, 5) creating a sample input file, 6) executing the program on HDFS to get the word count output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

WordCount Program Hadoop Task 2

This document discusses writing a word count program using the MapReduce API for Apache Hadoop. It provides steps to write the program in Java, compile and run it on a sample input text file to obtain the word count output. The key steps are: 1) setting the CLASSPATH, 2) writing the WordCount Java program, 3) compilation, 4) creating a JAR file, 5) creating a sample input file, 6) executing the program on HDFS to get the word count output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2.

Word count program using Mapreduce API

Aim: To write word count program using mapreduce API for apache
hadoop framework.

Mapreduce API:
Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner. The
framework sorts the outputs of the maps, which are then input to the reduce
tasks.

Inputs and Outputs


The MapReduce framework operates exclusively on <key, value> pairs, that is,
the framework views the input to the job as a set of <key, value> pairs and
produces a set of <key, value> pairs as the output of the job, conceivably of
different types.

Input and Output types of a MapReduce job:


(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,
v3> (output)

Steps to execute the WordCount Java Program:

1. Set CLASSPATH

nano ~/.bashrc

export CLASSPATH=${HADOOP_HOME}/share/hadoop/common/hadoop-
common-3.3.1.jar:${HADOOP_HOME}/share/hadoop/mapreduce/*:

source ~/.bashrc

2. WordCount.java Program ( Write program using gedit )

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3. Compilation

javac WordCount.java

4. Jar file creation

jar cf wc.jar WordCount*.class

5. hello.txt ( Create this file using gedit )

The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be prone
to failures.

The Hadoop Distributed File System (HDFS) is a distributed file system


designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant. HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS
relaxes a few POSIX requirements to enable streaming access to file system
data. HDFS was originally built as infrastructure for the Apache Nutch web
search engine project. HDFS is part of the Apache Hadoop Core project. The
project URL is http://hadoop.apache.org/.

6. Execution of Java Programs ( Jar file is a group of class files)

hdfs dfs -mkdir input

hdfs dfs -put hello.text input

hadoop jar wc.jar WordCount input ouput

7. Output

(HDFS) 1
Apache 3
Core 1
Distributed 1
File 1
HDFS 5
Hadoop 3
However, 1
It 2
Nutch 1
POSIX 1
Rather 1
System 1
The 3
URL 1
a 5
access 2
across 1
allows 1
and 4
application 2
applications 1
are 1
as 1
at 1
be 2
built 1
cluster 1
clusters 1
commodity 1
computation 1
computers 1
computers, 1
data 3
data. 1
deliver 1
delivering 1
deployed 1
designed 4
detect 1
differences 1
distributed 4
each 2
enable 1
engine 1
existing 1
failures 1
failures. 1
fault-tolerant 1
few 1
file 4
for 3
framework 1
from 2
handle 1
hardware 1
hardware. 2
has 1
have 1
high 1
high-availability, 1
highly 1
highly-available 1
http://hadoop.apache.org/. 1
infrastructure 1
is 9
itself 1
large 2
layer, 1
library 2
local 1
low-cost 1
machines, 1
many 1
may 1
models. 1
of 7
offering 1
on 4
originally 1
other 1
part 1
processing 1
programming 1
project 1
project. 2
prone1
provides 1
relaxes 1
rely 1
requirements 1
run 1
scale 1
search 1
servers 1
service 1
sets 1
sets. 1
significant. 1
similarities 1
simple 1
single 1
so 1
software 1
storage. 1
streaming 1
suitable 1
system 2
systems 1
systems. 1
than 1
that 2
the 6
thousands 1
throughput 1
to 10
top 1
up 1
using 1
was 1
web 1
which 1
with 1

You might also like