WordCount Program Hadoop Task 2
WordCount Program Hadoop Task 2
Aim: To write word count program using mapreduce API for apache
hadoop framework.
Mapreduce API:
Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner. The
framework sorts the outputs of the maps, which are then input to the reduce
tasks.
1. Set CLASSPATH
nano ~/.bashrc
export CLASSPATH=${HADOOP_HOME}/share/hadoop/common/hadoop-
common-3.3.1.jar:${HADOOP_HOME}/share/hadoop/mapreduce/*:
source ~/.bashrc
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
3. Compilation
javac WordCount.java
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be prone
to failures.
7. Output
(HDFS) 1
Apache 3
Core 1
Distributed 1
File 1
HDFS 5
Hadoop 3
However, 1
It 2
Nutch 1
POSIX 1
Rather 1
System 1
The 3
URL 1
a 5
access 2
across 1
allows 1
and 4
application 2
applications 1
are 1
as 1
at 1
be 2
built 1
cluster 1
clusters 1
commodity 1
computation 1
computers 1
computers, 1
data 3
data. 1
deliver 1
delivering 1
deployed 1
designed 4
detect 1
differences 1
distributed 4
each 2
enable 1
engine 1
existing 1
failures 1
failures. 1
fault-tolerant 1
few 1
file 4
for 3
framework 1
from 2
handle 1
hardware 1
hardware. 2
has 1
have 1
high 1
high-availability, 1
highly 1
highly-available 1
http://hadoop.apache.org/. 1
infrastructure 1
is 9
itself 1
large 2
layer, 1
library 2
local 1
low-cost 1
machines, 1
many 1
may 1
models. 1
of 7
offering 1
on 4
originally 1
other 1
part 1
processing 1
programming 1
project 1
project. 2
prone1
provides 1
relaxes 1
rely 1
requirements 1
run 1
scale 1
search 1
servers 1
service 1
sets 1
sets. 1
significant. 1
similarities 1
simple 1
single 1
so 1
software 1
storage. 1
streaming 1
suitable 1
system 2
systems 1
systems. 1
than 1
that 2
the 6
thousands 1
throughput 1
to 10
top 1
up 1
using 1
was 1
web 1
which 1
with 1