02-Wordcount Mapreduce
02-Wordcount Mapreduce
written in Java –
Map Phase – Data transformation and pre-processing step. Data is input in terms of key value
pairs and after processing is sent to the reduce phase.
Reduce Phase- Data is aggregated and the business logic is implemented in this phase which
is sent to the next big data tool in the data pipeline for further processing.
The standard Hadoop’s MapReduce model has Mappers, Reducers, Combiners, Partitioner,
and sorting all of which manipulate the structure of the data to fit the business requirements.
It is evident that to manipulate the structure of the data – Map and Reduce phase need to
make use of data structures like arrays to perform various transformation operations.
Ex. No. 2:
AIM:
Word count program to demonstrate the use of Map and Reduce tasks
STEPS:
1. Analyze the input file content
2. Develop the code
a. Writing a map function
b. Writing a reduce function
c. Writing the Driver class
3. Compiling the source
4. Building the JAR file
5. Starting the DFS
6. Creating Input path in HDFS and moving the data into Input path
7. Executing the program
$ cd ~
$ sudo mkdir wordcount
$ cd wordcount
$ sudo nano WordCount.java
Wordcount program:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*; // to tell hadoop what to run
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; //
import org.apache.hadoop.util.*; // to run mapreduce application
Run javac
$ cd ~
$ cd wordcount
$ javac WordCount.java -cp $(hadoop classpath)
The three java class files are created. To check it, type
$ ls
Create text file and move it to input folder in hadoop file system
$ nano hello.txt
Data transformation and pre-processing step. Data is input in terms of key value pairs and
after processing is sent to the reduce phase.
The last two arguments are source directory location which has input file and output directory
location where result will be generated in ‘out1’ folder. ‘Out1’ must not be created before.
Any error
1. edit the file hadoop-env.sh in /usr/local/etc/hadoop
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
● Enable debugging with ssh -vvv localhost and investigate the error in
detail.
● Check the SSH server configuration in /etc/ssh/sshd_config, in
particular the options PubkeyAuthentication (which should be set to yes)
and AllowUsers (if this option is active, add the hduser user to it). If you
made any changes to the SSH server configuration file, you can force a
configuration reload with sudo /etc/init.d/ssh reload.