Exploring Dataset in MapReduce
Exploring Dataset in MapReduce
Preface
The Data
Yelp
Yelp is a multi-platform local business review and social networking
application that publishes crowd sourced reviews about local business
owned by Yelp, California. A five star rating system is used to rate
business by reviewers.
Yelp dataset
Every year Yelp publishes its dataset about local businesses, users and
reviews across the country. The Yelp Dataset challenge for $35K price
is to mine the Academic Data and analyze the trends like what cuisines
are Yelpers raving about in these different countries, how much of a
business' success is really due to just location .
The dataset we are going to explore is the Yelp business and review
dataset which is provided in JSON format.
Yelp business dataset represents every business with a unique
business id. It also provides the city, location, state and similar
attributes.
The review dataset has the individual review done by each user
identified user id against business.
Business
Review
Hadoop
Hive
Data Processing
Environments
The operation performed on the Yelp dataset is on the following
environment settings.
Framework
Host OS
HPCC
Systems
HIVE
Windows 8.1
Map Reduce
Windows 8.1
Windows 8.1
Virtualization
Software
VMware Player
5.0.3
Oracle VM Virtual
Box
VM Image
HPCC Systems VM-amd64-4.2.43
Cloudera Quick Start VM 4.7.0
Virtual Box (Red Hat 64 bit)
Oracle VM Virtual
Box
Ubuntu-12.04.5-desktop-i386
ISO 64 bit
Data Loading
HPCC
Data loading is controlled through the Distributed File Utility (DFU)
server. Data typically arrives on the landing zone (for example, by FTP).
File movement (across components) is initiated by DFU. Data is copied
from the landing zone and is distributed (sprayed) to the Data Refinery
(Thor) by the ECL code.
A single physical file is distributed into multiple physical files across
the nodes of a cluster. The aggregate of the physical files creates one
logical file that is addressed by the ECL code.
Here the JSON data is converted into csv and uploaded to the cluster.
HADOOP
We can use command line to manage files on HDFS
hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file
system
hadoop fs -put /home/user/review_cleaned.csv /user/hadoop/dir3/
HUE
Using Hue Metastore manager in Cloudera we can create a table
manually from a JSON file and by specifying the input column names.
A python script was used to clean the Json and the review json cleaned
has the following headers.
["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]
Queries
We mainly focus on three operations COUNT, GROUP and JOIN.
A COUNT query of the number of records based on a filter (city)
A GROUP query on one the columns on the dataset (city, review
count)
A JOIN of two datasets (business )
Result
Q2: Find the total number of review counts for businesses in each
city.
This is going to be a group by based on city.
Code Snippet
CityCountRec:=RECORD
Biz_Clean.city;
statecount:=SUM(GROUP,Biz_Clean.review_c);
END;
personSummary:= TABLE(Biz_Clean,CityCountRec,city);
OUTPUT(personSummary,ALL);
In ECL we can ask the code to group the records by creating a layout
that refers a data set.
Query Result
Execution Graph
After executing the query in HIVE query editor we will get the result.
Q2: Find the total number of review counts for businesses in each
city?
A Hive query with SUM will fetch as the result.
SELECT b.city, SUM(review_count) AS review FROM yelp_biz b GROUP BY b.city
The result
So we have uploaded the csv file for the dataset into HDFS and will do
the processing.
Mapper and Reducer code snippet
Mapper
Reducer
Q2: Find the total number of review counts for businesses in each
city?
Mapper code snippet
Summary
The easiest way of processing the dataset was using Hive as loading
the dataset and defining its layout via HUE. The HiveQL increases the
usability with its SQL like syntax. Writing map reduce programs for big
data problems becomes easy if you are familiar with Java and it gives
you more control over implementation and execution.
HPCC is a lesser known platform but powerful enough to compete with
Hadoop. The language ECL used is powerful but with an unfamiliar set
of syntax and semantics takes more development time for novices. If
you consider a data movement of files from HPCC to Hadoop Comma
Separated Files (CSV) is the best option even though HPCC can
produce XML files.
As most of the queries were executed on single node local or virtual
machine environments a performance benchmarking comparing the
Big data platforms or technologies will be effete. At this point the
comparable material will be the line of codes and coding time.
References
http://hortonworks.com/hadoop-tutorial/how-to-process-datawith-apache-hive/
https://cwiki.apache.org/confluence/display/Hive/LanguageManua
l
http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
http://hpccsystems.com/
https://gigaom.com/2008/11/09/mapreduce-leads-the-way-forparallel-programming/
http://blog.cloudera.com/
https://amplab.cs.berkeley.edu/benchmark/