100% found this document useful (1 vote)

252 views33 pages

Understanding MapReduce Basics

This document provides an overview of the MapReduce programming model and how it works using an example of counting the number of movies each user rated in the MovieLens dataset. It explains that MapReduce distributes processing across a cluster, divides data into mapped and reduced partitions, and is resilient to failure. The example shows how the input data would be mapped to key-value pairs by user ID and movie ID, shuffled and sorted, then reduced to output the count of movies for each user. It also discusses how MapReduce scales by distributing tasks across multiple nodes in a cluster managed by YARN.

Uploaded by

cellule HD HD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

252 views33 pages

Understanding MapReduce Basics

Uploaded by

cellule HD HD

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MAPREDUCE

FUNDAMENTAL
CONCEPTS
Why MapReduce?

■ Distributes the processing of data on your cluster

■ Divides your data up into partitions that are MAPPED
(transformed) and REDUCED (aggregated) by mapper
and reducer functions you define
■ Resilient to failure – an application master monitors
your mappers and reducers on each partition
Let’s illustrate with an example

■ How many movies did each user rate in the MovieLens data set?
How MapReduce Works: Mapping

■ The MAPPER converts raw source data into key/value pairs

INPUT DATA

Mapper

K1:V K2:V K3:V K1:V K1:V

Example: MovieLens Data ([Link] file)
USER ID|MOVIE ID|RATING|TIMESTAMP

196 242 3 881250949

186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
Map users to movies they watched

USER ID|MOVIE ID|RATING|TIMESTAMP

196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488

Mapper

196:242 186:302 196:377 244:51 166:346 186:274 186:265

Extract and Organize What We Care
About

196:242 186:302 196:377 244:51 166:346 186:474 186:265

MapReduce Sorts and Groups the
Mapped Data (“Shuffle and Sort”)

196:242 186:302 196:377 244:51 166:346 186:474 186:265

166:346 186:302,474,265 196:242,377 244:51

The REDUCER Processes Each Key’s Values

166:346 186:302,474,265 196:242,377 244:51

len(movies)

166:1 186:3 196:2 244:1

Putting it All Together
USER ID|MOVIE ID|RATING|TIMESTAMP
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488

MAPPER

196:242 186:302 196:377 244:51 166:346 186:474 186:265

SHUFFLE AND SORT

166:346 186:302,474,265 196:242,377 244:51

REDUCER

166:1 186:3 196:2 244:1

MAPREDUCE ON A
CLUSTER
How MapReduce Scales
Putting it All Together
USER ID|MOVIE ID|RATING|TIMESTAMP
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488

MAPPER

196:242 186:302 196:377 244:51 166:346 186:274 186:265

SHUFFLE AND SORT

166:346 186:302,274,265 196:242,377 244:51

REDUCER

166:1 186:3 196:2 244:1

What’s Happening
NodeManager
YARN MapReduce
Client Node Resource Application
Manager Master

NodeManager Node NodeManager

MapTask / MapTask / MapTask /

HDFS ReduceTask ReduceTask ReduceTask
How are mappers and reducers written?

■ MapReduce is natively Java

■ STREAMING allows interfacing to other languages (ie Python)

MapTask /
ReduceTask

Key/values
stdin stdout

Streaming
Process
Handling Failure

■ Application master monitors worker tasks for errors or hanging

– Restarts as needed
– Preferably on a different node
■ What if the application master goes down?
– YARN can try to restart it
■ What if an entire Node goes down?
– This could be the application master
– The resource manager will try to restart it
■ What if the resource manager goes down?
– Can set up “high availability” (HA) using Zookeeper to
have a hot standby
MAPREDUCE: A REAL
EXAMPLE
How many of each rating type exist?
How many of each movie rating exist?
Making it a MapReduce problem

■ MAP each input line to (rating, 1)

■ REDUCE each rating with the sum of all the 1’s

USER ID|MOVIE ID|RATING|TIMESTAMP

196 242 3 881250949 3,1 Shuffle
186 302 3 891717742 Map 3,1 & Sort 1 -> 1, 1 Reduce 1, 2
196 377 1 878887116 1,1 2 -> 1, 1 2, 2
244 51 2 880606923 2,1 3 -> 1, 1 3, 2
166 346 1 886397596 1,1 4 -> 1 4, 1
186 474 4 884182806 4,1
186 265 2 881171488 2,1
Putting it all together
RUNNING
Run our MapReduce job in our Hadoop installation
MAPREDUCE WITH MRJOB
Installing what we need: HDP 2.5
■ PIP
– cd /etc/[Link].d
– cp [Link] /tmp
– rm [Link]
– cd ~
– yum install python-pip
■ MRJob
– pip install google-api-python-client==1.6.4
– pip install mrjob==0.5.11
■ Nano
– yum install nano
■ Data files and the script
– wget [Link]
– wget [Link]
Installing what we need: HDP 2.6.5

■ PIP
– Utility for installing Python packages
– yum install python-pip
■ MRJob
– pip install mrjob==0.5.11
■ Nano
– yum install nano
■ Data files and the script
– wget [Link]
100k/[Link]
– wget [Link]
[Link]/hadoop/[Link]
Running with mrjob

■ Run locally
– python [Link] [Link]
■ Run with Hadoop
– python [Link] -r hadoop --hadoop-streaming-jar
/usr/hdp/current/hadoop-mapreduce-client/[Link]
[Link]
YOUR CHALLENGE
Sort movies by popularity with Hadoop
Challenge exercise

■ Count up ratings given for each movie

– All you need is to change one thing in the mapper – we don’t care about
ratings now, we care about movie ID’s!
– Start with this and make sure you can do it.
– You can use nano to just edit the existing [Link] script
Stretch goal

■ Sort the movies by their numbers of ratings

■ Strategy:
– Map to (movieID, 1) key/value pairs
– Reduce with output of (rating count, movieID)
– Send this to a second reducer so we end up with things sorted by rating
count!
■ Gotchas:
– How do we set up more than one MapReduce step?
– How do we ensure the rating counts are sorted properly?
Multi-stage jobs

■ You can chain map/reduce stages together like this:

def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings),
MRStep(reducer=self.reducer_sorted_output)
]
Ensuring proper sorting

■ By default, streaming treats all input and output as strings. So things get
sorted as strings, not numerically.
■ There are different formats you can specify. But for now let’s just zero-pad
our numbers so they’ll sort properly.
■ The second reducer will look like this:
def reducer_count_ratings(self, key, values):
yield str(sum(values)).zfill(5), key
Iterating through the results

■ Spoiler alert!
def reducer_sorted_output(self, count, movies):
for movie in movies:
yield movie, count
CHECK YOUR RESULTS
Did it work?
My solution

Big Data Questionnaire Answers
67% (6)
Big Data Questionnaire Answers
2 pages
Intro to MDX: Concepts and Queries
100% (1)
Intro to MDX: Concepts and Queries
31 pages
Machine Learning Regression Notes
No ratings yet
Machine Learning Regression Notes
36 pages
Creating A Modern Analytics Architecture
No ratings yet
Creating A Modern Analytics Architecture
18 pages
Understanding MapReduce for Big Data
100% (1)
Understanding MapReduce for Big Data
6 pages
Introduction to MDX Concepts
No ratings yet
Introduction to MDX Concepts
3 pages
Understanding Big Data: Key Insights
No ratings yet
Understanding Big Data: Key Insights
16 pages
Harvard Big Data Course Overview
No ratings yet
Harvard Big Data Course Overview
6 pages
Folium Python Mapping Documentation
No ratings yet
Folium Python Mapping Documentation
16 pages
Qlik Replicate More Data AnalyticsReady White Paper US
No ratings yet
Qlik Replicate More Data AnalyticsReady White Paper US
14 pages
Tableau Performance Optimization Guide
No ratings yet
Tableau Performance Optimization Guide
3 pages
Big Data Analysis Workshop Overview
No ratings yet
Big Data Analysis Workshop Overview
4 pages
MapReduce and MRJob Overview
100% (1)
MapReduce and MRJob Overview
82 pages
Understanding Data Integration Challenges
No ratings yet
Understanding Data Integration Challenges
6 pages
Data Warehouse Application Modes Explained
No ratings yet
Data Warehouse Application Modes Explained
6 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
RTAP Applications in Big Data Analytics
No ratings yet
RTAP Applications in Big Data Analytics
53 pages
Informatica Forence 10 2013
100% (1)
Informatica Forence 10 2013
62 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
57 pages
Security Risks in Recommender Systems
100% (1)
Security Risks in Recommender Systems
15 pages
Big Data Technology Stack Overview
100% (1)
Big Data Technology Stack Overview
12 pages
Data Versioning in Graph Databases
No ratings yet
Data Versioning in Graph Databases
71 pages
PowerCenter Development Best Practices
No ratings yet
PowerCenter Development Best Practices
8 pages
AION AI Compute Platform Overview
No ratings yet
AION AI Compute Platform Overview
10 pages
PIM vs. MDM: Choosing the Right Strategy
No ratings yet
PIM vs. MDM: Choosing the Right Strategy
11 pages
Apache Calcite Tutorial Overview
No ratings yet
Apache Calcite Tutorial Overview
83 pages
OBIEE Semantic Layer
No ratings yet
OBIEE Semantic Layer
3 pages
IBM Data Analytics Course Overview
No ratings yet
IBM Data Analytics Course Overview
16 pages
Overview of Hadoop Modules
100% (1)
Overview of Hadoop Modules
40 pages
Ingestion Layer in Big Data Systems
No ratings yet
Ingestion Layer in Big Data Systems
11 pages
Google Cloud Fundamentals Course Overview
No ratings yet
Google Cloud Fundamentals Course Overview
2 pages
Cloudera Ref Arch Generic Cloud
No ratings yet
Cloudera Ref Arch Generic Cloud
35 pages
Apache Pig: A Guide to Pig Latin
No ratings yet
Apache Pig: A Guide to Pig Latin
61 pages
Erwin API Ref
100% (1)
Erwin API Ref
207 pages
Canadian Banking Industry Insights
No ratings yet
Canadian Banking Industry Insights
44 pages
The Operational Data Store - Tactical Analysis at Your Fingertips
86% (7)
The Operational Data Store - Tactical Analysis at Your Fingertips
64 pages
Big Data: by It Faculty Alttc Ghaziabad
No ratings yet
Big Data: by It Faculty Alttc Ghaziabad
26 pages
Understanding XML and JSON Basics
No ratings yet
Understanding XML and JSON Basics
38 pages
Data Warehouse Design with Dimensional Modeling
No ratings yet
Data Warehouse Design with Dimensional Modeling
87 pages
Big Data Analytics in Healthcare Insights
No ratings yet
Big Data Analytics in Healthcare Insights
51 pages
ER Model Concepts and Relationships
100% (1)
ER Model Concepts and Relationships
82 pages
HDFS Architecture and Components Overview
No ratings yet
HDFS Architecture and Components Overview
30 pages
Overview of Data Science Principles
No ratings yet
Overview of Data Science Principles
64 pages
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
No ratings yet
5 Capabilities For The Best Azure Backup and Recovery-Veeam - PG
12 pages
Big Data Use Cases in Government
No ratings yet
Big Data Use Cases in Government
40 pages
Big Data Ingestion Architecture Overview
No ratings yet
Big Data Ingestion Architecture Overview
32 pages
Alternatives to Star Schema in Data Warehousing
No ratings yet
Alternatives to Star Schema in Data Warehousing
15 pages
Right Data Over Big Data: Key Insights
No ratings yet
Right Data Over Big Data: Key Insights
8 pages
NoSQL vs MySQL Performance Analysis
No ratings yet
NoSQL vs MySQL Performance Analysis
3 pages
Smart Home NLP-Based Automation System
No ratings yet
Smart Home NLP-Based Automation System
9 pages
Making Sense of Schema-on-Read: Modeling JSON
No ratings yet
Making Sense of Schema-on-Read: Modeling JSON
49 pages
Data Science Applications in Agriculture
No ratings yet
Data Science Applications in Agriculture
5 pages
Automating The Modern Data Warehouse
No ratings yet
Automating The Modern Data Warehouse
66 pages
Understanding MapReduce Basics
No ratings yet
Understanding MapReduce Basics
31 pages
Understanding YARN in Hadoop Ecosystem
No ratings yet
Understanding YARN in Hadoop Ecosystem
183 pages
MapReduce in Cloud Computing Bootcamp
No ratings yet
MapReduce in Cloud Computing Bootcamp
55 pages
Analyzing Movie Ratings with MapReduce
No ratings yet
Analyzing Movie Ratings with MapReduce
20 pages
Hadoop and MapReduce for Big Data Analytics
No ratings yet
Hadoop and MapReduce for Big Data Analytics
43 pages
MapReduce and Hadoop Overview Guide
No ratings yet
MapReduce and Hadoop Overview Guide
69 pages
Cloud Computing with MapReduce & Hadoop
No ratings yet
Cloud Computing with MapReduce & Hadoop
53 pages
Jesus Calling: 365 Devotional Insights
100% (17)
Jesus Calling: 365 Devotional Insights
16 pages
Advertising Design Strategies Explained
No ratings yet
Advertising Design Strategies Explained
72 pages
KG Marigold February Timetable
No ratings yet
KG Marigold February Timetable
1 page
IS 9609-0: Technical Lettering Standards
No ratings yet
IS 9609-0: Technical Lettering Standards
16 pages
Ready to Learn Announcement Text
No ratings yet
Ready to Learn Announcement Text
24 pages
DSC Labs SMPTE OneShot Color Matching
No ratings yet
DSC Labs SMPTE OneShot Color Matching
4 pages
Report on English Seminar Experience
No ratings yet
Report on English Seminar Experience
3 pages
Binding of Isaac: Time to Beat Insights
No ratings yet
Binding of Isaac: Time to Beat Insights
1 page
Suomi KP-31 Kit Build Guide
No ratings yet
Suomi KP-31 Kit Build Guide
5 pages
Ubumenyi ku Muntu n'Icyaha mu Ijambo ry'Imana
No ratings yet
Ubumenyi ku Muntu n'Icyaha mu Ijambo ry'Imana
28 pages
Challenges in Journalistic Translation
100% (1)
Challenges in Journalistic Translation
5 pages
2024 Monthly Calendar Overview
No ratings yet
2024 Monthly Calendar Overview
16 pages
Essential Canva Tools for Graphic Design
No ratings yet
Essential Canva Tools for Graphic Design
5 pages
Grade 6 Quarter 2 Summative Assessment
No ratings yet
Grade 6 Quarter 2 Summative Assessment
13 pages
WHS/OCHA Deployment Overview
No ratings yet
WHS/OCHA Deployment Overview
8 pages
Playboy Special Digital Edition
No ratings yet
Playboy Special Digital Edition
76 pages
Fake News Detection with Machine Learning
No ratings yet
Fake News Detection with Machine Learning
4 pages
Brand Identity Projects on Behance
No ratings yet
Brand Identity Projects on Behance
1 page
Doordarshan Recruitment 2025 Updates
No ratings yet
Doordarshan Recruitment 2025 Updates
30 pages
Instagram Algorithm Insights & Trends
No ratings yet
Instagram Algorithm Insights & Trends
58 pages
Marketing Vocabulary for Digital Success
No ratings yet
Marketing Vocabulary for Digital Success
1 page
Approval Request for Placement Docs
No ratings yet
Approval Request for Placement Docs
2 pages
Google Analytics and AdWords Overview
No ratings yet
Google Analytics and AdWords Overview
2 pages
XII MARKETING UNIT 5 - EMERGING TRENDS IN MARKETING 100 Most Expected Viva Questions With Their Answers
No ratings yet
XII MARKETING UNIT 5 - EMERGING TRENDS IN MARKETING 100 Most Expected Viva Questions With Their Answers
5 pages
Disadvantages of E-Books Explained
No ratings yet
Disadvantages of E-Books Explained
1 page
Form Import Soal Topik Pilihan
No ratings yet
Form Import Soal Topik Pilihan
18 pages
Fungsi Alantois dalam Embrio
No ratings yet
Fungsi Alantois dalam Embrio
37 pages
Saudi Visa Bio App User Guide
No ratings yet
Saudi Visa Bio App User Guide
8 pages
Physics Teacher CV Template
No ratings yet
Physics Teacher CV Template
2 pages
Lightbox JavaScript Plugin v2.9.0
No ratings yet
Lightbox JavaScript Plugin v2.9.0
3 pages

Understanding MapReduce Basics

Uploaded by

Understanding MapReduce Basics

Uploaded by

MAPREDUCE

■ Distributes the processing of data on your cluster

■ The MAPPER converts raw source data into key/value pairs

K1:V K2:V K3:V K1:V K1:V

196 242 3 881250949

USER ID|MOVIE ID|RATING|TIMESTAMP

196:242 186:302 196:377 244:51 166:346 186:274 186:265

196:242 186:302 196:377 244:51 166:346 186:474 186:265

196:242 186:302 196:377 244:51 166:346 186:474 186:265

166:346 186:302,474,265 196:242,377 244:51

166:346 186:302,474,265 196:242,377 244:51

166:1 186:3 196:2 244:1

196:242 186:302 196:377 244:51 166:346 186:474 186:265

SHUFFLE AND SORT

166:346 186:302,474,265 196:242,377 244:51

166:1 186:3 196:2 244:1

196:242 186:302 196:377 244:51 166:346 186:274 186:265

SHUFFLE AND SORT

166:346 186:302,274,265 196:242,377 244:51

166:1 186:3 196:2 244:1

NodeManager Node NodeManager

MapTask / MapTask / MapTask /

■ MapReduce is natively Java

■ Application master monitors worker tasks for errors or hanging

■ MAP each input line to (rating, 1)

USER ID|MOVIE ID|RATING|TIMESTAMP

USER ID|MOVIE ID|RATING|TIMESTAMP

USER ID|MOVIE ID|RATING|TIMESTAMP

■ Count up ratings given for each movie

■ Sort the movies by their numbers of ratings

■ You can chain map/reduce stages together like this:

You might also like