Algorithms
Algorithms
Algorithms
CSE 490H
Algorithms for MapReduce
Sorting
Searching
TF-IDF
BFS
PageRank
More advanced algorithms
MapReduce Jobs
Tend to be very short, code-wise
IdentityReducer is very common
Utility jobs can be composed
Represent a data flow, more so than a
procedure
Sort: Inputs
A set of files, one value per line.
Mapper key is file name, line number
Mapper value is the contents of the line
Sort Algorithm
Takes advantage of reducer properties:
(key, value) pairs are processed in order
by key; reducers are themselves ordered
Mapper
Input:
((word, docname), (n, N))
Output: (word, (docname, n, N, 1))
Reducer
Sums counts for word in corpus
Outputs ((word, docname), (n, N, m))
Job 4: Calculate TF-IDF
Mapper
Input:
((word, docname), (n, N, m))
Assume D is known (or, easy MR to find it)
Output ((word, docname), TF*IDF)
Reducer
Just the identity function
Working At Scale
Buffering (doc, n, N) counts while summing
1s into m may not fit in memory
How many documents does the word the occur
in?
Possible solutions
Ignore very-high-frequency words
Write out intermediate data to a file
Use another MR pass
Final Thoughts on TF-IDF
Several small jobs add up to full algorithm
Lots of code reuse possible
Stock classes exist for aggregation, identity
Jobs 3 and 4 can really be done at once in
same reducer, saving a write/read cycle
Very easy to handle medium-large scale, but
must take care to ensure flat memory usage for
largest scale
BFS: Motivating Concepts
Performing computation on a graph data
structure requires processing at each node
Each node contains node-specific data as
well as links (edges) to other nodes
Computation must traverse the graph and
perform the computation step
1: 3, 18, 200
2: 6, 12, 80, 400
3: 1, 14
Consequences of insights:
We can map each row of 'current' to a list of
PageRank fragments to assign to linkees
These fragments can be reduced into a single
PageRank value for a page by summing
Graph representation can be even more
compact; since each element is simply 0 or 1,
only transmit column numbers where it's 1
Phase 1: Parse HTML