HW 3

CpSc 418 Homework 3 Due: October 22, 2024, 11:59pm
Architecture and Midterm Review Early-Bird: October 20, 2024, 11:59pm
Prelude
Please submit your solution using:
handin cs-418 hw3 Your solution should contain at least:
hw3.pdf: PDF for your answers.
1 The Questions (100 points)

Attempt any four of the questions below.
1. Associative Functions (25 points)
For each function below, state if it is associative or not. If it is associative, give a short proof; if not,
give a counter-example. Ignore round-off errors (i.e. all functions are mathematically exact).
If it appears to be associative, but you are stuck with a proof, you can give five test cases to get half
credit if it is associative. Your test cases should provide values X, Y, and Z such that
f(f(X, Y), Z) == f(X, f(Y, Z)).
If you give three test-cases that pass, but the function is not actually associative, you will get 1 point
for the examples.
(a) a(X, Y) -> X+Y.

(b) b(X, Y) -> (X+Y)*2.
(c) c(X, Y) -> X*Y - 2.
(d) d(X, Y) -> max(X, Y) + 2..
(e) e(X, Y) -> max(X, Y) * 2..
Hint: max is associative and commutative, and addition distributes over max, i.e.
max(X, Y) + Z == max(X+Z, Y+Z).
What about max(X, Y) * Z?
1
AccIn=2
0:7
0:3 4:7
0:1 2:3 4:5 6:7
3 2 4 −7 −4 9 8 −6
0 1 2 3 4 5 6 7
Figure 1: A Tree for Computing Exclusive Scan
2. Scan Figure 1 shows a tree for computing scan. Assume that this is an exclusive scan. Let Combine
be the associative operator for combining values in the scan.
Each of the squares at the bottom of the figure represents the task performed the worker processes.
For example, worker process 3 is represented by the square box labeled 3. Each worker performs some
local computation and then invokes scan. We are using Schwartz’s algorithm; so the processes that
did the local computation will also perform the operations in the tree nodes.
The circles and hexagon indicate the non-leaf tree nodes; the root-node is drawn as a hexagon. Each
non-leaf node is labeled Lo:Hi to indicate the inclusive range of worker processes in its subtree.
(a) (9 points)
Consider an interior node in the tree, i.e. a circle in Figure 1. When performing a scan, each such
node performs the following actions (but not necessarily in the order listed below):
• Receive a value, FromLeft, from its left subtree.
• Receive a value, FromRight, from its right subtree.
• Receive a value, FromParent, from its parent.
• Compute a value, ToLeft, to send to its left subtree.
• Compute a value, ToRight, to send to its right subtree.
• Compute a value, ToParent, to send to its parent.
• Send ToLeft to its left subtree.
• Send ToRight to its right subtree.
• Send ToParent to its parent.
Write these operations in an order that correctly implements Schwartz’s algorithm. You can write
the operations as “receive FromLeft”, “compute ToRight”, etc.
1.
2.
3.
4.
5.
2
6.
7.
8.
9.
(b) (4 points)
Schwarz’s algorithm reduces communication by having the same process handle the tasks associate
with multiple nodes in the tree.
i. Which tasks are handled by the process for worker 4? You can write these by just listing the
labels for the tree nodes (i.e. the labels inside the squares, circles, or hexagon).
ii. For task 0:3, is FromLeft a value received from another process or is it a value previously
computed by the process for this node?
iii. For task 0:3, is FromRight a value received from another process or is it a value previously
iv. For task 0:3, is FromParent a value received from another process or is it a value previously
(c) (6 points) Write expressions for ToLeft, ToRight, and ToParent, using the FromLeft, FromRight,
FromParent, and the Combine function:
ToLeft =
ToRight =
ToParent =
(d) (6 points) The blue arrows in Figure 1 show the values sent child tasks to their parents. The
values from the worker tasks are given in the figure. The green arrows in Figure 1 show the values
sent from a parent node to a child (or to the worker task). Assume that the Combine operation
is addition. Label each of the green arrows in the figure according to the values that convey; in
other words, label each with the appropriate number.
3
#define HOW_MANY_THREADS 2
int acc[HOW_MANY_THREADS];
int count3s(int thread_id, Barrier *b, int *my_seg, int my_len) {

int *my_acc = &(acc[thread_id]);
int grand_total;
*my_acc = 0; // initialize per-thread accumulator

for(int i = 0; i < my_len; i++ {
if(my_seg[i] == 3)
(*my_acc)++;
}
barrier_wait(b);
grand_total = 0;
for(int j = 0; j < HOW_MANY_THREADS; j++) {
grand_total += acc[j];
}
return(grand_total);
}
Figure 2: Shared-memory implementation of Count 3’s
3. MESI (25 points)

When the Lin & Snyder text introduced the Count 3’s problem (in Chapter 1, not distributed on
Canvas), they showed four implementations using shared memory:
• the first was wrong: each thread increments the same shared variable whenever it encounters a 3.
No locks to protect these accesses.
• the second was very slow: the shared accumulator was protected by a lock. Each thread acquires
the lock before incrementing the accumulator, and releases the lock after the increment.
• the third was slow: a global array of accumulators, one per thread. This leads to false sharing
when several accumulators are part of the same cache line.
• the fourth one got good performance: each thread had its own, local accumulator, and used a
lock once at the end to add its tally to the global total.
In this problem, we’ll look more closely at the third approach, the one with false sharing. See Figure 2
for the code. Each thread wants to know the total number of 3s is some array. This array is divided into
disjoint segments for each thread: each thread is responsible for counting the threes in the my_seg[0]
through my_seg[my_len-1]. We’ll assume that each thread encounters three 3s in its segment of the
shared array. The threads meet at a barrier, and then each thread computes the global total. We could
do this more efficiently using reduce, but the book hadn’t introduced reduce at the point where they
had the example.
For this example, each thread accesses the acc array nine times when executing the count3s function:
• One write when executing *my_acc = 0;. In the questions below, refer to this write as INIT.
• One read followed by one write for each of the three times it executes
(*my_acc)++;
We will refer to these operations as LOOP_RD[k] and LOOP_WR[k] for the memory read and memory
write respectively with 0 ≤ k < 3.
4
• HOW_MANY_THREADS reads when combining the results from each thread with the statement:
grand_total += acc[i];
We will refer to these operations as TOTAL[k].
(a) Per-processor memory order. (3 points)

As noted above, each thread performs nine memory operations. The order of the operations of
a single thread is completely determined by the code for that thread – we are assuming that
the compiler isn’t performing any optimizations that re-order or eliminate memory operations.
Complete the list of memory operations below that shows this per-thread order of operations:
Step Operation
1 INIT
2 LOOP_RD[0]
3
4
5
6
7
8 TOTAL[0]
9 TOTAL[1]
(b) Thread interleaving. (3 points)
While each thread must perform its operations in program order, the interleaving of the operations
of the two threads can be arbitrary. We need a sequence of pseudo-random bits to determine when
thread 0 performs the next operation and when thread 1 does.
i. What is your student number?

ii. Write the 24 least significant bits of your student number in binary. For example if your stu-
dent number is 12345678, then your answer to this question would be 101111000110000101001110.
Hint: io:format("~.2b~n", [YourStudentNumber rem (1 bsl 24)]).
iii. Do you prefer big-endian or little-endian formats?

(c) Cache operations. 15 points Complete the table in Figure 3. For each row, either thread 0
or thread 1 performs an operation. To pick which thread performs an operation, select the bits
from your answer to Question 3(b)ii in left-to-right order if you’re a big-endian fan, and from
right-to-left order if you prefer little-endians. For example, as a little-endian fan, I would choose
thread 0 for the first line, thread 1 for lines 2 through 4, thread 0 for lines 5 and 6, and so on.
If a thread has completed all of its nine steps, you should just fill in the remaining rows with
the remaining operations for that thread. For example, if your answer to Question 3(b)ii is
101000010000010000000110, you only have 6 ones. After thread 0 has taken its ninth step, thread
1 will only have taken three steps. Finish the table with thread 1 performing the operations in
the last nine rows.
Thread: enter 0 or 1 according to which thread is chosen to take a step for this row.
Step: which step this is for the thread.
5
Thread-op: the operation performed by the thread.
MESI-0: the state of the cache line holding acc[0] and acc[1] in Thread 0’s cache.
MESI-1: the state of the cache line holding acc[0] and acc[1] in Thread 1’s cache.
Main-memory op: if the operation includes a read from main-memory read, enter R. If the
operation includes a write to main-memory, enter W.
If the CPU performs a write when the cache line is in state I, the cache should first load the
line from memory (to get the rest of the line), and then write the bytes that it is modifying.
We’ll write RW for the main memory operation here. After the operation, the cache of the
processor that did the write will be state E, and all other caches will have the line in state I.
As an example, with my hypothetical student number of 12345678 and assuming that I cheer for
team little endian, my first operation will be Thread 0 performing an INIT operation. Thus the
beginning of my table will be:
Thread Step Thread op MESI-0 MESI-1 Main-memory op
I I
0 1 INIT E I RW
.. .. .. .. .. ..
. . . . . .
End of long explanation. Now, fill-in the table in Figure 3.
(d) False-sharing (4 points)
i. (2 points) What is the total number of main memory operations performed for accessing the
acc array by the two threads evaluating count3s?
ii. (2 points) For a better solution, each thread uses a local variable to count the threes. These
will be on different cache lines for different threads. After counting all the threes, they write
their local tally to acc[ThreadId] and then meet at the barrier and compute the grandTotal
as before. That is the total number of memory operations are performed for accessing the
acc array by the two threads evaluating count3s with this modification?
You don’t need to complete a new table, but you should provide a one or two sentence
justification for your answer.
6
Thread Step Thread op MESI-0 MESI-1 Main-memory op
I I
Figure 3: MESI operations
7
4. Message Passing Networks
(a) Bisection Width of a Fat Tree (15 points)
Fat-trees (see slide 20 of the Oct. 2 lecture slides) are a generalization of tree interconnect that
avoid the bandwidth bottleneck at the root by using an increasing number of links between tree
notes when going up in the tree. In particular, if a node is the root of a subtree with M nodes,
then there will be ⌈M α ⌉ links from the node to its parent, with 0 ≤ α ≤ 1 being a design paramter.
Fat trees have been used in real super-computers. For example, the Tianhe-2 supercomputer held
the fastest supercomputer position on the Top-500 supercomputer list for three years from June
2013 to June 2016 when it was superceded by the Sunway TaihuLight computer. The Tianhe-2
used a fat-tree for its interconnect, see High Performance Interconnect for Tianhe System – here is
the UBC ez-proxy link.
i. (5 points) What is the bisection width of a Fat Tree with 2k leaves? Give a formula in terms
k and α.
You can assume that the router-nodes are huge crossbars (see the paper citedwabove if you
want the details of how it can be done in practice, but we won’t complicate this question with
more detail). You don’t need to “look-inside” the router-nodes to answer this question.
ii. (10 points) Give a short justification of your answer. For example:
• Point out that if the minimum bisection involves cutting one link between two nodes, it
must involve cutting all links between those nodes. If you use this in your justification,
give a one sentence justification.
• State which links are cut to get the minimum bisection.
• Give a short explantion of why this is the minimum cut. Short means no more than three
sentences.
8
(b) Dimension routing on a torus. (10 points)
Consider a 16 × 24 torus. Describe for each source-destination pair listed below, give the shortest
route, using dimension routing.
i. Example: source = (3,12), destination = (7,6).
Shortest route goes from (3,12) to (7,6) by heading in the direction of increasing x, i.e.
(3, 12) → (4, 12) → (5, 12) → (6, 12) → (7, 12)
using 4 hops. Then it goes form (7,12) to (7,6) by heading in the direction of decreasing y
using 6 hops. The total complete route required 10 hops in the network.
Now that I gave one example of the details of “in the direction of increasing x”, you don’t
need to list all the intermediate nodes in your solutions.
ii. source = (3,12), destination = (10,22).
iii. source = (3,12), destination = (14,12).
Note: this problem gives more points per byte of answer than the others. Don’t worry about that,
just be happy to get an easy question.
9
5. Communication Overhead (25 points)
Consider the problem of computing the sum of the elements of a list, List, of N elements that is
distributed across NW workers. Each worker has a segment with N/NW elements of List. For simplicity,
we will assume that N is a multiple of NW, and that NW is a power of two.
Of course, we will use reduce to compute the sum. The worker tasks will use lists:sum to compute
their sums. Assume that lists:sum(L) takes time 3ns∗length(L) where ns is short for “nanosecond”,
and one nanosecond is 10−9 seconds. Assume that sending one integer between processes takes 2µs
where µs is short for “microsecond”, and one microsecond is 10−6 seconds. This is the total time for
from when when the sending process starts the send until the receiving process completes the receive.
(a) (4 points) What is the time to compute the sum using reduce?
Note that reduce makes two passes through the tree, one bottom-up and the other top-down
whether we use wtree:reduce or the book’s method. So, your answer should be the same for
either method. We are ignoring compute-time for non-leaf nodes of the tree because it is negigible
compared with the communication time.
(b) (4 points) What is the time to compute the sum sequentially, i.e. using a single process?
(c) (4 points) What is the speed-up?
(d) (3 points) What is the speed-up if N=100,000 and NW=16?
10
(e) (6 points) Now consider an implementation in C. Assume that C is 10× faster for computing the
sum of the elements of an array. In other words, the time to compute the sum of an array with N
elements is 0.3ns ∗ N. Assume that communication costs are the same as the Erlang case (similar
costs for any OS overhead, thread synchronization, etc.).
For the questions below, we have given you space to show two or three lines of work for each
answer.
i. What is the parallel time for the C implementation with N=100,000 and NW=16?
ii. What is the sequential time for the C implementation with N=100,000 and NW=16?
iii. What is the speed-up for the C implementation with N=100,000 and NW=16?
iv. Which is fastest, the Erlang sequential, Erlang parallel, C sequential, of C parallel?
v. Which gets the greater speed up, Erlang parallel vs. Erlang sequential, or C parallel vs. C
sequential?
vi. Give a one or two sentence explanation for the relationship you observe about which is fastest
and which gets the greater speed-up.
11
Unless otherwise noted or cited, the questions and other material in this homework problem set is
copyright 2024 by Susanne Bradley and Mark Greenstreet and are made available under the terms of
the Creative Commons Attribution 4.0 International license http://creativecommons.org/licenses/
by/4.0/
12

HW 3

Uploaded by

HW 3

Uploaded by

CpSc 418 Homework 3 Due: October 22, 2024, 11:59pm

Architecture and Midterm Review Early-Bird: October 20, 2024, 11:59pm

1 The Questions (100 points)

(a) a(X, Y) -> X+Y.

0:1 2:3 4:5 6:7

Figure 1: A Tree for Computing Exclusive Scan

int count3s(int thread_id, Barrier b, int my_seg, int my_len) {

*my_acc = 0; // initialize per-thread accumulator

Figure 2: Shared-memory implementation of Count 3’s

3. MESI (25 points)

(a) Per-processor memory order. (3 points)

i. What is your student number?

iii. Do you prefer big-endian or little-endian formats?

Figure 3: MESI operations

iii. source = (3,12), destination = (14,12).

(c) (4 points) What is the speed-up?

(d) (3 points) What is the speed-up if N=100,000 and NW=16?

You might also like