Lecture 02Amdahls Law, Modern Hardware
ECE 459: Programming for Performance
Patrick Lam
University of Waterloo
January 7, 2015
About Prediction and Speedups
Cliff Click said: 5% miss rates dominate performance.
Why is that?
2 / 13
About Prediction and Speedups
Cliff Click said: 5% miss rates dominate performance.
Why is that?
Recall: 100-1000 slot penalty for a miss.
See L02.pdf for a calculation.
3 / 13
Forcing Branch Mispredicts
blog.man7.org/2012/10/
how-much-do-builtinexpect-likely-and.html
#include <stdlib.h>
#include <stdio.h>
static __attribute__ ((noinline)) int f(int a) { return a; }
#define BSIZE 1000000
int main(int argc, char* argv[])
{
int *p = calloc(BSIZE, sizeof(int));
int j, k, m1 = 0, m2 = 0;
for (j = 0; j < 1000; j++) {
for (k = 0; k < BSIZE; k++) {
if (__builtin_expect(p[k], EXPECT_RESULT)) {
m1 = f(++m1);
} else {
m2 = f(++m2);
}
}
}
printf("%d, %d\n", m1, m2);
}
Running times: 3.1s with good (or no) hint, 4.9s with bogus hint.
4 / 13
Limitations of Speedups
Our main focus is parallelization.
Most programs have a sequential part and a
parallel part; and,
Amdahls Law answers, what are the limits to
parallelization?
5 / 13
Formulation (1)
S: fraction of serial runtime in a serial execution.
P: fraction of parallel runtime in a serial execution.
Therefore, S + P = 1.
With 4 processors, best case, what can happen to the
following runtime?
Runtime
S
6 / 13
Formulation (1)
Runtime
S
We want to split up the parallel part over 4 processors
Runtime
S
P
P
P
P
7 / 13
Formulation (2)
Ts : time for the program to run in serial
N: number of processors/parallel executions
Tp : time for the program to run in parallel
Under perfect conditions, get N speedup for P
Tp = Ts (S + NP )
8 / 13
Formulation (3)
How much faster can we make the program?
speedup =
=
=
Ts
Tp
Ts
TS (S +
1
P
S+N
P
N)
(assuming no overhead for parallelizing; or costs near zero)
9 / 13
Fixed-Size Problem Scaling,
Varying Fraction of Parallel Code
32
30
28
26
24
22
Speedup
20
18
50% Parallel
70% Parallel
90% Parallel
95% Parallel
99% Parallel
100% Parallel
16
14
12
10
8
6
4
2
0
12
16
20
24
28
32
Number of processors
10 / 13
Amdahls Law
Replace S with (1 P):
speedup =
maximum speedup =
1
P
(1P)+ N
1
(1P) ,
since
P
N
As you might imagine, the asymptotes in the previous
graph are bounded by the maximum speedup.
11 / 13
Assumptions behind Amdahls Law
How can we invalidate Amdahls Law?
12 / 13
Assumptions behind Amdahls Law
We assume:
problem size is fixed (well see this soon);
program/algorithm behaves the same on 1
processor and on N processors; and
that we can accurately measure runtimes
i.e. that overheads dont matter.
13 / 13