Program Optimization
Program Optimization
Overview
Program optimization
Code motion/precomputation
Strength reduction
Sharing of common subexpressions
Optimization blocker: Procedure calls
Optimization blocker: Memory aliasing
Exploiting Instruction-Level Parallelism
Dealing with Conditionals
Optimizing Compilers
Code Motion
Reduce frequency with which computation performed
If it will always produce same result
Especially moving code out of loop
void set_row(double *a, double *b,
long i, long n)
{
long j; long j;
for (j = 0; j < n; j++) int ni = n*i;
a[n*i+j] = b[j]; for (j = 0; j < n; j++)
} a[ni+j] = b[j];
Compiler-Generated Code Motion
void set_row(double *a, double *b,
long i, long n) long j;
{ long ni = n*i;
long j; double *rowp = a+ni;
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i+j] = b[j]; *rowp++ = b[j];
}
int ni = 0;
for (i = 0; i < n; i++) for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) for (j = 0; j < n; j++)
a[n*i + j] = b[j]; a[ni + j] = b[j];
ni += n;
}
Today
Overview
Program optimization
Code motion/precomputation
Strength reduction
Sharing of common subexpressions
Optimization blocker: Procedure calls
Optimization blocker: Memory aliasing
Exploiting Instruction-Level Parallelism
Dealing with Conditionals
Share Common Subexpressions
Reuse portions of expressions
Compilers often not very sophisticated in exploiting arithmetic
properties
/* Sum neighbors of i,j */ long inj = i*n + j;
up = val[(i-1)*n + j ]; up = val[inj - n];
down = val[(i+1)*n + j ]; down = val[inj + n];
left = val[i*n + j-1]; left = val[inj - 1];
right = val[i*n + j+1]; right = val[inj + 1];
sum = up + down + left + right; sum = up + down + left + right;
200 lower
180
160
140
CPU seconds
120
100
80
60
40
20
0
0 100000 200000 300000 400000 500000
String length
Convert Loop To Goto Form
void lower(char *s)
{
int i = 0;
if (i >= strlen(s))
goto done;
loop:
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
i++;
if (i < strlen(s))
goto loop;
done:
}
Strlen performance
Only way to determine length of string is to scan its entire length, looking for
null character.
Overall performance, string of length N
N calls to strlen
Require times N, N-1, N-2, …, 1
Overall O(N2) performance
Improving Performance
void lower(char *s)
{
int i;
int len = strlen(s);
for (i = 0; i < len; i++)
if (s[i] >= 'A' && s[i] <= 'Z')
s[i] -= ('A' - 'a');
}
200
180
160
140
CPU seconds
120
lower
100
80
60
40
20
lower2
0
0 100000 200000 300000 400000 500000
String length
Optimization Blocker: Procedure Calls
Why couldn’t compiler move strlen out of inner loop?
Procedure may have side effects
Alters global state each time called
Function may not return same value for given arguments
Depends on other parts of global state
Procedure lower could interact with strlen
Warning:
Compiler treats procedure call as a black box
Weak optimizations near them
int lencnt = 0;
Remedies: size_t strlen(const char *s)
Use of inline functions {
GCC does this with –O2 size_t length = 0;
while (*s != '\0') {
See web aside ASM:OPT
s++; length++;
Do your own code motion }
lencnt += length;
return length;
}
Today
Overview
Program optimization
Code motion/precomputation
Strength reduction
Sharing of common subexpressions
Optimization blocker: Procedure calls
Optimization blocker: Memory aliasing
Exploiting Instruction-Level Parallelism
Dealing with Conditionals
Memory Matters
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
twiddle1:
*xp += *yp; // xp= 2 + 2 = 4
*xp += *yp; // xp = 4 + 4 = 8
twiddle2:
*xp += 2 * (*yp); // xp = 2 + 2*2 = 6
Memory Aliasing
/* Sum rows is of n X n matrix a
and store in vector b */
void sum_rows1(double *a, double *b, long n) {
long i, j;
for (i = 0; i < n; i++) {
b[i] = 0;
for (j = 0; j < n; j++)
b[i] += a[i*n + j];
}
}
Value of B:
double A[9] = init: [4, 8, 16]
{ 0, 1, 2,
4, 8, 16},
i = 0: [3, 8, 16]
32, 64, 128};
900
800
vsum1: Slope = 4.0
700
600
Cycles
500
400
vsum2: Slope = 3.5
300
200
100
0
0 50 100 150 200
n = Number of elements
Benchmark Performance
void combine1(vec_ptr v, data_t *dest)
{
long int i; Compute sum or
*dest = IDENT; product of vector
for (i = 0; i < vec_length(v); i++) { elements
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
Operation Results
Addr. Addr.
Data Data
Data
Cache
Execution
Latency versus Throughput
Example: latency cycles/issue
Integer Multiply 10 1
Consequence:
How fast can 10 independent int mults be executed?
t1 = t2*t3; t4 = t5*t6; …
How fast can 10 sequentially dependent int mults be executed?
t1 = t2*t3; t4 = t5*t1; t6 = t7*t4; …
* d4
* d5
* d6
* d7
*
Loop Unrolling
void unroll2a_combine(vec_ptr v, data_t *dest)
{
int length = vec_length(v);
int limit = length-1;
data_t *d = get_vec_start(v);
data_t x = IDENT;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x = (x OP d[i]) OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x = x OP d[i];
}
*dest = x;
}
* Overall Performance
1 d2 d3
N elements, D cycles latency/op
* d4 d5 Should be (N/2+1)*D cycles:
* CPE = D/2
* d6 d7 Measured CPE slightly worse for
* FP mult
*
*
*
Loop Unrolling with Separate Accumulators
void unroll2a_combine(vec_ptr v, data_t *dest)
{
int length = vec_length(v);
int limit = length-1;
data_t *d = get_vec_start(v);
data_t x0 = IDENT;
data_t x1 = IDENT;
int i;
/* Combine 2 elements at a time */
for (i = 0; i < limit; i+=2) {
x0 = x0 OP d[i];
x1 = x1 OP d[i+1];
}
/* Finish any remaining elements */
for (; i < length; i++) {
x0 = x0 OP d[i];
}
*dest = x0 OP x1;
}
* *
What Now?
*
Unrolling & Accumulating
Idea
Can unroll to any degree L
Can accumulate K results in parallel
L must be multiple of K
Limitations
Diminishing returns
Cannot go beyond throughput limitations of execution units
Large overhead for short lengths
Finish off iterations sequentially
Unrolling & Accumulating: Double *
Case
Intel Nehelam (Shark machines)
Double FP Multiplication
Latency bound: 5.00. Throughput bound: 1.00
FP * Unrolling Factor L
K 1 2 3 4 6 8 10 12
1 5.00 5.00 5.00 5.00 5.00 5.00
Accumulators
Operation Results
Addr. Addr.
Data Data
Data
Cache
Execution
Branch Outcomes
When encounter conditional branch, cannot determine where to continue
fetching
Branch Taken: Transfer control to branch target
Branch Not-Taken: Continue with next instruction in sequence
Cannot resolve until outcome determined by branch/integer unit
Performance Cost
Multiple clock cycles on modern processor
Can be a major performance limiter
Effect of Branch Prediction
Loops void combine4b(vec_ptr v,
data_t *dest)
Typically, only miss when {
hit loop end long int i;
Checking code long int length = vec_length(v);
data_t acc = IDENT;
Reliably predicts that error for (i = 0; i < length; i++) {
won’t occur if (i >= 0 && i < v->len) {
acc = acc OP v->data[i];
}
}
*dest = acc;
}