0% found this document useful (0 votes)
126 views28 pages

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views28 pages

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS3350B Computer Architecture

CPU Performance and Profiling

Marc Moreno Maza

[Link]
Department of Computer Science
University of Western Ontario, Canada

Tuesday January 10, 2017


Components of a computer
Memory hierarchy
Levels of program code
▸ High-level language
▸ Level of abstraction closer
to the problem domain
▸ Designed for productivity
and portability
▸ Assembly language
▸ Textual representation of
instructions
▸ Many constructs of the
HLL are translated into
combinations of low-level
constructs
▸ Hardware representation
▸ Binary digits (bits)
▸ Encoded instructions and
data
http:
//[Link]/~moreno//CS447/Lectures/[Link]/[Link]
Understanding Performance

▸ Algorithm analysis:
estimates the number of operations executed, the number of
cache misses, etc.
▸ Programming language, compiler, architecture
the compilation process determines the machine instructions
executed per HLL operation
▸ Processor and memory system
determine how fast instructions are executed
▸ I/O system (including OS)
determines how fast I/O operations are executed
Performance Metrics

▸ Purchasing perspective:
given a collection of machines, which one has the
▸ best cost?
▸ best cost/performance?
▸ Design perspective:
faced with design options, which one has the
▸ best performance improvement?
▸ best cost/performance?
▸ Both require:
▸ basis for comparison,
▸ metrics for evaluation.
▸ Our goal is to understand what factors in the architecture
contribute to the overall system performance and the relative
importance (and cost) of these factors
CPU Performance
▸ We are normally interested in reducing
▸ Response time (aka execution time) – the time between the
start and the completion of a task
- Important to individual users
▸ Thus, to maximize performance, we need to minimize
execution time
performanceX = 1/execution_timeX
If X is n times faster than Y, then
performanceX execution_timeY
= =n
performanceY execution_timeX
▸ And we are interested in increasing
▸ Throughput - the total amount of work done in a given unit of
time
- Important to data center managers
▸ Decreasing response time usually improves throughput, but
other factors are important (task scheduling, memory
bandwidth, etc.)
CPU Clocking

▸ Almost all computers are constructed using a clock that


determines when events take place in the hardware

▸ Clock period (cycle): duration of a clock cycle (CC)


▸ determines the speed of a computer processor
▸ e.g., 250ps = 0.25ns = 250 × 10−12 s
▸ Clock frequency or rate (CR): cycles per second
▸ the inverse of the clock period
▸ e.g., 3.0GHz = 3000MHz = 3.0 × 109 Hz
▸ CR = 1 / CC.
Performance Factors
▸ It is important to distinguish elapsed time and the time spent
on our task
▸ CPU execution time (CPU time) - time the CPU spends
working on a task
▸ Does not include time waiting for I/O or running other
programs
CPU execution time = #CPU clock cycles × clock − cycle
for a program for a program
or
CPU execution time = #CPU clock cycles / clock − rate
for a program for a program

▸ Thus, we can improve performance by reducing either the


length of the clock cycle or the number of clock cycles
required for a program.
Instruction Performance

#CPU clock cycles = #Instructions × Average # of clock cycles


for a program for a program per instruction

▸ Clock cycles per instruction (CPI) - the average number of


clock cycles each instruction takes to execute:
▸ different instructions may take different amounts of time
depending on what they do;
▸ a way to compare two different implementations of the same
instruction set architecture (ISA).
The Classic Performance Equation

CPU time = Instruction_count × CPI × clock_cycle


or
CPU time = Instruction_count × CPI / clock_rate

▸ always Keep in mind that the only complete and reliable


measure of computer performance is time.
▸ For example, redesigning the hardware implementation of an
instruction set to lower the instruction count may lead to an
organization with
▸ a slower clock cycle time or,
▸ higher CPI,
that offsets the improvement in instruction count.
▸ Similarly, because CPI depends on the type of instruction
executed, the code that executes the fewest number of
instructions may not be the fastest.
A Simple Example (1/2)

n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (1)


ALU 50% 1 .5 .5
Load 20% 5 1.0
Store 10% 3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2

(1) How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
A Simple Example (1/2)

n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (1)


ALU 50% 1 .5 .5
Load 20% 5 1.0 .4
Store 10% 3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2 1.6

(1) How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 × IC × CC; so 2.2 versus 1.6 which
means 37.5% faster
A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)


ALU 50% 1 .5 .5
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2

(2) How does this compare with using branch prediction to save a
cycle off the branch time?

(3) What if two ALU instructions could be executed at once?


A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)


ALU 50% 1 .5 .5
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .2 .4
∑ = 2.2 2.0

(2) How does this compare with using branch prediction to save a
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10%
faster
(3) What if two ALU instructions could be executed at once?
A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)


ALU 50% 1 .5 .5 .25
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .2 .4
∑ = 2.2 2.0 1.95

(2) How does this compare with using branch prediction to save a
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10%
faster
(3) What if two ALU instructions could be executed at once?
CPU time new = 1.95 × IC × CC so 2.2 versus 1.95 means
12.8% faster
Understanding Program Performance

CPU time = Instruction_count × CPI × clock_cycle

▸ The performance of a program depends on the algorithm, the


language, the compiler, the architecture, and the actual
hardware.

Instruction_count CPI clock_cycle


Algorithm X X
Programming language X X
Compiler X X
ISA X X X
Processor organization X X
Performance Summary

Instructions Clock cycles Seconds


CPU Time = × ×
Program Instruction Clock cycle

▸ Performance depends on
▸ Algorithm: affects IC, possibly CPI
▸ Programming language: affects IC, CPI
▸ Compiler: affects IC, CPI
▸ Instruction set architecture: affects IC, CPI, Tc
Check Yourself

A given application written in Java runs 15 seconds on a desktop


processor. A new Java compiler is released that requires only 0.6
as many instructions as the old compiler. Unfortunately, it
increases the CPI by 1.1. How fast can we expect the application
to run using this new compiler? Pick the right answer from the
three choices below:
a. 15×0.6
1.1 = 8.2 sec
b. 15 × 0.6 × 1.1 = 9.9 sec
c. 15×1.1
0.6 = 27.5 sec
Power Trends

▸ In complementary metal oxide semiconductor (CMOS)


integrated circuit technology
Power = Capacitive load × Voltage2 × Frequency switched
(×30) (5V → 1V ) (×1000)
Reducing Power

▸ Suppose a new CPU has


▸ 85% of capacitive load of old CPU
▸ 15% voltage and 15% frequency reduction
Pnew Cold × 0.85 × (Vold × 0.85)2 × Fold × 0.85
= = 0.854 = 0.52
Pold Cold × Vold
2 ×F
old
▸ The power wall
▸ We can’t reduce voltage further
▸ We can’t remove more heat
▸ How else can we improve performance?
Uniprocessor Performance

▸ Constrained by power, instruction-level parallelism, memory


latency
Multiprocessors

▸ Multicore microprocessors
▸ More than one processor per chip
▸ Requires explicitly parallel programming
▸ Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
▸ Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
SPEC CPU Benchmark

▸ Programs used to measure performance


▸ Supposedly typical of actual workload
▸ Standard Performance Evaluation Corp (SPEC)
▸ Develops benchmarks for CPU, I/O, Web, ...
▸ SPEC CPU2006
▸ Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
▸ Normalize relative to reference machine
▸ Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
¿
Á n
Á
À
n
∏ Execution time ratioi
i=1
CINT2006 for Intel Core i7 920
Profiling Tools

▸ Many profiling tools


▸ gprof (static instrumentation)
▸ cachegrind, Dtrace (dynamic instrumentation)
▸ perf (performance counters)
▸ perf in linux-tools, based on event sampling
▸ Keep a list of where “interesting events” (cycle, cache miss,
etc) happen
▸ CPU Feature: Counters for hundreds of events
- Performance: Cache misses, branch misses, instructions per
cycle, ...
▸ Intel®64 and IA-32 Architectures Software Developer’s
Manual: Appendix A lists all counters [Link]
com/products/processor/manuals/[Link]
▸ perf user guide:
[Link]
Exercise 1

void copymatrix1(int (*src)[n], void copymatrix2(int (*src)[n],


int (*dst)[n], int n) { int (*dst)[n], int n) {
int i,j; int i,j;
for (i = 0; i < n; i++) for (j = 0; j < n; j++)
for (j = 0; j < n; j++) for (i = 0; i < n; i++)
dst[i][j] = src[i][j]; dst[i][j] = src[i][j];
} }
▸ copymatrix1 vs copymatrix2
▸ What do they do?
▸ What is the difference?
▸ Which one performs better? Why?
▸ perf stat -e cycles -e cache-misses ./copymatrix1
perf stat -e cycles -e cache-misses ./copymatrix2
▸ What does the output like?
▸ How to interpret it?
▸ Which program performs better?
Exercise 2

void lower1 (char* s) { void lower2 (char* s) {


int i; int i;
for (i = 0; i < strlen(s); i++) int n = strlen(s);
if (s[i]>=’A’ && s[i]<=’Z’) for (i = 0; i < n; i++)
s[i] -= ’A’-’a’; if (s[i]>=’A’ && s[i]<=’Z’)
} s[i] -= ’A’-’a’;
}
▸ lower1 vs lower2
▸ What do they do?
▸ What is the difference?
▸ Which one performs better? Why?
▸ perf stat -e cycles -e cache-misses ./lower1
perf stat -e cycles -e cache-misses ./lower2
▸ What does the output like?
▸ How to interpret it?
▸ Which program performs better?

You might also like