0% found this document useful (0 votes)

126 views28 pages

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views28 pages

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS3350B Computer Architecture

CPU Performance and Profiling

Marc Moreno Maza

[Link]
Department of Computer Science
University of Western Ontario, Canada

Tuesday January 10, 2017

Components of a computer
Memory hierarchy
Levels of program code
▸ High-level language
▸ Level of abstraction closer
to the problem domain
▸ Designed for productivity
and portability
▸ Assembly language
▸ Textual representation of
instructions
▸ Many constructs of the
HLL are translated into
combinations of low-level
constructs
▸ Hardware representation
▸ Binary digits (bits)
▸ Encoded instructions and
data
http:
//[Link]/~moreno//CS447/Lectures/[Link]/[Link]
Understanding Performance

▸ Algorithm analysis:
estimates the number of operations executed, the number of
cache misses, etc.
▸ Programming language, compiler, architecture
the compilation process determines the machine instructions
executed per HLL operation
▸ Processor and memory system
determine how fast instructions are executed
▸ I/O system (including OS)
determines how fast I/O operations are executed
Performance Metrics

▸ Purchasing perspective:
given a collection of machines, which one has the
▸ best cost?
▸ best cost/performance?
▸ Design perspective:
faced with design options, which one has the
▸ best performance improvement?
▸ best cost/performance?
▸ Both require:
▸ basis for comparison,
▸ metrics for evaluation.
▸ Our goal is to understand what factors in the architecture
contribute to the overall system performance and the relative
importance (and cost) of these factors
CPU Performance
▸ We are normally interested in reducing
▸ Response time (aka execution time) – the time between the
start and the completion of a task
- Important to individual users
▸ Thus, to maximize performance, we need to minimize
execution time
performanceX = 1/execution_timeX
If X is n times faster than Y, then
performanceX execution_timeY
= =n
performanceY execution_timeX
▸ And we are interested in increasing
▸ Throughput - the total amount of work done in a given unit of
time
- Important to data center managers
▸ Decreasing response time usually improves throughput, but
other factors are important (task scheduling, memory
bandwidth, etc.)
CPU Clocking

▸ Almost all computers are constructed using a clock that

determines when events take place in the hardware

▸ Clock period (cycle): duration of a clock cycle (CC)

▸ determines the speed of a computer processor
▸ e.g., 250ps = 0.25ns = 250 × 10−12 s
▸ Clock frequency or rate (CR): cycles per second
▸ the inverse of the clock period
▸ e.g., 3.0GHz = 3000MHz = 3.0 × 109 Hz
▸ CR = 1 / CC.
Performance Factors
▸ It is important to distinguish elapsed time and the time spent
on our task
▸ CPU execution time (CPU time) - time the CPU spends
working on a task
▸ Does not include time waiting for I/O or running other
programs
CPU execution time = #CPU clock cycles × clock − cycle
for a program for a program
or
CPU execution time = #CPU clock cycles / clock − rate
for a program for a program

▸ Thus, we can improve performance by reducing either the

length of the clock cycle or the number of clock cycles
required for a program.
Instruction Performance

#CPU clock cycles = #Instructions × Average # of clock cycles

for a program for a program per instruction

▸ Clock cycles per instruction (CPI) - the average number of

clock cycles each instruction takes to execute:
▸ different instructions may take different amounts of time
depending on what they do;
▸ a way to compare two different implementations of the same
instruction set architecture (ISA).
The Classic Performance Equation

CPU time = Instruction_count × CPI × clock_cycle

or
CPU time = Instruction_count × CPI / clock_rate

▸ always Keep in mind that the only complete and reliable

measure of computer performance is time.
▸ For example, redesigning the hardware implementation of an
instruction set to lower the instruction count may lead to an
organization with
▸ a slower clock cycle time or,
▸ higher CPI,
that offsets the improvement in instruction count.
▸ Similarly, because CPI depends on the type of instruction
executed, the code that executes the fewest number of
instructions may not be the fastest.
A Simple Example (1/2)

n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (1)

ALU 50% 1 .5 .5
Load 20% 5 1.0
Store 10% 3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2

(1) How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
A Simple Example (1/2)

n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (1)

ALU 50% 1 .5 .5
Load 20% 5 1.0 .4
Store 10% 3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2 1.6

(1) How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 × IC × CC; so 2.2 versus 1.6 which
means 37.5% faster
A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)

ALU 50% 1 .5 .5
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .4
∑ = 2.2

(2) How does this compare with using branch prediction to save a
cycle off the branch time?

(3) What if two ALU instructions could be executed at once?

A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)

ALU 50% 1 .5 .5
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .2 .4
∑ = 2.2 2.0

(2) How does this compare with using branch prediction to save a
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10%
faster
(3) What if two ALU instructions could be executed at once?
A Simple Example (2/2)
n
Overall effective CPI = ∑(CPIi × ICi )
i=1

Op Freq CPIi Freq × CPIi (2) (3)

ALU 50% 1 .5 .5 .25
Load 20% 5 1.0 1.0 1.0
Store 10% 3 .3 .3 .3
Branch 20% 2 .4 .2 .4
∑ = 2.2 2.0 1.95

(2) How does this compare with using branch prediction to save a
cycle off the branch time?
CPU time new = 2.0 × IC × CC so 2.2 versus 2.0 means 10%
faster
(3) What if two ALU instructions could be executed at once?
CPU time new = 1.95 × IC × CC so 2.2 versus 1.95 means
12.8% faster
Understanding Program Performance

CPU time = Instruction_count × CPI × clock_cycle

▸ The performance of a program depends on the algorithm, the

language, the compiler, the architecture, and the actual
hardware.

Instruction_count CPI clock_cycle

Algorithm X X
Programming language X X
Compiler X X
ISA X X X
Processor organization X X
Performance Summary

Instructions Clock cycles Seconds

CPU Time = × ×
Program Instruction Clock cycle

▸ Performance depends on
▸ Algorithm: affects IC, possibly CPI
▸ Programming language: affects IC, CPI
▸ Compiler: affects IC, CPI
▸ Instruction set architecture: affects IC, CPI, Tc
Check Yourself

A given application written in Java runs 15 seconds on a desktop

processor. A new Java compiler is released that requires only 0.6
as many instructions as the old compiler. Unfortunately, it
increases the CPI by 1.1. How fast can we expect the application
to run using this new compiler? Pick the right answer from the
three choices below:
a. 15×0.6
1.1 = 8.2 sec
b. 15 × 0.6 × 1.1 = 9.9 sec
c. 15×1.1
0.6 = 27.5 sec
Power Trends

▸ In complementary metal oxide semiconductor (CMOS)

integrated circuit technology
Power = Capacitive load × Voltage2 × Frequency switched
(×30) (5V → 1V ) (×1000)
Reducing Power

▸ Suppose a new CPU has

▸ 85% of capacitive load of old CPU
▸ 15% voltage and 15% frequency reduction
Pnew Cold × 0.85 × (Vold × 0.85)2 × Fold × 0.85
= = 0.854 = 0.52
Pold Cold × Vold
2 ×F
old
▸ The power wall
▸ We can’t reduce voltage further
▸ We can’t remove more heat
▸ How else can we improve performance?
Uniprocessor Performance

▸ Constrained by power, instruction-level parallelism, memory

latency
Multiprocessors

▸ Multicore microprocessors
▸ More than one processor per chip
▸ Requires explicitly parallel programming
▸ Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
▸ Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
SPEC CPU Benchmark

▸ Programs used to measure performance

▸ Supposedly typical of actual workload
▸ Standard Performance Evaluation Corp (SPEC)
▸ Develops benchmarks for CPU, I/O, Web, ...
▸ SPEC CPU2006
▸ Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
▸ Normalize relative to reference machine
▸ Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
¿
Á n
Á
À
n
∏ Execution time ratioi
i=1
CINT2006 for Intel Core i7 920
Profiling Tools

▸ Many profiling tools

▸ gprof (static instrumentation)
▸ cachegrind, Dtrace (dynamic instrumentation)
▸ perf (performance counters)
▸ perf in linux-tools, based on event sampling
▸ Keep a list of where “interesting events” (cycle, cache miss,
etc) happen
▸ CPU Feature: Counters for hundreds of events
- Performance: Cache misses, branch misses, instructions per
cycle, ...
▸ Intel®64 and IA-32 Architectures Software Developer’s
Manual: Appendix A lists all counters [Link]
com/products/processor/manuals/[Link]
▸ perf user guide:
[Link]
Exercise 1

void copymatrix1(int (src)[n], void copymatrix2(int (src)[n],

int (*dst)[n], int n) { int (*dst)[n], int n) {
int i,j; int i,j;
for (i = 0; i < n; i++) for (j = 0; j < n; j++)
for (j = 0; j < n; j++) for (i = 0; i < n; i++)
dst[i][j] = src[i][j]; dst[i][j] = src[i][j];
} }
▸ copymatrix1 vs copymatrix2
▸ What do they do?
▸ What is the difference?
▸ Which one performs better? Why?
▸ perf stat -e cycles -e cache-misses ./copymatrix1
perf stat -e cycles -e cache-misses ./copymatrix2
▸ What does the output like?
▸ How to interpret it?
▸ Which program performs better?
Exercise 2

void lower1 (char* s) { void lower2 (char* s) {

int i; int i;
for (i = 0; i < strlen(s); i++) int n = strlen(s);
if (s[i]>=’A’ && s[i]<=’Z’) for (i = 0; i < n; i++)
s[i] -= ’A’-’a’; if (s[i]>=’A’ && s[i]<=’Z’)
} s[i] -= ’A’-’a’;
}
▸ lower1 vs lower2
▸ What do they do?
▸ What is the difference?
▸ Which one performs better? Why?
▸ perf stat -e cycles -e cache-misses ./lower1
perf stat -e cycles -e cache-misses ./lower2
▸ What does the output like?
▸ How to interpret it?
▸ Which program performs better?

Module 2 (26-10-2024)
No ratings yet
Module 2 (26-10-2024)
50 pages
Lec10 Performance
No ratings yet
Lec10 Performance
22 pages
Module-2 Introduction and Performance Analysis
No ratings yet
Module-2 Introduction and Performance Analysis
51 pages
CSE 332: Understanding Computer Performance
No ratings yet
CSE 332: Understanding Computer Performance
41 pages
It3030e CA Chap1 Introduction 2.0m
No ratings yet
It3030e CA Chap1 Introduction 2.0m
25 pages
Computer Architecture
No ratings yet
Computer Architecture
26 pages
Lecture Ch4 Performance
No ratings yet
Lecture Ch4 Performance
25 pages
Computer Architecture Measurement
No ratings yet
Computer Architecture Measurement
26 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
52 pages
CMPS343Chapter1 Part B
No ratings yet
CMPS343Chapter1 Part B
22 pages
Da Ci
No ratings yet
Da Ci
13 pages
02 Performance
No ratings yet
02 Performance
13 pages
Computer Architecture Performance Metrics
No ratings yet
Computer Architecture Performance Metrics
25 pages
Chapter 1 Performance
No ratings yet
Chapter 1 Performance
32 pages
Computer Performance Insights
No ratings yet
Computer Performance Insights
22 pages
Computer Performance Evaluation Metrics
No ratings yet
Computer Performance Evaluation Metrics
53 pages
Co Unit1 Part3
No ratings yet
Co Unit1 Part3
11 pages
CPU Performance: CPI and Execution Time
No ratings yet
CPU Performance: CPI and Execution Time
45 pages
Cs2100 14 Understanding Performance
No ratings yet
Cs2100 14 Understanding Performance
46 pages
Slide 1
No ratings yet
Slide 1
33 pages
Calculating Cycles Per Instruction (CPI)
No ratings yet
Calculating Cycles Per Instruction (CPI)
38 pages
Chapter 01 RISC V
No ratings yet
Chapter 01 RISC V
30 pages
Lecture - 4 - Performance
No ratings yet
Lecture - 4 - Performance
31 pages
CA 02 Performance
No ratings yet
CA 02 Performance
21 pages
CS322: Cost-Performance in Architecture
No ratings yet
CS322: Cost-Performance in Architecture
52 pages
CA Performance
No ratings yet
CA Performance
26 pages
09 Perf
No ratings yet
09 Perf
22 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Lecture # 2
No ratings yet
Lecture # 2
33 pages
A Constant Clock Rate:: - Most Computers Run Synchronously Utilizing A CPU Clock Running at
No ratings yet
A Constant Clock Rate:: - Most Computers Run Synchronously Utilizing A CPU Clock Running at
45 pages
Cse431 04
No ratings yet
Cse431 04
17 pages
EC8552 CAO Unit-1 S03
No ratings yet
EC8552 CAO Unit-1 S03
19 pages
RISC-V ISA & Performance Metrics
No ratings yet
RISC-V ISA & Performance Metrics
72 pages
Puter Performance
No ratings yet
Puter Performance
15 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
Chapter 2-Part 12 1
No ratings yet
Chapter 2-Part 12 1
38 pages
Computer Organization Overview
No ratings yet
Computer Organization Overview
17 pages
Measuring Computer Performance Factors
No ratings yet
Measuring Computer Performance Factors
37 pages
Computer Organization The Role of Performance
No ratings yet
Computer Organization The Role of Performance
45 pages
Computer Performance Evaluation Guide
No ratings yet
Computer Performance Evaluation Guide
17 pages
CS104: Computer Organization: Lecture 08, 2 March 2020
No ratings yet
CS104: Computer Organization: Lecture 08, 2 March 2020
21 pages
CSA Performance
No ratings yet
CSA Performance
40 pages
Lec2 6
No ratings yet
Lec2 6
8 pages
Module 3.3 - Problems On Performance
No ratings yet
Module 3.3 - Problems On Performance
54 pages
4 Performance
No ratings yet
4 Performance
67 pages
Intro
No ratings yet
Intro
14 pages
Performance: Computer Architecture and Assembly Language Dr. Aiman El-Maleh
No ratings yet
Performance: Computer Architecture and Assembly Language Dr. Aiman El-Maleh
25 pages
CPU Performance Evaluation Guide
No ratings yet
CPU Performance Evaluation Guide
36 pages
Bản Sao Của Lecture 2 - Performance Measurement
No ratings yet
Bản Sao Của Lecture 2 - Performance Measurement
9 pages
CPU Performance Metrics Guide
No ratings yet
CPU Performance Metrics Guide
31 pages
Abstraction in Computer Architecture
No ratings yet
Abstraction in Computer Architecture
74 pages
Performance Metrics in CPU Design
No ratings yet
Performance Metrics in CPU Design
19 pages
Lecture2 ch1
No ratings yet
Lecture2 ch1
23 pages
Performance
No ratings yet
Performance
51 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
59 pages
CPU Performance & Power Evaluation
No ratings yet
CPU Performance & Power Evaluation
76 pages
IT401 Computer Organization and Architecture: Prasun Ghosal
No ratings yet
IT401 Computer Organization and Architecture: Prasun Ghosal
30 pages
Computer Performance Metrics
No ratings yet
Computer Performance Metrics
40 pages
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
24 pages
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
33 pages
Overview of GPU Architecture and CUDA
No ratings yet
Overview of GPU Architecture and CUDA
18 pages
Memory Hierarchy in CS3350B
No ratings yet
Memory Hierarchy in CS3350B
30 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
L7 Multicore 2
No ratings yet
L7 Multicore 2
22 pages
Computer Architecture Hazards
No ratings yet
Computer Architecture Hazards
31 pages
Section 9
No ratings yet
Section 9
60 pages
مكتبة نور - مميز بالاصفر PDF
No ratings yet
مكتبة نور - مميز بالاصفر PDF
229 pages
Nano PDF
No ratings yet
Nano PDF
1 page
Evolution of Electronics Overview
100% (1)
Evolution of Electronics Overview
17 pages
Helptochoose Com Compare Televisions
No ratings yet
Helptochoose Com Compare Televisions
116 pages
8051 Microcontroller Addressing Modes
No ratings yet
8051 Microcontroller Addressing Modes
17 pages
Yaesu DMU-2000 Serv
No ratings yet
Yaesu DMU-2000 Serv
21 pages
DHI-NVR616DR-64/128-4KS2: 64/128 Channel Ultra 4K H.265 Network Video Recorder
No ratings yet
DHI-NVR616DR-64/128-4KS2: 64/128 Channel Ultra 4K H.265 Network Video Recorder
3 pages
Chapter 2 Hall Effect Solutions
No ratings yet
Chapter 2 Hall Effect Solutions
2 pages
Experiment 11: NPN BJT Common Emitter Characteristics
100% (1)
Experiment 11: NPN BJT Common Emitter Characteristics
7 pages
Tunnel Diode Definition
No ratings yet
Tunnel Diode Definition
12 pages
EANTC-Huawei-CloudMetro - Presentation - v4 - Final
No ratings yet
EANTC-Huawei-CloudMetro - Presentation - v4 - Final
16 pages
Mi Portable Bluetooth Speaker 16W - User Manual
No ratings yet
Mi Portable Bluetooth Speaker 16W - User Manual
1 page
87 Ce930 1
No ratings yet
87 Ce930 1
2 pages
P N Junction Diode
No ratings yet
P N Junction Diode
3 pages
P4: Examine The Development of E-Commerce and Digital Marketing Platforms and Channels in Comparison To Physical Channels. M-Commerce
0% (1)
P4: Examine The Development of E-Commerce and Digital Marketing Platforms and Channels in Comparison To Physical Channels. M-Commerce
3 pages
Jarir IT Flyer Qatar
No ratings yet
Jarir IT Flyer Qatar
8 pages
Transformer Protection RET670: Exercise 3 - Differential Protection Open CT and OLTC Adjustment
100% (1)
Transformer Protection RET670: Exercise 3 - Differential Protection Open CT and OLTC Adjustment
22 pages
Computer Architecture PDF
No ratings yet
Computer Architecture PDF
69 pages
VLSI Design Unit 1
No ratings yet
VLSI Design Unit 1
41 pages
CGL Igbt Aux Manual
No ratings yet
CGL Igbt Aux Manual
220 pages
M.tech I Sem ESD LAB MANUAL
No ratings yet
M.tech I Sem ESD LAB MANUAL
40 pages
Sheet 04
No ratings yet
Sheet 04
9 pages
Winstarco Wf43mtibedrgd 670030130
No ratings yet
Winstarco Wf43mtibedrgd 670030130
5 pages
Exp#1 Diode Characteristics V2
0% (1)
Exp#1 Diode Characteristics V2
12 pages
Case Study
No ratings yet
Case Study
6 pages
Understanding, Measuring, Reducing Output Noise in DC - DC
No ratings yet
Understanding, Measuring, Reducing Output Noise in DC - DC
64 pages
Siprotec 7Sj600 Numerical Overcurrent, Motor and Overload Protection Relay
No ratings yet
Siprotec 7Sj600 Numerical Overcurrent, Motor and Overload Protection Relay
2 pages
Experiment9 AstableMultivibratorUsing555
No ratings yet
Experiment9 AstableMultivibratorUsing555
4 pages
Analog Circuit Design 1996
100% (2)
Analog Circuit Design 1996
408 pages
Pisonet Calculation
No ratings yet
Pisonet Calculation
4 pages
Gr10 2 - 2 Basic Concepts of Hardware
No ratings yet
Gr10 2 - 2 Basic Concepts of Hardware
46 pages
What Is RAM?: RAM (Random Access Memory)
No ratings yet
What Is RAM?: RAM (Random Access Memory)
4 pages

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Uploaded by

CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza

Uploaded by

CS3350B Computer Architecture

CPU Performance and Profiling

Marc Moreno Maza

Tuesday January 10, 2017

▸ Almost all computers are constructed using a clock that

▸ Clock period (cycle): duration of a clock cycle (CC)

▸ Thus, we can improve performance by reducing either the

#CPU clock cycles = #Instructions × Average # of clock cycles

▸ Clock cycles per instruction (CPI) - the average number of

CPU time = Instruction_count × CPI × clock_cycle

▸ always Keep in mind that the only complete and reliable

Op Freq CPIi Freq × CPIi (1)

Op Freq CPIi Freq × CPIi (1)

Op Freq CPIi Freq × CPIi (2) (3)

(3) What if two ALU instructions could be executed at once?

Op Freq CPIi Freq × CPIi (2) (3)

Op Freq CPIi Freq × CPIi (2) (3)

CPU time = Instruction_count × CPI × clock_cycle

▸ The performance of a program depends on the algorithm, the

Instruction_count CPI clock_cycle

Instructions Clock cycles Seconds

A given application written in Java runs 15 seconds on a desktop

▸ In complementary metal oxide semiconductor (CMOS)

▸ Suppose a new CPU has

▸ Constrained by power, instruction-level parallelism, memory

▸ Programs used to measure performance

▸ Many profiling tools

void copymatrix1(int (*src)[n], void copymatrix2(int (*src)[n],

void lower1 (char* s) { void lower2 (char* s) {

You might also like

void copymatrix1(int (src)[n], void copymatrix2(int (src)[n],