104
104
1
104 年台大電機
3. Assume that we have three classes of instructions with the following CPI values.
Class A: CPI = 1, Class B: CPI = 2, Class C: CPI = 4
Now we have two programs, X and Y, fulfilling the same function. The counts of instructions
in each class resulted respectively from the execution of X and Y are as follows.
Assuming that all instructions are executed without temporal overlapping, which of the
following statements are true?
(a) Program X has 600 instructions.
(b) Program Y has 430 instructions.
(c) CPI of program X is 2.5.
(d) CPI of program Y is 1.98.
(e) The execution time of program X is 0.81 of the execution time of program Y.
Answer: (b),
註(c): CPI of program X = 0.25 1 + 0.5 2 + 0.25 4 = 2.25
註(d): CPI of program Y = (150/430) 1 + (230/430) 2 + (50/430) 4 = 1.88
註(e): 1.88 / 2.25 = 0.84
4. Suppose that we have a memory system with 32-bit addresses and a 256 Kilobyte cache. The
size of cache line (block) is 64 bytes. Which of the following statements are true of the system?
(a) When the cache is 8-way associative, there are 4K cache lines.
(b) When the cache is 8-way associative, the index field is 8 bits long.
(c) When the cache is 32-way associative, the index field is 20 bits long.
(d) When the cache is direct-mapped, the tag field is 14 bits long.
(e) When the writing policy is ‘write-back’, the dirty data are written to the main memory
only until the corresponding cache line becomes a victim in the replacement policy.
Answer: (a), (d), (e)
註(a): Number of blocks = 256KB / 64B = 4K
註(b): Number of sets = 4K / 8 = 512 length of index = 9
註(c): Number of sets = 4K / 32 = 128 length of index = 7
註(d): 32 – 12 – 6 =14
3
104 年台大資工
4
the callee don’t have any local variables. That is, if the callee-procedure has no local
variables, it could benefit most from such a register partition.
(d) Different instructions treat the value of a register in different ways. If ISAs don’t provide
different set of registers, the programmers have to keep in mind what type of data is
contained in each register to prevent the miss-use of instructions. This will bother
programmers.
5
2. DATA REPRESENTATION
On a MIPS machine (Fig. 1) running UNIX, we observed the following binary string stored in
memory location x.
0011 1111 0111 0000 0000 0000 0000 0000
This binary could mean many different things,
(a) If this is an integer number, what value is it?
(b) If this is a single precision floating point number, what value is it?
(c) If this is an instruction, what instruction is it?
(d) If this is a C string, what string is it?
(e) If this binary string was observed on your x86 desktop, what would be your answer for (d)?
(f) Suppose we have three variables a. b and c. Give a case where (a + b) + c computes a
different value than a + (b + c) on a MIPS microprocessor.
3. BRANCH PREDICTION
For pipelined processors, control hazards could significantly decrease the performance.
Dynamic branch prediction techniques have been successfully adopted in many modern
processors to reduce performance penalty caused by control hazards. However, unlike direct
branches (e.g. BEQ L1 in ARM or BEQ $t0, $tl, L1 in MIPS), indirect branches are usually
difficult to predict.
(a) Please give at least one static and one dynamic branch prediction scheme used for direct
branches in microprocessors.
(b) List conditions where indirect branches, instead of direct branches, are used? Which one
of the listed cases is most frequent?
(c) Please explain why indirect branches are hard to predict?
Answer:
(a) Static branch prediction: always predict not taken
Dynamic branch prediction: two-bit branch prediction
(b) Case1: an indirect branch can be useful to multi-way branch. For instance, use MIPS jr
instruction to translate the C construct, switch/case. Case2: an indirect branch can be used in
function return. For instance, MIPS jr $ra is a function return instruction. Case 2 is most
frequent.
(c) The target instruction address of an indirect branch is not known until run time. This is why
indirect branches are hard to predict
7
4. Cache block size is an important design parameter for cache architecture. Assume a 1-CPI
(Cycle-per-Instruction) machine with an average of 1.4 memory references (both Instruction
and data) per instruction. Assume the CPU stalls for cache misses. Answer the following
questions using the cache miss rates for different block sizes listed in the following table,
(a) If the miss penalty is 24 + B (block size in bytes) cycles, what is the optimal cache block
size? Please show how you derive the answer.
(b) If critical-word-first is implemented in the cache, what is the optimal cache block size?
Please show how you derive the answer.
Answer:
(a) 32 bytes is the optimal block size
Block Size (Bytes) 8 16 32
Miss rate 8% 4% 3%
Miss penalty 32 40 56
Miss rate Miss penalty 2.56 1.6 0.96
(b) In the critical-word-first scheme, miss penalty is independent of block size. 32 byte block,
which has lowest miss rate, is the optimal block size.
5. You are given a task to parallelize the following problem in a multi-core architecture:
for (i = 0; i < N; i = i + 1)
for (j = 0; j < N; j = j + 1) {
r = 0;
for (k = 0; k < N; k = k + 1)
r = r + y[i][k] *z [k][j] ;
x[i][j] = r;
};
(a) Is this a weak-scaling or strong-scaling problem? Please explain your answer.
(b) Is it possible to partition this problem among cores such that there are no cache coherency
misses? Please explain your answer.
Answer:
(a) This is a strong-scaling problem since every element in array x can be computed
concurrently, i.e., it is 100% parallelizable problem. So, increase the problem size and
proportionally increase the number of processors will not increase speedup.
(b) Yes. If there are N cores in the multi-core architecture and a cache block contains N words,
we can assign each core the computation of a row of elements in array x. Since each row of
8
elements are in different memory block, so none of these memory block will be copied to
more than one cache and no cache coherency problem and no cache coherency misses.
6. In current disk storages, the operating system generally completes all I/Os asynchronously via
interrupts. But recent studies show that for future NVM (Non-Volatile Memory) storage
system, which has significantly lower access latency than disks, the synchronous approach
(i.e., polling) could be more efficient than the interrupt approach. Please provide rationale
behind this.
Answer:
When using polling to handle low speed disks, it will waste a lot of processor time since the
processor may read the status many times and find that the disk not yet completes an I/O
operation. Otherwise, when using polling to handle high speed NVM, the number of invalid
visits will be reduced. If it takes longer time for processer to execute an interrupt service routine
than to execute a polling routine, use polling to handle NVM may be more efficient than the
interrupt approach does.
9
104 年台聯大電機
1. Assume for arithmetic, load/store, and branch instructions, a processor has CPIs of 2, 4, and 2
respectively. Also assume that the instruction mix is 40%, 20% and 40% for the three kinds of
instructions respectively. Assume each processor has a 3 GHz clock frequency.
(1) Assume the clock frequency of this processor is proportional to its supply voltage. If we
add an extra processor to this system but reduce the supply voltage from 1.8V to 1.2V,
what is the performance improvement of this change? Assume only clock frequency is
affected by this change and the program can be perfectly parallelized on multiple
processors.
(2) In (1), if the CPI of the arithmetic instructions is doubled when the clock frequency is
slowed down, what is the performance improvement of this dual-core system?
(3) One way to evaluate the efficiency of a processor is using the power-delay product (PDP).
Smaller PDP means better efficiency. In this problem, we define the PDP as the product of
power and cycle time. Assume the power consumption of this processor is proportional to
the square of its supply voltage. Can we obtain efficiency improvement in terms of PDP in
(1)? Please briefly explain your reason.
Answer:
(1) Suppose the CPU clock frequency after lower the voltage is x
3GHz / 1.8 = x / 1.2 x = 2GHz
CPIold = 2 0.4 + 4 0.2 + 2 0.4 = 2.4
The average instruction time for the old processor = 2.4 / 3G = 0.8 ns
The average instruction time for the new processor = 2.4 / 2G = 1.2 ns
Speedup = = 1.33 the dual-core system is 1.33 faster than the original system.
(2) CPInew = 4 0.4 + 4 0.2 + 2 0.4 = 3.2
The average instruction time for new processor = 3.2 / 2G = 1.6 ns
Speedup = = 1 no performance improvement for the dual-core system.
(3) PDFold = 1.82 0.8 = 2.59
PDFnew = 1.22 2 1.2 ns = 3.46
Speedup = 2.59 / 3.46 = 0.75
the performance of the dual-core system is worse than that of the original system.
10
2. Consider the following MIPS code sequence:
LOOP: slt $t2, $0, $t1
beq $t2, $0, DONE
subi $t1, $t1, 1
addi $s2, $s2, 2
j LOOP
DONE:
(1) Assume that the register $t1 is initialized to the value of 10. What is the value in register
$s2 assuming $s2 is initially zero?
(2) Assume that the register $t1 is initialized to the value of N. How many instructions are
executed?
Answer:
(1) $s2 = 20
(2) 5N + 2
3. MIPS instructions:
(1) In MIPS, there are instructions lb, lbu, and sb for load byte, load byte unsigned, and store
byte, respectively; however, there is no the following instruction sbu. Why not?
(2) A MIPS branch instruction performs a modification of PC + 4 if the condition is true.
Suppose that the maximum range of the jump in MIPS is PC – A to PC + B, where both A
and B are positive numbers. What are A and B?
Answer:
(1) This is unnecessary since a single byte is being written to memory (thus, there's no sign
extension to 32 bits, as there is when reading a byte into a 32-bit register).
(2) PC + 4 – 217 = PC – A A = 217 – 4
PC + 4 + 217 – 4 = PC + B B = 217
4. IEEE 754-2008, the IEEE standard for floating-point (FP) arithmetic, contains a half precision
that is only 16 bits wide, in which the left most bit is the sign bit followed by the 5-bit
exponent with a bias of 15 and the 10-bit mantissa. A hidden 1 is assumed.
(1) Explain why the biased exponent is generally applied to the FP representation?
(2) Please write down the bit pattern to represent 1.5625 10-1 using IEEE 754-2008.
Comment on how the range and accuracy of this 16-bit FP format compares to the single
precision IEEE 754 standard.
Answer:
(1) Exponents have to be signed values in order to be able to represent both tiny and huge
values, but two's complement would make comparison harder. To solve this problem the
exponent is biased by adjusting its value to put it within an unsigned range suitable for
11
comparison.
(2) 1.5625 10-1 = 0.1562510 = 0.001012 = 1.01 2-3 0 01100 0100000000
Range: The ranges of single and half precision FP are about 1038 and 105, respectively.
Accuracy: single precision is (24 – 16) / 24 = 33.33% more accuracy than half precision.
5. Suppose that you have a computer that, on average, exhibits the following characteristics on
the programs you run:
Type Distribution IF ID EX MEM WB
Load 25% 2ns 1ns 3ns 3ns 1ns
Store 10% 2ns 1ns 3ns 2ns X
Arithmetic 45% 2ns 1ns 3ns X 1ns
Branch 20% 2ns 1ns 3ns X X
i. IP: instruction fetch; ID: instruction decode; EX: ALU execution; MEM: data
memory access; WB: write back
ii. X denotes that the corresponding stage is not needed
(1) If your computer is implemented as a single-cycle processor, what is its throughput
(measured by “instructions per second”)?
(2) If your computer is implemented as a multi-cycle processor, in which each stage is
executed in one cycle, what is its throughput (measured by “instructions per second”)?
(3) If your computer is implemented as a 5-stage pipelined processor, what is its idealized
throughput, assuming that the there are no hazards between instructions?
Answer:
(1) Average instruction time = 2 + 1 + 3 + 3 + 1 = 10 ns
Throughput = 1 / 10 ns = 108 instruction per sec.
(2) CPI = 5 0.25 + 4 0.1 + 4 0.45 + 3 0.2 = 4.05
Average instruction time = 4.05 3 ns = 12.15 ns
Throughput = 1 / 12.15 ns = 8 107 instruction per sec.
(3) Average instruction time = 1 3 ns = 3 ns
Throughput = 1 / 3 ns = 3.33 108 instruction per sec.
6. A RISC processor with 18-stage pipeline runs a program P having 6,114 instructions.
Branches comprise 23% of the instructions, and the "branch not taken" assumption holds for
static branch prediction. Further assume that 40% of the branches are predicted correctly, and
there is an average penalty of 1.7 cycles for each mispredicted branch. Additionally, 2% of the
total instructions incur an average of 1.3 stalls each. Please calculate the CPI of P on this
pipeline. SHOW ALL WORK TO GET FULL CREDIT CF YOUR ANSWER IS CORRECT.
PARTIAL CREDIT IF NOT.
Answer:
12
Total clock cycles to execute program P = (18 – 1) + 6114 + 6114 0.23 0.6 1.7 + 6114
0.02 1.3 = 7724
CPI = 7724 / 6114 = 1.26
7. Consider a lite version of MIPS (Lite-MIPS) in which the immediate fields in lw and sw
instructions must be zero. Thus, lw $t0, 0($sp) is legal in Lite-MIPS, but lw $t0, 4($sp) is not.
Lite-MIPS can be implemented with a 4-stage pipeline: IF, ID, EM, WB where the EM stage
performs the EX and MEM tasks for the normal 5-stage pipeline in parallel. (Note that no
instruction in Lite-MIPS uses both EX and MEM stages.)
(1) Consider the instruction sequence:
lw $t0, 0($a0)
sw $t0, 0($a0)
Fill in the pipeline diagram below to indicate stalls (if any, please mark it with “*”) to
resolve data hazards in the above sequence on the 4-stage pipeline with full forwarding.
Clock-cycle
1 2 3 4 5 6 7 8
Instruction
lw
sw
(2) If the sw instruction was replaced with a conditional branch instruction, under what
circumstances would a stall be necessary between the two instructions?
(3) If the sw instruction was replaced with addi $t0, $t0, 1 and there was no forwarding
implemented, how many stalls would be necessary?
(4) If conditional branch targets are computed in the ID stage, and conditional branch
decisions are resolved in the EM stage, what would be the best prediction strategy (never
predict, always predict taken, always predict not-taken) for this datapath? Draw pipeline
diagrams to support your answer.
Answer:
(1) The data hazard between lw and sw can be solved by the forwarding path from WB to EM
so there is no need for the pipeline to stall.
Clock-cycle
1 2 3 4 5
Instruction
lw IF ID EM WB
sw IF ID EM WB
(2) A stall would be necessary if the branch used the $t0 register and the branch decision was
resolved in the ID stage.
13
(3) 1 clock stall is needed.
Clock-cycle
1 2 3 4 5 6
Instruction
lw IF ID EM WB
addi IF ID ID EM WB
(4) It is always better to predict not-taken than to not predict at all, since there are no cycles lost
when the prediction is correct. However, there is no clear “winner” between predict taken
and predict not-taken as shown in the following pipeline diagrams:
Predict not-taken: wrong (2 flushes)
Instruction 1 2 3 4 5 6
beq IF ID EM WB
beq + 4 IF ID
beq + 8 IF
target IF
Predict taken: wrong (1 flush)
Instruction 1 2 3 4 5 6
beq IF ID EM WB
beq + 4 IF
target IF
If the branch is predicted taken, then only 1 flush is needed in the worst case (the incorrect
instruction can be flushed once the branch decision is known at the end of the EM stage).
Thus if branches are often taken, a predict taken strategy will cost one flush less than a
predict not-taken strategy (2 flushes, as shown above). However, if branches are often
not-taken, then it is better to predict not-taken (0 flushes) instead of taken (1 flush).
8. Given a four-way set associative cache of 1024 blocks, a 16-byte block size, and a 32-bit
address. Assume the access time of this cache is 1 ns including all delay, and the average
access time of the main memory is 5 ns per byte, including all the miss handling.
(1) What are the numbers of bits for the tag, and index in this cache?
(2) Below is a list of 32-bit memory address references, given as word addresses. Please
identify whether each reference is a hit or miss, assuming the cache is initially empty.
3, 188, 43, 1026, 191, 1064, 2, 40
(3) Suppose the miss rate at the primary cache is 4%. How much reduction can we obtain on
the AMAT if we add a secondary cache that has a 5 ns access time and is large enough to
reduce the miss rate to 1%?
(4) Suppose the space of this memory system is extended from 32-bit to 36-bit by using the
virtual memory technique, in which each page has 8KB. What are the numbers of bits for
the virtual page number and the page offset?
14
(5) Assume the TLB is put into the primary cache, and the page table is put into the main
memory. What is the AMAT if TLB and cache are both hit? What is the AMAT if TLB is
a miss without page fault, assuming the cache is still hit?
Answer:
(1) Number of sets = 1024 / 4 = 256 length of index field = 8 bits
Length of tag field = 32 – 8 – 4 = 20 bits
(2)
Word address Block address Tag Index Hit/Miss
3 0 0 0 Miss
188 47 0 47 Miss
43 10 0 10 Miss
1026 256 1 0 Miss
191 47 0 47 Hit
1064 266 1 10 Miss
2 0 0 0 Hit
40 10 0 10 Hit
15
104 清大資工
1. In computers, floating-point numbers are expressed as the signed bit, exponent and fraction
field. The bits of the fraction present a number between 0 and 1. Assume that the floating point
numbers are 32 bits, with a bias of 127. Let x = 0.3125 and y = - 0.09375 (in decimal).
(a) Show the floating-point number presentation of x and y using hexadecimal representation.
(b) Show the floating-point number presentation of x y using hexadecimal representation.
Answer:
(a)
x y
Normalization 0.312510 = 0.01012 = 1.01 2-2 - 0.0937510 = -0.000112 = -1.1 2-4
FP number 0 01111101 01000000000000000000000 1 01111011 10000000000000000000000
Hexadecimal 3EA00000 BDC00000
(b) (1.01 2-2) (-1.1 2-4) = - 1.111 2-4
FP = 1 01111011 11100000000000000000000 = BDF00000
2. Two n-bit inputs A[i] and B[i] are combined by the two's-complement subtraction (A – B) with
the subtraction result denoted as Sub[i]. The most significant bit n-1 is the signed bit. Bor(n – 2)
denote the borrow from bit n-2 to bit n-3. Indicate whether each of the following conditions is a
valid test for two's complement overflow. (The condition must be true if and only if there is
overflow.) Answer True or False for the following cases.
(a) A(n – 1) XOR B(n – 1) = 1 and Sub(n – l) = 1
(b) A(n – 1) XOR B(n – 1) = Bor(n – 2)
(c) A(n – 1) XOR B(n – 1) = 1 and Sub(n – l) ≠ A(n – 1)
Answer:
(a) (b) (c)
False True True
註:Use 4-bit 2’s complement (ranged from -8 ~ +7) as an example: 4 – (-5) = 9 overflow
A[3] xor B[3] = 1 Borrow from bit 2 to bit 1 is 1
0100
- 1011
1001
3. Assume a processor has a base CPI of 1.4, running at a clock rate of 4GHz. The access time of
the main memory is 50ns, including all the miss handling. Suppose the miss rate per instruction
at the primary cache is 4%. What will be the speedup after adding a secondary cache with a 5ns
access time for either a hit or a miss? Assume that the miss rate to main memory can be reduced
to 0.2%.
16
Answer:
CPI for one level cache = 1.4 + 0.04 200 = 9.4
CPI for two level cache = 1.4 + 0.04 20 + 0.002 200 = 2.6
Speedup = 9.4 / 2.6 = 3.62
4. Consider a system with the virtual memory address of 32 bits, and the physical address of 28
bits. The page size is 2KB. Each page table entry is 4 bytes in size.
(a) How many bits are in the page offset portion of the virtual address?
(b) What is the total page table size?
Answer:
(a) The width of page offset filed = log22K = 11 bits
(b) The page table has 232 / 211 = 221 = 2M entries
The page table size = 2M 4 bytes = 8 Mbytes
17
6. The beq instruction of MIPS will cause the processor to branch to execute from a target address
if the contents of the two specified registers are equal:
beq $rs, $rt, Target # branch to Target if $rs == $rt
It uses the I-type format shown below, where Opcode is 6-bit wide, rs and rt are 5-bit wide, and
Immediate has 16 bits with the leftmost bit as the sign bit.
Opcode rs rt Immediate
The target address of beq is calculated as PC + (Immediate 4). Consider the pipelined
implementation of the MIPS processor shown in the following and answer the following
questions.
(a) Explain the purpose of the Adder in the ID stage.
(b) Explain the purpose of the Mux in the IF stage.
(c) Explain the purpose of the IF.Flush control signal.
(d) Suppose we have a branch target buffer, which uses PC as the input, can always predict
correctly whether a branch will take, and supply the target address. Draw a diagram to
explain how the IF stage may be modified. (Hint: Consider the outputs of the branch target
buffer and what existing components may be removed.)
IF.Flush
Hazard
detection
unit
M ID/EX
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers = Data
Instruction ALU
PC memory
memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Answer:
(a) The adder is used to computer branch target address.
(b) The Mux is used to select the branch target or PC + 4 to update PC in the next clock cycle.
(c) The control IF.Flush is used to flush the wrong instruction in the IF stage.
(d) The multiplexor in front of the instruction selects the next instruction address from PC or
BTB depend on the BHT state. If the BHT state indicates to guess not taken, the PC is
applied to the instruction memory address input, otherwise the output of BTB is applied. The
original multiplexor to selects between branch target and PC is removed. The following
18
diagram only show the modification part in the IF stage.
IF/ID
Add
PC
Instruction
Memory
BTB
19
104 年交大資聯
單選題
1. Which of the following statement is correct?
(a) Instructions addi, beq, and j (jump) all need to perform sign extension.
(b) The most significant bit of 2's complement representation is the sign bit - a sign bit of 1
indicates a positive number.
(c) The range of numbers represented by n-bit 2's complement is: -(2n – 1) + 1~ ±0~ (2n – 1) – 1.
(d) Consider an addition of two 32-bit signed integers with a 32-bit ripple-carry adder, which
consists of 32 1-bit full adders connected in series: if the carry-in and the carry-out of the
most-significant-bit (MSB) full adder have different logic values, then there is an overflow
due to the addition.
(e) The operation of adding one very large positive integer and one very small negative integer
may produce an overflow.
Answer: (d)
4. In a memory system, there is one TLB, one physically addressed cache, and one physical main
memory. Assuming all memory addresses are translated to physical addresses before the cache
is accessed. Which of the following events is impossible in memory system (Note: PT stands
for page table)?
(a) TLB: hit, PT: hit. Cache: miss
(b) TLB: miss, PT: hit. Cache: hit
(c) TLB: miss, PT: hit, Cache: miss
(d) TLB: miss, PT: miss. Cache: miss
(e) TLB: hit, PT: miss. Cache: hit
Answer: (e)
複選題
5. Which of the following 32-bit hexadecimal data will have identical storage sequence in the
(byte-addressable) memory no matter the machine is big-endian or little-endian?
(a) AABBAABBhex
(b) ABBAABBAhex
(c) ABBBBBABhex
(d) ABCDCDABhex
(e) ABCDDCBAhex
Answer: (c), (d)
21
7. Which of the following statements are correct?
(a) Trying to allow some instructions to take fewer cycles does not help, since the throughput
is determined by the clock cycle; the number of pipeline stages per instruction affects
latency, not throughput.
(b) Instead of trying to make instructions take fewer cycles, we should explore making the
pipeline longer (deeper), so that instructions take more cycles, but the cycles are shorter.
This could improve performance.
(c) A data hazard between two adjacent R-type instructions can always be resolved with
forwarding.
(d) Perfect branch prediction (i.e., 100% prediction accuracy) combined with data forwarding
would allow a processor to always keep its pipeline full.
(e) If we use an instruction from the fall-through (untaken) path to fill in a delay slot, we must
duplicate the instruction.
Answer: (a), (b), (c)
註(e):from target
8. Suppose you have a machine which executes a program spending 50% of execution time in
floating-point multiply, 20% in floating-point divide, and 30% in integer instructions. Which
of following statements are correct?
(a) If we make the floating-point divide run 3 times faster, the speedup relative to original
machine is about 0.87.
(b) If we make the floating-point multiply run 8 times faster, the speedup relative to original
machine is about 1.78.
(c) If we make both divide and multiply run faster as described in (a) and (b), the speedup
relative to original machine is about 2.33.
(d) If we can make all the floating-point instructions run 15 times faster, the percent of
floating-point instructions need to be about 90.36 in order to achieve a speedup 4.
(e) The rule stating that the performance enhancement possible with a given improvement is
limited by the amount that the improved feature is used is called Moore's law.
Answer: (b), (c)
註(d): x = 80.36%
註(e):Amdahl’s law
22
9. Consider two different implementations, M1 and M2, of the same instruction set. There are
three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 80MHz
and M2 has a clock rate of 100MHz. The average number of cycles for each instruction class
and their frequencies (for a typical program) are as the table below. Which of following
statements are correct (Note, the MIPS in these statements stand for Millions of Instruction Per
Second)?
10. Assume a 128KB direct-mapped cache with a 32-byte block. Consider a video streaming
workload that accesses a 64KB working set sequentially with the following address stream: 0,
2, 4, 6, 8, 10, 12, 14, 16, ...etc. Which of the following statements are correct?
(a) The miss rate is 1/32.
(b) All the misses are compulsory misses based on the 3Cs model.
(c) The miss rate is sensitive to the size of the cache.
(d) The miss rate is sensitive to the size of the working set
(e) The miss rate is sensitive to the block size.
Answer: (b), (e)
註(a):1/16
23
題組A: Consider the following MIPS code in a 5-stage pipelined CPU (see Figure 1)
# memory address: instruction in the memory
36: sub $1, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
72: lw $4, 50($7)
76: add $14, $4, $2
80: slt $15, $6, $7
Figure 2 shows the pipeline state (in clock cycle 3) when the branch instruction (beq $1, $3, 7) is
in the ID stage. Assume that (i) this branch will be taken, (ii) branch outcome are to be
determined in the ID stage, and (iii) although not shown in the figure, forwarding to the ID stage
(from EX/MEM pipeline registers) is available.
11. In the next clock cycle (clock cycle 4), what is the output value of the ALU adder (for program
counter addition) in the IF stage?
(a) 40
(b) 44
(c) 48
(d) 72
(e) 76
24
Answer: (c)
註:pipeline is stalled in clock cycle 4 because there is data hazard between sub and beq
12. In the next clock cycle (clock cycle 4), what is the output value of the ALU adder (for program
counter addition) in the ID stage?
(a) 40
(b) 44
(c) 48
(d) 72
(e) 76
Answer: (d)
13. In the next clock cycle (clock cycle 4), what is the instruction being executed in the ID stage?
(a) sub $1, $4, $8
(b) beq $1, $3, 7
(c) and $12, $2, $5
(d) lw $4, 50($7)
(e) NOP (no operation)
Answer: (b)
14. In the next clock cycle (clock cycle 4), what is the instruction being executed in the EX stage?
(a) sub $1, $4, $8
(b) beq $1, $3, 7
(c) and $12, $2, $5
(d) lw $4, 50($7)
(e) NOP (no operation)
Answer: (e)
15. Is there any stall required after the branch is taken (given that Figure 1 shows the pipeline state
in clock cycle 3)?
(a) No, no required stall after the branch.
(b) Yes, one required stall will occur in clock cycle 4
(c) Yes, one required stall will occur in clock cycle 5
(d) Yes, one required stall will occur in clock cycle 6
(e) Yes, one required stall will occur in clock cycle 7
Answer: (e)
25
題組B: Assume 4KB pages, a 4-entry two-way set associative TLB and true LRU replacement. If
pages must be brought in from disk, increase the next largest page number. Given the virtual
address references, and the initial TLB and page table states provided below. Which of
following statements is (are) true?
Virtual address references: 4669, 2227, 13916, 34587, 48870, 12608
TLB Page Table
Set valid tag Physical page number valid Physical page number or in disk
0 0 11 12 1 5
0 1 3 6 0 Disk
1 1 7 4 0 Disk
1 0 4 9 1 6
1 9
1 11
0 Disk
1 4
0 Disk
0 Disk
1 3
1 12
16. For the given address references, list the corresponding virtual page numbers.
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(c) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (c)
註:
Virtual address 4669 2227 13916 34587 48870 12608
Virtual page number 1 0 3 8 11 3
Tag 0 0 1 4 5 1
Index 1 0 1 0 1 1
26
17. For the given address references, the corresponding indexes to the TLB are
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(c) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (e)
18. For the given address references, the corresponding tags to the TLB are
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(e) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (d)
20. For the given reference, how many page faults in total?
(a) None
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (c)
27
104 年成大電機
Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer. 10 points each, no partial point, no penalty.
2. Which of the following statements is (are) true for virtual memory system?
(a) The flash memory is a volatile device and it can be used for the swap space.
(b) The operating system usually creates the space on flash memory or disk for all the pages
of a process when it creates the process.
(c) The space on the disk or flash memory reserved for the full virtual memory space of a
process is called swap space.
(d) A memory access violation can be detected by the memory management unit.
Answer: (b), (c), (d)
註(a):The flash memory is a nonvolatile device
3. For a conditional branch instruction such as beq rs, rt, loop, which of the following statements
are true?
(a) The label “loop” defines the base address of the branch target.
(b) The label “loop” is an offset relative to the program counter which points to the next
sequential instruction of the branch instruction.
(c) The label “loop” is an unsigned number.
(d) The label “loop” is coded into the instruction as “loop”.
Answer: (b)
28
4. Which of the following statements is (are) true for virtualization?
(a) The software that supports virtual machine is called a virtual machine monitor (VMM)
which determines how to map the virtual resources to the physical resources,
(b) The cost of processor virtualization depends on the workload. User-level processor-bound
programs often have zero virtualization overhead.
(c) OS-intensive workloads which execute many system calls and privileged instructions can
result in high virtualization overheads.
(d) Virtualization is a simulation program that performs page walk for the virtual memory
system.
Answer: (a), (b), (c)
5. Which of the following is (are) true for cache system? The address is 32-bit long for each case.
(a) A 64KB direct-mapped cache has a line size of 64 bytes. The tag width is 18 bits,
(b) A 64KB direct-mapped cache has a line size of 64 bytes. The total tag memory is 16
Kbits.
(c) A 64KB 4-way set associative cache has a line size of 64 bytes. The tag width is 16 bits.
(d) A 64KB 4-way set associative cache has a line size of 64 bytes. The total tag memory is
18 Kbits.
Answer: (b), (d)
註(a):Tag width = 32 – 10 – 6 = 16 bits
註(b):The total tag memory = 16 1K = 16 Kbits
註(c):Tag width = 32 – 8 – 6 = 18 bits
註(d):The total tag memory = 18 1K = 18 Kbits
6. Which of the following is (are) true for branch hazard in a pipelined processor?
(a) Branch prediction can eliminate branch hazard completely.
(b) Branch hazard comes from a data access hazard. It happens frequently.
(c) A branch hazard arises from the need to make a decision based on the result of the branch
instruction.
(d) Branch hazard is a control hazard when the proper instruction cannot execute in the proper
pipeline clock cycle because the instruction was fetched is not the one that is needed.
Answer: (c), (d)
29
7. Which of the following is (are) true for processor implementation? Assume that for a single
cycle implementation, the processor’s cycle time is T nsec. The instruction count of the
program to run is N.
(a) For single cycle implementation, T is determined by the instruction which has the longest
latency.
(b) If N is the program instruction count, for single cycle CPU, it takes time of N × T nsec.
(c) For multi-cycle implementation that uses one fifth of T for the CPU clock cycle time, the
program execution time is 0.2 N T nsec.
(d) For a five-stage pipeline implementation that also uses 0.2T for the CPU clock cycle time,
the program execution time is 0.2 N T nsec.
Answer: (a), (b), (d)
9. Using 8K 8 SRAM chips for the memory system, which of the following is (are) true?
(a) For 1 MB memory system, it needs 64 SRAM chips.
(b) The 8K 8 chip has 8K address pins.
(c) It needs at least 8 chips for the connection to a 64-bit data bus for proper operation of full
bus width access. So the minimum memory size is 64KB.
(d) It needs at least 4 chips for the connection to a 64-bit data bus for proper operation of full
bus width access. So the minimum memory size is 32KB.
Answer: (c)
註(a):1MB / 8KB = 128 SRAM chips
註(b):8KB = 213B 13 address pins
30
10. Which of the following is (are) true about cache operations?
(a) A processor writes data into a cache line, which is also updated in other processor's cache.
This is the write-through policy.
(b) When a data cache write hit occurs, the written data are also updated in the next level of
memory. This is the write-back policy.
(c) When a data cache write miss occurs, the cache controller first fetches the missing block
into cache and then the data are written into the cache. This is the write-allocate policy.
(d) When a data cache write hit occurs, the data are only written into the cache. This is the
write-back policy.
Answer: (c), (d)
註(a):write-update (write-broadcast)
註(b):write-through
31
104 年成大資聯
32
1. Determine whether each of the following statements is true (T) or false (F).
(1) Program execution time reduces when the clock rate increases.
(2) Program execution time reduces when the CPI increases,
(3) Program execution time reduces when the instruction count (IC) increases.
(4) Suppose the floating point instructions are enhanced and can run 10 times faster. If the
execution time before the floating point enhancement is 80 seconds and three-fourth of the
execution time is spent executing floating-point instructions, the overall speed up is at
least 3.
(5) Suppose the floating point instructions are enhanced and can run 20 times faster. If the
execution time before the floating point enhancement is 80 seconds and one-half of the
execution time is spent executing floating-point instruction, the overall speed up is at least
2.
Answer:
(1) (2) (3) (4) (5)
T F F T F
註(4):Speedup = = 3.08
註(5):Speedup = = 1.9
2. Determine whether each of the following statements is true (T) or false (F)
(1) R-type and I-type MIPS instruction can be distinguished by the opcode of an instruction.
(2) Base addressing mode is used by I-format instructions
(3) PC-relative addressing is used by J-format
(4) Suppose the program counter (PC) is at address 0x0000 0000. It is possible to use one
single branch-on-equal (beq) MIPS instruction to get to address 0x00030000
(5) Suppose the program counter (PC) is at address 0x0000 0000, it is possible to use the
jump MIPS instruction to get to 0xFFFFFFB0
Answer:
(1) (2) (3) (4) (5)
T T F F F
註(1): The value of opcode of an R-type instruction is 0 and the value of opcode of an I-type
instruction is not 0.
33
3. The following descriptions are about IEEE 754 single precision float point format. Determine
whether each of the following statements is true (T) or false (F)?
(1) The float point format has 1 sign bit, 8 exponent bits, and 23 fraction bits.
(2) The smallest positive number it can represent is 0000 0001 0000 0000 0000 0000 0000
00002
(3) The result of “Divide 0 by 0” is 0111 1111 1000 0000 0000 0000 0000 00002
(4) To improve the accuracy of the results, IEEE 754 has one extra bit for rounding.
(5) 0.7510 is represented by 1011 1111 0100 0000 0000 0000 0000 00002
Answer:
(1) (2) (3) (4) (5)
T F F F F
註(2):Should be 0000 0000 1000 0000 0000 0000 0000 00002
註(3):The result of “Divide 0 by 0” should NaN so the fraction field cannot be all 0s.
註(4):Two extra bit: Guard and Round
註(5):Should be 0011 1111 0100 0000 0000 0000 0000 00002
4. Refer to the above table and Figure 1. The right side of the above table is the assembled
instructions of the MIPS instructions in the left side. The starting address of the loop is 4000010
in memory. What are the value of (1), (2), (3) and (4)? Express your answers in decimal
numbers.
Answer:
(1) (2) (3) (4)
2 10000 35 32
34
5. Refer the following instruction sequence:
Instruction sequence
lw $1, 40($2)
add $2, $3, $3
add $1, $1, $2
sw $1, 20($2)
(1) Find all data dependences in this instruction sequence.
(2) Find all hazards in this instruction sequence for a 5-stage pipeline with and without
forwarding.
(3) To reduce the clock cycle time, we are considering a split of the MEM stage into two
stages. Repeat (2) for this 6-stage pipeline.
Answer:
Instr. sequence RAW WAR WAW
I1: lw $1, 40($2) ($1) I1 to I3 ($2) I1 to I2 ($1) I1 to I3
I2: add $2, $3, $3 ($2) I2 to I3, I4
(1)
I3: add $1, $1, $2 ($1) I3 to I4
I4: sw $1, 20($2)
35
6. Suppose that in 1000 memory references there are 50 misses in the first-level cache, 20 misses
in the second-level cache, and 5 misses in the third-level cache. Assume the miss penalty from
the L3 cache to memory is 100 clock cycle, the hit time of the L3 cache is 10 clocks, the hit
time of the L2 cache is 4 clocks, the hit time of L1 is 1 clock cycle. What is the average
memory access time? Ignore the impact of writes.
Answer:
L1 cache L2 cache L3 cache
Miss rate 50/1000 20/1000 5/1000
Hit time 1 4 10
Miss penalty 4 10 100
AMAT = 1 + 4 50/1000 + 10 20/1000 + 100 5/1000 = 1.9
36
104 成大電通
Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer. 10 points each, no partial point, no penalty.
1. Which of the following statements is (are) not true for virtual memory system?
(a) It is typically unknown that when a page in memory will be replaced on flash memory or
disk.
(b) The flash memory is a volatile device and it can be used to store pages in memory.
(c) The operating system usually creates the space on flash memory or disk for all the pages
of a process when it creates the process.
(d) A program can be invoked by the operating system into different instances of processes.
(e) The space on the disk or flash memory reserved for the full physical memory space of a
process is called swap space.
Answer: (c)
註(d): A program can be decomposed by the OS into different instances of processes.
註(e): The space on the disk or flash memory reserved for the full virtual memory space of a
process is called swap space.
37
3. Which of the following is (are) true for the control hazards in a pipelined processor?
(a) Control hazard comes from a data cache miss.
(b) Considering two instructions i and j, with i occurring before j, j tries to read a source
before i writes it, so j incorrectly gets the old value. This causes a control hazard.
(c) Considering two instructions i and j, with i occurring before j, j tries to read a source
before i writes it, so j incorrectly gets the old value. This also causes a control hazard.
(d) A control hazard arises from the need to make a decision based on the result of a branch
instruction while others are executing.
(e) When the proper instruction cannot execute in the proper pipeline clock cycle because the
instruction was fetched is not the one that is needed.
Answer: (d), (e)
38
104 年中央資工
單選題
1. The information of a virtual memory system is assumed to be:
- Page size: 1024 words
- Number of virtual pages: 8
- Number of physical pages: 4
- The current page table is
VPN 0 1 2 3 4 5 6 7
PPN 2 1 NULL NULL 3 NULL 0 NULL
For the following (decimal) virtual word addresses: VA1: 57, VA2: 2048, VA3: 1026, VA4:
7749, VA5 6150, their corresponding physical addresses are PA1, PA2, PA3, PA4, PA5. K =
(PA1 + PA2 + PA3 + PA4 + PA5) mod 5, where “mod” is the modulo operator, and PA = -1 if it
is a page fault. What is "K"?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (a)
註:(2015 – 1 + 1026 – 1 + 6) mod 5 = 3045 mod 5 = 0
Virtual address 57 2048 1026 7749 6150
Virtual page number 0 2 1 7 6
Page offset 57 0 2 581 6
Physical address 2105 -1 1026 -1 6
2. A cache with data size 1Mbytes contains 32768 blocks and is eight-way set associative. The byte
address Ox11D9A4F1 is accessed and is a hit in this cache. Assume that the corresponding tag
value is “T”, what is “T mod 5”?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (c)
註:Block size = 1MB / 32K = 32 bytes. Number of sets = 32K / 8 = 4K
Tag Index Offset
Length 15 12 5
Value 000100011110100 110100100111 10001
0001000111101002 = 229210. 2292 mod 5 = 2
39
3. Given a function or output F with four inputs; A, B, C, D as F(A, B, C, D)= (3, 4, 5, 6, 7)
(minterms) and its don't care information is d(A, B, C, D) = (10, 11, 14, 15). By the commonly
used Karnaugh Map, simplify F with the minimum number of gates using “product of sum” and
there are “M” OR gates, “N” AND gates, “P” NOT/INVERT gates, where each AND/OR gate
has two inputs and NOT gate has one input. K= (M 4 + N 3 + P) mod 5. What is “K”?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (3)
註: F(A, B, C, D) = f(3, 4, 5, 6, 7) + d(10, 11, 14, 15) =
M = 2, N = 3, P = 1 K = 18 mod 5 = 3
5. There are three classes of instructions: Class A, Class B and Class C. The CPI values for Classes
A, B and C are 3, 2 and 1 respectively. We have two compilers, C1 and C2. The compiler C1
generates 106 Class A instructions, 2 l06 Class B instructions and 5 l06 Class C instructions.
The compiler C2 generates 106 Class A instructions, 106 Class B instructions and 107 Class C
instructions. If C1 is tested on a 1GHz machine, M1, while C2 is tested on a 1.5GHz machine,
M2, the execution time of {C1, M1} is “T1” ms (milliseconds) while the execution time of {C2,
M2} is “T2” (milliseconds). Calculate K = {Round(|T1 - T2| l234)} mod 5.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (C)
註:
Instruction class A B C
CPIi 3 2 1
IC for C1 106 2 l06 5 l06
IC for C2 106 106 107
CPU clock for C1 3 l0 + 2 2 l0 + 1 5 l0 = 12 l06
6 6 6
40
6. Which of the following statements are true?
(A) Page table is a cache.
(B) TLB is a cache.
(C) “TLB miss. Page table hit. Cache miss” is possible.
(D) Conflict misses only occur in a direct-mapped or set -associative cache and can be
eliminated in a fully associative cache of the same size.
(E) A very large cache can avoid all the misses.
Answer: (B), (C), (D)
7. Which are the causes that limit the growth of uniprocessor performance and motive the trend of
developing multiple processors per chip in recent years?
(A) Long memory latency
(B) Emergence of Reduced Instruction Set Computer
(C) Limits of power
(D) Little instruction-level parallelism left to exploit efficiently
(E) None of the above
Answer: (A), (C)
8. Suppose that the frequency of Floating Point (FP) operations is 25% and the average Clock
Cycle Per Instruction (CPI) of FP operations is 4. The frequency of Floating Point Square Root
(FPSQR) operations is 2% and the average Clock Cycle Per Instruction (CPI) of FP operations is
20. The average CPI of other instructions is 1.33. There are two alternatives to improve the
processor. Alternative 1 is to reduce the CPI of FPSQR to 2. Alternative 2 is to reduce the
average CPI of all FP to 2. Which of the following is true?
(A) Alternative 1 is better.
(B) Alternative 2 is better.
(C) The CPI of alternative 1 is 1.25.
(D) The CPI of alternative 2 is 1.25.
(E) The speedup for alternative 2 is 1.33.
Answer: (B)
註:
Instruction class FP FPSQR Other
Frequency 25% 2% 73%
CPIi 4 20 1.33
CPIoriginal = 4 0.25 + 20 0.02 + 1.33 0.73 = 2.37
CPIAlternative 1 = 4 0.25 + 2 0.02 + 1.33 0.73 = 2.04
CPIAlternative 2 = 2 0.25 + 20 0.02 + 1.33 0.73 = 1.87
Speedup for Alternative 2 = 2.37 / 1.87 = 1.27
41
9. Which of the following statements are true for General-purpose register (GPR) instruction
architecture?
(A) One of the advantages of register-register instructions is simpler code-generation
(B) The instruction count for register-register instructions is usually lower than register-memory
instructions.
(C) Using registers is more efficient for a compiler than other forms of internal storage.
(D) When variables are allocated to registers, the memory traffic reduces, the program speeds
up.
(E) None of the above
Answer: (A), (C), (D)
10. About single -cycle implementation and multi-cycle implementation of control and data path,
which of the following statements are true?
(A) Single-cycle implementation of control and data path is better than multi-cycle
implementation.
(B) Multi-cycle implementation of control and data path prevents an instruction from sharing
functional units with another instruction within the execution of the instruction.
(C) Single-cycle implementation facilitates the design of pipeline.
(D) For multi-cycle implementation, the clock cycle is determined by the longest possible path.
(E) None of the above.
Answer: (C), (D)
42
9 SD 0(R1), F4
10 SD -8(R1), F8
11 SUBI R1, R1, #32
12 SD A(R1), F12
13 BNEZ R1, LOOP
14 SD B(R1), F16
Where A and B stand for two numbers. Which of the following is true?
(A) Number A in line 12 should be 8.
(B) Number A in line 12 should be 16.
(C) Number B in line 14 should be 8.
(D) Number B in line 14 should be 16.
(E) Number B in line 14 should be 32.
Answer: (B), (C)
43
104 年中正電機
44
2. Booth's Algorithm and Modified Booth's Algorithm
(1) Calculate 11101010 11001110 by Booth's Algorithm.
(2) Give the rule of Modified Booth Receding.
(3) Find the values of P0, P2, P4, and P6 by Modified Booth's Algorithm.
Multiplicand Y 11101010
Multiplier X 11001110
P0
P2
P4
P6
P = P0 + P1 + P2 + P6
Answer
(1) 11101010 11001110 = 00000100 01001100
Iteration Step Multiplicand Product
0 initial values 11101010 00000000 110011100
00 no operation 11101010 00000000 110011100
1
Shift right product 11101010 00000000 011001110
10 – Multiplicand 11101010 00010110 011001110
2
Shift right product 11101010 00001011 001100111
11 no operation 11101010 00001011 001100111
3
Shift right product 11101010 00000101 100110011
11 no operation 11101010 00000101 100110011
4
Shift right product 11101010 00000010 110011001
01 + Multiplicand 11101010 11101100 110011001
5
Shift right product 11101010 11110110 011001100
00 no operation 11101010 11110110 011001100
6
Shift right product 11101010 11111011 001100110
10 – Multiplicand 11101010 00010001 001100110
7
Shift right product 11101010 00001000 100110011
11 no operation 11101010 00001000 100110011
8
Shift right product 11101010 00000100 010011001
(2)
Current bits Previous bit
Operation
ai+1 ai ai-1
0 0 0 None
0 0 1 Add the multiplicand
0 1 0 Add the multiplicand
45
0 1 1 Add twice the multiplicand
1 0 0 Subtract twice the multiplicand
1 0 1 Subtract the multiplicand
1 1 0 Subtract the multiplicand
1 1 1 None
(3)
Multiplicand Y 11101010
Multiplier X 11001110
P0 0000000000101100 ; -2 Multiplicand
P2 0000000000000000 ; none
P4 1111111010100000 ; + Multiplicand
P6 0000010110000000 ; - Multiplicand
P = P0 + P1 + P2 + P6 = 0000010001001100
CSA
PP0 = 000
PP1 = 101
Traditional PP2 = 101
adder
Product
註:
1 1 0 0 1 0 0 1 0
Partial products
101
× 110 First stage HA HA
1 1 1 1 0
46
4. The equation of Average Memory Access Time (AMAT) has three components, including hit
time, miss rate, and miss penalty.
(1) Give the equation of AMAT in terms of hit time, miss rate and miss penalty.
For each of the following cache optimizations, indicate which components of the AMAT
equation can be improved. Explain the reasons.
(2) Using a multi-level cache instead of a primary cache.
(3) Using an M-way set-associate cache instead of a direct-mapped cache.
(4) Using larger blocks instead of smaller blocks.
Answer
(1) AMAT = hit time + miss rate miss penalty
(2) miss penalty
(3) miss rate
(4) miss rate
5. Supposing that the industry trends show that a new process technology scales capacitance by
1/2, voltage by 1/2, and clock rate by 3, by what factor does the dynamic power scale?
Answer
Powernew = (3 Fold) (0.5 Cold) (0.5 Vold)2 = 0.375 Fold Cold Vold2 = 0.375
Powerold
Thus power scales by 0.375
47
6. For the MIPS assembly code below,
7. Assume that a CPU datapath contains five stages with different latencies below.
IF ID EX MEM WB
300ps 400ps 350ps 500ps 100ps
(1) What is the clock cycle time in a pipelined and a non-pipelined processor?
(2) What is the total latency of an add instruction in a pipelined and a non-pipelined processor?
(3) If we can split one stage of the pipelined datapath into two new stages for higher clock rate,
and each with half the latency of the original stage, which stage would you split, and what is
the new clock cycle time of the processor?
Answer
(1) (2)
Pipelined Single-cycle Pipelined Single-cycle
500ps 1650ps 2500ps 1650ps
(3)
Stage to split New clock cycle time
MEM 400ps
48
104 年中正資工
1. Please explain the reason why the single-cycle implementation is rarely used to implement any
instruction set of a processer.
Answer
The single-cycle datapath is inefficient, because the clock cycle must have the same length for
every instruction in this design. The clock cycle is determined by the longest path in the machine,
but several of instruction classes could fit in a shorter clock cycle.
2. If we want to design a carry-select adder to compute the addition of two 8-bit unsigned numbers
with ONLY 1-bit full adders and 2-to-l multiplexers. In addition, the delay time of a 1-bit full
adder and a 2-to-l multiplexer are DFA and DMX, respectively. Moreover, DMX is equal to 0.8 DFA.
Please determine the minimum delay time for this carry-select adder.
Answer: The minimum delay time for this carry-select adder = 4DFA + DMX = 4.8DFA
註:The following diagram shows the 8-bit carry-select adder where the sum delay for 4-bit ripple
carry adder is 4DFA
a7 b7 a6 b6 a5 b5 a4 b4 a7 b7 a6 b6 a5 b5 a4 b4
1 0 1 0 1 0 1 0 1 0
a3 b3 a2 b2 a1 b1 a0 b0
c s7 s6 s5 s4
s3 s2 s1 s0
3. The following techniques have been developed for cache optimizations: hit time, miss rate or
miss penalty: “Non-blocking cache”, “multi-banked cache”, and “critical word first and early
restart”. Please briefly explain these techniques and how they work.
Answer
Non-blocking cache is a cache that allows the processor to make references to the cache while
the cache is handling an earlier miss. Two implementations, hit under miss allows additional
cache hit during a miss and miss under miss allows multiple outstanding cache miss, are used to
hide cache miss latency.
Multi-banked cache: rather than treat the cache as a single monolithic block, divide into
independent banks that can support simultaneous accesses and can increase cache bandwidth.
49
Critical word first and early restart are both used to reduce miss penalty. Critical word first is
to organize the memory so that the requested word is transferred from the memory to the cache
first. The remainder of the block is then transferred. Early restart is simply to resume execution
as soon as the requested word of the block is returned, rather than wait for the entire block.
4. What are “3C cache misses”? List one technique to improve each of the 3C misses.
Answer
3Cs model: a cache model in which all cache misses are classified into one of 3 categories:
compulsory, capacity, and conflict misses.
5. Given the memory references (word addresses): 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186,
253, and a direct-mapped cache with 10 blocks. Indicate which of the above 12 memory
accesses will encounter a cache miss, if (1) each cache block has only 1 word, and (2) each
cache block has 10 words.
Answer
(1)
Word Block
Tag Index Hit/Miss
address address
3 3 0 3 Miss
180 180 18 0 Miss
43 43 4 3 Miss
2 2 0 2 Miss
191 191 19 1 Miss
88 88 8 8 Miss
190 190 19 0 Miss
14 14 1 4 Miss
181 181 18 1 Miss
44 44 4 4 Miss
186 186 18 6 Miss
253 253 25 3 Miss
(2)
Word Block
Tag Index Hit/Miss
address address
3 0 0 0 Miss
180 18 1 8 Miss
50
43 4 0 4 Miss
2 0 0 0 Hit
191 19 1 9 Miss
88 8 0 8 Miss
190 19 1 9 Hit
14 1 0 1 Miss
181 18 1 8 Miss
44 4 0 4 Hit
186 18 1 8 Hit
253 25 2 5 Miss
51
104 年中山電機
2. Consider a four-level memory hierarchy, M1, M2, M3, and M4, with access times T1 = 10 nsec,
T2 = 50 nsec, T3 = 100 nsec, and T4 = 600 nsec. The cache hit ratio H1 = 0.85 at the first level,
H2 = 0.90 at the second level and H3 = 0.95 at the third level. Calculate the effective access time
of this memory system.
Answer: AMAT = 10 ns + 0.15 50 ns + 0.10 100 ns + 0.05 600 ns = 57.5 ns
註:
M1 M2 M3 M4
Hit time 10 ns 50 ns 100ns 600 ns
Miss rate 0.15 0.10 0.05
Miss penalty 50 ns 100ns 600 ns
52
3. IEEE-754 floating-point representation
(a) Using 32-bit floating-point format (8-bit exponent, exponent bias = 127, and base = 2) to
represent -1/64.
(b) Using 64-bit floating-point format (11 -bit exponent, exponent bias = 1023, and base = 2) to
represent -1/32.
Answer
(a) -1/64 = - 0.0000012 = -1 2-6 1 01111001 00000000000000000000000
(b) -1/32 = - 0.000012 = -1 2-5
1 01111111010 0000000000000000000000000000000000000000000000000000
4. Consider a 32-bit microprocessor that has an on-chip 16 Kbytes four-way set associative cache.
Assume that the cache has a line size of four words (each word is 32 bits).
(a) Show the 32-bit physical address (Show how many tag bits, set bits, and offset bits).
(b) Where in the cache (by indicating the set number) is the double word from memory location
ABCDE8F8 mapped?
Answer: Number of sets = 16KB / (16B 4) = 256
(a)
Tag bits Set bits Offset bits
20 8 4
(b) ABCDE8F816 = 1010 1011 1100 1101 1110 1000 1111 10002
Tag Set Offset
1010 1011 1100 1101 1110 1000 1111 1000
Set number: 100011112 = 14310
5. A non-pipelined processor has a clock rate of.2.5 GHz and an average CPI (cycles per
instruction) of 4. An upgrade to this processor introduces a new processor with five-stage
pipeline. However, due to internal pipeline delays, such as latch delay, the clock rate of the new
processor has to be reduced to 2 GHz and an average CPI of 1.
(a) What is the speedup achieved for a typical program with 100 instructions?
(b) What is the MIPS rate for the two processors, respectively?
<Note>: MIPS = Million Instructions per Second.
Answer
(a) Execution time for non-pipelined processor = (100 4) / 2.5G = 160 ns
Execution time for pipelined processor = [(5 – 1) + 100] / 2G = 52 ns
Speedup = 160 ns / 52 ns = 3.08
(b) MIPS for non-pipelined processor = 100 / (160 ns 106) = 625
MIPS for pipelined processor = 100 / (52 ns 106) = 1923.08
53
104 年中山資工
NOTE: If some questions are unclear or not well denned to you, you can make your own
assumptions and state them clearly in the answer sheet.
1. The following table shows the percentage of MIPS instructions executed by category for
average of SPEC2000 integer programs and SPEC2000 floating point programs.
Frequency
Instruction class MIPS examples Average CPI
Integer Floating point
Arithmetic add, sub, addi 1.0 clock cycles 24% 48%
Data transfer lw, sw, lb, sb, lui 1.4 clock cycles 36% 39%
Logical and, or, nor, andi, ori 1.0 clock-cycles 18% 5%
Conditional branch beq, bne, slt, slti 1.7 clock cycles 18% 6%
Jump j, jr, jal 1.2 clock cycles 4% 2%
(1) Using the average instruction mix information for the program SPEC2000fp, find the
percentage of all memory accesses (both data and instruction) that are for reads. Assume
that two-thirds of data transfers are loads.
(2) Compute the effective CPI for MIPS. Average the instruction frequencies for
SPEC2000int and SPEC2000fp to obtain the instruction mix.
(3) Consider an architecture that is similar to MIPS except that it supports update addressing
for data transfer instructions. If we run SPEC2000int using this architecture, some
percentage of the data transfer instructions will be able to make use of the new
instructions, and for each instruction changed, one arithmetic instruction can be
eliminated. If 25% of the data transfer instructions can be changed, which will be faster
for SPEC2000int, the modified MIPS architecture or the unmodified architecture? How
much faster? (Assume mat both architectures have CPI values as given in the above table
and that the modified architecture has its cycle time increased by 20% in order to
accommodate the new instructions.)
Answer:
(1) (1 + 0.39 2/3) / 1.39 = 0.91
(2) Effective CPI = 1.0 0.36 + 1.4 0.375 + 1.0 0.115 + 1.7 0.12 + 1.2 0.03 = 1.24
(3) Modified architecture CPI = 1.0 (0.24 – 0.36 0.25) + 1.4 0.36 + 1.0 0.18 + 1.7
0.18 + 1.2 0.04 = 1.188
Suppose instruction count = IC and clock cycle time for the MIPS architecture = T
Unmodified architecture execution time = IC 1.24 T
Modified architecture execution time = IC 0.91 1.188 1.2T = IC 1.30 T
Unmodified architecture is 1.30 / 1.24 = 1.05 times faster than modified architecture
54
2. Consider two different implementations, I1 and I2, of the same instruction set. There are three
classes of instructions (A, B, and C) in the instruction set. I1 has a clock rate of 4 GHz, and I2
has a clock rate of 2 GHz. The following table shows the average number of cycles for each
instruction class on I1 and I2.
The table also contains a summary of average proportion of instruction classes generated by
three different compilers. C1 is a compiler produced by the makers of I1, C2 is produced by
the makers of I2, and C3 is a third-party product. Assume that each compiler uses me same
number of instructions for a given program but that the instruction mix is as described in me
table.
(1) Using C1 on both I1 and I2, how much faster can the makers of I1 claim I1 is compared to
I2?
(2) Which computer and compiler would you purchase if all other criteria are identical,
including cost?
Answer:
(1) For C1:
CPII1 = 2 0.4 + 3 0.4 + 5 0.2 = 3. Instruction TimeI1 = 3 / 4G = 0.75 ns
CPII2 = 1 0.4 + 2 0.4 + 2 0.2 = 1.6. Instruction TimeI2 = 1.6 / 2G = 0.8 ns
I1 is 0.8 ns / 0.75 ns = 1.067 times faster than I2.
(2) For C2:
CPII1 = 2 0.4 + 3 0.2 + 5 0.4 = 3.4. Instruction TimeI1 = 3.4 / 4G = 0.85 ns
CPII2 = 1 0.4 + 2 0.2 + 2 0.4 = 1.6. Instruction TimeI2 = 1.6 / 2G = 0.8 ns
For C3:
CPII1 = 2 0.5 + 3 0.25 + 5 0.25 = 3. Instruction TimeI1 = 3 / 4G = 0.75 ns
CPII2 = 1 0.5 + 2 0.25 + 2 0.25 = 1.5. Instruction TimeI2 = 1.6 / 2G = 0.75 ns
Compiler C3 has lower instruction time for both computer I1 and I2, so C3 should purchase.
Average instruction time for I1 = (0.75 + 0.85 + 0.75) / 3 = 0.78 ns
Average instruction time for I2 = (0.8 + 0.8 + 0.75) / 3 = 0.78 ns
Computer I1 and I2 have the same average instruction time for the three compiler, so either
one can be purchase.
55
3. You are going to enhance a computer, and there are two possible improvements: either make
multiply instructions run four times faster than before, or make memory access instructions run
two times faster than before. You repeatedly run a program that takes 100 seconds to execute.
Of this time, 20% is used for multiplication, 50% for memory access instructions, and 30% for
other tasks.
(1) What will the speedup be if both improvements are made?
(2) You are going to change the program described in Problem 3 so that the percentages are
not 20%, 50%, and 30% anymore. Assuming that none of the new percentages is 0, what
sort of program would result in a tie with regard to speedup (i.e., the same speedup)
between the two individual improvements? Provide both a formula and some examples.
Answer:
(1) Speedup = = 1.67
4. Figure 1 gives the datapath of a pipelined processor with forwarding and hazard detection.
(1) We have a program of 1000 instructions in the format of "lw, add, lw, add,..." The add
instruction depends (and only depends) on the lw instruction right before it. The lw
instruction also depends (and only depends) on the add instruction right before it. If the
program is executed on the pipelined datapath of Figure 1, what would be the actual CPI?
(2) What would be the actual CPI for the program in Problem (1) without forwarding?
(3) Consider executing the following code on the pipelined datapath of Figure 1. How many
cycles will it take to execute this code?
lw $4, 100($2)
sub $6, $4, $3
add $2, $3, $6
add $7, $5, $2
(4) With regard to the code in Problem (3), explain what the forwarding unit is doing during
the sixth cycle of execution. If any comparisons are being made, mention them.
56
Hazard ID/EX.MemRead
detection
unit
IF/IDWrite
ID/EX
EX/MEM
PCWrite Control MEM/WB
IF/ID 0
Instruction
Registers
Instruction ALU
PC Data
memory
memory
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt EX/MEM.
Rt RegisterRd
IF/ID.RegisterRd Rd
ID/EX.RegisterRt
Rt MEM/WB.RegisterRd
Rs Forwarding
unit
Answer:
(1) Actual CPI = [(5 – 1) + 1000 + 500] / 1000 ≈ 1.5
(2) Actual CPI = [(5 – 1) + 1000 + 999 2] / 1000 ≈ 3
(3) Total clock cycles = (5 – 1) + 4 + 1 = 9
(4) Forwarding unit: $6 compare with $3 and $6 compare with $6
Hazard detection unit: $3 compare with $5 and $3 compare with $2
註(4): The following table show which stage is occupying by which instruction. All the control
signals of the instruction in the WB stage have been cleared to 0.
Stage ID EX MEM WB
Instruction add $7, $5, $2 add $2, $3, $6 sub $6, $4, $3 sub $6, $4, $3
5. We have a program core consisting of several conditional branches. The program core will be
executed thousands of times. Below are the outcomes of each branch for one execution of the
program core (T for taken, N for not taken).
Branch 1: T-T-T-T, Branch 2: N-N-N-N-N, Branch 3: T-N-T-N-T-N. Branch 4: T-T-T-N-T
Assume the behavior of each branch remains the same for each program core execution. For
dynamic schemes, assume each branch has its own prediction buffer and each buffer initialized
to the same state before each execution. List the predictions and calculate the prediction
accuracy for the following branch prediction schemes:
(1) Always-taken
(2) 1-bit predictor (initialized to predict taken)
(3) 2-bit predictor as shown in Figure 2 (initialized to weakly predict taken)
Answer:
57
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-T-T-T-T, right: 0, wrong: 5
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
(1)
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Total: right: 11, wrong: 9, Accuracy = 100% × 11/20 = 55%
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-N-N-N-N, right: 4, wrong: 1
(2) Branch 3: prediction: T-T-N-T-N-T, right: 1, wrong: 5
Branch 4: prediction: T-T-T-T-N, right: 3, wrong: 2
Total: right: 12, wrong: 8, Accuracy = 100% × 12/20 = 60%
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-N-N-N-N, right: 4, wrong: 1
(3) Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Total: right: 15, wrong: 5, Accuracy = 100% × 15/20 = 75%
6. The average memory access time (AMAT) is possibly useful as a figure of merit for different
cache systems.
(1) Find the AMAT for a processor with a 2 ns clock, a miss penalty of 20 clock cycles, a
miss rate of 0.06 misses per reference, and a cache access time (including hit detection) of
1 clock cycle. Assume that the read and write miss penalties are the same and ignore other
write stalls.
(2) Suppose we can improve the miss rate to 0.04 misses per reference by doubling the cache
size. This causes the cache access time to increase to 1.2 clock cycles. Using the AMAT
as a metric, determine if this is a good trade-off.
(3) If the cache access time determines the processor's clock cycle time, which is often the
case, AMAT may not correctly indicate whether one cache organization is better than
another. If the processor's clock cycle time must be changed to match that of a cache, is
this a good trade-off? Assume the processors are identical except for the clock rate and the
number of cache miss cycles; assume 1.5 references per instruction and a CPI without
cache misses of 2. The miss penalty is 20 cycles for both processors.
Answer:
(1) AMAT = (1 + 0.06 20) 2 ns = 4.4 ns
(2) AMAT = (1.2 + 0.04 20) 2 ns = 4 ns
es, it’s a good choice.
(3) Instruction timeold = (2 + 1.5 × 0.06 × 20) 2 ns = 7.6 ns
Instruction timenew = (2 + 1.5 × 0.04 ×20) 2.4 ns = 7.68 ns
So, it’s not a good choice.
58
7. Consider three processors with different cache configurations:
Cache 1: Direct-mapped with two-word blocks
Cache 2: Direct-mapped with four-word blocks
Cache 3: Two-way set associative with four-word blocks
The following miss rate measurements have been made:
Cache 1: Instruction miss rate is 3.75%; data miss rate is 5%
Cache 2: Instruction miss rate is 2%; data miss rate is 4%
Cache 3: Instruction miss rate is 2%; data miss rate is 3%
For these processors, one-half of the instructions contain a data reference. Assume that the
cache miss penalty is 6 + Block size in words. The CPI for tills workload was measured on a
processor with Cache 1 and was found to be 2.0.
(1) Assuming a cache of 16K blocks and a 32-bit address, find the total number of sets and the
total number of tag bits for Cache 1, Cache 2, and Cache 3.
(2) Determine which processor spends the most cycles on cache misses.
(3) The cycle times for the processors in Problem 7.2 are 400 ps for the first and second
processors and 300 ps for the third processor. Determine which processor is the fastest and
which is the slowest.
Answer:
(1)
Cache No. of sets A tag bits Total tag bits
1 16K 32 – 14 – 3 = 15 15 16K = 240 Kbits
2 16K 32 – 14 – 4 = 14 14 16K = 224 Kbits
3 8K 32 – 13 – 4 = 15 15 16K = 240 Kbits
(2)
Cache Miss penalty Stall cycle per instruction
1 6+2=8 (0.0375 × 8 + 0.5 × 0.05 × 8) = 0.5
2 6 + 4 = 10 (0.02 × 10 + 0.5 × 0.04 × 10) = 0.4
3 6 + 4 = 10 (0.02 × 10 + 0.5 × 0.03 × 10) = 0.35
59
104 年中興電機
1. Consider two unsigned numbers A and B, with three bits shown as follows.
A = A2A1A0, B = B2B1B0
Let for i = 0, 1, and 2. Write the Boolean function for (A ≥ B)
Answer:
Boolean function for (A ≥ B) =
2. Describe three main technologies used in computer systems for I/O data transfer.
Answer:
Polling (programmed I/O): the processor periodically checking the status of an I/O device to
determine the need to service the device
Interrupt: I/O devices employ interrupts to indicate to the processor that they need attention.
DMA (direct memory access): a mechanism that provides a device controller with the ability to
transfer data directly to or from the memory without involving the processor
3. Two processors, i.e. CPU N and CPU K, have the same instruction set architecture. The two
processors execute the same program, which consists of 2500 instructions. CPU N has a clock
cycle time of 0.25ns, and CPI of 2.5 for the program. CPU K has a clock cycle time of 0.5ns,
and CPI of 1.5 for the same program.
(a) Determine the CPU time of CPU N and CPU K (in ns).
(b) Suppose the designer want to change the clock rate of CPU N, and then both CPU N and
CPU K have the same CPU time for the program. Find the new clock rate of CPU N (in
GHz).
Answer:
(a) CPU time for CPU N = 2500 2.5 0.25 ns = 1562.5 ns
CPU time for CPU K = 2500 1.5 0.5 ns = 1875 ns
(b) Suppose the new clock rate for CPU N is x
1875 ns = (2500 2.5) / x x = 3.3 GHz
60
4. The following C codes are compiled into the corresponding MIPS assembly codes. Assume
that i and k correspond to registers $s3 and $s5, and the base of the array save is in $s6.
C codes:
while( save[i] ===k)
{
i += 1;
}
MIPS assembly codes:
Loop: sll $t1, $s3, 2
add $t1, $t1, OP1
OP3 $t0, 0(OP2)
bne $t0, $s5, Exit
add $s3, $s3,1
j Loop
Exit
Please determine the proper values for operands (OP1, OP2), and the proper instruction for the
operator (OP3). Copy the following table (Table 1) to your answer sheet and fill in the two
operand values and the one instruction.
Table 1
Operand/Operator Value / Instruction
OP1
OP2
OP3
Answer:
Operand/Operator Value / Instruction
OP1 $s6
OP2 $t1
OP3 lw
61
5. The following MIPS assembly codes are compiled from the corresponding C codes. Assume
that base addresses for arrays x and y are found in $a0 and $a1, while i is in $s0. strcpy adjusts
the stack pointer and then saves the saved register $s0 on the stack.
MIPS assembly codes:
strcpy:
addi $sp, $sp, -4
sw $s0, 0($sp)
add $s0, $zero, $zero
L1: add $t1, $s0, $a1
lb $s2, 0($t1)
add $t3, $s0, $a0
sb $t2, 0($t3)
beq $t2, $zero, L2
addi $s0, $s0,1
j L1
L2: lw $s0, 0($sp)
addi $sp, $sp, 4
jr $ra
C codes:
void strcpy (char x[ ], char y[ ])
{
int i;
i = 0;
while (Code l)
Code 2;
}
Please determine the proper C codes for instructions (Code 1, Code 2). Copy the following
table (Table 2) to your answer sheet and fill in the two C code instructions.
Table 2
C code instructions
Code 1
Code 2
Answer:
C code instructions
Code 1 (x[i] = y[i]) != ‘\0’
Code 2 i += 1;
62
6. The basic single-cycle MIPS implementation in Figure 1 can only implement some
instructions. New instructions can be added to an existing Instruction Set Architecture (ISA),
but the decision whether or not to do that depends, among other things, on the cost and
complexity the proposed addition introduces into the processor datapath and control. Consider
the following new added instruction.
Instruction: LWI Rt, Rd(Rs)
Interpretation: Reg[Rt] = Mem[Reg[Rd] + Reg[Rs]]
(a) Which existing blocks (if any) can be used for this instruction?
(b) Which new functional blocks (if any) do we need for this instruction?
(c) What new signals do we need (if any) from the control unit to support this instruction?
Figure 1
Answer:
(a) This instruction uses instruction memory, both register read ports, the ALU to add Rd and
Rs together, data memory, and write port in Registers.
(b) None.
(c) None.
7. Consider three branch prediction schemes: predict not taken, predict taken, and dynamic
prediction. Assume that they have zero penalty when they predict correctly and two cycles
when they are wrong. Assume that the average predict accuracy of the dynamic predictor is
90%. Which predictor is the best choice for the following branches?
(1) A branch that is taken with 5% frequency
(2) A branch that is taken with 95% frequency
(3) A branch that is taken with 70% frequency
Answer:
63
Predict Predict Dynamic
Best choice
not taken taken prediction
(1) 0.95 0.05 0.9 Predict not taken
(2) 0.05 0.95 0.9 Predict taken
(3) 0.3 0.7 0.9 Dynamic prediction
8. We examine how data dependences affect execution in the basic 5-stage pipeline. Consider the
following sequence of instructions:
or r1, r2, r3
or r2, r1, r4
or r1, r1, r2
Also, assume the following cycle times for each of the options related to forwarding:
64
or r1, r2, r3 No RAW hazard on r1 from I1 (forwarded)
(c) or r2, r1, r4
or r1, r1, r2 No RAW hazard on r2 from I2 (forwarded)
or r1, r2, r3
or r2, r1, r4 RAW hazard on r1 from I1 to I3 cannot use
(e) nop
nop ALU-ALU forwarding
or r1, r1, r2
65
104 年中興資工
註(b):2-way set-associative cache means that a cache has exactly two locations where each memory
block can be placed.
2. Machine A has a clock rate of 1 GHz and a CPI of 2.2 for some program, and machine B has a
clock rate of 500MHz and a CPI of 1.2 for the same program. Which machine is faster for this
program, and by how much times?
Answer
Instruction for Machine A = 2.2 / 1GHz = 2.2 ns
Instruction for Machine B = 1.2 / 500MHz = 2.4 ns
Machine A is (2.4 ns / 2.2 ns) = 1.09 times faster than Machine B.
66
104 年台科大電子
2. Pipeline
(a) Please explain how a MIPS processor can exploit the instruction-level parallelism (ILP)?
(b) Please draw the datapath of the MIPS processor with the control unit.
Answer:
(a) MIPS pipeline partition instruction execution into balanced stage, overlap execution of
consecutive instructions, and exploits ILP among the instructions.
(b)
67
3. Assume that the miss rate of an instruction cache is 3% and the miss rate of the data cache is
5%. If a processor has a CPI of 3 without any memory stalls and the miss penalty is 100 cycles
for all misses. How much faster the processor can run with a perfect cache that never missed?
Assume the frequency of all loads and stores is 30%.
Answer:
CPIeffective = 3 + 1 0.03 100 + 0.3 0.05 100 = 7.5
The processor with a perfect cache is 7.5 / 3 = 2.5 faster than that without a perfect cache
4. Please explain branch prediction buffer and use 2-bit prediction scheme as an example.
Answer:
Branch prediction buffer is a small memory that is indexed by the lower portion of the address
of the branch instruction and that contains one or more bits indicating whether the branch was
recently taken or not. Using 2-bit prediction scheme as an example, each entry of the branch
prediction buffer contains 2-bit information. Predict not taken when these two bits are 00 or 01
and taken when 10 or 11.
5. Virtual Memory
(a) Please explain TLB, cache, and page table.
(b) How to integrate TLB, cache, and page table together.
Answer:
(a) TLB: A cache that keeps track of recently used address mappings to avoid an access to the
page table.
Cache: a fast memory between CPU and main memory.
Page table: the table contains the virtual to physical address translations in a virtual memory
system.
(b) The following diagram shows the integration of TLB, cache, and page table.
68
104 年台科大資工
1. The following figure shows the pipelined datapath with control signals. For the instruction
sequence below, please answer the following questions.
add $1, $2, $3
and $1, $1, $4
lw $5, 4($1)
sub $7, $5, $6
or $8, $5, $7
Hazard
detection
unit
F
Y ID/EX
X
EX/MEM
Control MEM/WB
IF/ID 0 E
C
A
Instruction
Registers
Instruction ALU
PC Data
memory
memory
B
Z D
Forwarding
unit
(1) Please give the control signals of multiplexors A, B, C and control lines D, E, F for clock
cycle 5. Note that the answer for each signal should be one of the follows: 0, 1, 00, 01, 10,
or 11.
A B C D E F
Control Signal
(2) For clock cycle 5, what are sent through lines X, Y, and Z? Please write down the
corresponding register numbers ($1, $2 ... $8).
X Y Z
Register Number
(3) How many cycles does it take to finish the execution of this instruction sequence?
Answer:
69
(1)
A B C D E F
Control Signal 0 0 10 1 1 1
(2)
X Y Z
Register Number $5/$6 $5 $1
70
3. Consider the following performance measurements for a program:
Measurement Computer A Computer B Computer C
Instruction count 10 billion 10 billion 9 billion
Clock rate 2 GHz 4 GHz 2 GHz
CPI 1.1 2.4 1.2
for each of the following statements, decide TRUE or FALSE.
(1) Computer B is faster than Computer A for this program.
(2) Computer C is faster than Computer A for tills program.
(3) Computer C has higher MIPS (million instructions per second) rating than Computer A for
this program.
(4) Suppose the program spends 10% of time on addition and 60% of time on multiplication.
This program can run 2 times faster by improving only the speed of addition.
(5) Suppose the program spends 10% of time on addition and 60% of time on multiplication.
This program can run 2 times faster by improving only the speed of multiplication.
Answer:
(1) (2) (3) (4) (5)
False True False
註(1, 2): ExeTimeCA = (10 10 1.1) / 2 109 = 5.5 sec.
9
71
104 年台師大資工
2. Consider a CPU with clock cycle time 0.5 nanosecond (or 0.5 ns). Suppose the CPU executes a
program with 1000 instructions. The average CPI (clock cycles per instruction) is 2.0 for the
program.
(a) Find the clock rate of the CPU (in gigahertz, or GHz).
(b) Find the CPU time for executing the program (in ns).
Answer:
(a) 1 / (0.5 ns) = 2 GHz
(b) The CPU time = 1000 2 0.5 ns 1000 ns
3. Consider a MIPS processor with separate instruction and data memories. Suppose the
following code sequence is executed on the processor.
LW R4, 24(R2); R4 MEM[R2 + 24]
SUB R5, R1, R4; R5 R1 - R4
LW R12, 32(R2); R12 MEM[R2 + 32]
ADD R8, R12, R4; R8 R12 + R4
ADD R10, R2, R8; R10 R2 + R8
LW R6, 8(R1); R6 MEM[R1 + 8]
LW R11, 16(R1); R11 MEM[R1 + 16]
SUB R9, R11, R6 R9 R11 - R6
(a) Determine the number of accesses to the instruction memory.
(b) Determine the number of accesses to the data memory.
(c) Suppose there is only one miss in the instruction memory for the code sequence, Compute
the miss rate of the instruction memory,
(d) Suppose there are two misses in the data memory for the code sequence. Compute the
miss rate of the data memory.
Answer:
(a) The number of accesses to the instruction memory is 8.
(b) The number of accesses to the data memory is 4.
(c) Miss rate of the instruction memory is 1/8 = 0.125.
(d) Miss rate of the data memory is 2/4 = 0.5.
72
4. Consider a five-stage (IF, ID, EX, MEM and WB) MIPS pipeline processor with hazard
detection and data forwarding units. Assume the processor includes separate instruction and
data memories so that the structural hazard for memory references can be avoided.
(a) Suppose the following code sequence is executed on the processor. Identify all the data
hazards which can be solved by forwarding.
ADD R5, R7, R12; R5 R7 + R12
ADD R4, R5, R6; R4 R5 + R6
LW R8, 12(R4); R8 MEM[12 + R4]
(b) Repeat part (a) for the following code sequence.
LW R9, 20(R7); R9 MEM[R7 + 20]
ADD R1, R9, R5; R1 R9 + R5
SW R1, 12(R6); MEM[R6 + 12] R1
(c) Suppose the following code sequence is executed on the processor. Determine the total
number of clocks needed to execute the code sequence.
ADD R6, R2, R3; R6 R2 + R3
SUB R8, R5, R6; R8 R5 - R6
Answer:
(a) (ADD, ADD) for register R5; (ADD, LW) for register R4
(b) (ADD, SW) for register R1
(c) (5 – 1) + 2 = 6 clock cycles
73
104 年暨南資工
2. Assume that executing program A on single CPU needs 100 cycles, and parallelizing program
A has fixed 10 cycle overheads. When parallelizing program A to be executed on 10 CPUs,
what is the fraction of program A to be parallelizable (e.g. x% of A) so that we can get 5 times
of speed up with 10 CPUs?
Answer: Suppose T is the time which can be parallelization in program A
75
104 年高大資工
單選題
1. Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a main memory access time of 100 ns, including all
the miss handling. Suppose the miss rate per instruction at the primary cache is 1%. How many
cycles will the new CPI be if we add a secondary cache that has a 5 ns access time for either a
hit or a miss and is large enough to reduce the miss rate to main memory to 0.4%?
(a) 1.2
(b) 2.6
(c) 2.8
(d) 6.5
Answer: (c)
註:CPI = 1 + 0.01 20 + 0.004 400
76
4. Assume that individual stages of the datapath have the following latencies:
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
填充題
6. switches threads only on costly stalls, such as a leve-2 cache miss.
7. is caused by the first access to a block that has never been in the cache.
8. is a structure that holds the destination program counters for branch instructions.
9. is the principle stating that if a data location is referenced, data locations with nearby
addresses will tend to be referenced soon.
10. The section of a process containing temporary data such as function parameters, return
addresses, and local variables is called .
Answer:
6 7 8 9 10
Coarse grain Compulsory Branch target Spatial Activation
multithreading misses buffer locality record
77
問答題
11. The five stages of MIPS pipeline are IF (instruction fetch), ID (Instruction decode and register
read), EXE (Execute operation or calculate address), MEM (Access memory operand), and
WB (Write result back to register). Given the following code:
add $5, $2, $1
lw $3, 4($5)
lw $2, 0($2)
or $3, $5, $3
sw $3, 0($5)
Suppose the data hazards must be resolved by “stalling” the dependent instructions until the
needed operand is written back to the register file. We assume that when the needed operand is
written back to the register file, the dependent instruction can read the needed operand from
the register file in the same clock cycle. How many NOPs and at what places that you add to
make the code segment execute correctly? How many cycles do these instructions execute?
You must show how to get the answer.
Answer: 5 NOPs is required and the execution time = (5 – 1) + 5 + 5 = 14 clock cycles
add $5, $2, $1
NOP
NOP
lw $3, 4($5)
lw $2, 0($2)
NOP
or $3, $5, $3
NOP
NOP
sw $3, 0($5)
12. Assume that a two-way set-associative cache of 1K blocks, 1-word block size, and a 32-bit
address. How many total bits are required for the cache (including valid bits)? You must show
how to get the answer.
Answer:
Number of sets = 1K / 2 = 512. Tag size = 32 – 9 – 2 = 21
The total bits are required for the cache = (1 + 21 + 32) 1K = 54K bits
78
13. Consider a computer running a program that requires 300s, with 100s spent executing FP
instructions, 75s executed Load/Store instructions, and 40s spent executing branch
instructions. What will the speedup be if the times for FP and branch instructions are reduced
by 30% and 50%, respectively? You must show how to get the answer.
Answer:
Speedup = = 1.2
79