0% found this document useful (0 votes)
17 views79 pages

104

The document contains a detailed index of various engineering and computer science programs from different universities in Taiwan, along with specific MIPS assembly instructions and examples. It also discusses concepts related to instruction set architecture, data representation, branch prediction, cache architecture, and parallel computing. The answers provided in the document address various technical questions related to these topics.

Uploaded by

DanielNel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views79 pages

104

The document contains a detailed index of various engineering and computer science programs from different universities in Taiwan, along with specific MIPS assembly instructions and examples. It also discusses concepts related to instruction set architecture, data representation, branch prediction, cache architecture, and parallel computing. The answers provided in the document address various technical questions related to these topics.

Uploaded by

DanielNel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

目 錄

104 年台大電機 .................................................................................................................................... 2


104 年台大資工 .................................................................................................................................... 4
104 年台聯大電機 .............................................................................................................................. 10
104 清大資工....................................................................................................................................... 16
104 年交大資聯 .................................................................................................................................. 20
104 年成大電機 .................................................................................................................................. 28
104 年成大資聯 .................................................................................................................................. 32
104 成大電通....................................................................................................................................... 37
104 年中央資工 .................................................................................................................................. 39
104 年中正電機 .................................................................................................................................. 44
104 年中正資工 .................................................................................................................................. 49
104 年中山電機 .................................................................................................................................. 52
104 年中山資工 .................................................................................................................................. 54
104 年中興電機 .................................................................................................................................. 60
104 年中興資工 .................................................................................................................................. 66
104 年台科大電子 .............................................................................................................................. 67
104 年台科大資工 .............................................................................................................................. 69
104 年台師大資工 .............................................................................................................................. 72
104 年暨南資工 .................................................................................................................................. 74
104 年高大資工 .................................................................................................................................. 76

1
104 年台大電機

1. The following is the definition of some MIPS assembly instructions.


Instruction Example Meaning Comments
Add registers add $s1, $s2, $s3 $s1 = $s2 + $s3 $s0-$s7 are registers
Subtract registers sub $t1, $t2, $t3 $t1 = $t2 - $t3 $t0-$t9 are also registers
Branch on equality beq $v1, $v0, L Branch to L if $v1 = $v0. $v0-$v1 are also registers.
L is an instruction label.
Add constant addi $t1, $s1, 30 $t1=$s1 + 30

We have the following MIPS assembly program.


L1: sub $t0, $t2, $t2
L2: addi $t1, $t0, 3
L3: add $t2, $t0, $t0
L4: beq $a1, $t0, L8
L5: add $t2, $t2, $a1
L6: sub $a1, $a1, $t1
L7: beq $a1, $a1, L4
L8: addi $t0, $t0, 10
L9: add $t1, $t1, $t1
Which of the following statements are true of the program?
(a) The program stops only when $t2 starts with a multiple of three.
(b) The program stops only when $t1 starts with a multiple of three.
(c) When the program stops, $t0 contains 13.
(d) When the program stops, $t1 contains 6.
(e) When the program stops, $a1 contains zero.
Answer: (d), (e)
註(a)(b): The program stops only when $a1 starts with a multiple of three
註(c): When the program stops, $t0 contains 10

2. Which of the following statements are true of 32-bit arithmetic?


(a) Hexadecimal FFFFF33A in 2's complement is -3260 in decimal.
(b) Hexadecimal 00000076 in 1's complement is 118 in decimal.
(c) Hexadecimals FFFFF420 + 00000039 in 2's complement addition is -3063 in decimal.
(d) Hexadecimals FFFFF420  00000039 in 2's complement multiplication is -115520 in
decimal.
(e) Hexadecimals FFFFF420 / 00000039 in 2's complement division is 77 in decimal.
Answer: (b)
註(a): Hexadecimal FFFFF33A in 2's complement is -3270 in decimal
2
註(c): Hexadecimals “FFFFF420” and “00000039” = “-3040” and “57” in decimal, respectively.

3. Assume that we have three classes of instructions with the following CPI values.
Class A: CPI = 1, Class B: CPI = 2, Class C: CPI = 4
Now we have two programs, X and Y, fulfilling the same function. The counts of instructions
in each class resulted respectively from the execution of X and Y are as follows.

Programs Class A Class B Class C


X 100 200 100
Y 150 230 50

Assuming that all instructions are executed without temporal overlapping, which of the
following statements are true?
(a) Program X has 600 instructions.
(b) Program Y has 430 instructions.
(c) CPI of program X is 2.5.
(d) CPI of program Y is 1.98.
(e) The execution time of program X is 0.81 of the execution time of program Y.
Answer: (b),
註(c): CPI of program X = 0.25  1 + 0.5  2 + 0.25  4 = 2.25
註(d): CPI of program Y = (150/430)  1 + (230/430)  2 + (50/430)  4 = 1.88
註(e): 1.88 / 2.25 = 0.84

4. Suppose that we have a memory system with 32-bit addresses and a 256 Kilobyte cache. The
size of cache line (block) is 64 bytes. Which of the following statements are true of the system?
(a) When the cache is 8-way associative, there are 4K cache lines.
(b) When the cache is 8-way associative, the index field is 8 bits long.
(c) When the cache is 32-way associative, the index field is 20 bits long.
(d) When the cache is direct-mapped, the tag field is 14 bits long.
(e) When the writing policy is ‘write-back’, the dirty data are written to the main memory
only until the corresponding cache line becomes a victim in the replacement policy.
Answer: (a), (d), (e)
註(a): Number of blocks = 256KB / 64B = 4K
註(b): Number of sets = 4K / 8 = 512  length of index = 9
註(c): Number of sets = 4K / 32 = 128  length of index = 7
註(d): 32 – 12 – 6 =14

3
104 年台大資工

1. ISA (INSTRUCTION SET ARCHITECTURE)


ISA Number of General registers Number of FP registers
Intel X86 (80386) 8 (eax to esp) 8 (ST0 to ST7)
Intel X86-64 16 32 ZMM registers
Intel Itanium 128 128
ARMv7 (VPFv2) 16 16
ARMv8 31 32
The above table shows the number of registers provided by Intel x86, Intel/HP Itanium, and
ARM ISA (Instruction Set Architecture).
(a) Please list one or more ISAs that provide no registers at all.
(b) Registers are usually used to hold temporaries. Modern compilers also use registers to
hold frequently used local variables, pointers, and so on. Ideally, the more the registers,
the more objects could be allocated to registers and memory references to those objects
could be saved. The trend of ISA has clearly shown that new architectures tend to have
more registers. So is it a good idea to take a quantum leap to go for 1024 registers in an
ISA. You must argue for your answer, just say yes or no will get no credits.
(c) In each ISA's calling convention, the set of general purpose registers is usually divided
into caller-save and callee-save two subsets. Which type of procedures could benefit most
from such a register partition?
(d) Most ISAs have separate register files: general purpose registers, floating point registers,
and SIMD (e.g. Intel SSE/2MM and ARM Neon) registers. Why do most modern ISAs
provide different set of registers? Would it be better to use just one set of general purpose
registers? For example, we could have instructions such like:
Add R1, R2, R3 /* integer add */
FADD R1, R2, R3 /* Floating point add */
PADD R1, R2, R3 /*SIMD Add*/
Although these are different type of instructions, they could use the same general purpose
registers.
Answer:
(a) Stack machine ISA that provide no registers at all.
(b) That the trends of ISA shows the new architectures tend to have more registers is because of
Moore’s law. However, very large number of registers may increase the clock cycle time
and can thus slow down the processor. So, the designer must balance the craving of
programs for more registers with the designer’s desire to keep the clock cycle fast.
(c) If general purpose registers is not divided into caller-save and callee-save subsets, the caller
must maintain all the registers allocated to the frequently used local variables even though

4
the callee don’t have any local variables. That is, if the callee-procedure has no local
variables, it could benefit most from such a register partition.
(d) Different instructions treat the value of a register in different ways. If ISAs don’t provide
different set of registers, the programmers have to keep in mind what type of data is
contained in each register to prevent the miss-use of instructions. This will bother
programmers.

5
2. DATA REPRESENTATION
On a MIPS machine (Fig. 1) running UNIX, we observed the following binary string stored in
memory location x.
0011 1111 0111 0000 0000 0000 0000 0000
This binary could mean many different things,
(a) If this is an integer number, what value is it?
(b) If this is a single precision floating point number, what value is it?
(c) If this is an instruction, what instruction is it?
(d) If this is a C string, what string is it?
(e) If this binary string was observed on your x86 desktop, what would be your answer for (d)?
(f) Suppose we have three variables a. b and c. Give a case where (a + b) + c computes a
different value than a + (b + c) on a MIPS microprocessor.

Fig 1. MIPS Reference Data for Computer Architecture question 2.


6
Answer:
(a) 230 – 224 + 223 – 220 = 220  (1024 – 16 + 8 – 1) = 1024  1024  1015 ≈ 1.06  109
(b) 1.111  2-1 = 0.11112 = 0.937510
(c) lui
(d) “?p”
(e) “0011 1111 0111 0000 0000 0000 0000 0000”
(f) Suppose the three variables: a = − l.510 × 1038, b = l.510 × 1038, c = 1.0 and are all single
precision floating point numbers.
a + (b + c) = − l.510 × 1038 + (l.510 × 1038 + 1.0)
= − l.510 × 1038 + l.510 × 1038 = 0.0
(a + b) + c = (− l.510 × 1038 + l.510 × 1038) + 1.0 = 0.0 + 1.0 = 1.0
 a + (b + c) ≠ (a + b) + c

3. BRANCH PREDICTION
For pipelined processors, control hazards could significantly decrease the performance.
Dynamic branch prediction techniques have been successfully adopted in many modern
processors to reduce performance penalty caused by control hazards. However, unlike direct
branches (e.g. BEQ L1 in ARM or BEQ $t0, $tl, L1 in MIPS), indirect branches are usually
difficult to predict.
(a) Please give at least one static and one dynamic branch prediction scheme used for direct
branches in microprocessors.
(b) List conditions where indirect branches, instead of direct branches, are used? Which one
of the listed cases is most frequent?
(c) Please explain why indirect branches are hard to predict?
Answer:
(a) Static branch prediction: always predict not taken
Dynamic branch prediction: two-bit branch prediction
(b) Case1: an indirect branch can be useful to multi-way branch. For instance, use MIPS jr
instruction to translate the C construct, switch/case. Case2: an indirect branch can be used in
function return. For instance, MIPS jr $ra is a function return instruction. Case 2 is most
frequent.
(c) The target instruction address of an indirect branch is not known until run time. This is why
indirect branches are hard to predict

7
4. Cache block size is an important design parameter for cache architecture. Assume a 1-CPI
(Cycle-per-Instruction) machine with an average of 1.4 memory references (both Instruction
and data) per instruction. Assume the CPU stalls for cache misses. Answer the following
questions using the cache miss rates for different block sizes listed in the following table,

Block Size (Bytes) 8 16 32


Miss rate 8% 4% 3%

(a) If the miss penalty is 24 + B (block size in bytes) cycles, what is the optimal cache block
size? Please show how you derive the answer.
(b) If critical-word-first is implemented in the cache, what is the optimal cache block size?
Please show how you derive the answer.
Answer:
(a) 32 bytes is the optimal block size
Block Size (Bytes) 8 16 32
Miss rate 8% 4% 3%
Miss penalty 32 40 56
Miss rate  Miss penalty 2.56 1.6 0.96

(b) In the critical-word-first scheme, miss penalty is independent of block size. 32 byte block,
which has lowest miss rate, is the optimal block size.

5. You are given a task to parallelize the following problem in a multi-core architecture:
for (i = 0; i < N; i = i + 1)
for (j = 0; j < N; j = j + 1) {
r = 0;
for (k = 0; k < N; k = k + 1)
r = r + y[i][k] *z [k][j] ;
x[i][j] = r;
};
(a) Is this a weak-scaling or strong-scaling problem? Please explain your answer.
(b) Is it possible to partition this problem among cores such that there are no cache coherency
misses? Please explain your answer.
Answer:
(a) This is a strong-scaling problem since every element in array x can be computed
concurrently, i.e., it is 100% parallelizable problem. So, increase the problem size and
proportionally increase the number of processors will not increase speedup.
(b) Yes. If there are N cores in the multi-core architecture and a cache block contains N words,
we can assign each core the computation of a row of elements in array x. Since each row of

8
elements are in different memory block, so none of these memory block will be copied to
more than one cache and no cache coherency problem and no cache coherency misses.

6. In current disk storages, the operating system generally completes all I/Os asynchronously via
interrupts. But recent studies show that for future NVM (Non-Volatile Memory) storage
system, which has significantly lower access latency than disks, the synchronous approach
(i.e., polling) could be more efficient than the interrupt approach. Please provide rationale
behind this.
Answer:
When using polling to handle low speed disks, it will waste a lot of processor time since the
processor may read the status many times and find that the disk not yet completes an I/O
operation. Otherwise, when using polling to handle high speed NVM, the number of invalid
visits will be reduced. If it takes longer time for processer to execute an interrupt service routine
than to execute a polling routine, use polling to handle NVM may be more efficient than the
interrupt approach does.

9
104 年台聯大電機

1. Assume for arithmetic, load/store, and branch instructions, a processor has CPIs of 2, 4, and 2
respectively. Also assume that the instruction mix is 40%, 20% and 40% for the three kinds of
instructions respectively. Assume each processor has a 3 GHz clock frequency.
(1) Assume the clock frequency of this processor is proportional to its supply voltage. If we
add an extra processor to this system but reduce the supply voltage from 1.8V to 1.2V,
what is the performance improvement of this change? Assume only clock frequency is
affected by this change and the program can be perfectly parallelized on multiple
processors.
(2) In (1), if the CPI of the arithmetic instructions is doubled when the clock frequency is
slowed down, what is the performance improvement of this dual-core system?
(3) One way to evaluate the efficiency of a processor is using the power-delay product (PDP).
Smaller PDP means better efficiency. In this problem, we define the PDP as the product of
power and cycle time. Assume the power consumption of this processor is proportional to
the square of its supply voltage. Can we obtain efficiency improvement in terms of PDP in
(1)? Please briefly explain your reason.
Answer:
(1) Suppose the CPU clock frequency after lower the voltage is x
 3GHz / 1.8 = x / 1.2  x = 2GHz
CPIold = 2  0.4 + 4  0.2 + 2  0.4 = 2.4
The average instruction time for the old processor = 2.4 / 3G = 0.8 ns
The average instruction time for the new processor = 2.4 / 2G = 1.2 ns
Speedup = = 1.33  the dual-core system is 1.33 faster than the original system.
(2) CPInew = 4  0.4 + 4  0.2 + 2  0.4 = 3.2
The average instruction time for new processor = 3.2 / 2G = 1.6 ns
Speedup = = 1  no performance improvement for the dual-core system.
(3) PDFold = 1.82  0.8 = 2.59
PDFnew = 1.22  2  1.2 ns = 3.46
Speedup = 2.59 / 3.46 = 0.75
 the performance of the dual-core system is worse than that of the original system.

10
2. Consider the following MIPS code sequence:
LOOP: slt $t2, $0, $t1
beq $t2, $0, DONE
subi $t1, $t1, 1
addi $s2, $s2, 2
j LOOP
DONE:
(1) Assume that the register $t1 is initialized to the value of 10. What is the value in register
$s2 assuming $s2 is initially zero?
(2) Assume that the register $t1 is initialized to the value of N. How many instructions are
executed?
Answer:
(1) $s2 = 20
(2) 5N + 2

3. MIPS instructions:
(1) In MIPS, there are instructions lb, lbu, and sb for load byte, load byte unsigned, and store
byte, respectively; however, there is no the following instruction sbu. Why not?
(2) A MIPS branch instruction performs a modification of PC + 4 if the condition is true.
Suppose that the maximum range of the jump in MIPS is PC – A to PC + B, where both A
and B are positive numbers. What are A and B?
Answer:
(1) This is unnecessary since a single byte is being written to memory (thus, there's no sign
extension to 32 bits, as there is when reading a byte into a 32-bit register).
(2) PC + 4 – 217 = PC – A  A = 217 – 4
PC + 4 + 217 – 4 = PC + B  B = 217

4. IEEE 754-2008, the IEEE standard for floating-point (FP) arithmetic, contains a half precision
that is only 16 bits wide, in which the left most bit is the sign bit followed by the 5-bit
exponent with a bias of 15 and the 10-bit mantissa. A hidden 1 is assumed.
(1) Explain why the biased exponent is generally applied to the FP representation?
(2) Please write down the bit pattern to represent 1.5625  10-1 using IEEE 754-2008.
Comment on how the range and accuracy of this 16-bit FP format compares to the single
precision IEEE 754 standard.
Answer:
(1) Exponents have to be signed values in order to be able to represent both tiny and huge
values, but two's complement would make comparison harder. To solve this problem the
exponent is biased by adjusting its value to put it within an unsigned range suitable for
11
comparison.
(2) 1.5625  10-1 = 0.1562510 = 0.001012 = 1.01  2-3  0 01100 0100000000
Range: The ranges of single and half precision FP are about 1038 and 105, respectively.
Accuracy: single precision is (24 – 16) / 24 = 33.33% more accuracy than half precision.

5. Suppose that you have a computer that, on average, exhibits the following characteristics on
the programs you run:
Type Distribution IF ID EX MEM WB
Load 25% 2ns 1ns 3ns 3ns 1ns
Store 10% 2ns 1ns 3ns 2ns X
Arithmetic 45% 2ns 1ns 3ns X 1ns
Branch 20% 2ns 1ns 3ns X X
i. IP: instruction fetch; ID: instruction decode; EX: ALU execution; MEM: data
memory access; WB: write back
ii. X denotes that the corresponding stage is not needed
(1) If your computer is implemented as a single-cycle processor, what is its throughput
(measured by “instructions per second”)?
(2) If your computer is implemented as a multi-cycle processor, in which each stage is
executed in one cycle, what is its throughput (measured by “instructions per second”)?
(3) If your computer is implemented as a 5-stage pipelined processor, what is its idealized
throughput, assuming that the there are no hazards between instructions?
Answer:
(1) Average instruction time = 2 + 1 + 3 + 3 + 1 = 10 ns
Throughput = 1 / 10 ns = 108 instruction per sec.
(2) CPI = 5  0.25 + 4  0.1 + 4  0.45 + 3  0.2 = 4.05
Average instruction time = 4.05  3 ns = 12.15 ns
Throughput = 1 / 12.15 ns = 8  107 instruction per sec.
(3) Average instruction time = 1  3 ns = 3 ns
Throughput = 1 / 3 ns = 3.33  108 instruction per sec.

6. A RISC processor with 18-stage pipeline runs a program P having 6,114 instructions.
Branches comprise 23% of the instructions, and the "branch not taken" assumption holds for
static branch prediction. Further assume that 40% of the branches are predicted correctly, and
there is an average penalty of 1.7 cycles for each mispredicted branch. Additionally, 2% of the
total instructions incur an average of 1.3 stalls each. Please calculate the CPI of P on this
pipeline. SHOW ALL WORK TO GET FULL CREDIT CF YOUR ANSWER IS CORRECT.
PARTIAL CREDIT IF NOT.
Answer:
12
Total clock cycles to execute program P = (18 – 1) + 6114 + 6114  0.23  0.6  1.7 + 6114 
0.02  1.3 = 7724
CPI = 7724 / 6114 = 1.26

7. Consider a lite version of MIPS (Lite-MIPS) in which the immediate fields in lw and sw
instructions must be zero. Thus, lw $t0, 0($sp) is legal in Lite-MIPS, but lw $t0, 4($sp) is not.
Lite-MIPS can be implemented with a 4-stage pipeline: IF, ID, EM, WB where the EM stage
performs the EX and MEM tasks for the normal 5-stage pipeline in parallel. (Note that no
instruction in Lite-MIPS uses both EX and MEM stages.)
(1) Consider the instruction sequence:
lw $t0, 0($a0)
sw $t0, 0($a0)
Fill in the pipeline diagram below to indicate stalls (if any, please mark it with “*”) to
resolve data hazards in the above sequence on the 4-stage pipeline with full forwarding.

Clock-cycle
1 2 3 4 5 6 7 8
Instruction
lw
sw

(2) If the sw instruction was replaced with a conditional branch instruction, under what
circumstances would a stall be necessary between the two instructions?
(3) If the sw instruction was replaced with addi $t0, $t0, 1 and there was no forwarding
implemented, how many stalls would be necessary?
(4) If conditional branch targets are computed in the ID stage, and conditional branch
decisions are resolved in the EM stage, what would be the best prediction strategy (never
predict, always predict taken, always predict not-taken) for this datapath? Draw pipeline
diagrams to support your answer.
Answer:
(1) The data hazard between lw and sw can be solved by the forwarding path from WB to EM
so there is no need for the pipeline to stall.
Clock-cycle
1 2 3 4 5
Instruction
lw IF ID EM WB
sw IF ID EM WB
(2) A stall would be necessary if the branch used the $t0 register and the branch decision was
resolved in the ID stage.

13
(3) 1 clock stall is needed.
Clock-cycle
1 2 3 4 5 6
Instruction
lw IF ID EM WB
addi IF ID ID EM WB
(4) It is always better to predict not-taken than to not predict at all, since there are no cycles lost
when the prediction is correct. However, there is no clear “winner” between predict taken
and predict not-taken as shown in the following pipeline diagrams:
Predict not-taken: wrong (2 flushes)
Instruction 1 2 3 4 5 6
beq IF ID EM WB
beq + 4 IF ID
beq + 8 IF
target IF
Predict taken: wrong (1 flush)
Instruction 1 2 3 4 5 6
beq IF ID EM WB
beq + 4 IF
target IF
If the branch is predicted taken, then only 1 flush is needed in the worst case (the incorrect
instruction can be flushed once the branch decision is known at the end of the EM stage).
Thus if branches are often taken, a predict taken strategy will cost one flush less than a
predict not-taken strategy (2 flushes, as shown above). However, if branches are often
not-taken, then it is better to predict not-taken (0 flushes) instead of taken (1 flush).

8. Given a four-way set associative cache of 1024 blocks, a 16-byte block size, and a 32-bit
address. Assume the access time of this cache is 1 ns including all delay, and the average
access time of the main memory is 5 ns per byte, including all the miss handling.
(1) What are the numbers of bits for the tag, and index in this cache?
(2) Below is a list of 32-bit memory address references, given as word addresses. Please
identify whether each reference is a hit or miss, assuming the cache is initially empty.
 3, 188, 43, 1026, 191, 1064, 2, 40
(3) Suppose the miss rate at the primary cache is 4%. How much reduction can we obtain on
the AMAT if we add a secondary cache that has a 5 ns access time and is large enough to
reduce the miss rate to 1%?
(4) Suppose the space of this memory system is extended from 32-bit to 36-bit by using the
virtual memory technique, in which each page has 8KB. What are the numbers of bits for
the virtual page number and the page offset?
14
(5) Assume the TLB is put into the primary cache, and the page table is put into the main
memory. What is the AMAT if TLB and cache are both hit? What is the AMAT if TLB is
a miss without page fault, assuming the cache is still hit?
Answer:
(1) Number of sets = 1024 / 4 = 256  length of index field = 8 bits
Length of tag field = 32 – 8 – 4 = 20 bits
(2)
Word address Block address Tag Index Hit/Miss
3 0 0 0 Miss
188 47 0 47 Miss
43 10 0 10 Miss
1026 256 1 0 Miss
191 47 0 47 Hit
1064 266 1 10 Miss
2 0 0 0 Hit
40 10 0 10 Hit

(3) AMAT1_level = 1 + 0.04  (5  16) = 4.2 ns


AMAT2_level = 1 + 0.04  5 + 0.01  80 = 2 ns
Reduction rate = (4.2 – 2) / 4.2 = 52.38%
(4) The number of bits for page offset = log28K = 13 bits
The number of bits for virtual page number = 36 – 13 = 23 bits
(5) Memory access time for TLB and cache hit = 1 + 1 = 2 ns
Suppose the length of each page table entry = 1 word = 4 bytes
Memory access time for TLB miss and cache hit = 1 + (5  4) + 1 = 22 ns

15
104 清大資工

1. In computers, floating-point numbers are expressed as the signed bit, exponent and fraction
field. The bits of the fraction present a number between 0 and 1. Assume that the floating point
numbers are 32 bits, with a bias of 127. Let x = 0.3125 and y = - 0.09375 (in decimal).
(a) Show the floating-point number presentation of x and y using hexadecimal representation.
(b) Show the floating-point number presentation of x  y using hexadecimal representation.
Answer:
(a)
x y
Normalization 0.312510 = 0.01012 = 1.01  2-2 - 0.0937510 = -0.000112 = -1.1  2-4
FP number 0 01111101 01000000000000000000000 1 01111011 10000000000000000000000
Hexadecimal 3EA00000 BDC00000
(b) (1.01  2-2)  (-1.1  2-4) = - 1.111  2-4
 FP = 1 01111011 11100000000000000000000 = BDF00000

2. Two n-bit inputs A[i] and B[i] are combined by the two's-complement subtraction (A – B) with
the subtraction result denoted as Sub[i]. The most significant bit n-1 is the signed bit. Bor(n – 2)
denote the borrow from bit n-2 to bit n-3. Indicate whether each of the following conditions is a
valid test for two's complement overflow. (The condition must be true if and only if there is
overflow.) Answer True or False for the following cases.
(a) A(n – 1) XOR B(n – 1) = 1 and Sub(n – l) = 1
(b) A(n – 1) XOR B(n – 1) = Bor(n – 2)
(c) A(n – 1) XOR B(n – 1) = 1 and Sub(n – l) ≠ A(n – 1)
Answer:
(a) (b) (c)
False True True
註:Use 4-bit 2’s complement (ranged from -8 ~ +7) as an example: 4 – (-5) = 9  overflow
A[3] xor B[3] = 1 Borrow from bit 2 to bit 1 is 1

0100
- 1011
1001

3. Assume a processor has a base CPI of 1.4, running at a clock rate of 4GHz. The access time of
the main memory is 50ns, including all the miss handling. Suppose the miss rate per instruction
at the primary cache is 4%. What will be the speedup after adding a secondary cache with a 5ns
access time for either a hit or a miss? Assume that the miss rate to main memory can be reduced
to 0.2%.

16
Answer:
CPI for one level cache = 1.4 + 0.04  200 = 9.4
CPI for two level cache = 1.4 + 0.04  20 + 0.002  200 = 2.6
Speedup = 9.4 / 2.6 = 3.62

4. Consider a system with the virtual memory address of 32 bits, and the physical address of 28
bits. The page size is 2KB. Each page table entry is 4 bytes in size.
(a) How many bits are in the page offset portion of the virtual address?
(b) What is the total page table size?
Answer:
(a) The width of page offset filed = log22K = 11 bits
(b) The page table has 232 / 211 = 221 = 2M entries
The page table size = 2M  4 bytes = 8 Mbytes

5. A specific memory organization has the following memory access times:


i. 1 memory bus clock cycle to send the address;
ii. 10 memory bus clock cycle for each DRAM access initiated;
iii. 1 memory bus clock cycle to send a word of data.
Assume that a cache block has eight words. Each word has four bytes.
(a) Increase the width of the memory organization (as well as the bus) can increase the
memory bandwidth. Determine the smallest width of the memory organization in terms of
bytes, such that the bandwidth (number of bytes transferred per bus clock cycles) for a
single miss exceeds 1.2 bytes/cycle.
(b) Instead of increasing the width of memory, the interleaved memory organization can be
used utilizing the advantage of multiple banks, each with one word wide. What is the
bandwidth speedup of the interleaving scheme as compared with the original
one-word-wide memory?
Answer:
(a) Suppose n the maximum number of times to read and transfer data from memory to cache
such that the bandwidth for a single miss exceeds 1.2 bytes/cycle
32 / [1 + (10 + 1)  n] ≥ 1.2  n ≤ 2.3
n should be an integer  n = 2
So, the smallest width of the memory organization = 8 / 2 = 4 words = 16 bytes
(b) The bandwidth for one-word-wide memory = 32 / [1 + (10 + 1)  8] = 0.36 bytes/cycle
The bandwidth for 4-bank interleaved memory = 32 / [1 + (10 + 1  4)  2] = 1.1
bytes/cycle
Speedup = 1.1 / 0.36 = 3.06

17
6. The beq instruction of MIPS will cause the processor to branch to execute from a target address
if the contents of the two specified registers are equal:
beq $rs, $rt, Target # branch to Target if $rs == $rt
It uses the I-type format shown below, where Opcode is 6-bit wide, rs and rt are 5-bit wide, and
Immediate has 16 bits with the leftmost bit as the sign bit.
Opcode rs rt Immediate
The target address of beq is calculated as PC + (Immediate  4). Consider the pipelined
implementation of the MIPS processor shown in the following and answer the following
questions.
(a) Explain the purpose of the Adder in the ID stage.
(b) Explain the purpose of the Mux in the IF stage.
(c) Explain the purpose of the IF.Flush control signal.
(d) Suppose we have a branch target buffer, which uses PC as the input, can always predict
correctly whether a branch will take, and supply the target address. Draw a diagram to
explain how the IF stage may be modified. (Hint: Consider the outputs of the branch target
buffer and what existing components may be removed.)
IF.Flush

Hazard
detection
unit
M ID/EX
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0

IF/ID EX M WB

4 Shift
left 2
M
u
x
Registers = Data
Instruction ALU
PC memory
memory M
u
M x
u
x

Sign
extend

M
u
x
Forwarding
unit

Answer:
(a) The adder is used to computer branch target address.
(b) The Mux is used to select the branch target or PC + 4 to update PC in the next clock cycle.
(c) The control IF.Flush is used to flush the wrong instruction in the IF stage.
(d) The multiplexor in front of the instruction selects the next instruction address from PC or
BTB depend on the BHT state. If the BHT state indicates to guess not taken, the PC is
applied to the instruction memory address input, otherwise the output of BTB is applied. The
original multiplexor to selects between branch target and PC is removed. The following
18
diagram only show the modification part in the IF stage.
IF/ID

Add

PC

Instruction
Memory
BTB

State from BHT

19
104 年交大資聯

單選題
1. Which of the following statement is correct?
(a) Instructions addi, beq, and j (jump) all need to perform sign extension.
(b) The most significant bit of 2's complement representation is the sign bit - a sign bit of 1
indicates a positive number.
(c) The range of numbers represented by n-bit 2's complement is: -(2n – 1) + 1~ ±0~ (2n – 1) – 1.
(d) Consider an addition of two 32-bit signed integers with a 32-bit ripple-carry adder, which
consists of 32 1-bit full adders connected in series: if the carry-in and the carry-out of the
most-significant-bit (MSB) full adder have different logic values, then there is an overflow
due to the addition.
(e) The operation of adding one very large positive integer and one very small negative integer
may produce an overflow.
Answer: (d)

2. Which of the following statement is correct?


(a) Pipelining improves the performance of a processor by decreasing the latency of a job
(e.g., an instruction) to be done.
(b) The more balanced pipeline stages a specific datapath has, the higher performance this
pipelined datapath can achieve.
(c) Every single instruction, when being executed in a n-stage pipelined datapath, consistently
needs n clock cycles before its execution is completely done.
(d) It is likely to have fewer hazards and less severe impact of hazards with deeper design of
pipelining (i.e., more pipeline stage).
(e) The number of pipeline stages monotonically grows with technology advance such as
shrinking transistor size.
Answer: (b)
註:正在n-stage pipeline datapath中被執行的單一指令,在完成執行前所需的clock個數小於n

3. Which of the following statement is correct?


(a) For write-back cache, both cache and memory will be updated on data-write hit.
(b) For write-through cache, multiple writes within a block may require only one write to the
memory.
(c) Write-through is widely used in virtual memory in order to keep memory and disk
consistent.
(d) First-level caches are more concerned about hit time, and second-level caches are more
concerned about miss rate.
(e) Increase associativity may increase miss rate due to conflict misses
20
Answer: (d)

4. In a memory system, there is one TLB, one physically addressed cache, and one physical main
memory. Assuming all memory addresses are translated to physical addresses before the cache
is accessed. Which of the following events is impossible in memory system (Note: PT stands
for page table)?
(a) TLB: hit, PT: hit. Cache: miss
(b) TLB: miss, PT: hit. Cache: hit
(c) TLB: miss, PT: hit, Cache: miss
(d) TLB: miss, PT: miss. Cache: miss
(e) TLB: hit, PT: miss. Cache: hit
Answer: (e)

複選題
5. Which of the following 32-bit hexadecimal data will have identical storage sequence in the
(byte-addressable) memory no matter the machine is big-endian or little-endian?
(a) AABBAABBhex
(b) ABBAABBAhex
(c) ABBBBBABhex
(d) ABCDCDABhex
(e) ABCDDCBAhex
Answer: (c), (d)

6. Which of the following statements are correct?


(a) Given that every 1-bit ALU has a propagation delay of 1 unit, the cumulative propagation
delay for an SLT (set-on-less-than) operation is 32 units in the worst case.
(b) The operation of 00110011 (multiplicand)  11011011 (multiplier) based on the Booth's
algorithm requires 2 additions and 3 subtractions.
(c) The Booth's algorithm is always better than traditional multiplication in terms of the total
number of required additions and subtractions.
(d) 0.3125  2130 can be represented by the IEEE 754 single-precision floating-point standard
format without any loss of accuracy.
(e) IEEE 754 double-precision floating-point representation is exactly 2X more precise than
the single-precision counterpart.
Answer: (a), (b)
註(d):0.3125  2130 = 0.0101  2130 = 1.01  2128  overflow

21
7. Which of the following statements are correct?
(a) Trying to allow some instructions to take fewer cycles does not help, since the throughput
is determined by the clock cycle; the number of pipeline stages per instruction affects
latency, not throughput.
(b) Instead of trying to make instructions take fewer cycles, we should explore making the
pipeline longer (deeper), so that instructions take more cycles, but the cycles are shorter.
This could improve performance.
(c) A data hazard between two adjacent R-type instructions can always be resolved with
forwarding.
(d) Perfect branch prediction (i.e., 100% prediction accuracy) combined with data forwarding
would allow a processor to always keep its pipeline full.
(e) If we use an instruction from the fall-through (untaken) path to fill in a delay slot, we must
duplicate the instruction.
Answer: (a), (b), (c)
註(e):from target

8. Suppose you have a machine which executes a program spending 50% of execution time in
floating-point multiply, 20% in floating-point divide, and 30% in integer instructions. Which
of following statements are correct?
(a) If we make the floating-point divide run 3 times faster, the speedup relative to original
machine is about 0.87.
(b) If we make the floating-point multiply run 8 times faster, the speedup relative to original
machine is about 1.78.
(c) If we make both divide and multiply run faster as described in (a) and (b), the speedup
relative to original machine is about 2.33.
(d) If we can make all the floating-point instructions run 15 times faster, the percent of
floating-point instructions need to be about 90.36 in order to achieve a speedup 4.
(e) The rule stating that the performance enhancement possible with a given improvement is
limited by the amount that the improved feature is used is called Moore's law.
Answer: (b), (c)
註(d):  x = 80.36%

註(e):Amdahl’s law

22
9. Consider two different implementations, M1 and M2, of the same instruction set. There are
three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 80MHz
and M2 has a clock rate of 100MHz. The average number of cycles for each instruction class
and their frequencies (for a typical program) are as the table below. Which of following
statements are correct (Note, the MIPS in these statements stand for Millions of Instruction Per
Second)?

Instruction Machine M1 - Machine M2 -


Frequency
class Cycles/instruction class Cycles /instruction class
A 1 2 60%
B 2 3 30%
C 4 4 10%

(a) The average CPI for M1 is 1.6


(b) The average CPI for M2 is 2.5
(c) M2 has higher MIPS rating than M1.
(d) If we change CPI of instruction class A for M2 to 1, M2 has higher MIPS rating than M1.
(e) If we increase the clock rate of M1 to 100MHz without affecting the CPI of A, B, and C.
The speedup of M1 is 0.8.
Answer: (a), (b), (d)
註(c):MIPSM1 = 80M / 1.6M = 50; MIPSM2 = 100M / 2.5M = 40.
註(d):CPIM2 = 1.9, MIPSM2 = 100M / 1.9M = 52.63
註(e):speedup = 100M /80M = 1.25

10. Assume a 128KB direct-mapped cache with a 32-byte block. Consider a video streaming
workload that accesses a 64KB working set sequentially with the following address stream: 0,
2, 4, 6, 8, 10, 12, 14, 16, ...etc. Which of the following statements are correct?
(a) The miss rate is 1/32.
(b) All the misses are compulsory misses based on the 3Cs model.
(c) The miss rate is sensitive to the size of the cache.
(d) The miss rate is sensitive to the size of the working set
(e) The miss rate is sensitive to the block size.
Answer: (b), (e)
註(a):1/16

23
題組A: Consider the following MIPS code in a 5-stage pipelined CPU (see Figure 1)
# memory address: instruction in the memory
36: sub $1, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or $13, $2, $6
72: lw $4, 50($7)
76: add $14, $4, $2
80: slt $15, $6, $7

Figure 2 shows the pipeline state (in clock cycle 3) when the branch instruction (beq $1, $3, 7) is
in the ID stage. Assume that (i) this branch will be taken, (ii) branch outcome are to be
determined in the ID stage, and (iii) although not shown in the figure, forwarding to the ID stage
(from EX/MEM pipeline registers) is available.

Figure 2: A pipeline datapath

11. In the next clock cycle (clock cycle 4), what is the output value of the ALU adder (for program
counter addition) in the IF stage?
(a) 40
(b) 44
(c) 48
(d) 72
(e) 76
24
Answer: (c)
註:pipeline is stalled in clock cycle 4 because there is data hazard between sub and beq

12. In the next clock cycle (clock cycle 4), what is the output value of the ALU adder (for program
counter addition) in the ID stage?
(a) 40
(b) 44
(c) 48
(d) 72
(e) 76
Answer: (d)

13. In the next clock cycle (clock cycle 4), what is the instruction being executed in the ID stage?
(a) sub $1, $4, $8
(b) beq $1, $3, 7
(c) and $12, $2, $5
(d) lw $4, 50($7)
(e) NOP (no operation)
Answer: (b)

14. In the next clock cycle (clock cycle 4), what is the instruction being executed in the EX stage?
(a) sub $1, $4, $8
(b) beq $1, $3, 7
(c) and $12, $2, $5
(d) lw $4, 50($7)
(e) NOP (no operation)
Answer: (e)

15. Is there any stall required after the branch is taken (given that Figure 1 shows the pipeline state
in clock cycle 3)?
(a) No, no required stall after the branch.
(b) Yes, one required stall will occur in clock cycle 4
(c) Yes, one required stall will occur in clock cycle 5
(d) Yes, one required stall will occur in clock cycle 6
(e) Yes, one required stall will occur in clock cycle 7
Answer: (e)

25
題組B: Assume 4KB pages, a 4-entry two-way set associative TLB and true LRU replacement. If
pages must be brought in from disk, increase the next largest page number. Given the virtual
address references, and the initial TLB and page table states provided below. Which of
following statements is (are) true?
Virtual address references: 4669, 2227, 13916, 34587, 48870, 12608
TLB Page Table
Set valid tag Physical page number valid Physical page number or in disk
0 0 11 12 1 5
0 1 3 6 0 Disk
1 1 7 4 0 Disk
1 0 4 9 1 6
1 9
1 11
0 Disk
1 4
0 Disk
0 Disk
1 3
1 12

16. For the given address references, list the corresponding virtual page numbers.
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(c) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (c)
註:
Virtual address 4669 2227 13916 34587 48870 12608
Virtual page number 1 0 3 8 11 3
Tag 0 0 1 4 5 1
Index 1 0 1 0 1 1

26
17. For the given address references, the corresponding indexes to the TLB are
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(c) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (e)

18. For the given address references, the corresponding tags to the TLB are
(a) 0, 0, 0, 2, 2, 0
(b) 1, 0, 3, 0, 3, 3
(e) 1, 0, 3, 8, 11, 3
(d) 0, 0, 1, 4, 5, 1
(e) 1, 0, 1, 0, 1, 1
Answer: (d)

19. For the given reference, accesses to TLB are


(a) miss, miss, miss, miss, miss, miss
(b) miss, hit, miss, miss, miss, miss
(c) miss, miss, hit, hit, hit, hit
(d) miss, miss, miss, miss, miss, hit
(e) miss, hit, hit, miss, hit, hit
Answer: (d)

20. For the given reference, how many page faults in total?
(a) None
(b) 1
(c) 2
(d) 3
(e) 4
Answer: (c)

27
104 年成大電機

Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer. 10 points each, no partial point, no penalty.

1. Which of the following statements is (are) true for data alignment?


(a) For a word size of 4 bytes, this 32-bit address, 0xF00ABCCC is aligned.
(b) For a word size of 8 bytes, this 32-bit address, 0xF0ABCCC is aligned.
(c) For a cache line size of 32 bytes, this 32-bit address, 0xF0ABCE0 is aligned.
(d) For a page size of 8KB, this 32-bit address, 0xF0ABC00 is aligned.
Answer: (a), (c)
註(a):0xF0OABCCC is a multiple of 4 so it is aligned.
註(b):0xF0OABCCC is not a multiple of 8 so it is not aligned.
註(c):0xF0ABCE0 is a multiple of 32 so it is aligned.
註(d):0xF0ABC00 is not a multiple of 8K so it is not aligned.

2. Which of the following statements is (are) true for virtual memory system?
(a) The flash memory is a volatile device and it can be used for the swap space.
(b) The operating system usually creates the space on flash memory or disk for all the pages
of a process when it creates the process.
(c) The space on the disk or flash memory reserved for the full virtual memory space of a
process is called swap space.
(d) A memory access violation can be detected by the memory management unit.
Answer: (b), (c), (d)
註(a):The flash memory is a nonvolatile device

3. For a conditional branch instruction such as beq rs, rt, loop, which of the following statements
are true?
(a) The label “loop” defines the base address of the branch target.
(b) The label “loop” is an offset relative to the program counter which points to the next
sequential instruction of the branch instruction.
(c) The label “loop” is an unsigned number.
(d) The label “loop” is coded into the instruction as “loop”.
Answer: (b)

28
4. Which of the following statements is (are) true for virtualization?
(a) The software that supports virtual machine is called a virtual machine monitor (VMM)
which determines how to map the virtual resources to the physical resources,
(b) The cost of processor virtualization depends on the workload. User-level processor-bound
programs often have zero virtualization overhead.
(c) OS-intensive workloads which execute many system calls and privileged instructions can
result in high virtualization overheads.
(d) Virtualization is a simulation program that performs page walk for the virtual memory
system.
Answer: (a), (b), (c)

5. Which of the following is (are) true for cache system? The address is 32-bit long for each case.
(a) A 64KB direct-mapped cache has a line size of 64 bytes. The tag width is 18 bits,
(b) A 64KB direct-mapped cache has a line size of 64 bytes. The total tag memory is 16
Kbits.
(c) A 64KB 4-way set associative cache has a line size of 64 bytes. The tag width is 16 bits.
(d) A 64KB 4-way set associative cache has a line size of 64 bytes. The total tag memory is
18 Kbits.
Answer: (b), (d)
註(a):Tag width = 32 – 10 – 6 = 16 bits
註(b):The total tag memory = 16  1K = 16 Kbits
註(c):Tag width = 32 – 8 – 6 = 18 bits
註(d):The total tag memory = 18  1K = 18 Kbits

6. Which of the following is (are) true for branch hazard in a pipelined processor?
(a) Branch prediction can eliminate branch hazard completely.
(b) Branch hazard comes from a data access hazard. It happens frequently.
(c) A branch hazard arises from the need to make a decision based on the result of the branch
instruction.
(d) Branch hazard is a control hazard when the proper instruction cannot execute in the proper
pipeline clock cycle because the instruction was fetched is not the one that is needed.
Answer: (c), (d)

29
7. Which of the following is (are) true for processor implementation? Assume that for a single
cycle implementation, the processor’s cycle time is T nsec. The instruction count of the
program to run is N.
(a) For single cycle implementation, T is determined by the instruction which has the longest
latency.
(b) If N is the program instruction count, for single cycle CPU, it takes time of N × T nsec.
(c) For multi-cycle implementation that uses one fifth of T for the CPU clock cycle time, the
program execution time is 0.2 N  T nsec.
(d) For a five-stage pipeline implementation that also uses 0.2T for the CPU clock cycle time,
the program execution time is 0.2 N  T nsec.
Answer: (a), (b), (d)

8. Which of the following is (are) true about program performance?


(a) The processor pipeline design affects the average cycles per instruction (CPI) and the
clock cycle time which can be achieved.
(b) Branch prediction improves the IPC, instruction per cycle of a pipelined processor.
(c) The programming language affects the instructions count since the statements in the
language are translated into the processor instructions.
(d) The instruction set architecture affects the instruction count, clock rate, and CPI.
Answer: (a), (c), (d)
註(a):It is true because different computer organization affects CPI and clock cycle time.
註(b):Branch prediction improves the ILP, instruction level parallelism.

9. Using 8K  8 SRAM chips for the memory system, which of the following is (are) true?
(a) For 1 MB memory system, it needs 64 SRAM chips.
(b) The 8K  8 chip has 8K address pins.
(c) It needs at least 8 chips for the connection to a 64-bit data bus for proper operation of full
bus width access. So the minimum memory size is 64KB.
(d) It needs at least 4 chips for the connection to a 64-bit data bus for proper operation of full
bus width access. So the minimum memory size is 32KB.
Answer: (c)
註(a):1MB / 8KB = 128 SRAM chips
註(b):8KB = 213B  13 address pins

30
10. Which of the following is (are) true about cache operations?
(a) A processor writes data into a cache line, which is also updated in other processor's cache.
This is the write-through policy.
(b) When a data cache write hit occurs, the written data are also updated in the next level of
memory. This is the write-back policy.
(c) When a data cache write miss occurs, the cache controller first fetches the missing block
into cache and then the data are written into the cache. This is the write-allocate policy.
(d) When a data cache write hit occurs, the data are only written into the cache. This is the
write-back policy.
Answer: (c), (d)
註(a):write-update (write-broadcast)
註(b):write-through

31
104 年成大資聯

Figure 1. Tables for MIPS Instruction Encoding

32
1. Determine whether each of the following statements is true (T) or false (F).
(1) Program execution time reduces when the clock rate increases.
(2) Program execution time reduces when the CPI increases,
(3) Program execution time reduces when the instruction count (IC) increases.
(4) Suppose the floating point instructions are enhanced and can run 10 times faster. If the
execution time before the floating point enhancement is 80 seconds and three-fourth of the
execution time is spent executing floating-point instructions, the overall speed up is at
least 3.
(5) Suppose the floating point instructions are enhanced and can run 20 times faster. If the
execution time before the floating point enhancement is 80 seconds and one-half of the
execution time is spent executing floating-point instruction, the overall speed up is at least
2.
Answer:
(1) (2) (3) (4) (5)
T F F T F

註(4):Speedup = = 3.08

註(5):Speedup = = 1.9

2. Determine whether each of the following statements is true (T) or false (F)
(1) R-type and I-type MIPS instruction can be distinguished by the opcode of an instruction.
(2) Base addressing mode is used by I-format instructions
(3) PC-relative addressing is used by J-format
(4) Suppose the program counter (PC) is at address 0x0000 0000. It is possible to use one
single branch-on-equal (beq) MIPS instruction to get to address 0x00030000
(5) Suppose the program counter (PC) is at address 0x0000 0000, it is possible to use the
jump MIPS instruction to get to 0xFFFFFFB0
Answer:
(1) (2) (3) (4) (5)
T T F F F
註(1): The value of opcode of an R-type instruction is 0 and the value of opcode of an I-type
instruction is not 0.

33
3. The following descriptions are about IEEE 754 single precision float point format. Determine
whether each of the following statements is true (T) or false (F)?
(1) The float point format has 1 sign bit, 8 exponent bits, and 23 fraction bits.
(2) The smallest positive number it can represent is 0000 0001 0000 0000 0000 0000 0000
00002
(3) The result of “Divide 0 by 0” is 0111 1111 1000 0000 0000 0000 0000 00002
(4) To improve the accuracy of the results, IEEE 754 has one extra bit for rounding.
(5) 0.7510 is represented by 1011 1111 0100 0000 0000 0000 0000 00002

Answer:
(1) (2) (3) (4) (5)
T F F F F
註(2):Should be 0000 0000 1000 0000 0000 0000 0000 00002
註(3):The result of “Divide 0 by 0” should NaN so the fraction field cannot be all 0s.
註(4):Two extra bit: Guard and Round
註(5):Should be 0011 1111 0100 0000 0000 0000 0000 00002

MIPS instructions Address Corresponding Assembled Instruction


Loop: sll $t1, $s3, 2 40000 0 0 19 9 4 0
(1) add $t1, $t1, $s6 40004 0 9 22 9 0 (4)
lw $t0,0($t1) 40008 (3) 9 8 0
bne $t0, $s5, Exit 40012 5 8 21 (1)
addi $s3, $s3,1 40016 8 19 19 1
j Loop 40020 2 (2)
Exit: 40024 .....

4. Refer to the above table and Figure 1. The right side of the above table is the assembled
instructions of the MIPS instructions in the left side. The starting address of the loop is 4000010
in memory. What are the value of (1), (2), (3) and (4)? Express your answers in decimal
numbers.
Answer:
(1) (2) (3) (4)
2 10000 35 32

34
5. Refer the following instruction sequence:
Instruction sequence
lw $1, 40($2)
add $2, $3, $3
add $1, $1, $2
sw $1, 20($2)
(1) Find all data dependences in this instruction sequence.
(2) Find all hazards in this instruction sequence for a 5-stage pipeline with and without
forwarding.
(3) To reduce the clock cycle time, we are considering a split of the MEM stage into two
stages. Repeat (2) for this 6-stage pipeline.
Answer:
Instr. sequence RAW WAR WAW
I1: lw $1, 40($2) ($1) I1 to I3 ($2) I1 to I2 ($1) I1 to I3
I2: add $2, $3, $3 ($2) I2 to I3, I4
(1)
I3: add $1, $1, $2 ($1) I3 to I4
I4: sw $1, 20($2)

Instruction sequence With forwarding Without forwarding


I1: lw $1, 40($2) ($1) I1 to I3
I2: add $2, $3, $3 ($2) I2 to I3, I4
(2)
I3: add $1, $1, $2 ($1) I3 to I4
I4: sw $1, 20($2)

Instruction sequence With forwarding RAW


I1: lw $1, 40($2) ($1) I1 to I3 ($1) I1 to I3
I2: add $2, $3, $3 ($2) I2 to I3, I4
(3)
I3: add $1, $1, $2 ($1) I3 to I4
I4: sw $1, 20($2)

35
6. Suppose that in 1000 memory references there are 50 misses in the first-level cache, 20 misses
in the second-level cache, and 5 misses in the third-level cache. Assume the miss penalty from
the L3 cache to memory is 100 clock cycle, the hit time of the L3 cache is 10 clocks, the hit
time of the L2 cache is 4 clocks, the hit time of L1 is 1 clock cycle. What is the average
memory access time? Ignore the impact of writes.
Answer:
L1 cache L2 cache L3 cache
Miss rate 50/1000 20/1000 5/1000
Hit time 1 4 10
Miss penalty 4 10 100
AMAT = 1 + 4  50/1000 + 10  20/1000 + 100  5/1000 = 1.9

7. For a memory access in a virtual memory system,


(1) is it possible that TLB hits and page fault occurs? Explain your answer.
(2) is it possible that TLB misses and page fault does not occur? Explain your answer.
[104成大資聯]
Answer:
(1) Impossible. Since all the mappings in TLB are from page table so it is not possible for TLB
hit but page table miss.
(2) Possible. Since all the mappings in TLB are from page table so it is possible when mappings
are in page table but not in TLB.

36
104 成大電通

Choose the correct answers for the following multiple choice problems. Each question may have
more than one answer. 10 points each, no partial point, no penalty.

1. Which of the following statements is (are) not true for virtual memory system?
(a) It is typically unknown that when a page in memory will be replaced on flash memory or
disk.
(b) The flash memory is a volatile device and it can be used to store pages in memory.
(c) The operating system usually creates the space on flash memory or disk for all the pages
of a process when it creates the process.
(d) A program can be invoked by the operating system into different instances of processes.
(e) The space on the disk or flash memory reserved for the full physical memory space of a
process is called swap space.
Answer: (c)
註(d): A program can be decomposed by the OS into different instances of processes.
註(e): The space on the disk or flash memory reserved for the full virtual memory space of a
process is called swap space.

2. Which of the following statements is (are) true for virtualization technology?


(a) On a conventional platform, a single operating system owns all the hardware resources,
but with a virtual machine (VM), multiple OSes all share the hardware resources.
(b) The software that supports VMs is called a virtual machine monitor (VMM) or hypervisor
which determines how to map the virtual resources to the physical resources. A physical
resource may be time-shared, partitioned, or even emulated in software.
(c) The cost of processor virtualization depends on the workload. User-level processor-bound
programs often have great virtualization overhead.
(d) I/O-intensive workloads are generally also OS-intensive, executing many system calls and
privileged instructions that can result in high virtualization overheads.
(e) If the I/O intensive workload is also I/O bound, the cost of processor virtualization can be
completely hidden, since the processor is often idle waiting for I/O.
Answer: (a), (b), (d), (e)
註(c):User-level processor-bound programs often have zero virtualization overhead.

37
3. Which of the following is (are) true for the control hazards in a pipelined processor?
(a) Control hazard comes from a data cache miss.
(b) Considering two instructions i and j, with i occurring before j, j tries to read a source
before i writes it, so j incorrectly gets the old value. This causes a control hazard.
(c) Considering two instructions i and j, with i occurring before j, j tries to read a source
before i writes it, so j incorrectly gets the old value. This also causes a control hazard.
(d) A control hazard arises from the need to make a decision based on the result of a branch
instruction while others are executing.
(e) When the proper instruction cannot execute in the proper pipeline clock cycle because the
instruction was fetched is not the one that is needed.
Answer: (d), (e)

4. Which of the following is (are) true about program performance?


(a) The efficiency of a compiler affects both instruction count and average cycles per
instruction.
(b) The algorithm determines the number of source program instructions executed and thus
affects the instruction count. The algorithm also affects the clock rate.
(c) The programming language affects the instructions count since the statements in the
language are translated into the processor instructions.
(d) The instruction set architecture affects the instruction count, clock rate, and CPI.
(e) The cache memory and DRAM used both affect the CPI.
Answer: (a), (c), (d)
註(e):The cache memory and DRAM used both affect the cycle time.

5. Which of the following is (are) true about cache operations?


(a) When a data cache write hit occurs, the written data are also updated in the next level of
memory. This is the write-through policy.
(b) When a data cache write miss occurs, the cache controller first fetches the missing block
into cache and then the data are written into the cache. This is the write-allocate policy.
(c) When a data cache write miss occurs, the cache controller first fetches the missing block
into cache and then the data are written into the cache. This is the write-around policy.
(d) When a data cache write hit occurs, the data are only written into the cache. This is the
write-back policy.
(e) A processor writes data into a cache line which is also present in other processor's cache.
This is the write-allocate policy.
Answer: (a), (b), (d)
註(e):write-update (write-broadcast)

38
104 年中央資工

單選題
1. The information of a virtual memory system is assumed to be:
- Page size: 1024 words
- Number of virtual pages: 8
- Number of physical pages: 4
- The current page table is

VPN 0 1 2 3 4 5 6 7
PPN 2 1 NULL NULL 3 NULL 0 NULL

For the following (decimal) virtual word addresses: VA1: 57, VA2: 2048, VA3: 1026, VA4:
7749, VA5 6150, their corresponding physical addresses are PA1, PA2, PA3, PA4, PA5. K =
(PA1 + PA2 + PA3 + PA4 + PA5) mod 5, where “mod” is the modulo operator, and PA = -1 if it
is a page fault. What is "K"?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (a)
註:(2015 – 1 + 1026 – 1 + 6) mod 5 = 3045 mod 5 = 0
Virtual address 57 2048 1026 7749 6150
Virtual page number 0 2 1 7 6
Page offset 57 0 2 581 6
Physical address 2105 -1 1026 -1 6

2. A cache with data size 1Mbytes contains 32768 blocks and is eight-way set associative. The byte
address Ox11D9A4F1 is accessed and is a hit in this cache. Assume that the corresponding tag
value is “T”, what is “T mod 5”?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (c)
註:Block size = 1MB / 32K = 32 bytes. Number of sets = 32K / 8 = 4K
Tag Index Offset
Length 15 12 5
Value 000100011110100 110100100111 10001
0001000111101002 = 229210. 2292 mod 5 = 2

39
3. Given a function or output F with four inputs; A, B, C, D as F(A, B, C, D)= (3, 4, 5, 6, 7)
(minterms) and its don't care information is d(A, B, C, D) = (10, 11, 14, 15). By the commonly
used Karnaugh Map, simplify F with the minimum number of gates using “product of sum” and
there are “M” OR gates, “N” AND gates, “P” NOT/INVERT gates, where each AND/OR gate
has two inputs and NOT gate has one input. K= (M  4 + N  3 + P) mod 5. What is “K”?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (3)
註: F(A, B, C, D) = f(3, 4, 5, 6, 7) + d(10, 11, 14, 15) =
 M = 2, N = 3, P = 1  K = 18 mod 5 = 3

4. “M” is a floating-point number in the format of IEEE754 single precision;


01000000001010101010101010101010
K = {Round(M  20)} mod 5. What is K?
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (D)
註: 10.10101010101010101010102 = 2 + 1/2 + 1/8 + 1/32 + 1/128 + 1/512 + …. ≈ 2.6640625
2.6640625  20 ≈ 53.28125

5. There are three classes of instructions: Class A, Class B and Class C. The CPI values for Classes
A, B and C are 3, 2 and 1 respectively. We have two compilers, C1 and C2. The compiler C1
generates 106 Class A instructions, 2  l06 Class B instructions and 5  l06 Class C instructions.
The compiler C2 generates 106 Class A instructions, 106 Class B instructions and 107 Class C
instructions. If C1 is tested on a 1GHz machine, M1, while C2 is tested on a 1.5GHz machine,
M2, the execution time of {C1, M1} is “T1” ms (milliseconds) while the execution time of {C2,
M2} is “T2” (milliseconds). Calculate K = {Round(|T1 - T2|  l234)} mod 5.
(A) 0 (B) 1 (C) 2 (D) 3 (E) 4.
Answer: (C)
註:
Instruction class A B C
CPIi 3 2 1
IC for C1 106 2  l06 5  l06
IC for C2 106 106 107
CPU clock for C1 3  l0 + 2  2  l0 + 1  5  l0 = 12  l06
6 6 6

CPU clock for C2 3  l06 + 2  l06 + 1  l07 = 15  l06


ExTime for C1 T1 = (12  l06) / 1G = 12 ms
ExTime for C2 T2 = (15  l06) / 1.5G = 10 ms
K ={Round(|12ms – 10ms|  l234)} mod 5 = round(2.468) mod 5 = 2

40
6. Which of the following statements are true?
(A) Page table is a cache.
(B) TLB is a cache.
(C) “TLB miss. Page table hit. Cache miss” is possible.
(D) Conflict misses only occur in a direct-mapped or set -associative cache and can be
eliminated in a fully associative cache of the same size.
(E) A very large cache can avoid all the misses.
Answer: (B), (C), (D)

7. Which are the causes that limit the growth of uniprocessor performance and motive the trend of
developing multiple processors per chip in recent years?
(A) Long memory latency
(B) Emergence of Reduced Instruction Set Computer
(C) Limits of power
(D) Little instruction-level parallelism left to exploit efficiently
(E) None of the above
Answer: (A), (C)

8. Suppose that the frequency of Floating Point (FP) operations is 25% and the average Clock
Cycle Per Instruction (CPI) of FP operations is 4. The frequency of Floating Point Square Root
(FPSQR) operations is 2% and the average Clock Cycle Per Instruction (CPI) of FP operations is
20. The average CPI of other instructions is 1.33. There are two alternatives to improve the
processor. Alternative 1 is to reduce the CPI of FPSQR to 2. Alternative 2 is to reduce the
average CPI of all FP to 2. Which of the following is true?
(A) Alternative 1 is better.
(B) Alternative 2 is better.
(C) The CPI of alternative 1 is 1.25.
(D) The CPI of alternative 2 is 1.25.
(E) The speedup for alternative 2 is 1.33.
Answer: (B)
註:
Instruction class FP FPSQR Other
Frequency 25% 2% 73%
CPIi 4 20 1.33
CPIoriginal = 4  0.25 + 20  0.02 + 1.33  0.73 = 2.37
CPIAlternative 1 = 4  0.25 + 2  0.02 + 1.33  0.73 = 2.04
CPIAlternative 2 = 2  0.25 + 20  0.02 + 1.33  0.73 = 1.87
Speedup for Alternative 2 = 2.37 / 1.87 = 1.27
41
9. Which of the following statements are true for General-purpose register (GPR) instruction
architecture?
(A) One of the advantages of register-register instructions is simpler code-generation
(B) The instruction count for register-register instructions is usually lower than register-memory
instructions.
(C) Using registers is more efficient for a compiler than other forms of internal storage.
(D) When variables are allocated to registers, the memory traffic reduces, the program speeds
up.
(E) None of the above
Answer: (A), (C), (D)

10. About single -cycle implementation and multi-cycle implementation of control and data path,
which of the following statements are true?
(A) Single-cycle implementation of control and data path is better than multi-cycle
implementation.
(B) Multi-cycle implementation of control and data path prevents an instruction from sharing
functional units with another instruction within the execution of the instruction.
(C) Single-cycle implementation facilitates the design of pipeline.
(D) For multi-cycle implementation, the clock cycle is determined by the longest possible path.
(E) None of the above.
Answer: (C), (D)

11. Considering the following code sequence.


1 Loop: LD F0, 0(R1)
2 SUBI R1, R1, 8
3 ADDD F4, F0, F2
4 stall
5 BNEZ R1, Loop
6 SD 8(R1), F4
Unroll four times and rearrange the code to minimize the stalls
1 Loop: LD F0, 0(R1)
2 LD F6,-8(R1)
3 LD F10, -16(R1)
4 LD F14, -24(R1)
5 ADDD F4, F0, F2
6 ADDD F8, F6, F2
7 ADDD F12, F10, F2
8 ADDD F16, F14, F2

42
9 SD 0(R1), F4
10 SD -8(R1), F8
11 SUBI R1, R1, #32
12 SD A(R1), F12
13 BNEZ R1, LOOP
14 SD B(R1), F16
Where A and B stand for two numbers. Which of the following is true?
(A) Number A in line 12 should be 8.
(B) Number A in line 12 should be 16.
(C) Number B in line 14 should be 8.
(D) Number B in line 14 should be 16.
(E) Number B in line 14 should be 32.
Answer: (B), (C)

43
104 年中正電機

1. Explain the following terms


(1) Write-through vs. write-back
(2) Virtual memory vs. physical memory
(3) Von Neumann bottleneck
(4) MIPS rate vs. MFLOPS rate
(5) Memory mapped I/O vs. I/O mapped I/O
(6) Multiprogramming
Answer
(1) Write-through: a scheme in which writes always update both the cache and the next lower
level of the memory hierarchy, ensuring that data is always consistent between the two.
Write-back: a scheme that handles writes by updating values only to the block in the cache,
then writing the modified block to the lower level of the hierarchy when the block is
replaced.
(2) Virtual memory: a technique that uses main memory as a “cache” for secondary storage.
Physical memory: also refers to as main memory. That is the RAM you have installed in
your computer.
(3) Von Neumann bottleneck: The shared bus between the program memory and data memory
leads to the Von Neumann bottleneck, the limited throughput (data transfer rate) between the
CPU and memory compared to the amount of memory. Because program memory and data
memory cannot be accessed at the same time, throughput is much smaller than the rate at
which the CPU can work.
(4) MIPS: a measurement of program execution speed based on the number of millions of
instructions. MIPS is computed as the instruction count divided by the product of the
execution time and 106.
MFLOPS: refers to Million Floating Point Operations per Second, and is the measurement
used to describe the speed of a computer.
(5) Memory mapped I/O: an I/O scheme in which portions of address space are assigned to
I/O devices, and reads and writes to those addresses are interpreted as commands to the I/O
device.
I/O mapped I/O: IO mapped IOs have separate address space than the memory and use
dedicated instruction to give a command to an I/O device and that specifies both device
number and the command word.
(6) Multiprogramming: is the ability of an operating system to execute more than one
program on a single processor machine.

44
2. Booth's Algorithm and Modified Booth's Algorithm
(1) Calculate 11101010  11001110 by Booth's Algorithm.
(2) Give the rule of Modified Booth Receding.
(3) Find the values of P0, P2, P4, and P6 by Modified Booth's Algorithm.

Multiplicand Y 11101010
Multiplier X 11001110
P0
P2
P4
P6
P = P0 + P1 + P2 + P6
Answer
(1) 11101010  11001110 = 00000100 01001100
Iteration Step Multiplicand Product
0 initial values 11101010 00000000 110011100
00 no operation 11101010 00000000 110011100
1
Shift right product 11101010 00000000 011001110
10 – Multiplicand 11101010 00010110 011001110
2
Shift right product 11101010 00001011 001100111
11 no operation 11101010 00001011 001100111
3
Shift right product 11101010 00000101 100110011
11 no operation 11101010 00000101 100110011
4
Shift right product 11101010 00000010 110011001
01 + Multiplicand 11101010 11101100 110011001
5
Shift right product 11101010 11110110 011001100
00 no operation 11101010 11110110 011001100
6
Shift right product 11101010 11111011 001100110
10 – Multiplicand 11101010 00010001 001100110
7
Shift right product 11101010 00001000 100110011
11 no operation 11101010 00001000 100110011
8
Shift right product 11101010 00000100 010011001
(2)
Current bits Previous bit
Operation
ai+1 ai ai-1
0 0 0 None
0 0 1 Add the multiplicand
0 1 0 Add the multiplicand

45
0 1 1 Add twice the multiplicand
1 0 0 Subtract twice the multiplicand
1 0 1 Subtract the multiplicand
1 1 0 Subtract the multiplicand
1 1 1 None
(3)
Multiplicand Y 11101010
Multiplier X 11001110
P0 0000000000101100 ; -2  Multiplicand
P2 0000000000000000 ; none
P4 1111111010100000 ; + Multiplicand
P6 0000010110000000 ; - Multiplicand
P = P0 + P1 + P2 + P6 = 0000010001001100

3. Carry-save (Wallace tree) multiplier


(1) Give the architecture of 101  110 Wallace tree multiplier.
(2) Give the inputs of each carry save adder.
Answer
(1) (2)

PP2 PP1 PP0

CSA
PP0 = 000
PP1 = 101
Traditional PP2 = 101
adder

Product
註:
1 1 0 0 1 0 0 1 0
Partial products
101
× 110 First stage HA HA

000 000 11000 11000 11000


1010 1010 0010 0110 00110
10100 10100 10100 Final adder FA FA FA FA
10100

1 1 1 1 0

46
4. The equation of Average Memory Access Time (AMAT) has three components, including hit
time, miss rate, and miss penalty.
(1) Give the equation of AMAT in terms of hit time, miss rate and miss penalty.
For each of the following cache optimizations, indicate which components of the AMAT
equation can be improved. Explain the reasons.
(2) Using a multi-level cache instead of a primary cache.
(3) Using an M-way set-associate cache instead of a direct-mapped cache.
(4) Using larger blocks instead of smaller blocks.
Answer
(1) AMAT = hit time + miss rate  miss penalty
(2) miss penalty
(3) miss rate
(4) miss rate

5. Supposing that the industry trends show that a new process technology scales capacitance by
1/2, voltage by 1/2, and clock rate by 3, by what factor does the dynamic power scale?
Answer
Powernew = (3  Fold)  (0.5  Cold)  (0.5  Vold)2 = 0.375  Fold  Cold  Vold2 = 0.375 
Powerold
Thus power scales by 0.375

47
6. For the MIPS assembly code below,

addi $t2, $0, 10


LOOP: lw $s1, 0($s0)
add $s2, $s2, $s1
addi $s0, $s0, 4
subi $t2, $t2, 1
bne $t2, $0, LOOP
(1) What is the total number of MIPS instructions executed?
(2) Translate the MIPS code into C code. Assume that $t2 holds the C-level integer, i, $s2 holds
the C-level integer, x, and $s0 holds the base address of the integer array, MemArray.
Answer
(1) The total number of instructions executed = 1 + 5  10 = 51
(2) for (i = 10; i > 0; i--)
{
x = x + MemArray[i];
}

7. Assume that a CPU datapath contains five stages with different latencies below.

IF ID EX MEM WB
300ps 400ps 350ps 500ps 100ps

(1) What is the clock cycle time in a pipelined and a non-pipelined processor?
(2) What is the total latency of an add instruction in a pipelined and a non-pipelined processor?
(3) If we can split one stage of the pipelined datapath into two new stages for higher clock rate,
and each with half the latency of the original stage, which stage would you split, and what is
the new clock cycle time of the processor?
Answer
(1) (2)
Pipelined Single-cycle Pipelined Single-cycle
500ps 1650ps 2500ps 1650ps
(3)
Stage to split New clock cycle time
MEM 400ps

48
104 年中正資工

1. Please explain the reason why the single-cycle implementation is rarely used to implement any
instruction set of a processer.
Answer
The single-cycle datapath is inefficient, because the clock cycle must have the same length for
every instruction in this design. The clock cycle is determined by the longest path in the machine,
but several of instruction classes could fit in a shorter clock cycle.

2. If we want to design a carry-select adder to compute the addition of two 8-bit unsigned numbers
with ONLY 1-bit full adders and 2-to-l multiplexers. In addition, the delay time of a 1-bit full
adder and a 2-to-l multiplexer are DFA and DMX, respectively. Moreover, DMX is equal to 0.8  DFA.
Please determine the minimum delay time for this carry-select adder.
Answer: The minimum delay time for this carry-select adder = 4DFA + DMX = 4.8DFA

註:The following diagram shows the 8-bit carry-select adder where the sum delay for 4-bit ripple
carry adder is 4DFA
a7 b7 a6 b6 a5 b5 a4 b4 a7 b7 a6 b6 a5 b5 a4 b4

4-bit Ripple Carry Adder 1 4-bit Ripple Carry Adder 0

1 0 1 0 1 0 1 0 1 0
a3 b3 a2 b2 a1 b1 a0 b0

c s7 s6 s5 s4

4-bit Ripple Carry Adder

s3 s2 s1 s0

3. The following techniques have been developed for cache optimizations: hit time, miss rate or
miss penalty: “Non-blocking cache”, “multi-banked cache”, and “critical word first and early
restart”. Please briefly explain these techniques and how they work.
Answer
Non-blocking cache is a cache that allows the processor to make references to the cache while
the cache is handling an earlier miss. Two implementations, hit under miss allows additional
cache hit during a miss and miss under miss allows multiple outstanding cache miss, are used to
hide cache miss latency.
Multi-banked cache: rather than treat the cache as a single monolithic block, divide into
independent banks that can support simultaneous accesses and can increase cache bandwidth.
49
Critical word first and early restart are both used to reduce miss penalty. Critical word first is
to organize the memory so that the requested word is transferred from the memory to the cache
first. The remainder of the block is then transferred. Early restart is simply to resume execution
as soon as the requested word of the block is returned, rather than wait for the entire block.

4. What are “3C cache misses”? List one technique to improve each of the 3C misses.
Answer
3Cs model: a cache model in which all cache misses are classified into one of 3 categories:
compulsory, capacity, and conflict misses.

Compulsory Capacity Conflict


Increase block size Increase cache size Increase associativity

5. Given the memory references (word addresses): 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186,
253, and a direct-mapped cache with 10 blocks. Indicate which of the above 12 memory
accesses will encounter a cache miss, if (1) each cache block has only 1 word, and (2) each
cache block has 10 words.
Answer
(1)
Word Block
Tag Index Hit/Miss
address address
3 3 0 3 Miss
180 180 18 0 Miss
43 43 4 3 Miss
2 2 0 2 Miss
191 191 19 1 Miss
88 88 8 8 Miss
190 190 19 0 Miss
14 14 1 4 Miss
181 181 18 1 Miss
44 44 4 4 Miss
186 186 18 6 Miss
253 253 25 3 Miss
(2)
Word Block
Tag Index Hit/Miss
address address
3 0 0 0 Miss
180 18 1 8 Miss

50
43 4 0 4 Miss
2 0 0 0 Hit
191 19 1 9 Miss
88 8 0 8 Miss
190 19 1 9 Hit
14 1 0 1 Miss
181 18 1 8 Miss
44 4 0 4 Hit
186 18 1 8 Hit
253 25 2 5 Miss

51
104 年中山電機

1. Explain and compare the following terminology pairs.


(a) TLB (Translation Lookaside Buffer) vs Page Table
(b) Interrupt-Driven I/O vs DMA
(c) VLIW vs Superscalar
(d) Multi-Core vs Cluster
Answer
(a) TLB: a cache that keeps track of recently used address mappings to avoid an access to the
page table.
Page Table: the table contains the virtual to physical address translations in a virtual
memory system.
(b) Interrupt-Driven I/O: an I/O scheme that employs interrupts to indicate to the processor
that an I/O device needs attention.
DMA: a mechanism that provides a device controller the ability to transfer data directly to
or from the memory without involving the processor.
(c) VLIW: a style of instruction set architecture that lunches many operations that are defined
to be independent in a single wide instruction, typically with many separate opcode field.
Superscalar: an advanced pipelining technique that enables the processor to execute more
than one instruction per clock cycle.
(d) Multi-Core: a multi-core processor is a single computing component with two or more
independent actual processing units (called "cores"), which are the units that read and
execute program instructions.
Cluster: a set of computers connected over a local area network (LAN) that function as a
single large multiprocessor.

2. Consider a four-level memory hierarchy, M1, M2, M3, and M4, with access times T1 = 10 nsec,
T2 = 50 nsec, T3 = 100 nsec, and T4 = 600 nsec. The cache hit ratio H1 = 0.85 at the first level,
H2 = 0.90 at the second level and H3 = 0.95 at the third level. Calculate the effective access time
of this memory system.
Answer: AMAT = 10 ns + 0.15  50 ns + 0.10  100 ns + 0.05  600 ns = 57.5 ns
註:
M1 M2 M3 M4
Hit time 10 ns 50 ns 100ns 600 ns
Miss rate 0.15 0.10 0.05
Miss penalty 50 ns 100ns 600 ns

52
3. IEEE-754 floating-point representation
(a) Using 32-bit floating-point format (8-bit exponent, exponent bias = 127, and base = 2) to
represent -1/64.
(b) Using 64-bit floating-point format (11 -bit exponent, exponent bias = 1023, and base = 2) to
represent -1/32.
Answer
(a) -1/64 = - 0.0000012 = -1  2-6  1 01111001 00000000000000000000000
(b) -1/32 = - 0.000012 = -1  2-5
 1 01111111010 0000000000000000000000000000000000000000000000000000

4. Consider a 32-bit microprocessor that has an on-chip 16 Kbytes four-way set associative cache.
Assume that the cache has a line size of four words (each word is 32 bits).
(a) Show the 32-bit physical address (Show how many tag bits, set bits, and offset bits).
(b) Where in the cache (by indicating the set number) is the double word from memory location
ABCDE8F8 mapped?
Answer: Number of sets = 16KB / (16B  4) = 256
(a)
Tag bits Set bits Offset bits
20 8 4
(b) ABCDE8F816 = 1010 1011 1100 1101 1110 1000 1111 10002
Tag Set Offset
1010 1011 1100 1101 1110 1000 1111 1000
Set number: 100011112 = 14310

5. A non-pipelined processor has a clock rate of.2.5 GHz and an average CPI (cycles per
instruction) of 4. An upgrade to this processor introduces a new processor with five-stage
pipeline. However, due to internal pipeline delays, such as latch delay, the clock rate of the new
processor has to be reduced to 2 GHz and an average CPI of 1.
(a) What is the speedup achieved for a typical program with 100 instructions?
(b) What is the MIPS rate for the two processors, respectively?
<Note>: MIPS = Million Instructions per Second.
Answer
(a) Execution time for non-pipelined processor = (100  4) / 2.5G = 160 ns
Execution time for pipelined processor = [(5 – 1) + 100] / 2G = 52 ns
Speedup = 160 ns / 52 ns = 3.08
(b) MIPS for non-pipelined processor = 100 / (160 ns  106) = 625
MIPS for pipelined processor = 100 / (52 ns  106) = 1923.08

53
104 年中山資工

NOTE: If some questions are unclear or not well denned to you, you can make your own
assumptions and state them clearly in the answer sheet.
1. The following table shows the percentage of MIPS instructions executed by category for
average of SPEC2000 integer programs and SPEC2000 floating point programs.

Frequency
Instruction class MIPS examples Average CPI
Integer Floating point
Arithmetic add, sub, addi 1.0 clock cycles 24% 48%
Data transfer lw, sw, lb, sb, lui 1.4 clock cycles 36% 39%
Logical and, or, nor, andi, ori 1.0 clock-cycles 18% 5%
Conditional branch beq, bne, slt, slti 1.7 clock cycles 18% 6%
Jump j, jr, jal 1.2 clock cycles 4% 2%

(1) Using the average instruction mix information for the program SPEC2000fp, find the
percentage of all memory accesses (both data and instruction) that are for reads. Assume
that two-thirds of data transfers are loads.
(2) Compute the effective CPI for MIPS. Average the instruction frequencies for
SPEC2000int and SPEC2000fp to obtain the instruction mix.
(3) Consider an architecture that is similar to MIPS except that it supports update addressing
for data transfer instructions. If we run SPEC2000int using this architecture, some
percentage of the data transfer instructions will be able to make use of the new
instructions, and for each instruction changed, one arithmetic instruction can be
eliminated. If 25% of the data transfer instructions can be changed, which will be faster
for SPEC2000int, the modified MIPS architecture or the unmodified architecture? How
much faster? (Assume mat both architectures have CPI values as given in the above table
and that the modified architecture has its cycle time increased by 20% in order to
accommodate the new instructions.)
Answer:
(1) (1 + 0.39  2/3) / 1.39 = 0.91
(2) Effective CPI = 1.0  0.36 + 1.4  0.375 + 1.0  0.115 + 1.7  0.12 + 1.2  0.03 = 1.24
(3) Modified architecture CPI = 1.0  (0.24 – 0.36  0.25) + 1.4  0.36 + 1.0  0.18 + 1.7 
0.18 + 1.2  0.04 = 1.188
Suppose instruction count = IC and clock cycle time for the MIPS architecture = T
Unmodified architecture execution time = IC  1.24  T
Modified architecture execution time = IC  0.91  1.188  1.2T = IC  1.30  T
Unmodified architecture is 1.30 / 1.24 = 1.05 times faster than modified architecture

54
2. Consider two different implementations, I1 and I2, of the same instruction set. There are three
classes of instructions (A, B, and C) in the instruction set. I1 has a clock rate of 4 GHz, and I2
has a clock rate of 2 GHz. The following table shows the average number of cycles for each
instruction class on I1 and I2.

Class CPI on I1 CPI on I2 C1 Usage C2 Usage C3 Usage


A 2 1 40% 40% 50%
B 3 2 40% 20% 25%
C 5 2 20% 40% 25%

The table also contains a summary of average proportion of instruction classes generated by
three different compilers. C1 is a compiler produced by the makers of I1, C2 is produced by
the makers of I2, and C3 is a third-party product. Assume that each compiler uses me same
number of instructions for a given program but that the instruction mix is as described in me
table.
(1) Using C1 on both I1 and I2, how much faster can the makers of I1 claim I1 is compared to
I2?
(2) Which computer and compiler would you purchase if all other criteria are identical,
including cost?
Answer:
(1) For C1:
CPII1 = 2  0.4 + 3  0.4 + 5  0.2 = 3. Instruction TimeI1 = 3 / 4G = 0.75 ns
CPII2 = 1  0.4 + 2  0.4 + 2  0.2 = 1.6. Instruction TimeI2 = 1.6 / 2G = 0.8 ns
I1 is 0.8 ns / 0.75 ns = 1.067 times faster than I2.
(2) For C2:
CPII1 = 2  0.4 + 3  0.2 + 5  0.4 = 3.4. Instruction TimeI1 = 3.4 / 4G = 0.85 ns
CPII2 = 1  0.4 + 2  0.2 + 2  0.4 = 1.6. Instruction TimeI2 = 1.6 / 2G = 0.8 ns
For C3:
CPII1 = 2  0.5 + 3  0.25 + 5  0.25 = 3. Instruction TimeI1 = 3 / 4G = 0.75 ns
CPII2 = 1  0.5 + 2  0.25 + 2  0.25 = 1.5. Instruction TimeI2 = 1.6 / 2G = 0.75 ns
Compiler C3 has lower instruction time for both computer I1 and I2, so C3 should purchase.
Average instruction time for I1 = (0.75 + 0.85 + 0.75) / 3 = 0.78 ns
Average instruction time for I2 = (0.8 + 0.8 + 0.75) / 3 = 0.78 ns
Computer I1 and I2 have the same average instruction time for the three compiler, so either
one can be purchase.

55
3. You are going to enhance a computer, and there are two possible improvements: either make
multiply instructions run four times faster than before, or make memory access instructions run
two times faster than before. You repeatedly run a program that takes 100 seconds to execute.
Of this time, 20% is used for multiplication, 50% for memory access instructions, and 30% for
other tasks.
(1) What will the speedup be if both improvements are made?
(2) You are going to change the program described in Problem 3 so that the percentages are
not 20%, 50%, and 30% anymore. Assuming that none of the new percentages is 0, what
sort of program would result in a tie with regard to speedup (i.e., the same speedup)
between the two individual improvements? Provide both a formula and some examples.
Answer:
(1) Speedup = = 1.67

(2) The problem is solved algebraically and results in the equation

where X = multiplication% and Y = memory%. Solving, we get X = (2/3)  Y. Many


examples thus exist: for example, multiplication% = 20%, memory% = 30%, other = 50%.

4. Figure 1 gives the datapath of a pipelined processor with forwarding and hazard detection.
(1) We have a program of 1000 instructions in the format of "lw, add, lw, add,..." The add
instruction depends (and only depends) on the lw instruction right before it. The lw
instruction also depends (and only depends) on the add instruction right before it. If the
program is executed on the pipelined datapath of Figure 1, what would be the actual CPI?
(2) What would be the actual CPI for the program in Problem (1) without forwarding?
(3) Consider executing the following code on the pipelined datapath of Figure 1. How many
cycles will it take to execute this code?
lw $4, 100($2)
sub $6, $4, $3
add $2, $3, $6
add $7, $5, $2
(4) With regard to the code in Problem (3), explain what the forwarding unit is doing during
the sixth cycle of execution. If any comparisons are being made, mention them.

56
Hazard ID/EX.MemRead
detection
unit

IF/IDWrite
ID/EX
EX/MEM
PCWrite Control MEM/WB
IF/ID 0

Instruction
Registers
Instruction ALU
PC Data
memory
memory

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt EX/MEM.
Rt RegisterRd
IF/ID.RegisterRd Rd
ID/EX.RegisterRt
Rt MEM/WB.RegisterRd
Rs Forwarding
unit

Answer:
(1) Actual CPI = [(5 – 1) + 1000 + 500] / 1000 ≈ 1.5
(2) Actual CPI = [(5 – 1) + 1000 + 999  2] / 1000 ≈ 3
(3) Total clock cycles = (5 – 1) + 4 + 1 = 9
(4) Forwarding unit: $6 compare with $3 and $6 compare with $6
Hazard detection unit: $3 compare with $5 and $3 compare with $2
註(4): The following table show which stage is occupying by which instruction. All the control
signals of the instruction in the WB stage have been cleared to 0.
Stage ID EX MEM WB
Instruction add $7, $5, $2 add $2, $3, $6 sub $6, $4, $3 sub $6, $4, $3

5. We have a program core consisting of several conditional branches. The program core will be
executed thousands of times. Below are the outcomes of each branch for one execution of the
program core (T for taken, N for not taken).
Branch 1: T-T-T-T, Branch 2: N-N-N-N-N, Branch 3: T-N-T-N-T-N. Branch 4: T-T-T-N-T
Assume the behavior of each branch remains the same for each program core execution. For
dynamic schemes, assume each branch has its own prediction buffer and each buffer initialized
to the same state before each execution. List the predictions and calculate the prediction
accuracy for the following branch prediction schemes:
(1) Always-taken
(2) 1-bit predictor (initialized to predict taken)
(3) 2-bit predictor as shown in Figure 2 (initialized to weakly predict taken)
Answer:

57
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-T-T-T-T, right: 0, wrong: 5
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
(1)
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Total: right: 11, wrong: 9, Accuracy = 100% × 11/20 = 55%
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-N-N-N-N, right: 4, wrong: 1
(2) Branch 3: prediction: T-T-N-T-N-T, right: 1, wrong: 5
Branch 4: prediction: T-T-T-T-N, right: 3, wrong: 2
Total: right: 12, wrong: 8, Accuracy = 100% × 12/20 = 60%
Branch 1: prediction: T-T-T-T, right: 4, wrong: 0
Branch 2: prediction: T-N-N-N-N, right: 4, wrong: 1
(3) Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Total: right: 15, wrong: 5, Accuracy = 100% × 15/20 = 75%

6. The average memory access time (AMAT) is possibly useful as a figure of merit for different
cache systems.
(1) Find the AMAT for a processor with a 2 ns clock, a miss penalty of 20 clock cycles, a
miss rate of 0.06 misses per reference, and a cache access time (including hit detection) of
1 clock cycle. Assume that the read and write miss penalties are the same and ignore other
write stalls.
(2) Suppose we can improve the miss rate to 0.04 misses per reference by doubling the cache
size. This causes the cache access time to increase to 1.2 clock cycles. Using the AMAT
as a metric, determine if this is a good trade-off.
(3) If the cache access time determines the processor's clock cycle time, which is often the
case, AMAT may not correctly indicate whether one cache organization is better than
another. If the processor's clock cycle time must be changed to match that of a cache, is
this a good trade-off? Assume the processors are identical except for the clock rate and the
number of cache miss cycles; assume 1.5 references per instruction and a CPI without
cache misses of 2. The miss penalty is 20 cycles for both processors.
Answer:
(1) AMAT = (1 + 0.06  20)  2 ns = 4.4 ns
(2) AMAT = (1.2 + 0.04  20)  2 ns = 4 ns
es, it’s a good choice.
(3) Instruction timeold = (2 + 1.5 × 0.06 × 20)  2 ns = 7.6 ns
Instruction timenew = (2 + 1.5 × 0.04 ×20)  2.4 ns = 7.68 ns
So, it’s not a good choice.

58
7. Consider three processors with different cache configurations:
Cache 1: Direct-mapped with two-word blocks
Cache 2: Direct-mapped with four-word blocks
Cache 3: Two-way set associative with four-word blocks
The following miss rate measurements have been made:
Cache 1: Instruction miss rate is 3.75%; data miss rate is 5%
Cache 2: Instruction miss rate is 2%; data miss rate is 4%
Cache 3: Instruction miss rate is 2%; data miss rate is 3%
For these processors, one-half of the instructions contain a data reference. Assume that the
cache miss penalty is 6 + Block size in words. The CPI for tills workload was measured on a
processor with Cache 1 and was found to be 2.0.
(1) Assuming a cache of 16K blocks and a 32-bit address, find the total number of sets and the
total number of tag bits for Cache 1, Cache 2, and Cache 3.
(2) Determine which processor spends the most cycles on cache misses.
(3) The cycle times for the processors in Problem 7.2 are 400 ps for the first and second
processors and 300 ps for the third processor. Determine which processor is the fastest and
which is the slowest.
Answer:
(1)
Cache No. of sets A tag bits Total tag bits
1 16K 32 – 14 – 3 = 15 15  16K = 240 Kbits
2 16K 32 – 14 – 4 = 14 14  16K = 224 Kbits
3 8K 32 – 13 – 4 = 15 15  16K = 240 Kbits
(2)
Cache Miss penalty Stall cycle per instruction
1 6+2=8 (0.0375 × 8 + 0.5 × 0.05 × 8) = 0.5
2 6 + 4 = 10 (0.02 × 10 + 0.5 × 0.04 × 10) = 0.4
3 6 + 4 = 10 (0.02 × 10 + 0.5 × 0.03 × 10) = 0.35

(3) CPIbase = CPIP1 - Stall cycle per instruction = 2 – 0.5 = 1.5


CPIP2 = 1.5 + 0.4 = 1.9
CPIP3 = 1.5 + 0.35 = 1.85
Instruction Time for P1 = 2  400 ps = 800 ps
Instruction Time for P2 = 1.9  400 ps = 760 ps
Instruction Time for P3 = 1.85  300 ps = 555 ps
P3 is the fastest and P1 is the slowest

59
104 年中興電機

1. Consider two unsigned numbers A and B, with three bits shown as follows.
A = A2A1A0, B = B2B1B0
Let for i = 0, 1, and 2. Write the Boolean function for (A ≥ B)
Answer:
Boolean function for (A ≥ B) =

2. Describe three main technologies used in computer systems for I/O data transfer.
Answer:
Polling (programmed I/O): the processor periodically checking the status of an I/O device to
determine the need to service the device
Interrupt: I/O devices employ interrupts to indicate to the processor that they need attention.
DMA (direct memory access): a mechanism that provides a device controller with the ability to
transfer data directly to or from the memory without involving the processor

3. Two processors, i.e. CPU N and CPU K, have the same instruction set architecture. The two
processors execute the same program, which consists of 2500 instructions. CPU N has a clock
cycle time of 0.25ns, and CPI of 2.5 for the program. CPU K has a clock cycle time of 0.5ns,
and CPI of 1.5 for the same program.
(a) Determine the CPU time of CPU N and CPU K (in ns).
(b) Suppose the designer want to change the clock rate of CPU N, and then both CPU N and
CPU K have the same CPU time for the program. Find the new clock rate of CPU N (in
GHz).
Answer:
(a) CPU time for CPU N = 2500  2.5  0.25 ns = 1562.5 ns
CPU time for CPU K = 2500  1.5  0.5 ns = 1875 ns
(b) Suppose the new clock rate for CPU N is x
1875 ns = (2500  2.5) / x  x = 3.3 GHz

60
4. The following C codes are compiled into the corresponding MIPS assembly codes. Assume
that i and k correspond to registers $s3 and $s5, and the base of the array save is in $s6.
C codes:
while( save[i] ===k)
{
i += 1;
}
MIPS assembly codes:
Loop: sll $t1, $s3, 2
add $t1, $t1, OP1
OP3 $t0, 0(OP2)
bne $t0, $s5, Exit
add $s3, $s3,1
j Loop
Exit
Please determine the proper values for operands (OP1, OP2), and the proper instruction for the
operator (OP3). Copy the following table (Table 1) to your answer sheet and fill in the two
operand values and the one instruction.
Table 1
Operand/Operator Value / Instruction
OP1
OP2
OP3

Answer:
Operand/Operator Value / Instruction
OP1 $s6
OP2 $t1
OP3 lw

61
5. The following MIPS assembly codes are compiled from the corresponding C codes. Assume
that base addresses for arrays x and y are found in $a0 and $a1, while i is in $s0. strcpy adjusts
the stack pointer and then saves the saved register $s0 on the stack.
MIPS assembly codes:
strcpy:
addi $sp, $sp, -4
sw $s0, 0($sp)
add $s0, $zero, $zero
L1: add $t1, $s0, $a1
lb $s2, 0($t1)
add $t3, $s0, $a0
sb $t2, 0($t3)
beq $t2, $zero, L2
addi $s0, $s0,1
j L1
L2: lw $s0, 0($sp)
addi $sp, $sp, 4
jr $ra
C codes:
void strcpy (char x[ ], char y[ ])
{
int i;
i = 0;
while (Code l)
Code 2;
}
Please determine the proper C codes for instructions (Code 1, Code 2). Copy the following
table (Table 2) to your answer sheet and fill in the two C code instructions.
Table 2
C code instructions
Code 1
Code 2

Answer:
C code instructions
Code 1 (x[i] = y[i]) != ‘\0’
Code 2 i += 1;

62
6. The basic single-cycle MIPS implementation in Figure 1 can only implement some
instructions. New instructions can be added to an existing Instruction Set Architecture (ISA),
but the decision whether or not to do that depends, among other things, on the cost and
complexity the proposed addition introduces into the processor datapath and control. Consider
the following new added instruction.
Instruction: LWI Rt, Rd(Rs)
Interpretation: Reg[Rt] = Mem[Reg[Rd] + Reg[Rs]]
(a) Which existing blocks (if any) can be used for this instruction?
(b) Which new functional blocks (if any) do we need for this instruction?
(c) What new signals do we need (if any) from the control unit to support this instruction?

Figure 1
Answer:
(a) This instruction uses instruction memory, both register read ports, the ALU to add Rd and
Rs together, data memory, and write port in Registers.
(b) None.
(c) None.

7. Consider three branch prediction schemes: predict not taken, predict taken, and dynamic
prediction. Assume that they have zero penalty when they predict correctly and two cycles
when they are wrong. Assume that the average predict accuracy of the dynamic predictor is
90%. Which predictor is the best choice for the following branches?
(1) A branch that is taken with 5% frequency
(2) A branch that is taken with 95% frequency
(3) A branch that is taken with 70% frequency
Answer:
63
Predict Predict Dynamic
Best choice
not taken taken prediction
(1) 0.95 0.05 0.9 Predict not taken
(2) 0.05 0.95 0.9 Predict taken
(3) 0.3 0.7 0.9 Dynamic prediction

8. We examine how data dependences affect execution in the basic 5-stage pipeline. Consider the
following sequence of instructions:
or r1, r2, r3
or r2, r1, r4
or r1, r1, r2
Also, assume the following cycle times for each of the options related to forwarding:

Without Forwarding With Full Forwarding With ALU-ALU Forwarding Only


250 ps 300 ps 290 ps

(a) Indicate dependences and their type.


(b) Assume there is no forwarding in this pipelined processor. Indicate hazards and add nop
instructions to eliminate them.
(c) Assume there is full forwarding. Indicate hazards and add nop instructions to eliminate
them.
(d) What is total execution time of this instruction sequence without forwarding and with full
forwarding? What is the speedup achieved by adding full forwarding to a pipeline that had
no forwarding?
(e) Add nop instructions to this code to eliminate hazards if there is ALU-ALU forwarding
only (no forwarding from the MEM to the EX stage).
Answer:
Instruction sequence Dependences
I1: or r1, r2, r3 RAW on r1 from I1 to I2 and I3
I2: or r2, r1, r4 RAW on r2 from I2 to I3
(a) I3: or r1, r1, r2 WAR on r2 from I1 to I2
WAR on r1 from I2 to I3
WAW on r1 from I1 to I3

or r1, r2, r3 Delay I2 to avoid RAW hazard on r1 from I1


nop
nop to I2
(b) or r2, r1, r4 Delay I3 to avoid RAW hazard on r2 from I2
nop to I3
nop
or r1, r1, r2

64
or r1, r2, r3 No RAW hazard on r1 from I1 (forwarded)
(c) or r2, r1, r4
or r1, r1, r2 No RAW hazard on r2 from I2 (forwarded)

No forwarding With forwarding Speedup


(5 – 1 + 3 + 4)  250ps = (5 – 1 + 3)  300ps =
(d) 2750 / 2100 = 1.31
2750 ps 2100ps

or r1, r2, r3
or r2, r1, r4 RAW hazard on r1 from I1 to I3 cannot use
(e) nop
nop ALU-ALU forwarding
or r1, r1, r2

65
104 年中興資工

1. True or False Questions:


(a) The control hazard can be eliminated by the forwarding method.
(b) The 2-way set-associative cache means that a cache has at least two locations where each
block can be placed.
(c) First-level caches are more concerned about hit time, and second-level caches are more
concerned about miss rate.
(d) RAID 3 is popular in applications with large data sets, such as multimedia and some
scientific codes.
(e) To benefit from a multiprocessor, an application must be concurrent.
Answer
(a) (b) (c) (d) (e)
False False True True False

註(b):2-way set-associative cache means that a cache has exactly two locations where each memory
block can be placed.

2. Machine A has a clock rate of 1 GHz and a CPI of 2.2 for some program, and machine B has a
clock rate of 500MHz and a CPI of 1.2 for the same program. Which machine is faster for this
program, and by how much times?
Answer
Instruction for Machine A = 2.2 / 1GHz = 2.2 ns
Instruction for Machine B = 1.2 / 500MHz = 2.4 ns
Machine A is (2.4 ns / 2.2 ns) = 1.09 times faster than Machine B.

66
104 年台科大電子

1. Please explain the following terms:


(a) Compulsory misses
(b) Capacity misses
(c) Conflict misses
Answer:
(a) Compulsory misses: a cache miss caused by the first access to a block that has never been
in the cache.
(b) Capacity misses: a cache miss that occurs because the cache even with fully associativity,
can not contain all the block needed to satisfy the request.
(c) Conflict misses: a cache miss that occurs in a set-associative or direct-mapped cache when
multiple blocks compete for the same set.

2. Pipeline
(a) Please explain how a MIPS processor can exploit the instruction-level parallelism (ILP)?
(b) Please draw the datapath of the MIPS processor with the control unit.
Answer:
(a) MIPS pipeline partition instruction execution into balanced stage, overlap execution of
consecutive instructions, and exploits ILP among the instructions.
(b)

67
3. Assume that the miss rate of an instruction cache is 3% and the miss rate of the data cache is
5%. If a processor has a CPI of 3 without any memory stalls and the miss penalty is 100 cycles
for all misses. How much faster the processor can run with a perfect cache that never missed?
Assume the frequency of all loads and stores is 30%.
Answer:
CPIeffective = 3 + 1  0.03  100 + 0.3  0.05  100 = 7.5
The processor with a perfect cache is 7.5 / 3 = 2.5 faster than that without a perfect cache

4. Please explain branch prediction buffer and use 2-bit prediction scheme as an example.
Answer:
Branch prediction buffer is a small memory that is indexed by the lower portion of the address
of the branch instruction and that contains one or more bits indicating whether the branch was
recently taken or not. Using 2-bit prediction scheme as an example, each entry of the branch
prediction buffer contains 2-bit information. Predict not taken when these two bits are 00 or 01
and taken when 10 or 11.

5. Virtual Memory
(a) Please explain TLB, cache, and page table.
(b) How to integrate TLB, cache, and page table together.
Answer:
(a) TLB: A cache that keeps track of recently used address mappings to avoid an access to the
page table.
Cache: a fast memory between CPU and main memory.
Page table: the table contains the virtual to physical address translations in a virtual memory
system.
(b) The following diagram shows the integration of TLB, cache, and page table.

CPU TLB Cache


Virtual Physical
address TLB address
miss
Page
Table

68
104 年台科大資工

1. The following figure shows the pipelined datapath with control signals. For the instruction
sequence below, please answer the following questions.
add $1, $2, $3
and $1, $1, $4
lw $5, 4($1)
sub $7, $5, $6
or $8, $5, $7
Hazard
detection
unit
F
Y ID/EX
X
EX/MEM
Control MEM/WB
IF/ID 0 E
C

A
Instruction

Registers
Instruction ALU
PC Data
memory
memory

B
Z D

Forwarding
unit

(1) Please give the control signals of multiplexors A, B, C and control lines D, E, F for clock
cycle 5. Note that the answer for each signal should be one of the follows: 0, 1, 00, 01, 10,
or 11.
A B C D E F
Control Signal

(2) For clock cycle 5, what are sent through lines X, Y, and Z? Please write down the
corresponding register numbers ($1, $2 ... $8).

X Y Z
Register Number

(3) How many cycles does it take to finish the execution of this instruction sequence?
Answer:

69
(1)
A B C D E F
Control Signal 0 0 10 1 1 1
(2)
X Y Z
Register Number $5/$6 $5 $1

(3) (5 – 1) + 5 + 1 = 10 clock cycles

2. Below are some assumptions of memory access times:


• 1 memory bus clock cycle to send the address
• 10 memory bus clock cycles for each DRAM access initiated
• 1 memory bus clock cycle to send a word of data
Based on these assumptions, please calculate the miss penalty of a four-word block for the
following organizations of memory.
(1) One-word-wide memory organization: the memory and the bus between the processor and
the memory are both one word wide.
(2) Wider memory organization: the memory and the bus between the processor and the
memory are widened to two words.
(3) Interleaved memory organization: the memory chips are organized in four banks, where
each bank is one word wide, without widening the interconnection bus (in other words, the
bus is still one word wide.
Answer:
(1) Miss penalty = 1 + (10 + 1)  4 = 45 clocks
(2) Miss penalty = 1 + (10 + 1)  2 = 23 clocks
(3) Miss penalty = 1 + 10 + 1  4 = 15 clocks

70
3. Consider the following performance measurements for a program:
Measurement Computer A Computer B Computer C
Instruction count 10 billion 10 billion 9 billion
Clock rate 2 GHz 4 GHz 2 GHz
CPI 1.1 2.4 1.2
for each of the following statements, decide TRUE or FALSE.
(1) Computer B is faster than Computer A for this program.
(2) Computer C is faster than Computer A for tills program.
(3) Computer C has higher MIPS (million instructions per second) rating than Computer A for
this program.
(4) Suppose the program spends 10% of time on addition and 60% of time on multiplication.
This program can run 2 times faster by improving only the speed of addition.
(5) Suppose the program spends 10% of time on addition and 60% of time on multiplication.
This program can run 2 times faster by improving only the speed of multiplication.
Answer:
(1) (2) (3) (4) (5)
False True False
註(1, 2): ExeTimeCA = (10  10  1.1) / 2  109 = 5.5 sec.
9

ExeTimeCB = (10  109  2.4) / 4  109 = 6 sec.


ExeTimeCC = (9  109  1.2) / 2  109 = 5.4 sec.
註(3): MIPSCA = 2  109 / (1.1  106) = 1818.18
MIPSCC = 2  109 / (1.2  106) = 1666.67

71
104 年台師大資工

1. Convert the following unsigned 8-bit binary numbers to decimal numbers:


(a) 01110101
(b) 11100000
Answer:
(a) 011101012 = 26 + 25 + 24 + 22 + 20 = 11710
(b) 111000002 = 27 + 26 + 25 = 22410

2. Consider a CPU with clock cycle time 0.5 nanosecond (or 0.5 ns). Suppose the CPU executes a
program with 1000 instructions. The average CPI (clock cycles per instruction) is 2.0 for the
program.
(a) Find the clock rate of the CPU (in gigahertz, or GHz).
(b) Find the CPU time for executing the program (in ns).
Answer:
(a) 1 / (0.5 ns) = 2 GHz
(b) The CPU time = 1000  2  0.5 ns 1000 ns

3. Consider a MIPS processor with separate instruction and data memories. Suppose the
following code sequence is executed on the processor.
LW R4, 24(R2); R4  MEM[R2 + 24]
SUB R5, R1, R4; R5  R1 - R4
LW R12, 32(R2); R12  MEM[R2 + 32]
ADD R8, R12, R4; R8  R12 + R4
ADD R10, R2, R8; R10  R2 + R8
LW R6, 8(R1); R6  MEM[R1 + 8]
LW R11, 16(R1); R11  MEM[R1 + 16]
SUB R9, R11, R6 R9  R11 - R6
(a) Determine the number of accesses to the instruction memory.
(b) Determine the number of accesses to the data memory.
(c) Suppose there is only one miss in the instruction memory for the code sequence, Compute
the miss rate of the instruction memory,
(d) Suppose there are two misses in the data memory for the code sequence. Compute the
miss rate of the data memory.
Answer:
(a) The number of accesses to the instruction memory is 8.
(b) The number of accesses to the data memory is 4.
(c) Miss rate of the instruction memory is 1/8 = 0.125.
(d) Miss rate of the data memory is 2/4 = 0.5.
72
4. Consider a five-stage (IF, ID, EX, MEM and WB) MIPS pipeline processor with hazard
detection and data forwarding units. Assume the processor includes separate instruction and
data memories so that the structural hazard for memory references can be avoided.
(a) Suppose the following code sequence is executed on the processor. Identify all the data
hazards which can be solved by forwarding.
ADD R5, R7, R12; R5  R7 + R12
ADD R4, R5, R6; R4  R5 + R6
LW R8, 12(R4); R8  MEM[12 + R4]
(b) Repeat part (a) for the following code sequence.
LW R9, 20(R7); R9  MEM[R7 + 20]
ADD R1, R9, R5; R1  R9 + R5
SW R1, 12(R6); MEM[R6 + 12]  R1
(c) Suppose the following code sequence is executed on the processor. Determine the total
number of clocks needed to execute the code sequence.
ADD R6, R2, R3; R6  R2 + R3
SUB R8, R5, R6; R8  R5 - R6
Answer:
(a) (ADD, ADD) for register R5; (ADD, LW) for register R4
(b) (ADD, SW) for register R1
(c) (5 – 1) + 2 = 6 clock cycles

5. Briefly explain the following terms.


(a) Program Counter.
(b) Control Hazard.
(c) Direct Mapped Cache.
Answer:
(a) Program Counter: the register containing the address of the instruction in the program being
executed.
(b) Control Hazard: when the proper instruction cannot execute in the proper pipeline clock
cycle because the instruction that was fetched is not the one that is needed; that is, the flow
of instruction addresses is not what the pipeline expected.
(c) Direct Mapped Cache: a cache structure in which each memory location is mapped to
exactly one location in the cache.

73
104 年暨南資工

1. Explain the following terms.


(1) Yield
(2) Normalization or SPECratio
(3) Least/most significant bit
(4) Branch target address
(5) Way-associative cache
Answer:
(1) Yield: The percentage of good dies from the total number of dies on the wafer.
(2) SPECratio is defined as reference time divided by target execution time.
(3) Least/most significant bit: The rightmost/leftmost bit in a MIPS word.
(4) Branch target address: the address specified in a branch, which becomes the new program
counter (PC) if the branch is taken.
(5) Way-associative cache: a cache that has a fixed number of locations (at least two) where
each block can be placed.

2. Assume that executing program A on single CPU needs 100 cycles, and parallelizing program
A has fixed 10 cycle overheads. When parallelizing program A to be executed on 10 CPUs,
what is the fraction of program A to be parallelizable (e.g. x% of A) so that we can get 5 times
of speed up with 10 CPUs?
Answer: Suppose T is the time which can be parallelization in program A

-  T = 100. So, 100% of program should be parallelizable.

3. How to decide the cycle time of a CPU with pipelining?


Answer:
The time of lowest step will be the cycle time of a pipeline or the longest latency of the pipeline
stage determines the pipeline clock cycle time.

4. Answer the following questions related to caches.


(a) Name one cache replacement policy that you learned and explain the policy,
(b) Is it necessary for a direct-mapped cache to utilize the cache replacement policy described
above? Why?
(c) To reduce compulsory misses, what kind of method do you propose to utilize?
Answer:
(a) LRU replacement scheme: the block replaced is the one that has been unused for the longest
time.
(b) It is not necessary for a direct-mapped cache to utilize cache replacement policy because the
74
requested block can go in exactly one position and the block occupying that position must
be replaced.
(c) Increase block size.

75
104 年高大資工

單選題
1. Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a main memory access time of 100 ns, including all
the miss handling. Suppose the miss rate per instruction at the primary cache is 1%. How many
cycles will the new CPI be if we add a secondary cache that has a 5 ns access time for either a
hit or a miss and is large enough to reduce the miss rate to main memory to 0.4%?
(a) 1.2
(b) 2.6
(c) 2.8
(d) 6.5
Answer: (c)
註:CPI = 1 + 0.01  20 + 0.004  400

2. Given an IEEE 754 single precision binary representation


00111111110000000000000000000000. What decimal number is represented by the float?
(a) 0.5
(b) 0.75
(c) 1.5
(d) 7.5
Answer: (c)
註:1.1  20 = 1.12 =1.510

3. Which statement is not true?


(a) Branch Prediction Buffer can reduce control hazard.
(b) Forwarding can reduce data hazard
(c) Memory interleaving is a technique for reducing memory access time through increased
bandwidth utilization of the data bus
(d) Compulsory misses occur when the cache cannot contain all the blocks needed during
execution of a program
Answer: (d)
註:Capacity misses

76
4. Assume that individual stages of the datapath have the following latencies:

IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps

What is the total latency of an LW instruction in a pipelined and non-pipelined processor?


(a) 1750ps, 1250ps
(b) 1750ps, 1050ps
(c) 1250ps, 1250ps
(d) 1750ps, 1050ps
Answer: (a)
註: The latency of an LW instruction in a pipelined processor = 350  5 = 1750 ps
The latency of an LW instruction in a pipelined processor = 250 + 350 + 150 + 300 + 200 =
1250 ps

5. Which statement is true?


(a) For floating-point numbers, x, y, and z, (x + y) + z = x + (y + z).
(b) The conflict misses may occur in a fully-associative cache.
(c) For a five-stage MIPS architecture, Translation-Lookaside Buffer should be used in IF and
MEM stages.
(d) The write-through mechanism is more suitable for virtual memory systems than the
write-back mechanism.
Answer: (c)

填充題
6. switches threads only on costly stalls, such as a leve-2 cache miss.
7. is caused by the first access to a block that has never been in the cache.
8. is a structure that holds the destination program counters for branch instructions.
9. is the principle stating that if a data location is referenced, data locations with nearby
addresses will tend to be referenced soon.
10. The section of a process containing temporary data such as function parameters, return
addresses, and local variables is called .
Answer:
6 7 8 9 10
Coarse grain Compulsory Branch target Spatial Activation
multithreading misses buffer locality record

77
問答題
11. The five stages of MIPS pipeline are IF (instruction fetch), ID (Instruction decode and register
read), EXE (Execute operation or calculate address), MEM (Access memory operand), and
WB (Write result back to register). Given the following code:
add $5, $2, $1
lw $3, 4($5)
lw $2, 0($2)
or $3, $5, $3
sw $3, 0($5)
Suppose the data hazards must be resolved by “stalling” the dependent instructions until the
needed operand is written back to the register file. We assume that when the needed operand is
written back to the register file, the dependent instruction can read the needed operand from
the register file in the same clock cycle. How many NOPs and at what places that you add to
make the code segment execute correctly? How many cycles do these instructions execute?
You must show how to get the answer.
Answer: 5 NOPs is required and the execution time = (5 – 1) + 5 + 5 = 14 clock cycles
add $5, $2, $1
NOP
NOP
lw $3, 4($5)
lw $2, 0($2)
NOP
or $3, $5, $3
NOP
NOP
sw $3, 0($5)

12. Assume that a two-way set-associative cache of 1K blocks, 1-word block size, and a 32-bit
address. How many total bits are required for the cache (including valid bits)? You must show
how to get the answer.
Answer:
Number of sets = 1K / 2 = 512. Tag size = 32 – 9 – 2 = 21
The total bits are required for the cache = (1 + 21 + 32)  1K = 54K bits

78
13. Consider a computer running a program that requires 300s, with 100s spent executing FP
instructions, 75s executed Load/Store instructions, and 40s spent executing branch
instructions. What will the speedup be if the times for FP and branch instructions are reduced
by 30% and 50%, respectively? You must show how to get the answer.
Answer:

Speedup = = 1.2

79

You might also like