Assignment 2
Assignment 2
QUESTION 1
a. The CPU time of a program is defined as the product of the CPI (cycles per
instruction) for the processor on which it runs, the total number of instructions
executed (I), and processor clock period (φ). Describe the major factors which
influence CPI, I and φ. [7]
b. For a new architecture to be worth developing it must have a commercial lifespan of
at least 10 years. What long-term factors must designers of a new architecture take
into consideration during the design process?
[7]
Microprocessor core speeds increase at a rate of 40–60% per annum, compared with speed
increases of 30% every ten years for DRAM devices. In the light of this increasing
discrepancy between CPU and main memory speeds, what can architects, system designers
and memory chip designers do to reduce the harmful effects of high memory latency in future
computer systems?
SOLUTIONS
Program complexity: The complexity of the program and the algorithms used affect the number
of instructions required to accomplish a task. More complex programs tend to have a higher
instruction count.
Loop iterations: Programs with loops execute the same set of instructions multiple times. The
number of loop iterations directly impacts the total instruction count.
Clock frequency: The clock frequency determines the speed at which instructions are executed.
Higher clock frequencies result in shorter clock periods, reducing overall execution time.
Microarchitecture: The design and implementation of the processor's microarchitecture can
impact the clock period. Improvements in microarchitecture, such as reducing pipeline stages or
enhancing circuitry, can decrease the clock period.
b. . What long-term factors must designers of a new architecture take into consideration
during the design process?
ii. In the light of this increasing discrepancy between CPU and main memory speeds,
what can architects, system designers and memory chip designers do to reduce the
harmful effects of high memory latency in future computer systems?
Cache Hierarchies: Enhancing the cache hierarchy can help bridge the gap between CPU and
main memory speeds. By incorporating multiple cache levels with varying sizes and access
latencies, architects can improve data locality and reduce the frequency of memory
accesses, thereby reducing the impact of high memory latency.
Prefetching Techniques: Implementing intelligent prefetching techniques can help
anticipate and fetch data from main memory before it is explicitly requested by the CPU.
This proactive approach can hide memory latency by bringing data closer to the CPU in
advance.
Memory Controllers: Optimizing memory controllers is crucial for efficient memory access.
Designers can employ techniques like memory interleaving, bank-level parallelism, and
command pipelining to improve memory channel utilization and reduce latency.
Memory Technology Advancements: Memory chip designers can focus on developing new
memory technologies that offer lower latency and higher bandwidth. Technologies like High-
Bandwidth Memory (HBM), Hybrid Memory Cube (HMC), or phase-change memory (PCM)
provide potential solutions to address the memory latency challenge.
Non-Volatile Memory (NVM): NVM, such as NAND flash or emerging technologies like
Resistive RAM (ReRAM) or Magnetoresistive RAM (MRAM), can offer lower latency
compared to traditional storage devices. Integrating NVM as a cache or main memory can
help reduce the impact of high memory latency.
Compression and Decompression: Employing compression techniques in the memory
subsystem can reduce the amount of data transferred between CPU and memory, effectively
reducing the memory latency. Decompression can be performed near the CPU to minimize
the additional latency introduced.
Hardware-Software Co-design: Collaboration between architects, system designers, and
software developers can lead to optimized solutions. Designers can work closely with
software developers to develop memory-access-aware algorithms and techniques that
minimize the harmful effects of high memory latency.
QUESTION 2
a. Name two RISC and two CISC processors. What are the main characteristics of RISC
processors? [4]
b. Define
i. Superscalar. [2]
ii. Super-pipeline. [2]
iii. Derive the equation for ideal speedup for a superscalar super-pipelined
processor compared to a sequential processor. Assume N instructions, k-stage
scalar base pipeline, superscalar degree of m, and super-pipeline degree of n.
[4]
c. For the same program, two different compilers are used. The table below shows the
execution time of the two different compiled programs.
i. Find the average CPI for each program given that the processor has a clock
cycle time of 1ns.
[4]
ii. Assume the average CPIs found in part (i), but that the compiled programs run
on two different processors. If the execution times on the two processors are
the same, how much faster is the clock of the processor running compiler A’s
code versus the clock of the processor running compiler B’s code?
[4]
SOLUTIONS
a. Name two RISC and two CISC processors. What are the main
characteristics of RISC processors?
[4]
ARM Cortex-A series: The ARM Cortex-A series, developed by ARM Holdings, is a popular example of
a RISC processor architecture. It is widely used in mobile devices, embedded systems, and low-
power applications. The main characteristics of ARM Cortex-A processors include:
Simple Instruction Set: RISC processors like ARM Cortex-A have a reduced instruction set, with a
focus on simple and fixed-length instructions. This simplicity allows for faster decoding and
execution.
Load-Store Architecture: RISC processors typically use a load-store architecture, where data must be
loaded into registers before manipulation and stored back to memory after processing. This design
simplifies instruction execution and improves performance.
Pipelining: RISC processors heavily utilize pipelining, breaking down instructions into multiple stages
to improve instruction throughput and achieve high performance.
Register File: RISC processors generally have a large number of general-purpose registers, typically
32 or more. This reduces the need for memory access, improving execution speed.
MIPS (Microprocessor without Interlocked Pipeline Stages): MIPS is another well-known RISC
processor architecture. It was developed by MIPS Technologies and found success in various
applications, including embedded systems, gaming consoles, and networking devices. The main
characteristics of MIPS processors include:
Fixed Instruction Length: MIPS processors have a fixed instruction length of 32 bits, simplifying
instruction decoding and pipeline design.
Load-Store Architecture: Similar to other RISC architectures, MIPS processors use a load-store
architecture, separating memory access instructions from arithmetic and logical instructions.
Delayed Branches: MIPS processors employ delayed branches, where the instruction following a
branch is always executed, regardless of whether the branch is taken or not. This technique helps
maintain pipeline efficiency.
Register Architecture: MIPS processors typically have a large number of general-purpose registers,
commonly 32. This reduces memory access and enhances performance.
d. Define
i. Superscalar.
A Superscalar processor is a type of microprocessor architecture
that enables parallel execution of multiple instructions within a
single clock cycle. It aims to improve instruction throughput and
overall performance by simultaneously executing multiple
instructions that are independent of each other.
ii. Super-pipeline.
In the realm of microprocessor architecture, a Super-pipeline refers to
an advanced design approach that incorporates an extended pipeline
with a significantly higher number of stages compared to traditional
pipelines. The objective of a Super-pipeline is to maximize instruction
throughput and exploit deeper instruction-level parallelism.
iii. Derive the equation for ideal speedup for a superscalar super-pipelined
processor compared to a sequential processor. Assume N instructions,
k-stage scalar base pipeline, superscalar degree of m, and super-
pipeline degree of n. [4]
(N / m) * (k / n)
The N / m term represents the number of instruction groups (also known as bundles) that need to be
executed, where each bundle contains m instructions executed in parallel.
The k / n term represents the number of cycles required to execute each bundle, considering the
division of the super-pipeline into n stages.
Now, the ideal speedup (S) can be calculated by dividing the total number of cycles required for the
sequential processor by the total number of cycles required for the superscalar super-pipelined
processor:
S = (N * k) / ((N / m) * (k / n))
S = (m * n)
Hence, the equation for the ideal speedup of a superscalar super-pipelined processor compared to a
sequential processor is simply the product of the superscalar degree (m) and the super-pipeline
degree (n).
c. For the same program, two different compilers are used. The table below shows the
execution time of the two different compiled programs.
Find the average CPI for each program given that the processor has a clock cycle time
of 1ns. [4]
Average CPI for a program = (Sum of all CPI values for the program) / (Number of CPI values for the
program)
Program 1
Assume the average CPIs found in part (i), but that the compiled programs run on two
different processors. If the execution times on the two processors are the same, how
much faster is the clock of the processor running compiler A’s code versus the clock
of the processor running compiler B’s code? [4]
The average CPIs for the compiled programs from compiler A and compiler B are known
from the previous part.
Clock_A: Clock speed of the processor running the code compiled by compiler A
Clock_B: Clock speed of the processor running the code compiled by compiler B
Since the execution times on the two processors are the same, we can write the following equation:
Rearranging the equation to solve for the ratio of the clock speeds, we get:
CPI_A = 1.5
CPI_B = 2.0
Therefore, the clock of the processor running the code compiled by compiler A is 1.333 times (or
33.3% faster) than the clock of the processor running the code compiled by compiler B.
QUESTION 3
The multi-cycle and pipelined data paths can be broken down into 5 steps:
Hardware to support the write back of the ALU operation back to the register file
Assume that each of the above steps takes the amount of time specified in the table below.
Note that these times include the overhead of performing the operation AND storing
the data in the register needed to save intermediate results between steps. Thus, the
times (Q) capture the critical path of the logic + latching overhead. After the Q seconds
listed for each stage above, the data can be used by another stage.
a. Given the times for the data path stages listed above, what would the clock period be
for the entire data path?
[4]
b. In a pipelined data path, assuming no hazards or stalls, how many seconds will it take
to execute 1 instruction?
[3]
c. Assuming that N instructions are executed, and all N instructions are add instructions,
what is the speedup of a pipelined implementation when compared to a multi-cycle
implementation? Your answer should be an expression that is a function of N. [4]
d. Assume you break up the memory stage into 2 stages instead of 1 to improve
throughput in a pipelined data path. Thus, the pipeline stages are now: F, D, EX, M1,
M2, WB. Show how the instructions below would progress though this 6 stage
pipeline. Full forwarding hardware is available.
[4]
e. List and briefly explain five important instruction set design issues. [5]
SOLUTION
i. Given the times for the data path stages listed above, what would the
clock period be for the entire data path?
To find the clock period for the entire data path, we need to find the maximum delay among these
five stages.
The clock period would be set to the duration of the longest stage delay, to ensure that all stages can
complete within one clock cycle.
The longest delay is in the MEMORY AND FETCH, which takes 305PS EACH.
Therefore, the clock period for the entire data path would be:
Clock Frequency = 1 / Clock Period, therefore we divide 305/1000 to work with ns therefore =
0.305ns
In a pipelined data path, multiple instructions can be in different stages simultaneously. Assuming no
hazards or stalls, each instruction will complete one stage per clock cycle.
To execute 1 instruction, it will go through all 5 stages (IF, ID, EX, MEM, WB) in the pipelined data
path.
Therefore, the time it takes to execute 1 instruction in the pipelined data path is:
Therefore, in a pipelined data path with a 0.305 ns clock period and no hazards or stalls, it will take
0.00000001525 seconds (15 nanoseconds) to execute 1 instruction.
iii. Assuming that N instructions are executed, and all N instructions are
add instructions, what is the speedup of a pipelined implementation
when compared to a multi-cycle implementation? Your answer should
be an expression that is a function of N.
Okay, let's calculate the speedup of a pipelined implementation compared to a multi-cycle
implementation, assuming N instructions are executed and they are all add instructions.
Clock period = 3 ns
Clock period = 3 ns
To calculate the speedup, we take the ratio of the multi-cycle time to the pipelined time:
Speedup = 5
This speedup expression is a function of N, the number of instructions executed, and it is constant at
5 for all N, as long as the instructions are all add instructions and there are no hazards or stalls.
The key point is that in the pipelined implementation, multiple instructions can be in different stages
simultaneously, allowing for a significant performance improvement over the multi-cycle
implementation, where each instruction takes 5 clock cycles to complete.
iv. Assume you break up the memory stage into 2 stages instead of 1 to
improve throughput in a pipelined data path. Thus, the pipeline stages
are now: F, D, EX, M1, M2, WB. Show how the instructions below
would progress though this 6 stage pipeline. Full forwarding hardware
is available.
Okay, let's consider the case where the Memory Access (MEM) stage is broken up into two stages,
M1 and M2, to improve throughput in a pipelined data path. The pipeline stages are now:
F (Instruction Fetch)
D (Instruction Decode)
EX (Execute)
M1 (Memory Access 1)
M2 (Memory Access 2)
WB (Write Back)
Let's show how the following instructions would progress through this 6-stage pipeline, assuming full
forwarding hardware is available:
armasm
Copy
3. LW R7, (R8)
4. SW R9, (R10)
Cycle IF ID EX M1 M2 WB
1 1
2 2 1
3 3 2 1
4 4 3 2 1
5 4 3 2 1
6 4 3 2 1
7 4 3 2
8 4 3
9 4
Explanation:
In the first cycle, the first instruction (ADD R1, R2, R3) is in the IF stage.
In the second cycle, the first instruction is in the ID stage, and the second instruction (SUB R4, R5,
R6) is in the IF stage.
In the third cycle, the first instruction is in the EX stage, the second instruction is in the ID stage, and
the third instruction (LW R7, (R8)) is in the IF stage.
In the fourth cycle, the first instruction is in the M1 stage, the second instruction is in the EX stage,
the third instruction is in the ID stage, and the fourth instruction (SW R9, (R10)) is in the IF stage.
In the fifth cycle, the first instruction is in the M2 stage, the second instruction is in the M1 stage, the
third instruction is in the EX stage, and the fourth instruction is in the ID stage.
In the sixth cycle, the first instruction is in the WB stage, the second instruction is in the M2 stage,
the third instruction is in the M1 stage, and the fourth instruction is in the EX stage.
In the seventh cycle, the second instruction is in the WB stage, the third instruction is in the M2
stage, and the fourth instruction is in the M1 stage.
In the eighth cycle, the third instruction is in the WB stage, and the fourth instruction is in the M2
stage.
By breaking up the Memory Access (MEM) stage into two separate stages, M1 and M2, the pipeline
can achieve higher throughput, as multiple instructions can be in different stages simultaneously.
This allows for better utilization of the processor resources and improved performance, as long as
the necessary forwarding logic is available to handle any data dependencies between the
instructions.
v. List and briefly explain five important instruction set design issues.
Instruction Encoding:
This refers to how the instructions are represented in binary format, including the size and layout of
the instruction fields (opcode, operands, etc.).
The encoding scheme affects instruction memory size, fetch complexity, and instruction decoding.
Orthogonality:
A highly orthogonal instruction set allows for greater flexibility and programming efficiency, as
programmers can combine different operands and addressing modes in various ways.
The instruction set should be as regular and simple as possible, with consistent naming conventions,
operand formats, and addressing modes.
This simplifies the hardware design and makes the instruction set easier to understand and program.
Addressing Modes:
Addressing modes define how the operands of an instruction are accessed in memory.
The choice of addressing modes impacts code density, memory access patterns, and the complexity
of the processor hardware.
The instruction set should provide mechanisms for handling exceptions (e.g., divide-by-zero, page
faults) and interrupts (e.g., from I/O devices) efficiently.
The design of exception and interrupt handling affects the responsiveness and reliability of the
system
QUESTION 4
Design a (very) simple CPU for an instruction set that contains only the following four
instructions: lw (load word), sw (store word), add, and jump (unconditional branch). Assume
that the instruction formats are similar to the MIPS architecture. If you assume a different
format, state the instruction formats. Show all the components, all the links, and all the
control signals in the data-path. You must show only the minimal hardware required to
implement these four instructions. For each instruction, show the steps involved and the
values of the control signals for a single cycle implementation.
SOLUTION
To design a simple CPU for the given instruction set, I will assume an instruction format similar to
MIPS:
- lw (load word):
- sw (store word):
- add:
The minimal hardware required to implement these four instructions is shown in the diagram below:
1. lw (load word):
- PC: Increment by 4
- Inst. Decode:
2. sw (store word):
- PC: Increment by 4
- Inst. Decode:
3. add:
- PC: Increment by 4
- Inst. Decode:
- Inst. Decode:
- Jump = 1 (jump)
This design provides the minimal hardware required to implement the four instructions. The control
signals are set appropriately for each instruction to ensure the correct execution.