Unit 3
1. Basic Non-Pipelined CPU Architecture
non-pipelined CPU executes instructions sequentially. One instruction must
A
completeallits stages before the next instruction can begin. Think of it like a
single worker handling one task entirely before starting the next.
Core Components:
1. Control Unit (CU):The "brain" of the CPU. It fetches instructions from
memory, decodes them, and generates control signals to direct the other
components (ALU, registers, memory interfaces) on what to do and
when.
2. Arithmetic Logic Unit (ALU):Performs arithmetic operations (addition,
subtraction, multiplication, division) and logical operations (AND, OR,
NOT, XOR, shifts). It takes operands from registers and returns the result
to a register.
3. Registers:Small, extremely fast storage locationswithinthe CPU. Used
to hold data, instructions, addresses, and status information temporarily
during execution. Key types include:
○ Program Counter (PC):Holds the memory address of thenext
instruction to be fetched.
○ Instruction Register (IR):Holds the instructioncurrentlybeing
decoded and executed.
○ Memory Address Register (MAR):Holds the address of the
memory location to be accessed (read from or written to).
○ Memory Data Register (MDR) / Memory Buffer Register
(MBR):Temporarily holds data being transferred to or from
memory.
○ General Purpose Registers (GPRs):Used to hold operands and
results for ALU operations. Accessible by the programmer (via
assembly language).
○ Status Register / Flags Register:Holds status bits (flags)
indicating results of operations (e.g., Zero flag, Carry flag,
Overflow flag, Negative flag).
4. Internal Buses:Pathways connecting the different components (CU,
ALU, Registers) within the CPU, allowing data and control signals to
travel between them.
peration:In a non-pipelined architecture, the CPU follows the
O
Fetch-Decode-Execute cycle strictly sequentially for each instruction. If
f etching takes 1 clock cycle, decoding 1, and executing 3, a single instruction
takes 5 clock cycles. The next instruction only starts fetchingafterthe previous
one has fully completed execution. This leads to underutilization of CPU
components (e.g., the fetch unit is idle during decode and execute).
2. Memory Hierarchy
PUs operate much faster than main memory (RAM). Accessing RAM for
C
every instruction and data operand would create a massive bottleneck. The
Memory Hierarchy is a structure that uses multiple levels of memory with
different speeds, sizes, and costs to bridge this gap.
Levels (Closest to CPU outwards):
1. Registers:Fastest, smallest, most expensive (part of the CPU). Hold
currently active data/instructions. Access time ~1 CPU cycle.
2. Cache Memory (L1, L2, L3):Small, fast Static RAM (SRAM) located
closer to the CPU (often on the same chip). Stores frequently accessed
data and instructions from main memory.
○ L1 Cache:Smallest, fastest cache, often split into instruction cache
(L1i) and data cache (L1d). Access time ~ few CPU cycles.
○ L2 Cache:Larger and slower than L1. Access time ~ 10-20 cycles.
○ L3 Cache:Largest and slowest cache level, often shared by
multiple CPU cores. Access time ~ 30-50 cycles.
3. Main Memory (RAM):Dynamic RAM (DRAM). Much larger than
cache but significantly slower. Holds the currently running operating
system and application programs/data. Access time ~ 100-200+ cycles.
4. Secondary Storage (Virtual Memory/Swap Space):Hard Disk Drives
(HDDs) or Solid State Drives (SSDs). Largest capacity, slowest access
time, cheapest per bit. Holds data and programs not currently in RAM.
Used as an extension of RAM (virtual memory). Access time ~
milliseconds.
5. Tertiary Storage (Optional):Optical disks, magnetic tapes for backups
and archival. Very slow.
rinciple of Locality:Memory hierarchy works efficiently because programs
P
tend to exhibit:
● T
emporal Locality:If an item (instruction or data) is accessed, it's likely
to be accessed again soon. (Loops, reuse of variables). Caching keeps
recently used items close.
● S
patial Locality:If an item is accessed, items whose addresses are close
by are likely to be accessed soon. (Sequential code execution, array
processing). Caching fetches blocks of data around the requested item.
oal:To provide the CPU with an average memory access time close to the
G
cache speed, while offering the large capacity of main memory and secondary
storage.
3. I/O Techniques
I nput/Output (I/O) techniques manage the communication between the
CPU/Memory system and external peripheral devices (keyboard, mouse, disk
drives, network interfaces, printers, etc.).
1. Programmed I/O (PIO):
○ Mechanism:The CPU executes specific I/O instructions. It
continuously checks (polls) the status register of the I/O device
until it's ready for data transfer. The CPU is directly responsible for
moving data between memory/registers and the I/O device buffer.
○ Pros:Simple to implement.
○ Cons:Very inefficient. The CPU wastes significant time waiting
(polling) for the slow I/O device, unable to perform other tasks.
Only suitable for very simple or slow devices.
2. Interrupt-Driven I/O:
○ Mechanism:The CPU initiates an I/O operation and then
continues executing other tasks. When the I/O device is ready (e.g.,
data received, operation complete), it sends an interrupt signal to
the CPU. The CPU suspends its current task, saves its state,
executes an Interrupt Service Routine (ISR) to handle the data
transfer, restores its state, and resumes the interrupted task.
○ Pros:Much more efficient than PIO, as the CPU doesn't wait idly.
○ Cons:Interrupt handling introduces overhead (saving/restoring
state, context switching). Still involves the CPU in the actual data
transfer process.
3. Direct Memory Access (DMA):
○ M
echanism:A dedicated hardware controller, the DMA Controller
(DMAC), manages the data transfer directly between the I/O
device and main memory,withoutinvolving the CPU except at the
beginning (to set up the transfer: source address, destination
address, data count) and the end (DMAC sends an interrupt when
done).
○ P ros:Most efficient for large data transfers. Frees the CPU almost
entirely during the transfer. Reduces CPU overhead significantly.
○ Cons:Requires a dedicated DMAC. Can lead to bus contention if
the DMAC and CPU need the memory bus simultaneously (cycle
stealing).
4. CPU Architecture Types (Based on Operand Storage)
his classification refers to how the Instruction Set Architecture (ISA) specifies
T
the operands for ALU instructions.
1. Accumulator Architecture:
○ Concept:Uses a single special register called the "accumulator" as
one implicit operand for most arithmetic/logic instructions. The
other operand typically comes from memory. The result is stored
back in the accumulator.
○ Example Instruction: ADD address(meaning ACC = ACC +
emory[address]
M )
○ Characteristics:Simple hardware, short instructions (one explicit
address). High memory traffic as operands frequently need
loading/storing. Older architecture type (e.g., early
microprocessors).
2. Stack Architecture:
○ Concept:Operands are implicitly on the top of a processor stack.
ALU operations pop operands from the stack and push the result
back onto it. Requires PUSHand
POPinstructions to move data
between memory and the stack.
○ Example Instruction: ADD(pops top two values, adds them,
pushes result)
○ Characteristics:Can lead to very compact code ("zero-address
instructions"). Stack management can be complex. Efficient for
evaluating complex expressions. Used in some systems like Java
Virtual Machine (JVM).
3. General Purpose Register (GPR) Architecture:
○ Concept:Uses multiple general-purpose registers to hold operands
and results. Dominant modern architecture.
○ Sub-types:
■ Register-Memory:Allows ALU instructions to have one
operand in a register and another in memory.ADD R1,
address(meaning
R1 = R1 + Memory[address]
).
■ R
egister-Register (Load/Store):ALU operationsonlywork
LOADand
on registers. Separate STOREinstructions are
LOAD
required to move data between registers and memory.
R1, address1
LOAD R2, address2
, ADD R3,
,
1, R2
R STORE address3, R3
, .
○ Characteristics:Reduces memory traffic compared to
accumulator/stack (registers are faster). Requires more complex
instruction formats (specifying multiple registers). Load/Store is
the basis for most RISC architectures (like ARM, MIPS, RISC-V).
4. Memory-Memory Architecture (Less Common now):
○ Concept:Allows ALU instructions to operate directly on operands
located in main memory, potentially storing the result back to
memory.
○ Example Instruction: ADD address1, address2,
address3(meaning
Memory[address1] =
emory[address2] + Memory[address3]
M )
Characteristics:Very high flexibility, complex instructions. Very
○
slow due to multiple memory accesses per instruction. Not
common in modern general-purpose CPUs, though some complex
instructions in CISC architectures might resemble this.
. Detailed Datapath of a Typical Register-Based (Load/Store)
5
CPU
he datapath shows the physical connections (buses) and functional units (ALU,
T
registers, memory interfaces) through which data flows during instruction
execution. Here's a simplified view for a load/store architecture:
(Visualize blocks connected by lines/arrows representing buses)
● Instruction Fetch:
1. Content of the PC is sent to MAR.
2. A Read signal is sent to the Memory Interface.
3. PC is incremented (usually PC = PC + 4, assuming 32-bit
instructions/addresses).
4. Memory returns the instruction via the data bus to the MDR.
5. Instruction moves from MDR to the IR.
● Instruction Decode:
1. The opcode part of the instruction in IR is sent to the Control Unit.
2. The Control Unit decodes the instruction and generates control
signals for subsequent steps.
3. Register operands specified in the IR (e.g., source registers Rs, Rt)
are used to select registers from the Register File.
Execute (Example:
● ADD R_dest, R_src1, R_src2 )
1. Data from R_src1 and R_src2 in the Register File are sent to the
ALU inputs (Input A, Input B).
2. The Control Unit sends an "ADD" signal to the ALU.
3. ALU performs the addition.
4. ALU result is routed back towards the Register File.
● Execute (Example: Address calculation for LOAD/STORE R_t,
offset(R_s)
)
1. Data from R_s in the Register File is sent to ALU Input A.
2. The 'offset' value (part of the instruction, possibly sign-extended) is
sent to ALU Input B.
3. The Control Unit sends an "ADD" signal to the ALU.
4. ALU calculates the effective memory address (Base Address +
Offset).
Memory Access (Example:
● LOAD R_t, address )
1. The calculated address (from ALU) is sent to the MAR.
2. The Control Unit sends a Read signal to the Memory Interface.
3. Memory returns the data via the data bus to the MDR.
● Memory Access (Example: STORE R_t, address )
1. The calculated address (from ALU) is sent to the MAR.
2. Data from register R_t (specified in IR) is read from the Register
File and sent to the MDR.
3. The Control Unit sends a Write signal to the Memory Interface.
4. Data from MDR is written to memory at the specified address.
● Write Back (Example: ADD R_dest, ...or LOAD R_t, ...
)
1. For ADD: The result from the ALU output is written into R_dest in
the Register File.
2. For LOAD: The data fetched from memory (now in MDR) is
written into R_t in the Register File.
3. Control Unit provides the correct register address (R_dest or R_t)
and the Write Enable signal to the Register File.
Key Datapath Elements:
● C -> Adder -> Mux -> PC (For incrementing PC)
P
● PC -> MAR
● Memory Interface <-> MAR, MDR
● MDR -> IR
● IR -> Control Unit
● I R (register fields) -> Register File (Read/Write addresses)
● Register File (Read Ports) -> ALU Inputs (possibly via Muxes)
● IR (immediate field) -> Sign Extender -> ALU Input B (via Mux)
● ALU Output -> MAR (for address calculation)
● ALU Output -> Register File (Write Port) (for R-type results)
● MDR -> Register File (Write Port) (for Load results)
● Register File (Read Port) -> MDR (for Store data)
● Control Unit -> Control signals to Muxes, ALU, Register File (Write
Enable), Memory Interface (Read/Write).
6. Fetch-Decode-Execute Cycle (Typically 3 to 5 Stages)
This is the fundamental cycle performed by the CPU to execute instructions.
Basic 3-Stage Cycle:
1. Fetch:
○ Get the address from the PC.
○ Load the instruction from memory at that address into the IR.
○ Increment the PC to point to the next instruction.
2. Decode:
○ Interpret the opcode in the IR.
○ Identify the operands needed.
○ Generate control signals for the execute stage.
○ Fetch operands from registers if needed.
3. Execute:
○ Perform the operation specified by the instruction (using the ALU,
accessing memory, changing PC for jumps/branches, writing
results to registers).
ypical 5-Stage RISC Pipeline Cycle:(This breakdown is crucial for
T
understanding pipelining)
1. IF (Instruction Fetch):Fetch instruction from memory using the address
in PC, store in IR, increment PC.
2. ID (Instruction Decode & Register Fetch):Decode instruction in IR,
identify required registers, read operand values from the Register File.
Decode immediate values. Check for hazards.
3. EX (Execute / Address Calculation):
○ For ALU instructions: Perform the operation using the ALU on
operands fetched in ID.
○ F or Load/Store: Calculate the effective memory address using the
ALU (Base + Offset).
○ For Branches: Calculate branch target address and evaluate branch
condition.
4. MEM (Memory Access):
○ For Load: Read data from memory using the address calculated in
EX.
○ For Store: Write data (fetched from register in ID) to memory using
the address calculated in EX.
○ Other instructions usually do nothing at this stage.
5. WB (Write Back):Write the result back into the Register File.
○ For ALU instructions: Write the result from the EX stage.
○ For Load instructions: Write the data fetched in the MEM stage.
I n a non-pipelined CPU, one instruction goes through all 5 stages before the
next one starts IF.
. Microinstruction Sequencing & Implementation of Control
7
Unit
he Control Unit generates the signals that control the datapath. There are two
T
main implementation approaches:
A. Hardwired Control Unit:
● I mplementation:Uses fixed, dedicated combinational logic circuits
(AND, OR, NOT gates, decoders) to generate control signals based on the
instruction opcode, ALU flags, and timing signals (clock).
● Operation:The opcode bits directly feed into the logic gates. The
outputs of these gates are the control signals.
● Microinstruction Sequencing:Not applicable in the same way. The
"sequence" is determined by the flow through the fixed logic based on the
current state and instruction.
● Pros:Very fast execution speed.
● Cons:Complex to design and debug. Inflexible; modifying the
instruction set requires redesigning the hardware. Difficult to implement
complex instruction sets. Typically used in RISC processors.
B. Microprogrammed Control Unit:
● I mplementation:Control signals are stored as sequences of
"microinstructions" in a special memory called the Control Store (or
Control Memory - C M), typically ROM or fast RAM.
● Components:
○ Control Store (CS):Holds the microprogram(s).
○ Microinstruction Register (µIR):Holds the current
microinstruction being executed.
○ Microprogram Counter (µPC):Holds the address of thenext
microinstruction in the CS to be fetched (analogous to the main
PC).
○ Sequencing Logic:Determines the next value for the µPC.
● Operation:
○ The instruction opcode from the IR is mapped to a starting address
in the Control Store.
○ The µPC is loaded with this starting address.
○ The microinstruction at µPC address is fetched from CS into the
µIR.
○ The bits in the µIR directly represent the control signals needed for
the datapath for that micro-step.
○ The Sequencing Logic uses information from the µIR (next address
field), the instruction opcode, and ALU flags to calculate the
address of thenextmicroinstruction (µPC update).
○ Repeat steps 3-5 until the end of the micro-routine for the current
machine instruction.
● Microinstruction Sequencing:How the next microinstruction address is
determined:
○ Increment:µPC = µPC + 1 (default sequential execution).
○ Branching:Based on ALU flags (e.g., if Zero flag is set, jump to
microinstruction X, else continue).
○ Dispatching:Based on the opcode of themachineinstruction
(used to find the start of the correct micro-routine).
○ Explicit Next Address:The current microinstruction contains the
address of the next one.
● Pros:Flexible (changing the instruction set means rewriting the
microprogram in the CS, not redesigning hardware). Easier to implement
complex instruction sets (CISC). Simpler design process.
● Cons:Slower than hardwired control due to the extra memory access
time for fetching microinstructions from the CS.
8. Enhancing Performance with Pipelining
ipelining is a technique used to improve CPU throughput by overlapping the
P
execution stages of multiple instructions. It doesn't make a single instruction
faster, but it increases the number of instructions completed per unit of time.
● C oncept:Divide instruction processing into multiple stages (like the
5-stage IF, ID, EX, MEM, WB). Insert pipeline registers between stages
to hold the intermediate results and control information for an instruction
as it moves down the "assembly line".
● Operation:In an ideal pipeline, a new instruction enters the first stage
(IF) in every clock cycle. While instruction 1 is in ID, instruction 2 is in
IF. While instruction 1 is in EX, instruction 2 is in ID, and instruction 3 is
in IF, and so on.
● Benefit:If there are 'k' stages, the ideal speedup compared to a
non-pipelined CPU is 'k' times (assuming balanced stage delays and no
interruptions). In the 5-stage example, after the first instruction takes 5
cycles to complete, subsequent instructions complete at a rate of one per
cycle (ideally).
● Challenges - Pipeline Hazards:Situations that prevent the next
instruction in the pipeline from executing during its designated clock
cycle.
○ Structural Hazards:Hardware resource conflict. Two different
instructions in the pipeline need the same resource (e.g., memory
access) at the same time. Solved by duplicating resources (e.g.,
separate instruction/data caches) or stalling.
○ Data Hazards:An instruction depends on the result of a previous
instruction that is still in the pipeline and hasn't completed writing
its result.
■ Read After Write (RAW - True Dependence):Instruction J
ADD R1,
tries to read before instruction I writes. (e.g.,
2, R3followed by
R SUB R4, R1, R5 ). Solved by
forwarding/bypassing(routing the result directly from ALU
output/MEM stage back to ALU input for the next
instruction) or stalling (inserting bubbles/NOPs).
■ Write After Read (WAR - Anti Dependence):Instruction J
tries to write before instruction I reads. Less common in
simple pipelines; handled by register renaming in more
advanced CPUs.
■ Write After Write (WAW - Output Dependence):Instruction J
tries to write before instruction I writes (to the same
register). Handled by ensuring writes happen in order or
register renaming.
Control Hazards (Branch Hazards):Occur with branch/jump
○
instructions. The pipeline fetches sequential instructions assuming
the branch is not taken, but if the branchistaken, the fetched
instructions are wrong and must be flushed. Solved by:
S
■ talling:Wait until the branch outcome is known.
■ Branch Prediction:Guess the outcome (e.g., predict not
taken, predict taken, use history). If wrong, flush and fetch
correct path.
■ Delayed Branch:Execute one or more instructionsafterthe
branch instruction, regardless of the outcome (compiler tries
to fill this "delay slot" with useful independent instructions).
The Need for a Memory Hierarchy
odern computer systems use amemory hierarchyto balancespeed,cost, and
M
capacity. No single memory type can offer the idealcombination ofvery fast,
very large, andvery cheap. Thus, the hierarchy isdesigned tooptimize
performance while managing cost.
Why a Memory Hierarchy Needed ?
1. Processor vs. Memory Speed Gap
● M odern CPUs areextremely fast, capable of executing billions of
instructions per second.
● Main memory (RAM) isslowerthan the processor.
● If the CPU had to wait every time for RAM access, performance would
drastically degrade.
➤Solution:Usefaster, smaller memory (caches)closerto the CPU.
2. Cost vs. Capacity Trade-off
F
● ast memory (like SRAM used in caches) isexpensive.
● Slower memory (like DRAM or hard disks) ischeaperand provides
more capacity.
➤
Solution:Usesmall amounts of fast memoryandlargeramounts of
slow memory.
3. Locality of Reference Principle
rograms tend to access a small portion of memory repeatedly over short
P
periods:
● T emporal locality: If a memory location is accessed,it is likely to be
accessed again soon.
● Spatial locality: If one memory location is accessed, nearby locations are
likely to be accessed soon.
➤
Caches exploit thisby storing recently or nearby used data to speed up
access.
Locality of Reference Principle
he locality of reference principle is a key concept in computer architecture that
T
describes how programs tend to access a relatively small portion of memory at
any given time. This principle can be broken down into two types:
1. Temporal Locality:This refers to the tendency of a program to access
the same memory locations repeatedly within a short time frame. For
example, if a program accesses a particular variable, it is likely to access
it again soon.
2. Spatial Locality:This refers to the tendency of a program to access
memory locations that are close to each other. For instance, if a program
accesses a certain array element, it is likely to access nearby elements
shortly thereafter.
he locality of reference principle is crucial for designing efficient memory
T
systems because it allows for the implementation of faster, smaller memory
types (like cache) that can store frequently accessed data, thereby reducing the
average time to access memory.
Memory Hierarchy in Practice
memory hierarchy is a structured arrangement of different types of memory
A
that vary in speed, size, and cost. The main levels of the memory hierarchy
include:
1. Cache Memory:This is the fastest type of memory, located closest to the
CPU. It is used to store frequently accessed data and instructions to speed
up processing. Cache memory is typically divided into levels (L1, L2,
L3), with L1 being the fastest and smallest.
2. Main Memory (RAM):This is the primary storage usedby the CPU to
hold data and instructions that are currently in use. It is slower than cache
but larger in capacity. Main memory is volatile, meaning it loses its
contents when power is turned off.
3. Secondary Memory:This includes storage devices like hard drives,
SSDs, and optical disks. Secondary memory is non-volatile and is used
for long-term data storage. It is much slower than both cache and main
memory but offers much larger storage capacity at a lower cost.
Memory Parameters
ccess Time: Time between a memory request and delivery of data.
A
Cycle Time: Time between successive accesses.
ache memory has the shortest access time, followed by main memory,
C
and then secondary memory.
ost per Bit:This is a measure of how much it costs to store one bit of
C
data. Cache memory is the most expensive per bit, followed by main
memory, and then secondary memory, which is the least expensive.
Main Memory
Semiconductor RAM & ROM Organization
AM (Random Access Memory):This is a type of volatilememory that
R
allows data to be read and written. It is organized into cells, each with a
unique address. RAM can be further categorized into:
tatic RAM (SRAM):Uses bistable latching circuitry to store each bit. It
S
is faster and more expensive than DRAM but is used for cache memory
due to its speed.
ynamic RAM (DRAM):Stores each bit in a capacitor,which must be
D
refreshed periodically. It is slower and less expensive than SRAM and is
used for main memory.
OM (Read-Only Memory):This is non-volatile memorythat is used to
R
store firmware or software that does not change. It is organized similarly to
RAM but is typically slower and cannot be easily modified.
Memory Expansion
emory expansion refers to the ability to increase the amount of RAM in a
M
system. This can be done by adding more RAM modules to the motherboard,
allowing for improved performance and the ability to run more applications
simultaneously.
Cache Memory
Associative & Direct Mapped Cache Organizations
Cache memory can be organized in different ways to optimize performance:
1. Direct Mapped Cache:Each block of main memory maps to exactly one
cache line. This is simple and fast but can lead to cache misses if multiple
memory blocks map to the same cache line (known as conflict misses).
2. Associative Cache:Any block of main memory can be stored in any
cache line. This flexibility reduces conflict misses but requires more
complex hardware to search for data, making it slower than
direct-mapped caches.
3. Set-Associative Cache:This is a compromise betweendirect-mapped
and fully associative caches. The cache is divided into sets, and each set
can hold multiple blocks. A block of memory can be placed in any line
within a specific set, balancing speed and complexity.
Summary Table
Component Speed Cost/bit Volatility Use case
Registers Fastest Highest Volatile CPU operations
ache
C Very fast High Volatile Speed up access
(SRAM)
AM
R Moderate Medium Volatile Main memory
(DRAM)
ROM Slow Medium Non-volatile Firmware/boot
code
HDD/SSD Slowest Low Non-volatile Long-term storage