cs61c Notes
cs61c Notes
Anmol Parande
Fall 2019 - Professors Dan Garcia and Miki Lustig
Contents
1 Binary Representation 3
1.1 Base conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Numeric Representations . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Sign and Magnitude . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 One’s Complement . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Two’s Complement . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Bias Encoding . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Floating Point Representation . . . . . . . . . . . . . . . . . . . . 4
2 C 5
2.1 C Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Bitwise Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Pointer Operators . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Structs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Struct Operators . . . . . . . . . . . . . . . . . . . . . . . 7
3 Memory Management 7
3.1 Memory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 The Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 The Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Heap Management . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 RISC-V 9
4.1 Basics of Assembly Languages . . . . . . . . . . . . . . . . . . . . 9
4.2 RISC-V Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.1 Caller Callee Convention . . . . . . . . . . . . . . . . . . 10
4.2.2 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3.1 Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Compiling, Assembling, Linking, Loading . . . . . . . . . . . . . 13
4.4.1 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
4.4.2 Assembler . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Datapath 18
6.1 Pipelined Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.1 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . 19
6.1.2 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.3 Control Hazards . . . . . . . . . . . . . . . . . . . . . . . 20
8 Parallelism 25
8.1 Data Level Parallelism and SIMD . . . . . . . . . . . . . . . . . . 26
8.2 Thread Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . 26
8.2.1 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . 27
8.2.3 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . 27
8.2.4 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2.5 Amdahls Law . . . . . . . . . . . . . . . . . . . . . . . . . 28
9 Input/Output 28
2
Disclaimer: These notes reflect 61C when I took the course (Fall 2019).
They may not accurately reflect current course content, so use at your own risk.
If you find any typos, errors, etc, please report them on the GitHub repository.
1 Binary Representation
Each bit of information can either be 1 or 0. As a result, N bits can represent
at most 2N values.
3
1.2.3 Two’s Complement
To convert from decimal to two’s complement: If the number is positive, convert
to binary as normal. If the number is Negative:
1. Convert the positive version of the number to binary
2. Invert each bit
3. Add one to the result
If given a two’s complement number, you can find -1 times that number by
following the same process (invert bits and add 1).
In the floating point representation, the manissa is always one because we are
using scientific notation, so we don’t bother storing it.
• Sign Bit: Determines the sign of the floating point number
• Exponent: 8 bit biased number (-127 bias)
• Significand: 23 significand bits representing 2−1 , 2−2 ...
4
Given a number in floating point representation, we can convert it back to
decimal by applying the following formula:
n = (−1)s (1 + signif icand) · 2exponent−127
Notice a couple things about this representation.
• If the exponent is larger than 8 bits, then overflow will occur
• If a negative exponent is more than 8 bits, then underflow occurs
• There are 2 0’s. Positive 0 and negative 0
Because of the way that floating point is built, certain sequences are designated
to be specific numbers
Exponent Significand Object
0 0 0
0 nonzero denorm
1-254 anything ±#
255 0 ±∞
255 nonzero NaN
A denormed number is a number where we don’t have an implicit 1 as the
mantissa. These numbers let us represent incredibly small numbers. They have
an implicit exponent of −126.
2 C
2.1 C Features
• C is a compiled language =⇒ executables are rebuilt for each new system
• Every variable holds garbage until initialization
• Function parameters are pass by value
5
2.3 Pointers
Definition 3 The address of an object is its location in memory
Pointers can point to anything, even other points. For example, int ∗ ∗a is
a pointer to an integer pointer. This is called a handle
2.4 Arrays
In C, arrays are represented as adjacent blocks of memory. The way that we
interact with them is through pointers. Consider
int a [ ] ;
int ∗b ;
Both of these variables represent arrays in C. They point to the first memory
location of the array. C arrays can be indexed into using [] subscripts like other
programming languages. They can also be index into using pointer arithmatic.
For example, ∗(a + 2) ≡ a[2]. By adding 2 to a, C knows to look two memory
locations into the array (i.e the third element).
6
lengths and they do not check their bounds. This means whenever you are
passing an array to a function, you should always be passing its size as well.
2.5 Structs
Structs are the basic datastructures in C. Like classes, they are composed of
simpler data structures, but there is no inheritance.
typedef can be a useful command with structs because it lets us name them
cleanly.
3 Memory Management
3.1 Memory Basics
In memory, a word is 4 bytes. When objects are saved to memory, they are
saved by words. How the words are ordered depends on the type of system. In
Little Endian systems, the Least Significant Byte is placed at the lowest mem-
ory address. In other words, the memory address points to the least significant
byte
The opposite is true in Big Endian systems. For example, lets say we have the
number 0x12345678 stored at the memory address 0x00
0x00 0x04 0x08 0x0C
Little Endian: 78 56 34 12
Big Endian: 12 34 56 78
There are four sections of memory: the stack, the heap, static, and code.
Definition 5 The Stack is where local variables are stored. This includes pa-
rameters and return addresses. It is the ”highest” level of memory and grows
”downward” towards heap memory.
Definition 6 The Heap is where dynamic memory is stored. Data lives here
until the programmer deallocates it. It sits below the stack and grows upwards
to toward it.
Definition 7 Static storage is where global variables are stored. This storage
does not change sizes and is mostly permanent.
Definition 8 Code storage is where the ”code” is located. This includes pre-
processing instructions and function calls.
7
Definition 9 Stackoverflow is when the stack grows so large that it intersects
with heap memory. This is mostly unavoidable.
Definition 10 Heap pileup is when the heap grows so large that it starts to
intersect with the stack. This is very avoidable because the programmer manages
it.
Implementation:
Every block in the heap has a header containing its size and a pointer to the
next block. The free blocks of memory are stored as a circular linked list. When
memory needs to be allocated to the heap, this linked list is searched. When
memory is freed, adjacent empty blocks are coalesced into a single larger block.
There are three different strategies which can be used to do this allocation/free-
ing.
Definition 14 Best-fit allocation is where the entire linked list is searched to
find the smallest block large enough to fit the requirements
Definition 15 First-fit allocation is where the first block that is large enough
to fit the requirement is returned
8
Definition 16 Next-fit allocation is like first-fit allocation except the memory
manager remembers where it left off in the free list and resumes searching from
there.
4 RISC-V
4.1 Basics of Assembly Languages
In Assembly Languages like RISC-V, instructions are executed directly by the
CPU. The basic job of the CPU is to taken a set of instructions and execute
them in order. The Instruction Set Architecture (ISA) determines how this
is done. In Reduced Instruction Set Computing (RISC) languages, the
instruction set is limited because it makes the hardware simple. Complex oper-
ations are left to the software.
9
the value 0. Unlike variables, registers have no types. The operation is what
determines what the content of the register is treated as.
Immediates are numerical constants. They can also be used as the operands
of assembly intructions.
10
6. Release local variables and return data to used registers so the caller can
access them (Decrease stack)
7. Return control to the calling function (Jump to ra)
Because every function must have this control flow, the caller-callee convention
was set up as follows
1. sp, gp, tp, s0-s11 are preserved across a function call
2. t0-t7, a0-a7 are not preserved across a function call
If a register is not preserved across a function call, then the caller cannot expect
its value to be the same after the callee returns. In order to preserve registers
across a function call, we use the stack. The Stack Frame stores the variables
which need to be saved in order to adhere to caller callee convention. Every
RISC-V function has a Prologue where it saves the neccessary registers to the
stack. An example might look like
a d d i sp , sp , −16
sw s0 , 0 ( sp )
sw s1 , 4 ( sp )
sw s2 , 8 ( sp )
sw ra , 1 2 ( sp )
This function must use the s0-s3 saved registers. We first create the stack frame
by decrementing the stack pointer. Then we save the saved registers to the
newly allocated memory. This function must be calling another function, so
it has to remember its return address. That is why we save it to stack. The
Epilogue is the part of the function before it returns where everything on the
stack is put back.
lw s0 , 0 ( sp )
lw s1 , 4 ( sp )
lw s2 , 8 ( sp )
lw ra , 1 2 ( sp )
jr ra
4.2.2 Directives
Directives are special demarcations in an assembly file which designated different
pieces of data.
.text: Subsequent items are put into the text segment of memory (i.e the code)
.data: Subsequent items are put into the data segment of memory (i.e static
variables)
.global sym: Declares a symbol to be global, meaning it can be referenced from other
files
11
.string str Stores a null-terminated string in the data memory segment
.word Store n 32 bit quantities into contiguous memory words.
R-Type
31:25 24:20 19:15 14:12 11:7 6:0
funct7 rs2 rs1 funct3 rd opcode
I-Type
31:20 19:15 14:12 11:7 6:0
Imm[11:0] rs1 funct3 rd opcode
S-Type
31:25 24:20 19:15 14:12 11:7 6:0
Imm[11:5] rs2 rs1 funct3 Imm[4:0] opcode
B-Type
12
31:12 11:7 6:0
Imm[31:12] rd opcode
J-Type
31 30:21 20 19:12 11:7 6:0
Imm[20] Imm[10:1] Imm[11] Imm[19:12] rd opcode
4.3.1 Addressing
Notice that the J-Type and B-Type instructions require use a label in code.
These labels are encoded in the instruction format as an offset from the program
counter. This is known as PC Relative Addressing. Since each instruction
in RISC-V is 1 word, we will never have an odd address. As a result, we don’t
need to store the last bit of the immediate in B-Type and J-Type instructions
because it is automatically 0.
4.4.1 Compiler
The input to the compiler is a file written in a high level programming language.
It outputs assembly language code which is built for the machine the code was
compiled on. The output of the compile may still include pseudo-instructions
in assembly.
4.4.2 Assembler
The Assembler is the program which converts assembly language code to ma-
chine language code. It reads and uses directives, replaces psuedo-instructions
with their real equivalent, and produces machine language code (i.e bits) where
it can.
The output of the assembler is an object file. The object file contains the
following elements:
• Object file header: The Size and position of the different sections of the
object files
13
• Symbol Table: A special data structure which lists the files global labels
and static data labels
• Debugging information
The symbol table contains information which is public to other files in the
program.
• Global function labels
• Data Labels
The Relocation Table contains information that needs to be relocated in later
steps. It essentially tracks everything that the Assembler cannot directly convert
to machine code immediately because it doesn’t have enough information.
• List of labels this file doesn’t know about
• List of absolute labels that are jumped took
• Any piece of data in the static section
When the assembler parses a file, instructions which don’t have a label are con-
verted into machine language. When a label is defined, it’s position is stored in
the relocation table. When a label is encountered in code, the assembler looks
to see if it’s position was defined in the relocation table. If it is found, the label
is replaced with the immediate and converted to machine code. Otherwise, the
line is marked for relocation.
In order to do its job, the assembler must take two passes through the code.
This is because of the forward reference problem. If a label is used before
it is defined in the same file, the first time the assembler encounters it, it won’t
know how to convert it to machine code. To resolve this, the assembler simply
takes two passes so it finds all labels in the first pass and convert the lines it
originally couldn’t in the second pass.
4.5 Linker
The Linker is responsible for linking all of the object files together and producing
the final executable. The linker operates in three steps:
1. Put the text segments together
2. Put the data segments together
3. Resolve any referencing issues
After the linking step, the entire program is finally in pure machine code because
all references to labels must be resolved. The linker knows the length of each text
and data segment. It can use these lengths to order the segments appropriately.
It assumes that the first word will be stored at 0x10000000 and can calculate
the absolute address of each word from there. To resolve references, it uses the
relocation table to change addresses to their appropriate values.
14
• PC-Relative Addresses: Never relocated
• Absolute Function Addresses: Always relocated
• External Function Addresses: Always relocated
• Static data: Always relocated
When the linker encounters a label, it does the following.
1. Search for the reference in the symbol tables
2. Search for the reference libraries if the reference is not in the symbol tables
3. Fill in the machine code once the absolute address is determined
This approach is known as Static Linking because the executable doesn’t
change. By contrast, with dynamic linking, libraries are loaded during runtime
to make the program smaller. However, this adds runtime overhead because
libraries must be searched for during runtime.
4.6 Loader
The Loader is responsible for taking an executable and running it.
1. Load text and data segments into memory
2. Copy command line arguments into stack
3. Initialize the registers
4. Set the PC and run
15
5.1 Registers
Registers are state elements frequently used in circuits. On the rising edge of the
clock, the input d is sampled and transferred the output q. At all other times,
the input d is ignored. There are three critical timing requirements which are
specific to registers
1. Setup Time: How long the input must be stable before the rising edge of
the clock for the register to properly read it
2. Hold Time: How long the input must be stable after the rising edge of the
clock for the register to properly read it
3. Clock-to-Q Time: How long after the rising edge of the clock it takes for
input to appear at the registers output
5.1.1 Pipelining
One place where registerrs become useful is in pipelining. Pipelining a circuit
means placing registers after combinational logic blocks. This stops delay times
from adding up because now intermediate quantities must be stored in the
register before being passed to the next block. This allows for higher clock
frequencies because we are no longer limited by the logic delays, only by our
registers.
x x x
(x + y) · (x · y) (x · y) (x · y)
y y y
These are the basic gates which are used in combinational logic. Using Boolean
Algebra, we can simplify complicated logic statements/circuits.
16
x · x̄ = 0 x + x̄ = 1
x·0=0 x+1=1
x·1=x x+0=x
x·x=x x+x=x
x·y =y·x x+y =y+x
(xy)z = x(yz) (x + y) + z = x + (y + z)
x(y + z) = xy + xz x + yz = (x + y)(x + z)
xy + x = x (x + y)x = x
xy = x̄ + ȳ x + y = x̄ȳ
One other important circuit element is the data multiplexor (mux). A mux
takes in n different sources of data and selects one of those sources to output.
It does this using select inputs. For example, if a mux has 3 select bits, then it
can choose from 8 different streams of data.
5.3 Timing
In circuits, timing is incredibly importat because combinational logic blocks have
propagation delays (i.e the output does not change instantaneously with the
input). When used in conjuction with registers (which what their own timing
requirements), if one is not careful, we could build a circuit which produces
inaccurate outputs. When checking whether or not a circuit will do what it is
designed to do, we need to look at two special paths:
• Longest CL Path: The longest path of combinational logic blocks between
two registers
This first formula tells us the maximum amount of time it takes for an input
to propagate from a register to another register. When the clock rises, the first
register will read the input and output it tclk2q later. The input will then go
through the combinational logic elements. Since we are looking for the maxi-
mum, we only consider the longest CL path. Finally, the output needs to be
stable for tsetup in order for the destination register to read it properly. As long
as our clock period is longer than the max delay, our circuit will work properly.
This second formula tells us the maximum hold time which our registers can
have. When the clock rises, the first register reads the input and outputs it
tclk2q later. The input will then go through the combinational logic elements.
17
Hold time is how long the input needs to be stable after the rising edge of the
clock, so if the combinational logic works incredibly fast, we will need a fast
hold time because otherwise the output will change faster than the hold time of
the output register. This is why we look at the shortest CL path. As long as
our registers have a hold time which is less than the max-hold, our circuit will
work properly.
6 Datapath
The processor is the part of the computer which manipulates data and makes
decisions. The processor is split into two parts: datapath and control. The
datapath is the part of the processor with the necessary hardware to perform
the operations required by the ISA. THe control tells the datapath what needs
to be done. On every tick of the clock, processors execute a single instruction.
Instruction execution takes place in 5 stages.
1. Instruction Fetch
2. Instruction Decode
3. Execute
4. Memory Access
5. Write Back
Each datapath has several main state elements
• Register File: An array of registers which the processor uses to keep in-
formation out of memory
• Program Counter: A special register which keeps track of where the pro-
cessor is in the program
• IMem: A read-only section of memory containing the instructions that
need to be executed
• DMem: Section of memory which contains data the processor needs to
read/write to
Important combinational logic elements of the datapath include
• Immediate Generator: Generates the immediate
• Branch Comparator: Decides whether or not branches should be taken
• ALU: Performs mathematical operations
All parts of the datapath operate at the same time. The control is how the pro-
cessor chooses which output to work with. Control can either be done through
combinational logic or by using ROM (Read-Only Memory).
18
6.1 Pipelined Datapath
A single-cycle datapath is one where every instruction passes through the dat-
apath one at a time. However, this is inefficient because faster stages (such as
register reading) are left unused while waiting for slower stages (such as memory
reading). One way to fix this is to pipeline the datapath. This allows multiple
instructions to use different parts of the datapath at once, speeding up the pro-
cessor because no stage is left unutilized. All we need to do is add registers after
each datapath stage. However, this introduces ”hazards” into the processor.
One solution is to have the instructions take turns to access the resource. The
other option is to add more hardware to distribute that resource. For example,
a Regfile structural hazard would be when the processor needs to read registers
for one instruction and write a register for another. This is solved by giving the
Regfile two independent read ports and one independent write port. Another
example is a memory structural hazard where instructional memory and data
memory are used simultaneously. This is solved by separating them into IMem
and DMem.
Problem: The result from the ALU will take 2 cycles to be written back.
An instruction might need it before then
Solution 1: We could stall by introducing no-ops (instructions that do noth-
ing), but this would kill performance.
Solution 2: WE can add a loop from the output of the ALU to the ALU
input via a mux and add a control element to determine when we should use
the forwarded value
19
Solution 2: Turn off all write enables and run the second instruction twice
Solution 3: Reorder instructions so that the load delay slot doesn’t use the
loaded result
Notice that this is only a problem when a branch is taken because it means the
pipeline must be flushed to get the wrong instructions out of the pipeline (by
converting them to no-ops). A more advanced way to fix this is to implement
branch prediction which is where the processor attempts to predict whether or
not a branch is taken and load instructions based on that.
7.1 Caching
The closer memory is to the processor, the faster it is. However, that also makes
it more expensive. The idea of caching is to copy a subset of main memory and
keep it close to the processor. We can then stack multiple layers of caches (each
one with more storage than the other) before we actually need to read DRAM.
Definition 20 Temporal locality is the idea that if data is used now, it will be
used later on
Definition 21 Spatial locality is the idea that if data is used now, nearby data
will likely be used later
Caches take advantage of both temporal and spatial locality to provide quick
memory access.
20
7.1.1 Direct Mapped Cache
In a direct mapped cache, data is transferred between main memory and the
cache in segments called blocks. Each memory address is associated with one
possible block in the cache. That way, when checking whether or not the data
located at a particular address is located in the cache, we only need to check a
single location. Each address is split into three parts.
• Offset: The number of the byte within the block that should be accessed.
• Index: The block in the cache
In this way, the direct-mapped cache is similar to a Hashmap where the key is
the Index and the Tag. When given an address, the direct mapped cache will
first go the block specified by the address’ index. It will then compare the tags.
• Tag: Remaining bits which distinguish memory addresses from each other
Tag Offset
In this way, fully associative caches are like an array. When looking data up
in the array, we search through it. The benefit of this is that we have no more
conflict misses because there are no mappings between memory addresses. The
drawback is that we need a comparator for every single entry.
21
memory address is stored inside the cache, we first identify the set using the
index and then check all of the tags in that set. This is why the address is split
into 3 parts:
• Offset: The number of the byte within the block that should be accessed.
We can think of N-Way Fully Associative caches like Hashmaps where the key
is the index and the value is an array. We must search through the array (by
checking tags) to actually get the correct block and then use the offset to retrieve
the data at a location in the block. N-Way fully associative caches are useful
because now we get the benefits of fully-associative caches but we only need N
comparators to check the tags, allowing us to build larger caches.
3. If the data is not in the cache, memory will read the address, load the
data into the cache, save the cache, and then the cache sends the data to
the processor
Definition 22 A cache hit is when the cache contains the requested data and
passes it to the processor
Definition 23 A cache miss is when the cache does not contain the requested
data, so it must read from memory
22
The hit rate is the fraction of accesses which are hits whereas the miss rate is the
fraction of accesses that are misses. It takes time to replace data in the cache
because it must be read from memory, so there is some miss penalty which is
how long it takes to replace that data. Likewise, the hit time is how long it
takes to access cache memory.
When choosing parameters for a cache, one must keep these metrics in mind.
For example, a large block size might give us better spatial locality, but eventu-
ally the number of cache misses will increase, and they will have a larger penalty
because more memory is being accessed at once.
23
LRU replacement policies make sure that frequently accessed data stays in the
cache. Note that both FIFO and LRU replacement policies only kick in once
the Fully Associative cache is full or a set in the N-Way Associative cache is
full.
The main idea of Virtual Memory is to have programs operate using virtual
addresses and have the operating system translate those into physical addresses.
Definition 29 The virtual address space is the set of all addresses that the user
program knows about and has available to run its code.
Definition 30 The physical address space is the set of addresses which map to
a location in physical memory (i.e DRAM)
Both the virtual address space and physical address space are chunked into units
called pages (similar to blocks in caches). Pages are the units of transfer from
Virtual Memory to Physical Memory and Physical Memory to Disk.
Definition 31 The page table tells us which virtual page maps to which physical
page
The page table also contains information such as access rights for that page.
Definition 32 The page table base register tells the processor where the page
table is located in memory.
Because we must go to memory for the page table, every single memory access
would take 2 trips to memory (one for the page table, the other for the data).
To stop this, we use a Translation Lookaside Buffer(TLB) to cache the page
table entries so we can still access them quickly.
24
The VPN is further divided into the TLB tag and TLB index. We first check the
TLB at the TLB index if the tags match. If they do, then the TLB will return
the PPN. If not, we need to check the page table for the mapping between VPN
and PPN. Once the PPN is retrieved, we concatenate the PPN with the page
offset. This gives us a physical address which we can then use to check the cache
and main memory.
VA PA On Cache Miss
Processor TLB Cache Memory
VPN PPN
Page Table
8 Parallelism
Parallelism is the concept of doing multiple things at the same time. This
can be done in both hardware and software. We classify parallelism based on
25
Flynn’s taxonomy which breaks things down into how a program handles its
data streams and how the program handles instruction streams.
Definition 36 Single Instruction, Single Data (SISD) programs are normal
programs where instructions are executed sequentially and there is a single source
of data
Definition 37 Single Instruction, Multiple Data (SIMD) programs are like SISD
in that they take a single sequence of instructions, but they leverage special in-
struction which apply the same operation to many pieces of data at a time (i.e
vector operations)
The fourth part of the taxonomy is Multiple Instruction, Single Data (MISD),
but it is not as commonly used.
Each core in a multiprocessor has multiple hardware threads which can execute
instructions. The Operating System multiplexes software threads onto these
hardware threads. All the threads which are not mapped to hardware threads
are marked as waiting. The OS chooses which threads run by
1. Remove a blocked/finished thread by interrupting execution and saving
the registers/PC
26
2. Multiplexing a new software thread onto the hardware thread by loading
the registers and PC
3. Starting execution by jumping to the saved PC and running the processor
8.2.2 Synchronization
Thread-Level Parallelism introduces some complications, however, because mul-
tiple processors have to work together to make the program work. One issue
which can arise synchronization. Suppose two threads need to update the same
variable. We don’t know which order the threads will access the variable in, so
it is possible they could completely overwrite each other by mistake. To solve
this proble, we use locks.
27
8.2.4 OpenMP
OpenMP is a C extension which enables multi-level, shared memory parallelism.
It utilizes compiler directives called pragmas to tell the compiler they can par-
allelize sections of code. OpenMP programs begin as a single process. When
a parallel region is encountered, OpenMP forks into parallel threads. These
threads are rejoined at the end of the parallel section.
#pragma omp p a r a l l e l f o r
Adding the parallel for pragma is placed above a for loop to parallelize it
#pragma omp p a r a l l e l
A regular parallel pragma marks a parallel section. In order to handle locks,
OMP uses the critical pragma
#pragma omp c r i t i c a l
where S is the fraction of the program that is serial and P is the factor by which
the parallel part is sped up. In the limit, as P grows infinitely large, we see that
the maximum speedup we can get is 1s , therefore we will always be as slow as
the serial part of the program.
9 Input/Output
Input/Output (I/O) is how humans interact with computers. IO devices send
data to the CPU via the IO interface. There are two different models of IO.
• Special IO instructions to direct the CPU
• Memory Mapped IO: A portion of the address space is reserved for IO
and the CPU uses normal lw, sw operations to access the data
In Memory Mapped IO, we have
• A control register: Says its ok for the CPU to read/write the data register
• Data Register: Contains the data from the IO device
During Polling, the processor continuously reads the control register until the
data register is ready. After that, it reads the data. However, polling is a waste
of processor resources, so for devices which are read from infrequently, we use
Interrupts and Exceptions.
28
Definition 41 An interrupt is an external event from the IO device which is
asynchronous the current program, meaning it can be handled when the processor
finds it convenient
29