0% found this document useful (0 votes)
15 views

Summary Midterm Concurrency

This document discusses concurrency and parallelism. Concurrency refers to tasks that may execute out of order or at the same time, while parallelism refers to tasks that are executed simultaneously using multiple processing elements. It provides examples of applications that are concurrent but not parallel and vice versa. It also discusses different contexts of concurrency like multithreading, multiprocessing, and distributed processing. It covers issues that can arise with concurrent processes like communication, resource sharing, synchronization, and programming errors. It discusses approaches to solving problems of mutual exclusion, deadlock, and starvation between concurrent processes.

Uploaded by

Vee Mandler
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Summary Midterm Concurrency

This document discusses concurrency and parallelism. Concurrency refers to tasks that may execute out of order or at the same time, while parallelism refers to tasks that are executed simultaneously using multiple processing elements. It provides examples of applications that are concurrent but not parallel and vice versa. It also discusses different contexts of concurrency like multithreading, multiprocessing, and distributed processing. It covers issues that can arise with concurrent processes like communication, resource sharing, synchronization, and programming errors. It discusses approaches to solving problems of mutual exclusion, deadlock, and starvation between concurrent processes.

Uploaded by

Vee Mandler
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Summary Mid-Term Concurrency

Lecture 3: Threads
Concurrency:

- Tasks are concurrent with respect to each other if:


o May execute out of order;
o Implies they can be executed at the same time, but this is not required.
- Deals with lots of things at once.
- Composition of independently executing processes.

Parallelism:

- Tasks are parallel if they are executed simultaneously:


o Requires multiple processing elements;
o Motivation  reduce the overall running time (wall clock) of the program: parallel
execution.
- Do lots of things at once.
- Simultaneous execution of (possibly related) computations.

Application that is:

- Concurrent but not parallel, examples


o Occurring at the same time in any given order.
o If you are waiting on something (i.e. answer of network), means that the order is not
determined.
o Maybe communicating with each other.
- Parallel but not concurrent, examples
o Tasks happen at the same time.
o We don’t have multiple tasks.
o Single instructions  Addition over four elements.

Contexts of concurrency:

- Multithreading: concurrent threads share an address space. Applications talk to each other
directly.
- Multiprogramming: concurrent processes execute on a uniprocessor (own memory space).
- Multiprocessing: concurrent processes on a multiprocessor.
- Distributed processing: concurrent processes executing on multiple nodes connected by a
network.

Forms where concurrency is used:

- Multiple applications (multiprogramming).


- Structured applications (application is a set of threads/processes).
- Operating system structure (OS is a set of threads/processes).

Concurrent processes (threads) need special support:

- Communication among processes


- Allocation of processor time
- Sharing resources
- Synchronisation of multiple processes

Dangerous:

- Sharing global resources (order of read & write operations)


- Management of allocation of resources (danger of deadlock)
- Programming errors are difficult to locate (Heisenbugs)

Critical sections: controlling access to the code utilising those shared resources.

Processes can:

- Compete for resources


o May not be aware of each other
o Execution must not be affected by each other
o OS is responsible for controlling access
- Cooperate by sharing common resource
o Programmer responsible for controlling access
o Hardware / OS/ programming language may provide support

Threads of process usually do not compete, but cooperate.

Three control problems:

- Mutual exclusion: critical resources  critical sections.


o Only one process at a time is allowed in a critical section.
o Only one process at a time is allowed to send commands to the GPU.
- Deadlock: two processes and two resources.
- Starvation: three processes compete for a single resource.

Terminology:

- Deadlock:
o Each member of a group of processes is waiting for another to take action.
o E.g. waiting for another to release a lock.
- Livelock:
o The states of the group of processes are constantly changing with regard to each
other, but none are progressing.
o E.g. trying to obtain a lock, but backing off if fails.
- Obstruction-freedom:
o From any point after a thread begins executing in isolation, it finishes in a finite
number of steps.
o A thread will be able to finish if no other thread makes progress.
- Lock-freedom:
o Some method will finish in a finite number of steps.
- Wait-freedom:
o Every method will finish in a finite number of steps.
- Starvation:
o A process is denied access to a resource.

Mutual exclusion
- Locking.
- Protects shared resources.
- Only one process at a time is allowed to access the critical resource.
- Modifications to the resource appear to happen atomically.
- Implementation requirements:
o Only one thread at a time is allowed in the critical section for a resource.
o No deadlock or starvation on attempting to enter/leave the critical section.
o A thread must not be delayed access to a critical section when there is no other
thread using it.
o A thread that halts in its non-critical section must do so without interfering with
other threads.
o No assumptions made about relative thread speed or number of processes.
- Usage conditions:
o A thread remains inside its critical section for a finite time only.
o No potentially blocking operation should be executed inside the critical section.
- Three ways to satisfy the implementation requirements:
o Software approach, put responsibility on the processes themselves/
o Systems approach, provide support within the OS or programming language.
o Hardware approach, special-purpose machine instructions.

Software approach

- Premise
o One or more threads with shared memory.
o Elementary mutual exclusion at the level of memory access
 Simultaneous access to the same memory location are serialised.
- Requirements for mutual exclusion
o Only one thread at a time is allowed in the critical section.
o No deadlock or starvation.
- Attempt 1
o The plan
 Threads take turns executing the critical section.
 Exploit serialisation of memory access to implement serialisation of access
to the critical section
o Employ a shared variable (memory location) turn that indicates whose turn it is to
enter the critical section.
o Busy waiting (spin lock)
 Process is always checking to see if it can enter the critical section
 Implements mutual exclusion
 Simple
o Disadvantages
 Process burns resources while waiting
 Processes must alternate access to the critical section
 If one process fails anywhere in the program, the other is permanently
blocked.
- Attempt 2
o The problem
 turn stores who can enter the critical section, rather than whether anybody
may enter the critical section.
o The new plan
 Store for each process whether it is in the critical section right now.
 flag[i] if process i is in the critical section right now.
o Requires programs to be well-behaved, which doesn’t always happen.
o If a process fails:
 Outside the critical section: the other is not blocked;
 Inside the critical section: the other is blocked (however, difficult to avoid).
o Does it work?
 Does not guarantee exclusive access.
- Attempt 3
o The goal
 Remove the gap between toggling the two flags.
o The updated plan
 Move setting the flag before checking whether we can enter.
o Does it work?
 No. The gap can cause a deadlock now.
- Attempt 4
o Problem
 Process sets its own state before knowing the other processes states and
cannot back off.
o The updated plan
 Process retracts its decision if it cannot enter.
o Does it work?
 No, we have a livelock.
 A special case of resource starvation, and a risk for algorithms which
attempt to detect and recover from deadlock.
- Attempt 5
o Improvements
 We can solve this problem by combining the fourth and first attempts.
 In addition to the flags we use a variable indicating whose turn it is to have
precedence in entering the critical section.
o Peterson’s algorithm
 Both processes are courteous and solve a tie in favour of the other.
 Algorithm can be generalised to work with n processes.
o Statement: mutual exclusion
 Threads 0 and 1 are never in the critical section at the same time.
o Proof
 If P0 is in the critical section then:
 flag[0] is true;
 flag[1] is false OR turn is 0 OR P1 is trying to enter the critical
section, after setting flag[1] to true but before setting turn to 0.
 For both P0 and P1 to be in critical section:
 flag[0] AND flag[1] AND turn = 0 AND turn = 1.

Hardware support
- The compare-and-swap (CAS) operation is an atomic instruction which allows mutual
exclusion for any number of threads using a single bit of memory.
o Compares the memory location to a given value.
o If they are the same, writes a new value to that location.
o Returns the old value of the memory location.
- The plan
o Use a bit lock where zero represents unlocked and one represents locked.
Lecture 4: State
Threads:

- The fundamental action in concurrency: create a new thread of control.

forkIO :: IO ( ) -> IO ThreadId

- Takes a computation of type IO ( ) as its argument;


- This IO action executes in a new thread concurrently with other threads;
- No specified order in which threads execute;
- Threads are very cheap: ~1.5 Kb / thread, easily run thousands of threads.

See lec4ex for examples

Ex1: interleaving of two threads

- The term n :: Int is shared between both threads (captured)


-  Safe because it is immutable;
- The program exits when main returns, even if there are other threads still running.

Sharing state

- IORef: mutable reference to some value


o Not designed for concurrency: need to protect critical section yourself.
o Compare-and-swap behaviour using atomicModifyIORef’
o Ex2, shared state concurrency using IORef

Import Data.IORef
newIORef :: a -> IO (IORef a)
readIORef :: IORef a -> IO a
writeIORef :: IORef a -> a -> IO ( )

- MVar: synchronised mutable references


o Like IORef’s, but with a lock attached for safe concurrent access.
o For communication between concurrent threads.
o An mvar is a box that is either empty or full.
o takeMVar removes the value from the box; blocks if it is currently empty.
o putMVar puts a value in the box; blocks if it is currently full.
o Ex3, synchronising variables for communication between concurrent threads.

Import Control.Concurrent.MVar
newMVar :: a -> IO (MVar a)
newEmptyMVar :: IO (MVar a)
takeMVar :: MVar a -> IO a
putMVar :: MVar a -> a -> IO ( )
readMVar :: MVar a -> IO a
withMVar :: MVar a -> (a -> IO b) -> IO b

o The runtime system can (sometimes) detect when a group of threads are
deadlocked.
 Only a conservative approximation to the future behaviour of the program.
o Other useful operations
 If multiple threads are blocked in takeMVar or putMVar, a single thread is
woken up in FIFO order: fairness.
 readMVar is multiple wakeup.
 withMVar can be used to protect critical sections.
o A lock
 MVar ( ) behaves as a lock: full is unlocked, empty is locked.
 Can be used as a mutex to protect some shared state or critical section.
o A one-place channel
 For passing messages between threads.
 An asynchronous channel with a buffer size of one.
o A container for shared mutable data.
o A building block for constructing larger concurrent data structures.

Regaining determinism (IORef)

- The goal:
 Want a way to specify data-flow dependencies between computations
 Results should be deterministic
 Ex4
- Data flow
o Key idea: non-deterministic result can only arise from a choice between multiple
put’s, so make that an error.
o Ex4.

Asynchronous computations (MVar)

- The goal:
o Want a way to run computations asynchronously and wait for their result.
o Cancel running computations.
- Perform an action asynchronously and later waiting for the results.
- Ex5

Unbounded queue

- The goal:
o enqueue does not block (indefinitely).
o Writers and readers do not conflict with each other.
- See slides for structure of the queue
o Orange boxes are MVar’s, we have to update these (what is read end what is write
end).  mutable structures.
o Blue boxes are sequence of values  immutable things. Linked together by mutable
references.

Fairness

- i.e. no thread should be starved of CPU time indefinitely.


- Threads are blocked on an MVar are woken up in FIFO order: single wakeup.
Lecture 5: STM (1)
Software transactional memory

Locks are bad:

- Taking too few/many locks;


o Inhibits concurrency (at best) or causes deadlock (at worst).
- Taking the wrong locks;
- Taking locks in the wrong order;
- Error recovery;
- Lost wake-ups or erroneous retries;
- Locks don’t support modular programming;
o Building larger functions out of smaller ones.

STM somewhat based on atomic blocks. Allows your program to say which parts need to behave
atomically.

Atomic blocks:

- The idea
o Garbage collectors allow us to program without malloc and free  can we do the
same for locks? What would that look like?
o Modular concurrency.
o Locks are pessimistic.
- STM
o A technique for implementing atomic blocks.
 Atomicity: effects become visible to other threads all at once.
 Isolation: cannot see the effects of other threads.
 Use a different type to wrap operations whose effect can be undone if
necessary.
o Sharing state
 We use TVar for IORef as a transactional variable.
 TMVar for MVar, a variable is either full or empty  threads wait for the
appropriate state.
o STM actions are composed together in the same way as IO actions.
o STM actions are executed as a single, isolated atomic block.
o Types are used to isolate transactional actions from arbitrary IO actions.
 To get from STM to IO we have to execute the entire action atomically.
 Can’t mix monads.

Implementing transactional memory

- How to implement atomically


o Not a single global lock (each time you start a block you take the lock, and when you
are done you release the lock)  it doesn’t scale, makes the program slower.
o Optimistic execution, without taking any locks.
- At the start of the atomic block begin a thread local transaction log
o Each writeTVar records the address and the new value to log.
o Each readTVar searches the log and
 Takes the value of an earlier writeTVar; or
 Reads the TVar and records the value into the log.
- At the end of the atomic block the transaction log must be validated
o Checks each readTVar in the log matches the current value.
o If successful all writeTVar recorded in the log are committed to the real TVars.
o The validate and commit steps must be truly atomic.
- Validation fails
o The operation executed with an inconsistent view of memory.
o Re-execute the transaction with a new transaction log.
 Since none of the writers are committed to memory, this is safe to do.
 It is critical that the atomic block contains no actions other than reads and
writes to TVars.
Lecture 6: STM (2)
Concurrency control required for safe access to shared state between threads.
- Mutual exclusion: critical resources  critical section.
 Only one process allowed in the critical section at once.
- Could lead to deadlock (a situation where the system is stuck) and starvation (a thread has
repeatedly bad luck).

Exam questions:

 What are the requirements for implementing mutual exclusion?


o There is mutual exclusion after you take the lock, so there is only one thread at a
time with the lock. So one thread at a time in the critical section.
o The lock doesn’t cause a deadlock.
o Absent of starvation (not necessary).
 MVar guarantee absent of starvation, threads woken up in FIFO. CASlock
not.
 What are the requirements for using critical sections?
o Don’t stay too long in the critical section.
o Guarantee that there are no deadlocks.
 What is an atomic section?
o A piece of code where the effects should become visible all at once.
o It shouldn’t be possible for another thread to observe an intermediate state of that
atomic block.

Blocking

- Wait for some condition to be true or a resource to become available.


o Abandon the current transaction and begin again;
o Only when the inputs change, to avoid busy waiting.

retry :: STM a

- Starts the entire atomical block again.

TMVar

- Threads block on an MVar are woken up in FIFO order, STM doesn’t have that;
- When multiple threads are blocked on a TVar, which should be woken up?
o Choose an alternative action if the first transaction calls retry.
 If the first action returns a result, that is the result of the orElse.
 If the first action retries, the second action runs.
 If the second action retries, the whole action retries.
 Since the result of orElse is also an STM action, you can feed it another call
to orElse and so choose from an arbitrary number of alternatives.

orElse :: STM a  STM a  STM a

STM as a building block (I)

- The goal:
o Run computations asynchronously and wait for the results.
o Cancel and race running (run 2 computations, wait and see which one finishes first)
computations.
o Same basic interface:

data Async a

async :: IO a  IO (Async a)
wait :: Async a  IO a
race :: Async a  Async b  IO (Either a b)

o See lec6ex.

STM as a building block (II)

- Key-value map
- The goal:
o A key-value map that can be accessed concurrently by multiple threads.
o Basic interface:

data Map k v

insert :: Ord k  k  v  Map k v  Map k v


lookup :: Ord k  k  Map k v  Maybe v

- Option 1# (see lec6ex).


o A regular (pure) key-value map in a mutable box.
o Simple, safe.
o No concurrency.
- Option 2# (see lec6ex).
o A pure map in a box, but using STM.
o Safe concurrent lookup.
o Insertion updates the entire tree (all other threads must retry).
- Option 3# (see lec6ex).
o Allows values to be read and adjusted (mutated) concurrently.
o Fixed key set.
- Option 4# (see lec6ex).
o Implement the data structure ourselves
o Goal: fully concurrent insertion and lookup
o Updates to disjoint parts of the tree do not conflict with each other.

STM offers composable blocking and atomicity.


 concurrent programming without locks.
Fairness: all blocked threads are woken up when a TVar changes.
Threads cannot block and have visible side effects.

Why smaller atomically blocks


 more likely that it will succeed.
 discarding effects of the transaction is easy: delete the log.
 each readTVar must traverse the log to see if it was written by an earlier writeTVar: O(n).
 a transaction that called retry is woken up whenever one of the TVar in its read set changes: O(n)
 a long running transaction can re-execute indefinitely because it is repeatedly aborted by shorter
transactions: starvation.
Lecture 8: Parallelism
Parallelism: doing lots of things at once.
 simultaneous execution of (possibly related) computations.
 At least two threads are executing simultaneously.

Mostly hardware perspective

- SIMD (Single-instruction multiple data)


o Do more per instruction.
o Low level, something we can do.
- SMT
o Goal  Increase IPC and make better use of the functional units that are available
every cycle.
o Execute multiple instructions from different threads in a single cycle.
o Two threads per core max (hyperthreading).
o Mostly hidden from us.
- Out-of-order/speculative execution
o Goal  Hide latency.
o Improve performance.
o Do other stuff while going out to main memory to get data back.
o Guess what the value that would have returned is and use that to go further.
o Mostly hidden from us.
- Multi-core multi-socket
o NUMA: non-uniform memory access.
o The time it takes for one core to read a bit of data can differ from another core.
o  accessing data for one CPU is faster if that data is on that CPU than on the other
socket.
- Accelerators
o GPU/DPU/SmartNIC/FPGA/…
- Distributed

Processor works:

- Fetch instruction pointed to by PC in memory.


- Decode instruction into operator (the adding function)/operand.
- Get operands from the register file (the things I want to add).
- Execute the instruction.
- Write back result to the register file.
- Increment program counter.
- Read and write memory from a computed address (2).
- Instead of incrementing the PC, set it to a computed value (3).
Figure 1 Figure 2

Figure 3
How to make processor faster:

- Introduction of pipelining.
o Independent instructions can follow each other.
o Start instructions n+1 after instruction n finishes the first stage.
- Superscalar execution.
o Multiple execution units in parallel.
o Scheduler issues multiple instruction in the same cycle.
- Out of order execution.
- Simultaneous multi-threading (SMT) (hyper threading)
o Scheduler issues multiple instructions in the same cycle; from different threads.
- Do more per instruction.

SIMD

- Single instruction multiple-data = kind of data parallelism.


o Amortise the control overhead over the instruction width.
o In contrast to SIMT model of CUDA/OpenCL, the vector width is exposed directly to
the programmer.
- For every SIMD instruction, i.e. add, instead of the operands being two numbers you instead
fetch a bunch of numbers and then execute that function over all of those data elements
simultaneously.
- Neglecting SIMD is becoming more costly.
- See lec8ex for SAXPY.
Array of structures (AoS)

- Most logical data organisation layout.


- Extremely difficult to access memory for reads (gather) and writes (scatter).
- Prevents efficient vectorisation.
- May lead to better cache utilisation if data is accessed randomly.
- Object oriented language.
- See lec8ex.

Structure of arrays (SoA)

- Separate array for each field of the structure.


- Keeps memory access contiguous when vectorisation is performed over structure instances.
- Typically better for vectorisation and to avoid false sharing.
- All x’s and y’s and z’s are next to each other.
- Data oriented design.
- See lec8ex.

Task parallelism

- Problem is broken down into separate tasks.


- Individual tasks are created and communicate/synchronise with
each other.
- Task decomposition dictates scalability.
- Explicit threads.
- Synchronise via locks, messages or STM.
- Modest parallelism.
- Hard to program.

Fork-Join

- Splits control flow into multiple forks which later rejoin.


- Can be used to implement many other patterns.
- See lec8ex. Figure 4 fork-join

Divide-and-conquer

- Lend themselves to fork-join parallelism.


- Sub-problems must be independent so that they
can execute in parallel.
- Correct task granularity is vital:
o Deep enough to expose enough parallelism.
o Not so fine-grained that scheduling
overheads dominate.

Data parallelism

- Problem is viewed as operations over parallel data.


- The same operation is applied to subsets of the
Figure 5 divide-and-conquer
data.
- Scales to the amount of data & number of processors.
- Operate simultaneously on bulk data.
- Implicit synchronisation.
- Massive parallelism.
- Easy to program.

Improving application performance through parallelisation means:

- Reducing the total time to compute a single result (latency).


- Increasing the rate at which a series of results are computed (throughput).
- Reducing the power consumption of a computation.

To make the program run faster, we need to gain more from parallelisation than we lose due to the
overhead of adding it

- Granularity: If the tasks are too small, the overhead of managing the tasks outweighs any
benefit you might get from running them in parallel.
- Data dependencies: When one task depends on another, they must be performed
sequentially.

Load balancing

- The computation must be distributed evenly across the processors to obtain the fastest
possible execution speed.
o It may be that some processors will complete their tasks before others and become
idle because the work is not evenly distributed.
 The amount of work is not known prior to execution.
 Differences in processor speeds (e.g. noisy system, frequency boost…).
- Static load balancing can be viewed as a scheduling or bin packing problem.
o Estimate the execution time for parts of the program and their interdependencies.
o Generate a fixed number of equally sized tasks and distribute amongst the
processors in some way (e.g. round robin, recursive bisection, random…).
o Limitations:
 Accurate estimates of execution time are difficult.
 Does not account for variable delays (e.g. memory access time) or number
of tasks (e.g. search problems).
- In dynamic load balancing tasks are allocated to processors during execution.
o In a centralised dynamic scheme one process holds all tasks to be computed.
o Worker processes request new tasks from the work-pool.
o
Readily

applicable to divide-and-conquer problems.


Figure 5 load balancing

Speedup

- The performance improvement of a parallel application:


o Where Tp is the time to execute using P threads/processors.

T1
speedup = Sp =
Tp
- The efficiency of the program is:
o P is number of processors.
o Ideally, efficiency is 1.
Sp T 1
efficiency = =
P PTp
- T1 can be:
o The parallel algorithm executed on one thread: relative speedup.
o An equivalent serial algorithm: absolute speedup.

Maximum speedup

- Several factors appear as overhead in parallel computations and limit the speedup of the
program.
o Periods when not all processors are performing useful work.
o Extra computations in the parallel version not appearing in the sequential version
(example: recompute constants locally).
o Communication time between processes.

Amdhal

- Amdhal’s law  the execution time of a program:


o Wser: time spent doing (non-parallelisable) serial work.
 Reading data from disk.
o Wpar: time spent doing parallel work.

Wpar
Tp ≥ Wser+
P
- If f is the fraction of serial work to be performed:
o If f = 0, program completely parallel.
o If f = 1,program completely sequential.
1
Sp ≤
f + (1−f ) /P
- The speedup bound is determined by the degree of sequential execution in the program, not
the number of processors.
o Strong scaling (fixed sized speedup): limP∞ Sp ≤ 1/f

Gustafson-Baris

- The problem size can increase as the number of processes increases.


o The proportion of the serial part decreases.
o Weak scaling (scaled speedup): Sp = 1 + (P – 1)fpar.

Lecture 9: GPGPU
Data parallelism

- Only a programming model.


o The key is a single logical thread of control.
o It does not actually require the operations to be executed in parallel.

CPU vs GPU

- Traditional CPU designs optimise for single-threaded performance.


o Branch prediction, out-of-order, large caches, etc.
o Much of the available transistor space is dedicated to non-computation resources.
o CPUs are designed to optimise latency of an individual thread’s result.
o Must be good at everything, parallel or not.
- GPUs are designed to accelerate graphics processing (rasterization).
o Inherently data-parallel task.
o Maximise bandwidth: the time to process a single pixel is less important than the
number of pixels is processed per second.
o Specialised for compute intensive, highly parallel computation.

CPU

- Multiple tasks = multiple threads.


- Tasks run different instructions.
- 10s of complex threads execute on a few cores.
- Threads managed explicitly.
- Expensive to create & manage threads.
- Core does a lot of non-computation stuff
o Branch prediction, out-of-order execution.
o Dedicated to increase performance of a single thread.
- Vertical parallelism: hide latency.
o Keep functional units busy when waiting for dependencies, memory, etc.
- Spends a lot of resources to avoid latency.

GPU
- SIMD.
- 10s of thousands of lightweight threads.
- Threads are managed scheduled by the hardware.
- Cheap to create many threads.
- Core has a lot of execution units.
o Little dedicated to thread scheduling.
o Dedicate all the hardware to computation resources.
o Programming model has to compensate that.
- Horizontal parallelism: increase throughput.
o More execution units working in parallel.
- Uses parallelism to hide latency.
- GPU architecture
o No branch prediction.
o One task (kernel) at a time.
o No context switching.
o Limited super-scale pipeline.
o No out-of-order execution.
o No simultaneous multithreading (hyperthreading).
o Very low clock speed.

Each GPU has:

- A number of streaming multiprocessors (comparable to CPU cores).


- Each core executes a number of warps (comparable to a CPU thread).
- Each warp consists of 32 “threads” that run in lockstep (comparable to a single lane on a
SIMD execution unit).

Each streaming multiprocessor (SM) executes a number of warps:

- The SM has a number of active threads.


- The core will switch warps whenever there is a stall in execution (waiting for memory).
- Latency is thus hidden by having many active threads; this is only possible if you can feed the
GPU enough work.

Similarities between CPU and GPU:

- Multiple cores.
- A memory hierarchy.
- SIMD vector instructions.

Differences:

- Each SM executes up to 64 warps (GPU), instead of two threads (with SMT2, CPU).
- The memory hierarchy is explicit on the GPU (software managed cache).
- CPU uses thread (SMTx) and instruction level parallelism to saturate ALUs.
- GPU SIMD is implicit (SIMT model).
Execution model

- The GPU is a co-processor


controlled by a host program.
o The host (CPU) and
device (GPU) have
separate memory
spaces.
o The host program
controls data
management on the
device (allocation,
transfer) as well as Figure 6 execution model 1
launching kernels. (6)
- He GPU kernels execute multiple thread blocks over the SMs.
o All threads execute the same sequential program.
o Thread instructions are executed in logical SIMD groups (warps). (7)

Figure 7 execution model 2

Programming model

- The CUDA (and OpenCL) programming model provides:


o Data parallel programming model.
o A thread abstraction to deal with SIMD.
o Synchronisation and data sharing between small groups of threads (100s, threads in
a block).
o A scalable programming model to deal with lots of threads (10.000s, all threads in all
SMEM).
o A C-like language for device code.
 Similarity is only superficial; it is heavily influenced by the underlying
hardware model.
- A GPU program consists of the kernel run on the GPU.
o Kernels = functions which are executed n times in parallel by n different threads on
the device.
o Each thread executes the same sequential program.
 We cannot execute different code in parallel.

Example kernels, element-wise add two vectors  see lec9ex.


Threads:

- A kernel consists of multiple copies of the code executed in parallel.


o Each thread has its own program counter, registers, processor state…
o The order in which threads are executed is not specified.
- Threads are very fine-grained.
o Launching threads on the GPU is cheap compared to on the CPU.
- Threads execute in a single-
instruction multiple-thread model
(SIMT).
o In a SIMD model the vector
width is explicit.
o In SIMT this is left
unspecified.
o Greatly simplifies the
Figure 8 SIMD & SIMT
programming model.
o Underlying hardware executes SIMD, but you as a programmer get to write it like
SIMT. (8)
o SIMT gets mapped down to SIMD.
o In CUDA threads execute in groups of 32 called a warp = the logical vector width.
- Performance considerations.
o Threads in a warp share the same program counter.
o Good code will try to keep all threads convergent within a warp.
- The scalar (kernel) code is mapped onto the hardware SIMD execution.
o Hardware handles control flow divergence and convergence.
o Divergent control flow between warp threads is handled via an active mask (is
thread executing now or not, or predicated execution).
 At each cycle all threads in a warp must execute the same instruction.
 Conditional code is handled by temporarily disabling threads for which the
condition is not true.
 If-then-else blocks sequentially executing the if and else branches.
 Can lead to subtle deadlocks.
- The GPU is a very wide vector processor.
- Benefits of SIMT vs SIMD:
o Similar to regular scalar code, easier to read and write.
- Drawbacks of SIMT vs SIMD:
o The (logical) vector width is always 32, regardless of data size.
o Scattered memory access and control flow are not discouraged.

Thread hierarchy:

- Parallel kernels are composed of many threads:


o Executing the same sequential program.
o Each thread has a unique identifier.
- Threads are grouped into blocks:
o Threads in the same block can cooperate.
- A grid of thread blocks is the collection of threads which will execute a given kernel.
o Thread blocks will be scheduled onto the SMs of the GPU for execution.
- Individual threads are grouped into thread blocks.
o Ach thread block constitutes an independent data-parallel task.
o Threads in the same block can cooperate and synchronise with each other.
o Threads in different thread blocks cannot cooperate.
o The program must be valid for any interleaving of thread blocks.
- This independence requirement ensures scalability.
- Each thread block is mapped onto a SM of the GPU to be executed.
o The hardware is free to assign blocks to any processor (SM) at any time.
o A kernel scales across any number of parallel processors.
o Each block executes in any order relative to other blocks.
- Synchronisation is only within a thread block.
- Each GPU thread is individually very weak.
o Hardware multithreading is required to hide latency.
o This means that performance depends on the number of thread blocks which can be
allocated onto each SM.
o This is limited by the set of registers and shared memory on the SM which are
shared between all threads executing on that processor.
- Per-thread resource usage costs performance.
o More registers  fewer thread blocks.
o More shared (local) memory usage  fewer thread blocks.

The multiprocessor occupancy = the number of kernel threads which can run simultaneously on
each SM, compared to the maximum possible.

 increase thread occupancy to improve program.

Memory hierarchy:

- A many-core processor is a device for turning a compute bound problem into a memory
bound problem.
o Lots of processors (ALUs).
o Memory concerns dominate performance tuning.
o Only global memory is persistent across kernel launches
- Global memory is accessed in 32-, 64-, or 128-byte transactions.
o Similar to how a CPU reads a cache line at a time.
o The GPU has a "coalescer" which examines the memory requests from threads in
the warp, and issues one or more global memory transactions.

To use bandwidth effective

Figure 9 memory hierarchy

You might also like