Summary Midterm Concurrency
Summary Midterm Concurrency
Lecture 3: Threads
Concurrency:
Parallelism:
Contexts of concurrency:
- Multithreading: concurrent threads share an address space. Applications talk to each other
directly.
- Multiprogramming: concurrent processes execute on a uniprocessor (own memory space).
- Multiprocessing: concurrent processes on a multiprocessor.
- Distributed processing: concurrent processes executing on multiple nodes connected by a
network.
Dangerous:
Critical sections: controlling access to the code utilising those shared resources.
Processes can:
Terminology:
- Deadlock:
o Each member of a group of processes is waiting for another to take action.
o E.g. waiting for another to release a lock.
- Livelock:
o The states of the group of processes are constantly changing with regard to each
other, but none are progressing.
o E.g. trying to obtain a lock, but backing off if fails.
- Obstruction-freedom:
o From any point after a thread begins executing in isolation, it finishes in a finite
number of steps.
o A thread will be able to finish if no other thread makes progress.
- Lock-freedom:
o Some method will finish in a finite number of steps.
- Wait-freedom:
o Every method will finish in a finite number of steps.
- Starvation:
o A process is denied access to a resource.
Mutual exclusion
- Locking.
- Protects shared resources.
- Only one process at a time is allowed to access the critical resource.
- Modifications to the resource appear to happen atomically.
- Implementation requirements:
o Only one thread at a time is allowed in the critical section for a resource.
o No deadlock or starvation on attempting to enter/leave the critical section.
o A thread must not be delayed access to a critical section when there is no other
thread using it.
o A thread that halts in its non-critical section must do so without interfering with
other threads.
o No assumptions made about relative thread speed or number of processes.
- Usage conditions:
o A thread remains inside its critical section for a finite time only.
o No potentially blocking operation should be executed inside the critical section.
- Three ways to satisfy the implementation requirements:
o Software approach, put responsibility on the processes themselves/
o Systems approach, provide support within the OS or programming language.
o Hardware approach, special-purpose machine instructions.
Software approach
- Premise
o One or more threads with shared memory.
o Elementary mutual exclusion at the level of memory access
Simultaneous access to the same memory location are serialised.
- Requirements for mutual exclusion
o Only one thread at a time is allowed in the critical section.
o No deadlock or starvation.
- Attempt 1
o The plan
Threads take turns executing the critical section.
Exploit serialisation of memory access to implement serialisation of access
to the critical section
o Employ a shared variable (memory location) turn that indicates whose turn it is to
enter the critical section.
o Busy waiting (spin lock)
Process is always checking to see if it can enter the critical section
Implements mutual exclusion
Simple
o Disadvantages
Process burns resources while waiting
Processes must alternate access to the critical section
If one process fails anywhere in the program, the other is permanently
blocked.
- Attempt 2
o The problem
turn stores who can enter the critical section, rather than whether anybody
may enter the critical section.
o The new plan
Store for each process whether it is in the critical section right now.
flag[i] if process i is in the critical section right now.
o Requires programs to be well-behaved, which doesn’t always happen.
o If a process fails:
Outside the critical section: the other is not blocked;
Inside the critical section: the other is blocked (however, difficult to avoid).
o Does it work?
Does not guarantee exclusive access.
- Attempt 3
o The goal
Remove the gap between toggling the two flags.
o The updated plan
Move setting the flag before checking whether we can enter.
o Does it work?
No. The gap can cause a deadlock now.
- Attempt 4
o Problem
Process sets its own state before knowing the other processes states and
cannot back off.
o The updated plan
Process retracts its decision if it cannot enter.
o Does it work?
No, we have a livelock.
A special case of resource starvation, and a risk for algorithms which
attempt to detect and recover from deadlock.
- Attempt 5
o Improvements
We can solve this problem by combining the fourth and first attempts.
In addition to the flags we use a variable indicating whose turn it is to have
precedence in entering the critical section.
o Peterson’s algorithm
Both processes are courteous and solve a tie in favour of the other.
Algorithm can be generalised to work with n processes.
o Statement: mutual exclusion
Threads 0 and 1 are never in the critical section at the same time.
o Proof
If P0 is in the critical section then:
flag[0] is true;
flag[1] is false OR turn is 0 OR P1 is trying to enter the critical
section, after setting flag[1] to true but before setting turn to 0.
For both P0 and P1 to be in critical section:
flag[0] AND flag[1] AND turn = 0 AND turn = 1.
Hardware support
- The compare-and-swap (CAS) operation is an atomic instruction which allows mutual
exclusion for any number of threads using a single bit of memory.
o Compares the memory location to a given value.
o If they are the same, writes a new value to that location.
o Returns the old value of the memory location.
- The plan
o Use a bit lock where zero represents unlocked and one represents locked.
Lecture 4: State
Threads:
Sharing state
Import Data.IORef
newIORef :: a -> IO (IORef a)
readIORef :: IORef a -> IO a
writeIORef :: IORef a -> a -> IO ( )
Import Control.Concurrent.MVar
newMVar :: a -> IO (MVar a)
newEmptyMVar :: IO (MVar a)
takeMVar :: MVar a -> IO a
putMVar :: MVar a -> a -> IO ( )
readMVar :: MVar a -> IO a
withMVar :: MVar a -> (a -> IO b) -> IO b
o The runtime system can (sometimes) detect when a group of threads are
deadlocked.
Only a conservative approximation to the future behaviour of the program.
o Other useful operations
If multiple threads are blocked in takeMVar or putMVar, a single thread is
woken up in FIFO order: fairness.
readMVar is multiple wakeup.
withMVar can be used to protect critical sections.
o A lock
MVar ( ) behaves as a lock: full is unlocked, empty is locked.
Can be used as a mutex to protect some shared state or critical section.
o A one-place channel
For passing messages between threads.
An asynchronous channel with a buffer size of one.
o A container for shared mutable data.
o A building block for constructing larger concurrent data structures.
- The goal:
Want a way to specify data-flow dependencies between computations
Results should be deterministic
Ex4
- Data flow
o Key idea: non-deterministic result can only arise from a choice between multiple
put’s, so make that an error.
o Ex4.
- The goal:
o Want a way to run computations asynchronously and wait for their result.
o Cancel running computations.
- Perform an action asynchronously and later waiting for the results.
- Ex5
Unbounded queue
- The goal:
o enqueue does not block (indefinitely).
o Writers and readers do not conflict with each other.
- See slides for structure of the queue
o Orange boxes are MVar’s, we have to update these (what is read end what is write
end). mutable structures.
o Blue boxes are sequence of values immutable things. Linked together by mutable
references.
Fairness
STM somewhat based on atomic blocks. Allows your program to say which parts need to behave
atomically.
Atomic blocks:
- The idea
o Garbage collectors allow us to program without malloc and free can we do the
same for locks? What would that look like?
o Modular concurrency.
o Locks are pessimistic.
- STM
o A technique for implementing atomic blocks.
Atomicity: effects become visible to other threads all at once.
Isolation: cannot see the effects of other threads.
Use a different type to wrap operations whose effect can be undone if
necessary.
o Sharing state
We use TVar for IORef as a transactional variable.
TMVar for MVar, a variable is either full or empty threads wait for the
appropriate state.
o STM actions are composed together in the same way as IO actions.
o STM actions are executed as a single, isolated atomic block.
o Types are used to isolate transactional actions from arbitrary IO actions.
To get from STM to IO we have to execute the entire action atomically.
Can’t mix monads.
Exam questions:
Blocking
retry :: STM a
TMVar
- Threads block on an MVar are woken up in FIFO order, STM doesn’t have that;
- When multiple threads are blocked on a TVar, which should be woken up?
o Choose an alternative action if the first transaction calls retry.
If the first action returns a result, that is the result of the orElse.
If the first action retries, the second action runs.
If the second action retries, the whole action retries.
Since the result of orElse is also an STM action, you can feed it another call
to orElse and so choose from an arbitrary number of alternatives.
- The goal:
o Run computations asynchronously and wait for the results.
o Cancel and race running (run 2 computations, wait and see which one finishes first)
computations.
o Same basic interface:
data Async a
async :: IO a IO (Async a)
wait :: Async a IO a
race :: Async a Async b IO (Either a b)
o See lec6ex.
- Key-value map
- The goal:
o A key-value map that can be accessed concurrently by multiple threads.
o Basic interface:
data Map k v
Processor works:
Figure 3
How to make processor faster:
- Introduction of pipelining.
o Independent instructions can follow each other.
o Start instructions n+1 after instruction n finishes the first stage.
- Superscalar execution.
o Multiple execution units in parallel.
o Scheduler issues multiple instruction in the same cycle.
- Out of order execution.
- Simultaneous multi-threading (SMT) (hyper threading)
o Scheduler issues multiple instructions in the same cycle; from different threads.
- Do more per instruction.
SIMD
Task parallelism
Fork-Join
Divide-and-conquer
Data parallelism
To make the program run faster, we need to gain more from parallelisation than we lose due to the
overhead of adding it
- Granularity: If the tasks are too small, the overhead of managing the tasks outweighs any
benefit you might get from running them in parallel.
- Data dependencies: When one task depends on another, they must be performed
sequentially.
Load balancing
- The computation must be distributed evenly across the processors to obtain the fastest
possible execution speed.
o It may be that some processors will complete their tasks before others and become
idle because the work is not evenly distributed.
The amount of work is not known prior to execution.
Differences in processor speeds (e.g. noisy system, frequency boost…).
- Static load balancing can be viewed as a scheduling or bin packing problem.
o Estimate the execution time for parts of the program and their interdependencies.
o Generate a fixed number of equally sized tasks and distribute amongst the
processors in some way (e.g. round robin, recursive bisection, random…).
o Limitations:
Accurate estimates of execution time are difficult.
Does not account for variable delays (e.g. memory access time) or number
of tasks (e.g. search problems).
- In dynamic load balancing tasks are allocated to processors during execution.
o In a centralised dynamic scheme one process holds all tasks to be computed.
o Worker processes request new tasks from the work-pool.
o
Readily
Speedup
T1
speedup = Sp =
Tp
- The efficiency of the program is:
o P is number of processors.
o Ideally, efficiency is 1.
Sp T 1
efficiency = =
P PTp
- T1 can be:
o The parallel algorithm executed on one thread: relative speedup.
o An equivalent serial algorithm: absolute speedup.
Maximum speedup
- Several factors appear as overhead in parallel computations and limit the speedup of the
program.
o Periods when not all processors are performing useful work.
o Extra computations in the parallel version not appearing in the sequential version
(example: recompute constants locally).
o Communication time between processes.
Amdhal
Wpar
Tp ≥ Wser+
P
- If f is the fraction of serial work to be performed:
o If f = 0, program completely parallel.
o If f = 1,program completely sequential.
1
Sp ≤
f + (1−f ) /P
- The speedup bound is determined by the degree of sequential execution in the program, not
the number of processors.
o Strong scaling (fixed sized speedup): limP∞ Sp ≤ 1/f
Gustafson-Baris
Lecture 9: GPGPU
Data parallelism
CPU vs GPU
CPU
GPU
- SIMD.
- 10s of thousands of lightweight threads.
- Threads are managed scheduled by the hardware.
- Cheap to create many threads.
- Core has a lot of execution units.
o Little dedicated to thread scheduling.
o Dedicate all the hardware to computation resources.
o Programming model has to compensate that.
- Horizontal parallelism: increase throughput.
o More execution units working in parallel.
- Uses parallelism to hide latency.
- GPU architecture
o No branch prediction.
o One task (kernel) at a time.
o No context switching.
o Limited super-scale pipeline.
o No out-of-order execution.
o No simultaneous multithreading (hyperthreading).
o Very low clock speed.
- Multiple cores.
- A memory hierarchy.
- SIMD vector instructions.
Differences:
- Each SM executes up to 64 warps (GPU), instead of two threads (with SMT2, CPU).
- The memory hierarchy is explicit on the GPU (software managed cache).
- CPU uses thread (SMTx) and instruction level parallelism to saturate ALUs.
- GPU SIMD is implicit (SIMT model).
Execution model
Programming model
Thread hierarchy:
The multiprocessor occupancy = the number of kernel threads which can run simultaneously on
each SM, compared to the maximum possible.
Memory hierarchy:
- A many-core processor is a device for turning a compute bound problem into a memory
bound problem.
o Lots of processors (ALUs).
o Memory concerns dominate performance tuning.
o Only global memory is persistent across kernel launches
- Global memory is accessed in 32-, 64-, or 128-byte transactions.
o Similar to how a CPU reads a cache line at a time.
o The GPU has a "coalescer" which examines the memory requests from threads in
the warp, and issues one or more global memory transactions.