0% found this document useful (0 votes)
20 views30 pages

Unit4 Session3 Parallel Computing Concepts Terminology Design Issues

Uploaded by

bhavanabaday
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
20 views30 pages

Unit4 Session3 Parallel Computing Concepts Terminology Design Issues

Uploaded by

bhavanabaday
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 30

Microprocessor & Computer

Architecture (μpCA)
Unit 4: Parallel Computing Concepts and
Terminology, Design Issues & Constraints

UE21CS251B

Session : 4.3
. Microprocessor & Computer Architecture (μpCA)
Parallel Computing Concepts and Terminology

Supercomputing / High Performance Computing (HPC)


Using the world's fastest and largest computers to solve large problems.

Node
A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores, memory, network
interfaces, etc. Nodes are networked together to comprise a supercomputer.

CPU / Socket / Processor / Core


This varies, depending upon who you talk to. In the past, a CPU (Central Processing Unit) was a singular
execution component for a computer. Then, multiple CPUs were incorporated into a node. Then,
individual CPUs were subdivided into multiple "cores", each being a unique execution unit. CPUs with
multiple cores are sometimes called "sockets" - vendor dependent. The result is a node with multiple
CPUs, each containing multiple cores. The nomenclature is confused at times. Wonder why?
. Microprocessor & Computer Architecture (μpCA)
Parallel Computing Concepts and Terminology
. Microprocessor & Computer Architecture (μpCA)
Parallel Computing Concepts and Terminology

Task : A logically discrete section of computational work. A task is typically a program or program-like set of instructions
that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors.

Pipelining : Breaking a task into steps performed by different processor units, with inputs streaming through, much like an
assembly line; a type of parallel computing.

Shared Memory: From a strictly hardware point of view, describes a computer architecture where all processors have
direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory and can directly address and access the same logical memory locations
regardless of where the physical memory actually exists.

Symmetric Multi-Processor (SMP): Shared memory hardware architecture where multiple processors share a single
address space and have equal access to all resources.

Distributed Memory: In hardware, refers to network based memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine memory and must use communications to access memory
on other machines where other tasks are executing.
. Microprocessor & Computer Architecture (μpCA)
Parallel Computing Concepts and Terminology

Communications : Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as
through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as
communications regardless of the method employed.

Synchronization : The coordination of parallel tasks in real time, very often associated with communications. Often
implemented by establishing a synchronization point within an application where a task may not proceed further until
another task(s) reaches the same or logically equivalent point.Synchronization usually involves waiting by at least one
task, and can therefore cause a parallel application's wall clock execution time to increase.

Granularity : In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
Coarse: relatively large amounts of computational work are done between communication events
Fine: Relatively small amounts of computational work are done between communication events
Observed Speedup: Observed speedup of a code which has been parallelized, defined as:
wall-clock time of serial execution
------------------------------------------------
wall-clock time of parallel execution
One of the simplest and most widely used indicators for a parallel program's performance.
. Microprocessor & Computer Architecture (μpCA)
Parallel Computing Concepts and Terminology

Parallel Overhead: The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel
overhead can include factors such as:Task start-up time
•Synchronizations
•Data communications
•Software overhead imposed by parallel languages, libraries, operating system, etc.
•Task termination time
Massively Parallel : Refers to the hardware that comprises a given parallel system - having many processing elements.
The meaning of "many" keeps increasing, but currently, the largest parallel computers are comprised of processing
elements numbering in the hundreds of thousands to millions.
Embarrassingly Parallel: Solving many similar, but independent tasks simultaneously; little to no need for coordination
between the tasks.
Scalability : Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in
parallel speedup with the addition of more resources. Factors that contribute to scalability include:
•Hardware - particularly memory-cpu bandwidths and network communication properties
•Application algorithm
•Parallel overhead related
•Characteristics of your specific application
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models

There are several parallel programming

models in common use: • Threads Model


• Shared Memory (without threads)

•Four tasks (represented by circles) are being executed


by four separate threads (each running on its own core).
•These threads interact with
• Three processes (depicted as boxes) are interacting with a single shared both private and shared memory spaces.
memory block (the large rectangle). •The code snippet above the threads shows commands
• Arrows labeled ‘read’ and ‘write’ indicate the processes’ actions on the like “call sub10,” indicating tasks assigned to different
memory. threads.
• Each process can read from or write to this shared memory. •Unlike the shared memory model, threads within a
• Essentially, in this model, multiple processes communicate through a process can communicate more efficiently.
common memory space. •The Threads Model leverages thread-level parallelism
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models
• Distributed Memory / Message Passing Model
This model demonstrates the following characteristics:

• A set of tasks that use their own local memory during


computation. Multiple tasks can reside on the same physical
machine and/or across an arbitrary number of machines.
Tasks exchange data through communications by sending and
receiving messages.

• Data transfer usually requires cooperative operations to be


performed by each process.

• For example: a send operation must have a matching receive


operation.
the Distributed Memory / Message
Passing Model allows tasks to
communicate by passing messages,
even when they’re distributed across
different machines.
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models
•Data Parallel Model May also be referred to as the Partitioned Global Address Space (PGAS) model.

• The data parallel model demonstrates the following characteristics:


• Address space is treated globally
• Most of the parallel work focuses on performing operations on a data set.
The data set is typically organized into a common structure, such as an
array or cube.
• A set of tasks work collectively on the same data structure, however, each
task works on a different partition of the same data structure.
• Tasks perform the same operation on their partition of work, for example,
"add 4 to every array element".
• On shared memory architectures, all tasks may have access to the data
structure
through global memory.
• On distributed memory architectures the data structure is split up and resides
as
"chunks" in the local memory of each task.
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models
• A hybrid model combines more than one of the previously described programming
• Hybrid Model
models.
• A common example of a hybrid model is the combination of the message
passing model (MPI) with the threads model (OpenMP).
• Threads perform computationally intensive kernels using local, on-node data
Communications between processes on different nodes occurs over the
network using MPI
• This hybrid model lends itself well to the most popular (currently) hardware
environment of clustered multi/many-core machines.
• Another similar and increasingly popular example of a hybrid model is using MPI
with CPU-GPU (Graphics Processing Unit) programming.
• MPI tasks run on CPUs using local memory and communicating with each other
over a network.
• Computationally intensive kernels are off-loaded to GPUs on-node.
• Data exchange between node-local memory and GPUs uses CUDA (or
something equivalent).
Other hybrid models are common:
• MPI with Pthreads
• MPI with non-GPU accelerators
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models : Single Program Multiple Data (SPMD):

• SPMD is actually a "high level" programming model that can be built upon any combination of the previously
mentioned parallel programming models.

• SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be
threads, message passing, data parallel or hybrid.

• MULTIPLE DATA: All tasks may use different data

• SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or
conditionally execute only those parts of the program they are designed to execute. That is, tasks do not
necessarily have to execute the entire program - perhaps only a portion of it.

• The SPMD model, using message passing or hybrid programming, is probably the most commonly used
parallel programming model for multi-node clusters.
Microprocessor & Computer Architecture (μpCA)
Parallel Computing Models : Multiple Program Multiple Data (SPMD):

• Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the
previously mentioned parallel programming models.

• MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message
passing, data parallel or hybrid.

• MULTIPLE DATA: All tasks may use different data

• MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems,
particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later
under Partioning).
Microprocessor & Computer Architecture (μpCA)
Parallel Computing - Design Issues:

Partitioning: Splitting to Smaller Problem


Mapping: Distributing to Multiple processor
Communication: if Required (Depend on Topology)
Too Many Consolidating : The Final result
Cook……
Microprocessor & Computer Architecture (μpCA)
Designing Parallel Programs:
Automatic vs. Manual Parallelization
• Designing and developing parallel programs has characteristically been a very manual process.
The programmer is typically responsible for both identifying and actually implementing parallelism.
• Very often, manually developing parallel codes is a time consuming, complex, error-prone and iterative process.
• For a number of years now, various tools have been available to assist the programmer with converting serial programs
into parallel programs. The most common type of tool used to automatically parallelize a serial program is a parallelizing
compiler or pre-processor.
• A parallelizing compiler generally works in two different ways:
▪ Fully Automatic
The compiler analyzes the source code and identifies opportunities for parallelism.
The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the
parallelism would actually improve performance.
Loops (do, for) are the most frequent target for automatic parallelization.
▪ Programmer Directed
Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to
parallelize the code.
May be able to be used in conjunction with some degree of automatic parallelization also.
The most common compiler generated parallelization is done using on-node shared memory and threads (such as OpenMP).
Microprocessor & Computer Architecture (μpCA)
Thread Level & Instruction Level Parallelism

Thread Level
• Multi-threading vs Hyper-threading or Simultaneous Multi-threading.

Instruction Level
• Pipelining
• Super Pipelining
• Super Scalar
• Vector & Array Processing
• VLIW
• EPIC
• Parallel Computing vs Multicore Computing
Microprocessor & Computer Architecture (μpCA)
Single thread can run at any given time

L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 1: floating point


Microprocessor & Computer Architecture (μpCA)
Single thread can run at any given time

L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2:
integer operation
Microprocessor & Computer Architecture (μpCA)
Multi-Threading

Single Thread Multi Thread

Hyper-Thread or
Simultaneous
Multithreading?

Kernel Kernel Kernel Kernel


Thread Thread Thread Thread
Microprocessor & Computer Architecture (μpCA)
SMT Processor: Both Threads Run’s Concurrently

L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2: Thread 1: floating point


integer operation
Microprocessor & Computer Architecture (μpCA)
But: Can’t simultaneously use the same functional unit

L1 D-Cache D-TLB

L2 Cache and Control


Integer Floating Point

Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder This scenario is


Bus

impossible with SMT


BTB and I-TLB
on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer
20 unit)
Microprocessor & Computer Architecture (μpCA)
Multi-core: Threads can run on separate cores

L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Thread 3 21 Thread 4
Microprocessor & Computer Architecture (μpCA)
SMT Dual-core: all four threads can run concurrently

L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus

BTB and I-TLB BTB and I-TLB

Thread 1 Thread 3 Thread22


2 Thread 4
Microprocessor & Computer Architecture (μpCA)
Instruction Level - Scalar Computing
1 2 3 4 5 6 7 8 9 10

for(i=1;i<=6;i++)
Out=i+i;
Microprocessor & Computer Architecture (μpCA)
Pipelining

Pipelining: Several instructions are simultaneously at


different stages of their execution
1 2 3 4 5 6 7 8 9 10
Think of Executing following loop
for(i=1;i<=6;i++)
Out=i+i;
Microprocessor & Computer Architecture (μpCA)
Super-Scalar

Superscalar: several instructions are simultaneously at the same stages of


their execution
CPU can execute more than one Instructions per clock cycle
A Super-Scalar architecture includes parallel execution units which can execute instruction Simultaneously.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Think of Executing following loop


for(i=1;i<=6;i++)
Out=i+i;
Microprocessor & Computer Architecture (μpCA)
Super Pipelining
• Super pipelining breaks down pipeline stages into smaller sub-stages.
• Each sub-stage performs a specific operation within a clock cycle.
• While superscalar focuses on parallel pipelines, super pipelining aims for more stages per cycle.
• It balances higher parallelism with potential stalls due to dependencies.

1 2 3 4 5 6 7 8 9 10

Think of Executing following loop


for(i=1;i<=6;i++)
Out=i+i;
Microprocessor & Computer Architecture (μpCA)
VLIW: Very Long Instruction Word

Multiple Independent Instructions are packed together by the Compiler

VLIW ADD MUL ADDF MULF LDR MOV VLIW allows multiple independent
instructions to be packed together by the
compiler, enabling parallel execution

• In VLIW, the address space is treated


globally.
• Tasks can access data across the entire
address space.

•Most of the parallel work focuses on


performing operations on a data set.
•The data set is typically organized into
common structures, such as arrays or cubes.
Microprocessor & Computer Architecture (μpCA)
VLIW vs Superscalar
Multiple Instructions Long Instruction
IF ID IF ID ADD MUL ::::::::::: MOV

PE PE :::::::: PE ::::::::
PE PE PE

• Superscalar employs static scheduling,


Static Scheduling Simpler Hardware where the hardware determines
Dynamic Scheduling which instructions can be executed in
Compiler is used to parallel.
Complex Hardware
• VLIW relies on dynamic scheduling, which means that the compiler • Identify Independent Instruction • Simpler hardware requirements
compared to VLIW.
determines which instructions can be executed in parallel.
• Complex hardware is required to manage this dynamic scheduling • Bundle the instructions • Superscalar breaks down long
• Each processing element (PE) handles a specific instruction type. instructions into smaller parts for
• VLIW treats the address space globally, allowing tasks to access data parallel execution
across the entire address space.
Microprocessor & Computer Architecture (μpCA)
VLIW: Very Long Instruction Word

Draw Back of VLIW

If compiler cannot find the independent instructions to for Long Instructions

• Need to Recompile the code


• Need to insert NOP’s
No operation
THANK YOU

Team MPCA
Department of Computer Science and Engineering

You might also like