High Performance Computing Lecture 1 HPC Public

High Performance Computing
ADVANCED SCIENTIFIC COMPUTING
Prof. Dr. – Ing. Morris Riedel

Adjunct Associated Professor
School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
Research Group Leader, Juelich Supercomputing Centre, Forschungszentrum Juelich, Germany
@Morris Riedel @MorrisRiedel @MorrisRiedel

LECTURE 1
High Performance Computing

September 5, 2019
Room V02-258
Review of Practical Lecture 0.2 – Short Intro to C Programming & Scheduling
 Many C/C++ Programs used in HPC today  Multi-user HPC usage needs Scheduling
Many multi-physics
application challenges [3] LLview
on different scales &
levels of granularity
drive the need for HPC
[1] Terrestrial Systems SimLab [2] NEST Web page
not good! right way!
Lecture 1 – High Performance Computing 2 / 50

Outline of the Course
1. High Performance Computing 11. Scientific Visualization & Scalable Infrastructures
2. Parallel Programming with MPI 12. Terrestrial Systems & Climate

13. Systems Biology & Bioinformatics
3. Parallelization Fundamentals
14. Molecular Systems & Libraries
4. Advanced MPI Techniques
15. Computational Fluid Dynamics & Finite Elements
5. Parallel Algorithms & Data Structures
16. Epilogue
6. Parallel Programming with OpenMP
7. Graphical Processing Units (GPUs) + additional practical lectures & Webinars for our
hands-on assignments in context
8. Parallel & Scalable Machine & Deep Learning
9. Debugging & Profiling & Performance Toolsets
 Practical Topics
10. Hybrid Programming & Patterns  Theoretical / Conceptual Topics
Outline
 High Performance Computing (HPC) Basics

 Four basic building blocks of HPC
 TOP500 & Performance Benchmarks
 Multi-core CPU Processors
 Shared Memory & Distributed Memory Architectures
 Hybrid Architectures & Programming
 HPC Ecosystem Technologies

 HPC System Software Environment – Revisited
 System Architectures & Network Topologies
 Many-core GPUs & Supercomputing Co-Design
 Relationships to Big Data & Machine/Deep Learning
 Data Access & Large-scale Infrastructures

Selected Learning Outcomes
 Students understand…
 Latest developments in parallel processing & high performance computing (HPC)
 How to create and use high-performance clusters
 What are scalable networks & data-intensive workloads
 The importance of domain decomposition
 Complex aspects of parallel programming
 HPC environment tools that support programming
or analyze behaviour
 Different abstractions of parallel computing on various levels
 Foundations and approaches of scientific domain-
specific applications
 Students are able to …
 Programm and use HPC programming paradigms
 Take advantage of innovative scientific computing simulations & technology
 Work with technologies and tools to handle parallelism complexity
High Performance Computing (HPC) Basics

What is High Performance Computing?
 Wikipedia: ‘redirects from HPC to Supercomputer’

 Interesting – gives us already a hint what it is generally about
 A supercomputer is a computer at the frontline of contemporary

processing capacity – particularly speed of calculation
[4] Wikipedia ‘Supercomputer’ Online
 HPC includes work on ‘four basic building blocks’ in this course

 Theory (numerical laws, physical models, speed-up performance, etc.)
 Technology (multi-core, supercomputers, networks, storages, etc.)
 Architecture (shared-memory, distributed-memory, interconnects, etc.)
 Software (libraries, schedulers, monitoring, applications, etc.)
[5] Introduction to High Performance Computing for Scientists and Engineers

Understanding High Performance Computing (HPC) – Revisited
 High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques
through specific support with dedicated hardware such as high performance cpu/core interconnections.
HPC
network
interconnection
important
 High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that
enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores.
network
interconnection
HTC less important!
 The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course focusses on High Throughput Computing
Parallel Computing
 All modern supercomputers depend heavily on parallelism

 Parallelism can be achieved with many different approaches
 We speak of parallel computing whenever a number of ‘compute

elements’ (e.g. cores) solve a problem in a cooperative way
 Often known as ‘parallel processing’ of some problem space

 Tackle problems in parallel to enable the ‘best performance’ possible
 Includes not only parallel computing, but also parallel input/output (I/O)
 ‘The measure of speed’ in High Performance Computing matters P1 P2 P3 P4 P5
 Common measure for parallel computers established by TOP500 list
 Based on benchmark for ranking the best 500 computers worldwide [6] TOP500 Supercomputing Sites
 Lecture 3 will give in-depth details on parallelization fundamentals & performance term relationships & theoretical considerations
TOP 500 List (June 2019)
power
challenge
[6] TOP500 Supercomputing Sites
EU #1

LINPACK Benchmarks and Alternatives
 TOP500 ranking is based on the LINPACK benchmark

[7] LINPACK Benchmark implementation
 LINPACK solves a dense system of linear equations of unspecified size
 LINPACK covers only a single architectural aspect (‘critics exist’)

 Measures ‘peak performance’: All involved ‘supercomputer elements’ operate on maximum performance
 Available through a wide variety of ‘open source implementations’
 Success via ‘simplicity & ease of use’ thus used for over two decades
 Realistic applications benchmark suites might be alternatives

 HPC Challenge benchmarks (includes 7 tests) [8] HPC Challenge Benchmark Suite
 JUBE benchmark suite (based on real applications) [9] JUBE Benchmark Suite

Multi-core CPU Processors
 Significant advances in CPU (or microprocessor chips)

 Multi-core architecture with dual,
quad, six, or n processing cores
 Processing cores are all on one chip
one chip
 Multi-core CPU chip architecture
 Hierarchy of caches (on/off chip)
 L1 cache is private to each core; on-chip
 L2 cache is shared; on-chip [10] Distributed & Cloud Computing Book
 L3 cache or Dynamic random access memory (DRAM); off-chip

 Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years
 Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat
 Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies

Dominant Architectures of HPC Systems
 Traditionally two dominant types of architectures  Shared-memory parallelization with OpenMP

 Shared-Memory Computers  Distributed-memory parallel programming with the
Message Passing Interface (MPI) standard
 Distributed Memory Computers
 Often hierarchical (hybrid) systems of both in practice

 Dominance in the last couple of years in the community on X86-based commodity clusters running the
Linux OS on Intel/AMD processors
 More recently, also accelerators play a significant role (e.g., many-core chips)
 More recently
 Both above considered as ‘programming models’
 Emerging computing models getting relevant for HPC: e.g., quantum devices, neuromorphic devices

Shared-Memory Computers
 A shared-memory parallel computer is a system in which a number

of CPUs work on a common, shared physical address space
 Two varieties of shared-memory systems:

1. Unified Memory Access (UMA)
Shared Memory
2. Cache-coherent Nonuniform Memory Access (ccNUMA)
 The Problem of ‘Cache Coherence’ (in UMA/ccNUMA)

 Different CPUs use Cache to ‘modify same cache values’
 Consistency between cached data & T1 T2 T3 T4 T5
data in memory must be guaranteed
 ‘Cache coherence protocols’ ensure a consistent view of memory

Shared-Memory with UMA
 UMA systems use ‘flat memory model’: Latencies and bandwidth

are the same for all processors and all memory locations.
 Also called Symmetric Multiprocessing (SMP)
 Selected Features
 Socket is a physical package (with multiple cores), typically a replacable
component
 Two dual core chips (2 core/socket)
 P = Processor core
 L1D = Level 1 Cache – Data (fastest)
 L2 = Level 2 Cache (fast)
 Memory = main memory (slow)
 Chipset = enforces cache coherence and
mediates connections to memory

Shared-Memory with ccNUMA
 ccNUMA systems share logically memory that is physically distributed

(similar like distributed-memory systems)
 Network logic makes the aggregated memory appear as one single address space
 Selected Features
 Eight cores (4 cores/socket); L3 = Level 3 Cache
 Memory interface = establishes a coherent link to enable one
‘logical’ single address space of ‘physically distributed memory’

Programming with Shared Memory using OpenMP
 Shared-memory programming enables immediate access to all data from all

processors without explicit communication
 OpenMP is dominant shared-memory programming standard today (v3)
 OpenMP is a set of compiler directives to ‘mark parallel regions’
[11] OpenMP API Specification
Shared Memory
 Features
 Bindings are defined for C, C++, and Fortran languages
 Threads TX are ‘lightweight processes’ that mutually access data
T1 T2 T3 T4 T5
 Lecture 6 will give in-depth details on the shared-memory programming model with OpenMP and using its compiler directives
Distributed-Memory Computers
 A distributed-memory parallel computer establishes a ‘system view’

where no process can access another process’ memory directly
 Features
 Processors communicate via Network Interfaces (NI)
 NI mediates the connection to a Communication network
 This setup is rarely used  a programming model view today
Programming with Distributed Memory using MPI
 Distributed-memory programming enables

explicit message passing as communication between processors
 Message Passing Interface (MPI) is dominant distributed-memory
programming standard today (available in many different version)
 MPI is a standard defined and developed by the MPI Forum
[12] MPI Standard
 Features
 No remote memory access on distributed-memory systems
 Require to ‘send messages’ back and forth between processes PX
 Many free Message Passing Interface (MPI) libraries available
 Programming is tedious & complicated, but most flexible method P1 P2 P3 P4 P5
 Lecture 2 & 4 will give in-depth details on the distributed-memory programming model with the Message Passing Interface (MPI)
MPI Standard – GNU OpenMPI Implementation Example – Revisited
 Message Passing Interface (MPI)

 A standardized and portable message-passing standard
 Designed to support different HPC architectures
 A wide variety of MPI implementations exist
 Standard defines the syntax and semantics
of a core of library routines used in C, C++ & Fortran [12] MPI Standard
 OpenMPI Implementation
 Open source license based on the BSD license
 Full MPI (version 3) standards conformance [13] OpenMPI Web page
 Developed & maintained by a consortium of
academic, research, & industry partners
 Typically available as modules on HPC systems and used with mpicc compiler
 Often built with the GNU compiler set and/or Intel compilers
 Lecture 2 will provide a full introduction and many more examples of the Message Passing Interface (MPI) for parallel programming
Hierarchical Hybrid Computers
 A hierarchical hybrid parallel computer is neither a purely shared-memory

nor a purely distributed-memory type system but a mixture of both
 Large-scale ‘hybrid’ parallel computers have shared-memory building
blocks interconnected with a fast network today
 Features
 Shared-memory nodes (here ccNUMA) with local NIs
 NI mediates connections to other remote ‘SMP nodes’

Programming Hybrid Systems & Patterns
 Hybrid systems programming uses MPI as explicit internode

communication and OpenMP for parallelization within the node
 Parallel Programming is often supported by using ‘patterns’ such as stencil
methods in order to apply functions to the domain decomposition
 Experience from HPC Practice

 Most parallel applications still take no notice of the hardware structure
 Use of pure MPI for parallelization remains the dominant programming
 Historical reason: old supercomputers all distributed-memory type
 Use of accelerators is significantly increasing in practice today
 Challenges with the ‘mapping problem’
 Performance of hybrid (as well as pure MPI codes) depends crucially
on factors not directly connected to the programming model
 It largely depends on the association of threads and processes to cores
 Patterns (e.g., stencil methods) support the parallel programming
 Lecture 10 will provide insights into hybrid programming models and introduces selected patterns used in parallel programming
[Video] Juelich – Supercomputer Upgrade

HPC Ecosystem Technologies

HPC System Software Environment – Revisited (cf. Practical Lecture 0.2)
 Operating System
 Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’
 Scheduling Systems  HPC systems and
supercomputers typically
 Manage concurrent access of users on Supercomputers provide a software
environment that support
 Different scheduling algorithms can be used with different ‘batch queues’ the processing of parallel
and scalable applications
 Example: SLURM @ JÖTUNN Cluster, LoadLeveler @ JUQUEEN, etc.
 Monitoring systems offer a
 Monitoring Systems comprehensive view of the
current status of a HPC
 Monitor and test status of the system (‘system health checks/heartbeat’) system or supercomputer
 Enables view of usage of system per node/rack (‘system load’)  Scheduling systems
enable a method by which
 Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc. user processes are given
access to processors
 Performance Analysis Systems
 Measure performance of an application and recommend improvements (.e.g Scalasca, Vampir, etc.)
 Lecture 9 will offer more insights into performance analysis systems with debugging, profiling, and HPC performance toolsets
Scheduling vs. Emerging Interactive Supercomputing Approaches
 JupyterHub is a multi-user
version of the notebook
designed for companies,
classrooms and research
labs
[20] A. Lintermann & M. Riedel et al., ‘Enabling [21] A. Streit & M. Riedel et al., ‘UNICORE 6 – [22] Project Jupyter Web page
Interactive Supercomputing at JSC – Lessons Learned’ Recent and Future Advancements’

Modular Supercomputer JUWELS – Revisited

HPC System Architectures
 HPC systems are very complex

‘machines‘ with many elements
 CPUs & multi-cores
 ‘multi-threading‘ capabilities
 Data access levels
 Different levels of Caches
 Network topologies
 Various interconnects
 Architecture Impacts
 Vendor designs, e.g.,
 HPC faced a significant change in practice with respect to performance increase after years
IBM Bluegene/Q  Getting more speed for free by waiting for new CPU generations does not work any more
 Infrastructure, e.g., cooling &  Multi-core processors emerge that require to use those multiple resources efficiently in parallel
power lines in computing hall  Many-core processors emerge that are used to accelerate certain computing application parts

Example: IBM BlueGene Architecture Evolution
 BlueGene/P
 BlueGene/Q

Network Topologies
 Large-scale HPC Systems have special network setups

 Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB)
 Different network topologies, e.g. tree, 5D Torus network, mesh, etc.
(raise challenges in task mappings and communication patterns)
network
interconnection
important
[5] Introduction to High Performance

Computing for Scientists and Engineers
Source:
IBM

HPC System Architecture & Continous Developments
 Increasing number of other ‘new’

emerging system architectures
 HPC at cutting-edge of computing and integrates new
hardware developments as continuous activity
 General Purpose Computation on
Graphics Processing Unit (GPGPUs/GPUs)
 Use of GPUs instead for computer graphics for computing
 Programming models are OpenCL and Nvidia CUDA  Artificial Intelligence with methods from
machine learning and deep learning influence
 Getting more and more adopted in many application fields HPC system architectures today
 Field Programmable Gate Array (FPGAs)  Complement initial focus on compute-

intensive with data-intensive application co-
design activities
 Integrated circuit designed to be configured
by a user after shipping
 Enables updates of functionality and reconfigurable ‘wired’ interconnects
 Cell processors
 Enables combination of general-purpose cores with
co-processing elements that accelerate dedicated forms of computations
Many-core GPGPUs
 Use of very many simple cores

 High throughput computing-oriented architecture
 Use massive parallelism by executing a lot of
concurrent threads slowly
 Handle an ever increasing amount of multiple
instruction threads
 CPUs instead typically execute a single [10] Distributed & Cloud Computing Book
long thread as fast as possible

 Many-core GPUs are used in large  Graphics Processing Unit (GPU) is great for data parallelism and task parallelism
 Compared to multi-core CPUs, GPUs consist of a many-core architecture with
clusters and within massively hundreds to even thousands of very simple cores executing threads rather slowly
parallel supercomputers today
 Named General-Purpose Computing on GPUs (GPGPU)
 Different programming models emerge

GPU Acceleration
 GPU accelerator architecture example

(e.g. NVIDIA card)
 GPUs can have 128 cores on one single GPU chip
 Each core can work with eight threads of instructions
 GPU is able to concurrently execute 128 * 8 = 1024 threads
 Interaction and thus major (bandwidth)
[10] Distributed & Cloud Computing Book
bottleneck between CPU and GPU
is via memory interactions
 E.g. applications that use matrix –  CPU acceleration means that GPUs accelerate computing due to a massive parallelism
vector/matrix multiplication with thousands of threads compared to only a few threads used by conventional CPUs
 GPUs are designed to compute large numbers of floating point operations in parallel
(e.g. deep learning algorithms)
 Lecture 10 will introduce the programming of accelerators with different approaches and their key benefits for applications
NVIDIA Fermi GPU Example
[10] Distributed & Cloud Computing Book

DEEP Learning takes advantage of Many-Core Technologies
 Innovation via specific layers and architecture types
[26] A. Rosebrock
[25] Neural Network 3D Simulation
 Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and how many-core HPC is used
Deep Learning Application Example – Using High Performance Computing
 Using Convolutional Neural Networks (CNNs) [27] J. Lange and M. Riedel et al.,
with hyperspectral remote sensing image data IGARSS Conference, 2018
 Find Hyperparameters & joint ‘new-old‘ modeling &

transfer learning given rare labeled/annotated data in
science (e.g. 36,000 vs. 14,197,122 images ImageNet)
[28] G. Cavallaro, M. Riedel et al., IGARSS 2019
 Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and remote sensing applications
HPC Relationship to ‘Big Data‘ in Machine & Deep Learning
JURECA
High Performance
Training Computing & Cloud
Model Performance / Accuracy
Time
Large Deep Learning Networks Computing
‘small datasets‘
manual feature
engineering‘ Medium Deep Learning Networks
changes the
ordering
Small Neural Networks
Traditional Learning Models

MatLab
Statistical
SVMs
Random Computing with R
Forests
scikit-learn Weka Octave
Dataset Volume  ‘Big Data‘ [15] www.big-data.tips

Deep Learning Application Example – Using Cloud Computing
 Performing parallel
computing with Apache
Spark across different
worker nodes
[23] J. Haut, G. Cavallaro and M. Riedel et al.,

IEEE Transactions on Geoscience and Remote Sensing, 2019
 Using Autoencoder deep neural

networks with Cloud computing
[24] Apache Spark Web page
 The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course teaches Apache Spark Approaches
HPC Relationship to ‘Big Data‘ in Simulation Sciences
[14] F. Berman: Maximising the Potential of Research Data

Data Access & Challenges
too slow
 P = Processor core elements
 Compute: floating points or integers
 Arithmetic units (compute operations)
cheaper
 Registers (feed those units with operands)

 ‘Data access‘ for application/levels
 Registers: ‘accessed w/o any delay‘
 L1D = Level 1 Cache – Data (fastest, normal)
 L2 = Level 2 Cache (fast, often)  The DRAM gap is the
large discrepancy
 L3 = Level 3 Cache (still fast, less often) between main memory
and cache bandwidths

faster
Main memory (slow, but larger in size)

 Storage media like harddisk, tapes, etc.
(too slow to be used in direct computing)

Big Data Drives Data-Intensive HPC Architecture Designs
 More recently system architectures are influenced by ‘big data‘

 CPU speed has surpassed IO capabilities of existing HPC resources
 Scalable I/O gets more and more important in application scalability
 Requirements for Hierarchical Storage Management (‘Tiers’)
 Mass storage devices (tertiary storage)
too slow to enable active processing of ‘big data’
 Increase in simulation time/granularity means
TBs equally important as FLOP/s
 Tapes cheap, but slowly accessible, direct access to compute nodes needed
 Drive new ‘tier-based’ designs
[16] A. Szalay et al., ‘GrayWulf:

Scalable Clustered Architecture
for Data Intensive Computing’

Application Co-Design of HPC Architectures – Modular Supercomputing Example
 The modular supercomputing architecture (MSA) [17] DEEP Projects Web Page
enables a flexible HPC system design co-designed
by the need of different application workloads

New HPC Architectures – Modular Supercomputing Architecture Example
General
MEM
Purpose
CPU
[17] DEEP Projects Web Page
CN
General MEM
Purpose MEM Many
CPU Core
FPGA NVRAM CPU
NVRAM
DN BN
General NVRAM
NVRAM
Purpose NVRAM
CPU NVRAM
NVRAM
NVRAM FPGA
FPGA NVRAM MEM
NVRAM
NAM GCE
Possible Application Workload
Large-scale Computing Infrastructures
 Large computing systems are often embedded in infrastructures

 Grid computing for distributed data storage and processing via middleware
 The success of Grid computing was renowned when being mentioned by Prof. Rolf-Dieter Heuer, CERN
Director General, in the context of the Higgs Boson Discovery:
 Other large-scale distributed infrastructures exist
 Partnership for Advanced Computing in Europe (PRACE)  EU HPC
 Extreme Engineering and Discovery Environment (XSEDE)  US HPC
 ‘Results today only possible due to extraordinary performance of

Accelerators – Experiments – Grid computing’
[18] Grid Computing Video
 Lecture 11 will give in-depth details on scalable approaches in large-scale HPC infrastructures and how to use them with middleware
[Video] PRACE – Introduction to Supercomputing
[19] PRACE – Introduction to Supercomputing

Lecture Bibliography

Lecture Bibliography (1)
 [1] Terrestrial Systems Simulation Lab, Online:

http://www.hpsc-terrsys.de/hpsc-terrsys/EN/Home/home_node.html
 [2] Nest:: The Neural Simulation Technology Initiative, Online:
https://www.nest-simulator.org/
 [3] T. Bauer, ‘System Monitoring and Job Reports with LLView‘, Online:
https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/supercomputer-ressources-2018-11/12b-sc-llview.pdf?__blob=publicationFile
 [4] Wikipedia ‘Supercomputer’, Online:
http://en.wikipedia.org/wiki/Supercomputer
 [5] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science,
ISBN 143981192X, English, ~330 pages, 2010, Online:
http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X
 [6] TOP500 Supercomputing Sites, Online:
http://www.top500.org/
 [7] LINPACK Benchmark, Online:
http://www.netlib.org/benchmark/hpl/
 [8] HPC Challenge Benchmark Suite, Online:
http://icl.cs.utk.edu/hpcc/
 [9] JUBE Benchmark Suite, Online:
http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html
 [10] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book, Online:
http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049
 [11] The OpenMP API specification for parallel programming, Online:
http://openmp.org/wp/openmp-specifications/
 [12] The MPI Standard, Online:

http://www.mpi-forum.org/docs/
 [13] OpenMPI Web page, Online:
https://www.open-mpi.org/
 [14] Fran Berman, ‘Maximising the Potential of Research Data’
 [15] Big Data Tips – Big Data Mining & Machine Learning, Online:
http://www.big-data.tips/
 [16] A. Szalay et al., ‘GrayWulf: Scalable Clustered Architecture for Data Intensive Computing’, Proceedings of the 42nd Hawaii International Conference on
System Sciences – 2009
 [17] DEEP Projects Web page, Online:
http://www.deep-projects.eu/
 [18] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online:
http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw
 [19] PRACE – Introduction to Supercomputing, Online:
https://www.youtube.com/watch?v=D94FJx9vxFA
 [20] A. Lintermann & M. Riedel et al., ‘Enabling Interactive Supercomputing at JSC – Lessons Learned’, ISC 2019, Frankfurt, Germany, Online:
https://www.researchgate.net/publication/330621591_Enabling_Interactive_Supercomputing_at_JSC_Lessons_Learned_ISC_High_Performance_2018_Intern
ational_Workshops_FrankfurtMain_Germany_June_28_2018_Revised_Selected_Papers
 [21] A. Streit & M. Riedel et al., ‘UNICORE 6 – Recent and Future Advancements ’, Online:
https://www.researchgate.net/publication/225005053_UNICORE_6_-_recent_and_future_advancements
 [22] Project Jupyter Web page, Online:
https://jupyter.org/hub

 [23] J. Haut, G. Cavallaro and M. Riedel et al., IEEE Transactions on Geoscience and Remote Sensing, 2019, Online:
https://www.researchgate.net/publication/335181248_Cloud_Deep_Networks_for_Hyperspectral_Image_Analysis
 [24] Apache Spark, Online:
https://spark.apache.org/
 [25] YouTube Video, ‘Neural Network 3D Simulation‘, Online:
https://www.youtube.com/watch?v=3JQ3hYko51Y
 [26] A. Rosebrock, ‘Get off the deep learning bandwagon and get some perspective‘, Online:
http://www.pyimagesearch.com/2014/06/09/get-deep-learning-bandwagon-get-perspective/
 [27] J. Lange, G. Cavallaro, M. Goetz, E. Erlingsson, M. Riedel, ‘The Influence of Sampling Methods on Pixel-Wise Hyperspectral Image Classification with 3D
Convolutional Neural Networks’, Proceedings of the IGARSS 2018 Conference, Online:
https://www.researchgate.net/publication/328991957_The_Influence_of_Sampling_Methods_on_Pixel-
Wise_Hyperspectral_Image_Classification_with_3D_Convolutional_Neural_Networks
 [28] G. Cavallaro, Y. Bazi, F. Melgani, M. Riedel, ‘Multi-Scale Convolutional SVM Networks for Multi-Class Classification Problems of Remote Sensing Images’,
Proceedings of the IGARSS 2019 Conference, to appear


High Performance Computing Lecture 1 HPC Public

Uploaded by

High Performance Computing Lecture 1 HPC Public

Uploaded by

High Performance Computing

ADVANCED SCIENTIFIC COMPUTING

Prof. Dr. – Ing. Morris Riedel

@Morris Riedel @MorrisRiedel @MorrisRiedel

High Performance Computing

[1] Terrestrial Systems SimLab [2] NEST Web page

not good! right way!

Lecture 1 – High Performance Computing 2 / 50

1. High Performance Computing 11. Scientific Visualization & Scalable Infrastructures

2. Parallel Programming with MPI 12. Terrestrial Systems & Climate

 High Performance Computing (HPC) Basics

 HPC Ecosystem Technologies

Lecture 1 – High Performance Computing 4 / 50

Lecture 1 – High Performance Computing 6 / 50

 Wikipedia: ‘redirects from HPC to Supercomputer’

 A supercomputer is a computer at the frontline of contemporary

[4] Wikipedia ‘Supercomputer’ Online

 HPC includes work on ‘four basic building blocks’ in this course

Lecture 1 – High Performance Computing 7 / 50

 All modern supercomputers depend heavily on parallelism

 We speak of parallel computing whenever a number of ‘compute

[5] Introduction to High Performance Computing for Scientists and Engineers

 Often known as ‘parallel processing’ of some problem space

Lecture 1 – High Performance Computing 10 / 50

 TOP500 ranking is based on the LINPACK benchmark

 LINPACK covers only a single architectural aspect (‘critics exist’)

 Realistic applications benchmark suites might be alternatives

Lecture 1 – High Performance Computing 11 / 50

 Significant advances in CPU (or microprocessor chips)

 L3 cache or Dynamic random access memory (DRAM); off-chip

Lecture 1 – High Performance Computing 12 / 50

 Traditionally two dominant types of architectures  Shared-memory parallelization with OpenMP

 Often hierarchical (hybrid) systems of both in practice

Lecture 1 – High Performance Computing 13 / 50

 A shared-memory parallel computer is a system in which a number

[5] Introduction to High Performance Computing for Scientists and Engineers

 Two varieties of shared-memory systems:

 The Problem of ‘Cache Coherence’ (in UMA/ccNUMA)

Lecture 1 – High Performance Computing 14 / 50

 UMA systems use ‘flat memory model’: Latencies and bandwidth

Lecture 1 – High Performance Computing 15 / 50

 ccNUMA systems share logically memory that is physically distributed

Lecture 1 – High Performance Computing 16 / 50

 Shared-memory programming enables immediate access to all data from all

[11] OpenMP API Specification

 A distributed-memory parallel computer establishes a ‘system view’

[5] Introduction to High Performance Computing for Scientists and Engineers

 Distributed-memory programming enables

[12] MPI Standard

 Message Passing Interface (MPI)

 A hierarchical hybrid parallel computer is neither a purely shared-memory

[5] Introduction to High Performance Computing for Scientists and Engineers

Lecture 1 – High Performance Computing 21 / 50

 Hybrid systems programming uses MPI as explicit internode

 Experience from HPC Practice

Lecture 1 – High Performance Computing 23 / 50

Lecture 1 – High Performance Computing 24 / 50

Lecture 1 – High Performance Computing 26 / 50

Lecture 1 – High Performance Computing 27 / 50

 HPC systems are very complex

Lecture 1 – High Performance Computing 28 / 50

Lecture 1 – High Performance Computing 29 / 50

 Large-scale HPC Systems have special network setups

[5] Introduction to High Performance

Lecture 1 – High Performance Computing 30 / 50

 Increasing number of other ‘new’

 Field Programmable Gate Array (FPGAs)  Complement initial focus on compute-

 Use of very many simple cores

long thread as fast as possible

Lecture 1 – High Performance Computing 32 / 50

 GPU accelerator architecture example

[10] Distributed & Cloud Computing Book

Lecture 1 – High Performance Computing 34 / 50

 Innovation via specific layers and architecture types

[25] Neural Network 3D Simulation

 Find Hyperparameters & joint ‘new-old‘ modeling &