High Performance Computing Lecture 1 HPC Public
High Performance Computing Lecture 1 HPC Public
Many C/C++ Programs used in HPC today Multi-user HPC usage needs Scheduling
Many multi-physics
application challenges [3] LLview
on different scales &
levels of granularity
drive the need for HPC
Students understand…
Latest developments in parallel processing & high performance computing (HPC)
How to create and use high-performance clusters
What are scalable networks & data-intensive workloads
The importance of domain decomposition
Complex aspects of parallel programming
HPC environment tools that support programming
or analyze behaviour
Different abstractions of parallel computing on various levels
Foundations and approaches of scientific domain-
specific applications
Students are able to …
Programm and use HPC programming paradigms
Take advantage of innovative scientific computing simulations & technology
Work with technologies and tools to handle parallelism complexity
Lecture 1 – High Performance Computing 5 / 50
High Performance Computing (HPC) Basics
High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques
through specific support with dedicated hardware such as high performance cpu/core interconnections.
HPC
network
interconnection
important
High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that
enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores.
network
interconnection
HTC less important!
The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course focusses on High Throughput Computing
Lecture 1 – High Performance Computing 8 / 50
Parallel Computing
Lecture 3 will give in-depth details on parallelization fundamentals & performance term relationships & theoretical considerations
Lecture 1 – High Performance Computing 9 / 50
TOP 500 List (June 2019)
power
challenge
[6] TOP500 Supercomputing Sites
EU #1
JUBE benchmark suite (based on real applications) [9] JUBE Benchmark Suite
one chip
Multi-core CPU chip architecture
Hierarchy of caches (on/off chip)
L1 cache is private to each core; on-chip
L2 cache is shared; on-chip [10] Distributed & Cloud Computing Book
More recently
Both above considered as ‘programming models’
Emerging computing models getting relevant for HPC: e.g., quantum devices, neuromorphic devices
Shared Memory
2. Cache-coherent Nonuniform Memory Access (ccNUMA)
Selected Features
Socket is a physical package (with multiple cores), typically a replacable
component
Two dual core chips (2 core/socket)
P = Processor core
L1D = Level 1 Cache – Data (fastest)
L2 = Level 2 Cache (fast)
Memory = main memory (slow)
Chipset = enforces cache coherence and
mediates connections to memory
Selected Features
Eight cores (4 cores/socket); L3 = Level 3 Cache
Memory interface = establishes a coherent link to enable one
‘logical’ single address space of ‘physically distributed memory’
Shared Memory
Features
Bindings are defined for C, C++, and Fortran languages
Threads TX are ‘lightweight processes’ that mutually access data
T1 T2 T3 T4 T5
Lecture 6 will give in-depth details on the shared-memory programming model with OpenMP and using its compiler directives
Lecture 1 – High Performance Computing 17 / 50
Distributed-Memory Computers
Features
Processors communicate via Network Interfaces (NI)
NI mediates the connection to a Communication network
This setup is rarely used a programming model view today
Lecture 1 – High Performance Computing 18 / 50
Programming with Distributed Memory using MPI
Features
No remote memory access on distributed-memory systems
Require to ‘send messages’ back and forth between processes PX
Many free Message Passing Interface (MPI) libraries available
Programming is tedious & complicated, but most flexible method P1 P2 P3 P4 P5
Lecture 2 & 4 will give in-depth details on the distributed-memory programming model with the Message Passing Interface (MPI)
Lecture 1 – High Performance Computing 19 / 50
MPI Standard – GNU OpenMPI Implementation Example – Revisited
OpenMPI Implementation
Open source license based on the BSD license
Full MPI (version 3) standards conformance [13] OpenMPI Web page
Developed & maintained by a consortium of
academic, research, & industry partners
Typically available as modules on HPC systems and used with mpicc compiler
Often built with the GNU compiler set and/or Intel compilers
Lecture 2 will provide a full introduction and many more examples of the Message Passing Interface (MPI) for parallel programming
Lecture 1 – High Performance Computing 20 / 50
Hierarchical Hybrid Computers
Features
Shared-memory nodes (here ccNUMA) with local NIs
NI mediates connections to other remote ‘SMP nodes’
Operating System
Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’
Scheduling Systems HPC systems and
supercomputers typically
Manage concurrent access of users on Supercomputers provide a software
environment that support
Different scheduling algorithms can be used with different ‘batch queues’ the processing of parallel
and scalable applications
Example: SLURM @ JÖTUNN Cluster, LoadLeveler @ JUQUEEN, etc.
Monitoring systems offer a
Monitoring Systems comprehensive view of the
current status of a HPC
Monitor and test status of the system (‘system health checks/heartbeat’) system or supercomputer
Enables view of usage of system per node/rack (‘system load’) Scheduling systems
enable a method by which
Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc. user processes are given
access to processors
Performance Analysis Systems
Measure performance of an application and recommend improvements (.e.g Scalasca, Vampir, etc.)
Lecture 9 will offer more insights into performance analysis systems with debugging, profiling, and HPC performance toolsets
Lecture 1 – High Performance Computing 25 / 50
Scheduling vs. Emerging Interactive Supercomputing Approaches
JupyterHub is a multi-user
version of the notebook
designed for companies,
classrooms and research
labs
[20] A. Lintermann & M. Riedel et al., ‘Enabling [21] A. Streit & M. Riedel et al., ‘UNICORE 6 – [22] Project Jupyter Web page
Interactive Supercomputing at JSC – Lessons Learned’ Recent and Future Advancements’
BlueGene/P
BlueGene/Q
network
interconnection
important
Lecture 10 will introduce the programming of accelerators with different approaches and their key benefits for applications
Lecture 1 – High Performance Computing 33 / 50
NVIDIA Fermi GPU Example
[26] A. Rosebrock
Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and how many-core HPC is used
Lecture 1 – High Performance Computing 35 / 50
Deep Learning Application Example – Using High Performance Computing
Using Convolutional Neural Networks (CNNs) [27] J. Lange and M. Riedel et al.,
with hyperspectral remote sensing image data IGARSS Conference, 2018
Lecture 8 will provide more details about parallel & scalable machine & deep learning algorithms and remote sensing applications
Lecture 1 – High Performance Computing 36 / 50
HPC Relationship to ‘Big Data‘ in Machine & Deep Learning
JURECA
High Performance
Training Computing & Cloud
Model Performance / Accuracy
Time
Large Deep Learning Networks Computing
‘small datasets‘
manual feature
engineering‘ Medium Deep Learning Networks
changes the
ordering
Small Neural Networks
Performing parallel
computing with Apache
Spark across different
worker nodes
The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course teaches Apache Spark Approaches
Lecture 1 – High Performance Computing 38 / 50
HPC Relationship to ‘Big Data‘ in Simulation Sciences
too slow
P = Processor core elements
Compute: floating points or integers
Arithmetic units (compute operations)
cheaper
The modular supercomputing architecture (MSA) [17] DEEP Projects Web Page
enables a flexible HPC system design co-designed
by the need of different application workloads
General
MEM
Purpose
CPU
[17] DEEP Projects Web Page
CN
General MEM
Purpose MEM Many
CPU Core
FPGA NVRAM CPU
NVRAM
DN BN
General NVRAM
NVRAM
Purpose NVRAM
CPU NVRAM
NVRAM
NVRAM FPGA
FPGA NVRAM MEM
NVRAM
NAM GCE
Possible Application Workload
Lecture 1 – High Performance Computing 43 / 50
Large-scale Computing Infrastructures
Lecture 11 will give in-depth details on scalable approaches in large-scale HPC infrastructures and how to use them with middleware
Lecture 1 – High Performance Computing 44 / 50
[Video] PRACE – Introduction to Supercomputing
[23] J. Haut, G. Cavallaro and M. Riedel et al., IEEE Transactions on Geoscience and Remote Sensing, 2019, Online:
https://www.researchgate.net/publication/335181248_Cloud_Deep_Networks_for_Hyperspectral_Image_Analysis
[24] Apache Spark, Online:
https://spark.apache.org/
[25] YouTube Video, ‘Neural Network 3D Simulation‘, Online:
https://www.youtube.com/watch?v=3JQ3hYko51Y
[26] A. Rosebrock, ‘Get off the deep learning bandwagon and get some perspective‘, Online:
http://www.pyimagesearch.com/2014/06/09/get-deep-learning-bandwagon-get-perspective/
[27] J. Lange, G. Cavallaro, M. Goetz, E. Erlingsson, M. Riedel, ‘The Influence of Sampling Methods on Pixel-Wise Hyperspectral Image Classification with 3D
Convolutional Neural Networks’, Proceedings of the IGARSS 2018 Conference, Online:
https://www.researchgate.net/publication/328991957_The_Influence_of_Sampling_Methods_on_Pixel-
Wise_Hyperspectral_Image_Classification_with_3D_Convolutional_Neural_Networks
[28] G. Cavallaro, Y. Bazi, F. Melgani, M. Riedel, ‘Multi-Scale Convolutional SVM Networks for Multi-Class Classification Problems of Remote Sensing Images’,
Proceedings of the IGARSS 2019 Conference, to appear