Applied High-Performance Computing and Parallel
Programming
Presenter: Liangqiong Qu
Assistant Professor
The University of Hong Kong
Administration
• Assignment 2 has released
- Due April 1, 2025, Tuesday, 11:59 PM
- Important: The usage of accounts of HPC system for the second assignment is from
March 20 to April 1 11:59 PM.
Review of Lecture 13: Predefined Data Types in MPI
• Different machines store data types differently, such
as Little-Endian and Big-Endian systems:
• Little-Endian: The last byte of a multibyte data type is
stored first.
• Big-Endian: The first byte of a multibyte data type is
stored first.
• MPI derived data type provide support for
heterogeneous systems: automatic data type
conversion
• Process A on a Little-Endian machine sends a 32-bit
integer to Process B on a Big-Endian machine using
MPI.
• MPI automatically handles the conversion, ensuring
the data is correctly interpreted by both processes.
Review of Lecture 13: C Structures
▪ Structures (or structs) allow you to group multiple related variables into a single
unit. Each variable within the structure is called a member.
▪ To define a structure, use the struct keyword and declare its members inside curly
braces {}. To access member of a structure, use the dot syntax (.):
C require data types to be stored
at memory addresses that are
multiples of their size due to
hardware design and performance
optimization.
NOTE: MPI is a library and it has no idea about the
C struct that we have set up in our main program.
C structs can have different memory layouts and
padding depending on the compiler and architecture
Review of Lecture 13: MPI Derived Datatypes
▪ MPI allows the programmer to create your own data types, analogous to defining
structures in C. This class of data is the derived datatype.
▪ Derived datatypes in MPI can be used in Grouping data of different datatypes for
communication. Grouping non contiguous data for communication
▪ Three steps to create a new MPI data type
• Construct the new data type
MPI_Datatype newtype;
MPI_Type_*(…); // define the new data type
Using the function like MPI_Type_create_struct, MPI_Type_vector, or MPI_Type_create_subarray to
define the layout of the new data type.
• Commit new data type with
MPI_Type_commit(MPI_Datatype * newtype);
A datatype object has to be committed before it can be used in a communication.
• After use, deallocate the data type with
MPI_Type_free(MPI_Datatype * newtype);
Review of Lecture 13: A Flexible, Vector-Like Type: MPI_Type_vector
MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype,
MPI_Datatype * newtype);
• count 2 (no. of blocks)
• blocklength 3 (no. of elements in each block)
• stride 5(no. of elements b/w start of each block
• oldtype MPI_INT
A Sub-array Type: MPI_Type_create_subarray
MPI_Type_create_subarray(int ndims, const int array_of_sizes[], const int
array_of_subsizes[], const int array_of_starts[], int order, MPI_Datatype
oldtype, MPI_Datatype *newtype)
Input arguments:
• ndims: number of array dimensions
• array_of_sizes: number of elements in each dimension of the full array
• array_of_subsizes: number of elements in each dimension of the subarray
• array_of_starts: starting coordinates of the subarray in each dimension
• order : array storage order flag (row-major: MPI_ORDER_C or column-
major MPI_ORDER_FORTRAIN )
Output arguments:
• newtype: new datatype (handle)
A Sub-array Type: MPI_Type_create_subarray
MPI_Type_create_subarray(int ndims, const int array_of_sizes[],
const int array_of_subsizes[], const int array_of_starts[], int order,
MPI_Datatype oldtype, MPI_Datatype *newtype)
• ndims: 2 (number of array dimensions)
• array_of_sizes: {nrows, ncols} (dimension of original full array)
• array_of_subsizes: {nrows-2, ncols-2} (actual dimension of subarray)
• array_of_starts: {1, 1}
• order : MPI_ORDER_C
• oldtype: MPI_INT
Most Flexible Type: MPI_Type_create_struct
▪ MPI does not directly support sending or receiving C struct types because MPI is
designed to be portable across different architectures and programming
languages. C structs can have different memory layouts and padding depending
on the compiler and architecture, making them non-portable for communication
between different MPI processes.
▪ To ensure portability, MPI provides functions like `MPI_Type_create_struct` to
create new MPI datatypes that can represent complex structures.
MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displs[],
MPI_Datatype block_types[], MPI_Datatype* new_datatype);
Most Flexible Type: MPI_Type_create_struct
▪ MPI_Type_create_struct is the most flexible routine to create an MPI datatype. It
describe blocks with arbitrary data types and arbitrary displacements.
MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displs[],
MPI_Datatype block_types[], MPI_Datatype* new_datatype);
Input arguments:
• block_count: The number of blocks to create.
• block_lengths : Array containing the length of each block.
• displs: Array containing the displacement for each block, expressed in bytes.
The displacement is the distance between the start of the MPI datatype created
and the start of the block.
• block_types : Type of elements in each block
Output arguments:
• newtype: new datatype (handle)
Most Flexible Type: MPI_Type_create_struct
▪ MPI_Type_create_struct is the most flexible routine to create an MPI datatype. It
describe blocks with arbitrary data types and arbitrary displacements.
MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displs[],
MPI_Datatype block_types[], MPI_Datatype* new_datatype);
• The contents of displs are either the
displacements in bytes of the block bases or
MPI addresses
• Displs (displacement) is important in order to
let MPI know where each field is located in
memory so it can correctly pack, send, receive,
and unpack the data.
Most Flexible Type: MPI_Type_create_struct
MPI_Type_create_struct(int block_count, const int block_lengths[], const MPI_Aint displs[],
MPI_Datatype block_types[], MPI_Datatype* new_datatype);
MPI_Type_create_struct
MPI_Type_create_struct(int
block_count, const int block_lengths[],
const MPI_Aint displs[],
MPI_Datatype block_types[],
MPI_Datatype* new_datatype);
Derived Data Types: Summary
▪ A flexible tool to communicate complex data structures in MPI
▪ Most important calls:
• MPI_Type_create_struct(…)
specifies the data layout of user-defined structs (or classes)
• MPI_Type_vector(…)
specifies strided data, i.e. same-type data with missing elements
• MPI_Type_create_subarray(…)
specifies sub-ranges of multi-dimensional arrays
• MPI_Type_commit, MPI_Type_free
• MPI_Get_address, MPI_Aint_add, MPI_Aint_diff
▪ Matching rule: send and receive match if specified basic datatypes match one by
one, regardless of displacements
▪ Correct displacements at receiver side are automatically matched to the
corresponding data items
MPI Input/Output
Why MPI I/O?
▪ Many parallel applications need …
• coordinated parallel access to a file by a group of processes
• simultaneous access to a file
• non-contiguous access to pieces of the file by many processes
• i.e., the data may be distributed amongst the processes according to a partitioning
scheme.
And of course it should be
efficient!
MPI I/O: Principles
▪ MPI file contains elements of a single MPI data type (etype)
▪ The file is partitioned among processes using an access template (filetype)
▪ All file accesses to/from a contiguous or non-contiguous user buffer (MPI data type)
▪ Several different ways of reading/writing data:
• non-blocking / blocking
• collective / individual
• individual / shared file pointers, explicit offsets
▪ Automatic data conversion in heterogeneous systems
MPI I/O
▪ Just like POSIX (Portable Operating System Interface of UNIX) I/O, you need to
Open the file
Read or Write data to the file
Close the file
▪ In MPI, these steps are almost the same:
Open the file: MPI_File_open
Read/Write to the file: MPI_File_read / MPI_File_write
Close the file: MPI_File_close
MPI I/O: MPI_File_open
MPI_File_open(MPI_Comm comm, const char *filename, int amode, MPI_Info
info, MPI_File *fh)
▪ Description: MPI_File_open opens the file identified by the filename filename on all
processes in the comm communicator group. MPI_File_open is a collective routine; all
processes must provide the same value for amode, and all processes must provide
filenames that reference the same file which are textually identical.
• filename can be different, but must point to the same file
• amode describes file access mode (see next slide)
• info object to modify the behavior of MPI_File_open, can be MPI_INFO_NULL
• (output argument) fh represents the file handle, can be subsequently used to access
the file until the file is closed using MPI_File_close
• Process-local file I/O is possible by specifying MPI_COMM_SELF as comm
MPI I/O: MPI_File_open—File Access Modes (amode)
Access mode Description
MPI_MODE_RDONLY read only
one of these
MPI_MODE_RDWR read and write
is required
MPI_MODE_WRONLY write only
MPI_MODE_CREATE create if it does not exist
MPI_MODE_EXCL raise an error if creating a file that already exists
MPI_MODE_DELETE_ON_CLOSE file is deleted when closed
MPI_MODE_UNIQUE_OPEN file is not concurrently opened by anybody else
MPI_MODE_APPEND all file pointers are located at the end of the file
▪ Flags can be used together via |, e.g., MPI_MODE_WRONLY | MPI_MODE_APPEND
MPI I/O: MPI_File_open—Example
▪ All processes in MPI_COMM_WORLD open the file collectively in write-only mode,
and if file dos not exist, it will be created
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_WRONLY |
MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
▪ Also possible to open file with only one process:
if (rank == 0) {
MPI_File fh;
MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_WRONLY |
MPI_MODE_CREATE, MPI_INFO_NULL, &fh);
... }
• fh is used as a file handle for the file that is going to be opened.
• `MPI_INFO_NULL`: This is a special constant representing an empty info object, which means that no
additional information is provided to the `MPI_File_open` function.
MPI I/O: MPI_File_close
MPI_File_close(MPI_File *fh)
▪ Description: MPI_File_close first synchronizes file state, then closes the file associated
with fh. MPI_File_close is a collective routine. The user is responsible for ensuring that
all outstanding requests associated with fh have completed before calling
MPI_File_close.
▪ File is deleted if MPI_MODE_DELETE_ON_CLOSE was part of access mode during
MPI_File_open call.
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, …, &fh);
...
MPI_File_close(&fh);
MPI I/O File Views
▪ A view indicates the visible and accessible data from a file. It does not necessarily
include every byte of the file.
▪ Each process has its own view
▪ A file view is described via a triplet of arguments (displacement, etype, filetype)
• displacement = number of bytes to be skipped from the start of the file
• etype = The basic unit of data access (can be any basic or derived MPI datatype)
• filetype = specifies layout of etypes within the file
▪ Initially, all processes view the file as a linear byte stream; that is, the etype and filetype
are both MPI_BYTE (0, MPI_BYTE, MPI_BYTE). The file view can be changed via
the MPI_File_set_view routine.
MPI I/O File Views: The Default File View
▪ Initially, all processes view the file as a linear byte stream; that is, the etype and filetype
are both MPI_BYTE. The file view can be changed via the MPI_File_set_view routine.
▪ After file open, each file has the default view
▪ Default view: linear byte stream
• displacement = 0
• etype = MPI_BTYE
• filetype = MPI_BYTE
▪ MPI_BYTE matches with any data type
MPI I/O File View: Definitions
MPI I/O Custom File View: MPI_File_set_view
▪ The default file view can be changed via the
MPI_File_set_view routine by setting different following
paramters.
• displacement = number of bytes to be skipped from
the start of the file
• etype = unit of data access (can be any basic or derived
datatype)
• filetype = specifies layout of etypes within the file
MPI I/O: Setting and Getting the View
MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,
MPI_Datatype filetype, const char *datarep, MPI_Info info)
• MPI_File_set_view changes the process’s view of the data.
• Collective operation
• Local and shared file pointers are reset to zero
• etype and filetype must be committed types
• datarep is a string specifying the format data is written to a file: native, internal,
external32, or user-defined (see next slide)
• Same etype extent and same datarep on all processes
• disp: number of bytes to be skipped from the start of the file
• info: Hints are additional pieces of information that tell the MPI implementation
how to handle file I/O operations more efficiently. We use MPI_INFO_NULL to
skip this.
MPI I/O: Setting and Getting the View
MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,
MPI_Datatype filetype, const char *datarep, MPI_Info info)
▪ MPI_File_set_view is collective across the fh; all processes in the group
must pass identical values for datarep and provide an etype with an
identical extent.
▪ After setting the process’s view, we can definitely use The
MPI_File_get_view routine to get the process’s view of the data in the
file.
MPI_File_get_view(MPI_File fh, MPI_Offset *disp, MPI_Datatype *etype,
MPI_Datatype *filetype, const char *datarep)
MPI I/O: Data Representation (Appendix)
MPI I/O: A Simple File View Example
▪ Basic example: File view for one process
▪ View contains holes with respect to original file
MPI I/O: A Simple File View Example
Note the
displacement is
measured in BYTE
Reading and Writing from/to Files
Data in MPI is moved between files and processes by issuing read and write calls. There are
three orthogonal aspects to data access: positioning (explicit offset vs. implicit file pointer),
synchronism (blocking vs. nonblocking), and coordination (noncollective vs. collective).
▪ Direction: Read / Write
▪ Positioning (realized via routine names)
• explicit offset ( _AT )
• individual file pointer (no positional qualifier)
• shared file pointer ( _SHARED or _ORDERED )
▪ Coordination
• non-collective
• collective ( _ALL )
▪ Synchronization
▪ blocking
▪ non-blocking (_I…)
Selected Important Data Access Routines
Positioning Synchronism Coordination
Noncollective Collective
Explicit offsets Blocking MPI_File_read_at MPI_File_read_at_all
MPI_File_write_at MPI_File_write_at_all
Nonblocking MPI_File_iread_at MPI_File_iread_at_all
MPI_File_iwrite_at MPI_File_iwrite_at_all
Individual file Blocking MPI_File_read MPI_File_read_all
pointers MPI_File_write MPI_File_write_all
Nonblocking MPI_File_iread MPI_File_iread_all
MPI_File_iwrite MPI_File_iwrite_all
Individual File Pointers
▪ Each process maintains its own individual file pointer:
• Each process maintains its own individual file pointer, which tracks the current position
in the file for that process.
• These pointers are independent across processes, meaning the file pointer of one process
does not affect another.
• This behavior is similar to standard POSIX file I/O semantics.
Individual File Pointer: MPI_File_read
MPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status
*status)
▪ Description: MPI_File_read attempts to read from the file associated with fh (at the current
individual file pointer position maintained by the system) a total number of count data
items having datatype type into the user’s buffer buf. MPI_File_read stores the number of
data-type elements actually read in status. All other fields of status are undefined.
▪ Output arguments:
• buf: initial address of buffer
• status: status object
▪ Input arguments:
• fh: file handle (handle)
• count: read count elements of datatype
• datatype: datatype of each buffer element
• If all processes execute a read at the same logical time, it is better to use the collective call
MPI_File_read_all
MPI_File_read
MPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status
*status)
▪ Description: MPI_File_read attempts to read from the file associated with fh (at the
current individual file pointer position maintained by the system) a total number
of count data items having datatype type into the user’s buffer buf. MPI_File_read stores
the number of data-type elements actually read in status.
• Each process maintains its own individual file pointer
• It is blocking, non-collective call. Behaves nearly like standard serial file I/O
• Individual file pointer is automatically incremented by
fp = fp + count * elements(datatype)/elements(etype)
Set Offset of Individual File Pointer fp: MPI_File_seek
MPI_File_seek(MPI_File fh, MPI_Offset offset, int whence)
▪ Description: MPI_File_seek updates the individual file pointer according to whence, which
has the following possible values. The offset can be negative, which allows seeking
backwards.
• fh (input/output argument): File handle (handle)
• Offset (input argument): File offset
• whence (input argument): Update mode
whence description
MPI_SEEK_SET The pointer fp is set to offset.
MPI_SEEK_CUR The pointer is set to the current pointer position
plus offset (fp+offset).
MPI_SEEK_END The pointer is set to the end of the file plus offset.
Explicit Offsets: MPI_File_read_at
MPI_File_read_at(MPI_File fh, MPI_Offset offset, void *buf, int count,
MPI_Datatype datatype, MPI_Status *status)
• Reads a file at an explicitly specified offset
• Arguments have same meaning as for MPI_File_read
• MPI_File_read_at = MPI_File_read + MPI_File_seek.
• _at indicates that the position in the file is specified as part of the call; this provides
thread-safety and clearer code than using a separate “seek” call. Preferred use.
• Read data starting at offset, starting offset * units of etype from begin of view
(displacement)
• Read count elements of datatype
• EOF can be detected by noting that the amount of data read is less than count
Collective Operations
▪ I/O can be performed collectively by all processes in a communicator:
• MPI_File_read_all
• MPI_File_write_all
• MPI_File_read_at_all
• MPI_File_write_at_all
• All processes in communicator that opened file must call function
• Same parameters as for the non-collective functions (MPI_File_read etc).
Each process specifies only its own access information
• _at indicates that the position in the file is specified as part of the call; this
provides thread-safety and clearer code than using a separate “seek” call
• Performance potentially better than for individual functions
Non blocking MPI I/O
▪ Non-blocking independent I/O is similar to non-blocking send/recv routines
▪ MPI_File_iread(_at)/MPI_File_iwrite(_at)e
▪ Wait for completion using MPI_Test, MPI_Wait, etc.
▪ Can be used to overlap I/O with computation.
Examples 1: Write "abcdabcd" Pattern
▪ Task 1:
Write an MPI program where each process writes a single character to a file such that
the output file contains the pattern "abcdabcdabcd" when executed with four processes.
Process "rank=0" should write 'a', process "rank=1" should write 'b', and so on.
Note: Use a 1-dimensional fileview with `MPI_Type_create_subarray` to achieve this.
Examples 1: Write "abcdabcd" Pattern
▪ Task 1: Write an MPI program where each process writes a single character to a file
such that the output file contains the pattern "abcdabcdabcd" when executed with
four processes. Process "rank=0" should write 'a', process "rank=1" should write 'b',
and so on. Use a 1-dimensional fileview with `MPI_Type_create_subarray` to
achieve this.
Instructions:
• Using file view to set the visible data to each process
• MPI_File_set_view(MPI_File fh, MPI_Offset
disp, MPI_Datatype etype, MPI_Datatype
filetype, const char *datarep, MPI_Info info)
• With file displacement = 0 (number of header bytes),
identical on all processes
Examples 1: Write "abcdabcd" Pattern
▪ Task 1: Write an MPI program where each process writes a single character to a file such that the
output file contains the pattern "abcdabcdabcd" when executed with four processes. Process "rank=0"
should write 'a', process "rank=1" should write 'b', and so on.
Core code:
char buffer = rank + 'a'; // translates rank 0, 1 , 2.. to a, b, c..
int repeated_times = 2;
int array_of_sizes[1] = {size}; // size indicates the number of processes
int array_of_subsizes[1] = {1};
int array_of_starts[1] = {rank};
MPI_Type_create_subarray(1, array_of_sizes, array_of_subsizes, array_of_starts,
MPI_ORDER_C, MPI_CHAR, &subarray);
MPI_Type_commit(&subarray);
MPI_File_open(MPI_COMM_WORLD, filename,
MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
MPI_File_set_view(fh, 0, MPI_CHAR, subarray, "native", MPI_INFO_NULL);
char write_buf[repeated_times];
for (int i = 0; i < repeated_times; i++) {
write_buf[i] = buffer;
}
MPI_File_write_all(fh, write_buf, repeated_times, MPI_CHAR, &status);
Examples 1: Write "abcdabcdabcd" Pattern
Run this program of with
4 processes, expected
results:
Expected result
abcdabcd (8 characters)
Examples 1.1 : Write "abcdabcdabab" Pattern and then Read
▪ Task 1.1:
(1) Write an MPI program where each process writes a single character to a file such
that the output file (filnename is‘abcdabcd.txt’) contains the pattern “abcdabcd”
(repeated two times) when executed with four processes. Process "rank=0" should write
'a', process "rank=1" should write 'b', and so on. Compile and run your code with 4
processes. The abcdabcd.txt should look like:
a b c d a b c d
(2) Make a copy of your result using the command `cp abcdabcd.txt abcdabcd_1.txt`.
Now, modify your code by changing the displacement from 0 to the current filesize (Use
MPI_File_get_size()), and write "abab" to the output file "abcdabcd.txt" by appending it
to the existing content. The updated content of "abcdabcd.txt" should look like:
a b c d a b c d a b a b
Note: Use a 1-dimensional fileview with `MPI_Type_create_subarray` to achieve this.
Examples 1.1 : Write "abcdabcdabab" Pattern and then Read
▪ Task 1.1:
(2) Make a copy of your result using the command `cp abcdabcd.txt abcdabcd_1.txt`.
Now, modify your code by changing the displacement from 0 to the current filesize (Use
MPI_File_get_size()), and write "abab" to the output file "abcdabcd.txt" by appending it
to the existing content. The updated content of "abcdabcd.txt" should look like:
a b c d a b c d a b a b
Core code:
// Task 2: Modify the displacement and write "abab" to the output file
// Obtain the displacement from the file with MPI_File_get_size()
MPI_Offset displacement;
MPI_File_get_size(fh, &displacement);
//update the view
MPI_File_set_view(fh, displacement, MPI_CHAR, subarray, "native", MPI_INFO_NULL);
MPI_File_write_all(fh, write_buf, repeated_times, MPI_CHAR, &status);
Examples 1.1 : Write "abcdabcdabab" Pattern
• Obtain the displacement from the file
with MPI_File_get_size()
• Set the displacement of
MPI_File_set_view to the size of the
file.
Reading and Writing from/to Files
▪ Data is moved between files and processes by calling read and write routines.
▪ Read routines move data from a file into memory.
▪ Write routines move data from memory into a file.
▪ The file is designated by a file handle (e.g., fh).
▪ The location of the file data is specified by an offset into the current view.
▪ The data in memory is specified by a triple: buf, count, and datatype.
▪ A data access routine attempts to transfer (read or write) count data items of type
datatype between the user’s buffer buf and the file.
▪ Upon completion, the amount of data accessed by the calling process is returned
in a status.
Summarization of MPI I/O
▪ Different operation modes:
• 'Blocking mode' to finish data operations, then continue computations
• 'Non-blocking mode"(aka asynchronously) to perform computations while a file
is being read or written in the background (typically more difficult to use)
▪ Supports the concept of ‘collective operations’
• Processes can access files each on its own orall together at the same time
• Collective I/O routines can improve I/O performance
▪ Provides advanced concepts
▪ Rich functionality provided to support various data representations and access
options
▪ "E.g., file views & data types/structures
Thank You!