Platform BPG Affinity
Platform BPG Affinity
Best practices
Using Affinity Scheduling in IBM
Platform LSF
Page 2 of 15
Executive Summary
IBM Platform LSF (LSF) is a powerful workload management platform for demanding
distributed HPC environments. It provides a comprehensive set of intelligent, policydriven scheduling features that enable you to utilize all of your compute infrastructure
resources and ensure optimal application performance.
When executing workload on multi-core hosts with non-uniform memory architectures
(NUMA), it is optimal for many applications to ensure that their instructions are bound
at the operating system level to:
Always execute on a specific subset of CPUs on the host in order to maximize
hits on internal caches or ensure exclusive use of these resources.
Always allocate memory from the nearest memory node where the application
executes if possible.
IBM Platform LSF 9.1.1 introduced new features to give end users control over these
kinds of allocation and binding behavior. This document presents guidelines for using
the LSF affinity scheduling features for common tasks such as CPU and memory binding
for sequential jobs and parallel jobs run through several popular MPI implementations.
Page 3 of 15
Introduction
This document serves as a best practice guide for how to use the affinity scheduling
features of LSF 9.1.1. This document covers the following topics:
How to enable and configure affinity scheduling in Platform LSF 9.1.1 and above
Several usage examples including:
Querying affinity-related information for hosts and jobs
Submitting jobs with CPU binding requirements
Submitting jobs with memory binding requirements
Checking the binding of tasks managed by LSF
Submitting OpenMPI jobs with binding requirements
Submitting IBM Platform MPI jobs with binding requirements
Submitting jobs with binding requirements to the IBM Parallel Operating
Environment
Currently, Platform LSF Affinity Scheduling is supported on hosts running Linux with
kernel version 2.6.18 or above on both x86 and Power architectures.
Page 4 of 15
Cluster Configuration
This section discusses the configuration of LSF Platform cluster. The example cluster has
4 hosts with the following configurations:
Table 1. Cluster Host Information
Host Name
Hardware Information
lsf_master
UMA
1 processor socket
4 cores / socket
UMA
1 processor socket
4 cores / socket
2 NUMA nodes
1 process socket / node
4 cores / socket
2 hardware threads / core
2 NUMA nodes
1 process socket / node
4 cores / socket
2 hardware threads / core
aff_none
aff_part
aff_full
Affinity Enabled
Not enabled
Not enabled
RB_PLUGIN
SCH_DISABLE_PHASES
()
()
MXJ
!
!
!
!
r1m
()
()
()
()
AFFINITY
(N)
(N)
(CPU_LIST="1,3,5,7,8-15")
(Y)
Page 5 of 15
that a single configuration can be shared across all hosts by using the special host name
default.
MAX
16
NJOBS
0
ut
0%
-
RUN
0
pg
0.0
-
pg
-
SSUSP
0
io
0
0
io
-
ls
0
-
ls
-
USUSP
0
it
0
-
it
-
RSV DISPATCH_WINDOW
0
tmp
0M
-
tmp
-
swp
0M
-
swp
-
mem
64G
0M
slots
16
-
mem
-
Note that the numbers inside the cores are the physical CPU IDs detected on the host in
this case each core contains 2 hardware threads, all of which are enabled on aff_full.
The host named aff_part has CPUs 1,3,5,7 and 8-15 enabled (excluding CPUs 0, 2, 4,
and 6), yielding the following display:
$ bhosts -l -aff aff_part
HOST aff_partial
STATUS
CPUF JL/U
ok
60.00
-
MAX
16
NJOBS
0
ut
0%
-
RUN
0
pg
0.0
-
pg
-
SSUSP
0
io
0
0
io
-
ls
-
ls
0
-
USUSP
0
it
0
-
it
-
RSV DISPATCH_WINDOW
0
tmp
0M
-
tmp
-
swp
0M
-
swp
-
mem
64G
0M
slots
16
-
mem
-
Page 6 of 15
NUMA[0: 0M / 32G]
Socket0
core0(8)
core1(10)
core2(12)
core3(14)
NUMA[1: 0M / 32G]
Socket0
core0(1 9)
core1(3 11)
core2(5 13)
core3(7 15)
Note the sections highlighted in red: the cores on the socket containing the excluded CPU
now only show a single thread each (the excluded CPU IDs have been omitted).
For host without affinity scheduling turned on, LSF does not show host topology
information in bhosts, and affinity scheduling is shown as disabled:
$ bhosts -l -aff aff_none
HOST aff_none
STATUS
CPUF JL/U
ok
60.00
-
MAX
4
NJOBS
0
ut
0%
-
RUN
0
pg
0.0
pg
-
SSUSP
0
io
0
0
io
-
ls
-
ls
0
-
USUSP
0
it
0
-
it
-
RSV DISPATCH_WINDOW
0
tmp
0M
-
tmp
-
swp
0M
swp
-
mem
32G
0M
slots
4
-
mem
-
After this job starts to run, use bjobs l aff jobID to check the affinity allocation of
the job:
$ bjobs -l -aff 102
Job <102>, User <rshen>, Project <default>, Status <RUN>, Queue <normal>, Comma
nd <sleep 9000>
Fri Sep 27 16:58:53: Submitted from host <bp860-04>, CWD <$HOME/LSF/proj/lsf/ut
opia/lsbatch/cmd>, 2 Processors Requested, Requested Resou
rces <affinity[core(1)]>;
Fri Sep 27 16:58:54: Started on 2 Hosts/Processors <aff_part> <aff_part>;
SCHEDULING PARAMETERS:
r15s
r1m r15m
loadSched
loadStop
-
ut
-
pg
-
io
-
ls
-
it
-
tmp
-
swp
-
mem
-
Page 7 of 15
AFFINITY:
HOST
aff_part
aff_part
CPU BINDING
-----------------------TYPE
LEVEL EXCL
IDS
core
/0/0/0
core
/0/0/1
MEMORY BINDING
-------------------POL
NUMA SIZE
-
Here, the AFFINITY: section displays the following information about the job:
Each of the two requested tasks (-n) has been allocated on host aff_part
according to the HOST column.
The TYPE column shows that each allocation unit is a core (because the job
requested core(1) in the affinity[] string).
The IDS column shows the specific logical ID on the host for that allocation: in
this case the first task is on NUMA node 0, socket 0, core 0 (0/0/0), and the
second is on the same NUMA and socket, but is allocated core 1 ( 0/0/1).
This job has two tasks/ranks, and each task is allocated three cores, therefore the job will
be allocated a total of 2*3 = 6 cores and 100MB of memory. Any memory allocated to
these tasks must come from the NUMA node closest to the core on which that task is
bound this is the effect of the membind=localonly clause. If no memory is available on
this node, then the job will swap.
Use bjobs l aff jobID to monitor this allocation:
$ bjobs -l aff 105
Job <105>, User <rshen>, Project <default>, Status <RUN>, Queue <normal>, Comma
nd <sleep 9000>
Fri Sep 27 17:07:07: Submitted from host <bp860-04>, CWD <$HOME/LSF/proj/lsf/ut
opia/lsbatch/cmd>, 2 Processors Requested, Requested Resou
rces <affinity[core(3):membind=localonly] rusage[mem=100]>
;
Fri Sep 27 17:07:08: Started on 2 Hosts/Processors <aff_part> <aff_part>;
SCHEDULING PARAMETERS:
r15s
r1m r15m
loadSched
loadStop
-
ut
-
pg
-
io
-
ls
-
it
-
tmp
-
swp
-
mem
-
CPU BINDING
-----------------------TYPE
LEVEL EXCL
IDS
core
/0/0/0
/0/0/1
/0/0/2
core
/1/0/0
/1/0/1
/1/0/2
MEMORY BINDING
-------------------POL
NUMA SIZE
local 0
50.0MB
local 1
50.0MB
Page 8 of 15
Here LSF has allocated the first task to the first NUMA node on aff_part, and the
second task to the second. The reason for this is that membind=localonly implicitly
requires that all the CPUs allocated to a given task access the same memory node. The
MEMORY BINDING subsection of the display shows us the ID of the NUMA node each task
is using, and the amount of memory allocated to each task on this node.
ut
-
pg
-
io
-
ls
-
it
-
tmp
-
swp
-
mem
-
CPU BINDING
-----------------------TYPE
LEVEL EXCL
IDS
core
numa
/0/0/0
/1/0/0
MEMORY BINDING
-------------------POL
NUMA SIZE
-
The sections in red show that LSF displays the exclusivity level of the task in the EXCL
column and properly shows that each core is allocated from a different NUMA node.
Page 9 of 15
To verify the actual dispatched process binding, you must get the process IDs of the
dispatched job using bjobs -l aff after the job is running:
$ bjobs -l 137
Job <137>, User <rshen>, Project <default>, Status <RUN>, Queue <normal>, Comma
nd <sleep 9000>
Mon Sep 30 14:02:24: Submitted from host <lsf-master>, CWD <$HOME>, Requested
Resources <affinity[core]>;
Mon Sep 30 14:02:25: Started on <aff_host>, Execution Home </home/rshen>, Execu
tion CWD </home/rshen>;
Mon Sep 30 14:02:41: Resource usage collected.
MEM: 6 Mbytes; SWAP: 0 Mbytes; NTHREAD: 4
PGID: 27645; PIDs: 27645 27646 27648
MEMORY USAGE:
MAX MEM: 6 Mbytes;
SCHEDULING PARAMETERS:
r15s
r1m r15m
loadSched
loadStop
-
ut
-
pg
-
io
-
ls
-
it
-
tmp
-
swp
-
mem
-
CPU BINDING
-----------------------TYPE
LEVEL EXCL
IDS
core
/0/0/0
MEMORY BINDING
-------------------POL
NUMA SIZE
-
Notice that the job was dispatched to aff_host and started three processes, one of which
is the actual sleep 9000 command. This process was allocated to NUMA 0, socket 0, and
core 0 on aff_host, and according to the bhosts output shown for this host, this
corresponds to physical CPUs 0 and 8.
You can verify that the affinity binding for all of these processes by logging in to
aff_host and running taskset -pc pid to get the list of CPUs to which the process has
been bound:
$ taskset -pc 27645
pid 27645's current affinity list: 0,8
$ taskset -pc 27646
pid 27646's current affinity list: 0,8
$ taskset -pc 27648
pid 27648's current affinity list: 0,8
Note that each process is bound to the CPU IDs contained in the correct core.
Page 10 of 15
The last step is required in order to generate the appropriate OpenMPI rank file that will
enable it to bind each job task to its own allocation. When jobs are submitted to this
application using the app option of bsub, LSF creates the appropriate rank file, and sets
the variable LSB_RANK_HOSTFILE in the job execution environment to its path. This file
can then be passed to mpirun with the rf option to have the tasks bound correctly.
Finally, in order for your job to properly escape the LSB_RANK_HOSTFILE variable, you
should include your mpirun command line inside a job script. The script can either be
installed on a shared file system or spooled as the standard input to bsub. Here is an
example using the latter approach:
$ cat /tmp/my_script
#!/bin/sh
mpirun rf $LSB_RANK_HOSTFILE /share/bin/hello_c
$ bsub -I -n 4 app openmpi -R "affinity[core]" < /tmp/my_script
Job <134> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on bp860-04>>
Hello, world, I am 1 of 4
Hello, world, I am 0 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4
Page 11 of 15
Each task in this job reserves two windows on its execution host (one window per
network), and be allocated a single core to which it will be bound by the OS. Use the
taskset command to verify this as described in Example 4: Verifying the bindings of
running tasks
Best practices
Page 12 of 15
Conclusion
This document describes the usage of the affinity scheduling feature in IBM Platform LSF
9.1.1, and how it integrates with OpenMPI, Platform MPI, and IBM Parallel Environment
Runtime Edition to bind individual tasks to CPUs and NUMA memory nodes.
Further reading
Contributors
Rong Song Shen
Software Developer: LSF
Sam Sanjabi
Senior Software Developer
Chong Chen
Principal Architect: LSF Product Family
Page 13 of 15
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and services
currently available in your area. Any reference to an IBM product, program, or service is not
intended to state or imply that only that IBM product, program, or service may be used. Any
functionally equivalent product, program, or service that does not infringe any IBM
intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in
this document. The furnishing of this document does not grant you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where
such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES
CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
Without limiting the above disclaimers, IBM provides no representations or warranties
regarding the accuracy, reliability or serviceability of any information or recommendations
provided in this publication, or with respect to any results that may be obtained by the use of
the information or observance of any recommendations provided herein. The information
contained in this document has not been submitted to any formal IBM test and is distributed
AS IS. The use of this information or the implementation of any recommendations or
techniques herein is a customer responsibility and depends on the customers ability to
evaluate and integrate them into the customers operational environment. While each item
may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee
that the same or similar results will be obtained elsewhere. Anyone attempting to adapt
these techniques to their own environment does so at their own risk.
This document and the information contained herein may be used solely in connection with
the IBM products discussed in this document.
This information could include technical inaccuracies or typographical errors. Changes are
periodically made to the information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or changes in the product(s)
and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only
and do not in any manner serve as an endorsement of those websites. The materials at those
websites are not part of the materials for this IBM product and use of those websites is at your
own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Any performance data contained herein was determined in a controlled environment.
Therefore, the results obtained in other operating environments may vary significantly. Some
measurements may have been made on development-level systems and there is no
guarantee that these measurements will be the same on generally available systems.
Furthermore, some measurements may have been estimated through extrapolation. Actual
results may vary. Users of this document should verify the applicable data for their specific
environment.
Page 14 of 15
Information concerning non-IBM products was obtained from the suppliers of those products,
their published announcements or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility or any other
claims related to non-IBM products. Questions on the capabilities of non-IBM products should
be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE: Copyright IBM Corporation 2013. All Rights Reserved.
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to IBM, for the purposes of
developing, using, marketing or distributing application programs conforming to the
application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions.
IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and
other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol ( or ), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at Copyright and trademark information at
www.ibm.com/legal/copytrade.shtml
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
Contacting IBM
To provide feedback about this paper, contact [email protected].
To contact IBM in your country or region, check the IBM Directory of Worldwide
Contacts at http://www.ibm.com/planetwide
Page 15 of 15