Exclusive process stride – EXCL_STRIDE
HPC compute nodes often have several GPUs per node. A common scenario is to set up several MPI tasks (on some CPU cores) driving the GPUs with other MPI tasks on the remaining CPU cores in main DRAM memory. This page describes how to set up this MPI task-to-core assignment.
Suppose a node has 42
CPU cores, 6
GPU devices, and the run needs 100
nodes. We use ATM as the component driving the GPUs in the examples below.
Process stride – PSTRID
To set up 6
MPI tasks per node for ATM and 42
MPI tasks per node for all other components:
./xmlchange MAX_TASKS_PER_NODE=42
./xmlchange MAX_MPITASKS_PER_NODE=42 # i.e. NTHRDS is 1
./xmlchange NTASKS=42*100 # all comps stacked on 100 nodes
./xmlchange PSTRID_ATM=7 # i.e. ATM procs stride 7 (=$NCPU/$NGPU = 42/6) with 6 ATM procs per node
./xmlchange NTASKS_ATM=6*100 # ATM is now at 6 procs per node
Process striding helps to
set 1 MPI process per GPU;
set MPI tasks close to GPUs when some CPU cores are closer to GPUs in a node’s NUMA context: may need to adjust
ROOTPE_ATM
(default 0) if core 0 is not the nearest to a GPU;keep all CPU cores busy with other components.
Here, CPU cores used by ATM are shared with other non-ATM components. Sharing can lead to out-of-memory errors or reduce ATM’s performance.
Exclusive process stride – EXCL_STRIDE
In addition to steps 1--5 above,
./xmlchange EXCL_STRIDE_ATM=7 # i.e. only ATM runs on stride-7 procs, other components on the remaining procs
Exclusive striding mutually excludes ATM and non-ATM components from executing on cores assigned to ATM. This minimizes any core-sharing side-effects.
How to verify component-to-task placement
./xmlchange INFO_MPROF=2 # to log memory usage profiling from every MPI process
zgrep "comps" run/e3sm.log*.gz # and look for "(pe= $task_id comps= cpl ICE GLC WAV IAC ESP)"
Logging produced from these steps notes which components are executing in an MPI task.
There is some, but minimal, error-checking and care should be exercised when setting up striding and exclusive striding.
Additional details are in the following PR: https://github.com/E3SM-Project/E3SM/pull/4859