How to Create a New PE Layout

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

Before diving into the tedium of creating your own PE layout, check on this page to see whether someone has already gone to the effort for you: /wiki/spaces/PERF/pages/59736197

Background Reading:

Here is a list of webpages describing PE layouts. Please add more pages if you know of any I've missed:

http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/x2574.html,
http://www.cesm.ucar.edu/models/cesm1.0/cpl7/cpl7_doc/x29.html#design_seq,
http://www.cesm.ucar.edu/models/cesm1.1/cesm/doc/usersguide/x730.html
http://www.cesm.ucar.edu/models/cesm1.3/cesm/doc/usersguide/x1516.html

Background:

ACME and CESM were designed to provide bit-for-bit identical answers regardless of PE layout (though this doesn't happen in practice unless certain debug options are used - see comment 3 by Patrick Worley (Unlicensed) below). This means that the information used by e.g. the ocean model needs to be available at the time it runs regardless of whether the ocean is running in in parallel with the other components or whether it is run serially after all the other components are finished. This necessitates passing one component information which is lagged relative to the current state of the other processes. It also imposes some limitations on which PE layouts are acceptable (see comments 4 and 5 by Patrick Worley (Unlicensed) below):

ATM must run serial with LND, ROF, and ICE
everything can (kind of) run concurrent with OCN
LND and ICE can run concurrently
ROF must run serial with land
coupler is complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."

The PE layout for a run is encapsulated in the env_mach_PES.xml file which is created by the create_newcase command during the model build step. This page focuses on changing this .xml file for a particular run; changing the default behavior for all runs on a particular machine requires muddling around with $ACME_code/cime/machines-acme/ stuff which is outside the scope of this page.

env_mach_pes.xml consists of entries like:

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the number of the first core to use for this task (with the first task being ROOTPE=0). Thus if ATM takes the first 10 cores and OCN takes the next 5 cores you would have NTASKS_ATM=10, ROOTPE_ATM=0, NTASKS_OCN=5, ROOTPE_OCN=10. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

Rules for choosing a PE layout:

Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
Components can either all be run sequentially or a hybrid of sequential and concurrent execution (see Examples section below).
1. completely sequential execution is good:
  1. for single component (e.g CORE-forced ocean, AMIP atmos) compsets since only the active component will consume appreciable time
  2. because it is by definition load balanced between components (each component only takes as long as it needs), so the need for performance tuning is eliminated.
2. concurrent execution is good because:
  1. it allows the model to scale to larger core counts and may improve throughput
For concurrent execution:
1. ATM must run serial with LND, ROF, and ICE
2. everything can (kind of) run concurrent with OCN
3. LND and ICE can run concurrently
4. ROF must run serial with land
5. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."
6. Comment from Patrick Worley (Unlicensed), taken from below: Also, sometimes components do not play well with each other. For example, the atmosphere, at scale, will want to use OpenMP threads. In contrast, CICE has no threading support at the moment, and if you overlap CICE with ATM when ATM is using threads, this spreads out the CICE processes across more nodes, increasing MPI communication overhead. Memory requirements for ATM and CICE can also make them infeasible to run together. In these (somewhat extreme) cases, running CICE concurrent with ATM and OCN may be justified. Parallel efficiency will be poor, but that is not the target in these cases.
For the atmosphere:
1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server. The number of physics columns is 9*nelem+2. It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1. On Titan, Mira and KNL systems, threads should be used. On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.
3. When using threads, there are several additional considerations. The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns. The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem (extra threads are ok, they will just not improve dynamics performance). The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.
4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:
atm res
# elements
# physics columns
optimal core counts
ne30 5400 48602
5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...
ne120 86400 777602 86400, 43200, 28800, 21600,...
The MPAS components
work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

module load metis/5.1.0
gpmetis <graph_file> <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

About Parallel I/O (PIO) Layout:

Input/output (IO) can also have a big impact on model performance. To avoid convolving I/O performance with model performance, the usual steps for optimizing PE layout are:

Choose PIO settings that are good enough to avoid having the model crash while reading input settings.
1. Often the default settings will not work, especially on vary large or very small processor counts.
Disable history output and restart files for all components, then use short (5-10 days) simulations to determine an optional processor layout.
1. Without disabling IO, short simulations will be dominated by I/O costs and will give misleading performance numbers.
  1. Note that input and initialization times are not included in SYPD computations.
2. To disable restart writes, set <entry id="REST_OPTION" value="never" /> in env_run.xml .
3. For ATM, you can disable output by simply setting nhtfrq (described here) in user_nl_cam to something longer than the length of the model run.
4. By default, MPAS always writes output during the first timestep. You can disable this by setting config_write_output_on_startup = .false. in user_nl_mpas-o:
Once you have a reasonable processor layout without I/O, use 1 month simulations to test PIO settings.

To understand the impact of I/O on the model performance start by taking a look at the time taken by the PIO functions (grep for pio*) in the timing subdirectory of the job run directory. Bad PIO settings can cause the model to crash, or have horribly slow output performance. But you want to make sure you have a reasonable processor layout before calibrating the PIO settings.

(At NERSC, ALCF, and OLCF, the timing and other performance-related data are also archived in the performance_archive for each system. Site is system specific. Currently:

NERSC: /project/projectdirs/(project id)/performance_archive
ALCF: /projects/(project id)/performance_archive
OLCF: /lustre/atlas/proj-shared/(project id)/performance_archive

)

Several aspects of the model I/O performance can be controlled at runtime by setting the PIO* parameters in env_run.xml. The model developer can start by tuning the number of PIO I/O tasks (processes that perform I/O) and the stride between the I/O tasks by setting PIO_NUMTASKS and PIO_STRIDE respectively in env_run.xml. By default the PIO_STRIDE is set to 4 and PIO_NUMTASKS is set to (total number of tasks/PIO_STRIDE), which means as you increase the number of tasks, you get more tasks trying to write to the filesystem.

You can get 'out of memory' errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we also get 'out of memory errors' in the MPI library itself because there are too many messages. At the same time the I/O performance is better for large data sets with more I/O tasks.

Users have to typically start with the default configuration and try couple of PIO_NUMTASKS and PIO_STRIDE options to get an optimal PIO layout for a model configuration on a particular machine. On large machines, empirically we have found that we need to constrain the number of PIO I/O tasks (PIO_NUMTASKS) to less than 128 to prevent out of memory errors, but this is also system specific. The other rule of thumb is to not assign more than one PIO I/O task to each compute node, to minimize MPI overhead and to improve parallel I/O performance. This will occur naturally if PIO_NUMTASK is enough smaller than the total number of tasks.

Examples:

For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle). Here is an example of a serial PE layout:

This layout can be expressed diagrammatically as below:

Fig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:

Here is the diagram showing how this setup translates into a processor map (x axis) as it evolves over time within a timestep (y-axis):

Fig: processor layout when several components are run in parallel. Example uses 375 nodes on Edison.

Testing PE Layout Timing:

It is hard to tell a priori which layout will be optimal. Luckily you don't need to. Try a couple of layouts and look at their timing (in simulated years per day or SYPD) to figure out which is fastest. Keep in mind that timings can vary (a lot!) from run to run, so you may need to run for a while or do multiple runs with each layout to get accurate numbers. CESM recommend 20 day runs without saving any output or writing any restarts (because I/O complicates timings and is best handled as a separate optimization exercise). See the PIO section above for more discussion about run lengths and experimentation strategy.

Timing information is available in a variety of places. The log files run/cpl.log.* provides some basic info like SYPD and time elapsed. Info broken down into the time spent in each component is available in case_scripts/timing/cesm_timing.<run_name>.<run_id>. This granular information is useful for load balancing cases where multiple processes are running in parallel, in which case you want to increase or decrease the number of cores devoted to each process in order to make concurrent processes take the same amount of time. This prevents processes from sitting idle while they wait for their companions to finish up. In the example below, it would be ideal for ICE and LND to take the same amount of time, CPL and ROF to take the same amount of time, and WAV and GLC to take the same amount of time. Additionally, we want ATM+ICE+CPL+WAV to take the same amount of time as OCN. In reality, LND, ROF, and GLC are probably much quicker than ATM, ICE, and CPL but since the number of cores they use is negligible this imbalance is tolerable.

atm res	# elements	# physics columns	optimal core counts
ne30	5400	48602	5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...
ne120	86400	777602	86400, 43200, 28800, 21600,...