How to Create a New PE Layout

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

Before diving into the joy of creating your own PE layout, check on this page to see whether someone has already gone to the effort for you:

Official ACME Performance Group /wiki/spaces/PERF/pages/114491519.  This is a new (12/2016) effort to collect optimal layouts.  Until it is complete, also check the many older/archived layout pages linked at the bottom of that page.  



Background Reading:

Here is a list of webpages describing PE layouts. Please add more pages if you know of any I've missed:

http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/x2574.html,
http://www.cesm.ucar.edu/models/cesm1.0/cpl7/cpl7_doc/x29.html#design_seq,
http://www.cesm.ucar.edu/models/cesm1.1/cesm/doc/usersguide/x730.html
http://www.cesm.ucar.edu/models/cesm1.3/cesm/doc/usersguide/x1516.html

Background:

ACME and CESM were designed to provide bit-for-bit identical answers regardless of PE layout (though this doesn't happen in practice unless certain debug options are used - see comment 3 by Patrick Worley (Unlicensed) below). This means that the information used by e.g. the ocean model needs to be available at the time it runs regardless of whether the ocean is running in in parallel with the other components or whether it is run serially after all the other components are finished. This necessitates passing one component information which is lagged relative to the current state of the other processes. It also imposes some limitations on which PE layouts are acceptable (see comments 4 and 5 by Patrick Worley (Unlicensed) below):

  1. ATM must  run serial with LND, ROF, and ICE
  2. everything can (kind of) run concurrent with OCN
  3. LND and ICE can run concurrently
  4. ROF must run serial with land
  5. coupler is complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."

The PE layout for a run is encapsulated in the env_mach_PES.xml file which is created by the create_newcase command during the model build step. This page focuses on changing this .xml file for a particular run; changing the default behavior for all runs on a particular machine requires muddling around with $ACME_code/cime/machines-acme/ stuff which is outside the scope of this page.

env_mach_pes.xml consists of entries like:

<entry id="NTASKS_ATM"   value="5400"  />   
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />   
<entry id="NINST_ATM"   value="1"  />   
<entry id="NINST_ATM_LAYOUT"   value="concurrent"  />    

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the first MPI task number to use for this component. ROOTPE=0 if you want to use the first bunch of cores to operate on that component. If you want to run another component in parallel, assign the second task's ROOTPE to be equal to NTASKS from the first component. Note that NTHRDS doesn't affect ROOTPE values - if you want ICE and LND to run in parallel and ICE takes up 1200 MPI tasks with 4 threads starting at ROOTPE=0, ROOTPE for LND should be 1200. This number wouldn't change if ICE only used 1 thread. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

Rules for choosing a PE layout:

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. Components can either all be run sequentially or a hybrid of sequential and concurrent execution (see Examples section below).
    1. completely sequential execution is good:
      1. for single component (e.g CORE-forced ocean, AMIP atmos) compsets since only the active component will consume appreciable time

      2. because it is by definition load balanced between components (each component only takes as long as it needs), so the need for performance tuning is eliminated.
    2. concurrent execution is good because:
      1. it allows the model to scale to larger core counts and may improve throughput
  4. For concurrent execution:
    1. ATM must  run serial with LND, ROF, and ICE
    2. everything can (kind of) run concurrent with OCN
    3. LND and ICE and ROF can run concurrently
    4. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to share processes with just one of the land and sea ice components, but this is very fuzzy. CPL needs enough processes to do its job, but if it shares processes with all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also run on processes that are not assigned to any of the other components - idle much of the time and causing more communication overhead, but always available."
    5. Comment from Patrick Worley (Unlicensed), taken from below: Also, sometimes components do not play well with each other. For example, the atmosphere, at scale, will want to use OpenMP threads. In contrast, CICE has no threading support at the moment, and if you overlap CICE with ATM when ATM is using threads, this spreads out the CICE processes across more nodes, increasing MPI communication overhead. Memory requirements for ATM and CICE can also make them infeasible to run together. In these (somewhat extreme) cases, running CICE concurrent with ATM and OCN may be justified. Parallel efficiency will be poor, but that is not the target in these cases.
  5. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  As noted on http://www.nersc.gov/users/computational-systems/edison/running-jobs/example-batch-scripts/, hyperthreading is turned on by submitting a job asking for twice as many cores as the machine actually has... you don't need to make any other changes.
      1. More specifically, nersc says hyperthreading is invoked by asking for 2x as many cores as fit onto the requested nodes as part of your 'srun' command. Since the actual job submission is hidden under layers of CIME stuff, it is unclear how we should change our env_mach_pes.xml file, for example. I (Peter) have found that setting MAX_TASKS_PER_NODE to 48 while leaving PES_PER_NODE at 24 seems to do the trick.
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    optimal core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...


  6. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

A Note about Layouts for Single-Component Simulations

The above discussion is mostly relevant to coupled runs involving all model components (aka "b" or "A_WCYCL" compsets). For simulations where most components are just reading in data (e.g. atmosphere "F" compsets or ocean/ice "G" compsets) it doesn't make sense to assign lots of cores to those inactive components. A natural choice in these cases is to run all components sequentially with all components asking for the same number of cores. This layout naturally avoids load balancing issues which might arise from some components doing all the work an others not doing anything. A complication with this approach is that some components may not scale to the core size used for the main components and may actually slow down as they are given too many components.

Here's an example of how an "F" compset could be load balanced:

  1. To start, use the same decomposition for all components
  2. Following the guidelines above for choosing NTASKS_ATM and threads, and find the best decomposition based on ATM run time (ignore other components initially)
  3. Compare ATM run time vs. TOT run time when there are few MPI tasks (lots of work per node).  ATM should be most of the cost.  Lets say ATM is 90% of the cost
  4. As you increase the number of nodes to speed up the atmosphere, monitor the ATM/TOT cost ratio.  If it starts to decrease, see if you can determine if one of the other components is starting to slow down because it is using too many cores.  The TOT cost could also slow down if the cost to couple becomes large, but this cost is very hard to isolate and interpret.  Addressing this issue, if it is needed, requires experience and trial and error.   Some strategies to consider from Patrick Worley (Unlicensed):  
    1. Put all of the data models  on their own nodes and leave these fixed (don't scale) as you scale up ATM and CPL and LND. This has been used especially when memory is limited.
    2. ROF is a special case - it does not scale well, so you often don't want to scale this up as fast. If LND is cheap, you can stop this early as well. And CPL may run out of scaling at some point as well.


About Parallel I/O (PIO) Layout:

Input/output (IO) can also have a big impact on model performance.  To avoid convolving I/O performance with model performance, the usual steps for optimizing PE layout are:

  1. Choose PIO settings that are good enough to avoid having the model crash while reading input settings.
    1. Often the default settings will not work, especially on vary large or very small processor counts.  
  2. Disable history output and restart files for all components, then use short (5-10 days) simulations to determine an optional processor layout.  
    1. Without disabling IO, short simulations will be dominated by I/O costs and will give misleading performance numbers.
      1. Note that input and initialization times are not included in SYPD computations.
    2. To disable restart writes, set <entry id="REST_OPTION"   value="never"  /> in env_run.xml .
    3. For ATM, you can disable output by simply setting nhtfrq (described here) in user_nl_cam to something longer than the length of the model run.
    4. By default, MPAS always writes output during the first timestep. You can disable this by setting config_write_output_on_startup = .false. in user_nl_mpas-o:
  3. Once you have a reasonable processor layout without I/O, use 1 month simulations to test PIO settings.  

To understand the impact of I/O on the model performance start by taking a look at the time taken by the PIO functions (grep for pio*) in the timing subdirectory of the job run directory. Bad PIO settings can cause the model to crash, or have horribly slow output performance.  But you want to make sure you have a reasonable processor layout before calibrating the PIO settings.  

(At NERSC, ALCF, and OLCF, the timing and other performance-related data are also archived in the performance_archive for each system. Site is system specific. Currently:

  • NERSC: /project/projectdirs/(project id)/performance_archive
  • ALCF: /projects/(project id)/performance_archive
  • OLCF: /lustre/atlas/proj-shared/(project id)/performance_archive

)

Several aspects of the model I/O performance can be controlled at runtime by setting the PIO* parameters in env_run.xml. The model developer can start by tuning the number of PIO I/O tasks (processes that perform I/O) and the stride between the I/O tasks by setting PIO_NUMTASKS and PIO_STRIDE respectively in env_run.xml. By default the PIO_STRIDE is set to 4 and PIO_NUMTASKS is set to (total number of tasks/PIO_STRIDE), which means as you increase the number of tasks, you get more tasks trying to write to the filesystem.

You can get 'out of memory' errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we also get 'out of memory errors' in the MPI library itself because there are too many messages. At the same time the I/O performance is better for large data sets with more I/O tasks.

Users have to typically start with the default configuration and try couple of PIO_NUMTASKS and PIO_STRIDE options to get an optimal PIO layout for a model configuration on a particular machine. On large machines, empirically we have found that we need to constrain the number of PIO I/O tasks (PIO_NUMTASKS) to less than 128 to prevent out of memory errors, but this is also system specific. The other rule of thumb is to not assign more than one PIO I/O task to each compute node, to minimize MPI overhead and to improve parallel I/O performance. This will occur naturally if PIO_NUMTASK is enough smaller than the total number of tasks. Mark Taylor says "In my view, the fewer PIO tasks the better (more robust & faster) but with too few you can run out of memory"

Examples:

For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:


<entry id="NTASKS_ATM"   value="128"  />   
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />   


<entry id="NTASKS_LND"   value="128"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="0"  />   


<entry id="NTASKS_ICE"   value="128"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />   


<entry id="NTASKS_OCN"   value="128"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="0"  />   


<entry id="NTASKS_CPL"   value="128"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    


<entry id="NTASKS_GLC"   value="128"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="0"  />   


<entry id="NTASKS_ROF"   value="128"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="0"  />   


<entry id="NTASKS_WAV"   value="128"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />   


This layout can be expressed diagrammatically as below:



Fig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg




Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:


<entry id="NTASKS_ATM"   value="5400"  />  
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />    


<entry id="NTASKS_LND"   value="600"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="4800"  />    


<entry id="NTASKS_ICE"   value="4800"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />    


<entry id="NTASKS_OCN"   value="3600"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="5400"  />    


<entry id="NTASKS_CPL"   value="4800"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    


<entry id="NTASKS_GLC"   value="600"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="4800"  />    


<entry id="NTASKS_ROF"   value="600"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="4800"  />    


<entry id="NTASKS_WAV"   value="4800"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />    


Here is the diagram showing how this setup translates into a processor map (x axis) as it evolves over time within a timestep (y-axis):



Fig: processor layout when several components are run in parallel. Example uses 375 nodes on Edison.

Testing PE Layout Timing:

It is hard to tell a priori which layout will be optimal. Luckily you don't need to. Try a couple of layouts and look at their timing (in simulated years per day or SYPD) to figure out which is fastest. Keep in mind that timings can vary (a lot!) from run to run, so you may need to run for a while or do multiple runs with each layout to get accurate numbers. CESM recommend 20 day runs without saving any output or writing any restarts (because I/O complicates timings and is best handled as a separate optimization exercise). For ACME and for high resolution runs, 20 days is impractical and unnecessary. 5 days (or even 2 or 3) should be good enough to collect data for determining reasonable layouts, especially given the higher frequency of coupling between the components. See the PIO section above for more discussion about run lengths and experimentation strategy.

Timing information is available in a variety of places. The log files run/cpl.log.* provides some basic info like SYPD and time elapsed.  Info broken down into the time spent in each component is available in case_scripts/timing/cesm_timing.<run_name>.<run_id>. This granular information is useful for load balancing cases where multiple processes are running in parallel, in which case you want to increase or decrease the number of cores devoted to each process in order to make concurrent processes take the same amount of time. This prevents processes from sitting idle while they wait for their companions to finish up. In the example below, it would be ideal for ICE and LND to take the same amount of time, CPL and ROF to take the same amount of time, and WAV and GLC to take the same amount of time. Additionally, we want ATM+ICE+CPL+WAV to take the same amount of time as OCN. In reality, LND, ROF, and GLC are probably much quicker than ATM, ICE, and CPL but since the number of cores they use is negligible this imbalance is tolerable.

Notes from Patrick Worley (Unlicensed) on interpreting timings in case_scripts/timing/CESM_timing.rs.* (communicated from email by Peter Caldwell): CESM_timing.rs files include output like the following:

TOT Run Time:     685.744 seconds       68.574 seconds/mday         3.45 myears/wday
LND Run Time:      13.492 seconds        1.349 seconds/mday       175.45 myears/wday
ROF Run Time:       4.397 seconds        0.440 seconds/mday       538.35 myears/wday
ICE Run Time:     136.503 seconds       13.650 seconds/mday        17.34 myears/wday
ATM Run Time:     438.496 seconds       43.850 seconds/mday         5.40 myears/wday
OCN Run Time:     336.464 seconds       33.646 seconds/mday         7.04 myears/wday
GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
CPL Run Time:      80.499 seconds        8.050 seconds/mday        29.41 myears/wday
CPL COMM Time:    599.469 seconds       59.947 seconds/mday         3.95 myears/wday

For layouts like the one above (with OCN running in parallel from everything else and ICE and LND running in parallel on the same cores used for ATM), Pat computes max (OCN, ATM + max(LND, ICE)) and compares it to the "TOT Run Time". The difference between these numbers is the amount of time spent in coupler computation and communication overhead. "CPL COMM Time" is kind of useless because it can include load imbalance from components that finish early (better load balancing generally leads to smaller "CPL COMM TIme" ). CPL Run Time is also unpredictable because it includes some MPI communication and may depend on which nodes CPL is running on.