How to Create a New PE Layout

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

Background Reading:

Here is a list of webpages describing PE layouts. Please add more pages if you know of any I've missed:

http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/x2574.html,
http://www.cesm.ucar.edu/models/cesm1.0/cpl7/cpl7_doc/x29.html#design_seq,
http://www.cesm.ucar.edu/models/cesm1.1/cesm/doc/usersguide/x730.html

Background:

ACME and CESM are supposed to provide bit-for-bit identical answers regardless of PE layout. This means that the information used by e.g. the ocean model needs to be available at the time it runs regardless of whether the ocean is running in in parallel with the other components or whether it is run serially after all the other components are finished. This necessitates passing one component information which is lagged relative to the current state of the other processes. It also imposes some limitations on which PE layouts are acceptable (and I don't know what these limitations are - help?).

The PE layout for a run is encapsulated in the env_mach_PES.xml file which is created by the create_newcase command during the model build step. This page focuses on changing this .xml file for a particular run; changing the default behavior for all runs on a particular machine requires muddling around with $ACME_code/cime/machines-acme/ stuff which is outside the scope of this page.

env_mach_pes.xml consists of entries like:

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the number of the first core to use for this task (with the first task being ROOTPE=0). Thus if ATM takes the first 10 cores and OCN takes the next 5 cores you would have NTASKS_ATM=10, ROOTPE_ATM=0, NTASKS_OCN=5, ROOTPE_OCN=10. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

Examples:

For small core counts, it's usually most efficient (is this true?) to run all the processes sequentially with all processes using all available cores:

Pic: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

To do: Insert PE layout for this case

To Do: Include PE layout for this case.

Rules for choosing a PE layout:

Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. For a cubed-sphere grid, the number of elements N = 6*NE^2. The number of physics columns is 9N+2. For RRM grids, the number of elements can be determined from the grid template file. Having uneven numbers of elements per task, or using more tasks then their are elements is possible, and will speed up the physics, but not the dynamics, and is thus less efficient.
When using threads, there are several additional considerations. if NTASKS_ATM*NTHREADS_ATM > the number of elements, nested openMP should be enabled. This is a new feature that is not enabled by default.
The table below shows the # elements and acceptable core counts for ACME atm resolutions:
atm res
# elements
acceptable core counts
ne30 5400
5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...
ne120 86400 ???
The MPAS components
work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type
module load metis/5.1.0
gpmetis <graph_file> <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.
For single component (e.g CORE-forced ocean, AMIP atmos) compsets, you might as well use a serial configuration because only the active component will consume appreciable time.

NOTE: this page is not done yet.

atm res	# elements	acceptable core counts
ne30	5400	5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...
ne120	86400	???

How to Create a New PE Layout

Background Reading:

Background:

Examples:

Rules for choosing a PE layout:

Testing PE Layout Timing: