How to Create a New PE Layout

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

Background Reading:

Here is a list of webpages describing PE layouts. Please add more pages if you know of any I've missed:

http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/x2574.html,
http://www.cesm.ucar.edu/models/cesm1.0/cpl7/cpl7_doc/x29.html#design_seq,
http://www.cesm.ucar.edu/models/cesm1.1/cesm/doc/usersguide/x730.html
http://www.cesm.ucar.edu/models/cesm1.3/cesm/doc/usersguide/x1516.html

Background:

ACME and CESM are supposed to provide bit-for-bit identical answers regardless of PE layout. This means that the information used by e.g. the ocean model needs to be available at the time it runs regardless of whether the ocean is running in in parallel with the other components or whether it is run serially after all the other components are finished. This necessitates passing one component information which is lagged relative to the current state of the other processes. It also imposes some limitations on which PE layouts are acceptable (and I don't know what these limitations are - help?).

The PE layout for a run is encapsulated in the env_mach_PES.xml file which is created by the create_newcase command during the model build step. This page focuses on changing this .xml file for a particular run; changing the default behavior for all runs on a particular machine requires muddling around with $ACME_code/cime/machines-acme/ stuff which is outside the scope of this page.

env_mach_pes.xml consists of entries like:

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the number of the first core to use for this task (with the first task being ROOTPE=0). Thus if ATM takes the first 10 cores and OCN takes the next 5 cores you would have NTASKS_ATM=10, ROOTPE_ATM=0, NTASKS_OCN=5, ROOTPE_OCN=10. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

Examples:

For small core counts, it's usually most efficient (is this true?) to run all the processes sequentially with all processes using all available cores:

Pic: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

To do: Insert PE layout for this case

To Do: Include PE layout for this case. (note: blue box on right labeled ATM is actually the ocean)

Rules for choosing a PE layout:

Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
For single component (e.g CORE-forced ocean, AMIP atmos) compsets, you might as well use a serial configuration because only the active component will consume appreciable time.
For the atmosphere:
1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem" from the latlon grid template files available for each grid here (ACME inputdata server). The number of physics columns is 9*nelem+2. It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1. On Titan, Mira and KNL systems, threads should be used. On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.
3. When using threads, there are several additional considerations. The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns. The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem (extra threads are ok, they will just not improve dynamics performance). The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.
4. The table below shows the # elements and acceptable core counts for ACME atm resolutions:

atm res	# elements	# physics columns	acceptable core counts
ne30	5400	48602	5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...
ne120	86400	777602	86400, 43200, 28800, 21600,...

The MPAS components
work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type
module load metis/5.1.0
gpmetis <graph_file> <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

NOTE: this page is not done yet.

How to Create a New PE Layout

Background Reading:

Background:

Examples:

Rules for choosing a PE layout:

Testing PE Layout Timing: