Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

Before diving into the tedium joy of creating your own PE layout, check on this page to see whether someone has already gone to the effort for you:

...

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the number of the first core first MPI task number to use for this task (with the first task being component. ROOTPE=0 ). Thus if ATM takes if you want to use the first 10 cores and OCN takes the next 5 cores you would have NTASKS_ATM=10, ROOTPE_ATM=0, NTASKS_OCN=5, ROOTPE_OCN=10bunch of cores to operate on that component. If you want to run another component in parallel, assign the second task's ROOTPE to be equal to NTASKS from the first component. Note that NTHRDS doesn't affect ROOTPE values - if you want ICE and LND to run in parallel and ICE takes up 1200 MPI tasks with 4 threads starting at ROOTPE=0, ROOTPE for LND should be 1200. This number wouldn't change if ICE only used 1 thread. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

...

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. Components can either all be run sequentially or a hybrid of sequential and concurrent execution (see Examples section below).
    1. completely sequential execution is good:
      1. for single component (e.g CORE-forced ocean, AMIP atmos) compsets since only the active component will consume appreciable time

      2. because it is by definition load balanced between components (each component only takes as long as it needs), so the need for performance tuning is eliminated.
    2. concurrent execution is good because:
      1. it allows the model to scale to larger core counts and may improve throughput
  4. For concurrent execution:
    1. ATM must  run serial with LND, ROF, and ICE
    2. everything can (kind of) run concurrent with OCN
    3. LND and ICE and ROF can run concurrently
    4. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to share processes with just one of the land and sea ice components, but this is very fuzzy. CPL needs enough processes to do its job, but if it shares processes with all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also run on processes that are not assigned to any of the other components - idle much of the time and causing more communication overhead, but always available."
    5. Comment from Patrick Worley (Unlicensed), taken from below: Also, sometimes components do not play well with each other. For example, the atmosphere, at scale, will want to use OpenMP threads. In contrast, CICE has no threading support at the moment, and if you overlap CICE with ATM when ATM is using threads, this spreads out the CICE processes across more nodes, increasing MPI communication overhead. Memory requirements for ATM and CICE can also make them infeasible to run together. In these (somewhat extreme) cases, running CICE concurrent with ATM and OCN may be justified. Parallel efficiency will be poor, but that is not the target in these cases.
  5. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  As noted on http://www.nersc.gov/users/computational-systems/edison/running-jobs/example-batch-scripts/, hyperthreading is turned on by submitting a job asking for twice as many cores as the machine actually has... you don't need to make any other changes.
      1. More specifically, nersc says hyperthreading is invoked by asking for 2x as many cores as fit onto the requested nodes as part of your 'srun' command. Since the actual job submission is hidden under layers of CIME stuff, it is unclear how we should change our env_mach_pes.xml file, for example. I (Peter) have found that setting MAX_TASKS_PER_NODE to 48 while leaving PES_PER_NODE at 24 seems to do the trick.
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    optimal core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...


  6. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

...

Here's an example of how an "F" compset could be load balanced:

  1. Use To start, use the same decomposition for all components
  2. Following the guidelines above for Atmosphere load balancing, choosing NTASKS_ATM and threads, and find the best decomposition based on ATM run time (ignore other components initially)
  3. Compare ATM run time vs. TOT run time when there are few MPI tasks (lots of work per node).  ATM should be most of the cost.  Lets say ATM is 90% of the cost
  4. As you increase the number of nodes to speed up the atmosphere, monitor the ATM/TOT cost ratio.  If it starts to decrease, see if you can determine if one of the other components is starting to slow down because it is using too many cores.  The TOT cost could also slow down if the cost to couple becomes large, but this cost is very hard to isolate and interpret.  This step  Addressing this issue, if it is needed, requires experience and trial and error.   Some strategies to consider from Patrick Worley (Unlicensed):  
    1. Put all of the data models  on their own nodes and leave these fixed (don't scale) as you scale up ATM and CPL and LND. This has been used especially when memory is limited.
    2. ROF is a special case - it does not scale well, so you often don't want to scale this up as fast. If LND is cheap, you can stop this early as well. And CPL may run out of scaling at some point as well.


About Parallel I/O (PIO) Layout:

...

Users have to typically start with the default configuration and try couple of PIO_NUMTASKS and PIO_STRIDE options to get an optimal PIO layout for a model configuration on a particular machine. On large machines, empirically we have found that we need to constrain the number of PIO I/O tasks (PIO_NUMTASKS) to less than 128 to prevent out of memory errors, but this is also system specific. The other rule of thumb is to not assign more than one PIO I/O task to each compute node, to minimize MPI overhead and to improve parallel I/O performance. This will occur naturally if PIO_NUMTASK is enough smaller than the total number of tasks. Mark Taylor says "In my view, the fewer PIO tasks the better (more robust & faster) but with too few you can run out of memory"

Examples:

For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:

...