Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

...

  1. To start, use the same decomposition for all components
  2. Following the guidelines above for choosing NTASKS_ATM and threads, and find the best decomposition based on ATM run time (ignore other components initially)
  3. Compare ATM run time vs. TOT run time when there are few MPI tasks (lots of work per node).  ATM should be most of the cost.  Lets say ATM is 90% of the cost
  4. As you increase the number of nodes to speed up the atmosphere, monitor the ATM/TOT cost ratio.  If it starts to decrease, see if you can determine if one of the other components is starting to slow down because it is using too many cores.  The TOT cost could also slow down if the cost to couple becomes large, but this cost is very hard to isolate and interpret.  Addressing this issue, if it is needed, requires experience and trial and error.   Some strategies to consider from Patrick Worley:  
    1. Put all of the data models  on their own nodes and leave these fixed (don't scale) as you scale up ATM and CPL and LND. This has been used especially when memory is limited.
    2. ROF is a special case - it does not scale well, so you often don't want to scale this up as fast. If LND is cheap, you can stop this early as well. And CPL may run out of scaling at some point as well.


About Parallel I/O (PIO) Layout:

...

Users have to typically start with the default configuration and try couple of PIO_NUMTASKS and PIO_STRIDE options to get an optimal PIO layout for a model configuration on a particular machine. On large machines, empirically we have found that we need to constrain the number of PIO I/O tasks (PIO_NUMTASKS) to less than 128 to prevent out of memory errors, but this is also system specific. The other rule of thumb is to not assign more than one PIO I/O task to each compute node, to minimize MPI overhead and to improve parallel I/O performance. This will occur naturally if PIO_NUMTASK is enough smaller than the total number of tasks. Mark Taylor says "In my view, the fewer PIO tasks the better (more robust & faster) but with too few you can run out of memory"

Examples:

For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:

...