Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Running the ACME model requires a layout for which model cores are assigned to handle which model components called the PE layout. There are currently only a few people who know how to do this and there is no documentation of the process. This is a huge bottleneck which makes running on a new machine or coming up with an efficient layout for a new compset slow. The goal of this page is to provide the info needed for anyone on the project to create their own PE layouts (or at least know when their layout is bad).

...

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. Components can either all be run sequentially or a hybrid of sequential and concurrent execution (see Examples section below).
    1. completely sequential execution is good:
      1. for single component (e.g CORE-forced ocean, AMIP atmos) compsets since only the active component will consume appreciable time

      2. because it is by definition load balanced between components (each component only takes as long as it needs), so the need for performance tuning is eliminated.
    2. concurrent execution is good because:
      1. it allows the model to scale to larger core counts and may improve throughput
  4. For concurrent execution:
    1. ATM must  run serial with LND, ROF, and ICE
    2. everything can (kind of) run concurrent with OCN
    3. LND and ICE and ROF can run concurrently
    4. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to share processes with just one of the land and sea ice components, but this is very fuzzy. CPL needs enough processes to do its job, but if it shares processes with all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also run on processes that are not assigned to any of the other components - idle much of the time and causing more communication overhead, but always available."
    5. Comment from Patrick Worley (Unlicensed), taken from below: Also, sometimes components do not play well with each other. For example, the atmosphere, at scale, will want to use OpenMP threads. In contrast, CICE has no threading support at the moment, and if you overlap CICE with ATM when ATM is using threads, this spreads out the CICE processes across more nodes, increasing MPI communication overhead. Memory requirements for ATM and CICE can also make them infeasible to run together. In these (somewhat extreme) cases, running CICE concurrent with ATM and OCN may be justified. Parallel efficiency will be poor, but that is not the target in these cases.
  5. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  As noted on http://www.nersc.gov/users/computational-systems/edison/running-jobs/example-batch-scripts/, hyperthreading is turned on by submitting a job asking for twice as many cores as the machine actually has... you don't need to make any other changes.
      1. More specifically, nersc says hyperthreading is invoked by asking for 2x as many cores as fit onto the requested nodes as part of your 'srun' command. Since the actual job submission is hidden under layers of CIME stuff, it is unclear how we should change our env_mach_pes.xml file, for example. I (Peter) have found that setting MAX_TASKS_PER_NODE to 48 while leaving PES_PER_NODE at 24 seems to do the trick.
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    optimal core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...
    The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from

    ,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...


  6. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/

...

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

...

A Note about Layouts for Single-Component Simulations

The above discussion is mostly relevant to coupled runs involving all model components (aka "b" or "A_WCYCL" compsets). For simulations where most components are just reading in data (e.g. atmosphere "F" compsets or ocean/ice "G" compsets) it doesn't make sense to assign lots of cores to those inactive components. A natural choice in these cases is to run all components sequentially with all components asking for the same number of cores. This layout naturally avoids load balancing issues which might arise from some components doing all the work an others not doing anything. A complication with this approach is that some components may not scale to the core size used for the main components and may actually slow down as they are given too many components. Here is a table of the maximum number of cores each component can handle: NOTE - I DON'T KNOW THESE NUMBERS, WHETHER THIS IS EVEN TRUE, OR HOW THREADING COMPLICATES THIS PICTURE. HELP?

...

There may also be advantages to running with some components in parallel... ALSO NOT SURE IF THIS IS TRUEoRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

A Note about Layouts for Single-Component Simulations

The above discussion is mostly relevant to coupled runs involving all model components (aka "b" or "A_WCYCL" compsets). For simulations where most components are just reading in data (e.g. atmosphere "F" compsets or ocean/ice "G" compsets) it doesn't make sense to assign lots of cores to those inactive components. A natural choice in these cases is to run all components sequentially with all components asking for the same number of cores. This layout naturally avoids load balancing issues which might arise from some components doing all the work an others not doing anything. A complication with this approach is that some components may not scale to the core size used for the main components and may actually slow down as they are given too many components.

Here's an example of how an "F" compset could be load balanced:

  1. Use the same decomposition for all components
  2. Following the guidelines above for Atmosphere load balancing, find the best decomposition based on ATM run time (ignore other components initially)
  3. Compare ATM run time vs. TOT run time when there are few MPI tasks (lots of work per node).  ATM should be most of the cost.  Lets say ATM is 90% of the cost
  4. As you increase the number of nodes to speed up the atmosphere, monitor the ATM/TOT cost ratio.  If it starts to decrease, see if you can determine if one of the other components is starting to slow down because it is using too many cores.  The TOT cost could also slow down if the cost to couple becomes large, but this cost is very hard to isolate and interpret.  This step requires experience and trial and error.   Some strategies to consider from Patrick Worley (Unlicensed):  
    1. Put all of the data models  on their own nodes and leave these fixed (don't scale) as you scale up ATM and CPL and LND. This has been used especially when memory is limited.
    2. ROF is a special case - it does not scale well, so you often don't want to scale this up as fast. If LND is cheap, you can stop this early as well. And CPL may run out of scaling at some point as well.


About Parallel I/O (PIO) Layout:

Input/output (IO) can also have a big impact on model performance.  To avoid convolving I/O performance with model performance, the usual steps for optimizing PE layout are:

...