Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

ACME and CESM were designed to provide bit-for-bit identical answers regardless of PE layout (though this doesn't happen in practice unless certain debug options are used - see comment 4 3 by Patrick Worley (Unlicensed) below). This means that the information used by e.g. the ocean model needs to be available at the time it runs regardless of whether the ocean is running in in parallel with the other components or whether it is run serially after all the other components are finished. This necessitates passing one component information which is lagged relative to the current state of the other processes. It also imposes some limitations on which PE layouts are acceptable (see comments 4 and 5 by Patrick Worley (Unlicensed) below):

  1. ATM must  run serial with LND, ROF, and ICE
  2. everything can (kind of) run concurrent with OCN
  3. LND and ICE can run concurrently
  4. ROF must run serial with land
  5. coupler is complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."

The PE layout for a run is encapsulated in the env_mach_PES.xml file which is created by the create_newcase command during the model build step. This page focuses on changing this .xml file for a particular run; changing the default behavior for all runs on a particular machine requires muddling around with $ACME_code/cime/machines-acme/ stuff which is outside the scope of this page.

...

For small core counts, it's usually most efficient (is this true?) to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:

...

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. For single component (e.g CORE-forced ocean, AMIP atmos) compsets, you might as well use a serial configuration because only the active component will consume appreciable time.

  4. For concurrent execution:
    1. ATM must  run serial with LND, ROF, and ICE
    2. everything can (kind of) run concurrent with OCN
    3. LND and ICE can run concurrently
    4. ROF must run serial with land
    5. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."
  5. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and acceptable core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    acceptable core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...


  6. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

...