Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

where in addition to the ATM values shown there are values for OCN, LND, ICE, GLC, ROF, CPL, and WAV components. NTASKS is the number of MPI tasks for that component. NTHRDS is the number of openmp threads. ROOTPE is the number of the first core to use for this task (with the first task being ROOTPE=0). Thus if ATM takes the first 10 cores and OCN takes the next 5 cores you would have NTASKS_ATM=10, ROOTPE_ATM=0, NTASKS_OCN=5, ROOTPE_OCN=10. The model will automatically run all components on disjoint sets of cores in parallel and all components sharing cores in sequence (with the caveat that parallelizing some tasks is impossible because of the need for bit-for-bit reproducibility regardless of PE layout). NINST entries are related to running multiple copies of a particular component at once; we aren't interested in this, so leave these quantities at their default values shown above. In short, coming up with a new PE layout is equivalent to choosing NTASKS, NTHRDS, and ROOTPE for each component.

Examples:

For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:

<entry id="NTASKS_ATM"   value="128"  />   
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />   

<entry id="NTASKS_LND"   value="128"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="0"  />   

<entry id="NTASKS_ICE"   value="128"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />   

<entry id="NTASKS_OCN"   value="128"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="0"  />   

<entry id="NTASKS_CPL"   value="128"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    

<entry id="NTASKS_GLC"   value="128"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="0"  />   

<entry id="NTASKS_ROF"   value="128"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="0"  />   

<entry id="NTASKS_WAV"   value="128"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />   

This layout can be expressed diagrammatically as below:

Image Removed

Fig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

 

Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:

<entry id="NTASKS_ATM"   value="5400"  />  
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />    

<entry id="NTASKS_LND"   value="600"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="4800"  />    

<entry id="NTASKS_ICE"   value="4800"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />    

<entry id="NTASKS_OCN"   value="3600"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="5400"  />    

<entry id="NTASKS_CPL"   value="4800"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    

<entry id="NTASKS_GLC"   value="600"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="4800"  />    

<entry id="NTASKS_ROF"   value="600"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="4800"  />    

<entry id="NTASKS_WAV"   value="4800"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />    

Here is the diagram showing how this setup translates into a processor map (x axis) as it evolves over time within a timestep (y-axis):

Image Removed

Fig: processor layout when several components are run in parallel. Example uses 375 nodes on Edison.

Rules for choosing a PE layout:

...

For single component (e.g CORE-forced ocean, AMIP atmos) compsets, you might as well use a serial configuration because only the active component will consume appreciable time.

...

  1. ATM must  run serial with LND, ROF, and ICE
  2. everything can (kind of) run concurrent with OCN
  3. LND and ICE can run concurrently
  4. ROF must run serial with land
  5. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."

...

  1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
  2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  
  3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
  4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:

...

# elements

...

# physics columns

...

5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

...

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

...

Rules for choosing a PE layout:

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. Components can either all be run sequentially or a hybrid of sequential and concurrent execution (see Examples section below).
    1. completely sequential execution is good:
      1. for single component (e.g CORE-forced ocean, AMIP atmos) compsets since only the active component will consume appreciable time

      2. because it is by definition load balanced between components (each component only takes as long as it needs), so the need for performance tuning is eliminated.
    2. concurrent execution is good because:
      1. it allows the model to scale to larger core counts and may improve throughput
  4. For concurrent execution:
    1. ATM must  run serial with LND, ROF, and ICE
    2. everything can (kind of) run concurrent with OCN
    3. LND and ICE can run concurrently
    4. ROF must run serial with land
    5. coupler rules are complicated. Patrick Worley (Unlicensed) says "I generally like the coupler to be associated with just one of land and sea ice, but this is very fuzzy. CPL needs enough processors to do its job, but if it overlaps all of the other components, then it is not available when a subset of components need to call the coupler, so inserts unnecessary serialization. CPL can also be own its own nodes - idle much of the time and causing more communication overhead, but always available."
    6. Comment from Patrick Worley (Unlicensed), taken from below: Also, sometimes components do not play well with each other. For example, the atmosphere, at scale, will want to use OpenMP threads. In contrast, CICE has no threading support at the moment, and if you overlap CICE with ATM when ATM is using threads, this spreads out the CICE processes across more nodes, increasing MPI communication overhead. Memory requirements for ATM and CICE can also make them infeasible to run together. In these (somewhat extreme) cases, running CICE concurrent with ATM and OCN may be justified. Parallel efficiency will be poor, but that is not the target in these cases.
  5. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid. The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid on the ACME inputdata server.  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and the most efficient core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    optimal core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...


  6. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

About PIO Layout:

Input/output (IO) can also have a big impact on model performance. It is set by PIO_NUMTASKS and PIO_STRIDE in (I don't know which file or even if this is right). All I know about PIO comes from this email by Mark Taylor:

Regarding PIO:  we set a default stride, which means as you increase the number of tasks, you get more tasks trying to write to the filesystem.  I think this is a mistake, and on a cluster like this, you probably want to keep the maximum number of I/O processors to around 16.   (via setting pio_numtasks instead of stride, or by adjusting stride based on the number of tasks).

Also, on our institutional cluster here at Sandia (using Lustre), I get terrible performance with pnetcdf, and change the netcdf (this is one of the setting in env_run.xml).

With PIO, you can get out of memory errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we get out of memory errors in the MPI library itself because there are too many messages.

This section needs help

Examples:


For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:


<entry id="NTASKS_ATM"   value="128"  />   
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />   


<entry id="NTASKS_LND"   value="128"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="0"  />   


<entry id="NTASKS_ICE"   value="128"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />   


<entry id="NTASKS_OCN"   value="128"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="0"  />   


<entry id="NTASKS_CPL"   value="128"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    


<entry id="NTASKS_GLC"   value="128"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="0"  />   


<entry id="NTASKS_ROF"   value="128"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="0"  />   


<entry id="NTASKS_WAV"   value="128"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />   


This layout can be expressed diagrammatically as below:


Image Added


Fig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg


 


Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:


<entry id="NTASKS_ATM"   value="5400"  />  
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />    


<entry id="NTASKS_LND"   value="600"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="4800"  />    


<entry id="NTASKS_ICE"   value="4800"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />    


<entry id="NTASKS_OCN"   value="3600"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="5400"  />    


<entry id="NTASKS_CPL"   value="4800"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    


<entry id="NTASKS_GLC"   value="600"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="4800"  />    


<entry id="NTASKS_ROF"   value="600"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="4800"  />    


<entry id="NTASKS_WAV"   value="4800"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />    


Here is the diagram showing how this setup translates into a processor map (x axis) as it evolves over time within a timestep (y-axis):


Image Added


Fig: processor layout when several components are run in parallel. Example uses 375 nodes on Edison.

Testing PE Layout Timing:

...