Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For small core counts, it's usually most efficient (is this true?) to run all the processes sequentially with all processes using all available cores. Here is an example of a serial PE layout:

<entry id="NTASKS_ATM"   value="128"  />   
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />   

<entry id="NTASKS_LND"   value="128"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="0"  />   

<entry id="NTASKS_ICE"   value="128"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />   

<entry id="NTASKS_OCN"   value="128"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="0"  />   

<entry id="NTASKS_CPL"   value="128"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    

<entry id="NTASKS_GLC"   value="128"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="0"  />   

<entry id="NTASKS_ROF"   value="128"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="0"  />   

<entry id="NTASKS_WAV"   value="128"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />   

This layout can be expressed diagrammatically as below:

PicFig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

To do: Insert PE layout for this case

Image Removed

To Do: Include PE layout for this case.  (note: blue box on right labeled ATM is actually the ocean) 

Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:

<entry id="NTASKS_ATM"   value="5400"  />  
<entry id="NTHRDS_ATM"   value="1"  />   
<entry id="ROOTPE_ATM"   value="0"  />    

<entry id="NTASKS_LND"   value="600"  />   
<entry id="NTHRDS_LND"   value="1"  />   
<entry id="ROOTPE_LND"   value="4800"  />    

<entry id="NTASKS_ICE"   value="4800"  />   
<entry id="NTHRDS_ICE"   value="1"  />   
<entry id="ROOTPE_ICE"   value="0"  />    

<entry id="NTASKS_OCN"   value="3600"  />   
<entry id="NTHRDS_OCN"   value="1"  />   
<entry id="ROOTPE_OCN"   value="5400"  />    

<entry id="NTASKS_CPL"   value="4800"  />   
<entry id="NTHRDS_CPL"   value="1"  />   
<entry id="ROOTPE_CPL"   value="0"  />    

<entry id="NTASKS_GLC"   value="600"  />   
<entry id="NTHRDS_GLC"   value="1"  />   
<entry id="ROOTPE_GLC"   value="4800"  />    

<entry id="NTASKS_ROF"   value="600"  />   
<entry id="NTHRDS_ROF"   value="1"  />   
<entry id="ROOTPE_ROF"   value="4800"  />    

<entry id="NTASKS_WAV"   value="4800"  />   
<entry id="NTHRDS_WAV"   value="1"  />   
<entry id="ROOTPE_WAV"   value="0"  />    

Here is the diagram showing how this setup translates into a processor map (x axis) as it evolves over time within a timestep (y-axis):

Image Added

Fig: processor layout when several components are run in parallel. Example uses 375 nodes on Edison.

Rules for choosing a PE layout:

  1. Choose a total number of tasks that is evenly divisible by the number of cores/node for your machine (e.g. asking for 14 total cores on a machine with 12 cores/node is dumb because you will be charged for 24 cores and 10 of them will sit idle).
  2. Try to make the number of assigned task for each component evenly divisible by the number of cores/node for your machine. It is more efficient to have entire nodes doing the same thing than having some cores on a node doing one thing and other cores doing something else.
  3. For single component (e.g CORE-forced ocean, AMIP atmos) compsets, you might as well use a serial configuration because only the active component will consume appreciable time.

  4. For the atmosphere:
    1. Choose NTASKS_ATM so it evenly divides the number of spectral elements in your atmos grid.   The The number of elements is "nelem", which can be extracted from the latlon grid template files available for each grid here (grid on the ACME inputdata server).  The number of physics columns is 9*nelem+2.  It is possible to use uneven numbers of elements per MPI task or to use more tasks then there are elements. Doing so will speed up the physics but will not speed up the dynamics so it is less efficient.
    2. For linux clusters and low numbers of nodes (less than 1000) it is typically best to use NTHREADS_ATM=1.   On Titan, Mira and KNL systems, threads should be used.  On Edison, sometimes small gains can be achieved by turning on hyperthreading and using 2 threads per MPI task, 24 MPI tasks per node.  
    3. When using threads, there are several additional considerations.   The number of MPI tasks times the number of threads per MPI task should be equal to the number of cores on the node (except when using hyperthreading on NERSC). The physics can make use of up to NTASKS_ATM*NTHREADS_ATM = # physics columns.  The dynamics by default can only make use of NTASKS_ATM*NTHREADS_ATM = nelem  (extra threads are ok, they will just not improve dynamics performance).   The new "nested openMP" feature can be used to allow the dynamics to use more threads but this compile time option is not yet enabled by default.    
    4. The table below shows the # elements and acceptable core counts for ACME atm resolutions:
    atm res

    # elements

    # physics columns

    acceptable core counts
    ne30540048602

    5400,2700,1800,1350,1080,900,675,600,540,450,350,300,270, ...

    ne1208640077760286400, 43200, 28800, 21600,...



  5. The MPAS components
    work well at any core count but require mapping files of the form mpas-cice.graph.info.<ICE_CORE_COUNT>.nc and mpas-o.graph.info.<OCN_CORE_COUNT>.nc. These files are automatically downloaded from https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/ and https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ice/mpas-cice/ by the ACME model, but may not exist yet if nobody has used these core counts yet. It is trivial to generate these files though. On edison, you can type

 

module load metis/5.1.0
gpmetis <graph_file>   <# pes>

where graph_file is something like https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/ocn/mpas-o/oRRS15to5/mpas-o.graph.info.151209 and # pes is the number of cores you want to use for that component.

...

 


NOTE: this page is not done yet.

Testing PE Layout Timing:

 

 

 

 It is hard to tell a priori which layout will be optimal. Luckily you don't need to. Try a couple of layouts and look at their timing (in simulated years per day or SYPD) to figure out which is fastest. Keep in mind that timings can vary (a lot!) from run to run, so you may need to run for a while or do multiple runs with each layout to get accurate numbers. CESM recommend 20 day runs without saving any output or writing any restarts (because I/O complicates timings and is best handled as a separate optimization exercise).

Timing information is available in a variety of places. The log files run/cpl.log.* provides some basic info like SYPD and time elapsed.  Info broken down into the time spent in each component is available in case_scripts/timing/cesm_timing.<run_name>.<run_id>. This granular information is useful for load balancing cases where multiple processes are running in parallel, in which case you want to increase or decrease the number of cores devoted to each process in order to make concurrent processes take the same amount of time. This prevents processes from sitting idle while they wait for their companions to finish up. In the example below, it would be ideal for ICE and LND to take the same amount of time, CPL and ROF to take the same amount of time, and WAV and GLC to take the same amount of time. Additionally, we want ATM+ICE+CPL+WAV to take the same amount of time as OCN. In reality, LND, ROF, and GLC are probably much quicker than ATM, ICE, and CPL but since the number of cores they use is negligible this imbalance is tolerable.

Image Added