Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Fig: Serial PE layout from earlier version of CAM which lacks WV, GLC, and ROF. Copied from http://www.cesm.ucar.edu/models/cesm1.0/cesm/cesm_doc_1_0_4/128pe_layout.jpg

 




Alternatively, some components can be run in parallel. Here is an example which uses 9000 cores (375 nodes) on Edison:

...

Timing information is available in a variety of places. The log files run/cpl.log.* provides some basic info like SYPD and time elapsed.  Info broken down into the time spent in each component is available in case_scripts/timing/cesm_timing.<run_name>.<run_id>. This granular information is useful for load balancing cases where multiple processes are running in parallel, in which case you want to increase or decrease the number of cores devoted to each process in order to make concurrent processes take the same amount of time. This prevents processes from sitting idle while they wait for their companions to finish up. In the example below, it would be ideal for ICE and LND to take the same amount of time, CPL and ROF to take the same amount of time, and WAV and GLC to take the same amount of time. Additionally, we want ATM+ICE+CPL+WAV to take the same amount of time as OCN. In reality, LND, ROF, and GLC are probably much quicker than ATM, ICE, and CPL but since the number of cores they use is negligible this imbalance is tolerable.

 

Notes from Patrick Worley (Unlicensed) on interpreting timings in case_scripts/timing/CESM_timing.rs.* (communicated from email by Peter Caldwell): For layouts like the one above (with OCN running in parallel from everything else and ICE and LND running in parallel on the same cores used for ATM), Pat computes max (OCN, ATM + max(LND, ICE)) and compares it to the "TOT Run Time". The difference between these numbers is the amount of time spent in communication overhead. "CPL COMM Time" is kind of useless because it can include load imbalance from components that finish early (smaller "CPL COMM TIme" means better load balancing). CPL Run Time is also unpredictable because it includes MPI communication and may depend on which nodes CPL is running on.