Page Comparison

...

Input/output (IO) can also have a big impact on model performance. So as to not confuse To avoid convolving I/O performance with model performance, the usual steps for optimizing PE layout are:

Get working Choose PIO settings , so that that are good enough to avoid having the model will not crash while reading input settings.
1. Often the default settings will not work,
especialy
1. especially on vary large or very small processor counts.
Disable history output and restart files for all components. Then you can , then use short (5-10 days) simulations to determine an optional processor layout. Otherwise
1. Without disabling IO, short simulations will be dominated by I/O costs and will give misleading performance numbers.
Input
1. 1. Note that input and initialization times are not included in SYPD computations
. (Note that disabling history write during the first timestep of MPAS-O is implemented by setting a parameter in user_nl_mpas-o:
1. 1. .
2. For ATM, you can disable output by simply setting nhtfrq (described here) in user_nl_cam to something longer than the length of the model run.
3. By default, MPAS always writes output during the first timestep You can disable this by setting config_write_output_on_startup = .false.
)
1. in user_nl_mpas-o:
Once you have a reasonable processor layout without I/O, use 1 month simulations to test PIO settings.

...

Several aspects of the model I/O performance can be controlled at runtime by setting the PIO* parameters in env_run.xml. The model developer can start by tuning the number of PIO I/O tasks (processes that perform I/O) and the stride between the I/O tasks by setting PIO_NUMTASKS and PIO_STRIDE respectively in env_run.xml. By default the PIO_STRIDE is set to 4, which means as you increase the number of tasks, you get more tasks trying to write to the filesystem.

You can get 'out of memory' errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we also get 'out of memory errors' in the MPI library itself because there are too many messages. At the same time the I/O performance is better for large data sets with more I/O tasks.

...

It is hard to tell a priori which layout will be optimal. Luckily you don't need to. Try a couple of layouts and look at their timing (in simulated years per day or SYPD) to figure out which is fastest. Keep in mind that timings can vary (a lot!) from run to run, so you may need to run for a while or do multiple runs with each layout to get accurate numbers. CESM recommend 20 day runs without saving any output or writing any restarts (because I/O complicates timings and is best handled as a separate optimization exercise). See the PIO section above for more discussion about run lengths and experimentation strategy.

Timing information is available in a variety of places. The log files run/cpl.log.* provides some basic info like SYPD and time elapsed. Info broken down into the time spent in each component is available in case_scripts/timing/cesm_timing.<run_name>.<run_id>. This granular information is useful for load balancing cases where multiple processes are running in parallel, in which case you want to increase or decrease the number of cores devoted to each process in order to make concurrent processes take the same amount of time. This prevents processes from sitting idle while they wait for their companions to finish up. In the example below, it would be ideal for ICE and LND to take the same amount of time, CPL and ROF to take the same amount of time, and WAV and GLC to take the same amount of time. Additionally, we want ATM+ICE+CPL+WAV to take the same amount of time as OCN. In reality, LND, ROF, and GLC are probably much quicker than ATM, ICE, and CPL but since the number of cores they use is negligible this imbalance is tolerable.

...

Versions Compared

Old Version 30

New Version 31

Key