Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Input/output (IO) can also have a big impact on model performance. To understand the impact of I/O on the model performance start by taking a look at the time taken by the PIO functions (grep for pio*) in the performance archive.

Several aspects of the model I/O performance can be controlled at runtime by setting the PIO* parameters in env_run.xml. The model developer can start by tuning the number of PIO io I/O tasks (processes that perform I/O) and the stride between the I/O tasks by setting PIO_NUMTASKS and PIO_STRIDE respectively in env_run.xml. All I know about PIO comes from this email by Mark Taylor:

...

By default the PIO_STRIDE is set to 4, which means as you increase the number of tasks, you get more tasks trying to write to the filesystem.

...

Also, on our institutional cluster here at Sandia (using Lustre), I get terrible performance with pnetcdf, and change the netcdf (this is one of the setting in env_run.xml).

With PIO, you can get out of memory errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we get out of memory errors in the MPI library itself because there are too many messages.

This section needs help

You can get out of memory errors in PIO if the number of tasks is too small. But if the number of tasks is too large, we get out of memory errors in the MPI library itself because there are too many messages. At the same time the I/O performance is better for large data sets with more io tasks.

Users have to typically start with the default configuration and try couple of PIO_NUMTASKS and PIO_STRIDE options to get an optimal PIO layout for a model configuration on a particular machine. On large machines, empirically we have found that we need to constrain the number of PIO io tasks (PIO_NUMTASKS) to less than 128 to prevent out of memory errors.

Examples:


For small core counts, it's usually most efficient to run all the processes sequentially with all processes using all available cores (though if one process scales poorly it may make sense to have it run on fewer processors while the remaining processors idle).  Here is an example of a serial PE layout:

...