Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It is important to employ the optimal ncclimo  parallelization strategy for your computer hardware resources. Select from the three available choices with the '-p par_typ' switch. The options are serial mode ('-p nil' or '-p serial'), background mode parallelism ('-p bck'), and MPI parallelism ('-p mpi'). The default is background mode parallelism, which is appropriate for lower resolution (e.g., ne30L30ne30L72) simulations on most nodes at high-performance computer centers. Use (or at least start with) serial mode on personal laptops/workstations. Serial mode requires twelve times less RAM than the parallel modes, and is much less likely to deadlock or cause OOM (out-of-memory) conditions on your personal computer. If the available RAM (+swap) is < 12*4*sizeof(monthly input file), then try serial mode first (12 is the optimal number of parallel processes for monthly climos, the computational overhead is a factor of four). EAM ne30L30 EAMv1 ne30L72 output is about ~1 GB GiB per month so each month requires about 4 GB GiB of RAM. EAM EAMv1 ne30L72 output (with LINOZ) is about ~10 GB/month so each month requires ~40 GB GiB RAM. EAM ne120 EAMv1 ne120L72 output is about ~12 GB/month so each month requires ~48 GB RAM. The computer does not actually use all this memory at one time, and many kernels compress RAM usage to below what top reports, so the actual physical usage is hard to pin-down, but may be a factor of 2.5-3.0 (rather than a factor of four) times the size of the input file. For instance, my a 16 GB MacBookPro will successfully run an ne30L30 climatology (that requests 48 GB RAM) in background mode, but the laptop will be slow and unresponsive for other uses until it finishes (in 6-8 minutes) the climos. Experiment a bit and choose the parallelization option that works best for you. 

Serial mode, as its name implies, uses one core at a time for climos, and computes sequentially the monthly then seasonal then annual climatologies. Serial mode means that climos are performed serially, but regridding will employ OMP threading on platforms that support it, and use up to 16 cores. By design each month and each season are independent of the others, so all months can be computed in parallel, then each season can be computed in parallel (using monthly climatologies), then the annual average can be computed. Background parallelization mode exploits this parallelism and executes the climos in parallel as background processes on a single node, so that twelve cores are simultaneously employed for monthly climatologies, four for seasonal, and one for annual. The optional regridding will employ up to two cores per process. MPI parallelism executes the climatologies on different nodes so that up to (optimally) twelve nodes are employed performing monthly climos. The full memory of each node is available for each individual climo. The optional regridding will employ up to eight cores per node. MPI mode or Background mode on a big memory queue must be used to process ne30L72 and ne120L30 ne120L72 climos on some, but not all, DOE computers. For example, attempting an ne120L30 ne120L72 climo on in background mode on rhea Cori (i.e., on one 128 96 GB compute node) will fail due to OOM. (OOM errors do not produce useful return codes so if your climo processes die without printing useful information, the cause may be OOM). However the same climo will succeed if executed on a single big-memory (1 TB) node on rhea Cori (use -C amd on Cori, or -lpartition=gpu on Andes, as shown below). Or MPI mode can be used for any climatology. The same ne120L30 ne120L72 climo will also finish blazingly fast in background mode on cooley Cooley (i.e., on one 384 GB compute node), so MPI mode is unnecessary on cooleythe beefiest nodes. In general, the fatter the memory, the better the performance. 

This implementation of parallelism for climatology generation has once had relatively poor granularity. Meaning that nodes using background or parallel mode always compute computed 12 monthly climatologies simultaneously, and nodes using serial mode always compute computed only 1 climatology at a time. Some nodes, e.g., your personal workstation, are underpowered for 12 yet overpowered for 1, and so would benefit from improved granularityand there was no granularity in between these extremes. The '-j job_nbr' option in splitter mode (and also in ncremap) already allows the user to specify the exact granularity to match the node's resources. A goal of ours is to implement a job_nbr parallelization algorithm for climo generation. This will enable most personal workstations to do better than serial modeHere job_nbr specifies the maximum number of simultaneous climo tasks (averaging, regridding) to send to a node at one time. For example, if job_nbr=4 then the 12 monthly climos will be computed in three sequential batches of four months each. In MPI mode those four months are sent to different nodes, and in Background mode those four months are computed on the host node. Some nodes, e.g., your personal workstation, are underpowered for 12 climo tasks yet overpowered for 1 task, and so benefit from improved granularity.

For a Single, Dedicated Node on LCFs:

The basic approach above (running the script from a standard terminal window) works well for small cases can be unpleasantly slow on login nodes of LCFs and for longer or higher resolution (e.g., ne120) climatologies. As a baseline, generating a climatology of 5 years of ne30 (~1x1 degree) EAM output with ncclimo takes 1-2 minutes on rhea (at a time with little contention), and 6-8 minutes on a 2014 MacBook Pro. To make things a bit faster at LCFs, you can ask for your own dedicated node (note this approach doesn't make sense except on supercomputers which have a job-control queue). On rhea do this via:

...