Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This implementation of parallelism for climatology generation once had relatively poor granularity. Meaning that nodes using background or parallel mode always computed 12 monthly climatologies simultaneously, and nodes using serial mode always computed only 1 climatology at a time, and there was no granularity in between these extremes. The -j $job_nbr option (also in ncremap) allows the user to specify the exact granularity to match the node's resources. Here $job_nbr specifies the maximum number of simultaneous climo tasks (averaging, regridding) to send to a node at one time. The default value of job_nbr is 12 for monthly climatologies in both MPI and Background parallelism modes. This can be over-ridden to improve granularity. For example, if --job_nbr=4 is explicitly requested, then the 12 monthly climos will be computed in three sequential batches of four months each. ncclimoautomatically sets job_nbr to the number of nodes available when working in splitter (not climo) mode, so invoking ncclimowith four nodes in splitter mode means each of the four nodes will receive one splitter task. In Background mode job_nbr defaults to 12 and if job_nbr is explicitly specified, say with --job_nbr=4, then four months are computed simultaneously on the host node. Some nodes, e.g., your personal workstation, are underpowered for 12 climo tasks yet overpowered for 1 task, and so benefit from improved granularity.

...

Climos on Single Compute Nodes at LCFs

The basic approach above (running the script from a standard terminal window) works well for small cases, yet can be unpleasantly slow on login nodes of LCFs and for longer or higher resolution (e.g., ne120) climatologies. As a baseline, generating a climatology of 5 years of ne30 ne30pg2 (~1x1 degree) EAM output with ncclimo takes 1-2 minutes on rhea (at a time with little contention), and 6-8 minutes on a 2014 MacBook Pro. To make things a bit faster at LCFs, you can ask for your own dedicated node (note this approach doesn't does not make sense except on supercomputers which that have a job-control queue). On rhea Perlmutter do this via:

qsub srun -A e3sm -I -A CLI115 constraint=cpu -V -l nodes=1 -l walltime-time=00:1030:00 -N ncclimo # rhea standard node (128 GB)

qsub -I -A CLI115 -V -l nodes=1 -l walltime=00:10:00 -lpartition=gpu -N ncclimo # rhea bigmem node (1 TB)

The equivalents on cooley and cori are:

qsub -I -A OceanClimate_4 --nodecount=1 --time=00:10:00 --jobname=ncclimo # cooley node (384 GB)

salloc  -A e3sm --nodes=1 --partition=debug --time=00:10:00 --job-name=ncclimo # cori node (128 GB)

-qos=debug --job-name=ncclimo --pty bash

Acquiring a dedicated node is useful for any calculation you want to do quickly, not just creating climo files though it does burn through our computing allocation so don't be wastefulprudent. This command returns a prompt once a nodes are is assigned (the prompt is returned in your home directory so you may then have to cd to the location you meant mean to run from). At that point you can simply use the 'basic' ncclimo invocation to run your codeinvoke ncclimo as described above. It will be faster because you are not sharing the node it's running on with other people. Again, ne30L30 ne30pg2L72 climos only require < 2 minutes, so the 10 30 minutes requested in the example is excessive and conservative. Tune-it with experience. Here is the meaning of each flag used:

-A: -I (that's a capital "i”): submit in interactive mode = return a prompt rather than running a programName of the account to charge for time used
--constraint: Queue name
--nodes=1: Number of nodes to request. ncclimo will use multiple cores per node.
--time: how How long to keep this dedicated node for. Unless you kill the shell created by the qsub command, the shell will exist for this amount of time, then die suddenly. In the above examples, 3 hrs is requested.
-l nodes=1 (rhea) or --nodecount 1 (cooley) or --nodes=1 (cori): the number of nodes to request. ncclimo will use multiple cores per node.
-V: export existing environmental variables into the new interactive shell. Peter doubts this is actually needed.

-q: the queue name (needed for locations like edison that have multiple queues with no default queue)

-A: the name of the account to charge for time used. This page may be useful for figuring that out if the above defaults don't work: /wiki/spaces/ED/pages/1114710

...


--pty bash: Submit in interactive mode = return a prompt rather than running a program

Climos on Multiple Nodes at LCFs

The above parallel approaches will fail when a single node lacks enough RAM (plus swap) to store all twelve monthly input files, plus extra RAM for computations. One should employ MPI multinode parallelism (-p mpi) on nodes with less RAM than 12*3*sizeof(monthly input).  The longest an ne120 climo will take is less than half an hour (~25 minutes on Edison or Rhea), so the simplest method to run MPI jobs is to request 12-interactive nodes using the above commands (though remember to add -p mpi), then execute the script at the command line. It is also possible, and sometimes preferable, to request non-interactive compute nodes in a batch queue. Executing an MPI-mode climo (on machines with job scheduling and, optimally, 12 available nodes) in a batch queue can be done in 2 commands. First, write an executable file which calls the ncclimo script with appropriate arguments. We do this below by echoing to a file ~/ncclimo.pbs, but you could also open an editor and copy the stuff in quotes below into a file and save it:

...