...
readonly COMPSET="WCYCL1850"
: compset (configuration)readonly RESOLUTION="ne30pg2_EC30to60E2r2"
: resolution (low-resolution coupled simulation in this case)ne30
is the number of spectral elements for the atmospheric dynamics grid, whilepg2
refers to the physics grid option. This mesh grid spacing is approximately 110 km.EC30to60E2r2
is the ocean and sea-ice resolution. The grid spacing varies between 30 and 60 km.For simulations with regionally refined meshes such as the N American atmosphere grid coupled to the WC14 ocean and sea-ice, replace with
northamericax4v1pg2_WC14to60E2r3
.
readonly CASE_NAME="v2.LR.piControl"
:v2.LR
is a short custom description to help identify the simulation.piControl
is the type of simulation. Other options here include , but are not limited to:amip
,F2010
.
readonly CASE_GROUP="v2.LR"
:This will let you mark multiple cases as part of the same group for later processing (e.g., with PACE).
Note: If this is part of a simulation campaign, ask your group lead about using a case_group label. Otherwise, please use a unique name to distinguish from existing case_group label names, i.g. “v2.LR“.
# Code and compilation
readonly CHECKOUT="20210702"
: date the code was checked out on in the form{year}{month}{day}
. The source code will be checked out in sub-directory named{year}{month}{day}
under <code_source_dir>.readonly BRANCH="master"
: branch the code was checked out from. Valid options include “master”, a branch name, or a git hash. For provenance purposes, it is best to specify the git hash.readonly DEBUG_COMPILE=false
: option to compile with DEBUG flag (leave set to false)
...
Before starting a long production run, it is highly recommended to perform a few short tests to verify:
The model starts without errors.
The model produces BFB (bit-for-bit) results after a restart.
The model produces BFB results when changing PE layout.
(1) can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.
...
To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.
Install zppy
If you haven't already, check out the zppy
repo in <code_source_dir>
. Go to You can ask questions about zppy
on https://github.com/E3SM-Project/zppy . Get path to clone by clicking the green "Code" button. Run git clone <path>
.Load the E3SM unified environment. /discussions/categories/general
Install zppy
Load the E3SM unified environment. For Chrysalis, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
. Commands for other machines, and installation guide for development version of zppy can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html
...
If you run ls
you’ll probably see a file files like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status
. This is one e3sm_diags
job. Parts of the file name are explained below:
...
To find the number of nodes, first look at the Processor # / Simulation Time chart on PACE. The furthest right number on that chart is the number of processors - 1 (because of zero-indexing), so add one to get the number of processors. Divide that number by x-axis lists the highest MPI rank used, with base-0 numbering of ranks. (PE layouts often don’t fit exactly N nodes but instead fill N-1 nodes and have some number of ranks left over on the final node, leaving some cores on that node unused.) Then find MPI tasks/node
in the “Experiment Details” section. The result will be the number of nodesnumber of nodes can then be calculated as ceil((highest MPI rank + 1)/ (MPI tasks/node))
.
The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput
.
...
Simulations that are deemed sufficiently valuable should be archived using zstash
for long-term preservation.
You can ask questions about zstash
on https://github.com/E3SM-Project/zstash/discussions/categories/general
Compy, Anvil and Chrysalis do not have local HPSS. We rely on NERSC HPSS for long-term archiving.
...
There may still be more files than is necessary to archive. You can probably remove *.err
, *.lock
, *debug_block*
, *ocean_block_stats*
files.
2. zstash create
...
On the machine that you ran the simulation on:
...
& Transfer to NERSC HPSS
2.a. E3SM Unified v1.6.0 / zstash v1.2.0 or greater
If you are using E3SM Unified v1.6.0
or greater, https://github.com/E3SM-Project/zstash/pull/154 has enabled Globus (https://www.globus.org/ ) transfer with zstash create
.
On the machine that you ran the simulation on:
If you don’t have one already, create a directory for utilities, e.g., utils
. Then open a file in that directory called batch_zstash_create.bash
and paste the following in it, making relevant edits:
Code Block |
---|
#!/bin/bash # Run on <machine name> # Load E3SM Unified <Command to load the E3SM Unified environment> # List of experiments to archive with zstash EXPS=(\ <case_name> \ ) # Loop over simulations for EXP in "${EXPS[@]}" do echo === Archiving ${EXP} === cd <simulations_dir>/${EXP} mkdir -p zstash stamp=`date +%Y%m%d` time zstash create -v --hpss=none globus://nersc/home/<first letter>/<username>/E3SMv2/${EXP} --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log done |
...
Code Block |
---|
$ screen # Enter screen $ screen -ls # Output should say "Attached" $ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log # Control A D to exit screen # DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!! $ screen -ls # Output should say "Detached" $ hostname # If you log in on another login node, # then you will need to ssh to this one to get back to the screen session. $ tail -f batch_zstash_create.log # Check log without going into screen # Wait for this to finish # (On Chrysalis, for 165 years of data, this takes ~14 hours) $ screen -r # Return to screen # Check that output ends with `real`, `user`, `sys` time information $ exit # Terminate screen $ screen -ls # The screen should no longer be listed $ ls <simulations_dir>/<case_name>/zstash # tar files, `index.db`, and a `zstash_create` log should be present # No tar files should be listed # If you'd like to know how much space the archive or entire simulation use, run: $ du -sh <simulations_dir>/<case_name>/zstash $ du -sh <simulations_dir>/<case_name> |
3. Transfer to NERSC
On a NERSC machine (Cori)Then, on NERSC/Cori:
Code Block |
---|
$ hsi mkdir$ -pls /global/cfs/cdirs/e3sm/home/<first letter>/<username>/E3SMv2/<case_name> $# Tar ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name> # Should be empty |
Log into Globus, using your NERSC credentials: https://www.globus.org/
...
(Left hand side): Transfer from <the machine's DTN> <simulations_dir>/<case_name>/zstash
...
files and `index.db` should be listed.
# Note `| wc -l` doesn't work on hsi
$ exit |
2.b. Earlier releases
2.b.i. zstash create
On the machine that you ran the simulation on:
If you don’t have one already, create a directory for utilities, e.g., utils
. Then open a file in that directory called batch_zstash_create.bash
and paste the following in it, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on <machine name>
# Load E3SM Unified
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Archiving ${EXP} ===
cd <simulations_dir>/${EXP}
mkdir -p zstash
stamp=`date +%Y%m%d`
time zstash create -v --hpss=none --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done |
Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .
Then, do the following:
Code Block |
---|
$ screen # Enter screen
$ screen -ls # Output should say "Attached"
$ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
$ tail -f batch_zstash_create.log # Check log without going into screen
# Wait for this to finish
# (On Chrysalis, for 165 years of data, this takes ~14 hours)
$ screen -r # Return to screen
# Check that output ends with `real`, `user`, `sys` time information
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ ls <simulations_dir>/<case_name>/zstash
# tar files, `index.db`, and a `zstash_create` log should be present
# If you'd like to know how much space the archive or entire simulation use, run:
$ du -sh <simulations_dir>/<case_name>/zstash
$ du -sh <simulations_dir>/<case_name> |
2.b.ii. Transfer to NERSC
On a NERSC machine (Cori):
Code Block |
---|
$ mkdir -p /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
$ ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
# Should be empty |
Log into Globus, using your NERSC credentials: https://www.globus.org/
(Left hand side): Transfer from
<the machine's DTN> <simulations_dir>/<case_name>/zstash
Click enter on path and "select all" on left-hand side
Transfer to
NERSC DTN /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
Notice we're using
cfs
rather thanscratch
on Cori
Click enter on the path
Click “Transfer & Sync Options” in the center.
Choose:
“sync - only transfer new or changed files” (choose “modification time is newer” in the dropdown box)
“preserve source file modification times”
“verify file integrity after transfer”
For “Label This Transfer”: “zstash <case_name> <machine name> to NERSC”
Click "Start" on the left hand side.
You should get an email from Globus when the transfer is completed. (On Chrysalis, for 165 years of data, this transfer takes ~13 hours).
...
2.b.iii. Transfer to HPSS
On a NERSC machine (Cori):
Code Block |
---|
Log in to Cori
$ cd /global/cfs/cdirs/e3sm/forsyth/<username>/<case_name>
$ ls | wc -l
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ ls *.tar | wc -l
# Should be two less than the previous result,
# since `index.db` and the `zstash_create` log are also present.
$ hsi
$ pwd
# Should be /home/<first letter>/<username>
$ ls E3SMv2
# Check what you already have in the directory
# You don't want to accidentally overwrite a directory already in HPSS.
$ exit
$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
$ screen
$ screen -ls # Output should say "Attached"
# https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss/managing-files-hsi
# cput will not transfer file if it exists.
$ hsi "cd /home/<first letter>/<username>/E3SMv2/; cput -R <case_name>"
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
# Wait for the `hsi` command to finish
# (On Chrysalis, for 165 years of data, this takes ~2 hours)
$ screen -r # Return to screen
# Check output for any errors
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit |
5. zstash check
On a NERSC machine (Cori):
Code Block |
---|
$ cd /global/homes/<first letter>/<username>
$ emacs batch_zstash_check.bash |
Paste the following in that file, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on NERSC dtn
# Load environment that includes zstash
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Checking ${EXP} ===
cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
mkdir -p ${EXP}/zstash
cd ${EXP}
stamp=`date +%Y%m%d`
time zstash check -v --hpss=/home/<first letter>/<username>/E3SMv2/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done |
...
to finish
# (On Chrysalis, for 165 years of data, this takes ~2 hours)
$ screen -r # Return to screen
# Check output for any errors
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit |
3. zstash check
On a NERSC machine (Cori):
Code Block |
---|
$ cd /global/homes/<first letter>/<username>
$ emacs batch_zstash_check.bash |
Paste the following in that file, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on NERSC dtn
# Load environment that includes zstash
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Checking ${EXP} ===
cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
mkdir -p ${EXP}/zstash
cd ${EXP}
stamp=`date +%Y%m%d`
time zstash check -v --hpss=/home/<first letter>/<username>/E3SMv2/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done |
Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .
If you’re using E3SM Unified v1.6.0 / zstash v1.2.0 (or greater) and you want to check a long simulation, you can use the --tars
option introduced in https://github.com/E3SM-Project/zstash/pull/170 to split the checking into more manageable pieces:
# Starting at 00005a until the end zstash check --tars=00005a- # Starting from the beginning to 00005a (included) zstash check --tars=-00005a # Specific range zstash check --tars=00005a-00005c # Selected tar files zstash check --tars=00003e,00004e,000059 # Mix and match zstash check --tars=000030-00003e,00004e,00005a-
Then, do the following:
Code Block |
---|
$ ssh dtn01.nersc.gov $ screen $ screen -ls # Output should say "Attached" $ cd /global/homes/<first letter>/<username> $ ./batch_zstash_check.bash # Control A D to exit screen # DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!! $ screen -ls # Output should say "Detached" $ hostname # If you log in on another login node, # then you will need to ssh to this one to get back to the screen session. # Wait for the script to finish # (On Chrysalis, for 165 years of data, this takes ~5 hours) $ screen -r # Return to screen # Check that output ends with `INFO: No failures detected when checking the files.` # as well as listing real`, `user`, `sys` time information $ exit # Terminate screen $ screen -ls # The screen should no longer be listed $ exit # exit data transfer node $ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>/zstash $ tail zstash_check_<stamp>.log # Output should match the output from the screen (without the time information) |
...
Note |
---|
Because of https://github.com/E3SM-Project/zstash/issues/167 , for now it is a good idea to run |
4. Document
On a NERSC machine (Cori):
...
Update simulation Confluence page with path to HPSS information regarding this simulation (For Water Cycle’s v2 work, that’s V2 Simulation Planning ) . Specifythat page is /wiki/spaces/ED/pages/2766340117 ) . In the zstash archive
column, specify:
/home/<first letter>/<username>/E3SMv2/<case_name>
zstash_create_<stamp>.log
zstash_check_<stamp>.log
...
5. Delete files
On a NERSC machine (Cori):
...