IN PROGRESS

Useful Aliases

Setting some aliases may be useful in running the model. You can edit ~/.bashrc to add aliases. Run source ~/.bashrc to start using them.

Note that the specific file name may differ amongst machines. Other possibilities include ~/.bash_profile, ~/.bashrc.ext, or ~/.bash_profile.ext.

Batch jobs

To check on all batch jobs:

alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10M %.10l %.8Q %55j" --sort=P,-t,-p'

To check on your batch jobs:

alias sq='sqa -u $USER'

The output of sq uses several abbreviations: ST = Status, R = running, PD = pending, CG = completing.

Directories

You will be working in several directories.

On anvil or chrysalis (i.e., LCRC machines), those paths are:

<run_scripts_dir>: ${HOME}/E3SM/scripts

<code_source_dir>: ${HOME}/E3SM/code

<post_processing_script_dir>: ${HOME}/E3SM/utils/post_v2

<simulations_dir>: /lcrc/group/e3sm/<username>/E3SM_simulations

So, it may be useful to set the following aliases:

# Model running
alias run_scripts="cd ${HOME}/E3SM/scripts/"
alias simulations="cd /lcrc/group/e3sm/<username>/E3SM_simulations/"
alias post_processing="cd ${HOME}/E3SM/utils/post_v2/"

On other machines, the paths are the same, except for the <simulations_dir>.

On compy (PNNL):

<simulations_dir>: /compyfs/<username>/E3SM_simulations

On cori (NERSC):

<simulations_dir>: ${CSCRATCH}/E3SM_simulations

Configuring the Model Run – Run Script

Start with an example of a run script for a low-resolution coupled simulation: run.20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis.sh. Create a new run script or copy an existing one. The path to it should be <run_scripts_dir>/run.<case_name>.sh

# Machine and project

readonly MACHINE=chrysalis: the name of the machine you’re running on.
readonly PROJECT="e3sm": SLURM project accounting (typically e3sm).

# Simulation

readonly COMPSET="WCYCL1850" : compset (configuration)
readonly RESOLUTION="ne30pg2_EC30to60E2r2": resolution (low-resolution coupled simulation in this case)
- ne30 is the number of spectral elements for the atmospheric dynamics grid, while pg2 refers to the physics grid option. This mesh grid spacing is approximately 110 km.
- EC30to60E2r2 is the ocean and sea-ice resolution. The grid spacing varies between 30 and 60 km.
- For simulations with regionally refined meshes such as the N American atmosphere grid coupled to the WC14 ocean and sea-ice, replace with northamericax4v1pg2_WC14to60E2r3.
readonly DESCRIPTOR="v2beta4.piControl":
- v2beta4 is a short custom description to help identify the simulation.
- piControl is the type of simulation. Other options here include , but are not limited to: amip, F2010.
readonly CASE_GROUP="v2beta4.piControl":
- This will let you mark multiple cases as part of the same group for later processing (e.g., with PACE).

# Code and compilation

readonly CHECKOUT="20210409": date the code was checked out on in the form {year}{month}{day}. The source code will be checked out in sub-directory named {year}{month}{day} under <code_source_dir>.
readonly BRANCH="master": branch the code was checked out from. Valid options include “master”, a branch name, or a git hash.
readonly DEBUG_COMPILE=false : option to compile with DEBUG flag (leave set to false)

# Run options

readonly MODEL_START_TYPE="initial" : specify how the model should start. For initial conditions, or continue from existing restart files.
readonly START_DATE="0001-01-01" : model start date. Typically year 1 for simulations with perpetual (time invariant) forcing or a real year for simulation for transient forcings.

# Case name

readonly CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${RESOLUTION}.${MACHINE} : the case name is a unique identifier for the simulation. It is constructed from variables defined above. If there is no risk of ambiguity, the machine name can be dropped CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${RESOLUTION}.

# Set paths

readonly CODE_ROOT="${HOME}/E3SM/code/${CHECKOUT}": where the E3SM code will be checked out.
readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SM_simulations/${CASE_NAME}": where the results will go. The directory ${CASE_NAME} will be in <simulations_dir>.

# Sub-directories (leave unchanged)

readonly CASE_BUILD_DIR=${CASE_ROOT}/build : all the compilation files, including the executable.
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive : where short-term archived files will reside.

# Define type of run

readonly run='production': type of simulation to run. Short test for verification or long production run. (See next section for details).

# Coupler history

readonly HIST_OPTION="nyears"
readonly HIST_N="5"

# Leave empty (unless you understand what it does)

readonly OLD_EXECUTABLE="" : this is a somewhat risky that allows you to re-use a pre-existing executable. This is not recommended because it breaks provenance.

# --- Toggle flags for what to do ----

This section controls what operations the script should perform. The run e3sm script can be invoked multiple times with the user having the option to bypass certain steps by toggling true / false

do_fetch_code=true : fetch the source code from Github.
do_create_newcase=true : create new case.
do_case_setup=true : case setup.
do_case_build=true : compile.
do_case_submit=true : submit simulation.

The first time the script is called, all the flags should be set to true. Subsequently, the user may decide to bypass code checkout (do_fetch_code=false) or compilation (do_case_build=false). A user may also prefer to manually submit the job by setting do_case_submit=false and then invoking ./case.submit.

Running the Model

Short tests

Before starting a long production, it is highly recommended to perform a few short tests to verify:

The model starts without errors.
The model produces BFB (bit-for-bit) results after a restart.
The model produces BFB results when changing PE layout.

(1) Can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch such non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

The type run to perform is controlled by the script variable run.

You should typically perform at least two short test (two different layouts, with and without restart).

First, let’s start with a short test using the 'S' (small) PE layout and running for 2x5 days:

readonly run='S_2x5_ndays'

If you have not fetched and compiled the code, set all the toggle flags to true:

do_fetch_code=true
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true

At this point, execute the run e3sm script:

cd <run_scripts_dir>
./run.<case_name>.sh

Fetching the code and compiling it will take some time (30 to 45 minutes), so go ahead a brew yourself a fresh cup of coffee. Once the script finished, the test job will have been submitted to the batch queue.

You can immediately edit the script to prepare for the second short test. In this case, we will be running for 10 days (without restart) using the 'M' (medium PE layout:

readonly run='M_1x10_ndays'

Since the code has already been fetched and compiled, change the toggle flags:

do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

and execute the script

cd <run_scripts_dir>
./run.<case_name>.sh

Since we are bypassing the code fetch and compilation (by re-using the previous executable), the script should only take a few seconds to run and again should submit the second test.

Note that short tests use separate output directories, so it is safe to submit and run multiple tests at once. If you’d like, you could submit additional test, for example 10 days with the medium 80 nodes ('M80') layout (M80_1x10_ndays).

Verifying results are BFB

Once the short tests are complete, we can confirm the results were bit-for-bit (BFB) the same. All the test output is located under the tests sub-directory:

cd <simulations_dir>/<case_name>/tests
for test in *
do
  gunzip -c ${test}/run/atm.log.*.gz | grep '^ nstep, te ' | uniq > atm_${test}.txt
done
md5sum *.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M80_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_S_2x5_ndays.txt

To verify that the results are indeed BFB, we extract global integral from the atmosphere log files (lines starting with ‘nstep, te’) and make sure that they are identical for all tests.

If the BFB check fails, you should stop here and understand why. If they succeed, you can now start the production simulation.

Production simulation

To prepare for the long production simulation, edit the run e3sm script and set:

readonly run='production'

In addition, you may need to customize the code block below thatsome variables in the code block below to configure run options:

# Production simulation

readonly PELAYOUT="M": 1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large. Production simulations typically use M or L. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.
readonly WALLTIME="28:00:00" : maximum wall clock time requested for the batch jobs.
readonly STOP_OPTION="nyears"
readonly STOP_N="20" : units and length of each segment (i.e. each batch job)
readonly REST_OPTION="nyears"
readonly REST_N="5" : units and frequency for writing restart files (make sure STOP_N is a multiple of REST_N, otherwise the model will stop without writing a restart fie at the end).
readonly RESUBMIT=”9” : number of resubmissions beyond the original segment. This simulation would run for a total of 200 years (20 + 9x20).
readonly DO_SHORT_TERM_ARCHIVING=false : leave to false if you want to manually run the short taerm archive.

Since the code has already been fetched and compiled for the short tests, the toggle flags can be set to:

do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

Finally, execute the script

cd <run_scripts_dir>
./run.<case_name>.sh

The script will automatically submit the first job. New jobs will be automatically be resubmitted at the end until the total number of segments have been run.

Looking at Results

cd <simulations_dir>/<case_name>
ls

Explanation of directories:

build: all the stuff to compile. The executable (e3sm.exe) is also there.
case_scripts: the files for your particular simulation.
run: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the form run/e3sm.log.*. Log prefixes correspond to components of the model:
- atm: atmosphere
- cpl: coupler
- ice: sea ice
- lnd: land
- ocn: ocean
- rof: river runoff

Run tail -f run/<component>.log.<latest log file> to keep up with a log in real time.

You can use the sq alias defined in the “Useful Aliases” section to check on the status of the job. The NODE in the output indicates the number of nodes used and is dependent on the processor_config / PELAYOUTsize.

When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.

Logs being compressed to .gz files is one of the last steps before the job is done and will indicate sucessful completion of the segment. less <log>.gz will let you directly look at a gzipped log.

Short Term Archiving

Short term archiving can be accomplished with the following steps. This can be done while the model is still running.

Use --force-move to move instead of copying, which can take a long time. Set --last-date to the latest date in the simulation you want to archive. You do not have to specify a beginning date.

cd <simulations_dir>/<case_name>/case_scripts
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls

Each component of the model has a subdirectory in archive. There are also two additional subdirectories: logs holds the gzipped log files and rest holds the restart files.

Component	Subdirectory	Files in the Subdirectory
Atmosphere (Earth Atmospheric Model)	`atm/hist`	`eam.h*`
Coupler	`cpl/hist`	`cpl.h*`
Sea Ice (MPAS-Sea-Ice)	`ice/hist`	`mpassi.hist.*`
Land (Earth Land Model)	`lnd/hist`	`elm.h*`
Ocean (MPAS-Ocean)	`ocn/hist`	`mpaso.hist.*`
River Runoff (MOSART)	`rof/hist`	`mosart.h*`

Performance Information

Model throughput is the number of simulated years per day. You can find this with:

cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm*

PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. You can also simply search by providing the JobID appended to log files (NNNNN.yymmdd-hhmmss where NNNNN is the Slurm job id). Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm, ocn, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.

Re-Submitting a Job After a Crash

If a job crashes, you can rerun with:

cd <simulations_dir>/<case_name>/case_scripts
# Make any changes necessary to avoid the crash
./case_submit

To gzip log files from failed jobs, run gzip *.log.<job ID>.*

Post-Processing with zppy (needs update)

To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.

cd <post_processing_script_dir>

Configuration File

Create a new configuration file or copy an existing one. Call it <case_name>.cfg.

The sections of the configuration file are described below:

[default]

input, output, www paths may need to be edited.

[climo]

mapping_file path may need to be edited.
Typically generate climatology files every 20,50 years: years = begin_year:end_yr:averaging_period – e.g., years = "1:80:20", "1:50:50",

[ts]

mapping_file path may need to be edited.
Time series, typically done in chunks of 10 years – e.g., years = "1:80:10"

[e3sm_diags]

reference_data_path may need to be edited.
short_name is a shortened version of the case_name
years should match the [climo] section years

[mpas_analysis]

enso_years should start with year 11 -- e.g., enso_years = "11-50",

Launch Post-Processing Jobs

If you haven't already, check out the zppy repo in <post_processing_script_dir>. Go to https://github.com/E3SM-Project/zppy . Get path to clone by clicking the green "Code" button. Run git clone <path>.

Load the E3SM unified environment. For LCRC machines, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified.sh. Commands for other machines can be found at https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html.

Run python zppy/post.py -c <case_name>.cfg. This will launch a number of jobs. Run sq to see what jobs are running.

e3sm_diags jobs are dependent on climo and ts jobs, so they wait for those to finish. Most run quickly, though MPAS analysis may take several hours.

These jobs create a new directory <simulations_dir>/<case_name>/post.

cd <simulations_dir>/<case_name>/post/scripts
cat *status # should be a list of "OK"
grep -v "OK" *.status # lists files without "OK"

If you re-run post-processing, it will check status of tasks and will skip a task if its status is “OK”.

Tasks

If you run ls you’ll probably see a file like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status. This is one e3sm_diags job. Parts of the file name are explained below:

Meaning	Part of File Name
Task	`e3sm_diags`
Grid	`180x360_aave`
Model/obs v. model/obs	`model_vs_obs`
First and last years	`0001-0020`

There is also a corresponding output file. It will have the same name but end with .o<job ID> instead of .status.

Output

The post-processing output is organized hierarchically as follows (where the exact year-ranges are what are defined in the configuration file):

<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr has the time series files – one variable per file, in 10 year chunks as defined in <post_processing_script_dir>/<case_name>.cfg.
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr similarly has climatology files for 20 year periods, as defined in <post_processing_script_dir>/<case_name>.cfg`.
<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr has globally averaged files for 10 years periods as defined in <post_processing_script_dir>/<case_name>.cfg. The glb directory currently doesn't follow the same file naming convention as 180x360_aave.

Documenting the Model Run

You should create a Confluence page for your model run in v2 beta1 simulations. Use Simulation Run Template as a template. See below for how to fill out this template.

Code

code_root_dir and tag_name are defined in <run_scripts_dir>/<case_name>.csh.

cd <code_root_dir>/<tag_name>
git log

The commit hash at the top is the most recent commit.

Add “<branch name>, <commit hash>” to this section of your page.

Configuration

Compset and Res are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.

Custom parameters should also be listed. Find these by running:

cd <run_scripts_dir>
grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at

Copy the code blocks after cat <<EOF >> user_nl_eam, cat << EOF >> user_nl_elm, and cat << EOF >> user_nl_mosart to your page.

Scripts

Push your <run_scripts_dir>/run.<case_name>.csh to https://github.com/E3SM-Project/SimulationScripts , in the archive/v2/beta/coupled directory. Then link it to this section of your page.

Output files

Specify the path to your output files: <simulations_dir>/<case_name>.

Jobs

Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.

Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs. Use ls to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log..

PACE’s “Experiment Details” section shows JobID as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.

Use less <log> to look at a gzipped log file. Scroll down a decent amount to DATE= to find the start date. Use SHIFT+g to go to the end of the file. Scroll up to DATE= to find the end date. In the “Years” column specify <start> - <end>, with each in year-month-day format.

To find the number of nodes, first look at Total PEs in PACE’s “Experiment Details” section. Divide that number by MPI tasks/node to get the number of nodes.

The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput.

In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.

Global time series

Run the following commands:

cd <post_processing_script_dir>/zppy/global_time_series
# Edit the variables at the top of `generate_global_time_series_plots.sh`
# See below for an explanation of the variables
./generate_global_time_series_plots.sh

Explanation of variables:

# For unified environment paths, see https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html
unified_script=<the path to the unified environment script. Do NOT include `source`>                                                               
e3sm_simulations_dir=<simulations_dir>
case_dir=${e3sm_simulations_dir}/<case_name>
# For web directory paths, see https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html
web_dir=<html_path>/E3SM/v2/beta/<case_name>
zppy_dir=<post_processing_script_dir>/zppy/

# Names                                                                  
moc_file=<e.g., mocTimeSeries_0001-0100.nc>
experiment_name=<e.g., 20210122.v2_test01.piControl.chrysalis>
figstr=<e.g., coupled_v2_test01>

# Years                                                                  
start_yr=<first year to process>
end_yr=<last year to process>

That will produce <figstr>.pdf and <figstr>.png. The latter will be available automatically at <web_address>/E3SM/v2/beta/<case_name>/<figstr>.png, where <web_address> can be found on https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html. You can download the image from the website and then upload it to your Confluence page.

You can also scp the image directly to your computer. For Chrysalis, you’d run the following command on your computer: scp <username>@chrysalis.lcrc.anl.gov:<zppy_dir>/global_time_series/<figstr>.png .

E3SM Diags

The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>.

Your diagnostics are located at the web address corresponding to the www path in <post_processing_script_dir>/<case_name>.cfg.

See the “Global time series” section above for finding the relevant web links. Fill the table with the specific web links: e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/.

MPAS Analysis

See the “Global time series” section above for finding the relevant web links.

Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/, create a bullet “1-50 (time series), 21-50 (climatology)”.

Long Term Archiving with zstash

Simulations that are deemed sufficiently valuable should be archived using zstash for long-term preservation.

Compy / anvil / chrysalis

Compy, anvil and chrysalis do not have local HPSS. We rely on NERSC HPSS for long-term archiving. Archiving requires a few separate steps:

Run 'zstash create' to archive to local disk.
Using Globus, transfer zstash archive files (everything under the zstash/ subdirectory) to NERSC HPSS. Select the option to preserve original files modification date.
Run 'zstash check' to verify integrity of zstash archive (and their transfer).
Update simulation Confluence page with path to HPSS.

Helper scripts

Below are some helper scripts to facilitate steps (1) and (3) above.

batch_zstash_create.bash to batch archive a number of simulations on compy. Run inside a ‘screen’ session to avoid any interruption:

#!/bin/bash

# Run on compy

# Load E3SM Unified
source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified.sh

# List of experiments to archive with zstash
EXPS=(\
20200827.alpha4_v1GM.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
20200905.alpha4_dtOcn.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Archiving ${EXP} ===
    cd /compyfs/gola749/E3SM_simulations/${EXP}
    mkdir -p zstash
    stamp=`date +%Y%m%d`
    time zstash create -v --hpss=none  --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done

batch_zstash_check.bash to batch check a number of simulations on NERSC dtn. Run inside a ‘screen’ session to avoid any interruption:

#!/bin/bash

# Run on NERSC dtn

# Load environment that includes zstash
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh

# List of experiments to archive with zstash
EXPS=(\
20200827.alpha4_v1GM.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
20200905.alpha4_dtOcn.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Checking ${EXP} ===
    #cd /global/cscratch1/sd/golaz/E3SM_simulations
    cd /global/cfs/cdirs/e3sm/golaz/E3SM_simulations
    mkdir -p ${EXP}/zstash
    cd ${EXP}
    stamp=`date +%Y%m%d`
    time zstash check --hpss=/home/g/golaz/2020/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done

More info

Refer to zstash's best practices for E3SM for details.

Running E3SM: step-by-step guide