Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.

Configuring the Model Run –

...

Run Script

A simplified run script template can be found at https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/beta/coupled/run.20210205.v2_test02Start with an example of a run script for a low-resolution coupled simulation: run.20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis.sh.

# Machine and project

  • readonly MACHINE=chrysalis: the name of the machine you’re running on.

  • readonly PROJECT="e3sm": SLURM project accounting (typically e3sm).

# Simulation

  • readonly COMPSET="WCYCL1850" : compset (configuration)

  • readonly RESOLUTION="ne30pg2_EC30to60E2r2-1900_ICG": resolution (low-resolution coupled simulation in this case)

    • ne30 is the number of spectral elements for the atmospheric dynamics grid, while pg2 refers to the phys-physics grid option. This mesh grid spacing is approximately 110 km.

    • EC30to60E2r2 is the ocean and sea-ice resolution.

    • ICG means initial conditions from a G-case. ICG can apply to either EC30to60E2r2 or RRM ocean (WC14). It specifies whether the ocean and ice is partially spun up or just from data (temperature and salinity, velocity is assumed zero): “G-case” indicates the ocean and sea-ice are active but the atmosphere, land, and river are from data.

    • rrm for regionally refined mesh is an option to replace other resolutions. RRM for the ocean/sea ice is 14km near the US coast and Arctic and identical to the 30to60E2r2 elsewhere. RRM for the atmosphere is 120km over North America and ne30 elsewhere.

    readonly DESCRIPTOR="v2_test02.piControl.ne30pg2_EC30to60E2r2":

    • v2_test02 The grid spacing varies between 30 and 60 km.

    • For simulations with regionally refined meshes such as the N American atmosphere grid coupled to the WC14 ocean and sea-ice, replace with northamericax4v1pg2_WC14to60E2r3.

  • readonly DESCRIPTOR="v2beta4.piControl":

    • v2beta4 is a short custom description to help identify the simulation.

    • piControl is the type of simulation. Other options here include amip, F2010.ne30pg2_EC30to60E2r2 is the resolution. This should be identical to the RESOLUTION above.

  • readonly CASE_GROUP="v2beta4.piControl":

    • This will let you mark multiple cases as part of the same group for later processing (e.g., with PACE).

# Code and compilation

  • readonly CHECKOUT="2021020520210409": date the code was checked out on in the form {year}{month}{day}. The source code will be checked out in sub-directory named {year}{month}{day} under <code_source_dir>.

  • readonly BRANCH="master": branch the code was checked out from. Valid options include “master”, a branch name, or a git hash.

  • readonly DEBUG_COMPILE=false : option to compile with DEBUG flag (leave set to false)

# Run options

For a short test run, this section might look like:

...

  • readonly

...

  • MODEL_START_TYPE="initial"

...

  • : specify how the model should start. For initial conditions, or continue from existing restart files.

  • readonly START_DATE="0001-01-01"

...

For a long production run, this section might look like:

Code Block
readonly MODEL_START_TYPE="initial"  # initial, continue
readonly START_DATE="0001-01-01"
readonly STOP_OPTION="nyears"  # Units will be number of years
readonly STOP_N="20"           # Stop after running 20 years (20 `STOP_OPTION`s)
readonly REST_OPTION="nyears" # Units will be number of years
readonly REST_N="5"            # Write restart file after running 5 years (5 `REST_OPTION`s)     
readonly RESUBMIT="4"          # Submit 4 times after the initial submit (4+1 submits * 20 years/submit = 100 years)
readonly DO_SHORT_TERM_ARCHIVING=false

In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.

Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.

# Coupler history

  • readonly HIST_OPTION="nyears"

  • readonly HIST_N="5"

# Batch options

  • readonly PELAYOUT="L": 1=single processor, S=small, M= : model start date. Typically year 1 for simulations with perpetual (time invariant) forcing or a real year for simulation for transient forcings.

# Case name

  • readonly CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${RESOLUTION}.${MACHINE} : the case name is a unique identifier for the simulation. It is constructed from variables defined above. If there is no risk of ambiguity, the machine name can be dropped CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${RESOLUTION}.

# Set paths

  • readonly CODE_ROOT="${HOME}/E3SM/code/${CHECKOUT}": where the E3SM code will be checked out.

  • readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SM_simulations/${CASE_NAME}": where the results will go. The directory ${CASE_NAME} will be in <simulations_dir>.

# Sub-directories (leave unchanged)

  • readonly CASE_BUILD_DIR=${CASE_ROOT}/build : all the compilation files, including the executable.

  • readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive : where short-term archived files will reside.

# Define type of run

  • readonly run='production': type of simulation to run. Short test for verification or long production run. (See next section for details).

# Coupler history

  • readonly HIST_OPTION="nyears"

  • readonly HIST_N="5"

# Leave empty (unless you understand what it does)

  • readonly OLD_EXECUTABLE="" : this is a somewhat risky that allows you to re-use a pre-existing executable. This is not recommended because it breaks provenance.

# --- Toggle flags for what to do ----

This section controls what operations the script should perform. The run e3sm script can be invoked multiple times with the user having the option to bypass certain steps by toggling true / false

  • do_fetch_code=true : fetch the source code from Github.

  • do_create_newcase=true : create new case.

  • do_case_setup=true : case setup.

  • do_case_build=true : compile.

  • do_case_submit=true : submit simulation.

The first time the script is called, all the flags should be set to true. Subsequently, the user may decide to bypass code checkout (do_fetch_code=false) or compilation (do_case_build=false). A user may also prefer to manually submit the job by setting do_case_submit=false and then invoking ./case.submit.

For a short test run, this section might look like:

For a long production run, this section might look like:

In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.

Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.

# Batch options

  • readonly PELAYOUT="L": 1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large. Use S for short tests. Full simulations should typically use M or L. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.

  • readonly WALLTIME="28:00:00" : maximum wall clock time requested for the batch jobs.

  • readonly PROJECT="e3sm" : accounting project of the batch jobs.

# Case nameSub-directories

This section should typically not be changed:

Code Block
readonly CASE_

...

SCRIPTS_DIR=${

...

  • If you are comparing the same case across different machines, add .${MACHINE}: ${CHECKOUT}.${DESCRIPTOR}.${MACHINE}.

# Set paths

  • readonly CODE_ROOT="${HOME}/E3SM/code/${CHECKOUT}": where the E3SM code will be checked out.

  • readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SM_simulations/${CASE_NAME}": where the results will go. The directory ${CASE_NAME} will be in <simulations_dir>.

# Sub-directories (leave unchanged)

This section should typically not be changed:

Code Block
readonly CASE_SCRIPTS_DIR=${CASE_ROOT}/case_scripts # Where files for your particular simulation will go.
readonly CASE_BUILD_DIR=${CASE_ROOT}/build          # All the stuff to compile. The executable will be there.
readonly CASE_RUN_DIR=${CASE_ROOT}/run              # Where all the output will be. Most components will have their own log files.
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive      # Where archives will go.

# Leave empty (unless you understand what it does)

  • readonly OLD_EXECUTABLE=""

Running the Model

Short tests

Before starting a long production, it is highly recommended to perform a few short tests to verify:

  1. The model starts without errors.

  2. The model produces BFB (bit-for-bit) results after a restart.

  3. The model produces BFB results when changing PE layout.

(1) Can save you a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

...

CASE_ROOT}/case_scripts # Where files for your particular simulation will go.
readonly CASE_BUILD_DIR=${CASE_ROOT}/build          # All the stuff to compile. The executable will be there.
readonly CASE_RUN_DIR=${CASE_ROOT}/run              # Where all the output will be. Most components will have their own log files.
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive      # Where archives will go.

Running the Model

Short tests

Before starting a long production, it is highly recommended to perform a few short tests to verify:

  1. The model starts without errors.

  2. The model produces BFB (bit-for-bit) results after a restart.

  3. The model produces BFB results when changing PE layout.

(1) Can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch such non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

The type run to perform is controlled by the script variable run.

You should typically perform at least two short test (two different layouts, with and without restart).

First, let’s start with a short test using the 'S' (small) PE layout and running for 2x5 days:

  • readonly run='S_2x5_ndays'

If you have not fetched and compiled the code, set all the toggle flags to true:

Code Block
do_fetch_code=true
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true

Once the job has been submitted to the batch queue, you can immediatedly edit the script and submit the second short test. In this case, we will be running for 10 days (without restart) using the 'M' (medium PE layout:

  • readonly run='M_1x10_ndays'

Since the code has already been fetched and compiled, change the toggle flags:

Code Block
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

Note that short testsuse separate output directories, so it is safe to submit and run multiple tests at once.

Verifying Results Were BFB (needs update).

...