Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first time the script is called, all the flags should be set to true. Subsequently, the user may decide to bypass code checkout (do_fetch_code=false) or compilation (do_case_build=false). A user may also prefer to manually submit the job by setting do_case_submit=false and then invoking ./case.submit.

For a short test run, this section might look like:

...

Running the Model

Short tests

Before starting a long production run, this section might look like:

In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.

Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.

# Batch options

  • readonly PELAYOUT="L": 1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large. Use S for short tests. Full simulations should typically use M or L. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.

  • readonly WALLTIME="28:00:00" : maximum wall clock time requested for the batch jobs.

  • readonly PROJECT="e3sm" : accounting project of the batch jobs.

# Sub-directories

This section should typically not be changed:

Code Block
readonly CASE_SCRIPTS_DIR=${CASE_ROOT}/case_scripts # Where files for your particular simulation will go.
readonly CASE_BUILD_DIR=${CASE_ROOT}/build          # All the stuff to compile. The executable will be there.
readonly CASE_RUN_DIR=${CASE_ROOT}/run              # Where all the output will be. Most components will have their own log files.
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive      # Where archives will go.

Running the Model

Short tests

Before starting a long production, it is highly recommended to perform a few short tests to verify:

  1. The model starts without errors.

  2. The model produces BFB (bit-for-bit) results after a restart.

  3. The model produces BFB results when changing PE layout.

(1) Can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch such non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

The type run to perform is controlled by the script variable run.

You should typically perform at least two short test (two different layouts, with and without restart).

First, let’s start with a short test using the 'S' (small) PE layout and running for 2x5 days:

  • readonly run='S_2x5_ndays'

If you have not fetched and compiled the code, set all the toggle flags to true:

Code Block
do_fetch_code=true
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true

Once the job has been submitted to the batch queue, you can immediatedly edit the script and submit the second short test. In this case, we will be running for 10 days (without restart) using the 'M' (medium PE layout:

  • readonly run='M_1x10_ndays'

Since the code has already been fetched and compiled, change the toggle flags:

Code Block
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

Note that short testsuse separate output directories, so it is safe to submit and run multiple tests at once.

Verifying Results Were BFB (needs update).

If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S) with a multi-year run (processor_config = L).

Code Block
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head

In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.

This BFB check will help you spot bugs in the code.

Production simulation

Run the model by doing the following:, it is highly recommended to perform a few short tests to verify:

  1. The model starts without errors.

  2. The model produces BFB (bit-for-bit) results after a restart.

  3. The model produces BFB results when changing PE layout.

(1) Can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch such non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

The type run to perform is controlled by the script variable run.

You should typically perform at least two short test (two different layouts, with and without restart).

First, let’s start with a short test using the 'S' (small) PE layout and running for 2x5 days:

  • readonly run='S_2x5_ndays'

If you have not fetched and compiled the code, set all the toggle flags to true:

Code Block
do_fetch_code=true
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true

At this point, execute the run e3sm script:

Code Block
cd <run_scripts_dir>
./run.<case_name>.sh

Fetching the code and compiling it will take some time (30 to 45 minutes), so go ahead a brew yourself a fresh cup of coffee. Once the script finished, the test job will have been submitted to the batch queue.

You can immediately edit the script to prepare for the second short test. In this case, we will be running for 10 days (without restart) using the 'M' (medium PE layout:

  • readonly run='M_1x10_ndays'

Since the code has already been fetched and compiled, change the toggle flags:

Code Block
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

and execute the script

Code Block
cd <run_scripts_dir>
./run.<case_name>.sh

Since we are bypassing the code fetch and compilation (by re-using the previous executable), the script should only take a few seconds to run and again should submit the second test.

Note that short tests use separate output directories, so it is safe to submit and run multiple tests at once. If you’d like, you could submit additional test, for example 10 days with the medium 80 nodes ('M80') layout (M80_1x10_ndays).

Verifying results are BFB

Once the short tests are complete, we can confirm the results were bit-for-bit (BFB) the same. All the test output is located under the tests sub-directory:

Code Block
cd <simulations_dir>/<case_name>/tests
for test in *
do
  gunzip -c ${test}/run/atm.log.*.gz | grep '^ nstep, te ' | uniq > atm_${test}.txt
done
md5sum *.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M80_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_S_2x5_ndays.txt

To verify that the results are indeed BFB, we extract global integral from the atmosphere log files (lines starting with ‘nstep, te’) and make sure that they are identical for all tests.

If the BFB check fails, you should stop here and understand why. If they succeed, you can now start the production simulation.

Production simulation

To prepare for the long production simulation, edit the run e3sm script and set:

  • readonly run='production'

In addition, you may need to customize the code block below thatsome variables in the code block below to configure run options:

# Production simulation

  • readonly PELAYOUT="M": 1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large. Production simulations typically use M or L. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.

  • readonly WALLTIME="28:00:00" : maximum wall clock time requested for the batch jobs.

  • readonly STOP_OPTION="nyears"

  • readonly STOP_N="20" : units and length of each segment (i.e. each batch job)

  • readonly REST_OPTION="nyears"

  • readonly REST_N="5" : units and frequency for writing restart files (make sure STOP_N is a multiple of REST_N, otherwise the model will stop without writing a restart fie at the end).

  • readonly RESUBMIT=”9” : number of resubmissions beyond the original segment. This simulation would run for a total of 200 years (20 + 9x20).

  • readonly DO_SHORT_TERM_ARCHIVING=false : leave to false if you want to manually run the short taerm archive.

Since the code has already been fetched and compiled for the short tests, the toggle flags can be set to:

Code Block
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true

Finally, execute the script

Code Block
cd <run_scripts_dir>
./run.<case_name>.cshsh

The repo will be checked out if do_fetch_code=true. The code will be compiled if do_case_build=true. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still script will automatically submit the first job. New jobs will be automatically be resubmitted at the end until the total number of segments have been run.

Looking at Results

Code Block
cd <simulations_dir>/<case_name>
ls

...

Component

Subdirectory

Files in the Subdirectory

Atmosphere (Earth Atmospheric Model)

atm/hist

eam.h*

Coupler

cpl/hist

cpl.h*

Sea Ice (MPAS-Sea-Ice)

ice/hist

mpassi.hist.*

Land (Earth Land Model)

lnd/hist

elm.h*

Ocean (MPAS-Ocean)

ocn/hist

mpaso.hist.*

River Runoff (MOSART)

rof/hist

mosart.h*

Archiving a Complete Run

cd <simulations_dir>/<case_name>

To archive a one month run, for example:

  • mkdir test_1M_S: 1M for one month and S for the processor_config size.

  • mv case_scripts test_1M_S/case_scripts

  • mv run test_1M_S/run

...

Coupler

cpl/hist

cpl.h*

Sea Ice (MPAS-Sea-Ice)

ice/hist

mpassi.hist.*

Land (Earth Land Model)

lnd/hist

elm.h*

Ocean (MPAS-Ocean)

ocn/hist

mpaso.hist.*

River Runoff (MOSART)

rof/hist

mosart.h*

Performance Information

Model throughput is the number of simulated years per day. You can find this with:

...

To gzip log files from failed jobs, run gzip *.log.<job ID>.*

Post-Processing with zppy

To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.

...

Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/, create a bullet “1-50 (time series), 21-50 (climatology)”.

Long Term Archiving with zstash