Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • readonly OLD_EXECUTABLE=""

Running the Model

Run the model by doing the following:

Code Block
cd <run_scripts_dir>
./run.<case_name>.csh

The repo will be checked out if fetch_code = true. The code will be compiled if old_executable = false. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.

Looking at Results

Code Block
cd <simulations_dir>/<case_name>
ls

Explanation of directories:

  • build: all the stuff to compile. The executable (e3sm.exe) is also there.

  • case_scripts: the files for your particular simulation.

  • run: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the form run/e3sm.log.*. Log prefixes correspond to components of the model:

    • atm: atmosphere

    • cpl: coupler

    • ice: sea ice

    • lnd: land

    • ocn: ocean

    • rof: river runoff

Run tail -f run/<component>.log.<latest log file> to keep up with a log in real time.

You can use the sq alias defined in the “Useful Aliases” section to check on the status of the job. The NODE in the output indicates the number of nodes used and is dependent on the processor_config / PELAYOUTsize.

When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.

Logs being compressed to .gz files is one of the last steps before the job is done. less <log>.gz will let you look at a gzipped log.

Short Term Archiving

Short term archiving can be accomplished with the following steps. This can be done while the model is still running.

Use --force-move to move instead of copying, which can take a long time. Set --last-date to the latest date in the simulation you want to archive. You do not have to specify a beginning date.

Code Block
cd <simulations_dir>/<case_name>/case_scripts
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls

Each component of the model has a subdirectory in archive. There are also two additional subdirectories: logs holds the gzipped log files and rest holds the restart files.

...

Component

...

Subdirectory

...

Files in the Subdirectory

...

Atmosphere (Earth Atmospheric Model)

...

atm/hist

...

eam.h*

...

Coupler

...

cpl/hist

...

cpl.h*

...

Sea Ice (MPAS-Sea-Ice)

...

ice/hist

...

mpassi.hist.*

...

Land (Earth Land Model)

...

lnd/hist

...

elm.h*

...

Ocean (MPAS-Ocean)

...

ocn/hist

...

mpaso.hist.*

...

River Runoff (MOSART)

...

rof/hist

...

mosart.h*

Archiving a Complete Run

cd <simulations_dir>/<case_name>

To archive a one month run, for example:

  • mkdir test_1M_S: 1M for one month and S for the processor_config size.

  • mv case_scripts test_1M_S/case_scripts

  • mv run test_1M_S/run

test_1M_S is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh and run it again.

Performance Information

...

Short tests

Before starting a long production, it is highly recommended to perform a few short tests to verify:

  1. The model starts without errors.

  2. The model produces BFB (bit-for-bit) results after a restart.

  3. The model produces BFB results when changing PE layout.

(1) Can save you a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.

Running short tests

Verifying Results Were BFB (needs update).

If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S) with a multi-year run (processor_config = L).

Code Block
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head

In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.

This BFB check will help you spot bugs in the code.

Production simulation

Run the model by doing the following:

Code Block
cd <run_scripts_dir>
./run.<case_name>.csh

The repo will be checked out if do_fetch_code=true. The code will be compiled if do_case_build=true. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.

Looking at Results

Code Block
cd <simulations_dir>/<case_name>
ls

Explanation of directories:

  • build: all the stuff to compile. The executable (e3sm.exe) is also there.

  • case_scripts: the files for your particular simulation.

  • run: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the form run/e3sm.log.*. Log prefixes correspond to components of the model:

    • atm: atmosphere

    • cpl: coupler

    • ice: sea ice

    • lnd: land

    • ocn: ocean

    • rof: river runoff

Run tail -f run/<component>.log.<latest log file> to keep up with a log in real time.

You can use the sq alias defined in the “Useful Aliases” section to check on the status of the job. The NODE in the output indicates the number of nodes used and is dependent on the processor_config / PELAYOUTsize.

When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.

Logs being compressed to .gz files is one of the last steps before the job is done and will indicate sucessful completion of the segment. less <log>.gz will let you directly look at a gzipped log.

Short Term Archiving

Short term archiving can be accomplished with the following steps. This can be done while the model is still running.

Use --force-move to move instead of copying, which can take a long time. Set --last-date to the latest date in the simulation you want to archive. You do not have to specify a beginning date.

Code Block
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm*

PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm, ocn, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.

Verifying Results Were BFB

If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S) with a multi-year run (processor_config = L).

Code Block
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head

In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.

...


./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls

Each component of the model has a subdirectory in archive. There are also two additional subdirectories: logs holds the gzipped log files and rest holds the restart files.

Component

Subdirectory

Files in the Subdirectory

Atmosphere (Earth Atmospheric Model)

atm/hist

eam.h*

Coupler

cpl/hist

cpl.h*

Sea Ice (MPAS-Sea-Ice)

ice/hist

mpassi.hist.*

Land (Earth Land Model)

lnd/hist

elm.h*

Ocean (MPAS-Ocean)

ocn/hist

mpaso.hist.*

River Runoff (MOSART)

rof/hist

mosart.h*

Archiving a Complete Run

cd <simulations_dir>/<case_name>

To archive a one month run, for example:

  • mkdir test_1M_S: 1M for one month and S for the processor_config size.

  • mv case_scripts test_1M_S/case_scripts

  • mv run test_1M_S/run

test_1M_S is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh and run it again.

Performance Information

Model throughput is the number of simulated years per day. You can find this with:

Code Block
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm*

PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. You can also simply search by providing the JobID appended to log files (NNNNN.yymmdd-hhmmss where NNNNN is the Slurm job id). Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm, ocn, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.

Re-Submitting a Job After a Crash

...