...
readonly OLD_EXECUTABLE=""
Running the Model
Run the model by doing the following:
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.csh |
The repo will be checked out if fetch_code = true
. The code will be compiled if old_executable = false
. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.
Looking at Results
Code Block |
---|
cd <simulations_dir>/<case_name>
ls |
Explanation of directories:
build
: all the stuff to compile. The executable (e3sm.exe
) is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in the “Useful Aliases” section to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
/ PELAYOUT
size.
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done. less <log>.gz
will let you look at a gzipped log.
Short Term Archiving
Short term archiving can be accomplished with the following steps. This can be done while the model is still running.
Use --force-move
to move instead of copying, which can take a long time. Set --last-date
to the latest date in the simulation you want to archive. You do not have to specify a beginning date.
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls |
Each component of the model has a subdirectory in archive
. There are also two additional subdirectories: logs
holds the gzipped log files and rest
holds the restart files.
...
Component
...
Subdirectory
...
Files in the Subdirectory
...
Atmosphere (Earth Atmospheric Model)
...
atm/hist
...
eam.h*
...
Coupler
...
cpl/hist
...
cpl.h*
...
Sea Ice (MPAS-Sea-Ice)
...
ice/hist
...
mpassi.hist.*
...
Land (Earth Land Model)
...
lnd/hist
...
elm.h*
...
Ocean (MPAS-Ocean)
...
ocn/hist
...
mpaso.hist.*
...
River Runoff (MOSART)
...
rof/hist
...
mosart.h*
Archiving a Complete Run
cd <simulations_dir>/<case_name>
To archive a one month run, for example:
mkdir test_1M_S
:1M
for one month andS
for theprocessor_config
size.mv case_scripts test_1M_S/case_scripts
mv run test_1M_S/run
test_1M_S
is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh
and run it again.
Performance Information
...
Short tests
Before starting a long production, it is highly recommended to perform a few short tests to verify:
The model starts without errors.
The model produces BFB (bit-for-bit) results after a restart.
The model produces BFB results when changing PE layout.
(1) Can save you a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.
Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.
Running short tests
…
Verifying Results Were BFB (needs update).
If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S
) with a multi-year run (processor_config = L
).
Code Block |
---|
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head |
In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.
This BFB check will help you spot bugs in the code.
Production simulation
Run the model by doing the following:
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.csh |
The repo will be checked out if do_fetch_code=true
. The code will be compiled if do_case_build=true
. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.
Looking at Results
Code Block |
---|
cd <simulations_dir>/<case_name>
ls |
Explanation of directories:
build
: all the stuff to compile. The executable (e3sm.exe
) is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in the “Useful Aliases” section to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
/ PELAYOUT
size.
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done and will indicate sucessful completion of the segment. less <log>.gz
will let you directly look at a gzipped log.
Short Term Archiving
Short term archiving can be accomplished with the following steps. This can be done while the model is still running.
Use --force-move
to move instead of copying, which can take a long time. Set --last-date
to the latest date in the simulation you want to archive. You do not have to specify a beginning date.
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm* |
PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.
Verifying Results Were BFB
If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S
) with a multi-year run (processor_config = L
).
Code Block |
---|
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head |
In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.
...
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls |
Each component of the model has a subdirectory in archive
. There are also two additional subdirectories: logs
holds the gzipped log files and rest
holds the restart files.
Component | Subdirectory | Files in the Subdirectory |
---|---|---|
Atmosphere (Earth Atmospheric Model) |
|
|
Coupler |
|
|
Sea Ice (MPAS-Sea-Ice) |
|
|
Land (Earth Land Model) |
|
|
Ocean (MPAS-Ocean) |
|
|
River Runoff (MOSART) |
|
|
Archiving a Complete Run
cd <simulations_dir>/<case_name>
To archive a one month run, for example:
mkdir test_1M_S
:1M
for one month andS
for theprocessor_config
size.mv case_scripts test_1M_S/case_scripts
mv run test_1M_S/run
test_1M_S
is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh
and run it again.
Performance Information
Model throughput is the number of simulated years per day. You can find this with:
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm* |
PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. You can also simply search by providing the JobID appended to log files (NNNNN.yymmdd-hhmmss where NNNNN is the Slurm job id). Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.
Re-Submitting a Job After a Crash
...