IN PROGRESS
Contact: Ryan Forsyth
Useful Aliases
Setting some aliases may be useful in running the model. You can edit ~/.bashrc
to add aliases. Run source ~/.bashrc
to start using them.
Note that the specific file name may differ amongst machines. Other possibilities include ~/.bash_profile
, ~/.bashrc.ext
, or ~/.bash_profile.ext
.
Batch jobs
To check on all batch jobs:
alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10M %.10l %.8Q %55j" --sort=P,-t,-p'
To check on your batch jobs:
alias sq='sqa -u $USER'
The output of sq
uses several abbreviations: ST = Status, R = running, PD = pending, CG = completing.
Directories
You will be working in several directories.
On anvil or chrysalis (i.e., LCRC machines), those paths are:
<run_scripts_dir>: ${HOME}/E3SM/scripts
<code_source_dir>: ${HOME}/E3SM/code
<post_processing_script_dir>: ${HOME}/E3SM/utils/post_v2
<simulations_dir>: /lcrc/group/e3sm/<username>/E3SM_simulations
So, it may be useful to set the following aliases:
# Model running alias run_scripts="cd ${HOME}/E3SM/scripts/" alias simulations="cd /lcrc/group/e3sm/<username>/E3SM_simulations/" alias post_processing="cd ${HOME}/E3SM/utils/post_v2/"
On other machines, the paths are the same, except for the <simulations_dir>.
On compy (PNNL):
<simulations_dir>: /compyfs/<username>/E3SM_simulations
On cori (NERSC):
<simulations_dir>: ${CSCRATCH}/E3SM_simulations
Configuring the Model Run – Old Run Script
A template for running the model is provided at https://github.com/E3SM-Project/E3SM/blob/master/run_e3sm.template.csh . Notice there is a section at the top labeled "THINGS USERS USUALLY CHANGE (SEE END OF SECTION FOR GUIDANCE)". These are the settings that you are most likely to change. The "EXPLANATION FOR OPTIONS ABOVE:" section explains these parameters.
Create a new run script or copy an existing one. The path to it should be <run_scripts_dir>/run.<case_name>.csh
For ease of use, below are further explanatory notes:
BASIC INFO ABOUT RUN
set job_name = v2_test01.piControl
:v2_test01
is a short custom description to help identify the simulation.piControl
is the type of simulation. Other options here includeamip
,F2010
set resolution = ne30pg2_EC30to60E2r2-1900_ICG
:ne30
is the number of spectral elements for the atmospheric grid.EC30to60E2r2
is the ocean and seaice resolution.rrm
for regionally refined mesh is an option to replace other resolutions.
SOURCE CODE OPTIONS
fetch_code
: if you have not run the model before, want to incorporate new changes, or use a new branch, then set this totrue
. Otherwise, you can set this tofalse
, which means time doesn't have to be spent checking out code.e3sm_tag
: the branch of the E3SM repo you want to run the model with.tag_name
: you can pick a short name to replacee3sm_tag
. Typically this will be a date (e.g., "20210122" for 2021-01-22. It is good practice to use year-month-day sols
will list runs chronologically).
CUSTOM CASE_NAME
set case_name = ${machine}.${tag_name}.${job_name}.${resolution}
:${tag_name}.${job_name}.${resolution}.${machine}
is also used.Note that
job_name
(seeBASIC INFO ABOUT RUN
) typically has two parts (separated by period), socase_name
will actually have five parts.
PROCESSOR CONFIGURATION
set processor_config = S
:S
,M
,L
sizes, amongst other specified in the "EXPLANATION FOR OPTIONS ABOVE:" section. UseS
for short tests. Full simulations should useL
. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.
DIRECTORIES
set code_root_dir = ~/E3SM/code/ set e3sm_simulations_dir = <simulations_dir> set case_build_dir = ${e3sm_simulations_dir}/${case_name}/build set case_run_dir = ${e3sm_simulations_dir}/${case_name}/run set short_term_archive_root_dir = ${e3sm_simulations_dir}/${case_name}/archive
LENGTH OF SIMULATION, RESTARTS, AND ARCHIVING
For a short run, this section might look like:
set stop_units = nmonths # Units will be number of months set stop_num = 1 # Stop after running one month (one stop_unit) set restart_units = $stop_units set restart_num = $stop_num set num_resubmits = 0
For a long run, this section might look like:
set stop_units = nyears # Units will be number of years set stop_num = 20 # Stop running after 20 years (20 stop_units) set restart_units = $stop_units # Units will also be number of years set restart_num = 5 # Write restart file after running 5 years (5 stop units) set num_resubmits = 4 # Submit 4 times after the initial submit (4+1 submits * 20 years/submit = 100 years)
In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.
Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.
Configuring the Model Run – New Run Script
A simplified run script template can be found at https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/beta/coupled/run.20210205.v2_test02.piControl.ne30pg2_EC30to60E2r2.chrysalis.sh.
# Machine
readonly MACHINE=chrysalis
: the name of the machine you’re running on.
# Simulation
readonly COMPSET="WCYCL1850"
: compset (configuration)readonly RESOLUTION="ne30pg2_EC30to60E2r2-1900_ICG"
: resolutionne30
is the number of spectral elements for the atmospheric grid,pg2
refers to the phys-grid optionEC30to60E2r2
is the ocean and sea-ice resolution.ICG
means initial conditions from a G-case. ICG can apply to eitherEC30to60E2r2
or RRM ocean (WC14). It specifies whether the ocean and ice is partially spun up or just from data (temperature and salinity, velocity is assumed zero): “G-case” indicates the ocean and sea-ice are active but the atmosphere, land, and river are from data.rrm
for regionally refined mesh is an option to replace other resolutions. RRM for the ocean/sea ice is 14km near the US coast and Arctic and identical to the30to60E2r2
elsewhere. RRM for the atmosphere is 120km over North America andne30
elsewhere.
readonly DESCRIPTOR="v2_test02.piControl.ne30pg2_EC30to60E2r2"
:v2_test02
is a short custom description to help identify the simulation.piControl
is the type of simulation. Other options here includeamip
,F2010
.ne30pg2_EC30to60E2r2
is the resolution. This should be identical to theRESOLUTION
above.
# Code and compilation
readonly CHECKOUT="20210205"
: date the code was checked out on in the form{year}{month}{day}
. The source code will be checked out in sub-directory named{year}{month}{day}
under <code_source_dir>.readonly BRANCH="master"
: branch the code was checked out from. Valid options include “master”, a branch name, or a git hash.readonly DEBUG_COMPILE=false
: option to compile with DEBUG flag (leave set to false)
# Run options
For a short test run, this section might look like:
readonly MODEL_START_TYPE="initial" # initial, continue readonly START_DATE="0001-01-01" readonly STOP_OPTION="nmonths" # Units will be number of months readonly STOP_N="1" # Stop after running 1 month (one `STOP_OPTION`) readonly REST_OPTION="nmonths" # Units will be number of months readonly REST_N="1" # Write restart file after running 1 month (one `REST_OPTION`) readonly RESUBMIT="0" # Do not re-submit readonly DO_SHORT_TERM_ARCHIVING=false
For a long production run, this section might look like:
readonly MODEL_START_TYPE="initial" # initial, continue readonly START_DATE="0001-01-01" readonly STOP_OPTION="nyears" # Units will be number of years readonly STOP_N="20" # Stop after running 20 years (20 `STOP_OPTION`s) readonly REST_OPTION="nyears" # Units will be number of years readonly REST_N="5" # Write restart file after running 5 years (5 `REST_OPTION`s) readonly RESUBMIT="4" # Submit 4 times after the initial submit (4+1 submits * 20 years/submit = 100 years) readonly DO_SHORT_TERM_ARCHIVING=false
In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.
Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.
# Coupler history
readonly HIST_OPTION="nyears"
readonly HIST_N="5"
# Batch options
readonly PELAYOUT="L"
:1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large
. UseS
for short tests. Full simulations should typically useM
orL
. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.readonly WALLTIME="28:00:00"
: maximum wall clock time requested for the batch jobs.readonly PROJECT="e3sm"
: accounting project of the batch jobs.
# Case name
readonly CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${MACHINE}
: sets the case name. This will look like20210205.v2_test02.piControl.ne30pg2_EC30to60E2r2.chrysalis
using the example values on this page.If you are not comparing the same case on different machines, you can exclude the
.${MACHINE}
part.
# Set paths
readonly CODE_ROOT="${HOME}/E3SM/code/${CHECKOUT}"
: where the E3SM code will be checked out.readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SM_simulations/${CASE_NAME}"
: where the results will go. The directory${CASE_NAME}
will be in<simulations_dir>
.
# Sub-directories (leave unchanged)
This section should typically not be changed:
readonly CASE_SCRIPTS_DIR=${CASE_ROOT}/case_scripts # Where files for your particular simulation will go. readonly CASE_BUILD_DIR=${CASE_ROOT}/build # All the stuff to compile. The executable will be there. readonly CASE_RUN_DIR=${CASE_ROOT}/run # Where all the output will be. Most components will have their own log files. readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive # Where archives will go.
# Leave empty (unless you understand what it does)
readonly OLD_EXECUTABLE=""
Running the Model
Run the model by doing the following:
cd <run_scripts_dir> ./run.<case_name>.csh
The repo will be checked out if fetch_code = true
. The code will be compiled if old_executable = false
. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.
Looking at Results
cd <simulations_dir>/<case_name> ls
Explanation of directories:
build
: all the stuff to compile. The executable (e3sm.exe
) is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in the “Useful Aliases” section to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
size.
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done. less <log>.gz
will let you look at a gzipped log.
Short Term Archiving
Short term archiving can be accomplished with the following steps. This can be done while the model is still running.
Use --force-move
to move instead of copying, which can take a long time. Set --last-date
to the latest date in the simulation you want to archive. You do not have to specify a beginning date.
cd <simulations_dir>/<case_name>/case_scripts ./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs cd <e3sm_simulations_dir>/<case_name>/archive ls
Each component of the model has a subdirectory in archive
. There are also two additional subdirectories: logs
holds the gzipped log files and rest
holds the restart files.
Component | Subdirectory | Files in the Subdirectory |
---|---|---|
Atmosphere (Earth Atmospheric Model) |
|
|
Coupler |
|
|
Sea Ice (MPAS-Sea-Ice) |
|
|
Land (Earth Land Model) |
|
|
Ocean (MPAS-Ocean) |
|
|
River Runoff (MOSART) |
|
|
Archiving a Complete Run
cd <simulations_dir>/<case_name>
To archive a one month run, for example:
mkdir test_1M_S
:1M
for one month andS
for theprocessor_config
size.mv case_scripts test_1M_S/case_scripts
mv run test_1M_S/run
test_1M_S
is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh
and run it again.
Performance Information
Model throughput is the number of simulated years per day. You can find this with:
cd <simulations_dir>/<case_name>/case_scripts/timing grep "simulated_years" e3sm*
PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.
Verifying Results Were BFB
If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S
) with a multi-year run (processor_config = L
).
cd <simulations_dir>/<case_name>/ gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt diff atm_S.txt atm_L.txt | head
In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.
This BFB check will help you spot bugs in the code.
Re-Submitting a Job After a Crash
If a job crashes, you can rerun with:
cd <simulations_dir>/<case_name>/case_scripts # Make any changes necessary to avoid the crash ./case_submit
To gzip
log files from failed jobs, run gzip *.log.<job ID>.*
Post-Processing
To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.
cd <post_processing_script_dir>
Configuration File
Create a new configuration file or copy an existing one. Call it <case_name>.cfg
.
The sections of the configuration file are described below:
[default]
input
,output
,www
paths may need to be edited.
[climo]
mapping_file
path may need to be edited.Typically generate climatology files every 20,50 years:
years = begin_year:end_yr:averaging_period
– e.g.,years = "1:80:20", "1:50:50",
[ts]
mapping_file
path may need to be edited.Time series, typically done in chunks of 10 years – e.g.,
years = "1:80:10"
[e3sm_diags]
reference_data_path
may need to be edited.short_name
is a shortened version of thecase_name
years
should match the[climo]
section years
[mpas_analysis]
enso_years
should start with year 11 -- e.g.,enso_years = "11-50",
Launch Post-Processing Jobs
If you haven't already, check out the zppy
repo in <post_processing_script_dir>
. Go to https://github.com/E3SM-Project/zppy . Get path to clone by clicking the green "Code" button. Run git clone <path>
.
Load the E3SM unified environment. For LCRC machines, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified.sh
. Commands for other machines can be found at https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html.
Run python zppy/post.py -c <case_name>.cfg
. This will launch a number of jobs. Run sq
to see what jobs are running.
e3sm_diags
jobs are dependent on climo
jobs, so they wait for those to finish. Most run quickly, though MPAS analysis may take several hours.
These jobs create a new directory <simulations_dir>/<case_name>/post
.
cd <simulations_dir>/<case_name>/post/scripts cat *status # should be a list of "OK" grep -v "OK" *.status # lists files without "OK"
If you re-run post-processing, it will check status of tasks and will skip if status is OK.
Tasks
If you run ls
you’ll probably see a file like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status
. This is one e3sm_diags
job. Parts of the file name are explained below:
Meaning | Part of File Name |
---|---|
Task |
|
Grid |
|
Model/obs v. model/obs |
|
First and last years |
|
There is also a corresponding file. It will have the same name but end with .o<job ID>
instead of .status
.
Output
The post-processing output is organized hierarchically as follows:
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr
has the time series files – one variable per file, in 10 year chunks as defined in<post_processing_script_dir>/<case_name>.cfg
.<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr
similarly has climatology files for 20 year periods, as defined in<post_processing_script_dir>/<case_name>.cfg
`.<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr
has globally averaged files for 10 years periods as defined in<post_processing_script_dir>/<case_name>.cfg
. Theglb
directory currently doesn't follow the same file naming convention as180x360_aave
.
Documenting the Model Run
You should create a Confluence page for your model run in v2 beta1 simulations. Use Simulation Run Template as a template. See below for how to fill out this template.
Code
code_root_dir
and tag_name
are defined in <run_scripts_dir>/<case_name>.csh
.
cd <code_root_dir>/<tag_name> git log
The commit hash at the top is the most recent commit.
Add “<branch name>, <commit hash>” to this section of your page.
Configuration
Compset
and Res
are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.
Custom parameters should also be listed. Find these by running:
cd <run_scripts_dir> grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at
Copy the code blocks after cat <<EOF >> user_nl_eam
, cat << EOF >> user_nl_elm
, and cat << EOF >> user_nl_mosart
to your page.
Scripts
Push your <run_scripts_dir>/run.<case_name>.csh
to https://github.com/E3SM-Project/SimulationScripts , in the archive/v2/beta/coupled
directory. Then link it to this section of your page.
Output files
Specify the path to your output files: <simulations_dir>/<case_name>
.
Jobs
Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.
Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run
. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs
. Use ls
to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log.
.
PACE’s “Experiment Details” section shows JobID
as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.
Use less <log>
to look at a gzipped log file. Scroll down a decent amount to DATE=
to find the start date. Use SHIFT+g
to go to the end of the file. Scroll up to DATE=
to find the end date. In the “Years” column specify <start> - <end>
, with each in year-month-day
format.
To find the number of nodes, first look at Total PEs
in PACE’s “Experiment Details” section. Divide that number by MPI tasks/node
to get the number of nodes.
The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput
.
In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.
Global time series
Run the following commands:
cd <post_processing_script_dir>/zppy/global_time_series # Edit the variables at the top of `generate_global_time_series_plots.sh` # See below for an explanation of the variables ./generate_global_time_series_plots.sh
Explanation of variables:
# For unified environment paths, see https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html unified_script=<the path to the unified environment script. Do NOT include `source`> e3sm_simulations_dir=<simulations_dir> case_dir=${e3sm_simulations_dir}/<case_name> # For web directory paths, see https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html web_dir=<html_path>/E3SM/v2/beta/<case_name> zppy_dir=<post_processing_script_dir>/zppy/ # Names moc_file=<e.g., mocTimeSeries_0001-0100.nc> experiment_name=<e.g., 20210122.v2_test01.piControl.chrysalis> figstr=<e.g., coupled_v2_test01> # Years start_yr=<first year to process> end_yr=<last year to process>
That will produce <figstr>.pdf
and <figstr>.png
. The latter will be available automatically at <web_address>/E3SM/v2/beta/<case_name>/<figstr>.png
, where <web_address>
can be found on https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html. You can download the image from the website and then upload it to your Confluence page.
You can also scp
the image directly to your computer. For Chrysalis, you’d run the following command on your computer: scp <username>@chrysalis.lcrc.anl.gov:<zppy_dir>/global_time_series/<figstr>.png .
E3SM Diags
The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>
.
Your diagnostics are located at the web address corresponding to the www
path in <post_processing_script_dir>/<case_name>.cfg
.
See the “Global time series” section above for finding the relevant web links. Fill the table with the specific web links: e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/
.
MPAS Analysis
See the “Global time series” section above for finding the relevant web links.
Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/
, create a bullet “1-50 (time series), 21-50 (climatology)”.