IN PROGRESS
Contact: Ryan Forsyth
Info |
---|
...
Summary
|
...
|
...
Processes
To check on all running processes:
alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10M %.10l %.8Q %55j" --sort=P,-t,-p'
To check on your running processes:
alias sq='sqa -u $USER'
The output of sq
uses several abbreviations: ST = Status, R = running, PD = pending, CG = completing.
Directories
You will be working in several directories. On Anvil or Chrysalis (i.e., LCRC machines), those paths are:
<run_scripts_dir>: /home/<username>/E3SM/scripts
<simulations_dir>: /lcrc/group/e3sm/<username>/E3SM_simulations/
<post_processing_dir>: /home/<username>/E3SM/utils/post_v2/
So, it may be useful to set the following aliases:
Code Block |
---|
# Model running
alias run_scripts="/home/<username>/E3SM/scripts/"
alias simulations="/lcrc/group/e3sm/<username>/E3SM_simulations/"
alias post_processing="/home/<username>/E3SM/utils/post_v2/" |
Configuring the Model Run
A template for running the model is provided at https://github.com/E3SM-Project/E3SM/blob/master/run_e3sm.template.csh . Notice there is a section at the top labeled "THINGS USERS USUALLY CHANGE (SEE END OF SECTION FOR GUIDANCE)". These are the settings that you are most likely to change. The "EXPLANATION FOR OPTIONS ABOVE:" section explains these parameters.
Create a new run script or copy an existing one. The path to it should be <run_scripts_dir>/run.<case_name>.csh
For ease of use, below are further explanatory notes:
BASIC INFO ABOUT RUN
set job_name = v2_test01.piControl
:v2_test01
is a short custom description to help identify the simulation.piControl
is the type of simulation. Other options here includeamip
,F2010
set resolution = ne30pg2_EC30to60E2r2-1900_ICG
:ne30
is the number of spectral elements for the atmospheric grid.EC30to60E2r2
is the resolution.rrm
for regionally refined mesh is an option to replace ICG.
SOURCE CODE OPTIONS
fetch_code
: if you have not run the model before, want to incorporate new changes, or use a new branch, then set this totrue
. Otherwise, you can set this tofalse
, which means time doesn't have to be spent checking out code.e3sm_tag
: the branch of the E3SM repo you want to run the model with.tag_name
: you can pick a short name to replacee3sm_tag
. Typically this will be a date (e.g., "20210122" for 2021-01-22. It is good practice to use year-month-day sols
will list runs chronologically).
CUSTOM CASE_NAME
set case_name = ${machine}.${tag_name}.${job_name}.${resolution}
:${tag_name}.${job_name}.${resolution}.${machine}
is also used.Note that
job_name
(seeBASIC INFO ABOUT RUN
) typically has two parts (separated by period), socase_name
will actually have five parts.
PROCESSOR CONFIGURATION
set processor_config = S
:S
,M
,L
sizes, amongst other specified in the "EXPLANATION FOR OPTIONS ABOVE:" section. UseS
for short tests. Full simulations should useL
. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.
DIRECTORIES
Code Block |
---|
set code_root_dir = ~/E3SM/code/
set e3sm_simulations_dir = <simulations_dir>
set case_build_dir = ${e3sm_simulations_dir}/${case_name}/build
set case_run_dir = ${e3sm_simulations_dir}/${case_name}/run
set short_term_archive_root_dir = ${e3sm_simulations_dir}/${case_name}/archive |
LENGTH OF SIMULATION, RESTARTS, AND ARCHIVING
For a short run, this section might look like:
Code Block |
---|
set stop_units = nmonths # Units will be number of months
set stop_num = 1 # Stop after running one month (one stop_unit)
set restart_units = $stop_units
set restart_num = $stop_num
set num_resubmits = 0 |
For a long run, this section might look like:
Code Block |
---|
set stop_units = nyears # Units will be number of years
set stop_num = 20 # Stop running after 20 years (20 stop_units)
set restart_units = $stop_units # Units will also be number of years
set restart_num = 5 # Write restart file after running 5 years (5 stop units)
set num_resubmits = 4 # Submit 4 times after the initial submit (4+1 submits * 20 years/submit = 100 years) |
In the above configuration, the model is submitted 5 times (initially and then 4 times after). Each submission covers 20 simulated years, so this will run 100 simulated years. On each submission, restart files will be written every 5 years. Since each submission covers 20 simulated years, each one will have 4 restart files written.
Model runs need to return the same results whether they use restart or not. If that is not the case, then a non-bit-for-bit change has been introduced.
Running the Model
Run the model by doing the following:
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.csh |
The repo will be checked out if fetch_code = true
. The code will be compiled if old_executable = false
. After the script finishes, the job has been submitted and you are free to close your terminal. The job will still run.
Looking at Results
Code Block |
---|
cd <simulations_dir>/<case_name>
ls |
Explanation of directories:
build
: all the stuff to compile. The executable is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in the “Useful Aliases” section to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
size.
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done. less <log>.gz
will let you look at a gzipped log.
Short Term Archiving
Short term archiving can be accomplished with the following steps. This can be done while the model is still running.
Use --force-move
to move instead of copying, which can take a long time. Set --last-date
to the latest date in the simulation you want to archive. You do not have to specify a beginning date.
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls |
Each component of the model has a subdirectory in archive
. There are also two additional subdirectories: logs
holds the gzipped log files and rest
holds the restart files.
...
Component
...
Subdirectory
...
Files in the Subdirectory
...
Atmosphere (Earth Atmospheric Model)
...
atm/hist
...
eam.h*
...
Coupler
...
cpl/hist
...
cpl.h*
...
Sea Ice (MPAS-Sea-Ice)
...
ice/hist
...
mpassi.hist.*
...
Land (Earth Land Model)
...
lnd/hist
...
elm.h*
...
Ocean (MPAS-Ocean)
...
ocn/hist
...
mpaso.hist.*
...
River Runoff (MOSART)
...
rof/hist
...
mosart.h*
Archiving a Complete Run
cd <simulations_dir>/<case_name>
To archive a one month run, for example:
mkdir test_1M_S
:1M
for one month andS
for theprocessor_config
size.mv case_scripts test_1M_S/case_scripts
mv run test_1M_S/run
test_1M_S
is the archive of the one month run. Now you're free to change some settings on <run_scripts_dir>/run.<case_name>.csh
and run it again.
Performance Information
Model throughput is the number of simulated years per day. You can find this with:
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm* |
PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.
Verifying Results Were BFB
If we ran the model twice, we can confirm the results were bit-for-bit (BFB) the same. Let's compare a one month run (processor_config = S
) with a multi-year run (processor_config = L
).
Code Block |
---|
cd <simulations_dir>/<case_name>/
gunzip -c test_1M_S/run/atm.log.<log>.gz | grep '^ nstep, te ' > atm_S.txt
gunzip -c archive/logs/atm.log.<log>.gz | grep '^ nstep, te ' > atm_L.txt
diff atm_S.txt atm_L.txt | head |
In this case, the diff begins at the time step where the multi-year run continues but the one month run has stopped. Thus, the first month is BFB the same between the two runs.
This BFB check will help you spot bugs in the code.
Re-Submitting a Job After a Crash
If a job crashes, you can rerun with:
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts
# Make any changes necessary to avoid the crash
./case_submit |
To gzip
log files from failed jobs, run gzip *.log.<job ID>.*
Post-Processing
To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.
cd <post_processing_dir>
Configuration File
Create a new configuration file or copy an existing one. Call it <case_name>.cfg
.
The sections of the configuration file are described below:
[default]
input
,output
,www
paths may need to be edited.
[climo]
mapping_file
path may need to be edited.Typically generate climatology files every 20,50 years:
years = begin_year:end_yr:averaging_period
– e.g.,years = "1:80:20", "1:50:50",
[ts]
mapping_file
path may need to be edited.Time series, typically done in chunks of 10 years – e.g.,
years = "1:80:10"
[e3sm_diags]
reference_data_path
may need to be edited.short_name
is a shortened version of thecase_name
years
should match the[climo]
section years
[mpas_analysis]
enso_years
should start with year 11 -- e.g.,enso_years = "11-50",
Launch Post-Processing Jobs
If you haven't already, check out the PythonPreAndPostProcessing
repo. Go to https://github.com/E3SM-Project/PreAndPostProcessingScripts. Get path to clone by clicking the green "Code" button. Run git clone <path>
.
Load the E3SM unified environment. For LCRC machines, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified.sh
. Commands for other machines can be found at https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html.
Run python PreAndPostProcessingScripts/postprocessing_bundle/v2/post.py -c <case_name>.cfg
. This will launch a number of jobs. Run sq
to see what jobs are running.
e3sm_diags
jobs are dependent on climo
jobs, so they wait for those to finish. Most run quickly, though MPAS analysis may take several hours.
These jobs create a new directory <simulations_dir>/<case_name>/post
.
Code Block |
---|
cd <simulations_dir>/<case_name>/post/scripts
cat *status # should be a list of "OK"
grep -v "OK" *.status # lists files without "OK" |
If you re-run post-processing, it will check status of tasks and will skip if status is OK.
Tasks
If you run ls
you’ll probably see a file like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status
. This is one e3sm_diags
job. Parts of the file name are explained below:
...
Meaning
...
Part of File Name
...
Task
...
e3sm_diags
...
Grid
...
180x360_aave
...
Model/obs v. model/obs
...
model_vs_obs
...
First and last years
...
0001-0020
There is also a corresponding file. It will have the same name but end with .o<job ID>
instead of .status
.
Output
The post-processing output is organized hierarchically as follows:
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr
has the time series files – one variable per file, in 10 year chunks as defined in<post_processing_dir>/<case_name>.cfg
.<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr
similarly has climatology files for 20 year periods, as defined in<post_processing_dir>/<case_name>.cfg
`.<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr
has globally averaged files for 10 years periods as defined in<post_processing_dir>/<case_name>.cfg
. Theglb
directory currently doesn't follow the same file naming convention as180x360_aave
.
Documenting the Model Run
You should create a Confluence page for your model run in v2 beta1 simulations. Use Simulation Run Template as a template. See below for how to fill out this template.
Code
code_root_dir
and tag_name
are defined in <run_scripts_dir>/<case_name>.csh
.
Code Block |
---|
cd <code_root_dir>/<tag_name>
git log |
The commit hash at the top is the most recent commit.
Add “<branch name>, <commit hash>” to this section of your page.
Configuration
Compset
and Res
are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.
Custom parameters should also be listed. Find these by running:
Code Block |
---|
cd <run_scripts_dir>
grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at |
Copy the code blocks after cat <<EOF >> user_nl_eam
, cat << EOF >> user_nl_elm
, and cat << EOF >> user_nl_mosart
to your page.
Scripts
...
|
Table of Contents |
---|
Useful Aliases
Setting some aliases may be useful in running the model. You can edit ~/.bashrc
to add aliases. Run source ~/.bashrc
to start using them.
Note that the specific file name may differ amongst machines. Other possibilities include ~/.bash_profile
, ~/.bashrc.ext
, or ~/.bash_profile.ext
.
Batch jobs
To check on all batch jobs:
alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %.8Q %j" --sort=P,-t,-p'
To check on your batch jobs:
alias sq='sqa -u $USER'
The output of sq
uses several abbreviations: ST = Status, R = running, PD = pending, CG = completing.
Directories
You will be working in several directories.
On anvil or chrysalis (i.e., LCRC machines), those paths are:
<run_scripts_dir>: ${HOME}/E3SM/scripts
<code_source_dir>: ${HOME}/E3SM/code
<simulations_dir>: /lcrc/group/e3sm/<username>/E3SMv2
So, it may be useful to set the following aliases:
Code Block |
---|
# Model running
alias run_scripts="cd ${HOME}/E3SM/scripts/"
alias simulations="cd /lcrc/group/e3sm/<username>/E3SMv2/" |
On other machines, the paths are the same, except for the <simulations_dir>.
On compy (PNNL):
<simulations_dir>: /compyfs/<username>/E3SMv2
On cori (NERSC):
<simulations_dir>: ${CSCRATCH}/E3SMv2
Configuring the Model Run – Run Script
Start with an example of a run script for a low-resolution coupled simulation:
The template script in the E3SM repository, which uses
cori-knl
: https://github.com/E3SM-Project/E3SM/blob/master/run_e3sm.template.shThis guide uses a similar script, which uses
chrysalis
: https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/run.v2.LR.piControl.sh (Note: this is a private repo) .
Create a new run script or copy an existing one. The path to it should be <run_scripts_dir>/run.<case_name>.sh
# Machine and project
readonly MACHINE=chrysalis
: the name of the machine you’re running on.readonly PROJECT="e3sm"
: SLURM project accounting (typically e3sm).
# Simulation
readonly COMPSET="WCYCL1850"
: compset (configuration)readonly RESOLUTION="ne30pg2_EC30to60E2r2"
: resolution (low-resolution coupled simulation in this case)ne30
is the number of spectral elements for the atmospheric dynamics grid, whilepg2
refers to the physics grid option. This mesh grid spacing is approximately 110 km.EC30to60E2r2
is the ocean and sea-ice resolution. The grid spacing varies between 30 and 60 km.For simulations with regionally refined meshes such as the N American atmosphere grid coupled to the WC14 ocean and sea-ice, replace with
northamericax4v1pg2_WC14to60E2r3
.
readonly CASE_NAME="v2.LR.piControl"
:v2.LR
is a short custom description to help identify the simulation.piControl
is the type of simulation. Other options here include , but are not limited to:amip
,F2010
.
readonly CASE_GROUP="v2.LR"
:This will let you mark multiple cases as part of the same group for later processing (e.g., with PACE).
Note: If this is part of a simulation campaign, ask your group lead about using a case_group label. Otherwise, please use a unique name to distinguish from existing case_group label names, i.g. “v2.LR“.
# Code and compilation
readonly CHECKOUT="20210702"
: date the code was checked out on in the form{year}{month}{day}
. The source code will be checked out in sub-directory named{year}{month}{day}
under <code_source_dir>.readonly BRANCH="master"
: branch the code was checked out from. Valid options include “master”, a branch name, or a git hash. For provenance purposes, it is best to specify the git hash.readonly DEBUG_COMPILE=false
: option to compile with DEBUG flag (leave set to false)
# Run options
readonly MODEL_START_TYPE="hybrid"
: specify how the model should start – useinitial
conditions,continue
from existing restart files,branch
, orhybrid
.readonly START_DATE="0001-01-01"
: model start date. Typically year 1 for simulations with perpetual (time invariant) forcing or a real year for simulation for transient forcings.
# Set paths
readonly CODE_ROOT="${HOME}/E3SMv2/code/${CHECKOUT}"
: where the E3SM code will be checked out.readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SMv2/${CASE_NAME}"
: where the results will go. The directory${CASE_NAME}
will be in<simulations_dir>
.
# Sub-directories
readonly CASE_BUILD_DIR=${CASE_ROOT}/build
: all the compilation files, including the executable.readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive
: where short-term archived files will reside.
# Define type of run
readonly run='production'
: type of simulation to run – i.e, a short test for verification or a long production run. (See next section for details).
# Coupler history
readonly HIST_OPTION="nyears"
readonly HIST_N="5"
# Leave empty (unless you understand what it does)
readonly OLD_EXECUTABLE=""
: this is a somewhat risky option that allows you to re-use a pre-existing executable. This is not recommended because it breaks provenance.
# --- Toggle flags for what to do ----
This section controls what operations the script should perform. The run-e3sm script can be invoked multiple times with the user having the option to bypass certain steps by toggling true
/ false
do_fetch_code=false
: fetch the source code from Github.do_create_newcase=true
: create new case.do_case_setup=true
: case setup.do_case_build=false
: compile.do_case_submit=true
: submit simulation.
The first time the script is called, all the flags should be set to true
. Subsequently, the user may decide to bypass code checkout (do_fetch_code=false)
or compilation (do_case_build=false
). A user may also prefer to manually submit the job by setting do_case_submit=false
and then invoking ./case.submit.
Note |
---|
A case is tied to one code base and one executable. That is, if you change |
Running the Model
Short tests
Before starting a long production run, it is highly recommended to perform a few short tests to verify:
The model starts without errors.
The model produces BFB (bit-for-bit) results after a restart.
The model produces BFB results when changing PE layout.
(1) can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.
Many code bugs can be caught with (2) and (3). While the E3SM nightly tests should catch such non-BFB errors, it is possible that you’ll be running a slightly different configuration (for example a different physics option) for which those tests have not been performed.
Running short tests
The type of run to perform is controlled by the script variable run
.
You should typically perform at least two short tests (two different layouts, with and without restart).
First, let’s start with a short test using the 'S' (small) PE layout and running for 2x5 days:
readonly run='S_2x5_ndays'
If you have not fetched and compiled the code, set all the toggle flags to true:
Code Block |
---|
do_fetch_code=true
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true |
At this point, execute the run-e3sm script:
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.sh |
Fetching the code and compiling it will take some time (30 to 45 minutes), so go ahead a brew yourself a fresh cup of coffee. Once the script finished, the test job will have been submitted to the batch queue.
You can immediately edit the script to prepare for the second short test. In this case, we will be running for 10 days (without restart) using the 'M' (medium PE layout):
readonly run='M_1x10_ndays'
Since the code has already been fetched and compiled, change the toggle flags:
Code Block |
---|
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true |
and execute the script
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.sh |
Since we are bypassing the code fetch and compilation (by re-using the previous executable), the script should only take a few seconds to run and again should submit the second test.
Note that short tests use separate output directories, so it is safe to submit and run multiple tests at once. If you’d like, you could submit additional test, for example 10 days with the medium 80 nodes ('M80') layout (M80_1x10_ndays
).
Verifying results are BFB
Once the short tests are complete, we can confirm the results were bit-for-bit (BFB) the same. All the test output is located under the tests
sub-directory:
Code Block |
---|
cd <simulations_dir>/<case_name>/tests
for test in *
do
zgrep -h '^ nstep, te ' ${test}/run/atm.log.*.gz | uniq > atm_${test}.txt
done
md5sum *.txt
5bdfee6da8433cde08f33f3c046653c6 atm_M_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6 atm_M80_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6 atm_S_2x5_ndays.txt |
To verify that the results are indeed BFB, we extract global integral from the atmosphere log files (lines starting with ‘nstep, te’) and make sure that they are identical for all tests.
If the BFB check fails, you should stop here and understand why. If they succeed, you can now start the production simulation.
Production simulation
To prepare for the long production simulation, edit the run e3sm script and set:
readonly run='production'
In addition, you may need to customize the code block below that some variables in the code block below to configure run options:
# Production simulation
readonly PELAYOUT="M"
:1=single processor, S=small, M=medium, L=large, X1=very large, X2=very very large
. Production simulations typically useM
orL
. The size determines how many nodes will be used. The exact number of nodes will differ amongst machines.readonly WALLTIME="28:00:00"
: maximum wall clock time requested for the batch jobs.readonly STOP_OPTION="nyears"
readonly STOP_N="20"
: units and length of each segment (i.e. each batch job)readonly REST_OPTION="nyears"
readonly REST_N="5"
: units and frequency for writing restart files (make sureSTOP_N
is a multiple ofREST_N
, otherwise the model will stop without writing a restart fie at the end).readonly RESUBMIT=”9”
: number of resubmissions beyond the original segment. This simulation would run for a total of 200 years (20 + 9x20).readonly DO_SHORT_TERM_ARCHIVING=false
: leave set tofalse
if you want to manually run the short term archive.
Since the code has already been fetched and compiled for the short tests, the toggle flags can be set to:
Code Block |
---|
do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=false
do_case_submit=true |
Finally, execute the script
Code Block |
---|
cd <run_scripts_dir>
./run.<case_name>.sh |
The script will automatically submit the first job. New jobs will be automatically be resubmitted at the end until the total number of segments have been run.
Looking at Results
Code Block |
---|
cd <simulations_dir>/<case_name>
ls |
Explanation of directories:
build
: all the stuff to compile. The executable (e3sm.exe
) is also there.case_scripts
: the files for your particular simulation.run
: where all the output will be. Most components (atmosphere, ocean, etc.) have their own log files. The coupler exchanges information between the components. The top level log file will be of the formrun/e3sm.log.*
. Log prefixes correspond to components of the model:atm
: atmospherecpl
: couplerice
: sea icelnd
: landocn
: oceanrof
: river runoff
Run tail -f run/<component>.log.<latest log file>
to keep up with a log in real time.
You can use the sq
alias defined in the “Useful Aliases” section to check on the status of the job. The NODE
in the output indicates the number of nodes used and is dependent on the processor_config
/ PELAYOUT
size.
When running on two different machines (such as Compy and Chrysalis) and/or two different compilers, the answers will not be the same, bit-for-bit. It is not possible using floating point operations to get bit-or-bit identical results across machines/compilers.
Logs being compressed to .gz
files is one of the last steps before the job is done and will indicate successful completion of the segment. less <log>.gz
will let you directly look at a gzipped log.
Short Term Archiving
By default, E3SM will store all output files under the run/
sub-directory. For long simulations, there could 10,000s to 100,000s of output files. Having so many files in a single directory can be very impractical, slowing down simple operations like ls
to a crawl. CIME includes a short-term archiving utility that will neatly organize output files into a separate archive/
sub-directory.
Short term archiving can be accomplished with the following steps. This can be done while the model is still running.
Use --force-move
to move instead of copying, which can take a long time. Set --last-date
to the latest date in the simulation you want to archive. You do not have to specify a beginning date.
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts
./case.st_archive --last-date 0051-01-01 --force-move --no-incomplete-logs
cd <e3sm_simulations_dir>/<case_name>/archive
ls |
Each component of the model has a subdirectory in archive
. There are also two additional subdirectories: logs
holds the gzipped log files and rest
holds the restart files.
Component | Subdirectory | Files in the Subdirectory |
---|---|---|
Atmosphere (Earth Atmospheric Model) |
|
|
Coupler |
|
|
Sea Ice (MPAS-Sea-Ice) |
|
|
Land (Earth Land Model) |
|
|
Ocean (MPAS-Ocean) |
|
|
River Runoff (MOSART) |
|
|
Performance Information
Model throughput is the number of simulated years per day. You can find this with:
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts/timing
grep "simulated_years" e3sm* |
PACE provides detailed performance information. Go to https://pace.ornl.gov/ and enter your username to search for your jobs. You can also simply search by providing the JobID appended to log files (NNNNN.yymmdd-hhmmss where NNNNN is the Slurm job id). Click on a job ID to see its performance details. “Experiment Details” are listed at the top of the job’s page. There is also a helpful chart detailing how many processors and how much time each component (atm
, ocn
, etc.) used. White areas indicate time spent idle/waiting. The area of each box is essentially the "cost = simulation time * number of processors" of the corresponding component.
Re-Submitting a Job After a Crash
If a job crashes, you can rerun with:
Code Block |
---|
cd <simulations_dir>/<case_name>/case_scripts
# Make any changes necessary to avoid the crash
./case.submit |
If you need to change a XML value, the following commands in the case_scripts
directory are useful:
Code Block |
---|
> ./xmlquery <variable> # Get value of a variable
> ./xmlchange -id <variable> -val <value> # Set value of a variable |
Before re-submitting:
Check that the rpointer files all point to the last restart. On very rare occasions, there might be some inconsistency if the model crashed at the end.
Run
head -n 1 rpointer.*
to see the restart date.
gzip all the
*.log
files from the faulty segment so that they get moved during the next short-term archiving. Togzip
log files from failed jobs, rungzip *.log.<job ID>*
(where<job ID>
has no periods/dots in it).Delete core or error files if there are any. MPAS components will sometimes produce a large number of them. The following commands are useful for checking for these files:
ls | grep -in core
ls | grep -in err
If you are re-submitting the initial job, you will need to run
./xmlchange -id CONTINUE_RUN -val TRUE
Post-Processing with zppy
To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.
You can ask questions about zppy
on https://github.com/E3SM-Project/zppy/discussions/categories/general
Install zppy
Load the E3SM unified environment. For Chrysalis, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
. Commands for other machines, and installation guide for development version of zppy can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html
Configuration File
Create a new post-processing configuration file or copy an existing one. A good starting point is the configuration file corresponding to the simulation above: https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/post.v2.LR.piControl.cfg (Note: this is a private repo)
Call it post.<case_name>.cfg
and place it in your <run_scripts_dir>
.
Edit the file and customize as needed. The file is structured with [section]
and [[sub-sections]]
. There is a [default]
section, followed by additional sections for each available zppy
task (climo, ts, e3sm_diags, mpas_analysis, …). Sub-sections can be used to have multiple instances of a particular task, for example regridded monthly or globally averaged time series files. Refer to the zppy documentation for more details.
The key sections of the configuration file are:
[default]
input
,output
,www
paths may need to be edited.
[climo]
mapping_file
path may need to be edited.Typically generate climatology files every 20,50 years:
years = begin_year:end_yr:averaging_period
– e.g.,years = "1:80:20", "1:50:50",
[ts]
mapping_file
path may need to be edited.Time series, typically done in chunks of 10 years – e.g.,
years = "1:80:10"
[e3sm_diags]
reference_data_path
may need to be edited.short_name
is a shortened version of thecase_name
years
should match the[climo]
sectionyears
[mpas_analysis]
Years can be specified separately for time series, climatology and ENSO plots. The lists must have the same lengths and each entry will be mapped to a realization of mpas_analysis:
ts_years = "1-50", "1-100",
enso_years = "11-50", "11-100",
climo_years ="21-50", "51-100",
In this particular example, MPAS Analysis will be run twice. The first realization will produce time series plots covering years 1 to 50, ENSO plots for years 11 to 50 and climatology plots averaged over years 21-50. The second realization will cover years 1-100 for time series, 11-100 for ENSO, 51-100 for climatologies.
[global_time_series]
ts_years
should match the[mpas_analysis]
sectionts_years
climo_years
should match the[mpas_analysis]
sectionclimo_years
See the zppy tutorial for complete configuration file examples.
Launch zppy
Make sure you load the E3SM unified environment.
Run zppy -c post.<case_name>.cfg
. This will submit a number of jobs. Run sq
to see what jobs are running.
e3sm_diags
jobs are dependent on climo
and ts
jobs, so they wait for those to finish. MPAS Analysis jobs re-use computations, so they are chained. Most jobs run quickly, though MPAS Analysis may take several hours.
These jobs create a new directory <simulations_dir>/<case_name>/post
. Each realization will have a shell script (typically bash). This is the actual file that has been submitted to the batch system. There will also be a log file *.o<job ID>
as well as a *.status
file. The status file indicates the state (WAITING, RUNNING, OK, ERROR). Once all the jobs are complete, you can check their status
Code Block |
---|
cd <simulations_dir>/<case_name>/post/scripts
cat *.status # should be a list of "OK"
grep -v "OK" *.status # lists files without "OK" |
If you re-run zppy
, it will check status of tasks and will skip any task if its status is “OK”.
As your simulation progresses, you can update the post-processing years in the configuration file and re-run zppy
. Newly added task will be submitted, while previously completed ones will be skipped.
Tasks
If you run ls
you’ll probably see files like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status
. This is one e3sm_diags
job. Parts of the file name are explained below:
Meaning | Part of File Name |
---|---|
Task |
|
Grid |
|
Model/obs v. model/obs |
|
First and last years |
|
There is also a corresponding output file. It will have the same name but end with .o<job ID>
instead of .status
.
Output
The post-processing output is organized hierarchically as follows (where the exact year-ranges are what are defined in the configuration file):
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr
has the time series files – one variable per file, in 10 year chunks as defined in<run_scripts_dir>/post.<case_name>.cfg
.<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr
similarly has climatology files for 20 year periods, as defined in<run_scripts_dir>/post.<case_name>.cfg
`.<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr
has globally averaged files for 10 years periods as defined in<run_scripts_dir>/post.<case_name>.cfg
.
Documenting the Model Run
You should create a Confluence page for your model run in /wiki/spaces/EWCG/pages/2126938167. Use /wiki/spaces/EWCG/pages/2297299190 as a template. See below for how to fill out this template.
Code
code_root_dir
and tag_name
are defined in <run_scripts_dir>/<case_name>.csh
.
Code Block |
---|
cd <code_root_dir>/<tag_name>
git log |
The commit hash at the top is the most recent commit.
Add “<branch name>, <commit hash>” to this section of your page.
Configuration
Compset
and Res
are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.
Custom parameters should also be listed. Find these by running:
Code Block |
---|
cd <run_scripts_dir>
grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at |
Copy the code blocks after cat <<EOF >> user_nl_eam
, cat << EOF >> user_nl_elm
, and cat << EOF >> user_nl_mosart
to your page.
Scripts
Push your <run_scripts_dir>/run.<case_name>.csh
to https://github.com/E3SM-Project/SimulationScripts , in the archive/v2/beta/coupled
directory. Then link it to this section of your page.
Output files
Specify the path to your output files: <simulations_dir>/<case_name>
.
Jobs
Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.
Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run
. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs
. Use ls
to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log.
.
PACE’s “Experiment Details” section shows JobID
as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.
Use zgrep "DATE=" <log> | head -n 1
to find the start date. Use zgrep "DATE=" <log> | tail -n 1
to find the end date. If you would like you can write a bash
function to make this easier:
Code Block |
---|
get_dates()
{
for f in atm.log.*.gz; do
echo $f
zgrep "DATE=" $f | head -n 1
zgrep "DATE=" $f | tail -n 1
echo ""
done
} |
(If zgrep
is unavailable, use less <log>
to look at a gzipped log file. Scroll down a decent amount to DATE=
to find the start date. Use SHIFT+g
to go to the end of the file. Scroll up to DATE=
to find the end date.)
In the “Years” column specify <start> - <end>
, with each in year-month-day
format.
To find the number of nodes, first look at the Processor # / Simulation Time chart on PACE. The x-axis lists the highest MPI rank used, with base-0 numbering of ranks. (PE layouts often don’t fit exactly N nodes but instead fill N-1 nodes and have some number of ranks left over on the final node, leaving some cores on that node unused.) Then find MPI tasks/node
in the “Experiment Details” section. The number of nodes can then be calculated as ceil((highest MPI rank + 1)/ (MPI tasks/node))
.
The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput
.
In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.
Global time series
Be sure to have set the [global_time_series]
task in zppy
.
Example configuration:
Code Block |
---|
# Global time series plots
[global_time_series]
active = False
years = "1-100",
ts_num_years = 10
figstr=coupled_v2_test01
moc_file=mocTimeSeries_0001-0100.nc
experiment_name=20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis
ts_years = "1-50", "1-100", |
That will produce <figstr>.pdf
and <figstr>.png
. They will be available automatically at <www>/<case_name>/<figstr>.png
. You can download the image from the website and then upload it to your Confluence page.
E3SM Diags
The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>
.
Your diagnostics are located at the web address corresponding to the www
path in <run_scripts_dir>/<case_name>.cfg
.
See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the URLs for the web portals on each E3SM machine (listed as <web_address>
).
Fill the table with the specific web links: e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/
.
MPAS Analysis
See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the URLs for the web portals on each E3SM machine (listed as <web_address>
)
Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/
, create a bullet “1-50 (time series), 21-50 (climatology)”.
Long Term Archiving with zstash
Simulations that are deemed sufficiently valuable should be archived using zstash
for long-term preservation.
You can ask questions about zstash
on https://github.com/E3SM-Project/zstash/discussions/categories/general
Compy, Anvil and Chrysalis do not have local HPSS. We rely on NERSC HPSS for long-term archiving.
If you are archiving a simulation run on Compy or LCRC (Chrysalis/Anvil), do all of the following steps. If you are archiving a simulation run on NERSC (Cori), skip to step 4.
1. Clean up directory
Log into the machine that you ran the simulation on.
Remove all eam.i
files except the latest one. Dates are of the form <YYYY-MM-DD>
.
Code Block |
---|
$ cd <simulations_dir>/<case_name>/run
$ ls | wc -l # See how many items are in this directory
$ mv <case_name>.eam.i.<YYYY-MM-DD>-00000.nc tmp.nc
$ rm <case_name>.eam.i.*.nc
$ mv tmp.nc <case_name>.eam.i.2015-01-01-00000.nc
$ ls | wc -l # See how many items are in this directory |
There may still be more files than is necessary to archive. You can probably remove *.err
, *.lock
, *debug_block*
, *ocean_block_stats*
files.
2. zstash create
& Transfer to NERSC HPSS
2.a. E3SM Unified v1.6.0 / zstash v1.2.0 or greater
If you are using E3SM Unified v1.6.0
or greater, https://github.com/E3SM-Project/zstash/pull/154 has enabled Globus (https://www.globus.org/ ) transfer with zstash create
.
On the machine that you ran the simulation on:
If you don’t have one already, create a directory for utilities, e.g., utils
. Then open a file in that directory called batch_zstash_create.bash
and paste the following in it, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on <machine name>
# Load E3SM Unified
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Archiving ${EXP} ===
cd <simulations_dir>/${EXP}
mkdir -p zstash
stamp=`date +%Y%m%d`
time zstash create -v --hpss=globus://nersc/home/<first letter>/<username>/E3SMv2/${EXP} --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done |
Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .
Then, do the following:
Code Block |
---|
$ screen # Enter screen
$ screen -ls # Output should say "Attached"
$ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
$ tail -f batch_zstash_create.log # Check log without going into screen
# Wait for this to finish
$ screen -r # Return to screen
# Check that output ends with `real`, `user`, `sys` time information
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ ls <simulations_dir>/<case_name>/zstash
# `index.db`, and a `zstash_create` log should be present
# No tar files should be listed
# If you'd like to know how much space the archive or entire simulation use, run:
$ du -sh <simulations_dir>/<case_name>/zstash
$ du -sh <simulations_dir>/<case_name> |
Then, on NERSC/Cori:
Code Block |
---|
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Tar files and `index.db` should be listed.
# Note `| wc -l` doesn't work on hsi
$ exit |
2.b. Earlier releases
2.b.i. zstash create
On the machine that you ran the simulation on:
If you don’t have one already, create a directory for utilities, e.g., utils
. Then open a file in that directory called batch_zstash_create.bash
and paste the following in it, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on <machine name>
# Load E3SM Unified
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Archiving ${EXP} ===
cd <simulations_dir>/${EXP}
mkdir -p zstash
stamp=`date +%Y%m%d`
time zstash create -v --hpss=none --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done |
Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .
Then, do the following:
Code Block |
---|
$ screen # Enter screen
$ screen -ls # Output should say "Attached"
$ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
$ tail -f batch_zstash_create.log # Check log without going into screen
# Wait for this to finish
# (On Chrysalis, for 165 years of data, this takes ~14 hours)
$ screen -r # Return to screen
# Check that output ends with `real`, `user`, `sys` time information
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ ls <simulations_dir>/<case_name>/zstash
# tar files, `index.db`, and a `zstash_create` log should be present
# If you'd like to know how much space the archive or entire simulation use, run:
$ du -sh <simulations_dir>/<case_name>/zstash
$ du -sh <simulations_dir>/<case_name> |
2.b.ii. Transfer to NERSC
On a NERSC machine (Cori):
Code Block |
---|
$ mkdir -p /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
$ ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
# Should be empty |
Log into Globus, using your NERSC credentials: https://www.globus.org/
(Left hand side): Transfer from
<the machine's DTN> <simulations_dir>/<case_name>/zstash
Click enter on path and "select all" on left-hand side
Transfer to
NERSC DTN /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
Notice we're using
cfs
rather thanscratch
on Cori
Click enter on the path
Click “Transfer & Sync Options” in the center.
Choose:
“sync - only transfer new or changed files” (choose “modification time is newer” in the dropdown box)
“preserve source file modification times”
“verify file integrity after transfer”
For “Label This Transfer”: “zstash <case_name> <machine name> to NERSC”
Click "Start" on the left hand side.
You should get an email from Globus when the transfer is completed. (On Chrysalis, for 165 years of data, this transfer takes ~13 hours).
2.b.iii. Transfer to HPSS
On a NERSC machine (Cori):
Code Block |
---|
Log in to Cori
$ cd /global/cfs/cdirs/e3sm/forsyth/<username>/<case_name>
$ ls | wc -l
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ ls *.tar | wc -l
# Should be two less than the previous result,
# since `index.db` and the `zstash_create` log are also present.
$ hsi
$ pwd
# Should be /home/<first letter>/<username>
$ ls E3SMv2
# Check what you already have in the directory
# You don't want to accidentally overwrite a directory already in HPSS.
$ exit
$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
$ screen
$ screen -ls # Output should say "Attached"
# https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss/managing-files-hsi
# cput will not transfer file if it exists.
$ hsi "cd /home/<first letter>/<username>/E3SMv2/; cput -R <case_name>"
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
# Wait for the `hsi` command to finish
# (On Chrysalis, for 165 years of data, this takes ~2 hours)
$ screen -r # Return to screen
# Check output for any errors
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit |
3. zstash check
On a NERSC machine (Cori):
Code Block |
---|
$ cd /global/homes/<first letter>/<username>
$ emacs batch_zstash_check.bash |
Paste the following in that file, making relevant edits:
Code Block |
---|
#!/bin/bash
# Run on NERSC dtn
# Load environment that includes zstash
<Command to load the E3SM Unified environment>
# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)
# Loop over simulations
for EXP in "${EXPS[@]}"
do
echo === Checking ${EXP} ===
cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
mkdir -p ${EXP}/zstash
cd ${EXP}
stamp=`date +%Y%m%d`
time zstash check -v --hpss=/home/<first letter>/<username>/E3SMv2/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done |
Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .
If you’re using E3SM Unified v1.6.0 / zstash v1.2.0 (or greater) and you want to check a long simulation, you can use the --tars
option introduced in https://github.com/E3SM-Project/zstash/pull/170 to split the checking into more manageable pieces:
# Starting at 00005a until the end zstash check --tars=00005a- # Starting from the beginning to 00005a (included) zstash check --tars=-00005a # Specific range zstash check --tars=00005a-00005c # Selected tar files zstash check --tars=00003e,00004e,000059 # Mix and match zstash check --tars=000030-00003e,00004e,00005a-
Then, do the following:
Code Block |
---|
$ ssh dtn01.nersc.gov
$ screen
$ screen -ls # Output should say "Attached"
$ cd /global/homes/<first letter>/<username>
$ ./batch_zstash_check.bash
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!
$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node,
# then you will need to ssh to this one to get back to the screen session.
# Wait for the script to finish
# (On Chrysalis, for 165 years of data, this takes ~5 hours)
$ screen -r # Return to screen
# Check that output ends with `INFO: No failures detected when checking the files.`
# as well as listing real`, `user`, `sys` time information
$ exit # Terminate screen
$ screen -ls # The screen should no longer be listed
$ exit # exit data transfer node
$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>/zstash
$ tail zstash_check_<stamp>.log
# Output should match the output from the screen (without the time information) |
Note |
---|
Because of https://github.com/E3SM-Project/ |
...
Output files
Specify the path to your output files: <simulations_dir>/<case_name>
.
Jobs
Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.
Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run
. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs
. Use ls
to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log.
.
PACE’s “Experiment Details” section shows JobID
as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.
Use less <log>
to look at a gzipped log file. Scroll down a decent amount to DATE=
to find the start date. Use SHIFT+g
to go to the end of the file. Scroll up to DATE=
to find the end date. In the “Years” column specify <start> - <end>
, with each in year-month-day
format.
To find the number of nodes, first look at Total PEs
in PACE’s “Experiment Details” section. Divide that number by MPI tasks/node
to get the number of nodes.
The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput
.
In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.
E3SM Diags
The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>
.
Your diagnostics are located at the web address corresponding to the www
path in <post_processing_dir>/<case_name>.cfg
.
https://e3sm-project.github.io/e3sm_diags/docs/html/quickguides/quick-guide-general.html has the file paths and corresponding web links for each machine. For LCRC, this www
would be /lcrc/group/acme/public_html/diagnostic_output/<username>/E3SM/v2/beta
and the web address would be https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/
. Fill the table with the specific links, e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/
.
MPAS Analysis
See the “E3SM Diags” section above for finding the relevant web links.
...
zstash/issues/167 , for now it is a good idea to run |
4. Document
On a NERSC machine (Cori):
Code Block |
---|
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2
# Check that the simulation case is now listed in this directory
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit
$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>/zstash
$ ls
# `index.db` and `zstash_check` log should be the only items listed
# https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss/managing-files-hsi
# cput will not transfer file if it exists.
$ hsi "cd /home/<first letter>/<username>/E3SMv2/<case_name>; cput -R <zstash_check log>"
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# tar files, `index.db`, `zstash_create` log, and `zstash_check` log should be present
$ exit |
Update simulation Confluence page with information regarding this simulation (For Water Cycle’s v2 work, that page is /wiki/spaces/ED/pages/2766340117 ) . In the zstash archive
column, specify:
/home/<first letter>/<username>/E3SMv2/<case_name>
zstash_create_<stamp>.log
zstash_check_<stamp>.log
5. Delete files
On a NERSC machine (Cori):
Code Block |
---|
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# tar files, `index.db`, `zstash_create` log, and `zstash_check` log should be present
# So, we can safely delete these items on cfs
$ exit
$ ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
# Should match output from the `ls` above
$ cd /global/cfs/cdirs/e3sm/<username>
$ ls E3SMv2
# Only the <case_name> you just transferred to HPSS should be listed
$ rm -rf E3SMv2 |
On the machine that you ran the simulation on:
Code Block |
---|
$ cd <simulations_dir>/<case_name>
$ ls zstash
# tar files, index.db, `zstash_create` log should be present
$ rm -rf zstash # Remove the zstash directory, keeping original files
$ cd <simulations_dir> |
More info
Refer to zstash's best practices for E3SM for details.
Publishing the simulation data (Optional)
E3SM Project has policy to publish all official simulation campaigns once those simulations are documented in publications. Refer to step 3 in Simulation Data Management for guidance on requesting data publication.