Info

Summary

Useful Aliases
- The aliases provided in this section may be useful.
Configuring the Model Run – Run Script
- The run-script remains mostly the same between runs. Users can change configuration parameters in it though.
Running the Model
- It is recommended to run a few short tests before starting a production simulation.
Short Term Archiving
- To avoid having so many output files in one directory, we can archive output into appropriate subdirectories.
Performance Information
- PACE is used to display performance information from a run.
Re-Submitting a Job After a Crash
- If a job crashes, it is easy to re-submit after making a fix.
Post-Processing with zppy
- zppy is a tool that brings multiple post-processing tools together – with zppy, users can easily run E3SM Diags and MPAS-Analysis as well as generate global time series plots.
Documenting the Model Run
- Model runs should be documented on its own Confluence page providing run details and results.
Long Term Archiving with zstash
- zstash can be used to archive results for the long-term.
Publishing the simulation data (Optional)

Table of Contents

Useful Aliases

...

alias sqa='squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %.8Q %55j%j" --sort=P,-t,-p'

To check on your batch jobs:

...

<simulations_dir>: /lcrc/group/e3sm/<username>/E3SM_simulationsE3SMv2

So, it may be useful to set the following aliases:

Code Block
# Model running alias run_scripts="cd ${HOME}/E3SM/scripts/" alias simulations="cd /lcrc/group/e3sm/<username>/E3SM_simulationsE3SMv2/"

On other machines, the paths are the same, except for the <simulations_dir>.

...

<simulations_dir>: /compyfs/<username>/E3SM_simulationsE3SMv2

On cori (NERSC):

<simulations_dir>: ${CSCRATCH}/E3SM_simulationsE3SMv2

Configuring the Model Run – Run Script

Start with an example of a run script for a low-resolution coupled simulation:

The template script in the E3SM repository, which uses cori-knl: https:

...

//github.com/E3SM-Project/E3SM/blob/master/run_e3sm.template.sh
This guide uses a similar script, which uses chrysalis: https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/run.v2.LR.piControl.sh (Note: this is a private repo) .

Create a new run script or copy an existing one. The path to it should be <run_scripts_dir>/run.<case_name>.sh

...

readonly COMPSET="WCYCL1850" : compset (configuration)
readonly RESOLUTION="ne30pg2_EC30to60E2r2": resolution (low-resolution coupled simulation in this case)
- ne30 is the number of spectral elements for the atmospheric dynamics grid, while pg2 refers to the physics grid option. This mesh grid spacing is approximately 110 km.
- EC30to60E2r2 is the ocean and sea-ice resolution. The grid spacing varies between 30 and 60 km.
- For simulations with regionally refined meshes such as the N American atmosphere grid coupled to the WC14 ocean and sea-ice, replace with northamericax4v1pg2_WC14to60E2r3.
readonly DESCRIPTORCASE_NAME="v2beta4v2.LR.piControl":
- v2beta4 v2.LR is a short custom description to help identify the simulation.
- piControl is the type of simulation. Other options here include , but are not limited to: amip, F2010.
readonly CASE_GROUP="v2beta4v2.piControlLR":
readonly CHECKOUT="20210409": date the code was checked out on in
- This will let you mark multiple cases as part of the same group for later processing (e.g., with PACE).

# Code and compilation

- Note: If this is part of a simulation campaign, ask your group lead about using a case_group label. Otherwise, please use a unique name to distinguish from existing case_group label names, i.g. “v2.LR“.

# Code and compilation

readonly CHECKOUT="20210702": date the code was checked out on in the form {year}{month}{day}. The source code will be checked out in sub-directory named {year}{month}{day} under <code_source_dir>.
readonly BRANCH="master": branch the code was checked out from. Valid options include “master”, a branch name, or a git hash. For provenance purposes, it is best to specify the git hash.
readonly DEBUG_COMPILE=false : option to compile with DEBUG flag (leave set to false)

...

readonly MODEL_START_TYPE="initialhybrid" : specify how the model should start – use initial conditions or , continue from existing restart files, branch, or hybrid.
readonly START_DATE="0001-01-01" : model start date. Typically year 1 for simulations with perpetual (time invariant) forcing or a real year for simulation for transient forcings.

# Case nameSet paths

readonly CASECODE_NAMEROOT="${CHECKOUT}.HOME}/E3SMv2/code/${DESCRIPTOR}.${RESOLUTION}.${MACHINE} : the case name is a unique identifier for the simulation. It is constructed from variables defined above. If there is no risk of ambiguity, the machine name can be dropped: CASE_NAME=${CHECKOUT}.${DESCRIPTOR}.${RESOLUTION}.

# Set paths

readonly CODE_ROOT="${HOME}/E3SM/code/${CHECKOUTCHECKOUT}": where the E3SM code will be checked out.
readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SMv2/${CASE_NAME}": where the E3SM code will be checked out.readonly CASE_ROOT="/lcrc/group/e3sm/${USER}/E3SM_simulations/${CASE_NAME}": where the results will go. The directory results will go. The directory ${CASE_NAME} will be in <simulations_dir>.

# Sub-directories (leave unchanged)

readonly CASE_BUILD_DIR=${CASE_ROOT}/build : all the compilation files, including the executable.
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive : where short-term archived files will reside.

...

This section controls what operations the script should perform. The run-e3sm script can be invoked multiple times with the user having the option to bypass certain steps by toggling true / false

do_fetch_code=truefalse : fetch the source code from Github.
do_create_newcase=true : create new case.
do_case_setup=true : case setup.
do_case_build=truefalse : compile.
do_case_submit=true : submit simulation.

The first time the script is called, all the flags should be set to true. Subsequently, the user may decide to bypass code checkout (do_fetch_code=false) or compilation (do_case_build=false). A user may also prefer to manually submit the job by setting do_case_submit=false and then invoking ./case.submit.

Note
A case is tied to one code base and one executable. That is, if you change `CHECKOUT` or `BRANCH`, then you should also change `CASE_NAME`.

Running the Model

Short tests

Before starting a long production run, it is highly recommended to perform a few short tests to verify:

The model starts without errors.
The model produces BFB (bit-for-bit) results after a restart.
The model produces BFB results when changing PE layout.

(1) can spare you from a considerable amount of frustration. Imagine submitting a large job on a Friday afternoon, only to discover Monday morning that the job started to run on Friday evening and died within seconds because of a typo in a namelist variable or input file.

...

Code Block

cd <simulations_dir>/<case_name>/tests
for test in *
do
  gunzipzgrep -ch '^ nstep, te ' ${test}/run/atm.log.*.gz | grep '^ nstep, te ' | uniquniq > atm_${test}.txt
done
md5sum *.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_M80_1x10_ndays.txt
5bdfee6da8433cde08f33f3c046653c6  atm_S_2x5_ndays.txt

...

Each component of the model has a subdirectory in archive. There are also two additional subdirectories: logs holds the gzipped log files and rest holds the restart files.

Component	Subdirectory	Files in the Subdirectory
Atmosphere (Earth Atmospheric Model)	`atm/hist`	`.eam.h`
Coupler	`cpl/hist`	`.cpl.h`
Sea Ice (MPAS-Sea-Ice)	`ice/hist`	`.mpassi.hist.`
Land (Earth Land Model)	`lnd/hist`	`.elm.h`
Ocean (MPAS-Ocean)	`ocn/hist`	`.mpaso.hist.`
River Runoff (MOSART)	`rof/hist`	`.mosart.h`

Performance Information

Model throughput is the number of simulated years per day. You can find this with:

...

Code Block
cd <simulations_dir>/<case_name>/case_scripts # Make any changes necessary to avoid the crash ./case_submit

To gzip log files from failed jobs, run gzip *.log.<job ID>.*

Post-Processing with zppy

To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.

Install zppy

If you haven't already, check out the zppy repo in <code_source_dir>. Go to https://github.com/E3SM-Project/zppy . Get path to clone by clicking the green "Code" button. Run git clone <path>.

Load the E3SM unified environment. For LCRC machines, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified.sh. Commands for other machines can be found at https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html

Configuration File

Create a new post-processing configuration file or copy an existing one. A good starting point is the configuration file corresponding to the simulation above: post.20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis.cfg

Call it post.<case_name>.cfg and place it in your <run_scripts_dir>.

Edit the file and customize as needed. The file is structured with [section] and [[sub-sections]]. There is a [default] section, followed by additional sections for each available zppy task (climo, ts, e3sm_diags, mpas_analysis, …). Sub-sections can be used to have multiple instances of a particular task, for example regridded monthly or globally averaged time series files. Refer to the zppy documentation for more details.

The key sections of the configuration file are:

[default]

input, output, www paths may need to be edited.

[climo]

mapping_file path may need to be edited.
Typically generate climatology files every 20,50 years: years = begin_year:end_yr:averaging_period – e.g., years = "1:80:20", "1:50:50",

[ts]

mapping_file path may need to be edited.
Time series, typically done in chunks of 10 years – e.g., years = "1:80:10"

[e3sm_diags]

reference_data_path may need to be edited.
short_name is a shortened version of the case_name
years should match the [climo] section years

[mpas_analysis]

Years can be specified separately for time series, climatology and ENSO plots. The lists must have the same lengths and each entry will be mapped to a realization of mpas_analysis:

ts_years = "1-50", "1-100",
enso_years = "11-50", "11-100",
climo_years ="21-50", "51-100",

In this particular example, MPAS Analysis will be run twice. The first realization will produce time series plots covering years 1 to 50, ENSO plots for years 11 to 50 and climatology plots averaged over years 21-50. The second realization will cover years 1-100 for time series, 11-100 for ENSO, 51-100 for climatologies.

Launch zppy

Make sure you load the E3SM unified environment.

Run python <code_source_dir>/zppy/post.py -c post.<case_name>.cfg. This will submit a number of jobs. Run sq to see what jobs are running.

e3sm_diags jobs are dependent on climo and ts jobs, so they wait for those to finish. MPAS Analysis jobs re-use computations, so they are chained. Most jobs run quickly, though MPAS Analysis may take several hours.

These jobs create a new directory <simulations_dir>/<case_name>/post. Each realization will have a shell script (typically bash). This is the actual file that has been submitted to the batch system. There will also be a log file *.o<job ID> as well as a *.status file. The status file indicates the state (WAITING, RUNNING, OK, ERROR). Once all the jobs are complete, you can check their status

Code Block
cd <simulations_dir>/<case_name>/post/scripts cat .status # should be a list of "OK" grep -v "OK" .status # lists files without "OK"

If you re-run zppy, it will check status of tasks and will skip any task if its status is “OK”.

As your simulation progresses, you can update the post-processing years in the configuration file and re-run zppy. Newly added task will be submitted, while previously completed ones will be skipped.

Tasks

If you run ls you’ll probably see a file like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status. This is one e3sm_diags job. Parts of the file name are explained below:

...

Meaning

...

Part of File Name

...

Task

...

e3sm_diags

...

Grid

...

180x360_aave

...

Model/obs v. model/obs

...

model_vs_obs

...

First and last years

...

0001-0020

There is also a corresponding output file. It will have the same name but end with .o<job ID> instead of .status.

Output

The post-processing output is organized hierarchically as follows (where the exact year-ranges are what are defined in the configuration file):

<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr has the time series files – one variable per file, in 10 year chunks as defined in <run_scripts_dir>/post.<case_name>.cfg.
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr similarly has climatology files for 20 year periods, as defined in <run_scripts_dir>/post.<case_name>.cfg`.
<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr has globally averaged files for 10 years periods as defined in <run_scripts_dir>/post.<case_name>.cfg.

Documenting the Model Run

You should create a Confluence page for your model run in /wiki/spaces/EWCG/pages/2126938167. Use /wiki/spaces/EWCG/pages/2297299190 as a template. See below for how to fill out this template.

Code

code_root_dir and tag_name are defined in <run_scripts_dir>/<case_name>.csh.

Code Block
cd <code_root_dir>/<tag_name> git log

The commit hash at the top is the most recent commit.

Add “<branch name>, <commit hash>” to this section of your page.

Configuration

Compset and Res are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.

Custom parameters should also be listed. Find these by running:

Code Block
cd <run_scripts_dir> grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at

Copy the code blocks after cat <<EOF >> user_nl_eam, cat << EOF >> user_nl_elm, and cat << EOF >> user_nl_mosart to your page.

Scripts

Push your <run_scripts_dir>/run.<case_name>.csh to https://github.com/E3SM-Project/SimulationScripts , in the archive/v2/beta/coupled directory. Then link it to this section of your page.

Output files

Specify the path to your output files: <simulations_dir>/<case_name>.

Jobs

Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.

Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs. Use ls to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log..

PACE’s “Experiment Details” section shows JobID as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.

Use less <log> to look at a gzipped log file. Scroll down a decent amount to DATE= to find the start date. Use SHIFT+g to go to the end of the file. Scroll up to DATE= to find the end date. In the “Years” column specify <start> - <end>, with each in year-month-day format.

To find the number of nodes, first look at Total PEs in PACE’s “Experiment Details” section. Divide that number by MPI tasks/node to get the number of nodes.

The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput.

In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.

Global time series

Be sure to have set the [global_time_series] task in zppy.

Example configuration:

Code Block

# Global time series plots
[global_time_series]
active = False
years = "1-100",
ts_num_years = 10
figstr=coupled_v2_test01
moc_file=mocTimeSeries_0001-0100.nc
experiment_name=20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis
ts_years = "1-50", "1-100",

That will produce <figstr>.pdf and <figstr>.png. They will be available automatically at <www>/<case_name>/<figstr>.png. You can download the image from the website and then upload it to your Confluence page.

E3SM Diags

The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>.

Your diagnostics are located at the web address corresponding to the www path in <run_scripts_dir>/<case_name>.cfg.

See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the relevant web links. Fill the table with the specific web links: e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/.

MPAS Analysis

See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the relevant web links.

Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/, create a bullet “1-50 (time series), 21-50 (climatology)”.

Long Term Archiving with zstash

Simulations that are deemed sufficiently valuable should be archived using zstash for long-term preservation.

Compy / Anvil / Chrysalis

Compy, Anvil and Chrysalis do not have local HPSS. We rely on NERSC HPSS for long-term archiving. Archiving requires a few separate steps:

Run zstash create to archive to local disk.
Using Globus, transfer zstash archive files (everything under the zstash/ subdirectory) to NERSC HPSS. Select the option to preserve original files modification date.
Run zstash check to verify integrity of zstash archive (and their transfer).
Update simulation Confluence page with path to HPSS.

Helper scripts

Below are some helper scripts to facilitate steps (1) and (3) above.

batch_zstash_create.bash to batch archive a number of simulations on compy. Run inside a ‘screen’ session to avoid any interruption:

Code Block

#!/bin/bash

# Run on compy

# Load E3SM Unified
source /share/apps/E3SM/conda_envs/load_latest_e3sm_unified.sh

# List of experiments to archive with zstash
EXPS=(\
20200827.alpha4_v1GM.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
20200905.alpha4_dtOcn.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Archiving ${EXP} ===
    cd /compyfs/gola749/E3SM_simulations/${EXP}
    mkdir -p zstash
    stamp=`date +%Y%m%d`
    time zstash create -v --hpss=none  --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done

batch_zstash_check.bash to batch check a number of simulations on NERSC dtn. Run inside a ‘screen’ session to avoid any interruption:

Code Block

#!/bin/bash

# Run on NERSC dtn

# Load environment that includes zstash
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_e3sm_unified.sh

# List of experiments to archive with zstash
EXPS=(\
20200827.alpha4_v1GM.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
20200905.alpha4_dtOcn.piControl.ne30pg2_r05_EC30to60E2r2-1900_ICG.compy \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Checking ${EXP} ===
    #cd /global/cscratch1/sd/golaz/E3SM_simulations
    cd /global/cfs/cdirs/e3sm/golaz/E3SM_simulations
    mkdir -p ${EXP}/zstash
    cd ${EXP}
    stamp=`date +%Y%m%d`
    time zstash check --hpss=/home/g/golaz/2020/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done.submit

If you need to change a XML value, the following commands in the case_scripts directory are useful:

Code Block
> ./xmlquery <variable> # Get value of a variable > ./xmlchange -id <variable> -val <value> # Set value of a variable

Before re-submitting:

Check that the rpointer files all point to the last restart. On very rare occasions, there might be some inconsistency if the model crashed at the end.
- Run head -n 1 rpointer.* to see the restart date.
gzip all the *.log files from the faulty segment so that they get moved during the next short-term archiving. To gzip log files from failed jobs, run gzip *.log.<job ID>* (where <job ID> has no periods/dots in it).
Delete core or error files if there are any. MPAS components will sometimes produce a large number of them. The following commands are useful for checking for these files:
- ls | grep -in core
- ls | grep -in err
If you are re-submitting the initial job, you will need to run ./xmlchange -id CONTINUE_RUN -val TRUE

Post-Processing with zppy

To post-process a model run, do the following steps. Note that to post-process up to year n, then you must have short-term archived up to year n.

You can ask questions about zppy on https://github.com/E3SM-Project/zppy/discussions/categories/general

Install zppy

Load the E3SM unified environment. For Chrysalis, this is: source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh. Commands for other machines, and installation guide for development version of zppy can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html

Configuration File

Create a new post-processing configuration file or copy an existing one. A good starting point is the configuration file corresponding to the simulation above: https://github.com/E3SM-Project/SimulationScripts/blob/master/archive/v2/post.v2.LR.piControl.cfg (Note: this is a private repo)

Call it post.<case_name>.cfg and place it in your <run_scripts_dir>.

Edit the file and customize as needed. The file is structured with [section] and [[sub-sections]]. There is a [default] section, followed by additional sections for each available zppy task (climo, ts, e3sm_diags, mpas_analysis, …). Sub-sections can be used to have multiple instances of a particular task, for example regridded monthly or globally averaged time series files. Refer to the zppy documentation for more details.

The key sections of the configuration file are:

[default]

input, output, www paths may need to be edited.

[climo]

mapping_file path may need to be edited.
Typically generate climatology files every 20,50 years: years = begin_year:end_yr:averaging_period – e.g., years = "1:80:20", "1:50:50",

[ts]

mapping_file path may need to be edited.
Time series, typically done in chunks of 10 years – e.g., years = "1:80:10"

[e3sm_diags]

reference_data_path may need to be edited.
short_name is a shortened version of the case_name
years should match the [climo] section years

[mpas_analysis]

Years can be specified separately for time series, climatology and ENSO plots. The lists must have the same lengths and each entry will be mapped to a realization of mpas_analysis:

ts_years = "1-50", "1-100",
enso_years = "11-50", "11-100",
climo_years ="21-50", "51-100",

In this particular example, MPAS Analysis will be run twice. The first realization will produce time series plots covering years 1 to 50, ENSO plots for years 11 to 50 and climatology plots averaged over years 21-50. The second realization will cover years 1-100 for time series, 11-100 for ENSO, 51-100 for climatologies.

[global_time_series]

ts_years should match the [mpas_analysis] section ts_years
climo_years should match the [mpas_analysis] section climo_years

See the zppy tutorial for complete configuration file examples.

Launch zppy

Make sure you load the E3SM unified environment.

Run zppy -c post.<case_name>.cfg. This will submit a number of jobs. Run sq to see what jobs are running.

e3sm_diags jobs are dependent on climo and ts jobs, so they wait for those to finish. MPAS Analysis jobs re-use computations, so they are chained. Most jobs run quickly, though MPAS Analysis may take several hours.

These jobs create a new directory <simulations_dir>/<case_name>/post. Each realization will have a shell script (typically bash). This is the actual file that has been submitted to the batch system. There will also be a log file *.o<job ID> as well as a *.status file. The status file indicates the state (WAITING, RUNNING, OK, ERROR). Once all the jobs are complete, you can check their status

Code Block
cd <simulations_dir>/<case_name>/post/scripts cat .status # should be a list of "OK" grep -v "OK" .status # lists files without "OK"

If you re-run zppy, it will check status of tasks and will skip any task if its status is “OK”.

As your simulation progresses, you can update the post-processing years in the configuration file and re-run zppy. Newly added task will be submitted, while previously completed ones will be skipped.

Tasks

If you run ls you’ll probably see files like e3sm_diags_180x360_aave_model_vs_obs_0001-0020.status. This is one e3sm_diags job. Parts of the file name are explained below:

Meaning	Part of File Name
Task	`e3sm_diags`
Grid	`180x360_aave`
Model/obs v. model/obs	`model_vs_obs`
First and last years	`0001-0020`

There is also a corresponding output file. It will have the same name but end with .o<job ID> instead of .status.

Output

The post-processing output is organized hierarchically as follows (where the exact year-ranges are what are defined in the configuration file):

<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/ts/monthly/10yr has the time series files – one variable per file, in 10 year chunks as defined in <run_scripts_dir>/post.<case_name>.cfg.
<e3sm_simulations_dir>/<case_name>/post/atm/180x360_aave/clim/20yr similarly has climatology files for 20 year periods, as defined in <run_scripts_dir>/post.<case_name>.cfg`.
<e3sm_simulations_dir>/<case_name>/post/atm/glb/ts/monthly/10yr has globally averaged files for 10 years periods as defined in <run_scripts_dir>/post.<case_name>.cfg.

Documenting the Model Run

You should create a Confluence page for your model run in /wiki/spaces/EWCG/pages/2126938167. Use /wiki/spaces/EWCG/pages/2297299190 as a template. See below for how to fill out this template.

Code

code_root_dir and tag_name are defined in <run_scripts_dir>/<case_name>.csh.

Code Block
cd <code_root_dir>/<tag_name> git log

The commit hash at the top is the most recent commit.

Add “<branch name>, <commit hash>” to this section of your page.

Configuration

Compset and Res are specified on in the PACE “Experiment Details” section. See “Performance Information” for how to access PACE. Choose the latest job and list these settings on your page.

Custom parameters should also be listed. Find these by running:

Code Block
cd <run_scripts_dir> grep -n "EOF >> user_nl" run.<case_name>.csh # Find the line numbers to look at

Copy the code blocks after cat <<EOF >> user_nl_eam, cat << EOF >> user_nl_elm, and cat << EOF >> user_nl_mosart to your page.

Scripts

Push your <run_scripts_dir>/run.<case_name>.csh to https://github.com/E3SM-Project/SimulationScripts , in the archive/v2/beta/coupled directory. Then link it to this section of your page.

Output files

Specify the path to your output files: <simulations_dir>/<case_name>.

Jobs

Fill out a table with columns for “Job”, “Years”, “Nodes”, “SYPD”, and “Notes”.

Log file names will give you the job IDs. Logs are found in <simulations_dir>/<case_name>/run. If you have done short term archiving, then they will instead be in <simulations_dir>/<case_name>/archive/logs. Use ls to see what logs are in the directory. The job ID will be the two-part (period-separated) number after .log..

PACE’s “Experiment Details” section shows JobID as well. In the table, link each job ID to its corresponding PACE web page. Note that failed jobs will not have a web page on PACE, but you should still list them in the table.

Use zgrep "DATE=" <log> | head -n 1 to find the start date. Use zgrep "DATE=" <log> | tail -n 1 to find the end date. If you would like you can write a bash function to make this easier:

Code Block
get_dates() { for f in atm.log.*.gz; do echo $f zgrep "DATE=" $f \| head -n 1 zgrep "DATE=" $f \| tail -n 1 echo "" done }

(If zgrep is unavailable, use less <log> to look at a gzipped log file. Scroll down a decent amount to DATE= to find the start date. Use SHIFT+g to go to the end of the file. Scroll up to DATE= to find the end date.)

In the “Years” column specify <start> - <end>, with each in year-month-day format.

To find the number of nodes, first look at the Processor # / Simulation Time chart on PACE. The x-axis lists the highest MPI rank used, with base-0 numbering of ranks. (PE layouts often don’t fit exactly N nodes but instead fill N-1 nodes and have some number of ranks left over on the final node, leaving some cores on that node unused.) Then find MPI tasks/node in the “Experiment Details” section. The number of nodes can then be calculated as ceil((highest MPI rank + 1)/ (MPI tasks/node)).

The SYPD (simulated years per day) is listed in PACE’s “Experiment Details” section as Model Throughput.

In the “Notes” section, mention if a job failed or if you changed anything before re-running a job.

Global time series

Be sure to have set the [global_time_series] task in zppy.

Example configuration:

Code Block

# Global time series plots
[global_time_series]
active = False
years = "1-100",
ts_num_years = 10
figstr=coupled_v2_test01
moc_file=mocTimeSeries_0001-0100.nc
experiment_name=20210409.v2beta4.piControl.ne30pg2_EC30to60E2r2.chrysalis
ts_years = "1-50", "1-100",

That will produce <figstr>.pdf and <figstr>.png. They will be available automatically at <www>/<case_name>/<figstr>.png. You can download the image from the website and then upload it to your Confluence page.

E3SM Diags

The template page already includes baseline diagnostics. Add your own diagnostics links labeled as <start_year>-<end_year>.

Your diagnostics are located at the web address corresponding to the www path in <run_scripts_dir>/<case_name>.cfg.

See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the URLs for the web portals on each E3SM machine (listed as <web_address>).

Fill the table with the specific web links: e.g., https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/e3sm_diags/180x360_aave/model_vs_obs_0001-0020/viewer/.

MPAS Analysis

See https://e3sm-project.github.io/e3sm_diags/_build/html/master/quickguides/quick-guide-general.html for finding the URLs for the web portals on each E3SM machine (listed as <web_address>)

Make a bulleted list of links, e.g., for https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/<username>/E3SM/v2/beta/<case_name>/mpas_analysis/ts_0001-0050_climo_0021-0050/, create a bullet “1-50 (time series), 21-50 (climatology)”.

Long Term Archiving with zstash

Simulations that are deemed sufficiently valuable should be archived using zstash for long-term preservation.

You can ask questions about zstash on https://github.com/E3SM-Project/zstash/discussions/categories/general

Compy, Anvil and Chrysalis do not have local HPSS. We rely on NERSC HPSS for long-term archiving.

If you are archiving a simulation run on Compy or LCRC (Chrysalis/Anvil), do all of the following steps. If you are archiving a simulation run on NERSC (Cori), skip to step 4.

1. Clean up directory

Log into the machine that you ran the simulation on.

Remove all eam.i files except the latest one. Dates are of the form <YYYY-MM-DD>.

Code Block

$ cd <simulations_dir>/<case_name>/run
$ ls | wc -l # See how many items are in this directory
$ mv <case_name>.eam.i.<YYYY-MM-DD>-00000.nc tmp.nc
$ rm <case_name>.eam.i.*.nc
$ mv tmp.nc <case_name>.eam.i.2015-01-01-00000.nc
$ ls | wc -l # See how many items are in this directory

There may still be more files than is necessary to archive. You can probably remove *.err, *.lock, *debug_block*, *ocean_block_stats* files.

2. `zstash create` & Transfer to NERSC HPSS

2.a. E3SM Unified v1.6.0 / zstash v1.2.0 or greater

If you are using E3SM Unified v1.6.0 or greater, https://github.com/E3SM-Project/zstash/pull/154 has enabled Globus (https://www.globus.org/ ) transfer with zstash create.

On the machine that you ran the simulation on:

If you don’t have one already, create a directory for utilities, e.g., utils. Then open a file in that directory called batch_zstash_create.bash and paste the following in it, making relevant edits:

Code Block

#!/bin/bash

# Run on <machine name>

# Load E3SM Unified
<Command to load the E3SM Unified environment>

# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Archiving ${EXP} ===
    cd <simulations_dir>/${EXP}
    mkdir -p zstash
    stamp=`date +%Y%m%d`
    time zstash create -v --hpss=globus://nersc/home/<first letter>/<username>/E3SMv2/${EXP} --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done

Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .

Then, do the following:

Code Block

$ screen # Enter screen
$ screen -ls # Output should say "Attached"
$ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!

$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node, 
# then you will need to ssh to this one to get back to the screen session.
$ tail -f batch_zstash_create.log # Check log without going into screen
# Wait for this to finish

$ screen -r # Return to screen
# Check that output ends with `real`, `user`, `sys` time information
$ exit # Terminate screen

$ screen -ls # The screen should no longer be listed
$ ls <simulations_dir>/<case_name>/zstash
# `index.db`, and a `zstash_create` log should be present
# No tar files should be listed

# If you'd like to know how much space the archive or entire simulation use, run:
$ du -sh <simulations_dir>/<case_name>/zstash
$ du -sh <simulations_dir>/<case_name>

Then, on NERSC/Cori:

Code Block
$ hsi $ ls /home/<first letter>/<username>/E3SMv2/<case_name> # Tar files and `index.db` should be listed. # Note `\| wc -l` doesn't work on hsi $ exit

2.b. Earlier releases

2.b.i. `zstash create`

On the machine that you ran the simulation on:

If you don’t have one already, create a directory for utilities, e.g., utils. Then open a file in that directory called batch_zstash_create.bash and paste the following in it, making relevant edits:

Code Block

#!/bin/bash

# Run on <machine name>

# Load E3SM Unified
<Command to load the E3SM Unified environment>

# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Archiving ${EXP} ===
    cd <simulations_dir>/${EXP}
    mkdir -p zstash
    stamp=`date +%Y%m%d`
    time zstash create -v --hpss=none  --maxsize 128 . 2>&1 | tee zstash/zstash_create_${stamp}.log
done

Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .

Then, do the following:

Code Block

$ screen # Enter screen
$ screen -ls # Output should say "Attached"
$ ./batch_zstash_create.bash 2>&1 | tee batch_zstash_create.log
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!

$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node, 
# then you will need to ssh to this one to get back to the screen session.
$ tail -f batch_zstash_create.log # Check log without going into screen
# Wait for this to finish
# (On Chrysalis, for 165 years of data, this takes ~14 hours)

$ screen -r # Return to screen
# Check that output ends with `real`, `user`, `sys` time information
$ exit # Terminate screen

$ screen -ls # The screen should no longer be listed
$ ls <simulations_dir>/<case_name>/zstash
# tar files, `index.db`, and a `zstash_create` log should be present

# If you'd like to know how much space the archive or entire simulation use, run:
$ du -sh <simulations_dir>/<case_name>/zstash
$ du -sh <simulations_dir>/<case_name>

2.b.ii. Transfer to NERSC

On a NERSC machine (Cori):

Code Block
$ mkdir -p /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name> $ ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name> # Should be empty

Log into Globus, using your NERSC credentials: https://www.globus.org/

(Left hand side): Transfer from <the machine's DTN> <simulations_dir>/<case_name>/zstash
Click enter on path and "select all" on left-hand side
Transfer to NERSC DTN /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
1. Notice we're using cfs rather than scratch on Cori
Click enter on the path
Click “Transfer & Sync Options” in the center.
Choose:
1. “sync - only transfer new or changed files” (choose “modification time is newer” in the dropdown box)
2. “preserve source file modification times”
3. “verify file integrity after transfer”
For “Label This Transfer”: “zstash <case_name> <machine name> to NERSC”
Click "Start" on the left hand side.

You should get an email from Globus when the transfer is completed. (On Chrysalis, for 165 years of data, this transfer takes ~13 hours).

2.b.iii. Transfer to HPSS

On a NERSC machine (Cori):

Code Block

Log in to Cori
$ cd /global/cfs/cdirs/e3sm/forsyth/<username>/<case_name>
$ ls | wc -l
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ ls *.tar | wc -l
# Should be two less than the previous result, 
# since `index.db` and the `zstash_create` log are also present.

$ hsi
$ pwd
# Should be /home/<first letter>/<username>
$ ls E3SMv2
# Check what you already have in the directory
# You don't want to accidentally overwrite a directory already in HPSS.
$ exit

$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2

$ screen
$ screen -ls # Output should say "Attached"
# https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss/managing-files-hsi
# cput will not transfer file if it exists.
$ hsi "cd /home/<first letter>/<username>/E3SMv2/; cput -R <case_name>"
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!

$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node, 
# then you will need to ssh to this one to get back to the screen session.
# Wait for the `hsi` command to finish
# (On Chrysalis, for 165 years of data, this takes ~2 hours)

$ screen -r # Return to screen
# Check output for any errors
$ exit # Terminate screen

$ screen -ls # The screen should no longer be listed
$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit

3. `zstash check`

On a NERSC machine (Cori):

Code Block
$ cd /global/homes/<first letter>/<username> $ emacs batch_zstash_check.bash

Paste the following in that file, making relevant edits:

Code Block

#!/bin/bash

# Run on NERSC dtn

# Load environment that includes zstash
<Command to load the E3SM Unified environment>

# List of experiments to archive with zstash
EXPS=(\
<case_name> \
)

# Loop over simulations
for EXP in "${EXPS[@]}"
do
    echo === Checking ${EXP} ===
    cd /global/cfs/cdirs/e3sm/<username>/E3SMv2
    mkdir -p ${EXP}/zstash
    cd ${EXP}
    stamp=`date +%Y%m%d`
    time zstash check -v --hpss=/home/<first letter>/<username>/E3SMv2/${EXP} --workers 2 2>&1 | tee zstash/zstash_check_${stamp}.log
done

Commands to load the E3SM Unified environment for each machine can be found at https://e3sm-project.github.io/zppy/_build/html/main/getting_started.html .

If you’re using E3SM Unified v1.6.0 / zstash v1.2.0 (or greater) and you want to check a long simulation, you can use the --tars option introduced in https://github.com/E3SM-Project/zstash/pull/170 to split the checking into more manageable pieces:

# Starting at 00005a until the end zstash check --tars=00005a- # Starting from the beginning to 00005a (included) zstash check --tars=-00005a # Specific range zstash check --tars=00005a-00005c # Selected tar files zstash check --tars=00003e,00004e,000059 # Mix and match zstash check --tars=000030-00003e,00004e,00005a-

Then, do the following:

Code Block

$ ssh dtn01.nersc.gov

$ screen
$ screen -ls # Output should say "Attached"
$ cd /global/homes/<first letter>/<username>
$ ./batch_zstash_check.bash
# Control A D to exit screen
# DO NOT CONTROL X / CONTROL C (as for emacs). This will terminate the task running in screen!!!

$ screen -ls # Output should say "Detached"
$ hostname
# If you log in on another login node, 
# then you will need to ssh to this one to get back to the screen session.
# Wait for the script to finish
# (On Chrysalis, for 165 years of data, this takes ~5 hours)

$ screen -r # Return to screen
# Check that output ends with `INFO: No failures detected when checking the files.`
# as well as listing real`, `user`, `sys` time information
$ exit # Terminate screen

$ screen -ls # The screen should no longer be listed
$ exit # exit data transfer node
$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>/zstash
$ tail zstash_check_<stamp>.log
# Output should match the output from the screen (without the time information)

Note
Because of https://github.com/E3SM-Project/zstash/issues/167 , for now it is a good idea to run `grep -i Exception zstash_check_<stamp>.log` to confirm success.

4. Document

On a NERSC machine (Cori):

Code Block

$ hsi
$ ls /home/<first letter>/<username>/E3SMv2
# Check that the simulation case is now listed in this directory
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# Should match the number of files in the other machine's `<simulations_dir>/<case_name>/zstash`
$ exit

$ cd /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>/zstash
$ ls
# `index.db` and `zstash_check` log should be the only items listed
# https://www2.cisl.ucar.edu/resources/storage-and-file-systems/hpss/managing-files-hsi
# cput will not transfer file if it exists.
$ hsi "cd /home/<first letter>/<username>/E3SMv2/<case_name>; cput -R <zstash_check log>"

$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# tar files, `index.db`, `zstash_create` log, and `zstash_check` log should be present
$ exit

Update simulation Confluence page with information regarding this simulation (For Water Cycle’s v2 work, that page is /wiki/spaces/ED/pages/2766340117 ) . In the zstash archive column, specify:

/home/<first letter>/<username>/E3SMv2/<case_name>
zstash_create_<stamp>.log
zstash_check_<stamp>.log

5. Delete files

On a NERSC machine (Cori):

Code Block

$ hsi
$ ls /home/<first letter>/<username>/E3SMv2/<case_name>
# tar files, `index.db`, `zstash_create` log, and `zstash_check` log should be present
# So, we can safely delete these items on cfs
$ exit

$ ls /global/cfs/cdirs/e3sm/<username>/E3SMv2/<case_name>
# Should match output from the `ls` above
$ cd /global/cfs/cdirs/e3sm/<username>
$ ls E3SMv2
# Only the <case_name> you just transferred to HPSS should be listed
$ rm -rf E3SMv2

On the machine that you ran the simulation on:

Code Block
$ cd <simulations_dir>/<case_name> $ ls zstash # tar files, index.db, `zstash_create` log should be present $ rm -rf zstash # Remove the zstash directory, keeping original files $ cd <simulations_dir>

More info

Refer to zstash's best practices for E3SM for details.

Publishing the simulation data (Optional)

E3SM Project has policy to publish all official simulation campaigns once those simulations are documented in publications. Refer to step 3 in Simulation Data Management for guidance on requesting data publication.

Page Comparison

Versions Compared

Old Version 51

New Version Current

Key

Useful Aliases

Configuring the Model Run – Run Script

Running the Model

Short tests

Performance Information

Post-Processing with zppy

Install zppy

Configuration File

Launch zppy

Tasks

Output

Documenting the Model Run

Code

Configuration

Scripts

Output files

Jobs

Global time series

E3SM Diags

MPAS Analysis

Long Term Archiving with zstash

Compy / Anvil / Chrysalis

Helper scripts

Post-Processing with zppy

Install zppy

Configuration File

Launch zppy

Tasks

Output

Documenting the Model Run

Code

Configuration

Scripts

Output files

Jobs

Global time series

E3SM Diags

MPAS Analysis

Long Term Archiving with zstash

1. Clean up directory

2. zstash create & Transfer to NERSC HPSS

2.a. E3SM Unified v1.6.0 / zstash v1.2.0 or greater

2.b. Earlier releases

2.b.i. zstash create

2.b.ii. Transfer to NERSC

2.b.iii. Transfer to HPSS

3. zstash check

4. Document

5. Delete files

More info

Publishing the simulation data (Optional)

2. `zstash create` & Transfer to NERSC HPSS

2.b.i. `zstash create`

3. `zstash check`