Climate reproducibility (non-bit-for-bit) testing

Introduction

Requiring model changes to pass stringent tests before being accepted as part of E3SM’s main development branch is critical for quickly and efficiently producing a trustworthy model. Depending on their impacts on model output, code modifications can be classified into three types:

Technical changes that continue to produce bit-for-bit identical solutions
Changes that cause the model solution to differ, yet produce a statistically identical climate when averaged over a sufficiently long time
Changes that lead to a different model climate

Only (3) impacts model climate, and changes of this type should only be implemented within the code after an in-depth demonstration of improvement. However, distinguishing between (2) and (3) requires a comprehensive analysis of both a baseline climate and the currently produced climate.

The MVK, PGN, and TSC tests, contained in the e3sm_atm_nbfb test suite, are used determine whether or not non-bit-for-bit (nb4b) model changes are also climate changing. The e3sm_atm_nbfb test suite is currently run nightly on NERSC's Cori and report to CdASH (https://my.cdash.org/index.php?project=E3SM) under the E3SM_Custom_Tests section.

If you have questions about the tests or how to run them, a point-of-contact for each test is listed below.

The tests

MVK: The mutivariate Kolmogorov-Smirnov Test POC Salil Mahajan

This tests the null hypothesis that the reference (n) and modified (m) model Short Independent Simulation Ensembles (SISE) represent the same climate state, based on the equality of distribution of each variable's annual global average in the standard monthly model output between the two simulations.

The (per variable) null hypothesis uses the non-parametric, two-sample (n and m) Kolmogorov-Smirnov test as the univariate test of equality of distribution of global means. The test statistic (t) is the number of variables that reject the (per variable) null hypothesis of equality of distribution at a 95% confidence level. The (overall) null hypothesis is rejected if t > α, where α is some critical number of rejecting variables. The critical value, α, is obtained from an empirically derived approximate null distribution of t using resampling techniques.

For more information, see:
- Salil Mahajan, Katherine J Evans, Joseph H Kennedy, Min Xu, Mathew R Norman, and Marcia L Branstetter. Ongoing solution reproducibility of earth system models as they progress toward exascale computing. The International Journal of High Performance Computing Applications, 0(0):1094342019837341, 0. doi:10.1177/1094342019837341.
- Salil Mahajan, Abigail L. Gaddis, Katherine J. Evans, and Matthew R. Norman. Exploring an ensemble-based approach to atmospheric climate modeling and testing at scale. Procedia Computer Science, 108:735 – 744, 2017. International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland. doi:10.1016/j.procs.2017.05.259.
PGN: The perterbation Growth Test POC Balwinder Singh

This tests the null hypothesis that the reference (n) and modified (m) model ensembles represent the same atmospheric state after each physics parameterization is applied within a single time-step using the two-sample (n and m) T-test for equal averages at a 95% confidence level. Ensembles are generated by repeating the simulation for many initial conditions, with each initial condition subject to multiple perturbations.

For more information, see:
- Singh, P. J. Rasch, H. Wan, W. Ma, P. H. Worley, and J. Edwards. A verification strategy for atmospheric model codes using initial condition perturbations. Journal of Geophysical Research: Atmospheres, In prep.
TSC: The Time Step Convergence Test POC Hui Wan

This tests the null hypothesis that the convergence of the time stepping error for a set of key atmospheric variables is the same for a reference ensemble and a test ensemble. Both the reference and test ensemble are generated with a two-second time step, and for each variable the RMSD between each ensemble and a truth ensemble, generated with a one-second time step, is calculated. RMSD is calculated globally and over two domains, the land and the ocean. The land/ocean domains contain just the atmosphere points that are over land/ocean cells.

At each 10 second interval during the 10 minute long simulations, the difference in the reference and test RMSDs for each variable, each ensemble member, and each domain are calculated and these ΔRMSDs should be zero for identical climates. A one sided (due to self convergence) Student's T Test is used to test the null hypothesis that the ensemble mean ΔRMSD is statistically zero at the 99.5% confidence level. A rejection of the null hypothesis (mean ΔRMSD is not statistically zero) at any time step for any variable will cause this test to fail.

For more information, see:
- H. Wan, K. Zhang, P. J. Rasch, B. Singh, X. Chen, and J. Edwards. A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (tsc1.0). Geoscientific Model Development, 10(2):537–552, 2017. doi:10.5194/gmd-10-537-2017.

Interpreting the test results

Breifly, a PASS for these tests indicate either the tests are bit-for-bit (#1 above) or the non-bit-for-bit changes produce a statistically identical climate (#2 above), while a FAIL indicates that the non-bit-for-bit changes produce a statistically different climate (#3 above).

When used in conjunction with the e3sm_integration test suite, the three changes can be classified into each test type above as shown in this table:

Change	Description	e3sm_integration	e3sm_atm_nb4b
Type 1	Technical changes that continue to produce bit-for-bit identical solutions	PASS	PASS
Type 2	Changes that cause the model solution to differ, yet produce a statistically identical climate when averaged over a sufficiently long time	FAIL	PASS
Type 3	Changes that lead to a different model climate	FAIL	FAIL

Test pass/fail and extended output

Tests will be reported as pass/fail to the CDASH dashboard. When a test fails, you can look at the test output on cdash by fist clicking on the fail box (yellow circle):

And then click on the test you're interested in (e.g., yellow circle):

At the end of the Test output EVV will report the pass/fail status here, and provide a link to the output website (or a portable website directory if the machine doesn't have project website directory – see manually running the tests below):

That link to the EVV website can be copy-pasted into your preferred browser to see the extended results, which includes figures and statistical analyses. And example website, with pass and fails for all tests is provided here:

http://livvkit.github.io/evv4esm/

Generating a new baseline for CDASH

Unlike the other E3SM/CIME system tests, the e3sm_atm_nb4b test must generate a new baseline because they require more than the final history and restart files to perform their statistical analyses (can't just copy files like we normally do when blessing a failed test.

For anvil, the following would generate (-g) and overwrite (-o) all three test baselines:

# export E3SM=/PATH/TO/E3SM
cd ${E3SM}/cime/scripts
source /lcrc/soft/climate/e3sm-unified/load_latest_cime_env.sh
./create_test e3sm_atm_nbfb -g -o --walltime 00:20:00

To generate a new individual baseline it is the same procedure:

# export E3SM=/PATH/TO/E3SM
cd ${E3SM}/cime/scripts
source /lcrc/soft/climate/e3sm-unified/load_latest_cime_env.sh
./create_test MVK_PS.ne4_ne4.FC5AV1C-L -g -o --walltime 00:20:00

This process will need to be repeated for every machine the test are being run on (currently only anvil). For more detailed information on how to run these tests, see manually running the tests section below.

For cori-knl the procedure is similar, but with increased walltime and a different path for the CIME environment

# export E3SM=/PATH/TO/E3SM
cd ${E3SM}/cime/scripts
source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_cime_env.sh
./create_test e3sm_atm_nbfb -g -o --walltime 01:00:00

Manually running the tests

The e3sm_at_nb4b test should be available to run on any E3SM supported machine that provides the cime_env conda environment, which works similar to the e3sm_unified analysis conda environment (see: Analysis Anaconda Metapackage). Note: the cime_env conda environment is required because these tests use high level statistics, they have additional python dependencies which need to be installed on your system and accessible via the compute nodes (if you're on a batch machine). Primarily, the statistical analysis of the climates is done through EVV which will generate a portable test website to describe the results (pass or fail) in detail (see the extended output section below).

First, activate cime_env:

source <activate_path>/load_latest_cime_env.sh

Where <activate_path> is the machine-specific location of the activation script as described in /wiki/spaces/ED/pages/780271950. Then, from the `${E3SM}/cime/scripts` directory (where ${E3SM} is the directory containing E3SM), use the create_test script as you would for any other E3SM/CIME SystemTest (see: Testing). For example, to run the MVK test and generate a new baseline, you would run a command like:

./create_test MVK_PS.ne4_ne4.F2010 -g --baseline-root "/PATH/TO/BASELINE"

NOTE: the default baseline-root is the E3SM project baseline directory and should be manually specified so you don't over-write the baseline used for nightly testing!

And to compare against the above baseline, you would run a command like:

./create_test MVK_PS.ne4_ne4.F2010 -c --baseline-root "/PATH/TO/BASELINE"

NOTE: The MVK run a 20 member ensemble for at least 13 months (using the last 12 for the statistical tests) and, depending on the machine, may take some fiddling to execute within a particular queue's wallclock time limit. You may want to over-ride the requested walltime using --walltime HH:MM:SS option to create_test.

Likewise, the entire e3sm_atm_nb4b test suite can be run:

# Generate a baseline
./create_test e3sm_atm_nb4b -g --baseline-root "/PATH/TO/BASELINE" 
# Compare to the generated baseline
./create_test e3sm_atm_nb4b -c --baseline-root "/PATH/TO/BASELINE"

when the comparison tests are done, EVV will report the results on the on the command line like:

2019-08-14  22:09:02: BASELINE PASS for test 'YYYYMMDD_HHMMSS_RANDOMID'. 
    Case: YYYYMMDD_HHMMSS_RANDOMID; Test status: pass; ...
    EVV results can be viewed at: <EVV_RESULTS_DIR>

Where <EVV_RESULTS_DIR> is an sub-directory of the case run directory, which we'll call ${CASE_RUN_DIR} containing a portable output website, which includes figures and statistical analyses. And example website, with pass and fails for all tests is provided here:

http://livvkit.github.io/evv4esm/

To view the website, you can either tunnel the website to your local machine through ssh, or copy the website directory to your machine and view it using EVV.

View via SSH

For this example, we'll assume the tests were run on Cori at NERSC, but these instructions should be easily adaptable to any E3SM supported machine. First, log into Cori via ssh and connect your local 8080 port to the 8080 port on Cori:

ssh -L 8080:localhost:8080 [USER]@cori.nersc.gov

Then, activate the cime_env conda environment:

source /global/cfs/cdirs/e3sm/software/anaconda_envs/load_latest_cime_env.sh

Navigate to ${CASE_RUN_DIR} as described above and use EVV to serve the website over port 8080:

pushd ${CASE_RUN_DIR}
evv -o <CASE_SPECIFIER>.cori-knl_intel.C.YYYYMMDD_HHMMSS_RANDOMID.evv -s 8080

where <CASE_SPECIFIER> is the case string you would have passed to create_test (e.g., MVK_PS.ne4_ne4.FC5AV1C-L). EVV will then report to you a URL where you can view the website:

--------------------------------------------------------------------
                   ______  __      __ __      __                    
                  |  ____| \ \    / / \ \    / /                    
                  | |__     \ \  / /   \ \  / /                     
                  |  __|     \ \/ /     \ \/ /                      
                  | |____     \  /       \  /                       
                  |______|     \/         \/                        
                                                                    
    Extended Verification and Validation for Earth System Models    
--------------------------------------------------------------------

  Current run: YYYY-MM-DD HH:MM:SS
  User: ${USER}
  OS Type: Linux 4.12.14-150.27-default
  Machine: cori07
  

Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/)

View the generated website by navigating to:

    http://0.0.0.0:8080/<CASE_SPECIFIER>L.cori-knl_intel.C.YYYYMMDD_HHMMSS_RANDOMID.evv/index.html

Exit by pressing `ctrl+c` to send a keyboard interrupt.

You can now either click that link or copy-past it into your favorite web browser for viewing.

View output from tests run on Anvil:

The sshd server on blues (frontend for Anvil) blocks port forwarding (as of 2022/6) so the ssh -L command above will not work.
However, the evv output is automatically copied into the LCRC webserver at /lcrc/group/e3sm/public_html/$USER/evv, and will be viewable (with $USER replaced by your user name) at:

https://web.lcrc.anl.gov/public/e3sm/$USER/evv/

View a local copy

For this example, we'll assume the tests were run on Cori at NERSC, but these instructions should be easily adaptable to any E3SM supported machine. First, on your local machine, install the cime_env conda env and activate it:

conda create -n cime_env -c conda-forge -c e3sm cime_env
conda activate cime_env

Then, copy the website to your local machine, and view it:

scp -r [USER]@cori.nersc.gov:${CASE_RUN_DIR}/<CASE_SPECIFIER>.cori-knl_intel.C.YYYYMMDD_HHMMSS_RANDOMID.evv .
evv -o <CASE_SPECIFIER>.cori-knl_intel.C.YYYYMMDD_HHMMSS_RANDOMID.evv -s
    --------------------------------------------------------------------
                       ______  __      __ __      __                    
                      |  ____| \ \    / / \ \    / /                    
                      | |__     \ \  / /   \ \  / /                     
                      |  __|     \ \/ /     \ \/ /                      
                      | |____     \  /       \  /                       
                      |______|     \/         \/                        
                                                                        
        Extended Verification and Validation for Earth System Models    
    --------------------------------------------------------------------
    
      Current run: YYYY-MM-DD HH:MM:SS
      User: ${USER}
      OS Type: Linux 4.12.14-150.27-default
      Machine: ${HOST}
      
    
    Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/)
    
    View the generated website by navigating to:
    
        http://0.0.0.0:8000/<CASE_SPECIFIER>L.cori-knl_intel.C.YYYYMMDD_HHMMSS_RANDOMID.evv/index.html
    
    Exit by pressing `ctrl+c` to send a keyboard interrupt.

where ${CASE_RUN_DIR}is as described above and <CASE_SPECIFIER> is the case string you would have passed to create_test (e.g., MVK_PS.ne4_ne4.FC5AV1C-L). You can now either click that link or copy-paste that link into your favorite web browser to view the output website.