Climate reproducibility (non-bit-for-bit) testing

Requiring model changes to pass stringent tests before being accepted as part of E3SM’s main development branch is critical for quickly and efficiently producing a trustworthy model. Depending on their impacts on model output, code modifications can be classified into three types:

Technical changes that continue to produce bit-for-bit identical solutions
Changes that cause the model solution to differ, yet produce a statistically identical climate when averaged over a sufficiently long time
Changes that lead to a different model climate

Only (3) impacts model climate, and changes of this type should only be implemented within the code after an in-depth demonstration of improvement. However, distinguishing between (2) and (3) requires a comprehensive analysis of both a baseline climate and the currently produced climate.

The MVK, PGN, and TSC tests, contained in the e3sm_atm_nbfb test suite, are used determine whether or not non-bit-for-bit (nb4b) model changes are also climate changing. The e3sm_atm_nbfb test suite is currently run nightly on NERSC's Cori and report to CdASH (https://my.cdash.org/index.php?project=ACME_Climate) under the E3SM_Custom_Tests section.

The tests

MVK: The mutivariate Kolmogorov-Smirnov Test POC Salil Mahajan

This tests the null hypothesis that the reference (n) and modified (m) model Short Independent Simulation Ensembles (SISE) represent the same climate state, based on the equality of distribution of each variable's annual global average in the standard monthly model output between the two simulations.

The (per variable) null hypothesis uses the non-parametric, two-sample (n and m) Kolmogorov-Smirnov test as the univariate test of equality of distribution of global means. The test statistic (t) is the number of variables that reject the (per variable) null hypothesis of equality of distribution at a 95% confidence level. The (overall) null hypothesis is rejected if t > α, where α is some critical number of rejecting variables. The critical value, α, is obtained from an empirically derived approximate null distribution of t using resampling techniques.

For more information, see:
- Salil Mahajan, Katherine J Evans, Joseph H Kennedy, Min Xu, Mathew R Norman, and Marcia L Branstetter. Ongoing solution reproducibility of earth system models as they progress toward exascale computing. The International Journal of High Performance Computing Applications, 0(0):1094342019837341, 0. doi:10.1177/1094342019837341.
- Salil Mahajan, Abigail L. Gaddis, Katherine J. Evans, and Matthew R. Norman. Exploring an ensemble-based approach to atmospheric climate modeling and testing at scale. Procedia Computer Science, 108:735 – 744, 2017. International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland. doi:10.1016/j.procs.2017.05.259.
PGN: The perterbation Growth Test POC Balwinder Singh

This tests the null hypothesis that the reference (n) and modified (m) model ensembles represent the same atmospheric state after each physics parameterization is applied within a single time-step using the two-sample (n and m) T-test for equal averages at a 95% confidence level. Ensembles are generated by repeating the simulation for many initial conditions, with each initial condition subject to multiple perturbations.

For more information, see:
- Singh, P. J. Rasch, H. Wan, W. Ma, P. H. Worley, and J. Edwards. A verification strategy for atmospheric model codes using initial condition perturbations. Journal of Geophysical Research: Atmospheres, In prep.
TSC: The Time Step Convergence Test POC Hui Wan

This tests the null hypothesis that the convergence of the time stepping error for a set of key atmospheric variables is the same for a reference ensemble and a test ensemble. Both the reference and test ensemble are generated with a two-second time step, and for each variable the RMSD between each ensemble and a truth ensemble, generated with a one-second time step, is calculated. RMSD is calculated globally and over two domains, the land and the ocean. The land/ocean domains contain just the atmosphere points that are over land/ocean cells.

At each 10 second interval during the 10 minute long simulations, the difference in the reference and test RMSDs for each variable, each ensemble member, and each domain are calculated and these ΔRMSDs should be zero for identical climates. A one sided (due to self convergence) Student's T Test is used to test the null hypothesis that the ensemble mean ΔRMSD is statistically zero at the 99.5% confidence level. A rejection of the null hypothesis (mean ΔRMSD is not statistically zero) at any time step for any variable will cause this test to fail.

For more information, see:
- H. Wan, K. Zhang, P. J. Rasch, B. Singh, X. Chen, and J. Edwards. A new and inexpensive non-bit-for-bit solution reproducibility test based on time step convergence (tsc1.0). Geoscientific Model Development, 10(2):537–552, 2017. doi:10.5194/gmd-10-537-2017.

Interpreting the test results

Breifly, a PASS for these tests indicate either the tests are bit-for-bit (#1 above) or the non-bit-for-bit changes produce a statistically identical climate (#2 above), while a FAIL indicates that the non-bit-for-bit changes produce a statistically different climate (#3 above).

When used in conjunction with the e3sm_integration test suite, the three changes can be classified into each test type above as shown in this table:

Change	Description	e3sm_integration	e3sm_atm_nb4b
Type 1	Technical changes that continue to produce bit-for-bit identical solutions	PASS	PASS
Type 2	Changes that cause the model solution to differ, yet produce a statistically identical climate when averaged over a sufficiently long time	FAIL	PASS
Type 3	Changes that lead to a different model climate	FAIL	FAIL

On CDASH...

Manually running the tests

On Cori...