E3SM Code Review and New Feature Process

2022/4/8: Original
2022/6/17: Revised to stress importance of not bundling features in a single PR
2022/8/31: Revised to remove duplicate terminology section
2023/4/18: Revised to clarify that modifications to an existing stealth feature is considered a stealth feature.
2023/5/18: Clarify that breaking the addition of a single feature into multiple PRs is welcome

 

This document describes the policy for bringing in new features and changing existing code in the E3SM model.

(Implementing this new policy requires a significant amount of work improving: the E3SM diagnostics, CIME test suites and documentation. See collected https://acme-climate.atlassian.net/wiki/spaces/CNCL/pages/3315138566)

 

Terminology

Code changes are divided into 4 categories

  • BFB (bit-for-bit)

    • New code produces results which are bit-for-bit identical with old code

    • Does not contain stealth features

    • github PR label: BFB

  • Roundoff

    • New code would match old code in exact arithmetic, but will diverge exponentially fast when using finite precision

    • Climate of new and old code will converge as the averaging time goes to infinity

    • github PR label: non-BFB

  • Climate changing

    • New feature is brought into the model turned on

    • Other model configuration change (parameters, resolution, forcing data)

    • github PR label: CC

  • Stealth Feature

    • BFB code change that includes a new feature, turned off by default, or modifications to an existing stealth feature.

    • github PR labels Stealth

 Note: multiple PR labels are possible. A Stealth feature is likely also BFB when the feature is off. CC, BFB and non-BFB are mutually exclusive.

Reference Solutions

The E3SM project will maintain reference solutions reflecting the current state of the ‘main’ branch

  • Coupled model: B-case runs (~100 years) for WC, cryosphere and BGC configurations

  • Component models: F, I and G cases, typically much shorter (e.g. 5 year F case with cyclic year 2010 forcing)

  • Updated periodically when climate changes are integrated, or monthly ( to check for unintended changes)

  • New features will be evaluated with respect to these reference solutions, with documentation on the import metrics

Note: E3SM Reference solutions and related documentation not yet ready!

 

Section 1: New Feature Request Process and Documentation Requirements

For new features or other major code changes, prepare a New Feature Overview Document.

(To determine when a code change becomes a major code change, consult with the component lead.)

E3SM-SFA funded work: feature should appear in that group’s roadmap before starting any work on a new feature including documentation prep. (E3SM pre-processing code (compsets, cime) and infrastructure are also covered by these procedures - infrastructure group is the “component group” for this work)

Externally funded work: developers are encouraged to submit this document early in their development process. This will assist with later E3SM acceptance.

  1. Developer: Before writing code: Proposer completes the New Feature Overview Document. Include, as appropriate:

    1. High-level description of code changes and/or new code, an overview of the design, infrastructure changes

    2. Expected improvements and how these will be demonstrated

    3. Describe needed updates to E3SM documentation

    4. Expected impacts on computational performance and mass/energy budgets.

    5. If relevant: describe papers that will be published

  2. E3SM Staff: Before writing code: Reviewer within E3SM (component lead or their delegate) should:

    1. Review document for completeness

    2. Determine if there is sufficient benefit to E3SM to justify the E3SM integration and future maintenance costs.

      1. For features not needed by the E3SM SFA, but which may be needed to support other DOE BER missions, the component lead should consult with E3SM leadership.

    3. Determine if review by performance and infrastructure groups are needed and ensure they are done.

      1. discourage and/or flag use of advanced language features or unnecessary complexity that may not be supported well by compilers or may cause performance degradation

    4. Create a confluence page for discussion during the review process

      1. Confluence space for all new feature overview documents and their approval status.

  3. Developer: After new feature is approved, begin work in an E3SM fork, following Development Getting Started Guide

    1. Features which are not approved but are needed for other reasons can be maintained by the developer on their E3SM fork.

    2. Large new features should be developed through a series of PRs rather than a single massive PR (which would be hard to thoroughly test and scrutinize). One New Feature Overview Document can be used for all PRs associated with that feature.

  4. Developer: When work is completed:

    1. Revise overview document to show improvements as originally proposed in overview document, and document simulation results from new feature PR guidelines. Include links to results archiving system (once it is ready)

    2. Submit PRs with links to documentation, follow new feature PR process.

 

Section 2: Collect documentation and simulation results needed for the Pull Request (PR)

Each of the four PR categories defined above require different levels of documentation and simulation results before a PR can be submitted.

 

BFB: no additional documentation is needed.

Roundoff. The developer must:

  1. Provide evidence that the changes are roundoff.

    1. e.g. subcomponent tests which have analytic solutions and error metrics. errors should change only at the roundoff level

    2. e.g. subcomponent tests which do not have exponential growth of roundoff error: in standalone dycore tests roundoff level changes often result in only roundoff level differences in the output.

    3. If roundoff-inducing changes can be isolated to a small amount of code, it might be possible to verify by inspection (e.g. changing order of operations)

  2. Run nbfb tests. If any of them fail, additional scrutiny is warranted.

  3. Strong evidence of roundoff level changes is sufficient in order to not needlessly delay our development process. In the rare case that the change is in fact not roundoff, this will be caught by our monthly reference B case update.

 

Stealth Features. The developer needs to run several tests (with the feature turned on if its a stealth feature) and note in the PR that they have:

  1. Include link to New Feature Overview Document (Section 1)

  2. Demonstrated that it’s possible to determine if feature is on/off via log file / namelists

  3. Ensured that it’s covered by a suitable timer

  4. Verified that it passes the “super-BFB” test suite.

  5. Documented performance changes via “HPC” test suite with a link to PACE results (For features with potential performance impact)

  6. Performed a component simulation test (when relevant). Provide E3SM diags output, comparison with baseline maintained by each component group, and determine if results in the acceptable range as documented by each component group. This will typically be shorter (e.g. a 5 year F compset) simulations used by component developers.

  7. Performed a B compset, e.g. a 10 year run for a “sanity check”. Compare with appropriate “reference B-case” using E3SM diags, and determine if results are in the acceptable range as documented by the simulation teams.

 

Climate changing. The developer must perform:

  1. Include link to New Feature Overview Document (Section 1)

  2. All tests for Stealth Features, except that instead of a “10 year sanity check B compset”, a longer simulation reproducing the “reference B-case” needs to be performed and compared with the official reference B-case.

 

Section 3: Pull Request (PR) Review

  1. New and stealth features: don't submit PR until Section 1 approval of overview document

  2. Submit the PR to the E3SM github site following Development Getting Started Guide . Links to the material outlined in Section 2 above should be in the first comment after the PR description.

  3. For code with potential performance impacts: Performance group lead or delegate reviews supplied PACE data.

  4. It is important not to bundle multiple features or fixes into a single PR.

  5. Climate Changing PRs (new features and bug fixes)

    1. Simulation group lead approves based on climate changes

    2. Simulation group lead (or delegate) will be assigned as a reviewer as well as appropriate component lead and then verifies:

      1. source code review was completed

      2. component lead also approves

      3. PR includes test (or is covered by existing tests)

      4. climate changing evaluation process (from Section 2) completed and results archived

      5. Updates main branch reference B case as needed.

      6. PR contains a single feature ( to ensure multiple features are tested independently)

  6. Stealth features

    1. Component lead (or delegate) is assigned as a reviewer and verifies:

      1. source code review

      2. PR includes appropriate test

      3. stealth-feature evaluation process (from Section 2) completed and results archived

      4. Expanded BFB tests were run with feature turned on

      5. PR contains a single feature ( to ensure multiple features are tested independently)

  7. Roundoff level PRs

    1. Component lead (or delegate) verifies:

      1. source code review was completed

      2. evidence to support roundoff level tests is sufficient, if not, follow component new feature climate evaluation procedures

      3. no stealth features

  8. BFB

    1. component lead (or delegate) be assigned as a reviewer verifies:

      1. source code review was completed

      2. no stealth features

Further Definitions

Super-BFB suite

The purpose of this suite is to make sure your new code maintains bit-for-bit properties we demand of the model. These properties include:

  • exact restart: the model should be able to restart during a run with bit-for-bit identical results compared to a model which ran for the same total time without restarting.

  • MPI-task count invariance: The model should give the same answers between two runs that only differed by the number of MPI tasks used.

  • OpenMP thread count invariance: The model should give the same answers between two run that only differ by number of OpenMP threads used.

There is a suite for each major component model ( atm, ocnicelndrof, and fully-coupled wcycl). They cover the standand-resolution grids in both optimization and debug configurations.

See tests.py for details on the included tests.