Issues with TWS_MONTH_BEGIN variable

This variable is problematic in some versions of E3SM v2.1 and pre-release versions of E3SMv3. Here is a summary of the how the problem appears and how to fix it.

Due to past versions of E3SM initializing variables to NaNs, simulations that use older restart files will still be initializing variables, like TWS_MONTH_BEGIN, to NaN which may cause runtime errors (examples of how these errors may appear in e3sm.log.* files are included below).

Solution

  • Ensure that the initial fix is in your branch (here: commit with fix ).

  • If your simulation uses an initial condition file for land (finidat), replace the NaNs in TWS_MONTH_BEGIN with the fill value of 1.e+36. Below are methods to perform the conversion (note that they overwrite the input file, so make a copy).

    • Using NCO functions:

      ncatted -a _FillValue,TWS_MONTH_BEGIN,o,f,NaN ${infile.nc} ncatted -a _FillValue,TWS_MONTH_BEGIN,m,f,1.0e36 ${infile.nc}
    • Using a Python script:

      from netCDF4 import Dataset import numpy as np ofile = Dataset('infile.nc','r+') var_array = f.variables['TWS_MONTH_BEGIN'] var_array[:][np.isnan(var_array[:])] = 1.e+36 ofile.close()

E3SM v2.1

E3SMv3

Example backtrace due to floating point exception

  1. 12: forrtl: error (65): floating invalid 12: Image PC Routine Line Source 12: libpthread-2.31.s 000014E799803910 Unknown Unknown Unknown 12: e3sm.exe 0000000004F6368D subgridavemod_mp_ 1045 subgridAveMod.F90 12: e3sm.exe 000000000663C0DC waterbudgetmod_mp 719 WaterBudgetMod.F90 12: e3sm.exe 0000000004A4E29E elm_driver_mp_elm 576 elm_driver.F90 12: e3sm.exe 00000000049F29F0 lnd_comp_mct_mp_l 506 lnd_comp_mct.F90 12: e3sm.exe 0000000000496175 component_mod_mp_ 751 component_mod.F90 12: e3sm.exe 00000000004583F7 cime_comp_mod_mp_ 2876 cime_comp_mod.F90 12: e3sm.exe 000000000047EB62 MAIN__ 153 cime_driver.F90 12: e3sm.exe 000000000042342D Unknown Unknown Unknown 12: libc-2.31.so 000014E79923E24D __libc_start_main Unknown Unknown

Example runtime fail while writing history file

896: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 180) to file (./E3SM.2023-SCIDAC.ne30pg2_EC30to60E2r2.AMIP.EF_0.13.CF_22.HD_0.56.elm.h0.1984-01.nc, ncid=150) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (TWS_MONTH_BEGIN, varid=206) failed (Number of subarray requests/regions=1, Size of data local to this process = 982). NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u1/w/whannah/E3SM/E3SM_SRC2/externals/scorpio/src/clib/pio_darray_int.c: 395)

 

History

  • Tests failed restart comparison due to missing TWS_MONTH_BEGIN restart variable (Issue #4649)

  • Longer tests that restart at beginning of the month failed restart comparison due to col_ws%endwb not being on the restart file. (Issue #5079)

  • The initial condition for a production test had to be converted from NaNs (PR #5811)