Page Comparison

This page is devoted to instruction in ncremap. It describes steps necessary to create grids, and to regrid datasets between different grids with ncremap. Some of the simpler regridding options supported by ncclimo are also described at Generate, Regrid, and Split Climatologies (climo files) with ncclimo. This page describes those features in more detail, and other, more boutique features often useful for custom regridding solutions.

The Zen of Regridding

Most modern climate/weather-related research requires a regridding step in its workflow. The plethora of geometric and spectral grids on which model and observational data are stored ensures that regridding is usually necessary to scientific insight, especially the focused and variable resolution studies that E3SM models conduct. Why does such a common procedure seem so complex? Because a mind-boggling number of options are required to support advanced regridding features that many users never need. To defer that complexity, this HOWTO begins with solutions to the prototypical regridding problem, without mentioning any other options. It demonstrates how to solve that problem simply, including the minimal software installation required. Once the basic regridding vocabulary has been introduced, we solve the prototype problem when one or more inputs are "missing", or need to be created. The HOWTO ends with descriptions of different regridding modes and workflows that use features customized to particular models, observational datasets, and formats. The overall organization, including TBD sections (suggest others, or vote for prioritizing, below), is:

...

Background and distributed node parallelism (as described above in the the Parallelism section) of MWF-mode are possible though not yet implemented. Please let us know if this feature is desired.

Advanced Regridding V:

...

CMIP6 Timeseries

This section describes the recommended procedures to construct and regrid E3SM timeseries data to CMIP6 specifications. Most models provide data to CMIP6 in timeseries format, meaning one variable-per-file with multiple years per file. These timeseries must be regridded to at least one of the CMIP6 standard grids. The E3SM project chose to supply its v1 experiments to CMIP6 archived on rectangular, uniform (i.e., equiangular in latitude and longitude), one-degree (for standard-resolution) and quarter-degree (for high-resolution) grids. Generating these timeseries from experiments as lengthy as 500 model years, formatted to CMIP6 specifications, requires many non-standard options to both ncclimo (to construct the timeseries) and to ncremap (to regrid timeseries), and is a natural capstone exercise in using both together. This section is arranged in reverse order where first you will see we present the final actual commands, and then followed by the descriptions, meanings, and reasons for particular options.

...

Take a moment to compare the methods for EAM/ELM and for MPAS. They are identical except for the variable names and the additional MPAS options in ${mpas_opt}. We will return to discuss those soon. Each command-set begins with setting experiment-dependent I/O directories and a map-file. Other experiments will require changing these to the appropriate I/O directories, yet the map-file will remain remains the same unless the native or destination grid needs to changechanges. The next three or four lines in each command-set configure the splitter and regridder with options many ncclimo/ncremap users have never tried before. Finally the list of input files and all the configuration options are sent to ncclimo. The entire procedure for the user boils down to creating then executing this one splitter command for each desired variable.

Regridding is performed only if the splitter (i.e., ncclimo) command for each desired variable. ncclimo will itself call ncremap as necessary.The first configuration options to discuss are the MPAS-specific options, '-m mpas --d2f'is invoke with the --map option that supplies a suitable mapfile from the native to the desired destination grid. CMIP6 will only distribute data on 2D structured grids, yet E3SM will itself distribute the timeseries on native (unstructured grids). Hence the commands above construct both the native timeseries (stored in ${drc_out}) and the regridded timeseries for CMIP6 (stored in ${drc_rgr}). Internally, the splitter constructs the native grid timeseries for the same time segement for all requested variables in parallel, waits for completion, and then calls the regridder (ncremap) in parallel with all timeseries for that segment, waits, then continues to the next segment.

The first configuration options to discuss are the MPAS-specific options. In order to automatically trigger a number of MPAS-specific behaviors, the regridder must first know that the model type is MPAS. When invoked with '-m mpas' the splitter will pass the MPAS-flag to the regridder. The splitter itself simply creates timeseries and does nothing different for MPAS files other than pass options through to the regridder. MPAS outputs its native grid data in double precision, not single precision like EAM/ELM/CAM. Thus raw MPAS datasets are twice the size needed by most analyses. The --d2f flag tells the regridder to demote doubles (unless they are coordinate variables) to floats in an additional pre-processing step. Otherwise, regridded MPAS output would be twice the size with no appreciable benefits for analysis.

The CMIP6-specific options (${cmip6_opt}) collectively ensure that the timeseries are compliant, compact, and concise. CMIP6 requires datasets be in netCDF4-Classic format, i.e., netCDF4 storage constrained to the netCDF3 API. This is achieved with the -7 switch (mnemonic: 7=4+3). Additionally, CMIP6 requires datasets use netCDF4's internal compression, the DEFLATE algorithm (same as in gzip). We recommend deflation level 1 (i.e., --dfl_lvl=1) since higher levels compress only marginally better yet require significantly more wallclock time.

The next three CMIP6 options trim the timeseries to exclude variables that could otherwise be included. The --no_cll_msr (no-cell-measures) switch excludes variables typically listed in the CF cell_measures attribute such as gridcell area and volume. The --no_frm_trm (no-formula-terms) switch excludes variables that appear in the CF formula_terms attribute, notably the 2D surface pressure (PS) for EAM. The --no_stg_grd (no-staggered-grid) switch excludes the offset (aka staggered) grid that ncremap normally adds to rectangular output grids. The specific variables excluded are slat, slon, and w_stag. There is no downside to this option for MPAS data, although it can cause problems for older versions of AMWG diagnostics. Thus timeseries processed with these options include no "extras" that might inflate their size or, alas, their convenience.

The splitter options (${spl_opt}) configure the timeseries length and number of segments. The splitter expects the number of (monthly) input files to equal the number of years (between ${yr_srt} and ${yr_end}, inclusive) times twelve. This sanity check helps prevent inadvertent omissions/inclusions of unwanted months. Each timeseries is split, if necessary, into a number of segments of equal length and possibly one shorter length tail segment. The --ypf option specifies the number of years per file (i.e., segment). CMIP6 recommends file sizes be no greater than a few gigabytes. Factors that influence the segment filesize include the segment length, the variable rank and number of layers if 3D, and, for regridded timeseries, the grid resolution and the presence of missing values (e.g., due to ocean bathymetry). A compromise that meets these criteria is segment lengths up to 500-years for 2D variables, and 25-years for 3D variables. For consistency, these same segments length limits are used for all E3SM v1 models and experiments in CMIP6. Note that because the segment lengths differ for 2D and 3D variables, it is necessary to call the splitter at least twice per experiment, once with 2D variables (supplied with --var) and segment size, and likewise for 3D variables and segment size.

These 500-year and 25-year segment lengths yield native-grid files of sizes ~800 MB and 2.3 GB for v1 EAM 2D and 3D variables, respectively, that have regridded (to 1x1 degree) sizes of ~1.0 GB and 3.0 GB. For v1 MPAS-Ocean data, these segment lengths yield native-grid files of sizes ~9.7 GB and 22 GB for MPAS-Ocean 2D and 3D variables, respectively, that have regridded (to 1x1 degree) sizes of ~900 MB and 1.5 GB. Hence all the regridded E3SM data distributed by CMIP6 will be in files of sizes between 1-3 GB. Note MPAS regridded data consumes ~90% less space than native grid data. This is due to two factors: 1. Raw MPAS data do not utilize the netCDF _FillValue attribute (which would substantially improve compression), and 2. Raw MPAS data are double precision not single precision.

The RAM overhead of timeseries generation can also be a factor on small nodes. Splitting does most of its work on disk and so requires only as much RAM as required to store a single timestep of a variable. Regridding is a different kettle of fish, a bird of another feather, and potentially a can of worms. The maximum RAM usage is about three times the uncompressed size of the entire timeseries. For the 500-yr 2D and 25-yr 3D segments considered here, expect peak RAM usage of 20 GB and 64 GB, respectively, for MPAS data. If the regridder exhausts available memory when called with multiple variables, then reduce the parallelization over variables using the --job=${job_nbr} option (not shown). This is unlikely to occur on beefy nodes because job_nbr defaults to 2 (i.e., variables are split and regridded in groups of two).

Now that the content of the rather lengthy CMIP6 splitter/regridder commands has been explained, it is worthwhile describing the method of invocation. The splitter accepts filenames supplied in numerous ways (command-line arguments, pipes to stdin, directory contents, redirection operators) as described above. For large numbers of input files typical of CMIP6 experiments, piping filenames as output by ls from the input file directory into the splitter is preferred for two reasons. First, ls automatically sorts files into alphanumeric order. This is equivalent to timeseries order because of the filename conventions employed by E3SM. Thus ls ensures that timeseries monotonically increase. Moreover, ls understands command-line globbing to simplify culling only required time periods from directories with longer simulations. Second, issuing ls from the input file directory removes the lengthy path component of each filename received by the splitter. For a 500-year pre-industrial DECK simulation, this removes 500*12=6000 copies of the same ~100-character directory path from the provenance metadata maintained in the history attribute of each downstream file.

Finally, note that we explicitly set ${TMPDIR} to a capacious writable directory prior to execution. The regridder writes all intermediate files to this directory, and removes them only upon successful completion. (The splitter itself never writes to ${TMPDIR}). However, for MPAS files, the regridder may write as many as three or four intermediate files per output file to ${TMPDIR}. Since some 3D MPAS-Ocean DECK PI files are 10's of GB in size, it is best to ensure the intermediate files are written to volatile storage. They will be automatically deleted upon successful completion of regridding. Should the splitter or regridder fail for any reason, the files will remain in ${TMPDIR} to assist in debugging. Thus it is best if ${TMPDIR} is automatically scrubbed every so often, e.g., on re-boots as with most Linux and MacOS workstations.

This discussion of splitting and regridding has focused on "one-off" experiments such as the DECK 500-yr pre-industrial simulation. The above methods with minor modifications also apply to ensemble experiments such as those with historical forcing since 1850. For example, this generates CMIP6 timeseries to analyze historical cloud radiative effects in the ensemble of five E3SM v1 simulations designated H1-H5:

drc_out="${DATA}/ne30/clm" # Native grid output directory
drc_rgr="${DATA}/ne30/rgr" # Regridded output directory
drc_tmp='/p/cscratch/acme/zender1/tmp' # Temporary/intermediate-file directory
cmip6_opt='-7 --dfl_lvl=1 --no_cll_msr --no_frm_trm --no_stg_grd' # CMIP6-specific options
spl_opt='--yr_srt=1850 --yr_end=2014 --ypf=500' # 2D Splitter options
vars='CLDLOW,CLDTOT,FSDS,FSDSC,FSNS,FSNSC,FLDS,FLNS,FLNSC,PS,TGCLDIWP,TGCLDLWP' # 2D
for nsm_nm in H1 H2 H3 H4 H5 ; do
drc_in="/p/user_pub/work/E3SM/1_0/historical_${nsm_nm}/1deg_atm_60-30km_ocean/atmos/native/model-output/mon/ens1/v1" # Input directory
export TMPDIR=${drc_tmp};cd ${drc_in};/bin/ls 2018????.DECKv1b_${nsm_nm}.ne30_oEC.edison.cam.h0.????-??.nc | ncclimo --fml_nm=${nsm_nm} --var=${vars} ${cmip6_opt} ${spl_opt} --map=${DATA}/maps/map_ne30np4_to_cmip6_180x360_aave.20181001.nc --drc_out=${drc_out} --drc_rgr=${drc_rgr} > ~/ncclimo.atm.${nsm_nm} 2>&1
done

The main difference between generating the timeseries for the Historical ensemble and the DECK PI experiment is the need to loop over the ensemble. Here the splitter command is not backgrounded so that one member experiment is processed at a time (to avoide overwhelming nodes with demands for I/O and RAM). Set the input directory in the ensemble loop, and ensure the globbing pattern for filenames matches the naming convention used for all five members. It is important to consider whether to output to member-specific directories or to a single, ensemble-wide directory. If the former, then nothing special need be done. If the latter, use the --fml_nm (family-name) option as above to avoid members overwriting one another's timeseries by creating member-spcific timeseries names like

CLDLOW_H1_185001_201412.nc
CLDLOW_H2_185001_201412.nc
...

Versions Compared

Old Version 12

New Version 13

Key

Advanced Regridding V:

CMIP6 Timeseries