Overview of E3SM Publication Process
It is resolved that E3SM will maintain an LLNL-local copy of all production E3SM simulation run archives in zstash format, from which (near) immediate extraction of all production datasets will proceed to a pre-publication “warehouse” supporting faceted access to the data. The warehouse will have a means to indicate the “state” of the data, from {Assessment, Mediation (being cleaned, regridded, etc), Publishable}. Future publication activities will proceed from the warehouse, obviating the need to retain the corresponding archives “on disk” if space becomes an issue. Specifically,
The LLNL E3SM Archives will retain the state of simulation output as released by the corresponding project groups (transferred “as is” from NERSC HPSS) in “zstash archive” format. Certain “mapping functions” will be conducted to identify the publishable content within the (often variably-structured) archives in order to facilitate survey and automated extraction of individual native-output datasets. The “maps” produced can be retained with the archives in the event that future access to the archives is required - however the goal is to avoid such need. Although initially kept on the local file system, as archives are “exhausted” (production materials warehoused or published) these archives can be pushed out to long-term tape storage, or eliminated, as policy dictates.
The LLNL E3SM Warehouse: Data not already published[*] will be extracted from Archive to the “faceted” pre-publication Warehouse “v0” paths. These paths mirror the facets employed in the actual publication directories, and allow internal access and management “as if” published. The leaf-directory is labeled “v0”, to indicate it is the raw archive extraction, and has yet to undergo data validation checks and corrections for occasional data irregularities (missing data, overlapping data from unusual restarts) and default post processing (regridding for selected time-series, climatology generation) . These post-processing steps result eventually in a “v1” path, indicating a status of “Publishable”. Publishable data that is not yet requested or authorized for publication, will reside indefinitely in the warehouse.
[*] Data already published, for which publication errors are discovered, will be treated as unpublished data (pulled from archive, cleaned, repaired) and the most effective path to re-publication update will be engaged.
Publication of Warehouse datasets that are authorized for publication need only be moved (relinked) to the corresponding publication directory path, mapfiles generated, and the formal ESGF publication process engaged.
By this deployment scheme, almost all queries regarding E3SM data production runs can be answered by tools that apply themselves (transparently if desired) to either their warehouse or publication location, largely obviating the continued dependence upon the “archive mapping functions” except where very archive-specific questions need to be addressed. The data footprint does not increase, since datasets pulled from archive will reside in exactly one of the two locations (warehouse or publication).
Automation: We have developed a suite of utilities that assist in moving datasets from archive arrival through to publication. Among these are:
Archive mapping utilities that assist the manual effort of “mapping” the archive content for identifiable datasets (a bit unfortunate that this has to be a “thing” at all).
A utility for conveniently extracting selected datasets to the warehouse.
A variety of individual data validation utilities for assessing the suitability of datasets for publication (and for making corrections to flawed datasets where possible)
Utilities for selected post-processing of datasets, to include the creation of derivative datasets (climos, timeseries), along with the integrity-preserving mapfile generation.
A multi-stage publication process that moves qualifying datasets from the warehouse to the official publication directories, and engages a publish-to-esgf operation.
An independent utility that confirms the accessibility of datasets published as expected.
Despite this raft of automated elements, the overall flow of datasets to publication currently requires manual shepherding at multiple and varied points, Full automation is on the horizon, but two important hurdles must be overcome:
(1) Dataset Status: Due to the multi-stage nature of dataset processing for publication, hundreds of datasets in the archive or warehouse may reside in various stages of processing. When these operations are being handled manually by different individuals, especially in an environment where new requirements and new procedures are often being introduced, it is hard enough for one person to keep track of the status of individual datasets , but almost impossible for one person to hand off processing to another in any consistent and reliable manner. Consider the following questions:
“Which of these specified[*] datasets have been extracted to the warehouse”
“Which datasets have been assessed for correctness, and to what degree corrected where possible”
“For which native datasets have the required climos or timeseries been produced, or mapfiles generated”
Presently, these questions cannot be answered except to hope that someone involved remembers. This is simply unworkable. We MUST have a uniform and consistent way to determine the precise state of any dataset, and to locate datasets that have qualified to a specific level of processing. To address this shortcoming, we are introducing the notion of “per dataset status files” for which compliant processing utilities will automatically record the process-point of each dataset.
[*] “specified datasets”: We lack even a consistently rational way to convey a collection of datasets. I challenge anyone to locate, in the archives or the warehouse, the set of “BGC CTC ssp585” datasets. NONE of the dataset IDs or facet paths include the terms “BGC” or “CTC”, and not all of the relevant experiments have “ssp585” in the name. Magic is required.
(2) Process Orchestration: We intend to process each dataset by assigning it a “process orchestrator” that can conduct each of the many and varied processing steps outlined above. This would not be possible except for the existence of machine-readable dataset status files, and a state transition graph detailing how to proceed for each dataset, given its status. We are engineering an orchestrator, capable of negotiating a path across any conditional sequence of processing operations that would read and update a dataset status file, and consult the appropriate transition graph to conduct dataset-specific processing.
This “orchestration scheme”, which we alternatively refer to as the “Warehouse State Machine” (or General Processing State Machine) is outlined here: Automated Management of Dataset Publication - A General Process State-Machine Approach
LLNL E3SM Archives
Data Location: /p/user_pub/e3sm/archives/<model>/<campaign>/<archive_directory>/
Config Location: /p/user_pub/e3sm/archives/.cfg/ (contains Archive_Locator, Archive_Map, and Standard_Dataset_Extraction_Patterns)
Operational details (Guides and Tools): See E3SM Long-Term Archive at LLNL
LLNL E3SM Staging “Warehouse”
Location: /p/user_pub/e3sm/warehouse/E3SM/<faceted_dataset_directory>/
Guides and Tools: . . .
LLNL E3SM Publication
Location: /p/user_pub/work/E3SM/<faceted_dataset_directory>/
Guides and Tools: . . .