It is resolved that E3SM will maintain an LLNL-local copy of all production E3SM simulation run archives in zstash format, from which (near) immediate extraction of all production datasets will proceed to a pre-publication “warehouse” supporting faceted access to the data. The warehouse will have a means to indicate the “state” of the data, from {Assessment, Mediation (being cleaned, regridded, etc), Publishable}. Future publication activities will proceed from the warehouse, obviating the need to retain the corresponding archives “on disk” if space becomes an issue. Specifically,

By this deployment scheme, almost all queries regarding E3SM data production runs can be answered by tools that apply themselves (transparently if desired) to either their warehouse or publication location, largely obviating the continued dependence upon the “archive mapping functions” except where very archive-specific questions need to be addressed. The data footprint does not increase, since datasets pulled from archive will reside in exactly one of the two locations (warehouse or publication).

Despite this raft of automated elements, the overall flow of datasets to publication currently requires manual shepherding at multiple and varied points, Full automation is on the horizon, but two important hurdles must be overcome:

(1) Dataset Status: Due to the multi-stage nature of dataset processing for publication, hundreds of datasets in the archive or warehouse may reside in various stages of processing. When these operations are being handled manually by different individuals, especially in an environment where new requirements and new procedures are often being introduced, it is hard enough for one person to keep track of the status of individual datasets , but almost impossible for one person to hand off processing to another in any consistent and reliable manner. Consider the following questions:

Presently, these questions cannot be answered except to hope that someone involved remembers. This is simply unworkable. We MUST have a uniform and consistent way to determine the precise state of any dataset, and to locate datasets that have qualified to a specific level of processing. To address this shortcoming, we are introducing the notion of “per dataset status files” for which compliant processing utilities will automatically record the process-point of each dataset.

[*] “specified datasets”: We lack even a consistently rational way to convey a collection of datasets. I challenge anyone to locate, in the archives or the warehouse, the set of “BGC CTC ssp585” datasets. NONE of the dataset IDs or facet paths include the terms “BGC” or “CTC”, and not all of the relevant experiments have “ssp585” in the name. Magic is required.

(2) Process Orchestration: We intend to process each dataset by assigning it a “process orchestrator” that can conduct each of the many and varied processing steps outlined above. This would not be possible except for the existence of machine-readable dataset status files, and a state transition graph detailing how to proceed for each dataset, given its status. We are engineering an orchestrator, capable of negotiating a path across any conditional sequence of processing operations that would read and update a dataset status file, and consult the appropriate transition graph to conduct dataset-specific processing.

This “orchestration scheme”, which we alternatively refer to as the “Warehouse State Machine” (or General Processing State Machine) is outlined here: Automated Management of Dataset Publication - A General Process State-Machine Approach

LLNL E3SM Archives

LLNL E3SM Staging “Warehouse”

LLNL E3SM Publication