Automated Management of Dataset Publication - A General Process State-Machine Approach

The work of dataset publication involves many and varied steps, some of which are punishing of space and CPU time, and some of which must be engaged only conditionally, depending upon the result of previous operations. Although these steps are largely “automated” individually, they require manual intervention at each stage, both in launching a process and in assessing the result in order to decide how next to proceed. In order to support the desired “full publication automation”, these individual steps need to be fully codified as to naming conventions, inputs, outputs, conditional processing and more.

One could attempt to construct a single monolithic executable process, embodying all of the possible processing functions and locking in the conditional branching rules, but such an approach would suffer on multiple fronts. It would require a full code update whenever any subordinate function is modified, or to accommodate additional processing steps, or to alter the conditional flow of operations. Moreover, the failure of a subordinate function (or a general transient system failure) would likely leave the processed data in a state that would not easily accept a check-pointed restart, and might require a full reprocessing “from the top”, with the consequent loss of hours (or days) of otherwise successful processing work.

To address these issues, we are engaging the design and implementation of a flexible, externally-configurable decoupled processing state machine. This state machine approach will allow the addition, removal, or replacement of subordinate process functions, and alteration in the sequencing or conditional branching among functions, without changes to the controlling state machine codebase or a redeployment of the controlling state-machine. It will also allow sufficiently detailed processing status , on a dataset-by-dataset basis, that interruptions in the process execution for any reason will allow restarts that will not incur the loss of previous successful process results.

Requisite Elements

The following elements enable the envisioned General Process State Machine:

The Domain “Dataset Spec”: This global specification contains the static configuration information that “anchors” the process to a specific domain (e.g. “E3SM datasets”). This document (/p/user_pub/e3sm/staging/resource/dataset_spec.yaml) details the metadata that defines each (E3SM) dataset, experiment, model version(s), resolutions, realms, grids, frequencies, etc. By “walking” the branches of this document, the complete list of E3SM dataset_ids (as reflected in the ESGF “master_id”) may be generated. Subsets of these dataset_ids are passed as tokens to those processes intended to operate upon the corresponding datasets.
The Process “Transition Graph”: This global specification contains the transition rules that define the path(s) of conditional processing.

This file is read once by the state machine, and its elements are plied with a dataset’s “current status” to determine the next appropriate processing step. It generally consists of entries of the form

currentProcess:currentState: (leads to) nextProcess:nextState

The Per-Dataset “Status File”: These files record and detail status of each dataset, and (coupled with the transition graph) control the state of processing to engage.

This file is maintained “with the dataset” (in the faceted dataset directory). It is an “append_only” object in terms of changes, thereby recording the timestamped history of processing applied to the dataset. The file consists generally of entries of the form

STAT:<timestamp>:<ProcessName>:<ProcessStatus>:<parameter-details>

Beyond just serving to check-point and condition the state of future processing, these files can be broadly surveyed to determine and report upon the status of the entire dataset warehouse (which datasets are at a particular stage of processing), and to study things like “How often was process X engaged” or “How much time was spent in a particular processing stage”, or “What fraction of time is spent per stage”, etc.

For a detailed exposition, see: https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/2907766794

Operational State Machine

To install and operate the existing warehouse state machine (Validate, PostProcess, Publish), see: https://github.com/E3SM-Project/esgfpub/blob/master/docs/3_warehouse.md