Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Despite this raft of automated elements, the overall flow of datasets to publication currently requires manual shepherding at multiple and varied points, Full automation is on the horizon, but two important hurdles must be overcome:

(1) Dataset Status: Due to the multi-stage nature of dataset processing for publication, hundreds of datasets in the archive or warehouse may reside in various stages of processing. When these operations are being handled manually by different individuals, especially in an environment where new requirements and new procedures are often being introduced, it is hard enough for one person to keep track of the status of individual datasets , but almost impossible for one person to hand off processing to another in any consistent and reliable manner.

...

Consider the following questions:

  • “Which of these specified[*] datasets have been extracted to the warehouse”

...

  • “Which datasets have been assessed for correctness, and to what degree corrected where possible”

...

  • “For which native datasets have the required climos or timeseries been produced, or mapfiles generated”

Presently, these questions cannot be answered except to hope that someone involved remembers. This is simply unworkable. We MUST have a uniform and consistent way to determine the precise state of any dataset, and to locate datasets that have qualified to a specific level of processing. To address this shortcoming, we are introducing the notion of “per dataset status files” for which compliant processing utilities will automatically record the process-point of each dataset.

(2) Process Orchestration: We intend to process each dataset by assigning it a “process orchestrator” that can conduct each of the many and varied processing steps outlined above. This would not be possible except for the existence of machine-readable dataset status files, and a state transition graph detailing how to proceed for each dataset, given its status. We are engineering an orchestrator, capable of negotiating a path across any conditional sequence of processing operations that would read and update a dataset status file, and consult the appropriate transition graph to conduct dataset-specific processing.

This “orchestration scheme”, which we alternatively refer to as the “Warehouse State Machine” (or General Processing State Machine) is outlined here: Automated Management of Dataset Publication - A General Process State-Machine Approach

...