Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Dataset Status: Due to the multi-stage nature of dataset processing for publication, hundreds of datasets in the archive or warehouse may reside in various stages of processing. When these operations are being handled manually by different individuals, especially in an environment where new requirements and new procedures are often being introduced, it is hard enough for one person to keep track of the status of individual datasets , but almost impossible for one person to hand off processing to another in any consistent and reliable manner. Questions such as “Which of these specified[*] datasets have been extracted to the warehouse”, “Which datasets have been assessed for correctness, and to what degree corrected where possible” and “For which native datasets have the required climos or timeseries been produced, or mapfiles generated” cannot be answered except to hope that someone involved remembers. This is simply unworkable. We MUST have a uniform and consistent way to to determine the precise state of any dataset, and to locate datasets that have qualified to a specific level of processing. To address this shortcoming, we are introducing the notion of “per dataset status files” that will automatically record the process-point of each dataset.

  2. Process Orchestration: We intend to process each dataset by assigning it a “process orchestrator” that can conduct each of the many and varied processing steps outlined above. This would not be possible except for the existence of machine-readable dataset status files, and a state transition graph detailing how to proceed for each dataset, given its status. We are engineering an orchestrator, capable of negotiating a path across any conditional sequence of processing operations that would read and update a dataset status file, and consult the appropriate transition graph to conduct dataset-specific processing.

This “orchestration scheme”, which we

...

alternatively refer to as the “Warehouse

...

State Machine” (or General Processing State Machine) is outlined on the (page to be determined…)

[*] “specified datasets”: We lack even a consistently rational way to convey a collection of datasets. I challenge anyone to locate, in the archives or the warehouse, the set of “BGC CTC ssp585” datasets. NONE of the dataset IDs or facet paths include the terms “BGC” or “CTC”, and not all of the relevant experiments have “ssp585” in the name. Magic is required.

...