Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The E3SM Long-Term Archive at LLNL intends to be a permanent repository, faithfully representing the output of all E3SM simulations, and thereafter become the one source of data for all future LLNL E3SM publications. A “plurality” of opinion exists to keep the archive “as is”, for the forensic value it may hold in understanding the actual behavior of the simulation models and their operations. The archives are in zstash format. Datasets slated for publication are soon zstash-extracted to a (faceted) warehouse location for pre-publication assessment, potential repairs, validation, post-processing, and publication.

It is an unfortunate fact that many of these archives, although nominally in tar-zstash format, employ non-standard filenames and tar-paths, some specific to campaign. Within a single archive, a “standard datatype pattern” (e.g. “*cam.h0.*nc“) may match multiple lists of files, none beginning with the recommended “atm/hist/”, and requiring manual inspection in order to “infer” which list of files are those intended for publication. Also, (due in part to major unscheduled restarts) many datasets are spread across multiple separate zstash archives. Any attempt to automate the extraction (or re-extraction) of an archived dataset is therefore stymied without the presence of a “map” leading to the various and multiple paths required to collect files for publication.

Here is the latest Archive_Map:

Excellentable
excellentableIdb677fee1-c379-427c-8c94-083689c9bfb5
contentEntityId1594655144

Motivation

Prior publication operations would involve, for each dataset to be published

  • Remote login to NERSC HPSS, or alternative data source location, in order to set up an environment to issue zstash tape-archive extraction commands

    • zstash commands would first activate (slow) tape-archive extraction of the necessary zstash tar-files determined to contain the desired dataset files.

    • the commands would then locate and (slowly) extract the desired datasets from the tar-files

  • Conducting Globus transfer of the extracted dataset files to LLNL “staging” in preparation for publication.

  • Discovering, only at publication time, data irregularities that might require considerable remediation.

...

  • Simulation runs that are partitioned across two or more separate archives, due to restarts conducted on different machines, for different sets of years

  • Simulation runs that are partitioned across two or more separate internal tar-paths, due to restarts conducted on different machines, for different sets of years

  • Datasets archived under variant tar-path names (“atm/hist” versus “archive/atm/hist”, versus “atm/hist/2007-2014/”, and even the occasional “run/”)

  • Files that pattern-match the intended dataset files but are not intended for extraction or publication, present in variously-named tar-paths (e.g. “rest/”, “try/”, …)

This is where our desire to keep the archives a “faithful copy” of their sources necessitates additional measures be taken at archive assimilation. These additional measures involve the generation and maintenance of certain “Archive Map” files that capture the idiosyncrasies of the archives in support of automated access for review and extraction( A child page will document the issues with archive assimilation processing and content and provide . Although certain codes help to automate this initial “archive discovery” process, it will always involve manual assessment and selection of Archive Map entries as long as we must accept archives with novel and unexpected structures as depicted above.

Archive Path Mapping

The following page, Archive Acquisition and Path-Mapping Operations , documents the procedures and scripts that facilitate the discovery and codification of archive content. The final products are the files summarized here:

  • /p/user_pub/e3sm/archive/.cfg/Archive_Locator

    • contains lines of the form: campaign,model,experiment,ensemble,full_path_to_archive_directory

    • mostly used internally to help produce the Archive Map.

  • /p/user_pub/e3sm/archive/.cfg/Archive_Map

    • contains lines of

...

    • the form: campaign,model,experiment,ensemble,dataset_type,full_path_to_archive_directory,tarfile_extraction_pattern

    • dataset_type has the form <realm>_nat_<freq>, e.g. “atm_nat_mon”, “atm_nat_6hr_snap”, “ocn_nat_5day”, etc.

Note that even when campaign,model,experiment,ensemble[,dataset_type] are fully specified, more than one line may be matched due to datasets that are “split” across multiple archives or archive tar-paths.

Archive Review and Dataset Extraction

Armed with the above Archive Map, one can reliably review or extract specific dataset files without having to employ zstash directly. The following apps (currently bash shell scripts) facilitate these activities.

~/bartoletti1/outbox/extract_archive_files_to.sh file_with_archive_map_line [destination_directory]

If “destination_directory” is not supplied, only the list of matching dataset files are produced and streamed to stdout. Otherwise, the matching files are extracted from archive and written to the destination directory. The directory must exist, and must be a fully-qualified path.

(more to come - work in progress)

Child pages (Children Display)
depth2