Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is a placeholder.

It will introduce the archive, its motivation, objectives and layout

A child page will document the data sources and transfer status

Introduction

The E3SM Long-Term Archive at LLNL intends to be a permanent repository, faithfully representing the output of all E3SM simulations, and thereafter become the one source of data for all future publications. A “plurality” of opinion exists to keep the archive “as is”, for the forensic value it may hold in understanding the actual behavior of the simulation models and their operations.

Motivation

Prior publication operations would involve, for each dataset to be published

  • Remote login to NERSC HPSS, or alternative data source location, in order to set up an environment to issue zstash tape-archive extraction commands

    • zstash commands would first activate (slow) tape-archive extraction of the necessary zstash tar-files determined to contain the desired dataset files.

    • the commands would then locate and (slowly) extract the desired datasets from the tar-files

  • Conducting Globus transfer of the extracted dataset files to LLNL “staging” in preparation for publication

  • Discovering, only at publication time, data irregularities that might require considerable remediation.

These operations would be conducted as publication requests arrived, and were prone to hang-ups of remote logins or downtime of the relevant systems. Due to there being many different datasets to a simulation, the very same (slow) tape archives would be separately opened for access on many different occasions. Reliability, availability, and pre-assessment for quality were disadvantaged. It was seen then as advantageous to defer the extraction of any specific dataset files, and instead extract the entire archive of tar-files just once, and transfer these to a local LLNL archive. The resulting local archives would then become readily available for varied coverage and quality assessment up front, and issues could be addressed long before time-sensitive publication requests would appear.

Operations

The LLNL Archives are currently located in: /p/user_pub/e3sm/archive/<model>/<campaign>/<original_source_archive_name(s)>/

Although having the archives present (in zstash archive format) directly in the local filesystem is a great improvement, much work is still required to make the materials “simply accessible” on any “per dataset” basis. Except to the trained eye, the “original source archive names” hardly lend themselves to human parsing, and we want to fully automate “extraction-by-dataset” for archive access. This is a real challenge, because across the campaigns we find:

  • Simulation runs that are partitioned across two or more separate archives, due to restarts conducted on different machines, for different sets of years

  • Simulation runs that are partitioned across two or more separate internal tar-paths, due to restarts conducted on different machines, for different sets of years

  • Datasets archived under variant tar-path names (“atm/hist” versus “archive/atm/hist”, versus “atm/hist/2007-2014/”, and even the occasional “run/”)

This is where our desire to keep the archives a “faithful copy” of their sources necessitates additional measures be taken at archive assimilation. These additional measures involve the generation and maintenance of certain “Archive Map” files that capture the idiosyncrasies of the archives in support of automated access for review and extraction

( A child page will document the issues with archive assimilation processing and content and provide the procedures and scripts that facilitate the discovery and codification of archive variations )

Child pages (Children Display)
depth2