Archive Acquisition and Path-Mapping Operations

There are two major parts to the proper assimilation of simulation run data to the LLNL E3SM local archives. The first part involves the physical (Globus) transfer of the zstash-archived data to its destination directory. The second part is the discovery and codification of the contained material, enabling subsequent automated review and extraction as needed.

Archive Acquisition

By a process often shrouded in mystery, you may learn of the need to acquire a simulation archive. You may hear voices indicating a confluence page, managed by a simulation project/group, detailing the runs conducted in a cloud of arcane verbiage, from which you must locate and decipher the following key treasure items

Campaign

Presently one of { BCG-v1, CRYO-v1, DECK-v1, HR-v1 }, although new campaigns are expected to appear

Campaign

Presently one of { BCG-v1, CRYO-v1, DECK-v1, HR-v1 }, although new campaigns are expected to appear

Model

Presently one of { 1_0, 1_1, 1_1_ECA, 1_2, 1_2_1, 1_3 }, although new models are expected to appear

Experiment(s)

Consulting the principals involved in conducting the simulations is recommended here. Do not expect the experiment name to support human comprehension in favor of compactness. Allow only the special characters { _ - } (underscore and hyphen) besides alphanumeric.

Ensemble

You may find “ensemble” cleverly embedded in the experiment name (e.g. Historical-H3, Projection-P5). These are generally excised from the experiment name in favor of an explicit “ensemble” variable assignment.

Archive Path(s)

This should be the NERSC HPSS path(s) to the zstash-formatted archive(s) associated with the four elements above. Take note of the archive_path_leaf_directory_name(s).

With the above 5 elements in hand, you can craft the receiving archive directory or directories, “/p/user_pub/e3sm/archive/model/campaign/archive_path_leaf_directory_name(s).

You may then log in to Globus, activate the credentials for access to NERSC HPSS and to (currently) acme1.llnl.gov, open these in the corresponding Globus “File_Manager” transfer panes, navigate to the given NERSC HPSS archive path in one pane, and to the newly-created receiving archive path in the other pane, and launch the transfer(s). If there are multiple archives to transfer, and fortune shines upon you (the source directories all have a common parent directory), you can simply highlight multiple source directories in the NERSC pane, indicate the receiving parent directory in the acme1 pane ( /p/user_pub/e3sm/archive/model/campaign/ ) and all of the archives can be transferred in one pass, each creating the proper archive leaf-path directory. Otherwise you will need to conduct each transfer individually.

As soon as a new archive is in place, manually update the file ‘/p/user_pub/e3sm/archive/.cfg/Archive_Locator’ with lines of the form

<campaign>,<model>,<experiment>,<ensemble>,<full_archive_path>

for every experiment contained in the archive. Usually there is only 1 experiment per archive, so only 1 line added.

Archive Path-Mapping

With each new archive (/p/user_pub/e3sm/archive/model/campaign/archive_path_leaf_directory_name) securely in hand, the real work of assimilating the archive begins. What exactly does this archive contain? Where is the material “intended” to be the publishable content, and what is not?

The page Default Set of Model Output for ESGF publication serves as a guide, but also includes non-”model output”. Regridded time-series and climatologies are generated post-archive, and are not in the archives themselves. I have produced the following table (CSV file: /p/user_pub/e3sm/archive/.cfg/Standard_Datatype_Extraction_Patterns) to facilitate archive catalog generation:

Dataset Type

Core File Pattern

Comment

atm nat mon

*cam.h0*

 

atm nat day

*cam.h1*

 

atm nat 6hr_snap

*cam.h2*

 

atm nat 6hr

*cam.h3*

 

atm nat 3hr_snap

*cam.h2*

BGC-only

atm nat 3hr

*cam.h4*

 

atm nat 3hr

*cam.h3*

BGC-only

atm nat day_cosp

*cam.h5*

 

lnd nat mon

*clm2.h0*

 

river nat mon

*mosart.h0*

 

ocn nat mon

*mpaso.hist.am.timeSeriesStatsMonthly.*

these are not “time-series” in the same sense as the post-process “regridded time series”

ocn nat 5day

*mpaso.hist.am.highFrequencyOutput.*

 

sea-ice nat mon

*mpascice.hist.am.timeSeriesStatsMonthly.*

these are not “time-series” in the same sense as the post-process “regridded time series”

sea-ice nat day

*mpascice.hist.am.timeSeriesStatsDaily.*

these are not “time-series” in the same sense as the post-process “regridded time series”

STEP 1: The first step in producing the archive catalog is to run:

~/bartoletti1/outbox/archive_path_mapper_stage1.sh file_of_lines_from_Archive_Locator

The “file_of_lines_from_Archive_Locator” would ordinarily be the lines pertaining to the new archive being assimilated. The output will be a directory filled with files, one for each “experiment,ensemble,dataset_type” found in that archive, containing the sorted listing of ALL files in the archive manifest whose filenames matched the corresponding Core File Pattern for that dataset_type. For example, when seeking “BGC-v1,ens1,hist_BCRD” for a small (2007-2014) archive, the “PathsFound” directory was filled with these files

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr_snap
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_day
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:lnd_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:river_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon

and the first and last 3 lines of the file “1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon” were

atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-01.nc
atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-02.nc
atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-03.nc
. . .
rest/2014-09-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-08.nc
rest/2014-11-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-10.nc
rest/2015-01-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-12.nc

Here, it seems clear that we want to add “atm/hist” to the file search pattern. But in many cases there are multiple plausible internal tar-paths to matching files. In order to help automate the necessary disambiguation …

STEP 2: We run

~/bartoletti1/outbox/archive_path_mapper_stage2.sh

This script will seek to trim, for each of the files in the “PathsFound” directory, all lines that begin with or contain known-to-avoid tar-paths. For your amusement, those avoided tar-paths (currently) include

"rest/", “post/", "test*", "init*", "run/try*", "run/bench*", "old/run*", "pp/remap*", "a-prime*", "lnd_rerun*", "atm/ncdiff*", "archive/rest*", "*fullD*", "*photic*"

For the filenames that remain, output is produced (named “headset_list_first_last”) that lists the first and last filename found in the residual file-lists. For the set of seven BGC-v1,ens1,hist_BCRD files listed above, that output was

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr

HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h3.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h3.2014-12-28-10800.nc

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr_snap

HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h2.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h2.2014-12-28-10800.nc

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_day

HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h1.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h1.2014-07-02-00000.nc

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon

HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-01.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-12.nc

1:BGC-v1:1_1:hist-BCRD:ens1:lnd_nat_mon

HEADF:lnd/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.clm2.h0.2007-01.nc
HEADL:lnd/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.clm2.h0.2014-12.nc

1:BGC-v1:1_1:hist-BCRD:ens1:river_nat_mon

HEADF:rof/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.mosart.h0.2007-01.nc
HEADL:rof/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.mosart.h0.2014-12.nc

1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon

HEADF:ice/hist/mpascice.hist.am.timeSeriesStatsMonthly.2007-01-01.nc
HEADL:ice/hist/mpascice.hist.am.timeSeriesStatsMonthly.2014-12-01.nc

This is what we want to see. The remaining “first and last” file in each set has a consistent and reasonable tar path, and the filenames match except for the “sim-date” field. For instance, consider the last 3 lines of the output file above. Edit the categorical line

1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon

by appending everything after “HEADF” of the “first file found” line (including the separating colon), but wild-carding the sim-date field “2007-01-01” to obtain the line

1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon:ice/hist/mpascice.hist.am.timeSeriesStatsMonthly.*.nc

(In truth, the pattern “ice/hist/*.nc” would suffice, as it seems only the correct files appear in the listing. including “mpascice.hist.am.timeSeriesStatsMonthly“ adds assurance.)

When things are not so smooth:

If instead we discovered that “first” and “last” did not match, we have two possible courses of action. If it is determined that one or the other involves the wrong tar-path, then the list of “tar-path elements to avoid” in the archive_path_mapper_stage2.sh script must be updated to eliminate the incorrect path, and we rerun that script to obtain a new output file. On some occasions this may be needed more than once, in order to determine the intended tar-path to the finalized run.

On the other hand, if the tar-paths are equal, but it seems that variant filenames are found, there are both benign and difficult cases. A benign case exists where the files may have differing generation-dates, but are intended to be part of a single finalized run, as in

HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-01.nc
HEADL:atm/hist/20190530.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-12.nc

If it is determined that the sequence of files is properly ordered and contiguous, one can elide the variant generation dates with the the first '*' in the pattern

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon:atm/hist/*BCRD*.cam.h0.*.nc

Alternately, one can explicitly call for both sets by editing the file to contain two categorical lines for the same dataset “atm_nat_mon”:

1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon:atm/hist/20181217.BCRD*.cam.h0.*.nc
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon:atm/hist/20190530.BCRD*.cam.h0.*.nc

A more difficult case occurs when the first and last names are not intended to be part of the same run, as in

HEADF:ice/hist/practice-mpascice.hist.am.timeSeriesStatsMonthly.2007-01-01.nc
HEADL:ice/hist/theta-mpascice.hist.am.timeSeriesStatsMonthly.2014-12-01.nc

then you must look at the full filename listing from “PathsFound” in stage 1 to determine whether the “practice” set or the “theta” set gives the full and proper collection of output files. Suppose that this is determined to be “theta”. You then have a choice (with different implications for robustness). Either ensure that the chosen pattern contains “theta”, as in “ice/hist/theta-*.nc”, or add “ice/hist/practice” to the list of “tar-path elements to avoid” in the archive_path_mapper_stage2.sh script.

Once you have completed this operation for the entire file (appended to each categorical line the working file-match pattern), issue the command

grep -v HEAD headset_list_first_last > archive_dataset_map_prelim

STEP 3: Then run

~/bartoletti1/outbox/archive_path_mapper_stage3.sh > update_Archive_Map

You can test the correctness of the “update_Archive_Map” by invoking

~/bartoletti1/outbox/extract_archive_files_to.sh file_with_update_archive_map_line > file_list

For each line of the “update_Archive_Map” a file-list can be produced and manually examined to ensure that only the intended files are being addressed.

Once you are satisfied that update lines are correct, issue these commands to “install” the updated Archive_Map:

cat /p/user_pub/e3sm/archive/.cfg/Archive_Map update_Archive_Map | sort | uniq > temp_Archive_Map

mv temp_Archive_Map /p/user_pub/e3sm/archive/.cfg/Archive_Map

CAVEATE: The above Archive Path-Mapping operations do not ensure that a dataset is necessarily complete or clean. The dataset may still be missing files, or contain hidden restarts with extra and overlapping files. The path-mapping only intends to ensure that all of the files belonging to an intended “finalized run” are identified, for further analysis on coverage and cleanliness. These latter activities are covered by the publication (or pre-publication) “Staging” activities.