...
You may then log in to Globus, activate the credentials for access to NERSC HPSS and to (currently) acme1.llnl.gov, open these in the corresponding Globus “File_Manager” transfer panes, navigate to the given NERSC HPSS archive path in one pane, and to the newly-created receiving archive path in the other pane, and launch the transfer(s). If there are multiple archives to transfer, and fortune shines upon you (the source directories all have a common parent directory), you can simply highlight multiple source directories in the NERSC pane, indicate the receiving parent directory in the acme1 pane ( /p/user_pub/e3sm/archive/model/campaign/ ) and all of the archives can be transferred in one pass, each creating the proper archive leaf-path directory. Otherwise you will need to conduct each transfer individually.
As soon as a new archive is in place, manually update the file ‘/p/user_pub/e3sm/archive/.cfg/Archive_Locator’ with lines of the form
<campaign>,<model>,<experiment>,<ensemble>,<full_archive_path>
for every experiment contained in the archive. Usually there is only 1 experiment per archive, so only 1 line added.
Archive Path-Mapping
With each new archive (/p/user_pub/e3sm/archive/model/campaign/archive_path_leaf_directory_name) securely in hand, the real work of assimilating the archive begins. What exactly does this archive contain? Where is the material “intended” to be the publishable content, and what is not?
The page Default Set of Model Output for ESGF publication serves as a guide, but also includes non-”model output”. Regridded time-series and climatologies are generated post-archive, and are not in the archives themselves. I have produce the following table (CSV file: /p/user_pub/e3sm/archive/.cfg/Standard_Datatype_Extraction_Patterns) to facilitate archive catalog generation:
Dataset Type | Core File Pattern | Comment |
atm nat mon | *cam.h0* | |
atm nat day | *cam.h1* | |
atm nat 6hr_snap | *cam.h2* | |
atm nat 6hr | *cam.h3* | |
atm nat 3hr_snap | *cam.h2* | BGC-only |
atm nat 3hr | *cam.h4* | |
atm nat 3hr | *cam.h3* | BGC-only |
atm nat day_cosp | *cam.h5* | |
lnd nat mon | *clm2.h0* | |
river nat mon | *mosart.h0* | |
ocn nat mon | *mpaso.hist.am.timeSeriesStatsMonthly.* | |
ocn nat globalStats | *mpaso.hist.am.globalStats.* | |
ocn nat 5day | *mpaso.hist.am.highFrequencyOutput.* | |
sea-ice nat mon | *mpascice.hist.am.timeSeriesStatsMonthly.* | |
sea-ice nat day | *mpascice.hist.am.timeSeriesStatsDaily.* |
STEP 1: The first step in producing the archive catalog is to run:
~/bartoletti1/outbox/archive_path_mapper_stage1.sh file_of_lines_from_Archive_Locator
The “file_of_lines_from_Archive_Locator” would ordinarily be the lines pertaining to the new archive being assimilated. The output will be a directory filled with files, one for each “experiment,ensemble,dataset_type” found in that archive, containing the sorted listing of ALL files in the archive manifest whose filenames matched the corresponding Core File Pattern for that dataset_type. For example, when seeking “BGC-v1,ens1,hist_BCRD” for a small (2007-2014) archive, the “PathsFound” directory was filled with these files
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr_snap
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_day
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:lnd_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:river_nat_mon
1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon
and the first and last 3 lines of the file “1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon” were
atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-01.nc
atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-02.nc
atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-03.nc
. . .
rest/2014-09-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-08.nc
rest/2014-11-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-10.nc
rest/2015-01-01-00000/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-12.nc
Here, it seems clear that we want to add “atm/hist” to the file search pattern. But in many cases there are multiple plausible internal tar-paths to matching files. In order to help automate the necessary disambiguity …
STEP 2: We run
~/bartoletti1/outbox/archive_path_mapper_stage2.sh
(placeholder)This script will seek to trim, for each of the files in the “PathsFound” directory, all lines that begin with or contain known-to-avoid tar-paths. For your amusement, those avoided tar-paths (currently) include
"rest/", “post/", "test*", "init*", "run/try*", "run/bench*", "old/run*", "pp/remap*", "a-prime*", "lnd_rerun*", "atm/ncdiff*", "archive/rest*", "*fullD*", "*photic*"
For the filenames that remain, output is produces that lists the first and last filename found in the residual filelists. For the set of seven BGC-v1,ens1,hist_BCRD files listed above, that output was
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr
HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h3.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h3.2014-12-28-10800.nc
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_6hr_snap
HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h2.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h2.2014-12-28-10800.nc
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_day
HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h1.2007-01-01-00000.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h1.2014-07-02-00000.nc
1:BGC-v1:1_1:hist-BCRD:ens1:atm_nat_mon
HEADF:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2007-01.nc
HEADL:atm/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.cam.h0.2014-12.nc
1:BGC-v1:1_1:hist-BCRD:ens1:lnd_nat_mon
HEADF:lnd/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.clm2.h0.2007-01.nc
HEADL:lnd/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.clm2.h0.2014-12.nc
1:BGC-v1:1_1:hist-BCRD:ens1:river_nat_mon
HEADF:rof/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.mosart.h0.2007-01.nc
HEADL:rof/hist/20181217.BCRD_CNPCTC20TR_OIBGC.ne30_oECv3.edison.mosart.h0.2014-12.nc
1:BGC-v1:1_1:hist-BCRD:ens1:sea-ice_nat_mon
HEADF:ice/hist/mpascice.hist.am.timeSeriesStatsMonthly.2007-01-01.nc
HEADL:ice/hist/mpascice.hist.am.timeSeriesStatsMonthly.2014-12-01.nc