ESGF publication for E3SM model output
This is a guide to ESGF publication for E3SM data. This guide is not meant to work for publication to CMIP6.
Step 1: Data Selection
The first step is to decide what data you want to publish. The following were published for all the DECK experiments:
ATM: Native cam.h0
Native cam.h1
Regridded climos at frequencies 5 year, 50 year, 100 year, total run
Regridded time series (for selected variables*)
LND: Native clm2.h0
Native mosart.h0
OCN: Native mpaso.timeSeriesStatsMonthly
ICE: Native mpassi.timeSeriesStatsMonthly
*Time series variables are: FLNS, FLNT, FLUT, FSNS, FSNTOA, FSNT, PRECC, PRECL, PRECSC, PRECSL, QFLX, SHFLX, THEFHT, TS
Once you've decided on what data to publish, generate the post processed data to meet your requirements (time series, regridded data, climos, ect).
ESGF requires that the data facets be stored in an "ini" file, which is used by the publisher to discover and validate the files. The E3SM ini file can be found here: /wiki/spaces/WORKFLOW/pages/650707592 and on GitHub: https://github.com/ESGF/config/blob/devel/publisher-configs/ini/esg.e3sm.ini
The format for this file is fairly straight forward, the first section defines what the data facet options are, the second section implements the facet options, and the final section uses those facets to layout the directory format and dataset ID format.
For each section <category> laid out in the "categories" section, there should be a "<category>_options" section that defines what those options can be. Each option can have one or more comma separated value.
categories = project | string | false | true | 0 experiment | enum | true | true | 1 realm | enum | true | true | 2 experiment_options = e3sm, piControl, Pre-industrial Control realm_options = atmos, land, ocean, sea-ice
The two most important values are the directory_format, and dataset_id options. Note that they can both include hardcoded strings to build directory names, as long as each options is in the format %(some_option)s. The %(root)s options is the path given to the esgprep command used later.
directory_format = %(root)s/%(source)s/%(model_version)s/%(experiment)s/%(atmos_grid_resolution)s_atm_%(ocean_grid_resolution)s_ocean/%(realm)s/%(regridding)s/%(data_type)s/%(time_frequency)s/%(ensemble_member)s dataset_id = %(source)s.%(model_version)s.%(experiment)s.%(atmos_grid_resolution)s_atm_%(ocean_grid_resolution)s_ocean.%(realm)s.%(regridding)s.%(data_type)s.%(time_frequency)s.%(ensemble_member)s
Step 2: Data formatting
Once you've defined that structure of the dataset, the next step is fairly straightforward. Simply create the directories in the structure you defined, and place the correct data into the structure where it should go.
For example, using the esg.e3sm.ini file linked above. Note that the representative file type for each leaf directory
$ tree --filelimit 5 /p/user_pub/work/E3SM/ 1_0 └── piControl └── 1deg_atm_60-30km_ocean ├── atmos │ ├── 129x256 │ │ ├── climo │ │ │ ├── monClim │ │ │ │ └── ens1 │ │ │ │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison_01_000101_000501_climo.nc] │ │ │ └── seasonClim │ │ │ └── ens1 │ │ │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison_01_AAN_climo.nc] │ │ ├── model-output │ │ │ └── mon │ │ │ └── ens1 │ │ │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc] │ │ └── time-series │ │ └── mon │ │ └── ens1 │ │ └── v1 [FSNTOA_000101_050012.nc] │ └── native │ └── model-output │ └── mon │ └── ens1 │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc] ├── land │ ├── 129x256 │ │ └── model-output │ │ └── mon │ │ └── ens1 │ │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.clm2.h0.0001-01.nc] │ └── native │ └── model-output │ └── mon │ └── ens1 │ └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.clm2.h0.0001-01.nc] ├── ocean │ └── native │ └── model-output │ └── mon │ └── ens1 │ └── v1 [mpaso.hist.am.timeSeriesStatsMonthly.0001-01-01.nc] └── sea-ice └── native └── model-output └── mon └── ens1 └── v1 [mpascice.hist.am.timeSeriesStatsMonthly.0001-01-01.nc]
Step 3: Mapfile generation
For this step you will need the esgprep utility. You can either run ```
pip install -e git://github.com/ESGF/esgf-prepare.git@master#egg=esgprep
``` or if you're running from any of the llnl machines it should be available by running ```source /usr/local/conda/bin/activate esgf-pub```
- First get the esgf ini files from the github repo. Run ```esgprep fetch-ini -i /path/to/my/ini/files``` Note that the path /path/to/my/ini/files must already exist.
- Next place your ini file with your specified facets into the directory with the name esg.e3sm.ini
- Next run the following command. I suggest using nohup or slurm since this process can take several hours.
esgmapfile make --outdir <path_to_put_mapfiles> -i <path_to_ini_directory> --project e3sm --max-processes <number_of_cores> <path_to_your_new_data_set>
Step 4: Indexing and publication
- Next log into the data node you're going to be publishing to. Make sure you have an openid account for this node, and that your account has the "publisher" attribute. Note that this server needs to have access to both the mapfiles as well as the data directories. If you staged the data on another server, you'll need to copy over in the correct structure to the esgf node you're publishing too.
- Ensure that /esg/config/esgcet/esg.e3sm.ini is correct and there is an entry in /esg/config/esgcet/esg.ini for e3sm in the projects table
- source the conda environment
- Store your myproxy credentials locally (this will store your credentials at ~/.globus/certificate-file and activate them for the next 72 hours)
myproxy-logon -s <your_esgf_identity_node_hostname> -l <your_myproxy_username> -o ~/.globus/certificate-file -t 72
- If you are publishing an experiment for the first time run "esginitialize -c"
- Run the following commands in the given order
- esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice
- esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice --noscan --thredds
- esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice --noscan --publish
Step 5: Verification
Your data should now be available on the given node. You can verify by opening a browser window and going to https://<your_esgf_node>/esg-search/search?project=e3sm
This should give you an XML file with all the datasets with project=e3sm that are available on the given node. Check that your new dataset is listed.
Step 6: additional facets
Optional data facet values must be added after the initial publication step. A new mapfile must be generated, with one line per dataset in the format
<dataset_id> | optional_key1=value1 | optional_key2=value2 | ect | ect
For example the first round of publication used the following:
E3SM.1_0.piControl.1deg_atm_60-30km_ocean.atmos.129x256.climo.monClim.ens1 | science_driver=Water Cycle | land_grid_resolution=1deg | seaice_grid_resolution=60-30km E3SM.1_0.piControl.1deg_atm_60-30km_ocean.atmos.129x256.time-series.mon.ens1 | science_driver=Water Cycle | land_grid_resolution=1deg | seaice_grid_resolution=60-30km | period=Perpetual 1850 E3SM.1_0.piControl.1deg_atm_60-30km_ocean.land.native.model-output.mon.ens1 | science_driver=Water Cycle | land_grid_resolution=1deg | seaice_grid_resolution=60-30km | period=Perpetual 1850
After the new mapfile has been created, envoke the esgadd_facetvalues command
esgadd_facetvalues --project e3sm --map /path/to/your/additional/facets/map --noscan --thredds --service fileservice
And finally, publish the newly updated dataset
esgpublish --project e3sm --map /path/to/your/map/files/ --noscan --publish --service fileservice