ESGF publication for E3SM model output

This is a guide to ESGF publication for E3SM data. This guide is not meant to work for publication to CMIP6.

This guide assumes you have your model data ready.

Step 1: Data Facets

The first step is to decide what data you want to publish. For the piControl run (the first large E3SM publication) we decided to publish the following files:

Atmosphere

Native cam.h0
Regridded cam.h0
Regridded climotologies
Regridded time series (of select variables)

Land

Native clm2.h0
Regridded clm2.h0

Ocean

Native mpaso.hist.am.timeSeriesStatsMonthly

Sea-Ice

Native mpascice.hist.am.timeSeriesStatsMonthly

Once you've decided on what data to publish, generate the post processed data to meet your requirements (time series, regridded data, climos, ect).

The E3SM ini file can be found here: /wiki/spaces/WORKFLOW/pages/650707592 and on GitHub: https://github.com/ESGF/config/blob/devel/publisher-configs/ini/esg.e3sm.ini

The format for this file is fairly straight forward, the first section defines what the data facet options are, the second section implements the facet options, and the final section uses those facets to layout the directory format and dataset ID format.

For each section <category> laid out in the "categories" section, there should be a "<category>_options" section that defines what those options can be. Each option can have one or more comma separated value.

categories =
        project | string | false | true | 0
        experiment | enum | true | true | 1
        realm | enum | true | true | 2

experiment_options =
        e3sm, piControl, Pre-industrial Control

realm_options =
        atmos, land, ocean, sea-ice

The two most important values are the directory_format, and dataset_id options. Note that they can both include hardcoded strings to build directory names, as long as each options is in the format %(some_option)s. The %(root)s options is the path given to the esgprep command used later.

directory_format = %(root)s/%(source)s/%(model_version)s/%(experiment)s/%(atmos_grid_resolution)s_atm_%(ocean_grid_resolution)s_ocean/%(realm)s/%(regridding)s/%(data_type)s/%(time_frequency)s/%(ensemble_member)s
dataset_id = %(source)s.%(model_version)s.%(experiment)s.%(atmos_grid_resolution)s_atm_%(ocean_grid_resolution)s_ocean.%(realm)s.%(regridding)s.%(data_type)s.%(time_frequency)s.%(ensemble_member)s

========= BUG WARNING ===========

DO NOT include any option values that include any substrings that match a 'v' followed by any numbers. No 'fv129x256' or 'v1_0' for model version.

This will break version parser in esgprep and cause it to not match any of your data.

=================================

Step 2: Data formatting

Once you've defined that structure of the dataset, the next step is fairly straightforward. Simply create the directories in the structure you defined, and place the correct data into the structure where it should go.

For example, using the esg.e3sm.ini file linked above. Note that the representative file type for each leaf directory

$ tree --filelimit 5 /p/user_pub/work/E3SM/
1_0
└── piControl
    └── 1deg_atm_60-30km_ocean
        ├── atmos
        │   ├── 129x256
        │   │   ├── climo
        │   │   │   ├── monClim
        │   │   │   │   └── ens1
        │   │   │   │       └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison_01_000101_000501_climo.nc]
        │   │   │   └── seasonClim
        │   │   │       └── ens1
        │   │   │           └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison_01_AAN_climo.nc]
        │   │   ├── model-output
        │   │   │   └── mon
        │   │   │       └── ens1
        │   │   │           └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc]
        │   │   └── time-series
        │   │       └── mon
        │   │           └── ens1
        │   │               └── v1 [FSNTOA_000101_050012.nc]
        │   └── native
        │       └── model-output
        │           └── mon
        │               └── ens1
        │                   └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.cam.h0.0001-01.nc]
        ├── land
        │   ├── 129x256
        │   │   └── model-output
        │   │       └── mon
        │   │           └── ens1
        │   │               └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.clm2.h0.0001-01.nc]
        │   └── native
        │       └── model-output
        │           └── mon
        │               └── ens1
        │                   └── v1 [20180129.DECKv1b_piControl.ne30_oEC.edison.clm2.h0.0001-01.nc]
        ├── ocean
        │   └── native
        │       └── model-output
        │           └── mon
        │               └── ens1
        │                   └── v1 [mpaso.hist.am.timeSeriesStatsMonthly.0001-01-01.nc]
        └── sea-ice
            └── native
                └── model-output
                    └── mon
                        └── ens1
                            └── v1 [mpascice.hist.am.timeSeriesStatsMonthly.0001-01-01.nc]

Step 3: Mapfile generation

For this step you will need the esgprep utility. You can either run ```

pip install -e git://github.com/ESGF/esgf-prepare.git@master#egg=esgprep

``` or if you're running from any of the llnl machines it should be available by running ```source /usr/local/conda/bin/activate esgf-pub```

First get the esgf ini files from the github repo. Run ```esgprep fetch-ini -i /path/to/my/ini/files``` Note that the path /path/to/my/ini/files must already exist.
Next place your ini file with your specified facets into the directory with the name esg.e3sm.ini
Next run the following command. I suggest using nohup or slurm since this process can take several hours.

esgmapfile make --outdir <path_to_your_output> -i <path_to_ini_directory> --project e3sm --max-processes <sum_number_of_cores> <path_to_your_new_data_set>

Step 4: Indexing and publication

Next log into the data node you're going to be publishing to. Make sure you have an openid account for this node, and that your account has the "publisher" attribute. Note that this server needs to have access to both the mapfiles as well as the data directories. If you staged the data on another server, you'll need to copy over in the correct structure to the esgf node you're publishing too.
Ensure that /esg/config/esgcet/esg.e3sm.ini is correct and there is an entry in /esg/config/esgcet/esg.ini for e3sm in the projects table
source the conda environment
Store your myproxy credentials locally (this will store your credentials at ~/.globus/certificate-file and activate them for the next 72 hours)

myproxy-logon -s <your_esgf_identity_node_hostname> -l <your_myproxy_username> -o ~/.globus/certificate-file -t 72

If you are publishing an experiment for the first time run "esginitialize -c"
Run the following commands in the given order

esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice
esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice --noscan --thredds
esgpublish --project e3sm --map /path/to/where/you/want/your/mapfiles/<your_first_mapfile>.map --service fileservice --noscan --publish

Step 5: Verification

Your data should now be available on the given node. You can verify by opening a browser window and going to https://<your_esgf_node>/esg-search/search?project=e3sm

This should give you an XML file with all the datasets with project=e3sm that are available on the given node. Check that your new dataset is listed.

Step 6: additional facets

Optional data facet values must be added after the initial publication step. A new mapfile must be generated, with one line per dataset in the format

<dataset_id> | optional_key1=value1 | optional_key2=value2 | ect | ect

For example the first round of publication used the following:

E3SM.1_0.piControl.1deg_atm_60-30km_ocean.atmos.129x256.climo.monClim.ens1 | science_driver="Water Cycle" | land_grid_resolution="1deg" | seaice_grid_resolution="60-30km"
E3SM.1_0.piControl.1deg_atm_60-30km_ocean.atmos.129x256.time-series.mon.ens1 | science_driver="Water Cycle" | land_grid_resolution="1deg" | seaice_grid_resolution="60-30km" | period="Perpetual 1850"
E3SM.1_0.piControl.1deg_atm_60-30km_ocean.land.native.model-output.mon.ens1 | science_driver="Water Cycle" | land_grid_resolution="1deg" | seaice_grid_resolution="60-30km" | period="Perpetual 1850"

After the new mapfile has been created, envoke the esgadd_facetvalues command

esgadd_facetvalues --project e3sm --map /path/to/your/additional/facets/map --noscan --thredds --service fileservice

And finally, publish the newly updated dataset

esgpublish --project e3sm --map /path/to/your/map/files/ --noscan --publish --service fileservice