Parallel Ensemble Simulations for ACME Performance and Verification

1.Poster TitleParallel Ensemble Simulations for ACME Performance and Verification
2.AuthorsAbigail Gaddis (Unlicensed), Matthew Norman, Kate Evans (Unlicensed), Salil Mahajan, Mark Taylor
3.GroupAtmosphere, Performance
4.Experiment 
5.Poster CategoryProblem/Solution
6.Submission TypePoster
7.Poster Linkhttps://acme-climate.atlassian.net/wiki/download/attachments/31130387/ACME_Problem_Poster_48x48.pptx?api=v2

Abstract

 

The status quo for high-resolution climate simulation is to perform a very small ensemble (order five) of long simulations (roughly a century) for various scenarios arising from IPCC specifications. To succeed in feasible time, a throughput constraint of five Simulated model Years Per wallclock Day (SYPD) is generally accepted as necessary. To achieve this, CAM-SE is used and is scaled over many Processing Elements (PEs), and work per node is very small. At this scale, parallel data transfer overheads are 40% of the total runtime or more, and there are very few threadable indices to use on an accelerator (e.g. Graphics Processing Unit, GPU). Also, even at these scaling limits, ACME is barely achieving a “capability-scale” portion of Titan (i.e., > 25% of the machine), and throughput is still only around one SYPD for the 28km-mesh water cycle experiment targeted by ACME. This, in turn, means (1) a low benefit from using GPUs and (2) poor usage of computer allocations, and (3) less likelihood of receiving large computing allocations in the future.

This is a pilot study investigating the merits of an ensemble-based approach to climate science and model evaluation rather than the traditional single, long simulation approach. Along with a single 100-year atmospheric simulation with annually cycled ocean conditions, we ran two additional experiments: five 20-year runs and 100 one-year runs (98 of which completed successfully) of the same configuration to discover and quantify the statistical differences between the two approaches and begin the process of understanding what science questions we may be able to answer in this manner.  The ensembles can be run in parallel; they completed in merely 12 hours from job submission, whereas the single 100-year and five 20-year simulations took roughly five weeks a piece end-to-end due to queue wait times, and job / node failures that inhibited automatic resubmission. 

Two outcomes have emerged from this initial study; (1) we have established that the single and ensembles simulations produce statistically similar probability distribution functions for many climate variables, justifying the use of the ensemble simulation method and (2) our testing procedure has uncovered a major bug in the optional inline interpolation routine within the model.