SE/CPL + Perf Breakout, All Hands June 2017

All: We have 1.5 hours for this Bi-Group breakout.

The goal is to have the necessary discussions to give everyone as much clarity on their path forward and responsibilities in the months ahead.

To use this time optimally, we will gather candidate discussion topics before and at the start of the meeting, then do a prioritization, and then have time-boxed discussions on the high-priority topics.

PriorityCandidate Discussion TopicRequestorMore Details; Others can chime in here to modify scope, show support or not, etc.Notes:
6Test Suite Revamp : What do we need to do finish this off?

Balwinder Singh (thank you, thank you) is leading this critical Epic to define a test suite with mostly ultra-low-res grids to get both improved code coverage and faster test throughput. What can the team do to finish this off?

We also need a high-res, large-case test suite to test basic functionality at scale and prevent performance regression.

  • the production low-res model at production processor counts
  • the high-res model at whatever count works.

Can SCM replace some global configs for tests?

Need faster test suites for edison debug nightly. Expand our coverage. Water-cycle based. Could run a 5-day test in edison debug-queue every day of true resolution.

Test suite that makes it through all LCFs in one day.

Test PIO→netcdf call stack. Jon will define. Jim will implement.

1How to best support Critical Path Team and Coupled Sim Group?
Michael Deakin (Unlicensed), Jon Wolfe: What should the rest of the SE/CPL + Perf groups know to better support ACME scientists?

run_acme script is being tested. But maybe not full combinatorial options. Need more runs with options.

Formation of Critical Path team has helped communication problems.

build_namelist is the expensive part, run a lot of places due to legacy reasons

A lot of the infrastructure has to have an understanding of how the tools are used

2Is there a better way to solicit/determine requirements for CIME and associated infrastructure?Philip Jones, Patrick Worley (Unlicensed) (in absentia)In the recent past, some capabilities have been (inadvertently?) dropped. Related to testing too.

Need more tests. That prevents things from being dropped. Its always inadvertent.

Going back: can always check out old version of master. Hard to mix new master with old features.

Whole body of people doing model runs; tap into them to find out what model runs entail, then we can catch a lot more

9How can we provide a mechanism for work to move forward without requiring a CIME mod to propagate into the repo?Philip Jones, Patrick Worley (Unlicensed) (in absentia)As CIME has become more opaque, it is more difficult to make quick/simple workarounds to keep simulations going while the final fix propagates through the system. Makes James Foucar a single point of failure.
8Upcoming Issues to be ready for:Andy Salinger

Things we need to be ready for:

  1. Multi-Repo Discussion Wednesday
  2. Infrastructure Review – Summer
  3. v1 Support – next year

5Open discussion on ACME and SE/CPL Group Management and Cross-Group relations.Andy Salinger

Any suggestions on improved management? Inter-group management

machine POCs? overworked.

Integrators: atm, lnd: pretty calm. ocn: very busy.

7Coupler DocumentationNoel KeenHow to modify for performance?What are the user-modifiable options?
3Performance plans in advance of release.Peter CaldwellWhat are the plans?

Manpower limited. Our limited performance people are chasing performance bugs. Before v1: Hope to do a little more threading performance, not much new GPU work. Not a huge performance gain expected by release. Focusing on high-res case and advanced architectures.

Peter: but lots of work will be done with low-res.

At scaling limit with low-res so not much that can be done.

Everyone needs to own performance and be aware of when their decision may influence it.

Pat is training someone to help with PE layouts. Also tool being revived.

4Test ProcessWade BurgessCan we enumerate what must be done to bless? generate a baseline?

"generate" means do a new run and promote that to baseline.

"bless" means take an old run and promote. Most cases are here.

Need glossary. More help desk options.


Slow init timesAndy SalingerWhat is the cause?

Lots of ACMEs component buildnml's are in perl.

Slow map reads in coupler.

Make CIME more pythonic with python3.


Slowing down CIME updates while Critical Path team is workingJon Wolfe
Can treat ACME's cime as maint branch. Only make fixes there while critical path team is working.