SE/CPL + Perf Breakout, All Hands June 2017
All: We have 1.5 hours for this Bi-Group breakout.
The goal is to have the necessary discussions to give everyone as much clarity on their path forward and responsibilities in the months ahead.
To use this time optimally, we will gather candidate discussion topics before and at the start of the meeting, then do a prioritization, and then have time-boxed discussions on the high-priority topics.
Priority | Candidate Discussion Topic | Requestor | More Details; Others can chime in here to modify scope, show support or not, etc. | Notes: |
---|---|---|---|---|
6 | Test Suite Revamp : What do we need to do finish this off? | Balwinder Singh (thank you, thank you) is leading this critical Epic to define a test suite with mostly ultra-low-res grids to get both improved code coverage and faster test throughput. What can the team do to finish this off? We also need a high-res, large-case test suite to test basic functionality at scale and prevent performance regression.
| Can SCM replace some global configs for tests? Need faster test suites for edison debug nightly. Expand our coverage. Water-cycle based. Could run a 5-day test in edison debug-queue every day of true resolution. Test suite that makes it through all LCFs in one day. Test PIO→netcdf call stack. Jon will define. Jim will implement. | |
1 | How to best support Critical Path Team and Coupled Sim Group? | Michael Deakin (Unlicensed), Jon Wolfe: What should the rest of the SE/CPL + Perf groups know to better support ACME scientists? | run_acme script is being tested. But maybe not full combinatorial options. Need more runs with options. Formation of Critical Path team has helped communication problems. build_namelist is the expensive part, run a lot of places due to legacy reasons A lot of the infrastructure has to have an understanding of how the tools are used | |
2 | Is there a better way to solicit/determine requirements for CIME and associated infrastructure? | Philip Jones, Patrick Worley (Unlicensed) (in absentia) | In the recent past, some capabilities have been (inadvertently?) dropped. Related to testing too. | Need more tests. That prevents things from being dropped. Its always inadvertent. Going back: can always check out old version of master. Hard to mix new master with old features. Whole body of people doing model runs; tap into them to find out what model runs entail, then we can catch a lot more |
9 | How can we provide a mechanism for work to move forward without requiring a CIME mod to propagate into the repo? | Philip Jones, Patrick Worley (Unlicensed) (in absentia) | As CIME has become more opaque, it is more difficult to make quick/simple workarounds to keep simulations going while the final fix propagates through the system. Makes James Foucar a single point of failure. | |
8 | Upcoming Issues to be ready for: | Andy Salinger | Things we need to be ready for:
| |
5 | Open discussion on ACME and SE/CPL Group Management and Cross-Group relations. | Andy Salinger | Any suggestions on improved management? Inter-group management | machine POCs? overworked. Integrators: atm, lnd: pretty calm. ocn: very busy. |
7 | Coupler Documentation | Noel Keen | How to modify for performance? | What are the user-modifiable options? |
3 | Performance plans in advance of release. | Peter Caldwell | What are the plans? | Manpower limited. Our limited performance people are chasing performance bugs. Before v1: Hope to do a little more threading performance, not much new GPU work. Not a huge performance gain expected by release. Focusing on high-res case and advanced architectures. Peter: but lots of work will be done with low-res. At scaling limit with low-res so not much that can be done. Everyone needs to own performance and be aware of when their decision may influence it. Pat is training someone to help with PE layouts. Also tool being revived. |
4 | Test Process | Wade Burgess | Can we enumerate what must be done to bless? generate a baseline? | "generate" means do a new run and promote that to baseline. "bless" means take an old run and promote. Most cases are here. Need glossary. More help desk options. |
Slow init times | Andy Salinger | What is the cause? | Lots of ACMEs component buildnml's are in perl. Slow map reads in coupler. Make CIME more pythonic with python3. | |
Slowing down CIME updates while Critical Path team is working | Jon Wolfe | Can treat ACME's cime as maint branch. Only make fixes there while critical path team is working. |