SE/CPL Breakout Notes
Topics for discussion:
JIRA:
Different groups use it differently and all badly. Better to abandon?
Don't want to use it for time tracking at the per-task level.
Does anyone look at the time or tasks?
JIRA (or other task tool) is good to keep focused.
Tasks often wind up changing after they are defined.
Making a task takes to much time.
Group leads like to use JIRA for tracking what is going on. Meeting minutes can also be used for that.
If we keep using JIRA, need to standardize across groups. Simplify task creation.
If we don't, need something else (also standardized)
Traceability is important. Who made a request? Is it done?
CIME5 progress/merge
Merges with ACME are going fine. config_grid is a problem because format has changed.
ACME and CESM will share a common CIME repo and major development will occur there. https://github.com/ESMCI/cime
Merge of CIME5 into ACME will be less disruptive then initial adoption of CIME2. There won't be a third big merge as longs as joint development continues.
We are getting everything working fast and better in python. Python is already more maintainable.
CIME is designed to help end-user scientist do work and keep from making mistakes. Need "power-user" mode.
SE team wants new version in v1.
Speed Dating Topics
How do you use JIRA?
Are we testing what you want?
What are your plans for keeping v1 features on master? Tests and v2?
Role of SE/CPL Group
Tracking computational resources. What machines have time? Should we do that? Performance group (Mark T) have been doing it.
If I need to do 1000 years of simulation, where do those resources come from? CSG is working on the model science. Exec Comm writes proposals. No one to prioritize usage. Who uses the rest of our NERSC time? Tracking useage over all ACME accounts.
Management
When will we get these questions answered and decide on actions? Discuss in next telecon.
Messages to ACME management
Are you using JIRA for anything?
Can we get everyone on github who is on Confluence.
Need a weekly report of ACME usage by user across all machines.
Test Suite:
When a test fails, the developer should find that out themselves. Every developer should check cdash daily. Wade should not have to tell developers they broke a test.
"penalty box" group exists that enforces read-only access. Use it for developers/integrators that break code?
Machine slowness still impacts testing. Don't always have fresh results every morning. "Red" + build time has be looked at.
Lots of work to look at logs and figure out what happened. Especially when theres a fail on sandia machines.
Slow tests: need to modify requested time per test (CIME development). Can run each test in parallel. Request a pool of nodes with one request and run all tests.
What is testing for? Cover all compsets, grids, functionality (threading, restart) machines, compilers.
Better pointers to documentation of tests and cdash.
Want: cluster to do all our tests overnight guaranteed. Use HPC just for HPC-class tests.
Need a frequency attribute for tests. Big ones done with known period.
When we have an ne120 test on mira, always wait for it?
Integrators making answer-changing commits from two models may hide each other. JIRA, can help with that. Also integrators should check the log.
Machine POC role
POC's are scattered across different groups. Do the POC's see the usage reports for ACME accounts on those machines?
Merge process:
Can we merge faster then 2 days?
No, not without skipping testing under regular conditions.
What about emergency? Can we start the test suite right away? Yes. Run acme_developer on machine of your choice (no baseline compares) and include results in the PR. Needed for case where you find a problem on Friday morning and want it fixed by Friday afternoon.