Debugging with Totalview on NERSC

(by Peter Caldwell) A parallel debugger is useful for poking around in code without having to recompile the code a bunch of time after adding print statements. It's also useful for figuring out why your code is crashing. NERSC does a good job of supporting the totalview debugger and E3SM does a good job of making sure the model runs on NERSC machines... taken together, we should have a pretty robust debugging capability. The NERSC totalview webpage is: http://www.nersc.gov/users/software/performance-and-debugging-tools/totalview/ . Basically I'm just replicating what they say below.

Steps:

compile the code
1. Both totalview and the debug queue are limited in the number of cores you can ask for. I'm not sure what the upper limit is (edit this page to include this info if you do!). I used 2 nodes of edison for 48 total cores with 1 thread each. This run_acme script should build a 2-node F-compset job using the ACME master: https://github.com/ACME-Climate/SimulationScripts/blob/master/archive/F1850/ne30/run_acme.debug.csh
2. Note that this script compiles in 'debug mode', which is critical for having the debugger work correctly.
3. This script also is set to compile but not actually execute the code.
4. Note also that I'm running in ne4 resolution because coarse resolution works better on low core counts.
5. ** in my experience, once you "module load totalview" in some shell, stuff is overloaded which screws up git. You don't need to load totalview except inside an interactive session and I recommend you don't. **
start an interactive debug session: salloc -N 2 -t 30:00 -p debug where -N 2 means reserve 2 nodes. In my experience this request is met immediately, but you may have to wait 30 min or so if the machine is really busy.
Once the prompt returns, you'll be dropped into a new shell on the nodes you were allocated. Your environment won't include the changes you made to the shell you were in before, so you probably need to type module load totalview
You also need to cd to the run directory for the ACME instance you just built. If you used run_acme.debug.csh to create that executable, cd $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/run/
Issue totalview srun -a --label -n 48 -c 2 --cpu_bind=cores ../build/acme.exe to start totalview. Note that this differs from NERSC's advice here (which causes crashes because of bad MPI bindings). I found the line I'm using in $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/case_scripts/batch_output/<file>.o<run_id>).
1. several windows will pop up. I usually just select "ok" on the top window, which asks if you want to turn various options (like replay engine) on.
2. On the main window, click on the green "go" triangle on the top left. This will initialize the model then stop and ask you if you want to continue or not.
  1. if you are debugging a model crash, just click on continue
  2. if you want to poke around in a file that isn't necessarily causing a crash, click on stop and add a breakpoint or watchpoint at the location you want to query. You can go to the location you want by clicking on "View"→ "Lookup Function". You can enter not just function names in the resulting box, but also file names. The desired function should show up in the main box of the main totalview window. To add a breakpoint, you can click on any line number with a box around it. Lines without boxes are comments or continuation statements. You can also set break/watch points by going to the "Action Point" menu on the top of the main totalview window.

There's obviously books worth of details about how to use totalview and debugging in general, but the above info should get you started...