Debugging with Totalview on NERSC

(by Peter Caldwell; last updated  ) A parallel debugger is useful for poking around in code without having to recompile the code a bunch of time after adding print statements. It's also useful for figuring out why your code is crashing. NERSC does a good job of supporting the totalview debugger and E3SM does a good job of making sure the model runs on NERSC machines... taken together, we should have a pretty robust debugging capability. The NERSC totalview webpage is: https://docs.nersc.gov/programming/performance-debugging-tools/totalview/ . Basically I'm just replicating what they say below.

Steps:

  1. compile the code. Here's a simple run script which does that: run_debug.sh. Helpful hints:
    1. Make sure to compile in 'debug mode'. Otherwise the debugger won't work.
    2. Only ask for a few nodes. Both totalview and the debug queue are limited in the number of cores you can ask for.
    3. Coarse resolution is much easier to run on a small number of cores and much easier to debug. Can you use ne4?
    4. Configure and build, but don't actually submit the simulation.
  2. From the case directory for your run, issue ./preview_run. One of the first things returned by this call is the number of nodes needed. Note this - you'll need it in step 4.
  3. Open the hidden file .case.run.sh in the case directory for your run. This script creates the environment needed to run e3sm and includes the raw submission command that would be needed to run. Do/note the following:
    1. find the "srun" command near the bottom of the file. Just before this command, add the line "module load totalview". Loading totalview needs to be done after ./.env_mach_specific.sh and perhaps ./preview_namelists are called in this file because these commands seem to screw up the environment totalview depends on
    2. Add "totalview -nomrnet " to the front of the srun command in .case.run.sh
  4. Execute "salloc -N <number of nodes from step 2> -t 30:00 --qos interactive -C knl". Once you're given an interactive allocation, you'll see a new terminal window.
  5. If you're not still in your case directory, cd back to it now.
  6. Execute your modified .case.run.sh script. This should start a totalview window. After clicking through the intro pages, click the green "go" button near the top of the console screen to start your debugging session.
    1. This will initialize the model then stop and ask you if you want to continue or not.
      1. if you are debugging a model crash, just click on continue
      2. if you want to poke around in a file that isn't necessarily causing a crash, click on stop and add a breakpoint or watchpoint at the location you want to query. You can go to the location you want by clicking on "View"→ "Lookup Function". You can enter not just function names in the resulting box, but also file names. The desired function should show up in the main box of the main totalview window. To add a breakpoint, you can click on any line number with a box around it. Lines without boxes are comments or continuation statements. You can also set break/watch points by going to the "Action Point" menu on the top of the main totalview window.

Notes:

  1. ** in my experience, once you "module load totalview" in some shell, stuff is overloaded which screws up git. You don't need to load totalview except inside an interactive session and I recommend you don't. **


Notes from Eva Sinha:

  • October 5, 2023 -  For step 3b, add "totalview-argss" instead of "totalview -nomrnet" to the front of the srun command in .case.run.sh