Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

(by Peter Caldwell; last updated  ) A parallel debugger is useful for poking around in code without having to recompile the code a bunch of time after adding print statements. It's also useful for figuring out why your code is crashing. NERSC does a good job of supporting the totalview debugger and E3SM does a good job of making sure the model runs on NERSC machines... taken together, we should have a pretty robust debugging capability. The NERSC totalview webpage is: httphttps://wwwdocs.nersc.gov/usersprogramming/software/performance-and-debugging-tools/totalview/ . Basically I'm just replicating what they say below.

Steps:

  1. compile the code. Here's a simple run script which does that: run_debug.sh. Helpful hints:
    1. Make sure to compile in 'debug mode'. Otherwise the debugger won't work.
    2. Only ask for a few nodes. Both totalview and the debug queue are limited in the number of cores you can ask for. I'm not sure what the upper limit is (edit this page to include this info if you do!). I used 2 nodes of edison for 48 total cores with 1 thread each. This run_acme script should build a 2-node F-compset job using the ACME master:  https://github.com/ACME-Climate/SimulationScripts/blob/master/archive/F1850/ne30/run_acme.debug.csh
    3. Note that this script compiles in 'debug mode', which is critical for having the debugger work correctly.
    4. This script also is set to compile but not actually execute the code.
    5. Note also that I'm running in ne4 resolution because coarse resolution works better on low core counts.
    6. ** in my experience, once you "module load totalview" in some shell, stuff is overloaded which screws up git. You don't need to load totalview except inside an interactive session and I recommend you don't. **
  2. start an interactive debug session: salloc -N 2 -t 30:00 -p debug where -N 2 means reserve 2 nodes. In my experience this request is met immediately, but you may have to wait 30 min or so if the machine is really busy.
  3. Once the prompt returns, you'll be dropped into a new shell on the nodes you were allocated. Your environment won't include the changes you made to the shell you were in before, so you probably need to type module load totalview
  4. You also need to cd to the run directory for the ACME instance you just built. If you used run_acme.debug.csh to create that executable, cd $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/run/
  5. Issue totalview srun -a --label -n 48 -c 2 --cpu_bind=cores ../build/acme.exe to start totalview. Note that this differs from NERSC's advice here (which causes crashes because of bad MPI bindings). I found the line I'm using in $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/case_scripts/batch_output/<file>.o<run_id>).
    1. several windows will pop up. I usually just select "ok" on the top window, which asks if you want to turn various options (like replay engine) on.
    2. On the main window, click on the green "go" triangle on the top left.
    3. Coarse resolution is much easier to run on a small number of cores and much easier to debug. Can you use ne4?
    4. Configure and build, but don't actually submit the simulation.
  6. From the case directory for your run, issue ./preview_run. One of the first things returned by this call is the number of nodes needed. Note this - you'll need it in step 4.
  7. Open the hidden file .case.run.sh in the case directory for your run. This script creates the environment needed to run e3sm and includes the raw submission command that would be needed to run. Do/note the following:
    1. find the "srun" command near the bottom of the file. Just before this command, add the line "module load totalview". Loading totalview needs to be done after ./.env_mach_specific.sh and perhaps ./preview_namelists are called in this file because these commands seem to screw up the environment totalview depends on
    2. Add "totalview -nomrnet " to the front of the srun command in .case.run.sh
  8. Execute "salloc -N <number of nodes from step 2> -t 30:00 --qos interactive -C knl". Once you're given an interactive allocation, you'll see a new terminal window.
  9. If you're not still in your case directory, cd back to it now.
  10. Execute your modified .case.run.sh script. This should start a totalview window. After clicking through the intro pages, click the green "go" button near the top of the console screen to start your debugging session.
    1. This will initialize the model then stop and ask you if you want to continue or not.
      1. if you are debugging a model crash, just click on continue
      2. if you want to poke around in a file that isn't necessarily causing a crash, click on stop and add a breakpoint or watchpoint at the location you want to query. You can go to the location you want by clicking on "View"→ "Lookup Function". You can enter not just function names in the resulting box, but also file names. The desired function should show up in the main box of the main totalview window. To add a breakpoint, you can click on any line number with a box around it. Lines without boxes are comments or continuation statements. You can also set break/watch points by going to the "Action Point" menu on the top of the main totalview window.

...

Notes:

  1. ** in my experience, once you "module load totalview" in some shell, stuff is overloaded which screws up git. You don't need to load totalview except inside an interactive session and I recommend you don't. **


Notes from Eva Sinha:

  • October 5, 2023 -  For step 3b, add "totalview-argss" instead of "totalview -nomrnet" to the front of the srun command in .case.run.sh