Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

(by Peter Caldwell; last updated  ) A parallel debugger is useful for poking around in code without having to recompile the code a bunch of time after adding print statements. It's also useful for figuring out why your code is crashing. NERSC does a good job of supporting the totalview debugger and E3SM does a good job of making sure the model runs on NERSC machines... taken together, we should have a pretty robust debugging capability. The NERSC totalview webpage is: https://docs.nersc.gov/programming/performance-debugging-tools/totalview/ . Basically I'm just replicating what they say below.

Steps:

  1. compile the code. Here's a simple run script that does that: XXX. Helpful hints:
    1. Make sure to compile in 'debug mode'. Otherwise the debugger won't work.
    2. Only ask for a few nodes. Both totalview and the debug queue are limited in the number of cores you can ask for.
    3. Coarse resolution is much easier to run on a small number of cores and much easier to debug. Can you use ne4?
    4. Configure and build, but don't actually submit the simulation
  2. Figure out what run command to use. In the case directory for your run, issue ./preview_run . The last line of this will be (basically) the arguments you pass to totalview. Note in particular the "-n" value - this is the number of nodes you need to request. Note as well that there's a hidden file called .case.run.sh in this directory which provides the actual commands needed to set up your environment.
  3. start an interactive debug session: salloc -N 4 -t 30:00 --qos interactive -C knl where -N 4 is the number of nodes you grabbed from preview_run. In my experience this request is met immediately, but you may have to wait 30 min or so if the machine is really busy.
  4. Once the prompt returns, you'll be dropped into a new shell on the nodes you were allocated. Your environment won't include the changes you made to the shell you were in before, so you probably need to type module load totalview
  5. You also need to cd to the run directory for the ACME instance you just built. If you used run_acme.debug.csh to create that executable, cd $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/run/
  6. Issue totalview srun -a --label -n 48 -c 2 --cpu_bind=cores ../build/acme.exe to start totalview. Note that this differs from NERSC's advice here (which causes crashes because of bad MPI bindings). I found the line I'm using in $CSCRATCH/ACME_simulations/master.test1.ne4_ne4/case_scripts/batch_output/<file>.o<run_id>).
    1. several windows will pop up. I usually just select "ok" on the top window, which asks if you want to turn various options (like replay engine) on.
    2. On the main window, click on the green "go" triangle on the top left. This will initialize the model then stop and ask you if you want to continue or not.
      1. if you are debugging a model crash, just click on continue
      2. if you want to poke around in a file that isn't necessarily causing a crash, click on stop and add a breakpoint or watchpoint at the location you want to query. You can go to the location you want by clicking on "View"→ "Lookup Function". You can enter not just function names in the resulting box, but also file names. The desired function should show up in the main box of the main totalview window. To add a breakpoint, you can click on any line number with a box around it. Lines without boxes are comments or continuation statements. You can also set break/watch points by going to the "Action Point" menu on the top of the main totalview window.

There's obviously books worth of details about how to use totalview and debugging in general, but the above info should get you started...


  1. ** in my experience, once you "module load totalview" in some shell, stuff is overloaded which screws up git. You don't need to load totalview except inside an interactive session and I recommend you don't. **
  1. https://github.com/ACME-Climate/SimulationScripts/blob/master/archive/F1850/ne30/run_acme.debug.csh


  • No labels