Interactive Debugging with ACME on Titan

I was asked by Dave Norton of PGI to see if I can get an interactive debugger working with ACME. His request was pgdbg, but he also wanted to see if DDT would work. This page documents my efforts for this. I'm initially testing this with something small:

./create_newcase -compset FC5AV1C-L -case enable_debugger -res ne4_ne4 -project stf006 --output-root /lustre/atlas1/stf006/scratch/imn
cd enable_debugger
./xmlchange DEBUG=TRUE

Also, to have an error to actually debug, I set the PGI compiler to 16.10.0 in env_mach_specific.xml.

./case.setup
./case.build

I then changed the "run_exe" value from "${EXEROOT}/acme.exe" to "gdb ${EXEROOT}/acme.exe" to have it run the debugger in env_mach_specific.xml

EXEROOT

On Titan, the EXEROOT is important because while the executable may normally live in NFS file space, if the executable is an argument to a debugger, it must then live in Lustre space. The EXEROOT variable in CIME is currently defaulting to: $CIME_OUTPUT_ROOT/$CASE/bld, which is good because it means that one can change the location of the executable by specifying --output-root in the ./create_newcase command, which makes things much simpler.

Running Interactively

Start an interactive job,

qsub -I -A stf006 -lwalltime=01:00:00,nodes=1 -q debug

Then, inside the job, cd to the case directory, then:

source .env_mach_specific.sh
./case.submit --no-batch

If you're using a visual debugger, this may be enough to work. If you're using a command-line debugger such as "gdb," we need to find a way to allow stdin to interact. You can avoid redirecting the output to file by removing the contents of "run_misc_suffix" in env_mach_specific.xml. However, I don't know how to enable interactive stdin input.

Allinea Forge DDT

I think the best way to try to run DDT on ACME is the reverse connect feature.

  1. Start an interactive job
  2. source the .env_mach_specific.sh
  3. load the forge module ( module load forge )
  4. change the "aprun" command in  env_mach_specific.xml to "ddt --connect aprun".
    It should look like this:
       <mpirun mpilib="default">
          <executable>ddt --connect aprun</executable>
          <arguments>
             <arg name="aprun"> {{ aprun }}</arg>
          </arguments>
       </mpirun>
  5. Before you run ./case.submit --no-batch, make sure that you have a remote client that is already connected to Titan following these instructions.
    (The path to DDT they provide in that webpage is outdated. Run "module show forge" to find the current path and use that when connecting.)

For CIME to correctly run the ddt --connect aprun command with the correct aprun arguments, I have a patch in CIME (an older version I think) that goes as follows:

diff --git a/cime/utils/python/CIME/aprun.py b/cime/utils/python/CIME/aprun.py
index a01d6fb..95acda3 100755
--- a/cime/utils/python/CIME/aprun.py
+++ b/cime/utils/python/CIME/aprun.py
@@ -64,7 +64,7 @@ def _get_aprun_cmd_for_case_impl(ntasks, nthreads, rootpes, pstrids,
 
     # Compute task and thread settings for batch commands
     tasks_per_node, task_count, thread_count, max_thread_count, total_node_count, aprun = \
-        0, 1, maxt[0], maxt[0], 0, "aprun"
+        0, 1, maxt[0], maxt[0], 0, ""
     for c1 in xrange(1, total_tasks):
         if maxt[c1] != thread_count:
             tasks_per_node = min(pes_per_node, max_tasks_per_node / thread_count)
diff --git a/cime/utils/python/CIME/case.py b/cime/utils/python/CIME/case.py
index 33d589e..b078349 100644
--- a/cime/utils/python/CIME/case.py
+++ b/cime/utils/python/CIME/case.py
@@ -1098,8 +1098,8 @@ class Case(object):
         executable, args = env_mach_specific.get_mpirun(self, mpi_attribs, job=job)
 
         # special case for aprun
-        if executable == "aprun":
-            return get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
+        if "aprun" in executable:
+            return executable + " " + get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
         else:
             mpi_arg_string = " ".join(args.values())


Github hash

The github hash that I've tested with using the above actions is 8d4835459cf84ce4311ae63edb2f6f0560214bb0