Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Minimize entrances and exits to parallel regions. OMP_WAIT_POLICY=ACTIVE can get around this, but it's more robust to make parallel regions as long as possible.  
  2. Thread over nested loops using the collapse clause or explicit division-mod arithmetic (x86 computes integer division and mod in the same instruction, so this looks more expensive than it is).