Page Comparison

...

Minimize entrances and exits to parallel regions. OMP_WAIT_POLICY=ACTIVE can get around this, but it's more robust to make parallel regions as long as possible. A caveat to this recommendation is that placing parallel regions very high in the call stack can create a large memory footprint (replication of temps, etc.) and can obscure the fact that developers are coding within a parallel region, resulting in errors. Using OMP PARALLEL DO around every loop creates too many entry/exits, but placing OMP PARALLEL around the entire time step loop results in large memory use and fragile code development. Somewhere in between these extremes is optimal and requires some experimentation.
Thread over nested loops using the collapse clause or explicit division-mod arithmetic (x86 computes integer division and mod in the same instruction, so this looks more expensive than it is).

...

Versions Compared