Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Vectorization is a little harder - it relies on uncoupled data which we are doing the same things with being layed out in the same way that our vectors are. The variables maintained by gauss points of different elements in Homme's dycore are good examples of this, as they are rarely dependent on each other. This suggests putting the element index last, as this allows direct reads from memory to vector registers. This can conflict with the cache locality requirement, especially when threading over elements or using MPI, so an alternative is "tile" or "stripe" our data with values from different elements (levels might also work).

...

Code Block
titledo and don't register spilling
linenumberstrue
! Don't do this
value = a + b ** 2 + c * d
... lots of code not using value ...
result = value + other_stuff


! Do this
... lots of code not using value ...
value = a + b ** 2 + c * d
result = value + other_stuff

Assembly

People familiar with assembly might consider inspecting the various versions produced by different compilers. A nice way to do that is to strip the code you're interested of external dependencies and put it through a tool like Compiler Explorer (note Compiler Explorer doesn't currently officially support Fortran, though according to this github issue you can enable it on GCC with the flag "-x f95"). Of course, reading the assembly can be misleading, so if it's not extremely obvious which is better, always measure!

Floating Point Issues

There are a couple of floating point issues to be aware of when coding. First, floating point division is far more expensive than floating point multiplication. What this means is that if you have a repeated division by a given value, you should compute the reciprocal of that value and multiply by the reciprocal inside loops. The exception to this is when you do this division fairly infrequently at different places in the code, in which case the likelihood of a cache miss outweighs the gain of multiplication being faster than division. For instance:

...