Seems to be only relevant to PGI: In functions, dummy assumed-size arguments like 'real :: x(:,:)' need to be stride-1 for the leftmost dimension (contiguous for the 1st dimension). Using pointers thus may create overheads.

Vectorization

Vector units apply the same operation on multiple pieces of data at a time, or Single Instruction Multiple Data (SIMD). Using vector units is crucial for efficiency on all modern processors. If you want the compiler to automatically vectorize your loops, you need to make it easy for the compiler to know that the actions of that loop are indeed data parallel. The main inhibitors of vectorization for a loop are having a loop bound that isn't a simple integer, having a function call within the loop (that needs to be inlined), using print statements, and having if-statements. It's best for CPUs and KNLs if you can get if-statements out of the innermost loops altogether when possible. GPUs are actually very good at handling these by virtue of the fact that the moment you're on a GPU you're inherently already vectorized (the question now is just how efficiently).

There are cases when you'll have what's called a loop-carried dependence. The most common examples of this are reductions like the prefix sum or accumulation operation below:

Code Block

title	Loop-Carried Dependence
linenumbers	true

do i = 1 , n
  array(i) = array(i-1) + dat1(i)**b
  dat2(i) = array(i)**c
enddo

...

Array Layout

There are two things to consider when deciding on the layout of an array: cache and vectorization.

Modern CPU caches have line widths of 64 consecutive bytes (an array of 8 doubles), which is the size of a single read (Nvidia GPUs read 128 consecutive bytes). To maximize cache performance, you should use an entire line before it's thrown out (this can be hard to measure; but is easy for us to do), and you should help the cache predictor by accessing it in a linear fashion. This translates to the well known rule of laying out the memory so consecutive quantities in memory correspond to the fastest index of loops.

Vectorization is a little harder - it relies on uncoupled data which we are doing the same things with being layed out in the same way that our vectors are. The variables maintained by gauss points of different elements in Homme's dycore are good examples of this, as they are rarely dependent on each other. This suggests putting the element index last, as this allows direct reads from memory to vector registers. This can conflict with the cache locality requirement, especially when threading over elements or using MPI, so an alternative is "tile" or "stripe" our data with values from different elements (levels might also work).

Image Added

Vectorization

Vector units apply the same operation on multiple pieces of data at a time, or Single Instruction Multiple Data (SIMD). Using vector units is crucial for efficiency on all modern processors. If you want the compiler to automatically vectorize your loops, you need to make it easy for the compiler to know that the actions of that loop are indeed data parallel. The main inhibitors of vectorization for a loop are having a loop bound that isn't a simple integer, having a function call within the loop (that needs to be inlined), using print statements, and having if-statements. It's best for CPUs and KNLs if you can get if-statements out of the innermost loops altogether when possible. GPUs are actually very good at handling these by virtue of the fact that the moment you're on a GPU you're inherently already vectorized (the question now is just how efficiently).

There are cases when you'll have what's called a loop-carried dependence. The most common examples of this are reductions like the prefix sum or accumulation operation below:

Code Block

title	Loop-Carried Dependence
linenumbers	true

do i = 1 , n
  tmp(i) =
dat1(i)**b enddo do i = 1 , n
  array(i) = array(i-1) + tmpdat1(i)**b
enddo do i = 1 , n
  dat2(i) = array(i)**c
enddo

...

This cannot do

...

SIMD

...

Whenever there is something that will inhibit vectorization, it's often best to fission the loop and isolate that operation by itself, and group the majority of the work together in loops that have work that all vectorizes.

Miscellaneous

Array Allocation

Modern languages can allocate memory for arrays in the "heap" and "stack". Global arrays (arrays with the "save" attribute or arrays defined in a module outside of a subroutine) are allocated in the heap when the model initializes and these exist (taking up memory) for the entire simulation. Arrays can be allocated while the code is running with the allocate() and deallocate() statements - these arrays are also placed in the heap, but the allocate() statement requires a system call and takes significant time (tens of thousands of instructions) as the system looks for available memory in the heap. Allocate can be even slower when running in a threaded region - as it usually requires thread synchronization. Allocate statement should never be used inside frequently called subroutines.

Automatic arrays on the stack: The most efficient way to create temporary arrays is to use local arrays declared in the subroutine. These arrays will be placed on the stack (although sometimes compiler options are needed to force larger automatic arrays to be placed on the stack). To create an array on the stack is nearly instant, as it just requires incrementing the stack pointer. One drawback of stack variables is that each openMP thread must have it's own stack, and this memory must be allocated ahead of time, with the OMP_STACKSIZE environment variable. Making this too small and the code will segfault when it runs out of stack memory, and making it too large, then allocating all this stacksize when the code starts can leave insufficient memory for the rest of the model.

Stack arrays created inside a threaded region will be thread private. Stack arrays created outside a threaded region can be shared by threads started later on in the subroutine, and marking such arrays "private" in openMP directives will create multiple copies of the array for each thread, which may or may not be what you want.

Heap arrays are in the global memory space and accessible by all thread. Sharing these arrays across threads requires synchronization if threads will be changing them independently. Marking these arrays as private in an openMP directive should never be done.

In C++, the stack and heap work the same as in Fortran; though it might be more difficult to tell when something uses the heap or the stack (I don't know enough Fortran to compare). Any data pointer initialized with 'malloc()' or 'new' (for C++, use new, as it will call constructors) will point to the heap. These should be deallocated with 'free()' or 'delete', respectively. C++ containers, such as list, vector, set, and map, will use the heap to store their data. Kokkos views will also be allocated on the heap unless initialized with a pointer to the stack. Note that initializing a view with a pointer to an array local to a kernel is broken in CUDA and should not be done.

Recommendation:

Allocate dynamic arrays only at a high level and infrequently
use stack arrays or for all subroutine array temporaries
understand the size of these arrays in order to have a good estimate of the needed stacksize
See below for similar problems with array slicing causing hidden allocation & array copies

Array Slicing

In my opinion, there are very few cases when array slicing is wise to do. I know it's a convenient feature of Fortran, but it's the cause of some of our worst performance degradations. The only time I think array slicing is OK is when you're moving a contiguous chunk of data, not computing things. For instance:

Code Block

title	do and don't array slicing
linenumbers	true

!Not wise because it's computing
a(:,i) = b(:,i) * c(:,i) + d(:)
g(:,i) = a(:,i) ** f(:)

!This is OK because it's moving a contiguous chunk of data
a(:,i) = b(:,i)

The problem is that when you have multiple successive array-slice notations, what this technically means is to put a loop around that one line by itself. For instance, the first two instructions above would translate into:

...

title	do and don't array slicing
linenumbers	true

...

parallelism over the i-loop. The best solution for this is to precompute all of the main computations (like the exponentiation shown) into a temporary array in a SIMD manner, do the prefix sum in serial with as little work as possible, and then return to parallel work from there. This way the serial work is isolated to the least workload possible. This is shown below:

Code Block

title	Loop-Carried Dependence
linenumbers	true

do i = 1 , n
  tmp(i) = dat1(i)**b
enddo
do i = 1 , n
  array(i) = array(i-1) + tmp(i)
enddo
do i = 1 , n
  dat2(i) = array(i)**c
enddo

What this will do is SIMD vectorize the first and third loops but not the center loop. But the center loop now has very little work to do in serial.

Whenever there is something that will inhibit vectorization, it's often best to fission the loop and isolate that operation by itself, and group the majority of the work together in loops that have work that all vectorizes.

Miscellaneous

Array Allocation

Modern languages can allocate memory for arrays in the "heap" and "stack". Global arrays (arrays with the "save" attribute or arrays defined in a module outside of a subroutine) are allocated in the heap when the model initializes and these exist (taking up memory) for the entire simulation. Arrays can be allocated while the code is running with the allocate() and deallocate() statements - these arrays are also placed in the heap, but the allocate() statement requires a system call and takes significant time (tens of thousands of instructions) as the system looks for available memory in the heap. Allocate can be even slower when running in a threaded region - as it usually requires thread synchronization. Allocate statement should never be used inside frequently called subroutines.

Automatic arrays on the stack: The most efficient way to create temporary arrays is to use local arrays declared in the subroutine. These arrays will be placed on the stack (although sometimes compiler options are needed to force larger automatic arrays to be placed on the stack). To create an array on the stack is nearly instant, as it just requires incrementing the stack pointer. One drawback of stack variables is that each openMP thread must have it's own stack, and this memory must be allocated ahead of time, with the OMP_STACKSIZE environment variable. Making this too small and the code will segfault when it runs out of stack memory, and making it too large, then allocating all this stacksize when the code starts can leave insufficient memory for the rest of the model.

Stack arrays created inside a threaded region will be thread private. Stack arrays created outside a threaded region can be shared by threads started later on in the subroutine, and marking such arrays "private" in openMP directives will create multiple copies of the array for each thread, which may or may not be what you want.

Heap arrays are in the global memory space and accessible by all thread. Sharing these arrays across threads requires synchronization if threads will be changing them independently. Marking these arrays as private in an openMP directive should never be done.

In C++, the stack and heap work the same as in Fortran; though it might be more difficult to tell when something uses the heap or the stack (I don't know enough Fortran to compare). Any data pointer initialized with 'malloc()' or 'new' (for C++, use new, as it will call constructors) will point to the heap. These should be deallocated with 'free()' or 'delete', respectively. C++ containers, such as list, vector, set, and map, will use the heap to store their data. Kokkos views will also be allocated on the heap unless initialized with a pointer to the stack. Note that initializing a view with a pointer to an array local to a kernel is broken in CUDA and should not be done.

Recommendation:

Allocate dynamic arrays only at a high level and infrequently
use stack arrays or for all subroutine array temporaries
understand the size of these arrays in order to have a good estimate of the needed stacksize
See below for similar problems with array slicing causing hidden allocation & array copies

Array Slicing

In my opinion, there are very few cases when array slicing is wise to do. I know it's a convenient feature of Fortran, but it's the cause of some of our worst performance degradations. The only time I think array slicing is OK is when you're moving a contiguous chunk of data, not computing things. For instance:

Code Block

title	do and don't array slicing
linenumbers	true

!Not wise because it's computing
a(:,i) = b(:,i) * c(:,i) + d(:)
g(:,i) = a(:,i) ** f(ii:)
enddo
!This is

...

Code Block
OK because it's moving a contiguous chunk of data a(:,i) = b(:,i)

The problem is that when you have multiple successive array-slice notations, what this technically means is to put a loop around that one line by itself. For instance, the first two instructions above would translate into:

Code Block

title	do and don't array slicing
linenumbers	true

do ii = 1 , n
  a(ii,i) = b(ii,i) * c(ii,i) + d(ii)
enddo
do i = 1 , n
  g(ii,i) = a(ii,i) ** f(ii)
enddo

Array Temporaries

But the potential caching problems of array slicing are nothing by comparison to the infamous "array temporary," which most Fortran compilers will tell you about explicitly when you turn debugging on. Take the following example for instance:

Code Block

title	do and don't array slicing
linenumbers	true

do j = 1 , n
  do i = 1 , n
    call my_subroutine( a(i,:,j) , b )

Because the array slice you passed to the subroutine is not contiguous, what the compiler internally does is create an "array temporary." The problem is that it nearly always allocates this array every time this function is called, copies the data, passes the allocated array to the subroutine, and deallocates afterward. It's the allocation during runtime for every iteration of that loop that degrades performance so badly. The best way to avoid this, again, is simply not to array slice. The better option isThis is a problem because this would be more efficient if these two loops were fused together into the same loop because the value a(ii,i) is reused. However, as the code loops through the first loop, a(1,i) from the first iteration is likely already kicked out of cache by the time it's needed again by the second loop. Sometimes, compilers will automatically fuse these together, and sometimes they will not. To ensure they are performing well, you should have explicitly coded:

Code Block

title	do and don't array slicing
linenumbers	true

do jii = 1 , n
  do ia(ii,i) = 1 , n
	do k = 1 , n2
      tmp(kb(ii,i) * c(ii,i) + d(ii)
  g(ii,i) = a(ii,i,k,j) ** f(ii)
  enddo
    call my_subroutine( tmp , b )

The reason this is more efficient is because you're not allocating "tmp" during runtime. Rather, you declare it as an automatic Fortran array, which most compilers have an option to place on the stack for efficiency.

Register Spilling

...

enddo

Array Temporaries

But the potential caching problems of array slicing are nothing by comparison to the infamous "array temporary," which most Fortran compilers will tell you about explicitly when you turn debugging on. Take the following example for instance:

Code Block

title	do and don't array slicing
linenumbers	true

do j = 1 , n
  do i = 1 , n
    call my_subroutine( a(i,:,j) , b )

Because the array slice you passed to the subroutine is not contiguous, what the compiler internally does is create an "array temporary." The problem is that it nearly always allocates this array every time this function is called, copies the data, passes the allocated array to the subroutine, and deallocates afterward. It's the allocation during runtime for every iteration of that loop that degrades performance so badly. The best way to avoid this, again, is simply not to array slice. The better option is:

Code Block

title	do and don't register spillingarray slicing
linenumbers	true

! Don't do thisj value = a1 +, bn
** 2 +do ci *= d1 ..., lotsn
of	do codek not= using1 value, ...n2
result = value + other_stuff  tmp(k) ! Do this
... lots of code not using value ...
value = a + b ** 2 + c * d
result = value + other_stuff= a(i,k,j)
    enddo
    call my_subroutine( tmp , b )

The reason this is more efficient is because you're not allocating "tmp" during runtime. Rather, you declare it as an automatic Fortran array, which most compilers have an option to place on the stack for efficiency.

Register Spilling

Register spilling occurs when the compiler can not keep all of the relevant local variables stored in registers, and must "spill" one or more onto the stack until needed again. To reduce register spilling, minimize the scope of variables and the distance between usage. Note this is also considered good code design as you're less likely to incorrectly use the variables in the code that doesn't need it (https://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ref=pd_bxgy_14_img_3/134-1054583-1071544?_encoding=UTF8&psc=1&refRID=VT7CJN8DXPJ9EMTFRXY3).

Code Block

title	do and don't register spilling
linenumbers	true

! Don't do this
value = a + b ** 2 + c * d
... lots of code not using value ...
result = value + other_stuff


! Do this
... lots of code not using value ...
value = a + b ** 2 + c * d
result = value + other_stuff

Assembly

People familiar with assembly might consider inspecting the various versions produced by different compilers. A nice way to do that is to strip the code you're interested of external dependencies and put it through a tool like Compiler Explorer (note Compiler Explorer doesn't currently officially support Fortran, though according to this github issue you can enable it on GCC with the flag "-x f95"). Of course, reading the assembly can be misleading, so if it's not extremely obvious which is better, always measure!

Floating Point Issues

There are a couple of floating point issues to be aware of when coding. First, floating point division is far more expensive than floating point multiplication. What this means is that if you have a repeated division by a given value, you should compute the reciprocal of that value and multiply by the reciprocal inside loops. The exception to this is when you do this division fairly infrequently at different places in the code, in which case the likelihood of a cache miss outweighs the gain of multiplication being faster than division. For instance:

...

Versions Compared

Old Version 15

New Version Current

Key

Vectorization

Array Layout

Vectorization

Miscellaneous

Array Allocation

Array Slicing

Miscellaneous

Array Allocation

Array Slicing

Array Temporaries

Register Spilling

Array Temporaries

Register Spilling

Assembly

Floating Point Issues

Page Comparison

Versions Compared

Old Version 15

New Version Current

Key

Vectorization

Array Layout

Vectorization

Miscellaneous

Array Allocation

Array Slicing

Miscellaneous

Array Allocation

Array Slicing

Array Temporaries

Register Spilling

Array Temporaries

Register Spilling

Assembly

Floating Point Issues