Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Heap arrays are in the global memory space and accessible by all thread.  Sharing these arrays across threads requires synchronization if threads will be changing them independently.   Marking these arrays as private in an openMP directive should never be done. 

Recommendation:

  1. Allocate dynamic arrays only at a high level and infrequently
  2. use stack arrays or for all subroutine array temporaries
  3. understand the size of these arrays in order to have a good estimate of the needed stacksize
  4. See below for similar problems with array slicing causing hidden allocation & array copies

Array Slicing

In my opinion, there are very few cases when array slicing is wise to do. I know it's a convenient feature of Fortran, but it's the cause of some of our worst performance degradations. The only time I think array slicing is OK is when you're moving a contiguous chunk of data, not computing things. For instance:

Code Block
titledo and don't array slicing
linenumberstrue
!Not wise because it's computing
a(:,i) = b(:,i) * c(:,i) + d(:)
g(:,i) = a(:,i) ** f(:)

!This is OK because it's moving a contiguous chunk of data
a(:,i) = b(:,i)

The problem is that when you have multiple successive array-slice notations, what this technically means is to put a loop around that one line by itself. For instance, the first two instructions above would translate into:

...

titledo and don't array slicing
linenumberstrue

...

In C++, the stack and heap work the same as in Fortran; though it might be more difficult to tell when something uses the heap or the stack (I don't know enough Fortran to compare). Any data pointer initialized with 'malloc()' or 'new' (for C++, use new, as it will call constructors) will point to the heap. These should be deallocated with 'free()' or 'delete', respectively. C++ containers, such as list, vector, set, and map, will use the heap to store their data. Kokkos views will also be allocated on the heap unless initialized with a pointer to the stack. Note that initializing a view with a pointer to an array local to a kernel is broken in CUDA and should not be done.

Recommendation:

  1. Allocate dynamic arrays only at a high level and infrequently
  2. use stack arrays or for all subroutine array temporaries
  3. understand the size of these arrays in order to have a good estimate of the needed stacksize
  4. See below for similar problems with array slicing causing hidden allocation & array copies

Array Slicing

In my opinion, there are very few cases when array slicing is wise to do. I know it's a convenient feature of Fortran, but it's the cause of some of our worst performance degradations. The only time I think array slicing is OK is when you're moving a contiguous chunk of data, not computing things. For instance:

Code Block
titledo and don't array slicing
linenumberstrue
!Not wise because it's computing
a(:,i) = b(:,i) * c(:,i) + d(:)
g(:,i) = a(ii:,i) ** f(ii:)

enddo!This is

...

 OK because it's moving a contiguous chunk of data
a(:,i) = b(:,i)

The problem is that when you have multiple successive array-slice notations, what this technically means is to put a loop around that one line by itself. For instance, the first two instructions above would translate into:

Code Block
titledo and don't array slicing
linenumberstrue
do ii = 1 , n
  a(ii,i) = b(ii,i) * c(ii,i) + d(ii)
enddo
do i = 1 , n
  g(ii,i) = a(ii,i) ** f(ii)
enddo

Array Temporaries

But the potential caching problems of array slicing are nothing by comparison to the infamous "array temporary," which most Fortran compilers will tell you about explicitly when you turn debugging on. Take the following example for instanceThis is a problem because this would be more efficient if these two loops were fused together into the same loop because the value a(ii,i) is reused. However, as the code loops through the first loop, a(1,i) from the first iteration is likely already kicked out of cache by the time it's needed again by the second loop. Sometimes, compilers will automatically fuse these together, and sometimes they will not. To ensure they are performing well, you should have explicitly coded:

Code Block
titledo and don't array slicing
linenumberstrue
do jii = 1 , n
  do ia(ii,i) = 1 , n
    call my_subroutine( a(i,:,j) , b )

...

b(ii,i) * c(ii,i) + d(ii)
  g(ii,i) = a(ii,i) ** f(ii)
enddo

Array Temporaries

But the potential caching problems of array slicing are nothing by comparison to the infamous "array temporary," which most Fortran compilers will tell you about explicitly when you turn debugging on. Take the following example for instance:

Code Block
titledo and don't array slicing
linenumberstrue
do j = 1 , n
  do i = 1 , n
    call my_subroutine( a(i,:,j) , b )

Because the array slice you passed to the subroutine is not contiguous, what the compiler internally does is create an "array temporary." The problem is that it nearly always allocates this array every time this function is called, copies the data, passes the allocated array to the subroutine, and deallocates afterward. It's the allocation during runtime for every iteration of that loop that degrades performance so badly. The best way to avoid this, again, is simply not to array slice. The better option is:best way to avoid this, again, is simply not to array slice. The better option is:

Code Block
titledo and don't array slicing
linenumberstrue
do j = 1 , n
  do i = 1 , n
	do k = 1 , n2
      tmp(k) = a(i,k,j)
    enddo
    call my_subroutine( tmp , b )

The reason this is more efficient is because you're not allocating "tmp" during runtime. Rather, you declare it as an automatic Fortran array, which most compilers have an option to place on the stack for efficiency.

Register Spilling

Register spilling occurs when the compiler can not keep all of the relevant local variables stored in registers, and must "spill" one or more onto the stack until needed again. To reduce register spilling, minimize the scope of variables and the distance between usage. Note this is also considered good code design as you're less likely to incorrectly use the variables in the code that doesn't need it (https://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ref=pd_bxgy_14_img_3/134-1054583-1071544?_encoding=UTF8&psc=1&refRID=VT7CJN8DXPJ9EMTFRXY3).

Code Block
titledo and don't array slicingregister spilling
linenumberstrue
! Don't do j = 1 , n
  do i = 1 , n
	do k = 1 , n2
      tmp(k) = a(i,k,j)
    enddo
    call my_subroutine( tmp , b )

...

 this
value = a + b ** 2 + c * d
... lots of code not using value ...
result = value + other_stuff


! Do this
... lots of code not using value ...
value = a + b ** 2 + c * d
result = value + other_stuff

Floating Point Issues

There are a couple of floating point issues to be aware of when coding. First, floating point division is far more expensive than floating point multiplication. What this means is that if you have a repeated division by a given value, you should compute the reciprocal of that value and multiply by the reciprocal inside loops. The exception to this is when you do this division fairly infrequently at different places in the code, in which case the likelihood of a cache miss outweighs the gain of multiplication being faster than division. For instance:

...