Now, you don't have to worry about this if you use CUDA managed memory, but in my experience, for small kernel runtimes (which we have), turning on managed memory leads to significantly greater overhead in the kernel launch times. Therefore, managed memory is not an option for us. Also, you have to be careful in how you use it. First, you can't have intermittent work on the CPU and then the GPU and vice versa. It needs to be continuously on the GPU to keep from thrashing the data back and forth. Further, if you have CUDA Fortran available, you can use the cudaMemPrefetchAsync() routine to pre-pin the memory on the GPU, which will improve performance significantly. While Summit has a unified memory address space. While this is nice, we shouldn't use it because then our code isn't portable to most other machines, which don't have this. So, I'm afraid it makes the most sense to manage data ourselves for now. It's cumbersome, but it's necessary in my opinion.

Managing Small Kernels and Memory Allocations

There is one final annoyance that's important to consider. In climate, since we strong scale significantly, we often end up in situations where we simply don't have much threading to work with. This is a big problem because it leads to small kernel runtimes, and kernels have significant overhead to launch on the GPU. The launch overhead is often between 5-10 microseconds. However, many of our kernels take less than 10 microseconds to complete! Therefore, we need a way to hide launch overheads. Further, many times, we have to allocate data on the GPU within subroutines. The CPU uses a pool allocator to manage allocations in main memory, and therefore, it's pretty cheap. However, cudaMalloc() on the GPU is very expensive (especially since our kernels take so little time). Therefore, we need a way to hide the cost of GPU memory allocations. The PGI compiler has a pool allocator for managed memory , but we don't use managed memoryto reduce the cost of allocations, which are much more expensive in that context.

The way to handle these situations is to do asynchronous kernel launches. What this does is launches all the kernels at once from the host so that the launch overheads are overlapped with kernel execuation. While many of our kernels are small, some are large, and this way, the larger kernels overlap with both the launch overheads with smaller kernels and with the GPU memory allocations. This is done with the "async(stream)" clause in OpenACC, where operations are synchronous within a stream and parallel between different streams. You specify a stream with an integer value. In OpenMP4.5, the situation is more complicated, and I'm not going to cover it here. The unified directives approach described here might make asynchronicity easier in OpenMP, especially transitioning from OpenACC. An example of OpenACC streams is below:

...

Versions Compared

Old Version 4

New Version 5

Key

Managing Small Kernels and Memory Allocations

Page Comparison

Versions Compared

Old Version 4

New Version 5

Key

Managing Small Kernels and Memory Allocations