/
#22 Lightweight threading and vectorization with OpenMP in ACME

#22 Lightweight threading and vectorization with OpenMP in ACME

1.Poster TitleLightweight threading and vectorization with OpenMP in ACME
2.AuthorsAzamat Mametjanov, Robert Jacob, Mark Taylor
3.GroupPerformance
4.ExperimentWatercycle
5.Poster CategoryEarly results
6.Submission TypePoster
7.Poster LinkLightweight_ACME
8.Lightning Talk SlideLightweight LT

    

Abstract

Next-generation HPC machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips. Applications are required to increase fine-grained parallelization (threading and vectorization) potential to fully utilize ~60 cores each with 8-double-wide vector units. While the new on-package high-bandwidth memory is expected to provide 5x speedup to existing executables, significant source code refactoring and fine-grained parallelization is needed to achieve further performance gains. ACME is moving towards this goal with OpenMP-based parallelization in atmosphere dynamics. Nested loops are threaded with OpenMP parallel-do regions and innermost loops are vectorized with OpenMP SIMD pragmas. Additional optimizations such as loop reordering, unroll and fusion are introduced to extract peak performance from computational hot-spots. In addition, the lightweight threading runtime BOLT is expected to significantly reduce threading overheads. We present initial results of nested threading in ACME coupled runs on Cori I and Mira along with BOLT enabled runs of HOMME transport_se mini-app on early access hardware. Substantial performance gains with lightweight nested threading recommend continued expansion of fine-grained parallelism in compute-intensive kernels.

Related content

Parallel Ensemble Simulations for ACME Performance and Verification
Parallel Ensemble Simulations for ACME Performance and Verification
More like this
Progress on porting the Community Atmosphere Model - Spectral Element (CAM-SE) to the GPU-CPU hybrid architecture.
Progress on porting the Community Atmosphere Model - Spectral Element (CAM-SE) to the GPU-CPU hybrid architecture.
More like this
Bundling runs to improve throughput on Mira
Bundling runs to improve throughput on Mira
More like this
Software-Facilitated Performance Improvements
Software-Facilitated Performance Improvements
More like this
#X01 C++/Kokkos Refactor of HOMME
#X01 C++/Kokkos Refactor of HOMME
More like this
#A13 Advances in the application of parallel split physics and dynamics
#A13 Advances in the application of parallel split physics and dynamics
More like this