#22 Lightweight threading and vectorization with OpenMP in ACME

1.Poster Title	Lightweight threading and vectorization with OpenMP in ACME
2.Authors	Azamat Mametjanov, Robert Jacob, Mark Taylor
3.Group	Performance
4.Experiment	Watercycle
5.Poster Category	Early results
6.Submission Type	Poster
7.Poster Link	Lightweight_ACME
8.Lightning Talk Slide	Lightweight LT

Abstract

Next-generation HPC machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips. Applications are required to increase fine-grained parallelization (threading and vectorization) potential to fully utilize ~60 cores each with 8-double-wide vector units. While the new on-package high-bandwidth memory is expected to provide 5x speedup to existing executables, significant source code refactoring and fine-grained parallelization is needed to achieve further performance gains. ACME is moving towards this goal with OpenMP-based parallelization in atmosphere dynamics. Nested loops are threaded with OpenMP parallel-do regions and innermost loops are vectorized with OpenMP SIMD pragmas. Additional optimizations such as loop reordering, unroll and fusion are introduced to extract peak performance from computational hot-spots. In addition, the lightweight threading runtime BOLT is expected to significantly reduce threading overheads. We present initial results of nested threading in ACME coupled runs on Cori I and Mira along with BOLT enabled runs of HOMME transport_se mini-app on early access hardware. Substantial performance gains with lightweight nested threading recommend continued expansion of fine-grained parallelism in compute-intensive kernels.