Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Page Properties
1.Poster TitleLightweight threading and vectorization with OpenMP in ACME
2.AuthorsAzamat Mametjanov, Robert Jacob, Mark Taylor
3.GroupPerformance
4.ExperimentWatercycle
5.Poster CategoryEarly results
6.Submission TypePoster
7.Poster LinkLightweight_ACME
8.Lightning Talk SlideLightweight LT

     

View file
name22 Lightweight_ACME.pdf
page2016-06-07 ACME Project All-Hands Meeting Posters
height400

Abstract

Next-generation HPC machines Cori II (October 2016) and Aurora (2018) are expected to provide 3-5x more cores per node with KNL and KNH Xeon Phi many-core chips. Applications are required to increase fine-grained parallelization (threading and vectorization) potential to fully utilize ~60 cores each with 8-double-wide vector units. While the new on-package high-bandwidth memory is expected to provide 5x speedup to existing executables, significant source code refactoring and fine-grained parallelization is needed to achieve further performance gains. ACME is moving towards this goal with OpenMP-based parallelization in atmosphere dynamics. Nested loops are threaded with OpenMP parallel-do regions and innermost loops are vectorized with OpenMP SIMD pragmas. Additional optimizations such as loop reordering, unroll and fusion are introduced to extract peak performance from computational hot-spots. In addition, the lightweight threading runtime BOLT is expected to significantly reduce threading overheads. We present initial results of nested threading in ACME coupled runs on Cori I and Mira along with BOLT enabled runs of HOMME transport_se mini-app on early access hardware. Substantial performance gains with lightweight nested threading recommend continued expansion of fine-grained parallelism in compute-intensive kernels.