Autotuning Memory-intensive Software for Node Architectures
Master of Science
Today, scientific computing plays an important role in scientific research. People build supercomputers to support the computational needs of large-scale scientific applications. Achieving high performance on today's supercomputers is difficult, in large part due to the complexity of the node architectures, which include wide-issue instruction-level parallelism, SIMD operations, multiple cores, multiple threads per core, and a deep memory hierarchy. In addition, growth of compute performance has outpaced the growth of memory bandwidth, making memory bandwidth a scarce resource. People have proposed various optimization methods, including tiling and prefetching, to make better usage of the memory hierarchy. However, due to architectural differences, code hand-tuned for one architecture is not necessarily efficient for others. For that reason, autotuning is often used to tailor high-performance code for different architectures. Common practice is to develop a parametric code generator that generates code according to different optimization parameters and then picks the best among various implementation alternatives for a given architecture. In this thesis, we use tensor transposition, a generalization of matrix transposition, as a motivating example to study the problem of autotuning memory-intensive codes for complex memory hierarchies. We developed a framework to produce optimized parallel tensor transposition code for node architectures. This framework has two components: a rule-based code generation and transformation system that generates code according to specified optimization parameters, and an autotuner that uses static analysis along with empirical autotuning to pick the best implementation scheme. In this work, we studied how to prune the autotuning search space and perform run-time code selection using hardware performance counters. Despite the complex memory access patterns of tensor transposition, experiments on two very different architectures show that our approach achieves more than 80% of the bandwidth of optimized memory copies when transposing most tensors. Our results show that autotuning is the key to achieving peak application performance across different node architectures for memory-intensive codes.
autotuning; memory-intensive software; tensor transposition; hardware performance counters; memory hierachy