Performance Analysis of Program Executions on Modern Parallel Architectures
Doctor of Philosophy
Parallel architectures have become common in supercomputers, data centers, and mobile chips. Usually, parallel architectures have complex features: many hardware threads, deep memory hierarchies, and non-uniform memory access (NUMA). Program designs without careful consideration of these features may lead to poor performance on such architectures. First, multi-threaded programs can suffer from performance degradation caused by imbalanced workload, overuse of synchronization, and parallel overhead. Second, parallel programs may suffer from the long latency to the main memory. Third, in a NUMA system, memory accesses can be remote rather than local. Without a NUMA-aware design, a threaded program may have many costly remote accesses and imbalanced memory requests to NUMA domains. Performance tools can help us take full advantage of the power of parallel architectures by providing insight into where and why a program fails to obtain top performance. This dissertation addresses the difficulty of obtaining insights about performance bottlenecks in parallel programs using lightweight measurement techniques. This dissertation makes four contributions. First, it describes a novel performance analysis method for OpenMP programs, which can identify root causes of performance losses. Second, it presents a data-centric analysis method that associates performance metrics with data objects. This data-centric analysis can both identify both a program's problematic memory accesses and associated variables; this information can help an application developer optimize programs for better locality. Third, this dissertation discusses the development of a lightweight method that collects memory reuse distance to guide cache locality optimization. Finally, it describes implemented a lightweight profiling method that can help pinpoint performance losses in programs on NUMA architectures and provide guidance about how to transform the program to improve performance. To validate the utility of these methods, I implemented them in HPCToolkit, a state-of-the-art profiler developed at Rice University. I used the extended HPCToolkit to study several parallel programs. Guided by the performance insights provided by the new techniques introduced in this dissertation, I optimized all of these programs and was able to obtain non-trivial improvements to their performance. The measurement overhead incurred by these new analysis methods is very small in both runtime and memory.
OpenMP; data locality; performance