Performance analysis for parallel programs from multicore to petascale
Tallent, Nathan Russell
Doctor of Philosophy
Cutting-edge science and engineering applications require petascale computing. Petascale computing platforms are characterized by both extreme parallelism (systems of hundreds of thousands to millions of cores) and hybrid parallelism (nodes with multicore chips). Consequently, to effectively use petascale resources, applications must exploit concurrency at both the node and system level --- a difficult problem. The challenge of developing scalable petascale applications is only partially aided by existing languages and compilers. As a result, manual performance tuning is often necessary to identify and resolve poor parallel and serial efficiency. Our thesis is that it is possible to achieve unique, accurate, and actionable insight into the performance of fully optimized parallel programs by measuring them with asynchronous-sampling-based call path profiles; attributing the resulting binary-level measurements to source code structure; analyzing measurements on-the-fly and postmortem to highlight performance inefficiencies; and presenting the resulting context- sensitive metrics in three complementary views. To support this thesis, we have developed several techniques for identifying performance problems in fully optimized serial, multithreaded and petascale programs. First, we describe how to attribute very precise (instruction-level) measurements to source-level static and dynamic contexts in fully optimized applications --- all for an average run-time overhead of a few percent. We then generalize this work with the development of logical call path profiling and apply it to work-stealing-based applications. Second, we describe techniques for pinpointing and quantifying parallel inefficiencies such as parallel idleness, parallel overhead and lock contention in multithreaded executions. Third, we show how to diagnose scalability bottlenecks in petascale applications by scaling our our measurement, analysis and presentation tools to support large-scale executions. Finally, we provide a coherent framework for these techniques by sketching a unique and comprehensive performance analysis methodology. This work forms the basis of Rice University's HPCTOOLKIT performance tools.
Computer science; Applied sciences