Application insight through performance modeling
Doctor of Philosophy
Tuning the performance of applications requires understanding the interactions between code and target architecture. Hardware counters, present in all modern processors, can identify possible causes of performance problems and can pinpoint sections of code that execute at a low fraction of machine peak performance. However, the information provided by hardware counters is often insufficient to understand the causes of poor performance or to realistically estimate the potential for performance improvement. This thesis presents techniques to measure and model application characteristics independent of the target architecture. Using information gathered from both static and dynamic analysis, this approach not only makes accurate predictions about the behavior of an application on a target architecture for different inputs, but also provides guidance for tuning by highlighting the factors that limit performance at different points in a program. We introduce several new performance analysis metrics that estimate the maximum gain expected from tuning different aspects of an application, or from using hardware accelerator coprocessors. Our approach models the most important factors affecting application performance and provides estimates of unfulfilled performance potential due to a mismatch between an application's characteristics and the resources present on the target architecture. We model an application's instruction execution cost and memory hierarchy utilization, and we identify performance problems arising from insufficient instruction-level parallelism or poor data locality. To demonstrate the utility of this approach, this thesis presents the results of analyzing and tuning two scientific applications. For Sweep3D, a three-dimensional Cartesian geometry neutron transport code benchmark from the DOEs Accelerated Strategic Computing Initiative, our analysis identified opportunities for improvement that shortened execution time by 66% on an Itanium2--based system. For the Gyrokinetic Toroidal Code (GTC) from Princeton Plasma Physics Laboratory, a particle-in-cell code that stimulates turbulent transport of particles and energy in burning plasma, our techniques identified opportunities for improvement that shortened execution time by 33% on the same Itanium2 system.