Automated Diagnosis of Scalability Losses in Parallel Applications
Doctor of Philosophy
Each generation of supercomputers is more powerful than the last in an attempt to keep up with the growing ambition of scientific inquiry. Despite improvements in computational power, however, performance of many parallel applications has failed to scale. Many factors degrade the parallel performance of applications. The need to understand application behaviors and pinpoint causes of inefficiency has led to the development of a broad array of tools for measuring and analyzing application performance. Those performance analysis tools generally focus on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of performance measurement data falls to application developers. Profiles generated by performance tools can usually identify the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. To address this problem, we developed an automated framework that analyzes time series of call path samples to present users with performance diagnosis of parallel executions. Our automated framework incurs much lower overhead in time and space compared to prior tools that analyze performance using instrumentation-based traces. The framework's automated diagnosis indicates the symptoms, severity, and causes of scalability losses found in a parallel execution. To support a broad array of parallel applications, our automated analysis is applicable to both SPMD and MPMD in both flat and hierarchical parallel models. We demonstrate the effectiveness of our framework by applying it to time-series measurements of three scientific codes.
performance; automated diagnosis; scalability losses; sample-based time series data