Portable high performance and scalability of partitioned global address space languages
Doctor of Philosophy
Large scale parallel simulations are fundamental tools for engineers and scientists. Consequently, it is critical to develop both programming models and tools that enhance development time productivity, enable harnessing of massively-parallel systems, and to guide the diagnosis of poorly scaling programs. This thesis addresses this challenge in two ways. First, we show that Co-array Fortran (CAF), a shared-memory parallel programming model, can be used to write scientific codes that exhibit high performance on modern parallel systems. Second, we describe a novel technique for analyzing parallel program performance and identifying scalability bottlenecks, and apply it across multiple programming models. Although the message passing parallel programming model provides both portability and high performance, it is cumbersome to program. CAF eases this burden by providing a partitioned global address space, but has before now only been implemented on shared-memory machines. To significantly broaden CAF's appeal, we show that CAF programs can deliver high-performance on commodity cluster platforms. We designed and implemented cafc, the first multiplatform CAF compiler, which transforms CAF programs into Fortran 90 plus communication primitives. Our studies show that CAF applications matched or exceeded the performance of the corresponding message passing programs. For good node performance, cafc employs an automatic transformation called procedure splitting, for high performance on clusters, we vectorize and aggregate communication at the source level. We extend CAF with hints enabling overlap of communication with computation. Overall, our experiments show that CAF versions of NAS benchmarks match the performance of their MPI counterparts on multiple platforms. The increasing scale of parallel systems makes it critical to pinpoint and fix scalability bottlenecks in parallel programs. To automatize this process, we present a novel analysis technique that uses parallel scaling expectations to compute scalability scores for calling contexts, and then guides an analyst to hot spots using an interactive viewer. Our technique is general and may thus be applied to several programming models; in particular, we used it to analyze CAF and MPI codes, among others. Applying our analysis to CAF programs highlighted the need for language-level collective operations which we both propose and evaluate.