Code Generation for Extreme Scale Parallel Systems
Srinivasa Murthy, Karthik
Mellor-Crummey, John M
Doctor of Philosophy
Power consumption and fabrication limitations are increasingly playing significant roles in the design of extreme scale parallel systems. These factors are influencing system designers to support higher on-node computing capability via throughput-optimized processors instead of latency-optimized processors. However, the inter- and intra-processor communication capabilities on such systems are not increasing at the same rate as the on-node computing capability. Consequently, achieving high performance requires careful orchestration of both single- and multiprocessor parallelism. This thesis shows that compiler technology and expressive programming model constructs can help applications more effectively exploit both forms of parallelism. Compilers play an important role in harnessing short vector parallelism supported by cores in modern processors. Over last ten years, vector widths have increased dramatically from the 64-bit vectors supported by Intel's Pentium MMX processor to the 512-bit vectors supported by Intel's Knights Corner processor. However, the vectorization capabilities of state-of-the-art compilers are still immature, failing in the presence of complex control flow and data dependencies. This thesis presents compiler transformations that enable efficient vector parallelism in the presence of common kinds of complex dependencies. To enable efficient multiprocessor parallelism, this thesis develops compiler technology to support sophisticated algorithms that minimize interprocessor communication. The class of .5D communication-avoiding algorithms was developed to address the inter-processor communication bottleneck. Mapping these algorithms to complex architectures efficiently is tedious for even expert programmers. To address this issue, this thesis presents the Maunam compiler, which generates efficient parallel code from a high-level, global-view sketch of a .5D algorithm that is expressed using symbolic data sizes and numbers of processors. To mitigate the cost of communication for multiprocessor parallelism, this thesis develops a novel compiler transformation to overlap communication with computation for systolic computations. Additionally, to aid effective management of the completion of non-blocking communication, this thesis presents two synchronization constructs, cofence and distributed phasers.