Advanced Data-Parallel Compilation
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18615
Over the past few decades, scientific research has grown to rely increasingly on simulation and other computational techniques. This strategy has been named in silico research. Computation is increasingly important for testing theories and obtaining results in fields where experimentation is not currently possible (e.g. astrophysics, cosmology, climate modeling) or the detail and resolution of the results cannot be provided by traditional experimental methodologies (e.g. computational biology, materials science). The quantities of data manipulated and produced by such scientific programs are often too large to be processed on a uniprocessor computer. Parallel computers have been used to obtain such results. Creating software for parallel machines has been a very difficult task. Most parallel applications have been written using low-level communication libraries based on message passing, which require application programmers to deal with all aspects of programming the machine. Compilers for data-parallel languages have been proposed as an alternative. High-Performance Fortran (HPF) was designed to simplify the construction of data-parallel programs operating on dense distributed arrays. HPF compilers have not implemented the necessary transformations and mapping strategies to translate complex high-level data-parallel codes into low-level high-performance applications. This thesis demonstrates that it is possible to generate scalable high-performance code for a range of complex, regular applications written in high-level data-parallel languages, through the design and implementation of several analysis, compilation and runtime techniques in the dHPF compiler. The major contributions of this thesis are: Analysis, code generation and data distribution strategies (multipartitioning) for tightly-coupled codes. Compiler and runtime support based on generalized multipartitioning. -Communication scheduling to hide latency, through the overlap of embarrassingly parallel loops. Advanced static analysis for communication coalescing and simplification. Support for efficient single-image executables running on a parameterized number of processors.Strategies for generation of large-scale scalable executables up to hundreds of processors. Experiments with the NAS SP, BT and LU application benchmarks show that these techniques enable the dHPF compiler to generate code that scales efficiently up to hundreds of processors with only a few percent overhead with respect to high-quality hand-coded implementations. The techniques are necessary but not sufficient to produce efficient code for the NAS MG multigrid benchmark, which still exhibits large overhead compared to its hand-coded counterpart.