An Optimizing Fortran D Compiler for MIMD Distributed-Memory Machines
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/16677
Massively parallel MIMD distributed-memory machines can provide enormous computational power; however, the difficulty of developing parallel programs for these machines has limited their use. Our thesis is that an advanced compiler can generate efficient parallel programs, if data decompositions are provided. To validate this thesis, we have implemented a compiler for Fortran D, a version of Fortran that provides data decomposition specifications at two levels: problem mapping using sophisticated array alignments, and machine mapping through a rich set of data distribution functions. The Fortran D compiler is organized around three major functions: program analysis, program optimization, and code generation. Its compilation strategy is based on the "owner computes" rule, where each processor only computes values of data it owns. Data decomposition specifications are translated into mathematical distribution functions that determine the ownership of local data. By composing these with subscript functions or their inverses, the compiler can efficiently partition computation and determine nonlocal accesses at compile-time. Fortran D optimizations are guided by the concept of data dependence. Program transformations modify the program execution order to enable optimizations. Communication optimizations reduce the number of messages and overlap communication with computation. Parallelism optimizations detect reductions and optimize pipelined computations to increase the amount of useful computation that may be performed in parallel. Empirical evaluations show that exploiting parallelism is vital, while message vectorization, coarse-grain pipelining, and collective communication are the key communication optimizations. A simple model is constructed to guide compiler optimizations. Loop indices, bounds, and nonlocal storage are managed by the compiler during code generation. Interprocedural analysis, optimization, and code generation algorithms limit compilation to only one pass over each procedure by collecting summary information after edits, then compiling procedures in reverse topological order to propagate necessary information. Delaying instantiation of the work partition, communication, and dynamic data decomposition enables interprocedural optimization. Interactions between the compiler and other elements of the programming system are discussed. Empirical measurements show that the output of the prototype Fortran D compiler is comparable to hand-written codes on the Intel iPSC/860 and significantly outperforms the CM Fortran compiler on the Thinking Machines CM-5.
Technical Report Number