Transforming Complex Loop Nests for Locality
This work was also published as a Rice University thesis/dissertation: http://hdl.handle.net/1911/18153
Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. This thesis develops the following novel techniques to optimize complex loop structures both effectively and inexpensively for better locality. Extended dependence representation: a matrix representation that incorporates dependence relations between iterations of arbitrarily nested loops. Transitive dependence analysis algorithm: a new algorithm that improves the time complexity of existing transitive dependence analysis algorithms. Dependence hoisting: a new loop transformation technique that permits the direct fusion and interchange of arbitrarily nested loops. The transformation is inexpensive and can be incorporated into most commercial compilers. Computation slicing: a framework that systematically applies dependence hoisting to optimize arbitrary loop structures for better locality. Recursion transformation: the first compiler work that automatically trans- forms loop structures into recursive form to exploit locality simultaneously at multiple levels of the memory hierarchy. Both the computation slicing framework and recursion transformation have been implemented and applied to successfully optimize a collection of benchmarks. In particular, the slicing framework has successfully blocked four linear algebra kernels: Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The auto-blocked versions have achieved performance improvements similar to those attained by manually blocked programs in LAPACK . The automatic blocking of QR and pivoting LU is a notable achievement because these kernels include loop nests that are considered difficult — to our knowledge, few previous compiler implementations have completely automated the blocking of the loop nests in these kernels. These facts indicate that although with a cost much lower than that of existing more general transformation frameworks [34, 42, 2, 36, 49], the computation slicing framework can in practice match or exceed the effectiveness of these general frameworks.