Transforming Complex Loop Nests For Locality
DateFebruary 19, 2002
Because of the increasing gap between the speeds of processors and standard memory chips, many compiler techniques have been developed to enhance locality of applications. This paper focuses on automatically optimizing complicated loop structures, for which existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. Building on traditional unimodular transformations on perfectly nested loops, we have developed a novel transformation called dependence hoisting. This transformation facilitates fusion of a set of arbitrarily nested loops at the outermost position of a code segment containing these loops. This transformation is especially useful when the loops to befused are nested inside one another and when some loops cannot be legallydistributed before fusion. We have also developed a transformation framework called computation slicing which applies dependence hoisting to block arbitrary loop nests for better locality. In terms of both asymptotic complexity, which is comparable to that of standard unimodular loop transformations, and actual running time, as measured in our experimental results, computation slicing should be efficient enough for inclusion in commercial production compilers. We have implemented the framework as a Fortransource-to-source translator. Our implementation has successfully blocked four numerical benchmark kernels: Cholesky, QR, LU factorization without pivoting,and LU factorization with partial pivoting. The automatically-blocked benchmarks achieved performance improvements similar to those attained by manually blocked programs in LAPACK. The automatic blocking of QR and LU with partial pivoting is a notable achievement because these benchmarks include loop nests that are considered difficult—to our knowledge, no previous compiler implementation has completely automated the blocking of QR and LU with pivoting. This fact indicates that our technique can in practice match or exceed the effectiveness of many general loop transformation frameworks.