Locality Transformations of Computation and Data for Portable Performance
Sharma, Kamal Gopal
Doctor of Philosophy
Recently, multi-cores chips have become omnipresent in computer systems ranging from high-end servers to mobile phones. A variety of multi-core architectures have been developed which vary in the number of cores on a single die, cache hierarchies within a chip and interconnect across chips. This diversity of architectures is likely to continue for the foreseeable future. With these architectural variations, performance tuning of an application becomes a major challenge for the programmer. Code optimizations developed for one architecture may have a different impact, even negative, across other architectures due to the differences in tuning parameters. This problem is compounded when scaling from a single node to multiple nodes in a cluster. Our thesis is that significant performance benefits can be obtained from locality transformations of computation and data that require minimal programmer effort. We establish this thesis by addressing three specific problems — tile size selection, data layout optimization and automatic selection of distribution functions. Tile size selection and data layout optimization improve intra-node performance, whereas automatic selection of distribution functions also enhances inter-node performance. Loop tiling is a well-known locality optimization technique that can increase cache reuse at different levels of a memory hierarchy. Choosing tile sizes for different loop nests is a non-trivial task for a programmer due to the large parameter search space and different architectural parameters for different platforms. In this dissertation, we present analytical bounds for tile size selection. Our approach uses different architectural parameters such as cache size and TLB size to limit the tile size search space, and also includes an automated algorithm to select optimized tile sizes. Another important locality optimization is data layout transformation, which improves cache performance by restructuring program data. An important challenge for a programmer is to select an efficient data layout for a given application and target platform. In the past, selecting a data layout has been limited by the cost of rewriting applications for different layouts and by the effort required to find an optimized layout for a given platform. In this work, we present an automated tool for implementing different layouts for a given program. Using our approach, a programmer can experiment with different layouts across varying platforms for efficient performance. We also provide an automatic algorithm to select an optimized data layout for a given program and its representative executions. When mapping an application onto multiple nodes, one challenge for scalable performance is how to distribute computation and data across the nodes. In this work, we introduce an algorithm to automatically select a task/data distribution function for a set of nodes. We build this solution on Intel CnC’s distributed runtime system. Overall, our approaches for the three problems provide automated methods for locality optimization that enable portable performance while requiring minimal programmer effort.