Compiling for Software Distributed-Shared Memory Systems
In this thesis, we explore the use of software distributed shared memory (SDSM) as a target communication layer for parallelizing compilers. ForSDSM to be effective for this purpose it must efficiently support both regular and irregular communication patterns. Previous studies have demonstrated techniques that enable SDSM to achieve performance that is competitive with hand-coded message passing for irregular applications. Here, we explore how to effectively exploit compiler-derived knowledge of sharing and communication patterns for regular access patterns to improve their performance on SDSM systems. We introduce two novel optimization techniques: compiler-restricted consistency which reduces the cost of false sharing, and compiler-managed communication buffers which, when used together with compiler-restricted consistency, reduce the cost of fragmentation. We focus on regular applications with wavefront computation and tightly-coupled sharing due to carried data dependence. Previous studies of regular applications all focus on loosely-coupled parallelism for which it is easier to achieve good performance. We describe point-to-point synchronization primitives we have developed that facilitate the parallelization of this type of applications on SDSM. Along with other types of compiler-assisted SDSM optimizations such as compiler-controlled eager update, our integrated compiler and run-time support provides speedups for wavefront computations on SDSM that rival those achieved previously only for loosely synchronous style applications. For example, we achieve a speed up of 11 out of 16 for SOR benchmark—a tightly-coupled computation based on wavefront, of a problem size of 4Kx4K. which compares favorably with the 14 out of 16 speed up which we obtain for Red Black SOR—a loosely-coupled computation, of the same problem size under the same hardware and software environment. With the NAS-BT application benchmark using the Class A problem size, we achieved an impressive boost of speedup, from 4 out of 16, to 10 out of 16, on SDSM as a result of the compiler and runtime optimizations we described here.