Memory and Communication Optimizations for Macro-dataflow Programs
Sbirlea, Dragos Dumitru
Doctor of Philosophy
It is now widely recognized that increased levels of parallelism are a necessary condition for improved application performance on multicore computers. However, the memory-per-core ratio is already low and, as the number of cores increases, it is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. Further, the memory requirements of parallel applications can be significantly larger than for their sequential counterparts and their memory utilization also depends critically on the schedule used when running them. This thesis proposes techniques that enable awareness and control of the tradeoff between a program’s memory usage and resulting performance. It does so by taking advantage of the computation structure that is made explicit in macro-dataflow programs which is one of the benefits of macro-dataflow as a programming model for modern multicore applications. To address this challenge, we first introduce folding - a memory management technique that enables programmers to map multiple data values to the same memory slot. This reduces the memory requirement of the program while still preserving its macro-dataflow execution semantics. We then propose an approach that allows dynamic macro-dataflow programs running on shared-memory multicore systems to obey a user-desired memory bound. Using the inspector/executor model, we tailor the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. We show that our technique can gracefully span the spectrum (with decreasing memory bounds) from fully parallel to fully serial execution, with several intermediate points between the two. Comparison with OpenMP shows that it can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP’s performance. Finally, we turn our attention to distributed systems where often the memory size is not a limiting factor, but communication and load balancing are. For these systems, we show that data and task distributions can be selected automatically even for applications expressed as dynamic task graphs, freeing the programmer from the cumbersome selection process .We show that optimal selection can be achieved for certain classes of distributions and cost functions that capture the trade-off between communication and load balance.
macro dataflow; inspector/executor; task parallelism