Synchronization and Pipelining on Multicore: Shaping Parallelism for a New Generation of Processors
The potential for higher performance from increasing on-chip transistor densities, on the one hand, and the limitations in instruction-level parallelism of sequential applications and in the scalability of increasingly complicated superscalar and multithreaded architectures, on the other, are leading the microprocessor industry to embrace chip multi-processors as a cost-effective solution for the general-purpose computing market. Multicore processors allow manufacturers to integrate larger numbers of simpler processing cores onto the same chip, thereby shortening design time and costs. They provide higher throughput for multi-programmed workloads by enabling simultaneous processing of independent jobs, and can improve the performance of parallel applications by exploiting thread-level parallelism. Additionally, the individual cores used might be superscalar or multithreaded, thereby exploiting more finely-grained levels of parallelism as well. While many design alternatives exist for multicore processors, one common choice is sharing the lower levels of the on-chip memory hierarchy among multiple processing cores. Although larger, shared caches cause higher access latencies and more complex logic, they provide a larger aggregate pool and reduce duplicate cache lines, thereby generally reducing capacity misses. However, sharing the cache can also negatively impact performance when the cache use behaviors of concurrent processes interfere with each other. Thus a good balance can be achieved from combining small, private first-level caches for fast, contention free access with large, shared lower-level on-chip caches for flexible workload tolerance. Performance on multicore processors is impacted by many of the same factors that impact performance on other shared-memory parallel architectures. However, the tighter coupling of on-chip resources changes some of the cost ratios that influence the design of parallel algorithms. Shared-cache multicore architectures introduce the potential for cheap inter-core communication, synchronization, and data sharing. They also introduce greater potential for cache contention. One alternative to the data-parallel programming model is pipelining a computation across multiple processors, effectively treating the processors as high-level vector units. Allen and Kennedy discuss pipelined parallelism in the context of the do across loop . Vadlamani and Jenks refer to this method as the Synchronized Pipelined Parallelism Model . In this paper, we examine the opportunities a shared-cache multicore processor presents for pipelined parallelism. Using the dual-core shared-cache Intel Core Duo architecture as our experimental setting, we first analyze inter-core synchronization costs using a simple synchronization micro benchmark. Then we evaluate a pipelined parallel version of Recursive Prismatic Time Skewing  on a 2D Gauss-Seidel kernel benchmark. RPTS is an optimization technique for iterative stencil computations that increases temporal locality by skewing spatial domains across a time domain and blocking in both domains. In the next subsection, we introduce our experimental setting. In Section 2, we discuss background issues including factors impacting performance in shared-cache architectures, the effects of cache-sharing contention on application performance, and the synchronized pipelined parallelism approach to programming for shared-cache environments. In Section 3, we present a simple synchronization micro benchmark and analyze its performance on the Intel Core Duo. In Section 4, we discuss optimization techniques for iterative stencil computations, then analyze their parallelization for a shared-cache multicore context. Finally, we present experimental results from a pipelined parallel implementation of RPTS in Section 5.
Technical Report Number