Show simple item record

dc.contributor.authorYoussefi, Annahita
dc.date.accessioned 2017-08-02T22:03:07Z
dc.date.available 2017-08-02T22:03:07Z
dc.date.issued 2009-11-14
dc.identifier.urihttps://hdl.handle.net/1911/96380
dc.description.abstract The potential for higher performance from increasing on-chip transistor densities, on the one hand, and the limitations in instruction-level parallelism of sequential applications and in the scalability of increasingly complicated superscalar and multithreaded architectures, on the other, are leading the microprocessor industry to embrace chip multi-processors as a cost-effective solution for the general-purpose computing market. Multicore processors allow manufacturers to integrate larger numbers of simpler processing cores onto the same chip, thereby shortening design time and costs. They provide higher throughput for multi-programmed workloads by enabling simultaneous processing of independent jobs, and can improve the performance of parallel applications by exploiting thread-level parallelism. Additionally, the individual cores used might be superscalar or multithreaded, thereby exploiting more finely-grained levels of parallelism as well. While many design alternatives exist for multicore processors, one common choice is sharing the lower levels of the on-chip memory hierarchy among multiple processing cores. Although larger, shared caches cause higher access latencies and more complex logic, they provide a larger aggregate pool and reduce duplicate cache lines, thereby generally reducing capacity misses. However, sharing the cache can also negatively impact performance when the cache use behaviors of concurrent processes interfere with each other. Thus a good balance can be achieved from combining small, private first-level caches for fast, contention free access with large, shared lower-level on-chip caches for flexible workload tolerance. Performance on multicore processors is impacted by many of the same factors that impact performance on other shared-memory parallel architectures. However, the tighter coupling of on-chip resources changes some of the cost ratios that influence the design of parallel algorithms. Shared-cache multicore architectures introduce the potential for cheap inter-core communication, synchronization, and data sharing. They also introduce greater potential for cache contention. One alternative to the data-parallel programming model is pipelining a computation across multiple processors, effectively treating the processors as high-level vector units. Allen and Kennedy discuss pipelined parallelism in the context of the do across loop [7]. Vadlamani and Jenks refer to this method as the Synchronized Pipelined Parallelism Model [12]. In this paper, we examine the opportunities a shared-cache multicore processor presents for pipelined parallelism. Using the dual-core shared-cache Intel Core Duo architecture as our experimental setting, we first analyze inter-core synchronization costs using a simple synchronization micro benchmark. Then we evaluate a pipelined parallel version of Recursive Prismatic Time Skewing [6] on a 2D Gauss-Seidel kernel benchmark. RPTS is an optimization technique for iterative stencil computations that increases temporal locality by skewing spatial domains across a time domain and blocking in both domains. In the next subsection, we introduce our experimental setting. In Section 2, we discuss background issues including factors impacting performance in shared-cache architectures, the effects of cache-sharing contention on application performance, and the synchronized pipelined parallelism approach to programming for shared-cache environments. In Section 3, we present a simple synchronization micro benchmark and analyze its performance on the Intel Core Duo. In Section 4, we discuss optimization techniques for iterative stencil computations, then analyze their parallelization for a shared-cache multicore context. Finally, we present experimental results from a pipelined parallel implementation of RPTS in Section 5.
dc.format.extent 15 pp
dc.language.iso eng
dc.rights You are granted permission for the noncommercial reproduction, distribution, display, and performance of this technical report in any format, but this permission is only for a period of forty-five (45) days from the most recent time that you verified that this technical report is still available from the Computer Science Department of Rice University under terms that include this permission. All other rights are reserved by the author(s).
dc.title Synchronization and Pipelining on Multicore: Shaping Parallelism for a New Generation of Processors
dc.type Technical report
dc.date.note November 14, 2009
dc.identifier.digital TR09-8
dc.type.dcmi Text
dc.identifier.citation Youssefi, Annahita. "Synchronization and Pipelining on Multicore: Shaping Parallelism for a New Generation of Processors." (2009) https://hdl.handle.net/1911/96380.


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record