The interaction of architecture and operating system in the designing of a scalable shared memory multiprocessor
Bennett, John K.
Doctor of Philosophy thesis
This dissertation describes the implementation and evaluation of operating system design techniques that can be used to achieve scalability and to improve performance in large-scale shared memory multiprocessors with non-uniform memory hierarchies. We describe the implementation of SALSA, an operating system that incorporates these techniques and that executes on a commercially available processor. The contributions of this dissertation include the implementation of a technique that masks memory latency and increases processor utilization via rapid context switching, and a detailed study of the effects of cache organization and caching policy on latency hiding. The dissertation presents the relative performance of several alternatives for context caching on a register window architecture and shows that write-back, set-associative caches provide best latency hiding performance, especially with constructive cache interference. We have demonstrated significant improvements in program performance (120%) with latency hiding when cache miss latency is high, even with low cache miss rates (1-2%). We show that direct-mapped caches are unsuitable when operating system code is highly sensitive to cache misses, as in the case of context switching trap code. We also show that increased processor utilization can significantly increase contention on the underlying network. In architectures with non-uniform memory access behavior, the exploitation of thread and data placement by the operating system is mandatory for improved performance. The organization of the SALSA kernel exploits the underlying memory architecture. We describe a programming model that takes into account the clustering in the system, and provides primitives for hierarchical data placement and hierarchical thread scheduling. We show that proper data placement can double performance in a three-level memory hierarchy such as Willow. SALSA also provides user control over memory allocation for fine-tuning a program's memory requirements, which was shown to improve program performance by up to 20%. Although the techniques described in this dissertation have been evaluated on a hierarchical bus-based architecture similar to Willow, they are applicable to any large-scale multiprocessor characterized by non-uniform memory access behavior and large memory access latency.
Computer science; Electronics; Electrical engineering