Software Support for Efficient Use of Modern Computer Architectures
Chabbi, Milind Mohan
Doctor of Philosophy
Parallelism is ubiquitous in modern computer architectures. Heterogeneity of CPU cores and deep memory hierarchies make modern architectures difficult to program efficiently. Achieving top performance on supercomputers is difficult due to complex hardware, software, and their interactions. Production software systems fail to achieve top performance on modern architectures broadly due to three main causes: resource idleness, parallel overhead, and data movement overhead. This dissertation presents novel and effective performance analysis tools, adaptive runtime systems, and architecture-aware algorithms to understand and address these problems. Many future high performance systems will employ traditional multicore CPUs augmented with accelerators such as GPUs. One of the biggest concerns for accelerated systems is how to make best use of both CPU and GPU resources. Resource idleness arises in a parallel program due to insufficient parallelism and load imbalance among other causes. To assess systemic resource idleness arising in GPU-accelerated architectures, we developed efficient profiling and tracing capabilities. We introduce CPU-GPU blame shifting--a novel technique to pinpoint and quantify the causes of resource idleness in GPU-accelerated architectures. Parallel overheads arise due to synchronization constructs such as barriers and locks used in parallel programs. We developed a new technique to identify and eliminate redundant barriers at runtime in Partitioned Global Address Space programs. In addition, we developed a set of novel mutual exclusion algorithms that exploit locality in the memory hierarchy to improve performance on Non-Uniform Memory Access architectures. In modern architectures, inefficient or unnecessary memory accesses can severely degrade program performance. To pinpoint and quantify wasteful memory operations, we developed a fine-grain execution-monitoring framework. We extended this framework and demonstrated the feasibility of attributing fine-grain execution metrics to source and data in their contexts for long running programs--a task previously thought to be infeasible. Together the solutions described in this dissertation were employed to gain insights into the performance of a collection of important programs, both parallel and serial. The insights we gained enabled us to improve the performance of many of these programs by a significant margin. Software for future systems will benefit from the techniques described in this dissertation.
performance analysis; resource idleness; blame shifting; heterogeneous architectures; GPU; More... dynamic analysis; barrier elision; PGAS; NWChem; mutual exclusion; locks; MCS lock; hierarchical MCS lock; HMCS lock; Power7; Power8; SGI UV 1000; Adaptive HMCS; AHMCS; fast path; contention managament; hysteresis; hardware transactional memory; DeadSpy; dead writes; Pin; fine-grained monitoring; CCTLib; fine-grained execution monitoring Less...