Hiding memory latency is critical in modern machines. Typically, machines have used cache and addressed the ensuing cache coherence problem with hardware or VM-based strategies that rely on global inter-cache communication. However, global communication limits scalability. "Local knowledge" coherence strategies, which use compile-time information to avoid run-time global communication, offer better scalability, but suffer additional cache misses. We develop a framework for understanding the relation of coherence strategies, previous and newly proposed. Within this framework, it is possible to define, independent of implementation considerations, an "ideal" local strategy with respect to cache hit rate. No local strategy could ever do better. For Fortran programs with readily analyzable subscripts, ideal local strategies achieve the same hit rates as global strategies.
We develop three new local coherence strategies, CTV, TS1, and TS$\sp\prime$, designed to exploit minimal, aggressive, and reasonable hardware support, respectively. CTV is suitable for machines with no hardware assistance for cache coherence except the bare minimum of an exposed invalidate instruction. TS1 implements the abstract theorems of ideal local coherence as a concrete algorithm. Though the implementation is probably too expensive for a real implementation, TS1 is a vehicle for studying the limits of local coherence. TS$\sp\prime$ treats coherence over array sections as a graph coloring problem. So long as there are sufficient colors (realized as bits per cache line), TS$\sp\prime$ is an ideal local strategy. We found that four colors are adequate for many programs. When more colors are needed, TS$\sp\prime$ degrades gracefully. Its execution overheads are negligible and its hardware implementation costs moderate.
Our data shows that TS$\sp\prime$ has better hit rates than the best previous local strategy, time-stamping, for nearly all programs, and thus better expected performance. Our data also shows that TS$\sp\prime$ achieves hit rates equal to global strategies for analyzable programs, and nearly so for partially analyzable programs. We indirectly compared the performance of TS$\sp\prime$ and a particular VM-style global strategy. TS$\sp\prime$ has better expected performance on our test suite. For machines without global coherence hardware, local strategies are an effective approach for an important class of programs.