The traditional CPU-bound applications of the past have been replaced by multiple concurrent data-driven applications that use lots of memory. These applications, including databases and virtualization, put high stress on the virtual memory system which can have up to a 50% performance overhead for some applications. Virtualization compounds this problem, where the overhead can be upwards of 90%. While much research has been done on reducing the number of TLB misses, they can not be eliminated entirely. This thesis examines three techniques for reducing the cost of TLB miss handling. We test each against real-world workloads and find that the techniques that exploit course-grained locality in virtual address use and contiguity found in page tables show the best performance.
The first technique reduces the overhead of multi-level page tables, such as those used in x86-64, with a dedicated MMU cache. We show that the most effective MMU caches are translation caches , which store partial translations and allow the page walk hardware to skip one or more levels of the page table. In recent years, both AMD and Intel processors have implemented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This thesis introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly introduced structures, both of which are translation cache variants, are better than existing structures in many situations.
Secondly, this thesis examines the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This work shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table.
Finally, we present a novel device, the SpecTLB, that is able to exploit alignment in the mapping from virtual address to physical address to interpolate translations without any memory accesses at all. Operating system support for automatic page size selection leaves many small pages aligned within large page "reservations". While large pages improve TLB coverage, they limit the control the operating system has over memory allocation and protection. Our device allows the latency penalty of small pages to be avoided while maintaining fine-grained allocation and protection.