Server processors are designed to deliver high throughput at low latency.
As a result, these processors are usually equipped with a few tens of
latency-optimized compute engines or processing cores. Meeting the
instruction and data demand of these cores requires a deep cache hierarchy
between the compute engines and the main memory. The last level of the
on-die SRAM cache hierarchy and the memory-side DRAM cache have lately
attracted significant attention from the researchers as well as industry
practitioners. In this talk, I will discuss some of the important
performance issues in the design of the on-die last-level SRAM cache and
the memory-side DRAM cache and their solutions arising from our research.
In particular, optimization of hit latency and miss count for the on-die
last-level SRAM cache and efficient maintenance of coherence information
across the on-die cache hierarchy will be discussed. For the memory-side
DRAM cache, I will touch upon the fundamental trade-offs between hit rate
and bandwidth optimization and discuss a few techniques to improve the
bandwidth delivery in systems equipped with such caches.

This talk will present a sampling of my research contributions of roughly
one decade done in collaboration with my students at IITK and external
collaborators primarily from Intel Microarchitecture Research Lab at
Bangalore and Intel Architecture Group at Bangalore and Haifa.