Title: Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-level Cache Hierarchies

Abstract:
On-die caches are a popular method to help hide the main memory latency. However, it
is difficult to build large caches without substantially increasing their access latency, which
in turn hurts performance. To overcome this difficulty, on-die caches are typically built as a
multi-level cache hierarchy. One such popular hierarchy that has been adopted by modern
microprocessors is the three level cache hierarchy. Building a three level cache hierarchy
enables a low average hit latency since most requests are serviced from faster inner level
caches. This has motivated recent microprocessors to deploy large level-2 (L2) caches that
can help further reduce the average hit latency. In this paper, we do a fundamental analysis
of the popular three level cache hierarchy and understand its performance delivery using
program criticality. Through our detailed analysis we show that the current trend of increasing
L2 cache sizes to reduce average hit latency is, in fact, an inefficient design choice. We instead
propose Criticality Aware Tiered Cache Hierarchy (CATCH) that utilizes an accurate detection of
program criticality in hardware and using a novel set of inter-cache prefetchers ensures that on-die
data accesses that lie on the critical path of execution are served at the latency of the fastest
level-1 (L1) cache. The last level cache (LLC) serves the purpose of reducing slow memory
accesses, thereby making the large L2 cache redundant for most applications. The area saved
by eliminating the L2 cache can then be used to create more efficient processor configurations.
Our simulation results show that CATCH outperforms the three level cache hierarchy with a large
1 MB L2 and exclusive LLC by an average of 8.4%, and a baseline with 256 KB L2 and inclusive
LLC by 10.3%. We also show that CATCH enables a powerful framework to explore broad chip-level
area, performance and power tradeoffs in cache hierarchy design. Supported by CATCH, we evaluate
radical architecture directions such as eliminating the L2 altogether and show that such architectures
can yield 4.5% performance gain over the baseline at nearly 30% lesser area or improve the performance
by 7.3% at the same area while reducing energy consumption by 11%.