The general-purpose cache-coherent many-core server processors are usually
designed with a per-core private cache hierarchy and a large shared
multi-banked last-level cache (LLC). The round-trip latency and the volume
of traffic through the on-die interconnect between the per-core private
cache hierarchy and the shared LLC banks can be significantly large. As a
result, optimized private caching is important in such architectures.
Traditionally, the private cache hierarchy in these processors treats the
private and the shared blocks equally. We, however, observe that
elimination of all non-compulsory noncoherence core cache misses to a
small subset of shared code and data blocks can save a large fraction of
the core requests to the LLC indicating large potential for reducing the
interconnect traffic in such architectures.

We architect a specialized exclusive per-core private L2 cache which
serves as a victim cache for the per-core private L1 cache. The proposed
victim cache selectively captures a subset of the L1 cache victims. Our
best selective victim caching proposal is driven by an online partitioning
of the L1 cache victims based on two distinct features, namely, an
estimate of sharing degree and an indirect simple estimate of reuse
distance. Our proposal learns the collective reuse probability of the
blocks in each partition on-the-fly and decides the victim caching
candidates based on these probability estimates.

Detailed simulation results on a 128-core system running a selected set of
multi-threaded commercial and scientific computing applications show that
our best victim cache design proposal at 64 KB capacity, on average, saves
44.1% core cache miss requests sent to the LLC and 10.6% execution cycles
compared to a baseline system that has no private L2 cache. In contrast, a
traditional 128 KB non-inclusive LRU L2 cache saves 42.2% core cache
misses sent to the LLC compared to the same baseline while performing
slightly worse than the proposed 64 KB victim cache. In summary, our
proposal outperforms the traditional design and enjoys lower interconnect
traffic while halving the space investment for the per-core private L2
cache. Further, the savings in core cache misses achieved due to
introduction of the proposed victim cache are observed to be only 8% less
than an optimal victim cache design at 32 KB and 64 KB capacity points.