Title:  IMP: Indirect Memory Prefetcher


Abstract:
Machine learning, graph analytics and sparse linear algebra-based applications are
dominated by irregular memory accesses resulting from following edges in a graph or
non-zero elements in a sparse matrix. These accesses have little temporal or spatial
locality, and thus incur long memory stalls and large bandwidth requirements. A traditional
streaming or striding prefetcher cannot capture these irregular access patterns. A majority
of these irregular accesses come from indirect patterns of the form A[B[j]]. We propose an
efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and
hide latency. We also propose a partial cacheline accessing mechanism for these prefetches
to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated
on 7 applications, IMP shows 56% speedup on average (up to 2.3×) compared to a baseline 64 core
system with streaming prefetchers. This is within 23% of an idealized system. With partial
cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).