Published In

Frontiers in High Performance Computing

Document Type

Article

Publication Date

5-19-2026

Subjects

Cache modeling, hardware-software co-design, memory tracing, memory-centric analysis, performance optimization, processor tracing, trace analysis

Abstract

IntroductionThe memory wall phenomenon—where advances in processor performance significantly outpace those in memory subsystems-poses a fundamental challenge for contemporary computing systems. In memory-bound applications, memory subsystem behavior dominates performance, yet existing analysis approaches present significant limitations: detailed microarchitectural simulators require days to weeks to simulate modest workloads; hardware performance counters provide only aggregate statistics that obscure temporal and spatial access patterns; and scaled simulation approaches face challenges in capturing contention effects, bandwidth saturation, and interference patterns that emerge at larger scales. These limitations reflect a processor-centric design philosophy—in both performance analysis tools and system co-design methodologies—that is increasingly misaligned with memorybound workloads, where a detailed understanding of memory access patterns, cache hierarchy interactions, and contention is critical for effective optimization.MethodsThis paper presents an integrated framework for memory-centric analysis that enables effective hardware-software co-design. We describe practical trace collection techniques, including hardware assisted processor tracing with minimal overhead and portable software-based instrumentation with statistical sampling. We present multi-perspective analysis methods that examine memory behavior from temporal, sequential, spatial, and relational viewpoints, revealing distinct optimization opportunities invisible in aggregate metrics. For example, a data structure switch from an open to a closed hash table in the miniVite graph application—guided by spatial anticipation metrics—improved hardware prefetcher utilization and delivered a 1.8 × runtime improvement, a benefit invisible to aggregate performance counters. We detail an architectural modeling framework that uses sampled traces with temporal interpolation and confidence-based filtering to evaluate cache and memory configurations.ResultsEvaluation on representative benchmarks demonstrates that this framework achieves practical accuracy (L2 cache errors of 2.64%, confidence-filtered L3 errors of 9.92%, bandwidth errors of 7.33%) while providing substantial speedup (26.8 × ) over cycle-accurate simulation, enabling rapid design space exploration.DiscussionWe demonstrate how this integrated framework enables systematic identification of both hardware optimizations (memory controller tuning, bank partitioning, NUMA configuration) and software optimizations (data layout restructuring, pre-fetching strategies, and memory-aware scheduling). Through this comprehensive treatment of the memory-centric analysis pipeline—from trace collection through architectural modeling to co-design application—we provide researchers and practitioners with practical techniques for addressing memory bottlenecks in contemporary computing systems.

Rights

© 2026 Gajaria, Challa, Suriyakumar, Manzano, Tallent and Márquez.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

DOI

10.3389/fhpcp.2026.1801169

Persistent Identifier

https://archives.pdx.edu/ds/psu/44728

Publisher

Frontiers Media SA

Share

COinS