Md Sadik Yasir Tauki - Cache Architecture Exploration for Memory-Intensive Kernels

Cache Architecture Exploration for Memory-Intensive Kernels
Evaluating cache hierarchy, associativity, and block-size trade-offs using Intel PIN and a Python-based cache simulator

Project Overview
This project analyzed cache-system behavior for memory-intensive 2D tensor workloads by developing and profiling three kernels: scatter, gather, and convolution. The study used Intel PIN to capture memory-access traces and a Python-based cache simulator to evaluate how different cache configurations affect hit rate, miss rate, and average memory access time (AMAT). The experiments varied cache hierarchy depth, L2 associativity, and L2 block size to understand performance trends across multiple workload patterns.

Assignment_1_938663164.pdf

Key Features

🖥️ Multiple Workloads: Implemented and evaluated scatter, gather, and convolution kernels on 256×256 input tensors.

⚙️ Configurable Cache Parameters: Studied the impact of cache hierarchy depth, cache blocks, and associativity, with L2 associativity swept from 2-way to 8-way and L2 block size varied from 32 to 128.

🔄 Trace-Driven Evaluation: Generated read/write traces using Intel PIN and passed them into a high-level cache simulator for architecture analysis.

📊 Performance Metrics: Compared hit rate, miss rate, number of hits, and AMAT across different workloads and cache settings.

Research Contributions

Identified how deeper cache hierarchies substantially reduce L1 AMAT across all three workloads. For scatter, AMAT drops from 198.48 with only L1 to 38.6 with L1+L2+L3; for gather, it drops from 205.99 to 37.7; and for convolution, it falls from 114.8 to 10.09.
Showed that increasing L2 associativity improves L2 behavior by reducing conflicts, while in some cases L3 hit rate decreases because more useful data is captured earlier in L2. This trend was explicitly observed in both scatter and gather analyses.
Demonstrated that increasing L2 block size consistently reduces miss rate by exploiting spatial locality. In scatter, miss rate decreases from 64 to 17.7 as block size increases from 32 to 128; in gather, it decreases from 52.9 to 16.3; and in convolution, from 28.4 to 8.5.

Technical Achievements

Developed C++ kernels for scatter, gather, and convolution, then integrated them with Intel PIN trace generation and a Python cache-simulation workflow. The assignment setup specifically required C++ kernels, Intel PIN, and a Python-based cache simulator.
Used configurable YAML-based simulator settings to sweep cache hierarchy depth, block count, and associativity, enabling systematic architectural comparison across workloads.
Captured workload-specific cache trends rather than treating all kernels the same, showing that scatter, gather, and convolution stress the memory hierarchy differently.

Tools and Software Used

C++ for implementing the scatter, gather, and convolution kernels.
Intel PIN for dynamic instrumentation and memory-trace generation.
Python for running the cache simulator and utility scripts.
Python-based Cache Simulator from the provided simulation framework.
YAML configuration files for modifying multilevel cache parameters.
GCC / g++ for compiling the workload kernels.
Linux / W135 lab machines or local x86 systems as the execution environment.

Applications

Computer Architecture Research: Understanding how workload behavior interacts with cache hierarchy design.
Performance Optimization: Identifying better cache configurations for memory-intensive kernels using trace-driven simulation.
Hardware-Software Co-Design: Connecting application access patterns with architectural decisions such as associativity, capacity, and hierarchy depth.

Impact and Recognition

This project provided a practical study of how real workload access patterns influence cache performance. The results showed that deeper cache hierarchies significantly reduce AMAT, higher associativity can improve upper-level cache effectiveness, and larger block sizes can lower miss rate by exploiting spatial locality. Overall, the work strengthened understanding of workload-aware cache design and demonstrated a complete trace-driven architecture evaluation flow from kernel development to simulator-based analysis.