Md. Sadik Yasir Tauki - Systolic Array Optimization for Neural Network Acceleration

Md. Sadik Yasir Tauki

Systolic Array Optimization for Neural Network Acceleration
Evaluating hardware utilization and performance trade-offs using SCALE-Sim

Project Overview
This project explored systolic array architectures for accelerating deep neural network workloads using SCALE-Sim v2. By experimenting with different array sizes (12×12, 32×32, 64×64, 128×128, 256×256) and dataflows (Input Stationary, Output Stationary, Weight Stationary), the study aimed to optimize the area-delay product—balancing performance in cycles with hardware utilization

Final Project_1.pdf

Key Features
🖥️ Multiple Topologies: Tested on YOLO Tiny, OCR (DeepBenchConv), and a modified MobileNet
⚙️ Configurable Array Sizes: From compact 12×12 arrays to large 256×256 arrays
🔄 Dataflow Comparison: Input Stationary (IS), Output Stationary (OS), and Weight Stationary (WS)
📊 Performance Metrics: Overall utilization, total cycles, and mapping efficiency

Research Contributions

Identified 32×32 Output Stationary as optimal for YOLO Tiny due to its balance of utilization and cycle count
Demonstrated 64×64 Output Stationary as best for MobileNet, aligning with depth-wise separable convolution characteristics
Conducted a sensitivity study across filter sizes (1×1 to 7×7), showing weak correlation with utilization in depth-wise separable convolutions

Technical Achievements

Achieved ~20% higher utilization with 32×32 OS compared to larger arrays in YOLO Tiny
systolic
Verified trade-offs between parallelism, memory bandwidth, and resource efficiency
Established Output Stationary as the most effective dataflow across multiple workloads

Applications

AI Accelerators: Custom hardware design for efficient CNN execution
Edge AI Systems: Optimized architectures for low-power devices
Hardware-Software Co-Design: Exploring DNN mapping strategies on systolic arrays

Impact and Recognition
This project provided insights into area-efficient systolic array configurations, demonstrating how compact architectures can deliver high utilization without requiring arbitrarily large arrays. Results highlight 32×32 Output Stationary as a strong default choice for lightweight neural networks, with 64×64 OS better suited for MobileNet-like workloads

Project Files

Page updated

Google Sites

Report abuse