Publications

 

Impact

In this paper, we propose a Reinforcement Learning (RL) based automated mapping approach to find optimum schedules of DNN layers for a given architecture model without violating the specified energy and latency constraints. The learned policies easily adapt to a wide range of DNN models with different hardware configurations, facilitating transfer learning and improving the training time. 

Impact

This research paper introduces a fixed partition compaction method that leverages consecutive zeros and non-zero weights/parameters within sparse DNN models. The paper resulted from the collective work of our MAGIC cluster team at IIT Palakkad and was released by the Association of Computing Machinery (ACM).

Impact

In this paper, we propose a compilation flow that supports efficient standalone execution of nested loops onto CGRAs. Experiments show that the standalone execution model leads to a maximum of 12.33× and an average of 6.75× performance improvement compared to the existing hosted execution model.

Impact

This paper introduces a hardware based loop control mechanism that can support arbitrarily nested loops up to four levels deep. The design could achieve a maximum of 1.9× and an average of 1.5× speed-up against the conventional approach of software based loop implementation. The total number of instructions executed is reduced to half for almost all the kernels considered, with an area and power consumption overhead of 2.6% and 0.8% respectively.

Impact

In this paper, we propose a centralized hardware-based loop optimization technique to achieve better area and energy results compared to the previously implemented distributed version. Without incurring any performance degradation, area overhead against the reference architecture is reduced down to 1.5% for a 4 × 2 CGRA configuration. A maximum of 47.3% and an average of 27.2% reduction in energy consumption is attained by the centralized version of hardware loop compared to the baseline model employing software loop.

Impact

This paper presents Spectral-Blaze, a novel FFT-based CNN accelerator that effectively addresses computational and energy bottlenecks in spatial domain acceleration. The proposed architecture introduces Intra-Patch parallelization during the Hadamard product phase, which optimizes Complex MAC (CMAC) unit utilization and maintains consistent reuse patterns across multiple input feature map patches. The Spectral-Blaze prototype, implemented on the Zynq MPSoC (ZU7CG), demonstrates impressive performance gains, achieving a speedup of 4.98× for VGG-16 and 1.64× for AlexNet compared to baseline. 

Efficient FFT-Based CNN Acceleration with Intra-Patch Parallelization and Flex-Stationary Dataflow

Impact

This paper presents a novel Hadamard Product

Generator (HPG) for FFT-based CNN acceleration. The proposed block uses Intra-Patch parallelization to optimize Complex Multiply and Accumulate (CMAC) unit utilization and maintains identical reuse behavior across patch elements. This scheme also offers multiple spatial unrolling schemes to increase resource reuse. The prototype is implemented on the Zynq MPSoC (XCZU7CG). It showcases throughput gains of 8.16× for VGG-16 and 9.30× for AlexNet compared to the state-of-the-art frequency domain accelerator