Md. Rownak Chowdhury

RECONFIGURABLE AI ACCELERATOR

Modern accelerators deliver vast arithmetic throughput yet often fall short in end-to-end performance because control, locality, and orchestration are fragmented. We frame our approach around three questions that govern throughput: who sequences execution (host vs. fabric), where operands live between layers (on-chip vs. off-chip), and whether data and control can co-stream. Our answer is MAVeC, a messaging-based, self-programmable accelerator in which compact packets carry both operation and operands, enabling the fabric - not just the host - to sequence work. A hierarchical memory organization keeps weights and partial sums resident on chip to minimize DRAM traffic, while co-streamed data and control unify computation and communication to reduce orchestration overhead. The research combines microarchitecture development, mapping algorithm design, and performance modeling/benchmarking across deep-learning workloads to study locality, utilization, and scalability. Together, these choices translate device-level efficiency into system-level gains and provide a principled path to reconfigurable, low-power, high-throughput architecture for next-generation computing.

Relevant Publications:

Accelerating PageRank Algorithmic Tasks With a New Programmable Hardware Architecture.
OFFLOAD: An Open-Source Framework and System Design Approach for Data Analytics on Distributed Compute Units.
Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs.
Messaging-Based Intelligent Processing Unit (m-IPU) for Next-Generation AI Computing
Implications of Memory Embedding and Hierarchy on the Performance of MAVeC AI Accelerators.
Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators.
InTuit: A Novel Algorithmic Approach for Neural Network Mapping onto a Data and Instruction Streamable AI Accelerator” (In Progress).
High-Speed Drug Response Modeling from Pharmacogenomic Data via Hardware-Accelerated Matrix Factorization (In Progress).

2. CMOS TRANSCEIVER FRONT-END DESIGN

Reliable wireless communication underpins modern IoT and healthcare systems, yet front-end RF blocks—oscillators, frequency dividers, amplifiers, and on-chip power management—face tight trade-offs among frequency range, noise, power, and area. As protocols move from sub-GHz to millimeter-wave bands, our goal is compact, low-power CMOS circuits that hold performance across diverse applications. We design ring-oscillator topologies for ultra-low-power transceivers; explore injection-locked frequency dividers (ring- and LC-based) to widen locking range and lower phase noise; develop low-noise amplifier architectures for mm-wave IoT sensors with balanced gain, bandwidth, linearity, and noise figure; and engineer LDO regulators that deliver high PSRR and fast transient response with minimal quiescent current to stabilize the RF front end. Together, these efforts advance CMOS RF front ends toward energy-efficient, cost-effective solutions for next-generation 5G/6G and large-scale IoT deployments.

Relevant Publications:

3. FPGA BASED ECC-CRYPTO ENGINE

Secure communication at IoT scale demands public-key cryptography that is both fast and energy-efficient. Elliptic-curve cryptography (ECC) offers strong security per bit, but practical deployment hinges on a hardware design that minimizes modular-arithmetic cost, avoids control bottlenecks, and resists side-channel leakage. Our work builds a low-latency, low-power ECC engine on FPGA around three ideas: (1) a unified point-operation block that performs point addition and doubling in one module - improving side-channel resilience and reducing control overhead; (2) projective/Jacobian coordinates to eliminate expensive inversions in the main loop; and (3) optimized modular arithmetic, combining Booth radix-4 multiplication with a fast P-256 style reduction and a combined add/subtract unit to cut cycles and area. On a Virtex-5 device, the Twisted Edwards (Ed25519) point-multiplication engine reaches ~1.4 ms per 256-bit scalar multiplication at ~118 MHz with a unified group-op latency of 646 cycles and a total of ~164.7 k cycles per scalar multiply, demonstrating competitive throughput/area.

Relevant Publications:

RECONFIGURABLE AI ACCELERATOR

2. CMOS TRANSCEIVER FRONT-END DESIGN

3. FPGA BASED ECC-CRYPTO ENGINE