Search this site
Embedded Files
Md. Rownak Chowdhury
  • About
  • Research
  • Teaching
Md. Rownak Chowdhury
  • About
  • Research
  • Teaching
  • More
    • About
    • Research
    • Teaching
  1. RECONFIGURABLE AI ACCELERATOR

Modern accelerators deliver vast arithmetic throughput yet often fall short in end-to-end performance because control, locality, and orchestration are fragmented. We frame our approach around three questions that govern throughput: who sequences execution (host vs. fabric), where operands live between layers (on-chip vs. off-chip), and whether data and control can co-stream. Our answer is MAVeC, a messaging-based, self-programmable accelerator in which compact packets carry both operation and operands, enabling the fabric - not just the host - to sequence work. A hierarchical memory organization keeps weights and partial sums resident on chip to minimize DRAM traffic, while co-streamed data and control unify computation and communication to reduce orchestration overhead. The research combines microarchitecture development, mapping algorithm design, and performance modeling/benchmarking across deep-learning workloads to study locality, utilization, and scalability. Together, these choices translate device-level efficiency into system-level gains and provide a principled path to reconfigurable, low-power, high-throughput architecture for next-generation computing. 

MAVeC Microarchitecture
On-Chip Memory Organization
Hardware-Software Interaction
Mapping Framework
Dataflow and Data Reuse
Simulation Result (Matrix Multiplication)
Simulation Result (Convolution Operation)
Less reliance on host/Off-Chip Memory
Less reliance on host/Off-Chip Memory
Design Space Exploration
Benchmark Results
Relevant Publications:
  • Accelerating PageRank Algorithmic Tasks With a New Programmable Hardware Architecture.
  • OFFLOAD: An Open-Source Framework and System Design Approach for Data Analytics on Distributed Compute Units.
  • Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs.
  • Messaging-Based Intelligent Processing Unit (m-IPU) for Next-Generation AI Computing
  • Implications of Memory Embedding and Hierarchy on the Performance of MAVeC AI Accelerators.
  • Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators.
  • InTuit: A Novel Algorithmic Approach for Neural Network Mapping onto a Data and Instruction Streamable AI Accelerator” (In Progress).
  • High-Speed Drug Response Modeling from Pharmacogenomic Data via Hardware-Accelerated Matrix Factorization (In Progress).

2. CMOS TRANSCEIVER FRONT-END DESIGN

Reliable wireless communication underpins modern IoT and healthcare systems, yet front-end RF blocks—oscillators, frequency dividers, amplifiers, and on-chip power management—face tight trade-offs among frequency range, noise, power, and area. As protocols move from sub-GHz to millimeter-wave bands, our goal is compact, low-power CMOS circuits that hold performance across diverse applications. We design ring-oscillator topologies for ultra-low-power transceivers; explore injection-locked frequency dividers (ring- and LC-based) to widen locking range and lower phase noise; develop low-noise amplifier architectures for mm-wave IoT sensors with balanced gain, bandwidth, linearity, and noise figure; and engineer LDO regulators that deliver high PSRR and fast transient response with minimal quiescent current to stabilize the RF front end. Together, these efforts advance CMOS RF front ends toward energy-efficient, cost-effective solutions for next-generation 5G/6G and large-scale IoT deployments. 

CMOS LDO
CMOS LNA
CMOS ILFD
Relevant Publications:
  • Ring Oscillator Design in 50 nm CMOS Technology for IoT-Based Remote Infectious Disease Monitoring System.
  • CMOS Low-Dropout Voltage Regulator Design Trends: An Overview.
  • CMOS Low Noise Amplifier Design Trends Towards Millimeter-Wave IoT Sensors.
  • Design Trends of LC-Tank Based CMOS ILFD for SHF and EHF Transceiver Applications.

3. FPGA BASED ECC-CRYPTO ENGINE

Secure communication at IoT scale demands public-key cryptography that is both fast and energy-efficient. Elliptic-curve cryptography (ECC) offers strong security per bit, but practical deployment hinges on a hardware design that minimizes modular-arithmetic cost, avoids control bottlenecks, and resists side-channel leakage. Our work builds a low-latency, low-power ECC engine on FPGA around three ideas: (1) a unified point-operation block that performs point addition and doubling in one module - improving side-channel resilience and reducing control overhead; (2) projective/Jacobian coordinates to eliminate expensive inversions in the main loop; and (3) optimized modular arithmetic, combining Booth radix-4 multiplication with a fast P-256 style reduction and a combined add/subtract unit to cut cycles and area. On a Virtex-5 device, the Twisted Edwards (Ed25519) point-multiplication engine reaches ~1.4 ms per 256-bit scalar multiplication at ~118 MHz with a unified group-op latency of 646 cycles and a total of ~164.7 k cycles per scalar multiply, demonstrating competitive throughput/area.

ECC Protocols
Modular Adder-Subtractor
Booth Radix-4 Multiplication
Modular Reduction
Unified Point Operation
Point Multiplication
Benchmarking
Benchmarking
Relevant Publications:
  • Efficient FPGA Implementation of Modular Arithmetic for Elliptic Curve Cryptography.

  • Efficient FPGA Implementation of Unified Point Operation for Twisted Edward Curve Cryptography.

  • Low Latency FPGA Implementation of Twisted Edward Curve Cryptography Hardware Accelerator Over Prime Field.

Google Sites
Report abuse
Page details
Page updated
Google Sites
Report abuse