Hi, I'm Sharath Pendyala, a PhD student at NC State University, Raleigh.

About Me

I am passionate about pushing the boundaries of hardware design and development. My expertise spans FPGA architecture, real-time signal processing, and high-speed data communication. Over the years, I have worked on innovative projects in 5G, Wi-Fi, satellite communication, and post-quantum cryptography. I thrive on solving complex design and architectural challenges, optimizing performance, and creating scalable solutions. My journey has taken me from designing satellite-based 3G base stations to developing advanced 5G-NR and Wi-Fi physical layers, and now to exploring the intersection of cryptography and hardware security.

Resume_Sharath_Pawan_Pendyala.pdf

Resume

Outrunning the Millennium FALCON: Speed Records for FALCON on FPGAs

Outrunning the Millenium Falcon: Speed records for FALCON on FPGAs

FALCON is a NIST-selected post-quantum digital signature scheme whose performance bottleneck lies in the SamplerZ subroutine for discrete Gaussian sampling. We present a throughput-optimized, full hardware implementation of SamplerZ that introduces several architectural and algorithmic innovations to significantly accelerate signature generation. Our design incorporates a datapath-aware floating-point arithmetic pipeline that strategically balances latency and resource utilization. We introduce a novel Estrin's Scheme-based polynomial evaluator to accelerate exponential approximation, and implement a constant-latency BerExp routine using floating-point exponentiation IP, thereby eliminating critical-path logic associated with fixed-point decomposition. Additionally, we optimize rejection handling through parallel sampling loops, full loop unrolling, and a speed-optimized flooring circuit, collectively enabling high-throughput discrete Gaussian sampling. As a result, these optimizations yield FPGA implementations of SamplerZ that achieve 55%-71% reduction in sampling latency, leading to a 36%-46% reduction in overall FALCON signature generation latency compared to the current state-of-the-art. Furthermore, our design achieves up to a 48% reduction in the Area-Time Product (ATP) of SamplerZ, setting a new benchmark for high-throughput and efficient discrete Gaussian sampling, advancing the practical deployment of post-quantum lattice-based signatures in high-performance cryptographic hardware.

Deus_Ex_LLMs.pdf

Deus Ex LLMs

Accepted at VLSI-SOC 2026

Emerging Post-Quantum Cryptographic (PQC) schemes such as FALCON require highly optimized hardware implementations to meet strict area and execution time constraints on embedded devices. Traditional hardware designs depend heavily on expert-crafted Register Transfer Level (RTL) or High-Level Synthesis (HLS) code, which is both time-consuming and prone to errors. This work explores the use of large language models (LLMs) to accelerate the development of cryptographic hardware, focusing on FALCON's performance-critical SamplerZ subroutine. We propose a design flow that iteratively uses LLMs to generate, refine, and evaluate synthesizable C code with HLS tools. We analyze the generated designs across various models (e.g., GPT-4, Claude, Gemini, Grok), compare them with prior hand-crafted RTL designs, and report implementation metrics such as Area-Delay Product (ADP) and synthesis convergence.

Our results show that LLMs can achieve implementations within 4% execution time and 30% area of expert-tuned code while also discovering novel hardware optimizations. Additionally, we identify key challenges in prompt engineering, numerical stability, and testbench overfitting, offering actionable recommendations for future AI-assisted hardware design frameworks.

Comparison_of_unified_multiplier_designs_for_FALCON_PQC.pdf

A Comparison of Unified Multiplier Designs for the Falcon Post-Quantum Digital Signature

Accepted at LightSec 2026

This paper evaluates four multiplier architectures—Baseline, Tiling, Comba, and Karatsuba—designed for Falcon, a post-quantum digital signature scheme. The results show that the 4-split Karatsuba multiplier is the most area-efficient on both FPGA and ASIC platforms, while the Tiling approach achieves the best energy efficiency for FPGA designs. Comba offers the highest energy efficiency for ASIC implementations.

The study also includes FPGA-specific architectural optimizations, such as efficient DSP utilization and pipelining, which enhance performance and energy efficiency. These findings provide valuable insights for optimizing Falcon hardware for various design goals, helping designers choose the best multiplier architecture based on the target platform's requirements.

Feel free to reach out to me for research collaborations, professional opportunities, or just to connect!

spendya@ncsu.edu

sharathpawan@gmail.com

+1-919-904-8878

Page updated

Google Sites

Report abuse