FALCON is a NIST-selected post-quantum digital signature scheme whose performance bottleneck lies in the SamplerZ subroutine for discrete Gaussian sampling. We present a throughput-optimized, full hardware implementation of SamplerZ that introduces several architectural and algorithmic innovations to significantly accelerate signature generation. Our design incorporates a datapath-aware floating-point arithmetic pipeline that strategically balances latency and resource utilization. We introduce a novel Estrin's Scheme-based polynomial evaluator to accelerate exponential approximation, and implement a constant-latency BerExp routine using floating-point exponentiation IP, thereby eliminating critical-path logic associated with fixed-point decomposition. Additionally, we optimize rejection handling through parallel sampling loops, full loop unrolling, and a speed-optimized flooring circuit, collectively enabling high-throughput discrete Gaussian sampling. As a result, these optimizations yield FPGA implementations of SamplerZ that achieve 55%-71% reduction in sampling latency, leading to a 36%-46% reduction in overall FALCON signature generation latency compared to the current state-of-the-art. Furthermore, our design achieves up to a 48% reduction in the Area-Time Product (ATP) of SamplerZ, setting a new benchmark for high-throughput and efficient discrete Gaussian sampling, advancing the practical deployment of post-quantum lattice-based signatures in high-performance cryptographic hardware.
Accepted at VLSI-SOC 2026
Emerging Post-Quantum Cryptographic (PQC) schemes such as FALCON require highly optimized hardware implementations to meet strict area and execution time constraints on embedded devices. Traditional hardware designs depend heavily on expert-crafted Register Transfer Level (RTL) or High-Level Synthesis (HLS) code, which is both time-consuming and prone to errors. This work explores the use of large language models (LLMs) to accelerate the development of cryptographic hardware, focusing on FALCON's performance-critical SamplerZ subroutine. We propose a design flow that iteratively uses LLMs to generate, refine, and evaluate synthesizable C code with HLS tools. We analyze the generated designs across various models (e.g., GPT-4, Claude, Gemini, Grok), compare them with prior hand-crafted RTL designs, and report implementation metrics such as Area-Delay Product (ADP) and synthesis convergence.
Our results show that LLMs can achieve implementations within 4% execution time and 30% area of expert-tuned code while also discovering novel hardware optimizations. Additionally, we identify key challenges in prompt engineering, numerical stability, and testbench overfitting, offering actionable recommendations for future AI-assisted hardware design frameworks.
Accepted at LightSec 2026
This paper evaluates four multiplier architectures—Baseline, Tiling, Comba, and Karatsuba—designed for Falcon, a post-quantum digital signature scheme. The results show that the 4-split Karatsuba multiplier is the most area-efficient on both FPGA and ASIC platforms, while the Tiling approach achieves the best energy efficiency for FPGA designs. Comba offers the highest energy efficiency for ASIC implementations.
The study also includes FPGA-specific architectural optimizations, such as efficient DSP utilization and pipelining, which enhance performance and energy efficiency. These findings provide valuable insights for optimizing Falcon hardware for various design goals, helping designers choose the best multiplier architecture based on the target platform's requirements.