In this work, we define syntactic robustness of LLM-based code generation as the model’s ability to produce correct code when given syntactically varied but semantically equivalent prompts, particularly those involving mathematical constraints (e.g., rewriting x + y = b as x - b + y = 0). Our findings show that current LLMs often fail to maintain correctness under such variations, indicating a lack of syntactic robustness. This vulnerability can be exploited to induce incorrect or misleading code generation. To address this, we propose a reduction-based pre-processing step that helps mitigate this issue by normalizing prompt representations before generation.
In this project, I introduce partial equivalence analysis to perform a more refined assessment of non-equivalence between two programs. I propose a quantitative symbolic equivalence analysis technique that identifies the input conditions under which two non-equivalent programs' behaviors align or diverge, along with a quantitative measure. To improve the scalability of existing model-counting-based techniques, I develop a range-search-based approach. This method explores the input space intervals and identify regions of equivalence and non-equivalence by calling SAT solvers. Overall, the approach enhances both precision and performance compared to state-of-the-art techniques.
Github Link
Figure: Best hybrid techniques (Pre-Post and Pre-Guided) VS SOTA symbolic approaches
In this work, I effectively combined symbolic execution and fuzzing to check program equivalence and non-equivalence. Symbolic execution is precise and can formally prove equivalence, but it’s expensive. On the other hand, fuzzing is scalable and can prove non-equivalence only. To get the best of both worlds, I proposed several heuristics to combine these methods depending on where symbolic execution and fuzzing are positioned in the analysis pipeline. One of the key ideas involves using symbolic execution to guide fuzzing, allowing it to focus on inputs that are more likely to expose non-equivalence between programs. The evaluation shows that the hybrid approach outperforms SOTA techniques.
Github Link