Data Availability
Abstract
Binary reverse engineering is foundational to various tasks such as malware analysis and vulnerability detection. Traditional binary analysis tools mainly operate at the function level. However, modern software has grown significantly in size, with binaries often containing thousands of functions. Without understanding how these functions are organized into higher-level structures, it becomes difficult to effectively support downstream analysis tasks. Analysts must examine thousands of functions separately, making the process time-consuming and error-prone. Despite these challenges, current research on recovering the higher-level structure of binaries remains limited.
To bridge this gap, we propose BinStruct, a novel binary structure recovery framework that recovers both file and module structures from binaries. BinStruct first identifies the file structure by combining data reference patterns, function calls, and semantic understanding from Large Language Models. Then, inspired by software architecture recovery in source code analysis, BinStruct identifies modules by clustering the recovered files using consensus between structural dependency and semantic similarity. Evaluation on 121 real-world stripped binaries demonstrates that BinStruct outperforms state-of-the-art techniques in both file and module recovery accuracy, while requiring only 7.42s and 34.46s on average to recover file and module structures, respectively. Case studies on Libxml2 and PredatorTheStealer demonstrate BinStruct's effectiveness on security tasks like attack surface analysis and malware investigation.
Methodology Overview
Research Questions
RQ1: What is the quality of the recovered file structure?
RQ2 What is the quality of the recovered modules structure?
RQ3: How each part of our design contributes to BinStruct?
RQ4: What is the time and token cost of BinStruct?
RQ5: What is the real-world application of BinStruct?
Main Contributions
We propose a novel framework, Binstruct, that integrates structural analysis with LLM-guided refinement to recover file and module structure from stripped binaries.
We develop a comprehensive recovery approach that first recovering files by combining data reference patterns, function calls, and LLM's semantic understanding; and then recovering modules based on the combination of dependency and semantic similarities.
We evaluate Binstruct on 121 real-world binaries and show that it outperforms existing binary structure recovery tools on both file and module recovery.
We demonstrate Binstruct's potential practical value through case studies on attack surface detection and malware analysis.