Table of Content
Design Objectives
Variation in Compilation Settings: To enable evaluation under diverse real-world settings such as cross-architecture and cross-optimization level, the dataset should be compiled under various compilation settings.
Diversity in Composition: Upon investigation, we find that Cisco dataset, which was proposed in Marcelli et al.’s study and used by HermesSim and CEBin, contains 87% functions from Z3 projects, hence leading to biased results. Thus, to better reflect real-world applications, the dataset should include a variety of projects to prevent bias in the results.
Correctness: To provide reliable results, the dataset must be correctly compiled and labeled. In our previous investigation, we found that ten binaries in the Cisco dataset was not compiled as intended, i.e. a binary declared to be MIPS64 were actually compiled for ARM32 architecture, binaries labeled declared to be non-inlined still contained many inlined functions (caused by misused compile option '-fno-inline-function', which only turns off function-inlining under O3 for clang), and some homologous functions were incorrectly labeled as negative pairs due to function renaming by compilers. Therefore, ensuring the correctness of BinAtlas is crucial.
Compilation Settings
2 compilers in 2 versions each: gcc-6, gcc-10, clang-7, clang-5
5 optimization levels: O0 O1 O2 O3 Os
2 inline options: with '-fno-inline' or not
4 architectures, with 2 bit-widths each: X86 and X64, ARM32 and ARM64, MIPS32 and MIPS64, PPC32 and PPC64
This process resulted in 12,453 binaries and 27,757,965 binary functions. After filtering out functions with fewer than five basic blocks, the dataset comprises 7,339,256 binary functions.
Compositions
To ensure the diversity in BinAtlas’s composition, we considered three key factors:
Functionality: The binaries were compiled from popular projects with different functionalities, including: compressing, networking, text parsing, database, image processing, and others.
Size: All the size of each project is considered to ensure no individual project overwhelms the dataset, which could lead to biased results.
Language: BFSD tools may exhibit different results in C and C++ too. Thus, the selected projects contain both C projects and C++ projects.
Correctness
We found many mistakes in the previous Cisco dataset used by Marcelli et al.’s study, which raises concerns on the reliability of their results. You can find the details in the design objectives part of this page. Thus, we further ensure the correctness of BinAtlas using the following methods:
Mitigating mis-compiled architectures: We use the file command to confirm each binary’s architecture.
Mitigating unexpected inlining in the non-inlined binaries: To confirm that binaries in the non-inlined subset were compiled without inlining, we analyze debug information to parse inline relationships. Despite these efforts, some binaries in the non-inlined subset contain inlined functions, mostly due to dependencies not inheriting the specified compilation flags. Overall, 24.1% of functions in the inlined subset contain inlined functions, compared to 2.5% in the non-inlined subset. To mitigate the impact of unexpected inlining in the non-inlined subset, we exclude functions containing inlined functions when selecting pairs from the non-inlined subset in the experiment.
Improving labeling qualities: We label two binary functions with the same name and source file as a positive pair. Conversely, if two functions have different names and source files, we label them as negative pairs. Including source files in labeling helps reduce false labeling due to compiler-generated variants of a source function.
Comparison with
Cisco Dataset
54 vulnerable functions
In 9 Source Projects: cJSON, libexpat, libpng, libxml2, lighttpd1.4, nginx, openssl, sqlite, zlib.
All are compiled with default compilation settings on a 64-bit x86 platform.
58 IoT firmware images
13 Manufactures: ASUS, belkin, Cisco, Dlkin, H3C, Linksys, Motorola, NetGear, Tenda, TOTOLink, TPLink, TRENDnet, Wavlink.
3 Architectures: MIPS32, MIPS64, ARM32.
Containing a total of 12,291 binaries, 3,676,973 functions.
Tool Evaluation Dataset
Our investigation on existing tool evaluation datasets show that they suffer from several issues, leading to limiations on comprehensive evaluation and poteintial biased and unreliable results.
Biased Project Selection
Biased project selection can lead to skewed evaluation results. The Cisco Dataset includes seven projects, one of which is Z3—a C++ constraint solver with a distinct programming style that differs significantly from the other projects.
The dataset comprises 789,364 binary functions in total, with 520,003 functions in the test set. Notably, 454,368 of these functions come from Z3, accounting for 57.6% of the entire dataset and 87.4% of the test set. This disproportionate representation introduces a significant imbalance, which may result in substantial bias.
Misused Compilation Options
The author of the Cisco dataset claims that they disabled function inlining to evaluate BFSD tools for similarity detection. However, according to our previous investigation, there is a huge portion of functions in their datasets that are still inlined.
A further analysis indicates that they misused a compilation option to disable function inlining in GCC: They use '-fno-inline-functions' to disable function inlining in both GCC and CLANG. However, in GCC, this option only disables the function inlining algorithm introduced in O3, the function inlining introduced in O2 will not be disabled. The document for this compilation option can be found at GCC Documentation:
Binaries are not in the Expected Architecture
Some binaries in the Cisco dataset is not the architecture as it is labeled to be. For example, a binary of unrar (mips64-gcc-9-O3_unrar) is actually an ARM32 executable.
Improper Function Pair Labeling Strategy
The Cisco dataset labels function pairs based solely on their names: if two functions share the same name, they are labeled as a positive pair; otherwise, they are labeled as a negative pair. However, this strategy can mislabel homologous functions as negative pairs. For instance, C++ template functions often result in multiple binary instantiations generated by the compiler. Although these instantiations are functionally equivalent—or even byte-for-byte identical in some cases—they are incorrectly labeled as negative pairs in the Cisco dataset.
This mislabeling introduces noise into the ground truth and leads to a systematic underestimation of the performance of Binary Function Similarity Detection (BFSD) tools.
As a result, the evaluation outcomes in studies using the Cisco dataset may not accurately reflect tool capabilities. For example, HermesSim achieved only 43.8% Recall@1 on the Cisco dataset under the noinline XM task (with a pool size of 10,000), whereas it achieved 90% under the same setting on our dataset. This discrepancy does not indicate that our dataset is biased in the opposite direction. Notably, jTrans and CLAP demonstrated similar performance on both our dataset and the BinaryCorp dataset under equivalent evaluation conditions, suggesting that our dataset provides a more balanced and realistic assessment.
Other datasets are limited in comprehensiveness:
BinaryCorp (used in jTrans, CLAP): only x86.
Understanding the AI-powered binary code similarity detection(Fu el al.): only 33 ELFs.
Vulnerability Detection Dataset
Vulnerability datasets used in previous studies are often limited in both scale and label quality. Typically, these datasets:
use binaries from only a few devices as the search pool, and
include around ten 1-day vulnerabilities as query functions.
Furthermore, ground-truth labels are often constructed by manually inspecting only the top-10 results returned by each tool.
Such practices result in incomplete and potentially biased ground-truth data, ultimately leading to insufficient and unreliable evaluation of vulnerability detection tools.