RQs

Table of Content

RQ2. Inconsistency Analysis and Tool Combinations

RQ1. Effectiveness Evaluation

A Deep Dive to Explore the Reason for the Performance Decline with the Growth of Pool Size

In this RQ, we observed that the decline in BFSD tool performance slows as the function pool size increases. This finding contradicts our initial intuition, which expected a sharp performance drop with larger pool sizes. However, upon further investigation, we find that this phenomenon can be explained from a probabilistic perspective.

For a randomly selected negative sample, let P represent the probability that it has a higher similarity to the query than the ground truth. Let N denote the total number of negative samples, and n represent the number of negative samples that are ranked higher than the ground truth. Then, n follows a binomial distribution: n ∼ B(P, N).

The rank of the positive sample, denoted as R, is given by n+1. According to the properties of the binomial distribution, we have the following equation:

Since P lies between 0 and 1, the above function is an exponential function that decreases and asymptotically approaches 0 as N increases. However, the rate of decline slows down as N becomes larger. Moreover, when we query Q times, the Recall@1 can be calculated as:

where Pi denotes the probability for the i-th query. The result represents average sum of multiple decreasing exponential curves. For some query functions where 1 − P has a smaller value, the corresponding value decreases rapidly toward zero as the pool size increases. Meanwhile, for other query functions where 1 − P is closer to 1, the decrease occurs more gradually as the pool size grows. This results in the observed pattern in Figure 2, where the curves initially drop sharply and then level off. The more query functions there are with smaller 1 − P values, the faster the initial decline. Conversely, when there are more query functions with larger 1 − P values, the curve stabilizes at a higher and more stable level.

RQ2. Inconsistency Analysis and Tool Combinations

Case Study for the Inconsistent Failure Patterns between HermesSim and DeJina

The key difference between HermesSim and DeJina lies in their underlying representations, which lead them to capture different features from the same binary functions.

HermesSim is based on intra-function dependency relationships and leverages a graph-based model to learn function representations. As a result, it is highly sensitive to structural variations in the code. This sensitivity enables HermesSim to effectively filter out negative samples by detecting differences in their dependency graphs. However, this same characteristic makes it more prone to errors when the dependency graphs of positive pairs diverge significantly due to compiler optimizations.
In contrast, DeJina learns function representations from decompiled code. Its performance relies heavily on high-level syntactic features such as strings and symbol names present in the decompiled output. These features provide robustness to certain structural changes but also introduce dependencies on the quality and consistency of the decompiler's output.

A Case when HermesSim failed while DeJina Succeeded

Query: from x64-clang-15-O0-libcrypto.so.3

Ground Truth: from x86-clang-15-O3-libcrypto.so.3

In this case, the dataflow between the function pair changes significantly across different optimization levels. Consequently, HermesSim assigns a low similarity score of only 0.25, leading to a failure in correctly identifying the match. In contrast, DeJina leverages high-level features such as function names and strings extracted from the decompiled code. These features remain consistent across compilation settings, allowing DeJina to assign a high similarity score of 0.89 and successfully rank the ground truth at top-1.

A Case when HermesSim Succeeded while DeJina Failed

Query: From x86-gcc-6-O0-test_conversions

Ground Truth: x64-gcc-10-Os-test_conversions

In this case, both functions are relatively simple. However, the decompiled code of the query function contains symbol names such as needle, src, and dest, whereas the ground-truth function does not. This inconsistency in symbolic information leads to DeJina assigning a moderate similarity score of 0.63, resulting in a failure to retrieve the correct match. In contrast, HermesSim succeeds in this case by assigning a similarity score of 0.86, as the underlying dependency graphs remain highly similar despite the absence of symbolic cues.

A Simple Ablation Study

To validate whether DeJina's failure was caused by the inconsistency in symbolic information, we conducted a simple ablation study by manually modifying the variable names in the ground-truth decompiled code to match those in the query. After this modification, the similarity score assigned by DeJina increased to 0.75, resulting in the ground-truth function being ranked at the top. This observation highlights DeJina’s sensitivity to symbolic cues in the decompiled code.

Full Results for Combinations

Results for all pairwise combinations:

Results for all triple-tool combinations:

Page updated

Google Sites

Report abuse