However, performing binary semantic analysis directly on assembly code features or control flow graph (CFG) features is challenging because different architectures have different assembly codes and obfuscation changes the CFG of functions. It hinders the understanding of program semantics by deep learning models (Haq and Caballero 2021). Therefore, to eliminate differences in assembly code between architectures, existing approaches (Peng et al. 2021; Luo et al. 2019) use deep learning techniques to learn function semantics from intermediate representation (IR) features, which are platform-independent and more abstract than assembly code. Furthermore, Singh (2021) found that combining a compiler with specific optimization options to compile the source code into a binary file and then extracting the corresponding binary pseudo-code was more beneficial for code classification and code clone detection. The reason is that deep learning-based approaches are known to have impressive success in source code clone detection (Fang et al. 2020; Zhang et al. 2019; Alon et al. 2019) and the pseudo-code is similar to source code, which can be extracted from a binary executable by decompiler tools. However, as far as we have reviewed, there is no BCSA work using pseudo-code to extract features and match functions.
In contrast, we can use the decompiler tool to obtain binary pseudo-code, which is very similar to the source code, pseudo-code snippets #2 and #3 correspond to Fig. 8a, b respectively. As shown in Fig. 2 the pseudo-code has a more uniform style than the binary code, and the pseudo-code for both 64 and 32-bit programs is similar to the source code (#1 in Fig. 2). In addition, the pseudo-code retains more semantic features and is more syntactically uniform. Thus, if we had access to the corresponding pseudo-code, we would not need to consider the challenges posed by different compilers, compilation optimization options, and instruction architectures. This observation led us to explore the feasibility of extracting binary pseudo-code for binary code similarity analysis.
As shown in the Fig. 3, For the pseudo-code extracted by the decompiler, we use TxlFootnote 1 to parse it and extract the corresponding pseudo-code Text information and string information from it. Since pseudo-code has natural language properties like source code (Hindle et al. 2016), we treated the pseudo-code as a Text sequence without considering the structural features in the code, and our experiments showed that the overall structural features of the code could be learned by a global deep convolutional network. We also did not normalize the pseudo-code because previous work (Singh 2021) has shown that some features in the source code have been smoothed out after the source code has been compiled, and features such as variable names and variable definitions have also been normalized by the decompilation tool processing. Finally, we found that string features in the decompiled code are also important for understanding function semantics, so we extracted the string features separately, converted them into Token sequences, and used a deep learning network to determine the similarity between two strings.
Singh (2021) propose a technique for clone detection using compiler optimization. They compiled the source code into a binary executable by optimizing it with the compiler optimization option and then converted it into decompiled code by a decompiler tool. They found that the compilation optimization smoothed out high-level features between different source codes, thus making the programs more similar in structure for the same task and more conducive to code classification and code clone detection. Our work differs from theirs in that we extract the binary decompiled code for binary code similarity detection.
If you want to determine in more detail what is different between two MSI files (for example version 1 and 2 of a package), you can get a little more involved using a proper MSI file viewer or MSI decompiler.
f5b9423551