Improving Cross-Platform Binary Analysis using Representation Learning via Graph Alignment

Geunwoo Kim1, Sanghyun Hong2, Michael Franz1, Dokyung Song3

1UCI, 2Oregon State University, 3Yonsei University

Summary

Platform-agnostic binary code embeddings are useful when conducting binary analysis across different platforms. However, prior learning-based approaches learn to generate binary code embeddings in either a supervised manner that relies on high-quality labeled data or an unsupervised manner that cannot be generalized across different platforms. We present XBA that takes a semisupervised approach to generate binary code embeddings that are semantically richer yet aligned across platforms than prior work. The key idea is to collect peripheral information of a piece of binary code by passing messages from its neighbors in our proposed graph representation of a binary that we call a binary disassembly graph. Our use of graph convolutional networks enables this message passing process with only a limited amount of labeled data and unifies the embedding dimension across different platforms. Our evaluation shows that XBA can generate semantically rich binary code embeddings that are well-aligned across platforms.

Motivation

This paper considers the problem of learning platform-agnostic binary code embeddings that can greatly facilitate cross-platform binary analysis. Software today is hosted on a variety of platforms (e.g., IoT firmware), including multiple operating systems (OSs) and instruction set architectures (ISAs), and each binary compiled for a different platform contains many platform-specific components such as invocations of system calls and assembly code handwritten for specific ISAs. This platform heterogeneity requires platform-specific knowledge for analysis, making it difficult to transfer one’s analysis efforts from one platform to another. Platform-agnostic embeddings, therefore, are useful in that they enable downstream binary analysis tasks to be performed in a cross-platform manner

An example pair of platform-specific binary code components having the same semantics is shown in Figure 1. For this pair, state-of-the-art embedding approaches do not produce embeddings well-aligned across platforms.

XBA overview

Following prior supervised approaches, we use a Siamese architecture, where the outputs of multiple identical neural networks, sharing their parameters and configuration, are fed into a final comparator network. Disassembled binaries compiled for multiple different platforms (here, we exemplify the case of two platforms) are first converted into typed graphs that we call binary disassembly graphs (BDGs), an example of which is shown in Fig. 2. A BDG is defined with different entities found in the disassembled binary as its nodes; ➀ basic block, ➁ external functions, ➂ string literal, and the relationship between them as its edges; ❶ intra-procedural control-flow transfer, ➋ direct calls, address-taking ➌ code-to-code references, ➍ code-to-string references.

The encoded contextual information is fed into a Graph Convolutional Network (GCN) as an adjacency matrix along with each entity’s attribute features, and the GCN outputs a fixed-sized numeric vector, i.e., an embedding, for each entity. A cross-platform binary similarity analysis can be performed, using measures such as Manhattan distance and Cosine similarity, in this new unified dimension.

Evaluation

Benchmark Datasets

Cross-OS benchmarks: We choose five different software that natively supports Linux and Windows: (i) CURL, (ii) SQLITE3, (iii) OpenSSL’s CLI interface program (OPENSSL), (iv) OpenSSL’s crypto library (LIBCRYPTO), and (v) Apache HTTPD. We compile them into binaries (either executables or shared libraries) of Linux ELF and Windows PE32+ with debug information.
Cross-ISA benchmarks: Owing to the layered architecture of modern OSes, most ISA-specific code resides in low-level system libraries. We, therefore, choose (i) GLIBC, the default C library in many Linux distros, and (ii) LIBCRYPTO in OpenSSL, a crypto library supporting multiple ISAs. Both libraries contain ISA-specific code, i.e., handwritten assembly code tailored to each supported ISA. We compiled them on Linux for x86_64 and AArch64 into shared libraries (ELF) with debug information.

RQ1. Does XBA produce similar embeddings for basic blocks that are not pre-aligned, i.e., not included in the seed alignments?

In all the cross-OS and cross-ISA evaluations, XBA outperforms the baseline approaches.

RQ2. Do the binary code embedding vectors learned with XBA encode useful information for cross-OS binary analysis?

We then examine how accurate XBA is in predicting new alignments unseen during training. The experiments shown in Table 6 were designed for the validation of XBA, and those shown in this Section reflect actual testing scenarios. Here, we hypothesize that, as XBA is a semi-supervised learning approach, and uses BDGs which contain rich contextual information between entities, XBA can predict accurately unseen samples.

RQ3. Can a XBA model generalize across binaries unseen during training?

Across the border, we observe that XBA improves the BoW-Encoding in OOD alignment predictions.

RQ4. How much of the supervision, i.e., the amount of seed alignments, is required in training a XBA model?

We observe that, in all four directions of our cross-platform alignment tasks, the prediction accuracy of XBA is consistent across the board. The accuracy difference between the 10% and 50% cases is less than 6%. The results suggest that XBA effectively captures contextual information between entities in BDGs even with a small amount of pre-aligned data.

RQ5. How much does each type of relation contribute to the performance of XBA?

We investigate the importance of each type of relation in generating semantically rich and aligned binary code embeddings. We ablate each relation type by setting the weights of all relations of that type in the adjacency matrix to zero during alignment inference. It turns out that the control-flow transfer relations (❶ and ➋) give most of the performance followed by code-to-string references (➍) and address-taking code-to-code references (➌).

Cite

[pdf] [code] [talk]

Team

Geunwoo Kim, Department of Computer Science, University of California, Irvine, kgw@uci.edu
Sanghyun Hong, Department of Computer Science. Oregon State University, sanghyun.hong@oregonstate.edu
Michael Franz, Department of Computer Science, University of California, Irvine, franz@uci.edu
Dokyung Song, Department of Computer Science, Yonsei University, dokyungs@yonsei.ac.kr

Acknowledgments

Page updated

Google Sites

Report abuse