A Tool To Boost Embedding Quality With IR-Code
IRGen is a search-based framework to find costomized optimization sequences that can substantially improve the embedding quality than learning only source code or IR generated by default optimization levels.
IRGen is generic, lightweight and extensible which based on LLVM framework. You can find more in LLVM details .
In this part, we introduce the details of
1) How common neural program embedding pipeline works
2) How IRGen works using GA
3) How we use LLVM-IR to help this process
Common neural program embedding pipeline
Neural program embedding, as illustrated left, converts discrete source code to numerical and continuous embedding vectors, with the end goal of facilitating a variety of learning-based program analysis.
You can refer to this for more details about Data augmentation .
The workflow of IRGen.
Under the guidance of our fitness function, IRGen begins a metaheuristic search process to iteratively mutate and select optimization sequences with high fitness scores.
You can find more about GA in An example if GA with code .
Learning from multiple IR code with Triplet Loss
As it depicts the procedure, the source code and a set of IR code compiled by the survival optimization sequences, are fed to our embedding model for learning and fine tuning w.r.t. triplet loss.
More about CodeCMR, from Tencent KEEN LAB.
DPCNN - Key structure in CodeCMR
CodeCMR is a variantion of DPCNN, which is shown to efficiently represent long-range associations in text. As shown left, the key building block, a word-level convolutional neural networks (CNN), can be duplicated until sufficiently deep to capture global representations of input text.
More details about DPCNN can be found here.
IRGen uses POJ-104 and GCJ datasets for the evaluation. Table below reports the statistics of these two datasets. As mentioned in the paper, the POJ-104 dataset contains 44,912 C/C++ programs that implement entry-level programming assignments for 104 different tasks (e.g., merge sort vs. two sum). The Google Code Jam (GCJ) is an international programming competition run by Google since 2008. The GCJ dataset contain source code of solutions to programming problems from GCJ. The GCJ dataset denotes another popular dataset that contains 260,901 C programs. Compared to POJ, GCJ has more files and each individual C source code file is more lenghty. We find that GCJ contains complex usage of C macros.
Special Tips: POJ-104 is full of .txt format files, we provide the source file in format '.c/.cpp' which are compilable. The same to GCJ.
You can find more in Anonymized Github repo
Code can be found in Anonymized Code on Github
Complementary Materials for RQ1: Efficiency of IRGen
Table 4 report the evaluation results in terms of baseline models in lines 3–11. For neural embedding models, we also report the min/max values in parenthesis. In accordance with our research motivation (Sec. 3), we also report results using CodeCMR augmented by LLVM IR code optimized by standard compiler optimization bundles (-O0, -O1, -O2, -O3, -Os). The last row reports CodeCMR augmented by six collections of LLVM IR optimized by sequences formed by IRGen. For both POJ-104 and GCJ datasets, we use MAP, AP, and AUPRG, as the metrics to assess the performance of different models. It is shown that de facto neural code embedding tools, including CodeBERT, CodeCMR, and MISIM-GNN, can largely outperform conventional code matching tools like Aroma-Dot and Aroma-Cos. When learning over C source code, we find that CodeBERT is the most well-performing model for the POJ-104 dataset, while MISIM-GNN performs the best for the GCJ dataset. In contrast, CodeCMR performs worse than both SOTA models for all metrics. On the other hand, we find that CodeCMR, by learning from LLVM IR optimized by standard compiler optimization bundles, can already outperform SOTA models on the GCJ dataset for over 10% MAP score. Evaluation on the POJ-104 dataset shows consistently promising enhancement against SOTA models in most of the settings.
Accuracy of all (augmented) models. For each metrics, we mark best models when training with C source code. We also mark the best models when training with C code and LLVM IR code optimized following different schemes. IRGen denotes training CodeCMR using source code and six collections of LLVM IR optimized by sequences formed by IRGen.
Complementary Materials for RQ2: Fitness Function of IRGen
we benchmark the efficiency of our fitness function. Recall we run IRGen for 800 iteations. Fig. 6 reports the fitness score increase across the entire GA campaigns. Test cases, despite their diverse functionality, manifest encouraging and consistent trending during searching. The fitness score keeps increasing and is seen to reach the saturation point after about 230 to 600 iterations. Indeed, we report that with sufficient amount of iterations, IRGen can find well-performing sequences for all the test cases.
Smoothed fitness score increases over 800 iterations
Complementary Materials for RQ3: Potency of Optimization Flags
This section answers RQ3 by measuring the potency of selected optimization flags. We report that the top-1 optimization sequence S contains in total 49 optimization flags.
To measure their contribution in boosting representation learning, we first train CodeCMR using C source code and LLVM IR optimizated using sequence S and record the baseline accuracy as acc. Then, we iteratively discard one optimization pass p from S and measure the augmentation effectiveness of using the remaining sequence with 48 flags. The accuracy drop reveals the contribution of flag p.
Ordered contributions of each optimization flag.
We also gives the detailed contribution in the table on the right besides the paper. Unlike Bintuner, our top-10 flags only contains 34.38% of the whole contribution. Instead of identifying one or few dominating flags that significantly contribute to enhance code embedding, it is rather the sequence of optimization flags that really matters.
We put these flags into three categories as follows:
1) Simplify an IR Statement.
-reassociate : This pass reassociates commutative expressions in an order that is designed to promote better constant propagation. For example: 4 + (x + 5) ⇒ x + (4 + 5) More details can be found in the code example below.
2) Simplify IR Statement Sequences.
-mem2reg : It promotes alloca instructions which only have loads and stores as uses.
example source code:
int foo(int x, int cond)
{
if (cond > 0)
x = 1;
else
x = -1;
return x;
}
LLVM-IR code without -mem2reg
define dso_local i32 @foo(i32 %x, i32 %cond) #0 {
entry:
%x.addr = alloca i32, align 4
%cond.addr = alloca i32, align 4
store i32 %x, i32* %x.addr, align 4
store i32 %cond, i32* %cond.addr, align 4
%0 = load i32, i32* %cond.addr, align 4
%cmp = icmp sgt i32 %0, 0
br i1 %cmp, label %if.then, label %if.else
if.then: ; preds = %entry
store i32 1, i32* %x.addr, align 4
br label %if.end
if.else: ; preds = %entry
store i32 -1, i32* %x.addr, align 4
br label %if.end
if.end: ; preds = %if.else, %if.then
%1 = load i32, i32* %x.addr, align 4
ret i32 %1}
LLVM-IR code with -mem2reg
define dso_local i32 @foo(i32 %x, i32 %cond) #0 {
entry:
%cmp = icmp sgt i32 %cond, 0
br i1 %cmp, label %if.then, label %if.else
if.then: ; preds = %entry
br label %if.end
if.else: ; preds = %entry
br label %if.end
if.end: ; preds = %if.else, %if.then
%x.addr.0 = phi i32 [ 1, %if.then ], [ -1, %if.else ]
ret i32 %x.addr.0}
We can find that the load/store/alloca insts are all turned into real SSA-format, which helps reduce the useless inst and boost the quality of IR Code.
3) Simplify CFG.
-loop-rotate: converts loops into do/while style loops.
The below example are from LLVM_LOOP_ROTATE
Converted into
==========>
Optimization Flags on GCJ and POJ
You can find the 49 passes selected on POJ, the 6 optimal sequences selected on POJ and 50 passes selected on GCJ, all the information are shown in the online document below:
According to the optimization flags we selected, we show the intersection set size(flags numbers) with the increase of the sequence number on the right fig.
We also show the MAP@R scores with Top K sequences IR.
Complementary Materials for RQ4: Generalization
As aforementioned, augmentation delivered by IRGen is independent from particular embedding model design. we empirical demonstrate the generalization of IRGen by augmenting another popular neural embedding model, ncc. As introduced before, ncc performs representation learning directly over LLVM IR code. Therefore, we did not need to change the implementation of ncc.
Complementary Materials for RQ5: Augmentation with Small Training Data
We consider a pracitical scenario where representation learning is trapped by limited amount of training data. We argue that this is indeed a common situation, e.g., in vulnerability analysis where we only have very limited samples of vulnerable code.
We find that when employing compiler optimization flags, it is feasible to create a considerable amount of extra training data, in IR code format, to augment embedding models, comparing with standard compiler optimization bundles, optimization sequences selected by IRGen can result in even higher enhancement.
Performance on POJ-104 with different size of training data.
If you wanna see more cases about the flags we select, click
the button.
If you wanna see the full version of code example in paper,
click the button.