We performed an empirical study with 100 C/C++ submission from Google Code Jam (GCJ) to understand the difficulties in rule-based automatic recompilation. We extended the rules from [1] and then investigated the root causes of failures when recompiling with (pseudo-)source code derived from decompiler outputs. During our analysis, we had observed 3 types of common error when recompiling with DecRule. We list as follows:
1. Specification error
This type of error usually happens in the code generation phases of the decompiler. While the decompiler may correctly identify function calls, the decompiled code does not match API specifications. A typical error comes from the call towards stdin and stdout. This can potentially attributed to errors in function signature recovery. A typical example shown as follows:
2. Inference error
During compilation, some information crucial for humans to understand the program is discarded. Decompilers have to infer such information according to the assembly instructions. However, such a process is error-prone; decompilers may fail to infer syntactical information that is necessary for recompilation. We illustrates two types of inference error as follows:
Type inference error:
As shown in (a), the actual type of the array is const int in the source code. However, it is inferred as uint32 by the decompiler (shown in (b)). As a result, due to the presence of negative numbers in the array, it triggered the “narrowing conversion” when compiled with default compilation options.
Buffer size inference error:
Another common error originates from the inference of array size. As shown in (a), the array size has been hardcoded to 20. Moreover, we can see the array size cannot be inferred by the decompiler (see (b) line 1) and triggered the “storage size of isn’t known” error. As the array access index i is determined by the user input “N” (dword_202288 of the decompiled output), the decompiler fails to infer the size of the array with range analysis. Hence, the array size has been left blank in the decompilation outputs.
3. Error from decompiler templates
As outlined in the background of C decompilation, decompilers’ output is generated based on control flow templates. We observe that some register loading patterns can induce false positives in the template matching process, resulting in syntactical problems in the generated pseudocode. An example shown as follows:
As shown in (c), the decompiler generates lines (3, 6) that do not exist in the source code in (b). By examining the corresponding assembly instructions (a), we identify that the decompiler incorrectly interprets certain stack variable movements (i.e., mov [rbp+OFFSET], <REGISTER> in lines 2-9 in (a)) as array accesses. Consequently, lines (3, 6) are erroneously generated in (c), leading to invalid conversion errors when recompiling with the decompiler’s output.