Hakweye contains two major parts: the static analysis and fuzzing loop. The static part calculates the function/basic block level distance, and instruments the relevant information during the compilation of the binaries. The fuzzing loop deals with the actual fuzzing according to the execution traces of the current seed and the generated basic block trace distance, covered function distance, and the target sites.
The static part is implemented based on AFL's LLVM mode, where the distance generation is inspired from Bohme's CCS2017 paper "Directed Greybox Fuzzing" but with our augments in multiple aspects. It consists of the following tools:
-he-conf=/path/to/llvm.toml
which takes a configuration file (e.g., llvm.toml) used throughout the static analysis. For 1), the analysis is based on the result of according to our forked SVF pointer analysis, the output is a yaml file "callgraph.yaml" (thanks to LLVM Yaml serialization/deserialization module). In addition to the relationship between two adjacent function call (directly/indirectly), we also keep the occurrences of a specific callee inside the caller (C_N in our paper) and number of basic blocks of a specific callee resides in the caller (C_B in our paper). This part may take some time for big projects with many function pointers (e.g., for cxxfilt it takes an average 12.5min to calculate the callgraph). For 2), the CFG for each function is generated directly by traversing the basic blocks for all the functions; the output is a yaml file with file name specified by the function names. For 3), we keep all the call site location from the debug information, this will be used to link the distance between function level distance and the basic block distance, the generated file is "calls.txt". For 4), this is derived by matching of the specified target sites (lines), output files are "tgt_funcs.txt" and "tgt_bbs.txt". For 5), this is collected by traversing functions and keeping the metadata (function names, locations), this will be used to filter out functions that are not in the binary but in the project; the output file is "funcs.txt". Note that 1), 2), 3), 5) can be reused for the same binary multiple times.-he-conf=/path/to/llvm.toml
. This option can be directly appended to the CFLAGS and CXXFLAGS environment variables. For most of the projects, it is a drop-in replacement for Clang/Clang++ without any other manual work.Additional notes:
The dynamic fuzzing framework of Hakweye is a Rust implementation of AFL; it follows AFL's practice by using forkserver, shared memory communication, loop bucket, etc.
The major differences include:
Specific to the directed fuzzer, the more detailed flow is shown below.
The use of he-fuzz (the dynamic fuzzer of Hawkeye) is like:
he-fuzz -c ./Config.toml -- /path/to/target arg1 arg2, ... @@ ... argn
Where "@@" is the stub file that will be replaced with the actual seed files. For program that reads from the stand input, no "@@" is needed.
Note that we can specify in the configuration file that no function level information will be used, in which case the fuzzer will only use the "basic block trace distance" for the power scheduling. If the function level information is indeed used, the "target function trace closure" will be calculated firstly, the output is a set of function IDs that is considered reachable to the target functions. During fuzzing, it will track the covered function IDs for the current seed, and check the function information of PUT which is generated from the input file "trace_funcs.json" to get the covered function similarity c_s. The basic block distance d_s is determined by checking the shared memory representing the accumulative distance value of the basic blocks and the number of these basic blocks, then calculating the division of these two values. Therefore the power function is p = c_s * (1 - d_s). The power will be used as a multiplier to the "performance score" for the seed, which originates from AFL's practice during havoc mutations.