Implmentation Details

Overview

Hakweye contains two major parts: the static analysis and fuzzing loop. The static part calculates the function/basic block level distance, and instruments the relevant information during the compilation of the binaries. The fuzzing loop deals with the actual fuzzing according to the execution traces of the current seed and the generated basic block trace distance, covered function distance, and the target sites.

Static Analysis

The static part is implemented based on AFL's LLVM mode, where the distance generation is inspired from Bohme's CCS2017 paper "Directed Greybox Fuzzing" but with our augments in multiple aspects. It consists of the following tools:

Difference Generator (he-diffs): This is used to generate the target sites (lines) for patch testing between two commits. For other scenarios, the target sites can be manually specified.
Preprocessor (he-pp/he-pp++): This is a wrapper of the Clang/Clang++ compiler. It utilizes the LLVM Gold Linker to generate LLVM Bitcode for the target (Program Under Test, or PUT for short) binaries (executables, shared libraries); in addition, it guarantees that the binaries are compiled with debug information. Other options (specified by environment variables) ensure that the compilation is compatible with the compilation during instrumentation. Essentially, this procedure will always succeed as long as the project can be compiled with Clang/Clang++ and can be linked with the Gold linker. The output is the binaries embedded with debug information.
Analyzer (libhe-tgt.so): This is a regular LLVM analysis pass that analyzes the generated LLVM Bitcode executable from Preprocessor. It generates 1) call graph of the whole binary; 2) the CFG for all the functions in the Bitcode; 3) call site locations; 4) the target functions and target basic blocks; 5) function list in the binary. This is implemented as a LLVM opt plugin, the only option passed is -he-conf=/path/to/llvm.toml which takes a configuration file (e.g., llvm.toml) used throughout the static analysis. For 1), the analysis is based on the result of according to our forked SVF pointer analysis, the output is a yaml file "callgraph.yaml" (thanks to LLVM Yaml serialization/deserialization module). In addition to the relationship between two adjacent function call (directly/indirectly), we also keep the occurrences of a specific callee inside the caller (C_N in our paper) and number of basic blocks of a specific callee resides in the caller (C_B in our paper). This part may take some time for big projects with many function pointers (e.g., for cxxfilt it takes an average 12.5min to calculate the callgraph). For 2), the CFG for each function is generated directly by traversing the basic blocks for all the functions; the output is a yaml file with file name specified by the function names. For 3), we keep all the call site location from the debug information, this will be used to link the distance between function level distance and the basic block distance, the generated file is "calls.txt". For 4), this is derived by matching of the specified target sites (lines), output files are "tgt_funcs.txt" and "tgt_bbs.txt". For 5), this is collected by traversing functions and keeping the metadata (function names, locations), this will be used to filter out functions that are not in the binary but in the project; the output file is "funcs.txt". Note that 1), 2), 3), 5) can be reused for the same binary multiple times.
Distance Calculator (he-dists): This takes the Bitcode and output of 1),2),3),4) from Analyzer to generate both function level distance and basic block distance. The weights of adjacent functions are calculated based on C_N and C_B; the function level distance is based on the Dijkstra shortest path distance; the basic block distance is based on the call sites (collected in 4) from Analyzer) and the function level distance. The methodology is fully explained in the paper. The output are two files: "funcs.dist" and "bbs.dist".
Instrumentor (he-clang/he-clang++): This is the actual part that does the instrumentation; it is still a wrapper for Clang/Clang++. It involves three different categories of instrumentations: 1) basic block transition; 2) basic block distance; 3) function IDs. For 1), this is exactly the same as AFL's LLVM mode instrumentation, by writing to the shared memory with the arithmetic results of previous basic block ID and the current basic block ID (see its technical details). For 2), it collects accumulative distances of the executed basic block distance and number of executed basic blocks and write to the shared memory, this is similar to AFLGo's approach. For 3), it writes each function ID to another shared memory that store only the function IDs. Different from the other shard memory used for basic blocks, the IDs are maintained by the redis and keep increasing. Each instrumentation information (function name, location, ID) is also kept in redis as a hash table. The output of the instrumentor is the instrumented binary. It accepts one option -he-conf=/path/to/llvm.toml. This option can be directly appended to the CFLAGS and CXXFLAGS environment variables. For most of the projects, it is a drop-in replacement for Clang/Clang++ without any other manual work.
Function ID Mapping (he-funcs): This maps the functions and its IDs, and the distances. It does two things: 1) extract the project function metadata; 2) mapping the function information in the binary with the project function metadata. For 1), the metadata is all from the redis generated from Instrumentor, it is a json file representing a list of project functions with the function IDs, names, locations; output file name is usually "proj_trace_funcs.json". For 2), this takes file "funcs.txt" (output of 5) from Analyzer), "funcs.dist" (output of Distance Calculator) and the metadata from 1) to filter out the functions that is not in our target binary (PUT). The final output is a json file representing a list of binary functions with the function IDs, names, locations, and distances; output file name is usually "trace_funcs.json". Note that the output of 1) can be reused as long as the project is unchanged and compiled with same "regular" compilation options (not including those in he-clang/he-clang++ wrapper).
Semantic Generator (he-sem): This can be used as both a standalone tool or a library that generates the mutated variants of the semantic aware seed files. It is based on Antlr4 and presently provides semantic generators for XML/Json/CSS files but can be easily extended to support other file types such as C/C++/Python etc (see a full list at antlr/antlr4-grammars) by providing the "g4" and mutation visitor against the corresponding AST structure. When used by he-fuzz, it is invoked via a JNI; when used as a standalone tool, it works like radamasa.

Additional notes:

We also provide a script "he-sdriver" to automate the whole static analysis procedure.
In our implementation, although "Target Function Trace Closure" belongs to the static analysis, it is actually calculated during the pre-processing procedure of the fuzzer.
We can apply "AFL" by only applying 1) in Instrumentor and feed with the original AFL fuzzer; we can apply "HE-Go" (in our paper) and feed with AFLGo's fuzzer.
"he-sem" is not used in our experiments in the paper.

Dynamic Fuzzing

The dynamic fuzzing framework of Hakweye is a Rust implementation of AFL; it follows AFL's practice by using forkserver, shared memory communication, loop bucket, etc.

The major differences include:

We provide a flexible configuration file for all possible advanced options that can be specified by the users. These options include the input /output directories, the semantic mutation rule, sanitizer relevant dynamic options, etc.
We embed a semantic mutation strategy that is used for programs whose input files are known to observe some grammars, for example javascript, XML, CSS, etc. This is done on top of ANTLR4's Java APIs and called adaptively via JNI for Rust.
We provide an extra to separate deliberately the graininess of the mutators, where we can apply fine-grained/coarse-grained mutations.

Specific to the directed fuzzer, the more detailed flow is shown below.

The use of he-fuzz (the dynamic fuzzer of Hawkeye) is like:

he-fuzz -c ./Config.toml -- /path/to/target arg1 arg2, ... @@ ... argn

Where "@@" is the stub file that will be replaced with the actual seed files. For program that reads from the stand input, no "@@" is needed.

Note that we can specify in the configuration file that no function level information will be used, in which case the fuzzer will only use the "basic block trace distance" for the power scheduling. If the function level information is indeed used, the "target function trace closure" will be calculated firstly, the output is a set of function IDs that is considered reachable to the target functions. During fuzzing, it will track the covered function IDs for the current seed, and check the function information of PUT which is generated from the input file "trace_funcs.json" to get the covered function similarity c_s. The basic block distance d_s is determined by checking the shared memory representing the accumulative distance value of the basic blocks and the number of these basic blocks, then calculating the division of these two values. Therefore the power function is p = c_s * (1 - d_s). The power will be used as a multiplier to the "performance score" for the seed, which originates from AFL's practice during havoc mutations.