To evaluate the generated fuzz drivers accurately, we first applied a general, automatic criterion to roughly validate a fuzz driver's effectiveness. The criterion is that: an effective fuzz driver should compile successfully, function for a minute without crashing when given an empty initial seed, and show an increase in coverage. Then, "False Positive Filtering" and "Semantic Test" techniques (as depicted in Figure 3, page 5) are introduced to correct the initial validation outcomes. The former aids in removing crashes that are indeed bugs in the target API, while the latter filters out invalid or ineffective API usage that might bypass the automatic criteria.
In terms of the “Target Bugs Filtering”, true bug crash stack signatures are collected from the 24-hour fuzzing results (10 fuzzing instances in parallel) of OSS-Fuzz drivers.
Regarding the “Semantic Test”, we wrote tests for a total of 29 questions, including
1) verifying the target API is called for all 86 questions;
2) checking the correct usage of fuzzing data to key arguments for 8 questions (such as fuzzing file contents instead of file names);
3) ensuring critical dependent APIs are invoked in 16 questions;
4) confirming necessary execution contexts are prepared for 5 questions (for instance, having a standby server process available for testing client APIs).
We implement these tests by injecting hooking code into the driver:
For 1), this is done by checking whether there is an invoke of the target API in the generated driver code;
For 2), this is done by hooking the target API function and specifically checking whether the value of the argument satisfies certain properties. For file name arguments, we check whether it is a valid file name and check whether the file content can be mutated by mutating the fuzzer's input;
For 3), the techniques used for this check is similar as 1)'s but the functions checked are APIs depended by the target API. For example, to effectively test `parse_headers` API in kamailio project, `parse_msg` has to be called first. In this case, we check `parse_msg` in 3)'s validation;
For 4), the techniques is similar as 1) and 3) but the APIs we checked is of contextual dependencies. For example, to test `input_parse_buffer` for tmux project, sessions, windows, and panes have to be initialized first. Therefore, we check whether the driver has invoked these APIs. To test `mg_get_response` for libmodbus project, a standby server process has to be established for effective testing. Therefore, we check the usage of multi-process/-thread APIs necessary for setting a standby server.