FAQs

Q: What's the rationale to instrument more feedback on multithreading segments?

A: The simple answer is that not all the instrumentation sites are equal and not all seeds are equal, and instrumentation more on multithreading segments are likely to provide more seeds with high qualities. The effectiveness can be observed by 1) the evaluation in the paper and 2) our newly detected (after the paper submission) 0-day vulnerabilities in world famous projects such as GNU Gold linker, myhtml multithreaded html parser, etc.

UPDATE: There is a new NDSS 2020 paper that also supports our viewpoints, see https://ajax4sec.github.io/papers/ndss20-fall-paper422.pdf .

Q: What's the runtime overhead of MUZZ?

A: Here are the results of MUZZ, MAFL and their comparisons with the baseline fuzzer AFL. Comprising MUZZ and MAFL is to particularly demonstrate the overhead of "coverage-oriented instrumentation".

Q: Can you tell more about rationale of the replaying strategies:

Suppose after the fuzzing procedure, the seed queue has three test seeds (correspondingly their N_{c} is inside the braces): S1 (3), S2 (1), S3 (2). And S2 is a seed that do not enter the multithreading code segments at all -- it is rejected by the validation check early and exits the program immediately. These seeds will be replayed with the programs that are hardened with ThreadSanitizer.

Vanilla round-robin replaying strategy (P1). The replaying sequence: S1 -> S2 -> S3 -> S1 -> S2 -> S3 -> S1 -> S2 -> S3 -> S1 -> S2 ...
Weighted round-robin replaying strategy (P2). The replaying sequence: S1 -> S1 -> S1 -> S2 -> S3 -> S3 -> S1 -> S1 -> S1-> S2 -> S3 ...

Suppose within a time budget, totally 8 executions are replayed. For P1, S1 is executed 3 times, S2 3 times, and S3 twice; for Strategy-2, S1 is executed 5 times, S2 once, S3 twice. Since S2 does not execute multithreading code at all, it is not beneficial to run S2 multiple times. Therefore, P1 is inefficient in replaying S2 3 times. On the other hand, some seeds may be more prune to have thread-interleavings during actual execution (since they may execute deeper in the multithreading environment). During calibration execution in fuzzing procedure, as N_{c} records the number of different traces, it may heuristically exhibits the tendency of exhibiting different threading-interleavings during practical executions. Therefore, P2's more executions on S1 is supposed to be more effective to revealing Concurrency Bugs (if any).

N_{c} approximates the tendency to exhibit different interleavings. In fact, It is worth noting that some seeds may exhibit less thread-interleavings than others, therefore it will be less beneficial to replay against them; we heuristically estimate this based on the value N_{c} recorded during fuzzing procedure. Table-3 shows the weighted strategy can indeed expose C-Bug earlier and stabler. The utilization of N_{c} also demonstrates novelty in our approach.

Q: More explanations on why MUZZ instruments more on the multithreading segments?

A: The simple answer is that not all seeds are equal and not all instrumentations are equal, and instrumentation more on multithreading segments are likely to provide more seeds with high qualities.

Fuzzing is effectively to detect vulnerabilities resulting from some incomplete validation checks. This means that the seeds generated by fuzzing are prone to be "corrupted input files" and the detected vulnerabilities are usually "shallow".
If a seed is "corrupted" (categorized as S-bad), it is more likely that the mutation on it is more or less still "corrupted" (it is much difficult to go from bad to good). On the other hand, if a seed is mostly "valid" (categorized as S-good), we have a much higher chance to keep the validity of the mutated seed (it is easier to generate good seeds from intact seeds). One example is that if the target program accepts some input files with certain file header requirements, it is extremely difficult to generate a valid seed from those S-bad seeds whose file headers do not satisfy the requirement. However, if the seed belongs to S-good in that the file header matches, the seed will at least pass this check it can execute more "deeper" parts of the target program. Mutations on this seed may not change the file header at all, so the variants of this seed are more likely to pass the file header check requirements as well.
If the percentage of S-bad seeds in the seed queue is high, the generated seeds are more likely to still belong to S-bad. If we provide "sufficient" instrumentation on them, we will more likely to generate quite a few of S-bad seeds. In the setting of multithreaded programs, this means that more seeds cannot even execute the multithreading segments at all. In this sense, we should actually provide less instrumentation on the "prologue" of the program, where in the multithreaded programs this typically belongs to the "single-threading code segments".
Keeping more S-good seeds in fact probably does not reduce the capability of detecting vulnerabilities that have nothing to do with multithreading (Vs). This is because S-bad seeds are usually of lower quality which may not be easy to be accidentally to be mutated to fail another different validation check. Meanwhile, S-good seeds can be mutated to various kinds of "S-bad" seeds. Moreover, since most of the projects we tested have been fuzzed by other fuzzers such as AFL, libFuzzer, etc, it is less likely these projects contain "shallow vulnerabilities". Therefore, MUZZ applies less instrumentations on these program segments.
More instrumentations on thread-interleaving indeed help improve quality of the seeds. Seeds with thread-interleaving at least mean that they have passed certain fundamental validation check therefore they tend to be of "high quality" (S-good). Further, since the discovery of "new transitions/coverage" is discovered across "different seeds", this inheritance increase the diversity of the seeds themselves. Therefore there will not be many pairs of seeds that with enough executions, their interleavings may be the same. Therefore, keeping these seeds are "essentially" different.
With more different S-good seeds, the more seeds that belong to S-good may be mutated. Despite that there are still many of these seeds belonging to S-bad, however they will may not be "kept" due to the fact the transition observed by the fuzzer is not new.

Q: Why not use ConAFL's benchmarks during evaluation?

A: As to most of the benchmarks (namely, boundedbuf, swarm, pfscan, ctrace, qsort), we are unsure what exact the benchmarks come from and which versions they are, since there are no particular repository for these projects. The only benchmark we are sure is bzip2smp which is available on sourceforge. So we can only test this benchmark.

According to ConAFL paper, the detected vulnerabilities all belong to double-free: the static analysis detected three double-free vulnerabilities, and the "targeted priority" strategy in ConAFL detected one. First, ConAFL mentioned the static analysis is based on LOCKSMITH, after some searching we noticed that this is not publicly available. Therefore we did not have a try. Second, in ConAFL paper, it seems that the authors emphasize that the dynamic fuzzing detects one crash --- this seems to suggest that the vulnerabilities detected by static analysis may have some false alarms. We then tried the following steps.

Fuzzing with AFL. We fuzz with AFL on 4 different machines, each with 168 hours (7 days), no crashes/vulnerabilities detected. This is consistent with ConAFL's paper in saying that "AFL did not report any crash after running for 2 days".
Intensive testing on bzip2smp with randomly generated inputs. We utilize Linux "dd" command to generate test inputs on 4 different machines for 360 hours (15 days), with file ranging from 1KB to 10MB. No crashes/vulnerabilities were detected.

We think that the crash may not be easily triggered in reality.

Further, based on the "double-free" category, we manually review the code about the memory allocation and free flows. The relevant APIs are standard libc free the customized default_bzfree. We do not believe that these two APIs will actually cause double-free, since we believe they are protected well by the mutex primitives.

What's more, we notice that bzip2smp does not contain relevant fixes after ConAFL detected the issues, also there are no bug reports available about these issues.

Therefore, we believe that the detected double-free may hardly happen in real world. We hereby did not use bzip2smp as our benchmark.

Q: Why not compare with libFuzzer during evaluation?

A: Unlike AFL or MUZZ, we have to feed libFuzzer with fuzzing drivers of the target projects.

Despite that libFuzzer is widely tested by Google's OSS-fuzz infrastructure for continuous testing, they focus only on library functionalities rather than multithreading. For example, OSS-fuzz contains the four projects that we used as our benchmarks:

However, a deep investigation of these projects tell us that they only deal with inputs in single-thread mode based on their OSS-fuzz setting. For example, libvpx repository contains a fuzzing driver (harness) used for libFuzzer to test decoding functionalities, however it tests without considerations of multithreading APIs such as pthread. Therefore, we cannot use these drivers directly.

Meanwhile, it is not easy to write a fuzzing driver that matches the functionality of vpxdec -t 4 -o out.y4m FILE exactly --- since we would have to implement the prologue code of vpxdec for multithreading all by ourselves. Worse still, for some projects such as pbzip2, implementing the driver is also an implementation of the target projects by ourselves, which requires huge efforts.

It is also probably unfair to write these drivers to fit for the functionalities of AFL or MUZZ --- it requires the deep understanding of the underlying projects to certify that libFuzzer's fuzzing driver matches the target binary exactly.

Q: Why not compare with other fuzzers during evaluation?

A: Apart from the reasons that we compare MOpt, we also found from Google FuzzBench's latest fuzzing samples that MOpt indeed works better than the other fuzzers for several (single-threaded) benchmarks: https://www.fuzzbench.com/reports/sample/index.html . This also demonstrates our rationale that we use MOpt as the state-of-the-art comparison tools for evaluation.

As we have evaluated with MOpt, 1) thread-aware strategies do help significantly for fuzzing and 2) thread-unawareness MOpt demonstrate similar performance as AFL despite that they can greatly improve the performance to fuzz single-threaded programs.

The other grey-box fuzzers, as we know, are thread-unaware. We provide a list of them to tell the differences.

Fuzzers

Q: ThreadSanitizer may report false positives on OpenMP, how is that resolved during replaying in the evaluation?

A: We use the OpenMP supplied by LLVM/Clang project. When building OpenMP, we passed `-DLIBOMP_TSAN_SUPPORT=ON` which enables ThreadSanitizer support. We referred to this post to resolve the issue. And fortunately, after our manual inspection on the detected concurrency bugs, we found ThreadSanitizer reports the violations (mostly data races) correctly.

In this answer from stackoverflow, It is said that there are still false positives. We will have a try on Archer, which is a data race detector specially for OpenMP programs.

Q: Does AFL perform poorly when specifying multiple threads than only 1 thread?

A: From our observation, it is usually so "for fuzzing purpose".

On one hand, there might be some slowdown due to the intrinsic overhead of threading. This however may not be significant due to the fact that fuzzing usually generates "invalid seeds" that are usually not able to pass the validation check, therefore they cannot even trigger the multithreading segments.

On the other hand, due to the unawareness of threading, even with enough feedback (as what MUZZ tries to add), AFL still has no idea how many times it needs to calibrate and how to capture the crashes reasonably.

Below is a result for x264 when fuzzing with command line "afl-fuzz -d -i x264-inputs -o /media/fuzz/x264 -t 4000+ -- x264-fuzz/install/bin/x264 --threads 1 --quiet --output /dev/null @@" (single thread) and "afl-fuzz -d -i x264-inputs -o /media/fuzz/x264 -t 4000+ -- x264-fuzz/install/bin/x264 --threads 3 --quiet --output /dev/null @@" (3 threads) respectively.

Note that the total executions are 941k and 770k. As to total paths, single thread is 1912, and 3-thread case has 1725, even when the 3-thread case has non-deterministic behaviors which should introduce more paths. The major difference is the found uniq crashes, where single-thread found 23, while 3-thread found only 2 (however may not be unique vulnerability/security bug).

Q: Can you reveal more about the static analysis?

A: Currently the static analysis for native pthread is purely based on SVF's thread-aware value flow analysis. As far as we know, this is the state-of-the-art analysis toolkit that integrates multiple classic static program analysis techniques and suitable for academic or industrial purpose. On some target programs, the static analyses took relatively long time (e.g, vpxdec). This may result from the limitations of the lockset analysis itself, or the actual implementation. We will further investigate these issues to see whether there can be some improvements on these results.

The analysis on OpenMP is much more lightweight as we majorly match on the LLVM IR of OpenMP pragmas such as ".omp_outlined*" and some heuristic analysis to locate the suspicious scope quickly. This may not work in general however is mostly fine in our evaluated projects (ImageMagick convert, GraphicsMagick, and pxz). We are working on integrating more precise analogous analyses on OpenMP functions, which may involve both Clang Frontend Analysis (CFA) and LLVM pointer analysis.

Q: What's the relationship between OpenMP and Pthread on Linux platform?

A: OpenMP is a general-purpose API that supports multi-platform shared memory multiprocessing programming in C, C++ on many platforms, including Linux, MacOS, Windows. It typically facilitates native multithreading APIs such as Windows threads or Pthreads. Thanks to the encapsulation and unified API, it is widely used by many projects with cross-platform multithreading support (see the using list for details). On Linux platform, it encapsulates Pthreads and only provides high-level interfaces to multithreading users.

In experiment, we limit the threads to be 4 whenever possible, except for cwebp in which there is no easy way to set the exact thread number.

Q: Why is V_m defined as multithreading-relevant vulnerability?

int compute(void *s_var) {

...

if (validity_check_2(s_var)) {

real_compute(s_var);

} else {

error_handle(s_var);

}

...

}

int main(int argc, char** argv) {

u8* input = read_file(argv[1]);

validity_check_1(input);

pthread_t t1, t2;

pthread_create(t1, compute, input[0]);

pthread_create(t2, compute, input[128]);

}

Both real_compute and error_handle itself may have vulnerabilities whatever they are executed in multi-threading or single-threading environment. These vulnerabilities may or may not result from concurrency bugs despite that they are executed in multithreading environment. In Doublade's settings, all of these vulnerabilities are marked as V_m.
The initial seeds may only enter the real_compute function; however the vulnerabilities may only lie inside error_handle. Therefore, without mutation (fuzzing) it is impossible to generate such an invalid seed that exercises the error_handle logic (after it passes validity_check_1 but fails validity_check_2).

int compute(void *s_var) {

...

if (strlen(s_var) % 8 == 0){

real_compute_1(s_var);

} else {

real_compute_2(s_var);

}

...

}

int main(int argc, char** argv) {

u8* input = read_file(argv[1]);

validity_check_1(input);

pthread_t t1, t2;

pthread_create(t1, compute, input[0]);

pthread_create(t2, compute, input[128]);

}

Suppose that only real_compute_2 has some vulnerabilities that have nothing to do with concurrency bugs. For example, it has an out-of-bound-write error when handling s_var with certain size. This still belongs to V_m due to that this happens in multi-threading environment, despite that it is not a V_c.
Another point is that suppose that real_compute_2 has logics that is much more complicated than real_compute_1, it is highly expected that there will be much more interleavings when executing real_compute_2; that is why we replay with more executions against the seeds that exhibit more interleavings (N_{cal} during fuzzing) in hope to expose more concurrency bugs (if any).

Q: What are the roles of the applied techniques in MUZZ?

A: The main novelty is to enhance fuzzing effectiveness on multithreaded programs by utilizing static+dynamic thread-aware analysis to preserve more valuable seeds that can reveal more multithreading-relevant vulnerabilities and concurrency-bugs.

Stratified instrumentation keeps more seeds that enter multi-threading context by providing more feedback on the corresponding seeds.
Thread-context instrumentation distinguishes more interleaving context; the only purpose is to introduce more feedback for fuzzing so that more "unique multithreading-relevant seeds" will be distinguished.
Schedule-intervention instrumentation increases the diversity in a repeated execution.
Dynamic fuzzing strategies aim to guide fuzzing to generate seeds that exercise different paths relevant with multithreading where the paths cannot be covered by initial seeds.

Google Sites

Report abuse