Extra Explanations

Here we provide some extra explanations for an easier understanding of the idea behind Cerebro.

We hope that readers can implement the strategies in Cerebro and build into their own fuzzers after reading the paper and these explanations.

Intuition of using a posteriori method for seed prioritization

Multi-objective optimization problems (https://en.wikipedia.org/wiki/Multi-objective_optimization) can be solved in many different ways.

One of them is to use scalarization (to give weights to different objectives and merge them as a single objective).

Another popular approach is to use the a posteriori methods (finding the Pareto Front/Frontier) (https://en.wikipedia.org/wiki/Pareto_efficiency).

Because the seed inputs maintained by the greybox fuzzers often hold many different properties, so it is naturally a multi-objective optimization problem to evaluate the quality of seed inputs.

Previous fuzzers mostly use scalarization to solve this problem.

But the weight assigned for each property is often decided empirically and using a posteriori methods can avoid this issue.

However, the challenge is that a posteriori methods often require more costly to calculate.

So here we propose a cost-effective algorithm for seed prioritization in fuzzing based on NSGA-II (https://www.iitk.ac.in/kangal/Deb_NSGA-II.pdf) to quickly calculate the seeds which are on the Pareto Frontier.

Intuition of using the dynamic input potential score

The greybox fuzzer mutates the seed inputs to generate new inputs for testing the target program.

The number of new inputs to be generated with a seed input is called "energy".

We can see energy represents the "importance" of seeds.

Here we propose to use "input potential" to evaluate the importance.

Input potential refers to the potential benefits (coverage/crashes) the fuzzer can get through mutating that input.

The intuition of this input potential score is that the importance of a seed drops if the more complex codes near its execution are covered by the fuzzer.

So in Cerebro, we proposed an algorithm to quickly evaluated this input potential score as the context changes dynamically.

Roles of the 4 different queues

Unlike traditional greybox fuzzers which only have 1 queue, Cerebro maintains 4 different queues to quickly calculate the Pareto Frontier on the fly.

Below is the relationship between the 4 queues:

Before explaining the roles of the 4 queues, we need to explain the term "cycle".

If the greybox finished fuzzing(mutating) all the seeds in the queue once, we say that the fuzzer finished a cycle.

In one cycle:

"Frontier" contains the seeds on the Pareto Frontier.

"Dominion" contains the seeds dominated by the seeds on the Pareto Frontier.

Once the fuzzer finished fuzzing the seeds in the "Frontier", the seeds with the lowest ranks in the "Dominion" automatically become the new "Frontier".

Here the concept of rank is from NSGA-II (https://www.iitk.ac.in/kangal/Deb_NSGA-II.pdf).

Basically, the process can be treated peeling an onion.

The out most layer is considered as the "Pareto Frontier".

After peeling the out most layer, another layer will automatically become the new out most layer.

So we always have the out most layer as the new Pareto Frontier of the rest of the onion.

Because coverage-based greybox fuzzers always keep new inputs as seeds if they can bring in new coverage.

So the "Newly Added" queue works as a buffer keeping all the new seeds generated based on the seeds on the current Pareto Frontier.

The seeds in "Newly Added" will also be given ranks and they will be merged into the "Dominion" and form the new "Frontier" once the fuzzer finished fuzzing the current "Frontier".

If a seed is fuzzed in current cycle, it will be put into the queue "Recycled".

Once the fuzzer finished a cycle, all the seeds will be transferred from "Recycled" to "Dominion", and the new "Frontier" will be calculated based on the current "Dominion".