REINAM: Reinforcement Learning for Input-Grammar Inference

Project Summary:

Program-input grammars (i.e., grammars encoding the language of valid program inputs) facilitate a wide range of applications in software engineering such as symbolic execution and delta debugging. Grammars synthesized by existing approaches can cover only a small part of the valid input space mainly due to unanalyzable code (e.g., native code) in programs and lacking high-quality, high-variety, and high-quantity seed inputs. To address these challenges, we present REINAM, a reinforcement-learning approach for synthesizing a probabilistic context-free program-input grammars without any seed input. REINAM includes an industrial symbolic execution engine to generate an initial set of inputs for the given target program, and includes an iterative process of grammar generalization to proactively generate additional inputs in order to infer grammars generalized from the initial seed inputs. To efficiently search for target generalizations in a huge search space of candidate generalization operators, REINAM includes a novel formulation of the search problem as the problem of reinforcement learning. Our evaluation results on five real-world subjects show that REINAM outperforms an existing state-of-the-art approach on precision and recall of synthesized grammars, and fuzz testing based on REINAM substantially increases the coverage of the valid input space. REINAM is able to synthesize a grammar covering the whole valid input space for some subjects without decreasing

accuracy of the grammar.






Implementation Details:

In this section,we briefly illustrate implementation details of REINAM. There are two phases in Reinam's grammar synthesis process.The purpose of phase 1 in REINAM is to generate an initial CFG for our RL procedure. The quality of synthesized grammar of Glade highly depends on the quality of seed input.


Language Inference and Symbolic Execution:

We first symbolically execute the parsing function with Pex in order to generate a set of representative strings to seed glade with. The intuition is that a test set of strings that has a higher code coverage should be more representative.We restrict the set of seeds such that all seeds are generated by unique paths through the parser. We prepare the subject programs by creating a simple test harness around the parser that loops and outputs whether the input parses successfully. We use the subset of our strings that are accepted by the subject program as seeds to Glade. The number of seeds we use as input to glade is mainly constrained by how many strings Pex would output. Depending on the program under test, this is anywhere from 50 to 1000 seeds (1000 seeds to be the upperbound). Glade then outputs a representation of the CFG it synthesized to the reinforcement learning stage.

REINAM flow diagram

Reinforcement Learning Phase:

Phase 2, which uses RL, has three steps. The first one is to mutate the PCFG by applying generalization operators. An important detail is the order to apply different operators. The characters generalization operator is independent from other operators since it only affects terminal symbols. Therefore it can be considered before all other

operators. Observed from the ideal grammars of XML and URI, we are accelerating this step by not only adding one character at each time but directly try some predefined terminal symbols, such as lowerletter =a| b| ... | z. We directly try to generalize the terminal symbols to these predefined symbols to check whether

it improves the grammar. We prioritize the repetition and alternation

operator over the merging operator. We have pointed out in the second example in Figure 3 and 4 that Glade is likely to miss the generalization obtained by a repetition/alternation followed by a merging. We randomly choose between applying repetition and alternation operator. In repetition, we randomly choose the production

rule to generalize by its probability in the PCFG. In alternation, we also randomly select the production rule by its probability and only randomly picking a substring with length 2 to alternate. The reason why we pick the production rule randomly but using the probability in the PCFG is that, as we mentioned, the probability in the PCFG implies the "correctness" of the rule. Therefore we want it more likely to generalize from a production rule in which we have higher confidence.


Input Sampling:

The second step is the input sampling. We sample 1,000 strings from the whole PCFG. We also sample 500 additional strings that is related to the production rule we added in the generalization step. We filter out those strings whose derivation does not contain any non-terminal symbol on the left-hand side of the new production

rule in the sampling phase. The reason for this step is that the most concern in current iteration of RL is to calculate the probability of the new production rule and thus verify it. Therefore we need sufficient input strings to calculate the reward. Since we only filter out those don’t have the non-terminal symbol of the new rule, the proportion of the inputs generated in this way still follows the probability distribution under that symbol. Though theoretically it requires exponential number of strings if we want to sample every possible  production orders from the starting initial non-terminal symbol. Each string we sampled will involve some production rules in its production procedure and we only care about how many times each production rule is applied during production. The order of different rules is not important in our model, therefore we find the current number of sampled strings sufficient in our evaluation. Note that the number of strings we sample can be changed based on the number of production rules and symbols in the initial grammar.


Probability Adjustment and Tuning:

The third step is the probability adjustment. For one generalization, we run the iterative process of sampling and adjustment multiple times to wait for the convergence of the probability. We set the threshold determining whether the

probability is still changing to be 1/10∗(number of production rules of P) which is 10 times less compared to the probability initialized from the uniform distribution number of product ion rules of P. The threshold means that if all production rules' probabilities change no more than 1/10∗(number of production rules of P) after a new round of sampling, we determine convergence. This is also the threshold to eliminate the production rules with too small probability. The tuning of the learning rate α here is trivial. This α mainly affects the convergence speed. The difference of the gradient the multiple rounds before the probability is converged is that the randomness in sampling could result in slight difference in the reward function of each production rule.We find that setting α between 0.01 and 0.2 can hardly make difference in performance. We set α to 0.1 in evaluation.