Evaluating Program Analysis And Testing Tools With the RUGRAT Random Benchmark Application Generator

Summary

Benchmarks are heavily used in different areas of computer science to evaluate algorithms and tools. In program analysis and testing, open-source and commercial programs are routinely used as benchmarks to evaluate different aspects of the algorithms and tools. Unfortunately, many of these programs are written by programmers who introduce different biases, not to mention that it is very difficult to find programs that can serve as benchmarks with high reproducibility of results.

We propose a novel approach for generating random benchmarks for evaluating program analysis and testing tools. Our approach uses stochastic parse trees, where language grammar production rules are randomly instantiated to generate programs that meet overall program configuration goals. We implemented our tool for Java and applied it to generate benchmarks with which we evaluated different program analysis and testing tools.

Technical paper: In the proceedings of the 10^thInternational Workshop on Dynamic Analysis (WODA), 2012 [PDF]

Slides: From the talk given at the 10^thInternational Workshop on Dynamic Analysis (WODA), 2012 [PPTX] [PDF]

Research Questions

Our experiments were done in two different setups:

RUGRAT: RUGRAT generates single-threaded programs that interact with the environment only via standard out. The goal of the experimentation was to determine whether RUGRAT-generated applications can be useful for finding bugs or shortcomings in program analysis and testing (RAT) tools. In the context of RAT tools we also refer to RUGRAT-generated applications as applications under test (AUTs). Specifically, we addressed the following research questions:
- RQ1. How similar are RUGRAT-generated applications to third-party applications?
- RQ2. How do program analysis tools behave while analyzing RUGRAT-generated applications?
- RQ3. Can the RUGRAT-generated applications find defects in the program analysis tools?
RUGRAT4Load: This is an extended version of RUGRAT that allows generated programs to use multiple threads and to access the network and disk. The goal of the experimentation was to determine whether RUGRAT4Load-generated applications show similar dynamic characteristics as widely used benchmark applications. Specifically, we addressed the following research question:
- RQ4. How do the dynamic characteristics (i.e., memory and CPU usage, disk I/O, etc.) of the generated programs compare with widely used benchmark applications?

Experimental Results for RUGRAT:

We used a Sun HotSpot 1.6.0_24 JVM running on a Windows XP OS system with 2.33 GHz 64 - bit Intel Xeon processor with 32GB RAM to run the experiments

RQ1. How similar are RUGRAT-generated applications to third-party applications?

We collected 78 different software metrics for the generated programs and for 33 open-source applications that we selected from SourceForge (the list of these applications can be found here). We ran statistical tests with the goal to determine if the generated applications differ from these open-source applications by software metrics. Our experimental result show that it is statistically impossible to tell whether an application is generated or written by programmers using such software metrics.

RQ2. How do program analysis tools behave while analyzing RUGRAT-generated applications?

We performed two different sets of experiments. In experiment-1, we used standard ranges for RUGRAT configuration parameters (see the paper for details) and in experiment-2, we allowed larger ranges. In total we performed 616 experiments on 77 RUGRAT-generated programs by invoking 4 (FindBugs, PMD, JLint and Randoop) program analysis tools (in two different configuration setups for each tool). We found interesting behaviors of these tools and report them in the paper.

RQ3. Can the RUGRAT-generated applications find defects in the program analysis tools?

Yes, among others, we found two limitations of FindBugs where it skips the analysis and may miss bugs (Figure 2(b) below). We also independently discovered an issue with Randoop previously reported as Issue 14.

Preliminary Experimental Results for RUGRAT4Load:

We used a Sun HotSpot 1.6.0_24 64bit JVM running on a Windows 7 system with 2.4GHz 64bit Intel i5 processor and 4GB RAM to run the experiments.

RQ4. How do the dynamic characteristics (i.e., memory and CPU usage, disk I/O, etc.) of the generated programs compare with widely used benchmark applications ?

Preliminary empirical experiments show that generated programs compared to a benchmark application (JPetStore), can produce significant amounts of I/O and computation loads. For example, Figure 1 shows the memory usage of generated programs and the benchmark application.

Download

Here is a list of the programs we used in our experiments:

1. RUGRAT (Mandatory)

Description: RUGRAT is a Java stochastic application benchmark generator, written in Java. We generated over 70 programs used in the paper with RUGRAT. See README.txt for details.
Download (source code + executable Jar file): Folder in Google Drive
Contact: Ishtiaque Hussain (ishtiaque.hussain [at] mavs [dot] uta [dot] edu)

2. RUGRAT4Load (optional)

Description: Special implementation of RUGRAT that produces multi-threaded programs and generates network access and disk I/O code in the program body.
Download: This is not available yet.
Contact: Mark Grechanik (drmark [at] uic [dot] edu)

3. Benchmark programs (Optional, can be generated using RUGRAT and configuration files)

Description: We generated dozens of Java applications that range from 300 LOC to 5 Million LOC to evaluate popular program analysis and testing tools. The total size of these generated programs is over 90 GB. For space limitations, we provide a few sample programs and the scripts to generate them. Please note, because of the random nature of the tool, newly generated programs may vary from the ones actually used in the experiment.
Download:
- Sample generated benchmark programs: ZIP file (141 MB)
- Scripts to generate and run experiments (please read the README.txt first): ZIP file (200 KB)
Contact: Ishtiaque Hussain (ishtiaque.hussain [at] mavs [dot] uta [dot] edu)

4. Tools and Libraries (Mandatory)

The following program analysis tools and libraries must be installed before conducting experiments:

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grants No. 0916139, 1017633, 1017305, and 1117369, as well as Accenture. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Contact

RUGRAT is a collaboration between the University of Texas at Arlington, Accenture Technology Labs, the University of Illinois at Chicago, Georgia Tech, and North Carolina State University.

Ishtiaque Hussain

E-mail: ishtiaque.hussain [at] mavs [dot] uta [dot] edu

Affiliation: University of Texas at Arlington

Christoph Csallner

E-mail: csallner [at] uta [dot] edu

Affiliation: University of Texas at Arlington

Mark Grechanik

E-mail: drmark [at] uic [dot] edu

Affiliation: Accenture Technology Labs and University of Illinois at Chicago

Chen Fu

E-mail: chen.fu [at] accenture [dot] com

Affiliation: Accenture Technology Labs