Reggae: Automated Test Generation for Programs using Complex Regular Expressions


[Summary] [People] [Evaluations] [Publications]

PROJECT SUMMARY

Test coverage such as branch coverage is commonly measured to assess the sufficiency of test inputs. To reduce tedious manual efforts in generating high-covering test inputs, various automated techniques have been proposed. Some recent effective techniques include Dynamic Symbolic Execution (DSE) based on path exploration. However, these existing DSE techniques cannot generate high-covering test inputs for programs using complex regular expressions due to a large exploration space; these complex regular expressions are commonly used for input validation and information extraction functionalities. Furthermore,

the existing test coverage metrics such as branch coverage may not reflect the sufficiency of test inputs for such programs. To address these issues, we present an approach named Reggae for improving test generation and coverage measurement based on a notion of specialization such as transforming a generic regular-expression-matching operation into a specialized one. Specifically, our approach includes a technique to reduce the exploration space of DSE in test generation, and a technique for measuring test coverage of regular expressions based on their corresponding automata, including state coverage and transition coverage. In our evaluation, we apply our approach on various input validation programs that use complex regular expressions. Empirical results show that, in comparison with existing approaches, our

approach generates test inputs to achieve 24% higher branch coverage on average and improve the test coverage of the regular expressions with 39% higher state coverage and 26% higher transition coverage on average.

PEOPLE

Faculty

Tao Xie (Principal Investigator)

Students

Nuo Li (PhD Student)

Collaborators

Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte (Microsoft Research)

EVALUATIONS

To evaluate Reggae, we need subject programs that use complex RegExs for matching input strings. One common type of such programs is input validators. In our evaluation, we use the input validation code of an open source web application and a set of input validation programs synthesized for using complex RegExs from a popular RegEx library, RegExLib.com.

Empirical Study of an Open Source Application

The open source web application used in our empirical study is an Email extraction component in a job board and recruiting system named FlashRecruit. FlashRecruit is implemented in 915 files with 62,007 Lines Of Code (excluding comments and blank lines). The files include Java, XML, Javascript, JSP, CSS, DTD, and HTML files. In the Java files, which include 42,114 Lines Of Code (excluding comments and blank lines), FlashRecruit uses an Email RegEx to find Email addresses in text (a class named EmailExtracter). In the findEmailInText method of EmailExtracter, Statements 28 and 39 invoke a RegEx-matching operation to check whether a string is a valid email address. If the string is a valid email address, the email address is added into a set. As our DSE engine is designed for .NET programs but FlashRecruit is implemented in Java, we first used Microsoft Visual Studio to translate the program under test from Java to C#.

We first use Pex to generate test inputs for the findEmailInText method with a time limit of 14 minutes. Pex generates 324 test inputs and achieve 80% branch coverage of the findEmailInText method. The 324 test inputs cannot cover Statements 28 or 39, as Pex cannot genetate a Word object with a value that can make the RegEx-matching operation return true. We next use Reggae to replace REGULAR_EXPRESSION in Statements 11 of EmailExtracter with its corresponding automaton, and then use Pex to generate test inputs for the findEmailInText method with a time limit of 14 minutes again. Pex generates 320 test inputs and achieve 90% branch coverage of the findEmailInText method. The 320 test inputs include 3 WordList objects containing a Word object with a value that can make the RegEx-matching operation return true. These WordList objects are: wordList1("dH_X_Z@0-20.0k", "word"), wordList2("L8-0-0@0B.080P", "word"), wordList3("BS_9_0@9-c.a8", "word"). ("word" is the value of Word.TYPE_WORD), which can cover Statement 28.

Synthesized Validators

We collect the RegExs from a popular RegEx library, RegExLib.com, to synthesize subject programs. RegExLib.com includes various commonly used RegExs. Among the RegExs collected from RegExLib.com, we select 14 RegExs with complex structures. There are three main principles for selecting these RegExs: (1) these RegExs represent different data types, such as email, vehicle number, and phone number, (2) these RegExs include selectable characters in a repeatable unit, such as (\w|-)+, where the \w and - are selectable characters in the brackets, and (3) these RegExs include repeatable characters within a repeatable unit, such as (@\w+)*, where the \w are repeatable both within and outside the brackets. In addition, as our implementation is based on the Brics Automaton and the Brics Automaton cannot deal with the RegExs that include some special characters such as < and >, we select the RegExs that do not include such characters and can be processed by the Brics Automaton.

The synthesized source code in our evaluation can be downloaded here .

The source code of our example:

    • InputsChecker.cs
    • EmailAutomaton.cs
    • CreditAutomaton.cs
    • RegexIsMatch_CreditAutomaton.cs

PUBLICATIONS

    1. Nuo Li, Tao Xie, Nikolai Tillmann, Jonathan de Halleux, and Wolfram Schulte. Reggae: Automated Test Generation for Programs using Complex Regular Expressions. In Proceedings of the 24th IEEE/ACM International Conference on Automated Software Engineering (ASE 2009), Short Paper, Auckland, New Zealand, November 2009. [PDF][BibTeX]

SPONSORS

National Science Foundation Award CCF-0725190, Science of Design Program (01/01/2008-12/31/2010)

National Science Foundation Award CNS-0716579, Cyber Trust Program (08/01/2007-07/31/2010)

NIST Supplement to National Science Foundation Award CNS-0716579