PROJECT SUMMARY Test coverage such as branch coverage is commonly measured to assess the sufficiency of test inputs. To reduce tedious manual efforts in generating high-covering test inputs, various automated techniques have been proposed. Some recent effective techniques include Dynamic Symbolic Execution (DSE) based on path exploration. However, these existing DSE techniques cannot generate high-covering test inputs for programs using complex regular expressions due to a large exploration space; these complex regular expressions are commonly used for input validation and information extraction functionalities. Furthermore, PEOPLE Faculty Students Collaborators
EVALUATIONS To evaluate Reggae, we need subject programs that use complex RegExs for matching input strings. One common type of such programs is input validators. In our evaluation, we use the input validation code of an open source web application and a set of input validation programs synthesized for using complex RegExs from a popular RegEx library, RegExLib.com. The open source web application used in our empirical study is an Email extraction component in a job board and recruiting system named FlashRecruit. FlashRecruit is implemented in 915 files with 62,007 Lines Of Code (excluding comments and blank lines). The files include Java, XML, Javascript, JSP, CSS, DTD, and HTML files. In the Java files, which include 42,114 Lines Of Code (excluding comments and blank lines), FlashRecruit uses an Email RegEx to find Email addresses in text (a class named EmailExtracter). In the findEmailInText method of EmailExtracter, Statements 28 and 39 invoke a RegEx-matching operation to check whether a string is a valid email address. If the string is a valid email address, the email address is added into a set. As our DSE engine is designed for .NET programs but FlashRecruit is implemented in Java, we first used Microsoft Visual Studio to translate the program under test from Java to C#. We first use Pex to generate test inputs for the findEmailInText method with a time limit of 14 minutes. Pex generates 324 test inputs and achieve 80% branch coverage of the findEmailInText method. The 324 test inputs cannot cover Statements 28 or 39, as Pex cannot genetate a Word object with a value that can make the RegEx-matching operation return true. We next use Reggae to replace REGULAR_EXPRESSION in Statements 11 of EmailExtracter with its corresponding automaton, and then use Pex to generate test inputs for the findEmailInText method with a time limit of 14 minutes again. Pex generates 320 test inputs and achieve 90% branch coverage of the findEmailInText method. The 320 test inputs include 3 WordList objects containing a Word object with a value that can make the RegEx-matching operation return true. These WordList objects are: wordList1("dH_X_Z@0-20.0k", "word"), wordList2("L8-0-0@0B.080P", "word"), wordList3("BS_9_0@9-c.a8", "word"). ("word" is the value of Word.TYPE_WORD), which can cover Statement 28. Synthesized Validators We collect the RegExs from a popular RegEx library, RegExLib.com, to synthesize subject programs. RegExLib.com includes various commonly used RegExs. Among the RegExs collected from RegExLib.com, we select 14 RegExs with complex structures. There are three main principles for selecting these RegExs: (1) these RegExs represent different data types, such as email, vehicle number, and phone number, (2) these RegExs include selectable characters in a repeatable unit, such as (\w|-)+, where the \w and - are selectable characters in the brackets, and (3) these RegExs include repeatable characters within a repeatable unit, such as (@\w+)*, where the \w are repeatable both within and outside the brackets. In addition, as our implementation is based on the Brics Automaton and the Brics Automaton cannot deal with the RegExs that include some special characters such as < and >, we select the RegExs that do not include such characters and can be processed by the Brics Automaton.
The synthesized source code in our evaluation can be downloaded
here . The source code of our example:
PUBLICATIONS
SPONSORS
|



