Johnny Lee
Microsoft Corporation
typo_pl@hotmail.com
Abstract
Software development has changed greatly in the past twenty years. However, the programmer’s tools have not changed nearly as much over the same period. Programmers keep running into the same types of bugs that others have encountered, such as buffer overruns. Most software debugging is a slow manual process that does not scale well. I have developed a Perl script to locate various typographical errors in C or C++ source code. This paper describes the development of the script, the types of bugs that the script will report, auxiliary scripts and applications, and my experience developing my first Perl script.
Introduction
Typographical errors (typos) can be very frustrating and time-consuming to locate. One famous example of a typo is a mistyped period instead of a comma in a FORTRAN program, i.e. DO 10 I = 1.10
instead of DO 10 I = 1,10
The FORTRAN compiler interprets the line as an assignment statement instead of a loop. See <http://www.best.com/~wilson/faq/inicio.html#queIII1> for a full explanation. Another example of a typographical error is from comp.risks, Volume 20, Issue 18 <http://catless.ncl.ac.uk/Risks/20.18.html#subj9>, where there were too few equals sign in an if statement, which turned a comparison into an assignment. It took a programmer two days to find three of these typographical errors.
Locating bugs in an application can be a tedious process. A programmer can do a time-consuming code review of the source code or if the bug is easily reproducible, the programmer can use a source code debugger to locate the bug. Otherwise, there are several computerized methods for locating bugs. Runtime analyzers like Rational Software’s Purify or NuMega’s BoundsChecker help locate bugs, but they often increase the execution time of the application and the amount of memory used. The more popular runtime analyzers detect many problems with memory accesses since they can easily detect when an application accesses invalid memory. Some runtime analyzers, such as Boundschecker, also check API usage, i.e. illegal parameters passed to functions. Debug versions of memory allocation functions are also popular and effective. Source code analyzers are another category of debugging tools. They usually find a different class of bugs than runtime analyzers. Compiling at a high warning level or running lint on your source are examples of source code analyzers. Intrinsa’s PREfix is a source code analyzer that can detect many of the bugs that runtime analyzers detect but its price is generally prohibitive except for individuals.
Programmers’ reviewing source code for typographical errors is a time-consuming, boring, and inefficient process. It is similar to looking for a needle in a haystack. It is very easy to overlook an errant semicolon or equal sign. This is a perfect job for a computer. Source code analyzers are better tools for locating typographical errors than runtime analyzers as runtime analyzers usually detect only the symptoms of a typographical error, if they detect anything at all.
Development of an automated bug finder
One such experience with a typographical error made me search for a better solution. I had to spend much time tracking down a bug to a line in the source code that had one too many equal signs. The extra equal sign turned an assignment statement into a comparison. My first solution involved using the Win32 application findstr.exe, similar to the UNIX grep command, to scan source code for the particular typo. The solution worked but was inflexible when handling more than one case. Debugging consisted of trial and error. My second solution involved using a batch file to control scanning the source code. This slightly increased the flexibility but was not a great improvement. I had heard that Perl was good at processing text so I cracked open the Pink Camel book and read a few chapters. I used some of the sample code as a framework for my third solution. Once I was satisfied with my typo.pl Perl script, I released the script to my group at Microsoft for their use. A feedback-loop developed. I would release a newer and improved script. People would find problems or make suggestions for new features or different classes of bugs. Then I would make more changes and the loop would start over again. After several iterations, I released the script companywide, which resulted in an even greater feedback-loop.
Description of the Typo.pl Perl script
The main loop of the script scans files one line at a time. The script removes all C or C++ comments, string constants, and whitespace from the line before checking for possible errors. It searches the line for any keywords or user-specified functions. For keywords that have an associated expression, i.e. if statements, for-loops, and while-loops, the script takes the expression and scans it for any possible errors. For user-specified functions, the script takes the function parameters and any return code and scans them for possible errors. This may require specific Perl code for a particular function. The script keeps track of various statistics such as number of lines scanned, number of comments, number of functions, and the number of statements as measured by semicolons.
The script emits a warning when it thinks that the code may have an error. In most cases, it does not know for sure and since the programmer is the only one who can decide, it will generate the warning as the safe thing to do. It is up to the programmer to decide if the bug is real or not.
At the end of the scan, the script displays statistics about the scanned files including number of lines, bytes, comments, semicolons, and functions as well as the amount of time that the script took to scan the files.
The script can display statistics about each file that it scans.
Users can direct the script to extract all the strings in the scanned files. Users can spellcheck the list of extracted strings to locate all the misspelled words in the source code.
The possible errors that the script generates can be organized into several categories:
- Typographical errors, i.e.
X == Y;
instead ofX = Y;
- Incorrect API usage, i.e.
memset(buf, 0, nCount);
sets 0 bytes to the value nCount - Incorrect logic, i.e.
if (x & 3 == 2)
is interpreted asif (x & (3 == 2))
- Miscellaneous, i.e. assigning a value inside an Assert
See the List of possible errors at the end of this paper for a complete list of possible errors that the script generates.
Perl was easy to learn. It has the fast turnaround time of interpreted BASIC. Its string processing capabilities are very rich. Regular expressions are a feature lacking from most other procedural languages. Arrays and hashes are easy to use and require much less management than in C or C++. Since there are so many ways to accomplish a given task in Perl, you need to profile each solution if you are worried about performance. Solutions that require few characters may or may not consume a lot of time. The Camel book offers several performance tips. I found that I needed to investigate if each of these were valid for my script. Sometimes they were valid and sometimes they were not, e.g. Avoid $&, $’, and $` did not make a measurable difference in the script’s runtime. There were the typical problems, i.e. confusing the string and numeric equality operators. Some of the error messages could be cryptic: when I used #$ARGV instead of the correct $#ARGV, which resulted in a confusing syntax error. It takes some time to get used to the suggested Perl programming style. I tend to write Perl code as if I was writing C code still. However, the problems are minimal compared to the time and effort that the script has saved.
Auxiliary Perl scripts and applications
I have developed several auxiliary Perl scripts and applications, which make handling the script’s output easier.
- Typosum.pl generates a summary of typo.pl’s output so the user can tell how many errors there are of a given type. Here’s an example of the typosum.pl output:
D:\src\zip22>perl c:\typo\typosum.pl -c <typo.txt 0: 11 1: 1 2: 0 3: 2 4: 0 5: 0 6: 0 7: 0 8: 0 9: 0 10: 0 11: 1 12: 0 13: 0 14: 0 15: 0 16: 0 17: 8 18: 0 19: 1 20: 2 21: 1 22: 0 23: 0 24: 0 25: 0 26: 3 27: 35 28: 0 29: 0 30: 24 31: 0 32: 6 33: 0 34: 0 35: 0 36: 0 37: 0 38: 0 39: 0 40: 0 41: 0 42: 0 43: 0 44: 1 45: 0 46: 14 47: 0 48: 0 49: 0 50: 0 51: 0 52: 0 Total= 110
- Denum.pl removes the line number information from the script’s output so one can easily generate a difference file between separate invocations of the script that scanned the same source code.
- TV.EXE parses the script’s output and displays the results in a cleaner, more organized manner. The application allows the user to sort the output based on different categories and displays the possible error in context. The application allows the user to browse quickly the script’s output to determine if any of the possible errors are real or not. You could use Emacs or Vim to perform a similar function.
Usage
User specifies behaviour of private functions in a text file
User runs the script from the topmost directory of the source code, directing the output to a file.
User browses the file with TV.Exe or a text editor to check for any valid bugs.
Example:
Scan the source code of Info-Zip’s zip2.2 archiver.
We will use a predefined option file that specifies the behaviour of most Win32 functions.
D:\src\zip22>perl c:\typo\typo.pl -optionfile:c:\typo\win32.txt c // Perl version: 5.001 // TYPO.PL Version 2.45 Jun 15 1999 by Johnny Lee (johnnyl) // OPTIONS: '-optionfile:c:\typo\win32.txt c' // START: Tue Jun 15 17:29:14 1999 D:\src\zip22\fileio.c (280): no immediate strchr check 27: =strchr(q,'@') [q] . . D:\src\zip22\windll\windll.c (106): using malloc result w/no check 30: *zcomment = 0; [zcomment] // FUNCS: 545 // SEMIS: 10,760 // COMMS: 5,615 // LINES: 34,020 // CHARS: 991,749 // START: Tue Jun 15 17:29:14 1999 // STOP: Tue Jun 15 17:29:32 1999
If the user redirects the script output to a file, then the user can browse the output using the TV.EXE application. See Figure 1.
Figure 1. TV.EXE displaying the script output from scanning Info-Zip’s zip2.2 source code.
You can use typosum.pl to generate a listing that displays the number of errors of each type found in the source code. If the user wants to compare the output of separate invocations of the script on the same group of files, then the line numbers have to be removed because modifications to the files may shift code around. Denum.pl removes the line numbers from the script output so you can use a diff-like tool to determine if there are any changes.
Pros and Cons of the Perl script
The Perl script is not a silver bullet. It does not parse C or C++ correctly. The script does not handle #include files or macros. Macros or complex code can fool the script and generate false positives. The script does not handle if-else statement control flow correctly. This failure generates more false positive warnings. The script has evolved over several years. Conditions that were once valid may not be valid any longer. My main job is not developing and maintaining the Perl script. I work on the script in my spare time – when I was recuperating from a running injury, I had plenty of spare time. I have not had the time to document all these assumptions or revisit them. The script cannot determine if the programmer designed the code to execute in a certain manner, i.e. falling through from one case statement to the following case statement. However, the script can scan source code written for different operating systems. When I ran the script on my PC, I was able to find real bugs in Macintosh and VMS source code. The script is easy to use, runs quickly, and does not require the modification of any makefiles to work. The time required to investigate all the reported warnings is much less than the time required to review the source code by one or more programmers. The script does not get tired, suffer from eyestrain, repetitive-stress injuries, or whine about scanning more source code. Programmers can run the script on their code before they checkin to ensure that there were no bugs introduced.
Where to get the Typo.pl Perl script
The homepage for the typo.pl perl script is http://www.geocities.com/typopl/. I will also submit the typo.pl script to CPAN after the 1999 Perl Conference.
List of possible errors
- Semicolon appended to an if statement. VC98 emits a warning for this.
if (x == y);
exit(1); - Use of
==
instead of=
in assignment statements. VC98 emits a warning for this. Handles single+,-
characters too.X == Y; X - NULL;
- Assignment of a number in an if statement, probably meant a comparison. VC98 emits a warning for this.
if (x = 3)
- Assignment within an Assert
ASSERT(Z = 4);
- Increment/decrement of ptr, ptr's contents not modified.
Programmer may have meant to modify ptr's contents*ptr++;
- Logical AND with a number
x = y && 1;
- Logical OR with a number
x = y || 1;
- Bitwise-AND/OR/XOR of number compared to another value.
This may have an undesired result due to C precedence rules since
bitwise-AND/OR/XOR has lower precedence than the comparison operators.if (x & 1 == 0) ==> if (x & (1 == 0))
- Referencing
Release/AddRef
instead of invoking them. MSVC 5+ can detect this case.pFoo->Release;
- Whitespace following a line-continuation character
- Shift operator (
<<, >>
) followed by+,-,*,/
may have undesired result
due to C precedence rules. The shift operator has lower precedence. VC98 emits a warning for this.x = y << 8 + 12; ==> x = y << (8 + 12);
- Very basic check for uninitialized variables in for-loops
- Misspelling the word Microsoft
- Swapping the last two args of
memset
may set 0 bytesmemset(buf, 0, nCount);
- Swapping the last two args of
FillMemory
may set 0 bytesFillMemory(pAction, 0, sizeof(Action));
LocalReAlloc/GlobalReAlloc
may fail withoutMOVEABLE
flag- Assigning result of realloc function to same variable that's realloc'ed
may result in leaked memory if realloc fails sinceNULL
will overwrite the original valuepch = (char *)realloc(pch, cch+20);
- ReAlloc flags in wrong place or using ReAlloc flags for a different realloc API,
i.e. passingGMEM_MOVEABLE
toLocalReAlloc
, it's not an error to the compiler,
but I'd say you were playing with fire. case
statement without abreak/return/goto/exit
case 2: Foo(); case 3: Bar(); break;
If you add a comment with the text fall through or no break before the next case statement, then the script will not emit a warning.
- Comparing
CreateFile
's return value vsNULL
for failure
Problem is thatCreateFile
returnsINVALID_HANDLE_VALUE
on failure. - Casting a 32-bit number (may not be 64-bit safe)
- Casting a 7-digit hex number with high-bit set in first digit.
Programmer may have meant to add an extra digit. - Comparing functions that return handles to
INVALID_HANDLE_VALUE
for failure,
problem is that these functions returnNULL
on failure - Comparing
OpenFile/_lopen/_lclose/_lcreat
return value
to anything other thanHFILE_ERROR
, which is the documented return value on failure. - Comparing
_alloca
result toNULL
is wrong since_alloca
fails by throwing an exception, not returningNULL
. - MSVC's
_alloca
fails by throwing an exception, so check to see if_alloca
is within atry {}
- Check to see if the result from functions that return a value
likeCreateWindow
orCreateThread
is checked at the first if-stmt. - Check for multiple inequality comparisons of the same var separated by
||
,
i.e.if ((x != 0) || (x != 2))
in this case, if x == 0, the second comparison will succeed and the code will enter the body of the if-statement.
Programmer probably meant&&
instead of||
. - Similar to 28, check for cases of the form:
if ((x == 0) && (x == 1))
- If a function result is used before it has been checked for success
- Check for use of
lstrcpy/strcpy
and other functions
that can overflow buffers. - Check to see if function result was stored somewhere
- Trying to take the logical inverse of a number.
x = !3;
- If the result from the
new
operator is used before it has been checked for success - Function that throws exception on error is not in a
try {}
. - Check for misspelled defined symbols. User must do most of the investigative work.
The script will note all the symbols used in#ifdef,#ifndef,#if,#elif
statements and
print them out at the end. - Check for bitwise-XORing one number with another number
x = y / (10 ^ 7);
- Wrong flags used with MapViewOfFile.
- Wrong flags used with CreateFile
- Duplicate flags passed to CreateFile
- Complain about returning unchecked function results
- Using
HRESULT
function result w/no check - Double semicolon at the end of a statement
- Incorrectly calculating memory needed by using
strlen(X+1)
instead ofstrlen(X)+1
- Assigning
TRUE
toboolVal
field ofVARIANT
, should useVARIANT_TRUE
(= -1) - Empty statement after while/for loop
- Use of
(!x & Y)
, probably meant(!(x & Y))
; C/C++ precedence rules have '!' before '&' - Testing a
#define
for a value instead of existence - Test a char for
'0'
instead of'\0'
, i.e. user meant to test for null terminator instead of number 0 - Use of a disallowed function
- Use of a disallowed string
- Filling an object with zeros, i.e.
memset(this, 0, sizeof this);
Acknowledgments
I would like to thank the many people at Microsoft who have written to me with suggestions or reported bugs.