Programming‎ > ‎

Suppress Small Cell Counts in SAS Reports

Introduction

When reporting public health statistics it is common to suppress small cell counts by replacing them with a symbol. Doing so protects the anonymity of small subgroups of the population. SAS provides no obvious way to do this so I suspect many people end up doing this manually which can be tedious and error prone. Given the lack of literature on this topic and my own history of forgetting to suppress small cell counts, I have decided there must be a better way to automatically suppress small cell counts without leaving SAS.

I use the following fictitious data just in case you want to try some of these examples yourself:

DATA sample;
INPUT ID gender $ age grade $;
CARDS;
8821  M  20  A
8959  F  21  A
1219  F  21  A
5684  M  20  A
9891  M  22  B
2668  F  21  B
2802  M  21  B
5557  F  22  B
2231  M  21  B
5793  F  20  B
6158  F  21  B
1233  M  19  B
4493  M  21  B
3038  M  22  B
8227  M  20  B
6033  M  21  B
9130  M  23  B
1705  M  22  C
5079  F  21  C
2009  M  21  C
7639  M  23  C
2438  M  21  C
1339  M  20  C
4175  F  21  C
2117  M  22  D
1905  F  22  D
9525  M  21  D
5405  F  22  E
 151  F  21  F
1350  M  21  F
3350  F  22  U
RUN;

Apply Formats to Reports

SAS formats can often be applied directly to reports produced by SAS, and not just the underlying data, and without some of the constraints SAS usually enforces when using a format on data. When you use a format on a report the format is applied temporarily and has no permanent effect on the way the underlying data is stored or presented.

The following odd looking format statement creates a format which suppresses frequencies of 1 or 2 when applied to a report.

PROC FORMAT;
VALUE mask
.     = 0
1,2   = '*'
OTHER = [16.];
RUN;
. = 0
Displays missing frequencies as zero. SAS shows missing numbers as a decimal, which can happen in cross tabulation reports. We can safely substitute zero for missing in this case since this wont compromise anyone’s anonymity, and a zero looks better in a report than a decimal.
1,2 = '*'
Displays frequencies of 1 or 2 as a '*' symbol. You can list any values, or range of values, you would like to suppress in output here, just like you normally would when you define a format. Likewise, the label can be any symbol you like, but I would not recommend numbers or missing values.
OTHER = [16.]
Format any other value using a built-in or previously defined format. In this case I have chosen an integer format '16.', but you can choose any sensible number format you want.

PROC TABULATE

PROC TABULATE is the most versatile of the reporting procedures when it comes to applying a format. With PROC TABULATE a format can be applied to specific rows or columns, or to the whole table.

The following code produces a table with two frequency distributions, the first one with the default number format applied, and the second one with the custom small number suppression format applied. This technique is useful if you only want to suppress small frequencies in some but not all columns.

PROC TABULATE DATA=sample MISSING;
CLASS grade;
TABLE grade,N='Frequency' N='Frequency Masked'*F=mask.;
RUN;

   Frequency
 Frequency Masked
 grade    
 A  4 4
 B  13 13
 C  7 7
 D  3 3
 E  1  *
 F  2  *
 U  1  *

When you are making a table which contains only frequency distributions you can use a useful shortcut to suppress all small numbers in the table by specifying the format at the start of the procedure.

PROC TABULATE DATA=sample MISSING F=mask.;
CLASS grade age;
TABLE grade ALL='Total', age*N=' ' ALL='Total'*N=' ';
RUN;

Notice how small frequencies and totals are suppressed while the totals still add up propperly.

   age  Total
   19  20
 21
 22  23  
 grade            
 A  0  *  *  0  0  4
 B  *  *  6  3  *  13
 C  0  *  4  *  *  7
 D  0  0  *  *  0  3
 E  0  0  0  *  0  *
 F  0  0  *  0  0  *
 U  0  0  0  *  0  *
 Total  *  5  15  8  *  31

PROC FREQ

PROC FREQ also lets you apply formats to output, but only to cross tabulations or n-way tables, NOT to one-way frequency distributions! If you like the convenience of making one-way frequency distributions with PROC FREQ as much as I do then this will hurt, but there are alternatives.

PROC FREQ DATA=sample;
TABLE grade*age / NOPCT NOCUM NOCOL NOROW MISSING FORMAT=mask.;
RUN;

 Table of grade by age
   age
 grade  19  20
 21
 22  23  Total
 A  0  *  *  0  0  4
 B  *  *  6  3  *  13
 C  0  *  4  *  *  7
 D  0  0  *  *  0  3
 E  0  0  0  *  0  *
 F  0  0  *  0  0  *
 U  0  0  0  *  0  *
 Total  *  5  15  8  *  31

PROC REPORT

PROC REPORT lets you apply a format to any single column, but not the whole table at once.

PROC REPORT DATA=sample NOFS;
COLUMN grade N;
DEFINE grade / GROUP MISSING;
DEFINE N / FORMAT=mask.;
RUN;

 grade  N
 A  4
 B  13
 C  7
 D  3
 E  *
 F  *
 U  *

Using Regular Expressions

REGular EXpressions, or regex to code gurus, is a very powerful and complex tool for matching patterns in strings of text. A regular expression is a pattern which is interpreted by a parser and used to identify parts of text which match a given pattern. Regular expression parsers are available in many programing languages and software, and the syntax, although apparently cryptic, is well worth learning if you need to match patterns in text from time to time. You also win major geek cred amongst your peers if you understand regex ;-)

Regular expressions can be made to alter the text they match, which is what I am going to show you in this section. Such a regular expression is also known as a substitution regular expression. A substitution regex has the following basic structure, 's/.../.../' , where 's' tells the parser that a substitution regex follows, the pattern between the first pair of forward slashes is the text to find, and the pattern between the second pair of forward slashes is the text to substitute for the matching text. I will elaborate on the match and substitute patterns in the following subsections.

A regular expressions ability to detect and alter patterns in text comes in handy when we look at HTML output produced by the SAS Output Delivery System (ODS). If you open a SAS HTML report in a text editor and look at how table cell values are written you will see that every value is surrounded by something similar to '<td class="r Data">' and '</td>', which is called a 'table data tagset' in HTML speak. The '<td...>' part is an 'open tag', which may or may not have any 'attributes'.  In this case the open tag happens to have an attribute called 'class' which has the value 'r Data', but could be almost anything. The '</td>' part is the 'close tag' for the previous open tag and represents the end of the cell content. The close tag must always be present for table data tags, and never has attributes. Between the open and close tags may be a value. We can capitalize on this predictability by writing a regular expression pattern to find such table data tags, look at the cell value inside the tag, and replace any small numbers with a symbol.

If you want to learn more about regular expressions then check out the Recommended Reading section at the end of this document.

Using an External Regex Parser via an Unnamed Pipe

Unnamed pipes allow you to perform one-way interactions with external programs from within SAS. In this case I am going to use an unnamed pipe to send ODS HTML output to an external program called sed, which processes the output before sending it to a file. Sed (Stream EDitor) is a Unix command line non-interactive program which parses text and uses a regular expression pattern to manipulate the text it parses. The GNU/Linux version of sed is open source and has also been ported to Windows, however few Windows systems will have sed installed by default.

The code below creates a file reference called 'report' which is actually an unnamed pipe that sends output to a web-page via sed. It then tells ODS to send output to the 'report', and then creates a frequency distribution with PROC FREQ.

FILENAME report PIPE "sed 's/<td\(.*\)>[1,2]<\/td>/<td\1>*<\/td>/g' - > ~/masked_output.html";

ODS HTML FILE=report;

PROC FREQ DATA=sample;
TABLE grade / NOPCT NOCUM NOCOL NOROW MISSING;
QUIT;

ODS HTML CLOSE;
<td\(.*\)>[1,2]<\/td>
This is the match portion of the regex, i.e. find text that looks like '<td class="r Data">1</td>'.
<td
This matches start of the table data open tag.
\(.*\)
This matches any characters between the start and end of the table data open tag. The parentheses are a 'capturing group' which remembers any matching text so we can use it later, like in the substitution pattern for example. The decimal tells the parser to match any character, and the asterisk tells the parser to match the preceding pattern zero or more times. The parentheses must be escaped with a backslash in the sed regex dialect because parentheses have a special meaning.
>
This matches end of the table data open tag.
[1,2]
This means match any one of this set of characters, in this case either the character '1' or '2'. Regex has no concept of numbers and sees them merely as characters. This can be problematic when trying to create a pattern to match bigger ranges of numbers, like 1 to 16 for example. A pattern to match numbers between 1 and 16 would look like '([1-9]|[1][0-6])', which reads like "match any single character between 1 and 9, or two consecutive characters where the first character is 1 and the second character is between 0 and 6".
<\/td>
This matches the table data close tag. HTML close tags always have a forward slash, which is another special character in regex, so the forward slash must be escaped with a backward slash.
<td\1>*<\/td>
This is the substitute portion of the regex, i.e. replace the found text with something like '<td class="r Data">*</td>'.
<td\1>
This is the table data start tag. The '\1' is a back reference to the group of characters that where captured by the match pattern '\(.*\)' which results in the captured text being inserted into this position of the substitution. If you have multiple capturing groups you can address each group by its order from left to right, so group 2 is '\2' etc.
*
This is the symbol to replace the small numbers with. The asterisk has no special meaning in the substitution part of the regex pattern so we don't need to escape it.
<\/td>
This is the table data close tag. The forward slash still has a special meaning in the substitution part of the regex so we must escape it with a backward slash.

The final 'g' tells the regex parser to repeat the match-and-substitute operation as many times as possible, otherwise the parser stops after the first match.

Using a Built-in SAS Regex Function

This method is a hack for those on a Windows system without sed. SAS has two built in regex flavours, the older regular expression functions such as RXMATCH and RXCHANGE with an unusual syntax, and the newer perl style regular expression functions such as PRXMATCH and PRXCHANGE with a much more common syntax. I fully recommend using the perl style regular expression functions in SAS since their syntax is similar to that used in other programming languages and there is much more help on the internet.

Using SAS regex functions takes two steps. First, output to a HTML file...

FILENAME step_one '~/masked_output_1.html';

ODS HTML FILE=step_one;

PROC FREQ DATA=sample;
TABLE grade / NOPCT NOCUM NOCOL NOROW MISSING;
QUIT;

ODS HTML CLOSE;

Second, read the file back into SAS again, process it line-by-line with the PRXCHANGE function in a _NULL_ data step, and then output it to another file.

FILENAME step_two '~/masked_output_2.html';

DATA _NULL_;
INFILE step_one MISSOVER LENGTH=input_length;
FILE step_two;

LENGTH line $500;

INPUT line $varying. input_length;

line = PRXCHANGE("s/<td(.*)>[1,2]<\/td>/<td$1>*<\/td>/",-1,line);

PUT line;
RUN;

The regex pattern used here is almost identical to the one used in the previous subsection except for a slightly different regex dialect. The parentheses of the capturing group '(.*)' should not be escaped with back slashes, and the  back reference to the previously captured group of characters is made with '$1'. The final '-1' parameter to the PRXCHANGE function tells the regex parser to repeat the match-and-substitute operation as many times as necessary.

Limitations

The methods I present can simplify the suppression of small cell counts, but they are not smart enough to prevent someone from re-calculating the content of suppressed cells by comparing surrounding cells. If someone is determined enough they may be able to calculate the value of a suppressed cell by subtracting the sum of all known cells on the same row or column from the row or column total. This is especially easy to do if there is only one suppressed cell on any given row or column, and column and row totals are shown.

   A  B  Total
 1  ?  5  6
 2  3  ?  5
 Total  4  7  22

A1 = 6-5 or 4-3
B2 = 5-3 or 7-5

   A  B  Total
 1  1
 5  6
 2  3  2  5
 Total  4  7  22

To prevent this problem, you must review your tables to ensure at least two cells are suppressed on any given column or row, if any. If only one cell is suppressed in a column or row, then either raise the mask threshold until at least two cells are suppressed in each column and row, or manually suppress the second lowest value in that column or row. Doing this manually will be an iterative process between columns and rows, since the second lowest cell in a column may be deduced from it's row total, for example.

UPDATE: Dayne Batten has found a novel solution to automatically overcome this problem, even for large multi-dimensional tables.

Conclusion

There are multiple ways to automatically suppress small cell counts with SAS. Use whatever techniques you are most comfortable with and whatever tools you have access to. Beware the limitations.

Recommended Reading

http://www.regular-expressions.info/tutorial.html by Jan Goyvaerts. By far my favorite place to learn about regular expressions.

http://www.cdphe.state.co.us/cohid/smnumguidelines.html for an explanation of why suppressing small cell counts is important.

Contact Information

This is still a work in progress so if you think I have missed something or got a fact wrong then please contact me and let me know.

Name: Brandon de Graaf
Enterprise: Injury Prevention Research Unit, http://www.otago.ac.nz/ipru/
Email: brandon dot degraaf at otago dot ac dot nz

Legal Stuff

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.
Comments