SAS Where= Option

If you want to filter records from the input dataset or external file there are various options available, like;

using IF statement (sub-setting If), WHERE statement and WHERE= dataset option etc.

But most useful and better out of those is using WHERE= dataset option.

Below are reasons behind it.

  • It has got the ability to eliminate the disqualifying records before even bringing them into PDV. Unlike Sub-setting IF statement which filters the record after taking into PDV i.e. while program execution.

  • Where= dataset option gives flexibility over its counterpartst; as it can be used to filter individual dataset in case if someone is having multiple datasets in SET statement. Whereas WHERE and IF statements are applied to all the datasets in SET statement.

Lets observe these benefits of WHERE= dataset option over its counter parts with the example below;

Example:

Consider that for some business need one need to find out foreign nationals accounts from the banks Current and Saving account databases.

  • In Current Account database those accounts are denoted by code 'FRGN' in ACT_TYPE column.

  • In Savings Account database those accounts are denoted by code 'F' in ACCT_CODE column.

Now let's see how to code this.

Difference Between Sub-setting IF and WHERE:

First of all we will consider the case of difference between Sub-setting IF and WHERE with demonstration of current_account dataset explained in the example above.

SAS Code: Difference Between IF and WHERE

/* Sample Current Accounts Dataset */

data current_accounts;

input acc_no open_dt date9. act_type $5. min_bal;

cards;

110011 12JUL1999 DMST 0

110012 12JUL1999 FRGN 0

110013 13AUG1999 FRGN 0

110014 17SEP2000 DMST 0

;

/* Sample Current Accounts Dataset */

data saving_accounts;

input acc_no open_dt date9. acct_code $2. min_bal;

cards;

100011 11MAY2000 D 200

100012 11JUN2000 F 200

100013 13AUG2000 F 200

100014 21SEP2000 F 200

;

/* Lets filter the dataset current_accounts using Sub-setting IF statement */

data temp;

set current_accounts;

/*examine the contents of PDV */

put _all_;

if act_type ='FRGN';

run;

/* Now lets see how it works with WHERE dataset option */

data temp;

set current_accounts (where=(act_type='FRGN'));

/*examine the contents of PDV */

put _all_;

run;

Now lets examine the results using execution LOG of the above code.

LOG:

74 data temp;

75 set current_accounts;

76 /*examine the contents of PDV */

77 put _all_;

78 if act_type ='FRGN';

79 run;

acc_no=110011 open_dt=14437 act_type=DMST min_bal=0 _ERROR_=0 _N_=1

acc_no=110012 open_dt=14437 act_type=FRGN min_bal=0 _ERROR_=0 _N_=2

acc_no=110013 open_dt=14469 act_type=FRGN min_bal=0 _ERROR_=0 _N_=3

acc_no=110014 open_dt=14870 act_type=DMST min_bal=0 _ERROR_=0 _N_=4

NOTE: There were 4 observations read from the data set WORK.CURRENT_ACCOUNTS.

NOTE: The data set WORK.TEMP has 2 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.02 seconds

cpu time 0.03 seconds

80

81 data temp;

82 set current_accounts (where=(act_type='FRGN'));;

83 /*examine the contents of PDV */

84 put _all_;

85 run;

acc_no=110012 open_dt=14437 act_type=FRGN min_bal=0 _ERROR_=0 _N_=1

acc_no=110013 open_dt=14469 act_type=FRGN min_bal=0 _ERROR_=0 _N_=2

NOTE: There were 2 observations read from the data set WORK.CURRENT_ACCOUNTS.

WHERE act_type='FRGN';

NOTE: The data set WORK.TEMP has 2 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.04 seconds

From the LOG above; its very clear that, while processing the Sub-setting IF statement SAS takes all the records into PDV (all the 4 records in this case) and then filters the record depending on given condition.

While for processing WHERE statement SAS filters the records depending on given condition before taking into PDV (Just taking only 2 records into PDV) and hence it is clear that using WHERE (statement or dataset option) is more efficient than using sub-setting IF.

Moving ahead; lets see how WHERE= dataset option is more flexible to use than WHERE statement using the above scenario of current_acoount and saving_account datasets.

SAS Code:

data foreign_nationals (DROP= act_type acct_code);

set current_accounts (WHERE=(act_type='FRGN')) saving_accounts (WHERE=(acct_code='F'));

put _ALL_;

run;

proc print data=foreign_nationals;

run;

As we know the column names on which filter condition has to be applied has got different names and different data in both the datasets; so its not possible to use the WHERE statement here; otherwise it will give error saying corresponding column not found.

And hence the use WHERE= dataset option is more convincing in this case. Below is the output dataset generated with the code above.

Output:

The SAS System 19:30 Thursday, November 5, 2009 1

Obs acc_no open_dt min_bal

1 110012 14437 0

2 110013 14469 0

3 100012 14772 200

4 100013 14835 200

5 100014 14874 200