SAS KEEP= option, SAS DROP= option

Many a times while working with multiple SAS datasets we just want to have only few columns from source dataset and need to drop all other columns.

It is very flexible with SAS; if the list of columns to DROPPED is large as compared to the list to columns to be KEPT then just use KEEP= option and vice verca.

In addition to these dataset options SAS has also provided KEEP and DROP statements.

However it is always better to use dataset options wherever possible, instead of data step statements because it is more efficient way than its counterpart. It's because

1. If we are dealing more than one input datasets in SET statement and if we need to operate only on one of those datasets then we have no other way. In other words Unlike KEEP and DROP statements KEEP= or DROP= dataset options work only on the specified dataset offering more flexibility to the programmer.

2. If we use data step KEEP or DROP statement; it works while execution (i.e. after compilation of the code) when SAS has already brought the columns in to the memory, so its not efficient memory management, on the other hand if we use dataset options then those columns get dropped before even taken into memory hence more efficient.

Lets get this clear by example.

Suppose there is dataset as shown in the example below called "monthly_deposites" which has got the amount deposited monthly into each account.

Now we need to have a dataset which has got the total amount deposited during the year into each account. which can be easily calculated using SUM() function. however suppose if we need only account_number and total amount in output dataset then we need to drop all the 12 columns.

Its obvious choice to give only those columns which needs to be kept using KEEP= dataset option instead of providing a long list of 12 columns names which needs to be dropped.

(This can be used alternatively in case of exactly opposite requirement)

Sample SAS Code:

/* creating input dataset */

data monthly_deposites;

input acc_no jan_dep feb_dep mar_dep apr_dep may_dep jun_dep jul_dep aug_dep sep_dep oct_dep nov_dep dec_dep;

total_amount = sum(jan_dep, feb_dep, mar_dep, apr_dep, may_dep, jun_dep, jul_dep, aug_dep, sep_dep, oct_dep, nov_dep, dec_dep);

cards;

11011 500 500 500 500 500 500 500 500 500 500 500 500

11012 600 600 600 600 600 600 600 600 600 600 600 600

;

/* Now, lets see the two ways to get the same output

1. Using KEEP= dataset option

2. Using KEEP data step statement

*/

/* using KEEP= data set option */

data tot_deposites_1;

set monthly_deposites (keep=acc_no total_amount);

put _all_;

run;

/* using KEEP datastep statement */

data tot_deposites_2;

set monthly_deposites;

put _all_;

keep acc_no total_amount;

run;

To demonstrate how the PDV gets shaped during the execution of both of the methods; we used

PUT _ALL_; statement to see the contents of the PDV during the execution.

We will get to see this in LOG; so lets examine the log carefully

LOG

197 data tot_deposites_1;

198 set monthly_deposites (keep=acc_no total_amount);

199 put _all_;

200 run;

acc_no=11011 total_amount=6000 _ERROR_=0 _N_=1

acc_no=11012 total_amount=7200 _ERROR_=0 _N_=2

NOTE: There were 2 observations read from the data set WORK.MONTHLY_DEPOSITES.

NOTE: The data set WORK.TOT_DEPOSITES_1 has 2 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.03 seconds

cpu time 0.01 seconds

201

202 data tot_deposites_2;

203 set monthly_deposites;

204 put _all_;

205 keep acc_no total_amount;

206 run;

acc_no=11011 jan_dep=500 feb_dep=500 mar_dep=500 apr_dep=500 may_dep=500 jun_dep=500 jul_dep=500

aug_dep=500 sep_dep=500 oct_dep=500 nov_dep=500 dec_dep=500 total_amount=6000 _ERROR_=0 _N_=1

acc_no=11012 jan_dep=600 feb_dep=600 mar_dep=600 apr_dep=600 may_dep=600 jun_dep=600 jul_dep=600

aug_dep=600 sep_dep=600 oct_dep=600 nov_dep=600 dec_dep=600 total_amount=7200 _ERROR_=0 _N_=2

NOTE: There were 2 observations read from the data set WORK.MONTHLY_DEPOSITES.

NOTE: The data set WORK.TOT_DEPOSITES_2 has 2 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.01 seconds

One can note from the above log that; KEEP statement unnecessarily carries the unwanted variables through out the datastep execution. While KEEP= dataset option wont take them into the PDV at all hence better memory management.

Another thing to note is that; the Real & CPU time required to carry out both the methods, which again show a favor for KEEP= dataset option.