summarizing-data-proc-summary-proc-means

Introduction:

It is very common in SAS programming to summarize the data for reporting or otherwise for further processing. For example if we consider a sashelp.cars dataset and want to find out what is average price of each type of car for each type of MAKE?

The obvious solution one can think of is PROC SQL way or using PROC SUMMARY/PROC MEANS.

Let me tell you frankly PROC SUMMARY and PROC MEANS are siblings with all their features commonly inherited. Only the difference is that PROC MEANS will give the output on console or other medium while by default and for PROC SUMMARY you need to provide the output option either as SAS dataset or PRINT. So we can use these procedures interchangeably as per requirement.

By default PROC SUMMARY and PROC MEANS provides summary stats like N(frequency of non-missing records), MEAN, MAXIMUM, MINIMUM, STD Deviation etc. for given set of attributes.

Here we will compare between PROC SQL and PROC SUMMARY for the example of sashelp.cars mentioned above.

Text Box

16 proc sql;

17 create table sql_cars as

18 select make, type, avg(INVOICE)

format = comman23.2 as sql_avg_price

19 from mycars

20 group by make, type;

NOTE: Compressing data set WORK.SQL_CARS

increased size by 100.00 percent.

Compressed is 2 pages; un-compressed would

require 1 pages.

NOTE: Table WORK.SQL_CARS created,

with 114 rows and 3 columns.

21 quit;

NOTE: PROCEDURE SQL used

(Total process time):

real time 1.27 seconds

user cpu time 0.53 seconds

system cpu time 0.03 seconds

Memory 1060589k

OS Memory 1068092k

LOG: PROC Summary

23 proc summary data=mycars

missing nway;

24 class make type;

25 var Invoice;

26 output out=sumry_cars

mean=sumry_avg_price;

27 run;

NOTE: There were 856000 observations

read from the dataset WORK.MYCARS.

NOTE: The data set WORK.SUMRY_CARS

has 114 observations and 5 variables.

NOTE: Compressing data set

WORK.SUMRY_CARS increased

size by 200.00 percent.

Compressed is 3 pages; un-compressed would

require 1 pages.

NOTE: PROCEDURE SUMMARY used

(Total process time):

real time 0.46 seconds

user cpu time 0.36 seconds

system cpu time 0.02 seconds

Memory 1060589k

OS Memory 1068092k

We can see that PROC SUMMARY is finishing the task at half of the time required by PROC SQL. You can specify MEAN=, SUM=, MINIMUM=, MAXIMUM= etc. options in OUTPUT statement to get respective stats.

One more thing to note here is use of CLASS statement instead of BY statement.

We know if use BY statement in any PROC the source data needs to in SORTED order; and that is the advantage of using CLASS statement; with CLASS statement you don’t need to carry a burden of using SORT step before PROC SUMMARY step.

Use of CLASS, NWAY and MISSING options:

As we discussed above if we use CLASS then you don’t need to have the source data in sorted order.

Let’s see what is NWAY?

If you observe the output below;

Output: NWAY

_TYPE_ is having value as 3; this is because of NWAY option. NWAY provides the records with maximum value of _TYPE_ only. If we haven’t used it; output would have different values of _TYPE_ ranging from 0 to 3. This is because PROC SUMMARY by default considers all cross- combinations of CLASS attributes. That is

  • At first it will produce the output for required summary stat without considering any CLASS variable and it will have 0 as value of _TYPE_.

  • Then at second step it will display all combinations of right most attribute in CLASS statement that it TYPE only, without considering MAKE. This will have 1 as value of _TYPE_.

  • Then it will consider MAKE attribute and will represent all of its combinations representing value 2 for _TYPE_.

  • And then finally will consider combinations of MAKE and TYPE and records corresponding to this will have _TYPE_ value 3.

This can be shown in the screen shot below.

Output: Without NWAY

MISSING Option: by default PROC SUMMARY/PROC MEANS do not consider missing values of given BY/CLASS attributes in the calculations; but MISSING option will override this default behavior asking the procedure to consider the MISSING values.

What is _FREQ_?

There is summary stat available in PROC SUMMARY/PROC MEANS represented by N which represents only count of non-missing values of given attribute while _FREQ_ will count all the values.