SAS PROC FORMAT

While working with variety of type of data; many of the times we need to represent data in various formats. A common example of it could be DATE; a single date can be represented in number of ways e.g. DD/MM/YY, DD/MON/YY, MM/DD/YYYY and much more.

PROC FORMAT is basically used to create user defined formats, as it has been illustrated with following examples.

Using SAS built-in formats:

SAS has provided hundreds of built in formats for different entities like dates, strings, currencies and much more. Once you apply format to any variable of the dataset; the original value will not change but while displaying the value of the variable it will be displayed in specified format. For example if you consider date; any date is stored as number in SAS but once you apply any date format to that variable lets say DATE9. then though internally it will be saved at date only it will displayed in DDMONYYYY format (e.g. 17JUL2009).

Formatting Data Values

Use of Format statement :

DATA EMP;

INPUT EMP_ID DOB:DATE9. BASIC_SAL;

CARDS;

1011 30JUL1982 5700

1012 17FEB1987 4700

;

RUN;

TITLE 'PROC PRINT BEFORE APPLYING FORMATS';

PROC PRINT DATA=EMP;RUN;

TITLE 'PROC PRINT AFTER APPLYING FORMATS';

PROC PRINT DATA=EMP;

FORMAT DOB DDMMYY10. BASIC_SAL DOLLAR8.;

RUN;

User Defined Formats:

PROC FORMAT;

value formatname value1 = range1

value2 = range2

value 3 = range3;

run;

Output:

Please note that if you use FORMAT statement within PROC step then formats get applied temporarily; while if you apply FORMAT statement in DATA step then the formats get applied permanently and the data values gets displayed in specified format itself until you remove or change the formats again.

Creating user defined formats (Numeric/Character/Picture Formats):

Apart from built-in formats provided user might need some different formats as well. We can use PROC FORMAT procedure to cater with this. In example below you can see three different formats created respectively for numeric variables, character variables and picture formats.

Numeric Format

PROC FORMAT;

/*Numeric Format*/

Value amount low-<5000 = "Below Average"

5000-<10000 = "Average"

10000-high = "Above Average";

/*Character Format*/

value $ country 'ENG' = "ENGLAND"

'USA' = "UNITED STATES"

'IND' = "INDIA"

'UK' = "ENGLAND"

other = "OTHERS";

/*Picture Format*/

picture discount low-high = '99.99%' (FILL='0');

run;

data customer;

input custid country:$3. trans_amt disc;

cards;

9091 USA 3800 3.2

9092 ENG 5800 8.2

9093 IND 10000 11

9094 BAN 2300 6.0

;

run;

title 'PROC PRINT WITHOUT FORMATS';

proc print data=customer;

run;

title 'PROC PRINT WITH FORMATS';

proc print data=customer;

format country $country. trans_amt amount. disc discount.;

run;

Output:

Important points to Note in user defined formats:

  • low and high are keywords in PROC FORMAT.

  • character formats need $ while creating and applying formats.

  • Picture formats can be used format the number in desired way it has may options like FILL, PREFIX etc.

  • By default user defined formats gets saved into formats catalog in WORK area to create permanent formats you can specify LIB= option in PROC FORMAT statement. Then you can use these formats into other sessions as well.

  • FORMATS can be applied in PROC STEP or DATA STEP; When applied in PROC STEP (except PROC SQL while table creation) it gets applied temporarily just for display purpose in the output of that particular datastep. However in DATA STEP it gets applied permanently and data gets displayed in the provided format by default.

Creating FORMATS at RUN Time from dataset using CNTLIN option:

In previous section we saw how we can create formats for set of predefined values; but how if I want to create Formats on the fly?

Let's say I want to divide my customers between low-medium-high income groups; but the definition of low-medium-high income is not fixed. Rather I will sum up the income of all customers and then divide it into 3 parts and will call it as low-medium-high. Then I have to create the FORMAT on the fly. In this case we can derive the FORMATS in following way.

SAS Code: CNTLIN Option

data cust_list;

input custid income;

cards;

1011 7600

1012 3434

1013 5656

1014 7600

1015 9844

;

run;

proc sql noprint;

select avg(income),

max(income)

into :factor, :high

from cust_list;

quit;

/*Preparing dataset to be used

to create FORMAT*/

data incm_grp;

fmtname = 'incm_grp';

start = 0;

end = &factor *0.8333;

label = 'low-income';

output;

fmtname = 'incm_grp';

start =(&factor *0.8333) + 0.01;

end = &factor*1.1666;

label = 'mid-income';

output;

fmtname = 'incm_grp';

start =(&factor*1.1666) + 0.01;

end = &high;

label = 'high-income';

output;

run;

/*Creating FORMAT incm_grp from

dataset incm_grp*/

PROC FORMAT cntlin=incm_grp;

RUN;

/*Applying FOMRAT to the

cust_list dataset*/

proc print data=cust_list;

format income incm_grp.;

run;

Now if you observe the code given above closely; except the first step where cust_list data is getting generated all other code is going to remain same. Hence you can always apply the same logic on newly acquired group of customers by this way. And this code will then tag the customer in low-mid-high income group dynamically.

Using PROC FOMRAT for Lookup (replacing SQL JOIN/ Datastep Merge):

We know there are different methods available to do look up in SAS. PROC FORMAT could be very effective in some scenarios instead of using SQL JOIN or DATA-step MERGE.

Lets consider a scenario; I got a world wide customer list with 100 Million customer which has got country name, cell numbers. Now my automated program wants to send a SMS to those customers but only problem is I do not have country dialing code. Now to resolve this issue I want to populate the country dialing code into customer list using country name. I have country name along with dialing code in another table.

Now natural choice would be to JOIN those tables using SQL and get the work done; otherwise one may think of using datastep MERGE. However there drawbacks with both of those methods.

1. PROC SQL creates Cartesian product for each JOIN; so if you consider 150 countries in other table; then in total it will create a interim dataset with 100 million X 150 = 15000 million records. which will eat huge memory and time required will be high.

2. Second method of using datastep merge needs the datasets to be already sorted; again sorting a dataset with 100 million records is going take lot of time and resources.

On the other side if you create a format of country_code and apply it to country field from customer list then the work will be done in single datastep execution with comparatively very less time. SAS code for this will look like below.

Using PROC FOMRAT for Lookup (replacing SQL JOIN/ Datastep Merge):

We know there are different methods available to do look up in SAS. PROC FORMAT could be very effective in some scenarios instead of using SQL JOIN or DATA-step MERGE.

Lets consider a scenario; I got a world wide customer list with 100 Million customer which has got country name, cell numbers. Now my automated program wants to send a SMS to those customers but only problem is I do not have country dialing code. Now to resolve this issue I want to populate the country dialing code into customer list using country name. I have country name along with dialing code in another table.

Now natural choice would be to JOIN those tables using SQL and get the work done; otherwise one may think of using datastep MERGE. However there drawbacks with both of those methods.

1. PROC SQL creates Cartesian product for each JOIN; so if you consider 150 countries in other table; then in total it will create a interim dataset with 100 million X 150 = 15000 million records. which will eat huge memory and time required will be high.

2. Second method of using datastep merge needs the datasets to be already sorted; again sorting a dataset with 100 million records is going take lot of time and resources.

On the other side if you create a format of country_code and apply it to country field from customer list then the work will be done in single datastep execution with comparatively very less time. SAS code for this will look like below.

SAS Code: using PROC FORMAT for LookUp

data cust_list;

input cust_id country:$3. cell:$10.;

cards;

192345 IND 9856894899

192346 USA 8787874

198233 PAK 4578159

198233 SA 98221544

;

run;

data country_dialing_code;

input country:$3. dialing_code $;

cards;

IND +91

USA +44

PAK +92

SA +27

;

run;

data cntry_code;

set country_dialing_code (keep=country dialing_code);

fmtname = 'cntry_code';

rename country = start dialing_code=label;

type = 'c';

run;

proc format cntlin=cntry_code;

run;

data cust_list;

set cust_list;

dialing_code = put(country,$cntry_code.);

run;

title 'Using PROC FORMAT as LOOK UP method';

proc print data=cust_list;

run;

Output: using PROC FORMAT for LookUp

In conclusion using PROC FORMAT could be very efficient in some scenarios where we need a join between a very large dataset with a small dataset.