Reading Delimited files (Reading CSV files) - SAS DSD Option

Many of the times data files coming out of conventional DBMS systems like Oracle, Teradata etc. contains the list of variables separated by each other by specific delimiting character like ',' OR '|' OR ' ' Or any such specific character.

By default SAS treats ' ' (single blank space) as delimiting character. But one can specify any other character or group of characters as delimiting character. This has been demonstrated below;

Suppose the data is in CSV format like given below,

External File: AccDetails.txt

111573691345,10JUL2010,C,0.00

111456234567|23JUN2011|O|1000.00

101345456734 21SEP1999 O 1000.00

101456234561|30OCT2011 C, 0.00

111234345566,18APR2010,,0.00

For demonstration purpose 3 different characters has been used in each line of external files and in 4th line of the external file all the three characters are used in single line.

We can have the following code snippet to work for;

SAS Code

data temp;

infile "D:\AccDetails.txt" dlm=", | ";

input acc_no

acc_open_date $9.

acc_type_code $

acc_min_bal_limit

;

run;

Readers should make a note of DLM option declared in the above code along with infile statement; where it specifies all the three delimiting characters used in the external files.

Also note that for reading acc_open_date $9. format is specified instead of just specifying $ to denote the character variable. This has been done because specifying just $ would read just 8 characters and the variables in our interest is of 9 characters length.

The output of the code above will show that SAS can work on all the three delimiting characters at a time, and hence the following output.

Output:

12:50 Sunday, June 21, 2009

acc_ acc_

open_ type_ acc_min_

Obs acc_no date code bal_limit

1 111573691345 10JUL2010 C 0

2 111456234567 23JUN2011 O 1000

3 101345456734 21SEP1999 O 1000

4 101456234561 30OCT2011 C 0

OOPS !!!!! Smart readers must have observed that the external file contains FIVE records and we can see only FOUR records; there must be something wrong with it.

Lets analyse the problem with the help of LOG.

LOG:

NOTE: The infile "D:\AccDetails.txt" is:

File Name=D:\AccDetails.txt,

RECFM=V,LRECL=256

NOTE: LOST CARD.

acc_no=111234345566 acc_open_date=18APR2010 acc_type_code=0.00 acc_min_bal_limit=. _ERROR_=1

_N_=5

NOTE: 5 records were read from the infile "D:\AccDetails.txt".

The minimum record length was 28.

The maximum record length was 32.

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

NOTE: The data set WORK.TEMP has 4 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.04 seconds

cpu time 0.00 seconds

One can conclude from the LOG that, It seems that there was some problem while reading the Fifth record of the external file. But the reason behind it is quite obvious.

If one can observe the fifth record of the external file more closely; it can be seen that, acc_type_code is missing within two consecutive delimiters. And SAS by default considers them as single and skips the missing variable in between, this causes data error in the input statement.

Use of DSD option helps to overcome this problem.

Lets see use of the DSD option with SAS code.

SAS Code:

data temp;

infile "D:\AccDetails.txt" dlm=", | " DSD;

input acc_no

acc_open_date:$9.

acc_type_code:$1.

acc_min_bal_limit

;

run;

proc print data=temp;

run;

*note the above code has been compiled on SAS 9.1 platform

Please remember the following highlights of the code.

DLM : This option specifies the character by which the fields in the source file have been separated, as ',' in the given case.

DSD : By Default SAS considers two consecutive delimiters as single and skips the missing variable in between, this causes data error in the input statement. Use of DSD option avoids this problem.

‘:’ used in between name of the variable and format name specifies that SAS should consider the delimiter as the boundary between two variables being read, though the length specified by the informats is lesser than the actual data.

And hence the following output: with all 5 records as expected.

Output:

12:50 Sunday, June 21, 2009

acc_

acc_open_ type_ acc_min_

Obs acc_no date code bal_limit

1 111573691345 10JUL2010 C 0

2 111456234567 23JUN2011 O 1000

3 101345456734 21SEP1999 O 1000

4 101456234561 30OCT2011 C 0

5 111234345566 18APR2010 0