Conceptually, a SAS data set is a file that consists of two parts: a descriptor portion and a data portion. Sometimes a SAS data set also points to one or more indexes, which enable SAS to locate records in the data set more efficiently. (The data sets that you see in this chapter do not contain indexes.)
The descriptor portion of a SAS data set contains information about the data set, including
§ the name of the data set
§ the date and time that the data set was created
§ the number of observations
§ the number of variables.
The data portion of a SAS data set is a collection of data values that are arranged in a rectangular table. In the example below, the name Jones is a data value, the weight 158.3 is a data value, and so on.
Rows (called observations) in the data set are collections of data values that usually relate to a single object. The values Jones, M, 48, and 128.6 constitute a single observation in the data set shown below.
In addition to general information about the data set, the descriptor portion contains information about the attributes of each variable in the data set. The attribute information includes the variable's name, type, length, format, informat, and label.
When you write SAS programs, it's important to understand the attributes of the variables that you use. For example, you might need to combine SAS data sets that contain same-named variables. In this case, the variables must be the same type (character or numeric).
Variable Type Length Format Informat Label
Policy Num 8 Policy Number
Total Num 8 DOLLAR8.2 COMMA10. Total Balance
Name Char 20 Patient Name
Each variable has a name that conforms to SAS naming conventions. Variable names follow exactly the same rules as SAS data set names. Like data set names, variable names
§ can be 1 to 32 characters long
§ must begin with a letter (A–Z, either uppercase or lowercase) or an underscore (_)
§ can continue with any combination of numbers, letters, or underscores.
A variable's type is either character or numeric.
§ Character variables, such as Name (shown below), can contain any values.
§ Numeric variables, such as Policy and Total (shown below), can contain only numeric values (the digits 0 through 9, +, -, ., and E for scientific notation).
A variable's type determines how missing values for a variable are displayed. In the following data set, Name and Sex are character variables, and Age and Weight are numeric variables.
§ For character variables such as Name, a blank represents a missing value.
§ For numeric variables such as Age, a period represents a missing value.
A variable's length (the number of bytes used to store it) is related to its type.
§ Character variables can be up to 32,767 bytes long. In the example below, Name has a length of 20 characters and uses 20 bytes of storage.
§ All numeric variables have a default length of 8. Numeric values (no matter how many digits they contain) are stored as floating-point numbers in 8 bytes of storage, unless you specify a different length.
You've seen that each SAS variable has a name, type, and length. In addition, you can define format, informat, and label attributes for variables. Let's look briefly at these optional attributes— you'll learn more about them in later chapters as you need to use them.
Formats are variable attributes that affect the way data values are written. SAS software offers a variety of character, numeric, and date and time formats. You can also create and store your own formats. To write values out using a particular form, you select the appropriate format.
For example, to display the value 1234 as $1234.00 in a report, you can use the DOLLAR8.2 format, as shown for Total below.
Usually you have to specify the maximum width (w) of the value to be written. Depending on the particular format, you might also need to specify the number of decimal places (d) to be written. For example, to display the value 5678 as 5,678.00 in a report, you can use the COMMA8.2 format, which specifies a width of 8 including 2 decimal places.
Note
You can permanently assign a format to a variable in a SAS data set, or you can temporarily specify a format in a PROC step to determine the way the data values appear in output.
Whereas formats write values out by using some particular form, informats read data values in certain forms into standard SAS values. Informats determine how data values are read into a SAS data set. You must use informats to read numeric values that contain letters or other special characters.
For example, the numeric value $1,234.00 contains two special characters, a dollar sign ($) and a comma (,). You can use an informat to read the value while removing the dollar sign and comma, and then store the resulting value as a standard numeric value. For Total below, the COMMA10. informat is specified.