The Process of Analyzing a Complex Database
You can easily assume that you know a lot about the data that you have acquired in a study. After all, you have painstakingly boosted your confidence in your research as you have progressed. You have justified your study and promoted its importance. You have found literature reviews that show that your investigation will provide the "missing link" that is needed for greater understanding of an important problem. Even the collection of the data values themselves has given you some intimacy with them.
At the end of all this elaborate preparation and hard work, data analysis should be a quick coup de grace. You should perform a deft stroke that formally ends the otherwise extensive research process.
Is this the way that it really works? Generally not.
To the beginner especially, the process of data analysis can be a long, frustrating experience. Goals that were once so clear become vague. Crisply formulated ideas turn into unresolvable questions. Hypotheses that formed the basis of the entire data collection scheme now seem un testable. Answers that were once sought with fervor are returned by the analysis in a way that makes them seem relatively unimportant.
This need not be the result of the analysis process.
There are many causes of analytical misfortune. For example, your data may be structured improperly. This can make a desired analysis difficult to run or even obscure the availability of a good technique. Analysis runs can mire you in numbers, many of which are unneeded for the analysis at hand. Time consuming searches are needed to sort out those numbers that are important from those that are unnecessary.
This discussion is intended as a review of the detailed steps that should be taken in the analysis of most types of data. Hopefully, it will help you avoid the common misfortunes encountered in the analysis process. In getting started, note that these guidelines are based on several premises. They assume that you have some tangible goals of the analysis that correspond to the purpose of your study. However ill-stated these goals may be, they impart the direction to the analysis procedure. You must also know something about the data. For example, are these data similar to those expected in such a study, are there similar data available for comparison, what are the expected measurement characteristics, etc. Finally, you must be curious about the data. Particularly, you should be willing to accept the unexpected.
The analysis procedures are based on a progression of steps that build a firm understanding of the data matrix and the relationships that it contains. The analysis of a simple data matrix may not require the attention to detail proposed here. Complex data matrices will always require this care.
The general analysis procedure is seen from the broad goals of its four major phases:
Define the analysis objectives & organize the data.
Examine the basic data relationships.
Test hypothesized relationships.
Perform follow-up analyses and reporting.
It is worthwhile examining the intent of each of these phases in some detail before looking in detail at the analysis procedures. Note that there has been an attempt to structure each analysis phase so that it stands alone. This makes them relatively independent so that you can temporarily suspend your analysis at the end of any phase. This is often important since few people push the complete analysis of a data matrix in a single effort.
Define Objectives & Organize Data
The process of defining the intended goal of the analysis and organizing the data are intimately tied together. The analysis goal was undoubtedly the guide to the data collection effort and determined what variables were recorded. Organizing the data involves the identification of the variables and their characteristics.
Often, a single data matrix will be very complex and require a number of analyses. In the discussion that follow, each such analysis is treated as though it were independent of the others. This is not true, of course, but a helpful orientation in doing the initial analysis work.
Once you have defined your objectives and organized your data, it is possible to make preliminary decisions regarding what analyses are possible.
There are two facets in need of explicit consideration: (1) whether the analysis will lead to a description of the data, whether it will be used to test for significant differences, or whether it is to examine the strength of association between several variables, and; (2) how many of the variables will be involved in the particular analysis.
1. Define the broad objectives of the analysis.
Analyses proceed toward goals. It is important that you note what you are looking for in each phase of your analysis. This may take the form of brief phrases or various diagrams. It helps if you record what you are looking for and how you will decide if you have found it.
Such notes will clarify which variables need to be used in the analyses and in which combination. This list of analysis goals will serve as a handy reference if you think you are drifting aimlessly in the analysis.
Don't view this as a static list of objectives. Rather, set aside a place to record the progressive development of your analysis goals.
Incidentally, you must declare your intended comparison before the analyses (these are called "planned comparison") if you follow a strict interpretation of the use of many statistical tests of significance. Therefore, you may be promoting good statistical habits.
2. Design a data matrix or verify the form of the data matrix.
Most often, you will need to design a data matrix to be used in your analysis. In some cases, you already have a data matrix by having gotten your data some other way, such as by a recording instrument. What ever the source of your data matrix, you must examine it critically. You must confirm that it is in the proper form to be analyzed easily in the subsequent steps. This is a critical process, for an inappropriate design can precludes the use of some analysiS procedures or obscure the possible utility of an analysis procedure.
If your data matrix has "structural" problems, you should probably take the time to change its form to something more appropriate.
Examine your data matrix for the following condition: are each of the "columns" of data proper variables? Often, you will find that a variable has been split between columns in your data matrix to accommodate a problem of classification. For example, there may be two columns of data relating to HEIGHT, one for SEX=F and the other for SEX=M. This is an error and your data matrix should be changed to have a single column of data for HEIGHT and another for SEX. This type of error probably comes from the mistaken impression that it will be easier to do data entry in the first form. Also, there is a tendency to make tables "square" instead of long and skinny since we are used to presenting data in a relatively compact form.
Correcting such a structural error requires that you add extra statements to your DATA step.
3. Identify the variables and their characteristics.
Build a "codebook" for your data matrix. This consists of a table in which all the characteristics of your data matrix are recorded. First, list the descriptive names of each variable in the order in which it appears in the data matrix. Following the words identifying each variable, make up appropriate SAS variable names. The limit is eight characters or less, starting with a letter and without special characters except the "break" symbol C).
Is there a special format associated with any variable?
If so, it should be recorded in the codebook table.
Dates: check on the variety of date formats (not all arrangements are possible, so check as soon as possible). It is possible to synthesize a date from an inappropriate field or multiple fields, but it is a bit of extra work.
Times: just like dates, there are a variety of formats available, but you must check first.
Character strings: strings eight or fewer characters long are easy to use if they have no embedded blanks. More complicated strings are those that are longer or that have single embedded blanks (this carries other restrictions that you should check).
Data without blanks separating the variables: in these cases, you will need to be sure that the values are lined up for each variable, then figure out the proper format that will allow reading of each variable. You should check the SAS User's Manual for details of reading formatted data.
Data that do not fit on a single input line: it is possible to skip to a new line in the middle of an INPUT line, but this takes some special handling. Check the SAS User's Manual.
Make up a label for each variable and record this in the codebook. This is generally more descriptive than the variable name. It is suggested that you specify the measurement units (if any) in these labels.
You can print multiple line labels in PROC PRINT if you have properly designed labels and you specify that you want to use this option. The suggestion is that you use eight or fewer characters per line segment, up to a maximum of three lines.
Determine the measurement scale of each variable. The three possibilities are measurement (or interval), ordinal (or ranked), and classification (category or nominal) data.
Define any new variables that need to be calculated.
Changing an existing variable's value: this is done for unit conversion, for example.
Creating a new variable: this is the combination of values from other variables. It may be the sum of several variables, for example.
Be careful to use the SAS functions (such as SUM) when you have missing data.
4. Print a data verification report.
Print the entire data matrix using a PROC PRINT. This will allow a verification of the original data values, after they have been entered into the first DATA step.
At this stage, do not bother with formatting or any fancy table enhancements, other than variables labels.
This original table should not be sorted.
Make sure to use a TITLE with this and all subsequent procedures that print results.
5. Define formats.
Formats serve two functions: (1) they can be used to change the way a value is printed, or (2) they can be used to classify measurement variables into categories.
At this stage, any desired classification functions should be established for variables that will be grouped into categories. This categorization will not be done until it is specified in an analysis procedure (generally).
6. Print a "neat" report.
A well-designed report can be useful as a supplement to an analysis. You have several tools that can be helpful:
Sorting: choose an appropriate variable, or variables, to establish the order of the observations printed.
BY statements: this will produce a subdivision of the table into units. This should be done only with a classification-type variable, never a measurement variable. Make sure that the input data matrix has been sorted prior to using a BY statement in a PROC PRINT.
ID statement: allows a variable to be substituted for the default aBS numbering. This is useful in cases where you have unique observation identification in one of your variables.
FORMAT statements: this provides the substitution of something else for data values, or the changing of the column width and decimal precision specifications. This latter is useful if a measurement variable has a large number of decimal places stored, perhaps as a result of a division, but the report should only have a few decimal places.
TITLE statements: use titles to clearly identify your report.
Examine the Data
There are three basic activities in this phase.
1. Examine the characteristics of the values for each variable: the distributional properties are of primary concern. You are trying to answer the question of whether they conform to the expected relationships. For example, are the measurement variables normally distributed and do the classification variables having the expected number of categories? This will be important later in determining the relevance of many of the analysis procedures. This is also a time when the descriptive characteristics of each variable should be recorded.
Measurement variables: check for normality, check for outliers, determine the central tendency (mean) and dispersion (standard deviation).
If the test for normality fails, make sure that it was not due to an outlier. Will a transformation help? If it is not possible to make the distribution of values normal, then consider using a non-parametric statistical approach.
Ordinal variables: check the central tendency measure (generally the median value), and the dispersion values (the interquartile ranges); it should be possible to gain a general idea of the distributional characteristics of the data at this point. Outliers or highly skewed distributions are particularly to be noted.
Classification variable: look carefully at each variable's values in the printed listing. If a variable has a "reasonable" number of different values, probably not more than 20 or so, then you should get a one-way frequency table to see the relative distribution of values. You might consider recording the mode as the measure of central tendency.
2. Identify exceptional cases: there are sometimes values in the data matrix that require some kind of special attention. For example, values that are far from other values often require some type of special handling, depending on the particular problem being examined.
3. Report the descriptive statistics: a "neat" table summarizing the descriptive statistics for the data matrix provides an important backdrop for other analyses.
If the goal of the analysis is to be a description of the data, the next phase may be skipped.
Test Relationships
There are two broad classes of relationships that are tested in most studies: (1) those that attempt to distinguish between groups based on the measurement of one or more of their characteristics, or (2) those that attempt to establish some measure of association between several measurement variables. The statistical tests that are used follow these two classes and answer two types of questions: (1) are there groups that are truly separable, and (2) is there a corresponding trend in the variable values and how much of the variation can be predicted by this relationship?
Simple test are usually easier to interpret than ones that are more complex. For example, usually it is easier to distinguish between two groups than among many groups. Likewise, usually it is easier to see the relationship between two variables than between many variables. Therefore, it is useful to start testing relationships with the simple cases before moving on to these that are more complex.
1. Select the appropriate analysis technique.
Approach the selection of the technique as follows (read this discussion with table at the end of the chapter at hand):
(1) Which of the broad analysis categories is appropriate? For example, when testing whether a fertilizer is effective in producing greater growth than an unfertilized control, the objective is to differentiate between these two groups, e.g., treatment versus control.
(2) What specific techniques are available within the broad analysis category? For example, in the previous example you are comparing measurement values for two values of a classification variable, e.g., treatment versus control, so the t-test procedure would be appropriate.
(3) Do the data to be tested meet the analysis criteria? If not, can they be modified to be appropriate? For example, are the two sets of measurement values normally distributed?
(4) What ancillary analysis techniques will be most useful in describing the conclusions? For example, perhaps a frequency histogram showing the categories of measurement values, grouped by treatment, might be an effective way to present the results.
In the development of the analytical procedure, remember that often you are searching for "patterns." These are sometimes seen outside the realm of statistical tests. To neglect such a search is to ignore an important part of the data analysis process.
2. Running the analyses.
Check the constraints of each analysis. Are the data distribution assumptions met? (Refer to the univariate analysis runs to answer this question.) Are there special considerations, such as not using highly correlated variables together in a multivariate analysis run, that should be checked?
Examine the output form the analysis run. Identify the parameters of interest and make the decisions necessary to test the hypotheses of interest. Refer back to your original analysis objectives to make sure you are examining the intended questions.
Extract the necessary values for reporting, such as probabilities, equation parameters, etc.
Follow-up Questions & Reporting
In many respects, the last phase of an analysis program is the most dangerous. New, often very interesting, questions emerge from the analyses. They are important, but distracting. Results have to be put into proper form for reporting. That involves careful design and (generally) a set of trial reports until a satisfactory design is achieved. It is equally important to make sure that your data are properly saved so that they can be used again, just in case they are needed.
These are demanding tasks that come when there is the greatest pressure to be done with the work. Don't underestimate how long they will take.
1. Handling new questions: the analysis procedures often provoke new questions. These must be examined in the way that any hypothesis is handled, by following through these same analysis phases.
Save these new questions for follow-up. Be careful to not let secondary questions detract from the main line of the analysis.
2. Reporting the results: data analysis programs produce pages and pages of results. These must be reduced to essential values that are then recorded in well designed tables and charts.
In all cases, the goal of this step is to faithfully record the essence of the data and the analysis results. Accuracy, intelligibility and conciseness are the guiding factors. Consistency with past practice must also be considered for the value of any analysis comes in part from the possibility of comparing its results to other, similar, studies.
3. Archiving the data and analysis procedures: a well-designed study will make it easy to save a few sets of data and analysis procedures (the SAS statements). These should be kept in such a way that it is convenient to move them back into the analysis environment. It is not unusual to have to get back to data for a quick check or to do a follow-up analysis. Storage techniques vary. Currently, placing files on floppy disks is the best medium available. Soon CD-ROM (similar to the compact disks you buy for audio recordings) will offer a more permanent storage medium.