Bivariate Data Analysis
The task of bivariate data analysis is to produce descriptions of the relationships between two variables. Some of the descriptions are numerical. Other descriptions are graphical. They are equally important and you will be involved in doing both kinds.
In a thorough analysis, you will graphically portray the relationship between your two variables, explore how well a particular equation can be used to describe this relationship, estimate how much of the variability in the data can be explained by the trend described by the equation, and examine the possibility that other equations are more appropriate as a means of description.
An Appropriate Data Matrix
The basic data matrix for bivariate data analysis consists of two variables, both of which are measurement variables. Both variables are related such that each observation has paired values. You can see this pairing by noting that if the two variables were sorted independently into another order, the sense of the data would be destroyed.
Printing a typical bivariate data matrix would appear as:
OBS HEIGHT WEIGHT
------ ------ ------
1 6.7 247.1
2 17.3 451.8
3 15.2 361.9
4 7.0 304.2
Often, it is stated that each of these variables must be normally distributed. While in some cases this is generally true, such as in correlation analyses, it is not always a requirement. Some regression analyses may be done if both distributions are similar, even if they are not normally distributed.
Analysis Tools for Bivariate Data
Plotting:
The general bivariate plotting procedure is called PROC PLOT. It is quite flexible and creates as acceptable a plot as possible given the limitations of a typewriter-like printer. Such plots are generally called "printer plots" in contrast to those plots that are drawn as lines on high-resolution devices.
PROC PLOT establishes a cartesian-coordinate system and draws the two axes as reference. These are labelled with variable names and measurement units. Data points are then placed in this coordinate structure. This procedure does not perform any of the curve-fitting nor does it calculate any of the regression relationships. These analysis activities must be performed separately and then the results are input to PROC PLOT to have them plotted.
Regression and Correlation Analysis:
Correlation analysis determines the degree of "relatedness" of the two measurement variables. Regression analysis fits a model (generally a straight line) to the observations and reports how confident you can be in the results of this fit.
There are a number of similarities between correlation and regression analyses although they should ordinarily be used for two distinctly different types of data. That is, a problem for which a correlation might be performed would not have a regression done on the same data.
Regression analyses are done when one variable (called the independent variable) is a causal factor that determines the value of the other variable (called the dependent variable). One of the results of the regression analysis is a statement of this relationship in the form of an equation. This equation may then be solved for any particular value of the independent variable to predict the most likely value of the dependent variable.
There are several assumptions regarding the data that must be met to have a valid analysis.
It is generally assumed that each of the values of the independent variable is known without error. This would happen if, for example, a precise measurement value could be established by the investigator. This is quite often the case. Where it is not, you should make sure that there is no systematic pattern to the error. That is, there shouldn't be a small error in measurements at one end of the scale and a large error at the other, with a uniform increase in the error in between. As an example, if your independent variable is WEIGHT, then you should be able to determine small WEIGHTs with the same precision as you would large WEIGHTs (say, to the nearest 0.1 g).
It is also assumed that the data meet the criterion of homoscedasticity. This describes the pattern of variation in measurement in the opposite direction. Here is how to picture this situation. If the measurements were grouped as a series of observations for each value of the independent variable, we could imagine a small frequency histogram being constructed for each such independent variable value. If the variances of each of these frequency distributions is similar, then we have met the conditions of homoscedasticity. This is hard to evaluate precisely, especially with few observations. Common sense will have to be used in making sure that this assumption is met.
In correlation analyses, you are examining how the two measurement variables are related. Stated even more precisely, you are examining how two measurement variables vary together. There is no implied causality in this relationship. As a result, you are not interested in producing a statement of this relationship (i.e., and equation) but only a measure of the strength of the relationship. Ordinarily you use Pearson's product moment correlation value as your measure of such strength.
The primary assumption that must be met by the data for a correlation analysis is that they come from a bivariate normal distribution. An approximation to this is to check that each of your two variables is normally distributed.