This tutorial uses the PULSE data. If you are not already familiar with it, you should read a description of the data. You can find the entire dataset at Hayden's site as a plain text file or as an Excel spreadsheet.
Loading Data
Two ways to load the data are:
1. Get the URL of the PULSE datafile on your clipboard (from my machine I right-click the link and select “copy link address”, which will be: http://statland.org/Software_Help/R/pulse.txt
Then, from RStudio, upper right window, Workspace tab
> Import Dataset > From Web URL, paste URL and follow directions
2. Download the text data file. If necessary, follow the directions on my LoadData video tutorials. The summary is: From RStudio, upper right window, Workspace tab
> Import Dataset > From text file, and follow directions
When importing, be sure to pay attention to the “Heading” checkbox. If the data comes with headings, as the PULSE data does, then be sure it is checked. If it has no headings, but is only the raw data, be sure it is NOT checked.
Summary Statistics
Now, the pulse dataset (in R it's called a data.frame) may have appeared in the upper left window of RStudio. If so, click the "x" on its tab to close it. In the upper right window of RStudio, in the Workspace tab, find the pulse dataset and click on it. You should see it appear in the upper left window. The data.frame pulse is in the case-variable format. The eight columns represent the eight variables. The names can be viewed above row 1 of the data.frame. They may also be viewed with the names() command:
> names(PULSE) [1] "PuBefor" "PuAfter" "Ran." "Smokes." "Sex" "Height" [7] "Weight" "ActivityL"
Type
> PuBefor
Notice the error? Although the variables are part of the data.frame, which is loaded, we cannot access them directly. To do so, we need to attach() them to our R workspace.
> attach(PULSE)
This avoids conflicts if several tables include variables with the same name. As good practice, we attach just data.frame at a time. All of the commands from the Getting Started With R tutorial may be used on the variables. As a shortcut, you can get an assortment of summary statistics for all your variables by using the summary() command.
> summary(PULSE)
Note that for the summaries reported above, R gives different summaries depending on whether the column contains numbers or text. This seems right except possibly for the ActivityL variable which is an ordered category and so might better be treated as qualitative data. For now, just remember that anything in a data file that looks like a number to R will be treated as such and you may need to take special steps if you have a column of numbers that are actually labels -- medical diagnosis codes, for example.
The one summary you might miss from the above is the standard deviation. You can easily get that for the variable of your choice.
sd(PuAfter)
Multiple Boxplots
We can compare the before and after pulse rates with multiple boxplots. R wants to see the data in a format that has all the measurements (both before and after) in one column with another column labelling the two sets of data. The R command c (for concatenate) combines a bunch of things into a single thing.
> rate = c(PuBefor,PuAfter) > rate
We used the name rate for the stacked variable. We also need a variable to keep track of which group each measurement came from. The R rep() (for repeat) command is useful for generating repetitive data.
> B = rep("Before",92) > A = rep("After",92) > BA = c(B,A) > BA
This creates a variable B that is just 92 instances of the word "Before" and similarly for A. Then the two are concatenated into BA, the categorical variable that keeps track of before and after. Now you can type
> boxplot(rate ~ BA)
to get boxplots.
The tilde "~" is used often in R. Here you can think of it as saying we are going to see how rate depends on BA. You can see that the "after" data show a higher median and more variability along with high outliers not present in the "before" data. The fact that both center and variability have changed makes it hard to give a simple comparison here. If we looked only at the mean, we might say that pulse rates went up by about seven points, and they did on average. However, the lowest rates have not gone up much at all (The minimum went up by 2.) while the highest have changed considerably (The maximum went up by 40!). A simple average change does not describe what happened here very accurately, which is one reason we always need to make a picture!
The following command gives the same boxplot without the labels
> boxplot(PuBefor, PuAfter)
The labels can be added to boxplots in the following way
> boxplot(PuBefor, PuAfter, names=c("Before", "After"), main="Pulse Rates")
The reason I took you through the first method is that syntax is very important. The data will usually come in the format for the first form, not the second.
Histograms
Because the class contained both men and women, we might expect some bimodality in the heights and weights. Histograms are the tool for assessing the shape of a distribution. (Boxplots tend to hide bimodality.) The number of bins used controls the "bandwidth" of the histogram. Too few, not enough detail. Too many, too much variation due to randomness in the data and not real signal. A good rule of thumb is the square root of the sample size, n. The breaks argument allows some control over the number of bars, sort of:
> hist(Height) > hist(Height, breaks=3) > hist(Height, breaks=5) > hist(Height, breaks=10)
Are the Heights trimodal or is that just random scatter? The default histogram is probably the best, which is unimodal and does not detect the presence of male and female populations. What do you find with Weight?
Now make boxplots as you did above but with Sex as the factor and Height as the measurement.
> boxplot(Height ~ Sex)
Both sexes show reasonably compact, symmetric distributions with no outliers, but the men are consistently about 5 inches taller than the women. Because the two distributions have similar shapes and variabilities, we can reasonably say that the men as a group are about 5 inches taller than the women. Compare this to the situation with the pulse rates above, where such a simple description was an oversimplification. Note that although the two sexes are clearly different here, there is enough overlap that the histogram does not clearly show the two groups.
Transformations
We saw in the earlier boxplots of the pulse rates that the After rates were skewed toward high values with several possible outliers. When we see most of the alleged outliers on one side of a boxplot it may well be that we have a skewness problem rather than an outlier problem. Another sign is outliers that start close to the whiskers and gradually thin out. Another is a boxplot that looks skewed in the direction of the outliers even if we erase them. It is important to recognize these two different situations because the remedies are quite different. If we have an outlier problem, we need to investigate the individual points to see if we can find a reason for their unusual behavior. If we have a skewness problem we might use the median rather than the mean to describe the center, or we might re-express (transform) the data to make it more symmetric. Here are a number of common transformations and their effect on the After pulse readings. First, here is a histogram of the original data.
> hist(PuAfter)
Let's see if a square root transformation makes this less skewed.
> SquareRoot = sqrt(PuAfter) > hist(SquareRoot)
This is better. Repeat with
ln = log(PuAfter)
to get natural logarithms (base e).
> ln = log(PuAfter) > hist(ln)
Going from the original data to the square roots to the logarithms there is successively less skewness. There is still some skewness present in the logs and so we will try two stronger transformations: negative reciprocal roots and negative reciprocals. The reason for the negatives here is that when you take reciprocals you reverse up and down in the graph. This is a nuisance when comparing several graphs, so we introduce the minus sign to switch things back again. For most purposes, however, we do not do that. For example, if you measure fuel economy in miles per gallon and take the reciprocal you get gallons per mile -- a perfectly good, but different, way to measure fuel consumption. No one would normally put a negative sign in front of either measurement. However, we do have to mentally remember that high MPG is good while high GPM is bad, even though in some sense they measure the same thing.
You can use the techniques described above to get these transformations.
> nrr = -1/sqrt(PuAfter) > hist(nrr)
> nr = -1/PuAfter > hist(nr)
Here four common transformations were applied in order of increasing strength for reducing skewness toward high vales: square roots, logarithms (here to base e), negative reciprocal roots, and negative reciprocals. The last two transformations seem to have made the distribution most symmetric.
detach()
When you are finished with a dataset you attached, it is best to detach it,
> detach(pulse)
If you attach multiple datasets with the same name, then you may inadvertently refer to one variable while the computer thinks you mean another by the same name and the results will be wrong.
Exercises
This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.
1. Select two numeric variables and thoroughly explore them using the techniques covered in this tutorial. What are your two primary observations?
2. The command below is a more advanced R command that is useful for comparing more than two variables at once: >install.packages("car") #Install the "car" package >library(car) #Need to have the "car" package installed >scatterplotMatrix(States95[,3:8],smoother=FALSE,groups=region,diagonal="hist",col=2:5)
3. Write one additional observation you made from your scatterplot matrix.