For this tutorial, we will use data comparing lives of generic and brand name batteries that is the main example at the start of Chapter 24 of De Veaux, Velleman and Bock, Stats.: Data and Models 2nd ed., 2008, Addison Wesley, Boston. You may either manually enter the data below, or use the following procedure to read it in:
Copy the data and paste it into an Excel or Google Spreadsheet. Save as a .csv file. ("csv" means "comma separated version")
From RStudio, upper right window, Workspace tab
> Import Dataset > From text file
Follow the directions, be sure to pay attention to the “Heading” checkbox. For further help with this technique, watch my LoadData video.
Times Battery.Type 1 194.0 Brand Name 2 205.5 Brand Name 3 199.2 Brand Name 4 172.4 Brand Name 5 184.0 Brand Name 6 169.5 Brand Name 7 190.7 Generic 8 203.5 Generic 9 203.5 Generic 10 206.5 Generic 11 222.5 Generic 12 209.4 Generic
Note that the lifetimes are in one variable and the type of battery in another. This is standard database format.
If you had looked at the data file in a text editor you might have noted that Battery Type got changed to Battery.Type. R does not like names with spaces in them. The spaces in "Brand Name" could also cause problems. Generally speaking, if you are setting up your own data, do not use names for files, values, variables, etc., that include spaces.
The command above just shows you what is in the file (and whether R can make any sense out of it). To do anything with the data, you have to read it into a data frame and then attach it to your workspace. (You do not need to know exactly what that means in order to do it.) For this tutorial, we will name the data.frame "bat". Suppose the file you read the data from was called "myfile.csv". The data.frame would then be called "myfile". You should be able to click it in the RStudio's Workspace tab. To convert it, type
> bat = myfile > attach(bat) > Times
Next, we look at the data.
> boxplot(Times ~ Battery.Type)
> summary(Times[Battery.Type=="Brand Name"])
> summary(Times[Battery.Type=="Generic"])
The sample size for these two groups is relatively small. There is therefore not enough data to tell whether the populations are normal or not. The boxplot suggests that the Generic batteries are lasting longer than the Brand Name. Is this due to the randomness of this particular sample, or does the difference reflect an actual signal from the population? That's where the t-test comes in.
> ?t.test
Typing Times causes R to list the times, verifying that we can now access them. Typing ?t.test caused a help window to pop up with technical info on thet.test command.
> t.test(formula = Times ~ Battery.Type)
We are thinking that battery lifetimes may depend on the type of battery. After formula = we type the dependent variable, a "~", and the independent variable. (The tilde "~" separates dependent from independent variables in R.) The small p-value of 0.03143 suggests there may well be a difference and the 95% confidence interval contains the surprising news that it is not in the direction we might have expected! The generics do last statistically significantly longer! How much?
> mean(Times[Battery.Type=="Brand Name"]) - mean(Times[Battery.Type=="Generic"])
> detach(bat)
> rm(bat)
Exercises
1. Is there a difference between SAT performance in Red vs. Blue States? Test SAT overall, as well as math and verbal separately.
2. If it has been covered in class, check that the assumptions for the independent t-test for question 1 are satisfied.