This example uses the speeds on Triphammer Road from De Veaux, Velleman and Bock, Stats.: Data and Models 2nd ed., 2008, Addison Wesley, Boston. (It's the first example in Chapter 23.) Police measured traffic speeds (in miles per hour) on a road where this was a concern. Here are the results.
29 34 34 28 30 29 38 31 29 34 32 31 27 37 29 26 24 34 36 31 34 36 21
Copy the above numbers onto your clipboard, then type the line below.
> speeds = scan()
The R scan function allows you to enter data without typing commas. In the case below, the values were not actually typed but inserted with cut and paste. The "24:" prompt means R has received 23 numbers and is waiting for the 24th. Hit RETURN to cease data entry.
> hist(speeds) #Simple histogram > hist(speeds, breaks=5) #Auto-plotter won't do 5 bins > bins = seq(20,40,by=4) #Manually create bins > hist(speeds, breaks=bins) #No we have our own 5 bars! > bins = seq(20,40,by=1) #Super-fine bins > hist(speeds, breaks=bins) #Histogram with too many bins > boxplot(speeds) #Checking for outliersIn the above R commands, I have transitioned to using comments following the lines. The number symbol, "#" is a comment character, indicating that all further text on the line is for human notes only, but is not executable by the computer. There are limits to the breaks argument when it is used to fix the number of bins, but if you give a vector of every endpoint (including the left of the first bar and right of the last bar), it will plot exactly as you specify. The data is clearly mound-shaped.
> summary(speeds) > sd(speeds)
Whenever we analyze data, we always want to look at the three key features of a variable first: shape, center, and spread.
> t.test(speeds, mu=30)
> t.test(speeds, mu=30, alternative="greater", conf.level=0.90)
Note that a single command returns both a hypothesis test and a confidence interval and that one-sided tests return one-sided confidence intervals (as they should). The confidence level must be specified as a number between 0 and 1. Another alternative for alternative is "less". Leaving it out gives a two-sided test/interval. The default mu is 0 and the default confidence level 95%=0.95. In the textbook example, the question was whether the average speed exceeded 30 miles per hour so that's what we tested.
Writing on these results, Hayden wrote, "We could question the result on two grounds. First, the stem and leaf shows the data bimodal and skewed toward low values (or is that an outlier?), and checking the mean may not be the appropriate tool here. Second, even though the average speed was close to 30, we can note that a majority of the vehicles were exceeding the 30 MPH speed limit." Regarding the first point, it is true that a super fine stem-and-leaf plot (same as histogram with super-fine bins, above) is bimodal. However, this is likely due to sampling variation, as can be demonstrated by randomly sampling 23 points from a normal distribution (should be run several times):
> out = rnorm(23,mean=31,sd=4) #23 random datapoints, normally dist.
> hist(out,breaks=18:43) #histogram
> length(out[out>31])
Regarding the second point, there are 23 observations, and so it is not surprising that 13 of them would be over the mean, if the population were really symmetric. The mean is actually at 31 and 10 points are greater than the mean. Similar behavior may be observed in the samples above. This may be observed in the above random samples.
> rm(bins, out, speeds) #Clean up workspace
Exercises
1. Assume that the mean SAT math and verbal scores reported for each state is a random sample from the state performances for other years and similar conditions. The average SAT math and verbal score is supposed to be 500. Run a one-sample t-test with the null hypothesis of the overall mean for each variable is 500.
2. There are at least two major flaws with this analysis. What is one of them? Could it be fixed?