Go to the Installing R and RStudio page to download and install the software for your platform.
Entering Data
For our first example, we will work with the data below on annual rainfall in inches for various cities throughout the world:
Algiers 30 Lagos 72 Athens 16 La Paz 23 Beirut 35 Lima 2 Berlin 23 London 23 Bogota 42 Madrid 17 Bombay 71 Moscow 25 Cairo 1 Oslo 27 Dublin 30 Paris 22 Geneva 34 Rome 30 Havana 48 Vienna 26
We will enter the numbers using the R function "c". The standard file format for statistical data is that each column is a variable and each row is a case. This is called the case-variable format. Here the variable is rainfall and the cases are the cities. Sometimes I like to think of the name of the "c" function as representing "column". You can use it to enter a column of data, i.e., a single variable. Use it like this: at the R prompt ">" type
> rainfall = c(30, 16, 35, 23,...,26)
Of course, you must type in the rest of the data where I typed "...".Hit RETURN at the end of the line. Nothing happens. If in doubt, R is silent. To check to see if you succeeded, just type rainfall at the R prompt. R should tell you what is in the variable "rainfall". You should see this:
> rainfall [1] 30 16 35 23 42 71 1 30 34 48 72 23 2 23 17 25 27 22 30 26
For these labs I will not normally display the R output. You should follow along by entering the commands into R and viewing the output yourself. This will slowly develop your skills with R, and give you the background you need to complete the exercises.
Back to the output. The [1] is an index for the line you are on. If the RStudio window was narrow, it might show the following output, where the number 72 is the 11th number in the dataset:
> rainfall [1] 30 16 35 23 42 71 1 30 34 4 [11] 72 23 2 23 17 25 27 22 30 26
There are many slick ways to get data into R but for now just typing it in will do. Another option is a simple data editor available in some versions of R. Type
> data.entry(rainfall)
to see if your version includes this feature. If it does, this is a good time to edit any typos in your data entry. The following worked in the Windows version. Double click on a cell to edit it. Hit RETURN when done with that cell. When done with all cells, right click on the Data Editor and chose Close. The Data Editor only edits data already entered into R. You can trick it into creating a new column of data. Let's say we also have snowfall data. Type
> snowfall = c(1)
> data.entry(snowfall)
The "1" in the first command is just a placeholder. Any number will do. When the Data Editor opens, replace the 1 with actual data.
Simple Summary Statistics
Once you have the data in R, you can get a variety of summary statistics and displays. Try some of the following.
> mean(rainfall) > median(rainfall)
The mean should be 29.85. If you get a different number, proofread your data for typos. If you do not have a Data Editor, you can fix one number at a time from the command line. Let's say that for the sixth city, Bombay, you typed 17 instead of 71.
> rainfall[6]=71
will fix this. The use of brackets accesses the values in the object rainfall. Parentheses, on the other hand, are used for commands. Experiment with the brackets a little bit. Change a number in the dataset, look at it, and then change it back again. When finished, check that the mean is still correct.
> mode(rainfall)
This is probably not what we had in mind for the mode. R is telling us that numerical data is what is stored in rainfall. If you really want the mode, construct a histogram and report the value(s) of the variable which gives the peak.
> hist(rainfall)
The most frequently occurring bin is clearly 20 to 30 inches. How many datapoints are in this rang? The y = frequency axis reads 10. The default display for R's hist() command is greater than the left endpoint and less than or equal to the right endpoint.
> rainfall[rainfall>20 & rainfall <=30]
gives the specific values. We're using the bracket notation, again, to ask for some of the data. In particular, we want only the values of rainfall greater than 20 AND rainfall less than or equal to 30.
> sd(rainfall) > max(rainfall) > min(rainfall) > range(rainfall)
Oh, my. I work so hard to convince my students that the range is one number!
> fivenum(rainfall)[1] 1.0 22.5 26.5 34.5 72.0 > lentgth(rainfall) Error: couldn't find function "lentgth" > length(rainfall) [1] 20
So, the range is 71. Many people include a sixth number with the five number summary: the number of observations, n. This is returned by the length function in R. (If you mistype something, R will give an error message. Most are more cryptic than this one. See me or the TAs for errors you get stuck on. Also, typos happen all the time. Press your up arrow to scroll through your previous commands. When the one you want comes up, edit it and hit RETURN. )
I like every tutorial to include an analysis of a dataset that shows how the software can actually be used to find out something useful about the data. Let's go back to the histogram. You can re-type
> hist(rainfall)
Or, from the RStudio Plots tab in the lower right, use the arrows to return to your histogram. The distribution is skew right, with two potential outliers on the right. This is confirmed by the mean of 29.85 being greater than the median of 26.5. Given the apparently extreme skewness of the histogram, one wonders why the mean is not farther to the right from the median than it is. Why?
Data Transformations
The plot for outlier detection is the boxplot:
> boxplot(rainfall)
Now there are outliers at both the high and low ends of the data! They are counter-balancing one another in the computation of the mean and median. For data such as this, which appear skewed toward high values, a transformation is often appropriate
> ln.rainfall = log(rainfall)
There is nothing special about the dot in ln.rainfall , it just makes the variable we created read easier. The log() command takes the natural logarithm of the data, which "pulls in" the skew right shape. This will be used later in the Biostatistics course.
> hist(ln.rainfall) > boxplot(ln.rainfall)
On the natural log scale, only the low outliers appear on the histogram! If you were in international agribusiness, which rainfall would be more of an outlier for your purposes, one inch per year or seventy-two inches per year?
Lesson
If we were to examine this data by hand, we might be tempted to do a single display simply because making more is extra work. However, using the computer we were able to use different graphs, and even transform the data for ease of exploration. In the process we discovered not only the two high outliers, but also the two harder to observe two low outliers and confirmation of their significance.
Exercises
For this tutorial, we will use the States95 dataset, which should have been loaded along with the rest of the datasets for the course (for Biostats it's in WS_Text_Examples.RData and for IntroStat it's in 210_LoadedData3.RData). To view it, click on its name in the Workspace. To see the meaning of the main variables, you would normally type "?States95", but this is my own private dataset. I obtained the key variables from the "SAT" dataset in the library package. To view them, first install the mosaic package, then call the mosaic library, and finally view the dataset. Once the package is loaded, you will never need to do it again. If you restart RStudio, you will need to re-open the library. To do this, type
> install.packages("mosaic") #Only need to do one time
> library(mosaic) #Do every time you want to use "mosaic", e.g. to view SAT variables
> ?SAT #Show the help file for SAT
1. Using the commands learned in this tutorial, explore the dataset on your own. Generate statistics of some of the variables of interest. Create some graphs. To get started, access the data.frame using the attach() command:
>attach(States95)
Then you can run the commands, for example:
>mean(sat)
>hist(salary)
2. Based on your exploration in question 1, write down at least three questions you would like to answer using this data. e.g. Are the SAT scores higher, on average, in either the Red or the Blue states?
Note: the States95 dataset is the dataset for all of the RTutorials exercises, so you are taking your first glimpse of it here, and will do successive exploration as you learn more statistics.