Transformations in R

Logarithmic Transormation

For this example we will use data on the number of electronic academic journals over a seven-year period. Note the shortcut for entering consecutive years.

27 36 45 181 306 1093 2459

> year = 1991:1997 > year [1] 1991 1992 1993 1994 1995 1996 1997 > Journals <- scan() 1: 27 2: 36 3: 45 4: 181 5: 306 6: 1093 7: 2459 8: Read 7 items > plot(year,Journals)

The first variable mentioned in the plot command is plotted on the horizontal axis. Not surprisingly, the number of electronic journals really took off during this period. Sometimes "exponential growth" is used to describe any kind of rapid growth, but technically it refers to a specific mathematical pattern. If we have true exponential growth, then plotting the logarithms of the growing variable versus time should give a straight line. First take the logarithms, then make the plot.

> logJ=log(Journals) > plot(year,logJ)

The original graph shows strong curvature. The logarithms of the journal counts plot as much more linear versus year. We might say that the growth is approximately exponential.

It might be interesting to see the effect of the transformation on the journal counts considered by themselves.

> hist(Journals) > hist(logJ)

Here the transformation makes the data much less skewed.

Reciprocal Transformation

Logarithms are a common transformation but certainly not the only one. We can do simple arithmetic transformations at the command line. For example, it is not clear whether fuel efficiency should be measured in miles per gallon or gallons per mile. If we have data in one form in a variable MPG, a reciprocal transformation takes us to the other.

> GPM=1/MPG

Exercises

This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.

1. Make a histogram of 'area'. What is the shape of the distribution? Why?

>hist(area)

#The following commands convert the area variable to the log_10 scale and show the histogram. Log base 10 was chosen over the natural log because we use the base-10 system and tend to think in multiples of 10 (i.e. the left endpoint of the log-10 histogram is 3, referring to 10^3, so we can understand the x-axis.)

>log.area = log10(area)

>hist(log.area)

2. What is the shape of the distribution of the log_10 data?

#What is really going on? Use

>sort(area)

>hist(area[area<200000]) #Histogram without largest two areas

#or

>index = order(area, decreasing=TRUE) #order() gives indices of the sorted set

>data.frame(state[index],area[index]) #view st and area according to the ordered indices

#to find out

3. Perform the same type of analysis on the "pop" variable. What is the shape of the distribution? What is the effect of doing a log-transformation on it? What do you think is going on?

Page updated

Report abuse