05. Basic statistics with R

Descriptive Statistics - Basic Functions

A great number of users, use R solely for statistical calculations and analyses. Although there is much more to R than statistics, the existence of an extensive (and ever-expanding) statistical toolbox remains one of its main strong points compared to other programing languages. R has a plethora of both elementary and advanced statistical functions, including mode calculations, statistical tests, simulation of data etc. Below we start discussing some of the most basic functions that will help you go through some initial basic statistics with your data. (In case you are bored, stay tuned for more advanced stuff in later chapters!)

sum(x)

The sum() function returns the sum of a numerical vector

min(x), max(x)

Return the minimum and maximum values of numerical vectors

length(x)

The length() function applied to vectors (regardless of class) returns their size in number of elements. length() is thus the easier way of counting elements

mean(x, trim=0, na.rm=FALSE)

mean() is a function for calculating the mean (surprise!) of a vector of numerical values (keep in mind both "vector" and "numeric"). To the completely uninitialized mean equals the sum of a vector divided over the number of its elements. It is thus equivalent to sum(x)/length(x).

It is called directly on the vector but allows for some additional options such as the trim option that refers to a fraction of values to be kept "out" of the data on either side of the data (max and min values). In this sense trim scales from 0 to 0.5, defaulting to 0 (this means that there is no trimming unless otherwise specified). Although trimming can be very valuable in the case of outliers (=values that fall out of the expected distribution) it is better left out before a thorough examination of the distribution has been conducted. mean() also allows the user to remove NA values before the calculation with na.rm=T. na.rm defaults to FALSE which means that if there are NA values in the vector, the function will return an error message.

sd(x), var(x)

These return the standard deviation sd() and the variance var() of a numeric vector. You may remember from highschool statistics that the standard deviation is the square root of the variance and that the variance is the mean square distance of the vector's values from its mean value. In general both var and sd give a very good idea of how "spread out" the data are compared to the mean value (which may be seen as the most probable, or expected value for a give dataset)

median(x)

The median of a numerical vector is the value that lies midway in a ranking of values from the lowest to the highest. Its difference from the mean is very often a good way to "guesstimate" on the existence of outliers and assymetric distributions

quantile(x, probs)

A quantile is defined as the range (minimum - maximum values) of a dataset that falls within a specific "slice" of the data, defined by ranking the values from the smallest to the largest. By definition quantiles refer to equally sized data slices containing the same number of values. In this sense the quartiles of a distribution are the smallest value (1st), the one that lies at the 25% of the values (2nd), the median (50%) is the 3rd, the value corresponding to 75% of the total is the 4th and the maximum corresponds to the 5th. The quartiles of this distribution are better defined as the 1st-2nd, 2nd-3rd, 3rd-4th and 4th-5th value ranges. All quantiles are thus by definition intervals.

The quantiles() function is called on a numerical vector with the option probs which accepts a vector of probabilities between 0 and 1. Thus suppose x is a numerical vector, then

quantile(x, probs=c(0,0.25,0.50,0.75,1))

produces the quartile values (4 intervals). While if we only cared about a specific value and not the whole set we could for instance try

quantile(x, probs=c(0.3,0.7))

which would give us the range of values between the 30% and the 70% of the total. Later on, we will see how we can combine clever vector forming function with quantiles to avoid writing down explicit intervals.

range(x)

range() returns the full range of values of a vector. It is thus by definition equivalent to quantile(x, c(0,1)). The only difference is that range produces a vector, while quantile a named vector with attributed names.

diff(x, lag=n)

The diff() function returns lagged differences in a series of values. It thus needs to be run on a numerical vector and it takes the order of the vector as it stands without any transformation. By default it will accept lag=1 which means that diff will transform a vector x of size N to a vector y of size N-1 differences between x[i] and x[i+1] values. In fact given a lag=n diff will transform a N-size vector of values to an N-n vector of differences between x[i] and x[i+n]. Too complicated? It may come very handy when trying to smooth curves or simulate derivatives.

scale(x)

The scale() function performs a basic normalization of a numeric vector. For the uninitialized, normalization is the process by which we adjust values of different scales to a common scale. A basic normalization scheme is adjusting all values to the scale of a normal distribution with mean=0 and sd=1. This (also known as) z-transformation is effected by scale(). It is essentially equivalent to calculating the difference of all values in x from the mean divided over the standard deviation, or in simple math:

y1<-(x-mean(x))/sd(x) y2<-scale(x)

where y1 and y2 are identical. Normalization schemes are very important for a number of reasons already discussed in data analysis classes or to be touched upon later on. In any case, a concept that you should keep in mind.

describe(m)

Is a combined function that acts on data frames and returns a number of modes in each including, variance, mean, kurtosis etc. describe() is part of the psych package so you need to pre-install psych (more on that later, lets for now assume that you have it). If we run describe on a pre-defined data frame (mtcars) this returns:

decribe(mtcars[,1:3]) vars n mean sd median trimmed mad min max range skew kurtosis mpg 1 32 20.09 6.03 19.2 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 cyl 2 32 6.19 1.79 6.0 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 drat 3 32 3.60 0.53 3.7 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 se mpg 1.07 cyl 0.32 drat 0.09

Later on, in the chapter dedicated to Programming R, we will see how do we download (and load) R Packages.

Statistical Inference - (not so) Basic Functions

Correlation Analysis

cor(x,y, method = c("pearson", "kendall", "spearman"))

Data correlation is very important when trying to estimate relationships between datasets (careful, we are always talking about correlation, not causation). When comparing 2 equally size vectors x and y, the correlation can be measured based on the linear, numerical pearson correlation or the kendall and spearman rank correlations. You may want to refer back to your statistics

Correlation can be called upon a matrix m, which produces a symmetrical square matrix.

Contingency Tables

table(x,y..)

The table() uses factors to build a contingency table of the counts at each combination of factor levels. In simpler words, table takes the elements of a vector, checks the number of factors these can be assigned to and then returns a factor table where each different factor element is corresponded to the count of elements that were referring to it in the initial vector. It is better to see this with an example. Take the built-in data frame mtcars in R simply by loading it in your R environment

mtcars

This will return a short dataframe containing types of cars with their corresponding specifications in terms of number of cylinders, horsepower, gears in the box etc. You may get an idea how table works try

table(mtcars$cyl)

what this command does is that it lets table() know that it has to act on the column of mtcars carrying the name $cyl. Remember we can use the $ operator to subset a data frame with named columns. The output of the command above is

table(mtcars$cyl)

4 6 8

11 7 14

which means that all cars had either 4, 6 or 8 cylinder (the top row corresponding to the factors) and that 11 of those had 4, 7 had 6 and 14 had 8 cylinders. This is a simple table, that is only summarizing the cylinder data. What if we asked a more interesting and complex question? For instance, what if we wanted to know at the same time how many cars with 4 cylinders also had 5 gears? We can use table on two dimensions simply by introducing one additional vector

table(mtcars$cyl, mtcars$gear)

3 4 5

4 1 8 2

6 2 4 1

8 12 0 2

the result is a two-dimensional array that tells us that there are 2 cars with 4 cylinders and 5 gears, 12 with 8 cylinders and 3 gears etc.

table() doesn't stop there and you can create even 3- or more-dimensional tables by adding more vectors in the function. But the output starts to be quite complex and needs to be handled with extra care.

Contingency tables are extremely useful in a number of statistical tests (to be covered later) but we can also use them for better representation of data in plots, rough classifications of data etc.

Linear Regression

Linear regression in R is as easy as simply typing in the two variables you want to model separtated by a tilde "~" (dependent ~ independent) with the use of the lm (linear modeling) function. Let's first compare two variables (the car's weight and the car's horse power) in a simple scatterplot:

plot(mtcars$wt,mtcars$hp, xlab="Weight", ylab="Horse Power", pch=19, cex=1.5, col="red4", type="p", cex.axis=1.2, cex.lab=1.4)

The plot above has been "embelished" with some additional features but it remains a scatterplot of "wt" versus "hp". Now suppose we want to model the linear function that best explains the relationship between the weight and the horse power. The "hp" will be in this case the dependent variable that is explained by a linear model of thw "wt". We can simply create the linear model like this:

model<-lm(mtcars$hp~mtcars$wt)

model # will print

Call: lm(formula = mtcars$hp ~ mtcars$wt) Coefficients: (Intercept) mtcars$wt -1.821 46.160

The model above prints two values that have to be explained as the a, b in a linear function of the general type:

f(x)=ax+b

in which case x is the weight of the car and f(x) is the expected hp based on the data we "trained" the model on. We can see how this model behaves simply by plotting it against the real data like this:

lines(mtcars$wt,46.160*mtcars$wt-1.821, type="l", col="blue4", lwd=3)

What we have done is that we asked R to plot a line over the data. That line was the function f(x)=ax+b with a,b being the coefficients of the model. We see that the fit between the model and the data is somewhat satisfactory given their variance. But how good exactly? We can have a quantitative measurement of the goodness of fit (as it is called) through a carefully designed statistic, the R-squared value. R-squared is a measurement of the a quantity called "coefficient of determination" is basically a measure of how close the line of the model function is passing through the data cloud as compared to the simplest line running parallel to the x-axis at the height of the mean of f(x). The R-squared value of the perfect fit is 1 and of the null fit (the one expected under purely random modeling) is 0. R-squared is pre-calculated from the lm function and can be directly visualized through running:

summary(fit) #prints

Call:

lm(formula = mtcars$hp ~ mtcars$wt)

Residuals:

Min 1Q Median 3Q Max

-83.430 -33.596 -13.587 7.913 172.030

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.821 32.325 -0.056 0.955

mtcars$wt 46.160 9.625 4.796 4.15e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 52.44 on 30 degrees of freedom

Multiple R-squared: 0.4339, Adjusted R-squared: 0.4151

F-statistic: 23 on 1 and 30 DF, p-value: 4.146e-05

You see that all that needs to be know on the model may be found here.

Probability Functions and Simulation

The basic format for probability functions in R is

[dpqr]-type of distribution

d,p,q or r stand for

d=density, p=probability function, q=quantile function and r=random generation

Out of the four, r is the one to be used more often providing us with a numerical vector of random values based on the type of distribution and the chosen parameters.

The supported types of distributions are:

  • rnorm() for the normal distribution which can be called with

rnorm(n, m=0, sd=1)

which returns m values sampled from normal distribution with mean=0 and sd=1

  • rbinom() for the binomial distribution

which is called as

rbinom(n, N, prob=p)

which returns n values from a size N with probability p

  • rpois() for the Poisson distribution

which is called with

rpois(n, lamda)

returning n values from a Poisson distribution with lamda parameter.

  • runif() for the uniform distribution

which you can use with

runif(n, min=a, max=b)

to obtain n random elements sampled from a uniform distribution between a and b. runif() is very often used for the generation of random real numbers

Other functions support less common distributions such as the geometric, the hypergeometric, weibull etc.