Statistical distributions in R
R has a large number of built in functions (in the R stats package) that are very convenient for visualizing and analyzing common statistical distributions. The standard functions provide
We'll illustrate these functions with examples from some familiar continuous and discrete distributions; the script file will contain these plus a few more.
Normal distribution
The normal distribution has 2 parameters: the mean mu and the standard deviation sigma. We'll pick an example with mu=5 and sigma=15.
> mu<-5
> sigma<-15
The density function requires specification of 1 or more values of x. We could give it a single value, but it will be more interesting if we give it a range of values where we think there the density will have some value (i.e., where the values of x might be likely. This should do it (easily)
> #generate equally spaced values between -50 and 50 and add >5
> x<-seq(-50,50,.5)+5
The corresponding values of the density are gotten by providing the dnorm function the objects x, mu, and sigma (all given values above).
> density<-dnorm(x,mu,sigma)
The plot function can be used to specify a nice line plot of the density.
> plot(x,density,type="l")
A similar process is used to generate the cdf plot using pnorm
> distrib<-pnorm(x,mu,sigma)
> plot(x,distrib,type="l")
Quantiles of this distribution are generated using qnorm
> #quantiles at standard probability levels
> prob_levels<-c(0.001,.05,.25,.5,.75,.95,0.999)
> qnorm(prob_levels,mu,sigma)
[1] -41.353485 -19.672804 -5.117346 5.000000 15.117346 29.672804 51.353485
Likelihood
The density function can be transformed to a likelihood by considering the data as fixed and 1 or more parameters as the variable. This is usually done on the log scale.
Take a simple example with a single value of x , assuming that sigma is fixed at the above value (15) and computing the likelihood value for candidate values of mu:
> #single value
> x<-4.5
> #possible values for mu
> mu.x<-seq(0,10,0.5)
> #likelihood
We can find the value of mu that maximizes the likelihood; not surprisingly this is the same as the data value (4.5).
> loglike<-dnorm(x,mu.x,sigma,log=TRUE)
> mu.x[loglike==max(loglike)]
[1] 4.5
Ordinarily data will come as a sample (list) of values, e.g., for n=2, in which case the log-likelihood over the data is simply the sum of the likelihood for each data point:
> ### 2 values of x
> x<-c(3,8)
> loglike<-dnorm(x[1],mu.x,sigma,log=TRUE)+dnorm(x[2],mu.x,sigma,log=TRUE)
> mu.x[loglike==max(loglike)]
[1] 5.5
In fact it's easy to show that substituting the sample mean for the data value in a single likelihood expression gives the same result:
> #using the mean
> x<-c(3,8)
> loglike<-dnorm(mean(x),mu.x,sigma,log=TRUE)
> mu.x[loglike==max(loglike)]
[1] 5.5
>
Plotting the values of the log-likelihood versus the values of mu, we can confirm that mu=5.5 provides the maximum, and is therefore the maximum likelihood estimate.
> #plot loglikehood vs values of mu
> plot(mu.x,loglike,type="l",xlab="mu",ylab="Log-Likelihood")
>
Note that in this example, there was only one parameter in play (mu); ordinarily both mu and sigma must be estimated. The basic approach is the same, but now involves finding the combination of mu and sigma that maximizes the likelihood as a function of both parameters.
Random number generation
Generating random values of "data" from a specified distribution is easy. Let's take a slightly different case, 100 values from the normal (0,1 ) distribution, aka the standard normal distribution.
> n<-100
> mu<-0
> sigma<-1
> x<-rnorm(n,mu,sigma)
Note that the expression
>x<-rnorm(n)
would have worked just as well, since the default values for mu and sigma are 0 and 1. Note that the mean value is slightly above the parameter value, and the sd slightly lower.
> mean(x)
[1] 0.07019897
> sd(x)
[1] 0.942415
This should be expected due to random (sample) variation, and if we repeat the process a number of times we would expect to get mean values of x above and below 0 (the true mean). Increasing n (say to 10,000) will result in a closer match of the sample and true mean.
Comparing the observed sample data and theoretical quantiles.
> x.q<-quantile(x,prob_levels)
> #compare to theoretical quantiles
> t.q<-qnorm(prob_levels,mu,sigma)
> quant.compare<-data.frame(emp=x.q,theor=t.q)
> with(quant.compare,plot(theor,emp))
> abline(a=0,b=1)
>
A plot comparing these confirms that most values lie on the "equals" line, indicating close agreement of the data to the nominal model.
Other distributions
The normal is a good jumping off point for other common distributions, including
Examples of the density, distribution, quantile and random number functions associated with each of these distributions are in the attached R script (along with a couple of additional distributions of interest, such as the beta and the gamma).