3.1. Statistical distribution functions in R

Statistical distributions in R

R has a large number of built in functions (in the R stats package) that are very convenient for visualizing and analyzing common statistical distributions. The standard functions provide

pdf/Probability density for values of the random variable given specified parameter values (prefix d)
cdf/Cumulative probability for values of the random variable given specified parameter values (prefix p)
Quantiles (inverse of the cdf) for specified probability and parameter values (prefix q)
Sample of random values drawn from a distribution of the specified family having specified parameter values (prefix r).

We'll illustrate these functions with examples from some familiar continuous and discrete distributions; the script file will contain these plus a few more.

Normal distribution

The normal distribution has 2 parameters: the mean mu and the standard deviation sigma. We'll pick an example with mu=5 and sigma=15.

> mu<-5

> sigma<-15

The density function requires specification of 1 or more values of x. We could give it a single value, but it will be more interesting if we give it a range of values where we think there the density will have some value (i.e., where the values of x might be likely. This should do it (easily)

> #generate equally spaced values between -50 and 50 and add >5

> x<-seq(-50,50,.5)+5

The corresponding values of the density are gotten by providing the dnorm function the objects x, mu, and sigma (all given values above).

> density<-dnorm(x,mu,sigma)

The plot function can be used to specify a nice line plot of the density.

> plot(x,density,type="l")

A similar process is used to generate the cdf plot using pnorm

> distrib<-pnorm(x,mu,sigma)

> plot(x,distrib,type="l")

Quantiles of this distribution are generated using qnorm

> #quantiles at standard probability levels

> prob_levels<-c(0.001,.05,.25,.5,.75,.95,0.999)

> qnorm(prob_levels,mu,sigma)

[1] -41.353485 -19.672804 -5.117346 5.000000 15.117346 29.672804 51.353485

Likelihood

The density function can be transformed to a likelihood by considering the data as fixed and 1 or more parameters as the variable. This is usually done on the log scale.

Take a simple example with a single value of x , assuming that sigma is fixed at the above value (15) and computing the likelihood value for candidate values of mu:

> #single value

> x<-4.5

> #possible values for mu

> mu.x<-seq(0,10,0.5)

> #likelihood

We can find the value of mu that maximizes the likelihood; not surprisingly this is the same as the data value (4.5).

> loglike<-dnorm(x,mu.x,sigma,log=TRUE)

> mu.x[loglike==max(loglike)]

[1] 4.5

Ordinarily data will come as a sample (list) of values, e.g., for n=2, in which case the log-likelihood over the data is simply the sum of the likelihood for each data point:

> ### 2 values of x

> x<-c(3,8)

> loglike<-dnorm(x[1],mu.x,sigma,log=TRUE)+dnorm(x[2],mu.x,sigma,log=TRUE)

> mu.x[loglike==max(loglike)]

[1] 5.5

In fact it's easy to show that substituting the sample mean for the data value in a single likelihood expression gives the same result:

> #using the mean

> x<-c(3,8)

> loglike<-dnorm(mean(x),mu.x,sigma,log=TRUE)

> mu.x[loglike==max(loglike)]

[1] 5.5

Plotting the values of the log-likelihood versus the values of mu, we can confirm that mu=5.5 provides the maximum, and is therefore the maximum likelihood estimate.

> #plot loglikehood vs values of mu

> plot(mu.x,loglike,type="l",xlab="mu",ylab="Log-Likelihood")

Note that in this example, there was only one parameter in play (mu); ordinarily both mu and sigma must be estimated. The basic approach is the same, but now involves finding the combination of mu and sigma that maximizes the likelihood as a function of both parameters.

Random number generation

Generating random values of "data" from a specified distribution is easy. Let's take a slightly different case, 100 values from the normal (0,1 ) distribution, aka the standard normal distribution.

> n<-100

> mu<-0

> sigma<-1

> x<-rnorm(n,mu,sigma)

Note that the expression

>x<-rnorm(n)

would have worked just as well, since the default values for mu and sigma are 0 and 1. Note that the mean value is slightly above the parameter value, and the sd slightly lower.

> mean(x)

[1] 0.07019897

> sd(x)

[1] 0.942415

This should be expected due to random (sample) variation, and if we repeat the process a number of times we would expect to get mean values of x above and below 0 (the true mean). Increasing n (say to 10,000) will result in a closer match of the sample and true mean.

Comparing the observed sample data and theoretical quantiles.

> x.q<-quantile(x,prob_levels)

> #compare to theoretical quantiles

> t.q<-qnorm(prob_levels,mu,sigma)

> quant.compare<-data.frame(emp=x.q,theor=t.q)

> with(quant.compare,plot(theor,emp))

> abline(a=0,b=1)

A plot comparing these confirms that most values lie on the "equals" line, indicating close agreement of the data to the nominal model.

Other distributions

The normal is a good jumping off point for other common distributions, including

Uniform -
- Continuous x, 2 parameters (a=lower bound, b=upper bound)
Poisson (typically used for count data)
- Integer x, 1 parameter (lambda= the mean and variance)
Bernoulli (success / failure)
- Dichotomous outcomes
- 1 parameter = p the probability of success
Binomial
- x=Number of success in n independent Bernoulli trials (number of failures = n-x)
- 1 parameter =p (probability of failure obtained by 1-p)
Multinomial
- Generalization of binomial to multitomous outcomes
- x1, x2, ...,xk =number of outcomes of class 1, 2, 3 (sum to n)
- p1,----pk = probability of outcome in each class (sum to 1)

Examples of the density, distribution, quantile and random number functions associated with each of these distributions are in the attached R script (along with a couple of additional distributions of interest, such as the beta and the gamma).

Next: Analysis of Frequency Data

Page updated

Google Sites

Report abuse