Stats text 06

06 Probability Distributions

6.1 Types of probabilities and distributions

Mathematically equally likely outcomes usually produce symmetric distributions. Simple probabilities of a single coin or single die are uniform in their shape. The probabilities of multiple coins or dice form a symmetric heap that is called a binomial distribution. As the number of dice and pennies increase, the distribution approaches a shape we will later learn to call the "normal" distribution.

Distributions based on relative frequencies can have a variety of shapes, symmetrical or non-symmetrical.

The shape of the distribution of a sample is often reflective of the shape of the distribution of a population. If the sample is a good, random sample, then the shape of the sample distribution is a good predictor of the shape of the population distribution.

Probability Distributions

A probability distribution usually refers to a relative frequency histogram drawn as a line chart.

Both discrete and continuous variables can have a probability distribution. Classes (or bins or intervals) can be constructed, relative frequencies (or probabilities) can be calculated and a relative frequency histogram can be drawn. If the data is continuous, then a mean can be calculated for the data from the original data. There is also a way to recover the mean from the class values and the probabilities, although this depends on the class values being treated as being a part of a continuous distribution. In later chapters the columns of the histogram chart will be replaced by a line, specifically a "heap" or "mound" shaped line. The diagrams further below show how one might move from a column chart representation of data to a line chart representation.

The following data consists of 39 body fat measurements for female students at the College of Micronesia-FSM Summer 2001 and Fall 2001. Following the table is a relative frequency histogram, the probability distribution for this data.

The area under the bars is equal to one, the sum of the relative frequencies. The above diagram consists of five discrete classes. Later we will look at continuous probability distributions using lines to depict the probability distribution. Imagine a line connecting the tops of the columns:

If the columns are removed and the class upper limits are shifted to where the right side of each column used to be:

The orange vertical line has been drawn at the value of the mean. This line splits the area under the "curve" in half. Half of the females have a body fat measurement less than this value, half have a body fat measurement greater than this value.

We could also draw a vertical line that splits the area under the curve such that we have ten percent of the area to the left of the orange line and ninety percent to the right of the orange line. This line would be at the value below which only ten percent of the measurements occur.

6.2 Calculations of the mean and the standard deviation

In some situations we have only the intervals and the frequencies but we do not have the original data. In these situations it would be useful to still be able to calculate a mean and a standard deviation for our data.

If we only have the intervals and frequencies, then we can calculate both the mean and the standard deviation from the class upper limits and the relative frequencies. 

The following table was taken from the 1994 FSM census. Here the data has already been tallied into intervals, we do not have access to the original data. 

The result is an average age of 24.12 years for a resident of the FSM in 1994 and a standard deviation of 18.10 years. This means at least half the population of the nation was under 24.12 years old.

Note we used the class upper limits to calculate the average age. Potentially this inflates the national average by as much as half a class width or 2.5 years. Taking this into account would yield an average age of 21.62 years old.

There is one more small complication to consider. Since the population of the FSM is growing, the number of people at each age in years is different across the five year span of the class. The age groups at the bottom of the class (near the class lower limit) are going to be bigger than the age groups at the top of the class (near the class upper limit). This would act to further reduce the average age.