Sampling an entire population is usually impossible due to expense, limited time, and overall practicality. For that reason ecologists use statistics to take a sample, describe it, and use it to estimate population traits, or parameters. Today in lab we'll consider this idea when trying to determine the weight distribution of bean beetles.
Students should be able to obtain or take a dataset and
Produce histograms to visualize the distribution of the data
Produce estimates of the minimum, maximum, and range
Produce estimates of the mean, median, and mode
Produce estimates of variance and confidence intervals
Explain the concept of confidence intervals
Explain how sample size influences the width of confidence intervals
Explain how the level of confidence influences the width of confidence intervals
Before we can describe a population, we have to collect data! In general we assume the data represent the population well. This means the sample is unbiased. We also assume a larger sample is better! However, we also have to realize that no matter how large or unbiased our sample is, its still random. This means every sample will be slightly different due to inherent sampling error.
Assuming we have data, lets consider how we summarize it to describe a population. Two main types of numerical summaries are often used - measures of central tendency, and measures of dispersion (variation).
Measures of central tendency (mean, median, and mode) tell us about the "average" data point but in different ways. To get the mean, if we have n pieces of data in our sample, we can add them together and divide by n (we call this "Y bar" in the formula below, due to the bar over the letter).
The mean isn't always the best number to describe a sample because it can be influenced by extremely low or extremely high values (outliers). The median, on the other hand, is not influenced by outliers. The median is the piece of data that falls in the center of the dataset if they are arranged in order (note: for samples with an even number of data points, the median is the average of the two pieces closest to the center). Since the median only considers the ranking of the data, not their values, it's less impacted by outliers. Likewise, the mode, or most common data point in your sample, is also not impacted by outliers.
Compared to measures of central tendency, measures of dispersion tell us how much variation is in our sample. There are two ways we'll calculate variation - variance and standard deviation. Let's start with the standard deviation - it's simply the square root of the variance. Easy enough, right?! So now let's examine the variance. The variance of the data is the average distance from the mean squared. In other words, calculate the mean of the data (Y bar), then subtract it from each data point (Yi) and square your answer. If you take the average of these numbers, you get the variance. The formula for variance is given below:
Why is it important to square the distance from the mean? ... to get rid of negative values. Let's explore this. Let's say you sample fish length and your sample mean is 10. If one of your observations (Yi) is 8, that means 8 - 10 = -2. Having a negative number in statistics complicates things, so we square this in order to get rid of the negative.
Once we calculate the variance, remember that the square root of the variance is the standard deviation. In other words, the average deviation of each data point (distance from the mean) is standardized in the same units as the mean. As a result, the mean and standard deviation are often graphed together when reporting sample statistics.
It's also important to note that, for data sets that are "bell-shaped" (normally distributed), about 95% of the data observations fall within 2 standard deviations of the mean. Thus, if any individual falls 3 standard deviations away from the mean, it means they're outliers along with 5% of the population.
You can graphically explore these concepts by developing a histogram. A histogram is a graph that puts all the data into a designated number of groups, or bins, and then counts how many data points are in each bin. These graphical summaries make it easy to spot outliers (points far from the mean), bimodal data (data that has 2 distinct peaks, or modes), and the skew of the data (if one side has many outliers or points). First, we'll explore this idea using simulations of fish capture. This uses information and code provided by UBC. Make sure under topics, you've selected Sampling for a normally distributed population.
Although we can describe the sample as noted above, we really want to know about the population! Fortunately statistical theory has been developed to help us do this.
For example, our best estimate of the mean of population is the mean of the sample, and the less spread (standard deviation) we have in our sample the better we should feel about using it to estimate something about a population! While this makes sense, we have to consider how confident we are in our estimate. We know each time we sample the data we have some sampling error. However, we also know that the if we sampled the population multiple times and plotted the means of each sample we would see a normal distribution. This is known as the Central Limit Theorem.
This means we can describe the spread of means we would see if we sampled multiple times. Since the means are normally distributed, 95% of them fall within about 2 standard deviations of the means (also called the standard error of the mean). We call this range the 95% Confidence Interval. Simulations (such as the one you can do below) show that the true mean of the population falls within this range for 95% of the confidencen interval you construct (remember each sample would give a slightly different confidence interval due to sampling error). For more on confidence intervals check out the Confidence Intervals! notes under the Resources tab.
Notice that this whole theory rests on the idea of multiple samples, but we typically only have one. Fortunately we can use the standard deviation of our sample (s) to estimate the standard deviation of the sample means (standard error of the mean).
The 95% confidence interval for normally distributed data is then defined as
Note the 2 here is an approximation but its fairly close for sample sizes that are larger than 10. Notice this means that
It's a range! It has a minimum and maximum.
For normally distrbuted data, the range is centered around the mean estimate! This is an easy way to check your math!
We can simulate this idea using information and code provided by UBC. Make sure under topics, you've selected 95% Confidence Intervals for the Mean.
First, enter your data in a new Google Sheet. For a histogram you will only have data in one column!
Next, select your data, insert a chart, and select histogram. Edit the resulting chart labels as needed. Note that if your data is very small (or large), you may need to change the units manually to ensure x-axis bin labels make sense because the default format only allows for 2 units after the decimal place. For example, if you enter your been beetle weights in grams, all bin labels may be .00. If you manually convert these to milligrams (divide by 1000) the bin labels will be more useful.
Key points to remember
Confidence intervals are a range that surrounds the estimated mean of your data; theory and simulations show that 95% of the time you construct one of these intervals the true mean of the data will fall in this range.
Since the mean is either in the interval (or not), this means that on average if you made confidence intervals 20 times the true mean would be outside the interval once (see the simulation for more help!)
Confidence intervals, therefore, must have an upper and lower bound that are equally spaced around your estimate of the mean! This is an easy way to check your calculations.
To produce confidence intervals, first remember what we need:
the mean of the data
the standard deviation of the data
the size (n) of the sample
We can use these respective functions in Sheets to find these values:
average
stdev
count
So to get the lower bound (remember, confidence intervals are a range, update the following formula to use your data:
=AVERAGE(data)−2*STDEV(data)/SQRT(COUNT(data))
Similarly, the upper bound can be found with
=AVERAGE(data) + 2*STDEV(data)/SQRT(COUNT(data))
Not we can also automate some this using Pivot tables. See Data Summaries in Google Sheets for more information!
You'll now apply these ideas about describing populations by sampling a bean beetle population. Make sure you are comfortable with the ideas of sampling, describing a sample, and confidence intervals before continuing with this exercise. If you need more help, check out the Confidence Intervals! review under the resources tab.
For this lab we'll collect samples of bean beetles to determine the distribution of weights. You can read more about bean beetles here. NOTE: This lab can be used alone or combined with the Inducing Evolution in Bean Beetles lab and/or the Mark-Recapture lab; see here for more details.
Collect a bean beetle culture (petri dish filled with mung beans and adult beetles) for your group.
Mark the culture with your groups name so you can identify it later if needed.
Place the culture on a white piece of paper (to aid in collecting any jumping beetles) near a balance.
Open the lid and carefully remove a beetle using a brush.
Weigh the beetle on the analytical scale. Remember to place a paper on the balance and tare it prior to weighing the beetle.
Once weighed, place the beetle in a new petri dish (with a lid).
Use a dissecting scope (if needed) to determine the sex of the beetle. See page 8 of the Bean Beetle Handbook for help in determining the sex of beetles.
Repeat for at least 50 beetles.
If time and culture permit, measure all beetles in the culture .
Once completed, all beetles may be returned to the original culture.
Now, use the data you collected to see how sample size impacts your estimates.
Create a histogram of your first 10 data points and calculate the average mass and a 95% confidence interval for the mean (see instructions below).
Repeat this exercise for the first 25 and all 50 data points.
Compare your estimated average masses and confidence intervals for each sample size. What differences do you observe?
If you sampled all beetles, calculate the actual average weight. Do your confidence intervals capture this true value?
If time permits (or following your instructors note), construct histograms and confidence intervals for male and female mass separately. Do these figures or intervals overlap? What does that mean?
Review all your calculations to make sure you are comfortable with the material.