Population Statistics (simulation)

Overview

Sampling an entire population is usually impossible due to expense, limited time, and overall practicality. For that reason ecologists use statistics to take a sample, describe it, and use it to estimate population traits, or parameters. Today in lab we'll consider this idea using a simulation focused on fish sizes.

Objectives

Students should be able to

explain the scientific process and how collecting and analyzing data from populations relates to it
obtain or take a dataset and
- Produce histograms to visualize the distribution of the data
- Produce estimates of the minimum, maximum, and range
- Produce estimates of the mean, median, and mode
- Produce estimates of variance and confidence intervals
- Explain the concept of confidence intervals
- Explain how sample size influences the width of confidence intervals
- Explain how the level of confidence influences the width of confidence intervals

Introduction

Sampling a Population

Before we can describe a population, we have to collect data. Typically, we can't collect data on every individual in a population because there's too many or they're too hard to find or capture. So, we sample the population, which means collecting data on a subset of individuals such that they represent the whole population.

Scientists spend a lot of time on sampling design, or planning how they'll collect data in the most efficient and representative way. We'll spend more time investigating sampling methods later this semester with a mark-recapture activity with frogs. Ultimately, though, scientists want the data they collect to represent the population well, which means making a large and unbiased sample.

In general, the larger the sample size the better. This is because we learn more about a population with every new individual we collect data from. However, we also have to realize that no matter how large or unbiased our sample is, it's still a random draw from the whole population. This means every sample will be slightly different due to inherent sampling error. This error is not referring to a mistake that scientists made in their sampling design! The error here means that each sample will not be a perfect replicate of any other samples due to random chance.

Numerical Summaries: Measures of Central Tendency & Measures of Dispersion

Assuming we have data, lets consider how we summarize it to describe a population. Two main types of numerical summaries are often used - measures of central tendency, and measures of dispersion (variation).

Measures of central tendency (mean, median, and mode) tell us about the "average" data point but in different ways. To get the mean, if we have n pieces of data in our sample, we can add them together and divide by n (we call this "Y bar" in the formula below, due to the bar over the letter).

Formula for mean. Estimate is sum of data divided by number of data points

The mean isn't always the best number to describe a sample because it can be influenced by extremely low or extremely high values (outliers). The median, on the other hand, is not influenced by outliers. The median is the piece of data that falls in the center of the dataset if they are arranged in order (note: for samples with an even number of data points, the median is the average of the two pieces closest to the center). Since the median only considers the ranking of the data, not their values, it's less impacted by outliers. Likewise, the mode, or most common data point in your sample, is also not impacted by outliers.

Compared to measures of central tendency, measures of dispersion tell us how much variation is in our sample. There are two ways we'll calculate variation - variance and standard deviation. Let's start with the standard deviation - it's simply the square root of the variance. Easy enough, right?! So now let's examine the variance. The variance of the data is the average distance from the mean squared. In other words, calculate the mean of the data (Y bar), then subtract it from each data point (Yi) and square your answer. If you take the average of these numbers, you get the variance. The formula for variance is given below:

Formual for variance. Estimate is equal to the sum of the squared distance each data point is from the mean divided by the number of data points - 1

Why is it important to square the distance from the mean?... to get rid of negative values. Let's explore this. Let's say you sample fish length and your sample mean is 10. If one of your observations (Yi) is 8, that means 8 - 10 = -2. Having a negative number in statistics complicates things, so we square this in order to get rid of the negative.

Once we calculate the variance, remember that the square root of the variance is the standard deviation. In other words, the average deviation of each data point (distance from the mean) is standardized in the same units as the mean. As a result, the mean and standard deviation are often graphed together when reporting sample statistics.

It's also important to note that, for data sets that are "bell-shaped" (normally distributed), about 95% of the data observations fall within 2 standard deviations of the mean. Thus, if any individual falls 3 standard deviations away from the mean, it means they're outliers along with 5% of the population.

Visual Summaries: Histograms

You can graphically explore these concepts by developing a histogram. A histogram is a graph that puts all the data into a designated number of groups, or bins, and then counts how many data points are in each bin. These graphical summaries make it easy to spot outliers (points far from the mean), bimodal data (data that have 2 distinct peaks, or modes), and the skew of the data (if one side has many outliers or points). You can see instructions for making a histogram in Google sheets toward the bottom of the summaries help page. See the lecture slides (further down this webpage) for examples of histogram shapes and how to interpret histograms.

In today's lab activity, we'll explore numerical and visual summaries of samples using a simulation of fish capture. We'll see how histogram shape changes as we change the values (parameters) for sample size, sample mean, and standard deviation, This helpful tool uses information and code provided by scientists at University of British Columbia (make sure you don't confuse this tool "Sampling from a normally distributed population" with the other tool further below "Confidence intervals for the mean").

Making Inferences About the Population from the Sample

Although we can describe the sample as noted above, we really want to know about the population. Fortunately, statistical theory has been developed to help us do this.

When we calculate the mean of a sample, it represents our best estimate of what's going on in the entire population. But how well does our sample represent the population? Generally, the less dispersion we have in our sample (i.e., lower variance and standard deviation), the better we should feel about how accurately our sample represents the entire population. Statistics allow us to quantify how confident we are in our estimate of population information.

We know each time we sample the population we have some random sampling error. As a result, the mean and standard deviation of sample 1 is going to be slightly different than the mean and standard deviation of sample 2, and so on. Let's say we sampled the population a total of 50 times. We could make a histogram of the mean of all 50 samples, and we would see a normal distribution - this is known as the Central Limit Theorem. We would also be able to calculate dispersion around the mean of our 50 samples, but instead of calling it variance or standard deviation, we call it standard error because it's a measure of dispersion due to sampling error.

In reality, scientists rarely have the time or funding to take multiple samples of a population. Fortunately, we can calculate the standard error using only the standard deviation of one sample:

Formula for standard error. Estimate is equal to standard deviation divided by square root of sample size

Here, the standard error of the mean (Y bar) is equal to the standard deviation (s) divided by the square root of the sample size (n). This equation allows us to describe the dispersion of means we would see if we sampled a population multiple times.

Recall our previous statement that with a normal distribution 95% of observations fall within about 2 standard deviations of the mean. This same rule applies to the mean of our samples, and we call this the 95% Confidence Interval. In other words, using our previous example, when we calculate standard error for our 50 samples, we would expect 95% of our samples to fall within 2 standard deviations from the mean.

The 95% confidence interval for normally distributed data is then defined as:

Formula for 95% confidence interval. Estimate is equal to mean plus minus two multiplied by the standard error.

Ultimately we can use the 95% confidence interval to say we're 95% confident that our samples have accurately captured the population mean within a range of numbers. The narrower this range, the more confident we are in our population estimate.

Use the simulation tool below to visualize these concepts and their links with sampling populations. Be sure not to confuse this tool "Confidence intervals for the mean" with the one above "Sampling a normally distributed population".

It can be challenging to grasp these concepts your first, or even second, time through. Keep working at it, ask questions, and read outside resources including the Confidence Intervals! notes under the Resources tab. You can also get help calculating them on the summaries help page.

Population Statistics (virtual)

Methods

Virtual Fish Sampling

You'll now apply these ideas about describing populations by running a virtual activity where you sample a fish population. Make sure you are comfortable with the ideas of sampling, describing a sample, and confidence intervals before continuing with this exercise. If you need more help, check out the Confidence Intervals! review under the resources tab.

Simulation 1: Mean & variation

Click here to open the virtual fish sampling simulator in a new window. Or, go to this page and select “Sample means with a normal distribution”. Click on the “Tutorial” button, read the text prompts they give and move to the next prompt by clicking “Next”. Once you've completed the tutorial, set the model parameters to the following values:

N = 10
μ = 100
σ = 20

Sample one fish and record its length into a blank spreadsheet. Sample a second fish and record its length. Repeat this process for a total of 10 fish.

1. What is the mean value of your sample in this particular simulation?

2. What is the median value of your sample?

3. What is the mode of your sample?

4. What type of distribution do the data in your simulation have? Produce a histogram from your data and include it here. Use the lecture slides to help you decide which distribution type most closely matches your data.

5. After having sampled 10 fish and observed the distribution of the data, click “Means for Many Samples”. You should see the lower graph fill up - this is a secondary sample of our initial samples. Click on this button multiple times, what distribution does your data take? Is this the same or different from your first distribution in Question 4? Is the amount of variation in the secondary sample the same as in the first sample?

6. Continue to follow the prompts. What happens when you increase the sample size? Why does this happen?

7. Reset the fish sampling visualizer by clicking on "Return to start" located at the bottom of the simulator. Enter the following model parameters:

N = 100
μ = 100
σ = 20

Go ahead and make a “complete sample of 100” and “show sampling distribution” to see the smoothed distribution line. Now, click on μ and drag it back and forth. Does the position of the distribution change? What about the shape of the distribution? Briefly describe what’s going on here.

8. Set the same starting values as question 1. Now click on σ and drag it back and forth, ranging from 5 to 40. Watch what happens to the distribution. Does the position of the distribution change? What about the shape of the distribution? Briefly describe what’s going on here.

9. Time to role play. Let’s say you own a fishing company and your goal is to catch the highest number of the largest possible fish. Compare scenarios A & B below. One scenario isn’t necessarily better for your business than the other, just different strategies for success with different outcomes for your business.

N = 100
μ = 200
σ = 10

N = 100
μ = 175
σ = 40

Why would A benefit your business over B?

Why would B benefit your business over A?

Simulation 2: 95% confidence intervals

Now that you've sampled the mean and variation of the simulated fish population, let's explore the 95% confidence intervals around the estimated mean fish size. Click here to open the second virtual fish sampling simulator in a new window. Or, go to this page and select “Confidence intervals of the mean”. Follow the prompts on the second tutorial. After you have run 300 “Repeated Samples” simulations, pause sampling (if you speed up the sampling speed this should only take about 1 minute). Look in the top right corner at the number of “Successes” (times that the population mean fell within that particular range), “Failures” (times that the range did not include the population mean), and “Success Rate” (percentage of times the population mean was found in a given range).

10. Record those numbers here.

Number of Successes:
Number of Failures:
Success Rate:

11. Now try using 99% confidence intervals. What happens to the ranges when you switch to 99% confidence intervals? Do they get larger or smaller? Why?

If we decrease our confidence, we end up with smaller interval. Think about it this way - you are 100% sure the size of the population is between 0 and infinity, but that's not a very useful guess. We have less confidence the population size is between any two numbers, and as those numbers get closer our confidence level continues to decrease.

12. Increase the sample size by dragging the “n 10” button to the right. Were the confidence intervals wider or narrower with n= 40 than when n= 10? Why do you think that is?

13. Finally, decrease the standard deviation - did your confidence intervals get wider or narrower? Why?