Confidence interval introduction

From the Big Picture of Statistics, we know that our goal in statistical inference is to infer from the sample data some conclusion about the wider population the sample represents. In the first section, “Distribution of Sample Proportions,” we investigated the obvious fact that random samples vary. Because different samples may lead to different conclusions, we cannot be certain that our conclusions are correct. Statistical inference uses the language of probability to say how trustworthy our conclusions are.

We learn two types of inference: confidence intervals and hypothesis tests. We construct a confidence interval when our goal is to estimate a population parameter (or a difference between population parameters). We conduct a hypothesis test when our goal is to test a claim about a population parameter (or a difference between population parameters). Both types of inference are based on the sampling distribution of sample statistics. For both, we report probabilities that state what would happen if we used the inference method repeatedly.

In this section, we build on the ideas in “Distribution of Sample Proportions” to reason as we do in inference, but we do not do formal inference procedures now. Instead, we focus on the logic of inference. We use categorical data and proportions to investigate the logic of inference. But all of the ideas we discuss here apply to quantitative variables and means.

Confidence Intervals

When our goal is to estimate a population proportion, we select a random sample from the population and use the sample proportion as an estimate. Of course, random samples vary, so we want to include a statement about the amount of error that may be present. Because sample proportions vary in a predictable way, we can also make a probability statement about how confident we are in the process we used to estimate the population proportion.

We can find many examples of confidence intervals reported in the media. Here is an example.

__________________________________________

EXAMPLE 1

The National Sleep Foundation sponsors an annual poll. In 2011, the poll found that “43% of Americans between the ages of 13 and 64 say they rarely or never get a good night’s sleep on weeknights. More than half (60%) say that they experience a sleep problem every night or almost every night (i.e., snoring, waking in the night, waking up too early, or feeling unrefreshed when they get up in the morning” (as reported at www.sleepfoundation.org).

Are these percentages sample statistics or population parameters? These statistics describe the responses of a sample of Americans.

Let’s focus on the 60% who say they experience a sleep problem every night or almost every night. Does this mean that 60% of all Americans have this same experience? Well, no. This is a sample statistic from a poll. But from this sample, we want to infer what percentage of the population does have sleep problems. Since the percentage with sleep problems will differ from one sample to the next, we need to make a statement about how much error we might expect between a sample percentage and the population percentage.

In the “Poll Methodology and Definitions” section of the article, we find more detailed information about the poll. According to the Sleep Foundation website, “The 2011 Sleep in America® annual poll was conducted for the National Sleep Foundation by WB&A Market Research, using a random sample of 1,508 adults between the ages of 13 and 64. The margin of error is 2.5 percentage points at the 95% confidence level.”

There is a lot of important information here:

  • The sample is random.

  • The sample size is 1,508.

  • The margin of error is 2.5%.

  • The confidence level is 95%.

From this information, we can construct an interval that we are reasonably confident contains the population proportion.

  • Sample statistic ± margin of error

  • 60% ± 2.5%

  • 57.5% to 62.5%

This interval is an example of a confidence interval. We interpret the interval this way: We are 95% confident that between 57.5% and 62.5% of all Americans experience a sleep problem every night or almost every night.

How confident are we that this interval contains the population proportion? In this case, we are 95% confident. This means that 95% of the time, a random sample of this size will have at most 2.5% error. So 95% of these intervals will contain the true population proportion. Another way to say this is that this method accurately estimates the population proportion 95% of the time.

Note: Notice that the sample is a random sample. We can construct a confidence interval only with a random sample.

__________________________________________

EXAMPLE 2

A Gallup poll conducted in November of 2011 asked the following question, “What would you say is the most urgent health problem facing this country at the present time?” The choices were access, cost, obesity, cancer, government interference, or the flu. The responses were access (27%), cost (20%), obesity (14%), cancer (13%), government interference (3%), or the flu (less than 0.5%).

The following is an excerpt from the Survey Methods section. “Results for this Gallup poll are based on telephone interviews conducted Nov. 3-6, 2011, with a random sample of 1,012 adults ages 18 and older, living in all 50 U.S. states and the District of Columbia. For results based on a total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ±4 percentage points.”

Based on this poll, find a 95% confidence interval to estimate the percentage of U.S. adults who feel that access to health care is the most urgent health problem facing this country. Think about how you might interpret your interval in context.

__________________________________________

EXAMPLE 3

A Gallup poll conducted between January and June of 2011 found 21% of Americans saying they smoke.

The following is an excerpt from the Survey Methods section. “Results are based on telephone interviews conducted as part of the Gallup-Healthways Well-Being Index survey Jan. 2 - June 30, 2011, with a random sample of 177,600 adults, aged 18 and older, living in all 50 U.S. states and the District of Columbia, selected using random-digit-dial sampling. For results based on a total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ± 0.2 percentage points.”

Based on this poll, find a confidence interval to estimate the percentage of who smoke. Think about how you might interpret your interval in context.

__________________________________________

Summary

A sample proportion from a random sample provides a reasonable estimate of the population proportion. We do not expect the sample proportion to be exactly equal to the population proportion, but we expect the population proportion to be somewhat close to the sample proportion. The purpose of confidence intervals is to use the sample proportion to construct an interval of values that we can be reasonably confident contains the true population proportion.

What Is the Connection to the Sampling Distribution?

Sample proportions are estimates for the population proportion, so each sample proportion has error. For an individual sample, we will not know the exact amount of error, so we report a margin of error based on the standard error. Recall that the standard error is the standard deviation of sampling distribution. We can view the standard error as the typical or average error in the sample proportions. To see how this works, let’s return to a familiar sampling distribution.

Recall our previous investigation of gender in the population of part-time college students. We investigated these questions: What proportion of part-time college students are female? If we predict that the proportion is 0.60, how much error can we expect to be confident of in our prediction?

We predicted the population proportion was 0.60 and ran a simulation to examine the variability in sample proportions for samples of 100 part-time college students. Here is the sampling distribution from the simulation.


We see that we can be very confident that most samples of this size will have proportions that differ from 0.60 by at most 2 standard errors. For this simulation, the standard error in sample proportions was about 0.049. About 95% of the samples have an error less than 2(0.049) = 0.098

If we use two standard errors as the margin of error, we can rewrite the confidence interval.

  • sample statistic ± margin of error

  • sample proportion ± 2(standard errors)

  • sample proportion ± 2(0.049)

  • sample proportion ± 0.098

Different sample proportions give different intervals. For example, if the sample proportion is 0.57, the confidence interval is 0.472 to 0.668. Here are our calculations.

  • sample proportion ± margin of error

  • 0.57 ± 2(0.049)

  • 0.57 ± 0.098

The endpoints of the interval are 0.57 ‑ 0.098 = 0.472 and 0.57 + 0.098 = 0.668. The confidence interval is 0.472 to 0.668.

Since about 95% of the samples have at most 9.8% error, we have a 95% confidence interval. Based on this sample, we say we are 95% confident that the percentage of part-time college students who are female is between 47.2% and 66.8%.

__________________________________________

Have you ever wondered what the average number of M&Ms in a bag at the grocery store is? You can use confidence intervals to answer this question.


Suppose you were trying to determine the mean rent of a two-bedroom apartment in your town. You might look in the classified section of the newspaper, write down several rents listed, and average them together. You would have obtained a point estimate of the true mean. If you are trying to determine the percentage of times you make a basket when shooting a basketball, you might count the number of shots you make and divide that by the number of shots you attempted. In this case, you would have obtained a point estimate for the true proportion.

We use sample data to make generalizations about an unknown population. This part of statistics is called inferential statistics. The sample data help us to make an estimate of a population parameter. We realize that the point estimate is most likely not the exact value of the population parameter, but close to it. After calculating point estimates, we construct interval estimates, called confidence intervals.

In this chapter, you will learn to construct and interpret confidence intervals. You will also learn a new distribution, the Student’s-t, and how it is used with these intervals. Throughout the chapter, it is important to keep in mind that the confidence interval is a random variable. It is the population parameter that is fixed.



If you worked in the marketing department of an entertainment company, you might be interested in the mean number of songs a consumer downloads a month from iTunes. If so, you could conduct a survey and calculate the sample mean and the sample standard deviation. You would use the sample mean to estimate the population mean and s to estimate the population standard deviation. The sample mean is the point estimate for the population mean, μ. he sample standard deviation, s, is the point estimate for the population standard deviation, σ.


A confidence interval is another type of estimate but, instead of being just one number, it is an interval of numbers. The interval of numbers is a range of values calculated from a given set of sample data. The confidence interval is likely to include an unknown population parameter.

Suppose, for the iTunes example, we do not know the population mean μ, but we do know that the population standard deviation is σ = 1 and our sample size is 100. Then, by the central limit theorem, the standard deviation for the sample mean is


The empirical rule, which applies to bell-shaped distributions, says that in approximately 95% of the samples, the sample mean, , will be within two standard deviations of the population mean μ. For our iTunes example, two standard deviations is (2)(0.1) = 0.2. The sample mean =0.1 is likely to be within 0.2 units of μ.


Because is within 0.2 units of μ, which is unknown, then μ is likely to be within 0.2 units of

in 95% of the samples. The population mean μ is contained in an interval whose lower number is calculated by taking the sample mean and subtracting two standard deviations (2)(0.1) and whose upper number is calculated by taking the sample mean and adding two standard deviations. In other words, μ is between − 0.2 and + 0.2 in 95% of all the samples.


For the iTunes example, suppose that a sample produced a sample mean = 2. Then the unknown population mean μ is between −0.2=2−0.2=1.8 and +0.2=2+0.2=2.2

We say that we are 95% confident that the unknown population mean number of songs downloaded from iTunes per month is between 1.8 and 2.2. The 95% confidence interval is (1.8, 2.2).

The 95% confidence interval implies two possibilities. Either the interval (1.8, 2.2) contains the true mean μ or our sample produced an that is not within 0.2 units of the true mean μ. The second possibility happens for only 5% of all the samples (95–100%).

Remember that a confidence interval is created for an unknown population parameter like the population mean, μ. Confidence intervals for some parameters have the form:

(point estimate – margin of error, point estimate + margin of error)


The margin of error depends on the confidence level or percentage of confidence and the standard error of the mean.

When you read newspapers and journals, some reports will use the phrase “margin of error.” Other reports will not use that phrase, but include a confidence interval as the point estimate plus or minus the margin of error. These are two ways of expressing the same concept.

Note

Although the text only covers symmetrical confidence intervals, there are non-symmetrical confidence intervals (for example, a confidence interval for the standard deviation).

References:

  1. https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/introduction-to-statistical-inference-1-of-3/

CC LICENSED CONTENT, SHARED PREVIOUSLY

Concepts in Statistics. Provided by: Open Learning Initiative. Located at: http://oli.cmu.edu. License: CC BY: Attribution


  1. https://courses.lumenlearning.com/introstats1/chapter/introduction-confidence-intervals/

CC LICENSED CONTENT, SHARED PREVIOUSLY

ALL RIGHTS RESERVED CONTENT