Confidence Intervals

Quantifying the sampling variability of a statistic.

class date

March 8, 2023

Discuss Reading Questions PDF


The process of generalizing from a statistic of a sample to a parameter of a population is known as statistical inference. The parameter of interest could be a mean, median, proportion, correlation coefficient, the coefficient of a linear model . . . the list goes on. In the scenario that unfolded in Pimentel Hall, the parameter was the mean year of the 527 students in the class. The process of estimating that parameter by calculating the sample mean of the 18 students who decided to sit in the front row that day induces a sampling distribution.

pop_eager <- data.frame(year = rep(c(1, 2, 3, 4),
                                   times = c(245, 210, 47, 25)),
                        eagerness = rep(c(10, 6, 3, 1),
                                        times = c(245, 210, 47, 25)))
set.seed(55)
pop_eager <- pop_eager %>%
  slice_sample(n = nrow(pop_eager))

mu <- mean(pop_eager$year)

p_pop <- pop_eager %>%
  ggplot(aes(x = year)) +
  geom_bar(fill = "purple") +
  labs(title = "Population Distribution",
       y = "") +
  annotate("point", x = mu, y = -20, shape = 17, size = 5, color = "goldenrod2") +
  theme_gray(base_size = 10) +
  theme(plot.title = element_text(size=10))

library(infer)
samp_1 <- pop_eager %>%
  slice_sample(n = 18,
                   replace = FALSE,
                   weight_by = eagerness)

many_samps <- samp_1 %>%
  mutate(Sample = 1)

for (i in 2:500) {
  many_samps <- pop_eager %>%
    slice_sample(n = 18,
                 replace = FALSE,
                 weight_by = eagerness) %>%
    mutate(Sample = i ) %>%
    bind_rows(many_samps)
}

many_xbars <- many_samps %>%
  group_by(Sample) %>%
  summarize(xbar = mean(as.numeric(year)))

xlims_from_pop <- ggplot_build(p_pop)$layout$panel_scales_x[[1]]$range$range

p_samp_dist <- many_xbars %>%
  ggplot(aes(x = xbar)) +
  geom_bar(fill = "steelblue") +
  lims(x = c(0, 4)) +
  labs(title = "Sampling Distribution",
       y = "",
       x = expression(paste(bar(x), " (mean year)"))) +
  annotate("point", x = mu, y = -10, shape = 17, size = 5, color = "goldenrod2") +
  theme_gray(base_size = 10) +
  theme(plot.title = element_text(size=10)) +
  lims(x = xlims_from_pop)
p_samp_dist

This sampling distribution captures the two sources of error that creep in while generalizing. The horizontal offset from the middle of the sampling distribution to the population parameter (the gold triangle) represents the bias. The spread of the sampling distribution represents the variation. In these lecture notes you’ll how to quantify sampling variability using two common tools.

Standard Error (SE)
The standard deviation of the sampling distribution of a statistic.
Confidence Interval
An interval of two values that represent lower and upper bounds on the statistic that capture most of the sampling distribution.

To focus on the variation, let’s introduce a second example, one in which we will not need to worry about bias.

A Simple Random Sample

Restaurants in San Francisco

Every year, the city of San Francisco’s health department visits all the restaurants in the city and inspects them for food safety. Each restaurant is given an inspection score; these range from 100 (perfectly clean) to 48 (serious potential for contamination). We have these scores from 2016. Let’s build up to the sampling distribution bit by bit.

The Population Distribution

Our population consists of the restaurants in San Francisco. Since the data are published online for all restaurants, we have a census1 of scores for every restaurant in the city.

load(url("https://www.dropbox.com/s/nbgw1uzj5ccwce2/fs_scores.rda?dl=1"))

pop_dist <- ggplot(as.data.frame(fs_scores), aes(x=fs_scores, y=..density..)) + 
      geom_histogram(breaks=seq(47.5, 100.5, by = 1), 
                 color="black", fill="gray") + 
      xlab("Food Safety Scores") +
      ylab("Proportion") +
      ggtitle("Population Distribution")

pop_dist

The population distribution is skewed left with a long left tail. The highest possible score is 100. It appears that even scores are more popular than odd scores for scores in the 90s; in fact there are no scores of 99, 97, and 95.

We can calculate two parameters of this population:

  • The population mean is 87.6.
  • The population SD (aka the box SD) is 8.9.

The Empirical Distribution

Although we have data on all of the restaurants in the city, imagine that you’re an inspector who has visited a simple random sample of 100 restaurants. That is, you draw 100 times without replacement from the population, with each unit equally likely to be selected. This leads to a representative sample that will have no selection bias.

The distribution of this sample (an empirical distribution) looks like:

set.seed(103122)
samp_scores <- sample(fs_scores, 100)

emp_dist <- ggplot(as.data.frame(samp_scores), aes(x=samp_scores)) + 
      geom_histogram(breaks=seq(47.5, 100.5, by = 1), 
                 color="black", fill="gray") + 
      xlab("Food Safety Scores") +
      ylab("Proportion") +
      ggtitle("Empirical Dist. (Sample 1)")

emp_dist

The sample statistics here are:

  • The sample mean is 86.27.
  • The sample SD is 9.9.

Observe that the empirical distribution resembles the population distribution because we are using a sampling method without with selection bias. It’s not a perfect match but the shape is similar. The sample average and the sample SD are also close to but not the same as the population average and SD.

The Sampling Distribution

If you compared your sample to that of another inspector who took visited 100 restaurants, their sample would not be identical to yours, but it would still resemble the population distribution, and its mean and SD would be close to those of all the restaurants in the city.

The distribution of the possible values of the sample mean of a simple random sample of 100 restaurants is the sampling distribution of the mean (of the sample). We can use it to, for example, find the chance that the sample mean will be over 88, or the chance that the sample mean will be between 85 and 95.

We use simulation to approximate this distribution. We repeat 100,000 times the process of drawing a simple random sample of 100 restaurants. The full distribution looks like:

set.seed(10312022)
samp_means <- replicate(100000, mean(sample(fs_scores, 100)))

sampling_dist <- ggplot(as.data.frame(samp_means), 
                        aes(x=samp_means, y=..density..)) + 
      geom_histogram(bins = 45, 
                 color="black", fill="gray") + 
      xlab("Average Food Safety Scores") +
      ylab("Proportion") +
      ggtitle("Sampling Distribution")

sampling_dist

We can consider numerical summaries of this distribution:

  • The sampling distribution mean is 87.6.
  • The sampling distribution SD, which is called the Standard Error, is 0.9. This convention of using a different name for the SD for the distribution of a statistic helps keep straight which kind of standard deviation we’re talking about.

Observe that the sampling distribution of the sample mean doesn’t look anything like the population or sample. Instead, it’s roughly symmetric in shape with a center that matches the population mean, and a small SE. The size of the SE reflects the fact that the sample mean tends to be quite close to the population.

Again, the sampling distribution provides the distribution for the possible values of the sample mean. From this distribution, we find that the chance the sample mean is over 88 is about 0.33, and the chance the sample mean is between 85 and 95 is roughly, 1.

Putting the Three Panels Together

Let’s look at these three aspects of this process side-by-side.

pop_dist + emp_dist + sampling_dist

Population Empirical Sampling
Shape left skew left skew bell-shaped / normal
Mean 87.6 86.27 87.6
SD 8.9 9.9 0.89

Observe that:

  1. The population mean and the expected sample mean are roughly the same.
  2. The SD of the population and the SE of the sample averages are related in the following way2:

\[SE(sample~mean) \approx SD(pop)/\sqrt{n}\]

  1. The histogram of the sample averages is not skewed like the histogram of the box, on the contrary, it is symmetric about the middle and bell-shaped, like the normal curve.

  2. The histogram of our sample of 100 resembles the population histogram.

  3. Since 100 is a pretty large sample,

\[ \begin{aligned} mean(pop) &\approx mean(sample) \\ SD(pop) &\approx SD(sample) \\ \end{aligned} \]

Now we’re ready to make inferences in a setting where we don’t know the population.

Inference for a Population Average

Drawing on our understanding of the thought-experiment, we ask:

What happens when you don’t see the population, you just have your sample, and you want to make an inference about the population?

We have serious gaps in our procedure for learning about the sampling distribution!

To start, we know we can use the sample average to infer the population average. This is called a point estimate for the population parameter.

But can we do better than that? Can we bring in more of the information that we have learned from the thought-experiment? For example, can we accompany our point estimate with a sense of its accuracy? Ideally, this would be the SE of the sample mean. Unfortunately, we don’t know the SE because it depends on the population SD. So now what do we do?

Standard Error

The thought-experiment tells us that the sample SD is close to the population SD (when you have a SRS). So we can substitute the sample SD into the formula for the SE.

\[ SE(sample~mean) \approx \frac{SD(sample)}{\sqrt{n}}\]

When presenting our findings, you might say, that based on a SRS of 100 restaurants in San Francisco, the average food safety score is estimated to be 86 with a standard error of about 1.

Suppose someone took a sample of 25 restaurants and provided an estimate of the average food safety score. Is that only 1/4 as accurate because the sample is 1/4 the size of ours?

Suppose someone took a sample of 100 restaurants in New York City where there are 50,000 restaurants (this is a made up number). Is their estimate only 1/10 a accurate because the number of units in the population is 10 times yours?

We can use the formula for the SE to answer these questions. In the table below, we have calculated these SEs for a generic value of \(SD(pop)\) and various choices of the population size and sample size.

Population Size Sample Size
25 100 400
500 \(SE = SD(pop)/5\) \(SE = SD(pop)/10\) \(SE = SD(pop)/20\)
5,000 \(SE = SD(pop)/5\) \(SE = SD(pop)/10\) \(SE = SD(pop)/20\)
50,000 \(SE = SD(pop)/5\) \(SE = SD(pop)/ 10\) \(SE = SD(pop)/ 20\)

What do you notice about the relationship between sample size and population size and SE?

  • The absolute size of the population doesn’t enter into the accuracy of the estimate, as long as the sample size is small relative to the population.
  • A sample of 400 is twice as accurate as a sample of 100, which in turn is twice as accurate as a sample of 25 (assuming the population is relatively much larger than the sample). The accuracy improves according to the square root of the sample size.

Confidence Intervals

Confidence intervals bring in more information from the thought-experiment. The confidence interval provides an interval estimate, instead of a point estimate, that is based on spread of the sampling distribution of the statistic.

We have seen that the sampling distribution is takes a familiar shape: that of the normal curve (also called the bell curve)3. Therefore we can fill in some of the holes in the thought-experiment with approximations.

The sketch that we have added in the middle panel shows a sampling distribution that looks like the bell curve. To understand how this will be helpful, let’s make a quick digression to talk about the normal distribution.

The Normal Distribution

The normal distribution describes a continuous random variable that has a density function with a familiar bell shape4.

data.frame(x = seq(-3, 3, .01)) %>%
  mutate(y = dnorm(x)) %>%
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  labs(x = "x",
       y = "f(x)") +
  scale_y_continuous(breaks = NULL) +
  theme_bw()

If a random variable \(X\) follows a normal distribution, we write

\[X \sim \textrm{N}(\mu, \sigma)\]

where \(\mu\) is the mean of the distribution and \(\sigma\) is its standard deviation. The particular normal distribution shown above is the standard normal distribution, where \(\mu = 0\) and \(\sigma = 1)\).

We can calculate the probability of any event related to \(X\) by finding the area under the curve corresponding to that event. That includes the probability that \(X\) falls within a particular interval. In the table below, we record the probabilities of three such intervals.

Interval Area under the normal curve
Between -1 and 1 0.68
Between -1.96 and 1.96 0.95
Between -2.58 and 2.58 0.99

So if we know a particular distribution is similar in shape to the normal distribution, we’re able to calculate the probabilities that the random variable falls within a particular interval. With this observation in hand, let’s return to the task of making an interval estimate.

Normal Confidence Intervals

When the sampling distribution is roughly normal in shape, then we can construct an interval that expresses exactly how much sampling variability there is. Using our single sample of data and the properties of the normal distribution, we can be 95% confident that the population parameter is within the following interval.

\[[\bar{x} - 1.96 SE, \bar{x} + 1.96SE]\]

We call this a confidence interval because 95% of the time, an interval constructed in this way will contain the population mean. For the particular interval that you have created, you don’t know if it contains the population mean or not. This is why we use the term confidence to describe it instead of probability. Probability comes into play when taking the sample, after that our confidence interval is a known observed value with nothing left to chance.

Confidence not Probability

To be more precise about what is meant by “confidence”, let’s take 100 samples of size 25 from the restaurant scores, and calculate a 95% confidence interval for each of our 100 samples. How many of these 100 confidence intervals do you think will include the population mean?

Let’s simulate it!

set.seed(10312022)
samp25_means <- replicate(100, mean(sample(fs_scores, 25)))

lower <- samp25_means - 1.96 * sd(fs_scores)/5
upper <- samp25_means + 1.96 * sd(fs_scores)/5
trial <- 1:100
cover <- (mean(fs_scores) >= lower) & (mean(fs_scores) <= upper)
CIs <- data.frame(trial, lower, upper, cover)

ci100 <- ggplot(CIs, aes(y = trial)) +
  geom_segment(aes(x=lower, y=trial, xend=upper, yend=trial, color= cover), 
               show.legend=FALSE) +
  annotate("segment", x=mean(fs_scores), xend=mean(fs_scores),
                y=0, yend=101, color="black") +
  labs(x=expression(bar(x)), y = "Iteration") +
  theme_minimal()

ci100

We have found that 95 of the 100 confidence intervals cover the population parameter. This is by design. If we simulate another 100 times, we may get a different number, but it is likely to be close to 95.

Inference for a Population Proportion

To gain practice with making confidence intervals, we turn to another example. This time we sample from a populations where the values are 0s and 1s (0-1 box). You will see that the process is very much the same, although there are a few simplifications that arise due to the nature of the box.

Suppose we only want to eat at restaurants with food safety scores above 95. Let’s make a confidence interval for the proportion of restaurants in San Francisco with scores that are “excellent” (scores over 95). To tackle this problem, we can modify our box. Since we need only to keep track of whether a score is excellent, we can replace the scores on the tickets with 0s and 1s, where 1 indicates an excellent score. Of the 5766 restaurants in San Francisco, 1240 are excellent. So our box has 5766 tickets in it, and 1240 are marked 1, and 4526 are marked 0. This time let’s take a SRS of 25.

The thought-experiment appears as

hi_score_dist <- ggplot(as.data.frame(hi_scores), 
                        aes(x=hi_scores, y=..density..)) + 
      geom_histogram(breaks=seq(-0.5, 1.5, by = 1), 
                 color="black", fill="gray") + 
      xlab("Excellent score?") +
      ylab("Proportion") +
      ggtitle("Population")

hi_samp_dist <- ggplot(as.data.frame(hi_score_samp), aes(x=hi_score_samp, y=..density..)) + 
      geom_histogram(breaks=seq(-0.5, 1.5, by = 1), 
                 color="black", fill="gray") + 
      xlab("Excellent scores?") +
      ylab("Proportion") +
      ggtitle("Empirical Distribution")

hi_sampling_dist <- ggplot(as.data.frame(hi_score_means), 
                        aes(x=hi_score_means, y=..density..)) + 
      geom_histogram(bins = 20, 
                 color="black", fill="gray") + 
      xlab("Proportion excellent") +
      ylab("Proportion") +
      ggtitle("Sampling Distribution")

hi_score_dist + hi_samp_dist + hi_sampling_dist

Population Empirical Sampling
Shape left skew left skew bell-shaped / normal
Mean 0.22 0.2 0.22
SD 0.4 0.4 0.08

In the special case of a 0-1 box:

  • The population average is the proportion of 1s in the box, let’s call this parameter \(p\).
  • The taking a draw from the population distribution taking a draw from a Bernoulli random variable, so \(SD(pop) = \sqrt{p(1-p)}\).
  • The sampling distribution has mean \(p\).
  • The sampling proportion, \(\hat{p}\), is similar to \(p\).
  • The SE of the sample proportion5 is approximately \(SE(\hat{p}) = \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}\).

With an equation to estimate \(SE\) from our data in hand, we can form a 95% confidence interval.

\[\left[\hat{p} - 1.96 \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}, \hat{p} + 1.96 \frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}\right]\]

Summary

In these notes, we have restricted ourselves to the simple random sample, where the only source of error that we’re concerned with is sampling variability. We outlined two tools for estimating that variability: the standard error (SE) and the confidence interval.

We saw how the size of the sample impacts the standard error of the estimate. The larger the sample, the more accurate our estimates are and in particular the accuracy improves according to \(1/\sqrt{n}\). We also found that the size of the population doesn’t impact the accuracy, as long as the sample is small compared to the population.

We made confidence intervals for population averages and proportions using the normal distribution. This approach can be extended to other properties of a population, such as the median of a population, or the coefficient in a regression equation.

The confidence intervals that we have made are approximate in the following sense:

  • We’re approximating the shape of the unknown sampling distribution with the normal curve.
  • The SD of the sample is used in place of the SD of the population in calculating the SE of the statistic.

There are times when we are unwilling to make the assumption of normality. This is the topic of the next set of notes.

Materials from class

Slides

Footnotes

  1. The terms census refers to a setting where you have access to the entire population.↩︎

  2. This approximation becomes equality for a random sample with replacement. When we have a SRS, the exact formula is \(SE(sample~mean) = \sqrt{\frac{N-n}{N-1}} SD(pop)/\sqrt{n}\).

    This additional term, called the finite population correction factor, adjusts for the fact that we are drawing without replacement. Here \(N\) is the number of tickets in the box (the size of the population) and \(n\) is the number of tickets drawn from the box (the size of the sample).

    To help make sense of this correction factor, think about the following two cases:

    • Draw \(N\) tickets from the box (that is, \(n = N\)).
    • Draw only one ticket from the box.

    What happens to the SE in these two extreme cases?

    In the first case, you will always see the entire population if you are drawing without replacement. So, the sample mean will exactly match the population mean. The sampling distribution has no variation, so \(SE = 0\).

    In the second case, since you take only one draw from the box, it doesn’t matter if you replace it or not. So the SE for a SRS should match the SE when sampling with replacement in this special case. In settings when \(N\) is large relative to \(n\), it effectively behaves as if you are sampling with replacement.↩︎

  3. This is not always the case. We’ll come back to this point later.↩︎

  4. For a normally distributed random variable, \(f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\). Read more on Wikipedia: https://en.wikipedia.org/wiki/Normal_distribution↩︎

  5. This calculation results from casting the total number of 1’s in a sample of size \(n\) as a binomial random variable with success probability \(p\). Call that random variable \(Y\). The variance of a binomial random variable is \(Var(Y) = np(1-p)\). Observing that sample proportion can be considered a binomial count divided by \(n\), and applying the properties of variance, we can find the variance of \(\hat{p}\) as,

    \[\begin{eqnarray} Var(\hat{p}) &= Var(\frac{1}{n}Y) \\ &= \frac{1}{n^2}Var(Y) \\ &= \frac{1}{n^2}np(1-p) \\ &= \frac{p(1-p)}{n} \end{eqnarray}\]

    So the standard error can be calculated as:

    \[\begin{eqnarray} SE(\hat{p}) &= \sqrt{Var(\hat{p})} = \sqrt{\frac{p(1-p)}{n}} \end{eqnarray}\]

    When estimating the SE from data, we plug in \(\hat{p}\) for \(p\).↩︎