# From Samples to Populations

The bias and variance of moving from sample to population.

class date

March 6, 2023

Generalization is the process of using a subset of information to draw conclusions about some broader set or phenomenon. It is powerful - it allows us to draw conclusions about things we have not observed - but it is tricky to do well. In these notes, you will learn the sources of error that can creep in when making a generalization.

There are four terms that you will see come up again and again as we discuss generalization. They are familiar terms that have tightly coupled meanings, so we present them together.

Sample
The subset of units that are observed, measured, and analyzed. Commonly referred to as a data set. The size of the sample is indicated by $$n$$.
Population
The set of units from which your sample is drawn. The size of the population is indicated by $$N$$.
Statistic
A numerical summary of a sample. Examples include a sample mean, a sample median, a sample proportion, a sample correlation coefficient, and an estimated coefficient of a linear model.
Population Parameter
A numerical summary of a population. Every statistic of a sample has an analog in the population (population mean, population proportion, etc).

To see how these terms interrelate and to introduce the sources of error that can creep in while generalizing from a sample to a population, let’s look a scenario.

### What year are my students?

On the first day of class, the professor strides into Pimentel Hall to present a lecture to a new crop of students. If you have not yet had the pleasure of having a class in Pimentel, it looks like this.

Pimentel1 is the second largest lecture hall at Cal, with exactly 527 seats. On this first day of class, the room is packed and all 527 of the students registered for Stat 20 are in the hall.

Eager to ensure that the lecture is calibrated to the interest and experience of the students, the professor seeks to learn how many years they have been at Cal. The professor calls on a student sitting in the middle of the front row and asks, “What year are you at Cal?” The student replies, “sophomore”, so the professor write 2 on the board. The professor proceeds to repeat this question to all of the students sitting in the 18 seats at the front row of Pimentel. By the end, there is a data frame on the board with one column and 18 rows, the first three of which read

Year
2
1
1

Ever the statistician, the professor then visualizes the distribution of data by sketching on the board a bar chart and jots down the sample mean, which is 1.77.

library(tidyverse)

pimentel <- data.frame(year = c(2, 1, 1, 1, 2, 2,
4, 1, 1, 3, 3, 1,
2, 2, 2, 1, 2, 1))

pimentel %>%
ggplot(aes(x = year)) +
geom_bar()

How good of an answer is 1.77 to the professors question of, “What year are my students?”

#### Identifying the components

In this setting, the population is the set of all $$N = 527$$ students in Pimentel. That is the set of units (students) that the professor seeks to understand and from which the sample is drawn. The sample is the subset of $$n = 18$$ students who were sitting in the front row and who were asked their year at Cal. The sample mean, $$\bar{x} = 1.77$$, is an example of a statistic. Other statistics that could be calculated from the sample include the median (2) and the mode (1). These statistics be used to estimate their analogs in the population, such as the population mean, $$\mu$$. The value of $$\mu$$ is what the professor seeks to learn, but at this point is still unknown.

Using this terminology, we can ask the question again: how good of an estimate of the mean year of the entire class is the sample mean, 1.77? How much error did we incur when generalizing and where did that error come from?

## Sources of Error

To understand the forms that estimation error can take, consider the analogy of darts thrown at a dart board. The center of the dart board represents the parameter that we are trying to hit and each dart that we throw represents a statistic calculated from a sample. There are two ways in which a dart throw can miss the bullseye. One way is that we could systematically tend to throw above and to the left of the bullseye. This form of error is called bias. Another way that our dart-throwing could miss the bullseye is if we are an erratic thrower and one throw tends to be very different from this next. This form of error is called variation.

If a sample is representative of the population, there is no bias present. That is represented by the top row of bullseyes.

### Types of Statistical Bias

Statistical bias comes in many forms. Here we describe two of the most important types.

Selection bias
When the mechanism used to choose units for the sample tends to select certain units more often than they should.

As an example, a convenience sample chooses the units that are most easily available. Problems can arise when those who are easy to reach differ in important ways from those harder to reach. Another example of selection bias can happen with observational studies and experiments. These studies often rely on volunteers (people who choose to participate), and this self-selection has the potential for bias if the volunteers differ from the target population in important ways.

Measurement bias
When your process of measuring a variable systematically misses the target in one direction.

For example, low humidity can systematically give us incorrectly high measurements for air pollution. In addition, measurement devices can become unstable and drift over time and so produce systematic errors. In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.

Both of these types of bias can lead to situations where the data are not centered on the unknown targeted value. A common method to address selection bias is to draw a simple random sample, where each unit from the population is equally likely to be drawn. A pilot survey can improve question wording and so reduce measurement bias. Procedures to calibrate instruments and protocols to take measurements in, say, random order can reduce measurement bias.

Bias does not need to be avoided under all circumstances. If an instrument is highly precise (low variance) and has a small bias, then that instrument might be preferable to another that has high variance and little to no bias. Biased studies can be useful to pilot a survey instrument or to capture useful information for the design of a larger study.

#### Statistical Bias in Pimentel

The method used by the professor likely suffers from statistical bias. The units that were drawn into the sample (the 18 students in the front row who were called on) constitute a convenience sample: they were sampled because they were easy to sample. That is not necessarily a problem, but there is good reason to think that students of all years are not equally likely to sit in the front row. First year students, bright-eyed and bushy-tailed with enthusiasm, will be more likely to sit in the front row. If that is true, the professor incurred selection bias that will lead to an estimate of the population mean that is too low.

What about measurement bias? In this case, the process of measuring a student’s year involves the act of asking them a question, hearing their answer, and writing it on the board. How could this be systematically in error? One way would be if first year students occasionally lie about their year when asked, for fear of being thought of as an over-eager over-achiever. If this were true, then lots of the 2s that we recorded were in fact 1s, and the estimate of 1.77 would be too high.

### Types of Variation

Whether or not bias is present, data typically also exhibit variation.

Sampling variability
If the sample is drawn from the population with some amount of randomness, the sampling variability describes the variability from one sample to the next.
Measurement variability
When we take multiple measurements on the same object, we get variations in measurements from one sample to the next.

#### Variability in Pimentel

There is some amount of randomness that plays into which units made their way into the professor’s sample of size 18. At the start of a semester, students tend to sit in a different seat each day they come to class. Even in a world were all students are equally likely to sit in the first row (therefore there is no bias), our sample of 1.77 might be too low because that day an unusally high number of first year students happened to sit in the front row, purely due to chance. The next day, perhaps an unually high number of juniors would sit in the front row and the estimate would leap up to, say, 2.7. This is sampling variability.

Variability in measurement here would refer to a process by which the recorded year for a student would differ from one measurement to the next. Imagine a student who sits in the front row on the first day and is recorded as a two That same student sits in the front row on the 10th day and is recorded as a one. It’s hard to imagine there being much measurement variability here but you can imagine a very absent-minded professor who, upon hearing the year, will sometimes immediately forget what had been said and will record a random year.

## The Sampling Distribution

One of the key concepts in understanding estimation errors when using statistics is to understand the shape of the sampling distribution of that statistic.

Sampling Distribution

The distribution of a statistic upon repeated sampling.

Even though this distribution has an innocuous name, something big is happening with this definition. No longer are we considering the (usually unknowable) distribution of the population or the observed distribution of the data (an empirical distribution). The sampling distribution is a distribution of a statistic, illustrating the different values that it could take, along with the probability of getting each of those values in a given sample.

Usually the sampling distribution is a hypothetical thing: what would our statistic have looked like if we had taken a different sample of data? We can make it concrete by working in a setting where we can actually do just that.

## Drawing Samples from Population

To understand the role that bias and variation play in estimates, we will for the moment assume that we have access to the entire population. This is almost never the case: if we had the population, we wouldn’t need to bother with estimating it from a sample! It is a very useful thought experiment, however. It allows us to see the different ways that samples and statistics can be drawn under different scenarios.

For this thought-experiment, we will will be drawing from a population that has 527 observations / rows, one for every student in Pimentel, with a population distribution of year that looks like this.

pop_eager <- data.frame(year = factor(rep(c(1, 2, 3, 4),
times = c(245, 210, 47, 25))),
eagerness = rep(c(10, 6, 3, 1),
times = c(245, 210, 47, 25)))
set.seed(34)
pop_eager <- pop_eager %>%
slice_sample(n = nrow(pop_eager)) # shuffle the rows to make them look natural
pop_eager %>%
ggplot(aes(x = year)) +
geom_bar(fill = "purple") +
labs(title = "Population Distribution")

#### Scenario 1: Calling on the front row

In the data collection scheme used by the professor, the students who happened to be sitting in the 18 seats at the front of the class are the ones who will make it into the sample. As discussed above, it is likely that this process will result in selection bias since first year students tend to be more eager and eager students tend to sit at the font of the class.

We can do our best to envision what this selection bias would look like by selecting each student out of the population with a probability that is proportional to their eagerness. Let’s say first year students have an eagerness score of 10 out of 10, sophomores have a 6, juniors a 3, and seniors a 1. Here are the first 5 rows of the population data frame with the eagerness scores right next to the year of the student.

slice(pop_eager, 1, 2, 3, 4, 5)
  year eagerness
1    3         3
2    2         6
3    2         6
4    1        10
5    2         6

When simulating the process that the professor used to draw 18 students, we can select the first student, a junior, with probability of 3 divided by the total eagerness of all of the students (the sum of the eagerness column), which is 4900. The probability of that student (or any junior) is $$3/4900 = 0.0006$$. That is small, but that’s not surprising: there are 527 students in the class, so the probability of selecting just one of them should be small.

The probability of selecting the fourth student from the population data frame above, a first year, can be calculated as $$10/4900 = .002$$. That’s also small, but it is 3.33 (10/3) times the chance of selecting a junior. In this setting, we’re using eagerness as a sampling weight to determine the relative probability of selecting more eager students into the sample.

Let’s now simulate the process of drawing 18 rows out of the data frame of 527 rows, where each row is being selected with a probability proportional to it’s eagerness. The three plots along the top row of the plot below illustrate what the empirical distribution of three samples might look like. They help us envision what the professor’s plot on the board would look like on three different days where the 18 students in the front row were called on, and in each of those days, it is the first year students who are most likely to end up the front row.

library(infer)
samp_1 <- pop_eager %>%
slice_sample(n = 18,
replace = FALSE,
weight_by = eagerness)

many_samps <- samp_1 %>%
mutate(replicate = 1)

set.seed(40211)

for (i in 2:500) {
many_samps <- pop_eager %>%
slice_sample(n = 18,
replace = FALSE,
weight_by = eagerness) %>%
mutate(replicate = i ) %>%
bind_rows(many_samps)
}

p1 <- many_samps %>%
filter(replicate == 1) %>%
ggplot(aes(x = year)) +
geom_bar() +
labs(title = "Sample 1")

p2 <- many_samps %>%
filter(replicate == 2) %>%
ggplot(aes(x = year)) +
geom_bar() +
labs(title = "Sample 2")

p3 <- many_samps %>%
filter(replicate == 3) %>%
ggplot(aes(x = year)) +
geom_bar() +
labs(title = "Sample 3")

many_xbars <- many_samps %>%
group_by(replicate) %>%
summarize(xbar = mean(as.numeric(year)))

p4 <- many_xbars %>%
ggplot(aes(x = xbar)) +
geom_bar(fill = "purple") +
lims(x = c(0, 4)) +
labs(title = "Sampling Distribution")

library(patchwork)

(p1 + p2 + p3) / p4

As you look from Sample 1 to Sample 3, you notice that the distribution of year varies from sample to sample. This is sampling variability. If you compare these plots to the population distribution above, you’ll find these three samples on balance seem to systematically have more first year students that you might expect. This is an effect of selection bias.

Both of these notions are captured in the sampling distribution. To understand the notion of a sampling distribution, imagine that you:

1. Calculate the sample mean for sample 1 and store it away as $$x_1 =$$ 1.4.
2. Calculate the sample mean for sample 2 and store it away as $$x_2 =$$ 1.2.
3. Calculate the sample mean for sample 3 and store it away as $$x_3 =$$ 1.5.
4. Repeat this process 500 times then
5. Plot the distribution of those 500 $$\bar{x}$$s.

This distribution is shown in the bottom row. We can see that in this scenario, it’s possible to get $$\bar{x}$$s that are as low as around 1 and as high as around 1.8. The $$\bar{x}$$ that we would expect from this process is around 1.4.

## Summary

In these notes we laid out a common goal of making generalizations: estimating the value of population parameters using statistics calculated from samples. The process of generalization is subject to several sources of error that are lumped into statistical bias and variation. Two of the central forms of statistical bias are selection bias and measurement bias. Two common forms of variation are sampling variability and measurement variability. With these notions of error in mind, we learned about the sampling distribution, the distribution of statistics that we would observe if we were to sample from the sample population many times and compute many statistics.

The sampling distribution is the central concept in making generalizations so we will revisit it throughout the coming weeks.

## The Ideas in Code

#### slice_sample()

The workhorse function to generate samples of data from a data frame is slice_sample(). This function takes a random sample of rows from a data frame. This function is in the tidyverse package. Useful arguments include:

• n: the sample size.
• replace: TRUE or FALSE. whether to sample rows with or without replacement.
• weight_by: a column from the original data frame to use as sampling weights for each row.
##### Example 1

Treat the penguins data set as your population. Take a random sample of three penguins, without replacement, from the penguins data frame where each of the penguins is equally likely.

library(stat20data)

set.seed(35)
penguins %>%
slice_sample(n = 3,
replace = FALSE)
# A tibble: 3 × 8
species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
<fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
1 Chinstrap Dream            43.5          18.1          202    3400 fema…  2009
2 Gentoo    Biscoe           45.1          14.4          210    4400 fema…  2008
3 Chinstrap Dream            52.2          18.8          197    3450 male   2009
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Observe:

• If you do not specify the sampling weights, the function will select each row with equal probability.
• We set a seed so that everything this code is run, it will come up with the same sample of 12 rows. Try running this code yourself. Do you get the same sample?

#### Example 2

Take a random sample of three penguins without replacement using sampling weights proportional to the bill lengths; that is, a penguin with a bill length that is twice is long as another penguin should be twice as likely to be selected for the sample.

penguins %>%
slice_sample(n = 3,
replace = FALSE,
weight_by = bill_length_mm)
# A tibble: 3 × 8
species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
<fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
1 Chinstrap Dream            45.6          19.4          194    3525 fema…  2009
2 Chinstrap Dream            51.3          19.9          198    3700 male   2007
3 Adelie    Biscoe           45.6          20.3          191    4600 male   2009
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Observe:

• While it is still possible for every penguin to be selected into the sample, the ones with longer bill lengths are more likely. The precise probability that the first penguin was selected was its bill length (43.5) divided by the sum of the bill lengths of all of the penguins in the population.

#### rep_slice_sample()

An extension of slice_sample() that doesn’t take just a single sample of size n, but repeats the process many times and draws many samples of size n. The useful new argument is reps, the number of replicates (or separate samples) you wish to generate. This function is in the infer package.

##### Example 3

Take two separate sample samples from the penguins data frame without replacement, and each of sample size three.

library(infer)
penguins %>%
rep_slice_sample(n = 3,
replace = FALSE,
reps = 2)
# A tibble: 6 × 9
# Groups:   replicate [2]
replicate species   island bill_length_mm bill_d…¹ flipp…² body_…³ sex    year
<int> <fct>     <fct>           <dbl>    <dbl>   <int>   <int> <fct> <int>
1         1 Chinstrap Dream            52.7     19.8     197    3725 male   2007
2         1 Adelie    Dream            41.5     18.5     201    4000 male   2009
3         1 Gentoo    Biscoe           52.5     15.6     221    5450 male   2009
4         2 Chinstrap Dream            43.2     16.6     187    2900 fema…  2007
5         2 Gentoo    Biscoe           45       15.4     220    5050 male   2008
6         2 Adelie    Biscoe           45.6     20.3     191    4600 male   2009
# … with abbreviated variable names ¹​bill_depth_mm, ²​flipper_length_mm,
#   ³​body_mass_g

Observe:

• The data frame produced by this function has a new column that has been appended to the left side: replicate. This column keeps track of which replicate the row belongs to. The first three rows constitute the first sample of size 3. The second three rows constitute the second sample of size 3.
• The dimensions of the output data frame are 6 rows by 9 columns. Compare that to the dimensions of the original penguins data frame: 366 by 8. This function added one new column and contains n $$\times$$ reps rows.

## Footnotes

1. Photo of Pimentel Hall by flickr user TheRealMichaelMoore.↩︎