# Bootstrapping

STAT 20: Introduction to Probability and Statistics

## While you’re waiting

If you’ve been given an index card, please write on it:

1. Your first name
2. Your year at Cal (1 is first year, 2 is second year, etc)
3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

## Agenda

• Concept Question
• Activity: The Bootstrap
• Bootstrapping with infer

# Concept Question

Which of these is a valid bootstrap sample?

01:00

Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
Abe Adelie 42.0
BS A
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Abe Adelie 42.0
Ola Gentoo 44.5
Ida Chinstrap 52.8
BS B
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
BS C
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
Ida Adelie 42.0
BS D
name species length
Gus Chinstrap 50.7
Abe Adelie 42.0
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

# The Bootstrap

## Parameters and Statistics

Our Goal: Assess the sampling error / variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.

Our Tool: The Bootstrap

## Collecting a sample of data

If you’ve been given an index card, please write on it:

1. Your first name
2. Your year at Cal (1 is first year, 2 is second year, etc)
3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

# Bootstrapping with Infer

## Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins %>%
mutate(is_adelie = species == "Adelie")

penguins %>%
ggplot(aes(x = is_adelie)) +
geom_bar()

## Point estimate

obs_stat <- penguins %>%
summarize(p_adelie = mean(is_adelie))
obs_stat
# A tibble: 1 × 1
p_adelie
<dbl>
1    0.442

## Generating one bootstrap sample

library(infer)
penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
replicate is_adelie
<int> <fct>
1         1 FALSE
2         1 FALSE
3         1 TRUE
4         1 FALSE
5         1 TRUE
6         1 TRUE
7         1 FALSE
8         1 TRUE
9         1 TRUE
10         1 TRUE
# … with 334 more rows

## Two more bootstrap samples

penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
replicate is_adelie
<int> <fct>
1         1 FALSE
2         1 TRUE
3         1 FALSE
4         1 FALSE
5         1 FALSE
6         1 TRUE
7         1 TRUE
8         1 FALSE
9         1 FALSE
10         1 FALSE
# … with 334 more rows
penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
replicate is_adelie
<int> <fct>
1         1 FALSE
2         1 TRUE
3         1 TRUE
4         1 FALSE
5         1 FALSE
6         1 TRUE
7         1 TRUE
8         1 FALSE
9         1 FALSE
10         1 FALSE
# … with 334 more rows

## Visualizing 9 bs samples

penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 9,
type = "bootstrap") %>%
ggplot(aes(x = is_adelie)) +
geom_bar() +
facet_wrap(vars(replicate),
nrow = 3)

## Calculating 9 $\hat{p}$

penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 9,
type = "bootstrap") %>%
calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
replicate  stat
<int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

## The bootstrap dist (reps = 500)

penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 500,
type = "bootstrap") %>%
calculate(stat = "prop") %>%
ggplot(aes(x = stat)) +
geom_histogram()

## Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins %>%
specify(response = is_adelie,
success = "TRUE") %>%
generate(reps = 500,
type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = .95)
# A tibble: 1 × 2
lower_ci upper_ci
<dbl>    <dbl>
1    0.392    0.494

## Your Turn

Create a 95% confidence interval for the median bill length of penguins.

05:00