Bootstrapping

STAT 20: Introduction to Probability and Statistics

While you’re waiting

If you’ve been given an index card, please write on it:

2. Your year at Cal (1 is first year, 2 is second year, etc)
3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

Agenda

• Concept Question
• Activity: The Bootstrap
• Bootstrapping with infer

Concept Question

Which of these is a valid bootstrap sample?

01:00

Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
BS A
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Ola Gentoo 44.5
Ida Chinstrap 52.8
BS B
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
BS C
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
BS D
name species length
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

The Bootstrap

Parameters and Statistics

Our Goal: Assess the sampling error / variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.

Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

2. Your year at Cal (1 is first year, 2 is second year, etc)
3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins %>%

penguins %>%
geom_bar()

Point estimate

obs_stat <- penguins %>%
obs_stat
# A tibble: 1 × 1
<dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins %>%
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
<int> <fct>
1         1 FALSE
2         1 FALSE
3         1 TRUE
4         1 FALSE
5         1 TRUE
6         1 TRUE
7         1 FALSE
8         1 TRUE
9         1 TRUE
10         1 TRUE
# ℹ 334 more rows

Two more bootstrap samples

penguins %>%
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
<int> <fct>
1         1 FALSE
2         1 TRUE
3         1 FALSE
4         1 FALSE
5         1 FALSE
6         1 TRUE
7         1 TRUE
8         1 FALSE
9         1 FALSE
10         1 FALSE
# ℹ 334 more rows
penguins %>%
success = "TRUE") %>%
generate(reps = 1,
type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
<int> <fct>
1         1 FALSE
2         1 TRUE
3         1 TRUE
4         1 FALSE
5         1 FALSE
6         1 TRUE
7         1 TRUE
8         1 FALSE
9         1 FALSE
10         1 FALSE
# ℹ 334 more rows

Visualizing 9 bs samples

penguins %>%
success = "TRUE") %>%
generate(reps = 9,
type = "bootstrap") %>%
geom_bar() +
facet_wrap(vars(replicate),
nrow = 3)

Calculating 9 $\hat{p}$

penguins %>%
success = "TRUE") %>%
generate(reps = 9,
type = "bootstrap") %>%
calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
replicate  stat
<int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins %>%
success = "TRUE") %>%
generate(reps = 500,
type = "bootstrap") %>%
calculate(stat = "prop") %>%
ggplot(aes(x = stat)) +
geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins %>%
success = "TRUE") %>%
generate(reps = 500,
type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = .95)
# A tibble: 1 × 2
lower_ci upper_ci
<dbl>    <dbl>
1    0.392    0.494

Documentation: infer.tidymodels.org

05:00