Logistic Regression

STAT 20: Introduction to Probability and Statistics

Agenda

  • Announcements
  • Concept Questions
  • Break
  • Lab 6.2
  • Appendix: Logistic Regression with the penguins dataset

Announcements

  • Quiz 4:
    • Monday in class.
    • Wrong By Design through Logistic Regression
  • Problem Sets:
    • PS 18 (Overfitting) due next Tuesday at 9am
    • Extra Practice (Logistic Regression)
  • Lab 6:
    • both parts due Tuesday at 9am

Concept Questions

A logistic regression model was fit in an attempt to predict the sex of a penguin "male" or "female" based on its body mass (grams).

Assuming that no change to the penguins dataset was made, will the model be predicting the probability of the penguin being male or the probability of the penguin being female?

01:00

m1 <- glm(sex ~ body_mass_g, data = penguins, family = "binomial")
 (Intercept)  body_mass_g 
-5.162541644  0.001239819 

Which of the expressions given in the poll (math or code) will correctly calculate the predicted probability that a penguin that weighs 4000 g is a female? Select all that apply

01:00

What is the misclassification rate of this model?

# A tibble: 4 × 3
# Groups:   sex [2]
  sex    y_hat      n
  <fct>  <chr>  <int>
1 female female   109
2 female male      56
3 male   female    74
4 male   male      94
01:00

m2 <- glm(sex ~ body_mass_g + bill_length_mm, 
          data = penguins, family = "binomial")
   (Intercept)    body_mass_g bill_length_mm 
   -6.91208086     0.00101530     0.06112808 

Open up RStudio and fit the model here in the slides. What are the predicted sexes of these two penguins?

  1. body mass = 3900 g, bill length = 50
  2. body mass = 4100 g, bill length = 35
01:00

Break

05:00

Lab

45:00

End of Lecture

Misclassification: Appendix

Building a predictive model

  1. Decide on the mathematical form of the model: logistic linear regression
  1. Select a metric that defines the “best” fit: the coefficients in logistic regression are the ones that minimize not the RSS function but a function called log-loss (which we don’t have time to cover)
  1. Estimating the coefficients of the model that are best using the training data: we know how to do this: test + train + glm()!
  1. Evaluating predictive accuracy using a test data set:\(R^2\) isn’t relevant here. We need a new metric!

Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'), 
                   size = nrow(penguins), 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'), 
                   size = nrow(penguins), 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

train <- penguins |>
  filter(set_type == "train")

test <- penguins |>
  filter(set_type == "test")

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test |>
  select(sex)
# A tibble: 70 × 1
   sex   
   <fct> 
 1 female
 2 male  
 3 female
 4 male  
 5 male  
 6 female
 7 male  
 8 female
 9 male  
10 male  
# ℹ 60 more rows

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test |>
  select(sex) |>
  mutate(p_hat = p_hat)
# A tibble: 70 × 2
   sex    p_hat
   <fct>  <dbl>
 1 female 0.345
 2 male   0.566
 3 female 0.259
 4 male   0.280
 5 male   0.365
 6 female 0.196
 7 male   0.428
 8 female 0.220
 9 male   0.559
10 male   0.279
# ℹ 60 more rows

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")

test |>
  select(sex) |>
  mutate(p_hat = predict(m2, test, type = "response"),
         y_hat = ifelse(p_hat > .5, "male", "female"))
# A tibble: 70 × 3
   sex    p_hat y_hat 
   <fct>  <dbl> <chr> 
 1 female 0.345 female
 2 male   0.566 male  
 3 female 0.259 female
 4 male   0.280 female
 5 male   0.365 female
 6 female 0.196 female
 7 male   0.428 female
 8 female 0.220 female
 9 male   0.559 male  
10 male   0.279 female
# ℹ 60 more rows

Classification errors

test |>
  select(sex) |>
  mutate(p_hat = p_hat,
         y_hat = ifelse(p_hat > .5, "male", "female"),
         FP = sex == "female" & y_hat == "male",
         FN = sex == "male" & y_hat == "female")
# A tibble: 70 × 5
   sex    p_hat y_hat  FP    FN   
   <fct>  <dbl> <chr>  <lgl> <lgl>
 1 female 0.345 female FALSE FALSE
 2 male   0.566 male   FALSE FALSE
 3 female 0.259 female FALSE FALSE
 4 male   0.280 female FALSE TRUE 
 5 male   0.365 female FALSE TRUE 
 6 female 0.196 female FALSE FALSE
 7 male   0.428 female FALSE TRUE 
 8 female 0.220 female FALSE FALSE
 9 male   0.559 male   FALSE FALSE
10 male   0.279 female FALSE TRUE 
# ℹ 60 more rows

Misclassification Rate

test |>
  select(sex) |>
  mutate(p_hat = p_hat,
         y_hat = ifelse(p_hat > .5, "male", "female")) |>
  summarize(MCR = mean(sex != y_hat))
# A tibble: 1 × 1
    MCR
  <dbl>
1 0.371