Logistic Regression

STAT 20: Introduction to Probability and Statistics

Agenda

  1. CQ
  2. Lecture: Misclassification
  3. Lab 8

Concept Questions

Which of the following is an example of a classification task?

01:00

m1 <- glm(sex ~ body_mass_g, data = penguins, family = "binomial")
 (Intercept)  body_mass_g 
-5.162541644  0.001239819 

What is the predicted probability that probability that a penguin that weighs 4000 g is a female?

(As a bonus, try sketching this function on a scatterplot!)

01:00

m2 <- glm(sex ~ body_mass_g + bill_length_mm, data = penguins, family = "binomial")
   (Intercept)    body_mass_g bill_length_mm 
   -6.91208086     0.00101530     0.06112808 

What are the predicted sexes of these two penguins?

  1. body mass = 3900 g, bill length = 50
  2. body mass = 4100 g, bill length = 35
01:00

Misclassification

Building a predictive model

  1. Decide on the mathematical form of the model: logistic linear regression
  1. Select a metric that defines the “best” fit: the coefficients in logistic regression are the ones that minimize not the RSS function but a function called log-loss (which we don’t have time to cover)
  1. Estimating the coefficients of the model that are best using the training data: we know how to do this: test + train + glm()!
  1. Evaluating predictive accuracy using a test data set:\(R^2\) isn’t relevant here. We need a new metric!

Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'), 
                   size = nrow(penguins), 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'), 
                   size = nrow(penguins), 
                   replace = TRUE, 
                   prob = c(0.8, 0.2))

train <- penguins %>%
  filter(set_type == "train")

test <- penguins %>%
  filter(set_type == "test")

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
  select(sex)
# A tibble: 70 × 1
   sex   
   <fct> 
 1 female
 2 male  
 3 female
 4 male  
 5 male  
 6 female
 7 male  
 8 female
 9 male  
10 male  
# … with 60 more rows

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
  select(sex) %>%
  mutate(p_hat = p_hat)
# A tibble: 70 × 2
   sex    p_hat
   <fct>  <dbl>
 1 female 0.345
 2 male   0.566
 3 female 0.259
 4 male   0.280
 5 male   0.365
 6 female 0.196
 7 male   0.428
 8 female 0.220
 9 male   0.559
10 male   0.279
# … with 60 more rows

Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
          data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
  select(sex) %>%
  mutate(p_hat = p_hat,
         y_hat = ifelse(p_hat > .5, "male", "female"))
# A tibble: 70 × 3
   sex    p_hat y_hat 
   <fct>  <dbl> <chr> 
 1 female 0.345 female
 2 male   0.566 male  
 3 female 0.259 female
 4 male   0.280 female
 5 male   0.365 female
 6 female 0.196 female
 7 male   0.428 female
 8 female 0.220 female
 9 male   0.559 male  
10 male   0.279 female
# … with 60 more rows

Classification errors

False Positives: Predicting a 1 that is in fact a 0

False Negatives: Predicting a 0 that is in fact a 1

Misclassification Rate:

\[ \frac{FP + FN}{total \, number \, of \, predictions} \]

Classification errors

test %>%
  select(sex) %>%
  mutate(p_hat = p_hat,
         y_hat = ifelse(p_hat > .5, "male", "female"),
         FP = sex == "female" & y_hat == "male",
         FN = sex == "male" & y_hat == "female")
# A tibble: 70 × 5
   sex    p_hat y_hat  FP    FN   
   <fct>  <dbl> <chr>  <lgl> <lgl>
 1 female 0.345 female FALSE FALSE
 2 male   0.566 male   FALSE FALSE
 3 female 0.259 female FALSE FALSE
 4 male   0.280 female FALSE TRUE 
 5 male   0.365 female FALSE TRUE 
 6 female 0.196 female FALSE FALSE
 7 male   0.428 female FALSE TRUE 
 8 female 0.220 female FALSE FALSE
 9 male   0.559 male   FALSE FALSE
10 male   0.279 female FALSE TRUE 
# … with 60 more rows

Misclassification Rate

test %>%
  select(sex) %>%
  mutate(p_hat = p_hat,
         y_hat = ifelse(p_hat > .5, "male", "female")) %>%
  summarize(misclas = mean(sex != y_hat))
# A tibble: 1 × 1
  misclas
    <dbl>
1   0.371

Lab