# Logistic Regression

STAT 20: Introduction to Probability and Statistics

## Agenda

1. CQ
2. Lecture: Misclassification
3. Lab 8

# Concept Questions

Which of the following is an example of a classification task?

01:00
m1 <- glm(sex ~ body_mass_g, data = penguins, family = "binomial")
 (Intercept)  body_mass_g
-5.162541644  0.001239819 

What is the predicted probability that probability that a penguin that weighs 4000 g is a female?

(As a bonus, try sketching this function on a scatterplot!)

01:00
m2 <- glm(sex ~ body_mass_g + bill_length_mm, data = penguins, family = "binomial")
   (Intercept)    body_mass_g bill_length_mm
-6.91208086     0.00101530     0.06112808 

What are the predicted sexes of these two penguins?

1. body mass = 3900 g, bill length = 50
2. body mass = 4100 g, bill length = 35
01:00

# Misclassification

## Building a predictive model

1. Decide on the mathematical form of the model: logistic linear regression
1. Select a metric that defines the “best” fit: the coefficients in logistic regression are the ones that minimize not the RSS function but a function called log-loss (which we don’t have time to cover)
1. Estimating the coefficients of the model that are best using the training data: we know how to do this: test + train + glm()!
1. Evaluating predictive accuracy using a test data set:$R^2$ isn’t relevant here. We need a new metric!

## Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'),
size = nrow(penguins),
replace = TRUE,
prob = c(0.8, 0.2))

## Example: penguins

set.seed(132)

# randomly sample train/test set split
set_type <- sample(x = c('train', 'test'),
size = nrow(penguins),
replace = TRUE,
prob = c(0.8, 0.2))

train <- penguins %>%
filter(set_type == "train")

test <- penguins %>%
filter(set_type == "test")

## Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

## Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
select(sex)
# A tibble: 70 × 1
sex
<fct>
1 female
2 male
3 female
4 male
5 male
6 female
7 male
8 female
9 male
10 male
# … with 60 more rows

## Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
select(sex) %>%
mutate(p_hat = p_hat)
# A tibble: 70 × 2
sex    p_hat
<fct>  <dbl>
1 female 0.345
2 male   0.566
3 female 0.259
4 male   0.280
5 male   0.365
6 female 0.196
7 male   0.428
8 female 0.220
9 male   0.559
10 male   0.279
# … with 60 more rows

## Predicting into test set

m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")

test %>%
select(sex) %>%
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female"))
# A tibble: 70 × 3
sex    p_hat y_hat
<fct>  <dbl> <chr>
1 female 0.345 female
2 male   0.566 male
3 female 0.259 female
4 male   0.280 female
5 male   0.365 female
6 female 0.196 female
7 male   0.428 female
8 female 0.220 female
9 male   0.559 male
10 male   0.279 female
# … with 60 more rows

## Classification errors

False Positives: Predicting a 1 that is in fact a 0

False Negatives: Predicting a 0 that is in fact a 1

Misclassification Rate:

$\frac{FP + FN}{total \, number \, of \, predictions}$

## Classification errors

test %>%
select(sex) %>%
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female"),
FP = sex == "female" & y_hat == "male",
FN = sex == "male" & y_hat == "female")
# A tibble: 70 × 5
sex    p_hat y_hat  FP    FN
<fct>  <dbl> <chr>  <lgl> <lgl>
1 female 0.345 female FALSE FALSE
2 male   0.566 male   FALSE FALSE
3 female 0.259 female FALSE FALSE
4 male   0.280 female FALSE TRUE
5 male   0.365 female FALSE TRUE
6 female 0.196 female FALSE FALSE
7 male   0.428 female FALSE TRUE
8 female 0.220 female FALSE FALSE
9 male   0.559 male   FALSE FALSE
10 male   0.279 female FALSE TRUE
# … with 60 more rows

## Misclassification Rate

test %>%
select(sex) %>%
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female")) %>%
summarize(misclas = mean(sex != y_hat))
# A tibble: 1 × 1
misclas
<dbl>
1   0.371