`01:00`

STAT 20: Introduction to Probability and Statistics

- Announcements
- Concept Questions
- Break
- Lab 6.2
- Appendix: Logistic Regression with the
`penguins`

dataset

- Quiz 4:
- Monday in class.
*Wrong By Design*through*Logistic Regression*

- Problem Sets:
- PS 18 (Overfitting) due next Tuesday at 9am
- Extra Practice (Logistic Regression)

- Lab 6:
- both parts due Tuesday at 9am

A logistic regression model was fit in an attempt to predict the sex of a penguin `"male"`

or `"female"`

based on its body mass (grams).

Assuming that no change to the `penguins`

dataset was made, will the model be predicting the probability of the penguin being male or the probability of the penguin being female?

`01:00`

```
(Intercept) body_mass_g
-5.162541644 0.001239819
```

Which of the expressions given in the poll (math or code) will correctly calculate the predicted probability that a penguin that weighs 4000 g is a female? *Select all that apply*

`01:00`

What is the misclassification rate of this model?

```
# A tibble: 4 × 3
# Groups: sex [2]
sex y_hat n
<fct> <chr> <int>
1 female female 109
2 female male 56
3 male female 74
4 male male 94
```

`01:00`

```
(Intercept) body_mass_g bill_length_mm
-6.91208086 0.00101530 0.06112808
```

Open up RStudio and fit the model here in the slides. What are the predicted sexes of these two penguins?

- body mass = 3900 g, bill length = 50
- body mass = 4100 g, bill length = 35

`01:00`

`05:00`

`45:00`

**Decide on the mathematical form of the model**: logistic linear regression

**Select a metric that defines the “best” fit**: the coefficients in logistic regression are the ones that minimize not the RSS function but a function called log-loss (which we don’t have time to cover)

**Estimating the coefficients of the model that are best using the training data**: we know how to do this: test + train +`glm()`

!

**Evaluating predictive accuracy using a test data set**:\(R^2\) isn’t relevant here. We need a new metric!

```
m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")
test |>
select(sex) |>
mutate(p_hat = p_hat)
```

```
# A tibble: 70 × 2
sex p_hat
<fct> <dbl>
1 female 0.345
2 male 0.566
3 female 0.259
4 male 0.280
5 male 0.365
6 female 0.196
7 male 0.428
8 female 0.220
9 male 0.559
10 male 0.279
# ℹ 60 more rows
```

```
m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
test |>
select(sex) |>
mutate(p_hat = predict(m2, test, type = "response"),
y_hat = ifelse(p_hat > .5, "male", "female"))
```

```
# A tibble: 70 × 3
sex p_hat y_hat
<fct> <dbl> <chr>
1 female 0.345 female
2 male 0.566 male
3 female 0.259 female
4 male 0.280 female
5 male 0.365 female
6 female 0.196 female
7 male 0.428 female
8 female 0.220 female
9 male 0.559 male
10 male 0.279 female
# ℹ 60 more rows
```

```
test |>
select(sex) |>
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female"),
FP = sex == "female" & y_hat == "male",
FN = sex == "male" & y_hat == "female")
```

```
# A tibble: 70 × 5
sex p_hat y_hat FP FN
<fct> <dbl> <chr> <lgl> <lgl>
1 female 0.345 female FALSE FALSE
2 male 0.566 male FALSE FALSE
3 female 0.259 female FALSE FALSE
4 male 0.280 female FALSE TRUE
5 male 0.365 female FALSE TRUE
6 female 0.196 female FALSE FALSE
7 male 0.428 female FALSE TRUE
8 female 0.220 female FALSE FALSE
9 male 0.559 male FALSE FALSE
10 male 0.279 female FALSE TRUE
# ℹ 60 more rows
```

```
test |>
select(sex) |>
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female")) |>
summarize(MCR = mean(sex != y_hat))
```

```
# A tibble: 1 × 1
MCR
<dbl>
1 0.371
```