`01:00`

STAT 20: Introduction to Probability and Statistics

- CQ
- Lecture: Misclassification
- Lab 8

Which of the following is an example of a classification task?

`01:00`

```
(Intercept) body_mass_g
-5.162541644 0.001239819
```

What is the predicted probability that probability that a penguin that weighs 4000 g is a female?

(As a bonus, try sketching this function on a scatterplot!)

`01:00`

```
(Intercept) body_mass_g bill_length_mm
-6.91208086 0.00101530 0.06112808
```

What are the predicted sexes of these two penguins?

- body mass = 3900 g, bill length = 50
- body mass = 4100 g, bill length = 35

`01:00`

**Decide on the mathematical form of the model**: logistic linear regression

**Select a metric that defines the “best” fit**: the coefficients in logistic regression are the ones that minimize not the RSS function but a function called log-loss (which we don’t have time to cover)

**Estimating the coefficients of the model that are best using the training data**: we know how to do this: test + train +`glm()`

!

**Evaluating predictive accuracy using a test data set**:\(R^2\) isn’t relevant here. We need a new metric!

```
m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")
test %>%
select(sex) %>%
mutate(p_hat = p_hat)
```

```
# A tibble: 70 × 2
sex p_hat
<fct> <dbl>
1 female 0.345
2 male 0.566
3 female 0.259
4 male 0.280
5 male 0.365
6 female 0.196
7 male 0.428
8 female 0.220
9 male 0.559
10 male 0.279
# … with 60 more rows
```

```
m2 <- glm(sex ~ body_mass_g + bill_length_mm,
data = train, family = "binomial")
p_hat <- predict(m2, test, type = "response")
test %>%
select(sex) %>%
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female"))
```

```
# A tibble: 70 × 3
sex p_hat y_hat
<fct> <dbl> <chr>
1 female 0.345 female
2 male 0.566 male
3 female 0.259 female
4 male 0.280 female
5 male 0.365 female
6 female 0.196 female
7 male 0.428 female
8 female 0.220 female
9 male 0.559 male
10 male 0.279 female
# … with 60 more rows
```

**False Positives**: Predicting a 1 that is in fact a 0

**False Negatives**: Predicting a 0 that is in fact a 1

**Misclassification Rate**:

\[ \frac{FP + FN}{total \, number \, of \, predictions} \]

```
test %>%
select(sex) %>%
mutate(p_hat = p_hat,
y_hat = ifelse(p_hat > .5, "male", "female"),
FP = sex == "female" & y_hat == "male",
FN = sex == "male" & y_hat == "female")
```

```
# A tibble: 70 × 5
sex p_hat y_hat FP FN
<fct> <dbl> <chr> <lgl> <lgl>
1 female 0.345 female FALSE FALSE
2 male 0.566 male FALSE FALSE
3 female 0.259 female FALSE FALSE
4 male 0.280 female FALSE TRUE
5 male 0.365 female FALSE TRUE
6 female 0.196 female FALSE FALSE
7 male 0.428 female FALSE TRUE
8 female 0.220 female FALSE FALSE
9 male 0.559 male FALSE FALSE
10 male 0.279 female FALSE TRUE
# … with 60 more rows
```