Lab 6: Diagnosing Cancer

Slides

Part I: Understanding the Context of the Data

Part II: Computing on the Data

You can load in the biopsies data frame using the code below:

library(tidyverse)
biopsies <- 
  read_csv("https://www.dropbox.com/s/0rbzonyrzramdgl/cells.csv?dl=1") |>
  mutate(diagnosis = factor(diagnosis, levels = c("B", "M")))

The diagnosis is in the column named diagnosis; each other column should be used to predict the diagnosis.

  1. Make a single plot that examines the association between radius_mean and radius_sd separately for each diagnosis (hint: aes() should have three arguments).

  2. Calculate the correlation between these two variables for each diagnosis.

  3. Give at least a two-sentence interpretation of the results in the last two questions. In particular, comment on:

  • Is the relationship between radius_mean and radius_sd different for benign biopsies vs. malignant biopsies?

  • If so, can you give an explanation for this difference?

  1. Split the data set into a roughly 80-20 train-test set split.

  2. Using the training data, fit a simple logistic regression model that predicts the diagnosis using the mean of the texture index.

  3. Using a threshold of .5, What would your model predict for a biopsy with a mean texture of 15? What probability does it assign to that outcome?

  4. Calculate and report two misclassification rates for your simple model: first on the training data and then on the testing data.

  5. Build a more complex model to predict the diagnosis using five predictors of your choosing.

  6. Calculate and report two misclassification rates for your complex model: first on the training data and then on the testing data.

  7. Is there any evidence that your model is overfitting? Explain in at least two sentences.

  8. Move back to your simple model for the next few questions.Report the total number of false negatives in the test data set.

  9. What can you change about your classification rule to lower the number of false negatives?

  10. Make the change you identified in the previous question and calculate the new number of false negatives.

  11. Calculate the testing misclassification rate using your new classification rule.

  12. Did your misclassification rule go up or down? Answer this question and explain why it went up or down in at least two sentences.