Lab 7: Baseball


Part I: Understanding the Context of the Data

Part II: Computing on the Data

The goal of this lab is to build three different regression models to predict the number of wins of a Major League Baseball team.

Use the following code to load in the Teams dataset from the Lahman database. Recall that you can query the help file for a data set by running ?Teams at the console.


# Convert the Teams object from a base R data frame to a tibble data frame. This improves the formatting for printing the dataframe.
Teams <- as.tibble(Teams)
# A tibble: 3,015 × 48
   yearID lgID  teamID franchID divID  Rank     G Ghome     W     L DivWin WCWin
    <int> <fct> <fct>  <fct>    <chr> <int> <int> <int> <int> <int> <chr>  <chr>
 1   1871 NA    BS1    BNA      <NA>      3    31    NA    20    10 <NA>   <NA> 
 2   1871 NA    CH1    CNA      <NA>      2    28    NA    19     9 <NA>   <NA> 
 3   1871 NA    CL1    CFC      <NA>      8    29    NA    10    19 <NA>   <NA> 
 4   1871 NA    FW1    KEK      <NA>      7    19    NA     7    12 <NA>   <NA> 
 5   1871 NA    NY2    NNA      <NA>      5    33    NA    16    17 <NA>   <NA> 
 6   1871 NA    PH1    PNA      <NA>      1    28    NA    21     7 <NA>   <NA> 
 7   1871 NA    RC1    ROK      <NA>      9    25    NA     4    21 <NA>   <NA> 
 8   1871 NA    TRO    TRO      <NA>      6    29    NA    13    15 <NA>   <NA> 
 9   1871 NA    WS3    OLY      <NA>      4    32    NA    15    15 <NA>   <NA> 
10   1872 NA    BL1    BLC      <NA>      2    58    NA    35    19 <NA>   <NA> 
# ℹ 3,005 more rows
# ℹ 36 more variables: LgWin <chr>, WSWin <chr>, R <int>, AB <int>, H <int>,
#   X2B <int>, X3B <int>, HR <int>, BB <int>, SO <int>, SB <int>, CS <int>,
#   HBP <int>, SF <int>, RA <int>, ER <int>, ERA <dbl>, CG <int>, SHO <int>,
#   SV <int>, IPouts <int>, HA <int>, HRA <int>, BBA <int>, SOA <int>, E <int>,
#   DP <int>, FP <dbl>, name <chr>, park <chr>, attendance <int>, BPF <int>,
#   PPF <int>, teamIDBR <chr>, teamIDlahman45 <chr>, teamIDretro <chr>
  1. Subset the Teams data set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab). What are the dimensions of this filtered data set?

  2. Plot the distribution of wins. Describe the shape of the distribution and compare it to your speculations from part 1 of the lab.

  3. Plot the relationship between runs and wins. Describe the relationship (form, direction, strength of association, presence of outliers) and compare it to your speculations from part 1 of the lab.

  4. Plot the relationship between runs allowed and wins. Describe the relationship. How does it compare to the relationship between runs and wins?

  5. Split your filtered version of the Teams data set into training and testing sets using the guidance provided in the notes (reserve at least 20% for the test set). Save them as teams_train and teams_test.

  6. Using the training data, fit a simple linear model to predict wins by runs and call it model_1. Write out the equation for the linear model (using the estimated coefficients) and report the training \(R^2\) as well as the testing \(R^2\).

  7. What is the average number of season runs and wins? Based on the previous model, how many games would you predict a team that scored the average number of runs would win? What about a team that scored 600 runs? What about 850 runs?

  8. Using the training data again, fit a multiple linear regression model to predict wins by runs and runs allowed and save it as model_2. Write out the equation for the linear model and report the training and testing \(R^2\). How does this model compare to the simple linear regression from the previous question?

  9. Fit a third, more complex model to predict wins and call it model_3. This model should use

    1. at least three variables from this data set,
    2. at least one non-linear transformation or polynomial term.

    Write out the equation for the resulting linear model and report the training and testing \(R^2\).

  10. Looking across all three models, in general how did the values training \(R^2\) compare to the values of testing \(R^2\)? Which is the better metric when deciding how well a model will perform on new data? By that metric, which of your three models is best?

  11. Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not?