Lab 7: Sports

release date

April 14, 2023

Discuss Slides

Part I: Understanding the Context of the Data

In this lab, you will have the opportunity to work with datasets from a few different sports. Each sport has a teams and players dataset associated with it,

Part II: Computing on the Data

You can download these questions into a .qmd file by running the following line of code at the console in R.

usethis::use_course("https://tinyurl.com/lab-7-sports")

The following code snippet cell will load into your R session the data sets, which are stored in a comma separated values file (.csv) online. Make sure you load the tidyverse library at the top of your .qmd document as well, as read_csv() is a function within this library.

library(tidyverse)
teams <- read_csv("insert data set link here", show_col_types = FALSE)
players <- read_csv("insert data set link here", show_col_types = FALSE)

The questions throughout this Part (questions 9-20) apply solely to the teams dataset.


  1. What are the dimensions of your teams dataset?

  2. Using your teams dataset, form and save a training and testing set using the guidance provided in the notes (reserve at least 20% for the test set).

Model One: Simple Linear Model

  1. Create a plot of the relationship between a predictor and a response variable of your choice, selecting a response that would be useful to predict and a predictor that you expect will have a strong relationship with the response. State why you chose each of your variables, then describe the relationship in the plot in terms of its shape (linear or non-linear), strength, and direction (generally positive or negative).

  2. Using the training data, fit a simple linear model using this pair of variables. Write out the equation for the linear model (using the estimated coefficients) and report your training \(R^2\).

  3. What is the average value of your predictor variable and the average value of your response variable? Using your model, what would you predict the value of the response to be for a new observation that takes as its predictor value the average predictor value from th training data?

  4. Calculate and report your testing \(R^2\) and compare it to your training \(R^2\) in one or two sentences.

Model Two: “Everything but the kitchen sink”

Be sure to read this footnote2.

  1. Using the same train-test approach, fit a second linear model to predict your response variable which satisfies the following:
  • includes all other variables in the data set as explanatory variables
  • at least two non-linear transformations
  • at least one polynomial term (up to degree 2), involving a variable you expect is non-linearly associated with your response variable.
  1. Report your training and testing \(R^2\)s. Compare the two values and explain the reasoning behind the relationship you notice.

Model Three: Your best choice

  1. Using the same train-test approach, fit a linear model that you think best predicts your response variable. You may use any number of variables you wish and employ non-linear transformations and polynomials if you would like, too. Report your training \(R^2\).

  2. Report your testing \(R^2\) and compare it to your training \(R^2\) in one or two sentences.

  3. Reviewing all three of your models, what general guidance would you give to someone who seeks to build predictive models for the sport you chose?

  4. Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not?

Footnotes

  1. Both the teams and the player datasets were created using the r package worldfootballR, and the data was obtained from the website FBRef.↩︎

  2. Note about questions 15 and 16: As these are real data sets, they have characteristics that make them challenging to fit a model to out-of-the-box (e.g. missing values). These challenges become more visible the more variables that you include in your model. Using the suggestions from the Ed post on this lab, do your best to fit a very complex model, but we don’t want you too much time getting this model to run.

    As an alternative, toggle the code cells for 15 and 16 to eval: false as below.

    ```{r}
    #| eval: false
    
    # Here is my code that fits my complex model...
    ```

    This will cause your code to be printed, but not run, so that we can see what you tried.

    In place of a fitted model and corresponding R^2, please describe in a paragraph the different techniques that you tried for building your model, the errors / challenges that you encountered, and how you tried to fix those errors.↩︎