usethis::use_course("https://tinyurl.com/lab7sports")
Lab 7: Sports
Part I: Understanding the Context of the Data
In this lab, you will have the opportunity to work with datasets from a few different sports. Each sport has a teams
and players
dataset associated with it,

Baseball

teams
dataset preview 
teams
dataset https://www.dropbox.com/s/qcyu24rcldfd8u9/baseball_teams_init.csv?dl=1 
players
dataset preview 
players
dataset https://www.dropbox.com/s/oipadwa5lc2rp8v/baseball_players_init.csv?dl=1


Football

teams
dataset preview 
teams
dataset https://www.dropbox.com/s/a0i9mpfrjghhk5t/football_teams_init.csv?dl=1 
players
dataset preview 
players
dataset https://www.dropbox.com/s/uyn69av0x0jnbnu/football_players_init.csv?dl=1


Soccer^{1}

teams
dataset preview 
teams
dataset https://www.dropbox.com/s/835bho6ijmyx5am/soccer_teams_init.csv?dl=1 
players
dataset preview 
players
dataset https://www.dropbox.com/s/1pyqn1fxiwcf5bb/soccer_players_init.csv?dl=1

Part II: Computing on the Data
You can download these questions into a .qmd file by running the following line of code at the console in R.
The following code snippet cell will load into your R session the data sets, which are stored in a comma separated values file (.csv) online. Make sure you load the tidyverse
library at the top of your .qmd document as well, as read_csv()
is a function within this library.
The questions throughout this Part (questions 920) apply solely to the teams
dataset.
What are the dimensions of your
teams
dataset?Using your
teams
dataset, form and save a training and testing set using the guidance provided in the notes (reserve at least 20% for the test set).
Model One: Simple Linear Model
Create a plot of the relationship between a predictor and a response variable of your choice, selecting a response that would be useful to predict and a predictor that you expect will have a strong relationship with the response. State why you chose each of your variables, then describe the relationship in the plot in terms of its shape (linear or nonlinear), strength, and direction (generally positive or negative).
Using the training data, fit a simple linear model using this pair of variables. Write out the equation for the linear model (using the estimated coefficients) and report your training \(R^2\).
What is the average value of your predictor variable and the average value of your response variable? Using your model, what would you predict the value of the response to be for a new observation that takes as its predictor value the average predictor value from th training data?
Calculate and report your testing \(R^2\) and compare it to your training \(R^2\) in one or two sentences.
Model Two: “Everything but the kitchen sink”
Be sure to read this footnote^{2}.
 Using the same traintest approach, fit a second linear model to predict your response variable which satisfies the following:
 includes all other variables in the data set as explanatory variables
 at least two nonlinear transformations
 at least one polynomial term (up to degree 2), involving a variable you expect is nonlinearly associated with your response variable.
 Report your training and testing \(R^2\)s. Compare the two values and explain the reasoning behind the relationship you notice.
Model Three: Your best choice
Using the same traintest approach, fit a linear model that you think best predicts your response variable. You may use any number of variables you wish and employ nonlinear transformations and polynomials if you would like, too. Report your training \(R^2\).
Report your testing \(R^2\) and compare it to your training \(R^2\) in one or two sentences.
Reviewing all three of your models, what general guidance would you give to someone who seeks to build predictive models for the sport you chose?
Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not?
Footnotes
Both the teams and the player datasets were created using the
r
packageworldfootballR
, and the data was obtained from the website FBRef.↩︎
Note about questions 15 and 16: As these are real data sets, they have characteristics that make them challenging to fit a model to outofthebox (e.g. missing values). These challenges become more visible the more variables that you include in your model. Using the suggestions from the Ed post on this lab, do your best to fit a very complex model, but we don’t want you too much time getting this model to run.
As an alternative, toggle the code cells for 15 and 16 to
eval: false
as below.```{r} # eval: false # Here is my code that fits my complex model... ```
This will cause your code to be printed, but not run, so that we can see what you tried.
In place of a fitted model and corresponding R^2, please describe in a paragraph the different techniques that you tried for building your model, the errors / challenges that you encountered, and how you tried to fix those errors.↩︎