# Lab 5: Baseball

## Part II: Computing on the Data

The goal of this lab is to build three different regression models to predict the number of wins of a Major League Baseball team.

Use the following code to load in the Teams dataset from the Lahman database (library). Recall that you can query the help file for a data set by running ?Teams at the console.

1. Subset the Teams data set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab. There might be another year post-2000 that you might want to filter out: why?). What are the dimensions of this filtered data set?

2. Plot the distribution of wins. Describe the shape of the distribution and compare it to your speculations from part 1 of the lab.

3. Plot the relationship between runs and wins. Describe the relationship (form, direction, strength of association, presence of outliers) and compare it to your speculations from part 1 of the lab.

4. Plot the relationship between runs allowed and wins. Describe the relationship. How does it compare to the relationship between runs and wins?

5. Fit a simple linear model to predict wins by runs and call it model_1. Write out the equation for the linear model (using the estimated coefficients) and interpret the $$R^2$$ in the context of the problem in at least one sentence.

6. What is the average number of season runs and wins? Based on the previous model, how many games would you predict a team that scored the average number of runs would win?

7. What about a team that scored 600 runs? What about 850 runs? What about 10,000 runs? Would any of these predictions be inaccurate and why?

8. Fit a multiple linear regression model to predict wins by runs and runs allowed (RA) and save it as model_2. Write out the equation for the linear model and report the $$R^2$$. How does this model compare to the simple linear regression from the previous question?

9. Fit a third, more complex model to predict wins and call it model_3. This model should use

1. at least three variables from this data set,
2. at least one non-linear transformation or polynomial term.

Write out the equation for the resulting linear model and report the $$R^2$$.

10. Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not? Answer in at least three sentences.