Case Study: Pricing Homes

STAT 20: Introduction to Probability and Statistics

Agenda

  • Concept Questions
  • Problem Set 7.2
  • Break
  • Lab 7.2

Concept Questions

Consider two houses for sale, both 1,100 sqft, 2 bedroom, 1 bathroom, with a small garage, but one is in Santa Monica and the other is in Westwood. Which is true of the predicted sale prices of these two homes?


Call:
lm(formula = log_price ~ log_sqft + city, data = LA)

Coefficients:
     (Intercept)          log_sqft    cityLong Beach  citySanta Monica  
         5.46554           1.15119          -0.89345          -0.09301  
    cityWestwood  
        -0.45846  
01:00

A simple model for price


m4 <- lm(log_price ~ bed, data = LA)


What do you expect the sign of the coefficient for bed to be?

01:00

A simple model for price


m4 <- lm(log_price ~ bed, data = LA)


What do you expect the sign of the coefficient for bed to be?

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   11.8      0.0436     271.  0        
2 bed            0.532    0.0142      37.3 9.77e-220

A less simple model for price


m5 <- lm(log_price ~ bed + log_sqft, data = LA)


What do you expect the sign of the coefficient for bed and log_sqft to be?

01:00

A less simple model for price

m5 <- lm(log(price) ~ bed + log_sqft, data = LA)


What do you expect the sign of the coefficient for bed and log_sqft to be?

# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    1.47     0.218       6.73 2.28e- 11
2 bed           -0.123    0.0164     -7.46 1.46e- 13
3 log_sqft       1.66     0.0346     47.8  2.60e-310

What is the relationship between bed and log_price?

What is the relationship between log_sqft and log_price?

What is the relationship between log_sqft and log_price, controlling for bed?

What is the relationship between bed and log_price, controlling for log_sqft?

Simpson’s Paradox

Simpson’s paradox, which also goes by several other names, is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined.

Source: Wikipedia

What can we build with data?

A prediction machine.

A summary of a data set.

A generalization to a population.

A causal explanation.

Model Interpretation

Question 1 What is the relationship between the number of bedrooms in a house and its price?

\[ \widehat{\textrm{log(price)}} = 11.8 + .53 \textrm{bed}\]

Question 2 After controlling for the size of a house, what is the relationship between the number of bedrooms in a house and its price?

\[ \widehat{\textrm{log(price)}} = 11.8 + -0.12 \textrm{bed} + 1.66 \textrm{log(sqft)}\]

The Tradeoff between flexility and interpretability

Fig 2.7 from An Introduction to Statistical Learning with R by James, Witten, Hastie, and Tibshirani.

Problem Set 7.2

20:00

This address was the property acquired by the UC to serve as the home of the president of the system (you can google it and pull up news articles).

Lab 7.2 Work

25:00