Causal Effects in Observational Studies

STAT 20: Introduction to Probability and Statistics

Agenda

Announcements
Concept Questions
Problem Set 19

Announcements

PS 19 and PS 20 both due Tuesday 4/29 at 9:00 AM
Final exam review sessions:
- Summarization: 12pm-1pm Monday 4/29, Stanley 105
- Causality: 3pm-4pm Monday 4/29, Stanley 105
- Generalization: 3pm-4pm Wednesday 5/1, VLSB 2050
- Probability: 4pm-5pm Wednesday 5/1, VLSB 2050
- Prediction: 3pm-4pm Friday 5/3, Stanley 105
Final exam: 7pm-10pm, Thursday 5/9, room TBA.
Please fill out course evals!

Concept Questions

To study the impact of receiving permanent resident status on mental health, we compare answers to a psychiatric survey from people who entered and won the US green card lottery to answers from others who entered but did not win.

What kind of study is this?

A randomized trial.
A natural experiment.
An observational study that requires matching.
None of the above.

01:00

To study the impact of childhood trauma on later academic performance, we compare GRE scores for students who lost a close family member in an automobile accident before the age of 8 to GRE scores for students who did not lose a close family member before age 8.

What kind of study is this?

A randomized trial.
A natural experiment.
An observational study that requires matching.
None of the above.

01:00

To study the effectiveness of a blood pressure medication, we enroll 500 patients. We take the blood pressure of all patients before anyone receives medication. We assign the 200 patients with the highest blood pressure readings to get the medication, assigning the others to be controls.

What kind of study is this?

A randomized trial.
A natural experiment.
An observational study that requires matching.
None of the above.

01:00

In the next slide, you will see the first few rows of a dataset containing demographic information on California counties. Scroll to see all of the rows.

We are interested in determining whether a difference in median_edu has a causal effect on homeownership using matching. Which county serves as the best counterfactual match to Fresno County?

Kern County
Alameda County
Contra Costa County
Shasta County
Del Norte County

02:00

name	homeownership	median_edu	metro	smoking_ban
Fresno County	55.0	some_college	yes	none
Colusa County	64.4	hs_diploma	no	none
Del Norte County	60.9	hs_diploma	no	none
Alameda County	55.1	some_college	yes	none
Contra Costa County	69.5	some_college	yes	partial
Glenn County	67.5	hs_diploma	no	none
Shasta County	66.0	some_college	yes	none
Kern County	61.4	hs_diploma	yes	none
San Luis Obispo County	61.4	some_college	yes	none

In this table there are nine counties, five with some_college values for median_edu and four with hs_diploma values.

How many counties of each type will remain after we conduct optimal matching on metro and smoking_ban?

some_college: 4, hs_diploma: 4.
some_college: 5, hs_diploma: 4.
some_college: 2, hs_diploma: 2.
some_college: 2, hs_diploma: 4.
Can’t tell without more information.

01:00

This question is designed to shift students away from thinking about individual matched pairs toward thinking about how matching reshapes an entire dataset. The correct answer is (A) since every county with hs_diploma is matched to a single county with some_college.

This question is also a good jumping off point for a mini-lecture about matching. The county example is not ideal because there are not a lot of close matches and because there are a lot of ties among the distances so the best match is not unique (although Contra Costa County is probably the one that will get dropped since it looks the least like any of the hs_diploma counties).

A better source for material in the mini lecture is the “matching_mini_lecture.docx” file about final exam scores and attending review sessions. Eventually this example might be a good thing to incorporate into the notes.

Which R command correctly performs matching on covariates to measure the impact of median_edu on homeownership?

matchit(homeownership ~ median_edu, data = county, method = ‘optimal’, distance = ‘euclidean’)
matchit(median_edu ~ homeownership, data = county, method = ‘optimal’, distance = ‘euclidean’)
matchit(median_edu ~ metro + smoking_ban, data = county, method = ‘optimal’, distance = ‘euclidean’)
matchit(homeownership ~ median_edu + metro + smoking_ban, data = county, method = ‘optimal’, distance = ‘euclidean’)

01:00

Assuming that metro and smoking_ban variables are the only ones we have measured, name an unmeasured variable that could introduce confounding between median_edu and homeownership.

02:00

Break

05:00

Problem Set

60:00