# Data Pipelines

STAT 20: Introduction to Probability and Statistics

## Agenda

• Concept Questions
• Problem Set 4
• Break
• Lab 2.1: Flights

## Announcements

• RQ: A Grammar of Graphics due Wednesday at 11:59pm
• Problem Set 4 (paper, max. 3) due next Tuesday at 9am
• Lab 2.1 (paper, max. 2) due next Tuesday at 9am
• Quiz 1 next Monday at 11:59pm (direct logistical and content questions to the syllabus and megathread on Ed).

# Concept Questions

## Question 1

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

Respond at pollev.com.

01:00

## Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"
[1]  TRUE  TRUE FALSE
c("fruit", "fruit", "vegetable") != "fruit"
[1] FALSE FALSE  TRUE
c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")
[1]  TRUE  TRUE FALSE

## Question 2

Which observations will be included in the following data frame?

class_survey |>
filter(coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE)

Please respond at pollev.com.

01:00

## Question 3

Which data frame will have fewer rows?

# this one
filter(class_survey, year == "This is my first semester!")

# or this one
class_survey |>
mutate(first_sem = (year == "This is my first semester!")) |>
filter(first_sem)
01:00

# Concept Question 2 Redux - Building data pipelines

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

## Nesting

filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE)

## Nesting

select(filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE),
coding_exp_xcale,
olympic_sport,
entrepreneur,
new_COVID_variant)

## Nesting

summarize(select(filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE),
coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant),
covid_avg = mean(new_COVID_variant))

## Nesting

summarize(select(filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE),
coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant),
covid_avg = mean(new_COVID_variant))
# A tibble: 1 × 1
covid_avg
<dbl>
1      0.52

## Nesting

summarize(select(filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE),
coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant),
covid_avg = mean(new_COVID_variant))

Cons

• Must be read from inside out 👎
• Hard to keep track of arguments 👎

Pros

• All in one line of code 👍
• Only refer to one data frame 👍

## Step-by-step

df1 <- filter(class_survey,
coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE)
df2 <- select(df1,
coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant)
summarize(df2,
covid_avg = mean(new_COVID_variant))

Cons

• Have to repeat data frame names 👎
• Creates unnecessary objects 👎

Pros

• Stores intermediate objects 👍
• Can be read top to bottom 👍

## Using the pipe operator

class_survey |>
filter(coding_exp_scale < 3,
olympic_sport %in% c("Ice skating", "Speed skating"),
entrepreneur == TRUE) |>
select(coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant) |>
summarize(covid_avg = mean(new_COVID_variant))

Cons

• 🤷

Pros

• Can be read like an english paragraph 👍
• Only type the data once 👍
• No leftovers objects 👍

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
select(new_COVID_variant) |>
filter(year == "It's my first year.")
Error in filter():
ℹ In argument: year == "It's my first year.".
Caused by error in year == "It's my first year.":
! comparison (==) is possible only for atomic and list types
class_survey |>
select(new_COVID_variant)
# A tibble: 619 × 1
new_COVID_variant
<dbl>
1           0.25
2           0.1
3           0
4           0.2
5           0.9
6           0.2
7           0.4
8           0.00005
9           0.2
10           0.3
# ℹ 609 more rows

## Concept Question 2 Redux

class_survey |> # A #<<
filter(coding_exp_scale < 3,
olympic_sport %in% c("Ice skating",
"Speed skating"),
entrepreneur == TRUE) |> # B #<<
select(coding_exp_scale,
olympic_sport,
entrepreneur,
new_COVID_variant) |> # C #<<
summarize(covid_avg = mean(new_COVID_variant)) # D #<<

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

## Concept Question 4

summarize(class_survey, mean(year == "I'm in my first year."))

What is will this line of code return?

Respond at pollev.com.

## Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE
[1] 2
TRUE * TRUE
[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

# Problem Set 5: Data Pipelines

25:00

# Break

05:00

# Lab 2.1: Flights

Let’s move to the lab slides on the course website!

25:00