Data Pipelines

STAT 20: Introduction to Probability and Statistics

Agenda

Concept Questions
Problem Set 4
Break
Lab 2.1: Flights

Announcements

RQ: A Grammar of Graphics due Wednesday at 11:59pm

Problem Set 4 (paper, max. 3) due next Tuesday at 9am

Lab 2.1 (paper, max. 2) due next Tuesday at 9am

Quiz 1 next Monday at 11:59pm (direct logistical and content questions to the syllabus and megathread on Ed).

Concept Questions

Question 1

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

Respond at pollev.com.

01:00

Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"

[1]  TRUE  TRUE FALSE

c("fruit", "fruit", "vegetable") != "fruit"

[1] FALSE FALSE  TRUE

c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")

[1]  TRUE  TRUE FALSE

Question 2

Which observations will be included in the following data frame?

class_survey |>
  filter(coding_exp_scale < 3,
        olympic_sport %in% c("Ice skating", "Speed skating"),
        entrepreneur == TRUE)

Please respond at pollev.com.

01:00

Question 3

Which data frame will have fewer rows?

# this one
filter(class_survey, year == "This is my first semester!")

# or this one
class_survey |>
  mutate(first_sem = (year == "This is my first semester!")) |>
  filter(first_sem)

01:00

Concept Question 2 Redux - Building data pipelines

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

Nesting

filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE)

Nesting

select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_xcale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant)

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))

# A tibble: 1 × 1
  covid_avg
      <dbl>
1      0.52

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))

Cons

Must be read from inside out 👎
Hard to keep track of arguments 👎

Pros

All in one line of code 👍
Only refer to one data frame 👍

Step-by-step

df1 <- filter(class_survey, 
              coding_exp_scale < 3,
              olympic_sport %in% c("Ice skating", "Speed skating"),
              entrepreneur == TRUE)
df2 <- select(df1, 
              coding_exp_scale,
              olympic_sport,
              entrepreneur,
              new_COVID_variant)
summarize(df2,
          covid_avg = mean(new_COVID_variant))

Cons

Have to repeat data frame names 👎
Creates unnecessary objects 👎

Pros

Stores intermediate objects 👍
Can be read top to bottom 👍

Using the pipe operator

class_survey |>
  filter(coding_exp_scale < 3,
         olympic_sport %in% c("Ice skating", "Speed skating"),
         entrepreneur == TRUE) |>
  select(coding_exp_scale,
         olympic_sport,
         entrepreneur,
         new_COVID_variant) |>
  summarize(covid_avg = mean(new_COVID_variant))

Cons

🤷

Pros

Can be read like an english paragraph 👍
Only type the data once 👍
No leftovers objects 👍

Understanding your pipeline

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
  select(new_COVID_variant) |>
  filter(year == "It's my first year.")

Error in `filter()`:
ℹ In argument: `year == "It's my first year."`.
Caused by error in `year == "It's my first year."`:
! comparison (==) is possible only for atomic and list types

class_survey |>
  select(new_COVID_variant)

# A tibble: 619 × 1
   new_COVID_variant
               <dbl>
 1           0.25   
 2           0.1    
 3           0      
 4           0.2    
 5           0.9    
 6           0.2    
 7           0.4    
 8           0.00005
 9           0.2    
10           0.3    
# ℹ 609 more rows

Concept Question 2 Redux

class_survey |> # A #<<
  filter(coding_exp_scale < 3,
         olympic_sport %in% c("Ice skating", 
                         "Speed skating"),
         entrepreneur == TRUE) |> # B #<<
  select(coding_exp_scale,
         olympic_sport,
         entrepreneur,
         new_COVID_variant) |> # C #<<
  summarize(covid_avg = mean(new_COVID_variant)) # D #<<

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

Concept Question 4

summarize(class_survey, mean(year == "I'm in my first year."))

What is will this line of code return?

Respond at pollev.com.

Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE

[1] 2

TRUE * TRUE

[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

Break

Problem Set 5: Data Pipelines

25:00

Break

05:00

Lab 2.1: Flights

Let’s move to the lab slides on the course website!

25:00

Data Pipelines

Agenda

Announcements

Concept Questions

Question 1

Evaluating equivalence, cont.

Question 2

Question 3

Concept Question 2 Redux - Building data pipelines

Nesting

Nesting

Nesting

Nesting

Nesting

Step-by-step

Using the pipe operator

Understanding your pipeline

Concept Question 2 Redux

Concept Question 4

Boolean Algebra

Break

Problem Set 5: Data Pipelines

Break

Lab 2.1: Flights

End of Lecture