Data Pipelines

STAT 20: Introduction to Probability and Statistics

Agenda

  • Concept Questions
  • Problem Set 4
  • Break
  • Lab 2.1: Flights

Announcements

  • RQ: A Grammar of Graphics due Wednesday at 11:59pm
  • Problem Set 4 (paper, max. 3) due next Tuesday at 9am
  • Lab 2.1 (paper, max. 2) due next Tuesday at 9am
  • Quiz 1 next Monday at 11:59pm (direct logistical and content questions to the syllabus and megathread on Ed).

Concept Questions

Question 1

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

Respond at pollev.com.

01:00

Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"
[1]  TRUE  TRUE FALSE
c("fruit", "fruit", "vegetable") != "fruit"
[1] FALSE FALSE  TRUE
c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")
[1]  TRUE  TRUE FALSE

Question 2

Which observations will be included in the following data frame?

class_survey |>
  filter(coding_exp_scale < 3,
        olympic_sport %in% c("Ice skating", "Speed skating"),
        entrepreneur == TRUE)

Please respond at pollev.com.

01:00

Question 3

Which data frame will have fewer rows?

# this one
filter(class_survey, year == "This is my first semester!")

# or this one
class_survey |>
  mutate(first_sem = (year == "This is my first semester!")) |>
  filter(first_sem)
01:00

Concept Question 2 Redux - Building data pipelines

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

Nesting

filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE)

Nesting

select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_xcale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant)

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))
# A tibble: 1 × 1
  covid_avg
      <dbl>
1      0.52

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympic_sport %in% c("Ice skating", "Speed skating"),
       entrepreneur == TRUE),
       coding_exp_scale,
       olympic_sport,
       entrepreneur,
       new_COVID_variant),
       covid_avg = mean(new_COVID_variant))

Cons

  • Must be read from inside out 👎
  • Hard to keep track of arguments 👎

Pros

  • All in one line of code 👍
  • Only refer to one data frame 👍

Step-by-step

df1 <- filter(class_survey, 
              coding_exp_scale < 3,
              olympic_sport %in% c("Ice skating", "Speed skating"),
              entrepreneur == TRUE)
df2 <- select(df1, 
              coding_exp_scale,
              olympic_sport,
              entrepreneur,
              new_COVID_variant)
summarize(df2,
          covid_avg = mean(new_COVID_variant))

Cons

  • Have to repeat data frame names 👎
  • Creates unnecessary objects 👎

Pros

  • Stores intermediate objects 👍
  • Can be read top to bottom 👍

Using the pipe operator

class_survey |>
  filter(coding_exp_scale < 3,
         olympic_sport %in% c("Ice skating", "Speed skating"),
         entrepreneur == TRUE) |>
  select(coding_exp_scale,
         olympic_sport,
         entrepreneur,
         new_COVID_variant) |>
  summarize(covid_avg = mean(new_COVID_variant))

Cons

  • 🤷

Pros

  • Can be read like an english paragraph 👍
  • Only type the data once 👍
  • No leftovers objects 👍

Understanding your pipeline

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
  select(new_COVID_variant) |>
  filter(year == "It's my first year.")
Error in `filter()`:
ℹ In argument: `year == "It's my first year."`.
Caused by error in `year == "It's my first year."`:
! comparison (==) is possible only for atomic and list types
class_survey |>
  select(new_COVID_variant)
# A tibble: 619 × 1
   new_COVID_variant
               <dbl>
 1           0.25   
 2           0.1    
 3           0      
 4           0.2    
 5           0.9    
 6           0.2    
 7           0.4    
 8           0.00005
 9           0.2    
10           0.3    
# ℹ 609 more rows

Concept Question 2 Redux

class_survey |> # A #<<
  filter(coding_exp_scale < 3,
         olympic_sport %in% c("Ice skating", 
                         "Speed skating"),
         entrepreneur == TRUE) |> # B #<<
  select(coding_exp_scale,
         olympic_sport,
         entrepreneur,
         new_COVID_variant) |> # C #<<
  summarize(covid_avg = mean(new_COVID_variant)) # D #<<

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

Concept Question 4

summarize(class_survey, mean(year == "I'm in my first year."))

What is will this line of code return?

Respond at pollev.com.

Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE
[1] 2
TRUE * TRUE
[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

Break

Problem Set 5: Data Pipelines

25:00

Break

05:00

Lab 2.1: Flights

Let’s move to the lab slides on the course website!

25:00

End of Lecture