Conditioning

Filtering, groupwise operations, and data pipelines.

class date

09/14/23

In the world of data, bigger is not always better. Sometimes there are real benefits to working with a subset of your observations that meet some particular condition. One use of conditioning is to add specificity to a claim. Another use of conditioning is to illuminate the relationship between variables.

To gain practice with conditioning, let’s turn to a data set that begins with a very general focus. In 2007, Savage and West published A qualitative, theoretical framework for understanding mammalian sleep1, wherein they “develop a general, quantitative theory for mammalian sleep that relates many of its fundamental parameters to metabolic rate and body size”. Characterizing the sleep patterns of all mammals is a broad task and their data set is corresponding diverse. Take a look at the first ten rows of their data below.

Code
library(tidyverse)

msleep <- msleep %>%
    mutate(log_bodywt = log(bodywt * 1000)) %>%
    select(name, sleep_total, log_bodywt, 
           vore, conservation)

msleep
# A tibble: 83 × 5
   name                       sleep_total log_bodywt vore  conservation
   <chr>                            <dbl>      <dbl> <chr> <chr>       
 1 Cheetah                           12.1      10.8  carni lc          
 2 Owl monkey                        17         6.17 omni  <NA>        
 3 Mountain beaver                   14.4       7.21 herbi nt          
 4 Greater short-tailed shrew        14.9       2.94 omni  lc          
 5 Cow                                4        13.3  herbi domesticated
 6 Three-toed sloth                  14.4       8.26 herbi <NA>        
 7 Northern fur seal                  8.7       9.93 carni vu          
 8 Vesper mouse                       7         3.81 <NA>  <NA>        
 9 Dog                               10.1       9.55 carni domesticated
10 Roe deer                           3         9.60 herbi lc          
# ℹ 73 more rows

In this data set, the unit of observation is a single species and the variables observed on each are its name, the average length of sleep each day, the natural log of the average weight, its dietary pattern, and its conservation status. We can visualize the relationship between sleep and body size in all 83 species using a scatter plot.

Code
df_labels <- filter(msleep, 
                    name %in% c("Little brown bat",
                                "African elephant"))

library(ggrepel)
p_sleep <- msleep %>%
    ggplot(aes(x = log_bodywt,
                   y = sleep_total)) +
    geom_point() +
    geom_text_repel(data = df_labels, 
                    aes(label = name), 
                    min.segment.length = 0,
                    box.padding = 2.5) +
    labs(x = "body weight (in log grams)",
         y = "total sleep per day (in hrs)") +
    theme_bw()

p_sleep

The mammals vary from the wee brown bat, slumbering for nearly 20 hours a day, to the massive African elephant, nodding off for less than five. That is quite a range! Lets drill down to smaller subsets of this data frame to gain a more nuanced sense of what is going on.

Filtering

If you think about the shape of a data frame, there are two basic ways you might go about slicing and dicing it into smaller subsets.

One way is to go at it is column-by-column. The act of selecting a subset of the columns of a data frame is called, well, selecting. When you select a column, you can do so either by its name or by its column number (or index). Selecting columns by name is more useful because their order tends to be arbitrary and might change over the course of an analysis.

The other way to go at it is row-by-row. The act of subsetting the rows of the data frame based on their row number is called slicing. As with columns, the order of the rows is also often arbitrary, so this is of limited use. Much more useful is filtering.

In the tidyverse, these functions are named select(), slice(), and filter().

Filtering

The act of subsetting the rows of a data frame based on the values of one or more variables to extract the observations of interest.

Filters are powerful because they comb through the values of the data frame, which is where most of the information is. The key part of any filter is the condition that you assert for the rows that are retained in your data frame. Let’s set up a filter to return only the little brown bat.

filter(msleep, name == "Little brown bat")
# A tibble: 1 × 5
  name             sleep_total log_bodywt vore    conservation
  <chr>                  <dbl>      <dbl> <chr>   <chr>       
1 Little brown bat        19.9       2.30 insecti <NA>        

Here name == "Little brown bat" is the condition that must be met by any row in the data set to be retained. The syntax used to set up the condition is a comparison between a column in the data frame on the left and a possible value of that column on the right.

Comparison Operators

The filter above uses the most direct condition: it retains the rows that have a value in the name variable that is precisely "Little brown bat". In this case, there is only one such row. There are a range of different comparisons that can be made, though, and each has its own operator.

Operator Translation
== equal to
!= not equal to
< less than
> greater than
<= less than or equal to
>= greater than or equal to

At first, the == operator looks like a typo. Why doesn’t we use =? The reason is that a single equals sign is already busy at work in R: it sets the values of arguments inside a function. Instead of assignment, we want to determine whether the thing on the left holds the same value as the thing on the right, so we use ==. It might help you keep things straight if you read it in your head as “is exactly equal to”.

Let’s return only the rows with large animals, defined as those with a log body weight greater than 12.

filter(msleep, log_bodywt > 12)
# A tibble: 9 × 5
  name                 sleep_total log_bodywt vore  conservation
  <chr>                      <dbl>      <dbl> <chr> <chr>       
1 Cow                          4         13.3 herbi domesticated
2 Asian elephant               3.9       14.8 herbi en          
3 Horse                        2.9       13.2 herbi domesticated
4 Donkey                       3.1       12.1 herbi domesticated
5 Giraffe                      1.9       13.7 herbi cd          
6 Pilot whale                  2.7       13.6 carni cd          
7 African elephant             3.3       15.7 herbi vu          
8 Brazilian tapir              4.4       12.2 herbi vu          
9 Bottle-nosed dolphin         5.2       12.1 carni <NA>        

There were 9 such animals and you can see all of them are large.

Logical Operators

What if you want both the little brown bat and the African elephant? What if you want both the large creatures as well as those that sleep only briefly? These are tasks that call for multiple comparisons composed together with the logical operators &, |, and %in%.

This filter returns the creatures who are large and who sleep little.

filter(msleep, log_bodywt > 12 & sleep_total < 5)
# A tibble: 8 × 5
  name             sleep_total log_bodywt vore  conservation
  <chr>                  <dbl>      <dbl> <chr> <chr>       
1 Cow                      4         13.3 herbi domesticated
2 Asian elephant           3.9       14.8 herbi en          
3 Horse                    2.9       13.2 herbi domesticated
4 Donkey                   3.1       12.1 herbi domesticated
5 Giraffe                  1.9       13.7 herbi cd          
6 Pilot whale              2.7       13.6 carni cd          
7 African elephant         3.3       15.7 herbi vu          
8 Brazilian tapir          4.4       12.2 herbi vu          

This can be read as “filter the msleep data frame to return the rows where both the log body weight is greater than 12 and the sleep total is less than 5”. We see that there are 8 such creatures, one fewer than the data frame with only the body weight filter (bottle-nosed dolphins sleep, on average, 5.2 hrs).

Using & to represent “and” is common across most computer languages but you can alternatively use the somewhat more compact syntax of simply adding the second filter after a comma.

filter(msleep, log_bodywt > 12, sleep_total < 5)
# A tibble: 8 × 5
  name             sleep_total log_bodywt vore  conservation
  <chr>                  <dbl>      <dbl> <chr> <chr>       
1 Cow                      4         13.3 herbi domesticated
2 Asian elephant           3.9       14.8 herbi en          
3 Horse                    2.9       13.2 herbi domesticated
4 Donkey                   3.1       12.1 herbi domesticated
5 Giraffe                  1.9       13.7 herbi cd          
6 Pilot whale              2.7       13.6 carni cd          
7 African elephant         3.3       15.7 herbi vu          
8 Brazilian tapir          4.4       12.2 herbi vu          

These two methods are equivalent.

To return all rows that either have a high body weight or low sleep time or both, use the | operator (sometimes called “vertical bar”).

filter(msleep, log_bodywt > 12 | sleep_total < 5)
# A tibble: 12 × 5
   name                 sleep_total log_bodywt vore  conservation
   <chr>                      <dbl>      <dbl> <chr> <chr>       
 1 Cow                          4        13.3  herbi domesticated
 2 Roe deer                     3         9.60 herbi lc          
 3 Asian elephant               3.9      14.8  herbi en          
 4 Horse                        2.9      13.2  herbi domesticated
 5 Donkey                       3.1      12.1  herbi domesticated
 6 Giraffe                      1.9      13.7  herbi cd          
 7 Pilot whale                  2.7      13.6  carni cd          
 8 African elephant             3.3      15.7  herbi vu          
 9 Sheep                        3.8      10.9  herbi domesticated
10 Caspian seal                 3.5      11.4  carni vu          
11 Brazilian tapir              4.4      12.2  herbi vu          
12 Bottle-nosed dolphin         5.2      12.1  carni <NA>        

Be cautious in deciding whether you want to use & or |. While | is generally read as “or”, we could also describe the above filter as one that returns the rows that have a high body weight and the rows that have low sleep times.

One way to keep them straight is to keep an eye on the number of observations that are returned. The intersection of multiple conditions (using &) should result in the same or fewer rows (the orange area) than the union of multiple conditions (using |) (the blue area).

Code
library(patchwork)

p_and <- p_sleep +
    annotate("rect", xmin = 12, xmax = 16,
             ymin = 1.5, ymax = 5, 
             fill = "orange", alpha = .4)

p_or <- p_sleep +
    annotate("rect", xmin = 12, xmax = 16,
             ymin = 1.5, ymax = 20, 
             fill = "blue", alpha = .4) +
    annotate("rect", xmin = 2, xmax = 12,
             ymin = 1.5, ymax = 5, 
             fill = "blue", alpha = .4)

p_and + p_or

When working with nominal categorical variables, the only operator that you’ll be using is ==. You can return a union like normal using |,

filter(msleep, name == "Little brown bat" | name == "African elephant")
# A tibble: 2 × 5
  name             sleep_total log_bodywt vore    conservation
  <chr>                  <dbl>      <dbl> <chr>   <chr>       
1 African elephant         3.3      15.7  herbi   vu          
2 Little brown bat        19.9       2.30 insecti <NA>        

Or you can save some typing (and craft more readable code) by using %in% instead:

filter(msleep, name %in% c("Little brown bat", "African elephant"))
# A tibble: 2 × 5
  name             sleep_total log_bodywt vore    conservation
  <chr>                  <dbl>      <dbl> <chr>   <chr>       
1 African elephant         3.3      15.7  herbi   vu          
2 Little brown bat        19.9       2.30 insecti <NA>        

Taxonomy of Data: Logicals

It is useful to pause here to look under the hood of this code. Once you get accustomed to the comparison operators and the syntax, the R code reads very similarly to the equivalent English command. But how are those comparisons being represented in terms of data?

To answer this question, consider a simple numeric vector of four integers.

a <- c(2, 4, 6, 8)

We can apply a comparison operator to this vector using the same syntax as above. Let’s compare each value in this vector to see if its less than 5.

a < 5
[1]  TRUE  TRUE FALSE FALSE

The result is a vector of the same length as a where each value indicates whether the comparison to each element was true or false. While it looks like a factor or a character vector TRUE and FALSE, this is actually our newest entry into the Taxonomy of Data: the logical vector.

class(a < 5)
[1] "logical"

A logical vector can only take two values, TRUE and FALSE (R also recognizes T and F but not True or true). While it might seem like a categorical variable with only two levels, a logical vector has an important property that makes it behave like a numerical variable.

sum(a < 5)
[1] 2

In a logical vector, a value of true is represented both by TRUE and by the number 1 and false by FALSE and the number 0. This integer representation is why TRUE + TRUE will work (it’s 2!) but "TRUE" + "TRUE" will not.

This dual representation is very useful because it allows us to compute a proportion using, paradoxically, the mean() function.

mean(a < 5)
[1] 0.5

a < 5 results in a vector with two 1s and two 0s. When you take the mean like this, you’re really finding the proportion of the elements that meet the condition that you laid out in your comparison. This is a very handy trick. We’ll use it more in a moment.

Data Pipelines

At this stage in the course, the number of functions that you are familiar with has grown dramatically. To do truly powerful things with data, you need to not just call one of these functions, but string together many of them in a thoughtful and organized manner.

An an example, to create a sorted data frame containing just the large animals, we need to take the original data frame and

  1. filter() such that log_bodywt > 12 and then
  2. arrange() in descending order of weight (desc(log_bodywt)).

A conventional approach breaks this process into two distinct lines of code and saves the output mid-way through.

msleep_large <- filter(msleep, log_bodywt > 12)
arrange(msleep_large, desc(log_bodywt))
# A tibble: 9 × 5
  name                 sleep_total log_bodywt vore  conservation
  <chr>                      <dbl>      <dbl> <chr> <chr>       
1 African elephant             3.3       15.7 herbi vu          
2 Asian elephant               3.9       14.8 herbi en          
3 Giraffe                      1.9       13.7 herbi cd          
4 Pilot whale                  2.7       13.6 carni cd          
5 Cow                          4         13.3 herbi domesticated
6 Horse                        2.9       13.2 herbi domesticated
7 Brazilian tapir              4.4       12.2 herbi vu          
8 Donkey                       3.1       12.1 herbi domesticated
9 Bottle-nosed dolphin         5.2       12.1 carni <NA>        

An approach that is more concise, easier to read, and generally faster to run is to compose these functions together with “the pipe”. The pipe, written %>%, is an operator that you have access to when you load the tidyverse package. If you have two functions, f1 and f2, both of which take a data frame as the first argument, you can pipe the output of f1 directly into f2 using.

f1(DF) %>% f2()

Let’s use the pipe to rewrite the code shown above.

filter(msleep, log_bodywt > 12) %>% arrange(desc(log_bodywt))
# A tibble: 9 × 5
  name                 sleep_total log_bodywt vore  conservation
  <chr>                      <dbl>      <dbl> <chr> <chr>       
1 African elephant             3.3       15.7 herbi vu          
2 Asian elephant               3.9       14.8 herbi en          
3 Giraffe                      1.9       13.7 herbi cd          
4 Pilot whale                  2.7       13.6 carni cd          
5 Cow                          4         13.3 herbi domesticated
6 Horse                        2.9       13.2 herbi domesticated
7 Brazilian tapir              4.4       12.2 herbi vu          
8 Donkey                       3.1       12.1 herbi domesticated
9 Bottle-nosed dolphin         5.2       12.1 carni <NA>        

What has changed? Most immediately, we have reduced two lines of code to one. The first function, filter(), is unchanged however the second function, arrange(), is now missing its first argument, the data frame. That is because it is being piped directly in from the output of the first function.

While this is a fine way to use the pipe, your code is made much more readable if you format it like this:

msleep %>%
    filter(log_bodywt > 12) %>% 
    arrange(desc(log_bodywt))
# A tibble: 9 × 5
  name                 sleep_total log_bodywt vore  conservation
  <chr>                      <dbl>      <dbl> <chr> <chr>       
1 African elephant             3.3       15.7 herbi vu          
2 Asian elephant               3.9       14.8 herbi en          
3 Giraffe                      1.9       13.7 herbi cd          
4 Pilot whale                  2.7       13.6 carni cd          
5 Cow                          4         13.3 herbi domesticated
6 Horse                        2.9       13.2 herbi domesticated
7 Brazilian tapir              4.4       12.2 herbi vu          
8 Donkey                       3.1       12.1 herbi domesticated
9 Bottle-nosed dolphin         5.2       12.1 carni <NA>        

This code results in the same output as the first version, but it now reads a bit like a poem: “Take the msleep data frame then filter it such that the log body weight is greater than twelve then arrange it in descending order by log body weight”.

This poem is admittedly not particularly poetic.

Let’s look at a few examples to understand the power of such a simple piece of syntax.

Examples

What year had the greatest total number of christenings?

In Lab 1, this question was tacked with two or three separate lines of code, one to mutate() and the other to arrange() (and possibly select()). As one pipeline, it is:

library(stat20data)

arbuthnot %>%
    mutate(total = boys + girls) %>%
    arrange(desc(total)) %>%
    select(year, total)
# A tibble: 82 × 2
    year total
   <int> <int>
 1  1705 16145
 2  1707 16066
 3  1698 16052
 4  1708 15862
 5  1697 15829
 6  1702 15687
 7  1701 15616
 8  1703 15448
 9  1706 15369
10  1699 15363
# ℹ 72 more rows

What is the trend in the total number of christenings over time?

arbuthnot %>%
    mutate(total = boys + girls) %>%
    ggplot(aes(x = year, y = total)) +
    geom_line()

This demonstrates that you can pipe a data frame directly into a ggplot - the first argument is a data frame after all! The main thing to note is that when moving into a ggplot, the layers are added with the + operator instead of the pipe, %>%.

What proportion of carnivores sleep more than 8 hours per night?

Answering this requires two steps: filter()ing to focus on carnivores and summarize()ing with a proportion that meet a condition (recall that a comparison results in a logical vector of 0s and 1s). It is often a good idea to record the number of observations that go into a summary statistic, which we do here with n().

msleep %>%
    filter(vore == "carni") %>%
    summarize(p_gt_8hrs = mean(sleep_total > 8),
              n = n())
# A tibble: 1 × 2
  p_gt_8hrs     n
      <dbl> <int>
1     0.684    19

Groupwise Operations

The last example above demonstrates a very common scenario: you want to perform some calculations on one particular group of observations in your data set. But what if you want to do that same calculation for every group?

The vore variable has four levels: carni, herbi, insecti, and omni. It would not be too difficult to copy and paste the above pipeline four times and modify each filter function to focus on a different group. But what if there were a dozen different levels?

This task - performing an operation on all groups of a data set one-by-one - is such a common data science task that nearly every software tool has a good solution. In the tidyverse, the solution is the group_by() function. Let’s see it in action.

msleep %>%
    group_by(vore) %>%
    summarize(p_gt_8hrs = mean(sleep_total > 8),
              n = n())
# A tibble: 5 × 3
  vore    p_gt_8hrs     n
  <chr>       <dbl> <int>
1 carni       0.684    19
2 herbi       0.594    32
3 insecti     1         5
4 omni        0.95     20
5 <NA>        0.714     7

Like most tidyverse functions, the first argument to group_by() is a data frame, so it can be slotted directly into the pipeline. The second argument, the one that shows up in the code above, is the name of the variable that you want to use to delineate the groups. This is generally a factor, character, or logical vector.

group_by() is an incredibly powerful function because it changes the behavior of downstream functions. Lets break our pipeline and inspect the data frame that comes out of it.

msleep %>%
    group_by(vore)
# A tibble: 83 × 5
# Groups:   vore [5]
   name                       sleep_total log_bodywt vore  conservation
   <chr>                            <dbl>      <dbl> <chr> <chr>       
 1 Cheetah                           12.1      10.8  carni lc          
 2 Owl monkey                        17         6.17 omni  <NA>        
 3 Mountain beaver                   14.4       7.21 herbi nt          
 4 Greater short-tailed shrew        14.9       2.94 omni  lc          
 5 Cow                                4        13.3  herbi domesticated
 6 Three-toed sloth                  14.4       8.26 herbi <NA>        
 7 Northern fur seal                  8.7       9.93 carni vu          
 8 Vesper mouse                       7         3.81 <NA>  <NA>        
 9 Dog                               10.1       9.55 carni domesticated
10 Roe deer                           3         9.60 herbi lc          
# ℹ 73 more rows

This looks . . . exactly like the original data frame.

Well, not exactly like it: there is now a note at the top that the data frame now has the notion of groups based on vore. In effect, group_by() has taken the generic data frame and turned it into the one in the middle below: the same data frame but with rows now flagged as belonging to one group or another. When we pipe this grouped data frame into summarize(), summarize() collapses that data frame down into a single row for each group and creates a new column for each new summary statistic.

Summary

There are several ways to subset a data frame but the most important for data analysis is filtering: subsetting the rows according to a condition. In R, that condition is framed in terms of a comparison between a variable and a value (or set of values). Comparisons take many forms and can be combined using logical operators. The result is a logical vector that can be used for filtering or computing summary statistics. You can perform simulataneous analyses on multiple subsets by doing groupwise operations with group_by().

As we begin to do analyses that require multiple operations, the pipe operator, %>%, can be used to stitch the functions together into a single pipeline.

If you’re thinking, 😬 , yikes there was a lot of coding in these notes, you’re right. Don’t worry. We’ll have plenty of time to practice in class.

—————————

The Ideas in Code

Some notes rely heavily on code to augment your learning and understanding of the main concepts. This “Ideas in Code” section is meant to expand more on concepts and functions that the notes utilize but may not fully explain.

This specific set of notes contains references to many functions from the tidyverse library such as mutate(), select() filter(), arrange(), ggplot(), group_by(), summarize(). We delve more into some of these functions here.

mutate()

This function allows you to create a new column in a dataframe. In typical tidyverse fashion, the first argument is a dataframe. The second argument names and defines how that new column is created. Above, we saw:

arbuthnot %>%
    mutate(total = boys + girls) %>%
    arrange(desc(total)) %>%
    select(year, total)
# A tibble: 82 × 2
    year total
   <int> <int>
 1  1705 16145
 2  1707 16066
 3  1698 16052
 4  1708 15862
 5  1697 15829
 6  1702 15687
 7  1701 15616
 8  1703 15448
 9  1706 15369
10  1699 15363
# ℹ 72 more rows

Here, the first argument, arbuthnot, is piped to mutate() and the second argument, total = boys + girls, creates a new column named total by adding together the columns boys and girls. You can use mutate() to create multiple columns at the same time:

arbuthnot %>%
    mutate(total = boys + girls,
           girl_proportion = girls / total) %>%
    arrange(desc(total)) %>%
    select(year, total, girl_proportion)
# A tibble: 82 × 3
    year total girl_proportion
   <int> <int>           <dbl>
 1  1705 16145           0.482
 2  1707 16066           0.478
 3  1698 16052           0.475
 4  1708 15862           0.481
 5  1697 15829           0.491
 6  1702 15687           0.488
 7  1701 15616           0.481
 8  1703 15448           0.497
 9  1706 15369           0.483
10  1699 15363           0.485
# ℹ 72 more rows

Note that switching the order of the two new columns created above such that girl_proportion = girls / total comes before total = boys + girls will produce an error because total is used before it is created.

select()

This function is defined above as “selecting a subset of the columns of a data frame.” You’ve seen how to use select() to select or “grab” certain columns, but you can also use select() to omit certain columns. The last block of code can be rewritten to produce the same output by placing a minus sign, -, in front of the columns to omit:

arbuthnot %>%
    mutate(total = boys + girls,
           girl_proportion = girls / total) %>%
    arrange(desc(total)) %>%
    select(-c(boys, girls))
# A tibble: 82 × 3
    year total girl_proportion
   <int> <int>           <dbl>
 1  1705 16145           0.482
 2  1707 16066           0.478
 3  1698 16052           0.475
 4  1708 15862           0.481
 5  1697 15829           0.491
 6  1702 15687           0.488
 7  1701 15616           0.481
 8  1703 15448           0.497
 9  1706 15369           0.483
10  1699 15363           0.485
# ℹ 72 more rows

arrange()

This function arranges the rows of a data frame according to some logical ordering of a column. This ordering is straightforward for numeric columns; the smallest numbers are placed first and ascend to the larger ones. That is, unless you use desc() (which stands for descending).

But what if you pass a column of characters to arrange()? Let’s take a look:

penguins %>%
  arrange(species) %>%
  select(species, island, bill_length_mm)
# A tibble: 333 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           36.7
 5 Adelie  Torgersen           39.3
 6 Adelie  Torgersen           38.9
 7 Adelie  Torgersen           39.2
 8 Adelie  Torgersen           41.1
 9 Adelie  Torgersen           38.6
10 Adelie  Torgersen           34.6
# ℹ 323 more rows

When arranged by species, Adelie penguins come first, followed by Chinstrap, then Gentoo. The penguins aren’t arranged in any specific order within a species, but we can change that by passing another column to arrange(). Passing additional columns to arrange() will systematically break ties. The below code arranges the data frame first by species (alphabetically) and then breaks ties by (ascending) bill length:

penguins %>%
  arrange(species, bill_length_mm) %>%
  select(species, island, bill_length_mm)
# A tibble: 333 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Dream               32.1
 2 Adelie  Dream               33.1
 3 Adelie  Torgersen           33.5
 4 Adelie  Dream               34  
 5 Adelie  Torgersen           34.4
 6 Adelie  Biscoe              34.5
 7 Adelie  Torgersen           34.6
 8 Adelie  Torgersen           34.6
 9 Adelie  Biscoe              35  
10 Adelie  Biscoe              35  
# ℹ 323 more rows

summarize()

This function summarizes a data frame into a single row. We can summarize a data frame by taking means or calculating the number of rows as above. We can also do other calculations like taking a median or calculating the variance of a column:

msleep %>%
    summarize(median_sleep = median(sleep_total),
              variance_sleep = var(sleep_total),
              n = n())
# A tibble: 1 × 3
  median_sleep variance_sleep     n
         <dbl>          <dbl> <int>
1         10.1           19.8    83

However, if summarize() is preceded by group_by(), then it will output multiple rows according to groups specified by group_by():

msleep %>%
    group_by(vore) %>%
    summarize(median_sleep = median(sleep_total),
              variance_sleep = var(sleep_total),
              n = n())
# A tibble: 5 × 4
  vore    median_sleep variance_sleep     n
  <chr>          <dbl>          <dbl> <int>
1 carni           10.4          21.8     19
2 herbi           10.3          23.8     32
3 insecti         18.1          35.1      5
4 omni             9.9           8.70    20
5 <NA>            10.6           9.02     7

This syntax looks a lot like the syntax used for mutate()! Like in mutate(), we name and define new columns: new_column = formula. The difference is that summarize() returns a brand new data frame that does not contain the columns of the original data frame where mutate() returns a data frame with all columns of the original data frame in addition to the newly defined ones.

Materials from class

Slides

Footnotes

  1. V. M. Savage and G. B. West. A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.↩︎