A Tool for Computing with Data

An introduction to the R language for statistical computing

class date

January 25, 2023

Discuss Reading Questions PDF

The concepts of a variable, its type, and the structure of a data frame are useful because they help guide our thinking about the nature of a data. But we need more than definitions. If our goal is to construct a claim with data, we need to a tool to aid in the construction. Our tool must be able to do two things: it must be able to store the data and it must be able to perform computations on the data.

In high school, you gained experience with one such tool: the graphing calculator. This fits our needs: you can enter a list of number into a graphic calculator like the Ti-84, it can store that list and it can execute computations on it, such as taking its sum. But the types of data that a calculator can store are very limited, as is the volume, as are the options for computation.

In this class, we will use a tool that is far more powerful: the computer language called R. The Ti-84 is to R what a tricycle is to the space ship. One of these tools can bring you to the end of the block; the other to the moon.

R and RStudio

R is one of the most powerful languages for doing statistics data science. One of the reasons for its power and popularity is that it is both free and open-source. This turns languages like R into something that resembles Wikipedia: a collaborative effort that is constantly evolving. Extensions to the R language have been authored by professional programmers1, people working in industry and government2, professors3, and students like you4.

You’ll be writing and running code through an app called RStudio. Beyond writing R code, RStudio allows you to manage your files and author polished documents that weave together code and text. RStudio can be run through a browser and we have set up an account for you that you can access by sending a browser tab to https://stat20.datahub.berkeley.edu/ or clicking the link in the upper right corner of the course website.

When you log into RStudio, the place where you can type and run R code is called the console and it’s located right here:

Screenshot showing RStudio in the browser with the console circled in green.

Figure 1: The R console in RStudio.

Code along

As you read through these notes, keep RStudio open in another window to code along at the console.

R as a Calculator

Although R is like a space ship capable of going to the moon, it’s also more than able to go to the end of the block. Type the sum 1 + 2 into the console (the area to the right of the >) and press Enter. What you should see is this:

1 + 2
[1] 3

All of the arithmetic operations work in R.

1 - 2
[1] -1
1 * 2
[1] 2
1 / 2
[1] 0.5

Each of these four lines of code is called a command and the response from R is the output. The [1] at the beginning of the output is there just to indicate that it is the first element of the output. This helps you keep track of things when the output spans many lines.

Although it is easiest to read code when the numbers are separated from the operator by a single space, it’s not necessary. R ignores all spaces when it runs your code, so each of the following also work.

1/2
[1] 0.5
1   /         2
[1] 0.5

You can add exponents by using ^, but don’t forget about the order of operations. If you want an alternative ordering, use parentheses.

2 ^ 3 + 1
[1] 9
2 ^ (3 + 1)
[1] 16

Saving Objects

Whenever you want to save the output of an R command, add an assignment arrow <- (less than, minus) as well as a name, such as “answer”.

answer <- 2 ^ (3 + 1)

When you run this command, there are two things to notice.

  1. The word answer appears in the upper right hand corner of RStudio, in the “Environment” tab.
  2. No output is returned at the console.

Every time you run a command, you can ask yourself: do I want to just see the output at the console or do I want to save it for later? If the latter, you can always see the contents of what you saved by just typing its name at the console and pressing Enter.

answer
[1] 16

There are a few rules around the names that R will allow for the objects that you’re saving. First, while all letters are fair game, special characters like +, -, /, !, $, are off-limits. Second, names can contain numbers, but not as the first character. That means names like answer, a, a12, my_pony, and FOO will all work. 12a and my_pony! will not.

But just because I’ve told you that those names won’t work doesn’t mean you shouldn’t give it a try…

my_pony! <- 2 ^ (3 + 1)
Error: <text>:1:8: unexpected '!'
1: my_pony!
           ^

This is an example of an error message and, though they can be alarming, they’re also helpful in coaching you how to correct your code. Here, it’s telling you that you had an “unexpected !” and then it points out where in your code that character popped up.

Creating Vectors

While it is helpful to be able to store a single number as an R object, to store data sets we’ll need to store a series of numbers. You can combine multiple values by putting them inside c() separated by commas.

my_fav_numbers <- c(9, 11, 19, 28)
my_fav_numbers
[1]  9 11 19 28

This is object is called a vector.

Vector (in R)

A set of contiguous data values that are of the same type.

As the definition suggests, you can create vectors out of many different types of data. To store words as data, use the following:

my_fav_colors <- c("green", "orange", "purple")
my_fav_colors
[1] "green"  "orange" "purple"

As this example shows, R can store more than just numbers as data. "green", "orange“, and "purple" are each called character strings and when combined together with c() they form a character vector. You can identify a string because it is wrapped in quotation marks and gets highlighted a different color in RStudio.

Vectors are often called atomic vectors because, like atoms, they are the simplest building blocks in the R language. Most of the objects in R are, at the end of the day, constructed from a series of vectors.

Functions

While the vector will serve as our atomic method of storing data in R, how do we perform computations on it? That is the role of functions.

Let’s use a function to find the arithmetic mean of the vector my_fav_numbers.

mean(my_fav_numbers)
[1] 16.75

A function in R operates in a very similar manner to functions that you’re familiar with from mathematics.

A diagram with the input x pointing into a box labelled function f and then an arrow pointing out of the box to the output y.

Figure 2: A mathematical function as a box with inputs and outputs.

In math, you can think of a function, \(f()\) as a black box that takes the input, \(x\), and transforms it to the output, \(y\). You can think of R functions in a very similar way. For our example above, we have:

  • Input: the vector of four numbers that serves as the input to the function, my_fav_numbers.
  • Function: the function name, mean, followed by parentheses.
  • Output: the number 16.75.

Help and Arguments

Every function in R has a built-in help file that tells you about how it works. It can be accessed using ?.

?mean

This will pop up the help file in a tab next to your Files tab in the lower right hand corner of RStudio. In addition to describing what the function does, the help file lists out its arguments. Arguments are the separate pieces of input that you can supply to a function and they can be named or unnamed.

In the command that we entered above, we used a single unnamed argument, my_fav_numbers. We could have alternatively written this with a named argument:

mean(x = my_fav_numbers)
[1] 16.75

As the help file suggests, x is the R object (here a vector of numbers) that you want to take the mean of. You can always pass objects to a function as named arguments, or if you want to be more concise, you can pass it unnamed and R will rely on the order to figure things out.

Screenshot of the helpfile from the mean function.

Figure 3: Help file for mean().

The test how this actually works, let’s add a second unnamed argument to our function. From reading the help file, you learn that you can supply it a trim argument to trim off some percent of the highest and lowest values before computing the mean. Let’s trim off 25% from the lower end and 25% from the upper end.

mean(my_fav_numbers, .25)
[1] 15

It worked! We trim off the 9 and the 28, then take \((11 + 19) / 2 = 15\). We can write the command using named arguments. The code will be a bit more verbose but the answer will be the same.

mean(x = my_fav_numbers, trim = .25)
[1] 15

What happens if we use unnamed arguments but change the order? Let’s find out.

mean(.25, my_fav_numbers)
Error in mean.default(0.25, my_fav_numbers): 'trim' must be numeric of length one

Since there are no names, R looks at the second argument and expects it to be the a proportion between 0 and .5 that it will use to trim. You have passed it a vector of three integers instead, so it’s justified in complaining.

Functions on Vectors

mean() is just one of thousands of different functions that are available in R. Most of them are sensibly named, like the following, which compute square roots and natural logarithms.

By default, log() computes the natural log. To use other bases, see ?log.

sqrt(my_fav_numbers)
[1] 3.000000 3.316625 4.358899 5.291503
log(my_fav_numbers)
[1] 2.197225 2.397895 2.944439 3.332205

Note that with these two functions, the input was a vector of length four and the output is a vector of length four. This is a distinctive aspect of the R language and it is helpful because it allows you to perform many separate operations (taking the square root of four numbers, one by one) with just a single command.

The Taxonomy of Data in R

In the last lecture notes, we introduced the Taxonomy of Data as a broad system to classify the different types of variables on which we can collect data. If you recall, a variable is a characteristic of an object that you can measure and record. When Dr. Gorman walked up to her first penguin (the unit of observation) and measured its bill length, she collected a single observation of the variable bill_length_mm. You could record that in R using,

bill_length_mm <- 50.7

She continued on to measure the next penguin, then the next, then the next… Instead of recording these as separate objects, it is more efficient to store them as a vector.

bill_length_mm <- c(50.7, 48.5, 52.8, 44.5, 42.0, 46.9, 50.2, 37.9)

This example shows that

A vector in R is a natural way to store observations on a variable.

so in the same way that we have asked, “what is the type of that variable?” we can now ask “what is the class of that variable in R?”.

Class (R)

A collection of objects, often vectors, that share similar attributes and behaviors.

While there are many classes in R, you can get a long way only knowing three. The first is represented by our vector my_fav_numbers. Let’s check it’s class using the class() function.

class(my_fav_numbers)
[1] "numeric"

Here we learn that my_fav_numbers is a numeric vector. Numeric vectors, as the name suggests, are composed only of numbers and can include measurements from both discrete and continuous numerical variables.

What about my_fav_colors?

class(my_fav_colors)
[1] "character"

R stores that as a character vector. This is a very flexible class that can be used to store text as data. But what if there are only a few fixed values that a variable can take? In that case, you can do better than a character vector, you can use a factor. Factor is a very useful class in R because it encodes the notion of levels discussed in the last notes.

To illustrate the difference, let’s make a character vector but then enrich it by turning it into a factor using factor().

char_vec <- c("cat", "cat", "dog")
fac <- factor(char_vec)
char_vec
[1] "cat" "cat" "dog"
fac
[1] cat cat dog
Levels: cat dog

The original character vector stores the same three strings that we used as input. The factor adds some additional information: the possible values that this vector can take.

This is particularly useful when you want to let R know that these levels have a natural ordering. If you have strong opinions about the relative merit of dogs over cats, you could specify that using:

ordered_fac <- factor(char_vec, levels = c("dog", "cat"))
ordered_fac
[1] cat cat dog
Levels: dog cat

This example also demonstrates that you can create a (character) vector inside a function.

While this doesn’t change the way the levels are ordered in the vector itself, it will effect the way they behave when we use them to create plots, as we’ll do in the next set of notes.

These three vector classes do a good job of putting into flesh and bone (or at least silicon) the abstract types captured in the Taxonomy of Data.

The taxonomy of data modified to show the R analogs of each of the data types.

Figure 4: The Taxonomy of Data with equivalent classes in R.

Data Frames in R

While vectors in R do a great job of capturing the notion of a variable, we will need more than that if we’re going to represent something like a data frame. Conveniently enough, R has a structure well-suited to this task called…(drumroll…)

Dataframe (R)
A two dimensional data structure used to store vectors of the same length. A direct analog of the data frame defined previously5.

Let’s use R to recreate the penguins data frame collected by Dr. Gorman.

bill_length_mm bill_depth_mm species
43.5 18.1 Chinstrap
48.1 15.1 Gentoo
49.0 19.5 Chinstrap
45.4 18.7 Chinstrap
34.6 21.1 Adelie
49.8 17.3 Chinstrap
40.9 18.9 Adelie
45.3 13.7 Gentoo

Creating a data frame

In the data frame above, there are three variables; the first two numeric continuous, the last one categorical nominal. Since R stores variables as vectors, we’ll need to create three vectors.

bill_length_mm <- c(50.7, 48.5, 52.8, 44.5, 42.0, 46.9, 50.2, 37.9)
bill_depth_mm <- c(19.7, 15.0, 20.0, 15.7, 20.2, 16.6, 18.7, 18.6)
species <- factor(c("Chinstrap", "Gentoo", "Chinstrap", "Gentoo", "Adelie", 
             "Chinstrap", "Chinstrap", "Adelie"))

Check the class of these vectors by using the as input to class().

While bill_length_mm and bill_depth_mm are both being stored as numeric vectors, species was first collected into a character vector, then passed directly to the factor() function. This is an example of nesting one function inside of another and it combined two lines of code into one.

With the three vectors stored in the Environment, all you need to do is staple them together with data.frame().

penguins_df <- data.frame(bill_length_mm, bill_depth_mm, species)
penguins_df
  bill_length_mm bill_depth_mm   species
1           50.7          19.7 Chinstrap
2           48.5          15.0    Gentoo
3           52.8          20.0 Chinstrap
4           44.5          15.7    Gentoo
5           42.0          20.2    Adelie
6           46.9          16.6 Chinstrap
7           50.2          18.7 Chinstrap
8           37.9          18.6    Adelie

Summary

This was our first introduction to R, a supercharged calculator for storing and computing on data. We learned how to do basic arithmetic, construct and save a vector, call functions, query the class of an object, and construct a data frame. This forms the foundation of our use of R. If that foundation feels shakey, don’t fret. Next class will be dedicated to a workshop on R.

A playful sketch of first impressions with R with dark clouds and scary R and second impression of sunny skies and happy R.

Figure 5: The arc of learning R6.

References and further reading

Materials from class

Slides

Footnotes

  1. The googlesheets4 package, which reads spreadsheet data into R was authored by Jenny Bryan, a developer at Posit: :https://googlesheets4.tidyverse.org/.↩︎

  2. The statistics office of the province of British Columbia maintains a public R package with all of their data: https://bcgov.github.io/bcdata/↩︎

  3. Dr. Christopher Paciorek in the Department of Statistics at UC Berkeley maintains a package to fit a very broad class of statistical models called Bayesian Models: https://r-nimble.org/.↩︎

  4. Simon Couch wrote the stacks package for model ensembling while an undergraduate https://stacks.tidymodels.org/index.html.↩︎

  5. R is an unusual language in that the data frame has been for decades a core structure of the language. The analogous structure in Python is the data frame found in the Pandas library.↩︎

  6. R monster artwork by @allison_horst.↩︎