A Tool for Computing with Data

STAT 20: Introduction to Probability and Statistics

Agenda

  • Concept Questions: Intro to Computing with R
  • R Workshop
    • Functions and Vectors
    • Data Frames
  • Break
  • Lab 1: Arbuthnot

Concept Questions

Educated Guess 1

What will happen here?


Answer at pollev.com/<name>


1 + "one"
01:00

Educated Guess 2

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 2, 3, 4)
sqrt(log(a))
01:00

Educated Guess 3

What will happen here?


Answer at pollev.com/<name>


a <- 1 + 2
a + 1
01:00

Educated Guess 4

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 3.14, "seven")
class(a)
01:00

Reading Questions

R Workshop

Components of RStudio

  1. Console

  2. Environment

  3. Editor

  4. File Directory

Now we are going to switch over to RStudio to understand these 4 components a bit better.

Components of RStudio

  1. Console: Where the live R session lives. Type commands into the prompt > and press enter/return to run them. The Console is in the lower-left pane.

  2. Environment: The space that keeps track of all of the data and objects that you have created or loaded and have access to. Found in the upper right pane.

  3. Editor: Used to compose and edit text (.qmd files) and R code (.r files). Found in the upper left pane.

  4. File Directory: Used to navigate between your files/folders on your Rstudio account. Can move, copy, rename, delete, etc. Found in the lower right pane.

R as a calculator

R allows all of the standard arithmetic operations.

Addition

1 + 2
[1] 3

Subtraction

1 - 2
[1] -1

Multiplication

1 * 2 
[1] 2

Division

1 / 2
[1] 0.5

R as a calculator, cont.

R allows all of the standard arithmetic operations.

Exponents

2 ^ 3
[1] 8

Parentheses for Order of Ops.

2 ^ 3 + 1
[1] 9
2 ^ (3 + 1)
[1] 16

Your turn

What is three times one point two raised to the quantity thirteen divided six?

01:00

Object assignment

You can create/save objects using the assignment operator <- . This is the equivalent of = in other programming languages. . . .

my_fav_num <- 11

In order to be recognized as a valid object name, you have to follow certain conventions; namely, the object name should begin with a letter.

good names names that cause errors
a 1trial
b $
FOO ^mean
my_var my var

Functions on vectors

A vector is the simplest structure used in R to store data. It can be created using the function c().

my_vector <- c(1, 3, 4)
my_vector
[1] 1 3 4

A function operates on an R object and produces output. R has many of the mathematical functions that you would expect.

sum(my_vector)
[1] 8

Your Turn

  1. Create a vector named vec with the even integers between 1 and 10 as well as the number 99 (six elements total).

  2. Find the sum of that vector.

  3. Find the max of that vector.

  4. Take the mean of that vector and round it to the nearest integer.

These should all be solved with R code. If you don’t know the name of a function to use, with hazard a guess by looking for a help file (e.g. ?sum) or google it.

05:00

Demo of:

  1. Creating an R script
  2. Saving it
  3. Typing in code that answers previous question
  4. How to run code from a script
  • Put cursor on line and click “Run”
  • Put cursor on line and type command+return.
  • Copy and paste to the console.

Building a data frame

You can combine vectors into a data frame using data.frame()1

bill_depth_mm <- c(15.0, 17.1, 18.7, 18.9)
bill_length_mm <- c(47.5, 40.2, 39.0, 35.3)
species <- c("Gentoo", "Adelie", "Adelie", "Adelie")


penguins_df <- data.frame(bill_depth_mm, bill_length_mm, species)
penguins_df
  bill_depth_mm bill_length_mm species
1          15.0           47.5  Gentoo
2          17.1           40.2  Adelie
3          18.7           39.0  Adelie
4          18.9           35.3  Adelie

Your Turn

  1. Create an .r script, name it, and save it.

  2. Create three vectors, name, favorite_color, and favorite_number that contain observations on those variables from 5 people in this class.

  3. Combine them into a data frame called my_classmates.

06:00

Loading Packages

R has a vast ecosystem of packages that add new functions. Any installed package can be loaded with the library() function.

Our two main packages:

  • tidyverse
  • stat20data

Load them with:

library(tidyverse)
library(stat20data)

Loading data from a package

Most data you will not be creating by hand. You will either be

  1. Loading it in from a separate file.

  2. Loading it from within an R package (most of our are in stat20data)

To load data from a package,

  1. load that package with library()
  2. You can then print the data to the console by typing its name and pressing enter or see it in the viewer with View(<df name>).
library(stat20data)
penguins
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Functions on data frames

3 functions from the tidyverse

The tidyverse package contains several functions used to manipulate data frames:

  • select() : subset columns
  • arrange() : sort rows
  • mutate() : create a new column from existing column(s)

Selecting columns

select(penguins, species, island)
# A tibble: 333 × 2
   species island   
   <fct>   <fct>    
 1 Adelie  Torgersen
 2 Adelie  Torgersen
 3 Adelie  Torgersen
 4 Adelie  Torgersen
 5 Adelie  Torgersen
 6 Adelie  Torgersen
 7 Adelie  Torgersen
 8 Adelie  Torgersen
 9 Adelie  Torgersen
10 Adelie  Torgersen
# ℹ 323 more rows

Arranging the rows of a data frame

arrange(penguins, bill_length_mm)
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.4          18.4               184        3325
 6 Adelie  Biscoe              34.5          18.1               187        2900
 7 Adelie  Torgersen           34.6          21.1               198        4400
 8 Adelie  Torgersen           34.6          17.2               189        3200
 9 Adelie  Biscoe              35            17.9               190        3450
10 Adelie  Biscoe              35            17.9               192        3725
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

You can sort in descending order by wrapping the variable name in desc().

Mutating a new column

mutate(penguins, bill_index = bill_depth_mm + bill_length_mm)
# A tibble: 333 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_index <dbl>

Remember that you can nest functions.

Nesting functions

select(mutate(penguins, bill_index = bill_depth_mm + bill_length_mm), bill_index)
# A tibble: 333 × 1
   bill_index
        <dbl>
 1       57.8
 2       56.9
 3       58.3
 4       56  
 5       59.9
 6       56.7
 7       58.8
 8       58.7
 9       59.8
10       55.7
# ℹ 323 more rows

Your turn

There is a built-in data set to R called mtcars that has information on cars that appeared in Motor Trend magazine. It’s already loaded and can be accessed as mtcars.

  1. Create a slimmer data frame that only contains the columns hp and wt and save it to mtcars_slim.

  2. Create a new column called power_to_weight that is the ratio of hp to wt. Save the three-column data frame back over mtcars_slim.

  3. Sort the data frame in descending order by the power-to-weight ratio.

Hint: look up help files!

08:00


Break

05:00

Lab 1: Arbuthnot