A Tool for Computing with Data

STAT 20: Introduction to Probability and Statistics

Agenda

  • Concept Questions: Intro to Computing with R
  • R Workshop
    • Functions and Vectors
    • Data Frames
  • Break
  • Lab 1: Arbuthnot

Concept Questions

Educated Guess 1

What will happen here?


Answer at pollev.com/<name>


1 + "one"
01:00

Educated Guess 2

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 2, 3, 4)
sqrt(log(a))
01:00

Educated Guess 3

What will happen here?


Answer at pollev.com/<name>


a <- 1 + 2
a + 1
01:00

Educated Guess 4

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 3.14, "seven")
class(a)
01:00

Reading Questions

R Workshop

Components of RStudio

  1. Console

  2. Environment

  3. Editor

  4. File Directory

Now we are going to switch over to RStudio to understand these 4 components a bit better.

Components of RStudio

  1. Console: Where the live R session lives. Type commands into the prompt > and press enter/return to run them. The Console is in the lower-left pane.

  2. Environment: The space that keeps track of all of the data and objects that you have created or loaded and have access to. Found in the upper right pane.

  3. Editor: Used to compose and edit text (.qmd files) and R code (.r files). Found in the upper left pane.

  4. File Directory: Used to navigate between your files/folders on your Rstudio account. Can move, copy, rename, delete, etc. Found in the lower right pane.

R as a calculator

R allows all of the standard arithmetic operations.

Addition

1 + 2
[1] 3

Subtraction

1 - 2
[1] -1

Multiplication

1 * 2 
[1] 2

Division

1 / 2
[1] 0.5

R as a calculator, cont.

R allows all of the standard arithmetic operations.

Exponents

2 ^ 3
[1] 8

Parentheses for Order of Ops.

2 ^ 3 + 1
[1] 9
2 ^ (3 + 1)
[1] 16

Your turn

What is three times one point two raised to the quantity thirteen divided six?

01:00

Object assignment

You can create/save objects using the assignment operator <- . This is the equivalent of = in other programming languages. . . .

my_fav_num <- 11

In order to be recognized as a valid object name, you have to follow certain conventions; namely, the object name should begin with a letter.

good names names that cause errors
a 1trial
b $
FOO ^mean
my_var my var

Functions on vectors

A vector is the simplest structure used in R to store data. It can be created using the function c().

my_vector <- c(1, 3, 4)
my_vector
[1] 1 3 4

A function operates on an R object and produces output. R has many of the mathematical functions that you would expect.

sum(my_vector)
[1] 8

Your Turn

  1. Create a vector named vec with the even integers between 1 and 10 as well as the number 99 (six elements total).

  2. Find the sum of that vector.

  3. Find the max of that vector.

  4. Take the mean of that vector and round it to the nearest integer.

These should all be solved with R code. If you don’t know the name of a function to use, with hazard a guess by looking for a help file (e.g. ?sum) or google it.

05:00

Demo of:

  1. Creating an R script
  2. Saving it
  3. Typing in code that answers previous question
  4. How to run code from a script
  • Put cursor on line and click “Run”
  • Put cursor on line and type command+return.
  • Copy and paste to the console.

Building a data frame

You can combine vectors into a data frame using data.frame()1

bill_depth_mm <- c(15.0, 17.1, 18.7, 18.9)
bill_length_mm <- c(47.5, 40.2, 39.0, 35.3)
species <- c("Gentoo", "Adelie", "Adelie", "Adelie")


penguins_df <- data.frame(bill_depth_mm, bill_length_mm, species)
penguins_df
  bill_depth_mm bill_length_mm species
1          15.0           47.5  Gentoo
2          17.1           40.2  Adelie
3          18.7           39.0  Adelie
4          18.9           35.3  Adelie

Your Turn

  1. Create an .r script, name it, and save it.

  2. Create three vectors, name, favorite_color, and favorite_number that contain observations on those variables from 5 people in this class.

  3. Combine them into a data frame called my_classmates.

06:00

Loading Packages

R has a vast ecosystem of packages that add new functions. Any installed package can be loaded with the library() function.

Our two main packages:

  • tidyverse
  • stat20data

Load them with:

library(tidyverse)
library(stat20data)

Loading data from a package

Most data you will not be creating by hand. You will either be

  1. Loading it in from a separate file.

  2. Loading it from within an R package (most of our are in stat20data)

To load data from a package,

  1. load that package with library()
  2. You can then print the data to the console by typing its name and pressing enter or see it in the viewer with View(<df name>).
library(stat20data)
penguins
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
 9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Functions on data frames

3 functions from the tidyverse

The tidyverse package contains several functions used to manipulate data frames:

  • select() : subset columns
  • arrange() : sort rows
  • mutate() : create a new column from existing column(s)

Selecting columns

select(penguins, species, island)
# A tibble: 333 × 2
   species island   
   <fct>   <fct>    
 1 Adelie  Torgersen
 2 Adelie  Torgersen
 3 Adelie  Torgersen
 4 Adelie  Torgersen
 5 Adelie  Torgersen
 6 Adelie  Torgersen
 7 Adelie  Torgersen
 8 Adelie  Torgersen
 9 Adelie  Torgersen
10 Adelie  Torgersen
# … with 323 more rows

Arranging the rows of a data frame

arrange(penguins, bill_length_mm)
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Dream               32.1          15.5        188    3050 fema…  2009
 2 Adelie  Dream               33.1          16.1        178    2900 fema…  2008
 3 Adelie  Torgersen           33.5          19          190    3600 fema…  2008
 4 Adelie  Dream               34            17.1        185    3400 fema…  2008
 5 Adelie  Torgersen           34.4          18.4        184    3325 fema…  2007
 6 Adelie  Biscoe              34.5          18.1        187    2900 fema…  2008
 7 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
 8 Adelie  Torgersen           34.6          17.2        189    3200 fema…  2008
 9 Adelie  Biscoe              35            17.9        190    3450 fema…  2008
10 Adelie  Biscoe              35            17.9        192    3725 fema…  2009
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

You can sort in descending order by wrapping the variable name in desc().

Mutating a new column

mutate(penguins, bill_index = bill_depth_mm + bill_length_mm)
# A tibble: 333 × 9
   species island    bill_length_mm bill_d…¹ flipp…² body_…³ sex    year bill_…⁴
   <fct>   <fct>              <dbl>    <dbl>   <int>   <int> <fct> <int>   <dbl>
 1 Adelie  Torgersen           39.1     18.7     181    3750 male   2007    57.8
 2 Adelie  Torgersen           39.5     17.4     186    3800 fema…  2007    56.9
 3 Adelie  Torgersen           40.3     18       195    3250 fema…  2007    58.3
 4 Adelie  Torgersen           36.7     19.3     193    3450 fema…  2007    56  
 5 Adelie  Torgersen           39.3     20.6     190    3650 male   2007    59.9
 6 Adelie  Torgersen           38.9     17.8     181    3625 fema…  2007    56.7
 7 Adelie  Torgersen           39.2     19.6     195    4675 male   2007    58.8
 8 Adelie  Torgersen           41.1     17.6     182    3200 fema…  2007    58.7
 9 Adelie  Torgersen           38.6     21.2     191    3800 male   2007    59.8
10 Adelie  Torgersen           34.6     21.1     198    4400 male   2007    55.7
# … with 323 more rows, and abbreviated variable names ¹​bill_depth_mm,
#   ²​flipper_length_mm, ³​body_mass_g, ⁴​bill_index

Remember that you can nest functions.

Nesting functions

select(mutate(penguins, bill_index = bill_depth_mm + bill_length_mm), bill_index)
# A tibble: 333 × 1
   bill_index
        <dbl>
 1       57.8
 2       56.9
 3       58.3
 4       56  
 5       59.9
 6       56.7
 7       58.8
 8       58.7
 9       59.8
10       55.7
# … with 323 more rows

Your turn

There is a built-in data set to R called mtcars that has information on cars that appeared in Motor Trend magazine. It’s already loaded and can be accessed as mtcars.

  1. Create a slimmer data frame that only contains the columns hp and wt and save it to mtcars_slim.

  2. Create a new column called power_to_weight that is the ratio of hp to wt. Save the three-column data frame back over mtcars_slim.

  3. Sort the data frame in descending order by the power-to-weight ratio.

Hint: look up help files!

08:00


Break

05:00

Lab 1: Arbuthnot