Introducing Probability

Definitions, examples, and axioms

class date

February 22, 2023

Discuss Reading Questions PDF

In this new unit of generalization, we are changing gear. After exploring datasets by creating graphical and numerical summaries, now we want to generalize from the data to the world. To do this, we need an understanding of the language and notation of probability.

For example, if the data are from a clinical trial for a new Covid-19 treatment, the design of the trial will decide who gets the treatment, and who doesn’t. Or perhaps we want to poll voters to estimate the support for decriminalizing marijuana possession federally (President Biden took a first step towards this goal by pardoning all Americans convicted of simple marijuana possession under federal law). Who is selected for such an opinion poll can make a big difference in the estimate of the proportion of voters who support decriminalization of marijuana.

You may have observed that some polling companies are more successful than others in their estimates and predictions, and consequently they are taken more seriously. Below is a snapshot of rankings of polling organizations from the well-known website FiveThirtyEight1. According to the website, the rankings are based on the polling organization’s ``historical accuracy and methodology’’

In order to make estimates as these polling organizations are doing, or understand the results of a clinical trial or such questions in which we generalize from our data sample to a larger group, we have to understand the variations in data introduced by randomness in our sampling methods. It is very important when we estimate a quantity (such as the proportion of voters who support a particular policy), that we compute the error of our estimate. The error is determined by the variability of the estimate. (Each time we poll a different group of voters, for example, we will get a different estimate of the proportion of voters that support decriminalizing marijuana possession.) To understand variation, we first have to understand how probability was used to collect the data. Before we do this, let’s make sure we understand what we mean by probability or chance, with two simple examples.

Coins and Dice

Example 1: Tossing a fair coin and counting the number of heads

Suppose I told you that I am going to toss a fair coin, and ask you what is the chance of the coin landing heads. Like most people, you reply 50%. Where does this number come from? Try tossing a coin twice and see how it lands. You might have it land heads once and tails once, or you may get two heads, or two tails. Is this surprising? Okay, let’s toss it ten times and count the number of heads and tails. I am going to simulate this in R (we will explore the code later in the notes):

# this ensures that each time we sample, we will get the same result
coin1 <- c("Heads", "Tails")

tosses <- sample(coin1, 10, replace = TRUE)

data.frame(tosses) %>%
  group_by(tosses) %>% 
  summarise(n = n())
# A tibble: 2 × 2
  tosses     n
  <chr>  <int>
1 Heads      3
2 Tails      7

What about if I toss it fifty times?


tosses <- sample(coin1, 50, replace = TRUE)

data.frame(tosses) %>%
  group_by(tosses) %>% 
  summarise(n = n())
# A tibble: 2 × 2
  tosses     n
  <chr>  <int>
1 Heads     15
2 Tails     35

This doesn’t look so good… Maybe tossing one hundred times will improve the split:


tosses <- sample(coin1, 100, replace = TRUE)

data.frame(tosses) %>%
  group_by(tosses) %>% 
  summarise(n = n())
# A tibble: 2 × 2
  tosses     n
  <chr>  <int>
1 Heads     46
2 Tails     54

We see that as the number of tosses increases, the distribution of tosses begins to look closer to 50 heads and 50 tails.

Now, instead of counting the number of heads and tails, let’s look at the proportion of tosses that land heads when we toss a coin \(n\) times, where \(n\) varies from \(1\) to \(1000\). We can plot the proportion of the tosses that land heads against the number of tosses:

coin1 <- c("H", "T")
toss_num <- 1000 # total number of tosses 
tosses <- sample(coin1, size = toss_num, replace = TRUE) 
# # simulating tossing a coin 1000 times

prop_heads <- cumsum(tosses == "H") / 1:toss_num
prop_heads <- data.frame(cbind(prop_heads, 1:toss_num))
prop_heads <- rename(prop_heads, toss_num = V2)
#  computing the fraction of tosses that land heads for each n and 
#  making the resultant vector into a data frame and renaming the columns

prop_heads %>% 
  ggplot(aes(y = prop_heads, x = toss_num)) +
  geom_line() + 
  annotate("segment", x = 0, xend = max(toss_num) + 50, y = 0.5, yend = 0.5,
           color = "red", linetype = 2, size = 0.9) + 
  xlab("Number of tosses") +
  ylab("Proportion of tosses that land heads") +
  ggtitle("As we increase the number of tosses,
           the proportion of heads settles at about 50% ")

We see that at the beginning, for a small number of tosses, the proportion of times that the coin lands heads is all over the place, but it eventually settles down to be around 0.5. This verifies our intuition that if we toss a fair coin, the proportion of times that the coin lands heads should be about 0.5. This idea of the probability of a particular outcome as a long run proportion of times that we see that outcome is called the frequentist theory of probability, and we will be using this theory in our class. (A different theory of probability uses a subjective notion of probability, but we won’t get into that at this time.) We will think about the probability of heads as the long-run relative frequency, or the proportion of times the coin will land heads if we toss it many, many times. This fits with our intuition that if we have a fair coin, that means that each of the two possible outcomes will occur roughly the same number of times when we toss it over and over again. We can then call the two outcomes equally likely and can define the probability of heads to be 1/2.


An action, involving chance, that can result in a finite number of possible outcomes (results of the experiment). For example, a coin toss is an experiment, and the possible outcomes are the coin landing heads or tails.
Outcome space
The collection of all the possible outcomes of an experiment is called an outcome space, and we denote it by \(\Omega\). For example, if we toss a coin, then the corresponding \(\Omega = \{\text{Heads, Tails}\}\). If we roll a die, then the corresponding \(\Omega = \{1, 2, 3, 4, 5, 6\}\)
The chance or probability of a particular outcome is the fraction of times that this outcome is expected to occur, when the experiment that results in the outcome is repeated over and over again (long run relative frequency).
A collection of outcomes as a result of the experiment being performed, perhaps more than once. For example, we could toss a coin twice, and consider the event of both tosses landing heads. We usually denote events by upper case letters from the beginning of the alphabet: \(A, B, C, \ldots\)
Example 2: Tossing a fair six-sided die2

Consider rolling a fair six-sided die: six outcomes are possible, and since the die is fair, each of them will have the same long run relative frequency. We can say that the six faces are equally likely: in the long run we expect that each of the faces will occur (roughly) the same number of times. We can then define the probability of each of the faces to be 1/6. This is called the probability distribution of the number of times we see each face, and can be listed in a table:

Face \(1\) \(2\) \(3\) \(4\) \(5\) \(6\)
Probability \(\displaystyle \frac{1}{6}\) \(\displaystyle \frac{1}{6}\) \(\displaystyle \frac{1}{6}\) \(\displaystyle \frac{1}{6}\) \(\displaystyle \frac{1}{6}\) \(\displaystyle \frac{1}{6}\)
Probability of equally likely outcomes
If all the outcomes of an experiment with \(n\) possible outcomes are equally likely, then the probability or chance of each outcome is \(\displaystyle \frac{1}{n}\).
Probability distribution
A listing of all the possible outcomes of an experiment and their probabilities.
Probability of an event
If the event \(A\) has \(k\) outcomes, and \(\Omega\) has \(n\) outcomes, and all the outcomes are equally likely, then the probability of \(A\) is \(\displaystyle \frac{k}{n}\). For example, consider tossing a coin twice. Then \(\Omega = \{\text{HH, HT, TH, TT}\}\), where H denotes the coin landing heads, and T the coin landing tails. Let \(A\) be the event that the coin lands the same way on both tosses, then \(A = \{\text{HH, TT}\}\). Therefore the probability of \(A = \displaystyle \frac{2}{4}\).

Let’s simulate rolling a die and counting how many times we see each face.

If we roll it 6 times, we don’t really expect to see each face once, and as you can see below, in this particular instance of rolling the die six times, we didn’t see the face with one spot, but saw two spots twice.

die <- 1:6

die_rolls <- sample(die, 6, replace = TRUE)

data.frame(die_rolls) %>%
  group_by(die_rolls) %>% 
  summarise(n = n())
# A tibble: 5 × 2
  die_rolls     n
      <int> <int>
1         2     2
2         3     1
3         4     1
4         5     1
5         6     1

What about if we roll the die 60 times? We should see each face about ten times:


die_rolls <- sample(die, 60, replace = TRUE)

data.frame(die_rolls) %>%
  group_by(die_rolls) %>% 
  summarise(n = n())
# A tibble: 6 × 2
  die_rolls     n
      <int> <int>
1         1    10
2         2    11
3         3    13
4         4    10
5         5     6
6         6    10

Not so great, but recalling the fair coin, let’s roll the die 600 times:


die_rolls <- sample(die, 600, replace = TRUE)

data.frame(die_rolls) %>%
  group_by(die_rolls) %>% 
  summarise(n = n())
# A tibble: 6 × 2
  die_rolls     n
      <int> <int>
1         1   114
2         2   112
3         3   101
4         4    95
5         5    98
6         6    80

It might be better to visualize it. We will draw the probability histogram, which shows the probabilities of each possible outcome (the probability distribution), and compare it to the empirical histograms (plotting the data distribution, not the probability distribution) of the results of rolling the die \(6, 60\), and \(600\) times. Note that the probability histogram is a theoretical histogram, where the area of the bars represent probabilities, and the total area of the histogram is \(1\).

prob_die <- rep(1/6, 6)

p1 <- data.frame(die) %>% 
  ggplot(aes(x = factor(die), y=prob_die)) +
  geom_col(width = 0.98, fill = "goldenrod2") +
  xlab("number of spots") +
  ylab("probability") +
  ggtitle("Probability distribution of the \n outcome of a die roll") +
  lims(y = c(0, .35))

roll_6 <- sample(die, 6, replace = TRUE)
roll_6 <- data.frame(table(roll_6)) %>%
  add_row(roll_6 = factor(1), Freq = 0, .before = 1) %>%
  mutate(spots = factor(1:6),
         prop_rolls = Freq/6)
# adding in a new row at the top that lists the number of times
# 1 was rolled in 6 rolls
# then adding in the values for the x-axis

p2 <- roll_6 %>%
  ggplot(aes(x = spots, y= prop_rolls)) +
  geom_col(width = 0.98, fill = "blue") +
  xlab("number of spots") +
  ylab("proportion of rolls") +
  ggtitle("Empirical distribution for 6 rolls") +
  lims(y = c(0, .35))

roll_60 <- sample(die, 60, replace = TRUE)

roll_60 <- data.frame(table(roll_60)) %>%
    mutate(spots = factor(1:6),
         prop_rolls = Freq/60)

p3 <- roll_60 %>%
  ggplot(aes(x = spots, y= prop_rolls)) +
  geom_col(width = 0.98, fill = "blue") +
  xlab("number of spots") +
  ylab("proportion of rolls") +
  ggtitle("Empirical distribution for 60 rolls") +
  lims(y = c(0, .35))

roll_600 <- sample(die, 600, replace = TRUE)

roll_600 <- data.frame(table(roll_600)) %>%
    mutate(spots = factor(1:6),
         prop_rolls = Freq/600)

p4 <- data.frame(roll_600) %>%
  ggplot(aes(x = spots, y=prop_rolls)) +
  geom_col(width = 0.98, fill = "blue") +
  xlab("number of spots") +
  ylab("proportion of rolls") +
  ggtitle("Empirical distribution for 600 rolls") +
  lims(y = c(0, .35))

(p1 + p2)/(p3+p4)

The important takeaway here is that we have a theoretical probability distribution of the outcomes, and we have what actually happens when we perform the experiment over and over. Eventually, the empirical distribution begins to look like the theoretical distribution.

Rules of probability

What we have done here is estimate chances using the idea that the chance (or probability) of an outcome is the proportion of times it occurs when we simulate over and over again the basic process that results in the outcome we are counting. We have used the definition of the probability of an outcome to be the proportion of times we expect it to occur when the experiment that generates the event is independently repeated many times 3. We denote the probability of an event \(A\) by \(P(A)\).

The chance of an event (where an event might be one outcome, such as “rolling a die and seeing two spots”, or a collection of outcomes, such as “rolling an even number”) needs to satisfy some basic mathematical rules called axioms (which are intuitively clear if you think of the chance of an event as its relative frequency):

Rule 1
The chance of any event is at least \(0\) (\(P(A) \ge 0\) for any event \(A\)).
Rule 2
We call the collection of all possible outcomes the outcome space \(\Omega\). The chance of an outcome being in \(\Omega\) is \(1\) (\(P(\Omega) = 1\)).

Note that the empty event (with no outcomes in it) is called the impossible event and denoted by the symbol for the empty set \(\emptyset\). The probability of the impossible event is \(0\).

Mutually exclusive events
If two events \(A\) and \(B\) do not overlap, that is, they have no outcomes in common, we say that the events are mutually exclusive. Note that if \(A\) and \(B\) are mutually exclusive, if we know that one of them happens, the other one cannot. We denote this by writing \(A \cap B = \emptyset\) and read this as \(A\) intersect \(B\) is empty. Therefore, we have that \(P(A \cap B) = P(\emptyset) = 0\).

For example: the event that a coin lands heads on a single toss and that it lands tails are mutually exclusive. Another example is that if we roll a die, the events that we roll an even number and the event that we roll an odd number are mutually exclusive, but the events that we roll an even number and the event that we roll a prime number are not mutually exclusive. The event that Manchester City win the English Premier League in 2023, and the event that Arsenal win the EPL in 2023 are mutually exclusive, but the events that Manchester City are EPL champions in 2023 and Manchester City are EPL champions in 2022 are not mutually exclusive.

In general, the collection of outcomes that are in both \(A\) and \(B\) is written as \(A \cap B\) or \(A\) and \(B\), regardless of whether their intersection (overlap) is empty or not.

If we want to consider the event consisting of all the outcomes that are either in \(A\) or in \(B\), we call this event the union of \(A\) and \(B\), and write it as \(A \cup B\). In general, the collection of all outcomes that are in \(A\) or \(B\) is written as \(A \cup B\) and can be read as \(A\) union \(B\) or \(A\) or \(B\).

Rule 3
If \(A\) and \(B\) are mutually exclusive (\(A \cap B = \emptyset\)), then \(P(A \cup B) = P(A) + P(B)\). That is, for two mutually exclusive events, the probability that either of the two events might occur is the sum of their probabilities. This is called the addition rule.

For example, consider rolling a fair six-sided die, and the two events \(A\) and \(B\), where \(A\) is the event of rolling a multiple of \(5\), and \(B\) is the event that we roll a multiple of \(2\). The only outcome in \(A\) is \(\{5\}\), while \(B\) consists of \(\{2, 4, 6\}\). Since all the sides of a fair die are equally likely, the chance of \(A\) is the chance of rolling a \(5\) which is \(1/6\). The chance of rolling an even number is \(1/6 + 1/6 +1/6 = 3/6\) (applying the same idea of adding up the chances, since we can’t have the die show two sides at the same time). This implies that the probability of either \(A\) or \(B = P(A \cup B) = 1/6 + 3/6 = 4/6\).

Note that both events cannot happen - using a six-sided die, you cannot roll a multiple of 5 that is also an even number. That is, the chance of \(A\) and \(B\) is \(0\).

An important consequence of rule 3: Let \(A\) be an event in \(\Omega\). The complement of \(A\), written as \(A^C\) consists of all those outcomes in \(\Omega\) that are not in \(A\). Note that \(A \cup A^C = \Omega\), so \(P(A) + P(A^C) = 1\).

Venn Diagrams

We often represent events using Venn diagrams. The outcome space \(\Omega\) is usually represented as a rectangle, and events are represented as circles inside \(\Omega\). Here is a Venn diagram showing two events \(A\) and \(B\), their intersection, and their union:

Here is a Venn diagram showing two mutually exclusive events:

The Box Model

One of the simplest ways of understanding chance (probability), and later, variations resulting from sampling, is through the box model4. A box model consists of a box with numbered tickets, from which tickets are drawn. To specify a box model, we have to say what tickets go in the box (that is, what are the numbers on the tickets, and how many of each) , and whether the tickets will be drawn with or without replacement (after drawing a ticket from the box, do we put it back in the box or not, before we draw another ticket).

A box of tickets

Consider the box above which has five almost identical tickets. The only difference is the value written on them. Imagine that we shake the box to mix the tickets up, and then draw one ticket without looking. This is to ensure that all the tickets are equally likely to be drawn5. For example, the chance of drawing a ticket labeled “4” is one in five, as there are five tickets to choose from and only one is labeled “4” etc.

Here is the probability histogram of the value of the drawn ticket, and you see that the bar over 2 is twice as tall. This is because there are twice as many tickets marked 2 as any other numbers. The counts or frequencies of the tickets in the box is called the population distribution or the box distribution, and you can see that this gives us the probability distribution for the value of the ticket drawn at random from the box.

tkts_box <- c(1, 2, 3, 4)
prob_box <- c(1/5, 2/5, 1/5, 1/5)

box <- c(1, 2, 2, 3, 4)

p1 <- data.frame(box) %>% 
  ggplot(aes(x=box)) +
  geom_bar(width = 0.98, fill = "darkorchid") +
  xlab("ticket value")  +
  ggtitle("Ticket distribution")

p2 <- data.frame(tkts_box) %>% 
  ggplot(aes(x=tkts_box, y=prob_box)) +
  geom_col(width = 0.98, fill = "goldenrod2") +
  xlab("ticket value") +


Notice that the only difference between these two plots is the vertical scale. When the tickets are equally likely, the box distribution (the purple plot) completely determines the probability distribution (the golden plot).

How about if we draw 50 times at random with replacement from this box and see what our sample looks like (we will explore the code later in the notes). That is, we will see what proportion of the draws are 1’s, 2’s etc.

box <- c(1,2,2,3,4)
sample_size <- 50
sample_box <- sample(box, size = sample_size, replace = TRUE)
data.frame(sample_box) %>%
  ggplot(aes(sample_box)) + 
  geom_bar(aes(y = ..prop..), width = 0.98, fill = "blue") +
  xlab("values of draws") + 
  ylab("proportion of draws")

We can see that the (blue) sample proportions look similar to the (gold) population (box) proportions (the probability distribution above), but are somewhat different. It turns out that the counts of the drawn tickets are:

Ticket Number of times drawn Proportion of times drawn
\(\fbox{1}\) 10 0.2
\(\fbox{2}\) 24 0.48
\(\fbox{3}\) 10 0.2
\(\fbox{4}\) 6 0.12

What we have seen here is how when we draw at random, we get a sample that resembles the population, that is, a representative sample.

The Box Model

The purpose of creating a box model is to understand the variability in a chance process such as tossing a coin over and over and counting the number of heads, or sampling voters to ask them if they will vote for President Biden in the 2024 elections. We do this by having the tickets represent the possible outcomes associated with the chance process, and using the variability in the tickets that are drawn to understand the variability of the chance process.

In order to create a box model that will model a process in which we have a repeated action (such as tossing a coin or drawing a ticket from a box), we must specify:

  1. The box and tickets (what tickets go in the box, and the numbers written on them)
  2. The number of draws
  3. Whether or not we replace tickets between draws
  4. How we will summarize the values on the drawn tickets (we will sum the draws or compute their average)

Note that these tickets should be identical in every way, except the value written on them, and assume that the tickets in the box are equally likely to be drawn (we draw without looking into the box).

Setting up a box model: examples

Let’s practice creating boxes to model coin tosses and die rolls. For each of the cases below, specify the four items above.

Example 1: Tossing a coin

Tossing a fair coin once

What box and which tickets would you use to simulate one coin toss?

Check your answer

We could draw one ticket from the box but these are not numbered.

Or we could draw one ticket from the box with the ticket \(\fbox{1}\) representing the coin landing heads, and \(\fbox{0}\) representing the coin landing tails. This change of numbering the ticket representing the outcome we are counting with a 1 and all other tickets with a 0 is very important, since we are classifying the outcomes into two categories, and counting the instances of one of them (in this case, the coin landing heads).
Ask yourself

What box would we use if the coin is not a fair coin, but lands heads \(5\) out of \(6\) times?

Tossing a fair coin twice

Check your answer

We could use the same box as above:

Remember that the ticket \(\fbox{1}\) represents the outcome of a toss landing heads and \(\fbox{0}\) represents a toss landing tails (because we are counting the number of heads). We let the box be as above, and draw two tickets at random with replacement from this box. To count the number of heads in two tosses, we add the draws.

Tossing an unfair coin

Set up the box model for tossing a coin which has chance of 5/6 landing heads, and counting the number of heads.

Check your answer

We do the same trick, of representing the outcome of heads with \(\fbox{1}\), and tails with \(\fbox{0}\). Since the coin lands heads 5 times out of 6, the box is given by:

We draw as many times as the number of tosses, and sum up the values on the tickets drawn.
Example 2: Rolling a fair die and summing the spots


Rolling a fair die once

Check your answer

The box will have six tickets as shown below, and we would draw one ticket from this box. The chance of any one of the tickets is \(1/6\).

Rolling a fair die twice

Check your answer

We would use the same box as in the previous example, draw twice at random with replacement from this box, then add the draws.

Example 3: Betting on red in roulette


Expand to read this example

An American roulette wheel has 38 pockets, of which 18 are red, 18 black, and 2 are green. The wheel is spun, and a small ball is thrown on the wheel so that it is equally likely to land in any of the 38 pockets. Players bet on which colored or numbered pocket the ball will come to rest in. If you bet one dollar that the ball will land on red, and it does, you get your dollar back, and you win one more dollar, so your net gain is $1. If it doesn’t, and lands on a black or green number, you lose your dollar, and your net “gain” is -$1. What would be the box model for your net gain from a single spin?

Net gain from a single spin

Check your answer

We would use the box shown below marked “Gain”. In the top box, there are 38 tickets, each representing a numbered pocket on the wheel. Draw one ticket to represent the ball landing in a particular pocket. In the second row each ticket represents a colored pocket, and we see that there are 18 red pockets, 18 black, and 2 green. We only care about the outcome ball lands in red pocket, and our gain from this. We would draw once from it, with the ticket marked \(\fbox{+1}\) representing our gain if the ball lands on a red pocket, and the ticket \(\fbox{-1}\) representing our loss if the ball doesn’t land on red. We can see that the chance of drawing a \(\fbox{+1}\) is 18 out of 38.

Net gain from 10 spins

How would we model our net gain or winnings from 10 spins (we bet $1 on each spin, and either lose it or get it back plus $1)?

Check your answer Use the same box as above, with 18 tickets marked \(\fbox{+1}\) and 20 marked \(\fbox{-1}\), draw 10 times at random with replacement, and sum the draws.
Example 4: Donkeys in Kenya


Expand to read this example

Donkeys play important roles in rural Kenya. People need donkeys for moving crops and water, for personal transportation, and for plowing fields. When a donkey gets sick, the vet needs to figure out how much the donkey weighs in order to prescribe the right amount of medicine. But, many vets in rural Kenya don’t have a scale big enough to weigh a donkey. Too little medicine can cause a sickness to re-emerge, and too much medicine can cause a harmful overdose. There are over 1.8 million donkeys in Kenya, so it’s important to have a simple, accurate way to estimate the weight of a donkey. The question is: how can we accurately estimate the weight of a donkey using other easy to get measurements? To address this question, The UK Donkey Sanctuary carried out a study at 17 mobile de-worming sites in rural Kenya. All 544 donkeys brought to these sites between July 23 and August 11, 2010, that were not pregnant or visibly diseased, were entered into the study.9.

Four body measurements were made for each donkey: liveweight (kg), heart girth (cm), height (cm), and length (cm), and in addition, sex and body condition score were recorded (the body condition score categorized the donkeys according to their condition, ranging from “emaciated” to “obese”). Suppose I want to sample 50 donkeys from this dataset, to compute their average weight and average girth (girth or heart girth is the circumference of the body, measured just behind the front legs.)

Set up the box model for this process (sampling donkeys).

Check your answer

The box has 544 tickets, one for each donkey. The ticket has information from all 8 variables recorded (3 categorical and 5 numerical). We only need 2 variables, so the tickets will have those two variables recorded. We draw 50 tickets from this box at random without replacement, and compute 2 averages from the draws.

Example 5: Penguins!

Expand to read this example

You have seen the penguins data, which measured 8 variables on a sample of 333. Here the box will contain 344 tickets, and we will draw 50 tickets at random without replacement. Note that each ticket in the box represents one penguin in the sample, and has information for multiple variables measured on that penguin. We would consider only the information we are interested in. Suppose we wanted sex, species, and flipper length, to compare the flipper lengths of males and females, grouped by species. Set up the box model.

Check your answer

The box would have 344 tickets, one for each penguin. Each ticket would have 3 things written on it: the species, sex, and flipper length in mm of the penguin. We would draw 50 tickets at random without replacement, and

Simulating the box model

Code along

As you read through these notes, keep RStudio open in another window to code along at the console.

The box model is easily simulated in R, since there is a convenient function that does exactly what we need: draw tickets from a “box” (a vector). We will use the function sample(x, size = n, replace = FALSE), where x is the vector you want to sample from, size is the number of draws (with the default value being the length of x), and replace specifies whether the draws are with or without replacement. The default in sample() is to draw without replacement. (Note: The set.seed() function below ensures that we will get the same random sample each time we run the code.)10

box <- c(1, 2, 2, 3, 4)
[1] 2

We can use sample() to estimate the chance of a particular outcome when we aren’t sure of what that chance might be. We would do this by repeatedly sampling from the “box” with replacement (many times), then computing the proportion of times we drew each ticket. For example, say we consider our first example (the simple box), and want to estimate the chance of each ticket.

In the code below, another new function is introduced: replicate(). The function replicate(reps, expr) is a very useful function that takes as input an expression expr and evaluates it reps times, returning a vector.

box <- c(1, 2, 2, 3, 4)
draws <- replicate(2000, sample(box, 1, replace = TRUE))
ggplot(data.frame(draws), aes(x=draws)) + 
  geom_bar(aes(y=..prop..), fill="blue", width = 0.98) + 
  ylab("proportion of draws") + 
  xlab("ticket drawn")

We see that the estimated chance of drawing a \(\fbox{2}\) is about 0.4, and this is about twice the estimated chance of drawing any other ticket. Of course, we knew this already, without needing to code it in R. Let’s think of a more complicated situation:

What if we wanted to wanted to draw five tickets with replacement from this box, and sum the draws? What would be the possible values that we would get? What could their chances be? We can visualize this in R:

box <- c(1, 2, 2, 3, 4)
draws <- replicate(5000, sum(sample(box, size = 5, replace = TRUE)))
ggplot(data.frame(draws), aes(x=draws)) + 
  geom_bar(aes(y=..prop..), fill="blue", width = 0.98) + 
  ylab("proportion") + 
  xlab("sum of draws") + 
  scale_x_continuous(breaks = seq(min(draws), max(draws), by = 1))

We can see that there is a lot more variation in the values taken by the sum of 5 draws.

Tossing a fair coin

We can estimate the chances of various outcomes related to coin tossing, using sampling from a box.

Suppose, for example, that we would like to figure out the chance of exactly 2 heads if we toss a coin 4 times. Think about how you would use the functions sample() and replicate() to model this, using the 0-1 box we defined earlier, for tossing a coin.

coin <- c(0, 1) #1 represents the toss landing heads
two_heads <-replicate(50000, sum(sample(coin, 4, replace = TRUE)) == 2)
cat("The proportion of times we see 2 heads out of 4 tosses is", mean(two_heads))
The proportion of times we see 2 heads out of 4 tosses is 0.37816

Rolling a pair of dice and summing the spots

This is something that we could use if we wanted to play Monopoly and couldn’t find the dice. Recall the box we used to simulate a die roll. Now we are going to define a vector in R to represent a die, and sample twice with replacement, from this vector, and add the spots.

die <- 1:6
# to simulate rolling a die twice and summing the spots
draws <- sample(die, size = 2, replace = TRUE)
[1] 6

We could also repeat it many times and estimate the chance of each of the possible outcomes.

many_draws <- replicate(5000, sum(sample(die, size = 2, replace = TRUE)))
ggplot(data.frame(many_draws), aes(x=many_draws)) + 
  geom_bar(aes(y=..prop..), fill="blue", width = 0.98) + 
  ylab("proportion") + 
  xlab("sum of two draws") + 
  scale_x_continuous(breaks = seq(min(many_draws), 
                                  max(many_draws), by = 1))

Ask yourself

We know all the possible outcomes of summing a pair of dice (between \(2\) and \(12\) spots). Why not make a box with tickets labeled \(2, 3, 4, \ldots, 12\) and draw once from that box? If we did indeed want a box from which we would only draw once, what would the box be? (Hint: How many possible outcomes would there be if you rolled a pair of dice. )

Sampling donkeys

In the donkeys dataset, the box has 544 tickets, with each ticket representing one of the donkeys that was measured at the mobile de-worming sites. Each ticket contains the values of 8 variables, of which we are interested in 2. We set up the box model for drawing 50 tickets at random without replacement, and we wanted to compute the average girth and the average weight.

We will use a function called slice_sample. The function slice was introduced in the notes on conditioning, and slice_sample is similar, except that it selects a random sample from the rows of a data frame. We will compute the average girth and the average weight for the donkeys in this sample.

# read in the donkey data
donkeys <- read.csv('donkeys.csv')

# take a quick look at the dataset
Rows: 544
Columns: 8
$ BCS       <dbl> 3.0, 2.5, 1.5, 3.0, 2.5, 1.5, 2.5, 2.0, 3.0, 3.0, 3.0, 3.0, …
$ Age       <chr> "<2", "<2", "<2", "<2", "<2", "<2", "<2", "<2", "<2", "<2", …
$ Sex       <chr> "stallion", "stallion", "stallion", "female", "female", "fem…
$ Length    <int> 78, 91, 74, 87, 79, 86, 83, 77, 46, 92, 86, 81, 84, 85, 87, …
$ Girth     <int> 90, 97, 93, 109, 98, 102, 106, 95, 66, 110, 105, 106, 106, 1…
$ Height    <int> 90, 94, 95, 96, 91, 98, 96, 89, 71, 99, 92, 92, 96, 96, 97, …
$ Weight    <int> 77, 100, 74, 116, 91, 105, 108, 86, 27, 141, 100, 95, 115, 1…
$ WeightAlt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# draw a sample of 50 donkeys and compute their mean girth and weight
donkeys %>%
  select(Girth, Weight) %>% #changing tickets to include variables of interest
  slice_sample(n = 50) %>% #drawing 50 times without replacement
  summarise(average_girth_cm = mean(Girth), average_weight_kg = mean(Weight))
  average_girth_cm average_weight_kg
1           115.94            151.86

Sampling penguins

Recall that in the penguins example, when the box has 344 tickets, with each ticket representing one of the observations that we have access to from the Palmer penguins data, each ticket contains the values of multiple variables. We set up the box model for sampling 50 penguins without replacement, and we wanted to compare flipper lengths.

# one sample
drop_na(penguins,sex) %>%
  select(species, sex, flipper_length_mm) %>% #changing tickets
  slice_sample(n = 50) %>% #drawing 50 times without replacement
  group_by(species, sex) %>%
  summarise(mean_flipper_length_mm = mean(flipper_length_mm))
# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    mean_flipper_length_mm
  <fct>     <fct>                   <dbl>
1 Adelie    female                   190.
2 Adelie    male                     194.
3 Chinstrap female                   194.
4 Chinstrap male                     197 
5 Gentoo    female                   211 
6 Gentoo    male                     223 

Drawing with replacement and without replacement

What’s the difference?

Consider our first box again, the one with five tickets. When we draw at random with replacement, we draw one ticket, and put it back before drawing another ticket. For the box in this section, if we draw twice with replacement, both times we will draw from the following box:

If we draw twice at random without replacement, then our second draw is from a different box, with an example illustrated below:

In the next set of notes, we are going to use R to simulate what happens when we draw twice from this box, once with replacement, and once without. We are also going to talk about the vocabulary of sampling.


  • In this lecture, we introduced the ideas of probability via the frequentist theory of probability. We used this to justify the notion of equally likely outcomes, and defined the outcome space of an experiment.

  • Then, using equally likely outcomes, we defined the probability of an event as the ratio of the number of outcomes in the event to the number of total outcomes in the outcome space.

  • After defining the probability of an event, we wrote down the mathematical rules of probability and introduced Venn diagrams.

  • The box model was introduced as a way to represent a process in which we repeat an experiment, recording the outcomes each time. Because the experiments have an element of chance (such as a coin toss), we call them chance processes. Later, we will look at statistics computed from the samples and use the box model to understand the variability of the statistics. Box models are helpful to help us understand chance processes, and to generalize from samples that have been collected using methods that use probability, to the populations that the samples were drawn from. We saw that the draws from a box give us a representative sample of the tickets in the box.

  • Finally, we simulated drawing from the box and computing summary statistics in R.

Materials from class




  1. This website was begun as poll aggregation site, by the statistician Nate Silver.↩︎

  2. The singular is die and the plural is dice. If we use the word “die” without any qualifiers, we will mean a fair, six-sided die.↩︎

  3. This is known as the frequentist theory of probability: the chance of something is the relative frequency with which it occurs in repeated trials (that are assumed to be independent).↩︎

  4. The box model was introduced by Freedman, Pisani, and Purves in their textbook Statistics↩︎

  5. We call the tickets equally likely when each ticket has the same chance of being drawn. That is, if there are \(n\) tickets in the box, each has a chance of \(1/n\) to be drawn. We also refer to this as drawing a ticket uniformly at random, because the chance of drawing the tickets are the same, or uniform.↩︎

  6. Photo via↩︎

  7. Photo via↩︎


  9. K. Milner and J.C. Rougier, 2014. How to weigh a donkey in the Kenyan countryside. Significance, 11(4), 40–43. 74, 115↩︎

  10. The random number generator in R is called a “Pseudo Random Number Generator”, because the process can be controlled by a “seed”. These are algorithmic random number generators, which means that if you provide the same seed (a starting number), R will generate the same sequence of random numbers. This makes it easier to debug your code, and reproduce your results if needed.↩︎