Code
[1] "The sum of 25 draws is 56"
Measuring the center and spread of a distribution
10/10/23
We are often interested in the average value of a random variable. We might repeat the action that generates a value of a random variable over and over again, and consider the long term average. For example, we might bet on red in roulette, and think about what our average gain would be if we play hundreds of games. Maybe we roll a die four times, record a success if we see at least one six, and repeat this process and take the average - note that we did this last week when we computed the proportion of times we rolled at least 1 6 in 4 rolls, while simulating de Méré’s paradox. The proportion is just a special case of an average, when the random variable takes only the values \(0\) or \(1\). So you can see that we can think of the average as the value we would predict for the random variable - some sort of typical or expected value.
Note that \(E(X)\) is a weighted average of the possible values taken by the random variable, where each possible value is weighted by its probability.
If \(X\) is the spots when we roll a fair six-sided die, then \(f(x) = P(x = x) = 1/6\) for \(x = 1, 2, \ldots, 6\). In this case, \(E(X) = \displaystyle \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5\).
In general, if we have a discrete uniform random variable on the values \(1, 2, 3, \ldots, n\), if we had to predict \(X\), we would predict the middle of the distribution, since all the values are equally likely: (\(f(x) =\displaystyle 1/n\) for all \(x = 1, 2, 3, \ldots, n\)). Therefore the expected value of a discrete uniform random variable is \(E(X) = \displaystyle\frac{n+1}{2},\) or just the usual average of \(1, 2, 3, \ldots, n\).
What happens if all the faces are not equally likely? Consider the following scenario: let \(X\) be the result of rolling a weird six-sided die, which has the probability of rolling a 4 or 6 is double that of rolling an odd number, and the probability of rolling a 2 is three times that of rolling an odd number. The probability mass function of \(X\) is as shown below:
\[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \]
Using the definition,
\[ \begin{aligned} E(X) &= \sum_x x \times f(x) \\ &= 1 \times 0.1 + 2 \times 0.3 + 3 \times 0.1 + 4 \times 0.2 + 5 \times 0.1 + 6 \times 0.2\\ &= 3.5 \\ \end{aligned} \] ##
Notice that the expected value was the same for both dice - the fair as well as unfair dice. This illustrates an important way in which the expected value of a random variable is just like the average of a list of numbers. It gives us some information about the distribution since we see the ``typical’’ value, but not that much information. Two random variables with very different distributions can have the same expected value.
Recall that a Bernoulli (\(p\)) random variable is the special case of a binomial random variable when the parameter \(n=1\). This random variable takes the value \(1\) with probability \(p\) and \(0\) with probability \(1-p\). Then: \[ E(X) = 1 \times p + 0 \times (1-p) = p\] The expected value of a Bernoulli(\(p\)) random variable is therefore just \(p\). In particular, if we toss a coin and define \(X\) to be the number of heads, then the expected value of \(X\) is the probability that the coin lands heads.
In all the figures above, note the red triangle marking the expected value. If you imagine the probability histogram to be a thin lamina - like a thin sheet of metal cut in the shape of the probability histogram, you can imagine the expected value as a “balancing point” - the point where the lamina would balance. It is the center of mass for the probability distribution. The expected value for a random variable is analogous to the average for sample data. Other terms that we use for the expected value of a random variable are expectation and mean. These can be used interchangeably.
Let \(X\) be a random variable such that
\[ X = \begin{cases} 1\: \text{with prob}\, 4/15\\ 2 \: \text{with prob}\, 7/30 \\ 0 \: \text{with prob}\, 1/3 \\ -1 \: \text{with prob} \, 1/6 \end{cases} \]
\[ \begin{aligned} E(X) &= 1 \times \frac{4}{15} + 2 \times \frac{7}{30} + 0 \times \frac{1}{3} + (-1) \times \frac{1}{6} \\ &= \frac{1 \times 8 + 2 \times 7 + 0 \times 10 + (-1) \times 5}{30} \\ &= \frac{8 + 14 + 0 - 5}{30} \\ &= \frac{17}{30} = 0.5666667 \end{aligned} \]
Let \(X\) be a random variable with the following probability distribution:
\(x\) | \(P(X = x)\) |
---|---|
\(-2\) | \(0.2\) |
\(-1\) | \(0.1\) |
\(0\) | \(0.2\) |
\(1\) | \(0.3\) |
\(3\) | \(0.2\) |
Let’s first add a column with the product \(x\times P(X=x)\):
\(x\) | \(P(X = x)\) | \(x\times P(X=x)\) |
---|---|---|
\(-2\) | \(0.2\) | \(-0.4\) |
\(-1\) | \(0.1\) | \(-0.1\) |
\(0\) | \(0.2\) | \(0\) |
\(1\) | \(0.3\) | \(0.3\) |
\(3\) | \(0.2\) | \(0.6\) |
Then we sum the third column to get \(E(X) = -0.4 -0.1 + 0 + 0.3 + 0.6 =\) 0.4.
Let’s do the same for the random variable \(Y = g(X) = X^2\). Add two columns to the original table, one with the values of \(y = g(x)\), and one with \(g(x)f(x) = g(x)P(X=x)\):
\(x\) | \(P(X = x)\) | \(y = x^2\) | \(g(x)\times P(X=x)\) |
---|---|---|---|
\(-2\) | \(0.2\) | \(4\) | \(0.8\) |
\(-1\) | \(0.1\) | \(1\) | \(0.1\) |
\(0\) | \(0.2\) | \(0\) | \(0\) |
\(1\) | \(0.3\) | \(1\) | \(0.3\) |
\(3\) | \(0.2\) | \(9\) | \(1.8\) |
Summing the last column we get \(E(Y) = 0.8 + 0.1 + 0 + 0.3 + 1.8 =\) 3.
Do not apply the function to \(f(x)\)! The probability distribution remains the same, only the variable values change - instead of using \(x\), we use \(g(x)\).
Let \(X\sim Bin(n, p)\). Recall that \(X\) counts the number of successes in \(n\) trials, where the probability of success on each trial is \(p\). We can define \(n\) Bernoulli (\(p\)) random variables \(X_1, X_2, \ldots, X_n\) where \(X_k = 1\) with probability \(p\), that is, \(X_k\) is 1 if the \(k\)th trial is a success. We see that the binomial random variable \(X\) can be written as a sum of these \(n\) Bernoulli random variables:
\[ X = X_1 + X_2 + \ldots + X_n \]
The expected value of \(X_k\) is \(E(X_k) = p\) for each \(k\), so using the additivity of expectation, we get
\[ \begin{aligned} E(X) &= E( X_1 + X_2 + \ldots + X_n) \\ &= E(X_1) + E(X_2) + \ldots + E(X_n) \\ &= p + p + \ldots + p\\ &= np \end{aligned} \]
Therefore, if \(X\sim Bin(n, p)\), then \(E(X) = np\). This intuitively makes sense: if I toss a fair coin 100 times, I expect to see about \(50\) heads. This is a very neat trick to compute the expected value of a binomial random variable because you can imagine that computing the expected value using the formula \(\displaystyle \sum_x x \cdot f(x)\) would be very messy and difficult. Using Bernoulli random variables allowed us to easily calculate the expected value of a binomial random variable.
When we looked at averages of data, we realized that computing measures of center was insufficient to give us a picture of the distribution. We needed to know how the data distribution spread out about its center, and this idea holds true for probability distributions as well. We want a number that describes how far from \(E(X)\) the values of \(X\) typically fall, similar to the standard deviation for a list of numbers.
The problem with \(Var(X)\) is that the units are squared, so just as we did for the sample variance, we will take the square root of the variance.
\[ SD(X) = \sqrt{Var(X)} \]
\(SD(X)\) is a “give or take” number attached to the mean \(E(X)\), so we can say that a ``typical’’ value of \(X\) is about \(\mu\), give or take one standard deviation (the value of \(SD(X)\)). Note that \(SD(X)\) is a non-negative number.
Recall the example in which we roll an unfair die, and the probability mass function \(f\) was given by: \[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \]
We had computed \(E(X)=\mu = 3.5\). What about \(Var(X)\)? Let’s write out the table, and add a column for \(g(x) = (x-3.5)^2\).
\(x\) | \(P(X = x)\) | \(g(x) = (x - 3.5)^2\) | \(g(x)\times P(X=x)\) |
---|---|---|---|
\(1\) | \(0.1\) | \(6.25\) | \(0.625\) |
\(2\) | \(0.3\) | \(2.25\) | \(0.675\) |
\(3\) | \(0.1\) | \(0.25\) | \(0.025\) |
\(4\) | \(0.2\) | \(0.25\) | \(0.05\) |
\(5\) | \(0.1\) | \(2.25\) | \(0.225\) |
\(6\) | \(0.2\) | \(6.25\) | \(1.250\) |
\(E(g(X)) = \sum_x g(x)\cdot P(X=x) =\) 2.85
Therefore the standard deviation of \(X = SD(X)\) is the square root of the variance, so about 1.688.
We have already computed that the expected value of the random variable \(X = 3.5\) where \(X\) is the result of rolling a fair die. What are \(Var(X)\) and \(SD(X)\)?
\(Var(X) = \sum_x (x-3.5)^2 \cdot P(X=x) =\) 2.917. \(SD(X) =\) 1.708.
Why do you think that \(Var(X)\) and \(SD(X)\) are greater when \(X\) is the result of rolling the fair die than when \(X\) is the result of rolling the unfair die? (Hint: think about the probability distributions.)
To compute variance, we usually use a nice shortcut formula: \[Var(X) = E\left(X^2\right) - \mu^2\]
Let’s confirm this. Let \(X\) be a random variable such that \(X = c\) with \(f(c) = 1\) for some real number \(c\). (This means that \(X\) takes the value \(c\) with probability 1, that is, with certainty.)
\[ Var(X) = E(X^2) - \mu^2 = E(c^2) - c^2 = c^2 - c^2 = 0\]. Note that \(E(c^2) = c^2\) as \(c^2\) is a constant. Thus we have that \(SD(c) = 0\).
\[ \begin{aligned} Var(cX) &= E((cX)^2) - (c\mu)^2 \\ &= E(c^2 X^2) - c^2 \mu^2 \\ &= c^2E(X^2) - c^2\mu^2\\ &= c^2\left(E(X^2) - \mu^2 \right) \\ &= c^2 Var(X) \end{aligned} \]
Thus, \(SD(cX) = \sqrt{Var(cX)} = \sqrt{c^2 Var(X)} = \lvert c \rvert SD(X).\) (\(SD(X) \ge 0\))
\[ Var(X+c) = Var(X) \]
\[ Var(X + Y) = Var(X) + Var(Y) \text{ and } Var(X - Y) = Var(X) + Var(Y) \]
Note that in this case, \(Var(X + Y) = Var(X) + Var(Y)\) implies that \(SD(X +Y) = \sqrt{Var(X) + Var(Y)}\) - square roots and therefore SD’s are not additive.
Consider the box with the following \(30\) tickets:
\(8\) tickets marked \(\fbox{1}\), \(7\) tickets marked \(\fbox{2}\), \(10\) tickets marked \(\fbox{0}\), and \(5\) tickets marked \(\fbox{-1}\).
Let \(X\) be the value of a single draw from this box, if we shuffle the tickets and draw one ticket at random. What is the probability distribution of \(X\)? What is \(E(X)\), rounded to 3 decimal places?
Notice that the average of the tickets in the box is 0.567 which is the same as \(E(X)\)!
What about the variance of \(X\)?
\[ \begin{aligned} Var(x) &= E\left[ ( X - \mu)^2 \right] \\ &= \left( \frac{8}{30}\times (1-0.567)^2 + \frac{7}{30} \times (2-0.567)^2 + \frac{10}{30} \times (0-0.567)^2 + \frac{5}{30} \times (-1 - 0.567)^2 \right) \\ &\approx 1.045\\ \end{aligned} \] The sample variance of the tickets in the box is a bit more than 1.045. This is because we use \(n-1\) in the denominator of sample variance and sample sd, rather than \(n\).
The standard deviation \(SD(X) = \sqrt{Var(X)} =\) 1.023.
Let \(X\) be a Bernoulli (\(p\)) random variable. We know that \(E(X) = \mu = p\). If we compute \(E(X^2)\), we get that \(E(X^2) = p\). (Make sure you know how to compute \(E(X^2)\).) Therefore we have that: \[ Var(X) = E(X^2) - \mu^2 = p - p^2 = p(1-p). \]
We use the same method as we did to compute the expectation of \(X\sim Bin(n,p)\). We will write \(X\) as a sum of independent Bernoulli random variables: \[ X = X_1 + X_2 + \ldots + X_n\] where each \(X_k \sim\) Bernoulli(\(p\)). Since the \(X_k\) are results of independent trials (by the definition of the binomial distribution), we have: \[Var(X) = Var(X_1) + Var(X_2) + \ldots + Var(X_n) = np(1-p).\] Therefore, \(SD(X) = \sqrt{np(1-p)}\)
This is a very important concept that we have already used to compute the expected value and variance of a binomial random variable by writing it as a sum of iid Bernoulli random variables.
A common example is when we toss a coin \(n\) times and count the number of heads - each coin toss can be considered a Bernoulli random variable, and the total number of heads is a sum of \(n\) iid Bernoulli random variables.
Consider the box shown below:
Say I draw \(25\) tickets with replacement from this box, and let \(X_k\) be the value of the \(k\)th ticket. Then each of the \(X_k\) has the same distribution, and they are independent since we draw the tickets with replacement. Therefore \(X_1, X_2, \ldots, X_{25}\) are iid random variables, and they each have a distribution defined by the following pmf: \[ f(x) = \begin{cases} 0.2, \; x = 0, 3, 4 \\ 0.1, \; x = 2 \\ 0.3, \; x = 1 \end{cases} \]
Suppose we make \(n\) draws at random with replacement from the box above, and sum the drawn tickets. The sum \(S_n = X_1, X_2, \ldots, X_{n}\) is also a random variable. Let’s simulate this by letting \(n=25\) and we will sample \(25\) tickets with replacement, sum them, and then repeat this process. Note that the smallest sum we can get is \(S_n = 0\) and the largest is \(100\). (Why?)
[1] "The sum of 25 draws is 56"
Now we will repeat this process 10 times:
[1] 56 50 64 51 49 41 64 44 45 51
It is clear that the sum \(S_n\) is random (because the \(X_k\) are random), and we can see that the sum of the draws changes with each iteration of the process.
Since we know the distribution of the \(X_k\), we can compute \(E(X_k)\) and \(Var(X_k)\). Note that since the \(X_1, X_2, \ldots, X_{n}\) are iid, all the \(X_k\) have the same mean and variance. What about their sum \(S_n\)? What are \(E(S_n)\) and \(Var(S_n)\), when \(n = 25\)?
\(E(X_k) = 0.2\times 0 + 0.3 \times 1 + 0.1 \times 2 + 0.2 \times 3 + 0.2 \times 4 = 1.9\) (Note that you could also have just computed the average of the tickets in the box.)
\(Var(X_k) = \sum_x (x-1.9)^2 \times P(X=x) = 2.09\)
\(E(S_{25}) = E(X_1 + X_2 + \ldots +X_{25}) = 25 \times E(X_1) = 25 \times 1.9\).
(We just use \(X_1\) since all the \(X_k\) have the same distribution.)
Since the \(X_k\) are independent, we can write that
\[ \begin{aligned} Var(S_{25}) &= Var(X_1 + X_2 + \ldots +X_{25})\\ &= Var(X_1) + Var(X_2) + \ldots +Var(X_{25})\\ &= 25 \times 2.09 \end{aligned} \]
We can see that the expectation and variance of the sum scale with \(n\), so that if \(S_n\) is the sum of \(n\) iid random variables \(X_1, X_2, \ldots, X_n\), then:
\[ \begin{aligned} E(S_n) &= n \times E(X_1) \\ Var(S_n) &= n \times Var(X_1)\\ \end{aligned} \] This does not hold for \(SD(S_n)\), though. For the SD, we have the following ``law’’ for the standard deviation of the sum.
Since all the \(X_k\) have the same distribution, we can use \(X_1\) to compute the mean and SD of the sum. This law says that if the sample size increases as \(n\), the expected value scales as the number of random variables, but the standard deviation of the sum increases more slowly, scaling as \(\sqrt{n}\). In other words, if you increase the number of random variables you are summing, the spread of your sum about its expected value increases, but not as fast as the expectation of the sum.
We denote the average of the random variables \(X_1, X_2, \ldots, X_n\) by \(\displaystyle \bar{X} =\frac{S_n}{n}\).
\(\displaystyle \bar{X}\) is called the sample mean (where the ``sample’’ consists of \(X_1, X_2, \ldots, X_n\)).
\[ E(\bar{X}) = E(\frac{S_n}{n}) = \frac{1}{n} E(S_n) = E(X_1) \]
This means that the expected value of an average does not scale as \(n\), but \(E(\bar{X})\) is the same as the expected value of a single random variable. Let’s check the variance now:
\[ Var(\bar{X}) = Var(\frac{S_n}{n}) = \frac{1}{n^2} Var(S_n) = \frac{n}{n^2} Var(X_1) \]
Therefore \(Var(\bar{X}) = \displaystyle \frac{1}{n} Var(X_1)\)
Note that, just like the sample sum \(S_n\), the sample mean \(\displaystyle \bar{X}\) is a random variable, and its variance scales as \(\displaystyle \frac{1}{n}\), which implies that \(SD(\bar{X})\) will scale as \(\displaystyle \frac{1}{\sqrt{n}}\).
Let’s go back to the box of colored tickets, draw from this box \(n\) times, and then compute the sum and average of the draws. We will simulate the distribution of the sum and the average of 25 draws to see what the distribution of the statistics looks like. Note that when \(n=25\), \(E(S_n) = 25\times 1.9 = 47.5\) and \(SE(S_n) = \sqrt{n} \times SD(X_1) = 5 \times 1.45 = 7.25\)
set.seed(12345)
box = c(0,0,1,1,1,2,3,3,4,4)
s1 <- sum(sample(box, size = 25, replace = TRUE))
sum_draws_25 = replicate(1000, sum(sample(box, size = 25, replace = TRUE)))
p1 <- data.frame(sum_draws_25) %>%
ggplot(aes(x = sum_draws_25, y=..density..)) +
geom_histogram(fill = "darkolivegreen2", color = "white") +
xlab("sample sum") +
ylab("density") +
ggtitle("Empirical distribution of the sample sum, n = 25") +
geom_vline(xintercept = 47.5, color = "black", lwd = 1.1)
sum_draws_100 = replicate(1000, sum(sample(box, size = 100, replace = TRUE)))
p2 <- data.frame(sum_draws_100) %>%
ggplot(aes(x = sum_draws_100, y=..density..)) +
geom_histogram(fill = "darkolivegreen3", color = "white") +
xlab("sample sum") +
ylab("density") +
ggtitle("Empirical distribution of the sample sum, n = 100") +
geom_vline(xintercept = 190, color = "black", lwd = 1.1)
mean_draws_25 = replicate(1000, mean(sample(box, size = 25, replace = TRUE)))
p3 <- data.frame(mean_draws_25) %>%
ggplot(aes(x = mean_draws_25, y=..density..)) +
geom_histogram(fill = "cadetblue2", color = "white") +
xlab("sample mean") +
ylab("density") +
ggtitle("Empirical distribution of the sample mean, n = 25") +
geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)
mean_draws_100 = replicate(1000, mean(sample(box, size = 100, replace = TRUE)))
p4 <- data.frame(mean_draws_100) %>%
ggplot(aes(x = mean_draws_100, y=..density..)) +
geom_histogram(fill = "deepskyblue", color = "white") +
xlab("sample mean") +
ylab("density") +
ggtitle("Empirical distribution of the sample mean, n = 100") +
geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)
(p1+p2)/(p3+p4)
What do we notice in these figures? The black line is the expected value. We see that the center of the distribution for the sample sum grows as the sample size increases (look at the x-axis), but this does not happen for the distribution of the sample mean. You can also see that the spread of the data for the sample sum is much greater when n = 100, but this does not happen for the distribution of the sample mean. We will explore the sample sum and sample mean next week. Now, the \(y\) axis has neither counts nor proportion, but it has “density”. This makes the histogram have a total area of one, similar to a probability histogram. Now we can think of this density histogram as an approximation of the probability histogram.
So far, we have talked about discrete distributions, and the probability mass functions for such distributions. Consider a random variable that takes any value in a given interval. Recall that we call such random variables continuous. In this situation, we cannot think about discrete bits of probability mass which are non-zero for certain numbers, but rather we imagine that our total probability mass of \(1\) is smeared over the interval, giving us a smooth density curve, rather than a histogram. To define the probabilities associated with a continuous random variable, we define a probability density function (pdf) rather than a probability mass function.
If \(X\) is a continuous random variable, we don’t talk about \(P(X = x\)), that is, the probability that \(X\) takes a particular value. Rather, we ask what is the probability that \(X\) lies in an interval around \(x\). Since there are infinitely many outcomes in any interval on the real line, no single outcome can have positive probability, so \(P(X=x) =0\) for any particular \(x\) in the interval where \(X\) is defined. To find the probability that \(X\) lies in an interval \((a,b)\), we integrate \(f(x)\) over the interval \((a,b)\). That is, we find the area under the curve \(f(x)\) over the interval \((a, b)\).
Let \(X\) be a random variable that takes values in the interval \((0,1)\) with probability density function \(f(x) = 1\) for \(x\) in \((0,1)\) and \(f(x) = 0\) outside of this interval.
Because \(f(x)\) is flat, all intervals of the same length will have the same area, so the distribution defined by \(f\) is called the Uniform\((0,1)\) distribution. If a random variable \(X\) has this distribution, we denote this by \(X \sim U(0,1)\). The probability that \(X\) is in any interval \((a, b)\) which is a subinterval of \((0,1)\) is given by the area of the rectangle formed by the interval and \(y=f(x)\), and so is just the width of the interval.
The difference is in how we compute \(P(X \le x)\). There are no discrete bits of probability mass for \(F(x)\) to collect. Instead we have that \(F(x)\) is the area under the curve \(y = f(x)\) all the way up to the point \(x\).
Let \(X \sim U(0,1)\). What is \(F(0.3)\)?
\[ F(0.3) = P(X \le 0.3) = \int_{-\infty}^0.3 f(t) dt = \int_0^0.3 1 dt = 0.3 \] In general, for the \(U(0,1)\) distribution, \(F(x) = x\).
We defined the expected value or the mean of a discrete random variable and listed the properties of expectation including linearity and additivity.
We defined the variance and standard deviation of a random variable. Both expectation and variance (and therefore standard deviation) are constants associated to the distribution of the random variable. The variance is more convenient than the sd for computation because it doesn’t have square roots. However, the units are squared, so you have to be careful while interpreting the variance. We discussed the properties of variance and standard deviation.
We defined the expectation and variance of sums and averages of an iid sample of random variables, and introduced the term standard error. We recognized how the mean and variance scale with \(n\) and defined the square root law for the standard error of the sum or mean of an iid sample.
We considered the probability distributions of sums and averages.
Finally, we introduced continuous distributions. In subsequent notes, we will introduce the most celebrated continuous distribution, the normal distribution.