Expected value and variance of a random variable
Measuring the center and spread of a distribution
We are often interested in the average value of a random variable. We might repeat the action that generates a value of a random variable over and over again, and consider the long term average. For example, we might bet on red in roulette, and think about what our average gain would be if we play hundreds of games. Maybe we roll a die four times, record a success if we see at least one six, and repeat this process and take the average - note that we did this last week when we computed the proportion of times we had a success in this game (while investigating de Méré’s paradox). The proportion is just a special case of an average, when a random variable takes the values \(0\) or \(1\). The average can be thought of the value we would predict for the random variable - some sort of typical or expected value.
We have seen the correspondence between discrete random variables and boxes of tickets, where the tickets in the box describing the random variable \(X\) are in the proportions specified by the probability distribution of \(X\). We will see that we can use the associated box to compute the average value of a random variable.
Example: Rolls of a weighted die
Let \(X\) be the face observed when we roll a six-sided die, where the probabilities are as follows: \[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \]
The box for this random variable (when \(X\) is the value of a ticket drawn) has the following tickets: \(\fbox{1},\, \fbox{2}, \, \fbox{2},\, \fbox{2},\, \fbox{3}, \, \fbox{4},\, \fbox{4}, \, \fbox{5}, \, \fbox{6},\, \fbox{6}\)
When we make a box, the tickets reflect the pmf of the random variable. Now we can compute the average of the tickets in the box: \[ \frac{1}{10}(1 + 2 + 2 + 2 + 3 + 4 + 4 + 5 + 6 + 6) = 3.5 \] Rewriting the expression above, we get: \[ 1 \times \frac{1}{10} + 2 \times \frac{3}{10} + 3 \times \frac{1}{10} + 4 \times \frac{2}{10} + 5 \times \frac{1}{10} + 6 \times \frac{2}{10} = 3.5 \]
As you can see, the box average is nothing but a weighted average of the possible values of \(X\), weighted by their probability.
Expected value of a random variable
The expected value or expectation of a random variable \(X\) with pmf \(f(x)\) is denoted by \(E(X)\). It is a constant associated with the distribution, and is defined by \[ E(X) = \sum_x x \cdot P(X = x) = \sum_x x \cdot f(x) \] \(E(X)\) is a weighted average of the values taken by the random variable.
Example: Rolls of a fair die
If \(X\) is the spots when we roll a fair six-sided die, then \(f(x) = P(x = x) = 1/6\) for \(x = 1, 2, \ldots, 6\). In this case, \(E(X) = \displaystyle \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5\).
In general, if we have a discrete uniform random variable on the values \(1, 2, 3, \ldots, n\), we expect to be in the middle, since all the values are equally likely (\(f(x) = 1/n\) for all \(x\)). Therefore the expected value of a discrete uniform random variable is \(E(X) = \frac{n+1}{2}\).
Example: Bernoulli random variables
Consider a Bernoulli(\(p\)) random variable which takes the value \(1\) with probability \(p\) and \(0\) with probability \(1-p\). Then: \[ E(X) = 1 \times p + 0 \times (1-p) = p\] The expected value of a Bernoulli(\(p\)) random variable is therefore just \(p\).
In all the figures above, note the red triangle marking the expected value. If you imagine the probability histogram to be a thin lamina - like a thin sheet of cardboard, you can imagine the expected value as a “balancing point”, or the point where the lamina would balance. It is the “center of mass” for the probability distribution. The expected value for a random variable is analogous to the average for data, and if you look at equally likely values, it is the same. Other terms that we use for the expected value of a random variable are expectation and mean. These can be used interchangeably.
Example: Computing expectation in two ways
Let \(X\) be a random variable such that \[X = \begin{cases} 1\: \text{with prob}\, 4/15\\ 2 \: \text{with prob}\, 7/30 \\ 0 \: \text{with prob}\, 1/3 \\ -1 \: \text{with prob} \, 1/6 \end{cases} \] Find the expected value of \(X\) in two ways:
- use the formula for expected value of \(X\)
Check your answer
\[ \begin{align} E(X) &= 1 \times \frac{4}{15} + 2 \times \frac{7}{30} + 0 \times \frac{1}{3} + (-1) \times \frac{1}{6} \\ &= \frac{1 \times 8 + 2 \times 7 + 0 \times 10 + (-1) \times 5}{30} \\ &= \frac{8 + 14 + 0 - 5}{30} \\ &= \frac{17}{30} = 0.5666667 \end{align} \]- make the box of equally likely tickets that corresponds to \(X\) and find the average of the tickets in the box.
Check your answer
The box will have \(30\) tickets: \(8\) \(\fbox{1}\), \(7\) \(\fbox{2}\), \(10\) \(\fbox{0}\), and \(5\) \(\fbox{-1}\). The average of these tickets is 0.5666667.
Properties of expectation
- Expected value of a constant
- The expected value of a constant is just the constant. Let \(c\) be any constant. The probability mass function can be considered to be \(f(c) = 1\). Then \[E(c) = c \cdot f(c) = c \]
- Constant multiple of a random variable
- If \(c\) is a constant, and \(X\) is a random variable with pmf \(f(x)\), then we can consider \(cX\) to be the random variable that takes the value \(cx\) when \(X\) takes the value \(x\), so \(f(cx) = f(x)\). Therefore, \[ E(cX) = \sum_x cx \cdot f(x) = c \sum_x x \cdot f(x) = c\cdot E(X) \]
- Additivity of expectation
- Let \(X\) and \(Y\) be two random variables, and consider their sum \(X + Y\). Then, \[ E(X + Y) = E(X) + E(Y)\] This is true regardless of whether \(X\) and \(Y\) are independent or not.
- Linearity of expectation
- Putting the previous two properties together, we get \[ E(aX + bY) = E(aX) + E(bY) = a\cdot E(X) +b \cdot E(Y)\]
- Expectation of a function of a random variable
- Suppose \(Y = g(X)\) is a function of the random variable \(X\). Then \(Y\) is also a random variable taking values \(y = g(x)\), and the probabilities are just the probabilities associated with \(x\). Then the expected value of \(Y\) is given by: \[ E(Y) = E(g(X)) = \sum_x g(x)f(x) = \sum_x g(x)P(X=x) \]
Example: Expectation of the square of a random variable
Let \(X\) be a random variable with the following probability distribution:
\(x\) | \(P(X = x)\) |
---|---|
\(-2\) | \(0.2\) |
\(-1\) | \(0.1\) |
\(0\) | \(0.2\) |
\(1\) | \(0.3\) |
\(3\) | \(0.2\) |
Let’s first compute \(E(X)\) by adding a column with the product \(x\cdot P(X=x)\):
\(x\) | \(P(X = x)\) | \(x\cdot P(X=x)\) |
---|---|---|
\(-2\) | \(0.2\) | \(-0.4\) |
\(-1\) | \(0.1\) | \(-0.1\) |
\(0\) | \(0.2\) | \(0\) |
\(1\) | \(0.3\) | \(0.3\) |
\(3\) | \(0.2\) | \(0.6\) |
Then we sum the third column to get \(E(X) = -0.4 -0.1 + 0 + 0.3 + 0.6 =\) 0.4.
Let’s do the same for the random variable \(Y = g(X) = X^2\). Add two columns to the original table, one with the values of \(y = g(x)\), and one with \(g(x)f(x) = g(x)P(X=x)\):
\(x\) | \(P(X = x)\) | \(y = x^2\) | \(g(x)\cdot P(X=x)\) |
---|---|---|---|
\(-2\) | \(0.2\) | \(4\) | \(0.8\) |
\(-1\) | \(0.1\) | \(1\) | \(0.1\) |
\(0\) | \(0.2\) | \(0\) | \(0\) |
\(1\) | \(0.3\) | \(1\) | \(0.3\) |
\(3\) | \(0.2\) | \(9\) | \(1.8\) |
Summing the last column we get \(E(Y) = 0.8 + 0.1 + 0 + 0.3 + 1.8 =\) 3.
Example: Binomial random variables
Let \(X\sim Bin(n, p)\). Recall that \(X\) counts the number of successes in \(n\) trials, where the probability of success on each trial is \(p\). We can define \(n\) Bernoulli(\(p\)) random variables \(X_1, X_2, \ldots, X_n\) where \(X_k = 1\) with probability \(p\), that is, \(X_k\) is 1 if the \(k\)th trial is a success. We see that our binomial random variable \(X\) can be written as a sum of these random variables: \[ X = X_1 + X_2 + \ldots + X_n \] The expected value of \(X_k\) is \(E(X_k) = p\) for each \(k\), so using the additivity of expectation, we get \[ \begin{align} E(X) &= E( X_1 + X_2 + \ldots + X_n) \\ &= E(X_1) + E(X_2) + \ldots + E(X_n) \\ &= p + p + \ldots + p\\ &= np \end{align} \] Therefore, if \(X\sim Bin(n, p)\), then \(E(X) = np\). This intuitively makes sense: if I toss a fair coin 100 times, I expect to see about \(50\) heads. This is a very neat trick to compute the expected value of a binomial random variable because you can imagine that computing the expected value using the formula \(\displaystyle \sum_x x \cdot f(x)\) would be very messy and difficult. Using Bernoulli random variables allowed us to easily calculate the expected value of a binomial random variable.
Variance and standard deviation of a random variable
When we looked at averages of data, we realized that computing measures of center was insufficient to give us a picture of the distribution. We needed to know how the data distribution spread out about its center, and this idea holds true for probability distributions as well.We want a number that describes how far from \(E(X)\) the values of \(X\) typically fall. We will look at the variance of the random variable \(X\), which is its expected squared deviation from the mean, and the square root of variance of \(X\), which is called the standard deviation of \(X\).
Variance and standard deviation of \(X\)
Let’s give the expected value of \(X\) a name - we will call it \(\mu\). From now on, if we say \(\mu\), we mean \(E(X)\). Note that \(\mu\) is a constant related to the probability distribution of \(X\). Consider the function \(g(X) = (X-\mu)^2\), which describes the square of the distance of each value of \(X\) from its mean. The expected value of \(g(X)\), that is, \(E(g(X)) = E\left((X-\mu)^2\right)\) is called the variance of \(X\) and denoted by \(Var(X)\). The problem with using \(Var(X)\) is that the units are squared, so as we did for data, we will take the square root of the variance: \(\sqrt{Var(X)}\). This is called the standard deviation of \(X\), and is a “give or take” number attached to the mean, so we can say that a “typical” value of \(X\) is about \(\mu\), give or take one standard deviation.
Example: Rolls of a weighted die
Recall our first example, where the probability mass function \(f\) is given by: \[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \] We had computed \(E(X)\) to be 3.5. What about \(Var(X)\)? Let’s write out the table, and add a column for \(g(x) = (x-3.5)^2\).
\(x\) | \(P(X = x)\) | \(g(x) = (x - 3.5)^2\) | \(g(x)\cdot P(X=x)\) |
---|---|---|---|
\(1\) | \(0.1\) | \(6.25\) | \(0.625\) |
\(2\) | \(0.3\) | \(2.25\) | \(0.675\) |
\(3\) | \(0.1\) | \(0.25\) | \(0.025\) |
\(4\) | \(0.2\) | \(0.25\) | \(0.05\) |
\(5\) | \(0.1\) | \(2.25\) | \(0.225\) |
\(6\) | \(0.2\) | \(6.25\) | \(1.250\) |
\(E(g(X)) = \sum_x g(x)\cdot P(X=x) =\) 2.85
Therefore the standard deviation of \(X = SD(X)\) is the square root of the variance, so about 1.688.
Example: Rolls of a fair die
We have already computed that the expected value of the random variable \(X = 3.5\) where \(X\) is the result of rolling a fair die. What are \(Var(X)\) and \(SD(X)\)?
Check your answer
\(Var(X) = \sum_x (x-3.5)^2 \cdot P(X=x) =\) 2.917. \(SD(X) =\) 1.708.
Why do you think that the variance and sd are greater for the fair die than the weighted die? (Hint: think about the probability distributions.)
Computational formula
When we want to compute variance, there is a nice shortcut formula that you can use: \[Var(X) = E(X^2) - \mu^2\]
Properties of variance and standard deviation
- The variance of a constant is 0
- Remember that the variance measures how spread out the values of \(X\) are about the mean. A constant doesn’t spread out at all, so it makes sense that the variance is \(0\). Let’s confirm this. Let \(X\) be a random variable such that \(X = c\) with \(f(c) = 1\) for some real number \(c\).
\[ Var(X) = E(X^2) - \mu^2 = E(c^2) - c^2 = c^2 - c^2 = 0\]. Note that \(E(c^2) = c^2\) as \(c^2\) is a constant. Thus we have that \(SD(c) = 0\).
- The variance of a constant multiple of a random variable \(Var(cX)\)
- Note that \(E(cX) = c\mu\) by the properties of expectation. Thus we have (using the short cut formula):
\[ \begin{align} Var(cX) &= E((cX)^2) - (c\mu)^2 \\ &= E(c^2 X^2) - c^2 \mu^2 \\ &= c^2E(X^2) - c^2\mu^2\\ &= c^2\left(E(X^2) - \mu^2 \right) \\ &= c^2 Var(X) \end{align}\]
Thus, \(SD(cX) = \sqrt{Var(cX)} = \sqrt{c^2 Var(X)} = \lvert c \rvert SD(X).\) (\(SD(X) \ge 0\))
- The variance of \(X\) is unchanged by adding a constant
- If we add a constant to \(X\), the spread of the distribution does not change, only its center changes. Let \(c\) be a real number. Then we have:
\[Var(X+c) = Var(X)\]
- Additivity of variance
- In certain situations (when we know that \(X\) and \(Y\) are independent random variables - the events \(X = x\) and \(Y = y\) are independent for all values of \(x\) and \(y\)), we have that:
\[ Var(X + Y) = Var(X) + Var(Y) \text{ and } Var(X - Y) = Var(X) + Var(Y)\] If we have that \(Var(X + Y) = Var(X) + Var(Y)\), then \(SD(X +Y) = \sqrt{Var(X) + Var(Y)}\).
Relationship with the box model
If a random variable \(X\) is represented as the value of a ticket drawn at random from a box, we know that \(E(X)\) is the average of the ticket values in the box. It turns out that \(Var(X)\) is the variance of the numbers in the box: Let \(x_1, x_2, \ldots, x_n\) be the numbers on the tickets in the box, then the average of these is \(E(X)\), and \(Var(X)\) is given by (\(\bar{x}\) is the box average): \[ Var(X) = \frac{1}{n} \left((x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \ldots + (x_n - \bar{x})^2\right) \]
Example: Box model
Consider the box with the following \(30\) tickets: \(8\) \(\fbox{1}\), \(7\) \(\fbox{2}\), \(10\) \(\fbox{0}\), and \(5\) \(\fbox{-1}\). The average of these tickets, \(\bar{x} =\) 0.567. Let’s compute the variance of the random variable \(X\) using the formula above:
\[Var(x) = \frac{1}{30}\left(8 \times (1-0.567)^2 + 7 \times (2-0.567)^2 + 10 \times (0-0.567)^2 + 5 \times (-1 - 0.567)^2 \right) = 1.045.\] The standard deviation is 1.023.
Example: Bernoulli random variables
Let \(X\) be a Bernoulli(\(p\)) random variable. We know that \(E(X) = \mu = p\). If we compute \(E(X^2)\), we get that \(E(X^2) = p\). (Make sure you can compute \(E(X^2)\).) Therefore we have that: \[ Var(X) = E(X^2) - \mu^2 = p - p^2 = p(1-p). \]
Example: Binomial random variables
We use the same method as we did to compute the expectation of \(X\sim Bin(n,p)\). We will write \(X\) as a sum of independent Bernoulli random variables: \[ X = X_1 + X_2 + \ldots + X_n\] where each \(X_k \sim\) Bernoulli(\(p\)). Since the \(X_k\) are results of independent trials (by the definition of the binomial distribution), we have: \[Var(X) = Var(X_1) + Var(X_2) + \ldots + Var(X_n) = np(1-p).\] Therefore, \(SD(X) = \sqrt{np(1-p)}\)
Independent and identically distributed random variables
Suppose we have \(n\) independent random variables \(X_1, X_2, \ldots, X_n\) such that they all have the same pmf \(f(x)\) and the same cdf \(F(x)\), we call the random variables \(X_1, X_2, \ldots, X_n\) independent and identically distributed, usually abbreviated i.i.d or iid. This is a very important concept that we have already used when we computed the expected value and variance of a binomial random variable by writing it as a sum of iid Bernoulli random variables. A common example is when we toss a coin \(n\) times and count the number of heads - each coin toss can be considered a Bernoulli random variable, and the total number of heads is a sum of \(n\) iid Bernoulli random variables.
Example: iid random variables
Consider the box shown below:
Say I draw \(25\) tickets with replacement from this box, and let \(X_k\) be the value of the \(k\)th ticket. Then each of the \(X_k\) has the same distribution, and they are independent since we draw the tickets with replacement. Therefore \(X_1, X_2, \ldots, X_{25}\) are iid random variables with pmf given by: \[ f(x) = \begin{cases} 0.2, \; x = 0, 3, 4 \\ 0.1, \; x = 2 \\ 0.3, \; x = 1 \end{cases} \]
Sums and averages of random variables
Sums
Suppose we make \(n\) draws at random (each ticket is equally likely) from the box above, with replacement, and sum the drawn tickets. The sum \(S_n = X_1, X_2, \ldots, X_{n}\) is also a random variable. Let’s check this by letting \(n=25\) and we will draw \(25\) tickets with replacement and sum them, and then repeat that a few times. Note that the smallest sum we can get is \(S_n = 0\) and the largest is \(100\). (How?)
[1] 56
Now we will repeat this process a few times:
It is clear that the sum is random (because the \(X_k\) are random), and it keeps changing. Since we know the distribution of the \(X_k\), we can compute \(E(X_k)\) and \(Var(X_k)\). Note that since the \(X_1, X_2, \ldots, X_{n}\) are iid, all the \(X_k\) have the same mean and variance. What about \(S\)? What are \(E(S_n)\) and \(Var(S_n)\), when \(n = 25\)?
\(E(X_k) = 0.2\times 0 + 0.3 \times 1 + 0.1 \times 2 + 0.2 \times 3 + 0.2 \times 4 = 1.9\) (Note that you could also have just computed the average of the tickets in the box.)
\(Var(X_k) = \sum_x (x-1.9)^2 P(X=x) = 2.09\)
\(E(S_{25}) = E(X_1 + X_2 + \ldots +X_{25}) = 25 \times E(X_1) = 25 \times 1.9\) We can use \(X_1\) since all the \(X_k\) have the same distribution.
Since the \(X_k\) are independent, we can write that \[ \begin{align} Var(S_{25}) &= Var(X_1 + X_2 + \ldots +X_{25})\\ &= Var(X_1) + Var(X_2) + \ldots +Var(X_{25})\\ &= 25 \times 2.09 \end{align} \] We can see that the expectation and variance of the sum scale with \(n\), and the expected value of the sum of \(n\) i.i.d random variables is \(n\) times the expected value of any one of them, and the variance of the sum is also \(n\) times the variance of any one of the iid random variables. This gives rise to the following “law” for the standard deviation of the sum.
- Square root law for sums of iid random variables
- The standard deviation of the sum of \(n\) iid random variables is given by: \[ SD(S_n) = \sqrt{n}SD(X_1) \]
Since all the \(X_k\) have the same distribution, we can use \(X_1\) to compute the mean and SD. This law says that if the sample size increases as \(n\), the expected value scales as number of random variables, but the standard deviation of the sum increases more slowly, scaling as \(\sqrt{n}\). In other words, if you increase the number of random variables you are summing, the spread of your sum about its expected value increases, but not as fast as the expectation of the sum.
Averages
We denote the average of the random variables \(X_1, X_2, \ldots, X_n\) by \(\displaystyle \overline{X} =\frac{S_n}{n}\).
\(\displaystyle \overline{X}\) is called the sample mean (where the “sample” consists of \(X_1, X_2, \ldots, X_n\)).
\[ E(\overline{X}) = E(\frac{S_n}{n}) = \frac{1}{n} E(S_n) = E(X_1) \]
This means that the expected value of an average does not scale as \(n\), but is the same as the expected value of a single random variable. Let’s check the variance now:
\[ Var(\overline{X}) = Var(\frac{S_n}{n}) = \frac{1}{n^2} Var(S_n) = \frac{n}{n^2} Var(X_1)\]
Therefore \(Var(\overline{X}) = \displaystyle \frac{1}{n} Var(X_1)\)
Note that, just like the sample sum \(S_n\), the sample mean \(\displaystyle \overline{X}\) is a random variable, and its variance scales as \(\frac{1}{n}\), which implies that \(SD(\overline{X})\) will scale as \(\displaystyle \frac{1}{\sqrt{n}}\).
- Square root law for averages of iid random variables
- The standard deviation of the average of \(n\) iid random variables is given by: \[ SD(\overline{X}) = \frac{1}{\sqrt{n}}SD(X_1) \]
- Standard error
- Since \(S_n\) and \(\overline{X}\) are numbers computed from the sample \(X_1, X_2, \ldots, X_n\), they are called statistics. We use the term standard error to denote the standard deviation of a statistic: \(SE(S_n)\) and \(SE(\overline{X})\) to distinguish it from the standard deviation of the ticket values in the box.
Example: Probability distributions for sums and averages
Let’s go back to the box of colored tickets, draw from this box \(n\) times, and then compute the sum and average of the draws. We will simulate the distribution of the sum and the average of 25 draws to see what the distribution of the statistics looks like. Note that when \(n=25\), \(E(S_n) = 25\times 1.9 = 47.5\) and \(SE(S_n) = \sqrt{n} \times SD(X_1) = 5 \times 1.45 = 7.25\)
set.seed(12345)
box = c(0,0,1,1,1,2,3,3,4,4)
s1 <- sum(sample(box, size = 25, replace = TRUE))
sum_draws_25 = replicate(1000, sum(sample(box, size = 25, replace = TRUE)))
p1 <- data.frame(sum_draws_25) %>%
ggplot(aes(x = sum_draws_25, y=..density..)) +
geom_histogram(fill = "deeppink3", color = "white") +
xlab("sample sum") +
ylab("density") +
ggtitle("Empirical distribution of the sample sum, n = 25") +
geom_vline(xintercept = 47.5, color = "black", lwd = 1.1)
sum_draws_100 = replicate(1000, sum(sample(box, size = 100, replace = TRUE)))
p2 <- data.frame(sum_draws_100) %>%
ggplot(aes(x = sum_draws_100, y=..density..)) +
geom_histogram(fill = "deeppink3", color = "white") +
xlab("sample sum") +
ylab("density") +
ggtitle("Empirical distribution of the sample sum, n = 100") +
geom_vline(xintercept = 190, color = "black", lwd = 1.1)
mean_draws_25 = replicate(1000, mean(sample(box, size = 25, replace = TRUE)))
p3 <- data.frame(mean_draws_25) %>%
ggplot(aes(x = mean_draws_25, y=..density..)) +
geom_histogram(fill = "cadetblue3", color = "white") +
xlab("sample mean") +
ylab("density") +
ggtitle("Empirical distribution of the sample mean, n = 25") +
geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)
mean_draws_100 = replicate(1000, mean(sample(box, size = 100, replace = TRUE)))
p4 <- data.frame(mean_draws_100) %>%
ggplot(aes(x = mean_draws_100, y=..density..)) +
geom_histogram(fill = "cadetblue3", color = "white") +
xlab("sample mean") +
ylab("density") +
ggtitle("Empirical distribution of the sample mean, n = 100") +
geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)
(p1+p2)/(p3+p4)
What do we notice in these figures? The black line is the expected value. We see that the center of the distribution for the sample sum grows as the sample size increases, but this does not happen for the distribution of the sample mean. You can also see that the spread of the data for the sample sum is much greater when n = 100, but this does not happen for the distribution of the sample mean. We will explore the sample sum and sample mean next week. Now, the \(y\) axis has neither counts nor proportion, but it has “density”. This makes the histogram have a total area of one, similar to a probability histogram. Now we can think of this density histogram as an approximation of the probability histogram.
Continuous distributions
So far, we have talked about discrete distributions, and the pmf for such distributions. If we have a random variable that takes any value in a given interval, then we call such random variables continuous. In this situation, we cannot think about discrete bits of probability mass, but rather we imagine that our total probability mass of \(1\) is smeared over the interval, giving us a smooth density curve, rather than a histogram. To define the probabilities associated with a continuous random variable, we define a probability density function (pdf) rather than a probability mass function.
- Probability density function \(f(x)\)
- This is a function \(f(x)\) that satisfies two conditions:
- it is non-negative (\(f(x) \ge 0\)) and
- the total area under the curve \(y = f(x)\) is 1. That is, \[ \int_{-\infty}^\infty f(x) dx = 1 \] Now we don’t talk about \(P(X = x\)), but rather that probability that \(X\) lies in an interval around \(x\). Since there are infinitely many outcomes in any interval on the real line, no single outcome can have positive probability, so \(P(X=x) =0\) for any \(x\) in the interval where \(X\) takes values. To find the probability that \(X\) lies in an interval \((a,b)\), we integrate \(f(x)\) over the interval \((a,b)\). That is, we find the area under the curve \(f(x)\) over the interval \((a, b)\).
Example: Uniform(0,1) distribution
If \(X\) is a random variable taking values in the interval \((0,1)\) with probability density function \(f(x) = 1\) for \(x\) in \((0,1)\) and \(f(x) = 0\) outside of this interval. Because \(f(x)\) is flat, all intervals of the same length will have the same area under the curve, so the distribution defined by \(f\) is called the Uniform\((0,1)\) distribution. If a random variable \(X\) has this distribution, we say that \(X \sim U(0,1)\). The probability that \(X\) is in any interval \((a, b)\) which is a subinterval of \((0,1)\) is given by the area of the rectangle formed by the interval and \(y=f(x)\), and so is just the width of the interval.
- Cumulative distribution function \(F(x)\)
- The cdf is still defined as before: \(F(x) = P(X \le x)\). The difference is in how we compute it. There are no discrete bits of mass for \(F(x)\) to collect. Now we have that \(F(x)\) is the area under the curve upto the point \(x\). \[ F(x) = P(X \le x) = \int_{-\infty}^x f(t) dt \]
Cdf for the Uniform(0,1) distribution
Let \(X \sim U(0,1)\). What is \(F(0.3)\)?
Check your answer
\[ F(0.3) = P(X \le 0.3) = \int_{-\infty}^0.3 f(t) dt = \int_0^0.3 1 dt = 0.3 \] In general, for the \(U(0,1)\) distribution, \(F(x) = x\).
Summary
We defined the expected value or the mean of a discrete random variable, and related it to the average of the tickets in a box model that corresponds to the random variable. We listed the properties of expectation including linearity and additivity.
We also defined the variance and standard deviation of a random variable. All these are constants of the distribution. The variance is more convenient for computation because it doesn’t have square roots. However, the units are squared, so you have to be careful while interpreting the variance. We discussed the properties of variance and standard deviation.
We defined the expectation and variance of sums and averages of an iid sample of random variables, and introduced the term standard error. We recognized how the mean and variance scale with \(n\) and defined the square root law for the standard error of the sum or mean of an iid sample.
We considered the probability distributions of sums and averages.
Finally, we introduced continuous distributions. Next time we will introduce the most celebrated continuous distribution, the normal distribution.