Expected value and variance of a random variable

Measuring the center and spread of a distribution

class date

10/10/23

We are often interested in the average value of a random variable. We might repeat the action that generates a value of a random variable over and over again, and consider the long term average. For example, we might bet on red in roulette, and think about what our average gain would be if we play hundreds of games. Maybe we roll a die four times, record a success if we see at least one six, and repeat this process and take the average - note that we did this last week when we computed the proportion of times we rolled at least 1 6 in 4 rolls, while simulating de Méré’s paradox. The proportion is just a special case of an average, when the random variable takes only the values \(0\) or \(1\). So you can see that we can think of the average as the value we would predict for the random variable - some sort of typical or expected value.

Expectation of a random variable

Expected value of a random variable
The expectation or expected value of a random variable \(X\) with pmf \(f(x)\) is denoted by \(E(X)\). It is a constant associated with the distribution, and is defined by \[ E(X) = \sum_x x \times P(X = x) = \sum_x x \times f(x) \]

Note that \(E(X)\) is a weighted average of the possible values taken by the random variable, where each possible value is weighted by its probability.

Example: Rolls of a fair die

If \(X\) is the spots when we roll a fair six-sided die, then \(f(x) = P(x = x) = 1/6\) for \(x = 1, 2, \ldots, 6\). In this case, \(E(X) = \displaystyle \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5\).

In general, if we have a discrete uniform random variable on the values \(1, 2, 3, \ldots, n\), if we had to predict \(X\), we would predict the middle of the distribution, since all the values are equally likely: (\(f(x) =\displaystyle 1/n\) for all \(x = 1, 2, 3, \ldots, n\)). Therefore the expected value of a discrete uniform random variable is \(E(X) = \displaystyle\frac{n+1}{2},\) or just the usual average of \(1, 2, 3, \ldots, n\).


Example: Rolls of an unfair die

What happens if all the faces are not equally likely? Consider the following scenario: let \(X\) be the result of rolling a weird six-sided die, which has the probability of rolling a 4 or 6 is double that of rolling an odd number, and the probability of rolling a 2 is three times that of rolling an odd number. The probability mass function of \(X\) is as shown below:

\[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \]

Using the definition,

\[ \begin{aligned} E(X) &= \sum_x x \times f(x) \\ &= 1 \times 0.1 + 2 \times 0.3 + 3 \times 0.1 + 4 \times 0.2 + 5 \times 0.1 + 6 \times 0.2\\ &= 3.5 \\ \end{aligned} \] ##


Notice that the expected value was the same for both dice - the fair as well as unfair dice. This illustrates an important way in which the expected value of a random variable is just like the average of a list of numbers. It gives us some information about the distribution since we see the ``typical’’ value, but not that much information. Two random variables with very different distributions can have the same expected value.

Example: Bernoulli random variables

Recall that a Bernoulli (\(p\)) random variable is the special case of a binomial random variable when the parameter \(n=1\). This random variable takes the value \(1\) with probability \(p\) and \(0\) with probability \(1-p\). Then: \[ E(X) = 1 \times p + 0 \times (1-p) = p\] The expected value of a Bernoulli(\(p\)) random variable is therefore just \(p\). In particular, if we toss a coin and define \(X\) to be the number of heads, then the expected value of \(X\) is the probability that the coin lands heads.

In all the figures above, note the red triangle marking the expected value. If you imagine the probability histogram to be a thin lamina - like a thin sheet of metal cut in the shape of the probability histogram, you can imagine the expected value as a “balancing point” - the point where the lamina would balance. It is the center of mass for the probability distribution. The expected value for a random variable is analogous to the average for sample data. Other terms that we use for the expected value of a random variable are expectation and mean. These can be used interchangeably.


Exercise: Compute the expectation of the random variable whose distribution is shown below:

Let \(X\) be a random variable such that

\[ X = \begin{cases} 1\: \text{with prob}\, 4/15\\ 2 \: \text{with prob}\, 7/30 \\ 0 \: \text{with prob}\, 1/3 \\ -1 \: \text{with prob} \, 1/6 \end{cases} \]

Check your answer

\[ \begin{aligned} E(X) &= 1 \times \frac{4}{15} + 2 \times \frac{7}{30} + 0 \times \frac{1}{3} + (-1) \times \frac{1}{6} \\ &= \frac{1 \times 8 + 2 \times 7 + 0 \times 10 + (-1) \times 5}{30} \\ &= \frac{8 + 14 + 0 - 5}{30} \\ &= \frac{17}{30} = 0.5666667 \end{aligned} \]

Properties of expectation

  1. Expected value of a constant
    The expected value of a constant is just the constant itself. Let \(c\) be any constant. The probability mass function can be considered to be \(f(c) = 1\). Then \[E(c) = c \cdot f(c) = c \]
  2. Constant multiple of a random variable
    If \(c\) is a constant, and \(X\) is a random variable with pmf \(f(x)\), then we can consider \(cX\) to be the random variable that takes the value \(cx\) when \(X\) takes the value \(x\), so \(f(cx) = f(x)\). Therefore, \[ E(cX) = \sum_x cx \cdot f(x) = c \sum_x x \cdot f(x) = c\cdot E(X) \]
  3. Additivity of expectation
    Let \(X\) and \(Y\) be two random variables, and consider their sum \(X + Y\). Then, \[ E(X + Y) = E(X) + E(Y)\] This is true regardless of whether \(X\) and \(Y\) are independent or not.
  4. Linearity of expectation
    Putting the previous two properties together, we get \[ E(aX + bY) = E(aX) + E(bY) = a\cdot E(X) +b \cdot E(Y)\]
  5. Expectation of a function of a random variable
    Suppose \(Y = g(X)\) is a function of the random variable \(X\). Then \(Y\) is also a random variable taking values \(y = g(x)\), and the probabilities are just the probabilities associated with \(x\). Then the expected value of \(Y\) is given by: \[ E(Y) = E(g(X)) = \sum_x g(x) \times f(x) = \sum_x g(x) \times P(X=x) \]


Example: Expectation of the square of a random variable

Let \(X\) be a random variable with the following probability distribution:

\(x\) \(P(X = x)\)
\(-2\) \(0.2\)
\(-1\) \(0.1\)
\(0\) \(0.2\)
\(1\) \(0.3\)
\(3\) \(0.2\)

Let’s first add a column with the product \(x\times P(X=x)\):

\(x\) \(P(X = x)\) \(x\times P(X=x)\)
\(-2\) \(0.2\) \(-0.4\)
\(-1\) \(0.1\) \(-0.1\)
\(0\) \(0.2\) \(0\)
\(1\) \(0.3\) \(0.3\)
\(3\) \(0.2\) \(0.6\)

Then we sum the third column to get \(E(X) = -0.4 -0.1 + 0 + 0.3 + 0.6 =\) 0.4.

Let’s do the same for the random variable \(Y = g(X) = X^2\). Add two columns to the original table, one with the values of \(y = g(x)\), and one with \(g(x)f(x) = g(x)P(X=x)\):

\(x\) \(P(X = x)\) \(y = x^2\) \(g(x)\times P(X=x)\)
\(-2\) \(0.2\) \(4\) \(0.8\)
\(-1\) \(0.1\) \(1\) \(0.1\)
\(0\) \(0.2\) \(0\) \(0\)
\(1\) \(0.3\) \(1\) \(0.3\)
\(3\) \(0.2\) \(9\) \(1.8\)

Summing the last column we get \(E(Y) = 0.8 + 0.1 + 0 + 0.3 + 1.8 =\) 3.

Warning

Do not apply the function to \(f(x)\)! The probability distribution remains the same, only the variable values change - instead of using \(x\), we use \(g(x)\).

Example: Binomial random variables

Let \(X\sim Bin(n, p)\). Recall that \(X\) counts the number of successes in \(n\) trials, where the probability of success on each trial is \(p\). We can define \(n\) Bernoulli (\(p\)) random variables \(X_1, X_2, \ldots, X_n\) where \(X_k = 1\) with probability \(p\), that is, \(X_k\) is 1 if the \(k\)th trial is a success. We see that the binomial random variable \(X\) can be written as a sum of these \(n\) Bernoulli random variables:

\[ X = X_1 + X_2 + \ldots + X_n \]

The expected value of \(X_k\) is \(E(X_k) = p\) for each \(k\), so using the additivity of expectation, we get

\[ \begin{aligned} E(X) &= E( X_1 + X_2 + \ldots + X_n) \\ &= E(X_1) + E(X_2) + \ldots + E(X_n) \\ &= p + p + \ldots + p\\ &= np \end{aligned} \]

Therefore, if \(X\sim Bin(n, p)\), then \(E(X) = np\). This intuitively makes sense: if I toss a fair coin 100 times, I expect to see about \(50\) heads. This is a very neat trick to compute the expected value of a binomial random variable because you can imagine that computing the expected value using the formula \(\displaystyle \sum_x x \cdot f(x)\) would be very messy and difficult. Using Bernoulli random variables allowed us to easily calculate the expected value of a binomial random variable.

Variance and standard deviation of a random variable

When we looked at averages of data, we realized that computing measures of center was insufficient to give us a picture of the distribution. We needed to know how the data distribution spread out about its center, and this idea holds true for probability distributions as well. We want a number that describes how far from \(E(X)\) the values of \(X\) typically fall, similar to the standard deviation for a list of numbers.

Variance of a random variable
Define \(\mu = E(X)\), and define the function \(g(X) = (X-\mu)^2\), which describes the square of the distance of each value of \(X\) from its mean, or the squared deviation of \(X\) from its mean. Then the variance of \(X\), written as \(Var(X)\) is given by: \[ Var(X) = E(g(X)) = E\left[(X-\mu)^2\right] \]

The problem with \(Var(X)\) is that the units are squared, so just as we did for the sample variance, we will take the square root of the variance.

Standard deviation of a random variable
The square root of \(Var(X)\) is called the standard deviation of \(X\):

\[ SD(X) = \sqrt{Var(X)} \]

\(SD(X)\) is a “give or take” number attached to the mean \(E(X)\), so we can say that a ``typical’’ value of \(X\) is about \(\mu\), give or take one standard deviation (the value of \(SD(X)\)). Note that \(SD(X)\) is a non-negative number.


Example: Rolls of a weighted die

Recall the example in which we roll an unfair die, and the probability mass function \(f\) was given by: \[ P(X=x) = f(x) = \begin{cases} 0.1,\: \text{for}\, x = 1, 3, 5 \\ 0.2, \:, \text{for}\, x = 4, 6 \\ 0.3, \: \text{for}\, x = 2 \end{cases} \]

We had computed \(E(X)=\mu = 3.5\). What about \(Var(X)\)? Let’s write out the table, and add a column for \(g(x) = (x-3.5)^2\).

\(x\) \(P(X = x)\) \(g(x) = (x - 3.5)^2\) \(g(x)\times P(X=x)\)
\(1\) \(0.1\) \(6.25\) \(0.625\)
\(2\) \(0.3\) \(2.25\) \(0.675\)
\(3\) \(0.1\) \(0.25\) \(0.025\)
\(4\) \(0.2\) \(0.25\) \(0.05\)
\(5\) \(0.1\) \(2.25\) \(0.225\)
\(6\) \(0.2\) \(6.25\) \(1.250\)

\(E(g(X)) = \sum_x g(x)\cdot P(X=x) =\) 2.85

Therefore the standard deviation of \(X = SD(X)\) is the square root of the variance, so about 1.688.


Example: Rolls of a fair die

We have already computed that the expected value of the random variable \(X = 3.5\) where \(X\) is the result of rolling a fair die. What are \(Var(X)\) and \(SD(X)\)?

Check your answer

\(Var(X) = \sum_x (x-3.5)^2 \cdot P(X=x) =\) 2.917. \(SD(X) =\) 1.708.

Why do you think that \(Var(X)\) and \(SD(X)\) are greater when \(X\) is the result of rolling the fair die than when \(X\) is the result of rolling the unfair die? (Hint: think about the probability distributions.)

Computational formula for \(Var(X)\)

To compute variance, we usually use a nice shortcut formula: \[Var(X) = E\left(X^2\right) - \mu^2\]

Properties of variance and standard deviation

  1. The variance of a constant is 0
    Remember that the variance measures how spread out the values of \(X\) are about the mean. A constant doesn’t spread out at all, so it makes sense that the variance is \(0\).

Let’s confirm this. Let \(X\) be a random variable such that \(X = c\) with \(f(c) = 1\) for some real number \(c\). (This means that \(X\) takes the value \(c\) with probability 1, that is, with certainty.)

\[ Var(X) = E(X^2) - \mu^2 = E(c^2) - c^2 = c^2 - c^2 = 0\]. Note that \(E(c^2) = c^2\) as \(c^2\) is a constant. Thus we have that \(SD(c) = 0\).

  1. The variance of a constant multiple of \(X\): \(Var(cX) = c^2 Var(X)\)
    Note that \(E(cX) = c\mu\) by the properties of expectation. Thus we have (using the short cut formula):

\[ \begin{aligned} Var(cX) &= E((cX)^2) - (c\mu)^2 \\ &= E(c^2 X^2) - c^2 \mu^2 \\ &= c^2E(X^2) - c^2\mu^2\\ &= c^2\left(E(X^2) - \mu^2 \right) \\ &= c^2 Var(X) \end{aligned} \]

Thus, \(SD(cX) = \sqrt{Var(cX)} = \sqrt{c^2 Var(X)} = \lvert c \rvert SD(X).\) (\(SD(X) \ge 0\))

  1. The variance of \(X\) is unchanged by adding a constant
    If we add a constant to \(X\), the spread of the distribution does not change, only its center changes. Let \(c\) be a real number. Then we have:

\[ Var(X+c) = Var(X) \]

  1. Additivity of variance
    If \(X\) and \(Y\) are independent random variables, that is the events \(X = x\) and \(Y = y\) are independent for all values of \(x\) and \(y\), we have that:

\[ Var(X + Y) = Var(X) + Var(Y) \text{ and } Var(X - Y) = Var(X) + Var(Y) \]

Note that in this case, \(Var(X + Y) = Var(X) + Var(Y)\) implies that \(SD(X +Y) = \sqrt{Var(X) + Var(Y)}\) - square roots and therefore SD’s are not additive.


Example: Box of tickets

Consider the box with the following \(30\) tickets:

\(8\) tickets marked \(\fbox{1}\), \(7\) tickets marked \(\fbox{2}\), \(10\) tickets marked \(\fbox{0}\), and \(5\) tickets marked \(\fbox{-1}\).

Let \(X\) be the value of a single draw from this box, if we shuffle the tickets and draw one ticket at random. What is the probability distribution of \(X\)? What is \(E(X)\), rounded to 3 decimal places?

Notice that the average of the tickets in the box is 0.567 which is the same as \(E(X)\)!

What about the variance of \(X\)?

\[ \begin{aligned} Var(x) &= E\left[ ( X - \mu)^2 \right] \\ &= \left( \frac{8}{30}\times (1-0.567)^2 + \frac{7}{30} \times (2-0.567)^2 + \frac{10}{30} \times (0-0.567)^2 + \frac{5}{30} \times (-1 - 0.567)^2 \right) \\ &\approx 1.045\\ \end{aligned} \] The sample variance of the tickets in the box is a bit more than 1.045. This is because we use \(n-1\) in the denominator of sample variance and sample sd, rather than \(n\).

The standard deviation \(SD(X) = \sqrt{Var(X)} =\) 1.023.

Example: Bernoulli random variables

Let \(X\) be a Bernoulli (\(p\)) random variable. We know that \(E(X) = \mu = p\). If we compute \(E(X^2)\), we get that \(E(X^2) = p\). (Make sure you know how to compute \(E(X^2)\).) Therefore we have that: \[ Var(X) = E(X^2) - \mu^2 = p - p^2 = p(1-p). \]

Example: Binomial random variables

We use the same method as we did to compute the expectation of \(X\sim Bin(n,p)\). We will write \(X\) as a sum of independent Bernoulli random variables: \[ X = X_1 + X_2 + \ldots + X_n\] where each \(X_k \sim\) Bernoulli(\(p\)). Since the \(X_k\) are results of independent trials (by the definition of the binomial distribution), we have: \[Var(X) = Var(X_1) + Var(X_2) + \ldots + Var(X_n) = np(1-p).\] Therefore, \(SD(X) = \sqrt{np(1-p)}\)

Independent and identically distributed random variables

IID random variables
If we have \(n\) independent random variables \(X_1, X_2, \ldots, X_n\) such that they all have the same pmf \(f(x)\) and the same cdf \(F(x)\), we call the random variables \(X_1, X_2, \ldots, X_n\) independent and identically distributed random variables. We usually use the abbreviation i.i.d or iid for ``independent and identically distributed’’.

This is a very important concept that we have already used to compute the expected value and variance of a binomial random variable by writing it as a sum of iid Bernoulli random variables.

A common example is when we toss a coin \(n\) times and count the number of heads - each coin toss can be considered a Bernoulli random variable, and the total number of heads is a sum of \(n\) iid Bernoulli random variables.

Example: iid random variables

Consider the box shown below:

Say I draw \(25\) tickets with replacement from this box, and let \(X_k\) be the value of the \(k\)th ticket. Then each of the \(X_k\) has the same distribution, and they are independent since we draw the tickets with replacement. Therefore \(X_1, X_2, \ldots, X_{25}\) are iid random variables, and they each have a distribution defined by the following pmf: \[ f(x) = \begin{cases} 0.2, \; x = 0, 3, 4 \\ 0.1, \; x = 2 \\ 0.3, \; x = 1 \end{cases} \]


Sums and averages of random variables

Sums

Suppose we make \(n\) draws at random with replacement from the box above, and sum the drawn tickets. The sum \(S_n = X_1, X_2, \ldots, X_{n}\) is also a random variable. Let’s simulate this by letting \(n=25\) and we will sample \(25\) tickets with replacement, sum them, and then repeat this process. Note that the smallest sum we can get is \(S_n = 0\) and the largest is \(100\). (Why?)

Code
set.seed(12345)
box = c(0,0,1,1,1,2,3,3,4,4)
s1 <- sum(sample(box, size = 25, replace = TRUE))
paste("The sum of 25 draws is", sep = " ", s1)
[1] "The sum of 25 draws is 56"

Now we will repeat this process 10 times:

Code
set.seed(12345)
replicate(10, sum(sample(box, size = 25, replace = TRUE)))
 [1] 56 50 64 51 49 41 64 44 45 51

It is clear that the sum \(S_n\) is random (because the \(X_k\) are random), and we can see that the sum of the draws changes with each iteration of the process.

Since we know the distribution of the \(X_k\), we can compute \(E(X_k)\) and \(Var(X_k)\). Note that since the \(X_1, X_2, \ldots, X_{n}\) are iid, all the \(X_k\) have the same mean and variance. What about their sum \(S_n\)? What are \(E(S_n)\) and \(Var(S_n)\), when \(n = 25\)?

\(E(X_k) = 0.2\times 0 + 0.3 \times 1 + 0.1 \times 2 + 0.2 \times 3 + 0.2 \times 4 = 1.9\) (Note that you could also have just computed the average of the tickets in the box.)

\(Var(X_k) = \sum_x (x-1.9)^2 \times P(X=x) = 2.09\)

\(E(S_{25}) = E(X_1 + X_2 + \ldots +X_{25}) = 25 \times E(X_1) = 25 \times 1.9\).

(We just use \(X_1\) since all the \(X_k\) have the same distribution.)

Since the \(X_k\) are independent, we can write that

\[ \begin{aligned} Var(S_{25}) &= Var(X_1 + X_2 + \ldots +X_{25})\\ &= Var(X_1) + Var(X_2) + \ldots +Var(X_{25})\\ &= 25 \times 2.09 \end{aligned} \]

We can see that the expectation and variance of the sum scale with \(n\), so that if \(S_n\) is the sum of \(n\) iid random variables \(X_1, X_2, \ldots, X_n\), then:

\[ \begin{aligned} E(S_n) &= n \times E(X_1) \\ Var(S_n) &= n \times Var(X_1)\\ \end{aligned} \] This does not hold for \(SD(S_n)\), though. For the SD, we have the following ``law’’ for the standard deviation of the sum.

Square root law for sums of iid random variables
The standard deviation of the sum of \(n\) iid random variables is given by: \[ SD(S_n) = \sqrt{n} \times SD(X_1) \]

Since all the \(X_k\) have the same distribution, we can use \(X_1\) to compute the mean and SD of the sum. This law says that if the sample size increases as \(n\), the expected value scales as the number of random variables, but the standard deviation of the sum increases more slowly, scaling as \(\sqrt{n}\). In other words, if you increase the number of random variables you are summing, the spread of your sum about its expected value increases, but not as fast as the expectation of the sum.

Averages

We denote the average of the random variables \(X_1, X_2, \ldots, X_n\) by \(\displaystyle \bar{X} =\frac{S_n}{n}\).

\(\displaystyle \bar{X}\) is called the sample mean (where the ``sample’’ consists of \(X_1, X_2, \ldots, X_n\)).

\[ E(\bar{X}) = E(\frac{S_n}{n}) = \frac{1}{n} E(S_n) = E(X_1) \]

This means that the expected value of an average does not scale as \(n\), but \(E(\bar{X})\) is the same as the expected value of a single random variable. Let’s check the variance now:

\[ Var(\bar{X}) = Var(\frac{S_n}{n}) = \frac{1}{n^2} Var(S_n) = \frac{n}{n^2} Var(X_1) \]

Therefore \(Var(\bar{X}) = \displaystyle \frac{1}{n} Var(X_1)\)

Note that, just like the sample sum \(S_n\), the sample mean \(\displaystyle \bar{X}\) is a random variable, and its variance scales as \(\displaystyle \frac{1}{n}\), which implies that \(SD(\bar{X})\) will scale as \(\displaystyle \frac{1}{\sqrt{n}}\).

Square root law for averages of iid random variables
The standard deviation of the average of \(n\) iid random variables is given by: \[ SD(\bar{X}) = \frac{1}{\sqrt{n}}SD(X_1) \]
Standard error
Since \(S_n\) and \(\bar{X}\) are numbers computed from the sample \(X_1, X_2, \ldots, X_n\), they are called statistics. We use the term standard error to denote the standard deviation of a statistic: \(SE(S_n)\) and \(SE(\bar{X})\) to distinguish it from the standard deviations of random variables that do not arise as statistics that are computed from \(X_1, X_2, \ldots, X_n\).

Example: Probability distributions for sums and averages

Let’s go back to the box of colored tickets, draw from this box \(n\) times, and then compute the sum and average of the draws. We will simulate the distribution of the sum and the average of 25 draws to see what the distribution of the statistics looks like. Note that when \(n=25\), \(E(S_n) = 25\times 1.9 = 47.5\) and \(SE(S_n) = \sqrt{n} \times SD(X_1) = 5 \times 1.45 = 7.25\)

Code
set.seed(12345)
box = c(0,0,1,1,1,2,3,3,4,4)
s1 <- sum(sample(box, size = 25, replace = TRUE))

sum_draws_25 = replicate(1000, sum(sample(box, size = 25, replace = TRUE)))

p1 <- data.frame(sum_draws_25) %>%
  ggplot(aes(x = sum_draws_25, y=..density..)) + 
  geom_histogram(fill = "darkolivegreen2", color = "white") + 
  xlab("sample sum") +
  ylab("density") +
  ggtitle("Empirical distribution of the sample sum, n = 25") + 
  geom_vline(xintercept = 47.5, color = "black", lwd = 1.1)


sum_draws_100 = replicate(1000, sum(sample(box, size = 100, replace = TRUE)))

p2 <- data.frame(sum_draws_100) %>%
  ggplot(aes(x = sum_draws_100, y=..density..)) + 
  geom_histogram(fill = "darkolivegreen3", color = "white") + 
  xlab("sample sum") +
  ylab("density") +
  ggtitle("Empirical distribution of the sample sum, n = 100") + 
  geom_vline(xintercept = 190, color = "black", lwd = 1.1)

mean_draws_25 = replicate(1000, mean(sample(box, size = 25, replace = TRUE)))

p3 <- data.frame(mean_draws_25) %>%
  ggplot(aes(x = mean_draws_25, y=..density..)) + 
  geom_histogram(fill = "cadetblue2", color = "white") + 
  xlab("sample mean") +
  ylab("density") +
  ggtitle("Empirical distribution of the sample mean, n = 25")  + 
  geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)


mean_draws_100 = replicate(1000, mean(sample(box, size = 100, replace = TRUE)))

p4 <- data.frame(mean_draws_100) %>%
  ggplot(aes(x = mean_draws_100, y=..density..)) + 
  geom_histogram(fill = "deepskyblue", color = "white") + 
  xlab("sample mean") +
  ylab("density") +
  ggtitle("Empirical distribution of the sample mean, n = 100") +
  geom_vline(xintercept = 1.9, color = "black", lwd = 1.1)


(p1+p2)/(p3+p4)

What do we notice in these figures? The black line is the expected value. We see that the center of the distribution for the sample sum grows as the sample size increases (look at the x-axis), but this does not happen for the distribution of the sample mean. You can also see that the spread of the data for the sample sum is much greater when n = 100, but this does not happen for the distribution of the sample mean. We will explore the sample sum and sample mean next week. Now, the \(y\) axis has neither counts nor proportion, but it has “density”. This makes the histogram have a total area of one, similar to a probability histogram. Now we can think of this density histogram as an approximation of the probability histogram.

Continuous distributions

So far, we have talked about discrete distributions, and the probability mass functions for such distributions. Consider a random variable that takes any value in a given interval. Recall that we call such random variables continuous. In this situation, we cannot think about discrete bits of probability mass which are non-zero for certain numbers, but rather we imagine that our total probability mass of \(1\) is smeared over the interval, giving us a smooth density curve, rather than a histogram. To define the probabilities associated with a continuous random variable, we define a probability density function (pdf) rather than a probability mass function.

Probability density function of a distribution

Probability density function
This is a function \(f(x)\) that satisfies two conditions:
    1. it is non-negative (\(f(x) \ge 0\)) and
    1. the total area under the curve \(y = f(x)\) is 1. That is, \[ \int_{-\infty}^\infty f(x) dx = 1 \]

If \(X\) is a continuous random variable, we don’t talk about \(P(X = x\)), that is, the probability that \(X\) takes a particular value. Rather, we ask what is the probability that \(X\) lies in an interval around \(x\). Since there are infinitely many outcomes in any interval on the real line, no single outcome can have positive probability, so \(P(X=x) =0\) for any particular \(x\) in the interval where \(X\) is defined. To find the probability that \(X\) lies in an interval \((a,b)\), we integrate \(f(x)\) over the interval \((a,b)\). That is, we find the area under the curve \(f(x)\) over the interval \((a, b)\).

The probability that a continuous random variable lies in an interval
The probability that \(X\) is in the interval \((a,b)\) is given by \[ P(a < X < b) = P(X \text{ is in the interval }(a,b)) = \int_{a}^b f(x) dx \]

Example: Uniform(0,1) distribution

Let \(X\) be a random variable that takes values in the interval \((0,1)\) with probability density function \(f(x) = 1\) for \(x\) in \((0,1)\) and \(f(x) = 0\) outside of this interval.

Because \(f(x)\) is flat, all intervals of the same length will have the same area, so the distribution defined by \(f\) is called the Uniform\((0,1)\) distribution. If a random variable \(X\) has this distribution, we denote this by \(X \sim U(0,1)\). The probability that \(X\) is in any interval \((a, b)\) which is a subinterval of \((0,1)\) is given by the area of the rectangle formed by the interval and \(y=f(x)\), and so is just the width of the interval.

Cumulative distribution function \(F(x)\)

Cumulative distribution function (cdf)
The cdf is defined the same way as for discrete random variables: \[ F(x) = P(X \le x) = \int_{-\infty}^x f(t) dt \]

The difference is in how we compute \(P(X \le x)\). There are no discrete bits of probability mass for \(F(x)\) to collect. Instead we have that \(F(x)\) is the area under the curve \(y = f(x)\) all the way up to the point \(x\).

Example: cdf for the Uniform (0,1) distribution

Let \(X \sim U(0,1)\). What is \(F(0.3)\)?

Check your answer

\[ F(0.3) = P(X \le 0.3) = \int_{-\infty}^0.3 f(t) dt = \int_0^0.3 1 dt = 0.3 \] In general, for the \(U(0,1)\) distribution, \(F(x) = x\).

Summary

We defined the expected value or the mean of a discrete random variable and listed the properties of expectation including linearity and additivity.

We defined the variance and standard deviation of a random variable. Both expectation and variance (and therefore standard deviation) are constants associated to the distribution of the random variable. The variance is more convenient than the sd for computation because it doesn’t have square roots. However, the units are squared, so you have to be careful while interpreting the variance. We discussed the properties of variance and standard deviation.

We defined the expectation and variance of sums and averages of an iid sample of random variables, and introduced the term standard error. We recognized how the mean and variance scale with \(n\) and defined the square root law for the standard error of the sum or mean of an iid sample.

We considered the probability distributions of sums and averages.

Finally, we introduced continuous distributions. In subsequent notes, we will introduce the most celebrated continuous distribution, the normal distribution.

Materials from class

Slides