Chapter 4 Expectation and Variance

4.1 Expectation (Mean)

Given a list of numbers \(x_1, x_2,\ldots, x_n\), the familiar way to average them is to add them up and divide by \(n\). This is called the arithmetic mean, and is defined by \[ \bar{x}=\frac{1}{n}\sum_{j=1}^{n} x_j. \] More generally, we can define a weighted mean of \(x_1,\ldots,x_n\) as

\[ \mbox{weighted-mean}(x)=\sum_{j=1}^{n} x_j p_j, \] where the weights \(p_1,\ldots,p_n\) are pre-specified nonnegative numbers that add up to 1 (so the unweighted mean is obtained when \(p_j = 1/n\) for all \(j\)).

Definition 4.1.1 (Expectation of a discrete r.v.).

The expected value (also called the expectation or mean) of a discrete r.v. \(X\) whose distinct possible values are \(x_1,x_2,\ldots\) is defined by \[ E(X)=\sum_{j=1}^{\infty} x_j P(X=x_j). \]

  • If the support is finite, then this is replaced by a finite sum.

We can also write \[ E(X)=\sum_x x P(X=x), \] where the sum is over the support of \(X\) (in any case, \(xP(X = x)\) is 0 for any \(x\) not in the support).

The mean may not exist for a probability distribution!

The expectation is undefined if \(\sum_{j=1}^{\infty} |x_j| P(X=x_j)\) diverges, since then the series for \(E(X)\) diverges or its value depends on the order in which the \(x_j\) are listed (conditionally convergent series).

Examples:

  1. Let \(X\) be the result of rolling a fair 6-sided die, so \(X\) takes on the values 1, 2, 3, 4, 5, 6, with equal probabilities. Intuitively, we should be able to get the average by adding up these values and dividing by 6. Using the definition, the expected value is

\[ E(X)=\frac{1}{6}(1+2+\ldots+6)=3.5. \]

  1. Let \(X\sim\mbox{Bern}(p)\) and \(q=1-p\). Then

\[ E(X)=1p+0q=p. \]

  1. Let \(X\sim Bin(4, 1/2)\), then its PMF can be written by (tabular form)
x 0 1 2 3 4
\(p_X(x)\) 1/16 4/16 6/16 4/16 1/16

\[ E(X)=0*1/16+1*4/16+2*6/16+3*4/16+4*1/16=2=np \]

The interpretation would be to consider a large number of independent Bernoulli trials, each with probability \(p\) of success. Writing 1 for “success” and 0 for “failure”, in the long run we would expect to have data consisting of \(np\) successes.

Proposition 4.1.2.

If \(X\) and \(Y\) are discrete r.v.s with the same distribution, then \(E(X) = E(Y)\) (if either side exists).

The converse of the above proposition is false since the expected value is just a one-number summary, not nearly enough to specify the entire distribution; it’s a measure of where the “center” is but does not determine, for example, how spread out the distribution is or how likely the r.v. is to be positive.

Two different distributions with identical mean:

Figure 4.1: Two different distributions with identical mean:

4.2 Expectation of functions of r.v.

For example, given r.v. \(X\), a function of r.v. is \[ Y=g(X) \] which is also a r.v.

We can also a function of two different r.v.s, such as \(Z=X+Y\).

How to obtain the expectation for a function of r.v.?

Can we simply use \[ E(Y)=g(E(X))? \] It depends on the function!

4.3 Properties of expectation - Linearity and Monotonicity

Theorem 4.2.1 (Linearity of expectation).

For any r.v.s \(X\), \(Y\) and any constant \(c\), \[ E(X+Y) = E(X)+E(Y) \]

\[ E(cX) = cE(X). \]

The expectation of a linear combination of r.vs is the the linear combination of their expectations. Regardless of independence between \(X\) and \(Y\)!

Example 4.2.2 (Binomial expectation).

For \(X\sim\mbox{Bin}(n, p)\), let’s find \(E(X)\) in two ways.

Recall: Bernoulli and Binomial relationship

Binomial r.v. is made of \(n\) independent Bernoulli r.v.s \[ X=X_1+\cdots+X_n \] where \(X_j \sim Bernoulli(p), \quad j=1,\ldots, n\). \[ E(X)=E(X_1+\cdots+X_n)=E(X_1)+\cdots+E(X_n)=p+\ldots+p=np \]

Example 4.2.3 (Hypergeometric expectation).

Let \(X\sim\mbox{HGeom}(w,b,n)\), interpreted as the number of white balls in a sample of size \(n\) drawn without replacement from an urn with \(w\) white and \(b\) black balls. As in the Binomial case, we can write \(X\) as a sum of Bernoulli random variables, \[ X= I_1+\ldots I_n, \] where \(I_j\) equals 1 if the \(j\)th ball in the sample is white and 0 otherwise.

  • \(I_j\sim\mbox{Bern}(p)\) with \(p=w/(w+b)\), since unconditionally the \(j\)th ball drawn is equally likely to be any of the balls.
  • Unlike in the Binomial case, the \(I_j\) are not independent, since the sampling is without replacement: given that a ball in the sample is white, there is a lower chance that another ball in the sample is white. However, linearity still holds for dependent random variables! Thus,

\[ E(X)=n\frac{w}{w+b}. \]

Proposition 4.2.4 (Monotonicity of expectation).

Let \(X\) and \(Y\) be r.v.s such that \(X\geq Y\) with probability 1. Then \(E(X) \geq E(Y)\), with equality holding if and only if \(X=Y\) with probability 1.

Proof.

This result holds for all r.v.s, but we will prove it only for discrete r.v.s since this chapter focuses on discrete r.vs. The r.v. \(Z=X-Y\) is nonnegative (with probability 1), so \(E(Z)\geq 0\), since \(E(Z)\) is defined as a sum of nonnegative terms. By linearity, \[ E(X)-E(Y)=E(X-Y) \geq 0, \] as desired.

If \(E(X)=E(Y)\), then by linearity we also have \(E(Z)=0\), which implies that \(P(X = Y) = P(Z = 0) = 1\) since if even one term in the sum defining \(E(Z)\) is positive, then the whole sum is positive.

4.4 Geometric distribution

Story 4.3.1 (Geometric distribution).

  • Consider a sequence of independent Bernoulli trials, each with the same success probability \(p\in (0, 1)\), with trials performed until a success occurs. Let \(X\) be the number of failures before the first successful trial. Then \(X\) has the Geometric distribution with parameter \(p\); we denote this by \(X\sim\mbox{Geom}(p)\).

  • For example, if we flip a fair coin until it lands Heads for the first time, then the number of Tails before the first occurrence of Heads is distributed as \(\mbox{Geom}(1/2)\).

To get the Geometric PMF from the story, imagine the Bernoulli trials as a string of 0’s (failures) ending in a single 1 (success). Each 0 has probability \(q=1-p\) and the final 1 has probability \(p\), so a string of \(k\) failures followed by one success has probability \(q^k p\).

Theorem 4.3.2 (Geometric PMF).

If \(X\sim\mbox{Geom}(p)\), then the PMF of \(X\) is \[ P (X=k)=q^k p \] for \(k=0,1,2,\ldots\), where \(q=1-p\).

  • This is a valid PMF because

\[ \sum_{k=0}^\infty q^k p=p\sum_{k=0}^\infty q^k=p\cdot \frac{1}{(1-q)}=1. \]

PMF and CDF of a geometric distribution

Figure 4.2: PMF and CDF of a geometric distribution

There are differing conventions for the definition of the Geometric distribution; some sources define the Geometric as the total number of trials, including the success. In this book, the Geometric distribution excludes the success.

Example 4.3.5 (Geometric expectation).

\[ P (X=k)=q^k p,\quad k=0,1,2,\ldots \]

\[ E(X)=\sum_{k=0}^\infty kq^kp=0+qp+2q^2p+3q^3p+\cdots=p(0+q+2q^2+3q^3+\cdots) \]

The one in the parenthesis is no long the sum of a geometric series.

However, the term \(kq^k\) looks like the form of derivative respect to \(q^k\).

We know that \[ \frac{1}{1-q}=\sum_{k=0}^\infty q^k. \] If we take derivative on both sides, we have \[ \frac{1}{(1-q)^2}=\sum_{k=0}^\infty kq^{k-1}=\frac{\sum_{k=0}^\infty kq^{k}}{q}. \]

Finally we have \[ E(X)=\sum_{k=0}^\infty kq^kp=0+qp+2q^2p+3q^3p+\cdots=p(0+q+2q^2+3q^3+\cdots)=p\cdot \frac{q}{(1-q)^2}=\frac{q}{p} \] Recall: Odds of success \[ \text{odds}=\frac{p}{q} \]

4.5 Memoryless Property of Geometric Distribution

Let \(X\) be a random variable that follows a Geometric distribution with success probability \(p\):

\[ P(X = k) = q^{k} p, \quad k = 0, 1, 2, 3, \dots \]

Then \(X\) has the memoryless property, which means

\[ P(X > m + n \mid X > m) = P(X > n) \]

for all nonnegative integers \(m, n\).

If \(X\) represents the number of failures until the first success, then the probability that you will need to go through more than \(n\) additional failures , given that you’ve already went through \(m\) failures without success, is the same as if you were starting over.

Poof (hint: using the CDF of Geometric distribution)

4.6 Negative Binomial distribution

Story 4.3.7 (Negative Binomial distribution).

In a sequence of independent Bernoulli trials with success probability \(p\), if \(X\) is the number of failures before the \(r\)th success, then \(X\) is said to have the Negative Binomial distribution with parameters \(r\) and \(p\), denoted \(X\sim\mbox{NBin}(r, p)\).

  • Both the Binomial and the Negative Binomial distributions are based on independent Bernoulli trials; they differ in the stopping rule and in what they are counting: the Binomial counts the number of successes in a fixed number of trials, while the Negative Binomial counts the number of failures until a fixed number of successes.

Theorem 4.3.8 (Negative Binomial PMF).

If \(X\sim\mbox{NBin(r,p)}\), then the PMF of \(X\) is \[ P(X=n)={n+r-1\choose r-1}p^r q^n \] for \(n=0,1,2,\ldots\), where \(q=1-p\).

Theorem 4.3.9.

Let \(X\sim \mbox{NBin}(r, p)\), viewed as the number of failures before the rth success in a sequence of independent Bernoulli trials with success probability \(p\). Then we can write \(X = X_1+\ldots +X_r\) where the \(X_i\) are i.i.d. \(\mbox{Geom}(p)\).

  • \(X_1\) is the number of failures before the 1st success
  • \(X_2\) is the number of failures between the 1st and 2nd success
  • … …
  • \(X_r\) is the number of failures between the \((r-1)\)th success and the \(r\)th success

Example 4.3.10 (Negative Binomial expectation).

\[ r\cdot\frac{q}{p} \]

4.7 Expectation of a nonlinear function of an r.v.

If we have a function of random variable \(X\), such as \(Y=g(X)\), if \(g\) is linear function, we already know the expectation of \(Y\) is \[ E(Y)=g(E(X)) \] However, if \(g\) is not linear, we do not have \(E(Y)=g(E(X))\) in general!

Example 4.3.14 (St. Petersburg paradox).

Suppose a wealthy stranger offers to play the following game with you. You will flip a fair coin until it lands Heads for the first time, and you will receive $2 if the game lasts for 1 round, $4 if the game lasts for 2 rounds, $8 if the game lasts for 3 rounds, and in general, $\(2^n\) if the game lasts for \(n\) rounds.

What is the expectation of the reward?

(Use the alternative parameterization of Geometric distribution)

In the generic definition of Geometric distribution, the r.v. is the number of failures that we go through before the 1st success.

Geometric distribution has an alternative formulation. We can also define the r.v. \(X\) as the number of trials we have to try to obtain 1st success. \[ P(X=n)=q^{n-1}p=(1/2)^n,\quad n=1,2,\ldots \] The value of reward is a function of \(X\), which is \(2^X\). Denote the value of reward as \(Y\), then \(Y=2^X\).

What is the expectation of \(Y\)? Is it simply \(2^{E(X)}\)?

Anyway, let’s find \(E(X)\) first.

From the generic parameterization of Geometric distribution, we know the expectation of failures until the first success is \(q/p=1\). Of course the expectation of the trials (including failure and the last success) is \(2\).

Can we say \(E(Y)=2^2=4\)?

We can validate it using the definition of method to find the expectation of \(Y\). \[ \begin{split} P(Y=2)&=P(X=1)=(1/2) \\ P(Y=4)&=P(X=2)=(1/2)^2=\frac{1}{4} \\ P(Y=8)&=P(X=3)=(1/2)^3=\frac{1}{8} \\ \dots \end{split} \]

By the definition of expectation, \[ E(Y)=2*\frac{1}{2}+4*\frac{1}{4}+\cdots=1+1+\cdots=\infty \]

4.8 Law of the unconscious statistician (LOTUS)

For a general function \(g(x)\), which is not necessarily linear, the expectation of \(g(X)\) can be obtained by so-called Law of the unconscious statistician (LOTUS).

Theorem 4.5.1 (LOTUS).

If \(X\) is a discrete r.v. and \(g\) is a function from \(\mathbb{R}\) to \(\mathbb{R}\), then \[ E(g(X))=\sum_x g(x) P(X=x), \] where the sum is taken over all possible values of \(X\).

If we have PMF of r.v. \(X\):

X \(x_1\) \(x_2\)
\(p_X\) \(P(X=x_1)\) \(P(X=x_2)\)

Then the PMF of \(Y=g(X)\) will be \[ P(Y=y)=P(X=x)=p_X(x) \]

Y=g(X) \(g(x_1)\) \(g(x_2)\)
\(p_Y\) \(P(X=x_1)\) \(P(X=x_2)\)

Of course, the expectation of \(Y\) is \[ E(Y)=E(g(X))=\sum_x g(x) P(X=x). \]

4.9 Indicator r.v.s and the fundamental bridge

An indicator r.v. \(I_A\) (or \(I(A)\)) for an event \(A\) is defined to be 1 if \(A\) occurs and 0 otherwise, \[ I_A = \left\{ \begin{array}{rl} 1 & \mbox{if A occurs} \\ 0 & \mbox{if A does not occur.} \end{array}\right. \]

The Bernoulli random r.v, is a typical indicator r.v, which indicates the occurrence of “success”.

Theorem 4.4.1 (Properties of Indicator r.v.)

Let \(A\) and \(B\) be events. Then the following properties hold.

  1. \((I_A)^k=I_A\) for any positive integer \(k\).
  2. \(I_{A^c}=1-I_A\).
  3. \(I_{A\cap B}=I_A I_B\).
  4. \(I_{A\cup B}=I_A+I_B-I_A I_B\).

Theorem 4.4.2 (Fundamental bridge between probability and expectation).

The expectation of the indicator r.v. \(I_A\) is the probability of the event \(A\), \[ E(I_A)=P(A). \]

For example, the expectation of Bernoulli(\(p\)) is \(p\).

Example 4.4.3 (Boole, Bonferroni, and inclusion-exclusion).

Let \(A_1,A_2,\ldots,A_n\) be events. Note that \[ I(A_1\cup\ldots\cup A_n)\leq I(A_1)+\ldots +I(A_n). \] Taking the expectation of both sides and using linearity and the fundamental bridge, we have \[ P(A_1\cup\ldots\cup A_n)\leq P(A_1)+\ldots +P(A_n), \] which is called Boole’s inequality or Bonferroni’s inequality.

Proof of Inclusion-exclusion theorem:

To prove inclusion-exclusion for \(n=2\), we can take the expectation of both sides in Property 4 of Theorem 4.4.1.

For general \(n\), we can use properties of indicator r.v.s as follows: \[ \begin{split} 1-I(A_1\cup\ldots \cup A_n) & = I(A_1^c\cap\ldots\cap A_n^c) \\ \\ & = (1-I(A_1))\ldots (1-I(A_n)) \\ \\ & = 1-\sum_i I(A_i)+\sum_{i<j} I(A_i)I(A_j)-\ldots \\ \\ & + (-1)^n I(A_1)\ldots I(A_n). \end{split} \] Taking the expectation of both sides, by the fundamental bridge we have inclusion-exclusion.

We are trying to find \(P(A_1\cup \cdots\cup A_n)\).

By de Morgan’s Law, \[ (A_1\cup \cdots\cup A_n)^c=A_1^c\cap \cdots\cap A_n^c \] Thus \[ I((A_1\cup \cdots\cup A_n)^c)=I(A_1^c\cap \cdots\cap A_n^c) \]

\[ 1-I(A_1\cup \cdots\cup A_n)=I(A_1^c\cap \cdots\cap A_n^c)=(1-I(A_1))\cdots((1-I(A_n)) \]

Example 4.4.4 (Matching continued).

We have a well-shuffled deck of \(n\) cards, labeled 1 through \(n\). A card is a match if the card’s position in the deck matches the card’s label.

Define \(I_j\) as the indicator r.v., which indicates the \(j\)th card is a match, so \[ X=I_1+\cdots+I_n. \] and \[ E(X)=E(I_1)+E(I_2)+\cdots+E(I_n) \] Now, let’s find \(E(I_j)\).

Since \(I_j\) is an indicator r.v., \[ E(I_j)=P(I_j=1)=\frac{(n-1)!}{n!}=\frac{1}{n} \] Thus, \[ E(X)=nE(I_j)=1 \]

Example 4.4.5 (Distinct birthdays, birthday matches).

In a group of \(n\) people, under the usual assumptions about birthdays, what is the expected number of distinct birthdays among the \(n\) people, i.e., the expected number of days on which at least one of the people was born? (Use the property of indicator r.v.)

Define indicator r.v. \(I_j\), which indicates at least one person was born on that day, \(j=1,2,\ldots, 365\).

Define r.v. \(X\) as the number of days one which at least one person was born, then \[ X=I_1+\cdots+I_{365}. \] We have \[ E(X)=E(I_1)+\cdots+E(I_{365}). \]

Now, lets find \(E(I_j)\). \[ E(I_j)=P(I_j=1)=1-\frac{364^n}{365^n} \] Thus, \[ E(X)=n(1-\frac{364^n}{365^n}) \]