Expectation

3.3. Expectation#

In this section, we’ll learn about a particular type of integral. Remember that an integral is defined on a measurable space and with respect to a measure; e.g., the measure space \((\Omega, \mathcal F, \mu)\). As we learned, when we have simple, bounded, non-negative, or integrable measurable functions \(f\), the integral is the operation that is denoted symbolically as \(\int f \,\text d \mu\).

When this measure is further a probability measure Definition 2.35, we call this integral something special: we call it an expectation, or expected value. Let’s start with a definition.

Definition 3.17 (Expected value)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(X \in m\mathcal F : \Omega \rightarrow \mathbb R\) is a random variable. The expected value is defined as:

\[ \mathbb EX = \int X\,\text d \mathbb P.\]

Notationally, you might see this quantity written in several different ways in this book. When we are only talking about a single variable, we’ll usually just write \(\mathbb EX\) since it’s nice and simple. When the random variable we are taking the expected value of is a bit more complicated, we might add parentheses or square brackets, like this: \(\mathbb E[X]\) or \(\mathbb E(X)\).

In any previous courses you might have taken on probability or statistics, you might have seen more complicated notations for an expected value (things like subscripts, which get extraordinarily taxing because nearly every statistician/probabilist seems to have a different understanding of what goes in the subscript). For now, we’re going to do our best to just omit this cumbersome notation and keep it simple: when you see an expectation of a random variable, you can always just think of it as an integral with respect to the corresponding probability space for that random variable. When we learn about product spaces, we’ll recap again what it means when we talk about expectations involving multiple random variables.

Now, let’s get some lingo under our belt for expectations. Let’s start with the term “existence of an expectation”:

Definition 3.18 (Existence of expected value)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(X \in m\mathcal F : \Omega \rightarrow \mathbb R\) is a random variable. Let \(X^+(\omega) = X(\omega) \wedge 0\), and let \(X^-(\omega) = (-X(\omega)) \wedge 0\). We say that \(\mathbb EX\) exists and:

\[ \mathbb EX = \mathbb EX^+ - \mathbb EX^-\]

whenever \(\mathbb EX^+\) or \(\mathbb EX^-\) are \(< \infty\).

In words, what we are with this wording is just that the expectation of a random variable \(X\) is said to exist when either of its positive portion (\(X^+\)) or its negative portion (\(X^-\)) are \(\mathbb P\)-integrable (they have finite integrals). Intuitionally, the idea here is that, even if one of these are infinite, as long as one of them is finite, we can still make sense of subtracting a finite number from infinity (it is still just infinity) or subtracting infinity from a finite number (it is just negative infinity).

Now that we have some of the basics under our belt, I’d recommend you check back to our study of \(\mu\)-integrable functions from Lemma 3.3, and just try to rectify the nuances of that definition to the intuition you are learning here. One thing that you should take away is that, if \(X\) is \(\mathbb P\)-integrable, then its expected value exists. However, the converse does not necessarily hold: just because an expected value exists does not mean the corresponding random variable is \(\mathbb P\)-integrable. The condition for existence of an expectation only requires one of the positive or negative portions being finite, not both!

Throughout this section, we are basically going to ad-nauseam regurgitate properties, theorems, lemmas, etc. directly from Section 3.1 and Section 3.2. This is because as you can see, expectations are just integrals, so all of the properties of integrals that we learned about over there will apply here, too. Further, since the measure in this case is a probability measure, we will be able to be even more specific about some aspects of the integral, and will be able to make several bigger results here.

3.3.1. Expectation corollaries#

To start off, we need to introduce a brief new notation. You are already immediately familiar with this term, but with a slightly different word for is:

Definition 3.19 (Relation holds almost surely)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space. Suppose that \(X, Y \in m\mathcal F\) are random variables. We say that a relation \(\rho : \mathbb R \times \mathbb R \rightarrow \{0, 1\}\) holds almost surely if:

\[ \mathbb P\left(\left\{\omega \in \Omega : \rho(X(\omega), Y(\omega)) = 0\right\}\right) = 1.\]

We write that \(\rho(X, Y)\) is a.s.

You will notice that this statement is basically exactly the same as the statement that we made when it came to a relation holding almost everywhere back in Definition 3.3, except for the fact that the domain in this case is a probability space. Again, if a relation holds almost surely, it also holds almost everywhere, but not necessarily the reverse (for the reason we gave back when we first introduced almost sure statements in Definition 2.37).

Property 3.13 (Expectation basics)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and that \(X, Y \in m\mathcal F\) are random variables where either:

\(X, Y \geq 0\) (they are non-negative), or
\(\mathbb E|X|, \mathbb E|Y| < \infty\) (they are \(\mathbb P\)-integrable).

Then:

\(\mathbb E[X + Y] = \mathbb E[X] + \mathbb E[Y]\),
for any \(a, b \in \mathbb R\), then \(\mathbb E[aX + b] = a\mathbb EX + b\), and
If \(X \overset{a.s.}{\geq} Y\), then \(\mathbb E[X] \geq \mathbb E[Y]\).

The proof of these statements are extremely easy: we’ve already done them! Remembering that expectations are just integrals, we can directly borrow our results from Section 3.1:

Proof. If \(X, Y \geq 0\) are non-negative:

Direct application of Property 3.9.
Direct application of Property 3.8, and Remark 3.1.
Direct application of Corollary 3.1 for non-negative functions.

If \(X, Y\) are \(\mathbb P\)-integrable:

Direct application of Property 3.12.
Direct application of Property 3.11, and Remark 3.1.
Direct application of Corollary 3.1 for \(\mathbb P\)-integrable functions.

That was pretty easy, right?

3.3.2. Norms and Convexity#

All of the things we learned about norms and convexity hold over to expectations, too:

Theorem 3.7 (Jensen’s inequality for random variables)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(X \in m\mathcal F\). Suppose further that \(\varphi: \mathbb R \rightarrow \mathbb R\) is convex. Then if \(\mathbb E|X|, \mathbb E\left|\varphi(X)\right| < \infty\):

\[ \varphi\left(\mathbb EX\right) \leq \mathbb E\left[\varphi(X)\right].\]

Which is just a less generic restatement of what we described when we first saw Jensen’s inequality in Theorem 3.2. The fine points here are that we asserted that \(\mathbb E|X|, \mathbb E\left|\varphi(X)\right| < \infty\). Remember that \(\mathbb E\) just denotes an integral with respect to the probability measure \(\mathbb P\), so this just means that \(X\) and \(\varphi \circ X = \varphi(X)\) are \(\mathbb P\)-integrable!

There are two special cases that will come up again and again in this book, so we just want to draw your attention to them:

Corollary 3.6 (Corollaries of Jensen’s inequality)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(X \in m\mathcal F\). Then if \(\mathbb E|X| < \infty\):

\(\left|\mathbb EX\right| \leq \mathbb E|X|\).
and further if \(\mathbb E|X|^2 = \mathbb EX^2 < \infty\), then \((\mathbb EX)^2 \leq \mathbb EX^2\).

In the second to last line, notice that I used got a little bit ambitious with my notation: to some readers, it might be unclear when I write \(\mathbb EX^2\) whether I mean \((\mathbb EX)^2\) or \(\mathbb E[X^2]\). When I don’t explicitly use brackets or parentheses, assume that I mean \(\mathbb E[X^2]\) when I write \(\mathbb EX^2\) (or any exponent, for that matter). This tends to be fairly standard in a lot of future work you will read, so I want to get you accustomed to as relaxed a notation as possible (to the extent that things are still clear and obvious).

Next, let’s investigate the concept of a norm:

Definition 3.20 (Norm of a random variable)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(X \in m\mathcal F\), and \(p \in [1, \infty]\). Then the \(p\)-norm of \(X\) is:

\[\begin{split} ||X||_p \triangleq \begin{cases} \left(\mathbb E|X|^p\right)^{\frac{1}{p}} & p \neq \infty \\ \inf\left\{M : \mathbb P(|X| > M) = 0\right\}& p = \infty \end{cases}\end{split}\]

which is just about exactly the definition you got accustomed to in Definition 3.12 but for random variables defined on a probability space.

In this case, we introduced a slightly new terminology: the infinity norm \(||X||_\infty\). Conceptually, all we are saying is that the infinity norm of a random variable is just the smallest number \(M\) such that \(X\) is almost surely \(\leq M\). This is because if we take the probability statement \(\mathbb P(|X| > M) = 0\), remember since we have a probability space that this means that \(\mathbb P(|X| \leq M) = 1\).

Theorem 3.8 (Hölder’s inequality)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(X, Y \in m\mathcal F\), and \(p, q \in [1, \infty]\) are s.t. \(\frac{1}{p} + \frac{1}{q} = 1\). Then:

\[ \mathbb E|X Y| \leq ||X||_p ||Y||_q.\]

which is basically just Theorem 3.3.

3.3.3. Subsets of the domain#

When we are considering subsets of the domain, we have a special notation in probability theory:

Definition 3.21 (Expectation over a subset of the domain)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(X \in m\mathcal F\) is a random variable. Then if \(F \in \mathcal F\), we write:

\[ \mathbb E[X; F] \triangleq \int_F X \,\text d \mathbb P.\]

The basic idea is that \(\mathbb E[X; F]\) is the expectation of \(X\), but limited to the subset of the domain given by the event (measurable set) \(A\). Stated another way, you can think of \(\mathbb E[X; F]\) to be the statement \(\mathbb E[X\mathbb 1_{\{F\}}]\); that is, we are thinking conceptually about functions of the form \(X(\omega) \mathbb 1_{\{F\}}(\omega)\) (the values that the function \(X\) takes for elements \(\omega\) of the domain \(\Omega\), but only looking at these values on \(F\)).

Further, when \(X\) is just an indicator random variable, we obtain another interesting and common piece of notation you will see a lot:

Definition 3.22 (Expectation of an indicator)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and \(F \in \mathcal F\). Then:

\[ \mathbb E\left[\mathbb 1_{\{F\}}\right] = \mathbb P(F) = \int_F 1\,\text d \mathbb P.\]

Remember that indicators were how we defined simple functions back in Definition 3.1. Therefore if we have a measure \(\mathbb P\) and a measurable set \(F \in \mathcal F\), we know how to integrate it indicator as-per Definition 3.2: its integral is just the measure of the set it is an indicator of. Here, since the measure is a probability, the integral is just the probability of \(F\).

Next up, we can see how we can use this notation to derive a neat result known as Markov’s inequality:

Theorem 3.9 (Markov’s inequality)

Suppose that:

\((\Omega, \mathcal F, \mathbb P)\) is a probability space,
\(X \in m\mathcal F\) is a random variable,
\(\varphi \in m\mathcal R: \mathbb R \rightarrow \mathbb R\) is s.t. \(\varphi \geq 0\) (\(\varphi\) is a non-negative function), and
\(B \in \mathcal R\) is a measurable set of the codomain \((\mathbb R, \mathcal R)\). Define \(i_B \triangleq \inf\{\varphi(b) : b \in B\}\). Then:

\[i_B \mathbb P(X \in B) \leq \mathbb E[\varphi(X); X \in B] \leq \mathbb E\varphi(X).\]

There’s quite a bit going on in that statement, so let’s break it down where you’re probably starting to get confused. In this statement, remember that \(B \in \mathcal R\) just means that \(B\) is an element of the \(\sigma\)-algebra of the codomain for \(X\) (remember that \(X \in m\mathcal F\) is short-hand for \(X \in m(\mathcal F, \mathcal R)\)). Here, \(i_B\) is defined as the smallest value that any element of \(B\) can be mapped to. Let’s try our hands at proving this statement.

Proof. Note that since \(\varphi \geq 0\), the definition of \(i_B\) as an infimum ensures that \(i_B \leq \varphi(X(\omega))\), for any \(\omega\) s.t. \(X(\omega) \in B\). Then:

\[ i_B \mathbb 1_{\{\omega \in \Omega : X(\omega) \in B\}}(\omega) \leq \varphi(X(\omega))\mathbb 1_{\{\omega \in \Omega : X(\omega) \in B\}}(\omega).\]

Further, notice that \(\mathbb 1_{\{\cdot\}} \leq 1\), so using the shorthand \(X \in B \equiv \{\omega \in \Omega : X(\omega) \in B\}\):

\[\varphi(X(\omega))\mathbb 1_{X \in B}(\omega) \leq \varphi(X(\omega)),\]

which is because \(\varphi \geq 0\), so \(\varphi \circ X \geq 0\).

Then with Property 3.13(3), taking the expectation preserves the inequalities, and by Property 3.13(2) rescaling by \(i_B\) is preserved, and by Lemma 2.5 \(\varphi \circ X \in m\mathcal F\), so:

\[\begin{split} \mathbb E\left[i_B \mathbb 1_{X \in B}\right] \leq \mathbb E[\varphi(X)\mathbb 1_{X \in B}] \leq \mathbb E\left[\varphi(X)\right] \\ \Rightarrow i_B \mathbb P(X \in B) \leq \mathbb E[\varphi(X); X \in B] \leq \mathbb E\varphi(X).\end{split}\]

Pretty easy, right? Next up, we’ll see a corollary of Markov’s inequality, called Chebyshev’s inequality:

Corollary 3.7 (Chebyshev’s inequality)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(X \in m\mathcal F\) is a random variable. Denote \(B = \{x : |x| \geq b\}\). Then:

\[ b^2 \mathbb P(|X| \geq b) \leq \mathbb EX^2.\]

What we are saying here is that we can find a direct relationship between the probability of the portion of the domain where \(|X| \geq b\); e.g., the measurable set \(\{\omega \in \Omega : |X(\omega)| \geq b\}\), and \(\mathbb EX^2\), which is quite a powerful result if you think about it. Even if you don’t think about it, you will in a few more chapters later on!

Proof. Define \(\varphi(x) \triangleq x^2\).

Note that \(\varphi(x)\) is convex, by the second derivative test, as \(\varphi''(x) = 2 \geq 0\).

Applying Theorem 3.9 gives the desired result.

In practice, you might see Markov’s inequality called Chebyshev’s inequality, but we’ll stick to the nomenclature we described so that we can be explicit.

Another powerful corollary of Markov’s inequality is that if the expectation of the absolute value random variable is finite, then the random variable is finite almost surely:

Lemma 3.9 (Bounded absolute expectation)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(X \in m\mathcal F\) is a random variable where \(\mathbb E|X| = M < \infty\). Then \(\mathbb P(|X| = \infty) = 0\).

Notice that here, we slightly abused the fact that if an event (here, finiteness) happens almost surely, it also occurs almost everywhere (the space on which finiteness does not hold has probability \(0\)).

Proof. Let \(\varphi(x) = |x|\), which is convex, and let \(B = \{x : |x| \geq b\} \equiv \{x : x^2 \geq b^2\}\). The equivalence is because \(x^2 \geq b^2 \iff |x| \geq b\).

Then by Theorem 3.9:

\[\begin{split} \mathbb P(X \in B) \equiv \mathbb P(|X| \geq b) &\leq \frac{\mathbb E|X|}{b} = \frac{M}{b} \\ &\xrightarrow[b \rightarrow \infty]{} 0.\end{split}\]

3.3.4. Convergence concepts#

Just like we were able to directly extend results on the basics of integration and the results on norms and convexity, we can do the same thing with convergence concepts. Let’s see how this works:

Lemma 3.10 (Fatou)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. If \(X_n \overset{a.s.}{\geq} 0\) for all \(n \in \mathbb N\), then:

\[ \liminf_{n \rightarrow \infty}\mathbb EX_n \geq \mathbb E\left[\liminf_{n \rightarrow \infty}X_n\right].\]

Proof. Direct application of Lemma 3.8.

3.3.4.1. Convergence almost surely and convergence in probability#

To understsand the monotone convergence theorem and some successive results, we’ll just make unambiguously clear the term almost sure convergence, which is basically the term Definition 3.14:

Definition 3.23 (Convergence almost surely)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. We say that \(X_n \xrightarrow[n \rightarrow \infty]{a.s.} X\) (the sequence converges almost surely) if:

\[ \mathbb P\left(\left\{\omega \in \Omega : \lim_{n \rightarrow \infty}X_n(\omega) = X(\omega)\right\}\right) \equiv \mathbb P\left(\lim_{n \rightarrow \infty}X_n = X\right) = 1.\]

So, the idea here is that a random variable converges almost surely if the set of points which are mapped \(X_n(\omega)\) that have a limit of \(X(\omega)\) has a probability of \(1\). We can also use that the probability of the complement of a set is \(1 - \) the probability of the set to deduce that equivalently, \(\mathbb P\left(\lim_{n \rightarrow \infty}X_n \neq X\right) = 0\).

We can also understand this definition using the concept of the \(\limsup\) of a set, in an \(\epsilon\) sort of way like you are used to in real analysis:

Definition 3.24 (Equivalent definition for convergence almost surely)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. We say that \(X_n \xrightarrow[n \rightarrow \infty]{a.s.} X\) if for every \(\epsilon > 0\):

\[ \mathbb P\left(\limsup_{n \rightarrow \infty}\left\{\omega \in \Omega : |X_n(\omega) - X(\omega)| > \epsilon\right\}\right) = 0.\]

The interetation of this quantity is the same as the one you saw in Definition 3.14.

We have another term for the special case of convergence in measure when we are dealing with a probability space, too:

Definition 3.25 (Convergence in probability)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. We say that \(X_n \xrightarrow[n \rightarrow \infty]{\mathcal P} X\) (the sequence converges in probability) if for every \(\epsilon > 0\):

\[ \lim_{n \rightarrow \infty}\mathbb P\left(\left\{\omega \in \Omega : |X_n(\omega) - X(\omega)| \leq \epsilon\right\}\right) \equiv \lim_{n \rightarrow \infty}\mathbb P\left(|X_n - X| \leq \epsilon\right) = 1.\]

So, the idea here is that a random variable converges almost surely if the set of points \(\omega\) mapped to \(X_n(\omega)\) that are \(\epsilon\)-close to \(X(\omega)\) for every \(n\) that is large enough has a probability that converges to \(1\). We can also use that the probability of the complement of a set is \(1 - \) the probability of the set to deduce that equivalently, \(\lim_{n \rightarrow \infty}\mathbb P\left(|X_n - X| > \epsilon\right) = 0\).

Further, we have the same catch as we did with convergence almost surely implying convergence in probability:

Lemma 3.11 (Convergence almost surely implies convergence in probability)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. Then if \(X_n \xrightarrow[n \rightarrow \infty]{a.s.} X\), \(X_n \xrightarrow[n\rightarrow\infty]{\mathcal P} X\).

Proof. Direct application of Lemma 3.7, noting that a probability space is \(\mu\)-finite, and by noting the equivalent conditions for the definitions of convergence almost surely is \(\mathbb P\left(\lim_{n \rightarrow \infty}X_n \neq X\right) = 0\) and convergence in probability is \(\lim_{n \rightarrow \infty}\mathbb P\left(|X_n - X| > \epsilon\right) = 0\).

3.3.4.2. The rest of the convergence concepts#

Finally, we’ll just rattle off the convergence concepts from the last section for posterity:

Theorem 3.10 (Monotone Convergence)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable, where \(0 \leq X_n \uparrow X\) a.s. Then:

\[\begin{split} \mathbb E X_n &\uparrow \mathbb EX\text{ as }n \rightarrow \infty \\ \Rightarrow \lim_{n \rightarrow \infty}E X_n &= \mathbb E\left[\lim_{n \rightarrow \infty}X_n\right] = \mathbb EX\text{ from below.}\end{split}\]

Proof. Direct application of Theorem 3.5.

Theorem 3.11 (Dominated Convergence)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, \(\{X_n\}_{n \in \mathbb N} \subseteq m\mathcal F\) is a sequence of random variables, and \(X \in m\mathcal F\) is a random variable. If:

\(X_n \xrightarrow[n \rightarrow \infty]{a.s.} X\),
There exists \(Y \in m\mathcal F\) s.t. \(|X_n| \leq Y\) for all \(n \in \mathbb N\), and
\(\mathbb E|Y| < \infty\),

Then:

\[\begin{split} \mathbb EX_n &\xrightarrow[n \rightarrow \infty]{} \mathbb EX \\ \Rightarrow \lim_{n \rightarrow \infty}\mathbb EX_n &= \mathbb E\left[\lim_{n \rightarrow \infty}X_n\right] = \mathbb EX.\end{split}\]

If \(Y\) is a constant function; e.g., \(Y(\omega) = r\) for \(r \in \mathbb R\), then we call Theorem 3.11 the bounded convergence theorem.

3.3.5. Change of variables#

Obviously, the change of variables formula applies here, too:

Lemma 3.12 (Change of variables)

Suppose that \((\Omega, \mathcal F, \mathbb P)\) is a probability space, and that \((S, \Sigma)\) and \((\mathbb R, \mathcal R)\) are measurable spaces, where \(X \in m(\mathcal F, \Sigma)\) and \(f \in m(\Sigma, \mathcal R)\) are random variables, and either:

\(f \geq 0\), or
\(\mathbb E|f(X)|< \infty\).

Then with \(X_*\mathbb P = \mathbb P \circ X^{-1}\) the pushforward measure of \(\mathbb P\), and \(X(\Omega) \triangleq \{X(\omega) : \omega \in \Omega\}\):

\[ \mathbb E f(X) \triangleq \int_\Omega f(X)\,\text d \mathbb P = \int_{S} f\,\text d X_*\mathbb P.\]

Proof. Direct application of mt:int:props:cov.