Martingales

Stochastic Calculus
Author

Quasar

Published

July 12, 2024

Martingales.

Elementary conditional expectation.

In elementary probability, the conditional expectation of a variable \(Y\) given another random variable \(X\) refers to the expectation of \(Y\) given the conditional distribution \(f_{Y|X}(y|x)\) of \(Y\) given \(X\). To illustrate this, let’s go through a simple example. Consider \(\mathcal{B}_{1}\), \(\mathcal{B}_{2}\) to be two independent Bernoulli-distributed random variables with \(p=1/2\). Then, construct:

\[\begin{aligned} X=\mathcal{B}_{1}, & \quad Y=\mathcal{B}_{1}+\mathcal{B}_{2} \end{aligned}\]

It is easy to compute \(\mathbb{E}[Y|X=0]\) and \(\mathbb{E}[Y|X=1]\). By definition, it is given by:

\[\begin{aligned} \mathbb{E}[Y|X=0] & =\sum_{j=0}^{2}j\mathbb{P}(Y=j|X=0)\\ & =\sum_{j=0}^{2}j\cdot\frac{\mathbb{P}(Y=j,X=0)}{P(X=0)}\\ & =0+1\cdot\frac{(1/4)}{(1/2)}+2\cdot\frac{0}{(1/2)}\\ & =\frac{1}{2} \end{aligned}\]

and

\[\begin{aligned} \mathbb{E}[Y|X=1] & =\sum_{j=0}^{2}j\mathbb{P}(Y=j|X=1)\\ & =\sum_{j=0}^{2}j\cdot\frac{\mathbb{P}(Y=j,X=1)}{P(X=1)}\\ & =0+1\cdot\frac{(1/4)}{(1/2)}+2\cdot\frac{(1/4)}{(1/2)}\\ & =\frac{3}{2} \end{aligned}\]

With this point of view, the conditional expectation is computed given the information that the event \(\{X=0\}\) occurred or the event \(\{X=1\}\) occurred. It is possible to regroup both conditional expectations in a single object, if we think of the conditional expectation as a random variable and denote it by \(\mathbb{E}[Y|X]\). Namely, we take:

\[\begin{aligned} \mathbb{E}[Y|X](\omega) & =\begin{cases} \frac{1}{2} & \text{if }X(\omega)=0\\ \frac{3}{2} & \text{if }X(\omega)=1 \end{cases}\label{eq:elementary-conditional-expectation-example} \end{aligned}\]

This random variable is called the conditional expectation of \(Y\) given \(X\). We make two important observations:

(i) If the value of \(X\) is known, then the value of \(\mathbb{E}[Y|X]\) is determined.

(ii) If we have another random variable \(g(X)\) constructed from \(X\), then we have:

\[\begin{aligned} \mathbb{E}[g(X)Y] & =\mathbb{E}[g(X)\mathbb{E}[Y|X]] \end{aligned}\]

In other words, as far as \(X\) is concerned, the conditional expectation \(\mathbb{E}[Y|X]\) is a proxy for \(Y\) in the expectation. We sometimes say that \(\mathbb{E}[Y|X]\) is the best estimate of \(Y\) given the information of \(X\).

The last observation is easy to verify since:

\[\begin{aligned} \mathbb{E}[g(X)Y] & =\sum_{i=0}^{1}\sum_{j=0}^{2}g(i)\cdot j\cdot\mathbb{P}(X=i,Y=j)\\ & =\sum_{i=0}^{1}\mathbb{P}(X=i)g(i)\left\{ \sum_{j=0}^{2}j\cdot\frac{\mathbb{P}(X=i,Y=j)}{\mathbb{P}(X=i)}\right\} \\ & =\mathbb{E}[g(X)\mathbb{E}[Y|X]] \end{aligned}\]

(Elementary Definitions of Conditional Expectation).

(1) \((X,Y)\) discrete. The treatment is similar to the above. If a random variable \(X\) takes values \((x_{i},i\geq1)\) and \(Y\) takes values \((y_{j},j\geq1)\), we have by definition that the conditional expectation as a random variable is:

\[\begin{aligned} \mathbb{E}[Y|X](\omega) & =\sum_{j\geq1}y_{j}\mathbb{P}(Y=y_{j}|X=x_{i})\quad\text{for }\omega\text{ such that }X(\omega)=x_{i} \end{aligned}\] (2) \((X,Y)\) continuous with joint PDF \(f_{X,Y}(x,y)\): In this case, the conditional expectation is the random variable given by

\[\begin{aligned} \mathbb{E}[Y|X] & =h(X) \end{aligned}\]

where

\[\begin{aligned} h(x) & =\int_{\mathbf{R}}yf_{Y|X}(y|x)dy=\int_{\mathbf{R}}y\frac{f_{X,Y}(x,y)}{f_{X}(x)}dy=\frac{\int_{\mathbf{R}}yf_{X,Y}(x,y)dy}{\int_{\mathbf{R}}f_{X,Y}(x,y)dy} \end{aligned}\]

In the two examples above, the expectation of the random variable \(\mathbb{E}[Y|X]\) is equal to \(\mathbb{E}[Y]\). Indeed in the discrete case, we have:

\[\begin{aligned} \mathbb{E}[\mathbb{E}[Y|X]] & =\sum_{i=0}^{1}P(X=x_{i})\cdot\sum_{j=0}^{2}y_{j}\mathbb{P}(Y=y_{j}|X=x_{i})\\ & =\sum_{i=0}^{1}\sum_{j=0}^{2}y_{j}\mathbb{P}(Y=y_{j},X=x_{i})\\ & =\sum_{j=0}^{2}y_{j}\mathbb{P}(Y=y_{j})\\ & =\mathbb{E}[Y] \end{aligned}\]

(Conditional Probability vs Conditional expectation). The conditional probability of the event \(A\) given \(B\) can be recast in terms of conditional expectation using indicator functions. If \(0<\mathbb{P}(B)<1\), it is not hard to check that: \(\mathbb{P}(A|B)=\mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}=1]\) and \(\mathbb{P}(A|B^{C})=\mathbb{E}[\mathbf{1}_{A}|1_{B}=0]\). Indeed the random variables \(\mathbf{1}_{A}\) and \(\mathbf{1}_{B}\) are discrete. If we proceed as in the discrete case above, we have:

\[\begin{aligned} \mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}=1] & =1\cdot\mathbb{P}(\mathbf{1}_{A}=1|\mathbf{1}_{B}=1)\\ & =\frac{\mathbb{P}(\mathbf{1}_{A}=1,\mathbf{1}_{B}=1)}{\mathbb{P}(\mathbf{1}_{B}=1)}\\ & =\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}\\ & =\mathbb{P}(A|B) \end{aligned}\]

A similar calculation gives \(\mathbb{P}(A|B^{C})\). In particular, the formula for total probability for \(A\) is a rewriting of the expectation of the random variable \(\mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}]\):

\[\begin{aligned} \mathbb{E}[\mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}]] & =\mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}=1]\mathbb{P}(\mathbf{1}_{B}=1)+\mathbb{E}[\mathbf{1}_{A}|\mathbf{1}_{B}=0]\mathbb{P}(\mathbf{1}_{B}=0)\\ & =\mathbb{P}(A|B)\cdot\mathbb{P}(B)+\mathbb{P}(A|B^{C})\cdot\mathbb{P}(B^{C})\\ & =\mathbb{P}(A) \end{aligned}\]

Conditional Expectation as a projection.

Conditioning on one variable.

We start by giving the definition of conditional expectation given a single variable. This relates to the two observations (A) and (B) made previously. We assume that the random variable is integrable for the expectations to be well-defined.

Let \(X\) and \(Y\) be integrable random variables on \((\Omega,\mathcal{F},\mathbb{P})\). The conditional expectation of \(Y\) given \(X\) is the random variable denoted by \(\mathbb{E}[Y|X]\) with the following two properties:

(A) There exists a function \(h:\mathbf{R}\to\mathbf{R}\) such that \(\mathbb{E}[Y|X]=h(X)\).

(B) For any bounded random variable of the form \(g(X)\) for some function \(g\),

\[\mathbb{E}[g(X)Y]=\mathbb{E}[g(X)\mathbb{E}[Y|X]]\label{eq:definition-conditional-expectation}\]

We can intepret the second property as follows. The conditional expectation \(\mathbb{E}[Y|X]\) serves as a proxy for \(Y\) as far as \(X\) is concerned. Note that in equation ([eq:definition-conditional-expectation]), the expectation on the left can be seen as an average over the joint values of \((X,Y)\), whereas the one on the right is an average over the values of \(X\) only! Another way to see this property is to write is as:

\[\mathbb{E}[g(X)(Y-\mathbb{E}[Y|X])]=0\]

In other words, the random variable \(Y-\mathbb{E}[Y|X]\) is orthogonal to any random variable constructed from \(X\).

Finally, it is important to notice that if we take \(g(X)=1\), then the second property implies :

\[\begin{aligned} \mathbb{E}[Y] & =\mathbb{E}[\mathbb{E}[Y|X]] \end{aligned}\]

In other words, the expectation of the conditional expectation of \(Y\) is simply the expectation of \(Y\).

The existence of the conditional expectation \(\mathbb{E}[Y|X]\) is not obvious. We know, it exists in particular cases given in example ([ex:elementary-definitions-of-conditional-expectation]). We will show more generally, that it exists, it is unique whenever \(Y\) is in \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\) (In fact, it can be shown to exist whenever \(Y\) is integrable). Before doing so, let’s warm up by looking at the case of Gaussian vectors.

(Conditional expectation of Gaussian vectors - I). Let \((X,Y)\) be a Gaussian vector of mean \(0\). Then:

\[\mathbb{E}[Y|X]=\frac{\mathbb{E}[XY]}{\mathbb{E}[X^{2}]}X\label{eq:conditional-expectation-of-gaussian-vector}\]

This candidate satisfies the two defining properties of conditional expectation : (A) It is clearly a function of \(X\); in fact it is a simple multiple of \(X\). (B) We have that the random variable \(\left(Y-\frac{\mathbb{E}[XY]}{\mathbb{E}[X^{2}]}X\right)\) is orthogonal and thus independent to \(X\). This is a consequence of the proposition ([prop:diagonal-cov-matrix-implies-independence-of-gaussians]), since:

\[\begin{aligned} \mathbb{E}\left[X\left(Y-\frac{\mathbb{E}[XY]}{\mathbb{E}[X^{2}]}X\right)\right] & =\mathbb{E}XY-\frac{\mathbb{E}[XY]}{\mathbb{E}[X^{2}]}\mathbb{E}X^{2}\\ & =\mathbb{E}XY-\frac{\mathbb{E}[XY]}{\cancel{\mathbb{E}[X^{2}]}}\cancel{\mathbb{E}X^{2}}\\ & =0 \end{aligned}\]

Therefore, we have for any bounded function \(g(X)\) of \(X\):

\[\begin{aligned} \mathbb{E}[g(X)(Y-\mathbb{E}(Y|X))] & =\mathbb{E}[g(X)]\mathbb{E}[Y-\mathbb{E}[Y|X]]=0 \end{aligned}\]

(Brownian conditioning-I) Let \((B_{t},t\geq0)\) be a standard Brownian motion. Consider the Gaussian vector \((B_{1/2},B_{1})\). Its covariance matrix is:

\[\begin{aligned} C & =\left[\begin{array}{cc} 1/2 & 1/2\\ 1/2 & 1 \end{array}\right] \end{aligned}\]

Let’s compute \(\mathbb{E}[B_{1}|B_{1/2}]\) and \(\mathbb{E}[B_{1/2}|B_{1}]\). This is easy using the equation ([eq:conditional-expectation-of-gaussian-vector]). We have:

\[\begin{aligned} \mathbb{E}[B_{1}|B_{1/2}] & =\frac{\mathbb{E}[B_{1}B_{1/2}]}{\mathbb{E}[B_{1/2}^{2}]}B_{1/2}\\ & =\frac{(1/2)}{(1/2)}B_{1/2}\\ & =B_{1/2} \end{aligned}\]

In other words, the best approximation of \(B_{1}\) given the information of \(B_{1/2}\) is \(B_{1/2}\). There is no problem in computing \(\mathbb{E}[B_{1/2}|B_{1}]\), even though we are conditioning on a future position. Indeed the same formula gives

\[\begin{aligned} \mathbb{E}[B_{1/2}|B_{1}] & =\frac{\mathbb{E}[B_{1}B_{1/2}]}{\mathbb{E}[B_{1}^{2}]}B_{1}=\frac{1}{2}B_{1} \end{aligned}\]

This means that the best approximation of \(B_{1/2}\) given the position at time \(1\), is \(\frac{1}{2}B_{1}\) which makes a whole lot of sense!

In example ([eq:conditional-expectation-of-gaussian-vector]) for the Gaussian vector \((X,Y)\), the conditional expectation was equal to the orthogonal projection of \(Y\) onto \(X\) in \(L^{2}\). In particular, the conditional expectation was a multiple of \(X\). Is this always the case? Unfortunately, it is not. For example, in the equation ([eq:elementary-conditional-expectation-example]), the conditional expectation is clearly not a multiple of the random variable \(X\). However, it is a function of \(X\), as is always the case by definition ([def:conditional-expectation]).

The idea to construct the conditional expectation \(\mathbb{E}[Y|X]\) in general is to project \(Y\) on the space of all random variables that can be constructed from \(X\). To make this precise, consider the following subspace of \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\) :

Let \((\Omega,\mathcal{F},\mathbb{P})\) be a probability space and \(X\) a random variable defined on it. The space \(L^{2}(\Omega,\sigma(X),\mathbb{P})\) is the linear subspace of \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\) consisting of the square-integrable random variables of the form \(g(X)\) for some function \(g:\mathbf{R}\to\mathbf{R}\).

This is a linear subspace of \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\): It contains the random variable \(0\), and any linear combination of random variables of this kind is also a function of \(X\) and must have a finite second moment. We note the following:

\(L^{2}(\Omega,\sigma(X),\mathbb{P})\) is a subspace of \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\), very much how a plane or line (going through the origin) is a subspace of \(\mathbf{R}^{3}\).

In particular, as in the case of a line or a plane, we can project an element of \(Y\) of \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\) onto \(L^{2}(\Omega,\sigma(X),\mathbb{P})\). The resulting projection is an element of \(L^{2}(\Omega,\sigma(X),\mathbb{P})\), a square-integrable random-variable that is a function of \(X\). For a subspace \(\mathcal{S}\) of \(\mathbf{R}^{3}\) (e.g. a line or a plane), the projection of the vector \(\mathbf{v}\in\mathbf{R}^{3}\) onto the subspace \(\mathcal{S}\), denoted \(\text{Proj}_{\mathcal{S}}(\mathbf{v})\) is the closest point to \(\mathbf{v}\) lying in the subspace \(\mathcal{S}\). Moreover, \(\mathbf{v}-\text{Proj}_{\mathcal{S}}(\mathbf{v})\) is orthogonal to the subspace. This picture of orthogonal projection also holds in \(L^{2}\). Let \(Y\) be a random variable in \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\) and let \(L^{2}(\Omega,\sigma(X),\mathbb{P})\) be the subspace of those random variables that are functions of \(X\). We write \(Y^{\star}\) for the random variable in \(L^{2}(\Omega,\sigma(X),\mathbb{P})\) that is closest to \(Y\). In other words, we have (using the definition of the \(L^{2}\)-distance square):

\[\inf_{Z\in L^{2}(\Omega,\sigma(X),\mathbb{P})}\mathbb{E}[(Y-Z)^{2}]=\mathbb{E}[(Y-Y^{\star})^{2}]\label{eq:Y-star-is-the-closest-to-Y-in-L2-sense}\]

It turns out that \(Y^{\star}\) is the right candidate for the conditional expectation.

Figure. An illustration of the conditional expectation \(\mathbb{E}[Y|X]\) as an orthogonal projection of \(Y\) onto the subspace \(L^2(\Omega,\sigma(X),\mathbb{P})\).

(Existence and uniqueness of the conditional expectation) Let \(X\) be a random variable on \((\Omega,\mathcal{F},\mathbb{P})\). Let \(Y\) be a random variable in \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\). Then the conditional expectation \(\mathbb{E}[Y|X]\) is the random variable \(Y^{\star}\) given in the equation ([eq:Y-star-is-the-closest-to-Y-in-L2-sense]). Namely, it is the random variable in \(L^{2}(\Omega,\sigma(X),\mathbb{P})\) that is closest to \(Y\) in the \(L^{2}\)-distance.

In particular we have the following:

1) It is the orthogonal projection of \(Y\) onto \(L^{2}(\Omega,\sigma(X),\mathbb{P})\), that is \(Y-Y^{\star}\) is orthogonal to any random variables in the subspace \(L^{2}(\Omega,\sigma(X),\mathbb{P})\).

2) It is unique.

This result reinforces the meaning of the conditional expectation \(\mathbb{E}[Y|X]\) as the best estimation of \(Y\) given the information of \(X\): it is the closest random variable to \(Y\) among all the functions of \(X\) in the sense of \(L^{2}\).

Proof. Proof. We write for short \(L^{2}(X)\) for the subspace \(L^{2}(\Omega,\sigma(X),\mathbb{P})\). Let \(Y^{\star}\) be as in equation ([eq:Y-star-is-the-closest-to-Y-in-L2-sense]). We show successively that (1) \(Y-Y^{\star}\) is orthogonal to any element of \(L^{2}(X)\), so it is the orthogonal projection (2) \(Y^{\star}\) has the properties of conditional expectation in definition ([eq:definition-conditional-expectation]) (3) \(Y^{\star}\) is unique.

(1) Let \(W=g(X)\) be a random variable in \(L^{2}(X)\). We show that \(W\) is orthogonal to \(Y-Y^{\star}\); that is \(\mathbb{E}[(Y-Y^{\star})W]=0\). This should be intuitively clear from figure above. On the one hand, we have by developing the square:

\[\begin{aligned} \mathbb{E}[(W-(Y-Y^{\star}))^{2}] & =\mathbb{E}[W^{2}-2W(Y-Y^{\star})+(Y-Y^{\star})^{2}]\nonumber \\ & =\mathbb{E}[W^{2}]-2\mathbb{E}[W(Y-Y^{\star})]+\mathbb{E}(Y-Y^{\star})^{2}]\label{eq:developing-the-square} \end{aligned}\]

On the other hand, \(Y^{\star}+W\) is an arbitrary vector in \(L^{2}(X)\)(it is a linear combination of the elements in \(L^{2}(X)\)), we must have from equation ([eq:Y-star-is-the-closest-to-Y-in-L2-sense]):

\[\begin{aligned} \mathbb{E}[(W-(Y-Y^{\star}))^{2}] & =\mathbb{E}[(Y-(Y^{\star}+W))^{2}]\nonumber \\ & \geq\inf_{Z\in L^{2}(X)}\mathbb{E}[(Y-Z)^{2}]\nonumber \\ & =\mathbb{E}[(Y-Y^{\star})^{2}]\label{eq:lower-bound} \end{aligned}\]

Putting the last two equations ([eq:developing-the-square]), ([eq:lower-bound]) together, we get that for any \(W\in L^{2}(X)\):

\[\begin{aligned} \mathbb{E}[W^{2}]-2\mathbb{E}[W(Y-Y^{\star})] & \geq0 \end{aligned}\]

In particular, this also holds for \(aW\), in which case we get:

\[\begin{aligned} a^{2}\mathbb{E}[W^{2}]-2a\mathbb{E}[W(Y-Y^{\star})] & \geq0\\ \implies a\left\{ a\mathbb{E}[W^{2}]-2\mathbb{E}[W(Y-Y^{\star})]\right\} & \geq0 \end{aligned}\]

If \(a>0\), then:

\[a\mathbb{E}[W^{2}]-2\mathbb{E}[W(Y-Y^{\star})]\geq0\label{eq:case-when-a-gt-zero}\]

whereas if \(a<0\), then the sign changes upon dividing throughout by \(a\), and we have:

\[a\mathbb{E}[W^{2}]-2\mathbb{E}[W(Y-Y^{\star})]\leq0\label{eq:case-when-a-lt-zero}\]

Rearranging ([eq:case-when-a-gt-zero]) yields:

\[\mathbb{E}[W(Y-Y^{\star})]\leq a\mathbb{E}[W^{2}]/2\label{eq:case-when-a-gt-zero-rearranged}\]

Rearranging ([eq:case-when-a-lt-zero]) yields:

\[\mathbb{E}[W(Y-Y^{\star})]\geq a\mathbb{E}[W^{2}]/2\label{eq:case-when-a-lt-zero-rearranged}\]

Since ([eq:case-when-a-gt-zero-rearranged]) holds for all \(a>0\), the stronger inequality, \(\mathbb{E}[W(Y-Y^{\star})]\leq0\) must hold. Since, ([eq:case-when-a-lt-zero-rearranged]) holds for all \(a<0\), the stronger inequality \(\mathbb{E}[W(Y-Y^{\star})]\geq0\) must hold. Consequently,

\[\mathbb{E}[W(Y-Y^{\star})]=0\]

(2) It is clear that \(Y^{\star}\) is a function of \(X\) by construction, since it is in \(L^{2}(X)\). Moreover, for any \(W\in L^{2}(X)\), we have from (1) that:

\[\begin{aligned} \mathbb{E}[W(Y-Y^{\star})] & =0 \end{aligned}\]

which is the second defining property of conditional expectations.

(3) Lastly, suppose there is another element \(Y'\) that is in \(L^{2}(X)\) that minimizes the distance to \(Y\). Then we would get:

\[\begin{aligned} \mathbb{E}[(Y-Y')^{2}] & =\mathbb{E}[(Y-Y^{\star}+Y^{\star}-Y')^{2}]\\ & =\mathbb{E}[(Y-Y^{\star})^{2}]+2\mathbb{E}[(Y-Y^{\star})(Y^{\star}-Y')]+\mathbb{E}[(Y^{\star}-Y')^{2}]\\ & =\mathbb{E}[(Y-Y^{\star})^{2}]+0+\mathbb{E}[(Y^{\star}-Y')^{2}]\\ & \quad\left\{ (Y^{\star}-Y')\in L^{2}(X)\perp(Y-Y^{\star})\right\} \end{aligned}\]

where we used the fact, that \(Y^{\star}-Y'\) is a vector in \(L^{2}(X)\) and the orthogonality of \(Y-Y^{\star}\) with \(L^{2}(X)\) as in (1). But, this implies that:

\[\begin{aligned} \cancel{\mathbb{E}[(Y-Y')^{2}]} & =\cancel{\mathbb{E}[(Y-Y^{\star})^{2}]}+\mathbb{E}[(Y^{\star}-Y')^{2}]\\ \mathbb{E}[(Y^{\star}-Y')^{2}] & =0 \end{aligned}\]

So, \(Y^{\star}=Y'\) almost surely. ◻

Conditional Expectation of continuous random variables. Let \((X,Y)\) be two random variables with joint density \(f_{X,Y}(x,y)\) on \(\mathbf{R}^{2}\). Suppose for simplicity, that \(\int_{\mathbf{R}}f(x,y)dx>0\) for every \(y\) belonging to \(\mathbf{R}\). Show that the conditional expectation \(\mathbf{E}[Y|X]\) equals \(h(X)\) where \(h\) is the function:

\[\begin{aligned} h(x) & =\frac{\int_{\mathbf{R}}yf_{X,Y}(x,y)dy}{\int_{\mathbf{R}}f_{X,Y}(x,y)dy}\label{eq:conditional-expectation-of-continuous-random-variables} \end{aligned}\]

In particular, verify that \(\mathbf{E}[\mathbf{E}[Y|X]]=\mathbf{E}[Y]\).

Hint: To prove this, verify that the above formula satisfies both the properties of conditional expectations; then invoke uniqueness to finish it off.

(i) The density function \(f_{X,Y}(x,y)\) is a map \(f:\mathbf{R}^{2}\to\mathbf{R}\). The integral \(\int_{y=-\infty}^{y=+\infty}yf_{X,Y}(x_{0},y)dy\) is the area under the curve \(yf(x,y)\) at the point \(x=x_{0}\). Let’s call it \(A(x_{0})\). If instead, we have an arbitrary \(x\), \(\int_{y=-\infty}^{y=+\infty}yf_{X,Y}(x,y)dy\) represents the area \(A(x)\) of an arbitrary slice of the surface \(yf_{X,Y}\) at the point \(x\). Hence, it is a function of \(x\). The denominator \(\int_{\mathbf{R}}f_{X,Y}(x,y)dy=f_{X}(x)\), the density of \(X\), which is a function of \(x\). Hence, the ratio is a function of \(x\).

(ii) Let \(g(X)\) is a bounded random variable. We have:

\[\begin{aligned} \mathbf{E}[g(X)(Y-h(X))] & =\mathbf{E}[Yg(X)]-\mathbf{E}[g(X)h(X)]\\ & =\int\int_{\mathbf{R}^{2}}yg(x)f_{X,Y}(x,y)dydx-\int_{\mathbf{R}}g(x)h(x)f(x)dx\\ & =\int\int_{\mathbf{R}^{2}}yg(x)f_{X,Y}(x,y)dydx\\ & -\int_{\mathbf{R}}\begin{array}{c} g(x)\cdot\frac{\int_{\mathbf{R}}yf_{X,Y}(x,y)dy}{\int_{\mathbf{R}}f_{X,Y}(x,y)dy}\end{array}\cdot\int_{\mathbf{R}}f_{X,Y}(x,y)dy\ dx\\ & =\int\int_{\mathbf{R}^{2}}yg(x)f_{X,Y}(x,y)dydx\\ & -\int_{\mathbf{R}}\begin{array}{c} g(x)\cdot\frac{\int_{\mathbf{R}}yf_{X,Y}(x,y)dy}{\cancel{\int_{\mathbf{R}}f_{X,Y}(x,y)dy}}\end{array}\cdot\cancel{\int_{\mathbf{R}}f_{X,Y}(x,y)dy}\ dx\\ & =\int\int_{\mathbf{R}^{2}}yg(x)f_{X,Y}(x,y)dydx-\int_{\mathbf{R}^{2}}yg(x)f_{X,Y}(x,y)\cdot dx\cdot dy\\ & =0 \end{aligned}\]

Thus, \(h(X)\) is a valid candidate for the conditional expectation \(\mathbf{E}[Y|X]\). Moreover, by the existence and uniqueness theorem ([th:existence-and-uniqueness-of-the-conditional-expectation]), \(\mathbf{E}[Y|X]\) is unique and equals \(h(X)\).

Conditioning on several random variables.

We would like to generalize the conditional expectation to the case when we condition on the information of more than one random variable. Taking the \(L^{2}\) point of view, we should expect that the conditional expectation is the orthogonal projection of the given random variable on the subspace generated by square integrable functions of all the variables on which we condition.

It is now useful to study sigma-fields, an object that was defined in chapter 1.

(Sigma-Field) A sigma-field or sigma-algebra \(\mathcal{F}\) of a sample space \(\Omega\) is a collection of all measurable events with the following properties:

(1) \(\Omega\) is in \(\mathcal{F}\).

(2) Closure under complement. If \(A\in\mathcal{F}\), then \(A^{C}\in\mathcal{F}\).

(3) Closure under countable unions. If \(A_{1},A_{2},\ldots,\in\mathcal{F}\), then \(\bigcup_{n=1}^{\infty}A_{n}\in\mathcal{F}\).

Such objects play a fundamental role in the rigorous study of probability and real analysis in general. We will focus on the intuition behind them. First let’s mention some examples of sigma-fields of a given sample space \(\Omega\) to get acquainted with the concept.

(Examples of sigma-fields).

(1) The trivial sigma-field. Note that the collection of events \(\{\emptyset,\Omega\}\) is a sigma-field of \(\Omega\). We generally denote it by \(\mathcal{F}_{0}\).

(2) The \(\sigma\)-field generated by an event \(A\). Let \(A\) be an event that is not \(\emptyset\) and not the entire \(\Omega\). Then the smallest sigma-field containing \(A\) ought to be:

\[\begin{aligned} \mathcal{F}_{1} & =\{\emptyset,A,A^{C},\Omega\} \end{aligned}\]

This sigma-field is denoted by \(\sigma(A)\).

(3) The sigma-field generated by a random variable \(X\).

We now define the \(\mathcal{F}_{X}\) as follows:

\[\begin{aligned} \mathcal{F}_{X} & =X^{-1}(\mathcal{B}):=\{\omega:X(\omega)\in B\},\forall B\in\mathcal{B}(\mathbf{R}) \end{aligned}\]

where \(\mathcal{B}\) is the Borel \(\sigma\)-algebra on \(\mathbf{R}\). \(\mathcal{F}_{X}\) is sometimes denoted as \(\sigma(X)\). \(\mathcal{F}_{X}\)is the set of all events pertaining to \(X\). It is a sigma-algebra because:

(i) \(\Omega\in\sigma(X)\) because \(\Omega=\{\omega:X(\omega)\in\mathbf{R}\}\) and \(\mathbf{R}\in\mathcal{B}(\mathbf{R})\).

(ii) Let any event \(C\in\sigma(X)\). We need to show that \(\Omega\setminus C\in\sigma(X)\).

Since \(C\in\sigma(X)\), there exists \(A\in\mathcal{B}(\mathbf{R})\), such that:

\[\begin{aligned} C & =\{\omega\in\Omega:X(\omega)\in A\} \end{aligned}\]

Now, we calculate:

\[\begin{aligned} \Omega\setminus C & =\{\omega\in\Omega:X(\omega)\in\mathbf{R}\setminus A\} \end{aligned}\]

Since \(\mathcal{B}(\mathbf{R})\) is a sigma-algebra, it is closed under complementation. Hence, if \(A\in\mathcal{B}(\mathbf{R})\), it implies that \(\mathbf{R}\setminus A\in\mathcal{B}(\mathbf{R})\). So, \(\Omega\setminus C\in\sigma(X)\).

(iii) Consider a sequence of events \(C_{1},C_{2},\ldots,C_{n},\ldots\in\sigma(X)\). We need to prove that \(\bigcup_{n=1}^{\infty}C_{n}\in\sigma(X)\).

Since \(C_{n}\in\sigma(X)\), there exists \(A_{n}\in\mathcal{B}(\mathbf{R})\) such that:

\[\begin{aligned} C_{n} & =\{\omega\in\Omega:X(\omega)\in A_{n}\} \end{aligned}\]

Now, we calculuate:

\[\begin{aligned} \bigcup_{n=1}C_{n} & =\{\omega\in\Omega:X(\omega)\in\bigcup_{n=1}^{\infty}A_{n}\} \end{aligned}\]

But, \(\bigcup_{n=1}^{\infty}A_{n}\in\mathcal{B}(\mathbf{R})\). So, \(\bigcup_{n=1}^{\infty}C_{n}\in\sigma(X)\).

Consequently, \(\sigma(X)\) is indeed a \(\sigma\)-algebra.

Intuitively, we think of \(\sigma(X)\) as containing all information about \(X\).

(4) The sigma-field generated by a stochastic process \((X_{s},s\leq t)\). Let \((X_{s},s\geq0)\) be a stochastic process. Consider the process restricted to \([0,t]\), \((X_{s},s\leq t)\). We consider the smallest sigma-field containing all events pertaining to the random variables \(X_{s},s\leq t\). We denote it by \(\sigma(X_{s},s\leq t)\) or \(\mathcal{F}_{t}\).

The sigma-fields on \(\Omega\) have a natural (partial) ordering: two sigma-fields \(\mathcal{G}\) and \(\mathcal{F}\) of \(\Omega\) are such that \(\mathcal{G}\subseteq\mathcal{F}\) if all the events in \(\mathcal{G}\) are in \(\mathcal{F}\). For example, the trivial \(\sigma\)-field \(\mathcal{F}_{0}=\{\emptyset,\Omega\}\) is contained in all the \(\sigma\)-fields of \(\Omega\). Clearly, the \(\sigma\)-field \(\mathcal{F}_{t}=\sigma(X_{s},s\leq t)\) is contained in \(\mathcal{F}_{t'}\) if \(t\leq t'\).

If all the events pertaining to a random variable \(X\) are in the \(\sigma\)-field \(\mathcal{G}\) (and thus we can compute \(\mu(X^{-1}((a,b]))\)), we will say that \(X\) is \(\mathcal{G}\)-measurable. This means that all information about \(X\) is contained in \(\mathcal{G}\).

Let \(X\) be a random variable defined on \((\Omega,\mathcal{F},\mathbb{P})\). Consider another \(\mathcal{G}\subseteq\mathcal{F}\). Then \(X\) is said to be \(\mathcal{G}\)-measurable, if and only if:

\[\begin{aligned} \{\omega:X(\omega)\in(a,b]\} & \in\mathcal{G}\text{ for all intervals }(a,b]\in\mathbf{R} \end{aligned}\]

(\(\mathcal{F}_{0}\)-measurable random variables). Consider the trivial sigma-field \(\mathcal{F}_{0}=\{\emptyset,\Omega\}\). A random variable that is \(\mathcal{F}_{0}\)-measurable must be a constant. Indeed, we have that for any interval \((a,b]\), \(\{\omega:X(\omega)\in(a,b]\}=\emptyset\) or \(\{\omega:X(\omega)\in(a,b]\}=\Omega\). This can only hold if \(X\) takes a single value.

[]{#ex:sigma(X)-measurable-random-variables-example label=“ex:sigma(X)-measurable-random-variables-example”}(\(\sigma(X)\)-measurable random variables). Let \(X\) be a given random variable on \((\Omega,\mathcal{F},\mathbb{P})\). Roughly speaking, a \(\sigma(X)\)-measurable random variable is determined by the information of \(X\) only. Here is the simplest example of a \(\sigma(X)\)-measurable random variable. Take the indicator function \(Y=\mathbf{1}_{\{X\in B\}}\) for some event \(\{X\in B\}\) pertaining to \(X\). Then the pre-images \(\{\omega:Y(\omega)\in(a,b]\}\) are either \(\emptyset\), \(\{X\in B\}\), \(\{X\in B^{C}\}\) or \(\Omega\) depending on whether \(0,1\) are in \((a,b]\) or not. All of these events are in \(\sigma(X)\). More generally, one can construct a \(\sigma(X)\)-measurable random variable by taking linear combinations of indicator functions of events of the form \(\{X\in B\}\).

It turns out that any (Borel measurable) function of \(X\) can be approximated by taking limits of such simple functions.

Concretely, this translates to the following statement:

\[\text{If }Y\text{ is \ensuremath{\sigma}(X)-measurable, then Y=g(X) for some function g}\]

In the same way, if \(Z\) is \(\sigma(X,Y)\)-measurable, then \(Z=h(X,Y)\) for some \(h\). These facts can be proved rigorously using measure theory.

We are ready to give the general definition of conditional expectation.

(Coin-Tossing Space). Suppose a coin is tossed infinitely many times. Let \(\Omega\) be the set of all infinite sequences of \(H\)s and \(T\)s. A generic element of \(\Omega\) is denoted by \(\omega_{1}\omega_{2}\ldots\), where \(\omega_{n}\) indicates the result of the \(n\)th coin toss. \(\Omega\) is an uncountable sample space. The trivial sigma-field \(\mathcal{F}_{0}=\{\emptyset,\Omega\}\). Assume that we don’t know anything about the outcome of the experiement. Even without any information, we know that the true \(\omega\) belongs to \(\Omega\) and does not belong to \(\emptyset\). It is the information learned at time \(0\).

Next, assume that we know the outcome of the first coin toss. Define \(A_{H}=\{\omega:\omega_{1}=H\}\)=set of all sequences beginning with \(H\) and \(A_{T}=\{\omega:\omega_{1}=T\}\)=set of all sequences beginning with \(T\). The four sets resolved by the first coin-toss form the the \(\sigma\)-field \(\mathcal{F}_{1}=\{\emptyset,A_{H},A_{T},\Omega\}\). We shall think of this \(\sigma\)-field as containing the information learned by knowing the outcome of the first coin toss. More precisely, if instead of being told about the first coin toss, we are told for each set in \(\mathcal{F}_{1}\), whether or not the true \(\omega\) belongs to that set, then we know the outcome of the first coin toss and nothing more.

If we are told the first two coin tosses, we obtain a finer resolution. In particular, the four sets:

\[\begin{aligned} A_{HH} & =\{\omega:\omega_{1}=H,\omega_{2}=H\}\\ A_{HT} & =\{\omega:\omega_{1}=H,\omega_{2}=T\}\\ A_{TH} & =\{\omega:\omega_{1}=T,\omega_{2}=H\}\\ A_{TT} & =\{\omega:\omega_{1}=T,\omega_{2}=T\} \end{aligned}\]

are resolved. Of course, the sets in \(\mathcal{F}_{1}\) are resolved. Whenever a set is resolved, so is its complement, which means that \(A_{HH}^{C}\), \(A_{HT}^{C}\), \(A_{TH}^{C}\) and \(A_{TT}^{C}\) are resolved, so is their union which means that \(A_{HH}\cup A_{TH}\), \(A_{HH}\cup A_{TT}\), \(A_{HT}\cup A_{TH}\) and \(A_{HT}\cup A_{TT}\) are resolved. The other two pair-wise unions \(A_{HH}\cup A_{HT}=A_{H}\) and \(A_{TH}\cup A_{TT}=A_{T}\) are already resolved. Finally, the triple unions are also resolved, because \(A_{HH}\cup A_{HT}\cup A_{TH}=A_{TT}^{C}\) and so forth. Hence, the information pertaining to the second coin-toss is contained in:

\[\begin{aligned} \mathcal{F}_{2} & =\{\emptyset,\Omega,\\ & A_{H},A_{T},\\ & A_{HH},A_{HT},A_{TH},A_{TT},\\ & A_{HH}^{C},A_{HT}^{C},A_{TH}^{C},A_{TT}^{C},\\ & A_{HH}\cup A_{TH},A_{HH}\cup A_{TT},A_{HT}\cup A_{TH},A_{HT}\cup A_{TT}\} \end{aligned}\]

Hence, if the outcome of the first two coin tosses is known, all of the events in \(\mathcal{F}_{2}\) are resolved - we exactly know, if each event has ocurred or not. \(\mathcal{F}_{2}\) is the information learned by observing the first two coin tosses.

(Exercises on sigma-fields).

(a) Let \(A\), \(B\) be two proper subsets of \(\Omega\) such that \(A\cap B\neq\emptyset\) and \(A\cup B\neq\Omega\). Write down \(\sigma(\{A,B\})\), the smallest sigma-field containing \(A\) and \(B\) explicitly. What if \(A\cap B=\emptyset\)?

(b) The Borel sigma-field is the smallest sigma-field containing intervals of the form \((a,b]\) in \(\mathbf{R}\). Show that all singletons \(\{b\}\) are in \(\mathcal{B}(\mathbf{R})\) by writing \(\{b\}\) as a countable intersection of intervals \((a,b]\). Conclude that all open intervals \((a,b)\) and all closed intervals \([a,b]\) are in \(\mathcal{B}(\mathbf{R})\). Is the subset \(\mathbf{Q}\) of rational numbers a Borel set?

Proof. Proof. (a) The sigma-field generated by the two events \(A\), \(B\) is given by:

\[\begin{aligned} \sigma(\{A,B\}) & =\{\emptyset,\Omega,\\ & A,B,A^{C},B^{C},\\ & A\cup B,A\cap B,\\ & A\cup B^{C},A^{C}\cup B,A^{C}\cup B^{C},\\ & A\cap B^{C},A^{C}\cap B,A^{C}\cap B^{C},\\ & (A\cup B)\cap(A\cap B)^{C},\\ & (A\cup B)^{C}\cup(A\cap B)\} \end{aligned}\]

(b) Firstly, recall that:

\[\begin{aligned} \mathcal{B}(\mathbf{R}) & =\bigcap_{\alpha\in\Lambda}\mathcal{F}_{\alpha}=\bigcap\sigma(\{I:I\text{ is an interval }(a,b]\subseteq\mathbf{R}\}) \end{aligned}\]

We can write:

\[\begin{aligned} \{b\} & =\bigcap_{n=1}^{\infty}\left(b-\frac{1}{n},b\right] \end{aligned}\]

As \(\mathcal{B}(\mathbf{R})\) is a sigma-field, it is closed under countable intersections. Hence, the singleton set \(\{b\}\)is a Borel set.

Similarly, we can write, any open interval as the countable union:

\[\begin{aligned} (a,b) & =\bigcup_{n=1}^{\infty}\left(a,b-\frac{1}{n}\right] \end{aligned}\]

We can convince ourselves, that equality indeed holds. Let \(x\in(a,b)\) and choose \(N\), such that \(\frac{1}{N}<|b-x|\). Then, for all \(n\geq N\), \(x\in(a,b-1/n]\). Thus, it belongs to the RHS. In the reverse direction, let \(x\) belong to \(\bigcup_{n=1}^{\infty}\left(a,b-\frac{1}{n}\right]\). So, \(x\) belongs to atleast one of these sets. Therefore, \(x\in(a,b)\) is trivially true. So, the two sets are equal.

Hence, open intervals are Borel sets.

Similarly, we may write:

\[\begin{aligned} [a,b] & =\bigcap_{n=1}^{\infty}\left(a-\frac{1}{n},b+\frac{1}{n}\right) \end{aligned}\]

Consequently, closed intervals are Borel sets. Since \(\mathbf{Q}\) is countable, it is a Borel set. Moreover, the empty set \(\emptyset\) and \(\mathbf{R}\) are Borel sets. So, \(\mathbf{R}\backslash\mathbf{Q}\) is also a Borel set. ◻

Let \((X,Y)\) be a Gaussian vector with mean \(0\) and covariance matrix

\[\begin{aligned} C & =\left[\begin{array}{cc} 1 & \rho\\ \rho & 1 \end{array}\right] \end{aligned}\]

for \(\rho\in(-1,1)\). We verify that the example ([ex:conditional-expectation-of-gaussian-vectors]) and exercise ([ex:=00005BArguin-4.1=00005D-Conditional-Expectation-of-continuous-random-variables]) yield the same conditional expectation.

(a) Use equation ([eq:conditional-expectation-of-gaussian-vector]) to show that \(\mathbf{E}[Y|X]=\rho X\).

(b) Write down the joint PDF \(f(x,y)\) of \((X,Y)\).

(c) Show that \(\int_{\mathbf{R}}yf(x,y)dy=\rho x\) and that \(\int_{\mathbf{R}}f(x,y)dy=1\).

(d) Deduce that \(\mathbf{E}[Y|X]=\rho X\) using the equation ([eq:conditional-expectation-of-continuous-random-variables]).

Proof. Proof. (a) Since \((X,Y)\) have mean \(0\) and variance \(1\), it follows that:

\[\begin{aligned} \mathbf{E}[(X-EX)(Y-EY)] & =\mathbf{E}(XY)\\ \sqrt{(\mathbf{E}[X^{2}]-(\mathbf{E}X)^{2})}\cdot\sqrt{(\mathbf{E}[Y^{2}]-(\mathbf{E}Y)^{2})} & =\sqrt{(1-0)(1-0)}\\ & =1 \end{aligned}\]

and therefore,

\[\begin{aligned} \rho & =\frac{\mathbf{E}(XY)}{1}=\frac{\mathbf{E}[XY]}{\mathbf{E}[X^{2}]} \end{aligned}\]

Since \((X,Y)\) is a Gaussian vector, using ([eq:conditional-expectation-of-gaussian-vector]), we have:

\[\begin{aligned} \mathbf{E}[Y|X] & =\frac{\mathbf{E}[XY]}{\mathbf{E}[X^{2}]}X=\rho X \end{aligned}\]

(b) Consider the augmented matrix \([C|I]\). We have:

\[\begin{aligned} [C|I] & =\left[\left.\begin{array}{cc} 1 & \rho\\ \rho & 1 \end{array}\right|\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\right] \end{aligned}\]

Performing \(R_{2}=R_{2}-\rho R_{1}\), the above system is row-equivalent to:

\[\left[\left.\begin{array}{cc} 1 & \rho\\ 0 & 1-\rho^{2} \end{array}\right|\begin{array}{cc} 1 & 0\\ -\rho & 1 \end{array}\right]\]

Performing \(R_{2}=\frac{1}{1-\rho^{2}}R_{2}\), the above system is row-equivalent to:

\[\left[\begin{array}{cc} 1 & \rho\\ 0 & 1 \end{array}\left|\begin{array}{cc} 1 & 0\\ \frac{-\rho}{1-\rho^{2}} & \frac{1}{1-\rho^{2}} \end{array}\right.\right]\]

Performing \(R_{1}=R_{1}-\rho R_{2}\), we have:

\[\left[\begin{array}{cc} 1 & 0\\ 0 & 1 \end{array}\left|\begin{array}{cc} \frac{1}{1-\rho^{2}} & -\frac{\rho}{1-\rho^{2}}\\ \frac{-\rho}{1-\rho^{2}} & \frac{1}{1-\rho^{2}} \end{array}\right.\right]\]

Thus, \[\begin{aligned} C^{-1} & =\frac{1}{1-\rho^{2}}\left[\begin{array}{cc} 1 & -\rho\\ -\rho & 1 \end{array}\right] \end{aligned}\]

Moreover, \(\det C=1-\rho^{2}.\)

Therefore, the joint density of \((X,Y)\) is given by:

\[\begin{aligned} f(x,y) & =\frac{1}{2\pi\sqrt{1-\rho^{2}}}\exp\left[-\frac{1}{2(1-\rho^{2})}\left[\begin{array}{cc} x & y\end{array}\right]\left[\begin{array}{cc} 1 & -\rho\\ -\rho & 1 \end{array}\right]\left[\begin{array}{c} x\\ y \end{array}\right]\right]\\ & =\frac{1}{2\pi\sqrt{1-\rho^{2}}}\exp\left[-\frac{1}{2(1-\rho^{2})}\left[\begin{array}{cc} x-\rho y & -\rho x+y\end{array}\right]\left[\begin{array}{c} x\\ y \end{array}\right]\right]\\ & \frac{1}{2\pi\sqrt{1-\rho^{2}}}\exp\left[-\frac{1}{2(1-\rho^{2})}(x^{2}-2\rho xy+y^{2})\right] \end{aligned}\]

(c) Claim I. \(\int_{\mathbf{R}}yf(x,y)dy=\rho x\).

Completing the square, we have:

\[\begin{aligned} (x^{2}-2\rho xy+y^{2}) & =(y-\rho x)^{2}+x^{2}(1-\rho^{2}) \end{aligned}\]

Thus, we can write:

\[\begin{aligned} \int_{\mathbf{R}}yf(x,y)dy & =\frac{1}{2\pi\sqrt{1-\rho^{2}}}e^{-\frac{1}{2}x^{2}}\int_{\mathbf{R}}ye^{-\frac{1}{2}\frac{(y-\rho x)^{2}}{(1-\rho^{2})}}dy \end{aligned}\]

Let’s substitute

\[\begin{aligned} z & =\frac{(y-\rho x)}{\sqrt{1-\rho^{2}}}\\ dz & =\frac{dy}{\sqrt{1-\rho^{2}}} \end{aligned}\]

Therefore,

\[\begin{aligned} \int_{\mathbf{R}}ye^{-\frac{1}{2}\frac{(y-\rho x)^{2}}{(1-\rho^{2})}}dy & =\sqrt{1-\rho^{2}}\int_{\mathbf{R}}(\rho x+\sqrt{1-\rho^{2}}z)e^{-\frac{z^{2}}{2}}dz\\ & =\rho x\cdot\sqrt{1-\rho^{2}}\int_{\mathbf{R}}e^{-\frac{z^{2}}{2}}dz+(1-\rho^{2})\int_{\mathbf{R}}ze^{-\frac{z^{2}}{2}}dz\\ & =\rho x\cdot\sqrt{1-\rho^{2}}\cdot\sqrt{2\pi}+(1-\rho^{2})\cdot0\\ & =\rho x\cdot\sqrt{1-\rho^{2}}\cdot\sqrt{2\pi} \end{aligned}\]

Consequently,

\[\begin{aligned} \int_{\mathbf{R}}yf(x,y)dy & =\frac{1}{2\pi\cancel{\sqrt{1-\rho^{2}}}}e^{-\frac{1}{2}x^{2}}\rho x\cdot\cancel{\sqrt{1-\rho^{2}}}\cdot\sqrt{2\pi}\\ & =\rho x\cdot\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^{2}}\\ & =\rho x\cdot f_{X}(x)\\ \frac{\int_{\mathbf{R}}yf(x,y)dy}{f_{X}(x)} & =\frac{\int_{\mathbf{R}}yf(x,y)dy}{\int_{\mathbf{R}}f(x,y)}=\rho x \end{aligned}\]

(d) For a Gaussian vector \((X,Y),\) the conditional expectation \(\mathbf{E}[Y|X]=h(X)\). Hence, \(\mathbf{E}[Y|X]=\rho X\). ◻

(Conditional Expectation) Let \(Y\) be an integrable random variable on \((\Omega,\mathcal{F},\mathbb{P})\) and let \(\mathcal{G}\subseteq\mathcal{F}\) be a sigma-field of \(\Omega\). The conditional expectation of \(Y\) given \(\mathcal{G}\) is the random variable denoted by \(\mathbb{E}[Y|\mathcal{G}]\) such that the following hold:

(a) \(\mathbb{E}[Y|\mathcal{G}]\) is \(\mathcal{G}\)-measurable.

In other words, all events pertaining to the random variable \(\mathbb{E}[Y|\mathcal{G}]\) are in \(\mathcal{G}\).

(b) For any (bounded) random variable \(W\), that is \(\mathcal{G}\)-measurable,

\[\begin{aligned} \mathbb{E}[WY] & =\mathbb{E}[W\mathbb{E}[Y|\mathcal{G}]] \end{aligned}\]

In other words, \(\mathbf{E}[Y|\mathcal{G}]\) is a proxy for \(Y\) as far as the events in \(\mathcal{G}\) are concerned.

Note that, by taking \(W=1\) in the property (B), we recover:

\[\begin{aligned} \mathbf{E}[\mathbf{E}[Y|\mathcal{G}]] & =\mathbf{E}[Y] \end{aligned}\]

Beware of the notation! If \(\mathcal{G}=\sigma(X)\), then the conditional expectation \(\mathbf{E}[Y|\sigma(X)]\) is usually denoted by \(\mathbf{E}[Y|X]\) for short. However, one should always keep in mind that conditioning on \(X\) is in fact projecting on the linear subspace generated by all variables constructed from \(X\) and not on the linear space generated by generated by \(X\) alone. In the same way, the conditional expectation \(\mathbf{E}[Z|\sigma(X,Y)]\) is often written \(\mathbf{E}[Z|X,Y]\) for short.

As expected, if \(Y\) is in \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\), then \(\mathbf{E}[Y|\mathcal{G}]\) is given by the orthogonal projection of \(Y\) onto the subspace \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\), the subspace of square integrable random variables that are \(\mathcal{G}\)-measurable. We write \(Y^{\star}\) for the random variable in \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\) that is closest to \(Y\) that is:

\[\begin{aligned} \min_{Z\in L^{2}(\Omega,\mathcal{G},\mathbb{P})}\mathbf{E}[(Y-Z)^{2}] & =\mathbf{E}[(Y-Y^{\star})^{2}]\label{eq:conditional-expectation} \end{aligned}\]

(Existence and Uniqueness of Conditional Expectations) Let \(\mathcal{G}\subset\mathcal{F}\) be a sigma-field of \(\Omega\). Let \(Y\) be a random variable in \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\). Then, the conditional expectation \(\mathbf{E}[Y|\mathcal{G}]\) is the random variable \(Y^{\star}\) given in the equation ([eq:conditional-expectation]). Namely, it is the random variable in \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\) that is closest to \(Y\) in the \(L^{2}\)-distance. In particular we have the following:

  • It is the orthogonal projection of \(Y\) onto \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\), that is, \(Y-Y^{\star}\) is orthogonal to the random variables in \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\).

  • It is unique.

Again, the result should be interpreted as follows: The conditional expectation \(\mathbf{E}[Y|\mathcal{G}]\) is the best approximation of \(Y\) given the information included in \(\mathcal{G}\).

The conditional expectation in fact exists and is unique for any integrable random variable \(Y\)(i.e. \(Y\in L^{1}(\Omega,\mathcal{F},\mathbb{P})\) as the definition suggests. However, there is no orthogonal projection in \(L^{1}\), so the intuitive geometric picture is lost.

Figure. An illustration of the conditional expectation \(\mathbb{E}[Y|\mathcal{G}]\) as an orthogonal projection of \(Y\) onto the subspace \(L^2(\Omega,\mathcal{G},\mathbb{P})\).

(Conditional Expectation for Gaussian Vectors. II.) Consider the Gaussian vector \((X_{1},\ldots,X_{n})\). Without loss of generality, suppose it has mean \(0\) and is non-degenerate. What is the best approximation of \(X_{n}\) given the information \(X_{1},\ldots,X_{n-1}\)? In other words, what is:

\[\mathbf{E}[X_{n}|\sigma(X_{1},\ldots,X_{n-1})\]

With example ([ex:sigma(X)-measurable-random-variables-example]) in mind, let’s write \(\mathbf{E}[X_{n}|X_{1}\ldots X_{n-1}]\) for short. From example ([ex:=00005BArguin-4.1=00005D-Conditional-Expectation-of-continuous-random-variables]), we know that if \((X,Y)\) is a Gaussian vector with mean \(0\), then \(\mathbf{E}[Y|X]\) is a multiple of \(X\). Thus, we expect, that \(\mathbf{E}[X_{n}|X_{1}X_{2}\ldots X_{n-1}]\) is a linear combination of \(X_{1},X_{2},\ldots,X_{n-1}\). That is, there exists \(a_{1},\ldots,a_{n-1}\) such that:

\[\begin{aligned} \mathbf{E}[X_{n}|X_{1}X_{2}\ldots X_{n-1}] & =a_{1}X_{1}+a_{2}X_{2}+\ldots+a_{n-1}X_{n-1} \end{aligned}\] In particular, since the conditional expectation is a linear combination of the \(X\)’s, it is itself a Gaussian random variable. The best way to find the coefficient \(a\)’s is to go back to IID decomposition of Gaussian vectors.

Let \((Z_{1},Z_{2},\ldots,Z_{n-1})\) be IID standard Gaussians constructed from the linear combination of \((X_{1},X_{2},\ldots,X_{n-1})\). Then, we have:

\[\begin{aligned} \mathbf{E}[X_{n}|X_{1}X_{2}\ldots X_{n-1}] & =b_{1}Z_{1}+\ldots+b_{n-1}Z_{n-1} \end{aligned}\]

Now, recall, that we construct the random variables \(Z_{1}\), \(Z_{2}\), \(\ldots\), \(Z_{n}\) using Gram-Schmidt orthogonalization:

\[\begin{aligned} \tilde{Z_{1}} & =X_{1}, & Z_{1} & =\frac{\tilde{Z_{1}}}{\mathbf{E}(\tilde{Z}_{1}^{2})}\\ \tilde{Z_{2}} & =X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1} & Z_{2} & =\frac{\tilde{Z}_{2}}{\mathbf{E}(\tilde{Z}_{2}^{2})}\\ \tilde{Z_{3}} & =X_{3}-\sum_{i=1}^{2}\mathbf{E}(X_{3}Z_{i})Z_{i} & Z_{3} & =\frac{\tilde{Z}_{3}}{\mathbf{E}(\tilde{Z}_{3}^{2})}\\ & \vdots \end{aligned}\]

The simple case for \(n=2\) random variables.

We have already seen before:

\[\begin{aligned} \mathbf{E}[X_{1}(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1})] & =\mathbf{E}[\tilde{Z}_{1}(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1})]\\ & =\frac{\mathbf{E}[\tilde{Z}_{1}^{2}]}{\mathbf{E}[\tilde{Z}_{1}^{2}]}\times\mathbf{E}\left[\tilde{Z}_{1}(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1})\right]\\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\mathbf{E}\left[\frac{\tilde{Z}_{1}}{\mathbf{E}[\tilde{Z}_{1}^{2}]}(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1})\right]\\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\mathbf{E}[Z_{1}(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1})]\\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\left(\mathbf{E}[Z_{1}X_{2}]-\mathbf{E}(X_{2}Z_{1})\mathbf{E}[Z_{1}^{2}]\right)\\ & =0 \end{aligned}\]

So,\(X_{2}-\mathbf{E}(X_{2}Z_{1})Z_{1}\) is orthogonal to \(X_{1}\).

Moreover, \(\mathbf{E}(X_{2}Z_{1})Z_{1}\) is a function of \(X_{1}\). Thus, both the properties of conditional expectation are satisfied. Since conditional expectations are unique, we must have, \(\mathbf{E}[X_{2}|X_{1}]=\mathbf{E}(X_{2}Z_{1})Z_{1}\).

The case for \(n=3\) random variables.

We have seen that:

\[\begin{aligned} \mathbf{E}[X_{1}(X_{3}-\mathbf{E}(X_{3}Z_{1})Z_{1}-\mathbf{E}(X_{3}Z_{2})Z_{2})] & =\frac{\mathbf{E}[\tilde{Z}_{1}^{2}]}{\mathbf{E}[\tilde{Z}_{1}^{2}]}\times\mathbf{E}[\tilde{Z}_{1}(X_{3}-\mathbf{E}(X_{3}Z_{1})Z_{1}-\mathbf{E}(X_{3}Z_{2})Z_{2})]\\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\mathbf{E}\left\{ \frac{\tilde{Z}_{1}}{\mathbf{E}[\tilde{Z}_{1}^{2}]}(X_{3}-\mathbf{E}(X_{3}Z_{1})Z_{1}-\mathbf{E}(X_{3}Z_{2})Z_{2})\right\} \\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\mathbf{E}\left\{ Z_{1}(X_{3}-\mathbf{E}(X_{3}Z_{1})Z_{1}-\mathbf{E}(X_{3}Z_{2})Z_{2})\right\} \\ & =\mathbf{E}[\tilde{Z}_{1}^{2}]\times\mathbf{E}[X_{3}Z_{1}]-\mathbf{E}[X_{3}Z_{1}]\mathbf{E}[Z_{1}^{2}]-\mathbf{E}[X_{3}Z_{2}]\mathbf{E}[Z_{1}Z_{2}]\\ & =0 \end{aligned}\]

It is an easy exercise to show that it is orthogonal to \(X_{2}\).

Hence, \(X_{3}-\mathbf{E}(X_{3}Z_{1})Z_{1}-\mathbf{E}(X_{3}Z_{2})Z_{2}\) is orthogonal to \(X_{1}\) and \(X_{2}\). Moreover, \(\mathbf{E}(X_{3}Z_{1})Z_{1}+\mathbf{E}(X_{3}Z_{2})Z_{2}\) is a function of \(X_{1}\), \(X_{2}\). Thus, we must have:

\[\begin{aligned} \mathbf{E}[X_{3}|X_{1}X_{2}] & =\mathbf{E}(X_{3}Z_{1})Z_{1}+\mathbf{E}(X_{3}Z_{2})Z_{2} \end{aligned}\]

In general, \(X_{n}-\sum_{i=1}^{n-1}\mathbf{E}(X_{n}Z_{i})Z_{i}\) is orthogonal to \(X_{1}\), \(X_{2}\), \(\ldots\), \(X_{n-1}\). Hence,

\[\begin{aligned} \mathbf{E}[X_{n}|X_{1}X_{2}\ldots X_{n-1}] & =\sum_{i=1}^{n-1}\mathbf{E}(X_{n}Z_{i})Z_{i} \end{aligned}\]

Properties of Conditional Expectation.

We now list the properties of conditional expectation that follow from the two defining properties (A), (B) in the definition. They are extremely useful, when doing explicit computations on martingales. A good way to remember them is to understand how they relate to the interpretation of conditional expectation as an orthogonal projection onto a subspace or, equivalently, as the best approximation of the variable given the information available.

Let \(Y\) be an integrable random variable on \((\Omega,\mathcal{F},\mathbb{P})\). Let \(\mathcal{G}\subseteq\mathcal{F}\) be another sigma-field of \(\Omega\). Then, the conditional expectation \(\mathbf{E}[Y|\mathcal{G}]\) has the following properties:

(1) If \(Y\) is \(\mathcal{G}\)-measurable, then :

\[\begin{aligned} \mathbf{E}[Y|\mathcal{G}] & =Y \end{aligned}\]

(2) Taking out what is known. More generally, if \(Y\) is \(\mathcal{G-}\)measurable and \(X\) is another integrable random variable (with \(XY\) also integrable), then :

\[\begin{aligned} \mathbf{E}[XY|\mathcal{G}] & =Y\mathbf{E}[X|\mathcal{G}] \end{aligned}\]

This makes sense, since \(Y\) is determined by \(\mathcal{G}\), so we can take out what is known; it can be treated as a constant for the conditional expectation.

(3) Independence. If \(Y\) is independent of \(\mathcal{G}\), that is, for any events \(\{Y\in(a,b]\}\) and \(A\in\mathcal{G}\):

\[\begin{aligned} \mathbb{P}(\{Y\in I\}\cap A) & =\mathbb{P}(\{Y\in I\})\cdot\mathbb{P}(A) \end{aligned}\]

then

\[\begin{aligned} \mathbf{E}[Y|\mathcal{G}] & =\mathbf{E}[Y] \end{aligned}\]

In other words, if you have no information on \(Y\), your best guess for its value is simply plain expectation.

(4) Linearity of conditional expectations. Let \(X\) be another integrable random variable on \((\Omega,\mathcal{F},\mathbb{P})\). Then,

\[\begin{aligned} \mathbf{E}[aX+bY|\mathcal{G}] & =a\mathbf{E}[X|\mathcal{G}]+b\mathbf{E}[Y|\mathcal{G}],\quad\text{for any }a,b\in\mathbf{R} \end{aligned}\]

The linearity justifies the cumbersom choice of notation \(\mathbf{E}[Y|\mathcal{G}]\) for the random variable.

(5) Tower Property : If \(\mathcal{H}\subseteq\mathcal{G}\) is another sigma-field of \(\Omega\), then:

\[\begin{aligned} \mathbf{E}[Y|\mathcal{H}] & =\mathbf{E}[\mathbf{E}[Y|\mathcal{G}]|\mathcal{H}] \end{aligned}\]

Think in terms of two successive projections: first on a plane, then on a line in the plane.

(6) Pythagoras Theorem. We have:

\[\begin{aligned} \mathbf{E}[Y^{2}] & =\mathbf{E}\left[\left(\mathbf{E}[Y|\mathcal{G}]\right)^{2}\right]+\mathbf{E}\left[\left(Y-\mathbf{E}[Y|\mathcal{G}]\right)^{2}\right] \end{aligned}\]

In particular:

\[\begin{aligned} \mathbf{E}\left[\left(\mathbf{E}\left[Y|\mathcal{G}\right]\right)^{2}\right] & \leq\mathbf{E}[Y^{2}] \end{aligned}\]

In words, the \(L^{2}\) norm of \(\mathbf{E}[X|\mathcal{G}]\) is smaller than the one of \(X\), which is clear if you think in terms of orthogonal projection.

(7) Expectation of the conditional expectation.

\[\begin{aligned} \mathbf{E}\left[\mathbf{E}[Y|\mathcal{G}]\right] & =\mathbf{E}[Y] \end{aligned}\]

Proof.

The uniqueness property of conditional expectations in theorem ([th:existence-and-uniqueness-of-conditional-expectations-II]) might appear to be an academic curiosity. On the contrary, it is very practical, since it ensures, that if we find a candidate for the conditional expectation that has the two properties in Definition ([def:conditional-expectation]), then it must be the conditional expectation. To see this, let’s prove property (1).

If \(Y\) is \(\mathcal{G}\)-measurable, then \(\mathbf{E}[Y|\mathcal{G}]=Y\).

It suffices to show that \(Y\) has the two defining properties of conditional expectation.

(1) We are given that, \(Y\) is \(\mathcal{G}\)-measurable. So, property (A) is satisfied.

(2) For any bounded random variable \(W\) that is \(\mathcal{G}\)-measurable, we have:

\[\begin{aligned} \mathbf{E}[W(Y-Y)] & =\mathbf{E}[0]=0 \end{aligned}\]

So, property (B) is also a triviality.

(Taking out what is known.) If \(Y\) is \(\mathcal{G}\)-measurable and \(X\) is another integrable random variable, then:

\[\begin{aligned} \mathbf{E}[XY|\mathcal{G}] & =Y\mathbf{E}[X|\mathcal{G}] \end{aligned}\]

In a similar vein, it suffices to show that, \(Y\mathbf{E}[X|\mathcal{G}]\) has the two defining properties of conditional expectation.

(1) We are given that \(Y\) is \(\mathcal{G}\)-measurable; from property (1), \(\mathbf{E}[X|\mathcal{G}]\) is \(\mathcal{G}\)-measurable. It follows that, \(Y\mathbf{E}[X|\mathcal{G}]\) is \(\mathcal{G}\)-measurable.

(2) From theorem ([th:existence-and-uniqueness-of-conditional-expectations-II]), \(X-\mathbf{E}[X|\mathcal{G}]\) is orthogonal to the random variables \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\). So, if \(W\) is any bounded \(\mathcal{G}\)-measurable random variable, it follows that:

\[\begin{aligned} \mathbf{E}[WY(X-\mathbf{E}[X|\mathcal{G}])] & =0\\ \implies\mathbf{E}[W\cdot XY] & =\mathbf{E}[WY\mathbf{E}[X|\mathcal{G}]] \end{aligned}\]

This closes the proof.

(Independence.) If \(Y\) is independent of \(\mathcal{G}\), that is, for all events \(\{Y\in(a,b]\}\) and \(A\in\mathcal{G}\),

\[\begin{aligned} \mathbf{\mathbb{P}}\{Y\in(a,b]\cap A\} & =\mathbb{P}\{Y\in(a,b]\}\cdot\mathbb{P}(A) \end{aligned}\]

then

\[\begin{aligned} \mathbf{E}[Y|\mathcal{G}] & =\mathbf{E}[Y] \end{aligned}\]

Let us show that \(\mathbf{E}[Y]\) has the two defining properties of conditional expectations.

(1) \(\mathbf{E}[Y]\) is a constant and so it is \(\mathcal{F}_{0}\) measurable. Hence, it is \(\mathcal{G}\) measurable.

(2) If \(W\) is another \(\mathcal{G}\)-measurable random variable,

\[\begin{aligned} \mathbf{E}[WY] & =\mathbf{E}[W]\cdot\mathbf{E}[Y] \end{aligned}\]

since \(Y\) is independent of \(\mathcal{G}\) and therefore it is independent of \(Y\). Hence,

\[\begin{aligned} \mathbf{E}[W(Y-\mathbf{E}[Y])] & =0 \end{aligned}\]

Consequently, \(\mathbf{E}[Y|\mathcal{G}]=\mathbf{E}[Y]\).

(Linearity of conditional expectations) Let \(X\) be another integrable random variable on \((\Omega,\mathcal{F},\mathbb{P})\). Then,

\[\begin{aligned} \mathbf{E}[aX+bY|\mathcal{G}] & =a\mathbf{E}[X|\mathcal{G}]+b\mathbf{E}[Y|\mathcal{G}],\quad\text{for any }a,b\in\mathbf{R} \end{aligned}\]

Since \(\mathbf{E}[X|\mathcal{G}]\) and \(\mathbf{E}[Y|\mathcal{G}]\) are \(\mathcal{G}-\)measurable, any linear combination of these two random variables is also \(\mathcal{G}\)-measurable.

Also, if \(W\) is any bounded \(\mathcal{G}-\)measurable random variable, we have:

\[\begin{aligned} \mathbf{E}[W(aX+bY-(a\mathbf{E}[X|\mathcal{G}]+b\mathbf{E}[Y|\mathcal{G}]))] & =a\mathbf{E}[W(X-\mathbf{E}[X|\mathcal{G}])]\\ & +b\mathbf{E}[W(Y-\mathbf{E}[Y|\mathcal{G}])] \end{aligned}\]

By definition, \(X-\mathbf{E}(X|\mathcal{G})\) is orthogonal t o the subspace \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\) and hence to all \(\mathcal{G}\)-measurable random-variables. Hence, the two expectations on the right hand side of the above expression are \(0\). Since, conditional expectations are unique, we have the desired result.

If \(\mathcal{H}\subseteq\mathcal{G}\) is another sigma-field of \(\Omega\), then

\[\begin{aligned} \mathbf{E}[Y|\mathcal{H}] & =\mathbf{E}[\mathbf{E}[Y|\mathcal{G}]|\mathcal{H}] \end{aligned}\]

Define \(U:=\mathbf{E}[Y|\mathcal{G}]\). By definition, \(\mathbf{E}[U|\mathcal{H}]\) is \(\mathcal{H}\)-measurable.

Let \(W\) be any bounded \(\mathcal{H}\)-measurable random variable. We have:

\[\begin{aligned} \mathbf{E}[W\{\mathbf{E}(Y|\mathcal{G})-\mathbf{E}(\mathbf{E}(Y|\mathcal{G})|\mathcal{H})\}] & =\mathbf{E}[W(U-\mathbf{E}(U|\mathcal{H})] \end{aligned}\]

But, by definition \(U-\mathbf{E}(U|\mathcal{H})\) is always orthogonal to the subspace \(L^{2}(\Omega,\mathcal{H},\mathbb{P})\) and hence, \(\mathbf{E}[W(U-\mathbf{\mathbf{E}}(U|\mathcal{H})]=0\). Since, conditional expectations are unique, we have the desired result.

Pythagoras’s theorem. We have:

\[\begin{aligned} \mathbf{E}[Y^{2}] & =\mathbf{E}[(\mathbf{E}[Y|\mathcal{G}])^{2}]+\mathbf{E}[(Y-\mathbf{E}(Y|\mathcal{G}))^{2}] \end{aligned}\]

In particular,

\[\begin{aligned} \mathbf{E}[(\mathbf{E}[Y|\mathcal{G}])^{2}] & \leq\mathbf{E}[Y^{2}] \end{aligned}\]

Consider the orthogonal decomposition:

\[\begin{aligned} Y & =\mathbf{E}[Y|\mathcal{G}]+(Y-\mathbf{E}[Y|\mathcal{G}]) \end{aligned}\]

Squaring on both sides and taking expectations, we have:

\[\begin{aligned} \mathbf{E}[Y^{2}] & =\mathbf{E}[(\mathbf{E}(Y|\mathcal{G}))^{2}]+\mathbf{E}[(Y-\mathbf{E}[Y|\mathcal{G}])^{2}]+2\mathbf{E}\left[\mathbf{E}[Y|\mathcal{G}](Y-\mathbf{E}[Y|\mathcal{G}])\right] \end{aligned}\]

By definition of conditional expectation, \((Y-\mathbf{E}[Y|\mathcal{G}])\) is orthogonal to the subspace \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\). By the properties of conditional expectation, \(\mathbf{E}[Y|\mathcal{G}]\) is \(\mathcal{G}-\)measurable, so it belongs to \(L^{2}(\Omega,\mathcal{G},\mathbb{P})\). Hence, the dot-product on the right-hand side is \(0\). Consequently, we have the desired result.

Moreover, since \((Y-\mathbf{E}[Y|\mathcal{G}])^{2}\) is a non-negative random variable, \(\mathbf{E}[(Y-\mathbf{E}[Y|\mathcal{G}])^{2}]\geq0\). It follows that: \(\mathbf{E}[Y^{2}]\geq\mathbf{E}[(\mathbf{E}(Y|\mathcal{G}))^{2}]\).

Our claim is:

\[\begin{aligned} \mathbf{E}\left[\mathbf{E}[Y|\mathcal{G}]\right] & =\mathbf{E}[Y] \end{aligned}\]

We know that, if \(W\) is any bounded \(\mathcal{G}\)-measurable random variable:

\[\begin{aligned} \mathbf{E}\left[WY\right] & =\mathbf{E}[W\mathbf{E}[Y|\mathcal{G}]] \end{aligned}\]

Taking \(W=1\), we have:

\[\begin{aligned} \mathbf{E}\left[Y\right] & =\mathbf{E}[\mathbf{E}[Y|\mathcal{G}]] \end{aligned}\]

(Brownian Conditioning II). We continue the example ([ex:brownian-conditioning-I]). Let’s now compute the conditional expectations \(\mathbf{E}[e^{aB_{1}}|B_{1/2}]\) and \(\mathbf{E}[e^{aB_{1/2}}|B_{1}]\) for some parameter \(a\). We shall need the properties of conditional expectation in proposition ([prop:properties-of-conditional-expectation]). For the first one we use the fact that \(B_{1/2}\) is independent of \(B_{1}-B_{1/2}\) to get:

\[\begin{aligned} \mathbf{E}[e^{aB_{1}}|B_{1/2}] & =\mathbf{E}[e^{a((B_{1}-B_{1/2})+B_{1/2})}|B_{1/2}]\\ & =\mathbf{E}[e^{a(B_{1}-B_{2})}\cdot e^{aB_{1/2}}|B_{1/2}]\\ & \quad\left\{ \text{Taking out what is known}\right\} \\ & =e^{aB_{1/2}}\mathbf{E}[e^{a(B_{1}-B_{1/2})}|B_{1/2}]\\ & =e^{aB_{1/2}}\cdot\mathbf{E}[e^{a(B_{1}-B_{1/2})}]\\ & \quad\{\text{Independence}\} \end{aligned}\]

We know that, \(a(B_{1}-B_{1/2})\) is a gaussian random variable with mean \(0\) and variance \(a^{2}/2\). We also know that, \(\mathbf{E}[e^{tZ}]=e^{t^{2}/2}\). So, \(\mathbf{E}[e^{a(B_{1}-B_{1/2})}]=e^{a^{2}/4}\). Consequently, \(\mathbf{E}[e^{aB_{1}}|B_{1/2}]=e^{aB_{1/2}+a^{2}/4}\).

The result itself has the form of the MGF of a Gaussian with mean \(B_{1/2}\) and variance \(1/2\). (The MGF of \(X=\mu+\sigma Z\), \(Z=N(0,1)\) is \(M_{X}(a)=\exp\left[\mu+\frac{1}{2}\sigma^{2}a^{2}\right]\).) In fact, this shows that the conditional distribution of \(B_{1}\) given \(B_{1/2}\) is Gaussian of mean \(B_{1/2}\) and variance \(1/2\).

For the other expectation, note that \(B_{1/2}-\frac{1}{2}B_{1}\) is independent of \(B_{1}\). We have: \[\begin{aligned} \mathbf{E}\left[\left(B_{1/2}-\frac{1}{2}B_{1}\right)B_{1}\right] & =\mathbf{E}(B_{1/2}B_{1})-\frac{1}{2}\mathbf{E}[B_{1}^{2}]\\ & =\frac{1}{2}-\frac{1}{2}\cdot1\\ & =0 \end{aligned}\]

Therefore, we have:

\[\begin{aligned} \mathbf{E}[e^{aB_{1/2}}|B_{1}] & =\mathbf{E}[e^{a(B_{1/2}-\frac{1}{2}B_{1})+\frac{a}{2}B_{1}}|B_{1}]\\ & =\mathbf{E}[e^{a(B_{1/2}-\frac{1}{2}B_{1})}\cdot e^{\frac{a}{2}B_{1}}|B_{1}]\\ & =e^{\frac{a}{2}B_{1}}\mathbf{E}[e^{a(B_{1/2}-\frac{1}{2}B_{1})}|B_{1}]\\ & \quad\{\text{Taking out what is known }\}\\ & =e^{\frac{a}{2}B_{1}}\mathbf{E}[e^{a(B_{1/2}-\frac{1}{2}B_{1})}]\\ & \quad\{\text{Independence}\} \end{aligned}\]

Now, \(a(B_{1/2}-\frac{1}{2}B_{1})\) is a random variable with mean \(0\) and variance \(a^{2}(\frac{1}{2}-\frac{1}{4})=\frac{a^{2}}{4}\). Consequently, \(\mathbf{E}[e^{(a/2)Z}]=e^{\frac{a^{2}}{8}}\). Thus, \(\mathbf{E}[e^{aB_{1/2}}|B_{1}]=e^{\frac{a}{2}B_{1}+\frac{a^{2}}{8}}\).

(Brownian bridge is conditioned Brownian motion). We know that the Brownian bridge \(M_{t}=B_{t}-tB_{1}\), \(t\in[0,1]\) is independent of \(B_{1}\). We use this to show that the conditional distribution of the Brownian motion given the value at the end-point \(B_{1}\) is the one of a Brownian bridge shifted by the straight line going from \(0\) to \(B_{1}\). To see this, we compute the conditional MGF of \((B_{t_{1}},B_{t_{2}},\ldots,B_{t_{n}})\) given \(B_{1}\) for some arbitrary choices of \(t_{1},t_{2},\ldots,t_{n}\) in \([0,1]\). We get the following by adding and subtracting \(t_{j}B_{1}\):

\[\begin{aligned} \mathbf{E}[e^{a_{1}B_{t_{1}}+\ldots+a_{n}B_{t_{n}}}|B_{1}] & =\mathbf{E}[e^{a_{1}(B_{t_{1}}-t_{1}B_{1})+\ldots+a_{n}(B_{t_{n}}-t_{n}B_{1})}\cdot e^{(a_{1}t_{1}B_{1}+\ldots+a_{n}t_{n}B_{1})}|B_{1}]\\ & =e^{(a_{1}t_{1}B_{1}+\ldots+a_{n}t_{n}B_{1})}\mathbf{E}[e^{a_{1}M_{t_{1}}+\ldots+a_{n}M_{t_{n}}}|B_{1}]\\ & \quad\{\text{Taking out what is known}\}\\ & =e^{(a_{1}t_{1}B_{1}+\ldots+a_{n}t_{n}B_{1})}\mathbf{E}[e^{a_{1}M_{t_{1}}+\ldots+a_{n}M_{t_{n}}}]\\ & \quad\{\text{Independence}\} \end{aligned}\]

The right side is exactly the MGF of the process \(M_{t}+tB_{1},t\in[0,1]\) (for a fixed value \(B_{1})\), where \((M_{t},t\in[0,1])\) is a Brownian bridge. This proves the claim.

(Conditional Jensen’s Inequality) If \(c\) is a convex function on \(\mathbf{R}\) and \(X\) is a random variable on \((\Omega,\mathcal{F},\mathbb{P})\), then:

\[\begin{aligned} \mathbf{E}[c(X)] & \geq c(\mathbf{E}[X]) \end{aligned}\]

More generally, if \(\mathcal{G}\subseteq\mathcal{F}\) is a sigma-field, then:

\[\mathbf{E}[c(X)|\mathcal{G}]\geq c(\mathbf{E}[X|\mathcal{G}])\]

Proof. Proof. We know that, if \(c(x)\) is a convex function, the tangent to the curve \(c\) at any point lies below the curve. The tangent to the cuve at this point, is a straight-line of the form:

\[\begin{aligned} c(t)=y & =mt+c \end{aligned}\]

where \(m(t)=c'(t)\). This holds for all \(t\in\mathbf{R}\). At an arbitrary point \(x\) we have:

\[\begin{aligned} c(x)\geq & y=mx+c \end{aligned}\]

Therefore, we have:

\[\begin{aligned} c(x)-c(t) & \geq m(t)(x-t) \end{aligned}\]

for any \(x\) and any point of tangency \(t\).

\[\begin{aligned} c(X)-c(Y) & \geq m(Y)(X-Y) \end{aligned}\]

Substituting \(Y=\mathbf{E}[X|\mathcal{G}]\), we get:

\[\begin{aligned} c(X)-c(\mathbf{E}[X|\mathcal{G}]) & \geq m(\mathbf{E}[X|\mathcal{G}])(X-\mathbf{E}[X|\mathcal{G}]) \end{aligned}\]

Taking expectations on both sides, we get:

\[\begin{aligned} \mathbf{E}[(c(X)-c(\mathbf{E}[X|\mathcal{G}]))|\mathcal{G}] & \geq\mathbf{E}[m(\mathbf{E}[X|\mathcal{G}])(X-\mathbf{E}[X|\mathcal{G}])|\mathcal{G}] \end{aligned}\]

The left-hand side simplifies as:

\[\begin{aligned} \mathbf{E}[(c(X)-c(\mathbf{E}[X|\mathcal{G}]))|\mathcal{G}] & =\mathbf{E}[c(X)|\mathcal{G}]-\mathbf{E}[c(\mathbf{E}[X|\mathcal{G}]))|\mathcal{G}]\\ & \quad\{\text{Linearity}\}\\ & =\mathbf{E}[c(X)|\mathcal{G}]-c(\mathbf{E}[X|\mathcal{G}])\\ & \quad\{\text{c(\ensuremath{\mathbf{E}}[X|\ensuremath{\mathcal{G}}])) is \ensuremath{\mathcal{G}}-measurable}\} \end{aligned}\]

On the right hand side, we have:

\[\begin{aligned} \mathbf{E}[m(\mathbf{E}[X|\mathcal{G}])(X-\mathbf{E}[X|\mathcal{G}])|\mathcal{G}] & =\mathbf{E}[m(\mathbf{E}[X|\mathcal{G}])\cdot X|\mathcal{G}]-\mathbf{E}[m(\mathbf{E}[X|\mathcal{G}])\cdot\mathbf{E}[X|\mathcal{G}]|\mathcal{G}]\\ & =\mathbf{E}[X|\mathcal{G}]m(\mathbf{E}[X|\mathcal{G}])-m(\mathbf{E}[X|\mathcal{G}])\cdot\mathbf{E}[X|\mathcal{G}]\\ & =0 \end{aligned}\]

Consequently, it follows that \(\mathbf{E}[c(X)|\mathcal{G}]\geq c(\mathbf{E}[X|\mathcal{G}])\). ◻

(Embeddings of \(L^{p}\) spaces) Square-integrable random variables are in fact integrable. In other words, there is always the inclusion \(L^{2}(\Omega,\mathcal{F},\mathbb{P})\subseteq L^{1}(\Omega,\mathcal{F},\mathbb{P})\). In particular, square integrable random variables always have a well-defined variance. This embedding is a simple consequence of Jensen’s inequality since:

\[\begin{aligned} |\mathbf{E}[X]|^{2} & \leq\mathbf{E}[|X|^{2}] \end{aligned}\]

as \(f(x)=|x|^{2}\) is convex. By taking the square root on both sides, we get:

\[\begin{aligned} \left\Vert X\right\Vert _{1} & \leq\left\Vert X\right\Vert _{2} \end{aligned}\]

More generally, for any \(1<p<\infty\), we can define \(L^{p}(\Omega,\mathcal{F},\mathbb{P})\) to be the linear space of random variables such that \(\mathbf{E}[|X|^{p}]<\infty\). Then for \(p<q\), since \(x^{q/p}\) is convex, we get by Jensen’s inequality :

\[\begin{aligned} \mathbf{E}[|X|^{q}] & =\mathbf{E}[(|X|^{p})^{\frac{q}{p}}]\geq\left(\mathbf{E}[|X|^{p}]\right)^{\frac{q}{p}} \end{aligned}\]

Taking the \(q\)-th root on both sides:

\[\begin{aligned} \mathbf{E}[|X|^{p}]^{1/p} & \leq\mathbf{E}[|X|^{q}]^{1/q} \end{aligned}\]

So, if \(X\in L^{q}\), then it must also be in \(L^{p}\). Concretely, this means that any random variable with a finite \(q\)-moment will also have a finite \(p\)-moment, for \(q>p\).

Martingales.

We now have all the tools to define martingales.

(Filtration). A filtration \((\mathcal{F}_{t}:t\geq0)\) of \(\Omega\) is an increasing sequence of \(\sigma\)-fields of \(\Omega\). That is,

\[\begin{aligned} \mathcal{F}_{s} & \subseteq\mathcal{F}_{t},\quad\forall s\leq t \end{aligned}\]

We will usually take \(\mathcal{F}_{0}=\{\emptyset,\Omega\}\). The canonical example of a filtration is the natural filtration of a given process \((M_{s}:s\geq0)\). This is the filtration given by \(\mathcal{F}_{t}=\sigma(M_{s},s\leq t)\). The inclusions of the \(\sigma\)-fields are then clear. For a given Brownian motion \((B_{t},t\geq0)\), the filtration \(\mathcal{F}_{t}=\sigma(B_{s},s\leq t)\) is sometimes called the Brownian filtration. We think of the filtration as the flow of information of the process.

A stochastic process \((X_{t}:t\geq0)\) is said to be adapted to \((\mathcal{F}_{t}:t\geq0)\), if for each \(t\), the random variable \(X_{t}\) is \(\mathcal{F}_{t}-\)measurable.

(Martingale). A process \((M_{t}:t\geq0)\) is a martingale for the filtration \((\mathcal{F}_{t}:t\geq0)\) if the following hold:

(1) The process is adapted, that is \(M_{t}\) is \(\mathcal{F}_{t}-\)measurable for all \(t\geq0\).

(2) \(\mathbf{E}[|M_{t}|]<\infty\) for all \(t\geq0\). (This ensures that the conditional expectation is well defined.)

(3) Martingale property:

\[\begin{aligned} \mathbf{E}[M_{t}|\mathcal{F}_{s}] & =M_{s}\quad\forall s\leq t \end{aligned}\]

Roughly, speaking this means that the best approximation of a process at a future time \(t\) is its value at the present.

In particular, the martingale property implies that:

\[\begin{aligned} \mathbf{E}[M_{t}|\mathcal{F}_{0}] & =M_{0}\nonumber \\ \mathbf{E}[\mathbf{E}[M_{t}|\mathcal{F}_{0}]] & =\mathbf{E}[M_{0}]\nonumber \\ \mathbf{E}[M_{t}] & =\mathbf{E}[M_{0}]\label{eq:expected-value-of-martingale-at-any-time-is-constant}\\ & \quad\{\text{Tower Property}\}\nonumber \end{aligned}\]

Usually, we take \(\mathcal{F}_{0}\) to be the trivial sigma-field \(\{\emptyset,\Omega\}\). A random variable that is \(\mathcal{F}_{0}\)-measurable must be a constant, so \(M_{0}\) is a constant. In this case, \(\mathbf{E}[M_{t}]=M_{0}\) for all \(t\). If properties (1) and (2) are satisfied, but the best approximation is larger, \(\mathbf{E}[M_{t}|\mathcal{F}_{s}]\geq M_{s}\), the process is called a submartingale. If it is smaller on average, \(\mathbf{E}[M_{t}|\mathcal{F}_{s}]\leq\mathbf{E}[M_{s}]\), we say it is a supermartingale.

We will be mostly interested in martingales that are continuous and square-integrable. Continuous martingales are martingales whose paths \(t\mapsto M_{t}(\omega)\) are continuous almost surely. Square-integrable martingales are such that \(\mathbf{E}[|M_{t}|^{2}]<\infty\) for all \(t\)’s. This condition is stronger than \(\mathbf{E}[|M_{t}|]<\infty\) due to Jensen’s inequality.

(Martingales in Discrete-time). Martingales can be defined the same way if the index set of the process is discrete. For example, the filtration \((\mathcal{F}_{n}:n\in\mathbf{N})\) is a countable set and the martingale property is then replaced by \(\mathbf{E}[M_{n+1}|\mathcal{F}_{n}]=M_{n}\) as expected. The tower-property then yields the martingale property \(\mathbf{E}[M_{n+k}|\mathcal{F}_{n}]=M_{n}\) for \(k\geq1\).

(Continuous Filtrations). Filtrations with continuous time can be tricky to handle rigorously. For example, one has to make sense of what it means for \(\mathcal{F}_{s}\) as \(s\) approaches \(t\) from the left. Is it equal to \(\mathcal{F}_{t}\)? Or is there actually less information in \(\lim_{s\to t^{-}}\mathcal{F}_{s}\) than in \(\mathcal{F}_{t}\)? This is a bit of headache when dealing with processes with jumps, like the Poisson process. However, if the paths are continuous, the technical problems are not as heavy.

Let’s look at some of the important examples of martingales constructed from Brownian Motion.

(Examples of Brownian Martingales)

(i) Standard Brownian Motion. Let \((B_{t}:t\geq0)\) be a standard Brownian motion and let \((\mathcal{F}_{t}:t\geq0)\) be a Brownian filtration. Then \((B_{t}:t\geq0)\) is a square integrable martingale for the filtration \((\mathcal{F}_{t}:t\geq0)\). Property (1) is obvious, because all the sets in \(\mathcal{F}_{t}\) are resolved, upon observing the outcome of \(B_{t}\). Similarly, \(\mathbf{E}[|B_{t}|]=0\). As for the martingale property, note that, by the properties of conditional expectation in proposition ([prop:properties-of-conditional-expectation]), we have:

\[\begin{aligned} \mathbf{E}[B_{t}|\mathcal{F}_{s}] & =\mathbf{E}[B_{t}|B_{s}]\\ & =\mathbf{E}[B_{t}-B_{s}+B_{s}|B_{s}]\\ & =\mathbf{E}[B_{t}-B_{s}|B_{s}]+\mathbf{E}[B_{s}|B_{s}]\\ & \quad\{\text{Linearity}\}\\ & =\mathbf{E}[B_{t}-B_{s}]+B_{s}\\ & \quad\{\text{Independence}\}\\ & =B_{s} \end{aligned}\]

(ii) Geometric Brownian Motion. Let \((B_{t},t\ge0)\) be a standard brownian motion, and \(\mathcal{F}_{t}=\sigma(B_{s},s\leq t)\). A geometric brownian motion is a process \((S_{t},t\geq0)\) defined by:

\[\begin{aligned} S_{t} & =S_{0}\exp\left(\sigma B_{t}+\mu t\right) \end{aligned}\]

for some parameter \(\sigma>0\) and \(\mu\in\mathbf{R}\). This is simply the exponential of the Brownian motion with drift. This is not a martingale for most choices of \(\mu\)! In fact, one must take

\[\begin{aligned} \mu & =-\frac{1}{2}\sigma^{2} \end{aligned}\] for the process to be a martingale for the Brownian filtration. Let’s verify this. Property (1) is obvious since \(S_{t}\) is a function of \(B_{t}\) for each \(t\). So, it is \(\mathcal{F}_{t}\) measurable. Moreover, property (2) is clear: \(\mathbf{E}[\exp(\sigma B_{t}+\mu t)]=\mathbf{E}[\exp(\sigma\sqrt{t}Z+\mu t)]=\exp(\mu t+\frac{1}{2}\sigma^{2}t)\). So, its a finite quantity. As for the martingale property, note that by the properties of conditional expectation, and the MGF of Gaussians, we have for \(s\leq t\):

\[\begin{aligned} \mathbf{E}[S_{t}|\mathcal{F}_{s}] & =\mathbf{E}\left[S_{0}\exp\left(\sigma B_{t}-\frac{1}{2}\sigma^{2}t\right)|\mathcal{F}_{s}\right]\\ & =S_{0}\exp(-\frac{1}{2}\sigma^{2}t)\mathbf{E}[\exp(\sigma(B_{t}-B_{s}+B_{s}))|\mathcal{F}_{s}]\\ & =S_{0}\exp(-\frac{1}{2}\sigma^{2}t)\exp(\sigma B_{s})\mathbf{E}[\exp(\sigma(B_{t}-B_{s}))|\mathcal{F}_{s}]\\ & \quad\{\text{Taking out what is known}\}\\ & =S_{0}\exp\left(\sigma B_{s}-\frac{1}{2}\sigma^{2}t\right)\mathbf{E}\left[\exp\left(\sigma(B_{t}-B_{s})\right)\right]\\ & \quad\{\text{Independence}\}\\ & =S_{0}\exp\left(\sigma B_{s}-\frac{1}{2}\sigma^{2}t+\frac{1}{2}\sigma^{2}(t-s)\right)\\ & =S_{0}\exp(\sigma B_{s}-\frac{1}{2}\sigma^{2}s)\\ & =S_{s} \end{aligned}\]

We will sometimes abuse terminology and refer to the martingale case of geometric brownian motion simply as geometric Brownian Motion when the context is clear.

(iii) The square of the Brownian motion, compensated. It is easy to check \((B_{t}^{2},t\geq0)\) is a submartingale by direct computation using increments or by Jensen’s inequality: \(\mathbf{E}[B_{t}^{2}|\mathcal{F}_{s}]>(\mathbf{E}[B_{t}|\mathcal{F}_{s}])^{2}=B_{s}^{2}\), \(s<t\). It is nevertheless possible to compensate to get a martingale:

\[\begin{aligned} M_{t} & =B_{t}^{2}-t \end{aligned}\]

It is an easy exercise to verify that \((M_{t}:t\geq0)\) is a martingale for the Brownian filtration \((\mathcal{F}_{t}:t\geq0)\).

\[\begin{aligned} \mathbf{E}[M_{t}|\mathcal{F}_{s}] & =\mathbf{E}[B_{t}^{2}-t|\mathcal{F}_{s}]\\ & =\mathbf{E}[B_{t}^{2}|\mathcal{F}_{s}]-t\\ & =\mathbf{E}[(B_{t}-B_{s}+B_{s})^{2}|\mathcal{F}_{s}]-t\\ & =\mathbf{E}[(B_{t}-B_{s})^{2}|\mathcal{F}_{s}]+2\mathbf{E}[(B_{t}-B_{s})B_{s}|\mathcal{F}_{s}]+\mathbf{E}[B_{s}^{2}|\mathcal{F}_{s}]-t\\ & =\mathbf{E}[(B_{t}-B_{s})^{2}]+2B_{s}\mathbf{E}[(B_{t}-B_{s})|\mathcal{F}_{s}]+B_{s}^{2}-t\\ & =\mathbf{E}[(B_{t}-B_{s})^{2}]+2B_{s}\mathbf{E}[(B_{t}-B_{s})]+B_{s}^{2}-t\\ & \left\{ \begin{array}{c} \text{\ensuremath{(B_{t}-B_{s})} is independent of \ensuremath{\mathcal{F}_{s}}}\\ \text{Also, \ensuremath{B_{s}} is known at time \ensuremath{s}} \end{array}\right\} \\ & =(t-s)+2B_{s}\cdot0+B_{s}^{2}-t\\ & =B_{s}^{2}-s\\ & =M_{s} \end{aligned}\]

(Other important martingales).

(1) Symmetric random walks. This is an example of a martingale in discrete time. Take \((X_{i}:i\in\mathbf{N})\) to be IID random variables with \(\mathbf{E}[X_{i}]=0\) and \(\mathbf{E}[|X_{i}|]<\infty\). Take \(\mathcal{F}_{n}=\sigma(X_{i},i\leq n)\) and

\[\begin{aligned} S_{n} & =X_{1}+X_{2}+\ldots+X_{n},\quad S_{0}=0 \end{aligned}\]

Firstly, the information learned by observing the outcomes of \(X_{1}\),\(\ldots\),\(X_{n}\) is enough to completely determine \(S_{n}\). Hence, \(S_{n}\) is \(\mathcal{F}_{n}-\)measurable.

Next, \[\begin{aligned} |S_{n}| & =\left|\sum_{i=1}^{n}X_{i}\right|\\ & \leq\sum_{i=1}^{n}|X_{i}| \end{aligned}\]

Consequently, by the montonocity of expectations, we have:

\[\begin{aligned} \mathbf{E}[|S_{n}|] & \leq\sum_{i=1}^{n}\mathbf{E}[|X_{i}|]<\infty \end{aligned}\]

The martingale property is also satisfied. We have:

\[\begin{aligned} \mathbf{E}[S_{n+1}|\mathcal{F}_{n}] & =\mathbf{E}[S_{n}+X_{n+1}|\mathcal{F}_{n}]\\ & =\mathbf{E}[S_{n}|\mathcal{F}_{n}]+\mathbf{E}[X_{n+1}|\mathcal{F}_{n}]\\ & =S_{n}+\mathbf{E}[X_{n+1}]\\ & \left\{ \begin{array}{c} \text{\ensuremath{S_{n}} is \ensuremath{\mathcal{F}_{n}}-measurable}\\ \text{\ensuremath{X_{n+1}} is independent of \ensuremath{\mathcal{F}_{n}}} \end{array}\right\} \\ & =S_{n}+0\\ & =S_{n} \end{aligned}\]

(2) Compensated Poisson process. Let \((N_{t}:t\geq0)\) be a Poisson process with rate \(\lambda\) and \(\mathcal{F}_{t}=\sigma(N_{s},s\leq t)\). Then, \(N_{t}\) is a submartingale for its natural filtration. Again, properties (1) and (2) are easily checked. \(N_{t}\) is \(\mathcal{F}_{t}\) measurable. Moreover, \(\mathbf{E}[|N_{t}|]=\mathbf{E}[N_{t}]=\frac{1}{\lambda t}<\infty\). The submartingale property follows by the independence of increments : for \(s\leq t\),

\[\begin{aligned} \mathbf{E}[N_{t}|\mathcal{F}_{s}] & =\mathbf{E}[N_{t}-N_{s}+N_{s}|\mathcal{F}_{s}]\\ & =\mathbf{E}[N_{t}-N_{s}|\mathcal{F}_{s}]+\mathbf{E}[N_{s}|\mathcal{F}_{s}]\\ & =\mathbf{E}[N_{t}-N_{s}]+N_{s}\\ & =\lambda(t-s)+N_{s}\\ & \left\{ \because\mathbf{E}[N_{t}]=\lambda t\right\} \end{aligned}\]

More importantly, we get a martingale by slightly modifying the process. Indeed, if we subtract \(\lambda t\), we have that the process :

\[\begin{aligned} M_{t} & =N_{t}-\lambda t \end{aligned}\]

is a martingale. We have:

\[\begin{aligned} \mathbf{E}[M_{t}|\mathcal{F}_{s}] & =\mathbf{E}[N_{t}-\lambda t|\mathcal{F}_{s}]\\ & =\lambda t-\lambda s+N_{s}-\lambda t\\ & =N_{s}-\lambda s\\ & =M_{s} \end{aligned}\]

This is called the compensated Poisson process. Let us simulate \(10\) paths of the compensated poisson process on \([0,10]\).

import numpy as np
import matplotlib.pyplot as plt

# Generates a sample path of a compensated poisson process 
# with rate : `lambda_` per unit time
# on the interval [0,T], and subintervals of size `stepSize`.

def generateCompensatedPoissonPath(lambda_,T,stepSize):
    N = int(T/stepSize)   

    poissonParam = lambda_ * stepSize        

    x = np.random.poisson(lam=poissonParam,size=N)  
    x = np.concatenate([[0.0], x])
    N_t = np.cumsum(x)  
    t = np.linspace(start=0.0,stop=10.0,num=1001)

    M_t = np.subtract(N_t,lambda_ * t)  
    return M_t


t = np.linspace(0,10,1001)
plt.grid(True)

plt.xlabel(r'Time $t$')
plt.ylabel(r'Compensated poisson process $M(t)$')
plt.grid(True)
plt.title(r'$10$ paths of the compensated Poisson process on $[0,10]$')

for i in range(10):
    # Generate a poisson path with rate 1 /sec = 0.01 /millisec
    n_t = generateCompensatedPoissonPath(lambda_=1.0, T=10, stepSize=0.01)
    plt.plot(t, n_t)


plt.show()
plt.close()

We saw in the two examples, that, even though a process is not itself a martingale, we can sometimes compensate to obtain a martingale! Ito Calculus will greatly extend this perspective. We will have systematic rules that show when a function of Brownian motion is a martingale and if not, how to modify it to get one.

For now, we observe that a convex function of a martingale is always a submartingale by Jensen’s inequality.

If \(c\) is a convex function on \(\mathbf{R}\) and \((M_{t}:t\geq0)\) is a martingale for \((\mathcal{F}_{t}:t\geq0)\), then the process \((c(M_{t}):t\geq0)\) is a submartingale for the same filtration, granted that \(\mathbf{E}[|c(M_{t})|]<\infty\).

Proof. Proof. The fact that \(c(M_{t})\) is adapted to the filtration is clear since it is an explicit function of \(M_{t}\). The integrability is by assumption. The submartingale property is checked as follows:

\[\begin{aligned} \mathbf{E}[c(M_{t})|\mathcal{F}_{s}] & \geq c(\mathbf{E}[M_{t}|\mathcal{F}_{s}])=c(M_{s}) \end{aligned}\]

(The Doob-Meyer Decomposition Theorem). Let \((X_{n}:n\in\mathbf{N})\) be a submartingale with respect to a filtration \((\mathcal{F}_{n}:n\in\mathbf{N})\). Define a sequence of random variables \((A_{n}:n\in\mathbf{N})\) by \(A_{0}=0\) and

\[\begin{aligned} A_{n} & =\sum_{i=1}^{n}(\mathbf{E}[X_{i}|\mathcal{F}_{i-1}]-X_{i-1}),\quad n\geq1 \end{aligned}\]

Note that \(A_{n}\) is \(\mathcal{F}_{n-1}\)-measurable. Moreover, since \((X_{n}:n\in\mathbf{N})\) is a submartingale, we have \(\mathbf{E}[X_{i}|\mathcal{F}_{i-1}]-X_{i-1}\geq0\) almost surely. Hence, \((A_{n}:n\in\mathbf{N})\) is an increasing sequence almost surely. Let \(M_{n}=X_{n}-A_{n}\).

We have:

\[\begin{aligned} \mathbf{E}[M_{n}|\mathcal{F}_{n-1}] & =\mathbf{E}[X_{n}-A_{n}|\mathcal{F}_{n-1}]\\ & =\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]-\mathbf{E}[A_{n}|\mathcal{F}_{n-1}]\\ & =\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]-\mathbf{E}\left[\left.\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]-X_{n-1}+A_{n-1}\right|\mathcal{F}_{n-1}\right]\\ & =\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]-\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]+\mathbf{E}[X_{n-1}|\mathcal{F}_{n-1}]-\mathbf{E}[A_{n-1}|\mathcal{F}_{n-1}]\\ & =\cancel{\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]}-\cancel{\mathbf{E}[X_{n}|\mathcal{F}_{n-1}]}+X_{n-1}-A_{n-1}\\ & =M_{n-1} \end{aligned}\]

Thus, \((M_{n}:n\in\mathbf{N})\) is a martingale. Thus, we have obtained the Doob decomposition:

\[\begin{aligned} X_{n} & =M_{n}+A_{n}\label{eq:doob-decomposition} \end{aligned}\]

This decomposition of a submartingale as a sum of a martingale and an adapted increasing sequence is unique, if we require that \(A_{0}=0\) and that \(A_{n}\) is \(\mathcal{F}_{n-1}\)-measurable.

For the continuous-time case, the situation is much more complicated. The analogue of equation ([eq:doob-decomposition]) is called the Doob-Meyer decomposition. We briefly describe this decomposition and avoid the technical details. All stochastic processes \(X(t)\) are assumed to be right-continuous with left-hand limits \(X(t-)\).

Let \(X(t)\), \(a\leq t\leq b\) be a submartingale with respect to a right-continuous filtration \((\mathcal{F}_{t}:a\leq t\leq b)\). If \(X(t)\) satisfies certain conditions, then it can be uniquely decomposed as:

\[\begin{aligned} X(t) & =M(t)+C(t),\quad a\leq t\leq b \end{aligned}\]

where \(M(t)\), \(a\leq t\leq b\) is a martingale with respect to \((\mathcal{F}_{t};a\leq t\leq b)\), \(C(t)\) is right-continuous and increasing almost surely with \(\mathbf{E}[C(t)]<\infty\).

(Square of a Poisson Process). Let \((N_{t}:t\geq0)\) be a Poisson process with rate \(\lambda\). We consider the compensated process \(M_{t}=N_{t}-\lambda t\). By ([corollary:the-convex-function-of-martingale-is-a-submartingale]), the process \((M_{t}^{2}:t\geq0)\) is a submartingale for the filtration \((\mathcal{F}_{t}:t\geq0)\) of the Poisson process. How should we compensated \(M_{t}^{2}\) to get a martingale? A direct computation using the properties of conditional expectation yields:

\[\begin{aligned} \mathbf{E}[M_{t}^{2}|\mathcal{F}_{s}] & =\mathbf{E}[(M_{t}-M_{s}+M_{s})^{2}|\mathcal{F}_{s}]\\ & =\mathbf{E}[(M_{t}-M_{s})^{2}+2(M_{t}-M_{s})M_{s}+M_{s}^{2}|\mathcal{F}_{s}]\\ & =\mathbf{E}[(M_{t}-M_{s})^{2}|\mathcal{F}_{s}]+2\mathbf{E}[(M_{t}-M_{s})M_{s}|\mathcal{F}_{s}]+\mathbf{E}[M_{s}^{2}|\mathcal{F}_{s}]\\ & =\mathbf{E}[(M_{t}-M_{s})^{2}]+2M_{s}\underbrace{\mathbf{E}[M_{t}-M_{s}]}_{\text{equals \ensuremath{0}}}+M_{s}^{2}\\ & =\mathbf{E}[(M_{t}-M_{s})^{2}]+M_{s}^{2} \end{aligned}\]

Now, if \(X\sim\text{Poisson\ensuremath{(\lambda t)}}\), then \(\mathbf{E}[X]=\lambda t\) and \(\mathbf{E}\ensuremath{[X^{2}]}=\lambda t(\lambda t+1)\).

\[\begin{aligned} \mathbf{E}[(M_{t}-M_{s})^{2}] & =\mathbf{E}\left[\left\{ (N_{t}-N_{s})-\lambda(t-s)\right\} ^{2}\right]\\ & =\mathbf{E}\left[(N_{t}-N_{s})^{2}\right]-2\lambda(t-s)\mathbf{E}\left[(N_{t}-N_{s})\right]+\lambda^{2}(t-s)^{2}\\ & =\lambda^{2}(t-s)^{2}+\lambda(t-s)-2\lambda(t-s)\cdot\lambda(t-s)+\lambda^{2}(t-s)^{2}\\ & =\lambda(t-s) \end{aligned}\]

Thus,

\[\begin{aligned} \mathbf{E}[M_{t}^{2}-\lambda t|\mathcal{F}_{s}] & =M_{s}^{2}-\lambda s \end{aligned}\]

We conclude that the process \((M_{t}^{2}-\lambda t:t\geq0)\) is a martingale. The Doob-Meyer decomposition of the submartingale \(M_{t}^{2}\) is then:

\[\begin{aligned} M_{t}^{2} & =(M_{t}^{2}-\lambda t)+\lambda t \end{aligned}\]

Consider a Brownian motion \(B(t)\). The quadratic variation of the process \((B(t):t\geq0)\) over the interval \([0,t]\) is given by \([B]_{t}=t\). On the other hand, we saw, that the square of Brownian motion compensated, \((B_{t}^{2}-t:t\geq0)\) is a martingale. Hence, the Doob-Meyer decomposition of \(B(t)^{2}\) is given by:

\[\begin{aligned} B(t)^{2} & =(B(t)^{2}-t)+t \end{aligned}\]

Computations with Martingales.

Martingales are not only conceptually interesting, they are also formidable tools to compute probabilities and expectations of processes. For example, in this section, we will solve the gambler’s ruin problem for Brownian motion. For convenience, we introduce the notion of stopping time before doing so.

A random variable \(\tau:\Omega\to\mathbf{N}\cup\{+\infty\}\) is said to be a stopping time for the filtration \((\mathcal{F}_{t}:t\geq0)\) if and only if:

\[\begin{aligned} \{\omega:\tau(\omega)\leq t\} & \in\mathcal{F}_{t},\quad\forall t\geq0 \end{aligned}\] Note that since \(\mathcal{F}_{t}\) is a sigma-field, if \(\tau\) is a stopping time, then we must also have that \(\{\omega:\tau(\omega)>t\}\in\mathcal{F}_{t}\).

In other words, \(\tau\) is a stopping time, if we can decide if the events \(\{\tau\leq t\}\) occurred or not based on the information available at time \(t\).

The term stopping time comes from gambling: a gambler can decide to stop playing at a random time (depending for example on previous gains or losses), but when he or she decides to stop, his/her decision is based solely upon the knowledge of what happened before, and does not depend on future outcomes. In other words, the stopping policy/strategy can only depend on past outcomes. Otherwise, it would mean that he/she has a crystall ball.

(Examples of stopping times).

(i) First passage time. This is the first time when a process reaches a certain value. To be precise, let \(X=(X_{t}:t\geq0)\) be a process and \((\mathcal{F}_{t}:t\geq0)\) be its natural filtration. For \(a>0\), we define the first passage time at \(a\) to be:

\[\begin{aligned} \tau(\omega) & =\inf\{s\geq0:X_{s}(\omega)\geq a\} \end{aligned}\]

If the path \(\omega\) never reaches \(a\), we set \(\tau(\omega)=\infty\). Now, for \(t\) fixed and for a given path \(X(\omega)\), it is possible to know if \(\{\tau(\omega)\leq t\}\) (the path has reached \(a\) before time \(t\)) or \(\{\tau(\omega)>t\}\) (the path has not reached \(a\) before time \(t\)) with the information available at time \(t\), since we are looking at the first time the process reaches \(a\). Hence, we conclude that \(\tau\) is a stopping time.

(ii) Hitting time. More generally, we can consider the first time (if ever) that the path of a process \((X_{t}:t\geq0)\) enters or hits a subset \(B\) of \(\mathbf{R}\):

\[\begin{aligned} \tau(\omega) & =\min\{s\geq0:X_{s}(\omega)\in B\} \end{aligned}\]

The first passage time is the particular case in which \(B=[a,\infty)\).

(iii) Minimum of two stopping times. If \(\tau\) and \(\tau'\) are two stopping times for the same filtration \((\mathcal{F}_{t}:t\geq0)\), then so is the minimum \(\tau\land\tau'\) between the two, where

\[\begin{aligned} (\tau\land\tau')(\omega) & =\min\{\tau(\omega),\tau'(\omega)\} \end{aligned}\]

This is because for any \(t\geq0\):

\[\begin{aligned} \{\omega: & (\tau\land\tau')(\omega)\leq t\}=\{\omega:\tau(\omega)\leq t\}\cup\{\omega:\tau'(\omega)\leq t\} \end{aligned}\]

Since the right hand side is the union of two events in \(\mathcal{F}_{t}\), it must also be in \(\mathcal{F}_{t}\) by the properties of a sigma-field. We conclude that \(\tau\land\tau'\) is a stopping time. Is it also the case that the maximum \(\tau\lor\tau'\) is a stopping time?

For any fixed \(t\geq0\), we have:

\[\begin{aligned} \{\omega:(\tau\lor\tau')(\omega)\leq t\} & =\{\omega:\tau(\omega)\leq t\}\cap\{\omega:\tau'(\omega)\leq t\} \end{aligned}\]

Since the right hand side is the intersection of two events in \(\mathcal{F}_{t}\), it must also be in \(\mathcal{F}_{t}\) by the properties of a sigma-field. We conclude that \(\tau\lor\tau'\) is a stopping time.

(Last passage time is not a stopping time). What if we look at the last time the process reaches \(a\), that is:

\[\begin{aligned} \rho(\omega) & =\sup\{t\geq0:X_{t}(\omega)\geq a\} \end{aligned}\]

This is a well-defined random variable, but it is not a stopping time. Based on the information available at time \(t\), we are not able to decide whether or not \(\{\rho(\omega)\leq t\}\) occurred or not, as the path can always reach \(a\) one more time after \(t\).

It turns out that a martingale that is stopped when the stopping time is attained remains a martingale.

(Stopped Martingale). If \((M_{t}:t\geq0)\) is a continuous martingale for the filtration \((\mathcal{F}_{t}:t\geq0)\) and \(\tau\) is a stopping time for the same filtration, then the stopped process defined by \[\begin{aligned} M_{t\land\tau} & =\begin{cases} M_{t} & t\leq\tau\\ M_{\tau} & t>\tau \end{cases} \end{aligned}\]

is also a continuous martingale for the same filtration.

[]{#th:doob’s-optional-sampling-theorem label=“th:doob’s-optional-sampling-theorem”}(Doob’s Optional sampling theorem). If \((M_{t}:t\geq0)\) is a continuous martingale for the filtration \((\mathcal{F}_{t}:t\geq0)\) and \(\tau\) is a stopping time such that \(\tau<\infty\) and the stopped process \((M_{t\land\tau}:t\geq0)\) is bounded, then:

\[\begin{aligned} \mathbf{E}[M_{\tau}] & =M_{0} \end{aligned}\]

Proof. Proof. Since \((M_{\tau\land t}:t\geq0)\) is a martingale, we always have:

\[\begin{aligned} \mathbf{E}[M_{\tau\land t}] & =M_{0} \end{aligned}\]

Now, since \(\tau(\omega)<\infty\), we must

have that \(\lim_{t\to\infty}M_{\tau\land t}=M_{\tau}\) almost surely. In particular, we have:

\[\begin{aligned} \mathbf{E}[M_{\tau}] & =\mathbf{E}\left[\lim_{t\to\infty}M_{\tau\land t}\right]=\lim_{t\to\infty}\mathbf{E}[M_{\tau\land t}]=\lim_{t\to\infty}M_{0} \end{aligned}\]

where we passed to the limit, using the dominated convergence theorem ([th:dominated-convergence-theorem]). ◻

(Gambler’s ruin with Brownian motion). The gambler’s ruin problem is known in different forms. Roughly speaking, it refers to the problem of computing the probability of a gambler making a series of bets reaching a certain amount before going broke. In terms of Brownian motion (and stochastic processes in general), it translates to the following questions: Let \((B_{t}:t\geq0)\) be a standard brownian motion starting at \(B_{0}=0\) and \(a,b>0\).

(1) What is the probability that a Brownian path reaches \(a\) before \(-b\)?

(2) What is the expected waiting time for the path to reach \(a\) or \(-b\)?

For the first question, it is a simple computation using stopping time and martingale properties. Define the hitting time:

\[\begin{aligned} \tau(\omega) & =\inf\{t\geq0:B_{t}(\omega)\geq a\text{ or }B_{t}(\omega)\leq-b\} \end{aligned}\]

Note that \(\tau\) is the minimum between the first passage time at \(a\) and the one at \(-b\).

We first show that \(\tau<\infty\) almost surely. In other words, all Brownian paths reach \(a\) or \(-b\) eventually. To see this, consider the event \(E_{n}\) that the \(n\)-th increment exceeds \(a+b\)

\[\begin{aligned} E_{n} & :=\left\{ |B_{n}-B_{n-1}|>a+b\right\} \end{aligned}\]

Note that, if \(E_{n}\) occurs, then we must have that the Brownian motion path exits the interval \([-b,a].\) Moreover, we have \(\mathbb{P}(E_{n})=\mathbb{P}(E_{1})\) for all \(n\). Call this probability \(p\).

Since the events \(E_{n}\) are independent, we have:

\[\begin{aligned} \mathbb{P}(E_{1}^{C}\cap E_{2}^{C}\cap\ldots\cap E_{n}^{C}) & =(1-p)^{n} \end{aligned}\]

As \(n\to\infty\) we have:

\[\begin{aligned} \lim_{n\to\infty}\mathbb{P}(E_{1}^{C}\cap E_{2}^{C}\cap\ldots\cap E_{n}^{C}) & =0 \end{aligned}\]

The sequence of events \((F_{n})\) where \(F_{n}=E_{1}^{C}\cap E_{2}^{C}\cap\ldots\cap E_{n}^{C}\) is a decreasing sequence of events. By the continuity of probability measure lemma ([th:continuity-property-of-lebesgue-measure]), we conclude that:

\[\begin{aligned} \lim_{n\to\infty}\mathbb{P}\left(F_{n}\right) & =\mathbb{P}\left(\bigcap_{n=1}^{\infty}F_{n}\right)=0 \end{aligned}\]

Therefore, it must be the case \(\mathbb{P}(\cup_{n=1}^{\infty}E_{n})=1\). So, \(E_{n}\) must occur for some \(n\), so all brownian motion paths reach \(a\) or \(-b\) almost surely.

Since \(\tau<\infty\) with probability one, the random variable \(B_{\tau}\) is well-defined : \(B_{\tau}(\omega)=B_{t}(\omega)\) if \(\tau(\omega)=t\). It can only take two values: \(a\) or \(-b\). Question (1) above translates into computing \(\mathbb{P}(B_{\tau}=a)\). On one hand, we have:

\[\begin{aligned} \mathbf{E}[B_{\tau}] & =a\mathbb{P}(B_{\tau}=a)+(-b)(1-\mathbb{P}(B_{\tau}=a)) \end{aligned}\]

On the other hand, by corollary ([th:doob's-optional-sampling-theorem]), we have \(\mathbf{E}[B_{\tau}]=\mathbf{E}[B_{0}]=0\). (Note that the stopped process \((B_{t\land\tau}:t\geq0)\) is bounded above by \(a\) and by \(-b\) below). Putting these two observations together, we get:

\[\begin{aligned} \mathbb{P}(B_{\tau}=a) & =\frac{b}{a+b} \end{aligned}\]

A very simple and elegant answer!

We will revisit this problem again and again. In particular, we will answer the question above for Brownian motion with a drift at length further ahead.

(Expected Waiting Time). Let \(\tau\) be as in the last example. We now answer question (2) of the gambler’s ruin problem:

\[\begin{aligned} \mathbf{E}[\tau] & =ab \end{aligned}\]

Note that the expected waiting time is consistent with the rough heuristic that Brownian motion travels a distance \(\sqrt{t}\) by time \(t\). We now use the martingale \(M_{t}=B_{t}^{2}-t\). On the one hand, if we apply optional stopping in corollary ([th:doob's-optional-sampling-theorem]), we get:

\[\begin{aligned} \mathbf{E}[M_{\tau}] & =M_{0}=0 \end{aligned}\]

Moreover, we know the distribution of \(B_{\tau}\), thanks to the probability calculated in the last example. We can therefore compute \(\mathbf{E}[M_{\tau}]\) directly:

\[\begin{aligned} 0 & =\mathbf{E}[M_{\tau}]\\ & =\mathbf{E}[B_{\tau}^{2}-\tau]\\ & =\mathbf{E}[B_{\tau}^{2}]-\mathbf{E}[\tau]\\ & =a^{2}\cdot\frac{b}{a+b}+b^{2}\cdot\frac{a}{a+b}-\mathbf{E}[\tau]\\ \mathbf{E}[\tau] & =\frac{a^{2}b+b^{2}a}{a+b}\\ & =\frac{ab\cancel{(a+b)}}{\cancel{(a+b)}}=ab \end{aligned}\]

Why can we apply optional stopping here? The random variable \(\tau\) is finite with probability \(1\) as before. However, the stopped martingale is not necessarily bounded as before: \(B_{\tau\land t}\) is bounded but \(\tau\) is not. However, the conclusion of optional stopping still holds. Indeed, we have:

\[\begin{aligned} \mathbf{E}[M_{t\land\tau}] & =\mathbf{E}[B_{t\land\tau}^{2}]-\mathbf{E}[t\land\tau] \end{aligned}\]

By the bounded convergence theorem, we get \(\lim_{t\to\infty}\mathbf{E}[B_{t\land\tau}^{2}]=\mathbf{E}[\lim_{t\to\infty}B_{t\land\tau}^{2}]=\mathbf{E}[B_{\tau}^{2}]\). Since \(\tau\land t\) is a non-decreasing sequence and as \(t\to\infty\), \(t\land\tau\to\tau\) almost surely, as \(\tau<\infty\), by the monotone convergence theorem, \(\lim_{t\to\infty}\mathbf{E}[t\land\tau]=\mathbf{E}[\tau]\).

(First passage time of Brownian Motion.) We can use the previous two examples to get some very interesting information on the first passage time:

\[\begin{aligned} \tau_{a} & =\inf\{t\geq0:B_{t}\geq a\} \end{aligned}\]

Let \(\tau=\tau_{a}\land\tau_{-b}\) be as in the previous examples with \(\tau_{-b}=\inf\{t\geq0:B_{t}\leq-b\}\). Note that \((\tau_{-b},b\in\mathbf{R}_{+})\) is a sequence of random variables that is increasing in \(b\). A brownian motion path must cross through \(-1\) before it hits \(-2\) for the first time and in general \(\tau_{-n}(\omega)\leq\tau_{-(n+1)}(\omega)\). Moreover, we have \(\tau_{-b}\to\infty\) almost surely as \(b\to\infty\). That’s because, \(\mathbb{P}\{\tau<\infty\}=1\). Moreover, the event \(\{B_{\tau}=a\}\) is the same as \(\{\tau_{a}<\tau_{-b}\}\). Now, the events \(\{\tau_{a}<\tau_{-b}\}\) are increasing in \(b\), since if a path reaches \(a\) before \(-b\), it will do so as well for a more negative value of \(-b\). On one hand, this means by the continuity of probability measure lemma ([th:continuity-property-of-lebesgue-measure]) that:

\[\begin{aligned} \lim_{b\to\infty}\mathbb{P}\left\{ \tau_{a}<\tau_{-b}\right\} & =\mathbb{P}\{\lim_{b\to\infty}\tau_{a}<\tau_{-b}\}\\ & =\mathbb{P}\{\tau_{a}<\infty\} \end{aligned}\]

On the other hand, we have by example ([example:probability-of-hitting-times])

\[\begin{aligned} \lim_{b\to\infty}\mathbb{P}\left\{ \tau_{a}<\tau_{-b}\right\} & =\lim_{b\to\infty}\mathbb{P}\{B_{\tau}=a\}\\ & =\lim_{b\to\infty}\frac{b}{b+a}\\ & =1 \end{aligned}\]

We just showed that:

\[\begin{aligned} \mathbb{P}\left\{ \tau_{a}<\infty\right\} & =1\label{eq:first-passage-time-to-a-is-finite-almost-surely} \end{aligned}\]

In other words, every Brownian path will reach \(a\), no matter how large \(a\) is!

How long will it take to reach \(a\) on average? Well, we know from example ([ex:expected-waiting-times]) that \(\mathbf{E}[\tau_{a}\land\tau_{-b}]=ab\). On one hand this means,

\[\begin{aligned} \lim_{b\to\infty}\mathbf{E}[\tau_{a}\land\tau_{-b}] & =\lim_{b\to\infty}ab=\infty \end{aligned}\]

On the other hand, since the random variables \(\tau_{-b}\) are increasing,

\[\begin{aligned} \lim_{b\to\infty}\mathbf{E}[\tau_{a}\land\tau_{-b}] & =\mathbf{E}\left[\lim_{b\to\infty}\tau_{a}\land\tau_{-b}\right]=\mathbf{E}[\tau_{a}] \end{aligned}\]

by the monotone convergence theorem ([th:monotone-convergence-theorem]). We just proved that:

\[\begin{aligned} \mathbf{E}[\tau_{a}] & =\infty \end{aligned}\]

In other words, any Brownian motion path will reach \(a\), but the expected waiting time for this to occur is infinite, no matter, how small \(a\) is! What is happening here? No matter, how small \(a\) is, there is always paths that reach very large negative values before hitting \(a\). These paths might be unlikely. However, the first passage time for these paths is so large that they affect the value of the expectation substantially. In other words, \(\tau_{a}\) is a heavy-tailed random variable. We look at the distribution of \(\tau_{a}\) in more detail in the next section.

(When option stopping fails). Consider \(\tau_{a}\), the first passage time at \(a>0\). The random variable \(B_{\tau_{a}}\) is well-defined since \(\tau_{a}<\infty\). In fact, we have \(B_{\tau_{a}}=a\) with probability one. Therefore, the following must hold:

\[\begin{aligned} \mathbf{E}[B_{\tau_{a}}] & =a\neq B_{0} \end{aligned}\]

Optional stopping theorem corollary ([th:doob's-optional-sampling-theorem]) does not apply here, since the stopped process \((B_{t\land\tau_{a}}:t\geq0)\) is not bounded. \(B_{t\land\tau_{a}}\) can become infinitely negative before hitting \(a\).

Reflection principle for Brownian motion.

(Bachelier’s formula). Let \((B_{t}:t\leq T)\) be a standard brownian motion on \([0,T].\) Then, the CDF of the random variable \(\sup_{0\leq t\leq T}B_{t}\) is:

\[\begin{aligned} \mathbb{P}\left(\sup_{0\leq t\leq T}B_{t}\leq a\right) & =\mathbb{P}\left(|B_{T}|\leq a\right) \end{aligned}\]

In particular, its PDF is:

\[\begin{aligned} f_{\max}(a) & =\frac{2}{\sqrt{2\pi T}}e^{-\frac{a^{2}}{2T}} \end{aligned}\]

We can verify these results empirically. Note that the paths of the random variables \(\max_{0\leq s\leq t}B_{s}\) and \(|B_{t}|\) are very different as \(t\) varies for a given \(\omega\). One is increasing and the other is not. The equality holds in distribution for a fixed \(t\). As a bonus corollary, we get the distribution of the first passage time at \(a\).

Let \(a\geq0\) and \(\tau_{a}=\inf\{t\geq0:B_{t}\geq a\}\). Then:

\[\begin{aligned} \mathbb{P}\left(\tau_{a}\leq T\right) & =\mathbb{P}\left(\max_{0\leq t\leq T}B_{t}\geq a\right)=\int_{a}^{\infty}\frac{2}{\sqrt{2\pi T}}e^{-\frac{x^{2}}{2T}}dx \end{aligned}\]

In particular, the random variable \(\tau_{a}\) has the PDF:

\[\begin{aligned} f_{\tau_{a}}(t) & =\frac{a}{\sqrt{2\pi}}\frac{e^{-\frac{a^{2}}{2t}}}{t^{3/2}},\quad t>0 \end{aligned}\]

This implies that it is heavy-tailed with \(\mathbf{E}[\tau_{a}]=\infty\).

Proof. Proof. The maximum on \([0,T]\) is larger than or equal to \(a\) if and only if \(\tau_{a}\leq T\). Therefore, the events \(\{\max_{0\leq t\leq T}B_{t}\geq a\}\) and \(\{\tau_{a}\leq T\}\) are the same. So, the CDF \(\mathbb{P}(\tau_{a}\leq t)\) of \(\tau_{a}\), by proposition ([prop:bacheliers-formula]) \(\int_{a}^{\infty}f_{\max}(x)dx=\int_{a}^{\infty}\frac{2}{\sqrt{2\pi T}}e^{-\frac{x^{2}}{2T}}dx\).

\[\begin{aligned} f_{\tau_{a}}(t) & =-2\phi(a/\sqrt{t})\cdot a\cdot\left(-\frac{1}{2t^{3/2}}\right)\\ & =\frac{a}{t^{3/2}}\phi\left(\frac{a}{\sqrt{t}}\right)\\ & =\frac{a}{t^{3/2}}\cdot\frac{1}{\sqrt{2\pi}}e^{-\frac{a^{2}}{2t}} \end{aligned}\]

To estimate the expectation, it suffices to realize that for \(t\geq1\), \(e^{-\frac{a^{2}}{2t}}\) is larger than \(e^{-\frac{a^{2}}{2}}\). Therefore, we have:

\[\begin{aligned} \mathbf{E}[\tau_{a}] & =\int_{0}^{\infty}t\frac{a}{\sqrt{2\pi}}\frac{e^{-a^{2}/2t}}{t^{3/2}}dt\geq\frac{ae^{-a^{2}/2}}{\sqrt{2\pi}}\int_{1}^{\infty}t^{-1/2}dt \end{aligned}\]

This is an improper integral and it diverges like \(\sqrt{t}\) and is infinite as claimed. ◻

To prove proposition ([prop:bacheliers-formula]), we will need an important property of Brownian motion called the reflection principle. To motivate it, recall the reflection symmetry of Brownian motion at time \(s\) in proposition ([prop:brownian-motion-symmetry-of-reflection-at-time-s]). It turns out that this reflection property also holds if \(s\) is replaced by a stopping time.

(Reflection principle). Let \((B_{t}:t\geq0)\) be a standard Brownian motion and let \(\tau\) be a stopping time for its filtration. Then, the process \((\tilde{B}_{t}:t\geq0)\) defined by the reflection at time \(\tau\):

\[\begin{aligned} \tilde{B}_{t} & =\begin{cases} B_{t} & \text{if \ensuremath{t\leq\tau}}\\ B_{\tau}-(B_{t}-B_{\tau}) & \text{if \ensuremath{t>\tau}} \end{cases} \end{aligned}\]

is also a standard brownian motion.

We defer the proof of the reflection property of Brownian motion to a further section. It is intuitive and instructive to quickly picture this in the discrete-time setting. I adopt the approach as in Shreve-I.

We repeatedly toss a fair coin (\(p\), the probability of \(H\) on each toss, and \(q=1-p\), the probability of \(T\) on each toss, are both equal to \(\frac{1}{2}\)). We denote the successive outcomes of the tosses by \(\omega_{1}\omega_{2}\omega_{3}\ldots\). Let

\[\begin{aligned} X_{j} & =\begin{cases} -1 & \text{if \ensuremath{\omega_{j}=H}}\\ +1 & \text{if \ensuremath{\omega_{j}=T}} \end{cases} \end{aligned}\]

and define \(M_{0}=0\), \(M_{n}=\sum_{j=1}^{n}X_{n}\). The process \((M_{n}:n\in\mathbf{N})\) is a symmetric random walk.

Suppose we toss a coin an odd number \((2j-1)\) of times. Some of the paths will reach level \(1\) in the first \(2j-1\) steps and other will not reach. In the case of \(3\) tosses, there are \(2^{3}=8\) possible paths and \(5\) of these reach level \(1\) at some time \(\tau_{1}\leq2j-1\). From that moment on, we can create a reflected path, which steps up each time the original path steps down and steps down each time the original path steps up. If the original path ends above \(1\) at the final time \(2j-1\), the reflected path ends below \(1\) and vice versa. If the original path ends at \(1\), the reflected path does also. In fact, the reflection at the first hitting time has the same distribution as the original random walk.

The key here is, out of the \(5\) paths that reach level \(1\) at some time, there are as many reflected paths that exceed \(1\) at time \((2j-1)\) as there are original paths that exceed \(1\) at time \((2j-1)\). So, to count the total number of paths that reach level \(1\) by time \((2j-1)\), we can count the paths that are at \(1\) at time \((2j-1)\) and then add on twice the number of paths that exceed \(1\) at time \((2j-1)\).

With this new tool, we can now prove proposition ([prop:bacheliers-formula]).

Proof. Proof. Consider \(\mathbb{P}(\max_{t\leq T}B_{t}\geq a)\). By splitting this probability over the event of the endpoint, we have:

\[\begin{aligned} \mathbb{P}\left(\max_{t\leq T}B_{t}\geq a\right) & =\mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,B_{T}>a\right)+\mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,B_{T}\leq a\right) \end{aligned}\]

Note also, that \(\mathbb{P}(B_{T}=a)=0\). Hence, the first probability equals \(\mathbb{P}(B_{T}\geq a)\). As for the second, consider the time \(\tau_{a}\). On the event considered, we have \(\tau_{a}\leq T\) and using lemma ([lemma:BM-reflection-principle]) at that time, we get

\[\begin{aligned} \mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,B_{T}\leq a\right) & =\mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,\tilde{B}_{T}\geq a\right) \end{aligned}\]

Observe that the event \(\{\max_{t\leq T}B_{t}\geq a\}\) is the same as \(\{\max_{t\leq T}\tilde{B}_{T}\geq a\}\). (A rough picture might help here.) Thereforem the above probability is

\[\begin{aligned} \mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,B_{T}\leq a\right) & =\mathbb{P}\left(\max_{t\leq T}\tilde{B}_{t}\geq a,\tilde{B}_{T}\geq a\right)=\mathbb{P}\left(\max_{t\leq T}B_{t}\geq a,B_{T}\geq a\right) \end{aligned}\]

where the last equality follows from the reflection principle (\(\tilde{B}_{t}\) is also a standard brownian motion, and \(B_{T}\) and \(\tilde{B}_{T}\) have the same distribution.) But, as above, the last probability is equal to \(\mathbb{P}(B_{T}\geq a)\). We conclude that:

\[\begin{aligned} \mathbb{P}\left(\max_{t\leq T}B_{t}\geq a\right) & =2\mathbb{P}(B_{T}\geq a)=\frac{2}{\sqrt{2\pi T}}\int_{a}^{\infty}e^{-\frac{x^{2}}{2T}}dx=\mathbb{P}(|B_{T}|\geq a) \end{aligned}\]

This implies in particular that \(\mathbb{P}\left(\max_{t\leq T}B_{t}=a\right)=0\). Thus, we also have \(\mathbb{P}(\max_{t\leq T}B_{t}\leq a)=\mathbb{P}(|B_{T}|\leq a)\) as claimed. ◻

(Simulating Martingales) Sample \(10\) paths of the following process with a step-size of \(0.01\):

(a) \(B_{t}^{2}-t\), \(t\in[0,1]\)

(b) Geometric Brownian motion : \(S_{t}=\exp(B_{t}-t/2)\), \(t\in[0,1]\).

Let’s write a simple \(\texttt{BrownianMotion}\) class, that we shall use to generate sample paths.

import numpy as np
import matplotlib.pyplot as plt

import attrs
from attrs import define, field

@define
class BrownianMotion:
    _step_size = field(validator=attrs.validators.and_(attrs.validators.instance_of(float),
                                                       attrs.validators.ge(0.0)))
    # Time T
    _T = field(validator=attrs.validators.and_(attrs.validators.instance_of(float),
                                               attrs.validators.ge(0.0)))
    # number of paths
    _N = field(validator=attrs.validators.and_(attrs.validators.instance_of(int),
                                               attrs.validators.gt(0)))

    _num_steps = field(init=False)

    def __attrs_post_init__(self):
        self._num_steps = int(self._T/self._step_size)

    def covariance_matrix(self):
        C = np.zeros((self._num_steps,self._num_steps))

        for i in range(self._num_steps):
            for j in range(self._num_steps):
                s = (i+1) * self._step_size
                t = (j+1) * self._step_size
                C[i,j] = min(s,t)
        return C

    # Each column vector represents a sample path
    def generate_paths(self):
        C = self.covariance_matrix()
        A = np.linalg.cholesky(C)
        Z = np.random.standard_normal((self._num_steps, self._N))
        X = np.matmul(A,Z)
        X = np.concatenate((np.zeros((1,self._N)),X),axis=0)
        return X.transpose()

Now, the process \(B_{t}^{2}-t\) can be sampled as follows:


def generateSquareOfBMCompensated(numOfPaths,stepSize,T):
    N = int(T/stepSize)

    X = []
    brownianMotion = BrownianMotion(stepSize,T)
    for n in range(numOfPaths):

        B_t = brownianMotion.samplePath()

        B_t_sq = np.square(B_t)

        t = np.linspace(start=0.0,stop=1.0,num=N+1)
        M_t = np.subtract(B_t_sq,t)
        X.append(M_t)

    return X

The gBM process can be sampled similarly, with \(\texttt{\ensuremath{M_{t}} = np.exp(np.subtract(\ensuremath{B_{t}},t/2))}\).

(Maximum of Brownian Motion.) Consider the maximum of Brownian motion on \([0,1]\): \(\max_{s\leq1}B_{s}\).

(a) Draw the histogram of the random variable \(\max_{s\leq1}B_{s}\)using \(10,0000\) sampled Brownian paths with a step size of \(0.01\).

(b) Compare this to the PDF of the random variable \(|B_{1}|\).

Solution.

I use the \(\texttt{itertools}\) python library to compute the running maximum of a brownian motion path.


brownianMotion = BrownianMotion(stepSize=0.01,T=1)
data = []

for i in range(10000):
    B_t = brownianMotion.samplePath()
    max_B_t = list(itertools.accumulate(B_t,max))
    data.append(max_B_t[100])

Analytically, we know that \(B_{1}\) is a gaussian random variable with mean \(0\) and variance \(1\).

\[\begin{aligned} \mathbb{P}(|B_{1}|\leq z) & =\mathbb{P}(|Z|\leq z)\\ & =\mathbb{P}(-z\leq Z\leq z)\\ & =\mathbb{P}(Z\leq z)-\mathbb{P}(Z\leq-z)\\ & =\mathbb{P}(Z\leq z)-(1-\mathbb{P}(Z\leq z))\\ F_{|B_{1}|}(z) & =2\Phi(z)-1 \end{aligned}\]

Differentiating on both sides, we get:

\[\begin{aligned} f_{|B_{1}|}(z) & =2\phi(z)=\frac{2}{\sqrt{2\pi}}e^{-\frac{z^{2}}{2}},\quad z\in[0,\infty) \end{aligned}\]

(First passage time.) Let \((B_{t}:t\geq0)\) be a standard brownian motion. Consider the random variable:

\[\begin{aligned} \tau & =\min\{t\geq0:B_{t}\geq1\} \end{aligned}\]

This is the first time that \(B_{t}\) reaches \(1\).

(a) Draw a histogram for the distribution of \(\tau\land10\) on the time-interval \([0,10]\) using \(10,000\) brownian motion paths on \([0,10]\) with discretization \(0.01\).

The notation \(\tau\land10\) means that if the path does not reach \(1\) on \([0,10]\), then give the value \(10\) to the stopping time.

(b) Estimate \(\mathbf{E}[\tau\land10]\).

(c) What proportion of paths never reach \(1\) in the time interval \([0,10]\)?

Solution.

To compute the expectation, we classify the hitting times of all paths into \(50\) bins. I simply did

\(\texttt{frequency, bins = np.histogram(firstPassageTimes,bins=50,range=(0,10))}\)

and then computed

\(\texttt{expectation=np.dot(frequency,bins[1:])/10000}\).

This expectation estimate on my machine is \(\mathbf{E}[\tau\land10]=4.34\) secs. There were approximately \(2600\) paths out of \(10,000\) that did not reach \(1\).

Gambler’s ruin at the French Roulette. Consider the scenario in which you are gambling \(\$1\) at the French roulette on the reds: You gain \(\$1\) with probability \(18/38\) and you lose a dollar with probability \(20/38\). We estimate the probability of your fortune reaching \(\$200\) before it reaches \(0\).

(a) Write a function that samples the simple random walk path from time \(0\) to time \(5,000\) with a given starting point.

(b) Use the above to estimate the probability of reaching \(\$200\) before \(\$0\) on a sample of \(100\) paths if you start with \(\$100\).

[]{#ex:doob’s-maximal-inequality label=“ex:doob’s-maximal-inequality”}Doob’s maximal inequalities. We prove the following: Let \((M_{k}:k\geq1)\) be positive submartingale for the filtration \((\mathcal{F}_{k}:k\in\mathbf{N})\). Then, for any \(1\leq p<\infty\) and \(a>0\)

\[\begin{aligned} \mathbb{P}\left(\max_{k\leq n}M_{k}>a\right) & \leq\frac{1}{a^{p}}\mathbf{E}[M_{n}^{p}] \end{aligned}\]

(a) Use Jensen’s inequality to show that if \((M_{k}:k\geq1)\) is a positive submartingale, then so is \((M_{k}^{p}:k\geq1)\) for \(1\leq p<\infty\). Conclude that it suffices to prove the statement for \(p=1\).

Solution.

The function \(f(x)=x^{p}\) is convex. By conditional Jensen’s inequality,

\[\begin{aligned} \left(\mathbf{E}[M_{k+1}|\mathcal{F}_{k}]\right)^{p} & \leq\mathbf{E}[M_{k}^{p}|\mathcal{F}_{k}] \end{aligned}\]

Thus,

\[\begin{aligned} \mathbf{E}[M_{k+1}^{p}|\mathcal{F}_{k}] & \geq\left(\mathbf{E}[M_{k+1}|\mathcal{F}_{k}]\right)^{p}\geq M_{k}^{p} \end{aligned}\]

where the last inequality follows from the fact that \((M_{k}:k\geq1)\) is a positive submartingale, so \(\mathbf{E}[M_{k+1}|\mathcal{F}_{k}]\geq M_{k}\). Consequently, \((M_{k}^{p}:k\geq1)\) is also a positive submartingale.

(b) Consider the events

\[\begin{aligned} B_{k} & =\bigcap_{j<k}\{\omega:M_{j}(\omega)\leq a\}\cap\{\omega:M_{k}(\omega)>a\} \end{aligned}\]

Argue that the \(B_{k}\)’s are disjoint and that \(\bigcup_{k\leq n}B_{k}=\{\max_{k\leq n}M_{k}>a\}=B\).

Solution.

Clearly, \(B_{k}\) is the event that the first time to cross \(a\) is \(k\). If \(B_{k}\) occurs, \(B_{k+1},B_{k+2},\ldots\) fail to occur. Hence, all \(B_{k}'s\) are pairwise disjoint. The event \(\bigcup_{k\leq n}B_{k}\) is the event that the random walk crosses \(a\) at any time \(k\leq n\). Thus, the running maximum of the Brownian motion at time \(n\) exceeds \(a\).

(c) Show that

\[\begin{aligned} \mathbf{E}[M_{n}]\geq\mathbf{E}[M_{n}\mathbf{1}_{B}] & \geq a\sum_{k\leq n}\mathbb{P}(B_{k})=a\mathbb{P}(B) \end{aligned}\]

by decomposing \(B\) in \(B_{k}\)’s and by using the properties of expectations, as well as the submartingale property.

Solution.

Clearly, \(M_{n}\geq M_{n}\mathbf{1}_{B}\geq a\mathbf{1}_{B}\). And \(M_{n}\) is a positive random variable. By monotonicity of expectations, \(\mathbf{E}[M_{n}]\geq\mathbf{E}[M_{n}\mathbf{1}_{B}]\geq a\mathbf{E}[\mathbf{1}_{B}]=a\mathbb{P}(B)=a\sum_{k\leq n}\mathbb{P}(B_{k})\), where the last equality holds because the \(B_{k}\)’s are disjoint.

(d) Argue that the inequality holds for continuous paths by discretizing time and using convergence theorems : If \((M_{t}:t\geq0)\) is a positive submartingale with continuous paths for the filtration \((\mathcal{F}_{t}:t\geq0)\), then for any \(1\leq p<\infty\) and \(a>0\):

\[\begin{aligned} \mathbb{P}\left(\max_{s\leq t}M_{s}>a\right) & \leq\frac{1}{a^{p}}\mathbf{E}[M_{t}^{p}] \end{aligned}\]

Solution.

Let \((M_{t}:t\geq0)\) be a positive submartingale with continuous paths for the filtration \((\mathcal{F}_{t}:t\geq0)\). Consider a sequence of partitions of the interval \([0,t]\) into \(2^{r}\) subintervals :

\[\begin{aligned} D_{r} & =\left\{ \frac{kt}{2^{r}}:k=0,1,2,\ldots,2^{n}\right\} \end{aligned}\]

And consider a sequence of discrete positive sub-martingales:

\[\begin{aligned} M_{kt/2^{r}}^{(r)} & =M_{kt/2^{r}},\quad k\in\mathbf{N},0\leq k\leq2^{r} \end{aligned}\]

Next, we define for \(r=1,2,3,\ldots\)

\[\begin{aligned} A_{r} & =\left\{ \sup_{s\in D_{r}}|M_{s}^{(r)}|>a\right\} \end{aligned}\]

By using the maximal inequality in discrete time, gives us:

\[\begin{aligned} \mathbb{P}(A_{r})=\mathbb{P}\left\{ \sup_{s\in D_{r}}|M_{s}^{(r)}|>a\right\} & \leq\frac{1}{a^{p}}\mathbf{E}\left[\left(M_{s}^{(r)}\right)^{p}\right]=\frac{1}{a^{p}}\mathbf{E}\left[M_{t}^{p}\right] \end{aligned}\]

\[\begin{aligned} \mathbb{P}\left(\max_{s\leq t}M_{s}>a\right) & =\mathbb{P}\left(\bigcup_{r=1}^{\infty}A_{r}\right)\\ & =\lim_{r\to\infty}\mathbb{P}\left(A_{r}\right)\\ & \left\{ \text{Continuity of probability measure}\right\} \\ & \leq\lim_{r\to\infty}\frac{1}{a^{p}}\mathbf{E}\left[M_{t}^{p}\right] \end{aligned}\]