Gaussian Processes

Stochastic Calculus
Author

Quasar

Published

October 19, 2025

Gaussian Processes.

Random Vectors.

Consider a probability space \((\Omega,\mathcal{F},\mathbb{P})\). We can define several random variables on \(\Omega\). A \(n\)-tuple of random variables on this space is called a random vector. For example, if \(X_{1},X_{2},\ldots,X_{n}\) are random variables on \((\Omega,\mathcal{F},\mathbb{P})\), then the \(n\)-tuple \((X_{1},X_{2},\ldots,X_{n})\) is a random vector on \((\Omega,\mathcal{F},\mathbb{P})\). The vector is said to be \(n\)-dimensional because it contains \(n\)-variables. We will sometimes denote a random vector by \(X\).

A good point of view is to think of a random vector \(X=(X_{1},\ldots,X_{n})\) as a random variable (point) in \(\mathbf{R}^{n}\). In other words, for an outcome \(\omega\in\Omega\), \(X(\omega)\) is a point sampled in \(\mathbf{R}^{n}\), where \(X_{j}(\omega)\) represents the \(j\)-th coordinate of the point. The distribution of \(X\), denoted \(\mu_{X}\) is the probability on \(\mathbf{R}^{n}\)defined by the events related to the values of \(X\):

\[\mathbb{P}\{X\in A\}=\mu_{X}(A)\quad\text{for a subset }A\text{ in }\mathbf{R}^{n}\]

In other words, \(\mathbb{P}(X\in A)=\mu_{X}(A)\) is the probability that the random point \(X\) falls in \(A\). The distribution of the vector \(X\) is also called the joint distribution of \((X_{1},\ldots,X_{n})\).

The joint distribution function of \(\mathbf{X}=(X,Y)\) is the function \(F:\mathbf{R}^{2}\to[0,1]\) given by:

\[F_{\mathbf{X}}(x,y)=\mathbb{P}(X\leq x,Y\leq y)\]

::: defn The joint PDF \(f_{\mathbf{X}}(x_{1},\ldots,x_{n})\) of a random vector \(\mathbf{X}\) is a function \(f_{\mathbf{X}}:\mathbf{R}^{n}\to\mathbf{R}\) such that the probability that \(X\) falls in a subset \(A\) of \(\mathbf{R}^{n}\) and is expressed as the multiple integral of \(f(x_{1},x_{2,}\ldots,x_{n})\) over \(A\): :::

\[\mathbb{P}(X\in A)=\int_{A}f(x_{1},x_{2},\ldots,x_{n})dx_{1}dx_{2}\ldots dx_{n}\]

Note that: we must have that the integral of \(f\) over the whole of \(\mathbf{R}^{n}\) is \(1\).

If \(F\) is differentiable at the point \((x,y)\), then we usually specify:

\[f(x,y)=\frac{\partial^{2}}{\partial x\partial y}F(x,y)\]

::: thm Let \((X,Y)\) be the random variables with joint density function \(f_{X,Y}(x,y)\). The marginal density function \(f_{X}(x)\) and \(f_{Y}(y)\) of the random variables \(X\) and \(Y\) respectively is given by:

\[\begin{aligned} f_{X}(x) & =\int_{-\infty}^{+\infty}f_{(X,Y)}(x,y)dy\\ f_{Y}(y) & =\int_{-\infty}^{+\infty}f_{(X,Y)}(x,y)dx\end{aligned}\] :::

::: proof Proof. We have:

\[\begin{aligned} F_{X}(x) & =P(X\leq x)\\ & =\int_{-\infty}^{x}\int_{y=-\infty}^{y=+\infty}f(x,y)dydx\end{aligned}\]

Differentiating both sides with respect to \(x\),

\[\begin{aligned} f_{X}(x) & =\int_{y=-\infty}^{y=+\infty}f(x,y)dydx\end{aligned}\] ◻ :::

::: defn For continuous random variables \(X\) and \(Y\) with the joint density function \(f_{(X,Y)}\), the conditional density of \(Y\) given \(X=x\) is:

\[\begin{aligned} f_{Y|X}(y|x) & =\frac{f_{(X,Y)}(x,y)}{f_{X}(x)}\end{aligned}\]

for all \(x\) with \(f_{X}(x)>0\). This is considered as a function of \(y\) for a fixed \(x\). As a convention, in order to make \(f_{Y|X}(y|x)\) well-defined for all real \(x\), let \(f_{Y|X}(y|x)=0\) for all \(x\) with \(f_{X}(x)=0\). :::

We are essentially slicing the the joint density function of \(f_{(X,Y)}(x,y)\) by a thin plane \(X=x\). How can we speak of conditioning on \(X=x\) for \(X\) being a continuous random variable, considering that this event has probability zero. Rigorously speaking, we are actually conditioning on the event that \(X\) falls within a small interval containing \(x\), say \(X\in(x-\epsilon,x+\epsilon)\) and then taking the limit as \(\epsilon\) approaches zero from the right.

We can recover the joint PDF \(f_{(X,Y)}\) if we have the conditional PDF \(f_{Y|X}\) and the corresponding marginal \(f_{X}\):

\[\begin{aligned} f_{(X,Y)}(x,y) & =f_{Y|X}(y|x)\cdot f_{X}(x)\end{aligned}\]

::: thm (Bayes rule and LOTP) Let \((X,Y)\) be continuous random variables. We have the following continuous form of the Bayes rule:

\[f_{Y|X}(y|x)=\frac{f_{X|Y}(x|y)\cdot f_{Y}(y)}{f_{X}(x)}\]

And we have the following continuous form of the law of total probability:

\[\begin{aligned} f_{X}(x) & =\int_{y=-\infty}^{y=+\infty}f_{X|Y}(x|y)\cdot f_{Y}(y)dy\end{aligned}\] :::

::: proof Proof. By the definition of conditional PDFs, we have:

\[\begin{aligned} f_{X|Y}(x|y)\cdot f_{Y}(y) & =f_{(X,Y)}(x,y)=f_{Y|X}(y|x)\cdot f_{X}(x)\end{aligned}\]

Dividing throughout by \(f_{X}(x)\), we have:

\[\begin{aligned} f_{Y|X}(x) & =\frac{f_{X|Y}(x|y)\cdot f_{Y}(y)}{f_{X}(x)}=\frac{f_{(X,Y)}(x,y)}{f_{X}(x)}\end{aligned}\] ◻ :::

Example 1 (Sampling uniformly in the unit disc). Consider the random vector \(\mathbf{X}=(X,Y)\) corresponding to a random point chosen uniformly in the unit disc \(\{(x,y):x^{2}+y^{2}\leq1\}\). \(\mathbf{X}\) is said to have uniform on the unit circle distribution. In this case the PDF is \(0\) outside the disc and \(\frac{1}{\pi}\) inside the disc:

\[\begin{aligned} f(x,y) & =\frac{1}{\pi}\quad\text{ if }x^{2}+y^{2}\leq1\end{aligned}\]

The random point \((X,Y)\) has \(x\)-coordinate \(X\) and \(Y\) coordinate \(Y\). Each of these are random variables and their PDFs and CDFs can be computed. This is a valid PDF, because:

\[\begin{aligned} \int\int_{D}f(x,y)dydx & =\int_{-1}^{1}\int_{-\sqrt{1-x^{2}}}^{\sqrt{1-x^{2}}}\frac{1}{\pi}dydx\\ & =\frac{1}{\pi}\int_{-1}^{1}\left[y\right]_{-\sqrt{1-x^{2}}}^{+\sqrt{1-x^{2}}}dx\\ & =\frac{2}{\pi}\int_{-1}^{1}\sqrt{1-x^{2}}dx\end{aligned}\]

Substituting \(x=\sin\theta\), we have: \(dx=\cos\theta d\theta\) and \(\sqrt{1-x^{2}}=\cos\theta\). The limits of integration are \(\theta=-\pi/2\) to \(\theta=\pi/2\). Thus,

\[\begin{aligned} \int\int_{D}f(x,y)dydx & =\frac{2}{\pi}\int_{-\pi/2}^{\pi/2}\cos^{2}\theta d\theta\\ & =\frac{1}{\pi}\int_{-\pi/2}^{\pi/2}(1+\cos2\theta)d\theta\\ & =\frac{1}{\pi}\left[\theta+\frac{1}{2}\sin2\theta\right]_{-\pi/2}^{\pi/2}\\ & =\frac{1}{\pi}\cdot\pi\\ & =1\end{aligned}\]

The CDF of \(X\) is given by:

\[\begin{aligned} F_{X}(a) & =\int_{-1}^{a}\int_{-\sqrt{1-x^{2}}}^{\sqrt{1-x^{2}}}\frac{1}{\pi}dydx\\ & =\frac{2}{\pi}\int_{-1}^{a}\sqrt{1-x^{2}}dx\end{aligned}\]

I leave it in this integral form. The PDF of \(X\) is obtained by differentiating the CDF, so it is:

\[f_{X}(x)=\frac{2}{\pi}\sqrt{1-x^{2}}\label{eq:marginal-pdf-of-X}\]

Let’s quickly plot the density of \(X\) over the domain of the definition \(-1\leq x\leq1\).

::: center Figure. The PDF of the random variable \(X\). :::

Not suprisingly the distribution of the \(x\)-coordinate is no longer uniform!

If \((X_{1},X_{2},\ldots,X_{n})\) is a random vector, the distribution of a single coordinate, say \(X_{1}\) is called the marginal distribution. In the example [Uniform-on-the-unit-circle-distribution], the marginal distribution of \(X\) is determined by the PDF [eq:marginal-pdf-of-X].

Random variables \(X_{1},X_{2},\ldots,X_{n}\) defined on the same probability space are said to be independent if for any intervals \(A_{1},A_{2},\ldots,A_{n}\) in \(\mathbf{R}\), the probability factors:

\[\mathbb{P}(X_{1}\in A_{1},X_{2}\in A_{2},\ldots,X_{n}\in A_{n})=\mathbb{P}(X_{1}\in A_{1})\times\mathbb{P}(X_{2}\in A_{2})\times\ldots\times\mathbb{P}(X_{n}\in A_{n})\] We say that the random variables are independent and identically distributed (IID) if they are independent and their marginal distributions are the same.

When the random vector \((X_{1},X_{2},\ldots,X_{n})\) has a joint PDF \(f(x_{1},x_{2},\ldots,x_{n})\), the independence of random variables is equivalent to saying that the joint PDF is given by the product of the marginal PDFs:

\[f(x_{1},x_{2},\ldots,x_{n})=f_{1}(x_{1})\times f_{2}(x_{2})\times\ldots\times f_{n}(x_{n})\]

Basic Probabilistic Inequalities.

Inequalities are extremely useful tools in the theoretical development of probability theory.

Jensen’s inequality.

If \(g\) is a convex function, and \(a>0\), \(b>0\), with \(p\in[0,1]\), it follows that:

\[g(pa+(1-p)b)\leq pg(a)+(1-p)g(b)\]

::: proof Proof. This directly follows from the definition of convex functions. ◻ :::

Jensen’s inequality for Random variables.

If \(g\) is a convex function, then it follows that:

\[\mathbb{E}(g(X))\geq g(\mathbb{E}X)\]

::: proof Proof. Another way to express the idea, that a function is convex is to observe that the tangent line at an arbitrary point \((t,g(t))\) always lies below the curve. Let \(y=a+bx\) be the tangent to \(g\) at the point \(t\). Then, it follows that:

\[\begin{aligned} a+bt & =g(t)\\ a+bx & \leq g(x)\end{aligned}\]

for all \(x\).

Thus, it follows that, for any point \(t\), there exists \(b\) such that:

\[\begin{aligned} g(x)-g(t) & \geq b(x-t)\end{aligned}\]

for all \(x\). Set \(t=\mathbb{E}X\) and \(x=X\). Then,

\[\begin{aligned} g(X)-g(\mathbb{E}X) & \geq b(X-\mathbb{E}X)\end{aligned}\]

Taking expectations on both sides and simplifying:

\[\begin{aligned} \mathbb{E}\left(g(X)\right)-g(\mathbb{E}X) & \geq b(\mathbb{E}X-\mathbb{E}X)=0\\ \mathbb{E}g(X) & \geq g(\mathbb{E}X)\end{aligned}\] ◻ :::

Young’s Inequality.

If \(a\geq0\) and \(b\geq0\) are non-negative real numbers and if \(p>1\) and \(q>1\) are real numbers such that \(\frac{1}{p}+\frac{1}{q}=1\), then:

\[ab\leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\]

::: proof Proof. Consider \(g(x)=\log x\). Being a concave function, Jensen’s inequality can be reversed. We have:

\[\begin{aligned} g\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}g(a^{p})+\frac{1}{q}g(b^{q})\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}\log(a^{p})+\frac{1}{q}\log(b^{q})\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}\cdot p\log(a)+\frac{1}{q}\cdot q\log(b)\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\log ab\end{aligned}\]

By the Monotonicity of the \(\log x\) function, it follows that :

\[\begin{aligned} ab & \leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\end{aligned}\] ◻ :::

Chebyshev’s inequality.

One of the simplest and very useful probabilistic inequalities is a tail bound by expectation: the so called Chebyshev’s inequality.

(Chebyshev’s inequality) If \(X\) is a non-negative random variable, then for every \(t\geq0\):

\[\mathbb{P}(X\geq t)\leq\frac{1}{t}\mathbb{E}X\]

::: proof Proof. We have:

\[\begin{aligned} t\cdot\mathbf{1}_{\{X\geq t\}} & \leq X\cdot\mathbf{1}_{\{X\geq t\}}\end{aligned}\]

By the monotonicity of expectations, we have:

\[\begin{aligned} \mathbb{E}\mathbf{1}_{\{X\geq t\}} & \leq\frac{1}{t}\mathbb{E}X\\ \implies\mathbb{P}\{X\geq t\} & \leq\frac{1}{t}\mathbb{E}X\end{aligned}\]

This closes the proof. \(\blacksquare\) :::

There are several variants, easily deduced from Chebyshev’s inequality using monotonicity of several functions. For a non-negative random variable \(X\) and \(t>0\), using the power function \(x^{p}\), \(p>0\), we get:

\[\mathbb{P}(X\geq t)=\mathbb{P}(X^{p}\geq t^{p})\leq\frac{1}{t^{p}}\mathbb{E}X^{p}\]

For a real valued random variable \(X\), every \(t\in\mathbf{R}\), using the square function \(x^{2}\) and variance, we have:

\[\mathbb{P}(|X-\mathbb{E}X|\geq t)\leq\frac{1}{t^{2}}\mathbb{E}|X-\mathbb{E}X|^{2}=\frac{1}{t^{2}}Var(X)\]

For a real-valued random variable \(X\), every \(t\in\mathbf{R}\) and \(\lambda>0\), using the exponential function \(e^{\lambda x}\)(which is monotonic), we have:

\[\mathbb{P}(X\geq t)=\mathbb{P}(\lambda X\geq\lambda t)=\mathbb{P}(e^{\lambda X}\geq e^{\lambda t})\leq\frac{1}{e^{\lambda t}}\mathbb{E}e^{\lambda X}\]

Our next inequality, the so-called Holder’s inequality is a very effective inequality to factor out the expectation of a product.

Holder’s inequality.

::: thm Let \(p,q\geq1\) be such that \(\frac{1}{p}+\frac{1}{q}=1\), For random variables \(X\) and \(Y\), we have:

\[\begin{aligned} \mathbb{E}|XY| & \leq\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}\end{aligned}\] :::

Proof. Proof. From the Young’s inequality, for any \(a,b\in\mathbf{R}\), \(p,q\geq1\), we have:

\[\begin{aligned} ab & \leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\end{aligned}\]

Setting \(a=\frac{|X|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}}\) and \(b=\frac{|Y|}{\left(\mathbb{E}|Y^{q}|\right)^{1/q}}\), we get:

\[\begin{aligned} \frac{|XY|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}} & \leq\frac{1}{p}\cdot\frac{|X|^{p}}{\mathbb{E}|X^{p}|}+\frac{1}{q}\cdot\frac{|Y|^{q}}{\mathbb{E}|Y^{q}|}\end{aligned}\]

Taking expectations on both sides, and using the monotonicity of expectation property, we get:

\[\begin{aligned} \frac{\mathbb{E}|XY|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}} & \leq\frac{1}{p}\cdot\frac{\mathbb{E}|X|^{p}}{\mathbb{E}|X^{p}|}+\frac{1}{q}\cdot\frac{\mathbb{E}|Y|^{q}}{\mathbb{E}|Y^{q}|}=\frac{1}{p}+\frac{1}{q}=1\end{aligned}\]

Consequently,

\[\begin{aligned} \mathbb{E}|XY| & \leq\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}\end{aligned}\]

Let \(p=2\) and \(q=2\). Then, we get the Cauchy-Schwarz inequality:

\[\begin{aligned} \mathbb{E}|XY| & \leq\left[\mathbb{E}(X^{2})\right]^{1/2}\left[\mathbb{E}(Y^{2})\right]^{1/2}\end{aligned}\]

In some ways, the \(p\)-th moment of a random variable can be thought of as it’s length or \(p\)-norm.

Define:

\[\left\Vert X\right\Vert _{p}=\left(\mathbb{E}|X|^{p}\right)^{1/p}\] ◻

Minkowski’s Inequality.

For random variables \(X\) and \(Y\), and for all \(p\geq1\) we have:

\[\left\Vert X+Y\right\Vert _{p}\leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\]

::: proof Proof. The basic idea of the proof is to use Holder’s inequality. Let \(\frac{1}{q}=1-\frac{1}{p}\) or in other words, \(q=\frac{p}{p-1}\). We have:

\[\begin{aligned} \mathbb{E}|X||X+Y|^{p-1} & \leq\left(\mathbb{E}|X|^{p}\right)^{1/p}\left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q} & (a)\\ \mathbb{E}|Y||X+Y|^{p-1} & \leq\left(\mathbb{E}|Y|^{p}\right)^{1/p}\left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q} & (b)\end{aligned}\]

Adding the above two equations, we get:

\[\begin{aligned} \mathbb{E}(|X+Y||X+Y|^{p-1})\leq\mathbb{E}(|X|+|Y|)(|X+Y|^{p-1}) & \leq\left\{ \left(\mathbb{E}|X|^{p}\right)^{1/p}+\left(\mathbb{E}|Y|^{p}\right)^{1/p}\right\} \left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q}\\ \mathbb{E}|X+Y|^{p} & \leq\left\{ \left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\right\} \left(\mathbb{E}|X+Y|^{p}\right)^{1/q}\\ \left(\mathbb{E}|X+Y|^{p}\right)^{1/p} & \leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\\ \left\Vert X+Y\right\Vert _{p} & \leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\end{aligned}\] ◻ :::

A quick refresher of linear algebra.

Many of the concepts in this chapter have very elegant interpretations, if we think of real-valued random variables on a probability space as vectors in a vector space. In particular, variance is related to the concept of norm and distance, while covariance is related to inner-products. These concepts can help unify some of the ideas in this chapter from a geometric point of view. Of course, real-valued random variables are simply measurable, real-valued functions on the abstract space \(\Omega.\)

::: defn (Vector Space).

By a vector space, we mean a non-empty set \(V\) with two operations: :::

  • Vector addition: \(+:(\mathbf{x},\mathbf{y})\to\mathbf{x}+\mathbf{y}\)

  • Scalar multiplication: \(\cdot:(\alpha,\mathbf{x})\to\alpha\mathbf{x}\)

such that the following conditions are satisfied:

(A1) Commutativity. \(\mathbf{x}+\mathbf{y}=\mathbf{y}+\mathbf{x}\) for all \(\mathbf{x},\mathbf{y}\in V\)

(A2) Associativity: \((\mathbf{x}+\mathbf{y})+\mathbf{z}=\mathbf{x}+(\mathbf{y}+\mathbf{z})\) for all \(\mathbf{x},\mathbf{y},\mathbf{z}\in V\)

(A3) Zero Element: There exists a zero element, denoted \(\mathbf{0}\) in \(V\), for all \(\mathbf{x}\in V\), such that \(\mathbf{x}+\mathbf{0}=\mathbf{x}\).

(A4) Additive Inverse: For all \(\mathbf{x}\in V\), there exists an additive inverse(negative element) denoted \(-\mathbf{x}\) in \(V\), such that \(\mathbf{x}+(-\mathbf{x})=\mathbf{0}\).

(M1) Scalar multiplication by identity element in \(F\): For all \(\mathbf{x}\in V\), \(1\cdot\mathbf{x}=\mathbf{x}\), where \(1\) denotes the multiplicative identity in \(F\).

(M2) Scalar multiplication and field multiplication mix well: For all \(\alpha,\beta\in F\) and \(\mathbf{v}\in V\), \((\alpha\beta)\mathbf{v}=\alpha(\beta\mathbf{v})\).

(D1) Distribution of scalar multiplication over vector addition: For all \(\alpha\in F\), and \(\mathbf{u},\mathbf{v}\in V\), \(\alpha(\mathbf{u}+\mathbf{v})=\alpha\mathbf{u}+\alpha\mathbf{v}\).

(D2) Distribution of field addition over scalar multiplication: For all \(\alpha,\beta\in F\), and \(\mathbf{v}\in V\), \((\alpha+\beta)\mathbf{v}=\alpha\mathbf{v}+\beta\mathbf{v}\).

As usual, our starting point is a random experiment modeled by a probability space \((\Omega,\mathcal{F},\mathbb{P})\), so that \(\Omega\) is the set of outcomes, \(\mathscr{\mathcal{F}}\) is the \(\sigma\)-algebra of events and \(\mathbb{P}\) is the probability measure on the measurable space \((\Omega,\mathcal{F})\). Our basic vector space \(V\) consists of all real-valued random variables defined on \((\Omega,\mathcal{F},\mathbb{P})\). We define vector addition and scalar multiplication in the usual way point-wise.

  • Vector addition: \((X+Y)(\omega)=X(\omega)+Y(\omega)\).

  • Scalar multiplication: \((\alpha X)(\omega)=\alpha X(\omega)\)

Clearly, any function \(g\) of a random variable \(X(\omega)\) is also a random variable on the same probability space and any linear combination of random variables on \((\Omega,\mathcal{F},\mathbb{P})\) also define a new random variable on the same probability space. Thus, \(V\) is closed under vector addition and scalar-multiplication. Since vector-addition and scalar multiplication is defined point-wise, it is easy to see that - all the axioms of a vector space (A1)-(A4), (M1-M2), (D1), (D2) are satisfied. The constantly zero random variable \(0(\omega)=0\) and the indicator random variable \(I_{\Omega}(\omega)\) can be thought of as the zero and identity vectors in this vector space.