Gaussian Processes.

Random Vectors.

Consider a probability space \((\Omega,\mathcal{F},\mathbb{P})\). We can define several random variables on \(\Omega\). A \(n\)-tuple of random variables on this space is called a random vector. For example, if \(X_{1},X_{2},\ldots,X_{n}\) are random variables on \((\Omega,\mathcal{F},\mathbb{P})\), then the \(n\)-tuple \((X_{1},X_{2},\ldots,X_{n})\) is a random vector on \((\Omega,\mathcal{F},\mathbb{P})\). The vector is said to be \(n\)-dimensional because it contains \(n\)-variables. We will sometimes denote a random vector by \(X\).

A good point of view is to think of a random vector \(X=(X_{1},\ldots,X_{n})\) as a random variable (point) in \(\mathbf{R}^{n}\). In other words, for an outcome \(\omega\in\Omega\), \(X(\omega)\) is a point sampled in \(\mathbf{R}^{n}\), where \(X_{j}(\omega)\) represents the \(j\)-th coordinate of the point. The distribution of \(X\), denoted \(\mu_{X}\) is the probability on \(\mathbf{R}^{n}\)defined by the events related to the values of \(X\):

\[\mathbb{P}\{X\in A\}=\mu_{X}(A)\quad\text{for a subset }A\text{ in }\mathbf{R}^{n}\]

In other words, \(\mathbb{P}(X\in A)=\mu_{X}(A)\) is the probability that the random point \(X\) falls in \(A\). The distribution of the vector \(X\) is also called the joint distribution of \((X_{1},\ldots,X_{n})\).

Definition 1 (Joint Distribution) The joint distribution function of \(\mathbf{X}=(X,Y)\) is the function \(F:\mathbf{R}^{2}\to[0,1]\) given by:

\[F_{\mathbf{X}}(x,y)=\mathbb{P}(X\leq x,Y\leq y)\]

Definition 2 (Joint density) The joint PDF \(f_{\mathbf{X}}(x_{1},\ldots,x_{n})\) of a random vector \(\mathbf{X}\) is a function \(f_{\mathbf{X}}:\mathbf{R}^{n}\to\mathbf{R}\) such that the probability that \(X\) falls in a subset \(A\) of \(\mathbf{R}^{n}\) and is expressed as the multiple integral of \(f(x_{1},x_{2,}\ldots,x_{n})\) over \(A\).

\[\mathbb{P}(X\in A)=\int_{A}f(x_{1},x_{2},\ldots,x_{n})dx_{1}dx_{2}\ldots dx_{n}\]

Note that: we must have that the integral of \(f\) over the whole of \(\mathbf{R}^{n}\) is \(1\).

If \(F\) is differentiable at the point \((x,y)\), then we usually specify:

\[f(x,y)=\frac{\partial^{2}}{\partial x\partial y}F(x,y)\]

Theorem 1 (Law of total probability) Let \((X,Y)\) be the random variables with joint density function \(f_{X,Y}(x,y)\). The marginal density function \(f_{X}(x)\) and \(f_{Y}(y)\) of the random variables \(X\) and \(Y\) respectively is given by:

\[\begin{aligned} f_{X}(x) & =\int_{-\infty}^{+\infty}f_{(X,Y)}(x,y)dy\\ f_{Y}(y) & =\int_{-\infty}^{+\infty}f_{(X,Y)}(x,y)dx\end{aligned}\]

Proof.

We have:

\[\begin{aligned} F_{X}(x) & =P(X\leq x)\\ & =\int_{-\infty}^{x}\int_{y=-\infty}^{y=+\infty}f(x,y)dydx\end{aligned}\]

Differentiating both sides with respect to \(x\),

\[\begin{aligned} f_{X}(x) & =\int_{y=-\infty}^{y=+\infty}f(x,y)dydx\end{aligned}\]

This closes the proof. \(\blacksquare\)

Definition 3 (Conditional density function) For continuous random variables \(X\) and \(Y\) with the joint density function \(f_{(X,Y)}\), the conditional density of \(Y\) given \(X=x\) is:

\[\begin{aligned} f_{Y|X}(y|x) & =\frac{f_{(X,Y)}(x,y)}{f_{X}(x)}\end{aligned}\]

for all \(x\) with \(f_{X}(x)>0\). This is considered as a function of \(y\) for a fixed \(x\). As a convention, in order to make \(f_{Y|X}(y|x)\) well-defined for all real \(x\), let \(f_{Y|X}(y|x)=0\) for all \(x\) with \(f_{X}(x)=0\).

We are essentially slicing the the joint density function of \(f_{(X,Y)}(x,y)\) by a thin plane \(X=x\). How can we speak of conditioning on \(X=x\) for \(X\) being a continuous random variable, considering that this event has probability zero. Rigorously speaking, we are actually conditioning on the event that \(X\) falls within a small interval containing \(x\), say \(X\in(x-\epsilon,x+\epsilon)\) and then taking the limit as \(\epsilon\) approaches zero from the right.

We can recover the joint PDF \(f_{(X,Y)}\) if we have the conditional PDF \(f_{Y|X}\) and the corresponding marginal \(f_{X}\):

\[ \begin{aligned} f_{(X,Y)}(x,y) & =f_{Y|X}(y|x)\cdot f_{X}(x) \end{aligned} \]

Theorem 2 (Bayes rule and LOTP) Let \((X,Y)\) be continuous random variables. We have the following continuous form of the Bayes rule:

\[f_{Y|X}(y|x)=\frac{f_{X|Y}(x|y)\cdot f_{Y}(y)}{f_{X}(x)}\]

And we have the following continuous form of the law of total probability:

\[\begin{aligned} f_{X}(x) & =\int_{y=-\infty}^{y=+\infty}f_{X|Y}(x|y)\cdot f_{Y}(y)dy\end{aligned}\]

Proof.

By the definition of conditional PDFs, we have:

\[ \begin{aligned} f_{X|Y}(x|y)\cdot f_{Y}(y) & =f_{(X,Y)}(x,y)=f_{Y|X}(y|x)\cdot f_{X}(x)\end{aligned} \]

Dividing throughout by \(f_{X}(x)\), we have:

\[ \begin{aligned} f_{Y|X}(x) & =\frac{f_{X|Y}(x|y)\cdot f_{Y}(y)}{f_{X}(x)}=\frac{f_{(X,Y)}(x,y)}{f_{X}(x)}\end{aligned} \]

This closes the proof. \(\blacksquare\)

Example 1 (Sampling uniformly in the unit disc). Consider the random vector \(\mathbf{X}=(X,Y)\) corresponding to a random point chosen uniformly in the unit disc \(\{(x,y):x^{2}+y^{2}\leq1\}\). \(\mathbf{X}\) is said to have uniform on the unit circle distribution. In this case the PDF is \(0\) outside the disc and \(\frac{1}{\pi}\) inside the disc:

\[\begin{aligned} f(x,y) & =\frac{1}{\pi}\quad\text{ if }x^{2}+y^{2}\leq1\end{aligned}\]

The random point \((X,Y)\) has \(x\)-coordinate \(X\) and \(Y\) coordinate \(Y\). Each of these are random variables and their PDFs and CDFs can be computed. This is a valid PDF, because:

\[\begin{aligned} \int\int_{D}f(x,y)dydx & =\int_{-1}^{1}\int_{-\sqrt{1-x^{2}}}^{\sqrt{1-x^{2}}}\frac{1}{\pi}dydx\\ & =\frac{1}{\pi}\int_{-1}^{1}\left[y\right]_{-\sqrt{1-x^{2}}}^{+\sqrt{1-x^{2}}}dx\\ & =\frac{2}{\pi}\int_{-1}^{1}\sqrt{1-x^{2}}dx\end{aligned}\]

Substituting \(x=\sin\theta\), we have: \(dx=\cos\theta d\theta\) and \(\sqrt{1-x^{2}}=\cos\theta\). The limits of integration are \(\theta=-\pi/2\) to \(\theta=\pi/2\). Thus,

\[\begin{aligned} \int\int_{D}f(x,y)dydx & =\frac{2}{\pi}\int_{-\pi/2}^{\pi/2}\cos^{2}\theta d\theta\\ & =\frac{1}{\pi}\int_{-\pi/2}^{\pi/2}(1+\cos2\theta)d\theta\\ & =\frac{1}{\pi}\left[\theta+\frac{1}{2}\sin2\theta\right]_{-\pi/2}^{\pi/2}\\ & =\frac{1}{\pi}\cdot\pi\\ & =1\end{aligned}\]

The CDF of \(X\) is given by:

\[\begin{aligned} F_{X}(a) & =\int_{-1}^{a}\int_{-\sqrt{1-x^{2}}}^{\sqrt{1-x^{2}}}\frac{1}{\pi}dydx\\ & =\frac{2}{\pi}\int_{-1}^{a}\sqrt{1-x^{2}}dx\end{aligned}\]

I leave it in this integral form. The PDF of \(X\) is obtained by differentiating the CDF, so it is:

\[f_{X}(x)=\frac{2}{\pi}\sqrt{1-x^{2}}\label{eq:marginal-pdf-of-X}\]

Let’s quickly plot the density of \(X\) over the domain of the definition \(-1\leq x\leq1\).

::: center Figure. The PDF of the random variable \(X\). :::

Not suprisingly the distribution of the \(x\)-coordinate is no longer uniform!

If \((X_{1},X_{2},\ldots,X_{n})\) is a random vector, the distribution of a single coordinate, say \(X_{1}\) is called the marginal distribution. In the example [Uniform-on-the-unit-circle-distribution], the marginal distribution of \(X\) is determined by the PDF [eq:marginal-pdf-of-X].

Random variables \(X_{1},X_{2},\ldots,X_{n}\) defined on the same probability space are said to be independent if for any intervals \(A_{1},A_{2},\ldots,A_{n}\) in \(\mathbf{R}\), the probability factors:

\[\mathbb{P}(X_{1}\in A_{1},X_{2}\in A_{2},\ldots,X_{n}\in A_{n})=\mathbb{P}(X_{1}\in A_{1})\times\mathbb{P}(X_{2}\in A_{2})\times\ldots\times\mathbb{P}(X_{n}\in A_{n})\] We say that the random variables are independent and identically distributed (IID) if they are independent and their marginal distributions are the same.

When the random vector \((X_{1},X_{2},\ldots,X_{n})\) has a joint PDF \(f(x_{1},x_{2},\ldots,x_{n})\), the independence of random variables is equivalent to saying that the joint PDF is given by the product of the marginal PDFs:

\[f(x_{1},x_{2},\ldots,x_{n})=f_{1}(x_{1})\times f_{2}(x_{2})\times\ldots\times f_{n}(x_{n})\]

Basic Probabilistic Inequalities.

Inequalities are extremely useful tools in the theoretical development of probability theory.

Jensen’s inequality.

Theorem 3 If \(g\) is a convex function, and \(a>0\), \(b>0\), with \(p\in[0,1]\), it follows that:

\[g(pa+(1-p)b)\leq pg(a)+(1-p)g(b)\]

Proof. This directly follows from the definition of convex functions. \(\blacksquare\)

Jensen’s inequality for Random variables.

Theorem 4 If \(g\) is a convex function, then it follows that:

\[\mathbb{E}(g(X))\geq g(\mathbb{E}X)\]

Proof.

Another way to express the idea, that a function is convex is to observe that the tangent line at an arbitrary point \((t,g(t))\) always lies below the curve. Let \(y=a+bx\) be the tangent to \(g\) at the point \(t\). Then, it follows that:

\[\begin{aligned} a+bt & =g(t)\\ a+bx & \leq g(x)\end{aligned}\]

for all \(x\).

Thus, it follows that, for any point \(t\), there exists \(b\) such that:

\[\begin{aligned} g(x)-g(t) & \geq b(x-t)\end{aligned}\]

for all \(x\). Set \(t=\mathbb{E}X\) and \(x=X\). Then,

\[\begin{aligned} g(X)-g(\mathbb{E}X) & \geq b(X-\mathbb{E}X)\end{aligned}\]

Taking expectations on both sides and simplifying:

\[\begin{aligned} \mathbb{E}\left(g(X)\right)-g(\mathbb{E}X) & \geq b(\mathbb{E}X-\mathbb{E}X)=0\\ \mathbb{E}g(X) & \geq g(\mathbb{E}X)\end{aligned}\]

This closes the proof. \(\blacksquare\)

Young’s Inequality.

Theorem 5 If \(a\geq0\) and \(b\geq0\) are non-negative real numbers and if \(p>1\) and \(q>1\) are real numbers such that \(\frac{1}{p}+\frac{1}{q}=1\), then:

\[ab\leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\]

Proof.

Consider \(g(x)=\log x\). Being a concave function, Jensen’s inequality can be reversed. We have:

\[\begin{aligned} g\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}g(a^{p})+\frac{1}{q}g(b^{q})\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}\log(a^{p})+\frac{1}{q}\log(b^{q})\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\frac{1}{p}\cdot p\log(a)+\frac{1}{q}\cdot q\log(b)\\ \log\left(\frac{1}{p}a^{p}+\frac{1}{q}b^{q}\right) & \geq\log ab\end{aligned}\]

By the Monotonicity of the \(\log x\) function, it follows that :

\[\begin{aligned} ab & \leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\end{aligned}\]

This closes the proof. \(\blacksquare\)

Chebyshev’s inequality.

One of the simplest and very useful probabilistic inequalities is a tail bound by expectation: the so called Chebyshev’s inequality.

Theorem 6 (Chebyshev’s inequality) If \(X\) is a non-negative random variable, then for every \(t\geq0\):

\[\mathbb{P}(X\geq t)\leq\frac{1}{t}\mathbb{E}X\]

Proof.

We have:

\[\begin{aligned} t\cdot\mathbf{1}_{\{X\geq t\}} & \leq X\cdot\mathbf{1}_{\{X\geq t\}}\end{aligned}\]

By the monotonicity of expectations, we have:

\[\begin{aligned} \mathbb{E}\mathbf{1}_{\{X\geq t\}} & \leq\frac{1}{t}\mathbb{E}X\\ \implies\mathbb{P}\{X\geq t\} & \leq\frac{1}{t}\mathbb{E}X\end{aligned}\]

This closes the proof. \(\blacksquare\)

There are several variants, easily deduced from Chebyshev’s inequality using monotonicity of several functions. For a non-negative random variable \(X\) and \(t>0\), using the power function \(x^{p}\), \(p>0\), we get:

\[\mathbb{P}(X\geq t)=\mathbb{P}(X^{p}\geq t^{p})\leq\frac{1}{t^{p}}\mathbb{E}X^{p}\]

For a real valued random variable \(X\), every \(t\in\mathbf{R}\), using the square function \(x^{2}\) and variance, we have:

\[\mathbb{P}(|X-\mathbb{E}X|\geq t)\leq\frac{1}{t^{2}}\mathbb{E}|X-\mathbb{E}X|^{2}=\frac{1}{t^{2}}Var(X)\]

For a real-valued random variable \(X\), every \(t\in\mathbf{R}\) and \(\lambda>0\), using the exponential function \(e^{\lambda x}\)(which is monotonic), we have:

\[\mathbb{P}(X\geq t)=\mathbb{P}(\lambda X\geq\lambda t)=\mathbb{P}(e^{\lambda X}\geq e^{\lambda t})\leq\frac{1}{e^{\lambda t}}\mathbb{E}e^{\lambda X}\]

Our next inequality, the so-called Holder’s inequality is a very effective inequality to factor out the expectation of a product.

Holder’s inequality.

Theorem 7 Let \(p,q\geq1\) be such that \(\frac{1}{p}+\frac{1}{q}=1\), For random variables \(X\) and \(Y\), we have:

\[\begin{aligned} \mathbb{E}|XY| & \leq\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}\end{aligned}\]

Proof. From the Young’s inequality, for any \(a,b\in\mathbf{R}\), \(p,q\geq1\), we have:

\[\begin{aligned} ab & \leq\frac{a^{p}}{p}+\frac{b^{q}}{q}\end{aligned}\]

Setting \(a=\frac{|X|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}}\) and \(b=\frac{|Y|}{\left(\mathbb{E}|Y^{q}|\right)^{1/q}}\), we get:

\[\begin{aligned} \frac{|XY|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}} & \leq\frac{1}{p}\cdot\frac{|X|^{p}}{\mathbb{E}|X^{p}|}+\frac{1}{q}\cdot\frac{|Y|^{q}}{\mathbb{E}|Y^{q}|}\end{aligned}\]

Taking expectations on both sides, and using the monotonicity of expectation property, we get:

\[\begin{aligned} \frac{\mathbb{E}|XY|}{\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}} & \leq\frac{1}{p}\cdot\frac{\mathbb{E}|X|^{p}}{\mathbb{E}|X^{p}|}+\frac{1}{q}\cdot\frac{\mathbb{E}|Y|^{q}}{\mathbb{E}|Y^{q}|}=\frac{1}{p}+\frac{1}{q}=1\end{aligned}\]

Consequently,

\[\begin{aligned} \mathbb{E}|XY| & \leq\left(\mathbb{E}|X^{p}|\right)^{1/p}\left(\mathbb{E}|Y^{q}|\right)^{1/q}\end{aligned}\]

Let \(p=2\) and \(q=2\). Then, we get the Cauchy-Schwarz inequality:

\[\begin{aligned} \mathbb{E}|XY| & \leq\left[\mathbb{E}(X^{2})\right]^{1/2}\left[\mathbb{E}(Y^{2})\right]^{1/2}\end{aligned}\]

In some ways, the \(p\)-th moment of a random variable can be thought of as it’s length or \(p\)-norm.

Define:

\[\left\Vert X\right\Vert _{p}=\left(\mathbb{E}|X|^{p}\right)^{1/p}\]

This closes the proof. \(\blacksquare\)

Minkowski’s Inequality.

Theorem 8 For random variables \(X\) and \(Y\), and for all \(p\geq1\) we have:

\[\left\Vert X+Y\right\Vert _{p}\leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\]

Proof.

The basic idea of the proof is to use Holder’s inequality. Let \(\frac{1}{q}=1-\frac{1}{p}\) or in other words, \(q=\frac{p}{p-1}\). We have:

\[\begin{aligned} \mathbb{E}|X||X+Y|^{p-1} & \leq\left(\mathbb{E}|X|^{p}\right)^{1/p}\left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q} & (a)\\ \mathbb{E}|Y||X+Y|^{p-1} & \leq\left(\mathbb{E}|Y|^{p}\right)^{1/p}\left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q} & (b)\end{aligned}\]

Adding the above two equations, we get:

\[\begin{aligned} \mathbb{E}(|X+Y||X+Y|^{p-1})\leq\mathbb{E}(|X|+|Y|)(|X+Y|^{p-1}) & \leq\left\{ \left(\mathbb{E}|X|^{p}\right)^{1/p}+\left(\mathbb{E}|Y|^{p}\right)^{1/p}\right\} \left(\mathbb{E}|X+Y|^{(p-1)q}\right)^{1/q}\\ \mathbb{E}|X+Y|^{p} & \leq\left\{ \left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\right\} \left(\mathbb{E}|X+Y|^{p}\right)^{1/q}\\ \left(\mathbb{E}|X+Y|^{p}\right)^{1/p} & \leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\\ \left\Vert X+Y\right\Vert _{p} & \leq\left\Vert X\right\Vert _{p}+\left\Vert Y\right\Vert _{p}\end{aligned}\]

This closes the proof. \(\blacksquare\)

A quick refresher of linear algebra.

Many of the concepts in this chapter have very elegant interpretations, if we think of real-valued random variables on a probability space as vectors in a vector space. In particular, variance is related to the concept of norm and distance, while covariance is related to inner-products. These concepts can help unify some of the ideas in this chapter from a geometric point of view. Of course, real-valued random variables are simply measurable, real-valued functions on the abstract space \(\Omega.\)

Definition 4 (Vector Space.) By a vector space, we mean a non-empty set \(V\) with two operations: :::

Vector addition: \(+:(\mathbf{x},\mathbf{y})\to\mathbf{x}+\mathbf{y}\)
Scalar multiplication: \(\cdot:(\alpha,\mathbf{x})\to\alpha\mathbf{x}\)

such that the following conditions are satisfied:

(A1) Commutativity. \(\mathbf{x}+\mathbf{y}=\mathbf{y}+\mathbf{x}\) for all \(\mathbf{x},\mathbf{y}\in V\)

(A2) Associativity: \((\mathbf{x}+\mathbf{y})+\mathbf{z}=\mathbf{x}+(\mathbf{y}+\mathbf{z})\) for all \(\mathbf{x},\mathbf{y},\mathbf{z}\in V\)

(A3) Zero Element: There exists a zero element, denoted \(\mathbf{0}\) in \(V\), for all \(\mathbf{x}\in V\), such that \(\mathbf{x}+\mathbf{0}=\mathbf{x}\).

(A4) Additive Inverse: For all \(\mathbf{x}\in V\), there exists an additive inverse(negative element) denoted \(-\mathbf{x}\) in \(V\), such that \(\mathbf{x}+(-\mathbf{x})=\mathbf{0}\).

(M1) Scalar multiplication by identity element in \(F\): For all \(\mathbf{x}\in V\), \(1\cdot\mathbf{x}=\mathbf{x}\), where \(1\) denotes the multiplicative identity in \(F\).

(M2) Scalar multiplication and field multiplication mix well: For all \(\alpha,\beta\in F\) and \(\mathbf{v}\in V\), \((\alpha\beta)\mathbf{v}=\alpha(\beta\mathbf{v})\).

(D1) Distribution of scalar multiplication over vector addition: For all \(\alpha\in F\), and \(\mathbf{u},\mathbf{v}\in V\), \(\alpha(\mathbf{u}+\mathbf{v})=\alpha\mathbf{u}+\alpha\mathbf{v}\).

(D2) Distribution of field addition over scalar multiplication: For all \(\alpha,\beta\in F\), and \(\mathbf{v}\in V\), \((\alpha+\beta)\mathbf{v}=\alpha\mathbf{v}+\beta\mathbf{v}\).

As usual, our starting point is a random experiment modeled by a probability space \((\Omega,\mathcal{F},\mathbb{P})\), so that \(\Omega\) is the set of outcomes, \(\mathscr{\mathcal{F}}\) is the \(\sigma\)-algebra of events and \(\mathbb{P}\) is the probability measure on the measurable space \((\Omega,\mathcal{F})\). Our basic vector space \(V\) consists of all real-valued random variables defined on \((\Omega,\mathcal{F},\mathbb{P})\). We define vector addition and scalar multiplication in the usual way point-wise.

Vector addition: \((X+Y)(\omega)=X(\omega)+Y(\omega)\).
Scalar multiplication: \((\alpha X)(\omega)=\alpha X(\omega)\)

Clearly, any function \(g\) of a random variable \(X(\omega)\) is also a random variable on the same probability space and any linear combination of random variables on \((\Omega,\mathcal{F},\mathbb{P})\) also define a new random variable on the same probability space. Thus, \(V\) is closed under vector addition and scalar-multiplication. Since vector-addition and scalar multiplication is defined point-wise, it is easy to see that - all the axioms of a vector space (A1)-(A4), (M1-M2), (D1), (D2) are satisfied. The constantly zero random variable \(0(\omega)=0\) and the indicator random variable \(I_{\Omega}(\omega)\) can be thought of as the zero and identity vectors in this vector space.

Inner Products.

In Euclidean geometry, the angle between two vectors is specified by their dot product, which is itself formalized by the abstract concept of inner products.

Definition 5 (Inner Product.) An inner product on the real vector space \(V\) is a pairing that takes two vectors \(\mathbf{v},\mathbf{w}\in V\) and produces a real number \(\left\langle \mathbf{v},\mathbf{w}\right\rangle \in\mathbf{R}\). The inner product is required to satisfy the following three axioms for all \(\mathbf{u},\mathbf{v},\mathbf{w}\in V\) and scalars \(c,d\in\mathbf{R}\).

Bilinearity: \[ \left\langle c\mathbf{u}+d\mathbf{v},\mathbf{w}\right\rangle =c\left\langle \mathbf{u},\mathbf{w}\right\rangle +d\left\langle \mathbf{v},\mathbf{w}\right\rangle \]

\[ \left\langle \mathbf{u},c\mathbf{v}+d\mathbf{w}\right\rangle =c\left\langle \mathbf{u},\mathbf{v}\right\rangle +d\left\langle \mathbf{u},\mathbf{w}\right\rangle \]

Symmetry:

\[ \left\langle \mathbf{v},\mathbf{w}\right\rangle =\left\langle \mathbf{w},\mathbf{v}\right\rangle \]

Positive Definiteness:

\[ \left\langle \mathbf{v},\mathbf{v}\right\rangle >0\quad\text{ whenever }\mathbf{v\neq\mathbf{0}} \]

\[ \left\langle \mathbf{v},\mathbf{v}\right\rangle =0\quad\text{ whenever }\mathbf{v=0} \]

Definition 6 (Norm). A norm on a real vector space \(V\) is a function \(\left\Vert \cdot\right\Vert :V\to\mathbf{R}\) satisfying :

(i) Positive Definiteness.

\[\left\Vert \mathbf{v}\right\Vert \geq0\]

and \[\left\Vert \mathbf{v}\right\Vert =0\quad\text{if and only if }\mathbf{v}=\mathbf{0}\]

(ii) Scalar multiplication.

\[\left\Vert \alpha\mathbf{v}\right\Vert =|\alpha|\left\Vert \mathbf{v}\right\Vert\]

(iii) Triangle Inequality.

\[\left\Vert \mathbf{x+y}\right\Vert \leq\left\Vert \mathbf{x}\right\Vert +\left\Vert \mathbf{y}\right\Vert\]

As mentioned earlier, we can define the \(p\)-norm of a random variable as:

\[ \begin{aligned} \left\Vert X\right\Vert _{p} & =\left(\mathbb{E}|X|^{p}\right)^{1/p} \end{aligned} \]

(i) Positive semi-definiteness: Since \(|X|\) is a non-negative random variable, \(|X|^{p}\geq0\) and the expectation of a non-negative random variable is also non-negative. Hence, \((\mathbb{E}|X|^{p})^{1/p}\geq0\). Moreover, \(\left\Vert X\right\Vert _{p}=0\) implies that \(\mathbb{E}|X|^{p}=0\). From property (iv) of expectations, \(X=0\).

(ii) Scalar-multiplication: We have:

\[ \begin{aligned} \left\Vert cX\right\Vert _{p} & =\left(\mathbb{E}|cX|^{p}\right)^{1/p}\\ & =\left(|c|^{p}\right)^{1/p}\left(\mathbb{E}|X|^{p}\right)^{1/p}\\ & =|c|\cdot\left\Vert X\right\Vert _{p} \end{aligned} \]

(iii) Triangle Inequality. This followed from the Minkowski’s inequality.

The space of all random variables defined on \((\Omega,\mathcal{\mathcal{F}},\mathbb{P})\) such that \(||X||_{p}<\infty\) is finite is called the \(L^{p}\) space.

Orthogonal Matrices.

Definition 7 (Orthogonal Matrix.) Let \(A\) be an \(n\times n\) square matrix. We say that the matrix \(A\) is orthogonal, if its transpose is equal toits inverse.

\[ \begin{aligned} A' & =A^{-1} \end{aligned} \]

This may seem like an odd property to study, but the following theorem explains why it is so useful. Essentially, an orthogonal matrix rotates (or reflects) vectors without distorting angles or distances.

Proposition 1 (Properties of an orthogonal matrix) Faor an \(n\times n\) square matrix \(A\), the following are equivalent:

(1) \(A\) is orthogonal. That is, \(A'A=I\).

(2) \(A\) preserves norms. That is, for all \(\mathbf{x}\),

\[\begin{aligned} \left\Vert A\mathbf{x}\right\Vert &=\left\Vert \mathbf{x}\right\Vert \end{aligned}\]

(3) \(A\) preserves inner products, that is, for every \(\mathbf{x}\), \(\mathbf{y}\)\(\in\mathbf{R}^{n}\):

\[\begin{aligned} (A\mathbf{x})\cdot(A\mathbf{y}) &=\mathbf{x}\cdot\mathbf{y}\end{aligned}\]

Proof.

We have:

\[\begin{aligned} \left\Vert A\mathbf{x}\right\Vert ^{2} & =\left(A\mathbf{x}\right)'(A\mathbf{x})\\ & =\mathbf{x}'(A'A)\mathbf{x}\\ & =\mathbf{x}'I\mathbf{x}\\ & =\mathbf{x}'\mathbf{x}\\ & =||\mathbf{x}||^{2}\end{aligned}\]

Consequently, \(||A\mathbf{x}||=||\mathbf{x}||\). The matrix \(A\) preserves norms. Thus, (1) implies (2).

Moreover, consider

\[\begin{aligned} ||A(\mathbf{x}+\mathbf{y})||^{2} & =\left(A\mathbf{x}+A\mathbf{y}\right)\cdot\left(A\mathbf{x}+A\mathbf{y}\right)\\ & =(A\mathbf{x})\cdot(A\mathbf{x})+(A\mathbf{x})\cdot(A\mathbf{y})+(A\mathbf{y})\cdot(A\mathbf{x})+(A\mathbf{y})\cdot(A\mathbf{y})\\ & =||A\mathbf{x}||^{2}+2(A\mathbf{x})\cdot(A\mathbf{y})+||A\mathbf{y}||^{2} & \{\mathbf{x}\cdot\mathbf{y}=\mathbf{y}\cdot\mathbf{x}\}\\ & =||\mathbf{x}||^{2}+2(A\mathbf{x})\cdot(A\mathbf{y})+||\mathbf{y}||^{2} & \{A\text{ preserves norms}\}\end{aligned}\]

But, \(||A(\mathbf{x}+\mathbf{y})||^{2}=||\mathbf{x}+\mathbf{y}||^{2}=||\mathbf{x}||^{2}+2\mathbf{x}\cdot\mathbf{y}+||\mathbf{y}||^{2}\).

Equating the two expressions, we have the desired result. Hence, (2) implies (3).

Lastly, if \(A\) preserves inner products, we may write:

\[\begin{aligned} \left\langle A\mathbf{x},A\mathbf{x}\right\rangle & =\left\langle \mathbf{x},\mathbf{x}\right\rangle \\ \left(A\mathbf{x}\right)'(A\mathbf{x}) & =\mathbf{x}'\mathbf{x}\\ \mathbf{x}'A'A\mathbf{x} & =0\end{aligned}\]

Since \(\mathbf{x}\neq\mathbf{0}\), it must be true that \(\mathbf{x}'A'A-\mathbf{x}'=0\). Again, since \(\mathbf{x}'\neq\mathbf{0}\), it follows that \(A'A-I=0\).

Theorem 9 (Linear Independence of orthogonal vectors) If \(\mathbf{q}_{1},\mathbf{q}_{2},\ldots,\mathbf{q}_{k}\in V\) be mutually orthogonal elements, such that \(\mathbf{q}_{i}\neq\mathbf{0}\) for all \(i\), then \(\mathbf{q}_{1},\mathbf{q}_{2},\ldots,\mathbf{q}_{k}\) are linearly independent.

Proof.

Let

\[\begin{aligned} c_{1}\mathbf{q}_{1}+c_{2}\mathbf{q}_{2}+\ldots+c_{k}\mathbf{q}_{k} & =\mathbf{0}\end{aligned}\]

Since \(\left\langle \mathbf{q}_{i},\mathbf{q}_{i}\right\rangle =1\) and \(\left\langle \mathbf{q}_{i},\mathbf{q}_{j}\right\rangle =0\) where \(i\neq j\), we can take the inner product of the vector \(c_1 \mathbf{q}_{1} + c_2 \mathbf{q}_{2} + \ldots + c_i \mathbf{q}_{i}+\ldots + c_{k}\mathbf{q}_{k}\) with \(\mathbf{q}_{i}\) for each \(i=1,2,3,\ldots,k\). It results in \(c_{i}||\mathbf{q}_{i}||^{2}=0\). Since \(\mathbf{q}_{i}\neq\mathbf{0}\), \(||\mathbf{q}_{i}||^{2}>0\). So, \(c_{i}=0\). We conclude that \(c_{1}=c_{2}=\ldots=c_{k} =0\). Consequently, \(\mathbf{q}_{1},\mathbf{q}_{2},\ldots,\mathbf{q}_{k}\) are linearly independent. \(\blacksquare\)

Theorem 10 (Orthogonal vectors form a basis) Let \(Q=\left[\begin{array}{cccc} \mathbf{q}_{1} & \mathbf{q}_{2} & \ldots & \mathbf{q}_{n}\end{array}\right]\) be an \(n\times n\) orthogonal matrix. Then, \(\{\mathbf{q}_{1},\ldots,\mathbf{q}_{n}\}\) form an orthonormal basis for \(\mathbf{R}^{n}\).

Proof.

We have \(Q\mathbf{e}_{i}=\mathbf{q}_{i}\). Consequently,

\[\begin{aligned} \left\langle \mathbf{q}_{i},\mathbf{q}_{i}\right\rangle & =\mathbf{q}_{i}'\mathbf{q}_{i}\\ & =(Q\mathbf{e}_{i})'(Q\mathbf{e}_{i})\\ & =\mathbf{e}_{i}'Q'Q\mathbf{e}_{i}\\ & =\mathbf{e}_{i}'I\mathbf{e}_{i}\\ & =\mathbf{e}_{i}'\mathbf{e}_{i}\\ & =1\end{aligned}\]

Assume that \(i\neq j\). We have:

\[\begin{aligned} \left\langle \mathbf{q}_{i},\mathbf{q}_{j}\right\rangle & =\mathbf{q}_{i}'\mathbf{q}_{j}\\ & =\mathbf{e}_{i}'Q'Q\mathbf{e}_{j}\\ & =\mathbf{e}_{i}'\mathbf{e}_{j}\\ & =0\end{aligned}\]

From Theorem 9, \(\{\mathbf{q}_{1},\ldots,\mathbf{q}_{n}\}\) are linearly independent and hence form an orthonormal basis for \(\mathbf{R}^{n}\).

Quadratic Forms.

An expression of the form:

\[\mathbf{x}'A\mathbf{x}\]

where \(\mathbf{x}\) is a \(n\times1\) column vector and \(A\) is an \(n\times n\) matrix is called a quadratic form in \(\mathbf{x}\) and

\[\begin{aligned} \mathbf{x}'A\mathbf{x} & =\sum_{i=1}^{n}\sum_{j=1}^{n}a_{ij}x_{i}x_{j}\end{aligned}\]

If \(A\) and \(B\) are \(n\times n\) and \(\mathbf{x},\mathbf{y}\) are \(n\)-vectors, then

\[\begin{aligned} \mathbf{x}'(A+B)\mathbf{y} & =\mathbf{x}'A\mathbf{y}+\mathbf{x}'B\mathbf{y}\end{aligned}\]

The quadratic form of the matrix \(A\) is called positive definite if:

\[\begin{aligned} \mathbf{x}'A\mathbf{x} & >0\quad\text{whenever }\mathbf{x}\neq\mathbf{0}\end{aligned}\]

and positive semidefinite if:

\[\begin{aligned} \mathbf{x}'A\mathbf{x} & \geq0\quad\text{whenever }\mathbf{x}\neq\mathbf{0}\end{aligned}\]

Letting \(\mathbf{e}_{i}\) be the unit vector with it’s \(i\)th coordinate vector \(1\), we have:

\[ \begin{aligned} \mathbf{e}_{i}'A\mathbf{e}_{i} & =\left[a_{i1}a_{i2}\ldots a_{ii}\ldots a_{in}\right]\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right]=a_{ii} \end{aligned} \]

Eigenthingies and diagonalizability.

Let \(V\) and \(W\) be finite dimensional vector spaces with \(dim(V)=n\) and \(dim(W)=m\). A linear transformation \(T:V\to W\), is defined by its action on the basis vectors. Suppose:

\[ \begin{aligned} T(\mathbf{v}_{j}) & =\sum_{i=1}^{n}a_{ij}\mathbf{w}_{i} \end{aligned} \]

for all \(1\leq i\leq m\).

Then, the matrix \(A=[T]_{\mathcal{B}_{V}}^{\mathcal{B}_{W}}\) of the linear transformation is defined as:

\[\begin{aligned} A & =\left[\begin{array}{cccc} a_{11} & a_{12} & \ldots & a_{1n}\\ a_{21} & a_{22} & \ldots & a_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ a_{m1} & a_{m2} & \ldots & a_{mn} \end{array}\right]\end{aligned}\]

Definition 8 A linear transformation \(T:V\to V\) is said to be diagonalizable if there exists an ordered basis \(\mathcal{B}=\{\mathbf{v}_{1},\ldots,\mathbf{v}_{n}\}\) for \(V\) so that the matrix for \(T\) with respect to \(\mathcal{B}\) is diagonal. This means precisely that, for some scalars \(\lambda_{1},\lambda_{2},\ldots,\lambda_{n}\), we have:

\[ \begin{aligned} T(\mathbf{v}_{1}) & =\lambda_{1}\mathbf{v}_{1}\\ T(\mathbf{v}_{2}) & =\lambda_{2}\mathbf{v}_{2}\\ \vdots\\ T(\mathbf{v}_{n}) & =\lambda_{n}\mathbf{v}_{n} \end{aligned} \]

In other words, if \(A=[T]_{\mathcal{B}}\), then we have:

\[\begin{aligned} A\mathbf{v}_{i} & =\lambda_{i}\mathbf{v}_{i}\end{aligned}\]

Thus, if we let \(P\) be the \(n\times n\) matrix whose columns are the vectors \(\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{n}\) and \(\Lambda\) be the \(n\times n\) diagonal matrix with diagonal entries \(\lambda_{1},\lambda_{2},\ldots,\lambda_{n}\), then we have:

\[\begin{aligned} A\left[\begin{array}{cccc} \mathbf{v}_{1} & \mathbf{v}_{2} & \ldots & \mathbf{v}_{n}\end{array}\right] & =\left[\begin{array}{cccc} \mathbf{v}_{1} & \mathbf{v}_{2} & \ldots & \mathbf{v}_{n}\end{array}\right]\left[\begin{array}{cccc} \lambda_{1}\\ & \lambda_{2}\\ & & \ddots\\ & & & \lambda_{n} \end{array}\right]\\ AP & =P\Lambda\\ A & =P\Lambda P^{-1}\end{aligned}\]

There exists a large class of diagonalizable matrices - the symmetric matrices. A square matrix \(A\) is symmetric, if \(A=A'\).

Definition 9 (Eigenvectors) Let \(T:V\to V\) be a linear transformation. A non-zero vector \(\mathbf{v}\in V\) is called the eigenvector of \(T\), if there is a scalar \(\lambda\) so that \(T(\mathbf{v})=\lambda\mathbf{v}\). The scalar \(\lambda\) is called the eigenvalue of \(T\).