Discriminative versus Generative models
Let’s say that you have input data \(\mathcal{D}=\{(\mathbf{x}_i,y_i): i=1,2,\ldots,N\}\), and suppose that \(y_i \in \{0,1\}\). Each input has \(D\)-features, so \(\mathbf{x}_i \in \mathbf{R}^D\). You want to classify an arbitrary feature vector \(\mathbf{x}\) into \(y=0\) or \(y=1\).
One way to build a classifier is to learn the joint probability distribution \(p(\mathbf{x},y)\) and then to condition on \(\mathbf{x}\), thereby deriving \(p(y=c|\mathbf{x})\). This is called the generative approach. An alternative approach is to fit a model of the form \(p(y|\mathbf{x})\) directly. This is called the discriminative approach.
Logistic Regression
The sigmoid function \(sigm(x)\) is defined as:
\[\begin{align*} sigm(x) = \frac{e^x}{1+e^x} \tag{1} \end{align*}\]
The logistic regression models the class posterior probability as:
\[\begin{align*} p(y=1|\mathbf{x}) =sigm(\mathbf{w}^T \mathbf{x}) = \frac{e^{\mathbf{w}^T \mathbf{x}}}{1 + e^{\mathbf{w}^T \mathbf{x}}} \tag{2} \end{align*}\]
Re-arranging, we can write:
\[\begin{align*} \frac{p(y=1|\mathbf{x})}{1 - p(y=1|\mathbf{x})} &= e^{\mathbf{w}^T \mathbf{x}}\\ \log \left(\frac{p(y=1|\mathbf{x})}{1 - p(y=1|\mathbf{x})}\right) &= \mathbf{w}^T \mathbf{x} \tag{3} \end{align*}\]
The quantity \(\frac{p(y=1|\mathbf{x})}{1 - p(y=1|\mathbf{x})}\) is called the odds and can take on any value between \(0\) and \(\infty\). Odds are traditionally used instead of probabilities to express chances of winning in horse-racing and casino games such as roulette.
The left-hand side is called log odds or logit. In the simplest case of \(D=1\) predictor, the equation (3) becomes:
\[\begin{align*} \log \left(\frac{p(y_i = 1|x_i,\mathbf{w})}{1 - p(y_i = 1|x_i,\mathbf{w})}\right) &= w_0 + w_1 x_i \tag{4} \end{align*}\]
Likelihood
The likelihood of all the data is:
\[\begin{align*} L(w_0,w_1) &= \prod_{i=1}^{N} p(y_i|\mathbf{x}_i) \\ &= \prod_{i=1}^{N} p(y_i=1|\mathbf{x}_i)^{I(y_i=1)} p(y_i=0|\mathbf{x}_i)^{I(y_i=0)} \\ &= \prod_{i=1}^{N} p(y_i=1|\mathbf{x}_i)^{I(y_i=1)} \cdot [1 - p(y_i=1|\mathbf{x}_i)]^{I(y_i=0)} \tag{5} \end{align*}\]
We seek estimates for \(w_0\) and \(w_1\), such that the predicted class probabilities \(\hat{p}(y_i = 1|x_i)\) and \(\hat{p}(y_i = 0|x_i)\) are as close as possible to the observed class labels. So, we try to maximize the likelihood function \(L\).
Linear Discriminant Analysis
Let \(c\) be an arbitrary class label. By the Bayes formula,
\[\begin{align*} p(y=c|\mathbf{x}) &= \frac{p(\mathbf{x},y=c)}{p(\mathbf{x})} \\ &= \frac{p(\mathbf{x}|y=c) \cdot p(y=c)}{\sum_{c=1}^{C} p(\mathbf{x}|y=c) \cdot p(y=c)} \tag{6} \end{align*}\]
The LDA is a generative classifier that models the class conditional distribution \(p(\mathbf{x}|y=c)\) and the class prior \(p(y=c)\) and applies the Bayes rule to derive \(p(y=c|\mathbf{x})\).
LDA makes the following assumptions:
- The prior follows a Bernoulli distribution.
\[\begin{align*} p(y=y_i) = \phi^{y_i} (1 - \phi)^{(1-y_i)} \end{align*}\]
- The data from class \(c\) is a \(D\)-dimensional multivariate gaussian distribution. We have:
\[\begin{align*} p(\mathbf{x}|y=c) = \mathcal{N}(\mathbf{\mu}_c,\mathbf{\Sigma}) \tag{8} \end{align*}\]
Thus,
\[\begin{align*} p(\mathbf{x}|y=c) &= \frac{1}{(2\pi)^{D/2} |\det \mathbf{\Sigma}|^{1/2}} \exp \left[-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_c)^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_c) \right] \tag{9} \end{align*}\]
Likelihood
The likelihood of all the data is:
\[\begin{align*} L(\phi,\mathbf{\mu}_1,\ldots,\mathbf{\mu}_C,\mathbf{\Sigma}) &= \prod_{i=1}^{N} p(\mathbf{x}_i,y_i)\\ &=\prod_{i=1}^{N} p(\mathbf{x}_i|y_i)\cdot p(y=y_i) \tag{10} \end{align*}\]
Log-Likelihood
The log-likelihood function \(l\) is:
\[\begin{align*} l(\phi,\mathbf{\mu}_1,\ldots,\mathbf{\mu}_C,\mathbf{\Sigma}) = \log L &= \sum_{i=1}^{N} \log p(\mathbf{x}_i|y_i) + \sum_{i=1}^{N} \log p(y=y_i) \tag{11} \end{align*}\]
For simplicity let’s assume we have \(C=2\) classes. Then, the above sum can be written as:
\[\begin{align*} l(\phi,\mathbf{\mu}_0,\mathbf{\mu}_1,\mathbf{\Sigma}) &= \sum_{i=1}^{N} I(y_i=1)\log p(\mathbf{x}_i|y=1) + \sum_{i=1}^{N} I(y_i = 0)\log p(\mathbf{x}_i|y=0) \\ &+ \sum_{i=1}^{N} I(y_i=1) \log p(y=y_i) + \sum_{i=1}^{N} I(y_i=0) \log p(y=y_i) \tag{12} \end{align*}\]
MLE Estimate for \(\phi\)
The first two terms of the log-likelihood function \(l\) are not a function of \(\phi\). Taking the partial derivative of \(l\) with respect to \(\phi\) on both sides, we are left with:
\[\begin{align*} \frac{\partial l}{\partial \phi} &= \frac{\partial}{\partial \phi}\left[\sum_{i=1}^{N}I(y_i = 1) y_i\log \phi + \sum_{i=1}^{N} I(y_i=0)(1-y_i)\log(1-\phi)\right]\\ &= \sum_{i=1}^{N} I(y_i = 1) \frac{y_i}{\phi} + \sum_{i=1}^{N} I(y_i=0) (1-y_i)\frac{-1}{1-\phi}\\ &= \sum_{i=1}^{N} I(y_i = 1) \frac{1}{\phi} - \sum_{i=1}^{N} I(y_i=0) \frac{1}{1-\phi} \tag{13} \end{align*}\]
Equating \(\frac{\partial l}{\partial \phi}\) to zero:
\[\begin{align*} \sum_{i=1}^{N} I(y_i = 1) \frac{1}{\phi} &= \sum_{i=1}^{N} I(y_i=0) \frac{1}{1-\phi}\\ (1-\phi)\sum_{i=1}^{N} I(y_i = 1) &= \phi\sum_{i=1}^{N} I(y_i=0)\\ \sum_{i=1}^{N} I(y_i = 1) &= \phi\sum_{i=1}^{N} I(y_i=0) + \phi\sum_{i=1}^{N} I(y_i=1)\\ \sum_{i=1}^{N} I(y_i = 1) &= \phi \cdot N \\ \hat{\phi} &= \frac{\sum_{i=1}^{N} I(y_i = 1)}{N} \tag{14} \end{align*}\]
MLE Estimate for \(\mu_c\)
First, note that:
\[\begin{align*} \log p(\mathbf{x}_i|y=1) = -\frac{D}{2}\log(2\pi) - \frac{1}{2} \log(|\det \mathbf{\Sigma}|) - \frac{1}{2}(\mathbf{x}_i - \mathbf{\mu}_1)^T \mathbf{\Sigma}^{-1}(\mathbf{x}_i - \mathbf{\mu}_1) \tag{15} \end{align*}\]
Thus,
\[\begin{align*} \frac{\partial l}{\partial \mu_1} &= -\frac{1}{2}\sum_{i=1}^{N} I(y_i = 1)\frac{\partial}{\partial \mu_1}[(\mathbf{x}_i - \mu_1)^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \mu_1)] \tag{16} \end{align*}\]
We know that, \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A \mathbf{x}\).
Consequently,
\[\begin{align*} \frac{\partial l}{\partial \mu_1} &= -\mathbf{\Sigma}^{-1}\sum_{i=1}^{N} I(y_i = 1) (\mathbf{x}_i - \mu_1) \tag{17} \end{align*}\]
Equating \(\frac{\partial l}{\partial \mu_1} = 0\), we have:
\[\begin{align*} \hat{\mu}_1 &= \frac{\sum_{i=1}^{N}I(y_i = 1) \mathbf{x}_i}{\sum_{i=1}^{N}I(y_i = 1)} \tag{18} \end{align*}\]
In general, for a class \(c\), we have:
\[\begin{align*} \hat{\mu}_c &= \frac{\sum_{i=1}^{N}I(y_i = c) \mathbf{x}_i}{\sum_{i=1}^{N}I(y_i = c)} \tag{19} \end{align*}\]
Traces and Determinants
Definition. The trace of a square matrix \(A\) is defined to the sum of the diagonal elements \(a_{ii}\) of \(A\)
\[\begin{align*} tr(A) = \sum_i a_{ii} \tag{20} \end{align*}\]
Claim. (Cyclic property) Let \(A,B,C\) be arbitrary matrices whose dimensions are conformal and are such that the product \(ABC\) (and therefore the other two products) is a square matrix. Then, the trace is invariant under cyclic permutations of matrix products:
\[\begin{align*} tr(ABC) = tr(BCA) = tr(CAB) \tag{21} \end{align*}\]
Proof.
We have:
\[\begin{align*} tr(ABC) &= \sum_{i} (ABC)_{ii} \tag{22} \end{align*}\]
The \((i,i)\) element of \(ABC\) must be the inner product of the \(i\)-th row of \(A\) and the \(i\)-th column of \(BC\). So:
\[\begin{align*} tr(ABC) &= \sum_{i} \sum_j A_{ij} (BC)_{ji} \tag{23} \end{align*}\]
The \((j,i)\) element of \(BC\) must be the inner product of the \(j\)-th row of \(B\) and the \(i\)-th column of \(C\). So:
\[\begin{align*} tr(ABC) &= \sum_{i} \sum_j A_{ij} \sum_{k} B_{jk} C_{ki} \\ &= \sum_{i} \sum_j \sum_{k} A_{ij} B_{jk} C_{ki} \tag{24} \end{align*}\]
But, this can be re-written as
\[\begin{align*} tr(ABC) &= \sum_{i} \sum_j \sum_{k} A_{ij} B_{jk} C_{ki} \\ &= \sum_j \sum_k B_{jk} \sum_i C_{ki} A_{ij} \\ &= \sum_j \sum_k B_{jk} (CA)_{kj} \\ &= \sum_j (BCA)_{jj} \\ &= tr(BCA) \tag{25} \end{align*}\]
Similarly, it can be shown that \(tr(BCA) = tr(CAB)\). This closes the proof.
Claim. Let \(A\) and \(B\) be matrices. Then,
\[\begin{align*} \frac{\partial}{\partial A} tr(BA) = B^T \tag{26} \end{align*}\]
Proof.
We have:
\[\begin{align*} tr(BA) &= \sum_i (BA)_{ii} \\ &= \sum_i \sum_j B_{ij} A_{ji} \end{align*}\]
Consequently,
\[\begin{align*} \left[\frac{\partial}{\partial A} tr(BA)\right]_{(i,j)} = \frac{\partial}{\partial a_{ij}} tr(BA) = B_{ji} \end{align*}\]
This closes the proof.
Claim. Let \(A\) be a square matrix. Then:
\[\begin{align*} \frac{\partial}{\partial A} \log (\det A) = (A^{-1})^T \tag{27} \end{align*}\]
Proof.
Recall that:
\[\begin{align*} \det A = \sum_{j} a_{ij} C_{ij} \end{align*}\]
where \(C_{ij}\) is the cofactor obtained after removing the \(i\)-th row and \(j\)-th column of \(A\). Thus,
\[\begin{align*} \frac{\partial}{\partial a_{ij}}\det A = C_{ij} \end{align*}\]
Consequently,
\[\begin{align*} \frac{\partial}{\partial A}\det A = C \end{align*}\]
where \(C\) is the cofactor matrix of \(A\). We know that \(C = (adj A)^T\), where \(adj A\) is the adjugate of \(A\). Moreover, \(A^{-1} = \frac{1}{|\det A|} adj (A)\). Therefore,
\[\begin{align*} \frac{\partial}{\partial A} \log (\det A) &= \frac{1}{|\det A|} \frac{\partial}{\partial A}\det A \\ &= \frac{1}{|\det A|} C \\ &= \frac{1}{|\det A|} (adj A)^T \\ &= \left(\frac{1}{|\det A|} adj A\right)^T \\ &= (A^{-1})^T \end{align*}\]
MLE Estimate for the covariance matrix \(\mathbf{\Sigma}\)
Since \(\mathbf{x}^T A \mathbf{x}\) is a scalar, \(\mathbf{x}^T A \mathbf{x} = tr(\mathbf{x}^T A \mathbf{x})\). We have:
\[\begin{align*} \mathbf{x}^T A \mathbf{x} &= tr(\mathbf{x}^T A \mathbf{x}) = tr(A \mathbf{x} \mathbf{x}^T) = tr(\mathbf{x} \mathbf{x}^T A) \end{align*}\]
We have:
\[\begin{align*} l(\phi,\mu_c,\mathbf{\Sigma}) &= \sum_{i=1}^{N} \log p(\mathbf{x}_i|y_i) + \sum_{i=1}^{N} \log p(y=y_i) \\ &= -\frac{ND}{2} \log(2\pi) - \frac{N}{2} \log(|\det \mathbf{\Sigma}|) \\ &- \frac{1}{2}\sum_{i=1}^{N} (\mathbf{x}_i - \mu_{y_i})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \mu_{y_i}) \\ &+ \sum_{i=1}^{N} \log p(y=y_i)\\ &= -\frac{ND}{2} \log(2\pi) + \frac{N}{2} \log(|\det \mathbf{\Sigma}^{-1}|) \\ &- \frac{1}{2}\sum_{i=1}^{N} tr[(\mathbf{x}_i - \mu_{y_i}) (\mathbf{x}_i - \mu_{y_i})^T \mathbf{\Sigma}^{-1}] \\ &+ \sum_{i=1}^{N} \log p(y=y_i) \end{align*}\]
Differentiating both sides with respect to \(\mathbf{\Sigma}^{-1}\), get:
\[\begin{align*} \frac{\partial l}{\partial \mathbf{\Sigma}^{-1}} &= \frac{N}{2} \mathbf{\Sigma} - \frac{1}{2}\sum_{i=1}^{N} (\mathbf{x}_i - \mu_{y_i}) (\mathbf{x}_i - \mu_{y_i})^T \end{align*}\]
Consequently, we have:
\[\begin{align*} \hat{\mathbf{\Sigma}}_{mle} &= \frac{1}{N} \sum_{i=1}^{N} (\mathbf{x}_i - \mu_{y_i}) (\mathbf{x}_i - \mu_{y_i})^T \tag{28} \end{align*}\]
Decision boundary
Let’s again consider the binary classification problem with \(C=2\) classes. The decision boundary is the line or the hyperplane that separates the part of the space where the probability that the point belongs to class \(1\) is larger than \(50\) percent from the part where the probability that the point belongs to class \(2\) is larger than \(50\) percent.
The decision boundary is given by \(p(y=1|\mathbf{x}) = p(y=0|\mathbf{x})\). Since these probabilities involve an exponent, it’s convenient to take logarithms on both sides. This results in:
\[\begin{align*} \cancel{-\frac{D}{2}\log(2\pi)} - \cancel{\frac{1}{2} \log(|\det \mathbf{\Sigma}|)} - \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_1)^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_1) = \\ \cancel{-\frac{D}{2}\log(2\pi)} - \cancel{\frac{1}{2} \log(|\det \mathbf{\Sigma}|)} - \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_0)^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_0) \tag{29} \end{align*}\]
Simplifying, we have:
\[\begin{align*} (\mathbf{x} - \mathbf{\mu}_1)^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_1) &= (\mathbf{x} - \mathbf{\mu}_0)^T \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_0) \tag{30} \end{align*}\]
\[\begin{align*} (\mathbf{x}^T - \mathbf{\mu}_1^T) \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_1) &= (\mathbf{x}^T - \mathbf{\mu}_0^T) \mathbf{\Sigma}^{-1}(\mathbf{x} - \mathbf{\mu}_0)\\ \cancel{\mathbf{x}^T \mathbf{\Sigma}^{-1} \mathbf{x}} - \mathbf{x}^T \mathbf{\Sigma}^{-1}\mathbf{\mu}_1 - \mathbf{\mu}_1^T \mathbf{\Sigma}^{-1} \mathbf{x} + \mathbf{\mu}_1^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_1 &= \cancel{\mathbf{x}^T \mathbf{\Sigma}^{-1} \mathbf{x}} - \mathbf{x}^T \mathbf{\Sigma}^{-1}\mathbf{\mu}_0 - \mathbf{\mu}_0^T \mathbf{\Sigma}^{-1} \mathbf{x} + \mathbf{\mu}_0^T \mathbf{\Sigma}^{-1} \mathbf{\mu}_0 \end{align*}\]
Note that, \(\mathbf{x}^T \mathbf{\Sigma}^{-1} (\mu_1 - \mu_0)\) is a scalar, so \(\mathbf{x}^T \mathbf{\Sigma}^{-1} (\mu_1 - \mu_0) = (\mathbf{x}^T \mathbf{\Sigma}^{-1} (\mu_1 - \mu_0))^T\). So, we get:
\[\begin{align*} 2\mathbf{x}^T \mathbf{\Sigma}^{-1} (\mu_1 - \mu_0) = \underbrace{\mu_1^T \mathbf{\Sigma}^{-1} \mu_1 - \mu_0^T \mathbf{\Sigma}^{-1} \mu_0}_{\text{constant}} \tag{31} \end{align*}\]
This is the equation of the decision boundary. This is a linear projection of the vector \(\mathbf{x}\) onto the \(\mathbf{\Sigma}^{-1} (\mu_1 - \mu_0)\) direction. Whenever this projection equals to this constant, we are on the decision boundary; when it’s larger than this threshold, it’s class \(1\) and when it’s smaller it’s class \(2\). So, the decision boundary is just a line perpendicular to this vector and crossing it in the point that corresponds to this threshold.
To make it clear, the fact that the decision boundary is linear follows from our assumption that the covariances are the same.
Quadratic Discriminant Analysis (QDA)
LDA assumes that the data within each class \(c\) are drawn from a multivariate Gaussian distribution with a class-specific mean vector \(\mathbf{\mu}_c\) and a covariance matrix that common to all \(C\) classes. Quadratic Discriminant Analysis (QDA) classifier assumes that the observations from each class are drawn from a Gaussian distribution and each class has its own mean vector \(\mathbf{\mu}_c\) and covariance matrix \(\mathbf{\Sigma}_c\).
\[\begin{align*} p(y=c|\mathbf{x}) = \frac{1}{(2\pi)^{D/2} |\det \mathbf{\Sigma_c}|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{x} - \mu_c)^T \mathbf{\Sigma}_c^{-1}(\mathbf{x} - \mu_c)\right] \end{align*}\]