Introduction
The following post is going to derive the least squares estimate of the coefficients of linear regression. Our data consists of \(p\) predictors or features \(X_1,\ldots,X_p\) and a response \(Y\), and there are \(n\) observations in our dataset. Assume that the data arises from the real world model:
\[\begin{align} \begin{bmatrix} y_1 \\ y_2 \\ \ldots \\ y_n \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots \\ x_{n1} & x_{n2} & \ldots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \ldots \\ \epsilon_n \end{bmatrix} \tag{1} \end{align}\]
or in matrix notation,
\[Y = X \beta + \epsilon \tag{2}\]
Population regression line versus Least Squares Estimates
The real world model in equation (1) is called the population regression line.
In statistics, we quite often do not know the population mean \(\mu\), but we try to estimate it using the sample mean \(\hat{\mu}\).
In a similar vein, we do not know the true values of the regression coefficients \(\beta_1,\beta_2,\ldots,\beta_p\). Instead, we estimate them from the data as \(\hat{\beta_1},\hat{\beta_2},\ldots,\hat{\beta_p}\).
So, our linear regression model would predict an outcome:
\[\hat{Y} = \hat{\beta_1}X_1 + \hat{\beta_2} X_2 + \ldots +\hat{\beta_p} X_p \tag{3}\]
Residual sum of squares
The difference between the observed response value and the predicted response value is called as the residual.
We define the residual sum of squares as:
\[\begin{align*} (Y - X\hat{\beta})'(Y - X\hat{\beta})&= (Y' - \hat{\beta}' X')(Y - X\hat{\beta})\\ &= Y'Y - Y'X \hat{\beta} - \hat{\beta}' X' Y + \hat{\beta}'X'X\hat{\beta} \end{align*}\]
The \(j\)-th column of \(Y'X\) is \(\sum_{i=1}^{n}y_i x_{ij}\) and therefore the product \(Y'X\hat{\beta}\) equals \(\sum_{j=1}^{p}\sum_{i=1}^{n}y_i x_{ij}\hat{\beta_j}\). But, \((x_{ij}) = (x_{ji})^T\). The same sum can be re-written \(\sum_{i=1}^{n}\sum_{j=1}^{p}\hat{\beta_j} x_{ji}^T y_i\). Thus, \(\hat{\beta}' X' Y = Y' X \hat{\beta}\).
Consequently,
\[\begin{align*} (Y - X\hat{\beta})'(Y - X\hat{\beta})&= Y'Y - 2Y'X \hat{\beta} + \hat{\beta}'X'X\hat{\beta} \tag{4} \end{align*}\]
Aside proof I
Claim. Let \(A \in \mathbf{R}^{m \times n}\) be a rectangular matrix and \(\vec{x}\) be a vector of \(n\) elements and let \(\vec{y}\) be the matrix-vector product:
\[\vec{y} = A \vec{x}\]
Then,
\[\frac{\partial \vec{y}}{\partial \vec{x}} = A\]
Proof.
Let \(A_1,\ldots,A_n\) be the columns of \(A\). Then,
\[\begin{align*} \vec{y} &= [A_1, A_2, \ldots, A_n] \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \\ &= A_1 x_1 + A_2 x_2 + \ldots + A_n x_n \end{align*}\]
Thus,
\[\frac{\partial \vec{y}}{\partial x_i} = A_i\]
Consequently,
\[\frac{\partial \vec{y}}{\partial \vec{x}} = A\]
Aside proof II
Claim. Consider the quadratic form \(Q(\vec{x}) = \vec{x}^T A^T A \vec{x}\). Then, we have:
\[\frac{\partial Q}{\partial \vec{x}} = 2A^T A\vec{x}\]
Proof.
The matrix \(K = A^T A\) is symmetric, since \((A^T A)^T = A^T (A^T)^T = A^T A\). So, \(Q = \vec{x}^T K \vec{x}\). Now, let \(A = (A_1, A_2, \ldots, A_n)\) in the block form, \(A_j\) denotes the \(j\)-th column of \(A\). Thus, \(A \vec{x} =\sum_j A_j x_j\). and \(\vec{x}^T A^T = \sum_j A_j x_j\) as well. So, \(Q = \left(\sum_j A_j x_j\right)^2\). Consequently,
\[\begin{align} \frac{\partial Q}{\partial x_j} &= 2 A_j \left(\sum_{j} A_j x_j\right) \end{align}\]
Thus,
\[\begin{align} \frac{\partial Q}{\partial \vec{x}} &= 2 \begin{bmatrix}A_1 \\ A_2 \\ \vdots \\ A_n\end{bmatrix} \left(\sum_{j} A_j x_j\right) \\ &= 2 A^T A \vec{x} \end{align}\]
Least squares estimate
We proceed with minimizing the RSS expression in equation (4). Taking derivatives with respect to the vector \(\hat{\beta}\) on both sides, and equating to zero, we have:
\[\begin{align*} \frac{\partial (RSS)}{\hat{\beta}}&= - 2Y'X + 2X'X\hat{\beta} = 0 \\ X^T X \hat{\beta} &= Y^T X \\ \hat{\beta} &= (X^T X)^{-1} Y^T X \end{align*}\]