The Levenberg Marquardt Algorithm
Algorithm Description
The Levenberg-Marquardt(LM) method consists of an iterative least-squares minimization of a function based on a modification of the Newton method. It’s a super-intuitive algorithm and a generic implementation can be very quickly coded up. I state the problem formally before defining the algorithm. We’ll use finite differences to approximate the first and second-order derivatives of the function.
Let \(\mathbf{x}\in\mathbf{R}^n\) be the parameter vector to be optimized. We want to find the optimal \(\mathbf{x}^*\) that minimizes the scalar error function:
\[ \begin{align*} F(\mathbf{x}) = \frac{1}{2}||\mathbf{r}(\mathbf{x})||^2 = \frac{1}{2}\mathbf{r}(\mathbf{x})^T \mathbf{r}(\mathbf{x}) \end{align*} \]
The residual error function \(\mathbf{r}:\mathbf{R}^n \to \mathbf{R}^m\) may sometimes include a comparison to reference or observed data. A very simple linear example would \(\mathbf{r}(\mathbf{x}) = \mathbf{b} - \mathbf{Ax}\). However, in the following, I assume that \(\mathbf{r}(\cdot)\) is any vector-valued function:
\[ \begin{align*} \mathbf{r}(\mathbf{x}) = (r_1(\mathbf{x}),f_2(\mathbf{x}),\ldots,r_m(\mathbf{x})) \end{align*} \]
We can define the Jacobian of the residual error functions as \(m \times n\) matrix with entries :
\[ \mathbf{J}_{ij}(\mathbf{x}) = \frac{\partial r_i}{\partial x_j}(\mathbf{x}) \]
We can also define the Hessian of the residual error functions as the \(n \times n\) matrix with entries :
\[ \begin{align*} \mathbf{H}_{ij}(\mathbf{x}) = \frac{\partial^2 r_i}{\partial x_i \partial x_j} (\mathbf{x}) \end{align*} \]
The gradient of the scalar-valued function \(F\), by the \(uv\) product rule is:
\[ \begin{align*} \nabla F(\mathbf{x}) = D\mathbf{r}(\mathbf{x}) \mathbf{r}(\mathbf{x}) = \mathbf{J}(\mathbf{x})\cdot \mathbf{r}(\mathbf{x}) \end{align*} \]
The Hessian of the function \(F\) is:
\[ \begin{align*} \nabla^2 F(\mathbf{x}) &= D\left\{\sum_{j=1}^{m} \nabla r_j(\mathbf{x}) \cdot r_j(\mathbf{x})\right\}\\ &= \sum_{j=1}^m \nabla^2 r_j(\mathbf{x}) r_j(\mathbf{x}) + (\nabla r_j(\mathbf{x}))^2 \end{align*} \]
If the derivatives \(\nabla^2 r_j(\mathbf{x})\) are small, they can be dropped and the Hessian in this case simply becomes:
\[ \nabla^2 F(\mathbf{x}) = \nabla r(\mathbf{x})^T \nabla(r(\mathbf{x})) = \mathbf{J}(\mathbf{x})^T \cdot \mathbf{J}(\mathbf{x}) \]
Then, the LM method minimizes the following \(2\)nd-order Taylor’s expansion of the actual error function:
\[ F(\mathbf{x}^{(k)} + \mathbf{h}) - F(\mathbf{x}^{(k)}) = \mathbf{h} \nabla F(\mathbf{x}^{(k)}) + \frac{1}{2}\mathbf{h}^T \nabla^2 F(\mathbf{x}^{(k)}) \mathbf{h} \tag{1}\]
Descent methods like gradient descent can place too much trust in their first- or second- order information, which can result in excessively large steps or premature convergence.
So, in LM, we add a penalty term
\[ \frac{1}{2} \lambda^{(k)} \mathbf{h}^T \mathbf{h} = \frac{1}{2} \lambda^{(k)} ||\mathbf{x} - \mathbf{x}^{(k)}||^2 \tag{2}\]
to the above Equation 1, that we want to minimize. That’s because, we don’t want to go too far away from \(\mathbf{x}^{(k)}\). It’s not because, we think the solution is not too far away. The actual solution could be far away. But, that’s a question of trust. And \(\lambda^{(k)}\) essentially gives you your level of distrust. If \(\lambda^{(k)}\) is super-big, it means that you don’t trust the model very much, or you trust it, but only if you are very close to \(\mathbf{x}^{(k)}\). When \(\lambda^{(k)}\) gets really small, it means you really trust your model. And you’re gonna find that \(\mathbf{x}\) is going to very far from \(\mathbf{x}^{(k)}\). So, that’s the gist. Putting together,
\[ E(\mathbf{h}) = \mathbf{h} \nabla F(\mathbf{x}^{(k)}) + \frac{1}{2}\mathbf{h}^T \nabla^2 ( F(\mathbf{x}^{(k)}) + \lambda^{(k)} I )\mathbf{h} \tag{3}\]
We can just solve for the optimal step-size \(\mathbf{h}_{lm}\) analytically. Taking the first derivative with respect to the step-size \(\mathbf{h}\) and setting it equal to zero:
\[ \nabla E(\mathbf{h}) = \nabla F(\mathbf{x}^{(k)}) + \mathbf{h}_{lm}( \nabla^2 F(\mathbf{x}^{(k)}) + \lambda^{(k)}I) = 0 \tag{4}\]
Consequently,
\[ \begin{align*} \mathbf{h}_{lm} &= -(\nabla^2 F(\mathbf{x}^{(k)}) + \lambda^{(k)}I)^{-1} \nabla F(\mathbf{x}^{(k)})\\ &=-(\mathbf{J}(\mathbf{x}^{(k)})^T \mathbf{J}(\mathbf{x})^{(k)} + \lambda^{(k)}I)^{-1} \mathbf{J}(\mathbf{x}^{(k)}) \mathbf{r}(\mathbf{x}^{(k)}) \end{align*} \tag{5}\]
Our best estimate of the minima, is consequently:
\[ \begin{align*} \mathbf{x}^{(k+1)} &= \mathbf{x}^{(k)} + \mathbf{h}_{lm}\\ &= \mathbf{x}^{(k)} -(\mathbf{J}(\mathbf{x}^{(k)})^T \mathbf{J}(\mathbf{x})^{(k)} + \lambda^{(k)}I)^{-1} \mathbf{J}(\mathbf{x}^{(k)}) \mathbf{r}(\mathbf{x}^{(k)}) \end{align*} \tag{6}\]
Updating \(\lambda^{(k)}\)
A trust-region method, or restricted step method maintains a local model of the trust region. It depends on the success of the previous step. If the step \(\mathbf{h}_{lm}\) results in a decrease in \(||F(\mathbf{x})||^2\), then we reduce \(\lambda^{(k)}\), otherwise we increase the value of this parameter.
So, we can use the following update mechanism:
- If \(||F(\mathbf{x}^{(k+1)})||^2\) < \(||F(\mathbf{x}^{(k)})||^2\), accept the new \(x\) and reduce \(\lambda\)
\[ \lambda^{(k+1)} = 0.8 \lambda^{(k)}\]
- otherwise, we increase the \(\lambda\) and do not update \(\mathbf{x}\):
\[ \lambda^{(k+1)} = 2 \lambda^{k}, \quad \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)}\]
Generic implementation in Julia
References
- Levenberg Marquardt Iteration, Professor Stephen Boyd, Stanford ENGR108