2 Vector and matrix calculus
2.1 First order vector derivatives
Derivative and gradient
The derivative of a scalar-valued function \(h(\boldsymbol x)\) with regard to its vector argument \(\boldsymbol x= (x_1, \ldots, x_d)^T\) is the row vector \[ D h(\boldsymbol x) = \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} = \begin{pmatrix} \frac{\partial h(\boldsymbol x)}{\partial x_1} & \cdots & \frac{\partial h(\boldsymbol x)}{\partial x_d} \\ \end{pmatrix} \]
The above notation follows the numerator layout convention, where the dimension of the numerator (here 1) determines the number of rows and the dimension of the denominator (here \(d\)) the number of columns of the resulting matrix (see https://en.wikipedia.org/wiki/Matrix_calculus for details). For a scalar function this results in a vector of the same dimension as \(\boldsymbol x^T\).
The gradient of \(h(\boldsymbol x)\) is a column vector and the transpose of the derivative \[ \text{grad } h(\boldsymbol x) = \left( \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} \right)^T \] It is often written using the nabla operator \(\nabla\) as \[ \nabla h(\boldsymbol x) = \begin{pmatrix} \frac{\partial h(\boldsymbol x)}{\partial x_1} \\ \vdots\\ \frac{\partial h(\boldsymbol x)}{\partial x_d} \end{pmatrix} \] with \[ \nabla = \begin{pmatrix} \frac{\partial }{\partial x_1} \\ \vdots\\ \frac{\partial }{\partial x_d} \end{pmatrix} \]
Note that \[ (\nabla h(\boldsymbol x))^T = \nabla^T h(\boldsymbol x) = D h(\boldsymbol x) = \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} \]
Example 2.1 Examples for the gradient and derivative:
- \(h(\boldsymbol x)=\boldsymbol a^T \boldsymbol x+ b\).
Then \(\nabla h(\boldsymbol x) = \boldsymbol a\) and \(D h(\boldsymbol x) = \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} = \boldsymbol a^T\). - \(h(\boldsymbol x)=\boldsymbol x^T \boldsymbol x\).
Then \(\nabla h(\boldsymbol x) = 2 \boldsymbol x\) and \(D h(\boldsymbol x) = \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} = 2 \boldsymbol x^T\). - \(h(\boldsymbol x)=\boldsymbol x^T \boldsymbol A\boldsymbol x\).
Then \(\nabla h(\boldsymbol x) = (\boldsymbol A+ \boldsymbol A^T) \boldsymbol x\) and \(D h(\boldsymbol x) = \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x} = \boldsymbol x^T (\boldsymbol A+ \boldsymbol A^T)\).
Jacobian matrix
Similarly, we can also compute the derivative of a vector-valued function \[ \boldsymbol h(\boldsymbol x) = ( h_1(\boldsymbol x), \ldots, h_m(\boldsymbol x) )^T \] with regard to \(\boldsymbol x\). This yields (again in numerator layout convention) a matrix of size \(m\) rows and \(d\) columns whose rows contain the derivatives of the components of \(\boldsymbol h(\boldsymbol x)\). \[ \begin{split} D\boldsymbol h(\boldsymbol x) &= \frac{\partial \boldsymbol h(\boldsymbol x)}{\partial \boldsymbol x} = \left(\frac{\partial h_i(\boldsymbol x)}{\partial x_j}\right) \\ &=\begin{pmatrix} \frac{\partial h_1(\boldsymbol x)}{\partial x_1} & \cdots & \frac{\partial h_1(\boldsymbol x)}{\partial x_d} \\ \vdots &\ddots & \vdots \\ \frac{\partial h_m(\boldsymbol x)}{\partial x_1} & \cdots & \frac{\partial h_m(\boldsymbol x)}{\partial x_d} \\ \end{pmatrix} \\ &= \left( {\begin{array}{c} \nabla^T h_1(\boldsymbol x) \\ \vdots \\ \nabla^T h_m(\boldsymbol x) \\ \end{array} } \right) \\ &= \boldsymbol J_{\boldsymbol h}(\boldsymbol x) \end{split} \]
This matrix is also called the Jacobian matrix
Example 2.2 \(\boldsymbol h(\boldsymbol x)=\boldsymbol A^T \boldsymbol x+ \boldsymbol b\). Then \(D \boldsymbol h(\boldsymbol x) = \frac{\partial \boldsymbol h(\boldsymbol x)}{\partial \boldsymbol x} = \boldsymbol J_{\boldsymbol h}(\boldsymbol x) =\boldsymbol A^T\).
If \(m=d\) then the Jacobian matrix is a square and this allows to compute the determinant of the Jacobian matrix. Both the Jacobian matrix and the Jacobian determinant are often called “the Jacobian” so one needs to determine from the context whether this refers to the matrix or the determinant.
If \(\boldsymbol y= \boldsymbol h(\boldsymbol x)\) is an invertible function with \(\boldsymbol x= \boldsymbol h^{-1}(\boldsymbol y)\) then the Jacobian matrix is invertible and the inverted matrix is the Jacobian matrix of the inverse function: \[ D\boldsymbol x(\boldsymbol y) = \left( D\boldsymbol y(\boldsymbol x)\right)^{-1} \rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \] or in alternative notation \[ \frac{\partial \boldsymbol x(\boldsymbol y)}{\partial \boldsymbol y} = \left. \left( \frac{\partial \boldsymbol y(\boldsymbol x)}{\partial \boldsymbol x} \right)^{-1} \right\rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \]
In this case the Jacobian determinant of the back-transformation can be computed as the inverse of the Jacobian determinant of the original function: \[ \det D\boldsymbol x(\boldsymbol y) = \det \left( D\boldsymbol y(\boldsymbol x)\right)^{-1} \rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \] and \[ \det \left(\frac{\partial \boldsymbol x(\boldsymbol y)}{\partial \boldsymbol y}\right) = \left. \det \left( \frac{\partial \boldsymbol y(\boldsymbol x)}{\partial \boldsymbol x} \right)^{-1} \right\rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \]
2.2 Second order vector derivatives
The matrix of all second order partial derivates of scalar-valued function with vector-valued argument is called the Hessian matrix: \[ \begin{split} \nabla \nabla^T h(\boldsymbol x) &= \begin{pmatrix} \frac{\partial^2 h(\boldsymbol x)}{\partial x_1^2} & \frac{\partial^2 h(\boldsymbol x)}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 h(\boldsymbol x)}{\partial x_1 \partial x_d} \\ \frac{\partial^2 h(\boldsymbol x)}{\partial x_2 \partial x_1} & \frac{\partial^2 h(\boldsymbol x)}{\partial x_2^2} & \cdots & \frac{\partial^2 h(\boldsymbol x)}{\partial x_2 \partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 h(\boldsymbol x)}{\partial x_d \partial x_1} & \frac{\partial^2 h(\boldsymbol x)}{\partial x_d \partial x_2} & \cdots & \frac{\partial^2 h(\boldsymbol x)}{\partial x_d^2} \end{pmatrix} \\ &= \left(\frac{\partial h(\boldsymbol x)}{\partial x_i \partial x_j}\right) \\ & = \frac{\partial }{\partial \boldsymbol x} \left( \frac{\partial h(\boldsymbol x)}{\partial \boldsymbol x}\right)^T \\ & = D (Dh(\boldsymbol x))^T \\ \end{split} \] By construction the Hessian matrix is square and symmetric.
Example 2.3 \(h(\boldsymbol x)=\boldsymbol x^T \boldsymbol A\boldsymbol x\). Then \(\nabla \nabla^T h(\boldsymbol x) = (\boldsymbol A+ \boldsymbol A^T)\).
2.3 Chain rules for gradient vector and Hessian matrix
Suppose \(h(\boldsymbol x)\) is a scalar-valued function and \(g(\boldsymbol y) = h(\boldsymbol x(\boldsymbol y))\) is a composite scalar-valued function where \(\boldsymbol x(\boldsymbol y)\) a map from \(\boldsymbol y\) to \(\boldsymbol x\).
The gradient of the composite function \(g(\boldsymbol y) = h(\boldsymbol x(\boldsymbol y))\) can be computed from the gradient of \(h(\boldsymbol x)\) and the Jacobian matrix for \(\boldsymbol x(\boldsymbol y)\) as follows: \[ \nabla g(\boldsymbol y) = (D\boldsymbol x(\boldsymbol y))^T\, \nabla h(\boldsymbol x) \rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \] and \[ \nabla g(\boldsymbol y) = \left(\frac{\partial \boldsymbol x(\boldsymbol y)}{\partial \boldsymbol y}\right)^T\, \nabla h(\boldsymbol x) \rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \]
Similarly, the Hessian matrix of \(g(\boldsymbol y)\) can be computed from the Hessian of \(h(\boldsymbol x)\) and the Jacobian matrix for \(\boldsymbol x(\boldsymbol y)\): \[ \nabla \nabla^T g(\boldsymbol y) = (D \boldsymbol x(\boldsymbol y))^T\, \nabla \nabla^T h(\boldsymbol x)\rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \, D\boldsymbol x(\boldsymbol y) \] and \[ \nabla \nabla^T g(\boldsymbol y) = \left(\frac{\partial \boldsymbol x(\boldsymbol y)}{\partial \boldsymbol y}\right)^T\, \nabla \nabla^T h(\boldsymbol x)\rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)} \, \frac{\partial \boldsymbol x(\boldsymbol y)}{\partial \boldsymbol y} \]
2.4 First order matrix derivatives
The derivative of a scalar-valued function \(h(\boldsymbol X)\) with regard to a matrix argument \(\boldsymbol X= (x_{ij})\) is defined as below and and results (in numerator layout convention) in a matrix of the same dimension as \(\boldsymbol X^T\): \[ D h(\boldsymbol X) = \frac{\partial h(\boldsymbol X)}{\partial \boldsymbol X} = \left(\frac{\partial h(\boldsymbol X)}{\partial x_{ji}}\right) \]
Example 2.4 Examples for first order matrix derivatives:
- \(\frac{\partial \text{Tr}(\boldsymbol A^T \boldsymbol X)}{\partial \boldsymbol X} = \boldsymbol A^T\)
- \(\frac{\partial \text{Tr}(\boldsymbol A^T \boldsymbol X\boldsymbol B)}{\partial \boldsymbol X} = \boldsymbol B\boldsymbol A^T\)
- \(\frac{\partial \text{Tr}(\boldsymbol X^T \boldsymbol A\boldsymbol X)}{\partial \boldsymbol X} = \boldsymbol X^T (\boldsymbol A+ \boldsymbol A^T)\)
- \(\frac{\partial \log \det(\boldsymbol X)}{\partial \boldsymbol X} = \frac{\partial \text{Tr}(\log \boldsymbol X)}{\partial \boldsymbol X} = \boldsymbol X^{-1}\)
2.5 Linear and quadratic approximation
A linear and quadratic approximation of a differentiable function is given by a Taylor series of first and second order, respectively.
Linear and quadratic approximation of a scalar-valued function of a scalar: \[ h(x) \approx h(x_0) + h'(x_0) \, (x-x_0) + \frac{1}{2} h''(x_0) \, (x-x_0)^2 \] Note that \(h'(x_0) = h'(x) \,|\, x_0\) is first derivative of \(h(x)\) evaluated at \(x_0\) and \(h''(x_0) = h''(x) \,|\, x_0\) is the second derivative of \(h(x)\) evaluated \(x_0\).
With \(x = x_0+ \varepsilon\) the approximation can also be written as \[ h(x_0+ \varepsilon) \approx h(x_0) + h'(x_0) \, \varepsilon + \frac{1}{2} h''(x_0)\, \varepsilon^2 \] The first two terms on the right comprise the linear approximation, all three terms the quadratic approximation.Linear and quadratic approximation of a scalar-valued function of a vector: \[ h(\boldsymbol x) \approx h(\boldsymbol x_0) + \nabla^T h(\boldsymbol x_0)\, (\boldsymbol x-\boldsymbol x_0) + \frac{1}{2} (\boldsymbol x-\boldsymbol x_0)^T \, \nabla \nabla^T h(\boldsymbol x_0) \, (\boldsymbol x-\boldsymbol x_0) \] Note that \(\nabla^T h(\boldsymbol x_0)\) is the transposed gradient (i.e the vector derivative) of \(h(\boldsymbol x)\) evaluated at \(\boldsymbol x_0\) and \(\nabla \nabla^T h(\boldsymbol x_0)\) the Hessian matrix of \(h(\boldsymbol x)\) evaluated at \(\boldsymbol x_0\). With \(\boldsymbol x= \boldsymbol x_0+ \boldsymbol \varepsilon\) this approximation can also be written as \[ h(\boldsymbol x_0+ \boldsymbol \varepsilon) \approx h(\boldsymbol x_0) + \nabla^T h(\boldsymbol x_0)\, \boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol x_0) \,\boldsymbol \varepsilon \] The first two terms on the right comprise the linear approximation, all three terms the quadratic approximation.
Linear approximation of a vector-valued function of a vector: \[ \boldsymbol h(\boldsymbol x) \approx \boldsymbol h(\boldsymbol x_0) + D \boldsymbol h(\boldsymbol x_0) \, (\boldsymbol x-\boldsymbol x_0) \] Note that \(D \boldsymbol h(\boldsymbol x_0)\) is Jacobian matrix (i.e the vector derivative) of \(\boldsymbol h(\boldsymbol x)\) evaluated at \(\boldsymbol x_0\). With \(\boldsymbol x= \boldsymbol x_0+ \boldsymbol \varepsilon\) this approximation can also be written as \[ \boldsymbol h(\boldsymbol x_0+ \boldsymbol \varepsilon) \approx h(\boldsymbol x_0) + D \boldsymbol h(\boldsymbol x_0) \, \boldsymbol \varepsilon \]
Example 2.5 Examples of Taylor series approximations of second order:
- \(\log(x_0+\varepsilon) \approx \log(x_0) + \frac{\varepsilon}{x_0} - \frac{\varepsilon^2}{2 x_0^2}\)
- \(\frac{x_0}{x_0+\varepsilon} \approx 1 - \frac{\varepsilon}{x_0} + \frac{\varepsilon^2}{ x_0^2}\)
Example 2.6 Around a local extremum \(\boldsymbol x_0\) (maximum or minimum) where the gradient vanishes (\(h(\boldsymbol x_0) = 0\)) the quadratic approximation of the function \(h(\boldsymbol x)\) simplifies to
\[
h(\boldsymbol x_0+ \boldsymbol \varepsilon) \approx h(\boldsymbol x_0) + \frac{1}{2} \boldsymbol \varepsilon^T \nabla \nabla^T h(\boldsymbol x_0) \boldsymbol \varepsilon
\]
2.6 Conditions for a local extremum of a function
To check if \(x_0\) or \(\boldsymbol x_0\) is a local extremum, i.e. a local maximum or a local minimum, of a differentiable function \(h(x)\) or \(h(\boldsymbol x)\) we can use the following conditions:
For a function of a single variable:
- First derivative is zero at the extremum: \(h'(x_0) = 0\).
- If the second derivative \(h''(x_0) < 0\) at the extremum is negative then it is a maximum.
- If the second derivative \(h''(x_0) > 0\) at the extremum is positive it is a minimum.
Note that conditions ii) and iii) are sufficient but not necessary. For a minimum, it is necessary that the second derivative is non-negative, and for a maximum that the second derivative is non-positive.
For a function of several variables:
- Gradient vanishes at extremum: \(\nabla h(\boldsymbol x_0)=0\).
- If the Hessian \(\nabla \nabla^T h(\boldsymbol x_0)\) is negative definite (= all eigenvalues of Hessian matrix are negative) then the extremum is a maximum.
- If the Hessian is positive definite (= all eigenvalues of Hessian matrix are positive) then the extremum is a minimum.
Again, conditions ii) and iii) are sufficient but not necessary. For a minimum it is necessary that the Hessian is positive semi-definite, and for a maximum that the Hessian is negative semi-definite.
Example 2.7 Minimum with vanishing second derivative:
\(x^4\) clearly has a minimum at \(x_0=0\). As required the first derivative \(4 x^3\) vanishes at \(x_0=0\). However, the second derivative \(12 x^2\) also vanishes at \(x_0=0\), showing that a positive second derivative is not necessary for a minimum.
2.7 Convex and concave functions
A function \(h(\boldsymbol x)\) is convex if for all \(\boldsymbol x_1\) and \(\boldsymbol x_2\) the line segment from point \((\boldsymbol x_1, h(\boldsymbol x_1))\) to point \((\boldsymbol x_2, h(\boldsymbol x_2))\) never lies below the function. Moreover, the function is strictly convex if the line segment always lies above the curve, apart from the two end points:
\[
\lambda h(\boldsymbol x_1) + (1-\lambda) h(\boldsymbol x_2) \geq h(\lambda \boldsymbol x_1 + (1-\lambda) \boldsymbol x_2)
\] for all \(\lambda \in [0, 1]\).
Equivalently, a differentiable function \(h(\boldsymbol x)\) is convex (strictly convex) if for all \(\boldsymbol x_0\) the function \(h(\boldsymbol x)\) never lies below (always lies above, except at \(\boldsymbol x_0\)) the linear approximation through the point \((\boldsymbol x_0, h(\boldsymbol x_0))\): \[ h(\boldsymbol x) \geq h(\boldsymbol x_0) + \nabla^T h(\boldsymbol x_0)\, (\boldsymbol x-\boldsymbol x_0) \]
For a convex function a vanishing gradient at \(\boldsymbol x_0\) indicates a minimum at \(\boldsymbol x_0\). Furthermore, any local minimum must also be a global minimum (for a differentiable function this follows directly from the last inequality). For a strictly convex function the minimum is unique so there is at most one local/global minimum in that case.
If \(h(\boldsymbol x)\) is convex, then \(-h(\boldsymbol x)\) is concave, and the criteria above can be adapted accordingly to check for concavity and strict concavity, as well as to identify local/global maxima.
(Strictly) convex and concave functions are convenient objective functions in optimisation as it is straightforward to find their local/global extrema, both analytically and numerically.
As the shape of a convex function resembles that of a valley, one way to memorise that fact is that a valley is convex.
Example 2.8 Convex functions:
This is a convex function but not a strictly convex function:
- \(\max(x^2, | x | )\)
The following are strictly convex functions:
- \(x^2\),
- \(x^4\),
- \(e^x\),
- \(x \log(x)\) for \(x>0\).
On the other hand, this is not a convex function:
- \(\frac{1}{x^2}\) for all \(x \neq 0\).
However, the function in last example is strictly convex if the domain is restricted to either \(x >0\) or \(x<0\).
Example 2.9 Concave functions:
The following are strictly concave functions:
- \(-x^2\),
- \(\log(x)\) for \(x>0\),
- \(\sqrt{x}\) for \(x>0\).