5 Fisher information
This chapter introduces the Fisher information matrix as the local curvature (the Hessian matrix) of the Kullback-Leibler divergence, serving as the local second-order sensitivity matrix for model parameters.
5.1 Fisher information
Local quadratic approximation of KL divergence
The Kullback-Leibler (KL) number measures the divergence between two distributions. We now study the KL divergence of two distributions within a parametric family separate only by some small \(\boldsymbol \varepsilon\).
Specifically, we consider \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon)) &= \operatorname{E}_{P(\boldsymbol \theta)}\left( \log p(\boldsymbol x| \boldsymbol \theta) - \log p(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right)\\ & = h(\boldsymbol \theta+\boldsymbol \varepsilon) \\ \end{split} \] where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. Assuming that the pdmf \(p(\boldsymbol x| \boldsymbol \theta)\) is twice differentiable with regard to \(\boldsymbol \theta\) we can approximate the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) quadratically by \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx h(\boldsymbol \theta) + \nabla h(\boldsymbol \theta)^T\boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
From the familiar properties of the KL divergence we conclude
- \(D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\geq 0\) and
- with equality only if \(\boldsymbol \varepsilon=0\).
Thus, by construction the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) assumes a minimum at \(\boldsymbol \varepsilon=0\) with \(h(\boldsymbol \theta)=0\) and a vanishing gradient \(\nabla h(\boldsymbol \theta) = 0\). Therefore, in the quadratic approximation of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) around \(\boldsymbol \theta\) the first two terms (constant and linear) vanish and only the quadratic term remains: \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
Furthermore, the Hessian matrix of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) evaluated at \(\boldsymbol \varepsilon=0\) is given by \[ \begin{split} \nabla \nabla^T h(\boldsymbol \theta) &= -\operatorname{E}_{P(\boldsymbol \theta)} \nabla \nabla^T \log p(\boldsymbol x| \boldsymbol \theta) \\ &= \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \end{split} \]
This matrix is known as the Fisher information at \(\boldsymbol \theta\), denoted by \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). It is also called expected Fisher information to emphasise that it is computed as the mean Hessian of the negative log-pdmf. The Fisher information matrix is always symmetric and positive semidefinite.
With its help the KL divergence can be locally approximated by \[ D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \boldsymbol \varepsilon \]
We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta+\boldsymbol \varepsilon), P(\boldsymbol \theta)) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]
Hence, although the KL divergence is not symmetric in its arguments in general, it is symmetric to second order (locally symmetric).
Parameter identifiability
For a regular model the Fisher information is positive definite (with only positive eigenvalues) and hence parameters are locally identifiable. Recall that a positive definite Hessian implies that \(h(\boldsymbol \theta+ \boldsymbol \varepsilon)\) has a true minimum at \(\boldsymbol \theta\).
Conversely, for a singular statistical model the Fisher information matrix is singular (some or all of its eigenvalues vanish) at some parameter values. This indicates local non-identifiability arising, e.g., from overparametrisation, parameters linked by exact constraints, lower dimensional latent structure, parameters on boundaries or other regularity failures.
Additivity of Fisher information
We may wish to compute the Fisher information based on a set of independent identically distributed (iid) random variables.
Assume that a random variable \(x \sim P(\boldsymbol \theta)\) has log-pdmf \(\log p(x| \boldsymbol \theta)\) and Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). The Fisher information \(\boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim P(\boldsymbol \theta)\) is computed from the joint log-pdmf \(\log p(x_1, \ldots, x_n) = \sum_{i}^n \log p(x_i| \boldsymbol \theta)\). This yields \[ \begin{split} \boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta) &= -\operatorname{E}_{P(\boldsymbol \theta)} \nabla \nabla^T \sum_{i}^n \log p(x_i| \boldsymbol \theta)\\ &= \sum_{i=1}^n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) =n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \\ \end{split} \] Hence, the Fisher information for a set of \(n\) iid random variables equals \(n\) times the Fisher information of a single variable.
Invariance property of Fisher information
Like KL divergence the Fisher information is invariant against change of variables in the sample space, such as from \(x\) to \(y\) and from distribution \(F_x\) to \(F_y\). This is easy to see as the KL divergence itself is invariant against such reparametrisation, and thus also its curvature, and hence the Fisher information.
More specifically, when the sample space is changed the density will gain a factor in the form of the Jacobian determinant according to this transformation. However, since this factor does not depend on the model parameters, the first and second derivatives of the log-density with regard to the model parameters are not affected by it.
See also Section 7.4 for related sample space invariance of the gradient and curvature of the log-likelihood and Chapter 9 for the sample invariance of observed Fisher information.
\(\color{Red} \blacktriangleright\) Transformation of Fisher information under change of parameters
The Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\) depends on the parameter \(\boldsymbol \theta\). If we use a different parametrisation of the underlying distribution family, say \(\boldsymbol \zeta\) with a map \(\boldsymbol \theta(\boldsymbol \zeta)\) from \(\boldsymbol \zeta\) to \(\boldsymbol \theta\), then the Fisher information changes according to the chain rule in calculus.
To find the resulting Fisher information in terms of the new parameter \(\boldsymbol \zeta\) we need to use the Jacobian matrix \(D \boldsymbol \theta(\boldsymbol \zeta)\). This matrix contains the gradients for each component of the map \(\boldsymbol \theta(\boldsymbol \zeta)\) in its rows: \[ D \boldsymbol \theta(\boldsymbol \zeta) = \begin{pmatrix}\nabla^T \theta_1(\boldsymbol \zeta)\\ \nabla^T \theta_2(\boldsymbol \zeta) \\ \vdots \\ \end{pmatrix} \]
With the above the Fisher information for \(\boldsymbol \theta\) is then transformed to the Fisher information for \(\boldsymbol \zeta\) applying the chain rule for the Hessian matrix: \[ \boldsymbol I^{\text{Fisher}}(\boldsymbol \zeta) = (D \boldsymbol \theta(\boldsymbol \zeta))^T \, \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \, D \boldsymbol \theta(\boldsymbol \zeta) \] This type of transformation is also known as covariant transformation.
In information geometry probability distributions are studied using tools from differential geometry. From this geometric perspective, smoothly parametrised distribution families \(P(\boldsymbol \theta)\) are viewed as manifolds. In KL divergence geometry the Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\) serves as metric tensor, measuring local distances between nearby distributions.
Other types of divergences among distributions induce related geometries, with local metrics similarly obtained by quadratic approximation.
5.2 Fisher information examples
Models with a single parameter
Example 5.1 Fisher information for the Bernoulli distribution:
The log-pmf for the Bernoulli distribution \(\operatorname{Ber}(\theta)\) is \[ \log p(x | \theta) = x \log \theta + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\operatorname{E}(x) = \theta\) we get as Fisher information \[ \begin{split} I^{\text{Fisher}}(\theta) & = -\operatorname{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]
Example 5.2 Quadratic approximations of the KL divergence between two Bernoulli distributions:
From Example 4.3 we have as KL divergence \[ D_{\text{KL}}\left (\operatorname{Ber}(\theta_1), \operatorname{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 5.1 the corresponding Fisher information.
The quadratic approximation implies that \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta), \operatorname{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta+\varepsilon), \operatorname{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]
In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.
Example 5.3 Fisher information for the normal distribution with known variance.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log p(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the Fisher information is \[ \boldsymbol I^{\text{Fisher}}\left(\mu\right) = \frac{1}{\sigma^2} \]
Models with multiple parameters
Example 5.4 Fisher information for the normal distribution.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log p(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.
The corresponding Hessian matrix is \[ \nabla \nabla^T \log p(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\operatorname{E}(x) = \mu\) we have \(\operatorname{E}(x-\mu) =0\). Furthermore, with \(\operatorname{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\operatorname{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol I^{\text{Fisher}}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]
Example 5.5 \(\color{Red} \blacktriangleright\) Fisher information of the categorical distribution:
The log-pmf function for the categorical distribution \(\operatorname{Cat}(\boldsymbol \theta)\) with \(K\) classes and \(K-1\) free parameters \(\theta_1, \ldots, \theta_{K-1}\) is \[ \begin{split} \log p(\boldsymbol x| \theta_1, \ldots, \theta_{K-1} ) & =\sum_{k=1}^{K-1} x_k \log \theta_k + x_K \log \theta_K \\ & =\sum_{k=1}^{K-1} x_k \log \theta_k + \left( 1 - \sum_{k=1}^{K-1} x_k \right) \log \left( 1 - \sum_{k=1}^{K-1} \theta_k \right) \\ \end{split} \]
From the log-pmf we compute the Hessian matrix of second order partial derivatives \(\nabla \nabla^T \log p(\boldsymbol x| \theta_1, \ldots, \theta_{K-1} )\) with regard to \(\theta_1, \ldots, \theta_{K-1}\):
The diagonal entries of the Hessian matrix (with \(i=1, \ldots, K-1\)) are \[ \frac{\partial^2}{\partial \theta_i^2} \log p(\boldsymbol x|\theta_1, \ldots, \theta_{K-1}) = -\frac{x_i}{\theta_i^2}-\frac{x_K}{\theta_K^2} \]
the off-diagonal entries are (with \(j=1, \ldots, K-1\) and \(j \neq i\)) \[ \frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p(\boldsymbol x|\theta_1, \ldots, \theta_{K-1}) = -\frac{ x_K}{\theta_K^2} \]
Recalling that \(\operatorname{E}(x_i) = \theta_i\) we obtain the Fisher information matrix for a categorical distribution as \(K-1 \times K-1\) dimensional matrix \[ \begin{split} \boldsymbol I^{\text{Fisher}}\left( \theta_1, \ldots, \theta_{K-1} \right) &= -\operatorname{E}\left( \nabla \nabla^T \log p(\boldsymbol x| \theta_1, \ldots, \theta_{K-1}) \right) \\ & = \begin{pmatrix} \frac{1}{\theta_1} + \frac{1}{\theta_K} & \cdots & \frac{1}{\theta_K} \\ \vdots & \ddots & \vdots \\ \frac{1}{\theta_K} & \cdots & \frac{1}{\theta_{K-1}} + \frac{1}{\theta_K} \\ \end{pmatrix}\\ & = \operatorname{Diag}\left( \frac{1}{\theta_1} , \ldots, \frac{1}{\theta_{K-1}} \right) + \frac{1}{\theta_K} \mathbf 1\\ \end{split} \]
For \(K=2\) and \(\theta_1=\theta\) this reduces to the Fisher information of the Bernoulli distribution, see Example 5.1, with \[ \begin{split} I^{\text{Fisher}}(\theta) & = \left(\frac{1}{\theta} + \frac{1}{1-\theta} \right) \\ &= \frac{1}{\theta (1-\theta)} \\ \end{split} \]
Example 5.6 \(\color{Red} \blacktriangleright\) Quadratic approximation of KL divergence of the categorical distribution and the Neyman and Pearson divergence:
We now consider the local approximation of the KL divergence \(D_{\text{KL}}(Q, P)\) between the categorical distribution \(Q=\operatorname{Cat}(\boldsymbol q)\) with probabilities \(\boldsymbol q=(q_1, \ldots, q_K)^T\) and the categorical distribution \(P=\operatorname{Cat}(\boldsymbol p)\) with probabilities \(\boldsymbol p= (p_1, \ldots, p_K)^T\).
From Example 4.6 we already know the KL divergence and from Example 5.5 the corresponding Fisher information.
First, we keep \(Q\) fixed and assume that \(P\) is a perturbed version of \(Q\) with \(\boldsymbol p= \boldsymbol q+\boldsymbol \varepsilon\). The perturbations \(\boldsymbol \varepsilon=(\varepsilon_1, \ldots, \varepsilon_K)^T\) satisfy \(\sum_{k=1}^K \varepsilon_k = 0\) because \(\sum_{k=1}^K q_i=1\) and \(\sum_{k=1}^K p_i=1\). Thus \(\varepsilon_K = -\sum_{k=1}^{K-1} \varepsilon_k\). Then \[ \begin{split} D_{\text{KL}}(\operatorname{Cat}(\boldsymbol q), \operatorname{Cat}(\boldsymbol q+\boldsymbol \varepsilon)) & \approx \frac{1}{2} (\varepsilon_1, \ldots, \varepsilon_{K-1}) \, \boldsymbol I^{\text{Fisher}}\left( q_1, \ldots, q_{K-1} \right) \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_{K-1}\\ \end{pmatrix} \\ &= \frac{1}{2} \left( \sum_{k=1}^{K-1} \frac{\varepsilon_k^2}{q_k} + \frac{ \left(\sum_{k=1}^{K-1} \varepsilon_k\right)^2}{q_K} \right) \\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{\varepsilon_k^2}{q_k}\\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{q_k}\\ & = \frac{1}{2} D_{\text{Neyman}}(Q, P)\\ \end{split} \] Second, keeping \(P\) fixed and with \(Q\) a perturbation of \(P\) we get \[ \begin{split} D_{\text{KL}}(\operatorname{Cat}(\boldsymbol p+\boldsymbol \varepsilon), \operatorname{Cat}(\boldsymbol p)) &\approx \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{p_k}\\ &= \frac{1}{2} D_{\text{Pearson}}(Q, P) \end{split} \] Note that in both approximations we divide by the probabilities of the distribution that is kept fixed.
In the above we encounter the Pearson \(\chi^2\) divergence and the Neyman \(\chi^2\) divergence. Both are, like the KL divergence, part of the family of \(f\)-divergences. The Neyman \(\chi^2\) divergence is also known as the reverse Pearson divergence as \(D_{\text{Neyman}}(Q, P) = D_{\text{Pearson}}(P, Q)\).
5.3 Further reading
Amari (2016) is a recent book and standard reference on information geometry.
For metrics associated with proper scoring rules see Dawid and Musio (2014).
The Fisher information matrix was originally introduced by Ronald A. Fisher (1890–1962) in the 1920s to measure the precision of an estimator.
Calyampudi R. Rao (1920–2023) showed in 1945 that the Fisher information matrix defines a local metric tensor on the parameter space and established its interpretation as a local sensitivity measure.
This insight later helped lead to the development of information geometry and to the study of singular or non-regular models in statistics.