5 Local divergence
This chapter introduces the Fisher information matrix as the local curvature (the Hessian matrix) of the Kullback-Leibler divergence, serving as the local second-order sensitivity matrix for model parameters.
5.1 Fisher information
Local quadratic approximation of KL divergence
The Kullback-Leibler (KL) number measures the divergence between two distributions. We now study the KL divergence of two distributions within a parametric family separate only by some small \(\boldsymbol \varepsilon\).
Specifically, we consider \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon)) &= \operatorname{E}_{P(\boldsymbol \theta)}\left( \log p(\boldsymbol x| \boldsymbol \theta) - \log p(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right)\\ & = h(\boldsymbol \theta+\boldsymbol \varepsilon) \\ \end{split} \] where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. Assuming that the pdmf \(p(\boldsymbol x| \boldsymbol \theta)\) is twice differentiable with regard to \(\boldsymbol \theta\) we can approximate the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) quadratically by \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx h(\boldsymbol \theta) + \nabla h(\boldsymbol \theta)^T\boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
From the familiar properties of the KL divergence we conclude
- \(D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\geq 0\) and
- with equality only if \(\boldsymbol \varepsilon=0\).
Thus, by construction the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) assumes a minimum at \(\boldsymbol \varepsilon=0\) with \(h(\boldsymbol \theta)=0\) and a vanishing gradient \(\nabla h(\boldsymbol \theta) = 0\). Therefore, in the quadratic approximation of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) around \(\boldsymbol \theta\) the first two terms (constant and linear) vanish and only the quadratic term remains: \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
Furthermore, the Hessian matrix of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) evaluated at \(\boldsymbol \varepsilon=0\) is given by \[ \begin{split} \nabla \nabla^T h(\boldsymbol \theta) &= -\operatorname{E}_{P(\boldsymbol \theta)} \nabla \nabla^T \log p(\boldsymbol x| \boldsymbol \theta) \\ &= \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta) \end{split} \]
This matrix \(\boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\) is known as the Fisher information. The index \(P\) serves as a reminder of the underlying model. It is also called expected Fisher information to emphasise that it is computed as the mean Hessian of the negative log-pdmf. The Fisher information matrix is always symmetric and positive semidefinite.
With its help the KL divergence can be locally approximated by \[ D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\,\boldsymbol \varepsilon \]
We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta+\boldsymbol \varepsilon), P(\boldsymbol \theta)) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]
Hence, although the KL divergence is not symmetric in its arguments in general, it is symmetric to second order (locally symmetric).
In information geometry probability distributions are studied using tools from differential geometry. From this geometric perspective, smoothly parametrised distribution families \(P(\boldsymbol \theta)\) are viewed as manifolds. In KL divergence geometry the Fisher information \(\boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\) serves as metric tensor, measuring local distances between nearby distributions.
Other types of divergences among distributions induce related geometries, with local metrics similarly obtained by quadratic approximation.
Parameter identifiability
For a regular model the Fisher information is positive definite (with only positive eigenvalues) and hence parameters are locally identifiable. Recall that a positive definite Hessian implies that \(h(\boldsymbol \theta+ \boldsymbol \varepsilon)\) has a true minimum at \(\boldsymbol \theta\).
Conversely, for a singular statistical model the Fisher information matrix is singular (some or all of its eigenvalues vanish) at some parameter values. This indicates local non-identifiability arising, e.g., from overparametrisation, parameters linked by exact constraints, lower dimensional latent structure, parameters on boundaries or other regularity failures.
Additivity of Fisher information
We may wish to compute the Fisher information based on a set of independent identically distributed (iid) random variables.
Assume that a random variable \(x \sim P(\boldsymbol \theta)\) has log-pdmf \(\log p(x| \boldsymbol \theta)\) and Fisher information \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta)\). The Fisher information \(\boldsymbol{\mathcal{I}}_{P_{x_1} \ldots P_{x_n}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim P(\boldsymbol \theta)\) is computed from the joint log-pdmf \(\log p(x_1, \ldots, x_n| \boldsymbol \theta) = \sum_{i}^n \log p(x_i| \boldsymbol \theta)\). This yields \[ \begin{split} \boldsymbol{\mathcal{I}}_{P_{x_1} \ldots P_{x_n}}(\boldsymbol \theta) &= -\operatorname{E}_{P_{x_1}(\boldsymbol \eta) \ldots P_{x_n}(\boldsymbol \eta)} \nabla \nabla^T \sum_{i=1}^n \log p(x_i| \boldsymbol \theta)\\ &= \sum_{i=1}^n \boldsymbol{\mathcal{I}}_{P_{x_i}}(\boldsymbol \theta) =n \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \\ \end{split} \] Hence, the Fisher information for a set of \(n\) iid random variables equals \(n\) times the Fisher information of a single variable.
Invariance under a change of variables
Fisher information is invariant with regard to reparametrisation of the sample space. Specifically, \[ \boldsymbol{\mathcal{I}}_{P_y}(\boldsymbol \theta) = \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \] under a general invertible variable transformation of the random variable from \(x\) to \(y\) with corresponding change of distribution from \(P_x\) to \(P_y\).
This corresponds to the invariance of KL divergence under a change of variables.
When the random variable is changed from \(x\) to \(y\) the density will gain a factor in the form of a Jacobian determinant associated with this transformation. However, since this factor does not depend on the model parameters, it does not affect the first and second derivatives of the log-density with regard to the model parameters.
See also Section 7.4 for related sample space invariance of the gradient and curvature of the log-likelihood and Chapter 9 for the sample invariance of observed Fisher information.
Data-processing inequality
More generally, Fisher information obeys the data-processing inequality. This states that Fisher information cannot increase under a data-processing map from \(x\) to \(y\), so that \[ \boldsymbol{\mathcal{I}}_{P_y}(\boldsymbol \theta) \leq \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \] For a lossless transformation, such as an invertible change of variables, the inequality becomes an identity. Note that for dimension \(d>1\) the above is a matrix inequality of the type \(\boldsymbol A\leq \boldsymbol B\), with matrices \(\boldsymbol A\), \(\boldsymbol B\) and \(\boldsymbol B-\boldsymbol A\) all symmetric and positive semidefinite.
The data-processing inequality for the Fisher information follows from the corresponding data-processing inequality for the KL divergence (Section 4.2). A related matrix inequality is the information inequality providing a lower bound on the variance of an estimator (Section 10.1).
Scalar examples — single parameter models
Example 5.1 Fisher information for the Bernoulli distribution:
The log-pmf for the Bernoulli distribution \(\operatorname{Ber}(\theta)\) is \[ \log p(x | \theta) = x \log \theta + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\operatorname{E}(x) = \theta\) we get as Fisher information \[ \begin{split} \mathcal{I}_{P}(\boldsymbol \theta) & = -\operatorname{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]
Example 5.2 Quadratic approximations of the KL divergence between two Bernoulli distributions:
From Example 4.5 we have as KL divergence \[ D_{\text{KL}}\left (\operatorname{Ber}(\theta_1), \operatorname{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 5.1 the corresponding Fisher information.
The quadratic approximation implies that \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta), \operatorname{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} \mathcal{I}_{\operatorname{Ber}}(\boldsymbol \theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta+\varepsilon), \operatorname{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} \mathcal{I}_{\operatorname{Ber}}(\boldsymbol \theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]
In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.
Example 5.3 Fisher information for the normal distribution with known variance.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log p(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the Fisher information is \[ \mathcal{I}_{P}(\mu) = \frac{1}{\sigma^2} \]
Matrix examples — multiple parameter models
Example 5.4 Fisher information for the normal distribution.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log p(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.
The corresponding Hessian matrix is \[ \nabla \nabla^T \log p(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\operatorname{E}(x) = \mu\) we have \(\operatorname{E}(x-\mu) =0\). Furthermore, with \(\operatorname{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\operatorname{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol{\mathcal{I}}_{P}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]
Example 5.5 \(\color{Red} \blacktriangleright\) Fisher information for the canonical parameter of an exponential family:
Review Section 2.4 before studying this example.
Assume \(P(\boldsymbol \eta)\) is an exponential family with canonical parameter vector \(\boldsymbol \eta\), canonical statistics \(\boldsymbol t(x)\) and log-partition function \(a(\boldsymbol \eta)\) with log-pdmf \(\log p(x|\boldsymbol \eta) = \boldsymbol \eta^T \boldsymbol t(x) + \log h(x) - a(\boldsymbol \eta)\).
If we take second derivatives with regard to \(\boldsymbol \eta\) all terms except for the last vanish: \[ \nabla \nabla^T \log p(x | \boldsymbol \eta) = - \nabla \nabla^T a(\boldsymbol \eta) = -\boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta) \] Then the Fisher information is \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) &= -\operatorname{E}_{P(\boldsymbol \eta)} \nabla \nabla^T \log p(x | \boldsymbol \eta)\\ & = \operatorname{E}_{P(\boldsymbol \eta)} \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta)\\ &= \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta) \end{split} \]
Hence, the Fisher information for the canonical parameter in an exponential family is the variance of the canonical statistics.
See Example 5.8 for a related but different result for the expectation parameters.
5.2 \(\color{Red} \blacktriangleright\) Reparametrisation
\(\color{Red} \blacktriangleright\) Fisher information under reparametrisation
The Fisher information \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta)\) depends on the parameter \(\boldsymbol \theta\). If we use a different parametrisation of the underlying distribution family, say \(P(\boldsymbol \zeta)\) instead of \(P(\boldsymbol \theta)\), with a map \(\boldsymbol \theta(\boldsymbol \zeta)\) from \(\boldsymbol \zeta\) to \(\boldsymbol \theta\), then the Fisher information changes according to the chain rule in calculus.
To find the resulting Fisher information in terms of the new parameter \(\boldsymbol \zeta\) we need to use the Jacobian matrix \(D \boldsymbol \theta(\boldsymbol \zeta)\). This matrix contains the gradients for each component of the map \(\boldsymbol \theta(\boldsymbol \zeta)\) in its rows: \[ D \boldsymbol \theta(\boldsymbol \zeta) = \begin{pmatrix}\nabla^T \theta_1(\boldsymbol \zeta)\\ \nabla^T \theta_2(\boldsymbol \zeta) \\ \vdots \\ \end{pmatrix} \]
With the above the Fisher information for \(\boldsymbol \theta\) is then transformed to the Fisher information for \(\boldsymbol \zeta\) applying the chain rule for the Hessian matrix: \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \zeta) = (D \boldsymbol \theta(\boldsymbol \zeta))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \, D \boldsymbol \theta(\boldsymbol \zeta) \] This type of transformation is also known as covariant transformation.
Examples
Example 5.6 \(\color{Red} \blacktriangleright\) Fisher information for the Bernoulli distribution from canonical to conventional parametrisation by change of variables:
From Example 2.2 and Example 5.5 the Fisher information for the Bernoulli distribution using the canonical parameter \(\eta\) is \[ \mathcal{I}_{P}(\eta)=a''(\eta) = \frac{ e^{\eta}}{(e^{\eta}+1)^2} \] The map to the canonical parameter \(\eta\) from the conventional parameter \(\theta\) is the logit function \(\eta(\theta) = \log\left( \frac{\theta}{1-\theta}\right)\) with Jacobian \[ D \eta(\theta) = \eta(\theta)'= \frac{1}{\theta (1-\theta)} \] Using the chain rule to obtain the Fisher information for \(\theta\) with \[ \mathcal{I}_{P}(\eta) \rvert_{\eta = \eta(\theta)} =\theta (1-\theta) \] yields \[ \mathcal{I}_{P}(\theta) = (D\eta(\theta))^2 \, \mathcal{I}_{P}(\eta) \rvert_{\eta = \eta(\theta)} = \frac{1}{\theta (1-\theta)} \] which agrees with the result obtained by direct calculation in Example 5.1.
Note that the Fisher information for the mean parameter \(\theta\) is the inverse of the Fisher information for the canonical parameter \(\eta\).
Example 5.7 \(\color{Red} \blacktriangleright\) Fisher information for the normal distribution from canonical to conventional parametrisation by change of variables:
From Example 2.3 and Example 5.5 the Fisher information matrix for the normal distribution using canonical parameters \(\boldsymbol \eta\) is \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta)= \nabla \nabla^T a(\boldsymbol \eta) = \begin{pmatrix} \frac{1}{1-2\eta_2} & \frac{2 \eta_1}{(1-2 \eta_2)^2} \\ \frac{2 \eta_1}{(1-2 \eta_2)^2} & \frac{4\eta_1^2 -4 \eta_2 +2 }{(1-2 \eta_2)^3} \\ \end{pmatrix} \] The map to the canonical parameter \(\boldsymbol \eta\) from the conventional parameters \(\boldsymbol \theta=c(\mu, \sigma^2)^T\) is \(\boldsymbol \eta= (\eta_1, \eta_2)^T = (\frac{\mu}{\sigma^2}, \frac{1}{2} - \frac{1}{2 \sigma^2} )^T\) with Jacobian matrix \[ D\boldsymbol \eta(\boldsymbol \theta)= \begin{pmatrix} \frac{1}{\sigma^2} & -\frac{\mu}{\sigma^4} \\ 0 & \frac{1 }{2\sigma^4} \\ \end{pmatrix} \] With \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \theta)} = \begin{pmatrix} \sigma^2 & 2 \mu \sigma^2 \\ 2 \mu \sigma^2 & 4 \mu^2 \sigma^2 + 2 \sigma^4 \\ \end{pmatrix} \] this yields the Fisher information for \(\boldsymbol \theta=c(\mu, \sigma^2)^T\) as \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta) & = (D \boldsymbol \eta(\boldsymbol \theta))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \theta)} \, D \boldsymbol \eta(\boldsymbol \theta)\\ &= \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \end{split} \] which agrees with the result obtained by direct calculation in Example 5.4.
Example 5.8 \(\color{Red} \blacktriangleright\) Fisher information for the expectation parameter of an exponential family:
An alternative parametrisation of an exponential family is provided by the means \(\boldsymbol \mu_{\boldsymbol t}\) of the canonical statistics \(\boldsymbol t(\boldsymbol x)\). These expectation parameters are given by \(\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)= \nabla a(\boldsymbol \eta)\).
The Jacobian for the transformation \(\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)\) is \(D \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta) = D \nabla a(\boldsymbol \eta) = \nabla \nabla^T a(\boldsymbol \eta) = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta)\). Hence, the Jacobian for the inverse transformation \(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})\) is \(D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}) = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^{-1}\).
The Fisher information transforms covariantly under change of parameter. With \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})} = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})) \] this yields the Fisher information for the expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\) \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \mu_{\boldsymbol t}) & = (D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})} \, D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})\\ &= \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^{-1} \end{split} \] Hence the Fisher information for the expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\) is the inverse of the variance of the canonical statistics, and therefore the inverse of the Fisher information for the canonical parameters \(\boldsymbol \eta\).
This relationship has been encountered before in Example 5.6 for the special case of the Bernoulli distribution.
5.3 Further reading
Amari (2016) is a recent book and standard reference on information geometry.
For metrics associated with proper scoring rules see Dawid and Musio (2014).
The Fisher information matrix was originally introduced by Ronald A. Fisher (1890–1962) in the 1925 to measure the precision of an estimator.
C. Radhakrishna Rao (1920–2023) showed in 1945 that the Fisher information matrix defines a local metric tensor on the parameter space and established its interpretation as a local sensitivity measure.
This insight later helped lead to the development of information geometry and to the study of singular or non-regular models in statistics.