5 Expected Fisher information
This chapter introduces the expected Fisher information matrix as the local curvature (the Hessian matrix) of the Kullback-Leibler divergence, serving as the local second-order sensitivity matrix for model parameters.
5.1 Expected Fisher information
Local quadratic approximation of KL divergence
The Kullback-Leibler (KL) number measures the divergence between two distribution. We now study the KL divergence of two distributions within a parametric family separate only by some small \(\boldsymbol \varepsilon\).
Specifically, we consider \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon)) &= \text{E}_{P(\boldsymbol \theta)}\left( \log p(\boldsymbol x| \boldsymbol \theta) - \log p(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right)\\ & = h(\boldsymbol \theta+\boldsymbol \varepsilon) \\ \end{split} \] where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. Assuming that the pdmf \(p(\boldsymbol x| \boldsymbol \theta)\) is twice differentiable with regard to \(\boldsymbol \theta\) we can approximate the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) quadratically by \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx h(\boldsymbol \theta) + \nabla h(\boldsymbol \theta)^T\boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
From the familiar properties of the KL divergence we conclude
- \(D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\geq 0\) and
- with equality only if \(\boldsymbol \varepsilon=0\).
Thus, by construction the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) assumes a minimum at \(\boldsymbol \varepsilon=0\) with \(h(\boldsymbol \theta)=0\) and a vanishing gradient \(\nabla h(\boldsymbol \theta) = 0\). Therefore, in the quadratic approximation of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) around \(\boldsymbol \theta\) the first two terms (constant and linear) vanish and only the quadratic term remains: \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
Furthermore, the Hessian matrix of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) evaluated at \(\boldsymbol \varepsilon=0\) is given by \[ \begin{split} \nabla \nabla^T h(\boldsymbol \theta) &= -\text{E}_{P(\boldsymbol \theta)} \nabla \nabla^T \log p(\boldsymbol x| \boldsymbol \theta) \\ &= \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \end{split} \]
This matrix is known as the Fisher information at \(\boldsymbol \theta\), denoted by \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). It is also called expected Fisher information to emphasise that it is computed as the mean Hessian of the negative log-pdmf. The Fisher information matrix is always symmetric and positive semidefinite.
With its help the KL divergence can be locally approximated by \[ D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \boldsymbol \varepsilon \]
We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta+\boldsymbol \varepsilon), P(\boldsymbol \theta)) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]
Hence, although the KL divergence is not symmetric in its arguments in general, it is symmetric to second order (locally symmetric).
Parameter identifiability
For a regular model the Fisher information is positive definite (with only positive eigenvalues) and hence parameters are locally identifiable. Recall that a positive definite Hessian implies that \(h(\boldsymbol \theta+ \boldsymbol \varepsilon)\) has a true minimum at \(\boldsymbol \theta\).
Conversely, for a singular statistical model the Fisher information matrix is singular (some or all of its eigenvalues vanish) at some parameter values. This indicates local non-identifiability arising, e.g., from overparametrisation, latent structure, boundary conditions or other regularity failures.
Additivity of Fisher information
We may wish to compute the expected Fisher information based on a set of independent identically distributed (iid) random variables.
Assume that a random variable \(x \sim P(\boldsymbol \theta)\) has log-pdmf \(\log p(x| \boldsymbol \theta)\) and expected Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). The expected Fisher information \(\boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim P(\boldsymbol \theta)\) is computed from the joint log-pdmf \(\log p(x_1, \ldots, x_n) = \sum_{i}^n \log p(x_i| \boldsymbol \theta)\). This yields \[ \begin{split} \boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta) &= -\text{E}_{P(\boldsymbol \theta)} \nabla \nabla^T \sum_{i}^n \log p(x_i| \boldsymbol \theta)\\ &= \sum_{i=1}^n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) =n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \\ \end{split} \] Hence, the expected Fisher information for a set of \(n\) iid random variables is the \(n\) times the Fisher information of a single variable.
Invariance property of Fisher information
Like KL divergence the expected Fisher information is invariant against change of variables in the sample space, such as from \(x\) to \(y\) and from distribution \(F_x\) to \(F_y\). This is easy to see as the KL divergence itself is invariant against such reparametrisation, and thus also its curvature, and hence the expected Fisher information.
More specifically, when the sample space is changed the density will gain a factor in the form of the Jacobian determinant according to this transformation. However, since this factor does not depend on the model parameters, the first and second derivatives of the log-density with regard to the model parameters are not affected by it.
See also Section 7.4 for related sample space invariance of the gradient and curvature of the log-likelihood and Chapter 9 for the sample invariance of observed Fisher information.
\(\color{Red} \blacktriangleright\) Transformation of Fisher information under change of parameters
The Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\) depends on the parameter \(\boldsymbol \theta\). If we use a different parametrisation of the underlying distribution family, say \(\boldsymbol \zeta\) with a map \(\boldsymbol \theta(\boldsymbol \zeta)\) from \(\boldsymbol \zeta\) to \(\boldsymbol \theta\), then the Fisher information changes according to the chain rule in calculus.
To find the resulting Fisher information in terms of the new parameter \(\boldsymbol \zeta\) we need to use the Jacobian matrix \(D \boldsymbol \theta(\boldsymbol \zeta)\). This matrix contains the gradients for each component of the map \(\boldsymbol \theta(\boldsymbol \zeta)\) in its rows: \[ D \boldsymbol \theta(\boldsymbol \zeta) = \begin{pmatrix}\nabla^T \theta_1(\boldsymbol \zeta)\\ \nabla^T \theta_2(\boldsymbol \zeta) \\ \vdots \\ \end{pmatrix} \]
With the above the Fisher information for \(\boldsymbol \theta\) is then transformed to the Fisher information for \(\boldsymbol \zeta\) applying the chain rule for the Hessian matrix: \[ \boldsymbol I^{\text{Fisher}}(\boldsymbol \zeta) = (D \boldsymbol \theta(\boldsymbol \zeta))^T \, \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \, D \boldsymbol \theta(\boldsymbol \zeta) \] This type of transformation is also known as covariant transformation.
In information geometry (Amari 2016) probability distributions are studied using tools from differential geometry. From this geometric perspective, smoothly parametrised distribution families \(P(\boldsymbol \theta)\) are viewed as manifolds. In KL divergence geometry the Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\) serves as metric tensor, measuring local distances between nearby distributions.
Other divergences induce related geometries, with local metrics similarly obtained by quadratic approximation of the divergence (for metrics associated with proper scoring rules see Dawid and Musio (2014)).
5.2 Expected Fisher information examples
Models with a single parameter
Example 5.1 Expected Fisher information for the Bernoulli distribution:
The log-pmf for the Bernoulli distribution \(\text{Ber}(\theta)\) is \[ \log p(x | \theta) = x \log(\theta) + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\text{E}(x) = \theta\) we get as Fisher information \[ \begin{split} I^{\text{Fisher}}(\theta) & = -\text{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]
Example 5.2 Quadratic approximations of the KL divergence between two Bernoulli distributions:
From Example 4.3 we have as KL divergence \[ D_{\text{KL}}\left (\text{Ber}(\theta_1), \text{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 5.1 the corresponding expected Fisher information.
The quadratic approximation implies that \[ D_{\text{KL}}\left( \text{Ber}(\theta), \text{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \text{Ber}(\theta+\varepsilon), \text{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]
In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.
Example 5.3 Expected Fisher information for the normal distribution with known variance.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log p(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the expected Fisher information is \[ \boldsymbol I^{\text{Fisher}}\left(\mu\right) = \frac{1}{\sigma^2} \]
Models with multiple parameters
Example 5.4 Expected Fisher information for the normal distribution.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log p(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.
The corresponding Hessian matrix is \[ \nabla \nabla^T \log p(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\text{E}(x) = \mu\) we have \(\text{E}(x-\mu) =0\). Furthermore, with \(\text{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\text{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the expected Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol I^{\text{Fisher}}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]
Example 5.5 \(\color{Red} \blacktriangleright\) Expected Fisher information of the categorical distribution:
The log-pmf function for the categorical distribution \(\text{Cat}(\boldsymbol \pi)\) with \(K\) classes and \(K-1\) free parameters \(\pi_1, \ldots, \pi_{K-1}\) is \[ \begin{split} \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1} ) & =\sum_{k=1}^{K-1} x_k \log \pi_k + x_K \log \pi_K \\ & =\sum_{k=1}^{K-1} x_k \log \pi_k + \left( 1 - \sum_{k=1}^{K-1} x_k \right) \log \left( 1 - \sum_{k=1}^{K-1} \pi_k \right) \\ \end{split} \]
From the log-pmf we compute the Hessian matrix of second order partial derivatives \(\nabla \nabla^T \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1} )\) with regard to \(\pi_1, \ldots, \pi_{K-1}\):
The diagonal entries of the Hessian matrix (with \(i=1, \ldots, K-1\)) are \[ \frac{\partial^2}{\partial \pi_i^2} \log p(\boldsymbol x|\pi_1, \ldots, \pi_{K-1}) = -\frac{x_i}{\pi_i^2}-\frac{x_K}{\pi_K^2} \]
the off-diagonal entries are (with \(j=1, \ldots, K-1\) and \(j \neq i\)) \[ \frac{\partial^2}{\partial \pi_i \partial \pi_j} \log p(\boldsymbol x|\pi_1, \ldots, \pi_{K-1}) = -\frac{ x_K}{\pi_K^2} \]
Recalling that \(\text{E}(x_i) = \pi_i\) we obtain the expected Fisher information matrix for a categorical distribution as \(K-1 \times K-1\) dimensional matrix \[ \begin{split} \boldsymbol I^{\text{Fisher}}\left( \pi_1, \ldots, \pi_{K-1} \right) &= -\text{E}\left( \nabla \nabla^T \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1}) \right) \\ & = \begin{pmatrix} \frac{1}{\pi_1} + \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_K} \\ \vdots & \ddots & \vdots \\ \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_{K-1}} + \frac{1}{\pi_K} \\ \end{pmatrix}\\ & = \text{Diag}\left( \frac{1}{\pi_1} , \ldots, \frac{1}{\pi_{K-1}} \right) + \frac{1}{\pi_K} \mathbf 1\\ \end{split} \]
For \(K=2\) and \(\pi_1=\theta\) this reduces to the expected Fisher information of the Bernoulli distribution, see Example 5.1. \[ \begin{split} I^{\text{Fisher}}(\theta) & = \left(\frac{1}{\theta} + \frac{1}{1-\theta} \right) \\ &= \frac{1}{\theta (1-\theta)} \\ \end{split} \]
Example 5.6 \(\color{Red} \blacktriangleright\) Quadratic approximation of KL divergence of the categorical distribution and the Neyman and Pearson divergence:
We now consider the local approximation of the KL divergence \(D_{\text{KL}}(Q, P)\) between the categorical distribution \(Q=\text{Cat}(\boldsymbol q)\) with probabilities \(\boldsymbol q=(q_1, \ldots, q_K)^T\) and the categorical distribution \(P=\text{Cat}(\boldsymbol p)\) with probabilities \(\boldsymbol p= (p_1, \ldots, p_K)^T\).
From Example 4.6 we already know the KL divergence and from Example 5.5 the corresponding expected Fisher information.
First, we keep \(Q\) fixed and assume that \(P\) is a perturbed version of \(Q\) with \(\boldsymbol p= \boldsymbol q+\boldsymbol \varepsilon\). The perturbations \(\boldsymbol \varepsilon=(\varepsilon_1, \ldots, \varepsilon_K)^T\) satisfy \(\sum_{k=1}^K \varepsilon_k = 0\) because \(\sum_{k=1}^K q_i=1\) and \(\sum_{k=1}^K p_i=1\). Thus \(\varepsilon_K = -\sum_{k=1}^{K-1} \varepsilon_k\). Then \[ \begin{split} D_{\text{KL}}(\text{Cat}(\boldsymbol q), \text{Cat}(\boldsymbol q+\boldsymbol \varepsilon)) & \approx \frac{1}{2} (\varepsilon_1, \ldots, \varepsilon_{K-1}) \, \boldsymbol I^{\text{Fisher}}\left( q_1, \ldots, q_{K-1} \right) \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_{K-1}\\ \end{pmatrix} \\ &= \frac{1}{2} \left( \sum_{k=1}^{K-1} \frac{\varepsilon_k^2}{q_k} + \frac{ \left(\sum_{k=1}^{K-1} \varepsilon_k\right)^2}{q_K} \right) \\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{\varepsilon_k^2}{q_k}\\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{q_k}\\ & = \frac{1}{2} D_{\text{Neyman}}(Q, P)\\ \end{split} \] Second, keeping \(P\) fixed and with \(Q\) a perturbation of \(P\) we get \[ \begin{split} D_{\text{KL}}(\text{Cat}(\boldsymbol p+\boldsymbol \varepsilon), \text{Cat}(\boldsymbol p)) &\approx \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{p_k}\\ &= \frac{1}{2} D_{\text{Pearson}}(Q, P) \end{split} \] Note that in both approximations we divide by the probabilities of the distribution that is kept fixed.
In the above we encounter the Pearson \(\chi^2\) divergence and the Neyman \(\chi^2\) divergence. Both are, like the KL divergence, part of the family of \(f\)-divergences. The Neyman \(\chi^2\) divergence is also known as the reverse Pearson divergence as \(D_{\text{Neyman}}(Q, P) = D_{\text{Pearson}}(P, Q)\).