5 Local divergence
This chapter introduces the Fisher information matrix as the local curvature (the Hessian matrix) of the Kullback-Leibler divergence, serving as the local second-order sensitivity matrix for model parameters.
5.1 Fisher information
Local quadratic approximation of KL divergence
The Kullback-Leibler (KL) number measures the divergence between two distributions. We now study the KL divergence of two distributions within a parametric family separate only by some small \(\boldsymbol \varepsilon\).
Specifically, we consider \[ D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon)) = h(\boldsymbol \theta+\boldsymbol \varepsilon) \] where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. Assuming that the pdmf \(p(\boldsymbol x| \boldsymbol \theta)\) is twice differentiable with regard to \(\boldsymbol \theta\) we can approximate the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) quadratically by \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) = h(\boldsymbol \theta) + \nabla h(\boldsymbol \theta)^T\boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon+ \mathcal{O}\!\left(||\boldsymbol \varepsilon||^3 \right) \]
Recalling the properties of the KL divergence it follows that
- \(h(\boldsymbol \theta)=0\) since \(D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta)) =0\) and
- \(\nabla h(\boldsymbol \theta) = 0\) as \(D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon)\) achives a minimum at \(\epsilon=0\).
Therefore, in the quadratic approximation of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) around \(\boldsymbol \theta\) the first two terms (constant and linear) vanish and only the quadratic term remains: \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
To compute the Hessian matrix \(\nabla \nabla^T h(\boldsymbol \theta)\) we expand \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) = \operatorname{E}_{P(\boldsymbol \theta)}\left( \log p(\boldsymbol x| \boldsymbol \theta) \right) -\operatorname{E}_{P(\boldsymbol \theta)}\left( \log p(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right) \] The first term does not depend on \(\epsilon\) and thus vanishes when taking derivatives. Noting that differentiation (as linear operator) and expectation (as weighted average) can be interchanged here so that the Hessian matrix of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) evaluated at \(\boldsymbol \varepsilon=0\) is \[ \begin{split} \nabla \nabla^T h(\boldsymbol \theta) &= -\operatorname{E}_{P(\boldsymbol \theta)} \left( \nabla \nabla^T \log p(\boldsymbol x| \boldsymbol \theta) \right) \\ &= \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta) \end{split} \]
This matrix \(\boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\) called Fisher information or expected Fisher information. The index \(P\) serves as a reminder of the underlying model. The Fisher information matrix is always symmetric and positive semidefinite. It becomes a scalar if there is only a single parameter \(\theta\).
With its help the KL divergence can be locally approximated by \[ D_{\text{KL}}(P(\boldsymbol \theta), P(\boldsymbol \theta+\boldsymbol \varepsilon))\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\,\boldsymbol \varepsilon \]
We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \theta+\boldsymbol \varepsilon), P(\boldsymbol \theta)) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]
Hence, although the KL divergence is not symmetric in its arguments in general, it is locally symmetric to second order.
In information geometry probability distributions are studied using tools from differential geometry. From this geometric perspective, smoothly parametrised distribution families \(P(\boldsymbol \theta)\) are viewed as manifolds. In KL divergence geometry the Fisher information \(\boldsymbol{\mathcal{I}}_P(\boldsymbol \theta)\) serves as metric tensor, measuring local distances between nearby distributions.
Other types of divergences (such as Bregman and \(f\)-divergences) induce related geometries, with metrics similarly obtained by quadratic approximation.
Parameter identifiability
For a regular model the Fisher information is positive definite (with only positive eigenvalues) and hence parameters are locally identifiable. Recall that a positive definite Hessian implies that \(h(\boldsymbol \theta+ \boldsymbol \varepsilon)\) has a true minimum at \(\boldsymbol \theta\).
Conversely, for a singular statistical model the Fisher information matrix is singular (some or all of its eigenvalues vanish) at some parameter values. This indicates local nonidentifiability arising, e.g., from overparametrisation, parameters linked by exact constraints, lower dimensional latent structure, parameters on boundaries or other regularity failures.
Additivity of Fisher information
We may wish to compute the Fisher information for a parameters based on a set of independent identically distributed (iid) random variables.
Assume that a random variable \(x \sim P(\boldsymbol \theta)\) has log-pdmf \(\log p(x| \boldsymbol \theta)\) and Fisher information \(\boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta)\). The Fisher information \(\boldsymbol{\mathcal{I}}_{P_{x_1, \ldots, x_n}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim P(\boldsymbol \theta)\) is computed from the joint log-pdmf \(\log p(x_1, \ldots, x_n| \boldsymbol \theta) = \sum_{i}^n \log p(x_i| \boldsymbol \theta)\). This yields \[ \begin{split} \boldsymbol{\mathcal{I}}_{P_{x_1, \ldots, x_n}}(\boldsymbol \theta) &= -\operatorname{E}_{P_{x_1, \ldots, x_n}(\boldsymbol \eta)} \left( \nabla \nabla^T \sum_{i=1}^n \log p(x_i| \boldsymbol \theta) \right)\\ &= -\operatorname{E}_{P_{x_1}(\boldsymbol \theta)} \ldots \operatorname{E}_{P_{x_n}(\boldsymbol \theta)} \left( \nabla \nabla^T \sum_{i=1}^n \log p(x_i| \boldsymbol \theta) \right)\\ &= \sum_{i=1}^n \boldsymbol{\mathcal{I}}_{P_{x}}(\boldsymbol \theta) =n \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \\ \end{split} \] Hence, the total Fisher information for a parameter based on a set of \(n\) iid random variables equals \(n\) times the Fisher information of a single variable.
Invariance under a change of variables
Fisher information is invariant with regard to reparametrisation of the sample space. Specifically, \[ \boldsymbol{\mathcal{I}}_{P_y}(\boldsymbol \theta) = \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \] under a general invertible variable transformation of the random variable from \(x\) to \(y\) with corresponding change of distribution from \(P_x\) to \(P_y\).
This corresponds to the invariance of KL divergence under a change of variables.
When the random variable is changed from \(x\) to \(y\) the density will gain a factor in the form of a Jacobian determinant associated with this transformation. However, since this factor does not depend on the model parameters, it does not affect the first and second derivatives of the log-density with regard to the model parameters.
See also Section 7.4 for the related sample space invariance of the gradient and curvature of the log-likelihood and Chapter 9 for the sample invariance of observed Fisher information.
Data-processing inequality
More generally, Fisher information obeys the data-processing inequality. This states that Fisher information cannot increase under a data-processing map from \(x\) to \(y\), so that \[ \boldsymbol{\mathcal{I}}_{P_x}(\boldsymbol \theta) \geq \boldsymbol{\mathcal{I}}_{P_y}(\boldsymbol \theta) \] For a lossless transformation, such as an invertible change of variables, the inequality becomes an identity. Note that for dimension \(d>1\) the above is a matrix inequality of the type \(\boldsymbol A\geq \boldsymbol B\), with matrices \(\boldsymbol A\), \(\boldsymbol B\) and \(\boldsymbol A-\boldsymbol B\) all symmetric and positive semidefinite.
The data-processing inequality for the Fisher information follows from the corresponding data-processing inequality for the KL divergence (Section 4.2). A related matrix inequality is the information inequality providing a lower bound on the variance of an estimator (Section 10.1).
Scalar examples — single parameter models
Example 5.1 Fisher information for the Bernoulli distribution:
The log-pmf for the Bernoulli distribution \(\operatorname{Ber}(\theta)\) is \[ \log p(x | \theta) = x \log \theta + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\operatorname{E}(x) = \theta\) we get as Fisher information \[ \begin{split} \mathcal{I}_{P}(\boldsymbol \theta) & = -\operatorname{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \] Hence, the Fisher information for the expectation parameter \(\operatorname{E}(x)=\theta\) equals the inverse of the variance \(\operatorname{Var}(x)=\theta(1-\theta)\). Consequently, high Fisher information corresponds to low variance and concentrated probability mass, indicating an informative distribution. Conversely, low Fisher information corresponds to high variance and dispersed probability mass, hence to a less informative distribution.
More generally, this inverse relationship between the Fisher information for expectation parameters (i.e. the mean of canonical statistics) and the variance (of canonical statistics) applies to all exponential families, see Example 5.8.
Example 5.2 Quadratic approximations of the KL divergence between two Bernoulli distributions:
From Example 4.5 we have as KL divergence \[ D_{\text{KL}}\left (\operatorname{Ber}(\theta_1), \operatorname{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 5.1 the corresponding Fisher information \[ \mathcal{I}_{P}(\boldsymbol \theta) = \frac{1}{\theta(1-\theta)} \]
The quadratic approximation implies that for small \(\varepsilon\) \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta), \operatorname{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} \mathcal{I}_{\operatorname{Ber}}(\boldsymbol \theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and similarly that \[ D_{\text{KL}}\left( \operatorname{Ber}(\theta+\varepsilon), \operatorname{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} \mathcal{I}_{\operatorname{Ber}}(\boldsymbol \theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]
In Worksheet E1 this is verified by computing the second-order Taylor series of the Bernoulli KL divergence.
Example 5.3 Fisher information for the normal distribution with known variance.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log p(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the Fisher information is \[ \mathcal{I}_{P}(\mu) = \frac{1}{\sigma^2} \]
As in Example 5.1 the Fisher information for the mean \(\operatorname{E}(x)=\mu\) equals the inverse of the variance \(\operatorname{Var}(x)=\sigma^2\).
Matrix examples — multiple parameter models
Example 5.4 Fisher information for the normal distribution.
The log-pdf for \(N(\mu, \sigma^2)\) is \[ \log p(x | \mu, \sigma^2) = -\frac{1}{2} \log \sigma^2 -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log p(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.
The corresponding Hessian matrix is \[ \nabla \nabla^T \log p(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\operatorname{E}(x) = \mu\) we have \(\operatorname{E}(x-\mu) =0\). Furthermore, with \(\operatorname{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\operatorname{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol{\mathcal{I}}_{P}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]
Example 5.5 \(\color{Red} \blacktriangleright\) Fisher information for the canonical parameter of an exponential family:
Review Section 2.4 before studying this example (and all other examples in this Chapter concerning exponential families).
Assume \(P(\boldsymbol \eta)\) is an exponential family with canonical parameter vector \(\boldsymbol \eta\), canonical statistics \(\boldsymbol t(x)\) and log-partition function \(a(\boldsymbol \eta)\) with log-pdmf \(\log p(x|\boldsymbol \eta) = \langle \boldsymbol \eta, \boldsymbol t(x)\rangle + \log h(x) - a(\boldsymbol \eta)\).
If we take second derivatives with regard to \(\boldsymbol \eta\) all terms except for the last vanish: \[ \nabla \nabla^T \log p(x | \boldsymbol \eta) = - \nabla \nabla^T a(\boldsymbol \eta) = -\boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta) \] Then the Fisher information for \(\boldsymbol \eta\) is \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) &= -\operatorname{E}_{P(\boldsymbol \eta)} \nabla \nabla^T \log p(x | \boldsymbol \eta)\\ & = \operatorname{E}_{P(\boldsymbol \eta)} \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta)\\ &= \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta)\\ &= \operatorname{Var}(\boldsymbol t(x)) \end{split} \] Hence, the Fisher information for the canonical parameter \(\boldsymbol \eta\) in an exponential family equals the variance of the canonical statistics.
Note that here there is no inverse relationship between the Fisher information and the variance. This applies only to expectation parameters, see Example 5.8 for details.
5.2 \(\color{Red} \blacktriangleright\) Reparametrisation
\(\color{Red} \blacktriangleright\) Fisher information under reparametrisation
The Fisher information matrix \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta)\) depends on the specific parametrisation of the underlying distribution family. If we use a different parametrisation, say \(P(\boldsymbol \zeta)\) instead of \(P(\boldsymbol \theta)\), then the Fisher information changes accordingly. Given a map between the two sets of parameters \(\boldsymbol \zeta\) and \(\boldsymbol \theta\) we can transform the Fisher information from one parametrisation to the other according to the chain rule of calculus for the Hessian matrix.
Firstly, the map \[\boldsymbol \theta(\boldsymbol \zeta)\] from \(\boldsymbol \zeta\) to \(\boldsymbol \theta\) allows to express the Fisher information matrix for \(\boldsymbol \theta\) in terms of \(\boldsymbol \zeta\) as \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \] However, note that this is not the Fisher information \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \zeta)\) for \(\boldsymbol \zeta\).
Secondly, we need to compute the Jacobian matrix \(D \boldsymbol \theta(\boldsymbol \zeta)\) containing the gradients for each component of \(\boldsymbol \theta(\boldsymbol \zeta)\) in its rows: \[ D \boldsymbol \theta(\boldsymbol \zeta) = \begin{pmatrix}\nabla^T \theta_1(\boldsymbol \zeta)\\ \nabla^T \theta_2(\boldsymbol \zeta) \\ \vdots \\ \end{pmatrix} \]
Finally, to find the Fisher information matrix \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \zeta)\) in terms of the new parameter \(\boldsymbol \zeta\) the Fisher information matrix \(\boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta)\) for \(\boldsymbol \theta\) expressed using \(\boldsymbol \zeta\) is multiplied on both sides with the Jacobian matrix, yielding \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \zeta) = (D \boldsymbol \theta(\boldsymbol \zeta))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \, D \boldsymbol \theta(\boldsymbol \zeta) \] This type of transformation is also known as covariant transformation.
Examples
Example 5.6 \(\color{Red} \blacktriangleright\) Transformation of Fisher information for the Bernoulli distribution:
From Example 2.2 and Example 5.5 the Fisher information for the canonical parameter \(\eta\) in the Bernoulli distribution is \[ \mathcal{I}_{P}(\eta)= \sigma^2_t = \frac{ e^{\eta}}{(e^{\eta}+1)^2} \] The map to the canonical parameter \(\eta\) from the conventional parameter \(\theta\) is the logit function \[ \eta(\theta) = \log\left( \frac{\theta}{1-\theta}\right) \] Thus the Fisher information for \(\eta\) expressed in terms of \(\theta\) is \[ \mathcal{I}_{P}(\eta) \rvert_{\eta = \eta(\theta)} =\theta (1-\theta) \] With the Jacobian \[ D \eta(\theta) = \eta(\theta)'= \frac{1}{\theta (1-\theta)} \] we then get the Fisher information for \(\theta\) as \[ \begin{split} \mathcal{I}_{P}(\theta) & = (D\eta(\theta))^2 \, \mathcal{I}_{P}(\eta) \rvert_{\eta = \eta(\theta)} \\ & = \frac{1}{\theta (1-\theta)}\\ \end{split} \] which agrees with the result obtained by direct calculation in Example 5.1.
Example 5.7 \(\color{Red} \blacktriangleright\) Transformation of Fisher information for the normal distribution:
From Example 2.3 and Example 5.5 the Fisher information for the canonical parameters \(\boldsymbol \eta\) in the normal distribution is \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta)= \boldsymbol \Sigma_{\boldsymbol t} = \begin{pmatrix} \frac{1}{1-2\eta_2} & \frac{2 \eta_1}{(1-2 \eta_2)^2} \\ \frac{2 \eta_1}{(1-2 \eta_2)^2} & \frac{4\eta_1^2 -4 \eta_2 +2 }{(1-2 \eta_2)^3} \\ \end{pmatrix} \] The map to the canonical parameter \(\boldsymbol \eta\) from the conventional parameters \(\boldsymbol \theta=c(\mu, \sigma^2)^T\) is \[ \boldsymbol \eta= (\eta_1, \eta_2)^T = (\frac{\mu}{\sigma^2}, \frac{1}{2} - \frac{1}{2 \sigma^2} )^T \] Thus the Fisher information \(\boldsymbol \eta\) epressed in terms of \(\boldsymbol \theta=c(\mu, \sigma^2)^T\) is \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \theta)} = \begin{pmatrix} \sigma^2 & 2 \mu \sigma^2 \\ 2 \mu \sigma^2 & 4 \mu^2 \sigma^2 + 2 \sigma^4 \\ \end{pmatrix} \] With the Jacobian matrix \[ D\boldsymbol \eta(\boldsymbol \theta)= \begin{pmatrix} \frac{1}{\sigma^2} & -\frac{\mu}{\sigma^4} \\ 0 & \frac{1 }{2\sigma^4} \\ \end{pmatrix} \] we then get the Fisher information for \(\boldsymbol \theta=c(\mu, \sigma^2)^T\) as \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \theta) & = (D \boldsymbol \eta(\boldsymbol \theta))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \theta)} \, D \boldsymbol \eta(\boldsymbol \theta)\\ &= \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \end{split} \] which agrees with the result obtained by direct calculation in Example 5.4.
Example 5.8 \(\color{Red} \blacktriangleright\) Fisher information for the expectation parameter of an exponential family:
From Example 5.5 the Fisher information matrix for the canonical parameters \(\boldsymbol \eta\) of an exponential family with canonical statistics \(\boldsymbol t(x)\) is \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta) = \operatorname{Var}(\boldsymbol t(x)) \] with the variance \(\boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta) = \nabla \nabla^T a(\boldsymbol \eta)\) obtained by computing the Hessian matrix of the log-partition function \(a(\boldsymbol \eta)\) (cf. Section 2.4).
An alternative parametrisation of an exponential family is given by the expectation \(\boldsymbol \mu_{\boldsymbol t} = \operatorname{E}(\boldsymbol t(x ))\). The parameters \(\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)= \nabla a(\boldsymbol \eta)\) are obtained via the gradient of the log-partition function \(a(\boldsymbol \eta)\). For an exponential family with minimal representation the relationship between expectation and canonical parameters is one-to-one, so that the inverse map \[ \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}) \] from \(\boldsymbol \mu_{\boldsymbol t}\) to \(\boldsymbol \eta\) exists and is unique.
Thus, the Fisher information for \(\boldsymbol \eta\) expressed in terms of \(\boldsymbol \mu_{\boldsymbol t}\) is \[ \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})} = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})) \]
The Jacobian for the transformation \(\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)\) is \(D \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta) = \nabla \nabla^T a(\boldsymbol \eta) = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta)\). Hence, the Jacobian for the inverse transformation \(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})\) is \[ D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}) = \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^{-1} \] Note that for an exponential family with minimal representation the variance is positive definite and thus invertible.
This yields the Fisher information for the expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\) as \[ \begin{split} \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \mu_{\boldsymbol t}) & = (D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^T \, \boldsymbol{\mathcal{I}}_{P}(\boldsymbol \eta) \rvert_{\boldsymbol \eta= \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})} \, D \boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t})\\ &= \boldsymbol \Sigma_{\boldsymbol t}(\boldsymbol \eta(\boldsymbol \mu_{\boldsymbol t}))^{-1} \end{split} \] Hence the Fisher information for the expectation parameter \(\boldsymbol \mu_{\boldsymbol t}\) is the inverse of the variance of the canonical statistics.
5.3 Further reading
Amari (2016) is a recent book and standard reference on information geometry.
For metrics associated with proper scoring rules see Dawid and Musio (2014).
Fisher information was originally introduced by Ronald A. Fisher (1890–1962) in Fisher (1925) under the term intrinsic accuracy.
C. Radhakrishna Rao (1920–2023) showed in 1945 that the Fisher information matrix defines a local metric tensor on the parameter space and established its interpretation as a local sensitivity measure.
This insight later helped lead to the development of information geometry and to the study of singular or nonregular models in statistics.