5 Expected Fisher information
5.1 Expected Fisher information
Definition of expected Fisher information
KL information measures the divergence of two distributions. Previously we have seen examples of KL divergence between two distributions belonging to the same family. We now consider the KL divergence of two such distributions separated in parameter space only by some small \(\boldsymbol \varepsilon\).
Specifically, we consider the function \[ \begin{split} h(\boldsymbol \theta+\boldsymbol \varepsilon) & = D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon}) \\ &= \text{E}_{F_{\boldsymbol \theta}}\left( \log f(\boldsymbol x| \boldsymbol \theta) - \log f(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right)\\ \end{split} \] where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. Assuming that \(f(\boldsymbol x| \boldsymbol \theta)\) is twice differentiable with regard to \(\boldsymbol \theta\) we can approximate \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) quadratically by \[ h(\boldsymbol \theta+\boldsymbol \varepsilon) \approx h(\boldsymbol \theta) + \nabla h(\boldsymbol \theta)^T\boldsymbol \varepsilon+ \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(\boldsymbol \theta) \,\boldsymbol \varepsilon \]
From the properties of the KL divergence we know that \(D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon})\geq 0\) and that it becomes zero only if \(\boldsymbol \varepsilon=0\). Thus, by construction the function \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) achieves for \(\boldsymbol \varepsilon=0\)
- a true minimum with \(h(\boldsymbol \theta)=0\),
- a vanishing gradient with \(\nabla h(\boldsymbol \theta) = 0\), and
- a positive definite Hessian matrix with \(\nabla \nabla^T h(\boldsymbol \theta) = -\text{E}_{F_{\boldsymbol \theta}} \nabla \nabla^T \log f(\boldsymbol x| \boldsymbol \theta)\).
Therefore in the quadratic approximation of \(h(\boldsymbol \theta+\boldsymbol \varepsilon)\) around \(\boldsymbol \theta\) above the first two terms (constant and linear) vanish and only the quadratic term remains. The Hessian matrix evaluated at \(\boldsymbol \theta\) \[ \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) = -\text{E}_{F_{\boldsymbol \theta}} \nabla \nabla^T \log f(\boldsymbol x| \boldsymbol \theta) \] is called expected Fisher information for \(\boldsymbol \theta\), or short Fisher information. Hence, the KL divergence can be locally approximated by \[ D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon})\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \boldsymbol \varepsilon \]
We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(F_{\boldsymbol \theta+\boldsymbol \varepsilon}, F_{\boldsymbol \theta}) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]
Hence, the KL divergence, while generally not symmetric in its arguments, is still locally symmetric.
Computing the expected Fisher information involves no observed data, it is purely a property of the model family \(F_{\boldsymbol \theta}\). In Chapter 9 we will study a related quantity, the observed Fisher information that in contrast to the expected Fisher information is a function of the observed data.
Example 5.1 \(\color{Red} \blacktriangleright\) Fisher information as metric tensor:
In the field of information geometry1 sets of distributions are studied using tools from differential geometry. It turns out that distribution families are manifolds and that the expected Fisher information matrix plays the role of the (symmetric!) metric tensor on this manifold.
Additivity of Fisher information
We may wish to compute the expected Fisher information based on a set of independent identically distributed (iid) random variables.
Assume that a random variable \(x \sim F_{\boldsymbol \theta}\) has log-density \(\log f(x| \boldsymbol \theta)\) and expected Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). The expected Fisher information \(\boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim F_{\boldsymbol \theta}\) is computed from the joint log-density \(\log f(x_1, \ldots, x_n) = \sum_{i}^n \log f(x_i| \boldsymbol \theta)\). This yields \[ \begin{split} \boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta) &= -\text{E}_{F_{\boldsymbol \theta}} \nabla \nabla^T \sum_{i}^n \log f(x_i| \boldsymbol \theta)\\ &= \sum_{i=1}^n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) =n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \\ \end{split} \] Hence, the expected Fisher information for a set of \(n\) iid random variables is the \(n\) times the Fisher information of a single variable.
Invariance property of the Fisher information
Like KL divergence the expected Fisher information is invariant against change of parametrisation of the sample space, say from variable \(x\) to \(y\) and from distribution \(F_x\) to \(F_y\). This is easy to see as the KL divergence itself is invariant against such reparametrisation, and thus also its curvature, and hence the expected Fisher information.
More specifically, when the sample space is changed the density gains a factor in the form of the Jacobian determinant according to this transformation. Since this factor does not depend of the model parameters and it does not change the first and second derivatives of the log-density with regard to the model parameters.
See also Section 7.4 for related sample space invariance of the gradient and curvature of the log-likelihood and Chapter 9 for the sample invariance of observed Fisher information.
\(\color{Red} \blacktriangleright\) Transformation of Fisher information when model parameters change
The Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\) depends on the parameter \(\boldsymbol \theta\). If we use a different parameterisation of the underlying parametric distribution family, say \(\boldsymbol \zeta\) with a map \(\boldsymbol \theta(\boldsymbol \zeta)\) from \(\boldsymbol \zeta\) to \(\boldsymbol \theta\), then the Fisher information changes according to the chain rule in calculus.
To find the resulting Fisher information in terms of the new parameter \(\boldsymbol \zeta\) we need to use the Jacobian matrix \(D \boldsymbol \theta(\boldsymbol \zeta)\). This matrix contains the gradients for each component of the map \(\boldsymbol \theta(\boldsymbol \zeta)\) in its rows: \[ D \boldsymbol \theta(\boldsymbol \zeta) = \begin{pmatrix}\nabla^T \theta_1(\boldsymbol \zeta)\\ \nabla^T \theta_2(\boldsymbol \zeta) \\ \vdots \\ \end{pmatrix} \]
With the above the Fisher information for \(\boldsymbol \theta\) is then transformed to the Fisher information for \(\boldsymbol \zeta\) applying the chain rule for the Hessian matrix: \[ \boldsymbol I^{\text{Fisher}}(\boldsymbol \zeta) = (D \boldsymbol \theta(\boldsymbol \zeta))^T \, \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \rvert_{\boldsymbol \theta= \boldsymbol \theta(\boldsymbol \zeta)} \, D \boldsymbol \theta(\boldsymbol \zeta) \] This type of transformation is also known as covariant transformation, in this case for the Fisher information metric tensor.
5.2 Expected Fisher information examples
Models with a single parameter
Example 5.2 Expected Fisher information for the Bernoulli distribution:
The log-probability mass function of the Bernoulli \(\text{Ber}(\theta)\) distribution is \[ \log p(x | \theta) = x \log(\theta) + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\text{E}(x) = \theta\) we get as Fisher information \[ \begin{split} I^{\text{Fisher}}(\theta) & = -\text{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]
Example 5.3 Quadratic approximations of the KL divergence between two Bernoulli distributions:
From Example 4.4 we have as KL divergence \[ D_{\text{KL}}\left (\text{Ber}(\theta_1), \text{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 5.2 the corresponding expected Fisher information.
The quadratic approximation implies that \[ D_{\text{KL}}\left( \text{Ber}(\theta), \text{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \text{Ber}(\theta+\varepsilon), \text{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]
In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.
Example 5.4 Expected Fisher information for the normal distribution \(N(\mu, \sigma^2)\) with known variance.
The log-density is \[ \log f(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log f(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the expected Fisher information is \[ \boldsymbol I^{\text{Fisher}}\left(\mu\right) = \frac{1}{\sigma^2} \]
Models with multiple parameters
Example 5.5 Expected Fisher information for the normal distribution \(N(\mu, \sigma^2)\).
The log-density is \[ \log f(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log f(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.
The corresponding Hessian matrix is \[ \nabla \nabla^T \log f(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\text{E}(x) = \mu\) we have \(\text{E}(x-\mu) =0\). Furthermore, with \(\text{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\text{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the expected Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol I^{\text{Fisher}}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]
Example 5.6 \(\color{Red} \blacktriangleright\) Expected Fisher information of the categorical distribution:
The log-probability mass function for the categorical distribution with \(K\) classes and \(K-1\) free parameters \(\pi_1, \ldots, \pi_{K-1}\) is \[ \begin{split} \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1} ) & =\sum_{k=1}^{K-1} x_k \log \pi_k + x_K \log \pi_K \\ & =\sum_{k=1}^{K-1} x_k \log \pi_k + \left( 1 - \sum_{k=1}^{K-1} x_k \right) \log \left( 1 - \sum_{k=1}^{K-1} \pi_k \right) \\ \end{split} \]
From the log-probability mass function we compute the Hessian matrix of second order partial derivatives \(\nabla \nabla^T \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1} )\) with regard to \(\pi_1, \ldots, \pi_{K-1}\):
The diagonal entries of the Hessian matrix (with \(i=1, \ldots, K-1\)) are \[ \frac{\partial^2}{\partial \pi_i^2} \log p(\boldsymbol x|\pi_1, \ldots, \pi_{K-1}) = -\frac{x_i}{\pi_i^2}-\frac{x_K}{\pi_K^2} \]
the off-diagonal entries are (with \(j=1, \ldots, K-1\) and \(j \neq i\)) \[ \frac{\partial^2}{\partial \pi_i \partial \pi_j} \log p(\boldsymbol x|\pi_1, \ldots, \pi_{K-1}) = -\frac{ x_K}{\pi_K^2} \]
Recalling that \(\text{E}(x_i) = \pi_i\) we obtain the expected Fisher information matrix for a categorical distribution as \(K-1 \times K-1\) dimensional matrix \[ \begin{split} \boldsymbol I^{\text{Fisher}}\left( \pi_1, \ldots, \pi_{K-1} \right) &= -\text{E}\left( \nabla \nabla^T \log p(\boldsymbol x| \pi_1, \ldots, \pi_{K-1}) \right) \\ & = \begin{pmatrix} \frac{1}{\pi_1} + \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_K} \\ \vdots & \ddots & \vdots \\ \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_{K-1}} + \frac{1}{\pi_K} \\ \end{pmatrix}\\ & = \text{Diag}\left( \frac{1}{\pi_1} , \ldots, \frac{1}{\pi_{K-1}} \right) + \frac{1}{\pi_K} \mathbf 1\\ \end{split} \]
For \(K=2\) and \(\pi_1=\theta\) this reduces to the expected Fisher information of a Bernoulli variable, see Example 5.2. \[ \begin{split} I^{\text{Fisher}}(\theta) & = \left(\frac{1}{\theta} + \frac{1}{1-\theta} \right) \\ &= \frac{1}{\theta (1-\theta)} \\ \end{split} \]
Example 5.7 \(\color{Red} \blacktriangleright\) Quadratic approximation of KL divergence of the categorical distribution and the Neyman and Pearson divergence:
We now consider the local approximation of the KL divergence \(D_{\text{KL}}(Q, P)\) between the categorical distribution \(Q=\text{Cat}(\boldsymbol q)\) with probabilities \(\boldsymbol q=(q_1, \ldots, q_K)^T\) with the categorical distribution \(P=\text{Cat}(\boldsymbol p)\) with probabilities \(\boldsymbol p= (p_1, \ldots, p_K)^T\).
From Example 4.6 we already know the KL divergence and from Example 5.6 the corresponding expected Fisher information.
First, we keep the first argument \(Q\) fixed and assume that \(P\) is a perturbed version of \(Q\) with \(\boldsymbol p= \boldsymbol q+\boldsymbol \varepsilon\). Note that the perturbations \(\boldsymbol \varepsilon=(\varepsilon_1, \ldots, \varepsilon_K)^T\) satisfy \(\sum_{k=1}^K \varepsilon_k = 0\) because \(\sum_{k=1}^K q_i=1\) and \(\sum_{k=1}^K p_i=1\). Thus \(\varepsilon_K = -\sum_{k=1}^{K-1} \varepsilon_k\). Then \[ \begin{split} D_{\text{KL}}(\text{Cat}(\boldsymbol q), \text{Cat}(\boldsymbol q+\boldsymbol \varepsilon)) & \approx \frac{1}{2} (\varepsilon_1, \ldots, \varepsilon_{K-1}) \, \boldsymbol I^{\text{Fisher}}\left( q_1, \ldots, q_{K-1} \right) \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_{K-1}\\ \end{pmatrix} \\ &= \frac{1}{2} \left( \sum_{k=1}^{K-1} \frac{\varepsilon_k^2}{q_k} + \frac{ \left(\sum_{k=1}^{K-1} \varepsilon_k\right)^2}{q_K} \right) \\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{\varepsilon_k^2}{q_k}\\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{q_k}\\ & = \frac{1}{2} D_{\text{Neyman}}(Q, P)\\ \end{split} \] Similarly, if we keep \(P\) fixed and consider \(Q\) as a perturbed version of \(P\) we get \[ \begin{split} D_{\text{KL}}(\text{Cat}(\boldsymbol p+\boldsymbol \varepsilon), \text{Cat}(\boldsymbol p)) &\approx \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{p_k}\\ &= \frac{1}{2} D_{\text{Pearson}}(Q, P) \end{split} \] Note that in both approximations we divide by the probabilities of the distribution that is kept fixed.
Note the appearance of the Pearson \(\chi^2\) divergence and the Neyman \(\chi^2\) divergence in the above. Both are, like the KL divergence, part of the family of \(f\)-divergences. The Neyman \(\chi^2\) divergence is also known as the reverse Pearson divergence as \(D_{\text{Neyman}}(Q, P) = D_{\text{Pearson}}(P, Q)\).
A recent review is given, e.g., in: Nielsen, F. 2020. An elementary introduction to information geometry. Entropy 22:1100. https://doi.org/10.3390/e22101100↩︎