6 Exponential families

6.1 Definition of an exponential family

Exponential tilting

A distribution family \(P(\boldsymbol \eta)\) for a random variable \(x\) is an exponential family if it is generated by exponential tilting of a base distribution \(B\), resulting in a pdmf of the form \[ \begin{split} p(x|\boldsymbol \eta) &= \underbrace{e^{ \langle \boldsymbol \eta, \boldsymbol t(x) \rangle }}_{\text{exponential tilt }} \underbrace{ h(x)}_{\text{base function}} / \underbrace{z(\boldsymbol \eta)}_{\text{normaliser}}\\ & = h(x)\, e^{\langle \boldsymbol \eta, \boldsymbol t(x) \rangle -a(\boldsymbol \eta)}\\ \end{split} \] where

\(\boldsymbol t(x)\) are the canonical statistics,
\(\boldsymbol \eta\) are the canonical parameters,
\(h(x)\) is a positive base function (typically unnormalised),
\(z(\boldsymbol \eta)\) is the partition function and
\(a(\boldsymbol \eta) = \log z(\boldsymbol \eta)\) the corresponding log-partition function.

The base pdmf is obtained at \(\boldsymbol \eta=0\) yielding \(b(x) = p(x | \boldsymbol \eta=0) = h(x) / z(0)\). If \(h(x)\) is already a normalised pdmf then \(z(0)=1\) and \(b(x)=h(x)\).

The above presentation of exponential families assumes a univariate random variable (scalar \(x\)) but also applies to multivariate random variables (vector \(\boldsymbol x\) or matrix \(\boldsymbol X\)) .

Likewise, canonical statistics and parameters are written as vectors but these may also be scalars or matrices (or a combination of both). The use of inner product notation \(\langle, \rangle\) includes all these cases, recall that for scalars \(\langle a, b \rangle = ab\), for vectors \(\langle \boldsymbol a, \boldsymbol b\rangle = \boldsymbol a^T \boldsymbol a\) and for matrices \(\langle \boldsymbol A, \boldsymbol B\rangle = \operatorname{Tr}( \boldsymbol A^T \boldsymbol B) = \operatorname{Vec}(\boldsymbol A)^T \operatorname{Vec}(\boldsymbol B)\).

Canonical statistics

The canonical statistics \(\boldsymbol t(x)\) are transformations of \(x\), usually simple functions such as the identity (\(x\)), the square (\(x^2\)), the inverse (\(1/x\)) or the logarithm (\(\log x\)).

Typically, the number of canonical statistics and hence the dimension of \(\boldsymbol t\) is small.

The canonical statistics \(\boldsymbol t(x)\) may be affinely dependent. If this is the case there is a vector \(\boldsymbol \eta_0\) for which
\[ \langle \boldsymbol \eta_0, \boldsymbol t(x) \rangle = \text{const.} \] A common example is when \(\boldsymbol x\) is a vector of counts \((n_1, \ldots, n_K)^T\) for \(K\) classes with a fixed total count \(n = \sum_{i=k}^K n_k = \boldsymbol x^T \mathbf 1_K\) and the canonical statistics are \(\boldsymbol t(\boldsymbol x) = \boldsymbol x\) and thus include all \(K\) counts.

If the elements in \(\boldsymbol t(x)\) are affinely independent the representation of the exponential family is minimal or complete, otherwise the representation is non-minimal or overcomplete.

Canonical parameters and identifiability

For each canonical statistic there is a corresponding canonical parameter so the dimensions and shape of \(\boldsymbol t(x)\) and \(\boldsymbol \eta\) match.

In a minimal representation the canonical parameters of the exponential family are identifiable and hence distinct parameter settings for \(\boldsymbol \eta\) yield distinct distributions.

Conversely, in a non-minimal or overcomplete representation there are redundant elements in the canonical parameters \(\boldsymbol \eta\) and the distributions within the exponential family are not identifiable. Specifically, there will be multiple \(\boldsymbol \eta\) yielding the same underlying distribution.

In the example above, where \(\boldsymbol x\) is a vector of counts with a fixed total count and \(\boldsymbol t(\boldsymbol x) = \boldsymbol x\) and corresponding canonical parameters \(\boldsymbol \eta\), the effective number of parameter is \(K-1\) rather than \(K\), so there is one redundant parameter in \(\boldsymbol \eta\).

Moment and cumulant generating functions

The moment generating function for the canonical statistics \(\boldsymbol t(x)\) is \[ \begin{split} M(\boldsymbol \tau) & = \operatorname{E}\left( e^{\langle \boldsymbol \tau, \boldsymbol t(x)\rangle} \right)\\ &= \int_x e^{\langle \boldsymbol \tau, \boldsymbol t(x)\rangle} p(x |\boldsymbol \eta) dx \\ &= \int_x e^{\langle \boldsymbol \tau, \boldsymbol t(x)\rangle}\, e^{\langle \boldsymbol \eta, \boldsymbol t(x)\rangle}\, h(x)\, / z(\boldsymbol \eta) dx \\ & = \left( \int_x e^{\langle \boldsymbol \tau+\boldsymbol \eta,\boldsymbol t(x)\rangle}\, h(x) \, dx\right) /z(\boldsymbol \eta) \\ & = z(\boldsymbol \tau+\boldsymbol \eta)/z(\boldsymbol \eta)\\ \end{split} \] Correspondingly, the cumulant generating function is \[ \begin{split} K(\boldsymbol \tau) &= \log M(\boldsymbol \tau) \\ &= a(\boldsymbol \tau+\boldsymbol \eta)-a(\boldsymbol \eta)\\ \end{split} \] Thus, the moment and cumulant generating functions for the canonical statistics are closely linked to the partition and log-partition functions, respectively.

6.2 Roles of the partition function

Normalising factor

The pdmf \(p(x|\boldsymbol \eta)\) must integrate to one. Therefore, given \(h(x)\) and \(\boldsymbol t(x)\) the partition function \(z(\boldsymbol \eta)\) is obtained by \[ z(\boldsymbol \eta)= \int_x e^{ \langle \boldsymbol \eta, \boldsymbol t(x) \rangle } \, h(x) \, dx \] For discrete \(x\) replace the integral by a sum.

Consequently partition function \(z(\boldsymbol \eta)\) is also called the normaliser and and the log-partition function \(a(\boldsymbol \eta)\) is called the log-normaliser.

Constructed as weighted integral of exponentials of linear functions, the partition function \(z(\boldsymbol \eta)\) and the log-partition function \(a(\boldsymbol \eta)\) are convex with regard to \(\boldsymbol \eta\).

For an exponential family in minimal representation the (log)-partition function(s) are strictly convex.

Definition of parameter space

The set of values of \(\boldsymbol \eta\) for which \(z(\boldsymbol \eta) < \infty\), and hence for which the pdmf \(p(x|\boldsymbol \eta)\) is well defined, comprises the parameter space of the exponential family. Some choices of \(h(x)\) and \(\boldsymbol t(x)\) do not yield a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.

Moments of canonical statistics

The first cumulant (the mean) and second cumulant (the variance) are obtained as the first and second derivatives of the cumulant generating function \(K(\boldsymbol \tau) = a(\boldsymbol \tau+\boldsymbol \eta)-a(\boldsymbol \eta)\) evaluated at \(\boldsymbol \tau=0\), respectively. As a result, the log-partition function \(a(\boldsymbol \eta)\) provides a practical way to obtain the mean and variance of the canonical statistics \(\boldsymbol t(x)\).

Specifically, computing its gradient yields the mean \[ \begin{split} \operatorname{E}( \boldsymbol t(x) ) = \boldsymbol \mu_{\boldsymbol t} & = \nabla a(\boldsymbol \eta)\\ &= \frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)} \end{split} \] and computing the Hessian matrix the covariance matrix \[ \begin{split} \operatorname{Var}( \boldsymbol t(x) ) = \boldsymbol \Sigma_{\boldsymbol t} & = \nabla \nabla^T a(\boldsymbol \eta)\\ &= \frac{\nabla \nabla^T z(\boldsymbol \eta)}{z(\boldsymbol \eta)} - \left(\frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)}\right) \left(\frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)}\right)^T \end{split} \]

For an exponential family with minimal representation the log-partition function is strictly convex, hence the variance \(\boldsymbol \Sigma_{\boldsymbol t}\) is a positive definite matrix and invertible.

For overcomplete representations the log-partition function is convex but not strictly convex, hence the covariance matrix is positive semi-definite and not invertible.

By construction, the log-partition function \(a(\boldsymbol \eta)\) is finite in the interior of its parameter space. Therefore all moments and cumulants of the canonical statistics \(\boldsymbol t(x)\) exist (are finite) and are given by the derivatives of \(a(\boldsymbol \eta)\) at \(\boldsymbol \eta\).

6.3 Further properties

Equivalent representations

An exponential family admits many equivalent representations such that different specifications of canonical statistics \(\boldsymbol t(x)\) and base function \(h(x)\) describe the same family.

First, any invertible linear transformation of the canonical statistic \(\boldsymbol t(x)\) yields the same distribution family.

Second, any member of the family, say \(P(\boldsymbol \eta_{0})\), can serve as its base distribution. Specifically, with \(p(x | \boldsymbol \eta_{0})\) used as base the pdmf for the exponential family \(P(\boldsymbol \eta)\) is \[ p(x| \boldsymbol \eta) = e^{ \langle \boldsymbol \eta- \boldsymbol \eta_{0}, \boldsymbol t(x)\rangle - (a(\boldsymbol \eta) -a(\boldsymbol \eta_{0}))}\, p(x| \boldsymbol \eta_{0}) \] which is in exponential family form.

Third, the base function \(h(x)\) can be left unnormalised, so there are infinitely many positive base functions \(h(x)\) that yield the same base pdmf \(b(x)\) after normalisation.

Fourth, any factors in \(h(x)\) of the form \(e^{ \langle \boldsymbol \eta_{0}, \boldsymbol t(x)\rangle}\) for constant \(\boldsymbol \eta_{0}\) can be removed from \(h(x)\). As a result, many commonly used exponential families set the base function to \(h(x)=1\).

Alternative parametrisations

An exponential family can be parametrised by three different sets of parameters:

canonical parameters \(\boldsymbol \eta\),
expectation parameters \(\boldsymbol \mu_{\boldsymbol t} = \operatorname{E}(\boldsymbol t(x))\) (the mean of the canonical statistics \(\boldsymbol t(x)\)), as well as
conventional parameters \(\boldsymbol \theta\) (such as mean and variance of \(x\)).

If the exponential family is minimal then there is a one-to-one map between the canonical parameters \(\boldsymbol \eta\) and the expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\).

The canonical and the expectation parameters can be expressed as a function of the conventional parameters \(\boldsymbol \theta\).

Often, some expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\) correspond to conventional parameters \(\boldsymbol \theta\) (e.g., if one of the canonical statistics is \(x\), then the corresponding expectation parameter is the mean of \(x\)).

Special types of exponential families

If the pdmf depends on \(x\) only through the canonical statistics \(\boldsymbol t(x)\) (and therefore \(h(x)\) can be set to a constant) the exponential family is a called a Gibbs family.

If \(t(x) = x\) (univariate) or \(\boldsymbol t(\boldsymbol x)\), so that the canonical statistics correspond directly to \(x\) or \(\boldsymbol x\), the family is called a natural exponential family (NEF). Consequently, a univariate natural exponential family has only a single parameter.

6.4 Univariate exponential families

Table 6.1 lists univariate exponential families, more details about these distributions are found in Chapter 4.

For \(\operatorname{Bin}(n, \theta)\) and \(\operatorname{Ber}(\theta)\) the conventional parameter is \[ \theta = \operatorname{logit}^{-1}(\eta)=\frac{e^\eta}{1+e^\eta} \] (logistic function). Conversely, since this is a one-to-one map, the canonical parameter equals \[ \eta = \operatorname{logit}(\theta)= \log\left(\frac{\theta}{1-\theta}\right) \]

Apart from \(\operatorname{Bin}(n, \theta)\) all families listed in Table 6.1 are Gibbs families (\(h(x)=1\)).

Furthermore, \(\operatorname{Bin}(n, \theta)\), \(\operatorname{Ber}(\theta)\) and \(\operatorname{Exp}(\theta)\) are NEFs. \(N(\mu,\sigma^2)\) with fixed \(\sigma^2\) (variance), \(\operatorname{Gam}(\alpha, \theta)\) with fixed \(\alpha\) (shape) and \(\operatorname{Wis}(s^2, k)\) with fixed \(k\) (shape) are also NEFs.

Table 6.1: Common univariate exponential families

Distribution	\(h(x)\)	\(z(\boldsymbol \eta)\)	\(\boldsymbol \eta\)	\(\boldsymbol t(x)\)	\(\boldsymbol \mu_{\boldsymbol t}\)
\(\operatorname{Bin}(n, \theta)\)	\(W_2\)	\((1+e^\eta)^n\)	\(\operatorname{logit}(\theta)\)	\(x\)	\(\theta\)
\(\operatorname{Ber}(\theta)\)	\(1\)	\(1+e^\eta\)	\(\operatorname{logit}(\theta)\)	\(x\)	\(\theta\)
\(\operatorname{Beta}(\alpha_1, \alpha_2)\)	\(1\)	\(B(\eta_1+1, \eta_2+1)\)	\(\begin{pmatrix}\alpha_1-1 \\ \alpha_2-1\end{pmatrix}\)	\(\begin{pmatrix} \log x \\ \log(1-x) \end{pmatrix}\)	\(\begin{pmatrix} \psi^{(0)}(\alpha_1)-\psi^{(0)}(m) \\ \psi^{(0)}(\alpha_2)-\psi^{(0)}(m)\end{pmatrix}\)
\(N(\mu,\sigma^2)\)	\(1\)	\((-\pi \eta_2^{-1})^{1/2}\) \(\exp(-\frac{1}{4}\eta_1^2 \eta_2^{-1})\)	\(\begin{pmatrix} \sigma^{-2} \mu \\ -\frac{1}{2}\sigma^{-2}\end{pmatrix}\)	\(\begin{pmatrix} x \\ x^2\end{pmatrix}\)	\(\begin{pmatrix} \mu \\ \sigma^2 + \mu^2 \end{pmatrix}\)
\(\operatorname{Gam}(\alpha, \theta)\)	\(1\)	\((-\eta_1)^{-\eta_2-1}\) \(\Gamma(\eta_2+1)\)	\(\begin{pmatrix} -1/\theta \\ \alpha-1 \end{pmatrix}\)	\(\begin{pmatrix} x \\ \log x \end{pmatrix}\)	\(\begin{pmatrix} \alpha \theta \\ \psi^{(0)}(\alpha) +\log\theta\end{pmatrix}\)
\(\operatorname{Exp}(\theta)\)	\(1\)	\(-\eta^{-1}\)	\(-1/\theta\)	\(x\)	\(\theta\)
\(\operatorname{Wis}(s^2, k)\)	\(1\)	\((-\eta_1)^{-\eta_2-1}\) \(\Gamma(\eta_2+1)\)	\(\begin{pmatrix} -\frac{1}{2} s^{-2} \\ \frac{k}{2} -1 \end{pmatrix}\)	\(\begin{pmatrix} x \\ \log x \end{pmatrix}\)	\(\begin{pmatrix} k s^2 \\ \psi^{(0)}(\frac{k}{2}) +\log(2 s^2)\end{pmatrix}\)
\(\operatorname{IG}(\alpha, \beta)\)	\(1\)	\((-\eta_1)^{\eta_2+1}\) \(\Gamma(-\eta_2-1)\)	\(\begin{pmatrix} -\beta \\ -\alpha-1 \end{pmatrix}\)	\(\begin{pmatrix} x^{-1} \\ \log x \end{pmatrix}\)	\(\begin{pmatrix} \alpha/\beta \\ -\psi^{(0)}(\alpha) +\log\beta \end{pmatrix}\)
\(\operatorname{IW}(\psi, k)\)	\(1\)	\((-\eta_1)^{\eta_2+1}\) \(\Gamma(-\eta_2-1)\)	\(\begin{pmatrix} -\frac{\psi}{2} \\ -\frac{k}{2}-1 \end{pmatrix}\)	\(\begin{pmatrix} x^{-1} \\ \log x \end{pmatrix}\)	\(\begin{pmatrix} k / \psi \\-\psi^{(0)}(\frac{k}{2}) +\log(\frac{\psi}{2}) \end{pmatrix}\)

Notes:

\(W_2 = \binom{n}{x}\) is the binomial coefficient.
\(B(\alpha_1, \alpha_2)\) is the beta function.
\(m=\alpha_1 + \alpha_2\).
\(\psi^{(0)}(x) =\frac{d}{dx} \log \Gamma(x)\) is the digamma function.

6.5 Multivariate exponential families

Table 6.2 lists multvariate exponential families, more details about these distributions are found in Chapter 5.

For \(\operatorname{Mult}(n, \boldsymbol \theta)\) and \(\operatorname{Cat}(\boldsymbol \theta)\) the conventional parameters are given by \[ \boldsymbol \theta= \operatorname{softmax}(\boldsymbol \eta) = \frac{(\exp \eta_k)}{\sum_{i=1}^K \exp \eta_i} \] As the softmax function is invariant against translation and is a many-to-one map, its inverse is not unique and the canonical parameters \[ \boldsymbol \eta= (c + \log \theta_k) \] are determined by the conventional parameters \(\boldsymbol \theta\) only up to a constant \(c\). This representation using \(K\) canonical parameters \(\boldsymbol \eta\) is non-minimal, hence \(\boldsymbol \eta\) is not identifiable and different values of \(\boldsymbol \eta\) can represent the same distribution.

A minimal representation with \(K-1\) parameters \(\eta_1, \ldots, \eta_{K-1}\) and \(\eta_K=0\) corresponds to \(c=-\log \theta_K\) and \(\eta_k = \log (\theta_k/ \theta_K )\). For \(K=2\) this yields the minimal representations of \(\operatorname{Bin}(n, \theta)\) and \(\operatorname{Ber}(\theta)\) shown in Table 6.1.

Apart from \(\operatorname{Mult}(n, \boldsymbol \theta)\) all families listed in Table 6.2 all are Gibbs families (\(h(\boldsymbol x)=1\)).

Furthermore, \(\operatorname{Mult}(n, \boldsymbol \theta)\) and \(\operatorname{Cat}(\boldsymbol \theta)\) are NEFs. \(N(\boldsymbol \mu, \boldsymbol \Sigma)\) with fixed \(\boldsymbol \Sigma\) (variance) and \(\operatorname{Wis}(\boldsymbol S, k)\) with fixed \(k\) (shape) are NEFs as well.

Table 6.2: Common multivariate exponential families

Distribution	\(h(\boldsymbol x)\)	\(z(\boldsymbol \eta)\)	\(\boldsymbol \eta\)	\(\boldsymbol t(\boldsymbol x)\)	\(\boldsymbol \mu_{\boldsymbol t}\)
\(\operatorname{Mult}(n, \boldsymbol \theta)\)	\(W_K\)	\((\sum_{k=1}^K \exp \eta_k)^n\)	\((c+\log \theta_k)\)	\(\boldsymbol x\)	\(\boldsymbol \theta\)
\(\operatorname{Cat}(\boldsymbol \theta)\)	\(1\)	\(\sum_{k=1}^K \exp \eta_k\)	\((c+ \log \theta_k)\)	\(\boldsymbol x\)	\(\boldsymbol \theta\)
\(\operatorname{Dir}(\boldsymbol \alpha)\)	\(1\)	\(B(\boldsymbol \eta+1)\)	\(\begin{pmatrix}\alpha_k-1 \end{pmatrix}\)	\(\begin{pmatrix} \log x_k \end{pmatrix}\)	\(\begin{pmatrix} \psi^{(0)}(\alpha_k)-\psi^{(0)}(m)\end{pmatrix}\)
\(N(\boldsymbol \mu, \boldsymbol \Sigma)\)	\(1\)	\(\det(- \pi \boldsymbol \eta_2^{-1})^{1/2}\) \(\exp(-\frac{1}{4} \boldsymbol \eta_1^T \boldsymbol \eta_2^{-1} \boldsymbol \eta_1)\)	\(\begin{pmatrix}\boldsymbol \Sigma^{-1}\boldsymbol \mu\\ -\frac{1}{2}\boldsymbol \Sigma^{-1}\end{pmatrix}\)	\(\begin{pmatrix} \boldsymbol x\\ \boldsymbol x\boldsymbol x^T\end{pmatrix}\)	\(\begin{pmatrix} \boldsymbol \mu\\ \boldsymbol \Sigma+ \boldsymbol \mu\boldsymbol \mu^T \end{pmatrix}\)
\(\operatorname{Wis}(\boldsymbol S, k)\)	\(1\)	\(\det(-\boldsymbol \eta_1)^{-\eta_2-\frac{d+1}{2}}\) \(\Gamma_d(\eta_2+\frac{d+1}{2})\)	\(\begin{pmatrix} -\frac{1}{2}\boldsymbol S^{-1} \\ \frac{k}{2} - \frac{d+1}{2} \end{pmatrix}\)	\(\begin{pmatrix} \boldsymbol X\\ \log \det(\boldsymbol X) \end{pmatrix}\)	\(\begin{pmatrix} k \boldsymbol S\\ \psi^{(0)}_d(\frac{k}{2}) + \log \det(2 \boldsymbol S)\end{pmatrix}\)
\(\operatorname{IW}\left(\boldsymbol \Psi, k\right)\)	\(1\)	\(\det(-\boldsymbol \eta_1)^{\eta_2+\frac{d+1}{2}}\) \(\Gamma_d(-\eta_2-\frac{d+1}{2})\)	\(\begin{pmatrix} -\frac{1}{2}\boldsymbol \Psi\\ -\frac{k}{2} - \frac{d+1}{2} \end{pmatrix}\)	\(\begin{pmatrix} \boldsymbol X^{-1} \\ \log \det(\boldsymbol X) \end{pmatrix}\)	\(\begin{pmatrix} k \boldsymbol \Psi^{-1} \\-\psi^{(0)}_d(\frac{k}{2}) +\log \det(\frac{\boldsymbol \Psi}{2}) \end{pmatrix}\)

Notes:

\(W_K = \binom{n}{x_1, \ldots, x_K}\) is the multinomial coefficient.
\(B(\boldsymbol \alpha)\) is the multivariate beta function.
\(m=\sum_{i=k}^K \alpha_k\).
\(\psi^{(0)}_d = \frac{d}{dx} \log \Gamma_d(x) = \sum_{i=1}^d \psi^{(0)}(x - (i-1)/2)\) is the multivariate digamma function.

See also: Exponential family (Wikipedia) and Efron (2022).