# 5 Multivariate dependencies

## 5.1 Measuring the linear association between two sets of random variables

### 5.1.1 Outline

In this section we discuss how to measure the total linear association between two sets of random variables $$\boldsymbol x= (x_1, \ldots, x_p)^T$$ and $$\boldsymbol y= (y_1, \ldots, y_q)^T$$. We assume a joint correlation matrix $\boldsymbol P= \begin{pmatrix} \boldsymbol P_{\boldsymbol x} & \boldsymbol P_{\boldsymbol x\boldsymbol y} \\ \boldsymbol P_{\boldsymbol y\boldsymbol x} & \boldsymbol P_{\boldsymbol y} \\ \end{pmatrix}$ with cross-correlation matrix $$\boldsymbol P_{\boldsymbol x\boldsymbol y} = \boldsymbol P_{\boldsymbol y\boldsymbol x}^T$$ and the within-group group correlations $$\boldsymbol P_{\boldsymbol x}$$ and $$\boldsymbol P_{\boldsymbol y}$$. If the cross-correlations vanish, $$\boldsymbol P_{\boldsymbol x\boldsymbol y} =0$$, then the two random vectors are uncorrelated, and the joint correlation matrix becomes a diagonal block matrix $\boldsymbol P_{\text{indep}} = \begin{pmatrix} \boldsymbol P_{\boldsymbol x} & 0 \\ 0 & \boldsymbol P_{\boldsymbol y} \\ \end{pmatrix} \, .$

### 5.1.2 Special cases

In linear regression model the squared multiple correlation or coefficient of determination $\text{Cor}(\boldsymbol x, y)^2 = \boldsymbol P_{y\boldsymbol x} \boldsymbol P_{\boldsymbol x}^{-1} \boldsymbol P_{\boldsymbol xy}$ is a standard measure to describe the strength of total linear association between the predictors $$\boldsymbol x$$ and the response $$y$$. If $$\boldsymbol P_{\boldsymbol xy} =0$$ then $$\text{Cor}(\boldsymbol x, y)^2=0$$.

If there is only a single predictor $$x$$ then $$\boldsymbol P_{xy}=\rho$$ and $$\boldsymbol P_{x} = 1$$ and therefore the squared multiple correlation reduces to the squared Pearson correlation $\text{Cor}(x, y)^2 = \rho^2 \, .$

### 5.1.3 Rozeboom vector correlation

For the general case with two random vectors we are looking for a scalar quantity that quantifies the divergence of the general joint correlation matrix $$\boldsymbol P$$ from the joint correlation matrix $$\boldsymbol P_{\text{indep}}$$ assuming uncorrelated $$\boldsymbol x$$ and $$\boldsymbol y$$.

One such quantity is Hotelling’s vector alienation coefficient (given in his 1936 CCA paper10) $\begin{split} a(\boldsymbol x, \boldsymbol y) &= \frac{\det(\boldsymbol P)}{\det(\boldsymbol P_{\text{indep}}) } \\ & = \frac{\det( \boldsymbol P) }{ \det(\boldsymbol P_{\boldsymbol x}) \, \det(\boldsymbol P_{\boldsymbol y}) } \end{split}$ With $$\boldsymbol K= \boldsymbol P_{\boldsymbol x}^{-1/2} \boldsymbol P_{\boldsymbol x\boldsymbol y} \boldsymbol P_{\boldsymbol y}^{-1/2}$$ (cf. Chapter 2, Canonical correlation analysis) the vector alienation coefficient can be written (using the Weinstein-Aronszajn determinant identity and the formula for the determinant for block-structured matrices, see Appendix) as $\begin{split} a(\boldsymbol x, \boldsymbol y) & = \det \left( \boldsymbol I_p - \boldsymbol K\boldsymbol K^T \right) \\ & = \det \left( \boldsymbol I_q - \boldsymbol K^T \boldsymbol K\right) \\ &= \prod_{i=1}^m (1-\lambda_i^2) \\ \end{split}$ where the $$\lambda_i$$ are the singular values of $$\boldsymbol K$$, i.e. the canonical correlations for the pair $$\boldsymbol x$$ and $$\boldsymbol y$$.

If $$\boldsymbol P_{\boldsymbol x\boldsymbol y} = 0$$ und thus $$\boldsymbol x$$ and $$\boldsymbol y$$ are uncorrelated then $$\boldsymbol P= \boldsymbol P_{\text{indep}}$$ and thus by construction the vector alienation coefficient $$a(\boldsymbol x, \boldsymbol y)=1$$. Hence, it is itself not a generalisation of the squared multiple correlation.

Instead, Rozeboom (1965) proposed to use as squared vector correlation the complement $\begin{split} \text{Cor}(\boldsymbol x, \boldsymbol y)^2 &= \rho^2_{\boldsymbol x\boldsymbol y} = 1 - a(\boldsymbol x, \boldsymbol y) \\ & = 1- \det\left( \boldsymbol I_p - \boldsymbol K\boldsymbol K^T \right) \\ & = 1- \det\left( \boldsymbol I_q - \boldsymbol K^T \boldsymbol K\right) \\ & =1- \prod_{i=1}^m (1-\lambda_i^2) \\ \end{split}$ If $$\boldsymbol P_{\boldsymbol x\boldsymbol y} = 0$$ then $$\text{Cor}(\boldsymbol x, \boldsymbol y)^2 = 0$$. Moreover, if either $$p=1$$ or $$q=1$$ the squared vector correlation reduces to the corresponding squared multiple correlation, and for both $$p=1$$ and $$q=1$$ it becomes squared Pearson correlation.

A more general way to measure multivariate association is mutual information (MI) which not only covers linear but also non-linear association. As we will discuss in a subsequent section the Rozeboom vector correlation arises naturally in MI for the multivariate normal distribution.

### 5.1.4 RV coefficient

Another common approach to measure association between two random vectors is the RV coefficient introduced by Robert and Escoufier (1976) as $RV(\boldsymbol x, \boldsymbol y) = \frac{ \text{Tr}(\boldsymbol \Sigma_{\boldsymbol x\boldsymbol y} \boldsymbol \Sigma_{\boldsymbol y\boldsymbol x} )}{ \sqrt{ \text{Tr}(\boldsymbol \Sigma_{\boldsymbol x}^2) \text{Tr}(\boldsymbol \Sigma_{\boldsymbol y}^2) } }$ It is easier to compute than the Rozeboom vector correlation since it is based on the matrix trace rather than matrix determinant.

For $$q=p=1$$ the RV coefficient reduces to the squared correlation. However, the RV coefficient does not reduce to the multiple correlation coefficient for $$q=1$$ and $$p > 1$$, and therefore the RV coefficient cannot be considered a coherent generalisation of Pearson and multiple correlation to the case of two random vectors.

## 5.2 Mutual information as generalisation of correlation

### 5.2.1 Definition of mutual information

Recall the definition of Kullback-Leibler divergence, or relative entropy, between two distributions: $D_{\text{KL}}(F, G) := \text{E}_F \log \biggl( \frac{f(\boldsymbol x)}{g(\boldsymbol x)} \biggr)$ Here $$F$$ plays the role of the reference distribution and $$G$$ is an approximating distribution, with $$f$$ and $$g$$ being the corresponding density functions (see MATH20802 Statistical Methods).

The Mutual Information (MI) between two random variables $$\boldsymbol x$$ and $$\boldsymbol y$$ is defined as the KL divergence between the corresponding joint distribution and the product distribution: $\text{MI}(\boldsymbol x, \boldsymbol y) = D_{\text{KL}}(F_{\boldsymbol x,\boldsymbol y}, F_{\boldsymbol x} F_{\boldsymbol y}) = \text{E}_{F_{\boldsymbol x,\boldsymbol y}} \log \biggl( \frac{f(\boldsymbol x, \boldsymbol y)}{f(\boldsymbol x) \, f(\boldsymbol y)} \biggr) .$ Thus, MI measures how well the joint distribution can be approximated by the product distribution (which would be the appropriate joint distribution if $$\boldsymbol x$$ and $$\boldsymbol y$$ are independent). Since MI is an application of KL divergence is shares all its properties. In particular, $$\text{MI}(\boldsymbol x, \boldsymbol y)=0$$ implies that the joint distribution and product distributions are the same. Hence the two random variables $$\boldsymbol x$$ and $$\boldsymbol y$$ are independent if the mutual information vanishes.

### 5.2.2 Mutual information between two normal variables

The KL divergence between two multivariate normal distributions $$F_{\text{ref}}$$ and $$F$$ is $D_{\text{KL}}(F_{\text{ref}}, F) = \frac{1}{2} \biggl\{ (\boldsymbol \mu-\boldsymbol \mu_{\text{ref}})^T \boldsymbol \Sigma^{-1} (\boldsymbol \mu-\boldsymbol \mu_{\text{ref}}) + \text{Tr}\biggl(\boldsymbol \Sigma^{-1} \boldsymbol \Sigma_{\text{ref}} \biggr) - \log \det \biggl( \boldsymbol \Sigma^{-1} \boldsymbol \Sigma_{\text{ref}} \biggr) - d \biggr\}$ This allows compute the mutual information $$\text{MI}_{\text{norm}}(x,y)$$ between two univariate random variables $$x$$ and $$y$$ that are are correlated and assumed to be jointly bivariate normal. Let $$\boldsymbol z= (x, y)^T$$. The joint bivariate normal distribution is characterised by the mean $$\text{E}(\boldsymbol z) = \boldsymbol \mu= (\mu_x, \mu_y)^T$$ and the covariance matrix $\boldsymbol \Sigma= \begin{pmatrix} \sigma^2_x & \rho \, \sigma_x \sigma_y \\ \rho \, \sigma_x \sigma_y & \sigma^2_y \\ \end{pmatrix}$ where $$\text{Cor}(x,y)= \rho$$. If $$x$$ and $$y$$ are independent then $$\rho=0$$ and $\boldsymbol \Sigma_{\text{indep}} = \begin{pmatrix} \sigma^2_x & 0 \\ 0 & \sigma^2_y \\ \end{pmatrix} \,.$ The product $\boldsymbol A= \boldsymbol \Sigma_{\text{indep}}^{-1} \boldsymbol \Sigma= \begin{pmatrix} 1 & \rho \frac{\sigma_y}{\sigma_x} \\ \rho \frac{\sigma_x}{\sigma_y} & 1 \\ \end{pmatrix}$ has trace $$\text{Tr}(\boldsymbol A) = 2$$ and determinant $$\det(\boldsymbol A) = 1-\rho^2$$.

With this the mutual information between $$x$$ and $$y$$ can be computed as $\begin{split} \text{MI}_{\text{norm}}(x, y) &= D_{\text{KL}}(N(\boldsymbol \mu, \boldsymbol \Sigma), N(\boldsymbol \mu,\boldsymbol \Sigma_{\text{indep}} )) \\ & = \frac{1}{2} \biggl\{ \text{Tr}\biggl(\boldsymbol \Sigma^{-1}_{\text{indep}} \boldsymbol \Sigma\biggr) - \log \det \biggl( \boldsymbol \Sigma^{-1}_{\text{indep}} \boldsymbol \Sigma\biggr) - 2 \biggr\} \\ & = \frac{1}{2} \biggl\{ \text{Tr}( \boldsymbol A) - \log \det( \boldsymbol A) - 2 \biggr\} \\ &= -\frac{1}{2} \log(1-\rho^2) \\ & \approx \frac{\rho^2}{2} \\ \end{split}$

Thus $$\text{MI}_{\text{norm}}(x,y)$$ is is a one-to-one function of the squared correlation $$\rho^2$$ between $$x$$ and $$y$$: For small values of correlation $$2 \, \text{MI}_{\text{norm}}(x,y) \approx \rho^2$$.

### 5.2.3 Mutual information between two normally distributed random vectors

The mutual information $$\text{MI}_{\text{norm}}(\boldsymbol x,\boldsymbol y)$$ between two multivariate normal random vector $$\boldsymbol x$$ and $$\boldsymbol y$$ can be computed in a similar fashion as in the bivariate case.

Let $$\boldsymbol z= (\boldsymbol x, \boldsymbol y)^T$$ with dimension $$d=p+q$$. The joint multivariate normal distribution is characterised by the mean $$\text{E}(\boldsymbol z) = \boldsymbol \mu= (\boldsymbol \mu_x^T, \boldsymbol \mu_y^T)^T$$ and the covariance matrix $\boldsymbol \Sigma= \begin{pmatrix} \boldsymbol \Sigma_{\boldsymbol x} & \boldsymbol \Sigma_{\boldsymbol x\boldsymbol y} \\ \boldsymbol \Sigma_{\boldsymbol x\boldsymbol y}^T & \boldsymbol \Sigma_{\boldsymbol y} \\ \end{pmatrix} \,.$ If $$\boldsymbol x$$ and $$\boldsymbol y$$ are independent then $$\boldsymbol \Sigma_{\boldsymbol x\boldsymbol y} = 0$$ and $\boldsymbol \Sigma_{\text{indep}} = \begin{pmatrix} \boldsymbol \Sigma_{\boldsymbol x} & 0 \\ 0 & \boldsymbol \Sigma_{\boldsymbol y} \\ \end{pmatrix} \, .$ The product $\begin{split} \boldsymbol A& = \boldsymbol \Sigma_{\text{indep}}^{-1} \boldsymbol \Sigma= \begin{pmatrix} \boldsymbol I_p & \boldsymbol \Sigma_{\boldsymbol x}^{-1} \boldsymbol \Sigma_{\boldsymbol x\boldsymbol y} \\ \boldsymbol \Sigma_{\boldsymbol y}^{-1} \boldsymbol \Sigma_{\boldsymbol y\boldsymbol x} & \boldsymbol I_q \\ \end{pmatrix} \\ & = \begin{pmatrix} \boldsymbol I_p & \boldsymbol V_{\boldsymbol x}^{-1/2} \boldsymbol P_{\boldsymbol x}^{-1} \boldsymbol P_{\boldsymbol x\boldsymbol y} \boldsymbol V_{\boldsymbol y}^{1/2} \\ \boldsymbol V_{\boldsymbol y}^{-1/2} \boldsymbol P_{\boldsymbol y}^{-1} \boldsymbol P_{\boldsymbol y\boldsymbol x} \boldsymbol V_{\boldsymbol x}^{1/2} & \boldsymbol I_q \\ \end{pmatrix}\\ \end{split}$ has trace $$\text{Tr}(\boldsymbol A) = d$$ and determinant $\begin{split} \det(\boldsymbol A) & = \det( \boldsymbol I_p - \boldsymbol K\boldsymbol K^T ) \\ &= \det( \boldsymbol I_q - \boldsymbol K^T \boldsymbol K) \\ \end{split}$ with $$\boldsymbol K= \boldsymbol P_{\boldsymbol x}^{-1/2} \boldsymbol P_{\boldsymbol x\boldsymbol y} \boldsymbol P_{\boldsymbol y}^{-1/2}$$. With $$\lambda_1, \ldots, \lambda_m$$ the singular values of $$\boldsymbol K$$ (i.e. the canonical correlations between $$\boldsymbol x$$ and $$\boldsymbol y$$) we get $\det(\boldsymbol A) = \prod_{i=1}^m (1-\lambda_i^2)$

The mutual information between $$\boldsymbol x$$ and $$\boldsymbol y$$ is then $\begin{split} \text{MI}_{\text{norm}}(\boldsymbol x, \boldsymbol y) &= D_{\text{KL}}(N(\boldsymbol \mu, \boldsymbol \Sigma), N(\boldsymbol \mu,\boldsymbol \Sigma_{\text{indep}} )) \\ & = \frac{1}{2} \biggl\{ \text{Tr}\biggl( \boldsymbol \Sigma^{-1}_{\text{indep}} \boldsymbol \Sigma\biggr) - \log \det \biggl( \boldsymbol \Sigma^{-1}_{\text{indep}} \boldsymbol \Sigma\biggr) - d \biggr\} \\ & = \frac{1}{2} \biggl\{ \text{Tr}( \boldsymbol A) - \log \det( \boldsymbol A) - d \biggr\} \\ &=-\frac{1}{2} \sum_{i=1}^m \log(1-\lambda_i^2)\\ \end{split}$

Note that $$\text{MI}_{\text{norm}}(\boldsymbol x, \boldsymbol y)$$ is the sum of the MIs resulting from the individual canonical correlations $$\lambda_i$$ with the same functional form as in the bivariate normal case.

By comparison with the squared Rozeboom vector correlation coefficient $$\rho^2_{\boldsymbol x\boldsymbol y}$$ we recognize that $\text{MI}_{\text{norm}}(\boldsymbol x,\boldsymbol y) = -\frac{1}{2} \log(1 - \rho^2_{\boldsymbol x\boldsymbol y} ) \approx \frac{1}{2} \rho^2_{\boldsymbol x\boldsymbol y}$ Thus, in the multivariate case $$\text{MI}_{\text{norm}}(\boldsymbol x\boldsymbol y)$$ has again exactly the same functional relationship with the vector correlation $$\rho^2_{\boldsymbol x, \boldsymbol y}$$ as the $$\text{MI}_{\text{norm}}(x, y)$$ for two univariate variables with squared Pearson correlation $$\rho^2$$.

Thus, Rozebooms vector correlation is directly linked to mutual information for jointly multivariate normally distributed variables.

### 5.2.4 Using MI for variable selection

In principle, MI can be computed for any distribution and model and thus applies to both normal and non-normal models, and to both linear and nonlinear relationships.

In very general way we may denote by $$F_{\boldsymbol y|\boldsymbol x}$$ we denote a predictive model for $$\boldsymbol y$$ conditioned on $$\boldsymbol x$$ and $$F_{\boldsymbol y}$$ is the marginal distribution of $$\boldsymbol y$$ without predictors. Note that the predictive model can assume any form (incl. nonlinear). Typically $$F_{\boldsymbol y|\boldsymbol x}$$ is a complex model and $$F_{\boldsymbol y}$$ a simple model (no predictors).

Then mutual information between $$\boldsymbol x$$ and $$\boldsymbol y$$ can be also understood as expectated KL divergence between the conditional and marginal distributions: $\text{E}_{F_{\boldsymbol x}}\, D_{\text{KL}}(F_{\boldsymbol y|\boldsymbol x}, F_{\boldsymbol y} ) = \text{MI}(\boldsymbol x, \boldsymbol y)$

This can be shown as follows. The KL divergence between $$F_{\boldsymbol y|\boldsymbol x}$$ and $$F_{\boldsymbol y}$$ is given by $D_{\text{KL}}(F_{\boldsymbol y|\boldsymbol x}, F_{\boldsymbol y} ) = \text{E}_{F_{\boldsymbol y|\boldsymbol x}} \log\biggl( \frac{f(\boldsymbol y|\boldsymbol x) }{ f(\boldsymbol y)} \biggr) \, ,$ which is a random variable since it depends on $$\boldsymbol x$$. Taking the expectation with regard to $$F_{\boldsymbol x}$$ (the distribution of $$\boldsymbol x$$) we get $\text{E}_{F_{\boldsymbol x}} D_{\text{KL}}(F_{\boldsymbol y|\boldsymbol x}, F_{\boldsymbol y} ) = \text{E}_{F_{\boldsymbol x}} \text{E}_{F_{\boldsymbol y|\boldsymbol x}} \log \biggl(\frac{ f(\boldsymbol y|\boldsymbol x) f(\boldsymbol x) }{ f(\boldsymbol y) f(\boldsymbol x) } \biggr) = \text{E}_{F_{\boldsymbol x,\boldsymbol y}} \log \biggl(\frac{ f(\boldsymbol x,\boldsymbol y) }{ f(\boldsymbol y) f(\boldsymbol x) } \biggr) = \text{MI}(\boldsymbol x,\boldsymbol y) \,.$

Because of this link of MI with conditioning the MI between response and predictor variables is often used for variable and feature selection in general models.

### 5.2.5 Other measures of general dependence

Besides mutual information there are others measures of general dependence between multivariate random variables.

The two most important ones that have been proposed in the recent literature are i) distance correlation and ii) the maximal information coefficient (MIC and $$\text{MIC}_e$$).

## 5.3 Graphical models

### 5.3.1 Purpose

Graphical models combine features from

• graph theory
• probability
• statistical inference

The literature on graphical models is huge, we focus here only on two commonly used models:

• DAGs (directed acyclic graphs), all edges are directed, no directed loops (i.e. no cycles, hence “acyclic”)
• GGM (Gaussian graphical models), all edges are undirected

Graphical models provide probabilistic models for trees and for networks, with random variables represented by nodes in the graphs, and branches representing conditional dependencies. In this regard they generalise both the tree-based clustering approaches as well as the probabilistic non-hierarchical methods (GMMs).

However, the class of graphical models goes much beyond simple unsupervised learning models. It also includes regression, classification, time series models etc. See e.g. the reference book by Murphy (2012).

### 5.3.2 Basic notions from graph theory

• Mathematically, a graph $$G = (V, E)$$ consists of a a set of vertices or nodes $$V = \{v_1, v_2, \ldots\}$$ and a set of branches or edges $$E = \{ e_1, e_2, \ldots \}$$.
• Edges can be undirected or directed.
• Graphs containing only directed edges are directed graphs, and likewise graphs containing only undirected edges are called undirected graphs. Graphs containing both directed and undirected edges are called partially directed graphs.
• A path is a sequence of of vertices such that from each of its vertices there is an edge to the next vertex in the sequence.
• A graph is connected when there is a path between every pair of vertices.
• A cycle is a path in a graph that connects a node with itself.
• A connected graph with no cycles is a called a tree.
• The degree of a node is the number of edges it connects with. If edges are all directed the degree of a node is the sum of the in-degree and out-degree, which counts the incoming and outgoing edges, respectively.
• External nodes are nodes with degree 1. In a tree-structed graph these are also called leafs.

Some notions are only relevant for graphs with directed edges:

• In a directed graph the parent node(s) of vertex $$v$$ is the set of nodes $$\text{pa}(v)$$ directly connected to $$v$$ via edges directed from the parent node(s) towards $$v$$.
• Conversely, $$v$$ is called a child node of $$\text{pa}(v)$$. Note that a parent node can have several child nodes, so $$v$$ may not be the only child of $$\text{pa}(v)$$.
• In a directed tree graph, each node has only a single parent, except for one particular node that has no parent at all (this node is called the root node).
• A DAG, or directed acyclic graph, is a directed graph with no directed cycles. A (directed) tree is a special version of a DAG.

### 5.3.3 Probabilistic graphical models

A graphical model uses a graph to describe the relationship between random variables $$x_1, \ldots, x_d$$. The variables are assumed to have a joint distribution with density/mass function $$\text{Pr}(x_1, x_2, \ldots, x_d)$$. Each random variable is placed in a node of the graph.

The structure of the graph and the type of the edges connecting (or not connecting) any pair of nodes/variables is used to describe the conditional dependencies, and to simplify the joint distribution.

Thus, a graphical model is in essence a visualisation of the joint distribution using structural information from the graph helping to understand the mutual relationship among the variables.

### 5.3.4 Directed graphical models

In a directed graphical model the graph structure is assumed to be a DAG (or a directed tree, which is also a DAG).

Then the joint probability distribution can be factorised into a product of conditional probabilities as follows: $\text{Pr}(x_1, x_2, \ldots, x_d) = \prod_i \text{Pr}(x_i | \text{pa}(x_i))$ Thus, the overall joint probability distribution is specified by local conditional distributions and the graph structure, with the directions of the edges providing the information about parent-child node relationships.

Probabilistic DAGs are also known as “Bayesian networks”.

Idea: by trying out all possible trees/graphs and fitting them to the data using maximum likelihood (or Bayesian inference) we hope to be able identify the graph structure of the data-generating process.

Challenges

1. in the tree/network the internal nodes are usually not known, and thus have to be treated as latent variables.

Answer: To impute the states at these nodes we may use the EM algorithm as in GMMs (which in fact can be viewed as graphical models, too!).

1. If we treat the internal nodes as unknowns we need to marginalise over the internal nodes, i.e. we need to sum / integrate over all possible set of states of the internal nodes!

Answer: This can be handled very effectively using the Viterbi algorithm which is essentially an application of the generalised distributive law. In particular for tree graphs this means that the summations occurs locally at each nodes and propagates recursively accross the tree.

1. In order to infer the tree or network structure the space of all trees or networks need to be explored. This is not possible in an exhaustive fashion unless the number of variables in the tree is very small.

Answer: Solution: use heuristic approaches for tree and network search!

1. Furthermore, there exist so-called “equivalence classes” of graphical models, i.e. sets of graphical models that share the same joint probability distribution. Thus, all graphical models within the same equivalence class cannot be distinguished from observational data, even with infinite sample size!

Answer: this is a fundamental mathematical problem of identifiability so there is now way around this issue. However, on the positive side, this also implies that the search through all graphical models can be restricted to finding the so-called “essential graph” (e.g. https://projecteuclid.org/euclid.aos/1031833662 )

Conclusion: using directed graphical models for structure discovery is very time consuming and computationally demanding for anything but small toy data sets.

This also explains why heuristic and non-model based approaches (such as hierarchical clustering) are so popular even though full statistical modelling is in principle possible.

### 5.3.5 Undirected graphical models

Another class of graphical models are models that contain only undirected edges. These undirected graphical models are used to represent the pairwise conditional (in)dependencies among the variables in the graph, and the resulting model is therefore also called conditional independence graph.

If $$x_i$$ and $$x_j$$ two selected random variables/nodes, and the set $$\{x_k\}$$ represents all other variables/nodes with $$k\neq i$$ and $$k \neq j$$. We say that variables $$x_i$$ and $$x_j$$ are conditionally independent given all the other variables $$\{x_k\}$$ $x_i \perp\!\!\!\perp x_j | \{x_k\}$ if the joint probability density of $$x_i, x_j$$ and $$x_k$$ factorises as $\text{Pr}(x_1, x_2, \ldots, x_d) = \text{Pr}(x_i | \{x_k\}) \text{Pr}(x_j | \{x_k\}) \text{Pr}(\{x_k\}) \,.$ or equivalently $\text{Pr}(x_i, x_j | \{x_k\}) = \text{Pr}(x_i | \{x_k\}) \text{Pr}(x_j | \{x_k\}) \,.$

In a corresponding conditional independence graph, there is no edge between $$x_i$$ and $$x_j$$, as in such a graph missing edges correspond to conditional independence between the respective non-connected nodes.

#### 5.3.5.1 Gaussian graphical model

Assuming that $$x_1, \ldots, x_d$$ are jointly normally distributed, i.e. $$\boldsymbol x\sim N(\boldsymbol \mu, \boldsymbol \Sigma)$$, it turns out that it is straightforward to identify the pairwise conditional independencies. From $$\boldsymbol \Sigma$$ we first obtain the precision matrix $\boldsymbol \Omega= (\omega_{ij}) = \boldsymbol \Sigma^{-1} \,.$ Crucially, it can be shown that $$\omega_{ij} = 0$$ implies $$x_i \perp\!\!\!\perp x_j | \{ x_k \}$$! Hence, from the precision matrix $$\boldsymbol \Omega$$ we can directly read off all the pairwise conditional independencies among the variables $$x_1, x_2, \ldots, x_d$$!

Often, the covariance matrix $$\boldsymbol \Sigma$$ is dense (few zeros) but the corresponding precision matrix $$\boldsymbol \Omega$$ is sparse (many zeros).

The conditional independence graph computed for normally distributed variables is called a Gaussian graphical model, or GGM. A further alternative name is covariance selection model.

### 5.3.6 Algorithm for learning GGMs

From the above we can devise a simple algorithm to to learn Gaussian graphical model (GGM) from data:

1. Estimate covariance $$\hat{\boldsymbol \Sigma}$$ (in such a way that it is invertible!)
2. Compute corresponding partial correlations
3. If $$\hat{\rho}_{ij|\text{rest}} \approx 0$$ then there is (approx). conditional independence between $$x_i$$ and $$x_j$$.
In practise this is done by statistical testing for vanishing partial correlations. If there are many edges we also need adjustment for simultaneous multiple testing since all edges are tested in parallel.

### 5.3.7 Example: exam score data (Mardia et al 1979:)

Correlations (rounded to 2 digits):

##            mechanics vectors algebra analysis statistics
## mechanics       1.00    0.55    0.55     0.41       0.39
## vectors         0.55    1.00    0.61     0.49       0.44
## algebra         0.55    0.61    1.00     0.71       0.66
## analysis        0.41    0.49    0.71     1.00       0.61
## statistics      0.39    0.44    0.66     0.61       1.00

Partial correlations (rounded to 2 digits):

##            mechanics vectors algebra analysis statistics
## mechanics       1.00    0.33    0.23     0.00       0.02
## vectors         0.33    1.00    0.28     0.08       0.02
## algebra         0.23    0.28    1.00     0.43       0.36
## analysis        0.00    0.08    0.43     1.00       0.25
## statistics      0.02    0.02    0.36     0.25       1.00

Note that that there are no zero correlations but there are four partial correlations close to 0, indicating conditional independence between:

• analysis and mechanics,
• statistics and mechanics,
• analysis and vectors, and
• statistics and vectors.

Thus, of 10 possible edges four are missing, and thus the conditional independence graph looks as follows:

Mechanics      Analysis
|     \    /    |
|    Algebra    |
|     /   \     |
Vectors      Statistics