A Gibbs sampler for multivariate linear regression Free

Summary of notation used in this work. Where this departs from the notation used by K07, the K07 equivalent is noted in the last column.

	Symbol	Meaning	K07
General	\|$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {\Sigma })$\|	ν-dimensional normal distribution (mean \|$\boldsymbol {\mu }$\|⁠, covariance \|$\boldsymbol {\Sigma }$\|⁠)
notation	\|$\mathcal {W}(\bf{V},\nu )$\|	Wishart distribution (scale matrix \|$\bf{V}$\|⁠, ν degrees of freedom)
	A_ij	Single element of matrix \|$\bf{A}$\|
	\|$\boldsymbol {A}_{j\cdot },\boldsymbol {A}_{\cdot j}$\|	jth row or column of \|$\bf{A}$\|
	\|$\boldsymbol {A}_{\bar{j}\cdot },\boldsymbol {A}_{\cdot \bar{j}}$\|	\|$\bf{A}$\| with the jth row or column removed	\|$\boldsymbol {A}_{-j\cdot },\boldsymbol {A}_{\cdot -j}$\|
	1_n	n × n identity matrix
Common	n	Number of data points
parameters	p	Number of covariates
	m	Number of responses	1
	K	Number of Gaussian mixture components or clusters
	\|$\boldsymbol {x}_i,\boldsymbol {y}_i$\|	Measured covariates and responses for data point i
	\|$\boldsymbol {M}_i$\|	Measurement covariance matrix for data point i	\|$\boldsymbol {\Sigma} _i$\|
	\|$\boldsymbol {\xi} _i,\boldsymbol {\eta} _i$\|	True covariates and responses for data point i
	\|$\boldsymbol {\alpha }$\|	Intercepts of the linear model
	\|$\boldsymbol {\beta }$\|	Slopes of the linear model
	\|$\boldsymbol {\Sigma }$\|	Intrinsic covariance about the linear model	σ²
	\|$\boldsymbol {G}$\|	Mixture component/cluster identification for each data point
Gaussian	\|$\boldsymbol {\pi }$\|	Weight of the each mixture component
mixture	\|$\boldsymbol {\mu} _k$\|	Mean of the kth component
	\|$\boldsymbol {T}_k$\|	Covariance of the kth component
	\|$\boldsymbol {\mu _0}$\|	Mean of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {U}$\|	Covariance of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {W}$\|	Scale matrix of the prior distribution of each \|$\boldsymbol {T}_k$\|
Dirichlet	\|$\boldsymbol {\mu }$\|	Mean of the normal base distribution
process	\|$\boldsymbol {T}$\|	Covariance of the normal base distribution
	κ	Concentration parameter of the process
	a	Shape parameter of the prior of κ
	b	Rate parameter of the prior of κ

	Symbol	Meaning	K07
General	\|$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {\Sigma })$\|	ν-dimensional normal distribution (mean \|$\boldsymbol {\mu }$\|⁠, covariance \|$\boldsymbol {\Sigma }$\|⁠)
notation	\|$\mathcal {W}(\bf{V},\nu )$\|	Wishart distribution (scale matrix \|$\bf{V}$\|⁠, ν degrees of freedom)
	A_ij	Single element of matrix \|$\bf{A}$\|
	\|$\boldsymbol {A}_{j\cdot },\boldsymbol {A}_{\cdot j}$\|	jth row or column of \|$\bf{A}$\|
	\|$\boldsymbol {A}_{\bar{j}\cdot },\boldsymbol {A}_{\cdot \bar{j}}$\|	\|$\bf{A}$\| with the jth row or column removed	\|$\boldsymbol {A}_{-j\cdot },\boldsymbol {A}_{\cdot -j}$\|
	1_n	n × n identity matrix
Common	n	Number of data points
parameters	p	Number of covariates
	m	Number of responses	1
	K	Number of Gaussian mixture components or clusters
	\|$\boldsymbol {x}_i,\boldsymbol {y}_i$\|	Measured covariates and responses for data point i
	\|$\boldsymbol {M}_i$\|	Measurement covariance matrix for data point i	\|$\boldsymbol {\Sigma} _i$\|
	\|$\boldsymbol {\xi} _i,\boldsymbol {\eta} _i$\|	True covariates and responses for data point i
	\|$\boldsymbol {\alpha }$\|	Intercepts of the linear model
	\|$\boldsymbol {\beta }$\|	Slopes of the linear model
	\|$\boldsymbol {\Sigma }$\|	Intrinsic covariance about the linear model	σ²
	\|$\boldsymbol {G}$\|	Mixture component/cluster identification for each data point
Gaussian	\|$\boldsymbol {\pi }$\|	Weight of the each mixture component
mixture	\|$\boldsymbol {\mu} _k$\|	Mean of the kth component
	\|$\boldsymbol {T}_k$\|	Covariance of the kth component
	\|$\boldsymbol {\mu _0}$\|	Mean of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {U}$\|	Covariance of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {W}$\|	Scale matrix of the prior distribution of each \|$\boldsymbol {T}_k$\|
Dirichlet	\|$\boldsymbol {\mu }$\|	Mean of the normal base distribution
process	\|$\boldsymbol {T}$\|	Covariance of the normal base distribution
	κ	Concentration parameter of the process
	a	Shape parameter of the prior of κ
	b	Rate parameter of the prior of κ

Table 1.

Summary of notation used in this work. Where this departs from the notation used by K07, the K07 equivalent is noted in the last column.

	Symbol	Meaning	K07
General	\|$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {\Sigma })$\|	ν-dimensional normal distribution (mean \|$\boldsymbol {\mu }$\|⁠, covariance \|$\boldsymbol {\Sigma }$\|⁠)
notation	\|$\mathcal {W}(\bf{V},\nu )$\|	Wishart distribution (scale matrix \|$\bf{V}$\|⁠, ν degrees of freedom)
	A_ij	Single element of matrix \|$\bf{A}$\|
	\|$\boldsymbol {A}_{j\cdot },\boldsymbol {A}_{\cdot j}$\|	jth row or column of \|$\bf{A}$\|
	\|$\boldsymbol {A}_{\bar{j}\cdot },\boldsymbol {A}_{\cdot \bar{j}}$\|	\|$\bf{A}$\| with the jth row or column removed	\|$\boldsymbol {A}_{-j\cdot },\boldsymbol {A}_{\cdot -j}$\|
	1_n	n × n identity matrix
Common	n	Number of data points
parameters	p	Number of covariates
	m	Number of responses	1
	K	Number of Gaussian mixture components or clusters
	\|$\boldsymbol {x}_i,\boldsymbol {y}_i$\|	Measured covariates and responses for data point i
	\|$\boldsymbol {M}_i$\|	Measurement covariance matrix for data point i	\|$\boldsymbol {\Sigma} _i$\|
	\|$\boldsymbol {\xi} _i,\boldsymbol {\eta} _i$\|	True covariates and responses for data point i
	\|$\boldsymbol {\alpha }$\|	Intercepts of the linear model
	\|$\boldsymbol {\beta }$\|	Slopes of the linear model
	\|$\boldsymbol {\Sigma }$\|	Intrinsic covariance about the linear model	σ²
	\|$\boldsymbol {G}$\|	Mixture component/cluster identification for each data point
Gaussian	\|$\boldsymbol {\pi }$\|	Weight of the each mixture component
mixture	\|$\boldsymbol {\mu} _k$\|	Mean of the kth component
	\|$\boldsymbol {T}_k$\|	Covariance of the kth component
	\|$\boldsymbol {\mu _0}$\|	Mean of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {U}$\|	Covariance of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {W}$\|	Scale matrix of the prior distribution of each \|$\boldsymbol {T}_k$\|
Dirichlet	\|$\boldsymbol {\mu }$\|	Mean of the normal base distribution
process	\|$\boldsymbol {T}$\|	Covariance of the normal base distribution
	κ	Concentration parameter of the process
	a	Shape parameter of the prior of κ
	b	Rate parameter of the prior of κ

	Symbol	Meaning	K07
General	\|$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {\Sigma })$\|	ν-dimensional normal distribution (mean \|$\boldsymbol {\mu }$\|⁠, covariance \|$\boldsymbol {\Sigma }$\|⁠)
notation	\|$\mathcal {W}(\bf{V},\nu )$\|	Wishart distribution (scale matrix \|$\bf{V}$\|⁠, ν degrees of freedom)
	A_ij	Single element of matrix \|$\bf{A}$\|
	\|$\boldsymbol {A}_{j\cdot },\boldsymbol {A}_{\cdot j}$\|	jth row or column of \|$\bf{A}$\|
	\|$\boldsymbol {A}_{\bar{j}\cdot },\boldsymbol {A}_{\cdot \bar{j}}$\|	\|$\bf{A}$\| with the jth row or column removed	\|$\boldsymbol {A}_{-j\cdot },\boldsymbol {A}_{\cdot -j}$\|
	1_n	n × n identity matrix
Common	n	Number of data points
parameters	p	Number of covariates
	m	Number of responses	1
	K	Number of Gaussian mixture components or clusters
	\|$\boldsymbol {x}_i,\boldsymbol {y}_i$\|	Measured covariates and responses for data point i
	\|$\boldsymbol {M}_i$\|	Measurement covariance matrix for data point i	\|$\boldsymbol {\Sigma} _i$\|
	\|$\boldsymbol {\xi} _i,\boldsymbol {\eta} _i$\|	True covariates and responses for data point i
	\|$\boldsymbol {\alpha }$\|	Intercepts of the linear model
	\|$\boldsymbol {\beta }$\|	Slopes of the linear model
	\|$\boldsymbol {\Sigma }$\|	Intrinsic covariance about the linear model	σ²
	\|$\boldsymbol {G}$\|	Mixture component/cluster identification for each data point
Gaussian	\|$\boldsymbol {\pi }$\|	Weight of the each mixture component
mixture	\|$\boldsymbol {\mu} _k$\|	Mean of the kth component
	\|$\boldsymbol {T}_k$\|	Covariance of the kth component
	\|$\boldsymbol {\mu _0}$\|	Mean of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {U}$\|	Covariance of the prior distribution of each \|$\boldsymbol {\mu} _k$\|
	\|$\boldsymbol {W}$\|	Scale matrix of the prior distribution of each \|$\boldsymbol {T}_k$\|
Dirichlet	\|$\boldsymbol {\mu }$\|	Mean of the normal base distribution
process	\|$\boldsymbol {T}$\|	Covariance of the normal base distribution
	κ	Concentration parameter of the process
	a	Shape parameter of the prior of κ
	b	Rate parameter of the prior of κ

2.1 Gaussian mixture model

We are interested in p + m properties of some class of object, where p of these (covariates) are supposed to be physically responsible for determining the other m (response variables). Measurements of these p + m quantities have been gathered for n objects. The true values of the covariates for the ith data point are denoted |$\boldsymbol {\xi} _i$|⁠, and the corresponding true responses are denoted |$\boldsymbol {\eta} _i$|⁠; these are nuisance parameters that will be marginalized over. The measured values of the corresponding quantities are denoted |$\boldsymbol {x}_i$| and |$\boldsymbol {y}_i$|⁠, and are assumed to be related to the true values by a (p + m)-dimensional normal measurement error distribution, which may be different for each data point. Writing the ν-dimensional normal distribution with mean |$\boldsymbol {\mu }$| and covariance |$\boldsymbol {V}$| as |$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {V})$|⁠, this is (for the ith data point)⁸

\begin{equation} \left(\begin{array}{c}\boldsymbol {x}_i\\ \boldsymbol {y}_i\end{array}\right) \sim \mathcal {N}_{p+m}\left[\left(\begin{array}{c}\boldsymbol {\xi} _i\\ \boldsymbol {\eta} _i\end{array}\right),\, \boldsymbol {M}_i\right]. \end{equation}

(1)

The p-dimensional distribution of covariates that these objects originally come from is not necessarily uniform. It is therefore modelled in a flexible way, as a mixture of Kp-dimensional normal distributions,

\begin{equation} \boldsymbol {\xi} _i \sim \sum _{k=1}^{K} \pi _k \, \mathcal {N}_p( \boldsymbol {\mu} _k,\, \boldsymbol {T}_k ), \end{equation}

(2)

with ∑_k π_k = 1. The summation notation in equation (2) is meant to convey that |$\boldsymbol {\xi} _i$| is drawn from the kth normal distribution (which has mean |$\boldsymbol {\mu} _k$| and covariance |$\boldsymbol {T}_k$|⁠) with probability π_k. As in K07, this is implemented by means of a set of latent indicator variables, |$\boldsymbol {G}$|⁠, with G_i indicating which of the K mixture components |$\boldsymbol {\xi} _i$| is drawn from.⁹ Formally, each |$\boldsymbol {G}$| follows the multinomial distribution defined by the proportions |$\boldsymbol {\pi }$|⁠.

The parameters |$\boldsymbol {G}$|⁠, |$\boldsymbol {\pi }$|⁠, |$\lbrace \boldsymbol {\mu} _k\rbrace$| and |$\lbrace \boldsymbol {T}_k\rbrace$| can be learned from the data, but it is helpful to impose some structure on them. Therefore, we adopt a hierarchical model whereby the vectors |$\lbrace \boldsymbol {\mu} _k\rbrace$| themselves follow a normal distribution,

\begin{equation} \boldsymbol {\mu} _k \sim \mathcal {N}_p\left( \boldsymbol {\mu _0},\, \boldsymbol {U} \right), \end{equation}

(3)

and both |$\boldsymbol {U}$| and the covariances |$\lbrace \boldsymbol {T}_k\rbrace$| follow inverse-Wishart distributions,

\begin{eqnarray} \boldsymbol {U} &\sim & \mathcal {W}^{-1}\left(\boldsymbol {W},K+p\right), \nonumber \\ \boldsymbol {T}_k &\sim & \mathcal {W}^{-1}\left(\boldsymbol {W},\, K+p\right). \end{eqnarray}

(4)

Here, |$\mathcal {W}^{-1}(\mathbf {V}, \nu )$| denotes the inverse-Wishart distribution with scale matrix |$\bf{V}$| and ν degrees of freedom.¹⁰ I follow K07 in taking uniform priors on the hyperparameters |$\boldsymbol {\mu _0}$| and |$\boldsymbol {W}$|⁠.¹¹ Note that this hierarchical model, and the Gaussian mixture itself, are fairly flexible but not fully general. While these are typically reasonable assumptions when we have little prior information about the distribution of covariates, they may not be appropriate for all situations. The particular structure in equations (3)–(4) tends to promote compactness in the covariate distribution; that is, if multiple, well-separated clusters of covariates exist, the onus is on the data to show that they are required.

The relationship by which the p covariates determine the m responses is assumed to be linear, with a normal intrinsic scatter,

\begin{equation} \boldsymbol {\eta} _i \sim \mathcal {N}_m\left(\boldsymbol {\alpha } + \boldsymbol {\beta } \boldsymbol {\xi} _i,\, \boldsymbol {\Sigma }\right). \end{equation}

(5)

Note that the linearity of the mean is crucial to maintain the conjugacy of the model. Here, |$\boldsymbol {\alpha }$| is the m × 1 vector of intercepts, β is an m × p matrix of slopes linking each response variable with each of the covariates, and Σ is the m × m intrinsic covariance matrix (assumed to be constant with respect to |$\boldsymbol {\xi }$|⁠). This can be written compactly for the entire data set in matrix form, with the definitions

\begin{eqnarray} \boldsymbol {X}_{i\cdot } &=& (1,\, \boldsymbol {\xi} _i^{\mathrm{T}}), \nonumber \\ \boldsymbol {Y}_{i\cdot } &=& \boldsymbol {\eta} _i^{\mathrm{T}}, \nonumber \\ \boldsymbol {B} &=& ( \boldsymbol {\alpha },\, \boldsymbol {\beta })^{\mathrm{T}}. \end{eqnarray}

(6)

where |$\boldsymbol {Y}$| is n × m, |$\boldsymbol {X}$| is n × (p + 1), and |$\boldsymbol {B}$| is (p + 1) × m. The notation |$\boldsymbol {A}_{i\cdot }$| refers to the ith row of A; likewise |$\boldsymbol {A}_{\cdot j}$| would refer to the jth column. The statement of the linear model then takes the familiar form

\begin{eqnarray} \boldsymbol {Y} &=& \boldsymbol {X B} + \boldsymbol {E}, \nonumber \\ \boldsymbol {E}_{i\cdot } &\sim & \mathcal {N}_m\left(\boldsymbol {0},\, \boldsymbol {\Sigma }\right). \end{eqnarray}

(7)

As noted in Sections 3.1.3 and 3.1.4, the conjugate prior distributions for |$\boldsymbol {B}$| and |$\boldsymbol {\Sigma }$| are, respectively, normal and inverse-Wishart.

2.2 Dirichlet process model

The Gaussian mixture prior on the distribution of covariates is flexible, but requires us to either choose a number of mixture components outright or carefully check the sensitivity of results to the number of components. Alternatively, we can constrain the distribution of covariates using a Dirichlet process, which describes a probability distribution over probability distributions. A Dirichlet process is defined by a concentration parameter, κ, and a base distribution, P₀. By choosing P₀ to be p-dimensional normal, the conjugacy relations that made the Gaussian mixture efficient to Gibbs sample will also hold for the Dirichlet process. Used in this way, the Dirichlet process can be thought of as a Gaussian mixture in which the number of components is marginalized over (Neal 2000 and references therein). The analogue of equation (2) is generally written

\begin{eqnarray} \boldsymbol {\xi} _i &\sim & P, \nonumber \\ P &\sim & \mathrm{DP}(P_0, \kappa ), \nonumber \\ P_0 &=& \mathcal {N}_p(\boldsymbol {\mu }, \boldsymbol {T}). \end{eqnarray}

(8)

Here |$\boldsymbol {\mu }$| and |$\boldsymbol {T}$| are the hyperparameters of the base distribution, for which I assume uniform priors. Note that there is only one |$\boldsymbol {\mu }$| and one |$\boldsymbol {T}$|⁠, unlike in the Gaussian mixture model, and there is no analogue of |$\boldsymbol {\mu _0}$|⁠, |$\boldsymbol {U}$| or |$\boldsymbol {W}$|⁠. The remainder of the model, namely equations (1) and (5–7), is the same as above.

In practice, for a given realization of the model parameters, the algorithm for realizing the Dirichlet process divides the data set into a finite number of clusters, with points in each cluster having identical values of |$\boldsymbol {\xi }$|⁠.¹² The vector of labels |$\boldsymbol {G}$| will identify which cluster each data point belongs to, similarly to its use in Section 2.1. A vector of cluster proportions, |$\boldsymbol {\pi }$|⁠, could also be defined analogously. However, in practice, the procedure for Gibbs sampling the Dirichlet process model implicitly marginalizes over it, and so |$\boldsymbol {\pi }$| never explicitly appears in the calculations (Section 3.2).

The concentration parameter of the Dirichlet process is related to the number of clusters in the data set, and can also be marginalized over. The conjugate prior for κ is the Gamma distribution,

\begin{equation} \kappa \sim \mathrm{Gamma}(a, b), \end{equation}

(9)

where a and b are, respectively, the shape and rate parameters of the prior. If the approximate number of clusters in the data set is known, these parameters can be chosen accordingly; otherwise, they can be chosen to be minimally informative (see discussion by Dorazio 2009 and Murugiah & Sweeting 2012, and Section 4.1).

3 THE GIBBS SAMPLER

Both of the models described above can be efficiently Gibbs sampled because they are fully conjugate. Recall that, in this situation, the sampling algorithm can be entirely specified as the set of conditional distributions used to sequentially update each parameter or block of parameters. Section 3.1 summarizes the changes to the K07 procedure needed to sample the Gaussian mixture model when there are multiple response variables, and Section 3.2 describes the procedure for sampling the Dirichlet process model.

In either case, an initial guess is needed for most of the free parameters, but this need not be very sophisticated. For example, it is generally acceptable to begin with the values of |$\lbrace \boldsymbol {\xi} _i\rbrace$| and |$\lbrace \boldsymbol {\eta} _i\rbrace$|⁠, respectively, initialized to the measured values |$\lbrace \boldsymbol {x}_i\rbrace$| and |$\lbrace \boldsymbol {y}_i\rbrace$|⁠, the intercepts |$\boldsymbol {\alpha }$| set to the average value of each column of |$\boldsymbol {Y}$|⁠, and the slopes |$\boldsymbol {\beta }$| set to zero.¹³ Of course, more intelligent guesses will decrease the ‘burn-in’ time of the resulting Markov chain, but generally this is relatively short.

3.1 Sampling the Gaussian mixture model

The procedure for updating the parameters governing the distribution of covariates (⁠|$\boldsymbol {G}$|⁠, |$\boldsymbol {\pi }$|⁠, |$\lbrace \boldsymbol {\mu} _k\rbrace$|⁠, |$\lbrace \boldsymbol {T}_k\rbrace$|⁠, |$\boldsymbol {\mu _0}$|⁠, |$\boldsymbol {U}$| and |$\boldsymbol {W}$|⁠, for which I adopt the same priors as K07) is not affected by the generalization to multiple responses, and the reader is referred to K07 for the details of those updates. Here, I review the procedure for Gibbs sampling the true values of the covariates and responses for each data point, |$\lbrace \boldsymbol {\xi} _i\rbrace$| and |$\lbrace \boldsymbol {\eta} _i\rbrace$|⁠, the regression coefficients, |$\boldsymbol {\alpha }$| and |$\boldsymbol {\beta }$|⁠, and the intrinsic covariance matrix, Σ.

3.1.1 Updating the covariates

See equations (59)–(65) of K07 for the corresponding discussion in that work. The fully conditional posterior of the jth covariate (j = 1, 2, …, p) for data point i is

\begin{eqnarray} (\boldsymbol {\xi} _i)_j | \ldots \sim \mathcal {N}_1\left(\hat{\xi }_{(ij)},\sigma _{(ij)}^2\right), \end{eqnarray}

(10)

where

\begin{eqnarray} \sigma _{(ij)}^2 &=& \left[ \left(\boldsymbol {M}_i^{-1}\right)_{jj} + \left(\boldsymbol {T}_{G_i}\right)^{-1}_{jj} + \boldsymbol {\beta} _{\cdot j}^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}\boldsymbol {\beta} _{\cdot j} \right]^{-1}, \nonumber \\ \hat{\xi }_{(ij)} &=& \sigma _{(ij)}^2\left[ \left(\boldsymbol {M}_i^{-1}\boldsymbol {z}_i^{{\ast }}\right)_j + \left(\boldsymbol {T}_{G_i}^{-1}\boldsymbol {\mu} _i^{{\ast }}\right)_j \right. \nonumber \\ && \left. +\, \boldsymbol {\beta} _{\cdot j}^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}\left(\boldsymbol {\eta} _i-\boldsymbol {\alpha }-\boldsymbol {\beta} _{\cdot \bar{j}}(\boldsymbol {\xi} _i)_{\bar{j}}\right) \right]. \end{eqnarray}

(11)

Here, |$\bar{j}$| indicates removal of the jth entry or column, and |$\boldsymbol {z}_i^{{\ast }}$| and |$\boldsymbol {\mu} _i^{{\ast }}$| are defined as in K07,

\begin{eqnarray} (\boldsymbol {z}_i^{\ast })_\ell &=& \left\lbrace \begin{array}{r}(\boldsymbol {x}_i)_\ell , \quad \ell = j \\ (\boldsymbol {x}_i,\boldsymbol {y}_i)_\ell - (\boldsymbol {\xi} _i,\boldsymbol {\eta} _i)_\ell , \quad \ell \ne j\end{array}\right., \nonumber \\ (\boldsymbol {\mu} _i^{\ast })_\ell &=& \left\lbrace \begin{array}{r}(\boldsymbol {\mu} _{G_i})_\ell , \quad \ell =j\\ (\boldsymbol {\mu} _{G_i})_\ell - (\boldsymbol {\xi} _i)_\ell , \quad \ell \ne j\end{array}\right.. \end{eqnarray}

(12)

3.1.2 Updating the responses

The response variables, |$\lbrace \boldsymbol {\eta} _i\rbrace$|⁠, can be updated by modifying equations (69–72) of K07 as follows (for each j = 1, 2, …, m):

\begin{eqnarray} (\boldsymbol {\eta} _i)_j | \ldots &\sim & \mathcal {N}_1\left(\hat{\eta }_{(ij)},s _{(ij)}^2\right), \nonumber \\ s_{(ij)}^2 &=& \left[ \left(\boldsymbol {M}_i^{-1}\right)_{(p+j)(p+j)} + \Sigma ^{-1}_{jj} \right]^{-1}, \nonumber \\ \hat{\eta }_{(ij)} &=& s _{(ij)}^2\left[ \left(\boldsymbol {M}_i^{-1}\boldsymbol {\zeta} _i^{{\ast }}\right)_{p+j} + \left(\boldsymbol {\Sigma }^{-1}\boldsymbol {q}_i^{{\ast }}\right)_j \right]. \end{eqnarray}

(13)

Here, |$\boldsymbol {\zeta} _i^{{\ast }}$| is defined analogously to |$\boldsymbol {z}_i^{{\ast }}$| in Section 3.1.1,

\begin{eqnarray} (\boldsymbol {\zeta} _i^{\ast })_\ell &=& \left\lbrace \begin{array}{r}(\boldsymbol {y}_i)_\ell , \quad \ell = p+j \\ (\boldsymbol {x}_i,\boldsymbol {y}_i)_\ell - (\boldsymbol {\xi} _i,\boldsymbol {\eta} _i)_\ell , \quad \ell \ne p+j\end{array}\right., \end{eqnarray}

(14)

and

\begin{eqnarray} (\boldsymbol {q}_i^{{\ast }})_\ell = \Bigg \lbrace \begin{array}{r}\alpha _\ell +\boldsymbol {\beta} _{\ell \cdot }\boldsymbol {\xi} _i, \quad \ell =j\\ \alpha _\ell +\boldsymbol {\beta} _{\ell \cdot }\boldsymbol {\xi} _i-(\boldsymbol {\eta} _i)_\ell , \quad \ell \ne j\end{array}. \end{eqnarray}

(15)

3.1.3 Updating the coefficients

The coefficients, |$\boldsymbol {\alpha }$| and |$\boldsymbol {\beta }$|⁠, may be updated by recasting equation (7) in the form of a univariate regression,

\begin{eqnarray} \boldsymbol {\widetilde{Y}} = \boldsymbol {\widetilde{X}\widetilde{B}} + \boldsymbol {\widetilde{E}}, \end{eqnarray}

(16)

where |$\boldsymbol {\widetilde{Y}}$| and |$\boldsymbol {\widetilde{E}}$| are nm × 1, |$\boldsymbol {\widetilde{X}}$| is nm × (p + 1)m and |$\boldsymbol {\widetilde{B}}$| is (p + 1)m × 1. I use the following (non-unique) definitions:

\begin{eqnarray} \boldsymbol {\widetilde{Y}} &=& \left(\begin{array}{c}\boldsymbol {Y}_{\cdot 1}\\ \vdots \\ \boldsymbol {Y}_{\cdot m}\end{array}\right), \nonumber \\ \boldsymbol {\widetilde{B}} &=& \left(\begin{array}{c}\boldsymbol {B}_{\cdot 1}\\ \vdots \\ \boldsymbol {B}_{\cdot m}\end{array}\right), \nonumber \\ \boldsymbol {\widetilde{X}} &=& \left(\begin{array}{ccc}\boldsymbol {X}&\boldsymbol {0}&\cdots \\ \boldsymbol {0}&\boldsymbol {X}&\cdots \\ \vdots &\vdots &\ddots \end{array}\right), \end{eqnarray}

(17)

with the nm × nm scatter covariance being

\begin{eqnarray} \boldsymbol {\widetilde{\Sigma }} &=& \left(\begin{array}{cccc}\mathbf {1}_n\Sigma _{11}&\mathbf {1}_n\Sigma _{12}&\cdots &\mathbf {1}_n\Sigma _{1m}\\ \mathbf {1}_n\Sigma _{21}&\mathbf {1}_n\Sigma _{22}&\cdots &\mathbf {1}_n\Sigma _{2m}\\ \vdots &\vdots &\ddots &\vdots \\ \mathbf {1}_n\Sigma _{m1}&\mathbf {1}_n\Sigma _{m2}&\cdots &\mathbf {1}_n\Sigma _{mm}\end{array}\right), \end{eqnarray}

(18)

where 1_n denotes the n × n identity. The fully conditional posterior for |$\boldsymbol {\widetilde{B}}$| is simply the normal distribution following from ordinary least-squares regression,

\begin{eqnarray} \boldsymbol {\widetilde{B}} | \ldots \sim \mathcal {N}_{(p+1)m}(\boldsymbol {\widetilde{\mathcal {B}}},\boldsymbol {\widetilde{\mathcal {S}}}). \end{eqnarray}

(19)

The mean is

\begin{eqnarray} \boldsymbol {\widetilde{\mathcal {B}}} = \big(\boldsymbol {\widetilde{X}}^{\mathrm{T}}\boldsymbol {\widetilde{X}}\big)^{-1}\boldsymbol {\widetilde{X}}^{\mathrm{T}}\boldsymbol {\widetilde{Y}}, \end{eqnarray}

(20)

whose calculation can be broken down into (p + 1) × (p + 1) chunks due to the structure of |$\boldsymbol {\widetilde{X}}$|⁠, and the covariance is

\begin{eqnarray} \boldsymbol {\widetilde{\mathcal {S}}} &=& \left(\begin{array}{cccc}\boldsymbol {\Xi } \, \Sigma _{11}&\boldsymbol {\Xi } \, \Sigma _{12}&\cdots &\boldsymbol {\Xi } \, \Sigma _{1m}\\ \boldsymbol {\Xi } \, \Sigma _{21}&\boldsymbol {\Xi } \, \Sigma _{22}&\cdots &\boldsymbol {\Xi } \, \Sigma _{2m}\\ \vdots &\vdots &\ddots &\vdots \\ \boldsymbol {\Xi } \, \Sigma _{m1}&\boldsymbol {\Xi } \, \Sigma _{m2}&\cdots &\boldsymbol {\Xi } \, \Sigma _{mm}\end{array}\right), \nonumber \\ \boldsymbol {\Xi } &=& \left(\boldsymbol {\widetilde{X}}^{\mathrm{T}}\boldsymbol {\widetilde{X}}\right)^{-1} \end{eqnarray}

(21)

i.e. |$\mathrm{Cov}\left(B_{ki},B_{\ell j}\right) = \Sigma _{ij}\left(\boldsymbol {\widetilde{X}}^{\mathrm{T}}\boldsymbol {\widetilde{X}}\right)^{-1}_{k\ell }.$|

Note that it is straightforward to sample from the product of equation (19) and a normal prior for |$\boldsymbol {\widetilde{B}}$|⁠; this option is implemented in lrgs, although the default is a uniform prior.

3.1.4 Updating the intrinsic covariance

With |$\boldsymbol {E} = \boldsymbol {Y} - \boldsymbol {XB}$| (equation 7), the conditional posterior for the intrinsic scatter is¹⁴

\begin{eqnarray} \boldsymbol {\Sigma } | \ldots \sim \mathcal {W}^{-1}(\boldsymbol {E}^{\mathrm{T}}\boldsymbol {E},\, n-1). \end{eqnarray}

(22)

In practice, a sample can be generated by setting |$\boldsymbol {\Sigma }$| equal to |$(\boldsymbol {A}^{\mathrm{T}}\boldsymbol {A})^{-1}$|⁠, where |$\boldsymbol {A}$| is (n − 1) × m and each row of |$\boldsymbol {A}$| is generated as |$\boldsymbol {A}_{i\cdot }\sim \mathcal {N}_m[\boldsymbol {0},(\boldsymbol {E}^{\mathrm{T}}\boldsymbol {E})^{-1}]$|⁠.

3.2 Sampling the Dirichlet process model

If a Dirichlet process rather than a Gaussian mixture is used to describe the prior distribution of covariates, the procedure to update the |$\lbrace \boldsymbol {\xi} _i\rbrace$| differs from that given above. This step implicitly updates |$\boldsymbol {G}$|⁠, which now identifies membership in one of a variable number of clusters (data points with identical values of |$\boldsymbol {\xi }$|⁠). In addition, Gibbs updates to the hyperparameters of the Dirichlet process and its base distribution, κ, |$\boldsymbol {\mu }$| and |$\boldsymbol {T}$|⁠, are possible. These are described below. Note that the updates to |$\boldsymbol {G}$|⁠, |$\boldsymbol {\pi }$|⁠, |$\lbrace \boldsymbol {\mu} _k\rbrace$|⁠, |$\lbrace \boldsymbol {T}_k\rbrace$|⁠, |$\boldsymbol {\mu _0}$|⁠, |$\boldsymbol {U}$| and |$\boldsymbol {W}$| given in K07 are no longer applicable (of these, only |$\boldsymbol {G}$| and |$\boldsymbol {\pi }$| exist in the model).

3.2.1 Updating the covariates

Let K be the number of clusters (i.e. distinct labels in |$\boldsymbol {G}$|⁠) at a given time. I follow the second algorithm given by Neal (2000), which first updates the cluster membership for each data point, and then draws new values of |$\boldsymbol {\xi }$| for each cluster.

For each data point i, update G_i as follows. Let

\begin{equation} q_k^{(i)} = n_k^{(i)} \, \mathcal {N}_p\left(\boldsymbol {\xi} ^{\prime }_k | \boldsymbol {\hat{\xi }}_1^{(i)}, \boldsymbol {\hat{T}}_1^{(i)}\right), \quad k=1,2,\ldots ,K, \end{equation}

(23)

where n_k is the number of data points belonging to the kth cluster not counting the ith data point, |$\boldsymbol {\xi} ^{\prime }_k$| is the vector of covariates shared by the kth cluster, and |$\mathcal {N}_\nu (\boldsymbol {x}|\boldsymbol {\mu },\boldsymbol {V})$| denotes the normal density, i.e. the density of |$\mathcal {N}_\nu (\boldsymbol {\mu },\boldsymbol {V})$| evaluated at |$\boldsymbol {x}$|⁠. Here,

\begin{eqnarray} \boldsymbol {\hat{T}}_1^{(i)} &=& \left[ \left(\boldsymbol {M}_i^{-1}\right)_{xx} + \boldsymbol {\beta }^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}\boldsymbol {\beta } \right]^{-1}, \nonumber \\ \boldsymbol {\hat{\xi }}_1^{(i)} &=& \boldsymbol {\hat{T}}_1^{(i)} \left[ \left(\boldsymbol {M}_i^{-1}\boldsymbol {z}_i\right)_x + \boldsymbol {\beta }^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}(\boldsymbol {\eta} _i-\boldsymbol {\alpha }) \right], \end{eqnarray}

(24)

where |$\boldsymbol {z}_i = \left( \boldsymbol {x}_i, \boldsymbol {y}_i-\boldsymbol {\eta} _i \right)$|⁠, and the subscript x indicates the range of subscripts associated with the covariates, 1, 2, …, p (so that, e.g. |$[\boldsymbol {M}_i^{-1}]_{xx}$| is the upper-left p × p block of |$\boldsymbol {M}_i^{-1}$|⁠). Furthermore, let

\begin{equation} r^{(i)} = \kappa \, \mathcal {N}_p\left( \boldsymbol {\mu } \left| \boldsymbol {\hat{\xi }}_1^{(i)},\, \boldsymbol {\hat{T}}_1^{(i)} + \boldsymbol {T} \right.\right). \end{equation}

(25)

Each element of |$\boldsymbol {q}^{(i)}$| is the conditional probability associated with the covariates of the kth cluster given the measurement and response variables associated with the ith data point, whereas r⁽ⁱ⁾ is related to the probability of the ith data point being drawn instead from the base distribution of the Dirichlet process. A new label, G_i, is drawn from the multinomial distribution as

\begin{equation} G_i|\ldots \sim \mathrm{Multinom}\left[(\boldsymbol {q}^{(i)},r^{(i)})\right], \end{equation}

(26)

after normalizing the probability vector |$(\boldsymbol {q}^{(i)},r^{(i)})$|⁠. A selection G_i = K + 1 indicates the creation of a new cluster, and in that case a new |$\boldsymbol {\xi} _i$| is immediately drawn from its conditional posterior,

\begin{equation} \boldsymbol {\xi} _i|\ldots \sim \mathcal {N}_p\left(\boldsymbol {\hat{\xi }}_{\boldsymbol 0}^{(i)}, \boldsymbol {\hat{T}}_{\boldsymbol 0}^{(i)}\right), \end{equation}

(27)

where

\begin{eqnarray} \boldsymbol {\hat{T}}_{\boldsymbol 0}^{(i)} &=& \left[ \left(\boldsymbol {\hat{T}}_{\boldsymbol 1}^{(i)}\right)^{-1} + \boldsymbol {T}^{-1} \right]^{-1}\nonumber \\ \boldsymbol {\hat{\xi }}_{\boldsymbol 0}^{(i)} &=& \boldsymbol {\hat{T}}_{\boldsymbol 0}^{(i)} \left( \boldsymbol {\hat{\xi }}_1^{(i)} + \boldsymbol {T}^{-1}\boldsymbol {\mu } \right). \end{eqnarray}

(28)

Once the procedure above is completed, new covariate vectors are drawn for each cluster (k = 1, 2, …, K) given the set of data points residing in it,

\begin{eqnarray} \boldsymbol {\xi} ^{\prime }_k|\ldots &\sim & \mathcal {N}_p(\boldsymbol {\hat{\xi }_2}, \boldsymbol {\hat{T}_2}), \\ \boldsymbol {\hat{T}_2} &=& \left\lbrace \boldsymbol {T}^{-1} + \sum _{i:G_i=k}\left[(\boldsymbol {M}_i^{-1})_{xx} + \boldsymbol {\beta }^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}\boldsymbol {\beta }\right] \right\rbrace ^{-1},\nonumber \\ \boldsymbol {\hat{\xi }_2} &=& \boldsymbol {\hat{T}_2} \Bigg \lbrace \boldsymbol {T}^{-1}\boldsymbol {\mu } + \sum _{i:G_i=k}\left[ \left(\boldsymbol {M}_i^{-1}\boldsymbol {z}_i\right)_x + \boldsymbol {\beta }^{\mathrm{T}}\boldsymbol {\Sigma }^{-1}\left(\boldsymbol {\eta} _i-\boldsymbol {\alpha }\right) \right] \Bigg \rbrace ,\nonumber \end{eqnarray}

(29)

and each new value |$\boldsymbol {\xi} ^{\prime }_k$| is assigned to all |$\boldsymbol {\xi} _i$| in the corresponding cluster (i.e. with G_i = k).

3.2.2 Updating the Dirichlet process concentration

The procedure for Gibbs sampling κ is given by Escobar & West (1995). First, a latent variable, h, is introduced and sampled according to a Beta distribution,

\begin{equation} h|\ldots \sim \mathrm{Beta}(\kappa +1,\, n). \end{equation}

(30)

Then, κ is updated according to

\begin{eqnarray} \kappa |\ldots &\sim & \delta \, \mathrm{Gamma}\left[a+K,\, b-\ln (h)\right] \nonumber \\ && +\, (1-\delta ) \mathrm{Gamma}\left[a+K-1,\, b-\ln (h)\right], \end{eqnarray}

(31)

where a and b are the shape and rate parameters of the Gamma prior on κ, and

\begin{equation} \delta = \left[ 1 + n\, \frac{b-\ln (h)}{a+K-1} \right]^{-1}. \end{equation}

(32)

In lrgs, the default values of a and b are chosen to be uninformative based on the number of data points, following the prescription given by Dorazio (2009).

3.2.3 Updating the base distribution hyperparameters

Using the notation of Section 3.2.1, the hyperparameters of the base distribution can be updated in turn as

\begin{eqnarray} \boldsymbol {\mu }|\ldots &\sim & \mathcal {N}_p\left(\frac{1}{K}\sum _{k=1}^{K}\boldsymbol {\xi} ^{\prime }_k,\, \frac{1}{K}\boldsymbol {T}\right), \nonumber \\ \boldsymbol {T}|\ldots &\sim & \mathcal {W}^{-1}\left[ \sum _{k=1}^{K}(\boldsymbol {\xi} ^{\prime }_k-\boldsymbol {\mu })(\boldsymbol {\xi} ^{\prime }_k-\boldsymbol {\mu })^{\mathrm{T}},\, K+p \right]. \end{eqnarray}

(33)

4 EXAMPLES

This section provides two example applications of the methods discussed above, respectively, on a toy model and an astrophysical data set.

4.1 Toy model

Consider the case of a single covariate, generated by three distinct Gaussian components, and a single response variable. Table 2 shows the specific model parameters used to generate the data, and the toy data set is shown in the left-hand panel of Fig. 1. Because the Gaussians generating the covariates are not especially well separated compared to their widths, the presence of three populations is not striking, although a histogram of the measured covariates is suggestive of the underlying structure (centre panel of Fig. 1).

Figure 1.

Left: the simulated data used to fit the toy model in Section 4.1. Centre: histogram of the simulated covariates, along with the three-Gaussian-mixture distribution from which they are drawn. Right: the simulated data, with colours/symbols reflecting the cluster assignments of the Dirichlet model at one step in the fit.

Table 2.

Model parameters used to generate the toy data set in Section 4.1. The distribution of covariates is taken to be a mixture of 3 Gaussians.

Parameter	Value
n, p, m, K	100, 1, 1, 3
All \|$\boldsymbol {M}_i$\|	1₂
α	0
β	1
Σ	9
All π_k	1/3
μ_k	5(k − 2)
All T_k	1

Table 2.

Model parameters used to generate the toy data set in Section 4.1. The distribution of covariates is taken to be a mixture of 3 Gaussians.

Parameter	Value
n, p, m, K	100, 1, 1, 3
All \|$\boldsymbol {M}_i$\|	1₂
α	0
β	1
Σ	9
All π_k	1/3
μ_k	5(k − 2)
All T_k	1

Suppose we had a physical basis for a three-component model (or suspected three components, by inspection), but wanted to allow for the possibility of more or less structure than a strict Gaussian mixture provides. The Dirichlet process supplies a way to do this. For a given κ and n, the distribution of K is known,¹⁵ so in principle a prior expectation for the number of clusters, say 3 ± 1, can be roughly translated into a Gamma prior on κ. Here, I instead adopt an uninformative prior on κ (Dorazio 2009), and compare the results to those of a Gaussian mixture model with K = 3.

Using the Dirichlet process model, results from a chain of 1000 Gibbs samples (discarding the first 10) are shown as shaded histograms in Fig. 1.¹⁶ The results are consistent with the input model values (vertical, dashed lines) for the parameters of interest (α, β, and Σ). The latent parameters describing the base distribution of the Dirichlet process are also consistent with the toy model, although they are poorly constrained. The right-hand panel of Fig. 1 shows the cluster assignments for a sample with K = 6 (the median of the chain); the clustered nature of the data is recognized, although the number of clusters tends to exceed the number of components in the input model.¹⁷ An equivalent analysis using a mixture of three Gaussians rather than a Dirichlet process model produces very similar constraints on the parameters of interest (hatched histograms in Fig. 2).

Figure 2.

Histograms of parameter samples using a Dirichlet process (blue shaded) or three-Gaussian mixture (hatched) prior on the distribution of covariates, for the toy-model analysis of Section 4.1. Dashed vertical lines indicate the input values used to generate the data set. The hyperparameters of the Gaussian mixture model are not shown; these correctly converge to the mean and width of the three mixture components.

4.2 Scaling relations of relaxed galaxy clusters

As a real-life astrophysical example, I consider the scaling relations of dynamically relaxed galaxy clusters, using measurements presented by Mantz et al. (2015). Note that there are a number of subtleties in the interpretation of these results that will be discussed elsewhere; here the problem is considered only as an application of the method presented in this work.

Briefly, the data set comprises X-ray measurements of 40 massive, relaxed clusters.¹⁸ The X-ray observables are total mass, M; gas mass, M_gas; average gas temperature, kT; and luminosity, L. In addition, spectroscopically measured redshifts are available for each cluster. A simple model of cluster formation by spherical collapse under gravity, neglecting gas physics, predicts self-similar power-law scaling relations among these quantities:¹⁹

\begin{eqnarray} M_{\mathrm{gas}}&\propto & M, \nonumber \\ kT &\propto & \left[E(z)\,M\right]^{2/3}, \nonumber \\ L &\propto & E(z)^{1.92}\,M^{0.92}, \end{eqnarray}

(34)

where E(z) = H(z)/H₀ is the normalized Hubble parameter at the cluster's redshift. The aim of this analysis is to test whether the power-law slopes above are accurate, and to characterize the joint intrinsic scatter of M_gas, kT, and L at fixed M and z. Taking the logarithm of these physical quantities, and assuming lognormal measurement errors and intrinsic scatter, this becomes a linear regression with p = 2 and m = 3. For brevity, and neglecting units, (ln E, ln M) → (x₁, x₂) and (ln M_gas, ln kT, ln L) → (y₁, y₂, y₃); I also approximately centre the covariates for convenience. Fig. 3 shows summary plots of these data. Although measurement errors are shown as orthogonal bars for clarity, the analysis will use a full 5 × 5 covariance matrix accounting for the interdependence of the X-ray measurements (this covariance is illustrated for one cluster in the figure).

Figure 3.

Scatter plots showing the distribution of measured covariates and responses for the p = 2, m = 3 problem of fitting galaxy cluster scaling relations described in Section 4.2. An ellipse illustrates the measurement covariance for the most massive (largest x₂) cluster in each panel. The particular combinations of x and y plotted are conventional (cf. equation 34).

Because the redshifts are measured with very small uncertainties, this problem is not well suited to the Dirichlet process prior; intuitively, the number of clusters in the Dirichlet process must approach the number of data points because the data are strongly inconsistent with one another (i.e. are not exchangeable). Instead, I use a Gaussian mixture prior with K = 3, and verify that in practice the results are not sensitive to K (the parameters of interest differ negligibly from an analysis with K = 1).

Marginalized two-dimensional constraints on the power-law slopes of each scaling relation are shown in the top row of Fig. 4 (68.3 and 95.4 per cent confidence). On inspection, only the luminosity scaling relation appears to be in any tension with the expectation in equation (34), having a preference for a weaker dependence on E(z) and a stronger dependence on M. These conclusions are in good agreement with a variety of earlier work (e.g. Reiprich & Böhringer 2002; Zhang et al. 2007, 2008; Rykoff et al. 2008; Pratt et al. 2009; Vikhlinin et al. 2009; Leauthaud et al. 2010; Mantz et al. 2010; Reichert et al. 2011; Sereno & Ettori 2015; see also the review of Giodini et al. 2013).

Figure 4.

Top row: joint 68.3 and 95.4 per cent confidence regions on the slope parameters of each of the scaling relations in equation (34). Black circles indicate the self-similar expectation given in equation (34). Bottom left: posterior distributions for the marginal intrinsic scatter parameters of the model. Bottom right: posteriors for the off-diagonal intrinsic scatter terms, expressed as correlation coefficients.

The posterior distributions of the elements of the multidimensional intrinsic covariance matrix are shown in the bottom row of Fig. 4, after transforming to marginal scatter (square root of the diagonal) and correlation coefficients (for the off-diagonal elements). The intrinsic scatters of M_gas and kT at fixed M and z are in good agreement with other measurements in the literature (see Allen, Evrard & Mantz 2011; Giodini et al. 2013, and references therein); the scatter of L is lower than the ∼40 per cent typically found, likely because this analysis uses a special set of morphologically similar clusters rather than a more representative sample. The correlation coefficients are particularly challenging to measure, and the constraints are relatively poor. Nevertheless, the ability to efficiently place constraints on the full intrinsic covariance matrix is an important feature of this analysis. Within uncertainties, these results agree well with the few previous constraints on these correlation coefficients in the literature (Mantz et al. 2010; Maughan 2014). The best-fitting intrinsic covariance matrix is illustrated visually in Fig. 5, which compares it to the residuals of |$\boldsymbol {y}$| with respect to the best-fitting values of |$\boldsymbol {x}$| and the best-fitting scaling relations.

$Residuals of $\boldsymbol {y}$ with respect to the best-fitting values of $\boldsymbol {x}$ and the best-fitting scaling relations. For clarity, measurement errors are shown as orthogonal bars, even though the measurement covariances are non-trivial. In particular, the ‘by-eye’ positive correlation in the left-hand panel is due to a positive correlation in the measurement uncertainties. Shaded ellipses correspond to 1σ, 2σ, and 3σ intrinsic scatter (in the two dimensions shown in each panel), according to the best-fitting intrinsic covariance matrix.$

Figure 5.

Residuals of |$\boldsymbol {y}$| with respect to the best-fitting values of |$\boldsymbol {x}$| and the best-fitting scaling relations. For clarity, measurement errors are shown as orthogonal bars, even though the measurement covariances are non-trivial. In particular, the ‘by-eye’ positive correlation in the left-hand panel is due to a positive correlation in the measurement uncertainties. Shaded ellipses correspond to 1σ, 2σ, and 3σ intrinsic scatter (in the two dimensions shown in each panel), according to the best-fitting intrinsic covariance matrix.

http://openbugs.net/w/FrontPage

5 SUMMARY

I have generalized the Bayesian linear regression method described by K07 to the case of multiple response variables, and included a Dirichlet process model of the distribution of covariates (equivalent to a Gaussian mixture whose complexity is learned from the data). The algorithm described here is implemented independently of the linmix_erridl code of K07 as an r package called lrgs, which is publicly available. Two examples, respectively using a toy data set and real astrophysical data, are presented.

A number of further generalizations are possible. In principle, significant complexity can be added to the model of the intrinsic scatter in the form of a Gaussian mixture or Dirichlet process model (with a Gaussian base distribution) while maintaining conjugacy of the conditional posteriors, and thereby the efficiency of the Gibbs sampler. The case of censored data (upper limits on some measured responses) is discussed by K07. This situation, or, more generally, non-Gaussian measurement errors, can be handled by rejection sampling (at the expense of efficiency) but is not yet implemented in lrgs. Also of interest is the case of truncated data, where the selection of the data set depends on one of the response variables, and the data are consequently an incomplete and biased subset of a larger population. This case can in principle be handled by modelling the selection function and imputing the missing data (Gelman et al. 2004). lrgs is shared publicly on GitHub, and I hope that users who want more functionality will be interested in helping develop the code further.

The addition of Dirichlet process modelling to this work was inspired by extensive discussions with Michael Schneider and Phil Marshall. Anja von der Linden did some very helpful beta testing. I acknowledge support from the National Science Foundation under grant AST-1140019.

1

i.e. the posterior distribution for certain parameters conditional on the (fixed) values of all other parameters.

2

3

http://mcmc-jags.sourceforge.net/

4

http://mc-stan.org/

5

http://www.r-project.org

6

https://github.com/abmantz/lrgs

7

http://cran.r-project.org

8

Note that in K07y preceded x as they correspond to rows and columns of M. The reverse convention is followed here.

9

In the notation used here, G_i is simply a label 1, 2, …, K, whereas K07 describe each G_i as a vector with all but one element zero. This distinction makes no practical difference.

10

While the inverse-Wishart distribution is conjugate in this context, and therefore computationally convenient, it has the generic disadvantage of imposing a particular structure on the marginal prior distributions of the variances and correlation coefficients that make up the resulting covariance matrix. In particular, large absolute values of the correlation coefficients preferentially correspond to large variances. This is sometimes undesirable, and from a practical standpoint it can occasionally result in the generation of computationally singular matrices. An alternative approach is to decompose a given covariance matrix, |$\boldsymbol {\Lambda }$|⁠, as |$\bf{SRS}$|⁠, where |$\bf{S}$| is diagonal and |$\bf{R}$| is a correlation matrix. This allows independent priors to be adopted for individual variances and correlation coefficients, which is usually more intuitive than using inverse-Wishart distribution, but at the expense that the resulting model is no longer conjugate. More extensive discussion of these options can be found in Barnard, McCulloch & Meng (2000), Gelman et al. (2004), and O'Malley & Zaslavsky (2008).

11

The conjugate prior for |$\boldsymbol {\mu _0}$| is normal, |$\mathcal {N}_p(\boldsymbol {u},\boldsymbol {V})$|⁠, which is uniform in the limit |$\boldsymbol {V^{-1}}=\boldsymbol {0}$|⁠. For |$\boldsymbol {W}$|⁠, the conjugate prior is inverse-Wishart, |$\mathcal {W}^{-1}(\boldsymbol {\Psi }, \nu )$|⁠, and the equivalent of the uniform distribution is realized by taking |$\boldsymbol {\Psi }=\boldsymbol {0}$| and ν = −(1 + d), where d is the size of |$\boldsymbol {\Psi }$| (in this case, d = p).

12

A pitfall of this approach occurs when few of the measured covariates are consistent with any others within their measurement errors. In that case, the number of clusters is necessarily similar in size to the number of data points, which is not generally the desired result.

13

Specifically, this simpleminded guess for the intercepts and slopes works reasonably well when the covariates have been approximately centred. More generally, estimates from an ordinary least-squares regression should provide a good starting point.

14

This expression assumes a Jeffreys (i.e. minimally informative) prior on Σ. More generally, one could use a prior |$\boldsymbol {\Sigma } \sim \mathcal {W}^{-1}(\boldsymbol {\Psi }, \nu _0)$|⁠, in which case the conditional posterior becomes |$\boldsymbol {\Sigma } | \ldots \sim \mathcal {W}^{-1}\left(\boldsymbol {E}^{\mathrm{T}}\boldsymbol {E}+\boldsymbol {\Psi },\, n+\nu _0\right)$|⁠. The Jeffreys prior corresponds to |$\boldsymbol {\Psi }=\boldsymbol {0}$| and ν₀ = −1, while |$\boldsymbol {\Psi }=\boldsymbol {0}$| and ν₀ = −(1 + m) corresponds to a prior that is uniform in |$\left|\boldsymbol {\Sigma }\right|$| (see e.g. Gelman et al. 2004). The K07 algorithm makes the latter assumption. The default in lrgs is the Jeffreys prior, but any inverse-Wishart prior can optionally be specified.

15

Specifically, K|n, κ ∼ s(n, K) κ Γ(κ)/Γ(κ + n), where s is an unsigned Stirling number of the first kind (Antoniak 1974).

16

For this particularly simple problem, the chain converges to the target distribution almost immediately. Comparing the first and second halves of the chain (or multiple independent chains) the Gelman–Rubin R statistic is <1.01 for every parameter. The autocorrelation length is also very short, ≲10 steps for every parameter.

17

We should generically expect this, since it is entirely possible for a mixture of many Gaussians to closely resemble a mixture with fewer components; e.g. in Fig. 2, we see that two of the six clusters are populated by single data points that are not outliers. The reverse is not true, and here it is interesting that the Dirichlet process cannot fit the data using fewer than K = 3 clusters (Fig. 2).

18

Galaxy clusters, not to be confused with the clusters of data points arising in the sampling of the Dirichlet process.

19

Here, I take L to be measured in a soft X-ray band, in practice 0.1–2.4 keV. Since the emissivity in this band is weakly dependent on temperature for hot clusters such as those in the data set, the resulting scaling relation has a shallower dependence on mass than the more familiar bolometric luminosity–mass relation, L_bol ∝ E(z)[E(z)M]^4/3. The exponents in the L scaling of equation (34) are specific to the chosen energy band.

REFERENCES

Allen

S. W.

Evrard

A. E.

Mantz

A. B.

2011

ARA&A

49

409

Antoniak

C. E.

1974

Ann. Stat.

2

1152

Barnard

J.

McCulloch

R.

Meng

X.-L.

2000

Stat. Sin.

10

1281

Dorazio

R. M.

2009

J. Stat. Plan. Inference

139

3384

Escobar

M. D.

West

M.

1995

J. Am. Stat. Assoc.

90

577

Gelman

A.

Carlin

J. B.

Stern

H. S.

Rubin

D. B.

2004

Bayesian Data Analysis

Chapman & Hall/CRC

London

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Giodini

S.

Lovisari

L.

Pointecouteau

E.

Ettori

S.

Reiprich

T. H.

Hoekstra

H.

2013

Space Sci. Rev.

177

247

Kelly

B. C.

2007

ApJ

665

1489

(K07)

Leauthaud

A.

et al.

2010

ApJ

709

97

Mantz

A.

Allen

S. W.

Ebeling

H.

Rapetti

D.

Drlica-Wagner

A.

2010

MNRAS

406

1773

Mantz

A. B.

Allen

S. W.

Morris

R. G.

Schmidt

R. W.

2015

MNRAS

in press

Maughan

B. J.

2014

MNRAS

437

1171

Murugiah

S.

Sweeting

T.

2012

J. Stat. Plan. Inference

142

1947

Neal

R. M.

2000

J. Comput. Graph. Stat.

9

249

O'Malley

A. J.

Zaslavsky

A. M.

2008

J. Am. Stat. Assoc.

103

1405

Pratt

G. W.

Croston

J. H.

Arnaud

M.

Böhringer

H.

2009

A&A

498

361

Reichert

A.

Böhringer

H.

Fassbender

R.

Mühlegger

M.

2011

A&A

535

A4

Reiprich

T. H.

Böhringer

H.

2002

ApJ

567

716

Robotham

A. S. G.

Obreschkow

D.

2015

PASA

32

e033

Rykoff

E. S.

et al.

2008

MNRAS

387

L28

Schneider

M. D.

Hogg

D. W.

Marshall

P. J.

Dawson

W. A.

Meyers

J.

Bard

D. J.

Lang

D.

2015

ApJ

807

87

Sereno

M.

Ettori

S.

2015

MNRAS

450

3675

Vikhlinin

A.

et al.

2009

ApJ

692

1033

Zhang

Y.-Y.

Finoguenov

A.

Böhringer

H.

Kneib

J.-P.

Smith

G. P.

Czoske

O.

Soucail

G.

2007

A&A

467

437

Zhang

Y.-Y.

Finoguenov

A.

Böhringer

H.

Kneib

J.-P.

Smith

G. P.

Kneissl

R.

Okabe

N.

Dahle

H.

2008

A&A

482

451