Abstract

This paper presents a model-based method for clustering multivariate binary observations that incorporates constraints consistent with the scientific context. The approach is motivated by the precision medicine problem of identifying autoimmune disease patient subsets or classes who may require different treatments. We start with a family of restricted latent class models or RLCMs. However, in the motivating example and many others like it, the unknown number of classes and the definition of classes using binary states are among the targets of inference. We use a Bayesian approach to RLCMs in order to use informative prior assumptions on the number and definitions of latent classes to be consistent with scientific knowledge so that the posterior distribution tends to concentrate on smaller numbers of clusters and sparser binary patterns. The paper derives a posterior sampling algorithm based on Markov chain Monte Carlo with split-merge updates to efficiently explore the space of clustering allocations. Through simulations under the assumed model and realistic deviations from it, we demonstrate greater interpretability of results and superior finite-sample clustering performance for our method compared to common alternatives. The methods are illustrated with an analysis of protein data to detect clusters representing autoantibody classes among scleroderma patients.

1 Introduction

Autoantibodies are the immune system's response to specific cellular protein complexes or “machines” (e.g., Rosen and Casciola-Rosen, 2016). Differential immune responses mounted toward the machines may lead to strikingly different clinical trajectories between autoimmune disease patients. Using autoantibodies as probes, a precision medicine goal is to discover novel autoimmune disease patient subsets that require different treatments prior to major clinical manifestation. This paper is motivated by the need to first formulate a model that respects the scientific constraint that the immune system responds to machines, that is to all components, rather than to individual proteins, and second, based on the model, to perform clustering in a Bayesian framework using autoantibody test data.

To illustrate the scientific constraint, Figure 1 shows a hypothetical patient with binary state vector formula in the middle panel, indicating that her immune system produced autoantibodies to the proteins (autoantigens) in Machines 1 and 3 but not Machine 2. The right panel of Figure 1 shows an illustrative example of three different machines (formula) with orthogonal machine profiles, though the proposed approach will cover general nonorthogonal cases. The binary matrix Q specifies which proteins constitute each machine. We refer to the rows of Q as “machine profiles” where formula if protein ℓ is a component of machine m. The left panel shows formula, which represents the error-free true presence or absence of autoantibody against protein formula for subject formula. The entire row formula is highlighted for subject i. The multivariate binary data for subject i (formula, not shown in the figure) are formula measured with error. The statistical goal is to incorporate the scientific constraint when clustering patients into subsets with distinct responses to the machines. The choice of binary rather than continuous measures of autoantibodies is specific to the gel electrophoresis and autoradiogram (GEA) technology that produced the data where the amount of protein is not informative about clustering patients (Wu et al., 2019).

The scientific constraint means that immune responses are mounted toward all protein components in each machine rather than toward individual proteins. For illustrative purposes, the machine profiles here are orthogonal so the true presence or absence of proteins can be represented by binary matrix factorization; the proposed approach covers general non-orthogonal cases (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)
Figure 1

The scientific constraint means that immune responses are mounted toward all protein components in each machine rather than toward individual proteins. For illustrative purposes, the machine profiles here are orthogonal so the true presence or absence of proteins can be represented by binary matrix factorization; the proposed approach covers general non-orthogonal cases (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)

The knowledge that the immune system attacks machines of multiple proteins rather than single proteins is why we refer to this approach as scientifically structured clustering (SSC). We use “scientific structure” to refer to latent binary variables that may be interpreted as having a scientific meaning. For example, in the protein data example, each element of the latent binary random vector formula represents the presence or absence of a machine. The scientific meaning of a machine is defined by its component proteins indicated by the corresponding row of Q. SSC for multivariate binary data has a number of potential applications beyond the example here. For example, in cognitive diagnosis, the latent states represent basic cognitive abilities and the 1s in Q represent the test items requiring each ability (e.g., Junker and Sijtsma, 2001); in epidemiology, the latent states can indicate unobserved disease-causing agents and the 1s in Q represent the molecular targets tied to each causative agent (e.g., Wu et al., 2016). Most importantly, in all these applications, the resulting clusters conform to the existing scientific context and therefore can be used to address relevant questions.

As detailed below, the model proposed to address the motivating patient clustering problem is a member of the family of restricted latent class models or RLCMs (e.g., Xu and Shang, 2018). In our motivating example, however, the possible combinations of machines that the immune systems can target in the population of patients are often unknown. This means not knowing the set of distinct population-level latent state patterns formula, neither the size formula nor its elements. In addition, the definitions of machine profiles Q are often unknown.

In addressing the motivating problem, this paper makes a primary contribution to the literature on RLCMs. In a Bayesian framework, we allow an unknown number of classes in a mixture of finite mixture (MFM; Miller and Harrison, 2018) framework where each component has multivariate discrete parameters, for example, the binary states in our context. We then derive a posterior sampling algorithm based on Markov chain Monte Carlo (MCMC) featuring split-merge updates that directly and efficiently sample the clustering allocations. The algorithm conveniently overcomes the hard constraint formula by postprocessing to merge sampled clusters having the same sampled state vectors. The model works with an unknown number of classes, unknown set of state patterns formula, and unknown Q-matrix. We use informative prior assumptions on the number and definition of latent classes to be consistent with the scientific knowledge, so that the posterior distribution tends to concentrate on fewer clusters with sparser latent state patterns.

The rest of the paper is organized as follows: Section 2 specifies the proposed model and prior distribution. Section 3 derives the MCMC algorithm for posterior inference. Section 4.1 presents simulation studies to show the better clustering performance of the proposed clustering method when compared to three common alternatives. Section 4.2 illustrates the methods with an analysis of protein data for subsetting scleroderma patients. The paper concludes with a discussion of model extensions and limitations.

2 Models

First formulated by Lazarsfeld (1950), latent class models (LCMs) have become an important tool for modeling multivariate discrete responses (e.g., Goodman, 1974; Dunson and Xing, 2009) and model-based clustering (e.g., Vermunt and Magidson, 2002). Alternative parametric approaches focus on underlying continuous variable specifications (e.g., Albert and Chib, 1993), which incorporate latent multivariate Gaussian random variables that are linked to the categorical observations through thresholding. However, these models do not directly perform clustering. In this paper, we focus on extensions of the classical LCMs to incorporate scientific constraints when clustering multivariate binary responses.

Notation. Let formula be the vector of binary measurements for subject formula and feature formula. Let formula collect all the observations into an formula binary data matrix. Let formula indicate the true presence/absence of feature ℓ for observation i. Let Γ be an formula binary matrix with formula-th element being formula, referred to as design matrix. We use formula and formula to represent the i-th row and ℓ-th column, respectively. Let formula represent the unobserved latent states for subject formula. Let formula lie in an unknown population state space formula. We use formula to denote the cardinality of formula. As a result, there are formula distinct patterns of latent states. We refer to formula that contains all binary patterns as “saturated” and otherwise “unsaturated”. We collect the latent states into an formula matrix H with formula-th entry being formula. Let Q be an unknown formula binary matrix with each row specifying the scientific meaning of each latent state. Let formula and formula represent row m and column ℓ of Q, respectively. Let formula represent the conditional distribution of random quantity A given another random quantity B. When formula, [A] represents the distribution of A. Let formula be an indicator function which equals 1 if statement A is true and 0 otherwise.

2.1 Proposed Model

We specify the model for the motivating example in two steps: (i) impose scientific structure upon the actual presence or absence of proteins (formula) and (ii) parameterize formula. The first step is needed to respect existing biological knowledge in the scientific context and the second step characterizes the measurement process.

Let the true presence or absence of feature ℓ for subject formula be

(1)

To emphasize the dependence of formula on Q and formula for formula, we write the row vector formula so that formula. In addition, let formula represent an formula binary matrix with each row representing the true presence or absence of feature formula for a subject having a particular latent state pattern in formula.

Next, we specify the measurement error model:

(2)
(3)

independently for feature formula and subject formula. Borrowing the terminologies in the statistical literature for diagnostic testing, we characterize the stochastic discrepancies between the actual formula and observed presence/absence of autoantibodies by sensitivity (formula) and specificity (formula). Let formula and formula. In this paper, we assume formula, because the diagnostic test based on immunoprecipitation (IP) is known to be both sensitive and specific. The sensitivity for a given protein is assumed to be the same regardless from which machine(s) it comes. Importantly, both the sensitivities and specificities can vary across proteins.

Equations (1) to (3) assume the probability of observing a protein given that it is present does not depend on how many machines it is present in, that is, it depends only on formula but otherwise not on the value of formula. In our application, the choice is driven by the technology used. The ability to detect the protein does not depend on how many machines produced it. In addition, the amount is irrelevant for clustering patients. Further the model assumes binary response/nonresponse to each protein, rather than quantifying the response as a real value. This modeling choice follows the convention that clinicians interpret the presence or absence of bands corresponding to each protein on the images obtained from GEA (Wu et al., 2019).

Equations (1) to (3) are related to some existing models proposed in cognitive diagnosis (e.g., Templin and Henson, 2006, known Q) and epidemiology (e.g., Wu et al., 2016, formula), which are examples of restricted LCMs (RLCM, Xu, 2017). In addition, imposing symmetric error rates formula, formula, results in the one-layer model of Rukat et al. (2017). The model can also be viewed as Boolean matrix factorization (BMF; Miettinen et al., 2008) because model (1) is equivalent to formula, where the logical “OR” operator “∨” outputs one if any argument equals one. The rows in Q are basis patterns for compactly encoding the L dimensional formula vector by formula bits in formula. When Q has orthogonal rows, BMF further reduces to nonnegative matrix factorization (NMF) formula (Lee and Seung, 1999). Zhang et al. (2007) proposed optimization methods to obtain the best factorization while Meeds et al. (2007) and Ni et al. (2019) took a nonparametric Bayes approach. In contrast to these works, we add a layer of mixture model in (5) to (10) that provides additional parsimony and posterior inference of the number of classes.

2.2 Identifiability Considerations for Posterior Algorithm Design

In general, identifiability conditions based on model likelihood can guide the design of simulation studies by informing the choice of the simulation truths that are statistically identifiable. In real data analysis, although truth is not expected to be known even if the model is a reasonable approximation, identifiability conditions can still dramatically reduce the size of parameter space.

Specifically, under a fixed M and possibly unsaturated formula, Gu and Xu (2019a) provided sufficient conditions on formula for strictly identifying formula, formula, formula, formula, and formula. Although formula is identifiable, Q may only be identified up to its equivalence classes, where Q1 and Q2 are equivalent if formula with equality holds elementwise.

In our simulations and data application, we restrict the inference algorithm to be performed on the set of formula binary Q-matrices:

(4)

where P1 and P2 are M- and L-dimensional permutation matrices. First, under a saturated formula, the conditions in (4) are based only on Q, easy-to-check, and necessary and sufficient for statistical identification of formula, formula, formula, and Q (Gu and Xu, 2019b). Second, the constraint formula also greatly facilitates posterior sampling by focusing on a small subset of binary matrices. In fact, among all M by L binary matrices, the fraction of formula is at most formula and quickly decays as the number of machines M increases.

We now turn to inferring individual latent states based on complete-data likelihood formula. Even given Q, conditions for identifying H exist but may fall short of ensuring consistent estimation of H because the number of unknowns in H diverges as the sample size increases. Consistent estimation requires extra conditions, for example, the number of measurements L increases with the sample size (Chiu et al., 2009). In finite samples and dimensions, we address this issue in a Bayesian framework by encouraging H to be a priori of low complexity, that is, few classes of distinct and sparse latent state vectors.

Finally, in real data analysis, incorporating informative priors, for example, partially known Q or high sensitivity and specificity may alleviate theoretical identifiability issues (e.g., Wu et al., 2016). The prior uncertainty for the nonidentified parameters will propagate into the posterior and not vanish even as the sample size approaches infinity (e.g., Kadane, 1975).

2.3 Priors

We first specify a prior for allocating observations into clusters and then a prior for the multivariate binary state vectors in the clusters.

Prior with an unknown number of classes. Though used interchangeably by many authors, we first make a distinction between a “component” that represents one of the true mixture components in the specification of a mixture model (referred to as “classes” in LCMs and this paper) and a “cluster” that represents a block of observations grouped together. Let K be the number of mixture components in the population and T the number of clusters in the sample (Miller and Harrison, 2018). We have formula, which means there exist formula empty components not realized in a random sample.

Additional notation is needed for our prior specification. Let formula be the component indicators, and let formula. Let formula be the subjects in component k, and formula be T observed clusters. Note the partition formula is invariant to component relabeling. Let formula be the clusters excluding subject i. Let formula be the data in cluster formula. Finally let formula be the latent state vectors for component formula.

We specify a prior distribution of clustering allocations formula as follows:

(5)
(6)
(7)

where formula is a distribution over formula, e.g., formula. The prior for formula is

where formula, and formula, formula, and formula, formula if formula (Miller and Harrison, 2018).

Prior for  formula  in each component. We specify a prior distribution for the component-specific parameters formula to encourage sparser binary patterns:

(8)
(9)
(10)

independently for formula. Equations (8) to (10) induce a marginal prior formula upon integrating over formula (Web Appendix A1.1) and is a truncated Indian Buffet Process (Ghahramani and Griffiths, 2006). In what follows, we set formula and formula, which offer good clustering results in simulations and data analysis. We reparameterize c1 in terms of formula and specify the prior formula, where formula.

Remark 1 A reviewer raised a question about model specification that, given a finite M, by following (8) to (9) to sample formula instead of formula's, there is already a positive prior mass on the equality of two binary vectors with finite length formula (hence i and formula belong to the same cluster), why is it necessary to have an additional layer of clustering structure through mixture models (5) to (7)? In this paper, we choose the mixture model specification because it has a prior on the number of classes which allows additional parsimony by a priori encouraging few classes and posterior inference on the number of classes. Although formula, K in the prior is not upper bounded. With an unbounded K, we can build on the algorithm of Miller and Harrison (2018) for inferring the number of classes. Because the component parameters are multivariate binary, we do need an additional step of merging clusters with the same latent states via post hoc processing at each MCMC iteration. The algorithm with merging is applicable to general MFM models with discrete component parameters. Web Appendix A1.2 discusses the merge operation.

For the rest of parameters, let the prior for Q be the uniform distribution over formula in (4). For formula and formula, we specify formula, independently for formula. Taken together, the joint distribution is

(11)

Figure 2 shows a schematic representation of the hierarchical model via directed acyclic graph (DAG) describing the relationships between the parameters and observations.

The directed acyclic graph representing the model structure. The quantities in squares are either data or hyperparameters; the unknown quantities are shown in the circles. The double-stroke circle represents multiplexer variable, which copies the value of one of its parent nodes chosen by the selector variable (shaded). The arrows connecting variables indicate that the parent parameterizes the distribution of the child node (solid lines) or completely determines the value of the child node (dotted arrows). The rectangular “plates” where the variables are enclosed indicate that a similar graphical structure is repeated over the index; the index in a plate indicates subjects, clusters, machines, or features (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)
Figure 2

The directed acyclic graph representing the model structure. The quantities in squares are either data or hyperparameters; the unknown quantities are shown in the circles. The double-stroke circle represents multiplexer variable, which copies the value of one of its parent nodes chosen by the selector variable (shaded). The arrows connecting variables indicate that the parent parameterizes the distribution of the child node (solid lines) or completely determines the value of the child node (dotted arrows). The rectangular “plates” where the variables are enclosed indicate that a similar graphical structure is repeated over the index; the index in a plate indicates subjects, clusters, machines, or features (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)

3 Posterior Inference

We develop inferential procedures to address the following three questions: (1) how many scientific clusters are in the data; (2) what are the unique latent states present in the sample, and (3) what is the latent state vector for every observation. To approximate the posterior distribution we use an MCMC algorithm for the inference of any functions of unknown parameters and latent variables (Gelfand and Smith, 1990). We focus on presenting the posterior sampling algorithm with a finite M, which is effective in our application by treating M as an upper bound, though the posterior algorithm for infinite M is readily derived by following Teh et al. (2007).

3.1 MCMC Algorithm

When the number of components K is unknown, one class of techniques updates component-specific parameters along with K. For example, the reversible-jump MCMC (RJ-MCMC, Green, 1995) works by an update to K along with proposed updates to the model parameters, which together are then accepted or rejected. However, designing good proposals for high-dimensional component parameters can be nontrivial. Alternative approaches include direct sampling of K (e.g., Nobile and Fearnside, 2007; McCullagh and Yang, 2008). Here we build on the algorithm of Miller and Harrison (2018) for sampling clusters with a prior on the number of mixture components.

  • Step 0:

    Initialize all parameters from their priors except for Q and clustering allocations formula. For Q, we set all elements to be zero except for columns ℓ that satisfy formula, for which we initialize by formula, where p and τ1 are prespecified. We set formula and formula, which work well in our simulations and data analysis. For formula, we initialize with one cluster comprised of all subjects.

    For iteration formula, repeat Steps 1–7:

  • Step 1:

    Update formula by directly sampling from its posterior without the need for considering component parameters or empty components. For subject formula, we assign subject i to an existing cluster formula or a new one on its own with probabilities:

    (12)

    where formula is the marginal likelihood for data formula and is computed in Web Appendix A2. The update (12) favors forming a cluster of its own formula if adding the i-th observation to any existing cluster fits poorly with data in that cluster. Following the split-merge recipe in Jain and Neal (2004) that efficiently explores the large space of clustering allocations (see details in Web Appendix A3), we build on (12) to update formula globally which creates or removes clusters for multiple subjects at a time.

  • Step 2:

    Update formula by the distribution over formula:

    for cluster formula, where formula, and formula and formula. Set formula, for subject formula.

    At the end of iterations, if some of the discrete mixture component parameters formula are duplicated, we postprocess the posterior samples by merging clusters in formula associated with identical latent states to obtain scientific clusters with distinct latent states; denote scientific clusters by formula and formula.

  • Step 3:

    Update the measurement error parameters for formula:

    formula, formula.

  • Step 4:

    Update hyperparameter formula by updating β:

    where formula is the number of observed clusters with activated m-th state. The update can be done based on a dense grid over (0,1) (Hoff, 2005).

  • Step 5:

    Update formula, independently for formula.

  • Step 6:

    Update Q via constrained Gibbs sampler. For formula, formula, do not update formula if (i) formula where formula is a column vector of with the m-th element being 1 and 0 for other elements, (ii) formula and formula, or (iii) formula in a column of formula with formula, and there is a single formula in the columns of the current Q. Otherwise, we follow an efficient update in Liu (1996) by flipping formula from formula to z with probability formula, where

    where formula, formula, formula, formula. In this step, we also identify “partner latent states” and merge the corresponding rows in Q. Specifically, we collapse two states formula that are present or absent together among subjects (formula, formula). We set formula for any i and formula, formula. It is easy to verify that this scheme preserves conditions (4) for Q truncated to the rows corresponding to active states. The merging scheme readily generalizes to cases with more than two partner states. Finally, we reset to zeros for the rows of Q corresponding to the unused latent states at an iteration.

We have used a few practical strategies to improve the sampling from many discrete parameters, for example, formula and Q. First, the prior that encourages fewer clusters propagates into the posterior so large Ts are visited less frequently. Second, we put an upper bound on M in real applications followed by sensitivity analysis. Third, Q is restricted to lie in a subset of formula informed by condition (4). Finally, in our experience, more efficient exploration of clustering allocations among the observations by global split-merge updates helps the sampling of Q. Convergence checks are presented in Web Appendix A4.

3.2 Posterior Summaries

Let formula be the posterior coclustering probabilities for any two subjects i and formula. We estimate formula by the empirical frequencies formula of subjects i and formula being clustered together across MCMC iterations. For point estimation, we compute the least square (LS) clustering formulaby minimizing the squared distance to the formula, formula, where formula if formula and zero otherwise (Dahl, 2006). In addition, because the interpretation of formula depends on Q, we choose an optimal Q from the posterior samples that minimizes a loss function. We select an iteration formula that minimizes the loss: formula where formula is matrix Frobenius norm. We denote it by formula. Turning to the inference of formula, we rerun the algorithm by fixing formula which produces additional posterior samples to reduce the Monte Carlo error in approximating formula.

4 Results

We illustrate the utility of Bayesian RLCM on both simulated and real data. First, we assess the performance of Bayesian RLCM on cluster estimation under simulation scenarios corresponding to varying levels of measurement error, dimension, sparsity level of each machine, sample size, and mixing weight. Using data simulated under the assumed RLCM and realistic deviations from it, the proposed inference algorithm performs clustering as well as or better than common alternative binary-data clustering methods. We first analyze a single randomly generated data set to highlight differences among the methods. We then use independent replications to evaluate frequentist performance of Bayesian RLCM in cluster estimation and contrast with the alternatives. Web Appendix A5.1 details the simulation scenarios. To assess the chance-corrected agreement between the true and estimated clustering allocations of the same observations, we used the adjusted Rand index (formula, Hubert and Arabie, 1985). See Web Appendix A5.2 for details. The value of formula is between −1 and 1 with values closer to 1 indicating better agreement. Finally, data from scleroderma patients are analyzed.

4.1 Simulated Examples to Study Model Performance

Simulation 1: More accurate clustering through scientifically structured classes. Figure 3 shows a random data set formula, the design matrix Γ, as well as the clusters obtained using complete-linkage, Hamming distance hierarchical clustering (HC), standard eight-class Bayesian latent class analysis (LCA, e.g., Garrett and Zeger, 2000), subset clustering analysis (SC; Hoff, 2005), and our Bayesian RLCM with unknown number of clusters fitted with truncation level formula. In this setting, HC is sensitive to noise and tends to split a true cluster or group observations from different true clusters. Unlike the others, the Bayesian RLCM automatically selects and filters subsets of features that distinguish eight classes (through scientific structures in (1)) and hence has superior clustering performance producing clusters that agrees quite well with the truth. For illustrative purposes, we showed an extreme example to highlight the different performances on a single random data set. Although the proposed model-based approach does not always perfectly reconstruct the clusters, this relative advantage of Bayesian RLCM persists under data replications.

In the 100-dimension multivariate binary data example (a), the eight classes differ with respect to subsets of measured features (b). In (c) HC, we indicate coclustering by filled cells. The true clusters are separated (dashed grids) and ordered according to the truth; (d, e, f): For Bayesian LCA, RLCM, and subset clustering (SC), we plot the posterior coclustering probability matrix  for N observations. Filled blocks on the main diagonal only indicate perfect recovery of the true clusters; Blank cells within the main diagonal blocks indicate true cluster being split and blue cells in the off-diagonal blocks indicate two observations being incorrectly co-clustered. Bayesian RLCM accounts for measurement errors, selects the relevant feature subsets and filters the subsets by a low-dimensional model (1) and therefore yields superior clustering results. The clustering results are based on a randomly generated data set for illustration. Cluster recovery by RLCM is not always perfect. The competing methods may show different relative performances under model misspecifications (see Section 4.1) (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)
Figure 3

In the 100-dimension multivariate binary data example (a), the eight classes differ with respect to subsets of measured features (b). In (c) HC, we indicate coclustering by filled cells. The true clusters are separated (dashed grids) and ordered according to the truth; (d, e, f): For Bayesian LCA, RLCM, and subset clustering (SC), we plot the posterior coclustering probability matrix formula for N observations. Filled blocks on the main diagonal only indicate perfect recovery of the true clusters; Blank cells within the main diagonal blocks indicate true cluster being split and blue cells in the off-diagonal blocks indicate two observations being incorrectly co-clustered. Bayesian RLCM accounts for measurement errors, selects the relevant feature subsets and filters the subsets by a low-dimensional model (1) and therefore yields superior clustering results. The clustering results are based on a randomly generated data set for illustration. Cluster recovery by RLCM is not always perfect. The competing methods may show different relative performances under model misspecifications (see Section 4.1) (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)

Simulation 2: Assess clustering performance under various parameter settings with replications. The ability of Bayesian RLCM in recovering the true clusters varies by the sparsity level (s) in each machine, level of measurement errors formula, mixing weights and sample sizes (N) (the leftmost boxes in groups of four in Figure 4). First, clustering performance improves by increasing the sparsity level in each machine from formula to 20% (compare the 1st and 3rd, 2nd and 4th RLCM boxplots in each panel of Figure 4). In the context of our motivating example, given a fixed number of protein landmarks L, patients will be more accurately clustered if each machine comprises more component proteins. This observation is also consistent with simulation studies conducted in the special case of formula (Hoff, 2005, table 1). For a given s, a larger L means a larger number of relevant features per machine and leads to better cluster recovery. Increasing L from 50 to 400, the mean formula (averaged over replications) increases from 0.7 to 0.98 at the sparsity level formula, 0.88 to 0.99 under formula. Second, more accurate clustering results under larger discrepancies between formula and formula. The formula averaged over replications is higher under formula than formula over all combinations of other parameters. Finally we observe mixed relative performances at distinct sample sizes as a result of two competing factors as the sample size increases: more precise estimation of measurement error parameters that improve clustering and a larger space of clusterings. Additional comparisons are in Figure S1 of the Supporting Information.

Based on  replications for each parameter setting, from the left to the right in each group of four boxplots, Bayesian RLCM (boxplots with solid lines) most accurately recovers the true clusters compared to subset clustering (SC), hierarchical clustering (HC), and traditional Bayesian latent class analysis (LCA) (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)
Figure 4

Based on formula replications for each parameter setting, from the left to the right in each group of four boxplots, Bayesian RLCM (boxplots with solid lines) most accurately recovers the true clusters compared to subset clustering (SC), hierarchical clustering (HC), and traditional Bayesian latent class analysis (LCA) (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)

Compared to three common alternatives, the Bayesian RLCM on average most accurately recovers the clusters. Bayesian RLCM produces the highest formulas (boxes with solid lines) compared to others (boxes with dotted lines). For example, under formula, the ratio of the mean formulas (averaged over replications) for Bayesian RLCM relative to subset clustering is 2.06, 2.04, 1.88, 1.71 for the sample-size-to-dimension ratios formula, respectively. As another example, under a higher formula, the relative advantage of Bayesian RLCM to HC narrows as shown by the smaller formula ratios 1.23, 1.62, 1.49, 1.16. Web Appendix A5.3 provides further discussion about the three alternative methods.

Simulation 3. Competing methods are evaluated under small and large degrees of model misspecifications for fairer comparisons. We consider two simple representative scenarios: (1) randomly perturbing only the final layer of the model to have more than two levels of response probabilities; (2) removing any restriction on how the clusters are connected to the observables by assuming general LCMs. See Web Appendix A5.1 for specific parameter values. We observe that in the first set with relatively minor misspecifications, the proposed model is still competitive (Figure S2 in the Supporting Information). The proposed approach can be viewed as regularization by shrinking class-specific response probabilities toward the assumed constraint. RLCM expectedly performs less well under larger magnitudes of random perturbations. In the second set, the main model is not as flexible as a general LCM resulting in less competitive clustering performance (Figure S3 in the Supporting Information). Large degrees of model misspecifications may hurt clustering performance. Our practical suggestion is to perform careful model checking which we illustrate in Section 4.2.

4.2 Protein Data Application for Estimating Scleroderma Patient Subsets

The applied goal is to estimate autoimmune disease patient clusters via reconstructing machine components. We seek to identify components of the machines and to quantify the variations in their occurrence among individuals and estimate patient subsets. The binary responses formula indicate the observed presence or absence of proteins at equi-spaced molecular weight landmarks as produced by a preprocessing method (Wu et al., 2019) applied to GEA data. We ran four gels, each loaded with IPs performed using sera from 19 different patients, and one reference lane. All sera were from scleroderma patients with cancer, and were all negative for the three most common autoantibodies found in scleroderma (anti-RNA polymerase III, anti-topoisomerase I, and anti-centromere). The IPs were loaded in random order on each gel; the reference sample is comprised of known molecules of defined sizes (molecular weights) and was always loaded in the first lane. The left panel in Figure 5 shows for each sample lane (labeled in the left margin; excluding the reference lanes) the binary responses indicating the observed presence or absence of autoantibodies at formula landmarks.

Results for GEA data. Left: Aligned data matrix for band presence or absence; row for 76 serum lanes, reordered into optimal estimated clusters (not merged)  separated by gray horizontal lines “—–”; columns for  protein landmarks. A blue vertical line “” indicates a band; Middle: lane-machine matrix for the probability of a lane (serum sample) having a particular machine. The blue cells correspond to high probability of having a machine in that column. Smaller probabilities are shown in lighter blue. Right: The estimated machine profiles. Here seven estimated machines are shown, each with component proteins shown by a blue bar “|” (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)
Figure 5

Results for GEA data. Left: Aligned data matrix for band presence or absence; row for 76 serum lanes, reordered into optimal estimated clusters (not merged) formula separated by gray horizontal lines “—–”; columns for formula protein landmarks. A blue vertical line “formula” indicates a band; Middle: lane-machine matrix for the probability of a lane (serum sample) having a particular machine. The blue cells correspond to high probability of having a machine in that column. Smaller probabilities are shown in lighter blue. Right: The estimated machine profiles. Here seven estimated machines are shown, each with component proteins shown by a blue bar “|” (This figure appears in color in the electronic version of this article, and any mention of color refers to that version.)

Patients differ in protein presence or absence patterns at the protein landmarks. Eleven out of formula aligned landmarks are absent among the patients tested. The rest of the landmarks are observed with prevalences between 1.3% and 94.7%. The GEA technologies are known to be highly specific and sensitive for nearly all proteins studied in this assay so we specify the hyperparameters in the Beta priors by formula, formula, formula, formula and conducted sensitivity analyses varying these hyperparameter values.

In this application, the scientists had previously identified and independently verified through additional protein chemistry the importance of a small subset of protein bands in determining clusters among a subset of subjects. They proposed that these subjects should be grouped together. We therefore fitted the Bayesian RLCM without further splitting these partial clusters formula so that the number of scientific clusters visited by the MCMC chain has an upper bound formula, where formula counts the number of observations in the initial cluster j. We fitted models and compared the results under multiple working truncation levels formula and obtained identical clustering results.

Figure 5 shows the observations grouped by the RLCM-estimated clusters (not merged) formula (left), the estimated Q-matrix formula (right), and the conditional posterior probabilities of the machines formula (middle).

The information for estimating matrix Q comes from the observed marginal associations (positive or negative) among the protein landmarks. Landmark protein pairs observed with positive association tend to be placed in the same estimated machine. For example, Landmarks 4, 7, and 8 appear together in Machine 5. Subjects either have all three landmarks or none at all, which induces strong positive pairwise associations among these landmarks. Indeed, the estimated log odds ratio (LOR) is 3.13 (standard error 1.16) for Landmark 4 versus 7, 2.21 (s.e., 0.98) for Landmark 4 versus 8, and 2.92 (s.e. 1.2) for Landmark 7 versus 8. The observed negative marginal associations between two landmarks suggest existence of machines with discordant landmarks. For example, Landmarks 10 and 27 are rarely estimated to be present or absent together in a subject as a result of (1) estimated machines with discordant landmarks and (2) subject-specific machine assignments. First, the model estimated that Landmark 10 (in Machine Set A: 1, 3, and 4) belongs to machines not having Landmark 27 (it is in Machine Set B: 2). Second, with high posterior probabilities, most observations have machines from one of, not both Set A and B hence creating discordance (high posterior probability formula). In the presence of measurement errors, strong negative marginal association results (observed LOR for Landmark 10 versus 27: −1.98, s.e. 0.8).

Our algorithm also directly infers the number of scientific clusters in the data given an initial partial clustering formula. The marginal posterior of the number of scientific clusters formula can be approximated by empirical samples of formula which result in a posterior median of 12 (95% credible interval: (8,16); Figure S4 in the Supporting Information). The advantage of Bayesian RLCM is the posterior inference about both the clusters and the distinct latent state variables formula interpreted based on the inferred Q matrix. The middle panel of Figure 5 shows that clusters differ in their posterior probabilities of having each of the estimated machines. Among 76 subjects analyzed, 23 of them have greater than 95% posterior probabilities of having both Machine 4 and 6. Finally, a group of seven subjects are estimated to be enriched with Machine 4 and 7. This is expected because the two machines have one or more of the Landmarks 35, 40, and 49 (33, 27, and 18 kDa bands, respectively) and together explain the distinctive observed combination of the raw bands in the seven subjects. Such inference about formula is not available based on hierarchical clustering or traditional latent class analysis.

We performed posterior predictive checking to assess model fit (Gelman et al., 1996). At each MCMC iteration, given the posterior sample of model parameters, we simulated a data set of the same size as the original set. For each replicated data set, we compute the marginal means and marginal pairwise log odds ratios (0.5 adjustment for zero counts). Across all replications, we compute the 95% posterior predictive confidence intervals (PPCI) defined by the 2.5% and 97.5% quantiles of the PPD. All the observed marginal means are covered by their respective PPCIs; The 95% PPCIs cover all but 24 of formula landmark pairs of observed pairwise log odds ratios (see Figure S5 and S6 in the Supporting Information). The proposed model adequately fits the GEA data.

5 Discussion

Modern scientific technologies give rise to measurements of varying precision and accuracy that are better targeted at the underlying state variables than ever before. In this paper we have focused on finite-sample Bayesian inference of an RLCM for analyzing dependent binary data. The primary advantage of such models lies in their expressive characterization of the between-class differential errors structured to respect specific scientific context about the biological and measurement processes. Using simulations and real data analysis, we studied the clustering of observations with an unknown number of clusters, uncertainty assessment of the clustering and the prediction of individual latent states. We develop and apply an MCMC algorithm for Bayesian RLCMs. The proposed method addresses inferential issues unique to mixture models with discrete component parameters and jointly infers the number of clusters, the design matrix Γ and other model parameters. We have illustrated its advantage through simulations relative to three commonly used binary-data clustering methods. Finally, viewed from a regularization perspective, the inferential procedure automatically selects subsets of features for each latent class and filters them through a low-dimensional model that shrinks class-specific response probability estimates toward one that represents the scientific structure and improves our ability to accurately estimate clusters.

RLCMs decompose the variation among multivariate binary responses into structure that reflects constraints in particular scientific context and stochastic variation without a known explanation. In our motivating example, it is certainly likely that there is some variability related to measurement errors. However, it is also highly likely that there are systematic biological and biochemical processes not included in the structural part because they are unknown to us today. RLCM analyses can be a useful tool in the effort to uncover the unknown structure. One approach would be to show that the latent classes are diagnostic of specific diseases. Another is that we might uncover a novel mechanism by defining distinct patterns of the same autoantigen machine in patients with the same disease or potentially in patients with different diseases that target the same machines. Though the present paper focused on an example in medicine, the developed method and algorithms apply to many problems in psychology and epidemiology.

We are currently studying a few potentially useful extensions. First, nested partially LCMs (Wu et al., 2017) incorporate local dependence and multiple sensitivity parameters that would improve the utility of Bayesian RLCMs. Second, because the algorithm involves iterating over subjects to find clusters, the computational time increases with the number of subjects N. Divide–Cluster–Combine schemes that estimate clusters in subsamples which are then combined may improve the computational speed at the expense of the approximation introduced by the multistage clustering (Ni et al., 2020). Third, in applications where the clustering of multivariate binary data comprises an important component of a hierarchical Bayesian model with multiple components, the posterior uncertainty in clustering propagates into other parts of the model and can be integrated into posterior inference of other model parameters (e.g., Jacob et al., 2017). Finally, under significant deviations from model assumptions, the constraints in Γ may lead to less competitive clustering performance. In real data applications, scientific context for clustering and careful model checking are therefore critical. The degree to which the actual data generating mechanism deviates from the assumed model can be characterized by general RLCM formulations such as the all-effect model (Equation (5); Gu and Xu, 2019a). Model selection under such settings may help with clustering performance. We leave this for future work.

Open Research Badges

This article has earned Open Data and Open Materials badges. Data and materials are available at https://github.com/zhenkewu/rewind and https://github.com/zhenkewu/spotgear.

Data Availability Statement

An R package “rewind” is freely available at https://github.com/zhenkewu/rewind. Another R package “spotgear” (https://github.com/zhenkewu/spotgear) is used for data preprocessing in Section 4.2. The data that support the findings in this paper are available from the corresponding author upon reasonable request.

Acknowledgments

The research is supported in part by a gift from the Jerome L. Greene Foundation and by the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318), National Institutes of Health (NIH) grants R01AR073208, P30AR070254, and P30CA046592. We are grateful for comments from the editor, associate editor, and three reviewers that greatly improved the presentation. We also thank Gongjun Xu, Peter Hoff, and Jian Kang for helpful discussions.

References

Albert
,
J.H.
and
Chib
,
S.
(
1993
)
Bayesian analysis of binary and polychotomous response data
.
Journal of the American Statistical Association
,
88
,
669
679
.

Chiu
,
C.-Y.
,
Douglas
,
J.A.
and
Li
,
X.
(
2009
)
Cluster analysis for cognitive diagnosis: theory and applications
.
Psychometrika
,
74
,
633
665
.

Dahl
,
D.B.
(
2006
) Model-based clustering for expression data via a Dirichlet process mixture model. In:
Do
,
K.A.
,
Müller
,
P.
and
Vannucci
,
M.
(Eds.)
Bayesian Inference for Gene Expression and Proteomics
.
New York
:
Cambridge University Press
, pp.
201
218
.

Dunson
,
D.
and
Xing
,
C.
(
2009
)
Nonparametric Bayes modeling of multivariate categorical data
.
Journal of the American Statistical Association
,
104
,
1042
1051
.

Garrett
,
E.
and
Zeger
,
S.
(
2000
)
Latent class model diagnosis
.
Biometrics
,
56
,
1055
1067
.

Gelfand
,
A.E.
and
Smith
,
A.F.
(
1990
)
Sampling-based approaches to calculating marginal densities
.
Journal of the American Statistical Association
,
85
,
398
409
.

Gelman
,
A.
,
Meng
,
X.-L.
and
Stern
,
H.
(
1996
)
Posterior predictive assessment of model fitness via realized discrepancies
.
Statistica Sinica
,
6
,
733
760
.

Ghahramani
,
Z.
and
Griffiths
,
T.L.
(
2006
)
Infinite latent feature models and the Indian buffet process
.
Advances in Neural Information Processing Systems
,
475
482
.

Goodman
,
L.
(
1974
)
Exploratory latent structure analysis using both identifiable and unidentifiable models
.
Biometrika
,
61
,
215
231
.

Green
,
P.J.
(
1995
)
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination
.
Biometrika
,
82
,
711
732
.

Gu
,
Y.
and
Xu
,
G.
(
2019a
)
Learning attribute patterns in high-dimensional structured latent attribute models
.
Journal of Machine Learning Research
,
20
,
1
58
.

Gu
,
Y.
and
Xu
,
G.
(
2019b
)
The sufficient and necessary condition for the identifiability and estimability of the DINA model
.
Psychometrika
,
84
,
468
483
.

Hoff
,
P.D.
(
2005
)
Subset clustering of binary sequences, with an application to genomic abnormality data
.
Biometrics
,
61
,
1027
1036
.

Hubert
,
L.
and
Arabie
,
P.
(
1985
)
Comparing partitions
.
Journal of Classification
,
2
,
193
218
.

Jacob
,
P.E.
,
Murray
,
L.M.
,
Holmes
,
C.C.
and
Robert
,
C.P.
(
2017
) Better together? statistical learning in models made of modules. arXiv preprint arXiv:1708.08719 .

Jain
,
S.
and
Neal
,
R.M.
(
2004
)
A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model
.
Journal of Computational and Graphical Statistics
,
13
,
158
182
.

Junker
,
B.W.
and
Sijtsma
,
K.
(
2001
)
Cognitive assessment models with few assumptions, and connections with nonparametric item response theory
.
Applied Psychological Measurement
,
25
,
258
272
.

Kadane
,
J.
(
1975
) The role of identification in Bayesian theory. In:
Fienberg
,
S.
and
Zellner
,
A.
(Eds.)
Studies in Bayesian Econometrics and Statistics
. chapter 5.2.
Amsterdam
:
North-Holland
, pp.
175
191
.

Lazarsfeld
,
P.F.
(
1950
) The logical and mathematical foundations of latent structure analysis. In:
Stouffer
,
S.
, (Ed.)
The American Soldier: Studies in Social Psychology in World War II
, volume
IV
.
Princeton, NJ
:
Princeton University Press
, pp.
362
412
.

Lee
,
D.D.
and
Seung
,
H.S.
(
1999
)
Learning the parts of objects by non-negative matrix factorization
.
Nature
,
401
,
788
791
.

Liu
,
J.S.
(
1996
)
Peskun's theorem and a modified discrete-state Gibbs sampler
.
Biometrika
,
83
,
681
682
.

McCullagh
,
P.
and
Yang
,
J.
(
2008
)
How many clusters?
 
Bayesian Analysis
,
3
,
101
120
.

Meeds
,
E.
,
Ghahramani
,
Z.
,
Neal
,
R.M.
and
Roweis
,
S.T.
(
2007
)
Modeling dyadic data with binary latent factors
.
Advances in Neural Information Processing Systems
,
977
984
.

Miettinen
,
P.
,
Mielikäinen
,
T.
,
Gionis
,
A.
,
Das
,
G.
and
Mannila
,
H.
(
2008
)
The discrete basis problem
.
IEEE Transactions on Knowledge and Data Engineering
,
20
,
1348
1362
.

Miller
,
J.W.
and
Harrison
,
M.T.
(
2018
)
Mixture models with a prior on the number of components
.
Journal of the American Statistical Association
,
113
,
340
356
.

Ni
,
Y.
,
Müller
,
P.
,
Diesendruck
,
M.
,
Williamson
,
S.
,
Zhu
,
Y.
and
Ji
,
Y.
(
2020
)
Scalable Bayesian nonparametric clustering and classification
.
Journal of Computational and Graphical Statistics
,
29
,
53
65
.

Ni
,
Y.
,
Müller
,
P.
and
Ji
,
Y.
(
2019
)
Bayesian double feature allocation for phenotyping with electronic health records
.
Journal of the American Statistical Association
. To Appear.

Nobile
,
A.
and
Fearnside
,
A.T.
(
2007
)
Bayesian finite mixtures with an unknown number of components: the allocation sampler
.
Statistics and Computing
,
17
,
147
162
.

Rosen
,
A.
and
Casciola-Rosen
,
L.
(
2016
)
Autoantigens as partners in initiation and propagation of autoimmune rheumatic diseases
.
Annual Review of Immunology
,
34
,
395
420
.

Rukat
,
T.
,
Holmes
,
C.C.
,
Titsias
,
M.K.
and
Yau
,
C.
(
2017
) Bayesian boolean matrix factorisation. In
International Conference on Machine Learning: Proceedings of the 34th International Conference on Machine Learning
, 70. pp.
2969
2978
.

Teh
,
Y.W.
,
Grür
,
D.
and
Ghahramani
,
Z.
(
2007
)
Stick-breaking construction for the Indian buffet process
.
Artificial Intelligence and Statistics
,
556
563
.

Templin
,
J.L.
and
Henson
,
R.A.
(
2006
)
Measurement of psychological disorders using cognitive diagnosis models
.
Psychological Methods
,
11
,
287
305
.

Vermunt
,
J.K.
and
Magidson
,
J.
(
2002
)
Latent class cluster analysis
.
Applied Latent Class Analysis
,
11
,
89
106
.

Wu
,
Z.
,
Casciola-Rosen
,
L.
,
Shah
,
A.
,
Rosen
,
A.
and
Zeger
,
S.L.
(
2019
)
Estimating autoantibody signatures to detect autoimmune disease patient subsets
.
Biostatistics
,
20
,
30
47
.

Wu
,
Z.
,
Deloria-Knoll
,
M.
,
Hammitt
,
L.L.
and
Zeger
,
S.L.
(
2016
)
Partially latent class models for case–control studies of childhood pneumonia aetiology
.
Journal of the Royal Statistical Society: Series C (Applied Statistics)
,
65
,
97
114
.

Wu
,
Z.
,
Deloria-Knoll
,
M.
and
Zeger
,
S.L.
(
2017
)
Nested partially latent class models for dependent binary data; estimating disease etiology
.
Biostatistics
,
18
,
200
213
.

Xu
,
G.
(
2017
)
Identifiability of restricted latent class models with binary responses
.
The Annals of Statistics
,
45
,
675
707
.

Xu
,
G.
and
Shang
,
Z.
(
2018
)
Identifying latent structures in restricted latent class models
.
Journal of the American Statistical Association
,
113
,
1284
1295
.

Zhang
,
Z.
,
Li
,
T.
,
Ding
,
C.
and
Zhang
,
X.
(
2007
) Binary matrix factorization with applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007),
391
400
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data