Abstract

High-dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable parsimonious models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of identifiable multilayer (potentially deep) discrete latent structure models for discrete data, termed Bayesian Pyramids. We establish the identifiability of Bayesian Pyramids by developing novel transparent conditions on the pyramid-shaped deep latent directed graph. The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate the identifiability and estimatability of model parameters. Applications of the methodology to DNA nucleotide sequence data uncover useful discrete latent features that are highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data and can be a useful alternative to popular machine learning methods.

1 Introduction

High-dimensional unordered categorical data are ubiquitous in many scientific disciplines, including the DNA nucleotides of A, G, C, T in genetics (Nguyen et al., 2016; Pokholok et al., 2005), occurrences of various species in ecological studies of biodiversity (Ovaskainen & Abrego, 2020; Ovaskainen et al., 2016), responses from psychological and educational assessments or social science surveys (Eysenck et al., 2020; Skinner, 2019), and document data gathered from huge text corpora or publications (Blei et al., 2003; Erosheva et al., 2004). Modeling and extracting information from multivariate discrete data require different statistical methods and theoretical understanding from those for continuous data. In an unsupervised setting, it is an important task to uncover reliable and meaningful latent patterns from the potentially high-dimensional and heterogeneous discrete observations. Ideally, the inferred lower-dimensional latent representations should not only provide scientific insights on their own, but also aid downstream statistical analyses through effective dimension reduction.

Recently, there has been a surge of interest in interpretable machine learning, see Doshi-Velez and Kim (2017), Rudin (2019), and Murdoch et al. (2019), among others. Latent variable approaches, however, have received limited attention in this emerging literature, likely due to the associated complexities. Indeed, deep learning models with many layers of latent variables are usually considered as uninterpretable black boxes. For example, the deep belief network (Hinton et al., 2006; Lee et al., 2009) is a very popular deep learning architecture, but it is generally not reliable to interpret the inferred latent structure. However, for high-dimensional data, it is highly desirable to perform dimension reduction to extract the key signals in the form of lower-dimensional latent representations. If the latent representations are themselves reliable, then they can be viewed as surrogate features of the data and then passed along to existing interpretable machine learning methods for downstream tasks. A key to success in such modeling and analysis processes is the interpretability of the latent structure. This in turn relies crucially on the identifiability of the statistical latent variable model being used.

In statistical terms, a set of parameters for a family of statistical models are said to be identifiable if distinct values of the parameters correspond to distinct distributions of the observed data. Studies under such an identifiability notion date back to Koopmans and Reiersol (1950), Teicher (1961), and Goodman (1974). Model identifiability is a fundamental prerequisite for valid statistical estimation and inference. In the latent variable context, if one wishes to interpret the parameters and latent representations learned using a latent variable model, then identifiability is necessary for making such interpretation meaningful and reproducible. Early considerations of identifiability in the latent variable context can be traced to the seminal work of Anderson and Rubin (1956) for traditional factor analysis. For modern latent variable models potentially containing nonlinear, non-continuous, and even deep layers of latent variables, identifiability issues can be challenging to address.

Recent developments on the identifiability of continuous latent variable models include Drton et al. (2011), Anandkumar et al. (2013), and Chen, Li, et al. (2020). Discrete latent variable models are an appealing alternative to their continuous counterparts in terms of the combination of interpretability and flexibility. Finite mixture models (McLachlan & Basford, 1988) routinely used for model-based clustering are a canonical example involving a single discrete latent variable. Such relatively simple approaches are insufficiently flexible for complex data sets. Extensions with multiple latent variables and/or multilayer structure have distinct advantages in such settings but come with increasingly complex identifiability issues. In this work, we are motivated to build identifiable deep latent variable models, which are flexible enough to capture the complex dependencies in real-world data, yet also with appropriate restrictions and parsimony to yield identifiability.

We propose a family of multilayer, potentially deep, discrete latent variable models and propose novel identifiability conditions for them. We establish identifiability for hierarchical latent structures organized in a ‘pyramid’—shaped Bayesian network. In such a Bayesian Pyramid, observed variables are at the bottom and multilayer latent variables above them describe the data generating process. Sparse graphical connections occur between layers, and our identifiability conditions impose structural and size constraints on these between-layer graphs. Technically, we tackle identifiability by first reformulating the Bayesian Pyramid as a constrained latent class model (LCM; Goodman, 1974) in a layerwise manner. Then, we derive a nontrivial algebraic property of LCMs under such parameter constraints (Proposition 2) and combine it with Kruskal’s theorem (Allman et al., 2009; Kruskal, 1977) on tensor decompositions to establish identifiability. Our identifiability results are not only technically novel, but also provide insights into methodology development. Indeed, the identifiability theory directly inspires the specification of deep latent architecture in Bayesian Pyramids, which features fewer latent variables deeper up the hierarchy. The identifiability results offer a theoretical basis for learning potentially deep and interpretable latent structures from high-dimensional discrete data.

A nice consequence of the identifiability results is the posterior consistency of Bayesian procedures under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. We develop a Gibbs sampler with data augmentation for computation. Simulation studies corroborate identifiability and show good performance of Bayes estimators. We apply the proposed approach to two DNA nucleotide sequence datasets (Dua & Graff, 2017). For the splice junction data, when using latent representations learned from our two-latent-layer model in downstream classification of nucleotide sequence types, we achieve a remarkable accuracy comparable to fine-tuned convolutional neural networks (i.e., in Nguyen et al., 2016). This suggests that the developed recipe of unsupervised learning of discrete data may serve as a useful alternative to popular machine learning methods.

The rest of this paper is organized as follows. Section 2 proposes Bayesian Pyramids, a new family of pyramid-shaped deep latent variable models for discrete data and reformulates a Bayesian Pyramid into constrained latent class models (CLCMs) in a layerwise manner. Section 3 first considers the identifiability of the general CLCMs and then proposes identifiability conditions for the multilayer deep Bayesian Pyramids. To illustrate the proposed framework, Section 4 focuses on a two-latent-layer Bayesian Pyramid and proposes a Bayesian estimation approach. Section 5 provides simulation studies that examine the performance of the proposed methodology and corroborate the identifiability theory. Section 6 applies the method to real data on nucleotide sequences. Finally, Section 7 discusses implications and future directions. Technical proofs, posterior computation details, and additional data analyses are included in the Online Supplementary Material. Matlab code implementing the proposed method is available at https://github.com/yuqigu/BayesianPyramids.

2 Bayesian Pyramids: multilayer latent structure models

This section proposes Bayesian Pyramids to model the joint distribution of multivariate unordered categorical data. For an integer m, denote [m]={1,2,,m}. Suppose for each subject, there are p observed variables y=(y1,,yp), where yj[dj] for each variable j[p]. dj is the number of categories that the jth observed variable can potentially take. We mainly consider multiple (potentially deep) layers of binary latent variables, motivated by better computational tractability and also the simpler interpretability, with each variable encoding presence or absence of a certain latent construct. The stack of multiple layers of binary latent variables induces a model resembling deep belief networks (Hinton et al., 2006). However, our proposed class of models is more general in terms of distributional assumptions. In the following, we first describe in detail our proposed Bayesian Pyramids in Section 2.1 and then connect them to a latent class model (Goodman, 1974) subject to certain constraints.

2.1 Multilayer Bayesian Pyramids

The proposed models belong to the broader family of Bayesian networks (Pearl, 2014), which are directed acyclic graphical models that can encode rich conditional independence information. We propose a ‘pyramid’-like Bayesian network with one latent variable at the root and more and more latent variables in downward layers, where the bottom layer consists of the p observed variables y1,,yp. Denote the number of latent layers in this Bayesian network by D, which can be viewed as the depth. Specifically, let the layer of latent variables consecutive to the observed y be α(1)=(α1,,αK1){0,1}K1 with K1 variables, and let a deeper layer of latent variables consecutive to α(m){0,1}Km be α(m+1){0,1}Km+1 for m=1,2,,D1. Finally, at the top and the deepest layer of the pyramid, we specify a single discrete latent variable z, or equivalently z(D){1,,B}. In this Bayesian network, all the directed edges are pointing in the top-down direction only between two consecutive layers, and there are no edges between variables within a layer. This gives the following factorization of the joint distribution of y and latent variables, where the subscript i denotes an index of a random subject,

(1)
(2)

In the above display, for each j[p], pa(j)[K1] is a collection of first-latent-layer variables, which are parents of yj. Similarly, for each k[Km], pa(k(m))[Km+1] is a collection of (m+1)-latent-layer variables, which are parents of the m-latent-layer variable αk.

Figure 1 gives a graphical visualization of the proposed model. To make clear the sparsity structure of this graph, we introduce binary matrices G(1),G(2),,G(D1), termed as graphical matrices, to encode the connecting patterns between consecutive layers. That is, G(d) summarizes the parent–child graphical patterns between the two consecutive layers d and d+1. Specifically, matrix G(1)=(gj,k(1)) has size p×K1 and matrix G(m)=(gk,k(m)) has size Km1×Km for m=2,,D1, with entries

Each variable in the graph is subject-specific, implying that all the circles in Figure 1 represent subject-specific quantities. Namely, if there are n subjects in the sample, each of them would have its own realizations of y and α(d)s. The proposed directed acyclic graph is not necessarily a tree, as shown in Figure 1. That is, each variable can have multiple parent variables in the layer above it, while in a directed tree, each variable can only have one parent. As a simpler illustration, we also provide a two-latent-layer Bayesian Pyramid in Figure 2, which features a simpler yet still quite expressive architecture. For example, we can specify the conditional distribution of each observed yj using a multinomial or binomial logistic model with its parent variables as linear predictors and specify the distribution of α given z using a latent class model. Namely, for a random subject i,

(3)

Later in Section 4, we will focus on the two-latent-layer model when describing the estimation methodology; please refer to that part for further details.

Multiple layers of binary latent traits αi(d)s model the distribution of observed yi. Binary matrices G(1),G(2),… encode the sparse connection patterns between consecutive layers. Dotted arrows emanating from the root variable zi summarize omitted layers {αi(d)}.
Figure 1.

Multiple layers of binary latent traits αi(d)s model the distribution of observed yi. Binary matrices G(1),G(2), encode the sparse connection patterns between consecutive layers. Dotted arrows emanating from the root variable zi summarize omitted layers {αi(d)}.

Two-latent-layer model. Latent variables zi,αi,1,…,αi,K1 and observed variables yi,1,…,yi,p are subject-specific, and model parameters G(1),β,τ,η are population quantities.
Figure 2.

Two-latent-layer model. Latent variables zi,αi,1,,αi,K1 and observed variables yi,1,,yi,p are subject-specific, and model parameters G(1),β,τ,η are population quantities.

We provide some discussions on the conceptual motivation for the proposed multilayer model. Intuitively, for each subject i, zi[B] in the deepest layer of the pyramid can encode some coarse-grained latent categories of subjects, while the vector αi(1){0,1}K1 encodes more fine-grained latent features of subjects. The hierarchical multilayer structure is conceptually appealing as it can provide latent representations of data in multiple resolutions, where there are nonlinear compositions between different latent resolutions that boost model flexibility. Model (3) for generating α(1) given z is a nonlinear latent class model (LCM). Hence, z cannot be viewed as a linear projection of the α(1)-layer; rather, the nonlinear LCM (with a deeper variable z and conditional independence of αk’s given z) can encode and explain rich and complex dependencies between the variables α(1),,αK1(1) using parsimonious parametrizations.

2.2 Reformulating the Bayesian Pyramid as a constrained latent class model

In this subsection, we reveal an interesting equivalence relationship between our Bayesian Pyramid models and latent class models (LCMs, Goodman, 1974) with certain equality constraints. Such equivalence will pave the way for investigating the identifiability of Bayesian Pyramids. The traditional latent class model (LCM) (Goodman, 1974; Lazarsfeld, 1950) posits the following joint distribution of yi for a random subject i:

(4)

The above equation specifies a finite mixture model with one discrete latent variable having H categories (i.e., H latent classes). In particular, given a latent variable z[H], νh=P(z=h) denotes the probability of z belonging to the hth latent class, and λcj,h(j)=P(yj=cjzj=h) denotes the conditional probability of response cj for variable yj given the latent class membership h. Expression (4) implies that p observed variables y1,,yp are conditionally independent given the latent z. In the Bayesian literature, the model in Dunson and Xing (2009) is a nonparametric generalization of (4), which allows an infinite number of latent classes.

Latent class modeling under (4) is widely used in social and biomedical sciences, where researchers often hope to infer subpopulations of individuals having different profiles (Collins & Lanza, 2009). However, the overly-simplistic form of (4) can lead to poor performance in inferring distinct and interpretable subpopulations. In particular, the model assumes that individuals in different subpopulations have completely different probabilities λcj,h(j) for all cj[dj] and j[p], and conditionally on subpopulation membership all the variables are independent. These restrictions can force the number of classes to increase in order to provide an adequate fit to the data, which can degrade interpretability of a plain latent class model.

We introduce some notation before proceeding. Denote a p×H all-one matrix by 1p×H. For a matrix S with p rows and a set A[p], denote by SA,: the submatrix of S consisting of rows indexed in A. Consider a family of constrained latent class models, which enable learning of a potentially large number of interpretable, identifiable, and diverse latent classes. A key idea is sharing of parameters within certain latent classes for each observed variable. We introduce a binary constraint matrixS=(Sj,h) of size p×H, which has rows indexed by the p observed variables and columns indexed by the H latent classes. The binary entry Sj,h indicates whether the conditional probability table λ1:dj,h(j)=(λ1,h(j),,λdj,h(j)) is free or instead equal to some unknown baseline. Specifically, if Sj,h=1 then λ1:dj,h(j) is free; while for those latent classes h[H] such that Sj,h=0, their conditional probability tables λ1:dj,h(j)’s are constrained to be equal. Hence, S puts the following equality constraints on the parameters of (4):

(5)

We also enforce a natural inequality constraint for identifiability,

(6)

If S=1p×H, then there are no active constraints and the original latent class model (4) is recovered. We generally denote the conditional probability parameters by Λ={λcj,h(j);j[p],cj[dj],h[H]} where λcj,h:=λcj,0 if Sj,h=0. As will be revealed soon, such constrained latent class models are related to our proposed Bayesian Pyramids via a neat mathematical transformation.

Viewed from a different perspective, a latent class model (4) specifies a decomposition of the p-way probability tensor Π=(πc1cp), where πc1cp=P(y1=c1,,yp=cpΛ,S,ν). This corresponds to the PARAFAC/CANDECOMP (CP) decomposition in the tensor literature (Kolda & Bader, 2009), which can be used to factorize general real-valued tensors, while our focus is on probability tensors. The proposed equality constraint (5) induces a family of constrained CP decompositions.

This family is connected with the sparse CP decomposition of Zhou et al. (2015), with both having equality constraints summarized by a p×H binary matrix. However, Zhou et al. (2015) encourage different observed variables to share parameters, while our proposed model encourages different latent classes to share parameters through (5).

We have the following proposition linking the proposed multilayer Bayesian Pyramid to the constrained latent class model under equality constraint (5). For two vectors a,b of the same length M, denote ab if aibi for all i[M]. Denote by 1() the binary indicator function, which equals one if the statement inside is true and zero otherwise.

 
Proposition 1

Consider the multilayer Bayesian Pyramid with binary graphical matrices G(1),G(2),,G(D1).

  • In marginalizing out all the latent variables exceptα(1), the distribution of y is a constrained latent class model with 2K1 latent classes, where each latent class can be characterized as one configuration of the K1-dimensional binary vector α(1){0,1}K1. The corresponding p×2K1 constraint matrix S(1) is determined by the bipartite graph structure between the α(1)-layer and the y-layer, with entries being

    (7)
  • Further, in considering the distribution of α(m){0,1}Km and marginalizing out all the latent variables deeper than α(m)exceptα(m+1), the distribution of α(m) is also a constrained latent class model with 2Km+1 latent classes, where each latent class is characterized as one configuration of the Km+1-dimensional binary vector α(m+1){0,1}Km+1. Its corresponding constraint matrix S(m) is determined by the bipartite graph structure between the mth and the (m+1)th latent layers, with entries being

    (8)

We present a toy example to illustrate Proposition 1 and discuss its implications.

 
Example 1

Consider a multilayer Bayesian Pyramid with p=6 and K1=3, with the graph between α(1) and y displayed in Figure 2. The 6×3 graphical matrix G(1) is presented below. Proposition 1(a) states that there is a p×2K1 constraint matrix S(1) taking the form

Entries of S(1) are determined according to (7), for example, S1,(000)(1)=11((000)G1,:(1))=11((000)(100))=1; and S6,(111)(1)=11((111)G6,:(1))=11((111)(011))=0. Each column of the constraint matrix S(1) is indexed by a latent class characterized by a configuration of a K1-dimensional binary vector. This implies that if only considering the first latent layer of variables α(1), all the subjects are naturally divided into 2K1 latent classes, each endowed with a binary pattern.

Proposition 1 gives a nice structural characterization of multilayer Bayesian Pyramids. This characterization is achieved by relating the multilayer sparse graph to the constrained latent class model in a layer-wise manner. This proposition provides a basis for investigating he identifiability of multilayer Bayesian Pyramids; see details in Section 3.

2.3 Connections to existing models and studies

We next briefly review connections between Bayesian Pyramids and existing models. In educational measurement research, Haertel (1989) first used the term restricted latent class models. Further developments along this line in the psychometrics literature led to a popular family of cognitive diagnosis models (de la Torre, 2011; Rupp & Templin, 2008; von Davier & Lee, 2019). These models are essentially binary latent skill models where each subject is endowed with a K-dimensional latent skill vector α{0,1}K indicating the mastery/deficiency statuses of K skills, and each test item j is designed to measure a certain configuration of skills, summarized by a loading vector qj{0,1}K. The matrix Q{0,1}J×K collecting all of the J skill loading vectors q1,,qJ as row vectors is often pre-specified by educational experts. The observed data for each subject are a J-dimensional binary vector r{0,1}J, indicating the correct/wrong answers to J test questions in the assessment. Recently, there have been emerging studies on the identifiability and estimation of such cognitive diagnosis models (e.g., Chen, Culpepper, et al., 2020; Chen et al., 2015; Fang et al., 2019; Gu & Xu, 2019, 2020; Xu, 2017). However, to our knowledge, there have not been works that model multilayer (i.e., deep) latent structure behind the data and investigate identifiability in such scenarios.

Bayesian Pyramids are also related to deep belief networks (DBNs, Hinton et al., 2006), sum-product networks (SPNs, Poon & Domingos, 2011), and latent tree graphical models (LTMs, Mourad et al., 2013) in the machine learning literature. DBNs have undirected edges between the deepest two layers designed based on computational considerations (Hinton, 2009), while a Bayesian Pyramid is a fully generative directed graphical model (Bayesian network) with all the edges pointing top down. Such generative modeling is naturally motivated by the identifiability considerations and also provides a full description of the data generating process. Also, DBNs in their popular form are models for multivariate binary data, feature a fully connected graph structure, and use logistic link functions between layers. In contrast, Bayesian Pyramids accommodate general multivariate categorical data and allow flexible forms of layerwise conditional distributions and sparse connections between consecutive layers. An SPN is a rooted directed acyclic graph consisting of sum nodes and product nodes. Zhao et al. (2015) show that an SPN can be converted to a bipartite Bayesian network, where the number of discrete latent variables equals the number of sum nodes in the SPN. Our model is more general in that in addition to modeling a bipartite network between the latent layer and the observed layer, we further model the dependence of the latent variables instead of assuming them independent. LTMs are a special case of our model (3) because, while all the variables in an LTM form a tree, we allow for more general DAGs beyond trees. Although the above models are extremely popular, identifiability has received essentially no attention; an exception is Zwiernik (2018), which discussed identifiability for relatively simple LTMs.

Our two-latent-layer Bayesian Pyramid shares a similar structure with the nonparametric latent feature models in Doshi-Velez and Ghahramani (2009). Both works consider a mixture of binary latent feature models, with each data point associated with both a deep latent cluster (z in our notation) and a binary vector of latent features [α(1) in our notation]. One distinction is that we adopt very flexible probabilistic distributions (see Examples 2 and 3) as conditional distributions in the DAG, while Doshi-Velez and Ghahramani (2009) posits that the latent features are a deterministic function of the products of z and α(1). In addition, Bayesian Pyramids are directly inspired by identifiability considerations, and in the next section, we provide explicit conditions that guarantee the model is identifiable—not only in the two-latent-layer architecture similar to Doshi-Velez and Ghahramani (2009) but also in general deep latent hierarchies with α(2),,α(D1),z. Interestingly, Doshi-Velez and Ghahramani (2009) conjecture heuristically that their specification ‘likely resolves the identifiability issues’.

3 Identifiability and constrained latent class structure behind Bayesian Pyramids

3.1 Identifiability of the constrained latent class model and posterior consistency

In Section 2, we proposed a new class of multilayer latent variable models deemed Bayesian Pyramids and showed that these models can be formulated as a type of constrained latent class model (CLCM) defined in (4)–(5). In this section, we study theoretical properties of model (4)–(5).

The classic latent class model in (4) was shown by Gyllenberg et al. (1994) to be not strictly identifiable. Strict identifiability generally requires one to establish a parameterization in which the parameters can be expressed as one-to-one functions of observables. As a weaker notion, generic identifiability requires that the map is one-to-one except on a Lesbesgue measure zero subset of the parameter space. In a seminal paper, Allman et al. (2009) leveraged Kruskal’s theorem (Kruskal, 1977) to show generic identifiability for an unconstrained latent class model. However, Allman et al. (2009)’s approach is not sufficient for establishing identifiability in constrained LCMs or Bayesian Pyramids, due to the complex parameter constraints in these models. Indeed, Allman et al. (2009)’s generic identifiability results imply that in the latent class model with an unconstrained parameter space, there exists a measure-zero subset of parameters where identifiability breaks down. In constrained LCMs, the equality constraints (5) exactly enforce parameters to fall into a measure-zero subset. In Bayesian Pyramids, these equality constraints arise from between-layer potentially sparse graphs. So without a careful and thorough investigation into the relationship between the parameter constraints and the graphical structure, Kruskal’s theorem is not directly applicable to investigating the identifiability of constrained LCMs or Bayesian Pyramids.

Below, we establish strict identifiability of model (4)–(5), by carefully examining the algebraic structure imposed by the constraint matrix S. We first introduce some notation. Denote by the Kronecker product of matrices and by the Khatri–Rao product of matrices. In particular, consider matrices A=(ai,j)Rm×r, B=(bi,j)Rs×t; and matrices C=(ci,j)=(c:,1c:,k)Rn×k, D=(di,j)=(d:,1d:,k)R×k, then there are ABRms×rt and CDRn×k with

The above definitions show the Khatri–Rao product is a column-wise Kronecker product; see more in Kolda and Bader (2009). We first establish the following technical proposition, which is useful for the later theorem on identifiability.

 
Proposition 2

For the constrained latent class model, define the following p parameter matrices subject to constraints (5) and (6) with some constraint matrix S,

Denote the Khatri–Rao product of the above p matrices by K=j=1pΛ(j), which has size j=1pdj×H. The following two conclusions hold.

  • If the H column vectors of the constraint matrix S are distinct, then K must have full column rank H.

  • If S contains identical column vectors, then K can be rank-deficient.

 
Remark 1

Proposition 2 implies that S having distinct columns is sufficient [in part (a)] and almost necessary [in part (b)] for the Khatri–Rao product K=j=1pΛ(j) to be full rank. To see the ‘almost necessary’ part, consider a special case where besides constraint (5) that λ1:dj,h1(j)=λ1:dj,h2(j) if Sj,h1=Sj,h2=0, the parameters also satisfy λ1:dj,h1(j)=λ1:dj,h2(j) if Sj,h1=Sj,h2=1. In this case, our proof shows that whenever the binary matrix S contains identical column vectors in columns h1 and h2, the matrix K also contains identical column vectors in columns h1 and h2 and hence is surely rank-deficient.

In the Khatri–Rao product matrix K defined in Proposition 2, each column characterizes the conditional distribution of vector y given a particular latent class. Therefore, Proposition 2 reveals an nontrivial algebraic property: whether S{0,1}p×H has distinct column vectors is linked to whether the H conditional distributions of y given each latent class are linearly independent. The matrix S does not need to have full column rank in order to have distinct column vectors. For example, a 3×3 matrix S with three columns being (1,0,0), (0,1,1), and (1,1,1) is rank-deficient but has distinct column vectors. Indeed, it is not hard to see that a binary matrix S with m rows can have as many as 2m distinct column vectors.

Proposition 1 reformulates a Bayesian Pyramid into a constrained LCM with a constraint matrix S, and then Proposition 2 establishes a nontrivial algebraic property of such constrained LCMs. Propositions 1 and 2 together pave the way for the development of the identifiability theory of Bayesian Pyramids. In particular, the sufficiency part of Proposition 2 uncovers a nontrivial inherent algebraic structure of the considered models. To prove Proposition 2, we leveraged a novel proof technique, the marginal probability matrix, in order to find a sufficient and almost necessary condition for the Khatri–Rao product of conditional probability matrices to have full rank. We anticipate that the conclusion in Proposition 2 can be useful in other graphical models with discrete latent variables, even beyond the Bayesian Pyramids considered in this paper. This is because graphical models involving discrete latent structure can often be formulated as a latent class model with equality constraints determined by the graph. Therefore, the conclusion of Proposition 2 might be of independent interest.

Although the linear independence of K’s columns itself does not lead to identifiability, it provides a basis for investigating strict identifiability of our model. We introduce the definition of strict identifiability under the current setup and then give the strict identifiability result.

Strict Identifiability

 
Definition 1

The constrained latent class model with (5) and (6) is said to be strictly identifiable if for any valid parameters (Λ,S,ν), the following equality holds if and only if (Λ¯,S¯,ν¯) and (Λ,S,ν) are identical up to a latent class permutation:

(9)
 
Remark 2

When the constraint matrix S is unknown and needs to be identified together with unknown continuous parameters Λ, there is a trivial nonidentifiability issue that needs to be resolved. To see this, continue to consider the special case mentioned in Remark 1 where λ1:dj,h1(j)=λ1:dj,h2(j) whenever Sj,h1=Sj,h2, then given a matrix S we can generally denote λ1:dj,h(j)=:λ1:dj,+(j) if Sj,h=1 and λ1:dj,h(j)=:,λ1:dj,(j) if Sj,h=0. Then without further restrictions, the following alternative (S~,Λ~) will be indistinguishable from the true (S,Λ), where S~=1p×HS, λ~1:dj,+=λ1:dj,(j) and λ~1:dj,=λ1:dj,+(j). One straightforward way to resolve such trivial nonidentifiability of S is to simply enforce that whenever sj,h1>sj,h2 the order of λj,c,h1 and λj,c,h2 is fixed for every possible category c[dj]. In the following studies of identifiability, we always assume such orderings of Λ with respect to S has been fixed.

For an arbitrary subset A[p], denote by SA,: the submatrix of S that consists of those rows indexed by variables belonging to A. SA,: has size |A|×H.

Strict Identifiability

 
Theorem 1

Consider the proposed constrained latent class model under (5) and (6) with true parameters ν={νh}h[H], Λ={Λ(j)}j[p], and S. Suppose there exists a partition of the p variables [p]=A1A2A3 such that

  • the submatrix SAi,: has distinct column vectors for i=1 and 2; and

  • for any h1h2[H], there is λc,h1(j)λc,h2(j) for some jA3 and some c[dj].

Also suppose νh>0 for each h[H]. Then, (Λ,S,ν) are strictly identifiable up to a latent class permutation.

In the above theorem, the constraint matrix S is not assumed to be fixed and known. This implies that both the matrix S and the parameters can be uniquely identified from data. We next give a corollary of Theorem 1, which replaces the condition on parameters Λ in part (b) with a slightly more stringent but also more transparent condition on the binary matrix S.

 
Corollary 1

Consider the proposed constrained latent class model under (5) and (6). Suppose νh>0 for each h[H]. If there is a partition of the p variables [p]=A1A2A3 such that each submatrix SAi,: has distinct column vectors for i=1,2,3, then parameters (ν,Λ,S) are strictly identifiable up to a latent class permutation.

The conditions of Corollary 1 are more easily checkable than those in Theorem 1 because they only depend on the structure of the constraint matrix S. It requires that after some column rearrangement, matrix S should vertically stack three submatrices, each of which has distinct column vectors.

The conclusions of Theorem 1 and Corollary 1 both regard strict identifiability, which is the strongest possible conclusion on parameter identifiability up to label permutation. If we consider the slightly weaker notion of generic identifiability as proposed in Allman et al. (2009), the conditions in Theorem 1 and Corollary 1 can be relaxed. Given a constraint matrix S, denote the constrained parameter space for (ν,Λ) by

(10)

and define the following subset of TS as

(11)

With the above notation, the generic identifiability of the proposed constrained latent class model is defined as follows.

Generic Identifiability

 
Definition 2

Parameters (Λ,S,ν) are said to be generically identifiable if NS defined in (11) has measure zero with respect to the Lebesgue measure on TS defined in (10).

Generic Identifiability

 
Theorem 2

Consider the constrained latent class model under (5) and (6). Suppose for i=1,2, changing some entries of SAi,: from ‘1’ to ‘0’ yields an S~Ai,: having distinct columns. Also suppose for any h1h2[H], there is λh1,c(j)λh2,c(j) for some jA3 and some c[dj]. Then (Λ,S,ν) are generically identifiable.

 
Remark 3

Note that altering some Sj,h from one to zero corresponds to adding one more equality constraint that the distribution of the jth variable given the hth latent class is set to the baseline through (5). Therefore, Theorem 2 intuitively implies that if enforcing more parameters in T to be equal can give rise to a strictly identifiable model, then the parameters that make the original model unidentifiable only occupy a negligible set in T.

Theorem 2 relaxes the conditions on S for strict identifiability presented earlier. In particular, here the submatrices SAi,: need not have distinct column vectors; rather, it would suffice if altering some entries of SAi,: from one to zero yield distinct column vectors. As pointed out by Allman et al. (2009), generic identifiability is often sufficient for real data analyses.

So far, we have focused on discussing model identifiability. Next, we show that our identifiability results guarantee Bayesian posterior consistency under suitable priors. Given a sample of size n, denote the observations by y1,,yn, which are n vectors each of dimension p. Recall that under (4), the distribution of the vector y under the considered model can be denoted by a p-way probability tensor Π=(πc1cp). When adopting a Bayesian approach, one can specify prior distributions for the parameters Λ, S, and ν, which induce a prior distribution for the probability tensor Π. Within this context, we are now ready to state the following theorem.

Posterior Consistency

 
Theorem 3

Denote the collection of model parameters by Θ=(Λ,S,ν). Suppose the prior distributions for the parameters Λ, S, and ν all have full support around the true values. If the true latent structure S0 and model parameters Λ0 satisfy the proposed strict identifiability conditions in Theorem 1 or Corollary 1, we have

where Nϵc(Θ0) is the complement of an ϵ-neighborhood of the true parameters Θ0 in the parameter space.

Theorem 3 implies that under an identifiable model and with appropriately specified priors, the posterior distribution places increasing probability in arbitrarily small neighborhoods of the true parameters of the constrained latent class model as sample size increases. These parameters include the mixture proportions and the class specific conditional probabilities.

3.2 Identifiability of multilayer Bayesian Pyramids

According to Proposition 1, for a multilayer Bayesian Pyramid with a p×K1 binary graphical matrix G(1) between the two bottom layers, one can follow (7) to construct a p×2K1 constraint matrix S(1) as illustrated in Example 1. We next provide transparent identifiability conditions that directly depend on the binary graphical matrices G(m)s. With the next theorem, one only needs to examine the structure of the between layer connecting graphs to establish identifiability.

 
Theorem 4

Consider the multilayer latent variable model specified in (1)–(2). Suppose the numbers K1,,KD1 are known. Suppose each binary graphical matrix G(m) of size Km1×Km (size p×K1 if m=1) takes the following form after some row permutation:

(12)

where G(m), generally denotes a submatrix of G(m) that can take an arbitrary form. Further suppose that the conditional distributions of variables satisfy the inequality constraint in (6). Then, the following parameters are identifiable up to a latent variable permutation within each layer: the probability distribution tensor τ for the deepest latent variable z(D), the conditional probability table of each variable (including observed and latent) given its parents, and also the binary graphical matrices {G(m);m=1,,D1}.

 
Remark 4

The proof of Theorem 4 provides a nice layer-wise argument on identifiability, that is, one can examine the structure of the Bayesian Pyramid in the bottom-up direction. As long as for some there are G(1),,G() taking the form of (12), then the parameters associated with the conditional distribution of y and α(1),,α(1) are identifiable and the marginal distributions of α() are also identifiable.

Theorem 4 implies a requirement that p3K1 and that Km13Km for every m=2,,D1, through the form of G(m) in (12). That is, the number of latent variables per layer decreases as the layer goes deeper up the pyramid. Condition (12) in Theorem 4 requires that each latent variable αk(m) in the mth latent layer has at least three children in the (m+1)th layer that do not have any other parents. Our identifiability conditions hold regardless of the specific models chosen for the conditional distributions, yjαk(1) and each αk(m)αk(m+1), as long as the graphical structure is enforced and these component models are not over-parameterized in a naive manner. We next give two concrete examples which differently model the distribution of α(m)α(m+1) but both respect the graph given by G(m+1).

 
Example 2

We first consider modeling the effects of the parent variables of each αk(m) as

(13)

where f:R(0,1) is a link function. The number of β-parameters in (13) equals k=1Km+1gk,k(m+1), which is the number of edges pointing to αk(m). Choosing f(x)=1/(1+exp(x)) leads to a model similar to sparse deep belief networks (Hinton et al., 2006; Lee et al., 2007).

 
Example 3

To obtain a more parsimonious alternative to (13), let

(14)

Model (14) satisfies the conditional independence encoded by G(m+1), since I(αk(m+1)=gk,k(m+1)=1for at least onek[Km+1])k:gk,k(m+1)=1(1αk(m+1)), implying that the distribution of αk(m) only depends on its parents in the (m+1)th latent layer. This model provides a probabilistic version of Boolean matrix factorization (Miettinen & Vreeken, 2014). The binary indicator equals the Boolean product of two binary vectors Gk,:(m+1) and α(m+1). The 1θk,1(m+1) and θk,0(m+1) quantify the two probabilities that the entry αk(m) does not equal the Boolean product.

Since Examples 2 and 3 satisfy the conditional independence constraints encoded by graphical matrices G(m+1)s, they satisfy the equality constraint in (5) with the constraint matrix S(m+1). Therefore, our identifiability conclusion in Theorem 4 applies to both examples with appropriate inequality constraints on the β—parameters or the θ—parameters; for example, see Proposition 3 in Section 4.

Besides Examples 2 and 3, there are many other models that respect the graphical structure. For example, (13) can be extended to include interaction effects of the parents of αk(m) as follows:

(15)

In (15) if αk(m) has M:=k=1Km+1gk,k(m+1) parents, then the number of β-parameters equals 2M.

Anandkumar et al. (2013) considered the identifiability of linear Bayesian networks. Although both Anandkumar et al. (2013) and this work address identifiability issues of Bayesian networks, their results are not applicable to our highly nonlinear models. The nonlinearity requires techniques that look into the inherent tensor decomposition structures (in particular, a constrained CP decomposition) caused by the discrete latent variables and graphical constraints imposed on the discrete latent distribution. Such inherent constrained tensor structures are specific to discrete and graphical latent structures and are not present in the settings considered in Anandkumar et al. (2013).

4 Bayesian inference for two-layer Bayesian Pyramids

As discussed earlier, the proposed multilayer Bayesian Pyramid in Section 2 has universal identifiability arguments for many different model structures. In this section, as an important special case, we focus on a two-latent-layer model with depth D=2 and use a Bayesian approach to infer the latent structure and model parameters. Recall our two-latent-layer Bayesian Pyramid specified earlier in (3) takes the form

Here, we assume βj,dj,0βj,dj,k0 for all k[K1], as conventionally done in logistic models. The parameter βj,c,k can be viewed as the weight associated with the potential directed edge from latent variable αk(1) to observed variable yj, for response category c. The βj,c,k only impacts the likelihood if there is an edge from αk(1) to yj with gj,k(1)=1. Denote the collection of β parameters as β. (3) specifies a usual latent class model in the second latent layer with B latent classes for the K1-dimensional latent vector α(1). This layer of the model has latent class proportion parameters τ=(τ1,,τB) and conditional probability parameters η=(ηk,b)K1×B. The full set of model parameters across layers is (G(1),β,τ,η), and the model structure is shown in Figure 2. We can denote the conditional probability P(yj=cα(1)=α) in (3) by λj,c,α.

4.1 Identifiability theory adapted to two-layer Bayesian Pyramids defined in (3)

For the two-latent-layer model in (3), the following proposition presents strict identifiability conditions in terms of explicit inequality constraints for the β parameters.

 
Proposition 3

Consider model (3) with true parameters (G(1),β,τ,η).

  • Suppose G(1)=(IK1;IK1;IK1;(G)), βj,j,c0, βj,j+K1,c0, and βj,j+2K1,c0 for j[K1], c[dj1]. Then G(1), β, and probability tensor ν(1) of α{0,1}K are strictly identifiable.

  • Under the conditions of part (a), if further there is K12log2B+1, then the parameters τ and η are generically identifiable from ν(1).

As mentioned in the last paragraph of Section 2, the identifiability results in that section apply to general distributional assumptions for variables organized in a sparse multilayer Bayesian Pyramid. When considering specific models, properties of the model can be leveraged to weaken the identifiability conditions. The next proposition illustrates this, establishing generic identifiability for the model in (3). Before stating the identifiability result, we formally define the allowable constrained parameter space for β under a graphical matrix G(1) as

(16)
 
Proposition 4

Consider model (3) with β belonging to Ω(β;G(1)) in (16). Suppose the graphical matrix G(1) can be rewritten as G(1)=(G1,G2,G3,(G)), where each Gm has size K1×K1 and

that is, each of G1,G2,G3 has all the diagonal entries equal to one while any off-diagonal entry is free to be either one or zero. Also suppose K12log2B+1. Then (G(1),β,τ,η) are generically identifiable.

The two different identifiability conditions on the binary graphical matrix G(1) stated in Propositions 3 and 4 correspond to different identifiability notions—strict and generic identifiability, respectively. The generic identifiability notion is slightly less stringent than strict identifiability, by allowing a Lebesgue-measure-zero subset N of the parameter space T where identifiability does not hold. Our sufficient generic identifiability conditions in Proposition 4 are much less stringent than conditions in Proposition 3.

4.2 Bayesian inference for the latent sparse graph and number of binary latent traits

We propose a Bayesian inference procedure for two-latent-layer Bayesian Pyramids. We apply a Gibbs sampler by employing the Polya-Gamma data augmentation in Polson et al. (2013) together with the auxiliary variable method for multinomial logit models in Holmes and Held (2006). Such Gibbs sampling steps can handle general multivariate categorical data.

4.2.1 Inference with a Fixed K1

First consider the case where the true number of binary latent variables K1 in the middle layer is fixed. Inferring the latent sparse graph G(1) is equivalent to inferring the sparsity structure of the continuous parameters βjkc’s. Let the prior for βj,k,0 be N(μ0,σ02), where hyperparameters μ0,σ02 can be set to give weakly informative priors for the logistic regression coefficients (Gelman et al., 2008). Specify priors for other parameters as

(17)

Here σ0 is a small positive number specifying the ‘pseudo’-variance, and we take σ0=0.1 in the numerical studies. Our adoption of a prior variance for βj,c,k when gj,k=0 follows a similar spirit as the ‘pseudo-prior’ approach in the Bayesian variable selection literature. In Bayesian variable selection for regression analysis, Dellaportas et al. (2002) first proposed using a pseudo-prior for the variance when one variable is not included in the model to facilitate convenient Gibbs sampling steps. Specifically, a binary variable ρj encodes whether or not the jth predictor is included in the regression, and the regression coefficient is βjρj, with βjπjN(0,σ2)+(1πj)N(0,σ02) and P(ρj=1)=πj. The pseudo-prior variance σ02 does not affect the posterior but may influence mixing of the MCMC algorithm.

The γ in (17) is further given a noninformative prior γBeta(1,1). The hyperparameters for the Inverse-Gamma distribution can be set to b1σ=b2σ=2. In the data augmentation part, for each subject i[n], each observed variable j[p], and each non-baseline category c=1,,dj1, we introduce Polya-Gamma random variable wijcPG(1,0). We use the auxiliary variable approach in Holmes and Held (2006) for multinomial regression to derive the conditional distribution of each βjck. Given data yij[dj], introduce binary indicators yijc=1(yij=c). The posteriors of β satisfy that

and we introduce notation

(18)

Next by the property of the Polya-Gamma random variables (Polson et al., 2013), we have

where pPG(wijc1,0) denotes the density function of PG(1,0). Based on the above identity, the conditional posterior of the βjc0’s and βjck’s are still Gaussian, and the conditional posterior of each wijc is still Polya-Gamma with (wijc)PG(1,βjc0+k=1K1βjckgj,kαi,kCij(c)); these full conditional distributions are easy to sample from. As for the binary entries gj,k’s indicating the presence or absence of edges in the Bayesian Pyramid, we sample each gj,k individually from its posterior Bernoulli distribution. The detailed steps of such a Gibbs sampler with known K1 are presented in the Online Supplementary Material.

4.2.2 Inferring an Unknown K1

On top of the sampling algorithm described above for fixed K1, we propose a method for simultaneously inferring K1 and other parameters. In the context of mixture models, Rousseau and Mengersen (2011) defined over-fitted mixtures with more than enough latent components and used shrinkage priors to effectively delete the unnecessary ones. In a similar spirit but motivated by Gaussian linear factor models, Legramanti et al. (2020) proposed the cumulative shrinkage process (CSP) prior, which has a spike and slab structure. We use a CSP prior on the variances {σck2} of {βjck} to infer the number of latent binary variables K1 in a two-layer Bayesian Pyramid. The rationale for using such an increasing shrinkage prior for latent dimension selection is that, it is natural to expect additional latent dimensions to play a progressively less important role in characterizing the data, so the associated parameters should have a stochastically decreasing effect. Specifically, under the CSP prior, the kth latent dimension is controlled by a scalar θk that follows a spike-and-slab distribution. Redundant dimensions will be essentially deleted by progressively shrinking the sequence {θ1,θ2,} towards an appropriate value θ (the spike). In particular, Legramanti et al. (2020) considered the continuous factor model where θk>0 denotes the variance of the factor loadings for the kth factor and θ is a small positive number indicating the variance of redundant latent factors.

We next describe in detail the prior specifications with an unknown number of binary latent variables K1. Consider an upper bound Kupper for K1, with K1<Kupper. Based on the identifiability conditions in Theorem 4 about the shape of Gp×K1(1), K1 is naturally constrained to be at most p/3, therefore Kupper can be set to p/3 or smaller in practice. We adopt a prior that detects redundant binary latent variables by increasingly shrinking the variance of βjck’s as k grows from 1 to Kupper. Specifically, letting βjckN(0,σck2), we put a CSP prior on variances {σc12,σc22,,σcKupper2} for each category c[d1], where σ2 is a prespecified small positive number indicating the spike variance for redundant binary latent variables:

(19)
(20)

where InvGa(b1σ,b2σ) refers to the inverse gamma distribution with shape b1σ and scale b2σ, and πk has a stick-breaking representation as in (20) with v1,v2,,vKupper1 independently following the Beta distribution Beta(1,α0). We set vKupper1 to truncate the stick-breaking representation at Kupper, similarly to Legramanti et al. (2020). The δσ2 represents the Dirac spike distribution with σ2 serving as the variance of redundant latent variables, while InvGa(aσ,bσ) represents the more diffuse slab distribution for the variances corresponding to active latent variables. The ‘increasing shrinkage’ comes from the fact that as the latent variable index k increases, the probability of αk belonging to the spike, πk, stochastically increases, because

increases as the index k increases. Therefore, the CSP prior features an increasing amount of shrinkage for larger k. We introduce auxiliary variables {hk;k=1,,Kupper} with hk[Kupper] to help with understanding and facilitate posterior computation. Specifically, the prior in (19) can be obtained by marginalizing out a discrete auxiliary variable hk with P(hk=)=vm=11(1vm), so (19)–(20) can be reformulated in terms of hk as

Therefore, the auxiliary variables hk determine whether the kth latent dimension is in the spike and hence corresponds to a redundant latent variable αk; specifically, if hk>k then σck2 follows the slab distribution InvGa(b1σ,b2σ) and αk is active, otherwise σck2=σ2 is in the spike and αk is redundant. Given (19), the largest possible number of active latent variables is Kupper1, because πKupper1=P(hKupperKupper) and the last latent variable is always redundant. Since the event 1(hk>k) indicates that the kth component is in the slab and hence active, we can write the total number of active latent dimensions K as

(21)

Tracking the posterior samples of all the hk can give a posterior estimate of K. The above data augmentation leads to Gibbs updating steps. We present the details of our Gibbs sampler in the Online Supplementary Material.

 
Remark 5

It is methodologically straightforward to extend our Gibbs sampler to deep Bayesian Pyramids with more than two latent layers. To see this, note that in deep Bayesian Pyramids with m2, for each m, the conditional distribution of α(m) given α(m+1) in Example 2 is a special case of the conditional distribution of y given α(1). Indeed, both of these conditionals follow generalized linear models with the (multinomial) logit link, with the parent variables serving as predictors for the child. Under such a formulation, introducing additional Polya-Gamma auxiliary variables for the α(1)-layer similar to those for the y-layer would allow Gibbs updates in a three-latent-layer Bayesian Pyramid. In this work, we focus on two-latent-layer models for computational efficiency.

 
Remark 6

We also remark that it is not hard to derive and implement a Gibbs sampler for constrained latent class models (CLCMs) mentioned in Section 3.1. Performing Gibbs sampling for CLCMs is a relatively straightforward extension to the current Gibbs sampler for Bayesian Pyramids. The reason is that one can similarly use the data augmentation strategies in Holmes and Held (2006) and Polson et al. (2013) to deal with yj[dj]; and one can also adopt similar priors for the binary constraint matrix S{0,1}p×H as those adopted for the graphical matrix G{0,1}p×K1 in a Bayesian Pyramid. Our preliminary simulations for such a Gibbs sampler under CLCMs showed that the recovery of parameters and the constraint matrix are not as stable and accurate as that for Bayesian Pyramids (shown in the later Figures 3–5). One explanation is that compared to multilayer Bayesian Pyramids, the parametrization of a CLCM is less parsimonious and requires many more mixture proportion parameters in order to describe the same joint distribution of the observed variables. Therefore, we have chosen to focus on the method for Bayesian Pyramids in this work, and treat CLCMs mainly as intermediate tools to help establish identifiability of Bayesian Pyramids.

Posterior means of the main-effect parameters {βj1k} for response category c=1 averaged across 50 independent replications, for n=500 or 2,000. Kupper=7 is taken in the CSP prior for posterior inference, while the true K1 is 4 as shown in the rightmost plot.
Figure 3.

Posterior means of the main-effect parameters {βj1k} for response category c=1 averaged across 50 independent replications, for n=500 or 2,000. Kupper=7 is taken in the CSP prior for posterior inference, while the true K1 is 4 as shown in the rightmost plot.

Mean estimation errors of the binary graphical matrix G(1) at the matrix level (in (a)), row level (in (b)), and entry level (in (c)). The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.
Figure 4.

Mean estimation errors of the binary graphical matrix G(1) at the matrix level (in (a)), row level (in (b)), and entry level (in (c)). The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.

Average root-mean-square errors (RMSE) of the posterior means of the main-effects β, intercepts β0, and deeper tensor arms η versus sample size n=250⋅i for i∈{1,2,…,8}. The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.
Figure 5.

Average root-mean-square errors (RMSE) of the posterior means of the main-effects β, intercepts β0, and deeper tensor arms η versus sample size n=250i for i{1,2,,8}. The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.

5 Simulation studies

We conducted replicated simulation studies to assess the Bayesian procedure proposed in Section 4 and examine whether the model parameters are indeed estimable as implied by Theorem 3. Consider a two-latent-layer Bayesian Pyramid with p=20 observed variables, d=4 response categories for each observed variable, and K1=4 binary latent variables in the middle layer and one binary latent variable in the deep layer. Let the true p×K1 binary graphical matrix be G(1)=(I4;I4;I4;1100;0110;0011;1001;1010;1001;0101;1110;0111). Such a G(1) satisfies the conditions for strict identifiability in Theorem 4, since it contains three copies of the identity matrix I4 as submatrices. Let the true intercept parameters for categories c=1,2,3 be (βj,1,0,βj,2,0,βj,3,0)=(3,2,1) for each j; and for any gj,k(1)=1, let the corresponding true main-effect parameters of the binary latent variables be βj,c,k=3 if variable yj has a single parent and βj,c,k=2 if yj has multiple parents. See Figure 3c for a heatmap of the sparse matrix of the main-effect parameters (βj1k;j[p],k[K1]) for category c=1.

We use the method developed in Section 4 with the CSP prior for posterior inference under an unknown K1. We specify an upper bound for K1 as Kupper=7, because p/20=7 is a natural upper bound here based on the identifiability considerations mentioned before. As for the hyperparameters in the CSP prior, we mainly follow the default setting and suggestion in Legramanti et al. (2020). Specifically, we set α0 to the same value in Legramanti et al. (2020), α0=5; we also follow the suggestion in Legramanti et al. (2020) that the spike variance σ2 should be a small positive number but should avoid to take excessively low values by setting σ2=0.07. Our simulations adopting these choices turn out to produce accurate estimation results under different settings (see the later Figures 3–5). In our preliminary simulations, we also varied these hyperparameters around the above values, and did not observe the algorithmic performance to be sensitive to them. Throughout the Gibbs sampling iterations, we enforce an identifiability constraint on β that βjck>0 as long as gj,k=1. For each sample size n we conduct 50 independent simulation replications. Since the model is identifiable up to a permutation of the latent variables in each layer, we post-process the posterior samples to find a column permutation of G(1) to best match the simulation truth; then the columns of other parameter matrices are permuted accordingly.

We have used the Gelman-Rubin convergence diagnostic (Gelman et al., 2013; Gelman & Rubin, 1992) to assess the convergence of the MCMC output from multiple random initializations. In particular, for the simulation setting corresponding to n=1,000, we randomly initialize the parameters from their prior distributions J=5 times and then run five MCMC chains for each simulated dataset. For each MCMC chain, we ran the chain for 15,000 iterations, discarding the first 10,000 iterations as burn-in, and retaining every fifth sample post burn-in to thin the chain. After collecting the posterior samples, we calculate the potential scale reduction factorR2 (i.e., the Gelman-Rubin statistic) of the model parameters. The median Gelman-Rubin statistics for the deep conditional probabilities η=(ηkb)K×B and that for the deep latent class proportions τ=(τb)B×1 across the five chains are as follows:

The Gelman–Rubin statistics for other parameters are similarly well controlled and are omitted. We also inspected the traceplots of the MCMC outputs and observed fast mixing of the MCMC chain after convergence. These observations justify running MCMC for 15,000 iterations and discarding the first 10,000 as burn-in, therefore we adopt these settings in all the numerical experiments. When applying the proposed method to other datasets, we also recommend calculating the Gelman–Rubin statistics and inspecting the traceplots to determine the appropriate number of overall MCMC iterations and burn-in iterations.

Our Bayesian modeling of G adopts rather uninformative priors with each entry gj,kBernoulli(γ) and we do not force G to take a specific form (such as to include any identity submatrix as described in Proposition 3 for strict identifiability) but rather let the sampler freely explore the space of all binary matrices to estimate G. Our theory on identifiability and posterior consistency implies that the posteriors of both G and other parameters concentrate around their true values as sample size grows. This is empirically verified in our simulation studies; Figures 3–5 show that as sample size n grows, both the discrete G and the continuous parameters are estimated consistently. We next elaborate on these findings from simulations.

Figure 3 presents posterior means of the main-effect parameters (βj1k;j[p],k[K1]) for category c=1, averaged across the 50 independent replications. The leftmost plot of Figure 3 shows that for a relatively small sample size n=500, the posterior means of (βj1k) already exhibit similar structure as the ground truth in the rightmost plot. Also, the true number of K1=4 binary latent variables in the middle latent layer are revealed in Figures 3a and b. For the sample size n=2,000, Figure 3b shows that the posterior means of (βj1k) are very close to the ground truth. The posterior means for other categories c=2,3,4 show similar patterns to those for c=1. The estimated (βj1k) are slightly biased toward zero for the smaller sample size (n=500) with bias less for the larger sample size (n=2,000). Our results suggest that the binary graphical matrix that underlies {βjck} (that is, the sparsity structure of the continuous parameters) is easier to estimate than the nonzero {βjck} themselves. That is, with finite samples, the existence of a link between each observed variable yj and each binary latent variable αk is easier to estimate than the strength of such a link (if it exists).

To assess how the approach performs with an increasing sample size, we consider eight sample sizes n=250i for i=1,2,,8 under the same settings as above. To compare the estimated structures to the simulation truth with K1=4, we retain exactly four binary latent variables k[Kupper]; choosing those having the largest average posterior variance 1/dc=1dσck2. For the discrete structure of the binary graphical matrix G(1), we present mean estimation errors in Figure 4. Specifically, Figure 4a plots the errors of recovering the entire matrix G(1), Figure 4b plots the errors of recovering the row vectors of G(1), and Figure 4b plots the errors of recovering the individual entries of G(1). For sample size as small as n=750, estimation of G(1) is very accurate across the simulation replications. Notably, n=750 is much smaller than the total number of cells dp=4201.1×1012 in the contingency table for the observed variables y. For continuous parameters β0={βjc0}, β={βjck}, and η={ηkb}, respectively, in Figure 5 we plot average root-mean-square errors (RMSE) of the posterior means versus sample size n. For each β0={βjc0}, β={βjck}, and η={ηkb}, Figure 5 shows RMSE decreases with sample size, which is as expected given our posterior consistency result in Theorem 3.

In simulations (and also the later real data analysis), we have checked the posterior samples collected from the MCMC algorithm and obtained the traceplots of various parameters in the model. By examining these, we have not observed label-switching issues in our numerical studies. In general, we would suggest one uses the traceplots of those mixture-component-specific parameters to examine whether there is a label switching issue. If there exists such issues, we recommend using the R package label.switching (Papastamoulis, 2016) for MCMC outputs to address the issue.

6 Application to DNA nucleotide sequence data

We apply our proposed two-latent-layer Bayesian Pyramid with the CSP prior to the Splice Junction dataset of DNA nucleotide sequences; the data are available from the UCI machine learning repository. We also analyze another dataset of nucleotide sequences, the Promoter data, and present the results in the Online Supplementary Material. The Splice Junction data consist of A, C, G, T nucleotides (d=4) at p=60 positions for n=3,175 DNA sequences. There are two types of genomic regions called introns and exons; junctions between these two are called splice junctions (Nguyen et al., 2016). Splice junctions can be further categorized as (a) exon-intron junction; and (b) intron-exon junction. The n=3,175 samples in the Splice dataset each belong to one of three types: Exon-Intron junction (‘EI’, 762 samples); Intron-Exon junction (‘IE’, 765 samples); and Neither EI or IE (‘N’, 1648 samples). Previous studies have used supervised learning methods for predicting sequence type (e.g., Li & Wong, 2003; Nguyen et al., 2016). Here we fit the proposed two-latent-layer Bayesian Pyramid to the data in a completely unsupervised manner, with the sequence type information held out. We use the nucleotide sequences to learn discrete latent representations of each sequence, and then investigate whether the learned lower dimensional discrete latent features are interpretable and predictive of the sequence type.

We let the variable z in the deepest latent layer have B=3 categories, inspired by the fact that there are three types of sequences: EI, IE, and N; but we do not use any information of which sequence belongs to which type when fitting the model to data. As mentioned earlier, the upper bound for the binary latent variables Kupper can be set to p/3 or smaller in order to yield an identifiable Bayesian Pyramid model. In practice, when p is large, we recommend starting with a relatively small Kupper, inspecting the estimated active/redundant latent dimensions, and only increasing Kupper if all the latent dimensions are estimated to be active a posteriori (that is, if the posterior model of K defined in (21) equals Kupper1). For this splice junction dataset with p=60, we start with Kupper=7 for better computational efficiency; this Kupper is the same as that used in the simulations and already allows for 2Kupper1=64 distinct latent binary profiles. We find that the posteriors select only K=5 active latent dimensions; this suggests that it is not necessary to increase Kupper to a larger number for this dataset.

We still run the Gibbs sampler for 15,000 iterations, discarding the first 10,000 as burn-in, and retaining every fifth sample post burn-in to thin the chain. Based on examining traceplots, the sampler has good mixing behavior. As mentioned in the last paragraph, our method selects K=5 binary latent variables. We index the saved posterior samples by r{1,,R=2,000}, for each r denote the samples of G by (gj,k(r);j[60],k[5]). Similarly for each r, denote the posterior samples of the nucleotides sequences’ latent binary traits by (ai,k(r);i[3,175],k[5]), and denote those of the nucleotide sequences’ deep latent category by (zi(r);i[3,175]) where zi(r){1,2,3}. Define our final estimators G^=(g^j,k), A^=(α^i,k), and Z^=(z^i,b) to be

g^j,k, α^i,k, and z^i,b summarize information of the element-wise posterior modes of the discrete latent structures in our model. The 60×5 matrix G^ depicts how the loci load on the binary latent traits, the 3,175×5 matrix A^ depicts the presence or absence of each binary latent trait in each nucleotide sequence, and the 3,175×3 matrix Z^ depicts which deep latent group each nucleotide sequence belongs to. G^, A^, and Z^ are all binary matrices, but the first two are binary feature matrices while the last one Z^ has each row having exactly one entry of ‘1’ indicating group membership. In Figure 6, the first three plots display the three estimated matrices G^, A^, and Z^, respectively; and the last plot shows the held-out gene type information for reference. As for the estimated loci loading matrix G^, Figure 6a provides information on how the p=60 loci depend on the five binary latent traits. Specifically, we found that the first 27 loci show somewhat similar loading patterns and mainly load on the first four binary traits. Also, the middle 10 loci (from locus 28 to locus 37) are similar in loading on all five traits, and the last 23 loci (from locus 38 to locus 60) are similar in exhibiting sparser loading structures. Figure 6b–d shows that the two matrices A^ and Z^ corresponding to the n=3,175 nucleotide sequences exhibit a clear pattern of clustering, which aligns well with the known but held-out junction types EI, IE, and N.

Splice junction data analysis under the CSP prior with Kupper=7. Plots are presented with the K⋆=5 binary latent traits selected by our proposed method a posteriori. After applying the rule-lists approach to deterministically match the latent features to the gene types as in (22), the accuracies for predicting the gene types EI, IE, N are all above 95%.
Figure 6.

Splice junction data analysis under the CSP prior with Kupper=7. Plots are presented with the K=5 binary latent traits selected by our proposed method a posteriori. After applying the rule-lists approach to deterministically match the latent features to the gene types as in (22), the accuracies for predicting the gene types EI, IE, N are all above 95%.

To formally assess how the latent discrete features learned by the proposed method perform in downstream prediction, we apply the ‘rule lists’ classification approach in Angelino et al. (2017) to the estimated latent features in A^ and Z^ for n=3,175 nucleotide sequences. The rule-lists approach is an interpretable classification method based on a categorical feature space, and it finds simple and deterministic rules of the categorical features in predicting a (binary) class label. For each instance i{1,,n=3,175}, we define the categorical features to be the eight-dimensional vector x^i=(z^i,1,z^i,2,z^i,3,a^i,1,,a^i,5). x^i is concatenated from z^i and a^i and is a feature vector of binary entries. Denote the ground-truth nucleotide sequence types by t=(ti;1i3,175), then ti=EI for i=1,,762, ti=IE for i=763,,1527, and ti=N for i=1,528,,3,175. Recall that the information of ti’s are not used in fitting our Bayesian Pyramid to obtain the latent features x^i’s. We use the Python package CORELS for the rule-lists method with xi’s and ti’s as input, and find rules that match x^i’s to ti’s. Specifically, these deterministic rules given by CORELS are:

(22)

Equation (22) gives a very simple and explicit rule depending on z^i,2, a^i,1, and a^i,5 for each nucleotide sequence i. This simple rule can be empirically verified by comparing the middle two plots to the rightmost plot in Figure 6. In (22), the rules for EI and IE are not mutually exclusive, but those for EI and N are mutually exclusive and the same holds for IE and N. Therefore, to obtain mutually exclusive rules for the three class labels EI, IE, and N based on (22), we can simply define the following two types of labels t^=(t^i;i[n]) and t^=(t^i;i[n]):

(23)

For the two types of labels t^ and t^ in (23), we provide their corresponding normalized confusion matrices in Figure 7 along with their heatmaps. Figure 7 shows that both t^ and t^ have very high prediction accuracies for the sequence types t, with all the diagonal entries above 95%. The superb prediction performance as shown in Figure 7 implies that the learned discrete latent features of nucleotide sequences are interpretable and very useful in the downstream task of classification. For the Splice Juntion dataset, such accuracies are even comparable to the state-of-the-art prediction performance given by fine-tuned convolutional neural networks given in Nguyen et al. (2016), which are also between 95% and 97%.

Downstream interpretable classification results for splice junction data following fitting the two-latent-layer Bayesian Pyramid. Prediction performance of the learned lower-dimensional discrete latent features are summarized in two confusion matrices, corresponding to t^⋆ and t^† defined in (23).
Figure 7.

Downstream interpretable classification results for splice junction data following fitting the two-latent-layer Bayesian Pyramid. Prediction performance of the learned lower-dimensional discrete latent features are summarized in two confusion matrices, corresponding to t^ and t^ defined in (23).

In the above data analysis, we have fixed B=3 inspired by the fact that there are three types of splice junctions. Alternatively, we can use the overfitted mixture framework of Rousseau and Mengersen (2011) to accommodate an unknown B. Overfitted mixtures are finite mixture models with an unknown number of mixture components B, where only an upper bound Bupper of B is known. In such scenarios, Rousseau and Mengersen (2011) proposed to use shrinkage priors for the mixture proportion parameters, such as Dirichlet priors with small Dirichlet hyperparameters. More specifically, denote the mixture proportion parameters corresponding to the Bupper ‘overfitted’ mixture components by π:=(π1,,πBupper), then π lives on the (Bupper1)-dimensional probability simplex. The overfitted mixture framework in Rousseau and Mengersen (2011) guarantees that with the prior πDirichlet(a1,,aBupper) where the Dirichlet parameters are small enough with respect to the dimension of the observations, the redundant mixture components will be ‘emptied out’ in the posterior asymptotically. Hence, the B true mixture components underlying the data will be identified from the posterior distribution. In the Online Supplementary Material, we reanalyze the real data specifying B=5 and simulation studies with an overfitted B, and achieve favorable results.

For comparison, we also applied/implemented two alternative discrete latent structure models, including: (a) the latent class model, with a single univariate discrete latent variable behind the observed data layer; and (b) a single-latent-layer model, with one layer of independent binary latent variables behind the data layer. For model (a), we applied the nonparametric Bayesian method in Dunson and Xing (2009) to the splice junction dataset and extracted kDX=10 latent classes (default in Dunson & Xing, 2009; increasing kDX beyond 10 is not considered because in the current 10 latent classes there are already empty classes not occupied by any nucleotide sequence). For model (b), we implemented a new Gibbs sampler (see description in Section 9.3 in the Online Supplementary Material), still employing the cumulative shrinkage process prior as used for Bayesian Pyramids. Then based on the learned lower-dimensional latent representations, we still apply the ‘rule-lists’ classification approach in Angelino et al. (2017) (via the Python package CORELS) for each of the three labels EI, IE, and N via binary classification. In addition to the above two competitors (abbreviated as ‘DX2009’ and ‘IndepBinary’, respectively), we also consider a benchmark approach of directly applying the rule-lists classifier to raw DNA nucleotide sequences. Since the rule-lists classifier in the CORELS package only applies to binary predictors, we first convert the DNA nucleotide sequences of A, G, C, T into longer sequences containing binary indicators of whether each loci is ‘A’, or ‘G’, or ‘C’; after this, each original 60-dimensional nucleotide sequence is converted to a binary vector of length 180. We then apply the rule-lists classifier using the same setting of learning at most three ‘rules’ (i.e., max_card=3 in the CORELS package) as all the other three latent variable methods: DX2009, IndepBinary, and BayesPyramid. We obtain the classification accuracy in Table 1.

Table 1.

Downstream classification accuracy for splice data

EI (train)IE (train)N (train)EI (test)IE (test)N (test)
Raw nucleotide nucleotides0.9090.8540.8400.9160.8660.839
DX20090.9460.9550.9180.9500.9650.929
IndepBinary0.9640.9550.9560.9650.9570.965
BayesPyramid0.9710.9750.9700.9840.9830.976
EI (train)IE (train)N (train)EI (test)IE (test)N (test)
Raw nucleotide nucleotides0.9090.8540.8400.9160.8660.839
DX20090.9460.9550.9180.9500.9650.929
IndepBinary0.9640.9550.9560.9650.9570.965
BayesPyramid0.9710.9750.9700.9840.9830.976
Table 1.

Downstream classification accuracy for splice data

EI (train)IE (train)N (train)EI (test)IE (test)N (test)
Raw nucleotide nucleotides0.9090.8540.8400.9160.8660.839
DX20090.9460.9550.9180.9500.9650.929
IndepBinary0.9640.9550.9560.9650.9570.965
BayesPyramid0.9710.9750.9700.9840.9830.976
EI (train)IE (train)N (train)EI (test)IE (test)N (test)
Raw nucleotide nucleotides0.9090.8540.8400.9160.8660.839
DX20090.9460.9550.9180.9500.9650.929
IndepBinary0.9640.9550.9560.9650.9570.965
BayesPyramid0.9710.9750.9700.9840.9830.976

The left three columns in Table 1 list accuracies on the training set containing 80% of the samples, and the right three columns on the test set containing the remaining 20% of the samples. The test accuracies are very close to the training ones, indicating satisfactory generalizability. The reason is that the maximum number of rules learned for each model is limited to be at most three, i.e., max_card=3 in the CORELS package. Such a small number of rules in a classifier do not overfit the training data and hence generalize well to the test data. We remark that when increasing max_card beyond three, the downstream classification accuracy of BayesPyramid remains the same as those in Table 1; however, for the raw nucleotide sequences, the CORELS package cannot complete the execution of the classifier with max_card=4, likely due to the high-dimensionality of the search space of rules. Therefore, we present the results of CORELS classification with max_card=3 in Table 1.

Remarkably, Table 1 shows that all the three unsupervised learning methods learn latent features that are more predictive of the sequence type than the raw sequences themselves, among which the Bayesian Pyramid is the apparent top performer. This observation indicates that the latent variable approaches, especially our multilayer Bayesian Pyramids, can ‘denoise’ the raw sequence data and return more useful features for downstream tasks. The model in Dunson and Xing (2009) (‘DX2009’) and the single-layer-independent-latent model (‘IndepBinary’) are special cases of Bayesian Pyramids, and both have excellent performance on the splice data. However, Bayesian Pyramids significantly decreased the test set misclassification rate of IndepBinary by 31%–60%, of DX2009 by 51%–68%, and of raw nucleotide sequences by 81%–87%.

7 Discussion

In this article, we have proposed Bayesian Pyramids, a general family of discrete multilayer latent variable models for multivariate categorical data. Bayesian Pyramids cover multiple existing statistical and machine learning models as special cases, while allowing for various different assumptions on the conditional distributions. Our identifiability theory is key in providing reassurance that one can reliably and reproducibly learn the latent structure, guarantees that are lacking from almost all of the related literature. The proposed Bayesian approach has excellent performance in the simulations and data analyses, and shows good computational performance and a surprising ability to accurately infer meaningful and useful latent structure. There are immediate applications of the proposed Bayesian Pyramid approach in many other disciplines. For example, in ecology and biodiversity research, joint species distribution modeling is a very active area (see a recent book of Ovaskainen & Abrego, 2020). The data consist of high-dimensional binary indicators of presence or absence of different species at each sampling location. In this contest, Bayesian Pyramids provide a useful approach for inferring latent community structure of biologic and ecological interest.

This work focuses on the unsupervised setting and uses latent variable approaches. In unsupervised cases, there is not a specific outcome/response associated with each data point, and the general goal is to discover ‘interesting’ patterns in data (Murphy, 2012). Such unsupervised learning problems are generally considered more challenging and less well studied than supervised ones. Latent variables can capture semantically interpretable constructs, and marginalizing out latents induces great flexibility in the resulting marginal distribution of observables. Bayesian Pyramids provide unsupervised feature discovery tools for learning latent architectures underlying multivariate data, yielding insight into data generating processes, performing nonlinear dimension reduction, and extracting useful latent representations.

In order to realize the great potential of latent variable approaches to unsupervised learning, identifiability issues must be addressed so that reliable latent constructs can be reproducibly extracted from data. Especially, if one wishes to interpret the latent representations and parameters estimated from a latent variable model and use them in downstream analyses, then identifiability is a prerequisite for making such interpretation meaningful and reproducible. In this sense, identifiability is necessary for interpretability of the latent structures. We remark that such interpretability considerations differ from the goals of learning interpretable treatment rules for some outcome in complex supervised learning settings (see, e.g., Kitagawa & Tetenov, 2018; Semenova & Chernozhukov, 2021). In modern machine learning, although powerful deep latent variable models have been proposed, identifiability issues have rarely been considered and the design of the latent architecture is often guided by heuristics. In this work, our identifiability theory directly inspires how to specify the deep latent architecture in a Bayesian Pyramid: the identifiability conditions on matrices G(m) (in Theorem 4) inspire us to specify a ‘pyramid’ structure featuring fewer and fewer latent variables deeper up the hierarchy (J3K1;K13K2,) and sparse graphical connections between layers.

In the future, it is worth further advancing and refining scalable algorithms for Bayesian Pyramids beyond the two-latent-layer case. Methodologically, our current Gibbs sampling procedure can be readily extended to multiple latent layers, because non-adjacent layers in a Bayesian Pyramid are conditionally independent and the current Gibbs updates for shallower layers can be similarly applied to deeper ones. In terms of the computational performance, however, efficiently sampling discrete latent variables via MCMC is generally more challenging than sampling continuous ones. In a Bayesian Pyramid, binary entries of the graphical matrices G(1),G(2), act similarly as covariate selection indicators in (logistic) regression problems, yet are more difficult to estimate because the ‘covariates’ α(1),α(2), themselves are latent and discrete. This fact can cause unsatisfactory mixing behavior of naive extensions of the Gibbs sampler to deeper models. For example, sampling to update an entry in a deeper graphical matrix G(d) where d2 can cause a substantial shift of the posterior space for the associated continuous parameters in downward layers. Future research is warranted to advance the MCMC methodology or explore complementary methods for estimating the discrete latent structures in deeper models.

Acknowledgments

The authors thank the Editor, Associate Editor, and three referees for helpful and constructive comments.

Funding

This work was partially supported by the National Science Foundation grant DMS-2210796. This work was also partially supported by grants R01ES027498 and R01ES028804 from National Institutes of Environmental Health Sciences of the National Institutes of Health, and received funding from European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement No 856506).

Data availability

Matlab code implementing the proposed method and data are available at this GitHub repository: https://github.com/yuqigu/BayesianPyramids.

Supplementary material

Supplementary material are available at Journal of the Royal Statistical Society: Series B online.

References

Allman
E. S.
,
Matias
C.
, &
Rhodes
J. A.
(
2009
).
Identifiability of parameters in latent structure models with many observed variables
.
The Annals of Statistics
,
37
(
6A
),
3099
3132
. https://doi.org/10.1214/09-AOS689

Anandkumar
A.
,
Hsu
D.
,
Javanmard
A.
, &
Kakade
S.
(
2013
).
Learning linear Bayesian networks with latent variables. In International Conference on Machine Learning (pp. 249–257). PMLR
.

Anderson
T. W.
, &
Rubin
H.
(
1956
).
Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5. pp. 111–150). Univ of California Press
.

Angelino
E.
,
Larus-Stone
N.
,
Alabi
D.
,
Seltzer
M.
, &
Rudin
C.
(
2017
).
Learning certifiably optimal rule lists for categorical data
.
The Journal of Machine Learning Research
,
18
(
1
),
8753
8830
.

Blei
D. M.
,
Ng
A. Y.
, &
Jordan
M. I.
(
2003
).
Latent Dirichlet allocation
.
Journal of Machine Learning Research
,
3
,
993
1022
.

Chen
Y.
,
Culpepper
S.
, &
Liang
F.
(
2020
).
A sparse latent class model for cognitive diagnosis
.
Psychometrika
,
85
(
1
),
121
153
. https://doi.org/10.1007/s11336-019-09693-2

Chen
Y.
,
Li
X.
, &
Zhang
S.
(
2020
).
Structured latent factor analysis for large-scale data: Identifiability, estimability, and their implications
.
Journal of the American Statistical Association
,
115
(
532
),
1756
1770
. https://doi.org/10.1080/01621459.2019.1635485

Chen
Y.
,
Liu
J.
,
Xu
G.
, &
Ying
Z.
(
2015
).
Statistical analysis of Q-matrix based diagnostic classification models
.
Journal of the American Statistical Association
,
110
(
510
),
850
866
. https://doi.org/10.1080/01621459.2014.934827

Collins
L. M.
, &
Lanza
S. T.
(
2009
).
Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences
. (Vol.
718
).
John Wiley & Sons
.

de la Torre
J.
(
2011
).
The generalized DINA model framework
.
Psychometrika
,
76
(
2
),
179
199
. https://doi.org/10.1007/s11336-011-9207-7

Dellaportas
P.
,
Forster
J. J.
, &
Ntzoufras
I.
(
2002
).
On Bayesian model and variable selection using MCMC
.
Statistics and Computing
,
12
(
1
),
27
36
. https://doi.org/10.1023/A:1013164120801

Doshi-Velez
F.
, &
Ghahramani
Z.
(
2009
).
Correlated non-parametric latent feature models. In Uncertainty in artificial intelligence
. PMLR.

Doshi-Velez
F.
, &
Kim
B.
(
2017
).
‘Towards a rigorous science of interpretable machine learning’, arXiv, arXiv:1702.08608, preprint: not peer reviewed
.

Drton
M.
,
Foygel
R.
, &
Sullivant
S.
(
2011
).
Global identifiability of linear structural equation models
.
The Annals of Statistics
,
39
(
2
),
865
886
. https://doi.org/10.1214/10-AOS859

Dua
D.
, &
Graff
C.
(
2017
).
UCI machine learning repository
.

Dunson
D. B.
, &
Xing
C.
(
2009
).
Nonparametric Bayes modeling of multivariate categorical data
.
Journal of the American Statistical Association
,
104
(
487
),
1042
1051
. https://doi.org/10.1198/jasa.2009.tm08439

Erosheva
E.
,
Fienberg
S.
, &
Lafferty
J.
(
2004
).
Mixed-membership models of scientific publications
.
Proceedings of the National Academy of Sciences
,
101
(
suppl_1
),
5220
5227
. https://doi.org/10.1073/pnas.0307760101

Eysenck
S. B.
,
Barrett
P. T.
, &
Saklofske
D. H.
(
2020
).
The junior Eysenck personality questionnaire
.
Personality and Individual Differences
,
169
(
5
),
109974
. https://doi.org/10.1016/j.paid.2020.109974

Fang
G.
,
Liu
J.
, &
Ying
Z.
(
2019
).
On the identifiability of diagnostic classification models
.
Psychometrika
,
84
(
1
),
19
40
. https://doi.org/10.1007/s11336-018-09658-x

Gelman
A.
,
Carlin
J. B.
,
Stern
H. S.
,
Dunson
D. B.
,
Vehtari
A.
, &
Rubin
D. B.
(
2013
).
Bayesian data analysis
.
CRC Press
.

Gelman
A.
,
Jakulin
A.
,
Pittau
M. G.
, &
Su
Y.-S.
(
2008
).
A weakly informative default prior distribution for logistic and other regression models
.
The Annals of Applied Statistics
,
2
(
4
),
1360
1383
. https://doi.org/10.1214/08-AOAS191

Gelman
A.
, &
Rubin
D. B.
(
1992
).
Inference from iterative simulation using multiple sequences
.
Statistical Science
,
7
(
4
),
457
472
. https://doi.org/10.1214/ss/1177011136

Goodman
L. A.
(
1974
).
Exploratory latent structure analysis using both identifiable and unidentifiable models
.
Biometrika
,
61
(
2
),
215
231
. https://doi.org/10.1093/biomet/61.2.215

Gu
Y.
, &
Xu
G.
(
2019
).
Learning attribute patterns in high-dimensional structured latent attribute models
.
Journal of Machine Learning Research
,
20
(
115
),
1
58
. https://jmlr.org/papers/v20/19-197.html

Gu
Y.
, &
Xu
G.
(
2020
).
Partial identifiability of restricted latent class models
.
Annals of Statistics
,
48
(
4
),
2082
2107
. https://doi.org/10.1214/19-AOS1878

Gyllenberg
M.
,
Koski
T.
,
Reilink
E.
, &
Verlaan
M.
(
1994
).
Non-uniqueness in probabilistic numerical identification of bacteria
.
Journal of Applied Probability
,
31
(
2
),
542
548
. https://doi.org/10.2307/3215044

Haertel
E. H.
(
1989
).
Using restricted latent class models to map the skill structure of achievement items
.
Journal of Educational Measurement
,
26
(
4
),
301
321
. https://doi.org/10.1111/j.1745-3984.1989.tb00336.x

Hinton
G. E.
(
2009
).
Deep belief networks
.
Scholarpedia
,
4
(
5
),
5947
. https://doi.org/10.4249/scholarpedia.5947

Hinton
G. E.
,
Osindero
S.
, &
Teh
Y. -W.
(
2006
).
A fast learning algorithm for deep belief nets
.
Neural Computation
,
18
(
7
),
1527
1554
. https://doi.org/10.1162/neco.2006.18.7.1527

Holmes
C. C.
, &
Held
L.
(
2006
).
Bayesian auxiliary variable models for binary and multinomial regression
.
Bayesian Analysis
,
1
(
1
),
145
168
. https://doi.org/10.1214/06-BA105

Kitagawa
T.
, &
Tetenov
A.
(
2018
).
Who should be treated? Empirical welfare maximization methods for treatment choice
.
Econometrica
,
86
(
2
),
591
616
. https://doi.org/10.3982/ECTA13288

Kolda
T. G.
, &
Bader
B. W.
(
2009
).
Tensor decompositions and applications
.
SIAM Review
,
51
(
3
),
455
500
. https://doi.org/10.1137/07070111X

Koopmans
T. C.
, &
Reiersol
O.
(
1950
).
The identification of structural characteristics
.
The Annals of Mathematical Statistics
,
21
(
2
),
165
181
. https://doi.org/10.1214/aoms/1177729837

Kruskal
J. B.
(
1977
).
Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics
.
Linear Algebra and its Applications
,
18
(
2
),
95
138
. https://doi.org/10.1016/0024-3795(77)90069-6

Lazarsfeld
P. F.
(
1950
).
The logical and mathematical foundation of latent structure analysis. In Studies in social psychology in world war II Vol. IV: Measurement and prediction (pp. 362–412)
. Princeton University Press.

Lee
H.
,
Ekanadham
C.
, &
Ng
A.
(
2007
).
Sparse deep belief net model for visual area V2
.
Advances in Neural Information Processing Systems
,
20
,
873
880
.

Lee
H.
,
Grosse
R.
,
Ranganath
R.
, &
Ng
A. Y.
(
2009
).
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 609–616)
.

Legramanti
S.
,
Durante
D.
, &
Dunson
D. B.
(
2020
).
Bayesian cumulative shrinkage for infinite factorizations
.
Biometrika
,
107
(
3
),
745
752
. https://doi.org/10.1093/biomet/asaa008

Li
J.
, &
Wong
L.
(
2003
).
Using rules to analyse bio-medical data: A comparison between C4. 5 and PCL. In International Conference on Web-Age Information Management (pp. 254–265). Springer
.

McLachlan
G. J.
, &
Basford
K. E.
(
1988
).
Mixture models: Inference and applications to clustering
. (Vol.
38
).
M. Dekker
.

Miettinen
P.
, &
Vreeken
J.
(
2014
).
Mdl4bmf: Minimum description length for Boolean matrix factorization
.
ACM Transactions on Knowledge Discovery from Data (TKDD)
,
8
(
4
),
1
31
. https://doi.org/10.1145/2601437

Mourad
R.
,
Sinoquet
C.
,
Zhang
N. L.
,
Liu
T.
, &
Leray
P.
(
2013
).
A survey on latent tree models and applications
.
Journal of Artificial Intelligence Research
,
47
(
1
),
157
203
. https://doi.org/10.1613/jair.3879

Murdoch
W. J.
,
Singh
C.
,
Kumbier
K.
,
Abbasi-Asl
R.
, &
Yu
B.
(
2019
).
Interpretable machine learning: Definitions, methods, and applications
.
Proceedings of the National Academy of Sciences
,
116
(
44
),
22071
22080
. https://doi.org/10.1073/pnas.1900654116

Murphy
K. P.
(
2012
).
Machine learning: A probabilistic perspective
.
MIT Press
.

Nguyen
N. G.
,
Tran
V. A.
,
Ngo
D. L.
,
Phan
D.
,
Lumbanraja
F. R.
,
Faisal
M. R.
,
Abapihi
B.
,
Kubo
M.
, &
Satou
K.
(
2016
).
DNA sequence classification by convolutional neural network
.
Journal of Biomedical Science and Engineering
,
9
(
5
),
280
286
. https://doi.org/10.4236/jbise.2016.95021

Ovaskainen
O.
, &
Abrego
N.
(
2020
).
Joint species distribution modelling: With applications in R
. Ecology, Biodiversity and Conservation.
Cambridge University Press
.

Ovaskainen
O.
,
Abrego
N.
,
Halme
P.
, &
Dunson
D. B.
(
2016
).
Using latent variable models to identify large networks of species-to-species associations at different spatial scales
.
Methods in Ecology and Evolution
,
7
(
5
),
549
555
. https://doi.org/10.1111/2041-210X.12501

Papastamoulis
P.
(
2016
).
Label.switching: An R package for dealing with the label switching problem in MCMC outputs
.
Journal of Statistical Software
,
69
(
1
),
1
24
. https://doi.org/10.18637/jss.v069.c01

Pearl
J.
(
2014
).
Probabilistic reasoning in intelligent systems: Networks of plausible inference
.
Elsevier
.

Pokholok
D. K.
,
Harbison
C. T.
,
Levine
S.
,
Cole
M.
,
Hannett
N. M.
,
Lee
T. I.
,
Bell
G. W.
,
Walker
K.
,
Rolfe
P. A.
,
Herbolsheimer
E.
,
Zeitlinger
J.
,
Lewitter
F.
,
Gifford
D. K.
, &
Young
R. A.
(
2005
).
Genome-wide map of nucleosome acetylation and methylation in yeast
.
Cell
,
122
(
4
),
517
527
. https://doi.org/10.1016/j.cell.2005.06.026

Polson
N. G.
,
Scott
J. G.
, &
Windle
J.
(
2013
).
Bayesian inference for logistic models using Pólya–Gamma latent variables
.
Journal of the American Statistical Association
,
108
(
504
),
1339
1349
. https://doi.org/10.1080/01621459.2013.829001

Poon
H.
, &
Domingos
P.
(
2011
).
Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 689–690). IEEE
.

Rousseau
J.
, &
Mengersen
K.
(
2011
).
Asymptotic behaviour of the posterior distribution in overfitted mixture models
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
73
(
5
),
689
710
. https://doi.org/10.1111/j.1467-9868.2011.00781.x

Rudin
C.
(
2019
).
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
.
Nature Machine Intelligence
,
1
(
5
),
206
215
. https://doi.org/10.1038/s42256-019-0048-x

Rupp
A. A.
, &
Templin
J. L.
(
2008
).
Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art
.
Measurement
,
6
(
4
),
219
262
.

Semenova
V.
, &
Chernozhukov
V.
(
2021
).
Debiased machine learning of conditional average treatment effects and other causal functions
.
The Econometrics Journal
,
24
(
2
),
264
289
. https://doi.org/10.1093/ectj/utaa027

Skinner
C.
(
2019
).
Analysis of categorical data for complex surveys
.
International Statistical Review
,
87
(
S1
),
S64
S78
. https://doi.org/10.1111/insr.12285

Teicher
H.
(
1961
).
Identifiability of mixtures
.
The Annals of Mathematical Statistics
,
32
(
1
),
244
248
. https://doi.org/10.1214/aoms/1177705155

von Davier
M.
, &
Lee
Y.-S.
(
2019
).
Handbook of diagnostic classification models
.
Springer
.

Xu
G.
(
2017
).
Identifiability of restricted latent class models with binary responses
.
The Annals of Statistics
,
45
(
2
),
675
707
. https://doi.org/10.1214/16-AOS1464

Zhao
H.
,
Melibari
M.
, &
Poupart
P.
(
2015
).
On the relationship between sum-product networks and Bayesian networks. In International Conference on Machine Learning (pp. 116–124). PMLR
.

Zhou
J.
,
Bhattacharya
A.
,
Herring
A. H.
, &
Dunson
D. B.
(
2015
).
Bayesian factorizations of big sparse tensors
.
Journal of the American Statistical Association
,
110
(
512
),
1562
1576
. https://doi.org/10.1080/01621459.2014.983233

Zwiernik
P.
(
2018
).
Latent tree models. In Handbook of graphical models (pp. 283–306). CRC Press
.

Author notes

Conflict of interest None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data