Bayesian Pyramids: identifiable multilayer discrete latent structure models for discrete data

Gu, Yuqi; Dunson, David B

doi:10.1093/jrsssb/qkad010

Abstract

High-dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable parsimonious models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of identifiable multilayer (potentially deep) discrete latent structure models for discrete data, termed Bayesian Pyramids. We establish the identifiability of Bayesian Pyramids by developing novel transparent conditions on the pyramid-shaped deep latent directed graph. The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate the identifiability and estimatability of model parameters. Applications of the methodology to DNA nucleotide sequence data uncover useful discrete latent features that are highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data and can be a useful alternative to popular machine learning methods.

Bayesian inference, deep generative models, identifiability, interpretable machine learning, latent class, multivariate categorical data

1 Introduction

High-dimensional unordered categorical data are ubiquitous in many scientific disciplines, including the DNA nucleotides of A, G, C, T in genetics (Nguyen et al., 2016; Pokholok et al., 2005), occurrences of various species in ecological studies of biodiversity (Ovaskainen & Abrego, 2020; Ovaskainen et al., 2016), responses from psychological and educational assessments or social science surveys (Eysenck et al., 2020; Skinner, 2019), and document data gathered from huge text corpora or publications (Blei et al., 2003; Erosheva et al., 2004). Modeling and extracting information from multivariate discrete data require different statistical methods and theoretical understanding from those for continuous data. In an unsupervised setting, it is an important task to uncover reliable and meaningful latent patterns from the potentially high-dimensional and heterogeneous discrete observations. Ideally, the inferred lower-dimensional latent representations should not only provide scientific insights on their own, but also aid downstream statistical analyses through effective dimension reduction.

Recently, there has been a surge of interest in interpretable machine learning, see Doshi-Velez and Kim (2017), Rudin (2019), and Murdoch et al. (2019), among others. Latent variable approaches, however, have received limited attention in this emerging literature, likely due to the associated complexities. Indeed, deep learning models with many layers of latent variables are usually considered as uninterpretable black boxes. For example, the deep belief network (Hinton et al., 2006; Lee et al., 2009) is a very popular deep learning architecture, but it is generally not reliable to interpret the inferred latent structure. However, for high-dimensional data, it is highly desirable to perform dimension reduction to extract the key signals in the form of lower-dimensional latent representations. If the latent representations are themselves reliable, then they can be viewed as surrogate features of the data and then passed along to existing interpretable machine learning methods for downstream tasks. A key to success in such modeling and analysis processes is the interpretability of the latent structure. This in turn relies crucially on the identifiability of the statistical latent variable model being used.

In statistical terms, a set of parameters for a family of statistical models are said to be identifiable if distinct values of the parameters correspond to distinct distributions of the observed data. Studies under such an identifiability notion date back to Koopmans and Reiersol (1950), Teicher (1961), and Goodman (1974). Model identifiability is a fundamental prerequisite for valid statistical estimation and inference. In the latent variable context, if one wishes to interpret the parameters and latent representations learned using a latent variable model, then identifiability is necessary for making such interpretation meaningful and reproducible. Early considerations of identifiability in the latent variable context can be traced to the seminal work of Anderson and Rubin (1956) for traditional factor analysis. For modern latent variable models potentially containing nonlinear, non-continuous, and even deep layers of latent variables, identifiability issues can be challenging to address.

Recent developments on the identifiability of continuous latent variable models include Drton et al. (2011), Anandkumar et al. (2013), and Chen, Li, et al. (2020). Discrete latent variable models are an appealing alternative to their continuous counterparts in terms of the combination of interpretability and flexibility. Finite mixture models (McLachlan & Basford, 1988) routinely used for model-based clustering are a canonical example involving a single discrete latent variable. Such relatively simple approaches are insufficiently flexible for complex data sets. Extensions with multiple latent variables and/or multilayer structure have distinct advantages in such settings but come with increasingly complex identifiability issues. In this work, we are motivated to build identifiable deep latent variable models, which are flexible enough to capture the complex dependencies in real-world data, yet also with appropriate restrictions and parsimony to yield identifiability.

We propose a family of multilayer, potentially deep, discrete latent variable models and propose novel identifiability conditions for them. We establish identifiability for hierarchical latent structures organized in a ‘pyramid’—shaped Bayesian network. In such a Bayesian Pyramid, observed variables are at the bottom and multilayer latent variables above them describe the data generating process. Sparse graphical connections occur between layers, and our identifiability conditions impose structural and size constraints on these between-layer graphs. Technically, we tackle identifiability by first reformulating the Bayesian Pyramid as a constrained latent class model (LCM; Goodman, 1974) in a layerwise manner. Then, we derive a nontrivial algebraic property of LCMs under such parameter constraints (Proposition 2) and combine it with Kruskal’s theorem (Allman et al., 2009; Kruskal, 1977) on tensor decompositions to establish identifiability. Our identifiability results are not only technically novel, but also provide insights into methodology development. Indeed, the identifiability theory directly inspires the specification of deep latent architecture in Bayesian Pyramids, which features fewer latent variables deeper up the hierarchy. The identifiability results offer a theoretical basis for learning potentially deep and interpretable latent structures from high-dimensional discrete data.

A nice consequence of the identifiability results is the posterior consistency of Bayesian procedures under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. We develop a Gibbs sampler with data augmentation for computation. Simulation studies corroborate identifiability and show good performance of Bayes estimators. We apply the proposed approach to two DNA nucleotide sequence datasets (Dua & Graff, 2017). For the splice junction data, when using latent representations learned from our two-latent-layer model in downstream classification of nucleotide sequence types, we achieve a remarkable accuracy comparable to fine-tuned convolutional neural networks (i.e., in Nguyen et al., 2016). This suggests that the developed recipe of unsupervised learning of discrete data may serve as a useful alternative to popular machine learning methods.

The rest of this paper is organized as follows. Section 2 proposes Bayesian Pyramids, a new family of pyramid-shaped deep latent variable models for discrete data and reformulates a Bayesian Pyramid into constrained latent class models (CLCMs) in a layerwise manner. Section 3 first considers the identifiability of the general CLCMs and then proposes identifiability conditions for the multilayer deep Bayesian Pyramids. To illustrate the proposed framework, Section 4 focuses on a two-latent-layer Bayesian Pyramid and proposes a Bayesian estimation approach. Section 5 provides simulation studies that examine the performance of the proposed methodology and corroborate the identifiability theory. Section 6 applies the method to real data on nucleotide sequences. Finally, Section 7 discusses implications and future directions. Technical proofs, posterior computation details, and additional data analyses are included in the Online Supplementary Material. Matlab code implementing the proposed method is available at https://github.com/yuqigu/BayesianPyramids.

2 Bayesian Pyramids: multilayer latent structure models

This section proposes Bayesian Pyramids to model the joint distribution of multivariate unordered categorical data. For an integer $m$ ⁠, denote $[m] = {1, 2, \dots, m}$ ⁠. Suppose for each subject, there are $p$ observed variables $y = (y_{1}, \dots, y_{p})^{⊤}$ ⁠, where $y_{j} \in [d_{j}]$ for each variable $j \in [p]$ ⁠. $d_{j}$ is the number of categories that the $j$ th observed variable can potentially take. We mainly consider multiple (potentially deep) layers of binary latent variables, motivated by better computational tractability and also the simpler interpretability, with each variable encoding presence or absence of a certain latent construct. The stack of multiple layers of binary latent variables induces a model resembling deep belief networks (Hinton et al., 2006). However, our proposed class of models is more general in terms of distributional assumptions. In the following, we first describe in detail our proposed Bayesian Pyramids in Section 2.1 and then connect them to a latent class model (Goodman, 1974) subject to certain constraints.

2.1 Multilayer Bayesian Pyramids

The proposed models belong to the broader family of Bayesian networks (Pearl, 2014), which are directed acyclic graphical models that can encode rich conditional independence information. We propose a ‘pyramid’-like Bayesian network with one latent variable at the root and more and more latent variables in downward layers, where the bottom layer consists of the $p$ observed variables $y_{1}, \dots, y_{p}$ ⁠. Denote the number of latent layers in this Bayesian network by $D$ ⁠, which can be viewed as the depth. Specifically, let the layer of latent variables consecutive to the observed $y$ be $α^{(1)} = (α_{1}, \dots, α_{K_{1}}) \in {0, 1}^{K_{1}}$ with $K_{1}$ variables, and let a deeper layer of latent variables consecutive to $α^{(m)} \in {0, 1}^{K_{m}}$ be $α^{(m + 1)} \in {0, 1}^{K_{m + 1}}$ for $m = 1, 2, \dots, D - 1$ ⁠. Finally, at the top and the deepest layer of the pyramid, we specify a single discrete latent variable $z$ ⁠, or equivalently $z^{(D)} \in {1, \dots, B}$ ⁠. In this Bayesian network, all the directed edges are pointing in the top-down direction only between two consecutive layers, and there are no edges between variables within a layer. This gives the following factorization of the joint distribution of $y$ and latent variables, where the subscript $i$ denotes an index of a random subject,

P (y_{i}, {α_{i}^{(m)}}, z_{i}^{(D)}) = P (y_{i} ∣ α_{i}^{(1)}) \prod_{m = 1}^{D - 2} P (α_{i}^{(m)} ∣ α_{i}^{(m + 1)}) P (α_{i}^{(D - 1)} ∣ z_{i}^{(D)}) P (z_{i}^{(D)}),

(1)

P (y_{i} ∣ α_{i}^{(1)}) = \prod_{j = 1}^{p} P (y_{i, j} ∣ α_{i, pa (j)}^{(1)}), P (α_{i}^{(m)} ∣ α_{i}^{(m + 1)}) = \prod_{k = 1}^{K_{m}} P (α_{i, k}^{(m)} ∣ α_{i, pa (k^{(m)})}^{(m + 1)}) .

(2)

In the above display, for each $j \in [p]$ ⁠, $pa (j) \subseteq [K_{1}]$ is a collection of first-latent-layer variables, which are parents of $y_{j}$ ⁠. Similarly, for each $k \in [K_{m}]$ ⁠, $pa (k^{(m)}) \subseteq [K_{m + 1}]$ is a collection of $(m + 1)$ -latent-layer variables, which are parents of the $m$ -latent-layer variable $α_{k}$ ⁠.

Figure 1 gives a graphical visualization of the proposed model. To make clear the sparsity structure of this graph, we introduce binary matrices $G^{(1)}, G^{(2)}, \dots, G^{(D - 1)}$ ⁠, termed as graphical matrices, to encode the connecting patterns between consecutive layers. That is, $G^{(d)}$ summarizes the parent–child graphical patterns between the two consecutive layers $d$ and $d + 1$ ⁠. Specifically, matrix $G^{(1)} = (g_{j, k}^{(1)})$ has size $p \times K_{1}$ and matrix $G^{(m)} = (g_{k, k^{'}}^{(m)})$ has size $K_{m - 1} \times K_{m}$ for $m = 2, \dots, D - 1$ ⁠, with entries

\begin{aligned} g_{j, k}^{(1)} = 1, if α_{k}^{(1)} is a parent of y_{j}, \\ g_{k, k^{'}}^{(m)} = 1, if α_{k}^{(m)} is a parent of α_{k^{'}}^{(m - 1)}, 2 \leq m \leq D - 1. \end{aligned}

Each variable in the graph is subject-specific, implying that all the circles in Figure 1 represent subject-specific quantities. Namely, if there are $n$ subjects in the sample, each of them would have its own realizations of $y$ and $α^{(d)}$ s. The proposed directed acyclic graph is not necessarily a tree, as shown in Figure 1. That is, each variable can have multiple parent variables in the layer above it, while in a directed tree, each variable can only have one parent. As a simpler illustration, we also provide a two-latent-layer Bayesian Pyramid in Figure 2, which features a simpler yet still quite expressive architecture. For example, we can specify the conditional distribution of each observed $y_{j}$ using a multinomial or binomial logistic model with its parent variables as linear predictors and specify the distribution of $α$ given $z$ using a latent class model. Namely, for a random subject $i$ ⁠,

\begin{aligned} P (y_{i, j} = c ∣ α_{i}^{(1)} = α) & = \frac{\exp (β_{j, c, 0} + \sum_{k = 1}^{K_{1}} β_{j, c, k} g_{j, k}^{(1)} α_{k})}{\sum_{m = 1}^{d_{j}} \exp (β_{j, m, 0} + \sum_{k = 1}^{K_{1}} β_{j, m, k} g_{j, k}^{(1)} α_{k})}, j \in [p], c \in [d_{j}], α \in {0, 1}^{K_{1}}, \\ P (α_{i}^{(1)} = α) & = \sum_{b = 1}^{B} P (z_{i} = b) P (α_{i}^{(1)} = α ∣ z_{i} = b) = \sum_{b = 1}^{B} τ_{b} \prod_{k = 1}^{K_{1}} η_{k, b}^{α_{k}} (1 - η_{k, b})^{1 - α_{k}} . \end{aligned}

(3)

Later in Section 4, we will focus on the two-latent-layer model when describing the estimation methodology; please refer to that part for further details.

Multiple layers of binary latent traits αi(d)s model the distribution of observed yi. Binary matrices G(1),G(2),… encode the sparse connection patterns between consecutive layers. Dotted arrows emanating from the root variable zi summarize omitted layers {αi(d)}.

Figure 1.

Multiple layers of binary latent traits $α_{i}^{(d)}$ s model the distribution of observed $y_{i}$ ⁠. Binary matrices $G^{(1)}, G^{(2)}, \dots$ encode the sparse connection patterns between consecutive layers. Dotted arrows emanating from the root variable $z_{i}$ summarize omitted layers ${α_{i}^{(d)}}$ ⁠.

Open in new tab Download slide

Two-latent-layer model. Latent variables zi,αi,1,…,αi,K1 and observed variables yi,1,…,yi,p are subject-specific, and model parameters G(1),β,τ,η are population quantities.

Figure 2.

Two-latent-layer model. Latent variables $z_{i}, α_{i, 1}, \dots, α_{i, K_{1}}$ and observed variables $y_{i, 1}, \dots, y_{i, p}$ are subject-specific, and model parameters $G^{(1)}, β, τ, η$ are population quantities.

Open in new tab Download slide

We provide some discussions on the conceptual motivation for the proposed multilayer model. Intuitively, for each subject $i$ ⁠, $z_{i} \in [B]$ in the deepest layer of the pyramid can encode some coarse-grained latent categories of subjects, while the vector $α_{i}^{(1)} \in {0, 1}^{K_{1}}$ encodes more fine-grained latent features of subjects. The hierarchical multilayer structure is conceptually appealing as it can provide latent representations of data in multiple resolutions, where there are nonlinear compositions between different latent resolutions that boost model flexibility. Model (3) for generating $α^{(1)}$ given $z$ is a nonlinear latent class model (LCM). Hence, $z$ cannot be viewed as a linear projection of the $α^{(1)}$ -layer; rather, the nonlinear LCM (with a deeper variable $z$ and conditional independence of $α_{k}$ ’s given $z$ ⁠) can encode and explain rich and complex dependencies between the variables $α^{(1)}, \dots, α_{K - 1}^{(1)}$ using parsimonious parametrizations.

2.2 Reformulating the Bayesian Pyramid as a constrained latent class model

In this subsection, we reveal an interesting equivalence relationship between our Bayesian Pyramid models and latent class models (LCMs, Goodman, 1974) with certain equality constraints. Such equivalence will pave the way for investigating the identifiability of Bayesian Pyramids. The traditional latent class model (LCM) (Goodman, 1974; Lazarsfeld, 1950) posits the following joint distribution of $y_{i}$ for a random subject $i$ ⁠:

P (y_{i, 1} = c_{1}, \dots, y_{i, p} = c_{p}) = \sum_{h = 1}^{H} ν_{h} \prod_{j = 1}^{p} λ_{c_{j}, h}^{(j)}, \forall c_{j} \in [d_{j}], j \in [p] .

(4)

The above equation specifies a finite mixture model with one discrete latent variable having $H$ categories (i.e., $H$ latent classes). In particular, given a latent variable $z \in [H]$ ⁠, $ν_{h} = P (z = h)$ denotes the probability of $z$ belonging to the $h$ th latent class, and $λ_{c_{j}, h}^{(j)} = P (y_{j} = c_{j} ∣ z_{j} = h)$ denotes the conditional probability of response $c_{j}$ for variable $y_{j}$ given the latent class membership $h$ ⁠. Expression (4) implies that $p$ observed variables $y_{1}, \dots, y_{p}$ are conditionally independent given the latent $z$ ⁠. In the Bayesian literature, the model in Dunson and Xing (2009) is a nonparametric generalization of (4), which allows an infinite number of latent classes.

Latent class modeling under (4) is widely used in social and biomedical sciences, where researchers often hope to infer subpopulations of individuals having different profiles (Collins & Lanza, 2009). However, the overly-simplistic form of (4) can lead to poor performance in inferring distinct and interpretable subpopulations. In particular, the model assumes that individuals in different subpopulations have completely different probabilities $λ_{c_{j}, h}^{(j)}$ for all $c_{j} \in [d_{j}]$ and $j \in [p]$ ⁠, and conditionally on subpopulation membership all the variables are independent. These restrictions can force the number of classes to increase in order to provide an adequate fit to the data, which can degrade interpretability of a plain latent class model.

We introduce some notation before proceeding. Denote a $p \times H$ all-one matrix by $1_{p \times H}$ ⁠. For a matrix $S$ with $p$ rows and a set $A \subseteq [p]$ ⁠, denote by $S_{A, :}$ the submatrix of $S$ consisting of rows indexed in $A$ ⁠. Consider a family of constrained latent class models, which enable learning of a potentially large number of interpretable, identifiable, and diverse latent classes. A key idea is sharing of parameters within certain latent classes for each observed variable. We introduce a binary constraint matrix $S = (S_{j, h})$ of size $p \times H$ ⁠, which has rows indexed by the $p$ observed variables and columns indexed by the $H$ latent classes. The binary entry $S_{j, h}$ indicates whether the conditional probability table $λ_{1 : d_{j}, h}^{(j)} = (λ_{1, h}^{(j)}, \dots, λ_{d_{j}, h}^{(j)})$ is free or instead equal to some unknown baseline. Specifically, if $S_{j, h} = 1$ then $λ_{1 : d_{j}, h}^{(j)}$ is free; while for those latent classes $h \in [H]$ such that $S_{j, h} = 0$ ⁠, their conditional probability tables $λ_{1 : d_{j}, h}^{(j)}$ ’s are constrained to be equal. Hence, $S$ puts the following equality constraints on the parameters of (4):

if S_{j, h_{1}} = S_{j, h_{2}} = 0, then λ_{1 : d_{j}, h_{1}}^{(j)} = λ_{1 : d_{j}, h_{2}}^{(j)} .

(5)

We also enforce a natural inequality constraint for identifiability,

if S_{j, h_{1}} \neq S_{j, h_{2}}, then λ_{c_{j}, h_{1}}^{(j)} \neq λ_{c_{j}, h_{2}}^{(j)} .

(6)

If $S = 1_{p \times H}$ ⁠, then there are no active constraints and the original latent class model (4) is recovered. We generally denote the conditional probability parameters by $Λ = {λ_{c_{j}, h}^{(j)}; j \in [p], c_{j} \in [d_{j}], h \in [H]}$ where $λ_{c_{j}, h} : = λ_{c_{j}, 0}$ if $S_{j, h} = 0$ ⁠. As will be revealed soon, such constrained latent class models are related to our proposed Bayesian Pyramids via a neat mathematical transformation.

Viewed from a different perspective, a latent class model (4) specifies a decomposition of the $p$ -way probability tensor $Π = (π_{c_{1} \dots c_{p}})$ ⁠, where $π_{c_{1} \dots c_{p}} = P (y_{1} = c_{1}, \dots, y_{p} = c_{p} ∣ Λ, S, ν)$ ⁠. This corresponds to the PARAFAC/CANDECOMP (CP) decomposition in the tensor literature (Kolda & Bader, 2009), which can be used to factorize general real-valued tensors, while our focus is on probability tensors. The proposed equality constraint (5) induces a family of constrained CP decompositions.

This family is connected with the sparse CP decomposition of Zhou et al. (2015), with both having equality constraints summarized by a $p \times H$ binary matrix. However, Zhou et al. (2015) encourage different observed variables to share parameters, while our proposed model encourages different latent classes to share parameters through (5).

We have the following proposition linking the proposed multilayer Bayesian Pyramid to the constrained latent class model under equality constraint (5). For two vectors $a, b$ of the same length $M$ ⁠, denote $a ⪰ b$ if $a_{i} \geq b_{i}$ for all $i \in [M]$ ⁠. Denote by $1 (\cdot)$ the binary indicator function, which equals one if the statement inside is true and zero otherwise.

Proposition 1

Consider the multilayer Bayesian Pyramid with binary graphical matrices $G^{(1)}, G^{(2)}, \dots, G^{(D - 1)}$ ⁠.

In marginalizing out all the latent variables except $α^{(1)}$ ⁠, the distribution of $y$ is a constrained latent class model with $2^{K_{1}}$ latent classes, where each latent class can be characterized as one configuration of the $K_{1}$ -dimensional binary vector $α^{(1)} \in {0, 1}^{K_{1}}$ ⁠. The corresponding $p \times 2^{K_{1}}$ constraint matrix $S^{(1)}$ is determined by the bipartite graph structure between the $α^{(1)}$ -layer and the $y$ -layer, with entries being
$\begin{aligned} S_{j, α^{(1)}}^{(1)} & = 1 - 1 (α_{k}^{(1)} \geq G_{j, k}^{(1)} for all k = 1, \dots, K_{1}) \\ = 1 - 1 (α^{(1)} ⪰ G_{j, :}^{(1)}), j \in [p], α^{(1)} \in {0, 1}^{K_{1}} . \end{aligned}$
(7)
Further, in considering the distribution of $α^{(m)} \in {0, 1}^{K_{m}}$ and marginalizing out all the latent variables deeper than $α^{(m)}$ except $α^{(m + 1)}$ ⁠, the distribution of $α^{(m)}$ is also a constrained latent class model with $2^{K_{m} + 1}$ latent classes, where each latent class is characterized as one configuration of the $K_{m + 1}$ -dimensional binary vector $α^{(m + 1)} \in {0, 1}^{K_{m + 1}}$ ⁠. Its corresponding constraint matrix $S^{(m)}$ is determined by the bipartite graph structure between the $m$ th and the $(m + 1)$ th latent layers, with entries being
$S_{k, α^{(m + 1)}}^{(m + 1)} = 1 - 1 (α^{(m + 1)} ⪰ G_{k, :}^{(m + 1)}), k \in [K_{m}], α^{(m + 1)} \in {0, 1}^{K_{m + 1}} .$
(8)

We present a toy example to illustrate Proposition 1 and discuss its implications.

Example 1

Consider a multilayer Bayesian Pyramid with $p = 6$ and $K_{1} = 3$ ⁠, with the graph between $α^{(1)}$ and $y$ displayed in Figure 2. The $6 \times 3$ graphical matrix $G^{(1)}$ is presented below. Proposition 1(a) states that there is a $p \times 2^{K_{1}}$ constraint matrix $S^{(1)}$ taking the form

G^{(1)} = (\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 1 \end{matrix}), S^{(1)} = (\begin{matrix} (000) & (100) & (010) & (001) & (110) & (101) & (011) & (111) \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 \end{matrix}) .

Entries of $S^{(1)}$ are determined according to (7), for example, $S_{1, (000)}^{(1)} = 1 - 1 ((000) ⪰ G_{1, :}^{(1)}) = 1 - 1 ((000) ⪰ (100)) = 1$ ⁠; and $S_{6, (111)}^{(1)} = 1 - 1 ((111) ⪰ G_{6, :}^{(1)}) = 1 - 1 ((111) ⪰ (011)) = 0$ ⁠. Each column of the constraint matrix $S^{(1)}$ is indexed by a latent class characterized by a configuration of a $K_{1}$ -dimensional binary vector. This implies that if only considering the first latent layer of variables $α^{(1)}$ ⁠, all the subjects are naturally divided into $2^{K_{1}}$ latent classes, each endowed with a binary pattern.

Proposition 1 gives a nice structural characterization of multilayer Bayesian Pyramids. This characterization is achieved by relating the multilayer sparse graph to the constrained latent class model in a layer-wise manner. This proposition provides a basis for investigating he identifiability of multilayer Bayesian Pyramids; see details in Section 3.

2.3 Connections to existing models and studies

We next briefly review connections between Bayesian Pyramids and existing models. In educational measurement research, Haertel (1989) first used the term restricted latent class models. Further developments along this line in the psychometrics literature led to a popular family of cognitive diagnosis models (de la Torre, 2011; Rupp & Templin, 2008; von Davier & Lee, 2019). These models are essentially binary latent skill models where each subject is endowed with a $K$ -dimensional latent skill vector $α \in {0, 1}^{K}$ indicating the mastery/deficiency statuses of $K$ skills, and each test item $j$ is designed to measure a certain configuration of skills, summarized by a loading vector $q_{j} \in {0, 1}^{K}$ ⁠. The matrix $Q \in {0, 1}^{J \times K}$ collecting all of the $J$ skill loading vectors $q_{1}, \dots, q_{J}$ as row vectors is often pre-specified by educational experts. The observed data for each subject are a $J$ -dimensional binary vector $r \in {0, 1}^{J}$ ⁠, indicating the correct/wrong answers to $J$ test questions in the assessment. Recently, there have been emerging studies on the identifiability and estimation of such cognitive diagnosis models (e.g., Chen, Culpepper, et al., 2020; Chen et al., 2015; Fang et al., 2019; Gu & Xu, 2019, 2020; Xu, 2017). However, to our knowledge, there have not been works that model multilayer (i.e., deep) latent structure behind the data and investigate identifiability in such scenarios.

Bayesian Pyramids are also related to deep belief networks (DBNs, Hinton et al., 2006), sum-product networks (SPNs, Poon & Domingos, 2011), and latent tree graphical models (LTMs, Mourad et al., 2013) in the machine learning literature. DBNs have undirected edges between the deepest two layers designed based on computational considerations (Hinton, 2009), while a Bayesian Pyramid is a fully generative directed graphical model (Bayesian network) with all the edges pointing top down. Such generative modeling is naturally motivated by the identifiability considerations and also provides a full description of the data generating process. Also, DBNs in their popular form are models for multivariate binary data, feature a fully connected graph structure, and use logistic link functions between layers. In contrast, Bayesian Pyramids accommodate general multivariate categorical data and allow flexible forms of layerwise conditional distributions and sparse connections between consecutive layers. An SPN is a rooted directed acyclic graph consisting of sum nodes and product nodes. Zhao et al. (2015) show that an SPN can be converted to a bipartite Bayesian network, where the number of discrete latent variables equals the number of sum nodes in the SPN. Our model is more general in that in addition to modeling a bipartite network between the latent layer and the observed layer, we further model the dependence of the latent variables instead of assuming them independent. LTMs are a special case of our model (3) because, while all the variables in an LTM form a tree, we allow for more general DAGs beyond trees. Although the above models are extremely popular, identifiability has received essentially no attention; an exception is Zwiernik (2018), which discussed identifiability for relatively simple LTMs.

Our two-latent-layer Bayesian Pyramid shares a similar structure with the nonparametric latent feature models in Doshi-Velez and Ghahramani (2009). Both works consider a mixture of binary latent feature models, with each data point associated with both a deep latent cluster (⁠ $z$ in our notation) and a binary vector of latent features [ $α^{(1)}$ in our notation]. One distinction is that we adopt very flexible probabilistic distributions (see Examples 2 and 3) as conditional distributions in the DAG, while Doshi-Velez and Ghahramani (2009) posits that the latent features are a deterministic function of the products of $z$ and $α^{(1)}$ ⁠. In addition, Bayesian Pyramids are directly inspired by identifiability considerations, and in the next section, we provide explicit conditions that guarantee the model is identifiable—not only in the two-latent-layer architecture similar to Doshi-Velez and Ghahramani (2009) but also in general deep latent hierarchies with $α^{(2)}, \dots, α^{(D - 1)}, z$ ⁠. Interestingly, Doshi-Velez and Ghahramani (2009) conjecture heuristically that their specification ‘likely resolves the identifiability issues’.

3 Identifiability and constrained latent class structure behind Bayesian Pyramids

3.1 Identifiability of the constrained latent class model and posterior consistency

In Section 2, we proposed a new class of multilayer latent variable models deemed Bayesian Pyramids and showed that these models can be formulated as a type of constrained latent class model (CLCM) defined in (4)–(5). In this section, we study theoretical properties of model (4)–(5).

The classic latent class model in (4) was shown by Gyllenberg et al. (1994) to be not strictly identifiable. Strict identifiability generally requires one to establish a parameterization in which the parameters can be expressed as one-to-one functions of observables. As a weaker notion, generic identifiability requires that the map is one-to-one except on a Lesbesgue measure zero subset of the parameter space. In a seminal paper, Allman et al. (2009) leveraged Kruskal’s theorem (Kruskal, 1977) to show generic identifiability for an unconstrained latent class model. However, Allman et al. (2009)’s approach is not sufficient for establishing identifiability in constrained LCMs or Bayesian Pyramids, due to the complex parameter constraints in these models. Indeed, Allman et al. (2009)’s generic identifiability results imply that in the latent class model with an unconstrained parameter space, there exists a measure-zero subset of parameters where identifiability breaks down. In constrained LCMs, the equality constraints (5) exactly enforce parameters to fall into a measure-zero subset. In Bayesian Pyramids, these equality constraints arise from between-layer potentially sparse graphs. So without a careful and thorough investigation into the relationship between the parameter constraints and the graphical structure, Kruskal’s theorem is not directly applicable to investigating the identifiability of constrained LCMs or Bayesian Pyramids.

Below, we establish strict identifiability of model (4)–(5), by carefully examining the algebraic structure imposed by the constraint matrix $S$ ⁠. We first introduce some notation. Denote by $\otimes$ the Kronecker product of matrices and by $⊙$ the Khatri–Rao product of matrices. In particular, consider matrices $A = (a_{i, j}) \in R^{m \times r}$ ⁠, $B = (b_{i, j}) \in R^{s \times t}$ ⁠; and matrices $C = (c_{i, j}) = (c_{:, 1} ∣ \dots ∣ c_{:, k}) \in R^{n \times k}$ ⁠, $D = (d_{i, j}) = (d_{:, 1} ∣ \dots ∣ d_{:, k}) \in R^{ℓ \times k}$ ⁠, then there are $A \otimes B \in R^{m s \times r t}$ and $C ⊙ D \in R^{n ℓ \times k}$ with

A \otimes B = (\begin{matrix} a_{1, 1} B & \dots & a_{1, r} B \\ ⋮ & ⋮ & ⋮ \\ a_{m, 1} B & \dots & a_{m, r} B \end{matrix}), C ⊙ D = (c_{:, 1} \otimes d_{:, 1} ∣ \dots ∣ c_{:, k} \otimes d_{:, k}) .

The above definitions show the Khatri–Rao product is a column-wise Kronecker product; see more in Kolda and Bader (2009). We first establish the following technical proposition, which is useful for the later theorem on identifiability.

Proposition 2

For the constrained latent class model, define the following $p$ parameter matrices subject to constraints (5) and (6) with some constraint matrix $S$ ⁠,

Λ^{(j)} = (\begin{matrix} λ_{1, 1}^{(j)} & λ_{1, 2}^{(j)} & \dots & λ_{1, H}^{(j)} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ λ_{d_{j}, 1}^{(j)} & λ_{d_{j}, 2}^{(j)} & \dots & λ_{d_{j}, H}^{(j)} \end{matrix}), j = 1, \dots, p .

Denote the Khatri–Rao product of the above $p$ matrices by $K = ⊙_{j = 1}^{p} Λ^{(j)}$ ⁠, which has size $\prod_{j = 1}^{p} d_{j} \times H$ ⁠. The following two conclusions hold.

If the $H$ column vectors of the constraint matrix $S$ are distinct, then $K$ must have full column rank $H$ ⁠.
If $S$ contains identical column vectors, then $K$ can be rank-deficient.

Remark 1

Proposition 2 implies that $S$ having distinct columns is sufficient [in part (a)] and almost necessary [in part (b)] for the Khatri–Rao product $K = ⊙_{j = 1}^{p} Λ^{(j)}$ to be full rank. To see the ‘almost necessary’ part, consider a special case where besides constraint (5) that $λ_{1 : d_{j}, h_{1}}^{(j)} = λ_{1 : d_{j}, h_{2}}^{(j)}$ if $S_{j, h_{1}} = S_{j, h_{2}} = 0$ ⁠, the parameters also satisfy $λ_{1 : d_{j}, h_{1}}^{(j)} = λ_{1 : d_{j}, h_{2}}^{(j)}$ if $S_{j, h_{1}} = S_{j, h_{2}} = 1$ ⁠. In this case, our proof shows that whenever the binary matrix $S$ contains identical column vectors in columns $h_{1}$ and $h_{2}$ ⁠, the matrix $K$ also contains identical column vectors in columns $h_{1}$ and $h_{2}$ and hence is surely rank-deficient.

In the Khatri–Rao product matrix $K$ defined in Proposition 2, each column characterizes the conditional distribution of vector $y$ given a particular latent class. Therefore, Proposition 2 reveals an nontrivial algebraic property: whether $S \in {0, 1}^{p \times H}$ has distinct column vectors is linked to whether the $H$ conditional distributions of $y$ given each latent class are linearly independent. The matrix $S$ does not need to have full column rank in order to have distinct column vectors. For example, a $3 \times 3$ matrix $S$ with three columns being $(1, 0, 0)^{⊤}$ ⁠, $(0, 1, 1)^{⊤}$ ⁠, and $(1, 1, 1)^{⊤}$ is rank-deficient but has distinct column vectors. Indeed, it is not hard to see that a binary matrix $S$ with $m$ rows can have as many as $2^{m}$ distinct column vectors.

Proposition 1 reformulates a Bayesian Pyramid into a constrained LCM with a constraint matrix $S$ ⁠, and then Proposition 2 establishes a nontrivial algebraic property of such constrained LCMs. Propositions 1 and 2 together pave the way for the development of the identifiability theory of Bayesian Pyramids. In particular, the sufficiency part of Proposition 2 uncovers a nontrivial inherent algebraic structure of the considered models. To prove Proposition 2, we leveraged a novel proof technique, the marginal probability matrix, in order to find a sufficient and almost necessary condition for the Khatri–Rao product of conditional probability matrices to have full rank. We anticipate that the conclusion in Proposition 2 can be useful in other graphical models with discrete latent variables, even beyond the Bayesian Pyramids considered in this paper. This is because graphical models involving discrete latent structure can often be formulated as a latent class model with equality constraints determined by the graph. Therefore, the conclusion of Proposition 2 might be of independent interest.

Although the linear independence of $K$ ’s columns itself does not lead to identifiability, it provides a basis for investigating strict identifiability of our model. We introduce the definition of strict identifiability under the current setup and then give the strict identifiability result.

Strict Identifiability

Definition 1

The constrained latent class model with (5) and (6) is said to be strictly identifiable if for any valid parameters $(Λ, S, ν)$ ⁠, the following equality holds if and only if $(\bar{Λ}, \bar{S}, \bar{ν})$ and $(Λ, S, ν)$ are identical up to a latent class permutation:

P (y = c ∣ Λ, S, ν) = P (y = c ∣ \bar{Λ}, \bar{S}, \bar{ν}), \forall c \in \times_{j = 1}^{p} [d_{j}] .

(9)

Remark 2

When the constraint matrix $S$ is unknown and needs to be identified together with unknown continuous parameters $Λ$ ⁠, there is a trivial nonidentifiability issue that needs to be resolved. To see this, continue to consider the special case mentioned in Remark 1 where $λ_{1 : d_{j}, h_{1}}^{(j)} = λ_{1 : d_{j}, h_{2}}^{(j)}$ whenever $S_{j, h_{1}} = S_{j, h_{2}}$ ⁠, then given a matrix $S$ we can generally denote $λ_{1 : d_{j}, h}^{(j)} = : λ_{1 : d_{j}, +}^{(j)}$ if $S_{j, h} = 1$ and $λ_{1 : d_{j}, h}^{(j)} = :, λ_{1 : d_{j}, -}^{(j)}$ if $S_{j, h} = 0$ ⁠. Then without further restrictions, the following alternative $(\tilde{S}, \tilde{Λ})$ will be indistinguishable from the true $(S, Λ)$ ⁠, where $\tilde{S} = 1_{p \times H} - S$ ⁠, ${\tilde{λ}}_{1 : d_{j}, +} = λ_{1 : d_{j}, -}^{(j)}$ and ${\tilde{λ}}_{1 : d_{j}, -} = λ_{1 : d_{j}, +}^{(j)}$ ⁠. One straightforward way to resolve such trivial nonidentifiability of $S$ is to simply enforce that whenever $s_{j, h_{1}} > s_{j, h_{2}}$ the order of $λ_{j, c, h_{1}}$ and $λ_{j, c, h_{2}}$ is fixed for every possible category $c \in [d_{j}]$ ⁠. In the following studies of identifiability, we always assume such orderings of $Λ$ with respect to $S$ has been fixed.

For an arbitrary subset $A \subseteq [p]$ ⁠, denote by $S_{A, :}$ the submatrix of $S$ that consists of those rows indexed by variables belonging to $A$ ⁠. $S_{A, :}$ has size $| A | \times H$ ⁠.

Strict Identifiability

Theorem 1

Consider the proposed constrained latent class model under (5) and (6) with true parameters $ν = {ν_{h}}_{h \in [H]}$ ⁠, $Λ = {Λ^{(j)}}_{j \in [p]}$ ⁠, and $S$ ⁠. Suppose there exists a partition of the $p$ variables $[p] = A_{1} \cup A_{2} \cup A_{3}$ such that

the submatrix $S_{A_{i}, :}$ has distinct column vectors for $i = 1$ and $2$ ⁠; and
for any $h_{1} \neq h_{2} \in [H]$ ⁠, there is $λ_{c, h_{1}}^{(j)} \neq λ_{c, h_{2}}^{(j)}$ for some $j \in A_{3}$ and some $c \in [d_{j}]$ ⁠.

Also suppose

ν_{h} > 0

for each

h \in [H]

⁠. Then,

(Λ, S, ν)

are strictly identifiable up to a latent class permutation.

In the above theorem, the constraint matrix $S$ is not assumed to be fixed and known. This implies that both the matrix $S$ and the parameters can be uniquely identified from data. We next give a corollary of Theorem 1, which replaces the condition on parameters $Λ$ in part (b) with a slightly more stringent but also more transparent condition on the binary matrix $S$ ⁠.

Corollary 1

Consider the proposed constrained latent class model under (5) and (6). Suppose $ν_{h} > 0$ for each $h \in [H]$ ⁠. If there is a partition of the $p$ variables $[p] = A_{1} \cup A_{2} \cup A_{3}$ such that each submatrix $S_{A_{i}, :}$ has distinct column vectors for $i = 1, 2, 3$ ⁠, then parameters $(ν, Λ, S)$ are strictly identifiable up to a latent class permutation.

The conditions of Corollary 1 are more easily checkable than those in Theorem 1 because they only depend on the structure of the constraint matrix $S$ ⁠. It requires that after some column rearrangement, matrix $S$ should vertically stack three submatrices, each of which has distinct column vectors.

The conclusions of Theorem 1 and Corollary 1 both regard strict identifiability, which is the strongest possible conclusion on parameter identifiability up to label permutation. If we consider the slightly weaker notion of generic identifiability as proposed in Allman et al. (2009), the conditions in Theorem 1 and Corollary 1 can be relaxed. Given a constraint matrix $S$ ⁠, denote the constrained parameter space for $(ν, Λ)$ by

T^{S} = {(ν, Λ) : Λ satisfies the constraints specified by S};

(10)

and define the following subset of $T_{S}$ as

\begin{aligned} N^{S} & = {(ν, Λ) \in T^{S} : \exists (\bar{ν}, \bar{Λ}) satisfying the constraints specified by some \bar{S} \\ such that P (y ∣ ν, Λ) = P (y ∣ \bar{ν}, \bar{Λ})} . \end{aligned}

(11)

With the above notation, the generic identifiability of the proposed constrained latent class model is defined as follows.

Generic Identifiability

Definition 2

Parameters $(Λ, S, ν)$ are said to be generically identifiable if $N_{S}$ defined in (11) has measure zero with respect to the Lebesgue measure on $T_{S}$ defined in (10).

Generic Identifiability

Theorem 2

Consider the constrained latent class model under (5) and (6). Suppose for $i = 1, 2$ ⁠, changing some entries of $S_{A_{i}, :}$ from ‘ $1$ ’ to ‘ $0$ ’ yields an ${\tilde{S}}_{A_{i}, :}$ having distinct columns. Also suppose for any $h_{1} \neq h_{2} \in [H]$ ⁠, there is $λ_{h_{1}, c}^{(j)} \neq λ_{h_{2}, c}^{(j)}$ for some $j \in A_{3}$ and some $c \in [d_{j}]$ ⁠. Then $(Λ, S, ν)$ are generically identifiable.

Remark 3

Note that altering some $S_{j, h}$ from one to zero corresponds to adding one more equality constraint that the distribution of the $j$ th variable given the $h$ th latent class is set to the baseline through (5). Therefore, Theorem 2 intuitively implies that if enforcing more parameters in $T$ to be equal can give rise to a strictly identifiable model, then the parameters that make the original model unidentifiable only occupy a negligible set in $T$ ⁠.

Theorem 2 relaxes the conditions on $S$ for strict identifiability presented earlier. In particular, here the submatrices $S_{A_{i}, :}$ need not have distinct column vectors; rather, it would suffice if altering some entries of $S_{A_{i}, :}$ from one to zero yield distinct column vectors. As pointed out by Allman et al. (2009), generic identifiability is often sufficient for real data analyses.

So far, we have focused on discussing model identifiability. Next, we show that our identifiability results guarantee Bayesian posterior consistency under suitable priors. Given a sample of size $n$ ⁠, denote the observations by $y_{1}, \dots, y_{n}$ ⁠, which are $n$ vectors each of dimension $p$ ⁠. Recall that under (4), the distribution of the vector $y$ under the considered model can be denoted by a $p$ -way probability tensor $Π = (π_{c_{1} \dots c_{p}})$ ⁠. When adopting a Bayesian approach, one can specify prior distributions for the parameters $Λ$ ⁠, $S$ ⁠, and $ν$ ⁠, which induce a prior distribution for the probability tensor $Π$ ⁠. Within this context, we are now ready to state the following theorem.

Posterior Consistency

Theorem 3

Denote the collection of model parameters by $Θ = (Λ, S, ν)$ ⁠. Suppose the prior distributions for the parameters $Λ$ ⁠, $S$ ⁠, and $ν$ all have full support around the true values. If the true latent structure $S^{0}$ and model parameters $Λ^{0}$ satisfy the proposed strict identifiability conditions in Theorem 1 or Corollary 1, we have

P (Θ \in N_{ϵ}^{c} (Θ^{0}) ∣ y_{1}, \dots, y_{n}) \to 0 almost surely,

where $N_{ϵ}^{c} (Θ^{0})$ is the complement of an $ϵ$ -neighborhood of the true parameters $Θ^{0}$ in the parameter space.

Theorem 3 implies that under an identifiable model and with appropriately specified priors, the posterior distribution places increasing probability in arbitrarily small neighborhoods of the true parameters of the constrained latent class model as sample size increases. These parameters include the mixture proportions and the class specific conditional probabilities.

3.2 Identifiability of multilayer Bayesian Pyramids

According to Proposition 1, for a multilayer Bayesian Pyramid with a $p \times K_{1}$ binary graphical matrix $G^{(1)}$ between the two bottom layers, one can follow (7) to construct a $p \times 2^{K_{1}}$ constraint matrix $S^{(1)}$ as illustrated in Example 1. We next provide transparent identifiability conditions that directly depend on the binary graphical matrices $G^{(m)}$ s. With the next theorem, one only needs to examine the structure of the between layer connecting graphs to establish identifiability.

Theorem 4

Consider the multilayer latent variable model specified in (1)–(2). Suppose the numbers $K_{1}, \dots, K_{D - 1}$ are known. Suppose each binary graphical matrix $G^{(m)}$ of size $K_{m - 1} \times K_{m}$ (size $p \times K_{1}$ if $m = 1$ ⁠) takes the following form after some row permutation:

G^{(m)} = (\begin{matrix} I_{K_{m}} \\ I_{K_{m}} \\ I_{K_{m}} \\ G^{(m), ⋆} \end{matrix}), m = 1, \dots, D - 1,

(12)

where $G^{(m), ⋆}$ generally denotes a submatrix of $G^{(m)}$ that can take an arbitrary form. Further suppose that the conditional distributions of variables satisfy the inequality constraint in (6). Then, the following parameters are identifiable up to a latent variable permutation within each layer: the probability distribution tensor $τ$ for the deepest latent variable $z^{(D)}$ ⁠, the conditional probability table of each variable (including observed and latent) given its parents, and also the binary graphical matrices ${G^{(m)}; m = 1, \dots, D - 1}$ ⁠.

Remark 4

The proof of Theorem 4 provides a nice layer-wise argument on identifiability, that is, one can examine the structure of the Bayesian Pyramid in the bottom-up direction. As long as for some $ℓ$ there are $G^{(1)}, \dots, G^{(ℓ)}$ taking the form of (12), then the parameters associated with the conditional distribution of $y$ and $α^{(1)}, \dots, α^{(ℓ - 1)}$ are identifiable and the marginal distributions of $α^{(ℓ)}$ are also identifiable.

Theorem 4 implies a requirement that $p \geq 3 K_{1}$ and that $K_{m - 1} \geq 3 K_{m}$ for every $m = 2, \dots, D - 1$ ⁠, through the form of $G^{(m)}$ in (12). That is, the number of latent variables per layer decreases as the layer goes deeper up the pyramid. Condition (12) in Theorem 4 requires that each latent variable $α_{k}^{(m)}$ in the $m$ th latent layer has at least three children in the $(m + 1)$ th layer that do not have any other parents. Our identifiability conditions hold regardless of the specific models chosen for the conditional distributions, $y_{j} ∣ α_{k}^{(1)}$ and each $α_{k}^{(m)} ∣ α_{k^{'}}^{(m + 1)}$ ⁠, as long as the graphical structure is enforced and these component models are not over-parameterized in a naive manner. We next give two concrete examples which differently model the distribution of $α^{(m)} ∣ α^{(m + 1)}$ but both respect the graph given by $G^{(m + 1)}$ ⁠.

Example 2

We first consider modeling the effects of the parent variables of each $α_{k}^{(m)}$ as

P (α_{k}^{(m)} = 1 ∣ α^{(m + 1)}, β^{(m + 1)}, G^{(m + 1)}) = f (β_{k, 0}^{(m + 1)} + \sum_{k^{'} : g_{k, k^{'}}^{(m + 1)} = 1} β_{k, k^{'}}^{(m + 1)} α_{k^{'}}^{(m + 1)}),

(13)

where $f : R \to (0, 1)$ is a link function. The number of $β$ -parameters in (13) equals $\sum_{k^{'} = 1}^{K_{m + 1}} g_{k, k^{'}}^{(m + 1)}$ ⁠, which is the number of edges pointing to $α_{k}^{(m)}$ ⁠. Choosing $f (x) = 1 / (1 + \exp (- x))$ leads to a model similar to sparse deep belief networks (Hinton et al., 2006; Lee et al., 2007).

Example 3

To obtain a more parsimonious alternative to (13), let

\begin{aligned} P (α_{k}^{(m)} & = 1 ∣ α^{(m + 1)}, θ^{(m + 1)}, G^{(m + 1)}) \\ = {\begin{matrix} θ_{k, 1}^{(m + 1)}, & if 1 (α_{k^{'}}^{(m + 1)} = g_{k, k^{'}}^{(m + 1)} = 1 for at least one k^{'} \in [K_{m + 1}]); \\ θ_{k, 0}^{(m + 1)}, & otherwise . \end{matrix} \end{aligned}

(14)

Model (14) satisfies the conditional independence encoded by $G^{(m + 1)}$ ⁠, since $I (α_{k^{'}}^{(m + 1)} = g_{k, k^{'}}^{(m + 1)} = 1 for at least one k^{'} \in [K_{m + 1}]) \equiv \prod_{k^{'} : g_{k, k^{'}}^{(m + 1)} = 1} (1 - α_{k^{'}}^{(m + 1)})$ ⁠, implying that the distribution of $α_{k}^{(m)}$ only depends on its parents in the $(m + 1)$ th latent layer. This model provides a probabilistic version of Boolean matrix factorization (Miettinen & Vreeken, 2014). The binary indicator equals the Boolean product of two binary vectors $G_{k, :}^{(m + 1)}$ and $α^{(m + 1)}$ ⁠. The $1 - θ_{k, 1}^{(m + 1)}$ and $θ_{k, 0}^{(m + 1)}$ quantify the two probabilities that the entry $α_{k}^{(m)}$ does not equal the Boolean product.

Since Examples 2 and 3 satisfy the conditional independence constraints encoded by graphical matrices $G^{(m + 1)}$ s, they satisfy the equality constraint in (5) with the constraint matrix $S^{(m + 1)}$ ⁠. Therefore, our identifiability conclusion in Theorem 4 applies to both examples with appropriate inequality constraints on the $β$ —parameters or the $θ$ —parameters; for example, see Proposition 3 in Section 4.

Besides Examples 2 and 3, there are many other models that respect the graphical structure. For example, (13) can be extended to include interaction effects of the parents of $α_{k}^{(m)}$ as follows:

\begin{aligned} P (α_{k}^{(m)} & = 1 ∣ α^{(m + 1)}, θ^{(m + 1)}, G^{(m + 1)}) \\ = f (β_{k, 0}^{(m + 1)} + \sum_{k^{'} : g_{k, k^{'}}^{(m + 1)} = 1} β_{k, k^{'}}^{(m + 1)} α_{k^{'}}^{(m + 1)} \\ + \sum_{\binom{k_{1} \neq k_{2} :}{g_{k, k_{1}}^{(m + 1)} = g_{k, k_{2}}^{(m + 1)} = 1}} β_{k, k_{1} k_{2}}^{(m + 1)} α_{k_{1}}^{(m + 1)} α_{k_{2}}^{(m + 1)} + \dots + β_{k, all}^{(m + 1)} \prod_{ℓ : g_{k, ℓ}^{(m + 1)} = 1} α_{ℓ}^{(m + 1)}) . \end{aligned}

(15)

In (15) if $α_{k}^{(m)}$ has $M : = \sum_{k^{'} = 1}^{K_{m + 1}} g_{k, k^{'}}^{(m + 1)}$ parents, then the number of $β$ -parameters equals $2^{M}$ ⁠.

Anandkumar et al. (2013) considered the identifiability of linear Bayesian networks. Although both Anandkumar et al. (2013) and this work address identifiability issues of Bayesian networks, their results are not applicable to our highly nonlinear models. The nonlinearity requires techniques that look into the inherent tensor decomposition structures (in particular, a constrained CP decomposition) caused by the discrete latent variables and graphical constraints imposed on the discrete latent distribution. Such inherent constrained tensor structures are specific to discrete and graphical latent structures and are not present in the settings considered in Anandkumar et al. (2013).

4 Bayesian inference for two-layer Bayesian Pyramids

As discussed earlier, the proposed multilayer Bayesian Pyramid in Section 2 has universal identifiability arguments for many different model structures. In this section, as an important special case, we focus on a two-latent-layer model with depth $D = 2$ and use a Bayesian approach to infer the latent structure and model parameters. Recall our two-latent-layer Bayesian Pyramid specified earlier in (3) takes the form

\begin{aligned} P (y_{i, j} = c ∣ α_{i}^{(1)} = α) & = \frac{\exp (β_{j, c, 0} + \sum_{k = 1}^{K_{1}} β_{j, c, k} g_{j, k}^{(1)} α_{k})}{\sum_{m = 1}^{d_{j}} \exp (β_{j, m, 0} + \sum_{k = 1}^{K_{1}} β_{j, m, k} g_{j, k}^{(1)} α_{k})}, j \in [p], c \in [d_{j}]; \\ P (α_{i}^{(1)} = α) & = \sum_{b = 1}^{B} τ_{b} \prod_{k = 1}^{K_{1}} η_{k, b}^{α_{k}} (1 - η_{k, b})^{1 - α_{k}}, α \in {0, 1}^{K_{1}} . \end{aligned}

Here, we assume $β_{j, d_{j}, 0} \equiv β_{j, d_{j}, k} \equiv 0$ for all $k \in [K_{1}]$ ⁠, as conventionally done in logistic models. The parameter $β_{j, c, k}$ can be viewed as the weight associated with the potential directed edge from latent variable $α_{k}^{(1)}$ to observed variable $y_{j}$ ⁠, for response category $c$ ⁠. The $β_{j, c, k}$ only impacts the likelihood if there is an edge from $α_{k}^{(1)}$ to $y_{j}$ with $g_{j, k}^{(1)} = 1$ ⁠. Denote the collection of $β$ parameters as $β$ ⁠. (3) specifies a usual latent class model in the second latent layer with $B$ latent classes for the $K_{1}$ -dimensional latent vector $α^{(1)}$ ⁠. This layer of the model has latent class proportion parameters $τ = (τ_{1}, \dots, τ_{B})^{⊤}$ and conditional probability parameters $η = (η_{k, b})_{K_{1} \times B}$ ⁠. The full set of model parameters across layers is $(G^{(1)}, β, τ, η)$ ⁠, and the model structure is shown in Figure 2. We can denote the conditional probability $P (y_{j} = c ∣ α^{(1)} = α)$ in (3) by $λ_{j, c, α}$ ⁠.

4.1 Identifiability theory adapted to two-layer Bayesian Pyramids defined in (3)

For the two-latent-layer model in (3), the following proposition presents strict identifiability conditions in terms of explicit inequality constraints for the $β$ parameters.

Proposition 3

Consider model (3) with true parameters $(G^{(1)}, β, τ, η)$ ⁠.

Suppose $G^{(1)} = (I_{K_{1}}; I_{K_{1}}; I_{K_{1}}; (G^{⋆})^{⊤})^{⊤}$ ⁠, $β_{j, j, c} \neq 0$ ⁠, $β_{j, j + K_{1}, c} \neq 0$ ⁠, and $β_{j, j + 2 K_{1}, c} \neq 0$ for $j \in [K_{1}]$ ⁠, $c \in [d_{j} - 1]$ ⁠. Then $G^{(1)}$ ⁠, $β$ ⁠, and probability tensor $ν^{(1)}$ of $α \in {0, 1}^{K}$ are strictly identifiable.
Under the conditions of part (a), if further there is $K_{1} \geq 2 \log_{2} B + 1$ ⁠, then the parameters $τ$ and $η$ are generically identifiable from $ν^{(1)}$ ⁠.

As mentioned in the last paragraph of Section 2, the identifiability results in that section apply to general distributional assumptions for variables organized in a sparse multilayer Bayesian Pyramid. When considering specific models, properties of the model can be leveraged to weaken the identifiability conditions. The next proposition illustrates this, establishing generic identifiability for the model in (3). Before stating the identifiability result, we formally define the allowable constrained parameter space for $β$ under a graphical matrix $G^{(1)}$ as

Ω (β; G^{(1)}) = {β_{1 : p, 1 : K_{1}, 1 : (d_{j} - 1)}; β_{j, c, k} \neq 0 if g_{j, k}^{(1)} = 1; and β_{j, c, k} = 0 if g_{j, k}^{(1)} = 0} .

(16)

Proposition 4

Consider model (3) with $β$ belonging to $Ω (β; G^{(1)})$ in (16). Suppose the graphical matrix $G^{(1)}$ can be rewritten as $G^{(1)} = (G_{1}, G_{2}, G_{3}, (G^{⋆})^{⊤})^{⊤}$ ⁠, where each $G_{m}$ has size $K_{1} \times K_{1}$ and

G_{m} = (\begin{matrix} 1 & * & \dots & * \\ * & 1 & \dots & * \\ ⋮ & ⋮ & ⋱ & ⋮ \\ * & * & \dots & 1 \end{matrix}), m = 1, 2, 3;

that is, each of $G_{1}, G_{2}, G_{3}$ has all the diagonal entries equal to one while any off-diagonal entry is free to be either one or zero. Also suppose $K_{1} \geq 2 \log_{2} B + 1$ ⁠. Then $(G^{(1)}, β, τ, η)$ are generically identifiable.

The two different identifiability conditions on the binary graphical matrix $G^{(1)}$ stated in Propositions 3 and 4 correspond to different identifiability notions—strict and generic identifiability, respectively. The generic identifiability notion is slightly less stringent than strict identifiability, by allowing a Lebesgue-measure-zero subset $N$ of the parameter space $T$ where identifiability does not hold. Our sufficient generic identifiability conditions in Proposition 4 are much less stringent than conditions in Proposition 3.

4.2 Bayesian inference for the latent sparse graph and number of binary latent traits

We propose a Bayesian inference procedure for two-latent-layer Bayesian Pyramids. We apply a Gibbs sampler by employing the Polya-Gamma data augmentation in Polson et al. (2013) together with the auxiliary variable method for multinomial logit models in Holmes and Held (2006). Such Gibbs sampling steps can handle general multivariate categorical data.

4.2.1 Inference with a Fixed $K_{1}$

First consider the case where the true number of binary latent variables $K_{1}$ in the middle layer is fixed. Inferring the latent sparse graph $G^{(1)}$ is equivalent to inferring the sparsity structure of the continuous parameters $β_{j k c}$ ’s. Let the prior for $β_{j, k, 0}$ be $N (μ_{0}, σ_{0}^{2})$ ⁠, where hyperparameters $μ_{0}, σ_{0}^{2}$ can be set to give weakly informative priors for the logistic regression coefficients (Gelman et al., 2008). Specify priors for other parameters as

\begin{aligned} (i) for fixed K_{1}, k = 1, \dots, K_{1} : & β_{j c k} ∣ (σ_{c k}^{2}, g_{j, k} = 1) \sim N (0, σ_{c k}^{2}); \\ β_{j c k} ∣ (σ_{c k}^{2}, g_{j, k} = 0) \sim N (0, σ_{0}^{2}); \\ σ_{c k}^{2} \sim InvGa (b_{1 σ}, b_{2 σ}); \\ P (g_{j, k} = 1) = 1 - P (g_{j, k} = 0) = γ; \end{aligned}

(17)

Here $σ_{0}$ is a small positive number specifying the ‘pseudo’-variance, and we take $σ_{0} = 0.1$ in the numerical studies. Our adoption of a prior variance for $β_{j, c, k}$ when $g_{j, k} = 0$ follows a similar spirit as the ‘pseudo-prior’ approach in the Bayesian variable selection literature. In Bayesian variable selection for regression analysis, Dellaportas et al. (2002) first proposed using a pseudo-prior for the variance when one variable is not included in the model to facilitate convenient Gibbs sampling steps. Specifically, a binary variable $ρ_{j}$ encodes whether or not the $j$ th predictor is included in the regression, and the regression coefficient is $β_{j} ρ_{j}$ ⁠, with $β_{j} \sim π_{j} N (0, σ^{2}) + (1 - π_{j}) N (0, σ_{0}^{2})$ and $P (ρ_{j} = 1) = π_{j}$ ⁠. The pseudo-prior variance $σ_{0}^{2}$ does not affect the posterior but may influence mixing of the MCMC algorithm.

The $γ$ in (17) is further given a noninformative prior $γ \sim Beta (1, 1)$ ⁠. The hyperparameters for the Inverse-Gamma distribution can be set to $b_{1 σ} = b_{2 σ} = 2$ ⁠. In the data augmentation part, for each subject $i \in [n]$ ⁠, each observed variable $j \in [p]$ ⁠, and each non-baseline category $c = 1, \dots, d_{j} - 1$ ⁠, we introduce Polya-Gamma random variable $w_{i j c} \sim PG (1, 0)$ ⁠. We use the auxiliary variable approach in Holmes and Held (2006) for multinomial regression to derive the conditional distribution of each $β_{j c k}$ ⁠. Given data $y_{i j} \in [d_{j}]$ ⁠, introduce binary indicators $y_{i j c} = 1 (y_{i j} = c)$ ⁠. The posteriors of $β$ satisfy that

p (β_{j, :, :} ∣ y_{1 : n}) \propto p (β_{j, :, :}) \prod_{i = 1}^{n} \prod_{c = 1}^{d_{j} - 1} \frac{{[\exp (β_{j c 0} + \sum_{k = 1}^{K_{1}} β_{j c k} g_{j, k} α_{i, k})]}^{y_{i j c}}}{\sum_{c^{'} = 1}^{d_{j}} \exp (β_{j c^{'} 0} + \sum_{k = 1}^{K_{1}} β_{j c^{'} k} g_{j, k} α_{i, k})},

and we introduce notation

\begin{aligned} ϕ_{i j c} & = β_{j c 0} + \sum_{k = 1}^{K_{1}} β_{j c k} g_{j, k} α_{i, k} - C_{i j (c)}; \\ C_{i j (c)} & = \log {\sum_{1 \leq ℓ \leq d_{j}, ℓ \neq c} \exp (β_{j ℓ 0} + \sum_{k = 1}^{K_{1}} β_{j ℓ k} g_{j, k} α_{i, k})} . \end{aligned}

(18)

Next by the property of the Polya-Gamma random variables (Polson et al., 2013), we have

\frac{\exp (ϕ_{i j c})^{y_{i j c}}}{1 + \exp (ϕ_{i j c})} = 2 \exp {(y_{i j c} - 1 / 2) ϕ_{i j c}} \int_{0}^{\infty} \exp {- w_{i j c} ϕ_{i j c}^{2} / 2} p^{PG} (w_{i j c} ∣ 1, 0) d w_{i j c},

where $p^{PG} (w_{i j c} ∣ 1, 0)$ denotes the density function of PG $(1, 0)$ ⁠. Based on the above identity, the conditional posterior of the $β_{j c 0}$ ’s and $β_{j c k}$ ’s are still Gaussian, and the conditional posterior of each $w_{i j c}$ is still Polya-Gamma with $(w_{i j c} ∣ -) \sim PG (1, β_{j c 0} + \sum_{k = 1}^{K_{1}} β_{j c k} g_{j, k} α_{i, k} - C_{i j (c)})$ ⁠; these full conditional distributions are easy to sample from. As for the binary entries $g_{j, k}$ ’s indicating the presence or absence of edges in the Bayesian Pyramid, we sample each $g_{j, k}$ individually from its posterior Bernoulli distribution. The detailed steps of such a Gibbs sampler with known $K_{1}$ are presented in the Online Supplementary Material.

4.2.2 Inferring an Unknown $K_{1}$

On top of the sampling algorithm described above for fixed $K_{1}$ ⁠, we propose a method for simultaneously inferring $K_{1}$ and other parameters. In the context of mixture models, Rousseau and Mengersen (2011) defined over-fitted mixtures with more than enough latent components and used shrinkage priors to effectively delete the unnecessary ones. In a similar spirit but motivated by Gaussian linear factor models, Legramanti et al. (2020) proposed the cumulative shrinkage process (CSP) prior, which has a spike and slab structure. We use a CSP prior on the variances ${σ_{c k}^{2}}$ of ${β_{j c k}}$ to infer the number of latent binary variables $K_{1}$ in a two-layer Bayesian Pyramid. The rationale for using such an increasing shrinkage prior for latent dimension selection is that, it is natural to expect additional latent dimensions to play a progressively less important role in characterizing the data, so the associated parameters should have a stochastically decreasing effect. Specifically, under the CSP prior, the $k$ th latent dimension is controlled by a scalar $θ_{k}$ that follows a spike-and-slab distribution. Redundant dimensions will be essentially deleted by progressively shrinking the sequence ${θ_{1}, θ_{2}, \dots}$ towards an appropriate value $θ_{\infty}$ (the spike). In particular, Legramanti et al. (2020) considered the continuous factor model where $θ_{k} > 0$ denotes the variance of the factor loadings for the $k$ th factor and $θ_{\infty}$ is a small positive number indicating the variance of redundant latent factors.

We next describe in detail the prior specifications with an unknown number of binary latent variables $K_{1}$ ⁠. Consider an upper bound $K_{upper}$ for $K_{1}$ ⁠, with $K_{1} < K_{upper}$ ⁠. Based on the identifiability conditions in Theorem 4 about the shape of $G_{p \times K_{1}}^{(1)}$ ⁠, $K_{1}$ is naturally constrained to be at most $p / 3$ ⁠, therefore $K_{upper}$ can be set to $p / 3$ or smaller in practice. We adopt a prior that detects redundant binary latent variables by increasingly shrinking the variance of $β_{j c k}$ ’s as $k$ grows from 1 to $K_{upper}$ ⁠. Specifically, letting $β_{j c k} \sim N (0, σ_{c k}^{2})$ ⁠, we put a CSP prior on variances ${σ_{c 1}^{2}, σ_{c 2}^{2}, \dots, σ_{c K_{upper}}^{2}}$ for each category $c \in [d - 1]$ ⁠, where $σ_{\infty}^{2}$ is a prespecified small positive number indicating the spike variance for redundant binary latent variables:

\begin{aligned} (ii) for unknown & K_{1} < K_{upper}, k = 1, \dots, K_{upper} : \\ σ_{c k}^{2} ∣ π_{k} \sim (1 - π_{k}) InvGa (b_{1 σ}, b_{2 σ}) + π_{k} δ_{σ_{\infty}^{2}}; \end{aligned}

(19)

π_{k} = \sum_{ℓ = 1}^{k} v_{ℓ} \prod_{m = 1}^{ℓ - 1} (1 - v_{m}),

(20)

where $InvGa (b_{1 σ}, b_{2 σ})$ refers to the inverse gamma distribution with shape $b_{1 σ}$ and scale $b_{2 σ}$ ⁠, and $π_{k}$ has a stick-breaking representation as in (20) with $v_{1}, v_{2}, \dots, v_{K_{upper} - 1}$ independently following the Beta distribution $Beta (1, α_{0})$ ⁠. We set $v_{K_{upper}} \equiv 1$ to truncate the stick-breaking representation at $K_{upper}$ ⁠, similarly to Legramanti et al. (2020). The $δ_{σ_{\infty}^{2}}$ represents the Dirac spike distribution with $σ_{\infty}^{2}$ serving as the variance of redundant latent variables, while $InvGa (a_{σ}, b_{σ})$ represents the more diffuse slab distribution for the variances corresponding to active latent variables. The ‘increasing shrinkage’ comes from the fact that as the latent variable index $k$ increases, the probability of $α_{k}$ belonging to the spike, $π_{k}$ ⁠, stochastically increases, because

E [π_{k}] = \sum_{ℓ = 1}^{k} E [v_{ℓ}] \prod_{m = 1}^{ℓ - 1} E [1 - v_{m}] = 1 - \frac{1}{(1 / α_{0} + 1)^{k}}

increases as the index $k$ increases. Therefore, the CSP prior features an increasing amount of shrinkage for larger $k$ ⁠. We introduce auxiliary variables ${h_{k}; k = 1, \dots, K_{upper}}$ with $h_{k} \in [K_{upper}]$ to help with understanding and facilitate posterior computation. Specifically, the prior in (19) can be obtained by marginalizing out a discrete auxiliary variable $h_{k}$ with $P (h_{k} = ℓ) = v_{ℓ} \prod_{m = 1}^{ℓ - 1} (1 - v_{m})$ ⁠, so (19)–(20) can be reformulated in terms of $h_{k}$ as

\begin{aligned} (σ_{c k}^{2} ∣ h_{k}) & \sim 1 (h_{k} > k) \cdot InvGa (b_{1 σ}, b_{2 σ}) + 1 (h_{k} \leq k) \cdot δ_{σ_{\infty}^{2}}; \\ P (h_{k} \leq k) & = \sum_{ℓ = 1}^{k} v_{ℓ} \prod_{m = 1}^{ℓ - 1} (1 - v_{m}) = π_{k}, k = 1, \dots, K_{upper} . \end{aligned}

Therefore, the auxiliary variables $h_{k}$ determine whether the $k$ th latent dimension is in the spike and hence corresponds to a redundant latent variable $α_{k}$ ⁠; specifically, if $h_{k} > k$ then $σ_{c k}^{2}$ follows the slab distribution $InvGa (b_{1 σ}, b_{2 σ})$ and $α_{k}$ is active, otherwise $σ_{c k}^{2} = σ_{\infty}^{2}$ is in the spike and $α_{k}$ is redundant. Given (19), the largest possible number of active latent variables is $K_{upper} - 1$ ⁠, because $π_{K_{upper}} \equiv 1 = P (h_{K_{upper}} \leq K_{upper})$ and the last latent variable is always redundant. Since the event $1 (h_{k} > k)$ indicates that the $k$ th component is in the slab and hence active, we can write the total number of active latent dimensions $K^{⋆}$ as

K^{⋆} = \sum_{k = 1}^{K_{upper}} 1 (h_{k} > k) .

(21)

Tracking the posterior samples of all the $h_{k}$ can give a posterior estimate of $K^{⋆}$ ⁠. The above data augmentation leads to Gibbs updating steps. We present the details of our Gibbs sampler in the Online Supplementary Material.

Remark 5

It is methodologically straightforward to extend our Gibbs sampler to deep Bayesian Pyramids with more than two latent layers. To see this, note that in deep Bayesian Pyramids with $m \geq 2$ ⁠, for each $m$ ⁠, the conditional distribution of $α^{(m)}$ given $α^{(m + 1)}$ in Example 2 is a special case of the conditional distribution of $y$ given $α^{(1)}$ ⁠. Indeed, both of these conditionals follow generalized linear models with the (multinomial) logit link, with the parent variables serving as predictors for the child. Under such a formulation, introducing additional Polya-Gamma auxiliary variables for the $α^{(1)}$ -layer similar to those for the $y$ -layer would allow Gibbs updates in a three-latent-layer Bayesian Pyramid. In this work, we focus on two-latent-layer models for computational efficiency.

Remark 6

We also remark that it is not hard to derive and implement a Gibbs sampler for constrained latent class models (CLCMs) mentioned in Section 3.1. Performing Gibbs sampling for CLCMs is a relatively straightforward extension to the current Gibbs sampler for Bayesian Pyramids. The reason is that one can similarly use the data augmentation strategies in Holmes and Held (2006) and Polson et al. (2013) to deal with $y_{j} \in [d_{j}]$ ⁠; and one can also adopt similar priors for the binary constraint matrix $S \in {0, 1}^{p \times H}$ as those adopted for the graphical matrix $G \in {0, 1}^{p \times K_{1}}$ in a Bayesian Pyramid. Our preliminary simulations for such a Gibbs sampler under CLCMs showed that the recovery of parameters and the constraint matrix are not as stable and accurate as that for Bayesian Pyramids (shown in the later Figures 3–5). One explanation is that compared to multilayer Bayesian Pyramids, the parametrization of a CLCM is less parsimonious and requires many more mixture proportion parameters in order to describe the same joint distribution of the observed variables. Therefore, we have chosen to focus on the method for Bayesian Pyramids in this work, and treat CLCMs mainly as intermediate tools to help establish identifiability of Bayesian Pyramids.

Figure 3.

Posterior means of the main-effect parameters ${β_{j 1 k}}$ for response category $c = 1$ averaged across 50 independent replications, for $n = 500$ or $2, 000$ ⁠. $K_{upper} = 7$ is taken in the CSP prior for posterior inference, while the true $K_{1}$ is 4 as shown in the rightmost plot.

Open in new tab Download slide

Figure 4.

Mean estimation errors of the binary graphical matrix $G^{(1)}$ at the matrix level (in (a)), row level (in (b)), and entry level (in (c)). The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.

Open in new tab Download slide

Average root-mean-square errors (RMSE) of the posterior means of the main-effects β, intercepts β0, and deeper tensor arms η versus sample size n=250⋅i for i∈{1,2,…,8}. The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.

Figure 5.

Average root-mean-square errors (RMSE) of the posterior means of the main-effects $β$ ⁠, intercepts $β_{0}$ ⁠, and deeper tensor arms $η$ versus sample size $n = 250 \cdot i$ for $i \in {1, 2, \dots, 8}$ ⁠. The median, 25% quantile, and 75% quantile based on the 50 independent simulation replications are shown by the circle, lower bar, and upper bar, respectively.

Open in new tab Download slide

5 Simulation studies

We conducted replicated simulation studies to assess the Bayesian procedure proposed in Section 4 and examine whether the model parameters are indeed estimable as implied by Theorem 3. Consider a two-latent-layer Bayesian Pyramid with $p = 20$ observed variables, $d = 4$ response categories for each observed variable, and $K_{1} = 4$ binary latent variables in the middle layer and one binary latent variable in the deep layer. Let the true $p \times K_{1}$ binary graphical matrix be $G^{(1)} = (I_{4}; I_{4}; I_{4}; 1100; 0110; 0011; 1001; 1010; 1001; 0101; 1110; 0111)$ ⁠. Such a $G^{(1)}$ satisfies the conditions for strict identifiability in Theorem 4, since it contains three copies of the identity matrix $I_{4}$ as submatrices. Let the true intercept parameters for categories $c = 1, 2, 3$ be $(β_{j, 1, 0}, β_{j, 2, 0}, β_{j, 3, 0}) = (- 3, - 2, - 1)$ for each $j$ ⁠; and for any $g_{j, k}^{(1)} = 1$ ⁠, let the corresponding true main-effect parameters of the binary latent variables be $β_{j, c, k} = 3$ if variable $y_{j}$ has a single parent and $β_{j, c, k} = 2$ if $y_{j}$ has multiple parents. See Figure 3c for a heatmap of the sparse matrix of the main-effect parameters $(β_{j 1 k}; j \in [p], k \in [K_{1}])$ for category $c = 1$ ⁠.

We use the method developed in Section 4 with the CSP prior for posterior inference under an unknown $K_{1}$ ⁠. We specify an upper bound for $K_{1}$ as $K_{upper} = 7$ ⁠, because $⌈ p / 20 ⌉ = 7$ is a natural upper bound here based on the identifiability considerations mentioned before. As for the hyperparameters in the CSP prior, we mainly follow the default setting and suggestion in Legramanti et al. (2020). Specifically, we set $α_{0}$ to the same value in Legramanti et al. (2020), $α_{0} = 5$ ⁠; we also follow the suggestion in Legramanti et al. (2020) that the spike variance $σ_{\infty}^{2}$ should be a small positive number but should avoid to take excessively low values by setting $σ_{\infty}^{2} = 0.07$ ⁠. Our simulations adopting these choices turn out to produce accurate estimation results under different settings (see the later Figures 3–5). In our preliminary simulations, we also varied these hyperparameters around the above values, and did not observe the algorithmic performance to be sensitive to them. Throughout the Gibbs sampling iterations, we enforce an identifiability constraint on $β$ that $β_{j c k} > 0$ as long as $g_{j, k} = 1$ ⁠. For each sample size $n$ we conduct 50 independent simulation replications. Since the model is identifiable up to a permutation of the latent variables in each layer, we post-process the posterior samples to find a column permutation of $G^{(1)}$ to best match the simulation truth; then the columns of other parameter matrices are permuted accordingly.

We have used the Gelman-Rubin convergence diagnostic (Gelman et al., 2013; Gelman & Rubin, 1992) to assess the convergence of the MCMC output from multiple random initializations. In particular, for the simulation setting corresponding to $n = 1, 000$ ⁠, we randomly initialize the parameters from their prior distributions $J = 5$ times and then run five MCMC chains for each simulated dataset. For each MCMC chain, we ran the chain for 15,000 iterations, discarding the first 10,000 iterations as burn-in, and retaining every fifth sample post burn-in to thin the chain. After collecting the posterior samples, we calculate the potential scale reduction factor $R^{2}$ (i.e., the Gelman-Rubin statistic) of the model parameters. The median Gelman-Rubin statistics for the deep conditional probabilities $η = (η_{k b})_{K \times B}$ and that for the deep latent class proportions $τ = (τ_{b})_{B \times 1}$ across the five chains are as follows:

median GR (η_{K \times B}) = (\begin{matrix} 1.0009 & 1.0013 \\ 1.0016 & 1.0020 \\ 1.0018 & 1.0015 \\ 1.0016 & 1.0014 \end{matrix}), median GR (τ_{B \times 1}) = (\begin{matrix} 1.0021 \\ 1.0021 \end{matrix}) .

The Gelman–Rubin statistics for other parameters are similarly well controlled and are omitted. We also inspected the traceplots of the MCMC outputs and observed fast mixing of the MCMC chain after convergence. These observations justify running MCMC for 15,000 iterations and discarding the first 10,000 as burn-in, therefore we adopt these settings in all the numerical experiments. When applying the proposed method to other datasets, we also recommend calculating the Gelman–Rubin statistics and inspecting the traceplots to determine the appropriate number of overall MCMC iterations and burn-in iterations.

Our Bayesian modeling of $G$ adopts rather uninformative priors with each entry $g_{j, k} \sim Bernoulli (γ)$ and we do not force $G$ to take a specific form (such as to include any identity submatrix as described in Proposition 3 for strict identifiability) but rather let the sampler freely explore the space of all binary matrices to estimate $G$ ⁠. Our theory on identifiability and posterior consistency implies that the posteriors of both $G$ and other parameters concentrate around their true values as sample size grows. This is empirically verified in our simulation studies; Figures 3–5 show that as sample size $n$ grows, both the discrete $G$ and the continuous parameters are estimated consistently. We next elaborate on these findings from simulations.

Figure 3 presents posterior means of the main-effect parameters $(β_{j 1 k}; j \in [p], k \in [K_{1}])$ for category $c = 1$ ⁠, averaged across the 50 independent replications. The leftmost plot of Figure 3 shows that for a relatively small sample size $n = 500$ ⁠, the posterior means of $(β_{j 1 k})$ already exhibit similar structure as the ground truth in the rightmost plot. Also, the true number of $K_{1} = 4$ binary latent variables in the middle latent layer are revealed in Figures 3a and b. For the sample size $n = 2,000$ ⁠, Figure 3b shows that the posterior means of $(β_{j 1 k})$ are very close to the ground truth. The posterior means for other categories $c = 2, 3, 4$ show similar patterns to those for $c = 1$ ⁠. The estimated $(β_{j 1 k})$ are slightly biased toward zero for the smaller sample size (⁠ $n = 500$ ⁠) with bias less for the larger sample size (⁠ $n = 2, 000$ ⁠). Our results suggest that the binary graphical matrix that underlies ${β_{j c k}}$ (that is, the sparsity structure of the continuous parameters) is easier to estimate than the nonzero ${β_{j c k}}$ themselves. That is, with finite samples, the existence of a link between each observed variable $y_{j}$ and each binary latent variable $α_{k}$ is easier to estimate than the strength of such a link (if it exists).

To assess how the approach performs with an increasing sample size, we consider eight sample sizes $n = 250 \cdot i$ for $i = 1, 2, \dots, 8$ under the same settings as above. To compare the estimated structures to the simulation truth with $K_{1} = 4$ ⁠, we retain exactly four binary latent variables $k \in [K_{upper}]$ ⁠; choosing those having the largest average posterior variance $1 / d \sum_{c = 1}^{d} σ_{c k}^{2}$ ⁠. For the discrete structure of the binary graphical matrix $G^{(1)}$ ⁠, we present mean estimation errors in Figure 4. Specifically, Figure 4a plots the errors of recovering the entire matrix $G^{(1)}$ ⁠, Figure 4b plots the errors of recovering the row vectors of $G^{(1)}$ ⁠, and Figure 4b plots the errors of recovering the individual entries of $G^{(1)}$ ⁠. For sample size as small as $n = 750$ ⁠, estimation of $G^{(1)}$ is very accurate across the simulation replications. Notably, $n = 750$ is much smaller than the total number of cells $d^{p} = 4^{20} \approx 1.1 \times 10^{12}$ in the contingency table for the observed variables $y$ ⁠. For continuous parameters $β_{0} = {β_{j c 0}}$ ⁠, $β = {β_{j c k}}$ ⁠, and $η = {η_{k b}}$ ⁠, respectively, in Figure 5 we plot average root-mean-square errors (RMSE) of the posterior means versus sample size $n$ ⁠. For each $β_{0} = {β_{j c 0}}$ ⁠, $β = {β_{j c k}}$ ⁠, and $η = {η_{k b}}$ ⁠, Figure 5 shows RMSE decreases with sample size, which is as expected given our posterior consistency result in Theorem 3.

In simulations (and also the later real data analysis), we have checked the posterior samples collected from the MCMC algorithm and obtained the traceplots of various parameters in the model. By examining these, we have not observed label-switching issues in our numerical studies. In general, we would suggest one uses the traceplots of those mixture-component-specific parameters to examine whether there is a label switching issue. If there exists such issues, we recommend using the R package label.switching (Papastamoulis, 2016) for MCMC outputs to address the issue.

6 Application to DNA nucleotide sequence data

We apply our proposed two-latent-layer Bayesian Pyramid with the CSP prior to the Splice Junction dataset of DNA nucleotide sequences; the data are available from the UCI machine learning repository. We also analyze another dataset of nucleotide sequences, the Promoter data, and present the results in the Online Supplementary Material. The Splice Junction data consist of A, C, G, T nucleotides (⁠ $d = 4$ ⁠) at $p = 60$ positions for $n = 3, 175$ DNA sequences. There are two types of genomic regions called introns and exons; junctions between these two are called splice junctions (Nguyen et al., 2016). Splice junctions can be further categorized as (a) exon-intron junction; and (b) intron-exon junction. The $n = 3, 175$ samples in the Splice dataset each belong to one of three types: Exon-Intron junction (‘EI’, 762 samples); Intron-Exon junction (‘IE’, 765 samples); and Neither EI or IE (‘N’, 1648 samples). Previous studies have used supervised learning methods for predicting sequence type (e.g., Li & Wong, 2003; Nguyen et al., 2016). Here we fit the proposed two-latent-layer Bayesian Pyramid to the data in a completely unsupervised manner, with the sequence type information held out. We use the nucleotide sequences to learn discrete latent representations of each sequence, and then investigate whether the learned lower dimensional discrete latent features are interpretable and predictive of the sequence type.

We let the variable $z$ in the deepest latent layer have $B = 3$ categories, inspired by the fact that there are three types of sequences: EI, IE, and N; but we do not use any information of which sequence belongs to which type when fitting the model to data. As mentioned earlier, the upper bound for the binary latent variables $K_{upper}$ can be set to $⌈ p / 3 ⌉$ or smaller in order to yield an identifiable Bayesian Pyramid model. In practice, when $p$ is large, we recommend starting with a relatively small $K_{upper}$ ⁠, inspecting the estimated active/redundant latent dimensions, and only increasing $K_{upper}$ if all the latent dimensions are estimated to be active a posteriori (that is, if the posterior model of $K^{⋆}$ defined in (21) equals $K_{upper} - 1$ ⁠). For this splice junction dataset with $p = 60$ ⁠, we start with $K_{upper} = 7$ for better computational efficiency; this $K_{upper}$ is the same as that used in the simulations and already allows for $2^{K_{upper} - 1} = 64$ distinct latent binary profiles. We find that the posteriors select only $K^{⋆} = 5$ active latent dimensions; this suggests that it is not necessary to increase $K_{upper}$ to a larger number for this dataset.

We still run the Gibbs sampler for 15,000 iterations, discarding the first 10,000 as burn-in, and retaining every fifth sample post burn-in to thin the chain. Based on examining traceplots, the sampler has good mixing behavior. As mentioned in the last paragraph, our method selects $K^{⋆} = 5$ binary latent variables. We index the saved posterior samples by $r \in {1, \dots, R = 2, 000}$ ⁠, for each $r$ denote the samples of $G$ by $(g_{j, k}^{(r)}; j \in [60], k \in [5])$ ⁠. Similarly for each $r$ ⁠, denote the posterior samples of the nucleotides sequences’ latent binary traits by $(a_{i, k}^{(r)}; i \in [3, 175], k \in [5])$ ⁠, and denote those of the nucleotide sequences’ deep latent category by $(z_{i}^{(r)}; i \in [3, 175])$ where $z_{i}^{(r)} \in {1, 2, 3}$ ⁠. Define our final estimators $\hat{G} = ({\hat{g}}_{j, k})$ ⁠, $\hat{A} = ({\hat{α}}_{i, k})$ ⁠, and $\hat{Z} = ({\hat{z}}_{i, b})$ to be

\begin{aligned} {\hat{g}}_{j, k} = 1 (\frac{1}{R} \sum_{r = 1}^{R} g_{j, k}^{(r)} > \frac{1}{2}), {\hat{α}}_{i, k} = 1 (\frac{1}{R} \sum_{r = 1}^{R} a_{i, k}^{(r)} > \frac{1}{2}), \\ {\hat{z}}_{i, b} = {\begin{matrix} 1, & if b = \underset{b \in B}{arg max} \frac{1}{R} \sum_{r = 1}^{R} 1 (z_{i}^{(r)} = b); \\ 0, & otherwise \end{matrix} . \end{aligned}

${\hat{g}}_{j, k}$ ⁠, ${\hat{α}}_{i, k}$ ⁠, and ${\hat{z}}_{i, b}$ summarize information of the element-wise posterior modes of the discrete latent structures in our model. The $60 \times 5$ matrix $\hat{G}$ depicts how the loci load on the binary latent traits, the $3,175 \times 5$ matrix $\hat{A}$ depicts the presence or absence of each binary latent trait in each nucleotide sequence, and the $3,175 \times 3$ matrix $\hat{Z}$ depicts which deep latent group each nucleotide sequence belongs to. $\hat{G}$ ⁠, $\hat{A}$ ⁠, and $\hat{Z}$ are all binary matrices, but the first two are binary feature matrices while the last one $\hat{Z}$ has each row having exactly one entry of ‘1’ indicating group membership. In Figure 6, the first three plots display the three estimated matrices $\hat{G}$ ⁠, $\hat{A}$ ⁠, and $\hat{Z}$ ⁠, respectively; and the last plot shows the held-out gene type information for reference. As for the estimated loci loading matrix $\hat{G}$ ⁠, Figure 6a provides information on how the $p = 60$ loci depend on the five binary latent traits. Specifically, we found that the first 27 loci show somewhat similar loading patterns and mainly load on the first four binary traits. Also, the middle 10 loci (from locus 28 to locus 37) are similar in loading on all five traits, and the last 23 loci (from locus 38 to locus 60) are similar in exhibiting sparser loading structures. Figure 6b–d shows that the two matrices $\hat{A}$ and $\hat{Z}$ corresponding to the $n = 3,175$ nucleotide sequences exhibit a clear pattern of clustering, which aligns well with the known but held-out junction types EI, IE, and N.

Figure 6.

Splice junction data analysis under the CSP prior with $K_{upper} = 7$ ⁠. Plots are presented with the $K^{⋆} = 5$ binary latent traits selected by our proposed method a posteriori. After applying the rule-lists approach to deterministically match the latent features to the gene types as in (22), the accuracies for predicting the gene types EI, IE, N are all above 95%.

Open in new tab Download slide

To formally assess how the latent discrete features learned by the proposed method perform in downstream prediction, we apply the ‘rule lists’ classification approach in Angelino et al. (2017) to the estimated latent features in $\hat{A}$ and $\hat{Z}$ for $n = 3,175$ nucleotide sequences. The rule-lists approach is an interpretable classification method based on a categorical feature space, and it finds simple and deterministic rules of the categorical features in predicting a (binary) class label. For each instance $i \in {1, \dots, n = 3,175}$ ⁠, we define the categorical features to be the eight-dimensional vector ${\hat{x}}_{i} = ({\hat{z}}_{i, 1}, {\hat{z}}_{i, 2}, {\hat{z}}_{i, 3}, {\hat{a}}_{i, 1}, \dots, {\hat{a}}_{i, 5})$ ⁠. ${\hat{x}}_{i}$ is concatenated from ${\hat{z}}_{i}$ and ${\hat{a}}_{i}$ and is a feature vector of binary entries. Denote the ground-truth nucleotide sequence types by $t = (t_{i}; 1 \leq i \leq 3, 175)$ ⁠, then $t_{i} = EI$ for $i = 1, \dots, 762$ ⁠, $t_{i} = IE$ for $i = 763, \dots, 1527$ ⁠, and $t_{i} = N$ for $i = 1, 528, \dots, 3,175$ ⁠. Recall that the information of $t_{i}$ ’s are not used in fitting our Bayesian Pyramid to obtain the latent features ${\hat{x}}_{i}$ ’s. We use the Python package CORELS for the rule-lists method with $x_{i}$ ’s and $t_{i}$ ’s as input, and find rules that match ${\hat{x}}_{i}$ ’s to $t_{i}$ ’s. Specifically, these deterministic rules given by CORELS are:

\begin{aligned} {\hat{t}}_{i}^{EI} & = 1 ({\hat{a}}_{i, 1} = 1 and {\hat{a}}_{i, 5} = 1), \\ {\hat{t}}_{i}^{IE} & = 1 ({\hat{z}}_{i, 2} = 1), \\ {\hat{t}}_{i}^{N} & = 1 ({\hat{z}}_{i, 2} = 0 and {\hat{a}}_{i, 5} = 0) . \end{aligned}

(22)

Equation (22) gives a very simple and explicit rule depending on ${\hat{z}}_{i, 2}$ ⁠, ${\hat{a}}_{i, 1}$ ⁠, and ${\hat{a}}_{i, 5}$ for each nucleotide sequence $i$ ⁠. This simple rule can be empirically verified by comparing the middle two plots to the rightmost plot in Figure 6. In (22), the rules for EI and IE are not mutually exclusive, but those for EI and N are mutually exclusive and the same holds for IE and N. Therefore, to obtain mutually exclusive rules for the three class labels EI, IE, and N based on (22), we can simply define the following two types of labels ${\hat{t}}^{⋆} = ({\hat{t}}_{i}^{⋆}; i \in [n])$ and ${\hat{t}}^{†} = ({\hat{t}}_{i}^{†}; i \in [n])$ ⁠:

{\hat{t}}_{i}^{⋆} = {\begin{matrix} EI, & if {\hat{t}}_{i}^{EI} = 1 and {\hat{t}}_{i}^{IE} = 0; \\ EI, & if {\hat{t}}_{i}^{IE} = 1; \\ N, & if {\hat{t}}_{i}^{N} = 1; \end{matrix} or {\hat{t}}_{i}^{†} = {\begin{matrix} EI, & if {\hat{t}}_{i}^{EI} = 1; \\ EI, & if {\hat{t}}_{i}^{IE} = 1 and {\hat{t}}_{i}^{EI} = 0; \\ N, & if {\hat{t}}_{i}^{N} = 1. \end{matrix}

(23)

For the two types of labels ${\hat{t}}^{⋆}$ and ${\hat{t}}^{†}$ in (23), we provide their corresponding normalized confusion matrices in Figure 7 along with their heatmaps. Figure 7 shows that both ${\hat{t}}^{⋆}$ and ${\hat{t}}^{†}$ have very high prediction accuracies for the sequence types $t$ ⁠, with all the diagonal entries above $95 %$ ⁠. The superb prediction performance as shown in Figure 7 implies that the learned discrete latent features of nucleotide sequences are interpretable and very useful in the downstream task of classification. For the Splice Juntion dataset, such accuracies are even comparable to the state-of-the-art prediction performance given by fine-tuned convolutional neural networks given in Nguyen et al. (2016), which are also between 95% and 97%.

Figure 7.

Downstream interpretable classification results for splice junction data following fitting the two-latent-layer Bayesian Pyramid. Prediction performance of the learned lower-dimensional discrete latent features are summarized in two confusion matrices, corresponding to ${\hat{t}}^{⋆}$ and ${\hat{t}}^{†}$ defined in (23).

Open in new tab Download slide

In the above data analysis, we have fixed $B = 3$ inspired by the fact that there are three types of splice junctions. Alternatively, we can use the overfitted mixture framework of Rousseau and Mengersen (2011) to accommodate an unknown $B$ ⁠. Overfitted mixtures are finite mixture models with an unknown number of mixture components $B$ ⁠, where only an upper bound $B_{upper}$ of $B$ is known. In such scenarios, Rousseau and Mengersen (2011) proposed to use shrinkage priors for the mixture proportion parameters, such as Dirichlet priors with small Dirichlet hyperparameters. More specifically, denote the mixture proportion parameters corresponding to the $B_{upper}$ ‘overfitted’ mixture components by $π : = (π_{1}, \dots, π_{B_{upper}})$ ⁠, then $π$ lives on the $(B_{upper} - 1)$ -dimensional probability simplex. The overfitted mixture framework in Rousseau and Mengersen (2011) guarantees that with the prior $π \sim Dirichlet (a_{1}, \dots, a_{B_{upper}})$ where the Dirichlet parameters are small enough with respect to the dimension of the observations, the redundant mixture components will be ‘emptied out’ in the posterior asymptotically. Hence, the $B$ true mixture components underlying the data will be identified from the posterior distribution. In the Online Supplementary Material, we reanalyze the real data specifying $B = 5$ and simulation studies with an overfitted $B$ ⁠, and achieve favorable results.

For comparison, we also applied/implemented two alternative discrete latent structure models, including: (a) the latent class model, with a single univariate discrete latent variable behind the observed data layer; and (b) a single-latent-layer model, with one layer of independent binary latent variables behind the data layer. For model (a), we applied the nonparametric Bayesian method in Dunson and Xing (2009) to the splice junction dataset and extracted $k_{DX} = 10$ latent classes (default in Dunson & Xing, 2009; increasing $k_{DX}$ beyond 10 is not considered because in the current 10 latent classes there are already empty classes not occupied by any nucleotide sequence). For model (b), we implemented a new Gibbs sampler (see description in Section 9.3 in the Online Supplementary Material), still employing the cumulative shrinkage process prior as used for Bayesian Pyramids. Then based on the learned lower-dimensional latent representations, we still apply the ‘rule-lists’ classification approach in Angelino et al. (2017) (via the Python package CORELS) for each of the three labels EI, IE, and N via binary classification. In addition to the above two competitors (abbreviated as ‘DX2009’ and ‘IndepBinary’, respectively), we also consider a benchmark approach of directly applying the rule-lists classifier to raw DNA nucleotide sequences. Since the rule-lists classifier in the CORELS package only applies to binary predictors, we first convert the DNA nucleotide sequences of A, G, C, T into longer sequences containing binary indicators of whether each loci is ‘A’, or ‘G’, or ‘C’; after this, each original $60$ -dimensional nucleotide sequence is converted to a binary vector of length $180$ ⁠. We then apply the rule-lists classifier using the same setting of learning at most three ‘rules’ (i.e., $m a x_c a r d = 3$ in the CORELS package) as all the other three latent variable methods: DX2009, IndepBinary, and BayesPyramid. We obtain the classification accuracy in Table 1.

Table 1.

Open in new tab

Downstream classification accuracy for splice data

	EI (train)	IE (train)	N (train)	EI (test)	IE (test)	N (test)
Raw nucleotide nucleotides	0.909	0.854	0.840	0.916	0.866	0.839
DX2009	0.946	0.955	0.918	0.950	0.965	0.929
IndepBinary	0.964	0.955	0.956	0.965	0.957	0.965
BayesPyramid	0.971	0.975	0.970	0.984	0.983	0.976

	EI (train)	IE (train)	N (train)	EI (test)	IE (test)	N (test)
Raw nucleotide nucleotides	0.909	0.854	0.840	0.916	0.866	0.839
DX2009	0.946	0.955	0.918	0.950	0.965	0.929
IndepBinary	0.964	0.955	0.956	0.965	0.957	0.965
BayesPyramid	0.971	0.975	0.970	0.984	0.983	0.976

Table 1.

Open in new tab

Downstream classification accuracy for splice data

	EI (train)	IE (train)	N (train)	EI (test)	IE (test)	N (test)
Raw nucleotide nucleotides	0.909	0.854	0.840	0.916	0.866	0.839
DX2009	0.946	0.955	0.918	0.950	0.965	0.929
IndepBinary	0.964	0.955	0.956	0.965	0.957	0.965
BayesPyramid	0.971	0.975	0.970	0.984	0.983	0.976

	EI (train)	IE (train)	N (train)	EI (test)	IE (test)	N (test)
Raw nucleotide nucleotides	0.909	0.854	0.840	0.916	0.866	0.839
DX2009	0.946	0.955	0.918	0.950	0.965	0.929
IndepBinary	0.964	0.955	0.956	0.965	0.957	0.965
BayesPyramid	0.971	0.975	0.970	0.984	0.983	0.976

The left three columns in Table 1 list accuracies on the training set containing 80% of the samples, and the right three columns on the test set containing the remaining 20% of the samples. The test accuracies are very close to the training ones, indicating satisfactory generalizability. The reason is that the maximum number of rules learned for each model is limited to be at most three, i.e., $m a x_c a r d = 3$ in the CORELS package. Such a small number of rules in a classifier do not overfit the training data and hence generalize well to the test data. We remark that when increasing max_card beyond three, the downstream classification accuracy of BayesPyramid remains the same as those in Table 1; however, for the raw nucleotide sequences, the CORELS package cannot complete the execution of the classifier with $m a x_c a r d = 4$ ⁠, likely due to the high-dimensionality of the search space of rules. Therefore, we present the results of CORELS classification with $m a x_c a r d = 3$ in Table 1.

Remarkably, Table 1 shows that all the three unsupervised learning methods learn latent features that are more predictive of the sequence type than the raw sequences themselves, among which the Bayesian Pyramid is the apparent top performer. This observation indicates that the latent variable approaches, especially our multilayer Bayesian Pyramids, can ‘denoise’ the raw sequence data and return more useful features for downstream tasks. The model in Dunson and Xing (2009) (‘DX2009’) and the single-layer-independent-latent model (‘IndepBinary’) are special cases of Bayesian Pyramids, and both have excellent performance on the splice data. However, Bayesian Pyramids significantly decreased the test set misclassification rate of IndepBinary by 31%–60%, of DX2009 by 51%–68%, and of raw nucleotide sequences by 81%–87%.

7 Discussion

In this article, we have proposed Bayesian Pyramids, a general family of discrete multilayer latent variable models for multivariate categorical data. Bayesian Pyramids cover multiple existing statistical and machine learning models as special cases, while allowing for various different assumptions on the conditional distributions. Our identifiability theory is key in providing reassurance that one can reliably and reproducibly learn the latent structure, guarantees that are lacking from almost all of the related literature. The proposed Bayesian approach has excellent performance in the simulations and data analyses, and shows good computational performance and a surprising ability to accurately infer meaningful and useful latent structure. There are immediate applications of the proposed Bayesian Pyramid approach in many other disciplines. For example, in ecology and biodiversity research, joint species distribution modeling is a very active area (see a recent book of Ovaskainen & Abrego, 2020). The data consist of high-dimensional binary indicators of presence or absence of different species at each sampling location. In this contest, Bayesian Pyramids provide a useful approach for inferring latent community structure of biologic and ecological interest.

This work focuses on the unsupervised setting and uses latent variable approaches. In unsupervised cases, there is not a specific outcome/response associated with each data point, and the general goal is to discover ‘interesting’ patterns in data (Murphy, 2012). Such unsupervised learning problems are generally considered more challenging and less well studied than supervised ones. Latent variables can capture semantically interpretable constructs, and marginalizing out latents induces great flexibility in the resulting marginal distribution of observables. Bayesian Pyramids provide unsupervised feature discovery tools for learning latent architectures underlying multivariate data, yielding insight into data generating processes, performing nonlinear dimension reduction, and extracting useful latent representations.

In order to realize the great potential of latent variable approaches to unsupervised learning, identifiability issues must be addressed so that reliable latent constructs can be reproducibly extracted from data. Especially, if one wishes to interpret the latent representations and parameters estimated from a latent variable model and use them in downstream analyses, then identifiability is a prerequisite for making such interpretation meaningful and reproducible. In this sense, identifiability is necessary for interpretability of the latent structures. We remark that such interpretability considerations differ from the goals of learning interpretable treatment rules for some outcome in complex supervised learning settings (see, e.g., Kitagawa & Tetenov, 2018; Semenova & Chernozhukov, 2021). In modern machine learning, although powerful deep latent variable models have been proposed, identifiability issues have rarely been considered and the design of the latent architecture is often guided by heuristics. In this work, our identifiability theory directly inspires how to specify the deep latent architecture in a Bayesian Pyramid: the identifiability conditions on matrices $G^{(m)}$ (in Theorem 4) inspire us to specify a ‘pyramid’ structure featuring fewer and fewer latent variables deeper up the hierarchy (⁠ $J \geq 3 K_{1}; K_{1} \geq 3 K_{2}, \dots$ ⁠) and sparse graphical connections between layers.

In the future, it is worth further advancing and refining scalable algorithms for Bayesian Pyramids beyond the two-latent-layer case. Methodologically, our current Gibbs sampling procedure can be readily extended to multiple latent layers, because non-adjacent layers in a Bayesian Pyramid are conditionally independent and the current Gibbs updates for shallower layers can be similarly applied to deeper ones. In terms of the computational performance, however, efficiently sampling discrete latent variables via MCMC is generally more challenging than sampling continuous ones. In a Bayesian Pyramid, binary entries of the graphical matrices $G^{(1)}, G^{(2)}, \dots$ act similarly as covariate selection indicators in (logistic) regression problems, yet are more difficult to estimate because the ‘covariates’ $α^{(1)}, α^{(2)}, \dots$ themselves are latent and discrete. This fact can cause unsatisfactory mixing behavior of naive extensions of the Gibbs sampler to deeper models. For example, sampling to update an entry in a deeper graphical matrix $G^{(d)}$ where $d \geq 2$ can cause a substantial shift of the posterior space for the associated continuous parameters in downward layers. Future research is warranted to advance the MCMC methodology or explore complementary methods for estimating the discrete latent structures in deeper models.

Acknowledgments

The authors thank the Editor, Associate Editor, and three referees for helpful and constructive comments.

Funding

This work was partially supported by the National Science Foundation grant DMS-2210796. This work was also partially supported by grants R01ES027498 and R01ES028804 from National Institutes of Environmental Health Sciences of the National Institutes of Health, and received funding from European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement No 856506).

Data availability

Matlab code implementing the proposed method and data are available at this GitHub repository: https://github.com/yuqigu/BayesianPyramids.

Supplementary material

Supplementary material are available at Journal of the Royal Statistical Society: Series B online.

References

Allman

E. S.

,

Matias

C.

, &

Rhodes

J. A.

(

2009

).

Identifiability of parameters in latent structure models with many observed variables

.

The Annals of Statistics

,

37

(

6A

),

3099

–

3132

. https://doi.org/10.1214/09-AOS689

Google Scholar

Crossref

WorldCat

Anandkumar

A.

,

Hsu

D.

,

Javanmard

A.

, &

Kakade

S.

(

2013

).

Learning linear Bayesian networks with latent variables. In International Conference on Machine Learning (pp. 249–257). PMLR

.

Anderson

T. W.

, &

Rubin

H.

(

1956

).

Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5. pp. 111–150). Univ of California Press

.

Angelino

E.

,

Larus-Stone

N.

,

Alabi

D.

,

Seltzer

M.

, &

Rudin

C.

(

2017

).

Learning certifiably optimal rule lists for categorical data

.

The Journal of Machine Learning Research

,

18

(

1

),

8753

–

8830

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Blei

D. M.

,

Ng

A. Y.

, &

Jordan

M. I.

(

2003

).

Latent Dirichlet allocation

.

Journal of Machine Learning Research

,

3

,

993

–

1022

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Chen

Y.

,

Culpepper

S.

, &

Liang

F.

(

2020

).

A sparse latent class model for cognitive diagnosis

.

Psychometrika

,

85

(

1

),

121

–

153

. https://doi.org/10.1007/s11336-019-09693-2

Chen

Y.

,

Li

X.

, &

Zhang

S.

(

2020

).

Structured latent factor analysis for large-scale data: Identifiability, estimability, and their implications

.

Journal of the American Statistical Association

,

115

(

532

),

1756

–

1770

. https://doi.org/10.1080/01621459.2019.1635485

Google Scholar

Crossref

WorldCat

Chen

Y.

,

Liu

J.

,

Xu

G.

, &

Ying

Z.

(

2015

).

Statistical analysis of

Q

-matrix based diagnostic classification models

.

Journal of the American Statistical Association

,

110

(

510

),

850

–

866

. https://doi.org/10.1080/01621459.2014.934827

Collins

L. M.

, &

Lanza

S. T.

(

2009

).

Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences

. (Vol.

718

).

John Wiley & Sons

.

de la Torre

J.

(

2011

).

The generalized DINA model framework

.

Psychometrika

,

76

(

2

),

179

–

199

. https://doi.org/10.1007/s11336-011-9207-7

Google Scholar

Crossref

WorldCat

Dellaportas

P.

,

Forster

J. J.

, &

Ntzoufras

I.

(

2002

).

On Bayesian model and variable selection using MCMC

.

Statistics and Computing

,

12

(

1

),

27

–

36

. https://doi.org/10.1023/A:1013164120801

Google Scholar

Crossref

WorldCat

Doshi-Velez

F.

, &

Ghahramani

Z.

(

2009

).

Correlated non-parametric latent feature models. In Uncertainty in artificial intelligence

. PMLR.

Doshi-Velez

F.

, &

Kim

B.

(

2017

).

‘Towards a rigorous science of interpretable machine learning’, arXiv, arXiv:1702.08608, preprint: not peer reviewed

.

Drton

M.

,

Foygel

R.

, &

Sullivant

S.

(

2011

).

Global identifiability of linear structural equation models

.

The Annals of Statistics

,

39

(

2

),

865

–

886

. https://doi.org/10.1214/10-AOS859

Google Scholar

Crossref

WorldCat

Dua

D.

, &

Graff

C.

(

2017

).

UCI machine learning repository

.

Dunson

D. B.

, &

Xing

C.

(

2009

).

Nonparametric Bayes modeling of multivariate categorical data

.

Journal of the American Statistical Association

,

104

(

487

),

1042

–

1051

. https://doi.org/10.1198/jasa.2009.tm08439

Google Scholar

Crossref

WorldCat

Erosheva

E.

,

Fienberg

S.

, &

Lafferty

J.

(

2004

).

Mixed-membership models of scientific publications

.

Proceedings of the National Academy of Sciences

,

101

(

suppl_1

),

5220

–

5227

. https://doi.org/10.1073/pnas.0307760101

Google Scholar

Crossref

WorldCat

Eysenck

S. B.

,

Barrett

P. T.

, &

Saklofske

D. H.

(

2020

).

The junior Eysenck personality questionnaire

.

Personality and Individual Differences

,

169

(

5

),

109974

. https://doi.org/10.1016/j.paid.2020.109974

Google Scholar

OpenURL Placeholder Text

WorldCat

Fang

G.

,

Liu

J.

, &

Ying

Z.

(

2019

).

On the identifiability of diagnostic classification models

.

Psychometrika

,

84

(

1

),

19

–

40

. https://doi.org/10.1007/s11336-018-09658-x

Gelman

A.

,

Carlin

J. B.

,

Stern

H. S.

,

Dunson

D. B.

,

Vehtari

A.

, &

Rubin

D. B.

(

2013

).

Bayesian data analysis

.

CRC Press

.

Gelman

A.

,

Jakulin

A.

,

Pittau

M. G.

, &

Su

Y.-S.

(

2008

).

A weakly informative default prior distribution for logistic and other regression models

.

The Annals of Applied Statistics

,

2

(

4

),

1360

–

1383

. https://doi.org/10.1214/08-AOAS191

Google Scholar

Crossref

WorldCat

Gelman

A.

, &

Rubin

D. B.

(

1992

).

Inference from iterative simulation using multiple sequences

.

Statistical Science

,

7

(

4

),

457

–

472

. https://doi.org/10.1214/ss/1177011136

Google Scholar

OpenURL Placeholder Text

WorldCat

Goodman

L. A.

(

1974

).

Exploratory latent structure analysis using both identifiable and unidentifiable models

.

Biometrika

,

61

(

2

),

215

–

231

. https://doi.org/10.1093/biomet/61.2.215

Google Scholar

Crossref

WorldCat

Gu

Y.

, &

Xu

G.

(

2019

).

Learning attribute patterns in high-dimensional structured latent attribute models

.

Journal of Machine Learning Research

,

20

(

115

),

1

–

58

. https://jmlr.org/papers/v20/19-197.html

Google Scholar

OpenURL Placeholder Text

WorldCat

Gu

Y.

, &

Xu

G.

(

2020

).

Partial identifiability of restricted latent class models

.

Annals of Statistics

,

48

(

4

),

2082

–

2107

. https://doi.org/10.1214/19-AOS1878

Google Scholar

Crossref

WorldCat

Gyllenberg

M.

,

Koski

T.

,

Reilink

E.

, &

Verlaan

M.

(

1994

).

Non-uniqueness in probabilistic numerical identification of bacteria

.

Journal of Applied Probability

,

31

(

2

),

542

–

548

. https://doi.org/10.2307/3215044

Google Scholar

Crossref

WorldCat

Haertel

E. H.

(

1989

).

Using restricted latent class models to map the skill structure of achievement items

.

Journal of Educational Measurement

,

26

(

4

),

301

–

321

. https://doi.org/10.1111/j.1745-3984.1989.tb00336.x

Google Scholar

Crossref

WorldCat

Hinton

G. E.

(

2009

).

Deep belief networks

.

Scholarpedia

,

4

(

5

),

5947

. https://doi.org/10.4249/scholarpedia.5947

Google Scholar

Crossref

WorldCat

Hinton

G. E.

,

Osindero

S.

, &

Teh

Y. -W.

(

2006

).

A fast learning algorithm for deep belief nets

.

Neural Computation

,

18

(

7

),

1527

–

1554

. https://doi.org/10.1162/neco.2006.18.7.1527

Holmes

C. C.

, &

Held

L.

(

2006

).

Bayesian auxiliary variable models for binary and multinomial regression

.

Bayesian Analysis

,

1

(

1

),

145

–

168

. https://doi.org/10.1214/06-BA105

Google Scholar

OpenURL Placeholder Text

WorldCat

Kitagawa

T.

, &

Tetenov

A.

(

2018

).

Who should be treated? Empirical welfare maximization methods for treatment choice

.

Econometrica

,

86

(

2

),

591

–

616

. https://doi.org/10.3982/ECTA13288

Google Scholar

Crossref

WorldCat

Kolda

T. G.

, &

Bader

B. W.

(

2009

).

Tensor decompositions and applications

.

SIAM Review

,

51

(

3

),

455

–

500

. https://doi.org/10.1137/07070111X

Google Scholar

Crossref

WorldCat

Koopmans

T. C.

, &

Reiersol

O.

(

1950

).

The identification of structural characteristics

.

The Annals of Mathematical Statistics

,

21

(

2

),

165

–

181

. https://doi.org/10.1214/aoms/1177729837

Google Scholar

Crossref

WorldCat

Kruskal

J. B.

(

1977

).

Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics

.

Linear Algebra and its Applications

,

18

(

2

),

95

–

138

. https://doi.org/10.1016/0024-3795(77)90069-6

Google Scholar

Crossref

WorldCat

Lazarsfeld

P. F.

(

1950

).

The logical and mathematical foundation of latent structure analysis. In Studies in social psychology in world war II Vol. IV: Measurement and prediction (pp. 362–412)

. Princeton University Press.

Lee

H.

,

Ekanadham

C.

, &

Ng

A.

(

2007

).

Sparse deep belief net model for visual area V2

.

Advances in Neural Information Processing Systems

,

20

,

873

–

880

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Lee

H.

,

Grosse

R.

,

Ranganath

R.

, &

Ng

A. Y.

(

2009

).

Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 609–616)

.

Legramanti

S.

,

Durante

D.

, &

Dunson

D. B.

(

2020

).

Bayesian cumulative shrinkage for infinite factorizations

.

Biometrika

,

107

(

3

),

745

–

752

. https://doi.org/10.1093/biomet/asaa008

Li

J.

, &

Wong

L.

(

2003

).

Using rules to analyse bio-medical data: A comparison between C4. 5 and PCL. In International Conference on Web-Age Information Management (pp. 254–265). Springer

.

McLachlan

G. J.

, &

Basford

K. E.

(

1988

).

Mixture models: Inference and applications to clustering

. (Vol.

38

).

M. Dekker

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Miettinen

P.

, &

Vreeken

J.

(

2014

).

Mdl4bmf: Minimum description length for Boolean matrix factorization

.

ACM Transactions on Knowledge Discovery from Data (TKDD)

,

8

(

4

),

1

–

31

. https://doi.org/10.1145/2601437

Google Scholar

Crossref

WorldCat

Mourad

R.

,

Sinoquet

C.

,

Zhang

N. L.

,

Liu

T.

, &

Leray

P.

(

2013

).

A survey on latent tree models and applications

.

Journal of Artificial Intelligence Research

,

47

(

1

),

157

–

203

. https://doi.org/10.1613/jair.3879

Google Scholar

OpenURL Placeholder Text

WorldCat

Murdoch

W. J.

,

Singh

C.

,

Kumbier

K.

,

Abbasi-Asl

R.

, &

Yu

B.

(

2019

).

Interpretable machine learning: Definitions, methods, and applications

.

Proceedings of the National Academy of Sciences

,

116

(

44

),

22071

–

22080

. https://doi.org/10.1073/pnas.1900654116

Google Scholar

Crossref

WorldCat

Murphy

K. P.

(

2012

).

Machine learning: A probabilistic perspective

.

MIT Press

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Nguyen

N. G.

,

Tran

V. A.

,

Ngo

D. L.

,

Phan

D.

,

Lumbanraja

F. R.

,

Faisal

M. R.

,

Abapihi

B.

,

Kubo

M.

, &

Satou

K.

(

2016

).

DNA sequence classification by convolutional neural network

.

Journal of Biomedical Science and Engineering

,

9

(

5

),

280

–

286

. https://doi.org/10.4236/jbise.2016.95021

Google Scholar

Crossref

WorldCat

Ovaskainen

O.

, &

Abrego

N.

(

2020

).

Joint species distribution modelling: With applications in R

. Ecology, Biodiversity and Conservation.

Cambridge University Press

.

Ovaskainen

O.

,

Abrego

N.

,

Halme

P.

, &

Dunson

D. B.

(

2016

).

Using latent variable models to identify large networks of species-to-species associations at different spatial scales

.

Methods in Ecology and Evolution

,

7

(

5

),

549

–

555

. https://doi.org/10.1111/2041-210X.12501

Google Scholar

Crossref

WorldCat

Papastamoulis

P.

(

2016

).

Label.switching: An R package for dealing with the label switching problem in MCMC outputs

.

Journal of Statistical Software

,

69

(

1

),

1

–

24

. https://doi.org/10.18637/jss.v069.c01

Google Scholar

OpenURL Placeholder Text

WorldCat

Pearl

J.

(

2014

).

Probabilistic reasoning in intelligent systems: Networks of plausible inference

.

Elsevier

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Pokholok

D. K.

,

Harbison

C. T.

,

Levine

S.

,

Cole

M.

,

Hannett

N. M.

,

Lee

T. I.

,

Bell

G. W.

,

Walker

K.

,

Rolfe

P. A.

,

Herbolsheimer

E.

,

Zeitlinger

J.

,

Lewitter

F.

,

Gifford

D. K.

, &

Young

R. A.

(

2005

).

Genome-wide map of nucleosome acetylation and methylation in yeast

.

Cell

,

122

(

4

),

517

–

527

. https://doi.org/10.1016/j.cell.2005.06.026

Polson

N. G.

,

Scott

J. G.

, &

Windle

J.

(

2013

).

Bayesian inference for logistic models using Pólya–Gamma latent variables

.

Journal of the American Statistical Association

,

108

(

504

),

1339

–

1349

. https://doi.org/10.1080/01621459.2013.829001

Google Scholar

Crossref

WorldCat

Poon

H.

, &

Domingos

P.

(

2011

).

Sum-product networks: A new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 689–690). IEEE

.

Rousseau

J.

, &

Mengersen

K.

(

2011

).

Asymptotic behaviour of the posterior distribution in overfitted mixture models

.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

,

73

(

5

),

689

–

710

. https://doi.org/10.1111/j.1467-9868.2011.00781.x

Google Scholar

Crossref

WorldCat

Rudin

C.

(

2019

).

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

.

Nature Machine Intelligence

,

1

(

5

),

206

–

215

. https://doi.org/10.1038/s42256-019-0048-x

Rupp

A. A.

, &

Templin

J. L.

(

2008

).

Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art

.

Measurement

,

6

(

4

),

219

–

262

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Semenova

V.

, &

Chernozhukov

V.

(

2021

).

Debiased machine learning of conditional average treatment effects and other causal functions

.

The Econometrics Journal

,

24

(

2

),

264

–

289

. https://doi.org/10.1093/ectj/utaa027

Google Scholar

Crossref

WorldCat

Skinner

C.

(

2019

).

Analysis of categorical data for complex surveys

.

International Statistical Review

,

87

(

S1

),

S64

–

S78

. https://doi.org/10.1111/insr.12285

Google Scholar

Crossref

WorldCat

Teicher

H.

(

1961

).

Identifiability of mixtures

.

The Annals of Mathematical Statistics

,

32

(

1

),

244

–

248

. https://doi.org/10.1214/aoms/1177705155

Google Scholar

Crossref

WorldCat

von Davier

M.

, &

Lee

Y.-S.

(

2019

).

Handbook of diagnostic classification models

.

Springer

.

Xu

G.

(

2017

).

Identifiability of restricted latent class models with binary responses

.

The Annals of Statistics

,

45

(

2

),

675

–

707

. https://doi.org/10.1214/16-AOS1464

Google Scholar

Crossref

WorldCat

Zhao

H.

,

Melibari

M.

, &

Poupart

P.

(

2015

).

On the relationship between sum-product networks and Bayesian networks. In International Conference on Machine Learning (pp. 116–124). PMLR

.

Zhou

J.

,

Bhattacharya

A.

,

Herring

A. H.

, &

Dunson

D. B.

(

2015

).

Bayesian factorizations of big sparse tensors

.

Journal of the American Statistical Association

,

110

(

512

),

1562

–

1576

. https://doi.org/10.1080/01621459.2014.983233

Zwiernik

P.

(

2018

).

Latent tree models. In Handbook of graphical models (pp. 283–306). CRC Press

.

Author notes

Conflict of interest None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Download all slides

Month:	Total Views:
March 2023	475
April 2023	40
May 2023	31
June 2023	30
July 2023	36
August 2023	25
September 2023	58
October 2023	56
November 2023	37
December 2023	11
January 2024	33
February 2024	18
March 2024	26
April 2024	62
May 2024	101
June 2024	86
July 2024	108
August 2024	94
September 2024	134
October 2024	122
November 2024	191
December 2024	146
January 2025	156
February 2025	123
March 2025	116
April 2025	86
May 2025	25

Article Contents

Bayesian Pyramids: identifiable multilayer discrete latent structure models for discrete data

Abstract

1 Introduction

2 Bayesian Pyramids: multilayer latent structure models

2.1 Multilayer Bayesian Pyramids

2.2 Reformulating the Bayesian Pyramid as a constrained latent class model

2.3 Connections to existing models and studies

3 Identifiability and constrained latent class structure behind Bayesian Pyramids

3.1 Identifiability of the constrained latent class model and posterior consistency

Strict Identifiability

Strict Identifiability

Generic Identifiability

Generic Identifiability

Posterior Consistency

3.2 Identifiability of multilayer Bayesian Pyramids

4 Bayesian inference for two-layer Bayesian Pyramids

4.1 Identifiability theory adapted to two-layer Bayesian Pyramids defined in (3)

4.2 Bayesian inference for the latent sparse graph and number of binary latent traits

4.2.1 Inference with a Fixed $K_{1}$

4.2.2 Inferring an Unknown $K_{1}$

5 Simulation studies

6 Application to DNA nucleotide sequence data

7 Discussion

Acknowledgments

Funding

Data availability

Supplementary material

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Bayesian Pyramids: identifiable multilayer discrete latent structure models for discrete data Free

Abstract

1 Introduction

2 Bayesian Pyramids: multilayer latent structure models

2.1 Multilayer Bayesian Pyramids

2.2 Reformulating the Bayesian Pyramid as a constrained latent class model

2.3 Connections to existing models and studies

3 Identifiability and constrained latent class structure behind Bayesian Pyramids

3.1 Identifiability of the constrained latent class model and posterior consistency

Strict Identifiability

Strict Identifiability

Generic Identifiability

Generic Identifiability

Posterior Consistency

3.2 Identifiability of multilayer Bayesian Pyramids

4 Bayesian inference for two-layer Bayesian Pyramids

4.1 Identifiability theory adapted to two-layer Bayesian Pyramids defined in (3)

4.2 Bayesian inference for the latent sparse graph and number of binary latent traits

4.2.1 Inference with a Fixed K1

4.2.2 Inferring an Unknown K1

5 Simulation studies

6 Application to DNA nucleotide sequence data

7 Discussion

Acknowledgments

Funding

Data availability

Supplementary material

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Bayesian Pyramids: identifiable multilayer discrete latent structure models for discrete data

4.2.1 Inference with a Fixed $K_{1}$

4.2.2 Inferring an Unknown $K_{1}$