Towards multi-omics characterization of tumor heterogeneity: a comprehensive review of statistical and machine learning approaches

SNV-driven tools for subclonal inference

Name	Year	Feature\|$^{\ast }$\|	Model\|$^{\dagger }$\|	Evolutionary constraint	Multi-sample analysis\|$^{\ddagger }$\|	Description^§	Implementation	Ref
PurBayes	2013	AC	Bin	N	N	Bayesian finite mixture model evaluated with penalized expected deviance	R	[21]
TrAp	2013	CVAF	N/A	Y	Y	Enumerates possible tree structures by iteratively merging first-generation trees	Java	[22]
PyClone	2014	AC	Bin, BeBin	N	Y	Bayesian nonparametric clustering using Dirichlet process	Python	[23]
SciClone	2014	AC, VAF	Beta, Gaussian, Bin	N	Y	Variational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clusters	R	[24]
PhyloSub	2014	AC	Bin	Y	Y	Bayesian nonparametric model using tree-structured stick-breaking process	C++, Python	[25]
EXPANDS	2014	VAF	N/A	N	N	Heuristic clustering of variants based on KL-divergence.	R	[26]
TITAN	2014	AC	Bin	N	N	Factorial HMM with joint emission model for allele counts and RDs	R	[27]
Clomial	2014	AC	Bin	N	Y	Constrained matrix factorization using multi-region sequencing data	R	[28]
Rec-BTP	2014	CVAF	N/A	Y	N	Recursive approximation algorithm for binary tree partition	MATLAB, Python	[29]
PhyloWGS	2015	AC	Bin	Y	Y	Extends PhyloSub by incorporating SCNA information from WGS	C++, Python	[30]
AncesTree	2015	AC	Bin	Y	Y	Solves VAF factorization problem with ILP	C++	[31]
LICHeE	2015	VAF	Gaussian	Y	Y	Multi-sample tumor phylogeny reconstruction by solving a set cover problem	Java	[32]
Bayclone	2015	AC	Bin	N	Y	Bayesian nonparametric clustering using categorial Indian buffet process	R	[33]
Cloe	2016	AC	BeBin	Y	Y	Phylogenetic latent feature model that reflects hierarchical relationship between subclonal features	R	[34]
Canopy	2016	AC	Bin	Y	Y	Multi-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNA	R	[35]
ddClone	2017	AC	Bin, BeBin	N	N	Bayesian nonparametric clustering using distance-dependent Chinese restaurant process	R	[36]
ClonEvol	2017	CVAF	N/A	Y	Y	Uses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderings	R	[37]

Name	Year	Feature\|$^{\ast }$\|	Model\|$^{\dagger }$\|	Evolutionary constraint	Multi-sample analysis\|$^{\ddagger }$\|	Description^§	Implementation	Ref
PurBayes	2013	AC	Bin	N	N	Bayesian finite mixture model evaluated with penalized expected deviance	R	[21]
TrAp	2013	CVAF	N/A	Y	Y	Enumerates possible tree structures by iteratively merging first-generation trees	Java	[22]
PyClone	2014	AC	Bin, BeBin	N	Y	Bayesian nonparametric clustering using Dirichlet process	Python	[23]
SciClone	2014	AC, VAF	Beta, Gaussian, Bin	N	Y	Variational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clusters	R	[24]
PhyloSub	2014	AC	Bin	Y	Y	Bayesian nonparametric model using tree-structured stick-breaking process	C++, Python	[25]
EXPANDS	2014	VAF	N/A	N	N	Heuristic clustering of variants based on KL-divergence.	R	[26]
TITAN	2014	AC	Bin	N	N	Factorial HMM with joint emission model for allele counts and RDs	R	[27]
Clomial	2014	AC	Bin	N	Y	Constrained matrix factorization using multi-region sequencing data	R	[28]
Rec-BTP	2014	CVAF	N/A	Y	N	Recursive approximation algorithm for binary tree partition	MATLAB, Python	[29]
PhyloWGS	2015	AC	Bin	Y	Y	Extends PhyloSub by incorporating SCNA information from WGS	C++, Python	[30]
AncesTree	2015	AC	Bin	Y	Y	Solves VAF factorization problem with ILP	C++	[31]
LICHeE	2015	VAF	Gaussian	Y	Y	Multi-sample tumor phylogeny reconstruction by solving a set cover problem	Java	[32]
Bayclone	2015	AC	Bin	N	Y	Bayesian nonparametric clustering using categorial Indian buffet process	R	[33]
Cloe	2016	AC	BeBin	Y	Y	Phylogenetic latent feature model that reflects hierarchical relationship between subclonal features	R	[34]
Canopy	2016	AC	Bin	Y	Y	Multi-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNA	R	[35]
ddClone	2017	AC	Bin, BeBin	N	N	Bayesian nonparametric clustering using distance-dependent Chinese restaurant process	R	[36]
ClonEvol	2017	CVAF	N/A	Y	Y	Uses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderings	R	[37]

|$^{\ddagger }$|Denotes whether the tool is able to conduct joint analysis of several spatially or longitudinally distinct samples collected from a single tumor. Abbreviations: |$^{*}$|AC, allele count; VAF, variant allele frequency; CVAF, clustered variant allele frequency; |$^{\dagger }$|Bin, binomial; BeBin, beta-binomial; ^§ILP, integer linear programming; CI, confidence interval.

Table 1

Open in new tab Download slide

SNV-driven tools for subclonal inference

Name	Year	Feature\|$^{\ast }$\|	Model\|$^{\dagger }$\|	Evolutionary constraint	Multi-sample analysis\|$^{\ddagger }$\|	Description^§	Implementation	Ref
PurBayes	2013	AC	Bin	N	N	Bayesian finite mixture model evaluated with penalized expected deviance	R	[21]
TrAp	2013	CVAF	N/A	Y	Y	Enumerates possible tree structures by iteratively merging first-generation trees	Java	[22]
PyClone	2014	AC	Bin, BeBin	N	Y	Bayesian nonparametric clustering using Dirichlet process	Python	[23]
SciClone	2014	AC, VAF	Beta, Gaussian, Bin	N	Y	Variational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clusters	R	[24]
PhyloSub	2014	AC	Bin	Y	Y	Bayesian nonparametric model using tree-structured stick-breaking process	C++, Python	[25]
EXPANDS	2014	VAF	N/A	N	N	Heuristic clustering of variants based on KL-divergence.	R	[26]
TITAN	2014	AC	Bin	N	N	Factorial HMM with joint emission model for allele counts and RDs	R	[27]
Clomial	2014	AC	Bin	N	Y	Constrained matrix factorization using multi-region sequencing data	R	[28]
Rec-BTP	2014	CVAF	N/A	Y	N	Recursive approximation algorithm for binary tree partition	MATLAB, Python	[29]
PhyloWGS	2015	AC	Bin	Y	Y	Extends PhyloSub by incorporating SCNA information from WGS	C++, Python	[30]
AncesTree	2015	AC	Bin	Y	Y	Solves VAF factorization problem with ILP	C++	[31]
LICHeE	2015	VAF	Gaussian	Y	Y	Multi-sample tumor phylogeny reconstruction by solving a set cover problem	Java	[32]
Bayclone	2015	AC	Bin	N	Y	Bayesian nonparametric clustering using categorial Indian buffet process	R	[33]
Cloe	2016	AC	BeBin	Y	Y	Phylogenetic latent feature model that reflects hierarchical relationship between subclonal features	R	[34]
Canopy	2016	AC	Bin	Y	Y	Multi-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNA	R	[35]
ddClone	2017	AC	Bin, BeBin	N	N	Bayesian nonparametric clustering using distance-dependent Chinese restaurant process	R	[36]
ClonEvol	2017	CVAF	N/A	Y	Y	Uses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderings	R	[37]

Name	Year	Feature\|$^{\ast }$\|	Model\|$^{\dagger }$\|	Evolutionary constraint	Multi-sample analysis\|$^{\ddagger }$\|	Description^§	Implementation	Ref
PurBayes	2013	AC	Bin	N	N	Bayesian finite mixture model evaluated with penalized expected deviance	R	[21]
TrAp	2013	CVAF	N/A	Y	Y	Enumerates possible tree structures by iteratively merging first-generation trees	Java	[22]
PyClone	2014	AC	Bin, BeBin	N	Y	Bayesian nonparametric clustering using Dirichlet process	Python	[23]
SciClone	2014	AC, VAF	Beta, Gaussian, Bin	N	Y	Variational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clusters	R	[24]
PhyloSub	2014	AC	Bin	Y	Y	Bayesian nonparametric model using tree-structured stick-breaking process	C++, Python	[25]
EXPANDS	2014	VAF	N/A	N	N	Heuristic clustering of variants based on KL-divergence.	R	[26]
TITAN	2014	AC	Bin	N	N	Factorial HMM with joint emission model for allele counts and RDs	R	[27]
Clomial	2014	AC	Bin	N	Y	Constrained matrix factorization using multi-region sequencing data	R	[28]
Rec-BTP	2014	CVAF	N/A	Y	N	Recursive approximation algorithm for binary tree partition	MATLAB, Python	[29]
PhyloWGS	2015	AC	Bin	Y	Y	Extends PhyloSub by incorporating SCNA information from WGS	C++, Python	[30]
AncesTree	2015	AC	Bin	Y	Y	Solves VAF factorization problem with ILP	C++	[31]
LICHeE	2015	VAF	Gaussian	Y	Y	Multi-sample tumor phylogeny reconstruction by solving a set cover problem	Java	[32]
Bayclone	2015	AC	Bin	N	Y	Bayesian nonparametric clustering using categorial Indian buffet process	R	[33]
Cloe	2016	AC	BeBin	Y	Y	Phylogenetic latent feature model that reflects hierarchical relationship between subclonal features	R	[34]
Canopy	2016	AC	Bin	Y	Y	Multi-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNA	R	[35]
ddClone	2017	AC	Bin, BeBin	N	N	Bayesian nonparametric clustering using distance-dependent Chinese restaurant process	R	[36]
ClonEvol	2017	CVAF	N/A	Y	Y	Uses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderings	R	[37]

|$^{\ddagger }$|Denotes whether the tool is able to conduct joint analysis of several spatially or longitudinally distinct samples collected from a single tumor. Abbreviations: |$^{*}$|AC, allele count; VAF, variant allele frequency; CVAF, clustered variant allele frequency; |$^{\dagger }$|Bin, binomial; BeBin, beta-binomial; ^§ILP, integer linear programming; CI, confidence interval.

Figure 1

Genome-based approaches for TH characterization. The left panel illustrates the viewpoint of SNV-driven subclonal inference algorithms. Different cancer subclones (denoted by the blue, red and green population of cells) harbor different sets of SNVs at their genomic loci (illustrated as squares, triangles and stars). The variant alleles at these genomic loci are identified and quantified by NGS, and their counts or VAFs are statistically modeled to obtain an optimal estimation of subclonal structure or evolutionary history of subclones. The right panel illustrates the viewpoint of SCNA-driven tumor purity and absolute copy number estimation approaches. Thick white lines inside the cells symbolize the copy numbers of genomic segments. While being quantified by SNP array or NGS platforms, the copy number values are averaged, and therefore resulting copy numbers are not necessarily integers. With the average copy number profiles, SCNA-driven methods infer tumor purity and absolute copy numbers of cancer cells. Note that the extent of contaminating normal cells (smooth circles) are taken into account by both of the approaches.

where |$n^{\textrm{ref}}_{l}$| and |$n^{\textrm{alt}}_{l}$| represents the number of reads supporting reference and alternative alleles, respectively. To identify VAF clusters, two different modeling strategies have been proposed: modeling (1) |$n^{\textrm{alt}}_l$| as a discrete random variable (Equation (3)), or (2) |$f_l$| as a continuous random variable (Equation (4)).

$$\begin{align}& n^{\textrm{alt}}_{l} \sim D_{\textrm{disc}}(d_{l}, p_{i}) \end{align}$$

(3)

$$\begin{align}& f_l \sim D_{\textrm{cont}}(d_{l}, p_{i}), \end{align}$$

(4)

where |$p_{i}$| represents the probability of sampling reads containing variant alleles, which is intuitively the half of the true prevalence of subclone |$i$| (⁠|$\phi _i$|⁠) for heterozygous variants. |$D_{\textrm{disc}}$| and |$D_{\textrm{cont}}$| are discrete and continuous probability distribution, respectively, which vary across different approaches. While most of the tools [21, 23–25, 27, 28, 30, 31, 33, 36, 38, 39] adopt binomial distribution as |$D_{\textrm{disc}}$|⁠, beta-binomial is another possible option to account for the overdispersion of allelic frequencies [23, 34, 36]. On the other hand, for |$D_{\textrm{cont}}$|⁠, beta [24] and Gaussian distribution [24, 32, 40] are the most favorable choices.

In reality, |$p_i$| does not simply correspond to |$\phi _{i} / 2$| for heterozygous variants due to the impurity of tumor samples and SCNAs. Therefore, several methods adjust |$p_i$| with extra parameters such as tumor purity (⁠|$\pi $|⁠), total ploidy (⁠|$\omega $|⁠) and copy number of variant allele (⁠|$\nu $|⁠). Let |$\omega _l^n$|⁠, |$\omega _l^c$| and |$\omega _l^v$| be the total copy number at locus |$l$| for normal cells, cancer cells without variants and cancer cells with variants, respectively, and let a similar definition applied to |$\nu _l^n$|⁠, |$\nu _l^c$| and |$\nu _l^v$|⁠. Then, |$p_i$| can be adjusted as below.

$$\begin{align}& p_i = \frac{(1-\pi)\nu_l^n + (1 - p_i)\pi\nu_l^c + p_i\pi\nu_l^v}{(1-\pi)\omega_l^n + (1 - p_i)\pi\omega_l^c + p_i\pi\omega_l^v}. \end{align}$$

(5)

Here, tumor purity |$\pi $| can be inferred from the data [21] or can be obtained by external measurements [23]. For |$\omega _l$| and |$\nu _l$|⁠, some approaches propose unified models to infer all the parameters simultaneously [27], while others circumvent the problem by providing external copy number estimates [24] or by marginalizing them out under the prior assumptions of subclonal genotypes [23].

Deciding the optimal number of inferred subclones (⁠|$K^{*}$|⁠) is another critical issue. The advantage of imposing nonparametric priors such as Dirichlet process [23, 41, 42] or categorical Indian buffet process [33] for subclonal structures is that |$K^{*}$| is inferred in an adaptive and data-driven manner, which does not require predefined |$K$| given to the model a priori. On the other hand, parametric priors, including Dirichlet distribution [34], are still a reasonable option. The model parameters should be repeatedly inferred for varying predefined |$K$|s. Thus, the lightweight inference algorithms such as expectation-maximization (EM) [27, 28, 38] or variational EM [24] are often preferred. Subsequently, the best |$K$| is selected by appropriate evaluation metric such as log likelihood [38], penalized expected deviance [21] and |$S\_Dbw$| [27].

The most realistic prior for the subclonal structures is the distribution over the topologies of phylogenetic trees, as it is likely to produce biologically reasonable phylogenies. PhyloSub [25] is one of the first methods to incorporate phylogenetic priors for subclonal inference, which uses a tree-structured stick-breaking process. Since then, many relevant approaches have been developed, and their common rationale is to integrate various phylogenetic constraints into their inference framework, such as sum and crossing rule [25], ancestry condition [31, 32] or children-sum-to-parents condition [22].

SCNA-driven approaches for tumor purity and absolute copy numbers

The primary goal of the most SCNA-driven approaches is to obtain the tumor purity and (allele-specific) absolute copy numbers (Figure 1, right panel; Table 2). While the early studies mostly used log R ratios (LRRs) and B allele frequencies (BAFs) generated from an SNP array, recent algorithms utilize whole-genome, or whole-exome, sequencing read depths (RDs) [or read depth ratios (RDRs)] for each genomic window and fractions of B-allele read counts, which correspond to LRRs and BAFs, respectively. In general, the algorithms need two values to infer the nonnegative integer copy numbers of A and B alleles (⁠|$\omega ^{A}_{l}$| and |$\omega ^{B}_{l}$|⁠, respectively) and purity (⁠|$\pi $|⁠): (1) LRRs or RDs/RDRs to infer the copy numbers and (2) BAFs to infer the genotypes. Formally, for each SNP locus |$l$|⁠, expected value of LRRs or RDRs (⁠|$r_l$|⁠) and BAF (⁠|$f_l$|⁠) can be written as below.

$$\begin{align}& r_l = \frac{2(1 - \pi) + (\omega^{A}_{l} + \omega^{B}_{l})\pi}{D} \end{align}$$

(6)

$$\begin{align}& f_l = \frac{(1 - \pi) + \omega^{B}_{l}\pi}{2(1 - \pi) + (\omega^{A}_{l} + \omega^{B}_{l})\pi}. \end{align}$$

(7)

Table 2

SCNA-driven tools for tumor purity and absolute copy number estimation

Name	Year	Data	Feature\|$^{*}$\|	Type\|$^{\dagger }$\|	Description	Implementation	Ref
ASCAT	2010	SNP array	LRR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Copy number estimates are evaluated by squared differences to nearest integers.	R	[17]
OncoSNP	2010	SNP array	RD, BAF	M	Student t-mixture solved by EM	MATLAB	[43]
ABSOLUTE	2012	SNP array	LRR, VAF	M	Gaussian mixture using grid search through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by SCNA-fit likelihoods.	R	[18]
APOLLOH	2012	NGS	RD, BAF	H	HMM with 18 genotype states and binomial emission. Parameters are estimated by EM.	MATLAB	[44]
CNAnorm	2012	NGS	RDR	M	Gaussian mixture fitted by EM. Number of components are chosen using AIC.	R	[45]
CLImAT	2014	NGS	RD, AC	H	HMM with binomial and negative binomial emission model for ACs and RDs, respectively	MATLAB	[46]
THetA2	2014	NGS	RD, BAF	M	Solves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbers	Java, MATLAB, Python	[40]
TITAN	2014	NGS	RDR, AC	H	Subclone-aware factorial HMM with joint emission model for ACs and RDs	R	[27]
PyLOH	2014	NGS	RD, AC	M	Joint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.	Python	[47]
AbsCN-seq	2014	NGS	RD, VAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that minimizes least squares objective.	R	[48]
Mixclone	2015	NGS	RD, BAF	M	Unified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.	Python	[38]
Sequenza	2015	NGS	RDR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that maximizes log-posterior probability.	Python, R	[49]
FACETS	2017	NGS	RD, BAF	M	Employs Gaussian noncentral \|$\chi ^2$\| model to call absolute copy numbers	R	[50]
CLImAT-HET	2017	NGS	RD, AC	H	Subclone-aware factorial HMM which extends CLImAT	MATLAB	[51]
ichorCNA	2017	NGS	RDR	H	HMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.	R	[52]
Sclust	2018	NGS	RD, RDR, BAF, VAF	O	\|$\omega $\| and \|$\pi $\| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.	C++	[53]
ACE	2019	NGS	RD	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.	R	[54]

Name	Year	Data	Feature\|$^{*}$\|	Type\|$^{\dagger }$\|	Description	Implementation	Ref
ASCAT	2010	SNP array	LRR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Copy number estimates are evaluated by squared differences to nearest integers.	R	[17]
OncoSNP	2010	SNP array	RD, BAF	M	Student t-mixture solved by EM	MATLAB	[43]
ABSOLUTE	2012	SNP array	LRR, VAF	M	Gaussian mixture using grid search through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by SCNA-fit likelihoods.	R	[18]
APOLLOH	2012	NGS	RD, BAF	H	HMM with 18 genotype states and binomial emission. Parameters are estimated by EM.	MATLAB	[44]
CNAnorm	2012	NGS	RDR	M	Gaussian mixture fitted by EM. Number of components are chosen using AIC.	R	[45]
CLImAT	2014	NGS	RD, AC	H	HMM with binomial and negative binomial emission model for ACs and RDs, respectively	MATLAB	[46]
THetA2	2014	NGS	RD, BAF	M	Solves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbers	Java, MATLAB, Python	[40]
TITAN	2014	NGS	RDR, AC	H	Subclone-aware factorial HMM with joint emission model for ACs and RDs	R	[27]
PyLOH	2014	NGS	RD, AC	M	Joint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.	Python	[47]
AbsCN-seq	2014	NGS	RD, VAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that minimizes least squares objective.	R	[48]
Mixclone	2015	NGS	RD, BAF	M	Unified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.	Python	[38]
Sequenza	2015	NGS	RDR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that maximizes log-posterior probability.	Python, R	[49]
FACETS	2017	NGS	RD, BAF	M	Employs Gaussian noncentral \|$\chi ^2$\| model to call absolute copy numbers	R	[50]
CLImAT-HET	2017	NGS	RD, AC	H	Subclone-aware factorial HMM which extends CLImAT	MATLAB	[51]
ichorCNA	2017	NGS	RDR	H	HMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.	R	[52]
Sclust	2018	NGS	RD, RDR, BAF, VAF	O	\|$\omega $\| and \|$\pi $\| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.	C++	[53]
ACE	2019	NGS	RD	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.	R	[54]

Abbreviations:|$^{*}$|LRR, log R ratio; RD, read depth; RDR, read depth ratio; BAF, B-allele frequency; AC, allele count; VAF, variant allele frequency; |$^{\dagger }$|E, heuristics-based method; M, mixture-based method; H, HMM-based method; O, Other.

Table 2

Open in new tab Download slide

SCNA-driven tools for tumor purity and absolute copy number estimation

Name	Year	Data	Feature\|$^{*}$\|	Type\|$^{\dagger }$\|	Description	Implementation	Ref
ASCAT	2010	SNP array	LRR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Copy number estimates are evaluated by squared differences to nearest integers.	R	[17]
OncoSNP	2010	SNP array	RD, BAF	M	Student t-mixture solved by EM	MATLAB	[43]
ABSOLUTE	2012	SNP array	LRR, VAF	M	Gaussian mixture using grid search through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by SCNA-fit likelihoods.	R	[18]
APOLLOH	2012	NGS	RD, BAF	H	HMM with 18 genotype states and binomial emission. Parameters are estimated by EM.	MATLAB	[44]
CNAnorm	2012	NGS	RDR	M	Gaussian mixture fitted by EM. Number of components are chosen using AIC.	R	[45]
CLImAT	2014	NGS	RD, AC	H	HMM with binomial and negative binomial emission model for ACs and RDs, respectively	MATLAB	[46]
THetA2	2014	NGS	RD, BAF	M	Solves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbers	Java, MATLAB, Python	[40]
TITAN	2014	NGS	RDR, AC	H	Subclone-aware factorial HMM with joint emission model for ACs and RDs	R	[27]
PyLOH	2014	NGS	RD, AC	M	Joint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.	Python	[47]
AbsCN-seq	2014	NGS	RD, VAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that minimizes least squares objective.	R	[48]
Mixclone	2015	NGS	RD, BAF	M	Unified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.	Python	[38]
Sequenza	2015	NGS	RDR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that maximizes log-posterior probability.	Python, R	[49]
FACETS	2017	NGS	RD, BAF	M	Employs Gaussian noncentral \|$\chi ^2$\| model to call absolute copy numbers	R	[50]
CLImAT-HET	2017	NGS	RD, AC	H	Subclone-aware factorial HMM which extends CLImAT	MATLAB	[51]
ichorCNA	2017	NGS	RDR	H	HMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.	R	[52]
Sclust	2018	NGS	RD, RDR, BAF, VAF	O	\|$\omega $\| and \|$\pi $\| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.	C++	[53]
ACE	2019	NGS	RD	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.	R	[54]

Name	Year	Data	Feature\|$^{*}$\|	Type\|$^{\dagger }$\|	Description	Implementation	Ref
ASCAT	2010	SNP array	LRR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Copy number estimates are evaluated by squared differences to nearest integers.	R	[17]
OncoSNP	2010	SNP array	RD, BAF	M	Student t-mixture solved by EM	MATLAB	[43]
ABSOLUTE	2012	SNP array	LRR, VAF	M	Gaussian mixture using grid search through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by SCNA-fit likelihoods.	R	[18]
APOLLOH	2012	NGS	RD, BAF	H	HMM with 18 genotype states and binomial emission. Parameters are estimated by EM.	MATLAB	[44]
CNAnorm	2012	NGS	RDR	M	Gaussian mixture fitted by EM. Number of components are chosen using AIC.	R	[45]
CLImAT	2014	NGS	RD, AC	H	HMM with binomial and negative binomial emission model for ACs and RDs, respectively	MATLAB	[46]
THetA2	2014	NGS	RD, BAF	M	Solves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbers	Java, MATLAB, Python	[40]
TITAN	2014	NGS	RDR, AC	H	Subclone-aware factorial HMM with joint emission model for ACs and RDs	R	[27]
PyLOH	2014	NGS	RD, AC	M	Joint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.	Python	[47]
AbsCN-seq	2014	NGS	RD, VAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that minimizes least squares objective.	R	[48]
Mixclone	2015	NGS	RD, BAF	M	Unified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.	Python	[38]
Sequenza	2015	NGS	RDR, BAF	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Finds optimal solution that maximizes log-posterior probability.	Python, R	[49]
FACETS	2017	NGS	RD, BAF	M	Employs Gaussian noncentral \|$\chi ^2$\| model to call absolute copy numbers	R	[50]
CLImAT-HET	2017	NGS	RD, AC	H	Subclone-aware factorial HMM which extends CLImAT	MATLAB	[51]
ichorCNA	2017	NGS	RDR	H	HMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.	R	[52]
Sclust	2018	NGS	RD, RDR, BAF, VAF	O	\|$\omega $\| and \|$\pi $\| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.	C++	[53]
ACE	2019	NGS	RD	E	Grid searches through \|$\omega $\| and \|$\pi $\|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.	R	[54]

Abbreviations:|$^{*}$|LRR, log R ratio; RD, read depth; RDR, read depth ratio; BAF, B-allele frequency; AC, allele count; VAF, variant allele frequency; |$^{\dagger }$|E, heuristics-based method; M, mixture-based method; H, HMM-based method; O, Other.

The denomiator |$D$| is given as

$$\begin{align}& D = \begin{cases} 2 & \textrm{if there is matched normal sample} \\ 2(1 - \pi) + \omega\pi & \textrm{otherwise}, \end{cases} \end{align}$$

(8)

where |$\omega $| represents genomewide average copy number. Upon this formulation, we describe four representative types of SCNA-driven approaches in this section.

Heuristics-based methods Absolute copy numbers should be nonnegative integers. Therefore, a simple assessment is possible for a candidate solution by evaluating how much the solution is close to nonnegative integers. Since a simple analytic solution (⁠|$\widehat{\omega ^{A}_l}$|⁠, |$\widehat{\omega ^{B}_l}$|⁠) for any fixed (⁠|$\omega $|⁠, |$\pi $|⁠) can be derived by Equations (6) and (7), and biologically plausible range of (⁠|$\omega $|⁠, |$\pi $|⁠) can be specified, we can easily think of grid-searching through the space of (⁠|$\omega $|⁠, |$\pi $|⁠) to find optimal solution. ASCAT [17] evaluates each grid-search solution by summing the squared difference between inferred copy numbers (⁠|$\widehat{\omega ^{A}_l}$|⁠, |$\widehat{\omega ^{B}_l}$|⁠) and the corresponding nearest nonnegative integers. ACE [54] first computes expected RDR signal for all plausible integer copy numbers by Equation (6) and evaluates each solution using the difference between the true RDR signal and nearest expected RDR signal. Similarly, Sequenza [49] finds the optimal (⁠|$\omega $|⁠, |$\pi $|⁠) that maximizes the log-posterior probability of the fit by grid-search.

Mixture-based methods LRR, RD or RDR values of genomic segments with the same copy state would be clustered together, which led to mixture-based methods. Solving the mixture model gives a set of putative copy states, and the states are further annotated by absolute copy numbers. For example, ABSOLUTE [18] is a prominent Gaussian mixture-based approach applied for pan-cancer absolute copy number analysis. Since expected LRR value can be parameterized by (⁠|$\omega $|⁠, |$\pi $|⁠) (Equation (6)), we have a different mixture model for each configuration of (⁠|$\omega $|⁠, |$\pi $|⁠). Through grid-search, ABSOLUTE obtains the optimal (⁠|$\omega $|⁠, |$\pi $|⁠) that shows the best SCNA-fit log-likelihood scores.

BAF values can also be incorporated in the mixture model since SNP loci sharing the same genotype will produce similar BAF values. PyLOH [47] jointly models RD and BAF values with Poisson and binomial mixture models, respectively.

HMM-based methods Since an SCNA event usually spans a large genomic stretch spanning multiple SNPs, those SNPs will share the same genotypes. This spatial correlation can be exploited to get more precise signals of allelic frequencies. While simple aggregation of read counts along the segment [55] was shown to be effective, more elegant approaches using hidden Markov model (HMM) have been developed.

In general, such HMM models regard the empirically plausible set of genotypes as a set of hidden states. For example, APOLLOH [44], CLIMAT [46] and TITAN [27] propose an 18-, 20- and 21-state model, respectively. Given a genotype as a hidden state, emission probabilities are defined with probability distributions. For example, RDs and log copy ratios are often modeled with Poisson and Gaussian distribution, respectively, and B-allelic read counts are modeled with a binomial distribution. The transition matrix can be inferred from the data [46], or a fixed matrix is used where the transition probabilities are defined to exponentially decay as distance increases [27, 44].

Subclone-aware methods Few SCNA-driven methods, namely TITAN [27], Sclust [53] and THetA2 [40], takes subclones into consideration. TITAN is a factorial HMM-based approach that explicitly models three populations with respect to each genomic segment as in SNV-driven approaches: (1) normal cells, (2) tumor cells without subclonal SCNA and (3) tumor cells with subclonal SCNA.

$$\begin{align}& r_l = \frac{2(1-\pi) + 2(1-p_l)\pi + (\omega^{A}_l + \omega^{B}_l)p_l\pi}{D} \end{align}$$

(9)

$$\begin{align}& f_l = \frac{(1-\pi) + (1-p_l)\pi + \omega^{B}_lp_l\pi}{2(1-\pi) + 2(1-p_l)\pi + (\omega^{A}_l + \omega^{B}_l)p_l\pi}, \end{align}$$

(10)

where |$p_l$| is the proportion of tumor cells harboring SCNA at locus or segment |$l$|⁠. Similar formulation has been adopted by Sclust, a unified framework that estimates absolute (subclonal) copy numbers and subclonal prevalences. On the other hand, THetA2 directly solves a mixture decomposition problem, by letting an observed RD vector expressed as a product of integer-valued interval count matrix |$C$| and real-valued genome mixing vector|$\mu $|⁠, which denote copy numbers of genomic segments in each subclone and subclonal prevalences, respectively. The decomposition is done by an efficient enumeration of the solution space of (⁠|$C$|⁠, |$\mu $|⁠).

Perspectives

The strength of SNV-driven approaches is that subclone-wise functional analyses can be readily conducted since an SNV can usually be translated into a functional alteration of a specific gene. However, the high variance of allele frequencies, due to insufficient sequencing depth or experimental bias, is one of the major hurdles. Moreover, discrepancies of SNVs captured from various callers also undermine the robustness of SNV-driven approaches [56]. On the contrary, SCNA-driven approaches can work well with shallow sequencing depth because an SCNA usually spans a large genomic interval rather than a point, resulting in much more reads supporting each event. They still have shortcomings in terms of analytic resolution, as most of their application is restricted to analyzing tumor purity or absolute copy number while not accounting for tumor subclones.

Since most cells of an individual share almost identical genomic blueprints, one of the fundamental limitations of genome-based approaches is that we cannot distinguish distinct cell types using genomic features. Therefore, for a full characterization of TH including microenvironment surrounding a tumor, epigenomic and transcriptomic features are necessary since they are known to show cell-type specificity.

Transcriptome-based tumor heterogeneity

A growing body of evidence suggests that epigenetic aberrations are nearly ubiquitous in a wide variety of cancer types [57]. Subjects of the epigenetic alterations commonly include DNA methylation, histone modification and, more broadly, chromatin state. Among them, aberrant DNA methylation exemplified by focal hypermethylation [58–62] and widespread hypomethylation [63–65] have been the most comprehensively characterized, yet not fully understood, phenomenon in cancer. Although histone modifications and chromatin states also have been of great interest in early studies, the extent of their heterogeneity across tumor cells largely remains to be explored. Thus, in this section, we mainly discuss how the heterogeneity of DNA methylation can be exploited to uncover the heterogeneity of tumor samples.

Underlying biological features

Investigating the ITH of DNA methylation

The earliest report of ITH of DNA methylation (mITH) was by Feinberg and Vogelstein [63], along with an insightful conjecture that mITH arises from a certain portion of cancer cells having defects in methylation maintenance. In the past decades, this hypothesis has gradually been consolidated in diverse types of cancers [66, 67]. The active introduction of the bisulfite sequencing techniques allowed the methylation state of individual CpGs to be analyzed in an unprecedented resolution and coverage. Most importantly, analyzing a single sequencing read harboring multiple CpGs enabled haplotype-like analysis of DNA methylation, and the phased pattern of DNA methylation is called `epiallele'. The techniques other than sequencing-based methods exploited by the early DNA methylation studies could only inform average methylation levels for each CpG, limiting the discussion of mITH at bulk-cell level. On the contrary, since each sequencing read is originated from a single cell, we can consider an epiallele as a surrogate of each cell. Therefore, analyzing methylation patterns of sequencing reads is conceptually equivalent to analyzing the local epigenomic state of a collection of single cells.

Dual perspectives on epigenetic heterogeneity in cancer

The two apparently contradictive hallmarks of epigenomic modification—their stable inheritance and dynamic regulation—make it one of the most intriguing clues for TH.

Somatic heritability The first aspect relies on the somatic heritability of epigenomic modifications, which is most analogous to the view of genomic TH because it mainly focuses on subclonal alterations of epigenomic marks, which corresponds to subclonal SNVs or SCNAs. In this viewpoint, a subclonal epigenomic mark can be either of passenger or driver of tumor progression as its genomic counterpart. The selective growth advantage given by `driver' epigenomic marks leads to a more aggressive proliferation, which finally results in an epigenetic subclone that possesses identical (or nearly identical) epigenetic alterations. The experimental evidence for epigenetic subclones was reported in diffuse large B-cell lymphomas by showing that the diversity of methylation patterns were greatly reduced (i.e. patterns ‘converged’) in relapsed samples after chemotherapy [68].

Stochasticity Even under appropriate regulation, the average DNA methylation level of a bulk sample at a specific CpG site seems to show slight fluctuations over time. It can be compared with a ball traveling back and forth at the bottom of the valley [69]. Often, the machinery is perturbed in a transformed or precancerous cell, giving flatter regulatory potential landscape that imposes only a weak restoring force on the methylation level. Thus, the methylation level becomes more unstable and follows a trajectory rather resembling a random-walk trajectory.

Tumor purity, infiltrating immune cells and microenvironment

Epigenomic features have far more advantages than genomic features for the computational microdissection of noncancerous cells in a sample. Since the cells of an individual contain almost identical genomic blueprints, genomic regions showing distinctive differences between tumoral and non-tumoral cells are restricted to particular regions with somatic alterations.

Although numerous successful methodologies based on such scarce genomic features have been developed, it can be complemented with epigenome-based algorithms exploiting the widespread difference of epigenomic marks between tumoral and non-tumoral cells, especially when compared with that of genomic features. Furthermore, it is known that the epigenomes of different cell types, such as immune cells at different hematopoietic lineage, are strikingly different. We hereafter broadly refer to the estimation of the composition of cell types comprising tumor microenvironment as `cell-type deconvolution'. In the following subsection, we discuss computational tumor purity estimation and cell-type deconvolution algorithms utilizing epigenomic features, mainly DNA methylation.

Computational problem formulation

Characterization of epigenetic subclones

There are a handful of tools that utilize somatic heritability of epialleles to seek the evidence of tumor subclones (Figure 2, left panel). The underlying assumption of these tools is that a subclone harbors (nearly) identical epialleles so that we can infer epigenetic subclonal structures based on them.

Figure 2

Epigenome-based approaches for TH characterization. Black and white circles denote methylated and unmethylated CpGs, respectively. The left panel illustrates the viewpoint of epiallele-driven approaches. An epiallele is defined as a set of consecutive binary methylation states that spans a short genomic stretch, which usually corresponds to a methylation pattern appearing in a single sequencing read. In this point of view, epialleles show dual characteristics at the same time: somatic heritability and stochasticity. Note that epialleles in subclonal cells are similar enough but not always identical. These patterns of methylation states can be obtained by various sequencing methods, and the frequencies of each pattern are used to analyze epigenetic subclones or measure the extent of epigenomic stochasticity in the tumor biopsy. The right panel depicts the viewpoint of epigenome-based approaches for tumor purity estimation and cell-type deconvolution. The basis of these approaches is the selection of informative CpG loci whose methylation level distinguishes particular cell types from the others. Using the methylation levels of informative CpG loci, the purity of the sample can be inferred by clustering-based approaches, and the composition of constituent cell types can be inferred by reference-based or reference-free methods.

Methclone [70] detects the compositional change of epialleles (epiallele shift) between two related samples (e.g. primary and recurrent tumor samples), in order to identify epigenomic subclonal evolution. Bayesian epiallele detection (BED) [71] is the first model to tolerate the mild stochasticity of epialleles by Bayesian inference. Given a locus, BED treats a true epiallele as a latent variable and assumes that an observed set of epialleles was generated from it according to a stochastic process involving the erroneous inheritance of methylation states. PRISM [72] is a more direct approach to deal with epigenetic subclones by applying the idea of SNV-based subclonal inference methods for epialleles. To simplify the approach, it only makes use of fully unmethylated and fully methylated epialleles as a `fingerprint' epialleles that distinctly mark a subclone, then the proportions of the fingerprints are clustered as in SNV-based methods.

Stochasticity measures for DNA methylation

As an initial attempt to standardize the quantification of mITH with sequencing data, Xie et al. [73] proposed an information-theoretic measure called methylation entropy (ME), which is calculated as the entropy of epialleles originating from a single genomic locus. In this section, let |$\mathcal{X} = \{\textbf{x}_i\}$| denote a set of unique epialleles appearing at particular genomic region. Also, let |$N(\textbf{x}_i)$| and |$P(\textbf{x}_i)$| denote the number and fraction of sequencing reads supporting epiallele |$\textbf{x}_i$|⁠.

$$\begin{align}& \textrm{ME} = -\frac{1}{|\mathcal{X}|}\sum\limits_{\textbf{x} \in \mathcal{X}} P(\textbf{x})\log_2{P(\textbf{x}).} \end{align}$$

(11)

Landan et al. [67] developed another measure named epipolymorphism (PM), which also captures the amount of heterogeneity of DNA methylation for a given genomic region.

$$\begin{align}& \textrm{PM} = 1 - \sum\limits_{\textbf{x}} {P(\textbf{x})^2}. \end{align}$$

(12)

The relationship between ME and PM is analogous to that between entropy and Gini index used for decision trees, and they can be considered as similar measures of mITH in general.

The importance of local disorder of DNA methylation was highlighted in CLL, and a new measure of mITH, proportion of discordant reads PDR, was introduced accordingly [74]. PDR is defined as a fraction of reads carrying CpGs in discordant methylation states (i.e. containing both methylated and unmethylated CpGs in a single read) with respect to all reads mapped to the region. Let |$\textbf{x}_u$| and |$\textbf{x}_m$| denote fully unmethylated and fully methylated epialleles, respectively, then PDR can be written as

$$\begin{align}& \textrm{PDR} = 1 - (P(\textbf{x}_u) + P(\textbf{x}_m)). \end{align}$$

(13)

A high PDR implies an admixture of cells with disordered DNA methylation states and is shown to be associated with adverse clinical outcomes, presumably through epigenetic diversification of subclones that finally leads to a higher adaptive potential of cancer.

Methylation haplotype load (MHL) is a metric measuring the co-methylation level within a genomic region [75]. It is first designed to identify genomic blocks with multiple CpG sites showing high concordance of methylation, called methylation haplotype blocks. MHL has unique characteristics compared with the other metrics in that it accounts for every sub-epialleles presented, as it is defined as a weighted average of the fraction of fully methylated epialleles of length |$i$| as below.

$$\begin{align}& \textrm{MHL} = \frac{\sum\limits_{i=1}^{l} w_i \times P(\textbf{x}_m^i)}{\sum\limits_{i=1}^{l} w_i}, \end{align}$$

(14)

where |$w_i$| is defined as a function of length |$i$| that governs how we weight the fractions of varying length and |$\textbf{x}_m^i$| denotes the fully methylated epiallele of length |$i$|⁠. For example, |$w_i = i$| or |$w_i = i^2$| are general options to give more weight to the concordance of longer epialleles.

Methylation combinatorial entropy (MCE) is defined as a natural logarithm of the number of possible configurations of given epialleles [70].

$$\begin{align}& \textrm{MCE} = \ln\frac{\Big(\sum\limits_{\textbf{x}} N(\textbf{x})\Big)!}{\prod\limits_{\textbf{x}} N(\textbf{x})!}. \end{align}$$

(15)

For example, when all the epialleles are the same, MCE is 0 since there is only one possible permutation of epialleles. On the contrary, when there exist all possible epialleles with same amount, MCE becomes maximal.

DNA methylation-based tumor purity estimation

The main idea of DNA methylation-based tumor purity estimation is to determine `informative' CpG loci that distinguish cancer cells from adjacent normal cells (Figure 2, right panel). The simplest measure is called leukocyte unmethylation for purity (LUMP), which is computed using the average methylation level of predefined 44 CpG sites that are known a priori to be consistently unmethylated in leukocytes [76]. Other ways to select informative CpG sites is to carry out statistical decision on each CpG locus to check if cancer and normal samples have a significant and stable difference of DNA methylation in that locus, based on Wilcoxon rank-sum test [77], AUC [78] or Z-score [79]. Once informative CpG loci are chosen, downstream purity estimation is rather straightforward since the methylation levels of each informative locus directly indicate the mixing proportion of cancer and normal cells. A point estimate of tumor purity can be derived by simple averaging of mixing proportions [78], or by clustering them with Gaussian kernel density estimation [77] or beta mixture model [79].

Since the majority of the methods only focus on pointwise average methylation levels, the development of NGS-based methods that fully utilize the phasing information of DNA methylation on a single sequencing read should be encouraged. MethylPurify [80] is an excellent pioneering example that selects informative genomic bins based on the EM-based splitting of bisulfite sequencing reads.

DNA methylation-based cell-type deconvolution

Cell-type deconvolution algorithms account for the detailed composition of constituent normal cell types, such as diverse types of immune cells and cancer-associated fibroblasts (Figure 2, right panel). The algorithms can be divided into two groups: (1) reference-based methods utilizing a catalog of reference methylomes from sorted cells and (2) reference-free methods that decompose observed methylation profiles into a linear combination of putative reference methylation profiles in an unsupervised manner. One of the first reference-based methods was proposed by Houseman et al. [81], which involves a constrained projection of observed methylation profiles onto reference profiles to obtain the estimates of cell-type proportions. Another popular approach is called MethylCIBERSORT [82]. It selects informative CpG sites that distinguish cell types (as in tumor purity estimation) and feeds their methylation levels into CIBERSORT [83]. Reference-free methods widely adopt matrix factorization techniques, by which the methylation levels are decomposed into latent reference methylation profiles and corresponding cell type proportions with some constraints. For example, RefFreeCellMix [84] uses a simple form of constraint which enforces methylation levels to lie between 0 and 1, and the sum of cell-type proportions to be less than 1. On the other hand, MeDeCom [85] adds a constraint that makes methylation levels close to 0 and 1.

Perspectives

Although the duality of epigenomic marks offers the most compelling possibilities for TH characterization, we still lack a clear conceptual background. It remains to be tested whether epigenetic subclones harboring somatically heritable epigenomic marks are prevalent throughout various cancer types. Meanwhile, investigating the stochasticity of epigenomic marks is not straightforward due to its probabilistic nature, which hinders the experimental reproducibility at the cellular level. Moreover, its quantification needs careful consideration since the variability vanishes and thus cannot be captured with simple `averaging-out' measures such as DNA methylation level. Nevertheless, cancer epigenomics is a vibrant field of research, and we are gradually being equipped with experimental and analytic methodologies, which will give us a clearer view of epigenetic TH in the near future.

Epigenomic features other than DNA methylation, such as histone modification, chromatin structure or even histone protein itself, may also constitute orthogonal layers of epigenomic TH. Notably, heterogeneous levels of the linker histone H1.0 were reported in several cancer types, where its deprivation was associated with increased self-renewal of tumorigenic cells [86].

Characterizing epigenomic TH is crucial as it serves as a glue that connects genomic TH and transcriptomic TH. We may discover hints of the mechanistic link between genomic and epigenomic TH by investigating the association of mutated epigenetic regulators and aberrant epigenomic marks. Similarly, the relationship between transcriptomic and epigenomic TH may be revealed by investigating the heterogeneity of the epigenomic landscape of regulatory regions.

Transcriptome-based tumor heterogeneity

Underlying biological features

Tumor heterogeneity, transcriptome and marker genes

Since the gene expression profile (GEP) of a tumor can be readily characterized by well-established platforms such as microarray and RNA-seq, it has been extensively studied for its potential as a biomarker for diseases. One successful application of transcriptomic biomarker is PAM50 signatures [87], a breast cancer subtype classification system based on the expression levels of 50 genes. Interestingly, the subtypes solely based on those genes are associated with prognosis, making it possible to design therapeutic regimens. However, selecting an optimal set of genes is not always straightforward.

What we observe from bulk cell sequencing is a crude admixture of reads from heterogeneous cell population [88], thereby it can be viewed as a blind source separation problem to seek for the independent effect of constituent cell types [89–91]. Depending on experimental design and sample quality, these compound signals may ruin the hypothesis testing or obscure possible novel findings. Unlike genomic and epigenomic features, transcriptomic features are basically continuous and it complicates the signal deconvolution using bulk transcriptome data. Nevertheless, utilizing marker genes would be an intuitive approach for the estimation of the composition of the non-tumoral part of a tumor biopsy [92].

In early studies, traditional markers for immunohistochemistry or reverse transcriptase-polymerase chain reaction were widely used. Recently, GEPs of purified cells are rapidly accumulating in public databases, which accelerate the discovery of marker genes and the development of algorithms utilizing them by various statistical approaches [93–95]. Although marker gene-based cell-type deconvolution methods can be easily applied with simple statistical approaches, their limitations are clear; the performance of such algorithm is limited by our prior knowledge.

Protein–protein interaction networks and pathways

Well-coordinated interactions between proteins are required to maintain normal cellular function. Efforts in identifying interaction partners of proteins in a high-throughput manner [96, 97] led to the compilation of valuable databases of protein–protein interaction networks [98, 99]. In bioinformatic analyses, these template networks are often used to define functional modules, which represent sets of proteins interacting together [100, 101]. Similarly, biological pathways denote the cellular mechanisms to convey the intra/extracellular information through the interactions of proteins to elicit an appropriate response of a cell upon various stimuli. To obtain a more direct functional portrait of a cell, several methods are proposed to infer the activity of biological pathways by aggregating individual gene expression values [102]. One advantage of handling biological pathways or modules is that their measured activity is relatively robust compared to the individual expression levels of its components. In terms of TH, we can also conceive the heterogeneity at levels of biological modules or pathways, which can be measured by information-theoretic approaches discussed in the next section.

Computational problem formulation

Purity estimation and cell-type deconvolution

Transcriptome-based purity estimation and computational cell-type deconvolution are advanced fields of TH studies (Figure 3, left panel; Table 3). Accordingly, there have been lots of methodological reviews and cross-comparison benchmark studies [116–120], including recent evaluation with single cell-based simulation [121]. In this section, we provide a brief overview of relevant computational methods.

Table 3

Transcriptome-based tools for tumor purity estimation and microenvironmental cell-type deconvolution

Name	Year	Application\|$^{*}$\|	Type\|$^{\dagger }$\|	Description\|$^{\ddagger }$\|	Implementation	Ref
PERT	2012	CTD	P	Nonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPs	Octave	[103]
ISOpure	2013	TPE	P	Models normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.	MATLAB, R	[104, 105]
DeMix	2013	TPE	P	Log-normal mixture model for tumor purity estimation using array-based expression data	C, R	[106]
ESTIMATE	2013	TPE	S	Computes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genes	R	[107]
UNDO	2015	TPE	O	Linear latent variable model to estimate tumor-stroma proportions	R	[108]
CIBERSORT	2015	CTD	R	\|$\nu $\|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypes	Java, R, Web	[83, 109]
contamDE	2016	TPE	P	Employs negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.	R	[110]
TIMER	2016	CTD	R	Constrained least squares regression to infer the abundance of six immune cell types	R, Web	[111]
MCP-counter	2016	CTD	S	Log2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populations	R	[112]
EPIC	2017	CTD	R	Constrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.	R	[113]
xCell	2017	CTD	S	Average of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scores	R	[114]
quanTIseq	2019	CTD	R	Constrained least squares regression with nonnegative and less-than-one constraints for cell type proportion	R	[115]
CIBERSORTx	2019	CTD	R	Extends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq data	Java, R, Web	[109]

Name	Year	Application\|$^{*}$\|	Type\|$^{\dagger }$\|	Description\|$^{\ddagger }$\|	Implementation	Ref
PERT	2012	CTD	P	Nonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPs	Octave	[103]
ISOpure	2013	TPE	P	Models normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.	MATLAB, R	[104, 105]
DeMix	2013	TPE	P	Log-normal mixture model for tumor purity estimation using array-based expression data	C, R	[106]
ESTIMATE	2013	TPE	S	Computes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genes	R	[107]
UNDO	2015	TPE	O	Linear latent variable model to estimate tumor-stroma proportions	R	[108]
CIBERSORT	2015	CTD	R	\|$\nu $\|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypes	Java, R, Web	[83, 109]
contamDE	2016	TPE	P	Employs negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.	R	[110]
TIMER	2016	CTD	R	Constrained least squares regression to infer the abundance of six immune cell types	R, Web	[111]
MCP-counter	2016	CTD	S	Log2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populations	R	[112]
EPIC	2017	CTD	R	Constrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.	R	[113]
xCell	2017	CTD	S	Average of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scores	R	[114]
quanTIseq	2019	CTD	R	Constrained least squares regression with nonnegative and less-than-one constraints for cell type proportion	R	[115]
CIBERSORTx	2019	CTD	R	Extends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq data	Java, R, Web	[109]

Abbreviations.|$^{*}$|CTD, cell-type deconvolution; TPE, tumor purity estimation. |$^{\dagger }$|P, probabilistic approach; R, regression-based approach; S, signature score-based approach; O, other. |$^{\ddagger }$|GEP, gene expression profile; CCLE, Cancer Cell Line Encyclopedia.

Table 3

Open in new tab Download slide

Transcriptome-based tools for tumor purity estimation and microenvironmental cell-type deconvolution

Name	Year	Application\|$^{*}$\|	Type\|$^{\dagger }$\|	Description\|$^{\ddagger }$\|	Implementation	Ref
PERT	2012	CTD	P	Nonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPs	Octave	[103]
ISOpure	2013	TPE	P	Models normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.	MATLAB, R	[104, 105]
DeMix	2013	TPE	P	Log-normal mixture model for tumor purity estimation using array-based expression data	C, R	[106]
ESTIMATE	2013	TPE	S	Computes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genes	R	[107]
UNDO	2015	TPE	O	Linear latent variable model to estimate tumor-stroma proportions	R	[108]
CIBERSORT	2015	CTD	R	\|$\nu $\|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypes	Java, R, Web	[83, 109]
contamDE	2016	TPE	P	Employs negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.	R	[110]
TIMER	2016	CTD	R	Constrained least squares regression to infer the abundance of six immune cell types	R, Web	[111]
MCP-counter	2016	CTD	S	Log2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populations	R	[112]
EPIC	2017	CTD	R	Constrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.	R	[113]
xCell	2017	CTD	S	Average of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scores	R	[114]
quanTIseq	2019	CTD	R	Constrained least squares regression with nonnegative and less-than-one constraints for cell type proportion	R	[115]
CIBERSORTx	2019	CTD	R	Extends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq data	Java, R, Web	[109]

Name	Year	Application\|$^{*}$\|	Type\|$^{\dagger }$\|	Description\|$^{\ddagger }$\|	Implementation	Ref
PERT	2012	CTD	P	Nonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPs	Octave	[103]
ISOpure	2013	TPE	P	Models normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.	MATLAB, R	[104, 105]
DeMix	2013	TPE	P	Log-normal mixture model for tumor purity estimation using array-based expression data	C, R	[106]
ESTIMATE	2013	TPE	S	Computes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genes	R	[107]
UNDO	2015	TPE	O	Linear latent variable model to estimate tumor-stroma proportions	R	[108]
CIBERSORT	2015	CTD	R	\|$\nu $\|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypes	Java, R, Web	[83, 109]
contamDE	2016	TPE	P	Employs negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.	R	[110]
TIMER	2016	CTD	R	Constrained least squares regression to infer the abundance of six immune cell types	R, Web	[111]
MCP-counter	2016	CTD	S	Log2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populations	R	[112]
EPIC	2017	CTD	R	Constrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.	R	[113]
xCell	2017	CTD	S	Average of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scores	R	[114]
quanTIseq	2019	CTD	R	Constrained least squares regression with nonnegative and less-than-one constraints for cell type proportion	R	[115]
CIBERSORTx	2019	CTD	R	Extends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq data	Java, R, Web	[109]

Abbreviations.|$^{*}$|CTD, cell-type deconvolution; TPE, tumor purity estimation. |$^{\dagger }$|P, probabilistic approach; R, regression-based approach; S, signature score-based approach; O, other. |$^{\ddagger }$|GEP, gene expression profile; CCLE, Cancer Cell Line Encyclopedia.

Figure 3

Transcriptome-based approaches for tumor heterogeneity characterization. The left panel illustrates the viewpoint of transcriptome-based computational deconvolution algorithms. The transcriptome data we observe is a mixture of the individual GEP of cells. Note that the expression levels of cell-type marker genes are represented by red bars in tumor microenvironmental cells. One approach to estimate the abundance of a particular cell type is to compute a signature score based on the marker genes exclusively expressed in such cells. Regression-based methods exploit reference GEPs to estimate the weight of linear combination across different cell types. Probabilistic approaches assume the expression levels are probabilistically distributed around reference GEPs and infer reference profiles and cell type abundances. The right panel illustrates the viewpoint of information-theoretic approaches. Each node of the network inside the cells denote each gene, and the redder circle represent the higher expression of the gene. These approaches measure the diversity of active signaling paths (paths with red circles) throughout the population of cells based on the protein–protein interaction network.

Signature score-based approaches To get a signature score for a cell type, these approaches aggregate the expression levels of corresponding marker genes. MCP-counter [112] simply uses geometric mean of marker gene expressions as the signature score for each cell type, while ESTIMATE [107] and xCell [114] utilize ssGSEA [122] scores to measure the collective enrichment of marker genes.

Regression-based approaches Given reference expression profiles of |$N_c$| constituent cell types, a bulk transcriptome profile |$\textbf{y}$| with |$N_g$| genes can be expressed as a linear combination of reference profiles as below:

$$\begin{align}& \textbf{y} = \textbf{C} \times \textbf{p}, \end{align}$$

(16)

where |$\textbf{C}$| is |$N_g \times N_c$| matrix of reference profile and |$\textbf{p}$| denotes cell type proportions. Here, |$\textbf{p}$| can be estimated through either of least squares optimization [111, 113, 115] or |$\nu $|-support vector regression [83, 109]. For the least squares methods, whose goal is to minimize |$||\textbf{y} - \textbf{C} \times \hat{\textbf{p}}||^2$|⁠, it is generally assumed that the elements of |$\hat{\textbf{p}}$| are nonnegative and also sum to one as they denote the proportions of cell types.

Probabilistic approaches Unlike regression-based approaches, these approaches explicitly deal with the likelihood of the probabilistic models for observed expression values or read counts and let the inference framework jointly infer the expression profiles and cell-type proportions. For example, DeMix and DeMixT formulate the observed array-based expression as a linear combination of log-normal random variables and conducts iterative optimization [106, 123]. Moreover, there have been methods that account for the perturbation of reference expression profiles arising from various microenvironmental factors. ISOpure posits an expression profile of normal cells as a linear combination of reference expression profiles of normal cells [104]. On the other hand, PERT introduces a perturbation vector, and it is multiplied to the reference profile [103].

Information-theoretic approaches

Aggregation of individual gene expression levels using pathway or network information has been widely applied for a robust systematic analysis of transcriptome throughout the past decades [100]. Interestingly, increased entropy at a level of signaling pathway was suggested to have arisen from the admixture of the heterogeneous and undifferentiated cell population, and it was shown to be a marker for cancer metastasis or drug sensitivity [124, 125]. Such measure of uncertainties in pathways, called `signaling entropy', is formulated as below (Figure 3, right panel). Given a well-established PPI network, an edge weight |$w_{ij}$| between gene |$i$| and |$j$| is defined as the product of (normalized) expression values of both genes according to the mass action principle [126]. The weight matrix |$W$| comprising values of |$w_{ij}$| is subsequently row-normalized to produce stochastic matrix |$P$|⁠. Let |$\pi $| the stationary distribution of |$P$| (i.e. |$\pi = P\pi $|⁠). Then, signaling entropy |$S$| is defined as

$$\begin{align}& S = \sum\limits_{i=1}^{n}\Bigg(\pi_i\sum\limits_{k \in \mathcal{N}_i}p_{ik}\log{p_{ik}}\Bigg), \end{align}$$

(17)

where |$\mathcal{N}_i$| denotes the neighbors of gene |$i$|⁠. By definition, signaling entropy is a transcriptome-level measure of the average promiscuity of PPI, where proteins at the core of the network are overweighed.

A similar measure, transcriptome-based ITH (tITH), captures the transcriptomic TH on a relative basis compared with normal samples [127]. Given stochastic matrices |$P^{T}$| and |$P^{N}$| for tumor and normal samples, respectively, network-based Jensen–Shannon divergence (nJSD) is defined as below.

$$\begin{align}& nJSD(N, T) = \frac{1}{n} \sum\limits_{i = 1}^{n} JSD(P^{N}_i, P^{T}_i), \end{align}$$

(18)

where |$JSD$| denotes conventional Jensen–Shannon divergence. Suppose a state |$A$| with a maximally ambiguous transcriptome state where the expressions of all genes are the same. Accordingly, tITH is computed as the relative distance between normal and tumor samples.

$$\begin{align}& tITH = \frac{nJSD(N, T)}{nJSD(N, T) + nJSD(T, A)}. \end{align}$$

(19)

Apart from the expression levels of genes, their alternative splicing patterns are also heterogeneous across cells [128]. In this regard, the concept of spliceome ITH (sITH) was proposed recently [129]. In brief, sITH captures the difference of uncertainties in splice junction usage with JSD, and the pan-cancer analysis revealed that increased sITH was shown to be associated with cancer progression and unfavorable outcomes.

Perspectives

Since the transcriptome serves as the most immediate and comprehensive surrogate of the functionality of a cell, characterizing transcriptomic TH is crucial to draw clear functional consequences of TH. However, the fundamental limitation of transcriptomic data is that gene expression levels are not discrete by nature, unlike genomic or epigenomic features. To separate the observed continuum of gene expressions into latent constituent distributions, one promising approach is to exploit information of upstream regulatory mechanisms. We might be able to use relevant genomic and epigenomic features to build a prior distribution of expression levels of genes, even though we still lack a systematic understanding of the gene regulatory mechanism. Conversely, it may be feasible to filter out nonfunctional alterations by integrating transcriptome-based TH characterization with other omics-based methods, facilitating the clear explanation of the driver event of the subclones.

Applications of single-cell sequencing

Although numerous approaches based on bulk sequencing data have been proposed, it is worth noting that the most straightforward way to investigate TH is separate molecular characterization of every single cell in a tumor. Understandably, single-cell DNA sequencing have showed its wide application in characterization of clonal diversity and evolution [130, 131] for various cancer types including breast cancer [132], clear cell renal carcinoma [133] and leukemia [134, 135]. Relevant computational tools include Single Cell Genotyper [136], SCITE [137], OncoNEM [138], SiFit [139], B-SCITE [140], SiCloneFit [141] and PhISCS [142]. Single-cell RNA sequencing has also been increasingly used. Since the pioneering study on primary glioblastoma cells (without microenvironmental cells) that reported cell-to-cell variability of gene expression and splicing patterns [143], the field has been broadened to characterize interaction among cells constituting microenvironment of metastatic melanoma [144], or to suggest higher-resolution subtype classifications of colorectal cancer using single cell-based transcriptomic profiles [145]. Meanwhile, some of the recent studies have been utilizing methods for single-cell trajectory inference for TH characterization. For example, Monocle [146, 147] was used to reveal the evolutionary trajectory in glioblastoma [148, 149], liver cancer [150], cutaneous T cell lymphoma [151] and uveal melanoma [152]. On the other hand, Wanderlust [153] was used to uncover the dynamics of single-cell transcriptomes in melanoma [154] over the pseudotime dimension.

Single-cell approaches still have critical technical limitations such as allelic dropout or doublets, which result in missing values in genotype calls and thus hamper robust analyses of TH. One promising approach to deal with these experimental artifacts is the integration of single-cell and bulk sequencing data. For example, ddClone [36] infers subclonal structures by a typical SNV-based scheme, but the clustering of VAF values is informed with single-cell genotypes. B-SCITE [140] and PhISCS [142] are subclonal phylogeny-aware methods that also combine the single-cell and bulk sequencing methods. Moreover, such integrative approaches are also effective for transcriptome-based cell-type deconvolution [155–157].

Recently, intriguing observations are emerging along with the technological advances that enabled the simultaneous measurement of multiple omics from a single cell. For example, G&T-seq [158] and TARGET-seq [159] revealed that the genomic and transcriptomic TH were tightly linked within a cell. In addition, joint profiling of single-cell transcriptome and chromatin states with scCAT-seq enabled the characterization of cellular subpopulations with different regulatory states [160]. Collectively, these findings recapitulate the necessity of integrative multi-omics characterization of TH. Currently, the single-cell multimodal sequencing methods are inadequate to replace bulk sequencing in terms of scalability and completeness of the measurements. Therefore, based on the conceptual background established by single-cell multi-omics studies, it is essential to implement multi-omics integrative methods based on bulk data.

Towards multi-omics and machine learning approaches for tumor heterogeneity chracterization

Integrative analysis of parallel evolution of genomes and epigenomes in cancer

The striking similarities and differences between genetic evolution (inferred from SNV or SCNA) and epigenetic evolution (inferred from DNA methylation) have offered an exciting avenue for TH studies: taking advantage of the complementarity between the two omics layers. The similarity between phylogenetic and phyloepigenetic trees have been reported in a wide variety of cancer types [161–166], opening up the possibility of an integrative analysis of phylogenetic and phyloepigenetic evolutionary histories. However, even though it has been shown that the epigenomic-genomic co-evolution [167] occurs, it remains questionable whether the co-evolution is ubiquitous across cancer types. Interestingly, in AML, the two layers of heterogeneity were found to be not necessarily linked to each other [168]. It implied that the genetic and epigenetic evolution in cancer might occur independently in some cases, yet not denying the co-evolution of those two omics layers in other cases. The results collectively imply that the two omics layers evolve `in parallel', which in turn can show convergent or distinct evolutionary histories depending on cancer types.

To our knowledge, computational methods utilizing the complementarity of epigenetic and genetic cancer evolution are not yet developed. However, since high-throughput data for cancer genomes and DNA methylomes have been rapidly accumulating, it would be an attractive field of research to conduct a joint inference of phylogenetic and phyloepigenetic trees. Although the details must be elaborated in further researches, it may be implemented by iterative and alternative inferences of phylogenetic and phyloepigenetic trees while constraining one tree by the other tree based on the candidate subclone that is likely to be appearing in both of the trees.

Integrating epigenomes and transcriptomes for cell-type deconvolution

We have already discussed various approaches using epigenome or transcriptome to infer the proportions of diverse cell types comprising tumor microenvironment. Despite those successful applications of individual-omics approaches, we still have room to improve the accuracy and robustness through integrating multi-omics molecular profiles. EDec [169] is a cell-type deconvolution approach that combines transcriptome and DNA methylome. It first selects a set of informative CpG loci with a catalog of reference methylomes of known cell types and infers underlying cell-type proportions. Subsequently, cell-type-specific transcriptome profiles are inferred using the proportions, which enables differential expression analysis done for each cell type separately.

ML approaches and applications

Most TH characterization approaches discussed above are based on explicit statistical modeling and manual feature selection, and their performances largely depend on our prior knowledge. Fortunately, as many large-scale studies of TH have been conducted by various international consortia [170, 171], we are gradually reaching a sufficient amount of relevant data to train a model to automatically learn the common characteristics of TH. In this section, we provide some insights on several possible approaches and applications of ML frameworks for future TH studies and introduce relevant works.

Data-driven feature extraction Feature extraction is the core principle behind how a machine learns and generalizes, where only the features relevant to the task are extracted from the noisy data. Clinical molecular profile data suffer from extreme noise arising from biological and technical variability. However, as the number of data grows, there will be a greater chance that an ML model captures biologically meaningful and common modules of TH, hopefully suggesting novel insights about cancer evolution. Recent approaches called REVOLVER [172] and Hintra [39] adopt the transfer learning paradigm to leverage the shared characteristics across a cohort of tumor samples. Both algorithms jointly optimize multiple tumor phylogenies across patients to capture repeated evolutionary trajectories. By penalizing unusual orderings of driver events that do not correlate well with other orderings in the cohort, the model is guided to find the recurrent evolutionary orderings. Since the method can extract recurrent snippets of driver events in a particular order, it allows us to dissect and arrange driver events in the correct order even if they were ambiguously intermingled inside a single subclone and could not be segregated by an inference with a single sample alone. This application of ML for fine-grained phylogenetic reconstruction of driver events is comparable to a super-resolution in the field of computer vision in a sense that the model extracts recurring patterns from a collection of data and exploit the patterns to get a higher-resolution view of a sample.

Ensemble learning Ensemble learning is an ML technique that improves the performance and generalization of models by combining multiple models trained to capture different aspects of data. As many different statistical approaches have been proposed for various TH characterization tasks, it will be interesting to see if ensembling those approaches improves the performance. Pioneering examples of ensemble learning methods include consensus purity estimation, defined as a median of normalized purity estimates from ESTIMATE, ABSOLUTE, LUMP and immunohistochemistry [76], and RF_Purify, a meta-learning approach that trains a random forest model predicting ABSOLUTE and ESTIMATE tumor purity estimates [173].

Multi-omics TH as a covariate We expect that information about TH extracted from multi-omics data (e.g. tumor purity, number of subclones, evolutionary histories or microenvironment composition) may serve as a novel and robust features for various clinical tasks such as cancer subtyping, prognosis prediction and optimal therapy design. Conventional molecular subtypes [87, 174] are usually defined using individual-omics measurements of bulk tumor samples, and thus the robustness of subtyping is predominantly confounded by the composition of cells. Moreover, it has been reported that the characteristics of cancer subtypes can be determined at a multi-omics level [175], which implies the existence of intrinsic subtypes that might have been neglected by individual-omics subtypes. Incorporating extracted multi-omics TH as a covariate in such tasks will provide more precise and robust results in clinical cancer researches. For example, conventional marker gene-based subtyping strategies may be refined to be dynamically adjusted according to the microenvironmental cell-type composition of the biopsy.

Conclusion and perspective

A comprehensive understanding of multi-omics TH will lead to a more precise therapeutic strategy that precisely targets every subclone at the same time. Furthermore, computationally deconvolving the mixed effects of various cancer-associated molecular aberrations may facilitate the experimental reproducibility and unveil hidden association between an aberration and cancer. Due to the complexity and large size of the molecular data, this problem can only be tackled by bioinformatic approaches.

Development of computational methods is possible only with a solid conceptual basis of multi-omics TH, and thus we need to continuously seek for the experimental evidence in parallel with algorithmic advancement. Fortunately, thanks to the development of single-cell sequencing technologies, we are gaining confidence in the existence of interplay between multiple omics layers that comprise TH [176]. While single-cell approaches have shown their great potential for exploratory analyses, we still need to establish systematic strategies based on bulk-cell multi-omics data for scalable and clinically applicable practice of cancer therapies. As we have discussed above, ideas proposed by numerous methods exploiting individual-omics layers may be good starting points for the integrative methods.

One another way to enhance our understanding of multi-omics TH is to employ data-driven ML approaches. Conventional TH characterization is usually sample-specific, which does not account for common characteristics of each cancer type. By utilizing accumulating molecular profiles of tumor samples, novel ML approaches may demonstrate the underlying patterns and similarities of TH within each cancer type, leading to a significant breakthrough in cancer therapeutics.

Key Points

A tumor sample consists of heterogeneous cellular populations, including diverse subclonal cancer cells, stromal cells and infiltrating immune cells. This TH harbors significant clinical potential. Although TH often exists across multi-omics layers, most computational approaches for TH characterization only rely on individual omics data.
To facilitate the development of multi-omics integrative methods for TH characterization, we provide a comprehensive review of existing computational methods to characterize TH based on genome, epigenome and transcriptome. Furthermore, we introduce some pioneering multi-omics integrative methods.
Multi-omics molecular profiles of cancers are rapidly accumulating. Therefore, we suggest some potential applications of data-driven ML approaches for TH characterization.

Funding

Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (NRF-2014M3C9A3063541); Next-Generation Information Computing Development Program through the NRF funded by the Ministry of Science, ICT (NRF-2017M3C4A7065887); Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224); Bio & Medical Technology Development Program of the NRF (NRF-2019M3E5D4065965).

Dohoon Lee is a PhD student of Interdisciplinary Program in Bioinformatics at Seoul National University. His research interests include tumor heterogeneity, computational epigenomics and machine learning.

Youngjune Park is a PhD student of Interdisciplinary Program in Bioinformatics at Seoul National University. His research interests include tumor evolution and machine learning.

Sun Kim is a professor of the department of computer science and engineering at Seoul National University. His research group focuses on bioinformatics and machine learning.

References

1.

Jamal-Hanjani

M

,

Quezada

SA

,

Larkin

J

, et al.

Translational implications of tumor heterogeneity

.

Clin Cancer Res

2015

;

21

(

6

):

1258

–

66

.

2.

Joung

JG

,

Oh

BY

,

Hong

HK

, et al.

Tumor heterogeneity predicts metastatic potential in colorectal cancer

.

Clin Cancer Res

2017

;

23

(

23

):

7209

–

16

.

3.

Dagogo-Jack

I

,

Shaw

AT

.

Tumour heterogeneity and resistance to cancer therapies

.

Nat Rev Clin Oncol

2018

;

15

(

2

):

81

–

94

.

4.

Nowell

P

.

The clonal evolution of tumor cell populations

.

Science

1976

;

194

(

4260

):

23

–

8

.

5.

Greaves

M

,

Maley

CC

.

Clonal evolution in cancer

.

Nature

2012

;

481

(

7381

):

306

–

13

.

6.

Brady

SW

,

McQuerry

JA

,

Qiao

Y

, et al.

Combating subclonal evolution of resistant cancer phenotypes

.

Nat Commun

2017

;

8

(

1

):

1231

.

7.

Mroz

EA

,

Rocco

JW

.

MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma

.

Oral Oncol

2013

;

49

(

3

):

211

–

5

.

8.

Hou

Y

,

Li

T

,

Gan

W

, et al.

Prognostic significance of mutant-allele tumor heterogeneity in uterine corpus endometrial carcinoma

.

Ann Transl Med

2020

;

8

(

6

):

339

–

9

.

9.

Rajput

A

,

Bocklage

T

,

Greenbaum

A

, et al.

Mutant-allele tumor heterogeneity scores correlate with risk of metastases in colon cancer

.

Clin Colorectal Cancer

2017

;

16

(

3

):

e165

–

70

.

10.

McGranahan

N

,

Furness

AJ

,

Rosenthal

R

, et al.

Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade

.

Science

2016

;

351

(

6280

):

1463

–

9

.

11.

Bozic

I

,

Antal

T

,

Ohtsuki

H

, et al.

Accumulation of driver and passenger mutations during tumor progression

.

Proc Natl Acad Sci U S A

2010

;

107

(

43

):

18545

–

50

.

12.

Vogelstein

B

,

Papadopoulos

N

,

Velculescu

VE

, et al.

Cancer genome landscapes

.

Science

2013

;

339

(

6127

):

1546

–

58

.

13.

Davies

JJ

,

Wilson

IM

,

Lam

WL

.

Array CGH technologies and their applications to cancer genomes

.

Chromosome Res

2005

;

13

(

3

):

237

–

48

.

14.

Pinkel

D

,

Albertson

DG

.

Array comparative genomic hybridization and its applications in cancer

.

Nat Genet

2005

;

37

(

6S

):

S11

–

7

.

15.

Huang

J

,

Wei

W

,

Zhang

J

, et al.

Whole genome DNA copy number changes identified by high density oligonucleotide arrays

.

Hum Genomics

2004

;

1

(

4

):

287

–

99

.

16.

Zhao

X

,

Li

C

,

Paez

JG

, et al.

An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays

.

Cancer Res

2004

;

64

(

9

):

3060

–

71

.

17.

Van Loo

P

,

Nordgard

SH

,

Lingjærde

OC

, et al.

Allele-specific copy number analysis of tumors

.

Proc Natl Acad Sci U S A

2010

;

107

(

39

):

16910

–

5

.

18.

Carter

SL

,

Cibulskis

K

,

Helman

E

, et al.

Absolute quantification of somatic DNA alterations in human cancer

.

Nat Biotechnol

2012

;

30

(

5

):

413

–

21

.

19.

Zack

TI

,

Schumacher

SE

,

Carter

SL

, et al.

Pan-cancer patterns of somatic copy number alteration

.

Nat Genet

2013

;

45

(

10

):

1134

–

40

.

20.

Davis

A

,

Gao

R

,

Navin

N

.

Tumor evolution: linear, branching, neutral or punctuated?

Biochim Biophys Acta Rev Cancer

2017

;

1867

(

2

):

151

–

61

.

21.

Larson

NB

,

Fridley

BL

.

PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data

.

Bioinformatics

2013

;

29

(

15

):

1888

–

9

.

22.

Strino

F

,

Parisi

F

,

Micsinai

M

, et al.

TrAp: a tree approach for fingerprinting subclonal tumor composition

.

Nucleic Acids Res

2013

;

41

(

17

):

e165

–

5

.

23.

Roth

A

,

Khattra

J

,

Yap

D

, et al.

PyClone: statistical inference of clonal population structure in cancer

.

Nat Methods

2014

;

11

(

4

):

396

–

8

.

24.

Miller

CA

,

White

BS

,

Dees

ND

, et al.

SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution

.

PLoS Comput Biol

2014

;

10

(

8

):

e1003665

.

25.

Jiao

W

,

Vembu

S

,

Deshwar

AG

, et al.

Inferring clonal evolution of tumors from single nucleotide somatic mutations

.

BMC Bioinformatics

2014

;

15

(

1

):

35

.

26.

Andor

N

,

Harness

JV

,

Müller

S

, et al.

Expands: expanding ploidy and allele frequency on nested subpopulations

.

Bioinformatics

2014

;

30

(

1

):

50

–

60

.

27.

Ha

G

,

Roth

A

,

Khattra

J

, et al.

TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data

.

Genome Res

2014

;

24

(

11

):

1881

–

93

.

28.

Zare

H

,

Wang

J

,

Hu

A

, et al.

Inferring clonal composition from multiple sections of a breast cancer

.

PLoS Comput Biol

2014

;

10

(

7

):

e1003703

.

29.

Hajirasouliha

I

,

Mahmoody

A

,

Raphael

BJ

.

A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data

.

Bioinformatics

2014

;

30

(

12

):

78

–

86

.

10.1142/9789814644730_0044

30.

Deshwar

AG

,

Vembu

S

,

Yung

CK

, et al.

PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors

.

Genome Biol

2015

;

16

(

1

):

1

–

20

.

31.

El-Kebir

M

,

Oesper

L

,

Acheson-Field

H

, et al.

Reconstruction of clonal trees and tumor composition from multi-sample sequencing data

.

Bioinformatics

2015

;

31

(

12

):

i62

–

70

.

32.

Popic

V

,

Salari

R

,

Hajirasouliha

I

, et al.

Fast and scalable inference of multi-sample cancer lineages

.

Genome Biol

2015

;

16

(

1

):

91

.

33.

Sengupta

S

,

Wang

J

,

Lee

J

, et al.

Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data

.

Pac Symp Biocomput

2015

;

20

:

467

–

78

. doi:

.

34.

Marass

F

,

Mouliere

F

,

Yuan

K

, et al.

A phylogenetic latent feature model for clonal deconvolution

.

Ann Appl Stat

2016

;

10

(

4

):

2377

–

404

.

35.

Jiang

Y

,

Qiu

Y

,

Minn

AJ

, et al.

Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing

.

Proc Natl Acad Sci U S A

2016

;

113

(

37

):

E5528

–

37

.

36.

Salehi

S

,

Steif

A

,

Roth

A

, et al.

ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data

.

Genome Biol

2017

;

18

(

1

):

1

–

18

.

37.

Dang

H

,

White

B

,

Foltz

S

, et al.

ClonEvol: clonal ordering and visualization in cancer sequencing

.

Ann Oncol

2017

;

28

(

12

):

3076

–

82

.

38.

Li

Y

,

Xie

X

.

MixClone: a mixture model for inferring tumor subclonal populations

.

BMC Genomics

2015

;

16

(

Suppl 2

):

1

–

9

.

39.

Khakabimamaghani

S

,

Malikic

S

,

Tang

J

, et al.

Collaborative intra-tumor heterogeneity detection

.

Bioinformatics

2019

;

35

(

14

):

i379

–

88

.

40.

Oesper

L

,

Satas

G

,

Raphael

BJ

.

Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data

.

Bioinformatics

2014

;

30

(

24

):

3532

–

40

.

41.

El-Kebir

M

,

Satas

G

,

Oesper

L

, et al.

Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures

.

Cell Syst

2016

;

3

:

43

–

53

.

42.

Nik-Zainal

S

,

Van Loo

P

,

Wedge

DC

, et al.

The life history of 21 breast cancers

.

Cell

2012

;

149

(

5

):

994

–

1007

.

43.

Yau

C

,

Mouradov

D

,

Jorissen

RN

, et al.

A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data

.

Genome Biol

2010

;

11

(

9

):

R92

.

44.

Ha

G

,

Roth

A

,

Lai

D

, et al.

Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer

.

Genome Res

2012

;

22

(

10

):

1995

–

2007

.

45.

Gusnanto

A

,

Wood

HM

,

Pawitan

Y

, et al.

Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data

.

Bioinformatics

2012

;

28

(

1

):

40

–

7

.

46.

Yu

Z

,

Liu

Y

,

Shen

Y

, et al.

CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data

.

Bioinformatics

2014

;

30

(

18

):

2576

–

83

.

47.

Li

Y

,

Xie

X

.

Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity

.

Bioinformatics

2014

;

30

(

15

):

2121

–

9

.

48.

Bao

L

,

Pu

M

,

Messer

K

.

AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data

.

Bioinformatics

2014

;

30

(

8

):

1056

–

63

.

49.

Favero

F

,

Joshi

T

,

Marquard

A

, et al.

Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data

.

Ann Oncol

2015

;

26

(

1

):

64

–

70

.

50.

Shen

R

,

Seshan

VE

.

FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing

.

Nucleic Acids Res

2016

;

44

(

16

):

e131

–

1

.

51.

Yu

Z

,

Li

A

,

Wang

M

.

CLImAT-HET: detecting subclonal copy number alterations and loss of heterozygosity in heterogeneous tumor samples from whole-genome sequencing data

.

BMC Med Genomics

2017

;

10

(

1

):

15

.

52.

Adalsteinsson

VA

,

Ha

G

,

Freeman

SS

, et al.

Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors

.

Nat Commun

2017

;

8

(

1

):

1324

.

53.

Cun

Y

,

Yang

TP

,

Achter

V

, et al.

Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust

.

Nat Protoc

2018

;

13

(

6

):

1488

–

501

.

54.

Poell

JB

,

Mendeville

M

,

Sie

D

, et al.

ACE: absolute copy number estimation from low-coverage whole-genome sequencing data

.

Bioinformatics

2019

;

35

(

16

):

2847

–

9

.

55.

Mayrhofer

M

,

DiLorenzo

S

,

Isaksson

A

.

Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue

.

Genome Biol

2013

;

14

(

3

):

R24

.

56.

Noorbakhsh

J

,

Kim

H

,

Namburi

S

, et al.

Distribution-based measures of tumor heterogeneity are sensitive to mutation calling and lack strong clinical predictive power

.

Sci Rep

2018

;

8

(

1

):

1

–

12

.

57.

Fraga

MF

,

Ballestar

E

,

Villar-Garea

A

, et al.

Loss of acetylation at Lys16 and trimethylation at Lys20 of histone H4 is a common hallmark of human cancer

.

Nat Genet

2005

;

37

(

4

):

391

–

400

.

58.

Jones

PA

,

Laird

PW

.

Cancer epigenetics comes of age

.

Nat Genet

1999

;

21

(

2

):

163

–

7

.

59.

Baylin

SB

,

Herman

JG

.

DNA hypermethylation in tumorigenesis: epigenetics joins genetics

.

Trends Genet

2000

;

16

(

4

):

168

–

74

.

60.

Esteller

M

,

Corn

PG

,

Baylin

SB

, et al.

A gene hypermethylation profile of human cancer

.

Cancer Res

2001

;

3

(

61

):

3225

–

9

.

61.

Jones

PA

,

Baylin

SB

.

The fundamental role of epigenetic events in cancer

.

Nat Rev Genet

2002

;

3

(

6

):

415

–

28

.

62.

Herman

JG

,

Baylin

SB

.

Gene silencing in cancer in association with promoter hypermethylation

.

N Engl J Med

2003

;

349

(

21

):

2042

–

54

.

63.

Feinberg

AP

,

Vogelstein

B

.

Hypomethylation distinguishes genes of some human cancers from their normal counterparts

.

Nature

1983

;

301

(

5895

):

89

–

92

.

64.

Feinberg

AP

,

Kuo

KC

.

Reduced genomic 5-methylcytosine content in human colonic neoplasia

.

Cancer Res

1988

;

1159

–

61

.

65.

Ehrlich

M

.

DNA methylation in cancer: too much, but also too little

.

Oncogene

2002

;

21

(

35

):

5400

–

13

.

66.

Aggerholm

A

,

Guldberg

P

,

Hokland

M

, et al.

Extensive intra- and interindividual heterogeneity of p15INK4B methylation in acute myeloid leukemia

.

Cancer Res

1999

;

18

:

436

–

41

.

67.

Landan

G

,

Cohen

NM

,

Mukamel

Z

, et al.

Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues

.

Nat Genet

2012

;

44

(

11

):

1207

–

14

.

68.

Pan

H

,

Jiang

Y

,

Boi

M

, et al.

Epigenomic evolution in diffuse large B-cell lymphomas

.

Nat Commun

2015

;

6

(

1

):

6921

.

69.

Timp

W

,

Feinberg

AP

.

Cancer as a dysregulated epigenome allowing cellular growth advantage at the expense of the host

.

Nat Rev Cancer

2013

;

13

(

7

):

497

–

510

.

70.

Li

S

,

Garrett-Bakelman

F

,

Perl

AE

, et al.

Dynamic evolution of clonal epialleles revealed by methclone

.

Genome Biol

2014

;

15

(

9

):

472

.

71.

Barrett

JE

,

Feber

A

,

Herrero

J

, et al.

Quantification of tumour evolution and heterogeneity via Bayesian epiallele detection

.

BMC Bioinformatics

2017

;

18

(

1

):

354

.

72.

Lee

D

,

Lee

S

,

Kim

S

.

PRISM: methylation pattern-based, reference-free inference of subclonal makeup

.

Bioinformatics

2019

;

35

(

14

):

i520

–

9

.

73.

Xie

H

,

Wang

M

,

de Andrade

A

, et al.

Genome-wide quantitative assessment of variation in DNA methylation patterns

.

Nucleic Acids Res

2011

;

39

(

10

):

4099

–

108

.

74.

Landau

DA

,

Clement

K

,

Ziller

MJ

, et al.

Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia

.

Cancer Cell

2014

;

26

(

6

):

813

–

25

.

75.

Guo

S

,

Diep

D

,

Plongthongkum

N

, et al.

Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA

.

Nat Genet

2017

;

49

(

4

):

635

–

42

.

76.

Aran

D

,

Sirota

M

,

Butte

AJ

.

Systematic pan-cancer analysis of tumour purity

.

Nat Commun

2015

;

6

:

1

–

11

.

77.

Zhang

N

,

Wu

HJ

,

Zhang

W

, et al.

Predicting tumor purity from methylation microarray data

.

Bioinformatics

2015

;

31

(

21

):

3401

–

5

.

78.

Benelli

M

,

Romagnoli

D

,

Demichelis

F

.

Tumor purity quantification by clonal DNA methylation signatures

.

Bioinformatics

2018

;

34

(

10

):

1642

–

9

.

79.

Liu

B

,

Yang

X

,

Wang

T

, et al.

MEpurity: estimating tumor purity using DNA methylation data

.

Bioinformatics

2019a

;

35

(

24

):

5298

–

300

.

80.

Zheng

X

,

Zhao

Q

,

Wu

HJ

, et al.

MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes

.

Genome Biol

2014

;

15

(

8

):

419

.

81.

Houseman

EA

,

Accomando

WP

,

Koestler

DC

, et al.

DNA methylation arrays as surrogate measures of cell mixture distribution

.

BMC Bioinformatics

2012

;

13

(

1

):

86

.

82.

Chakravarthy

A

,

Furness

A

,

Joshi

K

, et al.

Pan-cancer deconvolution of tumour composition using DNA methylation

.

Nat Commun

2018

;

9

(

1

):

3220

.

83.

Newman

AM

,

Liu

CL

,

Green

MR

, et al.

Robust enumeration of cell subsets from tissue expression profiles

.

Nat Methods

2015

;

12

(

5

):

453

–

7

.

84.

Houseman

EA

,

Kile

ML

,

Christiani

DC

, et al.

Reference-free deconvolution of DNA methylation data and mediation by cell composition effects

.

BMC Bioinformatics

2016

;

17

(

1

):

259

.

85.

Lutsik

P

,

Slawski

M

,

Gasparoni

G

, et al.

MeDeCom: discovery and quantification of latent components of heterogeneous methylomes

.

Genome Biol

2017

;

18

(

1

):

55

.

86.

Torres

CM

,

Biran

A

,

Burney

MJ

, et al.

The linker histone H1.0 generates epigenetic and functional intratumor heterogeneity

.

Science

2016

;

353

(

6307

):

aaf1644

.

87.

Parker

JS

,

Mullins

M

,

Cheang

MCU

, et al.

Supervised risk predictor of breast cancer based on intrinsic subtypes

.

J Clin Oncol

2009

;

27

:

1160

–

7

.

88.

Whitney

AR

,

Diehn

M

,

Popper

SJ

, et al.

Individuality and variation in gene expression patterns in human blood

.

Proc Natl Acad Sci U S A

2003

;

100

(

4

):

1896

–

901

.

89.

Venet

D

,

Pecasse

F

,

Maenhaut

C

, et al.

Separation of samples into their constituents using gene expression data

.

Bioinformatics

2001

;

17

(

Suppl. 1

):

279

–

87

.

90.

de Ridder

D

,

van der Linden

CE

,

Schonewille

T

, et al.

Purity for clarity: the need for purification of tumor cells in DNA microarray studies

.

Leukemia

2005

;

19

(

4

):

618

–

27

.

91.

de Bruin

EC

,

van de Pas

S

,

Lips

EH

, et al.

Macrodissection versus microdissection of rectal carcinoma: minor influence of stroma cells to tumor cell gene expression profiles

.

BMC Genomics

2005

;

6

(

1

):

142

.

92.

Zhao

Y

,

Simon

R

.

Gene expression deconvolution in clinical samples

.

Genome Med

2010

;

2

(

12

):

93

.

93.

Chikina

M

,

Zaslavsky

E

,

Sealfon

SC

.

CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations

.

Bioinformatics

2015

;

31

(

10

):

1584

–

91

.

94.

Xiao

SJ

,

Zhang

C

,

Zou

Q

, et al.

TiSGeD: a database for tissue-specific genes

.

Bioinformatics

2010

;

26

(

9

):

1273

–

5

.

95.

Yoshihara

K

,

Shahmoradgoli

M

,

Martínez

E

, et al.

Inferring tumour purity and stromal and immune cell admixture from expression data

.

Nat Commun

2013a

;

4

(

1

):

1

–

11

.

96.

Walhout

AJ

,

Vidal

M

.

High-throughput yeast two-hybrid assays for large-scale protein interaction mapping

.

Methods

2001

;

24

(

3

):

297

–

306

.

97.

Parrish

JR

,

Gulyas

KD

,

Finley

RL

.

Yeast two-hybrid contributions to interactome mapping

.

Curr Opin Biotechnol

2006

;

17

(

4

):

387

–

93

.

98.

Szklarczyk

D

,

Gable

AL

,

Lyon

D

, et al.

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets

.

Nucleic Acids Res

2018

;

47

:

607

–

13

.

99.

Oughtred

R

,

Stark

C

,

Breitkreutz

BJ

, et al.

The BioGRID interaction database: 2019 update

.

Nucleic Acids Res

2018

;

47

:

529

–

41

.

100.

Barabási

AL

,

Oltvai

ZN

.

Network biology: understanding the cell’s functional organization

.

Nat Rev Genet

2004

;

5

(

2

):

101

–

13

.

101.

Barabási

AL

,

Gulbahce

N

,

Loscalzo

J

.

Network medicine: a network-based approach to human disease

.

Nat Rev Genet

2011

;

12

(

1

):

56

–

68

.

102.

Lim

S

,

Lee

S

,

Jung

I

, et al.

Comprehensive and critical evaluation of individualized pathway activity measurement tools on pan-cancer data

.

Brief Bioinform

2020

;

21

(

1

):

36

–

46

.

103.

Qiao

W

,

Quon

G

,

Csaszar

E

, et al.

PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions

.

PLoS Comput Biol

2012

;

8

(

12

):

e1002838

.

104.

Quon

G

,

Haider

S

,

Deshwar

AG

, et al.

Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction

.

Genome Med

2013

;

5

(

3

):

29

.

105.

Anghel

CV

,

Quon

G

,

Haider

S

, et al.

Isopurer: an r implementation of a computational purification algorithm of mixed tumour profiles

.

BMC Bioinformatics

2015

;

16

(

1

):

156

.

106.

Ahn

J

,

Yuan

Y

,

Parmigiani

G

, et al.

DeMix: deconvolution for mixed cancer transcriptomes using raw measured data

.

Bioinformatics

2013

;

29

(

15

):

1865

–

71

.

107.

Yoshihara

K

,

Shahmoradgoli

M

,

Martínez

E

, et al.

Inferring tumour purity and stromal and immune cell admixture from expression data

.

Nat Commun

2013b

;

4

(

1

):

2612

.

108.

Wang

N

,

Gong

T

,

Clarke

R

, et al.

Undo: a bioconductor r package for unsupervised deconvolution of mixed gene expressions in tumor samples

.

Bioinformatics

2015

;

31

(

1

):

137

–

9

.

109.

Newman

AM

,

Steen

CB

,

Liu

CL

, et al.

Determining cell type abundance and expression from bulk tissues with digital cytometry

.

Nat Biotechnol

2019

;

37

:

773

–

82

.

110.

Shen

Q

,

Hu

J

,

Jiang

N

, et al.

contamDE: differential expression analysis of RNA-seq data for contaminated tumor samples

.

Bioinformatics

2016

;

32

(

5

):

705

–

12

.

111.

Li

B

,

Severson

E

,

Pignon

JC

, et al.

Comprehensive analyses of tumor immunity: implications for cancer immunotherapy

.

Genome Biol

2016a

;

17

(

1

):

174

.

112.

Becht

E

,

Giraldo

NA

,

Lacroix

L

, et al.

Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression

.

Genome Biol

2016

;

17

(

1

):

218

.

113.

Racle

J

,

de Jonge

K

,

Baumgaertner

P

, et al.

Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data

.

Elife

2017

;

6

:

e26476

.

114.

Aran

D

,

Hu

Z

,

Butte

AJ

.

xCell: digitally portraying the tissue cellular heterogeneity landscape

.

Genome Biol

2017

;

18

(

1

):

220

.

115.

Finotello

F

,

Mayer

C

,

Plattner

C

, et al.

Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data

.

Genome Med

2019

;

11

(

1

):

34

.

116.

Shen-Orr

SS

,

Gaujoux

R

.

Computational deconvolution: extracting cell type-specific information from heterogeneous samples

.

Curr Opin Immunol

2013

;

25

(

5

):

571

–

8

.

117.

Yadav

VK

,

De

S

.

An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples

.

Brief Bioinform

2015

;

16

(

2

):

232

–

41

.

118.

Mohammadi

S

,

Zuckerman

N

,

Goldsmith

A

, et al.

A critical survey of deconvolution methods for separating cell types in complex tissues

.

Proc IEEE

2017

;

105

(

2

):

340

–

66

.

119.

Avila Cobos

F

,

Vandesompele

J

,

Mestdagh

P

, et al.

Computational deconvolution of transcriptomics data from mixed cell populations

.

Bioinformatics

2018

;

34

(

11

):

1969

–

79

.

120.

Finotello

F

,

Trajanoski

Z

.

Quantifying tumor-infiltrating immune cells from transcriptomics data

.

Cancer Immunol Immunother

2018

;

67

(

7

):

1031

–

40

.

121.

Sturm

G

,

Finotello

F

,

Petitprez

F

, et al.

Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology

.

Bioinformatics

2019

;

35

:

436

–

45

.

122.

Barbie

DA

,

Tamayo

P

,

Boehm

JS

, et al.

Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1

.

Nature

2009

;

462

(

7269

):

108

–

12

.

123.

Wang

Z

,

Cao

S

,

Morris

JS

, et al.

Transcriptome deconvolution of heterogeneous tumor samples with immune infiltration

.

iScience

2018

;

9

:

451

–

60

.

124.

Teschendorff

AE

,

Severini

S

.

Increased entropy of signal transduction in the cancer metastasis phenotype

.

BMC Syst Biol

2010

;

4

(

1

):

104

.

125.

Teschendorff

AE

,

Sollich

P

,

Kuehn

R

.

Signalling entropy: a novel network-theoretical framework for systems analysis and interpretation of functional omic data

.

Methods

2014

;

67

(

3

). doi:

10.1093/nsr/nwaa099

.

126.

Banerji

CRS

,

Miranda-Saavedra

D

,

Severini

S

, et al.

Cellular network entropy as the energy potential in Waddington’s differentiation landscape

.

Sci Rep

2013

;

3

(

1

):

3039

.

127.

Park

Y

,

Lim

S

,

Nam

JW

, et al.

Measuring intratumor heterogeneity by network entropy using RNA-seq data

.

Sci Rep

2016

;

6

(

1

):

1

–

12

.

128.

Wan

Y

,

Larson

DR

.

Splicing heterogeneity: separating signal from noise

.

Genome Biol

2018

;

19

(

1

):

86

.

129.

Kim

M

,

Lee

S

,

Lim

S

, et al.

Splicehetero: an information theoretic approach for measuring spliceomic intratumor heterogeneity from bulk tumor RNA-seq

.

PLoS One

2019

;

14

(

10

):

e0223520

.

130.

Navin

NE

.

The first five years of single-cell cancer genomics and beyond

.

Genome Res

2015

;

25

(

10

):

1499

–

507

.

131.

Kuipers

J

,

Jahn

K

,

Beerenwinkel

N

.

Advances in understanding tumour evolution through single-cell sequencing

.

Biochim Biophys Acta Rev Cancer

2017

;

1867

(

2

):

127

–

38

.

132.

Navin

N

,

Kendall

J

,

Troge

J

, et al.

Tumour evolution inferred by single-cell sequencing

.

Nature

2011

;

472

(

7341

):

90

–

4

.

133.

Xu

X

,

Hou

Y

,

Yin

X

, et al.

Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor

.

Cell

2012

;

148

(

5

):

886

–

95

.

134.

Jan

M

,

Snyder

TM

,

Corces-Zimmerman

MR

, et al.

Clonal evolution of preleukemic hematopoietic stem cells precedes human acute myeloid leukemia

.

Sci Transl Med

2012

;

4

(

149

):

149ra118

.

135.

Potter

NE

,

Ermini

L

,

Papaemmanuil

E

, et al.

Single-cell mutational profiling and clonal phylogeny in cancer

.

Genome Res

2013

;

23

(

12

):

2115

–

25

.

136.

Roth

A

,

McPherson

A

,

Laks

E

, et al.

Clonal genotype and population structure inference from single-cell tumor sequencing

.

Nat Methods

2016

;

13

(

7

):

573

–

6

.

137.

Jahn

K

,

Kuipers

J

,

Beerenwinkel

N

.

Tree inference for single-cell data

.

Genome Biol

2016

;

17

(

1

):

86

.

138.

Ross

EM

,

Markowetz

F

.

OncoNEM: inferring tumor evolution from single-cell sequencing data

.

Genome Biol

2016

;

17

(

1

):

69

.

139.

Zafar

H

,

Tzen

A

,

Navin

N

, et al.

SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models

.

Genome Biol

2017

;

18

(

1

):

178

.

140.

Malikic

S

,

Jahn

K

,

Kuipers

J

, et al.

Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data

.

Nat Commun

2019a

;

10

(

1

):

2750

.

141.

Zafar

H

,

Navin

N

,

Chen

K

, et al.

SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data

.

Genome Res

2019

;

29

(

11

):

1847

–

59

.

142.

Malikic

S

,

Mehrabadi

FR

,

Ciccolella

S

, et al.

PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

.

Genome Res

2019b

;

29

(

11

):

1860

–

77

.

143.

Patel

AP

,

Tirosh

I

,

Trombetta

JJ

, et al.

Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma

.

Science

2014

;

344

(

6190

):

1396

–

401

.

144.

Tirosh

I

,

Izar

B

,

Prakadan

SM

, et al.

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

.

Science

2016

;

352

(

6282

):

189

–

96

.

145.

Li

H

,

Courtois

ET

,

Sengupta

D

, et al.

Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors

.

Nat Genet

2017

;

49

(

5

):

708

–

18

.

146.

Trapnell

C

,

Cacchiarelli

D

,

Grimsby

J

, et al.

The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells

.

Nat Biotechnol

2014

;

32

(

4

):

381

–

6

.

147.

Qiu

X

,

Mao

Q

,

Tang

Y

, et al.

Reversed graph embedding resolves complex single-cell trajectories

.

Nat Methods

2017

;

14

(

10

):

979

–

82

.

148.

Pang

B

,

Xu

J

,

Hu

J

, et al.

Single-cell RNA-seq reveals the invasive trajectory and molecular cascades underlying glioblastoma progression

.

Mol Oncol

2019

;

13

(

12

):

2588

–

603

.

149.

Yu

K

,

Hu

Y

,

Wu

F

, et al.

Surveying brain tumor heterogeneity by single-cell RNA sequencing of multi-sector biopsies

.

Natl Sci Rev

2020

.

150.

Ma

L

,

Hernandez

MO

,

Zhao

Y

, et al.

Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer

.

Cancer Cell

2019

;

36

(

4

):

418

–

430.e6

.

151.

Borcherding

N

,

Voigt

AP

,

Liu

V

, et al.

Single-cell profiling of cutaneous T-cell lymphoma reveals underlying heterogeneity associated with disease progression

.

Clin Cancer Res

2019

;

25

(

10

):

2996

–

3005

.

152.

Durante

MA

,

Rodriguez

DA

,

Kurtenbach

S

, et al.

Single-cell analysis reveals new evolutionary complexity in uveal melanoma

.

Nat Commun

2020

;

11

(

1

):

496

.

153.

Bendall

SC

,

Davis

KL

,

Amir

EAD

, et al.

Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development

.

Cell

2014

;

157

(

3

):

714

–

25

.

154.

Loeffler-Wirth

H

,

Binder

H

,

Willscher

E

, et al.

Pseudotime dynamics in melanoma single-cell transcriptomes reveals different mechanisms of tumor progression

.

Biology

2018

;

7

(

2

):

23

.

155.

Baron

M

,

Veres

A

,

Wolock

SL

, et al.

A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure

.

Cell Syst

2016

;

3

(

4

):

346

–

360.e4

.

156.

Wang

X

,

Park

J

,

Susztak

K

, et al.

Bulk tissue cell type deconvolution with multi-subject single-cell expression reference

.

Nat Commun

2019

;

10

(

1

):

380

.

157.

Jew

B

,

Alvarez

M

,

Rahmani

E

, et al.

Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

.

Nat Commun

2020

;

11

(

1

):

1971

.

158.

Macaulay

IC

,

Haerty

W

,

Kumar

P

, et al.

G&T-seq: parallel sequencing of single-cell genomes and transcriptomes

.

Nat Methods

2015

;

12

(

6

):

519

–

22

.

159.

Rodriguez-Meira

A

,

Buck

G

,

Clark

SA

, et al.

Unravelling intratumoral heterogeneity through high-sensitivity single-cell mutational analysis and parallel RNA sequencing

.

Mol Cell

2019

;

73

(

6

):

1292

–

1305.e8

.

160.

Liu

L

,

Liu

C

,

Quintero

A

, et al.

Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity

.

Nat Commun

2019b

;

10

(

1

):

470

.

161.

Aryee

MJ

,

Liu

W

,

Engelmann

JC

, et al.

DNA methylation alterations exhibit intraindividual stability and interindividual heterogeneity in prostate cancer metastases

.

Sci Transl Med

2013

;

5

(

169

):

169ra10

.

162.

Brocks

D

,

Assenov

Y

,

Minner

S

, et al.

Intratumor DNA methylation heterogeneity reflects clonal evolution in aggressive prostate cancer

.

Cell Rep

2014

;

8

(

3

):

798

–

806

.

163.

Loeffler

M

,

Kreuz

M

,

Haake

A

, et al.

Genomic and epigenomic co-evolution in follicular lymphomas

.

Leukemia

2015

;

29

(

2

):

456

–

63

.

164.

Mazor

T

,

Pankov

A

,

Johnson

BE

, et al.

DNA methylation and somatic mutations converge on the cell cycle and define similar evolutionary histories in brain tumors

.

Cancer Cell

2015

;

28

(

3

):

307

–

17

.

165.

Hao

JJ

,

Lin

DC

,

Dinh

HQ

, et al.

Spatial intratumoral heterogeneity and temporal clonal evolution in esophageal squamous cell carcinoma

.

Nat Genet

2016

;

48

(

12

):

1500

–

7

.

166.

Dietz

S

,

Lifshitz

A

,

Kazdal

D

, et al.

Global DNA methylation reflects spatial heterogeneity and molecular evolution of lung adenocarcinomas

.

Int J Cancer

2019

;

144

(

5

):

1061

–

72

.

167.

Oakes

CC

,

Claus

R

,

Gu

L

, et al.

Evolution of DNA methylation is linked to genetic aberrations in chronic lymphocytic leukemia

.

Cancer Discov

2014

;

4

(

3

):

348

–

61

.

168.

Li

S

,

Garrett-Bakelman

FE

,

Chung

SS

, et al.

Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia

.

Nat Med

2016b

;

22

(

7

):

792

–

9

.

169.

Onuchic

V

,

Hartmaier

RJ

,

Boone

DN

, et al.

Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types

.

Cell Rep

2016

;

17

(

8

):

2075

–

86

.

170.

Gerstung

M

,

Jolly

C

,

Leshchiner

I

, et al.

The evolutionary history of 2,658 cancers

.

Nature

2020

;

578

(

7793

):

122

–

8

.

171.

Campbell

PJ

,

Getz

G

,

Korbel

JO

, et al.

Pan-cancer analysis of whole genomes

.

Nature

2020

;

578

(

7793

):

82

–

93

.