Abstract

The multi-omics molecular characterization of cancer opened a new horizon for our understanding of cancer biology and therapeutic strategies. However, a tumor biopsy comprises diverse types of cells limited not only to cancerous cells but also to tumor microenvironmental cells and adjacent normal cells. This heterogeneity is a major confounding factor that hampers a robust and reproducible bioinformatic analysis for biomarker identification using multi-omics profiles. Besides, the heterogeneity itself has been recognized over the years for its significant prognostic values in some cancer types, thus offering another promising avenue for therapeutic intervention. A number of computational approaches to unravel such heterogeneity from high-throughput molecular profiles of a tumor sample have been proposed, but most of them rely on the data from an individual omics layer. Since the heterogeneity of cells is widely distributed across multi-omics layers, methods based on an individual layer can only partially characterize the heterogeneous admixture of cells. To help facilitate further development of the methodologies that synchronously account for several multi-omics profiles, we wrote a comprehensive review of diverse approaches to characterize tumor heterogeneity based on three different omics layers: genome, epigenome and transcriptome. As a result, this review can be useful for the analysis of multi-omics profiles produced by many large-scale consortia. Contact:[email protected]

Introduction

A tumor sample is not a homogeneous mass of malignant cells but a heterogeneous cellular ecosystem comprising populations of diverse cell types, including cancer cells, stromal cells or immune cells. Over the decades, such intrinsic heterogeneity of a tumor sample, called tumor heterogeneity (TH), has been acknowledged for its implication in every step of cancer progression from initiation to metastasis and relapse [1–3]. TH resides even within a cancer cell population, and it is specifically referred to as intratumor heterogeneity (ITH). ITH accompanies several distinct populations of identical cancer cells called subclones. According to the clonal theory of cancer evolution, ITH arises from the Darwinian serial evolution of cancer cells [4, 5]. TH and ITH are major obstacles for cancer therapeutics, but at the same time harbor a potential for a novel therapeutic strategy manipulating TH and ITH themselves [6].

Accordingly, a number of recent translational researches have been focusing on the predictive power of TH for predicting various clinical outcomes, including patient survival, therapeutic response and metastatic potential of a tumor. In such studies, mutant-allele tumor heterogeneity (MATH) is one of the most widely used measures of genomic TH [7]. Given a distribution of variant allele frequencies (VAFs), a MATH score is defined as the ratio of the median absolute deviation to the median of the distribution. Interestingly, MATH scores have been shown to harbor significant translational implications, highlighting the vast utility of TH in the clinics. For example, the higher MATH scores (i.e. more extensive genomic TH) were shown to be associated with poor clinical outcomes in head and neck squamous cell carcinoma [7] and uterine corpus endometrial carcinoma [8], as well as increased metastatic potential in colorectal cancer [2, 9]. Besides, the increased heterogeneity of subclonal tumor neoantigens arising from ITH was suggested to be associated with the poor response to immune checkpoint blockade therapies [10], which emphasizes that TH should be taken into consideration when designing novel therapeutic strategies.

To investigate the clinical potential of TH, it is crucial to resolve the extent of TH from the molecular data generated from a tumor specimen. The main goals of such studies are to characterize (1) the proportion of cancerous cells in the sample, or tumor purity; (2) the composition and evolutionary history of cancer subclones; and (3) the proportion of each noncancerous cell type comprising the tumor microenvironment. A number of relevant approaches to tackle such goals have been developed so far, but still, most of them rely on individual omics. Since TH is disseminated across different omics layers exemplified by genome, epigenome and transcriptome, we cannot fully characterize TH from the data capturing only an individual omics layer. Thus, a systematic multi-omics TH characterization is crucial for another therapeutic breakthrough in cancer clinics. Nevertheless, a thorough understanding of the individual characteristics of each omics data is necessary for the integrative analysis throughout different omics layers; also, sufficient discussions about the computational approaches to model the data and to infer hidden biological states of a tumor are needed. Therefore, in this article, we present a structured summary of the biological and statistical properties of cancer multi-omics data and comprehensive reviews of existing bioinformatic approaches in terms of how they exploit such data to reveal the extent of TH in the sample. We also introduce recent pioneering approaches for the joint analyses of multiple omics data, which can inspire the further development of algorithms based on multiple omics data. Finally, regarding the recent technological advances and the massive amount of relevant data, we provide some insights on the feasibility of machine learning (ML)-based approaches.

Genome-based tumor heterogeneity

Underlying biological features

Single-nucleotide variation

The conventional theory of tumor evolution defines a subclone as a group of cells with identical genetic aberrations. In this point of view, a subclone arises after a cell acquires a driver mutation that confers significant selective growth advantage, even if dozens of other passenger mutations have already been accumulated in the cell [11, 12]. Therefore, the general assumption behind the genome-based approaches is that cells constituting a subclone have the same assortment of genetic aberrations, including single-nucleotide variation (SNV), and thus such aberrations have similar frequencies. The advent of next-generation sequencing technologies enabled high-throughput quantification of DNA bases, which are the atomic units of genetic information. Accordingly, a plethora of statistical approaches have been developed in the past decade, driven by the massive SNV allelic count data.

Somatic copy number alteration

Somatic copy number alteration (SCNA) is another genomic feature that leaves a large trace of TH. In early studies of SCNAs, the primary sources of data were array-based technologies such as comparative genomic hybridization [13, 14] and SNP arrays [15, 16]. While both methods produce segmental copy numbers on a relative basis, the main difference is that only SNP arrays allow the analysis of allele-specific SCNAs [17] and therefore can be used for studies of loss of heterozygosity (LOH), or even copy number-neutral LOH. Due to its affordability, SNP array is still a reasonable ancillary method for validation of next-generation sequencing (NGS)-based methods, although NGS-based approaches tend to yield a finer resolution of SCNA detection.

Given relative quantities of copy numbers, it is not trivial to infer how many absolute integer copies of genomic segments are present per cell, due to the genomic instability, the impurity of tumor samples and the existence of tumor subclones with different SCNA profiles [18]. For instance, whole-genome duplication events often occur in cancer, but virtually they cannot be detected with relative copy numbers alone [19]. SCNAs also complicate SNV-driven approaches since the VAFs within copy number-altered regions will deviate from the representative VAFs of the subclone, which are derived from copy number-neutral regions. Therefore, in general, the major goal of SCNA-driven approaches is to estimate tumor purity and absolute copy numbers of genomic regions accurately, which in turn assist several other TH characterization methods.

Tumor phylogenies

Incorporating phylogenetic relationships among subclones into the model may provide enhanced plausibility and explainability of the inferred evolutionary history of a tumor. To date, four different models for tumor evolution have been proposed [20]: (1) linear, (2) branching, (3) neutral and (4) punctuated evolution. Although various mathematical models for each of the models were developed, the vast majority of subclonal inference algorithms do not directly employ those models due to their complexity. Instead, the models are built on some basic premises such as perfect phylogeny of tumor evolution in which convergent evolution never occurs and infinite site assumption, which assumes that there are no shared genomic aberrations between any two subclones in distinct lineage and no reverse mutations.

Tumor purity

Due to the inevitable inclusion of adjacent noncancerous cells during surgical resection and the nature of tumor microenvironment involving the infiltration of lymphocytes, practically, we cannot obtain a pure tumor sample. Thus, we define tumor purity, or cellularity, as a proportion of truly cancerous cells in a biopsy and introduce it as an essential parameter in the model that governs the accurate and nonambiguous estimation of subclonal prevalences, as illustrated below. Assume we observe a cluster of heterozygous VAFs averaged at 0.1, by which we can infer the existence of a subclone occupying 20% of tumor cells in case of 100% tumor purity. However, in reality, a sample is not perfectly pure, so that we have numerous equivalent solutions, e.g. 25% subclone with 80% tumor purity, or alternatively 40% subclone with 50% tumor purity. Accordingly, inferring tumor purity is one of the crucial challenges in TH characterization.

Computational problem formulation

SNV-driven approaches for subclonal inference

SNV-driven approaches treat an SNV as evidence of the subclonal population. Thus, the identification of subclones becomes equivalent to the inference of optimal assignment of SNVs to clusters representing candidate subclones (Figure 1, left panel; Table 1). Formally, VAF |$f_{l}$| of a variant at locus |$l$|⁠, covered by |$d_{l}$| reads is defined as below.
(1)
(2)
Table 1

SNV-driven tools for subclonal inference

NameYearFeature|$^{\ast }$|Model|$^{\dagger }$|Evolutionary constraintMulti-sample analysis|$^{\ddagger }$|Description§ImplementationRef
PurBayes2013ACBinNNBayesian finite mixture model evaluated with penalized expected devianceR[21]
TrAp2013CVAFN/AYYEnumerates possible tree structures by iteratively merging first-generation treesJava[22]
PyClone2014ACBin, BeBinNYBayesian nonparametric clustering using Dirichlet processPython[23]
SciClone2014AC, VAFBeta, Gaussian, BinNYVariational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clustersR[24]
PhyloSub2014ACBinYYBayesian nonparametric model using tree-structured stick-breaking processC++, Python[25]
EXPANDS2014VAFN/ANNHeuristic clustering of variants based on KL-divergence.R[26]
TITAN2014ACBinNNFactorial HMM with joint emission model for allele counts and RDsR[27]
Clomial2014ACBinNYConstrained matrix factorization using multi-region sequencing dataR[28]
Rec-BTP2014CVAFN/AYNRecursive approximation algorithm for binary tree partitionMATLAB, Python[29]
PhyloWGS2015ACBinYYExtends PhyloSub by incorporating SCNA information from WGSC++, Python[30]
AncesTree2015ACBinYYSolves VAF factorization problem with ILPC++[31]
LICHeE2015VAFGaussianYYMulti-sample tumor phylogeny reconstruction by solving a set cover problemJava[32]
Bayclone2015ACBinNYBayesian nonparametric clustering using categorial Indian buffet processR[33]
Cloe2016ACBeBinYYPhylogenetic latent feature model that reflects hierarchical relationship between subclonal featuresR[34]
Canopy2016ACBinYYMulti-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNAR[35]
ddClone2017ACBin, BeBinNNBayesian nonparametric clustering using distance-dependent Chinese restaurant processR[36]
ClonEvol2017CVAFN/AYYUses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderingsR[37]
NameYearFeature|$^{\ast }$|Model|$^{\dagger }$|Evolutionary constraintMulti-sample analysis|$^{\ddagger }$|Description§ImplementationRef
PurBayes2013ACBinNNBayesian finite mixture model evaluated with penalized expected devianceR[21]
TrAp2013CVAFN/AYYEnumerates possible tree structures by iteratively merging first-generation treesJava[22]
PyClone2014ACBin, BeBinNYBayesian nonparametric clustering using Dirichlet processPython[23]
SciClone2014AC, VAFBeta, Gaussian, BinNYVariational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clustersR[24]
PhyloSub2014ACBinYYBayesian nonparametric model using tree-structured stick-breaking processC++, Python[25]
EXPANDS2014VAFN/ANNHeuristic clustering of variants based on KL-divergence.R[26]
TITAN2014ACBinNNFactorial HMM with joint emission model for allele counts and RDsR[27]
Clomial2014ACBinNYConstrained matrix factorization using multi-region sequencing dataR[28]
Rec-BTP2014CVAFN/AYNRecursive approximation algorithm for binary tree partitionMATLAB, Python[29]
PhyloWGS2015ACBinYYExtends PhyloSub by incorporating SCNA information from WGSC++, Python[30]
AncesTree2015ACBinYYSolves VAF factorization problem with ILPC++[31]
LICHeE2015VAFGaussianYYMulti-sample tumor phylogeny reconstruction by solving a set cover problemJava[32]
Bayclone2015ACBinNYBayesian nonparametric clustering using categorial Indian buffet processR[33]
Cloe2016ACBeBinYYPhylogenetic latent feature model that reflects hierarchical relationship between subclonal featuresR[34]
Canopy2016ACBinYYMulti-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNAR[35]
ddClone2017ACBin, BeBinNNBayesian nonparametric clustering using distance-dependent Chinese restaurant processR[36]
ClonEvol2017CVAFN/AYYUses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderingsR[37]

|$^{\ddagger }$|Denotes whether the tool is able to conduct joint analysis of several spatially or longitudinally distinct samples collected from a single tumor. Abbreviations: |$^{*}$|AC, allele count; VAF, variant allele frequency; CVAF, clustered variant allele frequency; |$^{\dagger }$|Bin, binomial; BeBin, beta-binomial; §ILP, integer linear programming; CI, confidence interval.

Table 1

SNV-driven tools for subclonal inference

NameYearFeature|$^{\ast }$|Model|$^{\dagger }$|Evolutionary constraintMulti-sample analysis|$^{\ddagger }$|Description§ImplementationRef
PurBayes2013ACBinNNBayesian finite mixture model evaluated with penalized expected devianceR[21]
TrAp2013CVAFN/AYYEnumerates possible tree structures by iteratively merging first-generation treesJava[22]
PyClone2014ACBin, BeBinNYBayesian nonparametric clustering using Dirichlet processPython[23]
SciClone2014AC, VAFBeta, Gaussian, BinNYVariational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clustersR[24]
PhyloSub2014ACBinYYBayesian nonparametric model using tree-structured stick-breaking processC++, Python[25]
EXPANDS2014VAFN/ANNHeuristic clustering of variants based on KL-divergence.R[26]
TITAN2014ACBinNNFactorial HMM with joint emission model for allele counts and RDsR[27]
Clomial2014ACBinNYConstrained matrix factorization using multi-region sequencing dataR[28]
Rec-BTP2014CVAFN/AYNRecursive approximation algorithm for binary tree partitionMATLAB, Python[29]
PhyloWGS2015ACBinYYExtends PhyloSub by incorporating SCNA information from WGSC++, Python[30]
AncesTree2015ACBinYYSolves VAF factorization problem with ILPC++[31]
LICHeE2015VAFGaussianYYMulti-sample tumor phylogeny reconstruction by solving a set cover problemJava[32]
Bayclone2015ACBinNYBayesian nonparametric clustering using categorial Indian buffet processR[33]
Cloe2016ACBeBinYYPhylogenetic latent feature model that reflects hierarchical relationship between subclonal featuresR[34]
Canopy2016ACBinYYMulti-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNAR[35]
ddClone2017ACBin, BeBinNNBayesian nonparametric clustering using distance-dependent Chinese restaurant processR[36]
ClonEvol2017CVAFN/AYYUses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderingsR[37]
NameYearFeature|$^{\ast }$|Model|$^{\dagger }$|Evolutionary constraintMulti-sample analysis|$^{\ddagger }$|Description§ImplementationRef
PurBayes2013ACBinNNBayesian finite mixture model evaluated with penalized expected devianceR[21]
TrAp2013CVAFN/AYYEnumerates possible tree structures by iteratively merging first-generation treesJava[22]
PyClone2014ACBin, BeBinNYBayesian nonparametric clustering using Dirichlet processPython[23]
SciClone2014AC, VAFBeta, Gaussian, BinNYVariational Bayesian mixture model that prunes marginal clusters to obtain optimal number of clustersR[24]
PhyloSub2014ACBinYYBayesian nonparametric model using tree-structured stick-breaking processC++, Python[25]
EXPANDS2014VAFN/ANNHeuristic clustering of variants based on KL-divergence.R[26]
TITAN2014ACBinNNFactorial HMM with joint emission model for allele counts and RDsR[27]
Clomial2014ACBinNYConstrained matrix factorization using multi-region sequencing dataR[28]
Rec-BTP2014CVAFN/AYNRecursive approximation algorithm for binary tree partitionMATLAB, Python[29]
PhyloWGS2015ACBinYYExtends PhyloSub by incorporating SCNA information from WGSC++, Python[30]
AncesTree2015ACBinYYSolves VAF factorization problem with ILPC++[31]
LICHeE2015VAFGaussianYYMulti-sample tumor phylogeny reconstruction by solving a set cover problemJava[32]
Bayclone2015ACBinNYBayesian nonparametric clustering using categorial Indian buffet processR[33]
Cloe2016ACBeBinYYPhylogenetic latent feature model that reflects hierarchical relationship between subclonal featuresR[34]
Canopy2016ACBinYYMulti-sample tumor phylogeny reconstruction by joint modeling of SNV and SCNAR[35]
ddClone2017ACBin, BeBinNNBayesian nonparametric clustering using distance-dependent Chinese restaurant processR[36]
ClonEvol2017CVAFN/AYYUses bootstrapping to get CIs of subclonal prevalences generated from other methods, then enumerates and evaluates possible subclonal orderingsR[37]

|$^{\ddagger }$|Denotes whether the tool is able to conduct joint analysis of several spatially or longitudinally distinct samples collected from a single tumor. Abbreviations: |$^{*}$|AC, allele count; VAF, variant allele frequency; CVAF, clustered variant allele frequency; |$^{\dagger }$|Bin, binomial; BeBin, beta-binomial; §ILP, integer linear programming; CI, confidence interval.

Genome-based approaches for TH characterization. The left panel illustrates the viewpoint of SNV-driven subclonal inference algorithms. Different cancer subclones (denoted by the blue, red and green population of cells) harbor different sets of SNVs at their genomic loci (illustrated as squares, triangles and stars). The variant alleles at these genomic loci are identified and quantified by NGS, and their counts or VAFs are statistically modeled to obtain an optimal estimation of subclonal structure or evolutionary history of subclones. The right panel illustrates the viewpoint of SCNA-driven tumor purity and absolute copy number estimation approaches. Thick white lines inside the cells symbolize the copy numbers of genomic segments. While being quantified by SNP array or NGS platforms, the copy number values are averaged, and therefore resulting copy numbers are not necessarily integers. With the average copy number profiles, SCNA-driven methods infer tumor purity and absolute copy numbers of cancer cells. Note that the extent of contaminating normal cells (smooth circles) are taken into account by both of the approaches.
Figure 1

Genome-based approaches for TH characterization. The left panel illustrates the viewpoint of SNV-driven subclonal inference algorithms. Different cancer subclones (denoted by the blue, red and green population of cells) harbor different sets of SNVs at their genomic loci (illustrated as squares, triangles and stars). The variant alleles at these genomic loci are identified and quantified by NGS, and their counts or VAFs are statistically modeled to obtain an optimal estimation of subclonal structure or evolutionary history of subclones. The right panel illustrates the viewpoint of SCNA-driven tumor purity and absolute copy number estimation approaches. Thick white lines inside the cells symbolize the copy numbers of genomic segments. While being quantified by SNP array or NGS platforms, the copy number values are averaged, and therefore resulting copy numbers are not necessarily integers. With the average copy number profiles, SCNA-driven methods infer tumor purity and absolute copy numbers of cancer cells. Note that the extent of contaminating normal cells (smooth circles) are taken into account by both of the approaches.

where |$n^{\textrm{ref}}_{l}$| and |$n^{\textrm{alt}}_{l}$| represents the number of reads supporting reference and alternative alleles, respectively. To identify VAF clusters, two different modeling strategies have been proposed: modeling (1) |$n^{\textrm{alt}}_l$| as a discrete random variable (Equation (3)), or (2) |$f_l$| as a continuous random variable (Equation (4)).
(3)
(4)
where |$p_{i}$| represents the probability of sampling reads containing variant alleles, which is intuitively the half of the true prevalence of subclone |$i$| (⁠|$\phi _i$|⁠) for heterozygous variants. |$D_{\textrm{disc}}$| and |$D_{\textrm{cont}}$| are discrete and continuous probability distribution, respectively, which vary across different approaches. While most of the tools [21, 23–25, 27, 28, 30, 31, 33, 36, 38, 39] adopt binomial distribution as |$D_{\textrm{disc}}$|⁠, beta-binomial is another possible option to account for the overdispersion of allelic frequencies [23, 34, 36]. On the other hand, for |$D_{\textrm{cont}}$|⁠, beta [24] and Gaussian distribution [24, 32, 40] are the most favorable choices.
In reality, |$p_i$| does not simply correspond to |$\phi _{i} / 2$| for heterozygous variants due to the impurity of tumor samples and SCNAs. Therefore, several methods adjust |$p_i$| with extra parameters such as tumor purity (⁠|$\pi $|⁠), total ploidy (⁠|$\omega $|⁠) and copy number of variant allele (⁠|$\nu $|⁠). Let |$\omega _l^n$|⁠, |$\omega _l^c$| and |$\omega _l^v$| be the total copy number at locus |$l$| for normal cells, cancer cells without variants and cancer cells with variants, respectively, and let a similar definition applied to |$\nu _l^n$|⁠, |$\nu _l^c$| and |$\nu _l^v$|⁠. Then, |$p_i$| can be adjusted as below.
(5)
Here, tumor purity |$\pi $| can be inferred from the data [21] or can be obtained by external measurements [23]. For |$\omega _l$| and |$\nu _l$|⁠, some approaches propose unified models to infer all the parameters simultaneously [27], while others circumvent the problem by providing external copy number estimates [24] or by marginalizing them out under the prior assumptions of subclonal genotypes [23].

Deciding the optimal number of inferred subclones (⁠|$K^{*}$|⁠) is another critical issue. The advantage of imposing nonparametric priors such as Dirichlet process [23, 41, 42] or categorical Indian buffet process [33] for subclonal structures is that |$K^{*}$| is inferred in an adaptive and data-driven manner, which does not require predefined |$K$| given to the model a priori. On the other hand, parametric priors, including Dirichlet distribution [34], are still a reasonable option. The model parameters should be repeatedly inferred for varying predefined |$K$|s. Thus, the lightweight inference algorithms such as expectation-maximization (EM) [27, 28, 38] or variational EM [24] are often preferred. Subsequently, the best |$K$| is selected by appropriate evaluation metric such as log likelihood [38], penalized expected deviance [21] and |$S\_Dbw$| [27].

The most realistic prior for the subclonal structures is the distribution over the topologies of phylogenetic trees, as it is likely to produce biologically reasonable phylogenies. PhyloSub [25] is one of the first methods to incorporate phylogenetic priors for subclonal inference, which uses a tree-structured stick-breaking process. Since then, many relevant approaches have been developed, and their common rationale is to integrate various phylogenetic constraints into their inference framework, such as sum and crossing rule [25], ancestry condition [31, 32] or children-sum-to-parents condition [22].

SCNA-driven approaches for tumor purity and absolute copy numbers

The primary goal of the most SCNA-driven approaches is to obtain the tumor purity and (allele-specific) absolute copy numbers (Figure 1, right panel; Table 2). While the early studies mostly used log R ratios (LRRs) and B allele frequencies (BAFs) generated from an SNP array, recent algorithms utilize whole-genome, or whole-exome, sequencing read depths (RDs) [or read depth ratios (RDRs)] for each genomic window and fractions of B-allele read counts, which correspond to LRRs and BAFs, respectively. In general, the algorithms need two values to infer the nonnegative integer copy numbers of A and B alleles (⁠|$\omega ^{A}_{l}$| and |$\omega ^{B}_{l}$|⁠, respectively) and purity (⁠|$\pi $|⁠): (1) LRRs or RDs/RDRs to infer the copy numbers and (2) BAFs to infer the genotypes. Formally, for each SNP locus |$l$|⁠, expected value of LRRs or RDRs (⁠|$r_l$|⁠) and BAF (⁠|$f_l$|⁠) can be written as below.
(6)
(7)
Table 2

SCNA-driven tools for tumor purity and absolute copy number estimation

NameYearDataFeature|$^{*}$|Type|$^{\dagger }$|DescriptionImplementationRef
ASCAT2010SNP arrayLRR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Copy number estimates are evaluated by squared differences to nearest integers.R[17]
OncoSNP2010SNP arrayRD, BAFMStudent t-mixture solved by EMMATLAB[43]
ABSOLUTE2012SNP arrayLRR, VAFMGaussian mixture using grid search through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by SCNA-fit likelihoods.R[18]
APOLLOH2012NGSRD, BAFHHMM with 18 genotype states and binomial emission. Parameters are estimated by EM.MATLAB[44]
CNAnorm2012NGSRDRMGaussian mixture fitted by EM. Number of components are chosen using AIC.R[45]
CLImAT2014NGSRD, ACHHMM with binomial and negative binomial emission model for ACs and RDs, respectivelyMATLAB[46]
THetA22014NGSRD, BAFMSolves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbersJava, MATLAB, Python[40]
TITAN2014NGSRDR, ACHSubclone-aware factorial HMM with joint emission model for ACs and RDsR[27]
PyLOH2014NGSRD, ACMJoint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.Python[47]
AbsCN-seq2014NGSRD, VAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that minimizes least squares objective.R[48]
Mixclone2015NGSRD, BAFMUnified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.Python[38]
Sequenza2015NGSRDR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that maximizes log-posterior probability.Python, R[49]
FACETS2017NGSRD, BAFMEmploys Gaussian noncentral |$\chi ^2$| model to call absolute copy numbersR[50]
CLImAT-HET2017NGSRD, ACHSubclone-aware factorial HMM which extends CLImATMATLAB[51]
ichorCNA2017NGSRDRHHMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.R[52]
Sclust2018NGSRD, RDR, BAF, VAFO|$\omega $| and |$\pi $| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.C++[53]
ACE2019NGSRDEGrid searches through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.R[54]
NameYearDataFeature|$^{*}$|Type|$^{\dagger }$|DescriptionImplementationRef
ASCAT2010SNP arrayLRR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Copy number estimates are evaluated by squared differences to nearest integers.R[17]
OncoSNP2010SNP arrayRD, BAFMStudent t-mixture solved by EMMATLAB[43]
ABSOLUTE2012SNP arrayLRR, VAFMGaussian mixture using grid search through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by SCNA-fit likelihoods.R[18]
APOLLOH2012NGSRD, BAFHHMM with 18 genotype states and binomial emission. Parameters are estimated by EM.MATLAB[44]
CNAnorm2012NGSRDRMGaussian mixture fitted by EM. Number of components are chosen using AIC.R[45]
CLImAT2014NGSRD, ACHHMM with binomial and negative binomial emission model for ACs and RDs, respectivelyMATLAB[46]
THetA22014NGSRD, BAFMSolves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbersJava, MATLAB, Python[40]
TITAN2014NGSRDR, ACHSubclone-aware factorial HMM with joint emission model for ACs and RDsR[27]
PyLOH2014NGSRD, ACMJoint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.Python[47]
AbsCN-seq2014NGSRD, VAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that minimizes least squares objective.R[48]
Mixclone2015NGSRD, BAFMUnified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.Python[38]
Sequenza2015NGSRDR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that maximizes log-posterior probability.Python, R[49]
FACETS2017NGSRD, BAFMEmploys Gaussian noncentral |$\chi ^2$| model to call absolute copy numbersR[50]
CLImAT-HET2017NGSRD, ACHSubclone-aware factorial HMM which extends CLImATMATLAB[51]
ichorCNA2017NGSRDRHHMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.R[52]
Sclust2018NGSRD, RDR, BAF, VAFO|$\omega $| and |$\pi $| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.C++[53]
ACE2019NGSRDEGrid searches through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.R[54]

Abbreviations:|$^{*}$|LRR, log R ratio; RD, read depth; RDR, read depth ratio; BAF, B-allele frequency; AC, allele count; VAF, variant allele frequency; |$^{\dagger }$|E, heuristics-based method; M, mixture-based method; H, HMM-based method; O, Other.

Table 2

SCNA-driven tools for tumor purity and absolute copy number estimation

NameYearDataFeature|$^{*}$|Type|$^{\dagger }$|DescriptionImplementationRef
ASCAT2010SNP arrayLRR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Copy number estimates are evaluated by squared differences to nearest integers.R[17]
OncoSNP2010SNP arrayRD, BAFMStudent t-mixture solved by EMMATLAB[43]
ABSOLUTE2012SNP arrayLRR, VAFMGaussian mixture using grid search through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by SCNA-fit likelihoods.R[18]
APOLLOH2012NGSRD, BAFHHMM with 18 genotype states and binomial emission. Parameters are estimated by EM.MATLAB[44]
CNAnorm2012NGSRDRMGaussian mixture fitted by EM. Number of components are chosen using AIC.R[45]
CLImAT2014NGSRD, ACHHMM with binomial and negative binomial emission model for ACs and RDs, respectivelyMATLAB[46]
THetA22014NGSRD, BAFMSolves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbersJava, MATLAB, Python[40]
TITAN2014NGSRDR, ACHSubclone-aware factorial HMM with joint emission model for ACs and RDsR[27]
PyLOH2014NGSRD, ACMJoint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.Python[47]
AbsCN-seq2014NGSRD, VAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that minimizes least squares objective.R[48]
Mixclone2015NGSRD, BAFMUnified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.Python[38]
Sequenza2015NGSRDR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that maximizes log-posterior probability.Python, R[49]
FACETS2017NGSRD, BAFMEmploys Gaussian noncentral |$\chi ^2$| model to call absolute copy numbersR[50]
CLImAT-HET2017NGSRD, ACHSubclone-aware factorial HMM which extends CLImATMATLAB[51]
ichorCNA2017NGSRDRHHMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.R[52]
Sclust2018NGSRD, RDR, BAF, VAFO|$\omega $| and |$\pi $| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.C++[53]
ACE2019NGSRDEGrid searches through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.R[54]
NameYearDataFeature|$^{*}$|Type|$^{\dagger }$|DescriptionImplementationRef
ASCAT2010SNP arrayLRR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Copy number estimates are evaluated by squared differences to nearest integers.R[17]
OncoSNP2010SNP arrayRD, BAFMStudent t-mixture solved by EMMATLAB[43]
ABSOLUTE2012SNP arrayLRR, VAFMGaussian mixture using grid search through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by SCNA-fit likelihoods.R[18]
APOLLOH2012NGSRD, BAFHHMM with 18 genotype states and binomial emission. Parameters are estimated by EM.MATLAB[44]
CNAnorm2012NGSRDRMGaussian mixture fitted by EM. Number of components are chosen using AIC.R[45]
CLImAT2014NGSRD, ACHHMM with binomial and negative binomial emission model for ACs and RDs, respectivelyMATLAB[46]
THetA22014NGSRD, BAFMSolves maximum likelihood mixture decomposition problem by efficient enumeration of interval count matrix denoting copy numbersJava, MATLAB, Python[40]
TITAN2014NGSRDR, ACHSubclone-aware factorial HMM with joint emission model for ACs and RDsR[27]
PyLOH2014NGSRD, ACMJoint modeling of RD and B-allele count with Poisson and binomial distribution, respectively. Parameters are estimated by EM.Python[47]
AbsCN-seq2014NGSRD, VAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that minimizes least squares objective.R[48]
Mixclone2015NGSRD, BAFMUnified probabilistic framework using Poisson mixture for RD and binomial for BAF. Parameters are estimated by EM.Python[38]
Sequenza2015NGSRDR, BAFEGrid searches through |$\omega $| and |$\pi $|⁠. Finds optimal solution that maximizes log-posterior probability.Python, R[49]
FACETS2017NGSRD, BAFMEmploys Gaussian noncentral |$\chi ^2$| model to call absolute copy numbersR[50]
CLImAT-HET2017NGSRD, ACHSubclone-aware factorial HMM which extends CLImATMATLAB[51]
ichorCNA2017NGSRDRHHMM using Student t-mixture as an emission model for RDRs. Parameters are estimated by EM.R[52]
Sclust2018NGSRD, RDR, BAF, VAFO|$\omega $| and |$\pi $| are estimated through iterative conditional optimization. Allows estimation of segment-wise subclonal copy numbers. Subclonal prevalences are also inferred by mutational clustering based on smoothing splines.C++[53]
ACE2019NGSRDEGrid searches through |$\omega $| and |$\pi $|⁠. Solutions are evaluated by the discrepancy between observed and expected RDRs.R[54]

Abbreviations:|$^{*}$|LRR, log R ratio; RD, read depth; RDR, read depth ratio; BAF, B-allele frequency; AC, allele count; VAF, variant allele frequency; |$^{\dagger }$|E, heuristics-based method; M, mixture-based method; H, HMM-based method; O, Other.

The denomiator |$D$| is given as
(8)
where |$\omega $| represents genomewide average copy number. Upon this formulation, we describe four representative types of SCNA-driven approaches in this section.

Heuristics-based methods Absolute copy numbers should be nonnegative integers. Therefore, a simple assessment is possible for a candidate solution by evaluating how much the solution is close to nonnegative integers. Since a simple analytic solution (⁠|$\widehat{\omega ^{A}_l}$|⁠, |$\widehat{\omega ^{B}_l}$|⁠) for any fixed (⁠|$\omega $|⁠, |$\pi $|⁠) can be derived by Equations (6) and (7), and biologically plausible range of (⁠|$\omega $|⁠, |$\pi $|⁠) can be specified, we can easily think of grid-searching through the space of (⁠|$\omega $|⁠, |$\pi $|⁠) to find optimal solution. ASCAT [17] evaluates each grid-search solution by summing the squared difference between inferred copy numbers (⁠|$\widehat{\omega ^{A}_l}$|⁠, |$\widehat{\omega ^{B}_l}$|⁠) and the corresponding nearest nonnegative integers. ACE [54] first computes expected RDR signal for all plausible integer copy numbers by Equation (6) and evaluates each solution using the difference between the true RDR signal and nearest expected RDR signal. Similarly, Sequenza [49] finds the optimal (⁠|$\omega $|⁠, |$\pi $|⁠) that maximizes the log-posterior probability of the fit by grid-search.

Mixture-based methods LRR, RD or RDR values of genomic segments with the same copy state would be clustered together, which led to mixture-based methods. Solving the mixture model gives a set of putative copy states, and the states are further annotated by absolute copy numbers. For example, ABSOLUTE [18] is a prominent Gaussian mixture-based approach applied for pan-cancer absolute copy number analysis. Since expected LRR value can be parameterized by (⁠|$\omega $|⁠, |$\pi $|⁠) (Equation (6)), we have a different mixture model for each configuration of (⁠|$\omega $|⁠, |$\pi $|⁠). Through grid-search, ABSOLUTE obtains the optimal (⁠|$\omega $|⁠, |$\pi $|⁠) that shows the best SCNA-fit log-likelihood scores.

BAF values can also be incorporated in the mixture model since SNP loci sharing the same genotype will produce similar BAF values. PyLOH [47] jointly models RD and BAF values with Poisson and binomial mixture models, respectively.

HMM-based methods Since an SCNA event usually spans a large genomic stretch spanning multiple SNPs, those SNPs will share the same genotypes. This spatial correlation can be exploited to get more precise signals of allelic frequencies. While simple aggregation of read counts along the segment [55] was shown to be effective, more elegant approaches using hidden Markov model (HMM) have been developed.

In general, such HMM models regard the empirically plausible set of genotypes as a set of hidden states. For example, APOLLOH [44], CLIMAT [46] and TITAN [27] propose an 18-, 20- and 21-state model, respectively. Given a genotype as a hidden state, emission probabilities are defined with probability distributions. For example, RDs and log copy ratios are often modeled with Poisson and Gaussian distribution, respectively, and B-allelic read counts are modeled with a binomial distribution. The transition matrix can be inferred from the data [46], or a fixed matrix is used where the transition probabilities are defined to exponentially decay as distance increases [27, 44].

Subclone-aware methods Few SCNA-driven methods, namely TITAN [27], Sclust [53] and THetA2 [40], takes subclones into consideration. TITAN is a factorial HMM-based approach that explicitly models three populations with respect to each genomic segment as in SNV-driven approaches: (1) normal cells, (2) tumor cells without subclonal SCNA and (3) tumor cells with subclonal SCNA.
(9)
(10)
where |$p_l$| is the proportion of tumor cells harboring SCNA at locus or segment |$l$|⁠. Similar formulation has been adopted by Sclust, a unified framework that estimates absolute (subclonal) copy numbers and subclonal prevalences. On the other hand, THetA2 directly solves a mixture decomposition problem, by letting an observed RD vector expressed as a product of integer-valued interval count matrix |$C$| and real-valued genome mixing vector|$\mu $|⁠, which denote copy numbers of genomic segments in each subclone and subclonal prevalences, respectively. The decomposition is done by an efficient enumeration of the solution space of (⁠|$C$|⁠, |$\mu $|⁠).

Perspectives

The strength of SNV-driven approaches is that subclone-wise functional analyses can be readily conducted since an SNV can usually be translated into a functional alteration of a specific gene. However, the high variance of allele frequencies, due to insufficient sequencing depth or experimental bias, is one of the major hurdles. Moreover, discrepancies of SNVs captured from various callers also undermine the robustness of SNV-driven approaches [56]. On the contrary, SCNA-driven approaches can work well with shallow sequencing depth because an SCNA usually spans a large genomic interval rather than a point, resulting in much more reads supporting each event. They still have shortcomings in terms of analytic resolution, as most of their application is restricted to analyzing tumor purity or absolute copy number while not accounting for tumor subclones.

Since most cells of an individual share almost identical genomic blueprints, one of the fundamental limitations of genome-based approaches is that we cannot distinguish distinct cell types using genomic features. Therefore, for a full characterization of TH including microenvironment surrounding a tumor, epigenomic and transcriptomic features are necessary since they are known to show cell-type specificity.

Transcriptome-based tumor heterogeneity

A growing body of evidence suggests that epigenetic aberrations are nearly ubiquitous in a wide variety of cancer types [57]. Subjects of the epigenetic alterations commonly include DNA methylation, histone modification and, more broadly, chromatin state. Among them, aberrant DNA methylation exemplified by focal hypermethylation [58–62] and widespread hypomethylation [63–65] have been the most comprehensively characterized, yet not fully understood, phenomenon in cancer. Although histone modifications and chromatin states also have been of great interest in early studies, the extent of their heterogeneity across tumor cells largely remains to be explored. Thus, in this section, we mainly discuss how the heterogeneity of DNA methylation can be exploited to uncover the heterogeneity of tumor samples.

Underlying biological features

Investigating the ITH of DNA methylation

The earliest report of ITH of DNA methylation (mITH) was by Feinberg and Vogelstein [63], along with an insightful conjecture that mITH arises from a certain portion of cancer cells having defects in methylation maintenance. In the past decades, this hypothesis has gradually been consolidated in diverse types of cancers [66, 67]. The active introduction of the bisulfite sequencing techniques allowed the methylation state of individual CpGs to be analyzed in an unprecedented resolution and coverage. Most importantly, analyzing a single sequencing read harboring multiple CpGs enabled haplotype-like analysis of DNA methylation, and the phased pattern of DNA methylation is called `epiallele'. The techniques other than sequencing-based methods exploited by the early DNA methylation studies could only inform average methylation levels for each CpG, limiting the discussion of mITH at bulk-cell level. On the contrary, since each sequencing read is originated from a single cell, we can consider an epiallele as a surrogate of each cell. Therefore, analyzing methylation patterns of sequencing reads is conceptually equivalent to analyzing the local epigenomic state of a collection of single cells.

Dual perspectives on epigenetic heterogeneity in cancer

The two apparently contradictive hallmarks of epigenomic modification—their stable inheritance and dynamic regulation—make it one of the most intriguing clues for TH.

Somatic heritability The first aspect relies on the somatic heritability of epigenomic modifications, which is most analogous to the view of genomic TH because it mainly focuses on subclonal alterations of epigenomic marks, which corresponds to subclonal SNVs or SCNAs. In this viewpoint, a subclonal epigenomic mark can be either of passenger or driver of tumor progression as its genomic counterpart. The selective growth advantage given by `driver' epigenomic marks leads to a more aggressive proliferation, which finally results in an epigenetic subclone that possesses identical (or nearly identical) epigenetic alterations. The experimental evidence for epigenetic subclones was reported in diffuse large B-cell lymphomas by showing that the diversity of methylation patterns were greatly reduced (i.e. patterns ‘converged’) in relapsed samples after chemotherapy [68].

Stochasticity Even under appropriate regulation, the average DNA methylation level of a bulk sample at a specific CpG site seems to show slight fluctuations over time. It can be compared with a ball traveling back and forth at the bottom of the valley [69]. Often, the machinery is perturbed in a transformed or precancerous cell, giving flatter regulatory potential landscape that imposes only a weak restoring force on the methylation level. Thus, the methylation level becomes more unstable and follows a trajectory rather resembling a random-walk trajectory.

Tumor purity, infiltrating immune cells and microenvironment

Epigenomic features have far more advantages than genomic features for the computational microdissection of noncancerous cells in a sample. Since the cells of an individual contain almost identical genomic blueprints, genomic regions showing distinctive differences between tumoral and non-tumoral cells are restricted to particular regions with somatic alterations.

Although numerous successful methodologies based on such scarce genomic features have been developed, it can be complemented with epigenome-based algorithms exploiting the widespread difference of epigenomic marks between tumoral and non-tumoral cells, especially when compared with that of genomic features. Furthermore, it is known that the epigenomes of different cell types, such as immune cells at different hematopoietic lineage, are strikingly different. We hereafter broadly refer to the estimation of the composition of cell types comprising tumor microenvironment as `cell-type deconvolution'. In the following subsection, we discuss computational tumor purity estimation and cell-type deconvolution algorithms utilizing epigenomic features, mainly DNA methylation.

Computational problem formulation

Characterization of epigenetic subclones

There are a handful of tools that utilize somatic heritability of epialleles to seek the evidence of tumor subclones (Figure 2, left panel). The underlying assumption of these tools is that a subclone harbors (nearly) identical epialleles so that we can infer epigenetic subclonal structures based on them.

Epigenome-based approaches for TH characterization. Black and white circles denote methylated and unmethylated CpGs, respectively. The left panel illustrates the viewpoint of epiallele-driven approaches. An epiallele is defined as a set of consecutive binary methylation states that spans a short genomic stretch, which usually corresponds to a methylation pattern appearing in a single sequencing read. In this point of view, epialleles show dual characteristics at the same time: somatic heritability and stochasticity. Note that epialleles in subclonal cells are similar enough but not always identical. These patterns of methylation states can be obtained by various sequencing methods, and the frequencies of each pattern are used to analyze epigenetic subclones or measure the extent of epigenomic stochasticity in the tumor biopsy. The right panel depicts the viewpoint of epigenome-based approaches for tumor purity estimation and cell-type deconvolution. The basis of these approaches is the selection of informative CpG loci whose methylation level distinguishes particular cell types from the others. Using the methylation levels of informative CpG loci, the purity of the sample can be inferred by clustering-based approaches, and the composition of constituent cell types can be inferred by reference-based or reference-free methods.
Figure 2

Epigenome-based approaches for TH characterization. Black and white circles denote methylated and unmethylated CpGs, respectively. The left panel illustrates the viewpoint of epiallele-driven approaches. An epiallele is defined as a set of consecutive binary methylation states that spans a short genomic stretch, which usually corresponds to a methylation pattern appearing in a single sequencing read. In this point of view, epialleles show dual characteristics at the same time: somatic heritability and stochasticity. Note that epialleles in subclonal cells are similar enough but not always identical. These patterns of methylation states can be obtained by various sequencing methods, and the frequencies of each pattern are used to analyze epigenetic subclones or measure the extent of epigenomic stochasticity in the tumor biopsy. The right panel depicts the viewpoint of epigenome-based approaches for tumor purity estimation and cell-type deconvolution. The basis of these approaches is the selection of informative CpG loci whose methylation level distinguishes particular cell types from the others. Using the methylation levels of informative CpG loci, the purity of the sample can be inferred by clustering-based approaches, and the composition of constituent cell types can be inferred by reference-based or reference-free methods.

Methclone [70] detects the compositional change of epialleles (epiallele shift) between two related samples (e.g. primary and recurrent tumor samples), in order to identify epigenomic subclonal evolution. Bayesian epiallele detection (BED) [71] is the first model to tolerate the mild stochasticity of epialleles by Bayesian inference. Given a locus, BED treats a true epiallele as a latent variable and assumes that an observed set of epialleles was generated from it according to a stochastic process involving the erroneous inheritance of methylation states. PRISM [72] is a more direct approach to deal with epigenetic subclones by applying the idea of SNV-based subclonal inference methods for epialleles. To simplify the approach, it only makes use of fully unmethylated and fully methylated epialleles as a `fingerprint' epialleles that distinctly mark a subclone, then the proportions of the fingerprints are clustered as in SNV-based methods.

Stochasticity measures for DNA methylation

As an initial attempt to standardize the quantification of mITH with sequencing data, Xie et al. [73] proposed an information-theoretic measure called methylation entropy (ME), which is calculated as the entropy of epialleles originating from a single genomic locus. In this section, let |$\mathcal{X} = \{\textbf{x}_i\}$| denote a set of unique epialleles appearing at particular genomic region. Also, let |$N(\textbf{x}_i)$| and |$P(\textbf{x}_i)$| denote the number and fraction of sequencing reads supporting epiallele |$\textbf{x}_i$|⁠.
(11)
Landan et al. [67] developed another measure named epipolymorphism (PM), which also captures the amount of heterogeneity of DNA methylation for a given genomic region.
(12)
The relationship between ME and PM is analogous to that between entropy and Gini index used for decision trees, and they can be considered as similar measures of mITH in general.
The importance of local disorder of DNA methylation was highlighted in CLL, and a new measure of mITH, proportion of discordant reads PDR, was introduced accordingly [74]. PDR is defined as a fraction of reads carrying CpGs in discordant methylation states (i.e. containing both methylated and unmethylated CpGs in a single read) with respect to all reads mapped to the region. Let |$\textbf{x}_u$| and |$\textbf{x}_m$| denote fully unmethylated and fully methylated epialleles, respectively, then PDR can be written as
(13)
A high PDR implies an admixture of cells with disordered DNA methylation states and is shown to be associated with adverse clinical outcomes, presumably through epigenetic diversification of subclones that finally leads to a higher adaptive potential of cancer.
Methylation haplotype load (MHL) is a metric measuring the co-methylation level within a genomic region [75]. It is first designed to identify genomic blocks with multiple CpG sites showing high concordance of methylation, called methylation haplotype blocks. MHL has unique characteristics compared with the other metrics in that it accounts for every sub-epialleles presented, as it is defined as a weighted average of the fraction of fully methylated epialleles of length |$i$| as below.
(14)
where |$w_i$| is defined as a function of length |$i$| that governs how we weight the fractions of varying length and |$\textbf{x}_m^i$| denotes the fully methylated epiallele of length |$i$|⁠. For example, |$w_i = i$| or |$w_i = i^2$| are general options to give more weight to the concordance of longer epialleles.
Methylation combinatorial entropy (MCE) is defined as a natural logarithm of the number of possible configurations of given epialleles [70].
(15)
For example, when all the epialleles are the same, MCE is 0 since there is only one possible permutation of epialleles. On the contrary, when there exist all possible epialleles with same amount, MCE becomes maximal.

DNA methylation-based tumor purity estimation

The main idea of DNA methylation-based tumor purity estimation is to determine `informative' CpG loci that distinguish cancer cells from adjacent normal cells (Figure 2, right panel). The simplest measure is called leukocyte unmethylation for purity (LUMP), which is computed using the average methylation level of predefined 44 CpG sites that are known a priori to be consistently unmethylated in leukocytes [76]. Other ways to select informative CpG sites is to carry out statistical decision on each CpG locus to check if cancer and normal samples have a significant and stable difference of DNA methylation in that locus, based on Wilcoxon rank-sum test [77], AUC [78] or Z-score [79]. Once informative CpG loci are chosen, downstream purity estimation is rather straightforward since the methylation levels of each informative locus directly indicate the mixing proportion of cancer and normal cells. A point estimate of tumor purity can be derived by simple averaging of mixing proportions [78], or by clustering them with Gaussian kernel density estimation [77] or beta mixture model [79].

Since the majority of the methods only focus on pointwise average methylation levels, the development of NGS-based methods that fully utilize the phasing information of DNA methylation on a single sequencing read should be encouraged. MethylPurify [80] is an excellent pioneering example that selects informative genomic bins based on the EM-based splitting of bisulfite sequencing reads.

DNA methylation-based cell-type deconvolution

Cell-type deconvolution algorithms account for the detailed composition of constituent normal cell types, such as diverse types of immune cells and cancer-associated fibroblasts (Figure 2, right panel). The algorithms can be divided into two groups: (1) reference-based methods utilizing a catalog of reference methylomes from sorted cells and (2) reference-free methods that decompose observed methylation profiles into a linear combination of putative reference methylation profiles in an unsupervised manner. One of the first reference-based methods was proposed by Houseman et al. [81], which involves a constrained projection of observed methylation profiles onto reference profiles to obtain the estimates of cell-type proportions. Another popular approach is called MethylCIBERSORT [82]. It selects informative CpG sites that distinguish cell types (as in tumor purity estimation) and feeds their methylation levels into CIBERSORT [83]. Reference-free methods widely adopt matrix factorization techniques, by which the methylation levels are decomposed into latent reference methylation profiles and corresponding cell type proportions with some constraints. For example, RefFreeCellMix [84] uses a simple form of constraint which enforces methylation levels to lie between 0 and 1, and the sum of cell-type proportions to be less than 1. On the other hand, MeDeCom [85] adds a constraint that makes methylation levels close to 0 and 1.

Perspectives

Although the duality of epigenomic marks offers the most compelling possibilities for TH characterization, we still lack a clear conceptual background. It remains to be tested whether epigenetic subclones harboring somatically heritable epigenomic marks are prevalent throughout various cancer types. Meanwhile, investigating the stochasticity of epigenomic marks is not straightforward due to its probabilistic nature, which hinders the experimental reproducibility at the cellular level. Moreover, its quantification needs careful consideration since the variability vanishes and thus cannot be captured with simple `averaging-out' measures such as DNA methylation level. Nevertheless, cancer epigenomics is a vibrant field of research, and we are gradually being equipped with experimental and analytic methodologies, which will give us a clearer view of epigenetic TH in the near future.

Epigenomic features other than DNA methylation, such as histone modification, chromatin structure or even histone protein itself, may also constitute orthogonal layers of epigenomic TH. Notably, heterogeneous levels of the linker histone H1.0 were reported in several cancer types, where its deprivation was associated with increased self-renewal of tumorigenic cells [86].

Characterizing epigenomic TH is crucial as it serves as a glue that connects genomic TH and transcriptomic TH. We may discover hints of the mechanistic link between genomic and epigenomic TH by investigating the association of mutated epigenetic regulators and aberrant epigenomic marks. Similarly, the relationship between transcriptomic and epigenomic TH may be revealed by investigating the heterogeneity of the epigenomic landscape of regulatory regions.

Transcriptome-based tumor heterogeneity

Underlying biological features

Tumor heterogeneity, transcriptome and marker genes

Since the gene expression profile (GEP) of a tumor can be readily characterized by well-established platforms such as microarray and RNA-seq, it has been extensively studied for its potential as a biomarker for diseases. One successful application of transcriptomic biomarker is PAM50 signatures [87], a breast cancer subtype classification system based on the expression levels of 50 genes. Interestingly, the subtypes solely based on those genes are associated with prognosis, making it possible to design therapeutic regimens. However, selecting an optimal set of genes is not always straightforward.

What we observe from bulk cell sequencing is a crude admixture of reads from heterogeneous cell population [88], thereby it can be viewed as a blind source separation problem to seek for the independent effect of constituent cell types [89–91]. Depending on experimental design and sample quality, these compound signals may ruin the hypothesis testing or obscure possible novel findings. Unlike genomic and epigenomic features, transcriptomic features are basically continuous and it complicates the signal deconvolution using bulk transcriptome data. Nevertheless, utilizing marker genes would be an intuitive approach for the estimation of the composition of the non-tumoral part of a tumor biopsy [92].

In early studies, traditional markers for immunohistochemistry or reverse transcriptase-polymerase chain reaction were widely used. Recently, GEPs of purified cells are rapidly accumulating in public databases, which accelerate the discovery of marker genes and the development of algorithms utilizing them by various statistical approaches [93–95]. Although marker gene-based cell-type deconvolution methods can be easily applied with simple statistical approaches, their limitations are clear; the performance of such algorithm is limited by our prior knowledge.

Protein–protein interaction networks and pathways

Well-coordinated interactions between proteins are required to maintain normal cellular function. Efforts in identifying interaction partners of proteins in a high-throughput manner [96, 97] led to the compilation of valuable databases of protein–protein interaction networks [98, 99]. In bioinformatic analyses, these template networks are often used to define functional modules, which represent sets of proteins interacting together [100, 101]. Similarly, biological pathways denote the cellular mechanisms to convey the intra/extracellular information through the interactions of proteins to elicit an appropriate response of a cell upon various stimuli. To obtain a more direct functional portrait of a cell, several methods are proposed to infer the activity of biological pathways by aggregating individual gene expression values [102]. One advantage of handling biological pathways or modules is that their measured activity is relatively robust compared to the individual expression levels of its components. In terms of TH, we can also conceive the heterogeneity at levels of biological modules or pathways, which can be measured by information-theoretic approaches discussed in the next section.

Computational problem formulation

Purity estimation and cell-type deconvolution

Transcriptome-based purity estimation and computational cell-type deconvolution are advanced fields of TH studies (Figure 3, left panel; Table 3). Accordingly, there have been lots of methodological reviews and cross-comparison benchmark studies [116–120], including recent evaluation with single cell-based simulation [121]. In this section, we provide a brief overview of relevant computational methods.

Table 3

Transcriptome-based tools for tumor purity estimation and microenvironmental cell-type deconvolution

NameYearApplication|$^{*}$|Type|$^{\dagger }$|Description|$^{\ddagger }$|ImplementationRef
PERT2012CTDPNonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPsOctave[103]
ISOpure2013TPEPModels normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.MATLAB, R[104, 105]
DeMix2013TPEPLog-normal mixture model for tumor purity estimation using array-based expression dataC, R[106]
ESTIMATE2013TPESComputes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genesR[107]
UNDO2015TPEOLinear latent variable model to estimate tumor-stroma proportionsR[108]
CIBERSORT2015CTDR|$\nu $|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypesJava, R, Web[83, 109]
contamDE2016TPEPEmploys negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.R[110]
TIMER2016CTDRConstrained least squares regression to infer the abundance of six immune cell typesR, Web[111]
MCP-counter2016CTDSLog2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populationsR[112]
EPIC2017CTDRConstrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.R[113]
xCell2017CTDSAverage of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scoresR[114]
quanTIseq2019CTDRConstrained least squares regression with nonnegative and less-than-one constraints for cell type proportionR[115]
CIBERSORTx2019CTDRExtends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq dataJava, R, Web[109]
NameYearApplication|$^{*}$|Type|$^{\dagger }$|Description|$^{\ddagger }$|ImplementationRef
PERT2012CTDPNonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPsOctave[103]
ISOpure2013TPEPModels normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.MATLAB, R[104, 105]
DeMix2013TPEPLog-normal mixture model for tumor purity estimation using array-based expression dataC, R[106]
ESTIMATE2013TPESComputes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genesR[107]
UNDO2015TPEOLinear latent variable model to estimate tumor-stroma proportionsR[108]
CIBERSORT2015CTDR|$\nu $|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypesJava, R, Web[83, 109]
contamDE2016TPEPEmploys negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.R[110]
TIMER2016CTDRConstrained least squares regression to infer the abundance of six immune cell typesR, Web[111]
MCP-counter2016CTDSLog2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populationsR[112]
EPIC2017CTDRConstrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.R[113]
xCell2017CTDSAverage of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scoresR[114]
quanTIseq2019CTDRConstrained least squares regression with nonnegative and less-than-one constraints for cell type proportionR[115]
CIBERSORTx2019CTDRExtends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq dataJava, R, Web[109]

Abbreviations.|$^{*}$|CTD, cell-type deconvolution; TPE, tumor purity estimation. |$^{\dagger }$|P, probabilistic approach; R, regression-based approach; S, signature score-based approach; O, other. |$^{\ddagger }$|GEP, gene expression profile; CCLE, Cancer Cell Line Encyclopedia.

Table 3

Transcriptome-based tools for tumor purity estimation and microenvironmental cell-type deconvolution

NameYearApplication|$^{*}$|Type|$^{\dagger }$|Description|$^{\ddagger }$|ImplementationRef
PERT2012CTDPNonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPsOctave[103]
ISOpure2013TPEPModels normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.MATLAB, R[104, 105]
DeMix2013TPEPLog-normal mixture model for tumor purity estimation using array-based expression dataC, R[106]
ESTIMATE2013TPESComputes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genesR[107]
UNDO2015TPEOLinear latent variable model to estimate tumor-stroma proportionsR[108]
CIBERSORT2015CTDR|$\nu $|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypesJava, R, Web[83, 109]
contamDE2016TPEPEmploys negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.R[110]
TIMER2016CTDRConstrained least squares regression to infer the abundance of six immune cell typesR, Web[111]
MCP-counter2016CTDSLog2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populationsR[112]
EPIC2017CTDRConstrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.R[113]
xCell2017CTDSAverage of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scoresR[114]
quanTIseq2019CTDRConstrained least squares regression with nonnegative and less-than-one constraints for cell type proportionR[115]
CIBERSORTx2019CTDRExtends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq dataJava, R, Web[109]
NameYearApplication|$^{*}$|Type|$^{\dagger }$|Description|$^{\ddagger }$|ImplementationRef
PERT2012CTDPNonnegative maximum likelihood estimation based on latent dirichlet allocation framework. Introduces multiplicative perturbation vectors to model the perturbation of reference GEPsOctave[103]
ISOpure2013TPEPModels normal GEP as linear combination of known profiles of normal tissue sample. Enforces GEP of cancer cells to be similar to each other.MATLAB, R[104, 105]
DeMix2013TPEPLog-normal mixture model for tumor purity estimation using array-based expression dataC, R[106]
ESTIMATE2013TPESComputes stromal score, immune score and purity by ssGSEA using stromal/immune gene set with 141 genesR[107]
UNDO2015TPEOLinear latent variable model to estimate tumor-stroma proportionsR[108]
CIBERSORT2015CTDR|$\nu $|-support vector regression using LM22 leukocyte gene signature matrix, which contains 547 genes that distinguish 22 human hematopoietic cell phenotypesJava, R, Web[83, 109]
contamDE2016TPEPEmploys negative binomial model for RNA-seq read counts. Iteratively updates tumor cell proportions and differentially expressed gene sets.R[110]
TIMER2016CTDRConstrained least squares regression to infer the abundance of six immune cell typesR, Web[111]
MCP-counter2016CTDSLog2 geometric mean of marker gene expression values for 10 distinct microenvironmental cell populationsR[112]
EPIC2017CTDRConstrained least squares regression that gives more weight to marker genes having lower variance. Expressions are normalized by total mRNA content.R[113]
xCell2017CTDSAverage of transformed ssGSEA scores for cell-type specific gene sets from CCLE. Adopts spillover compensation to remove unwanted correlation between cell type scoresR[114]
quanTIseq2019CTDRConstrained least squares regression with nonnegative and less-than-one constraints for cell type proportionR[115]
CIBERSORTx2019CTDRExtends CIBERSORT to utilize GEPs from diverse sources and platforms, including single-cell RNA-seq dataJava, R, Web[109]

Abbreviations.|$^{*}$|CTD, cell-type deconvolution; TPE, tumor purity estimation. |$^{\dagger }$|P, probabilistic approach; R, regression-based approach; S, signature score-based approach; O, other. |$^{\ddagger }$|GEP, gene expression profile; CCLE, Cancer Cell Line Encyclopedia.

Transcriptome-based approaches for tumor heterogeneity characterization. The left panel illustrates the viewpoint of transcriptome-based computational deconvolution algorithms. The transcriptome data we observe is a mixture of the individual GEP of cells. Note that the expression levels of cell-type marker genes are represented by red bars in tumor microenvironmental cells. One approach to estimate the abundance of a particular cell type is to compute a signature score based on the marker genes exclusively expressed in such cells. Regression-based methods exploit reference GEPs to estimate the weight of linear combination across different cell types. Probabilistic approaches assume the expression levels are probabilistically distributed around reference GEPs and infer reference profiles and cell type abundances. The right panel illustrates the viewpoint of information-theoretic approaches. Each node of the network inside the cells denote each gene, and the redder circle represent the higher expression of the gene. These approaches measure the diversity of active signaling paths (paths with red circles) throughout the population of cells based on the protein–protein interaction network.
Figure 3

Transcriptome-based approaches for tumor heterogeneity characterization. The left panel illustrates the viewpoint of transcriptome-based computational deconvolution algorithms. The transcriptome data we observe is a mixture of the individual GEP of cells. Note that the expression levels of cell-type marker genes are represented by red bars in tumor microenvironmental cells. One approach to estimate the abundance of a particular cell type is to compute a signature score based on the marker genes exclusively expressed in such cells. Regression-based methods exploit reference GEPs to estimate the weight of linear combination across different cell types. Probabilistic approaches assume the expression levels are probabilistically distributed around reference GEPs and infer reference profiles and cell type abundances. The right panel illustrates the viewpoint of information-theoretic approaches. Each node of the network inside the cells denote each gene, and the redder circle represent the higher expression of the gene. These approaches measure the diversity of active signaling paths (paths with red circles) throughout the population of cells based on the protein–protein interaction network.

Signature score-based approaches To get a signature score for a cell type, these approaches aggregate the expression levels of corresponding marker genes. MCP-counter [112] simply uses geometric mean of marker gene expressions as the signature score for each cell type, while ESTIMATE [107] and xCell [114] utilize ssGSEA [122] scores to measure the collective enrichment of marker genes.

Regression-based approaches Given reference expression profiles of |$N_c$| constituent cell types, a bulk transcriptome profile |$\textbf{y}$| with |$N_g$| genes can be expressed as a linear combination of reference profiles as below:
(16)
where |$\textbf{C}$| is |$N_g \times N_c$| matrix of reference profile and |$\textbf{p}$| denotes cell type proportions. Here, |$\textbf{p}$| can be estimated through either of least squares optimization [111, 113, 115] or |$\nu $|-support vector regression [83, 109]. For the least squares methods, whose goal is to minimize |$||\textbf{y} - \textbf{C} \times \hat{\textbf{p}}||^2$|⁠, it is generally assumed that the elements of |$\hat{\textbf{p}}$| are nonnegative and also sum to one as they denote the proportions of cell types.

Probabilistic approaches Unlike regression-based approaches, these approaches explicitly deal with the likelihood of the probabilistic models for observed expression values or read counts and let the inference framework jointly infer the expression profiles and cell-type proportions. For example, DeMix and DeMixT formulate the observed array-based expression as a linear combination of log-normal random variables and conducts iterative optimization [106, 123]. Moreover, there have been methods that account for the perturbation of reference expression profiles arising from various microenvironmental factors. ISOpure posits an expression profile of normal cells as a linear combination of reference expression profiles of normal cells [104]. On the other hand, PERT introduces a perturbation vector, and it is multiplied to the reference profile [103].

Information-theoretic approaches

Aggregation of individual gene expression levels using pathway or network information has been widely applied for a robust systematic analysis of transcriptome throughout the past decades [100]. Interestingly, increased entropy at a level of signaling pathway was suggested to have arisen from the admixture of the heterogeneous and undifferentiated cell population, and it was shown to be a marker for cancer metastasis or drug sensitivity [124, 125]. Such measure of uncertainties in pathways, called `signaling entropy', is formulated as below (Figure 3, right panel). Given a well-established PPI network, an edge weight |$w_{ij}$| between gene |$i$| and |$j$| is defined as the product of (normalized) expression values of both genes according to the mass action principle [126]. The weight matrix |$W$| comprising values of |$w_{ij}$| is subsequently row-normalized to produce stochastic matrix |$P$|⁠. Let |$\pi $| the stationary distribution of |$P$| (i.e. |$\pi = P\pi $|⁠). Then, signaling entropy |$S$| is defined as
(17)
where |$\mathcal{N}_i$| denotes the neighbors of gene |$i$|⁠. By definition, signaling entropy is a transcriptome-level measure of the average promiscuity of PPI, where proteins at the core of the network are overweighed.
A similar measure, transcriptome-based ITH (tITH), captures the transcriptomic TH on a relative basis compared with normal samples [127]. Given stochastic matrices |$P^{T}$| and |$P^{N}$| for tumor and normal samples, respectively, network-based Jensen–Shannon divergence (nJSD) is defined as below.
(18)
where |$JSD$| denotes conventional Jensen–Shannon divergence. Suppose a state |$A$| with a maximally ambiguous transcriptome state where the expressions of all genes are the same. Accordingly, tITH is computed as the relative distance between normal and tumor samples.
(19)

Apart from the expression levels of genes, their alternative splicing patterns are also heterogeneous across cells [128]. In this regard, the concept of spliceome ITH (sITH) was proposed recently [129]. In brief, sITH captures the difference of uncertainties in splice junction usage with JSD, and the pan-cancer analysis revealed that increased sITH was shown to be associated with cancer progression and unfavorable outcomes.

Perspectives

Since the transcriptome serves as the most immediate and comprehensive surrogate of the functionality of a cell, characterizing transcriptomic TH is crucial to draw clear functional consequences of TH. However, the fundamental limitation of transcriptomic data is that gene expression levels are not discrete by nature, unlike genomic or epigenomic features. To separate the observed continuum of gene expressions into latent constituent distributions, one promising approach is to exploit information of upstream regulatory mechanisms. We might be able to use relevant genomic and epigenomic features to build a prior distribution of expression levels of genes, even though we still lack a systematic understanding of the gene regulatory mechanism. Conversely, it may be feasible to filter out nonfunctional alterations by integrating transcriptome-based TH characterization with other omics-based methods, facilitating the clear explanation of the driver event of the subclones.

Applications of single-cell sequencing

Although numerous approaches based on bulk sequencing data have been proposed, it is worth noting that the most straightforward way to investigate TH is separate molecular characterization of every single cell in a tumor. Understandably, single-cell DNA sequencing have showed its wide application in characterization of clonal diversity and evolution [130, 131] for various cancer types including breast cancer [132], clear cell renal carcinoma [133] and leukemia [134, 135]. Relevant computational tools include Single Cell Genotyper [136], SCITE [137], OncoNEM [138], SiFit [139], B-SCITE [140], SiCloneFit [141] and PhISCS [142]. Single-cell RNA sequencing has also been increasingly used. Since the pioneering study on primary glioblastoma cells (without microenvironmental cells) that reported cell-to-cell variability of gene expression and splicing patterns [143], the field has been broadened to characterize interaction among cells constituting microenvironment of metastatic melanoma [144], or to suggest higher-resolution subtype classifications of colorectal cancer using single cell-based transcriptomic profiles [145]. Meanwhile, some of the recent studies have been utilizing methods for single-cell trajectory inference for TH characterization. For example, Monocle [146, 147] was used to reveal the evolutionary trajectory in glioblastoma [148, 149], liver cancer [150], cutaneous T cell lymphoma [151] and uveal melanoma [152]. On the other hand, Wanderlust [153] was used to uncover the dynamics of single-cell transcriptomes in melanoma [154] over the pseudotime dimension.

Single-cell approaches still have critical technical limitations such as allelic dropout or doublets, which result in missing values in genotype calls and thus hamper robust analyses of TH. One promising approach to deal with these experimental artifacts is the integration of single-cell and bulk sequencing data. For example, ddClone [36] infers subclonal structures by a typical SNV-based scheme, but the clustering of VAF values is informed with single-cell genotypes. B-SCITE [140] and PhISCS [142] are subclonal phylogeny-aware methods that also combine the single-cell and bulk sequencing methods. Moreover, such integrative approaches are also effective for transcriptome-based cell-type deconvolution [155–157].

Recently, intriguing observations are emerging along with the technological advances that enabled the simultaneous measurement of multiple omics from a single cell. For example, G&T-seq [158] and TARGET-seq [159] revealed that the genomic and transcriptomic TH were tightly linked within a cell. In addition, joint profiling of single-cell transcriptome and chromatin states with scCAT-seq enabled the characterization of cellular subpopulations with different regulatory states [160]. Collectively, these findings recapitulate the necessity of integrative multi-omics characterization of TH. Currently, the single-cell multimodal sequencing methods are inadequate to replace bulk sequencing in terms of scalability and completeness of the measurements. Therefore, based on the conceptual background established by single-cell multi-omics studies, it is essential to implement multi-omics integrative methods based on bulk data.

Towards multi-omics and machine learning approaches for tumor heterogeneity chracterization

Integrative analysis of parallel evolution of genomes and epigenomes in cancer

The striking similarities and differences between genetic evolution (inferred from SNV or SCNA) and epigenetic evolution (inferred from DNA methylation) have offered an exciting avenue for TH studies: taking advantage of the complementarity between the two omics layers. The similarity between phylogenetic and phyloepigenetic trees have been reported in a wide variety of cancer types [161–166], opening up the possibility of an integrative analysis of phylogenetic and phyloepigenetic evolutionary histories. However, even though it has been shown that the epigenomic-genomic co-evolution [167] occurs, it remains questionable whether the co-evolution is ubiquitous across cancer types. Interestingly, in AML, the two layers of heterogeneity were found to be not necessarily linked to each other [168]. It implied that the genetic and epigenetic evolution in cancer might occur independently in some cases, yet not denying the co-evolution of those two omics layers in other cases. The results collectively imply that the two omics layers evolve `in parallel', which in turn can show convergent or distinct evolutionary histories depending on cancer types.

To our knowledge, computational methods utilizing the complementarity of epigenetic and genetic cancer evolution are not yet developed. However, since high-throughput data for cancer genomes and DNA methylomes have been rapidly accumulating, it would be an attractive field of research to conduct a joint inference of phylogenetic and phyloepigenetic trees. Although the details must be elaborated in further researches, it may be implemented by iterative and alternative inferences of phylogenetic and phyloepigenetic trees while constraining one tree by the other tree based on the candidate subclone that is likely to be appearing in both of the trees.

Integrating epigenomes and transcriptomes for cell-type deconvolution

We have already discussed various approaches using epigenome or transcriptome to infer the proportions of diverse cell types comprising tumor microenvironment. Despite those successful applications of individual-omics approaches, we still have room to improve the accuracy and robustness through integrating multi-omics molecular profiles. EDec [169] is a cell-type deconvolution approach that combines transcriptome and DNA methylome. It first selects a set of informative CpG loci with a catalog of reference methylomes of known cell types and infers underlying cell-type proportions. Subsequently, cell-type-specific transcriptome profiles are inferred using the proportions, which enables differential expression analysis done for each cell type separately.

ML approaches and applications

Most TH characterization approaches discussed above are based on explicit statistical modeling and manual feature selection, and their performances largely depend on our prior knowledge. Fortunately, as many large-scale studies of TH have been conducted by various international consortia [170, 171], we are gradually reaching a sufficient amount of relevant data to train a model to automatically learn the common characteristics of TH. In this section, we provide some insights on several possible approaches and applications of ML frameworks for future TH studies and introduce relevant works.

Data-driven feature extraction Feature extraction is the core principle behind how a machine learns and generalizes, where only the features relevant to the task are extracted from the noisy data. Clinical molecular profile data suffer from extreme noise arising from biological and technical variability. However, as the number of data grows, there will be a greater chance that an ML model captures biologically meaningful and common modules of TH, hopefully suggesting novel insights about cancer evolution. Recent approaches called REVOLVER [172] and Hintra [39] adopt the transfer learning paradigm to leverage the shared characteristics across a cohort of tumor samples. Both algorithms jointly optimize multiple tumor phylogenies across patients to capture repeated evolutionary trajectories. By penalizing unusual orderings of driver events that do not correlate well with other orderings in the cohort, the model is guided to find the recurrent evolutionary orderings. Since the method can extract recurrent snippets of driver events in a particular order, it allows us to dissect and arrange driver events in the correct order even if they were ambiguously intermingled inside a single subclone and could not be segregated by an inference with a single sample alone. This application of ML for fine-grained phylogenetic reconstruction of driver events is comparable to a super-resolution in the field of computer vision in a sense that the model extracts recurring patterns from a collection of data and exploit the patterns to get a higher-resolution view of a sample.

Ensemble learning Ensemble learning is an ML technique that improves the performance and generalization of models by combining multiple models trained to capture different aspects of data. As many different statistical approaches have been proposed for various TH characterization tasks, it will be interesting to see if ensembling those approaches improves the performance. Pioneering examples of ensemble learning methods include consensus purity estimation, defined as a median of normalized purity estimates from ESTIMATE, ABSOLUTE, LUMP and immunohistochemistry [76], and RF_Purify, a meta-learning approach that trains a random forest model predicting ABSOLUTE and ESTIMATE tumor purity estimates [173].

Multi-omics TH as a covariate We expect that information about TH extracted from multi-omics data (e.g. tumor purity, number of subclones, evolutionary histories or microenvironment composition) may serve as a novel and robust features for various clinical tasks such as cancer subtyping, prognosis prediction and optimal therapy design. Conventional molecular subtypes [87, 174] are usually defined using individual-omics measurements of bulk tumor samples, and thus the robustness of subtyping is predominantly confounded by the composition of cells. Moreover, it has been reported that the characteristics of cancer subtypes can be determined at a multi-omics level [175], which implies the existence of intrinsic subtypes that might have been neglected by individual-omics subtypes. Incorporating extracted multi-omics TH as a covariate in such tasks will provide more precise and robust results in clinical cancer researches. For example, conventional marker gene-based subtyping strategies may be refined to be dynamically adjusted according to the microenvironmental cell-type composition of the biopsy.

Conclusion and perspective

A comprehensive understanding of multi-omics TH will lead to a more precise therapeutic strategy that precisely targets every subclone at the same time. Furthermore, computationally deconvolving the mixed effects of various cancer-associated molecular aberrations may facilitate the experimental reproducibility and unveil hidden association between an aberration and cancer. Due to the complexity and large size of the molecular data, this problem can only be tackled by bioinformatic approaches.

Development of computational methods is possible only with a solid conceptual basis of multi-omics TH, and thus we need to continuously seek for the experimental evidence in parallel with algorithmic advancement. Fortunately, thanks to the development of single-cell sequencing technologies, we are gaining confidence in the existence of interplay between multiple omics layers that comprise TH [176]. While single-cell approaches have shown their great potential for exploratory analyses, we still need to establish systematic strategies based on bulk-cell multi-omics data for scalable and clinically applicable practice of cancer therapies. As we have discussed above, ideas proposed by numerous methods exploiting individual-omics layers may be good starting points for the integrative methods.

One another way to enhance our understanding of multi-omics TH is to employ data-driven ML approaches. Conventional TH characterization is usually sample-specific, which does not account for common characteristics of each cancer type. By utilizing accumulating molecular profiles of tumor samples, novel ML approaches may demonstrate the underlying patterns and similarities of TH within each cancer type, leading to a significant breakthrough in cancer therapeutics.

Key Points
  • A tumor sample consists of heterogeneous cellular populations, including diverse subclonal cancer cells, stromal cells and infiltrating immune cells. This TH harbors significant clinical potential. Although TH often exists across multi-omics layers, most computational approaches for TH characterization only rely on individual omics data.

  • To facilitate the development of multi-omics integrative methods for TH characterization, we provide a comprehensive review of existing computational methods to characterize TH based on genome, epigenome and transcriptome. Furthermore, we introduce some pioneering multi-omics integrative methods.

  • Multi-omics molecular profiles of cancers are rapidly accumulating. Therefore, we suggest some potential applications of data-driven ML approaches for TH characterization.

Funding

Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (NRF-2014M3C9A3063541); Next-Generation Information Computing Development Program through the NRF funded by the Ministry of Science, ICT (NRF-2017M3C4A7065887); Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224); Bio & Medical Technology Development Program of the NRF (NRF-2019M3E5D4065965).

Dohoon Lee is a PhD student of Interdisciplinary Program in Bioinformatics at Seoul National University. His research interests include tumor heterogeneity, computational epigenomics and machine learning.

Youngjune Park is a PhD student of Interdisciplinary Program in Bioinformatics at Seoul National University. His research interests include tumor evolution and machine learning.

Sun Kim is a professor of the department of computer science and engineering at Seoul National University. His research group focuses on bioinformatics and machine learning.

References

1.

Jamal-Hanjani
M
,
Quezada
SA
,
Larkin
J
, et al.
Translational implications of tumor heterogeneity
.
Clin Cancer Res
2015
;
21
(
6
):
1258
66
.

2.

Joung
JG
,
Oh
BY
,
Hong
HK
, et al.
Tumor heterogeneity predicts metastatic potential in colorectal cancer
.
Clin Cancer Res
2017
;
23
(
23
):
7209
16
.

3.

Dagogo-Jack
I
,
Shaw
AT
.
Tumour heterogeneity and resistance to cancer therapies
.
Nat Rev Clin Oncol
2018
;
15
(
2
):
81
94
.

4.

Nowell
P
.
The clonal evolution of tumor cell populations
.
Science
1976
;
194
(
4260
):
23
8
.

5.

Greaves
M
,
Maley
CC
.
Clonal evolution in cancer
.
Nature
2012
;
481
(
7381
):
306
13
.

6.

Brady
SW
,
McQuerry
JA
,
Qiao
Y
, et al.
Combating subclonal evolution of resistant cancer phenotypes
.
Nat Commun
2017
;
8
(
1
):
1231
.

7.

Mroz
EA
,
Rocco
JW
.
MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma
.
Oral Oncol
2013
;
49
(
3
):
211
5
.

8.

Hou
Y
,
Li
T
,
Gan
W
, et al.
Prognostic significance of mutant-allele tumor heterogeneity in uterine corpus endometrial carcinoma
.
Ann Transl Med
2020
;
8
(
6
):
339
9
.

9.

Rajput
A
,
Bocklage
T
,
Greenbaum
A
, et al.
Mutant-allele tumor heterogeneity scores correlate with risk of metastases in colon cancer
.
Clin Colorectal Cancer
2017
;
16
(
3
):
e165
70
.

10.

McGranahan
N
,
Furness
AJ
,
Rosenthal
R
, et al.
Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade
.
Science
2016
;
351
(
6280
):
1463
9
.

11.

Bozic
I
,
Antal
T
,
Ohtsuki
H
, et al.
Accumulation of driver and passenger mutations during tumor progression
.
Proc Natl Acad Sci U S A
2010
;
107
(
43
):
18545
50
.

12.

Vogelstein
B
,
Papadopoulos
N
,
Velculescu
VE
, et al.
Cancer genome landscapes
.
Science
2013
;
339
(
6127
):
1546
58
.

13.

Davies
JJ
,
Wilson
IM
,
Lam
WL
.
Array CGH technologies and their applications to cancer genomes
.
Chromosome Res
2005
;
13
(
3
):
237
48
.

14.

Pinkel
D
,
Albertson
DG
.
Array comparative genomic hybridization and its applications in cancer
.
Nat Genet
2005
;
37
(
6S
):
S11
7
.

15.

Huang
J
,
Wei
W
,
Zhang
J
, et al.
Whole genome DNA copy number changes identified by high density oligonucleotide arrays
.
Hum Genomics
2004
;
1
(
4
):
287
99
.

16.

Zhao
X
,
Li
C
,
Paez
JG
, et al.
An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays
.
Cancer Res
2004
;
64
(
9
):
3060
71
.

17.

Van Loo
P
,
Nordgard
SH
,
Lingjærde
OC
, et al.
Allele-specific copy number analysis of tumors
.
Proc Natl Acad Sci U S A
2010
;
107
(
39
):
16910
5
.

18.

Carter
SL
,
Cibulskis
K
,
Helman
E
, et al.
Absolute quantification of somatic DNA alterations in human cancer
.
Nat Biotechnol
2012
;
30
(
5
):
413
21
.

19.

Zack
TI
,
Schumacher
SE
,
Carter
SL
, et al.
Pan-cancer patterns of somatic copy number alteration
.
Nat Genet
2013
;
45
(
10
):
1134
40
.

20.

Davis
A
,
Gao
R
,
Navin
N
.
Tumor evolution: linear, branching, neutral or punctuated?
Biochim Biophys Acta Rev Cancer
2017
;
1867
(
2
):
151
61
.

21.

Larson
NB
,
Fridley
BL
.
PurBayes: estimating tumor cellularity and subclonality in next-generation sequencing data
.
Bioinformatics
2013
;
29
(
15
):
1888
9
.

22.

Strino
F
,
Parisi
F
,
Micsinai
M
, et al.
TrAp: a tree approach for fingerprinting subclonal tumor composition
.
Nucleic Acids Res
2013
;
41
(
17
):
e165
5
.

23.

Roth
A
,
Khattra
J
,
Yap
D
, et al.
PyClone: statistical inference of clonal population structure in cancer
.
Nat Methods
2014
;
11
(
4
):
396
8
.

24.

Miller
CA
,
White
BS
,
Dees
ND
, et al.
SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution
.
PLoS Comput Biol
2014
;
10
(
8
):
e1003665
.

25.

Jiao
W
,
Vembu
S
,
Deshwar
AG
, et al.
Inferring clonal evolution of tumors from single nucleotide somatic mutations
.
BMC Bioinformatics
2014
;
15
(
1
):
35
.

26.

Andor
N
,
Harness
JV
,
Müller
S
, et al.
Expands: expanding ploidy and allele frequency on nested subpopulations
.
Bioinformatics
2014
;
30
(
1
):
50
60
.

27.

Ha
G
,
Roth
A
,
Khattra
J
, et al.
TITAN: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data
.
Genome Res
2014
;
24
(
11
):
1881
93
.

28.

Zare
H
,
Wang
J
,
Hu
A
, et al.
Inferring clonal composition from multiple sections of a breast cancer
.
PLoS Comput Biol
2014
;
10
(
7
):
e1003703
.

29.

Hajirasouliha
I
,
Mahmoody
A
,
Raphael
BJ
.
A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data
.
Bioinformatics
2014
;
30
(
12
):
78
86
.

30.

Deshwar
AG
,
Vembu
S
,
Yung
CK
, et al.
PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors
.
Genome Biol
2015
;
16
(
1
):
1
20
.

31.

El-Kebir
M
,
Oesper
L
,
Acheson-Field
H
, et al.
Reconstruction of clonal trees and tumor composition from multi-sample sequencing data
.
Bioinformatics
2015
;
31
(
12
):
i62
70
.

32.

Popic
V
,
Salari
R
,
Hajirasouliha
I
, et al.
Fast and scalable inference of multi-sample cancer lineages
.
Genome Biol
2015
;
16
(
1
):
91
.

33.

Sengupta
S
,
Wang
J
,
Lee
J
, et al.
Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data
.
Pac Symp Biocomput
2015
;
20
:
467
78
. doi: .

34.

Marass
F
,
Mouliere
F
,
Yuan
K
, et al.
A phylogenetic latent feature model for clonal deconvolution
.
Ann Appl Stat
2016
;
10
(
4
):
2377
404
.

35.

Jiang
Y
,
Qiu
Y
,
Minn
AJ
, et al.
Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing
.
Proc Natl Acad Sci U S A
2016
;
113
(
37
):
E5528
37
.

36.

Salehi
S
,
Steif
A
,
Roth
A
, et al.
ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data
.
Genome Biol
2017
;
18
(
1
):
1
18
.

37.

Dang
H
,
White
B
,
Foltz
S
, et al.
ClonEvol: clonal ordering and visualization in cancer sequencing
.
Ann Oncol
2017
;
28
(
12
):
3076
82
.

38.

Li
Y
,
Xie
X
.
MixClone: a mixture model for inferring tumor subclonal populations
.
BMC Genomics
2015
;
16
(
Suppl 2
):
1
9
.

39.

Khakabimamaghani
S
,
Malikic
S
,
Tang
J
, et al.
Collaborative intra-tumor heterogeneity detection
.
Bioinformatics
2019
;
35
(
14
):
i379
88
.

40.

Oesper
L
,
Satas
G
,
Raphael
BJ
.
Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data
.
Bioinformatics
2014
;
30
(
24
):
3532
40
.

41.

El-Kebir
M
,
Satas
G
,
Oesper
L
, et al.
Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures
.
Cell Syst
2016
;
3
:
43
53
.

42.

Nik-Zainal
S
,
Van Loo
P
,
Wedge
DC
, et al.
The life history of 21 breast cancers
.
Cell
2012
;
149
(
5
):
994
1007
.

43.

Yau
C
,
Mouradov
D
,
Jorissen
RN
, et al.
A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data
.
Genome Biol
2010
;
11
(
9
):
R92
.

44.

Ha
G
,
Roth
A
,
Lai
D
, et al.
Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer
.
Genome Res
2012
;
22
(
10
):
1995
2007
.

45.

Gusnanto
A
,
Wood
HM
,
Pawitan
Y
, et al.
Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data
.
Bioinformatics
2012
;
28
(
1
):
40
7
.

46.

Yu
Z
,
Liu
Y
,
Shen
Y
, et al.
CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data
.
Bioinformatics
2014
;
30
(
18
):
2576
83
.

47.

Li
Y
,
Xie
X
.
Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity
.
Bioinformatics
2014
;
30
(
15
):
2121
9
.

48.

Bao
L
,
Pu
M
,
Messer
K
.
AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data
.
Bioinformatics
2014
;
30
(
8
):
1056
63
.

49.

Favero
F
,
Joshi
T
,
Marquard
A
, et al.
Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data
.
Ann Oncol
2015
;
26
(
1
):
64
70
.

50.

Shen
R
,
Seshan
VE
.
FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing
.
Nucleic Acids Res
2016
;
44
(
16
):
e131
1
.

51.

Yu
Z
,
Li
A
,
Wang
M
.
CLImAT-HET: detecting subclonal copy number alterations and loss of heterozygosity in heterogeneous tumor samples from whole-genome sequencing data
.
BMC Med Genomics
2017
;
10
(
1
):
15
.

52.

Adalsteinsson
VA
,
Ha
G
,
Freeman
SS
, et al.
Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors
.
Nat Commun
2017
;
8
(
1
):
1324
.

53.

Cun
Y
,
Yang
TP
,
Achter
V
, et al.
Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust
.
Nat Protoc
2018
;
13
(
6
):
1488
501
.

54.

Poell
JB
,
Mendeville
M
,
Sie
D
, et al.
ACE: absolute copy number estimation from low-coverage whole-genome sequencing data
.
Bioinformatics
2019
;
35
(
16
):
2847
9
.

55.

Mayrhofer
M
,
DiLorenzo
S
,
Isaksson
A
.
Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue
.
Genome Biol
2013
;
14
(
3
):
R24
.

56.

Noorbakhsh
J
,
Kim
H
,
Namburi
S
, et al.
Distribution-based measures of tumor heterogeneity are sensitive to mutation calling and lack strong clinical predictive power
.
Sci Rep
2018
;
8
(
1
):
1
12
.

57.

Fraga
MF
,
Ballestar
E
,
Villar-Garea
A
, et al.
Loss of acetylation at Lys16 and trimethylation at Lys20 of histone H4 is a common hallmark of human cancer
.
Nat Genet
2005
;
37
(
4
):
391
400
.

58.

Jones
PA
,
Laird
PW
.
Cancer epigenetics comes of age
.
Nat Genet
1999
;
21
(
2
):
163
7
.

59.

Baylin
SB
,
Herman
JG
.
DNA hypermethylation in tumorigenesis: epigenetics joins genetics
.
Trends Genet
2000
;
16
(
4
):
168
74
.

60.

Esteller
M
,
Corn
PG
,
Baylin
SB
, et al.
A gene hypermethylation profile of human cancer
.
Cancer Res
2001
;
3
(
61
):
3225
9
.

61.

Jones
PA
,
Baylin
SB
.
The fundamental role of epigenetic events in cancer
.
Nat Rev Genet
2002
;
3
(
6
):
415
28
.

62.

Herman
JG
,
Baylin
SB
.
Gene silencing in cancer in association with promoter hypermethylation
.
N Engl J Med
2003
;
349
(
21
):
2042
54
.

63.

Feinberg
AP
,
Vogelstein
B
.
Hypomethylation distinguishes genes of some human cancers from their normal counterparts
.
Nature
1983
;
301
(
5895
):
89
92
.

64.

Feinberg
AP
,
Kuo
KC
.
Reduced genomic 5-methylcytosine content in human colonic neoplasia
.
Cancer Res
1988
;
1159
61
.

65.

Ehrlich
M
.
DNA methylation in cancer: too much, but also too little
.
Oncogene
2002
;
21
(
35
):
5400
13
.

66.

Aggerholm
A
,
Guldberg
P
,
Hokland
M
, et al.
Extensive intra- and interindividual heterogeneity of p15INK4B methylation in acute myeloid leukemia
.
Cancer Res
1999
;
18
:
436
41
.

67.

Landan
G
,
Cohen
NM
,
Mukamel
Z
, et al.
Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues
.
Nat Genet
2012
;
44
(
11
):
1207
14
.

68.

Pan
H
,
Jiang
Y
,
Boi
M
, et al.
Epigenomic evolution in diffuse large B-cell lymphomas
.
Nat Commun
2015
;
6
(
1
):
6921
.

69.

Timp
W
,
Feinberg
AP
.
Cancer as a dysregulated epigenome allowing cellular growth advantage at the expense of the host
.
Nat Rev Cancer
2013
;
13
(
7
):
497
510
.

70.

Li
S
,
Garrett-Bakelman
F
,
Perl
AE
, et al.
Dynamic evolution of clonal epialleles revealed by methclone
.
Genome Biol
2014
;
15
(
9
):
472
.

71.

Barrett
JE
,
Feber
A
,
Herrero
J
, et al.
Quantification of tumour evolution and heterogeneity via Bayesian epiallele detection
.
BMC Bioinformatics
2017
;
18
(
1
):
354
.

72.

Lee
D
,
Lee
S
,
Kim
S
.
PRISM: methylation pattern-based, reference-free inference of subclonal makeup
.
Bioinformatics
2019
;
35
(
14
):
i520
9
.

73.

Xie
H
,
Wang
M
,
de Andrade
A
, et al.
Genome-wide quantitative assessment of variation in DNA methylation patterns
.
Nucleic Acids Res
2011
;
39
(
10
):
4099
108
.

74.

Landau
DA
,
Clement
K
,
Ziller
MJ
, et al.
Locally disordered methylation forms the basis of intratumor methylome variation in chronic lymphocytic leukemia
.
Cancer Cell
2014
;
26
(
6
):
813
25
.

75.

Guo
S
,
Diep
D
,
Plongthongkum
N
, et al.
Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA
.
Nat Genet
2017
;
49
(
4
):
635
42
.

76.

Aran
D
,
Sirota
M
,
Butte
AJ
.
Systematic pan-cancer analysis of tumour purity
.
Nat Commun
2015
;
6
:
1
11
.

77.

Zhang
N
,
Wu
HJ
,
Zhang
W
, et al.
Predicting tumor purity from methylation microarray data
.
Bioinformatics
2015
;
31
(
21
):
3401
5
.

78.

Benelli
M
,
Romagnoli
D
,
Demichelis
F
.
Tumor purity quantification by clonal DNA methylation signatures
.
Bioinformatics
2018
;
34
(
10
):
1642
9
.

79.

Liu
B
,
Yang
X
,
Wang
T
, et al.
MEpurity: estimating tumor purity using DNA methylation data
.
Bioinformatics
2019a
;
35
(
24
):
5298
300
.

80.

Zheng
X
,
Zhao
Q
,
Wu
HJ
, et al.
MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes
.
Genome Biol
2014
;
15
(
8
):
419
.

81.

Houseman
EA
,
Accomando
WP
,
Koestler
DC
, et al.
DNA methylation arrays as surrogate measures of cell mixture distribution
.
BMC Bioinformatics
2012
;
13
(
1
):
86
.

82.

Chakravarthy
A
,
Furness
A
,
Joshi
K
, et al.
Pan-cancer deconvolution of tumour composition using DNA methylation
.
Nat Commun
2018
;
9
(
1
):
3220
.

83.

Newman
AM
,
Liu
CL
,
Green
MR
, et al.
Robust enumeration of cell subsets from tissue expression profiles
.
Nat Methods
2015
;
12
(
5
):
453
7
.

84.

Houseman
EA
,
Kile
ML
,
Christiani
DC
, et al.
Reference-free deconvolution of DNA methylation data and mediation by cell composition effects
.
BMC Bioinformatics
2016
;
17
(
1
):
259
.

85.

Lutsik
P
,
Slawski
M
,
Gasparoni
G
, et al.
MeDeCom: discovery and quantification of latent components of heterogeneous methylomes
.
Genome Biol
2017
;
18
(
1
):
55
.

86.

Torres
CM
,
Biran
A
,
Burney
MJ
, et al.
The linker histone H1.0 generates epigenetic and functional intratumor heterogeneity
.
Science
2016
;
353
(
6307
):
aaf1644
.

87.

Parker
JS
,
Mullins
M
,
Cheang
MCU
, et al.
Supervised risk predictor of breast cancer based on intrinsic subtypes
.
J Clin Oncol
2009
;
27
:
1160
7
.

88.

Whitney
AR
,
Diehn
M
,
Popper
SJ
, et al.
Individuality and variation in gene expression patterns in human blood
.
Proc Natl Acad Sci U S A
2003
;
100
(
4
):
1896
901
.

89.

Venet
D
,
Pecasse
F
,
Maenhaut
C
, et al.
Separation of samples into their constituents using gene expression data
.
Bioinformatics
2001
;
17
(
Suppl. 1
):
279
87
.

90.

de Ridder
D
,
van der Linden
CE
,
Schonewille
T
, et al.
Purity for clarity: the need for purification of tumor cells in DNA microarray studies
.
Leukemia
2005
;
19
(
4
):
618
27
.

91.

de Bruin
EC
,
van de Pas
S
,
Lips
EH
, et al.
Macrodissection versus microdissection of rectal carcinoma: minor influence of stroma cells to tumor cell gene expression profiles
.
BMC Genomics
2005
;
6
(
1
):
142
.

92.

Zhao
Y
,
Simon
R
.
Gene expression deconvolution in clinical samples
.
Genome Med
2010
;
2
(
12
):
93
.

93.

Chikina
M
,
Zaslavsky
E
,
Sealfon
SC
.
CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations
.
Bioinformatics
2015
;
31
(
10
):
1584
91
.

94.

Xiao
SJ
,
Zhang
C
,
Zou
Q
, et al.
TiSGeD: a database for tissue-specific genes
.
Bioinformatics
2010
;
26
(
9
):
1273
5
.

95.

Yoshihara
K
,
Shahmoradgoli
M
,
Martínez
E
, et al.
Inferring tumour purity and stromal and immune cell admixture from expression data
.
Nat Commun
2013a
;
4
(
1
):
1
11
.

96.

Walhout
AJ
,
Vidal
M
.
High-throughput yeast two-hybrid assays for large-scale protein interaction mapping
.
Methods
2001
;
24
(
3
):
297
306
.

97.

Parrish
JR
,
Gulyas
KD
,
Finley
RL
.
Yeast two-hybrid contributions to interactome mapping
.
Curr Opin Biotechnol
2006
;
17
(
4
):
387
93
.

98.

Szklarczyk
D
,
Gable
AL
,
Lyon
D
, et al.
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
.
Nucleic Acids Res
2018
;
47
:
607
13
.

99.

Oughtred
R
,
Stark
C
,
Breitkreutz
BJ
, et al.
The BioGRID interaction database: 2019 update
.
Nucleic Acids Res
2018
;
47
:
529
41
.

100.

Barabási
AL
,
Oltvai
ZN
.
Network biology: understanding the cell’s functional organization
.
Nat Rev Genet
2004
;
5
(
2
):
101
13
.

101.

Barabási
AL
,
Gulbahce
N
,
Loscalzo
J
.
Network medicine: a network-based approach to human disease
.
Nat Rev Genet
2011
;
12
(
1
):
56
68
.

102.

Lim
S
,
Lee
S
,
Jung
I
, et al.
Comprehensive and critical evaluation of individualized pathway activity measurement tools on pan-cancer data
.
Brief Bioinform
2020
;
21
(
1
):
36
46
.

103.

Qiao
W
,
Quon
G
,
Csaszar
E
, et al.
PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions
.
PLoS Comput Biol
2012
;
8
(
12
):
e1002838
.

104.

Quon
G
,
Haider
S
,
Deshwar
AG
, et al.
Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction
.
Genome Med
2013
;
5
(
3
):
29
.

105.

Anghel
CV
,
Quon
G
,
Haider
S
, et al.
Isopurer: an r implementation of a computational purification algorithm of mixed tumour profiles
.
BMC Bioinformatics
2015
;
16
(
1
):
156
.

106.

Ahn
J
,
Yuan
Y
,
Parmigiani
G
, et al.
DeMix: deconvolution for mixed cancer transcriptomes using raw measured data
.
Bioinformatics
2013
;
29
(
15
):
1865
71
.

107.

Yoshihara
K
,
Shahmoradgoli
M
,
Martínez
E
, et al.
Inferring tumour purity and stromal and immune cell admixture from expression data
.
Nat Commun
2013b
;
4
(
1
):
2612
.

108.

Wang
N
,
Gong
T
,
Clarke
R
, et al.
Undo: a bioconductor r package for unsupervised deconvolution of mixed gene expressions in tumor samples
.
Bioinformatics
2015
;
31
(
1
):
137
9
.

109.

Newman
AM
,
Steen
CB
,
Liu
CL
, et al.
Determining cell type abundance and expression from bulk tissues with digital cytometry
.
Nat Biotechnol
2019
;
37
:
773
82
.

110.

Shen
Q
,
Hu
J
,
Jiang
N
, et al.
contamDE: differential expression analysis of RNA-seq data for contaminated tumor samples
.
Bioinformatics
2016
;
32
(
5
):
705
12
.

111.

Li
B
,
Severson
E
,
Pignon
JC
, et al.
Comprehensive analyses of tumor immunity: implications for cancer immunotherapy
.
Genome Biol
2016a
;
17
(
1
):
174
.

112.

Becht
E
,
Giraldo
NA
,
Lacroix
L
, et al.
Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression
.
Genome Biol
2016
;
17
(
1
):
218
.

113.

Racle
J
,
de Jonge
K
,
Baumgaertner
P
, et al.
Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data
.
Elife
2017
;
6
:
e26476
.

114.

Aran
D
,
Hu
Z
,
Butte
AJ
.
xCell: digitally portraying the tissue cellular heterogeneity landscape
.
Genome Biol
2017
;
18
(
1
):
220
.

115.

Finotello
F
,
Mayer
C
,
Plattner
C
, et al.
Molecular and pharmacological modulators of the tumor immune contexture revealed by deconvolution of RNA-seq data
.
Genome Med
2019
;
11
(
1
):
34
.

116.

Shen-Orr
SS
,
Gaujoux
R
.
Computational deconvolution: extracting cell type-specific information from heterogeneous samples
.
Curr Opin Immunol
2013
;
25
(
5
):
571
8
.

117.

Yadav
VK
,
De
S
.
An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples
.
Brief Bioinform
2015
;
16
(
2
):
232
41
.

118.

Mohammadi
S
,
Zuckerman
N
,
Goldsmith
A
, et al.
A critical survey of deconvolution methods for separating cell types in complex tissues
.
Proc IEEE
2017
;
105
(
2
):
340
66
.

119.

Avila Cobos
F
,
Vandesompele
J
,
Mestdagh
P
, et al.
Computational deconvolution of transcriptomics data from mixed cell populations
.
Bioinformatics
2018
;
34
(
11
):
1969
79
.

120.

Finotello
F
,
Trajanoski
Z
.
Quantifying tumor-infiltrating immune cells from transcriptomics data
.
Cancer Immunol Immunother
2018
;
67
(
7
):
1031
40
.

121.

Sturm
G
,
Finotello
F
,
Petitprez
F
, et al.
Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology
.
Bioinformatics
2019
;
35
:
436
45
.

122.

Barbie
DA
,
Tamayo
P
,
Boehm
JS
, et al.
Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1
.
Nature
2009
;
462
(
7269
):
108
12
.

123.

Wang
Z
,
Cao
S
,
Morris
JS
, et al.
Transcriptome deconvolution of heterogeneous tumor samples with immune infiltration
.
iScience
2018
;
9
:
451
60
.

124.

Teschendorff
AE
,
Severini
S
.
Increased entropy of signal transduction in the cancer metastasis phenotype
.
BMC Syst Biol
2010
;
4
(
1
):
104
.

125.

Teschendorff
AE
,
Sollich
P
,
Kuehn
R
.
Signalling entropy: a novel network-theoretical framework for systems analysis and interpretation of functional omic data
.
Methods
2014
;
67
(
3
). doi: .

126.

Banerji
CRS
,
Miranda-Saavedra
D
,
Severini
S
, et al.
Cellular network entropy as the energy potential in Waddington’s differentiation landscape
.
Sci Rep
2013
;
3
(
1
):
3039
.

127.

Park
Y
,
Lim
S
,
Nam
JW
, et al.
Measuring intratumor heterogeneity by network entropy using RNA-seq data
.
Sci Rep
2016
;
6
(
1
):
1
12
.

128.

Wan
Y
,
Larson
DR
.
Splicing heterogeneity: separating signal from noise
.
Genome Biol
2018
;
19
(
1
):
86
.

129.

Kim
M
,
Lee
S
,
Lim
S
, et al.
Splicehetero: an information theoretic approach for measuring spliceomic intratumor heterogeneity from bulk tumor RNA-seq
.
PLoS One
2019
;
14
(
10
):
e0223520
.

130.

Navin
NE
.
The first five years of single-cell cancer genomics and beyond
.
Genome Res
2015
;
25
(
10
):
1499
507
.

131.

Kuipers
J
,
Jahn
K
,
Beerenwinkel
N
.
Advances in understanding tumour evolution through single-cell sequencing
.
Biochim Biophys Acta Rev Cancer
2017
;
1867
(
2
):
127
38
.

132.

Navin
N
,
Kendall
J
,
Troge
J
, et al.
Tumour evolution inferred by single-cell sequencing
.
Nature
2011
;
472
(
7341
):
90
4
.

133.

Xu
X
,
Hou
Y
,
Yin
X
, et al.
Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor
.
Cell
2012
;
148
(
5
):
886
95
.

134.

Jan
M
,
Snyder
TM
,
Corces-Zimmerman
MR
, et al.
Clonal evolution of preleukemic hematopoietic stem cells precedes human acute myeloid leukemia
.
Sci Transl Med
2012
;
4
(
149
):
149ra118
.

135.

Potter
NE
,
Ermini
L
,
Papaemmanuil
E
, et al.
Single-cell mutational profiling and clonal phylogeny in cancer
.
Genome Res
2013
;
23
(
12
):
2115
25
.

136.

Roth
A
,
McPherson
A
,
Laks
E
, et al.
Clonal genotype and population structure inference from single-cell tumor sequencing
.
Nat Methods
2016
;
13
(
7
):
573
6
.

137.

Jahn
K
,
Kuipers
J
,
Beerenwinkel
N
.
Tree inference for single-cell data
.
Genome Biol
2016
;
17
(
1
):
86
.

138.

Ross
EM
,
Markowetz
F
.
OncoNEM: inferring tumor evolution from single-cell sequencing data
.
Genome Biol
2016
;
17
(
1
):
69
.

139.

Zafar
H
,
Tzen
A
,
Navin
N
, et al.
SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models
.
Genome Biol
2017
;
18
(
1
):
178
.

140.

Malikic
S
,
Jahn
K
,
Kuipers
J
, et al.
Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data
.
Nat Commun
2019a
;
10
(
1
):
2750
.

141.

Zafar
H
,
Navin
N
,
Chen
K
, et al.
SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data
.
Genome Res
2019
;
29
(
11
):
1847
59
.

142.

Malikic
S
,
Mehrabadi
FR
,
Ciccolella
S
, et al.
PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data
.
Genome Res
2019b
;
29
(
11
):
1860
77
.

143.

Patel
AP
,
Tirosh
I
,
Trombetta
JJ
, et al.
Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma
.
Science
2014
;
344
(
6190
):
1396
401
.

144.

Tirosh
I
,
Izar
B
,
Prakadan
SM
, et al.
Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq
.
Science
2016
;
352
(
6282
):
189
96
.

145.

Li
H
,
Courtois
ET
,
Sengupta
D
, et al.
Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors
.
Nat Genet
2017
;
49
(
5
):
708
18
.

146.

Trapnell
C
,
Cacchiarelli
D
,
Grimsby
J
, et al.
The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells
.
Nat Biotechnol
2014
;
32
(
4
):
381
6
.

147.

Qiu
X
,
Mao
Q
,
Tang
Y
, et al.
Reversed graph embedding resolves complex single-cell trajectories
.
Nat Methods
2017
;
14
(
10
):
979
82
.

148.

Pang
B
,
Xu
J
,
Hu
J
, et al.
Single-cell RNA-seq reveals the invasive trajectory and molecular cascades underlying glioblastoma progression
.
Mol Oncol
2019
;
13
(
12
):
2588
603
.

149.

Yu
K
,
Hu
Y
,
Wu
F
, et al.
Surveying brain tumor heterogeneity by single-cell RNA sequencing of multi-sector biopsies
.
Natl Sci Rev
2020
.

150.

Ma
L
,
Hernandez
MO
,
Zhao
Y
, et al.
Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer
.
Cancer Cell
2019
;
36
(
4
):
418
430.e6
.

151.

Borcherding
N
,
Voigt
AP
,
Liu
V
, et al.
Single-cell profiling of cutaneous T-cell lymphoma reveals underlying heterogeneity associated with disease progression
.
Clin Cancer Res
2019
;
25
(
10
):
2996
3005
.

152.

Durante
MA
,
Rodriguez
DA
,
Kurtenbach
S
, et al.
Single-cell analysis reveals new evolutionary complexity in uveal melanoma
.
Nat Commun
2020
;
11
(
1
):
496
.

153.

Bendall
SC
,
Davis
KL
,
Amir
EAD
, et al.
Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development
.
Cell
2014
;
157
(
3
):
714
25
.

154.

Loeffler-Wirth
H
,
Binder
H
,
Willscher
E
, et al.
Pseudotime dynamics in melanoma single-cell transcriptomes reveals different mechanisms of tumor progression
.
Biology
2018
;
7
(
2
):
23
.

155.

Baron
M
,
Veres
A
,
Wolock
SL
, et al.
A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure
.
Cell Syst
2016
;
3
(
4
):
346
360.e4
.

156.

Wang
X
,
Park
J
,
Susztak
K
, et al.
Bulk tissue cell type deconvolution with multi-subject single-cell expression reference
.
Nat Commun
2019
;
10
(
1
):
380
.

157.

Jew
B
,
Alvarez
M
,
Rahmani
E
, et al.
Accurate estimation of cell composition in bulk expression through robust integration of single-cell information
.
Nat Commun
2020
;
11
(
1
):
1971
.

158.

Macaulay
IC
,
Haerty
W
,
Kumar
P
, et al.
G&T-seq: parallel sequencing of single-cell genomes and transcriptomes
.
Nat Methods
2015
;
12
(
6
):
519
22
.

159.

Rodriguez-Meira
A
,
Buck
G
,
Clark
SA
, et al.
Unravelling intratumoral heterogeneity through high-sensitivity single-cell mutational analysis and parallel RNA sequencing
.
Mol Cell
2019
;
73
(
6
):
1292
1305.e8
.

160.

Liu
L
,
Liu
C
,
Quintero
A
, et al.
Deconvolution of single-cell multi-omics layers reveals regulatory heterogeneity
.
Nat Commun
2019b
;
10
(
1
):
470
.

161.

Aryee
MJ
,
Liu
W
,
Engelmann
JC
, et al.
DNA methylation alterations exhibit intraindividual stability and interindividual heterogeneity in prostate cancer metastases
.
Sci Transl Med
2013
;
5
(
169
):
169ra10
.

162.

Brocks
D
,
Assenov
Y
,
Minner
S
, et al.
Intratumor DNA methylation heterogeneity reflects clonal evolution in aggressive prostate cancer
.
Cell Rep
2014
;
8
(
3
):
798
806
.

163.

Loeffler
M
,
Kreuz
M
,
Haake
A
, et al.
Genomic and epigenomic co-evolution in follicular lymphomas
.
Leukemia
2015
;
29
(
2
):
456
63
.

164.

Mazor
T
,
Pankov
A
,
Johnson
BE
, et al.
DNA methylation and somatic mutations converge on the cell cycle and define similar evolutionary histories in brain tumors
.
Cancer Cell
2015
;
28
(
3
):
307
17
.

165.

Hao
JJ
,
Lin
DC
,
Dinh
HQ
, et al.
Spatial intratumoral heterogeneity and temporal clonal evolution in esophageal squamous cell carcinoma
.
Nat Genet
2016
;
48
(
12
):
1500
7
.

166.

Dietz
S
,
Lifshitz
A
,
Kazdal
D
, et al.
Global DNA methylation reflects spatial heterogeneity and molecular evolution of lung adenocarcinomas
.
Int J Cancer
2019
;
144
(
5
):
1061
72
.

167.

Oakes
CC
,
Claus
R
,
Gu
L
, et al.
Evolution of DNA methylation is linked to genetic aberrations in chronic lymphocytic leukemia
.
Cancer Discov
2014
;
4
(
3
):
348
61
.

168.

Li
S
,
Garrett-Bakelman
FE
,
Chung
SS
, et al.
Distinct evolution and dynamics of epigenetic and genetic heterogeneity in acute myeloid leukemia
.
Nat Med
2016b
;
22
(
7
):
792
9
.

169.

Onuchic
V
,
Hartmaier
RJ
,
Boone
DN
, et al.
Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types
.
Cell Rep
2016
;
17
(
8
):
2075
86
.

170.

Gerstung
M
,
Jolly
C
,
Leshchiner
I
, et al.
The evolutionary history of 2,658 cancers
.
Nature
2020
;
578
(
7793
):
122
8
.

171.

Campbell
PJ
,
Getz
G
,
Korbel
JO
, et al.
Pan-cancer analysis of whole genomes
.
Nature
2020
;
578
(
7793
):
82
93
.

172.

Caravagna
G
,
Giarratano
Y
,
Ramazzotti
D
, et al.
Detecting repeated cancer evolution from multi-region tumor sequencing data
.
Nat Methods
2018
;
15
(
9
):
707
14
.

173.

Johann
PD
,
Jager
N
,
Pfister
SM
, et al.
RF_Purify: a novel tool for comprehensive analysis of tumor-purity in methylation array data based on random forest regression
.
BMC Bioinformatics
2019
;
20
(
1
):
428
.

174.

Budinska
E
,
Popovici
V
,
Tejpar
S
, et al.
Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer
.
J Pathol
2013
;
231
(
1
):
63
76
.

175.

Guinney
J
,
Dienstmann
R
,
Wang
X
, et al.
The consensus molecular subtypes of colorectal cancer
.
Nat Med
2015
;
21
(
11
):
1350
6
.

176.

Bian
S
,
Hou
Y
,
Zhou
X
, et al.
Single-cell multiomics sequencing and analyses of human colorectal cancer
.
Science
2018
;
362
(
6418
):
1060
3
.

Author notes

Dohoon Lee and Youngjune Park are co-first authors and have contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)