Abstract

Cancer is a collection of genetic diseases, with large phenotypic differences and genetic heterogeneity between different types of cancers and even within the same cancer type. Recent advances in genome-wide profiling provide an opportunity to investigate global molecular changes during the development and progression of cancer. Meanwhile, numerous statistical and machine learning algorithms have been designed for the processing and interpretation of high-throughput molecular data. Molecular subtyping studies have allowed the allocation of cancer into homogeneous groups that are considered to harbor similar molecular and clinical characteristics. Furthermore, this has helped researchers to identify both actionable targets for drug design as well as biomarkers for response prediction. In this review, we introduce five frequently applied techniques for generating molecular data, which are microarray, RNA sequencing, quantitative polymerase chain reaction, NanoString and tissue microarray. Commonly used molecular data for cancer subtyping and clinical applications are discussed. Next, we summarize a workflow for molecular subtyping of cancer, including data preprocessing, cluster analysis, supervised classification and subtype characterizations. Finally, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation. We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and build robust classifiers that pave the way toward stratified treatment of cancer patients.

Introduction

Cancer is a large group of genetic diseases that are currently classified by their primary site of origin, such as brain cancer, breast cancer and lung cancer. However, not all cancers of an organ are the same, and genetic heterogeneity exists between and within cancers [1–6]. A major cause of this heterogeneity is genomic instability [7] that can act at the single-nucleotide level, or at much larger scales [8]. This poses significant challenges to the efficacy of currently applicable targeted therapies and complicate the development of future treatment strategies [7]. Because of this, there is a great need to classify cancer into homogeneous groups that associate with distinct molecular features and clinical outcomes and allow the development of subgroup specific therapies.

The traditional classification of cancer has been carried out by pathologists based on histological appearance and site of growth. This only partially reflects the true heterogenic character of cancer. Recent advances in genome-wide profiling techniques [9, 10] have allowed researchers to generate large-scale genomic data and classify cancer into more homogeneous groups [11, 12]. Genomic data have been used in many cancer subtyping studies, including leukemia [13], lymphoma [14], nasopharyngeal carcinoma (NPC) [15], breast [16], lung [17], liver [18], pancreas [19], colon [20] and soft tissue sarcomas [21]. Various machine learning algorithms [22–26] have also been developed for better prediction of cancer subtypes. Molecular subtyping studies have allowed the classification of cancer into uniform groups that correlated better with clinical outcomes than the traditional classifications of cancer [27]. In summary, the molecular classification can provide diagnostic, prognostic and therapeutic options for the treatment of cancers.

This review is organized as follows: in ‘Molecular subtyping of cancer’ section, we first introduce two techniques that are used for cancer subtyping: microarray and RNA sequencing (RNA-Seq). Next, we introduce three other frequently applied techniques for generating low- and medium-throughput molecular data; quantitative polymerase chain reaction (qPCR), NanoString and tissue microarray (TMA), and we discuss their applications in clinical tests. Subtype identifications and characterizations, which are the two important aspects involved in the subtyping process are also discussed. In ‘Moving toward clinical applications’ section, we illustrate potential clinical applications of cancer subtyping studies for diagnosis, prognosis, predicting therapy response and drug design. In ‘Challenges’ section, we identify and describe four major challenges in the molecular subtyping of cancer that may preclude clinical implementation, and finally, in ‘Conclusions and outlook’ section, we provide the concluding remarks and recommendations.

Molecular subtyping of cancer

Recent advances in genome-wide profiling techniques have allowed the generation of large-scale genomic data, and various statistical and machine learning algorithms have been developed for processing and interpretation of such data [23–25, 28–31]. Molecular subtyping of cancer, as its name suggests, is a new way to classify cancers into different groups based on molecular data and classification models. Contrary to the traditional histological classification of cancer, molecular classifications rely on biomarkers and classifiers. Biomarkers can be informative genes, microRNAs (miRNAs), DNA methylation markers and others [32]. Classifiers can be built by machine learning algorithms, such as Prediction Analysis for Microarrays (PAM), Support Vector Machines (SVMs) and more [33]. In the following, we will provide an introduction of different molecular data types and their applications, to a workflow for unsupervised classification of cancer.

High-throughput molecular data for cancer subtyping

Gene expression profiling data for cancer subtyping

Microarray and RNA-Seq are two common profiling techniques for generating high-throughput gene expression data. Microarrays are capable of profiling expression patterns for tens of thousands of selected genes in a single assay [9]. RNA-Seq is a sequencing-based method to determine the amount of gene abundance from the entire genome. There are numerous advantages of RNA-Seq over microarray [34]. First, unlike hybridization-based microarrays, RNA-Seq provides more accurate detection of gene expression. Second, RNA-Seq can detect novel transcripts, single-nucleotide variants and other (yet) unknown changes that microarray cannot detect. Finally, RNA-Seq has low background signal, and consequently has a large dynamic range. Microarray has been the most commonly used technique to generate large-scale molecular data for several decades [35]. With the fast development of sequencing and analyzing techniques, the sequencing cost will dramatically decrease and more statistical tools will be developed for RNA-Seq, and RNA-Seq will likely replace microarray [36]. Compared with other molecular profiling techniques, microarray and RNA-Seq are the most accurate, reliable and robust, but are also expensive, time-consuming and sample quality-dependent techniques (Table 1). They are commonly used in the initial biomarkers identification process. If biomarkers have been identified, other techniques are preferred.

Table 1

A comparison of different techniques for molecular profiling of cancer

Platform
CharacteristicMicroarrayRNA sequencingqPCRNanoStringTissue microarray
Accuracy [37, 166, 167]MedianMedianHighHighLow
Sensitivity [36, 167, 39]MedianHighHighHighLow
Specificity [38, 167, 168]MedianMedianHighHighLow
Speed [36, 167]SlowSlowFastMedianSlow
Cost (per sample)$300 [163]$1000 [163]$280 [164]$800 [39]$100 [165]
Sample requirement [167]FFPE/fresh-frozenFFPE/fresh-frozenFresh-frozenFFPE/fresh-frozenFFPE
Genome-wide coverageYesYesNoNoNo
QuantitativeYesYesYesYesNo
Single-base resolutionNoYesNoNoNo
Low sample inputNoNoYesNoYes
Reproducibility [168]MedianMedianHighHighLow
Platform
CharacteristicMicroarrayRNA sequencingqPCRNanoStringTissue microarray
Accuracy [37, 166, 167]MedianMedianHighHighLow
Sensitivity [36, 167, 39]MedianHighHighHighLow
Specificity [38, 167, 168]MedianMedianHighHighLow
Speed [36, 167]SlowSlowFastMedianSlow
Cost (per sample)$300 [163]$1000 [163]$280 [164]$800 [39]$100 [165]
Sample requirement [167]FFPE/fresh-frozenFFPE/fresh-frozenFresh-frozenFFPE/fresh-frozenFFPE
Genome-wide coverageYesYesNoNoNo
QuantitativeYesYesYesYesNo
Single-base resolutionNoYesNoNoNo
Low sample inputNoNoYesNoYes
Reproducibility [168]MedianMedianHighHighLow
Table 1

A comparison of different techniques for molecular profiling of cancer

Platform
CharacteristicMicroarrayRNA sequencingqPCRNanoStringTissue microarray
Accuracy [37, 166, 167]MedianMedianHighHighLow
Sensitivity [36, 167, 39]MedianHighHighHighLow
Specificity [38, 167, 168]MedianMedianHighHighLow
Speed [36, 167]SlowSlowFastMedianSlow
Cost (per sample)$300 [163]$1000 [163]$280 [164]$800 [39]$100 [165]
Sample requirement [167]FFPE/fresh-frozenFFPE/fresh-frozenFresh-frozenFFPE/fresh-frozenFFPE
Genome-wide coverageYesYesNoNoNo
QuantitativeYesYesYesYesNo
Single-base resolutionNoYesNoNoNo
Low sample inputNoNoYesNoYes
Reproducibility [168]MedianMedianHighHighLow
Platform
CharacteristicMicroarrayRNA sequencingqPCRNanoStringTissue microarray
Accuracy [37, 166, 167]MedianMedianHighHighLow
Sensitivity [36, 167, 39]MedianHighHighHighLow
Specificity [38, 167, 168]MedianMedianHighHighLow
Speed [36, 167]SlowSlowFastMedianSlow
Cost (per sample)$300 [163]$1000 [163]$280 [164]$800 [39]$100 [165]
Sample requirement [167]FFPE/fresh-frozenFFPE/fresh-frozenFresh-frozenFFPE/fresh-frozenFFPE
Genome-wide coverageYesYesNoNoNo
QuantitativeYesYesYesYesNo
Single-base resolutionNoYesNoNoNo
Low sample inputNoNoYesNoYes
Reproducibility [168]MedianMedianHighHighLow

Gene expression-based subtyping of cancer was first proposed by Golub et al. [13] in leukemia. The expression pattern of the 50 most informative genes was measured and a two-cluster self-organizing map (SOM) clustering method was applied [40] to group 38 samples into two classes: acute myeloid leukemia and acute lymphoblastic leukemia with accuracy of 100%. This demonstrated the fidelity of cancer subtyping based solely on gene expression patterns [13]. Gene expression-based subtyping now has been extended to include many cancer types [11, 14, 16, 17, 19, 21, 41].

Multi-platform profiling data for cancer subtyping

In addition to gene expression profiling, there are many other molecular profiling data types, such as mutation, miRNA expression, copy number variation (CNV) and DNA methylation, which can be used to identify and characterize cancer subtypes (Table 2) [43, 44, 50, 52–55]. As all cancers arise as a result of DNA sequence changes [56], the gene mutation patterns are informative and a likely platform from which to stratify cancer patients into homogeneous groups [57, 58]. MiRNAs are small noncoding RNAs about 20–22 nucleotides in length that play key roles in the regulation of gene expression. Alterations of miRNA expression are involved in the initiation and progression of human cancer [59–61]. MiRNA expression profiling now has been used as a new tool in cancer onset and subtyping [15, 62]. Unlike mRNAs, miRNAs are more stable and only a small number of miRNAs (∼200 in total) are sufficient to classify human cancers [63]. CNVs are structural variations and genomic alterations that affect DNA sequence lengths ranging from approximately 1 Kb to 3 Mb [64]. CNVs are associated with many complex diseases such as neuropsychiatric disorders [65], HIV [66], familiar pancreatitis [67] and cancers [68, 69]. Comparative genomic hybridization (CGH) can be used to detect CNVs at the genome-wide level, and array-based CGH can increase the resolution for better genomic studies. Epigenetic changes such as DNA methylation also play a significant role in the development and progression of cancer [70]. Bisulfite sequencing [71] and differential methylation hybridization [72] can be used to scan gene methylation status at the genome-wide level.

Table 2

Molecular subtyping studies mentioned in the review

Cancer typeDiscovery sample sizeMolecular data typeClustering methodDeterminative scoreNumber of subtypesClassification methodReference
Breast cancer65mRNAHierarchical clusteringNA4NAPerou et al .[16]
Breast cancer85mRNAHierarchical clusteringNA5NASorlie et al. [42]
Breast cancer825Five platformsCluster of clustersNA4NATCGA [43]
Breast cancer2, 000mRNA + CNViClusterARI10PAMCurtis et al. [44]
CRC62mRNAIterative NMFCophenetic coefficient5NASchlicker et al. [45]
CRC443mRNAOrig. cons. clusteringCDF area6Centroid-basedMarisa et al. [46]
CRC90mRNAOrig. cons. clusteringGap statistic3PAMDe Sousa E Melo et al. [20]
CRC445mRNANMF cons. clusteringCophenetic coefficient5PAMSadanandam et al. [47]
CRC1, 113mRNAOrig. cons. clusteringDynamic cut tree5Multiclass LDABudinska et al. [48]
CRC188mRNAk-meansNA3Single-sample centroid basedRoepman et al. [49]
CRC4, 151mRNAMarkov Cluster AlgorithmInflation factor4Random ForestGuinney et al. [11]
PDAC185miRNAHierarchical clusteringCDF area2SVMBauer et al. [50]
PDAC66mRNANMF cons. clusteringCophenetic coefficient3NTPCollisson et al. [19]
PDAC223mRNANMF cons. clusteringCophenetic coefficient2Rank-based classifierMoffitt et al. [51]
Pancreatic cancer96mRNANMF cons. clusteringCophenetic coefficient4NABailey et al. [12]
Leukemia38mRNASOMNA2NAGolub et al. [13]
Leukemia200MethylationPCANA16NAFigueroa et al. [169]
Lymphoma42mRNAHierarchical clusteringNA2NAAlizadeh et al. [14]
GBM35miRNAPCARatio of intracluster to intercluster correlation2LDAMarziali et al. [170]
Lung67mRNAHierarchical clusteringNA4NAGarber et al [17]
12 cancer types3, 527Five platformsCOCANA11NAHoadley et al [32]
Cancer typeDiscovery sample sizeMolecular data typeClustering methodDeterminative scoreNumber of subtypesClassification methodReference
Breast cancer65mRNAHierarchical clusteringNA4NAPerou et al .[16]
Breast cancer85mRNAHierarchical clusteringNA5NASorlie et al. [42]
Breast cancer825Five platformsCluster of clustersNA4NATCGA [43]
Breast cancer2, 000mRNA + CNViClusterARI10PAMCurtis et al. [44]
CRC62mRNAIterative NMFCophenetic coefficient5NASchlicker et al. [45]
CRC443mRNAOrig. cons. clusteringCDF area6Centroid-basedMarisa et al. [46]
CRC90mRNAOrig. cons. clusteringGap statistic3PAMDe Sousa E Melo et al. [20]
CRC445mRNANMF cons. clusteringCophenetic coefficient5PAMSadanandam et al. [47]
CRC1, 113mRNAOrig. cons. clusteringDynamic cut tree5Multiclass LDABudinska et al. [48]
CRC188mRNAk-meansNA3Single-sample centroid basedRoepman et al. [49]
CRC4, 151mRNAMarkov Cluster AlgorithmInflation factor4Random ForestGuinney et al. [11]
PDAC185miRNAHierarchical clusteringCDF area2SVMBauer et al. [50]
PDAC66mRNANMF cons. clusteringCophenetic coefficient3NTPCollisson et al. [19]
PDAC223mRNANMF cons. clusteringCophenetic coefficient2Rank-based classifierMoffitt et al. [51]
Pancreatic cancer96mRNANMF cons. clusteringCophenetic coefficient4NABailey et al. [12]
Leukemia38mRNASOMNA2NAGolub et al. [13]
Leukemia200MethylationPCANA16NAFigueroa et al. [169]
Lymphoma42mRNAHierarchical clusteringNA2NAAlizadeh et al. [14]
GBM35miRNAPCARatio of intracluster to intercluster correlation2LDAMarziali et al. [170]
Lung67mRNAHierarchical clusteringNA4NAGarber et al [17]
12 cancer types3, 527Five platformsCOCANA11NAHoadley et al [32]

Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis.

Table 2

Molecular subtyping studies mentioned in the review

Cancer typeDiscovery sample sizeMolecular data typeClustering methodDeterminative scoreNumber of subtypesClassification methodReference
Breast cancer65mRNAHierarchical clusteringNA4NAPerou et al .[16]
Breast cancer85mRNAHierarchical clusteringNA5NASorlie et al. [42]
Breast cancer825Five platformsCluster of clustersNA4NATCGA [43]
Breast cancer2, 000mRNA + CNViClusterARI10PAMCurtis et al. [44]
CRC62mRNAIterative NMFCophenetic coefficient5NASchlicker et al. [45]
CRC443mRNAOrig. cons. clusteringCDF area6Centroid-basedMarisa et al. [46]
CRC90mRNAOrig. cons. clusteringGap statistic3PAMDe Sousa E Melo et al. [20]
CRC445mRNANMF cons. clusteringCophenetic coefficient5PAMSadanandam et al. [47]
CRC1, 113mRNAOrig. cons. clusteringDynamic cut tree5Multiclass LDABudinska et al. [48]
CRC188mRNAk-meansNA3Single-sample centroid basedRoepman et al. [49]
CRC4, 151mRNAMarkov Cluster AlgorithmInflation factor4Random ForestGuinney et al. [11]
PDAC185miRNAHierarchical clusteringCDF area2SVMBauer et al. [50]
PDAC66mRNANMF cons. clusteringCophenetic coefficient3NTPCollisson et al. [19]
PDAC223mRNANMF cons. clusteringCophenetic coefficient2Rank-based classifierMoffitt et al. [51]
Pancreatic cancer96mRNANMF cons. clusteringCophenetic coefficient4NABailey et al. [12]
Leukemia38mRNASOMNA2NAGolub et al. [13]
Leukemia200MethylationPCANA16NAFigueroa et al. [169]
Lymphoma42mRNAHierarchical clusteringNA2NAAlizadeh et al. [14]
GBM35miRNAPCARatio of intracluster to intercluster correlation2LDAMarziali et al. [170]
Lung67mRNAHierarchical clusteringNA4NAGarber et al [17]
12 cancer types3, 527Five platformsCOCANA11NAHoadley et al [32]
Cancer typeDiscovery sample sizeMolecular data typeClustering methodDeterminative scoreNumber of subtypesClassification methodReference
Breast cancer65mRNAHierarchical clusteringNA4NAPerou et al .[16]
Breast cancer85mRNAHierarchical clusteringNA5NASorlie et al. [42]
Breast cancer825Five platformsCluster of clustersNA4NATCGA [43]
Breast cancer2, 000mRNA + CNViClusterARI10PAMCurtis et al. [44]
CRC62mRNAIterative NMFCophenetic coefficient5NASchlicker et al. [45]
CRC443mRNAOrig. cons. clusteringCDF area6Centroid-basedMarisa et al. [46]
CRC90mRNAOrig. cons. clusteringGap statistic3PAMDe Sousa E Melo et al. [20]
CRC445mRNANMF cons. clusteringCophenetic coefficient5PAMSadanandam et al. [47]
CRC1, 113mRNAOrig. cons. clusteringDynamic cut tree5Multiclass LDABudinska et al. [48]
CRC188mRNAk-meansNA3Single-sample centroid basedRoepman et al. [49]
CRC4, 151mRNAMarkov Cluster AlgorithmInflation factor4Random ForestGuinney et al. [11]
PDAC185miRNAHierarchical clusteringCDF area2SVMBauer et al. [50]
PDAC66mRNANMF cons. clusteringCophenetic coefficient3NTPCollisson et al. [19]
PDAC223mRNANMF cons. clusteringCophenetic coefficient2Rank-based classifierMoffitt et al. [51]
Pancreatic cancer96mRNANMF cons. clusteringCophenetic coefficient4NABailey et al. [12]
Leukemia38mRNASOMNA2NAGolub et al. [13]
Leukemia200MethylationPCANA16NAFigueroa et al. [169]
Lymphoma42mRNAHierarchical clusteringNA2NAAlizadeh et al. [14]
GBM35miRNAPCARatio of intracluster to intercluster correlation2LDAMarziali et al. [170]
Lung67mRNAHierarchical clusteringNA4NAGarber et al [17]
12 cancer types3, 527Five platformsCOCANA11NAHoadley et al [32]

Note: ARI, adjusted Rand index; No., number; COCA, Cluster-Of-Cluster-Assignments; iCluster, integrative clustering framework; LDA, linear discriminant analysis; NTP, nearest template prediction; Orig. cons., original consensus; PCA, principal component analysis.

Integrating the analysis of multiple genomic data, such as gene expression with CNV [44], miRNA with gene expression [73] and five-platform combined subtyping [32] studies can provide even better insights into tumor biology, and more accurate predictions, than the analysis at a single molecular level [74]. With the advances in high-throughput profiling technologies, the expenses spent on each sample are decreasing; thus, multi-platform identification and characterization of cancer is likely to become the norm.

Low- and medium-throughput molecular data for clinical test

Biomarkers identified from subtyping studies can be used in clinical practice. In typical clinical settings, only up to several dozens of these predefined biomarkers are measured to minimize the time and expenses spent on the tests [75]. In addition, most cancer specimens are formalin-fixed paraffin-embedded (FFPE), and only few are freshly prepared or snap frozen [76]. In contrast to the above mentioned high-throughput approaches, some low- and medium-throughput profiling techniques (such as qPCR, NanoString and TMA) that allow meaningful analysis of clinical specimens are well suited for clinical use of biomarker assays. These techniques are frequently used when fast detection time is required, and sample volume and pricing should be kept low. Sensitivity and specificity are the two terms used to evaluate a clinical test. Sensitivity refers to the ability of a test to correctly identify an individual with disease; specificity refers to the ability of a test to correctly identify an individual without the disease [77]. Another important term in the evaluation of a clinical test is to determine its accuracy, which describes the errors that a test will produce when differentiating between individuals with and without the disease [78]. In the following, we will compare these three techniques (qPCR, NanoString and TMA) in terms of accuracy, sensitivity, specificity and other aspects of concerns involved in a clinical test. Researchers can choose appropriate techniques for their clinical assays based on the comparisons provided in Table 1.

qPCR is commonly used to determine biomarker expression levels, or to assess CNVs. Because there is a PCR amplification step, which can greatly increase the nucleic acid input, only limited sample quantity is needed. Other advantages of qPCR include fast, high sensitivity, specificity and accuracy, which make it the routine method for validation of results initially obtained from high-throughput methods such as microarray and RNA-Seq [79]. Compared with other techniques, which can assay hundreds to thousands biomarkers, qPCR-based assays can only handle a limited number of biomarkers in a single test. qPCR-based tests also require high quality of the nucleic acids in the sampled material, so fresh-frozen tissues are typically required for qPCR.

The NanoString nCounter analysis system can be used to measure expression levels of up to 800 genes [80]. Developed by Geiss et al. [39], the nCounter system is more sensitive than microarrays, and similar in sensitivity to qPCR [39]. This technology uses digital molecular barcoding and microscopic imaging to detect and quantify the expression levels of genes in a single assay without enzymatic reactions [39, 81]. Other advantages of this technique include high accuracy and specificity [38]. Disadvantages include the high cost of the required reagents and instruments [80].

TMA is a histology-based test, developed by Kononen et al. [82], which allows the analysis of up to 1000 tumor specimens simultaneously in a single paraffin block [37]. Analysis of molecular targets at the DNA, mRNA and protein levels is possible. Once constructed, a TMA block can be sectioned hundreds of times (provided the depth of all cores is sufficient), with each section amenable to biomarker analysis. The most significant advantage of TMA is that all samples on the array are treated in an identical fashion [83]. Another advantage of TMA is that it is cost-effective (Table 1). Only a small amount of reagent is required to analyze all the samples on one slide [83]. Unlike qPCR, which requires fresh-frozen tissues, TMA requires FFPE tissues, which are the major source of material in the clinic. TMA also has limitations. For instance, low sensitivity, specificity and accuracy are the typical features of a TMA test [84]. Other disadvantages include: it usually takes several days to obtain the analysis results [85], only a limited number of analytes can be tested and the analyzed specimen volume is too small to represent the entire tumor [83]. Also during the TMA staining process, the amount of tissues will become less and less [86].

Subtype identifications and characterizations

Molecular subtyping (or molecular classification) is a process of assigning data objects into clusters, so that objects in the same cluster are more similar to each other than those in other clusters. There are two kinds of classification strategies, supervised (with class labels, such as tumor or normal tissues, known beforehand) and unsupervised (with unlabeled data) classification. Subtyping is a more general term of classification, which can be both supervised and unsupervised. Unsupervised classification is increasingly popular in biomedical research [87], and has been successfully used in many cancer subtyping studies [11, 13, 15, 17, 41, 51, 88, 89]. From these studies, we summarize a workflow for molecular subtyping of cancer. These include: data preprocessing, cluster analysis, supervised classification and subtype characterizations (Figure 1). In the following, we focused our attention on subtype identifications and characterizations, which are the two important aspects in the workflow.

Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death).
Figure 1

Molecular subtyping of cancer workflow. The workflow consists of four major steps: (A) Data preprocessing. Array data preprocessing include image analysis, data normalization and transformation. Next-generation sequencing data preprocessing contains the following steps: quality control, read alignment, expression quantification, data normalization and transformation. (B) Cluster analysis. A first feature selection is performed with a cutoff on SD (e.g. SD > 0.8) or median absolute deviation (MAD) (e.g. MAD > 0.5). Clustering is usually applied to either feature dimension or sample dimension, biclustering at both dimensions and triclustering at three dimensions (feature, sample and time). After (bi/tri) clustering, the optimal number of (bi/tri) clusters is determined by measurement such as gap statistics, cophenetic coefficients and CDF. Also, ensemble and consensus clustering have been proposed to enhance the robustness of (bi/tri) clustering. (C) Supervised classification. To build the best possible classifier, a sample selection (Silhouette width > 0) and a second feature selection (SAM/Limma) processes are applied. Various algorithms such as PAM, SVM, Random Forests (RF) and K-nearest neighbors can be used to build classifiers. (D) Subtype characterizations. A heatmap is used to represent the molecular characterizations, in which rows are features (genes, miRNAs, pathways, etc.) and columns are samples. Here, features are subtype-specific features; samples are sorted according to their subtype numbers. A Kaplan–Meier survival plot is used to represent the clinical characterizations, in which x-axis is the survival time, and y-axis is the probability of an event (i.e. death).

Subtype identifications

High-throughput molecular data are usually arranged into matrix forms, in which rows are features (genes, miRNAs or DNA methylation markers) and columns are samples. Molecular data matrices have been largely analyzed in two dimensions (2D): the feature dimension and the sample dimension [90]. Clustering is usually applied to either feature dimension or sample dimension. As subsets of features are active or suppressed only under certain experimental conditions, and behave almost independently under other conditions, to identify local patterns in the data matrix, biclustering (or subspace clustering), which allows to discover biclusters, was first proposed by Cheng and Church [91]. Now, various biclustering methods are developed to efficiently identify ‘homogeneous’ submatrices in data, such as singular value decomposition [22], nonnegative matrix factorization (NMF) [23] and geometric-based biclustering [92, 93]. With the fast development of data profiling technologies, it is now possible to have a number of samples for numerous features across multiple time points or experimental conditions. Such data can be arranged into three-dimensional (3D) matrices, with the first two dimensions representing the samples and features, respectively, and the third dimension for time or experimental conditions [94]. To find feature groups along the feature–sample–time (or –condition) dimensions, triclustering is proposed to mine triclusters in the data [95]. As tensor is a concept from mathematics that can be thought of as an organized multidimensional array of numerical values, tensor-based triclustering [96, 97] has become a promising solution for analyzing these longitudinal and spatial data.

The optimal number of clusters is determined by measurements such as gap statistics [98], cophenetic coefficients [99] and cumulative distribution function (CDF). Given that cluster analysis methods are based on different algorithms, they yield different results in terms of cluster numbers and assignments [100]. To enhance the robustness of clustering, a method called cluster ensemble has been proposed, which combines results from different runs of clustering methods into a single consensus result [100]. Another similar methodology is consensus clustering, which in conjunction with resampling techniques provides a method to reach consensus from multiple runs of the same clustering method [101]. The major difference between ensemble and consensus clustering is that ensemble clustering integrates results from multiple clustering methods, while consensus clustering provides resampling and performs a single type of clustering method multiple times. Ensemble and consensus clustering methods are also applicable to biclustering and triclustering, and have been widely used in cancer subtyping studies [19, 20, 46, 102].

Subtype characterizations

Subtype characterizations rely heavily on genomic and clinical data, and one purpose of subtype characterizations is to investigate the associations between the identified subtypes and their molecular/clinical relevance [103]. Subtype characterizations can also help to identify consensus subtypes within and between cancers, which we will cover in detail in ‘Cancer consensus molecular subtypes’ section.

Pathways, mutations, structural variations and methylation patterns can be used as the molecular characteristics. Characterizations of cancer subtypes have implications for patient outcome and targeted therapies. Lex et al. [104] developed an integrative visualization tool called StratomeX, which can help researchers to explore the relationships between subtypes and multiple genomic data types such as gene expression, DNA methylation or copy number data. These genomic data have been discussed in the ‘High-throughput molecular data for cancer subtyping’ section, which can not only be used to identify robust cancer subtypes, but can also help us better understand and interpret the molecular characteristics of the subtypes. In addition, gene set enrichment analysis (GSEA) is usually performed to characterize the biology underlying the identified subtypes. GSEA interprets the expression data at the level of gene sets, groups of genes that share the same biological function, chromosomal location, or regulation [105]. Annotated gene sets with specific biological meanings can be obtained, for example, from Gene Ontology (GO) [106] and KEGG [107] databases.

Clinical data include patient’s information such as age, gender, race, tumor grade, tumor size, time of diagnosis, smoking history, treatment strategies, relapse information, follow-up time and so on, which should be well preserved and managed for clinical characterization of the identified subtypes. Moreover, the survival analysis is a widely used method to compare the survival time differences between subtypes. The Kaplan–Meier estimator [108] can be used to generate the survival curve, and the log rank test provides a statistical comparison of two subtypes [109].

Subtype characterizations are necessary and important. Not only do they help us understand more about the subtype characteristics but also provide a subtype validation process. Ideally, there are distinct molecular and clinical characteristics between identified subtypes. Often, subtypes are only statistically different, but not biologically different. In such cases, reclustering and reclassification should be done until more interpretable results are obtained.

Moving toward clinical applications

From high-throughput molecular data and molecular subtyping of cancer to the development of marker panels using low- and medium-throughput methods, clinicians are beginning to embrace and make treatment decisions for cancer patients based on cancer subtyping studies [110, 111]. In the following, we will provide a few examples of subtyping studies that have been applied to cancer diagnosis, prognosis, response prediction and drug design. Specifically, we will focus on biomarkers for diagnostic and prognostic purposes in ‘Biomarkers identified from subtyping studies for cancer diagnosis and prognosis’ section, and cancer subtypes for therapy response prediction and drug development in ‘Cancer subtypes for predicting therapy response and drug design’ section.

Biomarkers identified from subtyping studies for cancer diagnosis and prognosis

Biomarkers identified from subtyping studies with specific indications for cancer diagnosis and prognosis are now widely applied in clinical research, and increasingly combined with conventional histology to improve diagnostic accuracy [112]. For example, TLE1 as a diagnostic marker for synovial sarcoma [113], and CD10, BCL6 and MUM1 as diagnostic markers for the germinal center B-cell-like (GCB) subtype of lymphoma [114]. Furthermore, biomarkers can be used directly to detect cancer. For instance, Bauer et al. [50] analyzed the complete miRNA repertoire of 136 pancreatic ductal adenocarcinoma (PDAC) samples, 27 pancreatitis samples and 22 normal controls. They used a hierarchical clustering method and an SVM classifier, and found that the analysis of only five miRNAs in blood and tissues can distinguish PDAC from pancreatitis and normal, possibly aiding PDAC diagnosis.

Several multigene predictors have been developed for breast cancer patients [115]. These include MammaPrint, Oncotype DX and simplified MapQuant Dx. These predictors are now widely used in the clinic to classify breast patients and treat them accordingly. MammaPrint was the first successfully applied microarray-based prognostic test for breast cancer. MammaPrint uses a 70-gene signature. To identify these genes, hierarchical clustering was used to classify 98 breast cancer patients into good and poor prognosis groups. This was followed by a three-step supervised classification method to reliably stratify good and poor prognostic categories, and finally found 70 prognostic genes for breast cancer [116]. MammaPrint is a US Food and Drug Administration-approved molecular test to predict the risk of breast cancer metastasis. The result of the test can help physicians to determine the appropriate treatment strategy. Most early-stage breast cancer patients receive adjuvant chemotherapy, but only subset of them benefit from the treatment. Paik et al. [117] thus developed a 21-gene qPCR assay called Oncotype DX. This is a diagnostic test that predicts the likelihood of chemotherapy benefit, and calculates the recurrence scores for early-stage breast cancer. Simplified MapQuant Dx is also a qPCR-based prognostic test for breast cancer. It was developed by Toussaint et al. [118], and is based on the expression patterns of four representative genes from the genomic grade index [119] and four reference genes. The prognostic information provided by the test is only applicable to estrogen receptor-positive breast cancer patients [120].

Cancer subtypes for predicting therapy response and drug design

Subtyping studies are potentially well suited to select a subset of patients that may benefit from certain drugs or therapies. For instance, Rouzier et al. [121] examined if the four subtypes of breast cancer [16] respond differently to chemotherapy. Results showed that the basal-like and ERBB2-overexpressing subtypes are more sensitive to paclitaxel- and doxorubicin-containing preoperative chemotherapy than the luminal and normal-like subtypes.

Tumor specimens for laboratory research are often limited in quantity, infiltrated with nontumor cells and sometimes ethical issues apply. Models for cancer, such as cell lines and patient-derived xenografts (PDXs), have been established as in vitro and in vivo platforms that can overcome these shortcomings of tumor specimens, and are now widely used by researchers. For instance, Ross et al. [122] provided molecular characterization of the NCI (National Cancer Institute)-60 cancer cell line panel, and demonstrated that these cell lines correspond to their tumors of origin. Gao et al. [123] established about 1000 PDXs, which provided excellent in vivo platforms to screen novel therapies for cancer patients. Cancer cell lines and PDXs can also be classified into different subtypes, for example, Kao et al. [124] classified 52 commonly used breast cancer cell lines into five subtypes [42], and defined the cell line subtypes that most faithfully capture the known heterogeneity of breast cancer. Moffitt et al. [51] sequenced 37 PDXs from PDAC and demonstrated that these models can recapitulate tumor-specific subtypes. Therefore, cell line and PDX models can provide a great opportunity to investigate subtype-specific therapies as well.

Recent developments in high-throughput technologies have allowed large-scale screening of chemicals and drugs on cell line panels [125]. For example, the abovementioned NCI-60 cancer cell line panel [126] has been used as a standard platform on which >40 000 chemicals were screened over the past few decades [125]. Besides, Garnett et al. [127] screened a panel of several hundred cancer cell lines with 130 drugs in clinical use and under preclinical investigation, which also provides a powerful strategy to identify subtype-specific cancer therapies and biomarkers to guide such strategies. Drug development is shifting away from cytotoxic agents, to drugs which are designed to target specific molecules that drive the malignant progression [128]. It is still a challenging task, but subtype-specific biomarkers can become potential targets for drug design, and should be investigated and validated further [129, 130].

Challenges

We see four major challenges in cancer subtyping studies that preclude clinical implementation (Figure 2). The first is data acquisition, curation and management. The second challenge is tumor microenvironment (TME) heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively.

Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results.
Figure 2

Four major challenges in the molecular subtyping of cancer and associated solutions/problems. The first challenge is data acquisition, curation and management. Data from publicly available data sets, such as ICGC, TCGA and GEO can increase sample size or be used as validation data sets. Low tumor cellularity can be addressed by physical and virtual microdissection. The second challenge is TME heterogeneity. The TME includes immune cells, blood vessels, fibroblasts and ECM, which are all exhibit heterogeneity at some level. The third challenge is the lack of consensus molecular subtypes. Currently, we only have three examples of consensus subtyping studies: colorectal cancer, breast cancer and the TCGA’s pan-cancer study. The last challenge is the problem with single-sample classification. Currently applied SSPs may yield inconsistent classification results.

Data acquisition, curation and management

Many cancer subtyping studies use a strategy called multiple random training-validation strategy [131], in which a training data set is used to identify molecular signatures, and the validation data sets are used to validate the classification performance. Normally, researchers will use their own data set as training data set, and use publicly available data sets as their validation data sets. Publicly available data sets, such as the International Cancer Genome Consortium (ICGC, www.icgc.org) and The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) contain coordinated large-scale cancer genomic data that can be accessed online. ICGC holds genomic, transcriptomic, epigenomic and clinical data from 50 different cancer types and subtypes. Currently, there are >25 000 tumor genome data available on the ICGC website [132]. TCGA also contains a collection of cancer genomic data, and so far, >30 human tumor types have been analyzed through large-scale genome sequencing from 11 000 patient samples [133]. In addition, Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is a public repository that archives and freely distributes gene expression data from numerous studies [134]. Researchers can upload their own data to the GEO or download data from GEO as validation data sets.

Subtyping studies typically use tumor numbers ranging from dozens to more than few hundreds for their study cohort (Table 2). Identification of cancer subtypes has been frustrated by a lack of tumor samples available for study [19]. For instance, because <20% of PDAC patients have resectable tumors at the time of diagnosis, material for profiling is typically limited [135]. Some studies have overcome this problem by integrating different sources of data into their studies to increase sample size [19, 136]. The introduced batch effects (or nonbiological differences) can be removed by methods like empirical Bayes [137], surrogate variable analysis [138] or Distance Weighted Discrimination [139].

Another common problem is the low tumor cellularity of patient samples, which makes the molecular data noisy. How to capture tumor-specific patterns in such data poses a problem. Because of the tight connection and interaction between cancer cells and surrounding cells, using conventional separation techniques, such as laser capture microdissection [140], cannot perfectly separate tumor cells from nontumor cells. Thus, various statistical enrichment techniques such as virtual microdissection [51], mathematical algorithms like ESTIMATE [141] or qpure [142] can be used to assess tumor cellularity and deconvolve tumor-specific contributions.

In summary, to dissect the genetic heterogeneity of the tumor cell, molecular and clinical data should be well processed and managed. As there are abundant publicly available data sets and various data processing tools that may be useful for answering such questions, researchers should take full advantage of them.

TME heterogeneity

Heterogeneity not only exists in the tumor cell compartment but also in the TME. The TME is the sum of interactions between tumor cells and the surrounding environment, which plays an important role in tumor development, progression and therapy responses. The TME includes immune cells, blood vessels, fibroblasts and extracellular matrix (ECM). Stroma is part of the TME, and is a histological unit consisting of connective tissue, fat tissue, fibroblasts, ECM and immune cells within an extracellular scaffold [143]. Stroma, as a whole, can be classified into different subtypes with clinical implications. For instance, Moffitt et al. [51] used NMF-based consensus clustering of hundreds of PDAC tumors and cell lines, and identified two stroma subtypes named as normal and activated. The activated stroma subtype contributes to poor clinical outcome.

Heterogeneity has also been observed in other components of the TME, such as tumor-infiltrated immune cells, fibroblasts and ECM [144–148]. Solid tumors are infiltrated by various immune cells, for example, T and B lymphocytes, mast cells and so on [149]. These immune cells either play a positive role in inhibition of cancer cell growth or are responsible for the tumor-associated chronic inflammation. The presence of a T-cell-infiltrated TME can serve as a predictive biomarker for response to immunotherapies [144]. However, in many tumor types, only a subset of patients can generate a tumor antigen-specific T-cell response. The remaining patients lack an appropriate T-cell phenotype and resist immunotherapeutic interventions [144]. How to select patients that can potentially benefit from immunotherapies is a challenge. We can address this problem by identifying T-cell response genes and building a binary gene expression classifier, which can distinguish response group from nonresponse group. ECM is a collection of extracellular proteins present in all tissues to provide support to that tissue’s cells [150]. Recent studies have found that considerable heterogeneity exists in the ECM, and clinical outcome is often related with ECM characteristics. For instance, Bergamaschi et al. [147] identified 278 ECM-related genes to classify primary breast tumors into four groups (ECM1–4) with distinct clinical outcomes.

Although tumor and stromal cells have close interactions with each other, stroma cells are different from tumor cells in terms of genetic architecture. Stroma cells are mostly genetically intact [143, 151], which suggests that the stroma could be a target of therapy. Heterogeneity in the characteristic of both tumor cells and TME raise questions regarding future cancer treatment. Which one of them is easier to target? How do we interpret such 2D heterogeneity, and how are they related? Can we incorporate them into a single system? These questions remain to be answered in the future.

Cancer consensus molecular subtypes

Currently, there are six subtyping systems for colorectal cancer (CRC) [20, 46, 45, 47–49], which classify CRC into three to six subtypes (Table 2). To identify robust consensus subtypes of CRCs, a consensus subtyping effort for CRC was initiated. The Colorectal Cancer Subtyping Consortium (CRCSC) developed a network-based approach to investigate the associations between the six independent classification systems. A multi-class classifier was built that could classify CRC into four consensus molecular subtypes (CMS1-4) [152]. CMS1 tumors are highly mutated, microsatellite unstable and show strong immune activation. CMS2 tumors are characterized by marked Wnt and Myc signaling activation. CMS3 cancers are metabolically dysregulated. CMS4 cases feature transforming growth factor-β activation, stromal invasion and angiogenesis signatures. These consensus results will aid future clinical stratification and subtype-based targeted interventions for CRC, and such collaborations should serve as a role model for other cancer subtyping studies to accelerate our understanding of cancer biology [152] and develop more efficient ways to cure cancers.

The use of different patient cohorts, platforms and clustering methods for a specific tumor type, typically yields divergent subtyping results. For breast cancer (Table 2), it was first classified by Perou et al. [16] into four subtypes: luminal, basal-like, normal-like and ERBB2-overexpressing subtypes. Then, Sørlie et al. [42] performed complementary DNA microarrays of 85 breast cancer patients and normal controls, and used hierarchical clustering to classify the patients into one of the five subtypes, i.e. luminal A, luminal B, HER2 over-expression, basal and normal-like. The most recent breast cancer subtyping study by TCGA also suggested four subtypes, which are luminal A, luminal B, HER2-positive and triple-negative subtypes [43]. We can conclude that despite inconsistent naming and number of clusters grouped by different studies [16, 42, 43], breast tumors fall primarily into three major subtypes: luminal, HER2 overexpression and triple-negative breast cancer (TNBC) [89]. The luminal subtype cancer is the most common one and carries a good prognosis. This subset of patients expresses hormone receptors, and this makes them responsive to hormone therapies. The HER2-overexpressing breast cancer subtype is more sensitive to herceptin (trastuzumab) and chemotherapy than the luminal subtype. The TNBC subtype is resistant to standard targeted therapies, and carries the worst prognosis.

The next important consideration is the consensus subtyping between cancers. Although there are many cancer types based on their tissue of origin, we can observe similarities between them. The TCGA’s pan-cancer classification study [32] is a good example of this. Six different ‘omic’ platforms were integratively analyzed, consisting of 3527 tumor specimens across 12 cancer types. A unified cancer classification system was constructed, and it identified 11 major subtypes. Among them, five subtypes were strongly associated with their tissue of origin, but the remaining subtypes were not strictly associated with their tissue of origin. For instance, bladder cancers split into three pan-cancer subtypes. Lung squamous, head and neck and a subset of bladder cancers coalesced into a single subtype. This study not only provided a new classification system for multiple cancers but also demonstrated that general characteristics exist between cancers that were traditionally considered to be different entities.

Cancer is a complex disease. Without a systematic understanding of the characteristics of the disease, we cannot develop effective therapies against it. The general characteristics within and between cancers provide great opportunities to identify consensus molecular subtypes. For example, basal subtypes are defined in breast cancer [42], bladder cancer [88] and pancreatic cancer [51]. Mesenchymal subtypes are defined in glioblastoma (GBM) [41], NPC [15], breast [153], pancreatic [19] and colon cancers [20]. Basal subtypes usually express genes like laminins and keratins, and have the worst prognosis compared with other subtypes. The characteristics of mesenchymal subtypes include a mesenchymal phenotype, high expression of proliferation genes, poor prognosis, high malignant potential and resistance to current therapies. Thus, devise treatments that are effective against multiple cancer types with shared characteristics may become a promising solution for future cancer treatment.

Single-sample classification

The abovementioned classifiers (or predictor) are mainly built based on a large number of training samples, and for this reason, we call them population-based predictors. In contrast, single-sample predictors (SSPs) are classification models that can classify a single sample into one of the molecular subtypes of a specific type of cancer [154, 155]. Traditionally, to classify a new sample into a specific subtype based on population-based predictor, reanalysis of a large data set is needed. Contrary to the population-based predictor, SSPs can assign a single sample to a specific subtype regardless of other samples, and is therefore more useful and practical for individual patients than population-based predictors. SSPs have been built for several types of cancer. For instances, Sørlie et al. [154] constructed the first SSP for breast cancer, Stratford et al. [136] developed an SSP for PDAC and Ringnér et al. [156] derived an SSP for lung adenocarcinoma.

SSPs are constructed based on tumor-intrinsic signatures and similarities between a given sample and molecular subtype centroids [154, 155]. Methods applied in the population-based predictor, such as hierarchical clustering and nearest centroid classification method [157], can be used in the SSP. One of the most important requirements for an SSP is that it cannot be built based on row-centered (mean centering or median centering) data [158]. Normally, molecular data matrices contain features in rows and samples in columns. Row-centering is a feature centering process that can help to remove side effects caused by outlier features. The construction of SSPs features no row-centering step, and studies have found inconsistent classification results caused by SSPs [158–160]. Sørlie et al. [161] accepted Weigelt et al.’s [158] conclusions and comments, and explained why there were inconsistent classification results. The reasons are listed below: for the three one-channel-based data sets, most of the variations were caused by differences between genes, and not so much by differences between samples. So, the correlation values vary greatly over a smaller range in the uncentered data. Therefore, for a sample to be correctly assigned to a subtype, it must be centered against an appropriately large and heterogeneous sample set. Sørlie et al. [161] highlighted the importance of performing row-centering in molecular data-processing steps.

In summary, building SSPs is a challenging but important task, and up to now, there are no effective ways to deal with the centering problem. Although current results are not encouraging, we hope that in the near future, applicable SSPs can be developed and applied in the clinic.

Conclusions and outlook

Heterogeneity renders cancer more than a single disease. This poses a significant challenge to the traditional management of cancer. With the advent of genome-wide molecular profiling of cancer, especially the advancements in high-throughput profiling technologies, researchers can now investigate the collective of genomic and epigenomic changes that exist in cancer. In contrast with traditional classification methods, molecular classification can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation.

The most important step in molecular subtyping of cancer is cluster analysis. Different clustering methods can produce different results, many cluster analyses are unstable and cluster analyses are a purely exploratory method [162]. It is hard to tell which algorithm is better, as this largely depends on the question asked. Thus, it is important to ascertain proper preprocessing and normalization of the data; also, ensemble and consensus clustering methods should be considered when doing the cluster analysis. Another important step is subtype characterizations. The identified subtypes should be both statistically significant and biologically relevant. This means that molecular as well as clinical data collection is mandatory to truly characterize the identified subtypes. Also, publicly available data sets can be used to evaluate the classification performance of the classifiers.

Although numerous molecular subtyping studies have been conducted, which have identified subtypes for various cancer types, current cancer patient stratification still largely relies on traditional histopathological observation and assessment. We are facing several challenges (Figure 2). The gap between research findings (identified subtypes) and clinical applications can be bridged by the improvement of statistical methods and better interpretation of the results. When cancers are correctly separated into different subtypes, the next important step is to properly interpret these identified subtypes from a biological point of view followed by a move toward clinical applications. With the successfully applied clinical tests in breast cancer, we hope that this will be followed in other cancer types.

In summary, cancer should not be treated as single disease. Molecular subtyping can identify distinct cancer subtypes, which may shed new lights on the treatment strategies for cancer patients. Several challenges should be addressed before clinical applications can be successfully applied.

Key Points

  • Heterogeneity renders cancer more than a single disease. Molecular subtyping can be used to assign cancers to subgroups with distinct molecular characteristics, tumor biology and clinical presentation.

  • Unsupervised classification schemes have been successfully applied to identify subtypes in a large number of malignancies. From these studies, we summarize a workflow for molecular subtyping of cancer. These include data preprocessing, cluster analysis, supervised classification and subtype characterizations.

  • We identified and described four major challenges in cancer subtyping studies that preclude clinical implementation. The first is data acquisition, curation and management. The second challenge is TME heterogeneity. The remaining two challenges are the lack of consensus molecular subtypes, and problems with single-sample classification, respectively.

  • We suggest that standardized methods should be established to help identify intrinsic subgroup signatures and to build robust classifiers that pave the way toward stratified treatment of cancer patients.

Lan Zhao is a PhD candidate at the Department of Electronic Engineering, City University of Hong Kong. Her research interests are in the areas of machine learning, cancer genomics and computational biology.

Victor H. F. Lee is currently a Clinical Associate Professor of the Department of Clinical Oncology, the University of Hong Kong. His current interests include clinical and genetic studies on nasopharyngeal cancer, head and neck cancers, lung cancers, liver cancers and gastrointestinal cancers.

Michael K. Ng is the Head and Chair Professor of the Department of Mathematics, and Chair Professor (Affiliate) of Department of Computer Science at the Hong Kong Baptist University. As an applied mathematician, his main research areas include bioinformatics, data mining, operations research and scientific computing.

Hong Yan received his PhD degree from Yale University. He was a Professor of imaging science at the University of Sydney and currently is the chair professor of computer engineering at City University of Hong Kong. His research interests include bioinformatics, image processing and pattern recognition.

Maarten F. Bijlsma is an Associate Professor at the Academic Medical Center with the University of Amsterdam. His research focuses on pancreatic and esophageal cancer, from the most fundamental mechanisms that underlie aberrant signaling in these diseases, to the development of serum-borne markers in patient cohorts to predict treatment response and disease outcome. Furthermore, he is a Biomarker/Imaging Program leader for the AMC/VUmc Cancer Center Amsterdam.

Acknowledgement

The authors thank Xin Wang from Department of Biomedical Sciences of the City University of Hong Kong for comments on an earlier version of the manuscript.

Funding

This work was supported by Hong Kong Research Grants Council (RGC) (Project C1007-15G) and City University of Hong Kong (Project 7004862).

References

1

Campbell
PJ
,
Pleasance
ED
,
Stephens
PJ
, et al.
Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing
.
Proc Nat Acad Sci USA
2008
;
105
(
35
):
13081
6
.

2

Shipitsin
M
,
Campbell
LL
,
Argani
P
, et al.
Molecular definition of breast tumor heterogeneity
.
Cancer Cell
2007
;
11
(
3
):
259
73
.

3

Macintosh
CA
,
Stower
M
,
Reid
N
, et al.
Precise microdissection of human prostate cancers reveals genotypic heterogeneity
.
Cancer Res
1998
;
58
:
23
8
.

4

González-García
I
,
Solé
RV
,
Costa
J.
Metapopulation dynamics and spatial heterogeneity in cancer
.
Proc Natl Acad Sci USA
2002
;
99
(
20
):
13085
9
.

5

Iacobuzio-Donahue
CA.
Genetic evolution of pancreatic cancer: lessons learnt from the pancreatic cancer genome sequencing project
.
Gut
2012
;
61
(
7
):
1085
94
.

6

Penchev
VR
,
Rasheed
ZA
,
Maitra
A
, et al.
Heterogeneity and targeting of pancreatic cancer stem cells
.
Clin Cancer Res
2012
;
18
(
16
):
4277
84
.

7

Burrell
RA
,
McGranahan
N
,
Bartek
J
, et al.
The causes and consequences of genetic heterogeneity in cancer evolution
.
Nature
2013
;
501
(
7467
):
338
45
.

8

McGranahan
N
,
Swanton
C.
Biological and therapeutic impact of intratumor heterogeneity in cancer evolution
.
Cancer Cell
2015
;
27
(
1
):
15
26
.

9

Duggan
DJ
,
Bittner
M
,
Chen
Y
, et al.
Expression profiling using cDNA microarrays
.
Nat Genet
1999
;
21(Suppl 1)
:
10
14
.

10

Metzker
ML.
Sequencing technologies—the next generation
.
Nat Rev Genet
2010
;
11
(
1
):
31
46
.

11

Guinney
J
,
Dienstmann
R
,
Wang
X
, et al.
The consensus molecular subtypes of colorectal cancer
.
Nat Med
2015
;
21
(
11
):
1350
6
.

12

Bailey
P
,
Chang
DK
,
Nones
K
, et al.
Genomic analyses identify molecular subtypes of pancreatic cancer
.
Nature
2016
;
531
(
7592
):
47
52
.

13

Golub
TR
,
Slonim
DK
,
Tamayo
P
, et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
.
Science
1999
;
286
(
5439
):
531
7
.

14

Alizadeh
AA
,
Eisen
MB
,
Davis
RE
, et al.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
.
Nature
2000
;
403
(
6769
):
503
11
.

15

Zhao
L
,
Fong
AHW
,
Liu
N
, et al.
Molecular subtyping of nasopharyngeal carcinoma (NPC) and a microRNA-based prognostic model for distant metastasis
.
J Biomed Sci
2018
;
25
:
16
.

16

Perou
CM
,
Sørlie
T
,
Eisen
MB
, et al.
Molecular portraits of human breast tumours
.
Nature
2000
;
406
(
6797
):
747
52
.

17

Garber
ME
,
Troyanskaya
OG
,
Schluens
K
, et al.
Diversity of gene expression in adenocarcinoma of the lung
.
Proc Natl Acad Sci USA
2001
;
98
(
24
):
13784
9
.

18

Chen
X
,
Cheung
ST
,
So
S
, et al.
Gene expression patterns in human liver cancers
.
Mol Biol Cell
2002
;
13
(
6
):
1929
39
.

19

Collisson
EA
,
Sadanandam
A
,
Olson
P
, et al.
Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy
.
Nat Med
2011
;
17
:
500
3
.

20

Felipe De Sousa
EM
,
Wang
X
,
Jansen
M
, et al.
Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions
.
Nat Med
2013
;
19
:
614
18
.

21

Nielsen
TO
,
West
RB
,
Linn
SC
, et al.
Molecular characterisation of soft tissue tumours: a gene expression study
.
Lancet
2002
;
359
(
9314
):
1301
7
.

22

Kluger
Y
,
Basri
R
,
Chang
JT
, et al.
Spectral biclustering of microarray data: coclustering genes and conditions
.
Genome Res
2003
;
13
(
4
):
703
16
.

23

Lee
DD
,
Seung
HS.
Learning the parts of objects by non-negative matrix factorization
.
Nature
1999
;
401
(
6755
):
788
91
.

24

Tibshirani
R
,
Hastie
T
,
Narasimhan
B
, et al.
Diagnosis of multiple cancer types by shrunken centroids of gene expression
.
Proc Natl Acad Sci USA
2002
;
99
(
10
):
6567
72
.

25

Hearst
MA
,
Dumais
ST
,
Osuna
E
, et al.
Support vector machines
.
IEEE Intell Syst Their Appl
1998
;
13
(
4
):
18
28
.

26

Khan
J
,
Wei
JS
,
Ringner
M
, et al.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
.
Nat Med
2001
;
7
:
673
9
.

27

Nutt
CL
,
Mani
DR
,
Betensky
RA
, et al.
Gene expression-based classification of malignant gliomas correlates better with survival than histological classification
.
Cancer Res
2003
;
63
:
1602
7
.

28

Eisen
MB
,
Spellman
PT
,
Brown
PO
, et al.
Cluster analysis and display of genome-wide expression patterns
.
Proc Natl Acad Sci USA
1998
;
95
(
25
):
14863
8
.

29

Pena
JM
,
Lozano
JA
,
Larranaga
P.
An empirical comparison of four initialization methods for the k-means algorithm
.
Pattern Recognit Lett
1999
;
20
:
1027
40
.

30

Breiman
L.
Random forests
.
Mach Learn
2001
;
45
:
5
32
.

31

Fukunaga
K
,
Narendra
PM.
A branch and bound algorithm for computing k-nearest neighbors
.
IEEE Trans Comput
1975
;
100
:
750
3
.

32

Hoadley
KA
,
Yau
C
,
Wolf
DM
, et al.
Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin
.
Cell
2014
;
158
(
4
):
929
44
.

33

Siang
TC
,
Soon
TW
,
Kasim
S
, et al.
A review of cancer classification software for gene expression data
.
Int J Biosci Biotechnol
2015
;
7
(
4
):
89
108
.

34

Wang
Z
,
Gerstein
M
,
Snyder
M.
RNA-seq: a revolutionary tool for transcriptomics
.
Nat Rev Genet
2009
;
10
:
57
63
.

35

Guo
Y
,
Sheng
Q
,
Li
J
, et al.
Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data
.
PLoS One
2013
;
8
(
8
):
e71462
.

36

Zhao
S
,
Fung-Leung
WP
,
Bittner
A
, et al.
Comparison of RNA-seq and microarray in transcriptome profiling of activated T cells
.
PLoS One
2014
;
9
(
1
):
e78644
.

37

Shergill
IS
,
Shergill
NK
,
Arya
M
, et al.
Tissue microarrays: a current medical research tool
.
Curr Med Res Opin
2004
;
20
:
707
12
.

38

Veldman-Jones
MH
,
Brant
R
,
Rooney
C
, et al.
Evaluating robustness and sensitivity of the nanostring technologies ncounter platform to enable multiplexed gene expression analysis of clinical samples
.
Cancer Res
2015
;
75
(
13
):
2587
93
.

39

Geiss
GK
,
Bumgarner
RE
,
Birditt
B
, et al.
Direct multiplexed measurement of gene expression with color-coded probe pairs
.
Nat Biotechnol
2008
;
26
:
317
25
.

40

Tamayo
P
,
Slonim
D
,
Mesirov
J
, et al.
Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation
.
Proc Natl Acad Sci USA
1999
;
96
(
6
):
2907
12
.

41

Verhaak
RGW
,
Hoadley
KA
,
Purdom
E
, et al.
Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1
.
Cancer Cell
2010
;
17
(
1
):
98
110
.

42

Sørlie
T
,
Perou
CM
,
Tibshirani
R
, et al.
Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications
.
Proc Natl Acad Sci USA
2001
;
98
(
19
):
10869
74
.

43

Cancer Genome Atlas Network
Comprehensive molecular portraits of human breast tumours
.
Nature
2012
;
490
:
61
70
.

44

Curtis
C
,
Shah
SP
,
Chin
SF
, et al.
The genomic and transcriptomic architecture of 2, 000 breast tumours reveals novel subgroups
.
Nature
2012
;
486
(
7403
):
346
52
.

45

Schlicker
A
,
Beran
G
,
Chresta
CM
, et al.
Subtypes of primary colorectal tumors correlate with response to targeted treatment in colorectal cell lines
.
BMC Med Genomics
2012
;
5
:
66
.

46

Marisa
L
,
de Reyniès
A
,
Duval
A
, et al.
Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value
.
PLoS Med
2013
;
10
(
5
):
e1001453
.

47

Sadanandam
A
,
Lyssiotis
CA
,
Homicsko
K
, et al.
A colorectal cancer classification system that associates cellular phenotype and responses to therapy
.
Nat Med
2013
;
19
:
619
25
.

48

Budinska
E
,
Popovici
V
,
Tejpar
S
, et al.
Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer
.
J Pathol
2013
;
231
(
1
):
63
76
.

49

Roepman
P
,
Schlicker
A
,
Tabernero
J
, et al.
Colorectal cancer intrinsic subtypes predict chemotherapy benefit, deficient mismatch repair and epithelial-to-mesenchymal transition
.
Int J Cancer
2014
;
134
(
3
):
552
62
.

50

Bauer
AS
,
Keller
A
,
Costello
E
, et al.
Diagnosis of pancreatic ductal adenocarcinoma and chronic pancreatitis by measurement of microRNA abundance in blood and tissue
.
PLoS One
2012
;
7
(
4
):
e34151
.

51

Moffitt
RA
,
Marayati
R
,
Flate
EL
, et al.
Virtual microdissection identifies distinct tumor-and stroma-specific subtypes of pancreatic ductal adenocarcinoma
.
Nat Genet
2015
;
47
:
1168
78
.

52

Marcucci
G
,
Mrózek
K
,
Bloomfield
CD.
Molecular heterogeneity and prognostic biomarkers in adults with acute myeloid leukemia and normal cytogenetics
.
Curr Opin Hematol
2005
;
12
:
68
75
.

53

Nones
K
,
Waddell
N
,
Song
S
, et al.
Genome-wide DNA methylation patterns in pancreatic ductal adenocarcinoma reveal epigenetic deregulation of SLIT-ROBO, ITGA2 and MET signaling
.
Int J Cancer
2014
;
135
(
5
):
1110
18
.

54

Waddell
N
,
Pajic
M
,
Patch
AM
, et al.
Whole genomes redefine the mutational landscape of pancreatic cancer
.
Nature
2015
;
518
(
7540
):
495
501
.

55

Daemen
A
,
Peterson
D
,
Sahu
N
, et al.
Metabolite profiling stratifies pancreatic ductal adenocarcinomas into subtypes with distinct sensitivities to metabolic inhibitors
.
Proc Natl Acad Sci USA
2015
;
112
(
32
):
E4410
17
.

56

Stratton
MR
,
Campbell
PJ
,
Futreal
PA.
The cancer genome
.
Nature
2009
;
458
(
7239
):
719
24
.

57

Finkelstein
SD
,
Sayegh
R
,
Christensen
S
, et al.
Genotypic classification of colorectal adenocarcinoma. Biologic behavior correlates with K-ras-2 mutation type
.
Cancer
1993
;
71
(
12
):
3827
38
.

58

Vural
S
,
Wang
X
,
Guda
C.
Classification of breast cancer patients using somatic mutation profiles and machine learning approaches
.
BMC Syst Biol
2016
;
10(Suppl 3)
:
62
.

59

Calin
GA
,
Liu
CG
,
Sevignani
C
, et al.
MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic leukemias
.
Proc Natl Acad Sci USA
2004
;
101
:
11755
60
.

60

Calin
GA
,
Croce
CM.
MicroRNA signatures in human cancers
.
Nat Rev Cancer
2006
;
6
(
11
):
857
66
.

61

Calin
GA
,
Garzon
R
,
Cimmino
A
, et al.
MicroRNAs and leukemias: how strong is the connection?
Leuk Res
2006
;
30
(
6
):
653
5
.

62

Cantini
L
,
Caselle
M
,
Forget
A
, et al.
A review of computational approaches detecting microRNAs involved in cancer
.
Front Biosci
2017
;
22
:
1774
91
.

63

Lu
J
,
Getz
G
,
Miska
EA
, et al.
MicroRNA expression profiles classify human cancers
.
Nature
2005
;
435
(
7043
):
834
8
.

64

Feuk
L
,
Carson
AR
,
Scherer
SW.
Structural variation in the human genome
.
Nat Rev Genet
2006
;
7
(
2
):
85
97
.

65

Cook
EH
Jr,
Scherer
SW.
Copy-number variations associated with neuropsychiatric conditions
.
Nature
2008
;
455
(
7215
):
919
23
.

66

Gonzalez
E
,
Kulkarni
H
,
Bolivar
H
, et al.
The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility
.
Science
2005
;
307
(
5714
):
1434
40
.

67

Le Maréchal
C
,
Masson
E
,
Chen
JM
, et al.
Hereditary pancreatitis caused by triplication of the trypsinogen locus
.
Nat Genet
2006
;
38
(
12
):
1372
.

68

Kallioniemi
OP
,
Kallioniemi
A
,
Piper
J
, et al.
Optimizing comparative genomic hybridization for analysis of DNA sequence copy number changes in solid tumors
.
Genes Chromosomes Cancer
1994
;
10
(
4
):
231
43
.

69

Sebat
J
,
Lakshmi
B
,
Troge
J
, et al.
Large-scale copy number polymorphism in the human genome
.
Science
2004
;
305
(
5683
):
525
8
.

70

Baylin
SB.
DNA methylation and gene silencing in cancer
.
Nat Clin Pract Oncol
2005
;
2
:
S4
S11
.

71

Frommer
M
,
McDonald
LE
,
Millar
DS
, et al.
A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands
.
Proc Natl Acad Sci USA
1992
;
89
(
5
):
1827
31
.

72

Huang
THM
,
Perry
MR
,
Laux
DE.
Methylation profiling of CpG islands in human breast cancer cells
.
Hum Mol Genet
1999
;
8
:
459
70
.

73

Kwon
MS
,
Kim
Y
,
Lee
S
, et al.
Integrative analysis of multi-omics data for identifying multi-markers for diagnosing pancreatic cancer
.
BMC Genomics
2015
;
16
:
S4
.

74

Zhao
Q
,
Shi
X
,
Xie
Y
, et al.
Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA
.
Brief Bioinform
2015
;
16
:
291
303
.

75

Wang
Y.
Development of cancer diagnostics—from biomarkers to clinical tests
.
Transl Cancer Res
2015
;
4
:
270
9
.

76

Corless
CL
,
Spellman
PT.
Tackling formalin-fixed, paraffin-embedded tumor tissue with next-generation sequencing
.
Cancer Discov
2012
;
2
(
1
):
23
4
.

77

Lalkhen
AG
,
McCluskey
A.
Clinical tests: sensitivity and specificity
.
Contin Educ Anaesth Crit Care Pain
2008
;
8
(
6
):
221
3
.

78

Linnet
K
,
Bossuyt
PMM
,
Moons
KGM
, et al.
Quantifying the accuracy of a diagnostic test or marker
.
Clin Chem
2012
;
58
(
9
):
1292
301
.

79

Prokopec
SD
,
Watson
JD
,
Waggott
DM
, et al.
Systematic evaluation of medium-throughput mRNA abundance platforms
.
RNA
2013
;
19
(
1
):
51
62
.

80

Kulkarni
MM.
Digital multiplexed gene expression analysis using the NanoString nCounter system
.
Curr Protoc Mol Biol
2011
;
Chapter 25
:
Unit25B.10
.

81

Payton
JE
,
Grieselhuber
NR
,
Chang
LW
, et al.
High throughput digital quantification of mRNA abundance in primary human acute myeloid leukemia samples
.
J Clin Invest
2009
;
119
(
6
):
1714
26
.

82

Kononen
J
,
Bubendorf
L
,
Kallionimeni
A
, et al.
Tissue microarrays for high-throughput molecular profiling of tumor specimens
.
Nat Med
1998
;
4
:
844
7
.

83

Rimm
DL
,
Camp
RL
,
Charette
LA
, et al.
Amplification of tissue by construction of tissue microarrays
.
Exp Mol Pathol
2001
;
70
:
255
64
.

84

Schmidt
LH
,
Biesterfeld
S
,
Kümmel
A
, et al.
Tissue microarrays are reliable tools for the clinicopathological characterization of lung cancer tissue
.
Anticancer Res
2009
;
29
:
201
9
.

85

Camp
RL
,
Neumeister
V
,
Rimm
DL.
A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers
.
J Clin Oncol
2008
;
26
(
34
):
5630
7
.

86

Hoos
A
,
Cordon-Cardo
C.
Tissue microarray profiling of cancer specimens and cell lines: opportunities and limitations
.
Lab Invest
2001
;
81
:
1331
8
.

87

Xu
R
,
Wunsch
DC.
Clustering algorithms in biomedical research: a review
.
IEEE Rev Biomed Eng
2010
;
3
:
120
54
.

88

Cancer Genome Atlas Research Network
.
Comprehensive molecular characterization of urothelial bladder carcinoma
.
Nature
2014
;
507
:
315
22
.

89

Dai
X
,
Li
T
,
Bai
Z
, et al.
Breast cancer intrinsic subtype classification, clinical use and future trends
.
Am J Cancer Res
2015
;
5
:
2929
43
.

90

Madeira
SC
,
Oliveira
AL.
Biclustering algorithms for biological data analysis: a survey
.
IEEE/ACM Trans Comput Biol Bioinform
2004
;
1
(
1
):
24
45
.

91

Cheng
Y
,
Church
GM.
Biclustering of expression data
.
Proc Int Conf Intell Syst Mol Biol
2000
;
8
:
93
103
.

92

Gan
X
,
Liew
AW-C
,
Yan
H.
Discovering biclusters in gene expression data based on high-dimensional linear geometries
.
BMC Bioinformatics
2008
;
9
(
1
):
209.

93

Zhao
H
,
Liew
AW-C
,
Xie
X
, et al.
A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data
.
J Theor Biol
2008
;
251
:
264
74
.

94

Mankad
S
,
Michailidis
G.
Biclustering three-dimensional data arrays with plaid models
.
J Comput Graph Stat
2014
;
23
:
943
65
.

95

Narmadha
N
,
Rathipriya
R.
Triclustering: an evolution of clustering. In:
2016 Online International Conference on Green Engineering and Technologies (IC-GET)
. IEEE, Coimbatore, India.
2016
, 1–4.

96

Li
Y
,
Ngom
A.
Classification of clinical gene-sample-time microarray expression data via tensor decomposition methods. In: Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer-Verlag Berlin, Heidelberg, Palermo, Italy, 2011, 275–86.

97

Luo
Y
,
Wang
F
,
Szolovits
P.
Tensor factorization toward precision medicine
.
Brief Bioinform
2017
;
18
:
511
4
.

98

Tibshirani
R
,
Walther
G
,
Hastie
T.
Estimating the number of clusters in a data set via the gap statistic
.
J R Stat Soc Series B Stat Methodol
2001
;
63
:
411
23
.

99

Brunet
J-P
,
Tamayo
P
,
Golub
TR
, et al.
Metagenes and molecular pattern discovery using matrix factorization
.
Proc Natl Acad Sci USA
2004
;
101
(
12
):
4164
9
.

100

Vega-Pons
S
,
Ruiz-Shulcloper
J.
A survey of clustering ensemble algorithms
.
Int J Pattern Recognit Artif Intell
2011
;
25
(
03
):
337
72
.

101

Monti
S
,
Tamayo
P
,
Mesirov
J
, et al.
Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data
.
Mach Learn
2003
;
52
:
91
118
.

102

Mukhopadhyay
A
,
Bandyopadhyay
S
,
Maulik
U.
Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification
.
PLoS One
2010
;
5
(
11
):
e13803.

103

Wang
X
,
Markowetz
F
,
De Sousa
E
,
Melo
F
, et al.
Dissecting cancer heterogeneity–an unsupervised classification approach
.
Int J Biochem Cell Biol
2013
;
45
:
2574
9
.

104

Lex
A
,
Streit
M
,
Schulz
H-J
, et al.
StratomeX: visual Analysis of Large-Scale Heterogeneous Genomics Data for Cancer Subtype Characterization
.
Comput Graph Forum
2012
;
31
:
1175
84
.

105

Subramanian
A
,
Tamayo
P
,
Mootha
VK
, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci USA
2005
;
102
:
15545
50
.

106

Ashburner
M
,
Ball
CA
,
Blake
JA
, et al.
Gene Ontology: tool for the unification of biology
.
Nat Genet
2000
;
25
:
25
9
.

107

Kanehisa
M
,
Goto
S
,
Hattori
M
, et al.
From genomics to chemical genomics: new developments in KEGG
.
Nucleic Acids Res
2006
;
34
(
90001
):
D354
7
.

108

Kaplan
EL
,
Meier
P.
Nonparametric estimation from incomplete observations
.
J Am Stat Assoc
1958
;
53
:
457
81
.

109

Mantel
N.
Evaluation of survival data and two new rank order statistics arising in its consideration
.
Cancer Chemother Rep
1966
;
50
:
163
70
.

110

Shen
T
,
Pajaro-Van de Stadt
SH
,
Yeat
NC
, et al.
Clinical applications of next generation sequencing in cancer: from panels, to exomes, to genomes
.
Front Genet
2015
;
6
:
215
.

111

Peyser
ND
,
Grandis
JR.
Cancer genomics: spot the difference
.
Nature
2017
;
541
(
7636
):
162
3
.

112

Voduc
D
,
Kenney
C
,
Nielsen
TO.
Tissue microarrays in clinical oncology
.
Semin Radiat Oncol
2008
;
18
(
2
):
89
97
.

113

Terry
J
,
Saito
T
,
Subramanian
S
, et al.
TLE1 as a diagnostic immunohistochemical marker for synovial sarcoma emerging from gene expression profiling studies
.
Am J Surg Pathol
2007
;
31
:
240
6
.

114

Hans
CP
,
Weisenburger
DD
,
Greiner
TC
, et al.
Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray
.
Blood
2004
;
103
(
1
):
275
82
.

115

Yersal
O
,
Barutca
S.
Biological subtypes of breast cancer: prognostic and therapeutic implications
.
World J Clin Oncol
2014
;
5
:
412
24
.

116

van 't Veer
LJ
,
Dai
H
,
van de Vijver
MJ
, et al.
Gene expression profiling predicts clinical outcome of breast cancer
.
Nature
2002
;
415
(
6871
):
530
6
.

117

Paik
S
,
Shak
S
,
Tang
G
, et al.
A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer
.
N Engl J Med
2004
;
351
(
27
):
2817
26
.

118

Toussaint
J
,
Sieuwerts
AM
,
Haibe-Kains
B
, et al.
Improvement of the clinical applicability of the Genomic Grade Index through a qRT-PCR test performed on frozen and formalin-fixed paraffin-embedded tissues
.
BMC Genomics
2009
;
10
:
424
.

119

Sotiriou
C
,
Wirapati
P
,
Loi
S
, et al.
Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis
.
J Natl Cancer Inst
2006
;
98
(
4
):
262
72
.

120

Wirapati
P
,
Sotiriou
C
,
Kunkel
S
, et al.
Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures
.
Breast Cancer Res
2008
;
10
:
R65
.

121

Rouzier
R
,
Perou
CM
,
Symmans
WF
, et al.
Breast cancer molecular subtypes respond differently to preoperative chemotherapy
.
Clin Cancer Res
2005
;
11
:
5678
85
.

122

Ross
DT
,
Scherf
U
,
Eisen
MB
, et al.
Systematic variation in gene expression patterns in human cancer cell lines
.
Nat Genet
2000
;
24
(
3
):
227
35
.

123

Gao
H
,
Korn
JM
,
Ferretti
S
, et al.
High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response
.
Nat Med
2015
;
21
:
1318
25
.

124

Kao
J
,
Salari
K
,
Bocanegra
M
, et al.
Molecular profiling of breast cancer cell lines defines relevant tumor models and provides a resource for cancer gene discovery
.
PLoS One
2009
;
4
(
7
):
e6146
.

125

Kim
N
,
He
N
,
Yoon
S.
Cell line modeling for systems medicine in cancers (Review)
.
Int J Oncol
2014
;
44
:
371
6
.

126

Shoemaker
RH
,
Monks
A
,
Alley
MC
, et al.
Development of human tumor cell line panels for use in disease-oriented drug screening
.
Prog Clin Biol Res
1987
;
276
:
265
86
.

127

Garnett
MJ
,
Edelman
EJ
,
Heidorn
SJ
, et al.
Systematic identification of genomic markers of drug sensitivity in cancer cells
.
Nature
2012
;
483
(
7391
):
570
5
.

128

Workman
P
,
Kaye
SB.
Translating basic cancer research into new cancer therapeutics
.
Trends Mol Med
2002
;
8
(
4
):
S1
9
.

129

Clarke
PA
,
te Poele
R
,
Workman
P.
Gene expression microarray technologies in the development of new therapeutic agents
.
Eur J Cancer
2004
;
40
:
2560
91
.

130

Hijazi
H
,
Wu
M
,
Nath
A
, et al.
Ensemble classification of cancer types and biomarker identification
.
Drug Dev Res
2012
;
73
:
414
19
.

131

Michiels
S
,
Koscielny
S
,
Hill
C.
Prediction of cancer outcome with microarrays: a multiple random validation strategy
.
Lancet
2005
;
365
(
9458
):
488
92
.

132

Hudson
TJ
,
Anderson
W
,
Aretz
A
, et al.
International network of cancer genome projects
.
Nature
2010
;
464
(
7291
):
993
8
.

133

Tomczak
K
,
Czerwińska
P
,
Wiznerowicz
M.
The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge
.
Contemp Oncol
2015
;
19
(
1A
):
A68.

134

Barrett
T.
Gene Expression Omnibus (GEO).
2013
. https://www.ncbi.nlm.nih.gov/books/NBK159736/.

135

Neoptolemos
JP
,
Stocken
DD
,
Friess
H
, et al.
A randomized trial of chemoradiotherapy and chemotherapy after resection of pancreatic cancer
.
N Engl J Med
2004
;
350
:
1200
10
.

136

Stratford
JK
,
Bentrem
DJ
,
Anderson
JM
, et al.
A six-gene signature predicts survival of patients with localized pancreatic ductal adenocarcinoma
.
PLoS Med
2010
;
7
(
7
):
e1000307
.

137

Johnson
WE
,
Li
C
,
Rabinovic
A.
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
2007
;
8
(
1
):
118
27
.

138

Leek
JT
,
Storey
JD.
Capturing heterogeneity in gene expression studies by surrogate variable analysis
.
PLoS Genet
2007
;
3
(
9
):
e161.

139

Benito
M
,
Parker
J
,
Du
Q
, et al.
Adjustment of systematic microarray data biases
.
Bioinformatics
2004
;
20
(
1
):
105
14
.

140

Emmert-Buck
MR
,
Bonner
RF
,
Smith
PD
, et al.
Laser capture microdissection
.
Science
1996
;
274
(
5289
):
998
1001
.

141

Yoshihara
K
,
Shahmoradgoli
M
,
Martínez
E
, et al.
Inferring tumour purity and stromal and immune cell admixture from expression data
.
Nat Commun
2013
;
4
:
2612
.

142

Song
S
,
Nones
K
,
Miller
D
, et al.
qpure: a tool to estimate tumor cellularity from genome-wide single-nucleotide polymorphism profiles
.
PLoS One
2012
;
7
(
9
):
e45835
.

143

Bhome
R
,
Bullock
MD
,
Al Saihati
HA
, et al.
A top-down view of the tumor microenvironment: structure, cells and signaling
.
Front Cell Dev Biol
2015
;
3
:
33
.

144

Gajewski
TF
,
Schreiber
H
,
Fu
Y-X.
Innate and adaptive immune cells in the tumor microenvironment
.
Nat Immunol
2013
;
14
:
1014
22
.

145

Jiménez-Sánchez
A
,
Memon
D
,
Pourpe
S
, et al.
Heterogeneous tumor-immune microenvironments among differentially growing metastases in an ovarian cancer patient
.
Cell
2017
;
170
:
927
38.e20
.

146

Orimo
A
,
Weinberg
RA.
Heterogeneity of stromal fibroblasts in tumor
.
Cancer Biol Ther
2007
;
6
(
4
):
618
9
.

147

Bergamaschi
A
,
Tagliabue
E
,
Sørlie
T
, et al.
Extracellular matrix signature identifies breast cancer subgroups with different clinical outcome
.
J Pathol
2008
;
214
(
3
):
357
67
.

148

Pickup
MW
,
Mouw
JK
,
Weaver
VM.
The extracellular matrix modulates the hallmarks of cancer
.
EMBO Rep
2014
;
15
(
12
):
1243
53
.

149

Pages
F
,
Galon
J
,
Dieu-Nosjean
MC
, et al.
Immune infiltration in human tumors: a prognostic factor that should not be ignored
.
Oncogene
2010
;
29
(
8
):
1093
102
.

150

Frantz
C
,
Stewart
KM
,
Weaver
VM.
The extracellular matrix at a glance
.
J Cell Sci
2010
;
123
(
Pt 24
):
4195
200
.

151

Allinen
M
,
Beroukhim
R
,
Cai
L
, et al.
Molecular characterization of the tumor microenvironment in breast cancer
.
Cancer Cell
2004
;
6
(
1
):
17
32
.

152

Guinney
J
,
Dienstmann
R
,
Wang
X
, et al.
The consensus molecular subtypes of colorectal cancer
.
Nat Med
2015
;
21
:
1350
6
.

153

Lehmann
BD
,
Bauer
JA
,
Chen
X
, et al.
Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies
.
J Clin Invest
2011
;
121
(
7
):
2750
.

154

Sørlie
T
,
Tibshirani
R
,
Parker
J
, et al.
Repeated observation of breast tumor subtypes in independent gene expression data sets
.
Proc Natl Acad Sci USA
2003
;
100
(
14
):
8418
23
.

155

Hu
Z
,
Fan
C
,
Oh
DS
, et al.
The molecular portraits of breast tumors are conserved across microarray platforms
.
BMC Genomics
2006
;
7
:
96.

156

Ringnér
M
,
Jönsson
G
,
Staaf
J.
Prognostic and chemotherapy predictive value of gene-expression phenotypes in primary lung adenocarcinoma
.
Clin Cancer Res
2016
;
22
:
218
29
.

157

Haibe-Kains
B
,
Desmedt
C
,
Loi
S
, et al.
A three-gene model to robustly identify breast cancer molecular subtypes
.
J Natl Cancer Inst
2012
;
104
(
4
):
311
25
.

158

Weigelt
B
,
Mackay
A
,
A'hern
R
, et al.
Breast cancer molecular profiling with single sample predictors: a retrospective analysis
.
Lancet Oncol
2010
;
11
(
4
):
339
49
.

159

Lusa
L
,
McShane
LM
,
Reid
JF
, et al.
Challenges in projecting clustering results across gene expression–profiling datasets
.
J Natl Cancer Inst
2007
;
99
(
22
):
1715
23
.

160

Guiu
S
,
Michiels
S
,
Andre
F
, et al.
Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 Working Group Statement
.
Ann Oncol
2012
;
23
:
2997
3006
.

161

Sørlie
T
,
Borgan
E
,
Myhre
S
, et al.
The importance of gene-centring microarray data
.
Lancet Oncol
2010
;
11
:
719
20
.

163

Mantione
KJ
,
Kream
RM
,
Kuzelova
H
, et al.
Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq
.
Med Sci Monit Basic Res
2014
;
20
:
138
42
.

164

Khansarinejad
B
,
Soleimanjahi
H
,
Mirab Samiee
S
, et al.
Monitoring human cytomegalovirus infection in pediatric hematopoietic stem cell transplant recipients: using an affordable in-house qPCR assay for management of HCMV infection under limited resources
.
Transpl Int
2015
;
28
:
594
603
.

165

Pires
ARC
,
Andreiuolo
F da M
,
de Souza
SR.
TMA for all: a new method for the construction of tissue microarrays without recipient paraffin block using custom-built needles
.
Diagn Pathol
2006
;
1
:
14
.

166

SEQC/MAQC-III Consortium
.
A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium
.
Nat Biotechnol
2014
;
32
:
903
14
.

167

Singh
A
,
Sau
AK.
Tissue microarray: a powerful and rapidly evolving tool for high-throughput analysis of clinical specimens
.
IJCRI
2010
;
1:1
11
.

168

Łabaj
PP
,
Kreil
DP.
Sensitivity, specificity, and reproducibility of RNA-Seq differential expression calls
.
Biol Direct
2016
;
11
:
66
.

169

Figueroa
ME
,
Lugthart
S
,
Li
Y
, et al.
DNA methylation signatures identify biologically distinct subtypes in acute myeloid leukemia
.
Cancer Cell
2010
;
17
:
13
27
.

170

Marziali
G
,
Buccarelli
M
,
Giuliani
A
, et al.
A three-microRNA signature identifies two subtypes of glioblastoma patients with different clinical outcomes
.
Mol Oncol
2017
;
11
:
1115
1129
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)