Platform-independent approach for cancer detection from gene expression profiles of peripheral blood cells

Yang, Yadong; Zhang, Tao; Xiao, Rudan; Hao, Xiaopeng; Zhang, Huiqiang; Qu, Hongzhu; Xie, Bingbing; Wang, Tao; Fang, Xiangdong

doi:10.1093/bib/bbz027

Abstract

Peripheral blood gene expression intensity-based methods for distinguishing healthy individuals from cancer patients are limited by sensitivity to batch effects and data normalization and variability between expression profiling assays. To improve the robustness and precision of blood gene expression-based tumour detection, it is necessary to perform molecular diagnostic tests using a more stable approach. Taking breast cancer as an example, we propose a machine learning–based framework that distinguishes breast cancer patients from healthy subjects by pairwise rank transformation of gene expression intensity in each sample. We showed the diagnostic potential of the method by performing RNA-seq for 37 peripheral blood samples from breast cancer patients and by collecting RNA-seq data from healthy donors in Genotype-Tissue Expression project and microarray mRNA expression datasets in Gene Expression Omnibus. The framework was insensitive to experimental batch effects and data normalization, and it can be simultaneously applied to new sample prediction.

cancer detection, framework, rank, expression, blood

Introduction

Breast cancer is typically diagnosed based on clinical features, medical imaging, biochemical tests, the pathological examination of clinical samples and molecular diagnostic tests. Early detection of breast cancer is critical for achieving good survival rates, with mammographic screening being the most reliable method. However, mammography is often ineffective when the tumour size is small or when breast tissue is dense. To overcome these limitations, several peripheral blood-derived materials are increasingly being used for liquid biopsies, including circulating DNA [1–4], circulating tumour cells (CTCs) [5] and exosomes [6, 7]. These materials—which can be obtained non-invasively—nonetheless have shortcomings, such as the heterogeneity of the CTC population and the inability to specifically isolate circulating tumour DNA [8]. Thus, more sensitive and specific analytical methods are needed to effectively detect trace amounts of these materials.

Plasma [9] and peripheral blood cell (PBC) gene expression profiling has been used to detect markers related to disease [10, 11] and drug response [12]. Cancer is a systemic disease associated with the perturbation of blood homeostasis resulting in detectable alterations in gene expression in erythrocytes [13], circulating leukocytes [14, 15] and tumour-educated platelets [16] that have potential applicability to cancer diagnostics. Previous studies have compared gene expression profiles in blood cells from cancer patients and healthy controls either by the direct detection of differentially expressed genes or by a machine learning approach, such as the support vector machine (SVM). However, current normalization approaches have limited capacity for batch-effect correction and may even distort biological signals [17, 18]. Therefore, signature genes or models established in a specific study cannot be directly transferred to other datasets, which hinders the applicability of public data as well as model cross-validation. To overcome this problem, the use of gene expression order has been proposed as an alternative to signal intensity since it is more stable against outliers, batch effects and different normalization algorithms [19–21] for tissue in a particular state. Given its superiority to absolute quantification methods, we propose a rank-based machine learning method to distinguish breast cancer and healthy donor blood samples and to investigate its potential for blood-based companion diagnostics (Figure 1).

Figure 1

Schematic overview of the relative expression-based model for liquid biopsies. First, expression intensity from either microarray or RNA-seq is pre-processed for pairwise comparison between any of the two genes. Then, the relative expression value (0/1) is recorded as new variables are presented for dimensionality reduction. Lastly, a prediction model was constructed based on relative expression value.

Open in new tab Download slide

Methods

Sample collection

Peripheral blood samples were obtained from 37 women with breast cancer. Before blood sampling, signed consent forms were obtained from all participants and the protocol of the study was approved by the ethics committee of the Affiliated Hospital, Academy of Military Medical Sciences. For each case, 5 ml of peripheral blood was collected into a QIAGEN PAXgeneTM blood RNA tube and immediately stored at −80°C. RNA was extracted with a QIAGEN PAXgene Blood RNA Kit using standard operating procedures.

Whole transcriptome sequencing and analysis

After total RNA extraction, mRNA libraries were constructed using the Illumina mRNA-Seq library preparation kit according to the manufacturer's protocol and 2 × 150 bp paired-end runs were performed on a Novaseq System. Sequencing quality analysis of the raw data was performed using FASTQC software (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The human GRCh37 reference genome was downloaded from iGenome (http://ccb.jhu.edu/software/tophat/igenomes.shtml), and the associated .GTF files were downloaded from the Ensembl website (http://asia.ensembl.org/index.html). Reads were then aligned to the reference genome using HISAT2 with default parameters, and the aligned reads were assembled and quantified by StringTie software [22]. The gene expression was represented by the fragments per kilobase of exon model per million mapped reads value of each sample. The RNA-seq raw data are deposited in the Genome Sequence Archive [8] with accession number PRJCA001108 (private link for review: http://bigd.big.ac.cn/gsa/s/qYELd91n).

Data and pre-processing

Multiple gene expression datasets generated from the ABI (ABI Human Genome Survey Microarray v.2) and Affymetrix (Affymetrix Human Exon 1.0 ST Array) platforms, Illumina HiSeq 2500 and Broad Institute Human L1000 epsilon were downloaded from Gene Expression Omnibus (GEO) [23] (Table 1). For microarray data, we used processed data from GEO. For RNA-seq data, GSE68086 read counts were normalized to reads per kilobase of transcript per million mapped reads (RPKM) values and GSE92743 RPKM values were downloaded from the GTEx website.

Table 1

Open in new tab

Data from cancer patients and healthy subjects used in this study

GEO accession no.	Number of healthy subjects	Number of cancer patients	Platform	Source
GSE16443	63	67	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE47862	163	158	Affymetrix Human Exon 1.0 ST Array	Peripheral blood cells
GSE11545	9	11	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE92743	137		RNA-seq (Broad Institute Human L1000 epsilon)	Peripheral blood cells
PRJCA001108	0	37	RNA-seq (Illumina Novaseq)	Peripheral blood cells
GSE68086	54	40	RNA-seq (Illumina HiSeq 2500)	Platelets

GEO accession no.	Number of healthy subjects	Number of cancer patients	Platform	Source
GSE16443	63	67	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE47862	163	158	Affymetrix Human Exon 1.0 ST Array	Peripheral blood cells
GSE11545	9	11	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE92743	137		RNA-seq (Broad Institute Human L1000 epsilon)	Peripheral blood cells
PRJCA001108	0	37	RNA-seq (Illumina Novaseq)	Peripheral blood cells
GSE68086	54	40	RNA-seq (Illumina HiSeq 2500)	Platelets

Table 1

Open in new tab

Data from cancer patients and healthy subjects used in this study

GEO accession no.	Number of healthy subjects	Number of cancer patients	Platform	Source
GSE16443	63	67	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE47862	163	158	Affymetrix Human Exon 1.0 ST Array	Peripheral blood cells
GSE11545	9	11	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE92743	137		RNA-seq (Broad Institute Human L1000 epsilon)	Peripheral blood cells
PRJCA001108	0	37	RNA-seq (Illumina Novaseq)	Peripheral blood cells
GSE68086	54	40	RNA-seq (Illumina HiSeq 2500)	Platelets

GEO accession no.	Number of healthy subjects	Number of cancer patients	Platform	Source
GSE16443	63	67	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE47862	163	158	Affymetrix Human Exon 1.0 ST Array	Peripheral blood cells
GSE11545	9	11	ABI Human Genome Survey Microarray v. 2	Peripheral blood cells
GSE92743	137		RNA-seq (Broad Institute Human L1000 epsilon)	Peripheral blood cells
PRJCA001108	0	37	RNA-seq (Illumina Novaseq)	Peripheral blood cells
GSE68086	54	40	RNA-seq (Illumina HiSeq 2500)	Platelets

Transformation of gene expression intensity to rank information

For each sample, we carried out pairwise comparisons of expression values of all genes. For each gene pair (G_i, G_j), the rank comparison, denoted as G_ij, should be 1 (G_i > G_j) or 0 (G_i ≤ G_j).

$$\begin{equation} {G}_{ij}=\left\{\begin{array}{c}1,\kern0.5em {G}_i>{G}_j\\ {}0,\kern0.5em {G}_i\le {G}_j\end{array}\right. \end{equation}$$

(1)

After the transformation, a large increase in the number of features is expected—i.e.

$$\begin{equation} n\to {C}_n^2=\frac{n\left(n-1\right)}{2} \end{equation}$$

(2)

where n is the total number of genes expressed in a sample.

Dimension reduction

Feature selection techniques do not alter the original representation of the variables but merely select an optimal subset of them. Pruning irrelevant and redundant features can improve the learning efficiency and avoid the curse of dimensionality.

We first treated the rank values (G_ij) as univariate features and selected the highest scoring percentage of features according to the one-way analysis of variance F-value. Univariate feature selection with the F-test examines each feature individually to determine the strength of the relationship of the feature with the target class. The advantages of this feature selection method are that it easily scales to very high-dimensional datasets and that it is computationally simple and fast. However, in this method, each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance. Therefore, we used it to pre-reduce the search space and then applied more complicated feature selection methods in the next two steps to select stronger features. We then used ElasticNet to further select important features by taking advantage of L1 and L2 regularization. The objective function is

$$\begin{equation} \underset{\beta }{\mathrm{argmin}}\frac{1}{2{n}_{samples}}{\left\Vert {X}_{\beta }-y\right\Vert}_2^2+\lambda \alpha {\left\Vert \beta \right\Vert}_1+\frac{\lambda \left(1-\alpha \right)}{2}{\left\Vert \beta \right\Vert}_2^2 \end{equation}$$

(3)

The function |$\alpha {\Vert \beta \Vert}_1+(1-\alpha ){\Vert \beta \Vert}_2^2$| is the elastic net penalty, which is a convex combination of the lasso and ridge penalties. For all |$\alpha \in [0,1]$|⁠, the elastic net penalty function is singular (without first derivative) at 0 and it is strictly convex for all α > 0, thus having the characteristics of both lasso and ridge regression. For α = 1 and α = 0, the penalty is L1 and L2, respectively. β is the coefficient of a vector that is estimated by minimizing the objective function. The parameter λ is a fixed non-negative constant that multiplies the penalty terms. We optimized parameters λ and α by using 10-fold cross-validation on the training data. Then, we collected features whose coefficients were not zero. Finally, we used a randomized logistic regression method for the selection of stable features, which is suitable for classification tasks, especially in a case where feature selection or model selection is unstable due to a high dimensionality. The method is a stability selection technique, which works by fitting the L1-penalized logistic regression model hundreds of times with perturbed data (75% subsampling and randomized regularization coefficient for each variable). Consider data (Xⁱ, Yⁱ), i = 1, …, N, with univariate response variable Y and p-dimensional covariates X. The model is defined as follows:

$$\begin{equation} \frac{1}{\mathrm{N}}\sum \limits_{\mathrm{i}=1}^Nf\big({w}^T{b}_i{X}^i+{vb}_i\big)+\lambda \sum \limits_{k=1}^p\left|{\beta}_k\right| \end{equation}$$

(4)

f(z) = log(1 + exp(-z)) is the logistic regression model. Where, |$b\in \{-1,1\}$|⁠, w is the coefficient vector and v is the intercept. The parameter λ is a regularization parameter, and β is estimated by minimizing this objective function. Randomized logistic regression assigns high scores to features that are repeatedly selected across randomizations. The more times a feature is selected, the more likely it is to be a stable variable. The method requires a number of fits to subsamples of the data set and is, as such, much more computationally demanding. However, there were few features left after the previous two steps of filtering; therefore, it did not take much time for the stable feature selection.

Model selection

Rank-based features were input to predict the status of cases whose value was 1 (cancer) or 0 (healthy). We used stochastic gradient descent (SGD), random forest (RF), SVM, logistic regression (LR) and Gaussian Naive Bayes algorithms in the scikit-learn package (0.18.1) to construct classifiers; GridSearchCV package to adjust the hyper-parameters; and 10-fold cross-validation to construct models.

Performance evaluation

To classify cancer and normal samples, the sensitivity, specificity and area under the receiver operating characteristic curve (AUC) of the classifier were estimated at different false discovery rate control levels.

$$\begin{equation} {\displaystyle \begin{array}{ll} Precision&=\frac{TP}{TP\ +\ FP}\\[4pt] {} Accuracy&=\frac{TP\ +\ TN}{TP\ +\ TN\ +\ FP\ + \ FN}\\[4pt] {} Sensitivity&=\frac{TP}{TP \ +\ FN}\end{array}} \end{equation}$$

(5)

Results

The schematic of breast cancer liquid biopsies using PBC gene expression data

We first used two types of single datasets (GSE68086 and GSE16443, which are RNA-seq and microarray data, respectively) to evaluate whether the rank-based model has the power to distinguish cancer and healthy subjects. Then, we validated the strategy on intra- and inter-platform datasets. Finally, we integrated much more microarray data in the model construction process and tested the applicability of the model in both microarray and RNA-seq data. For the validation process, we performed peripheral blood RNA-seq for 37 breast cancer patients and collected other public data (normal peripheral blood RNA-seq data from GTEx, normal and cancer peripheral blood microarray data from GSE11545 and GSE47862) (Figure 2).

Figure 2

Overview of the study design. We first used single datasets to confirm that a rank-based model has the power to distinguish cancer and healthy subjects. Then, we validated the strategy on intra- and inter-platform datasets. Finally, we concluded that the prediction performance was improved by including independent datasets and that a microarray-originated model can be applicable for RNA-seq data.

Open in new tab Download slide

Rank-based features do not reduce the classification power of cancer patients versus healthy control subjects

To facilitate the comparison of datasets from different platforms and/or batches, we transformed gene expression intensities to relative information according to any two of the expressed genes. Since there was undoubtedly some information loss during the transformation, we selected two datasets to test the classification power of the transformed data, including gene expression profiles from PBCs in healthy donors and breast cancer patients detected by microarray (GSE16443) and platelet cell expression profiles from breast cancer patients and healthy subjects detected by RNA-seq (GSE68086). For each dataset, 80% of samples were used for training and the remaining 20% were used for validation. After transforming the expression intensity of microarray data to pairwise gene order information, we obtained 22 885 995 features, of which 6 were retained after a three-step feature reduction approach. For RNA-seq data, we selected genes with RPKM > 1 in more than 95% of the samples. The RPKM value was then transformed to obtain pairwise gene order information. After the transformation, there were 1 103 355 features; among them, 16 were retained after a three-step feature reduction. A comparison of different models showed that the RF model had better predictive power in these two datasets (Supplementary Figure S1). The model distinguished cancer and healthy samples with an accuracy of 80.77% and 94.74% in GSE16443 and GSE68086, respectively (Figure 3A and C). The AUC reached 0.87 and 0.97 (Figure 2B and D), which is better than the intensity-based classification method [16, 24] and is significantly higher than the value for the randomly selected 6 and 16 features (Supplementary Figure S2). The performance of the prediction models remains stable when changing the rank-based features (Supplementary Figure S3). We also tried to use fold change (FC) to describe the difference between genes, and we found that the FC follows an approximate negative binomial distribution in RNA-seq and a normal distribution in microarray data (Supplementary Figure S4), which is not suited for integration and training.

Figure 3

Prediction of performance of the rank-based model in GSE16443 and GSE68086 datasets. (A, C) Confusion matrices of diagnostics of healthy donors (healthy) and breast cancer patients (cancer) from GSE16443 (A) and GSE68086 (C). (B, D) Receiver Operating Characteristic (ROC) curves of RF diagnostics of healthy donors and breast cancer patients from GSE16443 (B) and GSE68086 (D).

Open in new tab Download slide

Performance of the model in cross-validations between intra-platform datasets from different batches

To test whether the rank-based model was sensitive to differences in laboratory conditions, reagent lots and personnel, we used the GSE16443 dataset as a training cohort (n = 130) and GSE11545 as an independent validation cohort. A total of 138 features were selected after a three-step dimension reduction. The LR and SVM model produced the same performance. Both of them outperformed the RF, linear classifiers with SGD training, Gaussian Naive Bayes models, and they completely discriminated cancer from healthy samples. They performed well when applied to an independent validation cohort (GSE11545), with a sensitivity of 81.82%, specificity of 66.67%, accuracy of 75.00% (Figure 4A) and an AUC of 0.80 (Figure 4B). To test whether there are batch effects between these two datasets, we randomly selected normal and cancer samples from them, normalized them using z-scores and performed correlation analysis. The result showed that the expression consistency between the two datasets was very poor (Supplementary Figure S5). Furthermore, we compared the performance of our method with two most frequently used rank-based methods, the top scoring pair (TSP) [20] and k-top scoring pairs (k-TSP) [21], in selecting top gene pairs and distinguishing cancer and healthy subject in an independent study. We used the GSE16443 data as a training cohort to build classifiers and GSE11545 data as an independent validation cohort. The optimal value of k was determined by a 5-fold cross-validation, representing five pairs of genes achieving top scores. Two of the five gene pairs were surrogated with median value due to LRRC37B and SRSF2 were not detected in the GSE11545 data. However, k-TSP–based model showed poor performance in predicting GSE11545 subjects, with a sensitivity of 100% and a specificity of 0%. When k = 1, the k-TSP algorithm is referred simply as TSP. We got the same performance as k-TSP when used anyone of the three gene pairs to validate the model in the GSE11545 data. To look into the gene pairs in detail, we found that the expression of ACBD6 is higher than RPL37, MARS is higher than NGLY1 and NGLY1 is higher than GMFG in healthy subjects in training cohort (GSE16443) (Supplementary Figure S6A, C and E); however, the tendency did not exist in the validation cohort (GSE11545) (Supplementary Figure S6B, D and F).

Figure 4

Performance of the LR and SVM models in datasets from different batches. Performance of the models in the validation cohort in GSE11545 data (n = 20). (A) ROC curves of LR and SVM diagnostics of the validation cohort in GSE11545 data (B).

Open in new tab Download slide

Performance of the model in cross-validations between inter-platform datasets

To determine whether the rank-based model could be used to predict datasets from different expression quantification platforms, we combined datasets from ABI Human Genome Survey Microarray v.2 (GSE16443) and Affymetrix Human Exon 1.0 ST Array (GSE47862). To equalize sample numbers across platforms, we randomly selected 50% (n = 160) of GSE47862 samples and 80% (n = 104) of GSE16443 samples as the training set, with the remaining samples and an independent dataset from ABI microarray platform (GSE11545) constituting the validation set. A total of 108 features were retained after the dimension-reduction step. The SVM model outperformed the others based on the training set. Validation was subsequently performed with independent validation cohorts that were not involved in feature selection or model training. In GSE16443 (n = 26 samples), sensitivity, specificity and accuracy were 80.00%, 81.82% and 80.77%, respectively (Figure 5A), with an AUC of 0.88 (Figure 5B), which is better than the model from the single ABI microarray platform (Figure 3A and B). In GSE47862 (n = 161 samples), sensitivity, specificity and accuracy were 70.59%, 80.26% and 75.16%, respectively (Figure 5C), with an AUC of 0.84 (Figure 5D). In contrast, random classifiers generated from multiple rounds of random selection of gene pairs during the SVM training process had no predictive power. In GSE11545 (n = 20 samples), sensitivity, specificity and accuracy were 81.82%, 77.78% and 80.00%, respectively (Figure 5E), with an AUC of 0.84 (Figure 5F).

Figure 5

Performance of the SVM model in datasets from different platforms. (A, C, E). Performance of the SVM model in the validation cohort in GSE16443 (n = 26) (A), GSE47862 (n = 161) (C) and GSE11545(n = 20) (E) data. ROC curves of SVM diagnostics of the validation cohort in GSE16443 (B), GSE47862 (D) and GSE11545 (F) data.

Open in new tab Download slide

Improvement in the predictive performance of the model by integrating multicentre data

We integrated a larger number of PBC gene expression datasets from different platforms to assess whether the algorithm could improve the predictive performance of an independent dataset in cancer detection. Due to the differences in sample size between ABI Human Genome Survey Microarray and Affymetrix Human Exon Array training sets, we selected 100% of GSE16443 samples (n = 130) and 50% of GSE47862 samples (n = 160) to avoid bias introduced by platforms. A total of 52 features (Supplementary Table S1) were selected after a three-step dimension reduction. The RF model outperformed other algorithms. The model performed well when applied to an independent validation cohort (GSE11545) with a sensitivity of 90.91%, specificity of 100%, accuracy of 95% (Figure 6A) and AUC of 0.93 (Figure 6B), which is better than a model trained with less data in Figure 4 and 5. In the remaining 50% of GSE47862 (n = 161), sensitivity, specificity and accuracy were 76.47%, 80.26% and 78.26%, respectively (Figure 6C), with an AUC of 0.84 (Figure 6D).

Figure 6

More samples from multicentre data from different platforms improve the detection of cancer samples. (A, C) Confusion matrices of the RF model generated from GSE16443 (100%) and GSE47862 (50%) integrated data in an independent validation cohort in GSE11545 data (A) and the remaining 50% of the GSE47862 (n = 161) data. (C) ROC curves of RF diagnostics of an independent validation cohort in GSE11545 data (n = 20) (B) and the remaining 50% of the GSE47862 data (D).

Open in new tab Download slide

Application of the model generated from multiple microarray platforms to RNA-seq samples

To further test the generalizability of the model generated from ABI Human Genome Survey Microarray and Affymetrix Human Exon Array expression data, we applied the RF model to RNA-seq data. From the GTEx project, we selected the whole blood expression dataset from females whose RNA samples were prepared with a PAXgene kit and obtained 137 healthy samples. For the breast cancer samples, we performed RNA-seq for peripheral blood from 37 patients. In the validation process, we used mode to represent the value of LOC100128076_MT1X because it neither exists in GTEx nor our RNA-seq data. In the combined RNA-seq data, sensitivity, specificity and accuracy were 83.78%, 67.15% and 70.69% (Figure 7A), respectively, with an AUC of 0.80 (Figure 7B).

Figure 7

Performance of the RF model generated from multiple microarray platforms in RNA-seq samples. Performance of the RF model in a validation cohort of combined RNA-seq data (n = 174) (A) and ROC curves of RF diagnostics of the validation cohort in combined RNA-seq data (B).

Open in new tab Download slide

Detection of prognostic markers in tumour tissue based on surrogate blood mRNA profiles

For the 52 gene pairs (92 unique genes) identified with the RF model, we obtained the expression profiles of individual genes from The Cancer Genome Atlas (TCGA) breast cancer tissue samples (1091 cases). A multivariate analysis showed that 18 gene pairs were significantly associated with breast cancer patient survival (Figure 8). We randomly selected 92 genes and generated 52 gene pairs, and we repeated the process 60 times and found that the number of gene pairs that was significantly associated with patient survival approximately followed the Poisson distribution (λ = 2), and the probability is <<0.001 for k ≥ 18.

Figure 8

Kaplan–Meier overall survival analysis of breast cancer based on gene pairs using mRNA signatures from TCGA.

Open in new tab Download slide

Web service implementation

To facilitate the implementation of the prediction model, we constructed a web interface that can be accessed at http://bigd.big.ac.cn/rankDetect. Users can simply upload their expression matrix file, and the server will give the predicted status (healthy/cancer) of each sample. In addition, we hope the users will help refine the model by submitting the actual labels (if known) of their samples on the results page.

Discussion

Blood-based liquid biopsies are a non-invasive and accessible method for cancer diagnosis, therapeutic decision-making, prognostic determination and monitoring of clinical progression and treatment response [25, 26]. To date, there are no expression-based methods that can extract multicentre data for cancer diagnostics [27, 28], hindering early cancer detection. In this study, we propose a machine learning–based method that distinguishes breast cancer patients from healthy individuals based on pairwise rank transformation of gene expression intensity in each sample. The rank-based self-learning model can offer valuable information for breast cancer diagnosis and is insensitive to batch effects and data normalization. Our results suggest that a relative ordering-based method can make direct use of microarray data from different sources, thereby expediting research on human diseases. Furthermore, the model trained from microarray data can also be applied to RNA-seq data, highlighting the clinical relevance of relative gene expression levels in blood.

Gene regulatory networks—and consequently, gene expression profiles [19]—differ between normal and disease states [29]. Cancer can alter the gene expression in blood. PBCs comprise erythrocyte, white blood cell and platelet populations that are dynamic throughout cancer initiation and progression [30]. Gene expression changes in circulating leukocytes can serve as an indicator of infection or diseases such as cancer [9]. Additionally, the number of monocytes with a typical myeloid-derived suppressor cell surface phenotype was increased during breast cancer progression and was correlated with metastasis to lymph nodes and visceral organs [31]. A blood transcriptome-based diagnosis may also be applicable to different species. One study established a classifier from genes identified as differentially expressed in mouse PBCs that showed good accuracy and high stability when applied to human breast tumour prediction based on gene expression in peripheral blood mononuclear cells [32].

We propose a human breast cancer predictor based on PBC mRNA signatures. Our approach overcomes many of the constraints of previous models by using the relative expression orders of genes to reduce noise and to identify genes associated with breast cancer over other potentially confounding factors. By integrating datasets from different platforms and/or batches, we identified 52 stable gene pairs that could be important markers representing different patterns between breast cancer patients and health controls. In addition, based on these markers, a refined model was constructed, which was applicable in both sequencing and microarray platforms. However, the results presented here have limited application in clinical settings for various reasons. First, the model with the best predictive power is not definitive; we cannot exclude the possibility that a sufficient number of expression datasets can lead to the generation of a convergent model. Second, due to the scarcity of expression data from cancer blood samples, our analysis used only microarray data for model training, although the model from microarray data also performed well in normal blood RNA-seq data. These results serve as a proof-of-concept for using a rank-based model to develop a liquid biopsy and provide a statistical framework for constructing predictors in the context of not only breast cancer but also other malignancies. The rank-based normalization methods could also be expanded to other intensity-based features like miRNA expression and DNA methylation levels to remove batch effect from either different platforms or laboratories. Based on the principle of relative information, we plan to integrate more omics signatures such as blood DNA methylation to improve the algorithm. In addition to potential early cancer detection, our methods of normalization, feature selection and model construction also indicated new application scenarios in multi-cancer subtypes determination or even other disease diagnosis in the future.

Key Points

Gene expression intensity order is relatively stable in specific tissues of specific status.
Given the pairwise gene expression ranks from peripheral blood transcriptome as training data, supervised machine learning methods can be used to distinguish breast cancer patients from healthy donors.
The prediction model is robust in tackling both microarray and RNA-seq data.
We constructed a web-based tool named rankDetect that allows users to upload peripheral blood expression data and returns the predicted status of cancer/health.

Acknowledgements

The authors would like to thank the support and advice from present and past members of Dr Fang’s laboratory and would also like to thank the GEO database and TCGA Project Consortium for making their data publicly available.

Funding

National Key R&D Program of China (2016YFC0901701, 2016YFC0901603); ‘863 Projects’ of Ministry of Science and Technology of China (2015AA020101 and 2015AA020108); Key Research Program of the Chinese Academy of Sciences (KJZD-EW-L14).

Yadong Yang is a research assistant at the CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100049, China; he now works in GloriousMed Technology Co., Ltd., Beijing 100102. His research focuses on genomics data mining.

Tao Zhang is an MS student at the BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences. His research focuses on big data analysis and machine learning.

Rudan Xiao is an MS student at the CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. Her research focuses on disease omics analysis and translational medicine research.

Xiaopeng Hao is an associate professor in the Breast Oncology Department, The Fifth Center Affiliated PLA General Hospital. His research focuses on diagnosis and treatment of breast diseases.

Huiqiang Zhang is a doctor in the Breast Oncology Department, The Fifth Center Affiliated PLA General Hospital. His research focuses on clinical diagnosis and basic breast cancer research.

Hongzhu Qu is an associate professor at the CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. Her research focuses on genomics data mining.

Bingbing Xie is a PhD student at the CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. His research focuses on genomics data mining.

Tao Wang is a professor in the Breast Oncology Department, The Fifth Center Affiliated PLA General Hospital. Her research interests include comprehensive treatment and basic research of breast cancer.

Xiangdong Fang is a professor at the CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing 100049, China. His research interests are clinical omics and translational medicine for stem cells and complex diseases.

References

1.

Kang

S

,

Li

Q

,

Chen

Q

, et al.

CancerLocator: non-invasive cancer diagnosis and tissue-of-origin prediction using methylation profiles of cell-free DNA

.

Genome Biol

2017

;

18

:

53

.

2.

Xu

RH

,

Wei

W

,

Krawczyk

M

, et al.

Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma

.

Nat Mater

2017

;

16

:

1155

–

1161

.

3.

Olsson

E

,

Winter

C

,

George

A

, et al.

Serial monitoring of circulating tumor DNA in patients with primary breast cancer for detection of occult metastatic disease

.

EMBO Mol Med

2015

;

7

:

1034

–

1047

.

4.

Phallen

J

,

Sausen

M

,

Adleff

V

, et al.

Direct detection of early-stage cancers using circulating tumor DNA

.

Sci Transl Med

2017

;

9

.

Google Scholar

OpenURL Placeholder Text

WorldCat

5.

Bidard

FC

,

Proudhon

C

,

Pierga

JY

.

Circulating tumor cells in breast cancer

.

Mol Oncol

2016

;

10

:

418

–

430

.

6.

Al-Nedawi

K

,

Meehan

B

,

Micallef

J

, et al.

Intercellular transfer of the oncogenic receptor EGFRvIII by microvesicles derived from tumour cells

.

Nat Cell Biol

2008

;

10

:

619

–

624

.

7.

Thakur

BK

,

Zhang

H

,

Becker

A

, et al.

Double-stranded DNA in exosomes: a novel biomarker in cancer detection

.

Cell Res

2014

;

24

:

766

–

769

.

8.

Wang

J

,

Chang

S

,

Li

G

, et al.

Application of liquid biopsy in precision medicine: opportunities and challenges

.

Front Med

2017

;

11

:

522

–

527

.

9.

Zhang

K

,

Shi

H

,

Xi

H

, et al.

Genome-wide lncRNA microarray profiling identifies novel circulating lncRNAs for detection of gastric cancer

.

Theranostics

2017

;

7

:

213

–

227

.

10.

Twine

NC

,

Stover

JA

,

Marshall

B

, et al.

Disease-associated expression profiles in peripheral blood mononuclear cells from patients with advanced renal cell carcinoma

.

Cancer Res

2003

;

63

:

6069

–

6075

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

11.

Han

M

,

Liew

CT

,

Zhang

HW

, et al.

Novel blood-based, five-gene biomarker set for the detection of colorectal cancer

.

Clin Cancer Res

2008

;

14

:

455

–

460

.

12.

Burczynski

ME

,

Twine

NC

,

Dukart

G

, et al.

Transcriptional profiles in peripheral blood mononuclear cells prognostic of clinical outcomes in patients with advanced renal cell carcinoma

.

Clin Cancer Res

2005

;

11

:

1181

–

1189

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

13.

Albeniz

I

,

Demir

O

,

Turker-Sener

L

, et al.

Erythrocyte CD38 as a prognostic marker in cancer

.

Hematology

2007

;

12

:

409

–

414

.

14.

Whitney

AR

,

Diehn

M

,

Popper

SJ

, et al.

Individuality and variation in gene expression patterns in human blood

.

Proc Natl Acad Sci U S A

2003

;

100

:

1896

–

1901

.

15.

Sharma

P

,

Sahni

NS

,

Tibshirani

R

, et al.

Early detection of breast cancer based on gene-expression patterns in peripheral blood cells

.

Breast Cancer Res

2005

;

7

:

R634

–

R644

.

16.

Best

MG

,

Sol

N

,

Kooi

I

, et al.

RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics

.

Cancer Cell

2015

;

28

:

666

–

676

.

17.

Goh

WWB

,

Wang

W

,

Wong

L

.

Why batch effects matter in omics data, and how to avoid them

.

Trends Biotechnol

2017

;

35

:

498

–

507

.

18.

Lazar

C

,

Meganck

S

,

Taminau

J

, et al.

Batch effect removal methods for microarray gene expression data integration: a survey

.

Brief Bioinform

2013

;

14

:

469

–

490

.

19.

Wang

H

,

Sun

Q

,

Zhao

W

, et al.

Individual-level analysis of differential expression of genes and pathways for personalized medicine

.

Bioinformatics

2015

;

31

:

62

–

68

.

20.

Geman

D

,

d'Avignon

C

,

Naiman

DQ

, et al.

Classifying gene expression profiles from pairwise mRNA comparisons

.

Stat Appl Genet Mol Biol

2004

;

3

:

Article 19

.

Google Scholar

OpenURL Placeholder Text

WorldCat

21.

Tan

AC

,

Naiman

DQ

,

Xu

L

, et al.

Simple decision rules for classifying human cancers from gene expression profiles

.

Bioinformatics

2005

;

21

:

3896

–

3904

.

22.

Pertea

M

,

Kim

D

,

Pertea

GM

, et al.

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

.

Nat Protoc

2016

;

11

:

1650

–

1667

.

23.

Edgar

R

,

Domrachev

M

,

Lash

AE

.

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

.

Nucleic Acids Res

2002

;

30

:

207

–

210

.

24.

Aaroe

J

,

Lindahl

T

,

Dumeaux

V

, et al.

Gene expression profiling of peripheral blood cells for early detection of breast cancer

.

Breast Cancer Res

2010

;

12

:

R7

.

25.

Jia

S

,

Zhang

R

,

Li

Z

, et al.

Clinical and biological significance of circulating tumor cells, circulating tumor DNA, and exosomes as biomarkers in colorectal cancer

.

Oncotarget

2017

;

8

:

55632

–

55645

.

26.

Venesio

T

,

Siravegna

G

,

Bardelli

A

, et al.

Liquid biopsies for monitoring temporal genomic heterogeneity in breast and colon cancers

.

Pathobiology

2017

;

85

:

146

–

154

.

27.

Chang

YT

,

Huang

CS

,

Yao

CT

, et al.

Gene expression profile of peripheral blood in colorectal cancer

.

World J Gastroenterol

2014

;

20

:

14463

–

14471

.

28.

Alonso

V

,

Neves

AF

,

Marangoni

K

, et al.

Gene expression profile in the peripheral blood of patients with prostate cancer and benign prostatic hyperplasia

.

Cancer Detect Prev

2009

;

32

:

336

–

337

.

29.

Shiraishi

T

,

Matsuyama

S

,

Kitano

H

.

Large-scale analysis of network bistability for human cancers

.

PLoS Comput Biol

2010

;

6

:

e1000851

.

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Kusano

Y

,

Yokoyama

M

,

Terui

Y

, et al.

Low absolute peripheral blood CD4+ T-cell count predicts poor prognosis in R-CHOP-treated patients with diffuse large B-cell lymphoma

.

Blood Cancer J

2017

;

7

:

e558

.

Google Scholar

OpenURL Placeholder Text

WorldCat

31.

Bergenfelz

C

,

Larsson

AM

,

von

Stedingk

K

, et al.

Systemic monocytic-MDSCs are generated from monocytes and correlate with disease progression in breast cancer patients

.

PLoS One

2015

;

10

:

e0127028

.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

LaBreche

HG

,

Nevins

JR

,

Huang

E

.

Integrating factor analysis and a transgenic mouse model to reveal a peripheral blood predictor of breast tumors

.

BMC Med Genomics

2011

;

4

:

61

.

Author notes

Yadong Yang, Tao Zhang, and Rudan Xiao contribute equally to this study.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
March 2019	94
April 2019	106
May 2019	40
June 2019	5
July 2019	9
August 2019	9
September 2019	23
October 2019	12
November 2019	11
December 2019	7
January 2020	13
February 2020	3
March 2020	11
April 2020	8
May 2020	17
June 2020	18
July 2020	15
August 2020	7
September 2020	24
October 2020	17
November 2020	6
December 2020	7
January 2021	11
February 2021	12
March 2021	5
April 2021	22
May 2021	20
June 2021	36
July 2021	27
August 2021	36
September 2021	30
October 2021	16
November 2021	16
December 2021	7
January 2022	6
February 2022	6
March 2022	22
April 2022	4
May 2022	11
June 2022	24
July 2022	32
August 2022	42
September 2022	186
October 2022	257
November 2022	109
December 2022	138
January 2023	77
February 2023	48
March 2023	27
April 2023	98
May 2023	30
June 2023	6
July 2023	38
August 2023	30
September 2023	30
October 2023	16
November 2023	37
December 2023	34
January 2024	61
February 2024	23
March 2024	21
April 2024	23
May 2024	33
June 2024	13
July 2024	30
August 2024	44
September 2024	22
October 2024	21
November 2024	34
December 2024	29
January 2025	27
February 2025	23
March 2025	40
April 2025	14
May 2025	1

Article Contents

Platform-independent approach for cancer detection from gene expression profiles of peripheral blood cells

Abstract

Introduction

Methods

Sample collection

Whole transcriptome sequencing and analysis

Data and pre-processing

Transformation of gene expression intensity to rank information

Dimension reduction

Model selection

Performance evaluation

Results

The schematic of breast cancer liquid biopsies using PBC gene expression data

Rank-based features do not reduce the classification power of cancer patients versus healthy control subjects

Performance of the model in cross-validations between intra-platform datasets from different batches

Performance of the model in cross-validations between inter-platform datasets

Improvement in the predictive performance of the model by integrating multicentre data

Application of the model generated from multiple microarray platforms to RNA-seq samples

Detection of prognostic markers in tumour tissue based on surrogate blood mRNA profiles

Web service implementation

Discussion

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Platform-independent approach for cancer detection from gene expression profiles of peripheral blood cells

Abstract

Introduction

Methods

Sample collection

Whole transcriptome sequencing and analysis

Data and pre-processing

Transformation of gene expression intensity to rank information

Dimension reduction

Model selection

Performance evaluation

Results

The schematic of breast cancer liquid biopsies using PBC gene expression data

Rank-based features do not reduce the classification power of cancer patients versus healthy control subjects

Performance of the model in cross-validations between intra-platform datasets from different batches

Performance of the model in cross-validations between inter-platform datasets

Improvement in the predictive performance of the model by integrating multicentre data

Application of the model generated from multiple microarray platforms to RNA-seq samples

Detection of prognostic markers in tumour tissue based on surrogate blood mRNA profiles

Web service implementation

Discussion

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only