Better diagnostic signatures from RNAseq data through use of auxiliary co-data

Abstract

Summary

Our aim is to improve omics based prediction and feature selection using multiple sources of auxiliary information: co-data. Adaptive group regularized ridge regression (GRridge) was proposed to achieve this by estimating additional group-based penalty parameters through an empirical Bayes method at a low computational cost. We illustrate the GRridge method and software on RNA sequencing datasets. The method boosts the performance of an ordinary ridge regression and outperforms other classifiers. Post-hoc feature selection maintains the predictive ability of the classifier with far fewer markers.

Availability and Implementation

GRridge is an R package that includes a vignette. It is freely available at (https://bioconductor.org/packages/GRridge/). All information and R scripts used in this study, including those on retrieval and processing of the co-data, are available from http://github.com/markvdwiel/GRridgeCodata.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Adaptive group regularized ridge regression (GRridge, van de Wiel et al., 2016) was introduced to improve the predictive performance of logistic ridge regression by incorporating external and/or internal auxiliary information on the features: the co-data. The co-data is used to define groups of features, e.g. two groups of genes based on their presence or absence in a known gene signature, or groups of microRNAs (miRNAs) based on conservation status. GRridge then estimates group-penalties using empirical Bayes. To date, GRridge is the only regularization method that can structurally and objectively account for multiple sources of information, whileother similar group regularization methods, e.g. group lasso (Yuan et al., 2007) and sparse-group lasso (Simon et al., 2013) can only include one source of information for grouping of features and do not allow a different weight for different groups.

GRridge is illustrated to be a methodology that benefits from multidisciplinary input: it combines statistical summaries of the data with biological knowledge on the molecular features. The promise of GRridge that it can use co-data to improve classification comes with the obligation to explore what types of co-data may be of general use. In addition, we present additional functionalities of the method including better stability of the solutions, penalty estimation for overlapping groups like gene signature pathways, automatic selection of relevant co-data sources and post-hoc feature selection by adding L1-penalty. The latter effectively reduces to a group-weighted elastic-net, which accommodates selecting features with orthogonal effects. Stringent feature selection is often desirable for developing diagnostic assays. Therefore, we evaluate the classifiers using a dynamic number of features.

We present two applications of the GRridge method on count-based(mi)RNAseq, data that were generated from difficult impure samples, i.e. self-collected cervicovaginal specimens and blood-platelets. Those samples are contaminated with non-disease related cells. We hypothesize the use of co-data can improve the performance of a classification model. We show the usage of different types of co-data from internal and external sources.

2 Co-data

We define co-data as nominal or quantitative feature-specific information, obtained independently of the response. Four types of co-data are distinguished, namely (1) response-independent summaries in the primary data (e.g. standard deviations); (2) feature-specific summaries from an independent study (e.g. p-values); (3) feature groupings from public data bases (e.g. pathways); (4) genomic annotation (e.g. chromosome). Co-data type 1, 2 and 4 were demonstrated by van de Wiel et al. (2016) to improve methylation-based diagnostics of precancerous cervical lesions. Co-data type 1 is easily obtained from the data at hand. Co-data type 3 and 4, on the other hand, make use of publicly available information and biological knowledge related to the setting at hand.

3 Applications

3.1 Cervical cancer diagnostics using miRNAseq data

A deep sequencing analysis on small non-coding ribonucleic acid (miRNAseq) was performed on 56 samples (24 women with high-grade cervical intraepithelial neoplasia (CIN3) and 32 healthy women) for the purpose of finding relevant screening markers for cervical cancer screening. We provide more details of the study and the pre-processing procedures in the Supplementary Material. To this pre-processed sequencing data, we applied GRridge to build a classification model by, first, incorporating co-data from internal sources, namely abundance and standard deviation (co-data type 1). This readily available information rises the AUC of the predictive model from 78.8% to 84.3%. We then added co-data from an external source, i.e. conservation status of a miRNA (Agarwal et al., 2015). Although there is no significant harm done by inclusion of more co-data, it does not guarantee a higher performance of the GRridge model. We provide an approach to order and to select multiple co-data sources to optimize model performance (Supplementary Material). Moreover, interpretability improves, because the algorithm automatically determines the relevant co-data and ensures the selected co-datasets have additional positive effects on the model performance. The co-data selection procedure yielded conservation and standard deviation as significant co-data sources (AUC = 87.2%). We then applied the post-hoc L1 selection (Supplementary Material) to the GRridge model with the selected partitions. The feature selection rendered a very competitive model with 3 markers, which was clearly better than lasso and standard elastic net (Fig. 1).

Fig. 1.

The performance of classification models, assessed by leave-one-out cross-validation, for given numbers of selected features in the miRNA sequencing dataset. X-axis: the number of selected features, amongst these following values 3:1:25, 30, 40 and 50. Abbreviations. GRridge + selEN: GRridge model by incorporating selected partitions; SGL:sparse-group lasso. The feature selection was based on group-weighted elastic-net; lasso: L1-penalized logistic regression. The AUCs of non-selecting classification models are: ridge:0.79; GRridge:0.87; SVM:0.75. Lasso models could not be estimated for 40 and 50 selected features due to a small value of λ₁ (Color version of this figure is available at Bioinformatics online.)

Open in new tab Download slide

3.2 Multi-organ classification using RNAseq data

Blood platelets extracted from patients with 6 cancer types were used to profile RNA markers for early cancer detection (Wurdinger et al., 2015). We focus on discriminating patients with breast and colorectal cancer. Results for the other binary classifications are in the Supplementary Material. To assess the relevance of external information, we focus on co-data types 3 and 4. The performance of the GRridge model with co-data selection outperformed a ridge model and support vector machine (SVM) algorithm (AUC = 89.2%, 77.8%, 77.9%, for GRridge, ridge and SVM respectively). The improvement persisted under feature selection and for most other binary classifications (Supplementary Material).

The external sources for the co-datasets include: (i) gene signature pathways, i.e. immunologic signature (Godec et al., 2016) and transcription factor based pathways (TRANSFAC version 7.4) from the Molecular Signatures Database (Subramanian et al., 2005); (ii) a list of expressed-platelet genes (Bugert et al., 2003; Gnatenko et al., 2003) and; (iii) chromosomal location of the genes. The co-data selection incorporated chromosomal location and immunologic signature pathway. Since pathways are overlapping groups of genes, we adjust GRridge to accommodate such overlap (mathematical details are in the Supplementary Material). We also describe co-data types, including those that did not improve the GRridge prediction in the Supplementary Material.

4 Conclusions

We show the additional functionality and the applications of GRridge on (mi)RNAseq datasets. The method increases the predictive ability of ordinary logistic ridge model with low computational effort. Further, post-hoc L1 selection reveals relevant markers while maintaining good performance of the model.

Funding

This work was supported by the European Research Council (ERC advanced 2012-AdG, proposal 322986; Molecular Self Screening for Cervical Cancer Prevention, MASS-CARE) and Cancer Center Amsterdam, VU University Medical Center (CCA 2014-5-20).

Conflict of Interest: none declared.

References

Agarwal

et al. (

2015

)

Predicting effective microRNA target sites in mammalian mRNAs

eLife

e05005.

Bugert

et al. (

2003

)

Messenger RNA profiling of human platelets by microarray hybridization

Thromb. Haemost

738

–

748

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Gnatenko

D.V.

et al. (

2003

)

Transcript profiling of human platelets using microarray and serial analysis of gene expression

Blood

101

2285

–

2293

Godec

et al. (

2016

)

Compendium of immune signatures identifies conserved and species-specific biology in response to inflammation

Immunity

194

–

206

Simon

et al. (

2013

)

A sparse-group lasso

J. Comp. Graph. Stat

231

–

245

Google Scholar

Crossref

WorldCat

Subramanian

et al. (

2005

)

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

Proc. Natl. Acad. Sci. U. S. A

102

15545

–

15550

van de Wiel

M.A.

et al. (

2016

)

Better prediction by use of co-data: adaptive group-regularized ridge regression

Stat. Med

368

–

381

Wurdinger

et al. (

2015

)

RNA-seq of tumor-educated platelets enables blood-based pan-cancer, multiclass, and molecular pathway cancer diagnostics

Cancer Cell

666

–

676

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Yuan

et al. (

2007

)

Model selection and estimation in regression with grouped variables

J. R. Stat. Soc. Ser. B Stat. Methodol

–

Google Scholar

Crossref

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/about_us/legal/notices)

Associate Editor:

Download all slides

Month:	Total Views:
January 2017	55
February 2017	17
March 2017	24
April 2017	14
May 2017	122
June 2017	53
July 2017	19
August 2017	28
September 2017	10
October 2017	24
November 2017	19
December 2017	12
January 2018	11
February 2018	6
March 2018	19
April 2018	11
May 2018	17
June 2018	21
July 2018	35
August 2018	36
September 2018	32
October 2018	17
November 2018	35
December 2018	22
January 2019	15
February 2019	18
March 2019	35
April 2019	41
May 2019	23
June 2019	18
July 2019	18
August 2019	15
September 2019	38
October 2019	28
November 2019	29
December 2019	14
January 2020	20
February 2020	24
March 2020	19
April 2020	22
May 2020	11
June 2020	49
July 2020	37
August 2020	15
September 2020	12
October 2020	2
November 2020	18
December 2020	49
January 2021	40
February 2021	24
March 2021	36
April 2021	41
May 2021	18
June 2021	17
July 2021	36
August 2021	23
September 2021	23
October 2021	27
November 2021	23
December 2021	20
January 2022	30
February 2022	34
March 2022	22
April 2022	18
May 2022	13
June 2022	25
July 2022	25
August 2022	30
September 2022	18
October 2022	9
November 2022	21
December 2022	8
January 2023	3
February 2023	1
March 2023	10
April 2023	10
May 2023	19
June 2023	7
July 2023	4
August 2023	11
September 2023	5
October 2023	13
November 2023	10
December 2023	12
January 2024	10
February 2024	17
March 2024	22
April 2024	11
May 2024	16
June 2024	12
July 2024	6
August 2024	23
September 2024	9
October 2024	18
November 2024	8
December 2024	10
January 2025	13
February 2025	11
March 2025	11
April 2025	16
May 2025	8

Article Contents

Better diagnostic signatures from RNAseq data through use of auxiliary co-data

Abstract

1 Introduction

2 Co-data

3 Applications

3.1 Cervical cancer diagnostics using miRNAseq data

3.2 Multi-organ classification using RNAseq data

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Better diagnostic signatures from RNAseq data through use of auxiliary co-data Free

Abstract

1 Introduction

2 Co-data

3 Applications

3.1 Cervical cancer diagnostics using miRNAseq data

3.2 Multi-organ classification using RNAseq data

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

Better diagnostic signatures from RNAseq data through use of auxiliary co-data