Abstract

Summary: SurvJamda (Survival prediction by joint analysis of microarray data) is an R package that utilizes joint analysis of microarray gene expression data to predict patients' survival and risk assessment. Joint analysis can be performed by merging datasets or meta-analysis to increase the sample size and to improve survival prognosis. The prognosis performance derived from the combined datasets can be assessed to determine which feature selection approach, joint analysis method and bias estimation provide the most robust prognosis for a given set of datasets.

Availability: The survJamda package is available at the Comprehensive R Archive Network, http://cran.r-project.org.

Contact:  [email protected]

1 INTRODUCTION

The survJamda package was developed for survival prediction and risk assessment based on microarray data. It allows to jointly analyze the datasets through data merging and meta-analysis. Data merging combines the data into one set prior to their analysis, whereas meta-analysis integrates only the results. In addition to different joint analysis methods, survJamda contains various feature selection approaches and bias estimation techniques which enable the user to determine the combination of which methods provides the most robust prediction for a given set of datasets.

A few other R packages like ipdmeta (Broeze et al., 2009) and survcomp (Haibe-Kains et al., 2008) have been created for joint analysis, that are more specifically, related to meta-analysis of censored data with time-to-event outcome.

2 DATA

The functions and algorithms developed in survJamda can be assessed on the datasets of the survJamda.data package. SurvJamda.data, created by the author, is a data package of 18 Mb containing three breast cancer datasets, GSE3143, GSE1992 and GSE4335, which were analyzed in (Yasrebi et al., 2009). SurvJamda.data are also available on Comprehensive R Archive Network.

3 METHODS

3.1 Feature selection

  • Top-ranking (Yasrebi et al., 2009). The multiple hypothesis testing correction implemented in the p.adjust function in the R stats package can also be applied to the top-ranking method.

  • User-defined method.

3.2 Joint analysis methods

  • Merging method:

    • ComBat (Johnson et al., 2007).

    • Z-score normalization. This method is applied in two ways:

      1. Z-score1 normalization: in this approach, all datasets are Z-score normalized (Larsen et al., 2000) prior to their selection for the training and testing sets and their combination into one set (Yasrebi et al., 2009).

      2. Z-score2 normalization: in this method, the datasets are initially selected for the training and testing sets. Then, the datasets composing the training set are merged together and Z-score normalized. The testing set is also Z-score normalized independently and separately from the training set.

  • Meta-analysis. The inverse normal method (Hedges et al., 1985) is used for meta-analysis.

3.3 Validation frameworks

  • Cross validation (CV) nested in 10 iterations.

  • Independent validation.

    • Pair-wise mode: two datasets are selected at a time, one of which is used as the training set and the other as the testing set. This process is iterated until all datasets are used as the training and testing sets (Yasrebi et al., 2009).

    • Leave one dataset out: all datasets except one are merged together to form the training set and the left-out set is used as the testing set. Similarly, this process is iterated until all datasets are used as the training and testing sets (Yasrebi et al., 2009).

3.4 Performance measures

Conflict of Interest: none declared.

REFERENCES

Broeze
K.
et al.
,
Individual patient data meta-analysis of diagnostic and prognostic studies in obstetrics, gynaecology and reproductive medicine
BMC Med. Res. Methodol.
,
2009
, vol.
9
pg.
22
Haibe-Kains
B.
et al.
,
A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?
Bioinformatics
,
2008
, vol.
24
(pg.
2200
-
2208
)
Heagerty
P.J.
et al.
,
Time-dependent ROC curves for censored survival data and a diagnostic marker
Biometrics.
,
2000
, vol.
56
(pg.
337
-
344
)
Hedges
L.V.
Olkin
I.
Statistical Methods for Meta-Analysis.
,
1985
Academic Press
Ishwaran
H.
et al.
,
Random survival forests
Ann. Appl. Statist.
,
2008
, vol.
2
(pg.
841
-
860
)
Johnson
E.,W.
et al.
,
Adjusting batch effects in microarray expression data using empirical Bayes methods
Biostatistics
,
2007
, vol.
8
(pg.
118
-
127
)
Larsen
R.J.
Marx
M.L.
An Introduction to Mathematical Statistics and Its Applications
,
2000
3rd
Prentice Hall
Yasrebi
H.
et al.
,
Can survival prediction be improved by merging gene expression data sets?
PLoS ONE
,
2009
, vol.
4
pg.
e7431

Author notes

Associate Editor: Joaquin Dopazo