-
PDF
- Split View
-
Views
-
Cite
Cite
Zihuan Liu, Yan Sun, Xin Huang, BioPred: an R package for biomarkers analysis in precision medicine, Bioinformatics, Volume 40, Issue 10, October 2024, btae592, https://doi.org/10.1093/bioinformatics/btae592
- Share Icon Share
Abstract
The R package BioPred offers a suite of tools for subgroup and biomarker analysis in precision medicine. Leveraging Extreme Gradient Boosting (XGBoost) along with propensity score weighting and A-learning methods, BioPred facilitates the optimization of individualized treatment rules to streamline subgroup identification. BioPred also enables the identification of predictive biomarkers and obtaining their importance rankings. Moreover, the package provides graphical plots tailored for biomarker analysis. This tool enables clinical researchers seeking to enhance their understanding of biomarkers and patient population in drug development.
The package is available at CRAN and https://github.com/deeplearner0731/BioPred.
1 Introduction
Biomarkers are important tools for precision medicine in clinical trials development. Prognostic biomarkers provide an insights into patient disease outcomes, while predictive biomarkers predict treatment effects (Sechidis et al. 2018). Utilizing prognostic and predictive biomarkers, many subgroup identification methods are developed to explore patient populations heterogeneities in terms of disease progression and treatment response. For example, high PD-L1 IHC expression in patients with advanced non-small cell lung cancer is associated with improved efficacy of pembrolizumab (Garon et al. 2015), and BRCA1/2 mutations are instrumental in identifying patients likely to respond to PARP inhibitors (Ledermann et al. 2012). Numerous packages dedicated to subgroup analyses and predictive biomarker identification in clinical trials are available within the R statistical software environment. Notable examples include the MMMS (Li et al. 2014), GUIDE (Loh et al. 2015), ROWSi (Xu et al. 2015), subgroup (Schou and Marschner 2015), quint (Dusseldorp et al. 2016), SubgrpID (Huang et al. 2017), sparsereg (Ratkovic and Tingley 2017), TSDT (Battioui et al. 2018), FindIt (Egami et al. 2018), model4you (Seibold et al. 2019), SIDES (Riviere 2021), credsubs (Schnell et al. 2020), rlearner (Nie and Wager 2021), and CAPITAL (Cai et al. 2022).
In addition to the aforementioned packages, recent studies has demonstrated that a wide range of established statistical methodologies for subgroup identification fall under modified loss framework (Chen et al. 2017). Within this framework, the personalized R package have designed for optimal individualized treatment rule (ITR) estimation (Huling and Yu 2021). However, this package currently lacks integration with Extreme Gradient Boosting (XGBoost) within the modified loss function framework. Given XGBoost's widespread adoption in data mining communities, and its successful applications in biomedical domains, including prognostic biomarker identification (Li et al. 2022), there is a critical need for an R package specifically tailored to leverage XGBoost for the identification of subgroup and predictive biomarkers.
In this work, we introduce the R package BioPred, which integrates the modified loss function framework (Chen et al. 2017) with XGBoost to identify predictive biomarkers. In addition, BioPred provides a workflow for subgroup and predictive biomarker analysis. Specifically, the BioPred package offers analytical and visualization tools, including graphical representations of treatment effects in subgroups, biomarker response relationships, biomarker treatment interaction relationships in association with outcomes. Furthermore, BioPred is capable of handling commonly used endpoint types in clinical trials, including continuous, binary, and time-to-event endpoints. In addition, this package incorporates commonly used functions for biomarker analysis in clinical trials, such as cutoff analysis of biomarkers for responder enrichment. Overall, this platform allows researchers to conduct exploratory biomarker analyses effectively for the pharmaceutical industry.
2 BioPred R package
The general workflow of the package is shown in Fig. 1. BioPred combines A-learning (Murphy 2003) and Weight-learning techniques (Chen et al. 2017) with XGBoost to optimize ITR for both subgroup and predictive biomarker identification. The package accommodates different endpoint types, including continuous, binary, and time-to-event endpoints, making it versatile in clinical application. Results generated by BioPred can be presented in table or through visually graphical plots, facilitated by dedicated functions within the package. Furthermore, BioPred is poised to integrate simulation functions, enabling the generation of simulation data to augment analytical capabilities. We present a list of all BioPred functions along with their respective descriptions in Supplementary Table S1.

Analysis and visualization tools using the BioPred R package. (A) Subgroup and predictive biomarker identification: this panel illustrates the use of several key functions-XGBoostSub_con(), XGBoostSub_bin(), XGBoostSub_sur(), eval_metric_con(), eval_metric_bin(), eval_metric_sur(), get_subgroup_results(), and predictive_biomarker_imp()-to identify subgroups and predictive biomarkers. The input is an R DataFrame, with the output including predicted subgroup results and a ranked list of biomarker importance. (B) Performance evaluation and visualization: this component demonstrates performance evaluation and visualization, utilizing functions such as cat_summary(), subgrp_perf(), subgrp_perf_pred(), fixcut_bin(), fixcut_con(), fixcut_sur(), cdf.plot(), gam_plot(), gam_ctr_plot(), cut_perf(), roc_bin_plot(), roc_bin(), and scat_cont_plot(). The input for this component comprises results from (A), with these functions providing insights into the strength of the identified biomarkers' association with outcomes. “Biomarker +” denotes the biomarker-positive subpopulation, where the biomarker value meets the cutoff or other criteria, while “Biomarker –” signifies the biomarker-negative subpopulation, where the biomarker value does not meet the respective criteria.
2.1 The model
2.2 Subgroup identification
The XGBoostSub_con(), XGBoostSub_bin(), and XGBoostSub_sur() functions facilitate the development of subgroup models across various outcome types, with flexibility in selecting the appropriate function. These models are constructed using the xgb.train() function from the xgboost package (Chen and Guestrin 2016). Please see the documentation for the xgb.train() function of the xgboost package for more details on the possible arguments. The censors argument to denote the censor status of the data. The trt argument represents the integer vector of observed treatment statuses (e.g. for binary treatment, 1 or −1 in the th position indicates patient received the treatment or placebo). Additionally, the loss argument corresponds to the type of function, with options including loss=A-learning and loss=weight-learning. The propensity.score argument refers to a vector of values ranging between 0 and 1, representing the propensity scores. The output of get_subgroup_results() include subgroup results based on the optimal ITR, indicating whether patients will receive treatment or placebo. The value of the modified loss can be obtained using eval_metric_bin(), eval_metric_con(), and eval_metric_sur().
2.3 Predictive biomarker selection
The function predictive_biomarker_imp() can be used to access importance rankings for identified predictive biomarkers. Specifically, this function operates by taking the output of XGBoost-based subgroup functions as input and returns the importance ranking of the biomarkers along with importance plots. Of note, biomarker importance is evaluated using the xgb.importance() function within the xgboost package (Chen and Guestrin 2016).
2.4 Performance evaluation and visualization for subgroup and biomarker analysis
BioPred offers essential visualization and post-hoc analysis functions integral to biomarker analysis in clinical practice. The functions cdf.plot() empower users to visually explore the cumulative distribution of an individual biomarker in different subgroups. Typically employed after biomarker identification, the function aids in quantifying the percentage of subjects falling below or above specified cutoff values. Furthermore, the function roc_bin_plot() allows users to generate Receiver Operating Characteristic (ROC) plots for binary outcomes associated with individual biomarkers. Additionally, the functions fixcut_bin(), fixcut_con(), and fixcut_sur() facilitate the evaluation of various performance metrics across a range of candidate cutoffs for different outcome types. These functions assist users in selecting the cutoff of a biomarker based on metrics such as Youden index, Kappa agreement, P-value of subgroup testing specified by the user. This capability is particularly valuable for companion diagnostics (CDx) development with limited candidate cutoffs (e.g. IHC). Moreover, functions such as subgrp_perf_pred(), subgrp_perf (), and cut_perf () serve to evaluate subgroup performance. These functions are commonly utilized in both predictive and prognostic biomarker analyses, providing valuable insights into the strength of the identified biomarker's association with outcomes. This also aids clinical scientists in determining the optimal cutoff points for informed clinical decision-making.
2.5 Illustrative example
Following the BioPred workflow (Fig. 1), we demonstrate the application of binary outcomes using a tutorial dataset containing 10 biomarkers To develop the XGBoostSub_bin model with the A-learning loss, we employ the following approach:
XGBoostSub_bin(X, y, trt, pi,Loss_type = “A_learning”, params = list(learning_rate = 0.01, max_depth = 1, lambda = 5, tree_method = ‘hist’), nrounds = 300, disable_default_eval_metric = 0, verbose = FALSE)
Next, the subgroup results can be obtained using:
get_subgroup_results (model, X)
The A-learning loss metrics can be obtained as follows:
eval_metric_bin (model, X, y, pi, trt, Loss_type = “A_learning”)
The importance ranking for identified predictive biomarkers can be accessed through:
predictive_biomarker_imp(model)
If the output indicates that is the most important biomarker, the optimal cutoff for this biomarker can be determined by:
fixcut_bin (yvar=“y”, xvar=“x1”, dir=“>“, cutoffs=c(0.1,0.3,0.5), data=tutorial_data, method=“Fisher”, yvar.display=“y”, xvar.display=“Biomarker x1”, vert.x=F)
Here, the list 0.1,0.3,0.5 represents candidate optimal cutoff values. Readers can adjust this list based on their dataset. For instance, if an optimal cutoff of 0.5 is determined, the performance at this cutoff value can be assessed using the following code:
res=cut_perf (yvar=“y”, censorvar=NULL, xvar=“x1”, cutoff=c(0.5), dir=“>”, xvars.adj=NULL, data=tutorial_data, type=“c”, yvar.display=“y”, xvar.display=“Biomarker x1”).
Once the optimal cutoff is determined, the “biomarker-positive” group can be defined. This allows for the classification into four groups: placebo biomarker-positive, placebo biomarker-negative, treatment biomarker-positive, and treatment biomarker-negative. The predictive model performance based on these defined subgroups can then be assessed using:
res = subgrp_perf_pred(yvar=“y.time”, censorvar=“y.event”, grpvar=“biogroup”, grpname=c(“biomarker_positive”,‘biomarker_negative’),trtvar=“treatment_categorical”, trtname=c(“Placebo”, “Treatment”), xvars.adj=NULL,data=tutorial_data, type=“s”)
In addition, the following method can be employed to evaluate whether a selected biomarker is predictive:
gam_ctr_plot (yvar=“y.time”, censorvar=“y.event”, xvar= “x1”, xvars.adj=NULL,sxvars.adj=NULL,trtvar=“trt”,type=“s”,data=tutorial_data, k=5, title=“Group Contrast”, ybreaks=NULL, xbreaks=NULL, rugcol.var=NULL,link.scale=T, prt.sum=T, prt.chk=F, outlier.rm=F).
Finally, to determine whether the identified biomarker is prognostic, the association between the binary response variable and the biomarker can be examined using:
roc_bin_plot (yvar=“y”, xvars=“x1”, dirs=“auto”, data=tutorial_data, yvar.display=“y.bin”, xvars.display=“Biomarker x1”)
3 Conclusion
Traditionally, the effect of predictive biomarkers is estimated by fitting a regression model that includes an interaction between the biomarker and treatment. However, such parametric models with interaction terms often struggle to identify predictive biomarkers due to the “curse of dimensionality.” By incorporating XGBoost with modified loss function, our BioPred R package overcomes these limitations, enabling the optimization of ITRs for subgroups, identifying predictive biomarkers, selecting optimal biomarker cutoffs, and visualizing results graphically, even as the number of biomarkers increases with data from genomics and proteomics. Furthermore, BioPred holds practical clinical applications, including the development of CDx and enrichment designs for clinical trials.
Acknowledgements
This article is sponsored by AbbVie. AbbVie contributed to the design, research, and interpretation of data, writing reviewing, and approving the content. Z.L., Y.S., and X.H. are employees of AbbVie Inc. All authors may own AbbVie stock.
Data availability
No new data were generated or analysed in support of this research.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
None declared.