-
PDF
- Split View
-
Views
-
Cite
Cite
Fuyi Li, Shuangyu Dong, André Leier, Meiya Han, Xudong Guo, Jing Xu, Xiaoyu Wang, Shirui Pan, Cangzhi Jia, Yang Zhang, Geoffrey I Webb, Lachlan J M Coin, Chen Li, Jiangning Song, Positive-unlabeled learning in bioinformatics and computational biology: a brief review, Briefings in Bioinformatics, Volume 23, Issue 1, January 2022, bbab461, https://doi.org/10.1093/bib/bbab461
- Share Icon Share
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Introduction
With the advances of high-throughput sequencing techniques and large-scale biomedical experiments, an unprecedented volume of biological data has accumulated, allowing for data-driven computational analysis. Compared to wet laboratory-based research, which is often laborious and time-consuming, computational methods can predict and analyze potential patterns from the massive volume of biological data and shortlist candidates for follow-up experimental validation. To date, computational prediction and analysis have been successfully applied to address a broad spectrum of fundamental biological questions, such as homology sequence detection [1, 2], genome signals and regions identification [3–6], protein–protein interaction (PPI) and complex prediction [7–10], gene function prediction [11–14], protein/RNA/DNA functional site prediction [15–32] and biomedical image classification [33–36]. Among the computational analysis tasks, classification is of particular importance, which aims to classify the test samples into different classes (e.g. positive and negative samples in binary classification). For example, in protein phosphorylation site prediction, both phosphorylated peptides (i.e. positive samples) and non-phosphorylated peptides (i.e. negative samples) need to be provided for training the classification models [37, 38]. The performance of the classifier is therefore highly dependent on the quality of the training samples, the correctness of sample labeling and the ratio of the positive and negative samples. In the research areas of bioinformatics and computational biology, a variety of supervised-learning algorithms have been applied to construct classification models [39], such as support vector machines (SVM) [40], random forest (RF) [41], naïve Bayesian (NB) classifier [42] and logistic regression (LR) [43].
However, in numerous biological applications, negative samples are either limited or uncertain because it is much more straightforward to confirm a property than to ascertain that it does not hold. A potential binding site is confirmed if it binds a target, but failure to bind only means that the conditions for binding were not satisfied under a given experimental condition. Further, technological advances often lead to improved identification of specific properties, indicating that biological samples previously not known to have a property can now be confidently classified. For example, in our recent study [44], we demonstrated the changes in protein glycosylation site labeling throughout four time points across 10 years. Another example is PPI prediction [45, 46], where experimentally validated PPIs and non-interacting protein pairs are used as positive and negative training samples, respectively. Nevertheless, selecting the non-interacting protein pairs is challenging for two reasons: (i) novel PPIs are constantly being discovered in the course of time, indicating that some non-interacting protein pairs (i.e. negative samples) might be mislabeled; and (ii) there is a large number of protein pairs for which no interactions have been identified, significantly outnumbering the positive samples. Similar situations can also be found in gene function prediction [47–50], biological sequence classification [51], small non-coding RNA detection [52] and drug–drug interaction identification [53]. To address these issues, the positive unlabeled (PU) learning scheme has recently emerged as a useful approach [54], which is a special category belonging to the semi-supervised learning scheme. Semi-supervised learning scheme generally refers to building prediction models with partially labeled training data (e.g. the labeled data can be either positives or both positives and negatives) [55]. In contrast, PU learning specifically refers to the semi-supervised scheme that builds classification models directly from a small number of labeled positive samples and a huge volume of unlabeled samples (i.e. a mixture of both positive and negative samples) [54]. Due to the presence of unlabeled data, conventional binary classifiers that require both positive and negative samples, such as SVM and RF, are no longer directly applicable. One-class learning [56] is another alternative approach that trains a model based only on positive samples. However, this scheme cannot benefit from large amounts of information that might be present in unlabeled samples. Two major research directions have been proposed to enable PU learning, as summarized in previous studies [57, 58], including (i) converting the PU learning problems to conventional classification tasks by identifying reliable negatives from the unlabeled dataset, and (ii) adapting conventional classification frameworks to directly learn from positive and unlabeled samples.
To the best of our knowledge, only one review [45] has been published to date that summarized PU learning applications in protein-interaction networks. It compared eight PU learning algorithms only for derivations of PPI networks. However, PU learning has been widely applied in many different bioinformatic fields. It is thus highly desirable to comprehensively survey the PU learning scheme applied to a wide spectrum of bioinformatics tasks and discuss and explore the applicability of the PU learning scheme in addressing biological questions. In this study, we systematically reviewed and discussed the design and implementation of 29 PU learning-based bioinformatic applications covering a wide range of biological and biomedical topics, including sequence classification, interaction prediction, gene/protein function and functional site prediction. In addition, we discussed performance evaluation existing issues and future perspectives of PU learning algorithm development to offer valuable insights into the applicability of PU learning scheme in addressing important biological and biomedical questions.
PU learning scheme
PU learning scheme aims to build classifiers with competitive prediction performance using limited positive samples and high volumes of unlabeled data. To date, various PU learning algorithms have been developed and applied to address a variety of biological classification tasks (Table 1). We categorized these algorithms into two major strategies – ‘selecting reliable negatives’ and ‘adapting the base classifier,’ as in Kilic et al. [45]. A schematic overview of these two strategies is illustrated in Figure 1. A key difference between the two strategies is reflected by how they deal with unlabeled samples in the training data. The ‘selecting reliable negatives’ strategy seeks to identify a subset of unlabeled samples, which are then treated as negative samples. The positive samples together with these putative negative samples are subsequently used as training data for a conventional learning algorithm. The ‘adapting the base classifier’ strategy adapts the base classifiers (e.g. SVM) to estimate and correct for the expected ratio of negative samples in the unlabeled dataset. The ‘selecting reliable negatives’ strategy includes two sub-strategies, namely ‘negative expansion’ and ‘label propagation.’ Both ‘negative expansion’ and ‘label propagation’ strategies can reduce the number of mislabeled negative samples by calculating the likelihood of each unlabeled sample. ‘Negative expansion’ performs this calculation iteratively (e.g. bagging) through the classifier training procedure; while ‘label propagation’ calculates the likelihoods iteratively until convergence is reached based on a pre-generated similarity matrix. Please note that both Figure 1 and Table 1 provide a generic classification of reviewed PU learning tools in this article. As it is challenging to classify some PU tools that can fit into multiple categories, for the simplicity, we classified them based on their main PU learning characteristics according to their algorithmic descriptions. The following abbreviations are used throughout this review: P, U and N stand for positive data, unlabeled data and negative data, respectively; while RNs and LNs denote reliable negatives and likely negatives, FNs stand for false negatives and RPs denote reliable positives.
A comprehensive list of surveyed bioinformatic tools based on the PU learning schemea
Category . | Tool . | Year . | PU strategy . | Application . | Base classifier . | Performance evaluation . | Performance evaluation metrics . | Software/Code availabilityb . | |
---|---|---|---|---|---|---|---|---|---|
Selecting reliable negatives | Negative Expansion | PSoL [52] | 2006 | None | ncRNA identification | SVM | 5-fold CV and independent test | AUC | No |
LP-IC [51] | 2008 | Centroid distance evaluation | Biological sequence classification | SVM | Independent test | F-measure | No | ||
AGPS [47] | 2008 | Spy positives, Select RN | Gene function prediction | SVM | 10-fold CV | Precision, Recall, F-measure, AUC | No | ||
SPE_RNE [48] | 2010 | Positive enlargement | Gene function prediction | SVM | Random test | Precision, Recall, F-measure | No | ||
Bhardwaj et al. [50] | 2010 | Spy positives | Peripheral protein identification | C4.5 decision tree | 5-fold CV and independent test | Recall, ACC | No | ||
NOIT [59] | 2010 | Bagging | TF-target interaction | SVM with Platt scaling | 10-fold CV and independent test | AUC, AUPR | No | ||
ProDiGe [49] | 2011 | Bagging | Disease gene identification | Weighted-SVM | LOO CV | Precision, Recall, CDF | Webserver | ||
Patel and Wang [60] | 2015 | Spy positives, Bagging | Gene regulatory network | SVM, RF | Random test | ACC | No | ||
PUL-PUP [61] | 2016 | None | Pupylation site prediction | SVM | 10-fold CV | Recall, ACC, TNR, MCC, AUC | No | ||
LDAP [62] | 2017 | Bagging | lncRNA-disease association | Weighted-SVM | LOO CV | AUC | Webserver | ||
EPuL [63] | 2017 | Select RN | Pupylation site prediction | SVM | 10-fold CV and independent test | Recall, TNR, ACC, MCC, AUC | Webserver | ||
DeepDCR [64] | 2020 | Select RN | Disease-associated circular RNAs prediction | Deep forests [65] | 5-fold CV and independent test | Recall, Precision, AUC | Source code | ||
iPiDi-PUL [66] | 2021 | None | ncRNA-disease association | RF | 5-fold CV and independent test | AUC | Webserver | ||
EmptyNN [67] | 2021 | None | Single-cell RNA sequencing quality control | Neural network | 10-fold CV and independent test | AUC, Recall, Specificity | Source code | ||
Label Propagation | PUDI [68] | 2012 | None | Disease gene identification | Multi-level weighted SVM | 10-fold CV, 3-fold CV and benchmark | F-measure | Software | |
EPU [69] | 2014 | Ensemble classifiers | Disease gene identification | KNN, NB, SVM | 3-fold CV, LOO CV | Precision, Recall, F-measure | No | ||
PULSE [70] | 2015 | None | Stably folded isoform discovery | RF | 5-fold CV | AUC | Software | ||
PUPre [71] | 2015 | Spy positives, Select RN | Conformational B-cell epitopes prediction | Weighted-SVM | 10-fold CV | F-measure | No | ||
PUDT [72] | 2016 | Ensemble label propagation | Drug-target interaction prediction | Weighted-SVM | 5-fold CV | AUC | No | ||
Adapting the base classifier | PosOnly [57] | 2010 | None | Gene regulatory networks | SVM with Platt scaling | 10-fold CV and independent tests | F-measure, AUC | No | |
SIRENE [73] | 2013 | None | TF-CRE interaction | SVM | 3-fold CV | Precision, Recall | No | ||
HOCCLUS2 [74] | 2014 | Bagging | Gene regulatory networks | SVM | Independent tests | AUC | No | ||
PRIPU [75] | 2015 | None | Protein-RNA interaction | Biased-SVM | 5-fold CV | Precision, Recall, ACC, EPR | Source code | ||
PUEL [76] | 2016 | Bagging | Kinase substrate prediction | SVM | LOO CV | Recall, TNR, F-measure, MCC, GM | Software and source code | ||
MutPred2 [77] | 2017 | Bagging, Prior probability | Pathogenic amino acid variants prioritization | Feed-forward neural networks | 10-fold CV | AUC | Software and webserver | ||
PAnDE [78] | 2017 | Bagging | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | F-measure | No | ||
GlycoMine_PU [44] | 2019 | Bagging, Prior probability | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | ACC, F-measure, AUC | Webserver | ||
Topaz [79] | 2019 | None | Particle detection | CNN | 10-fold CV and independent tests | Precision | Software and source code | ||
PU-HIV [80] | 2021 | None | HIV-1 protease cleavage site prediction | Biased-SVM | 10-fold CV and independent tests | Precision, Recall, F-measure | Source code |
Category . | Tool . | Year . | PU strategy . | Application . | Base classifier . | Performance evaluation . | Performance evaluation metrics . | Software/Code availabilityb . | |
---|---|---|---|---|---|---|---|---|---|
Selecting reliable negatives | Negative Expansion | PSoL [52] | 2006 | None | ncRNA identification | SVM | 5-fold CV and independent test | AUC | No |
LP-IC [51] | 2008 | Centroid distance evaluation | Biological sequence classification | SVM | Independent test | F-measure | No | ||
AGPS [47] | 2008 | Spy positives, Select RN | Gene function prediction | SVM | 10-fold CV | Precision, Recall, F-measure, AUC | No | ||
SPE_RNE [48] | 2010 | Positive enlargement | Gene function prediction | SVM | Random test | Precision, Recall, F-measure | No | ||
Bhardwaj et al. [50] | 2010 | Spy positives | Peripheral protein identification | C4.5 decision tree | 5-fold CV and independent test | Recall, ACC | No | ||
NOIT [59] | 2010 | Bagging | TF-target interaction | SVM with Platt scaling | 10-fold CV and independent test | AUC, AUPR | No | ||
ProDiGe [49] | 2011 | Bagging | Disease gene identification | Weighted-SVM | LOO CV | Precision, Recall, CDF | Webserver | ||
Patel and Wang [60] | 2015 | Spy positives, Bagging | Gene regulatory network | SVM, RF | Random test | ACC | No | ||
PUL-PUP [61] | 2016 | None | Pupylation site prediction | SVM | 10-fold CV | Recall, ACC, TNR, MCC, AUC | No | ||
LDAP [62] | 2017 | Bagging | lncRNA-disease association | Weighted-SVM | LOO CV | AUC | Webserver | ||
EPuL [63] | 2017 | Select RN | Pupylation site prediction | SVM | 10-fold CV and independent test | Recall, TNR, ACC, MCC, AUC | Webserver | ||
DeepDCR [64] | 2020 | Select RN | Disease-associated circular RNAs prediction | Deep forests [65] | 5-fold CV and independent test | Recall, Precision, AUC | Source code | ||
iPiDi-PUL [66] | 2021 | None | ncRNA-disease association | RF | 5-fold CV and independent test | AUC | Webserver | ||
EmptyNN [67] | 2021 | None | Single-cell RNA sequencing quality control | Neural network | 10-fold CV and independent test | AUC, Recall, Specificity | Source code | ||
Label Propagation | PUDI [68] | 2012 | None | Disease gene identification | Multi-level weighted SVM | 10-fold CV, 3-fold CV and benchmark | F-measure | Software | |
EPU [69] | 2014 | Ensemble classifiers | Disease gene identification | KNN, NB, SVM | 3-fold CV, LOO CV | Precision, Recall, F-measure | No | ||
PULSE [70] | 2015 | None | Stably folded isoform discovery | RF | 5-fold CV | AUC | Software | ||
PUPre [71] | 2015 | Spy positives, Select RN | Conformational B-cell epitopes prediction | Weighted-SVM | 10-fold CV | F-measure | No | ||
PUDT [72] | 2016 | Ensemble label propagation | Drug-target interaction prediction | Weighted-SVM | 5-fold CV | AUC | No | ||
Adapting the base classifier | PosOnly [57] | 2010 | None | Gene regulatory networks | SVM with Platt scaling | 10-fold CV and independent tests | F-measure, AUC | No | |
SIRENE [73] | 2013 | None | TF-CRE interaction | SVM | 3-fold CV | Precision, Recall | No | ||
HOCCLUS2 [74] | 2014 | Bagging | Gene regulatory networks | SVM | Independent tests | AUC | No | ||
PRIPU [75] | 2015 | None | Protein-RNA interaction | Biased-SVM | 5-fold CV | Precision, Recall, ACC, EPR | Source code | ||
PUEL [76] | 2016 | Bagging | Kinase substrate prediction | SVM | LOO CV | Recall, TNR, F-measure, MCC, GM | Software and source code | ||
MutPred2 [77] | 2017 | Bagging, Prior probability | Pathogenic amino acid variants prioritization | Feed-forward neural networks | 10-fold CV | AUC | Software and webserver | ||
PAnDE [78] | 2017 | Bagging | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | F-measure | No | ||
GlycoMine_PU [44] | 2019 | Bagging, Prior probability | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | ACC, F-measure, AUC | Webserver | ||
Topaz [79] | 2019 | None | Particle detection | CNN | 10-fold CV and independent tests | Precision | Software and source code | ||
PU-HIV [80] | 2021 | None | HIV-1 protease cleavage site prediction | Biased-SVM | 10-fold CV and independent tests | Precision, Recall, F-measure | Source code |
aAbbreviations: ACC – accuracy; AnDE – averaged n-dimensional estimators; AUC – area under the ROC curve; AUPR – area under the precision-recall curve; LOO – leave one out; CDF – cumulative distribution function [49]; CNN – convolutional neural network; CRE – cis-regulatory elements; CV – cross-validation; EPR – explicit positive recall [75]; GM – geometric mean; KNN – K nearest neighbors; lncRNA – long non-coding RNA; LR – logistic regression; MCC – Matthews correlation coefficient; NB – naive Bayesian; ncRNA – non-coding RNA; RF – random forest; RN – reliable negatives; ROC – receiver operating characteristic; SVM – support vector machine; TF – transcription factor; TNR – true negative rate.
bThe URL addresses for the listed tools are as follows: ProDiGe: http://cbio.ensmp.fr/prodige/; LDAP: http://bioinformatics.csu.edu.cn/ldap/; EPuL: http://59.73.198.144:8080/EPuL/; DeepDCR: https://github.com/xzenglab/DeepDCR/; iPiDi-PUL: http://bliulab.net/iPiDi-PUL/server/; EmptyNN: https://github.com/lkmklsmn/empty_nn/; PUDI: http://www1.i2r.a-star.edu.sg/~xlli/PUDI/PUDI.html/; PULSE: http://www.kimlab.org/software/pulse/; PRIPU: http://admis.fudan.edu.cn/projects/pripu.htm/; PUEL: https://github.com/PengyiYang/KSP-PUEL/; MutPred: http://mutpred.mutdb.org/; GlycoMine_PU: https://glycomine.erc.monash.edu/Lab/GlycoMine_PU/; Topaz: http://topaz.csail.mit.edu/; PU-HIV: https://github.com/allenv5/PU-HIV/.
A comprehensive list of surveyed bioinformatic tools based on the PU learning schemea
Category . | Tool . | Year . | PU strategy . | Application . | Base classifier . | Performance evaluation . | Performance evaluation metrics . | Software/Code availabilityb . | |
---|---|---|---|---|---|---|---|---|---|
Selecting reliable negatives | Negative Expansion | PSoL [52] | 2006 | None | ncRNA identification | SVM | 5-fold CV and independent test | AUC | No |
LP-IC [51] | 2008 | Centroid distance evaluation | Biological sequence classification | SVM | Independent test | F-measure | No | ||
AGPS [47] | 2008 | Spy positives, Select RN | Gene function prediction | SVM | 10-fold CV | Precision, Recall, F-measure, AUC | No | ||
SPE_RNE [48] | 2010 | Positive enlargement | Gene function prediction | SVM | Random test | Precision, Recall, F-measure | No | ||
Bhardwaj et al. [50] | 2010 | Spy positives | Peripheral protein identification | C4.5 decision tree | 5-fold CV and independent test | Recall, ACC | No | ||
NOIT [59] | 2010 | Bagging | TF-target interaction | SVM with Platt scaling | 10-fold CV and independent test | AUC, AUPR | No | ||
ProDiGe [49] | 2011 | Bagging | Disease gene identification | Weighted-SVM | LOO CV | Precision, Recall, CDF | Webserver | ||
Patel and Wang [60] | 2015 | Spy positives, Bagging | Gene regulatory network | SVM, RF | Random test | ACC | No | ||
PUL-PUP [61] | 2016 | None | Pupylation site prediction | SVM | 10-fold CV | Recall, ACC, TNR, MCC, AUC | No | ||
LDAP [62] | 2017 | Bagging | lncRNA-disease association | Weighted-SVM | LOO CV | AUC | Webserver | ||
EPuL [63] | 2017 | Select RN | Pupylation site prediction | SVM | 10-fold CV and independent test | Recall, TNR, ACC, MCC, AUC | Webserver | ||
DeepDCR [64] | 2020 | Select RN | Disease-associated circular RNAs prediction | Deep forests [65] | 5-fold CV and independent test | Recall, Precision, AUC | Source code | ||
iPiDi-PUL [66] | 2021 | None | ncRNA-disease association | RF | 5-fold CV and independent test | AUC | Webserver | ||
EmptyNN [67] | 2021 | None | Single-cell RNA sequencing quality control | Neural network | 10-fold CV and independent test | AUC, Recall, Specificity | Source code | ||
Label Propagation | PUDI [68] | 2012 | None | Disease gene identification | Multi-level weighted SVM | 10-fold CV, 3-fold CV and benchmark | F-measure | Software | |
EPU [69] | 2014 | Ensemble classifiers | Disease gene identification | KNN, NB, SVM | 3-fold CV, LOO CV | Precision, Recall, F-measure | No | ||
PULSE [70] | 2015 | None | Stably folded isoform discovery | RF | 5-fold CV | AUC | Software | ||
PUPre [71] | 2015 | Spy positives, Select RN | Conformational B-cell epitopes prediction | Weighted-SVM | 10-fold CV | F-measure | No | ||
PUDT [72] | 2016 | Ensemble label propagation | Drug-target interaction prediction | Weighted-SVM | 5-fold CV | AUC | No | ||
Adapting the base classifier | PosOnly [57] | 2010 | None | Gene regulatory networks | SVM with Platt scaling | 10-fold CV and independent tests | F-measure, AUC | No | |
SIRENE [73] | 2013 | None | TF-CRE interaction | SVM | 3-fold CV | Precision, Recall | No | ||
HOCCLUS2 [74] | 2014 | Bagging | Gene regulatory networks | SVM | Independent tests | AUC | No | ||
PRIPU [75] | 2015 | None | Protein-RNA interaction | Biased-SVM | 5-fold CV | Precision, Recall, ACC, EPR | Source code | ||
PUEL [76] | 2016 | Bagging | Kinase substrate prediction | SVM | LOO CV | Recall, TNR, F-measure, MCC, GM | Software and source code | ||
MutPred2 [77] | 2017 | Bagging, Prior probability | Pathogenic amino acid variants prioritization | Feed-forward neural networks | 10-fold CV | AUC | Software and webserver | ||
PAnDE [78] | 2017 | Bagging | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | F-measure | No | ||
GlycoMine_PU [44] | 2019 | Bagging, Prior probability | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | ACC, F-measure, AUC | Webserver | ||
Topaz [79] | 2019 | None | Particle detection | CNN | 10-fold CV and independent tests | Precision | Software and source code | ||
PU-HIV [80] | 2021 | None | HIV-1 protease cleavage site prediction | Biased-SVM | 10-fold CV and independent tests | Precision, Recall, F-measure | Source code |
Category . | Tool . | Year . | PU strategy . | Application . | Base classifier . | Performance evaluation . | Performance evaluation metrics . | Software/Code availabilityb . | |
---|---|---|---|---|---|---|---|---|---|
Selecting reliable negatives | Negative Expansion | PSoL [52] | 2006 | None | ncRNA identification | SVM | 5-fold CV and independent test | AUC | No |
LP-IC [51] | 2008 | Centroid distance evaluation | Biological sequence classification | SVM | Independent test | F-measure | No | ||
AGPS [47] | 2008 | Spy positives, Select RN | Gene function prediction | SVM | 10-fold CV | Precision, Recall, F-measure, AUC | No | ||
SPE_RNE [48] | 2010 | Positive enlargement | Gene function prediction | SVM | Random test | Precision, Recall, F-measure | No | ||
Bhardwaj et al. [50] | 2010 | Spy positives | Peripheral protein identification | C4.5 decision tree | 5-fold CV and independent test | Recall, ACC | No | ||
NOIT [59] | 2010 | Bagging | TF-target interaction | SVM with Platt scaling | 10-fold CV and independent test | AUC, AUPR | No | ||
ProDiGe [49] | 2011 | Bagging | Disease gene identification | Weighted-SVM | LOO CV | Precision, Recall, CDF | Webserver | ||
Patel and Wang [60] | 2015 | Spy positives, Bagging | Gene regulatory network | SVM, RF | Random test | ACC | No | ||
PUL-PUP [61] | 2016 | None | Pupylation site prediction | SVM | 10-fold CV | Recall, ACC, TNR, MCC, AUC | No | ||
LDAP [62] | 2017 | Bagging | lncRNA-disease association | Weighted-SVM | LOO CV | AUC | Webserver | ||
EPuL [63] | 2017 | Select RN | Pupylation site prediction | SVM | 10-fold CV and independent test | Recall, TNR, ACC, MCC, AUC | Webserver | ||
DeepDCR [64] | 2020 | Select RN | Disease-associated circular RNAs prediction | Deep forests [65] | 5-fold CV and independent test | Recall, Precision, AUC | Source code | ||
iPiDi-PUL [66] | 2021 | None | ncRNA-disease association | RF | 5-fold CV and independent test | AUC | Webserver | ||
EmptyNN [67] | 2021 | None | Single-cell RNA sequencing quality control | Neural network | 10-fold CV and independent test | AUC, Recall, Specificity | Source code | ||
Label Propagation | PUDI [68] | 2012 | None | Disease gene identification | Multi-level weighted SVM | 10-fold CV, 3-fold CV and benchmark | F-measure | Software | |
EPU [69] | 2014 | Ensemble classifiers | Disease gene identification | KNN, NB, SVM | 3-fold CV, LOO CV | Precision, Recall, F-measure | No | ||
PULSE [70] | 2015 | None | Stably folded isoform discovery | RF | 5-fold CV | AUC | Software | ||
PUPre [71] | 2015 | Spy positives, Select RN | Conformational B-cell epitopes prediction | Weighted-SVM | 10-fold CV | F-measure | No | ||
PUDT [72] | 2016 | Ensemble label propagation | Drug-target interaction prediction | Weighted-SVM | 5-fold CV | AUC | No | ||
Adapting the base classifier | PosOnly [57] | 2010 | None | Gene regulatory networks | SVM with Platt scaling | 10-fold CV and independent tests | F-measure, AUC | No | |
SIRENE [73] | 2013 | None | TF-CRE interaction | SVM | 3-fold CV | Precision, Recall | No | ||
HOCCLUS2 [74] | 2014 | Bagging | Gene regulatory networks | SVM | Independent tests | AUC | No | ||
PRIPU [75] | 2015 | None | Protein-RNA interaction | Biased-SVM | 5-fold CV | Precision, Recall, ACC, EPR | Source code | ||
PUEL [76] | 2016 | Bagging | Kinase substrate prediction | SVM | LOO CV | Recall, TNR, F-measure, MCC, GM | Software and source code | ||
MutPred2 [77] | 2017 | Bagging, Prior probability | Pathogenic amino acid variants prioritization | Feed-forward neural networks | 10-fold CV | AUC | Software and webserver | ||
PAnDE [78] | 2017 | Bagging | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | F-measure | No | ||
GlycoMine_PU [44] | 2019 | Bagging, Prior probability | Glycosylation sites prediction | AnDE | 10-fold CV and independent tests | ACC, F-measure, AUC | Webserver | ||
Topaz [79] | 2019 | None | Particle detection | CNN | 10-fold CV and independent tests | Precision | Software and source code | ||
PU-HIV [80] | 2021 | None | HIV-1 protease cleavage site prediction | Biased-SVM | 10-fold CV and independent tests | Precision, Recall, F-measure | Source code |
aAbbreviations: ACC – accuracy; AnDE – averaged n-dimensional estimators; AUC – area under the ROC curve; AUPR – area under the precision-recall curve; LOO – leave one out; CDF – cumulative distribution function [49]; CNN – convolutional neural network; CRE – cis-regulatory elements; CV – cross-validation; EPR – explicit positive recall [75]; GM – geometric mean; KNN – K nearest neighbors; lncRNA – long non-coding RNA; LR – logistic regression; MCC – Matthews correlation coefficient; NB – naive Bayesian; ncRNA – non-coding RNA; RF – random forest; RN – reliable negatives; ROC – receiver operating characteristic; SVM – support vector machine; TF – transcription factor; TNR – true negative rate.
bThe URL addresses for the listed tools are as follows: ProDiGe: http://cbio.ensmp.fr/prodige/; LDAP: http://bioinformatics.csu.edu.cn/ldap/; EPuL: http://59.73.198.144:8080/EPuL/; DeepDCR: https://github.com/xzenglab/DeepDCR/; iPiDi-PUL: http://bliulab.net/iPiDi-PUL/server/; EmptyNN: https://github.com/lkmklsmn/empty_nn/; PUDI: http://www1.i2r.a-star.edu.sg/~xlli/PUDI/PUDI.html/; PULSE: http://www.kimlab.org/software/pulse/; PRIPU: http://admis.fudan.edu.cn/projects/pripu.htm/; PUEL: https://github.com/PengyiYang/KSP-PUEL/; MutPred: http://mutpred.mutdb.org/; GlycoMine_PU: https://glycomine.erc.monash.edu/Lab/GlycoMine_PU/; Topaz: http://topaz.csail.mit.edu/; PU-HIV: https://github.com/allenv5/PU-HIV/.

A schematic illustration of different types of PU learning algorithms used in the compared bioinformatic tools, including ‘selecting reliable negatives’: (A) negative expansion, (B) label propagation and (C) ‘adapting the base classifier.’ Panel (D) illustrates the four bioinformatics application domains of PU learning algorithms covered in this review, including DNA/RNA/protein sequence classification, functional site prediction, protein/gene function prediction and interaction prediction. RNs: reliable negatives; LN: dataset of likely negatives; P: dataset of positives and U: dataset of unlabeled samples.
Selecting reliable negatives
The flowchart of PU approaches within the ‘selecting reliable negatives’ strategy is demonstrated in Figure 1A and B. As can be seen, two main approaches for selecting reliable negative samples have been proposed, including ‘negative expansion’ and ‘label propagation.’
Negative expansion
As illustrated in Figure 1A, ‘negative expansion’ approaches iterate through a process of first labeling selected unlabeled samples as putative negatives, building a model to classify between positives and putative negatives, and then applying the model to relabel selected unlabeled samples as putative negatives. The selected putative negatives at each iteration represent the samples for which the current model is most confident.
To reduce the number of FNs in N, the NOIT [59] approach divides U into three large groups. Samples in each group are predicted by the classifier trained with P and the unlabeled samples randomly selected from the other two groups. Only the top-performing negative candidates will be added into the training set for the next iteration. The ‘simple bagging’ approaches provide the most straightforward strategy for potentially mislabeled negatives in U, which has proven to be able to better predict positives from unlabeled dataset U, compared to the conventional supervised binary classifiers, which simply treat U as N, as demonstrated in [51]. However, the prediction accuracy of the bagging-based methods might be lower compared to other types of approaches (i.e. ‘label propagation,’ and ‘adapting the base classifier’), since the positive samples in U are still likely to be mislabeled as N and therefore mislead the classifiers.
To evaluate the performance of the trained classifiers, some labeled positives are mixed into U acting as ‘spies.’ AGPS [47] and the method by Patel and Wang [60] chose the SVM with the best prediction performance as the final classifier to identify the ‘spies’; while Bhardwaj et al. [50] adjusted the threshold of selecting negatives at each iteration to minimize the number of positive spies being identified as negatives. It has been shown in [49] that compared with the one-class classifier that uses only positive examples to train a model, the PU learning algorithm with negative expansion is able to identify hidden positives from U more consistently and accurately. Compared with the bagging-based approaches, distance-based approaches might achieve a higher prediction accuracy as they select more reliable negatives to reduce the number of FNs in N.
Label propagation
Adapting the base classifier
Using this strategy, PUEL [76] randomly bags U into several groups, repeats the base classifier for each group in an ensemble manner, and averages different outcomes from the groups to obtain the final prediction result, adjusted using the Bayes rule and Equation 4. Similarly, HOCCLUS2 [74], MutPred2 [77], PAnDE [78] and GlycoMine_PU [44] subsample U when obtaining |${P}_x(s=1)$| to overcome the data imbalance (i.e. non-glycosylation sites significantly outnumbering glycosylation sites). SVM is used as the base classifier in PUEL [76], HOCCLUS2 [74] and MutPred2 [77], while AnDE [85] is used as the base classifier in PAnDE [78] and GlycoMine_PU [44].
As noted above, this baseline method assumes that positive samples are labeled randomly. However, it might not be true in real-world bioinformatics applications. To further improve the prediction accuracy, an algorithm called AlphaMax [86, 87] was adopted in MutPred2 [77] and GlycoMine_PU [44] to estimate the prior probability of each sample being labeled as positive. The performance results showed that the prediction accuracy could be significantly improved.
Applying PU scheme to address specific biological questions
Table 1 summarizes a total of 29 bioinformatic applications developed under the PU learning scheme, which can be further categorized into four classes shown in Figure 1D, including sequence classification, interaction prediction, gene/protein function and functional site prediction. These applications and their performance are discussed in detail in the following subsections.
DNA/RNA/protein sequence classification
‘Selecting reliable negatives’ algorithms have been widely applied to identifying disease-associated genes, classification of biological sequences and identification of ncRNAs (non-coding RNAs). Among these applications, PU algorithms are prevalent in the identification of disease-associated genes [49, 62, 68, 69], which aims to unravel the causative relationships between genes and diseases, contributing to providing a better understanding of gene variation-driven pathogenicity and proposing possible solutions to a variety of healthcare problems [68]. Current experimental efforts can generally identify a long list of potential candidate genes, but only a few are genuinely disease-causative [49]. This means that the proposed approach should be able to uncover the positive samples (i.e. disease-causative genes) from a vast number of unknown genes [68]. PU learning algorithms are therefore applied to this scenario based on the assumption that the genes with similar phenotypes are likely to have similar biological functions [69]. In addition to the PU algorithms, different types of biological data have been used for predicting disease-associated genes in [69], including human protein interaction (PPI), gene expression, gene ontology (GO) and phenotype-gene association. The causative genes of various common diseases have been explored under the PU learning scheme, including cancers, cardiovascular, endocrine, metabolic, neurological, psychiatric and ophthalmological disorders [49, 69]. Compared with the one-class learning algorithms, PU learning algorithms have demonstrated better performance for identifying disease-causative genes [49]. Similarly, PU algorithms have also been used to identify the associations between lncRNAs (long non-coding RNAs) and diseases [62]. PU learning scheme has also been successfully applied to biological sequence classification via the proposed LP-IC framework [51]. Two scenarios of biological sequence classification using LP-IC were investigated and discussed, including HLA-A2 binding and human-mouse alternative splicing. It was demonstrated in [51] that PU algorithms outperformed conventional supervised learning algorithms in identifying hidden positive samples, thereby leading to a higher precision rate. Another promising application is to detect ncRNAs. Experimental identification of ncRNAs using routine genetic and biochemical approaches is challenging because the majority of ncRNAs are short and non-susceptible to frameshift and nonsense mutations [52, 88]. This challenge highlights the necessity of using PU learning algorithms to computationally identify potential ncRNAs based on identified ncRNAs and a considerable number of unknown sequences. In this regard, Wang et al. [52] proposed a PU learning method, PSoL, to detect ncRNAs from the Escherichia coli (E. coli) genome. The empirical studies demonstrated that PSoL achieved superior prediction performance compared to the benchmarked approaches.
Interaction prediction
Both ‘selecting reliable negative’ and ‘adapting base the classifier’ algorithms have been applied to identify TF (transcription factor)–CRE (cis-regulatory element), TF-target and protein-RNA interactions, as well as gene regulatory networks. The TF–CRE interactions have been investigated in [73], which are part of gene regulatory networks and crucially important for understanding the regulation mechanisms of cells. However, identifying TF–CRE interactions using experimental methods is difficult even for well-studied organisms [73]. It was shown in [73] that PU learning methods performed better than those algorithms that used only positive samples, including the one-class approach. Uncovering the target genes of transcription factors is another important topic in mining and understanding gene regulatory networks. Eight different transcription factors with the largest numbers of known target genes from E. coli and Saccharomyces cerevisiae (S. cerevisiae) were investigated in [60] using a PU learning approach belonging to the ‘negative expansion’ group. Additionally, using experimental datasets from E. coli, Cerulo et al. [59] predicted the target genes of BCL6 in normal germinal center human B cells using the PU learning algorithm by heuristically selecting reliable negative samples to train the machine-learning model. Protein–RNA interactions are important in regulating many cellular processes. Using PU biased-SVM, Cheng et al. accurately predicted protein–RNA and protein–non-coding RNA interactions, achieving a satisfactory EPR (explicit positive recall) value [75]. Gene regulatory networks were analyzed and predicted in [57, 74] using the PU algorithms with probability estimation (i.e. the ‘adapting the base classifier’ group). Despite the availability of positive data (i.e. experimentally validated gene pair interactions) in public databases, the negative samples (i.e. non-interacting gene pairs) are not readily available due to the lack of experimental validation. PU learning, therefore, provides potential solutions to learn and analyze gene regulatory networks using both experimentally validated gene pair interactions and a large number of unlabeled gene pairs (i.e. for which the interaction information is not known).
Protein/gene function and functional site prediction
PU algorithms based on ‘negative expansion’ and ‘label propagation’ have been widely applied to predict protein/gene functions and functional sites of proteins. In [50], Bhardwaj et al. applied PU algorithms to predict peripheral (membrane-binding) proteins, which play an important role in biological processes such as cell signaling and are associated with serious diseases including cancers and acquired immunodeficiency syndrome. Additionally, a two-step PU algorithm from the ‘label propagation’ group has been successfully applied to predict conformational B-cell epitopes in [71] using a weighted SVM for selecting reliable negative samples. Several studies for predicting S. cerevisiae (i.e. yeast) gene functions from its genome sequence data have also deployed PU algorithms [47, 48]. Most databases such as GO only provide GO annotations for proteins/genes (i.e. positive samples), while the negative annotation (i.e. indicating there is no link between the gene and the GO term) are usually not available, necessitating and promoting the implementation and deployment of PU algorithms.
Both ‘selecting reliable negative’ and ‘adapting base the classifier’ algorithms have also been employed to predict functional sites of proteins. PU algorithms from the probability estimate correction group are the most widely used. A number of studies have been carried out to use PU algorithms for predicting different types of protein functional and post-translational modification sites (PTMs), such as glycosylation sites [44, 78], kinase substrates [76], pupylation sites [61] and protease cleavage sites [80]. Other PU applications for functional sites involve pathogenic amino acid variants prioritization [77] and gene site prediction for the identification of functional isoforms (i.e. alternative splicing) [70].
Performance evaluation metrics and strategies
Among these measures, AUPR is often used to evaluate imbalanced binary classifiers [89]. CDF considers the percentage of the positives that can be identified among all the hidden positives in U [49]. In addition to above performance metrics, EPR is a newly introduced metric that measures the proportion of the known positives that are identified accurately (refer to [75] for more details). Compared with traditional performance metrics, EPR is particularly applicable for assessing the performance of PU learning problems where negative samples are completely unavailable.
As shown in Table 1, five validation strategies have been used to evaluate the prediction performance of PU algorithms, including k-fold cross-validation (CV), independent test, leave-one-out (LOO) CV and random test. Among these strategies, k-fold CV is the most adopted test strategy. In k-fold CV, the samples are divided into k subgroups. At each validation step, one subgroup is used as the validation set, while the others are combined as the training set. The k subgroups take turns to be the validation set until the training and testing processes have been repeated k times. The value of k is typically set as 3, 5 or 10, and 10 is the most common choice. As used in [49, 62, 69, 76], LOOCV can be regarded as an extreme situation of k-fold cross-validation, where k is the total number of the training samples. Each time, only one sample is used for testing while the rest of the data is used to train the model. On larger training sets, the performance evaluation of LOOCV is usually more stable than that of k-fold CV.
Discussions and future perspectives
Imbalanced data and classifier overfitting are common problems in PU learning since many classifiers are sensitive to size differences between U and P. To address this, the unlabeled data usually are sub-sampled into smaller sizes to train the classifiers. For example, in [52], the size of the unlabeled training set is limited to three times the size of the positive training set to reduce the effect of classifier imbalance. Some algorithms further reduced the size of unlabeled training data to be the same as that of the positive set. However, it has been commented in [48] that overfitting will occur when the unlabeled training set is overly small. It is therefore essential to balance the sizes of P and U for generating reliable models. In addition, cross-validation should be frequently applied to avoid performance overestimation effectively.
As shown in Table 1, only a small number of base classifiers have been widely applied to date, predominantly SVM and RF. More PU learning methods, ensemble strategies and deep-learning approaches are expected to be integrated to explore the possibility of improving the performance and robustness of the PU learning algorithms. Apart from the widely used base classifiers such as SVM, RF and KNN, there are other excellent PU learning algorithms that can be taken into consideration, such as POSC4.5 [90], PURF [91], positive hidden naive Bayes (PHNB) and positive full Bayesian network classifier (PFBC) [92], which are based on the basic decision tree, RF and the naïve Bayes classifier, respectively. Note that these approaches are the extended versions of the decision tree, RF and naïve Bayes classifiers, which can directly learn from positive and unlabeled samples. Besides the base classifiers, ensemble methods have been widely applied to improve prediction performance [93]. In addition to the traditional ensemble methods for base classifiers, such as Adaboost [94, 95] and Stacking [96], integrating prediction outcomes of different PU learning algorithms has also demonstrated considerable promise in improving the prediction performance. For example, Yang et al. [69] integrated different biological data sources and prediction outputs of various PU learning schemes to significantly improve the prediction performance of disease-associated gene identification.
In recent years, deep-learning techniques have demonstrated superior prediction performance in a wide range of biomedical applications, including protein function prediction [97–99], DNA/RNA modification identification [25, 100], biomedical image classification and analysis [33, 101], and multi-omics data analysis [102–107]. Compared to the conventional machine learning algorithms such as SVM, deep neural networks are able to improve prediction accuracy by discovering the inherent relationships among various features [89]. To address the PU learning scenario, several adapted deep-learning algorithms have been published, such as NNPU [108], PUCNN [79] and GenPU [109]. Notably, deep learning has not yet been widely applied in PU learning scenarios using biological and biomedical data. In Table 1, three approaches, including PUCNN, MutPred2 and EmptyNN, used deep-learning algorithms for mining biological data. Bepler et al. [79] successfully applied PU neural network (PUCNN) to particle picking in cryo-electron micrographs; feed-forward neural networks were adopted in MutPred2 [77] for the molecular and phenotypic impact of amino acid mutations; while EmptyNN [67] was developed based on a positive-unlabeled learning neural network to remove cell-free droplets and recover lost cells in scRNA-seq data. Apart from generic deep neural networks, graph neural networks have also been further advanced for PU learning. Wu et al. [110] proposed a novel long-short distance aggregation network (LSDAN) for the PU graph learning framework and demonstrated that the proposed LSDAN achieved an outstanding F1 score. Note that LSDAN has great potential to be applied to interaction identification, for example, using PPI and gene regulation data. The performance of this framework to address PU tasks using biological data will then need to be carefully assessed. Another application of deep-learning algorithms is feature extraction. Compared to the traditional features such as DNA/protein sequence-derived features [111–114] and structural features [21, 23], deep-learning algorithms can automatically learn suitable feature representations for the input data, such as DNA/RNA/protein sequences, without the need of the manual design for biological or physicochemical properties or use of hand-crafted features [115–120]. In addition, the feature representations learned by deep learning algorithms can provide better characterization of the input dataset and improve the predictive performance.
Conclusion
In many applications, a lack of well labeled negative examples has been shown to hinder the development of traditional bioinformatics tools. This is of particular importance in biological and biomedical applications, where it may be much easier to identify positive cases than negative cases confidently, or data can be mislabeled as negatives due to the low sensitivity of experimental equipment. Compared to conventional supervised learning, PU learning provides a scheme where limited numbers of well-labeled positive samples and large numbers of unlabeled samples can be used together to train a classifier, avoiding the potential impact on prediction performance caused by lack of or unreliably labeled negative samples. To date, a variety of PU learning algorithms have been developed. This study comprehensively summarizes the current research progress of PU learning using biological and biomedical data. We have surveyed 29 state-of-the-art PU learning-based bioinformatic applications in underlying PU learning methodology, algorithm implementation, performance evaluation strategy and biological application. Our survey further discussed current issues and possible future directions for PU learning in bioinformatics. We anticipate that our review and analysis can provide helpful insights into the applications of PU learning algorithms and will therefore underpin the development of novel PU learning frameworks to address critical biomedical questions.
Positive unlabeled (PU) learning overcomes the issue of limited positive samples and potentially mislabeled negative samples in biological and biomedical data and have been therefore widely applied to various bioinformatics applications.
We conducted a comprehensive review of 29 state-of-the-art PU bioinformatic applications regarding their PU learning approaches, performance evaluation strategies and application domains.
Based on our investigation and analysis, we further commented on the current issues of PU learning in bioinformatics and provided several novel perspectives to underpin the future development of PU frameworks for addressing biological and biomedical questions.
Acknowledgements
This work was supported by grants from the National Health and Medical Research Council of Australia (NHMRC) (APP1127948, APP1144652); Australian Research Council (ARC) (LP110200333, DP120104460); National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965); a Major Inter-Disciplinary Research (IDR) project awarded by Monash University. FL’s work is supported by the core funding of the Doherty Institute at the University of Melbourne. CL is currently a CJ Martin Early Career Research Fellow supported by the NHMRC (1143366). LC’s work is supported by NHMRC career development fellowship (APP1103384), as well as an NHMRC-EU project grant (GNT1195743).
Fuyi Li received his PhD degree in Bioinformatics from Monash University, Australia. He is currently a Research Fellow in the Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia. His research interests are bioinformatics, computational biology, machine learning and data mining.
Shuangyu Dong received her MEng degree in Electrical Engineering from the University of Melbourne, Australia. She is currently a PhD candidate in the Department of Electrical and Electronic Engineering, The University of Melbourne, Australia. Her research interests are communication systems, information theory and machine learning.
André Leier is currently an assistant professor in the Department of Genetics, UAB School of Medicine, USA. He is also an associate scientist in UAB’s O’Neal Comprehensive Cancer Center and the Gregory Fleming James Cystic Fibrosis Research Center. His research interests are computational biomedicine, bioengineering, bioinformatics and machine learning.
Meiya Han received her MSc degree in Data Science from Monash University, Australia. She is currently a research assistant in the Department of Biochemistry and Molecular Biology, Monash University, Australia. Her research interests are bioinformatics, computational biology, machine learning and data mining.
Xudong Guo received his MEng degree from Ningxia University, China. His research interests are bioinformatics and data mining.
Jing Xu received her MSc degree in Computer Science and Technology from Nankai University, China. She is currently a PhD candidate in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her research interests include bioinformatics, computational biology, machine learning and deep learning.
Xiaoyu Wang received her MSc degree in Information Technology from The University of Melbourne, Australia. She is currently a research assistant in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her research interests are bioinformatics, computational biology, machine learning and data mining.
Shirui Pan received his PhD in Computer Science from the University of Technology Sydney (UTS), Ultimo, NSW, Australia. He is currently a senior lecturer with the Faculty of Information Technology, Monash University, Australia. Prior to this, he was a lecturer in the School of Software, University of Technology Sydney. His research interests include data mining and machine learning.
Cangzhi Jia is an associate professor in the College of Science, Dalian Maritime University. She obtained her PhD degree in the School of Mathematical Sciences from the Dalian University of Technology in 2007. Her major research interests include mathematical modeling in bioinformatics and machine learning.
Yang Zhang received his PhD degree in Computer Software and Theory in 2005 from Northwestern Polytechnical University, China. He is a professor at College of Information Engineering, Northwest A&F University. His research interests include machine learning and data mining.
Geoffrey I. Webb received his PhD degree in Computer Science in 1987 from La Trobe University. He is research director of the Monash Data Futures Institute and professor in the Faculty of Information Technology at Monash University. His research interests include machine learning, data mining, computational biology and user modeling.
Lachlan J.M. Coin is a professor and group leader in the Department of Microbiology and Immunology at the University of Melbourne. He is also a member of the Department of Clinical Pathology, University of Melbourne. His research interests are bioinformatics, machine learning, transcriptomics and genomics.
Chen Li is a research fellow in the Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University. He is currently a CJ Martin Early Career Research Fellow, supported by the Australian National Health and Medicine Research Council (NHMRC). His research interests include systems proteomics, immunopeptidomics, personalized medicine, experimental bioinformatics and data mining.
Jiangning Song is an associate professor and group leader in the Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia. He is also affiliated with the Monash Centre for Data Science, Faculty of Information Technology, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining and pattern recognition.