-
PDF
- Split View
-
Views
-
Cite
Cite
Yue Bi, Fuyi Li, Xudong Guo, Zhikang Wang, Tong Pan, Yuming Guo, Geoffrey I Webb, Jianhua Yao, Cangzhi Jia, Jiangning Song, Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac467, https://doi.org/10.1093/bib/bbac467
- Share Icon Share
Abstract
Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.
Introduction
Asymmetric mRNA distribution was first reported in the early 1980s, which found actin mRNA to be localized in the cytoplasm during the ascidian embryos [1]. Later, Lawrence and Singer also observed such a phenomenon in chicken fibroblasts via in situ hybridization [2]. In the past decade, there have been an increasing number of studies suggesting that transcripts localization is a common and efficient way to target gene products to specific intracellular regions [3–6]. A number of different regulatory processes, including cell polarity, cell motility, embryo development and asymmetric cell division, have been shown to be related to the mRNA subcellular localization [7–9]. Abnormal mRNA subcellular localization has been found to be associated with numerous diseases, such as fragile X syndrome, embryonal disorders, Alzheimer’s disease and cancer [10–14]. However, there is a significant gap regarding the understanding of the underlying mechanisms of how exactly mRNAs are transported within the cells. Characterization of mRNA subcellular localizations can help elucidate the RNA localization patterns and human diseases mechanisms. With the development of high-throughput RNA sequencing technologies, an increasing number of mRNA transcripts have been recently identified. Several popular online databases have been developed to provide annotations of RNA cellular localization data for public use, such as RNALocate [15], lncATLAS [16] and lncSLdb [17]. LncATLAS and lncSLdb mainly store localization information of long non-coding RNAs, while RNALocate collects subcellular localization data for almost all kinds of RNAs. With the advances in data curation, recent years have witnessed a proliferation of computational methods developed for predicting RNA localizations with low cost and high efficiency compared with wet-lab experimental methods. Such computational methods provide important complementation to the wet-lab methods for the identification of RNA localizations.
In the present study, we comprehensively surveyed the state-of-the-art computational approaches for mRNA subcellular localization prediction in terms of a wide range of aspects, including their data sources, benchmark datasets, sequence encoding schemes, feature selection methods, machine learning algorithms and performance evaluation strategies, which are listed in Table 1. The mRNA subcellular localization prediction problem can generally be regarded as a multi-class classification task. We categorized the existing computational predictors into two major types according to the machine learning scheme: (1) single-label multi-classification predictors and (2) multi-label multi-classification predictors. For the first type, there are five predictors that address the mRNA subcellular localization prediction as a single-label classification task. These included RNATracker [18], iLoc-mRNA [19], mRNALoc [20], mRNALocater [21] and SubLocEP [22]. Unlike single-label multi-class classification, multi-label learning (MLL) has been attracting increasing attention in recent years. This is relevant because real-world objects can often have multiple semantic meanings simultaneously, where an instance may be associated with multiple labels. Taking the existence of transcriptome as an example, the mRNA Vg1 formed in the nucleus can be re-modelled in the cytoplasm during vegetal localization [23]; the mRNA bglG is localized in the membrane to form a pre-complex when being co-transcribed with its sensor, while bglG is localized in the cell poles when being expressed on its own [24]. This widespread multiple localization phenomenon inspires the multi-label task to investigate the mRNA subcellular localization identification problem. Recently, a predictor called DM3Loc has been developed based on the multi-head self-attention mechanism, representing the first multi-label mRNA subcellular localization prediction model [25]. In another recent work, Wang et al. constructed multi-label prediction models with multiple kernel support vector machines for mRNA, lncRNA, miRNA and snoRNA, respectively [26].
Type . | Year . | Tool . | The sources of mRNA subcellular localization/sequence . | Subcellular localization . | Benchmark dataset size . | Encoding scheme . | Feature selection . | Algorithm . | Evaluation strategy . | Web server/Github availability . | Reference . |
---|---|---|---|---|---|---|---|---|---|---|---|
Single-label | 2019 | RNATracker | CeFra-Seq & APEX-RIP /Ensembl | Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear | 11 373 (Dataset 1) 13 860 (Dataset 2) | One-hot RNA secondary structure | None | CNN LSTM Attention | Tenfold cross-validation | https://www.github.com/HarveyYan/RNATracker | [18] |
2020 | iLoc-mRNA | RNALocate /GenBank | Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome | 4901 | K-mer | Binomial distribution ANOVA IFS | SVM | Fivefold cross-validation | http://lin-group.cn/server/iLoc-mRNA/ | [19] | |
2020 | mRNALoc | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus | 14 909 | PseKNC | None | SVM | Fivefold cross-validation Independent test | http://proteininformatics.org/mkumar/mrnaloc | [20] | |
2021 | mRNALocater | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP PseKNC | Remove collinear features SFS | CatBoost XGBoost LightGBM | Fivefold cross-validation Independent test | http://bio-bigdata.cn/mRNALocater | [21] | |
2021 | SubLocEP | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC | None | LightGBM | Fivefold cross-validation Independent test | http://lab.malab.cn/~lijing/SubLocEP.html | [22] | |
Multi-label | 2021 | DM3Loc | RNALocate /GenBank & NCBI | Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome | 17 870 | One-hot | None | CNN Attention | Fivefold cross-validation Independent test | http://dm3loc.lin-group.cn/ | [25] |
2021 | Wang’s | RNALocate | Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome | 13475 | K-mer RCKmer NAC DNC TNC CKSNAP | None | SVM | Tenfold cross-validation | http://lbci.tju.edu.cn/Online_services.htm | [26] |
Type . | Year . | Tool . | The sources of mRNA subcellular localization/sequence . | Subcellular localization . | Benchmark dataset size . | Encoding scheme . | Feature selection . | Algorithm . | Evaluation strategy . | Web server/Github availability . | Reference . |
---|---|---|---|---|---|---|---|---|---|---|---|
Single-label | 2019 | RNATracker | CeFra-Seq & APEX-RIP /Ensembl | Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear | 11 373 (Dataset 1) 13 860 (Dataset 2) | One-hot RNA secondary structure | None | CNN LSTM Attention | Tenfold cross-validation | https://www.github.com/HarveyYan/RNATracker | [18] |
2020 | iLoc-mRNA | RNALocate /GenBank | Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome | 4901 | K-mer | Binomial distribution ANOVA IFS | SVM | Fivefold cross-validation | http://lin-group.cn/server/iLoc-mRNA/ | [19] | |
2020 | mRNALoc | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus | 14 909 | PseKNC | None | SVM | Fivefold cross-validation Independent test | http://proteininformatics.org/mkumar/mrnaloc | [20] | |
2021 | mRNALocater | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP PseKNC | Remove collinear features SFS | CatBoost XGBoost LightGBM | Fivefold cross-validation Independent test | http://bio-bigdata.cn/mRNALocater | [21] | |
2021 | SubLocEP | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC | None | LightGBM | Fivefold cross-validation Independent test | http://lab.malab.cn/~lijing/SubLocEP.html | [22] | |
Multi-label | 2021 | DM3Loc | RNALocate /GenBank & NCBI | Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome | 17 870 | One-hot | None | CNN Attention | Fivefold cross-validation Independent test | http://dm3loc.lin-group.cn/ | [25] |
2021 | Wang’s | RNALocate | Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome | 13475 | K-mer RCKmer NAC DNC TNC CKSNAP | None | SVM | Tenfold cross-validation | http://lbci.tju.edu.cn/Online_services.htm | [26] |
Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.
Type . | Year . | Tool . | The sources of mRNA subcellular localization/sequence . | Subcellular localization . | Benchmark dataset size . | Encoding scheme . | Feature selection . | Algorithm . | Evaluation strategy . | Web server/Github availability . | Reference . |
---|---|---|---|---|---|---|---|---|---|---|---|
Single-label | 2019 | RNATracker | CeFra-Seq & APEX-RIP /Ensembl | Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear | 11 373 (Dataset 1) 13 860 (Dataset 2) | One-hot RNA secondary structure | None | CNN LSTM Attention | Tenfold cross-validation | https://www.github.com/HarveyYan/RNATracker | [18] |
2020 | iLoc-mRNA | RNALocate /GenBank | Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome | 4901 | K-mer | Binomial distribution ANOVA IFS | SVM | Fivefold cross-validation | http://lin-group.cn/server/iLoc-mRNA/ | [19] | |
2020 | mRNALoc | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus | 14 909 | PseKNC | None | SVM | Fivefold cross-validation Independent test | http://proteininformatics.org/mkumar/mrnaloc | [20] | |
2021 | mRNALocater | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP PseKNC | Remove collinear features SFS | CatBoost XGBoost LightGBM | Fivefold cross-validation Independent test | http://bio-bigdata.cn/mRNALocater | [21] | |
2021 | SubLocEP | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC | None | LightGBM | Fivefold cross-validation Independent test | http://lab.malab.cn/~lijing/SubLocEP.html | [22] | |
Multi-label | 2021 | DM3Loc | RNALocate /GenBank & NCBI | Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome | 17 870 | One-hot | None | CNN Attention | Fivefold cross-validation Independent test | http://dm3loc.lin-group.cn/ | [25] |
2021 | Wang’s | RNALocate | Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome | 13475 | K-mer RCKmer NAC DNC TNC CKSNAP | None | SVM | Tenfold cross-validation | http://lbci.tju.edu.cn/Online_services.htm | [26] |
Type . | Year . | Tool . | The sources of mRNA subcellular localization/sequence . | Subcellular localization . | Benchmark dataset size . | Encoding scheme . | Feature selection . | Algorithm . | Evaluation strategy . | Web server/Github availability . | Reference . |
---|---|---|---|---|---|---|---|---|---|---|---|
Single-label | 2019 | RNATracker | CeFra-Seq & APEX-RIP /Ensembl | Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear | 11 373 (Dataset 1) 13 860 (Dataset 2) | One-hot RNA secondary structure | None | CNN LSTM Attention | Tenfold cross-validation | https://www.github.com/HarveyYan/RNATracker | [18] |
2020 | iLoc-mRNA | RNALocate /GenBank | Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome | 4901 | K-mer | Binomial distribution ANOVA IFS | SVM | Fivefold cross-validation | http://lin-group.cn/server/iLoc-mRNA/ | [19] | |
2020 | mRNALoc | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus | 14 909 | PseKNC | None | SVM | Fivefold cross-validation Independent test | http://proteininformatics.org/mkumar/mrnaloc | [20] | |
2021 | mRNALocater | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP PseKNC | Remove collinear features SFS | CatBoost XGBoost LightGBM | Fivefold cross-validation Independent test | http://bio-bigdata.cn/mRNALocater | [21] | |
2021 | SubLocEP | RNALocate | Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus | 14 909 | PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC | None | LightGBM | Fivefold cross-validation Independent test | http://lab.malab.cn/~lijing/SubLocEP.html | [22] | |
Multi-label | 2021 | DM3Loc | RNALocate /GenBank & NCBI | Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome | 17 870 | One-hot | None | CNN Attention | Fivefold cross-validation Independent test | http://dm3loc.lin-group.cn/ | [25] |
2021 | Wang’s | RNALocate | Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome | 13475 | K-mer RCKmer NAC DNC TNC CKSNAP | None | SVM | Tenfold cross-validation | http://lbci.tju.edu.cn/Online_services.htm | [26] |
Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.
To deal with MLL tasks, problem transformation algorithms such as binary relevance (BR), label powerset (LP) and classifier chains (CC) [27] are popular strategies, which essentially transform the multi-label problems into one or more single-label tasks. Amongst these, BR is the most widely used problem transformation algorithm and is able to transform a multi-label problem into a binary problem for each label [28]. However, a weakness of BR is that it ignores the correlation between the labels, thereby limiting its utility. In contrast, LP and CC are designed with the awareness of correlations between different data labels. The basic idea of LP is that each unique combination of labels present in a multi-label training set is regarded as a new class of a new single-label multi-class task [29]. CC generates a chain of binary classifiers, one for each label. The subsequent binary classifiers in the chain are further augmented by all preceding binary relevance predictions in the chain [30]. However, these two methods also have certain limitations: For LP, less frequent combinations would lead to the sample imbalance, and it cannot predict the label combinations that do not appear in the training set. In the case of CC, the order of the input labels can affect the model quality and prediction performance [27]. If the first model of the chain predicts inaccurately, an error may propagate along the chain. To overcome this problem, Read et al. [30] proposed the ensembles of classifier chain (ECC) method. The core idea of ECC is to average the predictions of CC models over a group of random chain ordering; however, it may still impose some restrictions on the computational capacity and time cost.
Herein, we introduced a novel computational method termed Clarion (subcellular localization predictor), which is capable of identifying multiple subcellular localizations of mRNAs simultaneously. Firstly, we established a multi-label benchmark dataset extracted from RNALocate, consisting of nine different compartments of mRNA subcellular localizations. The |$k$|-mer nucleotide composition scheme was used to encode mRNA sequences. Next, we applied the weighted series approach as the ensemble framework of Clarion, which is a problem transformation method proposed to tackle multi-label tasks. The weighted series algorithm incorporates the prior information of the labels during model training to improve the prediction performance. Then, after the performance comparison with several machine learning algorithms, we selected XGBoost as the base classifier of Clarion. We optimized the weight of the weighted series and key parameters of XGBoost through 10-fold cross-validation. Additional independent tests illustrate that Clarion outperformed the existing state-of-the-art tools for identifying mRNA subcellular localizations. In addition, we also employed the SHAP (Shapley Addictive exPlanation) algorithm [31] to identify and interpret the most important |$k$|-mer features for each type of mRNA subcellular localization that made the most important contributions to the model predictions.
Material and methods
Benchmark dataset
In this study, all subcellular localization annotations and mRNA sequences were collected from the RNALocate database (version 2.0) [32]. The latest version of RNALocate (updated in June 2021) contained more than 210 000 RNA subcellular localization entries, encompassing more than 110 000 RNAs with 171 subcellular localizations across 104 different species. Its version 2.0 provides more accurate localization annotations than the first version, facilitating the construction of a reliable benchmark dataset. More specifically, the benchmark dataset was constructed according to the five following major steps:
1) We downloaded all RNA subcellular localization annotation entries from RNALocate (version 2.0) and accordingly collected 84 792 mRNA subcellular localization entries as the initial dataset.
2) According to the statistics of the initial dataset, there were 150 different types of annotated subcellular localizations. However, some had minimal and incomplete entries and as such, we removed those subcellular localization types whose corresponding entry numbers were less than 3000. As a result, we obtained nine types of subcellular localizations with 152 887 unique transcripts including exosome, nucleus, cytosol, chromatin, nucleoplasm, ribosome, nucleolus, cytoplasm and membrane.
3) Next, we redefined the mapping relationships between mRNAs and subcellular localizations based on multiple localizations of mRNAs in the transcriptome. In particular, an mRNA can be labelled with multiple subcellular localizations instead of being only labelled with one subcellular localization.
4) To reduce the effect of sequence redundancy on the performance of the classifier, we applied CD-HIT-EST [33] to remove the redundant sequences with the 80% sequence identity threshold to ensure the similarity between any two nucleotide sequences was less than 80%. Finally, 36 971 mRNAs were obtained and used as the benchmark dataset.
5) We analyzed the distribution of sequence length of these 36 971 mRNAs, which varied from 119 nt to 12 000 nt. In view of the computing complexity and limitation of the feature engineering algorithms, the mRNA lengths were adjusted to no longer than 6000 nt. Specifically, for those mRNAs with more than 6000 nt, the first 3000 nt and the last 3000 nt were extracted and merged.
Sequence vectorization
mRNA sequences need to be encoded as numeric vectors prior to the training of machine learning models. The |$k$|-mer nucleotide composition is one of the widely used sequence encoding methods, which has been successfully applied in a variety of bioinformatics studies [34–38]. Given an mRNA sequence |$S$| with length |$L$| nt, |$S={N}_1{N}_2{N}_3{N}_4{N}_5{N}_6\dots{N}_{L-1}{N}_L$|, where |$i\in \big[1,2,3,\dots, L\big]$| and |${N}_i$| represents the nucleotide acid at position |$i$|, one type of adenine (A), cytosine (C), guanine (G) and uracil/thymine (U/T). Accordingly, when using the |$k$|-mer (|$k=3$|) method to encode features, the feature vectors can be calculated as:
where |$f\big(\bullet \big)$| represents the number of |$k$|-mer type along the sequence. From the equation, we can observe that the |$k$|-mer vector dimension increases exponentially with the increase of |$k$| value. These numerous features may contain a great deal of redundancy and noise, which may cause extra training time and even have a negative influence on the model quality. In view of the dimension restriction and training effectiveness, we used 1-mer, 2-mer, 3-mer, 4-mer, 5-mer and 6-mer in this study.
Weighted series
In this study, we proposed a novel problem transformation method named weighted series (WS) to tackle multi-label learning problems, which is specific for the subcellular localization identification of mRNAs. The weighted series method involves two modules of binary classifiers, including a non-label module and a fusion-label module. The non-label module is concerned with training the model only from pre-extracted features. While the fusion-label module incorporates the priori information about the labels in the model training, whose learned priori label distributions could contribute to the model predictions. The final predictions of the two modules are combined by a weight |$w$| (ranging from 0 to 1) that requires user customization, reflecting the labels' relevance. When there is a strong correlation between the labels, a smaller value of |$w$| will promote the prediction performance.
Suppose |${\mathbb{R}}^d$| represents the |$d$| dimensional instance space and |${\mathbb{Y}}^q$| represents the |$q$| dimensional label space. Given a multi-label data set |$D=\big\{X,Y\big\}$| that containing |$n$| samples, where |$X\subset{\mathbb{R}}^d$| and |$Y\subset{\mathbb{Y}}^q$|. Both non-label module and fusion-label module require training q binary classifiers. When training the |$j$|-th model |${C}_j^N$| (|$1\le j\le q$|) of non-label module, the training label is |${Y}_j$|, and the training input is the feature vector of the |$n$| training samples, i.e., |$X$|. With regard to training the |$j$|-th model |${C}_j^F$| (|$1\le j\le q$|) of the fusion-label module, the priori information on other labels is added to the training process by combining features |$X$| as input. That is to say, the label is still |${Y}_j$|, whereas the training input is the fusion of |$X$| and |${Y}_i\big(i\ne j\big)$|. After model training, 2q binary classifiers of weighted series, |${C}^N\big\{{C}_1^N,\dots, {C}_q^N\big\}$| and |${C}^F\big\{{C}_1^F,\dots, {C}_q^F\big\}$|, can be used to conduct predictions. The prediction process of weighted series includes three steps: |${C}^N$| module prediction, |${C}^F$| module prediction and integration. Given a query mRNA sequence and its feature vector are |$x$|, firstly, non-label models |${C}^N$| are used to conduct the prediction and generate the prediction probabilities |${y}^N$|; |${y}^N$| would then be fused with |$x$| as the input to conduct prediction of fusion-label models |${C}^F$| and output the probabilities |${y}^F$|; finally, |${y}^N$| and |${y}^F$| would be integrated with a user-defined weight |$w$| that ranges from 0 and 1. The detailed training and prediction procedures of the weighted series are illustrated in Figure 1 and outlined in Algorithm 1.

Performance evaluation metrics
The model evaluation for MLL tasks is more complicated than binary classification problems because the predictive performance for all labels should be taken into account. In this study, we employed six widely used MLL evaluation metrics [29, 39, 40] to evaluate the performance of Clarion, including example-based accuracy (|${Acc}_{exam}$|), average precision, coverage, one-error, ranking loss and Hamming loss. Let |$f\big(\bullet \big)$| be the learned multi-label classifier and |$\big({x}_i,{Y}_i\big)\ \big\{1\le i\le t\big\}$| be a multi-label instance, |${Y}_i$| and |${P}_i$| represent the true and predicted label set for the instance,|${\overline{Y}}_i$| represents the complementary set of |${Y}_i$|, accordingly the above metrics can be formulated as follows:
where |${Rank}_f\big(x,y\big)$| represents the rank of y in Y based on the descending order, |$\big|\bullet \big|$| represents the cardinality of set while q is the cardinality of |${Y}_i$|, ∆ stands for the symmetric difference of two sets, while |$I\big(\bullet \big)$| counts the times that meet the condition.
Results and discussion
Statistical analysis of the dataset
The benchmark dataset used in this study contained a total of 36 971 mRNA sequences, each of which might be localized to a single or multiple subcellular compartments. 12 884 mRNAs had only one localization/compartment, 4060 mRNAs had two localizations, 3442 had three localizations, 3165 mRNAs had four localizations, 3518 mRNAs had five localizations, 4258 mRNAs had six localizations, 4079 mRNAs had seven localizations, 1443 mRNAs had eight localizations and 122 mRNAs had nine localizations, as illustrated in Figure 2A. In addition, we also plotted the distribution of the positive and negative samples in the nine compartments, as shown in Figure 2B. Compared with the nucleus, nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome, the data of exosome and cytoplasm were unevenly distributed. For the exosome, the number of the positive samples was much larger than that of the negatives, while for cytoplasm, the number of the negative samples was much larger than that of the positives. In addition, there were 31 448 mRNAs in exosome, 21 439 in nucleus, 14 237 in nucleoplasm, 14 328 in chromatin, 4016 in cytoplasm, 11 124 in nucleolus, 16 312 in cytosol, 6739 in membrane and 8680 in ribosome, respectively. The distribution of the label (i.e., subcellular localization) number in these nine compartments is shown in Figure 2C.

Statistical distributions of mRNA entries in the benchmark dataset curated in this study. (A) The relative percentages of mRNAs with different labels in the benchmark dataset. (B) The distribution of the positive and negative samples in the nine subcellular compartments. (C) The distribution of mRNAs with different labels in the nine compartments.
We noticed that those mRNAs in the cytoplasm were mostly single-localized, mRNAs in the nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome were mostly multi-localized, while mRNAs in exosome and nucleus appeared to be uniformly localized. In this study, we separated the benchmark dataset into the training_validation dataset and the independent test dataset by random sampling. Accordingly, 33,274 mRNAs were included in the training_validation dataset (90% of the total) and 3697 mRNAs in the independent test set (10% of the total). The former was used to compare the algorithm and optimize the parameters by tenfold cross-validation, while the latter was used to evaluate and validate the model performance. Detailed localization distributions of the training_validation and independent test datasets can be found in the Supplementary Table S1.
Selection of the base classifier
In this section, we performed a preliminary analysis to select a suitable base classifier. To do this, we evaluated seven popular machine learning algorithms, including k-nearest neighbour (KNN), logistic regression (LR), random forest (RF), LightGBM, XGBoost, CatBoost and multilayer perception (MLP) [41, 42] based on the WS framework to determine the optimal base classifier. For the sake of comparison, the weight |$w$| of WS was assigned to one-quarter, two-quarters and three-quarters of its range, respectively. This can save the running time of the algorithm and accelerate the determination of the base classifier. We employed the default parameters of each algorithm in the Scikit-learn Package and conducted a 10-fold cross-validation test on the training_validation dataset for performance comparison. As can be seen from Figure 3A, in the case of w = 0.25, 0.5 and 0.75, XGBoost secured the best predictive performance among the seven different algorithms in terms of accuracy. In the case of w = 0.25, we found XGBoost achieved a slightly inferior performance to RF in terms of coverage and ranking loss but attained the best performance in terms of the other metrics (refer to Supplementary Table S2 for detailed results). In addition, XGBoost was still the best-performing algorithm for w = 0.5 and 0.75 in terms of all the evaluation metrics. Consequently, the XGBoost algorithm was adopted as the base classifier of the weighted series for developing Clarion. The relationship between the model performance and the weight values will be discussed in more detail in the following section.

(A) The example-based accuracies of seven different machine learning algorithms. (B) The performance comparison of the models trained with different weights in terms of Accexam, average precision, coverage, one-error, ranking loss and hamming loss. (C) The accuracies of the binary models in weighted series for predicting each mRNA subcellular localization.
Effect of weight w
In this section, we evaluated the effect of weight |$w$| in weighted series structure on the model performance, which ranged from 0 to 1. In particular, we evaluated the model performance on the training data through a ten-fold cross-validation test with 19 different |$w$| candidates ranging from 0.05 to 0.95 with a step size of 0.05. Figure 3B illustrates the model performance with different candidate weights in terms of all six performance metrics. There is a clear peak/valley in each metric sub-figure, whereas the corresponding candidate weights are not consistent. As it is much more difficult to directly determine the first-rank value of weighted series for Clarion, we employed a method by assigning the weights score. According to the performance ranking of each evaluation metric, we assigned 5, 4, 3, 2 and 1 points to the top five candidate weights and 0 to those after the sixth, respectively. For instance, the candidate |$w(0.55)$| was assigned 2 points on Accexam as the accuracy was the fourth highest out of the 19 candidates. Similarly, the candidate |$w(0.55)$| was assigned 5 points for the average precision, 0 points for the coverage, 5 points for the One-error, 0 points for the ranking loss and 4 points for the hamming loss, respectively. We then obtained an overall score of 16 points for the candidate |$w(0.55)$| by summing up all the above scores. Using this procedure, we calculated the overall scores of the other 18 candidate weights, whose detailed results and score statistics are provided in Supplementary Table S3 and Supplementary Table S4. The candidate |$w(0.65)$| reached the best score of 22 points out of the 19 candidates and accordingly, it was adopted as the fixed weight of weighted series for Clarion.
Performance comparison with other problem transformation strategies
To demonstrate the capacity of the weighted series strategy in dealing with the multi-label multi-class mRNA subcellular localization prediction tasks, we used XGBoost with the default parameters as the base classifier to benchmark and compare our proposed method with the other three well-known problem transformation methods, including BR, CC and LP, on the 10-fold cross-validation tests using the training_validation set. Specifically, we performed 10 times of experiments with 10 groups of randomly generated label orders for CC because the input label order could directly affect the quality of the CC model. Among these 10 experiments, the best result was used for the comparison with BR, LP and WS. From the performance comparison results provided in Table 2, we found that the WS strategy displayed the best performance in terms of the evaluation metrics of Accexam, average precision, coverage, ranking loss and hamming loss. Although WS achieved a slightly lower performance than LP in terms of one-error, it showed a clear superiority in predicting mRNA subcellular localizations. To further improve the model performance, we optimized the key parameters, including learning_rate, n_estimators and max_depth, for XGBoost based on the 10-fold cross-validation test. The specific hyperparameters are provided in Supplementary Table S5. Subsequently, we retrained the final model named Clarion on the whole training_validation set using the weighted series strategy with the optimized hyperparameters of XGBoost.
Performance comparison of weighted series with binary relevance, classifier chains and label powerset
Strategy . | Accexam . | Average precision . | Coverage . | One-error . | Ranking loss . | Hamming loss . |
---|---|---|---|---|---|---|
BR | 0.600 | 0.651 | 6.121 | 0.706 | 0.345 | 0.194 |
CC | 0.456 | 0.580 | 7.384 | 0.838 | 0.598 | 0.299 |
LP | 0.558 | 0.626 | 6.271 | 0.601 | 0.443 | 0.229 |
WS (|$w=0.65$|) | 0.627 | 0.670 | 6.029 | 0.629 | 0.344 | 0.182 |
Strategy . | Accexam . | Average precision . | Coverage . | One-error . | Ranking loss . | Hamming loss . |
---|---|---|---|---|---|---|
BR | 0.600 | 0.651 | 6.121 | 0.706 | 0.345 | 0.194 |
CC | 0.456 | 0.580 | 7.384 | 0.838 | 0.598 | 0.299 |
LP | 0.558 | 0.626 | 6.271 | 0.601 | 0.443 | 0.229 |
WS (|$w=0.65$|) | 0.627 | 0.670 | 6.029 | 0.629 | 0.344 | 0.182 |
Performance comparison of weighted series with binary relevance, classifier chains and label powerset
Strategy . | Accexam . | Average precision . | Coverage . | One-error . | Ranking loss . | Hamming loss . |
---|---|---|---|---|---|---|
BR | 0.600 | 0.651 | 6.121 | 0.706 | 0.345 | 0.194 |
CC | 0.456 | 0.580 | 7.384 | 0.838 | 0.598 | 0.299 |
LP | 0.558 | 0.626 | 6.271 | 0.601 | 0.443 | 0.229 |
WS (|$w=0.65$|) | 0.627 | 0.670 | 6.029 | 0.629 | 0.344 | 0.182 |
Strategy . | Accexam . | Average precision . | Coverage . | One-error . | Ranking loss . | Hamming loss . |
---|---|---|---|---|---|---|
BR | 0.600 | 0.651 | 6.121 | 0.706 | 0.345 | 0.194 |
CC | 0.456 | 0.580 | 7.384 | 0.838 | 0.598 | 0.299 |
LP | 0.558 | 0.626 | 6.271 | 0.601 | 0.443 | 0.229 |
WS (|$w=0.65$|) | 0.627 | 0.670 | 6.029 | 0.629 | 0.344 | 0.182 |
Performance comparison with existing state-of-the-art tools
In this section, Clarion’s performance was benchmarked and compared with several state-of-the-art approaches for predicting mRNA subcellular localizations. Firstly, we compared Clarion with DM3Loc and Wang’s method, the only two multi-label predictors. Clarion has only five overlapping predictable compartments with these two predictors, including cytosol, exosome, membrane (cytoplasm for Wang’s method), nucleus and ribosome. Therefore, we compared their five-label prediction performance via an independent test. Clarion and DM3Loc achieved the Accexam of 0.722/0.441, average precision of 0.769/0.618, coverage of 3.019/3.957, one-error of 0.463/0.869, ranking loss of 0.204/0.533 and hamming loss of 0.150/0.330 on the independent dataset. Similarly, Clarion and Wang’s method achieved Accexam of 0.745/0.281, average precision of 0.767/0.679, coverage of 3.127/4.516, one-error of 0.445/0.882, ranking loss of 0.241/0.716 and hamming loss of 0.146/0.375. The above results indicated that Clarion outperformed DM3Loc and Wang’s method in multi-label prediction tasks.
Afterwards, we also compared Clarion’s single-label prediction performance with that of the other state-of-the-art methods. To facilitate the performance comparison, only the methods with accessible webservers were used for prediction and comparison, including iLoc-mRNA [19], mRNALoc [20], mRNALocator [21], DM3Loc [25] and Wang’s method [26]. In particular, the RNA sequences of the independent set were uploaded to their webservers, which then outputted the prediction labels. As shown in Table 3, Clarion clearly outperformed the other methods in predicting cytoplasm, cytosol, exosome, membrane, nucleus and ribosome. We also found that Clarion secured over 80% accuracies in almost all compartment predictions with the only exception of cytosol and nucleus. Notably, the mRNAs of the independent set may appear in the training set of other methods, which may account for Clarion’s slightly lower F1 scores than Wang’s method on the prediction of cytoplasm and ribosome (more details can be found in Supplementary Table S6 of the supplementary file). These comparison results further demonstrated Clarion’s superior prediction power on single-label tasks.
Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset
Localization . | iLoc-mRNA . | mRNALoc . | mRNALocator . | Wang’s . | DM3Loc . | Clarion . |
---|---|---|---|---|---|---|
Chromatin | N.A. | N.A. | N.A. | N.A. | N.A. | 81.47% |
Cytoplasm | N.A. | 54.88% | 38.90% | 87.10% | N.A. | 91.29% |
Cytosol | N.A. | N.A. | N.A. | 67.81% | 57.37% | 79.77% |
Exosome | N.A. | N.A. | N.A. | 16.18% | 70.00% | 92.10% |
Membrane | N.A. | N.A. | N.A. | N.A. | 70.92% | 89.15% |
Nucleolus | N.A. | N.A. | N.A. | N.A. | N.A. | 83.74% |
Nucleoplasm | N.A. | N.A. | N.A. | N.A. | N.A. | 80.74% |
Nucleus | N.A. | 55.18% | 57.42% | 60.13% | 69.52% | 79.23% |
Ribosome | 73.41% | N.A. | N.A. | 81.42% | 69.03% | 84.74% |
Localization . | iLoc-mRNA . | mRNALoc . | mRNALocator . | Wang’s . | DM3Loc . | Clarion . |
---|---|---|---|---|---|---|
Chromatin | N.A. | N.A. | N.A. | N.A. | N.A. | 81.47% |
Cytoplasm | N.A. | 54.88% | 38.90% | 87.10% | N.A. | 91.29% |
Cytosol | N.A. | N.A. | N.A. | 67.81% | 57.37% | 79.77% |
Exosome | N.A. | N.A. | N.A. | 16.18% | 70.00% | 92.10% |
Membrane | N.A. | N.A. | N.A. | N.A. | 70.92% | 89.15% |
Nucleolus | N.A. | N.A. | N.A. | N.A. | N.A. | 83.74% |
Nucleoplasm | N.A. | N.A. | N.A. | N.A. | N.A. | 80.74% |
Nucleus | N.A. | 55.18% | 57.42% | 60.13% | 69.52% | 79.23% |
Ribosome | 73.41% | N.A. | N.A. | 81.42% | 69.03% | 84.74% |
N.A.: non-applicable
Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset
Localization . | iLoc-mRNA . | mRNALoc . | mRNALocator . | Wang’s . | DM3Loc . | Clarion . |
---|---|---|---|---|---|---|
Chromatin | N.A. | N.A. | N.A. | N.A. | N.A. | 81.47% |
Cytoplasm | N.A. | 54.88% | 38.90% | 87.10% | N.A. | 91.29% |
Cytosol | N.A. | N.A. | N.A. | 67.81% | 57.37% | 79.77% |
Exosome | N.A. | N.A. | N.A. | 16.18% | 70.00% | 92.10% |
Membrane | N.A. | N.A. | N.A. | N.A. | 70.92% | 89.15% |
Nucleolus | N.A. | N.A. | N.A. | N.A. | N.A. | 83.74% |
Nucleoplasm | N.A. | N.A. | N.A. | N.A. | N.A. | 80.74% |
Nucleus | N.A. | 55.18% | 57.42% | 60.13% | 69.52% | 79.23% |
Ribosome | 73.41% | N.A. | N.A. | 81.42% | 69.03% | 84.74% |
Localization . | iLoc-mRNA . | mRNALoc . | mRNALocator . | Wang’s . | DM3Loc . | Clarion . |
---|---|---|---|---|---|---|
Chromatin | N.A. | N.A. | N.A. | N.A. | N.A. | 81.47% |
Cytoplasm | N.A. | 54.88% | 38.90% | 87.10% | N.A. | 91.29% |
Cytosol | N.A. | N.A. | N.A. | 67.81% | 57.37% | 79.77% |
Exosome | N.A. | N.A. | N.A. | 16.18% | 70.00% | 92.10% |
Membrane | N.A. | N.A. | N.A. | N.A. | 70.92% | 89.15% |
Nucleolus | N.A. | N.A. | N.A. | N.A. | N.A. | 83.74% |
Nucleoplasm | N.A. | N.A. | N.A. | N.A. | N.A. | 80.74% |
Nucleus | N.A. | 55.18% | 57.42% | 60.13% | 69.52% | 79.23% |
Ribosome | 73.41% | N.A. | N.A. | 81.42% | 69.03% | 84.74% |
N.A.: non-applicable
Effect of non-label and fusion-label modules in the weighted series structure
In this section, we further examined the effect of the non-label and fusion-label modules used in the WS framework. A total of 18 binary classifiers were trained in Clarion for predicting nine mRNA subcellular localizations respectively, including nine non-label models and nine fusion-label models. Here, we employed these binary models to predict the RNA sequences in the independent test dataset and accordingly evaluated the performance for each localization. Figure 3C illustrates the predictive performance of non-label models and fusion-label models in terms of accuracy. As a result, we found that the fusion-label models performed better than the non-label counterparts for all nine locations, highlighting the necessity and effectiveness of fusion-label models in the WS framework. However, it is noteworthy that the better performance of the fusion-label models originated from the outputs of non-label models because the fusion-label models used the prediction results of the non-label models as input features for the model training. Therefore, we conclude that the non-label and fusion-label modules are complementary and essential to the WS framework of Clarion.

Feature importance ranking based on the Shapley values. (A) Top 15 features for the nucleus, (B) top 15 features for the chromatin, (C) top 15 features for the ribosome and (D) top 15 features among all nine subcellular localizations.
Model interpretation
Shapley additive explanations (SHAP) is a powerful method based on the cooperative game theory that can interpret machine learning models [31] using the Shapley value, which can be used to rank and evaluate the importance of each feature and explain the predictions. SHAP has been successfully applied in a variety of bioinformatics tasks [43–46]. In this study, we used the Shapley value to assess the |$k$|-mer fragments of mRNAs that are important for subcellular localization prediction. With the nine binary models of the non-label module, the Shapley value of each |$k$|-mer feature was calculated and ranked using the SHAP Python package (https://shap.readthedocs.io/en/latest/index.html). Figure 4 and Supplementary Figures S1-S3 show the top 15 important |$k$|-mer features of Clarion for predicting mRNA subcellular localizations. We found some features are only important for one certain localization, such as ‘TTG’, ‘GGGCGC’ in the nucleus, ‘GACGC’ and ‘GCGGCA’ in chromatin, and ‘GGATCT’ and ‘GGCCG’ in the ribosome. For example, ‘TTG’ was identified as the most important feature for nucleus prediction in terms of the SHAP value but did not appear in the list of the top 15 features for the prediction of the other eight localizations. For ‘TTG’ shown in Figure 4A, each point represents an instance with a value from small to large corresponding to the colour from blue to red. When ‘TTG’ took high Shapley values, it would have an influence on the model to make the positive prediction of nucleus and visa verse. Therefore, the effect of ‘TTG’ can be summarized in a way that its larger value promotes the nucleus localization prediction. In contrast, the larger value of ‘GGGCGC’ promotes the non-nucleus localization prediction. These |$k$|-mer segments may be part of or related to protein recognition motifs for mRNA specific localization. Interestingly, the repeat motif ‘UUCAC’ has been found to be crucial for localization via binding with Vg1PBP [47–49], which corresponded to ‘TTCACC’ that was ranked thirteenth in exosome prediction.
In addition, it was also found that certain |$k$|-mer features were important for the prediction of several subcellular localizations. For instance, ‘GCGGC’ was ranked second in nucleus prediction, tenth in chromatin prediction, and tenth in ribosome prediction, respectively, as shown in Figure 4A–C. This could be also observed in Figure 4D, which shows the top 15 features ranked according to the average of the absolute SHAP values and shows the important proportion in different mRNA subcellular localizations. The feature ‘GCGGC’ shown in Figure 4D is more important for the prediction of the nucleus, ribosome, chromatin, as well as nucleoplasm. Moreover, the feature ‘AAAAAA’ was important for almost all compartments of subcellular localizations, whose larger values drove the positive prediction of cytoplasm, but the negative predictions of exosome, chromatin, nucleoplasm, nucleolus, cytosol and ribosome.
Webserver implementation
To facilitate the wider research community to make subcellular localization predictions, we developed a web server for Clarion that is freely available at http://monash.bioweb.cloud.edu.au/Clarion/. The web server is maintained by Nectar Research Cloud and configured on a Linux server equipped with a 4-core CPU, 8-GB memory and 30-GB hard disk. The web page was implemented using PHP and has been tested on several popular web browsers, including Google Chrome, Microsoft Edge, Internet Explorer, Mozilla Firefox and Safari. Users are required to copy and paste query mRNA sequences in the textbox or alternatively upload a sequence file in the FASTA format via the file-selection dialogue box. It should be noted the query sequences will be truncated to ensure that no mRNA is greater than 6000 nt. All the generated prediction results will be saved in a table format containing detailed information regarding the sequences and predicted subcellular localization types. The web server also provides a probability score in the range of 0–1 to indicate the probability of the subcellular localization type. The prediction results can be easily exported to widely used file formats, including CSV, Excel, PDF and plain text. More detailed instructions for using the Clarion webserver can be found on the help page of the webserver.
Conclusion
In this study, we have introduced a novel ensemble model termed Clarion for the simultaneous prediction of nine compartments of mRNA subcellular localizations, including exosome, nucleus, nucleoplasm, chromatin, cytoplasm, nucleolus, cytosol, membrane and ribosome. Specifically, Clarion used the |$k$|-mer nucleotide composition scheme to encode mRNA sequences and employed our proposed strategy, namely weighted series, as the ensemble framework. We selected XGBoost as the base classifier for the weighted series and optimized the important weight parameter for the weighted series. The cross-validation and independent tests illustrate the superiority of Clarion for predicting mRNA subcellular localizations, which outperformed several existing methods, including single-label predictors for iLoc-mRNA, mRNALoc and mRNALocator, and multi-label predictors of DM3Loc and Wang’s method. The performance improvement of Clarion can be attributed to two key factors: the first is the collection of more data than the existing methods to construct the benchmark dataset to facilitate high-quality model fitting; the second is use of the XGBoost-based weighted series method to incorporate multi-label priori information for model training and leverage such information to improve the predictive power. Moreover, we analyzed the most important |$k$|-mer features for each localization prediction using the model interpretation algorithm SHAP and developed a user-friendly web server for Clarion. Clarion is expected to be a promising tool for multi-label mRNA subcellular localization prediction in the field of bioinformatics.
Nevertheless, the model performance can be further improved, especially for several compartments in the cytosol and nucleus. Only nine cell compartments were considered by Clarion; it is desirable to expand the size of predictable compartments. In our future work, we plan to develop advanced technologies for the prediction of mRNA subcellular localization. Apart from mRNAs, we will also make efforts to develop methods for predicting localizations of other types of RNAs, such as microRNAs and long non-coding RNAs. In addition, the co-localization between different types of RNAs will be another research direction in the future work.
Characterization of mRNA subcellular localization can help elucidate gene regulatory networks and human disease mechanisms.
We proposed a novel ensemble method, termed Clarion, to predict nine subcellular localizations of mRNAs simultaneously.
Clarion achieved significantly better predictive performance in both single-label and multi-label predictions compared to state-of-the-art predictors.
The online webserver and local stand-alone tool of Clarion are publicly available at: http://monash.bioweb.cloud.edu.au/Clarion/.
Code and data availability
The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.
Funding
National Health and Medical Research Council of Australia (NHMRC) (APP1127948, APP1144652), Australian Research Council (ARC) (LP110200333, DP120104460), National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965), Major and Seed Inter-Disciplinary Research (IDR) projects awarded by Monash University.
Author Biographies
Yue Bi is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests include bioinformatics, computational biology, sequence analysis and machine learning.
Fuyi Li received his PhD in Bioinformatics from Monash University, Australia. He is currently a professor in the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics, computational biology, machine learning, and data mining.
Xudong Guo received his MEng degree from Ningxia University, China. He is currently a research assistant at the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics and data mining.
Zhikang Wang is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. His interests are bioinformatics, computational pathology, pattern recognition and deep learning.
Tong Pan is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests are bioinformatics, protein function analysis, deep learning and pattern recognition.
Yuming Guo is a professor of Global Environmental Health and Biostatistics & Head of the Monash Climate, Air Quality Research (CARE) Unit. His research focuses on environmental epidemiology, biostatistics, global environmental change, air pollution, climate change, urban design, residential environment, remote sensing modelling and infectious disease modelling.
Geoffrey I. Webb is a professor in the Faculty of Information Technology and a research director of the Monash Data Futures Institute at Monash University. His research interests include machine learning, data mining, computational biology and user modelling.
Jianhua Yao is a group leader in Tencent AI Lab, China. He received his PhD degree from John Hopkins University. His research interests include bioinformatics, computational biology, medical imaging and pattern recognition.
Cangzhi Jia is a professor in the college of science, Dalian Maritime University, China. She obtained her PhD degree in the school of Mathematical Sciences from the Dalian University of Technology in 2007. Her major research interests include mathematical modelling in bioinformatics and machine learning.
Jiangning Song is an associate professor and a group leader in the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biomedicine, machine learning, data mining and pattern recognition.