Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations

Summary of the reviewed predictors for mRNA subcellular localization

Type	Year	Tool	The sources of mRNA subcellular localization/sequence	Subcellular localization	Benchmark dataset size	Encoding scheme	Feature selection	Algorithm	Evaluation strategy	Web server/Github availability	Reference
Single-label	2019	RNATracker	CeFra-Seq & APEX-RIP /Ensembl	Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear	11 373 (Dataset 1) 13 860 (Dataset 2)	One-hot RNA secondary structure	None	CNN LSTM Attention	Tenfold cross-validation	https://www.github.com/HarveyYan/RNATracker	[18]
	2020	iLoc-mRNA	RNALocate /GenBank	Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome	4901	K-mer	Binomial distribution ANOVA IFS	SVM	Fivefold cross-validation	http://lin-group.cn/server/iLoc-mRNA/	[19]
	2020	mRNALoc	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus	14 909	PseKNC	None	SVM	Fivefold cross-validation Independent test	http://proteininformatics.org/mkumar/mrnaloc	[20]
	2021	mRNALocater	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP PseKNC	Remove collinear features SFS	CatBoost XGBoost LightGBM	Fivefold cross-validation Independent test	http://bio-bigdata.cn/mRNALocater	[21]
	2021	SubLocEP	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC	None	LightGBM	Fivefold cross-validation Independent test	http://lab.malab.cn/~lijing/SubLocEP.html	[22]
Multi-label	2021	DM3Loc	RNALocate /GenBank & NCBI	Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome	17 870	One-hot	None	CNN Attention	Fivefold cross-validation Independent test	http://dm3loc.lin-group.cn/	[25]
Multi-label	2021	Wang’s	RNALocate	Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome	13475	K-mer RCKmer NAC DNC TNC CKSNAP	None	SVM	Tenfold cross-validation	http://lbci.tju.edu.cn/Online_services.htm	[26]

Type	Year	Tool	The sources of mRNA subcellular localization/sequence	Subcellular localization	Benchmark dataset size	Encoding scheme	Feature selection	Algorithm	Evaluation strategy	Web server/Github availability	Reference
Single-label	2019	RNATracker	CeFra-Seq & APEX-RIP /Ensembl	Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear	11 373 (Dataset 1) 13 860 (Dataset 2)	One-hot RNA secondary structure	None	CNN LSTM Attention	Tenfold cross-validation	https://www.github.com/HarveyYan/RNATracker	[18]
	2020	iLoc-mRNA	RNALocate /GenBank	Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome	4901	K-mer	Binomial distribution ANOVA IFS	SVM	Fivefold cross-validation	http://lin-group.cn/server/iLoc-mRNA/	[19]
	2020	mRNALoc	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus	14 909	PseKNC	None	SVM	Fivefold cross-validation Independent test	http://proteininformatics.org/mkumar/mrnaloc	[20]
	2021	mRNALocater	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP PseKNC	Remove collinear features SFS	CatBoost XGBoost LightGBM	Fivefold cross-validation Independent test	http://bio-bigdata.cn/mRNALocater	[21]
	2021	SubLocEP	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC	None	LightGBM	Fivefold cross-validation Independent test	http://lab.malab.cn/~lijing/SubLocEP.html	[22]
Multi-label	2021	DM3Loc	RNALocate /GenBank & NCBI	Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome	17 870	One-hot	None	CNN Attention	Fivefold cross-validation Independent test	http://dm3loc.lin-group.cn/	[25]
Multi-label	2021	Wang’s	RNALocate	Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome	13475	K-mer RCKmer NAC DNC TNC CKSNAP	None	SVM	Tenfold cross-validation	http://lbci.tju.edu.cn/Online_services.htm	[26]

Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.

Table 1

Summary of the reviewed predictors for mRNA subcellular localization

Type	Year	Tool	The sources of mRNA subcellular localization/sequence	Subcellular localization	Benchmark dataset size	Encoding scheme	Feature selection	Algorithm	Evaluation strategy	Web server/Github availability	Reference
Single-label	2019	RNATracker	CeFra-Seq & APEX-RIP /Ensembl	Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear	11 373 (Dataset 1) 13 860 (Dataset 2)	One-hot RNA secondary structure	None	CNN LSTM Attention	Tenfold cross-validation	https://www.github.com/HarveyYan/RNATracker	[18]
	2020	iLoc-mRNA	RNALocate /GenBank	Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome	4901	K-mer	Binomial distribution ANOVA IFS	SVM	Fivefold cross-validation	http://lin-group.cn/server/iLoc-mRNA/	[19]
	2020	mRNALoc	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus	14 909	PseKNC	None	SVM	Fivefold cross-validation Independent test	http://proteininformatics.org/mkumar/mrnaloc	[20]
	2021	mRNALocater	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP PseKNC	Remove collinear features SFS	CatBoost XGBoost LightGBM	Fivefold cross-validation Independent test	http://bio-bigdata.cn/mRNALocater	[21]
	2021	SubLocEP	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC	None	LightGBM	Fivefold cross-validation Independent test	http://lab.malab.cn/~lijing/SubLocEP.html	[22]
Multi-label	2021	DM3Loc	RNALocate /GenBank & NCBI	Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome	17 870	One-hot	None	CNN Attention	Fivefold cross-validation Independent test	http://dm3loc.lin-group.cn/	[25]
Multi-label	2021	Wang’s	RNALocate	Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome	13475	K-mer RCKmer NAC DNC TNC CKSNAP	None	SVM	Tenfold cross-validation	http://lbci.tju.edu.cn/Online_services.htm	[26]

Type	Year	Tool	The sources of mRNA subcellular localization/sequence	Subcellular localization	Benchmark dataset size	Encoding scheme	Feature selection	Algorithm	Evaluation strategy	Web server/Github availability	Reference
Single-label	2019	RNATracker	CeFra-Seq & APEX-RIP /Ensembl	Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear	11 373 (Dataset 1) 13 860 (Dataset 2)	One-hot RNA secondary structure	None	CNN LSTM Attention	Tenfold cross-validation	https://www.github.com/HarveyYan/RNATracker	[18]
	2020	iLoc-mRNA	RNALocate /GenBank	Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome	4901	K-mer	Binomial distribution ANOVA IFS	SVM	Fivefold cross-validation	http://lin-group.cn/server/iLoc-mRNA/	[19]
	2020	mRNALoc	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus	14 909	PseKNC	None	SVM	Fivefold cross-validation Independent test	http://proteininformatics.org/mkumar/mrnaloc	[20]
	2021	mRNALocater	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP PseKNC	Remove collinear features SFS	CatBoost XGBoost LightGBM	Fivefold cross-validation Independent test	http://bio-bigdata.cn/mRNALocater	[21]
	2021	SubLocEP	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC	None	LightGBM	Fivefold cross-validation Independent test	http://lab.malab.cn/~lijing/SubLocEP.html	[22]
Multi-label	2021	DM3Loc	RNALocate /GenBank & NCBI	Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome	17 870	One-hot	None	CNN Attention	Fivefold cross-validation Independent test	http://dm3loc.lin-group.cn/	[25]
Multi-label	2021	Wang’s	RNALocate	Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome	13475	K-mer RCKmer NAC DNC TNC CKSNAP	None	SVM	Tenfold cross-validation	http://lbci.tju.edu.cn/Online_services.htm	[26]

Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.

To deal with MLL tasks, problem transformation algorithms such as binary relevance (BR), label powerset (LP) and classifier chains (CC) [27] are popular strategies, which essentially transform the multi-label problems into one or more single-label tasks. Amongst these, BR is the most widely used problem transformation algorithm and is able to transform a multi-label problem into a binary problem for each label [28]. However, a weakness of BR is that it ignores the correlation between the labels, thereby limiting its utility. In contrast, LP and CC are designed with the awareness of correlations between different data labels. The basic idea of LP is that each unique combination of labels present in a multi-label training set is regarded as a new class of a new single-label multi-class task [29]. CC generates a chain of binary classifiers, one for each label. The subsequent binary classifiers in the chain are further augmented by all preceding binary relevance predictions in the chain [30]. However, these two methods also have certain limitations: For LP, less frequent combinations would lead to the sample imbalance, and it cannot predict the label combinations that do not appear in the training set. In the case of CC, the order of the input labels can affect the model quality and prediction performance [27]. If the first model of the chain predicts inaccurately, an error may propagate along the chain. To overcome this problem, Read et al. [30] proposed the ensembles of classifier chain (ECC) method. The core idea of ECC is to average the predictions of CC models over a group of random chain ordering; however, it may still impose some restrictions on the computational capacity and time cost.

Herein, we introduced a novel computational method termed Clarion (subcellular localization predictor), which is capable of identifying multiple subcellular localizations of mRNAs simultaneously. Firstly, we established a multi-label benchmark dataset extracted from RNALocate, consisting of nine different compartments of mRNA subcellular localizations. The |$k$|-mer nucleotide composition scheme was used to encode mRNA sequences. Next, we applied the weighted series approach as the ensemble framework of Clarion, which is a problem transformation method proposed to tackle multi-label tasks. The weighted series algorithm incorporates the prior information of the labels during model training to improve the prediction performance. Then, after the performance comparison with several machine learning algorithms, we selected XGBoost as the base classifier of Clarion. We optimized the weight of the weighted series and key parameters of XGBoost through 10-fold cross-validation. Additional independent tests illustrate that Clarion outperformed the existing state-of-the-art tools for identifying mRNA subcellular localizations. In addition, we also employed the SHAP (Shapley Addictive exPlanation) algorithm [31] to identify and interpret the most important |$k$|-mer features for each type of mRNA subcellular localization that made the most important contributions to the model predictions.

Material and methods

Benchmark dataset

In this study, all subcellular localization annotations and mRNA sequences were collected from the RNALocate database (version 2.0) [32]. The latest version of RNALocate (updated in June 2021) contained more than 210 000 RNA subcellular localization entries, encompassing more than 110 000 RNAs with 171 subcellular localizations across 104 different species. Its version 2.0 provides more accurate localization annotations than the first version, facilitating the construction of a reliable benchmark dataset. More specifically, the benchmark dataset was constructed according to the five following major steps:

1) We downloaded all RNA subcellular localization annotation entries from RNALocate (version 2.0) and accordingly collected 84 792 mRNA subcellular localization entries as the initial dataset.
2) According to the statistics of the initial dataset, there were 150 different types of annotated subcellular localizations. However, some had minimal and incomplete entries and as such, we removed those subcellular localization types whose corresponding entry numbers were less than 3000. As a result, we obtained nine types of subcellular localizations with 152 887 unique transcripts including exosome, nucleus, cytosol, chromatin, nucleoplasm, ribosome, nucleolus, cytoplasm and membrane.
3) Next, we redefined the mapping relationships between mRNAs and subcellular localizations based on multiple localizations of mRNAs in the transcriptome. In particular, an mRNA can be labelled with multiple subcellular localizations instead of being only labelled with one subcellular localization.
4) To reduce the effect of sequence redundancy on the performance of the classifier, we applied CD-HIT-EST [33] to remove the redundant sequences with the 80% sequence identity threshold to ensure the similarity between any two nucleotide sequences was less than 80%. Finally, 36 971 mRNAs were obtained and used as the benchmark dataset.
5) We analyzed the distribution of sequence length of these 36 971 mRNAs, which varied from 119 nt to 12 000 nt. In view of the computing complexity and limitation of the feature engineering algorithms, the mRNA lengths were adjusted to no longer than 6000 nt. Specifically, for those mRNAs with more than 6000 nt, the first 3000 nt and the last 3000 nt were extracted and merged.

Sequence vectorization

mRNA sequences need to be encoded as numeric vectors prior to the training of machine learning models. The |$k$|-mer nucleotide composition is one of the widely used sequence encoding methods, which has been successfully applied in a variety of bioinformatics studies [34–38]. Given an mRNA sequence |$S$| with length |$L$| nt, |$S={N}_1{N}_2{N}_3{N}_4{N}_5{N}_6\dots{N}_{L-1}{N}_L$|⁠, where |$i\in \big[1,2,3,\dots, L\big]$| and |${N}_i$| represents the nucleotide acid at position |$i$|⁠, one type of adenine (A), cytosine (C), guanine (G) and uracil/thymine (U/T). Accordingly, when using the |$k$|-mer (⁠|$k=3$|⁠) method to encode features, the feature vectors can be calculated as:

$$ {V}=\left[\frac{ {f}\left({AAA}\right)}{{L}-{k}+1},\frac{{f}\left({AAC}\right)}{{L}-{k}+1},\frac{{f}\left({AAG}\right)}{{L}-{k}+1},\dots, \frac{{f}\left({UUU}/{TTT}\right)}{{L}-{k}+1}\right] $$

where |$f\big(\bullet \big)$| represents the number of |$k$|-mer type along the sequence. From the equation, we can observe that the |$k$|-mer vector dimension increases exponentially with the increase of |$k$| value. These numerous features may contain a great deal of redundancy and noise, which may cause extra training time and even have a negative influence on the model quality. In view of the dimension restriction and training effectiveness, we used 1-mer, 2-mer, 3-mer, 4-mer, 5-mer and 6-mer in this study.

Weighted series

In this study, we proposed a novel problem transformation method named weighted series (WS) to tackle multi-label learning problems, which is specific for the subcellular localization identification of mRNAs. The weighted series method involves two modules of binary classifiers, including a non-label module and a fusion-label module. The non-label module is concerned with training the model only from pre-extracted features. While the fusion-label module incorporates the priori information about the labels in the model training, whose learned priori label distributions could contribute to the model predictions. The final predictions of the two modules are combined by a weight |$w$| (ranging from 0 to 1) that requires user customization, reflecting the labels' relevance. When there is a strong correlation between the labels, a smaller value of |$w$| will promote the prediction performance.

Suppose |${\mathbb{R}}^d$| represents the |$d$| dimensional instance space and |${\mathbb{Y}}^q$| represents the |$q$| dimensional label space. Given a multi-label data set |$D=\big\{X,Y\big\}$| that containing |$n$| samples, where |$X\subset{\mathbb{R}}^d$| and |$Y\subset{\mathbb{Y}}^q$|⁠. Both non-label module and fusion-label module require training q binary classifiers. When training the |$j$|-th model |${C}_j^N$| (⁠|$1\le j\le q$|⁠) of non-label module, the training label is |${Y}_j$|⁠, and the training input is the feature vector of the |$n$| training samples, i.e., |$X$|⁠. With regard to training the |$j$|-th model |${C}_j^F$| (⁠|$1\le j\le q$|⁠) of the fusion-label module, the priori information on other labels is added to the training process by combining features |$X$| as input. That is to say, the label is still |${Y}_j$|⁠, whereas the training input is the fusion of |$X$| and |${Y}_i\big(i\ne j\big)$|⁠. After model training, 2q binary classifiers of weighted series, |${C}^N\big\{{C}_1^N,\dots, {C}_q^N\big\}$| and |${C}^F\big\{{C}_1^F,\dots, {C}_q^F\big\}$|⁠, can be used to conduct predictions. The prediction process of weighted series includes three steps: |${C}^N$| module prediction, |${C}^F$| module prediction and integration. Given a query mRNA sequence and its feature vector are |$x$|⁠, firstly, non-label models |${C}^N$| are used to conduct the prediction and generate the prediction probabilities |${y}^N$|⁠; |${y}^N$| would then be fused with |$x$| as the input to conduct prediction of fusion-label models |${C}^F$| and output the probabilities |${y}^F$|⁠; finally, |${y}^N$| and |${y}^F$| would be integrated with a user-defined weight |$w$| that ranges from 0 and 1. The detailed training and prediction procedures of the weighted series are illustrated in Figure 1 and outlined in Algorithm 1.

Figure 1

The workflow of the methodology of Clarion.

Performance evaluation metrics

The model evaluation for MLL tasks is more complicated than binary classification problems because the predictive performance for all labels should be taken into account. In this study, we employed six widely used MLL evaluation metrics [29, 39, 40] to evaluate the performance of Clarion, including example-based accuracy (⁠|${Acc}_{exam}$|⁠), average precision, coverage, one-error, ranking loss and Hamming loss. Let |$f\big(\bullet \big)$| be the learned multi-label classifier and |$\big({x}_i,{Y}_i\big)\ \big\{1\le i\le t\big\}$| be a multi-label instance, |${Y}_i$| and |${P}_i$| represent the true and predicted label set for the instance,|${\overline{Y}}_i$| represents the complementary set of |${Y}_i$|⁠, accordingly the above metrics can be formulated as follows:

$$ {{Acc}}_{{exam}}=\frac{1}{{t}}\sum_{{i}=1}^{{t}}\frac{\left|{{P}}_{{i}}\cap{{Y}}_{{i}}\right|}{\left|{{P}}_{{i}}\cup{{Y}}_{{i}}\right|} $$

$$\begin{align*} &{Average}\ {Precision}=\frac{1}{{t}}\sum_{{i}=1}^{{t}}\frac{1}{\left|{{Y}}_{{i}}\right|}\\[6pt] &\times\sum_{{y}\in{{Y}}_{{i}}}\frac{\left|\left\{{{y}}^{\prime}\left|{{Rank}}_{{f}}\left({{x}}_{{i}},{{y}}^{\prime}\right)\le{{Rank}}_{{f}}\left({{x}}_{{i}},{y}\right),{{y}}^{\prime}\in{{Y}}_{{i}}\right.\right\}\right|}{{{Rank}}_{{f}}\left({{x}}_{{i}},{y}\right)} \end{align*}$$

$$ {Coverage}=\frac{1}{{t}}\sum_{{i}=1}^{{t}}{\max}_{{{y}}^{\prime}\in{{Y}}_{{i}}}{{Rank}}_{{f}}\left({{x}}_{{i}},{{y}}^{\prime}\right)-1 $$

$$ One- error=\frac{1}{t}\sum_{i=1}^tI\left({\arg \mathit{\max}}_{y^{\prime}\in{Y}_i}f\left({x}_i,{y}^{\prime}\right)\notin{Y}_i\right) $$

$$ {Ranking}\ {Loss}=\frac{1}{{t}}\!\sum_{{i}=1}^{{t}}\frac{1}{\left|{{Y}}_{{i}}\right|\ \left|{\overline{{Y}}}_{{i}}\right|}{I}\left({f}\!\left({{x}}_{{i}},{{y}}^{\prime}\right)\!\le{f}\!\left({{x}}_{{i}},{{y}}^{\prime \prime}\right)\!,{{y}}^{\prime}\in{{Y}}_{{i}},{{y}}^{\prime \prime}\in{\overline{{Y}}}_{{i}}\right) $$

$$ {Hamming}\ {Loss}=\frac{1}{{t}}\sum_{{i}=1}^{{t}}\frac{1}{{q}}\left|{{P}}_{{i}}\Delta{{Y}}_{{i}}\right| $$

where |${Rank}_f\big(x,y\big)$| represents the rank of y in Y based on the descending order, |$\big|\bullet \big|$| represents the cardinality of set while q is the cardinality of |${Y}_i$|⁠, ∆ stands for the symmetric difference of two sets, while |$I\big(\bullet \big)$| counts the times that meet the condition.

Results and discussion

Statistical analysis of the dataset

The benchmark dataset used in this study contained a total of 36 971 mRNA sequences, each of which might be localized to a single or multiple subcellular compartments. 12 884 mRNAs had only one localization/compartment, 4060 mRNAs had two localizations, 3442 had three localizations, 3165 mRNAs had four localizations, 3518 mRNAs had five localizations, 4258 mRNAs had six localizations, 4079 mRNAs had seven localizations, 1443 mRNAs had eight localizations and 122 mRNAs had nine localizations, as illustrated in Figure 2A. In addition, we also plotted the distribution of the positive and negative samples in the nine compartments, as shown in Figure 2B. Compared with the nucleus, nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome, the data of exosome and cytoplasm were unevenly distributed. For the exosome, the number of the positive samples was much larger than that of the negatives, while for cytoplasm, the number of the negative samples was much larger than that of the positives. In addition, there were 31 448 mRNAs in exosome, 21 439 in nucleus, 14 237 in nucleoplasm, 14 328 in chromatin, 4016 in cytoplasm, 11 124 in nucleolus, 16 312 in cytosol, 6739 in membrane and 8680 in ribosome, respectively. The distribution of the label (i.e., subcellular localization) number in these nine compartments is shown in Figure 2C.

Figure 2

Statistical distributions of mRNA entries in the benchmark dataset curated in this study. (A) The relative percentages of mRNAs with different labels in the benchmark dataset. (B) The distribution of the positive and negative samples in the nine subcellular compartments. (C) The distribution of mRNAs with different labels in the nine compartments.

We noticed that those mRNAs in the cytoplasm were mostly single-localized, mRNAs in the nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome were mostly multi-localized, while mRNAs in exosome and nucleus appeared to be uniformly localized. In this study, we separated the benchmark dataset into the training_validation dataset and the independent test dataset by random sampling. Accordingly, 33,274 mRNAs were included in the training_validation dataset (90% of the total) and 3697 mRNAs in the independent test set (10% of the total). The former was used to compare the algorithm and optimize the parameters by tenfold cross-validation, while the latter was used to evaluate and validate the model performance. Detailed localization distributions of the training_validation and independent test datasets can be found in the Supplementary Table S1.

Selection of the base classifier

In this section, we performed a preliminary analysis to select a suitable base classifier. To do this, we evaluated seven popular machine learning algorithms, including k-nearest neighbour (KNN), logistic regression (LR), random forest (RF), LightGBM, XGBoost, CatBoost and multilayer perception (MLP) [41, 42] based on the WS framework to determine the optimal base classifier. For the sake of comparison, the weight |$w$| of WS was assigned to one-quarter, two-quarters and three-quarters of its range, respectively. This can save the running time of the algorithm and accelerate the determination of the base classifier. We employed the default parameters of each algorithm in the Scikit-learn Package and conducted a 10-fold cross-validation test on the training_validation dataset for performance comparison. As can be seen from Figure 3A, in the case of w = 0.25, 0.5 and 0.75, XGBoost secured the best predictive performance among the seven different algorithms in terms of accuracy. In the case of w = 0.25, we found XGBoost achieved a slightly inferior performance to RF in terms of coverage and ranking loss but attained the best performance in terms of the other metrics (refer to Supplementary Table S2 for detailed results). In addition, XGBoost was still the best-performing algorithm for w = 0.5 and 0.75 in terms of all the evaluation metrics. Consequently, the XGBoost algorithm was adopted as the base classifier of the weighted series for developing Clarion. The relationship between the model performance and the weight values will be discussed in more detail in the following section.

Figure 3

(A) The example-based accuracies of seven different machine learning algorithms. (B) The performance comparison of the models trained with different weights in terms of Acc_exam, average precision, coverage, one-error, ranking loss and hamming loss. (C) The accuracies of the binary models in weighted series for predicting each mRNA subcellular localization.

Effect of weight w

In this section, we evaluated the effect of weight |$w$| in weighted series structure on the model performance, which ranged from 0 to 1. In particular, we evaluated the model performance on the training data through a ten-fold cross-validation test with 19 different |$w$| candidates ranging from 0.05 to 0.95 with a step size of 0.05. Figure 3B illustrates the model performance with different candidate weights in terms of all six performance metrics. There is a clear peak/valley in each metric sub-figure, whereas the corresponding candidate weights are not consistent. As it is much more difficult to directly determine the first-rank value of weighted series for Clarion, we employed a method by assigning the weights score. According to the performance ranking of each evaluation metric, we assigned 5, 4, 3, 2 and 1 points to the top five candidate weights and 0 to those after the sixth, respectively. For instance, the candidate |$w(0.55)$| was assigned 2 points on Acc_exam as the accuracy was the fourth highest out of the 19 candidates. Similarly, the candidate |$w(0.55)$| was assigned 5 points for the average precision, 0 points for the coverage, 5 points for the One-error, 0 points for the ranking loss and 4 points for the hamming loss, respectively. We then obtained an overall score of 16 points for the candidate |$w(0.55)$| by summing up all the above scores. Using this procedure, we calculated the overall scores of the other 18 candidate weights, whose detailed results and score statistics are provided in Supplementary Table S3 and Supplementary Table S4. The candidate |$w(0.65)$| reached the best score of 22 points out of the 19 candidates and accordingly, it was adopted as the fixed weight of weighted series for Clarion.

Performance comparison with other problem transformation strategies

To demonstrate the capacity of the weighted series strategy in dealing with the multi-label multi-class mRNA subcellular localization prediction tasks, we used XGBoost with the default parameters as the base classifier to benchmark and compare our proposed method with the other three well-known problem transformation methods, including BR, CC and LP, on the 10-fold cross-validation tests using the training_validation set. Specifically, we performed 10 times of experiments with 10 groups of randomly generated label orders for CC because the input label order could directly affect the quality of the CC model. Among these 10 experiments, the best result was used for the comparison with BR, LP and WS. From the performance comparison results provided in Table 2, we found that the WS strategy displayed the best performance in terms of the evaluation metrics of Acc_exam, average precision, coverage, ranking loss and hamming loss. Although WS achieved a slightly lower performance than LP in terms of one-error, it showed a clear superiority in predicting mRNA subcellular localizations. To further improve the model performance, we optimized the key parameters, including learning_rate, n_estimators and max_depth, for XGBoost based on the 10-fold cross-validation test. The specific hyperparameters are provided in Supplementary Table S5. Subsequently, we retrained the final model named Clarion on the whole training_validation set using the weighted series strategy with the optimized hyperparameters of XGBoost.

Table 2

Performance comparison of weighted series with binary relevance, classifier chains and label powerset

Strategy	Acc_exam	Average precision	Coverage	One-error	Ranking loss	Hamming loss
BR	0.600	0.651	6.121	0.706	0.345	0.194
CC	0.456	0.580	7.384	0.838	0.598	0.299
LP	0.558	0.626	6.271	0.601	0.443	0.229
WS (⁠\|$w=0.65$\|⁠)	0.627	0.670	6.029	0.629	0.344	0.182

Strategy	Acc_exam	Average precision	Coverage	One-error	Ranking loss	Hamming loss
BR	0.600	0.651	6.121	0.706	0.345	0.194
CC	0.456	0.580	7.384	0.838	0.598	0.299
LP	0.558	0.626	6.271	0.601	0.443	0.229
WS (⁠\|$w=0.65$\|⁠)	0.627	0.670	6.029	0.629	0.344	0.182

Table 2

Performance comparison of weighted series with binary relevance, classifier chains and label powerset

Strategy	Acc_exam	Average precision	Coverage	One-error	Ranking loss	Hamming loss
BR	0.600	0.651	6.121	0.706	0.345	0.194
CC	0.456	0.580	7.384	0.838	0.598	0.299
LP	0.558	0.626	6.271	0.601	0.443	0.229
WS (⁠\|$w=0.65$\|⁠)	0.627	0.670	6.029	0.629	0.344	0.182

Strategy	Acc_exam	Average precision	Coverage	One-error	Ranking loss	Hamming loss
BR	0.600	0.651	6.121	0.706	0.345	0.194
CC	0.456	0.580	7.384	0.838	0.598	0.299
LP	0.558	0.626	6.271	0.601	0.443	0.229
WS (⁠\|$w=0.65$\|⁠)	0.627	0.670	6.029	0.629	0.344	0.182

Performance comparison with existing state-of-the-art tools

In this section, Clarion’s performance was benchmarked and compared with several state-of-the-art approaches for predicting mRNA subcellular localizations. Firstly, we compared Clarion with DM3Loc and Wang’s method, the only two multi-label predictors. Clarion has only five overlapping predictable compartments with these two predictors, including cytosol, exosome, membrane (cytoplasm for Wang’s method), nucleus and ribosome. Therefore, we compared their five-label prediction performance via an independent test. Clarion and DM3Loc achieved the Acc_exam of 0.722/0.441, average precision of 0.769/0.618, coverage of 3.019/3.957, one-error of 0.463/0.869, ranking loss of 0.204/0.533 and hamming loss of 0.150/0.330 on the independent dataset. Similarly, Clarion and Wang’s method achieved Acc_exam of 0.745/0.281, average precision of 0.767/0.679, coverage of 3.127/4.516, one-error of 0.445/0.882, ranking loss of 0.241/0.716 and hamming loss of 0.146/0.375. The above results indicated that Clarion outperformed DM3Loc and Wang’s method in multi-label prediction tasks.

Afterwards, we also compared Clarion’s single-label prediction performance with that of the other state-of-the-art methods. To facilitate the performance comparison, only the methods with accessible webservers were used for prediction and comparison, including iLoc-mRNA [19], mRNALoc [20], mRNALocator [21], DM3Loc [25] and Wang’s method [26]. In particular, the RNA sequences of the independent set were uploaded to their webservers, which then outputted the prediction labels. As shown in Table 3, Clarion clearly outperformed the other methods in predicting cytoplasm, cytosol, exosome, membrane, nucleus and ribosome. We also found that Clarion secured over 80% accuracies in almost all compartment predictions with the only exception of cytosol and nucleus. Notably, the mRNAs of the independent set may appear in the training set of other methods, which may account for Clarion’s slightly lower F1 scores than Wang’s method on the prediction of cytoplasm and ribosome (more details can be found in Supplementary Table S6 of the supplementary file). These comparison results further demonstrated Clarion’s superior prediction power on single-label tasks.

Table 3

Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset

Localization	iLoc-mRNA	mRNALoc	mRNALocator	Wang’s	DM3Loc	Clarion
Chromatin	N.A.	N.A.	N.A.	N.A.	N.A.	81.47%
Cytoplasm	N.A.	54.88%	38.90%	87.10%	N.A.	91.29%
Cytosol	N.A.	N.A.	N.A.	67.81%	57.37%	79.77%
Exosome	N.A.	N.A.	N.A.	16.18%	70.00%	92.10%
Membrane	N.A.	N.A.	N.A.	N.A.	70.92%	89.15%
Nucleolus	N.A.	N.A.	N.A.	N.A.	N.A.	83.74%
Nucleoplasm	N.A.	N.A.	N.A.	N.A.	N.A.	80.74%
Nucleus	N.A.	55.18%	57.42%	60.13%	69.52%	79.23%
Ribosome	73.41%	N.A.	N.A.	81.42%	69.03%	84.74%

Localization	iLoc-mRNA	mRNALoc	mRNALocator	Wang’s	DM3Loc	Clarion
Chromatin	N.A.	N.A.	N.A.	N.A.	N.A.	81.47%
Cytoplasm	N.A.	54.88%	38.90%	87.10%	N.A.	91.29%
Cytosol	N.A.	N.A.	N.A.	67.81%	57.37%	79.77%
Exosome	N.A.	N.A.	N.A.	16.18%	70.00%	92.10%
Membrane	N.A.	N.A.	N.A.	N.A.	70.92%	89.15%
Nucleolus	N.A.	N.A.	N.A.	N.A.	N.A.	83.74%
Nucleoplasm	N.A.	N.A.	N.A.	N.A.	N.A.	80.74%
Nucleus	N.A.	55.18%	57.42%	60.13%	69.52%	79.23%
Ribosome	73.41%	N.A.	N.A.	81.42%	69.03%	84.74%

N.A.: non-applicable

Table 3

Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset

Localization	iLoc-mRNA	mRNALoc	mRNALocator	Wang’s	DM3Loc	Clarion
Chromatin	N.A.	N.A.	N.A.	N.A.	N.A.	81.47%
Cytoplasm	N.A.	54.88%	38.90%	87.10%	N.A.	91.29%
Cytosol	N.A.	N.A.	N.A.	67.81%	57.37%	79.77%
Exosome	N.A.	N.A.	N.A.	16.18%	70.00%	92.10%
Membrane	N.A.	N.A.	N.A.	N.A.	70.92%	89.15%
Nucleolus	N.A.	N.A.	N.A.	N.A.	N.A.	83.74%
Nucleoplasm	N.A.	N.A.	N.A.	N.A.	N.A.	80.74%
Nucleus	N.A.	55.18%	57.42%	60.13%	69.52%	79.23%
Ribosome	73.41%	N.A.	N.A.	81.42%	69.03%	84.74%

Localization	iLoc-mRNA	mRNALoc	mRNALocator	Wang’s	DM3Loc	Clarion
Chromatin	N.A.	N.A.	N.A.	N.A.	N.A.	81.47%
Cytoplasm	N.A.	54.88%	38.90%	87.10%	N.A.	91.29%
Cytosol	N.A.	N.A.	N.A.	67.81%	57.37%	79.77%
Exosome	N.A.	N.A.	N.A.	16.18%	70.00%	92.10%
Membrane	N.A.	N.A.	N.A.	N.A.	70.92%	89.15%
Nucleolus	N.A.	N.A.	N.A.	N.A.	N.A.	83.74%
Nucleoplasm	N.A.	N.A.	N.A.	N.A.	N.A.	80.74%
Nucleus	N.A.	55.18%	57.42%	60.13%	69.52%	79.23%
Ribosome	73.41%	N.A.	N.A.	81.42%	69.03%	84.74%

N.A.: non-applicable

Effect of non-label and fusion-label modules in the weighted series structure

In this section, we further examined the effect of the non-label and fusion-label modules used in the WS framework. A total of 18 binary classifiers were trained in Clarion for predicting nine mRNA subcellular localizations respectively, including nine non-label models and nine fusion-label models. Here, we employed these binary models to predict the RNA sequences in the independent test dataset and accordingly evaluated the performance for each localization. Figure 3C illustrates the predictive performance of non-label models and fusion-label models in terms of accuracy. As a result, we found that the fusion-label models performed better than the non-label counterparts for all nine locations, highlighting the necessity and effectiveness of fusion-label models in the WS framework. However, it is noteworthy that the better performance of the fusion-label models originated from the outputs of non-label models because the fusion-label models used the prediction results of the non-label models as input features for the model training. Therefore, we conclude that the non-label and fusion-label modules are complementary and essential to the WS framework of Clarion.

Figure 4

Feature importance ranking based on the Shapley values. (A) Top 15 features for the nucleus, (B) top 15 features for the chromatin, (C) top 15 features for the ribosome and (D) top 15 features among all nine subcellular localizations.

Model interpretation

Shapley additive explanations (SHAP) is a powerful method based on the cooperative game theory that can interpret machine learning models [31] using the Shapley value, which can be used to rank and evaluate the importance of each feature and explain the predictions. SHAP has been successfully applied in a variety of bioinformatics tasks [43–46]. In this study, we used the Shapley value to assess the |$k$|-mer fragments of mRNAs that are important for subcellular localization prediction. With the nine binary models of the non-label module, the Shapley value of each |$k$|-mer feature was calculated and ranked using the SHAP Python package (https://shap.readthedocs.io/en/latest/index.html). Figure 4 and Supplementary Figures S1-S3 show the top 15 important |$k$|-mer features of Clarion for predicting mRNA subcellular localizations. We found some features are only important for one certain localization, such as ‘TTG’, ‘GGGCGC’ in the nucleus, ‘GACGC’ and ‘GCGGCA’ in chromatin, and ‘GGATCT’ and ‘GGCCG’ in the ribosome. For example, ‘TTG’ was identified as the most important feature for nucleus prediction in terms of the SHAP value but did not appear in the list of the top 15 features for the prediction of the other eight localizations. For ‘TTG’ shown in Figure 4A, each point represents an instance with a value from small to large corresponding to the colour from blue to red. When ‘TTG’ took high Shapley values, it would have an influence on the model to make the positive prediction of nucleus and visa verse. Therefore, the effect of ‘TTG’ can be summarized in a way that its larger value promotes the nucleus localization prediction. In contrast, the larger value of ‘GGGCGC’ promotes the non-nucleus localization prediction. These |$k$|-mer segments may be part of or related to protein recognition motifs for mRNA specific localization. Interestingly, the repeat motif ‘UUCAC’ has been found to be crucial for localization via binding with Vg1PBP [47–49], which corresponded to ‘TTCACC’ that was ranked thirteenth in exosome prediction.

In addition, it was also found that certain |$k$|-mer features were important for the prediction of several subcellular localizations. For instance, ‘GCGGC’ was ranked second in nucleus prediction, tenth in chromatin prediction, and tenth in ribosome prediction, respectively, as shown in Figure 4A–C. This could be also observed in Figure 4D, which shows the top 15 features ranked according to the average of the absolute SHAP values and shows the important proportion in different mRNA subcellular localizations. The feature ‘GCGGC’ shown in Figure 4D is more important for the prediction of the nucleus, ribosome, chromatin, as well as nucleoplasm. Moreover, the feature ‘AAAAAA’ was important for almost all compartments of subcellular localizations, whose larger values drove the positive prediction of cytoplasm, but the negative predictions of exosome, chromatin, nucleoplasm, nucleolus, cytosol and ribosome.

Webserver implementation

To facilitate the wider research community to make subcellular localization predictions, we developed a web server for Clarion that is freely available at http://monash.bioweb.cloud.edu.au/Clarion/. The web server is maintained by Nectar Research Cloud and configured on a Linux server equipped with a 4-core CPU, 8-GB memory and 30-GB hard disk. The web page was implemented using PHP and has been tested on several popular web browsers, including Google Chrome, Microsoft Edge, Internet Explorer, Mozilla Firefox and Safari. Users are required to copy and paste query mRNA sequences in the textbox or alternatively upload a sequence file in the FASTA format via the file-selection dialogue box. It should be noted the query sequences will be truncated to ensure that no mRNA is greater than 6000 nt. All the generated prediction results will be saved in a table format containing detailed information regarding the sequences and predicted subcellular localization types. The web server also provides a probability score in the range of 0–1 to indicate the probability of the subcellular localization type. The prediction results can be easily exported to widely used file formats, including CSV, Excel, PDF and plain text. More detailed instructions for using the Clarion webserver can be found on the help page of the webserver.

Conclusion

In this study, we have introduced a novel ensemble model termed Clarion for the simultaneous prediction of nine compartments of mRNA subcellular localizations, including exosome, nucleus, nucleoplasm, chromatin, cytoplasm, nucleolus, cytosol, membrane and ribosome. Specifically, Clarion used the |$k$|-mer nucleotide composition scheme to encode mRNA sequences and employed our proposed strategy, namely weighted series, as the ensemble framework. We selected XGBoost as the base classifier for the weighted series and optimized the important weight parameter for the weighted series. The cross-validation and independent tests illustrate the superiority of Clarion for predicting mRNA subcellular localizations, which outperformed several existing methods, including single-label predictors for iLoc-mRNA, mRNALoc and mRNALocator, and multi-label predictors of DM3Loc and Wang’s method. The performance improvement of Clarion can be attributed to two key factors: the first is the collection of more data than the existing methods to construct the benchmark dataset to facilitate high-quality model fitting; the second is use of the XGBoost-based weighted series method to incorporate multi-label priori information for model training and leverage such information to improve the predictive power. Moreover, we analyzed the most important |$k$|-mer features for each localization prediction using the model interpretation algorithm SHAP and developed a user-friendly web server for Clarion. Clarion is expected to be a promising tool for multi-label mRNA subcellular localization prediction in the field of bioinformatics.

Nevertheless, the model performance can be further improved, especially for several compartments in the cytosol and nucleus. Only nine cell compartments were considered by Clarion; it is desirable to expand the size of predictable compartments. In our future work, we plan to develop advanced technologies for the prediction of mRNA subcellular localization. Apart from mRNAs, we will also make efforts to develop methods for predicting localizations of other types of RNAs, such as microRNAs and long non-coding RNAs. In addition, the co-localization between different types of RNAs will be another research direction in the future work.

Key Points

Characterization of mRNA subcellular localization can help elucidate gene regulatory networks and human disease mechanisms.
We proposed a novel ensemble method, termed Clarion, to predict nine subcellular localizations of mRNAs simultaneously.
Clarion achieved significantly better predictive performance in both single-label and multi-label predictions compared to state-of-the-art predictors.
The online webserver and local stand-alone tool of Clarion are publicly available at: http://monash.bioweb.cloud.edu.au/Clarion/.

Code and data availability

The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.

Funding

National Health and Medical Research Council of Australia (NHMRC) (APP1127948, APP1144652), Australian Research Council (ARC) (LP110200333, DP120104460), National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965), Major and Seed Inter-Disciplinary Research (IDR) projects awarded by Monash University.

Author Biographies

Yue Bi is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests include bioinformatics, computational biology, sequence analysis and machine learning.

Fuyi Li received his PhD in Bioinformatics from Monash University, Australia. He is currently a professor in the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics, computational biology, machine learning, and data mining.

Xudong Guo received his MEng degree from Ningxia University, China. He is currently a research assistant at the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics and data mining.

Zhikang Wang is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. His interests are bioinformatics, computational pathology, pattern recognition and deep learning.

Tong Pan is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests are bioinformatics, protein function analysis, deep learning and pattern recognition.

Yuming Guo is a professor of Global Environmental Health and Biostatistics & Head of the Monash Climate, Air Quality Research (CARE) Unit. His research focuses on environmental epidemiology, biostatistics, global environmental change, air pollution, climate change, urban design, residential environment, remote sensing modelling and infectious disease modelling.

Geoffrey I. Webb is a professor in the Faculty of Information Technology and a research director of the Monash Data Futures Institute at Monash University. His research interests include machine learning, data mining, computational biology and user modelling.

Jianhua Yao is a group leader in Tencent AI Lab, China. He received his PhD degree from John Hopkins University. His research interests include bioinformatics, computational biology, medical imaging and pattern recognition.

Cangzhi Jia is a professor in the college of science, Dalian Maritime University, China. She obtained her PhD degree in the school of Mathematical Sciences from the Dalian University of Technology in 2007. Her major research interests include mathematical modelling in bioinformatics and machine learning.

Jiangning Song is an associate professor and a group leader in the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biomedicine, machine learning, data mining and pattern recognition.

References

1.

Jeffery

WR

,

Tomlinson

CR

,

Brodeur

RD

.

Localization of actin messenger RNA during early ascidian development

.

Dev Biol

1983

;

99

:

408

–

17

.

2.

Lawrence

JB

,

Singer

RH

.

Intracellular localization of messenger RNAs for cytoskeletal proteins

.

Cell

1986

;

45

:

407

–

15

.

3.

Meyer

C

,

Garzia

A

,

Tuschl

T

.

Simultaneous detection of the subcellular localization of RNAs and proteins in cultured cells by combined multicolor RNA-FISH and IF

.

Methods

2017

;

118-119

:

101

–

10

.

4.

Chin

A

,

Lécuyer

E

.

RNA localization: Making its way to the center stage

.

Biochimica et Biophysica Acta (BBA)-General Subjects

2017

;

1861

:

2956

–

70

.

5.

Kloc

M

,

Zearfoss

NR

,

Etkin

LD

.

Mechanisms of subcellular mRNA localization

.

Cell

2002

;

108

:

533

–

44

.

6.

Li

X

,

Franceschi

VR

,

Okita

TW

.

Segregation of storage protein mRNAs on the rough endoplasmic reticulum membranes of rice endosperm cells

.

Cell

1993

;

72

:

869

–

79

.

7.

Katz

ZB

,

Wells

AL

,

Park

HY

, et al.

beta-Actin mRNA compartmentalization enhances focal adhesion stability and directs cell migration

.

Genes Dev

2012

;

26

:

1885

–

90

.

8.

Kejiou

NS

,

Palazzo

AF

.

mRNA localization as a rheostat to regulate subcellular gene expression, Wiley Interdiscip Rev

.

RNA

2017

;

8

:e1416.

9.

Liu

D

,

Li

G

,

Zuo

Y

.

Function determinants of TET proteins: the arrangements of sequence motifs with specific codes

.

Brief Bioinform

2019

;

20

:

1826

–

35

.

10.

Cooper

TA

,

Wan

L

,

Dreyfuss

G

.

RNA and disease

.

Cell

2009

;

136

:

777

–

93

.

11.

Liu-Yesucevitz

L

,

Bassell

GJ

,

Gitler

AD

, et al.

Local RNA translation at the synapse and in disease

.

J Neurosci

2011

;

31

:

16086

–

93

.

12.

Sprenkle

NT

,

Sims

SG

,

Sanchez

CL

, et al.

Endoplasmic reticulum stress and inflammation in the central nervous system

.

Mol Neurodegener

2017

;

12

:

42

.

13.

Dolezal

JM

,

Dash

AP

,

Prochownik

EV

.

Diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers

.

BMC Cancer

2018

;

18

:

275

.

14.

Engel

KL

,

Arora

A

,

Goering

R

, et al.

Mechanisms and consequences of subcellular RNA localization across diverse cell types

.

Traffic

2020

;

21

:

404

–

18

.

15.

Zhang

T

,

Tan

P

,

Wang

L

, et al.

RNALocate: a resource for RNA subcellular localizations

.

Nucleic Acids Res

2017

;

45

:

D135

–

8

.

PubMed

16.

Mas-Ponte

D

,

Carlevaro-Fita

J

,

Palumbo

E

, et al.

LncATLAS database for subcellular localization of long noncoding RNAs

.

RNA

2017

;

23

:

1080

–

7

.

17.

Wen

X

,

Gao

L

,

Guo

X

, et al.

lncSLdb: a resource for long non-coding RNA subcellular localization

.

Database (Oxford)

2018

;

2018

:

1

–

6

.

18.

Yan

Z

,

Lecuyer

E

,

Blanchette

M

.

Prediction of mRNA subcellular localization using deep recurrent neural networks

.

Bioinformatics

2019

;

35

:

i333

–

42

.

19.

Zhang

ZY

,

Yang

YH

,

Ding

H

, et al.

Design powerful predictor for mRNA subcellular location prediction in Homo sapiens

.

Brief Bioinform

2021

;

22

:

526

–

35

.

20.

Garg

A

,

Singhal

N

,

Kumar

R

, et al.

mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization

.

Nucleic Acids Res

2020

;

48

:

W239

–

43

.

21.

Tang

Q

,

Nie

F

,

Kang

J

, et al.

mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy

.

Mol Ther

2021

;

29

:

2617

–

23

.

22.

Li

J

,

Zhang

L

,

He

S

, et al.

SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning

.

Brief Bioinform

2021

;

22

:bbaa401.

23.

Lewis

RA

,

Gagnon

JA

,

Mowry

KL

.

PTB/hnRNP I is required for RNP remodeling during RNA localization in Xenopus oocytes

.

Mol Cell Biol

2008

;

28

:

678

–

86

.

24.

Buskila

AA

,

Kannaiah

S

,

Amster-Choder

O

.

RNA localization in bacteria

.

RNA Biol

2014

;

11

:

1051

–

60

.

25.

Wang

D

,

Zhang

Z

,

Jiang

Y

, et al.

DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism

.

Nucleic Acids Res

2021

;

49

:e46.

26.

Wang

H

,

Ding

Y

,

Tang

J

, et al.

Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule

.

BMC Genomics

2021

;

22

:

56

.

27.

Zhang

M-L

,

Zhou

Z-H

.

A review on multi-label learning algorithms

.

IEEE Transactions on Knowledge and Data Engineering

2014

;

26

:

1819

–

37

.

28.

Boutell

MR

,

Luo

J

,

Shen

X

, et al.

Learning multi-label scene classification

.

Pattern Recognition

2004

;

37

:

1757

–

71

.

29.

Tsoumakas

G

,

Vlahavas

I

. Random k-labelsets: An ensemble method for multilabel classification. In:

European conference on machine learning

.

2007

, p.

406

–

17

.

Springer

.

30.

Read

J

,

Pfahringer

B

,

Holmes

G

, et al.

Classifier chains for multi-label classification

.

Machine Learning

2011

;

85

:

333

–

59

.

31.

Lundberg

SM

,

Erion

G

,

Chen

H

, et al.

From local explanations to global understanding with explainable AI for trees

.

Nat Mach Intell

2020

;

2

:

56

–

67

.

32.

Cui

T

,

Dou

Y

,

Tan

P

, et al.

RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation

.

Nucleic Acids Res

2022

;

50

:

D333

–

9

.

33.

Li

W

,

Godzik

A

.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

Bioinformatics

2006

;

22

:

1658

–

9

.

34.

Chen

Z

,

Liu

X

,

Zhao

P

, et al.

iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

.

Nucleic Acids Res

2022

;

50

(W1):W434–47.

35.

Jiang

P

,

Luo

J

,

Wang

Y

, et al.

kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers

.

Bioinformatics

2019

;

35

:

4871

–

8

.

36.

Manavalan

B

,

Basith

S

,

Shin

TH

, et al.

Computational prediction of species-specific yeast DNA replication origin via iterative feature representation

.

Brief Bioinform

2021

;

22

(4):bbaa304.

37.

Yan

K

,

Lv

H

,

Guo

Y

, et al.

TPpred-ATMV: therapeutic peptide prediction by adaptive multi-view tensor learning model

.

Bioinformatics

2022

;

38

:

2712

–

8

.

38.

Wei

L

,

Su

R

,

Luan

S

, et al.

Iterative feature representations improve N4-methylcytosine site prediction

.

Bioinformatics

2019

;

35

:

4930

–

7

.

39.

Ghamrawi

N

,

McCallum

A

. Collective multi-label classification. In:

Proceedings of the 14th ACM international conference on Information and knowledge management

.

2005

, p.

195

–

200

.

40.

Gopal

S

,

Yang

Y

. Multilabel classification with meta-level features. In:

Proceedings of the 33rd International ACM SIGIR conference on Research and development in information retrieval

.

2010

, p.

315

–

22

.

41.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

, et al.

Scikit-learn: Machine learning in Python

.

The Journal of Machine Learning Research

2011

;

12

:

2825

–

30

.

42.

Wang

X

,

Li

F

,

Xu

J

, et al.

ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning

.

Brief Bioinform

2022

;

23

(2):bbac031.

43.

Bi

Y

,

Xiang

D

,

Ge

Z

, et al.

An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP

.

Mol Ther Nucleic Acids

2020

;

22

:

362

–

72

.

44.

Li

F

,

Chen

J

,

Ge

Z

, et al.

Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework

.

Brief Bioinform

2021

;

22

:

2126

–

40

.

45.

Li

F

,

Guo

X

,

Jin

P

, et al.

Porpoise: a new approach for accurate prediction of RNA pseudouridine sites

.

Brief Bioinform

2021

;

22

(6):bbab245.