Identification of species-specific RNA N6-methyladinosine modification sites from RNA sequences

Summary of representative m6A predictor in published work

Tool	Method	Publication year	Feature selection method	Feature encoding scheme	Data scales	URL accessibility	Window size	Species	Reference
iRNA-Methyl	SVM	2015	None	PseDNC, RNA property parameters	1307	Inaccessible	51	S.cerevisiae	[7]
M6ATH	SVM	2016	None	NCP,ANF	394	Inaccessible	25	A.thaliana	[8]
RNA-MethylPred	SVM	2016	None	BPB,DNC, KNN score	1307	MATLAB package	51	S.cerevisiae	[9]
TargetM6A	SVM	2016	IFS	PSNP, PSDP, k-mer	1307(Met2614) and 832(Train1664)	Accessible	51(Met2614) and 21(Train1664)	S.cerevisiae	[10]
pRNAm-PC	SVM	2016	None	PCPM	1307	Inaccessible	51	S.cerevisiae	[11]
RNAMethPre	SVM	2016	None	binary, k-mer, relative position value, MFE	29457(H.sapiens) and 31728(Mus musculus)	Inaccessible	101	H.sapiens and Mus musculus	[12]
M6A-HPCS	SVM	2016	None	PCPM, PseDNC, AC, CC	1307	Accessible	51	S.cerevisiae	[13]
SRAMP	RF	2016	None	binary, KNN score spectrum	55706 (full transcript mode) and 46992 (mature mRNA mode)	Accessible	251	H.sapiens and Mus musculus	[14]
RAM-ESVM	SVM	2017	None	PseDNC, motif	1307	Inaccessible	51	S.cerevisiae	[15]
iRNA-PseColl	SVM	2017	None	NCP,ANF	1130	Inaccessible	41	H.sapiens	[16]
RAM-NPPS	SVM	2017	RFE,FSDI, MRMD	NPPS	8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)	Inaccessible	51	H.sapiens, S.cerevisiae and A.thaliana	[17]
M6APred-EL	Ensemble SVM	2018	None	position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties	1307	Inaccessible	51	S.cerevisiae	[18]
iRNA(m6A)-PseDNC	SVM	2018	None	PseDNC	1307	Inaccessible	51	S.cerevisiae	[19]
BERMP	BGRU	2018	None	ENAC, RNA word embedding	53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)	Inaccessible	251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae A.thaliana	[20]
M6AMRFS	XGBoost	2018	SFS	DNC, binary, local position-specific dinucleotide frequency	1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)	−	51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae and A.thaliana	[21]
RFAthM6A	RF	2018	None	PSNSP, PSDSP, KSNPF, k-mer	2518	Accessible	101	A.thaliana	[22]
DeepM6APred	SVM	2019	None	deep features and NPPS	1307	Inaccessible	51	S.cerevisiae	[23]
Gene2Vec	CNN	2019	None	One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec	56557	Inaccessible	1001	H.sapiens and Mus musculus	[24]
WHISTLE	SVM	2022	Perturb method	NCP, Genome-derived features	20516, 17383	Accessible	−	H.sapiens	[25]
DeepPromise	CNN	2022	None	ENAC, one-hot and RNA word embedding	44901, 11656 and 5233	Accessible	1001	H.sapiens and Mus musculus	[6]

Tool	Method	Publication year	Feature selection method	Feature encoding scheme	Data scales	URL accessibility	Window size	Species	Reference
iRNA-Methyl	SVM	2015	None	PseDNC, RNA property parameters	1307	Inaccessible	51	S.cerevisiae	[7]
M6ATH	SVM	2016	None	NCP,ANF	394	Inaccessible	25	A.thaliana	[8]
RNA-MethylPred	SVM	2016	None	BPB,DNC, KNN score	1307	MATLAB package	51	S.cerevisiae	[9]
TargetM6A	SVM	2016	IFS	PSNP, PSDP, k-mer	1307(Met2614) and 832(Train1664)	Accessible	51(Met2614) and 21(Train1664)	S.cerevisiae	[10]
pRNAm-PC	SVM	2016	None	PCPM	1307	Inaccessible	51	S.cerevisiae	[11]
RNAMethPre	SVM	2016	None	binary, k-mer, relative position value, MFE	29457(H.sapiens) and 31728(Mus musculus)	Inaccessible	101	H.sapiens and Mus musculus	[12]
M6A-HPCS	SVM	2016	None	PCPM, PseDNC, AC, CC	1307	Accessible	51	S.cerevisiae	[13]
SRAMP	RF	2016	None	binary, KNN score spectrum	55706 (full transcript mode) and 46992 (mature mRNA mode)	Accessible	251	H.sapiens and Mus musculus	[14]
RAM-ESVM	SVM	2017	None	PseDNC, motif	1307	Inaccessible	51	S.cerevisiae	[15]
iRNA-PseColl	SVM	2017	None	NCP,ANF	1130	Inaccessible	41	H.sapiens	[16]
RAM-NPPS	SVM	2017	RFE,FSDI, MRMD	NPPS	8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)	Inaccessible	51	H.sapiens, S.cerevisiae and A.thaliana	[17]
M6APred-EL	Ensemble SVM	2018	None	position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties	1307	Inaccessible	51	S.cerevisiae	[18]
iRNA(m6A)-PseDNC	SVM	2018	None	PseDNC	1307	Inaccessible	51	S.cerevisiae	[19]
BERMP	BGRU	2018	None	ENAC, RNA word embedding	53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)	Inaccessible	251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae A.thaliana	[20]
M6AMRFS	XGBoost	2018	SFS	DNC, binary, local position-specific dinucleotide frequency	1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)	−	51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae and A.thaliana	[21]
RFAthM6A	RF	2018	None	PSNSP, PSDSP, KSNPF, k-mer	2518	Accessible	101	A.thaliana	[22]
DeepM6APred	SVM	2019	None	deep features and NPPS	1307	Inaccessible	51	S.cerevisiae	[23]
Gene2Vec	CNN	2019	None	One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec	56557	Inaccessible	1001	H.sapiens and Mus musculus	[24]
WHISTLE	SVM	2022	Perturb method	NCP, Genome-derived features	20516, 17383	Accessible	−	H.sapiens	[25]
DeepPromise	CNN	2022	None	ENAC, one-hot and RNA word embedding	44901, 11656 and 5233	Accessible	1001	H.sapiens and Mus musculus	[6]

Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.

Table 1

Summary of representative m6A predictor in published work

Tool	Method	Publication year	Feature selection method	Feature encoding scheme	Data scales	URL accessibility	Window size	Species	Reference
iRNA-Methyl	SVM	2015	None	PseDNC, RNA property parameters	1307	Inaccessible	51	S.cerevisiae	[7]
M6ATH	SVM	2016	None	NCP,ANF	394	Inaccessible	25	A.thaliana	[8]
RNA-MethylPred	SVM	2016	None	BPB,DNC, KNN score	1307	MATLAB package	51	S.cerevisiae	[9]
TargetM6A	SVM	2016	IFS	PSNP, PSDP, k-mer	1307(Met2614) and 832(Train1664)	Accessible	51(Met2614) and 21(Train1664)	S.cerevisiae	[10]
pRNAm-PC	SVM	2016	None	PCPM	1307	Inaccessible	51	S.cerevisiae	[11]
RNAMethPre	SVM	2016	None	binary, k-mer, relative position value, MFE	29457(H.sapiens) and 31728(Mus musculus)	Inaccessible	101	H.sapiens and Mus musculus	[12]
M6A-HPCS	SVM	2016	None	PCPM, PseDNC, AC, CC	1307	Accessible	51	S.cerevisiae	[13]
SRAMP	RF	2016	None	binary, KNN score spectrum	55706 (full transcript mode) and 46992 (mature mRNA mode)	Accessible	251	H.sapiens and Mus musculus	[14]
RAM-ESVM	SVM	2017	None	PseDNC, motif	1307	Inaccessible	51	S.cerevisiae	[15]
iRNA-PseColl	SVM	2017	None	NCP,ANF	1130	Inaccessible	41	H.sapiens	[16]
RAM-NPPS	SVM	2017	RFE,FSDI, MRMD	NPPS	8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)	Inaccessible	51	H.sapiens, S.cerevisiae and A.thaliana	[17]
M6APred-EL	Ensemble SVM	2018	None	position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties	1307	Inaccessible	51	S.cerevisiae	[18]
iRNA(m6A)-PseDNC	SVM	2018	None	PseDNC	1307	Inaccessible	51	S.cerevisiae	[19]
BERMP	BGRU	2018	None	ENAC, RNA word embedding	53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)	Inaccessible	251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae A.thaliana	[20]
M6AMRFS	XGBoost	2018	SFS	DNC, binary, local position-specific dinucleotide frequency	1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)	−	51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae and A.thaliana	[21]
RFAthM6A	RF	2018	None	PSNSP, PSDSP, KSNPF, k-mer	2518	Accessible	101	A.thaliana	[22]
DeepM6APred	SVM	2019	None	deep features and NPPS	1307	Inaccessible	51	S.cerevisiae	[23]
Gene2Vec	CNN	2019	None	One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec	56557	Inaccessible	1001	H.sapiens and Mus musculus	[24]
WHISTLE	SVM	2022	Perturb method	NCP, Genome-derived features	20516, 17383	Accessible	−	H.sapiens	[25]
DeepPromise	CNN	2022	None	ENAC, one-hot and RNA word embedding	44901, 11656 and 5233	Accessible	1001	H.sapiens and Mus musculus	[6]

Tool	Method	Publication year	Feature selection method	Feature encoding scheme	Data scales	URL accessibility	Window size	Species	Reference
iRNA-Methyl	SVM	2015	None	PseDNC, RNA property parameters	1307	Inaccessible	51	S.cerevisiae	[7]
M6ATH	SVM	2016	None	NCP,ANF	394	Inaccessible	25	A.thaliana	[8]
RNA-MethylPred	SVM	2016	None	BPB,DNC, KNN score	1307	MATLAB package	51	S.cerevisiae	[9]
TargetM6A	SVM	2016	IFS	PSNP, PSDP, k-mer	1307(Met2614) and 832(Train1664)	Accessible	51(Met2614) and 21(Train1664)	S.cerevisiae	[10]
pRNAm-PC	SVM	2016	None	PCPM	1307	Inaccessible	51	S.cerevisiae	[11]
RNAMethPre	SVM	2016	None	binary, k-mer, relative position value, MFE	29457(H.sapiens) and 31728(Mus musculus)	Inaccessible	101	H.sapiens and Mus musculus	[12]
M6A-HPCS	SVM	2016	None	PCPM, PseDNC, AC, CC	1307	Accessible	51	S.cerevisiae	[13]
SRAMP	RF	2016	None	binary, KNN score spectrum	55706 (full transcript mode) and 46992 (mature mRNA mode)	Accessible	251	H.sapiens and Mus musculus	[14]
RAM-ESVM	SVM	2017	None	PseDNC, motif	1307	Inaccessible	51	S.cerevisiae	[15]
iRNA-PseColl	SVM	2017	None	NCP,ANF	1130	Inaccessible	41	H.sapiens	[16]
RAM-NPPS	SVM	2017	RFE,FSDI, MRMD	NPPS	8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)	Inaccessible	51	H.sapiens, S.cerevisiae and A.thaliana	[17]
M6APred-EL	Ensemble SVM	2018	None	position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties	1307	Inaccessible	51	S.cerevisiae	[18]
iRNA(m6A)-PseDNC	SVM	2018	None	PseDNC	1307	Inaccessible	51	S.cerevisiae	[19]
BERMP	BGRU	2018	None	ENAC, RNA word embedding	53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)	Inaccessible	251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae A.thaliana	[20]
M6AMRFS	XGBoost	2018	SFS	DNC, binary, local position-specific dinucleotide frequency	1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)	−	51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)	H.sapiens, Mus musculus, S.cerevisiae and A.thaliana	[21]
RFAthM6A	RF	2018	None	PSNSP, PSDSP, KSNPF, k-mer	2518	Accessible	101	A.thaliana	[22]
DeepM6APred	SVM	2019	None	deep features and NPPS	1307	Inaccessible	51	S.cerevisiae	[23]
Gene2Vec	CNN	2019	None	One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec	56557	Inaccessible	1001	H.sapiens and Mus musculus	[24]
WHISTLE	SVM	2022	Perturb method	NCP, Genome-derived features	20516, 17383	Accessible	−	H.sapiens	[25]
DeepPromise	CNN	2022	None	ENAC, one-hot and RNA word embedding	44901, 11656 and 5233	Accessible	1001	H.sapiens and Mus musculus	[6]

Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.

Figure 1

Overall framework of the predictors used in this research. The process contains four main steps: data collection, feature investigation, model training and cross-validation, and independent testing. In the data collection step, datasets from six different species were collected from published research and databases for motif substrate discovery and feature investigation. In the feature investigation phase, primary sequence-derived features (NAC, DNC, TNC, ENAC and CKSNAP) and one nucleotide physiochemical property features (NCP) were adopted. In the model training and cross-validation phase, three conventional ML methods (SVM, LR and RF) were applied to the datasets and a deep learning method using an adaptive learning network (adaptive-m6A), and 5-fold cross-validation were applied for model assessment. Finally, independent testing was employed to assess the performance of the proposed method.

Recently, deep learning methods have been widely applied in different fields, as well as in RNA modification prediction [5, 8–10, 12, 17, 21, 22, 26, 27]. Table 1 lists some representative m6A prediction methods and illustrates that there are still some limitations of the available tools. First, most of them focus on only a few types of species, with no or limited information on other species. Second, ML method is still the first choice for m6A site prediction, and the deep learning method for these purposes is usually limited to a convolutional neural network (CNN) framework. Also, these methods are time-consuming and labor-intensive for the construction of an experimentally verified m6A database. Moreover, little work has been conducted to discuss the motif patterns around modified sites among different samples to show the potential relationships among different species with the same type of RNA post-transcriptional modification [28]. To overcome these obstacles and investigate m6A modification among various categories of species, we have developed a prediction model for m6A modification in multi-species with different frameworks. We proposed an attention-based deep learning method called adaptive-m6A, which consists of CNN, bi-directional long short-term memory (BLSTM) and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional ML methods were also investigated: support vector machine, random forest and logistic regression classifiers in details. A cross-species validation was also conducted to evaluate whether one model from a certain species could work well using data from other species [29, 30]. A comparison with other representative methods was also listed as part of our report. To the best of our knowledge, this is the first work to use data from several species, which could help to reveal whether there are the same type of RNA modification from the perspective of their associated sequence contexts among different species [28]. The overall flowchart of this research has been included in Figure 1.

Method and materials

Data collection and preparation

The dataset used for training and independent testing was collected from published work and the RMBase database [31]. Data from six different species with m6A modified sites were obtained for Drosophila melanogaster (BDGP6), Danio rerio (zebrafish, danRer10), Escherichia coli str.K-12 substr.MG1655 (E. coli), Mus musculus (mm10), Saccharomyces cerevisiae S288C (sacCer3) and Arabidopsis thaliana (TAIR10). Collecting the datasets involved the following steps: first downloading the RNA sequence that contained the m6A modification sites and its position from the RMBase database. This first step transfers sequences from RMBase to the RNA sequence by replacing the T entries to U, as those sequences obtained from the RMBase database is in the form of DNA [32]. The datasets for species M. musculus and A. thaliana were downloaded from previous studies [14, 22]. To obtain a high-quality dataset, a CD-HIT with a 70% identity cut-off value [33] was used to remove redundancy and homology deviation [32, 34]. Then, we divided those datasets with a ratio of 7:3 for the training and testing sets, respectively. To avoid the effect caused by any imbalance in the sample number of positive and negative datasets, a random under sampler was employed to make the sample number equivalent in positive and negative sets [35]. The detailed number of final datasets for each individual species is listed in Table 2 and the data involved in this research could be accessed at https://github.com/Moretta1/Adaptive-m6A.

Table 2

Data distribution of each species in this research. The data resource contains two main parts, one from released database RMBase [31], the other from released work [14, 22]

Group	Species	Assembly	Training set		Testing set
			Positive	Negative	Positive	Negative
Insect	D. melanogaster	BDGP6	4714	4714	2014	20 504
Vertebrate	Danio rerio(zebrafish)	danRer10	30 109	30 109	12 916	15 645
Bacteria	Escherichia coli(E. coli)	ASM584v2	1522	1522	650	1546
Mammal	Mus musculus (House mouse)	mm10	31 120	31 120	13 325	131 418
Fungi	Saccharomyces cerevisiae S288C	sacCer3	1829	1829	785	1044
Plant	Arabidopsis thaliana	TAIR10	3523	3523	1510	2013

Group	Species	Assembly	Training set		Testing set
			Positive	Negative	Positive	Negative
Insect	D. melanogaster	BDGP6	4714	4714	2014	20 504
Vertebrate	Danio rerio(zebrafish)	danRer10	30 109	30 109	12 916	15 645
Bacteria	Escherichia coli(E. coli)	ASM584v2	1522	1522	650	1546
Mammal	Mus musculus (House mouse)	mm10	31 120	31 120	13 325	131 418
Fungi	Saccharomyces cerevisiae S288C	sacCer3	1829	1829	785	1044
Plant	Arabidopsis thaliana	TAIR10	3523	3523	1510	2013

Table 2

Data distribution of each species in this research. The data resource contains two main parts, one from released database RMBase [31], the other from released work [14, 22]

Group	Species	Assembly	Training set		Testing set
			Positive	Negative	Positive	Negative
Insect	D. melanogaster	BDGP6	4714	4714	2014	20 504
Vertebrate	Danio rerio(zebrafish)	danRer10	30 109	30 109	12 916	15 645
Bacteria	Escherichia coli(E. coli)	ASM584v2	1522	1522	650	1546
Mammal	Mus musculus (House mouse)	mm10	31 120	31 120	13 325	131 418
Fungi	Saccharomyces cerevisiae S288C	sacCer3	1829	1829	785	1044
Plant	Arabidopsis thaliana	TAIR10	3523	3523	1510	2013

Group	Species	Assembly	Training set		Testing set
			Positive	Negative	Positive	Negative
Insect	D. melanogaster	BDGP6	4714	4714	2014	20 504
Vertebrate	Danio rerio(zebrafish)	danRer10	30 109	30 109	12 916	15 645
Bacteria	Escherichia coli(E. coli)	ASM584v2	1522	1522	650	1546
Mammal	Mus musculus (House mouse)	mm10	31 120	31 120	13 325	131 418
Fungi	Saccharomyces cerevisiae S288C	sacCer3	1829	1829	785	1044
Plant	Arabidopsis thaliana	TAIR10	3523	3523	1510	2013

Feature encoding

In this research, primary sequence-derived features and nucleotide physicochemical properties [6] were selected to obtain sequence feature vectors as these can be directly obtained from the sequence textures and requires no extra third-party tools. The encoding methods were employed mainly based on toolkits iLearn [36] and iLearnplus [37].

Primary sequence-derived features

Nucleotide acid components Nucleotide acid components (NAC) indicate the frequency at which each nucleotide acid occurs in a sequence, which is the basic feature in RNA primary sequence-derived features. As there are four types of nucleotide acids (A, C, G, U/T) in an RNA sequence, the dimension of an NAC feature is 4. For the sequences |$x$|⁠, which is of fixed length n (n = 21 in this study), the probability |$P_x (k)$| of nucleotide acid |$k$| is

$$ \begin{align} P_x (k) = \frac{n_x (k)}{n} \end{align} $$

(1)

where |$n_x(k)$| refers to occurrence of nucleotide acid k.

Di-nucleotide components Di-nucleotide components (DNC) indicate the frequency at which each amino acid pair occurs in the sequence. There are totally four types of nucleotide acid in RNA sequence; therefore, |$4\times 4$| types of nucleotide acid pairs are available, making the dimension of DNC feature 16. The probability |$P_x (k)$| of amino acid pair in a sequence |$x$| is

$$ \begin{align} P_x (k) = \frac{n_x (k)}{n*(n-1)} \end{align} $$

(2)

where |$n_x (k)$| is the occurrence of a nucleotide acid pair |$k$|⁠. Figure S1 in the supplementary file shows the comparison of DNC features in positive and negative samples of all datasets.

Tri-nucleotide composition Similar to NAC and DNC, Tri-nucleotide composition (TNC) gives 64 descriptors. It is defined as:

$$ \begin{align} P_x (k) = \frac{N_{rst}}{n-2} \end{align} $$

(3)

where |$N_{rst}$| is the number of tri-nucleotide composition represented by nucleotide types |$r$|⁠, |$s$| and |$t$|⁠.

Composition of k-spaced nucleotide acid pair The composition of k-spaced nucleotide acid pairs (CKSNAP) is a criterion that is wildly used in the field for analyzing protein functions. When an integer k has been fixed, the number of k-spaced nucleotide acid pairs are determined as by the gap between two neighboring nucleotide ranges from 0 to the given integer k. The CKSNAP indicates the frequency of nucleotide pairs separated by any k composition. In this study, k was 5, which should range from 0 to 5. Because CKSNAP gives the same result as DNC when k = 0, only cases in which k ranges from 1 to 5 were considered in this study, with a total of 48 feature dimensions for a single sequence. Figure S2 in the supplementary file shows the comparison of CKSNAP in positive and negative samples of all datasets.

Enhanced nucleotide acid composition Enhanced nucleotide acid composition (ENAC) calculates the NAC based on the sequence window of fixed length using a (default value of 5) that continuously slides from the 5’ to 3’ termini of each nucleotide sequence and can be applied to the encoded of nucleotide sequence with an equal length. ENAC is calculated as:

$$ \begin{align} P_x (k) = \frac{N(t,composition)}{N(composition)} \end{align} $$

(4)

where |$t \subseteq \lbrace A,C,G,U/T \rbrace , N(composition) \subseteq \lbrace composition1,\\ composition2,\cdots , compositionN \rbrace $|⁠. In this study, there are |$21-5$| compositions possible for each sequence; therefore, the ENAC will give a total of |$4\times (21-5)$| dimensional feature for each sequence, which is 64.

Figure 2

The structure of the deep learning method contains two parts: the embedding module and task specific module. In the embedding module, RNA sequences of fixed lengths were converted into a numeric matrix vector and used as an input for the task specific module. The task specific module contains a CNN layer for extracting features, a max-pooling layer for speeding up the computation, BLSTM and attention layers, and fully connect layer for output result.

Nucleotide physicochemical properties

Nucleotide chemical properties Each of the four different kinds of nucleotides has different chemical structure and binding properties (Table 3), and this allows for the encoding scheme of nucleotide chemical properties (NCP). For instance, “A” would be encoded as (111) and G would be encoded as (100). NCP would generate a dimension feature of number 3^*21, which equals 63.

Table 3

Chemical structure of each nucleotide

Chemical property	Class	Label	Nucleotides
Ring structure	Purine	1	A, G
	Pyrimidine	0	C, U
Functional group	Amino	1	A, C
	Keto	0	G, U
Hydrogen bond	Strong	0	C, G
	Weak	1	A, U

Chemical property	Class	Label	Nucleotides
Ring structure	Purine	1	A, G
	Pyrimidine	0	C, U
Functional group	Amino	1	A, C
	Keto	0	G, U
Hydrogen bond	Strong	0	C, G
	Weak	1	A, U

Table 3

Chemical structure of each nucleotide

Chemical property	Class	Label	Nucleotides
Ring structure	Purine	1	A, G
	Pyrimidine	0	C, U
Functional group	Amino	1	A, C
	Keto	0	G, U
Hydrogen bond	Strong	0	C, G
	Weak	1	A, U

Chemical property	Class	Label	Nucleotides
Ring structure	Purine	1	A, G
	Pyrimidine	0	C, U
Functional group	Amino	1	A, C
	Keto	0	G, U
Hydrogen bond	Strong	0	C, G
	Weak	1	A, U

RNA sequence embedding

Sequence embedding was used in the embedding layer of deep learning framework. The embedding module maps the segments of each of the four nucleotide compositions to a numeric vector. The description of the embedding is:

$$ \begin{align} Embedding=embed_{token}+embed_{position} \end{align} $$

(5)

where |$embed_{position}$| refers to the position of the letter in the whole sequence, and |$embed_{token}$| is a specific random initialized identified using a lookup table [24]. Each fusion vector can be adjusted according to the task with back-propagation when the model is trained [29].

Model construction

Machine learning method

From the previous survey of RNA modification site prediction, the ML algorithms used for RNA modification site research included SVM [38, 39], RF [40], bagging [41], naive Bayesian network [42], light gradient boosting machine (LGBM) [43, 44], logistic regression (LR) [45, 46], extreme gradient boosting (XGBoost) [47], etc. In this study, we selected three commonly used ML algorithms to screen the optimal algorithm, including RF, LR and SVM, which were implemented using the Scikit-Learn toolkit [48]. The three ML algorithms are initially screened using the default parameters to obtain the top two optimal classifiers for parameter optimization and confirm the best classifier of the optimal model.

Deep learning method

The deep learning framework contains two main modules. First, an adaptive embedding module was used to map the nucleotide composition onto numerical vectors according to certain position information and token value from a lookup table [24, 29]. Second, a task-specific layer consisting of CNN, BLSTM, attention and full connection layers, in which the CNN layer extracted high-level features from the sequence. BLSTM is a specific recurrent neural network (RNN) that consists of a set of recurrently connected blocks and enhanced the learning ability of LSTM for long-distance dependency. Compared with unidirectional LSTM, BLSTM better captures the information of sequence context [49]. In addition to CNN and BLSTM, the attention mechanism can also capture positional information. The attention mechanism was originally proposed for solving machine translation tasks [50] and has shown to distinguish between more and less important information [51]. In the field of natural language processing and image identification, an increasing number of attention mechanism-related studies have shown the interpretability of the model [52, 53]. The attention mechanism has been shown to achieve a competitive performance in a wide range of biological sequence analysis problems [54, 55], and was therefore used in this study to identify the key information that affects m6A site prediction. A flowchart in Figure 2 shows the whole workflow of the adaptive learning model.

Evaluation metrics

Five-fold cross-validation and independent testing were used to evaluate the performance of the proposed model [56]. In addition, four criteria specificity (Sp), sensitivity (Sn), accuracy (Acc), precision(Prec) and Mathew’s correlation coefficient (MCC) are used to evaluate the performance and defined as [57]:

$$\begin{align} Sn &= \frac{TP}{TP+FN} \end{align}$$

(6)

$$\begin{align} Sp &= \frac{TN}{TN+FP} \end{align}$$

(7)

$$\begin{align} Prec &= \frac{TP}{TP+FP} \end{align}$$

(8)

$$\begin{align} Acc &= \frac{TP+TN}{TP+FP+TN+FN} \end{align}$$

(9)

$$ \begin{align} MCC &= \frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP) \times (TN+FN) \times (TP+FN) \times (TN+FP)}} \end{align} $$

(10)

where |$TP$| denotes positive samples that are correctly predicted as true m6A modification sites, |$TN$| denotes negative samples that are correctly predicted to not contain m6A modification sites, |$FN$| is the number of true m6A modification sequences that are falsely predicted as false m6A modification sequences and |$FP$| is the number of false m6A modification sequences that are falsely predicted as true m6A modification sequences.

Moreover, in addition to these evaluation indicators, by setting the sensitivity and 1-specificity to the x- and y-axes, respectively, the receiver operating characteristic curve (ROC curve) was generated, which was another indicator for evaluating the performance of the proposed method [58]. The closer the value of area under ROC is to 1, the better the prediction performance of the model and the higher the accuracy.

Results

Performance comparison of machine learning classifiers

For the ML classifier in this study, species-specific evaluation indicators for the three different classifiers were summarized and visualized in a bar chart (Figure 3) with the performance criteria of three ML methods used. Table S4 in the supplementary file provides table-formed details of the ML method performances, while the ROC and PRC curves for these models evaluated using independent testing are shown in Figure S3–S8. The performances indicated that in each species data, RF tended to achieve a higher performance when compared with two other classifiers. SVM was efficient in most of the species tested. In some cases, for instance, the species mm10, the performances of all the three classifiers tends had a similar level of accuracy around 0.6–0.7 and AUC between 0.55 and 0.75. Since the performance of the ML methods in sacCer3 and TAIR10 datasets were quite close and not as promising as the other small-scaled datasets that had less than 10 thousand sample numbers, they were chosen for subsequent optimization and comparison with feature selection.

Figure 3

Performance of ML method classifier (A) SVM, (B) RF and (C) LR in individual datasets. Detailed values in each individual classifier in table form have been listed in Table S1 in supplementary files.

Figure 4

Performance of optimal subset selection with feature selection method CHI2 and the optimal performance for the different species (A) mm10, (B) TAIR10 and (C) sacCer3, respectively. The TOP100 dimensions of features, which are the top 100 highest Chi2 values, correspond to feature dimensions. The weight of each feature was selected for among the total selected dimensions is presented. The bar chart shows the criteria for the original incorporated features (before feature selection) and the selected features.

Compared the three conventional ML classifiers, which achieved a higher performance in the BDGP6 species, the SVM classifier achieved the highest performance, while the mm10 dataset did not show a promising performance. This may have been caused by the differences between positive and negative samples that were much more obvious in the BDGP6 dataset. Overall, the performance of the RF classifier yielded a higher level compared with the other two classifiers. However, compared with other datasets, the RF classifier had a more biased performance in the danRer10 dataset.

Performance with hybrid features on feature selection methods

To further determine the optimal classifier and select the optimal feature subset, we adopted a two-step strategy. First, a feature selection technique, the chi-squared (CHI2) method was applied to calculate the feature importance values in different ways and then sort each dimension of the hybrid feature set according to these values (chi2 values were shown in Figure 4). Second, for the sorted hybrid feature list obtained by different feature selection methods, we selected the top 100 feature dimensions (listed in Figure 4 for the three datasets according to their chi2 values ranking from the highest to the lowest) and then selected for dimensionality of the optimal feature subset using the RF classifier through the IFS strategy. To determine the optimal classifier, the AUC of the RF and SVM classifiers were compared based on the feature selection methods, which shows the comparison of AUC values of TAIR10 (A. thaliana), mm10 (M. musculus) and sacCer3 (S. cerevisiae) respectively in Figure 4. Based on the feature selection method on three datasets for optimal subset selection, we compared the performance of the hybrid features and the selected top 100 feature dimensions with highest importance, the performances were shown in the histogram of Figure 4.

Performance on deep learning method

In addition to the ML method, we also applied an adaptive learning network using a species collection of more than 10 000 sample numbers, which was suitable for a deep learning framework (Table 4). A box-plot was constructed for the comparison of performances among the three ML methods, and indicated that the deep learning method achieved a better performance on average when compared with the conventional ML classifiers, and was extremely successful when larger datasets were used, such as for species mm10 and danRer10 (Figure S9). For the relatively smaller datasets, like sacCer3 and TAIR10 for instance, adaptive learning yielded a less significant performance when compared with the ML methods, which may have been caused by the limitation in data.

Table 4

Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold

Species	Methods	Sn	Sp	ACC	MCC	AUC	Prec	F1
BDGP6	SVM	0.92	0.92	0.91	0.71	0.96	0.66	0.73
	RF	0.96	0.92	0.92	0.7	0.96	0.59	0.68
	LR	0.92	0.90	0.90	0.62	0.92	0.50	0.60
	Ada	0.99	1	1	0.97	1	0.95	0.97
danRer10	SVM	0.91	0.86	0.89	0.77	0.93	0.85	0.88
	RF	0.93	0.88	0.9	0.81	0.95	0.89	0.91
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.99	0.96	0.97	0.95	0.99	0.95	0.97
*E. coli*	SVM	0.86	0.85	0.85	0.68	0.89	0.72	0.78
	RF	0.89	0.82	0.84	0.67	0.89	0.7	0.78
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.75	0.97	0.91	0.77	0.98	0.92	0.83
sacCer3	SVM	0.72	0.82	0.77	0.54	0.82	0.8	0.76
	RF	0.77	0.72	0.74	0.49	0.8	0.73	0.75
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.77	0.73	0.75	0.5	0.81	0.74	0.75
TAIR10	SVM	0.75	0.76	0.75	0.5	0.82	0.75	0.75
	RF	0.73	0.71	0.72	0.44	0.79	0.72	0.72
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.73	0.61	0.67	0.34	0.71	0.66	0.69
mm10	SVM	0.68	0.69	0.69	0.29	0.75	0.31	0.42
	RF	0.70	0.70	0.70	0.25	0.76	0.19	0.30
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.97	0.13	0.2	0.09	0.79	0.49	0.14

Species	Methods	Sn	Sp	ACC	MCC	AUC	Prec	F1
BDGP6	SVM	0.92	0.92	0.91	0.71	0.96	0.66	0.73
	RF	0.96	0.92	0.92	0.7	0.96	0.59	0.68
	LR	0.92	0.90	0.90	0.62	0.92	0.50	0.60
	Ada	0.99	1	1	0.97	1	0.95	0.97
danRer10	SVM	0.91	0.86	0.89	0.77	0.93	0.85	0.88
	RF	0.93	0.88	0.9	0.81	0.95	0.89	0.91
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.99	0.96	0.97	0.95	0.99	0.95	0.97
*E. coli*	SVM	0.86	0.85	0.85	0.68	0.89	0.72	0.78
	RF	0.89	0.82	0.84	0.67	0.89	0.7	0.78
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.75	0.97	0.91	0.77	0.98	0.92	0.83
sacCer3	SVM	0.72	0.82	0.77	0.54	0.82	0.8	0.76
	RF	0.77	0.72	0.74	0.49	0.8	0.73	0.75
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.77	0.73	0.75	0.5	0.81	0.74	0.75
TAIR10	SVM	0.75	0.76	0.75	0.5	0.82	0.75	0.75
	RF	0.73	0.71	0.72	0.44	0.79	0.72	0.72
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.73	0.61	0.67	0.34	0.71	0.66	0.69
mm10	SVM	0.68	0.69	0.69	0.29	0.75	0.31	0.42
	RF	0.70	0.70	0.70	0.25	0.76	0.19	0.30
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.97	0.13	0.2	0.09	0.79	0.49	0.14

Table 4

Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold

Species	Methods	Sn	Sp	ACC	MCC	AUC	Prec	F1
BDGP6	SVM	0.92	0.92	0.91	0.71	0.96	0.66	0.73
	RF	0.96	0.92	0.92	0.7	0.96	0.59	0.68
	LR	0.92	0.90	0.90	0.62	0.92	0.50	0.60
	Ada	0.99	1	1	0.97	1	0.95	0.97
danRer10	SVM	0.91	0.86	0.89	0.77	0.93	0.85	0.88
	RF	0.93	0.88	0.9	0.81	0.95	0.89	0.91
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.99	0.96	0.97	0.95	0.99	0.95	0.97
*E. coli*	SVM	0.86	0.85	0.85	0.68	0.89	0.72	0.78
	RF	0.89	0.82	0.84	0.67	0.89	0.7	0.78
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.75	0.97	0.91	0.77	0.98	0.92	0.83
sacCer3	SVM	0.72	0.82	0.77	0.54	0.82	0.8	0.76
	RF	0.77	0.72	0.74	0.49	0.8	0.73	0.75
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.77	0.73	0.75	0.5	0.81	0.74	0.75
TAIR10	SVM	0.75	0.76	0.75	0.5	0.82	0.75	0.75
	RF	0.73	0.71	0.72	0.44	0.79	0.72	0.72
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.73	0.61	0.67	0.34	0.71	0.66	0.69
mm10	SVM	0.68	0.69	0.69	0.29	0.75	0.31	0.42
	RF	0.70	0.70	0.70	0.25	0.76	0.19	0.30
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.97	0.13	0.2	0.09	0.79	0.49	0.14

Species	Methods	Sn	Sp	ACC	MCC	AUC	Prec	F1
BDGP6	SVM	0.92	0.92	0.91	0.71	0.96	0.66	0.73
	RF	0.96	0.92	0.92	0.7	0.96	0.59	0.68
	LR	0.92	0.90	0.90	0.62	0.92	0.50	0.60
	Ada	0.99	1	1	0.97	1	0.95	0.97
danRer10	SVM	0.91	0.86	0.89	0.77	0.93	0.85	0.88
	RF	0.93	0.88	0.9	0.81	0.95	0.89	0.91
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.99	0.96	0.97	0.95	0.99	0.95	0.97
*E. coli*	SVM	0.86	0.85	0.85	0.68	0.89	0.72	0.78
	RF	0.89	0.82	0.84	0.67	0.89	0.7	0.78
	LR	0.88	0.84	0.86	0.71	0.88	0.82	0.85
	Ada	0.75	0.97	0.91	0.77	0.98	0.92	0.83
sacCer3	SVM	0.72	0.82	0.77	0.54	0.82	0.8	0.76
	RF	0.77	0.72	0.74	0.49	0.8	0.73	0.75
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.77	0.73	0.75	0.5	0.81	0.74	0.75
TAIR10	SVM	0.75	0.76	0.75	0.5	0.82	0.75	0.75
	RF	0.73	0.71	0.72	0.44	0.79	0.72	0.72
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.73	0.61	0.67	0.34	0.71	0.66	0.69
mm10	SVM	0.68	0.69	0.69	0.29	0.75	0.31	0.42
	RF	0.70	0.70	0.70	0.25	0.76	0.19	0.30
	LR	0.67	0.67	0.67	0.34	0.67	0.67	0.67
	Ada	0.97	0.13	0.2	0.09	0.79	0.49	0.14

Motif analysis of modified m6A sequences

Since NAC was the most widely used sequence-based feature for exploring the motif around the m6A site, comparing the NAC between positive and negative datasets could be useful for m6A site identification. Our results indicated the occurrence of modified and non-modified m6A sequences in the six datasets, in which the positively charged nucleotide residues (A and C) appeared to have the highest frequency around the substrate sites (Figure 5). When we compared the NAC results between positive and negative datasets, the differences among each NAC feature seemed to be much larger in the BDGP6 dataset than other datasets (Figure 5), which suggested that in this species, the differences among positive and negative samples with respect to NAC features were more obvious than those in the other species. Additionally, the mm10 had NAC features with the least differences among positive and negative datasets, suggesting the performance in the mm10 dataset was not as successful when compared with the other datasets tested.

Figure 5

Visualization of feature NAC, which reflects the frequency of each individual nucleotide component in the positive and negative datasets of different species. The orange and blue colors correspond to positive and negative samples, respectively.

Figure 6

The successive motif patterns found using tool MotifX and the experimentally verified motifs generated from a positive dataset and motif generation by TwoSampleLogo. In each subfigure, the experimental panel shows the pattern visualization of positive sequences generated by TwoSampleLogo, the predicted panel shows the visualization of predicted substrates generated by MotifX, and the predicted top five substrates and their matched numbers are listed in the motifs panel.

Figure 7

Heatmap of the performance of the cross-species validation. The AUC values were chosen for visualization, with detailed criteria outlined in the supplementary fileTable S2. The white colored bar represents the average value, while the pink and green bars indicate higher and lower AUC values compared to the average, respectively, with the darker colors corresponding to a higher discrepancy.

Figure 5 has indicated the occurrence of each nucleotide composition in m6A modified and non-modified sequences in the six datasets, and for m6A sites, the positively charged nucleotide residue (A and C) appeared to have the highest frequency around the substrate sites. From comparing the NAC between positive and negative datasets illustrated in Figure 5, the differences among each NAC seem to be much larger in the BDGP6 dataset, which means that in this dataset, the differences among positive and negative samples with respect to NAC feature are more obvious than the other species, while mm10 tends to have the NAC feature with the least differences among positive and negative, that is the main reason that the performance in the mm10 dataset is not that outstanding compared with other datasets.

In addition to the visualization of m6A motifs, we wondered whether any significant patterns occurred in different species. Hence, we used position-specific NAC neighboring to visualize the nucleotide residue composition around the modified m6A sites in each species using frequency plots in TwoSampleLogo (Figure 6). We also used MotifX [59], an R-package tool designed to extract overrepresented patterns from sequence datasets that was originally used for peptide sequence analysis, to predict the successive substrate that frequently occurred near the modified sites in each individual dataset.

We found that there were some successive motifs that occurred in different species, indicating that similar patterns tended to be important in different species and significantly contributed to the same type of modification (Figure 6). For instance, the ‘GAC’ pattern could be found in almost every species. The motif pattern of different species might be similar for the same modification, in species BDGP6 and E. coli, which had showed high similarity, as well as danRer10 and E. coli, which also shared a similar motif pattern even though one is an animal and the other is a microbe. This result was very interesting that different species shared a similar pattern and should be further researched.

Cross-species performance evaluation

Based on the previous analysis, we have found that some substrates might be similar among different m6A modified sites in different species, hence we conducted a cross-validation between two different species. We applied the dataset danRer10 as the training set to obtain a model and employed it onto the rest categories of datasets in the independent testing stage. A heatmap was used to illustrate the cross-species validation (Figure 7). The feature ENAC and classifier RF were chosen for this cross-species analysis as they achieved the highest performance in our previous analysis of single features.

We use the color pink for performance higher than the average level, green for performance lower than the average level and white means the average level in the overall cases. Theoretically the best performance should be found at the diagonal line as it is the pairwise training model and testing set with the same species, that means that we should see different pink colors in the diagonal line. This matches the result that we have got, as it can be notice that the color of the diagonal line are mostly of pink color, which means performance above the average level. For one special occurred in the mm10 dataset, the diagonal color is white, which means the average performance, while the best performances on the mm10 model occurred in the BDGP6 testing set. This is reasonable because compared with the rest testing datasets, mm10 testing set has shown a good performance, the reason that best result achieved in BDGP6 dataset might own to the larger differences in BDGP6 dataset, and mm10 is of larger data scale compared with the original BDGP6 dataset, which then achieved a better performance when compared with mm10 testing dataset, which is of less obvious differences among positive and negative dataset.

It may be notice that when choosing danRer10 dataset as the training set, the model we have obtained could work well in most of the testing datasets. That may be caused by that danRer10 dataset is of relatively larger quantity, which could avoid the under-fitting case better thatn the other testing datasets, meanwhile the typical substrate ‘GAC’ in the danRer10 dataset was found in the other testing datasets, which means higher similarity with other species and leads to good performances when the testing dataset changed.

Another interesting finding was that the E. coli model achieved the best AUC for TAIR10 datasets and very good AUC for BDGP6 and danRer10, which may have been caused by several reasons. First, the sample number of E. coli and TAIR10 datasets involved in the cross-species test were similar in positive datasets, while in negative datasets the imbalance ratio was maintained (Table 2). Moreover, these datasets had similar motif pattern in position [–2,2] ‘G-ACU’ patterns (Figure S10), that may be why the E. coli model also achieved a high performance with the TAIR10 dataset, as they were of similar sample number level and motif substrate pattern. We hypothesized that the E. coli model worked well with the BDGP6 and danRer10 datasets because of the diversity of positive and negative samples in these two datasets. Indeed, we noticed that the differences among positive and negative samples in danRer10 and BDGP6 was very obvious compared with the other cases, which made it easier to distinguish the the positive and negative samples among these datasets (Figure 5). Also, similarity in the motif substrate pattern and dataset scale may have also been why the highest performance did not always occur in the diagonal line. Additionally, this could explain the results from when we applied a model from a smaller dataset onto a testing set obtained from a larger dataset. For instance, when the model from sacCer3, which is a dataset with around one thousand positive samples, was used with other testing datasets that had around three or four thousands positive samples. These tests produced a performance that was lower than the average level, illustrated by the lower triangular part of the graph that included more green, while the upper triangular part contained more pink (Figure 7).

Comparison of existing methods

We have chosen some representative and newly accessible frameworks to compare the SRAMP [14], and WHISTLE [25] tools with our proposed methods. The deep learning method DeepPromise [6], which is with only CNN, and DeepOME [60], in which CNN is incorporated with BLSTM, were also used in our comparison. Recently, a capsule network has been a popular method for disease-related prediction [61], PTM prediction [62, 63] and single-cell RNA-sequencing data analysis [64]. In this study, we have also tried to apply the capsule network to the prediction of m6A sites as an updated deep learning method to compare with an adaptive learning method. The datasets for these chosen baseline methods were maintained as original species data, with the detailed comparison shown in Table 5. These results indicated that in most cases, the Adaptive-m6A model could achieve a better performance compared with the conventional ML methods (SRAMP and WHISTLE), it could achieve a more outstanding performance when compared with the method of similar network structure (i.e. DeepPromise) as it shows higher AUC value. The attention mechanism has provided good performance with compared with method DeepOME, and provides high performance when larger data scale involved.

Table 5

Performance comparison with the latest accessible methods

Method name	Algorithm	Encoding scheme	Species	Datasize	Sn	Sp	MCC	AUC
SRAMP [14]	RF	one-hot, KNN score spectrum	H. sapiens, mm10	55 706 (full transcript mode) and 46992 (mature mRNA mode)	0.44	0.9	0.29	0.794
WHISTLE [25]	SVM	NCP,GNF, Genome-derived features	H. sapiens	20 516, 17 383	0.62	0.9	0.49	0.83
DeepPromise [6]	CNN	ENAC, one-hot and RNA word embedding	H. sapiens and Mus musculus	44 901, 11 656 and 5233	0.39	0.9	0.252	0.76
DeepOME [60]	CNN+BLSTM	one-hot	H. sapiens	3052	0.97	1.00	0.93	0.99
CapNetwork	CNN+BLSTM+CapsuleNetwork	RNA word embedding	mm10	207 010	0.12	0.95	0.13	0.61
Adaptive-M6A	CNN+BLSTM+attention	RNA word embedding	BDGP6	31 952	0.99	1.00	0.97	1.00
			danRer10	88 779	0.99	0.96	0.95	0.99
			E. coli	5240	0.75	0.97	0.77	0.98
			mm10	207 010	0.97	0.13	0.09	0.79
			TAIR10	10 569	0.73	0.61	0.34	0.71
			sacCer3	5487	0.77	0.73	0.50	0.81

Method name	Algorithm	Encoding scheme	Species	Datasize	Sn	Sp	MCC	AUC
SRAMP [14]	RF	one-hot, KNN score spectrum	H. sapiens, mm10	55 706 (full transcript mode) and 46992 (mature mRNA mode)	0.44	0.9	0.29	0.794
WHISTLE [25]	SVM	NCP,GNF, Genome-derived features	H. sapiens	20 516, 17 383	0.62	0.9	0.49	0.83
DeepPromise [6]	CNN	ENAC, one-hot and RNA word embedding	H. sapiens and Mus musculus	44 901, 11 656 and 5233	0.39	0.9	0.252	0.76
DeepOME [60]	CNN+BLSTM	one-hot	H. sapiens	3052	0.97	1.00	0.93	0.99
CapNetwork	CNN+BLSTM+CapsuleNetwork	RNA word embedding	mm10	207 010	0.12	0.95	0.13	0.61
Adaptive-M6A	CNN+BLSTM+attention	RNA word embedding	BDGP6	31 952	0.99	1.00	0.97	1.00
			danRer10	88 779	0.99	0.96	0.95	0.99
			E. coli	5240	0.75	0.97	0.77	0.98
			mm10	207 010	0.97	0.13	0.09	0.79
			TAIR10	10 569	0.73	0.61	0.34	0.71
			sacCer3	5487	0.77	0.73	0.50	0.81

Table 5

Performance comparison with the latest accessible methods

Method name	Algorithm	Encoding scheme	Species	Datasize	Sn	Sp	MCC	AUC
SRAMP [14]	RF	one-hot, KNN score spectrum	H. sapiens, mm10	55 706 (full transcript mode) and 46992 (mature mRNA mode)	0.44	0.9	0.29	0.794
WHISTLE [25]	SVM	NCP,GNF, Genome-derived features	H. sapiens	20 516, 17 383	0.62	0.9	0.49	0.83
DeepPromise [6]	CNN	ENAC, one-hot and RNA word embedding	H. sapiens and Mus musculus	44 901, 11 656 and 5233	0.39	0.9	0.252	0.76
DeepOME [60]	CNN+BLSTM	one-hot	H. sapiens	3052	0.97	1.00	0.93	0.99
CapNetwork	CNN+BLSTM+CapsuleNetwork	RNA word embedding	mm10	207 010	0.12	0.95	0.13	0.61
Adaptive-M6A	CNN+BLSTM+attention	RNA word embedding	BDGP6	31 952	0.99	1.00	0.97	1.00
			danRer10	88 779	0.99	0.96	0.95	0.99
			E. coli	5240	0.75	0.97	0.77	0.98
			mm10	207 010	0.97	0.13	0.09	0.79
			TAIR10	10 569	0.73	0.61	0.34	0.71
			sacCer3	5487	0.77	0.73	0.50	0.81

Method name	Algorithm	Encoding scheme	Species	Datasize	Sn	Sp	MCC	AUC
SRAMP [14]	RF	one-hot, KNN score spectrum	H. sapiens, mm10	55 706 (full transcript mode) and 46992 (mature mRNA mode)	0.44	0.9	0.29	0.794
WHISTLE [25]	SVM	NCP,GNF, Genome-derived features	H. sapiens	20 516, 17 383	0.62	0.9	0.49	0.83
DeepPromise [6]	CNN	ENAC, one-hot and RNA word embedding	H. sapiens and Mus musculus	44 901, 11 656 and 5233	0.39	0.9	0.252	0.76
DeepOME [60]	CNN+BLSTM	one-hot	H. sapiens	3052	0.97	1.00	0.93	0.99
CapNetwork	CNN+BLSTM+CapsuleNetwork	RNA word embedding	mm10	207 010	0.12	0.95	0.13	0.61
Adaptive-M6A	CNN+BLSTM+attention	RNA word embedding	BDGP6	31 952	0.99	1.00	0.97	1.00
			danRer10	88 779	0.99	0.96	0.95	0.99
			E. coli	5240	0.75	0.97	0.77	0.98
			mm10	207 010	0.97	0.13	0.09	0.79
			TAIR10	10 569	0.73	0.61	0.34	0.71
			sacCer3	5487	0.77	0.73	0.50	0.81

Conclusion

In this study, we have conducted species-specific prediction of m6A modification in RNA sequences. ML methods such as SVM, RF and LR for m6A prediction methods were adapted, as well as deep learning methods that mainly used adaptive learning model that consisted of CNN, BLSTM and attention network. This is the first time that such data from different species were included in one predictor, which produced a promising performance. Moreover, we have summarized the up-to-date predictor tools on m6A prediction, which is the most comprehensive work currently available. Based on this multi-species prediction work, we have improved our understanding of biological mechanisms of the RNA transcriptome and common mechanisms found in different biological organisms (e.g. plant, animal and microbe). Motif analysis of different species was also conducted, which identified similar patterns in different biological organisms. This is one interesting aspect of our results that is worth further analysis and research. There are still plenty of questions for discussion; for instance, can the encoding scheme of RNA sequence-based features used in representative methods similar to those used to identify features in peptide sequences, identify relationships between RNA modification and peptide post-transitional modification? This is another topic that could benefit from our presented results in future studies.

Key Points

We proposed an attention-based deep learning method, Adaptive-m6A, which consists of CNN, BLSTM and attention mechanisms, to identify m6A sites in multiple species.
Three conventional machine learning (ML) methods, support vector machine, random forest and logistic regression classifiers, were used with datasets with the most categories of species data.
Adaptive-m6A outperformed conventional ML methods in most of the species tested and achieved a promising performance.
Motif analysis and a cross-validation among different species showed that our proposed method could achieve a high performance among various species, which helps gain a better understanding about the sequence characteristics and biological mechanisms of the RNA transcriptome across different species.

Data availability

All the data involved in this esearch could be accessed at https://github.com/Moretta1/Adaptive-m6A.

Acknowledgments

The authors sincerely appreciate the Warshel Institute for Computational Biology funding from Shenzhen City and Longgang District for financially supporting this research.

Funding

This work was supported by the Ganghong Young Scholar Development Fund [2021E007], Guangdong Province Basic and Applied Basic Research Fund [2021A1515012447], National Natural Science Foundation of China [32070659], the Science, Technology and Innovation Commission of Shenzhen Municipality [JCYJ20200109150003938], and Shenzhen-Hong Kong Cooperation Zone for Technology and Innovation [HZQB-KCZYB-2020056].

Author Biographies

Rulan Wang is currently a PhD student at School of Sciences and Engineering, The Chinese University of Hong Kong, Shenzhen under the supervision of Prof. Tzong-Yi Lee and Prof. Hsien-Da Huang. Her research interests include bioinformatics, big data analytics,and machine learning.

Chia-Ru Chung is a postdoc in Kobilka Institute of Innovative Drug Discovery, The Chinese University of Hong Kong, Shenzhen. Her research interests include bioinformatics, big data analytics, data mining, and machine learning.

Hsien-Da Huang is a presidential chair professor and associate dean of education at the School of Life and Health Sciences, and the executive deputy director of Warshel Institute of Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research group majorly focuses on biological multi-disciplinary research topics, including bioinformatics, genomics, metagenomics and microbiome, intelligent biomedical Technologies (drug development, genetic test and precision medicine), AI and machine learning, and biological database design and development.

Tzong-Yi Lee is currently a professor at the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University. His research interests emphasize on bioinformatics, multi-Omics data analysis, single cell sequencing and spatial transcriptomics.

References

1.

Yang

G

,

Sun

Z

,

Zhang

N

.

Reshaping the role of m6a modification in cancer transcriptome: a review

.

Cancer Cell Int

2020

;

20

(

1

):

1

–

10

.

2.

Tong

J

,

Flavell

RA

,

Li

H-B

.

RNA m6a modification and its function in diseases

.

Front Med

2018

;

12

(

4

):

481

–

9

.

3.

Hao

W

,

Zhang

Y

.

Mechanisms and functions of TET protein-mediated 5-methylcytosine oxidation

.

Genes Dev

2011

;

25

(

23

):

2436

–

52

.

4.

Yang

C

,

Yiyang

H

,

Zhou

B

, et al.

The role of m6a modification in physiology and disease

.

Cell Death Dis

2020

;

11

(

11

):

1

–

16

.

5.

Helm

M

,

Motorin

Y

.

Detecting RNA modifications in the epitranscriptome: predict and validate

.

Nat Rev Genet

2017

;

18

(

5

):

275

–

91

.

6.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences

.

Brief Bioinform

2020

;

21

(

5

):

1676

–

96

.

7.

Chen

W

,

Feng

P

,

Ding

H

, et al.

iRNA-methyl: identifying n6-methyladenosine sites using pseudo nucleotide composition

.

Anal Biochem

2015

;

490

:

26

–

33

.

8.

Chen

W

,

Feng

P

,

Ding

H

, et al.

Identifying n 6-methyladenosine sites in the Arabidopsis thaliana transcriptome

.

Mol Genet Genomics

2016

;

291

(

6

):

2225

–

9

.

9.

Jia

C-Z

,

Zhang

J-J

,

Wei-Zhen

G

.

RNA-methylpred: a high-accuracy predictor to identify n6-methyladenosine in RNA

.

Anal Biochem

2016

;

510

:

72

–

5

.

10.

Li

G-Q

,

Liu

Z

,

Shen

H-B

, et al.

Targetm6a: identifying n 6-methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine

.

IEEE Trans Nanobioscience

2016

;

15

(

7

):

674

–

82

.

11.

Liu

Z

,

Xiao

X

,

Dong-Jun

Y

, et al.

pRNAm-PC: predicting n6-methyladenosine sites in RNA sequences via physical–chemical properties

.

Anal Biochem

2016

;

497

:

60

–

7

.

12.

Xiang

S

,

Liu

K

,

Yan

Z

, et al.

RNAmethpre: a web server for the prediction and query of mRNA m6a sites

.

PloS one

2016

;

11

(

10

):

e0162707

.

13.

Zhang

M

,

Sun

J-W

,

Liu

Z

, et al.

Improving n6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties

.

Anal Biochem

2016

;

508

:

104

–

13

.

14.

Zhou

Y

,

Zeng

P

,

Li

Y-H

, et al.

Sramp: prediction of mammalian n6-methyladenosine (m6a) sites based on sequence-derived features

.

Nucleic Acids Res

2016

;

44

(

10

):

e91

–

1

.

15.

Chen

W

,

Xing

P

,

Zou

Q

.

Detecting n6-methyladenosine sites from RNA transcriptomes using ensemble support vector machines

.

Sci Rep

2017

;

7

(

1

):

1

–

8

.

16.

Feng

P

,

Ding

H

,

Yang

H

, et al.

iRNA-psecoll: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into pseknc

.

Mol Ther-Nucleic Acids

2017

;

7

:

155

–

63

.

17.

Xing

P

,

Ran

S

,

Guo

F

, et al.

Identifying n6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine

.

Sci Rep

2017

;

7

(

1

):

1

–

7

.

18.

Wei

L

,

Chen

H

,

Ran

S

.

M6apred-el: a sequence-based predictor for identifying n6-methyladenosine sites using ensemble learning

.

Mol Therapy-Nucleic Acids

2018

;

12

:

635

–

44

.

19.

Chen

W

,

Hui Ding

X

,

Zhou

HL

, et al.

iRNA (m6a)-psednc: identifying n6-methyladenosine sites using pseudo dinucleotide composition

.

Anal Biochem

2018

;

561

:

59

–

65

.

20.

Huang

Y

,

He

N

,

Chen

Y

, et al.

Bermp: a cross-species classifier for predicting m6a sites by integrating a deep learning algorithm and a random forest approach

.

Int J Biol Sci

2018

;

14

(

12

):

1669

.

21.

Qiang

X

,

Chen

H

,

Ye

X

, et al.

M6amrfs: robust prediction of n6-methyladenosine sites with sequence-based features in multiple species

.

Front Genet

2018

;

9

:

495

.

22.

Wang

X

,

Yan

R

.

Rfathm6a: a new tool for predicting m6a sites in Arabidopsis thaliana

.

Plant Mol Biol

2018

;

96

(

3

):

327

–

37

.

23.

Wei

L

,

Ran

S

,

Wang

B

, et al.

Integration of deep feature representations and handcrafted features to improve the prediction of n6-methyladenosine sites

.

Neurocomputing

2019

;

324

:

3

–

9

.

24.

Zou

Q

,

Xing

P

,

Wei

L

, et al.

Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA

.

RNA

2019

;

25

(

2

):

205

–

18

.

25.

Chen

K

,

Wei

Z

,

Zhang

Q

, et al.

Whistle: a high-accuracy map of the human n 6-methyladenosine (m6a) epitranscriptome predicted using a machine learning approach

.

Nucleic Acids Res

2019

;

47

(

7

):

e41

–

1

.

26.

Chen

W

,

Tang

H

,

Lin

H

.

MethyRNA: a web server for identification of n6-methyladenosine sites

.

J Biomol Struct Dynam

2017

;

35

(

3

):

683

–

7

.

27.

Liu

B

.

Bioseq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

.

Brief Bioinform

2019

;

20

(

4

):

1280

–

94

.

28.

Song

Z

,

Huang

D

,

Song

B

, et al.

Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications

.

Nat Commun

2021

;

12

(

1

):

1

–

11

.

29.

Li

Z

,

Fang

J

,

Wang

S

, et al.

Adapt-kcr: a novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture

.

Brief Bioinform

2022

;

23

(

2

):

bbac037

.

30.

Wang

R

,

Wang

Z

,

Wang

H

, et al.

Characterization and identification of lysine crotonylation sites based on machine learning method on both plant and mammalian

.

Sci Rep

2020

;

10

(

1

):

1

–

12

.

31.

Xuan

J-J

,

Sun

W-J

,

Lin

P-H

, et al.

Rmbase v2. 0: deciphering the map of RNA modifications from epitranscriptome sequencing data

.

Nucleic Acids Res

2018

;

46

(

D1

):

D327

–

34

.

32.

Ao

C

,

Zou

Q

,

Liang

Y

.

Nmrf: identification of multispecies RNA 2′-o-methylation modification sites from RNA sequences

.

Brief Bioinform

2022

;

23

(

1

):

bbab480

.

33.

Limin

F

,

Niu

B

,

Zhu

Z

, et al.

Cd-hit: accelerated for clustering the next-generation sequencing data

.

Bioinformatics

2012

;

28

(

23

):

3150

–

2

.

34.

Zou

Q

,

Lin

G

,

Jiang

X

, et al.

Sequence clustering in bioinformatics: an empirical study

.

Brief Bioinform

2020

;

21

(

1

):

1

–

10

.

35.

He

H

,

Garcia

EA

.

Learning from imbalanced data

.

IEEE Trans Knowl Data Eng

2009

;

21

(

9

):

1263

–

84

.

36.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

ilearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

.

Brief Bioinform

2020

;

21

(

3

):

1047

–

57

.

37.

Chen

Z

,

Zhao

P

,

Li

C

, et al.

ilearnplus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

.

Nucleic Acids Res

2021

;

49

(

10

):

e60

–

0

.

38.

Vapnik

VN

,

Ya

A

,

Lerner.

Recognition of patterns with help of generalized portraits

.

Avtomat i Telemekh

1963

;

24

(

6

):

774

–

80

.

39.

Vapnik

VN

.

An overview of statistical learning theory

.

IEEE Trans Neural Netw

1999

;

10

(

5

):

988

–

99

.

40.

Breiman

L

.

Random forests

.

Mach Learn

2001

;

45

(

1

):

5

–

32

.

41.

Breiman

L

.

Bagging predictors

.

Mach Learn

1996

;

24

(

2

):

123

–

40

.

42.

Langarizadeh

M

,

Moghbeli

F

.

Applying naive Bayesian networks to disease prediction: a systematic review

.

Acta Inform Med

2016

;

24

(

5

):

364

.

43.

Alzamzami

F

,

Hoda

M

,

El Saddik

A

.

Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation

.

IEEE Access

2020

;

8

:

101840

–

58

.

44.

Bao

W

,

Cui

Q

,

Chen

B

, et al.

Phage_unir_lgbm: Phage virion proteins classification with unirep features and lightgbm model

.

Comput Math Methods Med

2022

;

2022

:

2022

.

45.

Chen

Y-Z

,

Wang

Z-Z

,

Wang

Y

, et al.

Nhkcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning

.

Brief Bioinform

2021

;

22

(

6

):

bbab146

.

46.

Raymond

E

.

Wright

.

Logistic regression

1995

, 217–44.

47.

Xiaojuan

Y

,

Zhou

J

,

Zhao

M

, et al.

Exploiting xg boost for predicting enhancer-promoter interactions

.

Curr Bioinform

2020

;

15

(

9

):

1036

–

45

.

48.

Pedregosa

F

,

Varoquaux

G

,

Gramfort

A

, et al.

Scikit-learn: machine learning in python

.

J Mach Learn Res

2011

;

12

:

2825

–

30

.

49.

Greff

K

,

Srivastava

RK

,

Koutník

J

, et al.

Lstm: a search space odyssey

.

IEEE Trans Neural Netw Learn Syst

2016

;

28

(

10

):

2222

–

32

.

50.

Vaswani

A

,

Shazeer

N

,

Parmar

N

, et al.

Attention is all you need

.

Adv Neural Inform Process Syst

2017

;

30

.

51.

Lin

Z

,

Feng

M

,

dos

Santos

CN

, et al.

A structured self-attentive sentence embedding

.

arXiv preprint arXiv:170303130

,

2017

.

52.

Wiegreffe

S

,

Pinter

Y

.

Attention is not explanation

.

arXiv preprint arXiv:190804626

2019

.

53.

Clark

K

,

Khandelwal

U

,

Levy

O

, et al.

What does bert look at? An analysis of Bert’s attention

.

arXiv preprint arXiv:190604341

2019

.

54.

Zou

Z

,

Tian

S

,

Gao

X

, et al.

Mldeepre: multi-functional enzyme function prediction with hierarchical multi-label deep learning

.

Front Genet

2019

;

9

:

714

.

55.

Hong

Z

,

Zeng

X

,

Wei

L

, et al.

Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism

.

Bioinformatics

2020

;

36

(

4

):

1037

–

43

.

56.

Kao

H-J

,

Huang

C-H

,

Bretaña

NA

, et al.

A two-layered machine learning method to identify protein o-glcnacylation sites with o-glcnac transferase substrate motifs

.

BMC Bioinformatics

2015

;

16

(

18

):

1

–

11

.

57.

Wang

R

,

Wang

Z

,

Li

Z

, et al.

Residue–residue contact can be a potential feature for the prediction of lysine crotonylation sites

.

Front Genet

2021

;

12

.

58.

Li

F

,

Li

C

,

Marquez-Lago

TT

, et al.

Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome

.

Bioinformatics

2018

;

34

(

24

):

4223

–

31

.

59.

Wagih

O

,

Sugiyama

N

,

Ishihama

Y

, et al.

Uncovering phosphorylation-based specificities through functional interaction networks

.

Mol Cell Proteomics

2016

;

15

(

1

):

236

–

45

.

60.

Li

H

,

Chen

L

,

Huang

Z

, et al.

|$deepome$|⁠: a web server for the prediction of |${2}^{\prime }-o- me$| sites based on the hybrid cnn and blstm architecture

.

Front Cell Dev Biol

2021

;

9

:

1244

.

61.

Yang

B

,

Bao

W

,

Wang

J

.

Active disease-related compound identification based on capsule network

.

Brief Bioinform

2022

;

23

(

1

):

bbab462

.

62.

Khanal

J

,

Tayara

H

,

Zou

Q

, et al.

Deepcap-kcr: accurate identification and investigation of protein lysine crotonylation sites based on capsule network

.

Brief Bioinform

2022

;

23

(

1

):

bbab492

.

63.

Bao

W

,

Yang

B

,

Chen

B

.

2-hydr_ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method

.

Chemom Intel Lab Syst

2021

;

215

:

104351

.