Abstract

N6-methyladinosine (m6A) modification is the most abundant co-transcriptional modification in eukaryotic RNA and plays important roles in cellular regulation. Traditional high-throughput sequencing experiments used to explore functional mechanisms are time-consuming and labor-intensive, and most of the proposed methods focused on limited species types. To further understand the relevant biological mechanisms among different species with the same RNA modification, it is necessary to develop a computational scheme that can be applied to different species. To achieve this, we proposed an attention-based deep learning method, adaptive-m6A, which consists of convolutional neural network, bi-directional long short-term memory and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional machine learning (ML) methods, including support vector machine, random forest and logistic regression classifiers, were considered in this work. In addition to the performance of ML methods for multi-species prediction, the optimal performance of adaptive-m6A yielded an accuracy of 0.9832 and the area under the receiver operating characteristic curve of 0.98. Moreover, the motif analysis and cross-validation among different species were conducted to test the robustness of one model towards multiple species, which helped improve our understanding about the sequence characteristics and biological functions of RNA modifications in different species.

Introduction

N6-methyladenosine (m6A) is the most prevalent, abundant and conserved internal co-transcriptional modification in eukaryotic RNAs, and occurs in around 30% of transcripts, especially within eukaryotic cells [1]. Since the 1970s, m6A has served as a layer of epigenetic modification, with the additional discovery of demethylases in recent years [2]. m6A plays a broad and crucial role in almost all aspects of physiological behavior, including physiological regulation, which may impair gene expression and cellular function if disrupted [3]. Moreover, m6A is involved in many diseases such as cancer, psychiatric disorders and metabolic disease [4]. Owing to the advances in genomics and molecular biology, most currently available high-throughput experimental techniques exhibit tens of thousands of transcriptomes, which provides sufficient data for further understanding of different RNA modifications. Several well-established machine learning (ML) techniques have been developed for m6A modification sites prediction using the abundance of experimentally verified data [5]. Over 20 approaches have been developed to predict m6A sites [6]. These methods differ in a variety of aspects, including the benchmark dataset and sequence and/or structural descriptors used, physicochemical properties and genomics features employed, feature selection techniques and targeted RNA modification types.

Table 1

Summary of representative m6A predictor in published work

ToolMethodPublication yearFeature selection methodFeature encoding schemeData scalesURL accessibilityWindow sizeSpeciesReference
iRNA-MethylSVM2015NonePseDNC, RNA property parameters1307Inaccessible51S.cerevisiae[7]
M6ATHSVM2016NoneNCP,ANF394Inaccessible25A.thaliana[8]
RNA-MethylPredSVM2016NoneBPB,DNC, KNN score1307MATLAB package51S.cerevisiae[9]
TargetM6ASVM2016IFSPSNP, PSDP, k-mer1307(Met2614) and 832(Train1664)Accessible51(Met2614) and 21(Train1664)S.cerevisiae[10]
pRNAm-PCSVM2016NonePCPM1307Inaccessible51S.cerevisiae[11]
RNAMethPreSVM2016Nonebinary, k-mer, relative position value, MFE29457(H.sapiens) and 31728(Mus musculus)Inaccessible101H.sapiens and Mus musculus[12]
M6A-HPCSSVM2016NonePCPM, PseDNC, AC, CC1307Accessible51S.cerevisiae[13]
SRAMPRF2016Nonebinary, KNN score spectrum55706 (full transcript mode) and 46992 (mature mRNA mode)Accessible251H.sapiens and Mus musculus[14]
RAM-ESVMSVM2017NonePseDNC, motif1307Inaccessible51S.cerevisiae[15]
iRNA-PseCollSVM2017NoneNCP,ANF1130Inaccessible41H.sapiens[16]
RAM-NPPSSVM2017RFE,FSDI, MRMDNPPS8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)Inaccessible51H.sapiens, S.cerevisiae and A.thaliana[17]
M6APred-ELEnsemble SVM2018Noneposition-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties1307Inaccessible51S.cerevisiae[18]
iRNA(m6A)-PseDNCSVM2018NonePseDNC1307Inaccessible51S.cerevisiae[19]
BERMPBGRU2018NoneENAC, RNA word embedding53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)Inaccessible251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae A.thaliana[20]
M6AMRFSXGBoost2018SFSDNC, binary, local position-specific dinucleotide frequency1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae and A.thaliana[21]
RFAthM6ARF2018NonePSNSP, PSDSP, KSNPF, k-mer2518Accessible101A.thaliana[22]
DeepM6APredSVM2019Nonedeep features and NPPS1307Inaccessible51S.cerevisiae[23]
Gene2VecCNN2019NoneOne-hot, Neighboring methylation state, RNA word embedding, Gene2Vec56557Inaccessible1001H.sapiens and Mus musculus[24]
WHISTLESVM2022Perturb methodNCP, Genome-derived features20516, 17383AccessibleH.sapiens[25]
DeepPromiseCNN2022NoneENAC, one-hot and RNA word embedding44901, 11656 and 5233Accessible1001H.sapiens and Mus musculus[6]
ToolMethodPublication yearFeature selection methodFeature encoding schemeData scalesURL accessibilityWindow sizeSpeciesReference
iRNA-MethylSVM2015NonePseDNC, RNA property parameters1307Inaccessible51S.cerevisiae[7]
M6ATHSVM2016NoneNCP,ANF394Inaccessible25A.thaliana[8]
RNA-MethylPredSVM2016NoneBPB,DNC, KNN score1307MATLAB package51S.cerevisiae[9]
TargetM6ASVM2016IFSPSNP, PSDP, k-mer1307(Met2614) and 832(Train1664)Accessible51(Met2614) and 21(Train1664)S.cerevisiae[10]
pRNAm-PCSVM2016NonePCPM1307Inaccessible51S.cerevisiae[11]
RNAMethPreSVM2016Nonebinary, k-mer, relative position value, MFE29457(H.sapiens) and 31728(Mus musculus)Inaccessible101H.sapiens and Mus musculus[12]
M6A-HPCSSVM2016NonePCPM, PseDNC, AC, CC1307Accessible51S.cerevisiae[13]
SRAMPRF2016Nonebinary, KNN score spectrum55706 (full transcript mode) and 46992 (mature mRNA mode)Accessible251H.sapiens and Mus musculus[14]
RAM-ESVMSVM2017NonePseDNC, motif1307Inaccessible51S.cerevisiae[15]
iRNA-PseCollSVM2017NoneNCP,ANF1130Inaccessible41H.sapiens[16]
RAM-NPPSSVM2017RFE,FSDI, MRMDNPPS8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)Inaccessible51H.sapiens, S.cerevisiae and A.thaliana[17]
M6APred-ELEnsemble SVM2018Noneposition-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties1307Inaccessible51S.cerevisiae[18]
iRNA(m6A)-PseDNCSVM2018NonePseDNC1307Inaccessible51S.cerevisiae[19]
BERMPBGRU2018NoneENAC, RNA word embedding53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)Inaccessible251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae A.thaliana[20]
M6AMRFSXGBoost2018SFSDNC, binary, local position-specific dinucleotide frequency1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae and A.thaliana[21]
RFAthM6ARF2018NonePSNSP, PSDSP, KSNPF, k-mer2518Accessible101A.thaliana[22]
DeepM6APredSVM2019Nonedeep features and NPPS1307Inaccessible51S.cerevisiae[23]
Gene2VecCNN2019NoneOne-hot, Neighboring methylation state, RNA word embedding, Gene2Vec56557Inaccessible1001H.sapiens and Mus musculus[24]
WHISTLESVM2022Perturb methodNCP, Genome-derived features20516, 17383AccessibleH.sapiens[25]
DeepPromiseCNN2022NoneENAC, one-hot and RNA word embedding44901, 11656 and 5233Accessible1001H.sapiens and Mus musculus[6]

Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.

Table 1

Summary of representative m6A predictor in published work

ToolMethodPublication yearFeature selection methodFeature encoding schemeData scalesURL accessibilityWindow sizeSpeciesReference
iRNA-MethylSVM2015NonePseDNC, RNA property parameters1307Inaccessible51S.cerevisiae[7]
M6ATHSVM2016NoneNCP,ANF394Inaccessible25A.thaliana[8]
RNA-MethylPredSVM2016NoneBPB,DNC, KNN score1307MATLAB package51S.cerevisiae[9]
TargetM6ASVM2016IFSPSNP, PSDP, k-mer1307(Met2614) and 832(Train1664)Accessible51(Met2614) and 21(Train1664)S.cerevisiae[10]
pRNAm-PCSVM2016NonePCPM1307Inaccessible51S.cerevisiae[11]
RNAMethPreSVM2016Nonebinary, k-mer, relative position value, MFE29457(H.sapiens) and 31728(Mus musculus)Inaccessible101H.sapiens and Mus musculus[12]
M6A-HPCSSVM2016NonePCPM, PseDNC, AC, CC1307Accessible51S.cerevisiae[13]
SRAMPRF2016Nonebinary, KNN score spectrum55706 (full transcript mode) and 46992 (mature mRNA mode)Accessible251H.sapiens and Mus musculus[14]
RAM-ESVMSVM2017NonePseDNC, motif1307Inaccessible51S.cerevisiae[15]
iRNA-PseCollSVM2017NoneNCP,ANF1130Inaccessible41H.sapiens[16]
RAM-NPPSSVM2017RFE,FSDI, MRMDNPPS8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)Inaccessible51H.sapiens, S.cerevisiae and A.thaliana[17]
M6APred-ELEnsemble SVM2018Noneposition-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties1307Inaccessible51S.cerevisiae[18]
iRNA(m6A)-PseDNCSVM2018NonePseDNC1307Inaccessible51S.cerevisiae[19]
BERMPBGRU2018NoneENAC, RNA word embedding53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)Inaccessible251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae A.thaliana[20]
M6AMRFSXGBoost2018SFSDNC, binary, local position-specific dinucleotide frequency1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae and A.thaliana[21]
RFAthM6ARF2018NonePSNSP, PSDSP, KSNPF, k-mer2518Accessible101A.thaliana[22]
DeepM6APredSVM2019Nonedeep features and NPPS1307Inaccessible51S.cerevisiae[23]
Gene2VecCNN2019NoneOne-hot, Neighboring methylation state, RNA word embedding, Gene2Vec56557Inaccessible1001H.sapiens and Mus musculus[24]
WHISTLESVM2022Perturb methodNCP, Genome-derived features20516, 17383AccessibleH.sapiens[25]
DeepPromiseCNN2022NoneENAC, one-hot and RNA word embedding44901, 11656 and 5233Accessible1001H.sapiens and Mus musculus[6]
ToolMethodPublication yearFeature selection methodFeature encoding schemeData scalesURL accessibilityWindow sizeSpeciesReference
iRNA-MethylSVM2015NonePseDNC, RNA property parameters1307Inaccessible51S.cerevisiae[7]
M6ATHSVM2016NoneNCP,ANF394Inaccessible25A.thaliana[8]
RNA-MethylPredSVM2016NoneBPB,DNC, KNN score1307MATLAB package51S.cerevisiae[9]
TargetM6ASVM2016IFSPSNP, PSDP, k-mer1307(Met2614) and 832(Train1664)Accessible51(Met2614) and 21(Train1664)S.cerevisiae[10]
pRNAm-PCSVM2016NonePCPM1307Inaccessible51S.cerevisiae[11]
RNAMethPreSVM2016Nonebinary, k-mer, relative position value, MFE29457(H.sapiens) and 31728(Mus musculus)Inaccessible101H.sapiens and Mus musculus[12]
M6A-HPCSSVM2016NonePCPM, PseDNC, AC, CC1307Accessible51S.cerevisiae[13]
SRAMPRF2016Nonebinary, KNN score spectrum55706 (full transcript mode) and 46992 (mature mRNA mode)Accessible251H.sapiens and Mus musculus[14]
RAM-ESVMSVM2017NonePseDNC, motif1307Inaccessible51S.cerevisiae[15]
iRNA-PseCollSVM2017NoneNCP,ANF1130Inaccessible41H.sapiens[16]
RAM-NPPSSVM2017RFE,FSDI, MRMDNPPS8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana)Inaccessible51H.sapiens, S.cerevisiae and A.thaliana[17]
M6APred-ELEnsemble SVM2018Noneposition-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties1307Inaccessible51S.cerevisiae[18]
iRNA(m6A)-PseDNCSVM2018NonePseDNC1307Inaccessible51S.cerevisiae[19]
BERMPBGRU2018NoneENAC, RNA word embedding53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana)Inaccessible251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae A.thaliana[20]
M6AMRFSXGBoost2018SFSDNC, binary, local position-specific dinucleotide frequency1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana)51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana)H.sapiens, Mus musculus, S.cerevisiae and A.thaliana[21]
RFAthM6ARF2018NonePSNSP, PSDSP, KSNPF, k-mer2518Accessible101A.thaliana[22]
DeepM6APredSVM2019Nonedeep features and NPPS1307Inaccessible51S.cerevisiae[23]
Gene2VecCNN2019NoneOne-hot, Neighboring methylation state, RNA word embedding, Gene2Vec56557Inaccessible1001H.sapiens and Mus musculus[24]
WHISTLESVM2022Perturb methodNCP, Genome-derived features20516, 17383AccessibleH.sapiens[25]
DeepPromiseCNN2022NoneENAC, one-hot and RNA word embedding44901, 11656 and 5233Accessible1001H.sapiens and Mus musculus[6]

Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.

Overall framework of the predictors used in this research. The process contains four main steps: data collection, feature investigation, model training and cross-validation, and independent testing. In the data collection step, datasets from six different species were collected from published research and databases for motif substrate discovery and feature investigation. In the feature investigation phase, primary sequence-derived features (NAC, DNC, TNC, ENAC and CKSNAP) and one nucleotide physiochemical property features (NCP) were adopted. In the model training and cross-validation phase, three conventional ML methods (SVM, LR and RF) were applied to the datasets and a deep learning method using an adaptive learning network (adaptive-m6A), and 5-fold cross-validation were applied for model assessment. Finally, independent testing was employed to assess the performance of the proposed method.
Figure 1

Overall framework of the predictors used in this research. The process contains four main steps: data collection, feature investigation, model training and cross-validation, and independent testing. In the data collection step, datasets from six different species were collected from published research and databases for motif substrate discovery and feature investigation. In the feature investigation phase, primary sequence-derived features (NAC, DNC, TNC, ENAC and CKSNAP) and one nucleotide physiochemical property features (NCP) were adopted. In the model training and cross-validation phase, three conventional ML methods (SVM, LR and RF) were applied to the datasets and a deep learning method using an adaptive learning network (adaptive-m6A), and 5-fold cross-validation were applied for model assessment. Finally, independent testing was employed to assess the performance of the proposed method.

Recently, deep learning methods have been widely applied in different fields, as well as in RNA modification prediction [5, 8–10, 12, 17, 21, 22, 26, 27]. Table 1 lists some representative m6A prediction methods and illustrates that there are still some limitations of the available tools. First, most of them focus on only a few types of species, with no or limited information on other species. Second, ML method is still the first choice for m6A site prediction, and the deep learning method for these purposes is usually limited to a convolutional neural network (CNN) framework. Also, these methods are time-consuming and labor-intensive for the construction of an experimentally verified m6A database. Moreover, little work has been conducted to discuss the motif patterns around modified sites among different samples to show the potential relationships among different species with the same type of RNA post-transcriptional modification [28]. To overcome these obstacles and investigate m6A modification among various categories of species, we have developed a prediction model for m6A modification in multi-species with different frameworks. We proposed an attention-based deep learning method called adaptive-m6A, which consists of CNN, bi-directional long short-term memory (BLSTM) and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional ML methods were also investigated: support vector machine, random forest and logistic regression classifiers in details. A cross-species validation was also conducted to evaluate whether one model from a certain species could work well using data from other species [29, 30]. A comparison with other representative methods was also listed as part of our report. To the best of our knowledge, this is the first work to use data from several species, which could help to reveal whether there are the same type of RNA modification from the perspective of their associated sequence contexts among different species [28]. The overall flowchart of this research has been included in Figure 1.

Method and materials

Data collection and preparation

The dataset used for training and independent testing was collected from published work and the RMBase database [31]. Data from six different species with m6A modified sites were obtained for Drosophila melanogaster (BDGP6), Danio rerio (zebrafish, danRer10), Escherichia coli str.K-12 substr.MG1655 (E. coli), Mus musculus (mm10), Saccharomyces cerevisiae S288C (sacCer3) and Arabidopsis thaliana (TAIR10). Collecting the datasets involved the following steps: first downloading the RNA sequence that contained the m6A modification sites and its position from the RMBase database. This first step transfers sequences from RMBase to the RNA sequence by replacing the T entries to U, as those sequences obtained from the RMBase database is in the form of DNA [32]. The datasets for species M. musculus and A. thaliana were downloaded from previous studies [14, 22]. To obtain a high-quality dataset, a CD-HIT with a 70% identity cut-off value [33] was used to remove redundancy and homology deviation [32, 34]. Then, we divided those datasets with a ratio of 7:3 for the training and testing sets, respectively. To avoid the effect caused by any imbalance in the sample number of positive and negative datasets, a random under sampler was employed to make the sample number equivalent in positive and negative sets [35]. The detailed number of final datasets for each individual species is listed in Table 2 and the data involved in this research could be accessed at https://github.com/Moretta1/Adaptive-m6A.

Table 2

Data distribution of each species in this research. The data resource contains two main parts, one from released database RMBase [31], the other from released work [14, 22]

GroupSpeciesAssemblyTraining setTesting set
PositiveNegativePositiveNegative
InsectD. melanogasterBDGP647144714201420 504
VertebrateDanio rerio(zebrafish)danRer1030 10930 10912 91615 645
BacteriaEscherichia coli(E. coli)ASM584v2152215226501546
MammalMus musculus (House mouse)mm1031 12031 12013 325131 418
FungiSaccharomyces cerevisiae S288CsacCer3182918297851044
PlantArabidopsis thalianaTAIR103523352315102013
GroupSpeciesAssemblyTraining setTesting set
PositiveNegativePositiveNegative
InsectD. melanogasterBDGP647144714201420 504
VertebrateDanio rerio(zebrafish)danRer1030 10930 10912 91615 645
BacteriaEscherichia coli(E. coli)ASM584v2152215226501546
MammalMus musculus (House mouse)mm1031 12031 12013 325131 418
FungiSaccharomyces cerevisiae S288CsacCer3182918297851044
PlantArabidopsis thalianaTAIR103523352315102013
Table 2

Data distribution of each species in this research. The data resource contains two main parts, one from released database RMBase [31], the other from released work [14, 22]

GroupSpeciesAssemblyTraining setTesting set
PositiveNegativePositiveNegative
InsectD. melanogasterBDGP647144714201420 504
VertebrateDanio rerio(zebrafish)danRer1030 10930 10912 91615 645
BacteriaEscherichia coli(E. coli)ASM584v2152215226501546
MammalMus musculus (House mouse)mm1031 12031 12013 325131 418
FungiSaccharomyces cerevisiae S288CsacCer3182918297851044
PlantArabidopsis thalianaTAIR103523352315102013
GroupSpeciesAssemblyTraining setTesting set
PositiveNegativePositiveNegative
InsectD. melanogasterBDGP647144714201420 504
VertebrateDanio rerio(zebrafish)danRer1030 10930 10912 91615 645
BacteriaEscherichia coli(E. coli)ASM584v2152215226501546
MammalMus musculus (House mouse)mm1031 12031 12013 325131 418
FungiSaccharomyces cerevisiae S288CsacCer3182918297851044
PlantArabidopsis thalianaTAIR103523352315102013

Feature encoding

In this research, primary sequence-derived features and nucleotide physicochemical properties [6] were selected to obtain sequence feature vectors as these can be directly obtained from the sequence textures and requires no extra third-party tools. The encoding methods were employed mainly based on toolkits iLearn [36] and iLearnplus [37].

Primary sequence-derived features

Nucleotide acid components Nucleotide acid components (NAC) indicate the frequency at which each nucleotide acid occurs in a sequence, which is the basic feature in RNA primary sequence-derived features. As there are four types of nucleotide acids (A, C, G, U/T) in an RNA sequence, the dimension of an NAC feature is 4. For the sequences |$x$|⁠, which is of fixed length n (n = 21 in this study), the probability |$P_x (k)$| of nucleotide acid |$k$| is
(1)
where |$n_x(k)$| refers to occurrence of nucleotide acid k.
Di-nucleotide components Di-nucleotide components (DNC) indicate the frequency at which each amino acid pair occurs in the sequence. There are totally four types of nucleotide acid in RNA sequence; therefore, |$4\times 4$| types of nucleotide acid pairs are available, making the dimension of DNC feature 16. The probability |$P_x (k)$| of amino acid pair in a sequence |$x$| is
(2)
where |$n_x (k)$| is the occurrence of a nucleotide acid pair |$k$|⁠. Figure S1 in the supplementary file shows the comparison of DNC features in positive and negative samples of all datasets.
Tri-nucleotide composition Similar to NAC and DNC, Tri-nucleotide composition (TNC) gives 64 descriptors. It is defined as:
(3)
where |$N_{rst}$| is the number of tri-nucleotide composition represented by nucleotide types |$r$|⁠, |$s$| and |$t$|⁠.

Composition of k-spaced nucleotide acid pair The composition of k-spaced nucleotide acid pairs (CKSNAP) is a criterion that is wildly used in the field for analyzing protein functions. When an integer k has been fixed, the number of k-spaced nucleotide acid pairs are determined as by the gap between two neighboring nucleotide ranges from 0 to the given integer k. The CKSNAP indicates the frequency of nucleotide pairs separated by any k composition. In this study, k was 5, which should range from 0 to 5. Because CKSNAP gives the same result as DNC when k = 0, only cases in which k ranges from 1 to 5 were considered in this study, with a total of 48 feature dimensions for a single sequence. Figure S2 in the supplementary file shows the comparison of CKSNAP in positive and negative samples of all datasets.

Enhanced nucleotide acid composition Enhanced nucleotide acid composition (ENAC) calculates the NAC based on the sequence window of fixed length using a (default value of 5) that continuously slides from the 5’ to 3’ termini of each nucleotide sequence and can be applied to the encoded of nucleotide sequence with an equal length. ENAC is calculated as:
(4)
where |$t \subseteq \lbrace A,C,G,U/T \rbrace , N(composition) \subseteq \lbrace composition1,\\ composition2,\cdots , compositionN \rbrace $|⁠. In this study, there are |$21-5$| compositions possible for each sequence; therefore, the ENAC will give a total of |$4\times (21-5)$| dimensional feature for each sequence, which is 64.
The structure of the deep learning method contains two parts: the embedding module and task specific module. In the embedding module, RNA sequences of fixed lengths were converted into a numeric matrix vector and used as an input for the task specific module. The task specific module contains a CNN layer for extracting features, a max-pooling layer for speeding up the computation, BLSTM and attention layers, and fully connect layer for output result.
Figure 2

The structure of the deep learning method contains two parts: the embedding module and task specific module. In the embedding module, RNA sequences of fixed lengths were converted into a numeric matrix vector and used as an input for the task specific module. The task specific module contains a CNN layer for extracting features, a max-pooling layer for speeding up the computation, BLSTM and attention layers, and fully connect layer for output result.

Nucleotide physicochemical properties

Nucleotide chemical properties Each of the four different kinds of nucleotides has different chemical structure and binding properties (Table 3), and this allows for the encoding scheme of nucleotide chemical properties (NCP). For instance, “A” would be encoded as (111) and G would be encoded as (100). NCP would generate a dimension feature of number 3*21, which equals 63.

Table 3

Chemical structure of each nucleotide

Chemical propertyClassLabelNucleotides
Ring structurePurine1A, G
Pyrimidine0C, U
Functional groupAmino1A, C
Keto0G, U
Hydrogen bondStrong0C, G
Weak1A, U
Chemical propertyClassLabelNucleotides
Ring structurePurine1A, G
Pyrimidine0C, U
Functional groupAmino1A, C
Keto0G, U
Hydrogen bondStrong0C, G
Weak1A, U
Table 3

Chemical structure of each nucleotide

Chemical propertyClassLabelNucleotides
Ring structurePurine1A, G
Pyrimidine0C, U
Functional groupAmino1A, C
Keto0G, U
Hydrogen bondStrong0C, G
Weak1A, U
Chemical propertyClassLabelNucleotides
Ring structurePurine1A, G
Pyrimidine0C, U
Functional groupAmino1A, C
Keto0G, U
Hydrogen bondStrong0C, G
Weak1A, U

RNA sequence embedding

Sequence embedding was used in the embedding layer of deep learning framework. The embedding module maps the segments of each of the four nucleotide compositions to a numeric vector. The description of the embedding is:
(5)
where |$embed_{position}$| refers to the position of the letter in the whole sequence, and |$embed_{token}$| is a specific random initialized identified using a lookup table [24]. Each fusion vector can be adjusted according to the task with back-propagation when the model is trained [29].

Model construction

Machine learning method

From the previous survey of RNA modification site prediction, the ML algorithms used for RNA modification site research included SVM [38, 39], RF [40], bagging [41], naive Bayesian network [42], light gradient boosting machine (LGBM) [43, 44], logistic regression (LR) [45, 46], extreme gradient boosting (XGBoost) [47], etc. In this study, we selected three commonly used ML algorithms to screen the optimal algorithm, including RF, LR and SVM, which were implemented using the Scikit-Learn toolkit [48]. The three ML algorithms are initially screened using the default parameters to obtain the top two optimal classifiers for parameter optimization and confirm the best classifier of the optimal model.

Deep learning method

The deep learning framework contains two main modules. First, an adaptive embedding module was used to map the nucleotide composition onto numerical vectors according to certain position information and token value from a lookup table [24, 29]. Second, a task-specific layer consisting of CNN, BLSTM, attention and full connection layers, in which the CNN layer extracted high-level features from the sequence. BLSTM is a specific recurrent neural network (RNN) that consists of a set of recurrently connected blocks and enhanced the learning ability of LSTM for long-distance dependency. Compared with unidirectional LSTM, BLSTM better captures the information of sequence context [49]. In addition to CNN and BLSTM, the attention mechanism can also capture positional information. The attention mechanism was originally proposed for solving machine translation tasks [50] and has shown to distinguish between more and less important information [51]. In the field of natural language processing and image identification, an increasing number of attention mechanism-related studies have shown the interpretability of the model [52, 53]. The attention mechanism has been shown to achieve a competitive performance in a wide range of biological sequence analysis problems [54, 55], and was therefore used in this study to identify the key information that affects m6A site prediction. A flowchart in Figure 2 shows the whole workflow of the adaptive learning model.

Evaluation metrics

Five-fold cross-validation and independent testing were used to evaluate the performance of the proposed model [56]. In addition, four criteria specificity (Sp), sensitivity (Sn), accuracy (Acc), precision(Prec) and Mathew’s correlation coefficient (MCC) are used to evaluate the performance and defined as [57]:
(6)
(7)
(8)
(9)
(10)
where |$TP$| denotes positive samples that are correctly predicted as true m6A modification sites, |$TN$| denotes negative samples that are correctly predicted to not contain m6A modification sites, |$FN$| is the number of true m6A modification sequences that are falsely predicted as false m6A modification sequences and |$FP$| is the number of false m6A modification sequences that are falsely predicted as true m6A modification sequences.

Moreover, in addition to these evaluation indicators, by setting the sensitivity and 1-specificity to the x- and y-axes, respectively, the receiver operating characteristic curve (ROC curve) was generated, which was another indicator for evaluating the performance of the proposed method [58]. The closer the value of area under ROC is to 1, the better the prediction performance of the model and the higher the accuracy.

Results

Performance comparison of machine learning classifiers

For the ML classifier in this study, species-specific evaluation indicators for the three different classifiers were summarized and visualized in a bar chart (Figure 3) with the performance criteria of three ML methods used. Table S4 in the supplementary file provides table-formed details of the ML method performances, while the ROC and PRC curves for these models evaluated using independent testing are shown in Figure S3–S8. The performances indicated that in each species data, RF tended to achieve a higher performance when compared with two other classifiers. SVM was efficient in most of the species tested. In some cases, for instance, the species mm10, the performances of all the three classifiers tends had a similar level of accuracy around 0.6–0.7 and AUC between 0.55 and 0.75. Since the performance of the ML methods in sacCer3 and TAIR10 datasets were quite close and not as promising as the other small-scaled datasets that had less than 10 thousand sample numbers, they were chosen for subsequent optimization and comparison with feature selection.

Performance of ML method classifier (A) SVM, (B) RF and (C) LR in individual datasets. Detailed values in each individual classifier in table form have been listed in Table S1 in supplementary files.
Figure 3

Performance of ML method classifier (A) SVM, (B) RF and (C) LR in individual datasets. Detailed values in each individual classifier in table form have been listed in Table S1 in supplementary files.

Performance of optimal subset selection with feature selection method CHI2 and the optimal performance for the different species (A) mm10, (B) TAIR10 and (C) sacCer3, respectively. The TOP100 dimensions of features, which are the top 100 highest Chi2 values, correspond to feature dimensions. The weight of each feature was selected for among the total selected dimensions is presented. The bar chart shows the criteria for the original incorporated features (before feature selection) and the selected features.
Figure 4

Performance of optimal subset selection with feature selection method CHI2 and the optimal performance for the different species (A) mm10, (B) TAIR10 and (C) sacCer3, respectively. The TOP100 dimensions of features, which are the top 100 highest Chi2 values, correspond to feature dimensions. The weight of each feature was selected for among the total selected dimensions is presented. The bar chart shows the criteria for the original incorporated features (before feature selection) and the selected features.

Compared the three conventional ML classifiers, which achieved a higher performance in the BDGP6 species, the SVM classifier achieved the highest performance, while the mm10 dataset did not show a promising performance. This may have been caused by the differences between positive and negative samples that were much more obvious in the BDGP6 dataset. Overall, the performance of the RF classifier yielded a higher level compared with the other two classifiers. However, compared with other datasets, the RF classifier had a more biased performance in the danRer10 dataset.

Performance with hybrid features on feature selection methods

To further determine the optimal classifier and select the optimal feature subset, we adopted a two-step strategy. First, a feature selection technique, the chi-squared (CHI2) method was applied to calculate the feature importance values in different ways and then sort each dimension of the hybrid feature set according to these values (chi2 values were shown in Figure 4). Second, for the sorted hybrid feature list obtained by different feature selection methods, we selected the top 100 feature dimensions (listed in Figure 4 for the three datasets according to their chi2 values ranking from the highest to the lowest) and then selected for dimensionality of the optimal feature subset using the RF classifier through the IFS strategy. To determine the optimal classifier, the AUC of the RF and SVM classifiers were compared based on the feature selection methods, which shows the comparison of AUC values of TAIR10 (A. thaliana), mm10 (M. musculus) and sacCer3 (S. cerevisiae) respectively in Figure 4. Based on the feature selection method on three datasets for optimal subset selection, we compared the performance of the hybrid features and the selected top 100 feature dimensions with highest importance, the performances were shown in the histogram of Figure 4.

Performance on deep learning method

In addition to the ML method, we also applied an adaptive learning network using a species collection of more than 10 000 sample numbers, which was suitable for a deep learning framework (Table 4). A box-plot was constructed for the comparison of performances among the three ML methods, and indicated that the deep learning method achieved a better performance on average when compared with the conventional ML classifiers, and was extremely successful when larger datasets were used, such as for species mm10 and danRer10 (Figure S9). For the relatively smaller datasets, like sacCer3 and TAIR10 for instance, adaptive learning yielded a less significant performance when compared with the ML methods, which may have been caused by the limitation in data.

Table 4

Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold

SpeciesMethodsSnSpACCMCCAUCPrecF1
BDGP6SVM0.920.920.910.710.960.660.73
RF0.960.920.920.70.960.590.68
LR0.920.900.900.620.920.500.60
Ada0.99110.9710.950.97
danRer10SVM0.910.860.890.770.930.850.88
RF0.930.880.90.810.950.890.91
LR0.880.840.860.710.880.820.85
Ada0.990.960.970.950.990.950.97
E. coliSVM0.860.850.850.680.890.720.78
RF0.890.820.840.670.890.70.78
LR0.880.840.860.710.880.820.85
Ada0.750.970.910.770.980.920.83
sacCer3SVM0.720.820.770.540.820.80.76
RF0.770.720.740.490.80.730.75
LR0.670.670.670.340.670.670.67
Ada0.770.730.750.50.810.740.75
TAIR10SVM0.750.760.750.50.820.750.75
RF0.730.710.720.440.790.720.72
LR0.670.670.670.340.670.670.67
Ada0.730.610.670.340.710.660.69
mm10SVM0.680.690.690.290.750.310.42
RF0.700.700.700.250.760.190.30
LR0.670.670.670.340.670.670.67
Ada0.970.130.20.090.790.490.14
SpeciesMethodsSnSpACCMCCAUCPrecF1
BDGP6SVM0.920.920.910.710.960.660.73
RF0.960.920.920.70.960.590.68
LR0.920.900.900.620.920.500.60
Ada0.99110.9710.950.97
danRer10SVM0.910.860.890.770.930.850.88
RF0.930.880.90.810.950.890.91
LR0.880.840.860.710.880.820.85
Ada0.990.960.970.950.990.950.97
E. coliSVM0.860.850.850.680.890.720.78
RF0.890.820.840.670.890.70.78
LR0.880.840.860.710.880.820.85
Ada0.750.970.910.770.980.920.83
sacCer3SVM0.720.820.770.540.820.80.76
RF0.770.720.740.490.80.730.75
LR0.670.670.670.340.670.670.67
Ada0.770.730.750.50.810.740.75
TAIR10SVM0.750.760.750.50.820.750.75
RF0.730.710.720.440.790.720.72
LR0.670.670.670.340.670.670.67
Ada0.730.610.670.340.710.660.69
mm10SVM0.680.690.690.290.750.310.42
RF0.700.700.700.250.760.190.30
LR0.670.670.670.340.670.670.67
Ada0.970.130.20.090.790.490.14
Table 4

Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold

SpeciesMethodsSnSpACCMCCAUCPrecF1
BDGP6SVM0.920.920.910.710.960.660.73
RF0.960.920.920.70.960.590.68
LR0.920.900.900.620.920.500.60
Ada0.99110.9710.950.97
danRer10SVM0.910.860.890.770.930.850.88
RF0.930.880.90.810.950.890.91
LR0.880.840.860.710.880.820.85
Ada0.990.960.970.950.990.950.97
E. coliSVM0.860.850.850.680.890.720.78
RF0.890.820.840.670.890.70.78
LR0.880.840.860.710.880.820.85
Ada0.750.970.910.770.980.920.83
sacCer3SVM0.720.820.770.540.820.80.76
RF0.770.720.740.490.80.730.75
LR0.670.670.670.340.670.670.67
Ada0.770.730.750.50.810.740.75
TAIR10SVM0.750.760.750.50.820.750.75
RF0.730.710.720.440.790.720.72
LR0.670.670.670.340.670.670.67
Ada0.730.610.670.340.710.660.69
mm10SVM0.680.690.690.290.750.310.42
RF0.700.700.700.250.760.190.30
LR0.670.670.670.340.670.670.67
Ada0.970.130.20.090.790.490.14
SpeciesMethodsSnSpACCMCCAUCPrecF1
BDGP6SVM0.920.920.910.710.960.660.73
RF0.960.920.920.70.960.590.68
LR0.920.900.900.620.920.500.60
Ada0.99110.9710.950.97
danRer10SVM0.910.860.890.770.930.850.88
RF0.930.880.90.810.950.890.91
LR0.880.840.860.710.880.820.85
Ada0.990.960.970.950.990.950.97
E. coliSVM0.860.850.850.680.890.720.78
RF0.890.820.840.670.890.70.78
LR0.880.840.860.710.880.820.85
Ada0.750.970.910.770.980.920.83
sacCer3SVM0.720.820.770.540.820.80.76
RF0.770.720.740.490.80.730.75
LR0.670.670.670.340.670.670.67
Ada0.770.730.750.50.810.740.75
TAIR10SVM0.750.760.750.50.820.750.75
RF0.730.710.720.440.790.720.72
LR0.670.670.670.340.670.670.67
Ada0.730.610.670.340.710.660.69
mm10SVM0.680.690.690.290.750.310.42
RF0.700.700.700.250.760.190.30
LR0.670.670.670.340.670.670.67
Ada0.970.130.20.090.790.490.14

Motif analysis of modified m6A sequences

Since NAC was the most widely used sequence-based feature for exploring the motif around the m6A site, comparing the NAC between positive and negative datasets could be useful for m6A site identification. Our results indicated the occurrence of modified and non-modified m6A sequences in the six datasets, in which the positively charged nucleotide residues (A and C) appeared to have the highest frequency around the substrate sites (Figure 5). When we compared the NAC results between positive and negative datasets, the differences among each NAC feature seemed to be much larger in the BDGP6 dataset than other datasets (Figure 5), which suggested that in this species, the differences among positive and negative samples with respect to NAC features were more obvious than those in the other species. Additionally, the mm10 had NAC features with the least differences among positive and negative datasets, suggesting the performance in the mm10 dataset was not as successful when compared with the other datasets tested.

Visualization of feature NAC, which reflects the frequency of each individual nucleotide component in the positive and negative datasets of different species. The orange and blue colors correspond to positive and negative samples, respectively.
Figure 5

Visualization of feature NAC, which reflects the frequency of each individual nucleotide component in the positive and negative datasets of different species. The orange and blue colors correspond to positive and negative samples, respectively.

The successive motif patterns found using tool MotifX and the experimentally verified motifs generated from a positive dataset and motif generation by TwoSampleLogo. In each subfigure, the experimental panel shows the pattern visualization of positive sequences generated by TwoSampleLogo, the predicted panel shows the visualization of predicted substrates generated by MotifX, and the predicted top five substrates and their matched numbers are listed in the motifs panel.
Figure 6

The successive motif patterns found using tool MotifX and the experimentally verified motifs generated from a positive dataset and motif generation by TwoSampleLogo. In each subfigure, the experimental panel shows the pattern visualization of positive sequences generated by TwoSampleLogo, the predicted panel shows the visualization of predicted substrates generated by MotifX, and the predicted top five substrates and their matched numbers are listed in the motifs panel.

Heatmap of the performance of the cross-species validation. The AUC values were chosen for visualization, with detailed criteria outlined in the supplementary fileTable S2. The white colored bar represents the average value, while the pink and green bars indicate higher and lower AUC values compared to the average, respectively, with the darker colors corresponding to a higher discrepancy.
Figure 7

Heatmap of the performance of the cross-species validation. The AUC values were chosen for visualization, with detailed criteria outlined in the supplementary fileTable S2. The white colored bar represents the average value, while the pink and green bars indicate higher and lower AUC values compared to the average, respectively, with the darker colors corresponding to a higher discrepancy.

Figure 5 has indicated the occurrence of each nucleotide composition in m6A modified and non-modified sequences in the six datasets, and for m6A sites, the positively charged nucleotide residue (A and C) appeared to have the highest frequency around the substrate sites. From comparing the NAC between positive and negative datasets illustrated in Figure 5, the differences among each NAC seem to be much larger in the BDGP6 dataset, which means that in this dataset, the differences among positive and negative samples with respect to NAC feature are more obvious than the other species, while mm10 tends to have the NAC feature with the least differences among positive and negative, that is the main reason that the performance in the mm10 dataset is not that outstanding compared with other datasets.

In addition to the visualization of m6A motifs, we wondered whether any significant patterns occurred in different species. Hence, we used position-specific NAC neighboring to visualize the nucleotide residue composition around the modified m6A sites in each species using frequency plots in TwoSampleLogo (Figure 6). We also used MotifX [59], an R-package tool designed to extract overrepresented patterns from sequence datasets that was originally used for peptide sequence analysis, to predict the successive substrate that frequently occurred near the modified sites in each individual dataset.

We found that there were some successive motifs that occurred in different species, indicating that similar patterns tended to be important in different species and significantly contributed to the same type of modification (Figure 6). For instance, the ‘GAC’ pattern could be found in almost every species. The motif pattern of different species might be similar for the same modification, in species BDGP6 and E. coli, which had showed high similarity, as well as danRer10 and E. coli, which also shared a similar motif pattern even though one is an animal and the other is a microbe. This result was very interesting that different species shared a similar pattern and should be further researched.

Cross-species performance evaluation

Based on the previous analysis, we have found that some substrates might be similar among different m6A modified sites in different species, hence we conducted a cross-validation between two different species. We applied the dataset danRer10 as the training set to obtain a model and employed it onto the rest categories of datasets in the independent testing stage. A heatmap was used to illustrate the cross-species validation (Figure 7). The feature ENAC and classifier RF were chosen for this cross-species analysis as they achieved the highest performance in our previous analysis of single features.

We use the color pink for performance higher than the average level, green for performance lower than the average level and white means the average level in the overall cases. Theoretically the best performance should be found at the diagonal line as it is the pairwise training model and testing set with the same species, that means that we should see different pink colors in the diagonal line. This matches the result that we have got, as it can be notice that the color of the diagonal line are mostly of pink color, which means performance above the average level. For one special occurred in the mm10 dataset, the diagonal color is white, which means the average performance, while the best performances on the mm10 model occurred in the BDGP6 testing set. This is reasonable because compared with the rest testing datasets, mm10 testing set has shown a good performance, the reason that best result achieved in BDGP6 dataset might own to the larger differences in BDGP6 dataset, and mm10 is of larger data scale compared with the original BDGP6 dataset, which then achieved a better performance when compared with mm10 testing dataset, which is of less obvious differences among positive and negative dataset.

It may be notice that when choosing danRer10 dataset as the training set, the model we have obtained could work well in most of the testing datasets. That may be caused by that danRer10 dataset is of relatively larger quantity, which could avoid the under-fitting case better thatn the other testing datasets, meanwhile the typical substrate ‘GAC’ in the danRer10 dataset was found in the other testing datasets, which means higher similarity with other species and leads to good performances when the testing dataset changed.

Another interesting finding was that the E. coli model achieved the best AUC for TAIR10 datasets and very good AUC for BDGP6 and danRer10, which may have been caused by several reasons. First, the sample number of E. coli and TAIR10 datasets involved in the cross-species test were similar in positive datasets, while in negative datasets the imbalance ratio was maintained (Table 2). Moreover, these datasets had similar motif pattern in position [–2,2] ‘G-ACU’ patterns (Figure S10), that may be why the E. coli model also achieved a high performance with the TAIR10 dataset, as they were of similar sample number level and motif substrate pattern. We hypothesized that the E. coli model worked well with the BDGP6 and danRer10 datasets because of the diversity of positive and negative samples in these two datasets. Indeed, we noticed that the differences among positive and negative samples in danRer10 and BDGP6 was very obvious compared with the other cases, which made it easier to distinguish the the positive and negative samples among these datasets (Figure 5). Also, similarity in the motif substrate pattern and dataset scale may have also been why the highest performance did not always occur in the diagonal line. Additionally, this could explain the results from when we applied a model from a smaller dataset onto a testing set obtained from a larger dataset. For instance, when the model from sacCer3, which is a dataset with around one thousand positive samples, was used with other testing datasets that had around three or four thousands positive samples. These tests produced a performance that was lower than the average level, illustrated by the lower triangular part of the graph that included more green, while the upper triangular part contained more pink (Figure 7).

Comparison of existing methods

We have chosen some representative and newly accessible frameworks to compare the SRAMP [14], and WHISTLE [25] tools with our proposed methods. The deep learning method DeepPromise [6], which is with only CNN, and DeepOME [60], in which CNN is incorporated with BLSTM, were also used in our comparison. Recently, a capsule network has been a popular method for disease-related prediction [61], PTM prediction [62, 63] and single-cell RNA-sequencing data analysis [64]. In this study, we have also tried to apply the capsule network to the prediction of m6A sites as an updated deep learning method to compare with an adaptive learning method. The datasets for these chosen baseline methods were maintained as original species data, with the detailed comparison shown in Table 5. These results indicated that in most cases, the Adaptive-m6A model could achieve a better performance compared with the conventional ML methods (SRAMP and WHISTLE), it could achieve a more outstanding performance when compared with the method of similar network structure (i.e. DeepPromise) as it shows higher AUC value. The attention mechanism has provided good performance with compared with method DeepOME, and provides high performance when larger data scale involved.

Table 5

Performance comparison with the latest accessible methods

Method nameAlgorithmEncoding schemeSpeciesDatasizeSnSpMCCAUC
SRAMP [14]RFone-hot, KNN score spectrumH. sapiens, mm1055 706 (full transcript mode) and 46992 (mature mRNA mode)0.440.90.290.794
WHISTLE [25]SVMNCP,GNF, Genome-derived featuresH. sapiens20 516, 17 3830.620.90.490.83
DeepPromise [6]CNNENAC, one-hot and RNA word embeddingH. sapiens and Mus musculus44 901, 11 656 and 52330.390.90.2520.76
DeepOME [60]CNN+BLSTMone-hotH. sapiens30520.971.000.930.99
CapNetworkCNN+BLSTM+CapsuleNetworkRNA word embeddingmm10207 0100.120.950.130.61
Adaptive-M6ACNN+BLSTM+attentionRNA word embeddingBDGP631 9520.991.000.971.00
danRer1088 7790.990.960.950.99
E. coli52400.750.970.770.98
mm10207 0100.970.130.090.79
TAIR1010 5690.730.610.340.71
sacCer354870.770.730.500.81
Method nameAlgorithmEncoding schemeSpeciesDatasizeSnSpMCCAUC
SRAMP [14]RFone-hot, KNN score spectrumH. sapiens, mm1055 706 (full transcript mode) and 46992 (mature mRNA mode)0.440.90.290.794
WHISTLE [25]SVMNCP,GNF, Genome-derived featuresH. sapiens20 516, 17 3830.620.90.490.83
DeepPromise [6]CNNENAC, one-hot and RNA word embeddingH. sapiens and Mus musculus44 901, 11 656 and 52330.390.90.2520.76
DeepOME [60]CNN+BLSTMone-hotH. sapiens30520.971.000.930.99
CapNetworkCNN+BLSTM+CapsuleNetworkRNA word embeddingmm10207 0100.120.950.130.61
Adaptive-M6ACNN+BLSTM+attentionRNA word embeddingBDGP631 9520.991.000.971.00
danRer1088 7790.990.960.950.99
E. coli52400.750.970.770.98
mm10207 0100.970.130.090.79
TAIR1010 5690.730.610.340.71
sacCer354870.770.730.500.81
Table 5

Performance comparison with the latest accessible methods

Method nameAlgorithmEncoding schemeSpeciesDatasizeSnSpMCCAUC
SRAMP [14]RFone-hot, KNN score spectrumH. sapiens, mm1055 706 (full transcript mode) and 46992 (mature mRNA mode)0.440.90.290.794
WHISTLE [25]SVMNCP,GNF, Genome-derived featuresH. sapiens20 516, 17 3830.620.90.490.83
DeepPromise [6]CNNENAC, one-hot and RNA word embeddingH. sapiens and Mus musculus44 901, 11 656 and 52330.390.90.2520.76
DeepOME [60]CNN+BLSTMone-hotH. sapiens30520.971.000.930.99
CapNetworkCNN+BLSTM+CapsuleNetworkRNA word embeddingmm10207 0100.120.950.130.61
Adaptive-M6ACNN+BLSTM+attentionRNA word embeddingBDGP631 9520.991.000.971.00
danRer1088 7790.990.960.950.99
E. coli52400.750.970.770.98
mm10207 0100.970.130.090.79
TAIR1010 5690.730.610.340.71
sacCer354870.770.730.500.81
Method nameAlgorithmEncoding schemeSpeciesDatasizeSnSpMCCAUC
SRAMP [14]RFone-hot, KNN score spectrumH. sapiens, mm1055 706 (full transcript mode) and 46992 (mature mRNA mode)0.440.90.290.794
WHISTLE [25]SVMNCP,GNF, Genome-derived featuresH. sapiens20 516, 17 3830.620.90.490.83
DeepPromise [6]CNNENAC, one-hot and RNA word embeddingH. sapiens and Mus musculus44 901, 11 656 and 52330.390.90.2520.76
DeepOME [60]CNN+BLSTMone-hotH. sapiens30520.971.000.930.99
CapNetworkCNN+BLSTM+CapsuleNetworkRNA word embeddingmm10207 0100.120.950.130.61
Adaptive-M6ACNN+BLSTM+attentionRNA word embeddingBDGP631 9520.991.000.971.00
danRer1088 7790.990.960.950.99
E. coli52400.750.970.770.98
mm10207 0100.970.130.090.79
TAIR1010 5690.730.610.340.71
sacCer354870.770.730.500.81

Conclusion

In this study, we have conducted species-specific prediction of m6A modification in RNA sequences. ML methods such as SVM, RF and LR for m6A prediction methods were adapted, as well as deep learning methods that mainly used adaptive learning model that consisted of CNN, BLSTM and attention network. This is the first time that such data from different species were included in one predictor, which produced a promising performance. Moreover, we have summarized the up-to-date predictor tools on m6A prediction, which is the most comprehensive work currently available. Based on this multi-species prediction work, we have improved our understanding of biological mechanisms of the RNA transcriptome and common mechanisms found in different biological organisms (e.g. plant, animal and microbe). Motif analysis of different species was also conducted, which identified similar patterns in different biological organisms. This is one interesting aspect of our results that is worth further analysis and research. There are still plenty of questions for discussion; for instance, can the encoding scheme of RNA sequence-based features used in representative methods similar to those used to identify features in peptide sequences, identify relationships between RNA modification and peptide post-transitional modification? This is another topic that could benefit from our presented results in future studies.

Key Points
  • We proposed an attention-based deep learning method, Adaptive-m6A, which consists of CNN, BLSTM and attention mechanisms, to identify m6A sites in multiple species.

  • Three conventional machine learning (ML) methods, support vector machine, random forest and logistic regression classifiers, were used with datasets with the most categories of species data.

  • Adaptive-m6A outperformed conventional ML methods in most of the species tested and achieved a promising performance.

  • Motif analysis and a cross-validation among different species showed that our proposed method could achieve a high performance among various species, which helps gain a better understanding about the sequence characteristics and biological mechanisms of the RNA transcriptome across different species.

Data availability

All the data involved in this esearch could be accessed at https://github.com/Moretta1/Adaptive-m6A.

Acknowledgments

The authors sincerely appreciate the Warshel Institute for Computational Biology funding from Shenzhen City and Longgang District for financially supporting this research.

Funding

This work was supported by the Ganghong Young Scholar Development Fund [2021E007], Guangdong Province Basic and Applied Basic Research Fund [2021A1515012447], National Natural Science Foundation of China [32070659], the Science, Technology and Innovation Commission of Shenzhen Municipality [JCYJ20200109150003938], and Shenzhen-Hong Kong Cooperation Zone for Technology and Innovation [HZQB-KCZYB-2020056].

Author Biographies

Rulan Wang is currently a PhD student at School of Sciences and Engineering, The Chinese University of Hong Kong, Shenzhen under the supervision of Prof. Tzong-Yi Lee and Prof. Hsien-Da Huang. Her research interests include bioinformatics, big data analytics,and machine learning.

Chia-Ru Chung is a postdoc in Kobilka Institute of Innovative Drug Discovery, The Chinese University of Hong Kong, Shenzhen. Her research interests include bioinformatics, big data analytics, data mining, and machine learning.

Hsien-Da Huang is a presidential chair professor and associate dean of education at the School of Life and Health Sciences, and the executive deputy director of Warshel Institute of Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research group majorly focuses on biological multi-disciplinary research topics, including bioinformatics, genomics, metagenomics and microbiome, intelligent biomedical Technologies (drug development, genetic test and precision medicine), AI and machine learning, and biological database design and development.

Tzong-Yi Lee is currently a professor at the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University. His research interests emphasize on bioinformatics, multi-Omics data analysis, single cell sequencing and spatial transcriptomics.

References

1.

Yang
G
,
Sun
Z
,
Zhang
N
.
Reshaping the role of m6a modification in cancer transcriptome: a review
.
Cancer Cell Int
2020
;
20
(
1
):
1
10
.

2.

Tong
J
,
Flavell
RA
,
Li
H-B
.
RNA m6a modification and its function in diseases
.
Front Med
2018
;
12
(
4
):
481
9
.

3.

Hao
W
,
Zhang
Y
.
Mechanisms and functions of TET protein-mediated 5-methylcytosine oxidation
.
Genes Dev
2011
;
25
(
23
):
2436
52
.

4.

Yang
C
,
Yiyang
H
,
Zhou
B
, et al.
The role of m6a modification in physiology and disease
.
Cell Death Dis
2020
;
11
(
11
):
1
16
.

5.

Helm
M
,
Motorin
Y
.
Detecting RNA modifications in the epitranscriptome: predict and validate
.
Nat Rev Genet
2017
;
18
(
5
):
275
91
.

6.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences
.
Brief Bioinform
2020
;
21
(
5
):
1676
96
.

7.

Chen
W
,
Feng
P
,
Ding
H
, et al.
iRNA-methyl: identifying n6-methyladenosine sites using pseudo nucleotide composition
.
Anal Biochem
2015
;
490
:
26
33
.

8.

Chen
W
,
Feng
P
,
Ding
H
, et al.
Identifying n 6-methyladenosine sites in the Arabidopsis thaliana transcriptome
.
Mol Genet Genomics
2016
;
291
(
6
):
2225
9
.

9.

Jia
C-Z
,
Zhang
J-J
,
Wei-Zhen
G
.
RNA-methylpred: a high-accuracy predictor to identify n6-methyladenosine in RNA
.
Anal Biochem
2016
;
510
:
72
5
.

10.

Li
G-Q
,
Liu
Z
,
Shen
H-B
, et al.
Targetm6a: identifying n 6-methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine
.
IEEE Trans Nanobioscience
2016
;
15
(
7
):
674
82
.

11.

Liu
Z
,
Xiao
X
,
Dong-Jun
Y
, et al.
pRNAm-PC: predicting n6-methyladenosine sites in RNA sequences via physical–chemical properties
.
Anal Biochem
2016
;
497
:
60
7
.

12.

Xiang
S
,
Liu
K
,
Yan
Z
, et al.
RNAmethpre: a web server for the prediction and query of mRNA m6a sites
.
PloS one
2016
;
11
(
10
):
e0162707
.

13.

Zhang
M
,
Sun
J-W
,
Liu
Z
, et al.
Improving n6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties
.
Anal Biochem
2016
;
508
:
104
13
.

14.

Zhou
Y
,
Zeng
P
,
Li
Y-H
, et al.
Sramp: prediction of mammalian n6-methyladenosine (m6a) sites based on sequence-derived features
.
Nucleic Acids Res
2016
;
44
(
10
):
e91
1
.

15.

Chen
W
,
Xing
P
,
Zou
Q
.
Detecting n6-methyladenosine sites from RNA transcriptomes using ensemble support vector machines
.
Sci Rep
2017
;
7
(
1
):
1
8
.

16.

Feng
P
,
Ding
H
,
Yang
H
, et al.
iRNA-psecoll: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into pseknc
.
Mol Ther-Nucleic Acids
2017
;
7
:
155
63
.

17.

Xing
P
,
Ran
S
,
Guo
F
, et al.
Identifying n6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine
.
Sci Rep
2017
;
7
(
1
):
1
7
.

18.

Wei
L
,
Chen
H
,
Ran
S
.
M6apred-el: a sequence-based predictor for identifying n6-methyladenosine sites using ensemble learning
.
Mol Therapy-Nucleic Acids
2018
;
12
:
635
44
.

19.

Chen
W
,
Hui Ding
X
,
Zhou
HL
, et al.
iRNA (m6a)-psednc: identifying n6-methyladenosine sites using pseudo dinucleotide composition
.
Anal Biochem
2018
;
561
:
59
65
.

20.

Huang
Y
,
He
N
,
Chen
Y
, et al.
Bermp: a cross-species classifier for predicting m6a sites by integrating a deep learning algorithm and a random forest approach
.
Int J Biol Sci
2018
;
14
(
12
):
1669
.

21.

Qiang
X
,
Chen
H
,
Ye
X
, et al.
M6amrfs: robust prediction of n6-methyladenosine sites with sequence-based features in multiple species
.
Front Genet
2018
;
9
:
495
.

22.

Wang
X
,
Yan
R
.
Rfathm6a: a new tool for predicting m6a sites in Arabidopsis thaliana
.
Plant Mol Biol
2018
;
96
(
3
):
327
37
.

23.

Wei
L
,
Ran
S
,
Wang
B
, et al.
Integration of deep feature representations and handcrafted features to improve the prediction of n6-methyladenosine sites
.
Neurocomputing
2019
;
324
:
3
9
.

24.

Zou
Q
,
Xing
P
,
Wei
L
, et al.
Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mRNA
.
RNA
2019
;
25
(
2
):
205
18
.

25.

Chen
K
,
Wei
Z
,
Zhang
Q
, et al.
Whistle: a high-accuracy map of the human n 6-methyladenosine (m6a) epitranscriptome predicted using a machine learning approach
.
Nucleic Acids Res
2019
;
47
(
7
):
e41
1
.

26.

Chen
W
,
Tang
H
,
Lin
H
.
MethyRNA: a web server for identification of n6-methyladenosine sites
.
J Biomol Struct Dynam
2017
;
35
(
3
):
683
7
.

27.

Liu
B
.
Bioseq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches
.
Brief Bioinform
2019
;
20
(
4
):
1280
94
.

28.

Song
Z
,
Huang
D
,
Song
B
, et al.
Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications
.
Nat Commun
2021
;
12
(
1
):
1
11
.

29.

Li
Z
,
Fang
J
,
Wang
S
, et al.
Adapt-kcr: a novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture
.
Brief Bioinform
2022
;
23
(
2
):
bbac037
.

30.

Wang
R
,
Wang
Z
,
Wang
H
, et al.
Characterization and identification of lysine crotonylation sites based on machine learning method on both plant and mammalian
.
Sci Rep
2020
;
10
(
1
):
1
12
.

31.

Xuan
J-J
,
Sun
W-J
,
Lin
P-H
, et al.
Rmbase v2. 0: deciphering the map of RNA modifications from epitranscriptome sequencing data
.
Nucleic Acids Res
2018
;
46
(
D1
):
D327
34
.

32.

Ao
C
,
Zou
Q
,
Liang
Y
.
Nmrf: identification of multispecies RNA 2′-o-methylation modification sites from RNA sequences
.
Brief Bioinform
2022
;
23
(
1
):
bbab480
.

33.

Limin
F
,
Niu
B
,
Zhu
Z
, et al.
Cd-hit: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
2012
;
28
(
23
):
3150
2
.

34.

Zou
Q
,
Lin
G
,
Jiang
X
, et al.
Sequence clustering in bioinformatics: an empirical study
.
Brief Bioinform
2020
;
21
(
1
):
1
10
.

35.

He
H
,
Garcia
EA
.
Learning from imbalanced data
.
IEEE Trans Knowl Data Eng
2009
;
21
(
9
):
1263
84
.

36.

Chen
Z
,
Zhao
P
,
Li
F
, et al.
ilearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
.
Brief Bioinform
2020
;
21
(
3
):
1047
57
.

37.

Chen
Z
,
Zhao
P
,
Li
C
, et al.
ilearnplus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization
.
Nucleic Acids Res
2021
;
49
(
10
):
e60
0
.

38.

Vapnik
VN
,
Ya
A
,
Lerner.
Recognition of patterns with help of generalized portraits
.
Avtomat i Telemekh
1963
;
24
(
6
):
774
80
.

39.

Vapnik
VN
.
An overview of statistical learning theory
.
IEEE Trans Neural Netw
1999
;
10
(
5
):
988
99
.

40.

Breiman
L
.
Random forests
.
Mach Learn
2001
;
45
(
1
):
5
32
.

41.

Breiman
L
.
Bagging predictors
.
Mach Learn
1996
;
24
(
2
):
123
40
.

42.

Langarizadeh
M
,
Moghbeli
F
.
Applying naive Bayesian networks to disease prediction: a systematic review
.
Acta Inform Med
2016
;
24
(
5
):
364
.

43.

Alzamzami
F
,
Hoda
M
,
El Saddik
A
.
Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation
.
IEEE Access
2020
;
8
:
101840
58
.

44.

Bao
W
,
Cui
Q
,
Chen
B
, et al.
Phage_unir_lgbm: Phage virion proteins classification with unirep features and lightgbm model
.
Comput Math Methods Med
2022
;
2022
:
2022
.

45.

Chen
Y-Z
,
Wang
Z-Z
,
Wang
Y
, et al.
Nhkcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning
.
Brief Bioinform
2021
;
22
(
6
):
bbab146
.

46.

Raymond
E
.
Wright
.
Logistic regression
1995
, 217–44.

47.

Xiaojuan
Y
,
Zhou
J
,
Zhao
M
, et al.
Exploiting xg boost for predicting enhancer-promoter interactions
.
Curr Bioinform
2020
;
15
(
9
):
1036
45
.

48.

Pedregosa
F
,
Varoquaux
G
,
Gramfort
A
, et al.
Scikit-learn: machine learning in python
.
J Mach Learn Res
2011
;
12
:
2825
30
.

49.

Greff
K
,
Srivastava
RK
,
Koutník
J
, et al.
Lstm: a search space odyssey
.
IEEE Trans Neural Netw Learn Syst
2016
;
28
(
10
):
2222
32
.

50.

Vaswani
A
,
Shazeer
N
,
Parmar
N
, et al.
Attention is all you need
.
Adv Neural Inform Process Syst
2017
;
30
.

51.

Lin
Z
,
Feng
M
,
dos
Santos
CN
, et al.
A structured self-attentive sentence embedding
.
arXiv preprint arXiv:170303130
,
2017
.

52.

Wiegreffe
S
,
Pinter
Y
.
Attention is not explanation
.
arXiv preprint arXiv:190804626
2019
.

53.

Clark
K
,
Khandelwal
U
,
Levy
O
, et al.
What does bert look at? An analysis of Bert’s attention
.
arXiv preprint arXiv:190604341
2019
.

54.

Zou
Z
,
Tian
S
,
Gao
X
, et al.
Mldeepre: multi-functional enzyme function prediction with hierarchical multi-label deep learning
.
Front Genet
2019
;
9
:
714
.

55.

Hong
Z
,
Zeng
X
,
Wei
L
, et al.
Identifying enhancer–promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism
.
Bioinformatics
2020
;
36
(
4
):
1037
43
.

56.

Kao
H-J
,
Huang
C-H
,
Bretaña
NA
, et al.
A two-layered machine learning method to identify protein o-glcnacylation sites with o-glcnac transferase substrate motifs
.
BMC Bioinformatics
2015
;
16
(
18
):
1
11
.

57.

Wang
R
,
Wang
Z
,
Li
Z
, et al.
Residue–residue contact can be a potential feature for the prediction of lysine crotonylation sites
.
Front Genet
2021
;
12
.

58.

Li
F
,
Li
C
,
Marquez-Lago
TT
, et al.
Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome
.
Bioinformatics
2018
;
34
(
24
):
4223
31
.

59.

Wagih
O
,
Sugiyama
N
,
Ishihama
Y
, et al.
Uncovering phosphorylation-based specificities through functional interaction networks
.
Mol Cell Proteomics
2016
;
15
(
1
):
236
45
.

60.

Li
H
,
Chen
L
,
Huang
Z
, et al.
|$deepome$|⁠: a web server for the prediction of |${2}^{\prime }-o- me$| sites based on the hybrid cnn and blstm architecture
.
Front Cell Dev Biol
2021
;
9
:
1244
.

61.

Yang
B
,
Bao
W
,
Wang
J
.
Active disease-related compound identification based on capsule network
.
Brief Bioinform
2022
;
23
(
1
):
bbab462
.

62.

Khanal
J
,
Tayara
H
,
Zou
Q
, et al.
Deepcap-kcr: accurate identification and investigation of protein lysine crotonylation sites based on capsule network
.
Brief Bioinform
2022
;
23
(
1
):
bbab492
.

63.

Bao
W
,
Yang
B
,
Chen
B
.
2-hydr_ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method
.
Chemom Intel Lab Syst
2021
;
215
:
104351
.

64.

Bao
S
,
Li
K
,
Yan
C
, et al.
Deep learning-based advances and applications for single-cell RNA-sequencing data analysis
.
Brief Bioinform
2022
;
23
(
1
):
bbab473
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)