-
PDF
- Split View
-
Views
-
Cite
Cite
Rulan Wang, Chia-Ru Chung, Hsien-Da Huang, Tzong-Yi Lee, Identification of species-specific RNA N6-methyladinosine modification sites from RNA sequences, Briefings in Bioinformatics, Volume 24, Issue 2, March 2023, bbac573, https://doi.org/10.1093/bib/bbac573
- Share Icon Share
Abstract
N6-methyladinosine (m6A) modification is the most abundant co-transcriptional modification in eukaryotic RNA and plays important roles in cellular regulation. Traditional high-throughput sequencing experiments used to explore functional mechanisms are time-consuming and labor-intensive, and most of the proposed methods focused on limited species types. To further understand the relevant biological mechanisms among different species with the same RNA modification, it is necessary to develop a computational scheme that can be applied to different species. To achieve this, we proposed an attention-based deep learning method, adaptive-m6A, which consists of convolutional neural network, bi-directional long short-term memory and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional machine learning (ML) methods, including support vector machine, random forest and logistic regression classifiers, were considered in this work. In addition to the performance of ML methods for multi-species prediction, the optimal performance of adaptive-m6A yielded an accuracy of 0.9832 and the area under the receiver operating characteristic curve of 0.98. Moreover, the motif analysis and cross-validation among different species were conducted to test the robustness of one model towards multiple species, which helped improve our understanding about the sequence characteristics and biological functions of RNA modifications in different species.
Introduction
N6-methyladenosine (m6A) is the most prevalent, abundant and conserved internal co-transcriptional modification in eukaryotic RNAs, and occurs in around 30% of transcripts, especially within eukaryotic cells [1]. Since the 1970s, m6A has served as a layer of epigenetic modification, with the additional discovery of demethylases in recent years [2]. m6A plays a broad and crucial role in almost all aspects of physiological behavior, including physiological regulation, which may impair gene expression and cellular function if disrupted [3]. Moreover, m6A is involved in many diseases such as cancer, psychiatric disorders and metabolic disease [4]. Owing to the advances in genomics and molecular biology, most currently available high-throughput experimental techniques exhibit tens of thousands of transcriptomes, which provides sufficient data for further understanding of different RNA modifications. Several well-established machine learning (ML) techniques have been developed for m6A modification sites prediction using the abundance of experimentally verified data [5]. Over 20 approaches have been developed to predict m6A sites [6]. These methods differ in a variety of aspects, including the benchmark dataset and sequence and/or structural descriptors used, physicochemical properties and genomics features employed, feature selection techniques and targeted RNA modification types.
Tool . | Method . | Publication year . | Feature selection method . | Feature encoding scheme . | Data scales . | URL accessibility . | Window size . | Species . | Reference . |
---|---|---|---|---|---|---|---|---|---|
iRNA-Methyl | SVM | 2015 | None | PseDNC, RNA property parameters | 1307 | Inaccessible | 51 | S.cerevisiae | [7] |
M6ATH | SVM | 2016 | None | NCP,ANF | 394 | Inaccessible | 25 | A.thaliana | [8] |
RNA-MethylPred | SVM | 2016 | None | BPB,DNC, KNN score | 1307 | MATLAB package | 51 | S.cerevisiae | [9] |
TargetM6A | SVM | 2016 | IFS | PSNP, PSDP, k-mer | 1307(Met2614) and 832(Train1664) | Accessible | 51(Met2614) and 21(Train1664) | S.cerevisiae | [10] |
pRNAm-PC | SVM | 2016 | None | PCPM | 1307 | Inaccessible | 51 | S.cerevisiae | [11] |
RNAMethPre | SVM | 2016 | None | binary, k-mer, relative position value, MFE | 29457(H.sapiens) and 31728(Mus musculus) | Inaccessible | 101 | H.sapiens and Mus musculus | [12] |
M6A-HPCS | SVM | 2016 | None | PCPM, PseDNC, AC, CC | 1307 | Accessible | 51 | S.cerevisiae | [13] |
SRAMP | RF | 2016 | None | binary, KNN score spectrum | 55706 (full transcript mode) and 46992 (mature mRNA mode) | Accessible | 251 | H.sapiens and Mus musculus | [14] |
RAM-ESVM | SVM | 2017 | None | PseDNC, motif | 1307 | Inaccessible | 51 | S.cerevisiae | [15] |
iRNA-PseColl | SVM | 2017 | None | NCP,ANF | 1130 | Inaccessible | 41 | H.sapiens | [16] |
RAM-NPPS | SVM | 2017 | RFE,FSDI, MRMD | NPPS | 8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana) | Inaccessible | 51 | H.sapiens, S.cerevisiae and A.thaliana | [17] |
M6APred-EL | Ensemble SVM | 2018 | None | position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties | 1307 | Inaccessible | 51 | S.cerevisiae | [18] |
iRNA(m6A)-PseDNC | SVM | 2018 | None | PseDNC | 1307 | Inaccessible | 51 | S.cerevisiae | [19] |
BERMP | BGRU | 2018 | None | ENAC, RNA word embedding | 53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana) | Inaccessible | 251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae A.thaliana | [20] |
M6AMRFS | XGBoost | 2018 | SFS | DNC, binary, local position-specific dinucleotide frequency | 1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana) | − | 51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae and A.thaliana | [21] |
RFAthM6A | RF | 2018 | None | PSNSP, PSDSP, KSNPF, k-mer | 2518 | Accessible | 101 | A.thaliana | [22] |
DeepM6APred | SVM | 2019 | None | deep features and NPPS | 1307 | Inaccessible | 51 | S.cerevisiae | [23] |
Gene2Vec | CNN | 2019 | None | One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec | 56557 | Inaccessible | 1001 | H.sapiens and Mus musculus | [24] |
WHISTLE | SVM | 2022 | Perturb method | NCP, Genome-derived features | 20516, 17383 | Accessible | − | H.sapiens | [25] |
DeepPromise | CNN | 2022 | None | ENAC, one-hot and RNA word embedding | 44901, 11656 and 5233 | Accessible | 1001 | H.sapiens and Mus musculus | [6] |
Tool . | Method . | Publication year . | Feature selection method . | Feature encoding scheme . | Data scales . | URL accessibility . | Window size . | Species . | Reference . |
---|---|---|---|---|---|---|---|---|---|
iRNA-Methyl | SVM | 2015 | None | PseDNC, RNA property parameters | 1307 | Inaccessible | 51 | S.cerevisiae | [7] |
M6ATH | SVM | 2016 | None | NCP,ANF | 394 | Inaccessible | 25 | A.thaliana | [8] |
RNA-MethylPred | SVM | 2016 | None | BPB,DNC, KNN score | 1307 | MATLAB package | 51 | S.cerevisiae | [9] |
TargetM6A | SVM | 2016 | IFS | PSNP, PSDP, k-mer | 1307(Met2614) and 832(Train1664) | Accessible | 51(Met2614) and 21(Train1664) | S.cerevisiae | [10] |
pRNAm-PC | SVM | 2016 | None | PCPM | 1307 | Inaccessible | 51 | S.cerevisiae | [11] |
RNAMethPre | SVM | 2016 | None | binary, k-mer, relative position value, MFE | 29457(H.sapiens) and 31728(Mus musculus) | Inaccessible | 101 | H.sapiens and Mus musculus | [12] |
M6A-HPCS | SVM | 2016 | None | PCPM, PseDNC, AC, CC | 1307 | Accessible | 51 | S.cerevisiae | [13] |
SRAMP | RF | 2016 | None | binary, KNN score spectrum | 55706 (full transcript mode) and 46992 (mature mRNA mode) | Accessible | 251 | H.sapiens and Mus musculus | [14] |
RAM-ESVM | SVM | 2017 | None | PseDNC, motif | 1307 | Inaccessible | 51 | S.cerevisiae | [15] |
iRNA-PseColl | SVM | 2017 | None | NCP,ANF | 1130 | Inaccessible | 41 | H.sapiens | [16] |
RAM-NPPS | SVM | 2017 | RFE,FSDI, MRMD | NPPS | 8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana) | Inaccessible | 51 | H.sapiens, S.cerevisiae and A.thaliana | [17] |
M6APred-EL | Ensemble SVM | 2018 | None | position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties | 1307 | Inaccessible | 51 | S.cerevisiae | [18] |
iRNA(m6A)-PseDNC | SVM | 2018 | None | PseDNC | 1307 | Inaccessible | 51 | S.cerevisiae | [19] |
BERMP | BGRU | 2018 | None | ENAC, RNA word embedding | 53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana) | Inaccessible | 251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae A.thaliana | [20] |
M6AMRFS | XGBoost | 2018 | SFS | DNC, binary, local position-specific dinucleotide frequency | 1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana) | − | 51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae and A.thaliana | [21] |
RFAthM6A | RF | 2018 | None | PSNSP, PSDSP, KSNPF, k-mer | 2518 | Accessible | 101 | A.thaliana | [22] |
DeepM6APred | SVM | 2019 | None | deep features and NPPS | 1307 | Inaccessible | 51 | S.cerevisiae | [23] |
Gene2Vec | CNN | 2019 | None | One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec | 56557 | Inaccessible | 1001 | H.sapiens and Mus musculus | [24] |
WHISTLE | SVM | 2022 | Perturb method | NCP, Genome-derived features | 20516, 17383 | Accessible | − | H.sapiens | [25] |
DeepPromise | CNN | 2022 | None | ENAC, one-hot and RNA word embedding | 44901, 11656 and 5233 | Accessible | 1001 | H.sapiens and Mus musculus | [6] |
Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.
Tool . | Method . | Publication year . | Feature selection method . | Feature encoding scheme . | Data scales . | URL accessibility . | Window size . | Species . | Reference . |
---|---|---|---|---|---|---|---|---|---|
iRNA-Methyl | SVM | 2015 | None | PseDNC, RNA property parameters | 1307 | Inaccessible | 51 | S.cerevisiae | [7] |
M6ATH | SVM | 2016 | None | NCP,ANF | 394 | Inaccessible | 25 | A.thaliana | [8] |
RNA-MethylPred | SVM | 2016 | None | BPB,DNC, KNN score | 1307 | MATLAB package | 51 | S.cerevisiae | [9] |
TargetM6A | SVM | 2016 | IFS | PSNP, PSDP, k-mer | 1307(Met2614) and 832(Train1664) | Accessible | 51(Met2614) and 21(Train1664) | S.cerevisiae | [10] |
pRNAm-PC | SVM | 2016 | None | PCPM | 1307 | Inaccessible | 51 | S.cerevisiae | [11] |
RNAMethPre | SVM | 2016 | None | binary, k-mer, relative position value, MFE | 29457(H.sapiens) and 31728(Mus musculus) | Inaccessible | 101 | H.sapiens and Mus musculus | [12] |
M6A-HPCS | SVM | 2016 | None | PCPM, PseDNC, AC, CC | 1307 | Accessible | 51 | S.cerevisiae | [13] |
SRAMP | RF | 2016 | None | binary, KNN score spectrum | 55706 (full transcript mode) and 46992 (mature mRNA mode) | Accessible | 251 | H.sapiens and Mus musculus | [14] |
RAM-ESVM | SVM | 2017 | None | PseDNC, motif | 1307 | Inaccessible | 51 | S.cerevisiae | [15] |
iRNA-PseColl | SVM | 2017 | None | NCP,ANF | 1130 | Inaccessible | 41 | H.sapiens | [16] |
RAM-NPPS | SVM | 2017 | RFE,FSDI, MRMD | NPPS | 8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana) | Inaccessible | 51 | H.sapiens, S.cerevisiae and A.thaliana | [17] |
M6APred-EL | Ensemble SVM | 2018 | None | position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties | 1307 | Inaccessible | 51 | S.cerevisiae | [18] |
iRNA(m6A)-PseDNC | SVM | 2018 | None | PseDNC | 1307 | Inaccessible | 51 | S.cerevisiae | [19] |
BERMP | BGRU | 2018 | None | ENAC, RNA word embedding | 53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana) | Inaccessible | 251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae A.thaliana | [20] |
M6AMRFS | XGBoost | 2018 | SFS | DNC, binary, local position-specific dinucleotide frequency | 1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana) | − | 51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae and A.thaliana | [21] |
RFAthM6A | RF | 2018 | None | PSNSP, PSDSP, KSNPF, k-mer | 2518 | Accessible | 101 | A.thaliana | [22] |
DeepM6APred | SVM | 2019 | None | deep features and NPPS | 1307 | Inaccessible | 51 | S.cerevisiae | [23] |
Gene2Vec | CNN | 2019 | None | One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec | 56557 | Inaccessible | 1001 | H.sapiens and Mus musculus | [24] |
WHISTLE | SVM | 2022 | Perturb method | NCP, Genome-derived features | 20516, 17383 | Accessible | − | H.sapiens | [25] |
DeepPromise | CNN | 2022 | None | ENAC, one-hot and RNA word embedding | 44901, 11656 and 5233 | Accessible | 1001 | H.sapiens and Mus musculus | [6] |
Tool . | Method . | Publication year . | Feature selection method . | Feature encoding scheme . | Data scales . | URL accessibility . | Window size . | Species . | Reference . |
---|---|---|---|---|---|---|---|---|---|
iRNA-Methyl | SVM | 2015 | None | PseDNC, RNA property parameters | 1307 | Inaccessible | 51 | S.cerevisiae | [7] |
M6ATH | SVM | 2016 | None | NCP,ANF | 394 | Inaccessible | 25 | A.thaliana | [8] |
RNA-MethylPred | SVM | 2016 | None | BPB,DNC, KNN score | 1307 | MATLAB package | 51 | S.cerevisiae | [9] |
TargetM6A | SVM | 2016 | IFS | PSNP, PSDP, k-mer | 1307(Met2614) and 832(Train1664) | Accessible | 51(Met2614) and 21(Train1664) | S.cerevisiae | [10] |
pRNAm-PC | SVM | 2016 | None | PCPM | 1307 | Inaccessible | 51 | S.cerevisiae | [11] |
RNAMethPre | SVM | 2016 | None | binary, k-mer, relative position value, MFE | 29457(H.sapiens) and 31728(Mus musculus) | Inaccessible | 101 | H.sapiens and Mus musculus | [12] |
M6A-HPCS | SVM | 2016 | None | PCPM, PseDNC, AC, CC | 1307 | Accessible | 51 | S.cerevisiae | [13] |
SRAMP | RF | 2016 | None | binary, KNN score spectrum | 55706 (full transcript mode) and 46992 (mature mRNA mode) | Accessible | 251 | H.sapiens and Mus musculus | [14] |
RAM-ESVM | SVM | 2017 | None | PseDNC, motif | 1307 | Inaccessible | 51 | S.cerevisiae | [15] |
iRNA-PseColl | SVM | 2017 | None | NCP,ANF | 1130 | Inaccessible | 41 | H.sapiens | [16] |
RAM-NPPS | SVM | 2017 | RFE,FSDI, MRMD | NPPS | 8366(H.sapiens), 1307(S.cerevisiae), 394(A.thaliana) | Inaccessible | 51 | H.sapiens, S.cerevisiae and A.thaliana | [17] |
M6APred-EL | Ensemble SVM | 2018 | None | position-specific information, physical-chemical properties, ring-function-hydrogen-chemical properties | 1307 | Inaccessible | 51 | S.cerevisiae | [18] |
iRNA(m6A)-PseDNC | SVM | 2018 | None | PseDNC | 1307 | Inaccessible | 51 | S.cerevisiae | [19] |
BERMP | BGRU | 2018 | None | ENAC, RNA word embedding | 53000 (mammalian full transcript) 44853 (mammalian mature mRNA) 1100 (S.cerevisiae) and 2100 (A.thaliana) | Inaccessible | 251 (Mammalian) 51 (S.cerevisiae) 101 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae A.thaliana | [20] |
M6AMRFS | XGBoost | 2018 | SFS | DNC, binary, local position-specific dinucleotide frequency | 1307 (S.cerevisiae), 1130 (H.sapiens), 725(Mus musculus) and 1000 (A.thaliana) | − | 51 (S.cerevisiae), 41 (H.sapiens), 41 (Mus musculus) and 25 (A.thaliana) | H.sapiens, Mus musculus, S.cerevisiae and A.thaliana | [21] |
RFAthM6A | RF | 2018 | None | PSNSP, PSDSP, KSNPF, k-mer | 2518 | Accessible | 101 | A.thaliana | [22] |
DeepM6APred | SVM | 2019 | None | deep features and NPPS | 1307 | Inaccessible | 51 | S.cerevisiae | [23] |
Gene2Vec | CNN | 2019 | None | One-hot, Neighboring methylation state, RNA word embedding, Gene2Vec | 56557 | Inaccessible | 1001 | H.sapiens and Mus musculus | [24] |
WHISTLE | SVM | 2022 | Perturb method | NCP, Genome-derived features | 20516, 17383 | Accessible | − | H.sapiens | [25] |
DeepPromise | CNN | 2022 | None | ENAC, one-hot and RNA word embedding | 44901, 11656 and 5233 | Accessible | 1001 | H.sapiens and Mus musculus | [6] |
Abbreviation in Feature selection method: IFS, incremental feature selection; RFE, recursive feature elimination; FSDI, feature selection based on discernibility and independence of a feature; MRMD, maximal relevance and maximal distance; SFS, sequence forward search. Abbreviation in Feature encoding scheme: PseDNC, pseudo dinucleotide composition; NCP, nucleotide chemical property; ANF, accumulated nucleotide frequency; BPB, bi-profile bayes; DNC, dinucleotide composition; KNN, k-nearest neighbour; PSNP, position-specific nucleotide propensity; PSDP, position-specific dinucleotide propensity; PCPM, physical-chemical property matrix; MFE, minimum free energy; AC, auto-covariance; CC, cross-covariance; NPPS, nucleotide pair position specificity; ENAC, Enhanced nucleic acid composition; DNC, di-nucleotide composition; PSNSP, position-specific nucleotide sequence profile; PSDSP, position-specific dinucleotide sequence profile; KSNPF, K-spaced nucleotide pair frequencies.

Overall framework of the predictors used in this research. The process contains four main steps: data collection, feature investigation, model training and cross-validation, and independent testing. In the data collection step, datasets from six different species were collected from published research and databases for motif substrate discovery and feature investigation. In the feature investigation phase, primary sequence-derived features (NAC, DNC, TNC, ENAC and CKSNAP) and one nucleotide physiochemical property features (NCP) were adopted. In the model training and cross-validation phase, three conventional ML methods (SVM, LR and RF) were applied to the datasets and a deep learning method using an adaptive learning network (adaptive-m6A), and 5-fold cross-validation were applied for model assessment. Finally, independent testing was employed to assess the performance of the proposed method.
Recently, deep learning methods have been widely applied in different fields, as well as in RNA modification prediction [5, 8–10, 12, 17, 21, 22, 26, 27]. Table 1 lists some representative m6A prediction methods and illustrates that there are still some limitations of the available tools. First, most of them focus on only a few types of species, with no or limited information on other species. Second, ML method is still the first choice for m6A site prediction, and the deep learning method for these purposes is usually limited to a convolutional neural network (CNN) framework. Also, these methods are time-consuming and labor-intensive for the construction of an experimentally verified m6A database. Moreover, little work has been conducted to discuss the motif patterns around modified sites among different samples to show the potential relationships among different species with the same type of RNA post-transcriptional modification [28]. To overcome these obstacles and investigate m6A modification among various categories of species, we have developed a prediction model for m6A modification in multi-species with different frameworks. We proposed an attention-based deep learning method called adaptive-m6A, which consists of CNN, bi-directional long short-term memory (BLSTM) and an attention mechanism, to identify m6A sites in multiple species. In addition, three conventional ML methods were also investigated: support vector machine, random forest and logistic regression classifiers in details. A cross-species validation was also conducted to evaluate whether one model from a certain species could work well using data from other species [29, 30]. A comparison with other representative methods was also listed as part of our report. To the best of our knowledge, this is the first work to use data from several species, which could help to reveal whether there are the same type of RNA modification from the perspective of their associated sequence contexts among different species [28]. The overall flowchart of this research has been included in Figure 1.
Method and materials
Data collection and preparation
The dataset used for training and independent testing was collected from published work and the RMBase database [31]. Data from six different species with m6A modified sites were obtained for Drosophila melanogaster (BDGP6), Danio rerio (zebrafish, danRer10), Escherichia coli str.K-12 substr.MG1655 (E. coli), Mus musculus (mm10), Saccharomyces cerevisiae S288C (sacCer3) and Arabidopsis thaliana (TAIR10). Collecting the datasets involved the following steps: first downloading the RNA sequence that contained the m6A modification sites and its position from the RMBase database. This first step transfers sequences from RMBase to the RNA sequence by replacing the T entries to U, as those sequences obtained from the RMBase database is in the form of DNA [32]. The datasets for species M. musculus and A. thaliana were downloaded from previous studies [14, 22]. To obtain a high-quality dataset, a CD-HIT with a 70% identity cut-off value [33] was used to remove redundancy and homology deviation [32, 34]. Then, we divided those datasets with a ratio of 7:3 for the training and testing sets, respectively. To avoid the effect caused by any imbalance in the sample number of positive and negative datasets, a random under sampler was employed to make the sample number equivalent in positive and negative sets [35]. The detailed number of final datasets for each individual species is listed in Table 2 and the data involved in this research could be accessed at https://github.com/Moretta1/Adaptive-m6A.
Group . | Species . | Assembly . | Training set . | Testing set . | ||
---|---|---|---|---|---|---|
. | . | . | Positive . | Negative . | Positive . | Negative . |
Insect | D. melanogaster | BDGP6 | 4714 | 4714 | 2014 | 20 504 |
Vertebrate | Danio rerio(zebrafish) | danRer10 | 30 109 | 30 109 | 12 916 | 15 645 |
Bacteria | Escherichia coli(E. coli) | ASM584v2 | 1522 | 1522 | 650 | 1546 |
Mammal | Mus musculus (House mouse) | mm10 | 31 120 | 31 120 | 13 325 | 131 418 |
Fungi | Saccharomyces cerevisiae S288C | sacCer3 | 1829 | 1829 | 785 | 1044 |
Plant | Arabidopsis thaliana | TAIR10 | 3523 | 3523 | 1510 | 2013 |
Group . | Species . | Assembly . | Training set . | Testing set . | ||
---|---|---|---|---|---|---|
. | . | . | Positive . | Negative . | Positive . | Negative . |
Insect | D. melanogaster | BDGP6 | 4714 | 4714 | 2014 | 20 504 |
Vertebrate | Danio rerio(zebrafish) | danRer10 | 30 109 | 30 109 | 12 916 | 15 645 |
Bacteria | Escherichia coli(E. coli) | ASM584v2 | 1522 | 1522 | 650 | 1546 |
Mammal | Mus musculus (House mouse) | mm10 | 31 120 | 31 120 | 13 325 | 131 418 |
Fungi | Saccharomyces cerevisiae S288C | sacCer3 | 1829 | 1829 | 785 | 1044 |
Plant | Arabidopsis thaliana | TAIR10 | 3523 | 3523 | 1510 | 2013 |
Group . | Species . | Assembly . | Training set . | Testing set . | ||
---|---|---|---|---|---|---|
. | . | . | Positive . | Negative . | Positive . | Negative . |
Insect | D. melanogaster | BDGP6 | 4714 | 4714 | 2014 | 20 504 |
Vertebrate | Danio rerio(zebrafish) | danRer10 | 30 109 | 30 109 | 12 916 | 15 645 |
Bacteria | Escherichia coli(E. coli) | ASM584v2 | 1522 | 1522 | 650 | 1546 |
Mammal | Mus musculus (House mouse) | mm10 | 31 120 | 31 120 | 13 325 | 131 418 |
Fungi | Saccharomyces cerevisiae S288C | sacCer3 | 1829 | 1829 | 785 | 1044 |
Plant | Arabidopsis thaliana | TAIR10 | 3523 | 3523 | 1510 | 2013 |
Group . | Species . | Assembly . | Training set . | Testing set . | ||
---|---|---|---|---|---|---|
. | . | . | Positive . | Negative . | Positive . | Negative . |
Insect | D. melanogaster | BDGP6 | 4714 | 4714 | 2014 | 20 504 |
Vertebrate | Danio rerio(zebrafish) | danRer10 | 30 109 | 30 109 | 12 916 | 15 645 |
Bacteria | Escherichia coli(E. coli) | ASM584v2 | 1522 | 1522 | 650 | 1546 |
Mammal | Mus musculus (House mouse) | mm10 | 31 120 | 31 120 | 13 325 | 131 418 |
Fungi | Saccharomyces cerevisiae S288C | sacCer3 | 1829 | 1829 | 785 | 1044 |
Plant | Arabidopsis thaliana | TAIR10 | 3523 | 3523 | 1510 | 2013 |
Feature encoding
In this research, primary sequence-derived features and nucleotide physicochemical properties [6] were selected to obtain sequence feature vectors as these can be directly obtained from the sequence textures and requires no extra third-party tools. The encoding methods were employed mainly based on toolkits iLearn [36] and iLearnplus [37].
Primary sequence-derived features
Composition of k-spaced nucleotide acid pair The composition of k-spaced nucleotide acid pairs (CKSNAP) is a criterion that is wildly used in the field for analyzing protein functions. When an integer k has been fixed, the number of k-spaced nucleotide acid pairs are determined as by the gap between two neighboring nucleotide ranges from 0 to the given integer k. The CKSNAP indicates the frequency of nucleotide pairs separated by any k composition. In this study, k was 5, which should range from 0 to 5. Because CKSNAP gives the same result as DNC when k = 0, only cases in which k ranges from 1 to 5 were considered in this study, with a total of 48 feature dimensions for a single sequence. Figure S2 in the supplementary file shows the comparison of CKSNAP in positive and negative samples of all datasets.

The structure of the deep learning method contains two parts: the embedding module and task specific module. In the embedding module, RNA sequences of fixed lengths were converted into a numeric matrix vector and used as an input for the task specific module. The task specific module contains a CNN layer for extracting features, a max-pooling layer for speeding up the computation, BLSTM and attention layers, and fully connect layer for output result.
Nucleotide physicochemical properties
Nucleotide chemical properties Each of the four different kinds of nucleotides has different chemical structure and binding properties (Table 3), and this allows for the encoding scheme of nucleotide chemical properties (NCP). For instance, “A” would be encoded as (111) and G would be encoded as (100). NCP would generate a dimension feature of number 3*21, which equals 63.
Chemical property . | Class . | Label . | Nucleotides . |
---|---|---|---|
Ring structure | Purine | 1 | A, G |
Pyrimidine | 0 | C, U | |
Functional group | Amino | 1 | A, C |
Keto | 0 | G, U | |
Hydrogen bond | Strong | 0 | C, G |
Weak | 1 | A, U |
Chemical property . | Class . | Label . | Nucleotides . |
---|---|---|---|
Ring structure | Purine | 1 | A, G |
Pyrimidine | 0 | C, U | |
Functional group | Amino | 1 | A, C |
Keto | 0 | G, U | |
Hydrogen bond | Strong | 0 | C, G |
Weak | 1 | A, U |
Chemical property . | Class . | Label . | Nucleotides . |
---|---|---|---|
Ring structure | Purine | 1 | A, G |
Pyrimidine | 0 | C, U | |
Functional group | Amino | 1 | A, C |
Keto | 0 | G, U | |
Hydrogen bond | Strong | 0 | C, G |
Weak | 1 | A, U |
Chemical property . | Class . | Label . | Nucleotides . |
---|---|---|---|
Ring structure | Purine | 1 | A, G |
Pyrimidine | 0 | C, U | |
Functional group | Amino | 1 | A, C |
Keto | 0 | G, U | |
Hydrogen bond | Strong | 0 | C, G |
Weak | 1 | A, U |
RNA sequence embedding
Model construction
Machine learning method
From the previous survey of RNA modification site prediction, the ML algorithms used for RNA modification site research included SVM [38, 39], RF [40], bagging [41], naive Bayesian network [42], light gradient boosting machine (LGBM) [43, 44], logistic regression (LR) [45, 46], extreme gradient boosting (XGBoost) [47], etc. In this study, we selected three commonly used ML algorithms to screen the optimal algorithm, including RF, LR and SVM, which were implemented using the Scikit-Learn toolkit [48]. The three ML algorithms are initially screened using the default parameters to obtain the top two optimal classifiers for parameter optimization and confirm the best classifier of the optimal model.
Deep learning method
The deep learning framework contains two main modules. First, an adaptive embedding module was used to map the nucleotide composition onto numerical vectors according to certain position information and token value from a lookup table [24, 29]. Second, a task-specific layer consisting of CNN, BLSTM, attention and full connection layers, in which the CNN layer extracted high-level features from the sequence. BLSTM is a specific recurrent neural network (RNN) that consists of a set of recurrently connected blocks and enhanced the learning ability of LSTM for long-distance dependency. Compared with unidirectional LSTM, BLSTM better captures the information of sequence context [49]. In addition to CNN and BLSTM, the attention mechanism can also capture positional information. The attention mechanism was originally proposed for solving machine translation tasks [50] and has shown to distinguish between more and less important information [51]. In the field of natural language processing and image identification, an increasing number of attention mechanism-related studies have shown the interpretability of the model [52, 53]. The attention mechanism has been shown to achieve a competitive performance in a wide range of biological sequence analysis problems [54, 55], and was therefore used in this study to identify the key information that affects m6A site prediction. A flowchart in Figure 2 shows the whole workflow of the adaptive learning model.
Evaluation metrics
Moreover, in addition to these evaluation indicators, by setting the sensitivity and 1-specificity to the x- and y-axes, respectively, the receiver operating characteristic curve (ROC curve) was generated, which was another indicator for evaluating the performance of the proposed method [58]. The closer the value of area under ROC is to 1, the better the prediction performance of the model and the higher the accuracy.
Results
Performance comparison of machine learning classifiers
For the ML classifier in this study, species-specific evaluation indicators for the three different classifiers were summarized and visualized in a bar chart (Figure 3) with the performance criteria of three ML methods used. Table S4 in the supplementary file provides table-formed details of the ML method performances, while the ROC and PRC curves for these models evaluated using independent testing are shown in Figure S3–S8. The performances indicated that in each species data, RF tended to achieve a higher performance when compared with two other classifiers. SVM was efficient in most of the species tested. In some cases, for instance, the species mm10, the performances of all the three classifiers tends had a similar level of accuracy around 0.6–0.7 and AUC between 0.55 and 0.75. Since the performance of the ML methods in sacCer3 and TAIR10 datasets were quite close and not as promising as the other small-scaled datasets that had less than 10 thousand sample numbers, they were chosen for subsequent optimization and comparison with feature selection.

Performance of ML method classifier (A) SVM, (B) RF and (C) LR in individual datasets. Detailed values in each individual classifier in table form have been listed in Table S1 in supplementary files.

Performance of optimal subset selection with feature selection method CHI2 and the optimal performance for the different species (A) mm10, (B) TAIR10 and (C) sacCer3, respectively. The TOP100 dimensions of features, which are the top 100 highest Chi2 values, correspond to feature dimensions. The weight of each feature was selected for among the total selected dimensions is presented. The bar chart shows the criteria for the original incorporated features (before feature selection) and the selected features.
Compared the three conventional ML classifiers, which achieved a higher performance in the BDGP6 species, the SVM classifier achieved the highest performance, while the mm10 dataset did not show a promising performance. This may have been caused by the differences between positive and negative samples that were much more obvious in the BDGP6 dataset. Overall, the performance of the RF classifier yielded a higher level compared with the other two classifiers. However, compared with other datasets, the RF classifier had a more biased performance in the danRer10 dataset.
Performance with hybrid features on feature selection methods
To further determine the optimal classifier and select the optimal feature subset, we adopted a two-step strategy. First, a feature selection technique, the chi-squared (CHI2) method was applied to calculate the feature importance values in different ways and then sort each dimension of the hybrid feature set according to these values (chi2 values were shown in Figure 4). Second, for the sorted hybrid feature list obtained by different feature selection methods, we selected the top 100 feature dimensions (listed in Figure 4 for the three datasets according to their chi2 values ranking from the highest to the lowest) and then selected for dimensionality of the optimal feature subset using the RF classifier through the IFS strategy. To determine the optimal classifier, the AUC of the RF and SVM classifiers were compared based on the feature selection methods, which shows the comparison of AUC values of TAIR10 (A. thaliana), mm10 (M. musculus) and sacCer3 (S. cerevisiae) respectively in Figure 4. Based on the feature selection method on three datasets for optimal subset selection, we compared the performance of the hybrid features and the selected top 100 feature dimensions with highest importance, the performances were shown in the histogram of Figure 4.
Performance on deep learning method
In addition to the ML method, we also applied an adaptive learning network using a species collection of more than 10 000 sample numbers, which was suitable for a deep learning framework (Table 4). A box-plot was constructed for the comparison of performances among the three ML methods, and indicated that the deep learning method achieved a better performance on average when compared with the conventional ML classifiers, and was extremely successful when larger datasets were used, such as for species mm10 and danRer10 (Figure S9). For the relatively smaller datasets, like sacCer3 and TAIR10 for instance, adaptive learning yielded a less significant performance when compared with the ML methods, which may have been caused by the limitation in data.
Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold
Species . | Methods . | Sn . | Sp . | ACC . | MCC . | AUC . | Prec . | F1 . |
---|---|---|---|---|---|---|---|---|
BDGP6 | SVM | 0.92 | 0.92 | 0.91 | 0.71 | 0.96 | 0.66 | 0.73 |
RF | 0.96 | 0.92 | 0.92 | 0.7 | 0.96 | 0.59 | 0.68 | |
LR | 0.92 | 0.90 | 0.90 | 0.62 | 0.92 | 0.50 | 0.60 | |
Ada | 0.99 | 1 | 1 | 0.97 | 1 | 0.95 | 0.97 | |
danRer10 | SVM | 0.91 | 0.86 | 0.89 | 0.77 | 0.93 | 0.85 | 0.88 |
RF | 0.93 | 0.88 | 0.9 | 0.81 | 0.95 | 0.89 | 0.91 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.99 | 0.96 | 0.97 | 0.95 | 0.99 | 0.95 | 0.97 | |
E. coli | SVM | 0.86 | 0.85 | 0.85 | 0.68 | 0.89 | 0.72 | 0.78 |
RF | 0.89 | 0.82 | 0.84 | 0.67 | 0.89 | 0.7 | 0.78 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.75 | 0.97 | 0.91 | 0.77 | 0.98 | 0.92 | 0.83 | |
sacCer3 | SVM | 0.72 | 0.82 | 0.77 | 0.54 | 0.82 | 0.8 | 0.76 |
RF | 0.77 | 0.72 | 0.74 | 0.49 | 0.8 | 0.73 | 0.75 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.77 | 0.73 | 0.75 | 0.5 | 0.81 | 0.74 | 0.75 | |
TAIR10 | SVM | 0.75 | 0.76 | 0.75 | 0.5 | 0.82 | 0.75 | 0.75 |
RF | 0.73 | 0.71 | 0.72 | 0.44 | 0.79 | 0.72 | 0.72 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.73 | 0.61 | 0.67 | 0.34 | 0.71 | 0.66 | 0.69 | |
mm10 | SVM | 0.68 | 0.69 | 0.69 | 0.29 | 0.75 | 0.31 | 0.42 |
RF | 0.70 | 0.70 | 0.70 | 0.25 | 0.76 | 0.19 | 0.30 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.97 | 0.13 | 0.2 | 0.09 | 0.79 | 0.49 | 0.14 |
Species . | Methods . | Sn . | Sp . | ACC . | MCC . | AUC . | Prec . | F1 . |
---|---|---|---|---|---|---|---|---|
BDGP6 | SVM | 0.92 | 0.92 | 0.91 | 0.71 | 0.96 | 0.66 | 0.73 |
RF | 0.96 | 0.92 | 0.92 | 0.7 | 0.96 | 0.59 | 0.68 | |
LR | 0.92 | 0.90 | 0.90 | 0.62 | 0.92 | 0.50 | 0.60 | |
Ada | 0.99 | 1 | 1 | 0.97 | 1 | 0.95 | 0.97 | |
danRer10 | SVM | 0.91 | 0.86 | 0.89 | 0.77 | 0.93 | 0.85 | 0.88 |
RF | 0.93 | 0.88 | 0.9 | 0.81 | 0.95 | 0.89 | 0.91 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.99 | 0.96 | 0.97 | 0.95 | 0.99 | 0.95 | 0.97 | |
E. coli | SVM | 0.86 | 0.85 | 0.85 | 0.68 | 0.89 | 0.72 | 0.78 |
RF | 0.89 | 0.82 | 0.84 | 0.67 | 0.89 | 0.7 | 0.78 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.75 | 0.97 | 0.91 | 0.77 | 0.98 | 0.92 | 0.83 | |
sacCer3 | SVM | 0.72 | 0.82 | 0.77 | 0.54 | 0.82 | 0.8 | 0.76 |
RF | 0.77 | 0.72 | 0.74 | 0.49 | 0.8 | 0.73 | 0.75 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.77 | 0.73 | 0.75 | 0.5 | 0.81 | 0.74 | 0.75 | |
TAIR10 | SVM | 0.75 | 0.76 | 0.75 | 0.5 | 0.82 | 0.75 | 0.75 |
RF | 0.73 | 0.71 | 0.72 | 0.44 | 0.79 | 0.72 | 0.72 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.73 | 0.61 | 0.67 | 0.34 | 0.71 | 0.66 | 0.69 | |
mm10 | SVM | 0.68 | 0.69 | 0.69 | 0.29 | 0.75 | 0.31 | 0.42 |
RF | 0.70 | 0.70 | 0.70 | 0.25 | 0.76 | 0.19 | 0.30 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.97 | 0.13 | 0.2 | 0.09 | 0.79 | 0.49 | 0.14 |
Performance of proposed adaptive learning method and comparison with ML methods. The proposed adaptive learning method is based on RNA embedding feature. In each ML classifier, the optimal performance of all the conventional features was listed for the comparison, the detailed performances of all the conventional features in each individual ML method have been listed in the supplementary file, Table S1. The optimal performance has been highlighted on bold
Species . | Methods . | Sn . | Sp . | ACC . | MCC . | AUC . | Prec . | F1 . |
---|---|---|---|---|---|---|---|---|
BDGP6 | SVM | 0.92 | 0.92 | 0.91 | 0.71 | 0.96 | 0.66 | 0.73 |
RF | 0.96 | 0.92 | 0.92 | 0.7 | 0.96 | 0.59 | 0.68 | |
LR | 0.92 | 0.90 | 0.90 | 0.62 | 0.92 | 0.50 | 0.60 | |
Ada | 0.99 | 1 | 1 | 0.97 | 1 | 0.95 | 0.97 | |
danRer10 | SVM | 0.91 | 0.86 | 0.89 | 0.77 | 0.93 | 0.85 | 0.88 |
RF | 0.93 | 0.88 | 0.9 | 0.81 | 0.95 | 0.89 | 0.91 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.99 | 0.96 | 0.97 | 0.95 | 0.99 | 0.95 | 0.97 | |
E. coli | SVM | 0.86 | 0.85 | 0.85 | 0.68 | 0.89 | 0.72 | 0.78 |
RF | 0.89 | 0.82 | 0.84 | 0.67 | 0.89 | 0.7 | 0.78 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.75 | 0.97 | 0.91 | 0.77 | 0.98 | 0.92 | 0.83 | |
sacCer3 | SVM | 0.72 | 0.82 | 0.77 | 0.54 | 0.82 | 0.8 | 0.76 |
RF | 0.77 | 0.72 | 0.74 | 0.49 | 0.8 | 0.73 | 0.75 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.77 | 0.73 | 0.75 | 0.5 | 0.81 | 0.74 | 0.75 | |
TAIR10 | SVM | 0.75 | 0.76 | 0.75 | 0.5 | 0.82 | 0.75 | 0.75 |
RF | 0.73 | 0.71 | 0.72 | 0.44 | 0.79 | 0.72 | 0.72 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.73 | 0.61 | 0.67 | 0.34 | 0.71 | 0.66 | 0.69 | |
mm10 | SVM | 0.68 | 0.69 | 0.69 | 0.29 | 0.75 | 0.31 | 0.42 |
RF | 0.70 | 0.70 | 0.70 | 0.25 | 0.76 | 0.19 | 0.30 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.97 | 0.13 | 0.2 | 0.09 | 0.79 | 0.49 | 0.14 |
Species . | Methods . | Sn . | Sp . | ACC . | MCC . | AUC . | Prec . | F1 . |
---|---|---|---|---|---|---|---|---|
BDGP6 | SVM | 0.92 | 0.92 | 0.91 | 0.71 | 0.96 | 0.66 | 0.73 |
RF | 0.96 | 0.92 | 0.92 | 0.7 | 0.96 | 0.59 | 0.68 | |
LR | 0.92 | 0.90 | 0.90 | 0.62 | 0.92 | 0.50 | 0.60 | |
Ada | 0.99 | 1 | 1 | 0.97 | 1 | 0.95 | 0.97 | |
danRer10 | SVM | 0.91 | 0.86 | 0.89 | 0.77 | 0.93 | 0.85 | 0.88 |
RF | 0.93 | 0.88 | 0.9 | 0.81 | 0.95 | 0.89 | 0.91 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.99 | 0.96 | 0.97 | 0.95 | 0.99 | 0.95 | 0.97 | |
E. coli | SVM | 0.86 | 0.85 | 0.85 | 0.68 | 0.89 | 0.72 | 0.78 |
RF | 0.89 | 0.82 | 0.84 | 0.67 | 0.89 | 0.7 | 0.78 | |
LR | 0.88 | 0.84 | 0.86 | 0.71 | 0.88 | 0.82 | 0.85 | |
Ada | 0.75 | 0.97 | 0.91 | 0.77 | 0.98 | 0.92 | 0.83 | |
sacCer3 | SVM | 0.72 | 0.82 | 0.77 | 0.54 | 0.82 | 0.8 | 0.76 |
RF | 0.77 | 0.72 | 0.74 | 0.49 | 0.8 | 0.73 | 0.75 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.77 | 0.73 | 0.75 | 0.5 | 0.81 | 0.74 | 0.75 | |
TAIR10 | SVM | 0.75 | 0.76 | 0.75 | 0.5 | 0.82 | 0.75 | 0.75 |
RF | 0.73 | 0.71 | 0.72 | 0.44 | 0.79 | 0.72 | 0.72 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.73 | 0.61 | 0.67 | 0.34 | 0.71 | 0.66 | 0.69 | |
mm10 | SVM | 0.68 | 0.69 | 0.69 | 0.29 | 0.75 | 0.31 | 0.42 |
RF | 0.70 | 0.70 | 0.70 | 0.25 | 0.76 | 0.19 | 0.30 | |
LR | 0.67 | 0.67 | 0.67 | 0.34 | 0.67 | 0.67 | 0.67 | |
Ada | 0.97 | 0.13 | 0.2 | 0.09 | 0.79 | 0.49 | 0.14 |
Motif analysis of modified m6A sequences
Since NAC was the most widely used sequence-based feature for exploring the motif around the m6A site, comparing the NAC between positive and negative datasets could be useful for m6A site identification. Our results indicated the occurrence of modified and non-modified m6A sequences in the six datasets, in which the positively charged nucleotide residues (A and C) appeared to have the highest frequency around the substrate sites (Figure 5). When we compared the NAC results between positive and negative datasets, the differences among each NAC feature seemed to be much larger in the BDGP6 dataset than other datasets (Figure 5), which suggested that in this species, the differences among positive and negative samples with respect to NAC features were more obvious than those in the other species. Additionally, the mm10 had NAC features with the least differences among positive and negative datasets, suggesting the performance in the mm10 dataset was not as successful when compared with the other datasets tested.

Visualization of feature NAC, which reflects the frequency of each individual nucleotide component in the positive and negative datasets of different species. The orange and blue colors correspond to positive and negative samples, respectively.

The successive motif patterns found using tool MotifX and the experimentally verified motifs generated from a positive dataset and motif generation by TwoSampleLogo. In each subfigure, the experimental panel shows the pattern visualization of positive sequences generated by TwoSampleLogo, the predicted panel shows the visualization of predicted substrates generated by MotifX, and the predicted top five substrates and their matched numbers are listed in the motifs panel.

Heatmap of the performance of the cross-species validation. The AUC values were chosen for visualization, with detailed criteria outlined in the supplementary fileTable S2. The white colored bar represents the average value, while the pink and green bars indicate higher and lower AUC values compared to the average, respectively, with the darker colors corresponding to a higher discrepancy.
Figure 5 has indicated the occurrence of each nucleotide composition in m6A modified and non-modified sequences in the six datasets, and for m6A sites, the positively charged nucleotide residue (A and C) appeared to have the highest frequency around the substrate sites. From comparing the NAC between positive and negative datasets illustrated in Figure 5, the differences among each NAC seem to be much larger in the BDGP6 dataset, which means that in this dataset, the differences among positive and negative samples with respect to NAC feature are more obvious than the other species, while mm10 tends to have the NAC feature with the least differences among positive and negative, that is the main reason that the performance in the mm10 dataset is not that outstanding compared with other datasets.
In addition to the visualization of m6A motifs, we wondered whether any significant patterns occurred in different species. Hence, we used position-specific NAC neighboring to visualize the nucleotide residue composition around the modified m6A sites in each species using frequency plots in TwoSampleLogo (Figure 6). We also used MotifX [59], an R-package tool designed to extract overrepresented patterns from sequence datasets that was originally used for peptide sequence analysis, to predict the successive substrate that frequently occurred near the modified sites in each individual dataset.
We found that there were some successive motifs that occurred in different species, indicating that similar patterns tended to be important in different species and significantly contributed to the same type of modification (Figure 6). For instance, the ‘GAC’ pattern could be found in almost every species. The motif pattern of different species might be similar for the same modification, in species BDGP6 and E. coli, which had showed high similarity, as well as danRer10 and E. coli, which also shared a similar motif pattern even though one is an animal and the other is a microbe. This result was very interesting that different species shared a similar pattern and should be further researched.
Cross-species performance evaluation
Based on the previous analysis, we have found that some substrates might be similar among different m6A modified sites in different species, hence we conducted a cross-validation between two different species. We applied the dataset danRer10 as the training set to obtain a model and employed it onto the rest categories of datasets in the independent testing stage. A heatmap was used to illustrate the cross-species validation (Figure 7). The feature ENAC and classifier RF were chosen for this cross-species analysis as they achieved the highest performance in our previous analysis of single features.
We use the color pink for performance higher than the average level, green for performance lower than the average level and white means the average level in the overall cases. Theoretically the best performance should be found at the diagonal line as it is the pairwise training model and testing set with the same species, that means that we should see different pink colors in the diagonal line. This matches the result that we have got, as it can be notice that the color of the diagonal line are mostly of pink color, which means performance above the average level. For one special occurred in the mm10 dataset, the diagonal color is white, which means the average performance, while the best performances on the mm10 model occurred in the BDGP6 testing set. This is reasonable because compared with the rest testing datasets, mm10 testing set has shown a good performance, the reason that best result achieved in BDGP6 dataset might own to the larger differences in BDGP6 dataset, and mm10 is of larger data scale compared with the original BDGP6 dataset, which then achieved a better performance when compared with mm10 testing dataset, which is of less obvious differences among positive and negative dataset.
It may be notice that when choosing danRer10 dataset as the training set, the model we have obtained could work well in most of the testing datasets. That may be caused by that danRer10 dataset is of relatively larger quantity, which could avoid the under-fitting case better thatn the other testing datasets, meanwhile the typical substrate ‘GAC’ in the danRer10 dataset was found in the other testing datasets, which means higher similarity with other species and leads to good performances when the testing dataset changed.
Another interesting finding was that the E. coli model achieved the best AUC for TAIR10 datasets and very good AUC for BDGP6 and danRer10, which may have been caused by several reasons. First, the sample number of E. coli and TAIR10 datasets involved in the cross-species test were similar in positive datasets, while in negative datasets the imbalance ratio was maintained (Table 2). Moreover, these datasets had similar motif pattern in position [–2,2] ‘G-ACU’ patterns (Figure S10), that may be why the E. coli model also achieved a high performance with the TAIR10 dataset, as they were of similar sample number level and motif substrate pattern. We hypothesized that the E. coli model worked well with the BDGP6 and danRer10 datasets because of the diversity of positive and negative samples in these two datasets. Indeed, we noticed that the differences among positive and negative samples in danRer10 and BDGP6 was very obvious compared with the other cases, which made it easier to distinguish the the positive and negative samples among these datasets (Figure 5). Also, similarity in the motif substrate pattern and dataset scale may have also been why the highest performance did not always occur in the diagonal line. Additionally, this could explain the results from when we applied a model from a smaller dataset onto a testing set obtained from a larger dataset. For instance, when the model from sacCer3, which is a dataset with around one thousand positive samples, was used with other testing datasets that had around three or four thousands positive samples. These tests produced a performance that was lower than the average level, illustrated by the lower triangular part of the graph that included more green, while the upper triangular part contained more pink (Figure 7).
Comparison of existing methods
We have chosen some representative and newly accessible frameworks to compare the SRAMP [14], and WHISTLE [25] tools with our proposed methods. The deep learning method DeepPromise [6], which is with only CNN, and DeepOME [60], in which CNN is incorporated with BLSTM, were also used in our comparison. Recently, a capsule network has been a popular method for disease-related prediction [61], PTM prediction [62, 63] and single-cell RNA-sequencing data analysis [64]. In this study, we have also tried to apply the capsule network to the prediction of m6A sites as an updated deep learning method to compare with an adaptive learning method. The datasets for these chosen baseline methods were maintained as original species data, with the detailed comparison shown in Table 5. These results indicated that in most cases, the Adaptive-m6A model could achieve a better performance compared with the conventional ML methods (SRAMP and WHISTLE), it could achieve a more outstanding performance when compared with the method of similar network structure (i.e. DeepPromise) as it shows higher AUC value. The attention mechanism has provided good performance with compared with method DeepOME, and provides high performance when larger data scale involved.
Method name . | Algorithm . | Encoding scheme . | Species . | Datasize . | Sn . | Sp . | MCC . | AUC . |
---|---|---|---|---|---|---|---|---|
SRAMP [14] | RF | one-hot, KNN score spectrum | H. sapiens, mm10 | 55 706 (full transcript mode) and 46992 (mature mRNA mode) | 0.44 | 0.9 | 0.29 | 0.794 |
WHISTLE [25] | SVM | NCP,GNF, Genome-derived features | H. sapiens | 20 516, 17 383 | 0.62 | 0.9 | 0.49 | 0.83 |
DeepPromise [6] | CNN | ENAC, one-hot and RNA word embedding | H. sapiens and Mus musculus | 44 901, 11 656 and 5233 | 0.39 | 0.9 | 0.252 | 0.76 |
DeepOME [60] | CNN+BLSTM | one-hot | H. sapiens | 3052 | 0.97 | 1.00 | 0.93 | 0.99 |
CapNetwork | CNN+BLSTM+CapsuleNetwork | RNA word embedding | mm10 | 207 010 | 0.12 | 0.95 | 0.13 | 0.61 |
Adaptive-M6A | CNN+BLSTM+attention | RNA word embedding | BDGP6 | 31 952 | 0.99 | 1.00 | 0.97 | 1.00 |
danRer10 | 88 779 | 0.99 | 0.96 | 0.95 | 0.99 | |||
E. coli | 5240 | 0.75 | 0.97 | 0.77 | 0.98 | |||
mm10 | 207 010 | 0.97 | 0.13 | 0.09 | 0.79 | |||
TAIR10 | 10 569 | 0.73 | 0.61 | 0.34 | 0.71 | |||
sacCer3 | 5487 | 0.77 | 0.73 | 0.50 | 0.81 |
Method name . | Algorithm . | Encoding scheme . | Species . | Datasize . | Sn . | Sp . | MCC . | AUC . |
---|---|---|---|---|---|---|---|---|
SRAMP [14] | RF | one-hot, KNN score spectrum | H. sapiens, mm10 | 55 706 (full transcript mode) and 46992 (mature mRNA mode) | 0.44 | 0.9 | 0.29 | 0.794 |
WHISTLE [25] | SVM | NCP,GNF, Genome-derived features | H. sapiens | 20 516, 17 383 | 0.62 | 0.9 | 0.49 | 0.83 |
DeepPromise [6] | CNN | ENAC, one-hot and RNA word embedding | H. sapiens and Mus musculus | 44 901, 11 656 and 5233 | 0.39 | 0.9 | 0.252 | 0.76 |
DeepOME [60] | CNN+BLSTM | one-hot | H. sapiens | 3052 | 0.97 | 1.00 | 0.93 | 0.99 |
CapNetwork | CNN+BLSTM+CapsuleNetwork | RNA word embedding | mm10 | 207 010 | 0.12 | 0.95 | 0.13 | 0.61 |
Adaptive-M6A | CNN+BLSTM+attention | RNA word embedding | BDGP6 | 31 952 | 0.99 | 1.00 | 0.97 | 1.00 |
danRer10 | 88 779 | 0.99 | 0.96 | 0.95 | 0.99 | |||
E. coli | 5240 | 0.75 | 0.97 | 0.77 | 0.98 | |||
mm10 | 207 010 | 0.97 | 0.13 | 0.09 | 0.79 | |||
TAIR10 | 10 569 | 0.73 | 0.61 | 0.34 | 0.71 | |||
sacCer3 | 5487 | 0.77 | 0.73 | 0.50 | 0.81 |
Method name . | Algorithm . | Encoding scheme . | Species . | Datasize . | Sn . | Sp . | MCC . | AUC . |
---|---|---|---|---|---|---|---|---|
SRAMP [14] | RF | one-hot, KNN score spectrum | H. sapiens, mm10 | 55 706 (full transcript mode) and 46992 (mature mRNA mode) | 0.44 | 0.9 | 0.29 | 0.794 |
WHISTLE [25] | SVM | NCP,GNF, Genome-derived features | H. sapiens | 20 516, 17 383 | 0.62 | 0.9 | 0.49 | 0.83 |
DeepPromise [6] | CNN | ENAC, one-hot and RNA word embedding | H. sapiens and Mus musculus | 44 901, 11 656 and 5233 | 0.39 | 0.9 | 0.252 | 0.76 |
DeepOME [60] | CNN+BLSTM | one-hot | H. sapiens | 3052 | 0.97 | 1.00 | 0.93 | 0.99 |
CapNetwork | CNN+BLSTM+CapsuleNetwork | RNA word embedding | mm10 | 207 010 | 0.12 | 0.95 | 0.13 | 0.61 |
Adaptive-M6A | CNN+BLSTM+attention | RNA word embedding | BDGP6 | 31 952 | 0.99 | 1.00 | 0.97 | 1.00 |
danRer10 | 88 779 | 0.99 | 0.96 | 0.95 | 0.99 | |||
E. coli | 5240 | 0.75 | 0.97 | 0.77 | 0.98 | |||
mm10 | 207 010 | 0.97 | 0.13 | 0.09 | 0.79 | |||
TAIR10 | 10 569 | 0.73 | 0.61 | 0.34 | 0.71 | |||
sacCer3 | 5487 | 0.77 | 0.73 | 0.50 | 0.81 |
Method name . | Algorithm . | Encoding scheme . | Species . | Datasize . | Sn . | Sp . | MCC . | AUC . |
---|---|---|---|---|---|---|---|---|
SRAMP [14] | RF | one-hot, KNN score spectrum | H. sapiens, mm10 | 55 706 (full transcript mode) and 46992 (mature mRNA mode) | 0.44 | 0.9 | 0.29 | 0.794 |
WHISTLE [25] | SVM | NCP,GNF, Genome-derived features | H. sapiens | 20 516, 17 383 | 0.62 | 0.9 | 0.49 | 0.83 |
DeepPromise [6] | CNN | ENAC, one-hot and RNA word embedding | H. sapiens and Mus musculus | 44 901, 11 656 and 5233 | 0.39 | 0.9 | 0.252 | 0.76 |
DeepOME [60] | CNN+BLSTM | one-hot | H. sapiens | 3052 | 0.97 | 1.00 | 0.93 | 0.99 |
CapNetwork | CNN+BLSTM+CapsuleNetwork | RNA word embedding | mm10 | 207 010 | 0.12 | 0.95 | 0.13 | 0.61 |
Adaptive-M6A | CNN+BLSTM+attention | RNA word embedding | BDGP6 | 31 952 | 0.99 | 1.00 | 0.97 | 1.00 |
danRer10 | 88 779 | 0.99 | 0.96 | 0.95 | 0.99 | |||
E. coli | 5240 | 0.75 | 0.97 | 0.77 | 0.98 | |||
mm10 | 207 010 | 0.97 | 0.13 | 0.09 | 0.79 | |||
TAIR10 | 10 569 | 0.73 | 0.61 | 0.34 | 0.71 | |||
sacCer3 | 5487 | 0.77 | 0.73 | 0.50 | 0.81 |
Conclusion
In this study, we have conducted species-specific prediction of m6A modification in RNA sequences. ML methods such as SVM, RF and LR for m6A prediction methods were adapted, as well as deep learning methods that mainly used adaptive learning model that consisted of CNN, BLSTM and attention network. This is the first time that such data from different species were included in one predictor, which produced a promising performance. Moreover, we have summarized the up-to-date predictor tools on m6A prediction, which is the most comprehensive work currently available. Based on this multi-species prediction work, we have improved our understanding of biological mechanisms of the RNA transcriptome and common mechanisms found in different biological organisms (e.g. plant, animal and microbe). Motif analysis of different species was also conducted, which identified similar patterns in different biological organisms. This is one interesting aspect of our results that is worth further analysis and research. There are still plenty of questions for discussion; for instance, can the encoding scheme of RNA sequence-based features used in representative methods similar to those used to identify features in peptide sequences, identify relationships between RNA modification and peptide post-transitional modification? This is another topic that could benefit from our presented results in future studies.
We proposed an attention-based deep learning method, Adaptive-m6A, which consists of CNN, BLSTM and attention mechanisms, to identify m6A sites in multiple species.
Three conventional machine learning (ML) methods, support vector machine, random forest and logistic regression classifiers, were used with datasets with the most categories of species data.
Adaptive-m6A outperformed conventional ML methods in most of the species tested and achieved a promising performance.
Motif analysis and a cross-validation among different species showed that our proposed method could achieve a high performance among various species, which helps gain a better understanding about the sequence characteristics and biological mechanisms of the RNA transcriptome across different species.
Data availability
All the data involved in this esearch could be accessed at https://github.com/Moretta1/Adaptive-m6A.
Acknowledgments
The authors sincerely appreciate the Warshel Institute for Computational Biology funding from Shenzhen City and Longgang District for financially supporting this research.
Funding
This work was supported by the Ganghong Young Scholar Development Fund [2021E007], Guangdong Province Basic and Applied Basic Research Fund [2021A1515012447], National Natural Science Foundation of China [32070659], the Science, Technology and Innovation Commission of Shenzhen Municipality [JCYJ20200109150003938], and Shenzhen-Hong Kong Cooperation Zone for Technology and Innovation [HZQB-KCZYB-2020056].
Author Biographies
Rulan Wang is currently a PhD student at School of Sciences and Engineering, The Chinese University of Hong Kong, Shenzhen under the supervision of Prof. Tzong-Yi Lee and Prof. Hsien-Da Huang. Her research interests include bioinformatics, big data analytics,and machine learning.
Chia-Ru Chung is a postdoc in Kobilka Institute of Innovative Drug Discovery, The Chinese University of Hong Kong, Shenzhen. Her research interests include bioinformatics, big data analytics, data mining, and machine learning.
Hsien-Da Huang is a presidential chair professor and associate dean of education at the School of Life and Health Sciences, and the executive deputy director of Warshel Institute of Computational Biology, The Chinese University of Hong Kong, Shenzhen. His research group majorly focuses on biological multi-disciplinary research topics, including bioinformatics, genomics, metagenomics and microbiome, intelligent biomedical Technologies (drug development, genetic test and precision medicine), AI and machine learning, and biological database design and development.
Tzong-Yi Lee is currently a professor at the Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University. His research interests emphasize on bioinformatics, multi-Omics data analysis, single cell sequencing and spatial transcriptomics.