Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning

Genome-wide datasets. Details on the number of labeled and unlabeled sequences. The IR is computed as the ratio between them.

Dataset	Positive sequences	Unlabeled hairpins	IR
CEL	249	1,737,349	1:6977
DME	307	2,066,807	1:6732
AGA	66	4,268,407	1:64,672
HSA	1710	48,099,855	1:28,128
ATH	304	1,355,663	1:4459

Dataset	Positive sequences	Unlabeled hairpins	IR
CEL	249	1,737,349	1:6977
DME	307	2,066,807	1:6732
AGA	66	4,268,407	1:64,672
HSA	1710	48,099,855	1:28,128
ATH	304	1,355,663	1:4459

Table 1

Open in new tab Download slide

Genome-wide datasets. Details on the number of labeled and unlabeled sequences. The IR is computed as the ratio between them.

Dataset	Positive sequences	Unlabeled hairpins	IR
CEL	249	1,737,349	1:6977
DME	307	2,066,807	1:6732
AGA	66	4,268,407	1:64,672
HSA	1710	48,099,855	1:28,128
ATH	304	1,355,663	1:4459

Dataset	Positive sequences	Unlabeled hairpins	IR
CEL	249	1,737,349	1:6977
DME	307	2,066,807	1:6732
AGA	66	4,268,407	1:64,672
HSA	1710	48,099,855	1:28,128
ATH	304	1,355,663	1:4459

Several features were extracted from these stem loops with miRNAfe [19], such as the ratio of each base in the sequence, the proportion of guanine-cytosine on the sequence, the ratio between guanine and cytosine, the length of the sequence, the number of stem loops and the number of nucleotides in the stem region, among many other. These features were used in OC-SVM, deepBN, deeSOM and miRNAss. Instead, bb-deepMiRGene and bb-DeepMir models do not use hand-engineered features but the raw sequence of each hairpin. The prediction of the secondary structure of each sequence, provided by RNAfold, is used by bb-deepMiRGene as well. Additional details of the feature extraction process can be found in [3] and a detailed description of the features inself is provided in the Supplementary Material, Table S1.

Performance measures

The prediction quality of the model was assessed using the classical classification measures of precision, recall, |$F1$|-score and Matthews correlation coefficient (MCC) defined as

$$\begin{equation*} P = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}, \quad F_1 = 2\frac{P R}{P + R}, \end{equation*}$$

$$\begin{align*} MCC = {\frac{{\mathit{TP}}\times{\mathit{TN}}-{\mathit{FP}}\times{\mathit{FN}}}{\sqrt{({\mathit{TP}}+{\mathit{FP}})({\mathit{TP}}+{\mathit{FN}})({\mathit{TN}}+{\mathit{FP}})({\mathit{TN}}+{\mathit{FN}})}}} \end{align*}$$

where TP, TN, FP and FN are true positive, true negative, false positive and false negative predictions, respectively. Performance curves were drawn using the scores for each test sequence, according to each model. The precision vs recall curve (PRC) plot is a well-known performance indicator. A recent study [38] has clearly shown that this representation is preferred over the receiver operating characteristics (ROCs) plot to assess binary classifiers with highly imbalanced data, where the number of negatives outweighs the number of positives significantly. For high imbalances, a classifier could reach a good performance in terms of specificity, but could perform poorly in providing good quality candidates, with a large amount of false positives. PRC plots, instead, can provide the viewer with a more clear assessment of performance due to the fact that they evaluate the fraction of true positives among the total positive predictions. Given the very large class imbalance of the datasets, |$F1$| and |$MCC$| provide the summarized measures by combining precision and recall. The maximums of |$F1$| and |$MCC$| along the entire PR curve will be called |$F1_m$| and |$MCC_m$|⁠, respectively. However, it should be noted that in this scenario very low values can be expected from these measures. For example, if a predictor has only 1% of FP in the AGA dataset, the precision could be below |$P = 0.0015$|⁠. As a consequence, very low values of |$F1$| and |$MCC$| will be observed.

Figure 1

Precision-recall curves for all the methods and datasets. Precision is in log scale. Bold curves are the mean of cross-validation results while the shaded area is the 10–90 percentile range.

An objective comparison of the overall model performances has been performed with the area under the curve of precision-recall (⁠|$AUC_{PR}$|⁠). As genome-datasets are heavily imbalanced and precision changes exponentially, a logarithmic ratio is defined as

$$\begin{equation} \hat{AUC}_{PR} = 1 - \frac{\log(AUC_{PR})}{\log(AUC_{b})}, \end{equation}$$

(1)

where |$AUC_{b}$| is the area under the baseline precision, that is, a classifier that assigns a positive label for all the test sequences. This ratio gives more information when comparing results on datasets with significantly different IRs, as the ones evaluated here.

Experimental setup

The models with published source code were trained and tested using our genome-wide datasets. Experimental evaluation was designed taking into account the practical considerations of the genome-wide pre-miRNA discovery task. Given that the computational cost of the methods are very high with genome-wide data, hyper-parameter optimization strategies, such as grid-search, can be prohibitive. Thus, the hyper-parameters used for each model are those published by the original authors. These are summarized in the Table S2 of Supplementary Material.

Each ML model was trained independently for each species in Table 1, and evaluated with an 8-fold cross-validation (CV) scheme, for each genome individually, to get an unbiased estimation of performance on unseen data. Each sequence from each genome was labeled either as a positive class (known pre-miRNAs) or unlabeled class. Each fold consisted of independent and non-overlapped training (7/8) and testing (1/8) partitions, each testing partition with the same IR as in the full genome. The pre-miRNA candidate scores obtained by the models for the samples in the test partitions were compared with the known labels to assess model performance. Friedman test and critical difference diagram with post hoc Nemenyi test were used to assess the statistical significance of differences in the |$\hat{AUC}_{PR}$| achieved by each model.

Results

The precision-recall curves for the prediction of novel pre-miRNAs in the genome-wide datasets are shown in Figure 1, for OC-SVM, deepBN, bb-deepMiRGene, miRNAss, deeSOM and bb-DeepMir. These curves were generated using the scores provided by each method. In these figures, the higher the curve the better. As it can be seen, the curves show a low precision when recall is high (bottom right corner of each sub-figure), where most of the candidates are, in fact, false positives. This is considered as a baseline, that is, the model obtains R=1.0 at the cost of classifying all test sequences as positive. As the score threshold is increased (from right to left), low-quality candidates are discarded, rapidly improving precision, but at the cost of losing recall.

In the CEL dataset, it can be seen that bb-DeepMiRGene has an outstanding performance, which can be due to the fact that, differently from the others, this method uses information of both sequence and secondary structure. This indicates that taking into account both information sources seems to be very important for finding good and precise pre-miRNAs candidates. In contrast, in spite that bb-DeepMir is also based on deep learning, it uses only sequence information and has the lowest score. Regarding deepBN, in spite of having reported very good results for pre-miRNA prediction in other scenarios, in this genome-wide dataset the large class imbalance seems to have affected its performance. For high recall, it can be seen that bb-DeepMiRGene reaches the best values, followed by OC-SVM and miRNAss. Regarding high precision, where recall is low, it can be seen that OC-SVM and deepBN lower its performance. These models are not able to deliver a small number of candidates with low FP. Regarding deeSOM, it seems to be the method with a more balanced trade-off between recall and precision.

For the DME dataset, bb-DeepMiRGene and OC-SVM work better again. On this dataset, only bb-deepMiRGene and deeSOM could reach good precision results. In the case of the AGA dataset, it should be noted that there is a very low number of known mirnas, only 57, which seems to deeply affect most of the classifiers. Only bb-deepMiRGene and deeSOM could reach high precision values. The HSA dataset is the largest one and miRNAss needed to build a very large adjacency matrix, which make it not applicable in practice for this amount of sequences. In general, similar behaviours as the dataset before can be observed, except for bb-DeepMir, which reaches here good precision values. Since this data set has a relatively large number of known miRNAs (1,710) and bb-DeepMir uses only sequence information, it seems that there are many similar patterns that are easily found by this method. Here, again, bb-deepMirGene is the best method. Finally, in the ATH dataset, almost all models behave similarly except for bb-DeepMir, which has a very low precision score for moderate recall but reaches a good precision at low recalls.

For a global comparison among all the methods, an assessment of performance was done by measuring the maximum |$F_1$|⁠, maximum MCC score and the |$\hat{AUC}_{PR}$| for each model in each genome (Table 2). In the CEL genome, both |$F1_m$| and |$MCC_m$| measures clearly indicate bb-deepMiRGene as the best method (in bold). In this genome, the following methods with high performance are deeSOM, miRNAss and bb-DeepMir, however, at a long distance from the best one. Regarding |$\hat{AUC}_{PR}$|⁠, bb-deepMiRGene is clearly the best one in this genome. In DME, the best method is deeSOM according to |$F1_m$| and |$MCC_m$|⁠, closely followed by the deep models bb-DeepMir and bb-deepMirGene. According to |$\hat{AUC}_{PR}$|⁠, the last one is by far the best one. In the AGA dataset, bb-deepMiRGene is again the best one, followed by deeSOM and OC-SVM. According to |$\hat{AUC}_{PR}$|⁠, the best one is deeSOM, although very close to bb-deepMiRGene. In the largest genome, HSA, again bb-deepMiRGene, bb-DeepMir and deeSOM are the best ones. Finally, in ATH the best method according to |$F1_m$| is miRNAss and deeSOM is the best according to |$MCC_m$|⁠. In |$\hat{AUC}_{PR}$|⁠, again bb-deepMiRGene is the best one.

Table 2

Summarized performances for all methods and datasets. |$F1_m$| and |$MCC_m$| are the best F1 and MCC along the precision-recall curve. |$\hat{AUC}_{pr}$| is the logarithmic ratio of the area under the precision-recall curve.

	CEL			DME			AGA			HSA			ATH
\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|
OC-SVM	0.012	0.047	0.342	0.002	0.015	0.292	0.013	0.033	0.240	0.004	0.014	0.197	0.153	0.121	0.625
deepBN	0.009	0.025	0.220	0.000	0.002	0.044	0.001	0.007	0.002	0.006	0.016	0.241	0.143	0.148	0.595
deeSOM	0.037	0.063	0.378	0.075	0.120	0.166	0.019	0.023	0.367	0.028	0.035	0.365	0.172	0.187	0.649
miRNAss	0.030	0.044	0.357	0.001	0.005	0.020	0.008	0.007	0.071	-	-	-	0.212	0.173	0.676
bb-DeepMir	0.028	0.023	0.190	0.053	0.015	0.048	0.009	0.009	0.109	0.038	0.050	0.457	0.085	0.060	0.475
bb-deepMiRGene	0.103	0.095	0.567	0.060	0.040	0.387	0.058	0.045	0.336	0.072	0.031	0.511	0.195	0.179	0.686

	CEL			DME			AGA			HSA			ATH
\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|
OC-SVM	0.012	0.047	0.342	0.002	0.015	0.292	0.013	0.033	0.240	0.004	0.014	0.197	0.153	0.121	0.625
deepBN	0.009	0.025	0.220	0.000	0.002	0.044	0.001	0.007	0.002	0.006	0.016	0.241	0.143	0.148	0.595
deeSOM	0.037	0.063	0.378	0.075	0.120	0.166	0.019	0.023	0.367	0.028	0.035	0.365	0.172	0.187	0.649
miRNAss	0.030	0.044	0.357	0.001	0.005	0.020	0.008	0.007	0.071	-	-	-	0.212	0.173	0.676
bb-DeepMir	0.028	0.023	0.190	0.053	0.015	0.048	0.009	0.009	0.109	0.038	0.050	0.457	0.085	0.060	0.475
bb-deepMiRGene	0.103	0.095	0.567	0.060	0.040	0.387	0.058	0.045	0.336	0.072	0.031	0.511	0.195	0.179	0.686

Table 2

Open in new tab Download slide

Summarized performances for all methods and datasets. |$F1_m$| and |$MCC_m$| are the best F1 and MCC along the precision-recall curve. |$\hat{AUC}_{pr}$| is the logarithmic ratio of the area under the precision-recall curve.

	CEL			DME			AGA			HSA			ATH
\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|
OC-SVM	0.012	0.047	0.342	0.002	0.015	0.292	0.013	0.033	0.240	0.004	0.014	0.197	0.153	0.121	0.625
deepBN	0.009	0.025	0.220	0.000	0.002	0.044	0.001	0.007	0.002	0.006	0.016	0.241	0.143	0.148	0.595
deeSOM	0.037	0.063	0.378	0.075	0.120	0.166	0.019	0.023	0.367	0.028	0.035	0.365	0.172	0.187	0.649
miRNAss	0.030	0.044	0.357	0.001	0.005	0.020	0.008	0.007	0.071	-	-	-	0.212	0.173	0.676
bb-DeepMir	0.028	0.023	0.190	0.053	0.015	0.048	0.009	0.009	0.109	0.038	0.050	0.457	0.085	0.060	0.475
bb-deepMiRGene	0.103	0.095	0.567	0.060	0.040	0.387	0.058	0.045	0.336	0.072	0.031	0.511	0.195	0.179	0.686

	CEL			DME			AGA			HSA			ATH
\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|	\|$F1_m$\|	\|$MCC_m$\|	\|$\hat{AUC}_{pr}$\|
OC-SVM	0.012	0.047	0.342	0.002	0.015	0.292	0.013	0.033	0.240	0.004	0.014	0.197	0.153	0.121	0.625
deepBN	0.009	0.025	0.220	0.000	0.002	0.044	0.001	0.007	0.002	0.006	0.016	0.241	0.143	0.148	0.595
deeSOM	0.037	0.063	0.378	0.075	0.120	0.166	0.019	0.023	0.367	0.028	0.035	0.365	0.172	0.187	0.649
miRNAss	0.030	0.044	0.357	0.001	0.005	0.020	0.008	0.007	0.071	-	-	-	0.212	0.173	0.676
bb-DeepMir	0.028	0.023	0.190	0.053	0.015	0.048	0.009	0.009	0.109	0.038	0.050	0.457	0.085	0.060	0.475
bb-deepMiRGene	0.103	0.095	0.567	0.060	0.040	0.387	0.058	0.045	0.336	0.072	0.031	0.511	0.195	0.179	0.686

As it can be seen in Table 2, the performance of the methods is very variable according to the size and imbalance of the genome data evaluated. Therefore, by using one measure alone, it is very difficult to indicate only one best method for all cases. However, since precision is in takes different orders of magnitude and the |$\hat{AUC}_{PR}$| is logarithmic scale, this measure gives more weight to the methods that reach better precision scores. High precision is very desirable when searching for new pre-miRNAs candidates in order to have less false positives to test. According to |$\hat{AUC}_{PR}$|⁠, it can be seen that bb-deepMiRGene reaches the higher values in most datasets, with exception of AGA, in which deeSOM is the best one. The very large class imbalance effect can be seen specially in AGA, where deepBN, miRNAss and bb-DeepMir cannot reach good results, while on ATH, the scores are high for all models. In order to provide a statistical analysis of results, a Friedman test was done, showing that differences in the |$\hat{AUC}_{PR}$| results are statistically significant (p = 3,3e-15). Critical difference (CD) diagram (Fig. 2) shows that bb-deepMiRGene and deeSOM are the best methods for the genome-wide prediction of pre-miRNAs. OC-SVM can reach a good |$\hat{AUC}_{PR}$|⁠, especially for high recall, but it cannot reach high precision values. These comparative results have shown that while the genome-wide imbalance affected all the methods, deep models (deeSOM and deepMiRGene) were the most robust for all species.

Figure 2

CD diagram for the pre-miRNA prediction methods along all genome-wide datasets.

Another relevant aspect of the methods reviewed, besides performance, is the computational cost. Using the same hardware specifications and the CEL dataset, OC-SVM took, on average, 7s for training each fold. This was the fastest method since it uses only the known positives for training and does not model the negative class. OC-SVM was followed by bb-DeepMir with 10 min and deeSOM with 20 min on average for each fold. However, miRNAss took 23 hs to train one fold because the adjacency matrix must be calculated pairwise among every sequence. Similarly, bb-deepMiRGene took 37.5 hs because of the conversion from sequences to embedding for such large genome data. The cost of predicting new sequences was negligible in all the cases after the models were trained.

Another important issue, from a very practical point of view: how many wet experiments should be done in order to find high-quality and true novel pre-miRNAs in the large quantity of sequences of a full genome? In order to answer this question a detailed analysis of the pre-miRNA candidates provided by the models evaluated is presented in Figure 3. Each sub-figure shows the number of sequences considered as candidates (C = TP + FP) at each score threshold along with the number of TP from the testing partition. At the upper right corner is the initial number of sequences from the test partition presented to each model, including the well-known pre-miRNAs labeled as positives. For example, in the case of the CEL dataset there are 32 TP for 217,138 candidates. At the left of each sub-figure is shown how many testing true positives have remained with the highest score threshold, these are the top pre-miRNAs candidates of each method. As the threshold is increased, from right to left in the figure, the slope in the curves shows that large quantities of low-quality candidates are discarded but TP are reduced very slowly, until only a few TP remains with different numbers of candidates. In summary, for these figures the lower the curve the better is the method.

Figure 3

Candidates-TP curves for the five datasets. Bold lines are the mean values for test partitions, and the shaded area is its 10–90 percentile range.

Open in new tab Download slide

In the CEL dataset, bb-DeepMiRGene could reach the lowest number of candidates for each TP value. For example, if 10 TPs are preserved, the output candidates will be on average 506 for bb-deepMiRGene, 2023 for OC-SVM, 2133 for deeSOM, 2515 for miRNAss, 7337 for deepBN and 17,937 for bb-DeepMir. Similarly, in order to have 2 TPs, the candidates numbers provided by each method will be 34 for bb-deepMiRGene, 520 for OC-SVM, 123 for deeSOM, 310 for miRNAss, 1110 for deepBN and 850 for bb-DeepMir. This means that bb-deepMiRGene provides between 3 and 25 times less candidates than the other methods to discover the same number of TP. As a direct consequence, less wet experiments would be needed to confirm the novel pre-miRNAs. In DME and AGA, it is clear how miRNAss, deepBN and bb-DeepMir produce at least one order of magnitude more FP than the other methods for low TP. In HSA, it seems that there are two groups. First, it can be seen that OC-SVM and deepBN cannot reduce the number of candidates further than 1000. Instead, bb-DeepMir, bb-deepMiRGene and deeSOM reach a very low candidate number, in the order of 100 sequences for 2 TP. In this case, bb-DeepMir seems to reach even lower values but variance is very high. For ATH, all methods but bb-DeepMir have similar behaviour. It is interesting to note that for bb-deepMiRGene and deeSOM, the best 10 candidates would include, on average, 2 TP, which is an outstanding result from a practical point of view.

Finally, all these comparative results illustrate a very important aspect that should be measured in all methods developed and used for pre-miRNA prediction. Drastically reducing the candidates is an important factor to reduce the costs of wet experimental confirmation. The most common case in a real genome-wide application would have millions of hairpins-like sequences, while it is commonly expected that only a few hundreds of them might contain true miRNAs. Thus, in a pre-miRNA classifier the ability for predicting a reasonable number of candidates to be tested in wet experiments is a characteristic of paramount importance. In this regard, the adapted version of bb-deepMiRGene, with balanced batches in the training, and the deeSOM (originally designed for high imbalance) are the best methods for genome-wide pre-miRNA prediction.

Conclusion

In this work we have compared several recent computational models for pre-miRNA discovery. For the first time, the extensive use of genome-wide data from five genomes (C. elegans, D. melanogaster, A. gambiae, H. sapiens and A. thaliana) allowed to compare the models in the same experimental conditions, testing them in a realistic scenario. Experimental results demonstrated that bb-deepMiRGene, a deep-learning network using the sequential and structural information of sequences, outperforms other state-of-the-art methods. This indicates the importance of taking into account both information to train deep learning models for finding pre-miRNAs candidates in genome-wide data. Additionally, deeSOM, a semi-supervised method that uses structural features as input, also reaches good performance, especially taking into account the precision for a low number of candidates.

Key Points

Six novel pre-miRNA prediction models based on machine learning were tested on five genome-wide datasets.
The models based on deep learning showed the best performances in all datasets.
The deep model that used information of both sequence and secondary structure has obtained the best results for genome-wide data.
Further research on deep learning based methods, with more realistic genome-wide datasets, is needed to improve current pre-miRNAs prediction.

Acknowledgments

We acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Funding

This work was supported by Agencia Nacional de Promocion Cientifica y Tecnologica (ANPCyT) PICT 2018 3384.

L. A. Bugnon holds a postdoctoral position at sinc(i) since 2017. His research interests include automatic learning, pattern recognition and signal and image processing, with applications to bioinformatics, biomedical signals and affective computing.

C. Yones has a postdoctoral position at sinc(i) since 2017. His research interests include machine learning, data-mining and semi-supervised learning, with applications in bioinformatics.

D.H. Milone is a full professor in the Department of Informatics at Universidad Nacional del Litoral (UNL) and a principal research scientist at CONICET. He is the director of sinc(i). His research interests include statistical learning, signal processing and neural and evolutionary computing, with applications to biomedical signals and bioinformatics.

G. Stegmayer is an assistant professor in the CONICET, Argentina. Her current research interests involve machine learning, data mining and pattern recognition in bioinformatics.

sinc(i) - Research Institute for Signals, Systems and Computational Intelligence. Research at sinc(i) aims to develop new algorithms for machine learning, data mining, signal processing and complex systems, providing innovative technologies for advancing healthcare, bioinformatics, precision agriculture, autonomous systems and human–computer interfaces. The sinc(i) was created and is supported by the two major institutions of highest education and research in Argentina: the National University of Litoral (UNL) and the National Scientific and Technical Research Council (CONICET).

References

1.

Lin

S

,

Gregory

RI

.

MicroRNA biogenesis pathways in cancer

.

Nat Rev Cancer

2015

;

6

(

15

):

321

–

33

.

2.

Croce

CM

,

Peng

Y

.

The role of MicroRNAs in human cancer

.

Signal Transduct Target Ther

2016

;

1

(

1

):

1

–

9

.

3.

Bertoli

G

,

Cava

C

,

Castiglioni

I

.

MicroRNAs: new biomarkers for diagnosis, prognosis, therapy prediction and therapeutic tools for breast cancer

.

Theranostics

2015

;

5

(

10

):

1122

–

43

.

4.

Li

L

,

Xu

J

,

Yang

D

, et al.

Computational approaches for microRNA studies: a review

.

Mamm Genome

2010

;

21

(

1–2

):

1

–

12

.

5.

Allmer

J

,

Yousef

M

.

Computational methods for ab initio detection of microRNAs

.

Front Genet

2012

;

3

:

1

–

5

.

6.

Friedländer

MR

,

Chen

W

,

Adamidi

C

, et al.

Discovering microRNAs from deep sequencing data using miRDeep

.

Nat Biotechnol

2008

;

26

(

4

):

407

–

15

.

7.

Hackenberg

M

,

Sturm

M

,

Langenberger

D

, et al.

miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments

.

Nucleic Acids Res

2009

;

37

(

1

):

68

–

76

.

). https://doi.org/10.1186/gb-2010-11-4-r39.

8.

Hendrix

D

,

Levine

M

,

Shi

W

.

MiRTRAP, a computational method for the systematic identification of miRNAs from high throughput sequencing data

.

Genome Biol

2010

;

11

(

39

9.

Hackenberg

M

,

Rodríguez-Ezpeleta

N

,

Aransay

AM

.

MiRanalyzer: An update on the detection and analysis of microRNAs in high-throughput sequencing experiments

.

Nucleic Acids Res

2011

;

39

(

1

):

132

–

8

.

10.

Mathelier

A

,

Carbone

A

,

Hofacker

I

.

MIReNA: finding microRNAs with high accuracy and no learning at genome scale and from deep sequencing data

.

Bioinformatics

2010

;

26

(

18

):

2226

–

34

.

11.

Friedländer

MR

,

MacKowiak

SD

,

Li

N

, et al.

MiRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades

.

Nucleic Acids Res

2012

;

40

(

1

):

37

–

52

.

12.

An

J

,

Lai

J

,

Sajjanhar

A

, et al.

MiRPlant: An integrated tool for identification of plant miRNA from RNA sequencing data

.

BMC Bioinformatics

2014

;

15

(

1

):

275

.

13.

Vitsios

DM

,

Kentepozidou

E

,

Quintais

L

, et al.

Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests

.

Nucleic Acids Res

2017

;

45

(

21

):

177

.

14.

Demirci

MDS

,

Allmer

J

.

Delineating the impact of machine learning elements in pre-microRNA detection

.

PeerJ

2017

;

5

:e3131. https://doi.org/10.7717/peerj.3131.

15.

Stegmayer

G

,

Di Persia

L

,

Rubiolo

M

, et al.

Predicting novel microRNA: a comprehensive comparison of machine learning approaches

.

Brief Bioinform

2018

;

20

(

5

):

1607

–

20

.

16.

Morgado

L

,

Johannes

F

.

Computational tools for plant small RNA detection and categorization

.

Brief Bioinform

2019

;

20

:

1181

–

92

.

17.

Wei

L

,

Liao

M

,

Gao

Y

, et al.

Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set

.

IEEE/ACM Trans Comput Biol Bioinform

2014

;

11

(

1

):

192

–

201

.

18.

Liu

B

,

Li

J

,

Cairns

MJ

.

Identifying miRNAs, targets and functions

.

Brief Bioinform

2014

;

15

(

1

):

1

–

19

.

19.

Yones

CA

,

Stegmayer

G

,

Kamenetzky

L

, et al.

miRNAfe: a comprehensive tool for feature extraction in microRNA prediction

.

Biosystems

2015

;

138

:

1

–

5

.

20.

Liang

C

,

Heikkinen

L

,

Wang

C

, et al.

Trends in the development of miRNA bioinformatics tools

.

Brief Bioinform

2018

;

20

(

5

):

1836

–

52

.

21.

Bugnon

LA

,

Yones

CA

,

Milone

DH

, et al.

Deep Neural Architectures for Highly Imbalanced Data in Bioinformatics

.

IEEE Trans Neural Netw Learn Syst

.,

2020

;

31

(

8

): 2857–67.

22.

Xue

C

,

Li

F

,

He

T

, et al.

Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine

.

BMC Bioinformatics

2005

;

6

:

310

.

23.

Dang

HT

,

Tho

HP

,

Satou

K

, et al.

Prediction of microRNA hairpins using one-class support vector machines

. In:

2nd International Conference on Bioinformatics and Biomedical Engineering, iCBBE 2008

,

pages 33–36

.

IEEE Computer Society

,

2008

.

Google Preview

24.

Yousef

M

,

Najami

N

,

Khalifav

W

.

A comparison study between one-class and two-class machine learning for MicroRNA target detection

.

J Biomed Sci Eng

2010

;

03

(

03

):

247

–

52

.

25.

Stegmayer

G

,

Yones

C

,

Kamenetzky

L

, et al.

High class-imbalance in pre-miRNA prediction: a novel approach based on deepSOM

.

IEEE/ACM Trans Comput Biol Bioinform

2016

;

14

(

6

):

1316

–

26

.

26.

Kohonen

T

,

Schroeder

MR

,

Huang

TS

.

Self-organizing Maps

.

Springer

,

2005

.

Google Preview

27.

Yones

C

,

Stegmayer

G

,

Milone

DH

.

Genome-wide pre-miRNA discovery from few labeled examples

.

Bioinformatics

2018

;

34

(

4

):

541

–

9

.

28.

Lecun

Y

,

Bengio

Y

,

Hinton

G

.

Deep learning

.

Nature

2015

;

521

(

7553

):

436

–

44

.

29.

Min

S

,

Lee

B

,

Yoon

S

.

Deep learning in bioinformatics

.

Brief Bioinform

2017

;

18

(

5

):

851

–

69

.

30.

Fischer

A

,

Igel

C

.

An introduction to restricted Boltzmann machines

. In:

Lecture Notes in Computer Science

2012

;

14

–

36

.

31.

Jaya

Thomas

,

Sonia

Thomas

, and

Lee

Sael

.

DP-miRNA: An improved prediction of precursor microRNA using deep learning model

. In:

2017 IEEE International Conference on Big Data and Smart Computing, BigComp 2017

,

pages 96–99

,

2017

.

32.

Thomas

J

,

Lee

S

.

Deep neural network based precursor microRNA prediction on eleven species

.

arXiv

2017

.

33.

Tang

X

,

Sun

Y

.

Fast and accurate microRNA search using CNN

.

BMC Bioinformatics

2019

;

20

(

Suppl 23

):

1

–

14

.

34.

Zheng

X

,

Xu

S

,

Zhang

Y

, et al.

Nucleotide-level convolutional neural networks for pre-miRNA classification

.

Sci Rep

2019

;

9

(

1

):

1

–

6

.

35.

Park

S

,

Min

S

,

Choi

H

, et al.

deepMiRGene: Deep Neural Network based Precursor microRNA Prediction

. In:

NIPS

,

2017

.

36.

Bugnon

LA

,

Yones

C

,

Milone

DH

, et al.

Genome-wide hairpins datasets of animals and plants for novel miRNA prediction

.

Data Brief

2019

;

25

:

104209

.

37.

Bartel

DP

.

MicroRNAs: genomics, biogenesis, mechanism, and function

.

Cell

2004

;

116

(

2

):

281

–

97

.

38.

Saito

T

,

Rehmsmeier

M

,

Hood

L

, et al.

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets

.

PLoS One

2015

;

10

(

3

).