-
PDF
- Split View
-
Views
-
Cite
Cite
Alexander W Clarke, Eirik Høye, Anju Angelina Hembrom, Vanessa Molin Paynter, Jakob Vinther, Łukasz Wyrożemski, Inna Biryukova, Alessandro Formaggioni, Vladimir Ovchinnikov, Holger Herlyn, Alexandra Pierce, Charles Wu, Morteza Aslanzadeh, Jeanne Cheneby, Pedro Martinez, Marc R Friedländer, Eivind Hovig, Michael Hackenberg, Sinan Uğur Umu, Morten Johansen, Kevin J Peterson, Bastian Fromm, MirGeneDB 3.0: improved taxonomic sampling, uniform nomenclature of novel conserved microRNA families and updated covariance models, Nucleic Acids Research, Volume 53, Issue D1, 6 January 2025, Pages D116–D128, https://doi.org/10.1093/nar/gkae1094
- Share Icon Share
Abstract
We present a major update of MirGeneDB (3.0), the manually curated animal microRNA gene database. Beyond moving to a new server and the creation of a computational mirror, we have expanded the database with the addition of 33 invertebrate species, including representatives of 5 previously unsampled phyla, and 6 mammal species. MirGeneDB now contains entries for 21 822 microRNA genes (5160 of these from the new species) belonging to 1743 microRNA families. The inclusion of these new species allowed us to refine both the evolutionary node of appearance of a number of microRNA genes/families, as well as MirGeneDB’s phylogenetically informed nomenclature system. Updated covariance models of all microRNA families, along with all smallRNA read data are now downloadable. These enhanced annotations will allow researchers to analyze microRNA properties such as secondary structure and features of their biogenesis within a robust phylogenetic context and without the database plagued with numerous false positives and false negatives. In light of these improvements, MirGeneDB 3.0 will assume the responsibility for naming conserved novel metazoan microRNAs. MirGeneDB is part of RNAcentral and Elixir Norway and is publicly and freely available at mirgenedb.org.

Introduction
The last 5 years have seen the publication of >55 000 papers referencing microRNAs in their title or abstract, more than any other type of non-coding RNA. These studies, however, continue to be challenged by long-standing problems that diminish the accuracy of publicly available small RNA annotations (1–3). Previous studies have found that as many as two-thirds of the purported microRNAs recorded in public databases are likely false positives (4–18), and that existing nomenclatural schemes often failed to reflect the real evolutionary histories of a number of important microRNA families. These errors hinder comparative work and have led to a confusing proliferation of study-specific databases (5,19–24). Further, they limit the power of microRNA data and the replicability of studies that use them across all branches of the biological sciences, from taxonomy and phylogenetics (25) to biomedical research (26).
Fortunately, these issues arise, not from an inherent problem in the study of microRNAs per se, but from the fact that many databases simply catalog the microRNAs described in the published literature without independently curating them. Although the process of reviewing purported microRNAs by hand is labor-intensive and cannot guarantee the absence of false negatives, especially in the case of tissue-specific or poorly expressed genes, it can eliminate virtually all false-positive results, significantly improving the reliability of microRNA data. To this end, in 2015, we established MirGeneDB, the first publicly accessible, manually curated microRNA gene database (13). Expanding on the pre-Next Generation Sequencing (NGS) criteria for microRNA annotation (27), we established a rigorous and consistent set of criteria to annotate a high-confidence set of microRNAs across the metazoa. Over the last 9 years, MirGeneDB’s repertoire has grown from just 4 (13), to 45 (18), to 75 species (17), including model and non-model systems drawn from roughly two-thirds of all animal phyla. Recently, we (28) and others (29) demonstrated that this well-curated data can be used to train algorithms to predict conserved microRNAs from genomes only, highlighting MirGeneDB’s value and potential for both comparative genomics and phylogenetics (25).
Despite its regular expansions, major branches of the metazoan tree were either wholly unrepresented or undersampled in previous versions of MirGeneDB, and, as a result, we have not been able to fully capture the patterns of microRNA evolution within several clades. To improve MirGeneDB’s cross-section of the metazoa and address residual nomenclatural and modeling issues, we have completed a third major update, MirGeneDB 3.0. This update enhances MirGeneDB’s comparative data set with >5000 new genes and >200 new gene families from 39 new species, drawn largely from invertebrate clades. Altogether, >20 000 accurately annotated, consistently named and curated microRNA genes from 114 metazoan species can now be browsed, searched and compared in MirGeneDB. We show that this data is useful for identifying the structural and sequence features of metazoan microRNAs, with some noteworthy outliers, with implications for microRNA prediction and annotation. We further provide downloadable covariance models (CM) and processed read files for all species. To address the present lack of an institution that names novel microRNAs, and to provide a functional reference, MirGeneDB will now assume the responsibility of naming novel microRNAs that are conserved between at least two animal species. It is our hope that this expansion and our efforts to name novel genes and families will allow MirGeneDB to have an even wider user base and continue to stand as the ‘gold standard’ (30–37) for metazoan microRNA annotation across the biological sciences.
Improving phylogenetic representation across the metazoa
The new species included in this update extend MirGeneDB’s sample of the metazoans and refine its resolution within several clades. These improvements are most noticeable among the invertebrates, which account for 33 of the 39 new taxa (Figure 1, blue species). At the broadest scale, we have added members of five previously absent phyla: Acoela (represented by Hofstenia miamia and Symsagittifera roscoffensis), Nematomorpha (Gordionus sp.), Nemertea (Lineus longissimus), Phoronida (Phoronis australis) and Priapulida (Priapulus caudatus). More narrowly, we have expanded our existing sample of several phyla. We added four new annelid species: the tubeworm Owenia fusiformis, the progenic orbiniid Dimorphilus gyrociliatus, the ragworm Platynereis dumerilii and the peanut worm Sipunculus nudus. Similarly, we added eight new species of molluscs: the solenogasters Epimenia babai and Wirenia argentea, the monoplacophoran Laevipilina hyalina, the polyplacophorans Mopalia mucosa and Acanthopleura granulata, and the scaphopod Pictodentalium vernedei, along with second representatives for the gastropods (Haliotis rufescens) and the bivalves (Ruditapes philippinarum). We also expanded our representation of the platyhelminths with the free-living polyclad Prostheceraeus crozieri and three parasitic neodermatans: the monogenean Gyrodactylus salaris, the trematode Schistosoma mansoni, and the cestode Echinoccocus granulosus. To the syndermatans (rotifers), we added the monogonont Brachionus koreanus and the bdelloid Adineta vaga (both free-living), as well as the ectoparasitic seisonid Seison nebaliae and the endoparasitic acanthocephalans Neoechinorhyncus agilis and Pomphorhyncus laevis. We also annotated four parasitic arthropods: the mosquito Anopheles gambiae, the tsetse fly Glossina pallidipes, the parasitic wasp Diachasmimorpha longicaudata and the spider mite Tetranychus urticae. Finally, to better understand the evolution of microRNAs in cnidarians, we added two additional anthozoans: the stony coral Acropora digitifera and the sea anemone Actinerus sp.

The evolution of 1743 microRNA families across the 114 metazoan species as annotated in MirGeneDB 3.0 with branch lengths corresponding to the total number of microRNA family-level gains plus family-level losses. New species added in this release are in blue (invertebrates) and red (vertebrates). Colored bars indicate some of the phylogenetic nodes associated with bursts of family-level microRNA innovation, with the number of new families at that node below. Inset: simplified metazoan phylogeny to highlight differences in MirGeneDB representation at the phylum level.
In addition to their phylogenetic significance, these new taxa enhance MirGeneDB’s coverage of several ecologies. For example, the secondarily miniaturized Dimorphilus and the three non-acanthocephalan rotifers double MirGeneDB’s suite of meiofaunal species (formerly consisting of only the rotifer Brachionus plicatilis and the two Caenorhabditis species). Among these taxa, Dimorphilus and Seison are especially interesting because they have the shortest genomes in our database—at only 70 Mbp (30) and 43 Mbp (38), respectively. This update also expands MirGeneDB’s repertoire of parasites with new micropredators (Anopheles and Glossina), trophically transmitted parasites (Schistosoma, Echinococcus and the acanthocephalans), and directly transmitted parasites (Gyrodactylus, Tetranychus and Seison). With the two parasitoids (Diachasmimorpha and Gordionus) in this update, MirGeneDB now contains representatives for four of the six canonical parasitism strategies (39).
Along with these invertebrates, MirGeneDB 3.0 now contains six additional mammalian species (Figure 1, red species). Previous studies have found that the rate of family-level microRNA innovation increased in the primates, relative to other mammals (29). To more precisely resolve the timing of this burst, this update includes three new species: Pongo abelii, Callithrix jacchus, and Microcebus murinus, representing a non-human hominoid, a platyrrhine, and a strepsirrhine, respectively. Together, these species indicate that the primate innovation pulse began within the Simiiformes and continued into the Haplorhini (Figure 1, dark blue bar). This update will also introduce the first members of Perissodactyla (Equus caballus), Paenungulata (Loxodonta africana) and Diprotodontia (Notomacropus eugenii) to MirGeneDB.
Overall, the new species included in MirGeneDB 3.0 increase its comparative power and improve our ability to resolve the deep-time evolutionary dynamics of microRNAs families and genes (40,41). Indeed, by including these new species, we were able to reevaluate the evolutionary origins of several microRNA families and establish the conservation (and biological importance) of many lineage-specific families, especially among the Annelida, Mollusca, Syndermata, and Platyhelminthes. Altogether, the 720 microRNA families now known to be conserved between two or more species support the monophyly of 80 different higher-level taxa (Supplementary Table S1).
Furthermore, the continual inclusion of new species into MirGeneDB buttresses our original insights (42–44) that new microRNA families continually appear in most metazoan lineages, with gains outweighing losses in most taxa outside of a few parasitic and/or meiofaunal species. Indeed, these important patterns—found only with microRNAs and not with any other type of known regulatory molecules (41)—are robustly illustrated using banner-plots (Figure 2). By highlighting important phylogenetic nodes, the conservation of the number of microRNA families and paralogues across metazoans, as well as the potential usefulness of microRNAs as taxonomic and phylogenetic markers, is obvious.

Banner plot of 114 Metazoan microRNA complements from MirGeneDB 3.0. Rows represent species sorted as in Figure 1 and columns represent microRNA families sorted by node of origin and abundance of overall paralogue number (heatmap function). Note the coloured bars on top depict selected phylogenetic nodes of origin that can also be found in Figure 1.
Revising MicroRNA cluster annotation
Starting with this update to MirGeneDB, we will institute a new procedure for naming the members of microRNA gene clusters. Because it relied on miRBase, which was not informed by evolutionary history, our original system for annotating paralogous microRNA genes could not be effectively applied to large gene clusters. One example, among many, of the problems facing this system can be found in the important eutherian-specific microRNAs of the imprinted MIR-154/MIR-376 cluster on human chromosome 14 (45). In humans, this cluster of loci consists of three paternally expressed protein-coding genes and at least three maternally expressed long non-coding RNAs (lncRNAs) that house at least 99 small RNAs including 49 small nucleolar RNAs (snoRNAs) (Figure 3).
![The annotation and nomenclature system for clustered microRNA loci used in MirGeneDB 3.0. Exemplified is the imprinted MIR-154/MIR-376 cluster on human chromosome 14 in comparison to mouse and dog. Each unique microRNA family and the orientation of transcription is shown by the colored arrows. Protein-encoding genes are indicated with the gray boxes; where known the maternally expressed lncRNA transcripts (‘Meg’, maternally expressed gene) are shown in pink and the paternally expressed protein-coding mRNA transcripts are shown in blue. When the miRNAs are clustered like that seen with the MIR-154 family (red) or the MIR-376 family (teal) the paralogues are named in the 5′ to 3′ direction according to the cluster present in the last common ancestor of eutherians. When coupled with the family/gene origins as detailed on the browse and gene pages in MirGeneDB, the ancestral cluster present in the eutherian last common ancestor is readily reconstructed, as well as clearing identifying lineage-specific gains [e.g. Mir-154-P16a and -P16b in human (#); Mir-154-P35 in dog (@)] as well as losses (e.g. Mir-154-P22 in human; Mir-154-P13 in mouse and Mir-154-P26 in dog). Also seen are gains of unique microRNA families including MIR-770 in the two boreoeutherians (asterisks), and the multiple gains in the murid rodents (mouse and rat, underlines). The glires lineage (i.e. rabbit plus the rodents) lost one of the two Mir-154-P8 genes, but because it is now unclear which one was lost the Mir-154-P8 gene is given an orphan (‘o’) designation.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/53/D1/10.1093_nar_gkae1094/1/m_gkae1094fig3.jpeg?Expires=1747859004&Signature=drKgFn75-IBudToO8Dyr6Ahzree0UeJMJW9JRyGMMsf7VweiepaVFYlkWgjTxxbJCMZ2FvEwjECegZ2aJYiAiUOxDKkxy6bZui35Yb2OGaySY3e2~lOAKbyFVQXOTBd8oucxi9oOcSLQOOy-PDNbSIWRbtJ8CD2~oytWgQroVn2Qu0BknjkAa1XQ5UwIdeikWLSZoOT6HpQrLca~H4cK7ciD68Tqv~fESY38M9XUHEEtrjTjT-RCIN2Uivwvff7jIbVWTdxFFooX4H8s2alF7xrCW6n8c5M~hZPxYvjTxB3lhT~8dbjWIAuMbuPWsr52GffY9IAMFxaEBetKEEOVZA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The annotation and nomenclature system for clustered microRNA loci used in MirGeneDB 3.0. Exemplified is the imprinted MIR-154/MIR-376 cluster on human chromosome 14 in comparison to mouse and dog. Each unique microRNA family and the orientation of transcription is shown by the colored arrows. Protein-encoding genes are indicated with the gray boxes; where known the maternally expressed lncRNA transcripts (‘Meg’, maternally expressed gene) are shown in pink and the paternally expressed protein-coding mRNA transcripts are shown in blue. When the miRNAs are clustered like that seen with the MIR-154 family (red) or the MIR-376 family (teal) the paralogues are named in the 5′ to 3′ direction according to the cluster present in the last common ancestor of eutherians. When coupled with the family/gene origins as detailed on the browse and gene pages in MirGeneDB, the ancestral cluster present in the eutherian last common ancestor is readily reconstructed, as well as clearing identifying lineage-specific gains [e.g. Mir-154-P16a and -P16b in human (#); Mir-154-P35 in dog (@)] as well as losses (e.g. Mir-154-P22 in human; Mir-154-P13 in mouse and Mir-154-P26 in dog). Also seen are gains of unique microRNA families including MIR-770 in the two boreoeutherians (asterisks), and the multiple gains in the murid rodents (mouse and rat, underlines). The glires lineage (i.e. rabbit plus the rodents) lost one of the two Mir-154-P8 genes, but because it is now unclear which one was lost the Mir-154-P8 gene is given an orphan (‘o’) designation.
Also housed are 53 microRNAs, 33 of which belong to the MIR-154 family (Figure 3, red triangles). In previous iterations of MirGeneDB, miRBase’s original Mir-154 gene was named Mir-154-P1, and subsequent genes in that family received paralogue IDs according to the order of their miRBase names (e.g. miRBase's Mir-299 and Mir-323 received the MirGeneDB names Mir-154-P2 and Mir-154-P3). Though this system was relatively simple, it meant that a brief review of the cluster would give no indication of which paralogues might be novel in a particular species and which ancestral paralogues might have been lost.
To address this issue, MirGeneDB 3.0 instituted a 5′ to 3′ system for paralogue annotation, with the 5′-most gene in a cluster receiving the ‘P1’ designation (for the MIR-154 family, this is miRBase's mir-379, our old Mir-154-P7), with the cluster present in the eutherian last common ancestor used as the reference taxon. This system, when applied consistently, can immediately reveal both a species’ novel gene copies through additions to the paralogue number (e.g. Mir-154-P6a and P6b in humans) and its secondary losses of ancestral copies through absences of expected paralogue numbers (e.g. both Mir-P21 and -P23 are present in human, but -P22 is not, indicating that this gene was present in the eutherian last common ancestor (LCA) and thus must be present in other eutherians like the dog; Figure 3, far right). When this cluster is compared across multiple eutherian species, the loss of Mir-154 genes in murid rodents, for example, is striking, as is the gain of novel murid-specific families (Figure 3, middle). Moreover, unique genes present in specific eutherian lineages that evolved after the eutherian LCA are now readily apparent by the fact that they have a higher paralogue number than their neighbors. For example, Mir-154-P35 sits immediately 3′ of Mir-154-P13 and 5′ of the Mir-154-P14 gene in dogs. Because the eutherian LCA is the reference taxon for the nomenclature, this indicates to the user that, instead of this gene being lost in humans, in which case the paralogue numbers would follow a normal ordinal sequence, it must have evolved somewhere in the lineage leading to dogs (in fact, it is present in the three laurasiatherians currently annotated in MirGeneDB: Canis, Bos, and Equus). We have applied this system to all known microRNA clusters, with the exception of a few ancestral clusters like the LET-7 and MIR-96 clusters, making the annotation and the evolutionary understanding of these large clusters much easier.
Nomenclatural responsibilities
Historically, MirGeneDB has deferred to miRBase in certain aspects of nomenclature. Though we have always categorized microRNAs according to shared evolutionary history, rather than shared seed sequences (leading, for example, to the inclusion of miRBase's mir-100 and mir-125 into the MirGeneDB MIR-10 family), we have previously avoided naming entirely novel microRNA genes or families ourselves. Instead, newly discovered sequences have generally received a provisional name containing their clade of origin, the word ‘novel,’ and an ordinal ID (e.g. or CAT-Novel-1 for the first novel found in the Catarrhini) until at least one member of that family was added to miRBase and given a name there. Beginning with this update, however, we will depart from this policy.
At the time of writing, miRBase has not received a major update in 5 years (46), despite the recent release of a great number of small RNA datasets from dozens of previously unstudied species. In this time, MirGeneDB has been updated thrice (including this update) and, as a result, contains annotations for 52 metazoan species and several major clades, such as the cephalopods, afrotheres, and rotifers, that are not included in miRBase. With the enhanced phylogenetic resolution afforded by the 3.0 update, we have also been able to conduct a systematic review of gene-level relationships within several invertebrate microRNA families, refining our previous assessments of paralogy. For these reasons, MirGeneDB will, henceforth, assume primary responsibility for naming new microRNA families and genes in metazoans.
For a microRNA family to receive an official name in MirGeneDB, it will need to meet the rigorous biogenesis criteria described in earlier publications (13,17,18) and must be conserved between two or more species. Those novel families that appear to be unique to a single species will receive a provisional name in the format described above, with a three-letter species code as their clade-of-origin (e.g. Cte-Novel-1 for the first novel found uniquely in Capitella teleta), until their conservation is demonstrated through the sampling of additional species. The only exception to this general rule will be those species-specific microRNA families that have already received names from miRBase (e.g. Cel-Mir-792); such families will retain their current names.
To prevent discordances between previous and future microRNA nomenclature, MirGeneDB will continue to include miRBase names, when they exist, alongside our own nomenclature for microRNA genes and families. Furthermore, it will still be possible to search for miRBase IDs along with MirGeneDB IDs on our web server.
In light of our new responsibilities, we will also amend some of MirGeneDB’s internal policies. Prior to this release, gene names in MirGeneDB have been subject to continuous revision, both to address our own errors, as revealed by improved taxon sampling, and to facilitate the analysis of large microRNAs clusters. In the long run, however, the instability that these revisions create would make it difficult to use MirGeneDB effectively. For this reason, MirGeneDB will institute a naming freeze on all known microRNA genes and families. Going forwards, we will not rename genes in MirGeneDB, nor will we change the names of valid families that have received a name from miRBase. If future expansions of this database reveal that we have misunderstood the evolutionary histories of any known genes, we will make a note in the ‘Comments’ section of the gene's page, but will not otherwise alter its ID. In addition, we will not reuse the miRBase name of a rejected family for any valid one discovered in the future (e.g. the names MIR-68, MIR-69 and MIR-198 will never be used). The primary exception to these rules are so-called orphan genes, which are members of a microRNA family whose paralogy is fundamentally uncertain at this time. These genes, designated with a ‘–o,’ will be renamed as the inclusion of new taxa refines our understanding of each major clade’s ancestral microRNA complement.
Covariance models
MirGeneDB’s rigorous curation criteria significantly improves the quality of microRNA annotation but requires significant human and computational resources. Furthermore, they prevent us from assessing the microRNA complements of species with public genomes but no public small RNA sequencing data. To overcome these limitations, we developed MirMachine in 2023 (28). The first version of MirMachine is based on MirGeneDB 2.1 and uses CM of 508 conserved microRNA families to predict which of those families are present in a given genome. In effect, this program enables us to qualify a species’ repertoire of conserved microRNAs, even when no small RNA data are available. These predictions can also be used to uncover microRNA pseudogenes, determine the conserved complements of extinct species, assess the quality of a genome assembly and improve the performance of smallRNA-based microRNA prediction software. On average, these predictions have an accuracy of 97.5%, making MirMachine the first tool that can accurately predict microRNAs from metazoan genomes only [see also (29) for a human microRNA families focussed tool].
Despite MirMachine's high overall accuracy, we recognized a potential to significantly improve its performance. Most importantly, the CM underlying the program tend to favor the patterns of variation and conservation observed in well-sampled clades, relative to more sparsely sampled ones. As a result, the models are more likely to make Type II errors (i.e. spuriously fail to predict an existing gene) when faced with otherwise unimportant sequence or structural changes in groups that were poorly represented, or unrepresented entirely, in MirGeneDB 2.1.
To resolve these issues, we have retrained the MirMachine CM using MirGeneDB 3.0’s expanded complement of species. Most importantly, the many new protostomes included in this update will diminish the influence that any single species or clade in this group has on the performance of the CM, reducing their false-negative rate. Similarly, the thousands of new genes in the training set have increased the total sequence diversity for all microRNA families, which will make the CM more tolerant of any new variation which they may encounter in a future genome. The new CM are publicly available on MirGeneDB 3.0, are submitted to Rfam (47) and will be included in a later, fully overhauled release of MirMachine.
Sequence and conservation propensities of microRNAs
Though they are a powerful tool, curated datasets are sometimes too small to reliably capture the general characteristics of complex biological entities and their evolutionary histories. With this in mind, we have made a concerted effort to rigorously and exhaustively curate the (conserved) microRNA complements of as broad a set of metazoans as possible. As of this update, MirGeneDB covers the majority of phyla in the metazoan tree of life, often with several representative species, and samples a wide range of lifestyles, ecologies, and genomic characteristics, including dramatic differences in genome size and content (41), and various numbers of whole genome duplication events (48). As a result, we now have the ability to systematically examine the characteristics that unite microRNAs across the metazoa.
At the most basic level, we can provide robust length criteria for the precursor microRNA sequence as a whole, as well as the lengths of the processed arms. Over the last several decades, authors have used a bewildering array of ranges to describe these sequences. For example, a brief review of the published estimates for the range of processed arm lengths returns at least 10 different possibilities: 18–22, 18–23, 19–22, 19–23, 19–24, 19–25, 20–22, 20–23, 20–24 and 20–25 nt. The data in MirGeneDB allow us to conclusively state that the average lengths of both the 5p and 3p arms are 22 ± 1 nt. Importantly, no known, robustly curated microRNA has an arm length of <20 nt or >27 nt (and the few known arm products that are 27 nt long all belong to vertebrate-specific paralogues of the MIR-34 family) (Table 1). This characteristic should be used to help discriminate potential novel microRNA sequences from the many other types of non-coding RNA molecules present in animal cells. Because the mean length of the inter-arm loop is 16 nt, the mean length of a metazoan pre-microRNA sequence is 60 nt. The lengths of canonical precursor sequences, however, vary widely between clades, with the cnidarians having some precursors as short as 49 nt and the arthropods having some as long as 250 nt (Supplementary Figure S1). Indeed, the level of taxonomic inclusion in MirGeneDB 3.0 makes it obvious that some species have tightly constrained pre-microRNA lengths of <80 nt (e.g. vertebrates, nematodes, mollusks) while others appear to have no upper limit to the length to the loop (e.g. arthropods, platyhelminthes, rotifers). Although Fromm et al. (13) speculated that there might be a relationship between loop lengths and the number of Dicer genes in a species’ genome such that only taxa with two or more Dicer genes had unconstrained loop lengths, it is now known that even some taxa with only a single Dicer gene [e.g. Annelida, (49)] occasionally have extraordinarily long pre-microRNA sequences (e.g. Mir-1994 in the oligochaete annelid E. fetida that measures 137 nt in length). For this reason, the cause of the observed constraint in pre-microRNA length in vertebrates and other groups must remain an outstanding and enigmatic question.
Nucleotide propensities of microRNAs in MirGeneDB 3.0 stratified by mature type
. | 5p arm . | Loop . | 3p arm . | . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Min . | Median . | Max . | Standard deviation (SD) . | Min . | Median . | Max . | SD . | Min . | Median . | Max . | SD . | Base-pairing . |
5p Matures (N = 9449) | 20 | 22 | 27 | 0.89 | 5 | 16 | 235 | 7.58 | 20 | 22 | 26 | 0.72 | 19 ± 1.587 |
3p Matures (N = 10 572) | 20 | 22 | 27 | 0.99 | 7 | 15 | 238 | 8.27 | 20 | 22 | 26 | 0.81 | 19 ± 1.581 |
co-Matures (N = 1155) | 20 | 23 | 26 | 0.96 | 6 | 16 | 170 | 5.92 | 20 | 22 | 26 | 0.77 | 19 ± 1.647 |
. | 5p arm . | Loop . | 3p arm . | . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Min . | Median . | Max . | Standard deviation (SD) . | Min . | Median . | Max . | SD . | Min . | Median . | Max . | SD . | Base-pairing . |
5p Matures (N = 9449) | 20 | 22 | 27 | 0.89 | 5 | 16 | 235 | 7.58 | 20 | 22 | 26 | 0.72 | 19 ± 1.587 |
3p Matures (N = 10 572) | 20 | 22 | 27 | 0.99 | 7 | 15 | 238 | 8.27 | 20 | 22 | 26 | 0.81 | 19 ± 1.581 |
co-Matures (N = 1155) | 20 | 23 | 26 | 0.96 | 6 | 16 | 170 | 5.92 | 20 | 22 | 26 | 0.77 | 19 ± 1.647 |
Nucleotide propensities of microRNAs in MirGeneDB 3.0 stratified by mature type
. | 5p arm . | Loop . | 3p arm . | . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Min . | Median . | Max . | Standard deviation (SD) . | Min . | Median . | Max . | SD . | Min . | Median . | Max . | SD . | Base-pairing . |
5p Matures (N = 9449) | 20 | 22 | 27 | 0.89 | 5 | 16 | 235 | 7.58 | 20 | 22 | 26 | 0.72 | 19 ± 1.587 |
3p Matures (N = 10 572) | 20 | 22 | 27 | 0.99 | 7 | 15 | 238 | 8.27 | 20 | 22 | 26 | 0.81 | 19 ± 1.581 |
co-Matures (N = 1155) | 20 | 23 | 26 | 0.96 | 6 | 16 | 170 | 5.92 | 20 | 22 | 26 | 0.77 | 19 ± 1.647 |
. | 5p arm . | Loop . | 3p arm . | . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | Min . | Median . | Max . | Standard deviation (SD) . | Min . | Median . | Max . | SD . | Min . | Median . | Max . | SD . | Base-pairing . |
5p Matures (N = 9449) | 20 | 22 | 27 | 0.89 | 5 | 16 | 235 | 7.58 | 20 | 22 | 26 | 0.72 | 19 ± 1.587 |
3p Matures (N = 10 572) | 20 | 22 | 27 | 0.99 | 7 | 15 | 238 | 8.27 | 20 | 22 | 26 | 0.81 | 19 ± 1.581 |
co-Matures (N = 1155) | 20 | 23 | 26 | 0.96 | 6 | 16 | 170 | 5.92 | 20 | 22 | 26 | 0.77 | 19 ± 1.647 |
Beyond the lengths of pre- and processed microRNA arm sequences, we can also now discern both sequence as well as structural propensities in metazoan microRNA sequences, in addition to processing propensities whereby specific trends might be found with either 5p-derived mature sequences versus 3p-derived mature sequences. Figure 4 shows heat maps for both from representative species, with sequence propensities in color and structural propensities in gray (See Supplementary Figure S2 for all species). These trends are determined from an analysis of all 21 176 microRNA sequences in MirGeneDB 3.0.

Global summary of the sequence and structure propensities of metazoan microRNAs. (A) Sequence and structural signatures for 9497 5p mature microRNAs, and at the bottom for 10 940 6 3p mature microRNAs for 114 bilaterian taxa (the number of considered microRNAs for the select species is given in parentheses). Shown across the top are the read percentages from MirGeneDB split into matures, stars, loops, as well as 5′ and 3′ Drosha-cleavage RNAs (DcRNAs). Below the sequence logos are sequence-signature heatmaps (A = green; C = blue; G = orange; U = red); the darker the shade the higher the incidence of that particular nucleotide at that particular position. Below the sequence-signature heatmaps are the structural-signature heatmaps – the darker the shade the more likely the structure is unpaired at that position. Loop sequences are truncated to the minimal 8 nucleotides in required length (Fromm et al. 2015) for all sequences and for all taxa, and only show the first four 5′ nucleotides and the terminal 3′ nucleotides of each loop sequence. Note that both 5p mature and 3p matures have a high propensity to have a uridine in positions 1 and 9 (see also the corresponding sequence logos). Interestingly, the stars belonging to 5p matures have a propensity to start with a cytosine at position 1, whereas stars belonging to 3p matures tend to start with an adenosine. Of note are the conspicuous unpaired regions immediately 5′ on the 5p arm and 3′ on the 3′ arm of the microRNAs found in Caenorhabditis elegans and Caenorhabditis briggsae, especially the ones whose mature gene products arise from the 5p arm, as well as the internal base-paired regions, in relation to all other analyzed bilaterian species. (B) The sequence and structural propensities and processing motifs of a representative microRNA (Hsa-Mir-26-P2). Six processing motifs are clearly seen in many pri-microRNAs (moving 5′–3′ along the primary RNA sequence) and most are readily apparent with the sequence and structural propensities shown in panel (A): (1) the basal or proximal UG motif (red) at the transition from the flanking unpaired sequence to base-paired stem structure; (2) the ‘bulged GWG’ motif (yellow) with a bulged uridine or adenosine (W) on the 3p arm flanked by 5′C-G-3′ base pairs at positions −10 and −11 (Baek et al. 2024); the ‘CmR’ motif (green), also known as the ‘GHG’ motif (where ‘H’ is A, C or U) (Fang & Bartel 2015) or the ‘mGHG’ motif (where ‘m’ is mismatch) (Kwon et al. 2019), that shows a bias for cytidine at position −5 of the 5′ DROSHA cut, a purine – usually a guanosine – at position + 5 of the 3′ DROSHA cut, and a base-pair mismatch between nucleotides −6 and + 4 of the two DROSHA sites; the ‘mid-BMW’ (purple) motif that reflects the propensity for bulges, wobbles and/or mismatches to occur at positions 10–12 on the mature arm relative to the star arm (Wolter et al. 2017; Li et al. 2020, 2021); the apical or distal ‘UGUG’ motif (orange); and the ‘CNNC’ motif (blue).
At the top of Figure 4A are the 9449 microRNA sequences annotated with a 5p mature gene product; below these are the 10 572 microRNA sequences annotated with a 3p mature gene product. Co-mature microRNAs (i.e. those whose read counts for both arms that are within 2× of one another) are not included in Figure 4 (see Supplementary Figure S2). The percentages indicate the proportion of reads for each region of the pre-microRNA including the DcRNA (50), also known as microRNA offset reads [Mors (51)], both arms, as well as the loop. In color, are the sequence propensities along the entirety of the pre-microRNA sequence, with sequence logos to highlight particularly relevant preferences in each of the two arms. As is well known, mature microRNA sequences are dramatically enriched in uridine (U) at positions 1 and 9, irrespective of whether they are 5p or 3p mature sequences. In contrast, the star sequences have different starting nucleotides, with 3p stars (i.e. those associated with 5p matures) typically possessing a cytosine (C) at position 1 and 5p stars typically starting with an adenosine (A). Regardless of this difference, both 3p and 5p stars underemphasize both guanosine (G) and uridine. These differences in the primary sequence propensities of 5p versus 3p matures are accompanied by differences in their secondary structure. The starting uridine in position 1 of 5p matures is generally mismatched with the corresponding nucleotide of the 3p arm, whereas the starting uridine of 3p matures is generally base paired to its cognate partner on the 5p arm (M, the darker the gray in Figure 4A the higher the mismatch rate between the specified nucleotide and its cognate on the other arm). These sequence and structural propensities, coupled with clear signals in sequence evolution (13), highlight that there are indeed important evolutionary and mechanistic distinctions not only between mature and star sequences at large, but also between mature sequences derived from the 5p arm and those derived from the 3p arm. These observations are in striking contrast to the common assumption that the arms of a microRNA locus are effectively interchangeable and, as such, not worth delineating as either mature or star products (52). This, of course, in no way means that both arm products cannot be used as regulatory molecules. Instead, it simply reflects that, on average, most microRNA loci have an evolutionarily selected and mechanistically driven mature gene product that can be discriminated from its necessary but passive partner arm based on discrete sequence and structural propensities (Figure 4A)
In addition to these characteristics of the fully processed arms, motifs in the primary and precursor microRNAs also help the microprocessor to distinguish bona fide pre-microRNA sequences from other transcribed (but non-microRNA) genomic hairpins and to position its two RNase III domains at the appropriate site of the pri-microRNA sequence (51,53). Six such motifs are currently recognized, each of which disrupts the overall structural symmetry of the prototypical pri-microRNA (54) (Figure 4B), albeit with only a moderate effect on processing individually, and diminishing returns collectively (Auyeung et al. 2013; Baek et al. 2024). Two motifs help position DROSHA and DGCR8 to the ends of the double-stranded pre-microRNA: the basal ‘UG’ motif (Figure 4B, pink), which guides DROSHA to the proximal end of the hairpin (36), and the apical ‘UGUG’ motif (Figure 4B, orange) that guides DGCR8 to the distal end of the hairpin. The ‘CNNC’ motif (where N = any nt), found on the single stranded end of the 3p arm (Figure 4B, blue), appears to engage the splicing factors SRSF3 and SRSF7 to assist in pre-microRNA recognition and processing (55,56). The ‘mid-BWM’ motif—so named because of the propensity to find bulges, wobbles and/or mismatches at positions 10–12 on the mature arm relative to the star arm (Figure 4B, purple)—also helps position DROSHA at the correct cutting site by preventing it from binding to the apical junction (57–59). The ‘bulged GWG’ motif with a bulged uridine or adenosine (W) on the 3p arm flanked by 5′C-G-3′ base pairs at positions −10 and −11 on the 5p arm (Figure 4B, yellow) dramatically increases microprocessing efficiency and homogeneity (36). Finally, a motif located a few nucleotides upstream of the 5p DROSHA cut and a few nucleotides downstream of the 3p DROSHA cut appears to be the key positioning motif for DROSHA. This motif is called the ‘GHG’ motif (where ‘H’ is A, C or U) by Fang & Bartel (54) or the ‘mGHG’ motif (where ‘m’ is mismatch) by Kwon et al. (60). However, given the data in Figure 4, it appears that neither of these monikers reflects the actual sequence propensities found in this motif. Instead, the bias appears to be for a cytidine at position −5 of the 5′ DROSHA cut, a purine – usually a guanosine – at position +5 of the 3′ DROSHA cut, and a mismatch between nucleotides −6 and + 4 of the two DROSHA sites. Hence, we refer to this motif as the ‘CmR’ motif (Figure 4 A and B, green) where the ‘C’ highlights the propensity for the cytidine, the ‘m’ the mismatch, and the ‘R’ the purine. This motif is recognized by the dsRBD motif of DROSHA, which helps precisely position the two catalytic RNase III domains on the pri-microRNA relative to the single- to double-stranded junction (60).
With so many annotated microRNAs from such a broad collection of metazoan taxa, it is likely that any additional sequence or structural propensities whose presence can be delineated by simply comparing sequences and secondary structures among species do not exist. However, the losses of the features described above are easily discerned, most dramatically in Caenorhabditis nematodes. Both C. elegans and C. briggsae have lost essentially all of these motifs, replacing them with a new and unique secondary structural signal 5′ and 3′ of the ancestral CmR motif (61,62). Because nematodes are not basal bilaterians, but instead nested within the Ecdysozoa with animals like arthropods (63), the absence of these ancestral motifs and the appearance of this new structural signal long postdates the bilaterian LCA, highlighting that motifs are part of the ancestral mode of microRNA processing that evolved long before the appearance of nematodes and, indeed, were likely present in the bilaterian LCA.
microRNA annotation workflow and modus operandi of MirGeneDB
The creation of MirGeneDB was prompted by our own efforts to conduct comparative microRNA studies (40,41,64,65) and the obvious lack of a well-curated microRNA reference database. For this reason, our past, current and future selection of new taxa for MirGeneDB has been, is and will be informed by the need to efficiently represent the broad diversity of the metazoan tree of life.
The incorporation of taxa generally requires the availability of a genome reference (either a complete assembly or raw genomic sequencing data) and smallRNA read data from the same or a very closely related species. Exceptions to this rule in MirGeneDB 3.0 primarily include evolutionarily and phylogenetically important species for which reads are currently not available and will not likely be available in the foreseeable future, such as the coelacanth, the tuatara, the chambered nautilus and the annelid Dimorphilus. For these genome-only species, all reads are clearly indicated as ‘predicted,’ and the determination of mature versus star arms was made based on the processing of the orthologous sequences in a closely related species. Assembled genomes are downloaded and preliminarily annotated by MirMachine (28) while unassembled genomic read data is processed along with smallRNA seq data by MirCandRef (24). The acquisition of publicly available smallRNA sequencing data is less straightforward because, in our experience, many existing publications either have not deposited their data in the Sequence Read Archive (SRA) (66), or suffer from poor sample annotation. This may have historical causes because data deposition was not mandatory until recently, with the SRA only being launched in 2007. These difficulties are complicated even further by the issue of orphan SRA datasets, which lack PubMed IDs because they have never been formally described in a publication, even though they are publicly available. To meet these challenges, we use miSRA (67), a newly developed and publicly available tool to screen the SRA repository for existing sequencing data as a function of species and other metadata of interest. Frequently, SRA metadata are scarce or even incorrect, so we manually (re)annotate the origin, accession and publication information of all smallRNA sequencing datasets as possible. We provide all processed and annotated fasta files of smallRNA sequencing data for each species on our download section (https://mirgenedb.org/download) as a service to our users and the public in general (see Supplementary Table S2).
These data are either downloaded and processed with miSRA, or downloaded with SRA-toolkit (66) and processed by miRTrace (68). The processed reads, together with the MirMachine predictions of conserved microRNAs from the genome will be subject to a microRNA prediction run with MirMiner (43). The microRNA prediction results are then manually curated and annotated using synteny information (if available) and tree-based comparisons of sequence deviations. This rigorous process is time-consuming, but guarantees the highest possible quality for our microRNA annotations, which can be considered to be near-free of false positives.
To ensure that the annotation process is consistent for all species in MirGeneDB, we are not open to direct submissions of annotations, though we are always grateful for suggestions and pointers to new species and data. If we act on a suggestion for a new species, we reserve the right to re-annotate its microRNA complement before hosting it on MirGeneDB. In the near future, we will update MirGeneDB with additional species announced on mirgenedb.org.
Computational improvements
The most important and noteworthy change is that as of this release MirGeneDB is hosted on two independent servers in two different cities in Norway through the Norwegian Research and Education Cloud that serve as computational mirrors in case one of them is affected and cannot function.
Integration with other services
MirGeneDB is part of RNAcentral (69) and version 3.0 will be hosted in the next release. CM of MirGeneDB 3.0 species have been submitted to Rfam (47) and their integration will be finalized in their next update. MirGeneDB annotations for overlapping species and MirMachine de novo annotation of conserved microRNAs are currently being implemented in Ensembl’s annotation pipeline (70).
Conclusion
More than 30 years and 150 000 publications after the discovery of microRNAs (71), the importance and relevance of these regulatory molecules for biomedical, evolutionary and biosystematic research questions are obvious. However, determining what is and what is not a microRNA and annotating them as such has remained a computational and intellectual challenge with major uncertainties about the validity and reliability of public microRNA repositories. With the advent of next-generation sequencing, high-performance computational pipelines and groundbreaking studies that deciphered the intricate mechanism of microRNA biogenesis [see (30,72)], as well as their unique mode of evolution [see (25,44)], the rules for annotation and nomenclature of microRNAs are now robustly established (13). Despite these advances, spurious annotations of thousands of questionable ‘microRNAs’ continue to be released in an apparently never-ending race for more, rather than better, data (73–76), while existing incorrect annotations have lead to a flood of highly questionable research findings (77–81), with the potential to create serious harm, both for patients and for the field as a whole, when dealing with human diseases.
Just as natural history museums provide invaluable collections for the biological research community, repositories or databases of genetic elements should provide a reliable reference to the corresponding research community. To guarantee reliability and comparability, physical natural history museums do not allow the deposition of specimens to their collections without first passing them through a rigorous curatorial process by expert taxonomists that adhere to a strict set of rules for the description (and naming) of new species (82). Repositories of genes, coding or non-coding, should also apply such rigorous and institutionalized curation. We see MirGeneDB as this institution for microRNAs.
In the decade since its 2015 release, MirGeneDB has undergone three major expansions that have placed it at the forefront of microRNA annotation and curation and brought it closer to the goal of complete representation of all animal phyla. All expansions adhered to the annotation and nomenclature rules initially established in 2003 (27) and later expanded to the next generation sequencing level (13), introduced new species and computational features such as integration of read data or evolutionary nodes of origin for families and genes, respectively. This release, besides its many novel species, makes read data we collected from various sources over the years directly available on our website, along with updated and downloadable CM, as powerful tools for accurate microRNA discovery. Our new naming responsibilities will unite the curatorial and nomenclatural functions in the microRNA field and ensure that future studies of microRNA evolution and function can rely on a dynamic database that grows with the ever-increasing number of publicly available metazoan genomes and small RNA libraries.
We hope to both ignite and contribute to a renaissance in microRNA research by moving beyond an era dominated by technological advancements, which fueled a race to describe increasingly dubious microRNAs in few model systems and diseases with limited biological significance. Instead, we aim to shift the focus toward an emphasis on biology – as it was also pointed out by others (26,34,35) – utilizing comparative approaches based on solid curatorial foundations to deepen our understanding on the emergence of complexity of animal life and beyond.
Data availability
All data herein is publicly available on SRA and on MirGeneDB as detailed in Supplementary File S1.
Supplementary data
Supplementary Data are available at NAR Online.
Acknowledgements
We are grateful to Jose M Martin Duran (Queen Mary University London) for samples of Owenia fusiformis, Mansi Srivastava (Harvard University) for samples of Hofstenia miamia and Bernhard Egger (University of Innsbruck) for the Prostheceraeus crozieri specimen. We also thank Detlev Arendt (EMBL Heidelberg) for early access to the Platynereis dumerilii read data. We thank Blake Sweeney (EMBL-EBI, RNAcentral & Rfam) for support and help integrating MirGeneDB and Covariance models into Rfam, RNAcentral. Leanne Haggerty (EMBL-EBI, Ensembl) and Jose M. G. Perez-Silva (EMBL-EBI, Ensembl) are acknowledged for their work with MirGeneDB and MirMachine integration into Ensembl. We thank David P. Bartel (HHMI, MIT, Whitehead Institute) and Victor Ambros (UMass Chan) for discussions and encouragement on the new naming responsibilities of novel conserved microRNAs.
Funding
Dartmouth James O. Freedman Presidential Scholars program (to A.W.C.); TromsøForskningsstiftelse [20_SG_BF ‘MIRevolution’ to B.F.].
Conflict of interest statement. None declared.
Notes
Present address: Charles Wu, University of Georgia, Athens, GA, USA.
References
Author notes
The first two authors should be regarded as Joint First Authors.
Comments