-
PDF
- Split View
-
Views
-
Cite
Cite
Li Huang, Li Zhang, Xing Chen, Updated review of advances in microRNAs and complex diseases: experimental results, databases, webservers and data fusion, Briefings in Bioinformatics, Volume 23, Issue 6, November 2022, bbac397, https://doi.org/10.1093/bib/bbac397
- Share Icon Share
Abstract
MicroRNAs (miRNAs) are gene regulators involved in the pathogenesis of complex diseases such as cancers, and thus serve as potential diagnostic markers and therapeutic targets. The prerequisite for designing effective miRNA therapies is accurate discovery of miRNA-disease associations (MDAs), which has attracted substantial research interests during the last 15 years, as reflected by more than 55 000 related entries available on PubMed. Abundant experimental data gathered from the wealth of literature could effectively support the development of computational models for predicting novel associations. In 2017, Chen et al. published the first-ever comprehensive review on MDA prediction, presenting various relevant databases, 20 representative computational models, and suggestions for building more powerful ones. In the current review, as the continuation of the previous study, we revisit miRNA biogenesis, detection techniques and functions; summarize recent experimental findings related to common miRNA-associated diseases; introduce recent updates of miRNA-relevant databases and novel database releases since 2017, present mainstream webservers and new webserver releases since 2017 and finally elaborate on how fusion of diverse data sources has contributed to accurate MDA prediction.
MicroRNA
The past two decades have witnessed significant advances in identifying microRNAs (miRNAs) and understanding their biogenesis, functions and roles in complex human diseases. Although the first miRNA, lin-14, was discovered by Ambros’s team in a 1993 study of Caenorhabditis elegans developmental timing [1], it was not until 2001 that researchers appreciated the gene regulatory functions of such endogenous noncoding RNAs with about 22 nucleotides, and defined them as a distinct class named ‘microRNAs’ [2–4]. So far there are 1917 annotated miRNAs within the human genome according to the latest release v22 of miRbase [5], while recent research estimated the total number to be ~2300 [6].
Introduction
Biogenesis
MiRNAs can be processed in one of two pathways. The canonical biogenesis pathway involves both nucleus and cytoplasmic cleavage events [7]. It begins with transcription of miRNA genes (about 30% from introns of protein-coding genes and most of the remaining from dedicated miRNA gene loci) by RNA polymerase II or III in the nucleus to generate the long primary miRNAs (pri-miRNAs), which may ultimately lead to a cluster of miRNAs or a single one [8, 9]. The pri-miRNAs are subsequently cleaved by Microprocessor that contains the double-stranded RNase III enzyme (DROSHA) and the DiGeorge syndrome critical region 8 (DGCR8) to form the ~60–70-nucleotide precursor miRNAs (pre-miRNAs) [10, 11]. In the ensuing step, the pre-miRNAs are exported from the nucleus to the cytoplasm by the exportin 5 (XPO5) protein for further processing by the RNase III enzyme Dicer to produce the ~22-nucleotide miRNA duplexes [12]. Each duplex comprises of a passenger and a guide strand: while the former is degraded by cellular machinery [13], the latter is the mature miRNA, loaded into the Argonaute (AGO) protein family to yield the miRNA-induced silencing complex (miRISC), which base-pairs with the target mRNA to regulate gene expression through mRNA cleavage or translational repression [7, 8]. Alternatively, the non-canonical biogenesis pathway processes miRNAs using different combinations of the same proteins in the canonical pathway, and can be categorized into Drosha/DGCR8-independent and Dicer-independent [13, 14]. The former generates Dicer substrate-like pre-miRNAs (such as mirtrons from the introns of host mRNA genes during splicing [15, 16]), which are directly exported to the cytoplasm without Drosha-mediated cleavage. The latter, in contrast, cleaves the endogenous short hairpin RNAs (shRNAs) via Drosha to liberate pre-miRNAs, which are then loaded into the AGO2 protein to catalyze maturation in the cytoplasm [17, 18]. Like the product of the canonical pathway, both non-canonical pathway categories eventually result in a functional miRISC. While most miRNAs reside within cellular microenvironment, extracellular miRNAs (ECmiRNAs), also known as circulating miRNAs, have also been found in plasms, serum and more bio-fluids [19–22]. During biogenesis, mature miRNAs can be selectively released from cytoplasm into extracellular environment to become circulating miRNAs, either by encapsulation in lipid vesicles including exosomes, micro-vesicles and apoptotic bodies, or by association with RNA binding proteins like AGO2 and high-density lipoprotein complexes (HDL) [23–27]. Circulating miRNAs exhibit impressive stability, even exposed to harsh conditions like extremely high/low pH, repetitive boiling or freezing–thawing cycles for a long time [28]. There are ongoing research efforts to understand the mechanisms underlying the stability [23, 28, 29] and to take advantage of it by using circulating miRNAs as novel diagnostic and therapeutic biomarkers for severe diseases, particularly cancers [24, 30–34].
Detection
Techniques of discovering miRNAs broadly fall into two types: traditional methods and novel approaches based on signal amplification strategies [35]. Northern blotting [36] is the most classical technique, widely adopted to detect miRNAs through a workflow of RNA extraction, separation by agarose gel electrophoresis, denaturing, blotting and fixing onto membrane, reaction with marker-labelled probes, membrane washing, and finally signal checking [35, 37–39]. This method requires no specialized equipment or technical knowledge to deploy but suffers from low sensitivity and high time consumption. Alternative traditional techniques are real-time quantitative polymerase chain reaction (qPCR) [40–42] that produces results of high sensitivity and specificity, and microarray analysis [43, 44] that supports fast and high throughput miRNA detection. The former uses special primers to transfers target miRNAs into synthesized complementary DNAs (cDNAs) via reverse transcription, and then quantifies the relationship between target miRNAs and PCR products [41, 45]. However, the use of real-time qPCR is limited by the risk of false-positive results and the needs for careful primer design and specialized equipment [35, 37]. The latter technique prepares a miRNA microarray using oligonucleotide probes, then labels the miRNAs (isolated from samples) with fluorescent dye, next hybridizes them into the microarray and lastly analyses the fluorescence signal data to quantify miRNAs’ expression levels [44]. Drawbacks of microarray profiling include high expenses and relative ineffectiveness in detecting low copy number miRNAs or distinguishing miRNAs with similar sequences [35, 46]. To address the limitations of traditional methods, nanomaterial-based amplification [47–51] and nucleic acid amplification approaches [52–56] have been developed in recent years and utilized at an increasing frequency to obtain more sensitive outcomes. Due to the properties of large surface area, electrical conductivity, chemical stability, cellular transfection versatility, photostability and low immunogenicity [35, 57], nanoparticles (such as nanostructured gold AuNPs [58], silver AgNCs [59] and copper CuNPs [51]) can be used to achieve highly selective detection of miRNAs with very short chains, similar sequences and low concentration [47] as well as to facilitate label-free detection [49]. Nucleic acid amplification, on the other hand, relies on various strategies to rapidly produce copies of target miRNAs at a constant temperature before analyzing the miRNA sequences, which is known as isothermal amplification [56]. Popular strategies include rolling circle amplification (RCA) [60], duplex-specific nuclease-based amplification [61], loop-mediated isothermal amplification (LAMP) [62], strand-displacement amplification (SDA) [63] and hybridization chain reaction (HCR)-based enzyme-free amplification [64]. In addition, researchers have attempted to combine different amplifications to leverage their strengths, such as integrating SDA with RCA [65] and coupling HCR with gold nanoparticles [66]. Furthermore, DNA nanostructures [67] and nuclei acid amplification-free methods [68, 69] have also emerged with proven effectiveness in miRNA detection.
Function
MiRNAs’ functions can be determined by animal knockout models (for loss-of-function) and transgenic overexpression (for gain-of-function) experiments [70, 71]. For instance, deleting a miR-96 copy will lead to deafness in mice (and humans), while dropping a copy of the miR-17-92 cluster will result in deficiencies in skeletons, growth and learning [72]. Another example is that pancreas-specific overexpression of let-7 in mice reduces glucose tolerance, hence causing a decreased level of glucose-induced pancreatic insulin secretion [73]. Research [70] indicates that a third of genes in human genome are collectively regulated by miRNAs. Indeed, their functional roles can be found in diverse biological processes that, as reported by previous reviews [74, 75], include nervous systems, cell differentiation and development, viral infection, immunity, and more. Specifically, miRNA expressions are associated with proliferation and differentiation of early embryonic stem cells as well as maintenance of mature cells in central nervous systems [70, 76, 77]. Similar function of cell differentiation and development can also be found in limb development, adipogenesis, myogenesis, angiogenesis and haematopoiesis, neurogenesis, and epithelial morphogenesis [70, 71, 78]. Moreover, the actions of miRNAs on viral infection are of a dual nature: virus-encoded miRNAs can guide an AGO protein on a viral transcript, whereas host miRNAs can target a cellular mRNA to alter the host antiviral response [79–82]. In addition, viral suppressors of RNA silencing (VSRs) induced by infections can impede the biogenesis of host miRNAs that contribute to antiviral response [83]; virus noncoding RNAs further incurs degradation of host miRNAs via the complementary sequences [84]. This implies that miRNAs seem to be ‘more friends than foes’ to viruses and, therefore, can be used as effective therapeutic targets [82]. As for immune systems, deregulated expressions of specific miRNAs (such as miR-124 [85] and miR-146 [86]) are correlated with autoimmunity, immune tolerance, inflammation and cancer [74, 87]. While the aforementioned functions are mostly based on the paradigm that miRNAs post-transcriptionally modulates the expression levels of target protein coding genes via inhibiting translation or enhancing mRNA decay [72], several unconventional miRNA functions are also worth noting [88], such as interactions with non-AGO proteins that interferes leukaemic blasts cell survival in chronic myelogenous leukaemia [89], activation of Toll-like receptors (TLRs) to incur neurodegeneration [90] and pro-metastatic inflammatory response in cancers [91], and upregulation of target mRNA expression on cell cycle arrest [92].
MiRNA-disease associations
A deeper understanding of miRNA biogenesis and function helps uncover the molecular, physiological and pathological mechanisms of complex human diseases, thereby contributing to MDA discovery, which in turn benefits the diagnosis and treatment of diseases [75]. Abnormal expressions of tissue-specific miRNAs can give rise to disorders in the corresponding tissues: upregulation of cardiac miRNAs miR-27a and miR-133, for instance, leads to contractility [93] and chronic heart failure [94], respectively; downregulated miR-541 expression is correlated with cardiac hypertrophy [95]; low expression of let-7g and miR-433 and high expression of miR-214 are found to promote tumour-genesis and progression of gastrointestinal cancers [96]; deregulated miR-34a, miR-122 and miR-192 are involved in the development of non-alcoholic fatty liver diseases and non-alcoholic steatohepatitis, with the potential of being biomarkers for determining different disease severity levels [97]. Two circulating miRNAs, miR-192 and miR-194, play a functional role in both type 1 and type 2 diabetes mellitus (T1DM and T2DM), and are implied by longitudinal studies to serve as biomarkers for assessing the risk of T2DM [98]. As a further example of effects on viral pathogenesis, the liver-specific miRNA, miR-122, can promote the replication of hepatitis C virus (HCV) via stabilizing the viral genome RNA [70, 99].
These examples of diseases associated with tissue-specific, circulating, or gene-targeting miRNAs are only a fraction of the numerous MDAs discovered so far. The newest Version 3.2 of Human MicroRNA Disease Database (HMDD v3.2) [100] records more than 35 000 association entries, classifying them in to six generalized categories including circulation, tissue, genetics, epigenetics, target and other. Each category contains one or more distinct evidences to support assessment of confidence level for the association. Many-to-many relationships exist between miRNAs and diseases [101] as well as between MDAs and their types [75]. The former means that a miRNA can be associated with multiple diseases, such as let-7 involved in both prostate neoplasms (PN) development [102] and Hepatitis B Virus (HBV) infection [103]; and vice versa, such as miR-103 [104] and miR-128 [105] also associated with PN. The latter indicates that an association can be related to multiple types, as reflected by the fact that the association between let-7 and PN belongs to both circulation [102] and genetics [106]; and several associations between the same miRNA and different diseases can be of the same type, as exemplified by let-7’s target-type associations with human immunodeficiency virus (HIV) Infection [107] and Perlman Syndrome [108].
Recent findings on five prevalent miRNA-related diseases
The previous MDA review by Chen et al. [75] discussed MDA research regarding five cancers of high incidence rate, namely, breast cancer, lung cancer, prostate cancer, colorectal cancer (CRC) and kidney cancer. As a continuation, we report their updated incidence rates based on Global Cancer Statistics 2020 [109] and summarize new research findings related to them since 2017.
Breast cancer
It is challenging to achieve early diagnosis, effective treatment and metastasis monitoring of breast cancers (BCs), the most common cancer in the world (over 2 million new cases and about 685 thousand new deaths) based on Global Cancer statistics 2020 [109], due to its heterogeneous histological patterns, diverse gene expression profiles and complex clinical features [110, 111]. Substantial research efforts have been made to explore the potential of miRNAs as diagnostic biomarkers and therapeutic targets and to uncover more BC-miRNA associations. Recent advances in the exploration include: circulating miRNAs’ beneficial role in BC management based on their ease-of-isolation and stable structures [112], the roles of exosomal miRNAs (exo-miRNAs) in tumour microenvironmental reactions to the growth of cancer cells [113, 114], the correlation between defects on miRNA biogenesis and BC traits [115], preclinical and clinical investigations of tissue-specific miRNAs’ effects on prognosis [116], and miRNAs’ contribution to cancer invasiveness [117]. For instance, the miR-34 family targets multiple proto-oncogenes (such as CD24, SIRT1 and ZEB1) to regulate tumorigenesis, apoptosis, metastasis, invasion and chemoresistance of BC [111]; BC invasion, osteomimicry and bone destruction can be hindered by the miR-30 family through repressing several bone metastasis-related genes like DKK-1 and RUNX2 [118]; the miR-449 family elevates the effect of doxorubicin treatment on triple negative breast cancer (TNBC) by targeting cell cycle factors such as E2F1 and E2F3 [119].
Lung cancer
This malignancy is the leading cause of cancer-related mortalities and the second most common cancer after BC, with nearly 1.8 million new deaths and 2.2 million new cases globally in 2020 [109]. The undesirable prognosis is due to no obvious symptoms for making early diagnosis, inadequate screening techniques, ineffective treatments and late onset of severe symptoms [120, 121]. Recently, researchers have paid considerable attention to early detection of lung cancers (LCs) via using exo-miRNAs and circulating miRNAs as liquid biopsy biomarkers [121–127]. Other studies focus on improving the understanding of LC biology, such as the investigation of miRNAs function in diverse metastasis lesions such as the epithelial–mesenchymal transition (EMT) and migration [128], the systematic analysis of miRNA experiments for selecting the most effective biomarkers [129], and the updates to the existing regulatory network of circulating RNAs, miRNAs and mRNAs in LC [130]. Novel associated miRNAs include miR-19 that controls CBX7 expression to promote the proliferation of non-small-cell lung cancer (NSCLC) [131], miR-33a-5p and miR-128-3p that function as tumour inhibitors in whole blood [132], miR-320a that downregulates STAT4 to foster the immunosuppressive macrophages M2 phenotype for LCs [133], and miR-146b-3p that is negatively correlated with TRAF6 expression in NSCLC tissues [134].
Prostate cancer
This is the third most commonly diagnosed cancer worldwide (more than 1.4 million new cases and ~375 thousand new deaths in 2020) [109], with drastically varying 5-year survival rates from almost 100% at the localized tumour stage to less than 30% at the advanced metastasis stage [135]. Therefore, it is clinically important to develop effective diagnosis strategies to achieve early detection of prostate cancers (PCs), which has recently attracted significant research interests. A systematic analysis has been carried out in 2018 on 84 miRNAs’ expressions in two cohorts of healthy men and PC patients to identify sensitive miRNA pairs as potential diagnostic biomarkers [135]. Among all possible combinations of miRNA pairs, miR-107-miR-26b-5p and miR-375-3p-miR-26b-5p had the most diagnostically significant expressions in the urine supernatant fraction of the two cohorts. Meanwhile, another study also profiled 372 miRNAs in plasms from patients and health controls, indicating the regulatory roles of miR-4289, miR-326, miR-152-3p and miR-98-5p in PC pathogenesis and their potential for early diagnosis [136]. Moreover, a large-scale circulating miRNA profiling for PC liquid biopsy was performed on serum samples from more than 800 PC patients, 241 healthy controls and 500 patients of other cancer types, concluded with the robust, high diagnostic performance of miR-17-3p and miR-1185-2-3p [137]. Researchers also analyzed expressions of miRNAs in semen exosomes from men with a modest rise of prostate-specific antigens (PSA) levels [138]. The results showed that, when combined with the blood PSA concentration, miR-142-3p, miR-142-5p and miR-223-3p could be effective non-invasive biomarkers for distinguishing PCs.
Colorectal cancer
This is the fifth most frequently diagnosed cancer (over 1.1 million new cases and about 577 thousand new deaths globally in 2020) [109], which can be caused by a myriad of factors like genetic alterations, dietary habits and obesity [139]. CRC is treated by chemotherapy, radiotherapy, targeted therapy, immunotherapy or surgery. The current 5-year survival rate is about 65%, according to the Surveillance, Epidemiology and End Results database for cancer statistics [140], with room for further improvement, compared with the rate of other cancers like 90% for BC. Several recent research findings are the co-regulatory functional relationships between tumour suppressor genes, oncogenes and miRNAs in CRC [141], the involvement of miRNAs in liver metastasis via controlling the EMT of CRC cells [142], the role of miRNAs as a potential molecular link between the metabolisms of obesity and CRC based on the similar dysregulated expressions in these diseases [143], the discovery of exosome-encapsulated miRNAs as new CRC biomarkers [144–146], enhanced understanding of interactions between miRNAs, long non-coding RNAs (lncRNAs) and mRNAs in CRC [139], and the various functions of dysregulated miRNAs in molecular signalling pathways [147]. Newly uncovered CRC-associated miRNAs include miR-708 that hinders CRC development via regulating ZEB1 in the AKT/mTOR signalling pathway [148], miR-148a-3p that is under expressed in CRC to increase the PD-L1 protein expression on tumour cells and suppress immunity [149], and miR-34a that modulates Notch signalling pathway to inhibit CRC metastasis [150]. Systematic experiments have also been carried out to reveal novel regulatory functions of known CRC-associated miRNAs. For instance, miR-199 originally targeted IKKB in the NF-|$\kappa$| B signally pathway in CRC and was observed to be correlated with a seed-region match between TNFRSF11A (RANK) and miR-199a-5p [151].
Kidney cancer
Globally, more than 430 thousand new cases of kidney cancers (KCs) and nearly 180 thousand KC-caused deaths are reported in Global Cancer Statistics 2020 [109]. Most KCs can be classified into three main categories, renal cell cancer (RCC), transitional cell cancer (TCC) and Wilms tumour (WT), according to the NIH National Cancer Institute (https://www.cancer.gov/types/kidney). Identifying the KC subtypes in patients can aid in clinical decision-making on the right treatments, decreasing adverse side-effects of drugs, and hence promoting the survival rate [152]. Recent research progresses encompass the development of machine learning models for KC subtype classification based on the miRNA genome [152], the identification of plasma exo-miRNAs as potential prognostic biomarkers for metastatic RCC (mRCC) [153], the role of urinary miRNAs in detecting RCC [154, 155], the enhanced understanding of miRNA molecular mechanisms in clear cell RCC (ccRCC) [156], the selection of miRNAs as prognosis indicators for predicting the ccRCC survival rate [157]. Novel KC-associated miRNAs recorded in recent literature include: miR-21 and miR-223 that are involved in nodal and distant metastasis of ccRCC [158], four miRNAs (miRNA-21-5p, miRNA-9-5p, miR-149-5p and miRNA-30b-5p) that can serve as prognosis signatures in ccRCC [159], and another three-miRNA signatures composed of miR-21, miR-584 and miR-155 for ccRCC diagnosis and prognosis [160].
Databases and webservers
As illustrated in Figure 1, bioinformatics tools have been historically closely coupled with the development of miRNA biology and experimental technologies [161]. Here, we take the following representative tools as examples to illustrate this trend. It should be noted that they are only a fraction of the numerous important miRNA-related resources.

Timeline of research on miRNA-related bioinformatics tools, closely coupled with the development of miRNA biology and experimental technologies.
The initial version of miRbase [162], the central online repository for miRNA genomics, was released in the same year 2002 as the first proposal of using miRNAs as potential biomarkers [163]. Since then, various bioinformatics tools have been introduced alongside the discovery of new miRNA-related biological mechanisms or experimental techniques, as depicted by the following examples. The year 2003 witnessed the identification of Drosha as miRNA maturation initiator [164], the characterization of the RISC complex [165], the release of the MiRscan webserver [166] for uncovering miRNA genes conserved in multiple genomes, and the publication of the miRanda webserver [167] for predicting miRNA targets, and the appearance of the ViennaRNA webserver [168] for inferring miRNA secondary structures. In 2004, the microarray technique was applied to profiling miRNAs [169], while RNAhybrid emerged as a flexible webserver for fast and easy miRNA target prediction [170]. Both the early application of Next-Generation Sequencing (NGS), miRNA-seq [171], and the database of experimentally verified miRNA targets, TarBase [172], were proposed in 2006. Subsequently in 2007, HMDD [173] was released as the first database that curated experimentally verified evidences for MDAs. Then in 2008, the term “isomiRs” (miRNA isoforms) was created to denote variations of miRNA sequences [174], together with the CID-miRNA webserver [175] for predicting pre-miRNAs in human genome. The next year witnessed the birth of one of the earliest computational models for MDA prediction, namely, Jiang et al.’s hypergeometric distribution method [189]. Later in 2010, The presence of circulating miRNAs was firstly detected in 12 human bio-fluids [19], when TransmiR was released to record regulations between transcription factors (TFs) and miRNAs [176]. After 2 years, Chen et al. proposed another classical MDA prediction model named RWRMDA [177]. In 2013, the technique of CLASH was devised to discover miRNA-target RNA duplexes associated with AGO [178]; meanwhile, a more specific database, miRCancer, was constructed by text mining on literature to store miRNA-cancer associations [179], the miRTarCLIP webserver [180] was proposed to uncover targets from high-throughput sequencing data, and the PHDcleav webserver [181] was developed to infer human Dicer cleavage sites based on sequences and secondary structures of miRNAs.
Thereafter, various novel MDA prediction models were developed to continuously improve the state-of-the-art performance, such as miRPD [182] and RLSMDA [183] in 2014, miRAI [184] in 2016, PBMDA [185] in 2017, IMCMDA [186], BNPMDA [187] and MDHGI [188] in 2018, MDA-CNN [189] in 2019, NIMCGCN [190] in 2020, as well as TDRC [191], MDA-GCNFTG [192] and GAEMDA [193] in 2021. Besides, diverse models for other prediction tasks were also introduced, such as MiRTDL [194] for inferring miRNAs’ targets in 2015 and MiRLoc [195] for predicting miRNA subcellular localization in 2022; and so were a number of new databases and webservers, such as OncomiR [196] for exploring pan-cancer microRNA dysregulation in 2018, Mirnovo [197] for identifying known and novel miRNAs in animals and plants from RNA-seq data, and wTAM [198] for offering annotations of weighted human miRNAs with importance scores for enrichment analysis. Across the timeline, both miRBase and HMDD were updated for many times, arriving at the latest releases v22.1 [5] and v3.2 [100], respectively in 2019.
Databases
The previous review [75] classified 22 databases that supported MDA research and prediction into three categories: comprehensive miRNA information, miRNA-related interactions and miRNA-disease associations. Most of them were manually curated and/or completed by text mining from abundant literature, with diverse usage frequencies and citation counts. In this section, we revisit the mainstream and highly cited databases (most of which were actively maintained) within each category regarding their contents and recent updates (if there were any); we also introduce new relevant databases published since 2017. The database summary is shown in Table 1.
Database . | Brief description . | First release . | Last update . | Current version . | URL . |
---|---|---|---|---|---|
Comprehensive miRNA information databases | |||||
Mainstream databases | |||||
miRBase | The major online repository containing comprehensive miRNA information such as sequences, biogenesis precursors, deep sequence expressions, and annotations | 2002 | 2019 | v22.1 | http://mirbase.org/ |
miRGator | An integrated system for miRNA functional annotations | 2007 | 2012 | v3 | http://mirgator.kobic.re.kr/ |
miRGen | Provide cell type-specific TSSs for more than 1500 microRNAs | 2018 | 2020 | v4 | https://diana.e-ce.uth.gr/mirgenv4 |
Rfam | A collection of comprehensive non-coding RNA families, with recent emphasis on miRNA families | 2003 | 2022 | v14.8 | https://rfam.org |
Novel databases since 2017 | |||||
EVmiRNA | Store miRNA expression profiles and related information in extracellular vesicles (EVs) such as exosomes and macrovesicles | 2018 | 2018 | – | http://bioinfo.life.hust.edu.cn/EVmiRNA |
miRmine | Facilitate a global view of miRNA expression profiles in tissues, cell-lines, and diseases | 2017 | 2017 | – | http://guanlab.ccmb.med.umich.edu/mirmine |
miRCarta | Serve as a central repository of miRNA candidates to complement to miRBase v21 | 2017 | 2017 | – | https://mircarta.cs.uni-saarland.de/ |
MiRNA-related interaction databases | |||||
Mainstream databases | |||||
miRTarBase | One of the most comprehensive databases for annotated and experimentally confirmed MITs | 2011 | 2021 | v9 | http://miRTarBase.cuhk.edu.cn/ |
TarBase | A major database of experimentally verified MITs | 2006 | 2017 | v8 | http://www.microrna.gr/tarbase |
miRWalk | Contain both experimentally validated and computationally predicted MITs in human, mouse, rat, dog and cow | 2011 | 2022 | v3 | http://mirwalk.umm.uni-heidelberg.de |
miRecords | An integrated resource for animal MITs | 2009 | 2013 | v4 | http://c1.accurascience.com/miRecords/ |
Novel database since 2017 | |||||
miRwayDB | The first repertoire for experimentally confirmed MPAs in various pathophysiological conditions | 2018 | 2018 | – | http://www.mirway.iitkgp.ac.in |
MiRNA-disease association databases | |||||
Mainstream databases | |||||
HMDD | The most widely adopted database on MDA prediction tasks, which documents 35 547 MDAs spanning six association types with 20 diverse evidences and causality information | 2007 | 2019 | v3.2 | http://www.cuilab.cn/hmdd |
MiR2Disease | A comprehensive repository for associations between deregulated miRNAs and diverse human diseases | 2008 | 2008 | – | http://www.mir2disease.org/ |
DbDEMC | A curated database for associations between differentially expressed miRNAs and various cancers | 2010 | 2022 | v3 | https://www.biosino.org/dbDEMC/index |
MiRCancer | Consolidating research findings on miRNA-cancer associations, automatically retrieved from publications on PubMed | 2013 | 2020 | – | http://mircancer.ecu.edu/ |
Novel databases since 2017 | |||||
Tumour IsomiR Encyclopedia (TIE) | Capture and annotate isomiRs, abundant in existing databases but difficult to map and often overlooked due to their short sequence length and high heterogeneity | 2021 | 2021 | – | https://isomir.ccr.cancer.gov |
MSDD | Document experimentally verified associations between diseases and SNPs in miRNAs | 2017 | 2017 | – | http://www.bio-bigdata.com/msdd/ |
HAHmiR.DB | Focus on miRNAs associated with HA stress | 2020 | 2020 | – | http://www.hahmirdb.in |
OncomiR | Support exploration of miRNA dysregulation associated with clinical characteristics of cancers by combining a backend database and a dynamic web server | 2017 | 2017 | – | www.oncomir.org |
Database . | Brief description . | First release . | Last update . | Current version . | URL . |
---|---|---|---|---|---|
Comprehensive miRNA information databases | |||||
Mainstream databases | |||||
miRBase | The major online repository containing comprehensive miRNA information such as sequences, biogenesis precursors, deep sequence expressions, and annotations | 2002 | 2019 | v22.1 | http://mirbase.org/ |
miRGator | An integrated system for miRNA functional annotations | 2007 | 2012 | v3 | http://mirgator.kobic.re.kr/ |
miRGen | Provide cell type-specific TSSs for more than 1500 microRNAs | 2018 | 2020 | v4 | https://diana.e-ce.uth.gr/mirgenv4 |
Rfam | A collection of comprehensive non-coding RNA families, with recent emphasis on miRNA families | 2003 | 2022 | v14.8 | https://rfam.org |
Novel databases since 2017 | |||||
EVmiRNA | Store miRNA expression profiles and related information in extracellular vesicles (EVs) such as exosomes and macrovesicles | 2018 | 2018 | – | http://bioinfo.life.hust.edu.cn/EVmiRNA |
miRmine | Facilitate a global view of miRNA expression profiles in tissues, cell-lines, and diseases | 2017 | 2017 | – | http://guanlab.ccmb.med.umich.edu/mirmine |
miRCarta | Serve as a central repository of miRNA candidates to complement to miRBase v21 | 2017 | 2017 | – | https://mircarta.cs.uni-saarland.de/ |
MiRNA-related interaction databases | |||||
Mainstream databases | |||||
miRTarBase | One of the most comprehensive databases for annotated and experimentally confirmed MITs | 2011 | 2021 | v9 | http://miRTarBase.cuhk.edu.cn/ |
TarBase | A major database of experimentally verified MITs | 2006 | 2017 | v8 | http://www.microrna.gr/tarbase |
miRWalk | Contain both experimentally validated and computationally predicted MITs in human, mouse, rat, dog and cow | 2011 | 2022 | v3 | http://mirwalk.umm.uni-heidelberg.de |
miRecords | An integrated resource for animal MITs | 2009 | 2013 | v4 | http://c1.accurascience.com/miRecords/ |
Novel database since 2017 | |||||
miRwayDB | The first repertoire for experimentally confirmed MPAs in various pathophysiological conditions | 2018 | 2018 | – | http://www.mirway.iitkgp.ac.in |
MiRNA-disease association databases | |||||
Mainstream databases | |||||
HMDD | The most widely adopted database on MDA prediction tasks, which documents 35 547 MDAs spanning six association types with 20 diverse evidences and causality information | 2007 | 2019 | v3.2 | http://www.cuilab.cn/hmdd |
MiR2Disease | A comprehensive repository for associations between deregulated miRNAs and diverse human diseases | 2008 | 2008 | – | http://www.mir2disease.org/ |
DbDEMC | A curated database for associations between differentially expressed miRNAs and various cancers | 2010 | 2022 | v3 | https://www.biosino.org/dbDEMC/index |
MiRCancer | Consolidating research findings on miRNA-cancer associations, automatically retrieved from publications on PubMed | 2013 | 2020 | – | http://mircancer.ecu.edu/ |
Novel databases since 2017 | |||||
Tumour IsomiR Encyclopedia (TIE) | Capture and annotate isomiRs, abundant in existing databases but difficult to map and often overlooked due to their short sequence length and high heterogeneity | 2021 | 2021 | – | https://isomir.ccr.cancer.gov |
MSDD | Document experimentally verified associations between diseases and SNPs in miRNAs | 2017 | 2017 | – | http://www.bio-bigdata.com/msdd/ |
HAHmiR.DB | Focus on miRNAs associated with HA stress | 2020 | 2020 | – | http://www.hahmirdb.in |
OncomiR | Support exploration of miRNA dysregulation associated with clinical characteristics of cancers by combining a backend database and a dynamic web server | 2017 | 2017 | – | www.oncomir.org |
Database . | Brief description . | First release . | Last update . | Current version . | URL . |
---|---|---|---|---|---|
Comprehensive miRNA information databases | |||||
Mainstream databases | |||||
miRBase | The major online repository containing comprehensive miRNA information such as sequences, biogenesis precursors, deep sequence expressions, and annotations | 2002 | 2019 | v22.1 | http://mirbase.org/ |
miRGator | An integrated system for miRNA functional annotations | 2007 | 2012 | v3 | http://mirgator.kobic.re.kr/ |
miRGen | Provide cell type-specific TSSs for more than 1500 microRNAs | 2018 | 2020 | v4 | https://diana.e-ce.uth.gr/mirgenv4 |
Rfam | A collection of comprehensive non-coding RNA families, with recent emphasis on miRNA families | 2003 | 2022 | v14.8 | https://rfam.org |
Novel databases since 2017 | |||||
EVmiRNA | Store miRNA expression profiles and related information in extracellular vesicles (EVs) such as exosomes and macrovesicles | 2018 | 2018 | – | http://bioinfo.life.hust.edu.cn/EVmiRNA |
miRmine | Facilitate a global view of miRNA expression profiles in tissues, cell-lines, and diseases | 2017 | 2017 | – | http://guanlab.ccmb.med.umich.edu/mirmine |
miRCarta | Serve as a central repository of miRNA candidates to complement to miRBase v21 | 2017 | 2017 | – | https://mircarta.cs.uni-saarland.de/ |
MiRNA-related interaction databases | |||||
Mainstream databases | |||||
miRTarBase | One of the most comprehensive databases for annotated and experimentally confirmed MITs | 2011 | 2021 | v9 | http://miRTarBase.cuhk.edu.cn/ |
TarBase | A major database of experimentally verified MITs | 2006 | 2017 | v8 | http://www.microrna.gr/tarbase |
miRWalk | Contain both experimentally validated and computationally predicted MITs in human, mouse, rat, dog and cow | 2011 | 2022 | v3 | http://mirwalk.umm.uni-heidelberg.de |
miRecords | An integrated resource for animal MITs | 2009 | 2013 | v4 | http://c1.accurascience.com/miRecords/ |
Novel database since 2017 | |||||
miRwayDB | The first repertoire for experimentally confirmed MPAs in various pathophysiological conditions | 2018 | 2018 | – | http://www.mirway.iitkgp.ac.in |
MiRNA-disease association databases | |||||
Mainstream databases | |||||
HMDD | The most widely adopted database on MDA prediction tasks, which documents 35 547 MDAs spanning six association types with 20 diverse evidences and causality information | 2007 | 2019 | v3.2 | http://www.cuilab.cn/hmdd |
MiR2Disease | A comprehensive repository for associations between deregulated miRNAs and diverse human diseases | 2008 | 2008 | – | http://www.mir2disease.org/ |
DbDEMC | A curated database for associations between differentially expressed miRNAs and various cancers | 2010 | 2022 | v3 | https://www.biosino.org/dbDEMC/index |
MiRCancer | Consolidating research findings on miRNA-cancer associations, automatically retrieved from publications on PubMed | 2013 | 2020 | – | http://mircancer.ecu.edu/ |
Novel databases since 2017 | |||||
Tumour IsomiR Encyclopedia (TIE) | Capture and annotate isomiRs, abundant in existing databases but difficult to map and often overlooked due to their short sequence length and high heterogeneity | 2021 | 2021 | – | https://isomir.ccr.cancer.gov |
MSDD | Document experimentally verified associations between diseases and SNPs in miRNAs | 2017 | 2017 | – | http://www.bio-bigdata.com/msdd/ |
HAHmiR.DB | Focus on miRNAs associated with HA stress | 2020 | 2020 | – | http://www.hahmirdb.in |
OncomiR | Support exploration of miRNA dysregulation associated with clinical characteristics of cancers by combining a backend database and a dynamic web server | 2017 | 2017 | – | www.oncomir.org |
Database . | Brief description . | First release . | Last update . | Current version . | URL . |
---|---|---|---|---|---|
Comprehensive miRNA information databases | |||||
Mainstream databases | |||||
miRBase | The major online repository containing comprehensive miRNA information such as sequences, biogenesis precursors, deep sequence expressions, and annotations | 2002 | 2019 | v22.1 | http://mirbase.org/ |
miRGator | An integrated system for miRNA functional annotations | 2007 | 2012 | v3 | http://mirgator.kobic.re.kr/ |
miRGen | Provide cell type-specific TSSs for more than 1500 microRNAs | 2018 | 2020 | v4 | https://diana.e-ce.uth.gr/mirgenv4 |
Rfam | A collection of comprehensive non-coding RNA families, with recent emphasis on miRNA families | 2003 | 2022 | v14.8 | https://rfam.org |
Novel databases since 2017 | |||||
EVmiRNA | Store miRNA expression profiles and related information in extracellular vesicles (EVs) such as exosomes and macrovesicles | 2018 | 2018 | – | http://bioinfo.life.hust.edu.cn/EVmiRNA |
miRmine | Facilitate a global view of miRNA expression profiles in tissues, cell-lines, and diseases | 2017 | 2017 | – | http://guanlab.ccmb.med.umich.edu/mirmine |
miRCarta | Serve as a central repository of miRNA candidates to complement to miRBase v21 | 2017 | 2017 | – | https://mircarta.cs.uni-saarland.de/ |
MiRNA-related interaction databases | |||||
Mainstream databases | |||||
miRTarBase | One of the most comprehensive databases for annotated and experimentally confirmed MITs | 2011 | 2021 | v9 | http://miRTarBase.cuhk.edu.cn/ |
TarBase | A major database of experimentally verified MITs | 2006 | 2017 | v8 | http://www.microrna.gr/tarbase |
miRWalk | Contain both experimentally validated and computationally predicted MITs in human, mouse, rat, dog and cow | 2011 | 2022 | v3 | http://mirwalk.umm.uni-heidelberg.de |
miRecords | An integrated resource for animal MITs | 2009 | 2013 | v4 | http://c1.accurascience.com/miRecords/ |
Novel database since 2017 | |||||
miRwayDB | The first repertoire for experimentally confirmed MPAs in various pathophysiological conditions | 2018 | 2018 | – | http://www.mirway.iitkgp.ac.in |
MiRNA-disease association databases | |||||
Mainstream databases | |||||
HMDD | The most widely adopted database on MDA prediction tasks, which documents 35 547 MDAs spanning six association types with 20 diverse evidences and causality information | 2007 | 2019 | v3.2 | http://www.cuilab.cn/hmdd |
MiR2Disease | A comprehensive repository for associations between deregulated miRNAs and diverse human diseases | 2008 | 2008 | – | http://www.mir2disease.org/ |
DbDEMC | A curated database for associations between differentially expressed miRNAs and various cancers | 2010 | 2022 | v3 | https://www.biosino.org/dbDEMC/index |
MiRCancer | Consolidating research findings on miRNA-cancer associations, automatically retrieved from publications on PubMed | 2013 | 2020 | – | http://mircancer.ecu.edu/ |
Novel databases since 2017 | |||||
Tumour IsomiR Encyclopedia (TIE) | Capture and annotate isomiRs, abundant in existing databases but difficult to map and often overlooked due to their short sequence length and high heterogeneity | 2021 | 2021 | – | https://isomir.ccr.cancer.gov |
MSDD | Document experimentally verified associations between diseases and SNPs in miRNAs | 2017 | 2017 | – | http://www.bio-bigdata.com/msdd/ |
HAHmiR.DB | Focus on miRNAs associated with HA stress | 2020 | 2020 | – | http://www.hahmirdb.in |
OncomiR | Support exploration of miRNA dysregulation associated with clinical characteristics of cancers by combining a backend database and a dynamic web server | 2017 | 2017 | – | www.oncomir.org |
Comprehensive miRNA information databases
Mainstream databases
miRBase
Available at http://mirbase.org/, miRBase [5] is the major online repository containing comprehensive miRNA information such as sequences, biogenesis precursors, deep sequence expressions and annotations. It also functions as the formal system of miRNA gene nomenclature, defining names of newly discovered miRNA genes. Originally known as the microRNA registry in its first release (2002), miRBase has been actively maintained and was updated to the v22 release in 2018, holding 38 589 hairpin precursor miRNA entries across 271 organisms, with additional information of annotation quality and sequence biological functions.
miRGator
This database (http://mirgator.kobic.re.kr/) was initially released in 2007 as an integrated system for miRNA functional annotations [199], and subsequently updated in 2010 [200] and 2012 [201]. The latest MiRGator v3.0 merged 73 deep sequencing datasets covering 4665 human samples from Gene Expression Omnibus (GEO) [202], Short Read Archive (SRA) [203] and the Cancer Genome Atlas archives (TCGA) [204]. The samples fell into 38 disease and anatomic categories. The database supported the following explorations of large-scale miRNA data: discovery of miRNA variants (iso-miRs) and differential expressions in different samples using a miR-seq browser, investigation of miRNA expression profiles by disease and anatomy, assessment of miRNA-target interactions (MTIs) and gene set analysis.
miRGen
The current release miRGen v4 (https://diana.e-ce.uth.gr/mirgenv4) [205] in 2020 was constructed with a similar research objective to that of previous versions [206–208], that is, to enhance the understanding of transcriptional-level miRNA biogenesis regulation. But this was achieved in a different manner of consolidating the analytical results of over 1000 cap analysis of gene expressions (CAGE) [209] samples in 133 tissues, cell lines and primary cells from the FANTOM Consortium [210], and eventually offering cell type-specific transcription start sites (TSSs) for more than 1500 miRNAs. An additional update is that miRGen v4 extended the repertoire of associations between transcription factors (TFs) and miRNA gene promoters in miRGen v3 by including the ChIP-Seq and DNase-Seq datasets from the ENCODE repository [211].
Rfam
Since the initial establishment in 2003 as a collection of comprehensive non-coding RNA families [212], Rfam (https://rfam.org/) has undergone 13 major updates. The latest major version Rfam 14 [213] focused on enlarging the coverage of miRNA families by continuously synchronizing with miRBase [5] in multiple rounds according to the RNAcentral sequence identifiers [214]. The first round involved 1678 manually curated multiple sequence alignments from miRBase v22, resulting in 356 new miRNA families added to Rfam and updates to 40 existing families; and the most recent round updated 121 existing families and 4 new ones. By now in Rfam 14.8, the total number of miRNA families has reached 4094.
Novel databases since 2017
EVmiRNA
Published in 2019, EVmiRNA (http://bioinfo.life.hust.edu.cn/EVmiRNA) [215] contained expression profiles, regulated pathways, functions, related drugs and publications of over 1000 miRNAs in extracellular vesicles (EVs) such as exosomes and macrovesicles (MVs). These miRNAs in tumour-derived exosomes and MVs can be important pathological and therapeutic biomarkers [215–220]. The database was created by collecting 462 small RNA sequencing EV samples from 17 tissues (either being healthy controls or in disease conditions), with an easy-to-use web interface for data inspection and download.
miRmine
The database (http://guanlab.ccmb.med.umich.edu/mirmine) [221] was released in early 2017 by adopting a robust data mining pipeline to retrieve and analyze human miRNA expression profiles from the vast publicly available miRNA data. Specifically, 304 miRNA-seq datasets of high quality were obtained from SRA [203]; then miRNA expression profiles in 16 different tissues and bio-fluids were measured and compared in a heatmap via hierarchical clustering with Pearson correlation as the distance metric. Therefore, miRmine facilitated a global view of expression profiles in tissues/cell-lines/diseases and the comparison of individual miRNAs.
miRCarta
Developed in 2017, miRCarta (https://mircarta.cs.uni-saarland.de/) [222] was considered as a complement to the 2014 release of miRBase v21 [223]. The former consolidated all entries and updates in the latter from Versions 1.1 to 21, and meanwhile also incorporated additional information from external databases/sources, such as experimentally confirmed miRNA targets from miRTarBase v6.1 [224] and miRNA precursors from HMDD v2.0 [225]. In total, miRCarta consisted of 12 857 more miRNA precursors than miRBase v21 and emerged as the most comprehensive repository of human miRNA candidates at the time of its release.
MiRNA-related interaction databases
Mainstream databases
miRTarBase
Uncovering MTIs promotes the understanding of miRNAs’ roles in disease pathology and benefits the development of diagnostic/therapeutic tools. The latest version of miRTarBase 9.0 (http://miRTarBase.cuhk.edu.cn/) [226] in 2021 served as one of the most comprehensive databases for annotated and experimentally confirmed MTIs. It included 2 200 449 manually curated MTIs between 4630 miRNAs and 27 172 targets from 13 389 relevant studies that were collected by a novel text-mining-based scoring system. Besides, datasets related to miRNA regulation and cell-free miRNAs were added into miRTarBase from the integrated databases including TransMir [227], miRSponge [228], SomamiR [229, 230] and CMEP [231]. Lastly, a display of TF-MITs was embedded in the original web interface.
TarBase
Another major MTI database is TarBase (http://www.microrna.gr/tarbase) [172]. Its latest eighth release in 2017 recorded more than one million entries for ~670 000 unique MTIs, which were verified by over 33 experimental methodologies on ~600 cell types/tissues under ~451 experimental conditions. These MTIs were obtained by manually curating and analyzing ~419 related publications and over 245 high-throughput sequencing datasets; data of each MTI included not only publications and experimental methodologies but also its associated tissues, cell types and regulation types (either positive or negative). As for updates to the web interface, filtering of regulation types, methodologies, cell types and tissues was implemented to support customized browsing, together with a ranking system to report the robustness of selected MTIs’ supporting methodologies. The database content could be presented by a series of effective visualizations in a specialized result page.
miRWalk
The latest miRWalk 3.0 (http://mirwalk.umm.uni-heidelberg.de) [232] stored both experimentally validated and computationally predicted MTIs in human, mouse, rat, dog and cow. The database emerged in 2017 to substitute miRWalk 2.0 [233] and has been actively maintained with a twice-annual update strategy since then (though the version number has been kept as 3.0). The new version focused on a single random-forest-based model TarPmiR to predict MTIs instead of 12 different models in miRWalk 2.0 that was less efficient. The third-party databases used to support prediction and verification were also reduced to TargetScan [234], miRDB [235] and miRTarBase [226], because most of the databases integrated by miRWalk 2.0 were no longer updated. Moreover, miRWalk 3.0 was presented in a completely redesigned interface to enable searching, dynamic filtering, saving and gene set enrichment analysis.
miRecords
This classical database (http://c1.accurascience.com/miRecords/) [236] was built as an integrated resource for animal MTIs, with a validated targets component holding a myriad of high-quality, manually curated, experimentally verified MTIs and a predicted targets component containing predicted miRNA targets from 11 MTI inference algorithms. Initially introduced in 2008, miRecords has undergone three main updates. Its current Version 4 in 2013 was comprised of 2705 MTIs between 644 miRNAs and 1901 target genes in nine animal species.
Novel databases since 2017
miRwayDB
The database (http://www.mirway.iitkgp.ac.in) [237] was developed in 2018 as the first repertoire for experimentally confirmed miRNA-pathway associations (MPAs) in various pathophysiological conditions. A comprehensive collection of MPA entries was created for 76 human disease conditions involving 232 miRNAs, 122 pathways and 328 target genes, via consolidating 663 related publications. Each entry included the disease name, the associated miRNA, the experimental sample type, the miRNA’s up/down-regulation pattern, the associated pathway, the targeted member of dysregulated pathway and a short description.
MiRNA-disease association databases
Mainstream databases
HMDD
This is the most widely adopted database in MDA prediction tasks: the abundant experimentally validated MDAs in HMDD are frequently used as the benchmark dataset for training and testing predictive models. The database evolved into Version 3.0 (http://www.cuilab.cn/hmdd) [100] in 2018, containing 200% more MDA entries than HMDD v2.0 [225] and extending the original four association types to six (circulation, tissue, genetics, epigenetics, target and other) with 20 diverse evidence codes. To date, HMDD v3.2 documents 35 547 MDA entries between 1206 miRNAs and 893 diseases from 19 280 papers on PubMed, with causality information appended to each entry.
miR2Disease
At the time of creation in 2008, miR2Disease (http://www.mir2disease.org/) [238] used to be a comprehensive repository for associations between deregulated miRNAs and diverse human diseases. Initially, it recorded 1939 MDAs that were manually curated from over 600 studies; shortly after publication, the number of entries moderately increased to 3273 between 349 miRNAs and 163 diseases. Each entry consisted of the microRNA ID, the disease name, a brief MDA description, the miRNA expression pattern, the detection method for miRNA expression, the miRNA’s experimentally verified target gene(s) and a literature reference. However, no further updates have been made since then.
dbDEMC
This is a database for associations between Differentially Expressed MiRNAs and various Cancers (https://www.biosino.org/dbDEMC/index) [239]. In 2010, by manual curation of 48 microarray datasets, the first version [240] was developed to store 607 miRNAs’ expression profiles in 14 cancer types. The second version dbDEMC 2.0 [241] was released in 2017, covering 209 expression profiling datasets that depicted associations between 2224 miRNAs and 36 cancer types (with 73 subtypes) in total. In 2022, the third version [239] was published to incorporate 403 microarray or miRNA-seq based miRNA expression datasets from public repositories including GEO [242], ArrayExpress [243], SRA [203] and TCGA [204], resulting in a total number of 3268 differentially expressed miRNAs for 40 cancer types with 149 subtypes.
miRCancer
Developed with the purpose of consolidating research findings on miRNA-cancer associations, miRCancer (http://mircancer.ecu.edu/) [179] included 878 associations between 236 miRNAs and 79 cancers, obtained by curating and text mining on over 26 000 publications on PubMed. It was updated regularly to incorporate new research articles. By mid-2020 when the most recent update was carried out, the number of entries reached 9080 concerning with 57 984 miRNAs and 196 cancers, retrieved from 7288 publications.
Novel databases since 2017
Tumour IsomiR Encyclopedia
Published in 2021, TIE (https://isomir.ccr.cancer.gov) [244] aimed to capture those miRNAs with sequence variants or isoforms (known as isomiRs), which were abundant in existing databases but difficult to map and often overlooked due to their short sequence length and high heterogeneity. IsomiRs have been proven to play a unique role in tumorigenesis [245–247] and serve as potential diagnostic and prognostic biomarkers [248, 249]. TIE was constructed by analyzing isomiR profiles in ~97 billion reads and ~16 billion small RNA sequences from more than 11 667 patient samples covering 33 adults and three paediatric tumours. Datasets for the samples were attained from TCGA [204] and the Therapeutically Applicable Research to Generate Effective Treatment projects (TARGET, available at https://ocg.cancer.gov/programs/target).
MSDD
The MiRNA SNP Disease Database (http://www.bio-bigdata.com/msdd/) [250] was developed in 2017 to document experimentally verified associations between diseases and single nucleotide polymorphisms (SNPs) in miRNAs. These SNPs, known as miRSNPs, modulated expressions of miRNAs that in turn regulated target genes, leading to pathogenesis of diseases. The data collection process involved text mining and manual curation of 2387 related publications on PubMed, and only highly credible associations with multiple strong experimental evidences from genotyping, western blot, quantitative reverese transciptase-polymerase chain reaction or luciferase reporter assays were fed into MSDD. This resulted in 525 final entries between 182 miRNAs, 197 SNPs, 153 genes and 164 diseases, which were supported by 397 publications.
HAHmiR.DB
The High-Altitude Human miRNA DataBase (http://www.hahmirdb.in) [251] was proposed in 2020, with a specific focus on miRNAs associated with high altitude (HA) stress. Entries in HAHmiR.DB were collected by manually curating publications on PubMed and Google Scholar. Each entry held comprehensive information including the miRNA expression profiles in different altitudes, fold change, experiment duration, biomarker associations, disease and drug associations, tissue-specific expression levels, Gene Ontology (GO), Kyoto Encyclopaedia of Gene and Genomes (KEGG) pathway associations. Furthermore, miRNA–TF–gene coregulatory networks and feed-forward loops (FFLs) regulatory-circuits were built to promote the understanding of complex regulatory mechanisms of HA stress.
OncomiR
Released in 2017, OncomiR (www.oncomir.org) [196] supported exploration of miRNA dysregulation in cancers via combining a backend database and a dynamic web server. The database contained statistically dysregulated miRNAs (among 1200 mature miRNA transcripts from TCGA [204]) that were associated with clinical characteristics of cancers (collected from 10 000 patients across 30 cancer types in TCGA). The server interface facilitated analyzing miRNA-target expression correlation, predicting MTIs in different cancer types, dynamically investigating miRNA-derived survival signatures and clustering cancer types.
Webservers
Webservers can be classified into three categories according to Chen et al. [161]: miRNA identification based on next generation sequencing (NGS) data or prediction from a genome, miRNA target prediction for uncovering the target genes of miRNAs, and other functional analysis for various aspects of miRNA-related research (such as secondary structure prediction and pathway analysis). This subsection presents the mainstream (highly cited) webservers and novel releases since 2017 in each category (see Table 2).
MiRNA identification
Mainstream webservers include: MiRscan [166] discovering conserved miRNAs in nematodes, RNAz [252] detecting miRNAs in comparative genomics data, miPred [253] distinguishing real pre-miRNAs from pseudo ones based on random forest, CID-miRNA [175] predicting pre-miRNAs in human genome, miRanalyzer [254] uncovering miRNAs in high-throughput sequencing experiments, MicroPC [255] comparing and inferring plant miRNAs, matureBayes [256] finding mature miRNAs based on sequences and secondary structures of their pre-miRNAs, and miRNAFold [257] quickly and highly sensitively predicting pre-miRNAs in genomes.
New webservers released since 2017 include: Mirnovo [197] detecting known and novel miRNAs in animals and plants from RNA-seq data, miRSwitch [258] revealing miRNA arm shift and switch events with an emphasis on potential changes in the distribution of mature miRNAs from the same precursor, and miRkwood [259] identifying miRNAs in plant genomes.
MiRNA target prediction
Mainstream webservers include: miRanda-mirSVR [167] for ranking miRNA target sites by down-regulation scores, RNAhybrid [170] identifying targets via the minimum free energy hybridisation of miRNAs, TargetScan [234] predicting target sits in mammalian mRNAs, PicTar [260] offering genome-wide miRNA target predictions for nematodes, humans and flies, RNA22 [261] inferring targets and exploring their various attributes, miRDB [262] providing both target predictions and functional annotations, mirDIP [263] consolidating over 150 million target predictions from 30 different resources, psRNATarget [264] identifying plant miRNAs’ targets, miRTarCLIP [180] uncovering targets from high-throughput sequencing data.
New webservers released since 2017 include: miRTar2GO [265] predicting cell line specific miRNA targets, and FFLtool [266] uncovering feed forward loop of transcription factor–miRNA–target regulation in human.
Other functional analysis
Mainstream webservers include: ViennaRNA [168] predicting miRNA secondary structures, mirPath [267] assessing miRNAs’ regulatory roles and analyzing their pathways, miTALOS [268] deciphering tissue specific miRNA regulation of pathways, isomiRex [269] discovering isomiR variations of miRNAs, PHDcleav [181] inferring human Dicer cleavage sites based on sequences and secondary structures of miRNAs, microTSS [270] detecting miRNA TSSs, Chimira [271] analyzing miRNA sequencing data and modifications, and miRNAme Converter [272] resolving inconsistencies of mature miRNA names.
New webservers released since 2017 include: miRViz [273] visualizing and interpreting large miRNA datasets, MISIM v2.0 [274] describing miRNA functional similarity based on MDAs (note that there is no webserver for MISIM v1.0), HumiR [275] offering an entry point to human miRNA research and enabling selection of right tools for research tasks, wTAM [198] annotating weighted human miRNAs with importance scores for enrichment analysis, NetInfer [276] predicting miRNA networks and miRNAs for anticancer drugs, mirnaQC [277] fostering comparative quality control of miRNA-seq data, miRNACancerMAP [278] constructing miRNA regulation network for cancers, TFmiR2 [279] building and analyzing integrated transcription factor and miRNA co-regulatory networks in human and mouse, and StructRNAfinder [280] finding and annotating miRNA families in transcript or genome sequences.
The data fusion paradigm
As depicted in Figure 1, in addition to databases and webservers, an increasing number of computational models for various tasks are emerging recently, most of which focus on MDA prediction and carry out multi-level omics data integration [161]. Since the problem definition of MDA prediction in the late 2000s [173, 281], the task has been extensively carried out following a data fusion paradigm, where multi-source datasets are combined in various ways to support inference. During the time span of 2009–2022, increasingly more heterogeneous datasets have been integrated to gain a comprehensive research perspective, which is a contributing factor to the gradual improvement of predictive performance. This section presents what datasets (from the aforementioned databases/webservers introduced in the previous section) have been fused by MDA prediction models and how the data fusion paradigm has accompanied the research progresses over the period of 2009–2022.
The ground truth MDAs for most models were obtained from the HMDD database: the v1.0 release [173] contained 3700+ biologically-confirmed associations and, after a myriad of updates spanning 2007–2019, the number reached 35 000+ in v3.2 [100]. Prediction based solely on this data is, however, insufficient to capture patterns of the bipartite miRNA-disease network. Incorporating auxiliary information regarding diseases/miRNAs (represented by nodes in the network), adding extra informative nodes (such as genes or lncRNAs) to the network, and using algorithms to exploit complex structures of the resulting heterogeneous graph can help better understand the relationship between miRNAs and diseases [177, 281]. By mining 50+ MDA prediction research papers published spanning 2009–2021, we summarize the following 11 different auxiliary datasets that have been involved in fusion: (1) known association types from HMDD v2.0 [225] or 3.0 [100]; (2) miRNA functional similarity, computed under the assumption that functionally similar miRNAs tend to be connected with phenotypically similar diseases [282] or directly obtained from the MISIM v2.0 webserver [274] that contained computed similarity scores based on this assumption; (3) MITs, mostly retrieved from miRTarBase [226] or TarBase [172]; (4) miRNA family information from miRBase [5]; (5) miRNA-lncRNA interactions from lncRNASNP [283]; (6) miRNA sequences from miRBase [5]; (7) miRNA-word associations from the abstracts of related research papers, with weight calculated by the TF-IDF scheme [284]; (8) disease phenotype similarity from MimMiner [285] or semantic similarity computed by representing disease descriptors from Medical Subject Heading (MeSH) as directed acyclic graphs (DAGs) [282]; (9) disease–gene associations from DisGeNET [286]; (10) disease–lncRNA associations from LncRNADisease [287]; (11) gene–gene interactions depicted by a probabilistic functional gene network from HumanNet [288]. Fusing diverse biological datasets is challenging due to their heterogeneity in types and structures, as well as different levels of incompleteness, noise and redundancy [289, 290]. Over the past 12 years, significant research efforts have been made to investigate the effectiveness of using different dataset combinations for fusion, and to develop increasingly more advanced algorithms to transform data into concise and consistent representations that can ultimately fulfil the goal of accurate MDA prediction. As a result, each time a new computational model was proposed, the performance metrics (mostly in terms of AUC, the Area Under the receiver operating characteristic Curve) were continuously and gradually improved [101].
According to this trend of performance, together with the HMDD versions, three research phases (2009 ~12, 2013 ~17 and 2018 ~now) for data fusion-driven MDA prediction can be identified (see Figure 2). The top cited models that utilized various combinations of the aforementioned auxiliary datasets are listed in each phase; and all models used known MDAs from HMDD as the ground truth. Early representative works in Phase 1 included the hypergeometric distribution method [281] and RWRMDA [177], both considered as the ground-breaking research into MDA prediction and fitted on HMDD v1.0 data. The former fused known MDAs, miRNA functional similarity and disease phenotype similarity to construct the heterogeneous network, then applied a score function (SF) to calculate association likelihood for MDPs, and achieved cross validation (CV) AUC of 0.75. The latter implemented a complex network (CN) algorithm based on random walk with restart (RWR) to iterate over the miRNA nodes in the miRNA functional similarity network, with initial restart probabilities defined according to known MDAs, yielding a CV AUC of 0.86.

Three phases for MDA prediction research, accompanied by the data fusion paradigm. Algorithm type abbreviations include: SF, CN, NN, MD, GBTs.
Subsequently, Phase 2 witnessed three notable changes. First, models began to fuse more datasets other than miRNA/disease similarity, such as miRPD [182] incorporating MITs and disease–gene associations, RBMMMDA [291] utilizing known association types to predict not only MDAs but also their categories, and MiRAI [184] integrating miRNA family information, miRNA–word associations and miRNA–target associations. Second, models were evaluated with the 2013 release of HMDD v2.0 holding more MDA samples. Third, advanced algorithms such as matrix decomposition (MD; as in the heterogeneous network-based RWR model [292]) and neural networks (NN; as in RBMMMDA) emerged to address the increased complexity incurred by larger scale fusion and more data samples. The maximum CV AUC reached 0.94, implying the temporary success of tackling the data fusion challenge. Models in Phase 2 explored the possibilities of various dataset combinations, and the research outcomes could inform bioinformaticians which datasets were more worth fusing than others.
Thereafter in Phase 3, miRNA sequences were the only new data introduced into fusion, whereas all other integrated datasets had appeared in Phase 2. This indicated less room for improvement from the data perspective. Instead, more advanced machine learning algorithms were devised to learn better representations of the fused data, such as extreme gradient boosting trees (GBTs) in EGBMMDA [101], NNs in MDA-CNN [189] (as well as MVMTMDA [293] and MMGCN [294]), MD in M2LFL [295] and tensor decomposition in TDRC [191]. Such algorithms further enhanced the CV AUC to 0.97 with HMDD v2.0 and, when evaluated with the 3.0+ version containing a triple of examples, the highest achievable value of 0.93.
It can be concluded from Figure 2 that both the sample size used for model-learning and the number of data sources emerging in fusion were rising over the three phases, making MDA prediction an even more challenging task. Researchers have tested models fusing different combinations of datasets and continuously boosted the predictive performance in terms of AUC, with the help of increasingly more advanced algorithms such as MD and deep learning. Understanding data fusion and algorithms in existing works can inspire development of more powerful models in the future.
Discussion and conclusion
MiRNAs contribute to the pathogenesis of human complex diseases via post-transcriptionally regulating gene expressions [13, 296–299], thereby serving as potential biomarkers for therapeutic agent invention [300, 301]. Preclinical studies in the past lustrum have generated promising outcomes of using synthetic miRNA inhibitors and mimics to hinder virus infection [302], promote wound healing [303, 304] and restore tumour suppression [305–307]. Furthermore, clinical trials on miRNA-based therapeutics are being continuously carried out since the trial of Miravirsen, the first-ever miRNA-targeting drug for treatment of the hepatitis C virus (HCV) infection [300, 308, 309]. The premise of devising miRNA therapies is accurate identification of MDAs by experimental methods that, however, can be time-consuming and costly given the numerous miRNA-disease combinations. Computational models address the problem by identifying the most likely associated MDPs for validation in biological experiments, hence speeding up novel MDA discovery.
Experimental data accumulation is a key factor for building accurate models. The past decade witnessed a nearly 10-fold increase in ground-truth MDAs, from 3700+ in HMDD v1.0 [173] to 35 000+ in the latest v3.2 [100]. Moreover, diverse auxiliary datasets have been integrated with known MDAs to support prediction. For example, fusing datasets of MITs and disease–gene associations, and gene–gene interactions could introduce genes into the original bipartite MDA network and used them as intermediaries to represent how miRNAs mediated genes to influence disease development [75]. Models learning from a larger sample size and fusing informative datasets are expected to produce better prediction outcomes, as implied by a comparison of performance metrics for different HMDD versions: the best AUC with v1.0 data was 0.86, while with v2.0 the figure could be as high as 0.97. Following the trend, although currently the maximum achievable AUC is 0.93 with HMDD v3.2, we speculate future models to improve the metric and move closer to perfect classifiers. This requires regular updates on related databases such as HMDD and miRBase, however, neither of which received new entries since 2019 according to Table 1. Active maintenance of classical databases plays a crucial role in MDA research. As for new ones constructed after 2017, little research has been conducted to examine their potential for fusion. It may be beneficial to tap into the value of relevant repositories such as miRMine [221] holding miRNA expression profiles in diseases, miRwayDB [237] recording MPAs, and MSDD [250] containing associations between diseases and miRNA SNPs.
Advanced algorithms that generate meaningful representations for fused data also contribute to the success of MDA prediction. Over three phases, the dominating model types have evolved from SFs and CN algorithms to advanced algorithms including (but not limited to) MD and deep learning, to meet the challenge of fusing heterogeneous and larger datasets. Most associated literature shed light on how diverse data sources were algorithmically combined to foster learning, which was an issue raised in the previous MDA review [75] and are extensively discussed in another review article of ours, entitled ‘Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational models’. Moreover, an additional review article written by us, entitled ‘Updated review of advances in microRNAs and complex diseases: towards systematic evaluation of computational models’, will illustrate the influence of individual auxiliary datasets (features) and various model components on predictive performance via ablation studies. Nonetheless, an overlooked issue by most research is additional computational complexity incurred by fusion of more datasets and complicated heuristics. Their contribution to MDA prediction can be better understood by including a cost–benefit analysis in future works. The impact of data sources can be further investigated by using them to infer a specific disease’s associated miRNAs and then comparing performance to conclude which dataset was more conducive to prediction for that disease [75].
Significant research efforts have been made to uncover miRNA-disease associations and reveal their biological mechanisms, which can contribute to diagnosis and treatment of diseases.
The current review summarized mainstream and highly cited miRNA-related databases and webservers and introduced novel relevant releases that emerged since 2017.
Computational models for MDA prediction have been developed following a data fusion paradigm, where multi-source datasets are combined in various ways to support inference.
The three phases spanning 2009–2022 witnessed continuous increases in the size of samples used for model-learning, the number of data sources emerging in fusion, the complexity of algorithms, and the predictive performance in terms of cross validation AUC.
Data availability
The source code and data of MDA-CNN are available at https://github.com/Issingjessica/MDA-CNN. The source code and data of MDA-GCNFTG are available at https://github.com/a96123155/MDA-GCNFTG. The source code and data of MVMTMDA are available at https://github.com/yahuang1991polyu/MVMTMDA/. The source code and data of NIMCGCN are available at https://github.com/ljatynu/NIMCGCN/. The source code and data of MMGCN are available at https://github.com/Txinru/MMGCN. The source code and data of TDRC are available at https://github.com/BioMedicalBigDataMiningLab/TDRC. The source code and dataset of IMCMDA are available at https://github.com/IMCMDAsourcecode/IMCMDA. The source code and data of GAEMDA are available at https://github.com/chimianbuhetang/GAEMDA.
Funding
National Natural Science Foundation of China (61972399 and 11931008 to XC).
Author Biographies
Li Huang is a PhD student of Academy of Arts and Design, Tsinghua University. His research interests include bioinformatics, complex network algorithms, machine learning and visual analytics.
Li Zhang is a PhD student of School of Information and Control Engineering, China University of Mining and Technology. His research interests include bioinformatics, drug discovery, neural networks and deep learning.
Xing Chen, PhD, is a professor of China University of Mining and Technology. He is the associate dean of Artificial Intelligence Research Institute, China University of Mining and Technology. He is also the founding director of Institute of Bioinformatics, China University of Mining and Technology and Big Data Research Center, China University of Mining and Technology. His research interests include complex disease-related non-coding RNA biomarker prediction, computational models for drug discovery, and early detection of human complex disease based on big data and artificial intelligence algorithms.