Abstract

Drug similarity studies are driven by the hypothesis that similar drugs should display similar therapeutic actions and thus can potentially treat a similar constellation of diseases. Drug–drug similarity has been derived by variety of direct and indirect sources of evidence and frequently shown high predictive power in discovering validated repositioning candidates as well as other in-silico drug development applications. Yet, existing resources either have limited coverage or rely on an individual source of evidence, overlooking the wealth and diversity of drug-related data sources. Hence, there has been an unmet need for a comprehensive resource integrating diverse drug-related information to derive multi-evidenced drug–drug similarities. We addressed this resource gap by compiling heterogenous information for an exhaustive set of small-molecule drugs (total of 10 367 in the current version) and systematically integrated multiple sources of evidence to derive a multi-modal drug–drug similarity network. The resulting database, ‘DrugSimDB’ currently includes 238 635 drug pairs with significant aggregated similarity, complemented with an interactive user-friendly web interface (http://vafaeelab.com/drugSimDB.html), which not only enables database ease of access, search, filtration and export, but also provides a variety of complementary information on queried drugs and interactions. The integration approach can flexibly incorporate further drug information into the similarity network, providing an easily extendable platform. The database compilation and construction source-code has been well-documented and semi-automated for any-time upgrade to account for new drugs and up-to-date drug information.

Introduction

Drug similarity studies rely on the assumption that drugs with similar pharmacological properties are similar in their mechanism of action, share similar side-effects and are indicated for the treatment of similar diseases [1, 2]. In-silico drug–drug similarity has been derived for a variety of applications including drug-target identification [3–7], side-effect prediction [8–10], drug–drug interaction prediction [11–15] and drug repositioning [1, 16–19]. The latter, i.e. repositioning existing drugs for new indications, has received an escalated interest in the research and pharmaceutical industries as an innovative drug development strategy offering the possibility of reductions in cost, time and risk as several phases of de-novo drug discovery can be bypassed for repositioning candidates [20]. Drug similarity estimation can be directly incorporated into the repositioning pipeline to prioritize repositioning candidates based on the extent of their similarity with the drug of interest.

A variety of drug-related sources of evidence—e.g. chemical structure characteristics [7, 21], protein targets [22, 23], side-effect profiles [6, 24], gene expression profiles [17, 25] and clinical information [2]—have been previously applied in drug–drug similarity analytics. Heterogeneous data sources provide a multi-view perspective for predicting similar drugs and can compensate for missing data across individual data sources. Hence, incorporating diverse data sources can boost the coverage and accuracy of the prediction and provide new insights into drug repositioning and other applications. Despite the current availability of several drug-related data sources, there is a need for a comprehensive, contemporary knowledgebase integrating diverse information from a wide array of evidence sources to derive multi-modal drug–drug similarities.

We addressed this resource gap by developing ‘DrugSimDB’ which incorporates into similarity measures, multiple sources of direct and indirect information compiled on a comprehensive list of drugs. DrugSimDB covers 10 317 small molecule drugs—including 2466 approved and 7212 experimental, illicit or withdrawn—and provides 238 635 pairs of drugs with significant, multi-modal similarity. Chemical structure descriptors, drug-induced pathways, drug–protein and protein–protein relationships as well as protein sequences and their functional annotations were compiled from diverse public datasets and used to estimate structure-, pathway-, target- and function-based similarity between each pair of drugs. Similarity measures across modalities were aggregated and assessed for statistical significance. Comparing against a drug repositioning gold standard of approved and failed drugs, we have shown that diversifying sources of similarity evidence improve the specificity and sensitivity of candidate prioritization for repositioning, which corroborates the necessity of multi-modal approaches and the utility of DrugSimDB for drug development.

We implemented an inclusive web-application (http://vafaeelab.com/drugSimDB.html) enabling users to browse DrugSimDB for a drug of interest or download the full database or any intermediately processed important files, e.g. individual pairwise similarity matrices. For each queried drug, in addition to a prioritized list of similar drugs, the web application provides information on a drug’s physicochemical and pharmacological properties as well as an interactive view of the drug’s 3D structure. More importantly, the web application provides an interactive visualization of an induced subnetwork of the drug–drug similarity network including the queried drug and its interacting partners. A batch query is also supported, where users can upload a list of drugs (names/IDs) to retrieve their similarity information. Users can select any node on the subnetwork to probe a drug’s side-effects or select any edge to explore PubMed articles with evidence of the association. For improved reusability and maintenance of data coverage, we implemented the whole framework as a well-documented and semi-automated, parallelized pipeline. Users can follow simple instructions to retrieve up-to-date data sources and update the database accordingly.

Overall, DrugSimDB and its web application provide an exhaustive and reusable resource for multi-modal drug similarity investigation enriched with drug side-effect, indications and literature evidence, which together form a unique starting point for drug-repositioning and beyond.

Materials and methods

Data sources

Drug names, identifiers, physicochemical and pharmacological properties and links to external databases were retrieved from ‘DrugBank’ [26], a comprehensive, frequently updated drug encyclopedia. Drug chemical structures in SDF format, protein targets and their primary structure in FASTA format were also retrieved from DrugBank. Drug-induced pathways and their constituent genes were obtained from ‘Kyoto Encyclopedia of Genes and Genomes’ (KEGG) [27]. Protein–protein interactions (PPIs) in humans were downloaded from ‘Interologous Interaction Database’ (I2D) [28], comprising validated and predicted PPIs compiled from over 35 databases and literature. Gene ontology (GO) annotations (cellular components, biological processes [BP] and molecular functions) of protein targets were obtained from the enrichR [29] web server which provides up-to-date GO annotations for gene-set enrichment analyses. Drug indications, i.e. drug to disease mapping and its clinical status, were downloaded from the Drug Repositioning Database (repoDB) [30]. Information on recorded adverse marketed drug reactions were obtained from SIDER, a database of drugs and side effects [31].

System design and implementation

The whole pipeline—including data retrieval, filtration and quality control, similarity estimation, validation and visualization—was implemented in R providing a unified platform for ease of reuse and ongoing maintenance. Drug similarity matrix computation was implemented using parallel computing in R, enabling intensive and repetitive similarity computations to be efficiently run over multiple processors and cores on local and remote clusters. An interactive web interface was developed using R Shiny [32]. Three-dimensional visualization of a queried drug’s molecular structure was implemented using the MolView [33] API. An interactive network view of an induced subnetwork comprising the queried drug and its interacting partners (i.e. significantly similar drugs) were visualized using the visNetwork R package, which offers all the features available in vis.js library for Shiny R applications [34]. Records of drug-pair co-occurrence in PubMed abstracts were retrieved and processed using the easyPubMed R package. The pipeline implementation is available to the public, properly commented and well-documented for usage instructions. We recommend using a web browser that supports 3D graphics for MolView rendering. The web interface has been tested on Firefox, Google Chrome and Internet Explorer.

Drug similarity estimation

Chemical structure similarity

Chemical structures of small molecule drugs were retrieved in SDF molecular format from DrugBank, release version 5.1.3 [35]. Invalid SDFs—i.e. those with NA values or with less than three columns in atom or bond blocks—were detected and removed. Atom pair descriptors were computed for valid compounds, and pairwise compound similarity, i.e. |${\delta}_c\big({d}_i,{d}_j\big)$|⁠, was estimated with atom pairs using the Tanimoto coefficient, which is defined as the proportion of atom pairs shared among two compounds divided by their union (Equation (1)).
(1)
where |${AP}_i$| and |${AP}_j$| represent atom pairs of drugs |${d}_i$| and |${d}_j$|⁠, respectively; therefore, the numerator is the number of atom pairs which are common in both compounds, and denominator represents the number of all atom pairs of the two compounds. These analyses were performed using the ChemmineR cheminformatics package in R [36].

Target protein sequence-based similarity

Target sequences in FASTA format were retrieved for all small molecule drugs from DrugBank, release version 5.1.3 [37]. Pairwise protein sequence comparison was performed using the standard Needleman-Wunsch [38] dynamic programming algorithm for global alignment and the percentage of pairwise sequence identity [39] was reported as the corresponding sequence similarity. Drug–drug similarity based on sequence similarities of their targets was then estimated as per Equation (2):
(2)
where target-based similarity between drugs |${d}_i$| and |${d}_j$| is denoted by |${\delta}_t\big({d}_i,{d}_j\big)$|⁠. |${T}_i$| is a set of proteins targeted by drugs |${d}_i$|⁠. Likewise, |${T}_j$| is a set of proteins targeted by drugs |${d}_j$| and |$S\big(x,y\big)$| is a symmetric sequence-based similarity measure between two protein targets, |$x\in{T}_i$| and |$y\in{T}_j$|⁠. Overall, Equation (2) computes the ‘best-match average’ in which each target of the first drug is paired only with the most similar term of the second one and vice versa. Sequence alignment and percentage of sequence identity were estimated using the Biostrings package of R [40].

Target Protein functional similarity

In addition to sequence similarity, protein targets overrepresented by similar cellular functions would imply similarities in a drug’s mechanisms and downstream effects [41]. To that purpose, sets of GO terms of all three categories—i.e. cellular components (CC), molecular functions (MF) and BP—associated with each protein were retrieved from enrichR [29] libraries, version 2018. GO terms which were very specific (with ≤15 associated genes) or very general (with ≥100 genes) were filtered out. The set of proteins associated with a drug was enriched including targets as well as their interacting proteins on the PPI network. The latter are functionally relevant proteins, the inclusion of which would enrich GO annotations and improve subsequent statistical analyses. The Human PPI network was downloaded from I2D [28], version 2.9, and queried against the set of all protein targets; protein-to-gene mapping was performed using the AnnotationDbi package in R [42].

A GO term was then associated with a drug |${d}_i$| if ‘overrepresented’ by its protein targets and their immediate interacting partners. In other words, a term would be enriched if there were a high enough number of |${d}_i$|-related proteins annotated with the GO term implying that the functional association is statistically significant (P-value < 0.05 using Fisher’s exact test).

Once each drug was annotated with enriched GO terms, the functional similarity between any two drugs, i.e. |${\delta}_f\big({d}_i,{d}_j\big)$|⁠, was determined by the semantic similarity of their associated GO terms as proposed by Wang et al. [43] using the topology of the GO graph structure. Pairwise semantic similarities between any two GO terms associated drug |${d}_i$| and |${d}_j$| were combined into a single semantic similarity measure using a best-match average strategy [43] and reported into a final similarity matrix. Semantic similarity estimation was performed using the mgoSim function from the GOSemSim R package [44].

Drug-induced pathway similarity

A drug-pair that induces identical or overlapping pathways implies similarities in mechanisms of drug actions providing relevant information for the study of drug similarities and repositioning [45]. Pathways induced by each small molecule drug were retrieved from KEGG, Release 91.0 [27]. The KEGGREST R package [46] (v 1.26.1) was used to invoke KEGG Restful APIs for collecting the list of KEGG pathways induced by each drug; ID mapping between DrugBank and KEGG Drug identifiers was performed using DrugBank external links, version 5.1.3.

Pairwise similarity between any two pathways was estimated based on the similarity of their constituent genes using dice similarity. Then, for each drug pair |${d}_i$| and |${d}_j$|⁠, a pathway-based similarity score, i.e. |${\delta}_p\big({d}_i,{d}_j\big)$|⁠, was estimated as per Equation (3):
(3)
where |${P}_i$| and |${P}_j$| are the sets of pathways induced by drugs |${d}_i$| and |${d}_j$|⁠, respectively; |$x$| and |$y$| are the two pathways represented as sets of their constituent genes and |$\mathrm{DSC}\big(x,y\big)=2\big|x\cap y\big|/\big(\big|x\big|+\big|y\big|\big)$| is the dice similarity coefficient computing the relative overlap of the two pathways. The pathsim function from R BioCor package [47] was used to estimate |$\mathrm{DSC}\big(.,.\big)$| measures ranging from 0 to 1. Overall, Equation (3) indicates that the maximum pathway-based similarity would be attained if two drugs induce one or more identical pathway(s), and the minimum similarity of 0.0 is when there is no gene in common between any two pathways induced by the comparing drug pair.

Results and discussion

Database overview and statistics

Figure 1 shows the overall scheme and construction of DrugSimDB and the web application. Table 1 summarizes data sources used to generate the database and web interface along with statistics on retrieved data. Overall, 10 317 small-molecule drugs available in DrugBank, version 5.1.3 were considered and six distinct drug–drug similarity matrices were generated estimating measures based on similarities of chemical structures, target protein sequences, induced pathways and target protein function (cellular component, BP and molecular functions). The size of each similarity matrix is 10 317 × 10 317 = 106 440 489 and values range from 0 to 1. Missing values indicate no relevant information is available about the comparing drugs and were retained for consistency in dimensions. The individual matrices were mean-aggregated to form a combined-score similarity matrix. To report relevant pairs, the combined matrix was filtered to exclude drugs with missing values across all individual matrices (496 out of 10 317) and those with no SMILE structure (639 out of 10 317). Additionally, drug pairs were excluded if neither of the two drugs were marketed/approved (resulting 23 865 948 drug pairs) with the assumption that repurposing would make sense only if the candidate had not failed to be approved for the disease of interest. The final database was then organized as a data-table, where each row records a drug pair and columns correspond to individual similarity measures (×6), the mean-aggregated score, its associated P-value (based on standardized z-score) and the corresponding false discovery rate [48] adjusted P-value. The final data-table reports drug pairs with adjusted P-value < 0.05, yielding a total of 238 635 unique pairs.

Database content and construction. For 10 317 small-molecule drugs, DrugSimDB collects information on (i) drug chemical structures to estimate drug pairwise chemical similarity, (ii) drug protein targets and protein sequences to estimate sequence-based target similarity, (iii) drug-induced pathways and their constituent genes to estimate pathway-based similarities, and (iv) GO annotations of protein targets and PPIs to identify functional similarities. The similarity scores are then mean-aggregated and filtered into a single matrix of combined similarities, i.e. DrugSimDB, which is made accessible and analyzable via a user-friendly and interactive graphical user interface and complemented with other information for in-place drug investigation. Abbreviations: GO: Gene Ontology, CC: Cellular Component, MF: Molecular Function, BP: Biological Process.
Figure 1

Database content and construction. For 10 317 small-molecule drugs, DrugSimDB collects information on (i) drug chemical structures to estimate drug pairwise chemical similarity, (ii) drug protein targets and protein sequences to estimate sequence-based target similarity, (iii) drug-induced pathways and their constituent genes to estimate pathway-based similarities, and (iv) GO annotations of protein targets and PPIs to identify functional similarities. The similarity scores are then mean-aggregated and filtered into a single matrix of combined similarities, i.e. DrugSimDB, which is made accessible and analyzable via a user-friendly and interactive graphical user interface and complemented with other information for in-place drug investigation. Abbreviations: GO: Gene Ontology, CC: Cellular Component, MF: Molecular Function, BP: Biological Process.

Database interface and access. (A) The navigation bar, (B) Users query any drug name for information on its similarity information with other approved drugs and can choose to view the type of combined statistics (i.e. mean-aggregated score, P-value or adjusted P-value). A batch query is also supported, where users can upload a list of drug names or DrugBank IDs to view similarities among them. (C) An interactive tabular view of a DrugSimDB induced sub-network comprising the query drug and its interacting pairs; users can filter, sort, export and print the table. An interactive network view of the induced sub-network of the queried drug would also be rendered. (D) A tabular view of PubMed-curated literature list involving a drug-pair when the user selects their corresponding edge in the network view. Panels describing/rendering the (E) physiochemical, (F) interactive 3D structure and the (G) pharmacological properties of the queried drug are shown. Users can also view a color-coded ‘periodic table’ of chemical elements to aid in the understanding of its chemical structure in the Structure tab.
Figure 2

Database interface and access. (A) The navigation bar, (B) Users query any drug name for information on its similarity information with other approved drugs and can choose to view the type of combined statistics (i.e. mean-aggregated score, P-value or adjusted P-value). A batch query is also supported, where users can upload a list of drug names or DrugBank IDs to view similarities among them. (C) An interactive tabular view of a DrugSimDB induced sub-network comprising the query drug and its interacting pairs; users can filter, sort, export and print the table. An interactive network view of the induced sub-network of the queried drug would also be rendered. (D) A tabular view of PubMed-curated literature list involving a drug-pair when the user selects their corresponding edge in the network view. Panels describing/rendering the (E) physiochemical, (F) interactive 3D structure and the (G) pharmacological properties of the queried drug are shown. Users can also view a color-coded ‘periodic table’ of chemical elements to aid in the understanding of its chemical structure in the Structure tab.

Table 1

Data types, statistics and details of data sources used to generate DrugSimDB and interface

Data typeStatisticsDetailsData source
Drug identifiers, drug names and clinical status10 317 small-molecule drugs including 2466 approved drugsDrugBank [26]
Drug physicochemical properties16 distinct properties per drugMolecular weight, hydrogen bond acceptors/donors, ring count, molecular refractivity and polarizability, CAS number, SMILES, lnChl, IUPAC name, etc.
Drug pharmacological properties16 distinct properties per drugDescription, indication, mechanism of action, target names, toxicity, pharmacodynamics, metabolism, half-life, route of elimination, etc.
Drug chemical structures9678 structuresSDF format
Drug protein targets and protein sequence4986 unique protein sequences and 20 061 drug-target pairsFASTA format
Drug-induced pathways243 pathways and 3888 drug-pathway associationsKEGG [27]
GO terms and annotations446 CC, 1151 MF, and 5103 BP terms, and a total of 250 734 protein-GO term associationsGO terms across categories of Cellular components (CC), molecular functions (MF) and BPEnrichr [29]
PPIs469 515 PPIsValidated and computationally predicted human PPIsI2D [28]
Drug indications and clinical status10 562 drug-indication associations including 6677 ‘approved’ and 3885 ‘non-approved’RepoDB was considered as the drug repositioning gold standard and used for technical validationRepoDB [30]
Drug side effects139 756 drug-side effect associationsInformation on marketed medicines and their recorded adverse drug reactionsSIDER [31]
Data typeStatisticsDetailsData source
Drug identifiers, drug names and clinical status10 317 small-molecule drugs including 2466 approved drugsDrugBank [26]
Drug physicochemical properties16 distinct properties per drugMolecular weight, hydrogen bond acceptors/donors, ring count, molecular refractivity and polarizability, CAS number, SMILES, lnChl, IUPAC name, etc.
Drug pharmacological properties16 distinct properties per drugDescription, indication, mechanism of action, target names, toxicity, pharmacodynamics, metabolism, half-life, route of elimination, etc.
Drug chemical structures9678 structuresSDF format
Drug protein targets and protein sequence4986 unique protein sequences and 20 061 drug-target pairsFASTA format
Drug-induced pathways243 pathways and 3888 drug-pathway associationsKEGG [27]
GO terms and annotations446 CC, 1151 MF, and 5103 BP terms, and a total of 250 734 protein-GO term associationsGO terms across categories of Cellular components (CC), molecular functions (MF) and BPEnrichr [29]
PPIs469 515 PPIsValidated and computationally predicted human PPIsI2D [28]
Drug indications and clinical status10 562 drug-indication associations including 6677 ‘approved’ and 3885 ‘non-approved’RepoDB was considered as the drug repositioning gold standard and used for technical validationRepoDB [30]
Drug side effects139 756 drug-side effect associationsInformation on marketed medicines and their recorded adverse drug reactionsSIDER [31]
Table 1

Data types, statistics and details of data sources used to generate DrugSimDB and interface

Data typeStatisticsDetailsData source
Drug identifiers, drug names and clinical status10 317 small-molecule drugs including 2466 approved drugsDrugBank [26]
Drug physicochemical properties16 distinct properties per drugMolecular weight, hydrogen bond acceptors/donors, ring count, molecular refractivity and polarizability, CAS number, SMILES, lnChl, IUPAC name, etc.
Drug pharmacological properties16 distinct properties per drugDescription, indication, mechanism of action, target names, toxicity, pharmacodynamics, metabolism, half-life, route of elimination, etc.
Drug chemical structures9678 structuresSDF format
Drug protein targets and protein sequence4986 unique protein sequences and 20 061 drug-target pairsFASTA format
Drug-induced pathways243 pathways and 3888 drug-pathway associationsKEGG [27]
GO terms and annotations446 CC, 1151 MF, and 5103 BP terms, and a total of 250 734 protein-GO term associationsGO terms across categories of Cellular components (CC), molecular functions (MF) and BPEnrichr [29]
PPIs469 515 PPIsValidated and computationally predicted human PPIsI2D [28]
Drug indications and clinical status10 562 drug-indication associations including 6677 ‘approved’ and 3885 ‘non-approved’RepoDB was considered as the drug repositioning gold standard and used for technical validationRepoDB [30]
Drug side effects139 756 drug-side effect associationsInformation on marketed medicines and their recorded adverse drug reactionsSIDER [31]
Data typeStatisticsDetailsData source
Drug identifiers, drug names and clinical status10 317 small-molecule drugs including 2466 approved drugsDrugBank [26]
Drug physicochemical properties16 distinct properties per drugMolecular weight, hydrogen bond acceptors/donors, ring count, molecular refractivity and polarizability, CAS number, SMILES, lnChl, IUPAC name, etc.
Drug pharmacological properties16 distinct properties per drugDescription, indication, mechanism of action, target names, toxicity, pharmacodynamics, metabolism, half-life, route of elimination, etc.
Drug chemical structures9678 structuresSDF format
Drug protein targets and protein sequence4986 unique protein sequences and 20 061 drug-target pairsFASTA format
Drug-induced pathways243 pathways and 3888 drug-pathway associationsKEGG [27]
GO terms and annotations446 CC, 1151 MF, and 5103 BP terms, and a total of 250 734 protein-GO term associationsGO terms across categories of Cellular components (CC), molecular functions (MF) and BPEnrichr [29]
PPIs469 515 PPIsValidated and computationally predicted human PPIsI2D [28]
Drug indications and clinical status10 562 drug-indication associations including 6677 ‘approved’ and 3885 ‘non-approved’RepoDB was considered as the drug repositioning gold standard and used for technical validationRepoDB [30]
Drug side effects139 756 drug-side effect associationsInformation on marketed medicines and their recorded adverse drug reactionsSIDER [31]

Database access and usage notes

A search interface for drug-similarity network

We have developed a web application (http://vafaeelab.com/drugSimDB.html) using the Shiny R Studio project [32] to enable easy access to the DrugSimDB database and in-place investigation of drugs of interest (Figure 2A–G). With this application, users can query a drug (or list of drugs) and view similarity information on its interacting drugs retrieved from DrugSimDB (Figure 2B). The queried network—i.e. an induced sub-network comprising the queried drug and its interacting partners—would be displayed in an exportable ‘tabular-view’ as well as an interactive ‘network-view’ (Figure 2C). For a batch query, users can upload a text file containing drug names or DrugBank IDs and similarities among queried drugs would be shown in the tabular and network views. The tabular-view is sortable and includes information on interacting drug names, clinical statuses, individual and combined similarity measures with the queried drug(s), P-values and adjusted P-values of the combined similarity scores. The induced sub-network of the queried drug(s) in the network-view is interactive and query-able; the edge width corresponds to the combined similarity score, and upon selecting an edge, a PubMed query is made with its incident drugs, and the search results are displayed as a table in a modal window (Figure 2D). Additionally, when a drug node is selected, it displays its side-effect information from the SIDER database. For any queried drug, in separate tabs, users can observe physiochemical properties of the queried drug (Figure 2E), its chemical structure in an interactive 3D view (Figure 2F), and its pharmacological properties (Figure 2G) providing an all-in-one view for further investigation of the drug of interest. For a multi-drug query, the structure view as well as physiochemical and pharmacological properties of each drug would be organized into a toggle list expandable upon clicking. Some example files are also provided to assist users on preparing input files for a batch query.

Data download and statistics

The interface enables users to bulk download the full DrugSimDB database as well as individual similarity matrices and other intermediately processed relevant files. Links to downloads are available in the ‘Download’ page. Users can also view summary statistics of the database in the ‘Statistics’ page and use the ‘Help’ and ‘Contact’ pages to get information on how to use the application and how to cite the database or contact producers for reporting any bugs/issues.

Technical validation and relevance

Drug–drug similarity network is scale-free

Despite the phenomenal diversity of networks in nature, their architecture is usually governed by a few simple principlescommon to most real networks [49]. The most remarkable property of a network is characterized by the degree or connectivity of its nodes. Networks with ‘power-law degree distribution’ are called ‘scale-free’ where most nodes have only a few links and a few nodes, often called hubs, have huge numbers of links holding the network together. Remarkably, biological networks among others show a strong level of evidence for a scale-free structure [50].

We have shown that the DrugSimDB similarity network, where nodes are drugs and links represent pairwise similarity, illustrates scale-free topology (Figure 3A). The DrugSimDB network constitutes 4141 unique drugs or nodes and 238 635 edges of similarity associations with P-value < 0.05. We performed a bootstrapping hypothesis test (using the poweRlaw package in R [51]) to statistically determine whether DrugSimDB’s network architecture follows a power-law distribution and received P-value = 0.6, which does not reject the null hypothesis, indicating that the degree distribution is likely to be power-law.

Technical validation and relevance. (A) This drug–drug similarity network illustrates a scale-free topology as observed in most of biological networks. (B) Integration of heterogenous data sources enhances information coverage reducing the number of missing values (i.e. drugs with no information) when compared to individual data sources. (C, D) Validated against RepoDB [30], a database of drug repositioning successes and failures, the combined similarity score of DrugSimDB drug-pairs yields a competitive AUC value of 0.708 which outperforms the predicting power obtained from individual data sources. It retains a similar score compared to target-based similarity yet with substantially improved coverage.
Figure 3

Technical validation and relevance. (A) This drug–drug similarity network illustrates a scale-free topology as observed in most of biological networks. (B) Integration of heterogenous data sources enhances information coverage reducing the number of missing values (i.e. drugs with no information) when compared to individual data sources. (C, D) Validated against RepoDB [30], a database of drug repositioning successes and failures, the combined similarity score of DrugSimDB drug-pairs yields a competitive AUC value of 0.708 which outperforms the predicting power obtained from individual data sources. It retains a similar score compared to target-based similarity yet with substantially improved coverage.

Comparison with Jaccard Index based on network-based properties. (A) The proportion of drug pairs whose similarity measure is ‘equal or less’ than the given thresholds. (B) The mean degree of nodes in the DrugSimDB networks and the corresponding Jaccard-based network; the error bar shows the standard error. (C) The number of drug pairs that are connected within the given distances (i.e. the number of links/edges between the two drugs is equal or less than the given threshold). Only the top 5% of similarity measures in target sequence-based and functional similarity matrices were retained in the DrugSimDB network and used for the calculation of degrees and distances. For each comparison, the pale color corresponds to the Jaccard-based approach. For functional similarity, only the GO category of BP was included in this visualization; similar results obtained using other categories (i.e. MF and CC) as visualized in Supplementary Figure S1, available online at https://dbpia.nl.go.kr/bib.
Figure 4

Comparison with Jaccard Index based on network-based properties. (A) The proportion of drug pairs whose similarity measure is ‘equal or less’ than the given thresholds. (B) The mean degree of nodes in the DrugSimDB networks and the corresponding Jaccard-based network; the error bar shows the standard error. (C) The number of drug pairs that are connected within the given distances (i.e. the number of links/edges between the two drugs is equal or less than the given threshold). Only the top 5% of similarity measures in target sequence-based and functional similarity matrices were retained in the DrugSimDB network and used for the calculation of degrees and distances. For each comparison, the pale color corresponds to the Jaccard-based approach. For functional similarity, only the GO category of BP was included in this visualization; similar results obtained using other categories (i.e. MF and CC) as visualized in Supplementary Figure S1, available online at https://dbpia.nl.go.kr/bib.

Aggregation of heterogeneous data improves the network coverage

Integrating heterogeneous multisource biomedical data on drugs would adjust for missing information across individual data sources and increase the data coverage. This potentially alleviates the sparsity challenge and difficulty of handling drugs with no information [52].

Figure 3B shows the proportion of drugs with no information across individual data sources and confirms that integration would reduce data sparsity. Drugs commonly have known valid chemical structures, resulting a minimum rate of missing values (7.4% out of 10 317) for chemical similarity. Other information sources, however, show substantial proportions of missing values with drug-induced pathways being at the extreme range (90.3%). The latter can be further improved by incorporating other databases as well as predictions on drug-pathway associations [45, 53], gene-expression profiles [54, 55] and protein interactions [28].

Drug–drug similarity network predicts repositioning candidates

Drug similarity networks can be readily used for repositioning purposes upon the assumption that similar drugs are potentially repositionable for same indication(s). To validate this assumption, we used repoDB [30], as a standard database of drug repositioning successes and failures which contains 6677 approved drug-indication pairs and 4123 failed drug-indication pairs extracted from DrugCentral [56] and ClinicalTrials.gov [57]. DrugSimDB drug pairs (total of 238 635) were sorted ascendingly by their combined similarity scores; a pair is considered as a ‘true positive’ (TP) when both drugs were approved for the same indication(s), and as a ‘false positive’ (FP) if, for a same indication, one drug was approved and the other was not. We then plotted ‘true positive rate,’ TPR (sensitivity) and ‘false positive rate,’ FPR (1-specificity) at multiple cut-off values as implemented by the ROCit R package [58] and estimated the ‘area under the receiver operating characteristic (ROC) curve’ (AUC) as shown in Figure 3C. We received a competitive AUC value of 0.708 using the combined similarity as the predicted score which outperforms scoring based on individual similarity measures (Figure 3D). This corroborates previous observations that integrating heterogeneous data sources can improve repositioning performance [3, 59].

Related works and comparison with Jaccard Index

Drug–drug similarity networks have been frequently used in a variety of in-silico drug development applications. Supplementary Table S1, available online at https://dbpia.nl.go.kr/bib, provides an illustrative list of recent studies where drug similarities were adopted as part of a larger computational pipeline to predict drug targets, identify drug–drug interactions and reposition drugs for new indications, among others. Regardless of the application, a mainstream approach to derive drug–drug similarities has been Jaccard similarity coefficient comparing properties (e.g. side-effects, targets, pathways) associated with any two drugs. While Jaccard-based similarity is a standard approach for comparing drugs across well-annotated properties (e.g. structural fingerprints), it has a limited capacity in deriving similarities for new or poorly annotated compounds. Additionally, when considering drug properties with a limited annotation coverage (e.g. induced pathways), drug pairs with overlapping properties are scarce, and thus the corresponding Jaccard based similarity matrix is extremely sparse upon studying a comprehensive set of compounds.

DugSimDB improves upon baseline Jaccard similarity coefficient by comparing pathways at the gene level, by estimating targets’ sequence similarities and by integrating PPI information with GO semantic similarities. Figure 4 demonstrates that the adopted approaches enhance the coverage and connectivity of drug–drug similarity networks compared with Jaccard-driven alternatives. Figure 4A illustrates the distribution of similarity measures (after removing missing values) as the proportion of drug pairs whose similarities are less than the given cut-off. For instance, Jaccard Index on the pathway level shows 80% of zero similarity while this value reduces to 48% when comparing pathways at the gene level. Additionally, on the functional similarity, 99% of drug pairs have Jaccard similarity of less than 0.2 (i.e. similarity percentile), while in the DrugSimDB network, the percentile raises to 0.8 indicating that the adopted approach not only increased the coverage but also improved the strength of the similarity evidence. Figure 4B shows the mean degree of nodes. Figure 4C shows the number of drug pairs that are connected within the given distances where the shortest distance between any two nodes were estimated using breadth-first search algorithm as implemented by the igraph package in R [60]. In the Jaccard-based pathway similarity, for instance, nodes are merely reachable from their immediate partners forming several disconnected islands. Together, the plots clearly show the improved connectivity of the DrugSimDB networks which can enhance subsequent network diffusion approaches frequently used in different drug development applications (see Supplementary Table S1 available online at https://dbpia.nl.go.kr/bib).

Code and data availability

To ensure the reproducibility of DrugSimDB, we have made the whole codebase (including any intermediate curation, processing and the web application) freely available for non-commercial uses in GitHub (https://github.com/VafaeeLab/drugSimDB). The code and interface are well documented, and the database update is implemented as a semi-automated pipeline. This would enable any-time upgrade by users to accommodate for updates in source databases. The pipeline has been efficiently implemented for parallel processing and it is recommended to be run on high-performance computing platforms to accelerate computations on large similarity matrices.

Conclusions

The DrugSimDB repository and its interface provide a comprehensive and easy-to-use resource to probe drug–drug similarities for a variety of drug development studies including, but not limited to, drug repositioning. The interface not only facilitates easy access to pairwise similarities via autocomplete browsing, exportable tables and interactive network visualizations, but also provides complementary information on the physiochemical properties, side-effects and pharmacology of queried drugs as well as PubMed evidence of any interacting, i.e. similar, drug pairs. Together, it provides an inclusive platform for similarity-based in-silico drug studies, all in one view. We have developed a semi-automated, well-commented upgrade-pipeline to enable easy and periodic database upgrade not only for developers but also for users who are willing to access to the latest version of data sources at any time.

Multiple lines of evidence regarding drug-related information have been derived from heterogeneous data sources to improve the coverage and prediction performance. Yet, DrugSimDB’s score-based prioritization platform has the capacity to incorporate a multitude of other drug-related information—e.g. drug adverse effects, pharmacodynamics, drug-target secondary structures and drug-induced molecular omics, which are all within our future perspective to further enhance the current resource. In contrast to supervised computational methods, the score-based, unsupervised prediction as adopted by DrugSimDB, is not biased to training composition, is not affected by an unbalanced training set and can simply incorporate any rare and sparse feature with substantial missing values. DrugSimDB is basically a weighted, multi-modal scale-free network of drug-drug associations which offers the scope for various network-based analyses [52], such as community detection, network-based inference and computing graph properties useful for drug repositioning and beyond.

Key Points
  • DrugSimDB provides a comprehensive, integrative and extendable resource of drug–drug similarities complemented with an interactive user-friendly web interface.

  • DrugSimDB networks and individual similarity matrices cover an exhaustive list of currently approved and investigational drugs. The platform is easily updatable (by users and developers) to account for new drugs and information.

  • DrugSimDB currently integrates information on drug chemical structures, protein targets and their primary structure, drug-induced pathways, gene ontology annotations of protein targets and protein–protein interactions.

  • The web interface facilitates access to further information on drugs’ pharmacology, physiochemical properties and side-effects as well as peer-reviewed evidence from the PubMed literature search engine on drug-pair co-occurrence.

Authors Contribution

F.V. conceived and supervised the project. F.V., A.A., M.D. and A.N. developed the methodological framework. F.V. and A.A. generated the results and figures. A.A. developed the online platform. F.V. and A.A. wrote the manuscript. J.S. and L.L.M. contributed in project conception. All authors reviewed and approved the final manuscript.

AKM Azad, PhD, is a postdoctoral research fellow in bioinformatics and computational biology at UNSW Sydney, the School of BABS.

Mojdeh Dinarvand, PhD, is a research associate in drug discovery and microbiology at UNSW Sydney, the School of BABS.

Alireza Nematollahi, PhD, has been a research associate in pharmacology at UNSW Sydney, the School of BABS.

Joshua Swift, PhD, received his PhD in molecular oncology (2018) from the School of BABS at UNSW Sydney and is the founder of ZiggyLabs, an animal health company which utilizes drug repositioning to identify novel anti-cancer therapies for dogs.

Louise Lutze-Mann, PhD, is an associate professor in molecular and cell biology at UNSW Sydney, the School of BABS.

Fatemeh Vafaee, PhD, is senior lecturer and team leader in bioinformatics and computational biomedicine at University of New South Wales (UNSW Sydney), the School of Biotechnology and Biomolecular Sciences (BABS).

References

1.

Brown
AS
,
Patel
CJ
.
MeSHDD: literature-based drug-drug similarity for drug repositioning
.
J Am Med Inform Assoc
2017
;
24
(
3
):
614
8
.

2.

Zeng
X
,
Jia
Z
,
He
Z
, et al.
Measure clinical drug–drug similarity using electronic medical records
.
Int J Med Inform
2019
;
124
:
97
103
.

3.

Luo
Y
,
Zhao
X
,
Zhao
J
, et al.
A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information
.
Nat Commun
2017
;
8
(
1
):
1
13
.

4.

Ding
H
,
Takigawa
I
,
Mamitsuka
H
, et al.
Similarity-based machine learning methods for predicting drug–target interactions: a brief review
.
Brief Bioinform
2014
;
15
(
5
):
734
47
.

5.

Wu
Z
,
Li
W
,
Liu
G
, et al.
Network-based methods for prediction of drug-target interactions
.
Front Pharmacol
2018
;
9
:
1134
.

6.

Campillos
M
,
Kuhn
M
,
Gavin
A-C
, et al.
Drug target identification using side-effect similarity
.
Science
2008
;
321
(
5886
):
263
6
.

7.

Lu
Y
,
Guo
Y
,
Korhonen
A
.
Link prediction in drug-target interactions network using similarity indices
.
BMC Bioinformatics
2017
;
18
(
1
):
39
.

8.

Zhao
X
,
Chen
L
,
Lu
J
.
A similarity-based method for prediction of drug side effects with heterogeneous information
.
Math Biosci
2018
;
306
:
136
44
.

9.

Zhang
W
,
Yue
X
,
Liu
F
, et al.
A unified frame of predicting side effects of drugs by using linear neighborhood similarity
.
BMC Syst Biol
2017
;
11
(
6
):
101
.

10.

Timilsina
M
,
Tandan
M
,
d'Aquin
M
, et al.
Discovering links between side effects and drugs using a diffusion based method
.
Sci Rep
2019
;
9
(
1
):
1
10
.

11.

Ferdousi
R
,
Safdari
R
,
Omidi
Y
.
Computational prediction of drug-drug interactions based on drugs functional similarities
.
J Biomed Inform
2017
;
70
:
54
64
.

12.

Sridhar
D
,
Fakhraei
S
,
Getoor
L
.
A probabilistic approach for collective similarity-based drug–drug interaction prediction
.
Bioinformatics
2016
;
32
(
20
):
3175
82
.

13.

Kastrin
A
,
Ferk
P
,
Leskošek
B
.
Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning
.
PLoS One
2018
;
13
(
5
):
e0196865
.

14.

Rohani
N
,
Eslahchi
C
.
Drug-drug interaction predicting by neural network using integrated similarity
.
Sci Rep
2019
;
9
(
1
):
1
11
.

15.

Ryu
JY
,
Kim
HU
,
Lee
SY
.
Deep learning improves prediction of drug–drug and drug–food interactions
.
Proc Natl Acad Sci USA
2018
;
115
(
18
):
E4304
11
.

16.

Luo
H
,
Wang
J
,
Li
M
, et al.
Drug repositioning based on comprehensive similarity measures and bi-random walk algorithm
.
Bioinformatics
2016
;
32
(
17
):
2664
71
.

17.

Huang
C-T
,
Hsieh
C-H
,
Oyang
Y-J
, et al.
A large-scale gene expression intensity-based similarity metric for drug repositioning
.
iScience
2018
;
7
:
40
52
.

18.

Zheng
Y
,
Peng
H
,
Zhang
X
, et al.
Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces
.
BMC Bioinformatics
2019
;
20
(
23
):
605
.

19.

Yan
C
,
Feng
L
,
Wang
W
, et al.
A novel drug repositioning approach based on integrative multiple similarity measures
.
Curr Mol Med
2020
;
20
(6):
442
51
.

20.

Ashburn
TT
,
Thor
KB
.
Drug repositioning: identifying and developing new uses for existing drugs
.
Nat Rev Drug Discov
2004
;
3
(
8
):
673
83
.

21.

O’Boyle
NM
,
Sayle
RA
.
Comparing structural fingerprints using a literature-based similarity benchmark
.
J Chem
2016
;
8
(
1
):
1
14
.

22.

Vilar
S
,
Hripcsak
G
.
Leveraging 3D chemical similarity, target and phenotypic data in the identification of drug-protein and drug-adverse effect associations
.
J Chem
2016
;
8
(
1
):
35
.

23.

Wang
W
,
Yang
S
,
Zhang
X
, et al.
Drug repositioning by integrating target information through a heterogeneous network model
.
Bioinformatics
2014
;
30
(
20
):
2923
30
.

24.

Tatonetti
NP
,
Ye
PP
,
Daneshjou
R
, et al.
Data-driven prediction of drug effects and interactions
.
Sci Transl Med
2012
;
4
(
125
):
125ra31
1
.

25.

Iorio
F
,
Bosotti
R
,
Scacheri
E
, et al.
Discovery of drug mode of action and drug repositioning from transcriptional responses
.
Proc Natl Acad Sci USA
2010
;
107
(
33
):
14621
6
.

26.

Wishart
DS
,
Feunang
YD
,
Guo
AC
, et al.
DrugBank 5.0: a major update to the DrugBank database for 2018
.
Nucleic Acids Res
2018
;
46
(
D1
):
D1074
82
.

27.

Kanehisa
M
,
Furumichi
M
,
Tanabe
M
, et al.
KEGG: new perspectives on genomes, pathways, diseases and drugs
.
Nucleic Acids Res
2017
;
45
(
D1
):
D353
61
.

28.

Brown
KR
,
Jurisica
I
.
Online predicted human interaction database
.
Bioinformatics
2005
;
21
(
9
):
2076
82
.

29.

Kuleshov
MV
,
Jones
MR
,
Rouillard
AD
, et al.
Enrichr: a comprehensive gene set enrichment analysis web server 2016 update
.
Nucleic Acids Res
2016
;
44
(
W1
):
W90
7
.

30.

Brown
AS
,
Patel
CJ
.
A standard database for drug repositioning
.
Sci Data
2017
;
4
(
1
):
1
7
.

31.

Kuhn
M
,
Letunic
I
,
Jensen
LJ
, et al.
The SIDER database of drugs and side effects
.
Nucleic Acids Res
2016
;
44
(
D1
):
D1075
9
.

33.

Smith
TJ
.
MolView: a program for analyzing and displaying atomic structures on the Macintosh personal computer
.
J Mol Graph
1995
;
13
(
2
):
122
5
.

34.

Almende
BV
,
Thieurmel
B
,
Robert
T
.
“visNetwork: Network Visualization using vis. js Library R package version 2.0. 4.”
2018
.

35.

DrugBank
.
DrugBank Release Version 5.1.3, Chemical Structures
.
2019
. https://www.drugbank.ca/releases/5-1-3#structures.

36.

Cao
Y
,
Charisi
A
,
Cheng
L-C
, et al.
ChemmineR: a compound mining framework for R
.
Bioinformatics
2008
;
24
(
15
):
1733
4
.

37.

DrugBank
.
DrugBank Release Version 5.1.3, Target Sequences
.
2019
. https://www.drugbank.ca/releases/5-1-3#target-sequences.

38.

Needleman
SB
,
Wunsch
CD
.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
.
J Mol Biol
1970
;
48
(
3
):
443
53
.

39.

Raghava
GP
,
Barton
GJ
.
Quantification of the variation in percentage identity for protein sequence alignments
.
BMC Bioinformatics
2006
;
7
(
1
):
415
.

40.

Pagès
H
,
Aboyoun
P
,
Gentleman
R
, and
DebRoy
S
.
Biostrings: efficient manipulation of biological strings
.
R package version
,
2017
;
2
(0).

41.

Passi
A
,
Rajput
NK
,
Wild
DJ
, et al.
RepTB: a gene ontology based drug repurposing approach for tuberculosis
.
J Chem
2018
;
10
(
1
):
24
.

42.

Carlson, Marc RJ, et al. ``

Genomic annotation resources in R/Bioconductor.
''
Statistical Genomics
.
Humana Press, New York, NY
,
2016
.
67
90

43.

Wang
JZ
,
Du
Z
,
Payattakool
R
, et al.
A new method to measure the semantic similarity of GO terms
.
Bioinformatics
2007
;
23
(
10
):
1274
81
.

44.

Yu
G
,
Li
F
,
Qin
Y
, et al.
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products
.
Bioinformatics
2010
;
26
(
7
):
976
8
.

45.

Zeng
H
,
Qiu
C
,
Cui
Q
.
Drug-path: a database for drug-induced pathways
.
Database (Oxford)
2015
;
2015
: bav061.

46.

Tenenbaum
D
. ``
KEGGREST: Client-side REST access to KEGG. R package version 1.24. 0. 2019.
''
2019
.

47.

Sancho
,
L.R.
BioCor: Functional Similarities
. R package version 1.10.0
.
2019
. https://llrs.github.io/BioCor/.

48.

Benjamini
Y
,
Hochberg
Y
.
Controlling the false discovery rate: a practical and powerful approach to multiple testing
.
J R Stat Soc B Methodol
1995
;
57
(
1
):
289
300
.

49.

Barabasi
A-L
,
Oltvai
ZN
.
Network biology: understanding the cell's functional organization
.
Nat Rev Genet
2004
;
5
(
2
):
101
13
.

50.

Broido
AD
,
Clauset
A
.
Scale-free networks are rare
.
Nat Commun
2019
;
10
(
1
):
1
10
.

51.

Gillespie
CS
. ``
Fitting heavy tailed distributions: The poweRlaw package. R package version 0.20. 5.
''
2015
.

52.

Luo
H
,
Li
M
,
Yang
M
, et al.
Biomedical data and computational models for drug repositioning: a comprehensive review
.
Brief Bioinform
2020
;
bbz176
. https://doi.org/10.1093/bib/bbz176.

53.

Frolkis
A
,
Knox
C
,
Lim
E
, et al.
SMPDB: the small molecule pathway database
.
Nucleic Acids Res
2010
;
38
(
suppl_1
):
D480
7
.

54.

Scherf
U
,
Ross
DT
,
Waltham
M
, et al.
A gene expression database for the molecular pharmacology of cancer
.
Nat Genet
2000
;
24
(
3
):
236
44
.

55.

Musa
A
,
Ghoraie
LS
,
Zhang
S-D
, et al.
A review of connectivity map and computational approaches in pharmacogenomics
.
Brief Bioinform
2018
;
19
(
3
):
506
23
.

56.

Ursu
O
,
Holmes
J
,
Knockel
J
, et al.
DrugCentral: online drug compendium
.
Nucleic Acids Res
2017
;
4
(
45
):
D932
D939
.

57.

Cine
N
.
ClinicalTrials. gov
,
2018
. https://clinicaltrials.gov/.

58.

Khan
MRA
.
ROCit-An R Package for Performance Assessment of Binary Classifier with Visualization
.
2019
.

59.

Zeng
X
,
Zhu
S
,
Liu
X
, et al.
deepDR: a network-based deep learning approach to in silico drug repositioning
.
Bioinformatics
2019
;
35
(
24
):
5191
8
.

60.

Csardi
G
,
Nepusz
T
.
The igraph software package for complex network research
.
Computer Science
2006
;
1695
(
5
):
1
9
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data