BETA: a comprehensive benchmark for computational drug–target prediction

Abstract

Internal validation is the most popular evaluation strategy used for drug–target predictive models. The simple random shuffling in the cross-validation, however, is not always ideal to handle large, diverse and copious datasets as it could potentially introduce bias. Hence, these predictive models cannot be comprehensively evaluated to provide insight into their general performance on a variety of use-cases (e.g. permutations of different levels of connectiveness and categories in drug and target space, as well as validations based on different data sources). In this work, we introduce a benchmark, BETA, that aims to address this gap by (i) providing an extensive multipartite network consisting of 0.97 million biomedical concepts and 8.5 million associations, in addition to 62 million drug–drug and protein–protein similarities and (ii) presenting evaluation strategies that reflect seven cases (i.e. general, screening with different connectivity, target and drug screening based on categories, searching for specific drugs and targets and drug repurposing for specific diseases), a total of seven Tests (consisting of 344 Tasks in total) across multiple sampling and validation strategies. Six state-of-the-art methods covering two broad input data types (chemical structure- and gene sequence-based and network-based) were tested across all the developed Tasks. The best-worst performing cases have been analyzed to demonstrate the ability of the proposed benchmark to identify limitations of the tested methods for running over the benchmark tasks. The results highlight BETA as a benchmark in the selection of computational strategies for drug repurposing and target discovery.

computational cenchmark, drug target prediction, computational drug development, deep learning

Introduction

Critical to the drug discovery process is the ability to define, identify, screen and understand potential candidate pairs among small molecules (i.e. drugs) and proteins (i.e. targets) [1–3]. Despite advancements in the use of biological assays to experimentally validate drug–target interactions (DTIs), these early steps of drug development remain expensive to accomplish [3, 4]. The use of current experimental screening (in vitro) methods to cover all the possible combinations of DTIs is infeasible, and the tendency to only focus on particular families of ‘druggable’ proteins or ‘preferable’ drugs greatly limits the systematic screening of the potentially larger number of compounds, small molecules and proteins available [5, 6]. The adoption of computational (in silico) methods has therefore been suggested to provide a more efficient means for prescreening [7–14].

Computational methods historically began with early attempts of docking simulations and ligand matching [2, 3, 15, 16] and have recently progressed to machine learning-based solutions [14, 17–20]. Although it is desirable to validate the discoveries with biological assays, known as external validation, it is infeasible for most computational labs. Therefore, internal validation, such as cross-validation, is the most popular validation strategy for the existing methods, where some of the drug–target associations remain for testing during the training process. The datasets in internal validations are either small-scale datasets developed from very early attempts [21–25] or tailored sets generated from diverse biomedical databases that contain drug–target associations [14]. For example, among the 87 investigated computational papers published in a recent survey [14], 79 (91%) and 66 (77%) papers conducted the experiments based on the biomedical databases Drugbank [26] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [27], and 54 (62%) used small-scale drug–target associations based on a protein category in the target space developed for cross-validation in 2008 [28]. Bias may still exist in these experiments with simple random suffering as the patterns of connectiveness and categories in the drug and target space with a large number of associations will be favored. A gold standard that provides large datasets, as well as sophisticated validation methods with a minimized risk of bias (e.g. permutations of different levels of connectiveness and categories in the drug and target space, as well as validations based on different data sources), does not exist. The complexity in selecting suitable computational solutions during the drug development phase is an ongoing challenge [21, 23, 29, 30], and without such a standard benchmark to evaluate predictive models in an equitable and comprehensive manner, the adoptability of developed computational methods is hindered.

In this work, we fill this gap by providing a large-scale benchmark that enables a comprehensive evaluation of drug–target predictive models to facilitate a selection of computational strategies for drug and target prescreening. This benchmark provides an extensive multipartite network consisting of 0.97 million biomedical concepts including 59 000 drugs and 95 000 targets, and 8.5 million associations including 817 thousand drug–target associations, as well as 62 million drug–drug and protein–protein similarities based on drug chemical structures and gene sequences that can be used to comprehensively evaluate the prescreening strategies that reflect seven use-cases (i.e. general, screening with different connectivity, target and drug screening based on categories, searching for specific drugs and targets and drug repurposing for specific diseases), a total of seven Tests (consisting of 344 Tasks in total) that cover two types of training/testing sampling strategies based on drug–target space as well as six types of validation strategies. To demonstrate the use of our benchmark, six state-of-the-art predictive models have been selected and categorized based on the input types (i.e. structure- and sequence-based and network-based methods) and evaluated as use-cases. The best-worst performing diseases (e.g. spinal muscular atrophy versus obesity for a versioning-based Job and human immunodeficiency virus (HIV) versus myocardial infarction for a trial-based Job) have been analyzed. The results highlighted BETA as a benchmark in the selection of drug–target prediction methods for drug repurposing and target discovery applications when a pair of drugs and targets are given as the input.

Methods

Our proposed benchmark consists of two major components: (i) datasets and (ii) evaluation Tasks.

With respect to datasets, a multipartite network was constructed based on an integration of 11 existing biomedical repositories (Diseasome [31], Drugbank [26], Gene Ontology Annotation (GOA) [32], Interaction Reference Index (iRefindex) [33], KEGG [27], Linked Structured Product Label (Linkedspl) [34], Online Mendelian Inheritance in Man (OMIM) [35], Pharmacogenomics Knowledge Base (Pharmgkb) [36], Side Effect Resource (SIDER) [37] and STRING [38]), which incorporated 971 874 entities and 8 530 037 associations in total. We defined a common ‘drug–target–disease’ node space that consisted of the entities from Drugbank (6250 drugs and 2838 targets) and OMIM (52 187 diseases) (see Table 1(a) for details). This graph also incorporated 46 million drug–drug and 16 million protein–protein similarities computed based on the chemical structures and gene sequence obtained from Drugbank.

Table 1

Open in new tab

Statistics of the benchmark dataset and evaluation Tasks

(a) Network
Repositories	Node (biomedical entities)				Edge (biomedical associations)				Drug mapped		Target mapped		Disease mapped
	All node	Drug	Target	Disease	All edge	Drug-associated	Target-associated	Disease-associated	Local	Drugbank	Local	Drugbank	Local	OMIM
Diseasome	9518	1362	3919	4213	25 091	8202	9744	268 918	1309	1309	508	515	445	557
DrugBank	24 655	6823	4037	0	56 245	56 245	14 744	0	0	0	0	0	0	0
GOA	48 624	0	43 830	0	163 498	0	163 715	0	0	0	2178	2180	0	0
Irefindex	230 136	0	14 860	0	59 809	0	975 287	0	0	0	2555	2562	0	0
KEGG	13 776	3848	7777	1284	975 287	58 476	45 465	35 725	2380	1676	2165	2159	1036	2751
Linkedspl	61 869	44 196	30	0	163 639	163 639	3613	0	37 807	1134	2131	2140	0	0
OMIM	104 540	0	0	31 468	137 682	0	0	137 682	0	0	2118	2120	0	0
Pharmgkb	6543	868	1442	486	33 508	23 322	17 610	25 460	1643	1638	2131	2140	1376	2208
Pharmgkb-offside	450 230	1332	0	10 097	877 604	438 802	0	438 802	7149	6138	0	0	50 905	52 159
Sider	2598	893	0	1705	68 424	68 424	0	68 424	858	868	0	0	814	1428
STRING	19 385	0	19 385	0	5 969 250	0	5 969 250	0	0	0	2131	2133	0	0
Total	971 874	59 322	95 280	49 253	8 530 037	817 110	7 199 428	975 011	51 146	6250	15 917	2838	54 576	52 187
(b) Evaluation tasks
Test	Purpose	Validation		Drug–Target Space		# Task	# Avg train pairs	# Avg drugs for training	# Avg targets for training	# Avg test pairs	# Avg test positive	# Avg test negative	# Avg drugs for testing	# Avg targets for testing
0	General	k-fold cross internal validation (k = 10)		All		10	11 016	5337	3087	1483	676	807	1190	1039
1	Screening with connectivity spaces (e.g. isolated drugs or targets)	k-fold cross internal validation (k = 10)		Connectivity space		90	11 769	5654	3221	929	471	458	637	551
2	Target screening when drugs are within/beyond category	k-fold cross internal validation (k = 3) + Drug categories		Category space (drugbank category)		36	12 164	5731	3252	152	76	76	32	73
				Category space (linkpl class)		36	12 108	5732	3201	263	132	132	32	192
3	Drug screening when targets are within/beyond category	k-fold cross internal validation (k = 3) + Protein categories		Category space (family)		36	12 045	5721	3252	389	195	195	109	120
				Category space (protein class)		36	11 965	5685	3247	549	275	275	165	194
4	Searching for a specific drug/target	Internal validation with m search (n = 10)		All targets (500)		10	12 207	5735	3239	534	33	501	1	534
		Internal validation with m search (n = 10)		All drugs (500)		10	12 196	5717	3253	545	44	501	545	1
5	Drug repurposing for a specific disease	Version-based external validation + Disease specific		Category space (10 disease)		40	12 240	5735	3253	44	22	22	24	14
6		Clinical trial-based external validation + Disease-specific		Category space (10 disease)		40	12 240	5735	3253	25	13	13	10	11

(a) Network
Repositories	Node (biomedical entities)				Edge (biomedical associations)				Drug mapped		Target mapped		Disease mapped
	All node	Drug	Target	Disease	All edge	Drug-associated	Target-associated	Disease-associated	Local	Drugbank	Local	Drugbank	Local	OMIM
Diseasome	9518	1362	3919	4213	25 091	8202	9744	268 918	1309	1309	508	515	445	557
DrugBank	24 655	6823	4037	0	56 245	56 245	14 744	0	0	0	0	0	0	0
GOA	48 624	0	43 830	0	163 498	0	163 715	0	0	0	2178	2180	0	0
Irefindex	230 136	0	14 860	0	59 809	0	975 287	0	0	0	2555	2562	0	0
KEGG	13 776	3848	7777	1284	975 287	58 476	45 465	35 725	2380	1676	2165	2159	1036	2751
Linkedspl	61 869	44 196	30	0	163 639	163 639	3613	0	37 807	1134	2131	2140	0	0
OMIM	104 540	0	0	31 468	137 682	0	0	137 682	0	0	2118	2120	0	0
Pharmgkb	6543	868	1442	486	33 508	23 322	17 610	25 460	1643	1638	2131	2140	1376	2208
Pharmgkb-offside	450 230	1332	0	10 097	877 604	438 802	0	438 802	7149	6138	0	0	50 905	52 159
Sider	2598	893	0	1705	68 424	68 424	0	68 424	858	868	0	0	814	1428
STRING	19 385	0	19 385	0	5 969 250	0	5 969 250	0	0	0	2131	2133	0	0
Total	971 874	59 322	95 280	49 253	8 530 037	817 110	7 199 428	975 011	51 146	6250	15 917	2838	54 576	52 187
(b) Evaluation tasks
Test	Purpose	Validation		Drug–Target Space		# Task	# Avg train pairs	# Avg drugs for training	# Avg targets for training	# Avg test pairs	# Avg test positive	# Avg test negative	# Avg drugs for testing	# Avg targets for testing
0	General	k-fold cross internal validation (k = 10)		All		10	11 016	5337	3087	1483	676	807	1190	1039
1	Screening with connectivity spaces (e.g. isolated drugs or targets)	k-fold cross internal validation (k = 10)		Connectivity space		90	11 769	5654	3221	929	471	458	637	551
2	Target screening when drugs are within/beyond category	k-fold cross internal validation (k = 3) + Drug categories		Category space (drugbank category)		36	12 164	5731	3252	152	76	76	32	73
				Category space (linkpl class)		36	12 108	5732	3201	263	132	132	32	192
3	Drug screening when targets are within/beyond category	k-fold cross internal validation (k = 3) + Protein categories		Category space (family)		36	12 045	5721	3252	389	195	195	109	120
				Category space (protein class)		36	11 965	5685	3247	549	275	275	165	194
4	Searching for a specific drug/target	Internal validation with m search (n = 10)		All targets (500)		10	12 207	5735	3239	534	33	501	1	534
		Internal validation with m search (n = 10)		All drugs (500)		10	12 196	5717	3253	545	44	501	545	1
5	Drug repurposing for a specific disease	Version-based external validation + Disease specific		Category space (10 disease)		40	12 240	5735	3253	44	22	22	24	14
6		Clinical trial-based external validation + Disease-specific		Category space (10 disease)		40	12 240	5735	3253	25	13	13	10	11

Table 1

Open in new tab

Statistics of the benchmark dataset and evaluation Tasks

(a) Network
Repositories	Node (biomedical entities)				Edge (biomedical associations)				Drug mapped		Target mapped		Disease mapped
	All node	Drug	Target	Disease	All edge	Drug-associated	Target-associated	Disease-associated	Local	Drugbank	Local	Drugbank	Local	OMIM
Diseasome	9518	1362	3919	4213	25 091	8202	9744	268 918	1309	1309	508	515	445	557
DrugBank	24 655	6823	4037	0	56 245	56 245	14 744	0	0	0	0	0	0	0
GOA	48 624	0	43 830	0	163 498	0	163 715	0	0	0	2178	2180	0	0
Irefindex	230 136	0	14 860	0	59 809	0	975 287	0	0	0	2555	2562	0	0
KEGG	13 776	3848	7777	1284	975 287	58 476	45 465	35 725	2380	1676	2165	2159	1036	2751
Linkedspl	61 869	44 196	30	0	163 639	163 639	3613	0	37 807	1134	2131	2140	0	0
OMIM	104 540	0	0	31 468	137 682	0	0	137 682	0	0	2118	2120	0	0
Pharmgkb	6543	868	1442	486	33 508	23 322	17 610	25 460	1643	1638	2131	2140	1376	2208
Pharmgkb-offside	450 230	1332	0	10 097	877 604	438 802	0	438 802	7149	6138	0	0	50 905	52 159
Sider	2598	893	0	1705	68 424	68 424	0	68 424	858	868	0	0	814	1428
STRING	19 385	0	19 385	0	5 969 250	0	5 969 250	0	0	0	2131	2133	0	0
Total	971 874	59 322	95 280	49 253	8 530 037	817 110	7 199 428	975 011	51 146	6250	15 917	2838	54 576	52 187
(b) Evaluation tasks
Test	Purpose	Validation		Drug–Target Space		# Task	# Avg train pairs	# Avg drugs for training	# Avg targets for training	# Avg test pairs	# Avg test positive	# Avg test negative	# Avg drugs for testing	# Avg targets for testing
0	General	k-fold cross internal validation (k = 10)		All		10	11 016	5337	3087	1483	676	807	1190	1039
1	Screening with connectivity spaces (e.g. isolated drugs or targets)	k-fold cross internal validation (k = 10)		Connectivity space		90	11 769	5654	3221	929	471	458	637	551
2	Target screening when drugs are within/beyond category	k-fold cross internal validation (k = 3) + Drug categories		Category space (drugbank category)		36	12 164	5731	3252	152	76	76	32	73
				Category space (linkpl class)		36	12 108	5732	3201	263	132	132	32	192
3	Drug screening when targets are within/beyond category	k-fold cross internal validation (k = 3) + Protein categories		Category space (family)		36	12 045	5721	3252	389	195	195	109	120
				Category space (protein class)		36	11 965	5685	3247	549	275	275	165	194
4	Searching for a specific drug/target	Internal validation with m search (n = 10)		All targets (500)		10	12 207	5735	3239	534	33	501	1	534
		Internal validation with m search (n = 10)		All drugs (500)		10	12 196	5717	3253	545	44	501	545	1
5	Drug repurposing for a specific disease	Version-based external validation + Disease specific		Category space (10 disease)		40	12 240	5735	3253	44	22	22	24	14
6		Clinical trial-based external validation + Disease-specific		Category space (10 disease)		40	12 240	5735	3253	25	13	13	10	11

(a) Network
Repositories	Node (biomedical entities)				Edge (biomedical associations)				Drug mapped		Target mapped		Disease mapped
	All node	Drug	Target	Disease	All edge	Drug-associated	Target-associated	Disease-associated	Local	Drugbank	Local	Drugbank	Local	OMIM
Diseasome	9518	1362	3919	4213	25 091	8202	9744	268 918	1309	1309	508	515	445	557
DrugBank	24 655	6823	4037	0	56 245	56 245	14 744	0	0	0	0	0	0	0
GOA	48 624	0	43 830	0	163 498	0	163 715	0	0	0	2178	2180	0	0
Irefindex	230 136	0	14 860	0	59 809	0	975 287	0	0	0	2555	2562	0	0
KEGG	13 776	3848	7777	1284	975 287	58 476	45 465	35 725	2380	1676	2165	2159	1036	2751
Linkedspl	61 869	44 196	30	0	163 639	163 639	3613	0	37 807	1134	2131	2140	0	0
OMIM	104 540	0	0	31 468	137 682	0	0	137 682	0	0	2118	2120	0	0
Pharmgkb	6543	868	1442	486	33 508	23 322	17 610	25 460	1643	1638	2131	2140	1376	2208
Pharmgkb-offside	450 230	1332	0	10 097	877 604	438 802	0	438 802	7149	6138	0	0	50 905	52 159
Sider	2598	893	0	1705	68 424	68 424	0	68 424	858	868	0	0	814	1428
STRING	19 385	0	19 385	0	5 969 250	0	5 969 250	0	0	0	2131	2133	0	0
Total	971 874	59 322	95 280	49 253	8 530 037	817 110	7 199 428	975 011	51 146	6250	15 917	2838	54 576	52 187
(b) Evaluation tasks
Test	Purpose	Validation		Drug–Target Space		# Task	# Avg train pairs	# Avg drugs for training	# Avg targets for training	# Avg test pairs	# Avg test positive	# Avg test negative	# Avg drugs for testing	# Avg targets for testing
0	General	k-fold cross internal validation (k = 10)		All		10	11 016	5337	3087	1483	676	807	1190	1039
1	Screening with connectivity spaces (e.g. isolated drugs or targets)	k-fold cross internal validation (k = 10)		Connectivity space		90	11 769	5654	3221	929	471	458	637	551
2	Target screening when drugs are within/beyond category	k-fold cross internal validation (k = 3) + Drug categories		Category space (drugbank category)		36	12 164	5731	3252	152	76	76	32	73
				Category space (linkpl class)		36	12 108	5732	3201	263	132	132	32	192
3	Drug screening when targets are within/beyond category	k-fold cross internal validation (k = 3) + Protein categories		Category space (family)		36	12 045	5721	3252	389	195	195	109	120
				Category space (protein class)		36	11 965	5685	3247	549	275	275	165	194
4	Searching for a specific drug/target	Internal validation with m search (n = 10)		All targets (500)		10	12 207	5735	3239	534	33	501	1	534
		Internal validation with m search (n = 10)		All drugs (500)		10	12 196	5717	3253	545	44	501	545	1
5	Drug repurposing for a specific disease	Version-based external validation + Disease specific		Category space (10 disease)		40	12 240	5735	3253	44	22	22	24	14
6		Clinical trial-based external validation + Disease-specific		Category space (10 disease)		40	12 240	5735	3253	25	13	13	10	11

For the evaluation component, we designed seven main Tests (344 Tasks in total) based on the Perspectives (i.e. Perspectives of validation and data spaces) that were used for the generation of training and testing sets (see Table 1(b)). Specifically, Tests 0–4 (i.e. internally validated Jobs) generated existent associations (i.e. positives) based on internal validation, in which the random selection of the entire drug–target space was used to generate nonexistent (i.e. negative) testing associations for Test 0 (10 Tasks), connectivity spaces were used for Test 1 (90 Tasks), category spaces were used for Tests 2 and 3 (144 Tasks) and a search of drug–target space was used for Test 4 (20 Tasks). Tests 5–6 (i.e. external validated Jobs) generated positive associations based on external validation, in which versioning- and clinical trial-based validation was used to generate the positive training and testing sets for Tests 5 (40 Tasks) and 6 (40 Tasks), respectively, and disease-based categories were used to generate negative associations for testing.

Figure 1

Organization of the proposed benchmark. The evaluation Tasks were generated based on a multipartite network and publicly available third-party resources (e.g. Drugbank [26] and ClinicalTrials.gov [39]). Two Perspectives, validation and data spaces, were used to generate the training and testing associations. Specifically, there were two types of validation: internal validation (e.g. K-fold cross-validation, drug category- and target category-based) and external validation (e.g. versioning- and clinical trial-based), used to generate the positive training and testing associations. Additionally, two types of data spaces, connectivity space (e.g. Scenario 1—default, 2—semicold start and 3—cold start) and category space [e.g. Test End Only (TEO), Train End Only (TRO), Test End Excluded (TEE) and Train End Excluded (TRE)], were designed to generate the negative testing associations. In total, seven main Tests comprising 344 Tasks were provided in the benchmark based on the two Perspectives, in which Test 0 was for general drug–target prediction (10 Tasks), Test 1 for screening for drug–target associations with connectivity spaces (90 Tasks), Tests 2–3 for target and drug screening with category space (144 Tasks), Test 4 for drug and target searching (20 Tasks) and Tests 5–6 for drug repurposing (40 Tasks).

Open in new tab Download slide

Data collection and processing

To generate the full datasets, we collected the linked data version of the human disease network [31], diseasome. The Drugbank, GOA, iRefindex, KEGG [27], linkedspl [33], OMIM [35], Pharmgkb [36] and SIDER [37] were collected from Bio2rdf release 4 [40]. STRING V11 [38] was directly downloaded. To integrate the databases, we defined the common node spaces for drugs, targets and diseases, in which Drugbank drugs and targets were for drugs and targets, and OMIM was for diseases (see Supplementary Figure 1, see Supplementary Data available online at https://dbpia.nl.go.kr/bib). The common entity identifiers (IDs) were used for mapping. Specially, the identifiers from Unified Medical Language System (UMLS) [41], DBpedia and Wikipedia [42], KEGG, PubChem [43] and Pharmgkb are used for mapping in drug space; UniPort Knowledgebase [44], HUGO Gene Nomenclature Committee [45], GenAtlas [46] and OMIM were for target space, whereas DBpedia, UMLS and Systematized nomenclature of medicine clinical terms (SNOMED CT) [47] are for disease space. We utilized owl:sameAs to provide a mapping across different datasets and kept the original entities and the associations in each dataset without integrating similar concepts from different datasets into one data point (i.e. entity). We obtained the drug chemical structure formatted in the Simplified Molecular Input Line Entry System [48] and gene sequence from Drugbank. Targeting the drugs and targets from Drugbank in the common node spaces, we generated the drug–drug similarity and protein–protein similarity matrices based on the Tanimoto similarity with Chemistry Development Kit [49] and Smith–Waterman algorithm [50]. The quantitative values in other datasets were not incorporated or computed in the proposed benchmark as it is challenging to normalize the quantitative values across the different datasets for the computation. In practice, an Resource Description Framework (RDF) triple store, GraphDB [51], was adopted to manage the network.

Figure 2

Logistics of two Perspectives (i.e. data spaces and validation) in the proposed benchmark. For the Perspective of data space, connectivity space- and category space-based Jobs were proposed to generate the nonexistent associations (i.e. negatives) associations for testing. Specifically, three Scenarios reflected the connectivity for nodes used for training, which were Scenario 3—cold start (i.e. a pair of nodes was isolated if the association was removed), Scenario 2—semicold start (i.e. one of the nodes in a pair was isolated if the association was removed) and Scenario 1—default (i.e. none of the nodes in a pair was isolated if the association was removed). There were nine Jobs for connectivity space-based Jobs, which included TT space, TC space, TA node space, CC space, CA node space, and AA node space for Scenario 1, SS and SU for Scenario 2 and DI in testing space for Scenario 3. Four Jobs were designed for category space-based Jobs, which were TEO, TRO, TEE, and TRE. For the Perspective of validation, both internal and external validation were used. Three Jobs were used for internal validation, which were k-fold cross-validation-based, drug category-based and target category-based Jobs. For a demonstration example, the heat maps in the two latter Jobs showed similarities among and between the different categories (e.g. dopamine antagonists and gaba modulators) and protein family (e.g. cell division protein kinase and sodium/chloride-dependent transporter). Two Jobs were used for external validation, which included versioning-based and clinical trial-based Jobs. For a demonstration example, six targets were allocated based on a query of breast cancers. The proteins phosphatidylinositol 4,5-bisphosphate 3kinase catalytic subunit alpha isoform, androgen receptor and cellular tumor antigen p53 were associated with the drugs adenosine triphosphate, diethylstilbestrol and triethyl phosphate based on validation with a newer version of Drugbank. The genes DNA repair protein RAD51 homolog 1 and serine/threonine-protein kinase Chk2 were associated with drugs fulvestrant, capecitabine and cyclophosphamide-based on validation with ClinicalTrials.gov. The structures of drugs and proteins were obtained from Drugbank and Protein Data Bank [59], respectively.

Open in new tab Download slide

Benchmark design

In general, the purpose of the evaluation was to assess how well a model can predict drug–target associations by separating existent associations (i.e. positives) from a highly imbalanced large number of nonexistent associations (i.e. negatives). Conventionally, three characteristics of datasets were widely used for evaluations: (i) the connectivity pattern of the drugs and targets that underlie topological context and inherent connection profiles [21, 52, 53], (ii) the categories of drugs and targets in real scenarios [54–56] and (iii) the validation of the associations internally and externally [55, 57, 58]. As such, we designed the seven evaluation Tests that generated the training and testing associations based on the two Perspectives—validations and data spaces. It should be noted that to distinguish the hierarchical level of logic used for the evaluation tasks, we used Tests, Perspectives, Jobs, Scenarios and Tasks to represent the evaluation task in each logic layer, in which each child concept was considered to be the subtask of its respective parent (e.g. Perspective of validation contains internal validation-based Jobs and external validation-based Jobs).

Perspective of validation

Two types of validations, consisting of both internal and external validation-based Jobs, were designed to generate positive training and testing associations.

Internal validation-based jobs

Three validations were used: (i) k-fold cross validation-based, in which the original drug–target associations were randomly partitioned into k equal-sized subsamples (without resulting in any isolated nodes in the network being built by any of the remaining K-1 subsamples), and then included in repeated k independent experiments conducted using each subsample for testing (i.e. positives testing set) and the remaining k – 1 subsamples for training (i.e. positives training set) [4]; (ii) drug category-based, in which the drug–target associations were partitioned for training and testing, included pairs with similar drugs from two different category systems for drugs, drug categories from Drugbank and pharmacologic class from DailyMed [60] and (iii) target category-based, where the drug–target associations were partitioned for training and testing based on different category systems for targets, family and protein class from Panther [61].

External validation-based jobs

Two validations were used to evaluate the algorithms’ ability in predicting new associations that existed in the newer version of training data and recently conducted clinical trials: (i) versioning-based, in which novel drug–target associations in the latest version of Drugbank were tested with the older version used as the training data and (ii) clinical trial-based, in which novel drug–target associations obtained from the latest clinical trials at ClinicalTrials.gov were tested based on the entire datasets in the benchmark used as the training data. Specifically, for a particular target of a disease, the drugs tested for clinical trial interventions under ‘recruiting’ status were considered a novel drug–target pair for testing. For the two external validation, the targets were categorized by the associated diseases.

Perspective of data space

Two types of data spaces were designed to generate negative drug–target associations for testing: topological structures in the drug–target space and in the node category space.

Connectivity space-based jobs

The permutations of connectiveness levels between the nodes supplied to an algorithm for predictive purposes were tested in the evaluation. Specifically, the nodes associated with the drug–target associations were classified into three spaces: (i) Test Node Space (TNS), (ii) Connected Node Space (CNS) and (iii) All Node Space (ANS). TNS consisted of all the drug and target nodes used for testing. CNS consisted of all drug and target nodes with a drug target association existing between them. ANS consisted of all drug and target nodes. Consequently, nine types of negative associations were designed based on three Scenarios, which reflected the connectivity of drugs and targets. Scenario 1 (default) was defined as ‘given a pair of drug and target, neither of the nodes (a drug or target) are isolated if the association is removed for testing’. In Scenario 1, testing nodes were those nodes in the CNS. Six Jobs can be generated, which included Test–Test space-based (TT, a pair of nodes both coming from TNS), Test-Connected space-based (TC, a pair of nodes coming from TNS and CNS), Test-All node space-based (TA, a pair of nodes coming from TNS and ANS), Connected-Connected space-based (CC, a pair of nodes both coming from CNS), Connected-All node space-based (CA, a pair of nodes coming from CNS and ANS) and All-All node space-based (AA, a pair of nodes both coming from ANS). Scenario 2 (semi-cold start) was defined as ‘given a pair of drug and target, one of the two nodes (either a drug or target) is isolated if the association is removed for testing’. In Scenario 2, one of the nodes was not in the CNS. Based on ‘guilt-by-association’ [4, 52], the two nodes were considered similar if they connected to a common node. Therefore, two types of negative associations can be generated for testing, which were Semi-isolated with Similar nodes (SS) and Semi-isolated with Unsimilar nodes (SU). Scenario 3 (cold start) was defined as ‘given a pair of drug and target, both nodes (drug and target) are isolated if the association is removed for testing’. In Scenario 3, both nodes were not in the CNS; accordingly, a pair of Double Isolated (DI) nodes can be created for negative associations.

Category space-based jobs

Drug-target associations were selected for testing similar/ dissimilar drugs or targets based on a category. Four types of negative associations were designed as follows: (i) TEO, in which testing negative pairs (a source node and an end node) were generated based on a source node of a testing positive and similar end nodes to the end node of testing positive in a category, (ii) TRO, in which testing negatives were generated based on a source node of testing positives and similar end nodes to the end node of testing positive in a category, (iii) TEE, in which testing negative pairs were generated based on a source node of a testing positive and nonsimilar end nodes to the end node of testing positive beyond a category and (iv) TRE, in which testing negatives were generated based on a source node of nontesting positives and nonsimilar end nodes to the end node of testing positive beyond a category.

Evaluation tasks

Based on the combinations of the Perspectives of data space and validation, seven evaluation Tests were generated, in which the positive associations for training and testing were generated based on the validation-based Perspective, and the negative associations for testing were generated based on the data space-based Perspective. Please note that the negative associations for training were not provided in the benchmark Tasks as those associations can be generated to improve the performances in different algorithms [21, 62, 63]. To conduct a fair comparison, we removed the drugs, targets and the corresponding drug–target associations from the evaluation tasks when the drugs and targets did not have chemical structure and gene sequence information in the benchmark as those entities and associations cannot be processed by the structure- and sequence-based methods.

To organize the Tasks, we classified them as follows.

Internally validated tests

Test 0 (10 Tasks): This Test was designed to conduct a general evaluation of the drug–target prediction. The k-fold cross-validation was used to generate the positive training and testing pairs. The negative testing pairs were randomly selected. In practice, k was set to 10.

Test 1 (90 Tasks): This Test was designed to evaluate the drug–target prediction when drugs and targets were at different connectivity spaces (e.g. isolated drugs or targets). The k-fold cross-validation was used to generate the positive training and testing pairs. Nine different connectivity spaces were used for the selection of negative testing pairs. k was set to 10.

Test 2 (72 Tasks): This Test was designed to evaluate the drug–target prediction when drugs were within or beyond two categories (i.e. drug categories from Drugbank and pharmacologic class from DailyMed [60]). The k-fold cross-validation was used to generate the positive training and testing pairs. Four kinds of category spaces for drugs were used for the selection of negative testing pairs. k was set to 3.

Test 3 (72 Tasks): This Test task was designed to evaluate the drug–target prediction when targets were within or beyond two categories (i.e. family and protein class from Panther [61]). The k-fold cross-validation was used to generate the positive training and testing pairs. Four kinds of category spaces for targets were used for the selection of negative testing pairs. k was set to 3.

Test 4 (20 Tasks): This Test was designed to evaluate the searching for drugs or targets when a target or a drug is given. N searches were conducted with a specified search space (e.g. 500 drugs or targets). In practice, N was set to 10.

Externally validated tests

Test 5 (40 Tasks): This Test was designed to evaluate the drug–target prediction for M diseases. A versioning-based validation was used to generate the positive testing pairs with the whole benchmark data that was used as positive training pairs. Four kinds of category spaces for targets were used for the selection of negative testing pairs. In practice, M was set to 10.

Test 6 (40 Tasks): This Test was designed to evaluate the drug–target prediction for M diseases. A clinical trial-based validation was used to generate the positive testing pairs with the whole benchmark data that was used as positive training pairs. Four kinds of category spaces for targets were used for the selection of negative testing pairs. M was set to 10.

Figure 3

Results of six methods for Test 0 for all the evaluated methods.

Open in new tab Download slide

Predictive models in evaluation

Based on the data sources used as the input, two types of algorithms were used: network-based methods and structure- and sequence-based methods: (i) Network-based methods are the methods that used any graphical information from the proposed benchmark as the input, which includes multiple types of biomedical entities, such as drugs, targets, diseases, side effects and pathways, and the corresponding information from multipartite (including drug–target bipartite) networks. In practice, we used three state-of-the-art network-based methods: DTINet [21], Bio-Linked Network Embeddings (bioLNE) [64] and NEural integration of neighbOr information for DTI prediction (NeoDTI) [65]. For DTINet and NeoDTI, we used drug–target, drug–disease, protein–disease, drug–side effect, protein–protein, drug–drug interaction as well as drug–drug similarity, and protein–protein similarity matrices as the input data. For bioLNE, we used drug, target, disease, side effect, chromosomal location, drug category, drug group, drug substance, food, module, pathway, variant location, haplotype, disease feature and disease symbol-related assertions. (ii) Structure- and sequence-based methods are the methods that primarily used the chemical structure of drugs and sequence of proteins as the input. The drug chemical structure and gene sequence were collected from Drugbank. In practice, we considered that DeepPurpose [66], DeepDTA [67] and GraphDTA [68] are state-of-the-art methods. For DeepPurpose, we used diverse encoding methods with the default setting, such as Convolutional Neural Network (CNN), Transformer, Daylight, AAC and Morgan for the structure and sequence information. For DeepDTA, we implemented it based on the DeepPurpose framework, in which structure and sequence information were both encoded with CNN. For NeoDTI, the parameter ranges for the grid search specified as dimension of node embedding|$d=\big\{256,512,1024\big\}$|⁠, dimension of the projection matrices |$k=\big\{256,512,1024\big\}$|⁠, repetition time of neighborhood information aggregation |$p=\big\{1,2,3\big\}$|⁠. For bioLNE, the parameter ranges for the grid search specified as classification {J48, SVM, Random Forest, Logistic Regress}, Binary Operator {average, hadamard, wrighted-L1 and L2}, weights for DBSI and classification are {0.0–0.9} with 0.1 increment and default performed parameters of Node2Vec [69]. For GraphDTA and DTINet, the default parameters were used.

Evaluation metrics

Three metrics were mainly used to assess the quality of the predictions, including area under the receiver operating characteristic curve (AUC ROC) [56, 57], Precision/Recall (PR AUC) and F1 measure (including Precision and Recall). In practice, we normalized the three metrics |$Mean\ \big( \mbox{Area Under the Curve Receiver Operator} \mbox{Characteristic} \ (AUCROC) + \mbox{Precision Recall Area Under the} \mbox{Curve} \ (PRAUC)+F1\big)$| to obtain a balanced score to better identify the best-performing method in general. In addition, we also provide Precision, Recall and F1 measures at top k and mean average precision (MAP) at top k search results for Test 4. The AUC ROC, PR AUC and F1 scores were calculated by the ROC JAVA library (https://github.com/kboyd/Roc), the Weka evaluation package [58] and scikit-learn package [70].

Evaluation Results

Existing methods were categorized into two distinct categories for the purpose of evaluating our benchmark based on the input data used: (i) network-based and (ii) structure- and sequence-based methods. For network-based methods, DTINet [21], bioLNE [64] and NeoDTI [65] are considered state-of-the-art for comparison. For structure- and sequence-based, DeepPurpose [66], DeepDTA [67] and GraphDTA [68] were adopted. We conducted a general evaluation (Test 0) to select the best-performing methods as a representative for each category and showed their results for the rest Tasks (Tests 1–6) in the main manuscript. The complete results of all the experiments for all the evaluated methods are shown in the supplements.

Figure 4

Results of internally validated Tests (Tests 1–4) for DeepPurpose and NeoDTI.

Open in new tab Download slide

Our results (see Figure 3) show, in general, structure- and sequence-based methods performed better than network-based methods (average AUCROC: 85.72 versus 84.67%, PRAUC: 81.01 versus 83.55%, Precision: 73.00 versus 77.07%, Recall: 84.52 versus 68.24%, F1: 78.23 versus 67.35%, general score: 81.65 versus 78.52%). The best methods for network-based and structure- and sequence-based methods are DeepPurpose (CNN and Transformer) (average AUCROC: 88.01%, PRAUC: 84.44%, Precision: 73.40%, Recall: 86.73%, F1: 79.49%, general score: 83.98%) and NeoDTI (average AUCROC: 86.52%, PRAUC: 83.32%, Precision: 71.40%, Recall: 80.78%, F1: 74.19%, general score: 81.34%). For each metric, we observed that bioLNE performed the best in terms of AUCROC (88.65%), PRAUC (85.73%) and Precision (94.09%) whereas DeepPurpose (CNN and Transformer) performed the best in terms of Recall (86.73%) and F1 (79.49%). Notably, compared with other methods, NeoDTI had a large standard deviation for Precision (9.20%) and recall (13.05%). For the following Tasks, we considered NeoDTI and DeepPurpose (CNN and Transformer) named DeepPurpose for short as two representative methods and show their results here (the results for the rest methods are shown in Supplementary Material).

Internally validated tests

We conducted Tests 1–3, which were internally validated Tests. Regarding Test 1, in general, DeepPurpose (average general score: 78.67%) outperformed NeoDTI (74.09%) in all six Jobs across all 90 Tasks constituting the connectivity space-based Jobs (see Figure 4A). For Scenario 1—Default (i.e. no isolation of nodes resulted if an association is removed), the performance increased when the negatives are sampled from a broader connective space, and achieved the best for type AA (e.g. general score of DeepPurpose: 93.00% versus NeoDTI: 83.57% for AA). According to the ‘guilt-by association’ principle, the nodes were more similar when sharing more common neighbors [71]. The tested methods performed excellently when the positive and negative pairs were topologically distinct (e.g. AA) but performed much worse when they were more topologically connected (e.g. general score of DeepPurpose: 73.76% versus NeoDTI: 59.12% for TT).

For the cold start problem, methods performed the best in Scenario 1—default (AUCROC: 85.30%, PRAUC: 84.31%, Precision: 72.91%, Recall: 84.15%, F1: 75.98%, general score: 81.86%), and then Scenario 2—semicold start (AUCROC: 77.75%, PRAUC: 77.87%, Precision: 74.48%, Recall: 66.70%, F1: 68.38%, general score: 74.67%), in which isolation of one node resulted if an association was removed and finally Scenario 3—cold start (AUCROC: 54.29%, PRAUC: 57.20%, Precision: 53.28%, Recall: 24.24%, F1: 29.21%, general score: 46.90%), in which the isolation of both nodes resulted if the association was removed. The failure of the cold start problem in prediction indicated the necessity of predictive models for new drugs and new targets. Particularly, we learned that, compared with other jobs, DI has large standard deviations for other evaluated methods as shown in Supplementary Figure 2 (see Supplementary Data available online at https://dbpia.nl.go.kr/bib) (e.g. AUCROC: 8.11%, PRAUC: 6.96%, Precision: 23.04%, Recall: 21.00%, F1: 17.69%, general score: 9.36%).

Figure 5

Results of externally validated Tests (Tests 5–6) for DeepPurpose and NeoDTI.

Open in new tab Download slide

Regarding Tests 2 and 3, in general, DeepPurpose (average general score: 82.69% for drug category-based Jobs and 83.05% for target category-based Jobs) outperformed NeoDTI (78.83% for drug category-based Jobs and 81.47% for target category-based Jobs) across all four category spaces and four categories covering the 144 Tasks (see Figure 4B, C). Although DeepPurpose was selected as a representative model, bioLNE (83.84% for drug category-based Jobs and 84.46% for target category-based Jobs) was the best performing model (see Supplementary Tables 1 and 2, see Supplementary Data available online at https://dbpia.nl.go.kr/bib). Regarding the category space-based Jobs, TRE is the best performing (85.16% for drug category-based Jobs and 87.49% for target category-based Jobs), and TEO is the worst performing (73.72% for drug category-based Jobs and 71.11% for target category-based Jobs), which was consistent with the design of the evaluation—it was more difficult to separate the positive and negative drug–target pairs if the drugs and targets were similar in the two pairs. We also noticed that different categories had similar prediction results for drug (e.g. ‘drugbank category’: 78.82% and ‘linkpl class’: 78.84% for NeoDTI, ‘drugbank category’: 81.42% and ‘linkpl class’: 83.96% for DeepPurpose) and target categories (e.g. ‘family’: 81.33% and ‘protein class’: 81.60% for NeoDTI, ‘family’: 83.50% and ‘protein class’: 82.59% for DeepPurpose). The complete results for category-based and connectivity space-based Jobs of all the evaluated methods can be found in Supplementary Tables 1 and 2 (see Supplementary Data available online at https://dbpia.nl.go.kr/bib).

For Test 4 (see Figure 4D), DeepPurpose outperformed NeoDTI (MAP@10: 8.57% for DeepPurpose versus 4.86% for NeoDTI, MAP@20: 9.96% versus 5.06%, MAP@50: 11.13% versus 6.32%, MAP@100: 12.26% versus 7.10%). Among all the methods (see Supplementary Figure 3, see Supplementary Data available online at https://dbpia.nl.go.kr/bib), we noticed that bioLNE performed the best (MAP@10: 23.37%, MAP@20: 53.68%, MAP@50: 28.30%, MAP@100: 16.76%). Regarding the queries, we learned that the target queries performed better than drugs (MAP@10: 11.35% for targets versus 4.72% for drugs, MAP@20: 21.26% versus 14.87, MAP@50: 17.78% versus 9.62%, MAP@100: 16.72% versus 7.75%). Test 4 also suggested that the best k can be found among the top 20 to top 50 (MAP@20: 18.07% and MAP@50: 13.70%).

Externally validated tests

We conducted two types of externally validated Tests based on versioning (Test 5) and trials (Test 6). Regarding Test 5, despite the great performance achieved by DeepPurpose in the previous Tasks, NeoDTI (average general score: 83.10%) achieved slightly better performance compared with DeepPurpose (average general score: 77.79%) across the 40 Tasks constituting the versioning-based Jobs. Similar to the internal validation conducted, TRE performed the best (92.81% for NeoDTI and 82.98% for DeepPurpose), and TEO performed worst comparatively (74.24% for NeoDTI and 71.34% for DeepPurpose) (see Supplementary Figure 4A, see Supplementary Data available online at https://dbpia.nl.go.kr/bib).

Among the 10 diseases of interest, breast cancer had the best predictive performance (92.56% for NeoDTI and 90.59% for DeepPurpose) and obesity had the worst predictive performance (69.32% for NeoDTI and 71.63% for DeepPurpose). There are four diseases for which both NeoDTI and DeepPurpose performed well: breast cancer (91.41% for NeoDTI and 87.32% for DeepPurpose), spinal muscular atrophy (92.56% for NeoDTI and 83.70% for DeepPurpose), obsessive–compulsive disorder (83.01% for NeoDTI and 81.13% for DeepPurpose), and insomnia (82.50% for NeoDTI and 80.23% for DeepPurpose). Moreover, the results show that NeoDTI (88.2% for HDL_cholesterol_level_QTL and 85.61% for migraine) could be a good complementary tool for DeepPurpose.

Regarding Test 6, the performance of DeepPurpose and NeoDTI was worse than the previous Tasks, in which NeoDTI (average general score: 68.24%) outperformed DeepPurpose (average General Score: 63.29%) across the 40 Tasks constituting the trial-based Jobs. Similarly, among the four types, TRE performed the best (77.08% for NeoDTI and 63.38% for DeepPurpose) whereas TEO performed the worst (59.44% for NeoDTI and 57.53% for DeepPurpose) (see Supplementary Figure 4B, see Supplementary Data available online at https://dbpia.nl.go.kr/bib).

Among the 10 diseases, myocardial infarction had the best predictive performance (average general score: 97.2% for NeoDTI and average general score: 93.75% for DeepPurpose) whereas Alzheimer’s disease had the worst predictive performance (55.80% for NeoDTI and 43.67% for DeepPurpose). We also learned that NeoDTI (e.g. 84.58% for obesity) could be a good complementary tool for DeepPurpose, and vice versa for DeepPurpose (e.g. 86.77% for leukemia) in some cases. For the details of other methods for Tests 5 and 6, please refer to Supplementary Figure 5 (see Supplementary Data available online at https://dbpia.nl.go.kr/bib).

Analysis of failures in prediction

Investigating the prediction for Tests 5 and 6 helped identify situations in which existing methods could not perform well. We chose the best-performing (spinal muscular atrophy) and worst-performing disease (obesity) for NeoDTI in Test 5. We found that although NeoDTI successfully distinguished the positives from negatives in training data, it lacked the ability to predict when the targets were within the category (e.g. TEO and TRO). The results show NeoDTI tends to predict all the associations related to a target as a positive or a negative, which may result in a good performance when all the labels are all in positives or negatives (see be0000132 in TEO for spinal muscular atrophy in Figure 6A). However, it may result in bad performance if both positives and negatives are in testing pairs (see be0000215 in TEO and TRO for obesity in Figure 6B). Although NeoDTI performed well for some diseases, we found it cannot properly repurpose drugs for a given protein despite having the potential to predict the protein with a given drug (see TEE and TRE). A similar observation of myocardial infarction (best performing) versus HIV (worst) for a trial-based Job (Test 6) can be found in Supplementary Figure 6 (see Supplementary Data available online at https://dbpia.nl.go.kr/bib).

Figure 6

Best- and worst-performing predictions in Test 5 for NeoDTI.

Open in new tab Download slide

Discussion

To design the benchmark, we extracted various biomedical associations from existing publicly accessible databases and knowledge bases and designed various evaluation Tasks to evaluate the ability of the predictive models to separate the positive and negative drug-target associations with similar drugs or targets defined for different purposes. We hope our work will provide a standardized and comprehensive way to evaluate the existing models as well as substantial information (e.g. features and associations of/among biomedical entities) to facilitate the selection of the most suitable predictive models in the real-world developmental process—the prediction of drug–target associations with a given a pair of drugs and targets, so as to lay the foundation necessary for the successful development of robust computational drug–target prediction methods similar to how the Text RetriEval Conference [72] has contributed to information retrieval and ImageNet [73] has contributed to visual recognition. Please note that our benchmark is designed for the methods requiring a pair of inputs and is not applicable to the methods similar to (Q)SAR [74] where only chemical structures are required.

The main advantages of our proposed benchmark include the creation of (i) a large-scale multidimensional network for prediction consisting of an extensive set of biomedical entities and diverse types of associations, and (ii) diverse prediction use-cases in which drugs and targets utilized for training and testing can be selected based on the difference of topological connectivity (connected versus isolated) or biomedical categories (e.g. drug category versus protein family). The datasets and evaluation Tasks are provided as off-the-shelf methods that can be easily adopted as a well-documented resource for drug development activities and scientific experiments. Although we have incorporated 11 repositories, more endeavors are needed to incorporate important repositories into the benchmark. The addition of datasets to our benchmark can be simply implemented by mapping to the common node space (i.e. the drugs targets from Drugbank and the diseases from OMIM). Currently, we are planning to incorporate more repositories (e.g. SymMap [75]) when their data are publicly accessible (see https://github.com/bioIKEA/IKEA_BETA_Benchmark/tree/master/data_space/output/datasets/pending for the pending repositories).

With our experiment, we found two limitations/biases of existing state-of-the-art methods. Firstly, they are incapable of handling new drugs and new targets (e.g. the poor performance of SS, SU and DI for Test 1) and secondly, they are incapable of properly repurposing drugs for a given protein—refer to be0000215 for obesity. As the discovery of the target-defined antineoplastic compounds is considered a more effective way for drug development (e.g. in cancers [76]), it drives researchers to develop novel repurposing methods to fill these needs. The proposed benchmark will provide a fine assessment of the effectiveness of the drug repurposing methods. Secondly, both DeepPurpose [66] and NeoDTI [65] perform worse when the training and testing nodes sharing more connections (e.g. the performance of CC > TA > TC > TT; SU > SS > DI). Although it was expected for a network-based method, such as NeoDTI, to be affected by the connectivity of the drugs and targets, it is a novel bias found for DeepPurpose. It, in a sense, demonstrated an indirect connection between topology-, structure- and sequence-based features for drugs and proteins in drug development contexts [28, 77, 78]. Although we addressed the bias caused by characteristics of data (e.g. topological structure), some biases in the practice of drug development (e.g. investigative bias) were not addressed. More evaluation Tasks are needed to perfect the investigation of the biases.

The evaluation Tasks in the benchmark have relied on the drug–target associations provided in Drugbank, which is considered as a ground truth. The drug–target associations in Drugbank were collected, curated and validated with multiple sources (e.g. PubMed, Therapeutic Target Database (TTD), Food and Drug Administration (FDA) labels, RxList, PharmGKB, textbooks) [26] and were used widely in drug–target prediction tasks. In our study, we simply trusted the existing knowledge bases to build our benchmark. Although we consider our contribution as the development of the benchmark, it is also important to keep updated on the associations that are trustworthy and supported by the experimental screen. On the other hand, the proposed benchmark was designed to provide heterogeneous biomedical information despite recognizing that it is a challenge for such information to be processed. To represent the knowledge graph, we used an undirected multidimensional network (i.e. edge types can be various but multiple and directed edges between two nodes are not allowed). We only kept one theme for each repository (e.g. only one type of association between the nodes), and used owl:sameAs to map similar concepts between the different repositories such that they can be linked with differing associations (e.g. reversed or conflict associations). For example, through the mappings between the disease entity ‘myocardial infarction’ (Diseasome: 3281) in Diseasome to the Omim (Omim:608446), and the side effect entity in Sider (Sider: C0027051) to the Omim (Omim:608446), drug–disease associations and drug–side effects associations can be utilized simultaneously. In our study, apart from the drug–target associations, the evaluation only covers the usage of partial associations (e.g. drug–drug and gene–gene similarity, drug–side effect, drug–drug and protein–protein interaction, drug–disease and protein–disease used in NeoDTI). The current benchmark is to assess the performance of drug–target prediction, and the heterogeneous datasets are designed to facilitate the computational but are not requisite. The usage of the other associations needed to be considered in the construction of the benchmark (e.g. the effect of removal of associations for network-based methods) or other prediction tasks (e.g. prediction of protein–protein interaction).

Our study only incorporated two types of methods: structure- and sequence-based and network-based methods, in our evaluation, as they utilized two distinct forms of input data—graphical structures of biomedical knowledge (including drug-target associations) and chemical structure and gene sequences. The two types were considered ideal representatives to demonstrate how our benchmark can provide input data to facilitate the computation. Although other types (e.g. matrix factorization, similarity/distance-based, feature-based and hybrid methods [14]) could not be tested in this study, they can be swapped as needed for individual evaluation use cases so long as the data to be evaluated on can be made into one of the two evaluated forms. The selection of six methods evaluated in our benchmark is based on three criteria: (i) methods of drug–target prediction published in the top-tier journals for computational biology from 2017; (ii) methods with open access source code and (iii) methods that are feasible for processing large training data. Although there are recently published excellent works (e.g. CoVex [79], a Bayesian ANalysis to determine Drug Interaction Targets (BANDIT) [80] and network-based proximity [54]), they cannot be implemented/included in the experiments due to the limited resources. Regarding the evaluation metrics, both AUCROC and PRAUC were not cut-off line sensitive (i.e. a threshold score was used to separate the positive and negative pairs for a prediction), whereas the F1 measure was. A debate exists on how a prediction should be given in real practice (e.g. probability estimation versus binary decision): Although the probability estimation may give a researcher more flexibility to set up a cut-off line, it is also challenging to determine a suitable cut-off line in practice [14]. Depending on the users’ preferences, further customization of the benchmarking evaluation may be necessary. Although the proposed benchmark is only designed for evaluating the effectiveness of prediction, we recognize that efficiency is also critical for the selection of a suitable method in the real-world screening of a large number of drug–target pairs. In general, the evaluation Tasks are scalable for the evaluated methods since the running time scales linearly with the number of Tasks for each main Test (Supplementary Figure 7, see Supplementary Data available online at https://dbpia.nl.go.kr/bib). As each Task is independent, in practice, a parallel computation can be applied for each of the evaluated methods, allowing for indefinite horizontal scaling as the resources would permit. On the other hand, although we set 500 drugs or targets in our search tasks (Test 4), it requires a larger size of search space in practice. Therefore, stress testing [81] to assess how stable a method can complete screening on a full combination of a large number of drug and target sets is needed.

Key Points

We have proposed BETA, a large-scale benchmark that enables a comprehensive evaluation of drug-target predictive models to facilitate a selection of computational strategies for drug and target prescreening.
BETA provides an extensive multipartite network that is consisted of 0.97 million biomedical concepts and 8.5 million associations, in addition to 62 million drug–drug and protein–protein similarities.
BETA provides evaluation strategies that reflect five purposes with a total of seven Tests with 344 Tasks across multiple sampling and validation strategies.
Six state-of-the-art methods covering two broad method types (chemical structure- and gene sequence- and network-based) were evaluated with the developed Tasks across multiple Jobs (screening with different levels of connectivity, target/drug screening when drugs/targets are within/beyond category and drug repurposing for a specific disease).

Availability

The proposed method is available, along with the data and source code, at the following URL: https://github.com/bioIKEA/IKEA_BETA_Benchmark.

Data availability

The data that support the findings of this study are available on request from the corresponding author upon reasonable request.

Code availability

The code for data integration and the benchmark generation is available at https://github.com/bioIKEA/IKEA_BETA_Benchmark.

Funding

National Institute of Health (NIH) NIGMS (K99GM135488).

Author Biographies

Nansu Zong is an assistant professor at the Department of Artificial Intelligence and Informatics Research, Mayo Clinic. He works on computational drug development based on knowledge base and deep learning algorithms.

Ning Li is a research fellow at the Center for Structural Biology (CSB) of the National Cancer Institute, NIH. He works on the structural and functional study of protein kinase A involving the method of X-ray crystallography and cryo-EM.

Andrew Wen is a bioinformatician from Mayo Clinic. He is interested in utilizing informatics tools to build diverse applications in the health care area. He is an expert on natural language processing (NLP).

Victoria Ngo is a postdoctoral research fellow at VA Palo Alto Health System and Stanford Health Policy. Ngo is a health informaticist, and her research focuses on health equity and the optimization of information technology to improve the delivery and coordination of care in the community.

Yue Yu is a bioinformatician from Mayo Clinic, who is mainly working in the medical data standardization field. Yu is also interested in using artificial intelligence methods to solve biomedical problems.

Ming Huang is an assistant professor in the Department of AI and Informatics at Mayo Clinic. He is an expert in topic modeling and deep learning.

Shaika Chowdhury is a research fellow with the Mayo Clinic AI & Informatics, who studies deep learning-based precision medicine. Chowdhury is interested in utilizing knowledge graphs to improve the performance of deep learning models.

Chao Jiang is a PhD student at Auburn University. He works on diverse deep learning models and is particularly focused on graph neural networks.

Sunyang Fu is a senior data science analyst and biomedical informatics researcher at the Mayo Clinic. His research focuses on (i) designing and validating NLP techniques for clinical information extraction, (ii) developing informatics frameworks and processes to accelerate the secondary use of electronic health records (EHRs) for clinical research and (iii) discovering EHR heterogeneity and information quality through quantitative and qualitative methods.

Richard Weinshilboum is a professor at the Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic. He studies pharmacogenomics—the role of inheritance and individual variation in DNA sequence or structure in drug response.

Guoqian Jiang is a professor at the Department of Artificial Intelligence and Informatics Research, Mayo Clinic. He researches biomedical terminologies and ontologies, data standards, common data elements and common data models for clinical studies.

Lawrence Hunter is a professor of pharmacology and computer science at the University of Colorado. He focuses on the knowledge-driven extraction of information from the primary biomedical literature, the semantic integration of knowledge resources in molecular biology and the use of knowledge in the analysis of high-throughput data.

Hongfang Liu is a professor at the Department of Artificial Intelligence and Informatics Research, Mayo Clinic. The primary research focus of Hongfang Liu is to facilitate the secondary use of clinical data for clinical and translational science research and health care delivery improvement using data science, artificial intelligence and informatics approaches.

Mayo Clinic is a charitable, nonprofit academic medical center that provides comprehensive patient care and education in clinical medicine and medical sciences as well as extensive programs in research. Mayo Clinic includes Mayo Medical School, Mayo Graduate School, Mayo School of Graduate Medical Education, Mayo School of Continuous Professional Development and Mayo School of Health Sciences.

References

Santos

Ursu

Gaulton

, et al.

A comprehensive map of molecular drug targets

Nat Rev Drug Discov

2017

;

–

Month:	Total Views:
June 2022	982
July 2022	365
August 2022	392
September 2022	353
October 2022	265
November 2022	188
December 2022	151
January 2023	122
February 2023	146
March 2023	173
April 2023	139
May 2023	127
June 2023	118
July 2023	91
August 2023	147
September 2023	98
October 2023	114
November 2023	100
December 2023	117
January 2024	123
February 2024	74
March 2024	108
April 2024	82
May 2024	100
June 2024	59
July 2024	83
August 2024	37
September 2024	58
October 2024	55
November 2024	64
December 2024	55
January 2025	45
February 2025	65
March 2025	55
April 2025	49
May 2025	21

Article Contents

BETA: a comprehensive benchmark for computational drug–target prediction Open Access

Abstract

Introduction

Methods

Data collection and processing

Benchmark design

Perspective of validation

Internal validation-based jobs

External validation-based jobs

Perspective of data space

Connectivity space-based jobs

Category space-based jobs

Evaluation tasks

Internally validated tests

Externally validated tests

Predictive models in evaluation

Evaluation metrics

Evaluation Results

Internally validated tests

Externally validated tests

Analysis of failures in prediction

Discussion

Availability

Data availability

Code availability

Funding

Author Biographies

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

BETA: a comprehensive benchmark for computational drug–target prediction