Evidential meta-model for molecular property prediction Open Access

Summary of datasets.

Dataset	Tox21	SIDER	MUV
Molecules	7831	1427	93087
Tasks	12	27	17
Meta-training tasks	9	21	12
Meta-testing tasks	3	6	5
Positive/negative label ratio	0.08	0.76	0.002

Dataset	Tox21	SIDER	MUV
Molecules	7831	1427	93087
Tasks	12	27	17
Meta-training tasks	9	21	12
Meta-testing tasks	3	6	5
Positive/negative label ratio	0.08	0.76	0.002

Table 1.

Summary of datasets.

Dataset	Tox21	SIDER	MUV
Molecules	7831	1427	93087
Tasks	12	27	17
Meta-training tasks	9	21	12
Meta-testing tasks	3	6	5
Positive/negative label ratio	0.08	0.76	0.002

Dataset	Tox21	SIDER	MUV
Molecules	7831	1427	93087
Tasks	12	27	17
Meta-training tasks	9	21	12
Meta-testing tasks	3	6	5
Positive/negative label ratio	0.08	0.76	0.002

The three datasets have molecules written in SMILES strings. We converted the SMILES strings into molecular plots using the Rdkit.Chem (Landrum et al. 2013) package to generate the input. For all our experiments, the Tox21 dataset was divided into nine training and three test tasks, the SIDER dataset was divided into 21 training and six test tasks, and the MUV dataset was divided into 12 training and five test tasks. For each task, we randomly preselected support and query molecules as support and query sets, respectively.

Compared Methods. We compare our EM3P2 with molecular property prediction (MPP) methods based on meta-learning or metric learning framework.

MAML (Finn et al. 2017) is a task-agnostic algorithm for few-shot meta-learning, where model parameters are trained using a small number of samples. We used the GIN and MLP as the base learner for MAML.
Pre-GNN (Hu et al. 2020) is the MAML model with associated GIN, which uses self-supervised learning to capture local and global information of graph data.
The Pre-PAR (Wang et al. 2021) is similar to the MAML model, but with the added step of generating a relationship graph between molecules with the property-aware embedding of support and query molecules.
SiameseNet (Koch et al. 2015) is a model based on the Siamese network, which is a metric-based few-shot learning algorithm. We used the GIN for molecular representation and measured the similarity between two inputs using the cosine distance. The relationship is considered positive if both input molecules are positive, otherwise negative.
Protonet (Snell et al. 2017, Crisostomi et al. 2022) is a model based on a prototypical network, which is also a metric-based few-shot learning algorithm. We used the GIN for molecular representation and cosine distance is used to measure the distance between the prototypes and the query sample.

Reproducibility Setting. The experiments were performed on an NVIDIA GeForce GTX 1080 Ti GPU with the following implementation details. MAML was implemented using the learn2learn library. We used author codes for Pre-GNN and Pre-PAR. Both metric-based methods, i.e. SiameseNet and Protonet, were re-implemented. We trained the models using the default settings. For our EM3P2, we used the pretrained GIN of Pre-GNN and implemented the rest by referencing to ENN and Pre-GNN using the Pytorch and Pytorch-Geometric libraries.

3.1 Comparison with the state-of-the-art

We compared the prediction performance of our EM3P2 (no uncertainty threshold) with the three existing meta-based MPP methods and two metric-based MPP methods using the area under the receiver operating characteristic curve (ROC-AUC) (Table 2). Overall, our EM3P2 performed the best compared to other methods for both 1-shot and 10-shot cases, except for the 1-shot case on the MUV dataset. We also note that the results for the SIDER dataset are consistently lower for all methods. We suspect that this is due to the categorization of side effects into 27 types, with different granularities and guidelines for each group as well as the mixed positive and negative bias tasks. Results for the MUV dataset also had low prediction performance. This is due to the extremely unbalanced nature of the dataset (with a 0.002 positive-to-negative label ratio). We will show later that using the uncertainty threshold improves the performance of our EM3P2.

Table 2.

ROC-AUC values of our EM3P2 and compared methods.

DatasetMethod	Tox21		SIDER		MUV
DatasetMethod	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
MAML	0.621	0.662	0.648	0.648	0.553	0.598
Pre-GNN	0.770	0.767	0.694	0.694	0.628	0.602
Pre-PAR	0.778	0.799	0.691	0.726	0.682	0.642
SiameseNet	0.768	0.773	0.643	0.643	0.713	0.651
Protonet	0.542	0.787	0.567	0.718	0.631	0.662

EM3P2^a	0.833	0.834	0.792	0.794	0.637	0.695

DatasetMethod	Tox21		SIDER		MUV
DatasetMethod	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
MAML	0.621	0.662	0.648	0.648	0.553	0.598
Pre-GNN	0.770	0.767	0.694	0.694	0.628	0.602
Pre-PAR	0.778	0.799	0.691	0.726	0.682	0.642
SiameseNet	0.768	0.773	0.643	0.643	0.713	0.651
Protonet	0.542	0.787	0.567	0.718	0.631	0.662

EM3P2^a	0.833	0.834	0.792	0.794	0.637	0.695

Our default proposed model without a uncertainty threshold.

The best performances are in bold face numbers.

Table 2.

ROC-AUC values of our EM3P2 and compared methods.

DatasetMethod	Tox21		SIDER		MUV
DatasetMethod	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
MAML	0.621	0.662	0.648	0.648	0.553	0.598
Pre-GNN	0.770	0.767	0.694	0.694	0.628	0.602
Pre-PAR	0.778	0.799	0.691	0.726	0.682	0.642
SiameseNet	0.768	0.773	0.643	0.643	0.713	0.651
Protonet	0.542	0.787	0.567	0.718	0.631	0.662

EM3P2^a	0.833	0.834	0.792	0.794	0.637	0.695

DatasetMethod	Tox21		SIDER		MUV
DatasetMethod	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
MAML	0.621	0.662	0.648	0.648	0.553	0.598
Pre-GNN	0.770	0.767	0.694	0.694	0.628	0.602
Pre-PAR	0.778	0.799	0.691	0.726	0.682	0.642
SiameseNet	0.768	0.773	0.643	0.643	0.713	0.651
Protonet	0.542	0.787	0.567	0.718	0.631	0.662

EM3P2^a	0.833	0.834	0.792	0.794	0.637	0.695

Our default proposed model without a uncertainty threshold.

The best performances are in bold face numbers.

3.2 Ablation study

Table 3 shows the ROC-AUC and precision values for each variant of our EM3P2. The variants are based on combinations of factors: whether query balancing (QB) or random sampling is used, whether belief regularizer (BR) is used, and whether accuracy versus uncertainty curve regularizer (AvUC) is used. We observed that each of these factors contributes to the overall accuracy of our EM3P2. Overall, the use of QB improves accuracy by at least 23%. The combination of QB and BR improves the precision even more and achieves the best precision values. In terms of prediction ROC-AUC values, the use of all factors resulted in the best performance for the Tox21 and SIDER datasets. For the MUV dataset, due to its extreme bias, the ROC-AUC values were all similar with a difference within 0.008 when using any combination of the three constraints. This can be mitigated by training weights for BR and AvUC. However, for our current work, we have fixed the lambda values as described in section. Our EM3P2 combines all three factors in the default setting.

Table 3.

Ablation studies of EM3P2 reporting ROC-AUC (floating point) and precision (percentage).^a^,^b

QB	BR	AvUC	Tox21		SIDER		MUV
QB	BR	AvUC	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
✗	✗	✗	0.815	0.821	0.733	0.720	0.630	0.552
			11%	8%	41%	43%	0%	0%
✓	✗	✗	0.816	0.817	0.698	0.698	0.628	0.638
			62%	54%	73%	68%	23%	30%
✓	✓	✗	0.832	0.833	0.776	0.783	0.629	0.631
			64%	60%	76%	70%	29%	40%
✓	✓	✓	0.833	0.834	0.792	0.794	0.629	0.630
			53%	50%	74%	69%	28%	35%

QB	BR	AvUC	Tox21		SIDER		MUV
QB	BR	AvUC	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
✗	✗	✗	0.815	0.821	0.733	0.720	0.630	0.552
			11%	8%	41%	43%	0%	0%
✓	✗	✗	0.816	0.817	0.698	0.698	0.628	0.638
			62%	54%	73%	68%	23%	30%
✓	✓	✗	0.832	0.833	0.776	0.783	0.629	0.631
			64%	60%	76%	70%	29%	40%
✓	✓	✓	0.833	0.834	0.792	0.794	0.629	0.630
			53%	50%	74%	69%	28%	35%

QB, query balancing in meta training; BR, Belief Regularizer; AvUC, Accuracy Versus Uncertainty Loss Regularizer.

The precision values are calculated for the minority labels.

The best performances are in bold face numbers.

Table 3.

Open in new tab Download slide

Ablation studies of EM3P2 reporting ROC-AUC (floating point) and precision (percentage).^a^,^b

QB	BR	AvUC	Tox21		SIDER		MUV
QB	BR	AvUC	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
✗	✗	✗	0.815	0.821	0.733	0.720	0.630	0.552
			11%	8%	41%	43%	0%	0%
✓	✗	✗	0.816	0.817	0.698	0.698	0.628	0.638
			62%	54%	73%	68%	23%	30%
✓	✓	✗	0.832	0.833	0.776	0.783	0.629	0.631
			64%	60%	76%	70%	29%	40%
✓	✓	✓	0.833	0.834	0.792	0.794	0.629	0.630
			53%	50%	74%	69%	28%	35%

QB	BR	AvUC	Tox21		SIDER		MUV
QB	BR	AvUC	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
✗	✗	✗	0.815	0.821	0.733	0.720	0.630	0.552
			11%	8%	41%	43%	0%	0%
✓	✗	✗	0.816	0.817	0.698	0.698	0.628	0.638
			62%	54%	73%	68%	23%	30%
✓	✓	✗	0.832	0.833	0.776	0.783	0.629	0.631
			64%	60%	76%	70%	29%	40%
✓	✓	✓	0.833	0.834	0.792	0.794	0.629	0.630
			53%	50%	74%	69%	28%	35%

QB, query balancing in meta training; BR, Belief Regularizer; AvUC, Accuracy Versus Uncertainty Loss Regularizer.

The precision values are calculated for the minority labels.

The best performances are in bold face numbers.

3.3 Uncertainty threshold

Figure 2 shows changes in accuracy values with respect to uncertainty. We can see that, with the exception of Task 23 (T23) in SIDER, the test tasks in the Tox21, SIDER, and MUV datasets increase in accuracy as uncertainty decreases. For SIDER task 23, our learned meta-model was not able to adapt properly. We suspect two reasons. First, task 23 is the side effect category for Pregnancy, puerperium & perinatal conditions, which is a significantly different categorization from the majority of other tasks that map side effects to system organ classes. This results in overall low accuracy compared to other test tasks. Second, task 23 is positively biased (1302:125) unlike typical tasks that are usually negatively biased. In fact, accuracy increased after a steep decline with increasing uncertainty when we trained our EM3P2 with balanced or positively biased tasks and tested the model with the negatively biased tasks. Additional results for the SIDER dataset can be found in the Supplementary Materials.

Figure 2.

Accuracy according to uncertainty threshold for 10-shot cases.

With the calibrated accuracy and uncertainty values, our trained EM3P2 was able to predict I don’t know for predictions with uncertainty values higher than the threshold. We used the uncertainty values as thresholds for whether or not to accept the prediction. Table 4 shows the average accuracy according to the uncertainty thresholds (ut) in 1- and 10-shot settings.

Table 4.

Accuracy of EM3P2 with uncertainty threshold.

Dataset	Tox21		SIDER		MUV
Method	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
EM3P2 $u t \leq 0.1$	93.5	92.1	84.5	86.8	97.4	97.7
EM3P2 $u t \leq 0.2$	90.6	89.6	80.2	82.1	92.9	92.4
EM3P2 $u t \leq 0.3$	88.7	87.4	75.3	79.0	89.8	88.7
EM3P2 $u t \leq 0.4$	87.3	86.2	72.9	75.6	87.6	85.9
EM3P2 no threshold	82.1	81.6	64.3	66.1	79.8	76.6

Dataset	Tox21		SIDER		MUV
Method	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
EM3P2 $u t \leq 0.1$	93.5	92.1	84.5	86.8	97.4	97.7
EM3P2 $u t \leq 0.2$	90.6	89.6	80.2	82.1	92.9	92.4
EM3P2 $u t \leq 0.3$	88.7	87.4	75.3	79.0	89.8	88.7
EM3P2 $u t \leq 0.4$	87.3	86.2	72.9	75.6	87.6	85.9
EM3P2 no threshold	82.1	81.6	64.3	66.1	79.8	76.6

The best performances are in bold face numbers.

Table 4.

Accuracy of EM3P2 with uncertainty threshold.

Dataset	Tox21		SIDER		MUV
Method	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
EM3P2 $u t \leq 0.1$	93.5	92.1	84.5	86.8	97.4	97.7
EM3P2 $u t \leq 0.2$	90.6	89.6	80.2	82.1	92.9	92.4
EM3P2 $u t \leq 0.3$	88.7	87.4	75.3	79.0	89.8	88.7
EM3P2 $u t \leq 0.4$	87.3	86.2	72.9	75.6	87.6	85.9
EM3P2 no threshold	82.1	81.6	64.3	66.1	79.8	76.6

Dataset	Tox21		SIDER		MUV
Method	1-shot	10-shot	1-shot	10-shot	1-shot	10-shot
EM3P2 $u t \leq 0.1$	93.5	92.1	84.5	86.8	97.4	97.7
EM3P2 $u t \leq 0.2$	90.6	89.6	80.2	82.1	92.9	92.4
EM3P2 $u t \leq 0.3$	88.7	87.4	75.3	79.0	89.8	88.7
EM3P2 $u t \leq 0.4$	87.3	86.2	72.9	75.6	87.6	85.9
EM3P2 no threshold	82.1	81.6	64.3	66.1	79.8	76.6

The best performances are in bold face numbers.

3.4 Comparison of EMLP and MLP empirical results

To show that quantified evidence provides better confidence, we looked at individual test cases in detail. Here, we compared individual results when only multi-layer perceptrons (MLPs) were used for classification and when evidential multi-layer perceptrons (EMLPs) as in our default setting. Table 5 shows the prediction results for captafol molecule in the Tox21 dataset. The true labels for the test tasks were all positive. However, the model using MLP classification predicted negative with a high class probability, while our EM3P2 using EMLP returned high uncertainty values and predicted I don’t know (?). More empirical results for SIDER and MUV datasets are provided in the Supplementary Materials.

Table 5.

Detailed test task result of captafol in Tox21 dataset.

	Measure	Task10	Task11	Task12
EM3P2	+ Evidence	0.382	0.283	0.255
	Uncertainty	0.672	0.717	0.745
	Prediction	?	?	?
EM3P2 (MLP)	– Class prob.	0.819	0.816	0.815
EM3P2 (MLP)	Prediction

	Measure	Task10	Task11	Task12
EM3P2	+ Evidence	0.382	0.283	0.255
	Uncertainty	0.672	0.717	0.745
	Prediction	?	?	?
EM3P2 (MLP)	– Class prob.	0.819	0.816	0.815
EM3P2 (MLP)	Prediction

Table 5.

Detailed test task result of captafol in Tox21 dataset.

	Measure	Task10	Task11	Task12
EM3P2	+ Evidence	0.382	0.283	0.255
	Uncertainty	0.672	0.717	0.745
	Prediction	?	?	?
EM3P2 (MLP)	– Class prob.	0.819	0.816	0.815
EM3P2 (MLP)	Prediction

	Measure	Task10	Task11	Task12
EM3P2	+ Evidence	0.382	0.283	0.255
	Uncertainty	0.672	0.717	0.745
	Prediction	?	?	?
EM3P2 (MLP)	– Class prob.	0.819	0.816	0.815
EM3P2 (MLP)	Prediction

4 Conclusion

We have proposed a Evidential Meta-model for Molecular Property Prediction (EM3P2) method for evidence-aware molecular property prediction (MPP). This method is capable of learning from few and often imbalanced data. The EM3P2 is a novel uncertainty-aware few-shot meta-learning model for MPP that provides uncertainty estimates for each model prediction. It adapts well to novel tasks with limited labeled data and is less sensitive to data imbalance. In addition to having comparable prediction performance to other few shot MPP models, we can use the uncertainty threshold to obtain better and more confident predictions. Overall, the proposed EM3P2 model and our analysis of the uncertainty estimation to the MPP dataset provide valuable insights and advances in addressing the challenges of data imbalance and reliability in MPP. Our EM3P2 shows the potential of MPP models to accelerate the identification of potential drug candidates and to improve reliability in sensitive areas.

Conflict of interest

None declared.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) [No.2022-0-00369]; and by the National Research Foundation of Korea Grant funded by the Korean government [2018R1A5A1060031, 2022R1F1A1065664].

Data availability

The data underlying this article were accessed from MoleculeNet, https://moleculenet.org/.

References

Altae-Tran

Ramsundar

Pappu

et al.

Low data drug discovery with one-shot learning

ACS Cent Sci

2017

;

283

–

Bao

Kong

Evidential deep learning for open set action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada. IEEE,

2021

13329

–

13338

Crisostomi

Antonelli

Maiorca

et al. Metric based few-shot graph classification. In: Proceedings of the First Learning on Graphs Conference, Virtual. PMLR,

2022

;198(33):

–

Dempster

AP.

A Generalization of Bayesian Inference. In: Yager, R.R., Liu, L. (eds) Classic Works of the Dempster-Shafer Theory of Belief Functions. Studies in Fuzziness and Soft Computing, vol 219. Springer,

2008

–

104

Finn

Abbeel

Levine

Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, Sydney, Australia. PMLR,

2017

1126

–11

Guo

Pleiss

Sun

On calibration of modern neural networks. In: International Conference on Machine Learning, Sydney, Australia. PMLR,

2017

1321

–13

Guo

Zhang

et al. Few-shot graph learning for molecular property prediction. In: Proceedings of the Web Conference 2021, Ljubljana, Slovenia. ACM,

2021

2559

–25

Ham

Yoon

Sael

Towards accurate and certain molecular properties prediction. In: The 13th International Conference on ICT Convergence,

Jeju Island, Republic of Korea

. IEEE,

2022

1621

–162

Hospedales

Antoniou

Micaelli

et al.

Meta-learning in neural networks: a survey

IEEE Trans Pattern Anal Mach Intell

2021

;

5149

–51

Liu

Gomes

et al. Strategies for pre-training graph neural networks. In: The International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia. IEEE,

2020

Jiang

Zhang

Zhao

et al.

MultiGran-SMILES: multi-granularity SMILES learning for molecular property prediction

Bioinformatics

2022

;

4573

–

Jøsang

Subjective Logic. Artificial Intelligence: Foundations, Theory, and Algorithms

Cham

Springer International Publishing

2016

Google Preview

Koch

Zemel

Salakhutdinov

et al. Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Vol. 2, Lille, France: JMLR,

2015

Krishnan

Tickoo

Improving model calibration with accuracy versus uncertainty optimization

Adv Neural Inf Process Syst

2020

;

18237

–

Kuhn

Letunic

Jensen

et al.

The sider database of drugs and side effects

Nucleic Acids Res

2016

;

D1075

–

Landrum

RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling

Greg Landrum,

2013

;

Pandey

Multidimensional belief quantification for label-efficient meta-learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. IEEE,

2022

14371

–

Sensoy

Kaplan

Kandemir

Evidential deep learning to quantify classification uncertainty

Adv Neural Inf Process Syst

2018

;

–

Snell

Swersky

Zemel

Prototypical networks for few-shot learning

Adv Neural Inf Process Syst

2017

;

–

Wang

Abuduweili

Yao

et al.

Property-aware relation networks for few-shot molecular property prediction

Adv Neural Inf Process Syst

2021

;

17441

–

Wang

Liu

Luo

et al.

Advanced graph and sequence neural networks for molecular property prediction and drug discovery

Bioinformatics

2022

;

2579

–

Wieder

Kohlbacher

Kuenemann

et al.

A compact review of molecular property prediction with graph neural networks

Drug Discov Today Technol

2020

;

–

Ramsundar

Feinberg

et al.

Moleculenet: a benchmark for molecular machine learning

Chem Sci

2018

;

513

–

Leskovec

et al. How powerful are graph neural networks? In: The International Conference on Learning Representations (ICLR), New Orleans, LA, USA. IEEE,

2019

Zang

Zhao

Tang

Hierarchical molecular graph self-supervised learning for property prediction

Commun Chem

2023

;

Zhang

Liu

Wang

et al.

Motif-based graph self-supervised learning for molecular property prediction

Adv Neural Inf Process Syst

2021

;

15870

–