Abstract

Motivation

The usefulness of supervised molecular property prediction (MPP) is well-recognized in many applications. However, the insufficiency and the imbalance of labeled data make the learning problem difficult. Moreover, the reliability of the predictions is also a huddle in the deployment of MPP models in safety-critical fields.

Results

We propose the Evidential Meta-model for Molecular Property Prediction (EM3P2) method that returns uncertainty estimates along with its predictions. Our EM3P2 trains an evidential graph isomorphism network classifier using multi-task molecular property datasets under the model-agnostic meta-learning (MAML) framework while addressing the problem of data imbalance.

Our results showed better prediction performances compared to existing meta-MPP models. Furthermore, we showed that the uncertainty estimates returned by our EM3P2 can be used to reject uncertain predictions for applications that require higher confidence.

Availability and implementation

Source code available for download at https://github.com/Ajou-DILab/EM3P2.

1 Introduction

Molecular property prediction (MPP) is a fundamental task in various industrial fields. Especially in drug discovery, an accurate and trustworthy computational method for MPP can greatly accelerate the process of drug candidate discovery. With the advent of big data and deep neural networks (DNNs), MPP models have improved dramatically. Earlier DNN models have used SMILES (Simplified Molecular Input Line Entry System) encoding (Wieder et al. 2020) as input. Today, the input is mostly graph-based, as it is more detailed compared to SMILES. Many methods use a simple graph representation of whole molecules as surveyed in Ham et al. (2022). Others use the whole molecular graph with motif graphs (Zhang et al. 2021, Zang et al. 2023) or mixtures of representations (Jiang et al. 2022, Wang et al. 2022).

Most current MPP models are based on Graph Neural Networks (GNNs). GNNs take the graph structure, i.e. the natural 2D structure of whole molecules or molecular motifs, as input to neural network models and make supervised predictions. GNNs have been shown to be strong at capturing the features of the nodes and the relationships between individual nodes, making them a powerful tool for MPP. For this reason, over 80 MPP tasks have used variants of GNN (Wieder et al. 2020). Among the most widely used GNN models on MPP are the variants of graph neural networks with a message passing scheme, as reviewed in Ham et al. (2022). More details on related works are provided in the Supplementary Materials.

However, supervised GNNs, due to their large number of parameters, inherently require a large amount of labeled data. With redundancy removed, labeled data for MPP is often insufficient and highly imbalanced in labels. Moreover, training large models such as GNNs on limited and imbalanced data requires additional measurements for the trustworthiness of the prediction.

To address the data scarcity problem, various few-shot learning-based methods have been proposed for general purposes (Hospedales et al. 2021, Crisostomi et al. 2022). The goal of few-shot learning is to train a model with only a few labeled data. Few-shot methods are often combined with various other learning methods, such as meta-learning, to improve performance. The goal of meta-learning is to learn a model of models. This model is often easily transferable to specific tasks or domains. In the optimization perspective, when few-shot and meta-learning are combined, the goal becomes learning an easily transferable model that can adapt to new data and tasks with only a few training data (Finn et al. 2017). The goal of few-shot learning-based meta-learning fits well with the MPP tasks. That is, we want a model that is trained on various molecule and MPP task pairs to work well on new molecules or new MPP tasks after a simple adaptation with only a few data.

In addition to prediction accuracy, the trustworthiness of prediction results is another critical factor in MPP. This seems even more critical in a label-imbalanced setting, where a model is overly confident in predicting the class label with more samples. Although few-shot meta-learning alleviates the need for large datasets, it does not address the gap between the confidence, i.e. class probability, and the actual accuracy. The trustworthiness of the prediction results can be estimated by the certainty of the model in making the prediction. If an MPP model can return the uncertainty of its prediction, we can better rely on the model and plan for postanalysis. Uncertainty estimation methods approximate model uncertainties to measure the confidence of models in their prediction for a given input. Note that, although uncertainty estimation models provide uncertainty values for each prediction they make, the predicted class probabilities may still be biased toward the majority class.

In this work, we propose a novel meta-MPP classification model called Evidential Meta-model for Molecular Property Prediction (EM3P2) that returns the uncertainty estimate for each model prediction, is able to adapt to new tasks with few labeled data, and is less sensitive to data imbalance. We list our contributions as follows:

  • New method: We propose a novel uncertainty-aware few-shot meta-learning model, i.e. EM3P2, for molecular property prediction that addresses the data imbalance problem and makes uncertainty adaptions.

  • Comparative analysis: We compare the performance of EM3P2 with three state-of-the-art few-shot meta-learning models and two representative metric learning models for molecular property prediction and show improved prediction performance.

  • Uncertainty analysis: We show that uncertainty estimates returned by EM3P2 can be used to determine the trustworthiness of model predictions. We also show that uncertainty estimates can be used as thresholds to improve accuracy.

2 Materials and methods

Overview. Our proposed method, i.e. Evidential Meta-model for Molecular Property Prediction (EM3P2), is illustrated in Fig. 1. Our EM3P2 uses the Model Agnostic Meta-Learning (MAML) (Finn et al. 2017, Guo et al. 2021) framework with our query balancing technique and evidential adaptations. The EM3P2 consists of a meta-training phase (top) and a meta-testing phase (bottom). In meta-training, the current meta-model is locally adapted with support sets (top bold box), and the adapted model (top dotted box) is used to update the meta-model. In meta-testing, test-task-specific local adaptation is performed with the support sets (bottom bold box), and the adapted model (bottom dotted box) is used to predict the evidence vectors. The components of the EM3P2 are the Graph Isomorphism Networks (GIN) (Xu et al. 2019) for graph embedding, the Multi-Layer Perceptron (MLP) for local classifications in the training phase, and the Evidential Multi-Layer Perceptron (EMLP) (Sensoy et al. 2018) for evidential meta-classifications. The choice of using MLP instead of EMLP for local adaptation in the training phase is to avoid learning uncertainty values that are specific to each training task.

The proposed EM3P2.
Figure 1.

The proposed EM3P2.

In the following detailed description of our method, we first describe the overall data-balanced meta-learning framework, then our graph embedding and classification models, and finally the loss functions and regularization terms.

Algorithm 1. EM3P2 Meta-Training

Require: Support dataset:St={xi,yi} and query dataset:

  Qt={xi,yi} for training task Tt for t=1,,T.

Require: Outer-loop, inner-loop learning rate β, α:

  1: Randomly initialize meta-model parameters Θ={Φ,Ψ}

  2: while not done do

  3:  for each Ttdo

  4:   sample N support data from St

  5:   fori=1,,Ndo

  6:    hgti,__=GIN(xti,Φ)

  7:    y^ti = MLP(hgti,Ψ)

  8:   end for

  9:   Lt Cross-Entropy Loss with {y^t1,,y^tN}

10:   Θt=ΘαΘLTtEq.(1)

11:   sample M/C number of query data from each C classes       from Qt for meta update

12:   fori=1,,Mdo

13:    hgti,hvti=GIN(xti,Φt)

14:    y^ti,eti = EMLP(hgti,Ψt)

15:   end for

16:   LtEq.(14) with {(y^t1,et1),,(y^tM,etM)}

17:   hvt=MEAN(hvt1,,hvtN)

18:  end for

19:  {A(T1),,A(TT)}Eq.(2) with {hv1,,hvT}

20:  Θ=ΘβΘt=1TA(Tt)·LTtEq.(3)

21: end while

2.1 Evidential meta learning framework

Our EM3P2 has two phases, i.e. meta-training and meta-testing, similar to MAML (Finn et al. 2017), with an added query balancing technique and design choices for evidential adaptations.

2.2 EM3P2 meta training

The goal of meta-training is to obtain a meta-model that is easily adaptable to new data and tasks. Algorithm 1 describes the overall meta-training process of our EM3P2. The meta-training phase iteratively selects a training task Tt from Ttrain (lines 3), makes a local task adaptation with the selected support dataset (lines 4–10), computes the evidential query loss and the average node embedding using the locally adapted model (Equation 14) with the selected query dataset (lines 11–18), and uses the query losses and average node embedding to make a task-attended update for the parameters of the evidential meta-model (lines 19–20).

Evidential task-adaption. More specifically, in the local adaptation of the meta-model, for tth task Tt, the current evidential meta-model with parameter Θ={Φ,Ψ} is adapted to the task-specific local model using N number (N-shot where N=1,10 for our test) of support data points xtiSt. First, the graph embedding vector hgti is obtained by running the GIN model (Sec.) with the current meta-GIN parameter Φ on the support data xti (line 6). Then, the graph embedding vector hgti is used to run the multi-layer perceptron (MLP) with the parameter set Ψ to obtain a prediction y^ti (line 7). After obtaining the results for the selected support data points, a support loss (Lt), which is simply a cross-entropy loss, is computed (lines 9). Note that choosing not to learn the evidential parameters, i.e. not using EMLP, for the local adaptations in the training phase is an important choice to avoid biasing the prediction confidence of the meta-model on specific training tasks. The support loss is used to update the local model parameters Θt={Φt,Ψt} (lines 10) as follows:
(1)

Query balancing (QB). The next step is to sample M number of data points from the query dataset Qt={xi,yi} for use in the evidential meta-model update. Under the default MAML setting, the queries are randomly selected from the query dataset. However, in a highly imbalanced environment, random sampling results in the training of a meta-model that always predicts in the direction of the majority class. To properly train our EM3P2 under data imbalance, when selecting M data points, we additionally ensure that the balanced number of samples are selected from each of the C numbers of classes by selecting M/C of query data points from each class (lines 11).

Evidential meta-model update. For each of the sampled query data xti, graph embedding vectors hgti and node embedding vectors hvti are computed using the task-adapted GIN parameters Φt. Graph embedding vectors hgti are used to compute class predictions y^ti and evidence vectors eti using the task-adapted classifier, i.e. evidential multi-layer perception (EMLP) (lines 12–15). The class predictions and the evidence vectors of the sampled query dataset are used to obtain the evidential query loss (line 16).

The final step of the iteration is to perform a task attended parameter update for the evidential meta-model. The average node embedding for each of the training tasks hvt (line 11) is used to compute the task attention values for each task A(Tt) for t=1,….T (line 19). The task attention values are calculated as follows (Guo et al. 2021):
(2)
where hvTt= MEAN({hvTti}i=1N) and MLP is a single layer of a multi-layer perceptron. The evidential query losses LTt and the attention values A(Tt) are then used to update the evidential meta-model (line 20) as follows:
(3)

Details of the model components, GIN, MLP, EMLP, and loss functions are described in the later sections.

Algorithm 2. EM3P2 Meta-Testing

Require:N support data st={xi,yi} and query data qt={x}     of tth testing task Tt

Require: meta-model parameters Θ={Φ,Ψ}

1: ΘtΘ

2: while not done do

3:  fori=1,,Ndo

4:   y^ti = EMLP(GIN(xti,Φt),Ψt)

5:  end for

6:  LtEq.(14) with {y^t1,,y^tN}

7:  Θt=ΘtαΘLTtEq.(4)

8: end while

9: y^,e = EMLP(GIN(x,Φ),Ψ)

2.3 EM3P2 meta testing

The meta-testing phase is similar to the meta-training phase without the meta-model update. The entire testing process is described in Algorithm 2. We assume a new query data qt={x} under a new task Tt. We also assume that there is N number of support data with labels under the same task. The learned evidence model with parameters Θ={Φ,Ψ} is adapted to the new task Tt under the evidence model setting (lines 1–4). The support loss (line 5) is used to update the local model parameters Θ={Φ,Ψ} (lines 6) as follows:
(4)

The final adapted model is used to obtain the prediction label and the evidence vector (lines 7–8) for evaluation.

2.4 Graph embedding and classification model

Our method uses the Graph Isomorphism Network (Xu et al. 2019) for molecular graph and node embedding and the Evidential Multi-layer Perceptron (Sensoy et al. 2018) for molecular property classification.

2.5 Graph isomorphism network (GIN)

Graph Isomorphism Network (GIN) (Xu et al. 2019) is a recently introduced graph neural network that has been proven to have equal power to the Weisfeiler-Lehman test under mild conditions. GIN is a type of message-passing neural network that utilizes multi-layer perceptrons (MLP) to aggregate and combine node representations from each node and its neighbor within the K-hops. More specifically, a K layered GIN, given a graph G=(V,E) and initial node vectors Xv, compute the representation of node v at kth layer as follows:
(5)
where N(v) is a set of neighbors of node v, and hv(0)=Xv. After the final layer, the final graph and node representations for classification are obtained from the readouts of the last layer l. More specifically, a node embedding is computed as hv={hv(l)|vV} and a graph embedding as hg=MEAN(hv(l)|vV), where MEAN(.) function averages all node embedding into the graph. The resulting graph embedding hg is used as input to the classification layers.

2.6 Evidential multi-layer perceptron (EMLP)

Among various uncertainty estimation methods, evidential neural networks (ENN) show comparable results to others with less number of parameters and less inference time complexity (Sensoy et al. 2018, Ham et al. 2022). ENN is based on the observation that softmax values, typical results of classification models, have the problem of inflating predicted class probabilities due mostly to the use of exponential terms. Instead of the typical softmax parameters set of a categorical distribution, the ENN for classification models (Sensoy et al. 2018) replaces the parameters with that of a Dirichlet distribution. The Dirichlet distribution for C class probabilities given parameter vector αRC can be written as follows:
(6)
where SC={p|j=1Cpj and 0p1,,pC1}B(α) are the C-dimensional unit simplex and the multinomial beta function, respectively. This allows the model to represent the predictions of the model as a distribution over possible classes instead of softmax point estimates.

For our EM3P2 model, graph embedding outputted by GIN goes through an evidential multi-layer perception (EMLP) where the last layer is an activation layer (we used ReLU), instead of a typical softmax layer. The result of the activation layer is taken as the evidence vector eR+C.

2.7 Estimating uncertainty

The class probabilities, confidence, and uncertainty of the prediction can be inferred from the evidence vector ei, for the input xi, according to the Dempster-Shafer theory of evidence (Dempster 2008) and subjective logic (Jøsang 2016). That is, the relation between the total evidence ec0 for class c and the Dirichlet parameter αc1 is αc=ec+1. Also, the prediction uncertainty u0 has the following relationship to the evidence: u+c=1Cec/(j=1Cej+1)=1. With the two equations and the prediction result of the positive evidence vector ei of size C for the input xi, we can extract the class probability as
(7)
the confidence as
(8)
and the prediction uncertainty as
(9)

If the estimated uncertainty ui is high, the prediction y^ may not be trustworthy (Sensoy et al. 2018, Ham et al. 2022).

2.8 Loss functions

There are three components to the overall loss function of our proposed EM3P2. First, we describe the evidential loss functions used. Then, we describe two regularizers used to improve uncertainty estimation and model calibration.

2.9 Evidential loss function

Our EM3P2 uses the Dirichlet cross-entropy loss (Sensoy et al. 2018) as the base loss function of EMLP. Note that for MLP, we use the regular cross-entropy loss. The Dirichlet cross-entropy loss is as follows (Sensoy et al. 2018):
(10)
where ψ(·) is a digamma function.

2.10 Belief regularizer (BR)

Belief regularization penalizes false beliefs, which are known to exist in few-shot models (Pandey and Yu 2022). If the false belief can be accurately quantified, it can be used to guide the model in training to minimize false predictions. The belief error is expressed as the sum of the evidence values for false classes for an input. The belief regularization can then be expressed as (Pandey and Yu 2022):
(11)
where ei(1yi) is an element-wise product of the evidence vector ei and the one-hot vector representation of the class label yi for input xi.

2.11 Accuracy versus uncertainty loss regularizer (AvUC)

The accuracy versus uncertainty (AvUC) measures the calibration of the model on its certainty of predictions and the actual accuracy as follows (Guo et al. 2017, Krishnan and Tickoo 2020, Bao et al. 2021):
(12)
where the nAC, nAU, nIC, and nIU are the numbers of results in the four categories: accurate and certain (AC), accurate and uncertain, inaccurate and certain (IC), and inaccurate and uncertain (IU). The AvUC regularizer puts a constrain on the cross entropy between maximum class probability pi and uncertainty ui of data xi as follows (Bao et al. 2021):
(13)
where λc is the strength of the constraint.

2.12 Overall loss function

Two losses are used for local adaptation and meta-model updates. For the local adaptation of the training phase (MLP updates), the regular cross-entropy loss is used, i.e. iNjCyijlogpij. For the rest of the model updates (EMLP updates), the following loss is used:
(14)
where λb,λc are annealing factors that increase monotonically with the training epoch increases. In our training, each annealing factor is defined as λb=λ0t/T and λc=0.1*λ0t/T. Each annealing factor increases from λ0 to 1.0 and 0.1, respectively, we set a smaller scale for λb with the observation that the model becomes more sensitive to calibration when the training dataset is small. In the early stages of training, the model focuses more on learning the data and acquiring evidence, but as training progresses, the model gradually penalizes false beliefs and false predictions with respect to accuracy.

3 Experiments

Dataset. We evaluated our experiment on the Tox21, SIDER, MUV datasets provided by MoleculeNet (Wu et al. 2018). A summary of the datasets is provided in the Table 1. Tox21 is a public database that measures the toxicity of molecules used in the 2014 Tox21 Data Challenge http://tripod.nih.gov/tox21/challenge/. The dataset contains information on whether or not each of the 7831 molecules binds to the 12 different targets. Tox21 is a highly imbalanced dataset, with more than twelve times more negative labels than positive labels (Table 1 last row). We considered predicting binding to each of the 12 targets as an independent task, resulting in 12 different tasks for the 7831 molecules. The Side Effect Resources (SIDER) is a database of 1427 marketed drugs and their side effects, categorized into 27 types (Kuhn et al. 2016, Altae-Tran et al. 2017). The SIDER dataset is somewhat imbalanced with a mix of positively biased, negatively biased, and relatively balanced tasks. The overall ratio of positive to negative labels is 0.76 overall. The 27 side effects were again considered independent tasks, resulting in 27 different tasks for the 1427 molecules. Maximum Unbiased Validation (MUV) dataset, which contains information about whether a molecule binds to a target protein. Each target protein can be considered as an independent task. The MUV contains 93087 molecules and 17 protein targets or tasks. Each task is highly imbalanced with a positive to negative label ratio of 0.002.

Table 1.

Summary of datasets.

DatasetTox21SIDERMUV
Molecules7831142793087
Tasks122717
Meta-training tasks92112
Meta-testing tasks365
Positive/negative label ratio0.080.760.002
DatasetTox21SIDERMUV
Molecules7831142793087
Tasks122717
Meta-training tasks92112
Meta-testing tasks365
Positive/negative label ratio0.080.760.002
Table 1.

Summary of datasets.

DatasetTox21SIDERMUV
Molecules7831142793087
Tasks122717
Meta-training tasks92112
Meta-testing tasks365
Positive/negative label ratio0.080.760.002
DatasetTox21SIDERMUV
Molecules7831142793087
Tasks122717
Meta-training tasks92112
Meta-testing tasks365
Positive/negative label ratio0.080.760.002

The three datasets have molecules written in SMILES strings. We converted the SMILES strings into molecular plots using the Rdkit.Chem (Landrum et al. 2013) package to generate the input. For all our experiments, the Tox21 dataset was divided into nine training and three test tasks, the SIDER dataset was divided into 21 training and six test tasks, and the MUV dataset was divided into 12 training and five test tasks. For each task, we randomly preselected support and query molecules as support and query sets, respectively.

Compared Methods. We compare our EM3P2 with molecular property prediction (MPP) methods based on meta-learning or metric learning framework.

  • MAML (Finn et al. 2017) is a task-agnostic algorithm for few-shot meta-learning, where model parameters are trained using a small number of samples. We used the GIN and MLP as the base learner for MAML.

  • Pre-GNN (Hu et al. 2020) is the MAML model with associated GIN, which uses self-supervised learning to capture local and global information of graph data.

  • The Pre-PAR (Wang et al. 2021) is similar to the MAML model, but with the added step of generating a relationship graph between molecules with the property-aware embedding of support and query molecules.

  • SiameseNet (Koch et al. 2015) is a model based on the Siamese network, which is a metric-based few-shot learning algorithm. We used the GIN for molecular representation and measured the similarity between two inputs using the cosine distance. The relationship is considered positive if both input molecules are positive, otherwise negative.

  • Protonet (Snell et al. 2017, Crisostomi et al. 2022) is a model based on a prototypical network, which is also a metric-based few-shot learning algorithm. We used the GIN for molecular representation and cosine distance is used to measure the distance between the prototypes and the query sample.

Reproducibility Setting. The experiments were performed on an NVIDIA GeForce GTX 1080 Ti GPU with the following implementation details. MAML was implemented using the learn2learn library. We used author codes for Pre-GNN and Pre-PAR. Both metric-based methods, i.e. SiameseNet and Protonet, were re-implemented. We trained the models using the default settings. For our EM3P2, we used the pretrained GIN of Pre-GNN and implemented the rest by referencing to ENN and Pre-GNN using the Pytorch and Pytorch-Geometric libraries.

3.1 Comparison with the state-of-the-art

We compared the prediction performance of our EM3P2 (no uncertainty threshold) with the three existing meta-based MPP methods and two metric-based MPP methods using the area under the receiver operating characteristic curve (ROC-AUC) (Table 2). Overall, our EM3P2 performed the best compared to other methods for both 1-shot and 10-shot cases, except for the 1-shot case on the MUV dataset. We also note that the results for the SIDER dataset are consistently lower for all methods. We suspect that this is due to the categorization of side effects into 27 types, with different granularities and guidelines for each group as well as the mixed positive and negative bias tasks. Results for the MUV dataset also had low prediction performance. This is due to the extremely unbalanced nature of the dataset (with a 0.002 positive-to-negative label ratio). We will show later that using the uncertainty threshold improves the performance of our EM3P2.

Table 2.

ROC-AUC values of our EM3P2 and compared methods.

DatasetMethodTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
MAML0.6210.6620.6480.6480.5530.598
Pre-GNN0.7700.7670.6940.6940.6280.602
Pre-PAR0.7780.7990.6910.7260.6820.642
SiameseNet0.7680.7730.6430.6430.7130.651
Protonet0.5420.7870.5670.7180.6310.662

EM3P2a0.8330.8340.7920.7940.6370.695
DatasetMethodTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
MAML0.6210.6620.6480.6480.5530.598
Pre-GNN0.7700.7670.6940.6940.6280.602
Pre-PAR0.7780.7990.6910.7260.6820.642
SiameseNet0.7680.7730.6430.6430.7130.651
Protonet0.5420.7870.5670.7180.6310.662

EM3P2a0.8330.8340.7920.7940.6370.695
a

Our default proposed model without a uncertainty threshold.

The best performances are in bold face numbers.

Table 2.

ROC-AUC values of our EM3P2 and compared methods.

DatasetMethodTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
MAML0.6210.6620.6480.6480.5530.598
Pre-GNN0.7700.7670.6940.6940.6280.602
Pre-PAR0.7780.7990.6910.7260.6820.642
SiameseNet0.7680.7730.6430.6430.7130.651
Protonet0.5420.7870.5670.7180.6310.662

EM3P2a0.8330.8340.7920.7940.6370.695
DatasetMethodTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
MAML0.6210.6620.6480.6480.5530.598
Pre-GNN0.7700.7670.6940.6940.6280.602
Pre-PAR0.7780.7990.6910.7260.6820.642
SiameseNet0.7680.7730.6430.6430.7130.651
Protonet0.5420.7870.5670.7180.6310.662

EM3P2a0.8330.8340.7920.7940.6370.695
a

Our default proposed model without a uncertainty threshold.

The best performances are in bold face numbers.

3.2 Ablation study

Table 3 shows the ROC-AUC and precision values for each variant of our EM3P2. The variants are based on combinations of factors: whether query balancing (QB) or random sampling is used, whether belief regularizer (BR) is used, and whether accuracy versus uncertainty curve regularizer (AvUC) is used. We observed that each of these factors contributes to the overall accuracy of our EM3P2. Overall, the use of QB improves accuracy by at least 23%. The combination of QB and BR improves the precision even more and achieves the best precision values. In terms of prediction ROC-AUC values, the use of all factors resulted in the best performance for the Tox21 and SIDER datasets. For the MUV dataset, due to its extreme bias, the ROC-AUC values were all similar with a difference within 0.008 when using any combination of the three constraints. This can be mitigated by training weights for BR and AvUC. However, for our current work, we have fixed the lambda values as described in section. Our EM3P2 combines all three factors in the default setting.

Table 3.

Ablation studies of EM3P2 reporting ROC-AUC (floating point) and precision (percentage).a,b

QBBRAvUCTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
0.8150.8210.7330.7200.6300.552
11%8%41%43%0%0%
0.8160.8170.6980.6980.6280.638
62%54%73%68%23%30%
0.8320.8330.7760.7830.6290.631
64%60%76%70%29%40%
0.8330.8340.7920.7940.6290.630
53%50%74%69%28%35%
QBBRAvUCTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
0.8150.8210.7330.7200.6300.552
11%8%41%43%0%0%
0.8160.8170.6980.6980.6280.638
62%54%73%68%23%30%
0.8320.8330.7760.7830.6290.631
64%60%76%70%29%40%
0.8330.8340.7920.7940.6290.630
53%50%74%69%28%35%
a

QB, query balancing in meta training; BR, Belief Regularizer; AvUC, Accuracy Versus Uncertainty Loss Regularizer.

b

The precision values are calculated for the minority labels.

The best performances are in bold face numbers.

Table 3.

Ablation studies of EM3P2 reporting ROC-AUC (floating point) and precision (percentage).a,b

QBBRAvUCTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
0.8150.8210.7330.7200.6300.552
11%8%41%43%0%0%
0.8160.8170.6980.6980.6280.638
62%54%73%68%23%30%
0.8320.8330.7760.7830.6290.631
64%60%76%70%29%40%
0.8330.8340.7920.7940.6290.630
53%50%74%69%28%35%
QBBRAvUCTox21
SIDER
MUV
1-shot10-shot1-shot10-shot1-shot10-shot
0.8150.8210.7330.7200.6300.552
11%8%41%43%0%0%
0.8160.8170.6980.6980.6280.638
62%54%73%68%23%30%
0.8320.8330.7760.7830.6290.631
64%60%76%70%29%40%
0.8330.8340.7920.7940.6290.630
53%50%74%69%28%35%
a

QB, query balancing in meta training; BR, Belief Regularizer; AvUC, Accuracy Versus Uncertainty Loss Regularizer.

b

The precision values are calculated for the minority labels.

The best performances are in bold face numbers.

3.3 Uncertainty threshold

Figure 2 shows changes in accuracy values with respect to uncertainty. We can see that, with the exception of Task 23 (T23) in SIDER, the test tasks in the Tox21, SIDER, and MUV datasets increase in accuracy as uncertainty decreases. For SIDER task 23, our learned meta-model was not able to adapt properly. We suspect two reasons. First, task 23 is the side effect category for Pregnancy, puerperium & perinatal conditions, which is a significantly different categorization from the majority of other tasks that map side effects to system organ classes. This results in overall low accuracy compared to other test tasks. Second, task 23 is positively biased (1302:125) unlike typical tasks that are usually negatively biased. In fact, accuracy increased after a steep decline with increasing uncertainty when we trained our EM3P2 with balanced or positively biased tasks and tested the model with the negatively biased tasks. Additional results for the SIDER dataset can be found in the Supplementary Materials.

Accuracy according to uncertainty threshold for 10-shot cases.
Figure 2.

Accuracy according to uncertainty threshold for 10-shot cases.

With the calibrated accuracy and uncertainty values, our trained EM3P2 was able to predict I don’t know for predictions with uncertainty values higher than the threshold. We used the uncertainty values as thresholds for whether or not to accept the prediction. Table 4 shows the average accuracy according to the uncertainty thresholds (ut) in 1- and 10-shot settings.

Table 4.

Accuracy of EM3P2 with uncertainty threshold.

DatasetTox21
SIDER
MUV
Method1-shot10-shot1-shot10-shot1-shot10-shot
EM3P2 ut0.193.592.184.586.897.497.7
EM3P2 ut0.290.689.680.282.192.992.4
EM3P2 ut0.388.787.475.379.089.888.7
EM3P2 ut0.487.386.272.975.687.685.9
EM3P2 no threshold82.181.664.366.179.876.6
DatasetTox21
SIDER
MUV
Method1-shot10-shot1-shot10-shot1-shot10-shot
EM3P2 ut0.193.592.184.586.897.497.7
EM3P2 ut0.290.689.680.282.192.992.4
EM3P2 ut0.388.787.475.379.089.888.7
EM3P2 ut0.487.386.272.975.687.685.9
EM3P2 no threshold82.181.664.366.179.876.6

The best performances are in bold face numbers.

Table 4.

Accuracy of EM3P2 with uncertainty threshold.

DatasetTox21
SIDER
MUV
Method1-shot10-shot1-shot10-shot1-shot10-shot
EM3P2 ut0.193.592.184.586.897.497.7
EM3P2 ut0.290.689.680.282.192.992.4
EM3P2 ut0.388.787.475.379.089.888.7
EM3P2 ut0.487.386.272.975.687.685.9
EM3P2 no threshold82.181.664.366.179.876.6
DatasetTox21
SIDER
MUV
Method1-shot10-shot1-shot10-shot1-shot10-shot
EM3P2 ut0.193.592.184.586.897.497.7
EM3P2 ut0.290.689.680.282.192.992.4
EM3P2 ut0.388.787.475.379.089.888.7
EM3P2 ut0.487.386.272.975.687.685.9
EM3P2 no threshold82.181.664.366.179.876.6

The best performances are in bold face numbers.

3.4 Comparison of EMLP and MLP empirical results

To show that quantified evidence provides better confidence, we looked at individual test cases in detail. Here, we compared individual results when only multi-layer perceptrons (MLPs) were used for classification and when evidential multi-layer perceptrons (EMLPs) as in our default setting. Table 5 shows the prediction results for captafol molecule in the Tox21 dataset. The true labels for the test tasks were all positive. However, the model using MLP classification predicted negative with a high class probability, while our EM3P2 using EMLP returned high uncertainty values and predicted I don’t know (?). More empirical results for SIDER and MUV datasets are provided in the Supplementary Materials.

Table 5.

Detailed test task result of captafol in Tox21 dataset.

MeasureTask10Task11Task12
EM3P2+ Evidence0.3820.2830.255
Uncertainty0.6720.7170.745
Prediction???
EM3P2 (MLP)– Class prob.0.8190.8160.815
Prediction
MeasureTask10Task11Task12
EM3P2+ Evidence0.3820.2830.255
Uncertainty0.6720.7170.745
Prediction???
EM3P2 (MLP)– Class prob.0.8190.8160.815
Prediction
Table 5.

Detailed test task result of captafol in Tox21 dataset.

MeasureTask10Task11Task12
EM3P2+ Evidence0.3820.2830.255
Uncertainty0.6720.7170.745
Prediction???
EM3P2 (MLP)– Class prob.0.8190.8160.815
Prediction
MeasureTask10Task11Task12
EM3P2+ Evidence0.3820.2830.255
Uncertainty0.6720.7170.745
Prediction???
EM3P2 (MLP)– Class prob.0.8190.8160.815
Prediction

4 Conclusion

We have proposed a Evidential Meta-model for Molecular Property Prediction (EM3P2) method for evidence-aware molecular property prediction (MPP). This method is capable of learning from few and often imbalanced data. The EM3P2 is a novel uncertainty-aware few-shot meta-learning model for MPP that provides uncertainty estimates for each model prediction. It adapts well to novel tasks with limited labeled data and is less sensitive to data imbalance. In addition to having comparable prediction performance to other few shot MPP models, we can use the uncertainty threshold to obtain better and more confident predictions. Overall, the proposed EM3P2 model and our analysis of the uncertainty estimation to the MPP dataset provide valuable insights and advances in addressing the challenges of data imbalance and reliability in MPP. Our EM3P2 shows the potential of MPP models to accelerate the identification of potential drug candidates and to improve reliability in sensitive areas.

Conflict of interest

None declared.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) [No.2022-0-00369]; and by the National Research Foundation of Korea Grant funded by the Korean government [2018R1A5A1060031, 2022R1F1A1065664].

Data availability

The data underlying this article were accessed from MoleculeNet, https://moleculenet.org/.

References

Altae-Tran
H
,
Ramsundar
B
,
Pappu
AS
et al.
Low data drug discovery with one-shot learning
.
ACS Cent Sci
2017
;
3
:
283
93
.

Bao
W
,
Yu
Q
,
Kong
Y.
Evidential deep learning for open set action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada. IEEE,
2021
,
13329
13338
.

Crisostomi
D
,
Antonelli
S
,
Maiorca
V
et al. Metric based few-shot graph classification. In: Proceedings of the First Learning on Graphs Conference, Virtual. PMLR,
2022
;198(33):
1
33
.

Dempster
AP.
A Generalization of Bayesian Inference. In: Yager, R.R., Liu, L. (eds) Classic Works of the Dempster-Shafer Theory of Belief Functions. Studies in Fuzziness and Soft Computing, vol 219. Springer,
2008
,
73
104
.

Finn
C
,
Abbeel
P
,
Levine
S.
Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, Sydney, Australia. PMLR,
2017
,
1126
–11
35
.

Guo
C
,
Pleiss
G
,
Sun
Y
On calibration of modern neural networks. In: International Conference on Machine Learning, Sydney, Australia. PMLR,
2017
,
1321
–13
30
.

Guo
Z
,
Zhang
C
,
Yu
W
et al. Few-shot graph learning for molecular property prediction. In: Proceedings of the Web Conference 2021, Ljubljana, Slovenia. ACM,
2021
,
2559
–25
67
.

Ham
K
,
Yoon
J
,
Sael
L.
Towards accurate and certain molecular properties prediction. In: The 13th International Conference on ICT Convergence,
Jeju Island, Republic of Korea
. IEEE,
2022
,
1621
–162
4
.

Hospedales
T
,
Antoniou
A
,
Micaelli
P
et al.
Meta-learning in neural networks: a survey
.
IEEE Trans Pattern Anal Mach Intell
2021
;
44
:
5149
–51
69
.

Hu
W
,
Liu
B
,
Gomes
J
et al. Strategies for pre-training graph neural networks. In: The International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia. IEEE,
2020
.

Jiang
J
,
Zhang
R
,
Zhao
Z
et al.
MultiGran-SMILES: multi-granularity SMILES learning for molecular property prediction
.
Bioinformatics
2022
;
38
:
4573
80
.

Jøsang
A.
Subjective Logic. Artificial Intelligence: Foundations, Theory, and Algorithms
.
Cham
:
Springer International Publishing
,
2016
.

Koch
G
,
Zemel
R
,
Salakhutdinov
R
et al. Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, Vol. 2, Lille, France: JMLR,
2015
.

Krishnan
R
,
Tickoo
O.
Improving model calibration with accuracy versus uncertainty optimization
.
Adv Neural Inf Process Syst
2020
;
33
:
18237
48
.

Kuhn
M
,
Letunic
I
,
Jensen
LJ
et al.
The sider database of drugs and side effects
.
Nucleic Acids Res
2016
;
44
:
D1075
9
.

Landrum
G.
RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling
.
Greg Landrum,
2013
;
8
:
31
.

Pandey
DS
,
Yu
Q.
Multidimensional belief quantification for label-efficient meta-learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA. IEEE,
2022
,
14371
80
.

Sensoy
M
,
Kaplan
L
,
Kandemir
M.
Evidential deep learning to quantify classification uncertainty
.
Adv Neural Inf Process Syst
2018
;
31
:
1
11
.

Snell
J
,
Swersky
K
,
Zemel
R.
Prototypical networks for few-shot learning
.
Adv Neural Inf Process Syst
2017
;
30
:
1
11
.

Wang
Y
,
Abuduweili
A
,
Yao
Q
et al.
Property-aware relation networks for few-shot molecular property prediction
.
Adv Neural Inf Process Syst
2021
;
34
:
17441
54
.

Wang
Z
,
Liu
M
,
Luo
Y
et al.
Advanced graph and sequence neural networks for molecular property prediction and drug discovery
.
Bioinformatics
2022
;
38
:
2579
86
.

Wieder
O
,
Kohlbacher
S
,
Kuenemann
M
et al.
A compact review of molecular property prediction with graph neural networks
.
Drug Discov Today Technol
2020
;
37
:
1
12
.

Wu
Z
,
Ramsundar
B
,
Feinberg
EN
et al.
Moleculenet: a benchmark for molecular machine learning
.
Chem Sci
2018
;
9
:
513
30
.

Xu
K
,
Hu
W
,
Leskovec
J
et al. How powerful are graph neural networks? In: The International Conference on Learning Representations (ICLR), New Orleans, LA, USA. IEEE,
2019
.

Zang
X
,
Zhao
X
,
Tang
B.
Hierarchical molecular graph self-supervised learning for property prediction
.
Commun Chem
2023
;
6
:
34
.

Zhang
Z
,
Liu
Q
,
Wang
H
et al.
Motif-based graph self-supervised learning for molecular property prediction
.
Adv Neural Inf Process Syst
2021
;
34
:
15870
82
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data