Abstract

Motivation

Predicting the properties of molecules is a fundamental problem in drug design and discovery, while how to learn effective feature representations lies at the core of modern deep-learning-based prediction methods. Recent progress shows expressive power of graph neural networks (GNNs) in capturing structural information for molecular graphs. However, we find that most molecular graphs exhibit low clustering along with dominating chains. Such topological characteristics can induce feature squashing during message passing and thus impair the expressivity of conventional GNNs.

Results

Aiming at improving node features’ expressiveness, we develop a novel chain-aware graph neural network model, wherein the chain structures are captured by learning the representation of the center node along the shortest paths starting from it, and the redundancy between layers are mitigated via initial residual difference connection (IRDC). Then the molecular graph is represented by attentive pooling of all node representations. Compared to standard graph convolution, our chain-aware learning scheme offers a more straightforward feature interaction between distant nodes, thus it is able to capture the information about long-range dependency. We provide extensive empirical analysis on real-world datasets to show the outperformance of the proposed method.

Availability and implementation

The MolPath code is publicly available at https://github.com/Assassinswhh/Molpath.

1 Introduction

Precise property predictions play a crucial role in the selection of chemical compounds with the desired attributes for subsequent tasks (David et al. 2020) within the realm of drug design and discovery, establishing itself as a pivotal task in this field. With the advent of deep learning techniques, molecular property prediction has achieved remarkable success (Wang et al. 2023b, Zhou et al. 2023). By encoding molecules as strings with some notation tools like SMILES (Weininger 1988) and SELFIES (Krenn et al. 2022), sequential models (Edwards et al. 2022, Irwin et al. 2022) can be employed to further extract higher-order features from the string for the downstream property classification and/or regression.

However, the one-dimensional string representation does not contain the information about atom-to-atom interactions, resulting in sub-optimal performances. To leverage the irregular structural feature of atom-to-atom interaction, recent work (Li et al. 2022a, Stärk et al. 2022, Zhu et al. 2022) resorts to graph neural networks, which models molecules as graphs by representing atoms as nodes and chemical bonds as edges. This way, the nonlinear interactions between atoms are able to be captured. In particular, it has been noticed by more recent work (Zang et al. 2023) that there are some unique structural patterns in molecules, e.g. rings and functional groups, suggesting that specific graph neural network models are required to obtain better feature representation. HiMol (Zang et al. 2023) shows that hierarchical architecture is effective for graph neural networks to capture different level of chemical semantic information.

Despite the local patterns (i.e. some motifs) implied in molecular graphs at mesoscopic level, the global properties also play a critical role in graph-based representation learning. Since most of current GNNs intrinsically perform local smoothing (Huang et al. 2023a, 2023b, 2024), limiting graph convolutions in the local regions around center nodes, it is hard for this scheme to capture the long-range dependency between nodes. When applying GNNs on molecular graphs, this drawback becomes more prominent, as the backbone of molecular graphs are generally chain-like. To get deep insight into the structural features of molecular graphs, we empirically study the local connectivity with clustering coefficient (Hamilton et al. 2017) that counts how likely the neighbors of a node are connected. Figure 1 demonstrates that there are only a very small number of nodes whose neighbors are well-connected, while for most nodes there are a lack of shortcuts between them, as shown in the inset of Fig. 1. This characteristic implies that information can only be transmitted along long-distance path between the majority of the nodes, which is considered as the root cause of over-squashing (Giraldo et al. 2023), a phenomenon that featural information on distant neighbors will be squashed during message passing. Clearly, the particular structure properties of molecular graphs call for rethinking the architecture of graph neural network models.

Average clustering coefficient of datasets. A sample molecule from dataset BACE is shown in the left side inset and the clustering coefficient distribution of dataset BACE is shown in the right-side inset.
Figure 1.

Average clustering coefficient of datasets. A sample molecule from dataset BACE is shown in the left side inset and the clustering coefficient distribution of dataset BACE is shown in the right-side inset.

Ideally, an appropriate graph neural network is able to exploit the influences of long-range neighbors to enrich feature representation and avoid the side effect of message passing. Toward this end, we introduce a path-based graph learning framework, which we term as MolPath. Specifically, our model account for the effects of chain-like structure by integrating shortest paths between center node and its high-order neighbors at different levels (i.e. lengths of the shortest paths). To distinguish between shortest paths of different orders, we adopt attention mechanism to learn the path importance. Besides, considering there may exist information redundancy between different orders, e.g. the shortest path of length k may include the shortest paths of length i<k, we leverage the initial residual difference connection trick to update the representation of higher-order shortest paths. The representation of entire graph is then obtained by pooling all weighted shortest paths of different orders.

In summary, the contributions of this article are as follows:

  • We develop a novel chain-aware graph neural network model, which imposes the message passing along the shortest paths between nodes, endowing the model with the ability to capture the chain-like backbone structure of molecular graphs.

  • By employing initial residual difference connection (IRDC) trick on the shortest path convolution, MolPath alleviates the information redundancy between layers, thus optimizing feature representation of the nodes. In aggregating all paths with different lengths initiated at center node, MolPath opts for attention mechanism for weighted node fusion.

  • We evaluate the proposed method on molecular property prediction (i.e. classification and regression) on several benchmark biological datasets. The experimental results demonstrate the superiority of our method over currently strong baseline models on both tasks.

The rest of this paper is organized as follows. Section 2 briefly introduces some related work. Section 3 details the experimental dataset. Section 4 presents our new model. Section 5 presents experimental results. Finally, Section 6 draws out conclusions.

2 Related work

There are some ways to learn representations for molecules, which can be roughly categorized into two types: GNN-based and SMILES string-based machine learning models.

2.1 SMILES string-based model

The dawn of the Transformer (Vaswani et al. 2017, Devlin et al. 2019) sparked a surge of interest to explore its application in encoding molecular data, such as GROVER (Rong et al. 2020) and ChemCrow (Bran et al. 2023). The input of these models are SMILES strings. Besides, CLAPS (Wang et al. 2023a) performs data enhancement on encoded SMILES strings through different mask methods combined with the attention mechanism. However, SMILES strings are one-dimensional data and therefore cannot capture complex structural information of molecules, which may pose limitations for certain tasks.

2.2 GNN-based model

This type of methods translate SMILES string into a graph and leverage the power of GNNs to learn the global representation of the graph, e.g. GraphMVP (Liu et al. 2021), MolCLR (Wang et al. 2022), and 3Dinfomax (Stärk et al. 2022), respectively, use different GNN methods to encode molecular data and then fuse the encoded features under a contrastive learning framework. HiMol (Zang et al. 2023) adopts a unique molecular decomposition strategy to decompose the molecular graph into motifs according to chemical rules such as the BRICS (Degen et al. 2008) algorithm to better capture the topology and unique structure of the molecular graph. This method is unique in that it can analyze the structure of a molecule at multiple levels. However, there are some drawbacks with the above models. For instance, contrastive learning-based models usually perform data augmentation by randomly deleting and adding nodes or edges, which may change the chemical structure of the molecule, thereby destroying the chemical semantics within the molecule. On the other hand, for the data whose clustering coefficient is relatively low, simply capturing unique structures, such as rings, may not reflect the complete structural information of the molecular graph well. To resolve this issue, we propose a chain-aware GNN model for long-distance dependency learning.

3 Materials

To evaluate the effectiveness of the proposed MolPath model compared to existing molecular representation learning methods, we predict properties of molecules on eight benchmark datasets sourced from MoleculeNet (Wu et al. 2018) across multiple domains, namely, physicochemical datasets (i.e. ESOL, FreeSolv, and Lipo) for regression tasks, physiological datasets (i.e. BBBP, Tox21, SIDER, ClinTox), and biophysical dataset BACE for classification tasks.

For regression task, the output of MolPath corresponds to the molecular property value. For classification, the model’s output consists of categorical property labels. Furthermore, each dataset is partitioned into training, validation, and test subsets at an 8:1:1 ratio. The training subset is dedicated to training the model, while the validation subset serves the purpose of fine-tuning hyperparameters. The test subset is employed to evaluate the model’s overall performance.

A comprehensive overview of the primary dataset statistics is presented in Table 1.

Table 1.

Summary information of the MoleculeNet benchmark datasets.

CategoryDatasetTasks type# Tasks# CompoundsMetric
Physical chemistryESOLRegression11128RSME
FreeSolvRegression1642RSME
LipopRegression14200RSME
PhysiologyBBBPClassification12039ROC-AUC
Tox21Classification127831ROC-AUC
SIDERClassification271427ROC-AUC
ClinToxClassification21478ROC-AUC
BiophysicsBACEClassification11513ROC-AUC
CategoryDatasetTasks type# Tasks# CompoundsMetric
Physical chemistryESOLRegression11128RSME
FreeSolvRegression1642RSME
LipopRegression14200RSME
PhysiologyBBBPClassification12039ROC-AUC
Tox21Classification127831ROC-AUC
SIDERClassification271427ROC-AUC
ClinToxClassification21478ROC-AUC
BiophysicsBACEClassification11513ROC-AUC
Table 1.

Summary information of the MoleculeNet benchmark datasets.

CategoryDatasetTasks type# Tasks# CompoundsMetric
Physical chemistryESOLRegression11128RSME
FreeSolvRegression1642RSME
LipopRegression14200RSME
PhysiologyBBBPClassification12039ROC-AUC
Tox21Classification127831ROC-AUC
SIDERClassification271427ROC-AUC
ClinToxClassification21478ROC-AUC
BiophysicsBACEClassification11513ROC-AUC
CategoryDatasetTasks type# Tasks# CompoundsMetric
Physical chemistryESOLRegression11128RSME
FreeSolvRegression1642RSME
LipopRegression14200RSME
PhysiologyBBBPClassification12039ROC-AUC
Tox21Classification127831ROC-AUC
SIDERClassification271427ROC-AUC
ClinToxClassification21478ROC-AUC
BiophysicsBACEClassification11513ROC-AUC

4 Methods

4.1 Framework

The architectural framework of our approach, referred to as MolPath, is visually represented in Fig. 2. The core of our proposed model is to learn the feature representation of the entire molecular graph by the designed MolPath module, which first embeds node features and then aggregate them with attention scores. Finally, the node features are compressed to be row vector for the downstream prediction.

The framework of our method MolPath. Left: The overall architecture. Middle: Details in the MolPath. Right: Details in the Path Convolution block.
Figure 2.

The framework of our method MolPath. Left: The overall architecture. Middle: Details in the MolPath. Right: Details in the Path Convolution block.

4.2 Notation

Let G=(V,E) be an undirected graph consisting of a set of nodes V and a set of edges EV×V, where |V|=n nodes and |E|=m edges. The set N(v) represents the neighbors of node v. For attributed graphs, each node vV is endowed with an initial node feature vector denoted by XvRd, which can contain categorical or real-valued properties of node v. Moreover, we denote the set of paths of length k by δ(k)={δ1(k),δ2(k),,δmk(k)}, where mk is the number of shortest paths with length k. Note that paths only contain distinct nodes. A path is a sequence of nodes δi(k)=[v1,v2,,vk+1], in which there is an edge linking one node in the sequence to its successor in the sequence, with no repeated nodes. Formally, the path starting at node i is denoted as vi, where for any two consecutive nodes j and j+1, Ej,j+1=1 and js for any two different nodes j and s. When k=0, there is only one source node in the path, δ(0) represents the nodes, and H(0)Rn×d is the initial feature vector of nodes.

4.3 Path convolution

In the following, we demonstrate each component in details. We first calculate the shortest path δ(k) of a given length k. Then our path convolution will perform along the shortest paths originated at the central node, which is different from the vanilla graph convolution (Kipf and Welling 2016) that involves recursive message passing. In particular, for each node presentation at the kth convolutional layer, to reduce the redundancy between layers, we introduce initial residual connections (IRDC) module to learn the representation difference for each node between its shortest path with different lengths.

4.3.1 Initial residual connections

As shown in Fig. 3, IRDC accounts for the difference between the initial node feature and the representations learned at previous convolution layers, as the information gain at the current layer. This operation attempts to capture as much as possible novel features at each layer. Since our convolution is performed along the shortest paths, while the shortest paths of length k1 are included in the shortest paths of length k, this implies that convolution on paths of length k inevitably contains information about paths of length k1. It necessitates the feature redundancy reduction using IRDC. Therefore, at layer k, before applying LSTM on the node sequences, we calculate the IRDCs for the nodes:
(1)
where k denotes the length of the shortest path, and λ(0,1] is a hyperparameter. The term j=1k1Hδi(k)(j) in Equation (1) can be seen as the total information extracted by the previous k1 layers. This mechanism enables model to fully exploit the distinguished features embodied in the original attributes and the features acquired through convolutions. To address potential issues related to value scaling, we perform batch normalization on node representations.
An illustrative example of Initial Residual Difference Connection (IRDC) when path length is k−1. The H˜(k) is the input for the convolution with path length k.
Figure 3.

An illustrative example of Initial Residual Difference Connection (IRDC) when path length is k1. The H˜(k) is the input for the convolution with path length k.

Next, we further learn the sequential correlation among nodes in a path, by employing Long Short-Term Memory(LSTM) (Hochreiter and Schmidhuber 1997)
(2)
where Hδi(k) are the updated representations of the nodes on the i-th path of length kδi(k) after performing sequence learning.

We note that, the LSTM operates on reversed paths, as we hypothesize that the most important information in the sequence is contained in the representation of starting node.

Then the final node representation at layer k is the aggregation of all the shortest paths of length k starting at the central nodes. In particular, we employ path pooling on individual paths. After normalization and a 2-layer MLP, we acquire the embeddings of the nodes at convolution layer k, i.e. H(k)Rn×d.
(3)

Accordingly, performing L layers of path convolution can obtain the information about the Lth-order neighbors (i.e. L hops distancing from the central nodes).

4.4 Path attention and graph pooling

On the conjecture that different shortest paths have different influence on the node representation learning, we utilize attention mechanism on the shortest paths with different lengths to weigh each type of shortest path. We use the initial feature as the query for attention score computation:
(4)
Accordingly, by aggregating the initial feature and the weighted sum of all the shortest paths of different length originated from the central node, we obtain the node representations: H(0)+i=1kωiH(i). The graph representation is then achieved by pooling all nodes
(5)
For downstream property prediction task, we use a two-layer MLP to generate the prediction results:
(6)

4.5 Property prediction loss

We choose different loss functions according to different molcular prediction tasks. Specifically, for classification task, the binarycrossentropy is used:
(7)
where y is the target label, y^ is the predicted label that is first processed by the sigmoid function, n indicates the number of total labels.
As for regression task, we utilize the L1loss:
(8)
where y is the real value, y^ is the predicted value.

4.6 Performance metrics

As suggested by MoleculeNet (Wu et al. 2018), ROC-AUC is used as the performance metric for classification on datasets BBBP, Tox21, SIDER, ClinTox, and BACE. On the three regression datasets ESOL, FreeSolv, and Lipop, we use RMSE to evaluate the performance.

4.7 Time complexity

To enumerate the shortest paths from the source node to all other nodes of length at most K, we use the Depth First Search (DFS) algorithm. This takes at most O(bK), where b is the nodes’ degree and the upper bound of b is the maximum nodes’ degree. For all nodes in the graph, the time complexity is O(nbk).

Molecules graphs are often small and/or sparse, n is often small and/or bn. Thus, for bounded K, the time complexity for enumerating the paths is not prohibitive.

Our model obtains path representation through LSTM, whose time cost is O(td2), where t is the length of path of the current layer and d is the hidden layer dimension. Moreover, the time complexity for the path attention mechanism is denoted as O(nd2), where n is the number of nodes. Consequently, the time complexity of the model is O(nbk+Ktd2+nd2), where K is the number of iterations of path convolution.

5 Results

5.1 Baselines and experimental settings

We comprehensively evaluate our method and compare it against ten methods:

  • GCN (Kipf and Welling 2016), GIN (Xu et al. 2018), which are commonly used GNNs for graph representation learning.

  • GROVER (Rong et al. 2020): It can learn rich structural and semantic information of molecules from enormous unlabeled molecules. Rather, to encode such complex information, GROVER integrates message passing networks into the transformer-style architecture to deliver a class of more expressive encoders of molecules.

  • GraphMvp (Liu et al. 2021): A molecular property prediction model employs a contrastive learning framework that exploits consistency between 2D topological structures and 3D geometric views to enhance molecular representations, ultimately improving the predictive performance of downstream task models in MoleculeNet.

  • GEM (Fang et al. 2022): A new geometry-enhanced molecular representation learning method is proposed to capture the geometric information and topological information of molecules.

  • GeomGcl (Li et al. 2022b): Propose a novel graph contrastive learning method utilizing the geometry of the molecule across 2D and 3D views.

  • HiMol (Zang et al. 2023): By encoding motif structure and extracting motif hierarchical graph representations, coupled with multi-level self-supervised pre-training (MSP), learn molecular representations and predict molecular properties.

  • 3D PGT (Wang et al. 2023b): A new 3D pre-training framework pre-trains the model on a multi-task pre-training framework containing three attributes: Bond length, bond angle, and dihedral angle, and then trains the model on a molecular graph without a 3D structure. Fine-tune it.

  • Uni-mol (Zhou et al. 2023): A universal 3D molecular representation learning pre-training framework based on Graphormer, which can apply pre-trained models to various downstream tasks through several different fine-tuning strategies.

  • 3DGCL (Moon et al. 2023): A small-scale 3D molecular contrastive learning framework that selects different conformers of molecules as positive samples.

According to different datasets, the parameter settings are also different. In our experiments, each dataset is splitted into three parts: training set, validation set, and test set. The optimal hyperparameters are obtained on the validation set by performing grid search, a commonly used hyper-parameter tuning approach. To reduce the influence of randomness on the results, we run the methods four times on the test set of each dataset and report the averaged results. All experiments use the Adam optimizer and are performed on NVIDIA RTX3090 GPU. A summary of the main parameter settings in the training phase is shown in Table 2.

Table 2.

Parameter settings in MolPath.

ParameterValue
Batch size16, 32, 64, 128
Dropout0, 0.01, 0.02, 0.03
Learning rate105, 104, 103
Hidden size[128, 700]
λ[0, 0.6]
K[4, 12]
ParameterValue
Batch size16, 32, 64, 128
Dropout0, 0.01, 0.02, 0.03
Learning rate105, 104, 103
Hidden size[128, 700]
λ[0, 0.6]
K[4, 12]
Table 2.

Parameter settings in MolPath.

ParameterValue
Batch size16, 32, 64, 128
Dropout0, 0.01, 0.02, 0.03
Learning rate105, 104, 103
Hidden size[128, 700]
λ[0, 0.6]
K[4, 12]
ParameterValue
Batch size16, 32, 64, 128
Dropout0, 0.01, 0.02, 0.03
Learning rate105, 104, 103
Hidden size[128, 700]
λ[0, 0.6]
K[4, 12]

5.2 Performance evaluation and comparison

Table 3 presents the performance of MolPath on molecule property regression and classification datasets, respectively. For regression task, our method consistently outperforms all baseline methods across the three benchmark datasets and achieves a new state-of-the-art on average, demonstrating its effectiveness and superiority in predicting molecular properties. We note that, the improvement of our MolPath on the Lipop dataset is marginal. This is because Lipop has a higher clustering coefficient compared to other datasets, which introduces more structural complexity challenge to our chain-oriented method.

Table 3.

Performance comparison on classification (measured by ROC-AUC and regression [measured by RMSE] tasks).a

MethodESOLFreeSolvLipopAverageBACEBBBPClinToxTox21SIDERAverage
GCN0.7781.5820.8991.0860.8290.8950.6150.7880.6240.750
GIN0.6191.1360.7560.8370.8500.8900.7530.8240.6180.787
GROVER0.9832.1760.8171.3250.8260.7000.8120.7430.6480.746
GraphMvp1.092b0.681b0.8120.7240.7900.6310.6390.719
GEM0.7981.8770.6601.1090.8560.7240.9060.7810.6720.788
GeomGcl0.7640.8770.5440.728bb0.9170.8510.647b
HiMol0.8332.2830.7081.2750.8460.7320.8080.7620.6250.755
3D PGT1.061b0.687b0.8090.7210.7940.7380.6060.734
Uni-miol0.7881.480.6030.9570.8570.7290.9190.7960.6590.792
3DGCL0.7781.441bb0.7920.855bbbb
MolPath(ours)0.2660.2000.5240.3300.8980.9130.8630.8680.6740.843
MethodESOLFreeSolvLipopAverageBACEBBBPClinToxTox21SIDERAverage
GCN0.7781.5820.8991.0860.8290.8950.6150.7880.6240.750
GIN0.6191.1360.7560.8370.8500.8900.7530.8240.6180.787
GROVER0.9832.1760.8171.3250.8260.7000.8120.7430.6480.746
GraphMvp1.092b0.681b0.8120.7240.7900.6310.6390.719
GEM0.7981.8770.6601.1090.8560.7240.9060.7810.6720.788
GeomGcl0.7640.8770.5440.728bb0.9170.8510.647b
HiMol0.8332.2830.7081.2750.8460.7320.8080.7620.6250.755
3D PGT1.061b0.687b0.8090.7210.7940.7380.6060.734
Uni-miol0.7881.480.6030.9570.8570.7290.9190.7960.6590.792
3DGCL0.7781.441bb0.7920.855bbbb
MolPath(ours)0.2660.2000.5240.3300.8980.9130.8630.8680.6740.843
a

The SOTA results are shown in bold. The underlined is the second best.

b

The result is unavailable in the original paper.

Table 3.

Performance comparison on classification (measured by ROC-AUC and regression [measured by RMSE] tasks).a

MethodESOLFreeSolvLipopAverageBACEBBBPClinToxTox21SIDERAverage
GCN0.7781.5820.8991.0860.8290.8950.6150.7880.6240.750
GIN0.6191.1360.7560.8370.8500.8900.7530.8240.6180.787
GROVER0.9832.1760.8171.3250.8260.7000.8120.7430.6480.746
GraphMvp1.092b0.681b0.8120.7240.7900.6310.6390.719
GEM0.7981.8770.6601.1090.8560.7240.9060.7810.6720.788
GeomGcl0.7640.8770.5440.728bb0.9170.8510.647b
HiMol0.8332.2830.7081.2750.8460.7320.8080.7620.6250.755
3D PGT1.061b0.687b0.8090.7210.7940.7380.6060.734
Uni-miol0.7881.480.6030.9570.8570.7290.9190.7960.6590.792
3DGCL0.7781.441bb0.7920.855bbbb
MolPath(ours)0.2660.2000.5240.3300.8980.9130.8630.8680.6740.843
MethodESOLFreeSolvLipopAverageBACEBBBPClinToxTox21SIDERAverage
GCN0.7781.5820.8991.0860.8290.8950.6150.7880.6240.750
GIN0.6191.1360.7560.8370.8500.8900.7530.8240.6180.787
GROVER0.9832.1760.8171.3250.8260.7000.8120.7430.6480.746
GraphMvp1.092b0.681b0.8120.7240.7900.6310.6390.719
GEM0.7981.8770.6601.1090.8560.7240.9060.7810.6720.788
GeomGcl0.7640.8770.5440.728bb0.9170.8510.647b
HiMol0.8332.2830.7081.2750.8460.7320.8080.7620.6250.755
3D PGT1.061b0.687b0.8090.7210.7940.7380.6060.734
Uni-miol0.7881.480.6030.9570.8570.7290.9190.7960.6590.792
3DGCL0.7781.441bb0.7920.855bbbb
MolPath(ours)0.2660.2000.5240.3300.8980.9130.8630.8680.6740.843
a

The SOTA results are shown in bold. The underlined is the second best.

b

The result is unavailable in the original paper.

For classification task, we can observe that MolPath surpasses all competing methods on four out of the five datasets. As a whole, MolPath is definitely superior to the baselines in terms of average performance. Another observation is that by leveraging 3D structure information, Uni-mol performs well on graph classification, especially on ClinTox and BACE that have higher clustering coefficients. These results suggest that our method is able to learn the chain-like backbone for molecule graphs, but limited on cycle-rich topology.

However, compared to HiMol, the model specialized in discerning specific structural components (i.e. motifs) within molecules, our method shows consistent superiority across different task. In particular on regression task, the chain-aware MolPath outperforms motif-oriented HiMol by a large margin. This phenomenon may suggest from another angle that a majority of molecule graphs are low-clustering, which coincides with our empirical findings, as shown in Fig. 1. Accordingly, it is practically useful to learn chain structures rather than cycle-involved motifs.

5.3 Visualization of molecular representations

In order to explore the effectiveness of our MolPath visually, we first embed the testset of BBBP and ESOL into a latent space, and then use the classic dimensionality reduction method t-SNE (t-distributed Stochastic Neighbor Embedding, Van der Maaten and Hinton 2008) for visualization. The results are reported in Fig. 4, where Fig. 4a–c correspond for MolPath, MolPath without IRDC, and MolPath without attention, respectively. A clear observation in Fig. 4a is that data points’ separation in MolPath is significantly improved when all components are retained, surpassing other tow plots. In contrast, in both Fig. 4b and Fig. 4c, the two types of data points become intermingled, highlighting the compelling evidence for the superior classification performance of MolPath. Also in Fig. 4d–f, without removing any components, the trend of Molpath is more obvious, and points with similar regression values are also more clustered.

t-distributed Stochastic Neighbor Embedding (t-SNE) visualization of BBBP and ESOL. (a)–(c) demonstrate the excellent effect of MolPath on classification tasks and confirm the effectiveness of the component on classification tasks. (d)–(f) demonstrate the excellent effect of MolPath on regression tasks and confirm the effectiveness of the component on regression tasks.
Figure 4.

t-distributed Stochastic Neighbor Embedding (t-SNE) visualization of BBBP and ESOL. (a)–(c) demonstrate the excellent effect of MolPath on classification tasks and confirm the effectiveness of the component on classification tasks. (d)–(f) demonstrate the excellent effect of MolPath on regression tasks and confirm the effectiveness of the component on regression tasks.

5.4 Ablation study

In this section, we perform ablation experiments on the aforementioned classification and regression datasets to investigate how various components of the model and different values of λ impact experimental performance.

Figure 5a and b shows the performance comparison of the models after sequentially removing various components on classification and regression datasets. Notably, when all components are retained, MolPath consistently achieves excellent results in both tasks. This emphasizes the model’s ability to obtain high-quality molecular representations and highlights its effectiveness in learning long-range node representations. In addition, the results demonstrate the ability of MolPath to capture and enhance chain graph representations through effective node pooling.

Ablation effect of MolPath on classification and regression datasets. (a) The performance of MolPath with different ablations on classification datasets. (b) The performance of MolPath with different ablations on regression.
Figure 5.

Ablation effect of MolPath on classification and regression datasets. (a) The performance of MolPath with different ablations on classification datasets. (b) The performance of MolPath with different ablations on regression.

Next, we delve into the influence of the parameter λ in the context of Initial Residual Difference Connection (IRDC) and its impact on experimental performance. In our previous discussion, we introduced the significance of IRDC, which allows for the utilization of additional information that may not be initially harnessed within the node features during the convolution process. The magnitude of λ signifies the ratio between the initial node representation and the shortest path representation of different lengths.

As depicted in Fig. 6a, it is observed that an increase in the parameter λ leads to a decline in performance across the three classification datasets—BBBP, ClinTox, and Tox21. However, an intriguing deviation is noted in the BACE dataset, where performance improves with higher values of λ, consistent with the trends illustrated in Fig. 1. This improvement can be attributed to the BACE dataset’s intrinsic characteristics, such as a higher aggregation coefficient and a greater prevalence of ring structures in its initial representation. Similarly, Fig. 6b demonstrates that, in the context of regression tasks, performance also deteriorates with increasing λ. This observation suggests that the convolutional layers become more adept at capturing relevant information, thereby optimizing overall model performance as λ is modulated.

The performance of the model under different λ (a) Classification datasets; (b) Regression datasets.
Figure 6.

The performance of the model under different λ (a) Classification datasets; (b) Regression datasets.

6 Conclusion

In this paper, we introduce a novel chain-aware graph neural network model called MolPath to capture the backbone of molecular graphs. In particular, MolPath uses IRDC module to reduce the information redundancy between the shortest paths with different lengths. To differentiate the impact of different nodes on the graph, we adopt attention mechanism for information synthesis. Extensive experiments show the consistent outperformance of MolPath over the state-of-the-art methods on two prediction tasks.

It is interesting to note that the the outperformance of MolPath on most molcule graphs and slightly inferior performance on individual dataset (i.e. ClinTox) compared to motif-based method HiMol can reflect the universality of low-clustering of molecule graphs. It is also revealed that combining two types of models might be more powerful for molecule graph representation learning. Besides, the performance of the model is relevant to the initial representation of the molecules. It is helpful to explore the influence of the spatial structure on the ultimate prediction accuracy in the future work.

Conflict of interest

The authors have no conflict of interest to declare.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC 62276099) and the Natural Science Foundation of Sichuan Province (2023NSFSC0501).

References

Bran MA, Cox S, Schilter O et al. Augmenting large language models with chemistry tools.

Nat Mach Intell
2024;
6
:
525
35
.

David
L
,
Thakkar
A
,
Mercado
R
et al.
Molecular representations in ai-driven drug discovery: a review and practical guide
.
J Cheminform
2020
;
12
:
56
.

Degen
J
,
Wegscheid-Gerlach
C
,
Zaliani
A
et al.
On the art of compiling and using ‘drug-like’chemical fragment spaces
.
ChemMedChem
2008
;
3
:
1503
7
.

Devlin
J
,
Chang
M-W
,
Lee
K
et al. BERT: pre-training of deep bidirectional transformers for language understanding. The North American Chapter of the Association for Computational Linguistics, Hyatt Regency in Minneapolis, 2019.

Edwards
C
,
Lai
T
,
Ros
K
et al. Translation between molecules and natural language. In: Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, 2022.

Fang
X
,
Liu
L
,
Lei
J
et al.
Geometry-enhanced molecular representation learning for property prediction
.
Nat Mach Intell
2022
;
4
:
127
34
. .

Giraldo
JH
,
Skianis
K
,
Bouwmans
T
et al. On the trade-off between over-smoothing and over-squashing in deep graph neural networks. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp.
566
576
. University of Birmingham and Eastside Rooms,
2023
.

Hamilton
W
,
Ying
Z
,
Leskovec
J.
Inductive representation learning on large graphs
. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017.

Hochreiter
S
,
Schmidhuber
J.
Long short-term memory
.
Neural Comput
1997
;
9
:
1735
80
. .

Huang
J
,
Du
L
,
Chen
X
et al. Robust mid-pass filtering graph convolutional networks. In: Proceedings of the ACM Web Conference 2023, pp. 328–338, Austin, Texas,
2023a
.

Huang
J
,
Li
P
,
Huang
R
et al.
Revisiting the role of heterophily in graph representation learning: An edge classification perspective
.
ACM Trans Knowl Discov Data
2023b
;
18
:
1
17
.

Huang
J
,
Shen
J
,
Shi
X
et al. On which nodes does GCN fail? enhancing GCN from the node perspective. In: Forty-first International Conference on Machine Learning, Vienna, Austria,
2024
.

Irwin
R
,
Dimitriadis
S
,
He
J
et al.
Chemformer: a pre-trained transformer for computational chemistry
.
Mach Learn: Sci Technol
2022
;
3
:
015022
.

Kipf
TN
,
Welling
M.
Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations, April 24–26, Toulon, France, 2017.

Krenn
M
,
Ai
Q
,
Barthel
S
et al.
Selfies and the future of molecular string representation
.
Patterns
2022
;
3
:
100588
. .

Li
H
,
Zhao
D
,
Zeng
J.
KPGT: Knowledge-guided pre-training of graph transformer for molecular property prediction. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 857–67, Washington DC Convention Center,
2022a
.

Li
S
,
Zhou
J
,
Xu
T
et al. GeomGCL: geometric graph contrastive learning for molecular property prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, online, Vol. 36, pp. 4541–4549,
2022b
.

Liu
S
,
Wang
H
,
Liu
W
et al. Pre-training molecular graph representation with 3d geometry. In: 11th International Conference on Learning Representations, Kigali Rwanda, 2023.

Moon
K
,
Im
H-J
,
Kwon
S.
3d graph contrastive learning for molecular property prediction
.
Bioinformatics
2023
;
39
:
btad371
. .

Rong
Y
,
Bian
Y
,
Xu
T
et al.
Self-supervised graph transformer on large-scale molecular data
.
Adv Neural Inf Process Syst
2020
;
33
:
12559
71
.

Stärk
H
,
Beaini
D
,
Corso
G
et al. 3d infomax improves GNNs for molecular property prediction. In: International Conference on Machine Learning, pp. 20479–20502, PMLR, Baltimore,
2022
.

Van der Maaten
L
,
Hinton
G.
Visualizing data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
2605
.

Vaswani
A
,
Shazeer
N
,
Parmar
N
et al.
Attention is all you need
.
In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Vol. 30, Long Beach, CA, USA,
2017
.

Wang
J
,
Guan
J
,
Zhou
S.
Molecular property prediction by contrastive learning with attention-guided positive sample selection
.
Bioinformatics
2023a
;
39
:
btad258
. .

Wang
X
,
Zhao
H
,
Tu
W-w
et al. Automated 3d pre-training for molecular property prediction. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2419–30, Long Beach, CA,
2023b
.

Wang
Y
,
Wang
J
,
Cao
Z
et al.
Molecular contrastive learning of representations via graph neural networks
.
Nat Mach Intell
2022
;
4
:
279
87
. .

Weininger
D.
Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules
.
J Chem Inf Comput Sci
1988
;
28
:
31
6
. .

Wu
Z
,
Ramsundar
B
,
Feinberg
EN
et al.
MoleculeNet: a benchmark for molecular machine learning
.
Chem Sci
2018
;
9
:
513
30
. .

Xu
K
,
Hu
W
,
Leskovec
J
et al. How powerful are graph neural networks? In: International Conference on Learning Representations, New Orleans, 2019.

Zang
X
,
Zhao
X
,
Tang
B.
Hierarchical molecular graph self-supervised learning for property prediction
.
Commun Chem
2023
;
6
:
34
. .

Zhou
G
,
Gao
Z
,
Ding
Q
et al. Uni-Mol: a universal 3D molecular representation learning framework. In: The 11th International Conference on Learning Representations, Kigali Rwanda,
2023
.

Zhu
J
,
Xia
Y
,
Wu
L
et al. Unified 2d and 3d pre-training of molecular representations. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.
2626
2636
, Washington DC, U.S.,
2022
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on: