-
PDF
- Split View
-
Views
-
Cite
Cite
Yang Yang, Zixuan Zheng, Yuyang Xu, Huifang Wei, Wenying Yan, BioGSF: a graph-driven semantic feature integration framework for biomedical relation extraction, Briefings in Bioinformatics, Volume 26, Issue 1, January 2025, bbaf025, https://doi.org/10.1093/bib/bbaf025
- Share Icon Share
Abstract
The automatic and accurate extraction of diverse biomedical relations from literature constitutes the core elements of medical knowledge graphs, which are indispensable for healthcare artificial intelligence. Currently, fine-tuning through stacking various neural networks on pre-trained language models (PLMs) represents a common framework for end-to-end resolution of the biomedical relation extraction (RE) problem. Nevertheless, sequence-based PLMs, to a certain extent, fail to fully exploit the connections between semantics and the topological features formed by these connections. In this study, we presented a graph-driven framework named BioGSF for RE from the literature by integrating shortest dependency paths (SDP) with entity-pair graph through the employment of the graph neural network model. Initially, we leveraged dependency relationships to obtain the SDP between entities and incorporated this information into the entity-pair graph. Subsequently, the graph attention network was utilized to acquire the topological information of the entity-pair graph. Ultimately, the obtained topological information was combined with the semantic features of the contextual information for relation classification. Our method was evaluated on two distinct datasets, namely S4 and BioRED. The outcomes reveal that BioGSF not only attains the superior performance among previous models with a micro-F1 score of 96.68% (S4) and 96.03% (BioRED), but also demands the shortest running times. BioGSF emerges as an efficient framework for biomedical RE.
Introduction
Amidst the exponential growth of biomedical literature, efficiently tapping into this extensive knowledge base has escalated into a formidable task. Knowledge graph (KG) technologies, with their ability to decipher the intricate web of connections among biomedical concepts, have surfaced as potential solutions to this predicament, thereby garnering significant research focus [1, 2]. Currently, a multiplicity of knowledge graphs have emerged, such as NetMe 2.0 [3], SPOKE [4], MKG-GC [5], etc. Knowledge graphs are making significant contributions in various domains, including facilitating drug repurposing initiatives [6], enhancing drug discovery efforts [7], elucidating molecular regulatory mechanisms [8], and informing clinical decision-making processes [9], among others. The knowledge extraction phase in KG development comprises two key components: entity extraction, involving the identification and retrieval of pertinent entities from the literature, and relation extraction (RE), which determines the connections among these extracted entities.
Relation extraction, a pivotal component of knowledge extraction, has garnered considerable attention, driving extensive research by various scholars and collaborative teams. Early attempts at RE predominantly relied on rule-based methods to extract relations, focusing on analyzing textual syntactic and semantic patterns [10, 11]. However, this method requires manual updates of expressions, and the constructed rules might be tailored only for specific tasks. As a result, researchers have transitioned towards adopting machine-learning techniques for automated relation determination. Singhal et al. [12], for instance, introduced a decision tree method to accurately identify disease-related point mutations from biomedical literature and enhanced the model performance by incorporating features such as statistical, distance, and affective elements.
The aforementioned methods require manual feature engineering, with distinct features needing to be designed for different text types. However, deep learning-based methods elegantly address this challenge by automatically learning feature representations from text, capturing various levels of information. These methods exhibit a superior understanding of contextual information and semantic relations. Currently, deep learning models for the RE task can be broadly categorized into two types: (i) Pre-trained language models that utilize large-scale text datasets for pre-training, enabling them to learn the language’s deep semantic and syntactic nuances; (ii) Task-specific deep learning models, which enrich relation information by integrating sentence features, entity information features, dependency features, and other relevant features.
In the realm of pre-trained language models, Li and colleagues [13] introduced BioBERT, a model based on the BERT architecture. It underwent training using data from Wikipedia, BooksCorpus, and PubMed article abstracts, emerging as the most widely used medical pre-trained language model. Meanwhile, the Generative Pre-trained Transformer (GPT), a recently popular large language model, has gained significant attention and increasing application within the biomedical sphere. For instance, Luo et al. [14] presented BioGPT, a GPT variant obtained through pre-training with GPT-2 and a wealth of domain-specific biological data. This model emphasizes comprehensive coverage of biomedical knowledge. Recently, Jin and colleagues [15] proposed a GPT model tailored for the genetics domain, innovatively utilizing the NCBI Web API to answer genomic inquiries.
On the other hand, to extract relational information more accurately, researchers have integrated more effective sentence feature information into deep learning models. Features based on dependency paths have contributed to improving the accuracy of RE. Miwa et al. proposed an end-to-end LSTM-RNN model based on dependency trees [16]. Later, Peng et al. introduced graph LSTM, which enables bidirectional information transfer between child and parent nodes through dependency relationships [17]. Lai et al. incorporated a neighbor attention mechanism and considered the dependency items of words [18]. Tian et al. designed a dependency-driven with Attentive Graph Convolutional Networks, which obtains head words and child nodes through dependency trees to design local connections, and uses the shortest dependency paths (SDP) to design global connections [19]. Chen et al. not only utilized the dependency relationships and types between words but also distinguished between reliable and noisy dependency information through weighting [20].
Although external information such as dependency relationships improves the performance of relation recognition, the connections between entity pairs themselves also provide rich feature information. Therefore, based on this, we propose a novel graph-driven framework BioGSF for biomedical RE based on entity-pair graph and the SDP. We used dependency relationships to obtain the SDP between entities, and incorporate this information into the entity-pairs graph. Then, we employed graph attention network to obtain topological information of the entity-pairs graph. Finally, we combine the acquired topological information with the semantic features of contextual information for relation classification. This framework harnesses the complementary strengths of diverse methodologies to overcome the constraints of current approaches and enhance the model’s efficacy and generalization capabilities in RE.
Methods
Datasets
In this study, we opted for two extensive datasets, namely S4 and BioRED [21], for the training and testing of our model. The S4 dataset was curated by carefully selecting four distinct relationship types: Protein–Protein Interaction (PPI), Drug–Drug Interaction (DDI), Chemical-Protein Interaction (CPI), and Chemical-Disease Interaction (CDI). These relationships were extracted from Dataset-11 in our prior research [5], which amalgamated eight different resources, including BC5CDR [22], BC6ChemProt [23], BC7DrugProt [24], LLL [25], AIMed, HPRD-50, BioInfer, and IEPA [26]. On the other hand, BioRED comprises a document-level biomedical relation dataset, encompassing 600 PubMed abstracts [21]. Apart from the aforementioned four relationship types, BioRED also features Protein-Disease Interaction (PDI), Disease-Variant Interaction (DVI), and Chemical-Variant Interaction (CVI). Both of the datasets provide pre-annotated entity information, and the sentences we used in our study are drawn from them with the entity tags already in place. As a result, entity recognition is explicitly excluded from the scope of our work and there is no requirement to conduct this task within the context of our study. The detail of the two datasets were summarized in Table 1.
Relation type . | S4: Train Set . | S4: Test Set . | BioRED: Train Set . | BioRED: Test set . |
---|---|---|---|---|
PPI | 21,090 | 1114 | 3744 | 614 |
DDI | 11,785 | 629 | 2407 | 690 |
CPI | 60,850 | 3204 | 3984 | 899 |
CDI | 5659 | 298 | 4303 | 1166 |
PDI | – | – | 7216 | 1715 |
DVI | – | – | 1819 | 686 |
CVI | – | – | 260 | 26 |
Total | 99,384 | 5236 | 23,733 | 5796 |
Relation type . | S4: Train Set . | S4: Test Set . | BioRED: Train Set . | BioRED: Test set . |
---|---|---|---|---|
PPI | 21,090 | 1114 | 3744 | 614 |
DDI | 11,785 | 629 | 2407 | 690 |
CPI | 60,850 | 3204 | 3984 | 899 |
CDI | 5659 | 298 | 4303 | 1166 |
PDI | – | – | 7216 | 1715 |
DVI | – | – | 1819 | 686 |
CVI | – | – | 260 | 26 |
Total | 99,384 | 5236 | 23,733 | 5796 |
Relation type . | S4: Train Set . | S4: Test Set . | BioRED: Train Set . | BioRED: Test set . |
---|---|---|---|---|
PPI | 21,090 | 1114 | 3744 | 614 |
DDI | 11,785 | 629 | 2407 | 690 |
CPI | 60,850 | 3204 | 3984 | 899 |
CDI | 5659 | 298 | 4303 | 1166 |
PDI | – | – | 7216 | 1715 |
DVI | – | – | 1819 | 686 |
CVI | – | – | 260 | 26 |
Total | 99,384 | 5236 | 23,733 | 5796 |
Relation type . | S4: Train Set . | S4: Test Set . | BioRED: Train Set . | BioRED: Test set . |
---|---|---|---|---|
PPI | 21,090 | 1114 | 3744 | 614 |
DDI | 11,785 | 629 | 2407 | 690 |
CPI | 60,850 | 3204 | 3984 | 899 |
CDI | 5659 | 298 | 4303 | 1166 |
PDI | – | – | 7216 | 1715 |
DVI | – | – | 1819 | 686 |
CVI | – | – | 260 | 26 |
Total | 99,384 | 5236 | 23,733 | 5796 |
Overall architecture
Our model BioGSF is primarily divided into three modules: the embedding layer, the feature fusion layer, and the classification layer. BioGSF takes into account not only the dependencies between entities but also the topological information of the entity pair graph. The overall framework is illustrated in Fig. 1. We chose BioBERT [13] as the pre-trained language model to obtain the embedding information of sentences. The feature fusion layer integrates both sentence semantic features and graph topological features. For sentence semantics, we use a fully connected layer to output contextual information, entity information, and SDP information. To capture graph information, we combine entity pair information and SDP information, which are then inputted into a GAT to obtain topological information. Finally, semantic and topological information are integrated and inputted into the classification layer for categorization.

Embedding layer and entity feature representation
Given an input sentence S, we derive its embedding via BioBERT, which has been trained on a corpus comprising 4.5 billion words from PubMed abstracts and 13.5 billion words from PubMed Central full-text articles. Within the sentence, we designate the positions of two entities, denoted as Entity 1 and Entity 2. The start and end of Entity 1 are marked by ‘[s1]’ and ‘[e1]’, respectively, while ‘[s2]’ and ‘[e2]’ similarly delineate the boundaries of Entity 2. This annotation process aids in the extraction of entity features during subsequent steps. Initially, the sentence undergoes tokenization to yield individual tokens. BioBERT then generates embeddings for each token, encoding positional, semantic, and contextual information. Furthermore, BioBERT produces a ‘[CLS]’ output that encapsulates the overall semantic content of the text sequence. We employ this as a sentence feature, which will be integrated with the forthcoming fusion features.
However, it is important to note that an entity can be tokenized into multiple tokens. To capture more precise semantic features of the entity, we encode the positions of the classified entities by labeling their locations with a mask. Furthermore, the entities are represented by the average embedding of their constituent tokens. Let estart and eend denote the start and end positions of the entity, respectively, and H represent the output embedding from BioBERT. The representation of the entity is given by Equation (1).
Where FCLayer signifies a linear transformation layer coupled with dropout layers, whose purposes are to regulate the output dimensionality of the entity embedding and mitigate overfitting, respectively.
Dependency feature representation
Although pre-trained language encoding can achieve good performance, it often captures contextual information while ignoring the part of speech of words in sentences. For instance, a word can function as a subject, predicate, attribute, adverbial, complement, etc., in different sentences. To address this issue, researchers propose using dependency relationships to obtain richer entity relationship information. In this study, we employed the NLP tool ScispaCy [27] to obtain dependency relationships from sentences. ScispaCy is a natural language processing library that has been trained with biomedical domain expertise, leveraging ScispaCy’s framework, and it yields superior outcomes for semantic analysis within medical data. The part-of-speech tagging and dependency parsing module within ScispaCy utilize the GENIA 1.0 corpus [28], PubMed abstracts, and the OntoNotes 5.0 Corpus [29, 30] for their corpora. Conversely, the named entity recognition module of ScispaCy relies on the MedMentions Dataset [31] for its corpus. The parsing results of ScispaCy include the head word, token, category, and child nodes for each token. The category is used to distinguish the annotated entity pairs. When an entity is parsed into multiple tokens, it facilitates marking all entity positions, their head words, and child nodes.
Leveraging the ScispaCy toolkit, we can tokenize sentences and delve into their syntactic structure to extract part-of-speech tags and establish dependency relationships among words, as depicted in Fig. 2A. The resultant dependency relationships are organized into a tree structure that interlinks the entities. Drawing upon this dependency tree, we implemented the Breadth-First Search algorithm to identify the SDP between pairs of entities, as depicted in Fig. 2B. In cases where entity pairs span across sentences, the ROOT node may not be singular, potentially hindering the identification of an SDP. To mitigate this, we utilize the head word of the entity, as identified by ScispaCy, as the SDP. The head word typically serves as the central term within a subtree, capturing the primary semantic relationship within the sentence or clause. This approach ensures that the model maintains robust comprehension and inferential capabilities even within intricate syntactic frameworks by constructing pathways that connect semantic cores. The detailed algorithm is outlined in Algorithm 1.

Shortest dependency path acquisition and processing module. A. ScispaCy parsing results. B. Example of shortest dependency path. C. Shortest dependency path representation acquisition module.
Input: Marked Sentences S1, Marked Sentences S2 |
Output: Shortest Dependency Path |
Tools: ScispaCy |
1 |$doc1=\left[\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\},\dots \right]\leftarrow ScispaCy(S1)$| |
2 |$doc2\leftarrow ScispaCy(S2)$| |
3 The alternative |$ROOT$| can be found through the parsing result |$\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\}$| = head word, word, type, children |
4 for |$ROOT$| in Alternative |$ROOT$| do |
|$path1\leftarrow ScispaCy\left( ROOT, Entity1\right)$| |
|$path2\leftarrow ScispaCy\left( ROOT, Entity2\right)$| |
end |
5 if Common path in path1 with path2 then |
Path|$\leftarrow shortest\ dependent\ path$| |
else |
Path|$\leftarrow Entity{1}^{\prime }s\ head\ word+ Entity{2}^{\prime }s\ head\ word$| |
end |
6 return Path |
Input: Marked Sentences S1, Marked Sentences S2 |
Output: Shortest Dependency Path |
Tools: ScispaCy |
1 |$doc1=\left[\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\},\dots \right]\leftarrow ScispaCy(S1)$| |
2 |$doc2\leftarrow ScispaCy(S2)$| |
3 The alternative |$ROOT$| can be found through the parsing result |$\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\}$| = head word, word, type, children |
4 for |$ROOT$| in Alternative |$ROOT$| do |
|$path1\leftarrow ScispaCy\left( ROOT, Entity1\right)$| |
|$path2\leftarrow ScispaCy\left( ROOT, Entity2\right)$| |
end |
5 if Common path in path1 with path2 then |
Path|$\leftarrow shortest\ dependent\ path$| |
else |
Path|$\leftarrow Entity{1}^{\prime }s\ head\ word+ Entity{2}^{\prime }s\ head\ word$| |
end |
6 return Path |
Input: Marked Sentences S1, Marked Sentences S2 |
Output: Shortest Dependency Path |
Tools: ScispaCy |
1 |$doc1=\left[\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\},\dots \right]\leftarrow ScispaCy(S1)$| |
2 |$doc2\leftarrow ScispaCy(S2)$| |
3 The alternative |$ROOT$| can be found through the parsing result |$\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\}$| = head word, word, type, children |
4 for |$ROOT$| in Alternative |$ROOT$| do |
|$path1\leftarrow ScispaCy\left( ROOT, Entity1\right)$| |
|$path2\leftarrow ScispaCy\left( ROOT, Entity2\right)$| |
end |
5 if Common path in path1 with path2 then |
Path|$\leftarrow shortest\ dependent\ path$| |
else |
Path|$\leftarrow Entity{1}^{\prime }s\ head\ word+ Entity{2}^{\prime }s\ head\ word$| |
end |
6 return Path |
Input: Marked Sentences S1, Marked Sentences S2 |
Output: Shortest Dependency Path |
Tools: ScispaCy |
1 |$doc1=\left[\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\},\dots \right]\leftarrow ScispaCy(S1)$| |
2 |$doc2\leftarrow ScispaCy(S2)$| |
3 The alternative |$ROOT$| can be found through the parsing result |$\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\}$| = head word, word, type, children |
4 for |$ROOT$| in Alternative |$ROOT$| do |
|$path1\leftarrow ScispaCy\left( ROOT, Entity1\right)$| |
|$path2\leftarrow ScispaCy\left( ROOT, Entity2\right)$| |
end |
5 if Common path in path1 with path2 then |
Path|$\leftarrow shortest\ dependent\ path$| |
else |
Path|$\leftarrow Entity{1}^{\prime }s\ head\ word+ Entity{2}^{\prime }s\ head\ word$| |
end |
6 return Path |
To extract more effective information from the SDP, we designed a special feature extraction module for it, as shown in Fig. 2C. Firstly, the obtained SDP is encoded by BioBERT denoted as HD, and then richer and more effective embedded information is obtained through a multi-head attention mechanism. Finally, a one-dimensional convolutional network (1D CNN) is used for dimensionality reduction to obtain the SDP representation. 1D-CNN can effectively extract local features from sequential data and better capture local patterns and characteristics at different positions in the sequence. The process is represented as shown in Equation (2):
Graph topology information representation
The mainstream graph neural networks primarily treat the words in a single sentence as nodes in the graph. Here, we utilize the entity-pair graph [32] to obtain graph topology information. In this graph, the node set represents entity pairs, and edges represent connections between entity pairs. If two entity pairs share a common entity, they are connected by an edge, represented as 1 in the adjacency matrix. As illustrated in Fig. 3, sentence A contains the entity pair (e1: TREM2; e2: AD), sentence B contains (e1: R47H; e2: AD), and sentence C contains (e1: R47H; e2: TREM2). It is evident that each pair of sentences A, B, and C shares a common entity, thus they are all connected by edges.

GAT is a network architecture based on graph-structured data, which assigns weights to neighbors through an attention mechanism and dynamically adjusts the relationship weights between different nodes during the learning process. Therefore, we combine the information of entity pairs with the SDP information and input them into GAT. Here, the combined information of the two is denoted as fusing embedding |$H=\left[{E}_1,{E}_2,D\right]$|, where E1 and E2 represent the feature information of the two entities respectively, and D represents the dependency feature information. Given the input embeddings of the graph as |$H=\left\{{h}_1,{h}_2,\dots, {h}_n\right\}$|, the attention score is calculated as |$a\left(W{h}_i,W{h}_j\right)$|, which represents the importance of node j to node i. Let denote N is all neighbors of node i in the graph. We normalize all nodes using softmax and activate them using LeakyReLU to obtain the attention mechanism coefficient aij, as shown in Equation 3:
Once the normalized attention scores have been obtained, we are able to extract the output features corresponding to each node. To guarantee the robustness of the self-attention learning process, a multi-head attention mechanism is utilized to aggregate K distinct attention results, ultimately yielding the final output, denoted as |${h}_i^{(0)}$|. Furthermore, a layer of GAT is appended at the end, functioning as the output layer.
Where |${F}_g$| is the topological information of the final graph, |${W}^{(0)}$| is and |${W}^{(1)}$| is the weight matrix of linear change, and the input embeddings are reduced to |${d}_m$| and |${d}_g$| respectively. Additionally, |$\sigma$| is the activation function.
Relation classification
We utilize the aforementioned information to classify relationships. Initially, we integrate the semantic features of the sentence, entity features, and the shortest dependent path information to generate FS, as represented in Equation (6) below:
where Hcls denotes the sentence feature, |$\oplus$| represents the concatenation operation, and the FCLayer reduces the dimensionality of the four types of information to ds.
Then the topology information Fg is incorporated, and the outcome is subsequently classified using a softmax-classifier. Assuming r represents the number of label categories, Wr denotes the size of r × (ds + dg), and br is the bias. The probability derived is exhibited in Equation (7):
To quantify the model’s loss, we employ cross-entropy loss, as defined in Equation (8):
For tag prediction, we apply the argmax function to determine the most probable label from p(y|t).
The detailed model parameters are listed in Supplementary Table S1.
Evaluation metrics
To evaluate the performance of our method and compare it with previous methods, we utilized the micro-average and macro-average F1-score metrics. Both scores are determined based on precision and recall, with their formulas defined as follows:
Results
Ablation studies
To validate the efficacy of our model, we performed ablation studies on each of the three proposed components: the choice of graph neural network, the SDP network layer, the maximum length L of the SDP and learning rate.
Initially, to validate the effectiveness of the graph neural network, we compared three models: one that directly outputs the fused information obtained by combining the SDP with entity pairs to the classification layer, and another that feeds this information into a graph neural network (GCN or GAT) before classification. As shown in the Table 2, the results demonstrate that the combination of GAT, multi-head attention mechanism, and 1D-CNN yields the best performance with micro-F1 score of 96.47 and 96.03 in S4 and BioRED dataset separately, significantly outperforming other methods. Furthermore, we compared our method with the approach that utilizes the average embedding of entities in RBERT [33]. The results showed that the combined use of entity information and the SDP yields better performance than the method that solely relies on entity information. This also indicates that the SDP has a positive effect on relation classification.
Layer . | S4 . | BioRED . |
---|---|---|
AVG | 94.79 | 95.48 |
MA + 1D-CNN | 94.71 | 95.67 |
GCN + AVG | 95.03 | 95.60 |
GCN + MA + 1D-CNN | 94.65 | 95.69 |
GAT+AVG | 95.02 | 95.26 |
GAT + MA + 1D-CNN(ours) | 96.47 | 96.03 |
Layer . | S4 . | BioRED . |
---|---|---|
AVG | 94.79 | 95.48 |
MA + 1D-CNN | 94.71 | 95.67 |
GCN + AVG | 95.03 | 95.60 |
GCN + MA + 1D-CNN | 94.65 | 95.69 |
GAT+AVG | 95.02 | 95.26 |
GAT + MA + 1D-CNN(ours) | 96.47 | 96.03 |
AVG: entity average embedding method; MA: multi-head self-attention mechanism
Layer . | S4 . | BioRED . |
---|---|---|
AVG | 94.79 | 95.48 |
MA + 1D-CNN | 94.71 | 95.67 |
GCN + AVG | 95.03 | 95.60 |
GCN + MA + 1D-CNN | 94.65 | 95.69 |
GAT+AVG | 95.02 | 95.26 |
GAT + MA + 1D-CNN(ours) | 96.47 | 96.03 |
Layer . | S4 . | BioRED . |
---|---|---|
AVG | 94.79 | 95.48 |
MA + 1D-CNN | 94.71 | 95.67 |
GCN + AVG | 95.03 | 95.60 |
GCN + MA + 1D-CNN | 94.65 | 95.69 |
GAT+AVG | 95.02 | 95.26 |
GAT + MA + 1D-CNN(ours) | 96.47 | 96.03 |
AVG: entity average embedding method; MA: multi-head self-attention mechanism
Next, we investigate the maximum length L of the SDP. Here, the length of the SDP refers to the length of the result obtained after tokenizing the SDP using BioBERT. When sentences are excessively long, the SDP between two entities may also become longer, potentially introducing significant noise into the obtained shortest dependency. We conducted statistics on the length L of the SDP in two datasets. As shown in Fig. 4A, over 95% of sentences have a SDP length less than or equal to 16, and all sentences have a SDP length less than 64. To test the effectiveness and generalization of our model, we set the maximum length of the SDP to 16, 32, and 64, i.e. applying physical truncation to adjust the SDP length to keep same data set size with three different SDP lengths during the experiment.

Different length of SDP statics and effect. A. Proportion of different SDP length. B. Comparison of different SDP length on model performance.
The results are shown in Fig. 4B. When L = 32, the model performs best on the S4 dataset but poorly on the BioRED dataset, achieving only 94.6% accuracy. When L = 64, although it performs slightly worse on the S4 dataset compared to L = 32, it significantly outperforms other settings on BioRED. L = 64 covers the SDP of all sentences, ensuring the completeness and generalization of the model.
Moreover, we also performed the ablation study on learning rates, including different settings such as 1e-5, 2e-5, and 3e-5, to deeply analyze the impact of the learning rate on the model’s performance. During the experiments, we observed that when the learning rate was set to 1e-5, the model performed better than other settings. It could converge more effectively and avoid issues like oscillation or overfitting. Based on this experimental result, we selected 1e-5 as the optimal learning rate for this study and conducted all subsequent experiments on this basis. The detail information of hyper-parameters were listed in Table S1 and Fig. S1.
Cross-validation
To rigorously validate the performance and stability of our proposed model, we conducted five-fold cross-validation experiments on the S4 and BioRED datasets, respectively. As demonstrated in Table 3, on the S4 train set, the average values of micro-F1 and macro-F1 are 95.93 ± 0.29 and 90.42 ± 0.61, respectively. Similarly, on the BioRED train data set, the micro-F1 and macro-F1 averages stand at 94.83 ± 0.37 and 93.53 ± 0.41. The obtained experimental results strongly suggest that our model showcases remarkable stability on both datasets. This not only indicates its consistent performance but also further confirms its robust generalization capabilities.
Cross-validation performance of our model on training sets from S4 and BioRED
Datasets . | Micro-F1 . | Macro-F1 . |
---|---|---|
S4 | 95.93 ± 0.29 | 90.42 ± 0.61 |
BioRED | 94.83 ± 0.37 | 93.53 ± 0.41 |
Datasets . | Micro-F1 . | Macro-F1 . |
---|---|---|
S4 | 95.93 ± 0.29 | 90.42 ± 0.61 |
BioRED | 94.83 ± 0.37 | 93.53 ± 0.41 |
Cross-validation performance of our model on training sets from S4 and BioRED
Datasets . | Micro-F1 . | Macro-F1 . |
---|---|---|
S4 | 95.93 ± 0.29 | 90.42 ± 0.61 |
BioRED | 94.83 ± 0.37 | 93.53 ± 0.41 |
Datasets . | Micro-F1 . | Macro-F1 . |
---|---|---|
S4 | 95.93 ± 0.29 | 90.42 ± 0.61 |
BioRED | 94.83 ± 0.37 | 93.53 ± 0.41 |
Performance comparison with different models
To assess the efficacy of our model, we conducted a comparative analysis of BioGSF’s performance against five preceding methodologies—BioBERT [13], RBERT [33], EPGNN [32], AGCN [19], BioEGRE [34] and ChatGPT-4.0 [35]—using the S4 and BioRED datasets.
Based on the comparison experimental results in Table 4, it becomes apparent that our BioGSF model exhibits optimal performance on both datasets. This notable achievement is primarily due to the efficient integration of the SDP with entity pair information, a strategy that considerably boosts classification performance. Although AGCN demonstrates commendable performance on the BioRED dataset, it demands extensive computational resources and endures a lengthy validation phase. Conversely, our model swiftly generates results with 75 s under similar conditions, whereas AGCN requires a significantly longer duration of 311 s. Additionally, the EPGNN model, which incorporates a GCN approach centered on entity pairs, showcases robust performance on both datasets. However, it disregards the critical importance of incorporating the SDP, relying exclusively on internal data. Furthermore, we employed the ChatGPT-4.0 model and utilized the zero-shot and one-shot method for comparison. Specifically, we designed appropriate prompts to guide the model’s responses without conducting additional training or tuning on the model to directly leverage the model’s general language understanding ability to rapidly test its performance in RE task. However, as shown in the table 4, the performance of the model was not satisfactory. It is evident from the results that although ChatGPT-4.0 has a strong ability in handling general language tasks, in specific tasks, its effect is poor and it fails to fully understand the context information.
Comparative performance evaluation of our model against other state-of-the-art models on the test set from S4 and BioRED datasets
Model . | S4 . | BioRED . | ||||
---|---|---|---|---|---|---|
Micro-F1 . | Macro-F1 . | Times . | Micro-F1 . | Macro-F1 . | Times . | |
BioBERT | 93.85 | 86.20 | 78 s | 94.38 | 89.71 | 75 s |
RBERT | 94.89 | 88.09 | 75 s | 94.81 | 94.27 | 75 s |
EPGNN | 94.98 | 90.62 | 75 s | 94.63 | 92.17 | 75 s |
AGCN | 92.22 | 79.11 | 311 s | 96.01 | 94.42 | 362 s |
BioEGRE | 89.57 | 75.41 | 1038s | 94.17 | 93.69 | 1366s |
GPT-4o(zero-shot) | 57.79 | 52.33 | 1h30m | 76.45 | 80.65 | 1h30m |
GPT-4o(one-shot) | 49.33 | 50.8 | 1h30m | 89.33 | 80.41 | 1h30m |
BioGSF | 96.68 | 92.46 | 75 s | 96.03 | 94.65 | 73 s |
Model . | S4 . | BioRED . | ||||
---|---|---|---|---|---|---|
Micro-F1 . | Macro-F1 . | Times . | Micro-F1 . | Macro-F1 . | Times . | |
BioBERT | 93.85 | 86.20 | 78 s | 94.38 | 89.71 | 75 s |
RBERT | 94.89 | 88.09 | 75 s | 94.81 | 94.27 | 75 s |
EPGNN | 94.98 | 90.62 | 75 s | 94.63 | 92.17 | 75 s |
AGCN | 92.22 | 79.11 | 311 s | 96.01 | 94.42 | 362 s |
BioEGRE | 89.57 | 75.41 | 1038s | 94.17 | 93.69 | 1366s |
GPT-4o(zero-shot) | 57.79 | 52.33 | 1h30m | 76.45 | 80.65 | 1h30m |
GPT-4o(one-shot) | 49.33 | 50.8 | 1h30m | 89.33 | 80.41 | 1h30m |
BioGSF | 96.68 | 92.46 | 75 s | 96.03 | 94.65 | 73 s |
Comparative performance evaluation of our model against other state-of-the-art models on the test set from S4 and BioRED datasets
Model . | S4 . | BioRED . | ||||
---|---|---|---|---|---|---|
Micro-F1 . | Macro-F1 . | Times . | Micro-F1 . | Macro-F1 . | Times . | |
BioBERT | 93.85 | 86.20 | 78 s | 94.38 | 89.71 | 75 s |
RBERT | 94.89 | 88.09 | 75 s | 94.81 | 94.27 | 75 s |
EPGNN | 94.98 | 90.62 | 75 s | 94.63 | 92.17 | 75 s |
AGCN | 92.22 | 79.11 | 311 s | 96.01 | 94.42 | 362 s |
BioEGRE | 89.57 | 75.41 | 1038s | 94.17 | 93.69 | 1366s |
GPT-4o(zero-shot) | 57.79 | 52.33 | 1h30m | 76.45 | 80.65 | 1h30m |
GPT-4o(one-shot) | 49.33 | 50.8 | 1h30m | 89.33 | 80.41 | 1h30m |
BioGSF | 96.68 | 92.46 | 75 s | 96.03 | 94.65 | 73 s |
Model . | S4 . | BioRED . | ||||
---|---|---|---|---|---|---|
Micro-F1 . | Macro-F1 . | Times . | Micro-F1 . | Macro-F1 . | Times . | |
BioBERT | 93.85 | 86.20 | 78 s | 94.38 | 89.71 | 75 s |
RBERT | 94.89 | 88.09 | 75 s | 94.81 | 94.27 | 75 s |
EPGNN | 94.98 | 90.62 | 75 s | 94.63 | 92.17 | 75 s |
AGCN | 92.22 | 79.11 | 311 s | 96.01 | 94.42 | 362 s |
BioEGRE | 89.57 | 75.41 | 1038s | 94.17 | 93.69 | 1366s |
GPT-4o(zero-shot) | 57.79 | 52.33 | 1h30m | 76.45 | 80.65 | 1h30m |
GPT-4o(one-shot) | 49.33 | 50.8 | 1h30m | 89.33 | 80.41 | 1h30m |
BioGSF | 96.68 | 92.46 | 75 s | 96.03 | 94.65 | 73 s |
Finally, to further validate the efficacy of our model, we reconstructed a data set based on the most recently published papers. Specifically, the data set was constructed using the abstracts of papers included in PubMed in the past month, namely October 2024 (Search terms: protein–protein interaction[tiab] cancer[tiab] Filters: from 2024/10/1–2024/10/31), and these were manually annotated (Table S2). We re-evaluated our method with this data set and the results demonstrated that our model achieved the best performance with micro-F1 and macro-F1 scores of 97.44 and 78.77, respectively (Table S3).
Our study underscores the profound benefits of merging SDP information with entity pair data, thereby greatly facilitating information acquisition by graph neural networks. This observation aligns seamlessly with the findings derived from ablation studies, further validating the efficacy of our approach.
Case study
To further investigate the effectiveness of our model for entity relationship extraction in medical literature, we randomly selected recently published research articles from PubMed as the basis for our case study. The results, presented in Table 5, aim to demonstrate the performance differences among various models in predicting relationship types. A notable observation from the table is that advanced models like BioBERT accurately identify the relationship type as CDI between widely known drug types, such as 5-Fluorouracil, and specific diseases. This indicates that the models perform well when dealing with drug entities that have extensive literature support.
Sentence . | Gold Standard . | BioBERT . | RBERT . | EPGNN . | AGCN . | BioEGRE . | BioGSF . |
---|---|---|---|---|---|---|---|
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | PDI | PDI | PDI | CDI | CDI | CDI |
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | CDI | CDI | PDI | CDI | CDI | CDI |
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo. | CDI | PDI | PDI | PDI | PDI | CDI | CDI |
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice. | PPI | DDI | DDI | DDI | CVI | CPI | DDI |
Sentence . | Gold Standard . | BioBERT . | RBERT . | EPGNN . | AGCN . | BioEGRE . | BioGSF . |
---|---|---|---|---|---|---|---|
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | PDI | PDI | PDI | CDI | CDI | CDI |
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | CDI | CDI | PDI | CDI | CDI | CDI |
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo. | CDI | PDI | PDI | PDI | PDI | CDI | CDI |
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice. | PPI | DDI | DDI | DDI | CVI | CPI | DDI |
Sentence . | Gold Standard . | BioBERT . | RBERT . | EPGNN . | AGCN . | BioEGRE . | BioGSF . |
---|---|---|---|---|---|---|---|
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | PDI | PDI | PDI | CDI | CDI | CDI |
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | CDI | CDI | PDI | CDI | CDI | CDI |
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo. | CDI | PDI | PDI | PDI | PDI | CDI | CDI |
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice. | PPI | DDI | DDI | DDI | CVI | CPI | DDI |
Sentence . | Gold Standard . | BioBERT . | RBERT . | EPGNN . | AGCN . | BioEGRE . | BioGSF . |
---|---|---|---|---|---|---|---|
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | PDI | PDI | PDI | CDI | CDI | CDI |
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2]. | CDI | CDI | CDI | PDI | CDI | CDI | CDI |
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo. | CDI | PDI | PDI | PDI | PDI | CDI | CDI |
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice. | PPI | DDI | DDI | DDI | CVI | CPI | DDI |
However, the situation changes when encountering new or less reported drugs, such as gemcitabin and ACZ2, especially when they appear in abbreviated forms. Due to the scarcity of entity information for these newer drugs, relying solely on entity information to determine relationship types becomes more challenging. It’s worth noting that the BioGSF model struggles in such cases, primarily because it heavily relies on direct information between entity pairs, neglecting the richer and crucial contextual information in the text. This design limitation often leads the model to mistakenly predict relationships as PDI when faced with the aforementioned challenges. On the other hand, our model and BioEGRE adopt different strategies, focusing more on extracting and utilizing rich entity and contextual information from dependency information. This approach enables accurate determination of relationship types between new drugs and other entities, even when direct entity information is scarce. Furthermore, our model transmits entity information and dependency information through an entity-pair graph, allowing other related entities to learn relationship type information.
Discussion
In this study, we present a model that integrates the SDPs and the entity-pair graph for the multi-biological entity relationship extraction from literatures. For diverse medical entity types, we can proficiently handle semantic information and acquire the relationships among entities, thereby contributing significantly to the construction of the medical knowledge graph.
First, we utilize the multi-head attention mechanism to capture the relationships and features of different input parts in the SDPs, thereby better understanding complex patterns and relationships. The SDPs not only contain the words and dependencies that are most directly related to the relationship between the target entity pairs, thereby simplifying the sentence structure, eliminating the noise in the sentence, and being able to represent the semantic relationship between the entities in the syntactic structure more intuitively. By combining local and global features, we can extract the useful parts of each SDP word based on local features. Secondly, different attention ‘heads’ can learn various representations, and these representations will guide the subsequent 1D-CNN to screen features. 1D-CNN captures local dependency relationships and temporal features, and simultaneously extracts local information from sequence data to reduce the influence of local noise and select the most beneficial words from all the SDP words. Regarding the selection of the length of the SDP, when L = 64, it can not only cover all the words on the SDPs but also achieve good results in two datasets. Possibly, adjustments need to be made for different datasets, which also reflects that our model has good generalization.
When selecting the graph neural network model, we compared the effects of GCN and GAT. Our results demonstrated that the combination of GAT and MA + 1DCNN achieved the optimal outcome. Meanwhile, we also found that the combination of GAT and AVG did not yield very good results, and was even worse than the combination of GCN and average pooling. This might be because GAT relies on a more complex structure when learning node representations and cannot capture node features only through simple average pooling. The combination of multi-head attention and one-dimensional CNN can provide GAT with richer feature expressions and enhance feature integration on the original basis.
Although our method has achieved good results through the strategy of integrating the SDPs and the entity-pair graph, there are still some shortcomings. Firstly, the method used in the model to obtain the entity itself is the approach adopted by most models, which is to perform average pooling at the entity positions using the embeddings obtained through pre-trained language. The entity information extracted through this method is not rich enough, especially when the sentences have limited information of entities as shown in the case in Table 5. In future work, we will integrate more methods and incorporate multiple modalities of information, such as text, image, and voice, to acquire richer information for representing the entities. Secondly, our model currently only supports binary RE and has not yet achieved n-ary RE. N-ary RE involves complex relationships among multiple entities, and its information processing is more complex. To accurately capture and understand these complex relationships, especially when the connection between the two entities, such as IgE and OVA in Table 5, is not sufficiently clear, the model requires stronger context understanding capabilities and higher computing power to be able to conduct effective reasoning and relationship identification among multiple entities. In future research, we will improve the dependency relationship between words and construct different weight coefficients for different dependency relationships. This will make the model have different sensitivities to different dependency relationship types and dependent words and reduce the influence of noise on word representations. Furthermore, the application of large language models (such as Llama3 [36]) in RE tasks will also provide new strategies for capturing complex context information and multi-entity relationships. Finally, it is worth noting that our approach solely focuses on the identification of entity relations and does not perform entity recognition.
In this paper, we innovatively put forward the idea of combining the SDPs with the entity-pair graph, and utilized the graph neural network model based on the two to extract the relationships of medical entity pairs. This model not only significantly reduced the time but also performed more outstandingly in terms of efficacy, opening up a new approach for extracting knowledge from biomedical texts and demonstrating broad application prospects.
Conclusion
In this paper, we innovatively proposed the concept of integrating the SDP with the entity-pair graph and employed the graph neural network model based on the two to extract the relationships of medical entity pairs. This model not only shortened the running time but also achieved excellent performance, opening up a new avenue for extracting knowledge from biomedical texts and demonstrating broad application prospects.
We introduced BioGSF, a novel graph-driven framework that integrates shortest dependency paths (SDP) with entity-pair graphs, enhancing the extraction of biomedical relations from literature.
We obtained rich SDP information through the combination of multi-head attention and one-dimensional convolutional neural network, which takes into account global features as well as local patterns and characteristics at different locations.
BioGSF demonstrates exceptional performance in biomedical relation extraction tasks, achieving a high micro-F1 score of 96.68% on the S4 dataset and 96.03% on the BioRED dataset with faster processing times than existing models.
Conflict of interest: The authors declare no conflicts of interest.
Funding
This research was funded by the start-up fund from Suzhou City University; Medical and Health Science and Technology Innovation Project of Suzhou (SKY2022010, SKYD2022097); Foundation of Suzhou Medical College of Soochow University (MP13405423, MX13401423); the Priority Academic Program Development of Jiangsu Higher Education Institutions; the open research fund of Suzhou Key Lab of Multi-modal Data Fusion and Intelligent Healthcare.
Data availability
The data and source codes used in this paper are available at https://github.com/serien-zzx/BioGSF
References
Miranda-Escalada A, Mehryary F, Luoma J. et al. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations. Database 2023; 2023:baad080.
.Open AI.
LlamaTeam, AI@Meta.
Author notes
Yang Yang and Zixuan Zheng contributed equally to this work.