BioGSF: a graph-driven semantic feature integration framework for biomedical relation extraction

The statistics of S4 and BioRED datasets

Relation type	S4: Train Set	S4: Test Set	BioRED: Train Set	BioRED: Test set
PPI	21,090	1114	3744	614
DDI	11,785	629	2407	690
CPI	60,850	3204	3984	899
CDI	5659	298	4303	1166
PDI	–	–	7216	1715
DVI	–	–	1819	686
CVI	–	–	260	26
Total	99,384	5236	23,733	5796

Relation type	S4: Train Set	S4: Test Set	BioRED: Train Set	BioRED: Test set
PPI	21,090	1114	3744	614
DDI	11,785	629	2407	690
CPI	60,850	3204	3984	899
CDI	5659	298	4303	1166
PDI	–	–	7216	1715
DVI	–	–	1819	686
CVI	–	–	260	26
Total	99,384	5236	23,733	5796

Table 1

The statistics of S4 and BioRED datasets

Relation type	S4: Train Set	S4: Test Set	BioRED: Train Set	BioRED: Test set
PPI	21,090	1114	3744	614
DDI	11,785	629	2407	690
CPI	60,850	3204	3984	899
CDI	5659	298	4303	1166
PDI	–	–	7216	1715
DVI	–	–	1819	686
CVI	–	–	260	26
Total	99,384	5236	23,733	5796

Relation type	S4: Train Set	S4: Test Set	BioRED: Train Set	BioRED: Test set
PPI	21,090	1114	3744	614
DDI	11,785	629	2407	690
CPI	60,850	3204	3984	899
CDI	5659	298	4303	1166
PDI	–	–	7216	1715
DVI	–	–	1819	686
CVI	–	–	260	26
Total	99,384	5236	23,733	5796

Overall architecture

Our model BioGSF is primarily divided into three modules: the embedding layer, the feature fusion layer, and the classification layer. BioGSF takes into account not only the dependencies between entities but also the topological information of the entity pair graph. The overall framework is illustrated in Fig. 1. We chose BioBERT [13] as the pre-trained language model to obtain the embedding information of sentences. The feature fusion layer integrates both sentence semantic features and graph topological features. For sentence semantics, we use a fully connected layer to output contextual information, entity information, and SDP information. To capture graph information, we combine entity pair information and SDP information, which are then inputted into a GAT to obtain topological information. Finally, semantic and topological information are integrated and inputted into the classification layer for categorization.

Figure 1

The overall architecture of BioGSF.

Embedding layer and entity feature representation

Given an input sentence S, we derive its embedding via BioBERT, which has been trained on a corpus comprising 4.5 billion words from PubMed abstracts and 13.5 billion words from PubMed Central full-text articles. Within the sentence, we designate the positions of two entities, denoted as Entity 1 and Entity 2. The start and end of Entity 1 are marked by ‘[s1]’ and ‘[e1]’, respectively, while ‘[s2]’ and ‘[e2]’ similarly delineate the boundaries of Entity 2. This annotation process aids in the extraction of entity features during subsequent steps. Initially, the sentence undergoes tokenization to yield individual tokens. BioBERT then generates embeddings for each token, encoding positional, semantic, and contextual information. Furthermore, BioBERT produces a ‘[CLS]’ output that encapsulates the overall semantic content of the text sequence. We employ this as a sentence feature, which will be integrated with the forthcoming fusion features.

However, it is important to note that an entity can be tokenized into multiple tokens. To capture more precise semantic features of the entity, we encode the positions of the classified entities by labeling their locations with a mask. Furthermore, the entities are represented by the average embedding of their constituent tokens. Let e_start and e_end denote the start and end positions of the entity, respectively, and H represent the output embedding from BioBERT. The representation of the entity is given by Equation (1).

$$ \begin{equation} E= FCLayer\left(\frac{\varSigma_{\dot{i}={\mathrm{e}}_{start}}^{e_{end}}H}{e_{end}-{e}_{start}+1}\right) \end{equation} $$

(1)

Where FCLayer signifies a linear transformation layer coupled with dropout layers, whose purposes are to regulate the output dimensionality of the entity embedding and mitigate overfitting, respectively.

Dependency feature representation

Although pre-trained language encoding can achieve good performance, it often captures contextual information while ignoring the part of speech of words in sentences. For instance, a word can function as a subject, predicate, attribute, adverbial, complement, etc., in different sentences. To address this issue, researchers propose using dependency relationships to obtain richer entity relationship information. In this study, we employed the NLP tool ScispaCy [27] to obtain dependency relationships from sentences. ScispaCy is a natural language processing library that has been trained with biomedical domain expertise, leveraging ScispaCy’s framework, and it yields superior outcomes for semantic analysis within medical data. The part-of-speech tagging and dependency parsing module within ScispaCy utilize the GENIA 1.0 corpus [28], PubMed abstracts, and the OntoNotes 5.0 Corpus [29, 30] for their corpora. Conversely, the named entity recognition module of ScispaCy relies on the MedMentions Dataset [31] for its corpus. The parsing results of ScispaCy include the head word, token, category, and child nodes for each token. The category is used to distinguish the annotated entity pairs. When an entity is parsed into multiple tokens, it facilitates marking all entity positions, their head words, and child nodes.

Leveraging the ScispaCy toolkit, we can tokenize sentences and delve into their syntactic structure to extract part-of-speech tags and establish dependency relationships among words, as depicted in Fig. 2A. The resultant dependency relationships are organized into a tree structure that interlinks the entities. Drawing upon this dependency tree, we implemented the Breadth-First Search algorithm to identify the SDP between pairs of entities, as depicted in Fig. 2B. In cases where entity pairs span across sentences, the ROOT node may not be singular, potentially hindering the identification of an SDP. To mitigate this, we utilize the head word of the entity, as identified by ScispaCy, as the SDP. The head word typically serves as the central term within a subtree, capturing the primary semantic relationship within the sentence or clause. This approach ensures that the model maintains robust comprehension and inferential capabilities even within intricate syntactic frameworks by constructing pathways that connect semantic cores. The detailed algorithm is outlined in Algorithm 1.

Figure 2

Shortest dependency path acquisition and processing module. A. ScispaCy parsing results. B. Example of shortest dependency path. C. Shortest dependency path representation acquisition module.

Algorithm 1

Shortest Dependent Path Calculations

Input: Marked Sentences S1, Marked Sentences S2

Output: Shortest Dependency Path

Tools: ScispaCy

1 |$doc1=\left[\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\},\dots \right]\leftarrow ScispaCy(S1)$|

2 |$doc2\leftarrow ScispaCy(S2)$|

3 The alternative |$ROOT$| can be found through the parsing result |$\left\{{h}_1,{w}_1,{t}_1,{c}_1\right\}$| = head word, word, type, children

4 for |$ROOT$| in Alternative |$ROOT$| do

|$path1\leftarrow ScispaCy\left( ROOT, Entity1\right)$|

|$path2\leftarrow ScispaCy\left( ROOT, Entity2\right)$|

end

5 if Common path in path1 with path2 then

Path|$\leftarrow shortest\ dependent\ path$|

else

Path|$\leftarrow Entity{1}^{\prime }s\ head\ word+ Entity{2}^{\prime }s\ head\ word$|

end

6 return Path

To extract more effective information from the SDP, we designed a special feature extraction module for it, as shown in Fig. 2C. Firstly, the obtained SDP is encoded by BioBERT denoted as H_D, and then richer and more effective embedded information is obtained through a multi-head attention mechanism. Finally, a one-dimensional convolutional network (1D CNN) is used for dimensionality reduction to obtain the SDP representation. 1D-CNN can effectively extract local features from sequential data and better capture local patterns and characteristics at different positions in the sequence. The process is represented as shown in Equation (2):

$$ \begin{equation} {\displaystyle \begin{array}{c}D=1D\_ CNN\left( Multi\_ attention\left({H}_D\right)\right)\end{array}} \end{equation} $$

(2)

Graph topology information representation

The mainstream graph neural networks primarily treat the words in a single sentence as nodes in the graph. Here, we utilize the entity-pair graph [32] to obtain graph topology information. In this graph, the node set represents entity pairs, and edges represent connections between entity pairs. If two entity pairs share a common entity, they are connected by an edge, represented as 1 in the adjacency matrix. As illustrated in Fig. 3, sentence A contains the entity pair (e1: TREM2; e2: AD), sentence B contains (e1: R47H; e2: AD), and sentence C contains (e1: R47H; e2: TREM2). It is evident that each pair of sentences A, B, and C shares a common entity, thus they are all connected by edges.

Figure 3

Schematic diagram of entity pairs.

GAT is a network architecture based on graph-structured data, which assigns weights to neighbors through an attention mechanism and dynamically adjusts the relationship weights between different nodes during the learning process. Therefore, we combine the information of entity pairs with the SDP information and input them into GAT. Here, the combined information of the two is denoted as fusing embedding |$H=\left[{E}_1,{E}_2,D\right]$|⁠, where E₁ and E₂ represent the feature information of the two entities respectively, and D represents the dependency feature information. Given the input embeddings of the graph as |$H=\left\{{h}_1,{h}_2,\dots, {h}_n\right\}$|⁠, the attention score is calculated as |$a\left(W{h}_i,W{h}_j\right)$|⁠, which represents the importance of node j to node i. Let denote N is all neighbors of node i in the graph. We normalize all nodes using softmax and activate them using LeakyReLU to obtain the attention mechanism coefficient a_ij, as shown in Equation 3:

$$ \begin{equation} { \begin{array}{c}{a}_{ij}=\displaystyle\frac{\exp \left( LeakyReLU\left({a}^T\left[{W}^{(0)}{\mathrm{h}}_i\Big\Vert{W}^{(0)}{\mathrm{h}}_j\right]\right)\right)}{\sum_{k\in{N}_i}\ \exp \left( LeakyReLU\left({a}^T\left[{W}^{(0)}{\mathrm{h}}_i\Big\Vert{W}^{(0)}{\mathrm{h}}_j\right]\right)\right)}\ \end{array}} \end{equation} $$

(3)

Once the normalized attention scores have been obtained, we are able to extract the output features corresponding to each node. To guarantee the robustness of the self-attention learning process, a multi-head attention mechanism is utilized to aggregate K distinct attention results, ultimately yielding the final output, denoted as |${h}_i^{(0)}$|⁠. Furthermore, a layer of GAT is appended at the end, functioning as the output layer.

$$ \begin{equation} {\displaystyle \begin{array}{c}{\left.{h}_i^{(0)}=\right\Vert}_{k=1}^K\sigma \left(\ \sum\limits_{j\in{N}_i}\ {a}_{ij}^k{W}^{(0)}{\mathrm{h}}_j\right)\end{array}} \end{equation}$$

(4)

$$ \begin{equation} {\displaystyle \begin{array}{c}{F}_g=\sigma \left(\ \sum\limits_{j\in{N}_i}\ {a}_{ij}^k{W}^{(1)}{h}_j^{(0)}\right)\end{array}} \end{equation} $$

(5)

Where |${F}_g$| is the topological information of the final graph, |${W}^{(0)}$| is and |${W}^{(1)}$| is the weight matrix of linear change, and the input embeddings are reduced to |${d}_m$| and |${d}_g$| respectively. Additionally, |$\sigma$| is the activation function.

Relation classification

We utilize the aforementioned information to classify relationships. Initially, we integrate the semantic features of the sentence, entity features, and the shortest dependent path information to generate F_S, as represented in Equation (6) below:

$$ \begin{equation} {\displaystyle \begin{array}{c}{F}_S= FCLayer\left({H}_{cls}\oplus{E}_1\oplus{E}_2\oplus \mathrm{D}\right)\end{array}} \end{equation} $$

(6)

where H_cls denotes the sentence feature, |$\oplus$| represents the concatenation operation, and the FCLayer reduces the dimensionality of the four types of information to d_s.

Then the topology information F_g is incorporated, and the outcome is subsequently classified using a softmax-classifier. Assuming r represents the number of label categories, W_r denotes the size of r × (d_s + d_g), and b_r is the bias. The probability derived is exhibited in Equation (7):

$$ \begin{equation} {\displaystyle \begin{array}{c}p\left(y|t\right)= softmax\left({W}_r\Big({F}_S\oplus{F}_g\right)+{b}_r\Big)\end{array}} \end{equation} $$

(7)

To quantify the model’s loss, we employ cross-entropy loss, as defined in Equation (8):

$$ \begin{equation} {\displaystyle \begin{array}{c} Loss=-\sum\limits_i\ {Y}_j\log \left({p}_i\right)\end{array}} \end{equation} $$

(8)

For tag prediction, we apply the argmax function to determine the most probable label from p(y|t).

$$ \begin{equation} {\displaystyle \begin{array}{c}{y}_{pred}= argmax\left(p\left(y|t\right)\right)\end{array}} \end{equation} $$

(9)

The detailed model parameters are listed in Supplementary Table S1.

Evaluation metrics

To evaluate the performance of our method and compare it with previous methods, we utilized the micro-average and macro-average F1-score metrics. Both scores are determined based on precision and recall, with their formulas defined as follows:

$$ \begin{equation}\kern-5.7pc { \begin{array}{c}{Precision}_{micro}=\displaystyle\frac{\sum_iT{P}_i}{\sum_i\left(T{P}_i+F{P}_i\right)}\end{array}} \end{equation}$$

(10)

$$ \begin{equation} { \begin{array}{c}\kern-4.7pc Recal{l}_{micro}=\displaystyle\frac{\sum_iT{P}_i}{\sum_i\left(T{P}_i+F{N}_i\right)}\end{array}} \end{equation}$$

(11)

$$ \begin{equation} { \begin{array}{c}F{1}_{micro}=2\cdotp \displaystyle\frac{Precision_{micro}\cdotp Recal{l}_{micro}}{Precision_{micro}+ Recal{l}_{micro}}\end{array}} \end{equation}$$

(12)

$$ \begin{equation}\kern-5.8pc { \begin{array}{c}{Precison}_i=\displaystyle\frac{T{P}_i}{T{P}_i+F{P}_i}\end{array}} \end{equation}$$

(13)

$$ \begin{equation}\kern-4.9pc { \begin{array}{c} Recal{l}_i=\displaystyle\frac{T{P}_i}{T{P}_i+F{N}_i}\end{array}} \end{equation}$$

(14)

$$ \begin{equation}\kern-.5pc { \begin{array}{c}F{1}_i=2\cdotp \displaystyle\frac{Preci\sin i\cdotp Recal{l}_i}{Preci\sin i+ Recal{l}_i}\end{array}} \end{equation}$$

(15)

$$ \begin{equation}\kern-5pc { \begin{array}{c}F{1}_{macro}=\displaystyle\frac{1}{N}\sum_{i=1}^NF{1}_i\end{array}} \end{equation} $$

(16)

Results

Ablation studies

To validate the efficacy of our model, we performed ablation studies on each of the three proposed components: the choice of graph neural network, the SDP network layer, the maximum length L of the SDP and learning rate.

Initially, to validate the effectiveness of the graph neural network, we compared three models: one that directly outputs the fused information obtained by combining the SDP with entity pairs to the classification layer, and another that feeds this information into a graph neural network (GCN or GAT) before classification. As shown in the Table 2, the results demonstrate that the combination of GAT, multi-head attention mechanism, and 1D-CNN yields the best performance with micro-F1 score of 96.47 and 96.03 in S4 and BioRED dataset separately, significantly outperforming other methods. Furthermore, we compared our method with the approach that utilizes the average embedding of entities in RBERT [33]. The results showed that the combined use of entity information and the SDP yields better performance than the method that solely relies on entity information. This also indicates that the SDP has a positive effect on relation classification.

Table 2

Comparison of model effect of different network layers

Layer	S4	BioRED
AVG	94.79	95.48
MA + 1D-CNN	94.71	95.67
GCN + AVG	95.03	95.60
GCN + MA + 1D-CNN	94.65	95.69
GAT+AVG	95.02	95.26
GAT + MA + 1D-CNN(ours)	96.47	96.03

Layer	S4	BioRED
AVG	94.79	95.48
MA + 1D-CNN	94.71	95.67
GCN + AVG	95.03	95.60
GCN + MA + 1D-CNN	94.65	95.69
GAT+AVG	95.02	95.26
GAT + MA + 1D-CNN(ours)	96.47	96.03

AVG: entity average embedding method; MA: multi-head self-attention mechanism

Table 2

Comparison of model effect of different network layers

Layer	S4	BioRED
AVG	94.79	95.48
MA + 1D-CNN	94.71	95.67
GCN + AVG	95.03	95.60
GCN + MA + 1D-CNN	94.65	95.69
GAT+AVG	95.02	95.26
GAT + MA + 1D-CNN(ours)	96.47	96.03

Layer	S4	BioRED
AVG	94.79	95.48
MA + 1D-CNN	94.71	95.67
GCN + AVG	95.03	95.60
GCN + MA + 1D-CNN	94.65	95.69
GAT+AVG	95.02	95.26
GAT + MA + 1D-CNN(ours)	96.47	96.03

AVG: entity average embedding method; MA: multi-head self-attention mechanism

Next, we investigate the maximum length L of the SDP. Here, the length of the SDP refers to the length of the result obtained after tokenizing the SDP using BioBERT. When sentences are excessively long, the SDP between two entities may also become longer, potentially introducing significant noise into the obtained shortest dependency. We conducted statistics on the length L of the SDP in two datasets. As shown in Fig. 4A, over 95% of sentences have a SDP length less than or equal to 16, and all sentences have a SDP length less than 64. To test the effectiveness and generalization of our model, we set the maximum length of the SDP to 16, 32, and 64, i.e. applying physical truncation to adjust the SDP length to keep same data set size with three different SDP lengths during the experiment.

Figure 4

Different length of SDP statics and effect. A. Proportion of different SDP length. B. Comparison of different SDP length on model performance.

The results are shown in Fig. 4B. When L = 32, the model performs best on the S4 dataset but poorly on the BioRED dataset, achieving only 94.6% accuracy. When L = 64, although it performs slightly worse on the S4 dataset compared to L = 32, it significantly outperforms other settings on BioRED. L = 64 covers the SDP of all sentences, ensuring the completeness and generalization of the model.

Moreover, we also performed the ablation study on learning rates, including different settings such as 1e-5, 2e-5, and 3e-5, to deeply analyze the impact of the learning rate on the model’s performance. During the experiments, we observed that when the learning rate was set to 1e-5, the model performed better than other settings. It could converge more effectively and avoid issues like oscillation or overfitting. Based on this experimental result, we selected 1e-5 as the optimal learning rate for this study and conducted all subsequent experiments on this basis. The detail information of hyper-parameters were listed in Table S1 and Fig. S1.

Cross-validation

To rigorously validate the performance and stability of our proposed model, we conducted five-fold cross-validation experiments on the S4 and BioRED datasets, respectively. As demonstrated in Table 3, on the S4 train set, the average values of micro-F1 and macro-F1 are 95.93 ± 0.29 and 90.42 ± 0.61, respectively. Similarly, on the BioRED train data set, the micro-F1 and macro-F1 averages stand at 94.83 ± 0.37 and 93.53 ± 0.41. The obtained experimental results strongly suggest that our model showcases remarkable stability on both datasets. This not only indicates its consistent performance but also further confirms its robust generalization capabilities.

Table 3

Cross-validation performance of our model on training sets from S4 and BioRED

Datasets	Micro-F1	Macro-F1
S4	95.93 ± 0.29	90.42 ± 0.61
BioRED	94.83 ± 0.37	93.53 ± 0.41

Table 3

Cross-validation performance of our model on training sets from S4 and BioRED

Datasets	Micro-F1	Macro-F1
S4	95.93 ± 0.29	90.42 ± 0.61
BioRED	94.83 ± 0.37	93.53 ± 0.41

Performance comparison with different models

To assess the efficacy of our model, we conducted a comparative analysis of BioGSF’s performance against five preceding methodologies—BioBERT [13], RBERT [33], EPGNN [32], AGCN [19], BioEGRE [34] and ChatGPT-4.0 [35]—using the S4 and BioRED datasets.

Based on the comparison experimental results in Table 4, it becomes apparent that our BioGSF model exhibits optimal performance on both datasets. This notable achievement is primarily due to the efficient integration of the SDP with entity pair information, a strategy that considerably boosts classification performance. Although AGCN demonstrates commendable performance on the BioRED dataset, it demands extensive computational resources and endures a lengthy validation phase. Conversely, our model swiftly generates results with 75 s under similar conditions, whereas AGCN requires a significantly longer duration of 311 s. Additionally, the EPGNN model, which incorporates a GCN approach centered on entity pairs, showcases robust performance on both datasets. However, it disregards the critical importance of incorporating the SDP, relying exclusively on internal data. Furthermore, we employed the ChatGPT-4.0 model and utilized the zero-shot and one-shot method for comparison. Specifically, we designed appropriate prompts to guide the model’s responses without conducting additional training or tuning on the model to directly leverage the model’s general language understanding ability to rapidly test its performance in RE task. However, as shown in the table 4, the performance of the model was not satisfactory. It is evident from the results that although ChatGPT-4.0 has a strong ability in handling general language tasks, in specific tasks, its effect is poor and it fails to fully understand the context information.

Table 4

Comparative performance evaluation of our model against other state-of-the-art models on the test set from S4 and BioRED datasets

Model	S4			BioRED
Model	Micro-F1	Macro-F1	Times	Micro-F1	Macro-F1	Times
BioBERT	93.85	86.20	78 s	94.38	89.71	75 s
RBERT	94.89	88.09	75 s	94.81	94.27	75 s
EPGNN	94.98	90.62	75 s	94.63	92.17	75 s
AGCN	92.22	79.11	311 s	96.01	94.42	362 s
BioEGRE	89.57	75.41	1038s	94.17	93.69	1366s
GPT-4o(zero-shot)	57.79	52.33	1h30m	76.45	80.65	1h30m
GPT-4o(one-shot)	49.33	50.8	1h30m	89.33	80.41	1h30m
BioGSF	96.68	92.46	75 s	96.03	94.65	73 s

Model	S4			BioRED
Model	Micro-F1	Macro-F1	Times	Micro-F1	Macro-F1	Times
BioBERT	93.85	86.20	78 s	94.38	89.71	75 s
RBERT	94.89	88.09	75 s	94.81	94.27	75 s
EPGNN	94.98	90.62	75 s	94.63	92.17	75 s
AGCN	92.22	79.11	311 s	96.01	94.42	362 s
BioEGRE	89.57	75.41	1038s	94.17	93.69	1366s
GPT-4o(zero-shot)	57.79	52.33	1h30m	76.45	80.65	1h30m
GPT-4o(one-shot)	49.33	50.8	1h30m	89.33	80.41	1h30m
BioGSF	96.68	92.46	75 s	96.03	94.65	73 s

Table 4

Comparative performance evaluation of our model against other state-of-the-art models on the test set from S4 and BioRED datasets

Model	S4			BioRED
Model	Micro-F1	Macro-F1	Times	Micro-F1	Macro-F1	Times
BioBERT	93.85	86.20	78 s	94.38	89.71	75 s
RBERT	94.89	88.09	75 s	94.81	94.27	75 s
EPGNN	94.98	90.62	75 s	94.63	92.17	75 s
AGCN	92.22	79.11	311 s	96.01	94.42	362 s
BioEGRE	89.57	75.41	1038s	94.17	93.69	1366s
GPT-4o(zero-shot)	57.79	52.33	1h30m	76.45	80.65	1h30m
GPT-4o(one-shot)	49.33	50.8	1h30m	89.33	80.41	1h30m
BioGSF	96.68	92.46	75 s	96.03	94.65	73 s

Model	S4			BioRED
Model	Micro-F1	Macro-F1	Times	Micro-F1	Macro-F1	Times
BioBERT	93.85	86.20	78 s	94.38	89.71	75 s
RBERT	94.89	88.09	75 s	94.81	94.27	75 s
EPGNN	94.98	90.62	75 s	94.63	92.17	75 s
AGCN	92.22	79.11	311 s	96.01	94.42	362 s
BioEGRE	89.57	75.41	1038s	94.17	93.69	1366s
GPT-4o(zero-shot)	57.79	52.33	1h30m	76.45	80.65	1h30m
GPT-4o(one-shot)	49.33	50.8	1h30m	89.33	80.41	1h30m
BioGSF	96.68	92.46	75 s	96.03	94.65	73 s

Finally, to further validate the efficacy of our model, we reconstructed a data set based on the most recently published papers. Specifically, the data set was constructed using the abstracts of papers included in PubMed in the past month, namely October 2024 (Search terms: protein–protein interaction[tiab] cancer[tiab] Filters: from 2024/10/1–2024/10/31), and these were manually annotated (Table S2). We re-evaluated our method with this data set and the results demonstrated that our model achieved the best performance with micro-F1 and macro-F1 scores of 97.44 and 78.77, respectively (Table S3).

Our study underscores the profound benefits of merging SDP information with entity pair data, thereby greatly facilitating information acquisition by graph neural networks. This observation aligns seamlessly with the findings derived from ablation studies, further validating the efficacy of our approach.

Case study

To further investigate the effectiveness of our model for entity relationship extraction in medical literature, we randomly selected recently published research articles from PubMed as the basis for our case study. The results, presented in Table 5, aim to demonstrate the performance differences among various models in predicting relationship types. A notable observation from the table is that advanced models like BioBERT accurately identify the relationship type as CDI between widely known drug types, such as 5-Fluorouracil, and specific diseases. This indicates that the models perform well when dealing with drug entities that have extensive literature support.

Table 5

Case study demonstrating the efficacy of BioGSF

Sentence	Gold Standard	BioBERT	RBERT	EPGNN	AGCN	BioEGRE	BioGSF
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	PDI	PDI	PDI	CDI	CDI	CDI
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	CDI	CDI	PDI	CDI	CDI	CDI
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo.	CDI	PDI	PDI	PDI	PDI	CDI	CDI
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice.	PPI	DDI	DDI	DDI	CVI	CPI	DDI

Sentence	Gold Standard	BioBERT	RBERT	EPGNN	AGCN	BioEGRE	BioGSF
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	PDI	PDI	PDI	CDI	CDI	CDI
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	CDI	CDI	PDI	CDI	CDI	CDI
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo.	CDI	PDI	PDI	PDI	PDI	CDI	CDI
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice.	PPI	DDI	DDI	DDI	CVI	CPI	DDI

Table 5

10.1007/s10462-023-10465-9

Case study demonstrating the efficacy of BioGSF

Sentence	Gold Standard	BioBERT	RBERT	EPGNN	AGCN	BioEGRE	BioGSF
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	PDI	PDI	PDI	CDI	CDI	CDI
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	CDI	CDI	PDI	CDI	CDI	CDI
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo.	CDI	PDI	PDI	PDI	PDI	CDI	CDI
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice.	PPI	DDI	DDI	DDI	CVI	CPI	DDI

Sentence	Gold Standard	BioBERT	RBERT	EPGNN	AGCN	BioEGRE	BioGSF
Finally, according to the results of drug susceptibility analysis, docetaxel, 5-Fluorouracil, [s1] gemcitabin [e1], and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	PDI	PDI	PDI	CDI	CDI	CDI
Finally, according to the results of drug susceptibility analysis, docetaxel, [s1] 5-Fluorouracil [e1], gemcitabin, and paclitaxel were found to be more sensitive to [s2] gastric cancer [e2].	CDI	CDI	CDI	PDI	CDI	CDI	CDI
Here, we constructed [s1] ACZ2 [e1] and investigated its efficacy and potential mechanism for [s2] gastric cancer [e2] in vitro and in vivo.	CDI	PDI	PDI	PDI	PDI	CDI	CDI
Changes in gene expression, chemokine and cytokine secretion, plasma [s1] IgE [e1], and lung histology were quantified using RT-qPCR, ELISA, and immunohistochemistry, respectively. Arg1 elimination [s2] OVA [e2] lso decreased number and tightness of correlations between adaptive changes in lung function and inflammatory parameters in OVA/OVA-treated female mice.	PPI	DDI	DDI	DDI	CVI	CPI	DDI

However, the situation changes when encountering new or less reported drugs, such as gemcitabin and ACZ2, especially when they appear in abbreviated forms. Due to the scarcity of entity information for these newer drugs, relying solely on entity information to determine relationship types becomes more challenging. It’s worth noting that the BioGSF model struggles in such cases, primarily because it heavily relies on direct information between entity pairs, neglecting the richer and crucial contextual information in the text. This design limitation often leads the model to mistakenly predict relationships as PDI when faced with the aforementioned challenges. On the other hand, our model and BioEGRE adopt different strategies, focusing more on extracting and utilizing rich entity and contextual information from dependency information. This approach enables accurate determination of relationship types between new drugs and other entities, even when direct entity information is scarce. Furthermore, our model transmits entity information and dependency information through an entity-pair graph, allowing other related entities to learn relationship type information.

Discussion

In this study, we present a model that integrates the SDPs and the entity-pair graph for the multi-biological entity relationship extraction from literatures. For diverse medical entity types, we can proficiently handle semantic information and acquire the relationships among entities, thereby contributing significantly to the construction of the medical knowledge graph.

First, we utilize the multi-head attention mechanism to capture the relationships and features of different input parts in the SDPs, thereby better understanding complex patterns and relationships. The SDPs not only contain the words and dependencies that are most directly related to the relationship between the target entity pairs, thereby simplifying the sentence structure, eliminating the noise in the sentence, and being able to represent the semantic relationship between the entities in the syntactic structure more intuitively. By combining local and global features, we can extract the useful parts of each SDP word based on local features. Secondly, different attention ‘heads’ can learn various representations, and these representations will guide the subsequent 1D-CNN to screen features. 1D-CNN captures local dependency relationships and temporal features, and simultaneously extracts local information from sequence data to reduce the influence of local noise and select the most beneficial words from all the SDP words. Regarding the selection of the length of the SDP, when L = 64, it can not only cover all the words on the SDPs but also achieve good results in two datasets. Possibly, adjustments need to be made for different datasets, which also reflects that our model has good generalization.

When selecting the graph neural network model, we compared the effects of GCN and GAT. Our results demonstrated that the combination of GAT and MA + 1DCNN achieved the optimal outcome. Meanwhile, we also found that the combination of GAT and AVG did not yield very good results, and was even worse than the combination of GCN and average pooling. This might be because GAT relies on a more complex structure when learning node representations and cannot capture node features only through simple average pooling. The combination of multi-head attention and one-dimensional CNN can provide GAT with richer feature expressions and enhance feature integration on the original basis.

Although our method has achieved good results through the strategy of integrating the SDPs and the entity-pair graph, there are still some shortcomings. Firstly, the method used in the model to obtain the entity itself is the approach adopted by most models, which is to perform average pooling at the entity positions using the embeddings obtained through pre-trained language. The entity information extracted through this method is not rich enough, especially when the sentences have limited information of entities as shown in the case in Table 5. In future work, we will integrate more methods and incorporate multiple modalities of information, such as text, image, and voice, to acquire richer information for representing the entities. Secondly, our model currently only supports binary RE and has not yet achieved n-ary RE. N-ary RE involves complex relationships among multiple entities, and its information processing is more complex. To accurately capture and understand these complex relationships, especially when the connection between the two entities, such as IgE and OVA in Table 5, is not sufficiently clear, the model requires stronger context understanding capabilities and higher computing power to be able to conduct effective reasoning and relationship identification among multiple entities. In future research, we will improve the dependency relationship between words and construct different weight coefficients for different dependency relationships. This will make the model have different sensitivities to different dependency relationship types and dependent words and reduce the influence of noise on word representations. Furthermore, the application of large language models (such as Llama3 [36]) in RE tasks will also provide new strategies for capturing complex context information and multi-entity relationships. Finally, it is worth noting that our approach solely focuses on the identification of entity relations and does not perform entity recognition.

In this paper, we innovatively put forward the idea of combining the SDPs with the entity-pair graph, and utilized the graph neural network model based on the two to extract the relationships of medical entity pairs. This model not only significantly reduced the time but also performed more outstandingly in terms of efficacy, opening up a new approach for extracting knowledge from biomedical texts and demonstrating broad application prospects.

Conclusion

In this paper, we innovatively proposed the concept of integrating the SDP with the entity-pair graph and employed the graph neural network model based on the two to extract the relationships of medical entity pairs. This model not only shortened the running time but also achieved excellent performance, opening up a new avenue for extracting knowledge from biomedical texts and demonstrating broad application prospects.

Key Points

We introduced BioGSF, a novel graph-driven framework that integrates shortest dependency paths (SDP) with entity-pair graphs, enhancing the extraction of biomedical relations from literature.
We obtained rich SDP information through the combination of multi-head attention and one-dimensional convolutional neural network, which takes into account global features as well as local patterns and characteristics at different locations.
BioGSF demonstrates exceptional performance in biomedical relation extraction tasks, achieving a high micro-F1 score of 96.68% on the S4 dataset and 96.03% on the BioRED dataset with faster processing times than existing models.

Conflict of interest: The authors declare no conflicts of interest.

Funding

This research was funded by the start-up fund from Suzhou City University; Medical and Health Science and Technology Innovation Project of Suzhou (SKY2022010, SKYD2022097); Foundation of Suzhou Medical College of Soochow University (MP13405423, MX13401423); the Priority Academic Program Development of Jiangsu Higher Education Institutions; the open research fund of Suzhou Key Lab of Multi-modal Data Fusion and Intelligent Healthcare.

Data availability

The data and source codes used in this paper are available at https://github.com/serien-zzx/BioGSF

References

Peng

Xia

Naseriparsa

. et al.

Knowledge graphs: Opportunities and challenges

Artif Intell Rev

2023

;56:13071–2.

10.1093/bioinformatics/btae194

Yang

Yan

A comprehensive review on knowledge graphs for complex diseases

Brief Bioinform

2023

;

Di Maria

Bellomo

Billeci

. et al.

NetMe 2.0: A web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph

Bioinformatics

2024

;

10.1093/bioinformatics/btad080

Morris

Soman

Akbas

. et al.

The scalable precision medicine open knowledge engine (SPOKE): A massive knowledge graph of biomedical information

Bioinformatics

2023

;

:btad080.

10.1016/j.csbj.2024.03.021

Yang

Zheng

. et al.

MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer

Comput Struct Biotechnol J

2024

;

1339

–

Bang

Lim

Lee

. et al.

Biomedical knowledge graph learning for drug repurposing by extending guilt-by-association to multiple layers

Nat Commun

2023

;

3570

10.1038/s41467-023-39301-y

Kong

Zhang

. et al.

Identifying compound-protein interactions with knowledge graph embedding of perturbation transcriptomics

Cell Genomics

2024

;

100655

10.1016/j.xgen.2024.100655

Zhang

Wang

. et al.

Development and validation of an AI-driven system for automatic literature analysis and molecular regulatory network construction, advanced

Science

2024

;

:2405395.

10.1002/advs.202405395

10.1038/s41587-021-01145-6

Santos

Colaco

Nielsen

. et al.

A knowledge graph to interpret clinical proteomics data

Nat Biotechnol

2022

;

692

–

702

10.

Mahmood

T-J

Mazumder

. et al.

DiMeX: A text mining system for mutation-disease association extraction

PloS One

2016

;

e0152725

10.1371/journal.pone.0152725

11.

Wang

Yan

. et al.

Real-world data medical knowledge graph: Construction and applications(MKG)

Artif Intell Med

2020

;

103

101817

10.1016/j.artmed.2020.101817

12.

Singhal

Simmons

Text mining for precision medicine: Automating disease-mutation relationship extraction from biomedical literature

J Am Med Inform Assoc

2016

;

766

–

13.

Lee

Yoon

Kim

. et al.

BioBERT: A pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

2020

;

1234

–

10.1093/bioinformatics/btz682

14.

Luo

Sun

Xia

. et al.

BioGPT: Generative pre-trained transformer for biomedical text generation and mining

Brief Bioinform

2022

;

bbac409

15.

Jin

Yang

Chen

. et al.

Genegpt: Augmenting large language models with domain tools for improved access to biomedical information

Bioinformatics

2024

;

btae075

10.1093/bioinformatics/btae075

16.

Miwa

Bansal

. End-to-end relation extraction using lstms on sequences and tree structures. In: Erk K, Smith N. (eds).

The 54th Annual Meeting of the Association for Computational Linguistics

Berlin, Germany

Association for Computational Linguistics

2016

17.

Peng

Poon

Quirk

. et al.

Cross-sentence N-ary relation extraction with graph LSTMs, transactions of the association for

Comput Linguist

2017

;

101

–

18.

Lai

P-T

BERT-GT: Cross-sentence N-ary relation extraction with BERT and graph transformer

Bioinformatics

2020

;

5678

–

19.

Tian

Chen

Song

. et al. Dependency-driven relation extraction with attentive graph convolutional networks. In: Zong C, Xia F, Li W, Navigli R. (eds).

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing

Vol. 1: Long Papers

. Online:

Association for Computational Linguistics

2021

4458

–

20.

Chen

Tian

Song

. et al.

Relation extraction with type-aware map memories of word dependencies

In: Findings of the Association for Computational Linguistics: ACL-IJCNLP

2021

;

2021

2501

–

21.

Luo

Lai

P-T

Wei

C-H

. et al.

BioRED: A rich biomedical relation extraction dataset

Brief Bioinform

2022

;

bbac282

22.

Sun

Johnson

. et al.

BioCreative V CDR task corpus: A resource for chemical disease relation extraction

Database

2016

;

2016

:baw068.

10.1093/database/baw068

23.

Krallinger

Rabal

Akhondi

. et al. Overview of the BioCreative VI chemical-protein interaction track. In:

Proceedings of the Sixth BioCreative Challenge Evaluation Workshop

, Bethesda, MDUSA: BioCreative,

2017

141

–

24.

Miranda-Escalada A, Mehryary F, Luoma J. et al. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations. Database 2023; 2023:baad080.

10.1093/database/baad080

25.

Nédellec

. Learning language in logic-genic interaction extraction challenge. In:

Proceedings of the 4th Learning Language in Logic Workshop (LLL05)

. Born, Germany.,

2005

, 7:

–

26.

Pyysalo

Airola

Heimonen

. et al. Comparative analysis of five protein-protein interaction corpora. In: Baker C and Jian S. (eds).

BMC Bioinformatics

. Singapore:

BioMed Central

2008

–

27.

Neumann

King

Beltagy

. et al. ScispaCy: Fast and robust models for biomedical natural language processing. In: Demner-Fushman D, Cohen K, Ananiadou S, Tsujii J. (eds).

18th SIGBioMed Workshop on Biomedical Natural Language Processing, BioNLP 2019

Florence, Italy

Association for Computational Linguistics (ACL)

2019

319

–

28.

Kim

J-D

Ohta

Tateisi

. et al.

GENIA corpus - a semantically annotated corpus for bio-textmining

Bioinformatics

2003

;

i180

–

29.

Hovy

Marcus

Palmer

. et al.

OntoNotes: The 90% solution

Proceedings of the Human Language Technology Conference of the NAACL

2006

;

Companion Volume: Short Papers

–

30.

Pradhan

Ramshaw

. OntoNotes: Large scale multi-layer, multi-lingual, distributed annotation. In:

Ide

Pustejovsky

(eds).

Handbook of Linguistic Annotation

Dordrecht

Springer Netherlands

2017

521

–

31.

Murty

Verga

Vilnis

. et al. Hierarchical losses and new resources for fine-grained entity typing and linking. In: Gurevych I, Miyao Y. (eds).

ACL

56th Annual Meeting of the Association for Computational Linguistics

, Melbourne, Australia.

2018

–

109

32.

Zhao

Wan

Gao

. et al. Improving relation classification by entity pair graph. In: Lee W and Nagoya ST. (eds).

Asian Conference on Machine Learning

Japan: PMLR

2019

1156

–

10.1186/s12859-023-05601-9

33.

. Enriching pre-trained language model with entity information for relation classification. In:

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

, New York, NY, USA, Association for Computing Machinery.

2019

2361

–

34.

Zheng

Wang

Luo

. et al.

BioEGRE: A linguistic topology enhanced method for biomedical relation extraction based on BioELECTRA and graph pointer neural network

BMC bioinformatics

2023

;

486

35.

Open AI.

GPT-4 technical report

. 2023, arXiv 2024;230308774.

10.48550/arXiv.2303.08774