Abstract

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein–protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 (⁠|$\sim $|100%) entries in UniProt for human are annotated by PredGO, over 186 000 (⁠|$\sim $|90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.

INTRODUCTION

Recent advances in genome sequencing and structural genomics are increasing the abundance of experimental sequences and three-dimensional structures, which need functional characterization. The gap between the large number of proteins that have been identified and the completeness of their annotations is continually widening. For example, as of November 2022, nearly 484 700 (⁠|$\sim $|85%) protein sequences deposited into the UniProtKB/Swiss-Prot database were without manually curated function; the same is true for many structures in the Protein Data Bank (PDB), nearly 156 300 (⁠|$\sim $|74%) structures lacked manually assigned function, and even after putative automated annotations nearly 111 900 (⁠|$\sim $|53%) structures remain listed as unannotated in the Gene Ontology Annotation (GOA) database [1]. Experimental identification and manual protein function annotation remain labor-intensive and expensive tasks. Consequently, in this scenario, researchers have focused on developing accurate computational methods for protein function prediction, which can therefore aid in reducing the gap.

With the ever-increasing accumulation of sequence and structure information, coupled with massive high-throughput experimentally data, a number of computational methods have been proposed to exploit these heterogeneous data, including function prediction from the amino acid sequence [2–4], structure data [5], gene expression [6], protein–protein interaction [7, 8], genomic context [9] and integrated methods [10–16].

The most common approach for protein function prediction is to use homology or sequence similarity and transfer functional annotations to newly identified proteins. Global and local sequence alignments, such as FASTA [17], BLAST and PSI-BLAST [18], are used to query sequence databases for homologs with a target protein, and the known functions of the top hits are transferred to the query. A better strategy for homology-based functional annotation is PFP [3], which uses three rounds of PSI-BLAST and a so-called ’function association matrix’ to include the annotations of even remote homologs. The extended similarity group (ESG) method [4], which performs iterative sequence database searches, promises to be more sensitive and accurate than conventional PSI-BLAST and its predecessor PFP. In recent years, a series of deep-learning methods have been proposed. These methods usually extract sequence features by calculating sequence similarity, multiple sequence alignments and sequence motifs and use neural networks to build the prediction model [19–24]. In addition to this, sequence-based protein language models have also yielded very promising results in protein bioinformatics and even in the whole protein science [25–28]. Although the sequence-based transfer is a genuine way of inferring function in the light of evolution, practically, a precise function is conserved only at levels of sequence identity >40% [29] and most of the newly identified proteins do not show significant sequence similarity with experimentally annotated proteins. Additionally, sequence similarity does not always imply functional equivalence; thus, sequence-based functional transfers can be erroneous. For example, proteins from gene duplication may have high sequence similarity but a divergence of functions. In addition, misannotations can be spread even when homology-based approaches are used in manually curated databases.

When sequence-based methods fail, functional clues can be inferred from the protein’s three-dimensional structure since the protein’s structure often remains more conserved than its sequence [30]. Methods for predicting function from structure rely on detecting some structural similarity, whether global or local, between the target protein and a structure of the known function. Global structure-comparison algorithms (e.g. DALI [31], TM-align [32] and MADOKA [33]) can be used to exploit general structural similarity. Other methods identify local surface regions that may be associated with highly specific functions. Examples include pvSOAR [34], eF-site [35], PDBSiteScan [36] and so on. However, structural information has not been widely used for protein function annotation. One of the main reasons is that there is a significant gap between the number of proteins with known sequences and those with experimentally determined structures. Recently, AlphaFold2 [37] has made a major breakthrough in the protein folding problem by predicting the 3D structure of proteins with atomic-level precision based on amino acid sequences alone [38]. And the model trained only on virtual samples predicted by AlphaFold2 is able to perform comparably to models based on experimentally solved real structures. These provide favorable conditions for improving the performance of structure-based protein function prediction.

In this paper, we propose a computational protein function prediction approach termed PredGO, which achieves high-performance protein function prediction by using a protein language model ESM-1b [39] trained by a large number of protein sequences to extract sequence features, graph neural network with geometric vector perceptron (GVP–GNN) [40] to extract protein structures predicted by AlphaFold2 and the multi-head attention mechanism to fuse protein-protein interaction (PPI) features. It can be observed from the comparison experiments that our model is able to extract the information in the predicted protein structures and achieve high-performance prediction of protein functions even if only sequence information is available. In addition, we experimentally verify the performance of the model for specific species. And we observe the difference between our model and competing methods in prediction results through specific proteins. We finally apply PredGO to human genome-wide protein function prediction and obtain promising results.

MATERIALS AND METHODS

Datasets

We used two datasets, CAFA3 [41–43] and UniGOA16, to evaluate methods. We downloaded the CAFA3 dataset from DeepGOPlus (https://deepgo.cbrc.kaust.edu.sa/data/) [23]. This dataset includes the CAFA3 challenge training sequences and experimental annotations released in September 2016, as well as the test benchmarks released on November 15, 2017. We removed proteins with sequence lengths greater than 1000 and proteins with ’fuzzy’ amino acids, such as consortia or unknowns. After processing, there are 3039 proteins in the test set. We obtained the protein structures from AlphaFoldDB (https://alphafold.ebi.ac.uk/download) [44]. For proteins in the test set whose structures could not be downloaded from AlphaFoldDB, we used the local version of AlphaFold2 to make predictions with the preset of ’full_dbs.’ We extracted the PPI information from the STRING database [45]. We filtered the PPI using medium confidence (⁠|$\geq $|0.4). To ensure the fairness of the evaluation, we downloaded the STRING database version 10.0 (http://version10.string-db.org/), released on April 16, 2016, before the release time of the CAFA3 training set.

We also built a dataset termed UniGOA16 from the UniProt-GOA [1] according to the standard CAFA protocol. We used the same GO version and STRING database as the CAFA3 dataset. We downloaded protein sequences from UniProt (https://www.uniprot.org/downloads) [46], predicted protein structures from AlphaFoldDB and protein functional annotations from GOA (http://www.ebi.ac.uk/GOA). We extracted all experimental annotations and removed unpredictable proteins and GO terms. We used the proteins experimentally annotated before June 24, 2016 as the training set, the no-knowledge proteins experimentally annotated from June 24, 2016 to June 24, 2019 as the validation set, and the no-knowledge proteins experimentally annotated from June 24, 2019 to January 1, 2022 as the test set. Finally, we used MMseqs2 [47] to cluster all sequences in the dataset with 60% sequence identity, removing sequences in the validation and test sets that appear in the same cluster as the training set sequences. Table 1 shows the statistics for both datasets.

Table 1

Summary of the CAFA3 and UniGOA16 datasets

DatasetOntologyTrainingValidationTestTerms
CAFA3MFO28 67932281035677
BPO42 250474821853992
CCO39 89345101117551
UniGOA16MFO38 58513151058613
BPO53 056206813053531
CCO49 60525061956534
DatasetOntologyTrainingValidationTestTerms
CAFA3MFO28 67932281035677
BPO42 250474821853992
CCO39 89345101117551
UniGOA16MFO38 58513151058613
BPO53 056206813053531
CCO49 60525061956534
Table 1

Summary of the CAFA3 and UniGOA16 datasets

DatasetOntologyTrainingValidationTestTerms
CAFA3MFO28 67932281035677
BPO42 250474821853992
CCO39 89345101117551
UniGOA16MFO38 58513151058613
BPO53 056206813053531
CCO49 60525061956534
DatasetOntologyTrainingValidationTestTerms
CAFA3MFO28 67932281035677
BPO42 250474821853992
CCO39 89345101117551
UniGOA16MFO38 58513151058613
BPO53 056206813053531
CCO49 60525061956534

PredGO

Overview

As illustrated in Figure 1, PredGO comprises five key steps: data search and generation, protein sequence feature extraction, PPI feature fusion, structure feature extraction, feature concatenation and GO term prediction.

An overview of PredGO. The input is a protein sequence. PredGO first searches for interacting proteins from the STRING database and predicts the protein structure using AlphaFold2. Then the protein sequence features are extracted by ESM-1b, and the proteins with interactions are fused using a PPI feature fusion module with protein fusion layers. The predicted structures are represented in the structure feature extraction module as graph structures with scalar features and vector features and extracted by GVP-GNN. Finally, the sequence-based and PPI-based features are concatenated with structure-based features to obtain the scores of GO terms by a multilayer perceptron and sigmoid function.
Figure 1

An overview of PredGO. The input is a protein sequence. PredGO first searches for interacting proteins from the STRING database and predicts the protein structure using AlphaFold2. Then the protein sequence features are extracted by ESM-1b, and the proteins with interactions are fused using a PPI feature fusion module with protein fusion layers. The predicted structures are represented in the structure feature extraction module as graph structures with scalar features and vector features and extracted by GVP-GNN. Finally, the sequence-based and PPI-based features are concatenated with structure-based features to obtain the scores of GO terms by a multilayer perceptron and sigmoid function.

The data search and generation step serves two main purposes. Firstly, it utilizes AlphaFold2 to generate 3D structural data for proteins. Secondly, it searches the database to retrieve other proteins that interact with the target protein. This step provides crucial inputs for the subsequent feature extraction process.

The sequence feature extraction module, PPI feature fusion module and structure feature extraction module are responsible for extracting relevant features from the obtained data. These modules play an essential role in capturing important characteristics of proteins.

Finally, the feature concatenation and term prediction steps integrate the extracted features to predict protein functions based on GO terms.

Protein sequence feature extraction

The protein language model ESM-1b, pre-trained using massive data, can effectively represent protein sequences as a vector. For a protein sequence of length |$n$|⁠, the output of this model is a vector |$p \in \mathbb{R}^{n \times 1280}$|⁠. We average the first dimension of the vector |${p=(a_1,a_2,...,a_n)^{\mathrm{T}}}$| to get a new vector |$p^{\prime} \in \mathbb{R}^{1280}$| of fixed length, and use this vector to represent the protein.

(1)

Assuming that there are |$m$| proteins that can be searched from the PPI database that have interactions with the target protein |$t$|⁠, then there will be |$m+1$| protein sequences to be input into the model. After the above processing, a vector |$H^E_t = (p^{\prime}_0,p^{\prime}_1,p^{\prime}_2,...,p^{\prime}_m)^{\mathrm{T}} \in \mathbb{R}^{ (m+1) \times 1280}$| will be obtained, where |$p^{\prime}_0$| represents the vector of the target protein, and the other is the vector of the interacting proteins.

P‌PI feature fusion module

Referring to the Encoder module in the Transformer model [48], we designed the PPI feature fusion module. This module consists of multiple protein fusion layers that can be stacked and a residual connection layer.

The protein fusion layer consists of multi-headed attention mechanisms and feed-forward neural networks. It treats proteins with interactions as different words in a sentence, and each protein is able to fuse useful information from other proteins using the multi-headed attention mechanism. In the multi-headed attention mechanism, the vectors of each protein will be linearly transformed to generate multiple sets of vectors |${Q}$|⁠, vectors |${K}$| and vectors |${V}$|⁠, where |${Q}$| and |${K}$| are used to determine the weights of protein fusions, and |${V}$| is the value being fused. After the multi-attention mechanism layer and feedforward neural network, we use residual connection [49], droupout [50] and layer normalization [51] to avoid the occurrence of gradient disappearance and overfitting. To avoid interacting proteins losing their original information when features are fused, we merge the unprocessed vector with the processed vector using the residual connection to obtain the final feature. This feature contains the sequence information and PPI information of the protein. The multiheaded attention mechanism and feed-forward neural network can be calculated as follows:

(2)

Where |$head_i=Attention(H^E_tW^Q_i,H^E_tW^K_i,H^E_tW^V_i)$|⁠.

(3)
(4)

where |$W^Q_i \in \mathbb{R}^{d_{model} \times d_k}$|⁠, |$W^K_i \in \mathbb{R}^{d_{model} \times d_k}$|⁠, |$W^V_i \in \mathbb{R}^{d_{model} \times d_v}$|⁠, |$W^o \in \mathbb{R}^{hd_v \times d_{model}}$|⁠, |$W_1 \in \mathbb{R}^{d_{model} \times 4d_{model}}$|⁠, |$W_2 \in \mathbb{R}^{4d_{model} \times d_{model}}$| are the parameter matrices of the projection. |$h$| is the number of heads, |$d_k=d_v=d_{model}/h=1280/h$|⁠, |$b_1$| and |$b_2$| are the bias terms.

Taking the vector |$H^E_t$| input to protein fusion layers as an example, a new vector |${H^{E}_t}^{\prime}= (p_0^{\prime\prime},p_1^{\prime\prime},p_2^{\prime\prime},...,p_m^{\prime\prime})^{\mathrm{T}} \in \mathbb{R}^{(m+1) \times 1280}$| will be obtained after processing by the protein fusion layers. After further processing by residual concatenation |$v_{sp}=LayerNorm(dropout(p^{\prime}_0+p^{\prime\prime}_0))$|⁠, a vector |$v_{sp}\in \mathbb{R}^{1280}$| containing sequence information and PPI information will be obtained to represent the target protein.

Structure feature extraction module

We use GVP–GNN [40] to extract the information contained in the 3D structure. GVP can be viewed as an extension of linear transformation that can compute both scalar and vectors. All nodes and edges in this GNN are represented using tuples containing scalars and vectors, enabling efficient representation of 3D structures of large biomolecules, including protein, through geometric and relational reasoning. GVP takes a pair of scalar features |$s \in \mathbb{R}^{n}$| and vector features |$V \in \mathbb{R}^{3 \times v}$| as input and outputs new scalar features |$s^{\prime} \in \mathbb{R}^{m}$| and vector features |$V^{\prime} \in \mathbb{R}^{3 \times \mu }$|⁠. The calculation process is as follows:

(5)
(6)

where |$V_h = VW_h$|⁠. |$W_h \in \mathbb{R}^{v \times h}$|⁠, |$W_m \in \mathbb{R}^{(n+h) \times m} $| and |$W_\mu \in \mathbb{R}^{h \times \mu } $| are the parameter matrices of the projection, b is the bias term, and |$\sigma $| and |$\sigma ^+$| are the activation functions.

GVP–GNN uses the message passing mechanism [52] to update node embeddings with messages from adjacent nodes and edges. Assume that |$h^{(j)}_v$| and |$h^{(j\rightarrow i)}_e$| represent the embedding of node j and edge (⁠|$\,j\rightarrow i$|⁠), the message passed from node |$j$| to node |$i$| can be expressed as |$h^{(j\rightarrow i)}_m = g(Concat(h^{(j)}_v,h^{(j\rightarrow i)}_e)) $|⁠, where |$g$| represents a function with GVPs. The steps of graph propagation are as follows:

(7)

where |$k^{\prime}$| is the number of incoming messages. Between graph propagation steps, the network uses GVP to update the node embeddings including scalar features and vector features at all nodes. The update process is as follows:

(8)

In this module, we create a KNN-graph by connecting adjacent nodes based on the |$C\alpha $| atom coordinates. Then, we construct scalar and vector features for both nodes and edges.

For instance, consider the |$ith$| node, where the |$C_i$| atom represents the |$ith$| node’s C atom. The node’s scalar features comprise the sine and cosine values of dihedral angles. These angles are computed using |$N_i$|⁠, |$C\alpha _i$|⁠, |$C_i$|⁠, |$C_{i-1}$| and |$N_{i+1}$|⁠. On the other hand, the vector feature of the node consists of forward and reverse unit vectors in the direction of |$C\alpha _{i+1}-C\alpha _i$| and |$C\alpha _{i-1}-C\alpha _i$|⁠, respectively. We also estimate the unit vectors in the direction of |$C\beta _{i}-C\alpha _i$| by assuming tetrahedral geometry and normalization [40].

To encode edges, we utilize the Gaussian radial basis functions and sine encoding with relative position information [48] for the scalar feature. In contrast, the vector feature of the edge represents the direction formed by the |$C\alpha $| atoms at its ends.

For a protein containing |$n$| residues, after being processed by the structure feature extraction module, a vector |$v_s \in \mathbb{R}^{n \times d_s}$| will be obtained, where the first dimension is the feature representation of each residue,|$d_s$| is the dimension of each residue node output by GVP–GNN.

Feature concatenation and term prediction

We combine the output of the PPI feature fusion module and the structure feature extraction module for GO term prediction. Since the vector |$v_s$| output from the structure feature extraction module is at the residue level, it first needs to be converted into a protein-level vector |$v_s^{\prime} \in \mathbb{R}^{d_s}$| using the global mean pooling layer (GMP). Then, |$v_s^{\prime}$| is concatenated with the vector |$v_{sp}$| from the PPI feature extraction module to obtain the final representation vector |$v_{target}$| of the target protein. Next, the dimensionality of this embedding is transformed into the dimensionality of the number of GO terms using a multilayer perceptron (MLP). Finally, the predictions of the model are transformed into confidence scores |$s_p$| from 0 to 1 using a sigmoid function. The calculation process is as follows:

(9)

Training

We implemented the model in Pytorch and Pytorch Geometric library [53, 54], and trained our model with binary cross-entropy as loss function and AdamW optimizer [55] with a learning rate of 1E-3. We set the dropout rate to 0.2 and use two layers of GVP–GNN, two protein fusion layers and three-layer multilayer perceptron, where the dimension of MLP layer 1 is the sum of the output feature dimension of the structure feature extraction module and the PPI feature fusion module, the dimension of layer 2 is four times that of layer 1, and the dimension of layer 3 is the number of predicted GO terms. We trained six models on the CAFA3 and UniGOA16 datasets for Molecular Function Ontology (MFO), Biological Process Ontology (BPO) and Cellular Component Ontology (CCO), respectively. During the training period, the model with the highest |$F_{max}$| value on the validation set is retained as the final model. The CAFA3 dataset’s epoch and batch sizes of training MFO, BPO and CCO models are 15, 20, 15 and 24, 24, 36. And on the UniGOA16 dataset, the epoch and batch sizes of training MFO, BPO and CCO models are 10, 15, 10 and 36, 36, 24.

Competing methods

Naive approach

Due to the hierarchical structure of GO terms, more annotations are generated in high-level GO terms. Predictions are obtained by simply calculating the GO term frequencies in the training set and assigning them across all proteins. This method, called ’naive’ in CAFA, is used as a baseline for comparing methods [43]. The calculation method is as follows:

(10)

where |$p_i$| is the target protein, |$G_j$| is the GO term to be predicted, |$N_j$| is the number of occurrences of |$G_j$| in the training set and |$N_{total}$| is the number of proteins in the training set.

DiamondBLAST and DiamondScore

Proteins with similar sequence tend to have similar functions. A basic approach to function prediction is to use BLAST to find proteins with similar sequence in the training set, assigning functions of similar proteins to target proteins [18]. We use Diamond to find similar sequence in the training set, and get a set of bitscores about query sequence and similar sequence [56]. We use the maximum normalized bitscore from similar sequence as the predicted value of the target protein |$p_{i}$|⁠:

(11)

where |$S(p_i)$| represents a set of proteins with similar sequence to protein |$p_i$| filtered by e-value of 0. 001, |$T(G_j,p_s)$| returns 1 if |$G_j$| is a true annotation of protein |$p_s$| and 0 otherwise. The |$bitscore$| is the similarity score of the protein |$p_i$| and |$s$| calculated by Diamond.

DiamondBLAST considers only the most similar sequences, while DiamondScore uses all the similar sequences returned by Diamond to predict the function. The calculation method is as follows:

(12)

DeepGOCNN and DeepGOPlus

DeepGOCNN and DeepGOPlus are well-known deep learning methods that predict only from sequence. DeepGOCNN uses a 1D CNN to scan protein sequences and uses a flat classification layer as a classifier to predict protein functions [23]. It can predict protein functions efficiently and in a wide range. DeepGOPlus combines the output of DeepGOCNN with the output of DiamondScore to further improve the prediction performance. We downloaded the source code of these methods, and according to the description in the paper, we selected different sizes of CNN filters for training on the training set, and selected the model with the best performance on the validation set as the final result, and finally selected the filters of sizes {8, 16, 24,..., 128}. For DeepGOPlus, we selected the alpha parameter that performed best on the validation set for combining DiamondScore.

LR-ESM

Large-scale protein language models have achieved surprising performance in protein-related fields. Esm-1b is a state-of-the-art protein language model that can be used to predict structure, function and other protein properties directly from individual sequence [39]. We convert the residue-level embedding of the model output to protein-level embedding by averaging the features at each residue level, and then use logistic regression (LR) to predict GO terms.

Evaluation metrics

We used three evaluation metrics to measure the performance of the method: |$F_{max}$|⁠, |$S_{min}$| and area under the precision–recall curve (AUPR). |$F_{max}$| and |$S_{min}$| are used as the main evaluation metrics in CAFA [41, 43]. AUPR is widely used in the evaluation of multi-label classification tasks including protein function prediction [57, 58]. |$F_{max}$| is a protein-centeric evaluation metric, which is defined as follows:

(13)
(14)
(15)
(16)
(17)

where |$t$| is a prediction threshold with a step size of 0.01 between 0 and 1. |$k(t)$| is the number of proteins with at least one GO term score not less than |$t$|⁠. |$n$| is the total number of proteins. |$1(\cdot )$| is 1 if the input is true, otherwise 0.

|$S_{min}$| is a term-centric evaluation metric that calculates the semantic distance between true annotations and predicted annotations. It is calculated as follows:

(18)
(19)
(20)
(21)

where |$ru(t)$| is called remaining uncertainty and |$mi(t)$| is called misinformation. |$IC(G_j)$| is the information content of term |$G_j$|⁠. The function |$Pr$| represents the conditional probability. |$Parent(G_j)$| denotes the parent node of |$G_j$| in the GO hierarchy.

RESULTS

To evaluate the performance of our model, we conducted several different experiments. Among them, the ’Performance comparison of different feature combinations’ was carried out using the UniGOA16 dataset. The datasets for the experiments titled ’Performance on proteins with and without PPI information’ and ’Performance comparison on different species’ were obtained by segmenting the UniGOA16 dataset. The specific statistics are shown in Table 2.

Table 2

Statistics on the number of sub-datasets used in the experiment

Sub-datasetMFOBPOCCO
UniGOA16105813051956
Proteins with PPI95310811740
Proteins without PPI104224216
Human proteins145119163
Mouse proteins64124109
Fission yeast proteins112520
Sub-datasetMFOBPOCCO
UniGOA16105813051956
Proteins with PPI95310811740
Proteins without PPI104224216
Human proteins145119163
Mouse proteins64124109
Fission yeast proteins112520
Table 2

Statistics on the number of sub-datasets used in the experiment

Sub-datasetMFOBPOCCO
UniGOA16105813051956
Proteins with PPI95310811740
Proteins without PPI104224216
Human proteins145119163
Mouse proteins64124109
Fission yeast proteins112520
Sub-datasetMFOBPOCCO
UniGOA16105813051956
Proteins with PPI95310811740
Proteins without PPI104224216
Human proteins145119163
Mouse proteins64124109
Fission yeast proteins112520

Performance on the test data set

As shown in Table 3, PredGO performs best in all three ontology domains for both datasets. On the CAFA3 dataset, PredGO has |$F_{max}$| of 0.674, 0.585, 0.699, |$S_{min}$| of 6.194, 18.067, 6.717 and AUPR of 0.642, 0.512, 0.678, respectively. On the UniGOA16 dataset, the |$F_{max}$| of PredGO was 0.687, 0.405, 0.694, the |$S_{min}$| was 3.719, 17.947, 5.442, and the AUPR was 0.603, 0.312, 0.734, respectively. The differences in performance between the different ontologies on the two datasets are caused by the structure and complexity of the ontologies and the available annotations. We can draw the following conclusions by comparing the results in the Table:

Table 3

Performance comparison on the CAFA3 and UniGOA16 datasets

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
CAFA3
 Naive0.4440.4060.6138.98921.8597.8720.3280.2850.572
 DiamondBLAST0.5490.4830.6009.18629.5009.1080.3020.2170.354
 DiamondScore0.6100.5290.4366.72319.4647.3540.6340.3220.473
 DeepGOCNN0.4990.4010.6458.38021.2127.5400.4360.3210.570
 DeepGOPlus0.5990.5200.6647.60520.1177.3450.5070.3680.587
 LR-ESM0.6490.5560.6826.48319.0816.9510.6310.4550.670
 PredGO0.6740.5850.6996.19418.0676.7170.6420.5120.678
UniGOA16
 Naive0.4520.3040.6385.33519.9466.6740.2880.1840.595
 DiamondBLAST0.5070.3180.4755.90326.3607.5060.2730.0900.269
 DiamondScore0.5560.3600.5104.05418.6206.6290.4000.1460.349
 DeepGOCNN0.5280.3290.6465.18519.7356.5530.3620.1930.603
 DeepGOPlus0.5760.3780.6374.59719.3526.4850.4430.1920.589
 LR-ESM0.6600.3850.6804.03118.4005.7060.5720.2920.705
 PredGO0.6870.4050.6943.71917.9475.4420.6030.3120.734
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
CAFA3
 Naive0.4440.4060.6138.98921.8597.8720.3280.2850.572
 DiamondBLAST0.5490.4830.6009.18629.5009.1080.3020.2170.354
 DiamondScore0.6100.5290.4366.72319.4647.3540.6340.3220.473
 DeepGOCNN0.4990.4010.6458.38021.2127.5400.4360.3210.570
 DeepGOPlus0.5990.5200.6647.60520.1177.3450.5070.3680.587
 LR-ESM0.6490.5560.6826.48319.0816.9510.6310.4550.670
 PredGO0.6740.5850.6996.19418.0676.7170.6420.5120.678
UniGOA16
 Naive0.4520.3040.6385.33519.9466.6740.2880.1840.595
 DiamondBLAST0.5070.3180.4755.90326.3607.5060.2730.0900.269
 DiamondScore0.5560.3600.5104.05418.6206.6290.4000.1460.349
 DeepGOCNN0.5280.3290.6465.18519.7356.5530.3620.1930.603
 DeepGOPlus0.5760.3780.6374.59719.3526.4850.4430.1920.589
 LR-ESM0.6600.3850.6804.03118.4005.7060.5720.2920.705
 PredGO0.6870.4050.6943.71917.9475.4420.6030.3120.734

Note: Best performance in bold.

Table 3

Performance comparison on the CAFA3 and UniGOA16 datasets

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
CAFA3
 Naive0.4440.4060.6138.98921.8597.8720.3280.2850.572
 DiamondBLAST0.5490.4830.6009.18629.5009.1080.3020.2170.354
 DiamondScore0.6100.5290.4366.72319.4647.3540.6340.3220.473
 DeepGOCNN0.4990.4010.6458.38021.2127.5400.4360.3210.570
 DeepGOPlus0.5990.5200.6647.60520.1177.3450.5070.3680.587
 LR-ESM0.6490.5560.6826.48319.0816.9510.6310.4550.670
 PredGO0.6740.5850.6996.19418.0676.7170.6420.5120.678
UniGOA16
 Naive0.4520.3040.6385.33519.9466.6740.2880.1840.595
 DiamondBLAST0.5070.3180.4755.90326.3607.5060.2730.0900.269
 DiamondScore0.5560.3600.5104.05418.6206.6290.4000.1460.349
 DeepGOCNN0.5280.3290.6465.18519.7356.5530.3620.1930.603
 DeepGOPlus0.5760.3780.6374.59719.3526.4850.4430.1920.589
 LR-ESM0.6600.3850.6804.03118.4005.7060.5720.2920.705
 PredGO0.6870.4050.6943.71917.9475.4420.6030.3120.734
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
CAFA3
 Naive0.4440.4060.6138.98921.8597.8720.3280.2850.572
 DiamondBLAST0.5490.4830.6009.18629.5009.1080.3020.2170.354
 DiamondScore0.6100.5290.4366.72319.4647.3540.6340.3220.473
 DeepGOCNN0.4990.4010.6458.38021.2127.5400.4360.3210.570
 DeepGOPlus0.5990.5200.6647.60520.1177.3450.5070.3680.587
 LR-ESM0.6490.5560.6826.48319.0816.9510.6310.4550.670
 PredGO0.6740.5850.6996.19418.0676.7170.6420.5120.678
UniGOA16
 Naive0.4520.3040.6385.33519.9466.6740.2880.1840.595
 DiamondBLAST0.5070.3180.4755.90326.3607.5060.2730.0900.269
 DiamondScore0.5560.3600.5104.05418.6206.6290.4000.1460.349
 DeepGOCNN0.5280.3290.6465.18519.7356.5530.3620.1930.603
 DeepGOPlus0.5760.3780.6374.59719.3526.4850.4430.1920.589
 LR-ESM0.6600.3850.6804.03118.4005.7060.5720.2920.705
 PredGO0.6870.4050.6943.71917.9475.4420.6030.3120.734

Note: Best performance in bold.

  • (i) On both datasets, compared with LR-ESM, PredGO improved |$F_{max}$| by 4.0%, 5.2% and 2.3%, decreased |$S_{min}$| by 6.1%, 3.9% and 4.0%, and improved AUPR by 3.5%, 9.7% and 2.7% on average for MFO, BPO and CCO. It can be seen that combining more information can improve the model performance compared to just using the protein language model.

  • (ii) DeepGOCNN uses only sequence information from the training set. PredGO and LR-ESM use protein language models pre-trained with a large amount of protein sequence data. From the average results of both datasets, PredGO improved |$F_{max}$| by about 32.6%, 34.5% and 7.90%, decreased |$S_{min}$| by 27.2%, 11.9% and 13.9%, and improved AUPR by 56.9%, 60.6% and 20.3% on MFO, BPO and CCO. From the experimental results, it is clear that pre-trained language models with a large number of sequences can effectively improve model performance.

  • (iii) The traditional methods based on sequence similarity, DiamondBLAST, and DiamondScore, outperformed the naive method in MFO and BPO but performed poorly in CCO. DeepGOPlus combining DeepGOCNN and DiamondScore improved the |$F_{max}$| of MFO and BPO by more than 9.1% on both datasets but decreased the |$F_{max}$| of CCO by 1.3% on the UniGOA16 dataset. This may be related to the fact that sequence homology carries more information related to molecular functions and biological processes and less information related to cellular components. PredGO improves by at least about 3.6%, and at most, can improve about 61.7% in each metric compared to the above methods. It shows that deep learning models are able to learn deeper information than sequence homology from a large number of sequences or other features for functional prediction.

Comparison with CAFA3 methods

We utilized the official CAFA evaluation tool to assess the performance of our model, trained on the CAFA3 dataset [41, 43]. It is important to note that the number of proteins in our training set is smaller than the number of proteins in the provided CAFA3 training dataset (⁠|$\sim $|88%). We achieved this reduction by excluding proteins with lengths exceeding 1000 and those containing ’fuzzy’ amino acids, for which structure and sequence information extraction was challenging. To ensure comprehensive predictions on the benchmark data, we discarded the parts that required structural information in cases where structure prediction was not feasible.

The evaluation results are displayed in Figure 2. Notably, our model, PredGO, attained |$F_{max}$| scores of 0.56, 0.38 and 0.65 in the BPO, MFO and CCO evaluations, respectively. It ranked first in the CCO evaluation, second in the MFO evaluation and third in the BPO evaluation. These rankings exhibit slight differences compared to our internal evaluation results. These variations can be attributed to the fact that our model is primarily trained on proteins with lengths below 1000. Consequently, when the length exceeds this threshold, the predicted structure may be significantly affected by sequence truncation, leading to the loss of critical information and negatively impacting the model’s performance.

Comparison of PredGO with CAFA3 top 10 methods.
Figure 2

Comparison of PredGO with CAFA3 top 10 methods.

Moreover, it is noteworthy that the top-ranked method in the MFO and BPO evaluations is Zhu lab’s method, which integrates multiple classifier components using an ensemble learning approach, while the second-ranked method in the BPO evaluation is INGA-Tosatto, which employs a more diverse set of features [21, 59]. In contrast, PredGO is a model that can be integrated as a component in Zhu lab’s method and does not utilize ’dark’ proteomic information. This distinction may contribute to PredGO not being the best-performing model in the MFO and BPO evaluations.

Performance on proteins with and without PPI information

PredGO searches the database for PPI information, but not all proteins yield results, especially some newly discovered proteins for which there is rather little relevant information. To measure the performance of the model when only sequence information is available, we divided the test set of UniGOA16 according to whether PPI information can be searched from the database. Performance of PredGO versus competing methods is shown in Table 4.

Table 4

Performance on proteins with and without PPI information

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
Proteins with PPI information
 Naive0.4570.3080.6535.22719.7926.6640.2950.1820.610
 DiamondBLAST0.5090.3210.4825.92426.6097.5700.2700.0920.277
 DiamondScore0.5560.3660.5214.03318.3746.6540.4010.1530.363
 DeepGOCNN0.5300.3290.6595.11519.5556.5240.3640.1930.626
 DeepGOPlus0.5770.3830.6444.54219.2256.4860.4400.1950.609
 LR-ESM0.6580.3820.6913.98018.2855.6920.5730.2850.717
 PredGO0.6860.4070.7053.66417.7725.4120.6030.3060.746
Proteins without PPI information
 Naive0.4710.3070.5526.02420.6866.7570.2280.1990.482
 DiamondBLAST0.5260.3050.4175.71725.3116.9830.2950.0820.204
 DiamondScore0.5860.3290.4214.21919.7546.3830.3870.1160.243
 DeepGOCNN0.5320.3310.5425.73220.5946.7600.3530.1920.456
 DeepGOPlus0.5860.3670.5774.94519.9706.4830.4680.1970.450
 LR-ESM0.6890.4010.5904.39218.9605.7890.5740.3310.604
 PredGO0.7040.3990.6024.20718.7915.6890.6070.3410.627
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
Proteins with PPI information
 Naive0.4570.3080.6535.22719.7926.6640.2950.1820.610
 DiamondBLAST0.5090.3210.4825.92426.6097.5700.2700.0920.277
 DiamondScore0.5560.3660.5214.03318.3746.6540.4010.1530.363
 DeepGOCNN0.5300.3290.6595.11519.5556.5240.3640.1930.626
 DeepGOPlus0.5770.3830.6444.54219.2256.4860.4400.1950.609
 LR-ESM0.6580.3820.6913.98018.2855.6920.5730.2850.717
 PredGO0.6860.4070.7053.66417.7725.4120.6030.3060.746
Proteins without PPI information
 Naive0.4710.3070.5526.02420.6866.7570.2280.1990.482
 DiamondBLAST0.5260.3050.4175.71725.3116.9830.2950.0820.204
 DiamondScore0.5860.3290.4214.21919.7546.3830.3870.1160.243
 DeepGOCNN0.5320.3310.5425.73220.5946.7600.3530.1920.456
 DeepGOPlus0.5860.3670.5774.94519.9706.4830.4680.1970.450
 LR-ESM0.6890.4010.5904.39218.9605.7890.5740.3310.604
 PredGO0.7040.3990.6024.20718.7915.6890.6070.3410.627

Note: Best performance in bold.

Table 4

Performance on proteins with and without PPI information

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
Proteins with PPI information
 Naive0.4570.3080.6535.22719.7926.6640.2950.1820.610
 DiamondBLAST0.5090.3210.4825.92426.6097.5700.2700.0920.277
 DiamondScore0.5560.3660.5214.03318.3746.6540.4010.1530.363
 DeepGOCNN0.5300.3290.6595.11519.5556.5240.3640.1930.626
 DeepGOPlus0.5770.3830.6444.54219.2256.4860.4400.1950.609
 LR-ESM0.6580.3820.6913.98018.2855.6920.5730.2850.717
 PredGO0.6860.4070.7053.66417.7725.4120.6030.3060.746
Proteins without PPI information
 Naive0.4710.3070.5526.02420.6866.7570.2280.1990.482
 DiamondBLAST0.5260.3050.4175.71725.3116.9830.2950.0820.204
 DiamondScore0.5860.3290.4214.21919.7546.3830.3870.1160.243
 DeepGOCNN0.5320.3310.5425.73220.5946.7600.3530.1920.456
 DeepGOPlus0.5860.3670.5774.94519.9706.4830.4680.1970.450
 LR-ESM0.6890.4010.5904.39218.9605.7890.5740.3310.604
 PredGO0.7040.3990.6024.20718.7915.6890.6070.3410.627
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
Proteins with PPI information
 Naive0.4570.3080.6535.22719.7926.6640.2950.1820.610
 DiamondBLAST0.5090.3210.4825.92426.6097.5700.2700.0920.277
 DiamondScore0.5560.3660.5214.03318.3746.6540.4010.1530.363
 DeepGOCNN0.5300.3290.6595.11519.5556.5240.3640.1930.626
 DeepGOPlus0.5770.3830.6444.54219.2256.4860.4400.1950.609
 LR-ESM0.6580.3820.6913.98018.2855.6920.5730.2850.717
 PredGO0.6860.4070.7053.66417.7725.4120.6030.3060.746
Proteins without PPI information
 Naive0.4710.3070.5526.02420.6866.7570.2280.1990.482
 DiamondBLAST0.5260.3050.4175.71725.3116.9830.2950.0820.204
 DiamondScore0.5860.3290.4214.21919.7546.3830.3870.1160.243
 DeepGOCNN0.5320.3310.5425.73220.5946.7600.3530.1920.456
 DeepGOPlus0.5860.3670.5774.94519.9706.4830.4680.1970.450
 LR-ESM0.6890.4010.5904.39218.9605.7890.5740.3310.604
 PredGO0.7040.3990.6024.20718.7915.6890.6070.3410.627

Note: Best performance in bold.

As can be seen from the table, PredGO comprehensively outperforms competing methods including naive and sequence-based methods in the prediction of MFO and CCO. In each metric, PredGO improved by a minimum of 1.7% and a maximum of 7.9% over the second place.

When we compare the BPO performance of the models, it can be seen that the performance of PredGO is better than that of competing methods for proteins with PPI information. And for proteins without PPI information, PredGO leads the second place LR-ESM in |$S_{min}$| and AUPR metrics by a minimum of 1.7% and a maximum of 5.7%. Only the protein-centric metric |$F_{max}$| trails the second place by 0.002(⁠|$\sim $|0.5%). Combining the three metrics, PredGO’s model performance is better than the competing methods except LR-ESM and not weaker than LR-ESM. Overall, the additional structure information that PredGO provides when making predictions can bring an improvement to the MFO and CCO performance of the model. Even without the PPI information, it does not affect the performance of the BPO too much. This is because many biological processes in proteins are performed by multiple proteins together, and the predicted 3D structural information is of limited help in predicting biological processes.

Performance comparison on different species

We divided the test set in UniGOA16 according to several different species. Table 5 shows the performance of PredGO versus competing methods on human, mouse and fission yeast. There is some disparity in performance across species in the results, probably because mouse has a smaller number of annotations compared to human and yeast is a bit simpler in function. PredGO has the best performance in 25 out of the 27 results. The BPO performance of PredGO is generally better than that of other methods, which may be attributed to the use of a fusion module to integrate PPI information.

Table 5

Performance on different species

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
HUMAN (9606)
 Naive0.4750.3090.6235.24427.7446.9340.3300.2350.612
 DiamondBLAST0.4930.3030.4736.15830.1458.1180.2390.0830.239
 DiamondScore0.5470.3490.5004.49226.1737.1180.3640.1300.309
 DeepGOCNN0.5890.3670.6455.13726.7286.7950.4380.2650.608
 DeepGOPlus0.5930.4000.6314.68426.8096.8960.4790.2310.577
 LR-ESM0.7200.4070.6803.79024.8926.1580.6600.3200.697
 PredGO0.7330.4350.6863.72224.6205.9900.6720.3410.715
MOUSE (10 090)
 Naive0.5410.3070.6116.04233.2669.870.2540.2380.557
 DiamondBLAST0.4210.3110.5185.50233.68610.4090.2080.1000.267
 DiamondScore0.4400.3250.5444.84230.6519.3210.3040.1600.336
 DeepGOCNN0.5950.3400.6265.95032.4739.7770.3030.2400.553
 DeepGOPlus0.6470.3870.6315.59931.6939.3930.3890.2370.559
 LR-ESM0.6730.3840.6344.58430.1579.4240.5160.3390.612
 PredGO0.6970.4030.6464.67929.6539.1940.5160.3410.633
FISSION YEAST (284 812)
 Naive0.5250.3240.7287.53622.7069.4930.2720.2400.657
 DiamondBLAST0.5930.4640.6176.58724.76211.9550.3450.2570.324
 DiamondScore0.6260.4590.6574.01218.3929.5690.4890.2900.402
 DeepGOCNN0.5020.3410.7267.64723.1899.5560.2830.1960.597
 DeepGOPlus0.6250.4990.7176.90520.0178.4600.4990.3450.629
 LR-ESM0.7720.5030.8005.23118.3398.9220.5990.4080.731
 PredGO0.7890.5210.7824.73118.1838.7710.6420.4360.766
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
HUMAN (9606)
 Naive0.4750.3090.6235.24427.7446.9340.3300.2350.612
 DiamondBLAST0.4930.3030.4736.15830.1458.1180.2390.0830.239
 DiamondScore0.5470.3490.5004.49226.1737.1180.3640.1300.309
 DeepGOCNN0.5890.3670.6455.13726.7286.7950.4380.2650.608
 DeepGOPlus0.5930.4000.6314.68426.8096.8960.4790.2310.577
 LR-ESM0.7200.4070.6803.79024.8926.1580.6600.3200.697
 PredGO0.7330.4350.6863.72224.6205.9900.6720.3410.715
MOUSE (10 090)
 Naive0.5410.3070.6116.04233.2669.870.2540.2380.557
 DiamondBLAST0.4210.3110.5185.50233.68610.4090.2080.1000.267
 DiamondScore0.4400.3250.5444.84230.6519.3210.3040.1600.336
 DeepGOCNN0.5950.3400.6265.95032.4739.7770.3030.2400.553
 DeepGOPlus0.6470.3870.6315.59931.6939.3930.3890.2370.559
 LR-ESM0.6730.3840.6344.58430.1579.4240.5160.3390.612
 PredGO0.6970.4030.6464.67929.6539.1940.5160.3410.633
FISSION YEAST (284 812)
 Naive0.5250.3240.7287.53622.7069.4930.2720.2400.657
 DiamondBLAST0.5930.4640.6176.58724.76211.9550.3450.2570.324
 DiamondScore0.6260.4590.6574.01218.3929.5690.4890.2900.402
 DeepGOCNN0.5020.3410.7267.64723.1899.5560.2830.1960.597
 DeepGOPlus0.6250.4990.7176.90520.0178.4600.4990.3450.629
 LR-ESM0.7720.5030.8005.23118.3398.9220.5990.4080.731
 PredGO0.7890.5210.7824.73118.1838.7710.6420.4360.766

Note: Best performance in bold.

Table 5

Performance on different species

Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
HUMAN (9606)
 Naive0.4750.3090.6235.24427.7446.9340.3300.2350.612
 DiamondBLAST0.4930.3030.4736.15830.1458.1180.2390.0830.239
 DiamondScore0.5470.3490.5004.49226.1737.1180.3640.1300.309
 DeepGOCNN0.5890.3670.6455.13726.7286.7950.4380.2650.608
 DeepGOPlus0.5930.4000.6314.68426.8096.8960.4790.2310.577
 LR-ESM0.7200.4070.6803.79024.8926.1580.6600.3200.697
 PredGO0.7330.4350.6863.72224.6205.9900.6720.3410.715
MOUSE (10 090)
 Naive0.5410.3070.6116.04233.2669.870.2540.2380.557
 DiamondBLAST0.4210.3110.5185.50233.68610.4090.2080.1000.267
 DiamondScore0.4400.3250.5444.84230.6519.3210.3040.1600.336
 DeepGOCNN0.5950.3400.6265.95032.4739.7770.3030.2400.553
 DeepGOPlus0.6470.3870.6315.59931.6939.3930.3890.2370.559
 LR-ESM0.6730.3840.6344.58430.1579.4240.5160.3390.612
 PredGO0.6970.4030.6464.67929.6539.1940.5160.3410.633
FISSION YEAST (284 812)
 Naive0.5250.3240.7287.53622.7069.4930.2720.2400.657
 DiamondBLAST0.5930.4640.6176.58724.76211.9550.3450.2570.324
 DiamondScore0.6260.4590.6574.01218.3929.5690.4890.2900.402
 DeepGOCNN0.5020.3410.7267.64723.1899.5560.2830.1960.597
 DeepGOPlus0.6250.4990.7176.90520.0178.4600.4990.3450.629
 LR-ESM0.7720.5030.8005.23118.3398.9220.5990.4080.731
 PredGO0.7890.5210.7824.73118.1838.7710.6420.4360.766
Method|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPOCCO
HUMAN (9606)
 Naive0.4750.3090.6235.24427.7446.9340.3300.2350.612
 DiamondBLAST0.4930.3030.4736.15830.1458.1180.2390.0830.239
 DiamondScore0.5470.3490.5004.49226.1737.1180.3640.1300.309
 DeepGOCNN0.5890.3670.6455.13726.7286.7950.4380.2650.608
 DeepGOPlus0.5930.4000.6314.68426.8096.8960.4790.2310.577
 LR-ESM0.7200.4070.6803.79024.8926.1580.6600.3200.697
 PredGO0.7330.4350.6863.72224.6205.9900.6720.3410.715
MOUSE (10 090)
 Naive0.5410.3070.6116.04233.2669.870.2540.2380.557
 DiamondBLAST0.4210.3110.5185.50233.68610.4090.2080.1000.267
 DiamondScore0.4400.3250.5444.84230.6519.3210.3040.1600.336
 DeepGOCNN0.5950.3400.6265.95032.4739.7770.3030.2400.553
 DeepGOPlus0.6470.3870.6315.59931.6939.3930.3890.2370.559
 LR-ESM0.6730.3840.6344.58430.1579.4240.5160.3390.612
 PredGO0.6970.4030.6464.67929.6539.1940.5160.3410.633
FISSION YEAST (284 812)
 Naive0.5250.3240.7287.53622.7069.4930.2720.2400.657
 DiamondBLAST0.5930.4640.6176.58724.76211.9550.3450.2570.324
 DiamondScore0.6260.4590.6574.01218.3929.5690.4890.2900.402
 DeepGOCNN0.5020.3410.7267.64723.1899.5560.2830.1960.597
 DeepGOPlus0.6250.4990.7176.90520.0178.4600.4990.3450.629
 LR-ESM0.7720.5030.8005.23118.3398.9220.5990.4080.731
 PredGO0.7890.5210.7824.73118.1838.7710.6420.4360.766

Note: Best performance in bold.

Among the many metrics, PredGO outperformed all competing methods in predicting human datasets, with a minimum improvement of 1.1% and a maximum improvement of 8.8% relative to the second place. Only the |$S_{min}$| metric for predicting mouse MFO and the |$F_{max}$| metric for predicting fission yeast CCO decreased by about 2% compared to LR-ESM but were much better than the other competing methods.

Case study

We use protein Q47319 as an example to illustrate the difference in performance between PredGO and other competing methods. Q47319 is a tRNA-uridine aminocarboxypropyltransferase that catalyzes the formation of 3-(3-amino-3-carboxypropyl)uridine (acp3U) at position 47 of tRNAs, and acp3U47 confers thermal stability on tRNA [60–62]. Figure 3 shows the DAG plot of the BPO terms of this protein and the method to correctly predict the corresponding GO terms, the Table 6 shows the specific prediction results of each method as well as the F1 scores.

DAG diagram of correct predicted BPO terms of Q47319 using different methods.
Figure 3

DAG diagram of correct predicted BPO terms of Q47319 using different methods.

Table 6

Predicted GO terms (GO terms in UniGOA16 dataset) of Q47319 in BPO by PredGO and competing methods

MethodResultsF1 scores
NaiveGO:0044260, GO:0008152, GO:0043170, GO:0044763, GO:0071704, GO:0044237, GO:0050794, GO:0050789, GO:0065007, GO:0044699, GO:0050896, GO:0009987, GO:00442380.378
DiamondBLAST0.000
DiamondScore0.000
DeepGOCNNGO:0050794*, GO:0044699, GO:0044763, GO:0050789,GO:0043170, GO:0050896, GO:0065007, GO:0044267, GO:0071704, GO:0044238, GO:0019222*, GO:0008152, GO:0009987, GO:0019538*, GO:0044237, GO:00442600.350
DeepGOPlusGO:0044699, GO:0050789, GO:0065007, GO:0071704,GO:0008152, GO:0009987, GO:00442370.258
LR-ESMGO:0008152, GO:0044763, GO:0044699, GO:0044710*,GO:0044237, GO:0009987, GO:0034641*, GO:0006807*, GO:0043170, GO:0044238, GO:0044260, GO:00717040.5
PredGOGO:0043412*, GO:0046483*, GO:0044699, GO:0009451*,GO:0044238,0.864
GO:0008152, GO:0044260, GO:0006396*,GO:0034470*, GO:0034641*,
GO:0071704, GO:0006139*, GO:0006725*, GO:0090304*, GO:1901360*,
GO:0043170,GO:0044237, GO:0006807*,GO:0009987, GO:0008033*
Experimental annotationGO:0043412*, GO:0046483*, GO:0009451*, GO:0044238, GO:0008152, GO:0044260,
GO:0006396*, GO:0034470*, GO:0034641*, GO:0071704, GO:0006139*, GO:0006725*,
GO:0090304*, GO:0006399*, GO:1901360*, GO:0010467*, GO:0043170, GO:0044237,
GO:0016070*, GO:0006807*, GO:0009987, GO:0006400*, GO:0008033*, GO:0034660*
MethodResultsF1 scores
NaiveGO:0044260, GO:0008152, GO:0043170, GO:0044763, GO:0071704, GO:0044237, GO:0050794, GO:0050789, GO:0065007, GO:0044699, GO:0050896, GO:0009987, GO:00442380.378
DiamondBLAST0.000
DiamondScore0.000
DeepGOCNNGO:0050794*, GO:0044699, GO:0044763, GO:0050789,GO:0043170, GO:0050896, GO:0065007, GO:0044267, GO:0071704, GO:0044238, GO:0019222*, GO:0008152, GO:0009987, GO:0019538*, GO:0044237, GO:00442600.350
DeepGOPlusGO:0044699, GO:0050789, GO:0065007, GO:0071704,GO:0008152, GO:0009987, GO:00442370.258
LR-ESMGO:0008152, GO:0044763, GO:0044699, GO:0044710*,GO:0044237, GO:0009987, GO:0034641*, GO:0006807*, GO:0043170, GO:0044238, GO:0044260, GO:00717040.5
PredGOGO:0043412*, GO:0046483*, GO:0044699, GO:0009451*,GO:0044238,0.864
GO:0008152, GO:0044260, GO:0006396*,GO:0034470*, GO:0034641*,
GO:0071704, GO:0006139*, GO:0006725*, GO:0090304*, GO:1901360*,
GO:0043170,GO:0044237, GO:0006807*,GO:0009987, GO:0008033*
Experimental annotationGO:0043412*, GO:0046483*, GO:0009451*, GO:0044238, GO:0008152, GO:0044260,
GO:0006396*, GO:0034470*, GO:0034641*, GO:0071704, GO:0006139*, GO:0006725*,
GO:0090304*, GO:0006399*, GO:1901360*, GO:0010467*, GO:0043170, GO:0044237,
GO:0016070*, GO:0006807*, GO:0009987, GO:0006400*, GO:0008033*, GO:0034660*

Note: The correctly predicted GO terms are in bold. Terms that do not appear in naive are added with *.

Table 6

Predicted GO terms (GO terms in UniGOA16 dataset) of Q47319 in BPO by PredGO and competing methods

MethodResultsF1 scores
NaiveGO:0044260, GO:0008152, GO:0043170, GO:0044763, GO:0071704, GO:0044237, GO:0050794, GO:0050789, GO:0065007, GO:0044699, GO:0050896, GO:0009987, GO:00442380.378
DiamondBLAST0.000
DiamondScore0.000
DeepGOCNNGO:0050794*, GO:0044699, GO:0044763, GO:0050789,GO:0043170, GO:0050896, GO:0065007, GO:0044267, GO:0071704, GO:0044238, GO:0019222*, GO:0008152, GO:0009987, GO:0019538*, GO:0044237, GO:00442600.350
DeepGOPlusGO:0044699, GO:0050789, GO:0065007, GO:0071704,GO:0008152, GO:0009987, GO:00442370.258
LR-ESMGO:0008152, GO:0044763, GO:0044699, GO:0044710*,GO:0044237, GO:0009987, GO:0034641*, GO:0006807*, GO:0043170, GO:0044238, GO:0044260, GO:00717040.5
PredGOGO:0043412*, GO:0046483*, GO:0044699, GO:0009451*,GO:0044238,0.864
GO:0008152, GO:0044260, GO:0006396*,GO:0034470*, GO:0034641*,
GO:0071704, GO:0006139*, GO:0006725*, GO:0090304*, GO:1901360*,
GO:0043170,GO:0044237, GO:0006807*,GO:0009987, GO:0008033*
Experimental annotationGO:0043412*, GO:0046483*, GO:0009451*, GO:0044238, GO:0008152, GO:0044260,
GO:0006396*, GO:0034470*, GO:0034641*, GO:0071704, GO:0006139*, GO:0006725*,
GO:0090304*, GO:0006399*, GO:1901360*, GO:0010467*, GO:0043170, GO:0044237,
GO:0016070*, GO:0006807*, GO:0009987, GO:0006400*, GO:0008033*, GO:0034660*
MethodResultsF1 scores
NaiveGO:0044260, GO:0008152, GO:0043170, GO:0044763, GO:0071704, GO:0044237, GO:0050794, GO:0050789, GO:0065007, GO:0044699, GO:0050896, GO:0009987, GO:00442380.378
DiamondBLAST0.000
DiamondScore0.000
DeepGOCNNGO:0050794*, GO:0044699, GO:0044763, GO:0050789,GO:0043170, GO:0050896, GO:0065007, GO:0044267, GO:0071704, GO:0044238, GO:0019222*, GO:0008152, GO:0009987, GO:0019538*, GO:0044237, GO:00442600.350
DeepGOPlusGO:0044699, GO:0050789, GO:0065007, GO:0071704,GO:0008152, GO:0009987, GO:00442370.258
LR-ESMGO:0008152, GO:0044763, GO:0044699, GO:0044710*,GO:0044237, GO:0009987, GO:0034641*, GO:0006807*, GO:0043170, GO:0044238, GO:0044260, GO:00717040.5
PredGOGO:0043412*, GO:0046483*, GO:0044699, GO:0009451*,GO:0044238,0.864
GO:0008152, GO:0044260, GO:0006396*,GO:0034470*, GO:0034641*,
GO:0071704, GO:0006139*, GO:0006725*, GO:0090304*, GO:1901360*,
GO:0043170,GO:0044237, GO:0006807*,GO:0009987, GO:0008033*
Experimental annotationGO:0043412*, GO:0046483*, GO:0009451*, GO:0044238, GO:0008152, GO:0044260,
GO:0006396*, GO:0034470*, GO:0034641*, GO:0071704, GO:0006139*, GO:0006725*,
GO:0090304*, GO:0006399*, GO:1901360*, GO:0010467*, GO:0043170, GO:0044237,
GO:0016070*, GO:0006807*, GO:0009987, GO:0006400*, GO:0008033*, GO:0034660*

Note: The correctly predicted GO terms are in bold. Terms that do not appear in naive are added with *.

As can be seen, the Q47319 protein has a total of 24 experimentally annotated BPO terms. 20 terms were predicted by PredGO, of which 19 terms were correct and only five terms were not predicted. Compared to other competing methods, PredGO predicted a much higher percentage of terms correctly and significantly reduced the number of incorrectly predicted terms. In addition to this, PredGO predicted more specific functions and successfully predicted that the protein has a tRNA processing function. The correct prediction results for DeepGOCNN and DeepGOPlus were essentially the same as the frequency-based naive method. The same protein language model-based method LR-ESM predicted less specific GO terms. DiamondBLAST and DiamondScore based on sequence similarity could not predict the function of Q47319 protein because no homologous protein of Q47319 protein could be found in the training set.

Ablation study

To investigate the effect of the predicted structure and PPI information on the prediction performance of PredGO, we designed an ablation study. We compared four models: using only pre-trained embedding, using pre-trained embedding and predicted structure, using pre-trained embedding and PPI information, and the model PredGO using all information. These models all use MLP to transform protein embedding into GO terms embedding. The difference lies in that they either omit the structure feature extraction module, or the PPI feature fusion module, or both. The performance of these models is shown in Table 7.

Table 7

Performance comparison of different feature combinations

Features|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPO CCO
pre-trained embedding0.6510.3790.6794.12318.7085.6090.5590.2760.706
pre-trained embedding &&
predicted structure
0.6720.3810.6863.87918.5155.5730.5960.2780.722
pre-trained embedding &&
PPI
0.6670.4000.6834.01818.2175.6650.5760.3070.710
pre-trained embedding &&
predicted structure && PPI
0.6870.4050.6943.71917.9475.4420.6030.3120.734
Features|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPO CCO
pre-trained embedding0.6510.3790.6794.12318.7085.6090.5590.2760.706
pre-trained embedding &&
predicted structure
0.6720.3810.6863.87918.5155.5730.5960.2780.722
pre-trained embedding &&
PPI
0.6670.4000.6834.01818.2175.6650.5760.3070.710
pre-trained embedding &&
predicted structure && PPI
0.6870.4050.6943.71917.9475.4420.6030.3120.734

Note: Best performance in bold.

Table 7

Performance comparison of different feature combinations

Features|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPO CCO
pre-trained embedding0.6510.3790.6794.12318.7085.6090.5590.2760.706
pre-trained embedding &&
predicted structure
0.6720.3810.6863.87918.5155.5730.5960.2780.722
pre-trained embedding &&
PPI
0.6670.4000.6834.01818.2175.6650.5760.3070.710
pre-trained embedding &&
predicted structure && PPI
0.6870.4050.6943.71917.9475.4420.6030.3120.734
Features|$\boldsymbol{F_{max}}$||$\boldsymbol{S_{min}}$|AUPR
MFOBPOCCOMFOBPOCCOMFOBPO CCO
pre-trained embedding0.6510.3790.6794.12318.7085.6090.5590.2760.706
pre-trained embedding &&
predicted structure
0.6720.3810.6863.87918.5155.5730.5960.2780.722
pre-trained embedding &&
PPI
0.6670.4000.6834.01818.2175.6650.5760.3070.710
pre-trained embedding &&
predicted structure && PPI
0.6870.4050.6943.71917.9475.4420.6030.3120.734

Note: Best performance in bold.

As can be seen from the table, the model using pre-trained embedding and predicted structure has more significant improvements in MFO and CCO compared to the model using only pre-trained embedding, with a maximum improvement of about 6.6% in metrics, and insignificant improvements in performance in BPO. Compared to the model using only pre-trained embeddings, the model using pre-trained embeddings and PPI information showed more significant improvements in BPO, with gains of over 11% in AUPR, but much smaller improvements in CCO. The model combining the pre-trained embedding, predicted structure, and PPI information has a more comprehensive prediction performance, outperforming the other three models in terms of MFO, PPO and CCO. This is because molecular function and cellular component are more related to protein structure, while biological processes are mostly generated by multiple protein interactions, which are more associated with PPI. The protein interaction information contained in the predicted structure is very limited, which leads to almost no improvement in BP performance by adding predicted structure. Therefore, we have incorporated PPI information to compensate for this.

Large-scale function prediction

We applied PredGO to predict GO functions of all 205 788 human sequences in UniprotKB. We used only sequence and PPI information for those proteins whose structures could not be predicted by AlphaFold. We extracted these proteins’ experimental and non-experimental annotations from the GOA database by referring to the CAFA rules and compared them with the prediction results of PredGO. As shown in Figure 4A, only 12.21% of protein sequences have experimental annotations in the GOA database, and 62.40% of the sequences have non-experimental annotations. As for PredGO, |$\sim $|100% entries are annotated with a confidence score greater than 0.5 (PredGO_All), and |$\sim $|90% of which are based on predicted structure (PredGO_Structure). This shows that PredGO can achieve extremely high coverage.

Protein function prediction results for human. (A) The percentage of human proteins in UniprotKB with annotated functions. (B) The number of functions per protein predicted by PredGO (y-axis) compared to the number of electronic annotations per protein in GOA (x-axis).
Figure 4

Protein function prediction results for human. (A) The percentage of human proteins in UniprotKB with annotated functions. (B) The number of functions per protein predicted by PredGO (y-axis) compared to the number of electronic annotations per protein in GOA (x-axis).

In addition, we visualized the number of annotations for each protein. Figure 4B compares the number of non-experimental annotations in the GOA database with the number predicted by PredGO. As seen from the figure, PredGO can predict more GO terms for most proteins, which means that PredGO can explore more functions.

WEB SERVER

We built a flexible and interactive web server that provides prediction and browsing services. Users can submit protein sequences or structures to the PredGO web server for function prediction (Figure 5A). The prediction results are presented in a table by default and can be downloaded as a text file (Figure 5B). It is also possible to browse the functions of all human proteins in UniprotKB predicted by PredGO (Figure 5C) and batch download all the annotations. In addition, users can interactively view the 3D structure of proteins and GO annotations in a tree (Figure 5D).

PredGO web server. (A) Enter query protein sequences; (B) prediction results; (C) database search, statistics and download; (D) visualization of GO annotations.
Figure 5

PredGO web server. (A) Enter query protein sequences; (B) prediction results; (C) database search, statistics and download; (D) visualization of GO annotations.

CONCLUSION

We designed a method called PredGO for protein function prediction, a meaningful but complex multi-label classification problem. PredGO makes full use of the massive amount of sequence information through a pre-trained protein language model, incorporates information contained in protein structure predicted with atomic precision by AlphaFold2, and fuses interacting proteins based on attention mechanisms. It is clear from the experimental results that our method enables high-performance protein function prediction. Even in the case of only sequences, high-performance prediction of MFO and CCO can be achieved without degrading the performance of BPO.

Our method faces certain challenges in predicting the funciton of long protein sequences. In the future, alternative structure prediction methods can be considered to address this issue. Furthermore, integrating more diverse and heterogeneous data sources, along with pre-training protein structure models using experimentally determined and predicted protein structures, has the potential to improve the predictive performance of the model.

Key Points
  • We propose a protein function prediction method using heterogeneous features, including sequence, predicted protein structure and protein–protein interaction relationships, that outperforms other state-of-the-art methods in terms of coverage and accuracy.

  • A graph neural network combined with a geometric vector perceptron is applied to extract protein structure information.

  • Proteins with interactions are viewed as different words in a sentence, and sequence information and interaction information are fused by a PPI feature fusion module.

ACKNOWLEDGEMENTS

The work was carried out at National Supercomputer Center in Tianjin, and the calculations were performed on Tianhe new generation Supercomputer.

FUNDING

This work was supported by the National Natural Science Foundation of China under grants Nos 62272490 and 61972422.

AUTHORS’ CONTRIBUTION

R.Z., Z.H and L.D. conceived and designed the algorithm;, R.Z. and Z.H conducted the experiments, R.Z. analyzed the results. R.Z. and L.D. wrote and reviewed the manuscript.

DATA AVAILABILITY

The webserver and database are available at http://predgo.denglab.org/.

Author Biographies

Rongtao Zheng is a graduate student at School of Computer, Central South University, Changsha, China. His research interests include data mining and protein function prediction.

Zhijian Huang is a graduate student at School of Computer, Central South University, Changsha, China. His research interests include data mining and protein function prediction.

Lei Deng is a professor in School of Computer Science and Engineering, Central South University, Changsha, China. His research interests include data mining, bioinformatics and systems biology.

REFERENCES

1.

Huntley
 
RP
,
Sawford
 
T
,
Mutowo-Meullenet
 
P
, et al.  
The Goa database: gene ontology annotation updates for 2015
.
Nucleic Acids Res
 
2015
;
43
(
D1
):
D1057
63
.

2.

Fowler
 
DM
,
Araya
 
CL
,
Fleishman
 
SJ
, et al.  
High-resolution mapping of protein sequence-function relationships
.
Nat Methods
 
2010
;
7
(
9
):
741
6
.

3.

Hawkins
 
T
,
Chitale
 
M
,
Luban
 
S
,
Kihara
 
D
.
Pfp: automated prediction of gene ontology functional annotations with confidence scores using protein sequence data
.
Prot Struct Funct Bioinformatics
 
2009
;
74
(
3
):
566
82
.

4.

Chitale
 
M
,
Hawkins
 
T
,
Park
 
C
,
Kihara
 
D
.
Esg: extended similarity group method for automated protein function prediction
.
Bioinformatics
 
2009
;
25
(
14
):
1739
45
.

5.

Jing
 
B
,
Eismann
 
S
,
Suriana
 
P
, et al.  
Learning from protein structure with geometric vector perceptrons
. In:
International Conference on Learning Representations
. OpenReview.net, Red Hook, New York, USA,
2021
.

6.

Huttenhower
 
C
,
Hibbs
 
M
,
Myers
 
C
,
Troyanskaya
 
OG
.
A scalable method for integration and functional analysis of multiple microarray datasets
.
Bioinformatics
 
2006
;
22
(
23
):
2890
7
.

7.

You
 
R
,
Yao
 
S
,
Xiong
 
Y
, et al.  
Netgo: improving large-scale protein function prediction with massive network information
.
Nucleic Acids Res
 
2019
;
47
(
W1
):
W379
87
.

8.

You
 
R
,
Yao
 
S
,
Mamitsuka
 
H
,
Zhu
 
S
.
Deepgraphgo: graph neural network for large-scale, multispecies protein function prediction
.
Bioinformatics
 
2021
;
37
(
Supplement_1
):
i262
71
.

9.

Gaudet
 
P
,
Livstone
 
MS
,
Lewis
 
SE
,
Thomas
 
PD
.
Phylogenetic-based propagation of functional annotations within the gene ontology consortium
.
Brief Bioinform
 
2011
;
12
(
5
):
449
62
.

10.

Jones
 
P
,
Binns
 
D
,
Chang
 
H-Y
, et al.  
Interproscan 5: genome-scale protein function classification
.
Bioinformatics
 
2014
;
30
(
9
):
1236
40
.

11.

You
 
R
,
Huang
 
X
,
Zhu
 
S
.
Deeptext2go: improving large-scale protein function prediction with deep semantic text representation
.
Methods
 
2018
;
145
:
82
90
.

12.

Yunes
 
JM
,
Babbitt
 
PC
.
Effusion: prediction of protein function from sequence similarity networks
.
Bioinformatics
 
2019
;
35
(
3
):
442
51
.

13.

Zhihua
 
D
,
He
 
Y
,
Li
 
J
,
Uversky
 
VN
.
Deepadd: protein function prediction from k-mer embedding and additional features
.
Comput Biol Chem
 
2020
;
89
:
107379
.

14.

Cai
 
Y
,
Wang
 
J
,
Deng
 
L
.
Sdn2go: an integrated deep learning model for protein function prediction
.
Front Bioeng Biotechnol
 
2020
;
8
:
391
.

15.

Maarten
 
JMF
.
Reijnders and Robert M Waterhouse. Crowdgo: machine learning and semantic similarity guided consensus gene ontology annotation
.
PLoS Comput Biol
 
2022
;
18
(
5
):
e1010075
 
[Online; accessed 2023-01-05]
.

16.

Yang
 
Y
,
Gao
 
D
,
Xie
 
X
, et al.  
Deepidc: a prediction framework of injectable drug combination based on heterogeneous information and deep learning
.
Clin Pharmacokinet
 
2022
;
61
:
1749
59
.

17.

Pearson
 
WR
.
Finding protein and nucleotide similarities with fasta
.
Curr Protoc Bioinformatics
 
2016
;
53
(
1
):
3
9
.

18.

Altschul
 
SF
,
Madden
 
TL
,
Schäffer
 
AA
, et al.  
Gapped blast and psi-blast: a new generation of protein database search programs
.
Nucleic Acids Res
 
1997
;
25
(
17
):
3389
402
.

19.

Fa
 
R
,
Cozzetto
 
D
,
Wan
 
C
,
Jones
 
DT
.
Predicting human protein function with multi-task deep neural networks
.
PloS One
 
2018
;
13
(
6
):
e0198216
.

20.

Kulmanov
 
M
,
Khan
 
MA
,
Hoehndorf
 
R
,
Wren
 
J
.
Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier
.
Bioinformatics
 
2018
;
34
(
4
):
660
8
.

21.

You
 
R
,
Zhang
 
Z
,
Xiong
 
Y
, et al.  
Golabeler: improving sequence-based large-scale protein function prediction by learning to rank
.
Bioinformatics
 
2018
;
34
(
14
):
2465
73
.

22.

Jain
 
A
,
Kihara
 
D
.
Phylo-pfp: improved automated protein function prediction using phylogenetic distance of distantly related sequences
.
Bioinformatics
 
2019
;
35
(
5
):
753
9
.

23.

Kulmanov
 
M
,
Hoehndorf
 
R
.
Deepgoplus: improved protein function prediction from sequence
.
Bioinformatics
 
2020
;
36
(
2
):
422
9
.

24.

Cao
 
Y
,
Shen
 
Y
.
Tale: transformer-based protein function annotation with joint sequence–label embedding
.
Bioinformatics
 
2021
;
37
(
18
):
2825
33
.

25.

Alley
 
EC
,
Khimulya
 
G
,
Biswas
 
S
, et al.  
Unified rational protein engineering with sequence-based deep representation learning
.
Nat Methods
 
2019
;
16
(
12
):
1315
22
.

26.

Heinzinger
 
M
,
Ahmed Elnaggar
 
Y
,
Wang
 
CD
, et al.  
Modeling aspects of the language of life through transfer-learning protein sequences
.
BMC Bioinformatics
 
2019
;
20
(
1
):
1
17
.

27.

Elnaggar
 
A
,
Heinzinger
 
M
,
Dallago
 
C
, et al.  
Prottrans: Toward understanding the language of lifethrough self-supervised learning
.
IEEE transactions onpattern analysis and machine intelligence
2021;
44
(10):7112–27.

28.

Unsal
 
S
,
Atas
 
H
,
Albayrak
 
M
, et al.  
Learning functional properties of proteins with language models
.
Nat Mach Intell
 
2022
;
4
(
3
):
227
45
.

29.

Todd
 
AE
,
Orengo
 
CA
,
Thornton
 
JM
.
Evolution of protein function, from a structural perspective
.
Curr Opin Chem Biol
 
1999
;
3
(
5
):
548
56
.

30.

Thornton
 
JM
,
Todd
 
AE
,
Milburn
 
D
, et al.  
From structure to function: approaches and limitations
.
Nat Struct Biol
 
2000
;
7
(
11
):
991
4
.

31.

Holm
 
L
,
Sander
 
C
.
Protein structure comparison by alignment of distance matrices
.
J Mol Biol
 
1993
;
233
(
1
):
123
38
.

32.

Kolodny
 
R
,
Koehl
 
P
,
Levitt
 
M
.
Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures
.
J Mol Biol
 
2005
;
346
(
4
):
1173
88
.

33.

Deng
 
L
,
Zhong
 
G
,
Liu
 
C
, et al.  
Madoka: an ultra-fast approach for large-scale protein structure similarity searching
.
BMC Bioinformatics
 
2019
;
20
(
19
):
1
10
.

34.

Andrew Binkowski
 
T
,
Freeman
 
P
,
Liang
 
J
.
Pvsoar: detecting similar surface patterns of pocket and void surfaces of amino acid residues on proteins
.
Nucleic Acids Res
 
2004
;
32
(
suppl_2
):
W555
8
.

35.

Kinoshita
 
K
,
Nakamura
 
H
.
Identification of protein biochemical functions by similarity search using the molecular surface database ef-site
.
Protein Sci
 
2003
;
12
(
8
):
1589
95
.

36.

Ivanisenko
 
VA
,
Pintus
 
SS
,
Grigorovich
 
DA
,
Kolchanov
 
NA
.
Pdbsitescan: a program for searching for active, binding and posttranslational modification sites in the 3D structures of proteins
.
Nucleic Acids Res
 
2004
;
32
(
suppl_2
):
W549
54
.

37.

Ma
 
W
,
Zhang
 
S
,
Li
 
Z
, et al.  
Enhancing protein function prediction performance by utilizing alphafold-predicted protein structures
.
J Chem Inf Model
 
2022
;
62
(
17
):
4008
17
.

38.

Jumper
 
J
,
Evans
 
R
,
Pritzel
 
A
, et al.  
Highly accurate protein structure prediction with alphafold
.
Nature
 
2021
;
596
(
7873
):
583
9
.

39.

Rives
 
A
,
Meier
 
J
,
Sercu
 
T
, et al.  
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc Natl Acad Sci
 
2021
;
118
(
15
):
e2016239118
.

40.

Jing
 
B
,
Eismann
 
S
,
Soni
 
PN
,
Dror
 
RO
.
Equivariant graph neural networks for 3D macromolecular structure.
 
arXiv preprint. arXiv:2106.03843
.
2021
.

41.

Jiang
 
Y
,
Oron
 
TR
,
Clark
 
WT
, et al.  
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
.
Genome Biol
 
2016
;
17
(
1
):
1
19
.

42.

Radivojac
 
P
,
Clark
 
WT
,
Oron
 
TR
, et al.  
A large-scale evaluation of computational protein function prediction
.
Nat Methods
 
2013
;
10
(
3
):
221
7
.

43.

Zhou
 
N
,
Jiang
 
Y
,
Bergquist
 
TR
, et al.  
The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens
.
Genome Biol
 
2019
;
20
(
1
):
1
23
.

44.

Varadi
 
M
,
Anyango
 
S
,
Deshpande
 
M
, et al.  
Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
.
Nucleic Acids Res
 
2022
;
50
(
D1
):
D439
44
.

45.

Szklarczyk
 
D
,
Franceschini
 
A
,
Wyder
 
S
, et al.  
String v10: protein–protein interaction networks, integrated over the tree of life
.
Nucleic Acids Res
 
2015
;
43
(
D1
):
D447
52
.

46.

UniProt Consortium
.
Uniprot: a worldwide hub of protein knowledge
.
Nucleic Acids Res
 
2019
;
47
(
D1
):
D506
15
.

47.

Mirdita
 
M
,
Steinegger
 
M
,
Söding
 
J
.
Mmseqs2 desktop and local web server app for fast, interactive sequence searches
.
Bioinformatics
 
2019
;
35
(
16
):
2856
8
.

48.

Vaswani
 
A
,
Shazeer
 
N
,
Parmar
 
N
, et al.  
Attention is all you need
.
Adv Neural Inform Process Syst
 
2017
;
30
:5998–6008.

49.

He
 
K
,
Zhang
 
X
,
Ren
 
S
, et al.  
Deep residual learning for image recognition
. In:
Proceedings of the IEEE Conference on Computer Vision and pattern recognition
. IEEE; Manhattan, New York, NY, US, 2016, pp. 770–8.

50.

Srivastava
 
N
,
Hinton
 
G
,
Krizhevsky
 
A
, et al.  
Dropout: a simple way to prevent neural networks from overfitting
.
J Mach Learn Res
 
2014
;
15
(
1
):
1929
58
.

51.

Ba
 
JL
,
Kiros
 
JR
,
Hinton
 
GE
.
Layer normalization
. In:
Neural Information Processing Systems
. Curran Associates Inc, Long Beach California USA, 2016.

52.

Gilmer
 
J
,
Schoenholz
 
SS
,
Riley
 
PF
, et al.  
Neural message passing for quantum chemistry
. In:
International Conference on Machine Learning
. JMLR.org,
2017
. pp.
1263
72
.
PMLR
.

53.

Paszke
 
A
,
Gross
 
S
,
Massa
 
F
, et al.  
Pytorch: an imperative style, high-performance deep learning library
.
Adv Neural Inform Process Syst
 
2019
;
32
:8024–35.

54.

Fey
 
M
,
Lenssen
 
JE
.
Fast graph representation learning with pytorch geometric
. In:
International Conference on Learning Representations
. OpenReview.net; Red Hook, New York, USA, 2019.

55.

Loshchilov
 
I
,
Hutter
 
F
.
Decoupled weight decay regularization
. In:
International Conference on Learning Representations
. OpenReview.net; Red Hook, New York, USA, 2019.

56.

Buchfink
 
B
,
Reuter
 
K
,
Drost
 
H-G
.
Sensitive protein alignments at tree-of-life scale using diamond
.
Nat Methods
 
2021
;
18
(
4
):
366
8
.

57.

Davis
 
J
,
Goadrich
 
M
.
The relationship between precision-recall and ROC curves
. In:
Proceedings of the 23rd International Conference on Machine Learning
. ACM, New York, USA,
2006
. pp.
233
40
.

58.

Yao
 
S
,
You
 
R
,
Wang
 
S
, et al.  
Netgo 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information
.
Nucleic Acids Res
 
2021
;
49
(
W1
):
W469
75
.

59.

Piovesan
 
D
,
Tosatto
 
SCE
.
Inga 2.0: improving protein function prediction for the dark proteome
.
Nucleic Acids Res
 
2019
;
47
(
W1
):
W373
8
.

60.

Nishimura
 
S
,
Taya
 
Y
,
Kuchino
 
Y
,
Ohashi
 
Z
.
Enzymatic synthesis of 3-(3-amino-3-carboxypropyl) uridine in Escherichia coli phenylalanine transfer RNA: transfer of the 3-amino-3-carboxypropyl group from s-adenosylmethionine
.
Biochem Biophys Res Commun
 
1974
;
57
(
3
):
702
8
.

61.

Takakura
 
M
,
Ishiguro
 
K
,
Akichika
 
S
, et al.  
Biogenesis and functions of aminocarboxypropyluridine in tRNA
.
Nat Commun
 
2019
;
10
(
1
):
1
12
.

62.

Meyer
 
B
,
Immer
 
C
,
Kaiser
 
S
, et al.  
Identification of the 3-amino-3-carboxypropyl (ACP) transferase enzyme responsible for acp3u formation at position 47 in Escherichia coli trnas
.
Nucleic Acids Res
 
2020
;
48
(
3
):
1435
50
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)