Abstract

As one of the most important tasks in protein structure prediction, protein fold recognition has attracted more and more attention. In this regard, some computational predictors have been proposed with the development of machine learning and artificial intelligence techniques. However, these existing computational methods are still suffering from some disadvantages. In this regard, we propose a new network-based predictor called ProtFold-DFG for protein fold recognition. We propose the Directed Fusion Graph (DFG) to fuse the ranking lists generated by different methods, which employs the transitive closure to incorporate more relationships among proteins and uses the KL divergence to calculate the relationship between two proteins so as to improve its generalization ability. Finally, the PageRank algorithm is performed on the DFG to accurately recognize the protein folds by considering the global interactions among proteins in the DFG. Tested on a widely used and rigorous benchmark data set, LINDAHL dataset, experimental results show that the ProtFold-DFG outperforms the other 35 competing methods, indicating that ProtFold-DFG will be a useful method for protein fold recognition. The source code and data of ProtFold-DFG can be downloaded from http://bliulab.net/ProtFold-DFG/download

Introduction

Different tertiary structures of proteins lead to various protein functions. A fundamental step to study protein structures is to identify the protein folds [1]. In this regard, different computational methods are proposed to solve the protein fold recognition problem. They can be split into three categories: (i) alignment methods; (ii) machine-learning-based methods; (iii) network-based methods.

The alignment methods directly calculate the similarity between two proteins, based on which the folds can be predicted. For example, the CLUSTAL W [2] is based on the gap penalties. HHsearch [3] is based on the Hidden Markov Model-Hidden Markov Model (HMM-HMM) alignments. The PSI-BLAST [4] considers the evolutionary information by Position-Specific Score Matrices (PSSMs). ACCFold [5] combines the Autocross-covariance (ACC) transformation and PSSMs. Because the residue–residue contacts contain the predicted structure information [6], the residue–residue contacts are incorporated into the computational method so as to further improve the predictive performance [7]. The local or nonlocal interactions among residues or motifs are incorporated into the predictors as well [8–10]. For example, MotifCNN-fold [11] combines the motif-based kernel with Convolutional Neural Network (CNN) and Deep Convolutional Neural Network (DCNN) to extract the fold-specific features.

The flowchart of the ProtFold-DFG predictor for protein fold recognition.
Figure 1

The flowchart of the ProtFold-DFG predictor for protein fold recognition.

The process of generating the DFG. (A) The basic ranking lists are generated by four methods. (B) The four ranking lists are merged. (C) The Directed Fusion Graph is constructed based on the combined ranking list. (D) The PageRank is performed on the Directed Fusion Graph to recognize the protein folds.
Figure 2

The process of generating the DFG. (A) The basic ranking lists are generated by four methods. (B) The four ranking lists are merged. (C) The Directed Fusion Graph is constructed based on the combined ranking list. (D) The PageRank is performed on the Directed Fusion Graph to recognize the protein folds.

Based on the aforementioned features, some machine-learning-based methods have been proposed [12]. For example, FOLDpro [13] combines different alignment methods and structural information with Support Vector Machines (SVMs) [14]. The ensemble classifier [15] fuses nine basic OET-KNN classifiers into a more accurate classifier. The deep learning techniques [16, 17] have been also applied to this field. For example, DeepFRpro [10] employs the CNN to generate the fold-specific features, based on which a random forest model is constructed to predict the folds. MV-fold [18] combines different features via multiview modeling.

The relationships among proteins can be considered as a protein similarity network. Some network-based methods extract the global interactions among proteins from the protein similarity network to recognize the protein folds. For example, the enrichment of network topological similarity is performed on the protein similarity network to detect the protein folds [19]. CMsearch [20] based on the cross-model learning learns the sequence–structure correlation. Fold-LTR-TCP [21] is a re-rank model via combining the Triadic Closure Principle (TCP) algorithm with Learning to Rank (LTR) [22].

As discussed above, the network-based methods achieve the state-of-the-art performance. However, two bottleneck problems still exist in the network-based methods, which should be further explored: (i) how to incorporate the relationships among different proteins into the protein similarity network? (ii) how to accurately recognize the proteins sharing the same fold with the query protein based on the network? To answer these two questions, in this study we propose the Directed Fusion Graph (DFG) to incorporate the relationships among proteins detected by different methods. The PageRank algorithm [23] is then performed on the DFG for protein fold recognition by considering the global relationships among proteins. The new computational predictor is called ProtFold-DFG.

Materials and methods

Benchmark data set

In order to facilitate the fair performance comparison with other competing methods, the ProtFold-DFG is evaluated on a widely used and rigorous benchmark data set, LINDAL data set [24] with 321 proteins covering 38 protein folds extracted from the SCOP database. In order to rigorously simulate the protein fold recognition task, these proteins are divided into two subsets sharing no proteins from the same family or superfamily. Based on these two subsets, the 2-fold cross-validation is used to evaluate the performance of various methods. For more information on the benchmark data set, please refer to [24].

The framework of ProtFold-DFG

Previous studies show that different methods for protein fold recognition are complementary [18, 25]. Therefore, it is possible to further improve the predictive performance by fusing these complementary methods. The protein fold recognition task can be treated as an information retrieval problem. For each query protein, it is searched against the data set by a computational method to generate the corresponding ranking list. The query protein is more likely to share the same fold with the top hit in the ranking list. For multiple ranking lists, the common top hits in different ranking lists are more likely to be in the same fold as the query protein. In this regard, we propose the DFG to incorporate these characteristics.

In this study, we fuse different ranking lists generated by five computational methods. The DFG is constructed to reflect the relationships among proteins in the network based on the ranking lists generated by the different methods. Finally, the PageRank algorithm [23] is performed on the DFG to re-rank the proteins in DFG so as to generate the final ranking list. The proposed new computational method is called ProtFold-DFG, whose overall flowchart is shown in Figure 1. In the following sections, we will introduce the detailed information of each step.

The pseudo-code of ProtFold-DFG predictor by combining PageRank and DFG.
Figure 3

The pseudo-code of ProtFold-DFG predictor by combining PageRank and DFG.

Generate the basic ranking lists

As shown in Figure 2A, ranking lists are generated by different computational methods to reflect the relationships between the query protein and the template proteins, including DeepFR [10], DeepSVM-fold (CCM) [7], DeepSVM-fold (PSFM) [7], MotifCNN-fold (CCM) [11], MotifCNN-fold (PSFM) [11] and Fold-LTR-TCP [21]. Among these methods, the DeepFR, DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM) and MotifCNN-fold (PSFM) are based on the deep learning techniques to extract the fold-specific features to improve the predictive performance, whereas Fold-LTR-TCP is based on the LTR [22].

DeepFR [10] extracts the fold-specific features by using DCNN, based on which the proteins are converted into feature vectors. The cosine similarity between two feature vectors is used to measure the similarity between two proteins.

DeepSVM-fold [7] calculates the feature vectors of proteins based on the fold-specific features generated using convolutional neural network–bidirectional long short-term memory (CNN-LSTM), and then the similarity between two proteins is calculated by cosine between their corresponding feature vectors. DeepSVM-fold (CCM) based on reside–residue contact (CCM) and DeepSVM-fold (PSFM) based on Position-Specific Frequency Matrix (PSFM) are employed in this study.

MotifCNN-fold [11] are based on the structural motif kernels to construct the motif-based CNNs. In this study the MotifCNN-fold (CCM), MotifCNN-fold (PSFM) are used, which are constructed based on the CCM and PSFM, respectively.

Fold-LTR-TCP [21] searches the query protein against the template proteins so as to find the template proteins in the same fold with the query protein in a supervised manner. The ranking list generated by Fold-LTR-TCP without TCP is used in this study.

Merge multiple ranking lists

After generating the basic ranking lists, we merge these ranking lists into a digraph G by:
(1)
where the |${\mathbf{M}}_l$| represents a ranking list generated by the method l. k represents the total number of the ranking lists. The proteins in this ranking list are sorted in ascending order according to their similarities with the query protein. The |$\Big\langle u,v\Big\rangle$| represents that protein v is more likely to be in the same fold with the query protein than protein u is. This process is shown in Figure 2B. The protein similarity network can be constructed by using the aforementioned approach, but this network fails to measure the subtle relationships among proteins, which will be explored in the following section.

Construct the DFG

The digraph G is not suitable for protein fold recognition, because it only considers the relationships between adjacent proteins in the basic ranking lists. In order to incorporate more interactions among proteins in the ranking lists, the transitive closure [26] is used to update the digraph G (Figure 2C). A ranking list |${\mathbf{M}}_l$| is normalized by:
(2)
where |${\Big\Vert{\mathbf{M}}_l\Big\Vert}_1$|⁠, |${\Big\Vert{\mathbf{M}}_l\Big\Vert}_2$| and |${\Big\Vert{\mathbf{M}}_l\Big\Vert}_{\infty }$| is the 1 norm, 2 norm and infinite norm of the ranking list by method l, respectively. The depth-first transitive closure algorithm [26] is employed to update the digraph G:
(3)

This updated digraph G represents the reachability among vertices, incorporating the relationships among proteins in the basic ranking lists.

Table 1

Performance of the ProtFold-DFG predictors based on different ranking lists on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFGa77.88
ProtFold-DFGb78.82
ProtFold-DFGc71.03
ProtFold-DFGd76.65
ProtFold-DFGe76.65
ProtFold-DFGf81.93
ProtFold-DFGg80.69
MethodAccuracy (%)
ProtFold-DFGa77.88
ProtFold-DFGb78.82
ProtFold-DFGc71.03
ProtFold-DFGd76.65
ProtFold-DFGe76.65
ProtFold-DFGf81.93
ProtFold-DFGg80.69

aFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM) and Fold-LTR-TCP.

bFusing the rank lists generated by MotifCNN-fold (CCM), MotifCNN-fold(PSFM) and Fold-LTR-TCP.

cFusing the rank lists generated by DeepFR and Fold-LTR-TCP.

dFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), DeepFR and Fold-LTR-TCP.

eFusing the rank lists generated by MotifCNN-fold (CCM), MotifCNN-fold (PSFM), DeepFR and Fold-LTR-TCP.

fFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotfiCNN-fold (PSFM) and Fold-LTR-TCP.

gFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotifCNN-fold (PSFM), DeepFR and Fold-LTR-TCP.

Table 1

Performance of the ProtFold-DFG predictors based on different ranking lists on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFGa77.88
ProtFold-DFGb78.82
ProtFold-DFGc71.03
ProtFold-DFGd76.65
ProtFold-DFGe76.65
ProtFold-DFGf81.93
ProtFold-DFGg80.69
MethodAccuracy (%)
ProtFold-DFGa77.88
ProtFold-DFGb78.82
ProtFold-DFGc71.03
ProtFold-DFGd76.65
ProtFold-DFGe76.65
ProtFold-DFGf81.93
ProtFold-DFGg80.69

aFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM) and Fold-LTR-TCP.

bFusing the rank lists generated by MotifCNN-fold (CCM), MotifCNN-fold(PSFM) and Fold-LTR-TCP.

cFusing the rank lists generated by DeepFR and Fold-LTR-TCP.

dFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), DeepFR and Fold-LTR-TCP.

eFusing the rank lists generated by MotifCNN-fold (CCM), MotifCNN-fold (PSFM), DeepFR and Fold-LTR-TCP.

fFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotfiCNN-fold (PSFM) and Fold-LTR-TCP.

gFusing the rank lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotifCNN-fold (PSFM), DeepFR and Fold-LTR-TCP.

Table 2

Performance of the ProtFold-DFG predictors with and without the transitive closure on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFGa54.20
ProtFold-DFGb81.93
MethodAccuracy (%)
ProtFold-DFGa54.20
ProtFold-DFGb81.93

aDFG |$\overline{\mathbf{\mathcal{F}}}$| without transitive closure [26].

bDFG |$\overline{\mathbf{\mathcal{F}}}$| with transitive closure [26].

Table 2

Performance of the ProtFold-DFG predictors with and without the transitive closure on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFGa54.20
ProtFold-DFGb81.93
MethodAccuracy (%)
ProtFold-DFGa54.20
ProtFold-DFGb81.93

aDFG |$\overline{\mathbf{\mathcal{F}}}$| without transitive closure [26].

bDFG |$\overline{\mathbf{\mathcal{F}}}$| with transitive closure [26].

Based on the KL divergence [26], the DFG |$\overline{\mathbf{\mathcal{F}}}$| is constructed by assigning weight to each edge in the digraph G, which can be represented as:
(4)
where |$\overline{\mathbf{M}}_{l,u}$| represents the similarity between the template protein u and the query protein calculated by the method l; k represents the total number of ranking lists. The KL divergence [27] measures the relationship between two proteins, which is then mapped to the edge weight. As a result, the DFG can accurately present the probability of mutual recommendation among proteins. Because we cannot insure that |$\overline{\mathbf{\mathcal{F}}}$| is stochastic and irreducible [28], |$\overline{\mathbf{\mathcal{F}}}$| is further converted following Google matrix [23] by:
(5)
where n is the number of the training samples. This combination ensures the |$\overline{\mathbf{\mathcal{F}}}$| is both stochastic and irreducible [28].

Perform PageRank on DFG

The PageRank algorithm is able to find the most cited literature from global perspective [29]. It has also been successfully applied to protein remote homology detection [30]. The PageRank algorithm assumes that the accuracy of a ranking is positively correlated with the relationships among proteins in the graph. Therefore, as shown in Figure 2D in this study the proteins are considered as the literature, and the proteins in the same fold can be considered as the most relevant literature. Based on these similarities, the PageRank is performed on the DFG for protein fold recognition.

Table 3

Performance of the ProtFold-DFG predictors using KL divergence and out-degree on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFG (out-degree)a80.06
ProtFold-DFG (out-degree)b7.47
ProtFold-DFG (KL divergence)c81.93
ProtFold-DFG (KL divergence)d81.93
MethodAccuracy (%)
ProtFold-DFG (out-degree)a80.06
ProtFold-DFG (out-degree)b7.47
ProtFold-DFG (KL divergence)c81.93
ProtFold-DFG (KL divergence)d81.93

aThe evaluation score between two proteins is the reciprocal of the vertex’s out-degree and the initial value is the average of basic methods.

bThe evaluation score between two proteins is the reciprocal of the vertex’s out-degree and the initial value is |${\mathbf{R}}^{(0)}$| (Equation 8).

cThe evaluation score between two proteins is calculated by the KL divergence and the initial value is the average of basic methods.

dThe evaluation score between two proteins is calculated by the KL divergence and the initial value is |${\mathbf{R}}^{(0)}$| (Equation 8).

Table 3

Performance of the ProtFold-DFG predictors using KL divergence and out-degree on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodAccuracy (%)
ProtFold-DFG (out-degree)a80.06
ProtFold-DFG (out-degree)b7.47
ProtFold-DFG (KL divergence)c81.93
ProtFold-DFG (KL divergence)d81.93
MethodAccuracy (%)
ProtFold-DFG (out-degree)a80.06
ProtFold-DFG (out-degree)b7.47
ProtFold-DFG (KL divergence)c81.93
ProtFold-DFG (KL divergence)d81.93

aThe evaluation score between two proteins is the reciprocal of the vertex’s out-degree and the initial value is the average of basic methods.

bThe evaluation score between two proteins is the reciprocal of the vertex’s out-degree and the initial value is |${\mathbf{R}}^{(0)}$| (Equation 8).

cThe evaluation score between two proteins is calculated by the KL divergence and the initial value is the average of basic methods.

dThe evaluation score between two proteins is calculated by the KL divergence and the initial value is |${\mathbf{R}}^{(0)}$| (Equation 8).

Table 4

Performance of the ProtFold-DFG predictors based on three link analysis algorithms on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodLink analysis algorithmAccuracy (%)
ProtFold-DFGaHypertext Induced Topic Selection80.37
ProtFold-DFGbTriadic Closure Principle79.43
ProtFold-DFGcPageRank81.93
MethodLink analysis algorithmAccuracy (%)
ProtFold-DFGaHypertext Induced Topic Selection80.37
ProtFold-DFGbTriadic Closure Principle79.43
ProtFold-DFGcPageRank81.93

aThe Hypertext Induced Toxic Selection (HITS) algorithm is performed on the DFG.

bThe Triadic Closure Principle (TCP) algorithm is performed on the DFG.

cThe PageRank algorithm is performed on the DFG.

Table 4

Performance of the ProtFold-DFG predictors based on three link analysis algorithms on the LINDAHL data set evaluated by using 2-fold cross-validation

MethodLink analysis algorithmAccuracy (%)
ProtFold-DFGaHypertext Induced Topic Selection80.37
ProtFold-DFGbTriadic Closure Principle79.43
ProtFold-DFGcPageRank81.93
MethodLink analysis algorithmAccuracy (%)
ProtFold-DFGaHypertext Induced Topic Selection80.37
ProtFold-DFGbTriadic Closure Principle79.43
ProtFold-DFGcPageRank81.93

aThe Hypertext Induced Toxic Selection (HITS) algorithm is performed on the DFG.

bThe Triadic Closure Principle (TCP) algorithm is performed on the DFG.

cThe PageRank algorithm is performed on the DFG.

The performance of four ProtFold-DFG predictors based on the out-degree and KL divergence with different iterations. For the meanings of a, b, c and d, please refer to the footnote of Table 3.
Figure 4

The performance of four ProtFold-DFG predictors based on the out-degree and KL divergence with different iterations. For the meanings of a, b, c and d, please refer to the footnote of Table 3.

Table 5

Performance of 36 computational methods for protein fold recognition on LINDAHL data set via 2-fold cross-validation

MethodsCategoryaAccuracy (%)Source
PSI-BLASTAlignment4[4]
HMMERAlignment4.4[33]
SAM-T98Alignment3.4[34]
BLASTLINKAlignment6.9[24]
SSEARCHAlignment5.6[35]
SSHMMAlignment6.9[35]
THREADERAlignment14.6[36]
FUGUEAlignment12.5[37]
RAPTORAlignment25.4[38]
SPARKSAlignment24.3[39]
SP3Alignment28.7[51]
FOLDproMachine learning26.5[13]
HHpredAlignment25.2[40]
SP4Alignment30.8[41]
SP5Alignment37.9[42]
BoostThreaderMachine learning42.6[43]
SPARKS-XAlignment45.2[44]
FFAS-3DMachine learning35.8[45]
RF-FoldMachine learning40.8[46]
DN-FoldMachine learning33.6[47]
RFDN-FoldMachine learning37.7[47]
DN-FoldSMachine learning33.3[47]
DN-FoldRMachine learning27.4[47]
HH-foldMachine learning42.1[48]
TA-foldMachine learning53.9[48]
dRHP-PseRAMachine learning34.9[49]
MT-foldMachine learning59.1[18]
DeepFR (strategy1)Machine learning44.5[10]
DeepFR (strategy2)Machine learning56.1[10]
DeepFRpro (strategy1)Machine learning57.6[10]
DeepFRpro (strategy2)Machine learning66.0[10]
DeepSVM-foldMachine learning67.3[7]
MotifCNN-foldMachine learning72.6[11]
Fold-LTR-TCPNetwork73.2[21]
FoldRec-C2CNetwork77.88[50]
ProtFold-DFG bNetwork81.93This study
MethodsCategoryaAccuracy (%)Source
PSI-BLASTAlignment4[4]
HMMERAlignment4.4[33]
SAM-T98Alignment3.4[34]
BLASTLINKAlignment6.9[24]
SSEARCHAlignment5.6[35]
SSHMMAlignment6.9[35]
THREADERAlignment14.6[36]
FUGUEAlignment12.5[37]
RAPTORAlignment25.4[38]
SPARKSAlignment24.3[39]
SP3Alignment28.7[51]
FOLDproMachine learning26.5[13]
HHpredAlignment25.2[40]
SP4Alignment30.8[41]
SP5Alignment37.9[42]
BoostThreaderMachine learning42.6[43]
SPARKS-XAlignment45.2[44]
FFAS-3DMachine learning35.8[45]
RF-FoldMachine learning40.8[46]
DN-FoldMachine learning33.6[47]
RFDN-FoldMachine learning37.7[47]
DN-FoldSMachine learning33.3[47]
DN-FoldRMachine learning27.4[47]
HH-foldMachine learning42.1[48]
TA-foldMachine learning53.9[48]
dRHP-PseRAMachine learning34.9[49]
MT-foldMachine learning59.1[18]
DeepFR (strategy1)Machine learning44.5[10]
DeepFR (strategy2)Machine learning56.1[10]
DeepFRpro (strategy1)Machine learning57.6[10]
DeepFRpro (strategy2)Machine learning66.0[10]
DeepSVM-foldMachine learning67.3[7]
MotifCNN-foldMachine learning72.6[11]
Fold-LTR-TCPNetwork73.2[21]
FoldRec-C2CNetwork77.88[50]
ProtFold-DFG bNetwork81.93This study

a‘Alignment’, ‘Machine learning’ and ‘Network’ mean that the corresponding predictors are alignment methods, machine-learning-based methods and network-based methods respectively according to the three categories introduced in the introduction section.

bThe ProtFold-DFG predictor is based on the KL-divergence, PageRank and the DFG generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotifCNN-fold (PSFM) and Fold-LTR-TCP.

Table 5

Performance of 36 computational methods for protein fold recognition on LINDAHL data set via 2-fold cross-validation

MethodsCategoryaAccuracy (%)Source
PSI-BLASTAlignment4[4]
HMMERAlignment4.4[33]
SAM-T98Alignment3.4[34]
BLASTLINKAlignment6.9[24]
SSEARCHAlignment5.6[35]
SSHMMAlignment6.9[35]
THREADERAlignment14.6[36]
FUGUEAlignment12.5[37]
RAPTORAlignment25.4[38]
SPARKSAlignment24.3[39]
SP3Alignment28.7[51]
FOLDproMachine learning26.5[13]
HHpredAlignment25.2[40]
SP4Alignment30.8[41]
SP5Alignment37.9[42]
BoostThreaderMachine learning42.6[43]
SPARKS-XAlignment45.2[44]
FFAS-3DMachine learning35.8[45]
RF-FoldMachine learning40.8[46]
DN-FoldMachine learning33.6[47]
RFDN-FoldMachine learning37.7[47]
DN-FoldSMachine learning33.3[47]
DN-FoldRMachine learning27.4[47]
HH-foldMachine learning42.1[48]
TA-foldMachine learning53.9[48]
dRHP-PseRAMachine learning34.9[49]
MT-foldMachine learning59.1[18]
DeepFR (strategy1)Machine learning44.5[10]
DeepFR (strategy2)Machine learning56.1[10]
DeepFRpro (strategy1)Machine learning57.6[10]
DeepFRpro (strategy2)Machine learning66.0[10]
DeepSVM-foldMachine learning67.3[7]
MotifCNN-foldMachine learning72.6[11]
Fold-LTR-TCPNetwork73.2[21]
FoldRec-C2CNetwork77.88[50]
ProtFold-DFG bNetwork81.93This study
MethodsCategoryaAccuracy (%)Source
PSI-BLASTAlignment4[4]
HMMERAlignment4.4[33]
SAM-T98Alignment3.4[34]
BLASTLINKAlignment6.9[24]
SSEARCHAlignment5.6[35]
SSHMMAlignment6.9[35]
THREADERAlignment14.6[36]
FUGUEAlignment12.5[37]
RAPTORAlignment25.4[38]
SPARKSAlignment24.3[39]
SP3Alignment28.7[51]
FOLDproMachine learning26.5[13]
HHpredAlignment25.2[40]
SP4Alignment30.8[41]
SP5Alignment37.9[42]
BoostThreaderMachine learning42.6[43]
SPARKS-XAlignment45.2[44]
FFAS-3DMachine learning35.8[45]
RF-FoldMachine learning40.8[46]
DN-FoldMachine learning33.6[47]
RFDN-FoldMachine learning37.7[47]
DN-FoldSMachine learning33.3[47]
DN-FoldRMachine learning27.4[47]
HH-foldMachine learning42.1[48]
TA-foldMachine learning53.9[48]
dRHP-PseRAMachine learning34.9[49]
MT-foldMachine learning59.1[18]
DeepFR (strategy1)Machine learning44.5[10]
DeepFR (strategy2)Machine learning56.1[10]
DeepFRpro (strategy1)Machine learning57.6[10]
DeepFRpro (strategy2)Machine learning66.0[10]
DeepSVM-foldMachine learning67.3[7]
MotifCNN-foldMachine learning72.6[11]
Fold-LTR-TCPNetwork73.2[21]
FoldRec-C2CNetwork77.88[50]
ProtFold-DFG bNetwork81.93This study

a‘Alignment’, ‘Machine learning’ and ‘Network’ mean that the corresponding predictors are alignment methods, machine-learning-based methods and network-based methods respectively according to the three categories introduced in the introduction section.

bThe ProtFold-DFG predictor is based on the KL-divergence, PageRank and the DFG generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotifCNN-fold (PSFM) and Fold-LTR-TCP.

The final ranking list R generated by the ProtFold-DFG predictor based on PageRank and DFG |$\overline{\mathbf{\mathcal{F}}}$| can be represented as [31]:
(6)
or [31]:
(7)
where I is the identity matrix. The power method [28] is employed to solve this problem by:
(8)
where |${\mathbf{R}}^{(t)}$| is the result of vector |${\mathbf{R}}^{(0)}$| and matrix |$\overline{\mathbf{\mathcal{F}}}$| after matrix multiplications for t times, |${\mathbf{R}}^{(0)}$| is an initial vector, which is set as |$\frac{2+n}{3n}\mathbf{1}$| at the start and n is the number of training samples. The |${\mathbf{R}}^{(t)}$| is normalized for each matrix multiplication by using Equation 2. The pseudocode of the ProtFold-DFG is shown in Figure 3.

Evaluation methodology

The overall accuracy is used to measure the performance of different methods, which can be calculated by:
(9)
where CN represents the number of the correctly predicted query proteins and N is the number of all proteins to be predicted. Please note that if the fold of the query protein is the same as that of the top-ranked template protein in the training set, this query protein is correctly predicted; otherwise, it is incorrectly predicted.

Results and discussion

The influence of the different combinations of the basic ranking lists on the performance of the ProtFold-DFG predictors

In this section, we investigate the influence of the basic predictors on the predictive performance. As introduced in the method section, six computational methods are selected to generate the basic ranking lists. There are two reasons for selecting these predictors: (i) they are the state-of-the-art methods in the field of protein fold recognition. (ii) They are based on different theories and techniques, and therefore, they are complementary.

The results of the ProtFold-DFG predictors by fusing different ranking lists are shown in Table 1, from which we can see the followings: (i) the performance of ProtFold-DFG can be improved when adding more complementary ranking lists. (ii) The ProtFold-DFG predictor based on the five ranking lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), MotifCNN-fold (CCM), MotfiCNN-fold (PSFM) and Fold-LTR-TCP achieves the best performance. (iii) The ProtFold-DFG predictor achieves lower performance when adding the ranking list generated by DeepFR. The reason is that Fold-LTR-TCP is also based on the DeepFR, and their predictive results are not complementary. Noise information will be introduced when incorporating the predictive results of DeepFR into the ProtFold-DFG predictor. This conclusion is further supported by the fact that the ProtFold-DFG predictor based on DeepFR and Fold-LTR-TCP can only achieve an accuracy of 71.03%.

The ProtFold-DFG prediction visualization. The best ProtFold-DFG predictor based on the five basic ranking lists is analyzed. The ranking lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), Fold-LTR-TCP, MotifCNN-fold (CCM) and MotifCNN-fold (PSFM) are visualized in (A–E), respectively. Based on the basic ranking lists, the corresponding DFG is generated (F), and then the PageRank is performed on the DFG to detect the template proteins in DFG sharing the same protein fold with the query protein (G). For these figures, the blue lines and gray lines represent the relationships among proteins. The relationship between two proteins linked by blue line is closer than that linked by gray line. The red lines indicate the final prediction results: the query proteins and the template proteins are predicted to be in the same protein folds. These figures are plotted by using the software tool Gephi [52].
Figure 5

The ProtFold-DFG prediction visualization. The best ProtFold-DFG predictor based on the five basic ranking lists is analyzed. The ranking lists generated by DeepSVM-fold (CCM), DeepSVM-fold (PSFM), Fold-LTR-TCP, MotifCNN-fold (CCM) and MotifCNN-fold (PSFM) are visualized in (AE), respectively. Based on the basic ranking lists, the corresponding DFG is generated (F), and then the PageRank is performed on the DFG to detect the template proteins in DFG sharing the same protein fold with the query protein (G). For these figures, the blue lines and gray lines represent the relationships among proteins. The relationship between two proteins linked by blue line is closer than that linked by gray line. The red lines indicate the final prediction results: the query proteins and the template proteins are predicted to be in the same protein folds. These figures are plotted by using the software tool Gephi [52].

The impact of the transitive closure on the performance of ProtFold-DFG

The main contribution of this study is to propose the DFG to fuse different ranking lists so as to model the global interactions among proteins. Different from the Google matrix [23], the DFG incorporates more interactions among proteins in the ranking lists by using the transitive closure [26]. In this section, we will explore the influence of the transitive closure on the performance of ProtFold-DFG, and the results are listed in Table 2, from which we can obviously see that the transitive closure can obviously improve the performance of ProtFold-DFG. The reason for its better performance is that when using transitive closure, the DFG cannot only consider the relationships between adjacent proteins in the ranking lists, but it also can incorporate the local or global interactions among proteins in the ranking lists, which will provide valuable information for the PageRank to re-rank the ranking list so as to accurately recognize the protein folds.

ProtFold-DFG based on KL divergence can improve its generalization ability

In this study, we employ the KL divergence [27] to assign weights to the edges in the DFG |$\overline{\mathbf{\mathcal{F}}}$|⁠. Besides KL divergence, out-degree [29] can also calculate the weight for each edge. The performance of ProtFold-DFG predictors based on these two methods with different node initial values is shown in Table 3 and Figure 4, from which we can see the followings: (i) the two ProtFold-DFG predictors based on the KL divergence can achieve stable performance, indicating that the DFG constructed by KL divergence is insensitive with the node initial values and the iteration times t (Equation 8). (ii) In contrast, the two ProtFold-DFG predictors based on the out-degree achieve lower performance, and the initial value and the iteration times obviously impact their performance.

As discussed above, the PageRank updates the ranking list during each iteration. Therefore, a suitable node initial value helps the PageRank to find an optimum ranking list. However, the initial value has little impact on the results with more iteration times. Compared with the out-degree, the KL divergence considers the subtle interactions among proteins in the ranking list. Therefore, the two ProtFold-DFG predictors based on the KL divergence achieve higher and more stable performance than that of the two ProtFold-DFG predictors based on the out-degree.

The performance of three link analysis algorithms performed on the DFG

Inspired by the Google matrix, we propose the DFG to fuse different ranking lists and incorporate the local and global interactions among proteins. The link analysis algorithm is then performed on the DFG for protein fold recognition. In this section, the performance of three link analysis algorithms is explored, including PageRank [23], Hyperlink Induced Topic Selection (HITS) [32] and Triadic Closure Principle (TCP) [21]. The predictive results of the three ProtFold-DFG predictors based on the three link analysis algorithms are shown in Table 4. The ProtFold-DFG predictor based on PageRank achieves the best performance. The reason is that PageRank is able to search the related proteins in the DFG globally, making full use of the mutual recommendation relationships among all the proteins in the DFG. In contrast, HITS only models the protein local interactions, and TCP only considers the pairwise interactions between two proteins.

Performance comparison with 35 other related methods

In order to evaluate the performance of the ProtFold-DFG, it is compared with other 35 computational methods for protein fold recognition, including PSI-BLAST [4], HMMER [33], SAM-T98 [34], BLASTLINK [24], SSEARCH [35], SSHMM [35], THREADER [36], FUGUE [37], RAPTOR [38], SPARKS [39], SP3 [39], FOLDpro [13], HHpred [40], SP4 [41], SP5 [42], BoostThreader [43], SPARKS-X [44], FFAS-3D [45], RF-Fold [46], DN-Fold [47], RFDN-Fold [47], DN-FoldS [47], DN-FoldR [47], HH-fold [48], TA-fold [48], dRHP-PseRA [49], MT-fold [18], DeepFR (strategy 1) [10], DeepFR (strategy 2) [10], DeepFRpro (strategy 1) [10], DeepFRpro (stragegy 2) [10], DeepSVM-fold [7], MotifCNN-fold [11], Fold-LTR-TCP [21] and FoldRec-C2C [50].

The results of the 36 predictors are listed in Table 5, from which we can see the followings: (i) ProtFold-DFG outperforms the compuational methods used for generating the basic ranking lists, including DeepSVM-fold, Fold-LTR-TCP and MotifCNN-fold. These results further confirm that the fusion of the different basic ranking lists by DFG can improve the predictive performance. (ii) The three network-based methods (Fold-LTR-TCP, FoldRec-C2C and ProtFold-DFG) show better performance than the alignment methods and the machine-learning-based methods. The reason is that the network-based methods incorporate the global relationships among proteins into the prediction, which is more efficient than only considering the pairwise protein similarities or the local relationships among proteins. (iii) ProtFold-DFG achieves the best performance among the 36 predictors. The reason is that ProtFold-DFG is able to more accurately mesaure the relationships among proteins by fusing complementary ranking lists, and the link analysis algorithm PageRank is able to globally analyze the relationships among proteins in the DFG to accurately detect the protein folds.

Feature analysis

In order to further explore the reasons why the ProtFold-DFG predictor is able to accurately detect the protein folds, a query protein 1bdo-d1bdo from protein fold 2_59 (SCOP ID) is selected as an example, and the prediction results of ProtFold-DFG are visualized in Figure 5, from which we can see the followings: (i) according to the top hits in the five basic ranking lists shown in Figure 5A–E, the fold type of the query protein is incorrectly predicted by the five computational predictors. (ii) DFG is able to fuse the five ranking lists by considering the local and global relationships among proteins (Figure 5F). (iii) Performed on the DFG, the PageRank algorithm is able to re-rank the proteins in the DFG and correctly predict the fold of the query protein 1bdo-d1bdo as 2_59 (Figure 5G). These results are not surprising, although the top hits in the basic ranking lists of the five predictors are incorrect, these errors can be corrected by fusing these complementary ranking lists and analyzing their relationships comprehensively. ProtFold-DFG treats the proteins as the literature, and finds the most closely linked proteins by combining the PageRank and DFG. These proteins can be considered as in the same protein fold.

Conclusion

Protein fold recognition is one of the key tasks in bioinformatics. In this regard, some computational methods have been proposed. Among these methods, the network-based methods can achieve the state-of-the-art performance, because they can more efficiently model the relationships among proteins with the help of the protein similarity network [50]. In this study, we propose a new network-based computational predictor called ProtFold-DFG. Compared with other existing methods, it has the following advantages: (i) The proposed DFG is able to fuse the complementary ranking lists generated by different methods, and more interactions can be incorporated by transitive closure. (ii) KL divergence can more accurately calculate the weights of the edges in DFG, which is insensitive with the initial values and the literation times. As a result, the performance and generalization ability of ProtFold-DFG can be obviously improved. (iii) The PageRank algorithm can correct the prediction errors in the basic ranking lists by comprehensively considering the protein interactions in the DFG. Because the DFG is able to fuse the results of different ranking methods, other applications in bioinformatics can be anticipated, such as protein remote homology detection [53].

Key Points
  • Protein fold recognition is a fundamental task for detecting protein structures and determining the protein functions. More accurate computational predictors are highly required.

  • We propose the ProtFold-DFG predictor for protein fold recognition. It is based on the DFG generated by different basic ranking methods. In order to globally consider the interactions among proteins in the DFG, the PageRank is performed on DFG to re-rank the ranking lists so as to improve the predictive performance.

  • Experimental results show that the ProtFold-DFG predictor outperforms the other 35 state-of-the-art computational methods. The source code and data of the ProtFold-DFG can be accessed from http://bliulab.net/ProtFold-DFG/download/

Acknowledgements

The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

The National Key R&D Program of China (No. 2018AAA0100100); National Natural Science Foundation of China (No. 61672184, 61822306, 61702134, 61732012, 61861146002); the Beijing Natural Science Foundation (No. JQ19019); Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063).

Jiangyi Shao is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, China. His expertise is in bioinformatics.

Bin Liu is a professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, natural language processing and machine learning.

References

1.

Chothia
C
,
Finkelstein
AV
.
The classification and origins of protein folding patterns
.
Annu Rev Biochem
1990
;
59
:
1007
39
.

2.

Thompson
JD
,
Higgins
DG
,
Gibson
TJ
.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
.
Nucleic Acids Res
1994
;
22
:
4673
80
.

3.

Söding
J
.
Protein homology detection by HMM--HMM comparison
.
Bioinformatics
2005
;
21
:
951
60
.

4.

Altschul
SF
,
Madden
TL
,
Schaffer
AA
, et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
:
3389
402
.

5.

Dong
Q
,
Zhou
S
,
Guan
J
.
A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation
.
Bioinformatics
2009
;
25
:
2655
62
.

6.

Gromiha
MM
,
Selvaraj
S
.
Inter-residue interactions in protein folding and stability
.
Prog Biophys Mol Biol
2004
;
86
:
235
77
.

7.

Liu
B
,
Li
C
,
Yan
K
.
DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks
.
Brief Bioinform
. doi: .

8.

Han
KF
,
Baker
D
.
Recurring local sequence motifs in proteins
.
J Mol Biol
1995
;
251
:
176
87
.

9.

Simons
KT
,
Kooperberg
C
,
Huang
E
, et al.
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions
.
J Mol Biol
1997
;
268
:
209
25
.

10.

Zhu
J
,
Zhang
H
,
Li
SC
, et al.
Improving protein fold recognition by extracting fold-specific features from predicted residue-residue contacts
.
Bioinformatics
2017
;
33
:
3749
57
.

11.

Li
C-C
,
Liu
B
.
MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks
.
Brief Bioinform
. doi: .

12.

Wei
L
,
Zou
Q
.
Recent progress in machine learning-based methods for protein fold recognition
.
Int J Mol Sci
2016
;
17
:
2118
.

13.

Cheng
J
,
Baldi
P
.
A machine learning information retrieval approach to protein fold recognition
.
Bioinformatics
2006
;
22
:
1456
63
.

14.

Li
D
,
Ju
Y
,
Zou
Q
.
Protein folds prediction with hierarchical structured SVM
.
Curr Proteomics
2016
;
13
:
79
85
.

15.

Shen
H-B
,
Chou
K-C
.
Ensemble classifier for protein fold pattern recognition
.
Bioinformatics
2006
;
22
:
1717
22
.

16.

Peng
L
,
Peng
MM
,
Liao
B
, et al.
The advances and challenges of deep learning application in biological big data processing
.
Curr Bioinform
2018
;
13
:
352
9
.

17.

Lv
Z
,
Ao
C
,
Zou
Q
.
Protein function prediction: from traditional classifier to deep learning
.
Proteomics
2019
;
19
:
1900119
.

18.

Yan
K
,
Fang
X
,
Xu
Y
, et al.
Protein fold recognition based on multi-view modeling
.
Bioinformatics
2019
;
35
:
2982
90
.

19.

Lhota
J
,
Hauptman
R
,
Hart
T
, et al.
A new method to improve network topological similarity search: applied to fold recognition
.
Bioinformatics
2015
;
31
:
2106
14
.

20.

Cui
X
,
Lu
Z
,
Wang
S
, et al.
CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction
.
Bioinformatics
2016
;
32
:
i332
40
.

21.

Liu
B
,
Zhu
Y
,
Yan
K
.
Fold-LTR-TCP: protein fold recognition based on triadic closure principle
.
Brief Bioinform
. doi: .

22.

Burges
CJC
.
From RankNet to LambdaRank to LambdaMART: an overview
.
Microsoft Research Technical Report MSR-TR-2010-82
2010
.

23.

Haveliwala
T
,
Kamvar
S.
The second eigenvalue of the Google matrix
,
Technical Report
,
Stanford University
2003
. http://ilpubs.stanford.edu:8090/582/

24.

Lindahl
E
,
Elofsson
A
.
Identification of related proteins on family, superfamily and fold level
.
J Mol Biol
2000
;
295
:
613
25
.

25.

Patil
K
,
Chouhan
U
.
Relevance of machine learning techniques and various protein features in protein fold classification: a review
.
Curr Bioinform
2019
;
14
:
688
97
.

26.

Ioannidis
YE
,
Ramakrishnan
R
. Efficient transitive closure algorithms. In:
Proceedings of the 14th VLDB Conference
.
1988
, p.
382
94
.

27.

Goldberger
J
,
Gordon
S
,
Greenspan
H
. An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures. In:
Proceedings Ninth IEEE International Conference on Computer Vision
.
France
,
2003
, p.
487
93
.

28.

Langville
AN
,
Meyer
CD
.
Deeper inside PageRank
.
Internet Math
2004
;
1
:
335
80
.

29.

Page
L
,
Brin
S
,
Motwani
R
et al.
The PageRank Citation ranking: bringing order to the web
,
Technical Report
1999
. http://ilpubs.stanford.edu:8090/422/.

30.

Liu
B
,
Jiang
S
,
Zou
Q
.
HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search
.
Brief Bioinform
2020
;
21
:
298
308
.

31.

Brin
S
,
Page
L
.
The anatomy of a large-scale hypertextual web search engine
.
Comput Netw ISDN Syst
1998
;
30
:
107
17
.

32.

Kleinberg
JM
.
Authoritative sources in a hyperlinked environment
.
J ACM
1999
;
46
:
604
32
.

33.

McClure
MA
,
Smith
C
,
Elton
P
.
Parameterization studies for the SAM and HMMER methods of hidden Markov model generation
.
Proc Int Conf Intell Syst Mol Biol
1996
;
4
:
155
64
.

34.

Karplus
K
,
Barrett
C
,
Hughey
R
.
Hidden Markov models for detecting remote protein homologies
.
Bioinformatics
1998
;
14
:
846
56
.

35.

Hargbo
J
,
Elofsson
A
.
Hidden Markov models that use predicted secondary structures for fold recognition
.
Proteins
1999
;
36
:
68
76
.

36.

Jones
DT
,
Taylor
WR
,
Thornton
JM
.
A new approach to protein fold recognition
.
Nature
1992
;
358
:
86
9
.

37.

Shi
J
,
Blundell
TL
,
Mizuguchi
K
.
FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties
.
J Mol Biol
2001
;
310
:
243
57
.

38.

Xu
J
,
Li
M
,
Kim
D
, et al.
RAPTOR: optimal protein threading by linear programming
.
J Bioinform Comput Biol
2003
;
1
:
95
117
.

39.

Zhou
H
,
Zhou
Y
.
Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition
.
Proteins
2004
;
55
:
1005
13
.

40.

Soding
J
,
Biegert
A
,
Lupas
AN
.
The HHpred interactive server for protein homology detection and structure prediction
.
Nucleic Acids Res
2005
;
33
:
W244
8
.

41.

Liu
S
,
Zhang
C
,
Liang
S
, et al.
Fold recognition by concurrent use of solvent accessibility and residue depth
.
Proteins
2007
;
68
:
636
45
.

42.

Zhang
W
,
Liu
S
,
Zhou
Y
.
SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model
.
PLoS One
2008
;
3
:
e2325
.

43.

Peng
J
,
Xu
J
.
Boosting protein threading accuracy
.
Res Comput Mol Biol
2009
;
5541
:
31
45
.

44.

Yang
Y
,
Faraggi
E
,
Zhao
H
, et al.
Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates
.
Bioinformatics
2011
;
27
:
2076
82
.

45.

Xu
D
,
Jaroszewski
L
,
Li
Z
, et al.
FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking
.
Bioinformatics
2014
;
30
:
660
7
.

46.

Jo
T
,
Cheng
J
.
Improving protein fold recognition by random forest
.
BMC Bioinform
2014
;
15
(
Suppl 11
):
S14
.

47.

Jo
T
,
Hou
J
,
Eickholt
J
, et al.
Improving protein fold recognition by deep learning networks
.
Sci Rep
2015
;
5
:
17573
.

48.

Xia
J
,
Peng
Z
,
Qi
D
, et al.
An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier
.
Bioinformatics
2017
;
33
:
863
70
.

49.

Chen
J
,
Long
R
,
Wang
XL
, et al.
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation
.
Sci Rep
2016
;
6
:
32333
.

50.

Shao
J
,
Yan
K
,
Liu
B
.
FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network
.
Brief Bioinform
. doi: .

51.

Zhou
H
,
Zhou
Y
.
Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments
.
Proteins
2005
;
58
:
321
8
.

52.

Bastian
M
,
Heymann
S
,
Jacomy
M.
Gephi: an open source software for exploring and manipulating networks. In:
Third international AAAI conference on weblogs and social media
.
2009
.

53.

Liu
B
,
Li
S
. ProtDet-CCH: Protein remote homology detection by combining Long Short-Term Memory and ranking methods.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
2019
;
16
:
1203
10
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)