Abstract

As an important task in protein structure and function studies, protein fold recognition has attracted more and more attention. The existing computational predictors in this field treat this task as a multi-classification problem, ignoring the relationship among proteins in the dataset. However, previous studies showed that their relationship is critical for protein homology analysis. In this study, the protein fold recognition is treated as an information retrieval task. The Learning to Rank model (LTR) was employed to retrieve the query protein against the template proteins to find the template proteins in the same fold with the query protein in a supervised manner. The triadic closure principle (TCP) was performed on the ranking list generated by the LTR to improve its accuracy by considering the relationship among the query protein and the template proteins in the ranking list. Finally, a predictor called Fold-LTR-TCP was proposed. The rigorous test on the LE benchmark dataset showed that the Fold-LTR-TCP predictor achieved an accuracy of 73.2%, outperforming all the other competing methods.

Introduction

The protein folds reflect the different patterns of protein structures, including the 3D structures of protein molecules and the trend of protein polypeptide chains [1, 2]. It is believed that different folds lead to different protein functions [2]. Therefore, accurate recognition of protein folds is helpful for predicting the protein structures and functions, and protein fold recognition based only on the protein sequences is critical for protein sequence analysis [3].

Because of its importance, some computational predictors were proposed to detect the protein folds from their sequences, including two categories: sequence alignment methods and discriminant-based methods [3, 4].

Sequence alignment methods calculate the similarity between query protein sequence and template protein sequence. The higher the score between query protein and template protein is, the more likely the query protein can be classified into the same fold categories as this template protein. The alignment methods were applied to protein fold recognition, such as Nedleman–Wunsch algorithm [5] with global optimum and Smith–Waterman algorithm with local optimum [1, 6]. Later, in order to improve the speed of traditional sequence alignment methods, researchers proposed BLAST [7] algorithm, FASTA [8] algorithm and HAlign [9, 10] at the cost of reducing accuracy. These methods cannot accurately detect the fold relationships among proteins with sequence similarity less than 25% [2]. Because proteins in the same fold usually share very low sequence identity, the simple sequence-profile alignment methods failed to detect protein folds [11]. In order to increase the sensitivity of sequence alignments, the structure features of proteins were added into the methods of sequence alignments, such as HHpred [12], FFAS-3D [13], SPARKS-X [14], etc. Later, researchers improved the HMM alignments [15] by sequence profiles, such as Hmmer [16] and HHblits [17]. Machine learning techniques have been widely used in protein fold recognition based on the sequence alignment methods, for example the RF-Fold [18] combines the features extracted by various sequence alignment methods with the random forest algorithm [19]. Ding et al. [20] employed the support vector machines (SVMs) and neural network (NN) to solve the protein fold recognition problem. Polat et al. [21] constructed a new NN called GAL for protein fold recognition problem. The multi-view model was employed to combine various sequence-based features [22]. Because the profiles consider the evolutionary information, some predictors improve the predictive performance by incorporating the profiles [23–25]. Recently, some predictors were constructed based on deep learning techniques, which have been widely employed in the field of bioinformatics [26–33]. For example DeepFRpro [34] extracted the fold-specific features by using deep learning and obviously improved the predictive performance. Later, this method was improved by combining the profile-based features extracted by LSTM model and SVMs [35].

All these aforementioned methods are playing important roles in the development of this very important field. However, there are still some problems which should be addressed: (i) the existing sequence alignment methods search the proteins in the same fold in an unsupervised manner, failing to accurately detect the proteins sharing low protein similarities; and (ii) the current predictors ignore the relationship of the proteins in the dataset. However, as reported in previous study [36], this information is critical for protein homology analysis.

In this regard, in this study we are to propose a novel predictor called Fold-LTR-TCP for protein fold recognition based on the Learning to Rank (LTR) algorithm [37] and triadic closure principle (TCP) [38] so as to solve the aforementioned problems. The LTR algorithm retrieves the proteins in the same fold in a supervised manner so as to use the protein fold information and the sequence-based and profile-based features. Previous study [36] has shown that protein similarity network considering the relationship of proteins in the dataset is useful for protein remote homology detection. Inspired by this study, the TCP [38] was employed and performed on the protein similarity network constructed by LTR algorithm [37] so as to improve the accuracy of the ranking list. When tested on a widely used benchmark dataset, experimental results showed that the proposed Fold-LTR-TCP predictor outperformed other competing methods for protein fold recognition.

Materials and method

Benchmark dataset

In this study, the LE benchmark dataset [39] was used to evaluate the performance of various methods. The LE dataset was proposed by Lindahl and Elofsson [39], which was constructed based on SCOP database [40]. The dataset contains 976 sequences, which can be downloaded in http://yanglab.nankai.edu.cn/TA-fold/benchmark/.

In order to rigorously simulate the protein fold recognition task, the LE benchmark dataset was divided into two subsets, sharing no proteins belonging to the same superfamily. For more detailed information, please refer to [39]. A 2-fold cross-validation was employed to evaluate the performance of the proposed method.

Framework of Fold-LTR-TCP

LTR algorithm [37] is a powerful supervised machine learning method, which was successfully applied to information retrieval [1] and protein remote homology detection [1, 41, 42]. Previous study [36] has shown that the ranking list can be further improved by considering the relationship of proteins in the protein similarity network. In this regard, we constructed a protein similarity network based on LTR algorithm [37] for protein fold recognition, and then the TCP [38] was performed on this network so as to refine the ranking list. Finally, the Fold-LTR-TCP predictor was proposed, whose flowchart is shown in Figure 1.

The flowchart to show how Fold-LTR-TCP works.
Figure 1

The flowchart to show how Fold-LTR-TCP works.

Basic ranking method

In this study, ranking method is used to identify protein folds. In this regard, HHsearch [17] was employed as the basic ranking method. The reason is that the ranking list detected by HHsearch [17] is accurate with high coverage of positive samples.

Given a query protein q, a set of |$\mathbb{C}(q)$| feed proteins of q can be detected by the basic ranking method HHsearch [17], which can be expressed as:
(1)
where the feedback is a term derived from information retrieval, representing the retrieval results associated with the query. In this study, the feedback proteins represent the detected proteins associated with the query protein by a ranking method.|${p}_i$|represents the ith feedback protein for query protein q, and |$l$| is the total number of feedback proteins.

Feature extraction

The proteins in the same folds usually share sequence similarity lower than 25% [39]. Because the similarity of proteins in the same superfamily is generally higher than that of proteins in the same fold, the traditional sequence alignments for homology detection generally failed for fold recognition. Therefore, in order to improve the predictive performance, two features were extracted to capture the characteristics of different protein folds, and then three feature mapping strategies were performed on these features to measure the relationship between any two proteins. The two feature extraction methods are introduced as follows:

  • (1) Profile-based feature

As shown in previous studies [43–45], profile-based features are more discriminative than the sequence-based features, because they contain evolutionary information extracted from the multiple sequence alignments [46]. In this regard, Top-n-gram [47] was used to extract these features.

(2) Alignment score feature

In sequence analysis, sequence alignment is a commonly used method. Although a single sequence alignment method is often inaccurate, better performance can be achieved by combining the complementary sequence alignment methods. Therefore, in this study, the 84-dimension features based on the alignment scores reported in [18] were employed. For more details about the 84-dimension features, please refer to [18].

Feature mapping strategies

Profile-based features are widely used in bioinformatics. However, none of these features can directly reflect the relationship between query proteins and feedback proteins. To solve this problem, three feature mapping strategies were used to measure the relationship between the two proteins:

(1) Phase subtraction strategy

The profile-based features Top-n-grams [47] are employed to represent the protein sequences, and the similarity of the query feedback protein pair|$\Big(q,p\Big)$| can be calculated by [42]:

(2)
where q is query protein and |${p}_i$|repreasents the ith feedback protein.

(2) Chebyshev distance strategy.

Chebyshev distance is able to measure the similarity between two vectors [48], and performs well for protein sequence similarity measurement. Therefore, in this study, we apply the Chebyshev distance [49] to measure the similarity of vectors based on Top-n-gram.

Each protein sequence Q can be represented by Top-n-gram features as
(3)
where n is equal to the value of n in Top-n-gram, |${q}_i$| is the frequency of the |$i$|th Top-n-gram occurring in P.

The similarity of query feedback protein pair|$\Big(q,p\Big)$| based on Top-n-grams is calculated by

(4)

(3) Min’s distance strategy

Min’s distance [48] is widely used in measuring protein sequence similarities. For a ranking list, a perfect result is that all the positive samples of the query are ranked higher than other proteins. In order to solve the problem that the extreme score in the ranking list will impact on the predictive results, the logarithmic function [50] was employed to solve this problem.

The similarity of query feedback protein pair|$\Big(q,p\Big)$| based on Top-n-gram is calculated by

(5)
where the value of |$\upalpha$| is 3. |${\Theta}_{\mathrm{M}}\Big({\mathbf{Q}}^p,{\mathbf{Q}}^q\Big)$| represents the similarity between query protein |$p$| and feedback protein|$q$|⁠.

Construction of LTR model.

LTR [37] is a supervised machine learning method for ranking tasks, which has been applied to many fields [1]. Recently, it has also been used in protein sequence analysis [1, 41]. In this study, the LambdaMART [37] method is used. A good ranking list will detect all the positive samples, and all the positive ones are ranked higher than the negative ones. The lambdaMART algorithm in the RankLib2.7 package was used as the implementation of LTR. Its parameters were set as ‘ranker 6 –metric2t NDCG@10 –norm linear –shrinkage 0.05’. Therefore, NDCG is used to measure the performance of the experimental algorithm because the NDCG considers the influence of positions in the ranking list.

In order to incorporate the Top-n-gram features into the LTR algorithm, we constructed a feature matrix |${\boldsymbol{\Pi}}_{\mathbf{1}}$| by using the features in the feedback proteins in |$\mathbb{C}(q)$|⁠, which can be represented as

(6)
where the value of |$\Omega$| is |${\sum}_{i=1}^n{20}^i$|⁠, and |${\Theta}_u\Big(q,p\Big)\ \mathrm{can}\ \mathrm{be}\ \mathrm{cacluated}$|by:
(7)
All the alignment score features can be represented as
(8)
where |${\Theta}_1\Big(q,p\Big)$|⁠, |${\Theta}_2\Big(q,p\Big),{\Theta}_3\Big(q,p\Big),{\Theta}_4\Big(q,p\Big)\ \mathrm{and}\ {\Theta}_5\Big(q,p\Big)$| are Prob, E-value, P-value, Score and the reciprocal of the position calculated by Hhsearch [17], respectively; |${\Theta}_i\Big(q,p\Big)$|⁠, |$6\le i\le 89$| represents 84-dimension features and|${\Theta}_{90}\Big(q,p\Big)$| represents protein similarity score calculated by DeepFRpro [34].
Finally, the LTR model was trained with the combination of matrix |${\boldsymbol{\Pi}}_{\mathbf{1}}$| and matrix |${\boldsymbol{\Pi}}_{\mathbf{2}}$| as
(9)

Construction of protein similarity network based on TCP.

Protein fold recognition is a multi-classification problem with low protein sequence similarity, which will lead to different data distribution between the training dataset and the test dataset. The traditional methods failed to accurately calculate the similarity between the template protein and the query protein, and therefore, their performance for protein fold recognition is generally low. In order to overcome this disadvantage, in this study we extended the one-to-one relationship between query protein and template protein to the one-to-multiple relationship among query protein and multiple template proteins. A protein similarity network was constructed based on LTR, and then the TCP [38] was performed on the network to improve the ranking list by considering the one-to-multiple relationship among query protein and multiple template proteins.

The TCP has been successfully applied to the studies of social network [38] as shown in Figure 2a and b. The TCP believes the following: (i) the closer the relationship between two persons and their common friends, the more they are likely to become friends; and (ii) the more common friends they have, the more they are likely to become friends [38]. Inspired by these rules, we proposed two rules for protein fold recognition: (i) the higher the scores between two proteins and the shared feedback proteins are, the more they are likely to belong to the same fold; and (ii) the more feedback proteins the two proteins share, the more they are likely to belong to the same fold. These rules were shown in Figure 2c and d. Based on these two rules, we proposed a protocol for protein fold recognition based on TCP, whose process was shown in Figure 3, and the details steps will be introduced in the following section.

The similarity between social network and protein fold recognition. Based on their similarity, the TCP can be applied to protein fold recognition.
Figure 2

The similarity between social network and protein fold recognition. Based on their similarity, the TCP can be applied to protein fold recognition.

The flowchart of how TCP works on the protein similarity network.
Figure 3

The flowchart of how TCP works on the protein similarity network.

The ranking list of query feedback proteins generated by LTR is formulated as follows:
(10)
 
(11)
where |${p}_i$| is the ith feedback of query q, and|${s}_j$| represents the jth feedback protein |${p}_j$| score, indicating the importance of the edges between query protein and feedback protein (Figure 3), n represents the feedback protein number. |${p}_j^{p_i}{\mathrm{p}}_{\mathrm{i}}$| is the jth feedback for the template protein |${p}_i$|⁠. According to the corresponding feedback protein scores, the |$\mathbb{R}(q)$| and |$\mathbb{R}\Big({p}_i\Big)$| are sorted in descending order.
Following previous study [36], the update rule of TCP algorithm to connect the query protein into the protein similarity network is calculated by
(12)
where N represents the total number of the protein sequences in the protein similarity network.|$\Theta \Big({p}_k,{p}_i\Big)$| represents the similarity value between |${p}_k$| and |${p}_i$| (cf. Equation 11). Please refer to [36] for more information of the parameters for constructing the network.
According to the extended rules obtained earlier, the similarity value between query protein and template protein is calculated by
(13)
where |$q$| is query protein, |${p}_i$| represents the ith template protein and |${p}_n^m$| represents the protein at the nth position in the feedback ranking list of m proteins.
Finally, we ranked the query protein with all template proteins directly, and get the following results:
(14)
where |$\mathbbm{v}\Big(q,{p}_i\Big)$| represents the common similarity score between query protein |$q$| and template protein |${p}_i$|⁠.

Evaluation methodology

In this study, protein fold recognition was treated as a ranking task. The similarity scores among query protein and all the corresponding template proteins were calculated, and then the template proteins were sorted in descending order according to the similarity scores. The first template protein with the highest similarity score is considered as the hit of the query protein.

The overall accuracy was employed to measure the performance various methods, which is the ratio of the number of correctly predicted proteins to all the number of proteins to be predicted [45]:

|$\mathrm{Accuary}=\frac{n}{N}\times 100\%$| (15)where n represents the number of the test samples, whose folds are correctly predicted, and N is the total number of query samples in the test dataset.

Results and discussion

TCP improves the accuracy of protein fold recognition

The TCP was compared with other two similar methods for protein fold recognition, including PageRank [36] and Hyperlink-Induced Topic Search (HITS) [36]. Besides the current protein similarity network constructed by the LTR model, these three methods were also performed on other four protein similarity networks constructed by four state-of-the-art protein fold recognition methods, including DeepFRpro (DF) [34], cosine (CS) [18], correlation (CL) [18] and Gaussian kernel functions (GK) [18]. The corresponding results were listed in Table 1, from which we can see the following: (i) among the five protein similarity networks, the ones constructed by LTR and DF obviously outperformed the other three, and the one based on LTR achieved the top performance. These results are not surprising because the LTR model retrieves the template proteins in a supervised manner using both the sequence label information and the features extracted from protein sequences, and DeepFRpro [34] combines deep learning algorithm with multiple alignment scores, leading to a more accurate protein similarity network; and (ii) TCP outperformed HITS and PageRank on all the five protein similarity networks. The reason is that TCP considers the relationship among the query protein and the template proteins in the ranking list, which will obviously improve the accuracy of the ranking list.

Table 1

Performance of TCP, PageRank and HITS performed on five different protein similarity networks on LE benchmark dataset via 2-fold cross-validation

MethodsAccuracy
PageRank (LTR)a72.5%
HITS (LTR)a70.1%
TCP (LTR) a73.2%
PageRank (DF)b71.3%
HITS (DF)b69.4%
TCP (DF)b72.6%
PageRank (CS)c4.9%
HITS (CS)c10.1%
TCP (CS)c20.9%
PageRank (CL)d9.3%
HITS (CL)d12.5%
TCP (CL)d19.3%
PageRank (GK)e6.5%
HITS (GK)e10.3%
TCP (GK)e11.2%
MethodsAccuracy
PageRank (LTR)a72.5%
HITS (LTR)a70.1%
TCP (LTR) a73.2%
PageRank (DF)b71.3%
HITS (DF)b69.4%
TCP (DF)b72.6%
PageRank (CS)c4.9%
HITS (CS)c10.1%
TCP (CS)c20.9%
PageRank (CL)d9.3%
HITS (CL)d12.5%
TCP (CL)d19.3%
PageRank (GK)e6.5%
HITS (GK)e10.3%
TCP (GK)e11.2%
a

athe network based on LTR calculated by LTR;

b

bthe network based on DF calculated by DeepFRpro [34];

c

cthe network based on CS calculated by cosine function [18];

d

dthe networks based on CL calculated by correlation function [18];

e

ethe networks based on GK calculated by Gaussian kernel function [18].

Table 1

Performance of TCP, PageRank and HITS performed on five different protein similarity networks on LE benchmark dataset via 2-fold cross-validation

MethodsAccuracy
PageRank (LTR)a72.5%
HITS (LTR)a70.1%
TCP (LTR) a73.2%
PageRank (DF)b71.3%
HITS (DF)b69.4%
TCP (DF)b72.6%
PageRank (CS)c4.9%
HITS (CS)c10.1%
TCP (CS)c20.9%
PageRank (CL)d9.3%
HITS (CL)d12.5%
TCP (CL)d19.3%
PageRank (GK)e6.5%
HITS (GK)e10.3%
TCP (GK)e11.2%
MethodsAccuracy
PageRank (LTR)a72.5%
HITS (LTR)a70.1%
TCP (LTR) a73.2%
PageRank (DF)b71.3%
HITS (DF)b69.4%
TCP (DF)b72.6%
PageRank (CS)c4.9%
HITS (CS)c10.1%
TCP (CS)c20.9%
PageRank (CL)d9.3%
HITS (CL)d12.5%
TCP (CL)d19.3%
PageRank (GK)e6.5%
HITS (GK)e10.3%
TCP (GK)e11.2%
a

athe network based on LTR calculated by LTR;

b

bthe network based on DF calculated by DeepFRpro [34];

c

cthe network based on CS calculated by cosine function [18];

d

dthe networks based on CL calculated by correlation function [18];

e

ethe networks based on GK calculated by Gaussian kernel function [18].

In order to illustrate the computational efficiency of the Fold-LTR-TCP predictor, its time complexity is analyzed. There are two algorithms in Fold-LTR-TCP, including lambdaMART algorithm in LTR model, and TCP. The time complexity of lambdaMART is |$O(TNM)$|⁠, where |$T$| is the maximum number of the iterations, |$N$| is the number of the query samples of each iteration and |$M$| is the number of the feedback proteins of each query sample. In the TCP, the total number of query proteins in the protein similar network is |$N$| (cf. Equation 12), and the time complexity of the TCP is |$O\Big({N}^2\Big)$|⁠. Therefore, the time complexity of the Fold-LTR-TCP predictor is |$O\Big( TNM+{N}^2\Big)$|⁠. In this study, the LE dataset was split into two subsets with 159 sequences and 162 sequences. The 2-fold cross-validation strategy was used to evaluate the performance of Fold-LTR-TCP. The total training time is 17 340 s and the total test time is only 170 s. This experiment was performed on a computer with the CPU of 20 cores with 2.4GHz and memory of 128G, indicating that the Fold-LTR-TCP predictor is an efficient method with high accuracy.

Comparison with other competing methods

The performance of the proposed Fold-LTR-TCP predictor is compared with other predictors, including PSI-Blast [7], HMMER [16], SAM-T98 [15], BLASTLINK [39], SSEARCH [51], SSHMM [52], THREADER [53], Fugue [54], RAPTOR [55], SPARKS [56], SPARKS-X [14], SP3 [57], SP4 [58], SP5 [59], BoostThreader [60], HH-fold [61], RFDN-Fold [62], DN-FoldS [62], DN-FoldR [62], MT-fold [63], HHpred [12], FFAS-3D [13], TA-fold [61], FOLDpro [64], DN-Fold [62], RNDN-Fold [62], RF-Fold [18], dRHP-PseRA [65], DeepFRpro [34] and DeepSVM-fold [35]. Table 2 shows the performance of these aforementioned methods, from which we conclude that the Fold-LTR-TCP predictor achieves the best performance. Besides Fold-LTR-TCP, all the other methods can only consider the pairwise similarity between the query protein and the template protein. The Fold-LTR-TCP is the first predictor to consider the global relationships among the query proteins and the template proteins based on the protein similarity network (Figure 3). This is the main reason for its better performance. These results further confirm that the Fold-LTR-TCP predictor is efficient for protein fold recognition and will facilitate the studies of protein structures and functions.

Table 2

Performance comparison of the Fold-LTR-TCP with other state-of-the-art methods on LE benchmark dataset via 2-fold cross-validation

MethodsAccuracySource
PSI-Blast4.0%[39]
HMMER4.4%[39]
SAM-T983.4%[39]
BLASTLINK6.9%[39]
SSEARCH5.6%[39]
SSHMM6.9%[39]
THREADER14.6%[39]
Fugue12.5%[64]
RAPTOR25.4%[64]
SPARKS28.7%[64]
SP330.8%[64]
FOLDpro26.5%[64]
HHpred25.2%[62]
SP430.8%[62]
SP537.9%[62]
BoostThreader42.6%[62]
SPARKS-X45.2%[62]
RF-Fold40.8%[62]
DN-Fold33.6%[62]
RFDN-Fold37.7%[62]
DN-FoldS33.3%[62]
DN-FoldR27.4%[62]
FFAS-3D35.8%[61]
HH-fold42.1%[61]
TA-fold53.9%[61]
dRHP-PseRA34.9%[65]
MT-fold59.1%[65]
DeepFRpro66.0%[34]
DeepSVM-fold67.3%[35]
Fold-LTR-TCP73.2%This study
MethodsAccuracySource
PSI-Blast4.0%[39]
HMMER4.4%[39]
SAM-T983.4%[39]
BLASTLINK6.9%[39]
SSEARCH5.6%[39]
SSHMM6.9%[39]
THREADER14.6%[39]
Fugue12.5%[64]
RAPTOR25.4%[64]
SPARKS28.7%[64]
SP330.8%[64]
FOLDpro26.5%[64]
HHpred25.2%[62]
SP430.8%[62]
SP537.9%[62]
BoostThreader42.6%[62]
SPARKS-X45.2%[62]
RF-Fold40.8%[62]
DN-Fold33.6%[62]
RFDN-Fold37.7%[62]
DN-FoldS33.3%[62]
DN-FoldR27.4%[62]
FFAS-3D35.8%[61]
HH-fold42.1%[61]
TA-fold53.9%[61]
dRHP-PseRA34.9%[65]
MT-fold59.1%[65]
DeepFRpro66.0%[34]
DeepSVM-fold67.3%[35]
Fold-LTR-TCP73.2%This study

The bold values represent the proposed method achieving the top performance.

Table 2

Performance comparison of the Fold-LTR-TCP with other state-of-the-art methods on LE benchmark dataset via 2-fold cross-validation

MethodsAccuracySource
PSI-Blast4.0%[39]
HMMER4.4%[39]
SAM-T983.4%[39]
BLASTLINK6.9%[39]
SSEARCH5.6%[39]
SSHMM6.9%[39]
THREADER14.6%[39]
Fugue12.5%[64]
RAPTOR25.4%[64]
SPARKS28.7%[64]
SP330.8%[64]
FOLDpro26.5%[64]
HHpred25.2%[62]
SP430.8%[62]
SP537.9%[62]
BoostThreader42.6%[62]
SPARKS-X45.2%[62]
RF-Fold40.8%[62]
DN-Fold33.6%[62]
RFDN-Fold37.7%[62]
DN-FoldS33.3%[62]
DN-FoldR27.4%[62]
FFAS-3D35.8%[61]
HH-fold42.1%[61]
TA-fold53.9%[61]
dRHP-PseRA34.9%[65]
MT-fold59.1%[65]
DeepFRpro66.0%[34]
DeepSVM-fold67.3%[35]
Fold-LTR-TCP73.2%This study
MethodsAccuracySource
PSI-Blast4.0%[39]
HMMER4.4%[39]
SAM-T983.4%[39]
BLASTLINK6.9%[39]
SSEARCH5.6%[39]
SSHMM6.9%[39]
THREADER14.6%[39]
Fugue12.5%[64]
RAPTOR25.4%[64]
SPARKS28.7%[64]
SP330.8%[64]
FOLDpro26.5%[64]
HHpred25.2%[62]
SP430.8%[62]
SP537.9%[62]
BoostThreader42.6%[62]
SPARKS-X45.2%[62]
RF-Fold40.8%[62]
DN-Fold33.6%[62]
RFDN-Fold37.7%[62]
DN-FoldS33.3%[62]
DN-FoldR27.4%[62]
FFAS-3D35.8%[61]
HH-fold42.1%[61]
TA-fold53.9%[61]
dRHP-PseRA34.9%[65]
MT-fold59.1%[65]
DeepFRpro66.0%[34]
DeepSVM-fold67.3%[35]
Fold-LTR-TCP73.2%This study

The bold values represent the proposed method achieving the top performance.

Besides the aforementioned methods, DeepSF [49] and SVM-fold [61] are two state-of-the-art methods. However, these two methods cannot be directly compared with our method and other related methods on the same LE dataset for the following reasons: (i) these two methods were not evaluated on the LE dataset; and (ii) they contain several hyper-parameters, which should be optimized for different datasets, and their source code is unavailable. However, our method can indirectly compare with these two methods. Evaluated on two benchmark datasets derived from SCOP database, the DeepSF achieved accuracies of 75.3% and 73% as reported in [49]. The Fold-LTR-TCP achieved an accuracy of 73.2% on the LE benchmark dataset, which was also derived from SCOP database. Therefore, we conclude that the Fold-LTR-TCP is better than or at least comparable with DeepSF. The Fold-LTR-TCP predictor was directly compared with TA-fold, and the performance was shown in Table 2. TA-fold combined SVM-fold and HH-fold and outperformed the SVM-fold as reported in [61]. Because the Fold-LTR-TCP outperformed TA-fold (Table 2), we conclude that Fold-LTR-TCP is better than SVM-fold.

Conclusion

In this study, we proposed a new computational predictor called Fold-LTR-TCP for protein fold recognition by combining the LTR and TCP. The Fold-LTR-TCP predictor is a general method for detecting different protein fold types. Because the protein folds with more proteins will provide more training samples for training Fold-LTR-TCP, it will achieve better performance for these protein folds. Experimental results showed that the Fold-LTR-TCP outperformed other competing methods. Fold-LTR-TCP has the following advantages: (i) it incorporates various features into the framework of LTR model in a supervised manner, treating protein fold recognition as a ranking task; and (ii) the ranking list generated by the LTR model is further improved by using the TCP by considering the one-to-multiple relationship among query protein and multiple template proteins. To the best knowledge of ours, Fold-LTR-TCP is the first approach to use the global relationships among the query proteins and all template proteins for protein fold recognition. It can be anticipated that the proposed framework would have many potential applications when the global interactions among biological sequences are required, such as protein complex identification [66], circRNA–disease association prediction [67], microRNA–disease identification [68], etc.

Key Points
  • Protein fold recognition is a very important problem in the field of protein structure and function studies. Although the existing computational predictors contribute the development of this field, they failed to accurately detect the protein folds due to the low sequence similarities of proteins in the same fold.

  • This study represents a new predictor called Fold-LTR-TCP for protein fold recognition. The protein similarity network describing the relationship among proteins was constructed based on LTR model, and then TCP was performed on this network by considering the one-to-multiple relationship among query protein and multiple template proteins for accurate protein fold recognition.

  • Experimental results on the LE benchmark dataset showed the proposed Fold-LTR-TCP outperformed 29 existing state-of-the-art methods for protein fold recognition. To the best knowledge of ours, Fold-LTR-TCP is the first predictor considering the relationships among proteins in the dataset for protein fold recognition, which is main reason for its better performance.

Acknowledgements

The authors are very much indebted to the four anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

This work was supported by the National Natural Science Foundation of China (61672184, 61822306), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063) and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

Yulin Zhu

Yulin Zhu is a master student at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. His expertise is in bioinformatics.

Ke Yan

Ke Yan is a is a Ph. D candidate at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. His expertise is in bioinformatics.

References

1.

Chen
 
JJ
,
Guo
 
MY
,
Li
 
SM
, et al.  
ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised learning to rank
.
Bioinformatics
 
2017
;
33
:
3473
6
.

2.

Stroud
 
RM
.
Introduction to protein-structure. Branden, C, Tooze, J
.
Science
 
1991
;
253
:
685
6
.

3.

Sander
 
C
,
Marks
 
D
.
Solutions to the computational protein folding problem
.
FASEB J
 
2018
;
32
.

4.

Wei
 
L
,
Zou
 
Q
.
Recent progress in machine learning-based methods for protein fold recognition
.
Int J Mol Sci
 
2016
;
17
:
2118
.

5.

Weston
 
J
,
Kuang
 
R
,
Leslie
 
C
, et al.  
Protein ranking by semi-supervised network propagation
.
BMC Bioinformatics
 
2006
;
7
.

6.

O'Driscoll
 
A
,
Belogrudov
 
V
,
Carroll
 
J
, et al.  
HBLAST: parallelised sequence similarity—a Hadoop MapReducable basic local alignment search tool
.
J Biomed Inform
 
2015
;
54
:
58
64
.

7.

Altschul
 
SF
,
Gish
 
W
,
Miller
 
W
, et al.  
Basic local alignment search tool
.
J Mol Biol
 
1990
;
215
:
403
10
.

8.

Pearson
 
WR
.
Searching protein-sequence libraries—comparison of the sensitivity and selectivity of the Smith–Waterman and Fasta algorithms
.
Genomics
 
1991
;
11
:
635
50
.

9.

Zou
 
Q
,
Hu
 
Q
,
Guo
 
M
, et al.  
HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy
.
Bioinformatics
 
2015
;
31
:
2475
81
.

10.

Wan
 
S
,
Zou
 
Q
.
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing
.
Algorithms Mol Biol
 
2017
;
12
:
25
.

11.

Baldi
 
P
,
Chauvin
 
Y
,
Hunkapiller
 
T
, et al.  
Hidden Markov-models of biological primary sequence information
.
Proc Natl Acad Sci U S A
 
1994
;
91
:
1059
63
.

12.

Soding
 
J
,
Biegert
 
A
,
Lupas
 
AN
.
The HHpred interactive server for protein homology detection and structure prediction
.
Nucleic Acids Res
 
2005
;
33
:
W244
8
.

13.

Xu
 
D
,
Jaroszewski
 
L
,
Li
 
ZW
, et al.  
FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking
.
Bioinformatics
 
2014
;
30
:
660
7
.

14.

Carlson
 
BE
,
Ostgaard
 
N
,
Kochkin
 
P
, et al.  
Meter-scale spark X-ray spectrum statistics
.
J Geophys Res Atmos
 
2015
;
120
:
11191
202
.

15.

Karplus
 
K
,
Barrett
 
C
,
Hughey
 
R
.
Hidden Markov models for detecting remote protein homologies
.
Bioinformatics
 
1998
;
14
:
846
56
.

16.

Finn
 
RD
,
Clements
 
J
,
Eddy
 
SR
.
HMMER web server: interactive sequence similarity searching
.
Nucleic Acids Res
 
2011
;
39
:
W29
37
.

17.

Remmert
 
M
,
Biegert
 
A
,
Hauser
 
A
, et al.  
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
.
Nat Methods
 
2012
;
9
:
173
5
.

18.

Jo
 
T
,
Cheng
 
JL
.
Improving protein fold recognition by random forest
.
BMC Bioinformatics
 
2014
;
15
:
S14
.

19.

Liu
 
XQ
,
Wu
 
QL
,
Pan
 
WT
.
Sentiment classification of micro-blog comments based on Randomforest algorithm
.
Concurr Comput
 
2019
;
31
.

20.

Ding
 
CH
,
Dubchak
 
I
.
Multi-class protein fold recognition using support vector machines and neural networks
.
Bioinformatics
 
2001
;
17
:
349
58
.

21.

Polat
 
O
,
Dokur
 
Z
.
Protein fold classification with grow-and-learn network
.
Turk J Electrical Eng Comp Sci
 
2017
;
25
:
1184
96
.

22.

Yan
 
K
,
Fang
 
X
,
Xu
 
Y
, et al.  
Protein fold recognition based on multi-view Modeling
.
Bioinformatics
 
2019
;
35
:
2982
2990
.

23.

Liu
 
B
,
Chen
 
J
,
Guo
 
M
, et al.  
Protein remote homology detection and fold recognition based on sequence-order frequency matrix
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
292
300
.

24.

Liu
 
B
,
Wang
 
X
,
Lin
 
L
, et al.  
A discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis
.
BMC Bioinformatics
 
2008
;
9
:
510
.

25.

Yan
 
K
,
Xu
 
Y
,
Fang
 
X
, et al.  
Protein fold recognition based on sparse representation based classification
.
Artif Intell Med
 
2017
;
79
:
1
8
.

26.

Zou
 
Q
,
Xing
 
P
,
Wei
 
L
, et al.  
Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA
.
RNA
 
2019
;
25
:
205
18
.

27.

Zhang
 
Z
,
Zhao
 
Y
,
Liao
 
X
, et al.  
Deep learning in omics: a survey and guideline
.
Brief Funct Genomics
 
2019
;
18
:
41
57
.

28.

Yu
 
L
,
Sun
 
X
,
Tian
 
SW
, et al.  
Drug and nondrug classification based on deep learning with various feature selection strategies
.
Curr Bioinform
 
2018
;
13
:
253
9
.

29.

Lv
 
ZB
,
Ao
 
CY
,
Zou
 
Q
.
Protein function prediction: from traditional classifier to deep learning
.
Proteomics
 
2019
;
19
:
2
.

30.

Peng
 
L
,
Peng
 
MM
,
Liao
 
B
, et al.  
The advances and challenges of deep learning application in biological big data processing
.
Curr Bioinform
 
2018
;
13
:
352
9
.

31.

Wei
 
L
,
Su
 
R
,
Wang
 
B
, et al.  
Integration of deep feature representations and handcrafted features to improve the prediction of N 6-methyladenosine sites
.
Neurocomputing
 
2019
;
324
:
3
9
.

32.

Liu
 
B
,
Li
 
S
.
ProtDet-CCH: protein remote homology detection by combining long short-term memory and ranking methods
.
IEEE/ACM Trans Comput Biol Bioinform
 
2019
;
16
:
1203
10
.

33.

Li
 
C-C
,
Liu
 
B
.
MotifCNN-fold: Protein Fold Recognition based on Fold-specific Features Extracted by Motif-Based Convolutional Neural Networks
.
Briefings in Bioinformatics
; doi: .

34.

Zhu
 
JW
,
Zhang
 
HC
,
Li
 
SC
, et al.  
Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts
.
Bioinformatics
 
2017
;
33
:
3749
57
.

35.

Liu
 
B
,
Li
 
C
,
Yan
 
K
.
DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks
.
Brief Bioinform
. doi: .

36.

Liu
 
B
,
Jiang
 
S
,
Zou
 
Q
.
HITS-PR-HHblits: Protein Remote Homology Detection by Combining PageRank and Hyperlink-Induced Topic Search
.
Briefings in Bioinformatics
; doi: .

37.

Trotman
 
A
.
Learning to rank
.
Inform Retrieval
 
2005
;
8
:
359
81
.

38.

Kovacs
 
IA
,
Luck
 
K
,
Spirohn
 
K
, et al.  
Network-based prediction of protein interactions
.
Nat Commun
 
2019
;
10
:
1240
.

39.

Lindahl
 
E
,
Elofsson
 
A
.
Identification of related proteins on family, superfamily and fold level
.
J Mol Biol
 
2000
;
295
:
613
25
.

40.

Chandonia
 
JM
,
Hon
 
G
,
Walker
 
NS
, et al.  
The ASTRAL compendium in 2004
.
Nucleic Acids Res
 
2004
;
32
:
D189
92
.

41.

Liu
 
B
,
Chen
 
J
,
Wang
 
X
.
Application of learning to rank to protein remote homology detection
.
Bioinformatics
 
2015
;
31
:
3492
8
.

42.

Liu
 
B
,
Zhu
 
Y
.
ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank
.
IEEE Access
 
2019
;
7
:
102499
507
.

43.

Liu
 
B
,
Chen
 
JJ
,
Wang
 
XL
.
Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis
.
Briefings in Bioinformatics
; doi: .

44.

Liu
 
B
.
BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches
.
Briefings in Bioinformatics
; doi: .

45.

Dong
 
QW
,
Zhou
 
SG
,
Guan
 
JH
.
A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation
.
Bioinformatics
 
2009
;
25
:
2655
62
.

46.

Liu
 
B
,
Gao
 
X
,
Zhang
 
H
.
BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA, and protein sequences at sequence level and residue level based on machine learning approaches
.
Nucleic Acids Res
. doi: .

47.

Liu
 
B
,
Wang
 
XL
,
Lin
 
L
, et al.  
A discriminative method for protein remote homology detection and fold recognition combining top-n-grams and latent semantic analysis
.
BMC Bioinformatics
 
2008
;
9
.

48.

Mulekar
 
MS
,
Brown
 
CS
,
Mulekar
 
MS
, et al.  
Distance and Similarity Measures
,
2014
.

49.

Hou
 
J
,
Adhikari
 
B
,
Cheng
 
JL
.
DeepSF: deep convolutional neural network for mapping protein sequences to folds
.
Bioinformatics
 
2018
;
34
:
1295
303
.

50.

Drago
 
F
,
Myszkowski
 
K
,
Annen
 
T
, et al.  
Adaptive logarithmic mapping for displaying high contrast scenes
.
Comput Graph Forum
 
2003
;
22
:
419
26
.

51.

Pearson
 
WR
.
Comparison of methods for searching protein-sequence databases
.
Protein Sci
 
1995
;
4
:
1145
60
.

52.

Hargbo
 
J
,
Elofsson
 
A
.
Hidden Markov models that use predicted secondary structures for fold recognition
.
Proteins
 
1999
;
36
:
68
76
.

53.

Jones
 
DT
,
Taylor
 
WR
,
Thornton
 
JM
.
A new approach to protein fold recognition
.
Nature
 
1992
;
358
:
86
9
.

54.

Shi
 
JY
,
Blundell
 
TL
,
Mizuguchi
 
K
.
FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties
.
J Mol Biol
 
2001
;
310
:
243
57
.

55.

Xu
 
J
,
Li
 
M
,
Kim
 
D
, et al.  
RAPTOR: optimal protein threading by linear programming
.
J Bioinform Comput Biol
 
2003
;
1
:
95
117
.

56.

Zhou
 
H
,
Zhou
 
Y
.
Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition
.
Proteins
 
2004
;
55
:
1005
13
.

57.

Zhou
 
HY
,
Zhou
 
YQ
.
Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments
.
Proteins
 
2005
;
58
:
321
8
.

58.

Liu
 
S
,
Zhang
 
C
,
Liang
 
SD
, et al.  
Fold recognition by concurrent use of solvent accessibility and residue depth
.
Proteins
 
2007
;
68
:
636
45
.

59.

Zhang
 
W
,
Liu
 
S
,
Zhou
 
YQ
.
SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model
.
PLoS One
 
2008
;
3
:
e2325
.

60.

Peng
 
J
,
Xu
 
JB
.
Boosting protein threading accuracy
.
Res Comput Mol Biol Proc
 
2009
;
5541
:
31+
.

61.

Xia
 
JQ
,
Peng
 
ZL
,
Qi
 
DW
, et al.  
An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier
.
Bioinformatics
 
2017
;
33
:
863
70
.

62.

Jo
 
T
,
Hou
 
J
,
Eickholt
 
J
, et al.  
Improving protein fold recognition by deep learning networks
.
Sci Rep
 
2015
;
5
:
17573
.

64.

Cheng
 
JL
,
Baldi
 
P
.
A machine learning information retrieval approach to protein fold recognition
.
Bioinformatics
 
2006
;
22
:
1456
63
.

65.

Chen
 
JJ
,
Long
 
R
,
Wang
 
XL
, et al.  
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation
.
Sci Rep
 
2016
;
6
.

66.

Wu
 
Z
,
Liao
 
Q
,
Liu
 
B
.
A comprehensive review and evaluation of computational methods for identifying protein complexes from protein–protein interaction networks
.
Brief Bioinform
. doi: .

67.

Wei
 
H
,
Liu
 
B
.
iCircDA-MF: identification of CircRNA–disease associations based on matrix factorization
.
Brief Bioinform
. doi: .

68.

Zou
 
Q
,
Li
 
J
,
Song
 
L
, et al.  
Similarity computation strategies in the microRNA–disease network: a survey
.
Brief Funct Genomics
 
2016
;
15
:
55
64
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)