Abstract

Protein remote homology detection is a fundamental and important task for protein structure and function analysis. Several search methods have been proposed to improve the detection performance of the remote homologues and the accuracy of ranking lists. The position-specific scoring matrix (PSSM) profile and hidden Markov model (HMM) profile can contribute to improving the performance of the state-of-the-art search methods. In this paper, we improved the profile-link (PL) information for constructing PSSM or HMM profiles, and proposed a PL-based search method (PL-search). In PL-search, more robust PLs are constructed through the double-link and iterative extending strategies, and an accurate similarity score of sequence pairs is calculated from the two-level Jaccard distance for remote homologues. We tested our method on two widely used benchmark datasets. Our results show that whether HHblits, JackHMMER or position-specific iterated-BLAST is used, PL-search obviously improves the search performance in terms of ranking quality as well as the number of detected remote homologues. For ease of use of PL-search, both its stand-alone tool and the web server are constructed, which can be accessed at http://bliulab.net/PL-search/.

Introduction

Protein remote homology detection is a key element of protein structure and function analysis. The available protein sequences are increasing rapidly; however, the high costs of experimental protein structure determination are driving the development of computational prediction methods based on sequence information for protein remote homology detection [1, 2].

Sequence-based alignment methods such as Smith–Waterman [3], basic local alignment search tool (BLAST) [4], HAlign [5] and FASTA [6] can distinguish close homology unambiguously. However, these techniques cannot be applied for protein remote homology detection. To enhance the possibility of detecting remote homology sequences, many position-specific scoring matrix (PSSM)-based alignment methods such as position-specific iterated-BLAST (PSI-BLAST) [4] and comparison of multiple protein sequence alignments [7] have been proposed. Considering the emission and state transition probabilities for each position of a protein sequence, the Hidden Markov Model (HMM) profile-based alignment methods HMMER [8] and HHsearch [9] achieve better performance than PSSM-based alignment methods. The state-of-the-art performance of these profile-based methods is based on the useful evolution information of the query’s family contained in the PSSM or HMM profile. Regardless of the PSSM- or HMM-based alignment methods, these approaches can detect more remote homology protein sequences by incorporating an iterative strategy, such as HHblits [10], PSI-BLAST [4] or JackHMMER [11].

Alignment methods form the foundations of many methods of protein remote homology detection, and some predictors are further improved by incorporating the search results from original alignment methods. For example, SCOOP [12] compares common sequences between two outputs of HMM-based search methods to increase the accuracy of protein remote homology detection, RankProp [13, 14] proposes a network-based ranking algorithm with edge weights constructed by PSI-BLAST. HITS-PR-HHblits [15] constructs a protein similarity network based on the true labels of proteins from the Structural Classification of Proteins (SCOP) database [16]. Given the complementarity of different alignment methods, some state-of-the-art fusion algorithms such as CHASE [17] and ProtDec-LTR3.0 [18] re-rank the search results to create an accurate ranking list. Although the above-mentioned methods contribute to the development of protein remote homology detection, there are still some shortcomings: (i) re-ranking methods can effectively improve the ranking quality of original search results, but the number of detected homology sequences are still limited by the original search methods; (ii) the accuracy of improved methods can be effected by the search results of the original alignment methods without filtering the non-homologous sequences, many studies [19, 20] have shown that input of non-homologous sequences into PSSM has disastrous effects on later iterations in iterative search methods; (iii) null search results from the original alignment methods are inevitable, and can lead to null results in the improved methods because they cannot provide any useful information.

In light of the above information, we propose the profile-link-based search method (PL-search) to achieve improved performance for protein remote homology detection. The framework of PL-search has two major parts (Figure 1): PL construction and similarity calculation. For more useful link information, the PL is constructed by using a double-link strategy and iterative extending-link strategy; for more accurate search results and increased detection of homologous protein sequences, the two-level Jaccard distance is used to calculate the similarity score of sequence pairs. The two parts complete the retrieval process relying on PL information rather than constructing a profile. To summarize, the main contributions of this paper are: (i) presentation of the PL-search based on HHblits (PL-HHblits) approach that outperforms other state-of-the-art original search methods in terms of ranking quality and the ability of detecting remote homology protein sequences; (ii) the demonstration that PL-search is a general search method, which can incorporate the search results of HHblits [10], PSI-BLAST [4] or JackHMMER [11] to construct PL information, which are widely used by biologists [21]; (iii) the identification of a solution for null results in HHblits and JackHMMER through use of PL-HHblits and PL-HMMER, respectively.

Flowchart of PL-based search. First, the out-link of the query sequence is obtained from an original search method with E-value of 0.001. Second, for a more robust PL, the out-link is updated to double-link and extended double-link. All sequences in the database are represented with extended double-link in the same strategy. Third, the two-level Jaccard distance is used to calculate the similarity of sequence pairs. The final ranking list is then constructed based on the double-link list and ranking list calculated by using the two-level Jaccard distance.
Figure 1

Flowchart of PL-based search. First, the out-link of the query sequence is obtained from an original search method with E-value of 0.001. Second, for a more robust PL, the out-link is updated to double-link and extended double-link. All sequences in the database are represented with extended double-link in the same strategy. Third, the two-level Jaccard distance is used to calculate the similarity of sequence pairs. The final ranking list is then constructed based on the double-link list and ranking list calculated by using the two-level Jaccard distance.

Materials and method

Benchmark datasets

To evaluate the performance of PL-search, we use two widely used benchmark datasets based on SCOP databases [16] with less than 95% identity to evaluate the performance of a predictor, including SCOP1.59 benchmark dataset and SCOPe2.06 benchmark dataset. The SCOP1.59 benchmark dataset contains 7329 sequences, and the SCOPe2.06 benchmark dataset contains 28 010 sequences, which is an updated version with some new hierarchy and classification of new protein structures from the Protein Data Bank (PDB) [22].

PL construction

In iterative search methods, profile construction selects homology sequences according to E-value score from search results in order to construct a profile. Previous studies [23, 24] have shown that true homology sequences in PSSM can provide true evolution information, leading to increased detection of homologous sequences and more accurate ranking. However, when non-homologous sequences occur in the profile, more non-homologous sequences are searched in later iterations. In the present study, non-homologous sequences in the profile provide incorrect link information, therefore eliminating these sequences and providing enough homology information in the PL are important. We propose two strategies to address this: (i) the double-link strategy to filter non-homologous sequences and (ii) the iterative extending-link strategy for more robust PLs.

Double-link strategy

During information retrieval, in-link is used to provide important search information as out-link. Link-based search methods, such as the PageRank algorithm [25] and Hypertext Induced Topic Selection algorithm [26], utilize the relationship of in-link and out-link to improve search results. In the present study, we construct a double-link strategy based on the relationship of in-link and out-link.

The out-links of query protein sequence can be easily extracted from the search results of the original search method. An inclusion threshold [10] for constructing profile in original search method is used to extract high quality out-links. The out-link can be represented as:
(1)
where q represents the query protein sequence; |${p}_i$| represents the feedback protein sequences in the results of original search method; |$e\Big(q,{p}_i\Big)$| represents the E-value score of |$q$| and |${p}_i$|⁠; |$inclusion\_E$| represents the inclusion E-value threshold for constructing the profile (⁠|$inclusion\_E$| is set to 0.001 in PL-search); |$M$| represents protein numbers with corresponding E-value thresholds.
The in-links of query protein sequence are consisted of sequences in databases when query protein sequence exits in the out-links. The in-link can be represented as:
(2)
where |$I$| represents protein number in the in-links of query protein sequence.
Then the double-link of query protein sequence is consisted of the non-null intersection of its in-link and out-link. The links in double-link of query sequence are sorted by the alignment scores in its out-link. Two special situations are considered in this study when constructing query’s double-link. The double-link of query protein sequence is set to its out-link when the intersection of its in-link and out-link is null. The double-link of query sequence is set as its in-link when its out-link is empty. The double-link can be represented as:
(3)

Iterative extending strategy

The double-link strategy improves the quality of PL, but it is inevitable that the number will be reduced. To overcome this problem, we propose an iterative extending-link strategy. A related study [27] has demonstrated that extending the link set by common links between two link sets can improve the robustness of the link set.

Algorithm 1

The iteration extending process of PL.

1: Parameters: |${\beta}_1$| adjusts extending link in first situation,|${\beta}_2$|
adjusts extending link in second situation
2: Input: double-link set of query sequence, double-link sets of
database
3: Output: Extended PL |$\mathit{\operatorname{Re}}(q)$|
4: |$\mathit{\operatorname{Re}}(q)$| = |${R}_{doule}(q)$|
5: For|${p}_i$| in ||${R}_{double}(q)$|| do
1: Parameters: |${\beta}_1$| adjusts extending link in first situation,|${\beta}_2$|
adjusts extending link in second situation
2: Input: double-link set of query sequence, double-link sets of
database
3: Output: Extended PL |$\mathit{\operatorname{Re}}(q)$|
4: |$\mathit{\operatorname{Re}}(q)$| = |${R}_{doule}(q)$|
5: For|${p}_i$| in ||${R}_{double}(q)$|| do
Algorithm 1

The iteration extending process of PL.

1: Parameters: |${\beta}_1$| adjusts extending link in first situation,|${\beta}_2$|
adjusts extending link in second situation
2: Input: double-link set of query sequence, double-link sets of
database
3: Output: Extended PL |$\mathit{\operatorname{Re}}(q)$|
4: |$\mathit{\operatorname{Re}}(q)$| = |${R}_{doule}(q)$|
5: For|${p}_i$| in ||${R}_{double}(q)$|| do
1: Parameters: |${\beta}_1$| adjusts extending link in first situation,|${\beta}_2$|
adjusts extending link in second situation
2: Input: double-link set of query sequence, double-link sets of
database
3: Output: Extended PL |$\mathit{\operatorname{Re}}(q)$|
4: |$\mathit{\operatorname{Re}}(q)$| = |${R}_{doule}(q)$|
5: For|${p}_i$| in ||${R}_{double}(q)$|| do
6:  if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|\ge{\beta}_1\mid{R}_{double}\Big({p}_i\Big)\mid$|then
7:  |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup{R}_{double}\Big({p}_i\Big)$|
8:  else if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|=\Big|{R}_{double}(q)\Big|$|then
9:  if|${\beta}_2\Big|{R}_{double}(q)\Big|<\Big|{R}_{double}\Big({p}_i\Big)\Big|$|then
10:   |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:{\beta}_2\Big|{R}_{double}(q)\Big|\Big]\Big)$|
11:   else|$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:|{R}_{double}\Big({p}_i\Big)|\Big]\Big)$|
12: End for
13: Return|$\mathit{\operatorname{Re}}(q)$|
6:  if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|\ge{\beta}_1\mid{R}_{double}\Big({p}_i\Big)\mid$|then
7:  |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup{R}_{double}\Big({p}_i\Big)$|
8:  else if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|=\Big|{R}_{double}(q)\Big|$|then
9:  if|${\beta}_2\Big|{R}_{double}(q)\Big|<\Big|{R}_{double}\Big({p}_i\Big)\Big|$|then
10:   |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:{\beta}_2\Big|{R}_{double}(q)\Big|\Big]\Big)$|
11:   else|$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:|{R}_{double}\Big({p}_i\Big)|\Big]\Big)$|
12: End for
13: Return|$\mathit{\operatorname{Re}}(q)$|
6:  if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|\ge{\beta}_1\mid{R}_{double}\Big({p}_i\Big)\mid$|then
7:  |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup{R}_{double}\Big({p}_i\Big)$|
8:  else if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|=\Big|{R}_{double}(q)\Big|$|then
9:  if|${\beta}_2\Big|{R}_{double}(q)\Big|<\Big|{R}_{double}\Big({p}_i\Big)\Big|$|then
10:   |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:{\beta}_2\Big|{R}_{double}(q)\Big|\Big]\Big)$|
11:   else|$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:|{R}_{double}\Big({p}_i\Big)|\Big]\Big)$|
12: End for
13: Return|$\mathit{\operatorname{Re}}(q)$|
6:  if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|\ge{\beta}_1\mid{R}_{double}\Big({p}_i\Big)\mid$|then
7:  |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup{R}_{double}\Big({p}_i\Big)$|
8:  else if|$\Big|{R}_{double}(q)\cap{R}_{double}\Big({p}_i\Big)\Big|=\Big|{R}_{double}(q)\Big|$|then
9:  if|${\beta}_2\Big|{R}_{double}(q)\Big|<\Big|{R}_{double}\Big({p}_i\Big)\Big|$|then
10:   |$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:{\beta}_2\Big|{R}_{double}(q)\Big|\Big]\Big)$|
11:   else|$\mathit{\operatorname{Re}}(q)={R}_{double}(q)\cup \Big({R}_{double}\Big({p}_i\Big)\Big[0:|{R}_{double}\Big({p}_i\Big)|\Big]\Big)$|
12: End for
13: Return|$\mathit{\operatorname{Re}}(q)$|

A more robust PL set is constructed according to the common links between two double-link sets. Two main situations were considered: first, all double links of |${p}_i$| were included in the query’s double-link set when most double links of |${p}_i$| existed in the query’s double-link set. Second, some high quality double-links of |${p}_i$| were included in the query’s double-link set when all double links of query were contained by the double-link set of |${p}_i$| and out of the first situation, which is complementary to the first situation. The double-link set of the query sequence was extended with all feedback sequences in the double-link set, and the iteration process initiated from the lower E-value feedback sequence with a higher probability of being homologous with the query sequence. The pseudo codes of this iterative extending-link strategy are shown in Algorithm 1.

Table 1

Comparison of the performance of three search methods and their corresponding PL-based search methods on two benchmark datasets

MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1dROC50dCoverageeTPdTPfTPgROC1dROC50dCoverageeTPdTPfTPg
HHblits0.84290.88340.714248.987732.423716.56410.89180.92840.681795.987542.152153.8355
PSI-BLAST0.76740.80160.498637.799429.96797.83150.85130.89410.513474.864238.320836.5434
JackHMMER0.80290.80710.498136.160330.10366.05680.89190.90590.5288139.405464.822774.5828
PL-HHblitsa0.86900.89760.729055.763732.389023.37470.92070.94260.7184106.130744.517461.6133
PL-BLASTb0.81580.82680.509346.254130.070516.18350.90470.91550.532776.893039.283237.6097
PL-HMMERc0.81190.81820.531150.847330.827719.99140.90430.91880.5642174.420370.4275103.9928
MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1dROC50dCoverageeTPdTPfTPgROC1dROC50dCoverageeTPdTPfTPg
HHblits0.84290.88340.714248.987732.423716.56410.89180.92840.681795.987542.152153.8355
PSI-BLAST0.76740.80160.498637.799429.96797.83150.85130.89410.513474.864238.320836.5434
JackHMMER0.80290.80710.498136.160330.10366.05680.89190.90590.5288139.405464.822774.5828
PL-HHblitsa0.86900.89760.729055.763732.389023.37470.92070.94260.7184106.130744.517461.6133
PL-BLASTb0.81580.82680.509346.254130.070516.18350.90470.91550.532776.893039.283237.6097
PL-HMMERc0.81190.81820.531150.847330.827719.99140.90430.91880.5642174.420370.4275103.9928

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

ePercentage of homology sequence (belonging to the same SCOP superfamily) searched.

fTP number at close homology level (belonging to the same SCOP family).

gTP number at remote homology level (belonging to the same SCOP superfamily, but not belong to the same family).

Table 1

Comparison of the performance of three search methods and their corresponding PL-based search methods on two benchmark datasets

MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1dROC50dCoverageeTPdTPfTPgROC1dROC50dCoverageeTPdTPfTPg
HHblits0.84290.88340.714248.987732.423716.56410.89180.92840.681795.987542.152153.8355
PSI-BLAST0.76740.80160.498637.799429.96797.83150.85130.89410.513474.864238.320836.5434
JackHMMER0.80290.80710.498136.160330.10366.05680.89190.90590.5288139.405464.822774.5828
PL-HHblitsa0.86900.89760.729055.763732.389023.37470.92070.94260.7184106.130744.517461.6133
PL-BLASTb0.81580.82680.509346.254130.070516.18350.90470.91550.532776.893039.283237.6097
PL-HMMERc0.81190.81820.531150.847330.827719.99140.90430.91880.5642174.420370.4275103.9928
MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1dROC50dCoverageeTPdTPfTPgROC1dROC50dCoverageeTPdTPfTPg
HHblits0.84290.88340.714248.987732.423716.56410.89180.92840.681795.987542.152153.8355
PSI-BLAST0.76740.80160.498637.799429.96797.83150.85130.89410.513474.864238.320836.5434
JackHMMER0.80290.80710.498136.160330.10366.05680.89190.90590.5288139.405464.822774.5828
PL-HHblitsa0.86900.89760.729055.763732.389023.37470.92070.94260.7184106.130744.517461.6133
PL-BLASTb0.81580.82680.509346.254130.070516.18350.90470.91550.532776.893039.283237.6097
PL-HMMERc0.81190.81820.531150.847330.827719.99140.90430.91880.5642174.420370.4275103.9928

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

ePercentage of homology sequence (belonging to the same SCOP superfamily) searched.

fTP number at close homology level (belonging to the same SCOP family).

gTP number at remote homology level (belonging to the same SCOP superfamily, but not belong to the same family).

Calculation of the similarity of sequence pairs by Jaccard distance

In previous studies involving PSI-BLAST [4] or JackHMMER [28], the similarity of sequence pairs was calculated by using profile alignment algorithms. In the present study, we measured the similarity between two PLs by using the Jaccard distance [29, 30]. The Jaccard distance is a useful similarity algorithm for calculating similarity scores between two sets. It has been successfully used in studies on genotyping data/next-generation sequencing data analysis [31], drug-drug interactions [32], gene interaction [33] and influenza vaccination responses [34].

In the present study, the similarity distance between the query protein sequence and sequences in database were calculated from the Jaccard distance by [30]:
(4)
We use a two-level Jaccard distance based on double-links in order to obtain a more robust similarity score between two sequences. The first level is the similarity of query sequence and sequences in the database. The second level is the similarity of those sequences in the double link of the query sequence and sequences in the database. For sequences in the double link of the query sequence, their double link can provide useful similarity information because they share high similarity with the query sequence. Therefore, the final similarity of two sequences can be represented as:
(5)

Final ranking list

The final ranking list of PL-search comprises the double-link ranking list and the ranking list generated by the two-level Jaccard distance. The double links are retained because the probability of including non-homologous protein sequences in double links is very low, and this can also reduce resource consumption. The final ranking list is calculated by:
(6)

The maximum score of |$Jscore(q,{p}_i)$| is set to 0.99 to reject low-quality search results. The final ranking list is equal to the double-link set when the Jscore is set to zero.

Evaluation of the method

We assessed the ranking quality and detected homology number to evaluate the performance of PL-search and original search methods. For ranking quality, the receiver operating characteristic scores receiver operating characteristic1 (ROC1) and ROC50 [35–37] were used. ROC1 and ROC50 are the area under the ROC curve when the first and fiftieth false non-homologous sequence appears, respectively. A higher ROC score indicates a more accurate ranking of search results. The statistical significance differences between the original search methods and their corresponding PL-based search methods in terms of ROC are calculated by t-test [38]. For detected homology number, the true positive (TP) number [9, 11, 39] and coverage [40] were evaluated. TP number is the number of detected homology protein sequences. Coverage or true positive rate is the percentage of detected homology protein sequences for a given Jscore threshold [41, 42]. Increased TP numbers and higher coverage indicate that more homologous protein sequences exist in the search results. In order to more objectively and comprehensively evaluate the coverage and TP number, errors per query (EPQ) is used as a threshold score. EPQ is the number of non-homologous sequences divided by the number of non-homologous sequences plus the number of homologous sequences.

Results and discussion

PL-search improves the performance of search methods

To explore the search ability of PL-search, its PL is constructed based on three search methods: PSI-BLAST [4], JackHMMER [28] and HHblits [10]. The performance of PL-search was tested on the two SCOP benchmark datasets based on three original search methods. This revealed that PL-search significantly improves the ranking quality as well as the number of remote homology protein sequences identified (Table 1 and 2).

Table 2

Statistical significance differences between three search methods and their corresponding PL-based search methods on the two benchmark datasets

MethodScop1.59 benchmark datasetSCOPe 2.06 benchmark dataset
P-value of ROC1dP-value of ROC50dP-value of ROC1dP-value of ROC50d
PL-HHblitsa versus HHblits6.5606e-070.00212.0240e-448.1488e-17
PL-BLASTb versus PSI-BLAST2.0208e-143.5188e-054.5773e-1083.4125e-22
PL-HMMERc versus JackHMMER0.15970.08101.4187e-078.1410e-09
MethodScop1.59 benchmark datasetSCOPe 2.06 benchmark dataset
P-value of ROC1dP-value of ROC50dP-value of ROC1dP-value of ROC50d
PL-HHblitsa versus HHblits6.5606e-070.00212.0240e-448.1488e-17
PL-BLASTb versus PSI-BLAST2.0208e-143.5188e-054.5773e-1083.4125e-22
PL-HMMERc versus JackHMMER0.15970.08101.4187e-078.1410e-09

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dP-value was calculated by t-test [44].

Table 2

Statistical significance differences between three search methods and their corresponding PL-based search methods on the two benchmark datasets

MethodScop1.59 benchmark datasetSCOPe 2.06 benchmark dataset
P-value of ROC1dP-value of ROC50dP-value of ROC1dP-value of ROC50d
PL-HHblitsa versus HHblits6.5606e-070.00212.0240e-448.1488e-17
PL-BLASTb versus PSI-BLAST2.0208e-143.5188e-054.5773e-1083.4125e-22
PL-HMMERc versus JackHMMER0.15970.08101.4187e-078.1410e-09
MethodScop1.59 benchmark datasetSCOPe 2.06 benchmark dataset
P-value of ROC1dP-value of ROC50dP-value of ROC1dP-value of ROC50d
PL-HHblitsa versus HHblits6.5606e-070.00212.0240e-448.1488e-17
PL-BLASTb versus PSI-BLAST2.0208e-143.5188e-054.5773e-1083.4125e-22
PL-HMMERc versus JackHMMER0.15970.08101.4187e-078.1410e-09

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dP-value was calculated by t-test [44].

In terms of ranking quality, the best performance of PL-search was achieved when its PL was constructed based on the results from HHblits (Table 1, Figure 2A–D), indicating that improved performance of the original search methods can improve the performance of PL-search. It is worth noting that when the ROC score is at its maximum, the ranking qualities of PL-search and original search methods show the largest discrepancy (Figure 2A–B, E–F and I–G). This big difference indicates that search results can be improved to the best ranking by using PL-search in many cases.

PL-based search improves the performance of three original search methods on two benchmark datasets. PL-HHblits, PL-BLAST and PL-HMMER represent PL-based search methods based on HHblits, PSI-BLAST and JackHMMER, respectively. (A), (B), (E), (F), (I) and (G) plot the percentage of sequences for which those methods exceed a given ROC1 and ROC50 scores. (C), (D), (G), (H), (K) and (I) plot the coverage and TP number for which those methods exceed a given EPQ score. Higher curves mean better performance for (A)–(L).
Figure 2

PL-based search improves the performance of three original search methods on two benchmark datasets. PL-HHblits, PL-BLAST and PL-HMMER represent PL-based search methods based on HHblits, PSI-BLAST and JackHMMER, respectively. (A), (B), (E), (F), (I) and (G) plot the percentage of sequences for which those methods exceed a given ROC1 and ROC50 scores. (C), (D), (G), (H), (K) and (I) plot the coverage and TP number for which those methods exceed a given EPQ score. Higher curves mean better performance for (A)–(L).

The highest detected number of remote homology protein sequences by PL-search was achieved with HHblits (Table 1, Figure 2A–D). From Figure 2C–D, G–H and K–L and Table 1, we can see that, compared with most search results of original search methods, the number of detected homology protein sequences increases with PL-search. The most significant improvement was observed for remote homology protein sequences, suggesting that PL-search is appropriate for protein remote homology detection. Because the number of search results of PL-search without non-homologous sequences was obviously higher than that of original search methods, which can be seen when EPQ score equals zero, the PL-search can be concluded to provide highly accurate search results. Furthermore, the gap of coverage between original search methods and their corresponding PL-search methods was lower than that of the TP number, indicating that the larger superfamilies tend to have more TP number increase.

Owing to PL construction, many null results of original search methods can be resolved by PL-search when JackHMMER and HHblits (Table 3) are employed. A high-quality ranking list can then be achieved by PL-search for the resolved sequences (Table 4).

Table 3

Comparison of null results of three search methods and their corresponding PL-search methods on the two benchmark datasets

MethodNumber of NULL result
SCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
HHblits7135
PL-HHblitsa225
PSI-BLAST1528
PL-BLASTb1528
JackHMMER12071365
PL-HMMERc11411261
MethodNumber of NULL result
SCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
HHblits7135
PL-HHblitsa225
PSI-BLAST1528
PL-BLASTb1528
JackHMMER12071365
PL-HMMERc11411261

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

Table 3

Comparison of null results of three search methods and their corresponding PL-search methods on the two benchmark datasets

MethodNumber of NULL result
SCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
HHblits7135
PL-HHblitsa225
PSI-BLAST1528
PL-BLASTb1528
JackHMMER12071365
PL-HMMERc11411261
MethodNumber of NULL result
SCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
HHblits7135
PL-HHblitsa225
PSI-BLAST1528
PL-BLASTb1528
JackHMMER12071365
PL-HMMERc11411261

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

Table 4

Performance of the PL-search methods after solving the search results of null on the two benchmark datasets

MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1cROC50cROC1cROC50c
PL-HHblitsa1.00001.00000.97360.9745
PL-HMMERb0.80860.84050.87020.9100
MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1cROC50cROC1cROC50c
PL-HHblitsa1.00001.00000.97360.9745
PL-HMMERb0.80860.84050.87020.9100

aFive and one hundered ten query proteins cannot be detected by HHblits but can be detected by PL-HHblits on SCOP1.59 benchmark dataset and SCOPe2.06 benchmark dataset, respectively.

bSixty-six and one hundered four query proteins cannot be detected by JackHMMER but can be detected by PL-HMMER on SCOP1.59 benchmark dataset and SCOPe2.06, respectively.

cPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 4

Performance of the PL-search methods after solving the search results of null on the two benchmark datasets

MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1cROC50cROC1cROC50c
PL-HHblitsa1.00001.00000.97360.9745
PL-HMMERb0.80860.84050.87020.9100
MethodSCOP 1.59 benchmark datasetSCOPe 2.06 benchmark dataset
ROC1cROC50cROC1cROC50c
PL-HHblitsa1.00001.00000.97360.9745
PL-HMMERb0.80860.84050.87020.9100

aFive and one hundered ten query proteins cannot be detected by HHblits but can be detected by PL-HHblits on SCOP1.59 benchmark dataset and SCOPe2.06 benchmark dataset, respectively.

bSixty-six and one hundered four query proteins cannot be detected by JackHMMER but can be detected by PL-HMMER on SCOP1.59 benchmark dataset and SCOPe2.06, respectively.

cPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Impact of Jscore threshold for PL-based search

Because a lower E-value threshold can improve the ranking quality of many search methods with losing homologous sequences in the search result, this leads us to question whether the Jscore (cf. Equation (5)) threshold of PL-search affects the search results?

To address this, we studied the effect of Jscore on the performance of PL-search in terms of ranking quality and TP number. From Figure 3 and Tables 5 and 6. we can see that the coverage decreases with decreasing Jscore threshold. An obvious decrease in coverage is observed when the Jscore decreases from 0.99 to 0.8 for all PL-search methods, indicating that many homologous sequences have lower common PLs. Higher ROC1 scores were obtained with lower Jscores, although the relation is not liner. Furthermore, the accuracy of the ranking list and number of homologous sequences can be improved by adjusting Jscore threshold.

Lower Jscore threshold (cf. Equation (5)) reduces the number of detected homology sequence (●, right y-axis) and leads to more accurate ranking lists (■ and ▲, left y-axis) in most situations for PL-based search. PL-HHblits ((A) and (D)), PL-BLAST(B and E) and PL-HMMER ((C) and (F)) represent PL-based search methods based on HHblits, PSI-BLAST and JackHMMER, respectively. The results on the SCOP1.59 benchmark dataset are plotted in subfigures (A)–(C), and the results on the SCOPe2.06 benchmark dataset are plotted in subfigures (D)–(F).
Figure 3

Lower Jscore threshold (cf. Equation (5)) reduces the number of detected homology sequence (●, right y-axis) and leads to more accurate ranking lists (■ and ▲, left y-axis) in most situations for PL-based search. PL-HHblits ((A) and (D)), PL-BLAST(B and E) and PL-HMMER ((C) and (F)) represent PL-based search methods based on HHblits, PSI-BLAST and JackHMMER, respectively. The results on the SCOP1.59 benchmark dataset are plotted in subfigures (A)–(C), and the results on the SCOPe2.06 benchmark dataset are plotted in subfigures (D)–(F).

Table 5

Impact of the Jscore threshold (cf. Equation (6)) on the three PL-based search methods on the SCOP 1.59 benchmark dataset

JscoreSCOP 1.59 benchmark dataset
thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.86900.89760.72900.81580.82680.50930.81190.81820.5311
0.90.87360.89420.70410.81920.82080.48450.81390.81910.5157
0.80.87820.89290.67700.82010.82130.47480.81510.81820.5054
0.70.88240.89210.66490.81970.82070.46900.81620.81860.4996
0.60.88540.89420.65550.81970.82030.46340.81660.81840.4940
0.50.88760.89580.64930.81940.81980.45850.81740.81890.4891
0.40.88810.89610.64290.81830.81850.45400.81830.81950.4837
0.30.88710.89460.63760.81760.81770.45120.81910.82010.4795
0.20.88650.89360.63100.81620.81630.44830.81920.81980.4757
0.10.88540.89250.62510.81300.81310.44510.81890.81950.4718
0.00.87690.88310.60220.80110.80110.43130.81030.81080.4633
JscoreSCOP 1.59 benchmark dataset
thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.86900.89760.72900.81580.82680.50930.81190.81820.5311
0.90.87360.89420.70410.81920.82080.48450.81390.81910.5157
0.80.87820.89290.67700.82010.82130.47480.81510.81820.5054
0.70.88240.89210.66490.81970.82070.46900.81620.81860.4996
0.60.88540.89420.65550.81970.82030.46340.81660.81840.4940
0.50.88760.89580.64930.81940.81980.45850.81740.81890.4891
0.40.88810.89610.64290.81830.81850.45400.81830.81950.4837
0.30.88710.89460.63760.81760.81770.45120.81910.82010.4795
0.20.88650.89360.63100.81620.81630.44830.81920.81980.4757
0.10.88540.89250.62510.81300.81310.44510.81890.81950.4718
0.00.87690.88310.60220.80110.80110.43130.81030.81080.4633

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 5

Impact of the Jscore threshold (cf. Equation (6)) on the three PL-based search methods on the SCOP 1.59 benchmark dataset

JscoreSCOP 1.59 benchmark dataset
thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.86900.89760.72900.81580.82680.50930.81190.81820.5311
0.90.87360.89420.70410.81920.82080.48450.81390.81910.5157
0.80.87820.89290.67700.82010.82130.47480.81510.81820.5054
0.70.88240.89210.66490.81970.82070.46900.81620.81860.4996
0.60.88540.89420.65550.81970.82030.46340.81660.81840.4940
0.50.88760.89580.64930.81940.81980.45850.81740.81890.4891
0.40.88810.89610.64290.81830.81850.45400.81830.81950.4837
0.30.88710.89460.63760.81760.81770.45120.81910.82010.4795
0.20.88650.89360.63100.81620.81630.44830.81920.81980.4757
0.10.88540.89250.62510.81300.81310.44510.81890.81950.4718
0.00.87690.88310.60220.80110.80110.43130.81030.81080.4633
JscoreSCOP 1.59 benchmark dataset
thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.86900.89760.72900.81580.82680.50930.81190.81820.5311
0.90.87360.89420.70410.81920.82080.48450.81390.81910.5157
0.80.87820.89290.67700.82010.82130.47480.81510.81820.5054
0.70.88240.89210.66490.81970.82070.46900.81620.81860.4996
0.60.88540.89420.65550.81970.82030.46340.81660.81840.4940
0.50.88760.89580.64930.81940.81980.45850.81740.81890.4891
0.40.88810.89610.64290.81830.81850.45400.81830.81950.4837
0.30.88710.89460.63760.81760.81770.45120.81910.82010.4795
0.20.88650.89360.63100.81620.81630.44830.81920.81980.4757
0.10.88540.89250.62510.81300.81310.44510.81890.81950.4718
0.00.87690.88310.60220.80110.80110.43130.81030.81080.4633

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 6

Impact of the Jscore threshold (cf. Equation (6)) on the three PL-based search methods on the SCOPe 2.06 benchmark dataset

Jscore thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.92070.94260.71840.90470.91550.53270.90430.91880.5642
0.90.92680.94380.69520.91500.91900.51070.91520.91970.5416
0.80.93050.94510.67950.91680.92030.50170.91750.92130.5321
0.70.93410.94460.66510.91700.92000.49470.91940.92290.5240
0.60.93750.94640.65170.91750.92000.48890.92060.92390.5176
0.50.94110.94880.63930.91790.91980.48340.92140.92450.5128
0.40.94260.95020.62880.91810.91960.47890.92220.92480.5068
0.30.94360.95090.62010.91800.91930.47410.92260.92500.5026
0.20.94420.95140.61220.91750.91870.47030.92260.92480.4987
0.10.94480.95160.60260.91640.91740.46580.92160.92370.4935
0.00.94190.94840.57380.91170.91250.45040.91640.91850.4800
Jscore thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.92070.94260.71840.90470.91550.53270.90430.91880.5642
0.90.92680.94380.69520.91500.91900.51070.91520.91970.5416
0.80.93050.94510.67950.91680.92030.50170.91750.92130.5321
0.70.93410.94460.66510.91700.92000.49470.91940.92290.5240
0.60.93750.94640.65170.91750.92000.48890.92060.92390.5176
0.50.94110.94880.63930.91790.91980.48340.92140.92450.5128
0.40.94260.95020.62880.91810.91960.47890.92220.92480.5068
0.30.94360.95090.62010.91800.91930.47410.92260.92500.5026
0.20.94420.95140.61220.91750.91870.47030.92260.92480.4987
0.10.94480.95160.60260.91640.91740.46580.92160.92370.4935
0.00.94190.94840.57380.91170.91250.45040.91640.91850.4800

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 6

Impact of the Jscore threshold (cf. Equation (6)) on the three PL-based search methods on the SCOPe 2.06 benchmark dataset

Jscore thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.92070.94260.71840.90470.91550.53270.90430.91880.5642
0.90.92680.94380.69520.91500.91900.51070.91520.91970.5416
0.80.93050.94510.67950.91680.92030.50170.91750.92130.5321
0.70.93410.94460.66510.91700.92000.49470.91940.92290.5240
0.60.93750.94640.65170.91750.92000.48890.92060.92390.5176
0.50.94110.94880.63930.91790.91980.48340.92140.92450.5128
0.40.94260.95020.62880.91810.91960.47890.92220.92480.5068
0.30.94360.95090.62010.91800.91930.47410.92260.92500.5026
0.20.94420.95140.61220.91750.91870.47030.92260.92480.4987
0.10.94480.95160.60260.91640.91740.46580.92160.92370.4935
0.00.94190.94840.57380.91170.91250.45040.91640.91850.4800
Jscore thresholdPL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dCoveragedROC1dROC50dCoveragedROC1dROC50dCoveraged
0.990.92070.94260.71840.90470.91550.53270.90430.91880.5642
0.90.92680.94380.69520.91500.91900.51070.91520.91970.5416
0.80.93050.94510.67950.91680.92030.50170.91750.92130.5321
0.70.93410.94460.66510.91700.92000.49470.91940.92290.5240
0.60.93750.94640.65170.91750.92000.48890.92060.92390.5176
0.50.94110.94880.63930.91790.91980.48340.92140.92450.5128
0.40.94260.95020.62880.91810.91960.47890.92220.92480.5068
0.30.94360.95090.62010.91800.91930.47410.92260.92500.5026
0.20.94420.95140.61220.91750.91870.47030.92260.92480.4987
0.10.94480.95160.60260.91640.91740.46580.92160.92370.4935
0.00.94190.94840.57380.91170.91250.45040.91640.91850.4800

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Impact of parameters for PL-based search

To investigate the effects of parameters of PL-search on the search performance, we analysed the performance of PL-search based on different original search methods with different parameters on the SCOP1.59 benchmark dataset. There are two parameters (cf. Algorithm 1) in our method, which contribute to better performance by adjusting the quality of the PL. The accuracy of ranking is less sensitive to these parameters, but parameter |${\beta}_1$| significantly impacts the TP number for PL-search (Tables 7 and 8).

Table 7

The impact of the parameter |${\beta}_1$| on the three PL-search methods with parameter |${\beta}_2$| value of 0 (cf. Algorithm 1) on the SCOP 1.59 benchmark dataset

|${\beta}_1$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPdEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
0.50.86650.897959.87530.16220.81440.826649.82140.19370.80900.817550.91130.1940
0.550.86600.897759.80830.16200.81440.826549.81270.19370.81010.817950.83610.1936
0.60.86610.897559.80410.16190.81440.826549.80940.19360.81060.818150.76640.1937
0.650.86810.898159.63460.16160.81390.826049.66610.19360.81040.818050.76310.1937
0.70.86800.897459.59310.16110.81380.826049.65920.19360.80960.817650.76220.1935
0.750.86800.898059.56160.16070.81380.825949.60500.19360.81070.817950.74140.1931
0.80.86840.896659.41960.15930.81380.825849.58330.19360.81060.818350.70270.1928
0.850.86820.896959.32530.15870.81480.825849.44940.19390.81060.818150.66330.1917
0.90.86760.896659.18020.15690.81510.825849.35570.19390.81090.818050.65790.1903
0.950.86790.898558.99950.15320.81570.826149.24590.19320.81110.817950.61320.1886
1.00.86790.897648.27470.14910.81580.826240.85590.19300.81250.818446.32680.1880
|${\beta}_1$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPdEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
0.50.86650.897959.87530.16220.81440.826649.82140.19370.80900.817550.91130.1940
0.550.86600.897759.80830.16200.81440.826549.81270.19370.81010.817950.83610.1936
0.60.86610.897559.80410.16190.81440.826549.80940.19360.81060.818150.76640.1937
0.650.86810.898159.63460.16160.81390.826049.66610.19360.81040.818050.76310.1937
0.70.86800.897459.59310.16110.81380.826049.65920.19360.80960.817650.76220.1935
0.750.86800.898059.56160.16070.81380.825949.60500.19360.81070.817950.74140.1931
0.80.86840.896659.41960.15930.81380.825849.58330.19360.81060.818350.70270.1928
0.850.86820.896959.32530.15870.81480.825849.44940.19390.81060.818150.66330.1917
0.90.86760.896659.18020.15690.81510.825849.35570.19390.81090.818050.65790.1903
0.950.86790.898558.99950.15320.81570.826149.24590.19320.81110.817950.61320.1886
1.00.86790.897648.27470.14910.81580.826240.85590.19300.81250.818446.32680.1880

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 7

The impact of the parameter |${\beta}_1$| on the three PL-search methods with parameter |${\beta}_2$| value of 0 (cf. Algorithm 1) on the SCOP 1.59 benchmark dataset

|${\beta}_1$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPdEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
0.50.86650.897959.87530.16220.81440.826649.82140.19370.80900.817550.91130.1940
0.550.86600.897759.80830.16200.81440.826549.81270.19370.81010.817950.83610.1936
0.60.86610.897559.80410.16190.81440.826549.80940.19360.81060.818150.76640.1937
0.650.86810.898159.63460.16160.81390.826049.66610.19360.81040.818050.76310.1937
0.70.86800.897459.59310.16110.81380.826049.65920.19360.80960.817650.76220.1935
0.750.86800.898059.56160.16070.81380.825949.60500.19360.81070.817950.74140.1931
0.80.86840.896659.41960.15930.81380.825849.58330.19360.81060.818350.70270.1928
0.850.86820.896959.32530.15870.81480.825849.44940.19390.81060.818150.66330.1917
0.90.86760.896659.18020.15690.81510.825849.35570.19390.81090.818050.65790.1903
0.950.86790.898558.99950.15320.81570.826149.24590.19320.81110.817950.61320.1886
1.00.86790.897648.27470.14910.81580.826240.85590.19300.81250.818446.32680.1880
|${\beta}_1$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPdEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
0.50.86650.897959.87530.16220.81440.826649.82140.19370.80900.817550.91130.1940
0.550.86600.897759.80830.16200.81440.826549.81270.19370.81010.817950.83610.1936
0.60.86610.897559.80410.16190.81440.826549.80940.19360.81060.818150.76640.1937
0.650.86810.898159.63460.16160.81390.826049.66610.19360.81040.818050.76310.1937
0.70.86800.897459.59310.16110.81380.826049.65920.19360.80960.817650.76220.1935
0.750.86800.898059.56160.16070.81380.825949.60500.19360.81070.817950.74140.1931
0.80.86840.896659.41960.15930.81380.825849.58330.19360.81060.818350.70270.1928
0.850.86820.896959.32530.15870.81480.825849.44940.19390.81060.818150.66330.1917
0.90.86760.896659.18020.15690.81510.825849.35570.19390.81090.818050.65790.1903
0.950.86790.898558.99950.15320.81570.826149.24590.19320.81110.817950.61320.1886
1.00.86790.897648.27470.14910.81580.826240.85590.19300.81250.818446.32680.1880

aPL-based search based on HHblits with the default parameters.

bPL-based search based on PSI-BLAST with default parameters except that the iterations was set as five.

cPL-based search based on JackHMMER with the default parameters.

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 8

The impact of parameter |${\beta}_2$| on the three PL-search methods on the SCOP 1.59 benchmark dataset

|${\beta}_2$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
00.86840.896659.41960.15930.81570.826149.24590.19320.81090.818050.61320.1903
0.50.86850.896859.46020.15960.81530.826049.40260.19400.81110.817850.69570.1886
0.60.86850.896959.46580.15960.81530.826049.42710.19400.81110.817850.70160.1886
0.70.86880.897059.47650.16000.81530.826049.52910.19400.81110.817950.75940.1886
0.80.86900.897359.63490.16010.81530.826349.53850.19410.81120.818150.76480.1888
0.90.86890.897359.63750.16020.81540.826249.55610.19360.81120.818250.77720.1890
1.00.86890.897259.82960.16070.81580.826849.85550.19420.81120.818250.84690.1901
1.10.86900.897659.83050.16080.81580.826849.85590.19420.81140.818150.85100.1899
1.20.86880.898259.84650.16110.81570.826949.86700.19430.81140.818150.85330.1903
1.30.86880.898759.85700.16120.81580.826949.86790.19420.81140.818150.84360.1905
1.40.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
1.50.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
|${\beta}_2$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
00.86840.896659.41960.15930.81570.826149.24590.19320.81090.818050.61320.1903
0.50.86850.896859.46020.15960.81530.826049.40260.19400.81110.817850.69570.1886
0.60.86850.896959.46580.15960.81530.826049.42710.19400.81110.817850.70160.1886
0.70.86880.897059.47650.16000.81530.826049.52910.19400.81110.817950.75940.1886
0.80.86900.897359.63490.16010.81530.826349.53850.19410.81120.818150.76480.1888
0.90.86890.897359.63750.16020.81540.826249.55610.19360.81120.818250.77720.1890
1.00.86890.897259.82960.16070.81580.826849.85550.19420.81120.818250.84690.1901
1.10.86900.897659.83050.16080.81580.826849.85590.19420.81140.818150.85100.1899
1.20.86880.898259.84650.16110.81570.826949.86700.19430.81140.818150.85330.1903
1.30.86880.898759.85700.16120.81580.826949.86790.19420.81140.818150.84360.1905
1.40.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
1.50.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904

aRepresents PL-HHblits with β1 value of 0.8 (cf. Algorithm 1).

bRepresents PL-BLAST with β1 value of 0.95 (cf. Algorithm 1).

cRepresents PL-HMMER with β1 value of 0.95 (cf. Algorithm 1).

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Table 8

The impact of parameter |${\beta}_2$| on the three PL-search methods on the SCOP 1.59 benchmark dataset

|${\beta}_2$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
00.86840.896659.41960.15930.81570.826149.24590.19320.81090.818050.61320.1903
0.50.86850.896859.46020.15960.81530.826049.40260.19400.81110.817850.69570.1886
0.60.86850.896959.46580.15960.81530.826049.42710.19400.81110.817850.70160.1886
0.70.86880.897059.47650.16000.81530.826049.52910.19400.81110.817950.75940.1886
0.80.86900.897359.63490.16010.81530.826349.53850.19410.81120.818150.76480.1888
0.90.86890.897359.63750.16020.81540.826249.55610.19360.81120.818250.77720.1890
1.00.86890.897259.82960.16070.81580.826849.85550.19420.81120.818250.84690.1901
1.10.86900.897659.83050.16080.81580.826849.85590.19420.81140.818150.85100.1899
1.20.86880.898259.84650.16110.81570.826949.86700.19430.81140.818150.85330.1903
1.30.86880.898759.85700.16120.81580.826949.86790.19420.81140.818150.84360.1905
1.40.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
1.50.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
|${\beta}_2$|PL-HHblitsaPL-BLASTbPL-HMMERc
ROC1dROC50dTPEPQdROC1dROC50dTPdEPQdROC1dROC50dTPdEPQd
00.86840.896659.41960.15930.81570.826149.24590.19320.81090.818050.61320.1903
0.50.86850.896859.46020.15960.81530.826049.40260.19400.81110.817850.69570.1886
0.60.86850.896959.46580.15960.81530.826049.42710.19400.81110.817850.70160.1886
0.70.86880.897059.47650.16000.81530.826049.52910.19400.81110.817950.75940.1886
0.80.86900.897359.63490.16010.81530.826349.53850.19410.81120.818150.76480.1888
0.90.86890.897359.63750.16020.81540.826249.55610.19360.81120.818250.77720.1890
1.00.86890.897259.82960.16070.81580.826849.85550.19420.81120.818250.84690.1901
1.10.86900.897659.83050.16080.81580.826849.85590.19420.81140.818150.85100.1899
1.20.86880.898259.84650.16110.81570.826949.86700.19430.81140.818150.85330.1903
1.30.86880.898759.85700.16120.81580.826949.86790.19420.81140.818150.84360.1905
1.40.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904
1.50.86870.898859.90740.16130.81580.826849.87420.19420.81190.818250.84730.1904

aRepresents PL-HHblits with β1 value of 0.8 (cf. Algorithm 1).

bRepresents PL-BLAST with β1 value of 0.95 (cf. Algorithm 1).

cRepresents PL-HMMER with β1 value of 0.95 (cf. Algorithm 1).

dPerformance at homology level that contains close homology and remote homology (belonging to the same SCOP superfamily).

Parameter |${\beta}_1$| is the most important parameter for increasing the detection of homologous protein sequences. In terms of reliability of the search results in this study, it is worth noting that the ROC1 score has higher priority than TP number for selecting |${\beta}_1$|⁠. Therefore, |${\beta}_1$| was set to 0.8, 0.95 and 0.95 for PL-HHblits, PL-BLAST and PL-HMMER, respectively. From Table 7, we can see that is an obvious increase in the number of homologous protein sequences when |${\beta}_1$| is set to 0.95 for all original search methods, indicating that constructing the PL with the iterative extending-link strategy significantly improves the search for homologous protein sequences. Further reduction of |${\beta}_1$| resulted in lower TP number and erratic fluctuation of the ROC score, while higher ranking quality can be obtained by fixing parameters |${\beta}_1=1$| and |${\beta}_2=0$|⁠, when there is no requirement to obtain more homologous sequences.

Parameter |${\beta}_2$| can further improve TP number with stable ranking quality (Table 8). Compared with |${\beta}_1$|⁠, the improvement for PL-search is reduced because |${\beta}_2$| is complementary to |${\beta}_1$|⁠; |${\beta}_2$| was set to 1.10, 1.50 and 1.50 for PL-HHblits, PL-BLAST and PL-HMMER, respectively, to achieve an improved ranking list.

An example to show how PL-HHblits works

An example can help the users to deeply understand the process of PL-search [43]. Therefore, an example of how the PL-HHblit predictor improves the predictive performance of HHblits is shown in Figure 4. Compared with the search results obtained from HHblits (Figure 4A), PL-HHblits (Figure 4D) detects more homology protein sequences, and fewer non-homology protein sequences. There are three reasons for the improved performance of PL-HHblits: (i) double-link strategy filters some non-homology protein sequences (Figure 4B and Equations (2) and (3)), and the sequences in double-link provide high quality feedback protein sequences for PL-search (see Figure 4D and Equation (6)); (ii) there are some common links between the PL of query protein (Figure 4C) and the PLs of the homology proteins (Figure 4F), contributing to detect more homology protein sequence; (iii) there are no common links between the PL of query protein (Figure 4C) and the PLs of the non-homology proteins (Figure 4E), resulting in a more accurate ranking list.

The search results of query protein (SCOP ID: d1e9yb1 and Family: b.92.1.1) generated by HHblits and PL-HHblits, and the link information of PL-HHblits. (A) and (B) show the search results outputted by HHblits and PL-HHblits, respectively. Subfigure C represents the double-link list of query protein (cf. Equation (3)). (D) represents the PLs of query protein (cf. Algorithm 1). (E) shows the PLs of a non-homology protein SCOP ID:d2z26a of the query protein, which is in the ranking list of HHblits but not in the ranking list of PL-HHblits. (F) represents the PLs of a homology protein SCOP ID: d3be7a1 of the query protein detected by PL-HHblits.
Figure 4

The search results of query protein (SCOP ID: d1e9yb1 and Family: b.92.1.1) generated by HHblits and PL-HHblits, and the link information of PL-HHblits. (A) and (B) show the search results outputted by HHblits and PL-HHblits, respectively. Subfigure C represents the double-link list of query protein (cf. Equation (3)). (D) represents the PLs of query protein (cf. Algorithm 1). (E) shows the PLs of a non-homology protein SCOP ID:d2z26a of the query protein, which is in the ranking list of HHblits but not in the ranking list of PL-HHblits. (F) represents the PLs of a homology protein SCOP ID: d3be7a1 of the query protein detected by PL-HHblits.

Screenshot of the web server of PL-search.
Figure 5

Screenshot of the web server of PL-search.

Stand-alone tool and web server for PL-based search

Because user-friendly web servers and stand-alone tools are useful for biologists, we construct the web server of the PL-search and its stand-alone tool, which can be accessed at http://bliulab.net/PL-search/. For the web server, three PL databases are provided. Furthermore, parallel speed-up is implemented in the web server and stand-alone tool. The steps for using the web server are shown below.

Step 1. Access the web server at http://bliulab.net/PL-search/server. The interface of PL-search is shown in Figure 5.

Step 2. First, query protein sequences in FASTA format should be written/copied or uploaded into the input box of the server interface. An example can be displayed by clicking the Example button. Second, one of the three search methods can be selected to construct the profile-link for PL-search. Third, the database version should be set. Fourth, the Jscore threshold can be set to filter the search results.

Step 3. Click the Submit button. Search results will be shown on the screen with three-dimensional structure information.

Step 4. The related data can be freely downloaded from http://bliulab.net/PL-search/download/.

Conclusion

In this study, we propose the PL-search method that extracts PL information from search methods to recalculate the similarity scores of sequence pairs and solves the problems associated with protein remote homology detection. Experimental results show that PL-search obviously improves the accuracy of ranking lists, and returns more homologous protein sequences than the corresponding original search method. The strong generality of PL-search is provided by three versions of PL-search methods based on three different search methods: including PSI-BLAST [4], JackHMMER [28] and HHblits [10]. Because PL-search can more accurately measure the similarity between proteins, it is able to improve the performance of these three search methods. Furthermore, PL-search can detect remote homologous proteins when the original search methods fail to detect any homologues.

Key points
  • Because protein remote homology detection is an important task for protein structure and function analysis, it is imperative to develop more accurate computational methods for this task.

  • A new search method called PL-search is proposed, which constructs more robust PLs by double link strategy and iterative extending strategy, and uses a two-level Jaccard distance to get more accurate similarity scores.

  • Experimental results on two benchmark datasets show that PL-search based on HHblits, JackHMMER and PSI-BLAST can obviously improve the search performance in terms of ranking quality and the number of detected remote homologues.

Acknowledgements

The authors are very much indebted to the three anonymous reviewers, whose constructive comments are very helpful in strengthening the presentation of this article.

Funding

This work was supported by the National Key R&D Program of China (No. 2018AAA0100100), the National Natural Science Foundation of China (No. 61822306, 61672184 and 61732012, 61861146002), Beijing Natural Science Foundation (No. JQ19019), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063) and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

Xiaopeng Jin is a PhD candidate at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. His research areas include bioinformatics and machine learning.

Qing Liao (PhD) is an Associate Professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China. Her research areas include bioinformatics and machine learning.

Bin Liu (PhD) is a Professor at the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China, and School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, nature language processing and machine learning.

References

1.

Soding
J
.
Big-data approaches to protein structure prediction
.
Science
2017
;
355
:
248
9
.

2.

Chen
J
,
Guo
M
,
Wang
X
, et al.
A comprehensive review and comparison of different computational methods for protein remote homology detection
.
Brief Bioinform
2018
;
19
:
231
44
.

3.

Smith
TF
,
Waterman
MS
.
Identification of common molecular subsequences
.
J Mol Biol
1981
;
147
:
195
7
.

4.

Altschul
SF
,
Madden
TL
,
Schaffer
AA
, et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
:
3389
402
.

5.

Wan
S
,
Zou
Q
.
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing
.
Algorithms Mol Biol
2017
;
12
:
25
.

6.

Pearson
WR
.
Rapid and sensitive sequence comparison with FASTP and FASTA
.
Methods Enzymol
1990
;
183
:
63
98
.

7.

Sadreyev
R
,
Grishin
N
.
COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance
.
J Mol Biol
2003
;
326
:
317
36
.

8.

Eddy
SR
.
Accelerated profile HMM searches
.
PLoS Comput Biol
2011
;
7
:
e1002195
.

9.

Soding
J
.
Protein homology detection by HMM-HMM comparison
.
Bioinformatics
2005
;
21
:
951
60
.

10.

Remmert
M
,
Biegert
A
,
Hauser
A
, et al.
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
.
Nat Methods
2011
;
9
:
173
5
.

11.

Johnson
LS
,
Eddy
SR
,
Portugaly
E
.
Hidden Markov model speed heuristic and iterative HMM search procedure
.
BMC Bioinf
2010
;
11
:
431
.

12.

Bateman
A
,
Finn
RD
.
SCOOP: a simple method for identification of novel protein superfamily relationships
.
Bioinformatics
2007
;
23
:
809
14
.

13.

Weston
J
,
Elisseeff
A
,
Zhou
DY
, et al.
Protein ranking: from local to global structure in the protein similarity network
.
Proc Natl Acad Sci USA
2004
;
101
:
6559
63
.

14.

Melvin
I
,
Weston
J
,
Leslie
C
, et al.
RANKPROP: a web server for protein remote homology detection
.
Bioinformatics
2009
;
25
:
121
2
.

15.

Liu
B
,
Jiang
S
,
Zou
Q
.
HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search
.
Brief Bioinform
2018
;
21
:
298
308
.

16.

Chandonia
JM
,
Fox
NK
,
Brenner
SE
.
SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database
.
Nucleic Acids Res
2019
;
47
:
D475
81
.

17.

Alam
I
,
Dress
A
,
Rehmsmeier
M
, et al.
Comparative homology agreement search: an effective combination of homology-search methods
.
Proc Natl Acad Sci USA
2004
;
101
:
13814
9
.

18.

Liu
B
,
Zhu
Y
.
ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank
.
IEEE Access
2019
;
7
:
102499
507
.

19.

Gonzalez
MW
,
Pearson
WR
.
Homologous over-extension: a challenge for iterative similarity searches
.
Nucleic Acids Res
2010
;
38
:
2177
89
.

20.

Pearson
WR
,
Li
W
,
Lopez
R
.
Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold
.
Nucleic Acids Res
2017
;
45
:
e46
.

21.

Alva
V
,
Nam
S-Z
,
Söding
J
, et al.
The MPI bioinformatics toolkit as an integrative platform for advanced protein sequence and structure analysis
.
Nucleic Acids Res
2016
;
44
:
W410
5
.

22.

Berman
HM
,
Battistuz
T
,
Bhat
TN
, et al.
The protein data Bank
.
Acta Crystallogr Sect D-biol Crystallogr
2002
;
58
:
899
907
.

23.

Pearson
WR
,
Sierk
ML
.
The limits of protein sequence comparison?
Curr Opin Struct Biol
2005
;
15
:
254
60
.

24.

Schaffer
AA
,
Aravind
L
,
Madden
TL
, et al.
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
.
Nucleic Acids Res
2001
;
29
:
2994
3005
.

25.

Franceschet
M
.
PageRank: standing on the shoulders of giants
.
Commun ACM
2011
;
54
:
92
101
.

26.

Kleinberg
JM
.
Authoritative sources in a hyperlinked environment
.
J ACM
1999
;
46
:
604
32
.

27.

Zhong
Z
,
Zheng
L
,
Cao
DL
et al. Re-ranking Person Re-identification with k-reciprocal Encoding,
30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017)
2017
:
3652
61
.

28.

Finn
RD
,
Clements
J
,
Eddy
SR
.
HMMER web server: interactive sequence similarity searching
.
Nucleic Acids Res
2011
;
39
:
W29
37
.

29.

Jaccard
P
.
Lois de distribution florale dans la zone alpine
.
Bull Soc Vaud Sci Nat
1902
;
38
:
69
130
.

30.

Levandowsky
M
,
Winter
D
.
Distance between sets
.
Nature
1971
;
234
:
34
5
.

31.

Prokopenko
D
,
Hecker
J
,
Silverman
EK
, et al.
Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 genomes project
.
Bioinformatics
2016
;
32
:
1366
72
.

32.

Zitnik
M
,
Agrawal
M
,
Leskovec
J
.
Modeling polypharmacy side effects with graph convolutional networks
.
Bioinformatics
2018
;
34
:
457
66
.

33.

Wallace
ZS
,
Rosenthal
SB
,
Fisch
KM
, et al.
On entropy and information in gene interaction networks
.
Bioinformatics
2019
;
35
:
815
22
.

34.

Avey
S
,
Mohanty
S
,
Wilson
J
, et al.
Multiple network-constrained regressions expand insights into influenza vaccination responses
.
Bioinformatics
2017
;
33
:
I208
16
.

35.

Hanley
JA
,
McNeil
BJ
.
The meaning and use of the area under a receiver operating characteristic (ROC) curve
.
Radiology
1982
;
143
:
29
36
.

36.

Gribskov
M
,
Robinson
NL
.
Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching
.
Comput Chem
1996
;
20
:
25
33
.

37.

Lv
H
,
Zhang
ZM
,
Li
SH
, et al.
Evaluation of different computational methods on 5-methylcytosine sites identification
.
Brief Bioinform
2019
. doi: .

38.

Wasserstein
RL
,
Lazar
NA
.
The ASA's statement on p-values: context, process, and purpose
.
Am Stat
2016
;
70
:
129
33
.

39.

Yang
H
,
Yang
W
,
Dao
FY
, et al.
A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae
.
Brief Bioinform
2019
. doi: .

40.

Reid
AJ
,
Yeats
C
,
Orengo
CA
.
Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone
.
Bioinformatics
2007
;
23
:
2353
60
.

41.

Brenner
SE
,
Chothia
C
,
Hubbard
TJP
.
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships
.
Proc Natl Acad Sci USA
1998
;
95
:
6073
8
.

42.

Schaffer
AA
,
Wolf
YI
,
Ponting
CP
, et al.
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices
.
Bioinformatics
1999
;
15
:
1000
11
.

43.

Notredame
C
,
Higgins
DG
,
Heringa
J
.
T-coffee: a novel method for fast and accurate multiple sequence alignment
.
J Mol Biol
2000
;
302
:
205
17
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)