Abstract

Protein remote homology detection is essential for structure prediction, function prediction, disease mechanism understanding, etc. The remote homology relationship depends on multiple protein properties, such as structural information and local sequence patterns. Previous studies have shown the challenges for predicting remote homology relationship by protein features at sequence level (e.g. position-specific score matrix). Protein motifs have been used in structure and function analysis due to their unique sequence patterns and implied structural information. Therefore, designing a usable architecture to fuse multiple protein properties based on motifs is urgently needed to improve protein remote homology detection performance. To make full use of the characteristics of motifs, we employed the language model called the protein cubic language model (PCLM). It combines multiple properties by constructing a motif-based neural network. Based on the PCLM, we proposed a predictor called PreHom-PCLM by extracting and fusing multiple motif features for protein remote homology detection. PreHom-PCLM outperforms the other state-of-the-art methods on the test set and independent test set. Experimental results further prove the effectiveness of multiple features fused by PreHom-PCLM for remote homology detection. Furthermore, the protein features derived from the PreHom-PCLM show strong discriminative power for proteins from different structural classes in the high-dimensional space. Availability and Implementation: http://bliulab.net/PreHom-PCLM.

INTRODUCTION

The protein remote homology detection is an essential step in many protein structure and function predictors [1, 2], revealing the evolutionary relationship among different organisms [3, 4]. Recent studies have demonstrated that it is possible to predict the 3D structures of proteins by using the evolutionary relationship inferred from the large-scale biological data by sequence analysis methods. For example, the HH-suite [5] has been employed by AlphaFold2 [6] and RoseTTAFold [7]. Therefore, more accurate remote homology detection methods are useful for studying protein structures and functions so as to understand disease mechanisms [7–9].

Remote homology detection is usually formalized as a classification or a retrieval problem (i.e. ranking problem). Several computational methods have been proposed, which can be divided into four groups: alignment methods, protein similarity network methods, ensemble learning methods and deep learning methods.

PSI-BLAST [10, 11] is a fundamental alignment method, which iteratively searches the homology sequences based on the position-specific score matrix (PSSM) [12]. Motived by this approach, methods based on different sequence profiles are proposed, such as HMMER [13, 14] and HHblits [5, 15]. They iteratively search the homology sequences based on Hidden Markov Model (HMM) profiles. MMseqs2 combines the short word (k-mer) match and the Smith–Waterman alignment algorithms for the large-scale metagenomic data analysis [16–18]. Its performance is comparable with PSI-BLAST requiring lower computational cost. However, traditional alignment methods that rely solely on sequence-level information struggle to accurately predict remote homology relationships, which can lead to nonhomologous proteins in their predictions [19, 20].

The protein similarity network provides structural information by constructing a structural-similarity-based weighted graph. The enrichment of Network Topological Similarity [21] is applied to sequence analysis and achieves better performance via building a structure similarity network. HITS-PR-HHblits [22] performs the PageRank and Hyperlink-Induced Topic Search on the similarity network to re-rank the results of PSI-BLAST. PL-search [23] constructs the double-link and profile-link networks, and further calculates the Jaccard score to filter the nonhomologous sequences caused by PSI-BLAST.

Ensemble learning is another way to improve predictive performance by learning multiple protein properties. ProtDec-LTR [24–26] integrates multiple ranking methods based on the Learning to Rank (LTR) [27] framework and obtains more accurate results. The LTR framework is a supervised framework for capturing complementary information among the basic ranking methods. Inspired by this strategy, SMI-BLAST [19] proposes an improved PSSM with lower false-positive rates by combining the LTR and the PSI-BLAST.

The deep neural networks (DNN) are suitable for learning characteristics from protein sequence, whose derived protein features are applicable for different downstream tasks [28]. UDSMProt [29] is a DNN-based language model combined with a pretrain strategy, detecting the remote homology relationships by building a classification model. ProtTrans [30] selects six well-known language models (T5 [31], Electra [32], BERT [33], Albert [34], Transformer-XL [35] and XLNet [36]) and applies them to various classification tasks, such as subcellular localization [37]. The experimental results also prove that these language models capture the global constraints of protein structures after model tuning.

Although the language model based on deep neural networks with a pretrained strategy is able to effectively learn the protein features, these features still lack biological information for remote homology detection. The reason is that the protein features derived from the pretrained language model may cover too complicated information to be used for remote homology relationship, such as structural information. For example, the performance of UDSMProt (with the pretrained language model) is lower than that of ProtDec-LSTM [38] (without the pretrained language model). The reason is that the protein features extracted by ProtDec-LSTM contains more useful information to detect proteins with remote homology. As a result, we should study a novel language model that learns protein features covering multiple protein properties.

The similarities between protein sequences and sentences can be summarized at four levels: (i) symbol level, protein sequences and sentences are composed of a series of symbols; (ii) they follow a certain grammatical rule; (iii) protein sequence and sentences have hierarchical structure; (iv) both protein sequences and sentences convey meaning, such as function or semantics. To achieve accurate protein remote homology detection, a new language model must utilize multi-level information to learn high-level semantic relationships.

The protein cubic language model (PCLM) is a powerful method for learning protein representations from multiple protein properties and has been successfully applied to identify protein disorder molecular functions [39]. Central to PCLM is the notion of representing the general information associated with proteins as a cube, where each face or slice embodies a distinct type of characteristic information. This concept is particularly useful for remote homology detection, as it accounts for various factors influencing the relationships among proteins, such as structural information.

In our proposed computational predictor, PreHom-PCLM, we employ motif-based CNNs (MotifCNNs) to extract features from structural motifs and assemble them into a motif feature cube, which is then input into the PCLM. The PCLM, which combines 2D convolution neural network (2D CNN) and transformer layers (TRM), captures latent remote homology relationships among proteins more effectively than traditional methods by learning from the multi-level information provided by the motif feature cube. Our experimental results demonstrate that PreHom-PCLM outperforms competing methods, highlighting the effectiveness of the PCLM. Moreover, the manifold embedding of protein features generated by PreHom-PCLM exhibits a robust discriminative capacity for proteins, ensuring its structural reliability.

METHODS

Detecting the remote homology by PreHom-PCLM includes two steps, building motif feature cube and feeding this cube into the PCLM, to predict the probability over protein superfamily types (Figure 1). The first step extracts motif features covering multi-level information by motif-based neural networks, then these motif features are fed into PCLM to improve the accuracy of detecting remote homology relationships in the second step.

The flowchart of PreHom-PCLM. In the two-step process of PreHom-PCLM, as illustrated in the figure, we first convert the query protein into six types of motif features using a Motif-based CNN, which is built from three groups of motifs. These motif features are then assembled into a motif feature cube. Subsequently, the motif feature cube is input into the PCLM, which consists of 2D CNN and TRM. The PCLM learns latent remote homology relationships among protein sequences and outputs a series of probabilities against superfamily types using a FC-based classifier. Finally, we assign the superfamily type with the highest probability to the query protein as the retrieval results.
Figure 1

The flowchart of PreHom-PCLM. In the two-step process of PreHom-PCLM, as illustrated in the figure, we first convert the query protein into six types of motif features using a Motif-based CNN, which is built from three groups of motifs. These motif features are then assembled into a motif feature cube. Subsequently, the motif feature cube is input into the PCLM, which consists of 2D CNN and TRM. The PCLM learns latent remote homology relationships among protein sequences and outputs a series of probabilities against superfamily types using a FC-based classifier. Finally, we assign the superfamily type with the highest probability to the query protein as the retrieval results.

Motif convolutional kernels

Proteins are folded by the linear sequence of amino acids, some of which contain unique motifs determining their structures and functions. We introduced the protein motif and build motif feature cube to improve the accuracy of predicting remote homology relationship from protein sequence. The protein sequence information is always provided by the PSSM, the normalized version of which is the position-specific frequency matrix (PSFM). We construct the PSFM by using PSI-BLAST (num_alignments = 1000, evalue = 1e−3, num_iterations = 3) searching against the nrdb90 data set [40]. The PSFM can represented as

(1)

where |$\text{PSFM}\in{\mathbb{R}}^{S\times 20}$|⁠, |$S$| is the sequence length and each entry |${p}_{i,j}$| denotes the frequency that the jth amino acid appears in position i of a protein sequence. We design three groups of motifs to extract multi-level information from PSFMs. They are the varied-length motif, standard motif and 1D motif, where each group includes 128 motifs selected from the MegaMotifBase [41, 42]. They are formalized as

(2)

where |$l$| denotes the size of each motif. The varied-length motifs keep the same size as the original structural motifs. The standard motifs share an exact size of 7, which is suitable for learning the local sequence pattern. The 1D motifs share an exact size of 1, designed for learning the local position information. These motifs are used for the convolutional kernel of 1D convolution neural network (1D CNN) layers in the MotifCNN. Their weights are all initialized randomly rather than the original ones of structural motifs. We design three styles of MotifCNNs against the varied-length, standard and 1D groups of motifs. Each MotifCNN generates the motif features as

(3)

where |${\text{F}}_{motif}\in{\mathbb{R}}^{S\times k}$|⁠, k = 128, denoting the number of motifs in each group, and each element |${f}_{i,j}$| describes the relationship between motif and protein sequence.

In addition, we add a 1D maximum-pooling and average-pooling layer against each MotifCNN to obtain two different motif features, |${\text{F}}_{2i-1},{\text{F}}_{2i}\in{\mathbb{R}}^{H_p\times k},i\in \left\{1,2,3\right\}$|⁠. We finally get six styles of motif features from three groups of motifs.

The architecture of PCLM

The Protein Language Model (PLM) is an effective way for protein sequence analysis [43, 44]. Recent research has proposed the Protein Cubic Language Model (PCLM) to the identification of protein disordered molecular function [39]. The core concept of the PCLM is to fuse multiple characteristics for a specific task. Therefore, this concept is suitable for protein remote homology.

In order to capture the patterns of the superfamilies, we extend the PCLM [39] by incorporating the structural motifs, and proposed a predictor called PreHom-PCLM for protein remote homology detection. The details of PCLM will be introduced in the following.

A motif feature cube |${\text{F}}_{cube}\in{\mathbb{R}}^{6\times{H}_p\times k}$| based on structural motifs is formalized as

(4)

where |${\text{F}}_j\in{\mathbb{R}}^{H_p\times k,},j\in \left\{1,2,\dots, 6\right\}$| denotes the jth style of feature extracted by the MotifCNN. This motif feature cube contains information from different levels. For example, the motif features |${\text{F}}_1$| and |${\text{F}}_2$| extracted by MotifCNN with varied-length motifs may include structural information. This motif feature cube is then fed into the PreHom-PCLM for protein remote homology detection. The PreHom-PCLM is formalized as

(3)

where the |$\left[{\text{F}}_1;{\text{F}}_2;\dots; {\text{F}}_j;\dots \right]$| denotes a feature cube composed of multi-level features. The |${g}_L\cdotp \dots \cdotp{g}_l\cdotp \dots \cdotp{g}_2\cdotp{g}_1$| represent L feature mapping functions, such as linear or non-linear mapping function.

2D convolutional layers for motif feature combination

We first transpose the motif feature cube |${\text{F}}_{cube}$| into |${\text{D}}_a\in{\mathbb{R}}^{H_p\times k\times 6}$| so that the first module in PreHom-PCLM fully learns the multi-level information. The first module is the 2D CNN to capture the relationships among different motif features in the convolution calculating process. The 2D CNN includes two convolution stages (stage1: 128 filter size, |$3\times 3$| kernel size, |$2\times 2$| stride size; stage2: 64 filter size, |$3\times 3$| kernel size, |$1\times 1$| stride size) combined with the batch normalization (BN) technique [45] and Gaussian error linear units (GELU) [46] activation function. At the tail of the 2D CNN, we add a 2D average-pooling layer (⁠|$5\times 5$| kernel size, |$2\times 2$| stride size) to downsample the output of the 2D CNN. After the above steps, we generate a different motif feature cube, |${\text{D}}_b\in{\mathbb{R}}^{H_m\times{W}_m\times{C}_m}$|⁠, where each value measures the relationships of multiple motif features.

TRM for cubic motif feature embedding

The motif feature cube obtained by the 2D CNN in the previous step represents the relationships among motif features. We add a TRM to capture the global association in the motif feature cube. We first transpose the motif feature cube into a 1D embedding |${\text{T}}_a\in{\mathbb{R}}^{C_m\times{E}_m}$|⁠, |${E}_m={H}_m\times{W}_m$|⁠. Then, we introduce a 1D CNN (64 filter size and 1 kernel size) with the BN technique and GELU activation function to process this embedding, and get the output, |${\text{T}}_b\in{\mathbb{R}}^{C_t\times{E}_t}$|⁠, |${C}_t={C}_m$| and |${E}_t={E}_m$|⁠. After that, we feed |${\text{T}}_b$| into the TRM (0.3 dropout rate, 512 hidden sizes of self-attention block). It outputs an embedding |${\text{T}}_c\in{\mathbb{R}}^{C_t\times{E}_t}$|⁠, including the global association learned by the TRM.

Superfamily prediction

We build a multi-classification model for protein remote homology detection, assigning a specific superfamily type for a query protein. We add a classifier with three fully connected (FC) feed-forward layers (16 384, 4096 and 2048 neurons), where each layer combines the BN technique and GELU activation function (no bias). We first change the embedding results |${\text{T}}_c$| into |$x\in{\mathbb{R}}^{C_f},{C}_f={C}_t\times{E}_t$|⁠, and then feed it into this classifier. We add a SoftMax function to convert the classifier’s output scores into a series of probabilities |${p}_k$| against superfamily types. To train this model, we consider cross-entropy as a loss function:

(6)

where |${y}_k$| denotes the real label of superfamily |$k$|⁠, |${p}_{\text{k}}$| is the probability of it and |$K$| is the amount of them. For the other detailed model parameter settings, please refer to Table S1.

Training details

The TensorFlow framework (v1.10.0) is employed to implement the PreHom-PCLM. In this study, we build the PreHom-PCLM with a supervised-based training strategy. We design a two-stage training strategy to obtain an optimal model. In detail, we initialize the model parameters in the second stage by the after-trained one in the first stage and select the Adam [47] algorithm (default parameters) to optimize the PreHom-PCLM parameters during the training process. For the detailed training parameters in different experiments, please refer to Table S2.

DISCUSSION

Data sets

We evaluated the performance of our PreHom-PCLM model for remote homology detection on the Structure Classification of Proteins database (SCOP1.75, 0.95 sequence identity) [48] and Structure Classification of Proteins extended database (SCOPe2.07, 0.95 sequence identity) [3]. SCOP1.75 is for constructing training and validation sets, and SCOPe2.07 is for building test and independent test sets. As for SCOP1.75, we first divided it into 10 parts of equal size, one of them as the validation set and the remaining as the training set. We also used the following tools to reduce the redundancy between the two sets: CD-HIT (0.95 sequence identity) [28, 49], MMseqs (0.95 sequence identity and 10e−4 evalue) [16, 28] and PSI-BLAST (10e−4 evalue) [10, 28]. Finally, the training set contains 14 328 proteins covering 1854 superfamilies. The validation set contains 1186 proteins covering 253 superfamilies.

For the test set, we first removed the proteins in the SCOP1.75 data set from the SCOPe2.07 data set and reduced the sequence identity to 0.95 by using CD-HIT [28]. We divided the test set into five parts with an equal number of samples to perform a 5-fold cross-validation. Therefore, the test set contains 13 557 proteins covering 1126 superfamilies.

For the independent test set, we retrieved the proteins that were added to the SCOPe database between 2020–07 and 2021–12 and reduced redundancy by MMseqs (0.95 sequence identity and 10e−4 evalue) [28]. As a result, the independent test set contains 5990 proteins covering 839 superfamilies, none of which overlap either the test set or the training set [50].

Experiments and evaluation metrics

Two experiments were conducted to demonstrate the effectiveness of the PreHom-PCLM: 5-fold cross-validation on the test set and an independent test. We trained a basic model in the training set, which was then tuned for the 5-fold cross-validation and independent test, respectively. In the 5-fold cross-validation, the test set is divided into five equal parts, and one part is used for testing while the remaining four parts are used to tune the base model. In the independent test, the base model is fine-tuned on the entire test set, and the performance is evaluated on the independent test set.

The PreHom-PCLM detects the remote homology relationship by predicting probability distribution to superfamily types and assigns the highest-probability type to the query protein. We used two metrics to measure the performance of different methods. These metrics are ROC1 and ROC50 [51–53].

Ablation study

There are two essential components in PreHom-PCLM to utilize the motif features for detecting protein remote homology relationships, including 2D CNN and TRM. We performed an ablation experiment to study the roles of different components in PreHom-PCLM (Table 1). The PreHom-PCLM without the two components achieves the lowest performance, indicating that these two components are essential. This is because the two components contribute to capturing the relationships among the motif features. In detail, the 2D CNN learns the local relationship of different motif features; the TRM strengthens this relationship by self-attention mechanisms. Furthermore, the 2D CNN is more important than TRM because the performance of PreHom-PCLM decreases more obviously after removing the 2D CNN compared with the PreHom-PCLM without using TRM. The reason is that the protein features learned by 2D CNN are the foundation for the TRM to learn more complicated information. This result in Table 1 is not surprising because the two components in PreHom-PCLM capture different relationships among the motif features contributing to the accuracy improvement of remote homology detection.

Table 1

The performance of PreHom-PCLM predictors based on different components and their combinations

Method2D CNNTRMROC1ROC50
PreHom-PCLM0.98570.9937
0.98200.9926
0.95380.9797
0.94530.9770
Method2D CNNTRMROC1ROC50
PreHom-PCLM0.98570.9937
0.98200.9926
0.95380.9797
0.94530.9770
Table 1

The performance of PreHom-PCLM predictors based on different components and their combinations

Method2D CNNTRMROC1ROC50
PreHom-PCLM0.98570.9937
0.98200.9926
0.95380.9797
0.94530.9770
Method2D CNNTRMROC1ROC50
PreHom-PCLM0.98570.9937
0.98200.9926
0.95380.9797
0.94530.9770

Five-fold cross-validation

We benchmarked PreHom-PCLM against three alignment methods (HHblits [5, 15], HMMER [13, 14] and PSI-BLAST [10, 11]), ProtDec-LTR 3.0 [26] and PL-HHblits, PL-HMMER and PL-BLAST [23] (Table 2).

Table 2

The comparison results on the test set

MethodROC1ROC50
PreHom-PCLM0.93560.9618
PL-HHblitsa0.92070.9426
PL-HMMERa0.90430.9188
PL-BLASTa0.90470.9155
ProtDec-LTR 3.0b0.90020.9016
MethodROC1ROC50
PreHom-PCLM0.93560.9618
PL-HHblitsa0.92070.9426
PL-HMMERa0.90430.9188
PL-BLASTa0.90470.9155
ProtDec-LTR 3.0b0.90020.9016

aCited from the PL-search [22] since PL-search relies on their built double-link and profile-link databases.

bReproduced in this study [52].

Table 2

The comparison results on the test set

MethodROC1ROC50
PreHom-PCLM0.93560.9618
PL-HHblitsa0.92070.9426
PL-HMMERa0.90430.9188
PL-BLASTa0.90470.9155
ProtDec-LTR 3.0b0.90020.9016
MethodROC1ROC50
PreHom-PCLM0.93560.9618
PL-HHblitsa0.92070.9426
PL-HMMERa0.90430.9188
PL-BLASTa0.90470.9155
ProtDec-LTR 3.0b0.90020.9016

aCited from the PL-search [22] since PL-search relies on their built double-link and profile-link databases.

bReproduced in this study [52].

These alignment methods are based on the profile feature, which iteratively search the query sequence against the protein database by their extracted sequence profile feature. A profile feature like the HMM profile is a probability matrix that describes the probability of the amino acid occurring in different positions of a sequence. However, the profile feature contains nonhomologous errors, leading to prediction biases.

The ProtDec-LTR 3.0 applied an ensemble and protein network methods into fusing alignment and profile features-based methods, which reduce some of the nonhomologous errors. The PL-search method builds a double-link database to prevent the nonhomologous proteins from the alignment results. It also utilizes the profile-link database and the Jaccard measurement strategy to expand the number of remote homology sequences.

However, PL-search is not completely effective in preventing nonhomologous errors since they rely on alignment methods. Experimental evidence suggests a decrease in the performance of PL-HHblits (ROC1: 0.8277, ROC50: 0.8849) when the extent of its sequence retrieval matches the size of PreHom-PCLM’s training data. This is attributable to the sequence alignment method’s limited capacity, which measures only the similarity between sequences without considering the structural similarity of proteins. Such limitations can engender nonhomologous errors, misleading the ranking optimization algorithm of PL-search when retrieval data are insufficient. As a solution for enhancing protein remote homology detection, we introduce PreHom-PCLM, a method that fuses different features. Our strategy includes three types of neural networks based on motifs, designed to extract structural information as well as other pertinent information such as local sequence patterns and positional information. PreHom-PCLM surpasses PL-HHblits in performance, thus demonstrating its effectiveness in integrating multiple motif features and addressing more nonhomologous errors than PL-search.

Independent test

Despite the improved performance of our PreHom-PCLM, the generalization ability has yet to be evaluated. To evaluate the generalization ability of the PreHom-PCLM and PL-search methods, we conducted an independent test (Table 3). The PL-search performs better on the independent test set than on the test set because the PL-search eliminates nonhomologous proteins. However, it is difficult to improve the PL-search anymore because of the limited quality of the alignment methods. The PreHom-PCLM avoids this problem by automatically fusing multi-level information by using the three styles of motif-based neural networks.

Table 3

PreHom-PCLM outperforms the PL-search for independent test

MethodROC1ROC50
PreHom-PCLM0.98570.9937
PL-HHblitsa0.97490.9777
PL-HMMERa0.93400.9370
PL-BLASTa0.94050.9707
MethodROC1ROC50
PreHom-PCLM0.98570.9937
PL-HHblitsa0.97490.9777
PL-HMMERa0.93400.9370
PL-BLASTa0.94050.9707

aReproduced the PL-search for independent test.

Table 3

PreHom-PCLM outperforms the PL-search for independent test

MethodROC1ROC50
PreHom-PCLM0.98570.9937
PL-HHblitsa0.97490.9777
PL-HMMERa0.93400.9370
PL-BLASTa0.94050.9707
MethodROC1ROC50
PreHom-PCLM0.98570.9937
PL-HHblitsa0.97490.9777
PL-HMMERa0.93400.9370
PL-BLASTa0.94050.9707

aReproduced the PL-search for independent test.

To assess the predictive performance of proteins with low sequence identity, we created a subset of 23 protein sequences by eliminating those from the original independent test set that had a sequence identity > 0.2 using the MMseqs tool. We then re-evaluated PreHom-PCLM and PL-search on this subset and observed that PreHom-PCLM exhibited higher accuracy compared to PL-search. For detailed results, please refer to Table S6.

Furthermore, we evaluated the performance of 3D protein structure prediction using our method. To accomplish this, we utilized Clustal Omega [54, 55] to compute multiple sequence alignment (MSA) results for 23 distinct test sequences that had <20% sequence identity with the training data set. We compared the TM-scores (Template Modeling score) and LDDT scores (Local Distance Difference Test score) of the structure predictions generated by trRosetta [56–59]. These predictions were derived from MSAs generated using PreHom-PCLM, PL-HHblits and HMMER algorithms (Figure 2A–C). The comparison demonstrated that PreHom-PCLM achieves more accurate predictions than PL-HHblits and HMMER. Moreover, when compared with trRosettaX-single [59], a language model-based method that takes a single sequence as input, PreHom-PCLM and trRosettaX-single perform comparably, surpassing PL-HHblits and HMMER. Notably, when focusing on the LDDT metric, emphasizing the accuracy of protein residue-level prediction, PreHom-PCLM’s advantage becomes more pronounced, even surpassing the trRosettaX-single method (Figure 2, Part II).

PreHom-PCLM accurately predict structure by using trRosetta. The x- and y-axis denote the predicted TM-score (Part I) or the predicted LDDT score (Part II) of different methods based on trRosetta. Sample points located to the right of the diagonal line on the graph indicate a higher precision in the structure prediction of method in x-axis, whereas points placed on the left demonstrates y-axis method’s higher prediction accuracy. In addition, no PDB template information was used in this protein structure prediction experiment.
Figure 2

PreHom-PCLM accurately predict structure by using trRosetta. The x- and y-axis denote the predicted TM-score (Part I) or the predicted LDDT score (Part II) of different methods based on trRosetta. Sample points located to the right of the diagonal line on the graph indicate a higher precision in the structure prediction of method in x-axis, whereas points placed on the left demonstrates y-axis method’s higher prediction accuracy. In addition, no PDB template information was used in this protein structure prediction experiment.

According to these results, we identified two main reasons for the effectiveness of our proposed PCLM: (i) we extracted features from structural motif to capture both sequence and structural characteristics of proteins; (ii) we employed the 2D CNN and Transformer components to constructs PCLM so as to facilitate the learning of high-level semantic relationships among proteins that are relevant for remote homology detection.

The improvement of the proposed remote homology detection method on protein structure prediction can be summarized as follows: (i) for proteins that are difficult to align using traditional sequence alignment methods, our method can provide more accurate structural prediction results by generating MSAs using proteins with remote homology relationships; (ii) our method achieves comparable or better prediction results compared to PL-HHblits, without relying on manually constructed sequence retrieval data sets; (iii) PreHom-PCLM performs comparably to trRosettaX-single, which is based on protein language model, indicating that determining remote homologous relationships is useful for predicting protein structure.

Protein structure analysis by AlphaFold2

AlphaFold2, developed by DeepMind, is a cutting-edge deep learning method that has revolutionized the analysis of protein structures. In this study, we utilize the local version of ColabFold (v1.5.1) [60], which allows the use of custom MSAs as input. ColabFold combines the fast homology search capabilities of MMseqs2 with either AlphaFold2 or RoseTTAFold to provide accelerated predictions of protein structures and complexes. In this study, ColabFold employs weights that are aligned with the official AlphaFold2 to ensure the alignment of predictions between ColabFold and official AlphaFold2. To maintain simplicity and clarity, we will use the term ‘AlphaFold2’ to refer to ColabFold, which incorporates the official weights of AlphaFold2. The experimental samples consist of the same 23 proteins from the independent test set mentioned in the previous section. We compare the LDDT score of the structure predicted by AlphaFold2 using different MSAs generated by PreHom-PCLM, PL-HHblits and HMMER (Figure 3, Part I: A–C). In addition, we compare these results with trRosettaX-single (Figure 3, Part I: D–F). Finally, we compare the structure predictions of AlphaFold2 and trRosetta using the same set of MSAs as input based on the metrics of LDDT score (Figure 3, Part II: A–C) and TM-score (Figure 3, Part II: D–F).

PreHom-PCLM accurately predict structure by using AlphaFold2. The x- and y-axis denote the predicted LDDT score (Part I and Part II: A–C) or TM-score (Part II: D–F) of different methods based on AlphaFold2. Additional details are consistent with Figure 2.
Figure 3

PreHom-PCLM accurately predict structure by using AlphaFold2. The x- and y-axis denote the predicted LDDT score (Part I and Part II: A–C) or TM-score (Part II: D–F) of different methods based on AlphaFold2. Additional details are consistent with Figure 2.

Based on Figure 3, the following conclusions can be drawn: (i) AlphaFold2 attained superior accuracy by employing MSAs generated by PreHom-PCLM, surpassing the performance of trRosettaX-single; (ii) AlphaFold2 demonstrated superior performance compared to trRosetta, especially regarding the LDDT score, a metric quantifying accuracy at the protein residue level.

The PCLM extracts more available information from motif features

The remote homology relationships depend on multiple protein properties, such as structural information and local sequence patterns. In this study, three motif-based neural networks were constructed in order to get different types of motif features that covered various protein properties. To confirm the effectiveness of PCLM, we compare the PreHom-PCLM with the other three methods (1D CNN, AvgPool and MaxPool), where the motif features are linearly concatenated, and fed to the 1D CNN, average-pooling and maximum-pooling layers, respectively. In contrast, the PreHom-PCLM builds a motif feature cube, and feeds it into the PCLM composed of 2D CNN and TRM so as to gain a deeper insight of the potential remote homology relationship.

Table 4 demonstrates that PreHom-PCLM outperforms the other methods, indicating the following: (i) constructing a motif feature cube and feeding it into the PCLM is more suitable than the other feature fusing strategies; (ii) the CNN architecture helps to learn the differences among the local patterns of motif features for remote homology detection; (iii) the 2D CNN employed in the PCLM is better than 1D CNN because it learns more available correlation patterns of motif features for detecting protein remote homology relationships.

Table 4

The results of different approaches for using the motif features on independent test set

MethodROC1ROC50
PreHom-PCLM0.98570.9937
1D CNNa0.97120.9890
AvgPoola0.94530.9770
MaxPoola0.93720.9722
MethodROC1ROC50
PreHom-PCLM0.98570.9937
1D CNNa0.97120.9890
AvgPoola0.94530.9770
MaxPoola0.93720.9722

aLinearly concatenate the motif features.

Table 4

The results of different approaches for using the motif features on independent test set

MethodROC1ROC50
PreHom-PCLM0.98570.9937
1D CNNa0.97120.9890
AvgPoola0.94530.9770
MaxPoola0.93720.9722
MethodROC1ROC50
PreHom-PCLM0.98570.9937
1D CNNa0.97120.9890
AvgPoola0.94530.9770
MaxPoola0.93720.9722

aLinearly concatenate the motif features.

PreHom-PCLM captures the global structural constraint in proteins

The remote homology relationships are related to the structure-level information [3]. The structural class is the top hierarchical organization in the SCOP/SCOPe databases. The PreHom-PCLM learns the remote homology relationships by integrating multiple motif features covering multi-level information, like structural information. We plot the protein distribution by the manifold learning algorithm to compare the raw sequence and the protein feature distributions to show that the PreHom-PCLM has captured the structural information (Figure 4). As shown in Figure 4A, the structure classes are hard to distinguish based on the raw sequence distributions. Figure 4B shows that even the PreHom-PCLM maps the proteins into the same high-dimensional feature space, but they lack discriminative power at intra-class and inter-class levels without training. Figure 4C shows that the features extracted by trained PreHom-PCLM are highly discriminative, aggregating the proteins of the same structural class into the same cluster and distinguishing these clusters from each other.

The feature distributions of proteins in the independent test set. We plot a SCOPe structural class as a point in these figures. (A) shows the original distributions based on the Edit Distance [61]. (B) visualizes the random representation based on the untrained PreHom-PCLM. (C) visualizes the final representation of the trained PreHom-PCLM. We employ the t-SNE algorithm (perplexity = 100, learning_rate = ‘auto’, init = ‘pca’, random_state = 0, metric = ‘precomputed’ in (A) and ‘euclidean’ in (B) and (C)) to convert the high-dimensional protein representation into 2D position distribution.
Figure 4

The feature distributions of proteins in the independent test set. We plot a SCOPe structural class as a point in these figures. (A) shows the original distributions based on the Edit Distance [61]. (B) visualizes the random representation based on the untrained PreHom-PCLM. (C) visualizes the final representation of the trained PreHom-PCLM. We employ the t-SNE algorithm (perplexity = 100, learning_rate = ‘auto’, init = ‘pca’, random_state = 0, metric = ‘precomputed’ in (A) and ‘euclidean’ in (B) and (C)) to convert the high-dimensional protein representation into 2D position distribution.

CONCLUSION

We employed the Protein Cubic Language Model (PCLM) to learn protein representation based on multiple motif features and proposed a predictor called PreHom-PCLM for protein remote homology detection. Experimental results show that PreHom-PCLM outperforms the other state-of-the-art predictors, which is a promising approach to extract and fuse different patterns of proteins (e.g. structure, sequence pattern). Moreover, the PreHom-PCLM does not rely on a large-scale MSA, which reduces computational cost and avoids potential errors introduced by alignment.

The principles and techniques of the PCLM, inspired by natural language processing, offer potential advancements in protein structure and function analysis. Specifically, remote homology relationships among proteins underpin the diversity of protein structures in various organisms. By learning characteristics at sequence, structure and function levels from PreHom-PCLM outputs, the accuracy of identifying intrinsically disordered regions can be improved [39]. Moreover, PreHom-PCLM can predict remote homology relationships for small-sized proteins (<50 amino acids), such as peptides that lack alignment results [62]. This addresses the issue of low accuracy in peptide type identification caused by insufficient evolutionary information. In future research, we plan to further explore protein language models like PCLM to advance studies in predicting protein function, like Gene Ontology terms.

Key Points
  • PreHom-PCLM employed a novel language model (protein cubic language model, PCLM) for protein remote homology detection. It treats different protein properties related to the distant homology as different faces of a cube and incorporates them by the PCLM.

  • PreHom-PCLM developed different feature extractors based on protein motifs to construct PCLM so as to cover multiple characteristic information related to remote homology relationships.

  • The benchmark and independent test set results suggested that PreHom-PCLM outperforms the other competing methods. The free-access web server is located at http://bliulab.net/PreHom-PCLM.

FUNDING

The National Natural Science Foundation of China (Nos 62271049, U22A2039, 62250028, 62102030), Beijing Institute of Technology Research and Innovation Promoting Project (Grant No. 2022YCXZ008).

Author Biographies

Jiangyi Shao is a PhD candidate at the School of Computer Science and Technology, Beijing Institute of Technology, China. His expertise is in bioinformatics.

Qi Zhang is a master student at the School of Computer Science and Technology, Beijing Institute of Technology, China. His expertise is in bioinformatics.

Ke Yan, PhD, is an assistant professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, natural language processing and machine learning.

Bin Liu, PhD, is a professor at the School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. His expertise is in bioinformatics, natural language processing and machine learning.

References

1.

Rao
 
RM
,
Liu
 
J
,
Verkuil
 
R
 et al.  MSA Transformer. In:
Marina
 
M.
,
Tong
 
Z.
(eds).
Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research
:
PMLR
,
2021
,
8844
56
.

2.

Torres
 
M
,
Yang
 
H
,
Romero
 
AE
, et al.  
Protein function prediction for newly sequenced organisms
.
Nat Mach Intell
 
2021
;
3
:
1050
60
.

3.

Chandonia
 
JM
,
Fox
 
NK
,
Brenner
 
SE
.
SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database
.
Nucleic Acids Res
 
2019
;
47
:
D475
81
.

4.

Yan
 
K
,
Lv
 
H
,
Guo
 
Y
, et al.  
sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure
.
Bioinformatics
 
2022
;
39
:
btac715
.

5.

Steinegger
 
M
,
Meier
 
M
,
Mirdita
 
M
, et al.  
HH-suite3 for fast remote homology detection and deep protein annotation
.
BMC Bioinformatics
 
2019
;
20
:
473
.

6.

Jumper
 
J
,
Evans
 
R
,
Pritzel
 
A
, et al.  
Highly accurate protein structure prediction with AlphaFold
.
Nature
 
2021
;
596
:
583
9
.

7.

Baek
 
M
,
DiMaio
 
F
,
Anishchenko
 
I
, et al.  
Accurate prediction of protein structures and interactions using a three-track neural network
.
Science
 
2021
;
373
:
871
6
.

8.

Gligorijevic
 
V
,
Renfrew
 
PD
,
Kosciolek
 
T
, et al.  
Structure-based protein function prediction using graph convolutional networks
.
Nat Commun
 
2021
;
12
:
3168
.

9.

Pejaver
 
V
,
Urresti
 
J
,
Lugo-Martinez
 
J
, et al.  
Inferring the molecular and phenotypic impact of amino acid variants with MutPred2
.
Nat Commun
 
2020
;
11
:
5918
.

10.

Altschul
 
SF
,
Madden
 
TL
,
Schaffer
 
AA
, et al.  
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
 
1997
;
25
:
3389
402
.

11.

Camacho
 
C
,
Coulouris
 
G
,
Avagyan
 
V
, et al.  
BLAST+: architecture and applications
.
BMC Bioinformatics
 
2009
;
10
:
421
.

12.

Wang
 
Y
,
Zhai
 
Y
,
Ding
 
Y
 et al. SBSM-Pro: Support Bio-sequence Machine for Proteins,
arXiv preprint arXiv
2023;
2308
:
10275
.

13.

Finn
 
RD
,
Clements
 
J
,
Eddy
 
SR
.
HMMER web server: interactive sequence similarity searching
.
Nucleic Acids Res
 
2011
;
39
:
W29
37
.

14.

Eddy
 
SR
.
Accelerated profile HMM searches
.
PLoS Comput Biol
 
2011
;
7
:
e1002195
.

15.

Remmert
 
M
,
Biegert
 
A
,
Hauser
 
A
,
Söding
 
J
.
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
.
Nat Methods
 
2011
;
9
:
173
5
.

16.

Steinegger
 
M
,
Soding
 
J
.
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
.
Nat Biotechnol
 
2017
;
35
:
1026
8
.

17.

Steinegger
 
M
,
Soding
 
J
.
Clustering huge protein sequence sets in linear time
.
Nat Commun
 
2018
;
9
:
2542
.

18.

Mirdita
 
M
,
Steinegger
 
M
,
Soding
 
J
.
MMseqs2 desktop and local web server app for fast, interactive sequence searches
.
Bioinformatics
 
2019
;
35
:
2856
8
.

19.

Jin
 
X
,
Liao
 
Q
,
Wei
 
H
, et al.  
SMI-BLAST: a novel supervised search framework based on PSI-BLAST for protein remote homology detection
.
Bioinformatics
 
2021
;
37
:
913
20
.

20.

Yan
 
K
,
Fang
 
X
,
Xu
 
Y
,
Liu
 
B
.
Protein fold recognition based on multi-view Modeling
.
Bioinformatics
 
2019
;
35
:
2982
90
.

21.

Lhota
 
J
,
Hauptman
 
R
,
Hart
 
T
, et al.  
A new method to improve network topological similarity search: applied to fold recognition
.
Bioinformatics
 
2015
;
31
:
2106
14
.

22.

Liu
 
B
,
Jiang
 
S
,
Zou
 
Q
.
HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search
.
Brief Bioinform
 
2018
;
21
:
298
308
.

23.

Jin
 
X
,
Liao
 
Q
,
Liu
 
B
.
PL-search: a profile-link-based search method for protein remote homology detection
.
Brief Bioinform
 
2021
;
22
. https://doi.org/10.1093/bib/bbaa051.

24.

Liu
 
B
,
Chen
 
J
,
Wang
 
X
.
Application of learning to rank to protein remote homology detection
.
Bioinformatics
 
2015
;
31
:
3492
8
.

25.

Chen
 
J
,
Guo
 
M
,
Li
 
S
, et al.  
ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised learning to rank
.
Bioinformatics
 
2017
;
33
:
3473
6
.

26.

Liu
 
B
,
Zhu
 
YL
.
ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank, IEEE
.
Access
 
2019
;
7
:
102499
507
.

27.

Burges
 
CJC
.
From RankNet to LambdaRank to LambdaMART: An Overview
.
Microsoft Research Technical Report
,
2010
.

28.

Zhu
 
J
,
Zhang
 
H
,
Li
 
SC
, et al.  
Improving protein fold recognition by extracting fold-specific features from predicted residue-residue contacts
.
Bioinformatics
 
2017
;
33
:
3749
57
.

29.

Strodthoff
 
N
,
Wagner
 
P
,
Wenzel
 
M
,
Samek
 
W
.
UDSMProt: universal deep sequence models for protein classification
.
Bioinformatics
 
2020
;
36
:
2401
9
.

30.

Elnaggar
 
A
,
Heinzinger
 
M
,
Dallago
 
C
, et al.  
ProtTrans: towards cracking the language of Lifes code through self-supervised deep learning and high performance computing
.
IEEE Trans Pattern Anal Mach Intell
 
2021
;
44
(
10
):
7112
27
.

31.

Raffel
 
C
,
Shazeer
 
N
,
Roberts
 
A
, et al.  
Exploring the limits of transfer learning with a unified text-to-text transformer
.
J Mach Learn Res
 
2020
;
21
:
67
.

32.

Clark
 
K
,
Luong
 
M-T
,
Le
 
QV
 et al.  ELECTRA: pre-training text encoders as discriminators rather than generators. In:
International Conference on Learning Representations
. Addis Ababa, Ethiopia: International Conference on Learning Representations,
2020
 
1
 
3168
.

33.

Devlin
 
J
,
Chang
 
M-W
,
Lee
 
K
 et al.  BERT: pre-training of deep bidirectional transformers for language understanding. In: Jill B, Christy D, Thamar S. (eds).
North American Chapter of the Association for Computational Linguistics (NAACL)
.
Minneapolis, Minnesota
,
2019
, p.
4171
86
.
Association for Computational Linguistics
.

34.

Lan
 
Z
,
Chen
 
M
,
Goodman
 
S
 et al.  ALBERT: A Lite BERT for self-supervised learning of language representations. In:
International Conference on Learning Representations
. Addis Ababa Ethiopia: International Conference on Learning Representations,
2020
.

35.

Dai
 
Z
,
Yang
 
Z
,
Yang
 
Y
 et al.  Transformer-XL: attentive language models beyond a fixed-length context. In: Anna K, David T, Lluís M. (eds),
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy
,
2019
, p.
2978
88
.
Florence, Italy: Association for Computational Linguistics
.

36.

Yang
 
Z
,
Dai
 
Z
,
Yang
 
Y
 et al.  XLNet: generalized autoregressive pretraining for language understanding. In:
Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
.
2019
.
Vancouver Canada: Curran Associates, Inc.

37.

Stärk
 
H
,
Dallago
 
C
,
Heinzinger
 
M
,
Rost
 
B
.
Light attention predicts protein location from the language of life
.
Bioinform Adv
 
2021
;
1
(
1
):
vbab0
.

38.

Li
 
SM
,
Chen
 
JJ
,
Liu
 
B
.
Protein remote homology detection based on bidirectional long short-term memory
.
BMC Bioinformatics
 
2017
;
18
:
8
.

39.

Pang
 
Y
,
Liu
 
B
.
DMFpred: predicting protein disorder molecular functions based on protein cubic language model
.
PLoS Comput Biol
 
2022
;
18
:
e1010668
.

40.

Holm
 
L
,
Sander
 
C
.
Removing near-neighbour redundancy from large protein sequence collections
.
Bioinformatics
 
1998
;
14
:
423
9
.

41.

Chakrabarti
 
S
,
Venkatramanan
 
K
,
Sowdhamini
 
R
.
SMoS: a database of structural motifs of protein superfamilies
.
Protein Eng
 
2003
;
16
:
791
3
.

42.

Li
 
CC
,
Liu
 
B
.
MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks
.
Brief Bioinform
 
2020
;
21
:
2133
41
.

43.

Unsal
 
S
,
Atas
 
H
,
Albayrak
 
M
, et al.  
Learning functional properties of proteins with language models
.
Nature Machine Intelligence
 
2022
;
4
:
227
45
.

44.

Li
 
H-L
,
Pang
 
Y-H
,
Liu
 
B
.
BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models
.
Nucleic Acids Res
 
2021
;
49
:
e129
.

45.

Sergey I, Christian S. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Francis B, David B. (eds),

Proceedings of Machine Learning Research
. Lille, France: Proceedings of Machine Learning Research, 2016;
38
:448–456.

46.

Hendrycks D, Gimpel K. Gaussian error linear units (GELUs). arXiv:160608415 2016.

47.

Kingma
 
DP
,
Ba
 
J
.
Adam: a method for stochastic optimization
. In:
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015
, Conference Track Proceedings. 2015.

48.

Murzin
 
AG
,
Brenner
 
SE
,
Hubbard
 
T
,
Chothia
 
C
.
SCOP: a structural classification of proteins database for the investigation of sequences and structures
.
J Mol Biol
 
1995
;
247
:
536
40
.

49.

Li
 
W
,
Godzik
 
A
.
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
.
Bioinformatics
 
2006
;
22
:
1658
9
.

50.

Shao
 
J
,
Chen
 
J
,
Liu
 
B
.
ProtRe-CN: protein remote homology detection by combining classification methods and network methods via learning to rank
.
IEEE/ACM Trans Comput Biol Bioinform
2022;
19
:3655–3662.

51.

Bamber
 
D
.
The area above the ordinal dominance graph and the area below the receiver operating characteristic graph
.
J Math Psychol
 
1975
;
12
:
387
415
.

52.

Gribskov
 
M
,
Robinson
 
NL
.
Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching
.
Comput Chem
 
1996
;
20
:
25
33
.

53.

Jin
 
XP
,
Liao
 
Q
,
Liu
 
B
.
S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection
.
Bioinformatics
 
2021
;
37
:
4321
7
.

54.

Goujon
 
M
,
McWilliam
 
H
,
Li
 
W
, et al.  
A new bioinformatics analysis tools framework at EMBL-EBI
.
Nucleic Acids Res
 
2010
;
38
:
W695
9
.

55.

Sievers
 
F
,
Wilm
 
A
,
Dineen
 
D
, et al.  
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega
.
Mol Syst Biol
 
2011
;
7
:
539
.

56.

Du
 
Z
,
Su
 
H
,
Wang
 
W
, et al.  
The trRosetta server for fast and accurate protein structure prediction
.
Nat Protoc
 
2021
;
16
:
5634
51
.

57.

Su
 
H
,
Wang
 
W
,
Du
 
Z
, et al.  
Improved protein structure prediction using a new multi-scale network and homologous templates
.
Adv Sci
 
2021
;
8
:
e2102592
.

58.

Yang
 
J
,
Anishchenko
 
I
,
Park
 
H
, et al.  
Improved protein structure prediction using predicted interresidue orientations
.
Proc Natl Acad Sci U S A
 
2020
;
117
:
1496
503
.

59.

Wang
 
W
,
Peng
 
Z
,
Yang
 
J
.
Single-sequence protein structure prediction using supervised transformer protein language models
.
Nat Comput Sci
 
2022
;
2
:
804
14
.

60.

Mirdita
 
M
,
Schutze
 
K
,
Moriwaki
 
Y
, et al.  
ColabFold: making protein folding accessible to all
.
Nat Methods
 
2022
;
19
:
679
82
.

61.

Bepler
 
T
,
Berger
 
B
.
Learning the protein language: evolution, structure, and function
.
Cell Systems
 
2021
;
12
:
654
669.e3
.

62.

Yan
 
K
,
Lv
 
H
,
Guo
 
Y
, et al.  
TPpred-ATMV: therapeutic peptide prediction by adaptive multi-view tensor learning model
.
Bioinformatics
 
2022
;
38
:
2712
8
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data