Abstract

Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.

INTRODUCTION

Proteins with distinct structures exhibit diverse functions, and this functional diversity makes them play crucial roles in biological processes [1]. Protein design is the rational design of new active and functional protein molecules, which helps drug development and vaccine design and reveals the basic principles of protein function [2]. As an important part of protein design, sequence design (known as fixed backbone design, FBB) focuses on predicting the amino acid sequence that will fold into a specific protein structure [3–5].

With the continuous advancement of deep learning, a series of different network architectures have been applied to protein structure prediction, protein–protein interaction prediction (PPI), protein binding site prediction and protein sequence design, resulting in many excellent methods [6–9]. These diverse frameworks have led to the development of innovative approaches, driving advancements in protein sequence design [10–15]. Early methods using deep learning for sequence design were mostly based on Multi-Layer Perception (MLP) network frameworks, such as SPIN and SPIN2 [16, 17]. They were the first methods to use sequence profiles generated from structural templates, integrating certain protein structural features (such as torsion angle, backbone angle and neighborhood distance) on the basis of fragment-derived sequence profiles and structure-derived energy profiles, which brought about a meaningful breakthrough at the time. The next most widely used is the CNN network framework. CNN-based models, like ProDCoNN and DenseCPD [18, 19], can extract protein features in a higher dimension, such as distance matrices and atomic distribution in three-dimensional space, improving the performance of methods. However, these methods run relatively slowly due to the need for individual preprocessing and prediction for each residue [20]. Recently, the graph neural network stands out among these frameworks due to its excellent performance and superior compatibility with molecules in structure. As a result, many great methods have been proposed. GraphTrans [21] introduces a graph attention and autoregressive decoding mechanism to improve the performance of the network. GVP [22] proposes a geometric vector perceptron to ensure the global equivariance of features. ProteinMPNN [9] calculates the ideal virtual atom |${C}_{\beta }$| as an additional atom and employs the message-passing neural network to achieve a performance leap. Pifold [20] enables the network to learn valuable atomic information independently and proposes the PIGNN module, which achieves a high sequence recovery rate and greatly reduces the running time. SPIN-CGNN [23] utilizes the nearest neighbor protein contact maps, combined with edge updates and selective kernels, to achieve significant improvements in recovery, perplexity, low-complexity regions (LCRs), conservation of hydrophobic positions and deviation from amino-acid compositions of native sequences. In addition, with the continuous development of natural language processing technology, pre-trained language models are also used in this field. ProteinBERT [24] focuses on capturing the local and global representation of proteins in a more natural way, achieving satisfactory performance on benchmarks covering a variety of protein properties. ESM-IF [25] breaks through the limitation of the number of protein structures that can be determined experimentally, using the structure of 12 million sequences predicted by AlphaFold2 [6] as an additional training data. LM-Design [26] adjusted the architecture of the language model and introduced a lightweight structural adapter, making it a strong protein designer. However, it is worth noting that most of these methods mainly focus on optimizing the network to improve performance, while not making much progress in terms of features. This suggests that these methods may ignore the impact of protein features on accuracy.

Protein sequence design and structure prediction are inverse applications of each other, sharing certain similarities. Throughout the development of protein structure prediction, structure templates have played an undeniably important role. Most structure prediction methods [6, 27–30] leverage templates as prior structural knowledge to enhance accuracy of protein structure prediction. Similarly, we can apply a similar sequential prior knowledge to protein sequence design. According to the principle of template making in protein structure prediction, we search the structure database for proteins with similar structures to the input backbone (structural analogs) and extract the sequences corresponding to the matched parts of the analogs to create the structural sequence profile.

In this work, we propose SPDesign, a method utilizing structure alignment to obtain the structural sequence profile (referred to as sequence profile) to enhance the reliability of sequence design. In order to reduce the search space of our in-house PAcluster80 structure database [31], SPDesign designs an enhanced ultrafast shape recognition algorithm (USR) [32], which encodes both the input backbone structure and the center of clusters in PAcluster80 into vectors of USR (USR-V), to initially screen the clusters in structure database. This transforms the complex shape comparison process into a similarity assessment between two vectors, significantly speeding up the search efficiency. The sequence profile contains sequence patterns from multiple structurally similar proteins, providing a reliable guidance. These patterns are then condensed through the statistical algorithm and a sequence pre-trained model, respectively, and ultimately guide the design of the enhanced network together with some structural features (e.g. distance of backbone atoms, pre-trained knowledge of structure). Experimental results demonstrate that, in terms of sequence recovery rate, SPDesign outperforms the state-of-the-art methods on widely used benchmark sets (CATH 4.2 test set: 67.05%, TS50: 68.64%, TS500: 71.63%). The PDBench [33] tool is used to analyze the performance of methods more comprehensively, and the results show that SPDesign has great performance on a variety of structures. Moreover, the case study shows that our proposed method has satisfactory performance on conserved residues of some proteins. Finally, compared with other methods, the sequences designed by SPDesign showed better ability to fold into native structures using the mainstream structure prediction methods.

METHODS

In this section, we provide a detailed explanation of SPDesign, as shown in Figure 1. This includes the structure database, the features utilized by our method and their preparation process, the acquisition of the structural sequence profile and its involvement in the network design process and the construction and working principles of the network, as well as the training process of the network model.

The workflow of SPDesign. (A) The input backbone structure uses USR-V to search for structural analogs in PAcluster80, and the sequences of the matched part are extracted to create the structural sequence profile. The consensus sequence, which is converted to sequence-pattern embedding by the pre-trained model, and positional residue probability are extracted from the profile. (B) Distance map of the backbone atoms. (C) Pre-trained knowledge of structure. (D) Features are aggregated and encoded by the network and finally decoded into the predicted sequence.
Figure 1

The workflow of SPDesign. (A) The input backbone structure uses USR-V to search for structural analogs in PAcluster80, and the sequences of the matched part are extracted to create the structural sequence profile. The consensus sequence, which is converted to sequence-pattern embedding by the pre-trained model, and positional residue probability are extracted from the profile. (B) Distance map of the backbone atoms. (C) Pre-trained knowledge of structure. (D) Features are aggregated and encoded by the network and finally decoded into the predicted sequence.

Structure database

We used our in-house PAcluster80 [31] as a structure database to generate the structural sequence profile. PAcluster80 removes structures with 100% sequence identity in PDB [34] and structures with pLDDT <90 in AlphaFold DB [35] (as of March 2022) and clusters PDB and AlphaFold DB into 56,805 protein clusters based on 80% structural similarity cutoff (TM-score), with a total of 207,187 proteins. To ensure the experiment’s fairness, 40% of sequence redundancy is removed between all test sets and PAcluster80.

Features

SPDesign employs a set of four features, including sequence-pattern embedding and positional residue probability derived from the structural sequence profile, as well as the distance of backbone atoms and pre-trained knowledge of structure. The following describes the acquisition process of all features in detail.

Features from the structural sequence profile

In order to obtain the proposed structural sequence profile, it is necessary to search for structural analogs of the input backbone in the PAcluster80 structure database and then use high-precision structure alignment tool TM-align [36] to perform fine alignment to obtain relevant sequences. Next, additional operations are applied to condense the features within the sequence profile and finally extract the sequence-pattern embedding and positional residue probability features. The detailed process is shown in Figure 1A.

However, it has been observed that during the search process for structural analogs, using TM-align is extremely time-consuming (one target requires more than 5 h). To address this challenge, SPDesign divided the search process into two steps: first, using an enhanced ultrafast shape recognition algorithm to match the shapes between proteins to roughly and quickly screen out clusters that may contain structural analogs in the PAcluster80 database, and then using the TM-align tool to perform fine structure alignment of proteins in these clusters.

In the first step, the USR-V of the input backbone structure is calculated and then its similarity with all cluster centers in PAcluster80 is compared to screen the most similar |$k$| clusters. For the |$i$|-th residue (⁠|${r}_i$|⁠) in the backbone, we find the nearest residue (⁠|${g}_i$|⁠) and the farthest residue (⁠|${b}_i$|⁠) of the current residue in Euclidean space. Additionally, we find the residue (⁠|${f}_i$|⁠) farthest from residue (⁠|${b}_i$|⁠):

(1)
(2)
(3)
(4)

where |$\overrightarrow{r_i}\overrightarrow{,{r}_j,}\overrightarrow{b_i}$| represent the|${C}_{\alpha }$|coordinate vectors of residues; |${b}_i,{g}_i,{f}_i\kern0.5em co$|rrespond to the residues described above; |${D}_{r_i}$| and |${D}_{b_i}$|⁠, respectively, denotes the distance set from the residue |${r}_i$|⁠, |${b}_i$| to other residues; |$L$| represents the total number of residues; and |$Ind$| signifies a function to obtain the subscript of the residue at the specified distance. After getting the four residues (⁠|${r}_i$|⁠, |${g}_i$|⁠, |${b}_i$| and |${f}_i$|⁠), we proceed to calculate the average distance between these four residues and all residues within the protein.

(5)

For each individual residue, four distances can be obtained through the process described above. Generalizing the procedure to all residues in the protein, a distance matrix (⁠|${R}^{L\times 4}$|⁠) can be derived. Next, the distance matrix is transposed so that each row of it can be viewed as a distribution. Consequently, four different distances distributions |${\left\{{d}^{r_i}\right\}}_{i=1}^L,{\left\{{d}^{g_i}\right\}}_{i=1}^L,{\left\{{d}^{b_i}\right\}}_{i=1}^L,{\left\{{d}^{f_i}\right\}}_{i=1}^L$| can be obtained, corresponding to the four types of atoms (⁠|${r}_i$|⁠, |${g}_i$|⁠, |${b}_i$|⁠, |${f}_i$|⁠). Since a distribution is completely determined by its moments, for simplicity in subsequent calculations, we compute the first three moments of the distributions in order to characterize them as vectors (USR-V|$\in{R}^{1\times 12}$|⁠) that encode the shape of the protein, as shown in Supplementary Figure S1.

By calculating the Mahalanobis distance between the USR-V of two proteins, the shape similarity of the two proteins can be obtained, illustrated as

(6)

where |${u}_1$| and |${u}_2$| represent the USR-V of the two proteins and |$\Sigma$| is the covariance matrix between the vectors.

In the second step, to obtain more accurate matching information for sequence extraction, SPDesign utilizes the TM-align tool to perform a comprehensive alignment between the input backbone and all structures within the chosen |$k$| clusters. According to the TM-score obtained during the matching process, the aligned sequences of matched parts from the top |$n$| proteins are extracted as the structural sequence profile. We present the average time taken to generate sequence profiles of proteins of different lengths in Supplementary Table S1.

On the basis of the sequence profile, the type and occurrence probability of amino acids appearing at each position are counted and used as the positional residue probability feature, which is a detailed description of the sequence profile. At the same time, in order to have an overall summary of the sequence profile, we take the most frequently occurring residues at each position to form a consensus sequence and use the sequence pre-trained model ESM2 to convert it into the sequence-pattern embedding.

Distance of backbone atoms and pre-trained knowledge of structure

The atomic coordinates of the input protein backbone residues, which have a length of|$L$|⁠, can be represented as |$\left\{{C}_{\alpha_i},{C}_i,{N}_i,{O}_i,{C}_{\beta_i}\ |\ \right. \left. 1\le \mathrm{i}\le \mathrm{L}\right\}$|⁠, where |${C}_{\alpha_i},{C}_i,{N}_i,{O}_i$| are backbone atoms and |${C}_{\beta_i}$|is an ideal atom computed from other atoms within the residue |$i$| [9], as shown below:

(7)
(8)

where |${C}_{\alpha_i},{C}_i,{N}_i$| and|${O}_i$| represent coordinates of the corresponding atoms in residue |$i$| and |${\lambda}_1,{\lambda}_2,{\lambda}_3$| represents three constant values. Next, the distances between these five atoms of all residues in the protein are calculated to obtain the backbone atomic distance map. Finally, in order to improve the smoothness of the overall network and map the distance information to a high-dimensional feature space for easier processing and analysis, the Gaussian radial basis function (RBF) is usually used to further process the distance information between atoms. We fit the distance distribution between N, |${C}_{\alpha }$|⁠, C, O and |${C}_{\beta }$| atoms using 16 Gaussian RBFs equally spaced from 2 to 22 Å [9]. The mathematical expression of the RBF function is as follows:

(9)

where |$x$| represents the distance map obtained above and the value of |${d}_i$| is uniformly sampled at intervals of 1.25 from 2 to 22 Å.

In addition, the structural pre-trained model ESM-IF achieved great performance in tasks related to protein design. It can extract the structural information of the input backbone and convert it into high-dimensional embedding (the pre-trained knowledge of structure, |$p\in{R}^{L\times 512}$|⁠). The embedding can be applied in downstream tasks to provide them with richer structural information.

Network

As shown in Figure 1D, the architecture of SPDesign utilizes a message-passing neural network, which can capture the characteristics of protein molecules directly from molecular graphs [37]. In the architecture, the backbone is represented as a K-nearest neighbors (KNN) graph. The backbone graph |$G\left(V,S,E\right)$| consists of the node feature |$V\in{R}^{L\times N\times 128}$|⁠, edge feature |$S\in{R}^{L\times N\times 128}$| and the set of edge between residues and their neighbors |$E\in{R}^{L\times N}$|⁠, where N represents the number of neighbors.

Encoding and decoding module

The encoder consists of a stack of network layers with a hidden dimension of 128. For residue |$i$|⁠, the node embedding of its neighbors are fused with its edge features to update the current node embedding and edge information in its propagation step [9], as shown below:

(10)
(11)
(12)
(13)
(14)

where |$f$| represents the operation of linear transformation, |$g$| means the layer consists of a linear layer alternating with the activation function, |$Norm$| means the layer normalization, |$Dropout$| indicates the dropout operation, |$j\to i$| means the neighbors of residue |$i$|⁠, |${p}^{init}\in{R}^{L\times 512}$| stands for pre-trained knowledge of structure, |${t}^{init}\in{R}^{L\times 1280}$| represents sequence-pattern embedding, |${s}^{init}$| stands for the distance of backbone atoms and |$\Vert$| represents the concatenation operation.

Since general-purpose graph transformers cannot update edge features, critical information between residues may be ignored, affecting the performance of the network. To solve this problem, SPDesign uses an edge update mechanism [9], as shown below:

(15)

The decoder transforms the encoded result representation into residue probabilities, following a principle similar to that of the encoder, same as shown in Equations (10)–(14). In the decoding process, the node feature is integrated with the positional residue probability, ultimately yielding the residue probability feature corresponding to each position:

(16)

where |$prp\in{R}^{L\times 20}$| is the positional residue probability and |${P}_i$| is the final probability of the |$i$|-th residue.

Model training

The network model is trained using a cross-entropy loss function to assess the deviation between the predicted and native sequences, as shown below:

(17)

where |${p}_i$| represents the probability of the amino acid belonging to the |$i$|-th amino acid type, |${y}_i$| is the one-hot representation of the real type and |$C$| is the number of amino acid types. The network weights are initialized using the same method as transformer [38]. The Adam optimization method is then employed with an initial learning rate of 0.0001. Other parameters include label smoothing and dropout rate of 10% and a batch size of 6000 tokens. For the KNN backbone graph, 30 nearest neighbors are selected using |${C}_{\alpha }$||${C}_{\alpha }$| distances. With the features prepared in advance, SPDesign can converge in about 3 h (30 epochs) of training on a single NVIDIA A100 GPU. Regenerating features for the dataset (19 746 proteins) takes approximately 12 h.

RESULTS AND DISCUSSION

Dataset

Using the same dataset as GraphTrans, GVP and Pifold [20–22], 19,746 proteins from the CATH 4.2 non-redundant S40 dataset [39] were divided into three parts, 17,782 proteins for training, 844 for validation and 1120 for testing. In addition to the CATH 4.2 test set, we selected additional test sets to test the scalability of our method in practical application scenarios, such as the commonly used TS50 and TS500 datasets proposed by SPIN [16], as well as the orphan and the de novo designed proteins provided by RGN2 [28]. Finally, we conducted a more objective analysis of method’s performance on the benchmark set provided by PDBench [33]. For strict testing, we have removed 40% of the sequence redundancy between the sequence profile (the searched sequences of similar structures) and the input sequence.

Results on benchmark sets

To evaluate the performance of SPDesign, we tested its effectiveness using some common benchmark sets (CATH 4.2 test set, TS50 and TS500) and compared it with SOTA methods (SPIN-CGNN, Pifold, ProteinMPNN, etc.). Table 1 shows the results of perplexity and sequence recovery rate. The sequence recovery rate measures the method’s ability to recover natural sequences, while perplexity is a measure that accounts for the certainty around the native amino acid residues [23]. It is obvious that our method achieves great performance on all test sets. Compared with other methods, such as SPIN-CGNN, Pifold and ProteinMPNN, SPDesign shows relatively significant improvements on CATH 4.2 test set, with recovery rates 12.24%, 15.54% and 21.89% higher, respectively. On the TS50 and TS500 benchmark test sets, SPDesign still achieved the highest performance, with recovery rates of 68.64% and 71.63%, respectively. This may be because, in comparison to the majority of other methods constrained by the inherent characteristics of the target backbone, the sequence profile offers our network a more comprehensive range of information. Additionally, we also evaluate the performance of methods on LCRs, conservation of hydrophobic positions and amino-acid substitution matrices [23]. The results are presented in Supplementary Figure S2 and Table S2, which suggest that our method may have a stronger ability to capture evolutionary information than other methods.

Table 1

Results on widely used benchmark sets.

CATH4.2TS50TS500
MethodPerplexityRecoveryPerplexityRecoveryPerplexityRecovery
SPDesign2.4367.052.7268.642.5471.63
SPIN-CGNN4.0554.813.5958.732.9862.87
LM-Design4.5255.653.5057.893.1967.78
Pifold4.5551.513.8658.723.4460.42
ProteinMPNN5.2545.163.9354.433.5358.08
ESM-IF/42.39/47.17/50.02
GVP5.3639.474.7144.144.2049.14
GraphTrans6.6335.825.4043.894.9845.69
CATH4.2TS50TS500
MethodPerplexityRecoveryPerplexityRecoveryPerplexityRecovery
SPDesign2.4367.052.7268.642.5471.63
SPIN-CGNN4.0554.813.5958.732.9862.87
LM-Design4.5255.653.5057.893.1967.78
Pifold4.5551.513.8658.723.4460.42
ProteinMPNN5.2545.163.9354.433.5358.08
ESM-IF/42.39/47.17/50.02
GVP5.3639.474.7144.144.2049.14
GraphTrans6.6335.825.4043.894.9845.69

The best results are marked in bold, and the best-performing method is marked.

Table 1

Results on widely used benchmark sets.

CATH4.2TS50TS500
MethodPerplexityRecoveryPerplexityRecoveryPerplexityRecovery
SPDesign2.4367.052.7268.642.5471.63
SPIN-CGNN4.0554.813.5958.732.9862.87
LM-Design4.5255.653.5057.893.1967.78
Pifold4.5551.513.8658.723.4460.42
ProteinMPNN5.2545.163.9354.433.5358.08
ESM-IF/42.39/47.17/50.02
GVP5.3639.474.7144.144.2049.14
GraphTrans6.6335.825.4043.894.9845.69
CATH4.2TS50TS500
MethodPerplexityRecoveryPerplexityRecoveryPerplexityRecovery
SPDesign2.4367.052.7268.642.5471.63
SPIN-CGNN4.0554.813.5958.732.9862.87
LM-Design4.5255.653.5057.893.1967.78
Pifold4.5551.513.8658.723.4460.42
ProteinMPNN5.2545.163.9354.433.5358.08
ESM-IF/42.39/47.17/50.02
GVP5.3639.474.7144.144.2049.14
GraphTrans6.6335.825.4043.894.9845.69

The best results are marked in bold, and the best-performing method is marked.

In addition, we compared the identity of the consensus sequence with the recovery rate of the designed sequence, and the results are shown in Figure 2A. It is obvious that the sequence identity of the consensus sequence is less than 10%, indicating that there are significant differences between the consensus sequence and the native sequence. This suggests that the performance of our method does not depend on homologous sequences.

(A) The left axis corresponds to the average recovery rate of designed sequences, and the right axis corresponds to the average sequence identity between the consensus sequence and native sequence. (B) Results comparison on orphan and de novo (designed) protein datasets.
Figure 2

(A) The left axis corresponds to the average recovery rate of designed sequences, and the right axis corresponds to the average sequence identity between the consensus sequence and native sequence. (B) Results comparison on orphan and de novo (designed) protein datasets.

We further analyzed the impact of sequence redundancy on method performance by removing 30% and 20% sequence identity from the CATH 4.2 dataset (training/ validation/test) and retraining SPDesign for protein sequence design (Supplementary Table S3). The results show that our method achieves sequence recovery rates of 67.05%, 67.73% and 66.55% on datasets with 40%, 30% and 20% sequence identity, respectively, which means that SPDesign is highly robust to very low sequence similarity data. More studies on sequences profile and structural similarity are presented in Supplementary Tables S4S8.

Furthermore, the methods were tested on the orphan (75 proteins) and de novo (149 proteins) data sets from RGN2 [28]. These proteins have fewer homologous sequences and are close to real protein design scenarios. The results presented in Figure 2B show that SPDesign is second only to the latest method SPIN-CGNN on the orphan protein dataset and outperforms all other methods on the de novo dataset. This may indicate that SPDesign can capture the mapping of protein tertiary structure to primary sequence to a certain extent.

Results on PDBench performance analysis

PDBench [33] is a tool for analyzing the performance of protein sequence design methods, offering a more comprehensive and object analysis. It provides a benchmark dataset that comprises 595 proteins, spanning 40 protein structures. These proteins can be clustered into four classes: Mainly Alpha, Mainly Beta, Alpha Beta and Few Structures/Special, ensuring that the evaluation results are not biased for the most common protein structures. Each class of structure is divided into multiple subdivision structures according to its characteristics. We use the PDBench tool to analyze benchmark methods and display the similarity result in Figure 3. More results (such as Accuracy and Macro Recall) will be shown in Supplementary Figures S3S7.

Comparison of the similarity on the PDBench benchmark dataset. The X-axis represents various types of subdivision structures. The method has a higher similarity value on a certain type of structure, indicating a better performance on this type of structure.
Figure 3

Comparison of the similarity on the PDBench benchmark dataset. The X-axis represents various types of subdivision structures. The method has a higher similarity value on a certain type of structure, indicating a better performance on this type of structure.

It is easily observed that SPDesign has better performance than other methods on the four classes of proteins. Specifically, its performance equally in the classes of Mainly Alpha, Mainly Beta and Alpha Beta, while the performance of the Few Structural/Special targets is not as good as the other three classes. The reason may be that these proteins have special structures and are relatively rare in nature, which make more challenging for modeling. In addition, it is obvious that SPDesign also has stable performance for structural subdivision areas, such as Alpha barrel and UP-down Bundle. This provides evidence that SPDesign improves the recovery rate of the sequence at the overall level, rather than that the residues in a certain part of the structure are restored very well. These results may suggest that SPDesign has wild applicability.

Measuring whether the designed sequences adhere to the principles of natural evolution is also a crucial consideration for evaluating the performance of sequence design methods. We counted the relevant amino acid distributions on the benchmark set provided by PDBench. Figure 4A shows the comparison of the amino acid type distribution calculated by SPDesign with the native sequence amino acid type distribution, and Figure 4B shows the bias of the amino acids used by methods when designing sequences. A non-zero value in Figure 4B signifies a deviation between the distribution of the amino acid designed by the method and the natural distribution. The larger the value, the more obvious the disparity. It is obvious that the amino acid distribution of ProteinMPNN shows a large deviation from the natural distribution, showing a very high preference for glutamic acid (E) and lysine (K). SPIN-CGNN also shows a very high biases on alanine (A), glutamine (Q), threonine (T) and valine (K), as does Pifold. As for SPDesign, its prediction bias distribution of amino acids is closer to zero as a whole, which suggests that the sequence designed by our method is more in line with the principles of natural evolution.

(A) Comparison of the amino acid type distribution calculated by SPDesign with the native sequence amino acid type distribution. (B) Prediction bias of methods at each residue compared to the native sequence. A positive value represents that the proportion of the amino acid in the designed sequence is higher than in the native state, while a negative value indicates a proportion lower than in the native state.
Figure 4

(A) Comparison of the amino acid type distribution calculated by SPDesign with the native sequence amino acid type distribution. (B) Prediction bias of methods at each residue compared to the native sequence. A positive value represents that the proportion of the amino acid in the designed sequence is higher than in the native state, while a negative value indicates a proportion lower than in the native state.

Structural modeling verification experiment

After having verified that SPDesign can design sequences with high accuracy, to gain a more intuitive understanding of whether the designed sequences can accurately fold into the desired structures, the sequences were modeled through the mainstream protein structure prediction method ESMFold [29]. The similarity between the predicted structures and the native structures is evaluated by the structural similarity metrics TM-score, RMSD and LDDT.

Considering the inherent biases in the protein structure prediction method ESMFold during the protein modeling process, which may impact the accuracy of experiment, we conducted a screening process on proteins from CATH 4.2 test set. First, we employed ESMFold to predict the structures of native sequences within the test set and then calculated the TM-score between predicted structures and the native structures. Finally, we used two different TM-score thresholds (0.8 and 0.9) to filter proteins from the test set, resulting in two sub-test sets that contain 141 and 87 proteins respectively, as shown in Table 2.

Table 2

Result of structural modeling verification experiment.

MethodTM-score >0.9(87)TM-score >0.8(141)
TM-scoreRMSDLDDTRecoveryTM-scoreRMSDLDDTRecovery
SPDesign0.9401.2160.77169.420.8931.5450.70075.77
SPIN-CGNN0.9431.1270.72155.720.8891.4110.67158.10
Pifold0.9192.1020.66653.560.8632.9880.61250.81
ProteinMPNN0.9371.6310.69750.610.8842.4750.63945.11
ESM-IF0.9291.3610.67349.400.8701.7130.60443.99
native-sequence0.9601.1561.0001000.9182.0141.000100
MethodTM-score >0.9(87)TM-score >0.8(141)
TM-scoreRMSDLDDTRecoveryTM-scoreRMSDLDDTRecovery
SPDesign0.9401.2160.77169.420.8931.5450.70075.77
SPIN-CGNN0.9431.1270.72155.720.8891.4110.67158.10
Pifold0.9192.1020.66653.560.8632.9880.61250.81
ProteinMPNN0.9371.6310.69750.610.8842.4750.63945.11
ESM-IF0.9291.3610.67349.400.8701.7130.60443.99
native-sequence0.9601.1561.0001000.9182.0141.000100

The ‘native-sequence’ signifies the reference performance using native sequence input. The best results except ‘native-sequence’ are marked in bold.

Table 2

Result of structural modeling verification experiment.

MethodTM-score >0.9(87)TM-score >0.8(141)
TM-scoreRMSDLDDTRecoveryTM-scoreRMSDLDDTRecovery
SPDesign0.9401.2160.77169.420.8931.5450.70075.77
SPIN-CGNN0.9431.1270.72155.720.8891.4110.67158.10
Pifold0.9192.1020.66653.560.8632.9880.61250.81
ProteinMPNN0.9371.6310.69750.610.8842.4750.63945.11
ESM-IF0.9291.3610.67349.400.8701.7130.60443.99
native-sequence0.9601.1561.0001000.9182.0141.000100
MethodTM-score >0.9(87)TM-score >0.8(141)
TM-scoreRMSDLDDTRecoveryTM-scoreRMSDLDDTRecovery
SPDesign0.9401.2160.77169.420.8931.5450.70075.77
SPIN-CGNN0.9431.1270.72155.720.8891.4110.67158.10
Pifold0.9192.1020.66653.560.8632.9880.61250.81
ProteinMPNN0.9371.6310.69750.610.8842.4750.63945.11
ESM-IF0.9291.3610.67349.400.8701.7130.60443.99
native-sequence0.9601.1561.0001000.9182.0141.000100

The ‘native-sequence’ signifies the reference performance using native sequence input. The best results except ‘native-sequence’ are marked in bold.

The results show that SPDesign achieves the highest performance on sequence recovery and LDDT, while TM-score and RMSD are second only to SPIN-CGNN. Furthermore, we found that the improvement of sequence recovery rate has a much greater impact on LDDT than TM-score. Therefore, we guess that higher sequence recovery rates help to establish more stable local spatial structures. Examples of target predicted by ESMFold are shown in Supplementary Figure S8.

Finally, we tested the performance of the method on hallucination proteins and proteins generated by a latest diffusion model [40] that can generate proteins of different length. Results in Supplementary Table S9 show that our method also performs competitively when faced with entirely new proteins.

Ablation study

To evaluate the impact of individual features on SPDesign, we conducted ablation experiments on the CATH 4.2 test set, including backbone atoms distance, pre-trained knowledge of structure and structural sequence profile. The results are presented in Table 3 and Supplementary Table S10.

Table 3

The impact of features on perplexity and sequence recovery rate on the CATH 4.2 test set.

SPDesign1SPDesign2SPDesign3SPDesign
Backbone atoms distance
Pre-trained knowledge of structure
Structural sequence profile
Perplexity5.364.963.512.43
Recovery (%)44.6048.8158.7667.05
SPDesign1SPDesign2SPDesign3SPDesign
Backbone atoms distance
Pre-trained knowledge of structure
Structural sequence profile
Perplexity5.364.963.512.43
Recovery (%)44.6048.8158.7667.05
Table 3

The impact of features on perplexity and sequence recovery rate on the CATH 4.2 test set.

SPDesign1SPDesign2SPDesign3SPDesign
Backbone atoms distance
Pre-trained knowledge of structure
Structural sequence profile
Perplexity5.364.963.512.43
Recovery (%)44.6048.8158.7667.05
SPDesign1SPDesign2SPDesign3SPDesign
Backbone atoms distance
Pre-trained knowledge of structure
Structural sequence profile
Perplexity5.364.963.512.43
Recovery (%)44.6048.8158.7667.05

SPDesign1 only contains backbone atoms distance feature, while SPDesign2 adds pre-trained knowledge of structure feature based on SPDesign1, resulting in a 4.21% increase in recovery rate. This result is expected because the pre-trained knowledge of structure feature provides a large amount of latent knowledge about the target geometry. Compared with SPDesign1, SPDesign3 adds the structural sequence profile feature. The result shows that this feature significantly improves the performance of our method by 14.16%. The possible reason for this encouraging improvement is that the introduction of sequence profile provides rich sequence information of similar structures.

Upon further analysis of the improvements brought by each feature, an interesting phenomenon is discovered. The combination of the structural sequence profile and pre-trained knowledge of structure improves prediction performance by 22.25%, which is more than the sum of their individual improvements. This favorable result may be due to the fact that the two features are complementary, adequately characterizing the backbone structure and providing rich geometric information.

Case study

Some studies have shown that different protein sequences may fold into similar structures and perform distinct biological functions. To investigate whether SPDesign tends to generate diverse protein sequences when the target proteins are structurally similar, we selected four proteins (PDB 3ney, 1kgd, 3wp0, 1ex6) that are not included in the training set for detailed analysis, as shown in Figure 5A. The average TM-score between these four proteins pairwise is 0.82, and the average sequence identity is 33.1%. Specifically, 3ney is a membrane protein primarily engaged in forming ternary complexes crucial for preserving the normal shape of red blood cells. 1kgd belongs to structural protein and can participate in intracellular and transcriptional regulation. 3wp0 acts as a peptide binding protein, making it a potential starting point for designing specific Dlg inhibitors targeting Scribble complex formation. 1ex6 functions as a transferase, catalyzing the reversible phosphoryl transfer from ATP to GMP [41–44]. Furthermore, we use the recovery of conserved amino acids as a metric to assess the method’s ability, as these conserved residues are crucial for the function of proteins.

Case study. (A) Structure of four proteins (PDB 3ney, 1kgd, 3wp0, 1ex6). The TM-score and sequence identity between pairs of these four proteins are as follows: 3ney-1kgd: 0.88, 55.2%; 3ney-3wp0: 0.77, 27.5%; 3ney-1ex6: 0.76, 31.2%; 1kgd-3wp0: 0.86, 26.5%; 1kgd-1ex6: 0.83, 26.7%; 3wp0-1ex6: 0.82, 31.5%. (B) Sequence alignment results between sequences designed by SPDesign (Designed seq) and the native sequences (marked by protein name). Conserved amino acids on the fragment sequence are underlined. Fragment-recovery represents the sequence recovery rate of SPDesign on the fragment sequence and conserved-recovery indicates the recovery rate on the conserved amino acids of the fragment. The average sequence identity between pairs of these four fragment sequences designed by SPDesign is 32.7%. (C) Alignment results between the structures predicted by AlphaFold2 using the designed sequences and the native structures.
Figure 5

Case study. (A) Structure of four proteins (PDB 3ney, 1kgd, 3wp0, 1ex6). The TM-score and sequence identity between pairs of these four proteins are as follows: 3ney-1kgd: 0.88, 55.2%; 3ney-3wp0: 0.77, 27.5%; 3ney-1ex6: 0.76, 31.2%; 1kgd-3wp0: 0.86, 26.5%; 1kgd-1ex6: 0.83, 26.7%; 3wp0-1ex6: 0.82, 31.5%. (B) Sequence alignment results between sequences designed by SPDesign (Designed seq) and the native sequences (marked by protein name). Conserved amino acids on the fragment sequence are underlined. Fragment-recovery represents the sequence recovery rate of SPDesign on the fragment sequence and conserved-recovery indicates the recovery rate on the conserved amino acids of the fragment. The average sequence identity between pairs of these four fragment sequences designed by SPDesign is 32.7%. (C) Alignment results between the structures predicted by AlphaFold2 using the designed sequences and the native structures.

We further extracted conservative structural fragments of these proteins and displayed alignment results between the fragment sequences designed by SPDesign and the native sequences of these fragments in Figure 5B. It is obvious that SPDesign performs very well on the overall fragment sequence, with an average recovery of 80%, and the average sequence identity between pairs of the four designed fragment sequences is 32.7%. These results suggest that SPDesign can design diverse sequences for proteins with similar structures, which may be attributed to the fact that the sequence profile is derived from proteins with similar structures but different sequences. This result shows that SPDesign can capture the intrinsic sequence-structure mapping. In addition, SPDesign has encouraging performance in recovering conserved residues. This may suggest that our method can better capture the crucial conserved features in protein structure and function, which may be beneficial for maintaining biological activity of proteins. The results of these four proteins designed by other methods are shown in Supplementary Figures S9 and S10.

Finally, we employed AlphaFold2 to perform structural modeling on the sequences designed by SPDesign and presented the structure alignment results between the native structures and predicted structures in Figure 5C. The results show that the structural recovery is better for 1kgd, 3ep0 and 3ney, while relatively poorer for 1ex6. It is obvious that the conserved structural fragment of 1ex6 is similar to the native structure, but the orientation of the global structure is biased, which may be attributed to two factors, as follows. One is that the current modeling techniques are not fully accurate in predicting protein structures, and the other is that conservative residues in the structural loop region are not accurately designed.

CONCLUSION

In this work, we develop SPDesign, a method that combines structural sequence profile and pre-trained language models for protein sequence design. SPDesign designs an enhanced ultrafast shape recognition algorithm to speed up the search process of structural analogs and then extracts sequence pattern information of structural analogs to assist the network in designing reasonable sequences. Experiments show that SPDesign significantly outperforms other methods on well-known benchmark tests (CATH 4.2 test set: 67.05%, TS50: 68.64%, TS500: 71.63%). PDBench tool was used to conduct a more comprehensively analysis of SPDesign. The results show that SPDesign has a better performance on a variety of major structural proteins, and the designed sequences are more in line with the principles of natural evolution. Moreover, we utilized a mainstream structure modeling method to model the sequence designed by our method, and experiments showed that the folding structure of the sequence designed by it may be more reasonable.

In case study, we found that for some target proteins with similar structures but have different sequences, SPDesign can design diverse sequences and can well preserve conserved sites on these proteins. In subsequent case modeling experiment, we speculated that failure to correctly recover some key conserved residues may affect the accuracy of designed sequence folding structures. Therefore, improving the recovery rate of conserved amino acids may be an important research topic.

Key Points
  • In this work, we propose a novel conception for extracting sequences from matched regions in structural analogs to create the structural sequence profile.

  • We propose SPDesign, a protein sequence design method that combines structural sequence profile and pre-trained language models.

  • We propose an enhanced ultrafast shape recognition algorithm that encodes protein shapes into USR-V vectors, assisting the search process for structural analogs.

  • Experiments suggest that SPDesign achieves the state-of-the-art performance across various benchmarks and has wildly applicability.

FUNDING

This work is supported by the National Science and Technology Major Project (2022ZD0115103), the National Nature Science Foundation of China (62173304) and the Key Project of Zhejiang Provincial Natural Science Foundation of China (LZ20F030002).

DATA AVAILABILITY

All data needed to evaluate the conclusions are present in the paper and the Supplementary Materials. The additional data and code related to this paper can be downloaded from our web server (http://zhanglab-bioinf.com/SPDesign/).

Author Biographies

Hui Wang is a master candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and the development of machine learning models and algorithms.

Dong Liu is a PhD candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and the development of machine learning models and algorithms.

Kailong Zhao is a PhD candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and optimization theory.

Yajun Wang is a professor in the College of Pharmaceutical Science, Zhejiang University of Technology. His research interests include biological pharmaceutical, bio catalysis, enzyme engineering and microbial fermentation.

Guijun Zhang is a professor in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and optimization theory.

References

1.

Defresne
M
,
Barbe
S
,
Schiex
T
.
Protein design with deep learning
.
Int J Mol Sci
2021
;
22
:11741.

2.

Marshall
LR
,
Zozulia
O
,
Lengyel-Zhand
Z
,
Korendovych
IV
.
Minimalist de novo Design of Protein Catalysts
.
ACS Catal
2019
;
9
:
9265
75
.

3.

Ding
W
,
Nakai
K
,
Gong
H
.
Protein design via deep learning
.
Brief Bioinform
2022
;
23
:
23
.

4.

Ovchinnikov
S
,
Huang
PS
.
Structure-based protein design with deep learning
.
Curr Opin Chem Biol
2021
;
65
:
136
44
.

5.

Pearce
R
,
Zhang
Y
.
Deep learning techniques have significantly impacted protein structure prediction and protein design
.
Curr Opin Struct Biol
2021
;
68
:
194
207
.

6.

Jumper
J
,
Evans
R
,
Pritzel
A
, et al.
Highly accurate protein structure prediction with AlphaFold
.
Nature
2021
;
596
:
583
9
.

7.

Sun
H
,
Wang
J
,
Wu
H
, et al.
A multimodal deep learning framework for predicting PPI-modulator interactions
.
J Chem Inf Model
2023
;
63
:
7363
72
.

8.

Fang
Y
,
Jiang
Y
,
Wei
L
, et al.
DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model
.
Bioinformatics
2023
;
39
:btad718.

9.

Dauparas
J
,
Anishchenko
I
,
Bennett
N
, et al.
Robust deep learning-based protein sequence design using ProteinMPNN
.
Science
2022
;
378
:
49
56
.

10.

Chen
S
,
Sun
Z
,
Lin
L
, et al.
To improve protein sequence profile prediction through image captioning on pairwise residue distance map
.
J Chem Inf Model
2020
;
60
:
391
9
.

11.

Huang
B
,
Fan
T
,
Wang
K
, et al.
Accurate and efficient protein sequence design through learning concise local environment of residues
.
Bioinformatics
2022
;
39
:btad122.

12.

Karimi
M
,
Zhu
S
,
Cao
Y
,
Shen
Y
.
De novo protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks
.
J Chem Inf Model
2020
;
60
:
5667
81
.

13.

Strokach
A
,
Kim
PM
.
Deep generative modeling for protein design
.
Curr Opin Struct Biol
2022
;
72
:
226
36
.

14.

Tan
C
,
Gao
Z
,
Xia
J
, et al.
Generative De novo protein design with global context
. In:
IEEE International Conference on Acoustics, Speech and Signal Processing
2023;
1
:1–5.

15.

Wang
J
,
Cao
H
,
Zhang
JZH
,
Qi
Y
.
Computational protein design with deep learning neural networks
.
Sci Rep
2018
;
8
:
6349
.

16.

Li
Z
,
Yang
Y
,
Faraggi
E
, et al.
Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles
.
Proteins
2014
;
82
:
2565
73
.

17.

O'Connell
J
,
Li
Z
,
Hanson
J
, et al.
SPIN2: predicting sequence profiles from protein structures using deep neural networks
.
Proteins
2018
;
86
:
629
33
.

18.

Qi
YF
,
Zhang
JZH
.
DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet
.
J Chem Inf Model
2020
;
60
:
1245
52
.

19.

Zhang
Y
,
Chen
Y
,
Wang
C
, et al.
ProDCoNN: protein design using a convolutional neural network
.
Proteins
2020
;
88
:
819
29
.

20.

Gao
Z
,
Tan
C
,
Li
SZ
.
PiFold: toward effective and efficient protein inverse folding
.
ArXiv
2022
;
abs/2209.12643
.

21.

Wu
Z
,
Jain
P
,
Wright
MA
, et al.
Representing long-range context for graph neural networks with global attention
.
ArXiv
2022
;
abs/2201.08821
.

22.

Jing
B
,
Eismann
S
,
Suriana
P
, et al.
Learning from protein structure with geometric vector Perceptrons
.
ArXiv
2020
;
abs/2009.01411
.

23.

Zhang
X
,
Yin
H
,
Ling
F
, et al.
SPIN-CGNN: improved fixed backbone protein design with contact map-based graph construction and contact graph neural network
.
PLoS Comput Biol
2023
;
19
:1–25.

24.

Brandes
N
,
Ofer
D
,
Peleg
Y
, et al.
ProteinBERT: a universal deep-learning model of protein sequence and function
.
Bioinformatics
2022
;
38
:
2102
10
.

25.

Hsu
C
,
Verkuil
R
,
Liu
J
, et al.
Learning inverse folding from millions of predicted structures
. In:
International Conference on MachineLearning
2022;
162
:8946–70.

26.

Zheng
Z
,
Deng
Y
,
Xue
D
, et al.
Structure-informed language models are protein designers
. In:
International Conference on MachineLearning
2023;
1781
:42317–38.

27.

Baek
M
,
DiMaio
F
,
Anishchenko
I
, et al.
Accurate prediction of protein structures and interactions using a three-track neural network
.
Science
2021
;
373
:
871
6
.

28.

Chowdhury
R
,
Bouatta
N
,
Biswas
S
, et al.
Single-sequence protein structure prediction using a language model and deep learning
.
Nat Biotechnol
2022
;
40
:
1617
23
.

29.

Zeming
L
,
Halil
A
,
Roshan
R
, et al.
Language models of protein sequences at the scale of evolution enable accurate structure prediction
.
bioRxiv
2022
;
2022.07.20.500902
.

30.

Zhao
KL
,
Liu
J
,
Zhou
XG
, et al.
MMpred: a distance-assisted multimodal conformation sampling for de novo protein structure prediction
.
Bioinformatics
2021
;
37
:
4350
6
.

31.

Zhao
K
,
Xia
Y
,
Zhang
F
, et al.
Protein structure and folding pathway prediction based on remote homologs recognition using PAthreader
.
Commun Biol
2023
;
6
:
243
.

32.

Guo
SS
,
Liu
J
,
Zhou
XG
,
Zhang
GJ
.
DeepUMQA: ultrafast shape recognition-based protein model quality assessment using deep learning
.
Bioinformatics
2022
;
38
:
1895
903
.

33.

Castorina
LV
,
Petrenas
R
,
Subr
K
,
Wood
CW
.
PDBench: evaluating computational methods for protein-sequence design
.
Bioinformatics
2023
;
39
:btad027.

34.

Burley
SK
,
Berman
HM
,
Kleywegt
GJ
, et al.
Protein data Bank (PDB): the single global macromolecular structure archive
.
Methods Mol Biol
2017
;
1607
:
627
41
.

35.

Varadi
M
,
Anyango
S
,
Deshpande
M
, et al.
AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
.
Nucleic Acids Res
2022
;
50
:
D439
44
.

36.

Zhang
Y
,
Skolnick
J
.
TM-align: a protein structure alignment algorithm based on the TM-score
.
Nucleic Acids Res
2005
;
33
:
2302
9
.

37.

Gilmer
J
,
Schoenholz
SS
,
Riley
PF
, et al. Neural Message Passing for Quantum Chemistry. In:
International Conference on Machine Learning
2017
;70:1263–72.

38.

Vaswani
A
,
Shazeer
NM
,
Parmar
N
et al. Attention is All you Need. In:
Neural Information Processing Systems
2017
;
11
:6010.

39.

Sillitoe
I
,
Bordin
N
,
Dawson
N
, et al.
CATH: increased structural coverage of functional space
.
Nucleic Acids Res
2021
;
49
:
D266
73
.

40.

Yim
J
,
Trippe
BL
,
Bortoli
VD
, et al.
SE(3) diffusion model with application to protein backbone generation
.
ArXiv
2023
;
abs/2302.02277
.

41.

Blaszczyk
J
,
Li
Y
,
Yan
H
,
Ji
X
.
Crystal structure of unligated guanylate kinase from yeast reveals GMP-induced conformational changes
.
J Mol Biol
2001
;
307
:
247
57
.

42.

Chytla
A
,
Gajdzik-Nowak
W
,
Biernatowska
A
, et al.
High-level expression of Palmitoylated MPP1 recombinant protein in mammalian cells
.
Membranes (Basel)
2021
;
11
:715.

43.

Li
Y
,
Spangenberg
O
,
Paarmann
I
, et al.
Structural basis for nucleotide-dependent regulation of membrane-associated guanylate kinase-like domains
.
J Biol Chem
2002
;
277
:
4159
65
.

44.

Zhu
J
,
Shang
Y
,
Xia
C
, et al.
Guanylate kinase domains of the MAGUK family scaffold proteins as specific phospho-protein-binding modules
.
EMBO J
2011
;
30
:
4986
97
.

Author notes

Hui Wang and Dong Liu contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]