-
PDF
- Split View
-
Views
-
Cite
Cite
Hui Wang, Dong Liu, Kailong Zhao, Yajun Wang, Guijun Zhang, SPDesign: protein sequence designer based on structural sequence profile using ultrafast shape recognition, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae146, https://doi.org/10.1093/bib/bbae146
- Share Icon Share
Abstract
Protein sequence design can provide valuable insights into biopharmaceuticals and disease treatments. Currently, most protein sequence design methods based on deep learning focus on network architecture optimization, while ignoring protein-specific physicochemical features. Inspired by the successful application of structure templates and pre-trained models in the protein structure prediction, we explored whether the representation of structural sequence profile can be used for protein sequence design. In this work, we propose SPDesign, a method for protein sequence design based on structural sequence profile using ultrafast shape recognition. Given an input backbone structure, SPDesign utilizes ultrafast shape recognition vectors to accelerate the search for similar protein structures in our in-house PAcluster80 structure database and then extracts the sequence profile through structure alignment. Combined with structural pre-trained knowledge and geometric features, they are further fed into an enhanced graph neural network for sequence prediction. The results show that SPDesign significantly outperforms the state-of-the-art methods, such as ProteinMPNN, Pifold and LM-Design, leading to 21.89%, 15.54% and 11.4% accuracy gains in sequence recovery rate on CATH 4.2 benchmark, respectively. Encouraging results also have been achieved on orphan and de novo (designed) benchmarks with few homologous sequences. Furthermore, analysis conducted by the PDBench tool suggests that SPDesign performs well in subdivided structures. More interestingly, we found that SPDesign can well reconstruct the sequences of some proteins that have similar structures but different sequences. Finally, the structural modeling verification experiment indicates that the sequences designed by SPDesign can fold into the native structures more accurately.
INTRODUCTION
Proteins with distinct structures exhibit diverse functions, and this functional diversity makes them play crucial roles in biological processes [1]. Protein design is the rational design of new active and functional protein molecules, which helps drug development and vaccine design and reveals the basic principles of protein function [2]. As an important part of protein design, sequence design (known as fixed backbone design, FBB) focuses on predicting the amino acid sequence that will fold into a specific protein structure [3–5].
With the continuous advancement of deep learning, a series of different network architectures have been applied to protein structure prediction, protein–protein interaction prediction (PPI), protein binding site prediction and protein sequence design, resulting in many excellent methods [6–9]. These diverse frameworks have led to the development of innovative approaches, driving advancements in protein sequence design [10–15]. Early methods using deep learning for sequence design were mostly based on Multi-Layer Perception (MLP) network frameworks, such as SPIN and SPIN2 [16, 17]. They were the first methods to use sequence profiles generated from structural templates, integrating certain protein structural features (such as torsion angle, backbone angle and neighborhood distance) on the basis of fragment-derived sequence profiles and structure-derived energy profiles, which brought about a meaningful breakthrough at the time. The next most widely used is the CNN network framework. CNN-based models, like ProDCoNN and DenseCPD [18, 19], can extract protein features in a higher dimension, such as distance matrices and atomic distribution in three-dimensional space, improving the performance of methods. However, these methods run relatively slowly due to the need for individual preprocessing and prediction for each residue [20]. Recently, the graph neural network stands out among these frameworks due to its excellent performance and superior compatibility with molecules in structure. As a result, many great methods have been proposed. GraphTrans [21] introduces a graph attention and autoregressive decoding mechanism to improve the performance of the network. GVP [22] proposes a geometric vector perceptron to ensure the global equivariance of features. ProteinMPNN [9] calculates the ideal virtual atom |${C}_{\beta }$| as an additional atom and employs the message-passing neural network to achieve a performance leap. Pifold [20] enables the network to learn valuable atomic information independently and proposes the PIGNN module, which achieves a high sequence recovery rate and greatly reduces the running time. SPIN-CGNN [23] utilizes the nearest neighbor protein contact maps, combined with edge updates and selective kernels, to achieve significant improvements in recovery, perplexity, low-complexity regions (LCRs), conservation of hydrophobic positions and deviation from amino-acid compositions of native sequences. In addition, with the continuous development of natural language processing technology, pre-trained language models are also used in this field. ProteinBERT [24] focuses on capturing the local and global representation of proteins in a more natural way, achieving satisfactory performance on benchmarks covering a variety of protein properties. ESM-IF [25] breaks through the limitation of the number of protein structures that can be determined experimentally, using the structure of 12 million sequences predicted by AlphaFold2 [6] as an additional training data. LM-Design [26] adjusted the architecture of the language model and introduced a lightweight structural adapter, making it a strong protein designer. However, it is worth noting that most of these methods mainly focus on optimizing the network to improve performance, while not making much progress in terms of features. This suggests that these methods may ignore the impact of protein features on accuracy.
Protein sequence design and structure prediction are inverse applications of each other, sharing certain similarities. Throughout the development of protein structure prediction, structure templates have played an undeniably important role. Most structure prediction methods [6, 27–30] leverage templates as prior structural knowledge to enhance accuracy of protein structure prediction. Similarly, we can apply a similar sequential prior knowledge to protein sequence design. According to the principle of template making in protein structure prediction, we search the structure database for proteins with similar structures to the input backbone (structural analogs) and extract the sequences corresponding to the matched parts of the analogs to create the structural sequence profile.
In this work, we propose SPDesign, a method utilizing structure alignment to obtain the structural sequence profile (referred to as sequence profile) to enhance the reliability of sequence design. In order to reduce the search space of our in-house PAcluster80 structure database [31], SPDesign designs an enhanced ultrafast shape recognition algorithm (USR) [32], which encodes both the input backbone structure and the center of clusters in PAcluster80 into vectors of USR (USR-V), to initially screen the clusters in structure database. This transforms the complex shape comparison process into a similarity assessment between two vectors, significantly speeding up the search efficiency. The sequence profile contains sequence patterns from multiple structurally similar proteins, providing a reliable guidance. These patterns are then condensed through the statistical algorithm and a sequence pre-trained model, respectively, and ultimately guide the design of the enhanced network together with some structural features (e.g. distance of backbone atoms, pre-trained knowledge of structure). Experimental results demonstrate that, in terms of sequence recovery rate, SPDesign outperforms the state-of-the-art methods on widely used benchmark sets (CATH 4.2 test set: 67.05%, TS50: 68.64%, TS500: 71.63%). The PDBench [33] tool is used to analyze the performance of methods more comprehensively, and the results show that SPDesign has great performance on a variety of structures. Moreover, the case study shows that our proposed method has satisfactory performance on conserved residues of some proteins. Finally, compared with other methods, the sequences designed by SPDesign showed better ability to fold into native structures using the mainstream structure prediction methods.
METHODS
In this section, we provide a detailed explanation of SPDesign, as shown in Figure 1. This includes the structure database, the features utilized by our method and their preparation process, the acquisition of the structural sequence profile and its involvement in the network design process and the construction and working principles of the network, as well as the training process of the network model.

The workflow of SPDesign. (A) The input backbone structure uses USR-V to search for structural analogs in PAcluster80, and the sequences of the matched part are extracted to create the structural sequence profile. The consensus sequence, which is converted to sequence-pattern embedding by the pre-trained model, and positional residue probability are extracted from the profile. (B) Distance map of the backbone atoms. (C) Pre-trained knowledge of structure. (D) Features are aggregated and encoded by the network and finally decoded into the predicted sequence.
Structure database
We used our in-house PAcluster80 [31] as a structure database to generate the structural sequence profile. PAcluster80 removes structures with 100% sequence identity in PDB [34] and structures with pLDDT <90 in AlphaFold DB [35] (as of March 2022) and clusters PDB and AlphaFold DB into 56,805 protein clusters based on 80% structural similarity cutoff (TM-score), with a total of 207,187 proteins. To ensure the experiment’s fairness, 40% of sequence redundancy is removed between all test sets and PAcluster80.
Features
SPDesign employs a set of four features, including sequence-pattern embedding and positional residue probability derived from the structural sequence profile, as well as the distance of backbone atoms and pre-trained knowledge of structure. The following describes the acquisition process of all features in detail.
Features from the structural sequence profile
In order to obtain the proposed structural sequence profile, it is necessary to search for structural analogs of the input backbone in the PAcluster80 structure database and then use high-precision structure alignment tool TM-align [36] to perform fine alignment to obtain relevant sequences. Next, additional operations are applied to condense the features within the sequence profile and finally extract the sequence-pattern embedding and positional residue probability features. The detailed process is shown in Figure 1A.
However, it has been observed that during the search process for structural analogs, using TM-align is extremely time-consuming (one target requires more than 5 h). To address this challenge, SPDesign divided the search process into two steps: first, using an enhanced ultrafast shape recognition algorithm to match the shapes between proteins to roughly and quickly screen out clusters that may contain structural analogs in the PAcluster80 database, and then using the TM-align tool to perform fine structure alignment of proteins in these clusters.
In the first step, the USR-V of the input backbone structure is calculated and then its similarity with all cluster centers in PAcluster80 is compared to screen the most similar |$k$| clusters. For the |$i$|-th residue (|${r}_i$|) in the backbone, we find the nearest residue (|${g}_i$|) and the farthest residue (|${b}_i$|) of the current residue in Euclidean space. Additionally, we find the residue (|${f}_i$|) farthest from residue (|${b}_i$|):
where |$\overrightarrow{r_i}\overrightarrow{,{r}_j,}\overrightarrow{b_i}$| represent the|${C}_{\alpha }$|coordinate vectors of residues; |${b}_i,{g}_i,{f}_i\kern0.5em co$|rrespond to the residues described above; |${D}_{r_i}$| and |${D}_{b_i}$|, respectively, denotes the distance set from the residue |${r}_i$|, |${b}_i$| to other residues; |$L$| represents the total number of residues; and |$Ind$| signifies a function to obtain the subscript of the residue at the specified distance. After getting the four residues (|${r}_i$|, |${g}_i$|, |${b}_i$| and |${f}_i$|), we proceed to calculate the average distance between these four residues and all residues within the protein.
For each individual residue, four distances can be obtained through the process described above. Generalizing the procedure to all residues in the protein, a distance matrix (|${R}^{L\times 4}$|) can be derived. Next, the distance matrix is transposed so that each row of it can be viewed as a distribution. Consequently, four different distances distributions |${\left\{{d}^{r_i}\right\}}_{i=1}^L,{\left\{{d}^{g_i}\right\}}_{i=1}^L,{\left\{{d}^{b_i}\right\}}_{i=1}^L,{\left\{{d}^{f_i}\right\}}_{i=1}^L$| can be obtained, corresponding to the four types of atoms (|${r}_i$|, |${g}_i$|, |${b}_i$|, |${f}_i$|). Since a distribution is completely determined by its moments, for simplicity in subsequent calculations, we compute the first three moments of the distributions in order to characterize them as vectors (USR-V|$\in{R}^{1\times 12}$|) that encode the shape of the protein, as shown in Supplementary Figure S1.
By calculating the Mahalanobis distance between the USR-V of two proteins, the shape similarity of the two proteins can be obtained, illustrated as
where |${u}_1$| and |${u}_2$| represent the USR-V of the two proteins and |$\Sigma$| is the covariance matrix between the vectors.
In the second step, to obtain more accurate matching information for sequence extraction, SPDesign utilizes the TM-align tool to perform a comprehensive alignment between the input backbone and all structures within the chosen |$k$| clusters. According to the TM-score obtained during the matching process, the aligned sequences of matched parts from the top |$n$| proteins are extracted as the structural sequence profile. We present the average time taken to generate sequence profiles of proteins of different lengths in Supplementary Table S1.
On the basis of the sequence profile, the type and occurrence probability of amino acids appearing at each position are counted and used as the positional residue probability feature, which is a detailed description of the sequence profile. At the same time, in order to have an overall summary of the sequence profile, we take the most frequently occurring residues at each position to form a consensus sequence and use the sequence pre-trained model ESM2 to convert it into the sequence-pattern embedding.
Distance of backbone atoms and pre-trained knowledge of structure
The atomic coordinates of the input protein backbone residues, which have a length of|$L$|, can be represented as |$\left\{{C}_{\alpha_i},{C}_i,{N}_i,{O}_i,{C}_{\beta_i}\ |\ \right. \left. 1\le \mathrm{i}\le \mathrm{L}\right\}$|, where |${C}_{\alpha_i},{C}_i,{N}_i,{O}_i$| are backbone atoms and |${C}_{\beta_i}$|is an ideal atom computed from other atoms within the residue |$i$| [9], as shown below:
where |${C}_{\alpha_i},{C}_i,{N}_i$| and|${O}_i$| represent coordinates of the corresponding atoms in residue |$i$| and |${\lambda}_1,{\lambda}_2,{\lambda}_3$| represents three constant values. Next, the distances between these five atoms of all residues in the protein are calculated to obtain the backbone atomic distance map. Finally, in order to improve the smoothness of the overall network and map the distance information to a high-dimensional feature space for easier processing and analysis, the Gaussian radial basis function (RBF) is usually used to further process the distance information between atoms. We fit the distance distribution between N, |${C}_{\alpha }$|, C, O and |${C}_{\beta }$| atoms using 16 Gaussian RBFs equally spaced from 2 to 22 Å [9]. The mathematical expression of the RBF function is as follows:
where |$x$| represents the distance map obtained above and the value of |${d}_i$| is uniformly sampled at intervals of 1.25 from 2 to 22 Å.
In addition, the structural pre-trained model ESM-IF achieved great performance in tasks related to protein design. It can extract the structural information of the input backbone and convert it into high-dimensional embedding (the pre-trained knowledge of structure, |$p\in{R}^{L\times 512}$|). The embedding can be applied in downstream tasks to provide them with richer structural information.
Network
As shown in Figure 1D, the architecture of SPDesign utilizes a message-passing neural network, which can capture the characteristics of protein molecules directly from molecular graphs [37]. In the architecture, the backbone is represented as a K-nearest neighbors (KNN) graph. The backbone graph |$G\left(V,S,E\right)$| consists of the node feature |$V\in{R}^{L\times N\times 128}$|, edge feature |$S\in{R}^{L\times N\times 128}$| and the set of edge between residues and their neighbors |$E\in{R}^{L\times N}$|, where N represents the number of neighbors.
Encoding and decoding module
The encoder consists of a stack of network layers with a hidden dimension of 128. For residue |$i$|, the node embedding of its neighbors are fused with its edge features to update the current node embedding and edge information in its propagation step [9], as shown below:
where |$f$| represents the operation of linear transformation, |$g$| means the layer consists of a linear layer alternating with the activation function, |$Norm$| means the layer normalization, |$Dropout$| indicates the dropout operation, |$j\to i$| means the neighbors of residue |$i$|, |${p}^{init}\in{R}^{L\times 512}$| stands for pre-trained knowledge of structure, |${t}^{init}\in{R}^{L\times 1280}$| represents sequence-pattern embedding, |${s}^{init}$| stands for the distance of backbone atoms and |$\Vert$| represents the concatenation operation.
Since general-purpose graph transformers cannot update edge features, critical information between residues may be ignored, affecting the performance of the network. To solve this problem, SPDesign uses an edge update mechanism [9], as shown below:
The decoder transforms the encoded result representation into residue probabilities, following a principle similar to that of the encoder, same as shown in Equations (10)–(14). In the decoding process, the node feature is integrated with the positional residue probability, ultimately yielding the residue probability feature corresponding to each position:
where |$prp\in{R}^{L\times 20}$| is the positional residue probability and |${P}_i$| is the final probability of the |$i$|-th residue.
Model training
The network model is trained using a cross-entropy loss function to assess the deviation between the predicted and native sequences, as shown below:
where |${p}_i$| represents the probability of the amino acid belonging to the |$i$|-th amino acid type, |${y}_i$| is the one-hot representation of the real type and |$C$| is the number of amino acid types. The network weights are initialized using the same method as transformer [38]. The Adam optimization method is then employed with an initial learning rate of 0.0001. Other parameters include label smoothing and dropout rate of 10% and a batch size of 6000 tokens. For the KNN backbone graph, 30 nearest neighbors are selected using |${C}_{\alpha }$|–|${C}_{\alpha }$| distances. With the features prepared in advance, SPDesign can converge in about 3 h (30 epochs) of training on a single NVIDIA A100 GPU. Regenerating features for the dataset (19 746 proteins) takes approximately 12 h.
RESULTS AND DISCUSSION
Dataset
Using the same dataset as GraphTrans, GVP and Pifold [20–22], 19,746 proteins from the CATH 4.2 non-redundant S40 dataset [39] were divided into three parts, 17,782 proteins for training, 844 for validation and 1120 for testing. In addition to the CATH 4.2 test set, we selected additional test sets to test the scalability of our method in practical application scenarios, such as the commonly used TS50 and TS500 datasets proposed by SPIN [16], as well as the orphan and the de novo designed proteins provided by RGN2 [28]. Finally, we conducted a more objective analysis of method’s performance on the benchmark set provided by PDBench [33]. For strict testing, we have removed 40% of the sequence redundancy between the sequence profile (the searched sequences of similar structures) and the input sequence.
Results on benchmark sets
To evaluate the performance of SPDesign, we tested its effectiveness using some common benchmark sets (CATH 4.2 test set, TS50 and TS500) and compared it with SOTA methods (SPIN-CGNN, Pifold, ProteinMPNN, etc.). Table 1 shows the results of perplexity and sequence recovery rate. The sequence recovery rate measures the method’s ability to recover natural sequences, while perplexity is a measure that accounts for the certainty around the native amino acid residues [23]. It is obvious that our method achieves great performance on all test sets. Compared with other methods, such as SPIN-CGNN, Pifold and ProteinMPNN, SPDesign shows relatively significant improvements on CATH 4.2 test set, with recovery rates 12.24%, 15.54% and 21.89% higher, respectively. On the TS50 and TS500 benchmark test sets, SPDesign still achieved the highest performance, with recovery rates of 68.64% and 71.63%, respectively. This may be because, in comparison to the majority of other methods constrained by the inherent characteristics of the target backbone, the sequence profile offers our network a more comprehensive range of information. Additionally, we also evaluate the performance of methods on LCRs, conservation of hydrophobic positions and amino-acid substitution matrices [23]. The results are presented in Supplementary Figure S2 and Table S2, which suggest that our method may have a stronger ability to capture evolutionary information than other methods.
. | CATH4.2 . | . | TS50 . | . | TS500 . | . |
---|---|---|---|---|---|---|
Method . | Perplexity . | Recovery . | Perplexity . | Recovery . | Perplexity . | Recovery . |
SPDesign† | 2.43 | 67.05 | 2.72 | 68.64 | 2.54 | 71.63 |
SPIN-CGNN | 4.05 | 54.81 | 3.59 | 58.73 | 2.98 | 62.87 |
LM-Design | 4.52 | 55.65 | 3.50 | 57.89 | 3.19 | 67.78 |
Pifold | 4.55 | 51.51 | 3.86 | 58.72 | 3.44 | 60.42 |
ProteinMPNN | 5.25 | 45.16 | 3.93 | 54.43 | 3.53 | 58.08 |
ESM-IF | / | 42.39 | / | 47.17 | / | 50.02 |
GVP | 5.36 | 39.47 | 4.71 | 44.14 | 4.20 | 49.14 |
GraphTrans | 6.63 | 35.82 | 5.40 | 43.89 | 4.98 | 45.69 |
. | CATH4.2 . | . | TS50 . | . | TS500 . | . |
---|---|---|---|---|---|---|
Method . | Perplexity . | Recovery . | Perplexity . | Recovery . | Perplexity . | Recovery . |
SPDesign† | 2.43 | 67.05 | 2.72 | 68.64 | 2.54 | 71.63 |
SPIN-CGNN | 4.05 | 54.81 | 3.59 | 58.73 | 2.98 | 62.87 |
LM-Design | 4.52 | 55.65 | 3.50 | 57.89 | 3.19 | 67.78 |
Pifold | 4.55 | 51.51 | 3.86 | 58.72 | 3.44 | 60.42 |
ProteinMPNN | 5.25 | 45.16 | 3.93 | 54.43 | 3.53 | 58.08 |
ESM-IF | / | 42.39 | / | 47.17 | / | 50.02 |
GVP | 5.36 | 39.47 | 4.71 | 44.14 | 4.20 | 49.14 |
GraphTrans | 6.63 | 35.82 | 5.40 | 43.89 | 4.98 | 45.69 |
The best results are marked in bold, and the best-performing method is marked†.
. | CATH4.2 . | . | TS50 . | . | TS500 . | . |
---|---|---|---|---|---|---|
Method . | Perplexity . | Recovery . | Perplexity . | Recovery . | Perplexity . | Recovery . |
SPDesign† | 2.43 | 67.05 | 2.72 | 68.64 | 2.54 | 71.63 |
SPIN-CGNN | 4.05 | 54.81 | 3.59 | 58.73 | 2.98 | 62.87 |
LM-Design | 4.52 | 55.65 | 3.50 | 57.89 | 3.19 | 67.78 |
Pifold | 4.55 | 51.51 | 3.86 | 58.72 | 3.44 | 60.42 |
ProteinMPNN | 5.25 | 45.16 | 3.93 | 54.43 | 3.53 | 58.08 |
ESM-IF | / | 42.39 | / | 47.17 | / | 50.02 |
GVP | 5.36 | 39.47 | 4.71 | 44.14 | 4.20 | 49.14 |
GraphTrans | 6.63 | 35.82 | 5.40 | 43.89 | 4.98 | 45.69 |
. | CATH4.2 . | . | TS50 . | . | TS500 . | . |
---|---|---|---|---|---|---|
Method . | Perplexity . | Recovery . | Perplexity . | Recovery . | Perplexity . | Recovery . |
SPDesign† | 2.43 | 67.05 | 2.72 | 68.64 | 2.54 | 71.63 |
SPIN-CGNN | 4.05 | 54.81 | 3.59 | 58.73 | 2.98 | 62.87 |
LM-Design | 4.52 | 55.65 | 3.50 | 57.89 | 3.19 | 67.78 |
Pifold | 4.55 | 51.51 | 3.86 | 58.72 | 3.44 | 60.42 |
ProteinMPNN | 5.25 | 45.16 | 3.93 | 54.43 | 3.53 | 58.08 |
ESM-IF | / | 42.39 | / | 47.17 | / | 50.02 |
GVP | 5.36 | 39.47 | 4.71 | 44.14 | 4.20 | 49.14 |
GraphTrans | 6.63 | 35.82 | 5.40 | 43.89 | 4.98 | 45.69 |
The best results are marked in bold, and the best-performing method is marked†.
In addition, we compared the identity of the consensus sequence with the recovery rate of the designed sequence, and the results are shown in Figure 2A. It is obvious that the sequence identity of the consensus sequence is less than 10%, indicating that there are significant differences between the consensus sequence and the native sequence. This suggests that the performance of our method does not depend on homologous sequences.

(A) The left axis corresponds to the average recovery rate of designed sequences, and the right axis corresponds to the average sequence identity between the consensus sequence and native sequence. (B) Results comparison on orphan and de novo (designed) protein datasets.
We further analyzed the impact of sequence redundancy on method performance by removing 30% and 20% sequence identity from the CATH 4.2 dataset (training/ validation/test) and retraining SPDesign for protein sequence design (Supplementary Table S3). The results show that our method achieves sequence recovery rates of 67.05%, 67.73% and 66.55% on datasets with 40%, 30% and 20% sequence identity, respectively, which means that SPDesign is highly robust to very low sequence similarity data. More studies on sequences profile and structural similarity are presented in Supplementary Tables S4–S8.
Furthermore, the methods were tested on the orphan (75 proteins) and de novo (149 proteins) data sets from RGN2 [28]. These proteins have fewer homologous sequences and are close to real protein design scenarios. The results presented in Figure 2B show that SPDesign is second only to the latest method SPIN-CGNN on the orphan protein dataset and outperforms all other methods on the de novo dataset. This may indicate that SPDesign can capture the mapping of protein tertiary structure to primary sequence to a certain extent.
Results on PDBench performance analysis
PDBench [33] is a tool for analyzing the performance of protein sequence design methods, offering a more comprehensive and object analysis. It provides a benchmark dataset that comprises 595 proteins, spanning 40 protein structures. These proteins can be clustered into four classes: Mainly Alpha, Mainly Beta, Alpha Beta and Few Structures/Special, ensuring that the evaluation results are not biased for the most common protein structures. Each class of structure is divided into multiple subdivision structures according to its characteristics. We use the PDBench tool to analyze benchmark methods and display the similarity result in Figure 3. More results (such as Accuracy and Macro Recall) will be shown in Supplementary Figures S3–S7.

Comparison of the similarity on the PDBench benchmark dataset. The X-axis represents various types of subdivision structures. The method has a higher similarity value on a certain type of structure, indicating a better performance on this type of structure.
It is easily observed that SPDesign has better performance than other methods on the four classes of proteins. Specifically, its performance equally in the classes of Mainly Alpha, Mainly Beta and Alpha Beta, while the performance of the Few Structural/Special targets is not as good as the other three classes. The reason may be that these proteins have special structures and are relatively rare in nature, which make more challenging for modeling. In addition, it is obvious that SPDesign also has stable performance for structural subdivision areas, such as Alpha barrel and UP-down Bundle. This provides evidence that SPDesign improves the recovery rate of the sequence at the overall level, rather than that the residues in a certain part of the structure are restored very well. These results may suggest that SPDesign has wild applicability.
Measuring whether the designed sequences adhere to the principles of natural evolution is also a crucial consideration for evaluating the performance of sequence design methods. We counted the relevant amino acid distributions on the benchmark set provided by PDBench. Figure 4A shows the comparison of the amino acid type distribution calculated by SPDesign with the native sequence amino acid type distribution, and Figure 4B shows the bias of the amino acids used by methods when designing sequences. A non-zero value in Figure 4B signifies a deviation between the distribution of the amino acid designed by the method and the natural distribution. The larger the value, the more obvious the disparity. It is obvious that the amino acid distribution of ProteinMPNN shows a large deviation from the natural distribution, showing a very high preference for glutamic acid (E) and lysine (K). SPIN-CGNN also shows a very high biases on alanine (A), glutamine (Q), threonine (T) and valine (K), as does Pifold. As for SPDesign, its prediction bias distribution of amino acids is closer to zero as a whole, which suggests that the sequence designed by our method is more in line with the principles of natural evolution.

(A) Comparison of the amino acid type distribution calculated by SPDesign with the native sequence amino acid type distribution. (B) Prediction bias of methods at each residue compared to the native sequence. A positive value represents that the proportion of the amino acid in the designed sequence is higher than in the native state, while a negative value indicates a proportion lower than in the native state.
Structural modeling verification experiment
After having verified that SPDesign can design sequences with high accuracy, to gain a more intuitive understanding of whether the designed sequences can accurately fold into the desired structures, the sequences were modeled through the mainstream protein structure prediction method ESMFold [29]. The similarity between the predicted structures and the native structures is evaluated by the structural similarity metrics TM-score, RMSD and LDDT.
Considering the inherent biases in the protein structure prediction method ESMFold during the protein modeling process, which may impact the accuracy of experiment, we conducted a screening process on proteins from CATH 4.2 test set. First, we employed ESMFold to predict the structures of native sequences within the test set and then calculated the TM-score between predicted structures and the native structures. Finally, we used two different TM-score thresholds (0.8 and 0.9) to filter proteins from the test set, resulting in two sub-test sets that contain 141 and 87 proteins respectively, as shown in Table 2.
Method . | TM-score >0.9(87) . | TM-score >0.8(141) . | ||||||
---|---|---|---|---|---|---|---|---|
TM-score . | RMSD . | LDDT . | Recovery . | TM-score . | RMSD . | LDDT . | Recovery . | |
SPDesign | 0.940 | 1.216 | 0.771 | 69.42 | 0.893 | 1.545 | 0.700 | 75.77 |
SPIN-CGNN | 0.943 | 1.127 | 0.721 | 55.72 | 0.889 | 1.411 | 0.671 | 58.10 |
Pifold | 0.919 | 2.102 | 0.666 | 53.56 | 0.863 | 2.988 | 0.612 | 50.81 |
ProteinMPNN | 0.937 | 1.631 | 0.697 | 50.61 | 0.884 | 2.475 | 0.639 | 45.11 |
ESM-IF | 0.929 | 1.361 | 0.673 | 49.40 | 0.870 | 1.713 | 0.604 | 43.99 |
native-sequence | 0.960 | 1.156 | 1.000 | 100 | 0.918 | 2.014 | 1.000 | 100 |
Method . | TM-score >0.9(87) . | TM-score >0.8(141) . | ||||||
---|---|---|---|---|---|---|---|---|
TM-score . | RMSD . | LDDT . | Recovery . | TM-score . | RMSD . | LDDT . | Recovery . | |
SPDesign | 0.940 | 1.216 | 0.771 | 69.42 | 0.893 | 1.545 | 0.700 | 75.77 |
SPIN-CGNN | 0.943 | 1.127 | 0.721 | 55.72 | 0.889 | 1.411 | 0.671 | 58.10 |
Pifold | 0.919 | 2.102 | 0.666 | 53.56 | 0.863 | 2.988 | 0.612 | 50.81 |
ProteinMPNN | 0.937 | 1.631 | 0.697 | 50.61 | 0.884 | 2.475 | 0.639 | 45.11 |
ESM-IF | 0.929 | 1.361 | 0.673 | 49.40 | 0.870 | 1.713 | 0.604 | 43.99 |
native-sequence | 0.960 | 1.156 | 1.000 | 100 | 0.918 | 2.014 | 1.000 | 100 |
The ‘native-sequence’ signifies the reference performance using native sequence input. The best results except ‘native-sequence’ are marked in bold.
Method . | TM-score >0.9(87) . | TM-score >0.8(141) . | ||||||
---|---|---|---|---|---|---|---|---|
TM-score . | RMSD . | LDDT . | Recovery . | TM-score . | RMSD . | LDDT . | Recovery . | |
SPDesign | 0.940 | 1.216 | 0.771 | 69.42 | 0.893 | 1.545 | 0.700 | 75.77 |
SPIN-CGNN | 0.943 | 1.127 | 0.721 | 55.72 | 0.889 | 1.411 | 0.671 | 58.10 |
Pifold | 0.919 | 2.102 | 0.666 | 53.56 | 0.863 | 2.988 | 0.612 | 50.81 |
ProteinMPNN | 0.937 | 1.631 | 0.697 | 50.61 | 0.884 | 2.475 | 0.639 | 45.11 |
ESM-IF | 0.929 | 1.361 | 0.673 | 49.40 | 0.870 | 1.713 | 0.604 | 43.99 |
native-sequence | 0.960 | 1.156 | 1.000 | 100 | 0.918 | 2.014 | 1.000 | 100 |
Method . | TM-score >0.9(87) . | TM-score >0.8(141) . | ||||||
---|---|---|---|---|---|---|---|---|
TM-score . | RMSD . | LDDT . | Recovery . | TM-score . | RMSD . | LDDT . | Recovery . | |
SPDesign | 0.940 | 1.216 | 0.771 | 69.42 | 0.893 | 1.545 | 0.700 | 75.77 |
SPIN-CGNN | 0.943 | 1.127 | 0.721 | 55.72 | 0.889 | 1.411 | 0.671 | 58.10 |
Pifold | 0.919 | 2.102 | 0.666 | 53.56 | 0.863 | 2.988 | 0.612 | 50.81 |
ProteinMPNN | 0.937 | 1.631 | 0.697 | 50.61 | 0.884 | 2.475 | 0.639 | 45.11 |
ESM-IF | 0.929 | 1.361 | 0.673 | 49.40 | 0.870 | 1.713 | 0.604 | 43.99 |
native-sequence | 0.960 | 1.156 | 1.000 | 100 | 0.918 | 2.014 | 1.000 | 100 |
The ‘native-sequence’ signifies the reference performance using native sequence input. The best results except ‘native-sequence’ are marked in bold.
The results show that SPDesign achieves the highest performance on sequence recovery and LDDT, while TM-score and RMSD are second only to SPIN-CGNN. Furthermore, we found that the improvement of sequence recovery rate has a much greater impact on LDDT than TM-score. Therefore, we guess that higher sequence recovery rates help to establish more stable local spatial structures. Examples of target predicted by ESMFold are shown in Supplementary Figure S8.
Finally, we tested the performance of the method on hallucination proteins and proteins generated by a latest diffusion model [40] that can generate proteins of different length. Results in Supplementary Table S9 show that our method also performs competitively when faced with entirely new proteins.
Ablation study
To evaluate the impact of individual features on SPDesign, we conducted ablation experiments on the CATH 4.2 test set, including backbone atoms distance, pre-trained knowledge of structure and structural sequence profile. The results are presented in Table 3 and Supplementary Table S10.
The impact of features on perplexity and sequence recovery rate on the CATH 4.2 test set.
. | SPDesign1 . | SPDesign2 . | SPDesign3 . | SPDesign . |
---|---|---|---|---|
Backbone atoms distance | √ | √ | √ | √ |
Pre-trained knowledge of structure | ✕ | √ | ✕ | √ |
Structural sequence profile | ✕ | ✕ | √ | √ |
Perplexity | 5.36 | 4.96 | 3.51 | 2.43 |
Recovery (%) | 44.60 | 48.81 | 58.76 | 67.05 |
. | SPDesign1 . | SPDesign2 . | SPDesign3 . | SPDesign . |
---|---|---|---|---|
Backbone atoms distance | √ | √ | √ | √ |
Pre-trained knowledge of structure | ✕ | √ | ✕ | √ |
Structural sequence profile | ✕ | ✕ | √ | √ |
Perplexity | 5.36 | 4.96 | 3.51 | 2.43 |
Recovery (%) | 44.60 | 48.81 | 58.76 | 67.05 |
The impact of features on perplexity and sequence recovery rate on the CATH 4.2 test set.
. | SPDesign1 . | SPDesign2 . | SPDesign3 . | SPDesign . |
---|---|---|---|---|
Backbone atoms distance | √ | √ | √ | √ |
Pre-trained knowledge of structure | ✕ | √ | ✕ | √ |
Structural sequence profile | ✕ | ✕ | √ | √ |
Perplexity | 5.36 | 4.96 | 3.51 | 2.43 |
Recovery (%) | 44.60 | 48.81 | 58.76 | 67.05 |
. | SPDesign1 . | SPDesign2 . | SPDesign3 . | SPDesign . |
---|---|---|---|---|
Backbone atoms distance | √ | √ | √ | √ |
Pre-trained knowledge of structure | ✕ | √ | ✕ | √ |
Structural sequence profile | ✕ | ✕ | √ | √ |
Perplexity | 5.36 | 4.96 | 3.51 | 2.43 |
Recovery (%) | 44.60 | 48.81 | 58.76 | 67.05 |
SPDesign1 only contains backbone atoms distance feature, while SPDesign2 adds pre-trained knowledge of structure feature based on SPDesign1, resulting in a 4.21% increase in recovery rate. This result is expected because the pre-trained knowledge of structure feature provides a large amount of latent knowledge about the target geometry. Compared with SPDesign1, SPDesign3 adds the structural sequence profile feature. The result shows that this feature significantly improves the performance of our method by 14.16%. The possible reason for this encouraging improvement is that the introduction of sequence profile provides rich sequence information of similar structures.
Upon further analysis of the improvements brought by each feature, an interesting phenomenon is discovered. The combination of the structural sequence profile and pre-trained knowledge of structure improves prediction performance by 22.25%, which is more than the sum of their individual improvements. This favorable result may be due to the fact that the two features are complementary, adequately characterizing the backbone structure and providing rich geometric information.
Case study
Some studies have shown that different protein sequences may fold into similar structures and perform distinct biological functions. To investigate whether SPDesign tends to generate diverse protein sequences when the target proteins are structurally similar, we selected four proteins (PDB 3ney, 1kgd, 3wp0, 1ex6) that are not included in the training set for detailed analysis, as shown in Figure 5A. The average TM-score between these four proteins pairwise is 0.82, and the average sequence identity is 33.1%. Specifically, 3ney is a membrane protein primarily engaged in forming ternary complexes crucial for preserving the normal shape of red blood cells. 1kgd belongs to structural protein and can participate in intracellular and transcriptional regulation. 3wp0 acts as a peptide binding protein, making it a potential starting point for designing specific Dlg inhibitors targeting Scribble complex formation. 1ex6 functions as a transferase, catalyzing the reversible phosphoryl transfer from ATP to GMP [41–44]. Furthermore, we use the recovery of conserved amino acids as a metric to assess the method’s ability, as these conserved residues are crucial for the function of proteins.

Case study. (A) Structure of four proteins (PDB 3ney, 1kgd, 3wp0, 1ex6). The TM-score and sequence identity between pairs of these four proteins are as follows: 3ney-1kgd: 0.88, 55.2%; 3ney-3wp0: 0.77, 27.5%; 3ney-1ex6: 0.76, 31.2%; 1kgd-3wp0: 0.86, 26.5%; 1kgd-1ex6: 0.83, 26.7%; 3wp0-1ex6: 0.82, 31.5%. (B) Sequence alignment results between sequences designed by SPDesign (Designed seq) and the native sequences (marked by protein name). Conserved amino acids on the fragment sequence are underlined. Fragment-recovery represents the sequence recovery rate of SPDesign on the fragment sequence and conserved-recovery indicates the recovery rate on the conserved amino acids of the fragment. The average sequence identity between pairs of these four fragment sequences designed by SPDesign is 32.7%. (C) Alignment results between the structures predicted by AlphaFold2 using the designed sequences and the native structures.
We further extracted conservative structural fragments of these proteins and displayed alignment results between the fragment sequences designed by SPDesign and the native sequences of these fragments in Figure 5B. It is obvious that SPDesign performs very well on the overall fragment sequence, with an average recovery of 80%, and the average sequence identity between pairs of the four designed fragment sequences is 32.7%. These results suggest that SPDesign can design diverse sequences for proteins with similar structures, which may be attributed to the fact that the sequence profile is derived from proteins with similar structures but different sequences. This result shows that SPDesign can capture the intrinsic sequence-structure mapping. In addition, SPDesign has encouraging performance in recovering conserved residues. This may suggest that our method can better capture the crucial conserved features in protein structure and function, which may be beneficial for maintaining biological activity of proteins. The results of these four proteins designed by other methods are shown in Supplementary Figures S9 and S10.
Finally, we employed AlphaFold2 to perform structural modeling on the sequences designed by SPDesign and presented the structure alignment results between the native structures and predicted structures in Figure 5C. The results show that the structural recovery is better for 1kgd, 3ep0 and 3ney, while relatively poorer for 1ex6. It is obvious that the conserved structural fragment of 1ex6 is similar to the native structure, but the orientation of the global structure is biased, which may be attributed to two factors, as follows. One is that the current modeling techniques are not fully accurate in predicting protein structures, and the other is that conservative residues in the structural loop region are not accurately designed.
CONCLUSION
In this work, we develop SPDesign, a method that combines structural sequence profile and pre-trained language models for protein sequence design. SPDesign designs an enhanced ultrafast shape recognition algorithm to speed up the search process of structural analogs and then extracts sequence pattern information of structural analogs to assist the network in designing reasonable sequences. Experiments show that SPDesign significantly outperforms other methods on well-known benchmark tests (CATH 4.2 test set: 67.05%, TS50: 68.64%, TS500: 71.63%). PDBench tool was used to conduct a more comprehensively analysis of SPDesign. The results show that SPDesign has a better performance on a variety of major structural proteins, and the designed sequences are more in line with the principles of natural evolution. Moreover, we utilized a mainstream structure modeling method to model the sequence designed by our method, and experiments showed that the folding structure of the sequence designed by it may be more reasonable.
In case study, we found that for some target proteins with similar structures but have different sequences, SPDesign can design diverse sequences and can well preserve conserved sites on these proteins. In subsequent case modeling experiment, we speculated that failure to correctly recover some key conserved residues may affect the accuracy of designed sequence folding structures. Therefore, improving the recovery rate of conserved amino acids may be an important research topic.
In this work, we propose a novel conception for extracting sequences from matched regions in structural analogs to create the structural sequence profile.
We propose SPDesign, a protein sequence design method that combines structural sequence profile and pre-trained language models.
We propose an enhanced ultrafast shape recognition algorithm that encodes protein shapes into USR-V vectors, assisting the search process for structural analogs.
Experiments suggest that SPDesign achieves the state-of-the-art performance across various benchmarks and has wildly applicability.
FUNDING
This work is supported by the National Science and Technology Major Project (2022ZD0115103), the National Nature Science Foundation of China (62173304) and the Key Project of Zhejiang Provincial Natural Science Foundation of China (LZ20F030002).
DATA AVAILABILITY
All data needed to evaluate the conclusions are present in the paper and the Supplementary Materials. The additional data and code related to this paper can be downloaded from our web server (http://zhanglab-bioinf.com/SPDesign/).
Author Biographies
Hui Wang is a master candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and the development of machine learning models and algorithms.
Dong Liu is a PhD candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and the development of machine learning models and algorithms.
Kailong Zhao is a PhD candidate in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and optimization theory.
Yajun Wang is a professor in the College of Pharmaceutical Science, Zhejiang University of Technology. His research interests include biological pharmaceutical, bio catalysis, enzyme engineering and microbial fermentation.
Guijun Zhang is a professor in the College of Information Engineering, Zhejiang University of Technology. His research interests include bioinformatics, intelligent information processing and optimization theory.
References
Author notes
Hui Wang and Dong Liu contributed equally to this work.