-
PDF
- Split View
-
Views
-
Cite
Cite
Jiancheng Zhong, Zhiwei Zou, Jie Qiu, Shaokai Wang, ScFold: a GNN-based model for efficient inverse folding of short-chain proteins via spatial reduction, Briefings in Bioinformatics, Volume 26, Issue 2, March 2025, bbaf156, https://doi.org/10.1093/bib/bbaf156
- Share Icon Share
Abstract
In the realm of protein design, the efficient construction of protein sequences that accurately fold into predefined structures has become an important area of research. Although advancements have been made in the study of long-chain proteins, the design of short-chain proteins requires equal consideration. The structural information inherent in short and single chains is typically less comprehensive than that of full-length chains, which can negatively impact their performance. To address this challenge, we introduce ScFold, a novel model that incorporates an innovative node module. This module utilizes spatial dimensionality reduction and positional encoding mechanisms to enhance the extraction of structural features. Experimental results indicate that ScFold achieves a recovery rate of 52.22|$\%$| on the CATH4.2 dataset, demonstrating notable efficacy for short-chain proteins, with a recovery rate of 41.6|$\%$|. Additionally, ScFold further exhibits enhanced recovery rates of 59.32|$\%$| and 61.59|$\%$| on the TS50 and TS500 datasets, respectively, demonstrating its effectiveness across diverse protein types. Additionally, we performed protein length stratification on the TS500 and CATH4.2 datasets and tested ScFold on length-specific sub-datasets. The results confirm the model’s superiority in handling short-chain proteins. Finally, we selected several protein sequence groups from the CATH4.2 dataset for structural visualization analysis and provided comparisons between the model-generated sequences and the target sequences.
Introduction
Proteins play crucial roles in numerous biological functions. They participate in regulating chemical reactions within the body, aiding in cellular repair and tissue regeneration, and contributing to immune defenses against pathogens. Understanding the relationship between protein structure and sequence is crucial for elucidating biological processes, developing disease therapies, and advancing drug discovery.
In recent years, the rapid advancement of artificial intelligence technology has led to significant progress in various fields [1, 2]. The field of protein design has also ushered in revolutionary progress. Especially driven by deep learning technology, researchers have begun to explore how to use machine learning models to solve the protein inverse folding problem, i.e. predicting the amino acid sequence from the known three-dimensional structure of a protein. For instance, Liu et al. [3] introduced a rotamer-free approach for protein sequence design based on deep learning and self-consistency. McPartlon et al. [4] developed a deep SE(3)-equivariant model to learn inverse protein folding. Additionally, Huang et al. [5] achieved accurate and efficient protein sequence design by learning the local environment of residues, while Dumortier et al. [6] presented Petribert, which enhances the BERT model with tridimensional encoding for inverse folding and design. In the realm of structure-based protein design, Li et al. [7] proposed neural network-derived Potts models using backbone atomic coordinates and tertiary motifs. Maguire et al. [8] introduced Xenet, employing a new graph convolution approach to accelerate protein design on quantum computers. Furthermore, Li et al. [9] developed Terminator, a neural framework for structure-based protein design that utilizes tertiary repeating motifs. Song and Li [10] applied importance-weighted expectation-maximization for protein sequence design. In the broader application of deep learning, Madani et al. [11] demonstrated the capabilities of large language models in generating functional protein sequences, while Watson et al. [12] integrated structure prediction networks and diffusion generative models to enhance the accuracy and applicability of protein design. Wu et al. [13] discussed the use of deep generative models for protein sequence design, and Pearce and Zhang [14] highlighted the significant impact of deep learning techniques on protein structure prediction and design. Ovchinnikov and Huang [15] examined the integration of structure-based protein design with deep learning approaches, while Ding et al. [16] provided an overview of deep learning methods for protein design. Gao et al. [17] delved into the application of deep learning in protein structural modeling and design. As research on deep generative models continues to progress, Strokach and Kim [18] demonstrated how deep generative modeling can facilitate protein design. Additionally, Li and Koehl [19] introduced three-dimensional representations of amino acids for protein sequence comparison and classification, and Greener et al. [20] employed variational autoencoders for the design of metalloproteins and novel protein folds. Among new methodologies, Karimi et al. [21] proposed a novel approach using guided conditional Wasserstein generative adversarial networks, while Anishchenko et al. [22] achieved de novo protein design through deep network hallucination. Cao et al. [23] developed Fold2seq, a generative model based on joint sequence (1D) and fold (3D) embeddings for protein design. Among them, GNN models have made significant progress, but there is still room for further improvement. Natural proteins are composed of 20 types of amino acids, and most protein design models are designed based on the following three models:
Models based on multilayer perceptron (MLP): the MLP comprises interconnected nodes, or neurons, organized into multiple layers. In the context of protein design, MLP is employed to predict the amino acid type for each residue within a protein sequence. The model takes all residues as input and outputs the probability of each residue corresponding to one of the 20 amino acids. The primary distinction among MLP-based methods lies in their approaches to constructing effective input features. For instance, SPIN [24] integrates torsion angles, fragment-derived sequence profiles, and structure-derived energy profiles to enhance prediction accuracy by considering various structural features. Building on SPIN, SPIN2 [25] introduced additional features, including backbone angles, local contact numbers, and neighborhood distances, resulting in an improved recovery rate from 30|$\%$| to 34|$\%$| [26]. Wang et al. [26] incorporated backbone dihedrals, backbone atoms (|$C\alpha $|, N, C, and O), secondary structure types (helices, sheets, loops), C|$\alpha $|-C|$\alpha $| distances, and unit directional vectors of C|$\alpha $|-N and C|$\alpha $|-C as input features, achieving a recovery rate of 33.0|$\%$| across 50 test proteins. The MLP method is characterized by its rapid inference speed, allowing for swift predictions of amino acid types in protein sequences. This capability is particularly advantageous for protein engineering, which often necessitates the rapid screening of extensive sequence datasets. However, despite their efficiency, MLP models in protein design exhibit limitations in prediction accuracy, primarily due to their inability to fully encapsulate protein structural information.
Models based on convolutional neural network (CNN): CNNs are a specialized deep learning architecture that employs convolutional layers to autonomously and effectively extract local features from various data types, including images. In the realm of protein design, CNN is utilized to derive residue features from protein structures, aiding in the prediction of amino acid types. The SPROF method [27] analyzes a protein’s distance matrix to extract structural features, achieving a recovery rate of 39.8|$\%$|. ProDCoNN [28] incorporates a nine-layer 3D CNN with multi-scale convolutional kernels for residue prediction, resulting in a recovery rate of 42.2|$\%$|. Additionally, DenseCPD [29] leverages the DenseNet architecture, enhancing feature utilization through densely connected convolutional networks and achieving a recovery rate of 55.53|$\%$|. While CNN-based methods [30] have shown improvements in recovery rates compared with MLP models, the complexity of the convolutional processes often leads to slower inference speeds. Consequently, each approach has distinct advantages and limitations. Furthermore, the lack of open-source availability for many CNN-based models presents challenges [31] for researchers aiming to replicate or extend these methodologies.
Models based on graph neural network (GNN): GNN models [32, 33] have emerged as a powerful tool in protein design, offering distinct advantages. These models represent the three-dimensional structures of proteins as K-NN graphs, effectively capturing the spatial relationships between residues. In these graphs, each residue is encoded as a node vector, with interactions between adjacent residues represented through edges and edge features. In recent years, GNN models for protein design have seen notable improvements in three key areas: data feature representation, model complexity and performance, and dataset scalability. For example, GraphTrans [34] employs a graph attention encoder alongside an autoregressive decoder, facilitating both sequence generation and structure prediction in protein design. GraphTrans implemented graph-based message passing, achieving a recovery rate of 35.82|$\%$|. GCA [35] enhances the model’s ability to capture comprehensive structural information by integrating global attention mechanisms with local ones, achieving a recovery rate of 37.64|$\%$|, with a performance of 31.1|$\%$| for single-chain predictions. Similarly, GVP [36] introduces a geometric vector perceptron that accounts for both spatial and logical relationships, resulting in a recovery rate of 39.47|$\%$|. However, its performance on single chains declined. This may be attributed to GCA’s ability to more effectively leverage residue features in single-chain contexts. ESM-IF [37] enhanced the model’s learning capacity by incorporating additional training data, resulting in a recovery rate of 38.5|$\%$| for single chains, while also improving performance on full chains. ProteinSolver [38] is specifically designed for protein design when partial sequences are known, although its performance on standard datasets remains unvalidated. Recent models, including AlphaDesign [39], ProteinMPNN [40], PiFold [31], and LM-design [41], have made significant strides by introducing new features and enhancements over previous iterations, leading to marked improvements in recovery rates. Compared with earlier methods, GNN [42] applications in protein design excel in managing complex biomolecular structures. By adeptly capturing and processing the spatial relationships and residue features within proteins, GNN [43] models offer precise predictions for protein structure and design. This balance of strengths positions GNN as a compelling choice in the field of protein design.
In GNNs, to effectively extract features from the three-dimensional (3D) structure of proteins, the first step is to select an appropriate method to convert the protein 3D structure into a graph structure. Common protein graph representation methods include geometry-based graph construction [44], topology-based graph construction [45], subgraph-based graph construction [46], and K-NN based graph construction. GraphTrans employs a graph construction method that combines the K-NN approach with geometric methods. By constructing graphs using K-NN while simultaneously extracting geometric information, this strategy has been widely adopted in the field of protein design to this day.
The importance of full-length proteins is evident; however, short-chain proteins and single-chain proteins also play crucial roles in various biological processes, such as hormones, antibodies, enzymes, and signaling molecules [47]. Nonetheless, the structural information of short and single chains is inherently more limited compared with full-length chains. This limitation may lead to a decrease in the performance of these shorter chains. For instance, PiFold exhibits a recovery rate on full-length chains that is 11.76|$\%$| higher than that on short chains. Therefore, it is essential to explore strategies that enhance the performance of single-chain and short-chain models while also improving the efficiency of full-length chains.
We present the ScFold model, a novel architecture designed to enhance protein structure prediction through the integration of a node module, an edge module, and a global module. The node module employs a parallel attention mechanism that utilizes two distinct attention strategies to learn node features concurrently while incorporating residual networks to preserve original feature representations. Additionally, spatial reduction techniques minimize the spatial dimensions of features, allowing the model to focus on critical local areas within the protein structure. This approach is particularly beneficial for short and single chains, which often possess limited structural information. Positional encoding further enriches the model’s understanding by providing contextual information and capturing long-range dependencies within the sequence. The edge and global modules, based on MLP, complement the node module by facilitating a more comprehensive learning of residue features, thereby enhancing overall model performance. Experimental results demonstrate that ScFold outperforms existing graph models across multiple datasets, achieving an accuracy of 52.22|$\%$| on the CATH4.2 dataset, with both short and single chains reaching a remarkable 40|$\%$| accuracy for the first time. Furthermore, ScFold leads in performance on single chains, with a recovery rate exceeding that of PiFold by over 3|$\%$|. On the TS50 and TS500 datasets, the model also attained accuracies of 59.32|$\%$| and 61.59|$\%$|, respectively. Ablation studies reveal the critical role of the innovative node module in enhancing the model’s ability to learn node features, which directly correlates with improvements in recovery rates. Overall, the ScFold model represents an advancement in protein design, particularly for short and single-chain proteins, providing valuable insights for future research and applications.
Methods
Overall framework
Figure 1 illustrates the overall architecture of the ScFold framework. Figure (a) shows the feature extraction process, where the input is the three-dimensional protein structure. A graph is constructed using the K-NN method, and distances, angles, and directional information are extracted from the protein’s 3D structure to serve as features for nodes and edges. Figure (b) depicts the model architecture, with the node module employing a parallel attention mechanism. Results from two attention mechanisms are concatenated and fed into a GNN composed of edge and global modules for further processing. Figure (c) details the implementation of the Overlapping Spatial Reduction (OSR) attention mechanism, which is based on the QKV attention mechanism. The query (Q) is generated from the input through linear layers, while the key (K) and value (V) are derived from the input via spatial dimensionality reduction. The attention matrix is constructed from Q and K, with positional encoding information integrated into it, and then passed through an activation function before being combined with V to produce the output of the node module. Through this approach, ScFold enables multi-scale learning of residue features and effectively generates the corresponding amino acid sequences based on the provided structural information.

ScFold method
Graph structure: we represent the protein’s three-dimensional structure as a K-NN graph composed of residues, facilitating the consideration of residue spatial relationships and the calculation of inter-residue distance, direction, and angle features. Here, k is set to a default value of 30. The K-NN graph enables computation of node and edge features. Each residue consists of four main chain atoms: |$C\alpha $|, C, N, and O. A local coordinate system is established based on these main chain atoms, leveraging rotation and translation-invariant features of individual or paired residues to compute inter-residue distance, direction and angle.
Distance feature: we employ Gaussian radial basis functions (RBFs) to compute distances between pairs of atoms. For instance, for atom pairs |$i$| and |$j$|, the distance between them is represented as |$RBF(||i - j||)$|, where |$i \in \{ C\alpha , N, C, O \}$| and |$j \in \{ C\alpha , N, C, O \}$|.
Angle feature: considering the sequential backbone and local structure between adjacent residues, we utilize dihedral angles |$(\alpha _{i}, \beta _{i}, \gamma _{i})$| and torsion angles |$(\phi _{i}, \psi _{i}, \omega _{i})$| as node features. For atom pairs |$(i, j)$|, angle features represent quaternion encoding of relative rotations between local coordinate systems, expressed as |$q_{ij} = q(Q_{i}^{T} Q_{j})$|, where |$q_{ij}$| is the quaternion encoding function.
Directional feature: for atom i, we utilize the direction from the internal atom to |$C_{\alpha _{i}}$| as the node feature. For instance, the direction feature of |$N_{i}$| is represented as |$Q_{i}^{T}\frac{N_{i}-C_{\alpha _{i}}}{\left .|{|N}_{i}-C_{\alpha _{i}}\right .||}$|. Similar computation methods apply to atom pairs.
Node module: we employed parallel attention mechanisms to learn the input node information separately. Then, we used a residual network to add the resulting outputs to the original data, preserving the original information. Next, we aggregated the structures obtained from the two attention mechanisms to produce the updated node information.
Neighborhood node module: we use a simplified graph transformer [31] to learn node information, employing an MLP to learn multi-head attention weights. We take |$\boldsymbol{h}_{i}^{l}$| represents the feature of node i at layer l, and |$\boldsymbol{e}_{ji}^{l}$| represents the feature of the edge associated with node j at layer l, respectively, and use an MLP to learn these features. The calculation of the attention weight |$a_{ji}$| is as follows:
Here, |$\mathcal{N}_{i}$| denotes the neighborhood of node i, and |$\parallel $| represents the concatenation operation. Node features are updated as follows:
OSR node module: to effectively capture both local and global information within protein three-dimensional structures, we integrate spatial reduction and positional encoding into our attention mechanism. Spatial Reduction Attention [48] has been widely employed in prior studies to efficiently extract global information by leveraging sparse token-region relationships. However, nonoverlapping spatial reduction may disrupt the three-dimensional spatial structure of proteins, thereby compromising feature quality. To mitigate this issue, we introduce OSR, which employs larger overlapping blocks to more accurately characterize spatial relationships near block boundaries. In our framework, spatial reduction is combined with multi-head attention: OSR first reduces the spatial dimensions of features, then incorporates the original features through residual connections [49], utilizing batch normalization to enhance feature quality. Multi-head attention captures feature relationships across different subspaces, thereby enhancing the model’s representational capacity. This mechanism directs model attention toward critical local regions of protein structures, effectively capturing local three-dimensional spatial relationships, which are particularly important for short-chain and single-chain structures. Additionally, positional encoding is integrated into the attention mechanism, allowing the model to account for spatial positions when calculating attention weights. This positional encoding offers essential contextual information, improving the model’s understanding of node relative positions within the graph. As a result, the model can better capture the orderliness of sequences and long-range dependencies, which are vital for understanding the overall structure of proteins, especially in short-chain and single-chain contexts. The enhancement of global information is particularly critical for improving prediction accuracy.
First, the input |$\mathbf{X} =h{\_}{v} \in R^{T*W}$| is processed through the OSR module to obtain spatially reduced features Y. Then, a local refinement module (LR) implemented by a 3|$\times $|3 depthwise convolution splits it into K and V parts, while Q is derived through a simple linear transformation. Finally, softmax attention is performed on (Q,K,V). Here, Z represents the attention output, and B denotes the relative position bias, which encodes the spatial correlation of the spatial attention map. The specific method for updating node features is as follows:
Edge module: the edge module [31] is an MLP-based component where the input consists of node features and edge features. Through linear transformations and nonlinear activation functions, edge features are iteratively updated based on the features of adjacent nodes. We have found that the edge module can effectively enhance the model’s learning capability:
Here, |$\boldsymbol{h}_{j}^{l}$| represents the feature of node i at layer l, and |$\boldsymbol{e}_{ji}^{l+1}$| represents the feature of the edge associated with node j at layer l+1. EdgeMLP is an MLP for edges, which is used to compute the edge features.
Global module: while learning local structural information is essential for accurately characterizing residue features [34, 36], capturing global information is equally critical [35]. However, the time complexity associated with global attention is proportional to the square of the protein length, which necessitates longer training times. Thus, it is imperative to strike a balance between recovery rate and computational efficiency. To address this challenge, the global module introduces a global vector for each protein. The specific implementation is as follows:
where |$\mathcal{B}_{i}$| is the index set of residues belonging to the same protein as residue i, |$\odot $| is element-wise product operation, and |$\sigma ()$| is the sigmoid function. The MLP structure used by GateMLP consists of four layers, including three hidden layers and one output layer. The hidden layers utilize the ReLU activation function, while the output layer employs the Sigmoid activation function. The computational cost of the global context attention is linear to the residue numbers.
Experiment
Dataset
We utilized the CATH4.2 dataset [50] to assess our method’s performance in protein folding tasks. This dataset consists of complete chains with lengths up to 500 residues. Following the data partitioning methodology employed in GraphTrans, GVP, and PiFold, we divided the entire CATH4.2 dataset into training, validation, and test sets. Specifically, the training set comprises 18 024 proteins, the validation set includes 608 proteins, and the test set consists of 1120 proteins.Additionally, we evaluated the performance of various methods on short chains and single chains, where short chains are defined as those with lengths of 100 residues or fewer, and single chains refer to proteins identified as single-chain entities in the protein database. To further examine the model’s generalization capabilities, we included the TS50 and TS500 datasets in our evaluation. TS50 contains 50 proteins, while TS500 extends this to 500 proteins, enabling researchers to assess the algorithm’s applicability and accuracy across a diverse array of protein sequences. To maintain the integrity of our evaluation, we ensured that there was no overlap between the training and validation sets and the test set, thereby safeguarding the robustness of our performance metrics.
Measurement
Perplexity [35]: to evaluate the complexity of the predicted protein sequences from a natural language processing perspective, we employed perplexity as a metric. This measure serves to quantify how well the predicted sequences align with the statistical properties of natural language. In this context, perplexity is calculated based on the probability distribution of the amino acids in the predicted sequences, effectively capturing the model’s uncertainty regarding its predictions. A lower perplexity score indicates a more coherent sequence, suggesting that the predicted protein structures are plausible. By utilizing perplexity, we gain insights into the quality and reliability of our predictions, allowing for a comprehensive evaluation of the model’s performance in generating biologically relevant protein sequences.
where |$\mathcal{S}^{N},\mathcal{X}^{N}$| is the sequence-structure pair of a protein with N amino acids. |$\mathcal{S}^{N}$| and |$\mathcal{X}^{N}$| denote the |$i$|th amino acid in sequence and structure, respectively. |${p}\left (\mathcal{S}_{i}^{N}\mid \mathcal{X}_{i}^{N}\right )$| is the output probability from the model.
Recovery rate [35]: to assess the prediction accuracy of protein sequences at the residue level, where D represents the entire dataset, we considered the recovery rate. This metric quantifies the model’s ability to correctly predict specific amino acid residues. The recovery rate is calculated based on the correspondence between the predicted residues and the corresponding residues in the true dataset. A higher recovery rate indicates that the model performs well in identifying and predicting the amino acids in protein sequences, thereby enhancing our understanding of protein structure. By analyzing the recovery rate, we can comprehensively evaluate the model’s performance and provide a basis for further optimization of predictions.
where |$\mathcal{D}$| denotes the whole dataset.
Baseline
We constructed a ScFold model with a hidden dimension of 128, trained for 100 epochs by default. The Adam optimizer was employed on an NVIDIA A800, with a batch size and learning rate set to 8 and 0.001, respectively. The dimensions for spatial dimensionality reduction and relative positional encoding were set to 2 and 128, respectively, with a dropout rate of 0.1. We compared ScFold with recent graph models on the CATH4.2 dataset, including StructGNN, StructTrans, GCA, GVP, GVP-large, AlphaDesign, ESM-IF, ProteinMPNN, LM-design, and PiFold, most of which are open-source. For a fair comparison, we applied the same data splits used for StructGNN, StructTrans, GCA, GVP, AlphaDesign, LM-design, and ProteinMPNN on the CATH4.2 dataset. The results obtained were largely consistent with those reported in their respective papers. It is noteworthy that ESM-IF was trained on CATH4.3, which may lead to incomparable recovery rates and perplexity due to variations in the dataset. Models based on MLP and CNN were excluded from the comparison as they utilized datasets other than CATH4.2.
As shown in Table 1, among these graph models, ScFold has demonstrated exceptional performance on both short chains and single chains, achieving three optimal and two suboptimal scores across six metrics. Several methods exhibited significant decreases in recovery rates for short chains and single chains due to their limited structural features. In contrast, ScFold’s performance remained stable. While LM-design achieves the highest recovery rate in single-chain scenarios, it exhibits significant declines in the other three metrics. Notably, ScFold is the only graph model to surpass a 40|$\%$| recovery rate while maintaining perplexity below 6 in both short-chain and single-chain contexts. This exceptional performance is attributed to ScFold’s innovative node module, which effectively leverages limited structural features. The introduced positional encoding mechanism enhances contextual information, assisting the model in learning the relative positions and structures of nodes within the graph. The node modules proposed by ScFold effectively explore both local and global features, compensating for the limitations inherent in short chain and single chain structures. This comprehensive approach results in optimal recovery rates and perplexity for single chains and short chains.
The table shows the perplexity and recovery of each model on the CATH4.2 dataset.
. | Perplexity . | Recovery . | ||||
---|---|---|---|---|---|---|
Model . | Short . | Single . | All . | Short . | Single . | All . |
StructGNN [35] | 8.29 | 8.74 | 6.40 | 29.44 | 28.26 | 35.91 |
GraphTrans [34] | 8.39 | 8.83 | 6.63 | 28.14 | 28.46 | 35.82 |
GCA [35] | 7.09 | 7.49 | 6.05 | 32.62 | 31.10 | 37.64 |
GVP [36] | 7.23 | 7.84 | 5.36 | 30.60 | 28.95 | 39.47 |
GVP-larget [37] | 7.68 | 6.12 | 6.17 | 32.6 | 39.4 | 39.2 |
AlphaDesign [39] | 7.32 | 7.63 | 6.30 | 34.16 | 32.66 | 41.31 |
ESM-IF [37] | 8.18 | 6.33 | 6.44 | 31.3 | 38.5 | 38.3 |
ProteinMPNN [40] | 6.21 | 6.68 | 4.61 | 36.35 | 34.43 | 45.96 |
PiFold [31] | 6.04 | 6.31 | 4.55 | 39.84 | 38.53 | 51.66 |
LM-design [41] | 6.77 | 6.46 | 4.52 | 37.88 | 42.47 | 55.65 |
ScFold(our) | 5.80 | 5.99 | 4.61 | 41.60 | 40.10 | 52.22 |
. | Perplexity . | Recovery . | ||||
---|---|---|---|---|---|---|
Model . | Short . | Single . | All . | Short . | Single . | All . |
StructGNN [35] | 8.29 | 8.74 | 6.40 | 29.44 | 28.26 | 35.91 |
GraphTrans [34] | 8.39 | 8.83 | 6.63 | 28.14 | 28.46 | 35.82 |
GCA [35] | 7.09 | 7.49 | 6.05 | 32.62 | 31.10 | 37.64 |
GVP [36] | 7.23 | 7.84 | 5.36 | 30.60 | 28.95 | 39.47 |
GVP-larget [37] | 7.68 | 6.12 | 6.17 | 32.6 | 39.4 | 39.2 |
AlphaDesign [39] | 7.32 | 7.63 | 6.30 | 34.16 | 32.66 | 41.31 |
ESM-IF [37] | 8.18 | 6.33 | 6.44 | 31.3 | 38.5 | 38.3 |
ProteinMPNN [40] | 6.21 | 6.68 | 4.61 | 36.35 | 34.43 | 45.96 |
PiFold [31] | 6.04 | 6.31 | 4.55 | 39.84 | 38.53 | 51.66 |
LM-design [41] | 6.77 | 6.46 | 4.52 | 37.88 | 42.47 | 55.65 |
ScFold(our) | 5.80 | 5.99 | 4.61 | 41.60 | 40.10 | 52.22 |
The table shows the perplexity and recovery of each model on the CATH4.2 dataset.
. | Perplexity . | Recovery . | ||||
---|---|---|---|---|---|---|
Model . | Short . | Single . | All . | Short . | Single . | All . |
StructGNN [35] | 8.29 | 8.74 | 6.40 | 29.44 | 28.26 | 35.91 |
GraphTrans [34] | 8.39 | 8.83 | 6.63 | 28.14 | 28.46 | 35.82 |
GCA [35] | 7.09 | 7.49 | 6.05 | 32.62 | 31.10 | 37.64 |
GVP [36] | 7.23 | 7.84 | 5.36 | 30.60 | 28.95 | 39.47 |
GVP-larget [37] | 7.68 | 6.12 | 6.17 | 32.6 | 39.4 | 39.2 |
AlphaDesign [39] | 7.32 | 7.63 | 6.30 | 34.16 | 32.66 | 41.31 |
ESM-IF [37] | 8.18 | 6.33 | 6.44 | 31.3 | 38.5 | 38.3 |
ProteinMPNN [40] | 6.21 | 6.68 | 4.61 | 36.35 | 34.43 | 45.96 |
PiFold [31] | 6.04 | 6.31 | 4.55 | 39.84 | 38.53 | 51.66 |
LM-design [41] | 6.77 | 6.46 | 4.52 | 37.88 | 42.47 | 55.65 |
ScFold(our) | 5.80 | 5.99 | 4.61 | 41.60 | 40.10 | 52.22 |
. | Perplexity . | Recovery . | ||||
---|---|---|---|---|---|---|
Model . | Short . | Single . | All . | Short . | Single . | All . |
StructGNN [35] | 8.29 | 8.74 | 6.40 | 29.44 | 28.26 | 35.91 |
GraphTrans [34] | 8.39 | 8.83 | 6.63 | 28.14 | 28.46 | 35.82 |
GCA [35] | 7.09 | 7.49 | 6.05 | 32.62 | 31.10 | 37.64 |
GVP [36] | 7.23 | 7.84 | 5.36 | 30.60 | 28.95 | 39.47 |
GVP-larget [37] | 7.68 | 6.12 | 6.17 | 32.6 | 39.4 | 39.2 |
AlphaDesign [39] | 7.32 | 7.63 | 6.30 | 34.16 | 32.66 | 41.31 |
ESM-IF [37] | 8.18 | 6.33 | 6.44 | 31.3 | 38.5 | 38.3 |
ProteinMPNN [40] | 6.21 | 6.68 | 4.61 | 36.35 | 34.43 | 45.96 |
PiFold [31] | 6.04 | 6.31 | 4.55 | 39.84 | 38.53 | 51.66 |
LM-design [41] | 6.77 | 6.46 | 4.52 | 37.88 | 42.47 | 55.65 |
ScFold(our) | 5.80 | 5.99 | 4.61 | 41.60 | 40.10 | 52.22 |
Feature type ablation
We conducted systematic experiments to evaluate the effectiveness of the node module. We compared the node module with GCN [51], GAT, QKV-based attention layers, and the AttMLP layer from PiFold. As shown in Table 2, we found that our proposed node module outperformed PiFold by 0.6|$\%$|, with improvements of 1.48|$\%$| and 0.74|$\%$| observed for the edge module and global module, respectively.
Based on the ablation study results presented in Table 2, we conducted a detailed comparison of various node, edge, and global modules to determine their impact on model performance, measured through perplexity and recovery scores. Each module was tested individually and in combination with others to examine their contributions.
The node module was evaluated using several established architectures: GCN, GAT, QKV, and AttMLP. ScFold, which integrates the OSR layer, yielded a baseline recovery score of 52.22|$\%$| with a low perplexity of 4.61. When using the GCN-based node module in Model1, recovery slightly dropped to 49.65|$\%$|, accompanied by a slight increase in perplexity to 4.74. Model2, which integrated the GAT layer, further decreased recovery to 45.78|$\%$| and increased perplexity to 4.96. These results suggest that while GCN and GAT have strengths in other domains, they may not fully capture the dependencies required for our specific task. On the other hand, the AttMLP attention layer implemented in Model4 provided a improvement in recovery and reduced perplexity to 4.55, approaching the effectiveness of ScFold.
The edge and global modules also contributed substantially to overall performance. The Edgeupdate layer improved recovery scores across several configurations. For instance, Model1, which combined the GCN node module with Edgeupdate, yielded a recovery rate closer to ScFold than models without Edgeupdate. Models equipped with the Context-based global module, such as Model4 and Model6, saw slight improvements in recovery over simpler configurations. These findings indicate that the edge and global modules help reinforce inter-node relationships, enhancing both interpretability and predictive power.
Through comparative analyses of various combinations, our experiments further validate the effectiveness of modular design. In the field of protein inverse folding, due to the high complexity of data and the presence of substantial heterogeneous information, selecting appropriate module combinations to enhance the information capture ability at different levels is particularly crucial. For instance, the combination of OSR attention layers and the Edge Update module demonstrates a strong complementarity in capturing global information and edge relationships, thereby improving overall recovery rates. Additionally, the Context layer of the Global module aids the model in constructing information aggregation on a global scale, which is essential for identifying sequence patterns.
Generalization
Protein design presents significant challenges, and the TS50 and TS500 datasets are widely recognized for evaluating the capabilities of protein design algorithms in managing diverse protein structures. These datasets specifically test the algorithms’ ability to translate protein backbone structures into corresponding amino acid sequences. TS50 comprises 50 proteins, while TS500 expands this to 500 proteins, enabling researchers to assess the generalizability and accuracy of their algorithms across a broader range of protein types. These datasets include both sequences and structural information from natural proteins, serving as critical benchmarks for protein design algorithms. This ensures that the generated protein sequences can effectively fold into the intended three-dimensional structures. Consequently, we compare the performance of each model on the TS50 and TS500 datasets to validate their generalization across multiple contexts.
As shown in Table 3, we evaluated the model using the TS50 dataset, revealing that ScFold improves upon previous graph-based models. The LM-design approach utilizes a structure-aware protein language model that effectively combines structural and sequential information, demonstrating superior performance on longer chains. However, it experiences a decline in recovery rates for proteins shorter than 200 residues in the TS50 dataset. Data from multiple datasets suggest that ScFold excels with shorter chains while ranking just behind LM-design for longer chains. Specifically, ScFold achieved recovery rates of 59.32|$\%$| and 61.59|$\%$| on the TS50 and TS500 datasets, respectively, surpassing other GNN models on the TS50 dataset.
Performance on short chains
To evaluate the advantages of ScFold for short-chain proteins, we analyzed the length distribution of proteins in the TS500 dataset and categorized them into four subsets based on their lengths. Specifically, there are 177 protein sequences with lengths ranging from 100 to 200 residues, 128 sequences from 200 to 300 residues, 91 sequences from 300 to 400 residues, and 66 sequences exceeding 400 residues. This categorization facilitates a clear comparison of ScFold’s performance across different length ranges, allowing for a precise assessment of various methods’ effectiveness within specific intervals. As shown in Fig. 2, ScFold exhibits higher recovery rates than PiFold in the 100–200 and 200–300 residue length ranges; however, their performances are similar in the 300+ residue length range, as reflected in the perplexity results.

We report the recovery rates and perplexity of our method and Pifold on the four subsets of TS500 with different sequence lengths.
We also analyzed the length distribution of proteins in the CATH4.2 dataset, dividing them into four subsets based on length. The dataset includes 7125 protein sequences with lengths between 100 and 200 residues, 5402 sequences with lengths between 200 and 300 residues, 4220 sequences with lengths between 300 and 400 residues, and 2225 sequences with lengths between 400 and 500 residues. Due to the small number of sequences longer than 500 residues and the large variation in results for sequences shorter than 100 residues, these categories were not included in our analysis. We trained PiFold and ScFold for 100 epochs and tested the dataset using five different models. The final results were averaged across the five runs, ensuring more accurate predictions for protein sequences of different lengths. As shown in Fig. 3, ScFold outperforms PiFold in most length ranges, except for the 400–500 residue range, where their recovery rates are nearly identical. The structure of the perplexity metric also shows consistent results. This indicates that ScFold demonstrates superior recovery performance for short to medium-length protein sequences, underscoring its effectiveness in predicting short-chain structures. These findings suggest that ScFold is particularly adept at capturing the inherent complexity and structural features of short-chain proteins. Furthermore, ScFold’s performance on the TS500 dataset aligns with the results from the CATH4.2 dataset, further confirming its efficacy in the reverse folding of short-chain proteins.

We divided the CATH4.2 dataset into four subsets based on protein sequence length, and we report the recovery rates and perplexity of our method and PiFold on these four subsets.
Case study
We selected a complete and representative protein sequence from the CATH4.2 dataset for visualization and rendered its three-dimensional structure using PyMOL. Additionally, we chose partial sequences of three short-chain proteins and predicted their structures using both PiFold and ScFold. As shown in Fig. 4, the prediction results of the three sequences indicate that ScFold consistently achieves higher prediction accuracy compared with PiFold, regardless of whether the sequences exhibit good or poor prediction outcomes. This demonstrates the effectiveness of ScFold in the design of short-chain proteins.

A case study was conducted and the sequences predicted by ScFold and PiFold were compared with the target sequence.
Discussion
Each protein graph construction method has its own advantages and disadvantages, and the choice depends on the specific application scenario and data characteristics. Geometry-based graph construction methods capture spatial and structural information by considering the geometric features between residues, making them suitable for obtaining precise spatial information of proteins. However, accurately calculating geometric features comes with a high computational cost. Topology-based graph construction methods focus on the connectivity and structure of the graph rather than precise geometric distances. By abstracting protein structures into topological graphs, these methods simplify the analysis of complex three-dimensional structures but may lose detailed geometric information. Subgraph-based graph construction methods are useful for identifying repetitive structural patterns or motifs within proteins, which is beneficial for discovering new structural motifs and their functional roles. However, identifying and comparing subgraphs in large protein datasets pose significant computational challenges. K-NN-based graph construction methods preserve local structure and are simple, efficient, and flexible. By connecting each node to its K nearest neighbors, these methods effectively capture local structural information, which is crucial for understanding protein function and stability. However, the choice of K is critical; an inappropriate K value may lead to either too many or too few connections, thereby affecting the graph representation. Therefore, we adopted the graph construction method used in models such as GraphTrans, GCA, and PiFold, which integrates the advantages of K-NN and geometry-based graph construction methods. This approach is highly compatible with our method of establishing a local coordinate system based on the backbone atoms. We first construct the graph using the K-NN method and simultaneously calculate geometric features for each node. This strategy not only preserves local structural information but also provides geometric details that are beneficial for subsequent model training.
In the field of protein design, topology-based and subgraph-based graph construction methods have not yet been considered. However, these methods have been employed by several researchers in other fields. For example, TO-GCN [45] proposed a topology-optimized graph convolutional network that jointly optimizes network topology and fully connected network parameters, leveraging network topology, node features, and label information to improve classification performance. Subgraph-based [46] graph construction methods have unique advantages by considering different node roles. For instance, each node in a graph structure may play a unique role in different subgraphs, but most prediction methods overlook these distinct node roles. Therefore, other graph construction methods hold great potential in the field of protein design, providing a new direction for future research.
Conclusion
Based on considerations of local and global information, we propose ScFold to enhance recovery rates in structure-based protein design. We introduce a novel node module to learn expressive residue features, incorporating spatial reduction mechanisms and positional encoding. Spatial reduction focuses on critical local information, while positional encoding provides additional contextual relationships. ScFold achieves recovery scores of 52.22|$\%$|, 59.32|$\%$|, and 61.59|$\%$| on the CATH4.2, TS50, and TS500 datasets, respectively. Furthermore, ScFold demonstrates excellent performance on single-chain and short-chain proteins, improving overall performance and highlighting its potential for advancing structure-based protein design
We introduce a GNN-based model for short-chain protein inverse folding that utilizes spatial dimensionality reduction and positional encoding mechanisms to enhance the extraction of structural features.
The spatial reduction mechanism directs the model’s attention to critical local regions of protein structures, effectively capturing local three-dimensional spatial relationships. The positional encoding provides essential contextual information, enhancing the model’s ability to capture the orderliness of sequences and long-range dependencies, which is particularly vital for understanding the overall structure of proteins.
Based on the benchmark CATH4.2 dataset, ScFold demonstrated better performance on short-chain protein across multiple evaluation metrics, and showed robust generalization on TS50 and TS500 datasets.
Competing interests
No competing interest is declared.
Funding
This work has been supported by the National Natural Science Foundation of China, grant no 62372171, the Hunan Provincial Natural Science Foundation of China, grant no. 2023JJ30414, and the Scientific Research Fund of Hunan Provincial Education Department, grant no. 23A0100.
Data availability
ScFold is freely available at https://github.com/jczhongcs/ScFold.