Abstract

Determining intrinsically disordered regions of proteins is essential for elucidating protein biological functions and the mechanisms of their associated diseases. As the gap between the number of experimentally determined protein structures and the number of protein sequences continues to grow exponentially, there is a need for developing an accurate and computationally efficient disorder predictor. However, current single-sequence-based methods are of low accuracy, while evolutionary profile-based methods are computationally intensive. Here, we proposed a fast and accurate protein disorder predictor LMDisorder that employed embedding generated by unsupervised pretrained language models as features. We showed that LMDisorder performs best in all single-sequence-based methods and is comparable or better than another language-model-based technique in four independent test sets, respectively. Furthermore, LMDisorder showed equivalent or even better performance than the state-of-the-art profile-based technique SPOT-Disorder2. In addition, the high computation efficiency of LMDisorder enabled proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions. The datasets, the source codes, and the trained model are available at https://github.com/biomed-AI/LMDisorder.

Introduction

Proteins were thought to have a stable tertiary structure when performing their functions. However, Romero et al. [1] suggested that about 15 000 proteins or long protein regions in SwissProt [2] do not have a well-defined 3D structure. These intrinsically disordered proteins or protein regions (IDPs/IDRs) enriched our understanding of the sequence-structure–function relation of proteins. For example, lacking a rigid structure, IDPs/IDRs can transform between a group of transient and mutually transforming structural states [3]. Some functions directly require structural flexibility such as linkers between structured domains whereas other IDPs can undergo disorder-to-order transitions when interacting with specific bioactive molecules and serve as specific chaperones for certain proteins. These binding regions or molecular recognition features were found in modular proteins participating in signaling and regulation [4, 5]. The plasticity and flexibility of these functional modules, allow IDPs/IDRs to adopt varying conformations to interact with different partners in signal transmission, assembly and regulation, resulting in their involvement in many human diseases, including cardiovascular disease, cancer, various genetic diseases and neurodegenerative diseases [6]. Multi-structure states of IDPs [3, 7] may be evolutionarily advantageous, as suggested by higher degree of intrinsic disorder in natural proteins than in random sequences [8]. Thus, identifying IDPs/IDRs is crucial to comprehend and resolve the causes and impacts of these unstructured states [9].

Several experimental techniques have been employed to identify IDPs/IDRs including X-ray crystallography, nuclear magnetic resonance (NMR) and circular dichroism (CD) [7, 10]. However, due to the high-cost, laborious and time-consuming nature of these experiments, computational tools for predicting IDRs/IDPs in protein sequences have been an active area of research for the past 20 years.

IDP prediction was initially based on small-scale machine learning models [11], using neural networks and support vector machines. As more data and powerful tools emerged, the deep learning models have come to the forefront, in methods such as ESpritz [12], SPOT-Disorder [13], NetSurfP-2.0 [14], AUCpreD [15], SPINE-D [16] and SPOT-Disorder2 [17]. Depending on the input, these methods for protein disorder prediction can be divided into two categories: single-sequence and sequence profile from multiple sequence alignment. The single-sequence-based methods leverage the composition and connectivity of the single protein sequence along with its derived information such as statistical potentials and amino acid propensities [18–20]. These methods allow fast computation but are less accurate than those using evolutionary sequence profiles [21, 22] because sequence conservations are important indicators for structure and function. Examples of profile-based methods are SPINE-D [16], MDFp2 [23], AUCpreD [15], SPOT-Disorder [13] and SPOT-Disorder2 [17]. However, the exponential increase in the library size of protein sequences makes profile generation computationally more and more intensive. Moreover, most proteins (>90%) do not pertain to large sequence clusters [24] and, as a result, their sequence profiles may not be that useful.

Recently, the unsupervised pretrained language models have been applied to extract features from protein sequences [25–28], which has shown very promising results in the downstream forecast, such as tertiary contact, ontology-based protein function, secondary structure, contact map and mutational effect [28–33]. These breakthroughs inspire us to exploit the language models for disorder prediction. A recent study [30] evaluated and discussed the performance of different models for protein representation, indicating that ProtTrans achieved the best performance in most of the tasks. However, ProtTrans has not yet been used for protein disorder prediction. In this work, we employed ProtTrans language model to predict the intrinsic disorder regions of protein.

Here, we present LMDisorder, an alignment-free sequence-based prediction of protein intrinsic disorder. Using a recently released pretrained language model [34], LMDisorder can generate informative sequence representations fast and make an accurate prediction. Specifically, we first leveraged the pretrained language model ProtTrans to produce the sequence embedding. Then, transformer networks were employed to capture the sequence patterns including long-range dependencies between the residues, followed by a fully connected layer for predicting protein disorder probabilities. LMDisorder was shown to be superior to or as good as other profile-based methods including SPOT-Disorder2 on different datasets with its computational efficiency as efficient as single-sequence-based techniques. The high computation efficiency of LMDisorder enabled proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions.

MATERIALS AND METHODS

Datasets

In our experiments, the datasets were obtained from previous studies [17, 34], as shown in Table 1. Briefly, we obtained 4229 protein sequences (DM4229), which contain 72 fully disordered chains from DisProt v5.0 [35] and 4157 high-resolution X-ray crystallography structures from PDB (prior to the 5 August 2003). These proteins were randomly divided into a training set (2700 proteins), a validation set (300 proteins) and a Testing set (1229 proteins, DM1229). According to BLASTClust [36], the sequence similarity of these proteins is <25%. Furthermore, we also employed four independent test datasets SL329, DisProt228, Mobi9230 and DisProt452. The SL329 dataset is selected from the SL477 protein set, which is released by Sirota et al. [37]. The overlap between DM4229 and SL477 was removed to obtain an independent test set containing 329 non-redundant sequences (SL329). DisProt228 is the subset from DisProt Complement [22] and consists of the proteins in DisProt database v7.0. Mobi9230 was obtained from MobiDB [38] with experimental annotation, which derived from manually curated database and primary data (e.g., from PDB structures). We constructed a novel dataset DisProt452, which consists of the proteins deposited to the DisProt database from June 2020 to December 2022. After removing the long proteins (>700 residues) and the homologous proteins in our training set and validation set (25% sequence identity cutoff with BLASTClust [36]), we obtained 452 proteins and built a novel independent test dataset DisProt452. Meanwhile, we respectively created the subsets of SL329 and DM1229, named SL250 and Test1185, all of which are the same datasets employed previously by SPOT-Disorder and SPOT-Disorder2 [17]. In Table 1, We have detailed the number of proteins, ordered and disordered residues as well as the percentage of disorder in each dataset. These independent test datasets contain the latest data, such as the newly annotated proteins in DisProt452, which were deposited to the DisProt database from June 2020 to December 2022.

Table 1

The number of proteins, ordered and disordered residues as well as the percentage of disorder in each dataset

DatasetNo. of proteinsNo. of ordered residuesNo. of disordered residuesPercentage of disorder
Validation30061 23160839.04%
DM12291229276 74829 0829.51%
SL32932951 29239 54443.53%
DisProt22822830 77218 81137.94%
Test11851185246 61626 5159.71%
SL25025032 26121 17339.62%
Mobi923092302 011 126828 64229.18%
DatasetNo. of proteinsNo. of ordered residuesNo. of disordered residuesPercentage of disorder
Validation30061 23160839.04%
DM12291229276 74829 0829.51%
SL32932951 29239 54443.53%
DisProt22822830 77218 81137.94%
Test11851185246 61626 5159.71%
SL25025032 26121 17339.62%
Mobi923092302 011 126828 64229.18%
Table 1

The number of proteins, ordered and disordered residues as well as the percentage of disorder in each dataset

DatasetNo. of proteinsNo. of ordered residuesNo. of disordered residuesPercentage of disorder
Validation30061 23160839.04%
DM12291229276 74829 0829.51%
SL32932951 29239 54443.53%
DisProt22822830 77218 81137.94%
Test11851185246 61626 5159.71%
SL25025032 26121 17339.62%
Mobi923092302 011 126828 64229.18%
DatasetNo. of proteinsNo. of ordered residuesNo. of disordered residuesPercentage of disorder
Validation30061 23160839.04%
DM12291229276 74829 0829.51%
SL32932951 29239 54443.53%
DisProt22822830 77218 81137.94%
Test11851185246 61626 5159.71%
SL25025032 26121 17339.62%
Mobi923092302 011 126828 64229.18%

Protein sequence features

Language model representation. LMDisorder utilizes the recent protein language model ProtT5-XL-U50 [34] (denoted as ProtTrans) for feature extraction. This is a transformer-based auto-encoder named T5 [39], which was pretrained in a self-supervised manner on UniRef50 [40], learning to predict masked amino acids. We extracted the hidden states from the last layer of the ProtTrans encoder as sequence features, which is an |$n\times 1024$| matrix (⁠|$n$| is the sequence length). We have also examined another pretrained model, ESM-1b [29] (denoted as ESM), which was also pretrained on UniRef50 with transformer. Sequence features by ESM are an |$n\times 1280$| matrix. The computational costs of inference for both ProtTrans and ESM are low, and the feature extraction process of our whole benchmark datasets (~4800 sequences) can be completed within 10 minutes on an Nvidia GeForce RTX 3090 GPU.

Evolutionary information (for ablation study). We have also tested evolutionary-derived features (position-specific scoring matrix, PSSM and hidden Markov models, HMM) for feature ablation studies. PSSM was obtained by performing PSI-BLAST [36] search against UniRef90 [40] with three iterations and an E-value of 0.001. HHblits was employed to generate HMM profile [41] by aligning a query sequence against UniClust30 [42] with default parameters. Each residue was encoded into a 20-dimensional feature vector in PSSM and HMM.

Predicted structural properties (for ablation study). Putative structural properties were generated by utilizing SPIDER3 [43], whose inputs include protein sequence, PSSM profiles and HMM profiles. We extracted four types of structural features from the outputs of SPIDER3: (1) Solvent accessible surface area. (2) The sine and cosine values of the four protein backbone torsion angles (θ, ϕ, ψ and τ). (3) Half sphere exposures, which are the numbers of spatially neighboring residues in the top and bottom half of the contacting sphere: the boundary of the two hemispheres was determined by the Cα-Cα and Cα-Cβ direction vectors. (4) Predicted probabilities of the three secondary structures (α-helix, β-sheet and random coil). Each residue was encoded into a 14-dimensional vector using SPIDER3. Values in PSSM, HMM and SPIDER were further normalized to scores between 0 to 1 using the min–max normalization:

(1)

where v is the original feature value, and Min and Max are the smallest and biggest values of this feature type observed in the training set.

The architecture of LMDisorder

The overall architecture of LMDisorder is shown in Figure 1. First, the protein sequence is inputted into the pretrained language model ProtTrans to yield the sequence embedding, which is augmented by Gaussian noise to avoid overfitting. Then, the transformer networks are employed to capture the sequence patterns including long-range dependencies between the residues. This is followed by a fully connected layer adopted to predict protein disorder probabilities.

The architecture of LMDisorder. A protein sequence is input into ProtTrans to produce the sequence embedding. The embedding is augmented by Gaussian noise and further encoded through the transformer networks to capture the long-range sequence patterns. The encoded representation is input to a fully connected layer for predicting disorder probabilities.
Figure 1

The architecture of LMDisorder. A protein sequence is input into ProtTrans to produce the sequence embedding. The embedding is augmented by Gaussian noise and further encoded through the transformer networks to capture the long-range sequence patterns. The encoded representation is input to a fully connected layer for predicting disorder probabilities.

Transformer networks

We stack N identical transformer [44] layers to learn the sequence representations. Each transformer layer has two sub-layers, a multi-head self-attention module, and a position-wise feed-forward network. A residual connection [45] is employed around each of the two sub-layers, followed by the layer normalization [46]. Let |$\boldsymbol{H}\in{\mathbb{R}}^{n\times d}$| denotes the input of the self-attention module, where n is the sequence length and d is the hidden dimension. The input of the |${l}^{\text{th}}$| layer |$\boldsymbol{{H}}^{(l)}$| is projected by three matrices |$\boldsymbol{W}_Q\in{\mathbb{R}}^{d\times{d}_K}$|⁠, |$\boldsymbol{W}_K\in{\mathbb{R}}^{d\times{d}_K}$| and |$\boldsymbol{W}_V\in{\mathbb{R}}^{d\times{d}_V}$| to get the corresponding query, key, and value representations |$\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}$| as below:

(2)

The scaled dot-product self-attention is then calculated according to the following equation:

(3)
(4)

where |$\sqrt{d_K}$| is a scaling factor and |$\boldsymbol{A}$| is a matrix capturing the similarities between queries and keys. Instead of performing a single attention function, multi-head attention was employed to jointly attend to information from different representation subspaces at different positions. We linearly projected the queries, keys and values h times, perform the attention function in parallel, and finally concatenate them together. In this work, |${d}_K={d}_V=d/h$|⁠.

Fully connected layer and noise augmentation

The output of the last transformer layer is input to a fully connected layer to predict the protein disorder probabilities for all n amino acid residues of the protein:

(5)

where |$\boldsymbol{H}^{\left(N+1\right)}\in{\mathbb{R}}^{n\times d}$| is the output of the |${N}^{\text{th}}$| transformer layer; |$\boldsymbol{W}\in{\mathbb{R}}^{d\times 1}$| is a learnable weight matrix; |$b\in \mathbb{R}$| is a bias term, and |$\boldsymbol{Y}^{\prime}\in{\mathbb{R}}^{n\times 1}$| is the predictions for the n residues. The sigmoid function normalizes the output of the fully connected layer into binding probabilities from 0 to 1. In addition, to avoid overfitting, the sequence features from ProtTrans are augmented by Gaussian noise before being fed to the transformer networks while training:

(6)

where |$\boldsymbol{H}^{(0)}$| is the sequence features from ProtTrans, |$\boldsymbol{X}$| is a matrix of random values from the standard normal distribution with the same size as |$\boldsymbol{H}^{(0)}$|⁠, and ε is a hyperparameter to regulate the augmentation.

Implementation details

We utilized the training set to train LMDisorder, and the validation set to evaluate its generalization on unseen proteins. The training process lasted at most 50 epochs and we chose the model in the epoch with the best performance on the validation set as the final model. The validation performance was optimized by choosing the best feature combination and searching all hyperparameters through a grid search. Specifically, we employed a 2-layer transformer network with 128 hidden units and the following set of hyperparameters: h = 4, ε = 0.05, and a batch size of 12. The dropout rate was set to 0.3 to avoid overfitting. We utilized the Adam optimizer [47] with a learning rate of 3 × 10−4 for model optimization on the binary cross-entropy loss. We implemented the proposed model with Pytorch 1.7.1 [48].

Evaluation metrics

Similar to previous studies [17, 34], we employed the area under the receiver operating characteristic curve (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), precision (Pr), sensitivity (Se), specificity (Sp), the area under the precision-recall curve (⁠|${\text{AUC}}_{\text{PR}}$|⁠), Matthews correlation coefficient (MCC), and the weighted score Sw (Sw = sensitivity + specificity – 1) to evaluate the performance of the method developed:

(7)
(8)
(9)
(10)
(11)

where TN, TP, FN and FP denote true negatives, true positives, false negatives and false positives, respectively.

RESULTS AND DISCUSSIONS

Feature importance and ablation analysis

The results of LMDisorder on the validation set and two independent test sets are shown in Table 2. It can be seen that the fractions of disordered residues are very different in different datasets. The fraction of disordered residues is only 9.0% of all residues in the validation set, compared to 9.5% in DM1229 and 56.5% in SL329. As |${\text{AUC}}_{\text{ROC}}$| is a balanced metric, similar performance for two data sets differed significantly in disorder content, indicates the robustness of our method. The |${\text{AUC}}_{\text{ROC}}$| of LMDisorder on validation set and DM1229 datasets are 0.940 and 0.920, and the |${\text{AUC}}_{\text{PR}}$| are 0.732 and 0.705, respectively. Although the disordered residues dominate the SL329 set, LMDisorder still achieved a similar performance with an |${\text{AUC}}_{\text{ROC}}$| of 0.911 and |${\text{AUC}}_{\text{PR}}$| of 0.909, respectively.

Table 2

The performance of LMDisorder on the validation and two independent test sets according to various measures, including AUCROC, AUCPR, MCC, Pr, Se, Sp, along with the statistics of the number of disordered and ordered residues

Dataset|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp# Order# Disord
Validation0.9400.7320.6460.7460.5980.98061 2316083
DM12290.9200.7050.6210.7560.5600.981276 74829 082
SL3290.9110.9090.6550.9470.6250.97351 29239 544
Dataset|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp# Order# Disord
Validation0.9400.7320.6460.7460.5980.98061 2316083
DM12290.9200.7050.6210.7560.5600.981276 74829 082
SL3290.9110.9090.6550.9470.6250.97351 29239 544
Table 2

The performance of LMDisorder on the validation and two independent test sets according to various measures, including AUCROC, AUCPR, MCC, Pr, Se, Sp, along with the statistics of the number of disordered and ordered residues

Dataset|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp# Order# Disord
Validation0.9400.7320.6460.7460.5980.98061 2316083
DM12290.9200.7050.6210.7560.5600.981276 74829 082
SL3290.9110.9090.6550.9470.6250.97351 29239 544
Dataset|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp# Order# Disord
Validation0.9400.7320.6460.7460.5980.98061 2316083
DM12290.9200.7050.6210.7560.5600.981276 74829 082
SL3290.9110.9090.6550.9470.6250.97351 29239 544

To examine the importance of various features for disorder prediction, we performed feature ablation experiments. As shown in Figure 2 and Supplementary Table S1, the model with either ProtTrans or ESM language model as sequence features performs significantly better than the profile-based models including PSSM+HMM and PSSM+HMM + SPIDER. Among them, the |${\text{AUC}}_{\text{ROC}}$| of the model using ProtTrans on validation set, DM1229 and SL250 are 0.940, 0.920 and 0.905, respectively, which is better than the model using ESM (0.922, 0.912 and 0.885). By comparison, the models using PSSM+HMM and PSSM+HMM + SPIDER features yield |${\text{AUC}}_{\text{ROC}}$| values of only 0.847, 0.840, 0.806 and 0.893, 0.886, 0.855, respectively. We noted that combining ProtTrans and PSSM+HMM + SPIDER features did not improve the model, with the |${\text{AUC}}_{\text{ROC}}$| value of 0.938, 0.918 and 0.905 for the validation set, DM1229 and SL250 sets, respectively, indicating that the pretrained model ProtTrans may have captured the evolution information and structure information of proteins. Furthermore, we found that combining ProtTrans with ESM only led to minor changes. The |${\text{AUC}}_{\text{ROC}}$| on the three datasets was 0.936, 0.922 and 0.906, respectively. Furthermore, to examine the effect of Gaussian noise, we tested the model without Gaussian noise. As shown in Supplementary Table S1, when we removed Gaussian noise, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| were 0.917 and 0.692, decreasing by 2.4 and 5.5%, respectively. These results demonstrated the usefulness of Gaussian noise.

Comparison of different feature combinations as labeled on the Validation and two independent test sets (DM1229 and SL250). Method performance is measured by Area under ROC and PR curves (AUCROC and AUCPR).
Figure 2

Comparison of different feature combinations as labeled on the Validation and two independent test sets (DM1229 and SL250). Method performance is measured by Area under ROC and PR curves (AUCROC and AUCPR).

Comparison with single-sequence-based methods

We compared LMDisorder with seven single-sequence-based methods (ESpritz-N, ESpritz-D, ESpritz-X, IUPred2A-short [49], IUPred2A-long and Spot-Disorder-Single [34]) on DM1229 and SL329. In addition, we compared to NetSurfP-3.0 [50], which is based on ESM-1b language model. The single-sequence-based methods employ sequence-derived features such as one-hot features. In this work, we obtained abundant protein representation from ProtTrans language model, which is found here effective for protein disorder prediction. Figure 3 compared the ROC curves and Supplementary Figure S1 and S2 compared the precision–recall curves produced by these methods on DM1229 and SL329, respectively, along with performance indicators listed in Table 3. Among all the methods compared, LMDisorder achieved the highest |${\text{AUC}}_{\text{ROC}}$| and MCC values in both datasets, while NetSurfP-3.0 achieved the second best. LMDisorder achieved the best results on SL329, with |${\text{AUC}}_{\text{ROC}}$| of 0.911, AUCPR of 0.910 and MCC of 0.655, which is 1.0% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 1.4% (AUCPR) and 5.5% (MCC) higher than NetSurfP-3.0, respectively. LMDisorder and NetSurfP-3.0 achieved a comparable level on DM1229, where the |${\text{AUC}}_{\text{ROC}}$|⁠, AUCPR and MCC of LMDisorder are 0.920, 0.705 and 0.621 respectively and the same indicators of NetSurfP-3.0 are 0.918, 0.732 and 0.620. It is worth noting that SPOT-Disorder-Single achieved the third best results on two datasets. For DM1229 dataset, the |${\text{AUC}}_{\text{ROC}}$|⁠, AUCPR and MCC of SPOT-Disorder-Single are 0.868, 0.599 and 0.518, which are 5.6% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 15% (AUCPR) and 17% (MCC) lower than LMDisorder, respectively. And on SL329, the |${\text{AUC}}_{\text{ROC}}$|⁠, AUCPR and MCC of SPOT-Disorder-Single are 2.6% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 2.6% (AUCPR) and 7.8% (MCC) lower than LMDisorder, respectively. For a more comprehensive comparison, we compared LMDisorder with other single-sequence-based methods on Mobi9230, and found our model consistently performed the best (Supplementary Table S3). The |${\text{AUC}}_{\text{ROC}}$| and AUCPR of LMDisorder are 0.863 and 0.8 respectively, while the second best method (NetSurfP-3.0) are 0.846 (⁠|${\text{AUC}}_{\text{ROC}})$| and 0.8 (AUCPR). Additionally, we constructed a novel independent dataset DisProt452 and tested LMDisorder with other methods in Supplementary Table S4. The results have shown that LMDisorder still has the best performance. The |${\text{AUC}}_{\text{ROC}}$| and AUCPR of LMDisorder are 0.875 and 0.854, respectively, while the second best method (spot-disorder-single) are 0.840 (⁠|${\text{AUC}}_{\text{ROC}}$|⁠) and 0.817 (AUCPR). The consistent performance on multiple data sets further confirmed the robustness of our model.

The receiver operating characteristic curves given by LMDisorder and other single-sequence-based methods on the (a) DM1229 and (b) SL329 sets.
Figure 3

The receiver operating characteristic curves given by LMDisorder and other single-sequence-based methods on the (a) DM1229 and (b) SL329 sets.

Table 3

The performance comparison of LMDisorder with several single-sequence-based methods on independent test sets DM1229 and SL329 according to AUCROC, AUCPR, MCC, Pr, Se and Sp. Bold fonts indicate the best results

DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp
DM1229ESpritz-X0.8520.5650.5030.6400.4590.973
ESpritz-N0.8020.4800.4310.6190.3580.977
ESpritz-D0.7640.2380.2770.2400.5570.814
IUPred2A-short0.8040.4750.4390.5780.4030.969
IUPred2A-long0.7420.3720.3210.5420.2430.978
Spot-Disorder-Single0.8680.5990.5180.7070.4320.981
NetSurfP-3.00.9180.7320.6200.7040.6440.971
LMDisorder0.9200.7050.6210.7560.5600.981
SL329ESpritz-X0.8420.8310.5430.8430.5880.916
ESpritz-N0.8260.8160.4730.8090.6130.889
ESpritz-D0.8630.8290.6080.8730.6200.931
IUPred2A-short0.8300.8110.5060.7990.6490.874
IUPred2A-long0.8380.8330.5520.8120.6480.884
Spot-Disorder-Single0.8870.8860.6040.9390.5630.972
NetSurfP-3.00.9020.8970.6210.8640.7730.912
LMDisorder0.9110.9100.6550.9470.6250.973
DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp
DM1229ESpritz-X0.8520.5650.5030.6400.4590.973
ESpritz-N0.8020.4800.4310.6190.3580.977
ESpritz-D0.7640.2380.2770.2400.5570.814
IUPred2A-short0.8040.4750.4390.5780.4030.969
IUPred2A-long0.7420.3720.3210.5420.2430.978
Spot-Disorder-Single0.8680.5990.5180.7070.4320.981
NetSurfP-3.00.9180.7320.6200.7040.6440.971
LMDisorder0.9200.7050.6210.7560.5600.981
SL329ESpritz-X0.8420.8310.5430.8430.5880.916
ESpritz-N0.8260.8160.4730.8090.6130.889
ESpritz-D0.8630.8290.6080.8730.6200.931
IUPred2A-short0.8300.8110.5060.7990.6490.874
IUPred2A-long0.8380.8330.5520.8120.6480.884
Spot-Disorder-Single0.8870.8860.6040.9390.5630.972
NetSurfP-3.00.9020.8970.6210.8640.7730.912
LMDisorder0.9110.9100.6550.9470.6250.973
Table 3

The performance comparison of LMDisorder with several single-sequence-based methods on independent test sets DM1229 and SL329 according to AUCROC, AUCPR, MCC, Pr, Se and Sp. Bold fonts indicate the best results

DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp
DM1229ESpritz-X0.8520.5650.5030.6400.4590.973
ESpritz-N0.8020.4800.4310.6190.3580.977
ESpritz-D0.7640.2380.2770.2400.5570.814
IUPred2A-short0.8040.4750.4390.5780.4030.969
IUPred2A-long0.7420.3720.3210.5420.2430.978
Spot-Disorder-Single0.8680.5990.5180.7070.4320.981
NetSurfP-3.00.9180.7320.6200.7040.6440.971
LMDisorder0.9200.7050.6210.7560.5600.981
SL329ESpritz-X0.8420.8310.5430.8430.5880.916
ESpritz-N0.8260.8160.4730.8090.6130.889
ESpritz-D0.8630.8290.6080.8730.6200.931
IUPred2A-short0.8300.8110.5060.7990.6490.874
IUPred2A-long0.8380.8330.5520.8120.6480.884
Spot-Disorder-Single0.8870.8860.6040.9390.5630.972
NetSurfP-3.00.9020.8970.6210.8640.7730.912
LMDisorder0.9110.9100.6550.9470.6250.973
DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCPrSeSp
DM1229ESpritz-X0.8520.5650.5030.6400.4590.973
ESpritz-N0.8020.4800.4310.6190.3580.977
ESpritz-D0.7640.2380.2770.2400.5570.814
IUPred2A-short0.8040.4750.4390.5780.4030.969
IUPred2A-long0.7420.3720.3210.5420.2430.978
Spot-Disorder-Single0.8680.5990.5180.7070.4320.981
NetSurfP-3.00.9180.7320.6200.7040.6440.971
LMDisorder0.9200.7050.6210.7560.5600.981
SL329ESpritz-X0.8420.8310.5430.8430.5880.916
ESpritz-N0.8260.8160.4730.8090.6130.889
ESpritz-D0.8630.8290.6080.8730.6200.931
IUPred2A-short0.8300.8110.5060.7990.6490.874
IUPred2A-long0.8380.8330.5520.8120.6480.884
Spot-Disorder-Single0.8870.8860.6040.9390.5630.972
NetSurfP-3.00.9020.8970.6210.8640.7730.912
LMDisorder0.9110.9100.6550.9470.6250.973

Comparison with state-of-the-art profile-based methods

We compared LMDisorder with 12 profile-based methods [51] for DisProt228, including s2D [34], AUCpreD, JRONN [52], ESpritz-D, MFDp2 [53], DISOPRED [54], MobiDB-lite [55], ESpritz-N, SPINE-D, MFDp [23], SPOT-Disorder and SPOT-Disorder2. Among these, SPOT-Disorder2 is a state-of-the-art profile-based method, which had the best performance in CAID prediction [56]. The profile-based methods [16, 17] usually use the evolutionary sequence profiles obtained from multiple sequence alignment [21, 22] created mostly from HHBlits [41] and PSI-Blast [36]. In general, profile-based methods are more accurate than single-sequence-based, although these methods will take more time to calculate the features. As shown in Table 4, LMDisorder improved over 11 Profile-based methods, while achieving similar performance as SPOT-Disorder2. The |${\text{AUC}}_{\text{ROC}}$|⁠, |${\text{AUC}}_{\text{PR}}$|⁠, MCC and Sw of LMDisorder are 0.800, 0.700, 0.471 and 0.484, respectively, compared to 0.809, 0.716 and 0.499, respectively, given by SPOT-Disorder2.

Table 4

The performance comparison of LMDisorder with some profile-based methods on the DisProt228 dataset based on AUCROC, AUCPR, MCC and Sw. Bold fonts and underlined fonts indicate the best and second-best results, respectively.

Method name|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
s2D0.7270.6250.2670.272
AUCpreD0.7480.3120.4340.436
JRONN0.7530.6540.3790.388
ESpritz-D (prof)0.7590.6450.3790.364
MFDp20.7680.5940.3710.375
DISOPRED0.7710.6080.4060.387
MobiDB-lite0.7720.5960.4220.360
IUPred2A-long0.7720.6940.4180.429
IUPred2A-short0.7740.6740.4250.437
ESpritz-N (prof)0.7760.6740.4320.440
MFDp0.7760.6030.3570.366
SPINE-D0.7860.6840.4230.436
SPOT-Disorder0.7920.6660.4620.465
LMDisorder0.8000.7000.4710.484
SPOT-Disorder20.8090.7160.4990.507
Method name|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
s2D0.7270.6250.2670.272
AUCpreD0.7480.3120.4340.436
JRONN0.7530.6540.3790.388
ESpritz-D (prof)0.7590.6450.3790.364
MFDp20.7680.5940.3710.375
DISOPRED0.7710.6080.4060.387
MobiDB-lite0.7720.5960.4220.360
IUPred2A-long0.7720.6940.4180.429
IUPred2A-short0.7740.6740.4250.437
ESpritz-N (prof)0.7760.6740.4320.440
MFDp0.7760.6030.3570.366
SPINE-D0.7860.6840.4230.436
SPOT-Disorder0.7920.6660.4620.465
LMDisorder0.8000.7000.4710.484
SPOT-Disorder20.8090.7160.4990.507
Table 4

The performance comparison of LMDisorder with some profile-based methods on the DisProt228 dataset based on AUCROC, AUCPR, MCC and Sw. Bold fonts and underlined fonts indicate the best and second-best results, respectively.

Method name|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
s2D0.7270.6250.2670.272
AUCpreD0.7480.3120.4340.436
JRONN0.7530.6540.3790.388
ESpritz-D (prof)0.7590.6450.3790.364
MFDp20.7680.5940.3710.375
DISOPRED0.7710.6080.4060.387
MobiDB-lite0.7720.5960.4220.360
IUPred2A-long0.7720.6940.4180.429
IUPred2A-short0.7740.6740.4250.437
ESpritz-N (prof)0.7760.6740.4320.440
MFDp0.7760.6030.3570.366
SPINE-D0.7860.6840.4230.436
SPOT-Disorder0.7920.6660.4620.465
LMDisorder0.8000.7000.4710.484
SPOT-Disorder20.8090.7160.4990.507
Method name|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
s2D0.7270.6250.2670.272
AUCpreD0.7480.3120.4340.436
JRONN0.7530.6540.3790.388
ESpritz-D (prof)0.7590.6450.3790.364
MFDp20.7680.5940.3710.375
DISOPRED0.7710.6080.4060.387
MobiDB-lite0.7720.5960.4220.360
IUPred2A-long0.7720.6940.4180.429
IUPred2A-short0.7740.6740.4250.437
ESpritz-N (prof)0.7760.6740.4320.440
MFDp0.7760.6030.3570.366
SPINE-D0.7860.6840.4230.436
SPOT-Disorder0.7920.6660.4620.465
LMDisorder0.8000.7000.4710.484
SPOT-Disorder20.8090.7160.4990.507

We further compared the top three methods (SPOT-Disorder, LMDisorder, SPOT-Disorder2) in Table 4 on additional two larger datasets. Table 5 shows the comparison of LMDisorder with SPOT-Disorder and SPOT-Disorder2 on Test1185 and SL250. LMDisorder is able to have a better performance on these two test sets. On Test1185, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| of LMDisorder are 0.920 and 0.709, respectively, which are 3.3 and 9.1% higher than those of SPOT-Disorder and 0.6 and 2% higher than those of SPOT-Disorder2. Similarly, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| of LMDisorder on SL250 are 0.905 and 0.890, which are 1.3 and 1.7% higher than those of SPOT-Disorder and 0.4 and 0.1% higher than those of SPOT-Disorder2. Thus, the performance of the proposed LMDisorder has matched the performance of the current state-of-the-art profile-based method at a fraction of computational time.

Table 5

The performance comparison of the top-3 methods in Table 4 on independent test sets Test1185 and SL250. Bold fonts indicate the best results

DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
Test1185SPOT-Disorder0.8940.6500.5670.477
SPOT-Disorder20.9140.6980.6070.676
LMDisorder0.9200.7090.6240.692
SL250SPOT-Disorder0.8930.8750.6290.567
SPOT-Disorder20.9010.8890.6790.625
LMDisorder0.9050.8900.6790.634
DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
Test1185SPOT-Disorder0.8940.6500.5670.477
SPOT-Disorder20.9140.6980.6070.676
LMDisorder0.9200.7090.6240.692
SL250SPOT-Disorder0.8930.8750.6290.567
SPOT-Disorder20.9010.8890.6790.625
LMDisorder0.9050.8900.6790.634
Table 5

The performance comparison of the top-3 methods in Table 4 on independent test sets Test1185 and SL250. Bold fonts indicate the best results

DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
Test1185SPOT-Disorder0.8940.6500.5670.477
SPOT-Disorder20.9140.6980.6070.676
LMDisorder0.9200.7090.6240.692
SL250SPOT-Disorder0.8930.8750.6290.567
SPOT-Disorder20.9010.8890.6790.625
LMDisorder0.9050.8900.6790.634
DatasetMethod|${\text{AUC}}_{\text{ROC}}$||${\text{AUC}}_{\text{PR}}$|MCCSw
Test1185SPOT-Disorder0.8940.6500.5670.477
SPOT-Disorder20.9140.6980.6070.676
LMDisorder0.9200.7090.6240.692
SL250SPOT-Disorder0.8930.8750.6290.567
SPOT-Disorder20.9010.8890.6790.625
LMDisorder0.9050.8900.6790.634

IDPs are associated with specific biological functions

Since LMDisorder is much faster than profile-based methods, we used LMDisorder to predict IDRs/IDPs in all human proteins in Uniprot [57]. Previous studies have shown that IDRs/IDPs play essential roles in many biological processes. Thus, we investigated whether the proteins with high disorder content predicted by LMDisorder were associated with these functions. Specifically, we selected the proteins with a fraction of disorder contents above 90% and conducted Gene Ontology (GO) [58] enrichment analysis using gProfiler [59]. Then we randomly selected the same number of proteins for 200 enrichment functional analyses and picked the lowest FDR (P-value adjusted by Benjamini-Hochberg False Discovery Rate) value (⁠|$1.0\times{10}^{-25}$|⁠) as the threshold. We found that these genes significantly enriched 5, 62 and 5 GO items in the molecular function (MF), biological process (BP) and cellular components (CC) category, respectively. Figure 4 shows the top-5 GO items of the three categories and the details are in Supplementary Table S6. The proteins with high disorder content enriched many important biological functions and many of them were supported by the literature. For instance, there are 12 pathways related to transcription [51, 60, 61], three pathways related to protein binding [62], and three pathways related to regulatory [6]. This indicates that LMDisorder provides biologically meaningful predictions, which may facilitate investigating the functions of the proteins without rich function annotations. The above-described enrichment of some functions is consistent with previous analysis, including cell differentiation, development and regulation [63, 64]. We also discovered some new functions associated with strong intrinsic disorder (Supplementary Table S6), including RNA biosynthetic process (⁠|$\text{FDR}=1.6\times{10}^{-53}$|⁠), DNA-templated (⁠|$\text{FDR}=5.2\times{10}^{-54})$| and RNA polymerase (⁠|$\text{FDR}=5.2\times{10}^{-54})$|⁠. Further studies are required to confirm these findings. Meanwhile, we have added the proteome level analysis for some proteins in Supplementary Table S5. Firstly, we selected proteins with less than 25% sequence similarity (using BLASTClust [36]) to the training set. And then the disorder ratio for each of the proteins was predicted. Finally, the proteins were ranked according to the prediction results and the top ones among them were picked. It is found that the proteins with high disorder contents have some special functions, such as DNA-binding, Transcription regulation and Cell division, which is consistent with the previous Gene Ontology enrichment analysis.

The top-5 GO items in three GO categories.
Figure 4

The top-5 GO items in three GO categories.

DISCUSSION

In this paper, we have developed a language-model-based method called LMDisorder to predict IDRs/IDPs in protein sequences. IDPs and IDRs are abundant in all species [65, 66], and relevant in many diseases [67, 68]. The existing profile-based methods can provide highly accurate prediction. However, they are computationally too intensive to perform genome-scale prediction due to the computational requirement for profile calculations. Here, we showed that using language model embedding as input features we can achieve speed and accuracy at the same time. For a protein with 500 residues, it takes about 15 minutes on an Nvidia GeForce RTX 3090 GPU for SPOT-Disorder2 but only one second on the same computer for LMDisorder with essentially the same performance. Meanwhile, we also tested other single-sequence-based methods, most finished within 1 to 5 seconds, because of no requirement of the time-consuming generation of sequence profiles. And compared with NetSurfP-3.0, which is the state-of-the-art language-model-based method, LMDisorder reached the equivalent or even better performance. This further proves the effectiveness of LMDisorder. In addition, the accuracy of our method remained essentially the same with the change of the protein length (Supplementary Figure S3), further indicating the robustness of our method.

One interesting observation is that the proteins with high disorder content are associated with specific biological functions. Through the enrichment analysis, we found that these proteins significantly enrich 72 GO items and many of them were supported by previous literature [6, 51, 60, 61, 69, 70]. In addition, we also discovered some new strong intrinsic-disorder-associated functions, which may be of interest for further studies.

Another observation is that combining ProtTrans and ESM did not improve over ProtTrans alone. This is a bit surprising because a recent study indicates that their combination led to an improved prediction of protein secondary structure and other tertiary structural properties [31]. This is perhaps due to the fact that disordered regions, unlike structured regions, are less conserved, and thus, evolutionary information from different language models may have captured different aspects of structural characteristics. Disorder predictions, on the other hand, are less sensitive to small intrinsic difference in the language models. In the future, we consider using the structural information predicted by AlphaFold2 [71, 72] or ESMFold [73], and then building models using graph networks [74] for further improving the performance.

In summary, this work developed a sequence-based, alignment-free method LMDisorder for protein disorder prediction. LMDisorder combines the advantages of current sequence-based and profile-based methods and greatly improves the prediction speed while ensuring high-prediction performance. The technique could be useful for highly accurate proteome-scale analysis.

Key Points
  • LMDisorder is an alignment-free single-sequence-based protein disorder predictor that employs embedding generated by unsupervised pretrained language models as features, thus bypassing time-consuming database searches.

  • LMDisorder is accurate and robust, which has better performance than state-of-the-art single-sequence-based and even profile-based approaches.

  • The high computation efficiency of LMDisorder enables proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions.

  • LMDisorder employs transformer networks to capture the sequence patterns, especially the long-range dependencies between the residues, which is useful for protein disorder prediction.

DATA AVAILABILITY

The datasets, the source codes and the trained model are available on https://github.com/biomed-AI/LMDisorder.

FUNDING

This study has been supported by the National Key R&D Program of China [2022YFF1203100], National Natural Science Foundation of China [12126610], and by the Supercomputing facilities of Shenzhen Bay Laboratory.

Author Biographies

Yidong Song is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network, protein disorder prediction and protein function prediction.

Qianmu Yuan is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein disorder prediction.

Sheng Chen is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein structure prediction.

Ken Chen is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein function prediction.

Yaoqi Zhou is a senior principal investigator in the Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China. His research interests include protein disorder prediction, protein design and protein/RNA structure prediction.

Yuedong Yang is a professor in the School of Computer Science and Engineering at Sun Yat-sen University. Currently he focuses on integrating HPC and AI techniques for multi-scale biomedical research.

REFERENCES

1.

Romero
 
P
,
Obradovic
 
Z
,
Kissinger
 
CR
, et al.  
Thousands of proteins likely to have long disordered regions
.
Pac Symp Biocomput
 
1998
;
3
:
437
48
.

2.

Bairoch
 
A
,
Boeckmann
 
B
.
The SWISS-PROT protein sequence data bank
.
Nucleic Acids Res
 
1991
;
19
:
2247
.

3.

Uversky
 
VN
.
Functions of short lifetime biological structures at large: the case of intrinsically disordered proteins
.
Brief Funct Genomics
 
2020
;
19
:
60
8
.

4.

Mészáros
 
B
,
Tompa
 
P
,
Simon
 
I
, et al.  
Molecular principles of the interactions of disordered proteins
.
J Mol Biol
 
2007
;
372
:
549
61
.

5.

Vacic
 
V
,
Oldfield
 
CJ
,
Mohan
 
A
, et al.  
Characterization of molecular recognition features, MoRFs, and their binding partners
.
J Proteome Res
 
2007
;
6
:
2351
66
.

6.

Dyson
 
HJ
,
Wright
 
PE
.
Intrinsically unstructured proteins and their functions
.
Nat Rev Mol Cell Biol
 
2005
;
6
:
197
208
.

7.

Receveur-Bréchot
 
V
,
Bourhis
 
JM
,
Uversky
 
VN
, et al.  
Assessing protein disorder and induced folding, proteins: structure
.
Function, and Bioinformatics
 
2006
;
62
:
24
45
.

8.

Yu
 
J-F
,
Cao
 
Z
,
Yang
 
Y
, et al.  
Natural protein sequences are more intrinsically disordered than random sequences
.
Cell Mol Life Sci
 
2016
;
73
:
2949
57
.

9.

Uversky
 
VN
.
Intrinsic disorder here, there, and everywhere, and nowhere to escape from it
.
Cell Mol Life Sci
 
2017
;
74
:
3065
7
.

10.

Konrat
 
R
.
NMR contributions to structural dynamics studies of intrinsically disordered proteins
.
J Magn Reson
 
2014
;
241
:
74
85
.

11.

Romero
 
P
,
Obradovic
 
Z
,
Li
 
X
, et al.  
Sequence complexity of disordered protein, proteins: structure
.
Function, and Bioinformatics
 
2001
;
42
:
38
48
.

12.

Walsh
 
I
,
Martin
 
AJ
,
Di Domenico
 
T
, et al.  
ESpritz: accurate and fast prediction of protein disorder
.
Bioinformatics
 
2012
;
28
:
503
9
.

13.

Hanson
 
J
,
Yang
 
Y
,
Paliwal
 
K
, et al.  
Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks
.
Bioinformatics
 
2017
;
33
:
685
92
.

14.

Klausen
 
MS
,
Jespersen
 
MC
,
Nielsen
 
H
, et al.  
NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning
.
Proteins: Structure, Function, and Bioinformatics
 
2019
;
87
:
520
7
.

15.

Wang
 
S
,
Ma
 
J
,
Xu
 
J
.
AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields
.
Bioinformatics
 
2016
;
32
:
i672
9
.

16.

Zhang
 
T
,
Faraggi
 
E
,
Xue
 
B
, et al.  
SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method
.
Journal of Biomolecular Structure and Dynamics
 
2012
;
29
:
799
813
.

17.

Hanson
 
J
,
Paliwal
 
KK
,
Litfin
 
T
, et al.  
SPOT-Disorder2: improved protein intrinsic disorder prediction by ensembled deep learning
.
Genomics Proteomics Bioinformatics
 
2019
;
17
:
645
56
.

18.

Dosztányi
 
Z
,
Csizmok
 
V
,
Tompa
 
P
, et al.  
IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content
.
Bioinformatics
 
2005
;
21
:
3433
4
.

19.

Linding
 
R
,
Russell
 
RB
,
Neduva
 
V
, et al.  
GlobPlot: exploring protein sequences for globularity and disorder
.
Nucleic Acids Res
 
2003
;
31
:
3701
8
.

20.

Prilusky
 
J
,
Felder
 
CE
,
Zeev-Ben-Mordehai
 
T
, et al.  
FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded
.
Bioinformatics
 
2005
;
21
:
3435
8
.

21.

Liu
 
Y
,
Wang
 
X
,
Liu
 
B
.
A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction
.
Brief Bioinform
 
2019
;
20
:
330
46
.

22.

Necci
 
M
,
Piovesan
 
D
,
Dosztányi
 
Z
, et al.  
A comprehensive assessment of long intrinsic protein disorder from the DisProt database
.
Bioinformatics
 
2018
;
34
:
445
52
.

23.

Mizianty
 
MJ
,
Stach
 
W
,
Chen
 
K
, et al.  
Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources
.
Bioinformatics
 
2010
;
26
:
i489
96
.

24.

Ovchinnikov
 
S
,
Park
 
H
,
Varghese
 
N
, et al.  
Protein structure determination using metagenome sequence data
.
Science
 
2017
;
355
:
294
8
.

25.

Lyu H, Sha N, Qin S, et al. Manifold denoising by nonlinear robust principal component analysis, Advances in neural information processing systems 2019;

32
:13390–400.

26.

Heinzinger
 
M
,
Elnaggar
 
A
,
Wang
 
Y
, et al.  
Modeling aspects of the language of life through transfer-learning protein sequences
.
BMC bioinformatics
 
2019
;
20
:
1
17
.

27.

Rao R, Meier J, Sercu T, et al. Transformer protein language models are unsupervised structure learners,

bioRxiv
2020:2020.2012. 2015.422761.

28.

Elnaggar
 
A
,
Heinzinger
 
M
,
Dallago
 
C
, et al.  
ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing
.
IEEE Trans Pattern Anal Mach Intell
2022;
44
:7112–27.

29.

Rives
 
A
,
Meier
 
J
,
Sercu
 
T
, et al.  
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
.
Proc Natl Acad Sci
 
2021
;
118
:e2016239118.

30.

Unsal
 
S
,
Atas
 
H
,
Albayrak
 
M
, et al.  
Learning functional properties of proteins with language models
.
Nature Machine Intelligence
 
2022
;
4
:
227
45
.

31.

Singh
 
J
,
Paliwal
 
K
,
Litfin
 
T
, et al.  
Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
.
Sci Rep
 
2022
;
12
:
1
9
.

32.

Singh
 
J
,
Litfin
 
T
,
Singh
 
J
, et al.  
SPOT-contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model
.
Bioinformatics
 
2022
;
38
:
1888
94
.

33.

Yuan Q, Chen S, Wang Y, et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning,

Briefings in Bioinformatics
2022;
23
:bbac444.

34.

Hanson
 
J
,
Paliwal
 
K
,
Zhou
 
Y
.
Accurate single-sequence prediction of protein intrinsic disorder by an ensemble of deep recurrent and convolutional architectures
.
J Chem Inf Model
 
2018
;
58
:
2369
76
.

35.

Vucetic
 
S
,
Obradovic
 
Z
,
Vacic
 
V
, et al.  
DisProt: a database of protein disorder
.
Bioinformatics
 
2005
;
21
:
137
40
.

36.

Altschul
 
SF
,
Madden
 
TL
,
Schäffer
 
AA
, et al.  
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
 
1997
;
25
:
3389
402
.

37.

Sirota
 
FL
,
Ooi
 
H-S
,
Gattermayer
 
T
, et al.  
Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset
.
BMC Genomics
 
2010
;
11
:
1
17
.

38.

Piovesan
 
D
,
Tabaro
 
F
,
Paladin
 
L
, et al.  
MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins
.
Nucleic Acids Res
 
2018
;
46
:
D471
6
.

39.

Raffel
 
C
,
Shazeer
 
N
,
Roberts
 
A
, et al.  
Exploring the limits of transfer learning with a unified text-to-text transformer
.
Journal of machine learning research
 
2020
;
21
:
1
67
.

40.

Suzek
 
BE
,
Huang
 
H
,
McGarvey
 
P
, et al.  
UniRef: comprehensive and non-redundant UniProt reference clusters
.
Bioinformatics
 
2007
;
23
:
1282
8
.

41.

Remmert
 
M
,
Biegert
 
A
,
Hauser
 
A
, et al.  
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
.
Nat Methods
 
2012
;
9
:
173
5
.

42.

Mirdita
 
M
,
von den
 
Driesch
 
L
,
Galiez
 
C
, et al.  
Uniclust databases of clustered and deeply annotated protein sequences and alignments
.
Nucleic Acids Res
 
2017
;
45
:
D170
6
.

43.

Heffernan
 
R
,
Yang
 
Y
,
Paliwal
 
K
, et al.  
Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility
.
Bioinformatics
 
2017
;
33
:
2842
9
.

44.

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need,

Advances in neural information processing systems
2017;
30
:5998–8.

45.

He
 
K
,
Zhang
 
X
,
Ren
 
S
 et al.  Deep residual learning for image recognition.
In
:
Proceedings of the IEEE conference on computer vision and pattern recognition
.
2016
,
p. 770-778
.

46.

Ba
 
JL
,
Kiros
 
JR
,
Hinton
 
GE
.
Layer normalization
.
Stat
 
2016
;
1050
:
21
.

47.

Kingma DP, Ba J. Adam: A method for stochastic optimization,

arXiv
preprint arXiv:1412.6980 2014.

48.

Paszke
 
A
,
Gross
 
S
,
Massa
 
F
, et al.  
Pytorch: an imperative style, high-performance deep learning library
.
Advances in neural information processing systems
 
2019
;
32
:
8026
37
.

49.

Mészáros
 
B
,
Erdős
 
G
,
Dosztányi
 
Z
.
IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding
.
Nucleic Acids Res
 
2018
;
46
:
W329
37
.

50.

Høie MH, Kiehl EN, Petersen B, et al. NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning,

Nucleic acids research
2022;
50
:W510–W515.

51.

Sigler
 
PB
.
Acid blobs and negative noodles
.
Nature
 
1988
;
333
:
210
2
.

52.

Yang
 
ZR
,
Thomson
 
R
,
McNeil
 
P
, et al.  
RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins
.
Bioinformatics
 
2005
;
21
:
3369
76
.

53.

Mizianty
 
MJ
,
Peng
 
Z
,
Kurgan
 
L
.
MFDp2: accurate predictor of disorder in proteins by fusion of disorder probabilities, content and profiles
.
Intrinsically disordered proteins
 
2013
;
1
:
e24428
.

54.

Jones
 
DT
,
Cozzetto
 
D
.
DISOPRED3: precise disordered region predictions with annotated protein-binding activity
.
Bioinformatics
 
2015
;
31
:
857
63
.

55.

Necci
 
M
,
Piovesan
 
D
,
Dosztányi
 
Z
, et al.  
MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins
.
Bioinformatics
 
2017
;
33
:
1402
4
.

56.

Necci
 
M
,
Piovesan
 
D
,
Tosatto
 
SC
.
Critical assessment of protein intrinsic disorder prediction
.
Nat Methods
 
2021
;
18
:
472
81
.

57.

Consortium
 
U
.
UniProt: a hub for protein information
.
Nucleic Acids Res
 
2015
;
43
:
D204
12
.

58.

Ashburner
 
M
,
Ball
 
CA
,
Blake
 
JA
, et al.  
Gene ontology: tool for the unification of biology
.
Nat Genet
 
2000
;
25
:
25
9
.

59.

Raudvere
 
U
,
Kolberg
 
L
,
Kuzmin
 
I
, et al.  
G: profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)
.
Nucleic Acids Res
 
2019
;
47
:
W191
8
.

60.

Radhakrishnan
 
I
,
Pérez-Alvarado
 
GC
,
Parker
 
D
, et al.  
Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator: coactivator interactions
.
Cell
 
1997
;
91
:
741
52
.

61.

Wright
 
PE
,
Dyson
 
HJ
.
Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm
.
J Mol Biol
 
1999
;
293
:
321
31
.

62.

Tompa
 
P
,
Fuxreiter
 
M
.
Fuzzy complexes: polymorphism and structural disorder in protein–protein interactions
.
Trends Biochem Sci
 
2008
;
33
:
2
8
.

63.

Bellay
 
J
,
Han
 
S
,
Michaut
 
M
, et al.  
Bringing order to protein disorder through comparative genomics and genetic interactions
.
Genome Biol
 
2011
;
12
:
1
15
.

64.

Colak
 
R
,
Kim
 
T
,
Michaut
 
M
, et al.  
Distinct types of disorder in the human proteome: functional implications for alternative splicing
.
PLoS Comput Biol
 
2013
;
9
:
e1003030
.

65.

Xue
 
B
,
Dunker
 
AK
,
Uversky
 
VN
.
Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life
.
Journal of Biomolecular Structure and Dynamics
 
2012
;
30
:
137
49
.

66.

Peng
 
Z
,
Yan
 
J
,
Fan
 
X
, et al.  
Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life
.
Cell Mol Life Sci
 
2015
;
72
:
137
51
.

67.

Uversky
 
VN
,
Oldfield
 
CJ
,
Dunker
 
AK
.
Intrinsically disordered proteins in human diseases: introducing the D2 concept
.
Annu Rev Biophys
 
2008
;
37
:
215
46
.

68.

Shigemitsu
 
Y
,
Hiroaki
 
H
.
Common molecular pathogenesis of disease-related intrinsically disordered proteins revealed by NMR analysis
.
The Journal of Biochemistry
 
2018
;
163
:
11
8
.

69.

Uversky
 
VN
,
Oldfield
 
CJ
,
Dunker
 
AK
.
Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling
.
Journal of Molecular Recognition: An Interdisciplinary Journal
 
2005
;
18
:
343
84
.

70.

Iakoucheva
 
LM
,
Brown
 
CJ
,
Lawson
 
JD
, et al.  
Intrinsic disorder in cell-signaling and cancer-associated proteins
.
J Mol Biol
 
2002
;
323
:
573
84
.

71.

Jumper
 
J
,
Evans
 
R
,
Pritzel
 
A
, et al.  
Highly accurate protein structure prediction with AlphaFold
.
Nature
 
2021
;
596
:
583
9
.

72.

Yuan
 
Q
,
Chen
 
S
,
Rao
 
J
, et al.  
AlphaFold2-aware protein–DNA binding site prediction using graph transformer
.
Brief Bioinform
 
2022
;
23
:
bbab564
.

73.

Lin Z, Akin H, Rao R, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction,

bioRxiv
2022.

74.

Yuan
 
Q
,
Chen
 
J
,
Zhao
 
H
, et al.  
Structure-aware protein–protein interaction site prediction using deep graph convolutional network
.
Bioinformatics
 
2021
;
38
:
125
32
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)