Fast and accurate protein intrinsic disorder prediction by using a pretrained language model

The number of proteins, ordered and disordered residues as well as the percentage of disorder in each dataset

Dataset	No. of proteins	No. of ordered residues	No. of disordered residues	Percentage of disorder
Validation	300	61 231	6083	9.04%
DM1229	1229	276 748	29 082	9.51%
SL329	329	51 292	39 544	43.53%
DisProt228	228	30 772	18 811	37.94%
Test1185	1185	246 616	26 515	9.71%
SL250	250	32 261	21 173	39.62%
Mobi9230	9230	2 011 126	828 642	29.18%

Dataset	No. of proteins	No. of ordered residues	No. of disordered residues	Percentage of disorder
Validation	300	61 231	6083	9.04%
DM1229	1229	276 748	29 082	9.51%
SL329	329	51 292	39 544	43.53%
DisProt228	228	30 772	18 811	37.94%
Test1185	1185	246 616	26 515	9.71%
SL250	250	32 261	21 173	39.62%
Mobi9230	9230	2 011 126	828 642	29.18%

Table 1

The number of proteins, ordered and disordered residues as well as the percentage of disorder in each dataset

Dataset	No. of proteins	No. of ordered residues	No. of disordered residues	Percentage of disorder
Validation	300	61 231	6083	9.04%
DM1229	1229	276 748	29 082	9.51%
SL329	329	51 292	39 544	43.53%
DisProt228	228	30 772	18 811	37.94%
Test1185	1185	246 616	26 515	9.71%
SL250	250	32 261	21 173	39.62%
Mobi9230	9230	2 011 126	828 642	29.18%

Dataset	No. of proteins	No. of ordered residues	No. of disordered residues	Percentage of disorder
Validation	300	61 231	6083	9.04%
DM1229	1229	276 748	29 082	9.51%
SL329	329	51 292	39 544	43.53%
DisProt228	228	30 772	18 811	37.94%
Test1185	1185	246 616	26 515	9.71%
SL250	250	32 261	21 173	39.62%
Mobi9230	9230	2 011 126	828 642	29.18%

Protein sequence features

Language model representation. LMDisorder utilizes the recent protein language model ProtT5-XL-U50 [34] (denoted as ProtTrans) for feature extraction. This is a transformer-based auto-encoder named T5 [39], which was pretrained in a self-supervised manner on UniRef50 [40], learning to predict masked amino acids. We extracted the hidden states from the last layer of the ProtTrans encoder as sequence features, which is an |$n\times 1024$| matrix (⁠|$n$| is the sequence length). We have also examined another pretrained model, ESM-1b [29] (denoted as ESM), which was also pretrained on UniRef50 with transformer. Sequence features by ESM are an |$n\times 1280$| matrix. The computational costs of inference for both ProtTrans and ESM are low, and the feature extraction process of our whole benchmark datasets (~4800 sequences) can be completed within 10 minutes on an Nvidia GeForce RTX 3090 GPU.

Evolutionary information (for ablation study). We have also tested evolutionary-derived features (position-specific scoring matrix, PSSM and hidden Markov models, HMM) for feature ablation studies. PSSM was obtained by performing PSI-BLAST [36] search against UniRef90 [40] with three iterations and an E-value of 0.001. HHblits was employed to generate HMM profile [41] by aligning a query sequence against UniClust30 [42] with default parameters. Each residue was encoded into a 20-dimensional feature vector in PSSM and HMM.

Predicted structural properties (for ablation study). Putative structural properties were generated by utilizing SPIDER3 [43], whose inputs include protein sequence, PSSM profiles and HMM profiles. We extracted four types of structural features from the outputs of SPIDER3: (1) Solvent accessible surface area. (2) The sine and cosine values of the four protein backbone torsion angles (θ, ϕ, ψ and τ). (3) Half sphere exposures, which are the numbers of spatially neighboring residues in the top and bottom half of the contacting sphere: the boundary of the two hemispheres was determined by the Cα-Cα and Cα-Cβ direction vectors. (4) Predicted probabilities of the three secondary structures (α-helix, β-sheet and random coil). Each residue was encoded into a 14-dimensional vector using SPIDER3. Values in PSSM, HMM and SPIDER were further normalized to scores between 0 to 1 using the min–max normalization:

$$\begin{align} {v}_{norm}=\frac{v-\textit{{Min}}}{\textit{{Max}}-\textit{{Min}}} \end{align}$$

(1)

where v is the original feature value, and Min and Max are the smallest and biggest values of this feature type observed in the training set.

The architecture of LMDisorder

The overall architecture of LMDisorder is shown in Figure 1. First, the protein sequence is inputted into the pretrained language model ProtTrans to yield the sequence embedding, which is augmented by Gaussian noise to avoid overfitting. Then, the transformer networks are employed to capture the sequence patterns including long-range dependencies between the residues. This is followed by a fully connected layer adopted to predict protein disorder probabilities.

Figure 1

The architecture of LMDisorder. A protein sequence is input into ProtTrans to produce the sequence embedding. The embedding is augmented by Gaussian noise and further encoded through the transformer networks to capture the long-range sequence patterns. The encoded representation is input to a fully connected layer for predicting disorder probabilities.

Transformer networks

We stack N identical transformer [44] layers to learn the sequence representations. Each transformer layer has two sub-layers, a multi-head self-attention module, and a position-wise feed-forward network. A residual connection [45] is employed around each of the two sub-layers, followed by the layer normalization [46]. Let |$\boldsymbol{H}\in{\mathbb{R}}^{n\times d}$| denotes the input of the self-attention module, where n is the sequence length and d is the hidden dimension. The input of the |${l}^{\text{th}}$| layer |$\boldsymbol{{H}}^{(l)}$| is projected by three matrices |$\boldsymbol{W}_Q\in{\mathbb{R}}^{d\times{d}_K}$|⁠, |$\boldsymbol{W}_K\in{\mathbb{R}}^{d\times{d}_K}$| and |$\boldsymbol{W}_V\in{\mathbb{R}}^{d\times{d}_V}$| to get the corresponding query, key, and value representations |$\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}$| as below:

$$\begin{align} {\displaystyle \begin{array}{c}\boldsymbol{Q}=\boldsymbol{H}^{(l)}\boldsymbol{W}_Q,\boldsymbol{K}=\boldsymbol{H}^{(l)}\boldsymbol{W}_K,\boldsymbol{V}=\boldsymbol{H}^{(l)}\boldsymbol{W}_V\end{array}} \end{align}$$

(2)

The scaled dot-product self-attention is then calculated according to the following equation:

$$\begin{align} {\displaystyle \begin{array}{c}\boldsymbol{A}=\frac{\boldsymbol{Q}\boldsymbol{K}^T}{\sqrt{d_K}}\end{array}}\end{align}$$

(3)

$$\begin{align} {\displaystyle \begin{array}{c}\boldsymbol{H}^{\left(l+1\right)}= Attn\left(\boldsymbol{H}^{(l)}\right)=\text{softmax}(\boldsymbol{A})\boldsymbol{V}\end{array}} \end{align} $$

(4)

where |$\sqrt{d_K}$| is a scaling factor and |$\boldsymbol{A}$| is a matrix capturing the similarities between queries and keys. Instead of performing a single attention function, multi-head attention was employed to jointly attend to information from different representation subspaces at different positions. We linearly projected the queries, keys and values h times, perform the attention function in parallel, and finally concatenate them together. In this work, |${d}_K={d}_V=d/h$|⁠.

Fully connected layer and noise augmentation

The output of the last transformer layer is input to a fully connected layer to predict the protein disorder probabilities for all n amino acid residues of the protein:

$$ \begin{align} {\displaystyle \begin{array}{c}\boldsymbol{Y}^{\prime }=\text{sigmoid}\left(\boldsymbol{H}^{\left(N+1\right)}\boldsymbol{W}+b\right)\end{array}} \end{align} $$

(5)

where |$\boldsymbol{H}^{\left(N+1\right)}\in{\mathbb{R}}^{n\times d}$| is the output of the |${N}^{\text{th}}$| transformer layer; |$\boldsymbol{W}\in{\mathbb{R}}^{d\times 1}$| is a learnable weight matrix; |$b\in \mathbb{R}$| is a bias term, and |$\boldsymbol{Y}^{\prime}\in{\mathbb{R}}^{n\times 1}$| is the predictions for the n residues. The sigmoid function normalizes the output of the fully connected layer into binding probabilities from 0 to 1. In addition, to avoid overfitting, the sequence features from ProtTrans are augmented by Gaussian noise before being fed to the transformer networks while training:

$$ \begin{align} {\displaystyle \begin{array}{c}\boldsymbol{H}^{(1)}=\boldsymbol{H}^{(0)}+\varepsilon \times \boldsymbol{X},\boldsymbol{X}\sim \mathcal{N}\left(0,1\right)\end{array}} \end{align} $$

(6)

where |$\boldsymbol{H}^{(0)}$| is the sequence features from ProtTrans, |$\boldsymbol{X}$| is a matrix of random values from the standard normal distribution with the same size as |$\boldsymbol{H}^{(0)}$|⁠, and ε is a hyperparameter to regulate the augmentation.

Implementation details

We utilized the training set to train LMDisorder, and the validation set to evaluate its generalization on unseen proteins. The training process lasted at most 50 epochs and we chose the model in the epoch with the best performance on the validation set as the final model. The validation performance was optimized by choosing the best feature combination and searching all hyperparameters through a grid search. Specifically, we employed a 2-layer transformer network with 128 hidden units and the following set of hyperparameters: h = 4, ε = 0.05, and a batch size of 12. The dropout rate was set to 0.3 to avoid overfitting. We utilized the Adam optimizer [47] with a learning rate of 3 × 10⁻⁴ for model optimization on the binary cross-entropy loss. We implemented the proposed model with Pytorch 1.7.1 [48].

Evaluation metrics

Similar to previous studies [17, 34], we employed the area under the receiver operating characteristic curve (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), precision (Pr), sensitivity (Se), specificity (Sp), the area under the precision-recall curve (⁠|${\text{AUC}}_{\text{PR}}$|⁠), Matthews correlation coefficient (MCC), and the weighted score Sw (Sw = sensitivity + specificity – 1) to evaluate the performance of the method developed:

$$\begin{equation} { \begin{array}{c}\textit{Pr}=\displaystyle\frac{TP}{TP+ FP}\end{array}} \end{equation}$$

(7)

$$\begin{equation} { \begin{array}{c} Se=\displaystyle\frac{TP}{TP+ FN}\end{array}} \end{equation}$$

(8)

$$\begin{equation} { \begin{array}{c} Sp=\displaystyle\frac{TN}{TN+ FP}\end{array}} \end{equation}$$

(9)

$$\begin{equation} { \begin{array}{c} MCC=\displaystyle\frac{TP\cdot TN+ FN\cdot FP}{\sqrt{\left( TP+ FN\right)\left( TP+ FP\right)\left( TN+ FN\right)\left( TN+ FP\right)}}\end{array}} \end{equation}$$

(10)

$$ \begin{equation} { \begin{array}{c} Sw= Se+ Sp-1\end{array}} \end{equation} $$

(11)

where TN, TP, FN and FP denote true negatives, true positives, false negatives and false positives, respectively.

RESULTS AND DISCUSSIONS

Feature importance and ablation analysis

The results of LMDisorder on the validation set and two independent test sets are shown in Table 2. It can be seen that the fractions of disordered residues are very different in different datasets. The fraction of disordered residues is only 9.0% of all residues in the validation set, compared to 9.5% in DM1229 and 56.5% in SL329. As |${\text{AUC}}_{\text{ROC}}$| is a balanced metric, similar performance for two data sets differed significantly in disorder content, indicates the robustness of our method. The |${\text{AUC}}_{\text{ROC}}$| of LMDisorder on validation set and DM1229 datasets are 0.940 and 0.920, and the |${\text{AUC}}_{\text{PR}}$| are 0.732 and 0.705, respectively. Although the disordered residues dominate the SL329 set, LMDisorder still achieved a similar performance with an |${\text{AUC}}_{\text{ROC}}$| of 0.911 and |${\text{AUC}}_{\text{PR}}$| of 0.909, respectively.

Table 2

The performance of LMDisorder on the validation and two independent test sets according to various measures, including AUC_ROC, AUC_PR, MCC, Pr, Se, Sp, along with the statistics of the number of disordered and ordered residues

Dataset	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp	# Order	# Disord
Validation	0.940	0.732	0.646	0.746	0.598	0.980	61 231	6083
DM1229	0.920	0.705	0.621	0.756	0.560	0.981	276 748	29 082
SL329	0.911	0.909	0.655	0.947	0.625	0.973	51 292	39 544

Dataset	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp	# Order	# Disord
Validation	0.940	0.732	0.646	0.746	0.598	0.980	61 231	6083
DM1229	0.920	0.705	0.621	0.756	0.560	0.981	276 748	29 082
SL329	0.911	0.909	0.655	0.947	0.625	0.973	51 292	39 544

Table 2

The performance of LMDisorder on the validation and two independent test sets according to various measures, including AUC_ROC, AUC_PR, MCC, Pr, Se, Sp, along with the statistics of the number of disordered and ordered residues

Dataset	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp	# Order	# Disord
Validation	0.940	0.732	0.646	0.746	0.598	0.980	61 231	6083
DM1229	0.920	0.705	0.621	0.756	0.560	0.981	276 748	29 082
SL329	0.911	0.909	0.655	0.947	0.625	0.973	51 292	39 544

Dataset	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp	# Order	# Disord
Validation	0.940	0.732	0.646	0.746	0.598	0.980	61 231	6083
DM1229	0.920	0.705	0.621	0.756	0.560	0.981	276 748	29 082
SL329	0.911	0.909	0.655	0.947	0.625	0.973	51 292	39 544

To examine the importance of various features for disorder prediction, we performed feature ablation experiments. As shown in Figure 2 and Supplementary Table S1, the model with either ProtTrans or ESM language model as sequence features performs significantly better than the profile-based models including PSSM+HMM and PSSM+HMM + SPIDER. Among them, the |${\text{AUC}}_{\text{ROC}}$| of the model using ProtTrans on validation set, DM1229 and SL250 are 0.940, 0.920 and 0.905, respectively, which is better than the model using ESM (0.922, 0.912 and 0.885). By comparison, the models using PSSM+HMM and PSSM+HMM + SPIDER features yield |${\text{AUC}}_{\text{ROC}}$| values of only 0.847, 0.840, 0.806 and 0.893, 0.886, 0.855, respectively. We noted that combining ProtTrans and PSSM+HMM + SPIDER features did not improve the model, with the |${\text{AUC}}_{\text{ROC}}$| value of 0.938, 0.918 and 0.905 for the validation set, DM1229 and SL250 sets, respectively, indicating that the pretrained model ProtTrans may have captured the evolution information and structure information of proteins. Furthermore, we found that combining ProtTrans with ESM only led to minor changes. The |${\text{AUC}}_{\text{ROC}}$| on the three datasets was 0.936, 0.922 and 0.906, respectively. Furthermore, to examine the effect of Gaussian noise, we tested the model without Gaussian noise. As shown in Supplementary Table S1, when we removed Gaussian noise, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| were 0.917 and 0.692, decreasing by 2.4 and 5.5%, respectively. These results demonstrated the usefulness of Gaussian noise.

Figure 2

Comparison of different feature combinations as labeled on the Validation and two independent test sets (DM1229 and SL250). Method performance is measured by Area under ROC and PR curves (AUCROC and AUCPR).

Comparison with single-sequence-based methods

We compared LMDisorder with seven single-sequence-based methods (ESpritz-N, ESpritz-D, ESpritz-X, IUPred2A-short [49], IUPred2A-long and Spot-Disorder-Single [34]) on DM1229 and SL329. In addition, we compared to NetSurfP-3.0 [50], which is based on ESM-1b language model. The single-sequence-based methods employ sequence-derived features such as one-hot features. In this work, we obtained abundant protein representation from ProtTrans language model, which is found here effective for protein disorder prediction. Figure 3 compared the ROC curves and Supplementary Figure S1 and S2 compared the precision–recall curves produced by these methods on DM1229 and SL329, respectively, along with performance indicators listed in Table 3. Among all the methods compared, LMDisorder achieved the highest |${\text{AUC}}_{\text{ROC}}$| and MCC values in both datasets, while NetSurfP-3.0 achieved the second best. LMDisorder achieved the best results on SL329, with |${\text{AUC}}_{\text{ROC}}$| of 0.911, AUC_PR of 0.910 and MCC of 0.655, which is 1.0% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 1.4% (AUC_PR) and 5.5% (MCC) higher than NetSurfP-3.0, respectively. LMDisorder and NetSurfP-3.0 achieved a comparable level on DM1229, where the |${\text{AUC}}_{\text{ROC}}$|⁠, AUC_PR and MCC of LMDisorder are 0.920, 0.705 and 0.621 respectively and the same indicators of NetSurfP-3.0 are 0.918, 0.732 and 0.620. It is worth noting that SPOT-Disorder-Single achieved the third best results on two datasets. For DM1229 dataset, the |${\text{AUC}}_{\text{ROC}}$|⁠, AUC_PR and MCC of SPOT-Disorder-Single are 0.868, 0.599 and 0.518, which are 5.6% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 15% (AUC_PR) and 17% (MCC) lower than LMDisorder, respectively. And on SL329, the |${\text{AUC}}_{\text{ROC}}$|⁠, AUC_PR and MCC of SPOT-Disorder-Single are 2.6% (⁠|${\text{AUC}}_{\text{ROC}}$|⁠), 2.6% (AUC_PR) and 7.8% (MCC) lower than LMDisorder, respectively. For a more comprehensive comparison, we compared LMDisorder with other single-sequence-based methods on Mobi9230, and found our model consistently performed the best (Supplementary Table S3). The |${\text{AUC}}_{\text{ROC}}$| and AUC_PR of LMDisorder are 0.863 and 0.8 respectively, while the second best method (NetSurfP-3.0) are 0.846 (⁠|${\text{AUC}}_{\text{ROC}})$| and 0.8 (AUC_PR). Additionally, we constructed a novel independent dataset DisProt452 and tested LMDisorder with other methods in Supplementary Table S4. The results have shown that LMDisorder still has the best performance. The |${\text{AUC}}_{\text{ROC}}$| and AUC_PR of LMDisorder are 0.875 and 0.854, respectively, while the second best method (spot-disorder-single) are 0.840 (⁠|${\text{AUC}}_{\text{ROC}}$|⁠) and 0.817 (AUC_PR). The consistent performance on multiple data sets further confirmed the robustness of our model.

Figure 3

The receiver operating characteristic curves given by LMDisorder and other single-sequence-based methods on the (a) DM1229 and (b) SL329 sets.

Table 3

The performance comparison of LMDisorder with several single-sequence-based methods on independent test sets DM1229 and SL329 according to AUC_ROC, AUC_PR, MCC, Pr, Se and Sp. Bold fonts indicate the best results

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp
DM1229	ESpritz-X	0.852	0.565	0.503	0.640	0.459	0.973
	ESpritz-N	0.802	0.480	0.431	0.619	0.358	0.977
	ESpritz-D	0.764	0.238	0.277	0.240	0.557	0.814
	IUPred2A-short	0.804	0.475	0.439	0.578	0.403	0.969
	IUPred2A-long	0.742	0.372	0.321	0.542	0.243	0.978
	Spot-Disorder-Single	0.868	0.599	0.518	0.707	0.432	0.981
	NetSurfP-3.0	0.918	0.732	0.620	0.704	0.644	0.971
	LMDisorder	0.920	0.705	0.621	0.756	0.560	0.981
SL329	ESpritz-X	0.842	0.831	0.543	0.843	0.588	0.916
	ESpritz-N	0.826	0.816	0.473	0.809	0.613	0.889
	ESpritz-D	0.863	0.829	0.608	0.873	0.620	0.931
	IUPred2A-short	0.830	0.811	0.506	0.799	0.649	0.874
	IUPred2A-long	0.838	0.833	0.552	0.812	0.648	0.884
	Spot-Disorder-Single	0.887	0.886	0.604	0.939	0.563	0.972
	NetSurfP-3.0	0.902	0.897	0.621	0.864	0.773	0.912
	LMDisorder	0.911	0.910	0.655	0.947	0.625	0.973

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp
DM1229	ESpritz-X	0.852	0.565	0.503	0.640	0.459	0.973
	ESpritz-N	0.802	0.480	0.431	0.619	0.358	0.977
	ESpritz-D	0.764	0.238	0.277	0.240	0.557	0.814
	IUPred2A-short	0.804	0.475	0.439	0.578	0.403	0.969
	IUPred2A-long	0.742	0.372	0.321	0.542	0.243	0.978
	Spot-Disorder-Single	0.868	0.599	0.518	0.707	0.432	0.981
	NetSurfP-3.0	0.918	0.732	0.620	0.704	0.644	0.971
	LMDisorder	0.920	0.705	0.621	0.756	0.560	0.981
SL329	ESpritz-X	0.842	0.831	0.543	0.843	0.588	0.916
	ESpritz-N	0.826	0.816	0.473	0.809	0.613	0.889
	ESpritz-D	0.863	0.829	0.608	0.873	0.620	0.931
	IUPred2A-short	0.830	0.811	0.506	0.799	0.649	0.874
	IUPred2A-long	0.838	0.833	0.552	0.812	0.648	0.884
	Spot-Disorder-Single	0.887	0.886	0.604	0.939	0.563	0.972
	NetSurfP-3.0	0.902	0.897	0.621	0.864	0.773	0.912
	LMDisorder	0.911	0.910	0.655	0.947	0.625	0.973

Table 3

The performance comparison of LMDisorder with several single-sequence-based methods on independent test sets DM1229 and SL329 according to AUC_ROC, AUC_PR, MCC, Pr, Se and Sp. Bold fonts indicate the best results

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp
DM1229	ESpritz-X	0.852	0.565	0.503	0.640	0.459	0.973
	ESpritz-N	0.802	0.480	0.431	0.619	0.358	0.977
	ESpritz-D	0.764	0.238	0.277	0.240	0.557	0.814
	IUPred2A-short	0.804	0.475	0.439	0.578	0.403	0.969
	IUPred2A-long	0.742	0.372	0.321	0.542	0.243	0.978
	Spot-Disorder-Single	0.868	0.599	0.518	0.707	0.432	0.981
	NetSurfP-3.0	0.918	0.732	0.620	0.704	0.644	0.971
	LMDisorder	0.920	0.705	0.621	0.756	0.560	0.981
SL329	ESpritz-X	0.842	0.831	0.543	0.843	0.588	0.916
	ESpritz-N	0.826	0.816	0.473	0.809	0.613	0.889
	ESpritz-D	0.863	0.829	0.608	0.873	0.620	0.931
	IUPred2A-short	0.830	0.811	0.506	0.799	0.649	0.874
	IUPred2A-long	0.838	0.833	0.552	0.812	0.648	0.884
	Spot-Disorder-Single	0.887	0.886	0.604	0.939	0.563	0.972
	NetSurfP-3.0	0.902	0.897	0.621	0.864	0.773	0.912
	LMDisorder	0.911	0.910	0.655	0.947	0.625	0.973

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Pr	Se	Sp
DM1229	ESpritz-X	0.852	0.565	0.503	0.640	0.459	0.973
	ESpritz-N	0.802	0.480	0.431	0.619	0.358	0.977
	ESpritz-D	0.764	0.238	0.277	0.240	0.557	0.814
	IUPred2A-short	0.804	0.475	0.439	0.578	0.403	0.969
	IUPred2A-long	0.742	0.372	0.321	0.542	0.243	0.978
	Spot-Disorder-Single	0.868	0.599	0.518	0.707	0.432	0.981
	NetSurfP-3.0	0.918	0.732	0.620	0.704	0.644	0.971
	LMDisorder	0.920	0.705	0.621	0.756	0.560	0.981
SL329	ESpritz-X	0.842	0.831	0.543	0.843	0.588	0.916
	ESpritz-N	0.826	0.816	0.473	0.809	0.613	0.889
	ESpritz-D	0.863	0.829	0.608	0.873	0.620	0.931
	IUPred2A-short	0.830	0.811	0.506	0.799	0.649	0.874
	IUPred2A-long	0.838	0.833	0.552	0.812	0.648	0.884
	Spot-Disorder-Single	0.887	0.886	0.604	0.939	0.563	0.972
	NetSurfP-3.0	0.902	0.897	0.621	0.864	0.773	0.912
	LMDisorder	0.911	0.910	0.655	0.947	0.625	0.973

Comparison with state-of-the-art profile-based methods

We compared LMDisorder with 12 profile-based methods [51] for DisProt228, including s2D [34], AUCpreD, JRONN [52], ESpritz-D, MFDp2 [53], DISOPRED [54], MobiDB-lite [55], ESpritz-N, SPINE-D, MFDp [23], SPOT-Disorder and SPOT-Disorder2. Among these, SPOT-Disorder2 is a state-of-the-art profile-based method, which had the best performance in CAID prediction [56]. The profile-based methods [16, 17] usually use the evolutionary sequence profiles obtained from multiple sequence alignment [21, 22] created mostly from HHBlits [41] and PSI-Blast [36]. In general, profile-based methods are more accurate than single-sequence-based, although these methods will take more time to calculate the features. As shown in Table 4, LMDisorder improved over 11 Profile-based methods, while achieving similar performance as SPOT-Disorder2. The |${\text{AUC}}_{\text{ROC}}$|⁠, |${\text{AUC}}_{\text{PR}}$|⁠, MCC and Sw of LMDisorder are 0.800, 0.700, 0.471 and 0.484, respectively, compared to 0.809, 0.716 and 0.499, respectively, given by SPOT-Disorder2.

Table 4

The performance comparison of LMDisorder with some profile-based methods on the DisProt228 dataset based on AUC_ROC, AUC_PR, MCC and Sw. Bold fonts and underlined fonts indicate the best and second-best results, respectively.

Method name	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
s2D	0.727	0.625	0.267	0.272
AUCpreD	0.748	0.312	0.434	0.436
JRONN	0.753	0.654	0.379	0.388
ESpritz-D (prof)	0.759	0.645	0.379	0.364
MFDp2	0.768	0.594	0.371	0.375
DISOPRED	0.771	0.608	0.406	0.387
MobiDB-lite	0.772	0.596	0.422	0.360
IUPred2A-long	0.772	0.694	0.418	0.429
IUPred2A-short	0.774	0.674	0.425	0.437
ESpritz-N (prof)	0.776	0.674	0.432	0.440
MFDp	0.776	0.603	0.357	0.366
SPINE-D	0.786	0.684	0.423	0.436
SPOT-Disorder	0.792	0.666	0.462	0.465
LMDisorder	0.800	0.700	0.471	0.484
SPOT-Disorder2	0.809	0.716	0.499	0.507

Method name	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
s2D	0.727	0.625	0.267	0.272
AUCpreD	0.748	0.312	0.434	0.436
JRONN	0.753	0.654	0.379	0.388
ESpritz-D (prof)	0.759	0.645	0.379	0.364
MFDp2	0.768	0.594	0.371	0.375
DISOPRED	0.771	0.608	0.406	0.387
MobiDB-lite	0.772	0.596	0.422	0.360
IUPred2A-long	0.772	0.694	0.418	0.429
IUPred2A-short	0.774	0.674	0.425	0.437
ESpritz-N (prof)	0.776	0.674	0.432	0.440
MFDp	0.776	0.603	0.357	0.366
SPINE-D	0.786	0.684	0.423	0.436
SPOT-Disorder	0.792	0.666	0.462	0.465
LMDisorder	0.800	0.700	0.471	0.484
SPOT-Disorder2	0.809	0.716	0.499	0.507

Table 4

The performance comparison of LMDisorder with some profile-based methods on the DisProt228 dataset based on AUC_ROC, AUC_PR, MCC and Sw. Bold fonts and underlined fonts indicate the best and second-best results, respectively.

Method name	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
s2D	0.727	0.625	0.267	0.272
AUCpreD	0.748	0.312	0.434	0.436
JRONN	0.753	0.654	0.379	0.388
ESpritz-D (prof)	0.759	0.645	0.379	0.364
MFDp2	0.768	0.594	0.371	0.375
DISOPRED	0.771	0.608	0.406	0.387
MobiDB-lite	0.772	0.596	0.422	0.360
IUPred2A-long	0.772	0.694	0.418	0.429
IUPred2A-short	0.774	0.674	0.425	0.437
ESpritz-N (prof)	0.776	0.674	0.432	0.440
MFDp	0.776	0.603	0.357	0.366
SPINE-D	0.786	0.684	0.423	0.436
SPOT-Disorder	0.792	0.666	0.462	0.465
LMDisorder	0.800	0.700	0.471	0.484
SPOT-Disorder2	0.809	0.716	0.499	0.507

Method name	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
s2D	0.727	0.625	0.267	0.272
AUCpreD	0.748	0.312	0.434	0.436
JRONN	0.753	0.654	0.379	0.388
ESpritz-D (prof)	0.759	0.645	0.379	0.364
MFDp2	0.768	0.594	0.371	0.375
DISOPRED	0.771	0.608	0.406	0.387
MobiDB-lite	0.772	0.596	0.422	0.360
IUPred2A-long	0.772	0.694	0.418	0.429
IUPred2A-short	0.774	0.674	0.425	0.437
ESpritz-N (prof)	0.776	0.674	0.432	0.440
MFDp	0.776	0.603	0.357	0.366
SPINE-D	0.786	0.684	0.423	0.436
SPOT-Disorder	0.792	0.666	0.462	0.465
LMDisorder	0.800	0.700	0.471	0.484
SPOT-Disorder2	0.809	0.716	0.499	0.507

We further compared the top three methods (SPOT-Disorder, LMDisorder, SPOT-Disorder2) in Table 4 on additional two larger datasets. Table 5 shows the comparison of LMDisorder with SPOT-Disorder and SPOT-Disorder2 on Test1185 and SL250. LMDisorder is able to have a better performance on these two test sets. On Test1185, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| of LMDisorder are 0.920 and 0.709, respectively, which are 3.3 and 9.1% higher than those of SPOT-Disorder and 0.6 and 2% higher than those of SPOT-Disorder2. Similarly, the |${\text{AUC}}_{\text{ROC}}$| and |${\text{AUC}}_{\text{PR}}$| of LMDisorder on SL250 are 0.905 and 0.890, which are 1.3 and 1.7% higher than those of SPOT-Disorder and 0.4 and 0.1% higher than those of SPOT-Disorder2. Thus, the performance of the proposed LMDisorder has matched the performance of the current state-of-the-art profile-based method at a fraction of computational time.

Table 5

The performance comparison of the top-3 methods in Table 4 on independent test sets Test1185 and SL250. Bold fonts indicate the best results

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
Test1185	SPOT-Disorder	0.894	0.650	0.567	0.477
	SPOT-Disorder2	0.914	0.698	0.607	0.676
	LMDisorder	0.920	0.709	0.624	0.692
SL250	SPOT-Disorder	0.893	0.875	0.629	0.567
	SPOT-Disorder2	0.901	0.889	0.679	0.625
	LMDisorder	0.905	0.890	0.679	0.634

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
Test1185	SPOT-Disorder	0.894	0.650	0.567	0.477
	SPOT-Disorder2	0.914	0.698	0.607	0.676
	LMDisorder	0.920	0.709	0.624	0.692
SL250	SPOT-Disorder	0.893	0.875	0.629	0.567
	SPOT-Disorder2	0.901	0.889	0.679	0.625
	LMDisorder	0.905	0.890	0.679	0.634

Table 5

The performance comparison of the top-3 methods in Table 4 on independent test sets Test1185 and SL250. Bold fonts indicate the best results

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
Test1185	SPOT-Disorder	0.894	0.650	0.567	0.477
	SPOT-Disorder2	0.914	0.698	0.607	0.676
	LMDisorder	0.920	0.709	0.624	0.692
SL250	SPOT-Disorder	0.893	0.875	0.629	0.567
	SPOT-Disorder2	0.901	0.889	0.679	0.625
	LMDisorder	0.905	0.890	0.679	0.634

Dataset	Method	\|${\text{AUC}}_{\text{ROC}}$\|	\|${\text{AUC}}_{\text{PR}}$\|	MCC	Sw
Test1185	SPOT-Disorder	0.894	0.650	0.567	0.477
	SPOT-Disorder2	0.914	0.698	0.607	0.676
	LMDisorder	0.920	0.709	0.624	0.692
SL250	SPOT-Disorder	0.893	0.875	0.629	0.567
	SPOT-Disorder2	0.901	0.889	0.679	0.625
	LMDisorder	0.905	0.890	0.679	0.634

IDPs are associated with specific biological functions

Since LMDisorder is much faster than profile-based methods, we used LMDisorder to predict IDRs/IDPs in all human proteins in Uniprot [57]. Previous studies have shown that IDRs/IDPs play essential roles in many biological processes. Thus, we investigated whether the proteins with high disorder content predicted by LMDisorder were associated with these functions. Specifically, we selected the proteins with a fraction of disorder contents above 90% and conducted Gene Ontology (GO) [58] enrichment analysis using gProfiler [59]. Then we randomly selected the same number of proteins for 200 enrichment functional analyses and picked the lowest FDR (P-value adjusted by Benjamini-Hochberg False Discovery Rate) value (⁠|$1.0\times{10}^{-25}$|⁠) as the threshold. We found that these genes significantly enriched 5, 62 and 5 GO items in the molecular function (MF), biological process (BP) and cellular components (CC) category, respectively. Figure 4 shows the top-5 GO items of the three categories and the details are in Supplementary Table S6. The proteins with high disorder content enriched many important biological functions and many of them were supported by the literature. For instance, there are 12 pathways related to transcription [51, 60, 61], three pathways related to protein binding [62], and three pathways related to regulatory [6]. This indicates that LMDisorder provides biologically meaningful predictions, which may facilitate investigating the functions of the proteins without rich function annotations. The above-described enrichment of some functions is consistent with previous analysis, including cell differentiation, development and regulation [63, 64]. We also discovered some new functions associated with strong intrinsic disorder (Supplementary Table S6), including RNA biosynthetic process (⁠|$\text{FDR}=1.6\times{10}^{-53}$|⁠), DNA-templated (⁠|$\text{FDR}=5.2\times{10}^{-54})$| and RNA polymerase (⁠|$\text{FDR}=5.2\times{10}^{-54})$|⁠. Further studies are required to confirm these findings. Meanwhile, we have added the proteome level analysis for some proteins in Supplementary Table S5. Firstly, we selected proteins with less than 25% sequence similarity (using BLASTClust [36]) to the training set. And then the disorder ratio for each of the proteins was predicted. Finally, the proteins were ranked according to the prediction results and the top ones among them were picked. It is found that the proteins with high disorder contents have some special functions, such as DNA-binding, Transcription regulation and Cell division, which is consistent with the previous Gene Ontology enrichment analysis.

Figure 4

The top-5 GO items in three GO categories.

DISCUSSION

In this paper, we have developed a language-model-based method called LMDisorder to predict IDRs/IDPs in protein sequences. IDPs and IDRs are abundant in all species [65, 66], and relevant in many diseases [67, 68]. The existing profile-based methods can provide highly accurate prediction. However, they are computationally too intensive to perform genome-scale prediction due to the computational requirement for profile calculations. Here, we showed that using language model embedding as input features we can achieve speed and accuracy at the same time. For a protein with 500 residues, it takes about 15 minutes on an Nvidia GeForce RTX 3090 GPU for SPOT-Disorder2 but only one second on the same computer for LMDisorder with essentially the same performance. Meanwhile, we also tested other single-sequence-based methods, most finished within 1 to 5 seconds, because of no requirement of the time-consuming generation of sequence profiles. And compared with NetSurfP-3.0, which is the state-of-the-art language-model-based method, LMDisorder reached the equivalent or even better performance. This further proves the effectiveness of LMDisorder. In addition, the accuracy of our method remained essentially the same with the change of the protein length (Supplementary Figure S3), further indicating the robustness of our method.

One interesting observation is that the proteins with high disorder content are associated with specific biological functions. Through the enrichment analysis, we found that these proteins significantly enrich 72 GO items and many of them were supported by previous literature [6, 51, 60, 61, 69, 70]. In addition, we also discovered some new strong intrinsic-disorder-associated functions, which may be of interest for further studies.

Another observation is that combining ProtTrans and ESM did not improve over ProtTrans alone. This is a bit surprising because a recent study indicates that their combination led to an improved prediction of protein secondary structure and other tertiary structural properties [31]. This is perhaps due to the fact that disordered regions, unlike structured regions, are less conserved, and thus, evolutionary information from different language models may have captured different aspects of structural characteristics. Disorder predictions, on the other hand, are less sensitive to small intrinsic difference in the language models. In the future, we consider using the structural information predicted by AlphaFold2 [71, 72] or ESMFold [73], and then building models using graph networks [74] for further improving the performance.

In summary, this work developed a sequence-based, alignment-free method LMDisorder for protein disorder prediction. LMDisorder combines the advantages of current sequence-based and profile-based methods and greatly improves the prediction speed while ensuring high-prediction performance. The technique could be useful for highly accurate proteome-scale analysis.

Key Points

LMDisorder is an alignment-free single-sequence-based protein disorder predictor that employs embedding generated by unsupervised pretrained language models as features, thus bypassing time-consuming database searches.
LMDisorder is accurate and robust, which has better performance than state-of-the-art single-sequence-based and even profile-based approaches.
The high computation efficiency of LMDisorder enables proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions.
LMDisorder employs transformer networks to capture the sequence patterns, especially the long-range dependencies between the residues, which is useful for protein disorder prediction.

DATA AVAILABILITY

The datasets, the source codes and the trained model are available on https://github.com/biomed-AI/LMDisorder.

FUNDING

This study has been supported by the National Key R&D Program of China [2022YFF1203100], National Natural Science Foundation of China [12126610], and by the Supercomputing facilities of Shenzhen Bay Laboratory.

Author Biographies

Yidong Song is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network, protein disorder prediction and protein function prediction.

Qianmu Yuan is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein disorder prediction.

Sheng Chen is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein structure prediction.

Ken Chen is a Ph.D. student in the School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network and protein function prediction.

Yaoqi Zhou is a senior principal investigator in the Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China. His research interests include protein disorder prediction, protein design and protein/RNA structure prediction.

Yuedong Yang is a professor in the School of Computer Science and Engineering at Sun Yat-sen University. Currently he focuses on integrating HPC and AI techniques for multi-scale biomedical research.

REFERENCES

1.

Romero

P

,

Obradovic

Z

,

Kissinger

CR

, et al.

Thousands of proteins likely to have long disordered regions

.

Pac Symp Biocomput

1998

;

3

:

437

–

48

.

2.

Bairoch

A

,

Boeckmann

B

.

The SWISS-PROT protein sequence data bank

.

Nucleic Acids Res

1991

;

19

:

2247

.

3.

Uversky

VN

.

Functions of short lifetime biological structures at large: the case of intrinsically disordered proteins

.

Brief Funct Genomics

2020

;

19

:

60

–

8

.

PubMed

4.

Mészáros

B

,

Tompa

P

,

Simon

I

, et al.

Molecular principles of the interactions of disordered proteins

.

J Mol Biol

2007

;

372

:

549

–

61

.

5.

Vacic

V

,

Oldfield

CJ

,

Mohan

A

, et al.

Characterization of molecular recognition features, MoRFs, and their binding partners

.

J Proteome Res

2007

;

6

:

2351

–

66

.

6.

Dyson

HJ

,

Wright

PE

.

Intrinsically unstructured proteins and their functions

.

Nat Rev Mol Cell Biol

2005

;

6

:

197

–

208

.

7.

Receveur-Bréchot

V

,

Bourhis

JM

,

Uversky

VN

, et al.

Assessing protein disorder and induced folding, proteins: structure

.

Function, and Bioinformatics

2006

;

62

:

24

–

45

.

8.

Yu

J-F

,

Cao

Z

,

Yang

Y

, et al.

Natural protein sequences are more intrinsically disordered than random sequences

.

Cell Mol Life Sci

2016

;

73

:

2949

–

57

.

9.

Uversky

VN

.

Intrinsic disorder here, there, and everywhere, and nowhere to escape from it

.

Cell Mol Life Sci

2017

;

74

:

3065

–

7

.

10.

Konrat

R

.

NMR contributions to structural dynamics studies of intrinsically disordered proteins

.

J Magn Reson

2014

;

241

:

74

–

85

.

11.

Romero

P

,

Obradovic

Z

,

Li

X

, et al.

Sequence complexity of disordered protein, proteins: structure

.

Function, and Bioinformatics

2001

;

42

:

38

–

48

.

12.

Walsh

I

,

Martin

AJ

,

Di Domenico

T

, et al.

ESpritz: accurate and fast prediction of protein disorder

.

Bioinformatics

2012

;

28

:

503

–

9

.

13.

Hanson

J

,

Yang

Y

,

Paliwal

K

, et al.

Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks

.

Bioinformatics

2017

;

33

:

685

–

92

.

14.

Klausen

MS

,

Jespersen

MC

,

Nielsen

H

, et al.

NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning

.

Proteins: Structure, Function, and Bioinformatics

2019

;

87

:

520

–

7

.

15.

Wang

S

,

Ma

J

,

Xu

J

.

AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields

.

Bioinformatics

2016

;

32

:

i672

–

9

.

16.

Zhang

T

,

Faraggi

E

,

Xue

B

, et al.

SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method

.

Journal of Biomolecular Structure and Dynamics

2012

;

29

:

799

–

813

.

17.

Hanson

J

,

Paliwal

KK

,

Litfin

T

, et al.

SPOT-Disorder2: improved protein intrinsic disorder prediction by ensembled deep learning

.

Genomics Proteomics Bioinformatics

2019

;

17

:

645

–

56

.

18.

Dosztányi

Z

,

Csizmok

V

,

Tompa

P

, et al.

IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content

.

Bioinformatics

2005

;

21

:

3433

–

4

.

19.

Linding

R

,

Russell

RB

,

Neduva

V

, et al.

GlobPlot: exploring protein sequences for globularity and disorder

.

Nucleic Acids Res

2003

;

31

:

3701

–

8

.

20.

Prilusky

J

,

Felder

CE

,

Zeev-Ben-Mordehai

T

, et al.

FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded

.

Bioinformatics

2005

;

21

:

3435

–

8

.

21.

Liu

Y

,

Wang

X

,

Liu

B

.

A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction

.

Brief Bioinform

2019

;

20

:

330

–

46

.

22.

Necci

M

,

Piovesan

D

,

Dosztányi

Z

, et al.

A comprehensive assessment of long intrinsic protein disorder from the DisProt database

.

Bioinformatics

2018

;

34

:

445

–

52

.

23.

Mizianty

MJ

,

Stach

W

,

Chen

K

, et al.

Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources

.

Bioinformatics

2010

;

26

:

i489

–

96

.

24.

Ovchinnikov

S

,

Park

H

,

Varghese

N

, et al.

Protein structure determination using metagenome sequence data

.

Science

2017

;

355

:

294

–

8

.

25.

Lyu H, Sha N, Qin S, et al. Manifold denoising by nonlinear robust principal component analysis, Advances in neural information processing systems 2019;

32

:13390–400.

26.

Heinzinger

M

,

Elnaggar

A

,

Wang

Y

, et al.

Modeling aspects of the language of life through transfer-learning protein sequences

.

BMC bioinformatics

2019

;

20

:

1

–

17

.

27.

Rao R, Meier J, Sercu T, et al. Transformer protein language models are unsupervised structure learners,

bioRxiv

2020:2020.2012. 2015.422761.

28.

Elnaggar

A

,

Heinzinger

M

,

Dallago

C

, et al.

ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing

.

IEEE Trans Pattern Anal Mach Intell

2022;

44

:7112–27.

29.

Rives

A

,

Meier

J

,

Sercu

T

, et al.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

.

Proc Natl Acad Sci

2021

;

118

:e2016239118.

30.

Unsal

S

,

Atas

H

,

Albayrak

M

, et al.

Learning functional properties of proteins with language models

.

Nature Machine Intelligence

2022

;

4

:

227

–

45

.

31.

Singh

J

,

Paliwal

K

,

Litfin

T

, et al.

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment

.

Sci Rep

2022

;

12

:

1

–

9

.

PubMed

32.

Singh

J

,

Litfin

T

,

Singh

J

, et al.

SPOT-contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model

.

Bioinformatics

2022

;

38

:

1888

–

94

.

33.

Yuan Q, Chen S, Wang Y, et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning,

Briefings in Bioinformatics

2022;

23

:bbac444.

34.

Hanson

J

,

Paliwal

K

,

Zhou

Y

.

Accurate single-sequence prediction of protein intrinsic disorder by an ensemble of deep recurrent and convolutional architectures

.

J Chem Inf Model

2018

;

58

:

2369

–

76

.

35.

Vucetic

S

,

Obradovic

Z

,

Vacic

V

, et al.

DisProt: a database of protein disorder

.

Bioinformatics

2005

;

21

:

137

–

40

.

36.

Altschul

SF

,

Madden

TL

,

Schäffer

AA

, et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

.

Nucleic Acids Res

1997

;

25

:

3389

–

402

.

37.

Sirota

FL

,

Ooi

H-S

,

Gattermayer

T

, et al.

Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset

.

BMC Genomics

2010

;

11

:

1

–

17

.

38.

Piovesan

D

,

Tabaro

F

,

Paladin

L

, et al.

MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins

.

Nucleic Acids Res

2018

;

46

:

D471

–

6

.

39.

Raffel

C

,

Shazeer

N

,

Roberts

A

, et al.

Exploring the limits of transfer learning with a unified text-to-text transformer

.

Journal of machine learning research

2020

;

21

:

1

–

67

.

PubMed

40.

Suzek

BE

,

Huang

H

,

McGarvey

P

, et al.

UniRef: comprehensive and non-redundant UniProt reference clusters

.

Bioinformatics

2007

;

23

:

1282

–

8

.

41.

Remmert

M

,

Biegert

A

,

Hauser

A

, et al.

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment

.

Nat Methods

2012

;

9

:

173

–

5

.

42.

Mirdita

M

,

von den

Driesch

L

,

Galiez

C

, et al.

Uniclust databases of clustered and deeply annotated protein sequences and alignments

.

Nucleic Acids Res

2017

;

45

:

D170

–

6

.

43.

Heffernan

R

,

Yang

Y

,

Paliwal

K

, et al.

Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility

.

Bioinformatics

2017

;

33

:

2842

–

9

.

44.

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need,

Advances in neural information processing systems

2017;

30

:5998–8.

45.

He

K

,

Zhang

X

,

Ren

S

et al. Deep residual learning for image recognition.

In

:

Proceedings of the IEEE conference on computer vision and pattern recognition

.

2016

,

p. 770-778

.

46.

Ba

JL

,

Kiros

JR

,

Hinton

GE

.

Layer normalization

.

Stat

2016

;

1050

:

21

.

47.

Kingma DP, Ba J. Adam: A method for stochastic optimization,

arXiv

preprint arXiv:1412.6980 2014.

48.

Paszke

A

,

Gross

S

,

Massa

F

, et al.

Pytorch: an imperative style, high-performance deep learning library

.

Advances in neural information processing systems

2019

;

32

:

8026

–

37

.

49.

Mészáros

B

,

Erdős

G

,

Dosztányi

Z

.

IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding

.

Nucleic Acids Res

2018

;

46

:

W329

–

37

.

50.

Høie MH, Kiehl EN, Petersen B, et al. NetSurfP-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning,

Nucleic acids research

2022;

50

:W510–W515.

51.

Sigler

PB

.

Acid blobs and negative noodles

.

Nature

1988

;

333

:

210

–

2

.

52.

Yang

ZR

,

Thomson

R

,

McNeil

P

, et al.

RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins

.

Bioinformatics

2005

;

21

:

3369

–

76

.

53.

Mizianty

MJ

,

Peng

Z

,

Kurgan

L

.

MFDp2: accurate predictor of disorder in proteins by fusion of disorder probabilities, content and profiles

.

Intrinsically disordered proteins

2013

;

1

:

e24428

.

54.

Jones

DT

,

Cozzetto

D

.

DISOPRED3: precise disordered region predictions with annotated protein-binding activity

.

Bioinformatics

2015

;

31

:

857

–

63

.

55.

Necci

M

,

Piovesan

D

,

Dosztányi

Z

, et al.

MobiDB-lite: fast and highly specific consensus prediction of intrinsic disorder in proteins

.

Bioinformatics

2017

;

33

:

1402

–

4

.

56.

Necci

M

,

Piovesan

D

,

Tosatto

SC

.

Critical assessment of protein intrinsic disorder prediction

.

Nat Methods

2021

;

18

:

472

–

81

.

57.

Consortium

U

.

UniProt: a hub for protein information

.

Nucleic Acids Res

2015

;

43

:

D204

–

12

.

58.

Ashburner

M

,

Ball

CA

,

Blake

JA

, et al.

Gene ontology: tool for the unification of biology

.

Nat Genet

2000

;

25

:

25

–

9

.

59.

Raudvere

U

,

Kolberg

L

,

Kuzmin

I

, et al.

G: profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)

.

Nucleic Acids Res

2019

;

47

:

W191

–

8

.

60.

Radhakrishnan

I

,

Pérez-Alvarado

GC

,

Parker

D

, et al.

Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator: coactivator interactions

.

Cell

1997

;

91

:

741

–

52

.

61.

Wright

PE

,

Dyson

HJ

.

Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm

.

J Mol Biol

1999

;

293

:

321

–

31

.

62.

Tompa

P

,

Fuxreiter

M

.

Fuzzy complexes: polymorphism and structural disorder in protein–protein interactions

.

Trends Biochem Sci

2008

;

33

:

2

–

8

.

63.

Bellay

J

,

Han

S

,

Michaut

M

, et al.

Bringing order to protein disorder through comparative genomics and genetic interactions

.

Genome Biol

2011

;

12

:

1

–

15

.

64.

Colak

R

,

Kim

T

,

Michaut

M

, et al.

Distinct types of disorder in the human proteome: functional implications for alternative splicing

.

PLoS Comput Biol

2013

;

9

:

e1003030

.

65.

Xue

B

,

Dunker

AK

,

Uversky

VN

.

Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life

.

Journal of Biomolecular Structure and Dynamics

2012

;

30

:

137

–

49

.

66.

Peng

Z

,

Yan

J

,

Fan

X

, et al.

Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life

.

Cell Mol Life Sci

2015

;

72

:

137

–

51

.

67.

Uversky

VN

,

Oldfield

CJ

,

Dunker

AK

.

Intrinsically disordered proteins in human diseases: introducing the D2 concept

.

Annu Rev Biophys

2008

;

37

:

215

–

46

.

68.

Shigemitsu

Y

,

Hiroaki

H

.

Common molecular pathogenesis of disease-related intrinsically disordered proteins revealed by NMR analysis

.

The Journal of Biochemistry

2018

;

163

:

11

–

8

.

69.

Uversky

VN

,

Oldfield

CJ

,

Dunker

AK

.

Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling

.

Journal of Molecular Recognition: An Interdisciplinary Journal

2005

;

18

:

343

–

84

.