Abstract

Protein phosphorylation is a reversible and ubiquitous post-translational modification that primarily occurs at serine, threonine and tyrosine residues and regulates a variety of biological processes. In this paper, we first briefly summarized the current progresses in computational prediction of eukaryotic protein phosphorylation sites, which mainly focused on animals and plants, especially on human, with a less extent on fungi. Since the number of identified fungi phosphorylation sites has greatly increased in a wide variety of organisms and their roles in pathological physiology still remain largely unknown, more attention has been paid on the identification of fungi-specific phosphorylation. Here, experimental fungi phosphorylation sites data were collected and most of the sites were classified into different types to be encoded with various features and trained via a two-step feature optimization method. A novel method for prediction of species-specific fungi phosphorylation-PreSSFP was developed, which can identify fungi phosphorylation in seven species for specific serine, threonine and tyrosine residues (http://computbiol.ncu.edu.cn/PreSSFP). Meanwhile, we critically evaluated the performance of PreSSFP and compared it with other existing tools. The satisfying results showed that PreSSFP is a robust predictor. Feature analyses exhibited that there have some significant differences among seven species. The species-specific prediction via two-step feature optimization method to mine important features for training could considerably improve the prediction performance. We anticipate that our study provides a new lead for future computational analysis of fungi phosphorylation.

Introduction

Reversible phosphorylation [1], one of the most important post-translational modifications (PTMs), plays a key role in diverse cellular processes including cell division [2], apoptosis [3], immune response [4] and signal transduction pathway [5]. The process of protein phosphorylation primarily occurs at amino acid residues such as serine (S), threonine (T) and tyrosine (Y) in substrate protein. However, not all of the three residues can be phosphorylated. Thus, the efficient identification of phosphorylation sites is very significant to the understanding of intracellular mechanism. To date, various experimental methods have been used for the identification of phosphorylation sites. Especially, the advent of high-throughput mass spectrometry (MS) [6] has rapidly accumulated fungi phosphorylation data. Furthermore, there is much evidence to indicate that fungi phosphorylation also occurs in more complex organisms [7–9], but its abundance and role remain largely unknown, primarily due to the lack of adequate analytical tools [10]. Indeed, since the discovery of phosphorylation in 1906 by Phoebus A. Leve [11], most researches have been made to analyse the pattern of phosphorylation sites in plants [12, 13], animals [14] and bacteria [15]; only limited advances have been made in the detection of fungi phosphorylation and the clarification of its biological function.

The overall framework of PreSSFP.
Figure 1

The overall framework of PreSSFP.

Compared with conventional experimental approaches, in silico prediction of phosphorylation sites has become a promising strategy, which could be conducted as preliminary analyses and screens out the potential targets for further confirmation. Although there are excellent methods to analyse protein phosphorylation at S/T/Y residues, few are suitable for the study of fungi phosphorylation. From the research, we also observed that understanding the molecular mechanisms of different species of fungi phosphorylation proteins was essential to regulate fungal pathogenicity of fungal biology. For example, for Cryptococcus neoformans (C. neoformans), one of the essential factors for its virulence is due to activation of cyclic adenosine 5′-monophosphate (cAMP)-dependent protein kinase A (PKA) pathway. PKA is known to phosphorylate various down-stream targets including transcription factors such as Nrg1 and Rim101, through which cAMP/PKA pathway modulates capsule size and melanin formation in C. neoformans. Identification of phosphoproteins in C. neoformans could serve as a discovery platform to establish a molecular network that confers virulence in C. neoformans [16].

Here we used the Fungi Phosphorylation Database (FPD) provided by Bai et al. that comprises 62 272 non-redundant phosphorylation sites in 11 222 proteins across 8 fungal species [17], including Aspergillus flavus (A. flavus) [18], Aspergillus nidulans (A. nidulans) [19], C. neoformans [16], Fusarium graminearum (F. graminearum) [20], Magnaporthe oryzae (M. oryzae) [21], Neurospora crassa (N. crassa) [22], Saccharomyces cerevisiae (S. cerevisiae) [23] and Schizosaccharomyces pombe (S. pombe) [23] to preprocess, and proposed a computational tool of PreSSFP via a two-step feature optimization method for the prediction of specific fungi phosphorylation sites. At first, five types of feature extraction strategies were used to formulate our protein peptide fragments based on the sequence information, evolutionary information and physicochemical properties. Then we performed a two-step feature selection to remove the redundant feature vectors and construct the optimization model on the basis of 10-fold cross-validation. By comparison, PreSSFP exhibited a competitive performance to other existing tools for predicting specific fungi phosphorylation sites. The fungi phosphorylation sites were classified into different species and PTMs sites for training, which could further improve the prediction performance. Taken all together, PreSSFP is provided as an open source tool and implemented in Matlab and JAVA at http://computbiol.ncu.edu.cn/PreSSFP.

Materials and methods

There are four main procedures to construct and evaluate PreSSFP system as follows (Figure 1): (i) construct valid benchmark dataset and independent dataset to train and test the PreSSFP for seven organisms separately; (ii) formulate the peptide samples in the datasets with an effective mathematical expression by extracting their various sequence, evolutionary and physicochemical properties feature; (iii) perform a two-step feature optimization algorithm based on the benchmark datasets to train the optimal feature subsets for each organism in a cross-validation manner; and (iv) objectively evaluate the optimal model for each organism by using the independent test datasets and develop the PreSSFP webserver.

Data collection and preprocessing

In this work, experimentally verified fungi phosphorylation data for eight model organisms, including A. flavus, A. nidulans, C. neoformans, F. graminearum, M. oryzae, N. crassa, S. cerevisiae and S. pombe datasets, were collected from publicly available FPD [16] and UniProt database [24]. Both A. flavus and A. nidulans are belonging to Aspergillus, so we consolidated them into a category of Aspergillus. Therefore, we obtained seven categories data in fungi. The homology protein sequences were removed with a 30% identity cutoff using CD-HIT [25]. For seven model organisms, independent datasets were constructed by randomly selecting 15% of all non-redundant protein entries and the remaining non-redundant proteins were used to construct training dataset. Experimentally verified phosphorylation serine (S) / threonine (T) / tyrosine (Y) residues were regarded as positive samples. All the remaining S/T/Y residues that had not been verified as phosphorylation sites in these proteins were considered as negative samples. Each site was represented as a peptide segment of the length of 15 with S/T/Y in the center. Based on that, a total of 94 proteins with 152 serine phosphorylation sites and 6762 non-phosphorylation sites were obtained as an independent dataset for Aspergillus in this study; the remaining 528 proteins containing 930 serine phosphorylation sites and 37420 non-phosphorylation sites were utilized as training samples (the other different species-specific phosphorylation serine, threonine and tyrosine datasets are shown in Table 1). From Table 1, we find that the number of tyrosine phosphorylation in all species except S. cerevisiae is too small and lacks statistical significance. Therefore, we regarded the tyrosine phosphorylation dataset in other six organisms as the second independent test dataset for cross-species prediction of tyrosine phosphorylation sites. Furthermore, the amount of serine phosphorylation data in S. cerevisiae is too big, so we divided it into 10 averages for training. Finally, the average value of all 10 training results is adopted. In addition, the number of negative samples is greater than positive samples, and a positive-to-negative sample ratio of 1:1 is randomly pooled as a training dataset and independent test dataset.

Table 1

The statistics of fungi phosphorylation S/T/Y datasets in different organisms

OrganismsResidue of typeEliminated homology
(sites/proteins)
Training dataset
(sites/proteins)
Testing dataset
(sites/proteins)
AspergillusSerine1082/622930/528152/94
Threonine309/239263/20346/36
Tyrosine36/36
C. neoformansSerine931/599785/509146/90
Threonine186/134163/11823/16
Tyrosine20/19
F. graminearumSerine2384/10901999/926385/164
Threonine625/468526/39799/71
Tyrosine64/60
M. oryzaeSerine3757/13833144/1175613/208
Threonine1033/591869/502164/89
Tyrosine23/21
N. crassaSerine3347/15132835/1286512/227
Threonine1183/8001003/680180/120
Tyrosine139/134
S. cerevisiaeSerine25497/332721613/28273884/500
Threonine8567/24417197/20741370/367
Tyrosine1810/11941532/1014278/180
S. pombeSerine3035/12242581/1040454/184
Threonine561/385481/32780/58
Tyrosine81/75
OrganismsResidue of typeEliminated homology
(sites/proteins)
Training dataset
(sites/proteins)
Testing dataset
(sites/proteins)
AspergillusSerine1082/622930/528152/94
Threonine309/239263/20346/36
Tyrosine36/36
C. neoformansSerine931/599785/509146/90
Threonine186/134163/11823/16
Tyrosine20/19
F. graminearumSerine2384/10901999/926385/164
Threonine625/468526/39799/71
Tyrosine64/60
M. oryzaeSerine3757/13833144/1175613/208
Threonine1033/591869/502164/89
Tyrosine23/21
N. crassaSerine3347/15132835/1286512/227
Threonine1183/8001003/680180/120
Tyrosine139/134
S. cerevisiaeSerine25497/332721613/28273884/500
Threonine8567/24417197/20741370/367
Tyrosine1810/11941532/1014278/180
S. pombeSerine3035/12242581/1040454/184
Threonine561/385481/32780/58
Tyrosine81/75
Table 1

The statistics of fungi phosphorylation S/T/Y datasets in different organisms

OrganismsResidue of typeEliminated homology
(sites/proteins)
Training dataset
(sites/proteins)
Testing dataset
(sites/proteins)
AspergillusSerine1082/622930/528152/94
Threonine309/239263/20346/36
Tyrosine36/36
C. neoformansSerine931/599785/509146/90
Threonine186/134163/11823/16
Tyrosine20/19
F. graminearumSerine2384/10901999/926385/164
Threonine625/468526/39799/71
Tyrosine64/60
M. oryzaeSerine3757/13833144/1175613/208
Threonine1033/591869/502164/89
Tyrosine23/21
N. crassaSerine3347/15132835/1286512/227
Threonine1183/8001003/680180/120
Tyrosine139/134
S. cerevisiaeSerine25497/332721613/28273884/500
Threonine8567/24417197/20741370/367
Tyrosine1810/11941532/1014278/180
S. pombeSerine3035/12242581/1040454/184
Threonine561/385481/32780/58
Tyrosine81/75
OrganismsResidue of typeEliminated homology
(sites/proteins)
Training dataset
(sites/proteins)
Testing dataset
(sites/proteins)
AspergillusSerine1082/622930/528152/94
Threonine309/239263/20346/36
Tyrosine36/36
C. neoformansSerine931/599785/509146/90
Threonine186/134163/11823/16
Tyrosine20/19
F. graminearumSerine2384/10901999/926385/164
Threonine625/468526/39799/71
Tyrosine64/60
M. oryzaeSerine3757/13833144/1175613/208
Threonine1033/591869/502164/89
Tyrosine23/21
N. crassaSerine3347/15132835/1286512/227
Threonine1183/8001003/680180/120
Tyrosine139/134
S. cerevisiaeSerine25497/332721613/28273884/500
Threonine8567/24417197/20741370/367
Tyrosine1810/11941532/1014278/180
S. pombeSerine3035/12242581/1040454/184
Threonine561/385481/32780/58
Tyrosine81/75

Features extraction and optimization

Sequence information feature

Amino acid composition (AAC) and di-AAC (DAAC), which reflect frequency information of amino acid and amino acid pair occurrence, have been widely applied in various fields of bioinformatics [26, 27]. The feature of AAC can be formatted as follows:
(1)
where |${f}_i$| represents the frequency of occurrence of the i-th amino acid in |$\Big\{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,O\Big\}$|⁠. The vector ‘O’ represents the virtual amino acid or other specific amino acid (e.g. B, Z and X), and residue S or T or Y of the central position is omitted. The feature of DAAC is the composition of two adjacent amino acids, which can be formatted as follows:
(2)

where |${f}_i$| represents the frequency of i-th amino acid pair in|$\Big\{ AA, AC, AD,\cdots, YY\Big\}$|⁠.

The Bi-profile Bayes (BPB) method is first proposed by Shao et al. in 2009 [28] and widely applied for prediction PTMs sites [29, 30]. It would be more informative to combine peptide sequence features in positive and negative feature space than a single feature space or fixed binary encoding scheme. Let |$\mathrm{S}={s}_1{s}_2\cdots {s}_n$| represent one of the peptide sequences in the datasets, where |${s}_j\Big(j=1,2,\cdots, n\Big)$| expresses one of the 21 amino acids |$\Big\{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,O\Big\}$|⁠, and n is the length of the peptide segment, omitting residue S or T or Y of the central position (i.e. |$\mathit{n}=14$| in this work). Probability vector
(3)

is used to encode peptide |$\mathrm{S}$|⁠, where |${p}_j\Big(j=1,2,\cdots, n\Big)$| denotes the posterior probability of each amino acid at the j-th position in the positive training samples, and |${p}_j\Big(j=n+1,n+2,\cdots, 2n\Big)$| denotes the posterior probability of each amino acid at the j-th position in the negative training samples.

Evolutionary information feature

K Nearest Neighbors (KNN) algorithm, which can detect local sequence similarity, has shown better predictive performance in various classification systems [31]. In order to make full use of cluster information of local sequence fragments for predicting phosphorylation sites, we adopted the KNN score to extract features in both positive and negative datasets. For instance, for two local sequence fragments |${s}_1$| and |${s}_2$|⁠, define the distance |$D\Big({s}_1,{s}_2\Big)$| between |${s}_1$| and |${s}_2$| as
(4)
(5)

where |$a$| and |$b$| are the two amino acids; |$M$| is the BLOSUM62 substitution matrix; |$Sim$| stems from the normalized amino acid substitution matrix; |$L$| denotes the number of up-stream or down-stream amino acids flanking each side of the target S/T/Y; and |$\mathit{\max}(M)$| and |$\mathit{\min}(M)$| represent the largest and the smallest number in the matrix, respectively.

Physicochemical properties feature

For biochemical reactions, physicochemical property (PSP) is the most instinctive feature and is successfully applied in the bioinformatics field [32, 33]. The properties of each of the 20 amino acids are omnifarious, which can be answerable for the diversity and specificity of protein function and structure. A large number of experimental and theoretical studies have been carried out to show different varieties of properties of individual amino acids and to represent them based on the numerical index. Amino Acid index (AAindex) database (Version 9.1) contains 544 amino acid indices and specifies the physicochemical properties of amino acids. Then, according to the values connected with each PSP, the amino acids around the phosphorylation sites can be encoded. All of the 544 physicochemical properties are examined with descending order on the basis of the value of information gain (IG) in seven species separately. The larger the value of IG is, the greater the impact of the corresponding amino acid residue on the phosphorylation site is. Thus, in our work, in terms of the IG values of all PSP, the top nine are selected and defined as informative features in the prediction model.

Two-step feature optimization

A two-step feature optimization was employed to select prominent feature vectors and then to construct the optimal feature subset. In the first step, we calculated the importance of each feature vector by using random forest (RF) classifier in Python [34–36], which ranked each input feature according to the mean accuracy of the given test data and labels. When we made use of the RF classifier, the parameter n_estimators was first optimized. For n_estimators, the larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. Thus, we defined it in the range of 50 to 250 via grid search with cross-validation to obtain the best n_estimators on the basis of accuracy (the detailed information of n_estimators in different model is shown in Table S1). Then we used the best parameter to execute model classification and acquired the top 200 features as the optimal feature candidates after this step. The second step was step-wise feature elimination. First, the feature vectors ranking top 10 in the first step were merged as a training model to calculate the performance based on support vector machine (SVM). At each round, the next feature of the first-step feature list was added into the model and then assessed. If SVM accuracy of the model is improved by its addition as a feature or the SVM accuracy of the model is over 80%, the feature is accepted as the final feature subset to build the prediction model; otherwise, it is eliminated (the detail in Algorithm Two-step feature optimization).

Algorithm Two-step feature optimization

Input:

Benchmark dataset, S; |$\mathrm{S}\!={\!\Big({X}_1,{X}_2,\dots, {X}_n\!\Big)}^T$|⁠,|${X}_i\!=\!\Big({x}_{i1},{x}_{i2},\dots, {x}_{ip}\Big),\\ i=1,2,\dots, n$|

Response values|$, y=\Big({y}_1,{y}_2,\dots, {y}_n\Big)$|

Testing dataset, T; |$\mathrm{T}\!=\!{\Big({X_1}^{\prime },{X_2}^{\prime },\dots, {X_m}^{\prime}\!\Big)}^T$|⁠,|${X_i}^{\prime}\!=\!\Big({x_{i1}}^{\prime },{x_{i2}}^{\prime },\dots, {x_{ip}}^{\prime}\Big),\\ i=1,2,\dots, m$|

Feature selection method, Two-step feature optimization;

Output:

Predicted label |${y}^{\prime }=\Big({y}_1^{\prime },{y}_2^{\prime },\dots, {y}_m^{\prime}\Big)$| for testing dataset.

1: for each |$i\in \Big[1,n\Big]$| do

2: S’=FiveFeatureCode(S,|$i$|⁠);

3: end for

4: for each |$n\_ estimator\in \Big[50,250\Big]$| do

5: best_n_estimator = GridSearchCV(S’,|$y$|⁠, n_estimator);

6: end for

7: [ranking, index] = RFclassifier(S’,|$y$|⁠, best_n_estimator);

8: Model=Extract(S’, index (1:10));

9: Acc=SVM (Model);

11: m=Acc;

12: for each |$i\in \Big[11,200\Big]$| do

13: Model = Combine (Model, Exact(S’, index (i)));

14: NewAcc = SVM (Model);

15: if m>NewAcc

16: if NewAcc>=0.8

17: m=NewAcc;

18: continue;

19: elseif NewAcc <0.8

20: Model (:,i)=[];

21: end if;

22: elseif m<= NewAcc

23: m=NewAcc;

24: continue;

25: end if;

26: end for;

27: for each |$i\in \Big[1,m\Big]$| do

28: |${\mathrm{T}}^{\prime }$|=FiveFeatureCode(⁠|$\mathrm{T}$|⁠,|$i$|⁠);

29: end for

30: F=getTopfeatures(⁠|${\mathrm{T}}^{\prime }$|⁠, Model);

31: |${y}^{\prime }$|=SVMfromCrossValidation(F);

32: return |${y}^{\prime }$|⁠;

Performance evaluation

In this paper, we used SVM to assess the consequences of different types of features [37]. The kernel function radial basis function is supposed to map the input samples into a higher dimensional space. Distinguish phosphorylation samples and non-phosphorylation samples in PreSSFP via SVM using the grid search strategy in LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm). Meanwhile, the four measurements of accuracy (Acc), sensitivity (Sn), specificity (Sp) and Matthews correlation coefficient (MCC) are adopted to evaluate the prediction performance. The four measurements are defined as shown below:
(6)
(7)
(8)
(9)

where TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives, respectively. The receiver operating characteristic (ROC) curves and area under ROC (AUC) values were also carried out.

Results

Current progresses in the prediction of eukaryotic phosphorylation sites

It is believed that up to one-half of eukaryotic proteins undergo phosphorylation [39]. Thus, most of the previous studies were mainly focused on eukaryotic proteins phosphorylation, including animals, plants and fungi. Table 2 provides a simple overview of currently available eukaryotic protein phosphorylation site prediction methods, including the type of species, tool name, kinase or kinase family-specific prediction, feature information, prediction algorithm, reference and website. From Table 2, 47 phosphorylation site prediction tools have been developed. We have broadly divided them into two categories: tools for general phosphorylation site prediction and tools for species-specific phosphorylation site prediction. However, the tool for fungi-specific phosphorylation site prediction is not available due to lack of data. There have only tools for predicting yeast phosphorylation site, including NetPhosYeast [38] and PhosphoPICK [76]. Meanwhile, it can be seen that the biggest difference among these computational methods is the feature information and prediction algorithm. As shown in Table 2, most of the prediction algorithms for identifying protein phosphorylation sites were machine learning methods, including artificial neural networks (ANNs), SVM, RFs, logistic regression and Bayesian network, etc. For example, the first tool, NetPhos 3.1, was built to predict eukaryotic proteins phosphorylation sites based on ANN by Blom et al. in 1999 [40]. Kim et al. implemented PredPhospho 1.0 in SVMs algorithm to predict for four PK groups and four PK families. And its enhanced version-PredPhospho 2.0 could predict for 7 PK groups and 18 PK families [70, 71]. As we can see from Table 2, owing to the strong generalization ability and versatility of SVM, it is the most widely used prediction algorithm for identifying phosphorylation sites.

Table 2

Summary of currently available prediction tools of eukaryotic phosphorylation sites

Type of speciesTool namePK-specFeature informationPrediction algorithmReferenceWebsite (http://)
GeneralNetPhosNoSIANN[40]www.cbs.dtu.dk/services/NetPhos
Scansite 2.0YesSPPSSM[41]scansite.mit.edu
NetPhosKYesSIANN[42]www.cbs.dtu.dk/services/NetPhosK
DISPHOS 1.3NoAAF, DisorderLR[43]www.dabi.temple.edu/disphos (N/A)
PHOSITEYesPSSMPA[44]www.phosite.com(N/A)
PPSP 1.0YesBDTBP[45]ppsp.biocuckoo.org
pkaPSYesMotifSA[46]mendel.imp.univie.ac.at/sat/pkaPS
NetworKINYesMotif, PSSMANN[47]networkin.info/index.shtml
KinasePhosYesSP, CPSVM[48, 49]KinasePhos2.mbc.nctu.edu.tw
GANNPhosNoBEGANN[50]
AMSYesBESVM[51, 52]ams2.bioinfo.pl (N/A)
MetaPredPSYesSPWV[53]MetaPred.umn.edu/MetaPredPS
SMALIYesPSSMMatch[54, 55]lilab.uwo.ca/SMALI.htm (N/A)
PredikinYesPSSM, SAHMM[56, 57]predikin.biosci.uq.edu.au.
CRPhos 0.8YesSICRF[58]www.ptools.ua.ac.be/CRPhos
PostModYesSP, EINR[59]biodb.kaist.ac.kr/PTM
iGPS 1.0YesSIGPS[60]gps.biocuckoo.org
PhosK3DYesSP, 3DSVM[61]csb.cse.yzu.edu.tw/PhosK3D
PTMPredYesPSPMSVM[62]doc.aporc.org/wiki/PTMPred
ModPredNoSPMoRFs[63]www.modpred.org (N/A)
JUPred_MLPNoPSSMMLP[64]
AnimalGeneralPhosphoSVMNoSPSVM[14]sysbio.unl.edu/PhosphoSVM.
PhosContext2vecBothSPContext2vec[65]phoscontext2vec.erc.monash.edu
HomoCRPNoSPSPR[66]fasta.bioch.virginia.edu/crp
sapiensGPS 2.0YesSIGPS[67]gps.biocuckoo.org
PhoScanYesSI, flexibilityLOR[68]bioinfo.au.tsinghua.edu.cn/phoscan
MusiteYesKNN, Disorder, AAFSVM[69]musite.sourceforge.net
PredPhosphoYesAAPCSVMs[70, 71]phosphovariant.ngri.go.kr
Phos_predYesFIRF[72]bioinformatics.ustc.edu.cn/phos_pred
PSEAYesSPPSEA[73]bioinfo.ncu.edu.cn/PKPred_Home.aspx
iPhos-PseEnNoPseEnRF[74]www.jci-bioinfo.cn/iPhos-PseEn
PhosphoPICKYesPSAAFBN[75, 76]bioinf.scmb.uq.edu.au/phosphopick
MusiteDeepBothBEDL[77]github.com/duolinwang/MusiteDeep
PhosphoPredictYesSP, FImRMR[78]phosphopredict.erc.monash.edu/webserver.html
KSRPredYesSP, PPIKRR[79]
iPhos-PseEvoNoEI, PseAACGST[80]www.jci-bioinfo.cn/iPhos-PseEvo
Multi-iPPseEvoNoEI, PseAACGST[81]www.jci-bioinfo.cn/Multi-iPPseEvo
PTM-ssMPNossMPSVM[82]bioinformatics.ustc.edu.cn/PTM-ssMP/index
QuokkaYesNNS, AAF, WLS, BSI, KNNLR[83]quokka.erc.monash.edu/
MousePhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
PlantGeneralYesSIHMM, GPS[84]ekpd.biocuckoo.org/protocol.php
ArabidopsisPhosPhAtNoStructureSVMs[12]phosphat.uni-hohenheim.de
thalianaPlantPhosNoMDDHMM[13]csb.cse.yzu.edu.tw/PlantPhos
MusiteYesKNN, Disorder, AAFSVM[85]musite.sourceforge.net
KMPhosNoKMsSVM[86]
RicePhosphoRiceNoSIWVS[87]bioinformatics.fafu.edu.cn/PhosphoRice
Rice_PhosphoNoAF-CKSAAPSVM[88]bioinformatics.fafu.edu.cn/rice_phospho1.0
SoybeanPHOSFERNoAAIndexRF[89]saphire.usask.ca
FungiYeastNetPhosYeastNoSIANN[38]www.cbs.dtu.dk/services/NetPhosYeast
PhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
Type of speciesTool namePK-specFeature informationPrediction algorithmReferenceWebsite (http://)
GeneralNetPhosNoSIANN[40]www.cbs.dtu.dk/services/NetPhos
Scansite 2.0YesSPPSSM[41]scansite.mit.edu
NetPhosKYesSIANN[42]www.cbs.dtu.dk/services/NetPhosK
DISPHOS 1.3NoAAF, DisorderLR[43]www.dabi.temple.edu/disphos (N/A)
PHOSITEYesPSSMPA[44]www.phosite.com(N/A)
PPSP 1.0YesBDTBP[45]ppsp.biocuckoo.org
pkaPSYesMotifSA[46]mendel.imp.univie.ac.at/sat/pkaPS
NetworKINYesMotif, PSSMANN[47]networkin.info/index.shtml
KinasePhosYesSP, CPSVM[48, 49]KinasePhos2.mbc.nctu.edu.tw
GANNPhosNoBEGANN[50]
AMSYesBESVM[51, 52]ams2.bioinfo.pl (N/A)
MetaPredPSYesSPWV[53]MetaPred.umn.edu/MetaPredPS
SMALIYesPSSMMatch[54, 55]lilab.uwo.ca/SMALI.htm (N/A)
PredikinYesPSSM, SAHMM[56, 57]predikin.biosci.uq.edu.au.
CRPhos 0.8YesSICRF[58]www.ptools.ua.ac.be/CRPhos
PostModYesSP, EINR[59]biodb.kaist.ac.kr/PTM
iGPS 1.0YesSIGPS[60]gps.biocuckoo.org
PhosK3DYesSP, 3DSVM[61]csb.cse.yzu.edu.tw/PhosK3D
PTMPredYesPSPMSVM[62]doc.aporc.org/wiki/PTMPred
ModPredNoSPMoRFs[63]www.modpred.org (N/A)
JUPred_MLPNoPSSMMLP[64]
AnimalGeneralPhosphoSVMNoSPSVM[14]sysbio.unl.edu/PhosphoSVM.
PhosContext2vecBothSPContext2vec[65]phoscontext2vec.erc.monash.edu
HomoCRPNoSPSPR[66]fasta.bioch.virginia.edu/crp
sapiensGPS 2.0YesSIGPS[67]gps.biocuckoo.org
PhoScanYesSI, flexibilityLOR[68]bioinfo.au.tsinghua.edu.cn/phoscan
MusiteYesKNN, Disorder, AAFSVM[69]musite.sourceforge.net
PredPhosphoYesAAPCSVMs[70, 71]phosphovariant.ngri.go.kr
Phos_predYesFIRF[72]bioinformatics.ustc.edu.cn/phos_pred
PSEAYesSPPSEA[73]bioinfo.ncu.edu.cn/PKPred_Home.aspx
iPhos-PseEnNoPseEnRF[74]www.jci-bioinfo.cn/iPhos-PseEn
PhosphoPICKYesPSAAFBN[75, 76]bioinf.scmb.uq.edu.au/phosphopick
MusiteDeepBothBEDL[77]github.com/duolinwang/MusiteDeep
PhosphoPredictYesSP, FImRMR[78]phosphopredict.erc.monash.edu/webserver.html
KSRPredYesSP, PPIKRR[79]
iPhos-PseEvoNoEI, PseAACGST[80]www.jci-bioinfo.cn/iPhos-PseEvo
Multi-iPPseEvoNoEI, PseAACGST[81]www.jci-bioinfo.cn/Multi-iPPseEvo
PTM-ssMPNossMPSVM[82]bioinformatics.ustc.edu.cn/PTM-ssMP/index
QuokkaYesNNS, AAF, WLS, BSI, KNNLR[83]quokka.erc.monash.edu/
MousePhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
PlantGeneralYesSIHMM, GPS[84]ekpd.biocuckoo.org/protocol.php
ArabidopsisPhosPhAtNoStructureSVMs[12]phosphat.uni-hohenheim.de
thalianaPlantPhosNoMDDHMM[13]csb.cse.yzu.edu.tw/PlantPhos
MusiteYesKNN, Disorder, AAFSVM[85]musite.sourceforge.net
KMPhosNoKMsSVM[86]
RicePhosphoRiceNoSIWVS[87]bioinformatics.fafu.edu.cn/PhosphoRice
Rice_PhosphoNoAF-CKSAAPSVM[88]bioinformatics.fafu.edu.cn/rice_phospho1.0
SoybeanPHOSFERNoAAIndexRF[89]saphire.usask.ca
FungiYeastNetPhosYeastNoSIANN[38]www.cbs.dtu.dk/services/NetPhosYeast
PhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
Table 2

Summary of currently available prediction tools of eukaryotic phosphorylation sites

Type of speciesTool namePK-specFeature informationPrediction algorithmReferenceWebsite (http://)
GeneralNetPhosNoSIANN[40]www.cbs.dtu.dk/services/NetPhos
Scansite 2.0YesSPPSSM[41]scansite.mit.edu
NetPhosKYesSIANN[42]www.cbs.dtu.dk/services/NetPhosK
DISPHOS 1.3NoAAF, DisorderLR[43]www.dabi.temple.edu/disphos (N/A)
PHOSITEYesPSSMPA[44]www.phosite.com(N/A)
PPSP 1.0YesBDTBP[45]ppsp.biocuckoo.org
pkaPSYesMotifSA[46]mendel.imp.univie.ac.at/sat/pkaPS
NetworKINYesMotif, PSSMANN[47]networkin.info/index.shtml
KinasePhosYesSP, CPSVM[48, 49]KinasePhos2.mbc.nctu.edu.tw
GANNPhosNoBEGANN[50]
AMSYesBESVM[51, 52]ams2.bioinfo.pl (N/A)
MetaPredPSYesSPWV[53]MetaPred.umn.edu/MetaPredPS
SMALIYesPSSMMatch[54, 55]lilab.uwo.ca/SMALI.htm (N/A)
PredikinYesPSSM, SAHMM[56, 57]predikin.biosci.uq.edu.au.
CRPhos 0.8YesSICRF[58]www.ptools.ua.ac.be/CRPhos
PostModYesSP, EINR[59]biodb.kaist.ac.kr/PTM
iGPS 1.0YesSIGPS[60]gps.biocuckoo.org
PhosK3DYesSP, 3DSVM[61]csb.cse.yzu.edu.tw/PhosK3D
PTMPredYesPSPMSVM[62]doc.aporc.org/wiki/PTMPred
ModPredNoSPMoRFs[63]www.modpred.org (N/A)
JUPred_MLPNoPSSMMLP[64]
AnimalGeneralPhosphoSVMNoSPSVM[14]sysbio.unl.edu/PhosphoSVM.
PhosContext2vecBothSPContext2vec[65]phoscontext2vec.erc.monash.edu
HomoCRPNoSPSPR[66]fasta.bioch.virginia.edu/crp
sapiensGPS 2.0YesSIGPS[67]gps.biocuckoo.org
PhoScanYesSI, flexibilityLOR[68]bioinfo.au.tsinghua.edu.cn/phoscan
MusiteYesKNN, Disorder, AAFSVM[69]musite.sourceforge.net
PredPhosphoYesAAPCSVMs[70, 71]phosphovariant.ngri.go.kr
Phos_predYesFIRF[72]bioinformatics.ustc.edu.cn/phos_pred
PSEAYesSPPSEA[73]bioinfo.ncu.edu.cn/PKPred_Home.aspx
iPhos-PseEnNoPseEnRF[74]www.jci-bioinfo.cn/iPhos-PseEn
PhosphoPICKYesPSAAFBN[75, 76]bioinf.scmb.uq.edu.au/phosphopick
MusiteDeepBothBEDL[77]github.com/duolinwang/MusiteDeep
PhosphoPredictYesSP, FImRMR[78]phosphopredict.erc.monash.edu/webserver.html
KSRPredYesSP, PPIKRR[79]
iPhos-PseEvoNoEI, PseAACGST[80]www.jci-bioinfo.cn/iPhos-PseEvo
Multi-iPPseEvoNoEI, PseAACGST[81]www.jci-bioinfo.cn/Multi-iPPseEvo
PTM-ssMPNossMPSVM[82]bioinformatics.ustc.edu.cn/PTM-ssMP/index
QuokkaYesNNS, AAF, WLS, BSI, KNNLR[83]quokka.erc.monash.edu/
MousePhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
PlantGeneralYesSIHMM, GPS[84]ekpd.biocuckoo.org/protocol.php
ArabidopsisPhosPhAtNoStructureSVMs[12]phosphat.uni-hohenheim.de
thalianaPlantPhosNoMDDHMM[13]csb.cse.yzu.edu.tw/PlantPhos
MusiteYesKNN, Disorder, AAFSVM[85]musite.sourceforge.net
KMPhosNoKMsSVM[86]
RicePhosphoRiceNoSIWVS[87]bioinformatics.fafu.edu.cn/PhosphoRice
Rice_PhosphoNoAF-CKSAAPSVM[88]bioinformatics.fafu.edu.cn/rice_phospho1.0
SoybeanPHOSFERNoAAIndexRF[89]saphire.usask.ca
FungiYeastNetPhosYeastNoSIANN[38]www.cbs.dtu.dk/services/NetPhosYeast
PhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
Type of speciesTool namePK-specFeature informationPrediction algorithmReferenceWebsite (http://)
GeneralNetPhosNoSIANN[40]www.cbs.dtu.dk/services/NetPhos
Scansite 2.0YesSPPSSM[41]scansite.mit.edu
NetPhosKYesSIANN[42]www.cbs.dtu.dk/services/NetPhosK
DISPHOS 1.3NoAAF, DisorderLR[43]www.dabi.temple.edu/disphos (N/A)
PHOSITEYesPSSMPA[44]www.phosite.com(N/A)
PPSP 1.0YesBDTBP[45]ppsp.biocuckoo.org
pkaPSYesMotifSA[46]mendel.imp.univie.ac.at/sat/pkaPS
NetworKINYesMotif, PSSMANN[47]networkin.info/index.shtml
KinasePhosYesSP, CPSVM[48, 49]KinasePhos2.mbc.nctu.edu.tw
GANNPhosNoBEGANN[50]
AMSYesBESVM[51, 52]ams2.bioinfo.pl (N/A)
MetaPredPSYesSPWV[53]MetaPred.umn.edu/MetaPredPS
SMALIYesPSSMMatch[54, 55]lilab.uwo.ca/SMALI.htm (N/A)
PredikinYesPSSM, SAHMM[56, 57]predikin.biosci.uq.edu.au.
CRPhos 0.8YesSICRF[58]www.ptools.ua.ac.be/CRPhos
PostModYesSP, EINR[59]biodb.kaist.ac.kr/PTM
iGPS 1.0YesSIGPS[60]gps.biocuckoo.org
PhosK3DYesSP, 3DSVM[61]csb.cse.yzu.edu.tw/PhosK3D
PTMPredYesPSPMSVM[62]doc.aporc.org/wiki/PTMPred
ModPredNoSPMoRFs[63]www.modpred.org (N/A)
JUPred_MLPNoPSSMMLP[64]
AnimalGeneralPhosphoSVMNoSPSVM[14]sysbio.unl.edu/PhosphoSVM.
PhosContext2vecBothSPContext2vec[65]phoscontext2vec.erc.monash.edu
HomoCRPNoSPSPR[66]fasta.bioch.virginia.edu/crp
sapiensGPS 2.0YesSIGPS[67]gps.biocuckoo.org
PhoScanYesSI, flexibilityLOR[68]bioinfo.au.tsinghua.edu.cn/phoscan
MusiteYesKNN, Disorder, AAFSVM[69]musite.sourceforge.net
PredPhosphoYesAAPCSVMs[70, 71]phosphovariant.ngri.go.kr
Phos_predYesFIRF[72]bioinformatics.ustc.edu.cn/phos_pred
PSEAYesSPPSEA[73]bioinfo.ncu.edu.cn/PKPred_Home.aspx
iPhos-PseEnNoPseEnRF[74]www.jci-bioinfo.cn/iPhos-PseEn
PhosphoPICKYesPSAAFBN[75, 76]bioinf.scmb.uq.edu.au/phosphopick
MusiteDeepBothBEDL[77]github.com/duolinwang/MusiteDeep
PhosphoPredictYesSP, FImRMR[78]phosphopredict.erc.monash.edu/webserver.html
KSRPredYesSP, PPIKRR[79]
iPhos-PseEvoNoEI, PseAACGST[80]www.jci-bioinfo.cn/iPhos-PseEvo
Multi-iPPseEvoNoEI, PseAACGST[81]www.jci-bioinfo.cn/Multi-iPPseEvo
PTM-ssMPNossMPSVM[82]bioinformatics.ustc.edu.cn/PTM-ssMP/index
QuokkaYesNNS, AAF, WLS, BSI, KNNLR[83]quokka.erc.monash.edu/
MousePhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick
PlantGeneralYesSIHMM, GPS[84]ekpd.biocuckoo.org/protocol.php
ArabidopsisPhosPhAtNoStructureSVMs[12]phosphat.uni-hohenheim.de
thalianaPlantPhosNoMDDHMM[13]csb.cse.yzu.edu.tw/PlantPhos
MusiteYesKNN, Disorder, AAFSVM[85]musite.sourceforge.net
KMPhosNoKMsSVM[86]
RicePhosphoRiceNoSIWVS[87]bioinformatics.fafu.edu.cn/PhosphoRice
Rice_PhosphoNoAF-CKSAAPSVM[88]bioinformatics.fafu.edu.cn/rice_phospho1.0
SoybeanPHOSFERNoAAIndexRF[89]saphire.usask.ca
FungiYeastNetPhosYeastNoSIANN[38]www.cbs.dtu.dk/services/NetPhosYeast
PhosphoPICKYesPSAAFBN[76]bioinf.scmb.uq.edu.au/phosphopick

On the other hand, in Table 2, many different feature infor-mation and encoding methods have been used to formulate the protein or peptide samples, including sequence information, structural information, physicochemical properties, evolutionary information and functional information. Among them, sequence information is the most widely used. Some predictors, like NetPhos, used the whole protein sequence information to input training model, or other predictors, such as GANNPhos and AMS, used AAC or binary encoding to encode amino acid sequence fragment for model training. Phosphorylation sites have been observed to be preferentially located in disordered regions [90]. In 2004 and 2010, predicted disorder scores for phosphorylation sites were used as features in the phosphorylation predictor DISPHOS [43] and Musite [69]. It is true that disorder scores of the experimentally verified phosphorylation sites are rare, and most of them require predictive tools to predict disorder scores, but doing so will bias the results of the prediction model. Meanwhile, the use of physicochemical properties could play a role in the phosphorylation site prediction tools. For example, the predictor PHOSFER used 24 AAIndex indices to encode protein sequence [89]. It mainly adopted consensus fuzzy clustering method to cluster 544 amino acid properties into 8 clusters and then identified a set of 24 indices consisting of three individual AAIndex indices from each cluster. The results suggested it was necessary to extract important information from 544 AAIndex for encoding phosphorylation sequences. Evolutionary information, such as KNN, was very effective feature when used for predicting phosphorylation sites [69, 83]. Li et al. designed Quokka with KNN and sequence information feature in 2018, which provided a high-quality prediction [83]. Furthermore, in order to improve predictive performance of the model, some tools utilized multiple information integration to predict phosphorylation sites, including PostMod [59] and Musite [69]. The combination of features can lead to a certain improvement, while they have not discussed the importance and contribution of different features for the phosphorylation sites prediction. Many of the derived features may contain redundant information and not all of the features play an equally important role in the prediction. Thereby feature selection is very necessary to select the most informative features and their contribution to the phosphorylation site prediction should be discussed in detail.

Fortunately, lots of fungi phosphorylation sites have been identified in the past 6 years, which provides a great opportunity to design a tool for fungi phosphorylation analysis. In view of the above, by incorporating sequence information, physicochemical properties and evolutionary information that had been evidenced good performances for prediction PTMs sites in the previous tools with two-step feature optimization strategy, we proposed PreSSFP for predicting species-specific fungi phosphorylation sites.

Feature optimization results via two-step feature optimization

In the previous research [26, 91, 92], we found that the results of independent coding tests with single feature were less than the combination of all five types of features (AAC, DAAC, BPB, PSP and KNN). So we combined all five types of features as a model for access. However, the prediction performances of the model were not fully satisfactory. It may be owing to all features that are not equivalently essential for the model performance. Or some of feature vectors may be unwanted noise, which easily leads to bias in the prediction performance. Thus, it is generally indispensable to execute feature optimization so that we can reserve the important one. As we know, the principal components analysis (PCA) [93] and maximum relevance minimum redundancy (mRMR) [78] are usually used for feature selection. PCA is aimed at selecting the data with higher cumulative contribution rate after processing and transformation of the original data. The mRMR method is based on mutual information for quantifying relevance and redundancy. In Figure 2, taking serine phosphorylation in Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans as an example, we used the PCA and mRMR to carry out feature selection, respectively, and compared their results with two-step feature optimization. From Figure 2, we found that prediction performance of the model by using two-step feature optimization is superior to that of PCA and mRMR based on the feature vectors of the same dimension. For example, the accuracy has improved by 5% and 4% for serine phosphorylation of Aspergillus in comparison with mRMR and PCA separately, and the accuracy has increased at a rate of 17% and 15% for threonine phosphorylation in C. neoformans, compared with mRMR and PCA, respectively.

Accuracies of our method with mRMR and PCA for serine phosphorylationin Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans, respectively.
Figure 2

Accuracies of our method with mRMR and PCA for serine phosphorylationin Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans, respectively.

Here we applied two-step feature optimization to dig the principal feature vectors when generating the serine phosphorylation sites predictive model of Aspergillus. To start with, we had a total of 580 dimension feature vectors by combining all five types of features. Then we set the parameter n_estimator in the range of 50 to 250 for acquiring the best n_estimator. By using gridsearchcv, we obtained that the best n_estimator was 190 for prediction Aspergillus serine phosphorylation. Next, we received the ranking of the feature importance via RF classifier on the basis of 190. The lower the ranking of the feature vector is, the greater the impact on the serine phosphorylation sites is. Accordingly, we obtained the top 200 feature vectors as the optimal feature candidates for the second step feature optimization. Finally, we selected the feature vectors ranking top 10 from the 200 feature vectors to construct an initial model and calculated the performance. Whereafter, the remaining 190 feature vectors were put into this model to train one by one. If SVM accuracy of the model has been improved, the feature vector is preserved to construct a new model for the next training. Otherwise, it is eliminated. From what has been described above, with the dimension decreasing from 580 to 146, the accuracy increased from 77.90% to 81.18% and the AUC value added from 77.41% to 80.65% in serine phosphorylation sites prediction for Aspergillus. Comparatively speaking, the chosen feature vectors have superior prediction performance to the combination of five types of features. Totally, 15 models are built, including 7 serine phosphorylation models, 7 threonine phosphorylation models and 1 tyrosine phosphorylation model. The prediction performance of combining five types of features and the selected feature vector by two-step feature optimization for predicting other species-specific fungi phosphorylation protein sites are shown in Table 3.

Analysis of feature importance and contribution

Sequentially, we further analysed which feature vectors are valuable to prediction model based on optimization features selected by two-step feature optimization. In Figure 3, taking serine phosphorylation in Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans as an example, for phosphoserine in Aspergillus, we reconstituted 143-dimension new feature vectors from above five types of features, and the ratios of dimension of chosen feature vectors belonging to the five types of features are 0.04 (15/400), 1 (5/5), 0.72 (91/126), 0.75 (21/28) and 0.67 (14/21). For phosphothreonine in C. neoformans, 168-dimension new feature vectors were selected from the above five types of features, and 0.06 (27/400), 1 (5/5), 0.80 (101/126), 0.82 (23/28) and 0.57 (12/21) were the proportions of dimension of chosen feature vectors accounting for the five types of features. In view of these results, we observed that the ratio of KNN feature was notably higher than those of other four features, indicating that KNN features exerted a vital effect on the performance evaluation of the model and contributed to identifying phosphorylation S/T/Y sites. Meanwhile, from the RF classifier results, the importance ranking of KNN feature in seven species is also higher than those of other features. Correspondingly, KNN features exhibited the best performances in these models, which could be concerned with conserved residues and detected local sequence similarity. Phosphorylation-related clusters may exist in local sequences around phosphorylation sites [69], and KNN can capture the cluster information and distinguish them from the background. Therefore, KNN score is suitable to be used as a feature for predicting phosphorylation sites. Then, from Figure 3, the ratios of the PSP and BPB features were also relatively high, implying that the PSP and the frequency of each amino acid at each position were also comparatively important to predict phosphorylation S/T/Y sites. In fact, the physicochemical properties of amino acids surrounding phosphorylation sites had an impact on phosphorylation event [94]. In contrast, the ratios of DAAC feature vectors are relatively low. Meanwhile, from Figure 3, the constitution of final feature vectors is various in different species models. It shows that the impact of each feature on all species is distinct and then the performance evaluation of the prediction model is different as well. Furthermore, the feature optimization results of serine and threonine phosphorylation in different species also revealed that two-step feature optimization method, which fully considered the importance and contribution of each dimension feature vector, can achieve the higher prediction accuracy.

Table 3

The prediction performance of before and after feature optimization for the fungi phosphorylation S/T/Y site in seven organisms

SiteOrganismsPerformance of prediction (before)Performance of prediction (after)
dimAcc (%)Sn (%)Sp (%)AUC (%)dimAcc (%)Sn (%)Sp (%)AUC (%)
SAspergillus58077.9077.8978.3077.4114681.1882.2280.2180.65
C. neoformans58076.0677.1675.3775.807781.0183.5678.8280.81
F. graminearum58075.7175.1076.5275.434179.4080.6378.6280.16
M. oryzae58077.0876.6777.5777.304580.1078.3882.0380.80
N. crassa58078.1277.2579.1378.0215180.2976.6685.0680.78
S. cerevisiae58074.0473.1775.0174.449576.6276.2677.0078.44
S. pombe58076.6976.6076.8676.584478.6877.2180.3377.50
TAspergillus58069.8870.7469.5870.6512374.1874.0375.0675.34
C. neoformans58076.6078.6575.6674.4616882.2683.3381.2581.13
F. graminearum58072.5572.7072.6672.149081.1378.9583.6781.21
M. oryzae58076.9875.8778.4276.7311181.0377.5585.5382.41
N. crassa58078.2578.9678.2078.0015178.9678.4179.9678.81
S. cerevisiae58069.2871.3067.2773.169475.3572.3979.2075.08
S. pombe58071.6173.1970.9871.689473.4873.6473.8573.40
YS. cerevisiae58062.9663.1162.8862.545169.9371.6368.4870.00
SiteOrganismsPerformance of prediction (before)Performance of prediction (after)
dimAcc (%)Sn (%)Sp (%)AUC (%)dimAcc (%)Sn (%)Sp (%)AUC (%)
SAspergillus58077.9077.8978.3077.4114681.1882.2280.2180.65
C. neoformans58076.0677.1675.3775.807781.0183.5678.8280.81
F. graminearum58075.7175.1076.5275.434179.4080.6378.6280.16
M. oryzae58077.0876.6777.5777.304580.1078.3882.0380.80
N. crassa58078.1277.2579.1378.0215180.2976.6685.0680.78
S. cerevisiae58074.0473.1775.0174.449576.6276.2677.0078.44
S. pombe58076.6976.6076.8676.584478.6877.2180.3377.50
TAspergillus58069.8870.7469.5870.6512374.1874.0375.0675.34
C. neoformans58076.6078.6575.6674.4616882.2683.3381.2581.13
F. graminearum58072.5572.7072.6672.149081.1378.9583.6781.21
M. oryzae58076.9875.8778.4276.7311181.0377.5585.5382.41
N. crassa58078.2578.9678.2078.0015178.9678.4179.9678.81
S. cerevisiae58069.2871.3067.2773.169475.3572.3979.2075.08
S. pombe58071.6173.1970.9871.689473.4873.6473.8573.40
YS. cerevisiae58062.9663.1162.8862.545169.9371.6368.4870.00
Table 3

The prediction performance of before and after feature optimization for the fungi phosphorylation S/T/Y site in seven organisms

SiteOrganismsPerformance of prediction (before)Performance of prediction (after)
dimAcc (%)Sn (%)Sp (%)AUC (%)dimAcc (%)Sn (%)Sp (%)AUC (%)
SAspergillus58077.9077.8978.3077.4114681.1882.2280.2180.65
C. neoformans58076.0677.1675.3775.807781.0183.5678.8280.81
F. graminearum58075.7175.1076.5275.434179.4080.6378.6280.16
M. oryzae58077.0876.6777.5777.304580.1078.3882.0380.80
N. crassa58078.1277.2579.1378.0215180.2976.6685.0680.78
S. cerevisiae58074.0473.1775.0174.449576.6276.2677.0078.44
S. pombe58076.6976.6076.8676.584478.6877.2180.3377.50
TAspergillus58069.8870.7469.5870.6512374.1874.0375.0675.34
C. neoformans58076.6078.6575.6674.4616882.2683.3381.2581.13
F. graminearum58072.5572.7072.6672.149081.1378.9583.6781.21
M. oryzae58076.9875.8778.4276.7311181.0377.5585.5382.41
N. crassa58078.2578.9678.2078.0015178.9678.4179.9678.81
S. cerevisiae58069.2871.3067.2773.169475.3572.3979.2075.08
S. pombe58071.6173.1970.9871.689473.4873.6473.8573.40
YS. cerevisiae58062.9663.1162.8862.545169.9371.6368.4870.00
SiteOrganismsPerformance of prediction (before)Performance of prediction (after)
dimAcc (%)Sn (%)Sp (%)AUC (%)dimAcc (%)Sn (%)Sp (%)AUC (%)
SAspergillus58077.9077.8978.3077.4114681.1882.2280.2180.65
C. neoformans58076.0677.1675.3775.807781.0183.5678.8280.81
F. graminearum58075.7175.1076.5275.434179.4080.6378.6280.16
M. oryzae58077.0876.6777.5777.304580.1078.3882.0380.80
N. crassa58078.1277.2579.1378.0215180.2976.6685.0680.78
S. cerevisiae58074.0473.1775.0174.449576.6276.2677.0078.44
S. pombe58076.6976.6076.8676.584478.6877.2180.3377.50
TAspergillus58069.8870.7469.5870.6512374.1874.0375.0675.34
C. neoformans58076.6078.6575.6674.4616882.2683.3381.2581.13
F. graminearum58072.5572.7072.6672.149081.1378.9583.6781.21
M. oryzae58076.9875.8778.4276.7311181.0377.5585.5382.41
N. crassa58078.2578.9678.2078.0015178.9678.4179.9678.81
S. cerevisiae58069.2871.3067.2773.169475.3572.3979.2075.08
S. pombe58071.6173.1970.9871.689473.4873.6473.8573.40
YS. cerevisiae58062.9663.1162.8862.545169.9371.6368.4870.00
Comparison with ratio of each type of feature of serine phosphorylation in Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans. The horizontal axis represents the ratios of dimension of chosen feature vectors belonging to the five types of features. The vertical axis represents the five types of features (AAC, DAAC, BPB, PSP and KNN).
Figure 3

Comparison with ratio of each type of feature of serine phosphorylation in Aspergillus, tyrosine phosphorylation in S. cerevisiae and threonine phosphorylation in C. neoformans. The horizontal axis represents the ratios of dimension of chosen feature vectors belonging to the five types of features. The vertical axis represents the five types of features (AAC, DAAC, BPB, PSP and KNN).

Feature analysis of phosphorylation of different species

After that, we further detected and visualized statistically significant differences about the compositional biases of sequences between phosphorylation and non-phosphorylation sites. Figure 4a displays the frequency information of amino acid occurrence in serine phosphorylation. For Aspergillus, aliphatic amino acids such as aspartic, proline, alanine, glutamic and glycine are enriched in the positive fragment, while polar amino acids including cysteine, threonine and serine are depleted. For S. pombe, lysine, arginine, aspartic and glutamic, which have charged, are enriched in the positive fragment, while aliphatic amino acids such as leucine, asparagine and tryptophane are depleted in the negative fragment. It shows that there have significant differences among different species. This result is certified by the two-sample logo results (Figure 4c–d) [95]. Similarly, from Figure S1, we also find that there have significant differences in threonine phosphorylation for seven species. Then, based on IG, we chose the rank top 9 from 544 PSP to construct our model (Figure 4b). From the analysis of feature importance, we find that the selected nine PSP features make a great contribution to our model indeed.

(a) Comparison of AAC between serine phosphorylation sites and non-phosphorylation sites for seven organisms. The vertical axis denotes the log2 ratio of the averages of amino acid frequencies for phosphorylation and non-phosphorylation sites. (b) Difference in 544 physicochemical properties based on IG scores. Two-sample Logos of composition biases around serine phosphorylation sites compared to the non-phosphorylation sites for (c) Aspergillus and (d) S. pombe.
Figure 4

(a) Comparison of AAC between serine phosphorylation sites and non-phosphorylation sites for seven organisms. The vertical axis denotes the log2 ratio of the averages of amino acid frequencies for phosphorylation and non-phosphorylation sites. (b) Difference in 544 physicochemical properties based on IG scores. Two-sample Logos of composition biases around serine phosphorylation sites compared to the non-phosphorylation sites for (c) Aspergillus and (d) S. pombe.

Figure 5a and b reflects the frequency information of amino acid pair occurrence. Taking serine phosphorylation in Aspergillus as an example, after the two-step feature optimization, we extracted the DAAC feature from the final optimized feature set to plot the radar map (Figure 5a). We can find that the SP and SS have the highest values in 15 amino acid pairs, which represent the two most important types of amino acid pair. Figure 5b shows the frequency information of 400 amino acid pairs for serine phosphorylation in Aspergillus. It can be seen that SP and SS also have higher frequencies than other amino acid pairs. Meanwhile, the motif scores of Figure S2 are in accordance with the above results [96]. Similarly, we make the same analysis for threonine phosphorylation in C. neoformans (Figure S3). From Figure S3, the frequency of TP is far higher than other 26 amino acid pairs and the heat maps of DAAC also display that the amino acid pair TP is significant, which are consistent with motif result. Subsequently, in Figure 5c and Figure S4a, we further analysed the posterior probabilities of each amino acid at each position in the positive and negative training datasets for serine phosphorylation in Aspergillus, respectively. From Figure 5c, we observe that positive samples tend to enrich at position −3, +1 and +3, which is identical with the two-sample logo results, while the negative samples have almost the same value at each position, indicating that there have significant differences between positive and negative samples. It can be seen from the previous analysis of feature importance that the proportion of the selected BPB feature to the total reaches 80%. Similarly, we observe that the average frequency value of positive samples at position −4, −3, −2, +2, +3 and +4 is relatively higher than negative samples in Figure S4a. Finally, we compared the average KNN score of phosphorylation sites with those of non-phosphorylation sites in Figure 5d and Figure S4b. Due to the gap of data, we selected different k values and comparison sets (the detail in Table S2). With the biggest difference between positive and negative, KNN score with different k values exhibited the best performance. For example, for Aspergillus, the average KNN score of serine phosphorylation is within 0.70−0.88, while the non-phosphorylation average KNN scores are within 0.46–0.53. From the above results of feature optimization, the Acc value of Aspergillus is 0.812, which have the higher accuracy than that of other modifications. Taken together, it shows that there have significant differences among different species in the same modification residue, which indicates the necessity of species-special identification of fungi phosphorylation sites.

(a) Radar map for the 15 most important pairs of DAAC on the basis of two-step feature optimization. The primary axis refers to the number of occurrences in the positive and negative samples with the unit of 100. (b) Heat map for the DAAC scores of amino acid pair composition. (c) Difference in the average frequency value of each amino acid in the positive training datasets for serine phosphorylation in Aspergillus on different positions. (d) Comparison of mean KNN scores between threonine phosphorylation sites and non-phosphorylation sites. The vertical axis represents the average KNN scores. The horizontal axis represents numbers of nearest neighbors in positive and negative samples.
Figure 5

(a) Radar map for the 15 most important pairs of DAAC on the basis of two-step feature optimization. The primary axis refers to the number of occurrences in the positive and negative samples with the unit of 100. (b) Heat map for the DAAC scores of amino acid pair composition. (c) Difference in the average frequency value of each amino acid in the positive training datasets for serine phosphorylation in Aspergillus on different positions. (d) Comparison of mean KNN scores between threonine phosphorylation sites and non-phosphorylation sites. The vertical axis represents the average KNN scores. The horizontal axis represents numbers of nearest neighbors in positive and negative samples.

Species-specific fungi phosphorylation site prediction of PreSSFP

In this work, 15 models for fungi phosphorylation sites prediction were constructed. On the basis of 10-fold cross-validation, we evaluated the performance of each model on each training dataset and the ROC curve is shown in Figure 6. For serine phosphorylation in seven species, the AUC values of Aspergillus, C. neoformans, F. graminearum, M. oryzae, N. crassa, S. cerevisiae and S. pombe were 0.807, 0.808, 0.802, 0.808, 0.807, 0.784 and 0.775, respectively. For threonine phosphorylation in seven species, the AUC values were 0.754, 0.811, 0.812, 0.824, 0.788, 0.751 and 0.734. For tyrosine phosphorylation in S. cerevisiae, the AUC value was 0.70. From the ROC curves, we knew that those models have relatively good confident predictions. In contrast, we used the tyrosine phosphorylation dataset of each species except S. cerevisiae as testing set to test tyrosine phosphorylation site model for S. cerevisiae, but the results of prediction was not fully satisfactory and achieved lower AUC scores and specificities (the detail in Table S3). In general, these justify the necessity of species-specific prediction and our prediction model PreSSFP is a robust predictor.

Comparison with other existing tools

For the purpose of evaluating the prediction performance of the PreSSFP objectively, general and species-specific predictors were selected to compare with PreSSFP. Taking into account the availability and representativeness of tools, general tool NetPhos 3.1, yeast-specific predictor NetPhosYeast, soybean-specific predictor PHOSFER and human-specific predictor iPhos-PseEn were employed. Because NetPhosYeast tools can only predict S/T phosphorylation in yeast proteins, we used it to predict S. cerevisiae, S. pombe and C. neoformans data. The tool NetPhos 3.1, PHOSFER and iPhos-PseEn were used to predict all seven species. We directly submitted testing data set of each species for the prediction, and the results of PreSSFP were used for a comparison. As shown in Table 4, for the serine phosphorylation prediction of S. pombe, the AUC value of PreSSFP is 0.784. Compared with NetPhos 3.1, PHOSFER and iPhos-PseEn, the AUC value of our model has improved by 19.3%, 13.69% and 14.27%, respectively. Meanwhile, the AUC value of NetPhosYeast is 0.676. Comparison of prediction performance of PreSSFP with other tools in threonine and tyrosine phosphorylation sites is listed in Table S4. For the threonine phosphorylation in C. neoformans, compared with NetPhos 3.1, PHOSFER and iPhos-PseEn, the AUC value of our model has increased at a rate of 34.4%, 16.33% and 10.38%, respectively; compared with the test results of NetPhosYeast, the AUC value of our model has improved by 24.18%. This suggests that PreSSFP is better than general tools, and there are some differences among serine, threonine and tyrosine phosphorylation sites in the same species.

Performance evaluation of PreSSFP. The ROC curves and AUC values for 10-fold cross-validations of the training dataset for seven organisms. (a) Serine phosphorylation. (b) Threonine phosphorylation.
Figure 6

Performance evaluation of PreSSFP. The ROC curves and AUC values for 10-fold cross-validations of the training dataset for seven organisms. (a) Serine phosphorylation. (b) Threonine phosphorylation.

Table 4

Comparison of prediction performance of PreSSFP with other tools in serine phosphorylation sites

Organisms_SPredictorPerformance of prediction
Acc (%)Sn (%)Sp (%)MCC (%)AUC (%)
AspergillusNetPhos 3.154.2853.8954.748.3854.27
PHOSFER64.4781.4359.4034.3869.08
iPhos-PseEn69.4485.0063.4643.4175.10
Our work83.8885.0382.8067.8085.20
C. neoformansNetPhos 3.157.5356.8858.3315.1457.96
NetPhosYeast55.4852.9983.3319.9568.46
PHOSFER58.5674.5155.1922.5558.41
iPhos-PseEn70.0078.5765.3841.9368.68
Our work76.0376.3975.6852.0679.66
F. graminearumNetPhos 3.156.6256.1457.1813.2958.05
PHOSFER62.4788.1057.4533.7072.24
iPhos-PseEn64.8475.6860.4432.7468.73
Our work76.2375.2577.3052.5176.67
M. oryzaeNetPhos 3.158.1657.4658.9916.3857.70
PHOSFER56.4470.9053.8117.8463.04
iPhos-PseEn58.7061.1157.1417.8263.10
Our work73.6573.3174.0147.3175.07
N. crassaNetPhos 3.154.7953.8856.239.8454.36
PHOSFER57.0369.5754.2918.3160.38
iPhos-PseEn69.4275.0065.8839.8569.96
Our work78.5277.1480.0457.1078.04
S. cerevisiaeNetPhos 3.154.3954.1254.708.8054.71
NetPhosYeast56.4653.7075.7219.5064.65
PHOSFER52.7867.8251.5110.3758.99
iPhos-PseEn67.4571.1864.8335.4566.81
Our work71.0668.5574.3642.5170.13
S. pombeNetPhos 3.157.4956.6458.5915.1059.10
NetPhosYeast56.8353.8580.3921.6267.58
PHOSFER56.3976.3653.6319.5864.69
iPhos-PseEn65.4372.7361.6832.5964.11
Our work77.7577.6377.8855.5178.38
Organisms_SPredictorPerformance of prediction
Acc (%)Sn (%)Sp (%)MCC (%)AUC (%)
AspergillusNetPhos 3.154.2853.8954.748.3854.27
PHOSFER64.4781.4359.4034.3869.08
iPhos-PseEn69.4485.0063.4643.4175.10
Our work83.8885.0382.8067.8085.20
C. neoformansNetPhos 3.157.5356.8858.3315.1457.96
NetPhosYeast55.4852.9983.3319.9568.46
PHOSFER58.5674.5155.1922.5558.41
iPhos-PseEn70.0078.5765.3841.9368.68
Our work76.0376.3975.6852.0679.66
F. graminearumNetPhos 3.156.6256.1457.1813.2958.05
PHOSFER62.4788.1057.4533.7072.24
iPhos-PseEn64.8475.6860.4432.7468.73
Our work76.2375.2577.3052.5176.67
M. oryzaeNetPhos 3.158.1657.4658.9916.3857.70
PHOSFER56.4470.9053.8117.8463.04
iPhos-PseEn58.7061.1157.1417.8263.10
Our work73.6573.3174.0147.3175.07
N. crassaNetPhos 3.154.7953.8856.239.8454.36
PHOSFER57.0369.5754.2918.3160.38
iPhos-PseEn69.4275.0065.8839.8569.96
Our work78.5277.1480.0457.1078.04
S. cerevisiaeNetPhos 3.154.3954.1254.708.8054.71
NetPhosYeast56.4653.7075.7219.5064.65
PHOSFER52.7867.8251.5110.3758.99
iPhos-PseEn67.4571.1864.8335.4566.81
Our work71.0668.5574.3642.5170.13
S. pombeNetPhos 3.157.4956.6458.5915.1059.10
NetPhosYeast56.8353.8580.3921.6267.58
PHOSFER56.3976.3653.6319.5864.69
iPhos-PseEn65.4372.7361.6832.5964.11
Our work77.7577.6377.8855.5178.38
Table 4

Comparison of prediction performance of PreSSFP with other tools in serine phosphorylation sites

Organisms_SPredictorPerformance of prediction
Acc (%)Sn (%)Sp (%)MCC (%)AUC (%)
AspergillusNetPhos 3.154.2853.8954.748.3854.27
PHOSFER64.4781.4359.4034.3869.08
iPhos-PseEn69.4485.0063.4643.4175.10
Our work83.8885.0382.8067.8085.20
C. neoformansNetPhos 3.157.5356.8858.3315.1457.96
NetPhosYeast55.4852.9983.3319.9568.46
PHOSFER58.5674.5155.1922.5558.41
iPhos-PseEn70.0078.5765.3841.9368.68
Our work76.0376.3975.6852.0679.66
F. graminearumNetPhos 3.156.6256.1457.1813.2958.05
PHOSFER62.4788.1057.4533.7072.24
iPhos-PseEn64.8475.6860.4432.7468.73
Our work76.2375.2577.3052.5176.67
M. oryzaeNetPhos 3.158.1657.4658.9916.3857.70
PHOSFER56.4470.9053.8117.8463.04
iPhos-PseEn58.7061.1157.1417.8263.10
Our work73.6573.3174.0147.3175.07
N. crassaNetPhos 3.154.7953.8856.239.8454.36
PHOSFER57.0369.5754.2918.3160.38
iPhos-PseEn69.4275.0065.8839.8569.96
Our work78.5277.1480.0457.1078.04
S. cerevisiaeNetPhos 3.154.3954.1254.708.8054.71
NetPhosYeast56.4653.7075.7219.5064.65
PHOSFER52.7867.8251.5110.3758.99
iPhos-PseEn67.4571.1864.8335.4566.81
Our work71.0668.5574.3642.5170.13
S. pombeNetPhos 3.157.4956.6458.5915.1059.10
NetPhosYeast56.8353.8580.3921.6267.58
PHOSFER56.3976.3653.6319.5864.69
iPhos-PseEn65.4372.7361.6832.5964.11
Our work77.7577.6377.8855.5178.38
Organisms_SPredictorPerformance of prediction
Acc (%)Sn (%)Sp (%)MCC (%)AUC (%)
AspergillusNetPhos 3.154.2853.8954.748.3854.27
PHOSFER64.4781.4359.4034.3869.08
iPhos-PseEn69.4485.0063.4643.4175.10
Our work83.8885.0382.8067.8085.20
C. neoformansNetPhos 3.157.5356.8858.3315.1457.96
NetPhosYeast55.4852.9983.3319.9568.46
PHOSFER58.5674.5155.1922.5558.41
iPhos-PseEn70.0078.5765.3841.9368.68
Our work76.0376.3975.6852.0679.66
F. graminearumNetPhos 3.156.6256.1457.1813.2958.05
PHOSFER62.4788.1057.4533.7072.24
iPhos-PseEn64.8475.6860.4432.7468.73
Our work76.2375.2577.3052.5176.67
M. oryzaeNetPhos 3.158.1657.4658.9916.3857.70
PHOSFER56.4470.9053.8117.8463.04
iPhos-PseEn58.7061.1157.1417.8263.10
Our work73.6573.3174.0147.3175.07
N. crassaNetPhos 3.154.7953.8856.239.8454.36
PHOSFER57.0369.5754.2918.3160.38
iPhos-PseEn69.4275.0065.8839.8569.96
Our work78.5277.1480.0457.1078.04
S. cerevisiaeNetPhos 3.154.3954.1254.708.8054.71
NetPhosYeast56.4653.7075.7219.5064.65
PHOSFER52.7867.8251.5110.3758.99
iPhos-PseEn67.4571.1864.8335.4566.81
Our work71.0668.5574.3642.5170.13
S. pombeNetPhos 3.157.4956.6458.5915.1059.10
NetPhosYeast56.8353.8580.3921.6267.58
PHOSFER56.3976.3653.6319.5864.69
iPhos-PseEn65.4372.7361.6832.5964.11
Our work77.7577.6377.8855.5178.38

The following reasons may account for such a big difference. First, with the large-scale identification of fungi phosphorylation sites using MS, the number of fungi phosphorylation sites has increased apparently. The data may cause a bias classification of the training model. For example, the NetPhosYeast only considered a total of 953 phosphoserine sites and 192 phosphothreonine sites from 675 yeast proteins. However, PreSSFP integrated all experimental fungi phosphorylation sites from FPD. Second, different organisms may significantly differ in sequences or structural patterns around the phosphorylation sites. For instance, comparative analysis of serine phosphorylation event between different species of eukaryotes performed by Frades et al. revealed that the animal-specific motifs are mainly basic amino acids and the plant-specific top discriminative n-grams contain many acidic amino acids [93]. The general tool NetPhos 3.1, yeast-specific predictor NetPhosYeast, soybean-specific predictor PHOSFER and human-specific predictor iPhos-PseEn are not suitable for predicting all species. The comparison results of PreSSFP with other tools just highlighted the necessity for developing systematic species-specific model to improve the prediction performance of phosphorylation sites. Third, the NetPhos 3.1 and NetPhosYeast only concentrated on the sequence information, and PHOSFER merely combined several amino acid indices to encode protein sequences and iPhos-PseEn just considered the evolutionary information, which could not extract fully features information. Nevertheless, PreSSFP integrated five feature strategies to ensure the complete extraction of sequence-derived information, evolutionary information and physical chemistry properties information.

Conclusions

In this study, on the basis of the primary protein sequences, an online tool named PreSSFP that used a two-step feature optimization method was developed for identifying potential fungi phosphorylation sites. The corresponding analyses and comparison with the existing tools demonstrated that PreSSFP was stabilized and satisfied in the prediction performance and considerably improved the prediction results of fungi phosphorylation sites. Feature analysis showed that the different fungi-species phosphorylation sites have some significant differences in sequenced-derived information, indicating that distinguishing different species was important for fungi phosphorylation sites prediction in order to improve prediction quality. Meanwhile, feature optimization exhibited that KNN feature was significant and exerted a great influence on predicting fungi phosphorylation sites. For the future prediction of fungi phosphorylation sites, newly identified fungi phosphorylation sites will be continuously collected and integrated into computational models, for a better prediction if available. In conclusion, we anticipate that PreSSFP can exert a complementary effect on existing approaches in fungi phosphorylation sites identification. Additionally, the combination of computational analyses with experimental verification could afford useful information for the understanding of the modified mechanism.

Key Points

  • There have some significant differences among different species in fungi phosphorylation.

  • It is necessary for the classification of fungi phosphorylation sites into different species to predict.

  • We provide a tool of computational prediction of fungi phosphorylation in seven species.

  • Through two-step feature optimization to extract important feature is efficient, which can considerably improve the prediction performance of PreSSFP.

Funding

This work was supported by the National Natural Science Foundation of China (21665016 and 21305062) and the Natural Science Foundation of Jiangxi Province (20151BAB203022).

Cao Man is a graduate student at School of Sciences, Nanchang University. Her research focuses on integrative tyrosine post-translational modifications data for analysis and validation.

Guodong Chen is a graduate student at School of Sciences, Nanchang University. His current research focuses on developing novel data analysis algorithms and software of prokaryotes lysine acetylation.

Jialin Yu is a graduate student at School of Sciences, Nanchang University. His research focuses on deep learning and prediction of protein structure and function.

Shaoping Shi is an associate professor at School of Sciences, Nanchang University. Her research focuses on the development of novel data analysis algorithms and bioinformatics tools for prediction of protein structure and function.

References

1.

Pawson
T
.
Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems
.
Cell
2004
;
116
(
2
):
191
203
.

2.

Bu
YH
,
He
YL
,
Zhou
HD
, et al.
Insulin receptor substrate 1 regulates the cellular differentiation and the matrix metallopeptidase expression of preosteoblastic cells
.
J Endocrinology
2010
;
206
(
3
):
271
.

3.

Zhang
J
,
Johnson
GVW
.
Tau protein is hyperphosphorylated in a site-specific manner in apoptotic neuronal PC12 cells
.
J Neurochem
2010
;
75
(
6
):
2346
57
.

4.

Kim
SH
,
Lee
CE
.
Counter-regulation mechanism of IL-4 and IFN-α signal transduction through cytosolic retention of the pY-STAT6:pY-STAT2:p48 complex
.
Eur J Immuol
2011
;
41
(
2
):
461
72
.

5.

Uddin
S
,
Lekmine
F
,
Sassano
A
, et al.
Role of Stat5 in type I interferon-signaling and transcriptional regulation
.
Biochem Bioph Res Co
2003
;
308
(
2
):
325
30
.

6.

Fuhrer
T
,
Heer
D
,
Begemann
B
, et al.
High-throughput, accurate mass metabolome profiling of cellular extracts by flow injection-time-of-flight mass spectrometry
.
Anal Chem
2011
;
83
(
18
):
7074
.

7.

Studer
RA
,
Rodriguez-Mias
RA
,
Haas
KM
, et al.
Evolution of protein phosphorylation across 18 fungal species
.
Science
2016
;
354
(
6309
):
229
.

8.

Eia
R
,
Varbanets
LD
.
Investigation of oxidative phosphorylation in mitochondrial fractions of fungi of the genus Fusarium
.
Mikrobiol Zh
1968
;
30
(
1
):
13
5
.

9.

Fehér
Z
,
Szirák
K
.
Signal transduction in fungi—the role of protein phosphorylation
.
Acta Microbiol Imm H
1999
;
46
(
2–3
):
269
.

10.

Potel
CM
,
Lin
MH
,
Ajr
H
, et al.
Widespread bacterial protein histidine phosphorylation revealed by mass spectrometry-based proteomics
.
Nat Methods
2018
;
15
(
3
):
187
.

11.

Sacco
F
,
Perfetto
L
,
Cesareni
G
.
Combining phosphoproteomics datasets and literature information to reveal the functional connections in a cell phosphorylation network
.
Proteomics
2018
;
18
(
5–6
):
e1700311
.

12.

Heazlewood
JL
,
Durek
P
,
Hummel
J
, et al.
PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor
.
Nucleic Acids Res
2008
;
36
(
Database issue
):
1015
21
.

13.

Lee
TY
,
Bretana
NA
,
Lu
CT
.
PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity
.
BMC Bioinformatics
2011
;
12
:
261
.

14.

Dou
Y
,
Yao
B
,
Zhang
C
.
PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine
.
Amino Acids
2014
;
46
(
6
):
1459
69
.

15.

Miller
ML
,
Soufi
B
,
Jers
C
, et al.
NetPhosBac—a predictor for Ser/Thr phosphorylation sites in bacterial proteins
.
Proteomics
2009
;
9
(
1
):
116
25
.

16.

Bai
Y
,
Chen
B
,
Li
M
, et al.
FPD: a comprehensive phosphorylation database in fungi
.
Fungal Biology
2017
;
121
(
10
):
869
.

17.

Ge
F
,
Zhang
F
,
Yang
G
, et al.
Global phosphoproteomic analysis reveals the involvement of phosphorylation in aflatoxins biosynthesis in the pathogenic fungus Aspergillus flavus
.
Sci Rep
2016
;
6
:
34078
.

18.

Ramsubramaniam
N
,
Harris
SD
,
Marten
MR
.
The phosphoproteome of Aspergillus nidulans reveals functional association with cellular processes involved in morphology and secretion
.
Proteomics
2015
;
14
(
21–22
):
2454
9
.

19.

Selvan
LDN
,
Renuse
S
,
Kaviyil
JE
, et al.
Phosphoproteome of Cryptococcus neoformans
.
J Proteomics
2014
;
97
:
287
95
.

20.

Rampitsch
C
,
Tinker
NA
,
Subramaniam
R
, et al.
Phosphoproteome profile of Fusarium graminearum grown in vitro under nonlimiting conditions
.
Proteomics
2012
;
12
(
7
):
1002
5
.

21.

Franck
WL
,
Gokce
E
,
Randall
SM
, et al.
Phosphoproteome analysis links protein phosphorylation to cellular remodeling and metabolic adaptation during Magnaporthe oryzae appressorium development
.
J Proteome Res
2015
;
14
(
6
):
2408
24
.

22.

Xiong
Y
,
Coradetti
ST
,
Li
X
, et al.
The proteome and phosphoproteome of Neurospora crassa in response to cellulose, sucrose and carbon starvation
.
Fungal Genet Biol
2014
;
72
:
21
33
.

23.

Shahid
U
,
Lin
S
,
Xu
Y
, et al.
dbPAF: an integrative database of protein phosphorylation in animals and fungi
.
Sci Rep
2016
;
6
:
23534
.

24.

UniProt
CT
.
UniProt: the universal protein knowledgebase
.
Nucleic Acids Res
2018
;
46
(
5
):
2699
.

25.

Huang
Y
,
Niu
B
,
Gao
Y
, et al.
CD-HIT Suite: a web server for clustering and comparing biological sequences
.
Bioinformatics
2010
;
26
(
5
):
680
2
.

26.

Cao
M
,
Chen
GD
,
Wang
LN
, et al.
Computational prediction and analysis for tyrosine post-translational modifications via elastic net
.
J Chem Inf Model
2018
;
58
(
6
):
1272
81
.

27.

Le
Q
,
Sievers
F
,
Higgins
DG
.
Protein multiple sequence alignment benchmarking through secondary structure prediction
.
Bioinformatics
2017
;
33
(
9
):
1331
7
.

28.

Shao
J
,
Xu
D
,
Tsai
SN
, et al.
Computational identification of protein methylation sites through bi-profile Bayes feature extraction
.
Plos One
2009
;
4
(
3
):
e4920
.

29.

Song
J
,
Tan
H
,
Shen
H
, et al.
Cascleave: towards more accurate prediction of caspase substrate cleavage sites
.
Bioinformatics
2010
;
26
(
6
):
752
.

30.

Jia
C
,
Zuo
Y
,
Zou
Q
, et al.
O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique
.
Bioinformatics
2018
;
34
(
12
):
2029
36
.

31.

Wang
LN
,
Shi
SP
,
Wen
PP
, et al.
Computing prediction and functional analysis of prokaryotic propionylation
.
J Chem Inf Model
2017
;
57
(
11
):
2896
904
.

32.

Suo
SB
,
Qiu
JD
,
Shi
SP
, et al.
Position-specific analysis and prediction for protein lysine acetylation based on multiple features
.
PLoS One
2012
;
7
(
11
):
e49108
.

33.

Wen
PP
,
Shi
SP
,
Xu
HD
, et al.
Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization
.
Bioinformatics
2016
;
32
(
20
):
3107
15
.

34.

Geurts
P
,
Ernst
D
,
Wehenkel
L
.
Extremely randomized trees
.
Machine Learning
2006
;
63
(
1
):
3
42
.

35.

Wager
S
,
Hastie
T
,
Efron
B
.
Confidence intervals for random forests: the jackknife and the infinitesimal jackknife
.
J Mach Learn Res
2014
;
15
(
1
):
1625
51
.

36.

Zhao
X
,
Ning
Q
,
Ai
M
, et al.
PGluS: prediction of protein S-glutathionylation sites with multiple features and analysis
.
J Theor Biol
2015
;
380
(
3
):
524
9
.

37.

Chang
CC
,
Lin
CJ
.
LIBSVM: a library for support vector machines
.
2011
;
2
(
3
):
1
27
.

38.

Olsen
JV
,
Vermeulen
M
,
Santamaria
A
, et al.
Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis
.
Sci Signal
2010
;
3
(
104
):
ra3
.

39.

Blom
N
,
Gammeltoft
S
,
Brunak
S
.
Sequence and structure-based prediction of eukaryotic protein phosphorylation sites
.
J. Mol. Biol.
1999
;
294
(
5
):
1351
62
.

40.

Obenauer
JC
,
Cantley
LC
,
Yaffe
MB
.
Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs
.
Nucleic Acids Res
2003
;
31
(
13
):
3635
41
.

41.

Blom
N
,
Sicheritz-Ponten
T
,
Gupta
R
, et al.
Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence
.
Proteomics
2004
;
4
(
6
):
1633
49
.

42.

Iakoucheva
LM
,
Radivojac
P
,
Brown
CJ
, et al.
The importance of intrinsic disorder for protein phosphorylation
.
Nucleic Acids Res
2004
;
32
(
3
):
1037
49
.

43.

Koenig
M
,
Grabe
N
.
Highly specific prediction of phosphorylation sites in proteins
.
Bioinformatics
2004
;
20
(
18
):
3620
7
.

44.

Xue
Y
,
Li
A
,
Wang
L
, et al.
PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory
.
BMC Bioinformatics
2006
;
7
:
163
.

45.

Neuberger
G
,
Schneider
G
.
Eisenhaber F. pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model
.
Biol Direct
2007
;
2
:
1
.

46.

Linding
R
,
Jensen
LJ
,
Ostheimer
GJ
, et al.
Systematic discovery of in vivo phosphorylation networks
.
Cell
2007
;
129
(
7
):
1415
26
.

47.

Wong
YH
,
Lee
TY
,
Liang
HK
, et al.
KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns
.
Nucleic Acids Res
2007
;
35
:
W588
94
.

48.

Huang
HD
,
Lee
TY
,
Tzeng
SW
, et al.
KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites
.
Nucleic Acids Res
2005
;
33
:
W226
9
.

49.

Tang
YR
,
Chen
YZ
,
Canchaya
CA
, et al.
GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network
.
Protein Eng Des Sel
2007
;
20
(
8
):
405
12
.

50.

Plewczynski
D
,
Tkacz
A
,
Wyrwicz
LS
, et al.
AutoMotif Server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update
.
J Mol Model
2008
;
14
(
1
):
69
76
.

51.

Plewczynski
D
,
Basu
S
,
Saha
I
.
AMS 4.0: consensus prediction of post-translational modifications in protein sequences
.
Amino Acids
2012
;
43
(
2
):
573
82
.

52.

Wan
J
,
Kang
S
,
Tang
C
, et al.
Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection
.
Nucleic Acids Res
2008
;
36
(
4
):
e22
.

53.

Huang
H
,
Li
L
,
Wu
C
, et al.
Defining the specificity space of the human SRC homology 2 domain
.
Mol Cell Proteomics
2008
;
7
(
4
):
768
84
.

54.

Li
L
,
Wu
C
,
Huang
H
, et al.
Prediction of phosphotyrosine signaling networks using a scoring matrix-assisted ligand identification approach
.
Nucleic Acids Res
2008
;
36
(
10
):
3263
73
.

55.

Saunders
NF
,
Brinkworth
RI
,
Huber
T
, et al.
Predikin and PredikinDB: a computational framework for the prediction of protein kinase peptide specificity and an associated database of phosphorylation sites
.
BMC Bioinformatics
2008
;
9
:
245
.

56.

Brinkworth
RI
,
Breinl
RA
,
Kobe
B
.
Structural basis and prediction of substrate specificity in protein serine/threonine kinases
.
Proc Natl Acad Sci U S A
2003
;
100
(
1
):
74
9
.

57.

Dang
TH
,
Van Leemput
K
,
Verschoren
A
, et al.
Prediction of kinase-specific phosphorylation sites using conditional random fields
.
Bioinformatics
2008
;
24
(
24
):
2857
64
.

58.

Jung
I
,
Matsuyama
A
,
Yoshida
M
, et al.
PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship
.
BMC Bioinformatics
2010
;
11
:
S10
.

59.

Song
C
,
Ye
M
,
Liu
Z
, et al.
Systematic analysis of protein phosphorylation networks from phosphoproteomic data
.
Mol Cell Proteomics
2012
;
11
(
10
):
1070
83
.

60.

Su
MG
,
Lee
TY
.
Incorporating substrate sequence motifs and spatial amino acid composition to identify kinase-specific phosphorylation sites on protein three-dimensional structures
.
BMC Bioinformatics
2013
;
14
:
S2
.

61.

Xu
Y
,
Wang
X
,
Wang
Y
, et al.
Prediction of posttranslational modification sites from amino acid sequences with kernel methods
.
J Theor Biol
2014
;
344
:
78
87
.

62.

Pejaver
V
,
Hsu
WL
,
Xin
F
, et al.
The structural and functional signatures of proteins that undergo multiple events of post-translational modification
.
Protein sci
2014
;
23
(
8
):
1077
93
.

63.

Banerjee
S
,
Ghosh
D
,
Basu
S
, et al.
JUPred_MLP: Prediction of Phosphorylation Sites Using a Consensus of MLP Classifiers
.
India
:
Springer
,
2016
.

64.

Xu
Y
,
Song
J
,
Wilson
C
, et al.
PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction
.
Sci Rep
2018
;
8
(
1
):
8240
.

65.

Mackey
AJ
,
Haystead
TA
,
Pearson
WR
.
CRP: cleavage of radiolabeled phosphoproteins
.
Nucleic Acids Res
2003
;
31
(
13
):
3859
61
.

66.

Xue
Y
,
Ren
J
,
Gao
X
, et al.
GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy
.
Mol Cell Proteomics
2008
;
7
(
9
):
1598
608
.

67.

Li
T
,
Li
F
,
Zhang
X
.
Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach
.
Proteins
2008
;
70
(
2
):
404
14
.

68.

Gao
J
,
Thelen
JJ
,
Dunker
AK
, et al.
Musite, a tool for global prediction of general and kinase-specific phosphorylation sites
.
Mol Cell Proteomics
2010
;
9
(
12
):
2586
600
.

69.

Ryu
GM
,
Song
P
,
Kim
KW
, et al.
Genome-wide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases
.
Nucleic Acids Res
2009
;
37
(
4
):
1297
307
.

70.

Kim
JH
,
Lee
J
,
Oh
B
, et al.
Prediction of phosphorylation sites using SVMs
.
Bioinformatics
2004
;
20
(
17
):
3179
84
.

71.

Fan
W
,
Xu
X
,
Shen
Y
, et al.
Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest
.
Amino Acids
2014
;
46
(
4
):
1069
78
.

72.

Suo
SB
,
Qiu
JD
,
Shi
SP
, et al.
PSEA: kinase-specific prediction and analysis of human phosphorylation substrates
.
Sci Rep
2014
;
4
:
4524
.

73.

Qiu
WR
,
Xiao
X
,
Xu
ZC
, et al.
iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier
.
Oncotarget
2016
;
7
(
32
):
51270
83
.

74.

Patrick
R
,
Le Cao
KA
,
Kobe
B
, et al.
PhosphoPICK: modelling cellular context to map kinase-substrate phosphorylation events
.
Bioinformatics
2015
;
31
(
3
):
382
9
.

75.

Patrick
R
,
Horin
C
,
Kobe
B
, et al.
Prediction of kinase-specific phosphorylation sites through an integrative model of protein context and sequence
.
Biochim Biophys Acta
2016
;
1864
(
11
):
1599
608
.

76.

Wang
D
,
Zeng
S
,
Xu
C
, et al.
MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction
.
Bioinformatics
2017
;
33
(
24
):
3909
16
.

77.

Song
J
,
Wang
H
,
Wang
J
, et al.
PhosphoPredict: a bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection
.
Sci Rep
2017
;
7
(
1
):
6862
.

78.

Wang
M
,
Wang
T
,
Wang
B
, et al.
A novel phosphorylation site-kinase network-based method for the accurate prediction of kinase-substrate relationships
.
Biomed Res Int
2017
;
2017
:
1826496
.

79.

Qiu
WR
,
Sun
BQ
,
Xiao
X
, et al.
iPhos-PseEvo: identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory
.
Mol Inform
2017
;
36
(
5–6
):
1600010
.

80.

Qiu
WR
,
Zheng
QS
,
Sun
BQ
, et al.
Multi-iPPseEvo: a multi-label classifier for identifying human phosphorylated proteins by incorporating evolutionary information into chou's general PseAAC via grey system theory
.
Mol Inform
2017
;
36
:(
3
):
1600085
.

81.

Liu
Y
,
Wang
M
,
Xi
J
, et al.
PTM-ssMP: a web server for predicting different types of post-translational modification sites using novel site-specific modification profile
.
Int J Biol Sci
2018
;
14
(
8
):
946
56
.

82.

Li
F
,
Li
C
,
Marquez-Lago
TT
, et al.
Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome
.
Bioinformatics
2018
;
DOI: 10.1093/bioinformatics/bty522
.

83.

Puntervoll
P
,
Linding
R
,
Gemund
C
, et al.
ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins
.
Nucleic Acids Res
2003
;
31
(
13
):
3625
30
.

84.

Yao
Q
,
Gao
J
,
Bollinger
C
, et al.
Predicting and analyzing protein phosphorylation sites in plants using musite
.
Front Plant Sci
2012
;
3
:
186
.

85.

Wang
X
,
Xu
ML
,
Li
BQ
, et al.
Prediction of phosphorylation sites based on Krawtchouk image moments
.
Proteins
2017
;
85
(
12
):
2231
38
.

86.

Que
S
,
Li
K
,
Chen
M
, et al.
PhosphoRice: a meta-predictor of rice-specific phosphorylation sites
.
Plant Methods
2012
;
8
:
5
.

87.

Lin
S
,
Song
Q
,
Tao
H
, et al.
Rice_Phospho 1.0: a new rice-specific SVM predictor for protein phosphorylation sites
.
Sci Rep
2015
;
5
:
11940
.

88.

Trost
B
,
Kusalik
A
.
Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights
.
Bioinformatics
2013
;
29
(
6
):
686
94
.

89.

Ingrell
CR
,
Miller
ML
,
Jensen
ON
, et al.
NetPhosYeast: prediction of protein phosphorylation sites in yeast
.
Bioinformatics
2007
;
23
(
7
):
895
7
.

90.

Trost
B
,
Kusalik
A
.
Computational prediction of eukaryotic phosphorylation sites
.
Bioinformatics
2011
;
27
(
21
):
2927
35
.

91.

Shi
SP
,
Xu
HD
,
Wen
PP
, et al.
Progress and challenges in predicting protein methylation sites
.
Mol BioSyst
2015
;
11
(
10
):
2610
9
.

92.

Chen
GD
,
Cao
M
,
Luo
K
, et al.
ProAcePred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization
.
Bioinformatics
2018
; DOI: .

93.

Bao
W
,
You
ZH
,
Huang
DS
.
CIPPN: computational identification of protein pupylation sites by using neural network
.
Oncotarget
2017
;
8
(
65
):
108867
108879
.

94.

Frades
I
,
Resjö
S
,
Andreasson
E
.
Comparison of phosphorylation patterns across eukaryotes by discriminative N-gram analysis
.
BMC Bioinformatics
2015
;
16
:
239
.

95.

Vacic
V
,
Iakoucheva
LM
,
Radivojac
P
.
Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments
.
Bioinformatics
2006
;
22
(
12
):
1536
7
.

96.

Chou
MF
,
Schwartz
D
.
Biological sequence motif discovery using motif-x
.
Curr Protoc Bioinformatics
2011
;
13
(
Unit 13
):
15
24
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data