Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods

Li, Fuyi; Wang, Yanan; Li, Chen; Marquez-Lago, Tatiana T; Leier, André; Rawlings, Neil D; Haffari, Gholamreza; Revote, Jerico; Akutsu, Tatsuya; Chou, Kuo-Chen; Purcell, Anthony W; Pike, Robert N; Webb, Geoffrey I; Ian Smith, A; Lithgow, Trevor; Daly, Roger J; Whisstock, James C; Song, Jiangning

doi:10.1093/bib/bby077

Abstract

The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.

protease, substrate specificity, substrate cleavage, bioinformatics, sequence analysis, machine learning, prediction model

Introduction

Proteolysis, mediated by a variety of proteases, is a ubiquitous and irreversible biological process (BP) in cells, which breaks down proteins to peptides by cleaving the bonds between amino acids [1]. Moreover, proteases also play a major role in enzyme activation by releasing propeptides and targeting signals [2]. It has been increasingly recognised that proteolysis plays crucial roles in the cell cycle [3], cell death [4], pathway regulation [5, 6], protein degradation [7–9] and the inflammation response [10–13]. Dysregulated proteolysis is involved in a number of human diseases, especially in cancers [14–21]. To date, hundreds of proteases have been identified based on systematic profiling using mammalian tissues and samples [22, 23]. Many proteases are highly specific to substrates that have appropriate structural and sequence features [24–27]. Therefore, a better understanding of protease specificity not only contributes to our knowledge of proteolysis mechanisms but also benefits the computational characterisation of novel protease-specific cleavage sites. Current experimental techniques for characterising the substrate specificity of proteases mainly include the N-terminal peptide identification method [28], one- and two-dimensional gel-based methods [28] and high-throughput mass spectrometry [29], and a number of attempts have been made to explore the sequence and structural differences between cleavage and non-cleavage sites. Kazanov et al. [30] studied the structural preferences of cleavage sites by mapping 200 proteolytic events to the CutDB database [31]. The extracted cleavage sites were found to be more exposed than non-cleavage sites, and a considerable proportion of such cleavage sites were located in alpha-helices or beta-sheets, indicating substrate specificity at the secondary structure (SS) level. Previous studies show that at least for certain proteases (e.g. caspase-3 and GluC) [32, 33], there exists SS preference of the substrate specificity. Taken together, these findings suggest that, to better distinguish cleavage sites from non-cleavage sites, a wide range of features should be integrated to build reliable protease-specific computational models.

The success of experimental identification of protease cleavage sites has led to widespread availability of protease-specific cleavage repertoires. Among these online resources, MEROPS is a very valuable knowledge database containing detailed biological annotation of protease-specific cleavage sites collected from a huge amount of research literature and large-scale proteomic studies. Another database, named ‘Degradome’, contains the approximately 600 mammalian proteases discovered to date [22]. CutDB [31] is a well-established online portal for annotating proteolytic machinery and events. Given the advances of data curation, the past two decades have witnessed a proliferation of computational tools developed to accurately identify protease substrates and cleavage sites, thereby complementing and guiding experimental studies, which are usually time-consuming and labour intensive. We roughly categorised these approaches into two types according to their methodologies: (i) methods based on sequence-scoring functions, including PeptideCutter [34], PEPS [35], PoPS [36], GraBCas [37], CaSPredictor [38], SitePrediction [39], GPS-CCD [40] and CAT3 [41]; and (ii) methods based on machine learning techniques, including CASVM [42], PCSS [43], Pripper [44], Cascleave [45], CaMPDB [46], LabCaS [47], PROSPER [48], Cascleave 2.0 [49], ScreenCap3 [50], Transfer-learner [51], PROSPERous [52] and iProt-Sub [53]. Generally, approaches based on sequence-scoring functions are able to deliver the prediction outcomes promptly, while machine learning-based frameworks often achieve better prediction performance.

Given the growing number of studies on computational characterisation of protease-specific cleavage sites, several reviews have been published to investigate these methods, in terms of model development [54–56]. However, the limitations of these reviews are mainly 2-fold: (i) most of the existing reviews are relatively out-of-date as current state-of-the-art computational methods have not been covered and (ii) except for a most recent review, which exclusively focused on caspase cleavage site prediction [57], none of these reviews systematically evaluated the prediction performance of surveyed predictors for protease-specific cleavage site prediction. To overcome these issues, in this article, we provide a comprehensive survey of the most up-to-date progress of large-scale computational studies on protease-specific cleavage site prediction. In total, 19 computational methods published to date (including 8 scoring function-based methods and 11 machine learning-based methods) were critically assessed, systematically benchmarked and thoroughly discussed in terms of algorithm construction, heterogeneous features extracted, performance evaluation strategy and software utility. Most importantly, we performed extensive independent tests to objectively assess the prediction performance of reviewed computational approaches based on a newly constructed independent test data set with cleavage sites specific to 10 proteases. Based on our review, we then point out the limitations of current methods, followed by some suggestions to further improve the prediction performance. We anticipate that our review will aid future development of computational methods for efficient and accurate protease-specific cleavage site predictions.

Materials and Methods

Construction of independent test data sets

In order to evaluate the prediction performance objectively, we constructed an independent test data set minimising overlaps with the training data sets. Note that in this survey analysis, we are more interested in predicted cleavage sites in proteins (not synthetic substrates) and that predictive methods are mainly aimed at identifying substrates for endopeptidases and not exopeptidases or omega peptidases. The detailed procedures of constructing the independent test data sets are as follows. First, we extracted all the protease-specific protein substrates and their cleavage sites from the latest version of MEROPS database (release 12.0) [58]. It should be noted that we excluded synthetic substrates and peptidase derived from phage displays and the like. Subsequently, the CD-HIT program was employed to remove sequence redundancy with an identity threshold of 70% between any two protein sequences. We then removed all the sequences that existed in all versions of 9.9 and lower, which were used to train most of the reviewed predictors, including the recently published PROSPERous [52] and iProt-Sub [53] methods. As a result, the final independent test data set contains 2536 substrates and 3641 cleavage sites, specific to 10 proteases. These cleavage sites were used as positive samples, while the negative samples were set to be sites in substrate proteins, which are not known to function as cleavage sites. Consequently, the numbers of the negative samples and positive samples were highly imbalanced. To evaluate the existing methods using relatively balanced data sets, we randomly selected the same number of negative sites as there were positive samples. A statistical summary of the curated data sets in this study is shown in the supplementary file (Table S1 in the Supplementary file).

Existing approaches for protease cleavage site prediction

Table 1 summarises the two types of existing methods used to predict cleavage sites, covering a wide range of aspects, including algorithm employed, calculated features, evaluation strategy and software availability. Figure 1 visualises the methodologies of these two types of approaches to provide a better understanding of their implemented workflows.

Table 1

Open in new tab

A comprehensive list of the reviewed methods/tools for prediction of cleavage sites

Method classification	Tool^a	Year	Software availability	Webserver availability^b	Features	Scoring function/algorithm^c	Evaluation strategy	Protease specificity
Scoring function-based	PeptideCutter [34]	1999	No	Yes		AAO	-	38 proteases
	PEPS [35]	2003	Decommissioned	Decommissioned		CSSM	-	caspase-3; cathepsin-B; cathepsin-L
	PoPS [36]	2005	Yes	Yes		PSSM, SS, SA, 'PEST' region	-	36 proteases
	GraBCas [37]	2005	Decommissioned	Decommissioned		PSSM	-	caspases 1-9; granzyme B
	CaSPredictor [38]	2005	Decommissioned	Decommissioned		CCSearcher	-	caspases
	SitePrediction [39]	2009	No	Yes		BSI	Independent test	calpains; caspases; cathepsins; lysins; MMPs
	GPS-CCD [40]	2011	No	Yes		BSI	4, 6, 8, 10-fold CV and leave-one-out	calpains
	CAT3 [41]	2012	Yes	No		PSSM	10-fold CV and independent test	caspase-3
Machine learning-based	CASVM [42, 59]	2007	Decommissioned	Decommissioned	Binary	SVM	10-fold CV and independent test	caspases
	PCSS [43]	2010	No	Yes	Binary, AAF, SS, PSS	SVM	leave-one-out	caspases; granzyme B
	Pripper [44]	2010	Yes	No	Binary, SP	SVM, RF, C4.5	leave-one-out	caspases
	Cascleave [45]	2010	No	Yes	Binary, SP, PSA, BPB, PSS	SVR	5-fold CV and independent test	caspases
	CaMPDB [46]	2011	No	Yes	Binary, PSS, PSA	SVM	10-fold CV	calpains
	PROSPER [48]	2012	No	Yes	Binary, SP, PSS, PSA, PND	SVR	5-fold CV and independent test	24 proteases
	LabCaS [47]	2013	No	Yes	AAF, PSA, BSI, PSS	CRF	leave-one-out	calpains
	Cascleave 2.0 [49]	2014	Decommissioned	Decommissioned	Binary, PND, AAP, PFF, PP, PSA, PSSM, RCS	SVR	5-fold CV and independent test	caspases; granzyme B
	ScreenCap3 [50]	2014	No	Yes	Binary, PSSM	SVM	5-fold CV and independent test	caspase-3
	PROSPERous [52]	2017	No	Yes	NNS, AAF, WLS, BSI	LR	5-fold CV and independent test	90 proteases
	iProt-Sub [53]	2018	No	Yes	Binary, CKSAAP, PSSM, BSI, KNN, PP, PSS, PSA	SVR	5-fold CV and independent test	38 proteases

Method classification	Tool^a	Year	Software availability	Webserver availability^b	Features	Scoring function/algorithm^c	Evaluation strategy	Protease specificity
Scoring function-based	PeptideCutter [34]	1999	No	Yes		AAO	-	38 proteases
	PEPS [35]	2003	Decommissioned	Decommissioned		CSSM	-	caspase-3; cathepsin-B; cathepsin-L
	PoPS [36]	2005	Yes	Yes		PSSM, SS, SA, 'PEST' region	-	36 proteases
	GraBCas [37]	2005	Decommissioned	Decommissioned		PSSM	-	caspases 1-9; granzyme B
	CaSPredictor [38]	2005	Decommissioned	Decommissioned		CCSearcher	-	caspases
	SitePrediction [39]	2009	No	Yes		BSI	Independent test	calpains; caspases; cathepsins; lysins; MMPs
	GPS-CCD [40]	2011	No	Yes		BSI	4, 6, 8, 10-fold CV and leave-one-out	calpains
	CAT3 [41]	2012	Yes	No		PSSM	10-fold CV and independent test	caspase-3
Machine learning-based	CASVM [42, 59]	2007	Decommissioned	Decommissioned	Binary	SVM	10-fold CV and independent test	caspases
	PCSS [43]	2010	No	Yes	Binary, AAF, SS, PSS	SVM	leave-one-out	caspases; granzyme B
	Pripper [44]	2010	Yes	No	Binary, SP	SVM, RF, C4.5	leave-one-out	caspases
	Cascleave [45]	2010	No	Yes	Binary, SP, PSA, BPB, PSS	SVR	5-fold CV and independent test	caspases
	CaMPDB [46]	2011	No	Yes	Binary, PSS, PSA	SVM	10-fold CV	calpains
	PROSPER [48]	2012	No	Yes	Binary, SP, PSS, PSA, PND	SVR	5-fold CV and independent test	24 proteases
	LabCaS [47]	2013	No	Yes	AAF, PSA, BSI, PSS	CRF	leave-one-out	calpains
	Cascleave 2.0 [49]	2014	Decommissioned	Decommissioned	Binary, PND, AAP, PFF, PP, PSA, PSSM, RCS	SVR	5-fold CV and independent test	caspases; granzyme B
	ScreenCap3 [50]	2014	No	Yes	Binary, PSSM	SVM	5-fold CV and independent test	caspase-3
	PROSPERous [52]	2017	No	Yes	NNS, AAF, WLS, BSI	LR	5-fold CV and independent test	90 proteases
	iProt-Sub [53]	2018	No	Yes	Binary, CKSAAP, PSSM, BSI, KNN, PP, PSS, PSA	SVR	5-fold CV and independent test	38 proteases

^aThe URL addresses for the listed tools are as follows:

PeptideCutter, http://web.expasy.org/peptide_cutter/;

PoPS, http://pops.csse.monash.edu.au/;

SitePrediction, http://www.dmbr.ugent.be/prx/bioit2-public/SitePrediction/;

CAT3, https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-14/MediaObjects/12859_2011_5127_MOESM9_ESM.RAR;

PCSS, https://modbase.compbio.ucsf.edu/peptide/;

Pripper, http://users.utu.fi/mijopi/Pripper/;

Cascleave, http://sunflower.kuicr.kyoto-u.ac.jp/sjn/Cascleave/webserver.html;

CaMPDB, http://calpain.org/predict.rb?cls=substrate;

GPS-CCD, http://ccd.biocuckoo.org/;

LabCaS, http://www.csbio.sjtu.edu.cn/bioinf/LabCaS/;

PROSPER, https://prosper.erc.monash.edu.au;

Cascleave 2.0, http://www.structbioinfor.org/cascleave2/index.html;

ScreenCap3, http://scap.cbrc.jp/ScreenCap3/index.php;

PROSPERous, http://prosperous.erc.monash.edu/;

iProt-Sub, http://iProt-Sub.erc.monash.edu.au/.

^bYes: The publication is accompanied with a webserver/tool and it is still functional; Decommissioned: The webserver/tool is no longer available; No: The publication has no webserver or tool.

^cAbbreviations: AAO, amino acid occurrence; CSSM, cleavage site scoring matrices; PSSM, position-specific scoring matrices; CCSearcher, caspase cleavage site searcher; BSI, BLOSUM62 substitution index; CV, cross validation; MMPs, matrix metallopeptidases; Binary, binary features; SVM, support vector machine; AAF, amino acid frequency; SS, secondary structure; PSS, predicted secondary structure; SP, sequence profile; RF, Random Forest; PSA, predicted solvent accessibility; BPB, bi-profile Bayesian signatures; SVR, support vector regression; CRF, conditional random fields; PND, predicted natively disorder; AAP, amino acid properties; PFF, protein functional features; PP, physicochemical properties; RCS, residue conservation score; NNS, nearest neighbour similarity; WLS, WebLogo-based sequence conservation; LR, logistic regression; CKSAAP, composition of k-spaced amino acid pairs; KNN, k-nearest neighbours features.

Table 1

Open in new tab

A comprehensive list of the reviewed methods/tools for prediction of cleavage sites

Method classification	Tool^a	Year	Software availability	Webserver availability^b	Features	Scoring function/algorithm^c	Evaluation strategy	Protease specificity
Scoring function-based	PeptideCutter [34]	1999	No	Yes		AAO	-	38 proteases
	PEPS [35]	2003	Decommissioned	Decommissioned		CSSM	-	caspase-3; cathepsin-B; cathepsin-L
	PoPS [36]	2005	Yes	Yes		PSSM, SS, SA, 'PEST' region	-	36 proteases
	GraBCas [37]	2005	Decommissioned	Decommissioned		PSSM	-	caspases 1-9; granzyme B
	CaSPredictor [38]	2005	Decommissioned	Decommissioned		CCSearcher	-	caspases
	SitePrediction [39]	2009	No	Yes		BSI	Independent test	calpains; caspases; cathepsins; lysins; MMPs
	GPS-CCD [40]	2011	No	Yes		BSI	4, 6, 8, 10-fold CV and leave-one-out	calpains
	CAT3 [41]	2012	Yes	No		PSSM	10-fold CV and independent test	caspase-3
Machine learning-based	CASVM [42, 59]	2007	Decommissioned	Decommissioned	Binary	SVM	10-fold CV and independent test	caspases
	PCSS [43]	2010	No	Yes	Binary, AAF, SS, PSS	SVM	leave-one-out	caspases; granzyme B
	Pripper [44]	2010	Yes	No	Binary, SP	SVM, RF, C4.5	leave-one-out	caspases
	Cascleave [45]	2010	No	Yes	Binary, SP, PSA, BPB, PSS	SVR	5-fold CV and independent test	caspases
	CaMPDB [46]	2011	No	Yes	Binary, PSS, PSA	SVM	10-fold CV	calpains
	PROSPER [48]	2012	No	Yes	Binary, SP, PSS, PSA, PND	SVR	5-fold CV and independent test	24 proteases
	LabCaS [47]	2013	No	Yes	AAF, PSA, BSI, PSS	CRF	leave-one-out	calpains
	Cascleave 2.0 [49]	2014	Decommissioned	Decommissioned	Binary, PND, AAP, PFF, PP, PSA, PSSM, RCS	SVR	5-fold CV and independent test	caspases; granzyme B
	ScreenCap3 [50]	2014	No	Yes	Binary, PSSM	SVM	5-fold CV and independent test	caspase-3
	PROSPERous [52]	2017	No	Yes	NNS, AAF, WLS, BSI	LR	5-fold CV and independent test	90 proteases
	iProt-Sub [53]	2018	No	Yes	Binary, CKSAAP, PSSM, BSI, KNN, PP, PSS, PSA	SVR	5-fold CV and independent test	38 proteases

Method classification	Tool^a	Year	Software availability	Webserver availability^b	Features	Scoring function/algorithm^c	Evaluation strategy	Protease specificity
Scoring function-based	PeptideCutter [34]	1999	No	Yes		AAO	-	38 proteases
	PEPS [35]	2003	Decommissioned	Decommissioned		CSSM	-	caspase-3; cathepsin-B; cathepsin-L
	PoPS [36]	2005	Yes	Yes		PSSM, SS, SA, 'PEST' region	-	36 proteases
	GraBCas [37]	2005	Decommissioned	Decommissioned		PSSM	-	caspases 1-9; granzyme B
	CaSPredictor [38]	2005	Decommissioned	Decommissioned		CCSearcher	-	caspases
	SitePrediction [39]	2009	No	Yes		BSI	Independent test	calpains; caspases; cathepsins; lysins; MMPs
	GPS-CCD [40]	2011	No	Yes		BSI	4, 6, 8, 10-fold CV and leave-one-out	calpains
	CAT3 [41]	2012	Yes	No		PSSM	10-fold CV and independent test	caspase-3
Machine learning-based	CASVM [42, 59]	2007	Decommissioned	Decommissioned	Binary	SVM	10-fold CV and independent test	caspases
	PCSS [43]	2010	No	Yes	Binary, AAF, SS, PSS	SVM	leave-one-out	caspases; granzyme B
	Pripper [44]	2010	Yes	No	Binary, SP	SVM, RF, C4.5	leave-one-out	caspases
	Cascleave [45]	2010	No	Yes	Binary, SP, PSA, BPB, PSS	SVR	5-fold CV and independent test	caspases
	CaMPDB [46]	2011	No	Yes	Binary, PSS, PSA	SVM	10-fold CV	calpains
	PROSPER [48]	2012	No	Yes	Binary, SP, PSS, PSA, PND	SVR	5-fold CV and independent test	24 proteases
	LabCaS [47]	2013	No	Yes	AAF, PSA, BSI, PSS	CRF	leave-one-out	calpains
	Cascleave 2.0 [49]	2014	Decommissioned	Decommissioned	Binary, PND, AAP, PFF, PP, PSA, PSSM, RCS	SVR	5-fold CV and independent test	caspases; granzyme B
	ScreenCap3 [50]	2014	No	Yes	Binary, PSSM	SVM	5-fold CV and independent test	caspase-3
	PROSPERous [52]	2017	No	Yes	NNS, AAF, WLS, BSI	LR	5-fold CV and independent test	90 proteases
	iProt-Sub [53]	2018	No	Yes	Binary, CKSAAP, PSSM, BSI, KNN, PP, PSS, PSA	SVR	5-fold CV and independent test	38 proteases

^aThe URL addresses for the listed tools are as follows:

PeptideCutter, http://web.expasy.org/peptide_cutter/;

PoPS, http://pops.csse.monash.edu.au/;

SitePrediction, http://www.dmbr.ugent.be/prx/bioit2-public/SitePrediction/;

CAT3, https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-13-14/MediaObjects/12859_2011_5127_MOESM9_ESM.RAR;

PCSS, https://modbase.compbio.ucsf.edu/peptide/;

Pripper, http://users.utu.fi/mijopi/Pripper/;

Cascleave, http://sunflower.kuicr.kyoto-u.ac.jp/sjn/Cascleave/webserver.html;

CaMPDB, http://calpain.org/predict.rb?cls=substrate;

GPS-CCD, http://ccd.biocuckoo.org/;

LabCaS, http://www.csbio.sjtu.edu.cn/bioinf/LabCaS/;

PROSPER, https://prosper.erc.monash.edu.au;

Cascleave 2.0, http://www.structbioinfor.org/cascleave2/index.html;

ScreenCap3, http://scap.cbrc.jp/ScreenCap3/index.php;

PROSPERous, http://prosperous.erc.monash.edu/;

iProt-Sub, http://iProt-Sub.erc.monash.edu.au/.

^bYes: The publication is accompanied with a webserver/tool and it is still functional; Decommissioned: The webserver/tool is no longer available; No: The publication has no webserver or tool.

^cAbbreviations: AAO, amino acid occurrence; CSSM, cleavage site scoring matrices; PSSM, position-specific scoring matrices; CCSearcher, caspase cleavage site searcher; BSI, BLOSUM62 substitution index; CV, cross validation; MMPs, matrix metallopeptidases; Binary, binary features; SVM, support vector machine; AAF, amino acid frequency; SS, secondary structure; PSS, predicted secondary structure; SP, sequence profile; RF, Random Forest; PSA, predicted solvent accessibility; BPB, bi-profile Bayesian signatures; SVR, support vector regression; CRF, conditional random fields; PND, predicted natively disorder; AAP, amino acid properties; PFF, protein functional features; PP, physicochemical properties; RCS, residue conservation score; NNS, nearest neighbour similarity; WLS, WebLogo-based sequence conservation; LR, logistic regression; CKSAAP, composition of k-spaced amino acid pairs; KNN, k-nearest neighbours features.

Scoring function-based methods are built based on the assumption that similar sequences share similar biological functions [34]. The scoring function-based methods are constructed using a data set of experimentally verified cleavage sites. These methods perform cleavage site prediction by measuring sequence similarities between the query segment/protein and all the segments/proteins with known cleavage sites, using different sequence scoring functions. The calculated similarity scores are subsequently ranked, and the predicted cleavage site will be one of the segment/protein most similar to the query segment/sequence.

To construct a reliable and accurate predictive model for machine learning-based methods, well-assembled training data sets and appropriate selection of the core algorithm are required. In general, four main steps need to be considered to train a machine learning-based cleavage site prediction model: (i) construction of training data sets containing well-curated experimental cleavage sites, (ii) feature encoding using protein sequences from the training set, (iii) machine learning model selection and construction and (iv) model evaluation and performance optimisation.

According to the nomenclature of protease substrate specificity proposed in [59], amino acid residues in the substrate sequence are numbered as ‘…-P4-P3-P2-P1-P1|$ ^{\prime} $|-P2|$ ^{\prime} $|-P3|$ ^{\prime} $|-P4|$ ^{\prime} $|-…’, where the cleavage site is located between the P1 and P1|$ ^{\prime} $| positions.

Scoring function-based approaches

These methods differ in the employed sequence scoring functions. Among these methods, PeptideCutter [34] is the most straightforward and simplest tool for cleavage site prediction, which is in this case solely based on the occurrence of amino acids of consensus cleavage sites in the sequences. Except for PeptideCutter, all other methods listed in Table 1 employ more sophisticated statistics to determine potential cleavage sites at the corresponding positions.

PEPS [35] uses individual rule-based endopeptidase Cleavage Site Scoring Matrices (CSSMs) for predicting caspase-3 cleavage sites. In [35], a CSSM was constructed based on the theory that a cleavage position with high amino acid variability should receive a lower score, thus indicating the sequence conservation of cleavage site. The position-specific score for an amino acid |$A$| is defined as |$\textrm{Score}={f}_{A,i}/{n}_{dif \, A}$|⁠, where |${f}_{A,i}$| is the observed relative frequency of amino acid |$A$|(⁠|$\,{f}_{A,i}={n}_{A,i}/N$|⁠, where |${n}_{A,i}$| is the observed incidence of amino acid A at position i; and N is the number of input sequences), while |${n}_{difA}$| is the number of variable amino acids occur at position i.

PoPS [36] applies a ‘specificity profile’ to each position in the cleavage region, which assigns a value to each individual amino acid. The value represents the relative contribution of the amino acid at the specific position to the overall substrate specificity of the protease. The ‘specificity profile’ is represented by a |$20\times N$| position-specific scoring matrices (PSSM), each entry |${p}_{i,j}$| illustrates the relative contribution of amino acid i to position j, while N is the length of ‘sequence profile’ (SP). In addition, PoPS also assigns a weight vector |$\left({w}_1,\dots, {w}_N\right)$| to represent the weights of each position in the cleavage region. For a sequence segment SS with N amino acids, let |${A}_k$| and |${A}_{k+1}\ \left(1\le k\le N-1\right)$| represent the P1 and P1|$ ^{\prime} $|⁠, respectively, then the cleavage probability score at position |${A}_k,{A}_{k+1}$| is calculated as |${\Sigma}_i^N{w}_i{p}_{i,{A}_i}$|⁠. The higher the score is, the more favourable the cleavage.

GraBCas [37] is another score-based tool for predicting cleavage sites for caspases-1–9 and granzyme B. For each protease, GraBCas develops a 3 × 20 PSSM based on the experimentally determined substrate segments. The three rows of the PSSM correspond to the positions P4, P3 and P2 of a sequence motif with a potential cleavage site. To calculate the cleavage score, GraBCas screens all the tetra-peptides |${\textrm{A}}_4{\textrm{A}}_3{\textrm{A}}_2\textrm{D}$| with Asp (D) at P1. The scoring method is defined as follows:

$$ Score\left({\textrm{A}}_4{\textrm{A}}_3{\textrm{A}}_2\textrm{D}\right)=100\times \frac{Score_{\textrm{P}4}\left({\textrm{A}}_4\right)\times {Score}_{\textrm{P}3}\left({\textrm{A}}_3\right)\times {Score}_{\textrm{P}4}\left({\textrm{A}}_4\right)}{1000^3}, $$

(1)

where |${Score}_{\textrm{P}4}\left({\textrm{A}}_4\right)$|⁠, |${Score}_{\textrm{P}3}\left({\textrm{A}}_3\right)$| and |${Score}_{\textrm{P}2}\left({\textrm{A}}_2\right)$| denote the corresponding matrix entries of |${\textrm{A}}_4$| at P4, |${\textrm{A}}_3$| at P3 and |${\textrm{A}}_2$| at P2, respectively.

CaSPredictor [38] was developed based on the Caspase Cleavage Site searcher (CCSearcher) algorithm using the following three parameters: the BLOSUM 62 matrix index, Amino Acid Frequency (AAF) index and PEST index [61, 62]. The BLOSUM 62 matrix is applied to each amino acid in the cleavage segment P4-P1|$ ^{\prime} $|⁠, and the associated index is defined as follows:

$$ {I}_s=\frac{\sum_{P1^\prime}^{P4} MS\left(\textrm{Ps}\right)}{\sum_{P1^\prime}^{P4} MS\left(\textrm{As}\right)}, $$

(2)

where ‘As’ denotes the experimentally annotated cleavage site, and ‘Ps’ indicates the potential cleavage site. The |${I}_s$| represents the sequence similarity between the potential sites and true cleavage sites, ranging from 0 to 1. The highest |${I}_s$| is then used for the CCSearcher algorithm. The AAF index of potential cleavage segments is calculated as the mean value of normalised relative frequency for each position |${I}_F={\overline{M}}_{Ps}\left({NF}_{P4},{NF}_{P3},{NF}_{P2},{NF}_{P1},{NF}_{P1^\prime}\right)$|⁠, where |${NF}_i$| is the relative frequency and defined as |${NF}_i={f}_i/{f}_{i\ \mathit{\max}}$| and |${f}_i={n}_i/N$|⁠. This is similar to the method used in PEPS, but here |${NF}_i$| is the relative frequency of the amino acid at position i divided by the most frequent amino acid for the same position. The PEST index is calculated by giving a value 1 to amino acids of PEST regions [61, 62] including Ser, Thr, Pro, Glu or Asp, Asn and Gln. Finally, the CCSearcher algorithm calculates the CCScore based on the following three indices:

$$\textrm{CCScore}={I}_S\frac{\left({I}_F+{I}_{PEST}\right)}{2}. $$

(3)

Similar to CaSPredictor, SitePrediction [39] adopts the BLOSUM 62 matrix to calculate the similarity between the potential cleavage sites and the known cleavage sites. Some extra features were also integrated into SitePrediction, such as PEST sequence occurrence, solvent accessibility and SS prediction information.

GPS-CCD [40] was built to predict the Calpain cleavage sites based on the BLOSUM 62 matrix. In GPS-CCD, a potential Calpain cleavage segment is compared with every experimentally validated cleavage segment, so as to calculate its similarity score. The latter is defined as follows:

$$ S\left(A,B\right)=\sum \limits_{-m\le i\le n} Score\!\left(A\!\left[i\right],B\!\left[i\right]\right), $$

(4)

where |$Score\!\left(A\!\left[i\right],B\!\left[i\right]\right)$| is the BLOSUM 62 substitution score between the amino acids of segments A and B at position i. Correspondingly, m and n are the number of residues upstream and downstream of the cleavage site in the cleavage segment. Finally, the average value of the substitution scores is regarded as the final score of the queried potential cleavage segment.

Figure 1

Flow charts of (A) scoring function-based methods and (B) machine learning-based methods. For each type of method, the key steps are summarised and visualised. Scoring function-based methods predict potential cleavage sites. The statistical scoring functions are based on the sequence alignments containing known cleavable sequence segments. Machine learning-based methods generate the prediction outcomes using trained models based on a variety of heterogeneous features.

Open in new tab Download slide

CAT3 [41] is another tool using the PSSM approach for the caspase-3 cleavage site prediction. First, position specific frequency matrices (20 × 14) are generated from the multiple sequence alignments of the relevant segments. Each matrix has 14 rows, showing the positions P9P8P7P6P5P4P3P2P1P1|$ ^{\prime} $|P2|$ ^{\prime} $|P3|$ ^{\prime} $|P4|$ ^{\prime} $|P5|$ ^{\prime} $|⁠, with an Asp residue at the position P1. While the 20 columns of the matrices represent the frequencies of amino acids. Then, four scoring matrices are built to calculate the final prediction score of CAT3. These four scoring matrices, |$\textbf{A}$|⁠, |$\textbf{B}$|⁠, |$\textbf{C}\textbf{1}$|⁠, and |$\textbf{C}\textbf{2}$|⁠, are defined as follows:

$$\begin{align}\textbf{A}=&\,{\log}_2\frac{\textrm{FM}{1}^{+}}{\Omega},\kern1em \textbf{B}={\log}_2\frac{\textrm{FM}{1}^{+}-\textrm{FM}{1}^{-}}{\Omega},\nonumber\\ {}\textbf{C}1=&\,{\log}_2\frac{\left[\textrm{FM}{1}_{\textrm{P}4=\textrm{D}}^{+}\right]-\left[\textrm{FM}{1}_{\textrm{P}4=\textrm{D}}^{-}\right]}{\Omega},\\ {}\textbf{C}2=&\,{\log}_2\frac{\left[\textrm{FM}{1}_{\textrm{P}4\ne \textrm{D}}^{+}\right]-\left[\textrm{FM}{1}_{\textrm{P}4\ne \textrm{D}}^{-}\right]}{\Omega}\nonumber\end{align}$$

(5)

where |$ \textrm{FM}{1}^{+} $| denotes the frequency matrix of positive segments, |$ \textrm{FM}{1}^{-} $| represents the frequency matrix of negative segments and |$\Omega$| is the AAF. |$\textrm{FM}{1}_{\textrm{P}4=\textrm{D}}^{+} $| and |$\textrm{FM}{1}_{\textrm{P}4\ne \textrm{D}}^{+}$| denote the frequency matrices built from of a subset of positive segments with |$\textrm{P}4=\textrm{D} $| and |$\textrm{P}4\ne \textrm{D}$|⁠, respectively. |$\textrm{FM}{1}_{\textrm{P}4=\textrm{D}}^{-}$| and |$\textrm{FM}{1}_{\textrm{P}4\ne \textrm{D}}^{-}$| represent the frequency matrices built from a subset of negative segments with |$\textrm{P}4=\textrm{D} $| and |$\textrm{P}4\ne \textrm{D}$|⁠, respectively. Finally, the prediction score |$\textbf{S}$| of CAT3 is calculated as follows:

$$\textbf{S}=\frac{\textbf{A}+\textbf{B}+\textbf{C}}{3},\kern2em \textbf{C}=\begin{cases}\mathbf{C1,}\ &\!\!\textrm{if}\ \textrm{P4 = D}\\ \mathbf{C2},& \!\!\textrm{if}\ \textrm{P4}\ne\textrm{D}\end{cases} $$

(6)

Machine learning-based approaches

Sequence scoring function-based methods are fairly straightforward due to the fact that they usually employ simplified statistical methods based on a variety of scoring functions, such as sequence similarity, sequence patterns and PSSMs. However, such functions generally do not perform well for cleavage site prediction. Compared to scoring function methods, machine learning-based methods generally achieve superior prediction performance, as a result of using more sophisticated algorithms and calculating descriptive sequences and structural features. Among the machine learning-based approaches, support vector machine (SVM) and support vector regression (SVR) are the most widely applied machine learning algorithms (Table 1). In the rest of this section, we introduce these and more machine learning techniques employed in protein cleavage site prediction, together with their usage in the corresponding approaches.

Sequence-based features

Machine learning-based methods usually need to calculate sequence/sequence-based features for model training. A variety of sequence features and sequence-based features have been used in the reviewed cleavage site prediction methods, including binary features, sequence features (AAF, SP, amino acid properties, composition of k-spaced amino acid pairs, among others), PSSMs, predicted structural features (PSS, predicted solvent accessibility (PSA), predicted disorder, among others), protein functional features and physicochemical properties. Note that the majority of these features can be calculated and extracted using the recently developed iFeature toolkit [63].

Binary features are among the most widely used feature types in machine learning-based methods. Eight such methods include CASVM, PCSS, Pripper, Cascleave, CaMPDB, PROSPER, Cascleave 2.0, ScreenCap3 and iProt-Sub, which were built incorporating binary features. PSS is another frequently used feature, which was used in six methods, including PCSS, Cascleave, CaMPDB, PROSPER, LabCaS and iProt-Sub. PSA is another feature encoding scheme for the cleavage substrate sequences, and has been used in Cascleave, CaMPDB, PROSPER, LabCaS, Cascleave 2.0 and iProt-Sub.

Feature selection methods

From the reviewed methods shown in Table 1, one can observe that the number of features employed for model training has increased in time and with it are the feature sets’ dimensionalities. Further, the initial feature sets usually contained redundant and noisy features. For this reason, three methods (PROSPER, Cascleave 2.0 and iProt-Sub) applied feature selection algorithms to characterise feature importance and subsequently used the optimal feature sets for training the ultimate machine learning models. In this section, we will briefly discuss the feature selection strategies used by these three methods.

PROSPER calculated five types (in total 448 features) of sequence-based features initially, including binary, SP, PSS, PSA and predicted natively disorder features. The Mean Decrease Gini Index (MDGI) calculated by random forest (RF) was used to measure the feature importance. To evaluate which are the more informative features, from the initial feature set, an MDGI score is calculated for each feature |$f$| as follows:

$$ {MDGI}_f=\frac{Gini_f-\overline{Gini}}{\sigma }, $$

(7)

where |${Gini}_f$| is the Gini score for the feature f, |$\overline{Gini}$| is the average Gini score of all the features in the initial feature set and |$\sigma$| is the standard deviation. In PROSPER, the features with MDGI score greater than 1.0 were used as the optimal features to train the classifier.

The initial feature sets of Cascleave 2.0 and iProt-Sub both contain over 4000 features. However, and generally, machine learning algorithms cannot learn a high-quality prediction model from such a high-dimensional feature set. These two methods therefore applied a two-step feature selection strategy, which combined the minimum Redundancy Maximum Relevance (mRMR) algorithm [64] with forward feature selection (FFS), to evaluate feature importance. The 1st step of this strategy is to use the mRMR algorithm to rank all the features in the initial feature set, according to the redundancy of the features and relevance to class labels. Then, both Cascleave 2.0 and iProt-Sub used the top 100 features as the optimal feature candidates (OFCs). The 2nd step is to use the FFS algorithm to select the final optimal feature set. FFS adds 100 feature subsets by adding one feature from OFCs. This process continues until no more features can be added to improve the prediction performance.

Support vector machine

Among the machine learning-based approaches listed in Table 1, CASVM [42, 60] and CaMPDB [46] employed SVM as their core algorithm. CASVM is the 1st SVM-based method using RBF kernel and integrating the binary-encoding features for the cleavage site prediction. Another SVM-based method, CaMPDB [46], was built using a multiple kernel learning strategy and combining the binary features and the Calpain cleavage prediction. In addition, SVM is one of the core algorithms in Pripper [44], together with RF, and C4.5. Pripper uses the RFB kernel as the core kernel of the SVM. The benchmarking test showed that the SVM model of Pripper outperformed the RF and C4.5 models, in terms of prediction accuracy.

SVMs are trained by the generation of hyperplanes based on the training data, to optimise the margin separating cleavage and non-cleavage sites, by seeking a linear discriminative function:

$$f \!\left(\textbf{x}\right)={\textbf{w}}^T\Phi \!\left(\textbf{x}\right)+{\textrm{w}}_0,$$

(8)

where |$\Phi$| is a basis function that maps the current feature vectors from the training data to a higher dimensional space. Note that although f(x) is a linear function of |$\Phi \!\left(\textrm{x}\right)$|⁠, one can also define it as a non-linear function of x.

Support vector regression

SVR is a variation of SVM that can be used in regression tasks. SVR has been widely applied in a series of works, such as Cascleave [45], PROSPER [48], Cascleave 2.0 [49] and the most recently published iProt-Sub. SVR retains all the key features of SVM, one of the key differences between SVM and SVR is the optimisation function. For SVMs, to optimise the hyperplane, the width of the margin (w) needs to be maximised. The optimisation function for SVMs is

$$ \textrm{C}\sum \limits_{n=1}^N{\xi}_n+\frac{1}{2}{\left|\left|\textbf{w}\right|\right|}^2; $$

(9)

while the optimisation function for SVRs is

$$ \textrm{C}\sum \limits_{n=1}^N\left({\xi}_n+\widehat{\xi_n}\right)+\frac{1}{2}{\left|\left|\textbf{w}\right|\right|}^2. $$

(10)

A slack variable |${\xi}_n$| is assigned to each training data in SVMs. In contrast, there are two slack variables, |${\xi}_n$| and |$\widehat{\xi_n}$|⁠, for each training data in SVRs.

C4.5

Pripper [44] is the 1st and the only predictor for cleavage site prediction that employed the C4.5 algorithm [65], which is a heuristic algorithm for introducing decision trees. C4.5 uses the gain ratio based on information entropy theory to determine the most discriminative features for the split function, which is defined as follows:

$$SplitInfo(S)=-\sum \limits_{i\in S}\frac{\left|{D}_i\right|}{\left|D\right|}{\mathit{\log}}_2\left(\frac{\left|{D}_i\right|}{\left|D\right|}\right), $$

(11)

where S is the splitting rule, D denotes the training data set and D_i is a subset of D. This value represents the potential information gain generated by splitting the training data set D under the splitting rule S. Therefore, the gain ratio is defined as

$$\Delta {F}_{gainRatio}(S)=\frac{\Delta \kern0.1em {F}_{infoGain}(S)}{SplitInfo(S)}.$$

(12)

where |$\Delta \kern0.1em {F}_{infoGain}(S)$| is the information gain under the splitting rule S. The feature with the maximum gain ratio is then selected as the splitting attribute. One advantage of using a decision tree algorithm is the interpretability of the generated rules. Essentially, the rules generated by C4.5 follow the ‘if…then…’ pattern, which can be easily interpreted by biologists and experimentalists, thereby opening opportunities and providing potentially novel observations of the biological scenarios.

Random forest

Besides C4.5 and SVM, Pripper [44] was also trained with the RF [66–70]. Consisting of a number of decision trees, RF has been proven to be simple but very powerful in practice. RF algorithm usually contains multiple classification and regression trees [71] constructed using bootstrapped subsamples of training data, and delivers prediction results via the majority voting strategy. Two random processes are conducted during the construction of a RF: bootstrapped subsamples and random selection of features to calculate the splitting feature for each node based on the Gini index. After an RF is constructed, given a test feature vector sample x, the classification label y can be estimated as

$$H \!\left(\textbf{x}\right)={\textrm{argmax}}_{y \in \mathfrak{y}} {\sum}_{t=1}^TI\left({h}_t \! \left(\textbf{x}\right)=y\right),$$

(13)

where |$\mathfrak{y}$| is the class label set, T denotes the number of classification trees, |$I(.)$| represents the indicator function and |${h}_t \! \left(\textbf{x}\right)$| indicates the classification function of tree t.

Conditional random fields

LabCaS [47] is a method that is based on the Conditional Random Fields (CRFs) algorithm [72], for Calpain cleavage site prediction from protein sequences. CRFs are undirected graphical models, which were first introduced in the context of segmentation and labelling of text sequences. To date, they have been proven to be effective in many applications, including information extraction [73] and image processing and parsing [74]. Given a query sequence, a single exponential model is used for the conditional probability of all training labels in CRFs. The nodes of CRFs can be divided into two disjoint sets: the observed variables |$X$| and the output variables Y. The main idea behind CRFs is that of defining a conditional probability distribution |$p\left(y|x\right)$| over label sequences Y given an observation sequence X. The probability distribution |$p\left(y|x\right)$| has the following form:

$$ p\left({y}_1,{y}_2,\dots, {y}_n|{x}_1,{x}_2,\dots, {x}_n\right)=\frac{1}{Z(X)}\prod \limits_{i=1}^n\exp \left(\sum \limits_{k=1}^K{\lambda}_k\,{f}_k\left({y}_i,{y}_{i-1},{x}_i\right)\right),$$

(14)

where K is the number of class labels; K = 2 for two-class classification. |${\lambda}_k$| is the weight vector of features, and |${f}_k$| is the function of features for |$\left\{{y}_i,{y}_{i-1},{x}_i\right\}$|⁠.

Logistic regression

PROSPERous [52] employed the Logistic Regression (LR) algorithm [75] for high-throughput prediction of protein cleavage sites by 90 proteases. To the best of our knowledge, PROSPERous is the framework of cleavage site prediction covering the widest range of proteases. The LR algorithm is designed to estimate the distribution |$P\!\left(Y|X\right)$| from the training data, where Y is a discrete value and |$X= \ <{X}_1,\dots, {X}_d>$| represents any feature vector containing discrete and/or continuous variables. The LR model can be defined as follows:

$$ P \!\left(Y=1|X\right)= LR ({\theta}^TX)=\frac{1}{1+{e}^{-{\theta}^TX}} $$

(15)

where |$LR(z)=\frac{1}{1+{e}^{-z}}$| is the logistic function and |${\theta}^TX={\theta}_0+\sum \nolimits_{i=1}^d{\theta}_i{X}_i$|⁠. The values of |$g ({\theta}^TX)$| and |$P \! \left(Y|X\right)$| range from 0 to 1. While |$P \! \left(Y=0|X\right)$| can be calculated as

$$P \! \left(Y=0|X\right)=1-g\left({\theta}^TX\right)=\frac{e^{-{\theta}^TX}}{1+{e}^{-{\theta}^TX}}. $$

(16)

Webserver/Software functionality

Two types of computational products, including online server and local executables, are usually available along with each method publication, thereby facilitating high-throughput prediction. Based on our survey, most predictors for cleavage site prediction have been made publicly available in form of a webserver. These include PeptideCutter, PoPS, SitePrediction, PCSS, Cascleave, CaMPDB, LabCaS, PROSPER, ScreenCap 3 and iProt-Sub. It is also worth noting that the webservers of some predictors in Table 1 have been discontinued or decommissioned.

On the webservers for protease-specific cleavage site prediction, users generally need to select the protease of interest and provide the protein sequence(s). When providing the protein sequences for prediction, the servers of PoPS, PCSS, Pripper, SitePrediction, GPS-CCD, LabCaS, PROSPERous and iProt-Sub enable users to upload a file with multiple protein sequences in FASTA format. It is worth noting that GPS-CCD, LabCaS, PROSPERous and iProt-Sub have limitations to the maximum number of sequences allowed (i.e. ≤2000 sequences for GPS-CCD, ≤3 sequences for LabCaS, ≤1000 sequences for PROSPERous and ≤50 sequences for iProt-Sub). On the other hand, other tools such as PeptideCutter, Cascleave, CaMPDB, PROSPER and ScreenCap 3 can only process one protein sequence in FASTA format at a time and a submission of multiple sequences is prohibited. Another important point to note is the selection of the specific protease, given the fact that most of the predictors were built using the data set of cleavage sites regulated by the specific proteases (Table 1). Specially, PoPS and PCSS allow users to build new models based on their provided data sets, which is a customisable and useful function, especially for non-expert users.

A well-designed output format is important for users to easily interpret the prediction results. Among the surveyed predictors with available webservers, PoPS, SitePrediction, PCSS, Cascleave, CaMPDB, PROSPER and iProt-Sub send notification emails and prediction results to users with job IDs, which is convenient for users and allows revisits to the prediction results in the future. Most available servers display the prediction results on their webpages. While most tools show the results in plain text, PROSPER, LabCaS, SitePrediction and iProt-Sub use a residue score distribution figure to visualise the prediction results, enhancing the interpretability. Besides, the possibility to download prediction results strengthens the usability of the webservers. SitePrediction and LabCaS allow users to download their prediction results in text format; while PROSPERous allows users to download results in CSV, Excel and PDF, facilitating further investigation and easy comparison of prediction results.

Computational works including PoPS, Pripper and GPS-CCD also have provided locally runnable software, for biologists and interested users to run on their local machines. The detailed instructions of these tools, with detailed descriptions of corresponding installation procedure, dependencies and processes of prediction, are also available.

Performance evaluation measurements and strategies

Based on our review (Table 1), k-fold cross validation (CV) and independent testing are two widely applied strategies for prediction performance assessment. In k-fold CV, the data set is randomly split into k equally sized subsets, |${D}_1,{D}_2,\dots, {D}_k$|⁠. Each of the subsets is used once to evaluate the trained model while the remaining subsets have been combined to train the model. Therefore, the training and testing procedure is conducted k times and average performance of the k times is usually reported. For independent test, an independent test data set is usually assembled ensuring that there is no overlap with the training data set. The independent test is therefore an effective and objective way to assess the robustness and scalability of constructed models.

The prediction performance of most of the reviewed methods was evaluated by both k-fold cross-validation and independent tests, including CAT3, CASVM, Cascleave, PROSPER, Cascleave 2.0, ScreenCap3, PROSPERous and iProt-Sub. The performance of GPS-CCD was evaluated by 4-, 6-, 8-, 10-fold CV and the leave-one-out CV test (a special case of k-fold CV, where k is set to the number of samples in the data set, and only one sample is used as the test set at a time). However, SitePrediction only performed an independent test for performance evaluation, and CaMPDB only performed 10-fold CV to evaluate prediction performance. Also, the performance of PCSS, Pripper and LabCaS was only evaluated by leave-one-out CV test.

Based on our review, we set out to comprehensively assess the performance of all these methods based on independent data sets and five measures. The latter are AUC (Area Under the Curve), Sn (sensitivity), Sp (specificity), Acc (accuracy) and MCC (Matthew’s Correlation Coefficient), which are commonly used to evaluate the prediction performance of algorithms [76–79], including for protease-specific cleavage sites. These measures are defined as follows:

$$ \textrm{Sn}=1-\frac{N_{-}^{+}}{N^{+}} 0\le\ \textrm{Sn}\ \le 1\\[6pt] \textrm{Sp}=1-\frac{N_{+}^{-}}{N^{-}} 0\le\ \textrm{Sp}\ \le 1\\[6pt] \textrm{Acc}=\Lambda =1-\frac{N_{-}^{+}+{N}_{+}^{-}}{N^{+}+{N}^{-}} 0\le\ \textrm{Acc}\ \le 1\\[6pt] \textrm{MCC}=\frac{1-\left(\frac{N_{-}^{+}}{N^{+}}+\frac{N_{+}^{-}}{N^{-}}\right)}{\sqrt{\left(1+\frac{N_{+}^{-}-{N}_{-}^{+}}{N^{+}}\right)\ \left(1+\frac{N_{-}^{+}-{N}_{+}^{-}}{N^{-}}\right)}} -1\le\ \textrm{MCC}\ \le 1 $$

(17)

where |${N}^{+},{N}_{-}^{+},{N}^{-}$| and |${N}_{+}^{-}$| represent the numbers of positives, false negatives, negatives and false positives, respectively.

Enrichment analysis on Gene Ontology terms

Based on our independent test data set, we performed a systematic analysis of the distribution of functional annotations of the substrates, including cellular component (CC), BP, molecular function (MF) and key functional pathways (KEGG Pathway), based on the annotations extracted from the UniProt database. For enrichment analysis of KEGG pathways, we searched the KEGG pathways using the UniProt IDs of the corresponding substrate proteins of each protease. To find the statistically enriched GO terms, we performed enrichment analysis with the two-side hypergeometric tests, and those with a P-value less than 0.05 were considered as significant. The P-value of a given term t is defined as follows:

$$ P\textrm{-value}_t={F}_{h\textrm{ypergeom}}\left(m,n,N,M\right), $$

(18)

where m is the total number of proteins in the data set annotated by the term t, n is the total number of human proteins annotated by the term t, M is the total number of substrate proteins annotated by at least one term and N is the total number of human proteins annotated by at least one term.

Results and Discussion

Conservation analysis of motifs with protease-specific cleavage sites

Based on an independent test data set, we analysed the distribution of amino acid preferences surrounding the cleavage sites of 10 proteases. The sequence logos resulting from using pLogo [80] are provided in Figure 2. It is obvious that for both caspases (caspase-1, -3 and -6) and granzyme B (human and mouse), Asp is always the cleavage site residue at the P1 position. However, a closer look revealed that there are subtle differences between the cleavage sites of these proteases, with caspase-6 and granzyme B (human) having complex amino acid preferences at surrounding positions. For instance, caspase-6 requires Glu residue at the P3 position, and Ala, Ser, Gly at the P1|$ ^{\prime} $| position. The substrate cleavage site of thrombin exhibits Arg at the P1 position, while in contrast to caspases and granzyme B, matrix metallopeptidases and calpains have distinctive amino acid preferences (Figure 2GHIJ). For calpains, LXSXPP and LXAXPP are the common patterns observed at positions P2 to P4|$ ^{\prime} $|⁠.

Figure 2

Residue specificity and enrichment of sequons of 10 protease-specific substrates, including (A) caspase-1, (B) caspase-3, (C) caspase-6, (D) granzyme B (Homo sapiens), (E) granzyme B (Mus musculus), (F) thrombin, (G) matrix metallopeptidase-2, (H) matrix metallopeptidase-3, (I) calpain-1 and ( J) calpain-2.

Open in new tab Download slide

Figure 3

Cleavage entropy heatmap of 36 proteases.

Open in new tab Download slide

Motif-based analysis of sequence preferences around different types of protease substrate cleavage sites

Previous studies [81] have shown that the cleavage entropy of proteins is an effective quantitative measure of their conservation. Cleavage entropy is calculated by counting the frequency of amino acids at a specific position on a specific protein sequence data set and, the smaller the entropy value, the stronger the conservation of the amino acids at this position. To better understand the conservation of the motif flanking a cleavage site, we calculated the cleavage entropy of 38 proteases using a motif with eight amino acids (P4-P4|$ ^{\prime} $|⁠). As shown in Figure 3 and Table S2 in the Supplementary file (apart from the last column and the last row in the files), each row represents the entropy value of a specific protease cleavage site from the upstream P4 to the downstream P4|$ ^{\prime} $| positions. The column ‘Avg’ represents the average of the eight entropy values, and the last line (i.e. the row ‘Overall’) in the heatmap denotes the average entropy of the 38 protease cleavage sites. In general, the cleavage entropy of different protease substrates varies greatly. Some proteases show strong conservation (e.g. caspase-3, caspase-7, caspase-6and caspase-8). On the other hand, the motifs contained in the cleavage sites of some proteases are non-conserved (e.g. cathepsin L, calpain-1, calpain-2, neprilysin, granzyme M, lactocepin I and lactocepin-3). In addition, the same protease catalytic type shares similar degrees of sequence conservation. For the cysteine and most of serine protease catalytic types, the P1 locus was highly conserved, while the conservation of the aspartic and MMP catalytic types, to a lesser extent, was mainly reflected in the P3 locus.

Performance assessment of different tools based on the independent test data sets

We used the independent test data sets generated in this study to conduct a performance comparison among the predictors listed in Table 1. We do note that, because of the limited availability of the training data sets for the prediction methods in Table 1, it is challenging to ensure that our independent testing data sets have no overlap with their training data sets. In addition, some predictors regularly update their models using more up-to-date data sets. Despite these difficulties, we assembled the independent test data sets by removing all the cleavage sites from earlier versions of the MEROP database, thereby minimising the overlap between our data sets and the training data sets of compared predictors. We then submitted the protein sequences in FASTA format from our independent test data sets to the webservers/local tools of methods listed in Table 1 to obtain corresponding prediction results. In terms of parameter configuration, we adopted recommended parameters following the instructions in the publication associated with each method or left them at default values if no instructions were available. To present the evaluation results, we plotted Receiver Operating Characteristic (ROC) curves and reported AUC values for each method, as shown in Figure 4 and Figure S1 and Table S3 in the Supplementary file.

Figure 4

ROC curves and the corresponding AUC values of reviewed predictors for cleavage site prediction specific to (A) caspase-1, (B) caspase-3, (C) caspase-6, (D) granzyme B (Homo sapiens), (E) granzyme B (Mus musculus), (F) thrombin, (G) matrix metallopeptidase-2 and (H) matrix metallopeptidase-3.

Open in new tab Download slide

Among all the proteases, machine learning-based methods in general achieved better prediction performance than sequence scoring function-based methods. However, we did not find one best predictor for all the proteases. Among the machine learning-based predictors, PROSPERous performed best for cleavage sites of seven proteases, i.e. caspase-1, caspase-6, granzyme B (Homo sapiens), granzyme B (Mus musculus), thrombin, MMP-2 and MMP-3. We assume that one reason why PROSPERous performed best is because it was developed using newer, more comprehensive data sets, which covered a substantially larger set of experimental verified cleavage sites than previous methods. Moreover, PROSPERous is based on a two-level prediction framework, in which the 1st level is the sequence-scoring function while the 2nd level is the LR model. This robust two-level prediction system resulted in PROSPERous’ improved predictive power.

From the scoring function based methods, SitePrediction achieved the overall best performance for proteases caspase-1, caspase-6 and MMP-3. It is worth mentioning that some predictors have achieved outstanding prediction performances for the cleavage sites specific to the proteases they were particularly designed for. For instance, the two methods designed specifically to predict cleavage sites of caspase-3 are ScreenCap 3 and CAT3: ScreenCap3 outperformed all the other machine learning and scoring function based methods (AUC = 0.993; Figure 4B), while CAT3 was the best scoring function-based method (AUC = 0.955; Figure 4B). Similarly, for calpain-1 and calpain-2 specific approaches, GPS-CCD and LabCaS outperformed all the other tools for calpain cleavage site prediction (refer to Figure S1 in the Supplementary file). In terms of the selected machine learning algorithms, our results indicate that SVM and SVR are the most powerful algorithms based on the ROCs. The hybrid machine learning-based and caspase-specific approach, Pripper, integrates four different classification algorithms, including J48 [65], RF, SVM and Vote [82]. Based on Figure 4, SVM (Pripper-SVM) achieved best performance for caspase-3 and caspase-6; while for caspase-1, Pripper-RF outperformed the other three algorithms. This suggests that there is no universal machine learning algorithm suitable for all proteases. The algorithm selection is therefore at the designers’ discretion and should be tested extensively prior to the final model construction. Additionally, the performance comparison among existing methods clearly demonstrates that the accuracy of protease-specific cleavage site prediction varies between different proteases. Based on the independent tests, the cleavage sites regulated by MMP-2, MMP-3, calpain-1, calpain-2 and thrombin are most difficult to predict, posing a challenge to the research community. Almost all the tested methods achieved relatively unsatisfactory prediction performances for these proteases. In comparison, the prediction performance for caspases and granzyme B is better. These conclusions are consistent with previously published studies [46, 48, 51–53] for protease-specific cleavage site prediction.

Improvement of prediction performance and outlook

Despite outstanding progress in computational studies and tools for cleavage site prediction, we believe there is still wide room for further improvement, especially in terms of algorithm learning strategies and prediction performance. To this effect, we provide several insights and suggestions.

Our 1st suggestion is to employ ensemble learning techniques in protease-specific cleavage site prediction. Ensemble methods, first introduced in [83], have been massively applied in algorithm development and ‘real-world’/bioinformatics applications [84–89]. The past three decades have witnessed the power of ensemble methods in significantly improving the prediction performance. For this reason, ensemble methods have become the ‘method-of-choice’ for performance improvement. To date, a variety of ensemble techniques have been developed, but the basic principle remains the same: the final prediction output for a test sample is determined jointly by all the algorithms generated in the ensemble committee. For example, the ‘majority voting’ is the most straightforward ensemble strategy, where the final prediction output for a test sample is determined by the class that obtains the most votes [84]. By applying the ensemble techniques, not only can performance can be improved but, other common issues, such as overfitting, can also be significantly reduced [90]. In addition, ensemble techniques can help improve the robustness and scalability of developed algorithms when testing with other independent test data sets [91].

Deep learning [92] is another promising technique for further improving the prediction performance of cleavage site prediction by generating internal feature representations of the cleavage sites, from protein sequences, within the hidden layers. To date, deep learning algorithms have been applied to a variety of bioinformatics applications, such as drug target interaction prediction [93], protein subcellular localisation prediction [94], phosphorylation sites prediction [95], protein solubility prediction [96] and protein homology detection [97]. Some representative deep learning techniques involved in these studies include Deep Neural Networks [98], Deep Belief Networks [99], Long Short-Term Memory [100] and Convolutional Neural Networks [92]. Compared to the conventional artificial neural networks, deep learning techniques have overcome some key issues, such as overfitting and slow convergence [92]. Current machine learning studies have demonstrated that deep learning algorithms achieved competitive prediction results when compared with other state-of-the-art machine learning algorithms, such as SVM and RF. In spite of the time-consuming training process, the classification process of deep learning is generally faster than traditional machine learning-based methods, as deep learning techniques do not need to extract features from protein sequences or convert them to feature vectors for prediction. Thus, deep learning techniques are attractive and especially suitable for high-throughput prediction tasks and competitive prediction performance.

Other than supervised data mining techniques, such as deep learning techniques aforementioned, semi-supervised methods have also been applied in bioinformatics applications, because of the limited data availability and/or the reliability of sample labelling. One of the semi-supervised techniques, named Positive and Unlabelled (PU) learning, has attracted significant attention recently and has been readily used in disease gene identification [101], kinase substrate prediction [102], glycosylation sites prediction [103] and drug interactions inferring [104]. PU learning is a machine learning scheme that distinguishes itself from others due to its strong ability to learn from only PU data [105, 106]. As such, PU learning only requires some high-quality positive samples and a number of unlabelled data to build reliable predictive models, which is a significant advantage in many areas, such as protein posttranslational modification prediction, drug-target interaction prediction and cleavage site prediction, whereby some negative samples might be falsely labelled because of the limitations of experimental techniques. Taking cleavage site prediction as an example, most of machine learning-based studies built their predictive models using the experimentally verified cleavage sites (i.e. positive samples) and non-cleavage sites (i.e. negative samples). However, some negative samples (i.e. the non-cleavage sites) might be incorrectly labelled and are subject to updates with advanced experimental techniques. Built on such data sets, the predictive models would be problematic and unreliable, and the performance of the models would be impacted, while PU learning can obviously avoid such problems.

Regarding feature selection techniques, in addition to the ones introduced in Feature selection methods section, additional novel feature selection techniques have been published. These, such as F-score and binominal distribution based feature selection techniques [107, 108], may greatly help improve the prediction performance. It is therefore suggested that users should apply different feature selection approaches in order to determine the optimal feature set prior to model construction.

Last but not least, because of the advances of high-throughput sequencing [109, 110], the fast accumulation of biological data has become a significant computational burden, which could exceed the process capacity of local machines. The same issue also applies to the data accumulation of protein cleavage sites. Cloud computing techniques have been frequently applied to deal with such ‘Big Data’ problems, and commonly used solutions are distributed and parallel computing techniques, including Hadoop [111–113], MapReduce [114] and Spark [115, 116]. We anticipate that by applying these techniques and frameworks, the large-scale prediction of a variety of biological data, including protein cleavage site prediction, can be facilitated in the future.

Conclusion

Proteolysis is an important and irreversible post-translational modification and plays important roles in many physiological processes. Accurate prediction of protease-specific cleavage sites can short-list reliable candidates for experimental validation, in order to better understand the functions and mechanisms of proteolysis. Given the wide availability of experimentally verified protease-specific cleavage sites, a number of computational methods (including machine learning-based methods and sequence scoring function-based methods) have been developed during the past two decades. Based on the significant achievement in this area, in this article, we revisited and evaluated 19 state-of-the-art computational approaches for predicting protease-specific cleavage substrates and sites. We summarised a wide range of aspects in detail, including methodology, algorithm, calculated features, evaluation strategy and software usability. Using our independent testing data sets, we performed extensive benchmarking evaluation and demonstrated that machine learning-based methods generally outperform sequence scoring function-based methods for predicting cleavage sites specific to all ten types of proteases. Among the evaluated 14 machine learning-based methods, PROSPERous is the most accurate generic tool for predicting cleavage sites of eight proteases while GPS-CCD and LabCaS achieved overall best performance for predicting calpain-specific cleavage sites. Following the independent tests, we provided some insights into further performance improvement, including applications of ensemble learning, deep learning and positive unlabelled learning techniques. We also suggested several computational frameworks for parallel and distributed computing, with the goal of tackling fast-accumulating data of protease-specific cleavage sites. We believe this review, together with our suggestions, will inform and guide future developments of computational methods for protease-specific cleavage site prediction, in turn advancing our understanding of biological functions and mechanisms of proteolysis.

Key Points

We provide a comprehensive survey of 19 computational methods (including 8 scoring function-based and 11 machine learning-based methods) developed for predicting protease-specific substrates and cleavage sites in the past two decades. Our survey analyzes these 19 computational methods in terms of underlying algorithms, extracted heterogeneous features, predictive capability, high-throughput capacity and practical utility.
Based on the cleavage entropy measure, we delineate the substrate specificity profiles of 36 different proteases with >50 experimentally validated substrate cleavage sites and provide functional insights into the specificity of these proteases.
We perform comprehensive independent tests to assess the performance of existing tools for predicting protease-specific cleavage sites across 10 different proteases, using up-to-date independent data sets.
Extensive independent tests show that PROSPERous is the most accurate generic tool for predicting cleavage sites of multiple proteases; on the other hand, GPS-CCD and LabCaS achieved an overall best performance for predicting calpain cleavage sites.
This review will serve as a practically useful guide for interested readers to develop next-generation bioinformatics tools for protease-specific cleavage site prediction.

Funding

This work was supported by grants from the National Health and Medical Research Council of Australia (NHMRC; 1092262), the Australian Research Council (ARC; LP110200333 and DP120104460), the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965), a Major Inter-Disciplinary Research (IDR) project awarded by Monash University, the Collaborative Research Program of Institute for Chemical Research, Kyoto University (2018-28), NHMRC CJ Martin Early Career Research Fellowship (1143366 to C.L.), ARC Discovery Outstanding Research Award (DORA; to G.I.W) and partially (for T.T.M.L. and A.L.) by the Informatics Institute of the School of Medicine at University of Alabama at Birmingham.

Fuyi Li received his BEng and DEng degrees in Software Engineering from Northwest A&F University, China. He is currently a PhD candidate in the Department of Biochemistry and Molecular Biology and the Infection and Immunity Program, Biomedicine Discovery Institute, Monash University, Australia. His research interests are bioinformatics, computational biology, machine learning and data mining.

Yanan Wang received his MEng degree from Shanghai Jiao Tong University, China. He will commence his PhD study in microbial bioinformatics, later in 2018, in the Biomedicine Discovery Institute, Monash University. His research interests are bioinformatics, machine learning, data mining and pattern recognition.

Chen Li received his PhD degree in Bioinformatics from Monash University, Australia. He is currently an NHMRC CJ Martin Early Career Research Fellow at the Institute of Molecular Systems Biology, ETH Zürich, Switzerland, and the Monash Biomedicine Discovery Institute, Monash University, Australia. His research interests include systems immunology, proteomics, immunopeptidomics, bioinformatics and data mining.

Tatiana T. Marquez-Lago is an associate professor in the Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham School of Medicine, USA. Her research interests include multiscale modelling and simulations, artificial intelligence, bioengineering and systems biomedicine. Her interdisciplinary laboratory studies stochastic gene expression, chromatin organisation, antibiotic resistance in bacteria and host-microbiota interactions in complex diseases.

André Leier is currently an assistant professor in the Department of Genetics, University of Alabama at Birmingham School of Medicine, USA. He is also an associate scientist in the UAB Comprehensive Cancer Center. He received his PhD in Computer Science (Dr. rer. nat.), University of Dortmund, Germany. He conducted postdoctoral research at Memorial University of Newfoundland, Canada, The University of Queensland, Australia and ETH Zürich, Switzerland. His research interests are in biomedical informatics and computational and systems biomedicine.

Neil D. Rawlings received his bachelor degree in Biological Sciences from Aston University, Birmingham, UK and his PhD from the Open University, Milton Keynes, UK. Since 1996, he and Alan J. Barrett have been the originators, developers and curators of the MEROPS database for proteolytic enzymes, their inhibitors and substrates, with the goal to provide the globally accepted classification of proteolytic enzymes and their inhibitors. He is one of the editors of the Handbook of Proteolytic Enzymes (Elsevier, 2013) and is currently a curator for the InterPro database at the European Bioinformatics Institute, Cambridge, UK.

Gholamreza Haffari received his PhD in Computer Science in 2010 from Simon Fraser University, Canada. He is a senior lecturer in the Faculty of Information Technology, Monash University. His research interests are natural language processing, machine learning and bioinformatics.

Jerico Revote received his bachelor degree in Computer Science from RMIT University, Melbourne, Australia. He is working as a research engineer in the Monash eResearch Centre at Monash University, Australia. He is also currently a part-time PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University. His research interests are machine learning, data mining, artificial intelligence, software development and scalable applications.

Tatsuya Akutsu received his DEng degree in Information Engineering in 1989 from University of Tokyo, Japan. Since 2001, he has been a professor in the Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan. His research interests include bioinformatics and discrete algorithms.

Kuo-Chen Chou received his D.Sc. degree in 1984 from Kyoto University, Japan. He is the founder and chief scientist of Gordon Life Science Institute. He is also a Distinguished High Impact Professor and advisory professor of several universities. His research interests are computational biology and biomedicine, protein structure prediction, low-frequency internal motion of protein/DNA molecules and its biological functions, diffusion-controlled reactions of enzymes, as well as graphic rules in enzyme kinetics and other biological systems.

Anthony W. Purcell is a professor in the Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Australia. He specialises in targeted and global quantitative proteomics of complex biological samples, with a specific focus on identifying targets of the immune response and host-pathogen interactions.

Robert N. Pike obtained his PhD in Biochemistry from the University of Natal in South Africa in 1991. He conducted post-doctoral fellowship work at the University of Georgia, USA, and the University of Cambridge, UK. He is currently the director of La Trobe Institute of Molecular Science and the Head of School of Molecular Science, La Trobe University, Australia. His research focuses on the study of enzymes involved in innate immunity and host defense against bacteria and other pathogens. He is also strongly interested in the enzymes used by pathogens to attack the host.

Geoffrey I. Webb received his PhD degree in 1987 from La Trobe University. He is the director of the Monash Centre for Data Science and Professor in Faculty of Information Technology at Monash University, Australia. He is a leading data scientist and has been the Program Committee Chair of two leading data mining conferences, ACM SIGKDD and IEEE ICDM. His research interests include machine learning, data mining, computational biology and user modelling.

A. Ian Smith completed his PhD at Prince Henrys Institute Melbourne and Monash University, Australia. He is currently the Vice-Provost (Research & Research Infrastructure) at Monash University. He is also a Professorial Fellow in the Department of Biochemistry and Molecular Biology at Monash University, where he runs his research group. His research applies proteomic technologies to study the proteases involved in the generation and metabolism of peptide regulators involved in both brain and cardiovascular function.

Trevor Lithgow received his PhD degree in 1992 from La Trobe University, Australia. He is an ARC Australian Laureate Fellow (FL130100038) in the Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia. His research interests particularly focus on molecular biology, cellular microbiology and bioinformatics. His laboratory develops and deploys multidisciplinary approaches to identify new protein transport machines in bacteria, understand the assembly of protein transport machines, and their role in secretion of hydrolases, including proteases, as host-targeted effectors.

Roger J. Daly obtained his PhD from the University of Liverpool, UK. He is the Head of the Department of Biochemistry and Molecular Biology and the Cancer Program within the Biomedicine Discovery Institute at Monash University. His research applies cutting-edge technology platforms in mass spectrometry-based proteomics and kinomics to inform and refine the sub classification of particular cancers, identify novel therapeutic targets and new applications for existing therapies, and also identify companion biomarkers that help stratify patients for appropriate therapy.

James C. Whisstock is a professor in the Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Australia. He is an NHMRC Senior Principal Research Fellow and the director of the ARC Centre of Excellence for Advanced Molecular Imaging. He studied for his bachelor degree and D.Phil. at The University of Cambridge, UK. His research interests include structural biology and drug development around immune-related complexes, including proteases and protease inhibitors and pore forming proteins such as perforin.

Jiangning Song is currently a Senior Research Fellow and group leader in the Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia. He is also affiliated with the Monash Centre for Data Science, Faculty of Information Technology, Monash University. His research interests include bioinformatics, computational biology, machine learning, data mining and pattern recognition.

References

1.

Rogers

LD

,

Overall

CM

.

Proteolytic post-translational modification of proteins: proteomic tools and methodology

.

Mol Cell Proteomics

2013

;

12

:

3532

–

42

.

2.

Zhou

A

,

Webb

G

,

Zhu

X

, et al.

Proteolytic processing in the secretory pathway

.

J Biol Chem

1999

;

274

:

20745

–

8

.

3.

Clarke

DJ

.

Proteolysis and the cell cycle

.

Cell Cycle

2002

;

1

:

233

–

4

.

4.

Bruck

WM

,

Gibson

GR

,

Bruck

TB

.

The effect of proteolysis on the induction of cell death by monomeric alpha-lactalbumin

.

Biochimie

2014

;

97

:

138

–

43

.

5.

Lal

M

,

Caplan

M

.

Regulated intramembrane proteolysis: signaling pathways and biological functions

.

Physiology (Bethesda)

2011

;

26

:

34

–

44

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

6.

Varshavsky

A

.

The N-end rule pathway and regulation by proteolysis

.

Protein Sci

2011

;

20

:

1298

–

345

.

7.

Lecker

SH

,

Goldberg

AL

,

Mitch

WE

.

Protein degradation by the ubiquitin-proteasome pathway in normal and disease states

.

J Am Soc Nephrol

2006

;

17

:

1807

–

19

.

8.

Lebraud

H

,

Wright

DJ

,

Johnson

CN

, et al.

Protein degradation by in-cell self-assembly of proteolysis targeting chimeras

.

ACS Cent Sci

2016

;

2

:

927

–

34

.

9.

Ottis

P

,

Crews

CM

.

Proteolysis-targeting chimeras: induced protein degradation as a therapeutic strategy

.

ACS Chem Biol

2017

;

12

:

892

–

8

.

10.

Shah

PK

.

Inflammation, metalloproteinases, and increased proteolysis: an emerging pathophysiological paradigm in aortic aneurysm

.

Circulation

1997

;

96

:

2115

–

7

.

11.

Cowan

FM

,

Broomfield

CA

,

Lenz

DE

, et al.

Putative role of proteolysis and inflammatory response in the toxicity of nerve and blister chemical warfare agents: implications for multi-threat medical countermeasures

.

J Appl Toxicol

2003

;

23

:

177

–

86

.

12.

Ionescu

AA

,

Nixon

LS

,

Shale

DJ

.

Cellular proteolysis and systemic inflammation during exacerbation in cystic fibrosis

.

J Cyst Fibros

2004

;

3

:

253

–

8

.

13.

auf dem

Keller

U

,

Prudova

A

,

Eckhard

U

, et al.

Systems-level analysis of proteolytic events in increased vascular permeability and complement activation in skin inflammation

.

Sci Signal

2013

;

6

:

rs2

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

14.

Kato

GJ

.

Human genetic diseases of proteolysis

.

Hum Mutat

1999

;

13

:

87

–

98

.

15.

De Strooper

B

.

Proteases and proteolysis in Alzheimer disease: a multifactorial view on the disease process

.

Physiol Rev

2010

;

90

:

465

–

94

.

16.

Bingol

B

,

Sheng

M

.

Deconstruction for reconstruction: the role of proteolysis in neural plasticity and disease

.

Neuron

2011

;

69

:

22

–

32

.

17.

Ehrnhoefer

DE

,

Martin

DDO

,

Schmidt

ME

, et al.

Preventing mutant huntingtin proteolysis and intermittent fasting promote autophagy in models of Huntington disease

.

Acta Neuropathol Commun

2018

;

6

:

16

.

18.

Yamasaki

L

,

Pagano

M

.

Cell cycle, proteolysis and cancer

.

Curr Opin Cell Biol

2004

;

16

:

623

–

8

.

19.

Mason

SD

,

Joyce

JA

.

Proteolytic networks in cancer

.

Trends Cell Biol

2011

;

21

:

228

–

37

.

20.

Sevenich

L

,

Joyce

JA

.

Pericellular proteolysis in cancer

.

Genes Dev

2014

;

28

:

2331

–

47

.

21.

Hillebrand

LE

,

Bengsch

F

,

Hochrein

J

, et al.

Proteolysis-a characteristic of tumor-initiating cells in murine metastatic breast cancer

.

Oncotarget

2016

;

7

:

58244

–

60

.

22.

Quesada

V

,

Ordonez

GR

,

Sanchez

LM

, et al.

The Degradome database: mammalian proteases and diseases of proteolysis

.

Nucleic Acids Res

2009

;

37

:

D239

–

43

.

23.

Kappelhoff

R

,

Puente

XS

,

Wilson

CH

, et al.

Overview of transcriptomic analysis of all human proteases, non-proteolytic homologs and inhibitors: organ, tissue and ovarian cancer cell line expression profiling of the human protease degradome by the CLIP-CHIP (TM) DNA microarray

.

Biochim Biophys Acta

1864

;

2017

:

2210

–

9

.

Google Scholar

OpenURL Placeholder Text

WorldCat

24.

Schauperl

M

,

Fuchs

JE

,

Waldner

BJ

, et al.

Characterizing protease specificity: how many substrates do we need?

PLoS One

2015

;

10

:

e0142658

.

Google Scholar

OpenURL Placeholder Text

WorldCat

25.

Diamond

SL

.

Methods for mapping protease specificity

.

Curr Opin Chem Biol

2007

;

11

:

46

–

51

.

26.

Boulware

KT

,

Daugherty

PS

.

Protease specificity determination by using cellular libraries of peptide substrates (CLiPS)

.

Proc Natl Acad Sci USA

2006

;

103

:

7583

–

8

.

27.

Harris

JL

,

Backes

BJ

,

Leonetti

F

, et al.

Rapid and general profiling of protease specificity by using combinatorial fluorogenic substrate libraries

.

Proc Natl Acad Sci USA

2000

;

97

:

7754

–

9

.

28.

Agard

NJ

,

Wells

JA

.

Methods for the proteomic identification of protease substrates

.

Curr Opin Chem Biol

2009

;

13

:

503

–

9

.

29.

Dix

MM

,

Simon

GM

,

Cravatt

BF

.

Global mapping of the topography and magnitude of proteolytic events in apoptosis

.

Cell

2008

;

134

:

679

–

91

.

30.

Kazanov

MD

,

Igarashi

Y

,

Eroshkin

AM

, et al.

Structural determinants of limited proteolysis

.

J Proteome Res

2011

;

10

:

3642

–

51

.

31.

Igarashi

Y

,

Eroshkin

A

,

Gramatikova

S

, et al.

CutDB: a proteolytic event database

.

Nucleic Acids Res

2007

;

35

:

D546

–

9

.

32.

Belushkin

AA

,

Vinogradov

DV

,

Gelfand

MS

, et al.

Sequence-derived structural features driving proteolytic processing

.

Proteomics

2014

;

14

:

42

–

50

.

33.

Timmer

JC

,

Zhu

W

,

Pop

C

, et al.

Structural and kinetic determinants of protease substrates

.

Nat Struct Mol Biol

2009

;

16

:

1101

.

34.

Wilkins

MR

,

Gasteiger

E

,

Bairoch

A

, et al.

Protein identification and analysis tools in the ExPASy server

.

Methods Mol Biol

1999

;

112

:

531

–

52

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

35.

Lohmuller

T

,

Wenzler

D

,

Hagemann

S

, et al.

Toward computer-based cleavage site prediction of cysteine endopeptidases

.

Biol Chem

2003

;

384

:

899

–

909

.

36.

Boyd

SE

,

Pike

RN

,

Rudy

GB

, et al.

PoPS: a computational tool for modeling and predicting protease specificity

.

J Bioinform Comput Biol

2005

;

3

:

551

–

85

.

37.

Backes

C

,

Kuentzer

J

,

Lenhof

HP

, et al.

GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences

.

Nucleic Acids Res

2005

;

33

:

W208

–

13

.

38.

Garay-Malpartida

HM

,

Occhiucci

JM

,

Alves

J

, et al.

CaSPredictor: a new computer-based tool for caspase substrate prediction

.

Bioinformatics

2005

;

21

(

Suppl 1

):

i169

–

76

.

39.

Verspurten

J

,

Gevaert

K

,

Declercq

W

, et al.

SitePredicting the cleavage of proteinase substrates

.

Trends Biochem Sci

2009

;

34

:

319

–

23

.

40.

Liu

Z

,

Cao

J

,

Gao

X

, et al.

GPS-CCD: a novel computational program for the prediction of calpain cleavage sites

.

PLoS One

2011

;

6

:

e19001

.

Google Scholar

OpenURL Placeholder Text

WorldCat

41.

Ayyash

M

,

Tamimi

H

,

Ashhab

Y

.

Developing a powerful in silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome

.

BMC Bioinformatics

2012

;

13

:

14

.

42.

Wee

LJ

,

Tan

TW

,

Ranganathan

S

.

CASVM: web server for SVM-based prediction of caspase substrates cleavage sites

.

Bioinformatics

2007

;

23

:

3241

–

3

.

43.

Barkan

DT

,

Hostetter

DR

,

Mahrus

S

, et al.

Prediction of protease substrates using sequence and structure features

.

Bioinformatics

2010

;

26

:

1714

–

22

.

44.

Piippo

M

,

Lietzen

N

,

Nevalainen

OS

, et al.

Pripper: prediction of caspase cleavage sites from whole proteomes

.

BMC Bioinformatics

2010

;

11

:

320

.

45.

Song

J

,

Tan

H

,

Shen

H

, et al.

Cascleave: towards more accurate prediction of caspase substrate cleavage sites

.

Bioinformatics

2010

;

26

:

752

–

60

.

46.

DuVerle

DA

,

Ono

Y

,

Sorimachi

H

, et al.

Calpain cleavage prediction using multiple kernel learning

.

PLoS One

2011

;

6

:

e19035

.

Google Scholar

OpenURL Placeholder Text

WorldCat

47.

Fan

YX

,

Zhang

Y

,

Shen

HB

.

LabCaS: labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields

.

Proteins

2013

;

81

:

622

–

34

.

48.

Song

J

,

Tan

H

,

Perry

AJ

, et al.

PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites

.

PLoS One

2012

;

7

:

e50300

.

Google Scholar

OpenURL Placeholder Text

WorldCat

49.

Wang

M

,

Zhao

XM

,

Tan

H

, et al.

Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets

.

Bioinformatics

2014

;

30

:

71

–

80

.

50.

Fu

SC

,

Imai

K

,

Sawasaki

T

, et al.

ScreenCap3: improving prediction of caspase-3 cleavage sites using experimentally verified noncleavage sites

.

Proteomics

2014

;

14

:

2042

–

6

.

51.

Wang

Y

,

Song

J

,

Marquez-Lago

TT

, et al.

Knowledge-transfer learning for prediction of matrix metalloprotease substrate-cleavage sites

.

Sci Rep

2017

;

7

:

5755

.

52.

Song

J

,

Li

F

,

Leier

A

, et al.

PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy

.

Bioinformatics

2018

;

34

:

684

–

7

.

53.

Song

J

,

Wang

Y

,

Li

F

, et al.

iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites

.

Brief Bioinform

2018

;

bby028

.

Google Scholar

OpenURL Placeholder Text

WorldCat

54.

Song

J

,

Tan

H

,

Boyd

SE

, et al.

Bioinformatic approaches for predicting substrates of proteases

.

J Bioinform Comput Biol

2011

;

9

:

149

–

78

.

55.

duVerle

DA

,

Mamitsuka

H

.

A review of statistical methods for prediction of proteolytic cleavage

.

Brief Bioinform

2012

;

13

:

337

–

49

.

56.

A du

Verle

D

,

Mamitsuka

H

.

Machine learning sequence classification techniques: application to cysteine protease cleavage prediction

.

Curr Bioinform

2012

;

7

:

415

–

423

.

Google Scholar

Crossref

WorldCat

57.

Bao

Y

,

Marini

S

,

Tamura

T

, et al.

Toward more accurate prediction of caspase cleavage sites: a comprehensive review of current methods, tools and features

.

Brief Bioinform

2018

;

bby041

.

Google Scholar

OpenURL Placeholder Text

WorldCat

58.

Rawlings

ND

,

Barrett

AJ

,

Thomas

PD

, et al.

The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database

.

Nucleic Acids Res

2018

;

46

:

D624

–

32

.

59.

Schechter

I

,

Berger

A

.

On the size of the active site in proteases. I. Papain

.

Biochem Biophys Res Commun

1967

;

27

:

157

–

62

.

60.

Wee

LJ

,

Tan

TW

,

Ranganathan

S

.

SVM-based prediction of caspase substrate cleavage sites, BMC

.

Bioinformatics

2006

;

7

(

Suppl 5

):

S14

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

61.

Rogers

S

,

Wells

R

,

Rechsteiner

M

.

Amino acid sequences common to rapidly degraded proteins: the PEST hypothesis

.

Science

1986

;

234

:

364

–

8

.

62.

Rechsteiner

M

,

Rogers

SW

.

PEST sequences and regulation by proteolysis

.

Trends Biochem Sci

1996

;

21

:

267

–

71

.

63.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences

.

Bioinformatics

2018

;

34

:

2499

–

2502

.

64.

Peng

H

,

Long

F

,

Ding

C

.

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

.

IEEE Trans Pattern Anal Mach Intell

2005

;

27

:

1226

–

38

.

65.

Quinlan

JR

.

C4. 5: programs for machine learning

.

United States

:

Elsevier

2014

.

66.

Breiman

L

.

Random forests

.

Mach Learn

2001

;

45

:

5

–

32

.

Google Scholar

Crossref

WorldCat

67.

Li

F

,

Li

C

,

Wang

M

, et al.

GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome

.

Bioinformatics

2015

;

31

:

1411

–

9

.

68.

Li

F

,

Li

C

,

Revote

J

, et al.

GlycoMine(struct): a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features

.

Sci Rep

2016

;

6

:

34595

.

69.

Song

J

,

Li

F

,

Takemoto

K

, et al.

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework

.

J Theor Biol

2018

;

443

:

125

–

37

.

70.

Wei

LY

,

Xing

PW

,

Tang

JJ

, et al.

PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only

.

IEEE Trans Nanobioscience

2017

;

16

:

240

–

7

.

71.

Breiman

L

.

Classification and Regression Trees

.

United Kingdom

:

Routledge

,

2017

.

72.

Lafferty

J

,

McCallum

A

,

Pereira

FC

.

Conditional random fields: Probabilistic models for segmenting and labeling sequence data

. In:

Proceedings of the Eighteenth International Conference on Machine Learning (ICML'01)

,

2001

.

Morgan Kaufmann Publishers Inc.

, United States.

73.

Sarawagi

S

,

Cohen

WW

.

Semi-markov conditional random fields for information extraction

. In:

Advances in Neural Information Processing Systems

.

2005

, 1185–

92

.

MIT Press

,

United States

.

74.

Sutton

C

,

McCallum

A

.

An introduction to conditional random fields. Foundations and Trends® in

.

Machine Learning

2012

;

4

:

267

–

373

.

Google Scholar

Crossref

WorldCat

75.

Li

F

,

Li

C

,

Marquez-Lago

TT

, et al.

Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome

.

Bioinformatics

2018

;

bty522

.

Google Scholar

OpenURL Placeholder Text

WorldCat

76.

Chen

W

,

Feng

P

,

Yang

H

, et al.

iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites

.

Mol Ther Nucleic Acids

2018

;

11

:

468

–

74

.

77.

Feng

P

,

Yang

H

,

Ding

H

, et al.

iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics

,

2018

. doi:10.1016/j.ygeno.2018.01.005.

78.

Yang

H

,

Qiu

W-R

,

Liu

G

, et al.

iRSpot-Pse6NC: identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC

.

Int J Biol Sci

2018

;

14

:

883

–

891

.

79.

Chen

W

,

Yang

H

,

Feng

P

, et al.

iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties

.

Bioinformatics

2017

;

33

:

3518

–

23

.

80.

O'Shea

JP

,

Chou

MF

,

Quader

SA

, et al.

pLogo: a probabilistic approach to visualizing sequence motifs

.

Nat Methods

2013

;

10

:

1211

–

2

.

81.

Fuchs

JE

,

von

Grafenstein

S

,

Huber

RG

, et al.

Cleavage entropy as quantitative measure of protease specificity

.

PLoS Comput Biol

2013

;

9

:

e1003007

.

Google Scholar

OpenURL Placeholder Text

WorldCat

82.

Ruta

D

,

Gabrys

B

.

Classifier selection for majority voting

.

Inf Fusion

2005

;

6

:

63

–

81

.

Google Scholar

Crossref

WorldCat

83.

Hansen

LK

,

Salamon

P

.

Neural network ensembles

.

IEEE Trans Pattern Anal Mach Intell

1990

;

12

:

993

–

1001

.

Google Scholar

Crossref

WorldCat

84.

Zhou

Z-H

.

Ensemble Methods: Foundations and Algorithms

.

CRC Press

,

2012

.

85.

Wan

S

,

Duan

Y

,

Zou

Q

.

HPSLPred: an ensemble multi‐label classifier for human protein subcellular location prediction with imbalanced source

.

Proteomics

2017

;

17

:

1

–

12

.

Google Scholar

Crossref

WorldCat

86.

Liu

B

,

Yang

F

,

Chou

KC

.

2L-piRNA: a two-layer ensemble classifier for identifying Piwi-interacting RNAs and their function

.

Mol Ther Nucleic Acids

2017

;

7

:

267

–

77

.

87.

Liu

B

,

Wang

S

,

Long

R

, et al.

iRSpot-EL: identify recombination spots with an ensemble learning approach

.

Bioinformatics

2017

;

33

:

35

–

41

.

88.

Wang

CY

,

Hu

LL

,

Guo

MZ

, et al.

imDC: an ensemble learning method for imbalanced classification with miRNA data

.

Genet Mol Res

2015

;

14

:

123

–

33

.

89.

Chen

W

,

Xing

P

,

Zou

Q

.

Detecting N 6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines

.

Sci Rep

2017

;

7

:

40242

.

90.

Opitz

D

,

Maclin

R

.

Popular ensemble methods: an empirical study

.

J Artif Intell Res

1999

;

11

:

169

–

98

.

Google Scholar

Crossref

WorldCat

91.

Seni

G

,

Elder

JF

.

Ensemble methods in data mining: improving accuracy through combining predictions

.

Synth Lect Data Mining Knowledge Discov

2010

;

2

:

1

–

126

.

Google Scholar

Crossref

WorldCat

92.

Min

S

,

Lee

B

,

Yoon

S

.

Deep learning in bioinformatics

.

Brief Bioinform

2017

;

18

:

851

–

69

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

93.

Tian

K

,

Shao

M

,

Wang

Y

, et al.

Boosting compound-protein interaction prediction by deep learning

.

Methods

2016

;

110

:

64

–

72

.

94.

Wei

L

,

Ding

Y

,

Su

R

, et al.

Prediction of human protein subcellular localization using deep learning

.

J Parallel Distributed Comput

2017

;

117

:

212

–

217

.

Google Scholar

Crossref

WorldCat

95.

Wang

D

,

Zeng

S

,

Xu

C

, et al.

MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction

.

Bioinformatics

2017

;

33

:

3909

–

16

.

96.

Khurana

S

,

Rawi

R

,

Kunji

K

, et al.

DeepSol: a deep learning framework for sequence-based protein solubility prediction

.

Bioinformatics

,

2018

;

34

:

2605

–

2613

.

97.

Liu

B

,

Chen

J

,

Guo

M

, et al.

Protein remote homology detection and fold recognition based on Sequence-Order Frequency Matrix

. IEEE/ACM Trans Comput Biol Bioinform

2017

, doi:10.1109/TCBB.2017.2765331.

98.

Schmidhuber

J

.

Deep learning in neural networks: an overview

.

Neural Netw

2015

;

61

:

85

–

117

.

99.

Hinton

GE

.

Deep belief networks

.

Scholarpedia

2009

;

4

:

5947

.

Google Scholar

Crossref

WorldCat

100.

Hochreiter

S

,

Schmidhuber

J

.

Long short-term memory

.

Neural Comput

1997

;

9

:

1735

–

80

.

101.

Yang

P

,

Li

XL

,

Mei

JP

, et al.

Positive-unlabeled learning for disease gene identification

.

Bioinformatics

2012

;

28

:

2640

–

7

.

102.

Yang

P

,

Humphrey

SJ

,

James

DE

, et al.

Positive-unlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data

.

Bioinformatics

2016

;

32

:

252

–

9

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

103.

Li

F

,

Song

J

,

Li

C

, et al.

PAnDE: averaged n-dependence estimators for positive unlabeled learning. ICIC Expr Lett

.

Part B, Appl: An Int J Res Surv

2017

;

8

:

1287

–

97

.

Google Scholar

OpenURL Placeholder Text

WorldCat

104.

Hameed

PN

,

Verspoor

K

,

Kusljic

S

, et al.

Positive-unlabeled learning for inferring drug interactions based on heterogeneous attributes

.

BMC Bioinformatics

2017

;

18

:

140

.

105.

Elkan

C

,

Noto

K

. Learning classifiers from only positive and unlabeled data. In:

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

.

2008

,

213

–

20

.

ACM

,

United State

.

106.

Zhang

B

,

Zuo

W

. Learning from positive and unlabeled examples: a survey. In:

Information Processing (ISIP), 2008 International Symposiums on

.

2008

,

650

–

54

.

IEEE

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

107.

Su

ZD

,

Huang

Y

,

Zhang

ZY

, et al.

iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC

. Bioinformatics

2018

. doi:10.1093/bioinformatics/bty508.

108.

Lin

H

,

Liang

ZY

,

Tang

H

, et al.

Identifying sigma70 promoters with novel pseudo nucleotide composition

. IEEE/ACM Trans Comput Biol Bioinform

2017

.

109.

O'Driscoll

A

,

Daugelaite

J

,

Sleator

RD

. ‘Big data’, Hadoop and cloud computing in genomics. J Biomed Inform

2013

;

46

:

774

–

81

.

110.

Wang

X

,

Williams

C

,

Liu

ZH

, et al.

Big data management challenges in health research—a literature review

.

Brief Bioinform

2017

;

bbx086

.

Google Scholar

OpenURL Placeholder Text

WorldCat

111.

Leipzig

J

.

A review of bioinformatic pipeline frameworks

.

Brief Bioinform

2017

;

18

:

530

–

6

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

112.

Zou

Q

,

Wan

S

,

Zeng

X

.

HPTree: reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via NJ model and Hadoop

. In: Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on.

2016

,

53

–

58

.

IEEE, China

.

113.

Zou

Q

.

Multiple sequence alignment and reconstructing phylogenetic trees with Hadoop

.

Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference

2016

;

1438

IEEE.

Google Scholar

OpenURL Placeholder Text

WorldCat

114.

Zou

Q

,

Li

XB

,

Jiang

WR

, et al.

Survey of MapReduce frame operation in bioinformatics

.

Brief Bioinform

2014

;

15

:

637

–

47

.

115.

Karim

MR

,

Michel

A

,

Zappa

A

, et al.

Improving data workflow systems with cloud services and use of open data for bioinformatics research

.

Brief Bioinform

2017

;

bbx039

.

Google Scholar

OpenURL Placeholder Text

WorldCat

116.

Su

W

,

Liao

X

,

Lu

Y

, et al.

Multiple sequence alignment based on a suffix tree and center-star strategy: a linear method for multiple nucleotide sequence alignment on spark parallel framework

.

J Comput Biol

2017

;

24

:

1230

–

42

.

Author notes

These authors contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
September 2018	113
October 2018	17
November 2018	20
December 2018	34
January 2019	23
February 2019	27
March 2019	37
April 2019	20
May 2019	14
June 2019	5
July 2019	8
August 2019	12
September 2019	9
October 2019	21
November 2019	17
December 2019	8
January 2020	45
February 2020	29
March 2020	15
April 2020	36
May 2020	14
June 2020	13
July 2020	23
August 2020	23
September 2020	41
October 2020	54
November 2020	47
December 2020	31
January 2021	27
February 2021	9
March 2021	19
April 2021	43
May 2021	92
June 2021	62
July 2021	65
August 2021	76
September 2021	92
October 2021	44
November 2021	28
December 2021	61
January 2022	77
February 2022	67
March 2022	60
April 2022	77
May 2022	64
June 2022	93
July 2022	150
August 2022	107
September 2022	154
October 2022	290
November 2022	80
December 2022	86
January 2023	97
February 2023	85
March 2023	78
April 2023	76
May 2023	67
June 2023	28
July 2023	77
August 2023	78
September 2023	66
October 2023	80
November 2023	48
December 2023	80
January 2024	56
February 2024	78
March 2024	137
April 2024	100
May 2024	106
June 2024	104
July 2024	107
August 2024	84
September 2024	112
October 2024	113
November 2024	99
December 2024	104
January 2025	72
February 2025	78
March 2025	91
April 2025	84
May 2025	12

Article Contents

Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods

Abstract

Introduction

Materials and Methods

Construction of independent test data sets

Existing approaches for protease cleavage site prediction

Scoring function-based approaches

Machine learning-based approaches

Sequence-based features

Feature selection methods

Support vector machine

Support vector regression

C4.5

Random forest

Conditional random fields

Logistic regression

Webserver/Software functionality

Performance evaluation measurements and strategies

Enrichment analysis on Gene Ontology terms

Results and Discussion

Conservation analysis of motifs with protease-specific cleavage sites

Motif-based analysis of sequence preferences around different types of protease substrate cleavage sites

Performance assessment of different tools based on the independent test data sets

Improvement of prediction performance and outlook

Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods

Abstract

Introduction

Materials and Methods

Construction of independent test data sets

Existing approaches for protease cleavage site prediction

Scoring function-based approaches

Machine learning-based approaches

Sequence-based features

Feature selection methods

Support vector machine

Support vector regression

C4.5

Random forest

Conditional random fields

Logistic regression

Webserver/Software functionality

Performance evaluation measurements and strategies

Enrichment analysis on Gene Ontology terms

Results and Discussion

Conservation analysis of motifs with protease-specific cleavage sites

Motif-based analysis of sequence preferences around different types of protease substrate cleavage sites

Performance assessment of different tools based on the independent test data sets

Improvement of prediction performance and outlook

Conclusion

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only