Abstract

Bacterial effector proteins secreted by various protein secretion systems play crucial roles in host–pathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into host–pathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.

Introduction

Bacteria can form mutualistic or pathogenic associations with hosts such as humans through the regulation of their specialized protein secretion systems [1–3]. The process of protein secretion by bacteria requires induction of protein synthesis and then protein translocation from the bacterial cytoplasm into host cells [4]. A secreted protein may either remain associated with the outer membrane, or be injected into eukaryotic (host) cells or into neighbouring bacterial cells [5]. To date, nine distinct types of protein secretion systems have been experimentally characterized in gram-negative bacteria [2, 3, 6–10], which are referred to as type I to type IX. Various enzymes are exported to the environment by the type I, type II or type V secretion systems [5]. In contrast, type III secretion system (T3SS), type IV secretion system (T4SS) and type VI secretion system (T6SS) [11–18] transport ‘effector’ proteins into host cells. By definition, effector proteins mimic the function of host proteins and can thereby dysregulate host cell biology to the benefit of the bacterium. Effector proteins secreted by the T3SS, T4SS and T6SS are, respectively, named T3SE, T4SE and T6SE. The numbers of experimentally validated effectors vary across bacterial species, with respect to different hosts and according to various survival strategies [11, 19, 20].

In light of the biological significance of bacterial effector proteins, a number of computational approaches were developed to predict secreted effector proteins based on protein-sequence information [21–23]. An important consensus from previous studies was that simplified statistical methods based on individual features alone, such as sequence similarity, sequence patterns and gene-adjacent sequence features, did not perform well for effector protein prediction [24–26]. Therefore, since 2009, machine learning algorithms have been increasingly used to address this difficult task by formulating effector protein prediction as a classification problem. The machine learning algorithms used to date include support vector machines (SVMs) [19, 27–32], artificial neural networks (ANNs) [27], Markov or hidden Markov models [33], Naïve Bayes [34] and Random Forest (RF) [4]. Among these machine learning techniques, SVMs are the most widely used algorithms for prediction of effector proteins. A variety of features, such as compositions of amino acids and amino acid pairs, position-specific scoring matrices (PSSMs), physicochemical properties and protein secondary structures (SS), were commonly extracted and used as an input to train the machine learning models. Cross-validation tests including leave-one-out and k-fold cross-validation are widely applied to assess the performance of the developed methods.

The currently available methods for secretion effector prediction differ significantly from one another in terms of learning algorithms, data sets (divided into training and test data sets), features used, prediction performance, availability via designated web servers and/or stand-alone software and applicability. In this article, we aim to provide a comprehensive survey and performance evaluation of currently available methods and tools for the prediction of three major types of secretion effector proteins, namely, T3SEs, T4SEs and T6SEs. To the best of our knowledge, this is the first in-depth comparison of its kind. It is particularly notable that, while there have been a number of machine learning-based methods for the prediction of T3SEs and T4SEs, little work has been done for prediction of the effectors of the more recently discovered T6SS [35, 36]. Experimental studies have proposed several motifs for identifying T6SEs [37, 38], and here we evaluate the performance of motif pattern-based approaches for predicting T6SEs by using the independent test data set extracted from the previous studies of Salomon et al. and Altindis et al. [37, 38].

Based on the performance evaluation of current methods for effector protein prediction, we developed three ensemble classifiers by integrating the output of all reviewed methods in this study. Three machine learning algorithms, i.e. SVM [39], RF [40] and Logistic Regression (LR) [41, 42], were used to train the ensemble classifiers. The three classifiers took the output of all individual predictors as input. The performance was then evaluated using 5-fold cross-validation. Our results indicated that the three ensemble models outperformed all individual tools for both T3SEs and T4SEs. We anticipate that these ensemble models will complement existing methods and provide new insights into the roles of secreted effectors of T3SS and T4SS.

Materials and methods

Construction of the independent test data sets

We searched through several publicly available databases to extract data associated with T3SE, T4SE and T6SE and construct the independent test data sets. Figure 1 depicts the flowchart of our data-curation procedures for the creation of independent test data sets.

Flowchart of the independent test data set collection for T3SEs, T4SEs and T6SEs.
Figure 1

Flowchart of the independent test data set collection for T3SEs, T4SEs and T6SEs.

Initially, we searched through the UniProt database [43] using various keywords describing different types of bacterial secreted effector proteins. Such keywords included ‘effector protein’, ‘bacterial secretion effector’ and ‘translocated into the host cell’ and were used in combination with ‘type III secretion system’ (‘T3SS’), ‘type IV secretion system’ (‘T4SS’) or ‘type VI secretion system’ (‘T6SS’), or their associated effector acronyms ‘T3SE’, ‘T4SE’ and ‘T6SE’, respectively. This search strategy resulted in a large number of redundant entries for the same effectors. These were then manually checked and filtered to ensure the quality of extracted entries. Subsequently, proteins that did not genuinely belong as T3SE, T4SE or T6SE were removed. All retained entries were required to have unambiguous and explicit annotations, as well as evidence for their classification (in form of statement such as ‘secreted by T3SS’, or ‘translocated into the host cell via the type IV secretion system’).

Secondly, a number of additional effector proteins were collected from curated data sets in previous studies. Although many of these proteins can be found in the NCBI protein database (http://www.ncbi.nlm.nih.gov/protein/), they are not necessarily annotated as such. For example, only the 100 N-terminal amino acids of non-redundant T3SEs are used in BPBAac [32] (with three information factors for each entry, including gene name, bacteria species and PMID number provided). This information was then used to extract full protein sequence entries from NCBI; full-length protein-sequence information is mandatory for our study, as the complete N- and C-terminal residue information is required for feature extraction and calculation. Wherever necessary, we extracted the complete amino acid sequences of these entries by searching their corresponding protein names provided in the literature.

Thirdly, we mined the relevant literature by searching the abstract in PubMed to obtain the most recent secreted effector proteins not currently included in public sequence databases. We then used their protein and/or gene names to search in the NCBI protein database to validate and retrieve their sequences in FASTA format.

After these steps, all extracted effector proteins of T3SS, T4SS and T6SS constituted the positive data sets, which are referred to as T3_P, T4_P and T6_P, respectively. As a final procedure, to objectively evaluate and compare the performance of all reviewed methods/tools, we downloaded, whenever possible, the original training data sets used for developing these approaches and removed all the duplicate proteins from T3_P, T4_P and T6_P. To generate the negative data sets of non-effectors for each of the bacterial secretion systems, we randomly selected proteins from the positive data sets representing the other two secretion systems. For example, when constructing the negative data set for T3SS, we randomly chose effector proteins from the independent test data sets for T4SS and T6SS. Similar to the construction procedure for positive data sets, we removed all duplicate non-effector sequences from the negative data sets for all three secretion systems.

To avoid potential overestimation of the prediction performance, the CD-HIT program (available at http://weizhong-lab.ucsd.edu/cd-hit/) was used to remove sequence redundancy from both positive and negative data sets for the three secretion systems. CD-HIT is a widely used bioinformatics tool for clustering protein sequences according to a specified sequence identity threshold, which was set at 40% for this study [44]. As a result, 44 T3SEs, 40 T4SEs and 237 T6SEs were retained following removal of sequence redundancy. We randomly selected the same numbers of negative samples based on CD-HIT clustered negative sequences for each secretion system. In summary, three independent test data sets were constructed, with each of these including effector proteins and non-effector proteins for each of the bacterial secretion systems, i.e. III (44 T3SEs versus 44 non-T3SEs), IV (40 T4SEs versus 40 non-T4SEs) and VI (237 T6SEs versus 237 non-T6SEs), respectively.

To explore potential amino acid enrichment or depletion in either N- or C-terminal residue positions for secreted effector proteins, sequence-logo representations were generated for the 50 N-terminal and 50 C-terminal residue positions based on the curated data sets by using pLogo [45]. pLogo is a probabilistic approach for the identification and visualization of sequence motifs, and was used for this analysis. The background data set for this motif-visualization analysis included the protein sequences obtained by searching the UniProt database.

Existing approaches for effector protein prediction

Tables 1 and 2 summarize the currently available prediction methods/tools for T3SEs and T4SEs, respectively. Notably, for T3SE predictors, SVMs were adopted as the predominant machine learning algorithm by multiple tools, including ANN [27], SIEVE [31], BEAN [28], BEAN 2.0 [29] and BPBAac [30]. Apart from SVMs, several methods used other machine learning algorithms, including RF model [4], EffectiveT3 [34], T3SEdb [46] and T3_MM [33]. As to T4SE predictors, we evaluated two currently available tools, namely, T4EffPred [19] and T4SEpre [32], as T4SE predictors. For T6SE predictors, there are no other tools currently available aside from motif-based search methods. Therefore, to evaluate the performance of T6SE prediction, we used specific motifs previously proposed, including MIX (marker for type six effectors) [37] and the motifs from Altindis et al. [38]. These approaches will be described in detail in subsequent sections.

Table 1

A Comprehensive list of the reviewed methods/tools for the prediction of T3SEs for the bacterial type III secretion system

Toola (year)Software availabilityWebserver availabilityFeature representationAlgorithmPerformance evaluation strategyTraining data set
Test data setReference
#Effectors#Non-effectors
ANN (2009)NoYesSEQANN & SVM10-fold cross-validation (leave 50% out)575685n/a[24]
SIEVE (2009)NoYesAAC; GC; PHYL; CON; SEQSVMIndependent testn/an/an/a[28]
EffectiveT3 (2009)YesYesSSNaïve Bayes10-fold cross-validation167n/a[30]
T3SEdb (2010)NoYesHydrophobicity; polarity; β-turnsNaïve Bayes10-fold cross-validation and independent test100100Effectors: 68Non-effectors: 68[41]
T3_MM (2013)YesYesAACMarkov model5-fold cross-validation and independent test15430835[42]
RF model (2013)YesNoAAC; SS; RSA; PPRF model5-fold cross-validation and independent test191213121[4]
BEAN (2013)YesNoHH-CKSAAPSVM5-fold cross-validation and independent test154308323[25]
BEAN 2.0 (2013)NoYesHH-CKSAAPSVM5-fold cross-validation243486n/a[26]
Toola (year)Software availabilityWebserver availabilityFeature representationAlgorithmPerformance evaluation strategyTraining data set
Test data setReference
#Effectors#Non-effectors
ANN (2009)NoYesSEQANN & SVM10-fold cross-validation (leave 50% out)575685n/a[24]
SIEVE (2009)NoYesAAC; GC; PHYL; CON; SEQSVMIndependent testn/an/an/a[28]
EffectiveT3 (2009)YesYesSSNaïve Bayes10-fold cross-validation167n/a[30]
T3SEdb (2010)NoYesHydrophobicity; polarity; β-turnsNaïve Bayes10-fold cross-validation and independent test100100Effectors: 68Non-effectors: 68[41]
T3_MM (2013)YesYesAACMarkov model5-fold cross-validation and independent test15430835[42]
RF model (2013)YesNoAAC; SS; RSA; PPRF model5-fold cross-validation and independent test191213121[4]
BEAN (2013)YesNoHH-CKSAAPSVM5-fold cross-validation and independent test154308323[25]
BEAN 2.0 (2013)NoYesHH-CKSAAPSVM5-fold cross-validation243486n/a[26]

n/a, not applicable; RSA, relative solvent accessibility; PP, physicochemical properties; GC, G + C nucleotide compositions of the primary DNA sequence; PHYL, phylogenetic profile; CON, sequence conservation; SEQ, N-terminal sequence of protein; DPC, dipeptide composition; PSSM_AC, auto covariance transformation of PSSM.

a

The URL addresses for accessing the listed tools are provided as follows:

Table 1

A Comprehensive list of the reviewed methods/tools for the prediction of T3SEs for the bacterial type III secretion system

Toola (year)Software availabilityWebserver availabilityFeature representationAlgorithmPerformance evaluation strategyTraining data set
Test data setReference
#Effectors#Non-effectors
ANN (2009)NoYesSEQANN & SVM10-fold cross-validation (leave 50% out)575685n/a[24]
SIEVE (2009)NoYesAAC; GC; PHYL; CON; SEQSVMIndependent testn/an/an/a[28]
EffectiveT3 (2009)YesYesSSNaïve Bayes10-fold cross-validation167n/a[30]
T3SEdb (2010)NoYesHydrophobicity; polarity; β-turnsNaïve Bayes10-fold cross-validation and independent test100100Effectors: 68Non-effectors: 68[41]
T3_MM (2013)YesYesAACMarkov model5-fold cross-validation and independent test15430835[42]
RF model (2013)YesNoAAC; SS; RSA; PPRF model5-fold cross-validation and independent test191213121[4]
BEAN (2013)YesNoHH-CKSAAPSVM5-fold cross-validation and independent test154308323[25]
BEAN 2.0 (2013)NoYesHH-CKSAAPSVM5-fold cross-validation243486n/a[26]
Toola (year)Software availabilityWebserver availabilityFeature representationAlgorithmPerformance evaluation strategyTraining data set
Test data setReference
#Effectors#Non-effectors
ANN (2009)NoYesSEQANN & SVM10-fold cross-validation (leave 50% out)575685n/a[24]
SIEVE (2009)NoYesAAC; GC; PHYL; CON; SEQSVMIndependent testn/an/an/a[28]
EffectiveT3 (2009)YesYesSSNaïve Bayes10-fold cross-validation167n/a[30]
T3SEdb (2010)NoYesHydrophobicity; polarity; β-turnsNaïve Bayes10-fold cross-validation and independent test100100Effectors: 68Non-effectors: 68[41]
T3_MM (2013)YesYesAACMarkov model5-fold cross-validation and independent test15430835[42]
RF model (2013)YesNoAAC; SS; RSA; PPRF model5-fold cross-validation and independent test191213121[4]
BEAN (2013)YesNoHH-CKSAAPSVM5-fold cross-validation and independent test154308323[25]
BEAN 2.0 (2013)NoYesHH-CKSAAPSVM5-fold cross-validation243486n/a[26]

n/a, not applicable; RSA, relative solvent accessibility; PP, physicochemical properties; GC, G + C nucleotide compositions of the primary DNA sequence; PHYL, phylogenetic profile; CON, sequence conservation; SEQ, N-terminal sequence of protein; DPC, dipeptide composition; PSSM_AC, auto covariance transformation of PSSM.

a

The URL addresses for accessing the listed tools are provided as follows:

Table 2

A Comprehensive list of the reviewed methods/tools for prediction of T4SEs of the bacterial type IV secretion systema

Toolb (Year)Software AvailabilityWebserver AvailabilityFeature representationAlgorithmPerformance Evaluation StrategyTraining data set
Test data setReference
#Effectors#Non-effectors
T4EffPred (2013)YesYesAAC; DPC; PSSM; PSSM_ACSVMLeave-one-out3401132n/a[19]
T4SEpre (2014)YesNoAAC; SA; SSSVM5-fold cross-validation347694n/a[29]
Toolb (Year)Software AvailabilityWebserver AvailabilityFeature representationAlgorithmPerformance Evaluation StrategyTraining data set
Test data setReference
#Effectors#Non-effectors
T4EffPred (2013)YesYesAAC; DPC; PSSM; PSSM_ACSVMLeave-one-out3401132n/a[19]
T4SEpre (2014)YesNoAAC; SA; SSSVM5-fold cross-validation347694n/a[29]
a

Refer to the abbreviations in Table 1 for full descriptions of the feature representation and algorithms.

b

The URL addresses for accessing the listed tools are provided as follows:

Table 2

A Comprehensive list of the reviewed methods/tools for prediction of T4SEs of the bacterial type IV secretion systema

Toolb (Year)Software AvailabilityWebserver AvailabilityFeature representationAlgorithmPerformance Evaluation StrategyTraining data set
Test data setReference
#Effectors#Non-effectors
T4EffPred (2013)YesYesAAC; DPC; PSSM; PSSM_ACSVMLeave-one-out3401132n/a[19]
T4SEpre (2014)YesNoAAC; SA; SSSVM5-fold cross-validation347694n/a[29]
Toolb (Year)Software AvailabilityWebserver AvailabilityFeature representationAlgorithmPerformance Evaluation StrategyTraining data set
Test data setReference
#Effectors#Non-effectors
T4EffPred (2013)YesYesAAC; DPC; PSSM; PSSM_ACSVMLeave-one-out3401132n/a[19]
T4SEpre (2014)YesNoAAC; SA; SSSVM5-fold cross-validation347694n/a[29]
a

Refer to the abbreviations in Table 1 for full descriptions of the feature representation and algorithms.

b

The URL addresses for accessing the listed tools are provided as follows:

Algorithms used by existing approaches

An SVM classifier is a powerful algorithm widely applied to solve many classification tasks in the field of computational biology [47–55]. It can be used to build linear or non-linear classification models by transforming input vectors into a high-dimensional space and constructing an optimal separation hyperplane between the positive and negative samples [56]. SVMs often achieve better or competitive performances compared with other machine learning techniques. Consequently, SVMs are also used for effector protein prediction of T3SEs [SIEVE, BPBAac, BEAN and BEAN 2.0 (Table 1)] and T4SEs [T4EffPred and T4SEpre (Table 2)].

The SIEVE model was the first SVM-based approach used to predict T3SEs [31] and was developed using the Gist software package [57], which is an online SVM classification software, based on both protein- and DNA-sequence information. The radial basis function was chosen as the core kernel of the SVM with a width of 0.5 and an optimized ratio of negative-to-positive examples to perform the classification [31].

BPBAac is also an SVM-based approach for predicting T3SEs that trains the prediction models based on amino acid composition (AAC) features extracted using the bi-profile Bayesian (BPB) feature-extraction scheme [58, 59]. The radial basis function K (si, sj) = exp (−γ‖si − sj2) was selected as the core kernel of the SVM model. Its parameter γ and the penalty parameter C was then optimized via a grid search based on 10-fold cross-validation.

BEAN is a sophisticated approach used for identifying T3SEs and combines a hidden Markov model-based search method called HHbits with profile-based k-spaced AAC (CKSAAP) to extract the feature vector called HH-CKSAAP and train a linear kernel SVM model [28]. The SVM model was trained with the parameter cost C = 1 and tolerance of termination criterion e = 1 × 10−4. BEAN 2.0 is an advanced version of BEAN [29] that exploits more informative features for training the model on a larger data set as compared with BEAN.

T4EffPred is an SVM-based tool for predicting T4SEs and integrates the library for SVMs toolbox in the MATLAB workspace to build a prediction model based on different types of sequence-derived features, including AAC, dipeptide composition, PSSM and PSSM autocovariance transformation. Here, too, the SVM kernel is the radial basis function with parameters γ and C optimized using a grid search based on 10-fold cross-validation.

T4SEpre is yet another SVM-based tool for predicting T4SEs. It takes into account a number of different features and their combinations, including sequential AAC features, single-profile Bayesian (SPB) AAC features, BPB AAC features and joint position-specific features of AAC, SS and solvent accessibility (SA). The optimal parameters were the same as those used by T4EffPred.

Another popular machine learning technique is ANN, as it is able to deal with non-linear and high-dimensional data [60, 61]. The ANN tool was developed by combining both ANN (feed-forward-type architecture with a single hidden neuron layer) and SVM algorithms to train the optimal model using the signal sequence located within the first 30 amino acids at the N-terminus [27]. This method used a gradient-descent back-propagation learning scheme, with momentum at an adaptive learning rate. The output of the ANN was converted into a binary decision using a cut-off threshold value of θ = 0.5. For the SVM classifier, the complexity parameter C and the parameter γ of the radial basis function were optimized using a grid search in the logarithmic space.

A Markov model [62] has also been used for the prediction of secretion effector proteins. T3_MM adopted a straightforward Markov model based on the AAC of the 100 N-terminal amino acid residues to achieve a more stable classification performance [33]. Based on the Markov model, a sequential likelihood-ratio variable, R was created to measure the overall difference in the conditional probability profiles of position-adjacent AAC between T3SEs and non-T3SEs. The R-values were calculated and statistically analysed for T3SEs and non-T3SEs.

A Naïve Bayes classifier is a machine learning algorithm used mainly for solving supervised classification tasks and provides a simple approach by assuming that numeric attributes follow a single Gaussian distribution [63]. Given its attractive features, including its simple structure and ease of implementation, Naïve Bayes classifiers perform well in many real-world applications [64]. EffectiveT3 is a Naïve Bayes-based tool used for predicting T3SEs, by integrating a variety of N-terminal sequence features such as amino acid frequencies, short peptides and residues with certain physicochemical properties [34]. Notably, when using EffectiveT3 [34] to predict potential T3SEs, the choice of an appropriate probability threshold for the ‘secreted’ class (used to adjust the selectivity and sensitivity of the predictor) is set following user discretion. T3SEdb is another Naïve Bayesian classifier for T3SE prediction and was constructed using physico-chemical properties, such as hydrophobicity, polarity and β-turns, along with N-terminal motifs (100 amino acids). T3SEdb was implemented using WEKA [46], which is a well-established and widely used data-mining platform.

In recent years, RF emerged as a powerful machine learning algorithm and has been increasingly applied to solve many classification/regression problems [65–69]. It is especially efficient at dealing with data sets with high-dimensional features [45]. The ensemble of decision trees built by RF can reduce the bias of single decision trees, thereby improving overall prediction accuracy. The RF model developed by Yang et al. [4] predicts T3SEs and uses protein-sequence information, including AAC, SA, SS and six physicochemical properties, as well as the sequence fragment of 52 position-specific residues, to train the RF model [4]. The model has two parameters: ntree, the number of trees to build, and mtry, the number of variables randomly selected as candidates for each node. Both parameters are optimized using a grid-search approach. For this study, ntree took on values between 500 and 2500, in steps of 500, and mtry was set to integer values between 1 and 40. The RF algorithm was implemented using the RF package written in R [70].

Feature selection

The purpose of feature selection is to identify the most informative and contributive features to model performance and remove noisy and redundant features, to optimize prediction performance [71–73]. Given that initial features often contain noisy and redundant information, more studies use feature-selection techniques to characterize feature importance before the training of final optimized models. In this section, we briefly discuss the application of feature selection by different tools and summarize their results.

Among the reviewed tools, BPBAac, SIEVE, RF, Effective T3 and T3SEdb used feature-selection techniques to filter irrelevant features and characterize feature contributions to the performance of their methods. For the remaining predictors, it was unclear whether feature-selection strategies were used.

In SIEVE [31] the most important features were selected via an iterative process called recursive feature elimination. This process successively eliminates features exhibiting low impact on overall model performance. In comparison, RF adopted permutation importance analysis to facilitate optimal feature selection, resulting in 62 optimal features [4]. To identify the most informative features, EffectiveT3 used two feature-selection strategies provided by WEKA, including a greedy hill-climb search [74] (the BestFirst algorithm using a look-up-cache size of one and five iterations) and correlated feature selection [75] (locally predictive = true, missing values = false). For T3SEdb, a greedy stepwise algorithm [76] was used to select a reduced feature set consisting of individual physicochemical properties. After feature selection, 92 individual features, including hydrophobicity, polarity and β-turns, were reduced to 63 combined features. BPBAac adopted both the BPB and SPB method for feature extraction. The two methods are similar except that BPB also takes the features of negative-training data into consideration. Additionally, Löwer et al. found that the effector proteins of T3SS share common sequence-based features at the N-terminus (the 30 N-terminal residues). These sequence-based features were shown to contribute to accurate predictions of T3SEs [27].

Software functionality

In this section, we discuss the user-friendliness of graphical interfaces and functionalities of existing tools. Tools, such as BEAN 2.0, EffectiveT3 and ANN, enable users to submit multiple protein sequences in the FASTA format, although they have limitations regarding the maximum number of sequences allowed (for BEAN 2.0 and EffectiveT3, ≤200 protein sequences are permitted; for ANN, ≤50 protein sequences are allowed). However, T3_MM and T4Effpred only allow submissions of single-sequence queries in the FASTA format at a time, i.e. submission of multiple sequences is not allowed. Additionally, SIEVE is capable of predicting effector proteins by allowing users to upload files containing FASTA-formatted protein sequences. SIEVE and EffectiveT3 return the prediction outcome after the submission task is completed by sending an email to users instead of redirecting the output to a webpage. Depending on the task at hand, this might be a limitation, owing to the indirect retrieval of the prediction outcome.

Four tools, EffectiveT3, BPBAac, T3_MM and T4SEpre, also provide stand-alone software written in R, Perl and other programming languages to enable users to perform prediction analyses on local computers. Detailed instructions providing useful guidance and help for troubleshooting during installation and use are found on the corresponding websites. Furthermore, T4Effpred provides several different predictors implemented in MATLAB, based on different feature combinations and methods [19].

Additionally, detailed on-site help documents and examples of job submissions, if available, can facilitate the user understanding of prediction procedures and requirements. In this regard, BEAN 2.0, T3_MM and EffectiveT3 provide example sequences, allowing users to quickly get familiarized with the format of sequence submissions. Descriptions of sequence-length limitations, the maximum allowable number of sequences per submission, introduction of the prediction algorithms and methods and results interpretation are available for all tools. These various help documents provide useful information promoting users’ understanding of tool methodologies, requirements and limitations.

Performance evaluation measurements

Cross-validation (including k-fold cross-validation and leave-one-out cross-validation) and independent tests are often used to assess prediction performance. To perform k-fold cross-validation, the entire data set is divided into k subsets. Subsequently, at each cross-validation step, one subset constitutes the validation set, while the remaining k-1 subsets are combined to form the training data set. This procedure is repeated k times until all subsets have been used as both training and test sets. The average performance across all k trials is then computed and reported. Leave-one-out cross-validation can be regarded as an extreme case of k-fold cross-validation, with k = N, where N is the total number of samples in the data set. Similarly, each instance in the data set is used as a validation sample, whereas the remaining N − 1 samples are used to form the training data set and to train the prediction model. As a result, the average performance of the N models is reported as the final prediction performance of leave-one-out cross-validation. In contrast, the independent test provides a more objective performance evaluation. The independent test is conducted on a separate test data set by using a presumably different data distribution as compared with the training data set. To perform independent test cross-validation, it is necessary to ensure that there are no overlapping data points between the training data set and the independent test data set. An important consideration is that all sequence entries in the independent test data set have minimal sequence similarity with those included in the training data set.

The prediction performance of all the reviewed tools, except SIEVE, was evaluated by performing k-fold cross-validation tests in their original studies (i.e. 10-fold cross-validation for ANN and EffectiveT3, 5-fold cross-validation for T3_MM, RF, BEAN, BEAN 2.0 and T4SEpre, and leave-one-out cross-validation for BPBAac and T4EffPred). The performance of SIEVE, BPBAac, T3_MM and RF was also evaluated using independent tests in their original studies. Here, we comprehensively assessed the performance of all reviewed tools by performing tests based on independent data sets.

To evaluate the predictive performance of the reviewed approaches, six measures were used in this study, namely, Accuracy (ACC), Specificity (Sp), Sensitivity (Sn), F1 score, area under the curve (AUC) and Matthews correlation coefficient (MCC) [77]. Receiver operating characteristic (ROC) curves were plotted to represent Sn versus (1  Sp) by shifting prediction cut-off thresholds. MCC is calculated based on the numbers of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) and is usually considered as a balanced measure, especially for skewed or unbalanced data sets. These performance measures are calculated as follows:

Results and discussion

Analysis of sequence motifs of known effector proteins

For each type of effector protein, N- and C-terminal sequences were extracted using a window size of 50 amino acids based on previous studies [19, 30]. The generated sequence logos for each type of effector protein are displayed in Figure 2.

Sequence-logo representations illustrating the amino acid preferences of both N- and C-terminal sequence motifs of the three different types of secreted effector proteins, (A) T3SEs, (B) T4SEs, (C) T6Ses and (D) the control (i.e. cytoplasmic proteins). Amino acids located above the X-axis are favourable, while those underneath the X-axis are unfavourable at the corresponding positions.
Figure 2

Sequence-logo representations illustrating the amino acid preferences of both N- and C-terminal sequence motifs of the three different types of secreted effector proteins, (A) T3SEs, (B) T4SEs, (C) T6Ses and (D) the control (i.e. cytoplasmic proteins). Amino acids located above the X-axis are favourable, while those underneath the X-axis are unfavourable at the corresponding positions.

Ignoring the methionine at position 1, which is responsible for translation initiation, several notable preferences of amino acid residues are observed in Figure 2. While there is an overall lack of conservation in the C-terminal sequence, except for a preference for glutamine residues at position 4 and, to a lesser extent, at positions 1, 3, 6, 21, 32, 33 and 39 (Figure 2A), there is somewhat more striking conservation in the N-terminal region of the T3SE sequences. The N-terminal sequence motifs of T3SEs exhibit an enrichment with serine residues across multiple positions, including positions 6 to 10, 12, 13, 17, 18, 20, 21 and 31 to 34, and enrichment with isoleucine residues at positions 3 and 4, while leucine residues are depleted (Figure 2A). These observations are consistent with a number of experimental studies on individual T3SEs. For example, isoleucine residues contribute to the secretion of YopD, a T3SE of Yersinia pseudotuberculosis [78], and isoleucine and serine residues in YopE promote its secretion by the T3SS in Yersinia [79, 80]. Predictive analysis of residue preference in T3SE from Salmonella and Pseudomonas show prevalence of isoleucine and serine in the N-terminal region [79], and more broad analysis of T3SEs also highlight the over-representation of these amino acids in the N-region of T3SEs [30, 34].

In the case of T4SEs, several studies have suggested that C-terminal residues appear to provide the targeting information for protein translocation [81, 82]. Other recent studies showed that targeting information can be encoded in the N-terminal region of at least some T4SEs [83–85]. The sequence logos associated with the N- and C-terminal motifs of T4SEs are displayed in Figure 2B. In particular, we found that lysine and asparagine residues are favoured in the N-terminal sequences (Figure 2B). For C-terminal motifs, we observed a preponderance of glutamate at positions 35–41 and serine at positions 42–47 for the T4SEs. The enrichment with glutamate and serine is consistent with a previous computational study of T4SE proteins [32]. The motif analysis also makes clear that the final three positions at the C-terminus favour hydrophobic or positively charged residues, particularly asparagine, lysine and leucine. Experimental investigations of specific T4SEs in Legionella pneumophila and Agrobacterium tumefaciens have suggested that such hydrophobic or positively charged residues are essential for functional translocation signals that assist protein secretion [13, 81, 82], and the motif analysis presented here suggests this to be a general rule.

For T6SE N-terminal sequences, there was no striking conservation of residues that would suggest a targeting signal. At most, serine was frequently observed at position 2, and lysine was favoured at the final four positions at the C-terminus (Figure 2C). A previous case study of Hcp (haemolysin co-regulated protein) secretion by the T6SS of Edwardsiella tarda indicated that positively charged residues such as lysine are important for translocation by the T6SS [11, 86]. While this is consistent with positively charged residues close to the C-terminus contributing to a recognition sequence in T6SEs, this simple feature alone would not discriminate T6SEs from many other (non-secreted) proteins in the bacterial cytoplasm.

In terms of the N-terminal sequences of the control (i.e. cytoplasmic proteins), serine was favoured at position 2, while the enrichment of lysine and isoleucine at positions 3, 4 and 5, 6, 7 was also observed. For the C-terminal sequences of the control, we observed an overrepresentation of lysine residues at the final six positions 45–50.

Analysis of characteristic sequence lengths and amino acid frequencies for different types of effector proteins

By definition, effector proteins contain one or more domains that mimic functions important to host cell biology. As a result, variation in effector protein-sequence length reflects the diversity and/or complexity of their specific functional roles [87]. To elucidate the distribution of sequence lengths for T3SEs, T4SEs and T6SEs, we calculated their respective protein-sequence lengths (Figure 3). The resulting histograms showed that there are a large number of sequences with a similar length of 300–500 amino acid residues. The three classes of effector proteins exhibited similar sequence-length distributions, despite the fact that the T3SS, T4SS and T6SS protein translocase machinery is quite distinct in its architecture and therefore in the physical constraints that might be expected to be placed on the substrate (i.e. effector) proteins.

Distribution of sequence lengths for the complete sets of T3SEs, T4SEs and T6SEs.
Figure 3

Distribution of sequence lengths for the complete sets of T3SEs, T4SEs and T6SEs.

Recently, it has been observed that overall AAC, as well as structural elements, tend to distinguish secreted proteins from cytoplasmic proteins [88]. Analysis of the AAC in T3SEs, T4SEs and T6SEs showed similarities in the frequency distributions between the three types of effector proteins (Figure 4). For example, leucine and serine were frequently found across the three classes of effector proteins. Leucine was identified as being important for protein binding and transport [89, 90] and, in at least one example, the effector protein SlrP secreted by the Salmonella T3SS has leucine-rich repeats with several conserved leucine residues present in a region shown to be important for translocation by the T3SS [91–93]. The three classes of effector proteins exhibited some specificities in regard to amino acid frequency, for example in that glutamate, alanine and lysine occurred more frequently in T4SEs than in T3SEs and T6SEs.

Variations in the frequencies of the 20 amino acids between T3SEs, T4SEs, T6SEs and the control (i.e. cytoplasmic proteins).
Figure 4

Variations in the frequencies of the 20 amino acids between T3SEs, T4SEs, T6SEs and the control (i.e. cytoplasmic proteins).

To address the significance of these perceived differences, statistical tests including the Mann–Whitney U-test and the permutation test on amino acid frequencies were conducted (Table 3). The Mann–Whitney U-test was performed using the default implementation in R [94], while the permutation test was executed through the R package DAAG [95]. The results of the Mann–Whitney U-test showed that the most differentially distributed amino acids between T3SEs and T4SEs were alanine, glutamate, phenylalanine, isoleucine, lysine and tyrosine. Serine and valine exhibited differential rates of occurrence between T3SEs and T6SEs, while the frequencies of alanine, glycine, lysine, asparagine and valine were significantly different between T4SE and T6SE. Notably, alanine and lysine occurred at significantly higher rates between T4SE and the other two classes (T3SE and T6SE), with valine present at significantly different levels between T6SE and the other two classes (T3SE and T4SE). Serine appeared to be the most significantly different amino acid type between T3SE/T4SE/T6SE and the control. In addition, glycine, asparagine and valine were also found to be significantly different between T3SE and the control, while between T6SE and the control arginine was significantly different. In contrast, the frequencies of alanine, phenylalanine, glycine and isoleucine were significantly different between T4SE and the control. Results from the permutation test indicated a differential preference for proline between T3SE and T4SE, while glycine and asparagine were significantly distributed between T3SE and T6SE, and serine occurred at significantly different percentages between T4SE and T6SE. Glutamine, threonine and isoleucine occurred with significantly different values of frequency between the control and three classes (T3SE, T4SE and T6SE), respectively.

Table 3

Statistical analysis of residue frequencies in T3SEs, T4SEs, T6SEs and the control

ResidueMann–Whitney U-test
Permutation test
T3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus controlT3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus control
Ala< 2.2e-165.574e-06< 2.2e-160.5382<2.2e-161.065e-110000.021800
Cys0.017060.66460.041390.00060990.066530.0018610.1570.4210.590.002550.07610.048
Asp0.83550.1810.10994.437e-065.298e-120.012680.9080.3080.2112.2e-0500.00323
Glu< 2.2e-160.053522.481e-121.634e-085.354e-160.00716700.3630000.00033
Phe< 2.2e-169.65e-080.00026240.09596< 2.2e-163.224e-10000.002020.13700
Gly1.334e-132.035e-06< 2.2e-16<2.2e-16< 2.2e-160.0362202e-060000.157
His0.047730.013990.1370.60910.019940.0036440.0280.06910.7280.6180.006770.0356
Ile< 2.2e-165.365e-080.00072381.032e-05< 2.2e-160.000114404e-060.0170.0020308e-06
Lys< 2.2e-160.2072< 2.2e-160.1926< 2.2e-160.000229600.1800.80500.0623
Leu8.791e-070.35770.00064660.20762.253e-090.91582.8e-050.3690.003680.47200.634
Met9.062e-110.74913.065e-100.06599< 2.2e-160.195100.54200.087700.415
Asn0.012697.135e-07< 2.2e-16< 2.2e-16< 2.2e-164.977e-120.0007022e-060000
Pro1.278e-051.425e-050.26770.15331.411e-081.25e-066e-0600.02147.2e-050.001010
Gln0.00036063.412e-050.047333.856e-080.0001330.82790.0001940.0001880.1222e-060.05250.83
Arg1.345e-060.054840.0035341.142e-13< 2.2e-16< 2.2e-1607e-040.0283000
Ser3.524e-11<2.2e-162.33e-07< 2.2e-16< 2.2e-16< 2.2e-16003.6e-05000
Thr0.042360.2550.17926.617e-060.00025810.00010450.1130.3590.62701e-050.000408
Val1.175e-05<2.2e-16< 2.2e-16< 2.2e-16< 2.2e-160.14710.00018200000.308
Trp0.01275.185e-145.124e-146.679e-091.294e-088.21e-050.053700000.000558
Tyr< 2.2e-165.698e-111.509e-053.888e-05< 2.2e-165.072e-08008.8e-050.0038500
ResidueMann–Whitney U-test
Permutation test
T3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus controlT3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus control
Ala< 2.2e-165.574e-06< 2.2e-160.5382<2.2e-161.065e-110000.021800
Cys0.017060.66460.041390.00060990.066530.0018610.1570.4210.590.002550.07610.048
Asp0.83550.1810.10994.437e-065.298e-120.012680.9080.3080.2112.2e-0500.00323
Glu< 2.2e-160.053522.481e-121.634e-085.354e-160.00716700.3630000.00033
Phe< 2.2e-169.65e-080.00026240.09596< 2.2e-163.224e-10000.002020.13700
Gly1.334e-132.035e-06< 2.2e-16<2.2e-16< 2.2e-160.0362202e-060000.157
His0.047730.013990.1370.60910.019940.0036440.0280.06910.7280.6180.006770.0356
Ile< 2.2e-165.365e-080.00072381.032e-05< 2.2e-160.000114404e-060.0170.0020308e-06
Lys< 2.2e-160.2072< 2.2e-160.1926< 2.2e-160.000229600.1800.80500.0623
Leu8.791e-070.35770.00064660.20762.253e-090.91582.8e-050.3690.003680.47200.634
Met9.062e-110.74913.065e-100.06599< 2.2e-160.195100.54200.087700.415
Asn0.012697.135e-07< 2.2e-16< 2.2e-16< 2.2e-164.977e-120.0007022e-060000
Pro1.278e-051.425e-050.26770.15331.411e-081.25e-066e-0600.02147.2e-050.001010
Gln0.00036063.412e-050.047333.856e-080.0001330.82790.0001940.0001880.1222e-060.05250.83
Arg1.345e-060.054840.0035341.142e-13< 2.2e-16< 2.2e-1607e-040.0283000
Ser3.524e-11<2.2e-162.33e-07< 2.2e-16< 2.2e-16< 2.2e-16003.6e-05000
Thr0.042360.2550.17926.617e-060.00025810.00010450.1130.3590.62701e-050.000408
Val1.175e-05<2.2e-16< 2.2e-16< 2.2e-16< 2.2e-160.14710.00018200000.308
Trp0.01275.185e-145.124e-146.679e-091.294e-088.21e-050.053700000.000558
Tyr< 2.2e-165.698e-111.509e-053.888e-05< 2.2e-165.072e-08008.8e-050.0038500
Table 3

Statistical analysis of residue frequencies in T3SEs, T4SEs, T6SEs and the control

ResidueMann–Whitney U-test
Permutation test
T3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus controlT3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus control
Ala< 2.2e-165.574e-06< 2.2e-160.5382<2.2e-161.065e-110000.021800
Cys0.017060.66460.041390.00060990.066530.0018610.1570.4210.590.002550.07610.048
Asp0.83550.1810.10994.437e-065.298e-120.012680.9080.3080.2112.2e-0500.00323
Glu< 2.2e-160.053522.481e-121.634e-085.354e-160.00716700.3630000.00033
Phe< 2.2e-169.65e-080.00026240.09596< 2.2e-163.224e-10000.002020.13700
Gly1.334e-132.035e-06< 2.2e-16<2.2e-16< 2.2e-160.0362202e-060000.157
His0.047730.013990.1370.60910.019940.0036440.0280.06910.7280.6180.006770.0356
Ile< 2.2e-165.365e-080.00072381.032e-05< 2.2e-160.000114404e-060.0170.0020308e-06
Lys< 2.2e-160.2072< 2.2e-160.1926< 2.2e-160.000229600.1800.80500.0623
Leu8.791e-070.35770.00064660.20762.253e-090.91582.8e-050.3690.003680.47200.634
Met9.062e-110.74913.065e-100.06599< 2.2e-160.195100.54200.087700.415
Asn0.012697.135e-07< 2.2e-16< 2.2e-16< 2.2e-164.977e-120.0007022e-060000
Pro1.278e-051.425e-050.26770.15331.411e-081.25e-066e-0600.02147.2e-050.001010
Gln0.00036063.412e-050.047333.856e-080.0001330.82790.0001940.0001880.1222e-060.05250.83
Arg1.345e-060.054840.0035341.142e-13< 2.2e-16< 2.2e-1607e-040.0283000
Ser3.524e-11<2.2e-162.33e-07< 2.2e-16< 2.2e-16< 2.2e-16003.6e-05000
Thr0.042360.2550.17926.617e-060.00025810.00010450.1130.3590.62701e-050.000408
Val1.175e-05<2.2e-16< 2.2e-16< 2.2e-16< 2.2e-160.14710.00018200000.308
Trp0.01275.185e-145.124e-146.679e-091.294e-088.21e-050.053700000.000558
Tyr< 2.2e-165.698e-111.509e-053.888e-05< 2.2e-165.072e-08008.8e-050.0038500
ResidueMann–Whitney U-test
Permutation test
T3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus controlT3SE versus T4SET3SE versus T6SET4SE versus T6SET3SE versus controlT4SE versus controlT6SE versus control
Ala< 2.2e-165.574e-06< 2.2e-160.5382<2.2e-161.065e-110000.021800
Cys0.017060.66460.041390.00060990.066530.0018610.1570.4210.590.002550.07610.048
Asp0.83550.1810.10994.437e-065.298e-120.012680.9080.3080.2112.2e-0500.00323
Glu< 2.2e-160.053522.481e-121.634e-085.354e-160.00716700.3630000.00033
Phe< 2.2e-169.65e-080.00026240.09596< 2.2e-163.224e-10000.002020.13700
Gly1.334e-132.035e-06< 2.2e-16<2.2e-16< 2.2e-160.0362202e-060000.157
His0.047730.013990.1370.60910.019940.0036440.0280.06910.7280.6180.006770.0356
Ile< 2.2e-165.365e-080.00072381.032e-05< 2.2e-160.000114404e-060.0170.0020308e-06
Lys< 2.2e-160.2072< 2.2e-160.1926< 2.2e-160.000229600.1800.80500.0623
Leu8.791e-070.35770.00064660.20762.253e-090.91582.8e-050.3690.003680.47200.634
Met9.062e-110.74913.065e-100.06599< 2.2e-160.195100.54200.087700.415
Asn0.012697.135e-07< 2.2e-16< 2.2e-16< 2.2e-164.977e-120.0007022e-060000
Pro1.278e-051.425e-050.26770.15331.411e-081.25e-066e-0600.02147.2e-050.001010
Gln0.00036063.412e-050.047333.856e-080.0001330.82790.0001940.0001880.1222e-060.05250.83
Arg1.345e-060.054840.0035341.142e-13< 2.2e-16< 2.2e-1607e-040.0283000
Ser3.524e-11<2.2e-162.33e-07< 2.2e-16< 2.2e-16< 2.2e-16003.6e-05000
Thr0.042360.2550.17926.617e-060.00025810.00010450.1130.3590.62701e-050.000408
Val1.175e-05<2.2e-16< 2.2e-16< 2.2e-16< 2.2e-160.14710.00018200000.308
Trp0.01275.185e-145.124e-146.679e-091.294e-088.21e-050.053700000.000558
Tyr< 2.2e-165.698e-111.509e-053.888e-05< 2.2e-165.072e-08008.8e-050.0038500

Performance assessment of different tools for effector protein prediction based on the independent test data sets

Tables 4–6 show the performance of different methods for prediction of T3SEs, T4SEs and T6SEs using our curated independent test data sets, respectively. Five measures, namely Sn, Sp, ACC, F1 and MCC, were used to compare the performance between different methods. For T3SE prediction, we observed that BEAN 2.0 and ANN were the top two best-performing tools (Table 4), with BEAN 2.0 outperforming all other tools in terms of the F1 measure, and ANN achieving the highest prediction accuracy and MCC value. Although SEVIE and EffectiveT3 achieved a Sp of 100%, the Sn was considerably lower as compared with the Sn values obtained from the other tools. Overall, BPBAac performed the worst, with a Sn of 0.205, ACC of 59.1% and MCC of 0.287.

Table 4

T3SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
BEAN2.00.6590.86476.10.7070.534
ANN0.5680.97777.30.6550.598
T3_MM0.5000.90970.50.5850.448
BPBAac0.2050.97759.10.3040.287
SEVIE0.2051.00060.20.3050.338
EffectiveT30.2501.00062.50.3570.378
ModelSnSpACC (%)F1MCC
BEAN2.00.6590.86476.10.7070.534
ANN0.5680.97777.30.6550.598
T3_MM0.5000.90970.50.5850.448
BPBAac0.2050.97759.10.3040.287
SEVIE0.2051.00060.20.3050.338
EffectiveT30.2501.00062.50.3570.378

Values in bold indicate the best value achieved for the corresponding measure.

Table 4

T3SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
BEAN2.00.6590.86476.10.7070.534
ANN0.5680.97777.30.6550.598
T3_MM0.5000.90970.50.5850.448
BPBAac0.2050.97759.10.3040.287
SEVIE0.2051.00060.20.3050.338
EffectiveT30.2501.00062.50.3570.378
ModelSnSpACC (%)F1MCC
BEAN2.00.6590.86476.10.7070.534
ANN0.5680.97777.30.6550.598
T3_MM0.5000.90970.50.5850.448
BPBAac0.2050.97759.10.3040.287
SEVIE0.2051.00060.20.3050.338
EffectiveT30.2501.00062.50.3570.378

Values in bold indicate the best value achieved for the corresponding measure.

Table 5

T4SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
T4Effpred0.9250.85088.80.9060.777
T4SEpre_bpbAac0.5750.97577.50.6600.600
T4SEpre_psAac0.5250.97575.00.6180.560
T4SEpre_joint0.0500.97551.20.090.066
ModelSnSpACC (%)F1MCC
T4Effpred0.9250.85088.80.9060.777
T4SEpre_bpbAac0.5750.97577.50.6600.600
T4SEpre_psAac0.5250.97575.00.6180.560
T4SEpre_joint0.0500.97551.20.090.066

Values in bold indicate the best value achieved for the corresponding measure.

Table 5

T4SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
T4Effpred0.9250.85088.80.9060.777
T4SEpre_bpbAac0.5750.97577.50.6600.600
T4SEpre_psAac0.5250.97575.00.6180.560
T4SEpre_joint0.0500.97551.20.090.066
ModelSnSpACC (%)F1MCC
T4Effpred0.9250.85088.80.9060.777
T4SEpre_bpbAac0.5750.97577.50.6600.600
T4SEpre_psAac0.5250.97575.00.6180.560
T4SEpre_joint0.0500.97551.20.090.066

Values in bold indicate the best value achieved for the corresponding measure.

Table 6

T6SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
MIX0.3330.66849.90.4000.002
Altindis et al. [38]0.1220.89250.30.1970.023
ModelSnSpACC (%)F1MCC
MIX0.3330.66849.90.4000.002
Altindis et al. [38]0.1220.89250.30.1970.023

Values in bold indicate the best value achieved for the corresponding measure.

Table 6

T6SE-Prediction performance using the independent test data set

ModelSnSpACC (%)F1MCC
MIX0.3330.66849.90.4000.002
Altindis et al. [38]0.1220.89250.30.1970.023
ModelSnSpACC (%)F1MCC
MIX0.3330.66849.90.4000.002
Altindis et al. [38]0.1220.89250.30.1970.023

Values in bold indicate the best value achieved for the corresponding measure.

For the prediction of T4SEs (Table 5), T4Effpred outperformed the other two tools and achieved the overall best performance with an ACC of 88.8%, F1 of 0.906 and MCC of 0.777. This is not surprising given that the T4Effpred-prediction model was trained using a relatively larger training data set than those used in the other tools and took four types of informative features into consideration, including AAC, amino acid pairs and autocovariance-transformed PSSM profiles. Surprisingly, T4SEpre_joint, which was evaluated as the strongest classifier of T4SEpre in the original work [22], exhibited an extremely poor performance. One reason may have been owing to the feature set, which included SS and SA used in T4SEpre_joint. However, the PSSM profile, which is a powerful component of T4SE prediction [19], was not used in T4SEpre_joint. Another potential explanation could be that T4SEpre_joint considered the extracted features from the C-terminus only, while the features of the N-terminus might also contain additional contributing information for each sample.

There are currently no computational models specifically developed for T6SE prediction. However, there are two simple sequence motif-based methods for T6SE identification. These use conserved motifs of a T6SE hydrolase (in Altindis et al. [38]) and conserved motifs of Vibrio cholerae VCA0105 homologues (in MIX). These two methods were used as benchmarks for the performance evaluation of T6SE prediction (Table 6). For example, using the motifs in Altindis et al. [38], a motif pattern ‘F[Y|W]P[D]DY[T]’ can be formulated based on regular expressions to search for protein sequences that contain such motifs. The prediction performance of both methods is shown in Table 6. The prediction performance of the motif pattern-search methods was unsatisfactory, with an ACC of between 49.9% and 50.3% and F1 < 0.500. These results suggest that motif-based methods alone are not accurate enough to identify T6SEs. This is perhaps most likely owing to the high diversity of T6SE sequences and poor coverage of motifs. More advanced computational work on T6SE prediction awaits further experimental discoveries of sufficient T6SEs to build suitable training sets.

Ensemble-learning models enhance the prediction of both T3SEs and T4SEs

We examined whether the performance of predicting T3SEs and T4SEs could be further improved by developing ensemble-learning classifiers that integrate the outputs of all predictors. The primary purpose of this investigation was to demonstrate the usefulness of ensemble learning for improving the performance of effector prediction.

Three machine learning algorithms, including SVMs [56], RF [40] and LR [96], were applied to construct the ensemble models. For SVM, we used the radial basis kernel and grid search to optimize the best parameter cost{1, 2, , 10}. For RF, the R package randomForest [70] was used to train the RF model with the optimized mtry parameter and with ntree set to 100. For LR, the model was trained using the R statistical package [94]. Additionally, LR was transformed from linear regression using the following function:
where p(x)indicates the probability of the dependent variable, x refers to an independent variable and β0 and β1 are constants. The above ensemble-learning classifiers used the output of different individual T3SE and T4SE predictors as input features, with their respective performance evaluated via the 5-fold cross-validation test.

We performed ROC-curve analysis to compare the prediction performance of T3SEs and T4SEs between the three ensemble models and all individual predictors (Figure 5). The three ensemble classifiers consistently outperformed all the individual tools for the prediction of both T3SEs (Figure 5A) and T4SEs (Figure 5B) as measured by the AUC score. Among the three ensemble classifiers, the RF classifier achieved the best performance for T3SE prediction (with an AUC value of 0.805) and T4SE prediction (with an AUC value of 0.943). Thus, the ensemble predictors use the advantages of each of the individual predictors to considerably enhance prediction performance. Integration of individual predictors can serve as a useful strategy for providing stable and accurate predictive performance of the two types of effector proteins. Lastly, the source code associated with these ensemble-learning models can be freely downloaded at http://tbooster.erc.monash.edu/.

ROC-curve analysis of the predictive performance of the three ensemble-learning models as compared with all other individual predictors. (A) performance comparison between different methods for T3SE prediction using the independent test data set; (B) performance comparison between different methods for T4SE prediction using the independent test data set.
Figure 5

ROC-curve analysis of the predictive performance of the three ensemble-learning models as compared with all other individual predictors. (A) performance comparison between different methods for T3SE prediction using the independent test data set; (B) performance comparison between different methods for T4SE prediction using the independent test data set.

Case study

To examine the scalability and robustness of the reviewed predictors, we performed a case study using experimentally verified examples that were not included in both the training and testing data sets. The case studies for T3SEs and T4SEs were conducted separately by submitting the protein sequences to the corresponding web servers or by using stand-alone software. The detailed prediction output from each tool can be found in the Supplementary Data.

The first case study proteins were the E3 ubiquitin-protein ligase SlrP (NCBI ID: 81853756; UniProt ID: Q8ZQQ2) and the T3SS cytotoxic effector BteA (NCBI ID: 633380306). SlrP is a Salmonella T3SE that mimics host cell factors in the ubiquitination pathway, thereby resulting in host-cell death [97]. Most of the existing T3SE predictors, including the ensemble-learning models succeed to correctly predict SlrP as a T3SE. Only Effective T3 failed to predict its identity. BteA (Bordetella type 3 secretion system effector A) is a Bordetella T3SE that is a non-apoptotic cytotoxic effector for a wide range of mammalian cells [98]. The existing T3SE predictors, including BPBAac, Effective T3 and SIEVE failed to predict BteA as a T3SE, while ANN, BEAN 2.0, T3_MM and the ensemble-learning models correctly predicted BteA as a T3SE.

The second case study proteins were the E3 ubiquitin-protein ligase LubX (UniProt ID: Q5ZRQ0) and the product of the gene Lwal_1306 (UniProt ID: A0A0W1AD05), which is a T4SE secreted by the Dot/Icm T4SS of Legionella waltersii but of unknown cellular function [20]. LubX is a Legionella T4SE that interferes with the host cell ubiquitination pathway, thereby resulting in host-cell death [99]. The existing tools, including the ensemble-learning models, correctly predicted LubX as a T4SE, except for T4SEpre_joint. In the case of Lwal_1306, only T4Effpred and the ensemble-learning models successfully predicted its identity as a T4SE.

These results highlight the inconsistencies in existing prediction tools, and the importance and value of integrating the prediction outputs of individual tools into the ensemble-learning models to obtain reliable T3SE and T4SE predictions.

Conclusion

The biological significance of effector proteins has motivated the development of computational tools that facilitate accurate predictions of T3SEs, T4SEs and T6SEs. The development of such tools enables comprehensive study of host–pathogen interactions as well as characterization of the arsenal of specific effectors delivered in any given scenario of bacterial infection and virulence. In this study, we performed a comprehensive survey, benchmarking the performance of available methods and tools for the prediction of three major types of bacterial effector proteins: T3SEs, T4SEs and T6SEs. Additionally, we reviewed, discussed and assessed all methods in terms of their learning algorithms, feature extraction and selection methods, predictive performance, their user-friendliness and applicability and availability as either a web server or stand-alone software. To provide an objective evaluation of the performance, we curated independent test data sets for the three types of effector proteins. According to cross-validation tests, BEAN 2.0 achieved the overall best performance of T3SEs prediction, while T4Effpred was the best-performing tool for T4SE prediction. Our analysis also showed that T6SE prediction remains a challenging task, still to be addressed; there remains a strong case for the development of specialized models for T6SE prediction. We suggest that by integrating the output of individual predictors, ensemble-learning models using SVMs, RF and LR methods significantly outperformed all individual tools. These ensemble methods are now available to the research community and will provide reliable and robust predictive performance for both T3SEs and T4SEs. This study serves as a useful guide for researchers who are particularly interested in using existing tools and in developing new computational methods for effector prediction. We expect that our proposed methods, along with the increasing availability of experimentally verified data and the advancement of probabilistic learning techniques, will greatly improve the prediction of bacterial effector proteins. The latter will prove invaluable for further investigations of T3SS- and T4SS-mediated pathogenesis and their roles in pathogen–host interactions.

Key Points
  • This work provides a comprehensive review and assessment of currently available bioinformatics tools for the prediction of secreted effectors of bacteria with secretion system types III, IV and VI. We focus on prediction algorithms, prediction performance, feature selection and software utilities.

  • We use extracted motif patterns to assess the performance of simplified predictors for secreted effector proteins of the recently identified type VI secretion system. Our assessment was based on a curated, independent test data set.

  • Performance benchmarks indicate that current tools achieve a relatively satisfying performance for predicting effector proteins of the type III secretion system, while that for the type IV secretion system requires improvement.

  • We propose and built new ensemble models based on support vector machines, random forest and logistic regression to further improve the prediction performance of effector proteins of both the type III and type IV secretion systems. This required the integration of outputs from all individual models. Five-fold cross-validation and independent tests demonstrate that the ensemble models outperform all reviewed predictors of types III and IV secretion systems. Specific test cases are presented.

Yi An is currently a master’s student in the College of Information Engineering, Northwest A&F University, China. As a current visiting student at the Biomedicine Discovery Institute and Department of Microbiology at Monash University, she is undertaking a bioinformatics project focused on computational analysis of bacterial secreted effector proteins. Her research interests include bioinformatics, data mining and web-based information systems.

Jiawei Wang received his master’s degree in School of Electronic and Computer Engineering from Peking University, China. His research interests are bioinformatics, machine learning and data mining.

Chen Li received his PhD degree in Bioinformatics in 2016 from Monash University, Australia. He is currently a postdoctoral research fellow at the Department of Microbiology and Biomedicine Discovery Institute, Monash University, Australia. His research interests focus on systems pharmacology, bioinformatics, systems biology, machine learning and data mining.

André Leier received his PhD in Computer Science (Dr. rer. nat.) from the University of Dortmund, Germany. He conducted postdoctoral research at The Memorial University of Newfoundland, Canada, The University of Queensland, Australia, and ETH Zürich, Switzerland. He is a senior research fellow and independent research scientist at the Okinawa Institute of Science and Technology, Japan. His research interests include computational and systems biology, biomedical informatics and computational medicine.

Tatiana Marquez-Lago received her PhD in Mathematics with distinction from the University of New Mexico in 2006. She conducted postdoctoral research at The University of Queensland, Australia, and ETH Zürich, Switzerland. She is an Assistant Professor and Head of the Integrative Systems Biology Unit at the Okinawa Institute of Science and Technology, Japan. Her research interests include stochastic and multi-scaled models, systems biology, synthetic biology and biomedical informatics.

Jonathan Wilksch received his PhD degree in 2012 from The University of Melbourne, Australia. He is currently a Research Fellow in the Department of Microbiology at Monash University, Australia. His research background and current interests include the mechanisms of bacterial pathogenesis, biofilm formation, gene regulation and host–pathogen interactions.

Yang Zhang received his PhD degree in Computer Science and Engineering in 2015 from Northwestern Polytechnical University, China. He is currently a professor in the College of Information Engineering, Northwest A&F University, China. His research interests are big data analysis, machine learning and data mining.

Geoffrey I. Webb received his PhD degree in 1987 from La Trobe University. He is a professor in the Faculty of Information Technology and director of the Monash Institute for Data Science at Monash University. His research interests include machine learning, data mining, computational biology and user modelling.

Jiangning Song is a senior research fellow in the Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Australia. He is also a Principal Investigator at the Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences. He received his PhD degree in 2005 from Jiangnan University, China and conducted his postdoctoral research at The University of Queensland, Australia and Kyoto University, Japan. His research interests include bioinformatics, systems biology, machine learning, systems pharmacology and enzyme engineering.

Trevor Lithgow received his PhD degree in 1992 from La Trobe University. He is an ARC Australian Laureate Fellow in the Biomedicine Discovery Institute and the Department of Microbiology at Monash University, Australia. His research interests particularly focus on molecular biology, cellular microbiology and bioinformatics. His laboratory develops and deploys multidisciplinary approaches to identify new protein transport machines in bacteria, understand the assembly of protein transport machines and dissect the effects of antimicrobial peptides on antibiotic resistant ‘superbugs’.

Acknowledgements

T.M.L and A.L would like to thank the Isaac Newton Institute for Mathematical Sciences.

Funding

The National Health and Medical Research Council of Australia (NHMRC) (1092262) and the Australian Research Council (ARC). G.I.W. is a recipient of the Discovery Outstanding Research Award (DORA) of the Australian Research Council (ARC). T.L. is an ARC Australian Laureate Fellow.

References

1

Tseng
TT
,
Tyler
BM
,
Setubal
JC.
Protein secretion systems in bacterial-host associations, and their description in the Gene Ontology
.
BMC Microbiol
2009
;
9 (Suppl 1)
:
S2.

2

Costa
TR
,
Felisberto-Rodrigues
C
,
Meir
A
, et al.
Secretion systems in Gram-negative bacteria: structural and mechanistic insights
.
Nat Rev Microbiol
2015
;
13
:
343
59
.

3

Desvaux
M
,
Hébraud
M
,
Talon
R
, et al.
Secretion and subcellular localizations of bacterial proteins: a semantic awareness issue
.
Trends Microbiol
2009
;
17
:
139
45
.

4

Yang
X
,
Guo
Y
,
Luo
J
, et al.
Effective identification of Gram-negative bacterial type III secreted effectors using position-specific residue conservation profiles
.
PLoS One
2013
;
8
:
e84439.

5

Wandersman
C.
Concluding remarks on the special issue dedicated to bacterial secretion systems: function and structural biology
.
Res Microbiol
2013
;
164
:
683
7
.

6

Economou
A
,
Christie
PJ
,
Fernandez
RC
, et al.
Secretion by numbers: protein traffic in prokaryotes
.
Mol Microbiol
2006
;
62
:
308
19
.

7

Chang
JH
,
Desveaux
D
,
Creason
AL.
The ABCs and 123s of bacterial secretion systems in plant pathogenesis
.
Annu Rev Phytopathol
2014
;
52
:
317
45
.

8

Durand
E
,
Cambillau
C
,
Cascales
E
, et al.
VgrG, Tae, Tle, and beyond: the versatile arsenal of type VI secretion effectors
.
Trends Microbiol
2014
;
22
:
498
507
.

9

Galan
JE
,
Lara-Tejero
M
,
Marlovits
TC
, et al.
Bacterial type III secretion systems: specialized nanomachines for protein delivery into target cells
.
Annu Rev Microbiol
2014
;
68
:
415
38
.

10

Pearson
JS
,
Zhang
Y
,
Newton
HJ
, et al.
Post-modern pathogens: surprising activities of translocated effectors from E. coli and Legionella
.
Curr Opin Microbiol
2015
;
23
:
73
9
.

11

Basler
M.
Type VI secretion system: secretion by a contractile nanomachine
.
Philos Trans R Soc Lond B Biol Sci
2015
;
370
:
1
11
.

12

Block
A
,
Alfano
JR.
Plant targets for pseudomonas syringae type III effectors: virulence targets or guarded decoys?
Curr Opin Microbiol
2011
;
14
:
39
46
.

13

Zechner
EL
,
Lang
S
,
Schildbach
JF.
Assembly and mechanisms of bacterial type IV secretion machines
.
Philos Trans R Soc Lond B Biol Sci
2012
;
367
:
1073
87
.

14

Russell
AB
,
Peterson
SB
,
Mougous
JD.
Type VI secretion system effectors: poisons with a purpose
.
Nat Rev Microbiol
2014
;
12
:
137
48
.

15

Portaliou
AG
,
Tsolis
KC
,
Loos
MS
, et al.
Type III secretion: building and operating a remarkable nanomachine
.
Trends Biochem Sci
2016
;
41
:
175
89
.

16

Cianfanelli
FR
,
Monlezun
L
,
Coulthurst
SJ.
Aim, load, fire: the type VI secretion system, a bacterial nanoweapon
.
Trends Microbiol
2016
;
24
:
51
62
.

17

So
EC
,
Mattheis
C
,
Tate
EW
, et al.
Creating a customized intracellular niche: subversion of host cell signaling by Legionella type IV secretion system effectors 1
.
Can J Microbiol
2015
;
61
:
617
35
.

18

Trokter
M
,
Felisberto-Rodrigues
C
,
Christie
PJ
, et al.
Recent advances in the structural and molecular biology of type IV secretion systems
.
Curr Opin Struct Biol
2014
;
27
:
16
23
.

19

Zou
L
,
Nan
C
,
Hu
F.
Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles
.
Bioinformatics
2013
;
29
:
3135
42
.

20

Burstein
D
,
Amaro
F
,
Zusman
T
, et al.
Genomic analysis of 38 Legionella species identifies large and diverse effector repertoires
.
Nat Genet
2016
;
48
:
167
75
.

21

McDermott
JE
,
Corrigan
A
,
Peterson
E
, et al.
Computational prediction of type III and IV secreted effectors in gram-negative bacteria
.
Infect Immun
2011
;
79
:
23
32
.

22

Eichinger
V
,
Nussbaumer
T
,
Platzer
A
, et al.
EffectiveDB—updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems
.
Nucleic Acids Res
2016
;
44
:
D669
74
.

23

Sato
Y
,
Takaya
A
,
Yamamoto
T.
Meta-analytic approach to the accurate prediction of secreted virulence effectors in gram-negative bacteria
.
BMC Bioinformatics
2011
;
12
:
442.

24

Tobe
T
,
Beatson
SA
,
Taniguchi
H
, et al.
An extensive repertoire of type III secretion effectors in Escherichia coli O157 and the role of lambdoid phages in their dissemination
.
Proc Natl Acad Sci USA
2006
;
103
:
14941
6
.

25

Petnicki-Ocwieja
T
,
Schneider
DJ
,
Tam
VC
, et al.
Genomewide identification of proteins secreted by the Hrp type III protein secretion system of Pseudomonas syringae pv. tomato DC3000
.
Proc Natl Acad Sci USA
2002
;
99
:
7652
7
.

26

Panina
EM
,
Mattoo
S
,
Griffith
N
, et al.
A genome‐wide screen identifies a Bordetella type III secretion effector and candidate effectors in other species
.
Mol Microbiol
2005
;
58
:
267
79
.

27

Löwer
M
,
Schneider
G.
Prediction of type III secretion signals in genomes of Gram-negative bacteria
.
PLoS One
2009
;
4
:
e5917.

28

Dong
X
,
Zhang
YJ
,
Zhang
Z.
Using weakly conserved motifs hidden in secretion signals to identify type-III effectors from bacterial pathogen genomes
.
PLoS One
2013
;
8
:
e56632.

29

Dong
X
,
Lu
X
,
Zhang
Z.
BEAN 2.0: an integrated web resource for the identification and functional analysis of type III secreted effectors
.
Database (Oxford)
2015
;
2015
:
bav064.

30

Wang
Y
,
Zhang
Q
,
Sun
MA
, et al.
High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles
.
Bioinformatics
2011
;
27
:
777
84
.

31

Samudrala
R
,
Heffron
F
,
McDermott
JE.
Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems
.
PLoS Pathog
2009
;
5
:
e1000375.

32

Wang
Y
,
Wei
X
,
Bao
H
, et al.
Prediction of bacterial type IV secreted effectors by C-terminal features
.
BMC Genomics
2014
;
15
:
50.

33

Wang
Y
,
Sun
M
,
Bao
H
, et al.
T3_MM: a markov model effectively classifies bacterial type III secretion signals
.
PLoS One
2013
;
8
:
e58173.

34

Arnold
R
,
Brandmaier
S
,
Kleine
F
, et al.
Sequence-based prediction of type III secreted proteins
.
PLoS Pathog
2009
;
5
:
e1000376.

35

Hachani
A
,
Wood
TE
,
Filloux
A.
Type VI secretion and anti-host effectors
.
Curr Opin Microbiol
2016
;
29
:
81
93
.

36

Zoued
A
,
Brunet
YR
,
Durand
E
, et al.
Architecture and assembly of the type VI secretion system
.
Biochim Biophys Acta
2014
;
1843
:
1664
73
.

37

Salomon
D
,
Kinch
LN
,
Trudgian
DC
, et al.
Marker for type VI secretion system effectors
.
Proc Natl Acad Sci USA
2014
;
111
:
9271
6
.

38

Altindis
E
,
Dong
T
,
Catalano
C
, et al.
Secretome analysis of Vibrio cholerae type VI secretion system reveals a new effector-immunity pair
.
MBio
2015
;
6
:
e00075.

39

Yang
ZR.
Biological applications of support vector machines
.
Brief Bioinform
2004
;
5
:
328
38
.

40

Breiman
L.
Random forests
.
Mach Learn
2001
;
45
:
5
32
.

41

Zardo
P
,
Collie
A.
Predicting research use in a public health policy environment: results of a logistic regression analysis
.
Implement Sci
2014
;
9
:
142.

42

Koh
K
,
Kim
S-J
,
Boyd
SP.
An interior-point method for large-scale l1-regularized logistic regression
.
J Mach Learn Res
2007
;
8
:
1519
55
.

43

UniProt
C.
UniProt: a hub for protein information
.
Nucleic Acids Res
2015
;
43
:
D204
12
.

44

Huang
Y
,
Niu
B
,
Gao
Y
, et al.
CD-HIT suite: a web server for clustering and comparing biological sequences
.
Bioinformatics
2010
;
26
:
680
2
.

45

Tay
DM
,
Govindarajan
KR
,
Khan
AM
, et al.
T3SEdb: data warehousing of virulence effectors secreted by the bacterial type III secretion system
.
BMC Bioinformatics
2010
;
11
:
S4.

46

Xu
H
,
Lemischka
IR
,
Ma'ayan
A.
SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells
.
BMC Syst Biol
2010
;
4
:
173.

47

Lei
Z
,
Dai
Y.
An SVM-based system for predicting protein subnuclear localizations
.
BMC Bioinformatics
2005
;
6
:
291.

48

Jaakkola
TS
,
Diekhans
M
,
Haussler
D.
Using the Fisher kernel method to detect remote protein homologies
.
Proc Int Conf Intell Syst Mol Biol
1999
;
149
58
.

49

Hua
S
,
Sun
Z.
Support vector machine approach for protein subcellular localization prediction
.
Bioinformatics
2001
;
17
:
721
8
.

50

Furey
TS
,
Cristianini
N
,
Duffy
N
, et al.
Support vector machine classification and validation of cancer tissue samples using microarray expression data
.
Bioinformatics
2000
;
16
:
906
14
.

51

Brown
MP
,
Grundy
WN
,
Lin
D
, et al.
Knowledge-based analysis of microarray gene expression data by using support vector machines
.
Proc Natl Acad Sci USA
2000
;
97
:
262
7
.

52

Ben-Hur
A
,
Ong
CS
,
Sonnenburg
S
, et al.
Support vector machines and kernels for computational biology
.
PLoS Comput Biol
2008
;
4
:
e1000173.

53

Vapnik
VN
,
Vapnik
V
,
Statistical Learning Theory
.
New York
:
Wiley
,
1998
.

54

Vapnik
V
,
The Nature Of Statistical Learning Theory
.
Springer Science & Business Media
,
New York, NY
,
2013
.

55

Cortes
C
,
Vapnik
V.
Support-vector networks
.
Mach Learn
1995
;
20
:
273
97
.

56

Pavlidis
P
,
Wapinski
I
,
Noble
WS.
Support vector machine classification on the web
.
Bioinformatics
2004
;
20
:
586
7
.

57

Song
J
,
Tan
H
,
Shen
H
, et al.
Cascleave: towards more accurate prediction of caspase substrate cleavage sites
.
Bioinformatics
2010
;
26
:
752
60
.

58

Shao
J
,
Xu
D
,
Tsai
S-N
, et al.
Computational identification of protein methylation sites through bi-profile bayes feature extraction
.
PLoS One
2009
;
4
:
e4920.

59

Hapudeniya
M.
Artificial neural networks in bioinformatics
.
Sri Lanka J Bio-Med Inform
2010
;
1
:
104
111
.

60

Bishop
CM
,
Neural Networks for Pattern Recognition
.
Oxford university press
,
New York, NY
,
1995
.

61

Fosler-Lussier
E
,
Markov Models and Hidden Markov Models: A Brief Tutorial
.
International Computer Science Institute
,
Berkeley, CA
,
1998
.

62

John
GH
,
Langley
P.
Estimating continuous distributions in bayesian classifiers
. In:
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence
,
1995
, pp.
338
45
.
Morgan Kaufmann Publishers Inc
,
San Francisco, CA
.

63

Yousef
M
,
Nebozhyn
M
,
Shatkay
H
, et al.
Combining multi-species genomic data for microRNA identification using a Naive Bayes classifier
.
Bioinformatics
2006
;
22
:
1325
34
.

64

Rodin
AS
,
Litvinenko
A
,
Klos
K
, et al.
Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies
.
J Comput Biol
2009
;
16
:
1705
18
.

65

Wang
M
,
Chen
X
,
Zhang
M
, et al.
Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests
.
BMC Proc
2009
;
3
:
S69
.

66

Yang
WW
,
Gu
CC.
Selection of important variables by statistical learning in genome-wide association analysis
.
BMC Proc
2009
;
3
:
S70
. BioMed Central.

67

Zhang
W
,
Xiong
Y
,
Zhao
M
, et al.
Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature
.
BMC Bioinformatics
2011
;
12
:
341
.

68

Boulesteix
AL
,
Janitza
S
,
Kruppa
J
, et al.
Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics
.
Wiley Interdiscip Rev Data Min Knowl Discov
2012
;
2
:
493
507
.

69

Altmann
A
,
Tolosi
L
,
Sander
O
, et al.
Permutation importance: a corrected feature importance measure
.
Bioinformatics
2010
;
26
:
1340
7
.

70

Liaw
A
,
Wiener
M.
Classification and regression by random Forest
.
R News
2002
;
2
:
18
22
.

71

Saeys
Y
,
Inza
I
,
Larrañaga
P.
A review of feature selection techniques in bioinformatics
.
Bioinformatics
2007
;
23
:
2507
17
.

72

Awada
W
,
Khoshgoftaar
TM
,
Dittman
D
, et al. A review of the stability of feature selection techniques for bioinformatics data. In:
Information Reuse and Integration (IRI), 2012 IEEE 13th International Conference on
,
2012
, pp.
356
63
. IEEE,
New York, NY
.

73

Khalid
S
,
Khalil
T
,
Nasreen
SA.
A survey of feature selection and feature extraction techniques in machine learning
. In:
Science and Information Conference (SAI), 2014
,
2014
, pp.
372
-
378
. IEEE.

74

Markstein
P
,
Xu
Y.
Computational systems bioinformatics
.
World Scientific
,
Imperial College Press
,
London, United Kingdom
,
2006
.

75

Hall
MA
,
Correlation-Based Feature Selection for Machine Learning
.
The University of Waikato
,
Hamilton, New Zealand
,
1999
.

76

Witten
IH
,
Frank
E
,
Data Mining: Practical Machine Learning Tools and Techniques
.
Morgan Kaufmann
,
San Francisco, CA
,
2005
.

77

Matthews
BW.
Comparison of the predicted and observed secondary structure of T4 phage lysozyme
.
Biochim Biophys Acta
1975
;
405
:
442
51
.

78

Amer
AA
,
Åhlund
MK
,
Bröms
JE
, et al.
Impact of the N-terminal secretor domain on YopD translocator function in Yersinia pseudotuberculosis type III secretion
.
J Bacteriol
2011
;
193
:
6683
700
.

79

Lloyd
SA
,
Sjöström
M
,
Andersson
S
, et al.
Molecular characterization of type III secretion signals via analysis of synthetic N‐terminal amino acid sequences
.
Mol Microbiol
2002
;
43
:
51
9
.

80

Ghosh
P.
Process of protein transport by the type III secretion system
.
Microbiol Mol Biol Rev
2004
;
68
:
771
95
.

81

Nagai
H
,
Cambronne
ED
,
Kagan
JC
, et al.
A C-terminal translocation signal required for Dot/Icm-dependent delivery of the Legionella RalF protein to host cells
.
Proc Natl Acad Sci USA
2005
;
102
:
826
31
.

82

Vergunst
AC
,
van Lier
MC
,
den Dulk-Ras
A
, et al.
Positive charge is an important feature of the C-terminal transport signal of the VirB/D4-translocated proteins of Agrobacterium
.
Proc Natl Acad Sci USA
2005
;
102
:
832
7
.

83

Myeni
S
,
Child
R
,
Ng
TW
, et al.
Brucella modulates secretory trafficking via multiple type IV secretion effector proteins
.
PLoS Pathog
2013
;
9
:
e1003556.

84

Marchesini
MI
,
Herrmann
CK
,
Salcedo
SP
, et al.
In search of Brucella abortus type IV secretion substrates: screening and identification of four proteins translocated into host cells through VirB system
.
Cell Microbiol
2011
;
13
:
1261
74
.

85

Ke
Y
,
Wang
Y
,
Li
W
, et al.
Type IV secretion system of Brucella spp. and its effectors
.
Front Cell Infect Microbiol
2015
;
5
:
72
.

86

Jobichen
C
,
Chakraborty
S
,
Li
M
, et al.
Structural basis for the secretion of EvpC: a key type VI secretion system protein from Edwardsiella tarda
.
PLoS One
2010
;
5
:
e12910.

87

Lipman
DJ
,
Souvorov
A
,
Koonin
EV
, et al.
The relationship of protein conservation and sequence length
.
BMC Evol Biol
2002
;
2
:
20.

88

De Geyter
J
,
Tsirigotaki
A
,
Orfanoudaki
G
, et al.
Protein folding in the cell envelope of Escherichia coli
.
Nat Microbiol
2016
;
1
:
16107.

89

Zhou
Z
,
Zhen
J
,
Karpowich
NK
, et al.
LeuT-desipramine structure reveals how antidepressants block neurotransmitter reuptake
.
Science
2007
;
317
:
1390
3
.

90

Singh
SK
,
Piscitelli
CL
,
Yamashita
A
, et al.
A competitive inhibitor traps LeuT in an open-to-out conformation
.
Science
2008
;
322
:
1655
61
.

91

Singh
AK
,
Singh
R
,
Tomar
D
, et al.
The leucine aminopeptidase of Staphylococcus aureus is secreted and contributes to biofilm formation
.
Int J Infect Dis
2012
;
16
:
e375
81
.

92

Bernal-Bayard
J
,
Cardenal-Muñoz
E
,
Ramos-Morales
F.
The Salmonella type III secretion effector, salmonella leucine-rich repeat protein (SlrP), targets the human chaperone ERdj3
.
J Biol Chem
2010
;
285
:
16360
8
.

93

Miao
EA
,
Miller
SI.
A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium
.
Proc Natl Acad Sci USA
2000
;
97
:
7539
44
.

94

R Core Team
.
R: A language and environment for statistical computing
. R Foundation for Statistical Computing, Vienna, Austria,
2014
. URL http://www.R-project.org/.

95

Maindonald
J
,
Braun
J.
Data Analysis and Graphics Using R: An Example-Based Approach
.
Cambridge University Press
,
Cambridge, UK
,
2006
.

96

Freedman
DA.
Statistical Models: Theory and Practice
.
Cambridge University Press
,
Cambridge, UK
,
2009
.

97

Bernal-Bayard
J
,
Ramos-Morales
F.
Salmonella type III secretion effector SlrP is an E3 ubiquitin ligase for mammalian thioredoxin
.
J Biol Chem
2009
;
284
:
27587
95
.

98

Hegerle
N
,
Rayat
L
,
Dore
G
, et al.
In-vitro and in-vivo analysis of the production of the Bordetella type three secretion system effector A in Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica
.
Microbes Infect
2013
;
15
:
399
408
.

99

Kubori
T
,
Hyakutake
A
,
Nagai
H.
Legionella translocates an E3 ubiquitin ligase that has multiple U‐boxes with distinct functions
.
Mol Microbiol
2008
;
67
:
1307
19
.

Author notes

The Yi An, Jiawei Wang and Chen Li authors contributed equally to this work.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/about_us/legal/notices)

Supplementary data