DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases

Wang, Yanan; Li, Fuyi; Bharathwaj, Manasa; Rosas, Natalia C; Leier, André; Akutsu, Tatsuya; Webb, Geoffrey I; Marquez-Lago, Tatiana T; Li, Jian; Lithgow, Trevor; Song, Jiangning

doi:10.1093/bib/bbaa301

Abstract

Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.

beta-lactamase, antimicrobial resistance, bioinformatics, deep learning, sequence homology

Introduction

Since the discovery of the beta-lactam penicillin (Fleming, 1928) and its subsequent clinical usage, a growing set of beta-lactam antibiotics have been discovered, developed and prescribed regularly in clinical practice [1, 2]. Unfortunately, a drastic increase in antibiotic usage has led to the rapid spread of antibiotic resistance that poses a significant global public health threat. Beta-lactamases (BLs) are enzymes expressed by bacteria that hydrolyze beta-lactam antibiotics, giving rise to beta-lactam resistance [3, 4]. The first BL, named penicillinase, was discovered and isolated from Escherichia coli in 1940 by Abraham and Chain, even before penicillin was used clinically [5]. Since then, the number of identified BLs has dramatically increased, with 4326 BLs currently included in beta-lactamase database (BLDB) [6]. Many of these BLs are plasmid encoded and can be shared across bacterial species by lateral transfer, contributing to the rapid spread of drug resistance [7]. BLs are the main propagators of acquired antibiotic resistance, with multidrug resistance in clinical strains being reported frequently [8]. Increasingly, infections caused by organisms that express extended-spectrum beta-lactamases (ESBLs) [9] are difficult to treat both due to challenges in detecting ESBLs and reporting inconsistencies [10]. The same applies to bacteria that express carbapenemases, a category of BLs that inactivates carbapenems, which are often considered to be last resort beta-lactam antibiotics in clinical treatment [10–12]. To make matters worse, a growing number of Enterobacteriaceae, species highly adapted for growth in human tissues, have been shown to express multiple BLs, often of various classes [12, 13]. Accurately classifying BLs is an essential step to guide BL inhibitor design and to deepen our understanding of the mechanisms involved in beta-lactam antibiotic resistance. To that effect, two classification schemes are currently used: (i) molecular classification as classes A–D, based on sequence homology [14] or (ii) functional classification as groups 1–4, based on substrate and inhibitor profile [15]. In this study, we adopt the molecular classification scheme to develop a far-reaching detection tool.

To classify and identify BLs from various organisms after antibiotic susceptibility testing, a variety of BL-specific screening tests can be conducted [13, 16]. However, these methods are resource costly and time consuming. Thus, an alternative solution to identify BL classes is to computational approaches [17], which can rapidly identify potential BLs and narrow down BL candidates for experimental validation. A method like this is feasible due to the rapid accumulation of well-annotated genome sequences. Highly reliable prediction models can be used to not only perform high-throughput screening but also discover potential BLs that are challenging to be identified based on sequence similarity search. Along these lines, it is worth noting that several approaches have been implemented to identify BLs. These approaches can be divided into three major categories:

(i) Knowledge databases: Several web-based BLDBs are available, which provide information resources for BLs. These include BL-specific databases [6, 18, 19], as well as antibiotic multidrug resistance-related databases [20–23], which also includes annotations on BLs.

Figure 1

The architecture of the DeepBL methodology. The development of DeepBL involves four major stages, including (A) data curation, where all ‘BL’ sequences annotated in the NCBI RefSeq database are extracted and used as the positive samples, whereas 0.015% of all ‘Not BL’ sequences are randomly chosen as the negative samples to constitute the ‘Not-BL’ subset; (B) feature encoding, where the sequence encoding scheme CKSAAP is applied to encode the sequence of proteins in the benchmark dataset; (B) model training, where the model architecture is built, model hyperparameters are optimized and training strategies are compared and (D) performance evaluation, where the performance of DeepBL models is assessed by performing 10-fold cross-validation and independent tests. (E) shows the detailed architecture of the Small VGGNet deep learning framework.

Open in new tab Download slide

(ii) Sequence similarity-based methods: These methods apply a general hypothesis that proteins that share similar sequences usually perform similar biochemical functions. Among sequence similarity-based methods, BLAST [24, 25] and HMMER [26] can be used to identify BL classes by querying a sequence against the profile constructed based on the known BLs. Subsequently, the class- or family-specific conserved motifs can be identified. To provide an example, Srivastava et al. proposed a motif-based BL family prediction method using the MEME/MAST suit to identify family-specific motifs or patterns [23, 27, 28].

(iii) Machine learning-based methods: Several machine learning-based algorithms have been proposed to construct the prediction models for BL classification, such as Bayes [29], support vector machine (SVM) [30, 31] and convolutional neural network (CNN) [32].

Despite the progress achieved, there are intrinsic drawbacks and limitations associated with the current methods. These include: (i) sequence similarity-based methods can only be used to annotate a given sequence based on its sequence similarity to already known BL sequences. As such, they cannot identify novel BLs; (ii) the majority of machine learning-based methods [30–32] are developed based on the datasets that had <4000 BL sequences and <800 negative BL sequences, which are small and limited in the size and scale; (iii) despite the relatively high performance achieved by these methods, their predictive performance and generalization ability remain to be validated on large-scale independent test datasets and (iv) another related issue is the selection of a reliable negative dataset. According to the reference sequence (RefSeq) dataset, BL sequences accounted for a small proportion (39 495 out of 105 194 720, curated in March 2019) of the collected bacterial sequences. As such, it is critical to build a reliable and representative negative BL dataset and develop a robust classification model to identify BL sequences with a low false-positive rate.

To address these issues, we develop DeepBL, a high-throughput deep learning-based approach for identifying BLs and their classes using protein sequences. DeepBL is developed based on a well-annotated large dataset containing 39 495 BLs extracted from the National Center for Biotechnology Information (NCBI) RefSeq database (downloaded in March 2019) [33, 34]. Here, we formulate the BL prediction task as a multiclass classification problem. We carefully curate a reliable negative BL dataset by extending the number of sequences and assess the model performance by performing multiple times of undersampling. We then build the prediction model of DeepBL using a simplified version of the VGGNet deep learning architecture [35], namely as the Small VGGNet. Then, we demonstrate this framework outperforms the other four commonly used machine learning algorithms when evaluated on an independent test dataset containing >10 000 sequences. Moreover, we also examine the effect of varying sequence redundancy levels on the model performance. We further apply the optimized model of DeepBL to the entire set of reviewed bacterial sequences, identify potential BL sequences and make this computational compendium publicly accessible at the DeepBL website.

Materials and methods

DeepBL is developed based on the deep learning technique. It is a high-throughput and multiclass classification approach for the identification of BLs and their classes from protein sequence data. The architecture of the DeepBL methodology is illustrated in Figure 1. As can be seen, its development involves four major steps, including data curation, feature encoding, model training and performance evaluation. At the first step, dataset curation, a high-quality dataset was collected from the NCBI RefSeq [33]. At the second step, the classical feature encoding scheme, composition of k-spaced amino acid pairs (CKSAAP) [36, 37] was employed to extract the sequence features. At the third step, a customized neural network, i.e. the Small VGGNet [35] was applied to train the prediction model of DeepBL. At the final step, 10-fold cross-validation and independent tests were performed to rigorously evaluate the performance of DeepBL models in terms of several performance metrics.

Benchmark dataset

To construct a benchmark dataset for training and validating the model of DeepBL, a nonredundant bacterial protein sequence dataset was originally downloaded from the NCBI RefSeq database [33]. This allowed us to collect >100 million bacterial protein sequences. The RefSeq [33] database provides a comprehensive, nonredundant and well-annotated collection of reference sequences. It is periodically augmented with newly published sequence data. The sequences of BLs were extracted by searching the database with the keyword combination of ‘BL’ and the respective class of A, B, C and D. As the BL sequences only account for a smaller portion of the whole-sequence dataset, we performed a random selection of the not-beta-lactamases (Not-BL) sequences. We define the Not-BL sequences as those without the annotation of ‘BL’. To retrieve such a reliable negative dataset, we selected the sequences from RefSeq without the annotation of ‘BL’ and resulted negative datasets had a reasonable scale (i.e. no more than 10 times larger than that of the smallest BL class) compared with the positive BL sequence dataset. To enable statistical significance test, we randomly selected the negative dataset for five times and generated five negative datasets for the model training and performance evaluation. The data distribution of the datasets is shown in Figure 2 and Table S1. The sequence redundancy in the datasets was removed with CD-HIT [38] using the sequence identity cutoff thresholds ranging from 1.0 to 0.7 with the step size of 0.05. The resulting datasets were used to assess the model performance and compare with four other machine learning algorithms. Finally, a total of 35 datasets clustered with different sequence identity thresholds were obtained. All these datasets can be downloaded from the webserver of DeepBL at http://deepbl.erc.monash.edu.au/download.html/.

Figure 2

The relationship of the number of remaining sequences in the BL sequence datasets clustered by CD-HIT in accordance with different sequence identity cutoff thresholds, which ranged from 0.7 to 1.0.

Open in new tab Download slide

Composition of k-spaced amino acid pair

In this study, the full-length protein sequences were used as the input to train deep learning models. In order to encode these sequences of different length into a numerical matrix with an identical size, we used the CKSAAP [36, 37, 39, 40] as the sequence encoding scheme. CKSAAP describes the local characteristics of amino acids within a given value of k for the protein sequence, where k denotes the sequence distance of two amino acids. Therefore, 400 distinct types of k-spaced amino acid pairs (i.e. AA, AC, AD, …, YY) can be generated for 20 amino acid types. By taking k = 0 as an example, a vector containing 400 elements can be calculated for any two adjacent amino acids in the protein sequence. A feature vector can thus be defined as:

$$ {\left(\frac{N_{\mathrm{AA}}}{N_{\mathrm{total}}},\frac{N_{\mathrm{AC}}}{N_{\mathrm{total}}},\frac{N_{\mathrm{AD}}}{N_{\mathrm{total}}},\dots, \frac{N_{\mathrm{YY}}}{N_{\mathrm{total}}}\right)}_{400} $$

where N_AA, N_AC, N_AD and N_YY denote the numbers of amino acid pairs AA, AC, AD and YY that appear in the protein sequence, respectively. N_total equals to the sum of N_AA, N_AC, N_AD, … and N_YY. In this study, we used the iLearn software package to calculate and extract the CKSAAP features [41]. The CKSAAP encoding was performed with k = 0, 1, 2, 3, 4 and 5. Accordingly, the dimension of the CKSAAP feature vector is 20|$\times$| 20|$\times$| 6= 2400.

Figure 3

ROC curves and PR curves on the 10-fold cross-validation test. The AUC values of the ROC curves and PR curves are calculated with average and SD values.

Open in new tab Download slide

Figure 4

Boxplots of performance results on the 10-fold cross-validation tests. Performance metrics monitored during model training included (a): Loss value, (b): Categorical accuracy, F1-Score and MCC.

Open in new tab Download slide

Figure 5

ROC and PR curves for the prediction of Classes A, B, C, D and Not-BLs on the independent test datasets.

Open in new tab Download slide

Figure 6

Confusion matrix of predicted results on the independent test. The matrix represents the distribution of the outputs for each of the five BL classes. Correctly predicted numbers are highlighted and shown on the diagonal line.

Open in new tab Download slide

Figure 7

Relationship between the resulting AUC values and varying sequence identity thresholds on the benchmark datasets.

Open in new tab Download slide

Figure 8

Boxplots of F1-Score, MCC and ACC on the datasets with varying sequence identity thresholds.

Open in new tab Download slide

Figure 9

Statistics of proteome-wide prediction of BLs and their respective classes by applying the optimized DeepBL model.

Open in new tab Download slide

Figure 10

Prediction outputs of the two novel BLs in the BKC-1, PAD-1 case study.

Open in new tab Download slide

Figure 11

Screenshot of the web pages of the DeepBL webserver. (A) The input page of DeepBL; (B) the submission page of DeepBL and (C) the prediction output interface of DeepBL.

Open in new tab Download slide

Architecture of the deep learning networks

In view of the dimension and scale of the dataset, we employed the Small VGGNet deep learning framework to train the prediction model. Small VGGNet is a VGGNet-like architecture for deep learning neural networks, derived from the VGGNet [42] architecture. The VGGNet was proposed in Large Scale Visual Recognition Challenge 2014 and won the subtask of image classification and localization. Afterward, many well-established variations of VGGNet are implemented for specific applications. A VGGNet-like neural network has the following features: (i) using 3 × 3 convolutional layers stacked on top of the other layers in increasing network depth; (ii) using the max-pooling layers to reduce the volume size and (iii) stacking a fully connected layer prior to the SoftMax or the sigmoid layer at the end of the network. For our network architecture, five convolution layers were adopted along with the ReLU layers, max-pooling layers, batch-normalization layers and dropout layers, as shown in Figure 1E. Furthermore, the Small VGGNet was customized to fit our input feature dimensions and the five-class classification problem. Lastly, the model architecture was implemented using python and the Keras library [43] with the TensorFlow [44] framework as the backend.

Model training

Two different model training strategies were adopted with Keras [43] based on the curated benchmark dataset: (i) model training for 10-fold cross-validation and (ii) model training for the independent test. As for the model training for 10-fold cross-validation, the model performance could be thoroughly evaluated on the given benchmark dataset and mitigate the bias caused by the data split. As for the model training for independent test, the larger independent test dataset could be used to evaluate the generalization of the trained model using the training dataset. To avoid the potential bias, we performed the model training for 10 times and accordingly calculated the mean and SD of the performance result of the 10 times. These two model training strategies shared the same hyperparameters including the number of epochs, batch size, learning rate, dropout rate and validation ratio (with corresponding values 100, 32, 1e−3, 0.5 and 0.2). To avoid model overfitting, an early stopping setting and dropout layers were used during model training. By combining the hyperparameters and the settings, model loss values and accuracies on both the training and validation datasets gradually changed and stabilized at specific values, as shown in Figure S1. The final optimal hyperparameters were selected based on the F1-score and accuracy on the validation dataset and used for performance evaluation and comparison.

Performance evaluation metrics

To objectively evaluate the model performance, several commonly used metrics are used to assess the performance, including overall accuracy (ACC), Matthew Correlation Coefficient (MCC), F1-score, receiver-operating characteristic (ROC) curve and Recall–Precision curve with the corresponding area under the ROC curve (AUC) and area under the Recall–Precision curve (AURPC) values. These performance metrics are defined as follows:

$$ \mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$

$$ \mathrm{MCC}=\frac{\left(\mathrm{TP}\times \mathrm{TN}\right)-\left(\mathrm{FP}\times \mathrm{FN}\right)}{\sqrt{\left(\mathrm{TP}+\mathrm{FP}\right)\left(\mathrm{TP}+\mathrm{FN}\right)\left(\mathrm{TN}+\mathrm{FP}\right)\left(\mathrm{TN}+\mathrm{FN}\right)}} $$

$$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$

$$ \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$

$$ F1-\mathrm{Score}=\frac{2\times \mathrm{Recall}\times \mathrm{Precision}}{\mathrm{Recall}+\mathrm{Precision}} $$

where TP represents true positive; TN, true negative; FP, false positive; FN, false negative. For all the measurements defined above, the higher value the better. As the dataset is imbalanced, MCC and F1-Score are considered as more important performance measures compared with the ACC.

In particular, the ROC curves and Precision–Recall (PR) curves are plotted on 10-fold cross-validation and the independent test. The corresponding AUC values are calculated as an essential metric to evaluate the performance of the trained models and compared with different methods.

Performance evaluation at different sequence identity levels

When using conventional machine learning algorithms, sequence redundancy removal plays an essential role in sequence analysis and adoption of different sequence identity thresholds to remove the sequence redundancy would influence the model performance. After removing highly similar sequences from the benchmark datasets, the classifiers are trained using the representative sequences, which could enhance the robustness and generalization of the trained model and also reduce the risk of overfitting. Traditional machine learning algorithms can often achieve relatively good performance even on relatively small training datasets. However, compared with the traditional machine learning algorithms, deep learning algorithms have different requirements. They typically rely on a large amount of data for training the models to achieve good performance. Use of limited training data to train deep learning models may lead to the overfitting of the model; in some scenarios, data augmentation is employed to generate more similar training samples to enhance the model performance [45]. Moreover, a series of mechanisms are in place to prevent the overfitting in deep learning, including the ReLU activation function, the dropout rate and the pooling layer. In this study, in order to evaluate the effect of sequence redundancy removal on the performance of deep learning models, we benchmark the model performance on the datasets clustered using varying sequence identity thresholds by the CD-HIT program.

Performance comparison with different machine learning algorithms

To illustrate the effectiveness of DeepBL based on the Small VGGNet architecture, we assessed and compared the performance of DeepBL using the same dataset split (i.e. training dataset: 80%, independent testing dataset: 20%) and feature encoding scheme (CKSAAP) and that of four other popular machine learning algorithms, including Naïve Bayes (NB), Logistical Regression (LR), SVM and Random Forest (RF). In particular, the ‘one versus the rest’ strategy was used to train the multiclass classification models for the four conventional machine learning algorithms. The four machine learning algorithms were implemented using the Scikit-learn package [46] of Python.

Results and discussion

Performance evaluation on 10-fold cross-validation

We evaluateddation on the benchmark datasets. The mean ROC curves and mean PR curves on the 10-fold cross-validation test are displayed in Figure 3. As can be seen, DeepBL achieved a remarkable performance, with an AUC score of >0.99 for all the five classes. Among the five classes, DeepBL performed the best for predicting Class B with an AUC-ROC score of 0.999 ± 0.0 and an AUPRC score of 0.997 ± 0.001, respectively.

In Figure 4A, we also show the loss values of categorical cross-entropy of model training on 10-fold cross-validation test. As can be seen, the loss values range from 0.14 to 0.26. The performance results in terms of ACC, MCC and F1-Score are shown as box plots in Figure 4B. Altogether, these results demonstrate that DeepBL achieved a high performance on 10-fold cross-validation test when evaluated on the whole curated benchmark dataset.

Performance comparison between DeepBL and the other algorithms

To illustrate the effectiveness and robustness of DeepBL, we further evaluated the predictive performance of DeepBL models trained using the optimally tuned hyperparameters on the independent test dataset. The corresponding ROC and PR curves of DeepBL on the independent test are shown in Figure 5. Again, these results on the independent test clearly show that DeepBL achieved a remarkable performance. Figure 5 also shows that both Class B and Class D obtained the best performance, with an AUC value of 1.0 and an AURPC value of 0.999, followed by the Not-BL class with an AUC value of 0.996 and an AURPC value of 0.995, respectively. In addition, Class A and Class C achieved an AUC value of 0.991 and 0.989 and an AURPC value of 0.986 and 0.932, respectively.

Figure 6 shows the normalized confusion matrix of the prediction of five classes of BLs, which further demonstrates the outstanding performance achieved by DeepBL. Moreover, we also benchmarked the performance of DeepBL against four other commonly used machine learning algorithms (i.e. SVM, RF, NB and LR). The ROC curves of these four compared algorithms are provided in Figures S2–S5. Altogether, the performance comparison results show DeepBL clearly outperformed the other conventional machine learning algorithms.

Taken together, we conclude that DeepBL achieved the remarkable performance due to the following two primary reasons: (i) the CNN employed by DeepBL offers an attractive advantage of automatically extracting useful features from the input data compared with other conventional machine learning algorithms without the need of complex feature engineering and extraction and (ii) the class imbalance and sequence redundancy in the benchmark dataset would significantly influence the model performance of conventional machine learning algorithms, but for the deep learning framework of DeepBL, multiple strategies and settings are in place during the deep learning model training to avoid the overfitting. These included use of the dropout layer, max-pooling layer as well as the early stopping strategy. Furthermore, the use of large-scale training data could also help to improve the generalization and robustness of the deep learning model of DeepBL.

Effect of sequence identity level on the predictive performance

To evaluate the potential effect of sequence redundancy removal on the model performance, we generated the benchmark datasets at seven different sequence identity thresholds (ranging from 0.7 to 1.0). A global test dataset was extracted from dataset after sequence redundancy removal with the threshold of 0.7. To reduce the bias of model performance comparison among different sequence identities, we further removed the sequence redundancy between each training dataset with the global test dataset using the threshold of 0.7. Finally, 35 training datasets and one global test dataset were prepared. Then, we trained models on the 35 training datasets and tested the model performance using the global test dataset. During this procedure, we trained the models for 10 times and calculated the average AUC, F1-Score, MCC and Accuracy, and the results are shown in Figures 7 and 8 and Figures S6–S13. As shown in Figures 7 and 8, the model performance expectedly tends to increase in accordance with the larger sequence identity thresholds. In Figure 7, the model trained on datasets without any sequence redundancy removal achieved the overall best performance across all the five classes, with the AUCs ranging from 0.985 to 1.0 depending on the particular BL class. As a comparison, the models trained on datasets with the sequence identity threshold of 0.7 achieved the lowest performance. Among these models, the largest performance difference was >0.02 for the prediction of Class-C. In addition, we also observed that nearly all models achieved stable AUCs. In addition, only 40% sequences of Class-C were retained after the sequence redundancy removal at the sequence identity cutoff threshold of 0.95 (Figure 2). This indicates that the Class-C BL sequences had the highest homology levels, which may explain the relatively larger variation of AUC for Class-C BL prediction compared with the other classes.

As shown in Figure 8, the DeepBL models trained on the datasets without any sequence redundancy removal achieved the highest F1-Score, MCC and ACC with the mean values from 0.93 to 0.98, whereas the models trained on the datasets with sequence redundancy removal at the sequence identity threshold of 0.7 attained the worst performance with the mean values of 0.88–0.96. We can see that the model performance was relatively stable with respect to the three metrics. Altogether, the results indicate that researchers should practice with caution regarding the necessity to perform sequence redundancy removal, especially for training deep learning-based models. Before proceeding to carry out the sequence redundancy removal, the need to do so should be carefully evaluated in lieu of the research task itself.

In an effort to computationally characterize a complete repertoire of novel putative BLs, we further applied the optimized DeepBL model to perform the proteome-wide screening using the bacterial proteome data. We downloaded all reviewed bacterial protein sequences from the UniProt database, which contained 334 009 bacterial protein sequences in the release dated on 16 May 2019. As a result, DeepBL identified 2876 Class-A, 665 Class-B, 335 Class-C and 231 Class-D BLs, respectively. And data visualization is shown in Figure 9. A full list of all predicted BL sequences is available at the DeepBL webserver (http://deepbl.erc.monash.edu.au/download/proteomics_scan/screening_results.zip).

Case study

To further examine the capability of DeepBL in identifying novel BLs, we performed an independent case study by predicting two rarely reported BLs that were excluded from the training dataset of our DeepBL model. The first BL, BKC-1 (accession number: AKD43328.1), was reported as a novel BL from a Klebsiella pneumoniae clinical isolate and characterized as a carbapenemase [47]. Purified BKC-1 can not only hydrolyze carbapenems but also penicillins, cephalosporins and monabactams. DeepBL successfully predicted BKC-1 with the predicted probability score of 0.9996 and classified it as a Class-A BL. The second BL, PAD-1 (accession number: KXF74838.1), was isolated from the desert soil bacterium, Paramesorhizobium deserti and classified as a novel Class-A environmental carbapenemase with an unusual substrate profile [48]. DeepBL was correctly able to predict this BL and its class with the predicted probability score of 0.9997. The prediction outputs of these two case study BLs are provided in Figure 10. Together, these results highlight the capability of DeepBL in identifying and discovering novel BLs from the sequence information.

Webserver and user guide

In order to facilitate experimental researchers’ work, the DeepBL webserver has been developed and freely available at http://DeepBL.erc.monash.edu/. The user interface of DeepBL is shown in Figure 11. The DeepBL server was mainly implemented using Python on the base of Apache and configured in the Ubuntu OS on a 4-core server with 32 GB memory and a 1 TB hard disk. The server uses the optimal model to identify the BL sequences for the submitted tasks. According to the benchmarking test, DeepBL can complete the prediction of 100 protein sequences in approximately 10 s, which is presumably due to the application of effective feature encoding schemes and small deep learning architecture like the small VGGNet framework. Moreover, the curated benchmark datasets and the bacterial proteome-wide prediction results can be downloaded from the DeepBL webserver as well.

Researchers can submit bacterial protein sequences as the input. Upon the job submission, the DeepBL webserver will process the submitted tasks, predict and return the BL protein sequences and their class information using the optimized deep learning model. The prediction results can be directly visualized within the webserver, whereas the results with all predicted scores can be downloaded for users’ follow-up analysis. With the implementation and availability of the webserver, DeepBL will be maintained and made publicly available for at least 5 years. We plan to regularly update the model and webserver when more sequence data become available in the future. It is anticipated that DeepBL will be exploited as an indispensable tool for in silico discovery of novel BLs.

Conclusion

In this study, we have proposed DeepBL, a powerful deep learning-based approach for identifying BLs and their corresponding classes from the sequence data. More specifically, DeepBL was developed using the Small VGGNet architecture based on the large-scale and reliable curated datasets extracted from the NCBI RefSeq database. Through extensive benchmarking experiments, we demonstrated the generalization and robustness of DeepBL based on strictly and comprehensively conducted performance assessments on both 10-fold cross-validation and independent tests. Built on the deep CNN architecture, the performance of DeepBL benefitted from the use of large-scale datasets with less influence by data imbalance and sequence redundancy. Through the performance comparison of the models trained using various datasets with sequence redundancy removed at different sequence identity thresholds, we obtained the best-performing model of DeepBL on the dataset without any sequence redundancy removal at all. In terms of the feature encoding scheme, we only used CKSAAP as it can convert sequences of any length into the features of the same size. We further applied the optimized model to perform in silico screening and construct a whole-proteome complete catalogue of BLs. DeepBL is a generalized and useful tool for BL identification presumably due to the use of significantly enlarged benchmark datasets. We anticipate DeepBL will be a valuable tool for the wider research community and be exploited to shed light on the identification and functional annotation of putative BLs and their classes in the future.

Key Points

Accurate identification of beta-lactamases (BLs) and their specific subclasses is an essential step to guide BL inhibitor design and improve our understanding of the mechanisms involved in beta-lactam antibiotic resistance.
We present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs.
Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library.
The performance of DeepBL models is assessed with respect to the sequence redundancy level and negative sample selection in the benchmark datasets.
Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from UniProt and accordingly construct a whole-proteome catalogue of putative BLs and their subclasses.
These proteome-wide prediction results and the web server of DeepBL is freely available at http://deepbl.erc.monash.edu.au/.

Conflict of Interest

The authors declare no conflict of interest.

Funding

National Health and Medical Research Council of Australia (NHMRC) (APP1127948 and APP1144652); the Australian Research Council (ARC) (DP120104460); the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965); a Postgraduate Scholarship from the Monash–Newcastle Alliance; a Major Inter-Disciplinary Research (IDR) project awarded by Monash University and the Collaborative Research Program of Institute for Chemical Research, Kyoto University; Informatics Institute of the School of Medicine at UAB (to T.T.M.L. and A.L.). T.L. is an ARC Laureate Fellow.

Availability and implementation

To facilitate rapid identification of potential beta-lactamases from protein or genome sequences, for widespread use by the research community, a user-friendly webserver and the curated datasets used in this study have been made publicly available at http://DeepBL.erc.monash.edu.au/.

Yanan Wang is currently a PhD candidate in the Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology at Monash University, Australia. He received his Bachelor degree in communication engineering from the University of Shanghai for Science and Technology and his Master degree in control science and technology from Shanghai Jiao Tong University, China. His research interests are bioinformatics, machine learning and data mining.

Fuyi Li received his PhD in Bioinformatics from Monash University, Australia. He is currently a research fellow in the Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia. His research interests are bioinformatics, computational biology, machine learning and data mining.

Manasa Bharathwaj is currently a PhD candidate in the Department of Microbiology at the Biomedicine Discovery Institute, Monash University, Australia. She received her Bachelor degree in Biotechnology (Engineering) from SASTRA University, India, and her Masters by Research degree in Infectious diseases from The University of Edinburgh, UK. Her research interests include antibiotic resistance, bacterial protein transport and transcriptional regulation.

Natalia C. Rosas is currently a PhD candidate in the Department of Microbiology at the Biomedicine Discovery Institute, Monash University, Australia. She gained her Bachelor degree in bacteriology and clinical laboratory from the Universidad del Valle, Colombia, and her Master’s degree in Biotechnology from the University of Melbourne, Australia. Her experience and interests are in microbiology, molecular biology and antimicrobial resistance.

André Leier is currently an assistant professor in the Department of Genetics and the Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham (UAB) School of Medicine, USA. He is also an associate scientist in the UAB Comprehensive Cancer Center. He received his PhD in Computer Science (Dr. rer. nat.), University of Dortmund, Germany. He conducted postdoctoral research at Memorial University of Newfoundland, Canada, The University of Queensland, Australia and ETH Zürich, Switzerland. His research interests are in biomedical informatics and computational and systems biomedicine.

Tatsuya Akutsu received his DEng degree in Information Engineering in 1989 from University of Tokyo, Japan. Since 2001, he has been a professor in the Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan. His research interests include bioinformatics and discrete algorithms.

Geoffrey I. Webb received his PhD degree in 1987 from La Trobe University, Australia. He is a professor in the Faculty of Information Technology and director of the Monash Centre for Data Science at Monash University. His research interests include machine learning, data mining, computational biology and user modeling.

Tatiana T. Marquez-Lago is an associate professor in the Department of Genetics and the Department of Cell, Developmental and Integrative Biology, UAB School of Medicine, USA. Her research interests include multiscale modeling and simulations, artificial intelligence, bioengineering and systems biomedicine. Her interdisciplinary lab studies stochastic gene expression, chromatin organization, antibiotic resistance in bacteria and host–microbiota interactions in complex diseases.

Jian Li is a professor and group leader in the Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Australia. He is a Web of Science 2015–2017 Highly Cited Researcher in Pharmacology & Toxicology. He is currently an NHMRC Principal Research Fellow. His research interests include the pharmacology of polymyxins and the discovery of novel, safer polymyxins.

Trevor Lithgow is a professor in the Department of Microbiology at Monash University, Australia, and Director of the Centre to Impact AMR. He received his PhD degree in 1992 from La Trobe University. His research interests particularly focus on bacterial molecular cell biology and bioinformatics. His lab develops and deploys multidisciplinary approaches including comparative genomics to understand and image the assembly of proteins into bacterial outer membranes, and to discover and characterize the activity of bacteriophages that kill antibiotic-resistant ‘superbugs’.

Jiangning Song is an associate professor and group leader in the Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia. He is a member of the Monash Centre for Data Science, Faculty of Information Technology and an associate investigator of the ARC Centre of Excellence in Advanced Molecular Imaging, Monash University. His research interests include bioinformatics, computational biology, machine learning, data analytics and pattern recognition.

References

1.

Drawz

SM

,

Bonomo

RA

.

Three decades of β-lactamase inhibitors

.

Clin Microbiol Rev

2010

;

23

:

160

–

201

.

2.

Demain

AL

,

Blander

RP

.

The β-lactam antibiotics: past, present, and future. Antonie van Leeuwenhoek

.

Int J Gen Mol Microbiol

1999

;

75

:

5

–

19

.

Google Scholar

OpenURL Placeholder Text

WorldCat

3.

Fisher

JF

,

Knowles

JR

.

Bacterial resistance to β-lactams: the β-lactamases

.

Annu Rep Med Chem

1978

;

13

:

239

–

48

.

Google Scholar

OpenURL Placeholder Text

WorldCat

4.

Bush

K

.

Past and present perspectives on β-lactamases

.

Antimicrob Agents Chemother

2018

;

62

:e01076-18.

Google Scholar

OpenURL Placeholder Text

WorldCat

5.

Bush

K

,

Bradford

PA

.

β-Lactams and β-lactamase inhibitors: an overview

.

Cold Spring Harb Perspect Med

2016

;

6

:a025247.

Google Scholar

OpenURL Placeholder Text

WorldCat

6.

Naas

T

,

Oueslati

S

,

Bonnin

RA

, et al.

Beta-lactamase database (BLDB)–structure and function

.

J Enzyme Inhib Med Chem

2017

;

32

:

917

–

9

.

7.

Saunders

JR

,

Hart

CA

,

Saunders

VA

.

Plasmid-mediated resistance to β-iactam antibiotics in gram-negative bacteria: the role of in-vivo recyclization reactions in plasmid evolution

.

J Antimicrob Chemother

1986

;

18

:

57

–

66

.

8.

Alekshun

MN

,

Levy

SB

.

Molecular mechanisms of antibacterial multidrug resistance

.

Cell

2007

;

128

:

1037

–

50

.

9.

Paterson

DL

,

Bonomo

RA

.

Extended-spectrum β-lactamases: a clinical update

.

Clin Microbiol Rev

2005

;

18

:

657

–

86

.

10.

Steward

CD

,

Wallace

D

,

Hubert

SK

, et al.

Ability of laboratories to detect emerging antimicrobial resistance in nosocomial pathogens: a survey of Project ICARE laboratories

.

Diagn Microbiol Infect Dis

2000

;

38

:

59

–

67

.

11.

Queenan

AM

,

Bush

K

.

Carbapenemases: the versatile β-lactamases

.

Clin Microbiol Rev

2007

;

20

:

440

–

58

.

12.

Meletis

G

.

Carbapenem resistance: overview of the problem and future perspectives

.

Ther Adv Infect Dis

2016

;

3

:

15

–

21

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

13.

Livermore

DM

,

Winstanley

TG

,

Shannon

KP

.

Interpretative reading: recognizing the unusual and inferring resistance mechanisms from resistance phenotypes

.

J Antimicrob Chemother

2001

;

48

:

87

–

102

.

14.

Ambler

RP

,

Coulson

AFW

,

Frere

JM

, et al.

A standard numbering scheme for the class A β-lactamases

.

Biochem J

1991

;

276

:

269

–

70

.

15.

Bulychev

A

,

Bellettini

JR

,

O’Brien

M

, et al.

N-Sulfonyloxy-β-lactam inhibitors for β-lactamases

.

Tetrahedron

2000

;

56

:

5719

–

28

.

Google Scholar

Crossref

WorldCat

16.

Sharma

S

,

Ramnani

P

,

Virdi

JS

.

Detection and assay of β-lactamases in clinical and non-clinical strains of Yersinia enterocolitica biovar 1A

.

J Antimicrob Chemother

2004

;

54

:

401

–

5

.

17.

Moradigaravand

D

,

Palm

M

,

Farewell

A

, et al.

Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data

.

PLoS Comput Biol

2018

;

14

:

1

–

17

.

Google Scholar

Crossref

WorldCat

18.

Danishuddin

M

,

Baig

MH

,

Kaushal

L

, et al.

BLAD: a comprehensive database of widely circulated beta-lactamases

.

Bioinformatics

2013

;

29

:

2515

–

6

.

19.

Thai

QK

,

Bös

F

,

Pleiss

J

.

The lactamase engineering database: a critical survey of TEM sequences in public databases

.

BMC Genomics

2009

;

10

:

390

.

20.

Liu

B

,

Pop

M

.

ARDB - antibiotic resistance genes database

.

Nucleic Acids Res

2009

;

37

:

D443

–

7

.

21.

McArthur

AG

,

Waglechner

N

,

Nizam

F

, et al.

The comprehensive antibiotic resistance database

.

Antimicrob Agents Chemother

2013

;

57

:

3348

–

57

.

22.

Jia

B

,

Raphenya

AR

,

Alcock

B

, et al.

CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database

.

Nucleic Acids Res

2017

;

45

:

D566

–

73

.

23.

Srivastava

A

,

Singhal

N

,

Goel

M

, et al.

CBMAR: a comprehensive β-lactamase molecular annotation resource

.

Database (Oxford)

2014

;bau111.

Google Scholar

OpenURL Placeholder Text

WorldCat

24.

Altschul

SF

,

Gish

W

,

Miller

W

, et al.

Basic local alignment search tool

.

J Mol Biol

1990

;

215

:

403

–

10

.

25.

Camacho

C

,

Coulouris

G

,

Avagyan

V

, et al.

BLAST+: architecture and applications

.

BMC Bioinformatics

2009

;

10

:421.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Potter

SC

,

Luciani

A

,

Eddy

SR

, et al.

HMMER web server: 2018 update

.

Nucleic Acids Res

2018

;

46

:

W200

–

4

.

27.

Srivastava

A

,

Singhal

N

,

Goel

M

, et al.

Identification of family specific fingerprints in β-lactamase families

.

Sci World J

2014

;

2014

:980572.

Google Scholar

OpenURL Placeholder Text

WorldCat

28.

Singh

R

,

Saxena

A

,

Singh

H

.

Identification of group specific motifs in beta-lactamase family of proteins

.

J Biomed Sci

2009

;

16

:

109

.

29.

Nath

A

,

Karthikeyan

S

.

Enhanced identification of β-lactamases and its classes using sequence, physicochemical and evolutionary information with sequence feature characterization of the classes

.

Comput Biol Chem

2017

;

68

:

29

–

38

.

30.

Kumar

R

,

Srivastava

A

,

Kumari

B

, et al.

Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine

.

J Theor Biol

2015

;

365

:

96

–

103

.

31.

Srivastava

A

,

Kumar

R

,

Kumar

M

.

BlaPred: predicting and classifying β-lactamase using a 3-tier prediction system via Chou’s general PseAAC

.

J Theor Biol

2018

;

457

:

29

–

36

.

32.

White

C

,

Ismail

HD

,

Saigo

H

, et al.

CNN-BLPred: a convolutional neural network based predictor for β-lactamases (BL) and their classes

.

BMC Bioinformatics

2017

;

18

:577.

Google Scholar

OpenURL Placeholder Text

WorldCat

33.

O’Leary

NA

,

Wright

MW

,

Brister

JR

, et al.

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

.

Nucleic Acids Res

2016

;

44

:

D733

–

45

.

34.

Sayers

EW

,

Agarwala

R

,

Bolton

EE

, et al.

Database resources of the National Center for Biotechnology Information

.

Nucleic Acids Res

2019

;

47

:

D23

–

8

.

35.

Cheng

C

,

Parhi

KK

.

Fast 2D convolution algorithms for convolutional neural networks

.

IEEE Trans Circuits Syst I Regul Pap

2020

;

67

:

1678

–

91

.

Google Scholar

Crossref

WorldCat

36.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iFeature: a python package and web server for features extraction and selection from protein and peptide sequences

.

Bioinformatics

2018

;

34

:

2499

–

502

.

37.

Chen

Z

,

Zhou

Y

,

Song

J

, et al.

HCKSAAP-UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties

.

Biochim Biophys Acta

2013

;

1834

:

1461

–

7

.

38.

Li

W

,

Godzik

A

.

CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences

.

Bioinformatics

2006

;

22

:

1658

–

9

.

39.

Chen

Z

,

Zhou

Y

,

Zhang

Z

, et al.

Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features

.

Brief Bioinform

2014

;

16

:

640

–

57

.

40.

Chen

Z

,

Liu

X

,

Li

F

, et al.

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites

.

Brief Bioinform

2019

;

20

:

2267

–

90

.

41.

Chen

Z

,

Zhao

P

,

Li

F

, et al.

iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

.

Brief Bioinform

2020

;

21

:

1047

–

57

.

42.

Simonyan

K

,

Zisserman

A

. Very deep convolutional networks for large-scale image recognition.

3rd Int. Conf. Learn. Represent. ICLR 2015—Conf. Track Proc.

2015

;

1

–

14

43.

Briggs

FH

,

Bell

JF

,

Kesteven

MJ

.

Removing radio interference from contaminated astronomical spectra using an independent reference signal and closure relations

.

Astron J

2000

;

120

:

3351

–

61

.

Google Scholar

Crossref

WorldCat

44.

Abadi

M

,

Agarwal

A

,

Barham

P

, et al.

TensorFlow: large-scale machine learning on heterogeneous distributed systems

.

arXiv preprint arXiv

2016

;1603.04467.

45.

Taylor

L

,

Nitschke

G

.

Improving deep learning using generic data augmentation

.

2018 IEEE Symposium Series on Computational Intelligence (SSCI),

Bangalore, India.

2018

, pp.

1542

–

7

.

46.

Chakraborty

A

,

Ghosh

S

,

Mukhopadhyay

P

, et al.

Trapping effect analysis of AlGaN/InGaN/GaN Heterostructure by conductance frequency measurement

.

MRS Proc

2014

;

XXXIII

2:

81

–

7

.

Google Scholar

OpenURL Placeholder Text

WorldCat

47.

Nicoletti

AG

,

Marcondes

MFM

,

Martins

WMBS

, et al.

Characterization of BKC-1 class A carbapenemase from Klebsiella pneumoniae clinical isolates in Brazil

.

Antimicrob Agents Chemother

2015

;

59

:

5159

–

64

.

48.

Lv

R

,

Guo

J

,

Yan

YF

, et al.

Characterization of a novel class a carbapenemase PAD-1 from Paramesorhizobium deserti A-3-ET, a strain highly resistant to β-lactam antibiotics

.

Sci Rep

2017

;

7

:

1

–

9

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
November 2020	89
December 2020	52
January 2021	37
February 2021	14
March 2021	15
April 2021	9
May 2021	13
June 2021	8
July 2021	16
August 2021	15
September 2021	20
October 2021	7
November 2021	17
December 2021	22
January 2022	16
February 2022	21
March 2022	4
April 2022	10
May 2022	12
June 2022	9
July 2022	10
August 2022	7
September 2022	13
October 2022	13
November 2022	18
December 2022	4
January 2023	6
February 2023	10
March 2023	8
April 2023	6
May 2023	11
June 2023	10
July 2023	28
August 2023	35
September 2023	21
October 2023	34
November 2023	68
December 2023	61
January 2024	61
February 2024	59
March 2024	41
April 2024	39
May 2024	30
June 2024	32
July 2024	27
August 2024	26
September 2024	45
October 2024	55
November 2024	30
December 2024	36
January 2025	21
February 2025	23
March 2025	65
April 2025	38
May 2025	3

Article Contents

DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases

Abstract

Introduction

Materials and methods

Benchmark dataset

Composition of k-spaced amino acid pair

Architecture of the deep learning networks

Model training

Performance evaluation metrics

Performance evaluation at different sequence identity levels

Performance comparison with different machine learning algorithms

Results and discussion

Performance evaluation on 10-fold cross-validation

Performance comparison between DeepBL and the other algorithms

Effect of sequence identity level on the predictive performance

Case study

Webserver and user guide

Conclusion

Conflict of Interest

Funding

Availability and implementation

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

DeepBL: a deep learning-based approach for in silico discovery of beta-lactamases

Abstract

Introduction

Materials and methods

Benchmark dataset

Composition of k-spaced amino acid pair

Architecture of the deep learning networks

Model training

Performance evaluation metrics

Performance evaluation at different sequence identity levels

Performance comparison with different machine learning algorithms

Results and discussion

Performance evaluation on 10-fold cross-validation

Performance comparison between DeepBL and the other algorithms

Effect of sequence identity level on the predictive performance

Case study

Webserver and user guide

Conclusion

Conflict of Interest

Funding

Availability and implementation

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only