Abstract

Metagenomic analyses facilitate the exploration of the microbial world, advancing our understanding of microbial roles in ecological and biological processes. A pivotal aspect of metagenomic analysis involves assessing the quality of metagenome-assembled genomes (MAGs), crucial for accurate biological insights. Current machine learning–based methods often treat completeness and contamination prediction as separate tasks, overlooking their inherent relationship and limiting models’ generalization. In this study, we present DeepCheck, a multitasking deep learning framework for simultaneous prediction of MAG completeness and contamination. DeepCheck consistently outperforms existing tools in accuracy across various experimental settings and demonstrates comparable speed while maintaining high predictive accuracy even for new lineages. Additionally, we employ interpretable machine learning techniques to identify specific genes and pathways that drive the model’s predictions, enabling independent investigation and assessment of these biological elements for deeper insights.

Introduction

Since the recovery of the first metagenome-assembled genome (MAG) in 2004 [1], thousands of MAGs have been recovered, unveiling vital functions of bacteria and archaea within global ecosystems [1–3]. Analyzing metagenomic data spanning various microbial species reveals microbial diversity and functional potential, aiding in the comprehension of ecosystem complexity, addressing environmental challenges, and shedding light on disease mechanisms [4]. Moreover, this approach can be extensively applied in the biotechnology sector, including wastewater treatment, food safety monitoring, and the development of novel pharmaceuticals, thereby accelerating the exploration and utilization of microbial resources [5, 6].

Evaluating the quality of MAGs is pivotal within metagenome analysis [7]. The main goal of this evaluation is to furnish completeness and contamination annotations for each queried genome [2]. With the development of sequencing technologies as well as assembly tools, hundreds of thousands of highly diverse MAGs from metagenomic data have been recovered, rendering manual evaluation of genome quality unfeasible. In this context, the urgency of developing fast and accurate tools for assessing assembly quality has become paramount.

Current tools for evaluating the quality of MAGs can be broadly categorized into two types based on their computational strategies: (i) the single-copy gene method, for example, CheckM version 1(hereafter CheckM1) [8] and Benchmarking Universal Single-Copy Orthologs (BUSCO) [9]; this strategy relies on comparative genomics to identify lineage-specific marker gene sets. It predicts the completeness and contamination of a recovered MAG by analyzing the presence, absence, and copy number of these markers; (ii) machine learning–based methods, for example, CheckM version 2 (hereafter CheckM2) [10]; this strategy predicts the quality of MAGs by constructing relationships between genomic features and the completeness and contamination of MAGs.

The reported literature indicates that machine learning–based methods are more accurate and faster than single-copy marker gene-based methods [10]. Single-copy marker gene-based methods are limited by their over-reliance on single-copy genes. Specifically, these methods excel in predicting the quality of lineages endowed with high-quality marker genes; they tend to be less effective when dealing with lineages possessing incomplete marker genes, especially when assessing the quality of new or less-studied lineages. In contrast, machine learning–based approaches do not rely on predefined sets of marker genes. They can incorporate additional genomic information for model learning, such as multicopy genes, biological pathways, modules, and other features like amino acid counts. In addition, machine learning–based methods offer the advantage of quickly updating their training databases with new high-quality reference genomes, even those representing taxa with only a single genome available.

To our knowledge, CheckM2 is currently the only machine learning–based algorithm for assessing the quality of MAGs. Compared to its predecessor, CheckM1, CheckM2 has exhibited significant enhancements in prediction accuracy and speed. Since its release, it has gained extensive utilization in the analysis of metagenomic data. Despite its success, CheckM2 faces two primary limitations: firstly, it employs two separate models to predict the completeness and contamination of MAGs, thus overlooking the inherent correlation between these two tasks; secondly, there is a pressing need for the development of more sophisticated feature extraction techniques capable of discerning features specific to completeness and contamination from the genomic data provided [10].

In recent years, the rapid development of deep learning has led to a wide range of applications in biological research, including drug discovery [11–15], drug–drug interactions [16, 17], and protein structure prediction [18, 19]. Here, we introduced a multitask deep learning [20] framework for simultaneous prediction of completeness and contamination of MAGs, named DeepCheck. Specifically, in the DeepCheck framework, we use the residual network structure [21] and the attention mechanism [22, 23] to construct a feature extractor for extracting task-specific features from input genomic information. For model training, we used both completeness and contamination levels on the simulated genome for parameter learning. In addition, we construct an interpreter [24, 25] for detecting which specific genes and pathways the model is predicting based on subsequent investigation and assessment independently. We conducted extensive comparisons regarding the performance of DeepCheck and benchmark methods in various experimental scenarios. Our experimental studies show that DeepCheck outperforms benchmark methods in predicting the completeness and contamination accuracy of the MAG. Therefore, we anticipate that DeepCheck can emerge as a prominent tool for reliably assessing the quality of microbial genomes, offering heightened confidence when drawing biological inferences from MAGs.

In summary, DeepCheck offers significant contributions to the field of MAG analysis: (i) it introduces a novel deep learning framework designed to simultaneously predict microbial genome quality in terms of completeness and contamination. This framework leverages the correlation between completeness and contamination to enhance prediction performance. (ii) It constructs an interpreter that identifies specific genes and pathways upon which the model bases its predictions, enabling these elements to be independently investigated and assessed.

Materials and methods

Benchmark datasets

Simulated training dataset

In this study, we collected raw genomes from RefSeq release 202 [26] annotated as “complete” or “chromosome” to simulate the benchmark datasets with a known level of completeness and contamination. The specific calculations are detailed below:

  • Step 1: We employed the fastANI software [27] package to identify genomes that shared 99% average nucleotide identity with the raw genomes.

  • Step 2: Subsequently, we utilized Prodigal v.2.6.3 [28] to predict genes and proteins from the identified genomes.

  • Step 3: To simulate a range of completeness values from 5% to 100% at 5% intervals, we conducted protein sampling on the predicted proteins using the BBMap suite release 38.18 [29]. Furthermore, we conducted multiple samplings of the same predicted proteins to generate a contamination range spanning from 0% to 35%. True completeness and true contamination percentage were calculated as follows:
    (1)
    (2)

Constructing test data

To comprehensively evaluate the performance of MAG algorithms, we simulated test data using methods different from those used to construct the training dataset. Specifically, the test genome underwent fragmentation into ~20 kb fragments utilizing the “shred.sh” option within the BBMap suite, with a median length of 20 000 and a variance of 2000. Subsequently, these fragments were sampled using the “samplereadstarget” option from the BBMap suite. This process was designed to generate a range of completeness and contamination values as detailed in the Simulated Training Dataset section.

In total, the training dataset, valid dataset (neural network methods for selecting optimal parameters), and test dataset included 13 375 new complete microbial isolate genomes representing 2 domains, 59 novel phyla, 128 classes, 290 orders, 633 families, 2283 genera, and 7723 novel species according to their Genome Taxonomy Database (GTDB) classifications [30, 31] (more details about the distribution of the benchmark data are shown in Supplementary Fig. S1).

Benchmark methods

As far as we know, CheckM2 is currently the only machine learning method for predicting the quality of MAGs and performs better than BUSCO performance on multiple experimental scenarios. For CheckM2, two models, artificial neural networks [32] (NNs) and gradient-boosted (GB) decision trees [33], are embedded in CheckM2 to predict the completeness and contamination of the assembly results, respectively. In this study, we trained NN and GB separately for a more comprehensive comparison, denoted by CheckM2(NN) and CheckM2(GB). In addition, we also add CheckM1(V1.2.3) in our comparisons, which is the single-copy marker gene method. Note that since CheckM1 is not a machine learning–based approach, we ensured that the query dataset used for CheckM1 was consistent with those used in CheckM2(GB), CheckM2(NN), and DeepCheck, and we utilized the default reference dataset. Additionally, CheckM1 was executed with the “lineage_wf” flag to enable automatic lineage selection.

Extracting the input genomic information

The predicted genes of each simulated genome were annotated using Kyoto Encyclopedia of Genes and Genomes (KEGG) [34] IDs through the “blastp” command in Diamond v.2.0.4 [35]. The reference database utilized for this annotation process was the uniref100 (3 June 2018) [36], which contained KEGG Orthology (KO) annotations. To filter the annotations, we applied specific criteria, including a query_cover of 80%, a subject_cover of 80%, a value of |$1\times{10}^{-5}$|⁠, and a percent_id of 30%. We considered only the top hit for each gene during this process. Subsequently, we transformed the annotations into a frequency matrix, where all existing KEGG IDs were represented. In this matrix, rows corresponded to individual simulated genomes, while columns represented counts of detected annotations. Furthermore, we computed and included additional information for each genome, including the frequency counts of each amino acid, the number of coding sequences, and the total amino acid length. These supplementary details were incorporated alongside the protein annotations to enhance the feature vector.

Constructing the framework of DeepCheck

As the complexity of processing tasks increases, multitask learning research has gained more and more attention and applications in computer vision [37], natural language processing [38, 39], and bioinformatics [40–42]. When the two tasks have a large correlation, a more robust feature representation can be extracted using a common feature extractor, and better performance can be obtained by constructing a simple regressor or classifier for the task after the feature extractor.

In this study, predicting the completeness of MAGs can be viewed as a special case of contamination, and thus, the two tasks are strongly correlated. Inspired by multitask learning, in this study, we propose an NN-based multitask learning framework for predicting the completeness and contamination of MAGs in a unified model, named DeepCheck. Specifically, DeepCheck consists of two parts, the feature extractor (DeepCheck-FE) as well as two regressors (regressor 1 and regressor 2), as shown in Fig. 1.

The computational process of DeepCheck. (A) DeepCheck simulation training dataset and (B) the training process of DeepCheck.
Figure 1

The computational process of DeepCheck. (A) DeepCheck simulation training dataset and (B) the training process of DeepCheck.

The framework of DeepCheck-FE

In the section Extracting the Input Genomic Information, we calculated the frequency counts of each amino acid, the number of coding sequences, the total amino acid length, and the KEGG annotation of MAGs. However, these features are highly sparse and may contain features that are not relevant to completeness and contamination. In this study, inspired by the powerful feature extraction capability of convolutional neural networks, we propose a feature extractor based on residual convolutional neural networks with an attention mechanism for extracting feature representations more relevant to completeness and contamination from genomic information. In order to obtain a larger sensory field and at the same time facilitate the neural network to capture the local information, we first complement the genomic information from 0 to 20 164 dimensions and then resize into a matrix of |$142\times 142$|⁠, denoted as M. After that, the matrix M is sequentially fed into the convolutional layer, batch norm layer, ReLu layer, and max pool layer, three residual blocks, attention block, average pool layer, flatten layer, and fully connected layer. Residual blocks were designed to facilitate the training of deeper neural network architectures, effectively addressing the vanishing gradient problem and improving learning by allowing the direct flow of gradients. For the attention block, assuming that the third residual block outputs a feature map of X, we used the three |$1\times 1$| convolutional layers to obtain three outputs of Q, K, and V, respectively. Then, the attention scores are computed by taking the dot product of the Q with the K, denoted as attention feature maps. After that, we multiplied the attention feature maps with K and fed the result into a |$1\times 1$| convolutional layer to obtain the self-attention feature maps (more detailed network parameters are provided in Supplementary Table S1).

Regressors for predicting the completeness and contamination of metagenome-assembled genomes

Assuming that the feature representation extracted using DeepCheck-FE is denoted as f, we input f into each of the two regressors to predict the completeness and contamination of the MAGs separately, where each of the two regressors consists of two fully connected layers containing 100 neurons. For the parameters of DeepCheck, we used the following equation to jointly optimize the two regressors as well as DeepCheck-FE:

(3)

where |${y}_{true\_ completness}$| and |${y}_{true\_\mathrm{contamination}}$| are the simulated real values of completeness and contamination, respectively. |${y}_{pred\_ completness}$| and |${y}_{pred\_\mathrm{contamination}}$| are the prediction values of completeness and contamination, respectively. |$n$| is the number of the training samples; |$\gamma$| is the balance parameter. Detailed ablation experiments explaining the use of the joint loss function are presented in Supplementary Text S1, as well as in Supplementary Tables S2 and S3.

The model interpreter

In the experiments conducted, it has been demonstrated that deep learning can enhance the accuracy of predictions regarding the completeness and contamination of MAGs. However, users often face challenges in understanding which specific genomic features contribute to the model’s final decision, leading to a certain level of frustration. To address this issue, our study employs an interpretable machine learning technique (i.e. integrated gradients in Captum [43]) to equip the prediction model with a model interpreter. This interpreter elucidates the significance of various genomic features in determining the balance between completeness and contamination in each genomic prediction, thereby clarifying which features are pivotal for the model’s predictions. This approach significantly reduces the uncertainty associated with the use of deep learning models in genomic analysis, making the results more accessible and understandable for users.

The implementation of DeepCheck

In the training process of DeepCheck, the batch size, the learning rate, and the loss balance parameter |$\gamma$| are set to 256, |$1\times{e}^{-4}$|⁠, and 0.5, respectively. In order to obtain the optimal parameters, we further divide the training genome into two parts (training dataset and validation dataset) according to 9:1, which are used for training and selecting the optimal parameters, respectively. Specifically, for each epoch, we save the parameter when the loss of the validation set is smallest. For the test genome data, it is not involved in the model training. When the model training is finished, we fix the model parameters and then input the test samples to obtain the predication results of completeness and contamination.

Performance metrics

In this study, we used six evaluation metrics for reviewing the performance of the model, including the commonly used regression task evaluation metrics MSE (mean squared error) and MAE (mean absolute error), yield rate (YR), conformance rate (CR), nonconformance rate (NCR), and R2 (coefficient of determination); a more detailed description of these metrics can be found in Supplementary Text 2.

Results

In this section, based on the simulation completeness and contamination levels of each simulated metagenome and the Minimum Information about a Metagenome-Assembled Genome (MIMAG) [2], they are categorized into high quality, medium quality, low quality, and high contamination. Specifically, the categorization scheme is structured based on the following criteria: high-quality assemblies exhibit completeness exceeding 90%, with contamination values <5%; medium-quality assemblies have completeness value ≥50% and contamination rates <10%; low-quality assemblies have a completeness value below 50% and contamination rates within the range of 0%–10%; and assemblies categorized as having high contamination surpass completeness levels of 10%.

Comparison of DeepCheck and benchmark methods under 5-fold cross-validation

In this section, we used five-fold cross-validation to assess the performance of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck across six performance metrics (MSE, MAE, YR, CR, NCR, R2), utilizing a dataset comprising 13 375 complete microbial isolate genomes from RefSeq version 202. We used the strategy described in the Simulated Training Dataset section to process the 13 375 complete microbial isolate genomes set to obtain true completeness and contamination as training targets. Ultimately, 1 720 000 genomes were simulated to evaluate the performance of the macro-genome quality assessment algorithm.

From Fig. 2A–D and Supplementary Fig. S2, DeepCheck outperforms the other three methods in all prediction tasks, regardless of their quality level (low quality, medium quality, high quality, and high-contamination), except for predicting the contamination level of high-quality and high-contamination genome assemblies. Specifically, for completeness prediction, the MSE and MAE metrics of DeepCheck are below CheckM2(GB) and CheckM2(NN). In particular, for predicting high-quality genome quality, the proposed DeepCheck yielded an MSE score of |$1.9\times{10}^{-2}$|⁠, which is lower than the MSE score of |$2.1\times{10}^{-2}$| obtained by the CheckM2(NN), resulting in a 10.5% [(⁠|$2.1\times{10}^{-2}-1.9\times{10}^{-2})/1.9\times{10}^{-2}$|] enhancement in performance. In the realm of contamination prediction, benchmark methods generally exhibit superior performance, characterized by lower MSE and MAE values, which suggests that machine learning–based methods excel at enhancing the accuracy of prediction completeness. Nevertheless, in scenarios involving genomes of low to medium quality, DeepCheck still demonstrates measurable improvements in both MSE and MAE metrics, underscoring its efficacy in a broader range of genomic conditions.

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the collected genomes. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.
Figure 2

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the collected genomes. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.

From Fig. 2G and H, DeepCheck demonstrates superior performance over the other three methods in predicting both completeness and contamination, particularly in terms of the YR and NCR metrics. Specifically, DeepCheck attains the highest YR values, as shown in Fig. 2E and F, which indicates that the discrepancies between the predicted values by DeepCheck, and the true values tend to be within a narrower margin of ~1%. Regarding the NCR metric, DeepCheck achieves lower values compared to CheckM1, CheckM2(NN), and CheckM2(GB), suggesting that the variance between its predictions and the actual values is less, within ~5% range. According to Supplementary Fig. S2E, DeepCheck, the best-performing method, achieved the highest R2 scores of 0.9840 for completeness and 0.9813 for contamination. The second-best method, CheckM2(NN), recorded slightly lower R2 scores of 0.9801 for completeness and 0.9720 for contamination. These results suggest that DeepCheck offers the most accurate model fitting for both completeness and contamination. Collectively, these findings underscore DeepCheck’s enhanced efficacy over CheckM1, CheckM2(GB), and CheckM2(NN) across commonly studied lineages.

Comparison of DeepCheck and CheckM2 under new lineages

Due to the presence of many unidentified microbial communities in MAGs from novel lineages, the analysis of these MAGs could become a valuable resource for studying the ecological functions and metabolic pathways of microbial communities. Furthermore, due to the lack of reference genomes, it is difficult to obtain a sufficient quantity and diversity of datasets to train predictive models for novel lineages. This deficiency can constrain the capacity of a model to generalize to new, unseen lineages. Therefore, the ability of the prediction model to maintain high accuracy for unobserved lineages becomes a pivotal factor influencing its utility.

In this section, we selected those lineages with <50 occurrences from phylum as the test set (529) and the remaining as the training and validation sets (12 846). This practice is implemented to ensure that the test set encompasses a lineage that differs from those used in the training and validation sets. We used the strategy described in the Simulated Training Dataset section to process the training set and validation set to get true completeness and contamination as training targets. For the test set, we used the strategy described in the Constructing Test Data section to obtain the completeness and contamination values of the test set. After completing the simulation, we generated 1 650 000 genomes, each annotated with completeness and contamination values, to serve as the training and validation sets for parameter optimization in DeepCheck. Additionally, 68 000 genomes, similarly annotated, were used as test sets to evaluate the algorithm’s performance.

As illustrated in Fig. 3A–D and Supplementary Fig. S3, our proposed DeepCheck demonstrates superior or comparable performance to the other two benchmark methods in most cases. In particular, DeepCheck exhibits an improvement of |$3\times{10}^{-3}$| over the second-place method, CheckM2(NN), and a |$7\times{10}^{-3}$| improvement over CheckM2(GB) in predicting genomic completeness for low-quality assemblies. From Fig. 3E and F, we can see that, in both predicting completeness and contamination of novel lineages, DeepCheck slightly surpasses CheckM2(GB), while NCR slightly trails behind CheckM2(GB). From Supplementary Fig. S3, DeepCheck exhibits the highest performance, achieving R2 scores of 0.9179 for completeness and 0.9134 for contamination. The second-best method, CheckM2(NN), registers slightly lower R2 scores, with 0.9092 for completeness and 0.8991 for contamination.

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the genomes with new lineages. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.
Figure 3

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the genomes with new lineages. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.

These results demonstrate that, leveraging the robust feature extraction capability of DeepCheck-FE, DeepCheck enhances the accuracy of predicting the quality assessment of macro gene assemblies compared to CheckM1, CheckM2(GB), and CheckM2(NN). However, the absence of these lineages in the training set contributes to lower accuracy in the predicted results, particularly in cases where the percentage difference between the true and predicted values exceeds 5% (this may affect the accuracy of subsequent correct biological conclusions).

Benchmarking DeepCheck performance on Superphylum

The previous experiments have demonstrated some progress made by DeepCheck in predicting both conventional and novel phylogenetic lineages. However, in practical applications, it will encounter some superphylum with relatively small genome sizes, which are important microbial communities in ecological environments, participating in nutrient cycling and influencing the dynamics of microbial communities through interactions with other microbes. Predicting the genomic quality of these superphyla is particularly challenging due to their high diversity, unique biological characteristics, lack of key genes, and typically smaller genome sizes.

In this section, we compiled 27 genomes identified as Patescibacteria (Patescibacteria is a group within the superphylum) according to the GTDB dataset sourced from RefSeq version 202. Additionally, 30 Patescibacteria genomes derived from wastewater were acquired [44]. A total of 57 Patescibacteria genomes were employed as a testing set to evaluate the predictive capability of DeepCheck in assessing the quality of genomes from these microorganisms. We used the strategy described in the Constructing Test Data section to obtain 6000 genomes with completeness and contamination values for evaluation to assess the performance of the algorithm. In the section Comparison of DeepCheck and Benchmark Methods under 5-foldCross-validation, the simulated dataset, after excluding the Patescibacteria genomes, comprises ~1 700 000 genomes that serve as training and validation sets.

As illustrated in Fig. 4A–D and Supplementary Fig. S4, our proposed DeepCheck exhibits superior or comparable performance (lower MSE and MAE values) compared to the other three benchmark methods, except in cases where the contamination of high-quality and high-contamination MAGs is predicted. Specifically, when predicting the contamination of low-quality MAGs, the MAE value reached 0.015, which is a 16% improvement over the second-place method, Checkm(GB) (Supplementary Fig. S4C). In addition, from Fig. 4E, our method achieved comparable results to the other three methods, registering a lower YR and a higher NCR. These outcomes suggest that predicting the completeness of Patescibacteria genomes presents a significant challenge. However, for contamination, as illustrated in Fig. 4F for the NCR metric, our proposed DeepCheck achieves the lowest value. This indicates that, even when predicting the challenging Patescibacteria, the discrepancy between DeepCheck’s predictions and the true values predominantly falls within the range of [−5%, 5%]. This enhances confidence in the reliability of subsequent analyses of the findings, compared to those obtained using CheckM1, CheckM(GB), and CheckM(NN). For the R2 metric, Supplementary Fig. S4E displays the R2 scores of various methods in assessing completeness and contamination. DeepCheck demonstrates superior performance, recording the highest R2 scores of 0.6120 for completeness and 0.3221 for contamination. The second-best method, CheckM2(NN), posts slightly lower R2 scores, with 0.6110 for completeness and 0.2523 for contamination.

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the genomes with Patescibacteria. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (E, F) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.
Figure 4

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of MAGs on the genomes with Patescibacteria. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (E, F) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.

Taken together, although predicting the completeness of Patescibacteria is a challenging task (each method yielded higher values of MAE and MSE and lower values of YR and R2), DeepCheck improves the predictive effectiveness of the original method, which shows us the potential of solving the problem based on deep residual convolutional NNs.

Benchmarking DeepCheck cross-contamination performance

Contamination in MAGs may result from the misclassification of closely related strains or species. Additionally, contamination may include disparate sequences from other lineages or even different domains, a phenomenon we refer to as cross-contamination. In this subsection, we evaluate the performance of DeepCheck relative to three other methods in assessing the quality of MAGs. We collected 16 genomes from RefSeq version 202 and employed the strategy described in the work [10] to simulate 1500 genomes as the test set. Additionally, we utilized the 1 720 000 genomes simulated in the Comparison of DeepCheckand Benchmark Methods under 5-foldCross-Validation section as both training and validation sets.

As illustrated in Fig. 5A–F and Supplementary Fig. S5, our proposed DeepCheck demonstrates distinct strengths compared to the other three benchmark methods. For instance, our method exhibits a clear advantage in predicting the completeness of low-quality genome sequences (Fig. 5C). In contrast, CheckM2(GB) performs better with high-quality genomes (Fig. 5A). The results for high- and medium-quality genomes are comparable; DeepCheck shows a lower MAE but a higher MSE when dealing with high contamination levels (Fig. 5D). On the other hand, regarding the YR and NCR metrics, DeepCheck recorded the highest YR and the lowest NCR values (Fig. 5E). This indicates that the predictions made by our method align more closely with the true values. When predicting contamination, DeepCheck generally achieves superior performance, except in scenarios involving genomes of medium quality (Supplementary Fig. S5). On the other hand, in terms of YR and NCR values, DeepCheck achieved the highest YR and the second lowest NCR value, only slightly higher than that of CheckM2(NN). From Supplementary Fig. S5E, DeepCheck exhibits superior results, achieving near-optimal R2 scores close to 0.95 for both completeness and contamination. CheckM2(NN) also performs well, though with slightly lower R2 values, particularly in the context of contamination. Other methods, such as CheckM2(GB) and CheckM1, show a noticeable decline in performance, especially in contamination modeling, where their R2 scores are significantly lower. This suggests that while DeepCheck and CheckM2(NN) are robust in their predictive accuracy, there is a discernible gap in the efficacy of the other methods, particularly when handling contamination data.

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of metagenome-assembled genomes on cross-contamination. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.
Figure 5

Comparison of DeepCheck with CheckM1, CheckM2(NN), and CheckM2(GB) for the quality of metagenome-assembled genomes on cross-contamination. (A–D) Mean MAE and MSE of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness at different genome qualities. (G, H) The boxplot of mean YR, CR, and NCR of CheckM1, CheckM2(NN), CheckM2(GB), and DeepCheck in predicting completeness and contamination at different genome qualities.

Taken together, these results demonstrate that our proposed DeepCheck is capable of effectively assessing the quality of cross-contamination genomes.

Model interpreters bring new biological insights

Determining the contribution of specific genomic features to the NN model employed by DeepCheck for prediction is essential for understanding the model's decision-making process and improving its interpretability. However, validating the accuracy of outputs from this interpretable module poses a significant challenge. This difficulty arises primarily from the complex operations inherent in neural networks and the intricate relationships between pathway features and genomic quality. In this study, within the Comparison of DeepCheck and Benchmark Methods under 5-foldCross-Validation section, we employed DeepCheck’s Interpretation Module to identify the top 300 contributing genomic features for each genome. We then determined the 300 genomic features that appeared most frequently among these top contributors for genomes within the same phylum, designating them as key pathways of the phylum (Supplementary Table S4).

According to Supplementary Table S4, we can see that the key pathways significantly influencing lineage completeness and contamination predictions include ribosomal proteins, which are identified through pivotal genomic features in each category. Additionally, DNA processing, genes involved in the transfer RNA (tRNA) biosynthesis pathway, and the count of individual nucleotides are major contributors to these predictions. There are also separate pathways that have higher predictive values only for certain lineages. For example, K17753 (isocitrate--homoisocitrate dehydrogenase) has a more important contribution to the prediction of Chloroflexota integrity and contamination. Chloroflexota is a group of bacteria that produce energy through photosynthesis [45, 46], and it has been shown that the K17753 pathway is closely related to the metabolism and biosynthesis of secondary metabolites [47]. These results bolster confidence in the ability of DeepCheck to capture nuances of the underlying biological reality.

Taken together, this study proposes utilizing the interpretable module in two key ways: (i) enhancing genome assembly: The output of the interpreter, which highlights pathways significantly impacting genome completeness and contamination, serves as an invaluable diagnostic tool. It allows researchers to pinpoint crucial biological processes that warrant further examination during genome assembly or that may signal contamination. This information could lead to more focused strategies for enhancing genome assembly quality. (ii) Facilitating cross-species macro-genomic analyses: Given the extensive coverage of KEGG pathways across different organisms, the interpretable module has the potential to uncover both commonalities and differences in factors affecting genome quality across taxa. Such cross-species applicability deepens our understanding of genome conservation and divergence, potentially highlighting universal challenges or lineage-specific issues in genome sequencing and assembly.

Discussing the time and memory consumption of DeepCheck

Like CheckM2, DeepCheck relies on the GTDB version, and the main time spent in the computation is in the KEGG pathway annotation of the genome using Diamond. For interpretable modules, completeness, and contamination, it runs very fast on the Graphics Processing Unit (GPU). In this study, we performed runtime statistics on a server with 24 threads (CPU: AMD EPYC 7H12 64-Core Processor, GPU: NVIDIA GeForce RTX 4090), and, specifically, we perform inference tests on 1000 genomes as well as interpret inference results. Time taken per minute per thread was calculated as |$\frac{all\ run\ times}{number\ of\ genomes}/ number\ of\ threads$|⁠. During the run, we stopped the other running programs. It was determined that DeepCheck processes ~0.51 genomes per minute. In comparison, CheckM2(GB) and CheckM2(NN) process ~0.52 and 0.53 genomes per minute, while CheckM1 handles ~1.38 genomes per minute, respectively. CheckM2(GB) and CheckM2(NN) exhibit slightly faster processing rates than DeepCheck. This is mainly due to the fact that DeepCheck has more parameters as well as deploying an interpretable module for the user. We believe that this increase in time is worthwhile because DeepCheck predicts more accurately while giving users reasons for NN reasoning to be more confident in the model’s predictions.

Conclusion and discussion

Here, we introduce DeepCheck, a multitasking deep learning framework designed for simultaneously predicting the completeness and contamination of MAGs. DeepCheck surpasses existing tools in accuracy across various experimental scenarios while maintaining impressive prediction speed. Notably, even when encountering novel lineages, DeepCheck consistently achieves high prediction accuracy. Moreover, we employ interpretable machine learning techniques to uncover the specific genes and pathways underlying the model’s predictions. This enables subsequent investigation and independent assessment of the predicted outcomes.

While DeepCheck achieved high accuracy for both common and novel genomes, its performance in predicting genome completeness and contamination for Superphylum remains suboptimal and requires further improvement. Particularly for genomes from the Patescibacteria phylum, the completeness predictions made by DeepCheck differ by an average of 10% from the true values. Although this deviation is the smallest among the three methods currently evaluated, caution is still advised when using these methods to predict the completeness of genomes within the Patescibacteria phylum. This is attributed to two main factors: firstly, the scarcity of genomic data available for the Superphylum, coupled with the high species diversity, limits the model’s ability to capture relevant genomic features. Secondly, the significant variation in genome size, structure, and gene content among Superphylum species complicates model generalization. To address these challenges, we propose two avenues for improvement: (i) to augment training data by simulating additional Superphylum genomes using generative large language models and subsequently leveraging transfer learning to integrate knowledge from these simulated data and (ii) to enhance model complexity by incorporating more advanced model structures, such as graphical neural networks, to better capture the intricate relationships and structural properties inherent in genomic data.

Key Points
  • DeepCheck framework: The proposed DeepCheck framework addresses this limitation by using a multitasking deep learning approach that simultaneously predicts both completeness and contamination. This integration enhances the accuracy and generalization capability of the model.

  • Performance advantages: DeepCheck demonstrates superior accuracy compared to existing tools and offers faster prediction speeds. It also maintains high predictive accuracy for novel microbial lineages.

  • Interpretable machine learning: DeepCheck incorporates interpretable machine learning techniques to identify specific genes and pathways that influence its predictions. This feature allows for further independent exploration and understanding of these biological elements.

Conflict of interest: The authors have no conflicts of interest to report.

Funding

This work was funded by the National Natural Science Foundation of China (32170327 and 32270664) and the National Key Research and Development Program of China (2023YFD2200102 and 2023YFD2200104), and the AI & AI for Science Project of Nanjing University.

Data availability

Benchmarking data can be accessed from Zenodo (https://zenodo.org/records/11299463).

Code availability

The source code of DeepCheck is available on github (https://github.com/Guowei-nju/DeepCheck).

References

1.

Bickhart
DM
,
Kolmogorov
M
,
Tseng
E
. et al.
Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities
.
Nat Biotechnol
2022
;
40
:
711
9
. .

2.

Bowers
RM
,
Kyrpides
NC
,
Stepanauskas
R
. et al.
Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea
.
Nat Biotechnol
2017
;
35
:
725
31
.

3.

Hugerth
LW
,
Larsson
J
,
Alneberg
J
. et al.
Metagenome-assembled genomes uncover a global brackish microbiome
.
Genome Biol
2015
;
16
:
1
18
.

4.

Ke
S
,
Weiss
ST
,
Liu
Y-Y
.
Dissecting the role of the human microbiome in COVID-19 via metagenome-assembled genomes
.
Nat Commun
2022
;
13
:
5235
.

5.

Chivian
D
,
Jungbluth
SP
,
Dehal
PS
. et al.
Metagenome-assembled genome extraction and analysis from microbiomes using KBase
.
Nat Protoc
2023
;
18
:
208
38
.

6.

Gwak
H-J
,
Lee
SJ
,
Rho
M
.
Application of computational approaches to analyze metagenomic data
.
J Microbiol
2021
;
59
:
233
41
.

7.

Lu
J
,
Rincon
N
,
Wood
DE
. et al.
Metagenome analysis using the kraken software suite
.
Nat Protoc
2022
;
17
:
2815
39
.

8.

Parks
DH
,
Imelfort
M
,
Skennerton
CT
. et al.
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
.
Genome Res
2015
;
25
:
1043
55
.

9.

AlQuraishi
M
.
AlphaFold at CASP13
.
Bioinformatics
2019
;
35
:
4862
5
.

10.

Chklovski
A
,
Parks
DH
,
Woodcroft
BJ
. et al.
CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning
.
Nat Methods
2023
;
20
:
1203
12
.

11.

Lv
Q
,
Chen
G
,
Zhao
L
. et al.
Mol2Context-vec: learning molecular representation from context awareness for drug discovery
.
Brief Bioinform
2021
;
22
:
bbab317
.

12.

Lv
Q
,
Chen
G
,
Yang
Z
. et al.
Meta-molnet: a cross-domain benchmark for few examples drug discovery
.
IEEE Trans Neural Netw Learn Syst
2024
.

13.

Lv
Q
,
Chen
G
,
Yang
Z
. et al.
Meta learning with graph attention networks for low-data drug discovery
.
IEEE Trans Neural Netw Learn Syst
2023
;
35
:
11218
30
. .

14.

Lv
Q
,
Chen
G
,
He
H
. et al.
TCMBank: bridges between the largest herbal medicines, chemical ingredients, target proteins, and associated diseases with intelligence text mining
.
Chem Sci
2023
;
14
:
10684
701
.

15.

Askr
H
,
Elgeldawi
E
,
Aboul Ella
H
. et al.
Deep learning in drug discovery: an integrative review and future challenges
.
Artif Intell Rev
2023
;
56
:
5975
6037
.

16.

Lv
Q
,
Zhou
J
,
Yang
Z
. et al.
3D graph neural network with few-shot learning for predicting drug–drug interactions in scaffold-based cold start scenario
.
Neural Netw
2023
;
165
:
94
105
.

17.

Lin
X
,
Dai
L
,
Zhou
Y
. et al.
Comprehensive evaluation of deep and graph learning on drug–drug interactions prediction
.
Brief Bioinform
2023
;
24
:
bbad235
.

18.

Senior
AW
,
Evans
R
,
Jumper
J
. et al.
Improved protein structure prediction using potentials from deep learning
.
Nature
2020
;
577
:
706
10
.

19.

Hamamsy
T
,
Morton
JT
,
Blackwell
R
. et al.
Protein remote homology detection and structural alignment using deep learning
.
Nat Biotechnol
2024
;
42
:
975
85
.

20.

Zhang
Y
,
Yang
Q
.
A survey on multi-task learning
.
IEEE Trans Knowl Data Eng
2021
;
34
:
5586
609
.

21.

He
K
,
Zhang
X
,
Ren
S
. et al. Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
770
8
,
2016
.

22.

Niu
Z
,
Zhong
G
,
Yu
H
.
A review on the attention mechanism of deep learning
.
Neurocomputing
2021
;
452
:
48
62
.

23.

Guo
M-H
,
Xu
T-X
,
Liu
J-J
. et al.
Attention mechanisms in computer vision: a survey
.
Comput Vis Media
2022
;
8
:
331
68
.

24.

Amorim
JP
,
Abreu
PH
,
Fernández
A
. et al.
Interpreting deep machine learning models: an easy guide for oncologists
.
IEEE Rev Biomed Eng
2021
;
16
:
192
207
.

25.

Murdoch
WJ
,
Singh
C
,
Kumbier
K
. et al.
Definitions, methods, and applications in interpretable machine learning
.
Proc Natl Acad Sci
2019
;
116
:
22071
80
.

26.

O'Leary
NA
,
Wright
MW
,
Brister
JR
. et al.
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation
.
Nucleic Acids Res
2016
;
44
:
D733
45
.

27.

Jain
C
,
Rodriguez-R
LM
,
Phillippy
AM
. et al.
High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries
.
Nat Commun
2018
;
9
:
5114
.

28.

Hyatt
D
,
Chen
G-L
,
LoCascio
PF
. et al.
Prodigal: prokaryotic gene recognition and translation initiation site identification
.
BMC Bioinformatics
2010
;
11
:
1
11
.

29.

Bushnell
B
.
BBMap: A Fast, Accurate, Splice-Aware Aligner
2014
.

30.

Parks
DH
,
Chuvochina
M
,
Rinke
C
. et al.
GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy
.
Nucleic Acids Res
2022
;
50
:
D785
94
.

31.

Chaumeil
P-A
,
Mussig
AJ
,
Hugenholtz
P
. et al.
GTDB-Tk: A Toolkit to Classify Genomes with the Genome Taxonomy Database
.
Oxford University Press
,
2020
.

32.

Yegnanarayana
B
.
Artificial Neural Networks
.
PHI Learning Pvt. Ltd.
,
2009
.

33.

Ke
G
,
Meng
Q
,
Finley
T
. et al.
Lightgbm: a highly efficient gradient boosting decision tree
.
Adv Neural Inf Process Syst
2017
;
30
.

34.

Kanehisa
M
,
Sato
Y
,
Kawashima
M
. et al.
KEGG as a reference resource for gene and protein annotation
.
Nucleic Acids Res
2016
;
44
:
D457
62
.

35.

Buchfink
B
,
Reuter
K
,
Drost
H-G
.
Sensitive protein alignments at tree-of-life scale using DIAMOND
.
Nat Methods
2021
;
18
:
366
8
.

36.

Suzek
BE
,
Huang
H
,
McGarvey
P
. et al.
UniRef: comprehensive and non-redundant UniProt reference clusters
.
Bioinformatics
2007
;
23
:
1282
8
.

37.

Wong
A
,
Wu
Y
,
Abbasi
S
. et al. Fast GraspNeXt: A fast self-attention neural network architecture for multi-task learning in computer vision tasks for robotic grasping on the edge. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
2292
6
,
2023
.

38.

Yang
E
,
Pan
J
,
Wang
X
. et al. Adatask: A task-aware adaptive learning rate approach to multi-task learning. In:
Proceedings of the AAAI Conference on Artificial Intelligence
, pp.
10745
53
,
2023
.

39.

Bhat
A
,
Modi
A
. Multi-task learning framework for extracting emotion cause span and entailment in conversationsIn:
Transfer Learning for Natural Language Processing Workshop
.
PMLR
,
2023
,
33
51
.

40.

Xu
Q
,
Yang
Q
.
A survey of transfer and multitask learning in bioinformatics
.
J Comput Sci Eng
2011
;
5
:
257
68
.

41.

He
Q
,
Qiao
W
,
Fang
H
. et al.
Improving the identification of miRNA–disease associations with multi-task learning on gene–disease networks
.
Brief Bioinform
2023
;
24
:
bbad203
.

42.

Tang
X
,
Zhang
J
,
He
Y
. et al.
Explainable multi-task learning for multi-modality biological data analysis
.
Nat Commun
2023
;
14
:
2546
.

43.

Kokhlikyan
N
,
Miglani
V
,
Martin
M
. et al.
Captum: a unified and generic model interpretability library for pytorch
.
arXiv preprint arXiv:2009.07896 2020
.

44.

Singleton
CM
,
Petriglieri
F
,
Kristensen
JM
. et al.
Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing
.
Nat Commun
2021
;
12
:
2009
.

45.

Freches
A
,
Fradinho
JC
.
The biotechnological potential of the Chloroflexota phylum
.
Appl Environ Microbiol
2024
;
e01756
23
.

46.

Garritano
AN
,
Song
W
,
Thomas
T
.
Carbon fixation pathways across the bacterial and archaeal tree of life
.
PNAS Nexus
2022
;
1
:
pgac226
.

Author notes

Guo Wei and Nannan Wu contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]