-
PDF
- Split View
-
Views
-
Cite
Cite
Shenjie Wang, Xiaoyan Zhu, Xuwen Wang, Yuqian Liu, Minchao Zhao, Zhili Chang, Xiaonan Wang, Yang Shao, Jiayin Wang, TMBstable: a variant caller controls performance variation across heterogeneous sequencing samples, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae159, https://doi.org/10.1093/bib/bbae159
- Share Icon Share
Abstract
In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable’s superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
INTRODUCTION
Variant calling from genomic sequencing data is a basic computational problem in bioinformatics [1–4]. Hundreds of variant callers have been developed during the past two decades. For all of the existing callers, based on our best knowledge, the MEAN value of each assessment indicator (e.g. accuracy, precision, f1-score) on a set of samples is used to evaluate a caller. This is quite reasonable over a long while. First, if the heterogeneity on genomic variants is limited across samples, then for a caller, its performance will not present significant difference across those samples. Furthermore, for those heterogeneous samples, e.g. cancer samples, either the molecular pathology for cancer diagnosis or the targeted therapies for cancer treatment only focus on few specific loci/biomarkers [2, 5]. Thus, for each sample, a caller is evaluated by calling those biomarkers, while the mean value on a set of samples is enough to evaluate the performance and for routine work [6–13].
However, the mean value assessment is no longer sufficient for evaluating or guiding the design of a variant caller for counting-based biomarkers. Tumor mutation burden (TMB), which is defined as the number of somatic mutations per megabase (mut/Mb), is a typical counting-based biomarker. Similar to TMB, tumor homologous recombination deficiency (HRD) is often evaluated by the number of mutational events implying genomic scars. Counting-based biomarkers also include copy number variation burden (CNV burden), micro-satellite instability and chromothripsis. For these counting-based biomarkers, taking TMB as an example, we observed a significant fluctuation on various calling performance across samples (detailed results are shown in Results section). Such fluctuation introduces significant errors in optimizing TMB positive threshold [14–16]. In addition, for a pre-set positive threshold, if the performance on patient data is different from the performance on the cohort for optimizing the threshold, it may lead to an exclusion of patient who could really benefit from immunotherapy, or conversely [17]. Thus, it is imperative for variant callers to maintain a ‘stable’ performance across heterogeneous samples.
Why do the existing callers always present performance fluctuations on counting-based biomarker? First, the assessment indicators are coupled for counting-based biomarker. For example, in previous scenarios, we often use a series of false-positive filters to control calling performance. But for TMB, it may be out of control because when we remove a false positive, sometimes we introduce a false negative simultaneously. For a variant caller cannot correctly identify a variant, it does not imply the non-existence at the loci. Second, different variant callers adopt various calling strategies, e.g. split-read strategy, read-pair or read depth strategy, assembly-based strategy and their combinations [18–23]. Many review papers and comparison studies have already pointed out that different calling strategies usually present more advantageous on the datasets with the specific patterns of data signals [13, 19]. Thus, as it is impossible for all samples sharing the same patterns, the existing callers cannot grantee a ‘stable’ performance across heterogeneous samples.
Therefore, in this paper, we are trying to develop a novel variant caller that is able to maintain stable performance across heterogeneous samples. The existing callers cannot address this issue because the calling strategies, whatever one or multiple strategies above, or their combinations, are pre-set and applied indistinctively. Then, if the pre-set strategy(ies) is mis-matched with the pattern, either a false positive or a false negative occurs. The performance fluctuations expose as the numbers of mis-matched are various across samples [17]. Thus, a potential solution is to develop a caller that not only can ensemble the existing strategies but also, more importantly, can automatically select the optimal strategies according to the patterns in genomic regions as well.
We present TMBstable, which is among the first variant callers designed for controlling the performance variation across samples. In short, given sequencing data, it divides it into a set of regions. For each region, it dynamically selects the optimal calling strategy, which is achieved by a meta-learning framework. Thus, TMBstable is different from any existing callers, which adopt the same strategy on an entire sample. By effectively solving the mis-match issue, such design ensures more stable performance. We conducted large-scale validation experiments on simulated and real sequencing data. The results strongly demonstrate the advantage of ‘stable’ design and its benefits on counting-based biomarkers.
METHOD
Building on the premise of addressing the limitations of existing variant callers, the subsequent section of this paper delves into the methodological intricacies of TMBstable. We aim to elucidate how TMBstable uniquely employs meta-learning to dynamically adapt its calling strategies, ensuring reliable performance across diverse genomic samples. This Method section will comprehensively detail the algorithm’s design, including the steps involved in segmenting sequencing data, extracting meta-features and implementing the meta-learning framework for optimal strategy selection. By providing these methodological insights, we endeavor to offer a clear understanding of TMBstable’s innovative approach in addressing the challenges of variant calling in heterogeneous samples.
Data
In this study, we employed a total of 300 simulated samples and 106 real tumor samples to train, test and validate the TMBstable model. The simulated samples were generated using GSDcreator software [24], with variant information primarily sourced from The Cancer Genome Atlas (TCGA) database, accessible at http://www.cbioportal.org/. The authentic tumor samples were graciously provided by Nanjing Geneseeq Technology Inc., and the experiment was carried out by the Sun Yat-sen University Cancer Center.
Specifically, the NSCLC patient group was carefully selected for a study spanning from December 2015 to August 2017, which included 95 individuals. Out of these, 60 were eligible for the final phase of analysis, with data collection continuing until January 2019. This cohort primarily consisted of patients diagnosed with adenocarcinoma (60%) and squamous cell carcinoma (32%), with a staggering 99% being stage IV at the time of diagnosis. The median age at the start of treatment was 55 years. Notably, 49% of these patients had a history of smoking, and the group showed a higher prevalence of male patients, with a 70% male to 30% female ratio. The observed objective response rate (ORR) for the NSCLC cohort was 19%, and the median progression-free survival (mPFS) was 91 days. The choice of immunotherapeutic agents did not significantly affect the progression-free survival (PFS) outcomes.
Concurrently, the NPC patient group, enrolled from March 2016 to January 2018, included 128 patients who participated in two single-arm phase I trials (NCT02721589 and NCT02593786), with 46 meeting the criteria for final evaluation. The patients underwent extensive analysis of tumor samples using both Whole Exome Sequencing (WES) and panel sequencing, with a preference for formalin-fixed paraffin-embedded (FFPE) samples when available. In this group, the median age at the beginning of treatment was 46 years, and 25% of the patients had a smoking history. There was a higher ratio of male patients, with 80% being male. The NPC cohort had an ORR of 12% and a median PFS of 67.5 days. As with the NSCLC cohort, the choice of immunotherapy did not result in significant variation in PFS.
Both studies involved stringent inclusion and exclusion criteria, focusing on patients aged between 18 and 70 years with minimal disease impact (the Eastern Cooperative Oncology Group (ECOG) performance status score of 0 or 1), confirmed diagnosis of advanced NSCLC or NPC, failure of prior systemic therapy and at least one measurable lesion as per RECIST 1.1 guidelines. Patients with certain medical histories or conditions were excluded. Radiological tumor evaluations were conducted at the outset and every 6 weeks to monitor treatment efficacy and disease progression. The continuation of treatment depended on the absence of disease progression or unacceptable drug toxicity, with PFS calculated from the initial treatment to disease progression or death. Overall survival (OS) was defined from treatment initiation to the patient’s death. The ethical review committee of Sun Yat-sen University Cancer Center endorsed the study, ensuring high ethical standards and adherence to the Declaration of Helsinki. All participants provided written informed consent [14].
TMBstable
TMBstable is a specialized tool designed for the detection of variations using short reads and a meta-learning approach. It comprises six distinct steps. Firstly, the bam file is divided into sub-region windows that encompass specific variation signatures. Secondly, meta-features are extracted from these windows to capture the characteristics associated with caller performance. These features enable discrimination between different sub-regions. Thirdly, all the windows are clustered based on their window features. Fourthly, meta-targets are identified for each cluster, and a meta-model is constructed to establish the relationship between the meta-features and meta-targets. Fifthly, suitable variation callers are assigned to each cluster based on the meta-model. This facilitates the detection of SVs for each window cluster, and the calling results are subsequently merged. The architecture of TMBstable is illustrated in Figure 1. The architecture diagram illustrates the processing workflow where each rectangular box represents a step in the process, with arrows indicating the direction of the workflow. The sections enclosed by dashed lines and annotated in red font emphasize their critical role within the entire processing sequence.

Region segmentation for input file
In this step, TMBstable incorporates a sliding window approach to scan the bam file and partition it into sub-regions that contain variation signatures. Specifically, TMBstable scans each region and computes variation scores for every site within the region. Within each region, a sliding window of length win_range is set, and the mismatch rate of each site s within the window is sequentially calculated. Different sites are assigned varying weights W based on their proximity to the central site. The variant score can be determined using the following formula [25]:
In Formula 1, unmapped reads represent the count of reads that do not align with the bases of the reference genome among all the reads covering the specific site, while total reads indicate the count of reads that cover the site.
Once the variation score for each site is obtained, we can identify sub-regions containing variation signatures based on this score. To determine whether a region contains the variation signature or not, we set a minimum threshold M for the variant score. If a sub-region (begin, end) satisfies the condition:
We consider it as a sub-region that contains variation signatures and proceed to extract it from the beginning to the end.
Meta-feature extraction for sub-regions
After acquiring the sub-region that contains variations, we will extract meta-features that can discern the performance of the variant caller within the sub-region (a meta-feature is a higher-level attribute derived from base data features, often used to capture more abstract properties of the data for advanced analytical models or meta-learning). Distinct sets of meta-features are extracted for single nucleotide variants (SNVs) and structural variants (SVs). A multitude of review studies have indicated that certain features in sequencing data, such as the size and type of variations, along with the ratio of variations present in tandem repeat regions, are reliable for differentiating the effectiveness of various variant calling methods [13, 22, 23]. Based on a combination of experimental findings and literature reviews, this research has chosen eight meta-features for characterizing sub-regions of SNVs and an additional eight meta-features for the sub-regions of SVs.
We selected meta-features that effectively differentiate the performance of various caller algorithms while minimizing redundancy and noise. The meta-features we ultimately incorporated were specifically chosen for their proven efficacy in distinguishing between the performances of different algorithmic strategies across various genomic regions.
For example, assembly-based strategies adeptly generate longer contig segments through methods such as seed and extend, making them particularly effective in identifying longer variations. Nevertheless, these strategies often rely heavily on the assembly process, which can lead to reduced performance in low-depth sequencing regions where achieving accurate assembly is challenging. Conversely, the split-read approach, which aligns read segments to the reference genome for identifying split reads, is capable of detecting variants in regions of lower sequencing depth. However, this method’s lack of prior assembly might result in less optimal performance when it comes to identifying larger variants.
Under these circumstances, features like sequencing depth and regional context play a pivotal role in determining the model’s effectiveness. They enable the model to tailor its approach, leveraging the strengths and mitigating the weaknesses of various strategies. This adaptability ensures more precise variant detection in diverse scenarios, enhancing the overall robustness and applicability of our model. Through this strategic incorporation of key features, our model not only achieves high accuracy in variant detection but also remains efficient and streamlined, avoiding the pitfalls of unnecessary complexity.
These meta-features span both the variation signature-level and the sequencing data-level. For an SNVs sub-region, the variation signature-level meta-features include mapping quality, region context, insertion ratio, deletion ratio and substitution ratio. The sequencing data level meta-features consist of read length, sequencing depth (refers to the number of times a particular region of the genome has been sequenced, indicating the amount of coverage and data reliability for that region) and sequencing quality (refers to the accuracy and reliability of the nucleotide base calls in sequencing data, often assessed using quality scores that predict the probability of errors in the sequence). In the case of an SVs sub-region, the variation signature-level meta-features encompass sub-region length, insert size (refers to the length of the DNA fragment inserted between sequencing adapters, crucial for understanding the spacing between paired-end reads in genome sequencing), region context, insertion ratio, deletion ratio and substitution ratio. The sequencing data level meta-features comprise read length and sequencing depth. Tables 1 and 2 provide summaries of the meta-features for SNVs sub-regions and SVs sub-regions, respectively.
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth Sequencing quality |
Variation signature level | Mapping quality Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth Sequencing quality |
Variation signature level | Mapping quality Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth Sequencing quality |
Variation signature level | Mapping quality Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth Sequencing quality |
Variation signature level | Mapping quality Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth |
Variation signature level | Sub-region length Insert size Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth |
Variation signature level | Sub-region length Insert size Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth |
Variation signature level | Sub-region length Insert size Region context Insertion ratio Deletion ratio Substitution ratio |
Level . | Meta-features . |
---|---|
Sequencing data level | Read length Sequencing depth |
Variation signature level | Sub-region length Insert size Region context Insertion ratio Deletion ratio Substitution ratio |
The table above organizes information using two columns to clearly display features at different levels of subregions. The first column indicates the level to which the meta-features belong, while the second column provides a detailed list of the specific meta-features included at that level.
Algorithm 1 outlines the pseudocode for extracting meta-features from sub-regions. The algorithm takes a SAM/BAM file, I, as input and generates the sub-region meta-feature, O, as output.
Require: |
I is the sam/bam file |
Ensure: O |
1: I_arr = [] |
2: tr_arr = [] |
3: softclips = [] |
4: breakpoints = [] |
5: readlen_sum = 0 |
6: I_arr ← I |
7: tr_arr ← TR |
8: depth = caculateDepth(I_arr) |
9: for i = 1 to n do |
10: readlen = f[i].len() |
11: readlen_sum = readlen_sum + readlen |
12: if f[i].include(‘s’) then |
13: softclips.add(f[i]) |
14: else |
15: continue |
16: end if |
17: end for |
18: ave_readlen = readlen_sum / n |
19: for j = 1 to m do |
20: breakpoint = calculateBP(softclips[j]) |
21: breakpoints.add(breakpoint) |
22: end for |
23: region_size = caculateRegionSize(breakpoints) |
24: regionContext = caculateRegionContext(tr, breakpoints) |
25: sequencingQuality = caculateSequencingQuality(reads) |
26: mappingQuality = caculateMappingQuality(reads) |
27: insertionRatio = caculateInsertionRatio(reads) |
28: deletionRatio = caculateDeletionRatio(reads) |
29: substitutionRatio = caculateSubstitutionRatio(reads) |
30: insertSize = caculateMapRatio(reads) |
31: O.add(depth, readlen, region_size, regionContext, insertionRatio, deletionRatio, substitutionRatio, insertSize) |
32: Return O |
Require: |
I is the sam/bam file |
Ensure: O |
1: I_arr = [] |
2: tr_arr = [] |
3: softclips = [] |
4: breakpoints = [] |
5: readlen_sum = 0 |
6: I_arr ← I |
7: tr_arr ← TR |
8: depth = caculateDepth(I_arr) |
9: for i = 1 to n do |
10: readlen = f[i].len() |
11: readlen_sum = readlen_sum + readlen |
12: if f[i].include(‘s’) then |
13: softclips.add(f[i]) |
14: else |
15: continue |
16: end if |
17: end for |
18: ave_readlen = readlen_sum / n |
19: for j = 1 to m do |
20: breakpoint = calculateBP(softclips[j]) |
21: breakpoints.add(breakpoint) |
22: end for |
23: region_size = caculateRegionSize(breakpoints) |
24: regionContext = caculateRegionContext(tr, breakpoints) |
25: sequencingQuality = caculateSequencingQuality(reads) |
26: mappingQuality = caculateMappingQuality(reads) |
27: insertionRatio = caculateInsertionRatio(reads) |
28: deletionRatio = caculateDeletionRatio(reads) |
29: substitutionRatio = caculateSubstitutionRatio(reads) |
30: insertSize = caculateMapRatio(reads) |
31: O.add(depth, readlen, region_size, regionContext, insertionRatio, deletionRatio, substitutionRatio, insertSize) |
32: Return O |
Require: |
I is the sam/bam file |
Ensure: O |
1: I_arr = [] |
2: tr_arr = [] |
3: softclips = [] |
4: breakpoints = [] |
5: readlen_sum = 0 |
6: I_arr ← I |
7: tr_arr ← TR |
8: depth = caculateDepth(I_arr) |
9: for i = 1 to n do |
10: readlen = f[i].len() |
11: readlen_sum = readlen_sum + readlen |
12: if f[i].include(‘s’) then |
13: softclips.add(f[i]) |
14: else |
15: continue |
16: end if |
17: end for |
18: ave_readlen = readlen_sum / n |
19: for j = 1 to m do |
20: breakpoint = calculateBP(softclips[j]) |
21: breakpoints.add(breakpoint) |
22: end for |
23: region_size = caculateRegionSize(breakpoints) |
24: regionContext = caculateRegionContext(tr, breakpoints) |
25: sequencingQuality = caculateSequencingQuality(reads) |
26: mappingQuality = caculateMappingQuality(reads) |
27: insertionRatio = caculateInsertionRatio(reads) |
28: deletionRatio = caculateDeletionRatio(reads) |
29: substitutionRatio = caculateSubstitutionRatio(reads) |
30: insertSize = caculateMapRatio(reads) |
31: O.add(depth, readlen, region_size, regionContext, insertionRatio, deletionRatio, substitutionRatio, insertSize) |
32: Return O |
Require: |
I is the sam/bam file |
Ensure: O |
1: I_arr = [] |
2: tr_arr = [] |
3: softclips = [] |
4: breakpoints = [] |
5: readlen_sum = 0 |
6: I_arr ← I |
7: tr_arr ← TR |
8: depth = caculateDepth(I_arr) |
9: for i = 1 to n do |
10: readlen = f[i].len() |
11: readlen_sum = readlen_sum + readlen |
12: if f[i].include(‘s’) then |
13: softclips.add(f[i]) |
14: else |
15: continue |
16: end if |
17: end for |
18: ave_readlen = readlen_sum / n |
19: for j = 1 to m do |
20: breakpoint = calculateBP(softclips[j]) |
21: breakpoints.add(breakpoint) |
22: end for |
23: region_size = caculateRegionSize(breakpoints) |
24: regionContext = caculateRegionContext(tr, breakpoints) |
25: sequencingQuality = caculateSequencingQuality(reads) |
26: mappingQuality = caculateMappingQuality(reads) |
27: insertionRatio = caculateInsertionRatio(reads) |
28: deletionRatio = caculateDeletionRatio(reads) |
29: substitutionRatio = caculateSubstitutionRatio(reads) |
30: insertSize = caculateMapRatio(reads) |
31: O.add(depth, readlen, region_size, regionContext, insertionRatio, deletionRatio, substitutionRatio, insertSize) |
32: Return O |
Lines 1–7 in Algorithm 1 initialize several arrays: f_arr, tr_arr, softclips, breakpoint and readlen_sum. The f_arr array stores the reads from the SAM/BAM file, while tr_arr stores the regions from the tandem repeat regions file. The softclips array is used to store softclip reads, and the breakpoint array saves the breakpoints of these softclip reads. The readlen_sum variable keeps track of the total length of reads. In lines 6–7 of the pseudocode, the SAM/BAM file F and the tandem repeat regions file TR are read and stored in the arrays f and tr, respectively.
In line 8, the depth function calculates the sequencing depth, which is then assigned to the variable depth. Lines 9–18 iterate over each read length separately, updating the readlen_sum value, saving softclips in the softclips array and determining the average length of softclip reads. Lines 19–22 iterate over each softclip read independently, calculate the breakpoints and store them in the breakpoint array.
Lines 23–30 compute various metrics such as sub-region length, region context, insertion ratio, deletion ratio, substitution ratio and mapping ratio using corresponding calculation functions. Finally, the meta-feature O is saved in line 31 and returned in line 32.
Sub-regions clustering
After completing the previous steps, we now have the cut windows and their corresponding window features. In this next step, we will cluster all the windows based on these window features. We utilize spectral clustering [26] to construct a weighted, undirected graph comprising all the windows, subsequently segmenting it into several optimal subgraphs. The aim here is to ensure that each subgraph is homogenous internally, yet distinctly separated from others in terms of distance. In this graph, the weight of each edge signifies the spatial proximity between any two windows. Edges connecting nearer windows are assigned greater weights, implying closer association, whereas those linking more distant windows bear lesser weights, indicating a weaker connection. Figure 2 illustrates the abstract model of window clustering. The specific steps involved are as follows:

(i) The affinity matrix is constructed by leveraging the similarity measures between the windows, which based on various window meta-features. Each element of the matrix quantifies the similarity between a pair of windows, with higher values indicating greater similarity.
(ii) From this affinity matrix, a Laplacian matrix is then formulated. This matrix is a mathematical representation that is used to describe the graph structure of the data. The Laplacian matrix is essential for understanding the global structure of the graph and is key to the spectral clustering algorithm.
(iii) The eigenvector associated with the Laplacian matrix’s smallest eigenvalue is calculated. This eigenvector is known as the Fiedler vector and plays a pivotal role in identifying the clusters. The Fiedler vector can be thought of as providing a view of the graph that highlights the parts that can be separated with minimal cuts.
(iv) These eigenvectors serve as the new feature vectors for the sample windows, enabling the delineation of window clusters. They have the important property of being able to delineate or outline the window clusters. Essentially, these new features allow the algorithm to identify which windows should be grouped together into the same cluster based on their similarities, ultimately enabling the division of the graph into multiple optimal subgraphs.
Model training and testing
In this step, we selected seven commonly used algorithms, namely, bcftools, freebayes, gatk, Delly, Manta, Pindel and SvABA, to construct our model. These algorithms are utilized for detecting different types of variations. Specifically, bcftools, freebayes and gatk are employed for SNV detection, while Delly, Manta, Pindel and SvABA are used for SV detection. Each caller is executed on all datasets individually. Subsequently, all combinations of callers are considered to determine the minimum false positive rate (FPR) callers set and the minimum false negative rate (FNR) callers set for each dataset. These sets serve as the labels or meta-targets in the meta-learning approach [27–30].
Once the meta-targets are identified, a 10-fold cross-validation technique is employed to divide the data into training, testing and validation sets. This allows us to train the meta-model and establish the relationship between the meta-features and the meta-targets (a meta-target in machine learning and data analysis refers to a higher-level goal or outcome derived from underlying data, used to guide the learning process in a meta-learning framework).
For the meta-model training process, we adopt a multi-instance multi-label learning approach. To fully capitalize on the complementary relationships between different callers during the multi-label learning training, we employ the Label Powerset approach to construct the model [31]. Finally, we evaluate the performance of the model using four widely accepted metrics in multi-label learning: f1_score, hamming_loss, accuracy_score and zero_one_loss [32–35].
Variation callers assigning and results merging
In this crucial step, we feed the data acquired from the preceding stage into our trained model. The primary objective here is to ascertain the most fitting caller for every individual window cluster. This process plays a pivotal role in the overall workflow as it empowers us to efficiently allocate an appropriate caller to each cluster, thereby streamlining the variant detection process for each window cluster. Once the detection process is concluded, the outcomes gleaned from each window cluster are harmoniously consolidated. This consolidation phase represents a pivotal juncture where the diverse pieces of information converge into a coherent whole. Ultimately, the merged results serve as the culminating output of our comprehensive methodology, providing a holistic view of the variant detection.
For a more detailed description and explanation of TMBstable’s core modules, please refer to the Supplementary Algorithm section in the supplementary materials.
RESULTS
After detailing the methodology in the previous sections, we now transition to presenting the results of our study. We conducted an extensive evaluation of seven advanced variant callers (bcftools, freebayes, gatk, Delly, Manta, Pindel and SvABA) using 300 simulated samples and 106 real tumor samples from NSCLC and NPC patients. These real samples were sequenced in FASTQ format and aligned using the Burrows–Wheeler Aligner (BWA) [36], ensuring accurate alignment with the reference genome file hg19.fa. The samples were then analyzed for SNV and SV detection to measure key metrics like FPR, FNR, precision and recall, assessing the stability and accuracy of TMBstable in comparison. Our rigorous testing, using both simulated and real-world data, demonstrated TMBstable’s superior performance in variant detection, particularly in complex genomic scenarios of cancer research. The validation datasets were distinct from training datasets to ensure unbiased and robust results. For specific model evaluation indicators or detailed descriptions of the methods used, please refer to the final section of the supplementary materials.
Variation detection on simulated data
To evaluate the FPR, FNR, precision and recall of TMBstable, we utilized various simulated datasets with different data features. Specifically, we generated a total of 300 simulated short-read sequencing datasets using the GSDcreator simulator [24]. The variants present in these simulated datasets were primarily derived from the TCGA Database (http://www.cbioportal.org/).
To create samples with diverse variation distributions, we introduced multiple types of variations and generated reads of varying lengths with different sequencing depths. Additionally, we manipulated the density of variations by adjusting the distance between them. Moreover, we controlled the number of variations within the tandem repeat region of the genome to alter the proportion of variations in that specific region. We defined true called variations as those that significantly overlapped with the reference variations by a minimum proportion of 80%.
The benchmark results presented in Figure 3 clearly illustrate that TMBstable, represented by the deep blue region, forms the smallest area in the radar chart, indicating significantly lower coefficients of variation and variances in metrics such as FPR, FNR, precision and recall statistics.

The benchmark results depict the performance of various variation callers on different simulated datasets. The performance metrics include the coefficients of variation for the FPR, FNR, precision and recall, as well as the variances for the FPR, FNR, precision and recall. It should be noted that, contrary to the conventional indicator evaluation criteria, lower values for the variance and coefficient of variation metrics signify better performance.
To further explain the advantages between TMBstable and existing software, we calculated the relative performance improvement of TMBstable compared to existing software on numerical values. Our comparative analysis reveals that even against the most stable software, freebayes+Delly, on this batch of simulated data, TMBstable still demonstrates a 1.55% decrease in the FNRcov performance metric. This finding not only confirms the exceptional stability of TMBstable but also showcases its significant advantage in relative comparisons.
This superiority stems from the meta-learning technology employed by TMBstable. Under the meta-learning framework, TMBstable utilizes a pre-trained meta-model to dynamically allocate the most suitable variant detection algorithm for different sub-regions. Specifically, in this optimization process, the objective function L is defined as the performance loss in scenarios where the algorithm and data are mismatched:
Here, n represents the total number of data points, |${y}_i$| is the actual label of the ith data point and |$\hat{y_i}$| is the model’s prediction for the ith data point. The goal of meta-learning is to find a set of algorithms A that minimizes L. This is achieved by evaluating the loss function across different combinations of algorithms and selecting the combination that minimizes it:
In TMBstable, this process involves choosing among various combinations of SNV detection software and SV detection software. This approach enables TMBstable to effectively adapt to various data characteristics and maintain performance stability across different scenarios. This method not only reduces performance fluctuations due to mismatches between algorithms and data but also enhances the overall stability of performance, thus demonstrating significant advantages, particularly in complex and variable bioinformatics data analysis.
In summary, the experimental results clearly demonstrate that TMBstable exhibits a high level of stability as a variation caller, with minimal fluctuations in both FPR, FNR, precision and recall across diverse and distinct samples. This characteristic is particularly valuable for setting accurate thresholds in the context of immunotherapy applications.
Furthermore, we computed the TMB values for each simulated sample and compared them with the actual TMB values of the samples. We used the root mean square error (RMSE) to quantify the disparity between the predicted values and the actual values. A smaller RMSE indicates a closer match between the predicted values and the actual values, signifying better model performance.
In the benchmark results shown in Figure 4, it is evident that the TMB values calculated using TMBstable detection results closely approximate the actual TMB values.

The benchmark result depicts the RMSE of various variation callers on different simulated datasets. The figure is divided into 12 panels, labeled as (A–L), respectively. Each panel represents a combination of SNV detection software and SV detection software. These combinations include bcftools+Delly, bcftools+Manta, bcftools+Pindel, bcftools+SvABA, freebayes+Delly, freebayes+Manta, freebayes+Pindel, freebayes+SvABA, gatk+Delly, gatk+Manta, gatk+Pindel and gatk+SvABA. Panel (I) specifically corresponds to TMBstable. It should be noted that, contrary to the conventional indicator evaluation criteria, lower values for the RMSE metrics signify better performance.
To further explain the advantages between TMBstable and existing software, we calculated the RMSE decrease of TMBstable compared to existing software on numerical values. We found that the RMSE was reduced by at least 3.44 and up to 7.25 when compared to existing software. This clearly shows that TMBstable is significantly closer to the actual values than existing software in terms of TMB value prediction.
This is attributable to the meta-learning technology employed by TMBstable, which enables it to dynamically select the most suitable variant detection algorithm based on the distinct characteristics of individual samples. This means that TMBstable can more effectively adapt to the heterogeneity and complexity of the samples, thereby reducing erroneous detections caused by mismatches between the algorithm and the sample characteristics.
In summary, the experimental results clearly demonstrate that the TMB values calculated based on TMBstable’s detection results are the closest to the actual TMB values when compared to other callers.
Variation detection on real data
Furthermore, we benchmarked TMBstable, bcftools, freebayes, gatk, Delly, Manta, Pindel and SvABA with the real tumor sequencing data. We defined true called variations as the called variations that significantly overlap with the reference variations by proportions (≧80%). These tumor samples encompassed a total of 106 cases, including both NSCLC and NPC samples. A detailed description of this dataset is provided in the Method section. The experimental results are shown in Figure 5.

The benchmark results depict the performance of various variation callers on different real tumor datasets. The performance metrics include the coefficients of variation for the FPR, FNR, precision and recall, as well as the variances for the FPR, FNR, precision and recall. The boxes, each representing a different combination of SNV detection caller and SV detection caller, are marked from a to l to indicate bcftools+Delly, bcftools+Manta, bcftools+Pindel, bcftools+SvABA, freebayes+Delly, freebayes+Manta, freebayes+Pindel, freebayes+SvABA, gatk+Delly, gatk+Manta and gatk+Pindel, respectively. The box labeled m represents TMBstable. It should be noted that, contrary to the conventional indicator evaluation criteria, lower values for the variance and coefficient of variation metrics signify better performance.
From Figure 5, it is evident that TMBstable, indicated by the deep blue region, also forms the smallest area on the radar chart. It not only demonstrates variance values that are competitive with other software but also achieves the minimum coefficient of variation. Notably, TMBstable exhibits significantly lower coefficients of variation for both FPR and FNR compared to all other software, making it highly suitable for practical immunotherapy applications.
Similarly, we also calculated the relative performance improvement of TMBstable compared to existing software on numerical values in real data. We find that even when compared to the bcftools+SvABA software, which performed most stably on this batch of real data, TMBstable still demonstrated a decrease of 1.12% in the PRECISIONcov metric. Compared to the bcftools+Manta software, which exhibited the most stable average performance across all metrics in this batch of real data, there was a reduction of 16.32%.
Continuing from the analysis of real data samples, it’s noteworthy that although the decrease in the performance metric for TMBstable is less pronounced in real data compared to simulated data, it still indicates a significant performance enhancement. This slightly lesser decline in real data as opposed to simulated data can be attributed to the inherent heterogeneity of tumor samples. Tumor samples often exhibit a high degree of genetic diversity and complexity. This heterogeneity can introduce additional challenges in variant detection, as it may affect the consistency and accuracy of the algorithms. In the case of real tumor samples, the effectiveness of a variant calling algorithm can be influenced by a range of factors, such as the presence of subclonal mutations, varying tumor purity and the complex interplay of genomic alterations. These factors can cause more variability in the performance of variant detection algorithms on real data compared to more controlled and uniform simulated data. TMBstable, with its application of meta-learning techniques, attempts to address these challenges by dynamically allocating suitable algorithms for different data characteristics. However, the increased complexity and variability inherent in real tumor samples might limit the extent to which performance can be optimized, leading to a slightly smaller reduction compared to simulated data.
Specifically, let’s consider the performance measure of an algorithm as a function P(d, A), where d represents a data sample and A represents an algorithm. In an ideal scenario with simulated data, this function might have a smoother landscape, implying a more predictable and uniform response to optimization techniques. The optimization goal can be represented as:
Here, |${d}_{sim}$| represents simulated data samples, and |${A}_{opt}$| is the algorithm configuration that minimizes the performance measure P.
However, for real tumor samples, the function P(d, A) becomes more complex due to the inherent variability and heterogeneity. This complexity can be thought of as adding a ‘noise’ component to the performance measure, thereby making the optimization landscape more rugged. The performance measure for real data might look like:
where |$\varepsilon \left({d}_{real},A\right)$| represents the additional variability introduced by the complexity of real tumor samples. This additional term can lead to local minima or less predictable behavior in the optimization process; hence, the optimization in real data can be represented as:
The optimization process in real data, represented by |${A}_{opt- real}$|, may not achieve the same level of performance improvement as in simulated data due to this added complexity.
Despite this, the observed superiority in TMBstable’s performance in real data samples is notable. It showcases the software’s ability to adapt to the complex and diverse nature of real tumor samples, providing more accurate and consistent results. This adaptability is crucial in the context of precision medicine, where stable variant detection plays a pivotal role in personalized treatment strategies.
Additionally, we also calculated the TMB values for each real sample based on the variant detection results and compared them with the actual TMB values of the samples. As shown in Figure 6, we can draw conclusions consistent with those on simulated data.

The benchmark result depict the RMSE of various variation callers on different real datasets. The figure is divided into 12 panels, labeled as (A–L). Each panel represents a combination of SNV detection software and SV detection software. These combinations include bcftools+Delly, bcftools+Manta, bcftools+Pindel, bcftools+SvABA, freebayes+Delly, freebayes+Manta, freebayes+Pindel, freebayes+SvABA, gatk+Delly, gatk+Manta, gatk+Pindel and gatk+SvABA. Panel (I) specifically corresponds to TMBstable. It should be noted that, contrary to the conventional indicator evaluation criteria, lower values for the RMSE metrics signify better performance.
Similarly, we conducted a calculation of the reduction in RMSE for TMBstable as compared to existing software, focusing on numerical values derived from real data. Our analysis revealed that TMBstable attained a reduction in RMSE ranging from a minimum of 2.55 to a maximum of 6.42, compared to existing software’s performance. However, the degree of performance enhancement in TMBstable with real data, relative to simulated studies, was slightly less pronounced. This can be primarily attributed to the high complexity and unpredictability inherent in real-world datasets. Contrasting with the idealized and controlled conditions of simulated data, the plethora of unknown variables and complex scenarios in real data intensify the challenges in algorithmic optimization. TMBstable, utilizing meta-learning techniques, strives to navigate these complexities. Yet, the intrinsic variability and irregularities in real data continue to impede the full maximization of algorithmic efficiency, leading to more subdued improvements compared to those seen in simulated settings. Despite this, the observed advancement in TMBstable’s performance within real-world scenarios remains significant and commendable.
In conclusion, whether it’s the experiments with real data or simulated data, it’s evident that the TMB values calculated based on TMBstable variant detection results are the closest to the actual values.
Finally, we conducted a more extensive comparison of TMB classifications for categorizing patients into TMB-negative and TMB-positive groups between TMBstable and other variant callers. Additionally, we created survival curves to provide a visual representation of patient groupings. As shown in Figure 7, the figure displays the results for TMBstable and bcftools+Delly, while the results for other callers can be found in Supplementary Figure S1.

Comparison of survival curves for TMB values calculated under different callers. In the figure, the horizontal axis represents months, and the vertical axis represents survival rate. The survival curves distinctly separate the patients into TMB-negative and TMB-positive groups. Within these curves, high TMB (TMB-positive), indicates that these individuals have a number of tumor mutations exceeding a predefined threshold. High TMB values are generally considered to be potentially related to the response rate of cancer immunotherapy. Conversely, low TMB (TMB-negative), represents, those who have fewer tumor mutations. Crosses indicate censored data points where patients were lost to follow-up or the study ended before their death.
From Figure 7, we can observe that TMB values calculated from TMBstable detection results effectively differentiate patients, while TMB values calculated from bcftools+Delly detection results show a less effective differentiation. In the Supplementary, we provide survival curve comparisons for several other callers, which yield similar results. Additionally, we calculated the P-values and found that TMB values calculated from TMBstable effectively differentiate patients, with P-values of 0.01286 in the NSCLC cohort and 0.01971 in the NPC cohort. This indicates a significant difference in the survival curves between TMB-positive and TMB-negative patients. In contrast, the P-values for survival curves obtained from bcftools+Delly are all above 0.05, indicating no significant differences.
Additionally, we conducted further evaluations on computational efficiency, as detailed in the supplementary file. Upon scrutinizing the experimental outcomes, we observed that despite the incorporation of multiple procedures such as sample segmentation, window clustering and meta-feature extraction into our variant detection process, there was no significant increase in time and memory consumption compared to widely employed callers. In summary, the experimental results indicate that TMBstable demonstrates considerable utility in practical application scenarios.
DISCUSSION
Following the presentation of our results, where we assessed the performance of TMBstable, we now shift our focus to a detailed discussion of the findings. In the proposed approach, we explore different options for selecting the multi-label learning algorithm. In this regard, we discuss these considerations in this section. We evaluated the classification accuracy of our method using four commonly used multi-label learning algorithm evaluation metrics: F1-score, Hamming_loss, Accuracy_score and Zero-one_loss. To ensure comprehensive analysis, we employed a 10-fold cross-validation method for our experiments.
We compared three well-established multi-label learning approaches, the Label Powerset approach (LP), the Classifier Chains approach (CC) and the Binary Relevance approach (BR), across two distinct meta-targets: Precision and Recall. The results of this comparison are illustrated in Figure 8.

(A–D) present the F1-score, Hamming_loss, Accuracy_score and Zero-one_loss of the model in relation to the meta-target Precision. Similarly, (E–H) showcase the F1-score, Hamming_loss, Accuracy_score and Zero-one_loss of the model with respect to the meta-target Recall. For each metric, the boxes represent the interquartile range, with the central line indicating the median value. The horizontal lines positioned above and below the boxes represent the maximum and minimum values within the data, respectively. Discrete black dots are used to denote any outliers that may exist.
F1-score is a metric for gauging the accuracy of a model, taking into account both precision and recall. A higher F1-score suggests better performance. In the box plot, the line in the middle of the box represents the median F1-score, while the top and bottom edges of the box depict the upper and lower quartiles, respectively. The ‘whiskers’ extending from the box illustrate the range of the data, and any individual black dots denote outliers, which are data points that deviate from the typical range. Hamming_loss measures the proportion of labels that are incorrectly predicted; thus, a lower value is indicative of better performance. The layout of the box plot is similar to the previous, with the median, quartiles and potential outliers clearly displayed. Accuracy_score is the proportion of instances correctly predicted over the total instances. A higher Accuracy_score denotes improved performance. The median line and quartiles in the box plot reflect the central tendency and dispersion of accuracy scores across various experiments. Zero-one_loss is a measure of the proportion of samples that have at least one label incorrectly predicted. A lower Zero-one_loss is preferable as it indicates fewer instances with incorrect predictions. The box plot follows the same structure as the previous charts to demonstrate the distribution of Zero-one_loss scores. Across all metrics, the LP displays a general trend of superior performance, with the central median lines often residing toward more favorable values (higher for F1-score and Accuracy_score, lower for Hamming_loss and Zero-one_loss). The box plots reveal the variability in performance within each approach across different runs or dataset partitions, with the range indicated by the whiskers and the presence of outliers showcasing the stability or volatility of the method.
Based on the observations from Figure 8, it is evident that the choice of multi-label learning approach significantly impacts the performance of the constructed meta-models, with the Label Powerset approach yielding more robust and higher-quality predictions in this evaluation. These findings provide empirical support for the theoretical advantages outlined in the method section for the Label Powerset approach.
Furthermore, to assess the impact of meta-features on model performance, we conducted feature ablation experiments. Meta-features at both the variant signature level and sequencing data level were removed, and the changes in performance of the multi-label classification model were observed. The experimental results are presented in Figure 9.

Feature ablation comparison figure. (A) represents the comparison of model performance after removing features at the sequencing data level with model performance under the original features. (B) represents the comparison of model performance after removing features at the variant signature level with model performance under the original features.
These two radar charts are a visual representation of the model’s robustness and the contribution of different levels of meta-features. The closer the colored line is to the outer edge of the radar chart, the better the performance is for the corresponding metric. Conversely, a line closer to the center indicates worse performance. The presence of the colored lines inside the area defined by the black line in both charts clearly demonstrates that the model relies on the full set of meta-features to achieve optimal performance.
From Figure 9, it can be observed that the removal of either meta-features at the variant signature level or at the sequencing data level results in a slight performance decrease. Only by using all meta-features can the best model performance be achieved.
In conclusion, the experimental results above indicate that the current selection of meta-features and meta-models can yield relatively optimal multi-label classification performance, achieving the best model effectiveness.
Additionally, we further explored the generalizability and limitations of our algorithm. We emphasized the capability of TMBstable to handle a diverse range of samples, achieved through a series of steps including window-based segmentation of samples, extraction of meta-features from these windows and clustering based on these features. While this approach offers robust performance in most scenarios, it can face challenges when dealing with extreme cases.
This is particularly evident in the context of atypical cancer samples, such as those with very high heterogeneity or a minimal number of mutations, which may result in a very low number of clusters in our window-based clustering method. Such scenarios are likely to occur in certain highly aggressive or advanced-stage cancers, where the genetic landscape is exceptionally complex, or in early-stage cancers with sparse mutations. In these instances, TMBstable might not only lack a significant performance advantage compared to existing software fitted to these specific clusters but may also incur additional memory and time consumption due to the processes of window segmentation, meta-feature extraction and clustering.
We recognize these limitations as inherent challenges in the field of bioinformatics, particularly in the diverse and complex realm of cancer genomics. Although such extreme cases are relatively rare and the majority of samples do not exhibit these issues, we are still continually working to enhance TMBstable to better handle these extreme scenarios and to reduce the computational burden.
CONCLUSION
In this study, we present TMBstable, an innovative algorithm designed for the stable detection of variants, with a particular focus on improving TMB calculations. Employing a meta-learning framework, TMBstable begins by segmenting sequencing data into windows, clustering similar regions and dynamically selecting the most appropriate algorithms for variant detection across these clusters. Our comprehensive evaluation, utilizing both simulated and real samples, reveals that TMBstable achieves superior stability in performance metrics compared to advanced existing callers. This stability is critical for determining immunotherapy thresholds and thus has a profound impact on patient selection. The implications of our findings suggest that TMBstable has the potential to significantly enhance the accuracy and reliability of biomarker quantification in clinical settings. As we look to the future, we plan to extend this approach to a wider range of biomarkers, such as HRD, CNV burden, micro-satellite instability and chromothripsis. By extracting specific data features and training meta-models tailored to these diverse biomarkers, we aim to adapt the TMBstable approach for a broader application spectrum, effectively tackling the unique challenges each biomarker presents. This future work will focus on optimizing computational efficiency and user-friendliness, building upon the robust foundation laid by the current study, with the ultimate goal of enabling more precise and personalized treatment strategies in oncology.
TMBstable is introduced as a novel variant caller specifically designed to ensure stable performance across heterogeneous sequencing samples, addressing the limitations of traditional variant callers.
TMBstable introduces a novel meta-learning approach, which dynamically selects the optimal calling strategy for each genomic region, addressing performance variations in counting-based biomarker TMB and offering a more flexible and adaptive solution.
TMBstable stands out by dynamically selecting the optimal calling strategy for each region within sequencing data using a meta-learning framework, in contrast to existing callers with uniform strategies.
TMB often exhibit significant performance variations across samples, which can impact immunotherapy eligibility. TMBstable effectively addresses these performance variations.
TMBstable’s benefits are validated through extensive experiments with simulated and real data. It outperforms existing callers in terms of stability and is especially valuable for determining threshold values in immunotherapy applications.
ACKNOWLEDGEMENTS
We thank all of the faculty members and graduate students who discussed the mathematical and statistical issues in seminars.
FUNDING
This work was funded by the National Natural Science Foundation of China, grant numbers 72293581, 72293580, 72274152 and 71872146. This work was also supported by the Natural Science Basic Research Program of Shaanxi, grant number 2020JC-01. The article processing charge was funded by the Natural Science Basic Research Program of Shaanxi, grant number 2020JC-01.
AUTHOR CONTRIBUTIONS
J.W. and X.Z. conceived this research; S.W., X.W. and Y.L. designed the algorithm and the method; S.W. implemented coding and designed the software; S.W., X.W., M.Z., Z.C, X.W. and Y.S. collected the sequencing data; S.W. performed the experiments and analyzed the data; S.W. and J.W. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.
DATA AVAILABILITY
All data are available from the corresponding author upon reasonable request.
Author Biographies
Shenjie Wang is studying for a Ph.D. in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His research interests include machine learning and software engineering.
Xiaoyan Zhu is an associate professor at the School of Computer Science and Technology, Xi’an Jiaotong University. Her research interests include machine learning and data mining.
Xuwen Wang is a Ph.D. in the School of Computer Science and Technology, Xi’an Jiaotong University, China. His research interests include bioinformatics and machine learning.
Yuqian Liu is an assistant professor at the School of Computer Science and Technology, Xi’an Jiaotong University. Her research interests include bioinformatics and machine learning.
Minchao Zhao is the Chief Operating Officer of Nanjing Geneseeq Technology Inc. His research interest mainly focuses on bioinformatics and cancer diagnosis.
Zhili Chang is the head of the Bioinformatics Department at Nanjing Geneseeq Technology Inc. His research interest mainly focuses on bioinformatics and cancer diagnosis.
Xiaonan Wang is the Chief Technology Officer of Nanjing Geneseeq Technology Inc. Her research interest mainly focuses on bioinformatics and cancer diagnosis.
Yang Shao is the chairman of Nanjing Geneseeq Technology Inc. His research interest mainly focuses on bioinformatics and cancer diagnosis.
Jiayin Wang is a faculty member of the School of Computer Science and Technology, Xi’an Jiaotong University. His research interest mainly focuses on the management issues in bioinformatics, computational biology, cancer diagnosis and treatment.