-
PDF
- Split View
-
Views
-
Cite
Cite
Simon Gene Gottlieb, Knut Reinert, SeArcH schemes for Approximate stRing mAtching, NAR Genomics and Bioinformatics, Volume 7, Issue 1, March 2025, lqaf025, https://doi.org/10.1093/nargab/lqaf025
- Share Icon Share
Abstract
Finding approximate occurrences of a query in a text using a full-text index is a central problem in stringology with many applications, especially in bioinformatics. The recent work has shown significant speed-ups by combining bidirectional indices and employing variations of search schemes. Search schemes partition a query and describe how to search the resulting parts with a given error bound. The performance of search schemes can be approximated by the node count, which represents an upper bound of the number of search steps. Finding optimum search schemes is a difficult combinatorial optimization problem that becomes hard to solve for four and more errors. This paper improves on a few topics important to search scheme based searches: First, we show how search schemes can be used to model previously published approximate search strategies such as suffix filters, 01*0-seeds, or the pigeonhole principle. This unifies these strategies in the search scheme framework, makes them easily comparable and results in novel search schemes that allow for any number of errors. Second, we present a search scheme construction heuristic, which is on par with optimum search schemes and has a better node count than any known search scheme for equal or above four errors. Finally, using the different search schemes, we show that the node count measure is not an ideal performance metric and therefore propose an improved performance metric called the weighted node count, which approximates a search algorithm’s run time much more accurately.
Introduction
Finding approximate occurrences of a string in a large text is a fundamental problem in computer science with numerous applications, especially in bioinformatics. In this paper, we will address the problem for Hamming distance as well as for edit distance.
More formally, the approximate string matching (ASM) problem considered in this paper is defined as follows: Given an error number k, a string, here referred to as a queryQ, and a text T, composed of characters from an alphabet of size |Σ|, find substrings of the text whose distance to the query is at most k.
Solving the ASM problem has become especially important in bioinformatics due to the advances in sequencing technology during the last years. The mainstream second-generation sequencing techniques like Illumina produce reads of length 150–250 with an error rate of about 1%, mostly substitutions caused by the sequencing technology.
Other sequencing technologies, e.g. Pacific Bioscience or Oxford Nanopore, produce much longer reads but with a higher error rate (in the range of up to 15%) containing substitutions, insertions, and deletions. A standard problem is to map the reads back to a reference genome while taking into account the errors introduced by the sequencing technology as well as those caused by biological variations, such as SNPs or small structural variations. Such a problem is almost always modeled as the ASM problem for Hamming or edit distance.
There are two main algorithmic strategies to address the ASM problem for large inputs, including a long reference text and/or a high number of reads: filtering and indexing.
In this work, we focus on the indexing approach. Here, the main idea is to preprocess the reference text, the set of reads, or both, more intricately.
Such preprocessing into full-text string indices has the benefit that we do not have to scan the whole reference but can conduct queries much faster at the expense of larger memory consumption. String indices currently used are the suffix array [1], enhanced suffix array [2], and affix arrays [3, 4], as well as the FM-index [5], a data structure based on the Burrows–Wheeler Transform (BWT) [6] and some auxiliary tables. For an in-depth discussion see [7]. Such indices are used to compute exact matches between a query and a text as a subroutine in backtracking approaches. For the ASM problem for Hamming or edit distance, the existing algorithms all have exponential complexity in k (e.g. [8, 9]), and are hence only suited for small k.
Lam et al. [10] introduced an idea based on bidirectional FM indices to speed up ASM for Hamming distance. For the cases k = 1 and 2, they partitioned the read into k + 1 equal parts and argued that performing approximate matching on a certain combination of these parts in a bidirectional index amounts to faster approximate matching of the whole read. This combination is such that all possible error configurations, i.e., all possible distributions of k mismatches among the parts, are covered. The main idea behind improved speed is that a bidirectional index does not have to start the search from the beginning or end of the read but from any of the parts. Therefore, we can start the search from a middle part and then expand it to the left and right into adjacent parts in any order we like. By choosing multiple appropriate orderings of parts for this purpose, we can perform a much faster ASM compared to a unidirectional search. By enforcing exact or near-exact searches on the first parts of the partition, we significantly reduce the number of backtrackings or spurious results. By using different orderings of parts, we ensure that all possible error configurations are covered.
Kucherov et al. [11] formalized and generalized this idea by defining the concept of search schemes. Assume a read can be partitioned into a given number of parts, denoted by p. The parts are indexed from left to right. A search scheme
Kianfar and colleagues [12] in cooperation with our group, proposed a method to compute optimal search schemes for Hamming distance under certain constraints by minimizing the overall number of edges in the backtracking trees of the search schemes. Their method is based on solving a Mixed-Integer Linear Program (ILP), which is costly. Hence, they can only compute solutions for up to four errors in reasonable time and only guarantee for 1 and 2 errors to be optimal. This paper mentions that Vrolands et al. [9] 01*0 seeds can be reformulated as search schemes.
Renders et al. [13] implemented improvements of the applied search schemes into their tool Columba. Instead of backtracking single error configurations, they use a banded matrix to track multiple error configuration. By leaving out well-defined paths through the matrix the number of search steps is significantly decreased. To our knowledge and experience this is currently the fastest implementation to execute search schemes. Additionally, they benchmark search schemes based on the pigeonhole principle and 01*0 seeds for up to four errors.
In their latest work [14], Renders et al. explore the possibility of automated design of efficient search schemes. They implement a new ILP formulation that is capable of generating search schemes for up to k = 7 errors. They show that search schemes generated with their tool hato and executed with Columba outperform existing tools for lossless approximate search.
Searching in a bidirectional index using any given, valid search scheme is already implemented in our library SeqAn [15] for Hamming and edit distance. Hence, it is attractive to investigate methods to construct search schemes for larger k. This fact and the shortcomings of the method of Kianfar et al. inspired the work we present in this paper.
We show how to cleverly formulate suffix filters [8], 01*0 seeds [9], and pigeonhole filtering in terms of search schemes and compare the performances of all methods using the Columba tool [13]. We propose a novel metric weighted node code to evaluate search schemes that correlates much better to their actual performance than the node count measure used in [12].
Finally, we will underpin these claims by collecting benchmark for all search schemes with Columba [13] and compare the run times to the node count and our newly introduced weighted node count. Supporting that the weighted node count is a good performance indicator of search schemes.
Material and methods
Nomenclature
In this section, we introduce fundamental notations and definitions. We use 0-based indices, which should be interpreted as offsets from the first entry. A string is denoted as A = a0a1…a|A| − 1 where |A| denotes the length and ai refers to a single character of the string. An infix can be denoted as Ai, j = aiai + 1…aj − 1. The approximate search of a query Q = q0, q1, …q|Q| − 1 inside a text T = t0, t1, …, tn − 1 is defined as finding a set of substrings in
A bidirectional index as the 2FM-Index allows us to search for a query Q by starting at any point in our query and extending it to the left or right until our full query is covered. In a broad sense, a search scheme describes how to extend a query step by step and how many errors are expected.
Assume the query Q is divided into p parts P = (P0, …, Pp − 1) such that the concatenation of the parts equals Q. A search scheme is defined as a set of searchesS = {s0, s1, …, s|S| − 1}. Each search is defined as a triplet si = (πi, Li, Ui). The search order of parts is defined by a permutation π = (π0, …, πp − 1) of {0, …, p − 1}. To work with bidirectional indices, e.g. a 2FM-Index, it is also required that π fulfills the connectivity property, which states that each πi has to be one larger than the maximum of the previous πk or one smaller than the minimum. Further on, let Li = (Li, 0, …, Li, p − 1) denote the minimal accumulated error and Ui = (Ui, 0, …, Ui, p − 1) denote the maximal accumulated error while searching the part in the order of π. Then it must hold that the values in Ui and Li are monotonously increasing and the upper bound is larger or equal to the lower bound Ui, j ≥ Li, j, ∀j. Search schemes that fulfill those two requirements for all searches can be used for searching in a bidirectional index and are hence called valid. In the following, we will drop the search index i from Li and Ui and write L and U if the context is clear. The expression L = (x, y, z, …) will be shortened to L = xyz to increase readability.
An error configuration is a distribution of the k errors over the parts and is denoted by (e0, …, ep − 1) with
An example of a search scheme can be seen in Fig. 2, where the standard pigeonhole principle for two errors is expressed as a valid and complete search scheme with three searches. It can be seen that all error configurations are covered, albeit some by several searches. Hence, the search scheme is redundant. This opens the door for devising an improved version of the pigeonhole search scheme that we will introduce in the Subsection Formulating search strategies as search schemes.

Conditions: (1)

Pigeonhole principle expressed as a search scheme S = {s0 = (012, 000, 022), s1 = (102, 000, 022), s2 = (210, 000, 022)} with k = 2, query length |Q| = 8 and an alphabet of |Σ| = 2 for distham. Following an edge down symbolizes a match between query and text. Branching to the right indicates a mismatch. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (0, 0, 2), (0, 2, 0)}, the second search covers {(0, 0, 0), (1, 0, 0), (1, 0, 1), (0, 0, 1), (0, 0, 2), (2, 0, 0)} and the third search covers {(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 2), and (0, 2, 0)}.
Node count
The quality of a search scheme can be measured by counting the number of extensions applied to a pattern, which equals the number of search steps in the 2FM-Index. This measure is also referred to as the node count and was introduced by Kianfar et al. [12]. To compute the node count of a search scheme we have to assume a distance metric, pattern length, and the alphabet size. From the pattern length, we can compute vectors
Weighted node count
However, the node count has a few shortcomings as a performance indicator for search schemes. This becomes especially noticeable when comparing run times of backtracking to other formulated search schemes when using higher error rates.
We introduce an improved weighted node count, which takes into consideration that not all children of every node can be visited. Depending on the size of the indexed text T, it is guaranteed that at a certain depth the search will not visit every node. At depth l, the number of possible backtracking branches is
The weighted node count for edit distance looks similar but has additional branches. We included the equation in the appendix.
Formulating search strategies as search schemes
To formulate published search strategies into search schemes, we need to determine a method to generate
: Pigeonhole based with k errors. : Optimized pigeonhole based with k errors. : Suffix filter based with k errors. : 01*0 seeds based with k errors. : Optimized 01*0 seeds based with k errors. : Our custom heuristic to construct search schemes with k errors and p parts.
Later on, we will compare these to additional search schemes:
: Backtracking scheme with a single search. : Optimum Search Schemes provided by Kianfar et al. [12]. : Optimum search scheme that are split into k + 1 parts, also described by Kianfar et al. [12]. : Splitting the search into k + 1 parts. For k ∈ {1, 2, 3, 4} described by Kucherov et al. [11]. : Splitting the search into k + 2 parts. For k ∈ {1, 2, 3, 4} described by Kucherov et al. [11]. : Search schemes provided by Renders et al. [14].
All used search schemes with concrete numbers are listed in the supplementary file in the Search Schemes section.
Pigeonhole strategy
From the pigeonhole principle, we can derive that a query with k errors divided into k + 1 parts will imply that at least one of the parts has to have zero errors. By performing k + 1 searches where each search starts with a different part with zero errors, we can create a valid and complete search scheme. An example of this search scheme is shown in Fig. 2.
The above search scheme for the pigeonhole principle can be improved with a tighter lower and upper bound. For example, the first search of
Regarding the upper bounds, we can avoid the search for certain values for the parts to the left of i. This is achieved by decreasing the upper bound to k − i + 1 for the (i − 1)th part and allowing one additional error for every next part to the left until k is reached.
The resulting search scheme depicted in Fig. 3 has a lower node count and the error configurations covered by each of the searches are disjoint, in contrast to the standard pigeonhole principle depicted in Fig. 2. Optimized pigeonhole-based search schemes with higher errors k > 2 have redundant searches.

An optimized version of the pigeonhole principle expressed as a search scheme S = {(012, 000, 022), (102, 011, 022), and (210, 012, 012)} with k = 2, query length |Q| = 8 and an alphabet of σ = 2. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (0, 0, 2), and (0, 2, 0)}, the second search covers {(1, 0, 0), (1, 0, 1), and (2, 0, 0)} and finally the third search covers {(1, 1, 0)}.
Suffix filter
The suffix filter can be seen as an extension of the pigeonhole principle. Assume we have a factorization of the query into k + 1 factors. Then according to Kärkkäinen and Na [8], there must exist a strong match of a factor suffix, which means the sum of the errors of s factors is smaller than s. We can model this as a search scheme, by first searching a part i with zero errors, then extending the search to the right as long as the strong match criterion is met, and finally extending the parts to the left.
For our small example, the suffix filter search scheme is depicted in Fig. 4. Note the difference in the error configurations when comparing to the optimized pigeonhole scheme depicted in Fig. 3. For these parameters, the suffix filter search scheme is redundant.

Suffix filter expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), and (0, 0, 2)}, the second search covers {(1, 0, 0), (1, 0, 1), (2, 0, 0), and (0, 1, 0)} and finally the third search covers {(1, 1, 0), (0, 2, 0), and (0, 1, 0)}.
01*0 seeds
The 01*0 seeds strategy of Vroland et al. [9] follows a formulation that is stronger than the pigeonhole principle. It states that if a read has k errors and is divided into k + 2 parts then there exists at least 1 group of adjacent parts where the first and last part have zero errors and all parts in between have exactly one error. In the original formulation, 01*0 seeds are found using a unidirectional FM-Index and extended and verified in text. In contrast to this strategy, we formulate the complete search of the query as a search scheme. Let j be the number of parts that match with exactly one error. Then we can formulate the strategy with an ordering that matches part i with 0 errors, then the next j parts with exactly 1 error, then part i + j + 1 again with 0 errors. The remaining errors are distributed over the remaining parts. Formally the search scheme reads as:
The search scheme

01*0 seeds expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2.

Optimized 01*0 seeds expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2.
We evaluate the proposed search schemes in Subsection Evaluation. We compare the theoretical performance by computing the node count and measure the actual run time.
Our search scheme heuristic
In this subsection, we provide a description on how to construct a search scheme
For our construction algorithm, we analyzed the known optimum search schemes, where we noticed that they seemingly have an internal structure. We will describe a parameterized construction which will reproduce these search schemes. The construction algorithm has two user defined parameters k and p. The parameter k indicates the allowed number of errors our scheme should cover. The parameter p controls the number of parts and must be chosen such that p > k (This is a limitation of our construction, it likely could be removed). To find a search schemes with a low node count one must iterate over different values of p. In all our test cases, p = k + 2 results in the lowest node count, with the exception of k ≤ 2 where p = k + 1 has the lowest node count.
From these two parameters, we need to derive the number of searches and for each search the order of parts as well as the lower and upper bound of each part. Computing the number of searches, order of parts and the lower bound is much easier than the upper bound. Starting with the number of searches, we fix these to k + 1.
The order of the parts for a search i is defined as follows:
The lower bound of the parts are defined as follows:
The upper bound Ui requires an additional matrix M. The matrix M defines the maximum number of errors that are allowed in a part. The rows correspond to the searches, and each column corresponds to a part of the query:
The matrix M is split into three sub matrices called Mk, Mb, and Mt. Each of these matrices have to satisfy additional constraints:
The first three conditions can be used to to generate submatrices Mk, Mb, and Mt respectively. The last two conditions are not automatically fulfilled after generating M. In an adjustment step, the first column i with an element mj, i violating these condition has to be adjusted by permuting its elements. The elements m0, i until mk − 1, i without the element mi, i will be permuted until a valid combination is found. (We do not know if this step will always terminate, in our tests it worked for k ≤ 1000.) This adjustment step has to be repeated until all columns fulfill these conditions. In a last step, the matrix M will be reordered to Mord according to π:
The upper bound of the parts can now be defined as follows, limiting the values to a maximum of k:
Fig. 7 shows a search scheme with two allowed errors and split into four parts. The node count is identical to the known optimum search schemes.

Search scheme generated by our construction algorithm. S = {s0 = (2310, 0000, 0022), s1 = (1230, 0011, 0112), and s2 = (0123, 0002, 0122)} with k = 2, query length |Q| = 8, 4-part-split and an alphabet of |Σ| = 2 for distham.
Step by step example
The following section will go through an example creating
Next, we generate the lower bound L of every search:
To generate the upper bound, we first need to compute Mord first. For this, we create an initial matrix M0:
The value m0, 1 = 2 and m0, 2 = 1 violates the property m0, 1 ≤ m0, 2. Triggering a permutation of the column 2 results in a new matrix M1:
This matrix M1 fulfills all required properties, so we can continue reordering the values according to π to receive Mord:
To compute U we need to combine L and Mord.:
This results in four searches s0 = (01234, 00003, 02233), s1 = (12340, 00022, 01223), s2 = (23410, 00111, 01123), and s3 = (34210, 00000, 00333).
Results and discussion
Evaluation
In this section, we want to evaluate the node count, weighted node count and run time of all search schemes. In our benchmarks, we use a string generated over the alphabet Σ = {A, C, G, T} of length 3 billion as our reference text. (Roughly, the size of the human genome). As queries, we created a fixed set of simulated strings of length 50 with an exact number of edit distance errors depending on the benchmark. The length of 50 mimics methods that use the FM-Index as a seeding strategy [16, 17]. For run time measurements, we used the Columba tool [13] in version v1.2. We activate uniform partitioning and did not use in-text verification to avoid query specific optimization, but we used Columbas “editopt” search distance knowing this will lead to overestimation of our weighted node count. To generate simulated genomes, simulated reads and search schemes, we use our tool Sahara [18]. We also use it to simplify the preprocessing steps for Columba. While Sahara is capable of executing searches, we decided to use the superior implementation in Columba, which also maintains the comparability with previous published papers by Renders et al.
Using the node count as a performance predictor (Table 1), we expect that
Showing the node count of different search schemes. Assuming |P| = 50, |Σ| = 4 and edit distance. The lowest node counts are highlighted. Empty cells denote non-existing search schemes
errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 025 · 101 | 1 343 · 103 | 1 293 · 105 | 975 · 107 | 600 · 109 | 309 · 1011 | |
530 · 101 | 747 · 103 | 732 · 105 | - | - | - | |
585 · 101 | 836 · 103 | 1 004 · 105 | 949 · 107 | 703 · 109 | 424 · 1011 | |
581 · 101 | 1 014 · 103 | 1 356 · 105 | 1 279 · 107 | 951 · 109 | 567 · 1011 | |
530 · 101 | 1 199 · 103 | 1 621 · 105 | 1 561 · 107 | 1 166 · 109 | 703 · 1011 | |
530 · 101 | 900 · 103 | 1 062 · 105 | 938 · 107 | 649 · 109 | 368 · 1011 | |
530 · 101 | 780 · 103 | 889 · 105 | 753 · 107 | 557 · 109 | 315 · 1011 | |
530 · 101 | 779 · 103 | 790 · 105 | 603 · 107 | 389 · 109 | 205 · 1011 | |
581 · 101 | 737 · 103 | 732 · 105 | 568 · 107 | 359 · 109 | 189 · 1011 | |
645 · 101 | 790 · 103 | 753 · 105 | 579 · 107 | 362 · 109 | 189 · 1011 | |
530 · 101 | 779 · 103 | 1 194 · 105 | - | - | - | |
530 · 101 | 807 · 103 | 983 · 105 | 1 122 · 107 | - | - | |
581 · 101 | 857 · 103 | 804 · 105 | 1 055 · 107 | - | - | |
530 · 101 | 762 · 103 | 834 · 105 | 659 · 107 | 483 · 109 | 235 · 1011 |
errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 025 · 101 | 1 343 · 103 | 1 293 · 105 | 975 · 107 | 600 · 109 | 309 · 1011 | |
530 · 101 | 747 · 103 | 732 · 105 | - | - | - | |
585 · 101 | 836 · 103 | 1 004 · 105 | 949 · 107 | 703 · 109 | 424 · 1011 | |
581 · 101 | 1 014 · 103 | 1 356 · 105 | 1 279 · 107 | 951 · 109 | 567 · 1011 | |
530 · 101 | 1 199 · 103 | 1 621 · 105 | 1 561 · 107 | 1 166 · 109 | 703 · 1011 | |
530 · 101 | 900 · 103 | 1 062 · 105 | 938 · 107 | 649 · 109 | 368 · 1011 | |
530 · 101 | 780 · 103 | 889 · 105 | 753 · 107 | 557 · 109 | 315 · 1011 | |
530 · 101 | 779 · 103 | 790 · 105 | 603 · 107 | 389 · 109 | 205 · 1011 | |
581 · 101 | 737 · 103 | 732 · 105 | 568 · 107 | 359 · 109 | 189 · 1011 | |
645 · 101 | 790 · 103 | 753 · 105 | 579 · 107 | 362 · 109 | 189 · 1011 | |
530 · 101 | 779 · 103 | 1 194 · 105 | - | - | - | |
530 · 101 | 807 · 103 | 983 · 105 | 1 122 · 107 | - | - | |
581 · 101 | 857 · 103 | 804 · 105 | 1 055 · 107 | - | - | |
530 · 101 | 762 · 103 | 834 · 105 | 659 · 107 | 483 · 109 | 235 · 1011 |
Showing the node count of different search schemes. Assuming |P| = 50, |Σ| = 4 and edit distance. The lowest node counts are highlighted. Empty cells denote non-existing search schemes
errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 025 · 101 | 1 343 · 103 | 1 293 · 105 | 975 · 107 | 600 · 109 | 309 · 1011 | |
530 · 101 | 747 · 103 | 732 · 105 | - | - | - | |
585 · 101 | 836 · 103 | 1 004 · 105 | 949 · 107 | 703 · 109 | 424 · 1011 | |
581 · 101 | 1 014 · 103 | 1 356 · 105 | 1 279 · 107 | 951 · 109 | 567 · 1011 | |
530 · 101 | 1 199 · 103 | 1 621 · 105 | 1 561 · 107 | 1 166 · 109 | 703 · 1011 | |
530 · 101 | 900 · 103 | 1 062 · 105 | 938 · 107 | 649 · 109 | 368 · 1011 | |
530 · 101 | 780 · 103 | 889 · 105 | 753 · 107 | 557 · 109 | 315 · 1011 | |
530 · 101 | 779 · 103 | 790 · 105 | 603 · 107 | 389 · 109 | 205 · 1011 | |
581 · 101 | 737 · 103 | 732 · 105 | 568 · 107 | 359 · 109 | 189 · 1011 | |
645 · 101 | 790 · 103 | 753 · 105 | 579 · 107 | 362 · 109 | 189 · 1011 | |
530 · 101 | 779 · 103 | 1 194 · 105 | - | - | - | |
530 · 101 | 807 · 103 | 983 · 105 | 1 122 · 107 | - | - | |
581 · 101 | 857 · 103 | 804 · 105 | 1 055 · 107 | - | - | |
530 · 101 | 762 · 103 | 834 · 105 | 659 · 107 | 483 · 109 | 235 · 1011 |
errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 025 · 101 | 1 343 · 103 | 1 293 · 105 | 975 · 107 | 600 · 109 | 309 · 1011 | |
530 · 101 | 747 · 103 | 732 · 105 | - | - | - | |
585 · 101 | 836 · 103 | 1 004 · 105 | 949 · 107 | 703 · 109 | 424 · 1011 | |
581 · 101 | 1 014 · 103 | 1 356 · 105 | 1 279 · 107 | 951 · 109 | 567 · 1011 | |
530 · 101 | 1 199 · 103 | 1 621 · 105 | 1 561 · 107 | 1 166 · 109 | 703 · 1011 | |
530 · 101 | 900 · 103 | 1 062 · 105 | 938 · 107 | 649 · 109 | 368 · 1011 | |
530 · 101 | 780 · 103 | 889 · 105 | 753 · 107 | 557 · 109 | 315 · 1011 | |
530 · 101 | 779 · 103 | 790 · 105 | 603 · 107 | 389 · 109 | 205 · 1011 | |
581 · 101 | 737 · 103 | 732 · 105 | 568 · 107 | 359 · 109 | 189 · 1011 | |
645 · 101 | 790 · 103 | 753 · 105 | 579 · 107 | 362 · 109 | 189 · 1011 | |
530 · 101 | 779 · 103 | 1 194 · 105 | - | - | - | |
530 · 101 | 807 · 103 | 983 · 105 | 1 122 · 107 | - | - | |
581 · 101 | 857 · 103 | 804 · 105 | 1 055 · 107 | - | - | |
530 · 101 | 762 · 103 | 834 · 105 | 659 · 107 | 483 · 109 | 235 · 1011 |
When measuring the run times for 100 000 queries using the different queries Table 2 tells a different story. Despite our prediction derived from the node count, in every error category our heuristic search schemes
Showing the run time in seconds of search schemes run by Columba searching for 100 000 queries. Using |P| = 50, |Σ| = 4, |T| = 3 000 000 000, with simulated reads using edit distance. The lowest run times are highlighted. For k = 5 and k = 6, only 1000 queries were executed and the run time scaled by 100. Empty cells denote non-existing search schemes. The search scheme
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
28.05 | 725.68 | timeout | timeout | timeout | timeout | |
1.54 | 5.17 | 48.13 | - | - | - | |
1.80 | 5.92 | 21.54 | 70.81 | 263.00 | 1084.00 | |
1.80 | 5.15 | 16.44 | 58.08 | 725.00 | 9432.00 | |
1.52 | 2.16 | 54.60 | 1672.09 | 29 789.00 | 239 756.00 | |
1.52 | 2.07 | 38.11 | 860.12 | 14 862.00 | 12 0531.00 | |
1.54 | 2.02 | 13.80 | 357.49 | 2225.00 | 12 194.00 | |
1.53 | 2.01 | 20.41 | 534.57 | 5937.00 | 44 025.00 | |
1.78 | 5.04 | 48.50 | 928.63 | 9784.00 | 78 642.00 | |
3.75 | 11.28 | 145.14 | 1885.38 | 19 851.00 | failed | |
1.52 | 2.01 | 107.31 | - | - | - | |
1.53 | 2.06 | 8.40 | 62.47 | - | - | |
1.78 | 5.23 | 16.19 | 50.56 | - | - | |
1.55 | 2.05 | 8.36 | 57.93 | 68.00 | 426.00 |
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
28.05 | 725.68 | timeout | timeout | timeout | timeout | |
1.54 | 5.17 | 48.13 | - | - | - | |
1.80 | 5.92 | 21.54 | 70.81 | 263.00 | 1084.00 | |
1.80 | 5.15 | 16.44 | 58.08 | 725.00 | 9432.00 | |
1.52 | 2.16 | 54.60 | 1672.09 | 29 789.00 | 239 756.00 | |
1.52 | 2.07 | 38.11 | 860.12 | 14 862.00 | 12 0531.00 | |
1.54 | 2.02 | 13.80 | 357.49 | 2225.00 | 12 194.00 | |
1.53 | 2.01 | 20.41 | 534.57 | 5937.00 | 44 025.00 | |
1.78 | 5.04 | 48.50 | 928.63 | 9784.00 | 78 642.00 | |
3.75 | 11.28 | 145.14 | 1885.38 | 19 851.00 | failed | |
1.52 | 2.01 | 107.31 | - | - | - | |
1.53 | 2.06 | 8.40 | 62.47 | - | - | |
1.78 | 5.23 | 16.19 | 50.56 | - | - | |
1.55 | 2.05 | 8.36 | 57.93 | 68.00 | 426.00 |
Showing the run time in seconds of search schemes run by Columba searching for 100 000 queries. Using |P| = 50, |Σ| = 4, |T| = 3 000 000 000, with simulated reads using edit distance. The lowest run times are highlighted. For k = 5 and k = 6, only 1000 queries were executed and the run time scaled by 100. Empty cells denote non-existing search schemes. The search scheme
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
28.05 | 725.68 | timeout | timeout | timeout | timeout | |
1.54 | 5.17 | 48.13 | - | - | - | |
1.80 | 5.92 | 21.54 | 70.81 | 263.00 | 1084.00 | |
1.80 | 5.15 | 16.44 | 58.08 | 725.00 | 9432.00 | |
1.52 | 2.16 | 54.60 | 1672.09 | 29 789.00 | 239 756.00 | |
1.52 | 2.07 | 38.11 | 860.12 | 14 862.00 | 12 0531.00 | |
1.54 | 2.02 | 13.80 | 357.49 | 2225.00 | 12 194.00 | |
1.53 | 2.01 | 20.41 | 534.57 | 5937.00 | 44 025.00 | |
1.78 | 5.04 | 48.50 | 928.63 | 9784.00 | 78 642.00 | |
3.75 | 11.28 | 145.14 | 1885.38 | 19 851.00 | failed | |
1.52 | 2.01 | 107.31 | - | - | - | |
1.53 | 2.06 | 8.40 | 62.47 | - | - | |
1.78 | 5.23 | 16.19 | 50.56 | - | - | |
1.55 | 2.05 | 8.36 | 57.93 | 68.00 | 426.00 |
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
28.05 | 725.68 | timeout | timeout | timeout | timeout | |
1.54 | 5.17 | 48.13 | - | - | - | |
1.80 | 5.92 | 21.54 | 70.81 | 263.00 | 1084.00 | |
1.80 | 5.15 | 16.44 | 58.08 | 725.00 | 9432.00 | |
1.52 | 2.16 | 54.60 | 1672.09 | 29 789.00 | 239 756.00 | |
1.52 | 2.07 | 38.11 | 860.12 | 14 862.00 | 12 0531.00 | |
1.54 | 2.02 | 13.80 | 357.49 | 2225.00 | 12 194.00 | |
1.53 | 2.01 | 20.41 | 534.57 | 5937.00 | 44 025.00 | |
1.78 | 5.04 | 48.50 | 928.63 | 9784.00 | 78 642.00 | |
3.75 | 11.28 | 145.14 | 1885.38 | 19 851.00 | failed | |
1.52 | 2.01 | 107.31 | - | - | - | |
1.53 | 2.06 | 8.40 | 62.47 | - | - | |
1.78 | 5.23 | 16.19 | 50.56 | - | - | |
1.55 | 2.05 | 8.36 | 57.93 | 68.00 | 426.00 |
As an alternative run time estimation, we compare our previously introduced weighted node count with the run time. Table 3 lists the different weighted node counts. Here we see correlation between the predicted fastest search schemes and the actual run time. Only for k = 4, our prediction of fastest and second-fastest search scheme is swapped. The weighted node count correctly predicts that
Showing the weighted node count of different search schemes. Assuming |P| = 50, |Σ| = 4, |T| = 3 000 000 000 and edit distance. The lowest weighted node counts are highlighted. Empty cells denote non-existing search schemes
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 082.60 | 434 . 102 | 122 . 104 | 255 . 105 | 414 . 106 | 530 . 109 | |
31.65 | 198.46 | 2 669.57 | - | - | - | |
47.52 | 229.60 | 1 122.48 | 2 650.15 | 15 086.29 | 123 820.56 | |
31.70 | 137.24 | 545.41 | 1 255.75 | 46 111.73 | 2 879 754.09 | |
31.65 | 48.99 | 8 531.56 | 530 041.98 | 15 683 373.87 | 253 620 952.60 | |
31.65 | 48.63 | 2 678.92 | 233 968.19 | 3 128 519.06 | 94 408 630.87 | |
31.65 | 48.98 | 3 534.11 | 106 714.51 | 3 313 394.15 | 40 034 994.27 | |
31.65 | 48.98 | 3 782.52 | 128 136.32 | 3 617 506.01 | 44 548 566.52 | |
31.70 | 137.24 | 2 669.57 | 42 121.71 | 684 323.68 | 9 127 067.20 | |
76.53 | 368.88 | 4 184.41 | 76 653.18 | 1 224 984.33 | 18 787 288.71 | |
31.65 | 48.98 | 6 323.76 | - | - | - | |
31.65 | 48.63 | 304.05 | 3 375.69 | - | - | |
31.70 | 183.68 | 545.41 | 1 497.72 | - | - | |
31.65 | 48.63 | 304.05 | 3 006.80 | 1 996.37 | 31 105.20 |
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 082.60 | 434 . 102 | 122 . 104 | 255 . 105 | 414 . 106 | 530 . 109 | |
31.65 | 198.46 | 2 669.57 | - | - | - | |
47.52 | 229.60 | 1 122.48 | 2 650.15 | 15 086.29 | 123 820.56 | |
31.70 | 137.24 | 545.41 | 1 255.75 | 46 111.73 | 2 879 754.09 | |
31.65 | 48.99 | 8 531.56 | 530 041.98 | 15 683 373.87 | 253 620 952.60 | |
31.65 | 48.63 | 2 678.92 | 233 968.19 | 3 128 519.06 | 94 408 630.87 | |
31.65 | 48.98 | 3 534.11 | 106 714.51 | 3 313 394.15 | 40 034 994.27 | |
31.65 | 48.98 | 3 782.52 | 128 136.32 | 3 617 506.01 | 44 548 566.52 | |
31.70 | 137.24 | 2 669.57 | 42 121.71 | 684 323.68 | 9 127 067.20 | |
76.53 | 368.88 | 4 184.41 | 76 653.18 | 1 224 984.33 | 18 787 288.71 | |
31.65 | 48.98 | 6 323.76 | - | - | - | |
31.65 | 48.63 | 304.05 | 3 375.69 | - | - | |
31.70 | 183.68 | 545.41 | 1 497.72 | - | - | |
31.65 | 48.63 | 304.05 | 3 006.80 | 1 996.37 | 31 105.20 |
Showing the weighted node count of different search schemes. Assuming |P| = 50, |Σ| = 4, |T| = 3 000 000 000 and edit distance. The lowest weighted node counts are highlighted. Empty cells denote non-existing search schemes
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 082.60 | 434 . 102 | 122 . 104 | 255 . 105 | 414 . 106 | 530 . 109 | |
31.65 | 198.46 | 2 669.57 | - | - | - | |
47.52 | 229.60 | 1 122.48 | 2 650.15 | 15 086.29 | 123 820.56 | |
31.70 | 137.24 | 545.41 | 1 255.75 | 46 111.73 | 2 879 754.09 | |
31.65 | 48.99 | 8 531.56 | 530 041.98 | 15 683 373.87 | 253 620 952.60 | |
31.65 | 48.63 | 2 678.92 | 233 968.19 | 3 128 519.06 | 94 408 630.87 | |
31.65 | 48.98 | 3 534.11 | 106 714.51 | 3 313 394.15 | 40 034 994.27 | |
31.65 | 48.98 | 3 782.52 | 128 136.32 | 3 617 506.01 | 44 548 566.52 | |
31.70 | 137.24 | 2 669.57 | 42 121.71 | 684 323.68 | 9 127 067.20 | |
76.53 | 368.88 | 4 184.41 | 76 653.18 | 1 224 984.33 | 18 787 288.71 | |
31.65 | 48.98 | 6 323.76 | - | - | - | |
31.65 | 48.63 | 304.05 | 3 375.69 | - | - | |
31.70 | 183.68 | 545.41 | 1 497.72 | - | - | |
31.65 | 48.63 | 304.05 | 3 006.80 | 1 996.37 | 31 105.20 |
Errors . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
---|---|---|---|---|---|---|
1 082.60 | 434 . 102 | 122 . 104 | 255 . 105 | 414 . 106 | 530 . 109 | |
31.65 | 198.46 | 2 669.57 | - | - | - | |
47.52 | 229.60 | 1 122.48 | 2 650.15 | 15 086.29 | 123 820.56 | |
31.70 | 137.24 | 545.41 | 1 255.75 | 46 111.73 | 2 879 754.09 | |
31.65 | 48.99 | 8 531.56 | 530 041.98 | 15 683 373.87 | 253 620 952.60 | |
31.65 | 48.63 | 2 678.92 | 233 968.19 | 3 128 519.06 | 94 408 630.87 | |
31.65 | 48.98 | 3 534.11 | 106 714.51 | 3 313 394.15 | 40 034 994.27 | |
31.65 | 48.98 | 3 782.52 | 128 136.32 | 3 617 506.01 | 44 548 566.52 | |
31.70 | 137.24 | 2 669.57 | 42 121.71 | 684 323.68 | 9 127 067.20 | |
76.53 | 368.88 | 4 184.41 | 76 653.18 | 1 224 984.33 | 18 787 288.71 | |
31.65 | 48.98 | 6 323.76 | - | - | - | |
31.65 | 48.63 | 304.05 | 3 375.69 | - | - | |
31.70 | 183.68 | 545.41 | 1 497.72 | - | - | |
31.65 | 48.63 | 304.05 | 3 006.80 | 1 996.37 | 31 105.20 |
If we compare across error categories, e.g.
The weighted node count is not a perfect predictor of a search scheme performance, but it is a superior estimator compared to the node count approach. It allows to identify a well performing search scheme considering the number of allowed errors, the query length and the length of the reference text.
We ran the same benchmark on a fixed set of simulated strings of length 150, which is a typical size of reads produced by Illumina sequencing machines. The results are less informative and can be found in the Supplementary Tables S3–S5. The run-times within an error category are close such that expressive derivations are harder to show, and we decided to focus on reads of length 50 instead. Additionally, we ran the benchmark using the human genome with a read length of 50 and 150. The results can be seen in Supplementary Tables S6 and S7. While these results look similar, they require the weighted node count to take the self-similarity of the reference genome into account. We decided to adhere to the random generated reference, because it is a simpler model.
In summary, we see that
Conclusion
We formulated different search strategies as search schemes and introduced our own search scheme construction algorithm. This enables the creation of search schemes with any number of errors. We showed that our search scheme achieves the lowest node count in every error category, but run times expose that there is no correlation between the run time and the node count. We used this knowledge to introduce an improved weighted node count, which can be used to estimate run times to a certain degree of certainty. This allows to choose the best search scheme for a given set of queries, which confirms the finding of Renders et al. [14]. Although, there is space to explore even better predictors, the weighted node count can be used for future development of new search scheme constructions that enable better performance.
Supplementary data
Supplementary data is available at NAR Genomics & Bioinformatics Online.
Conflict of interest
None declared.
Data availability
Data are available at https://github.com/seqan/sahara (https://doi.org/10.5281/zenodo.14640235) and https://github.com/SGSSGene/sahara_benchmarks (https://doi.org/10.5281/zenodo.14640239).
Appendix
Weighted node count
Weighted node count for edit distance. In most practical application, this formula results in an overestimation, because some recursive calls can be optimized.
Comments