SeArcH schemes for Approximate stRing mAtching

Abstract

Finding approximate occurrences of a query in a text using a full-text index is a central problem in stringology with many applications, especially in bioinformatics. The recent work has shown significant speed-ups by combining bidirectional indices and employing variations of search schemes. Search schemes partition a query and describe how to search the resulting parts with a given error bound. The performance of search schemes can be approximated by the node count, which represents an upper bound of the number of search steps. Finding optimum search schemes is a difficult combinatorial optimization problem that becomes hard to solve for four and more errors. This paper improves on a few topics important to search scheme based searches: First, we show how search schemes can be used to model previously published approximate search strategies such as suffix filters, 01*0-seeds, or the pigeonhole principle. This unifies these strategies in the search scheme framework, makes them easily comparable and results in novel search schemes that allow for any number of errors. Second, we present a search scheme construction heuristic, which is on par with optimum search schemes and has a better node count than any known search scheme for equal or above four errors. Finally, using the different search schemes, we show that the node count measure is not an ideal performance metric and therefore propose an improved performance metric called the weighted node count, which approximates a search algorithm’s run time much more accurately.

Issue Section:

Standard Article

Introduction

Finding approximate occurrences of a string in a large text is a fundamental problem in computer science with numerous applications, especially in bioinformatics. In this paper, we will address the problem for Hamming distance as well as for edit distance.

More formally, the approximate string matching (ASM) problem considered in this paper is defined as follows: Given an error number k, a string, here referred to as a queryQ, and a text T, composed of characters from an alphabet of size |Σ|, find substrings of the text whose distance to the query is at most k.

Solving the ASM problem has become especially important in bioinformatics due to the advances in sequencing technology during the last years. The mainstream second-generation sequencing techniques like Illumina produce reads of length 150–250 with an error rate of about 1%, mostly substitutions caused by the sequencing technology.

Other sequencing technologies, e.g. Pacific Bioscience or Oxford Nanopore, produce much longer reads but with a higher error rate (in the range of up to 15%) containing substitutions, insertions, and deletions. A standard problem is to map the reads back to a reference genome while taking into account the errors introduced by the sequencing technology as well as those caused by biological variations, such as SNPs or small structural variations. Such a problem is almost always modeled as the ASM problem for Hamming or edit distance.

There are two main algorithmic strategies to address the ASM problem for large inputs, including a long reference text and/or a high number of reads: filtering and indexing.

In this work, we focus on the indexing approach. Here, the main idea is to preprocess the reference text, the set of reads, or both, more intricately.

Such preprocessing into full-text string indices has the benefit that we do not have to scan the whole reference but can conduct queries much faster at the expense of larger memory consumption. String indices currently used are the suffix array [1], enhanced suffix array [2], and affix arrays [3, 4], as well as the FM-index [5], a data structure based on the Burrows–Wheeler Transform (BWT) [6] and some auxiliary tables. For an in-depth discussion see [7]. Such indices are used to compute exact matches between a query and a text as a subroutine in backtracking approaches. For the ASM problem for Hamming or edit distance, the existing algorithms all have exponential complexity in k (e.g. [8, 9]), and are hence only suited for small k.

Lam et al. [10] introduced an idea based on bidirectional FM indices to speed up ASM for Hamming distance. For the cases k = 1 and 2, they partitioned the read into k + 1 equal parts and argued that performing approximate matching on a certain combination of these parts in a bidirectional index amounts to faster approximate matching of the whole read. This combination is such that all possible error configurations, i.e., all possible distributions of k mismatches among the parts, are covered. The main idea behind improved speed is that a bidirectional index does not have to start the search from the beginning or end of the read but from any of the parts. Therefore, we can start the search from a middle part and then expand it to the left and right into adjacent parts in any order we like. By choosing multiple appropriate orderings of parts for this purpose, we can perform a much faster ASM compared to a unidirectional search. By enforcing exact or near-exact searches on the first parts of the partition, we significantly reduce the number of backtrackings or spurious results. By using different orderings of parts, we ensure that all possible error configurations are covered.

Kucherov et al. [11] formalized and generalized this idea by defining the concept of search schemes. Assume a read can be partitioned into a given number of parts, denoted by p. The parts are indexed from left to right. A search scheme $S = {(π_{s}, L_{s}, U_{s}) | s = 1, \dots, S}$ is a collection of S searches, where each search s is designated by a triplet (π_s, L_s, U_s). π_s is a permutation of 1, …, p and denotes the order in which the parts of the partition are searched in search s. L_s and U_s are denoting the accumulated lower and upper bounds of allowed errors for each part.

Kianfar and colleagues [12] in cooperation with our group, proposed a method to compute optimal search schemes for Hamming distance under certain constraints by minimizing the overall number of edges in the backtracking trees of the search schemes. Their method is based on solving a Mixed-Integer Linear Program (ILP), which is costly. Hence, they can only compute solutions for up to four errors in reasonable time and only guarantee for 1 and 2 errors to be optimal. This paper mentions that Vrolands et al. [9] 01*0 seeds can be reformulated as search schemes.

Renders et al. [13] implemented improvements of the applied search schemes into their tool Columba. Instead of backtracking single error configurations, they use a banded matrix to track multiple error configuration. By leaving out well-defined paths through the matrix the number of search steps is significantly decreased. To our knowledge and experience this is currently the fastest implementation to execute search schemes. Additionally, they benchmark search schemes based on the pigeonhole principle and 01*0 seeds for up to four errors.

In their latest work [14], Renders et al. explore the possibility of automated design of efficient search schemes. They implement a new ILP formulation that is capable of generating search schemes for up to k = 7 errors. They show that search schemes generated with their tool hato and executed with Columba outperform existing tools for lossless approximate search.

Searching in a bidirectional index using any given, valid search scheme is already implemented in our library SeqAn [15] for Hamming and edit distance. Hence, it is attractive to investigate methods to construct search schemes for larger k. This fact and the shortcomings of the method of Kianfar et al. inspired the work we present in this paper.

We show how to cleverly formulate suffix filters [8], 01*0 seeds [9], and pigeonhole filtering in terms of search schemes and compare the performances of all methods using the Columba tool [13]. We propose a novel metric weighted node code to evaluate search schemes that correlates much better to their actual performance than the node count measure used in [12].

Finally, we will underpin these claims by collecting benchmark for all search schemes with Columba [13] and compare the run times to the node count and our newly introduced weighted node count. Supporting that the weighted node count is a good performance indicator of search schemes.

Material and methods

Nomenclature

In this section, we introduce fundamental notations and definitions. We use 0-based indices, which should be interpreted as offsets from the first entry. A string is denoted as A = a₀a₁…a_{|A| − 1} where |A| denotes the length and a_i refers to a single character of the string. An infix can be denoted as A_{i, j} = a_ia_{i + 1}…a_{j − 1}. The approximate search of a query Q = q₀, q₁, …q_{|Q| − 1} inside a text T = t₀, t₁, …, t_{n − 1} is defined as finding a set of substrings in ${t_{s_{0}, e_{0}}, t_{s_{1}, e_{1}}, \dots}$ that are similar under some metric dist( ·, ·), such that $d i s t (t_{s_{i}, e_{i}}, Q) \leq k$ where k is the number of errors for the distance measure. For metrics, we exclusively use either Hamming distance dist_ham or edit distance dist_edit.

A bidirectional index as the 2FM-Index allows us to search for a query Q by starting at any point in our query and extending it to the left or right until our full query is covered. In a broad sense, a search scheme describes how to extend a query step by step and how many errors are expected.

Assume the query Q is divided into p parts P = (P₀, …, P_{p − 1}) such that the concatenation of the parts equals Q. A search scheme is defined as a set of searchesS = {s₀, s₁, …, s_{|S| − 1}}. Each search is defined as a triplet s_i = (π_i, L_i, U_i). The search order of parts is defined by a permutation π = (π₀, …, π_{p − 1}) of {0, …, p − 1}. To work with bidirectional indices, e.g. a 2FM-Index, it is also required that π fulfills the connectivity property, which states that each π_i has to be one larger than the maximum of the previous π_k or one smaller than the minimum. Further on, let L_i = (L_{i, 0}, …, L_{i, p − 1}) denote the minimal accumulated error and U_i = (U_{i, 0}, …, U_{i, p − 1}) denote the maximal accumulated error while searching the part in the order of π. Then it must hold that the values in U_i and L_i are monotonously increasing and the upper bound is larger or equal to the lower bound U_{i, j} ≥ L_{i, j}, ∀j. Search schemes that fulfill those two requirements for all searches can be used for searching in a bidirectional index and are hence called valid. In the following, we will drop the search index i from L_i and U_i and write L and U if the context is clear. The expression L = (x, y, z, …) will be shortened to L = xyz to increase readability.

An error configuration is a distribution of the k errors over the parts and is denoted by (e₀, …, e_{p − 1}) with $\sum_{i = 0}^{p - 1} e_{i} = k$ ⁠. If a search scheme covers all possible error configurations, it is called complete. Additionally, if each error configuration is covered by only one search, it is called non-redundant. In the case of hamming distance a complete and non-redundant search scheme will lead to each match being found exactly once. Contrary, edit distance based searches can find the same position with different alignments.

An example of a search scheme can be seen in Fig. 2, where the standard pigeonhole principle for two errors is expressed as a valid and complete search scheme with three searches. It can be seen that all error configurations are covered, albeit some by several searches. Hence, the search scheme is redundant. This opens the door for devising an improved version of the pigeonhole search scheme that we will introduce in the Subsection Formulating search strategies as search schemes.

$Conditions: (1) $\mathcal {L}_l \le d \le \mathcal {U}_l \wedge 1 \le l \le \lceil \log _{\sigma } |T| \rceil$ (2) $\mathcal {L}_l \le d \le \mathcal {U}_l \wedge l >\lceil \log _{\sigma } |T| \rceil$ The depiction of the weighted node count for edit distance can be found in the appendix.$

Figure 1.

Conditions: (1) $L_{l} \leq d \leq U_{l} \land 1 \leq l \leq ⌈ \log_{σ} | T | ⌉$ (2) $L_{l} \leq d \leq U_{l} \land l > ⌈ \log_{σ} | T | ⌉$ The depiction of the weighted node count for edit distance can be found in the appendix.

Open in new tab Download slide

Pigeonhole principle expressed as a search scheme S = {s0 = (012, 000, 022), s1 = (102, 000, 022), s2 = (210, 000, 022)} with k = 2, query length |Q| = 8 and an alphabet of |Σ| = 2 for distham. Following an edge down symbolizes a match between query and text. Branching to the right indicates a mismatch. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (0, 0, 2), (0, 2, 0)}, the second search covers {(0, 0, 0), (1, 0, 0), (1, 0, 1), (0, 0, 1), (0, 0, 2), (2, 0, 0)} and the third search covers {(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 2), and (0, 2, 0)}.

Figure 2.

Pigeonhole principle expressed as a search scheme S = {s₀ = (012, 000, 022), s₁ = (102, 000, 022), s₂ = (210, 000, 022)} with k = 2, query length |Q| = 8 and an alphabet of |Σ| = 2 for dist_ham. Following an edge down symbolizes a match between query and text. Branching to the right indicates a mismatch. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (0, 0, 2), (0, 2, 0)}, the second search covers {(0, 0, 0), (1, 0, 0), (1, 0, 1), (0, 0, 1), (0, 0, 2), (2, 0, 0)} and the third search covers {(0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0), (0, 0, 2), and (0, 2, 0)}.

Open in new tab Download slide

Node count

The quality of a search scheme can be measured by counting the number of extensions applied to a pattern, which equals the number of search steps in the 2FM-Index. This measure is also referred to as the node count and was introduced by Kianfar et al. [12]. To compute the node count of a search scheme we have to assume a distance metric, pattern length, and the alphabet size. From the pattern length, we can compute vectors $L$ and $U$ which are similar to L and U but are inflated to the length of our pattern. For search s₀ in Fig. 2, the bounds L = 000, U = 022 are inflated to $L^{(e d i t)} = 000 000 00$ ⁠, $U^{(e d i t)} = 000 222 22$ ⁠, or $L^{(h a m)} = 000 000 00$ ⁠, $U^{(h a m)} = 000 122 22$ ⁠.

Weighted node count

However, the node count has a few shortcomings as a performance indicator for search schemes. This becomes especially noticeable when comparing run times of backtracking to other formulated search schemes when using higher error rates.

We introduce an improved weighted node count, which takes into consideration that not all children of every node can be visited. Depending on the size of the indexed text T, it is guaranteed that at a certain depth the search will not visit every node. At depth l, the number of possible backtracking branches is $#branches = | Σ |^{l}$ ⁠. A concrete reference T can have at most |T| − l different l-mers. Hence, below level l = ⌈log_|Σ||T|⌉ not all branches of the search scheme will be visited. Using this reasoning, we change the computation of the node count. From level l on, the maximal number of nodes being visited is lower than the total number of nodes. With this, the weighted node count for Hamming distance can be computed as seen in Fig. 1.

The weighted node count for edit distance looks similar but has additional branches. We included the equation in the appendix.

Formulating search strategies as search schemes

To formulate published search strategies into search schemes, we need to determine a method to generate $S = {(π_{s}, L_{s}, U_{s}) | s = 1, \dots, S}$ for an allowed number of errors. We have to ensure that the search scheme is valid and complete. All our illustrations are for the Hamming distance, but the schemes are equally valid for the edit distance. In the following, we will describe several search schemes based on different strategies:

$S_{p h, k}$ ⁠: Pigeonhole based with k errors.
$S_{p h - o p t, k}$ ⁠: Optimized pigeonhole based with k errors.
$S_{s f, k}$ ⁠: Suffix filter based with k errors.
$S_{01 * 0, k}$ ⁠: 01*0 seeds based with k errors.
$S_{01 * 0 - o p t, k}$ ⁠: Optimized 01*0 seeds based with k errors.
$S_{H, k, p}$ ⁠: Our custom heuristic to construct search schemes with k errors and p parts.

Later on, we will compare these to additional search schemes:

$S_{b t, k}$ ⁠: Backtracking scheme with a single search.
$S_{O S S, k}$ ⁠: Optimum Search Schemes provided by Kianfar et al. [12].
$S_{k i a n f a r, k}$ ⁠: Optimum search scheme that are split into k + 1 parts, also described by Kianfar et al. [12].
$S_{k u c h, k, k + 1}$ ⁠: Splitting the search into k + 1 parts. For k ∈ {1, 2, 3, 4} described by Kucherov et al. [11].
$S_{k u c h, k, k + 2}$ ⁠: Splitting the search into k + 2 parts. For k ∈ {1, 2, 3, 4} described by Kucherov et al. [11].
$S_{h a t o, k}$ ⁠: Search schemes provided by Renders et al. [14].

All used search schemes with concrete numbers are listed in the supplementary file in the Search Schemes section.

Pigeonhole strategy

From the pigeonhole principle, we can derive that a query with k errors divided into k + 1 parts will imply that at least one of the parts has to have zero errors. By performing k + 1 searches where each search starts with a different part with zero errors, we can create a valid and complete search scheme. An example of this search scheme is shown in Fig. 2.

The above search scheme for the pigeonhole principle can be improved with a tighter lower and upper bound. For example, the first search of $S_{p h, k}$ finds all substrings with zero errors in the first part. The second and further searches can avoid searching for those results. Careful inspection shows, that for search i, we could require at least one error on each part to the left of part i. These parts will have been found with zero errors by previous searches. To achieve this, we need to increase the bounds when searching to the left of part i (i.e. parts i − 1 to 0). For these parts, the lower bound can be set to 1 for part i − 1 and increased until value i is reached for part 0.

Regarding the upper bounds, we can avoid the search for certain values for the parts to the left of i. This is achieved by decreasing the upper bound to k − i + 1 for the (i − 1)th part and allowing one additional error for every next part to the left until k is reached.

The resulting search scheme depicted in Fig. 3 has a lower node count and the error configurations covered by each of the searches are disjoint, in contrast to the standard pigeonhole principle depicted in Fig. 2. Optimized pigeonhole-based search schemes with higher errors k > 2 have redundant searches.

Figure 3.

An optimized version of the pigeonhole principle expressed as a search scheme S = {(012, 000, 022), (102, 011, 022), and (210, 012, 012)} with k = 2, query length |Q| = 8 and an alphabet of σ = 2. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (0, 0, 2), and (0, 2, 0)}, the second search covers {(1, 0, 0), (1, 0, 1), and (2, 0, 0)} and finally the third search covers {(1, 1, 0)}.

Open in new tab Download slide

Suffix filter

The suffix filter can be seen as an extension of the pigeonhole principle. Assume we have a factorization of the query into k + 1 factors. Then according to Kärkkäinen and Na [8], there must exist a strong match of a factor suffix, which means the sum of the errors of s factors is smaller than s. We can model this as a search scheme, by first searching a part i with zero errors, then extending the search to the right as long as the strong match criterion is met, and finally extending the parts to the left.

For our small example, the suffix filter search scheme is depicted in Fig. 4. Note the difference in the error configurations when comparing to the optimized pigeonhole scheme depicted in Fig. 3. For these parameters, the suffix filter search scheme is redundant.

Figure 4.

Suffix filter expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2. The first search covers the error configurations {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), and (0, 0, 2)}, the second search covers {(1, 0, 0), (1, 0, 1), (2, 0, 0), and (0, 1, 0)} and finally the third search covers {(1, 1, 0), (0, 2, 0), and (0, 1, 0)}.

Open in new tab Download slide

01*0 seeds

The 01*0 seeds strategy of Vroland et al. [9] follows a formulation that is stronger than the pigeonhole principle. It states that if a read has k errors and is divided into k + 2 parts then there exists at least 1 group of adjacent parts where the first and last part have zero errors and all parts in between have exactly one error. In the original formulation, 01*0 seeds are found using a unidirectional FM-Index and extended and verified in text. In contrast to this strategy, we formulate the complete search of the query as a search scheme. Let j be the number of parts that match with exactly one error. Then we can formulate the strategy with an ordering that matches part i with 0 errors, then the next j parts with exactly 1 error, then part i + j + 1 again with 0 errors. The remaining errors are distributed over the remaining parts. Formally the search scheme reads as:

\begin{array}{r} \begin{array}{llllllll} S_{01 * 0, k} = { & (π_{0, 0}, & L_{0, 0}, & U_{0, 0}), & \dots, & (π_{0, k}, & L_{0, k}, & U_{0, k}), \\ (π_{1, 0}, & L_{1, 0}, & U_{1, 0}), & \dots, & (π_{1, k - 1}, & L_{1, k - 1}, & U_{1, k - 1}) \\ \dots \\ (π_{k, 0}, & L_{k, 0}, & U_{k, 0})} \end{array} \end{array}

The search scheme $S_{01 * 0, k}$ produces searches with the same order of parts. This enables merging searches, optimizing the number of searches. We consider this an optimized 01*0 seeds strategy which contains exactly k searches. A disadvantage of merging searches is that it creates additional search paths, that did not exist in the original $S_{01 * 0, k}$ search scheme. In Fig. 5, a $S_{01 * 0, k}$ search scheme is depicted. It consists of five searches of which multiple searches search the parts in exactly the same order with different error bounds. The optimized scheme $S_{01 * 0_{o} p t, k}$ with fewer searches but more nodes is depicted in Fig. 6.

\begin{array}{rcl} π_{o p t, i} & = & π_{i, 0} \\ L_{o p t, i} & = & min (L_{i, j}) \\ U_{o p t, i} & = & max (U_{i, j}) \\ S_{01 * 0 - o p t, k} & = & {(π_{o p t, i}, L_{o p t, i}, U_{o p t, i}) | 0 \leq i \leq k} \end{array}

Figure 5.

01*0 seeds expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2.

Open in new tab Download slide

Figure 6.

Optimized 01*0 seeds expressed as a search scheme with k = 2, query length |Q| = 8, and an alphabet of σ = 2.

Open in new tab Download slide

We evaluate the proposed search schemes in Subsection Evaluation. We compare the theoretical performance by computing the node count and measure the actual run time.

Our search scheme heuristic

In this subsection, we provide a description on how to construct a search scheme $S_{H, k, p} = {(π_{0}, L_{0}, U_{0}), \dots)}$ ⁠. Our goal is to construct search schemes with a low node count, which can be used for errors k ≥ 4 for which optimum search schemes are unknown. This algorithm does not guarantee to create optimum search scheme, we also do not know if the search schemes produced are complete or valid. Because of these limitations, we consider our method a heuristic for search scheme construction. However, we checked that the resulting search schemes are valid and complete for up to 15 errors. Additionally, our heuristic has the same node count as the known optimum search schemes and the lowest node count compared to all known search schemes for higher errors.

For our construction algorithm, we analyzed the known optimum search schemes, where we noticed that they seemingly have an internal structure. We will describe a parameterized construction which will reproduce these search schemes. The construction algorithm has two user defined parameters k and p. The parameter k indicates the allowed number of errors our scheme should cover. The parameter p controls the number of parts and must be chosen such that p > k (This is a limitation of our construction, it likely could be removed). To find a search schemes with a low node count one must iterate over different values of p. In all our test cases, p = k + 2 results in the lowest node count, with the exception of k ≤ 2 where p = k + 1 has the lowest node count.

From these two parameters, we need to derive the number of searches and for each search the order of parts as well as the lower and upper bound of each part. Computing the number of searches, order of parts and the lower bound is much easier than the upper bound. Starting with the number of searches, we fix these to k + 1.

The order of the parts for a search i is defined as follows:

\begin{array}{r} π_{i} = (\overset{extending right}{\overset{⏞}{i, \dots, p - 1}}, \overset{extending left}{\overset{⏞}{i - 1, \dots, 0}}) \end{array}

The lower bound of the parts are defined as follows:

\begin{array}{r} L_{i} = (\overset{\times (p - i - 1)}{\overset{⏞}{0, \dots, 0}}, \overset{\times (i + 1)}{\overset{⏞}{k - i, \dots, k - i}}) \end{array}

The upper bound U_i requires an additional matrix M. The matrix M defines the maximum number of errors that are allowed in a part. The rows correspond to the searches, and each column corresponds to a part of the query:

\begin{array}{rcl} M & = & (\begin{array}{c} \begin{matrix} M_{k} \\ M_{b} \end{matrix} & M_{t} \end{array}) \\ = & (\begin{array}{c} \begin{array}{cccc} m_{0, 0} & m_{0, k - 1} \\ m_{1, 0} & \dots & m_{1, k - 1} \\ \dots & \dots \\ m_{k - 1, 0} & \dots & m_{k - 1, k - 1} \\ m_{k, 0} & \dots & m_{k, k - 1} \end{array} & \begin{array}{ccc} m_{0, k} & \dots & m_{0, p - 1} \\ m_{1, k} & \dots & m_{1, p - 1} \\ \dots & \dots \\ m_{k, k} & \dots & m_{k, p - 1} \end{array} \end{array}) \end{array}

The matrix M is split into three sub matrices called M_k, M_b, and M_t. Each of these matrices have to satisfy additional constraints:

\begin{array}{rcl} \forall 0 \leq i < k : m_{j, i} & = & (j + i \mod k) \\ \forall 0 \leq i < k : m_{k, i} & = & k \\ \forall i \geq k \land 0 \leq j \leq k : m_{j, i} & = & k - j \\ \forall j < i : & m_{i, j + 1} \leq m_{i, j} \\ \forall j > i : & m_{i, j - 1} \leq m_{i, j} \end{array}

The first three conditions can be used to to generate submatrices M_k, M_b, and M_t respectively. The last two conditions are not automatically fulfilled after generating M. In an adjustment step, the first column i with an element m_{j, i} violating these condition has to be adjusted by permuting its elements. The elements m_{0, i} until m_{k − 1, i} without the element m_{i, i} will be permuted until a valid combination is found. (We do not know if this step will always terminate, in our tests it worked for k ≤ 1000.) This adjustment step has to be repeated until all columns fulfill these conditions. In a last step, the matrix M will be reordered to M_ord according to π:

\begin{array}{r} M_{o r d} = (\begin{array}{c} m_{0, π_{0} [0]} & m_{0, π_{0} [1]} & \dots & m_{0, π_{0} [p - 1]} \\ m_{1, π_{1} [0]} & m_{1, π_{1} [1]} & \dots & m_{1, π_{1} [p - 1]} \\ \dots & \dots & \dots \\ m_{k, π_{k} [0]} & m_{k, π_{k} [1]} & \dots & m_{k, π_{k} [p - 1]} \end{array}) \end{array}

The upper bound of the parts can now be defined as follows, limiting the values to a maximum of k:

\begin{array}{rcl} U_{i} & = & (min (0 + M_{o r d} [i] [0], k), \\ min (L_{i} [0] + M_{o r d} [i] [1], k), \\ \dots, \\ min (L_{i} [p - 2] + M_{o r d} [i] [p - 1], k)) \end{array}

Fig. 7 shows a search scheme with two allowed errors and split into four parts. The node count is identical to the known optimum search schemes.

Search scheme generated by our construction algorithm. S = {s0 = (2310, 0000, 0022), s1 = (1230, 0011, 0112), and s2 = (0123, 0002, 0122)} with k = 2, query length |Q| = 8, 4-part-split and an alphabet of |Σ| = 2 for distham.

Figure 7.

Search scheme generated by our construction algorithm. S = {s₀ = (2310, 0000, 0022), s₁ = (1230, 0011, 0112), and s₂ = (0123, 0002, 0122)} with k = 2, query length |Q| = 8, 4-part-split and an alphabet of |Σ| = 2 for dist_ham.

Open in new tab Download slide

Step by step example

The following section will go through an example creating $S_{H, 3, 5}$ ⁠, which, in this case, is an optimum search scheme. It allows searches for up to k = 3 errors and splits a query into p = k + 2 = 5 parts. We start by generating the order of pieces for each search:

\begin{array}{rc} π_{0} & = (0, 1, 2, 3, 4) \\ π_{1} & = (1, 2, 3, 4, 0) \\ π_{2} & = (2, 3, 4, 1, 0) \\ π_{3} & = (3, 4, 2, 1, 0) \end{array}

Next, we generate the lower bound L of every search:

\begin{array}{r} L_{0} = (0, 0, 0, 0, 3) \\ L_{1} = (0, 0, 0, 2, 2) \\ L_{2} = (0, 0, 1, 1, 1) \\ L_{3} = (0, 0, 0, 0, 0) \end{array}

To generate the upper bound, we first need to compute M_ord first. For this, we create an initial matrix M₀:

\begin{array}{rc} M_{0} & = (\begin{matrix} m_{0, 0} & m_{0, 1} & m_{0, 2} & m_{0, 3} & m_{0, 4} \\ m_{1, 0} & m_{1, 1} & m_{1, 2} & m_{1, 3} & m_{1, 4} \\ m_{2, 0} & m_{2, 1} & m_{2, 2} & m_{2, 3} & m_{2, 4} \\ m_{3, 0} & m_{3, 1} & m_{3, 2} & m_{3, 3} & m_{3, 4} \\ m_{4, 0} & m_{4, 1} & m_{4, 2} & m_{4, 3} & m_{4, 4} \end{matrix}) \\ = (\begin{matrix} 0 & 2 & 1 & 3 & 3 \\ 1 & 0 & 2 & 2 & 2 \\ 2 & 1 & 0 & 1 & 1 \\ 3 & 3 & 3 & 0 & 0 \end{matrix}) \end{array}

The value m_{0, 1} = 2 and m_{0, 2} = 1 violates the property m_{0, 1} ≤ m_{0, 2}. Triggering a permutation of the column 2 results in a new matrix M₁:

\begin{array}{rc} M_{1} & = (\begin{matrix} 0 & 2 & 2 & 3 & 3 \\ 1 & 0 & 1 & 2 & 2 \\ 2 & 1 & 0 & 1 & 1 \\ 3 & 3 & 3 & 0 & 0 \end{matrix}) \end{array}

This matrix M₁ fulfills all required properties, so we can continue reordering the values according to π to receive M_ord:

\begin{array}{r} M_{o r d} = (\begin{array}{c} 0 & 2 & 2 & 3 & 3 \\ 0 & 1 & 2 & 2 & 1 \\ 0 & 1 & 1 & 1 & 2 \\ 0 & 0 & 3 & 3 & 3 \end{array}) \end{array}

To compute U we need to combine L and M_ord.:

\begin{array}{rcl} U_{0} & = & (min (0 + 0, k), min (0 + 2, k), min (0 + 2, k), \\ min (0 + 3, k), min (3 + 3, k)) \\ = & (0, 2, 2, 3, 3) \\ U_{1} & = & (min (0 + 0, k), min (0 + 1, k), min (0 + 2, k), \\ min (0 + 2, k), min (2 + 1, k)) \\ = & (0, 1, 2, 2, 3) \\ U_{2} & = & (min (0 + 0, k), min (0 + 1, k), min (0 + 1, k), \\ min (1 + 1, k), min (1 + 2, k)) \\ = & (0, 1, 1, 2, 3) \\ U_{3} & = & (min (0 + 0, k), min (0 + 0, k), min (0 + 3, k), \\ min (0 + 3, k), min (0 + 3, k)) \\ = & (0, 0, 3, 3, 3) \end{array}

This results in four searches s₀ = (01234, 00003, 02233), s₁ = (12340, 00022, 01223), s₂ = (23410, 00111, 01123), and s₃ = (34210, 00000, 00333).

Results and discussion

Evaluation

In this section, we want to evaluate the node count, weighted node count and run time of all search schemes. In our benchmarks, we use a string generated over the alphabet Σ = {A, C, G, T} of length 3 billion as our reference text. (Roughly, the size of the human genome). As queries, we created a fixed set of simulated strings of length 50 with an exact number of edit distance errors depending on the benchmark. The length of 50 mimics methods that use the FM-Index as a seeding strategy [16, 17]. For run time measurements, we used the Columba tool [13] in version v1.2. We activate uniform partitioning and did not use in-text verification to avoid query specific optimization, but we used Columbas “editopt” search distance knowing this will lead to overestimation of our weighted node count. To generate simulated genomes, simulated reads and search schemes, we use our tool Sahara [18]. We also use it to simplify the preprocessing steps for Columba. While Sahara is capable of executing searches, we decided to use the superior implementation in Columba, which also maintains the comparability with previous published papers by Renders et al.

Using the node count as a performance predictor (Table 1), we expect that $S_{H, k, k + 2}$ and $S_{O S S, k}$ outperform all other search strategies. Interestingly, it even predicts that $S_{H, k, k + 2}$ outperforms the optimum search scheme $S_{O S S, k}$ for two errors. We attribute this to the partitioning of the query. Since the query is 50 bases long and divided into four parts, the parts will have either 12 or 13 bases. Depending on the partition, the node count prefers one or the other search scheme.

Table 1.

Open in new tab

Showing the node count of different search schemes. Assuming |P| = 50, |Σ| = 4 and edit distance. The lowest node counts are highlighted. Empty cells denote non-existing search schemes

errors	1	2	3	4	5	6
$S_{b t, k}$	1 025 · 10¹	1 343 · 10³	1 293 · 10⁵	975 · 10⁷	600 · 10⁹	309 · 10¹¹
$S_{O S S, k}$	530 · 10¹	747 · 10³	732 · 10⁵	-	-	-
$S_{01 * 0, k}$	585 · 10¹	836 · 10³	1 004 · 10⁵	949 · 10⁷	703 · 10⁹	424 · 10¹¹
$S_{01 * 0 - o p t, k}$	581 · 10¹	1 014 · 10³	1 356 · 10⁵	1 279 · 10⁷	951 · 10⁹	567 · 10¹¹
$S_{p h, k}$	530 · 10¹	1 199 · 10³	1 621 · 10⁵	1 561 · 10⁷	1 166 · 10⁹	703 · 10¹¹
$S_{p h - o p t, k}$	530 · 10¹	900 · 10³	1 062 · 10⁵	938 · 10⁷	649 · 10⁹	368 · 10¹¹
$S_{s f, k}$	530 · 10¹	780 · 10³	889 · 10⁵	753 · 10⁷	557 · 10⁹	315 · 10¹¹
$S_{H 2, k, k + 1}$	530 · 10¹	779 · 10³	790 · 10⁵	603 · 10⁷	389 · 10⁹	205 · 10¹¹
$S_{H 2, k, k + 2}$	581 · 10¹	737 · 10³	732 · 10⁵	568 · 10⁷	359 · 10⁹	189 · 10¹¹
$S_{H 2, k, k + 3}$	645 · 10¹	790 · 10³	753 · 10⁵	579 · 10⁷	362 · 10⁹	189 · 10¹¹
$S_{k i a n f a r, k}$	530 · 10¹	779 · 10³	1 194 · 10⁵	-	-	-
$S_{k u c h, k, k + 1}$	530 · 10¹	807 · 10³	983 · 10⁵	1 122 · 10⁷	-	-
$S_{k u c h, k, k + 2}$	581 · 10¹	857 · 10³	804 · 10⁵	1 055 · 10⁷	-	-
$S_{h a t o, k}$	530 · 10¹	762 · 10³	834 · 10⁵	659 · 10⁷	483 · 10⁹	235 · 10¹¹

errors	1	2	3	4	5	6
$S_{b t, k}$	1 025 · 10¹	1 343 · 10³	1 293 · 10⁵	975 · 10⁷	600 · 10⁹	309 · 10¹¹
$S_{O S S, k}$	530 · 10¹	747 · 10³	732 · 10⁵	-	-	-
$S_{01 * 0, k}$	585 · 10¹	836 · 10³	1 004 · 10⁵	949 · 10⁷	703 · 10⁹	424 · 10¹¹
$S_{01 * 0 - o p t, k}$	581 · 10¹	1 014 · 10³	1 356 · 10⁵	1 279 · 10⁷	951 · 10⁹	567 · 10¹¹
$S_{p h, k}$	530 · 10¹	1 199 · 10³	1 621 · 10⁵	1 561 · 10⁷	1 166 · 10⁹	703 · 10¹¹
$S_{p h - o p t, k}$	530 · 10¹	900 · 10³	1 062 · 10⁵	938 · 10⁷	649 · 10⁹	368 · 10¹¹
$S_{s f, k}$	530 · 10¹	780 · 10³	889 · 10⁵	753 · 10⁷	557 · 10⁹	315 · 10¹¹
$S_{H 2, k, k + 1}$	530 · 10¹	779 · 10³	790 · 10⁵	603 · 10⁷	389 · 10⁹	205 · 10¹¹
$S_{H 2, k, k + 2}$	581 · 10¹	737 · 10³	732 · 10⁵	568 · 10⁷	359 · 10⁹	189 · 10¹¹
$S_{H 2, k, k + 3}$	645 · 10¹	790 · 10³	753 · 10⁵	579 · 10⁷	362 · 10⁹	189 · 10¹¹
$S_{k i a n f a r, k}$	530 · 10¹	779 · 10³	1 194 · 10⁵	-	-	-
$S_{k u c h, k, k + 1}$	530 · 10¹	807 · 10³	983 · 10⁵	1 122 · 10⁷	-	-
$S_{k u c h, k, k + 2}$	581 · 10¹	857 · 10³	804 · 10⁵	1 055 · 10⁷	-	-
$S_{h a t o, k}$	530 · 10¹	762 · 10³	834 · 10⁵	659 · 10⁷	483 · 10⁹	235 · 10¹¹

Table 1.

Open in new tab

Showing the node count of different search schemes. Assuming |P| = 50, |Σ| = 4 and edit distance. The lowest node counts are highlighted. Empty cells denote non-existing search schemes

errors	1	2	3	4	5	6
$S_{b t, k}$	1 025 · 10¹	1 343 · 10³	1 293 · 10⁵	975 · 10⁷	600 · 10⁹	309 · 10¹¹
$S_{O S S, k}$	530 · 10¹	747 · 10³	732 · 10⁵	-	-	-
$S_{01 * 0, k}$	585 · 10¹	836 · 10³	1 004 · 10⁵	949 · 10⁷	703 · 10⁹	424 · 10¹¹
$S_{01 * 0 - o p t, k}$	581 · 10¹	1 014 · 10³	1 356 · 10⁵	1 279 · 10⁷	951 · 10⁹	567 · 10¹¹
$S_{p h, k}$	530 · 10¹	1 199 · 10³	1 621 · 10⁵	1 561 · 10⁷	1 166 · 10⁹	703 · 10¹¹
$S_{p h - o p t, k}$	530 · 10¹	900 · 10³	1 062 · 10⁵	938 · 10⁷	649 · 10⁹	368 · 10¹¹
$S_{s f, k}$	530 · 10¹	780 · 10³	889 · 10⁵	753 · 10⁷	557 · 10⁹	315 · 10¹¹
$S_{H 2, k, k + 1}$	530 · 10¹	779 · 10³	790 · 10⁵	603 · 10⁷	389 · 10⁹	205 · 10¹¹
$S_{H 2, k, k + 2}$	581 · 10¹	737 · 10³	732 · 10⁵	568 · 10⁷	359 · 10⁹	189 · 10¹¹
$S_{H 2, k, k + 3}$	645 · 10¹	790 · 10³	753 · 10⁵	579 · 10⁷	362 · 10⁹	189 · 10¹¹
$S_{k i a n f a r, k}$	530 · 10¹	779 · 10³	1 194 · 10⁵	-	-	-
$S_{k u c h, k, k + 1}$	530 · 10¹	807 · 10³	983 · 10⁵	1 122 · 10⁷	-	-
$S_{k u c h, k, k + 2}$	581 · 10¹	857 · 10³	804 · 10⁵	1 055 · 10⁷	-	-
$S_{h a t o, k}$	530 · 10¹	762 · 10³	834 · 10⁵	659 · 10⁷	483 · 10⁹	235 · 10¹¹

errors	1	2	3	4	5	6
$S_{b t, k}$	1 025 · 10¹	1 343 · 10³	1 293 · 10⁵	975 · 10⁷	600 · 10⁹	309 · 10¹¹
$S_{O S S, k}$	530 · 10¹	747 · 10³	732 · 10⁵	-	-	-
$S_{01 * 0, k}$	585 · 10¹	836 · 10³	1 004 · 10⁵	949 · 10⁷	703 · 10⁹	424 · 10¹¹
$S_{01 * 0 - o p t, k}$	581 · 10¹	1 014 · 10³	1 356 · 10⁵	1 279 · 10⁷	951 · 10⁹	567 · 10¹¹
$S_{p h, k}$	530 · 10¹	1 199 · 10³	1 621 · 10⁵	1 561 · 10⁷	1 166 · 10⁹	703 · 10¹¹
$S_{p h - o p t, k}$	530 · 10¹	900 · 10³	1 062 · 10⁵	938 · 10⁷	649 · 10⁹	368 · 10¹¹
$S_{s f, k}$	530 · 10¹	780 · 10³	889 · 10⁵	753 · 10⁷	557 · 10⁹	315 · 10¹¹
$S_{H 2, k, k + 1}$	530 · 10¹	779 · 10³	790 · 10⁵	603 · 10⁷	389 · 10⁹	205 · 10¹¹
$S_{H 2, k, k + 2}$	581 · 10¹	737 · 10³	732 · 10⁵	568 · 10⁷	359 · 10⁹	189 · 10¹¹
$S_{H 2, k, k + 3}$	645 · 10¹	790 · 10³	753 · 10⁵	579 · 10⁷	362 · 10⁹	189 · 10¹¹
$S_{k i a n f a r, k}$	530 · 10¹	779 · 10³	1 194 · 10⁵	-	-	-
$S_{k u c h, k, k + 1}$	530 · 10¹	807 · 10³	983 · 10⁵	1 122 · 10⁷	-	-
$S_{k u c h, k, k + 2}$	581 · 10¹	857 · 10³	804 · 10⁵	1 055 · 10⁷	-	-
$S_{h a t o, k}$	530 · 10¹	762 · 10³	834 · 10⁵	659 · 10⁷	483 · 10⁹	235 · 10¹¹

When measuring the run times for 100 000 queries using the different queries Table 2 tells a different story. Despite our prediction derived from the node count, in every error category our heuristic search schemes $S_{H, k, k + c}$ and $S_{O S S, k}$ were outperformed by other schemes. A similar observation was also made by Renders et al. [13]. Moreover, our node count prediction indicates that all search schemes will have similar magnitude in run time, especially $S_{b t, k}$ should not be slower than 2× the fastest search schemes. Contrary to our prediction the actual run times in Table 2 show that $S_{b t, k}$ is multiple magnitudes slower than every other search schemes. Lastly, from the node count, one should estimate that every additional error increases run time about 100×, predicting run time of $S_{01 * 0, k}$ suggests an increase of 10¹⁰× when changing k = 1 to k = 6. Our benchmarks show it is less than 10³× slower. This is observable for all search schemes. From these observations we conclude that node count is not a useful predictor to estimate the performance of a search scheme.

Table 2.

Open in new tab

Showing the run time in seconds of search schemes run by Columba searching for 100 000 queries. Using |P| = 50, |Σ| = 4, |T| = 3 000 000 000, with simulated reads using edit distance. The lowest run times are highlighted. For k = 5 and k = 6, only 1000 queries were executed and the run time scaled by 100. Empty cells denote non-existing search schemes. The search scheme $S_{H, k, k + 3}$ for k = 6 failed due to a Columba fallback to a normal bidirectional search and $S_{b t, k}$ timed-out after 1 h

Errors	1	2	3	4	5	6
$S_{b t, k}$	28.05	725.68	timeout	timeout	timeout	timeout
$S_{O S S, k}$	1.54	5.17	48.13	-	-	-
$S_{01 * 0, k}$	1.80	5.92	21.54	70.81	263.00	1084.00
$S_{01 * 0 - o p t, k}$	1.80	5.15	16.44	58.08	725.00	9432.00
$S_{p h, k}$	1.52	2.16	54.60	1672.09	29 789.00	239 756.00
$S_{p h - o p t, k}$	1.52	2.07	38.11	860.12	14 862.00	12 0531.00
$S_{s f, k}$	1.54	2.02	13.80	357.49	2225.00	12 194.00
$S_{H 2, k, k + 1}$	1.53	2.01	20.41	534.57	5937.00	44 025.00
$S_{H 2, k, k + 2}$	1.78	5.04	48.50	928.63	9784.00	78 642.00
$S_{H 2, k, k + 3}$	3.75	11.28	145.14	1885.38	19 851.00	failed
$S_{k i a n f a r, k}$	1.52	2.01	107.31	-	-	-
$S_{k u c h, k, k + 1}$	1.53	2.06	8.40	62.47	-	-
$S_{k u c h, k, k + 2}$	1.78	5.23	16.19	50.56	-	-
$S_{h a t o, k}$	1.55	2.05	8.36	57.93	68.00	426.00

Errors	1	2	3	4	5	6
$S_{b t, k}$	28.05	725.68	timeout	timeout	timeout	timeout
$S_{O S S, k}$	1.54	5.17	48.13	-	-	-
$S_{01 * 0, k}$	1.80	5.92	21.54	70.81	263.00	1084.00
$S_{01 * 0 - o p t, k}$	1.80	5.15	16.44	58.08	725.00	9432.00
$S_{p h, k}$	1.52	2.16	54.60	1672.09	29 789.00	239 756.00
$S_{p h - o p t, k}$	1.52	2.07	38.11	860.12	14 862.00	12 0531.00
$S_{s f, k}$	1.54	2.02	13.80	357.49	2225.00	12 194.00
$S_{H 2, k, k + 1}$	1.53	2.01	20.41	534.57	5937.00	44 025.00
$S_{H 2, k, k + 2}$	1.78	5.04	48.50	928.63	9784.00	78 642.00
$S_{H 2, k, k + 3}$	3.75	11.28	145.14	1885.38	19 851.00	failed
$S_{k i a n f a r, k}$	1.52	2.01	107.31	-	-	-
$S_{k u c h, k, k + 1}$	1.53	2.06	8.40	62.47	-	-
$S_{k u c h, k, k + 2}$	1.78	5.23	16.19	50.56	-	-
$S_{h a t o, k}$	1.55	2.05	8.36	57.93	68.00	426.00

Table 2.

Open in new tab

Errors	1	2	3	4	5	6
$S_{b t, k}$	28.05	725.68	timeout	timeout	timeout	timeout
$S_{O S S, k}$	1.54	5.17	48.13	-	-	-
$S_{01 * 0, k}$	1.80	5.92	21.54	70.81	263.00	1084.00
$S_{01 * 0 - o p t, k}$	1.80	5.15	16.44	58.08	725.00	9432.00
$S_{p h, k}$	1.52	2.16	54.60	1672.09	29 789.00	239 756.00
$S_{p h - o p t, k}$	1.52	2.07	38.11	860.12	14 862.00	12 0531.00
$S_{s f, k}$	1.54	2.02	13.80	357.49	2225.00	12 194.00
$S_{H 2, k, k + 1}$	1.53	2.01	20.41	534.57	5937.00	44 025.00
$S_{H 2, k, k + 2}$	1.78	5.04	48.50	928.63	9784.00	78 642.00
$S_{H 2, k, k + 3}$	3.75	11.28	145.14	1885.38	19 851.00	failed
$S_{k i a n f a r, k}$	1.52	2.01	107.31	-	-	-
$S_{k u c h, k, k + 1}$	1.53	2.06	8.40	62.47	-	-
$S_{k u c h, k, k + 2}$	1.78	5.23	16.19	50.56	-	-
$S_{h a t o, k}$	1.55	2.05	8.36	57.93	68.00	426.00

Errors	1	2	3	4	5	6
$S_{b t, k}$	28.05	725.68	timeout	timeout	timeout	timeout
$S_{O S S, k}$	1.54	5.17	48.13	-	-	-
$S_{01 * 0, k}$	1.80	5.92	21.54	70.81	263.00	1084.00
$S_{01 * 0 - o p t, k}$	1.80	5.15	16.44	58.08	725.00	9432.00
$S_{p h, k}$	1.52	2.16	54.60	1672.09	29 789.00	239 756.00
$S_{p h - o p t, k}$	1.52	2.07	38.11	860.12	14 862.00	12 0531.00
$S_{s f, k}$	1.54	2.02	13.80	357.49	2225.00	12 194.00
$S_{H 2, k, k + 1}$	1.53	2.01	20.41	534.57	5937.00	44 025.00
$S_{H 2, k, k + 2}$	1.78	5.04	48.50	928.63	9784.00	78 642.00
$S_{H 2, k, k + 3}$	3.75	11.28	145.14	1885.38	19 851.00	failed
$S_{k i a n f a r, k}$	1.52	2.01	107.31	-	-	-
$S_{k u c h, k, k + 1}$	1.53	2.06	8.40	62.47	-	-
$S_{k u c h, k, k + 2}$	1.78	5.23	16.19	50.56	-	-
$S_{h a t o, k}$	1.55	2.05	8.36	57.93	68.00	426.00

As an alternative run time estimation, we compare our previously introduced weighted node count with the run time. Table 3 lists the different weighted node counts. Here we see correlation between the predicted fastest search schemes and the actual run time. Only for k = 4, our prediction of fastest and second-fastest search scheme is swapped. The weighted node count correctly predicts that $S_{b t, k}$ is magnitudes slower than the other search schemes.

Table 3.

Open in new tab

Showing the weighted node count of different search schemes. Assuming |P| = 50, |Σ| = 4, |T| = 3 000 000 000 and edit distance. The lowest weighted node counts are highlighted. Empty cells denote non-existing search schemes

Errors	1	2	3	4	5	6
$S_{b t, k}$	1 082.60	434 . 10²	122 . 10⁴	255 . 10⁵	414 . 10⁶	530 . 10⁹
$S_{O S S, k}$	31.65	198.46	2 669.57	-	-	-
$S_{01 * 0, k}$	47.52	229.60	1 122.48	2 650.15	15 086.29	123 820.56
$S_{01 * 0 - o p t, k}$	31.70	137.24	545.41	1 255.75	46 111.73	2 879 754.09
$S_{p h, k}$	31.65	48.99	8 531.56	530 041.98	15 683 373.87	253 620 952.60
$S_{p h - o p t, k}$	31.65	48.63	2 678.92	233 968.19	3 128 519.06	94 408 630.87
$S_{s f, k}$	31.65	48.98	3 534.11	106 714.51	3 313 394.15	40 034 994.27
$S_{H 2, k, k + 1}$	31.65	48.98	3 782.52	128 136.32	3 617 506.01	44 548 566.52
$S_{H 2, k, k + 2}$	31.70	137.24	2 669.57	42 121.71	684 323.68	9 127 067.20
$S_{H 2, k, k + 3}$	76.53	368.88	4 184.41	76 653.18	1 224 984.33	18 787 288.71
$S_{k i a n f a r, k}$	31.65	48.98	6 323.76	-	-	-
$S_{k u c h, k, k + 1}$	31.65	48.63	304.05	3 375.69	-	-
$S_{k u c h, k, k + 2}$	31.70	183.68	545.41	1 497.72	-	-
$S_{h a t o, k}$	31.65	48.63	304.05	3 006.80	1 996.37	31 105.20

Errors	1	2	3	4	5	6
$S_{b t, k}$	1 082.60	434 . 10²	122 . 10⁴	255 . 10⁵	414 . 10⁶	530 . 10⁹
$S_{O S S, k}$	31.65	198.46	2 669.57	-	-	-
$S_{01 * 0, k}$	47.52	229.60	1 122.48	2 650.15	15 086.29	123 820.56
$S_{01 * 0 - o p t, k}$	31.70	137.24	545.41	1 255.75	46 111.73	2 879 754.09
$S_{p h, k}$	31.65	48.99	8 531.56	530 041.98	15 683 373.87	253 620 952.60
$S_{p h - o p t, k}$	31.65	48.63	2 678.92	233 968.19	3 128 519.06	94 408 630.87
$S_{s f, k}$	31.65	48.98	3 534.11	106 714.51	3 313 394.15	40 034 994.27
$S_{H 2, k, k + 1}$	31.65	48.98	3 782.52	128 136.32	3 617 506.01	44 548 566.52
$S_{H 2, k, k + 2}$	31.70	137.24	2 669.57	42 121.71	684 323.68	9 127 067.20
$S_{H 2, k, k + 3}$	76.53	368.88	4 184.41	76 653.18	1 224 984.33	18 787 288.71
$S_{k i a n f a r, k}$	31.65	48.98	6 323.76	-	-	-
$S_{k u c h, k, k + 1}$	31.65	48.63	304.05	3 375.69	-	-
$S_{k u c h, k, k + 2}$	31.70	183.68	545.41	1 497.72	-	-
$S_{h a t o, k}$	31.65	48.63	304.05	3 006.80	1 996.37	31 105.20

Table 3.

Open in new tab

Errors	1	2	3	4	5	6
$S_{b t, k}$	1 082.60	434 . 10²	122 . 10⁴	255 . 10⁵	414 . 10⁶	530 . 10⁹
$S_{O S S, k}$	31.65	198.46	2 669.57	-	-	-
$S_{01 * 0, k}$	47.52	229.60	1 122.48	2 650.15	15 086.29	123 820.56
$S_{01 * 0 - o p t, k}$	31.70	137.24	545.41	1 255.75	46 111.73	2 879 754.09
$S_{p h, k}$	31.65	48.99	8 531.56	530 041.98	15 683 373.87	253 620 952.60
$S_{p h - o p t, k}$	31.65	48.63	2 678.92	233 968.19	3 128 519.06	94 408 630.87
$S_{s f, k}$	31.65	48.98	3 534.11	106 714.51	3 313 394.15	40 034 994.27
$S_{H 2, k, k + 1}$	31.65	48.98	3 782.52	128 136.32	3 617 506.01	44 548 566.52
$S_{H 2, k, k + 2}$	31.70	137.24	2 669.57	42 121.71	684 323.68	9 127 067.20
$S_{H 2, k, k + 3}$	76.53	368.88	4 184.41	76 653.18	1 224 984.33	18 787 288.71
$S_{k i a n f a r, k}$	31.65	48.98	6 323.76	-	-	-
$S_{k u c h, k, k + 1}$	31.65	48.63	304.05	3 375.69	-	-
$S_{k u c h, k, k + 2}$	31.70	183.68	545.41	1 497.72	-	-
$S_{h a t o, k}$	31.65	48.63	304.05	3 006.80	1 996.37	31 105.20

Errors	1	2	3	4	5	6
$S_{b t, k}$	1 082.60	434 . 10²	122 . 10⁴	255 . 10⁵	414 . 10⁶	530 . 10⁹
$S_{O S S, k}$	31.65	198.46	2 669.57	-	-	-
$S_{01 * 0, k}$	47.52	229.60	1 122.48	2 650.15	15 086.29	123 820.56
$S_{01 * 0 - o p t, k}$	31.70	137.24	545.41	1 255.75	46 111.73	2 879 754.09
$S_{p h, k}$	31.65	48.99	8 531.56	530 041.98	15 683 373.87	253 620 952.60
$S_{p h - o p t, k}$	31.65	48.63	2 678.92	233 968.19	3 128 519.06	94 408 630.87
$S_{s f, k}$	31.65	48.98	3 534.11	106 714.51	3 313 394.15	40 034 994.27
$S_{H 2, k, k + 1}$	31.65	48.98	3 782.52	128 136.32	3 617 506.01	44 548 566.52
$S_{H 2, k, k + 2}$	31.70	137.24	2 669.57	42 121.71	684 323.68	9 127 067.20
$S_{H 2, k, k + 3}$	76.53	368.88	4 184.41	76 653.18	1 224 984.33	18 787 288.71
$S_{k i a n f a r, k}$	31.65	48.98	6 323.76	-	-	-
$S_{k u c h, k, k + 1}$	31.65	48.63	304.05	3 375.69	-	-
$S_{k u c h, k, k + 2}$	31.70	183.68	545.41	1 497.72	-	-
$S_{h a t o, k}$	31.65	48.63	304.05	3 006.80	1 996.37	31 105.20

If we compare across error categories, e.g. $S_{01 * 0, k}$ ⁠, we predict an increase of 2600 × going from k = 1 to k = 6. In our measured data, we see an actual increase of 600×. Similar correlation can be made across all search schemes. We are assuming the deviations originate from the optimizations of the Columba tool, which only follows a reduced set of edges. We normalized the (weighted) node count by a factor, such that $S_{O S S, k}$ for k = 1 is perfectly predicted. (We have chosen $S_{O S S, k}$ for no specific reason. The tables look similar if a different search scheme is chosen for normalization.) The value is then divided by the actual run time, giving us a ratio. The resulting numbers are shown in Supplementary Tables S1 and S2. The weighted node count is often a very good predictor of the run time (mostly within a factor of 2). The maximal overestimation is in one case 160× while the node count over-estimates 16 · 10⁶×.

The weighted node count is not a perfect predictor of a search scheme performance, but it is a superior estimator compared to the node count approach. It allows to identify a well performing search scheme considering the number of allowed errors, the query length and the length of the reference text.

We ran the same benchmark on a fixed set of simulated strings of length 150, which is a typical size of reads produced by Illumina sequencing machines. The results are less informative and can be found in the Supplementary Tables S3–S5. The run-times within an error category are close such that expressive derivations are harder to show, and we decided to focus on reads of length 50 instead. Additionally, we ran the benchmark using the human genome with a read length of 50 and 150. The results can be seen in Supplementary Tables S6 and S7. While these results look similar, they require the weighted node count to take the self-similarity of the reference genome into account. We decided to adhere to the random generated reference, because it is a simpler model.

In summary, we see that $S_{h a t o, k}$ has the lowest run time (with exception of k = 4) even though it does not have the lowest node count. Our weighted node count prediction is closer to the real run time compared to the node count. The node count wrongly predicts that $S_{H, k, k + 2}$ is the fastest for all k ≥ 2. In contrast, the weighted node count predicts the fastest search scheme for all error category correctly, missing the real run-time by a small factor.

Conclusion

We formulated different search strategies as search schemes and introduced our own search scheme construction algorithm. This enables the creation of search schemes with any number of errors. We showed that our search scheme achieves the lowest node count in every error category, but run times expose that there is no correlation between the run time and the node count. We used this knowledge to introduce an improved weighted node count, which can be used to estimate run times to a certain degree of certainty. This allows to choose the best search scheme for a given set of queries, which confirms the finding of Renders et al. [14]. Although, there is space to explore even better predictors, the weighted node count can be used for future development of new search scheme constructions that enable better performance.

Supplementary data

Supplementary data is available at NAR Genomics & Bioinformatics Online.

Conflict of interest

None declared.

Data availability

Data are available at https://github.com/seqan/sahara (https://doi.org/10.5281/zenodo.14640235) and https://github.com/SGSSGene/sahara_benchmarks (https://doi.org/10.5281/zenodo.14640239).

Appendix

Weighted node count

\begin{array}{rc} n_{e d i t} (S) = \sum_{(π, L, U) \in S} n_{e d i t} (π, L, U) \\ n_{e d i t} (π, L, U) = \sum_{d = L_{| π | - 1}}^{U_{| π | - 1}} n_{| π | - 1, d} \\ n_{l, d} = {\begin{cases} (1) & L_{l} \leq d \leq U_{l} \land 1 \leq l \leq ⌈ \log_{σ} | T | ⌉ \\ (2) & L_{l} \leq d \leq U_{l} \land l > ⌈ \log_{σ} | T | ⌉ \\ 1 & l = 0, d = 0 \\ 0 & o t h e r w i s e \end{cases} \\ (1) = n_{l - 1, d} + (σ - 1) \cdot n_{l - 1, d - 1} + σ \cdot n_{l, d - 1} + n_{l - 1, d - 1} \\ (2) = (n_{l - 1, d} + (σ - 1) \cdot n_{l - 1, d - 1} + σ \cdot n_{l, d - 1} + n_{l - 1, d - 1}) \cdot \frac{| T |}{σ^{l}} \end{array}

Weighted node count for edit distance. In most practical application, this formula results in an overestimation, because some recursive calls can be optimized.

References

Manber

Myers

Suffix arrays: a new method for on-line string searches

J Comput

1993

;

935

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Abouelhoda

Kurtz

Ohlebusch

Replacing suffix trees with enhanced suffix arrays

J Discrete Algorithms

2004

;

–

10.1016/S1570-8667(03)00065-0

Google Scholar

Crossref

WorldCat

Maaß

Linear bidirectional on-line construction of affix trees

Algorithmica

2003

;

–

10.1007/s00453-003-1029-2

Google Scholar

Crossref

WorldCat

Strothmann

The affix array data structure and its applications to RNA secondary structure analysis

Theor Comput Sci

2007

;

389

278

–

10.1016/j.tcs.2007.09.029

Google Scholar

Crossref

WorldCat

Ferragina

Manzini

Opportunistic data structures with applications

FOCS ’00

2000

;

New York

IEEE

390

–

10.1109/SFCS.2000.892127

Burrows

Wheeler

A block-sorting lossless data compression algorithm

Research Report 124

1994

;

Palo Alto, California

Digital SRC

124

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Reinert

Langmead

Weese

et al. .

Alignment of next-generation sequencing reads

Ann Rev Genom Hum Genet

2015

;

133

–

10.1146/annurev-genom-090413-025358

Google Scholar

Crossref

WorldCat

Kärkkäinen

Faster filters for approximate string matching

ALENEX ’07

2007

;

Philadelphia

SIAM

–

Vroland

Salson

Bini

et al. .

Approximate search of short patterns with high error rates using the 01*0 lossless seeds

J Discrete Algorithms

2016

;

–

10.1016/j.jda.2016.03.002

Google Scholar

Crossref

WorldCat

10.

Lam

Tam

et al. .

High throughput short read alignment via bi-directional BWT

BIBM ’09

2009

;

New York

IEEE

–

11.

Kucherov

Salikhov

Tsur

Approximate string matching using a bidirectional index

Theor Comput Sci

2016

;

638

145

–

10.1016/j.tcs.2015.10.043

Google Scholar

Crossref

WorldCat

12.

Kianfar

Pockrandt

Torkamandi

et al. .

Optimum search schemes for approximate string matching using bidirectional FM-Index

arXiv

5 March 2018, preprint: not peer reviewed

10.48550/arXiv.1711.02035

13.

Renders

Marchal

Fostier

Dynamic partitioning of search patterns for approximate pattern matching using search schemes

Iscience

2021

;

102687

10.1016/j.isci.2021.102687

14.

Renders

Depuydt

Rahmann

et al. .

Lossless approximate pattern matching: automated design of efficient search schemes

J Comput Biol

2024

;

975

–

10.1089/cmb.2024.0664

15.

Reinert

Dadi

Ehrhardt

et al. .

The SeqAn C++ template library for efficient sequence analysis: a resource for programmers

J Biotechnol

2017

;

261

157

–

10.1016/j.jbiotec.2017.07.017

16.

Durbin

Fast and accurate short read alignment with Burrows-Wheeler transform

Bioinformatics

2009

;

1754

–

10.1093/bioinformatics/btp324

17.

Ahmed

Bertels

Al-Ars

A comparison of seed-and-extend techniques in modern DNA read alignment algorithms

BIBM'16

2016

;

New York

IEEE

1421

–

18.

Simon Gene Gottlieb

seqan/sahara: Sahara v1.3.1 (v1.3.1).

Zenodo

2025

;

Genève, Switzerland

10.5281/zenodo.14640235

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
March 2025	109
April 2025	98
May 2025	6

Article Contents

SeArcH schemes for Approximate stRing mAtching

Abstract

Introduction

Material and methods

Nomenclature

Node count

Weighted node count

Formulating search strategies as search schemes

Pigeonhole strategy

Suffix filter

01*0 seeds

Our search scheme heuristic

Step by step example

Results and discussion

Evaluation

Conclusion

Supplementary data

Conflict of interest

Data availability

Appendix

References

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Most Read

Latest

Article Contents

SeArcH schemes for Approximate stRing mAtching

Abstract

Introduction

Material and methods

Nomenclature

Node count

Weighted node count

Formulating search strategies as search schemes

Pigeonhole strategy

Suffix filter

01*0 seeds

Our search scheme heuristic

Step by step example

Results and discussion

Evaluation

Conclusion

Supplementary data

Conflict of interest

Data availability

Appendix

References

Supplementary data

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Most Read

Latest

This Feature Is Available To Subscribers Only