BWT construction and search at the terabase scale

Notations and naming convention.

Notation	Description
Σ	Alphabet of symbols. ${A, C, G, T, N}$ for DNA
$Σ'$	Augmented alphabet: $Σ' ≜ Σ \cup {$}$
a, b, c	Symbols in $Σ'$
T	Concatenated reference string including sentinels
S(i)	Suffix array: offset of the ith smallest suffix
B	BWT string: $B [i] ≜ T [S (i) - 1]$
m	Number of sentinels: $m ≜ \| {i : B [i] = $} \|$
n	Total length: $n ≜ \| T \| = \| B \|$
r	Number of runs: $r ≜ \| {i : B [i] = B [i + 1]} \| + 1$
$C_{B} (a)$	Accumulative count: $C_{B} (a) ≜ \| {i : B [i] < a} \|$
${rank}_{B} (a, k)$	Rank: ${rank}_{B} (a, k) ≜ \| {i < k : B [i] = a} \|$
$π_{B} (a, k)$	$π_{B} (a, k) ≜ C_{B} (a) + {rank}_{B} (a, k)$
$π (k)$	LF-mapping: $π (k) ≜ S^{- 1} (S (k) - 1) = π_{B} (B [k], k)$

Notation	Description
Σ	Alphabet of symbols. ${A, C, G, T, N}$ for DNA
$Σ'$	Augmented alphabet: $Σ' ≜ Σ \cup {$}$
a, b, c	Symbols in $Σ'$
T	Concatenated reference string including sentinels
S(i)	Suffix array: offset of the ith smallest suffix
B	BWT string: $B [i] ≜ T [S (i) - 1]$
m	Number of sentinels: $m ≜ \| {i : B [i] = $} \|$
n	Total length: $n ≜ \| T \| = \| B \|$
r	Number of runs: $r ≜ \| {i : B [i] = B [i + 1]} \| + 1$
$C_{B} (a)$	Accumulative count: $C_{B} (a) ≜ \| {i : B [i] < a} \|$
${rank}_{B} (a, k)$	Rank: ${rank}_{B} (a, k) ≜ \| {i < k : B [i] = a} \|$
$π_{B} (a, k)$	$π_{B} (a, k) ≜ C_{B} (a) + {rank}_{B} (a, k)$
$π (k)$	LF-mapping: $π (k) ≜ S^{- 1} (S (k) - 1) = π_{B} (B [k], k)$

Table 1.

Open in new tab Download slide

Notations and naming convention.

Notation	Description
Σ	Alphabet of symbols. ${A, C, G, T, N}$ for DNA
$Σ'$	Augmented alphabet: $Σ' ≜ Σ \cup {$}$
a, b, c	Symbols in $Σ'$
T	Concatenated reference string including sentinels
S(i)	Suffix array: offset of the ith smallest suffix
B	BWT string: $B [i] ≜ T [S (i) - 1]$
m	Number of sentinels: $m ≜ \| {i : B [i] = $} \|$
n	Total length: $n ≜ \| T \| = \| B \|$
r	Number of runs: $r ≜ \| {i : B [i] = B [i + 1]} \| + 1$
$C_{B} (a)$	Accumulative count: $C_{B} (a) ≜ \| {i : B [i] < a} \|$
${rank}_{B} (a, k)$	Rank: ${rank}_{B} (a, k) ≜ \| {i < k : B [i] = a} \|$
$π_{B} (a, k)$	$π_{B} (a, k) ≜ C_{B} (a) + {rank}_{B} (a, k)$
$π (k)$	LF-mapping: $π (k) ≜ S^{- 1} (S (k) - 1) = π_{B} (B [k], k)$

Notation	Description
Σ	Alphabet of symbols. ${A, C, G, T, N}$ for DNA
$Σ'$	Augmented alphabet: $Σ' ≜ Σ \cup {$}$
a, b, c	Symbols in $Σ'$
T	Concatenated reference string including sentinels
S(i)	Suffix array: offset of the ith smallest suffix
B	BWT string: $B [i] ≜ T [S (i) - 1]$
m	Number of sentinels: $m ≜ \| {i : B [i] = $} \|$
n	Total length: $n ≜ \| T \| = \| B \|$
r	Number of runs: $r ≜ \| {i : B [i] = B [i + 1]} \| + 1$
$C_{B} (a)$	Accumulative count: $C_{B} (a) ≜ \| {i : B [i] < a} \|$
${rank}_{B} (a, k)$	Rank: ${rank}_{B} (a, k) ≜ \| {i < k : B [i] = a} \|$
$π_{B} (a, k)$	$π_{B} (a, k) ≜ C_{B} (a) + {rank}_{B} (a, k)$
$π (k)$	LF-mapping: $π (k) ≜ S^{- 1} (S (k) - 1) = π_{B} (B [k], k)$

Suppose $T = (P_{0}, P_{1}, \dots, P_{m - 1})$ is an ordered list of m strings over Σ. $T ≜ P_{0} $_{0} P_{1} $_{1} \dots P_{m - 1} $_{m - 1}$ is the concatenation of the strings in $T$ with sentinels ordered by $$_{0} < $_{1} < \dots < $_{m - 1}$ ⁠. For other ways to define string concatenation on ordered string lists or unordered string sets, see Cenzato and Lipták (2024).

For convenience, let $n ≜ | T |$ and $T [- 1] = T [n - 1]$ ⁠. The suffix array of T is an integer array S such that S(i), $0 \leq i < n$ ⁠, is the start position of the ith smallest suffix among all suffixes of T (Fig. 1a). The Burrows–Wheeler Transform (BWT) of T is a string B computed by $B [i] = T [S (i) - 1]$ ⁠. All sentinels in B are represented by the same symbol “ $$$ ” and are not distinguished from each other. $Σ' ≜ Σ \cup {$}$ denotes the alphabet including the sentinel.

Figure 1.

Examples of BWT and related data structures. (a) The BWT B, suffix array S and LF-mapping π of string T. Subscriptions are equal to the ranks of symbols in B, which are the same as the ranks among the suffixes (indicated by arrows). (b) The prefix trie simulated with BWT B. In each node, the pair of integers gives the suffix array interval of the string represented by the path from the node to the root. Double circles indicate nodes that can reach the beginning of T. (c) The prefix directed acyclic word graph (DAWG) of T by merging nodes with identical suffix array intervals.

For $a \in Σ'$ ⁠, let $C_{B} (a) ≜ | {i : B [i] < a} |$ be the number of symbols smaller than a and ${rank}_{B} (a, k) ≜ | {i < k : B [i] = a} |$ be the number of a before offset k in B. We may omit subscription B when we are describing one string only. The last-to-first mapping (LF mapping) π is defined by $π (i) ≜ S^{- 1} (S (i) - 1)$ ⁠, where $S^{- 1}$ is the inverse function of suffix array S. It can be calculated as $π (i) = π_{B} (B [i], i)$ ⁠, where $π_{B} (a, i) ≜ C (B [i]) + rank (B [i], i)$ ⁠. As $B [π (i)]$ immediately proceeds $B [i]$ on T, we can use π to decode the ith sequence in B.

2.2 BWT construction

Algorithm 1.

Insert BWT B₂ into BWT B₁

1: procedureAppendBWT(B₁, B₂)

2: $m_{1} \leftarrow | {k : B_{1} [k] = $} |$

3: $m_{2} \leftarrow | {k : B_{2} [k] = $} |$

4: for $k \leftarrow 0$ to $| B_{2} | - 1$ do

5: $a \leftarrow B_{2} [k]$

6: $R (k) \leftarrow (a, π_{B_{2}} (a, k))$

7: end for

8: for $i \leftarrow 0$ to $m_{2} - 1$ do

9: $k \leftarrow i$ ⁠; $l \leftarrow m_{1}$

10: repeat

11: $(a, k') \leftarrow R (k)$

12: $R (k) \leftarrow (a, k + l)$ ▹ position in the merged BWT

13: $k \leftarrow k'$ ⁠; $l \leftarrow π_{B_{1}} (a, l)$

14: until $a = $$

15: end for

16: for $(a, k) \in R$ do ▹ N.B. k is sorted in array R

17: ${insert}_{B_{1}} (a, k)$ ▹ insert a after k symbols in B₁

18: end for

19: end procedure

Libsais is an efficient library for computing the suffix array of a single string. It does not directly support a list of strings. Nonetheless, we note that T is a string over alphabet $Σ ″ = {$_{0}, $_{1}, \dots, $_{m - 1}, A, C, G, T, N}$ with lexicographical order $$_{0} < \dots < $_{m - 1} < A < C < G < T < N$ ⁠. We can use m + 5 non-negative integers to encode T and apply libsais. The suffix array derived this way will be identical to the suffix array of T.

For m that can be represented by a 32-bit integer and n represented by a 64-bit integer, libsais will need at least 12n bytes to construct the suffix array. It is impractical for n more than tens of billions. To construct BWT for large n, ropebwt3 uses libsais to build the BWT for a batch (up to 7 Gb by default) and merges it to the BWT of already processed batches (Algorithm 1). The basic idea behind the algorithm is well known (Ferragina et al. 2010) but implementations vary (Sirén 2016, Oliva et al. 2021). In ropebwt3, we encode BWT with a B+-tree (Fig. 2a). This yields $O (log r)$ rank query (line 13) and insertion (line 17), where r is the number of runs in the merged BWT. The bottleneck of the algorithm lies in rank calculation (line 8), which can be parallelized if $m_{2} > 1$ ⁠.

Figure 2.

Examples of binary BWT encoding. The alphabet is ${A, C, G}$ ⁠. (a) Encoded as a B+-tree. An internal node keeps the marginal counts of symbols in its descendants. An external node keeps the run-length encoded substring in BWT. A run length may be encoded in one, two, four, or eight bytes in a scheme inspired by UTF-8. A B+-tree organized this way resembles the rope data structure. (b) Encoded as a bit-packed array. The first two rows store run-length encoded BWT interleaved with marginal counts in each block. The Elias delta encoding is used to represent run lengths. The last row is an index into BWT for fast access.

Open in new tab Download slide

This online BWT construction algorithm does not use temporary disk space. The overall time complexity is $O (n log r)$ ⁠. The BWT takes $O (r log r)$ in space. The memory required for partial BWT construction with libsais depends on the batch size and the longest string. B+-tree is dynamic. Ropebwt3 optionally converts B+-tree to the fermi binary format (Fig. 2b; Li 2012) which is static but is faster to query and can be memory-mapped.

Ropebwt2 (Li 2014) uses the same B+-tree to encode BWT and has the same time complexity. However, it inserts sequences, not partial BWTs, into existing BWT. It cannot be efficiently parallelized for long strings. Note that independent of our earlier work, Ohno et al. (2018) also used a B+-tree to encode BWT. Its implementation (Bannai et al. 2020) is several times slower than ropebwt2 on 152 bacterial genomes from Li et al. (2024), possibly because ropebwt2 is optimized for the small DNA alphabet.

2.3 Suffix array interval and backward search

For a string $P \in Σ^{*}$ (ie not including sentinels), let $occ (P)$ be the number of occurrences of P in T. Define $lo (P)$ to be the number of suffixes that are lexicographically smaller than P and $hi (P) ≜ lo (P) + occ (P)$ ⁠. $[lo (P), hi (P))$ is called the suffix array interval of P, or SA interval in brief. An SA interval is important if there exists P such that it is the SA interval of P. The SA interval of the empty string is $[0, n)$ where $n = | T |$ ⁠.

If we know the SA interval of P, we can calculate the SA interval of aP with: $lo (a P) = π_{B} (a, lo (P))$ and $hi (a P) = π_{B} (a, hi (P))$ ⁠. To calculate $occ (P)$ ⁠, we start with $[0, n)$ and repeatedly apply the equation above from the last symbol in P to the first. This procedure is called backward search.

2.4 Locating with FM-index

By definition of BWT, if $i \in [lo (P), hi (P)), P = T [S (i), S (i) + | P |)$ ⁠. We can thus locate an occurrence of P in the original string T. However, suffix array S takes $O (n log n)$ in space and may be too large to store explicitly. With an FM-index, we only store S(i) if and only if i is a multiple of s where s is a positive integer controlling the sample rate. We calculate the rest of S(i) using LF-mapping $π (\cdot)$ ⁠.

A complication with multi-string BWT is that $π (i)$ does not point to the preceeding symbol when $B [i] = $$ ⁠. This is because sentinels are ordered during suffix sorting but are not distinguished from each other in BWT. We will have to store the rank of each string in T (array R in Algorithm 2). For convenience, we also keep the index of the string and the offset on the string instead of the offset on the concatenated string T.

To find the position of

i \in [lo (P), hi (P))

in the original string, we repeatedly apply

π (\cdot)

k times until the position of

π^{k} (i)

is stored. With U, V and R precalculated by IndexSSA in Algorithm 2, we can locate i with

{\begin{matrix} (R (π^{k} (i)), k) & (B [π^{k} (i)] = $) \\ (U (π^{k} (i)), V (π^{k} (i) + k)) & (otherwise and π^{k} (i) \mod s = 0) \end{matrix}

where k is the smallest non-negative integer such that

B [π^{k} (i)] = $

π^{k} (i) \mod s = 0

⁠. On average,

k \approx s

⁠, which means each locate operation triggers s rank queries.

Function Locate1 in Algorithm 2 provides a faster way to find one position in SA interval $[l, h)$ ⁠. A key observation is that if the interval contains a sentinel or there exists k such that $l \leq k s < h$ ⁠, we can immediately locate one occurrence; if not, we can apply backward search repeatedly until an SA interval $[l', h')$ brackets a stored suffix array value. If T consists of $m'$ identical strings, we apply $s / \min (h - l, m')$ rounds of backward searches on average, usually much faster than the naive algorithm. Ropebwt3 implements a generalized Locate1 function that finds multiple occurrences.

Algorithm 2.

Locate one hit given suffix array samples

1: procedureIndexSSA(B, s)

2: $A \leftarrow \emptyset$ ⁠; $m \leftarrow | {i : B [i] = $} |$ ▹ Number of sequences

3: for $t \leftarrow 0$ to m—1 do ▹ Traverse all sequences

4: $k \leftarrow t$ ⁠; $l \leftarrow 0$ ▹ l will be the length of tth sequence

5: repeat▹ Iterate from the end of the sequence

6: $k \leftarrow π (k)$ ⁠; $l \leftarrow l + 1$

7: if $B [k] = $$ then

8: $R (k) \leftarrow t$ ▹ The rank of tth sequence is k

9: else if $k \mod s = 0$ then

10: $A \leftarrow A \cup {k / s}$

11: end if

12: until $B [k] = $$

13: for $k \in A$ do

14: U(k) = t; $V (k) \leftarrow l - 1 - k$

15: end for

16: end for

17: return (U, V, R)

18: end procedure

19: procedureLocate1(⁠ $B, U, V, R, lo, hi, s$ ⁠)

20: $I \leftarrow {(lo, hi, 0)}$

21: while $I = \emptyset$ do

22: $(l, h, o) \leftarrow largest interval in$ I

23: $I \leftarrow I ∖ {(l, h, o)}$

24: if $\exists k$ such that $k \cdot s \in [l, h)$ then

25: return $(U (k), V (k) + o)$

26: else if $π_{B} ($, l) < π_{B} ($, h)$ then

27: return $(R (π_{B} ($, l)), o)$

28: end if

29: for $a \in Σ$ do

30: $I \leftarrow I \cup {(π_{B} (a, l), π_{B} (a, h), o + 1)}$

31: end for

32: end while

33: end procedure

2.5 Double-strand BWT

The definitions above are applicable to generic strings. With one BWT, we can only achieve backward search; forward search additionally requires the BWT of the reverse strings (Lam et al. 2009). Nonetheless, due to the strand symmetry of DNA strings, it is possible to achieve both forward and backward search with one BWT provided that the BWT contains both strands of DNA strings (Li 2012).

Formally, a DNA alphabet is $Σ = {A, C, G, T, N}$ ⁠. $\bar{a}$ denotes the Watson-Crick complement of symbol $a \in Σ$ ⁠. The complement of $$, A, C, G, T$ ⁠, and $N$ are $$, T, G, C, A$ ⁠, and $N$ ⁠, respectively.

For string P, $\bar{P}$ is its reverse complement string. The double-strand concatenation of a DNA string list $T = (P_{0}, P_{1}, \dots, P_{m - 1})$ is $\tilde{T} = P_{0} $_{0} {\bar{P}}_{0} $_{1} P_{1} $_{2} {\bar{P}}_{1} $_{3} \dots P_{m - 1} $_{2 m - 2} {\bar{P}}_{m - 1} $_{2 m - 1}$ ⁠. The double-strand BWT (DS-BWT) of $T$ is the BWT of $\tilde{T}$ ⁠. We note that if P is a substring of $\tilde{T}$ ⁠, $\bar{P}$ must be a substring and $occ (P) = occ (\bar{P})$ ⁠. The suffix array bidirectional interval (SA bi-interval) of P is a 3-tuple defined as $(lo (P), lo (\bar{P}), occ (P))$ ⁠.

Algorithm 3.

Backward and forward extensions with DS-BWT

1: procedureBackwardExt(⁠ $B, (k, k', s), a$ ⁠)

2: for all $b < \bar{a}$ do ▹ b can be $$$

3: $k' \leftarrow k' + [π_{B} (\bar{b}, k + s) - π_{B} (\bar{b}, k)]$

4: end for

5: $s \leftarrow π_{B} (a, k + s) - π_{B} (a, k)$

6: $k \leftarrow π_{B} (a, k)$

7: return $(k, k', s)$

8: end procedure

9: procedureForwardExt(⁠ $B, (k, k', s), a$ ⁠)

10: $(k', k, s) \leftarrow$ BackwardExt $(B, (k', k, s), \bar{a})$

11: return $(k, k', s)$

12: end procedure

An SA bi-interval can be extended in both backward and forward directions. To calculate the SA bi-interval of aP, we can use the standard backward search to compute

lo (a P)

and

occ (a P)

from

lo (P)

and

occ (P)

⁠. As to

lo (\bar{a P})

⁠, we note that

[lo (\bar{a P}), hi (\bar{a P})) \subset [lo (\bar{P}), hi (\bar{P}))

because

\bar{a P} = \bar{P} ° \bar{a}

is prefixed with

\bar{P}

⁠. We can thus calculate

lo (\bar{a P}) = lo (\bar{P}) + \sum_{b < \bar{a}} occ (\bar{P} b) = lo (\bar{P}) + \sum_{b < \bar{a}} occ (\bar{b} P)

⁠.

It is easy to see that if $(k, k', s)$ is the SA bi-interval of P, $(k', k, s)$ is the SA bi-interval of $\bar{P}$ ⁠, and vice versa. Therefore, we can use the backward extension of $\bar{P}$ to achieve the forward extension of P. Algorithm 3 gives the details. It simplifies the original formulation (Li 2012).

2.6 Finding supermaximal exact matches

An exact match between strings T and P is a 3-tuple (t, p, l) such that $T [t, t + l) = P [p, p + l)$ ⁠. A maximal exact match (MEM) is an exact match that cannot be extended in either direction. A MEM may be contained in another MEM on the pattern string P. For example, suppose $T = GACCTCCG$ and $P = ACCT$ ⁠. MEM (5, 1, 2) is contained in MEM (1, 0, 4) on the pattern string. A supermaximal exact match (SMEM) is a MEM that is not contained in other MEMs on the pattern string. In the example above, only (1, 0, 4) is a SMEM. There are usually much fewer SMEMs than MEMs. Gagie (2024) recently proposed a new algorithm to find long SMEMs (Algorithm 4) that is faster than our earlier algorithm (Li 2012) especially when there are many short SMEMs to skip. Both algorithms can also find SMEMs occurring at least $s_{\min}$ times (Tatarnikov et al. 2023).

Algorithm 4.

Finding SMEMs no shorter than $ℓ$ (Gagie 2024)

1: procedureFindSMEM1(⁠ $ℓ, s_{\min}, B, P, i$ ⁠)

2: if $i + ℓ > | P |$ then return $| P |$ ▹ Reaching the end of $| P |$

3: $(k, k', s) \leftarrow (0, 0, | B |)$ ▹ SA bi-interval of empty string

4: for $j \leftarrow i + ℓ - 1$ to i do ▹ backward

5: $(k, k', s) \leftarrow$ BackwardExt $(B, (k, k', s), P [j])$

6: if $s < s_{\min}$ then return j + 1

7: end for

8: for $j \leftarrow i + ℓ$ to $| P | - 1$ do ▹ forward

9: $(k, k', s) \leftarrow$ ForwardExt $(B, (k, k', s), P [j])$

10: if $s < s_{\min}$ then break

11: end for

12: $e \leftarrow j$ and output $[i, e)$ ▹ SMEM found

13: $(k, k', s) \leftarrow (0, 0, | B |)$

14: for $j \leftarrow e$ to i + 1 do ▹ backward again

15: $(k, k', s) \leftarrow$ BackwardExt $(B, (k, k', s), P [j])$

16: if $s < s_{\min}$ then return j + 1

17: end for

18: end procedure

19: procedureFindSMEM(⁠ $ℓ, s_{\min}, B, P$ ⁠)

20: $i \leftarrow 0$

21: repeat

22: $i \leftarrow$ FindSMEM1 $(ℓ, s_{\min}, B, P, i)$

23: until $i \geq | P |$

24: end procedure

2.7 Finding inexact matches with BWA-SW

The prefix trie of T is a tree that encodes all the prefixes of T (Fig. 1b). Each edge in the tree is labeled with a symbol in Σ. A path from a node to the root spells a substring of T. We can label a node with the SA interval of the string from the node to the root. When we know the label of a node, we can find the label of its child using backward search. We can thus simulate the top-down traversal of the prefix trie (Lam et al. 2008).

As is shown in Fig. 1b, different nodes in the prefix trie may have the same label. If we merge nodes with the same label (Fig. 1c), we will get a prefix DAWG (Blumer et al. 1983). For $| T | > 1$ ⁠, the DAWG has at most $2 | T | - 1$ nodes (Blumer et al. 1984). Each node uniquely corresponds to an important SA interval of T.

Let G_T denote the prefix DAWG of T and

V (G_{T})

be the set of vertices in G_T. Given a reference string T and a pattern string P, we can align G_T and G_P under affine-gap penalty with

{\begin{matrix} E_{u v} & = & \max_{u^{'} \in pre (u)} {\max {H_{u' v} - q, E_{u' v}}} - e \\ F_{u v} & = & \max_{v^{'} \in pre (v)} {\max {H_{u v'} - q, F_{u v'}}} - e \\ G_{u v} & = & \max_{u^{'} \in pre (u), v^{'} \in pre (v)} {H_{u' v'} + s (u', u; v', v)} \\ H_{u v} & = & \max {G_{u v}, E_{u v}, F_{u v}} \end{matrix}

where

u \in V (G_{P}), v \in V (G_{T})

⁠,

pre (u)

and

pre (v)

are the sets of predecessors in G_P and G_T respectively, q is the gap open penalty, e is the gap extension penalty, and

s (u', u; v', v)

is the match/mismatch score between the symbol labeled on edge

(u', u)

in G_P and the symbol on

(v', v)

in G_T. This equation is similar to but not the same as our earlier result (Li and Durbin 2010).

On real data, G_T may be too large to store explicitly. Ropebwt3 instead explicitly stores G_P only and traverses it in the topological order (Algorithm 5). At a node $u \in V (G_{P})$ ⁠, we use a hash table to keep ${v \in V (G_{T}) : H_{u v} > 0}$ ⁠. This algorithm is exact in that it guarantees to find the best alignment. In practice, however, a large number of $v \in V (G_{T})$ may be aligned to u with H_uv > 0. It is slow and memory demanding to keep track of all cells (u, v) with positive scores when P is long. Similar to our earlier work, we only store top W cells (25 by default) at each u. This heuristic is akin to dynamic banding for linear sequences (Suzuki and Kasahara 2018).

Algorithm 5.

The revised BWA-SW algorithm

1: procedureBwaSW(G_P, G_T)

2: for $u \in V (G_{P})$ in topological order do

3: for $u' \in pre (u)$ do ▹ predecessors of u

4: for $v' \in V (G_{T})$ s.t. $H_{u' v'} > 0$ do ▹ match

5: for $v \in child (v')$ do ▹ children of $v'$

6: $H_{u v} \leftarrow \max {H_{u v}, H_{u' v'} + s (u', u; v', v)}$

7: end for

8: end for

9: for $v \in V (G_{T})$ s.t. $H_{u' v} > 0$ do ▹ insertion

10: $E_{u v} \leftarrow \max {E_{u v}, \max {H_{u' v} - q, E_{u' v}} - e}$

11: $H_{u v} \leftarrow \max {H_{u v}, E_{u v}}$

12: end for

13: end for

14: for $v' \in V (G_{T})$ s.t. $H_{u v'} > 0$ do ▹ deletion

15: for $v \in child (v')$ do ▹ children of $v'$

16: $F_{u v} \leftarrow \max {F_{u v}, \max {H_{u v'} - q, F_{u v'}} - e}$

17: $H_{u v} \leftarrow \max {H_{u v}, F_{u v}}$

18: end for

19: end for

20: end for

21: end procedure

2.8 Estimating local haplotype diversity

BWA-SW with the banding heuristic may miss the best matching haplotype especially given an index consisting of similar haplotypes. When the suffix on the best full-length alignment has a lot more mismatches than the suffix on suboptimal alignments, the best alignment may have moved out of the band early in the iteration and thus get missed. Nevertheless, a long read sequenced from a new sample may be the recombinant of two genomes in the index. We often do not seek the best alignment of the long read to a single haplotype. We are instead more interested in the collection of haplotypes a query sequence can be aligned to even if they do not lead to the best alignment. This now becomes possible as BWA-SW explores suboptimal alignments to multiple haplotypes.

More exactly, we perform semi-global sequence-to-DAWG alignment (ie requiring the query sequence to be aligned from end to end) by applying BWA-SW to the graph representing the linear query sequence P, from the end to the start. We can find the set of matching haplotypes $M ≜ {v \in V (G_{T}) : H_{u_{0} v} > 0}$ where u₀ represents the start of P. $M$ may include suboptimal haplotypes caused by small variants as well as the optimal one.

Importantly, P may be aligned to similar positions on T with slightly different gap placements. For example, given

T = CAAGCAG

and

P = AGCG

⁠, the algorithm above may find the following two alignments around the same position but in different SA intervals:

\begin{matrix} T : & CAAGCAG & CAAGCAG \\ | | | | & o r & | | | | \\ P : & AGCA & A - GCA \end{matrix}

When counting hits, we want to ignore the second suboptimal alignment. In theory, we can identify this case by comparing the position of each alignment. This procedure is slow as the locate operation is costly. We instead leverage the bi-directionality to identify redundancy heuristically. Suppose P can be aligned to SA bi-intervals $(k_{1}, k'_{1}, s_{1})$ and $(k_{2}, k'_{2}, s_{2})$ and the alignment score to the first interval is higher. We filter out the second interval if $[k_{2}, k_{2} + s_{2}) \subset [k_{1}, k_{1} + s_{1})$ or $[k'_{2}, k'_{2} + s_{2}) \subset [k'_{1}, k'_{1} + s_{1})$ ⁠. This strategy does not avoid all overcounting but it works well on multiple real examples we have closely inspected.

In practice, we may apply this algorithm to the flanking sequence of a variant or to sliding k-mers of a long query sequence to enumerate possible local haplotypes and estimate their frequencies in the index. It is a new query type that is biologically meaningful.

3 Results

3.1 Performance on index construction

We evaluated the performance of BWT construction on 100 haplotype-resolved human assemblies collected in Li et al. (2024). As we included both strands (Section 2.5), each BWT construction algorithm took about 600 billion bases as input (Table 2). grlBWT (commit 5b6d26a; Díaz-Domínguez and Navarro 2023) is the fastest algorithm (Table 3) at the cost of ∼2 terabytes of working disk space including decompressed sequences. Ropebwt3 took 21 h from input sequences in gzip’d FASTA to the final index, of which 7.7 h was spent on libsais. It does not use working disk space and can append new sequences to an existing BWT. However, hardcoded for the DNA alphabet, ropebwt3 does not work with more general alphabets. pfp-thresholds (Rossi et al. 2022) used more memory than the input sequences. It may be impractical with increased sample size.

Table 2.

Datasets.

Name	No. of bases^a	No. of sequences	Avg run length^b
human100^c	301.6 Gb	38.6 k	141.6
human320^d	963.0 Gb	27.1 k	395.6
CommonBacteria^e	7326.6 Gb	278.4 M	828.6

Name	No. of bases^a	No. of sequences	Avg run length^b
human100^c	301.6 Gb	38.6 k	141.6
human320^d	963.0 Gb	27.1 k	395.6
CommonBacteria^e	7326.6 Gb	278.4 M	828.6

Number of bases in the input sequences on one strand.

Average run length in BWT constructed from both strands.

100 long-read human assemblies; N50 string length: 44.4 Mb.

320 long-read human assemblies; N50 string length: 135.3 Mb.

AllTheBacteria (Hunt et al. 2024) excluding “dustbin” and “unknown.”

Table 2.

Datasets.

Name	No. of bases^a	No. of sequences	Avg run length^b
human100^c	301.6 Gb	38.6 k	141.6
human320^d	963.0 Gb	27.1 k	395.6
CommonBacteria^e	7326.6 Gb	278.4 M	828.6

Name	No. of bases^a	No. of sequences	Avg run length^b
human100^c	301.6 Gb	38.6 k	141.6
human320^d	963.0 Gb	27.1 k	395.6
CommonBacteria^e	7326.6 Gb	278.4 M	828.6

Number of bases in the input sequences on one strand.

Average run length in BWT constructed from both strands.

100 long-read human assemblies; N50 string length: 44.4 Mb.

320 long-read human assemblies; N50 string length: 135.3 Mb.

AllTheBacteria (Hunt et al. 2024) excluding “dustbin” and “unknown.”

Table 3.

Indexing performance.^a

Dataset	Algorithm	Elapsed^b	CPU^c	RAM
human100	grlBWT	8.3 h	$\times 3.6$	84.8 GB
	pfp-thres^d	<51.7 h	$\times 1.0$	788.1 GB
	ropebwt3	20.5 h	$\times 22.9$	83.0 GB
	metagraph^e	16.9 h	$\times 18.6$	251.0 GB
	fulgor^f	1.2 h	$\times 27.2$	165.2 GB
human320	grlBWT^g	23.3 h	$\times 4.2$	270.4 GB
	ropebwt3	81.2 h	$\times 16.5$	99.2 GB
	ropebwt3^h	64.9 h	$\times 23.7$	170.5 GB
CommonBacteria	ropebwt3	25.6 d	$\times 32.4$	67.3 GB

Dataset	Algorithm	Elapsed^b	CPU^c	RAM
human100	grlBWT	8.3 h	$\times 3.6$	84.8 GB
	pfp-thres^d	<51.7 h	$\times 1.0$	788.1 GB
	ropebwt3	20.5 h	$\times 22.9$	83.0 GB
	metagraph^e	16.9 h	$\times 18.6$	251.0 GB
	fulgor^f	1.2 h	$\times 27.2$	165.2 GB
human320	grlBWT^g	23.3 h	$\times 4.2$	270.4 GB
	ropebwt3	81.2 h	$\times 16.5$	99.2 GB
	ropebwt3^h	64.9 h	$\times 23.7$	170.5 GB
CommonBacteria	ropebwt3	25.6 d	$\times 32.4$	67.3 GB

Up to 64 threads specified if multi-threading is supported.

Excluding time for format conversion; “h” for hours; “d” for days.

Number of CPU threads used on average.

pfp-thresholds was run on a slower machine with more RAM.

k-mer coordinates in the “row_diff_brwt_coord” encoding.

Without “--meta --diff” as the basic index is smaller; lossy index.

Using two bytes per run with option “-b 2,” which is faster.

BWT merge and partial BWT construction with libsais are run together.

Table 3.

Indexing performance.^a

Dataset	Algorithm	Elapsed^b	CPU^c	RAM
human100	grlBWT	8.3 h	$\times 3.6$	84.8 GB
	pfp-thres^d	<51.7 h	$\times 1.0$	788.1 GB
	ropebwt3	20.5 h	$\times 22.9$	83.0 GB
	metagraph^e	16.9 h	$\times 18.6$	251.0 GB
	fulgor^f	1.2 h	$\times 27.2$	165.2 GB
human320	grlBWT^g	23.3 h	$\times 4.2$	270.4 GB
	ropebwt3	81.2 h	$\times 16.5$	99.2 GB
	ropebwt3^h	64.9 h	$\times 23.7$	170.5 GB
CommonBacteria	ropebwt3	25.6 d	$\times 32.4$	67.3 GB

Dataset	Algorithm	Elapsed^b	CPU^c	RAM
human100	grlBWT	8.3 h	$\times 3.6$	84.8 GB
	pfp-thres^d	<51.7 h	$\times 1.0$	788.1 GB
	ropebwt3	20.5 h	$\times 22.9$	83.0 GB
	metagraph^e	16.9 h	$\times 18.6$	251.0 GB
	fulgor^f	1.2 h	$\times 27.2$	165.2 GB
human320	grlBWT^g	23.3 h	$\times 4.2$	270.4 GB
	ropebwt3	81.2 h	$\times 16.5$	99.2 GB
	ropebwt3^h	64.9 h	$\times 23.7$	170.5 GB
CommonBacteria	ropebwt3	25.6 d	$\times 32.4$	67.3 GB

Up to 64 threads specified if multi-threading is supported.

Excluding time for format conversion; “h” for hours; “d” for days.

Number of CPU threads used on average.

pfp-thresholds was run on a slower machine with more RAM.

k-mer coordinates in the “row_diff_brwt_coord” encoding.

Without “--meta --diff” as the basic index is smaller; lossy index.

Using two bytes per run with option “-b 2,” which is faster.

BWT merge and partial BWT construction with libsais are run together.

Both MONI (v0.2.1; Rossi et al. 2022) and Movi (v1.1.0; Zakeri et al. 2024) use pfp-thresholds for building BWT. They generate additional data structures on top of BWT. Time spent on these additional steps were not counted. The tested version of Movi used more than one terabyte of memory to construct the final index. The Movi developer kindly provided the Movi index for the evaluation of query performance in the next section.

We also indexed the same dataset with k-mer based tools. In the lossless mode, metagraph (v0.3.6; Karasikov et al. 2024) indexed the 100 human genomes in 17 h (Table 3). Fulgor (v3.0.0; Fan et al. 2024) is by far the fastest. However, lossy in nature, it is not directly comparable to the rest of the tools in the table.

To test scalability, we indexed a larger dataset consisting of 320 human genomes recently released by the Human Pangenome Reference Consortium (Liao et al. 2023). Because these assemblies are more contiguous than human100 (Table 2), m₂ in Algorithm 1 is smaller, which reduces the multi-threading efficiency of ropebwt3. We can alleviate this by executing BWT merge and partial BWT construction at the same time at the cost of higher memory footprint. On human320, grlBWT is 2–4 times as fast but uses more memory (Table 3). On the largest CommonBacteria dataset (Table 2), ropebwt3 took less than a month to construct the double-strand BWT with 14.66 trillion symbols. The final index in the fermi format is 27.6 GB in size. The “dustbin” and “unknown” categories in AllTheBacteria consist of many unique sequences which are not compressed well. Indexing the entire AllTheBacteria with ropebwt3 will probably take 100–200 GB memory at the peak.

3.2 Query performance

We queried 100–200 Mb human short reads (SR+), human long reads (LR+) and non-human long reads (LR–) against the human pangenome indices constructed above (Table 4). It is important to note that no two tools support exactly the same type of query, but the comparison can still give a hint about the relative performance.

Table 4.

Query performance.

Data	Algorithm	Type^d	Speed^e (kb/s)	RAM (GB)
SR+^a	ropebwt3^f	MEM31	1758.5	10.6
		MEM51	1907.5	10.6
		SW10	84.1	15.2
	MONI^g	MEM–	453.2	68.4
		extend	348.2	68.4
	Movi	PML	5894.0	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2717.5	5.1
LR+^b	ropebwt3	MEM31	1695.9	10.5
		MEM51	1793.9	10.5
		SW25	82.7	15.6
	MONI	MEM–	413.6	68.4
	Movi	PML	16204.9	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2491.6	5.1
LR−^c	ropebwt3	MEM31	1365.0	10.4
		MEM51	3051.6	10.4
		SW25	58.2	17.9
	MONI	MEM–	186.8	68.4
	Movi	PML	8490.9	47.6
	metagraph	PA+	1119.3	65.3
	fulgor	PA	4240.8	5.1

Data	Algorithm	Type^d	Speed^e (kb/s)	RAM (GB)
SR+^a	ropebwt3^f	MEM31	1758.5	10.6
		MEM51	1907.5	10.6
		SW10	84.1	15.2
	MONI^g	MEM–	453.2	68.4
		extend	348.2	68.4
	Movi	PML	5894.0	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2717.5	5.1
LR+^b	ropebwt3	MEM31	1695.9	10.5
		MEM51	1793.9	10.5
		SW25	82.7	15.6
	MONI	MEM–	413.6	68.4
	Movi	PML	16204.9	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2491.6	5.1
LR−^c	ropebwt3	MEM31	1365.0	10.4
		MEM51	3051.6	10.4
		SW25	58.2	17.9
	MONI	MEM–	186.8	68.4
	Movi	PML	8490.9	47.6
	metagraph	PA+	1119.3	65.3
	fulgor	PA	4240.8	5.1

First 1 million 125 bp human short reads from SRR3099549.

First 10 000 human PacBio HiFi reads from SRR26545347.

First 10 000 metagenomic PacBio HiFi reads from DRR290133.

MEMx: super-maximal exact matches (SMEMs) of x bp or longer with counts; MEM–: SMEM without counts; extend: Smith-Waterman extension from the longest SMEM; PML: pseudo-matching length; PA: pseudo-alignment; PA+: pseudo-alignment with contig names; SWy: BWA-SW with bandwidth y.

Kilobases processed per CPU second. Index loading time excluded.

Index in the binary fermi format.

The MONI index includes both strands. We modified MONI such that extension is performed on the forward query strand only.

Table 4.

Open in new tab Download slide

Query performance.

Data	Algorithm	Type^d	Speed^e (kb/s)	RAM (GB)
SR+^a	ropebwt3^f	MEM31	1758.5	10.6
		MEM51	1907.5	10.6
		SW10	84.1	15.2
	MONI^g	MEM–	453.2	68.4
		extend	348.2	68.4
	Movi	PML	5894.0	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2717.5	5.1
LR+^b	ropebwt3	MEM31	1695.9	10.5
		MEM51	1793.9	10.5
		SW25	82.7	15.6
	MONI	MEM–	413.6	68.4
	Movi	PML	16204.9	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2491.6	5.1
LR−^c	ropebwt3	MEM31	1365.0	10.4
		MEM51	3051.6	10.4
		SW25	58.2	17.9
	MONI	MEM–	186.8	68.4
	Movi	PML	8490.9	47.6
	metagraph	PA+	1119.3	65.3
	fulgor	PA	4240.8	5.1

Data	Algorithm	Type^d	Speed^e (kb/s)	RAM (GB)
SR+^a	ropebwt3^f	MEM31	1758.5	10.6
		MEM51	1907.5	10.6
		SW10	84.1	15.2
	MONI^g	MEM–	453.2	68.4
		extend	348.2	68.4
	Movi	PML	5894.0	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2717.5	5.1
LR+^b	ropebwt3	MEM31	1695.9	10.5
		MEM51	1793.9	10.5
		SW25	82.7	15.6
	MONI	MEM–	413.6	68.4
	Movi	PML	16204.9	47.6
	metagraph	PA+	<0.1	65.3
	fulgor	PA	2491.6	5.1
LR−^c	ropebwt3	MEM31	1365.0	10.4
		MEM51	3051.6	10.4
		SW25	58.2	17.9
	MONI	MEM–	186.8	68.4
	Movi	PML	8490.9	47.6
	metagraph	PA+	1119.3	65.3
	fulgor	PA	4240.8	5.1

First 1 million 125 bp human short reads from SRR3099549.

First 10 000 human PacBio HiFi reads from SRR26545347.

First 10 000 metagenomic PacBio HiFi reads from DRR290133.

Kilobases processed per CPU second. Index loading time excluded.

Index in the binary fermi format.

The MONI index includes both strands. We modified MONI such that extension is performed on the forward query strand only.

Among the three BWT-based tools, ropebwt3 is slower than Movi but faster than MONI on finding exact matches. Movi finds pseudo-matching length (PML) which is not intended to be the longest exact match. PML corresponds to the longest MEM for only 195 out of one million short reads, and none for long reads. The longest PML of each read is on average 27% shorter than the longest exact match for LR+ and 22% shorter for SR+. Nonetheless, we believe it is possible to implement SMEM finding based on the Movi data structure with minor performance overhead.

MONI and ropebwt3 can also find inexact matches. Implementing r-index, MONI can relatively cheaply locate one SMEM. It leverages this property to extract the genomic sequence around the longest SMEM and performs Smith-Waterman extension. MONI extension is faster than BWA-SW on SR+ because it does not need to inspect suboptimal hits. This feature apparently focuses on short reads only. On LR+, MONI extension fails to extend to the ends of reads and generate incorrect output for the majority of reads.

As to k-mer indices, fulgor outputs the labels of genomes that each read have enough 31-mer matches to. In its current form, such output is not useful for pangenome analysis as we know most of reads can be mapped to all genomes. Metagraph can additionally output the contig name of each match. However, when most k-mers are present in the index, metagraph is impractically slow.

3.3 Identifying novel sequences

As a biological application, we used the pangenome index to identify novel sequences in reads that are absent from other human genomes. For this, we downloaded the PacBio HiFi reads for tumor sample COLO829 (https://downloads.pacbcloud.com/public/revio/2023Q2/COLO829/COLO829/), mapped them to human100 (Table 2), which includes T2T-CHM13 (Nurk et al. 2022), and extracted $\geq$ 1 kb regions on reads that are not covered by SMEMs of 51 bp or longer. We found 95 kb sequences in 43 reads. These sequences could not be assembled. NCBI BLAST suggested multiple weak hits to cow genomes. We could not identify the source of these sequences but there were few of them anyway.

When we mapped the COLO829 reads to T2T-CHM13 only and applied the same procedure, we found 55.9 Mb of “novel” sequences in 25 365 reads. The much larger number is caused by regions with dense SNPs that prevented long exact matches. Counterintuitively, mapping these reads to T2T-CHM13 with minimap2 (Li 2018) resulted in more “novel” sequences at 78.6 Mb, probably because minimap2 ignores seeds occurring thousands of times in T2T-CHM13 and may miss more alignments. Capable of finding SMEMs at the pangenome scale, ropebwt3 is more effective for identifying known sequences. It is also 16% faster and uses less memory than full minimap2 alignment against a single genome.

3.4 Haplotype diversity around C4 genes

The reference human genome GRCh38 has two paralogous C4 genes, C4A and C4B, and they may have copy-number changes (Sekar et al. 2016). In both cases, the exon 26 harbors the functional domain. We extracted the exon 26 from both genes from GRCh38. They differ at six mismatches over 157 bp. We aligned both 157 bp segments, one from each gene, to the human pangenome. 105 haplotypes have the same C4A exon 26 sequence, one haplotype has a mismatch at offset 128 and four haplotypes have a mismatch at 54. In case of C4B, 60 haplotypes have the same reference C4B sequence and 23 also have a mismatch at offset 54. This means among the 100 human haplotypes, there are 110 C4A gene copies and 83 C4B copies. If we increase the bandwidth from the default 25 to 200, BWA-SW will be able to align C4A exon 26 to C4B genes and output all five hits for each sequence.

Figure 3 shows the local haplotype diversity across the entire C4A gene spanning ∼20.6 kb on GRCh38. We can see most regions have 193 copies, except a ∼6.4 kb HERV insertion that separates long and short forms (Sekar et al. 2016). The dip around exon 26 is caused by the C4A–C4B difference. We could only see these alternative haplotypes with BWA-SW which reports suboptimal hits.

Figure 3.

Haplotype diversity around the C4A gene. 101-mers with 50 bp overlaps are extracted from the C4A genomic sequence on GRCh38 and aligned to the human pangenome. The Y axis shows the counts of 101-mer matches under different edit-distance thresholds.

4 Discussions

Ropebwt3 is a fast tool for BWT construction and sequence search for redundant DNA sequences. Generating BWT purely in memory and supporting incremental build, ropebwt3 is convenient and practical for BWT construction at large scale. It is the only algorithm that can construct the BWT of 320 human genomes from a 2.5 GB AGC archive (Deorowicz et al. 2023) without staging all decompressed data in memory or on disk. It provides the fastest algorithm so far for finding supermaximal exact matches and can report inexact hits as well. Ropebwt3 demonstrates that BWT-based data structures are scalable to terabases of pangenome data.

Ropebwt3 implements an FM-index to locate SMEMs or local hits. Although the standard r-index is faster than an FM-index of the same size, it imposes a fixed sampling rate: two suffix array values per run. The BWT of CommonBacteria has 14.6 trillion bases and 17.6 billion runs. An r-index is likely to take >200 GB, while an FM-index sampled at one position per 8192 bp takes 17.5GB in ropebwt3. Subsampled r-index (sr-index; Cobas et al. 2024) is probably the better solution, which we will explore in future.

This article has not evaluated several recent BWT-based tools including r-index-f (Brown et al. 2022), block_RLBWT (Díaz-Domínguez et al. 2023), Move-r (Bertram et al. 2024) and b-move (Depuydt et al. 2024). Although they have not been tested on large datasets like human pangenome and they do not report SMEMs, some of their data structures are probably more efficient than the ones we use. We may integrate these methods into ropebwt3 in future as well.

We often use pangenome graphs to analyze multiple similar genomes (Liao et al. 2023). These graphs are built from the multiple sequence alignment through complex procedures involving many parameters (Li et al. 2020, Garrison et al. 2024, Hickey et al. 2024). It is challenging to understand if the graph topology is biologically meaningful especially given that we do not know the correct alignment between two genomes, let alone multiple ones. Complement to graph-based data structures, BWT-based algorithms are often exact with no heuristics or parameters but they tend to support limited query types. For example, we cannot project the alignment to a designated reference genome. What additional query types we can achieve will be of great interest to the comprehensive pangenome analysis in future.

Acknowledgements

We are grateful to Travis Gagie for pointing us to his long MEM finding algorithm, to Ilya Grebnov for adding the support of 16-bit alphabet which helps to accelerate ropebwt3, to Mohsen Zakeri for providing the Movi index, to Massimiliano Rossi for explaining the MONI algorithm, and to Giulio Pibiri and Rob Patro for trouble-shooting compilation issues with fulgor.

Conflict of interest

None declared.

Funding

This work was supported by National Institute of Health grant [R01HG010040 and U01HG010961 to H.L.].

Data availability

The ropebwt3 source code is available at https://github.com/lh3/ropebwt3. Prebuilt ropebwt3 indices can be obtained from https://doi.org/10.5281/zenodo.11533210 and https://doi.org/10.5281/zenodo.13955431.

References

Ahmed

Rossi

Kovaka

et al.

Pan-genomic matching statistics for targeted nanopore sequencing

iScience

2021

;

102696

Bannai

Gagie

Refining the r-index

Theor Comput Sci

2020

;

812

–

108

Bertram

Fischer

Nalbach

Move-r: optimizing the r-index. In: Liberti L (ed.), 22nd International Symposium on Experimental Algorithms, SEA 2024, July 23–26, 2024, Vienna, Austria, volume 301 of LIPIcs. Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

2024

1:1

–

Blumer

Ehrenfeucht

et al.

Linear size finite automata for the set of all subwords of a word—an outline of results

Bull EATCS

1983

;

–

10.1101/2023.04.15.536996,

Blumer

Ehrenfeucht

et al. Building the minimal DFA for the set of all subwords of a word on-line in linear time. In: Paredaens J (ed.) Automata, Languages and Programming, 11th Colloquium, Antwerp, Belgium, July 16–20, 1984, Proceedings, volume 172 of Lecture Notes in Computer Science. Springer.

1984

109

–

Boucher

Gagie

Kuhnle

et al.

Prefix-free parsing for building big BWTs

Algorithms Mol Biol

2019

;

Bray

Pimentel

Melsted

et al.

Near-optimal probabilistic RNA-seq quantification

Nat Biotechnol

2016

;

525

–

Břinda

Lima

Pignotti

et al. Efficient and robust search of microbial genomes via phylogenetic compression. bioRxiv,

2024

, preprint: not peer reviewed.

Brown

Gagie

Rossi

RLBWT tricks. In: Schulz C, Uçar B (eds.), 20th International Symposium on Experimental Algorithms, SEA 2022, July 25–27, 2022, Heidelberg, Germany, volume 233 of LIPIcs. Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

2022

, pages

16:1

–

Burrows

Wheeler

DJ.

A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation.

1994

Cenzato

Lipták

A survey of BWT variants for string collections

Bioinformatics

2024

;

btae333

Chang

Lawler

EL.

Sublinear approximate string matching and biological applications

Algorithmica

1994

;

327

–

10.48550/arXiv.2409.14654,

Cobas

Gagie

Navarro

Fast and small subsampled r-indexes. CoRR, abs/2409.14654.

2024

, preprint: not peer reviewed.

Cox

Bauer

Jakobi

et al.

Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform

Bioinformatics

2012

;

1415

–

Deorowicz

Danek

AGC: compact representation of assembled genomes with fast queries and updates

Bioinformatics

2023

;

btad097

Depuydt

Renders

de Vyver

et al. b-move: faster bidirectional character extensions in a run-length compressed index. In: Pissis SP, Sung W (eds.), 24th International Workshop on Algorithms in Bioinformatics, WABI 2024, September 2–4, 2024, Royal Holloway, London, United Kingdom, volume 312 of LIPIcs. Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

2024

10:1

–

Díaz-Domínguez

Dönges

Puglisi

et al. Simple runs-bounded FM-Index designs are fast. In: Georgiadis L (ed.), 21st International Symposium on Experimental Algorithms, SEA 2023, July 24–26, 2023, Barcelona, Spain, volume 265 of LIPIcs. Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

2023

7:1

–

Díaz-Domínguez

Navarro

Efficient construction of the BWT for repetitive text using string compression

Inf Comput

2023

;

294

105088

Edgar

Taylor

Lin

et al.

Petabase-scale sequence alignment catalyses viral discovery

Nature

2022

;

602

142

–

Fan

Khan

Singh

et al.

Fulgor: a fast and compact k-mer index for large-scale matching and color queries

Algorithms Mol Biol

2024

;

Ferragina

Gagie

Manzini

. Lightweight data indexing and compression in external memory.

Algorithmica

2012;

707

–

Google Preview

Ferragina

Manzini

Opportunistic data structures with applications. In:

FOCS

IEEE Computer Society

;

2000

390

–

Google Preview

Gagie

How to find long maximal exact matches and ignore short ones. In: Day JD, Manea F (eds.), Developments in Language Theory – 28th International Conference, DLT 2024, Göttingen, Germany, August 12–16, 2024, Proceedings, Vol. 14791 of Lecture Notes in Computer Science. Springer.

2024

131

–

Gagie

Navarro

Prezza

Optimal-time text indexing in bwt-runs bounded space. In: Czumaj A (ed.), Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7–10, 2018. SIAM.

2018

1459

–

Gagie

Navarro

Prezza

Fully functional suffix trees and optimal text searching in BWT-runs bounded space

J ACM

2020

;

–

10.1101/2024.03.08.584059,

Garrison

Guarracino

Heumos

et al.

Building pangenome graphs

Nat Methods

2024

;

2008

–

Hickey

Monlong

Ebler

et al. ;

Human Pangenome Reference Consortium

Pangenome graph construction from genome alignments with Minigraph-Cactus

Nat Biotechnol

2024

;

663

–

Hunt

Lima

Shen

et al. AllTheBacteria—all bacterial genomes assembled, available and searchable. bioRxiv,

2024, preprint: not peer reviewed.

Karasikov

Mustafa

Danciu

et al. Indexing all life’s known biological sequences. bioRxiv,

10.1101/2020.10.01.322164,

2024, preprint: not peer reviewed.

Karasikov

Mustafa

Joudaki

et al.

Sparse binary relation representations for genome graph annotation

J Comput Biol

2020

;

626

–

Kim

Song

Breitwieser

et al.

Centrifuge: rapid and sensitive classification of metagenomic sequences

Genome Res

2016

;

1721

–

Kucherov

Salikhov

Tsur

Approximate string matching using a bidirectional index

Theor Comput Sci

2016

;

638

145

–

Lam

Tam

et al. High throughput short read alignment via bi-directional BWT. In: 2009 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2009, Washington, DC, USA, November 1–4, 2009, Proceedings. IEEE Computer Society.

2009

–

Lam

Sung

Tam

et al.

Compressed indexing and local alignment of DNA

Bioinformatics

2008

;

791

–

Langmead

Trapnell

Pop

et al.

Ultrafast and memory-efficient alignment of short dna sequences to the human genome

Genome Biol

2009

;

R25

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Bioinformatics

2012

;

1838

–

Fast construction of FM-index for long sequence reads

Bioinformatics

2014

;

3274

–

Minimap2: pairwise alignment for nucleotide sequences

Bioinformatics

2018

;

3094

–

100

Durbin

Fast and accurate short read alignment with Burrows–Wheeler transform

Bioinformatics

2009

;

1754

–

Durbin

Fast and accurate long-read alignment with Burrows–Wheeler transform

Bioinformatics

2010

;

589

–

Feng

Chu

et al.

The design and construction of reference pangenome graphs with minigraph

Genome Biol

2020

;

265

Marin

Farhat

MR.

Exploring gene content with pangene graphs

Bioinformatics

2024

;

btae456

et al.

SOAP2: an improved ultrafast tool for short read alignment

Bioinformatics

2009

;

1966

–

Liao

W-W

Asri

Ebler

et al.

A draft human pangenome reference

Nature

2023

;

617

312

–

Marchet

Boucher

Puglisi

et al.

Data structures based on k-mers for querying large collections of sequencing data sets

Genome Res

2021

;

–

Navarro

Indexing highly repetitive string collections, part II: compressed indexes

ACM Comput Surv

2022

;

–

Nishimoto

Tabei

Optimal-time queries on BWT-runs compressed indexes. In: Bansal N, Merelli E, Worrell J (eds.), 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12–16, 2021, Glasgow, Scotland (Virtual Conference), Vol. 198 of LIPIcs. Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

2021

101:1

–

Nurk

Koren

Rhie

et al.

The complete sequence of a human genome

Science

2022

;

376

–

Ohno

Sakai

Takabatake

et al.

A faster implementation of online RLBWT and its application to LZ77 parsing

J Discrete Algorithms

2018

;

52-53

–