Abstract

Motivation: Several dynamic programming algorithms for predicting RNA structures with pseudoknots have been proposed that differ dramatically from one another in the classes of structures considered.

Results: Here, we use the natural topological classification of RNA structures in terms of irreducible components that are embeddable in the surfaces of fixed genus. We add to the conventional secondary structures four building blocks of genus one in order to construct certain structures of arbitrarily high genus. A corresponding unambiguous multiple context-free grammar provides an efficient dynamic programming approach for energy minimization, partition function and stochastic sampling. It admits a topology-dependent parametrization of pseudoknot penalties that increases the sensitivity and positive predictive value of predicted base pairs by 10–20% compared with earlier approaches. More general models based on building blocks of higher genus are also discussed.

Availability: The source code of gfold is freely available at http://www.combinatorics.cn/cbpc/gfold.tar.gz.

Contact:  [email protected]

Supplementary information:  Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The global conformation of RNA molecules is to a large extent determined by topological constraints encoded at the level of secondary structure, i.e. by the mutual arrangements of the base paired helices (Bailor et al., 2010). In this context, secondary structure is understood in a wider sense that includes pseudoknots. Although the vast majority of RNAs has simple, i.e. pseudoknot-free, secondary structure, PseudoBase (Taufer et al., 2009) lists more than 250 records of pseudoknots determined by a variety of experimental and computational techniques including crystallography, nuclear magnetic resonance, mutational experiments and comparative sequence analysis. In many cases, they are crucial for molecular function. Examples include the catalytic cores of several ribozymes (Doudna and Cech, 2002), programmed frameshifting (Namy et al., 2006) and telomerase activity (Theimer et al., 2005), reviewed in Giedroc and Cornish (2009); Staple and Butcher (2005).

Secondary structures can been interpreted as matchings in a graph of permissible base pairs (Tabaska et al., 1998). The energy of RNA folding is dominated by the stacking of adjacent base pairs, not by the hydrogen bonds of the individual base pairs (Mathews et al., 1999). In contrast to maximum weighted matching, the general RNA folding problem with a stacking-based energy function is NP-complete (Akutsu, 2000; Lyngsø and Pedersen, 2000). The most commonly used RNA secondary structure prediction tools, including mfold (Zuker, 1989) and the Vienna RNA Package (Hofacker et al., 1994), therefore exclude pseudoknots.

Polynomial-time dynamic programming (DP) algorithms can be devised, however, for certain restricted classes of pseudoknots. In contrast to the O(N2) space and O(N3) time solution for simple secondary structures (Nussinov et al., 1978; Waterman, 1978; Zuker and Stiegler, 1981), however, most of these approaches are computationally much more demanding. The design of pseudoknot folding algorithms thus has been governed more by the need to limit computational cost and achieve a manageable complexity of the recursion than the conscious choice of a particularly natural search space of RNA structures. As a case in point, the class of structures underlying the algorithm by Rivas and Eddy (1999) (R&E-structures, pknot-R&E) was characterized only in a subsequent publication (Rivas and Eddy, 2000). The following references provide a certainly incomplete list of DP approaches to RNA structure prediction using different structure classes characterized in terms of recursion equations and/or stochastic grammars: Akutsu (2000); Cai et al. (2003); Chen et al. (2009); Deogun et al. (2004); Dirks and Pierce (2003); Kato et al. (2006); Li and Zhu (2005); Lyngsø and Pedersen (2000); Matsui et al. (2005); Reeder and Giegerich (2004); Rivas and Eddy (1999); Uemura Y. et al. (1999). The interrelationships of some of these classes of RNA structures have been clarified in part by Condon et al. (2004) and Rødland (2006). In addition to these exact algorithms, a plethora of heuristic approaches to pseudoknot prediction have been proposed in the literature; see e.g. (Chen, 2008; Metzler and Nebel, 2008) and the references therein.

At least three distinct classification schemes of RNA contact structures have been proposed: Haslinger and Stadler (1999) suggested using book-embeddings, Jin et al. (2008) focused on the maximal set of pairwise crossing base pairs and Bon et al. (2008) based the classification on topological embeddings. While these classifications have in common that simple secondary structure forms the most primitive class of structures, they differ already in the construction of the first non-trivial class of pseudoknots. Despite their mathematical appeal, however, no efficient (polynomial-time) algorithms are available for predicting pseudoknotted structures even in the simplest case of three non-crossing RNA structures. A practically workable approach to three non-crossing structures requires the enumeration of an exponentially growing number of diagrams which are then ‘filled in’ by the means of DP (Huang et al., 2009); a Monte-Carlo approach utilizing the topological approach with a very simple matching-like energy model was explored by (Vernizzi and Orland, 2005).

In this contribution, we show that the topological classification of RNA structures can be translated into efficient DP algorithms. To this end, we introduce γ-structures and prove that they can be derived from a finite family of abstract shapes called shadows. In Theorem 2.3, we enumerate these four shadows for γ = 1, which can be cast as explicit construction rules for a unique multiple context-free grammar (Section 2.3). Corresponding DP algorithms for energy minimization, partition function and Boltzmann-sampling functionalities are implemented in the software package gfold. An important feature is that γ-structures can be treated algorithmically like pseudoknot-free secondary structures in the sense that there are finitely many motifs, i.e. shadows, for fixed γ, each of which is assigned a specific energy. Because of the multiplicity of motifs, which rapidly increases with γ, this allows for a more detailed energy model of pseudoknotted structures based on their topological complexity.

2 RESULTS

2.1 Topology of RNA structures

Diagram representation: RNA molecules are linear biopolymers consisting of the four nucleotides A, U, C and G characterized by a sequence endowed with a unique orientation (5′ to 3′). Each nucleotide can interact (base pair) with at most one other nucleotide by means of specific hydrogen bonds. Only the Watson–Crick pairs GC and AU as well as the wobble GU are admissible. These base pairs determine the secondary structure. Note that we have neglected here base triples and other types of more complex interactions. Secondary structures can thus be represented as graphs where nucleotides are represented by vertices, the backbone of the molecule as well as the hydrogen bonds are represented by edges; see Figure 1a. More conveniently, we use the convention to represent the backbone of the polymer by a horizontally drawn chain. As before, this chain consists of vertices and arcs, respectively, representing the nucleotides and covalent bonds. However, the edges representing the base pairs now are depicted as arcs in the upper half plane; see Figure 1b. We call this representation the diagram of the molecule.

RNA structure as planar graph represented (a) as ball-and-stick figure with short edges for hydrogen bonds and (b) with linear backbone and semi-circles for hydrogen bonds.
Fig. 1.

RNA structure as planar graph represented (a) as ball-and-stick figure with short edges for hydrogen bonds and (b) with linear backbone and semi-circles for hydrogen bonds.

Thus, we shall identify a structure with a labelled graph over the vertex set [N] = {1, 2,…, N} represented by drawing the vertices 1, 2,…, N on a horizontal line in the natural order and the arcs (i, j), where i < j, in the upper half plane.

Fatgraph representation: in order to understand the topological properties of RNA molecules, we need to pass from the picture of RNA as diagrams or contact graphs to that of topological surfaces. Only the associated surface carries the important invariants leading to a meaningful filtration of RNA structures. Formally, we will view an RNA molecule as a topological surface (Andersen,J.E. et al., submitted for publication). The main idea is to ‘thicken’ the edges into (untwisted) bands or ribbons and to expand each vertex to a disk as shown in Figure 2. This inflation of edges leads to a fatgraph 𝔻 (Loebl and Moffatt, 2008; Penner et al., 2010).

Inflation of edges and vertices to ribbons and disks. Here we have four vertices, five edges and one boundary component . The corresponding surface has Euler characteristic χ=v−e+r=0 and genus g=1, see Equations (2.1) and (2.2).
Fig. 2.

Inflation of edges and vertices to ribbons and disks. Here we have four vertices, five edges and one boundary component formula. The corresponding surface has Euler characteristic χ=ve+r=0 and genus g=1, see Equations (2.1) and (2.2).

A fatgraph, sometimes also called ‘ribbon graph’ or ‘map’, is a graph equipped with a cyclic ordering of the incident half edges at each vertex. Thus, 𝔻 refines its underlying graph D insofar as it encodes the ordering of the ribbons incident on its disks. In the following, we will deal with orientable ribbon graphs.1 Each ribbon has two boundaries. The first one in counterclockwise order is labeled by an arrowhead, (Fig. 2). A 𝔻-cycle or 𝔻-boundary component is then constructed by following these directed boundaries from disk to disk, thereby alternating between base pair ribbons and backbone, with the exception of the segment of the boundary component that travels along the bottom of the backbone using only backbone bonds, as shown in Figures 2 and 3. We give a brief tutorial on how to compute boundary components in the Supplementary Figure S6. Topological invariants such as the number of boundary components of the fatgraph 𝔻 can thus be computed directly from the underlying diagram D. Furthermore, fatgraphs can be succinctly stored and conveniently manipulated on the computer as pairs of permutations (Penner et al., 2010).

Computing the number of boundary components. The diagram contains 5+9 edges and 10 vertices. We follow the alternating paths described in the text and observe that there are exactly two boundary components (bold and thin). According to Equation (2.1), the genus of the diagram is given by , see Supplementary Figure S6 for details.
Fig. 3.

Computing the number of boundary components. The diagram contains 5+9 edges and 10 vertices. We follow the alternating paths described in the text and observe that there are exactly two boundary components (bold and thin). According to Equation (2.1), the genus of the diagram is given by formula, see Supplementary Figure S6 for details.

The fatgraph 𝔻 gives rise to a unique surface X𝔻, and each 𝔻-cycle corresponds to a boundary component of X𝔻, whose Euler characteristic and genus are given by
(2.1)
(2.2)
where v, e, r denotes the number of discs, ribbons and boundary components in 𝔻 (Massey, 1967). The graph D can readily be obtained by continuously contracting the ribbons and discs of 𝔻.

We next make use of an additional feature of RNA structures, namely that the backbone forms a unique oriented chain determined by the covalent bonds. Thus, the backbone can be collapsed to a single disk since the surface is orientable: in the absence of twisted ribbons, there is no particular information in the backbone itself. Indeed, the procedure can be undone by reinflating the disk and rebuilding the backbone. The contraction of the N vertices to a single one and the removal of the (N − 1) covalent bonds therefore preserves the Euler characteristic and genus, (Fig. 4).

Reduction to fatgraphs with a single vertex. Contracting the backbone of a diagram into a single vertex decreases the length of the boundary components and preserves the genus. The contracted fatgraph is equivalent to the labeled directed cycle. The backbone of the polymer can be recovered by reinflating the disk into the backbone. The polygon (r.h.s.) represents the standard 2D model of a surface as discussed in Massey (1967).
Fig. 4.

Reduction to fatgraphs with a single vertex. Contracting the backbone of a diagram into a single vertex decreases the length of the boundary components and preserves the genus. The contracted fatgraph is equivalent to the labeled directed cycle. The backbone of the polymer can be recovered by reinflating the disk into the backbone. The polygon (r.h.s.) represents the standard 2D model of a surface as discussed in Massey (1967).

Using the collapsed fatgraph,2 we see that the relation between the genus of the surface and the number of boundary components is determined by the number of arcs in the upper half plane, namely,
(2.3)
where n is number of base pairs and r the number of boundary components. The latter can be computed easily and therefore controls the genus of the molecules. Equation (2.3) follows from Equations (2.2) and (2.1), which together yield 2 − 2gr = ve, and the observation that the contracted graph has e = n arcs and a single (v = 1) vertex.

2.2 γ-structures

The shadow of a diagram (RNA structure) is obtained by removing all non-crossing arcs, collapsing all isolated vertices and replacing all remaining stacks (i.e. adjacent parallel arcs) by single arcs (Fig. 5). Shadows can be seen as a generalization of shape abstractions (Giegerich et al., 2004) to pseudoknotted structures (Reidys and Wang, 2010). Similar to the process of contracting the backbone into a single vertex, the projection into a shadow changes neither genus nor the number of boundary components (Andersen,J.E. et al., submitted for publication). However, all information on stack lengths and non-crossing components of the structure is lost in the process. We shall see that the set of structures with shadow formula can nevertheless be reconstructed efficiently. To this end we will show that, for fixed genus g, there are only finitely many distinct shadows Sg, which will play a central role in constructing folding algorithms.

The shadow of a diagram is obtained by removing all non-crossing arcs and isolated vertices and collapsing all resulting stacks into single arcs. While taking shadows is a significant reduction, the key topological invariants of genus and number of boundary components remain invariant.
Fig. 5.

The shadow of a diagram is obtained by removing all non-crossing arcs and isolated vertices and collapsing all resulting stacks into single arcs. While taking shadows is a significant reduction, the key topological invariants of genus and number of boundary components remain invariant.

A diagram is irreducible (or connected) (Kleitman, 1970) if for any two arcs there is a sequence of arcs so that consecutive arcs cross one other. A shadow is not necessarily irreducible but may be composed of multiple irreducible components or blocks, see (1) of Figure 6. Any shadow (and in general, any diagram) can be decomposed iteratively by removing irreducible components from bottom to top, i.e. so that that there is no component ‘inside’ the one just removed. Note that the set formula of irreducible components of the set of shadows, formula, equals the set of shadows of the irreducible components of the diagram S. Furthermore, the genus of formula is the sum of the genera of its irreducible components, i.e.
(2.4)
It seems natural, therefore, to determine the complexity of a structure by the maximal genus of the components of its shadows. More precisely, we say that S is a γ-structure if formula holds for all irreducible components of the shadows formula. By definition, a γ-structure can thus be constructed from the set Sγ of shadows of genus at most γ by inserting certain non-crossing arcs, (Fig. 6). The simplest class of structures are of course 0-structures, obtained by placing non-crossing arcs over the empty structure.
γ-structures: we display the shadow of a 1-structure (1) having topological genus 2 and the shadow of the HDV-structure (2) (Ferré-D'Amaré et al., 1998), a 2-structure having also genus two. Although both shadows have genus two, the HDV-structure cannot be generated iteratively via successive removals of S1-elements and stacked arcs. The structure displayed on the left is derived via two S1-substructures.
Fig. 6.

γ-structures: we display the shadow of a 1-structure (1) having topological genus 2 and the shadow of the HDV-structure (2) (Ferré-D'Amaré et al., 1998), a 2-structure having also genus two. Although both shadows have genus two, the HDV-structure cannot be generated iteratively via successive removals of S1-elements and stacked arcs. The structure displayed on the left is derived via two S1-substructures.

Lemma 2.1.

 

An RNA structure is a 0-structure if and only if it is a simple secondary structure. In particular, a 0-structure always has genus g = 0.

Proof.

 

We first observe that a diagram of genus zero contains no crossing arcs. This follows from the fact that genus is a monotone non-decreasing function of the number of arcs [see Equation (2.3)] and that the genus of the matching (H) consisting of two mutually crossing arcs has only one boundary component and hence genus one; (Fig. 2). Secondly, we observe by induction on the number of arcs that each new non-crossing arc contributes a new boundary component and 2 − 2g − (r + 1)=1 − (n + 1) shows that the genus remains zero. Structures consisting only of non-crossing arcs therefore have genus zero.

Next, we consider structures of arbitrary genus. For their analysis, diagrams without isolated points, i.e. matchings, play a central role. Let 𝒞g(n) be the set of matchings of genus g with n arcs, and let cg(n)≔|𝒞g(n)| denote its cardinality. As shown by (Andersen,J.E. et al., submitted for publication), the generating function Cg(z)=∑n≥0cg(n)zn is given by
(2.5)
where Pg(z) is an integral polynomial of degree (3g − 1) such that Pg(1/4)≠0. The number of genus zero matchings are well known to be given by the Catalan numbers, and Equation (2.5) allows the derivation of explicit formulas for higher genera, for instance,
Furthermore, the number cg(2g) of matchings of genus g having exactly 2g arcs, i.e. matchings having exactly one boundary component, is the coefficient of z2g in Pg(z) and is given by
(2.6)
Explicitly, we have c1(2) = 1, c2(4) = 21 and c3(6) = 1485 for example. These particular matchings will serve as ‘seeds’ for our folding algorithm. More precisely, we shall use the following:

Theorem 2.2.

 

For arbitrary genus g, the set Sgof shadows is finite. Every shadow in Sgcontains at least 2g and at most (6g − 2) arcs.

The special case g = 1, on which we focus in the algorithmic part of this contribution, is explicated in the Supplementary Material.

Proof.

 

First note that if there is more than one boundary component, then there must be an arc with different boundary components on its two sides, and removing this arc decreases r by exactly one while preserving g since the number of arcs is given by n = 2g + r − 1. Furthermore, if there are ν boundary components of length ℓ in the polygonal model, then 2n = ∑ ℓν since each side of each arc is traversed once by the boundary. For a shadow, ν1 = 0 by definition, and ν2 ≤ 1 as one sees directly. It therefore follows that 2n = ∑ ℓν ≥ 3(r − 1)+2, so 2n = 4g+2r − 2 ≥ 3r−1, i.e. 4g − 1 ≥ r. Thus, we have n = 2g+(4g − 1) − 1 = 6g − 2, i.e. any shadow can contain at most 6g − 2 arcs. The lower bound 2g follows directly from n = 2g + r − 1 by observing r = 1.

Many Sg-shadows are in fact γ-structures for some γ < g, that is, they can be constructed from elements of Sγ. One key result of this contribution is the following characterization of 1-structures:

Theorem 2.3.

 
An RNA structure is a 1-structure if and only if its shadow can be decomposed by iteratively removing one of the four shadows
In particular, 1-structures can have arbitrarily large topological genus.

Proof.

 

We only give a sketch here and refer to the Supplementary Material for a full proof. First, we observe that taking the shadow preserves genus. Since (H) is the unique matching with two arcs of genus g = 1, it is contained in every matching of genus g = 1. An arc crossing into (H) preserves the genus and leads to either (K) or (L). While every arc added to (K) increases the genus, there is one possibility to preserve the genus when adding an arc to (L), namely, the addition leading to (M). It remains to observe that no further arc can be added to (M).

Before proceeding to algorithmic considerations, we briefly compare the class of γ-structures with other classes of pseudoknots. Condon et al. (2004) investigated the structure classes L&P (Lyngsø and Pedersen, 2000), D&P (Dirks and Pierce, 2003), A&U (Akutsu, 2000) and R&E-class (Rivas and Eddy, 1999). The L&P- and D&P-class are based on the H-type shadow depicted in Theorem 2.3 and hence are proper subsets of the 1-structures. The A&U-class does not cover shadow M but on the other hand contains some configurations that are not 1-structures, and even the 2-structures do not completely contain the A&U-class. Nevertheless, the A&U-class is small: there are more 1-structures than A&U-structures for any given sequence length (Nebel,M.E. and Weinberg,F., submitted for publication).

The R&E class does not impose a limit on the genus of the shadow and hence contains γ-structure with arbitrarily large γ. Conversely, Figure 3 shows a 2-structure that is not contained in the R&E class. This example is minimal, i.e. all 1-structures are contained in R&E. Similarly, the set of k-non-crossing structures (Huang et al., 2009; Jin et al., 2008) has infinitely many shadows for any fixed k ≥ 3 (Reidys and Wang, 2010), and hence like R&E, contains γ-structure with arbitrarily large γ. We note that every 1-structure is 4-non-crossing; more precisely, shadows (H) and (K) are 3-non-crossing, while shadow (L) and (M) each contain three mutually crossing arcs. (Fig. 7).

Venn diagram of important classes of structures with pseudoknots. The mutual relationships of pseudoknot-free secondary structure (SS), the two H-shadow classes D&P and L&P, and the classes A&U and R&E, respectively, were already described by Condon et al. (2004). 1-structures and 4-non-crossing structures are added here.
Fig. 7.

Venn diagram of important classes of structures with pseudoknots. The mutual relationships of pseudoknot-free secondary structure (SS), the two H-shadow classes D&P and L&P, and the classes A&U and R&E, respectively, were already described by Condon et al. (2004). 1-structures and 4-non-crossing structures are added here.

2.3 Minimum free energy folding of γ-structures

0-structures: We have shown in the previous section that 0-structures are simple RNA secondary structures. Their minimum free energy (MFE) configuration can be obtained by DP recursions (Waterman, 1978; Zuker and Stiegler, 1981) derived from a decomposition into suitable substructures. This decomposition can be expressed in terms of a context-free grammar (Dowell and Eddy, 2004; Steffen and Giegerich, 2005). In the simplest case, which corresponds to evaluating base pairs only, we consider a single non-terminal symbol S representing an arbitrary diagram over a segment and three terminal symbols to represent isolated vertices (symbol :), openings (symbol () and closings (symbol )) of base pairs. We only need the three production rules
(2.7)
to generate the corresponding language 𝒮.

1-structures: We shall use that (i) any 1-structure can be inductively generated from genus one structures and (ii) that every genus one structure has shadow (H), (K), (L) or (M), to specify a multiple context-free grammar (MCFG) (Seki et al., 1991). In contrast to context-free grammars, the non-terminal symbols of MCFGs may consist of multiple components which must be expanded3 in parallel. In this way, it becomes possible to couple separated parts of a derivation and thus to generate crossings. In the case of 1-structures, the language 𝒮 is built upon sequences of intervals (fragment-pairs) [i, r], [s, j], where (i, j), (r, s) are nested arcs. Arcs having endpoints in the different fragments are assumed to be non-crossing; (Fig. 8). For the MCFG, the fragments of a pair are associated with two different (coupled) components of a two-dimensional non-terminal symbol.

Fragment-pairs in RNA structures: the rule I → IA1IB1IA2IB2S induces the fragment-pairs [i1, r1], [s1, j1] and [i2, r2], [s2, j2]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s2, j2].
Fig. 8.

Fragment-pairs in RNA structures: the rule IIA1IB1IA2IB2S induces the fragment-pairs [i1, r1], [s1, j1] and [i2, r2], [s2, j2]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s2, j2].

Accordingly, we (re)introduce the following symbols: Different brackets as well as the different non-terminals of pattern formula are used to distinguish nestings of the various kinds of shadows. Finally, we specify the production rules of our unambiguous MCFG ℛ1:
where X ∈ {H, K, L, M} distinguishes the four types of pseudoknots.
  • non-terminal S, representing secondary structure elements (i.e. diagrams without crossing arcs) according to the rules given above;

  • non-terminals I and T, representing an arbitrary 1-structure;

  • non-terminals formula with two components used to represent a fragment-pair with nested arcs, X ∈ {H, K, L, M}; and

  • terminals (X,)X denoting the opening and closing of a base pair, respectively, where X is one of the types H, K, L or M.

Theorem 2.4.

 
Any RNA 1-structure can beuniquelydecomposed via1, and any diagram generated via1is a 1-structure (Fig. 9).
Illustration of the grammar ℛ1.
Fig. 9.

Illustration of the grammar ℛ1.

Proof.

 
We proceed by induction on the number of shadows. Induction basis: in a 1-structure formula that contains no genus 1-shadow, there are no crossings and hence the structure can be decomposed uniquely via the context-free grammar of secondary structures. Induction step: suppose we are given a 1-structure containing r ≥ 1 shadows of genus one. We decompose from right to left. Everything is clear until we encounter a substructure containing a genus 1 shadow. For an arc α = (i, j), we distinguish two cases: (I) α is not crossed, or (II) α is crossed by another arc. In case of (I), there exists a 1-structure nested in α. In case of (II), we consider the partial order ≤, where (i, j) ≤(r, s) if and only if r < i and j < s. Since crossing arcs in a 1-structure are contained in one of the four base types, we distinguish the following scenarios
  • (H): then there exist maximal base pairs β=(r, s), where r < i < s < j,

  • (K): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < v < i < s < j,

  • (L): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < i < v < s < j,

  • (M): then there exist maximal base pairs β=(r, s), θ=(u, v) and δ=(p, q), where p < u < r < q < i < v < s < j.

Consider the set C(α) of arcs that are crossed by α and the minimal arc α* that crosses any element of C(α). Here, minimality is considered with respect to the partial order ≤, where (i, j) ≤(r, s) if and only if r < i and j < s. It follows that α=(i, j) and α*=(i*, j*) induce the fragment pair [i, i*] and [j*, j]. We similarly obtain the corresponding arcs β*, θ* or δ*, which induce at most four fragment pairs and correspond to a unique shadow of type (H), (K), (L) or (M) (Fig. 10). By construction, the number of genus 1 shadows of any substructure contained in such a fragment-pair is reduced at least by one, and can by induction hypothesis be uniquely decomposes via ℛ1. Finally, any structure generated via ℛ1 is constructed from top-to-bottom by iteratively building configurations of arcs having shadow (H), (K), (L) or (M). Thus, any structure obtained via ℛ1 is indeed a 1-structure completing the proof of the theorem.
Fragmentation: the four cases corresponding to the four shadows (H), (K), (L) and (M). In (1), there are two maximal arcs: α=(i, j) and β=(r, s), where r < i < s < j, whence the diagram has shadow (H). Here, α*=(i*, j*) is the minimal arc crossing C(α) and β*=(r*, s*) is the minimal arc crossing C(β). We have B1 =[i, i*], B2 =[j, j*], A1 =[r, r*], A2 =[s, s*]. Cases (2), (3) and (4) are analyzed similarly.
Fig. 10.

Fragmentation: the four cases corresponding to the four shadows (H), (K), (L) and (M). In (1), there are two maximal arcs: α=(i, j) and β=(r, s), where r < i < s < j, whence the diagram has shadow (H). Here, α*=(i*, j*) is the minimal arc crossing C(α) and β*=(r*, s*) is the minimal arc crossing C(β). We have B1 =[i, i*], B2 =[j, j*], A1 =[r, r*], A2 =[s, s*]. Cases (2), (3) and (4) are analyzed similarly.

2-structures: a folding algorithm for 2-structures requires an analogous enumeration of all (irreducible) shadows of genus 2. From Equation (2.6), it is straightforward to explicitly derive the 21 shadows of genus 2 with 4 arcs, see Supplementary Figure 10. As in the case of genus 1, arc insertions into these 21 configurations leads to the complete set of 3472 shadows of genus 2. This large number makes it infeasible to build a practically useful folding algorithms for all 2-structures. It may be useful, however, to deal with a (small) subset of shadows. The complexity of such an algorithm is determined by the complexity of decomposing the individual shadows by means of MCFG production rules reminiscent of those for ℛ1. For instance, the shadow of the HDV-structure displayed in (2) of Figure 6 is contained in the R&E class and can therefore be computed in O(N6) time and O(N4) space. However, when resorting to our approach its time complexity is at least O(N8): the shadow presented in Figure 11 requires an DP algorithm with O(N8) time-complexity and O(N6) space-complexity. It is an ongoing work to devise a sensible folding algorithm for 2-structures.

Folding of 2-structures: the shadow shown here is not contained in the R&E class of structures and cannot be generated by gap matrices. It can be decomposed, however, using the eight indexes i, j, k, l, m, n, p and q, thus implying a O(N8) time-complexity. This makes use of a six-dimensional gap matrix Gj,k,l,m,n,p, which implies O(N6) space-complexity.
Fig. 11.

Folding of 2-structures: the shadow shown here is not contained in the R&E class of structures and cannot be generated by gap matrices. It can be decomposed, however, using the eight indexes i, j, k, l, m, n, p and q, thus implying a O(N8) time-complexity. This makes use of a six-dimensional gap matrix Gj,k,l,m,n,p, which implies O(N6) space-complexity.

MFE folding of 1-structures: if we make use of a naïve table-based parsing scheme, checking for each subword s of the input and for each rule f whether f can produce s, a rule like f = IIA1IB2IC1IA2ID1IB2IC2ID2S introduces a complexity O(N18): first, we must process O(N2) different with subwords s induced by an input of size n. Secondly, each non-terminal but the first on the right-hand side of the production introduces an additional split point, which specifies the part of s to be generated by the corresponding non-terminal. Since its location may freely be chosen within s, each split point gives rise to another loop variable, and hence contributes a factor O(N) to the runtime.

Even if there are much more sophisticated parsing algorithms, it is useful to consider this simple scheme since it directly translates into a recursion for a DP algorithm typically used to compute structures of minimum free energy. Furthermore, it is possible to introduce intermediate steps in the derivation of our language by making use of additional non-terminals and production rules such that the time complexity can be reduced to O(N6). For that purpose, let the non-terminal I represent 1-structures in which no structures with shadow (H), (K), (L) or (M) are nested and the last vertex is paired. We introduce the non-terminal symbols formula, formula and formula assumed to represent intermediate fragment pairs and the production rules
where (U1, U2) is a marked copy of (U1, U2) used to identify the components which must later be expanded in a coupled way. Accordingly, we replace the derivations of T in ℛ1 as follows:
Note that syntactically, i.e. considered as dot-bracket representations, the 1-structures can be generated by an MCFG, parsable in time O(N5). However, in that case, corresponding brackets are not generated in a coupled way making the grammar inappropriate for algorithmic purposes.

As typical for DP and in analogy to our parsing scheme, we use two-dimensional matrices to store the optimal structure over a fragment. The matrix is indexed by the sequence coordinates of the endpoints. It can be a simple secondary structure 𝒮 or a substructure of higher genus. For the fragment-pairs, i.e. for the non-terminals of dimension two, four-dimensional matrices indexed by the endpoints of both linked fragments are required to store the optimal structure over them. Suppose the pair of fragments is [i, r] and [s, j], and let Gu(i, j; r, s) be the fragment-pair (associated with) [U1, U2], Gv(i, j; r, s) be the fragment-pair [V1, V2], Gw(i, j; r, s) be the fragment-pair [W1, W2] and G(i, j; r, s) be the fragment-pair [X1, X2]. The recursions for these matrices, summarized in graphical form in Figure 12, are determined directly by the grammar.

The decomposition for 4-dimensional matrices G, Gu, Gv, and Gw.
Fig. 12.

The decomposition for 4-dimensional matrices G, Gu, Gv, and Gw.

We can conclude from the rewriting rules that the computation of the two-dimensional matrices requires at most three loop variables, and there are O(N2) many of them. Accordingly, O(N5) operations are required to fill the associated two-dimensional matrices. For the four-dimensional matrices, two loop variables are needed for each of the corresponding rewriting rules (those with a left-hand side of dimension two) for there are in each case two split points introduced by the right-hand sides of the corresponding productions. Since we need to compute O(N4) matrix entries, the total run time is in O(N6). Obviously, O(N4) space is required to store these tables. Accordingly, the algorithm can generate all 1-structures in O(N6) time and O(N4) space, i.e. with the same complexity as pknotsRE (Rivas and Eddy, 1999) (for the larger R&E class). The advantage of 1-structures is that structurally different shadows can be parametrized in different ways, and that the search space is restricted to moderately complex shadows. In contrast, the language of R&E-structures is based on crossings and can neither identify blocks of arcs nor restrict the genus of the shadows. For more structure classes restricted to H-structures, NUPACK (Dirks and Pierce, 2003) requires O(N5) time and O(N4) space.

This is substantially more demanding, of course, than the O(N4) time and O(N2) memory complexity of pknotsRGReeder and Giegerich (2004), which, however, deals with a very restricted subset of H-shadow structures, demanding that helices are maximally extended and perfect in the sense that they are not interrupted by bulge- or interior-loops. pknotsRG thus is not guaranteed to find the minimum energy structure within the class H-shadow structures. A related fast heuristic treats the (K)-shadow as a superposition of the two H-shadows (Theis et al., 2010).

2.4 Partition function and sampling

We have shown that the MCFG ℛ1 uniquely generates all 1-structures, i.e. it is unambiguous. Consequently, ℛ1 can be employed to count 1-structures over a given sequence x and to compute the corresponding partition function
where R is the universal gas constant, T is the temperature, G(s) is energy of structure s over sequence x and formula is the set of 1-structures in which all base pairs (i, j) satisfy the base pairing rules for RNA, i.e. xixj ∈ {AU, UA, GC, CG, GU,UG}. Let Ni,j denote the substructure represented by the non-terminal symbol N in ℛ1 over the fragment [i, j], and let formula denote the fragment-pair formula, where X1 =[i, r] and X2 =[s, j] in the recursions for energy minimization. For each of these symbols, we introduce corresponding partial partition functions QNi,j and formula. Since the MCFG is unambiguous, the recursions for the partial partition functions are derived by replacing minima by sums and addition of energy contribution by multiplication of partial partition functions, see e.g. (Voß et al., 2006). For instance, the recursion for the partition functions corresponding to the non-terminal symbol T reads
where E[h, ℓ] denotes the energy of the loop closed by the base pair (h, ℓ).
The probabilities ℙNi,j of partial structures of type N over the fragment [i, j] and the probabilities formula of partial structures of type formula over the fragment pair [i, j], [r, s] are readily calculated from the partial partition functions. These ‘backward recursions’ are analogous to those derived by McCaskill (1990) for crossing free structures: let ΛNi,j be the set of 1-structures containing Ni,j and let formula be the set of 1-structures containing the fragment-pair formula. It follows that we have
Suppose Ni,j or formula are obtained by decomposing θs. The conditional probabilities ℙNi,js and formula are then given by Qθs(Ni,j)/Qθs and formula, respectively. Here Qθs represents the partition function of θs, and Qθs(Ni,j) and formula represent the partition functions for those θs-configurations that contain Ni,j and formula, respectively. Taking the sum over all possible θs, we obtain
From this backward recursion, one immediately derives a stochastic backtracing recursion from the probabilities of partial structures that generates a Boltzmann sample of 1-structures, see Ding and Lawrence (2003); Huang et al. (2010); Tacker et al. (1996) for analogous constructions.
The basic data structure for this sampling is a stack A which stores blocks of the form (i, j, N) [or formula], presenting substructures of non-terminal symbols N over [i, j] (or formula over [X1, X2] where X1 = [i, r] and X2 =[s, j]). L is a set of base pairs storing those removed by the decomposition step in the grammar. We initialize with the block (1, n, I) in A, and L = ∅. In each step, we pick up one element in A and decompose it via the grammar with probability QM/QN, where QN is the partition function of the block which is picked up from A, and QM is the partition function of the target block which is decomposed by the rewriting rule. The base pairs which are removed in the decomposition step are moved to L. For instance, according to the rewriting rule TI(T)S, the block (i, j, T) is decomposed into the three blocks: (i, h − 1, I), (h + 1, ℓ − 1, T), (ℓ + 1, j, S) and one base pair (h, ℓ) which is to be removed. For fixed indices h, ℓ, where ih < ℓ ≤ j, the probability of decomposing (i, j, T) reads
The sampling step is iterated until A is empty. The resulting 1-structure is the given by the list L of base pairs.

2.5 Software

Implementation: MFE folding, partition function including a computation of base pairing probabilities and stochastic backtracing are implemented in gfold. The program is written in C.

Energy model: although the presentation above uses a simplified grammar that does not explicitly distinguish the usual loop types, gfold implements the Mathews–Turner energy model without dangles (Mathews et al., 1999, 2004) for secondary structure elements. For pseudoknots, we use here an extended version of the Dirks–Pierce (DP) model (Dirks and Pierce, 2003) that allows different penalties βX for the four topologically distinct pseudoknot types X = H, K, L, M. We have observed that the values of βX have a substantial influence on the accuracy of the predicted structures. In both NUPACK and pknotsRE, a common pseudoknot penalty β1 is assigned whenever two gap matrices cross. Since the number of such crossings depends on the type of the pseudoknot, this algorithmic design would imply βA = β1, βB = βC = 2β1 and βD = 3β1. In gfold, these parameters are independent and can be adjusted to improve the performance. Since most experimentally known pseudoknots are of types (H) and (K), we focused in particular on the ratio of βA and βB and found that both sensitivity (the ratio of correctly predicted base pairs to the total number of base pairs in the reference structure) and positive predictive value reach a maximum for βB = 1.3βA. The pseudoknot penalty of type (H) coincides with that of the DP model, i.e. βA = β1 = 9.6 (kcal/mol). The other penalties are set to βB = 12.6, βC = 14.6 and βD = 17.6; see Supplementary Material for details. An alternative set of pseudoknot parameters described by Andronescu et al. (2010) can easily be incorporated but would require a readjustment of these four topological penalties.

Performance: the current implementation of gfold is applicable to sequences with a length up to N ∼ 150 nt on current PC hardware. Figure 13 summarizes the resource requirements.

Run time (A) and peak memory (B) of gfold. Timing information is given for MFE-only (triangles) and partition function with sampling 10 000 structures from the Boltzmann ensemble. To compute error bars, we folded between 10 (N > 100) and 100 (N < 70) randomly generated sequences on a Xeon E5410, 2.33 GHz, 48 GB memory. Memory allocation is independent of the sequence. For N ≥ 100, double precision floats are necessary to avoid overflows. This leads to the jump in memory consumption by a factor of 2. Dotted lines indicate the theoretical behavior of O(N6) (time) and O(N4) (space). The slope for CPU time is slightly steeper than the theory since constraints among the six indices introduced by the minimum size of the complex pseudoknot elements lead to an additional speedup for small N.
Fig. 13.

Run time (A) and peak memory (B) of gfold. Timing information is given for MFE-only (triangles) and partition function with sampling 10 000 structures from the Boltzmann ensemble. To compute error bars, we folded between 10 (N > 100) and 100 (N < 70) randomly generated sequences on a Xeon E5410, 2.33 GHz, 48 GB memory. Memory allocation is independent of the sequence. For N ≥ 100, double precision floats are necessary to avoid overflows. This leads to the jump in memory consumption by a factor of 2. Dotted lines indicate the theoretical behavior of O(N6) (time) and O(N4) (space). The slope for CPU time is slightly steeper than the theory since constraints among the six indices introduced by the minimum size of the complex pseudoknot elements lead to an additional speedup for small N.

We have observed that gfold provides a substantial increase in both sensitivity and a positive predictive value (PPV, ratio of correctly predicted base pairs to the total number of base pairs in the predicted structure) compared with the alternative DP approaches pknotsRE (Rivas and Eddy, 1999), NUPACK (Dirks and Pierce, 2003) and pknotsRG-mfe (Reeder and Giegerich, 2004), and that gfold provides a substantial increase in accuracy, cf. Figure 14. In an evaluation on the entire Pseudobase (van Batenburg et al., 2001), gfold achieves a sensitivity of 0.762 and PPV of 0.761. However, as detailed in Supplementary Table S3, the performance varies substantially between different classes of sequences. Interestingly, the more complex pseudoknots of type (K) are predicted with even higher accuracy (sensitivity 0.889, PPV 0.899) than the simpler, much more frequent type H.

Performance of gfold. Comparison of the average sensitivity (A) and PPV (B) of different prediction algorithms on a sample of 32 structures from Pseudobase. All details of this sample are given in the Supplementary Table S2. (C) The PPV increases significantly if only base pairs with larger pairing probabilities as predicted by the partition function version of gfold are included in the predicted structure.
Fig. 14.

Performance of gfold. Comparison of the average sensitivity (A) and PPV (B) of different prediction algorithms on a sample of 32 structures from Pseudobase. All details of this sample are given in the Supplementary Table S2. (C) The PPV increases significantly if only base pairs with larger pairing probabilities as predicted by the partition function version of gfold are included in the predicted structure.

The PPV of gfold predictions can be increased by filtering the base pairs of the MFE structure by their probability P of formation, which is computed by the partition function version of gfold. Accepting only base pairs with a predicted base pairing probability P > 0.95 increases the PPV from 0.76 to more than 0.9, (Fig. 14C). In order to evaluate the false positive rate, we folded 100 tRNA sequences from Sprinzl's tRNA database (Jühling et al., 2009). gfold correctly identifies 94% of them as pseudoknot free. In comparison, NUPACK correctly identifies 86% and pknotsRG−mfe 89% of this sample set.

3 DISCUSSION

Combinatorial models of pseudoknotted RNA structures are limited in two ways: on the one hand, exact algorithmic folding can be constructed only for certain types of structures; on the other hand, the larger the structure sets are, the more base pairing patterns are contained in them that cannot be realized in nature due to steric constraints. Algorithm design so far has been mostly driven by the desire to reduce computational complexity. The idea behind gfold, in contrast, is to define a more suitable class of structures that can be generated by nesting and concatenating a small number of elementary building blocks. This recursive structure is captured by a fairly simple unambiguous multiple context-free grammar that translates in a canonical way to DP algorithms for computing the minimum energy structure and the partition function in O(N6) time and O(N4) space. In addition to MFE folding, we have implemented the computation of base pairing probabilities and a stochastic backtracing recursion, thus providing the major functionalities of RNA secondary structure prediction software for a very natural class of pseudoknotted structures.

The 1-structures considered here strike a balance between the generality necessary to cover almost all known pseudoknotted structures and the restriction to topologically elementary structures that have a good chance to actually correspond to a feasible spatial structure. From a mathematical point of view, the characterization of structures in terms of irreducible components with given topological genus appears particularly natural and promises to reflect closely the ease with which a structure can be embedded in three dimensions. In addition, the grammar underlying gfold naturally distinguishes different types of pseudoknots and admits different energy parameters for them. We observe that this additional freedom of the parametrization leads to a substantial increase of sensitivity of type (K) pseudoknots, (0.63 → 0.889) and PPV (0.73 → 0.899) compared wit the usage of a common penalty for each crossing of gap matrices. In terms of prediction accuracy, gfold thus compares favorably also with the leading alternative DP approaches to pseudoknotted structures.

Funding: This work was supported by the 973 Project of the Ministry of Science and Technology; the PCSIRT Project of the Ministry of Education; National Science Foundation of China to C.M.R. and his lab; Deutsche Forschungsgemeinschaft (projects STA 850/2-1 & STA 850/7-1); the European Union FP-7 project QUANTOMICS (no. 222664) to P.F.S. and his lab. J.E.A. and R.C.P. are supported by QGM, the Centre for Quantum Geometry of Moduli Spaces, funded by the Danish National Research Foundation.

Conflict of Interest: none declared.

1Ribbons may also be allowed to twist giving rise to possibly non-orientable surfaces (Massey, 1967).

2In order to relate this to the standard 2D models of surfaces derived from triangulations: from the collapsed fatgraph we can derive the polygonal model of the surface X𝔻, i.e. a 2n-gon in which edges are identified in pairs; (Fig. 4).

3This coupling is only required for components that were generated by the same production step. Components, even if of the same kind, derived in different steps are independent of each other.

REFERENCES

Akutsu
T.
,
Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots
Discr. Appl. Math.
,
2000
, vol.
104
(pg.
45
-
62
)
Andersen
J.E.
et al.
,
Enumeration of linear chord diagrams
J. Alg. Comb.
,
2010
 
submitted
Andronescu
M.S.
et al.
,
Improved free energy parameters for RNA pseudoknotted secondary structure prediction
RNA
,
2010
, vol.
16
(pg.
26
-
42
)
Bailor
M.H.
et al.
,
Topology links RNA secondary structure with global conformation, dynamics, and adaptation
Science
,
2010
, vol.
327
(pg.
202
-
206
)
Bon
M.
et al.
,
Topological classification of RNA structures
J. Mol. Biol.
,
2008
, vol.
379
(pg.
900
-
911
)
Cai
L.
et al.
,
Stochastic modeling of RNA pseudoknotted structures: a grammatical approach
Bioinformatics
,
2003
, vol.
19
Suppl. 1
(pg.
i66
-
i73
)
Chen
S.J.
,
RNA folding: conformational statistics, folding kinetics, and ion electrostatics
Annu. Rev. Biophys.
,
2008
, vol.
37
(pg.
197
-
214
)
Chen
H.-L.
et al.
,
An O(n5) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids
J. Comput. Biol.
,
2009
, vol.
16
(pg.
803
-
815
)
Condon
A.
et al.
,
Classifying RNA pseudoknotted structures
Theor. Comput. Sci.
,
2004
, vol.
320
(pg.
35
-
50
)
Deogun
J.S.
et al.
,
RNA secondary structure prediction with simple pseudoknots
Proceedings of the Second Conference on Asia-Pacific Bioinformatics (APBC 2004).
,
2004
Sydney, Australia
Australian Computer Society
(pg.
239
-
246
)
Ding
Y.
Lawrence
C.E.
,
A statistical sampling algorithm for rna secondary structure prediction
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
7280
-
7301
)
Dirks
R.M.
Pierce
N.A.
,
A partition function algorithm for nucleic acid secondary structure including pseudoknots
J. Comput. Chem.
,
2003
, vol.
24
(pg.
1664
-
1677
)
Doudna
J.A.
Cech
T.R.
,
The chemical repertoire of natural ribozymes
Nature
,
2002
, vol.
418
(pg.
222
-
228
)
Dowell
R.D.
Eddy
S.R.
,
Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction
BMC Bioinformatics
,
2004
, vol.
5
pg.
71
Ferré-D'Amaré
A.R.
et al.
,
Crystal structure of a hepatitis delta virus ribozyme
Nature
,
1998
, vol.
395
(pg.
567
-
574
)
Giedroc
D.P.
Cornish
P.V.
,
Frameshifting RNA pseudoknots: structure and mechanism
Virus Res.
,
2009
, vol.
139
(pg.
193
-
208
)
Giegerich
R.
et al.
,
Abstract shapes of RNA
Nucleic Acids Res.
,
2004
, vol.
32
(pg.
4843
-
4851
)
Haslinger
C.
Stadler
P.F.
,
RNA structures with pseudo-knots: graph-theoretical and combinatorial properties
Bull. Math. Biol.
,
1999
, vol.
61
(pg.
437
-
467
)
Hofacker
I.L.
et al.
,
Fast folding and comparison of RNA secondary structures
Monatsh. Chem.
,
1994
, vol.
125
(pg.
167
-
188
)
Huang
F.W.
et al.
,
Folding 3-noncrossing RNA pseudoknot structures
J. Comput. Biol.
,
2009
, vol.
16
(pg.
1549
-
1575
)
Huang
F.W.D.
et al.
,
Target prediction and a statistical sampling algorithm for RNA-RNA interaction
Bioinformatics
,
2010
, vol.
26
(pg.
175
-
181
)
Jin
E.Y.
et al.
,
Combinatorics of RNA structures with pseudoknots
Bull. Math. Biol.
,
2008
, vol.
70
(pg.
45
-
67
)
Jühling
F.
et al.
,
tRNAdb 2009: compilation of tRNA sequences and tRNA genes
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
D159
-
D162
)
Kato
Y.
et al.
,
RNA pseudoknotted structure prediction using stochastic multiple context-free grammar
IPSJ Digit. Cour.
,
2006
, vol.
2
(pg.
655
-
664
)
Kleitman
D.
,
Proportions of irreducible diagrams
Stud. Appl. Math.
,
1970
, vol.
49
(pg.
297
-
299
)
Li
H.
Zhu
D.
Wang
L.
,
A new pseudoknots folding algorithm for RNA structure prediction
COCOON 2005
,
2005
, vol.
3595
Berlin
Springer
(pg.
94
-
103
)
Loebl
M.
Moffatt
I.
,
The chromatic polynomial of fatgraphs and its categorification
Adv. Math.
,
2008
, vol.
217
(pg.
1558
-
1587
)
Lyngsø
R.B.
Pedersen
C.N.
,
RNA pseudoknot prediction in energy-based models
J. Comput. Biol.
,
2000
, vol.
7
(pg.
409
-
427
)
Massey
W.S.
Algebraic Topology: An Introduction.
,
1967
New York
Springer
Mathews
D.
et al.
,
Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure
J. Mol. Biol.
,
1999
, vol.
288
(pg.
911
-
940
)
Mathews
D.
et al.
,
Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure
Proc. Natl Acad. Sci.USA
,
2004
, vol.
101
(pg.
7287
-
7292
)
Matsui
H.
et al.
,
Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures
Bioinformatics
,
2005
, vol.
21
(pg.
2611
-
2617
)
McCaskill
J.S.
,
The equilibrium partition function and base pair binding probabilities for RNA secondary structure
Biopolymers
,
1990
, vol.
29
(pg.
1105
-
1119
)
Metzler
D.
Nebel
M.E.
,
Predicting RNA secondary structures with pseudoknots by MCMC sampling
J. Math. Biol.
,
2008
, vol.
56
(pg.
161
-
181
)
Namy
O.
et al.
,
A mechanical explanation of RNA pseudoknot function in programmed ribosomal frameshifting
Nature
,
2006
, vol.
441
(pg.
244
-
247
)
Nebel
M.E.
Weinberg
F.
An algebraic approach to rna pseudoknotted structures.
,
2011
 
submitted
Nussinov
R.
et al.
,
Algorithms for loop matching
SIAM J. Appl. Math.
,
1978
, vol.
35
(pg.
68
-
82
)
Penner
R.C.
et al.
,
Fatgraph models of proteins
Comm. Pure Appl. Math.
,
2010
, vol.
63
(pg.
1249
-
1297
)
Reeder
J.
Giegerich
R.
,
Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics
BMC Bioinformatics
,
2004
, vol.
5
pg.
104
Reidys
C.M.
Wang
R.
,
Shapes of RNA pseudoknot structures
J. Comput. Biol.
,
2010
, vol.
17
(pg.
1575
-
1590
)
Rivas
E.
Eddy
S.R.
,
A dynamic programming algorithm for RNA structure prediction including pseudoknots
J. Mol. Biol.
,
1999
, vol.
285
(pg.
2053
-
2068
)
Rivas
E.
Eddy
S.R.
,
The language of RNA: a formal grammar that includes pseudoknots
Bioinformatics
,
2000
, vol.
16
(pg.
334
-
340
)
Rødland
E.A.
,
Pseudoknots in RNA secondary structures: representation, enumeration, and prevalence
J. Comput. Biol.
,
2006
, vol.
13
(pg.
1197
-
1213
)
Seki
H.
et al.
,
On multiple context free grammars
Theor. Comput. Sci.
,
1991
, vol.
88
(pg.
191
-
229
)
Staple
D.W.
Butcher
S.E.
,
Pseudoknots: RNA structures with diverse functions
PLoS Biol.
,
2005
, vol.
3
pg.
e213
Steffen
P.
Giegerich
R.
,
Versatile and declarative dynamic programming using pair algebras
BMC Bioinformatics
,
2005
, vol.
6
pg.
224
Tabaska
J.E.
et al.
,
An RNA folding method capable of identifying pseudoknots and base triples
Bioinformatics
,
1998
, vol.
14
(pg.
691
-
699
)
Tacker
M.
et al.
,
Algorithm independent properties of RNA structure prediction
Eur. Biophy. J.
,
1996
, vol.
25
(pg.
115
-
130
)
Taufer
M.
et al.
,
PseudoBase++: an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
D127
-
D135
)
Theimer
C.A.
et al.
,
Structure of the human telomerase RNA pseudoknot reveals conserved tertiary interactions essential for function
Mol. Cell
,
2005
, vol.
17
(pg.
671
-
682
)
Theis
C.
et al.
,
Prediction of rna secondary structure including kissing hairpin motifs
Algorithms Bioinformatics
,
2010
, vol.
6293
(pg.
52
-
64
)
Uemura
Y.
et al.
,
Tree adjoining grammars for RNA structure prediction
Theor. Comput. Sci.
,
1999
, vol.
210
(pg.
277
-
303
)
van Batenburg
F.H.D.
et al.
,
PseudoBase: structural information on RNA pseudoknots
Nucleic Acids Res.
,
2001
, vol.
29
(pg.
194
-
195
)
Vernizzi
G.
Orland
H.
,
Large-N random matrices for RNA folding
Acta Phys. Polon.
,
2005
, vol.
36
(pg.
2821
-
2827
)
Voß
B.
et al.
,
Complete probabilistic analysis of RNA shapes
BMC Biol.
,
2006
, vol.
4
pg.
5
Waterman
M.S.
,
Secondary structure of single-stranded nucleic acids
Adv. Math.
,
1978
, vol.
1
(pg.
167
-
212
)
Zuker
M.
,
On finding all suboptimal foldings of an RNA molecule
Science
,
1989
, vol.
244
(pg.
48
-
52
)
Zuker
M.
Stiegler
P.
,
Optimal computer folding of larger RNA sequences using thermodynamics and auxiliary information
Nucleic Acids Res.
,
1981
, vol.
9
(pg.
133
-
148
)

Author notes

Associate Editor: Anna Tramontano

Supplementary data