-
PDF
- Split View
-
Views
-
Cite
Cite
Christian M. Reidys, Fenix W. D. Huang, Jørgen E. Andersen, Robert C. Penner, Peter F. Stadler, Markus E. Nebel, Topology and prediction of RNA pseudoknots, Bioinformatics, Volume 27, Issue 8, April 2011, Pages 1076–1085, https://doi.org/10.1093/bioinformatics/btr090
- Share Icon Share
Abstract
Motivation: Several dynamic programming algorithms for predicting RNA structures with pseudoknots have been proposed that differ dramatically from one another in the classes of structures considered.
Results: Here, we use the natural topological classification of RNA structures in terms of irreducible components that are embeddable in the surfaces of fixed genus. We add to the conventional secondary structures four building blocks of genus one in order to construct certain structures of arbitrarily high genus. A corresponding unambiguous multiple context-free grammar provides an efficient dynamic programming approach for energy minimization, partition function and stochastic sampling. It admits a topology-dependent parametrization of pseudoknot penalties that increases the sensitivity and positive predictive value of predicted base pairs by 10–20% compared with earlier approaches. More general models based on building blocks of higher genus are also discussed.
Availability: The source code of gfold is freely available at http://www.combinatorics.cn/cbpc/gfold.tar.gz.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
The global conformation of RNA molecules is to a large extent determined by topological constraints encoded at the level of secondary structure, i.e. by the mutual arrangements of the base paired helices (Bailor et al., 2010). In this context, secondary structure is understood in a wider sense that includes pseudoknots. Although the vast majority of RNAs has simple, i.e. pseudoknot-free, secondary structure, PseudoBase (Taufer et al., 2009) lists more than 250 records of pseudoknots determined by a variety of experimental and computational techniques including crystallography, nuclear magnetic resonance, mutational experiments and comparative sequence analysis. In many cases, they are crucial for molecular function. Examples include the catalytic cores of several ribozymes (Doudna and Cech, 2002), programmed frameshifting (Namy et al., 2006) and telomerase activity (Theimer et al., 2005), reviewed in Giedroc and Cornish (2009); Staple and Butcher (2005).
Secondary structures can been interpreted as matchings in a graph of permissible base pairs (Tabaska et al., 1998). The energy of RNA folding is dominated by the stacking of adjacent base pairs, not by the hydrogen bonds of the individual base pairs (Mathews et al., 1999). In contrast to maximum weighted matching, the general RNA folding problem with a stacking-based energy function is NP-complete (Akutsu, 2000; Lyngsø and Pedersen, 2000). The most commonly used RNA secondary structure prediction tools, including mfold (Zuker, 1989) and the Vienna RNA Package (Hofacker et al., 1994), therefore exclude pseudoknots.
Polynomial-time dynamic programming (DP) algorithms can be devised, however, for certain restricted classes of pseudoknots. In contrast to the O(N2) space and O(N3) time solution for simple secondary structures (Nussinov et al., 1978; Waterman, 1978; Zuker and Stiegler, 1981), however, most of these approaches are computationally much more demanding. The design of pseudoknot folding algorithms thus has been governed more by the need to limit computational cost and achieve a manageable complexity of the recursion than the conscious choice of a particularly natural search space of RNA structures. As a case in point, the class of structures underlying the algorithm by Rivas and Eddy (1999) (R&E-structures, pknot-R&E) was characterized only in a subsequent publication (Rivas and Eddy, 2000). The following references provide a certainly incomplete list of DP approaches to RNA structure prediction using different structure classes characterized in terms of recursion equations and/or stochastic grammars: Akutsu (2000); Cai et al. (2003); Chen et al. (2009); Deogun et al. (2004); Dirks and Pierce (2003); Kato et al. (2006); Li and Zhu (2005); Lyngsø and Pedersen (2000); Matsui et al. (2005); Reeder and Giegerich (2004); Rivas and Eddy (1999); Uemura Y. et al. (1999). The interrelationships of some of these classes of RNA structures have been clarified in part by Condon et al. (2004) and Rødland (2006). In addition to these exact algorithms, a plethora of heuristic approaches to pseudoknot prediction have been proposed in the literature; see e.g. (Chen, 2008; Metzler and Nebel, 2008) and the references therein.
At least three distinct classification schemes of RNA contact structures have been proposed: Haslinger and Stadler (1999) suggested using book-embeddings, Jin et al. (2008) focused on the maximal set of pairwise crossing base pairs and Bon et al. (2008) based the classification on topological embeddings. While these classifications have in common that simple secondary structure forms the most primitive class of structures, they differ already in the construction of the first non-trivial class of pseudoknots. Despite their mathematical appeal, however, no efficient (polynomial-time) algorithms are available for predicting pseudoknotted structures even in the simplest case of three non-crossing RNA structures. A practically workable approach to three non-crossing structures requires the enumeration of an exponentially growing number of diagrams which are then ‘filled in’ by the means of DP (Huang et al., 2009); a Monte-Carlo approach utilizing the topological approach with a very simple matching-like energy model was explored by (Vernizzi and Orland, 2005).
In this contribution, we show that the topological classification of RNA structures can be translated into efficient DP algorithms. To this end, we introduce γ-structures and prove that they can be derived from a finite family of abstract shapes called shadows. In Theorem 2.3, we enumerate these four shadows for γ = 1, which can be cast as explicit construction rules for a unique multiple context-free grammar (Section 2.3). Corresponding DP algorithms for energy minimization, partition function and Boltzmann-sampling functionalities are implemented in the software package gfold. An important feature is that γ-structures can be treated algorithmically like pseudoknot-free secondary structures in the sense that there are finitely many motifs, i.e. shadows, for fixed γ, each of which is assigned a specific energy. Because of the multiplicity of motifs, which rapidly increases with γ, this allows for a more detailed energy model of pseudoknotted structures based on their topological complexity.
2 RESULTS
2.1 Topology of RNA structures
Diagram representation: RNA molecules are linear biopolymers consisting of the four nucleotides A, U, C and G characterized by a sequence endowed with a unique orientation (5′ to 3′). Each nucleotide can interact (base pair) with at most one other nucleotide by means of specific hydrogen bonds. Only the Watson–Crick pairs GC and AU as well as the wobble GU are admissible. These base pairs determine the secondary structure. Note that we have neglected here base triples and other types of more complex interactions. Secondary structures can thus be represented as graphs where nucleotides are represented by vertices, the backbone of the molecule as well as the hydrogen bonds are represented by edges; see Figure 1a. More conveniently, we use the convention to represent the backbone of the polymer by a horizontally drawn chain. As before, this chain consists of vertices and arcs, respectively, representing the nucleotides and covalent bonds. However, the edges representing the base pairs now are depicted as arcs in the upper half plane; see Figure 1b. We call this representation the diagram of the molecule.

RNA structure as planar graph represented (a) as ball-and-stick figure with short edges for hydrogen bonds and (b) with linear backbone and semi-circles for hydrogen bonds.
Thus, we shall identify a structure with a labelled graph over the vertex set [N] = {1, 2,…, N} represented by drawing the vertices 1, 2,…, N on a horizontal line in the natural order and the arcs (i, j), where i < j, in the upper half plane.
Fatgraph representation: in order to understand the topological properties of RNA molecules, we need to pass from the picture of RNA as diagrams or contact graphs to that of topological surfaces. Only the associated surface carries the important invariants leading to a meaningful filtration of RNA structures. Formally, we will view an RNA molecule as a topological surface (Andersen,J.E. et al., submitted for publication). The main idea is to ‘thicken’ the edges into (untwisted) bands or ribbons and to expand each vertex to a disk as shown in Figure 2. This inflation of edges leads to a fatgraph 𝔻 (Loebl and Moffatt, 2008; Penner et al., 2010).
A fatgraph, sometimes also called ‘ribbon graph’ or ‘map’, is a graph equipped with a cyclic ordering of the incident half edges at each vertex. Thus, 𝔻 refines its underlying graph D insofar as it encodes the ordering of the ribbons incident on its disks. In the following, we will deal with orientable ribbon graphs.1 Each ribbon has two boundaries. The first one in counterclockwise order is labeled by an arrowhead, (Fig. 2). A 𝔻-cycle or 𝔻-boundary component is then constructed by following these directed boundaries from disk to disk, thereby alternating between base pair ribbons and backbone, with the exception of the segment of the boundary component that travels along the bottom of the backbone using only backbone bonds, as shown in Figures 2 and 3. We give a brief tutorial on how to compute boundary components in the Supplementary Figure S6. Topological invariants such as the number of boundary components of the fatgraph 𝔻 can thus be computed directly from the underlying diagram D. Furthermore, fatgraphs can be succinctly stored and conveniently manipulated on the computer as pairs of permutations (Penner et al., 2010).

Computing the number of boundary components. The diagram contains 5+9 edges and 10 vertices. We follow the alternating paths described in the text and observe that there are exactly two boundary components (bold and thin). According to Equation (2.1), the genus of the diagram is given by , see Supplementary Figure S6 for details.


We next make use of an additional feature of RNA structures, namely that the backbone forms a unique oriented chain determined by the covalent bonds. Thus, the backbone can be collapsed to a single disk since the surface is orientable: in the absence of twisted ribbons, there is no particular information in the backbone itself. Indeed, the procedure can be undone by reinflating the disk and rebuilding the backbone. The contraction of the N vertices to a single one and the removal of the (N − 1) covalent bonds therefore preserves the Euler characteristic and genus, (Fig. 4).

Reduction to fatgraphs with a single vertex. Contracting the backbone of a diagram into a single vertex decreases the length of the boundary components and preserves the genus. The contracted fatgraph is equivalent to the labeled directed cycle. The backbone of the polymer can be recovered by reinflating the disk into the backbone. The polygon (r.h.s.) represents the standard 2D model of a surface as discussed in Massey (1967).

2.2 γ-structures
The shadow of a diagram (RNA structure) is obtained by removing all non-crossing arcs, collapsing all isolated vertices and replacing all remaining stacks (i.e. adjacent parallel arcs) by single arcs (Fig. 5). Shadows can be seen as a generalization of shape abstractions (Giegerich et al., 2004) to pseudoknotted structures (Reidys and Wang, 2010). Similar to the process of contracting the backbone into a single vertex, the projection into a shadow changes neither genus nor the number of boundary components (Andersen,J.E. et al., submitted for publication). However, all information on stack lengths and non-crossing components of the structure is lost in the process. We shall see that the set of structures with shadow can nevertheless be reconstructed efficiently. To this end we will show that, for fixed genus g, there are only finitely many distinct shadows Sg, which will play a central role in constructing folding algorithms.

The shadow of a diagram is obtained by removing all non-crossing arcs and isolated vertices and collapsing all resulting stacks into single arcs. While taking shadows is a significant reduction, the key topological invariants of genus and number of boundary components remain invariant.







γ-structures: we display the shadow of a 1-structure (1) having topological genus 2 and the shadow of the HDV-structure (2) (Ferré-D'Amaré et al., 1998), a 2-structure having also genus two. Although both shadows have genus two, the HDV-structure cannot be generated iteratively via successive removals of S1-elements and stacked arcs. The structure displayed on the left is derived via two S1-substructures.
Lemma 2.1.
An RNA structure is a 0-structure if and only if it is a simple secondary structure. In particular, a 0-structure always has genus g = 0.
Proof.
We first observe that a diagram of genus zero contains no crossing arcs. This follows from the fact that genus is a monotone non-decreasing function of the number of arcs [see Equation (2.3)] and that the genus of the matching (H) consisting of two mutually crossing arcs has only one boundary component and hence genus one; (Fig. 2). Secondly, we observe by induction on the number of arcs that each new non-crossing arc contributes a new boundary component and 2 − 2g − (r + 1)=1 − (n + 1) shows that the genus remains zero. Structures consisting only of non-crossing arcs therefore have genus zero.



Theorem 2.2.
For arbitrary genus g, the set Sgof shadows is finite. Every shadow in Sgcontains at least 2g and at most (6g − 2) arcs.
The special case g = 1, on which we focus in the algorithmic part of this contribution, is explicated in the Supplementary Material.
Proof.
First note that if there is more than one boundary component, then there must be an arc with different boundary components on its two sides, and removing this arc decreases r by exactly one while preserving g since the number of arcs is given by n = 2g + r − 1. Furthermore, if there are νℓ boundary components of length ℓ in the polygonal model, then 2n = ∑ℓ ℓνℓ since each side of each arc is traversed once by the boundary. For a shadow, ν1 = 0 by definition, and ν2 ≤ 1 as one sees directly. It therefore follows that 2n = ∑ℓ ℓνℓ ≥ 3(r − 1)+2, so 2n = 4g+2r − 2 ≥ 3r−1, i.e. 4g − 1 ≥ r. Thus, we have n = 2g+(4g − 1) − 1 = 6g − 2, i.e. any shadow can contain at most 6g − 2 arcs. The lower bound 2g follows directly from n = 2g + r − 1 by observing r = 1.
Many Sg-shadows are in fact γ-structures for some γ < g, that is, they can be constructed from elements of Sγ. One key result of this contribution is the following characterization of 1-structures:
Theorem 2.3.

Proof.
We only give a sketch here and refer to the Supplementary Material for a full proof. First, we observe that taking the shadow preserves genus. Since (H) is the unique matching with two arcs of genus g = 1, it is contained in every matching of genus g = 1. An arc crossing into (H) preserves the genus and leads to either (K) or (L). While every arc added to (K) increases the genus, there is one possibility to preserve the genus when adding an arc to (L), namely, the addition leading to (M). It remains to observe that no further arc can be added to (M).
Before proceeding to algorithmic considerations, we briefly compare the class of γ-structures with other classes of pseudoknots. Condon et al. (2004) investigated the structure classes L&P (Lyngsø and Pedersen, 2000), D&P (Dirks and Pierce, 2003), A&U (Akutsu, 2000) and R&E-class (Rivas and Eddy, 1999). The L&P- and D&P-class are based on the H-type shadow depicted in Theorem 2.3 and hence are proper subsets of the 1-structures. The A&U-class does not cover shadow M but on the other hand contains some configurations that are not 1-structures, and even the 2-structures do not completely contain the A&U-class. Nevertheless, the A&U-class is small: there are more 1-structures than A&U-structures for any given sequence length (Nebel,M.E. and Weinberg,F., submitted for publication).
The R&E class does not impose a limit on the genus of the shadow and hence contains γ-structure with arbitrarily large γ. Conversely, Figure 3 shows a 2-structure that is not contained in the R&E class. This example is minimal, i.e. all 1-structures are contained in R&E. Similarly, the set of k-non-crossing structures (Huang et al., 2009; Jin et al., 2008) has infinitely many shadows for any fixed k ≥ 3 (Reidys and Wang, 2010), and hence like R&E, contains γ-structure with arbitrarily large γ. We note that every 1-structure is 4-non-crossing; more precisely, shadows (H) and (K) are 3-non-crossing, while shadow (L) and (M) each contain three mutually crossing arcs. (Fig. 7).

Venn diagram of important classes of structures with pseudoknots. The mutual relationships of pseudoknot-free secondary structure (SS), the two H-shadow classes D&P and L&P, and the classes A&U and R&E, respectively, were already described by Condon et al. (2004). 1-structures and 4-non-crossing structures are added here.
2.3 Minimum free energy folding of γ-structures

1-structures: We shall use that (i) any 1-structure can be inductively generated from genus one structures and (ii) that every genus one structure has shadow (H), (K), (L) or (M), to specify a multiple context-free grammar (MCFG) (Seki et al., 1991). In contrast to context-free grammars, the non-terminal symbols of MCFGs may consist of multiple components which must be expanded3 in parallel. In this way, it becomes possible to couple separated parts of a derivation and thus to generate crossings. In the case of 1-structures, the language 𝒮 is built upon sequences of intervals (fragment-pairs) [i, r], [s, j], where (i, j), (r, s) are nested arcs. Arcs having endpoints in the different fragments are assumed to be non-crossing; (Fig. 8). For the MCFG, the fragments of a pair are associated with two different (coupled) components of a two-dimensional non-terminal symbol.
![Fragment-pairs in RNA structures: the rule I → IA1IB1IA2IB2S induces the fragment-pairs [i1, r1], [s1, j1] and [i2, r2], [s2, j2]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s2, j2].](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/27/8/10.1093_bioinformatics_btr090/3/m_bioinformatics_27_8_1076_f14.jpeg?Expires=1749036872&Signature=Hi~5GNS4q2H4qewZb0~103uX506-Y7v7822NnkHHhgc9aiI9E8qNC67ACtbZ~f85u5xdWM2tm06Z~Zz4QAVIRidPyHAmu9eDACiJfvMTF8yTidzh1sPd258IcR1R~5rn54dhU65NFAkKtnJI6otjBUMilueaFDiTrVi-25BHwZe6TO1ndymUPOMWl2jtPTK6cmhGZPvhC5ET6ICFQ5xGufSOmAeH7C2YQkoYVodXfUuJT7q~ARVx2o-3Ln98YZa0rggDthTplYeGWqlR-eCGoT41uiVdJmpbokG8Reh2QgZFUoHNv7VKucISiPHKhmVs66hwPrwFtk5o0m7eLzoEFw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Fragment-pairs in RNA structures: the rule I → IA1IB1IA2IB2S induces the fragment-pairs [i1, r1], [s1, j1] and [i2, r2], [s2, j2]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s2, j2].


non-terminal S, representing secondary structure elements (i.e. diagrams without crossing arcs) according to the rules given above;
non-terminals I and T, representing an arbitrary 1-structure;
non-terminals
with two components used to represent a fragment-pair with nested arcs, X ∈ {H, K, L, M}; and
terminals (X,)X denoting the opening and closing of a base pair, respectively, where X is one of the types H, K, L or M.
Theorem 2.4.

Proof.

(H): then there exist maximal base pairs β=(r, s), where r < i < s < j,
(K): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < v < i < s < j,
(L): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < i < v < s < j,
(M): then there exist maximal base pairs β=(r, s), θ=(u, v) and δ=(p, q), where p < u < r < q < i < v < s < j.
![Fragmentation: the four cases corresponding to the four shadows (H), (K), (L) and (M). In (1), there are two maximal arcs: α=(i, j) and β=(r, s), where r < i < s < j, whence the diagram has shadow (H). Here, α*=(i*, j*) is the minimal arc crossing C(α) and β*=(r*, s*) is the minimal arc crossing C(β). We have B1 =[i, i*], B2 =[j, j*], A1 =[r, r*], A2 =[s, s*]. Cases (2), (3) and (4) are analyzed similarly.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/27/8/10.1093_bioinformatics_btr090/3/m_bioinformatics_27_8_1076_f3.jpeg?Expires=1749036873&Signature=pLq-pHQ4AZKGuE-ixNt70OYPSYpsb-aB1MCdI4dIvkGdWVBLxNzMWZcnrFeaw-VyG9EtkOlLZWlkNAbA5T5xYZgJrCkXAcdQy9nxizXoo5koHsCqxPH-YURsMtNlUp~Wb72--adJELgzN1~YdPOoOoHbvACPXLUZ77NnimAgG2~Rd8Jl8JcHNO-jKDAEQFA-Y6WJxZt1-yckgaaLeY0rknjrFirq3LXhWRqoh8AGgBX9jlnLVNDRulyqh7E~UHLvrAIhB4sLsIVRYZaj7NmiIaA6-pcDqYd4gupnU6afoIc0ehXg2Sl~jpCoP-zyvZ3DHYpruToJjK3Ot4ImE0PrHA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Fragmentation: the four cases corresponding to the four shadows (H), (K), (L) and (M). In (1), there are two maximal arcs: α=(i, j) and β=(r, s), where r < i < s < j, whence the diagram has shadow (H). Here, α*=(i*, j*) is the minimal arc crossing C(α) and β*=(r*, s*) is the minimal arc crossing C(β). We have B1 =[i, i*], B2 =[j, j*], A1 =[r, r*], A2 =[s, s*]. Cases (2), (3) and (4) are analyzed similarly.
2-structures: a folding algorithm for 2-structures requires an analogous enumeration of all (irreducible) shadows of genus 2. From Equation (2.6), it is straightforward to explicitly derive the 21 shadows of genus 2 with 4 arcs, see Supplementary Figure 10. As in the case of genus 1, arc insertions into these 21 configurations leads to the complete set of 3472 shadows of genus 2. This large number makes it infeasible to build a practically useful folding algorithms for all 2-structures. It may be useful, however, to deal with a (small) subset of shadows. The complexity of such an algorithm is determined by the complexity of decomposing the individual shadows by means of MCFG production rules reminiscent of those for ℛ1. For instance, the shadow of the HDV-structure displayed in (2) of Figure 6 is contained in the R&E class and can therefore be computed in O(N6) time and O(N4) space. However, when resorting to our approach its time complexity is at least O(N8): the shadow presented in Figure 11 requires an DP algorithm with O(N8) time-complexity and O(N6) space-complexity. It is an ongoing work to devise a sensible folding algorithm for 2-structures.

Folding of 2-structures: the shadow shown here is not contained in the R&E class of structures and cannot be generated by gap matrices. It can be decomposed, however, using the eight indexes i, j, k, l, m, n, p and q, thus implying a O(N8) time-complexity. This makes use of a six-dimensional gap matrix Gj,k,l,m,n,p, which implies O(N6) space-complexity.
MFE folding of 1-structures: if we make use of a naïve table-based parsing scheme, checking for each subword s of the input and for each rule f whether f can produce s, a rule like f = I → IA1IB2IC1IA2ID1IB2IC2ID2S introduces a complexity O(N18): first, we must process O(N2) different with subwords s induced by an input of size n. Secondly, each non-terminal but the first on the right-hand side of the production introduces an additional split point, which specifies the part of s to be generated by the corresponding non-terminal. Since its location may freely be chosen within s, each split point gives rise to another loop variable, and hence contributes a factor O(N) to the runtime.





As typical for DP and in analogy to our parsing scheme, we use two-dimensional matrices to store the optimal structure over a fragment. The matrix is indexed by the sequence coordinates of the endpoints. It can be a simple secondary structure 𝒮 or a substructure of higher genus. For the fragment-pairs, i.e. for the non-terminals of dimension two, four-dimensional matrices indexed by the endpoints of both linked fragments are required to store the optimal structure over them. Suppose the pair of fragments is [i, r] and [s, j], and let Gu(i, j; r, s) be the fragment-pair (associated with) [U1, U2], Gv(i, j; r, s) be the fragment-pair [V1, V2], Gw(i, j; r, s) be the fragment-pair [W1, W2] and G(i, j; r, s) be the fragment-pair [X1, X2]. The recursions for these matrices, summarized in graphical form in Figure 12, are determined directly by the grammar.

The decomposition for 4-dimensional matrices G, Gu, Gv, and Gw.
We can conclude from the rewriting rules that the computation of the two-dimensional matrices requires at most three loop variables, and there are O(N2) many of them. Accordingly, O(N5) operations are required to fill the associated two-dimensional matrices. For the four-dimensional matrices, two loop variables are needed for each of the corresponding rewriting rules (those with a left-hand side of dimension two) for there are in each case two split points introduced by the right-hand sides of the corresponding productions. Since we need to compute O(N4) matrix entries, the total run time is in O(N6). Obviously, O(N4) space is required to store these tables. Accordingly, the algorithm can generate all 1-structures in O(N6) time and O(N4) space, i.e. with the same complexity as pknotsRE (Rivas and Eddy, 1999) (for the larger R&E class). The advantage of 1-structures is that structurally different shadows can be parametrized in different ways, and that the search space is restricted to moderately complex shadows. In contrast, the language of R&E-structures is based on crossings and can neither identify blocks of arcs nor restrict the genus of the shadows. For more structure classes restricted to H-structures, NUPACK (Dirks and Pierce, 2003) requires O(N5) time and O(N4) space.
This is substantially more demanding, of course, than the O(N4) time and O(N2) memory complexity of pknotsRGReeder and Giegerich (2004), which, however, deals with a very restricted subset of H-shadow structures, demanding that helices are maximally extended and perfect in the sense that they are not interrupted by bulge- or interior-loops. pknotsRG thus is not guaranteed to find the minimum energy structure within the class H-shadow structures. A related fast heuristic treats the (K)-shadow as a superposition of the two H-shadows (Theis et al., 2010).
2.4 Partition function and sampling




















2.5 Software
Implementation: MFE folding, partition function including a computation of base pairing probabilities and stochastic backtracing are implemented in gfold. The program is written in C.
Energy model: although the presentation above uses a simplified grammar that does not explicitly distinguish the usual loop types, gfold implements the Mathews–Turner energy model without dangles (Mathews et al., 1999, 2004) for secondary structure elements. For pseudoknots, we use here an extended version of the Dirks–Pierce (DP) model (Dirks and Pierce, 2003) that allows different penalties βX for the four topologically distinct pseudoknot types X = H, K, L, M. We have observed that the values of βX have a substantial influence on the accuracy of the predicted structures. In both NUPACK and pknotsRE, a common pseudoknot penalty β1 is assigned whenever two gap matrices cross. Since the number of such crossings depends on the type of the pseudoknot, this algorithmic design would imply βA = β1, βB = βC = 2β1 and βD = 3β1. In gfold, these parameters are independent and can be adjusted to improve the performance. Since most experimentally known pseudoknots are of types (H) and (K), we focused in particular on the ratio of βA and βB and found that both sensitivity (the ratio of correctly predicted base pairs to the total number of base pairs in the reference structure) and positive predictive value reach a maximum for βB = 1.3βA. The pseudoknot penalty of type (H) coincides with that of the DP model, i.e. βA = β1 = 9.6 (kcal/mol). The other penalties are set to βB = 12.6, βC = 14.6 and βD = 17.6; see Supplementary Material for details. An alternative set of pseudoknot parameters described by Andronescu et al. (2010) can easily be incorporated but would require a readjustment of these four topological penalties.
Performance: the current implementation of gfold is applicable to sequences with a length up to N ∼ 150 nt on current PC hardware. Figure 13 summarizes the resource requirements.

Run time (A) and peak memory (B) of gfold. Timing information is given for MFE-only (triangles) and partition function with sampling 10 000 structures from the Boltzmann ensemble. To compute error bars, we folded between 10 (N > 100) and 100 (N < 70) randomly generated sequences on a Xeon E5410, 2.33 GHz, 48 GB memory. Memory allocation is independent of the sequence. For N ≥ 100, double precision floats are necessary to avoid overflows. This leads to the jump in memory consumption by a factor of 2. Dotted lines indicate the theoretical behavior of O(N6) (time) and O(N4) (space). The slope for CPU time is slightly steeper than the theory since constraints among the six indices introduced by the minimum size of the complex pseudoknot elements lead to an additional speedup for small N.
We have observed that gfold provides a substantial increase in both sensitivity and a positive predictive value (PPV, ratio of correctly predicted base pairs to the total number of base pairs in the predicted structure) compared with the alternative DP approaches pknotsRE (Rivas and Eddy, 1999), NUPACK (Dirks and Pierce, 2003) and pknotsRG-mfe (Reeder and Giegerich, 2004), and that gfold provides a substantial increase in accuracy, cf. Figure 14. In an evaluation on the entire Pseudobase (van Batenburg et al., 2001), gfold achieves a sensitivity of 0.762 and PPV of 0.761. However, as detailed in Supplementary Table S3, the performance varies substantially between different classes of sequences. Interestingly, the more complex pseudoknots of type (K) are predicted with even higher accuracy (sensitivity 0.889, PPV 0.899) than the simpler, much more frequent type H.

Performance of gfold. Comparison of the average sensitivity (A) and PPV (B) of different prediction algorithms on a sample of 32 structures from Pseudobase. All details of this sample are given in the Supplementary Table S2. (C) The PPV increases significantly if only base pairs with larger pairing probabilities as predicted by the partition function version of gfold are included in the predicted structure.
The PPV of gfold predictions can be increased by filtering the base pairs of the MFE structure by their probability P of formation, which is computed by the partition function version of gfold. Accepting only base pairs with a predicted base pairing probability P > 0.95 increases the PPV from 0.76 to more than 0.9, (Fig. 14C). In order to evaluate the false positive rate, we folded 100 tRNA sequences from Sprinzl's tRNA database (Jühling et al., 2009). gfold correctly identifies 94% of them as pseudoknot free. In comparison, NUPACK correctly identifies 86% and pknotsRG−mfe 89% of this sample set.
3 DISCUSSION
Combinatorial models of pseudoknotted RNA structures are limited in two ways: on the one hand, exact algorithmic folding can be constructed only for certain types of structures; on the other hand, the larger the structure sets are, the more base pairing patterns are contained in them that cannot be realized in nature due to steric constraints. Algorithm design so far has been mostly driven by the desire to reduce computational complexity. The idea behind gfold, in contrast, is to define a more suitable class of structures that can be generated by nesting and concatenating a small number of elementary building blocks. This recursive structure is captured by a fairly simple unambiguous multiple context-free grammar that translates in a canonical way to DP algorithms for computing the minimum energy structure and the partition function in O(N6) time and O(N4) space. In addition to MFE folding, we have implemented the computation of base pairing probabilities and a stochastic backtracing recursion, thus providing the major functionalities of RNA secondary structure prediction software for a very natural class of pseudoknotted structures.
The 1-structures considered here strike a balance between the generality necessary to cover almost all known pseudoknotted structures and the restriction to topologically elementary structures that have a good chance to actually correspond to a feasible spatial structure. From a mathematical point of view, the characterization of structures in terms of irreducible components with given topological genus appears particularly natural and promises to reflect closely the ease with which a structure can be embedded in three dimensions. In addition, the grammar underlying gfold naturally distinguishes different types of pseudoknots and admits different energy parameters for them. We observe that this additional freedom of the parametrization leads to a substantial increase of sensitivity of type (K) pseudoknots, (0.63 → 0.889) and PPV (0.73 → 0.899) compared wit the usage of a common penalty for each crossing of gap matrices. In terms of prediction accuracy, gfold thus compares favorably also with the leading alternative DP approaches to pseudoknotted structures.
Funding: This work was supported by the 973 Project of the Ministry of Science and Technology; the PCSIRT Project of the Ministry of Education; National Science Foundation of China to C.M.R. and his lab; Deutsche Forschungsgemeinschaft (projects STA 850/2-1 & STA 850/7-1); the European Union FP-7 project QUANTOMICS (no. 222664) to P.F.S. and his lab. J.E.A. and R.C.P. are supported by QGM, the Centre for Quantum Geometry of Moduli Spaces, funded by the Danish National Research Foundation.
Conflict of Interest: none declared.
1Ribbons may also be allowed to twist giving rise to possibly non-orientable surfaces (Massey, 1967).
2In order to relate this to the standard 2D models of surfaces derived from triangulations: from the collapsed fatgraph we can derive the polygonal model of the surface X𝔻, i.e. a 2n-gon in which edges are identified in pairs; (Fig. 4).
3This coupling is only required for components that were generated by the same production step. Components, even if of the same kind, derived in different steps are independent of each other.
REFERENCES
Author notes
Associate Editor: Anna Tramontano