Article Navigation

Journal Article

Topology and prediction of RNA pseudoknots

Author Notes

Abstract

Motivation: Several dynamic programming algorithms for predicting RNA structures with pseudoknots have been proposed that differ dramatically from one another in the classes of structures considered.

Results: Here, we use the natural topological classification of RNA structures in terms of irreducible components that are embeddable in the surfaces of fixed genus. We add to the conventional secondary structures four building blocks of genus one in order to construct certain structures of arbitrarily high genus. A corresponding unambiguous multiple context-free grammar provides an efficient dynamic programming approach for energy minimization, partition function and stochastic sampling. It admits a topology-dependent parametrization of pseudoknot penalties that increases the sensitivity and positive predictive value of predicted base pairs by 10–20% compared with earlier approaches. More general models based on building blocks of higher genus are also discussed.

Availability: The source code of gfold is freely available at http://www.combinatorics.cn/cbpc/gfold.tar.gz.

Contact: [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The global conformation of RNA molecules is to a large extent determined by topological constraints encoded at the level of secondary structure, i.e. by the mutual arrangements of the base paired helices (Bailor et al., 2010). In this context, secondary structure is understood in a wider sense that includes pseudoknots. Although the vast majority of RNAs has simple, i.e. pseudoknot-free, secondary structure, PseudoBase (Taufer et al., 2009) lists more than 250 records of pseudoknots determined by a variety of experimental and computational techniques including crystallography, nuclear magnetic resonance, mutational experiments and comparative sequence analysis. In many cases, they are crucial for molecular function. Examples include the catalytic cores of several ribozymes (Doudna and Cech, 2002), programmed frameshifting (Namy et al., 2006) and telomerase activity (Theimer et al., 2005), reviewed in Giedroc and Cornish (2009); Staple and Butcher (2005).

Secondary structures can been interpreted as matchings in a graph of permissible base pairs (Tabaska et al., 1998). The energy of RNA folding is dominated by the stacking of adjacent base pairs, not by the hydrogen bonds of the individual base pairs (Mathews et al., 1999). In contrast to maximum weighted matching, the general RNA folding problem with a stacking-based energy function is NP-complete (Akutsu, 2000; Lyngsø and Pedersen, 2000). The most commonly used RNA secondary structure prediction tools, including mfold (Zuker, 1989) and the Vienna RNA Package (Hofacker et al., 1994), therefore exclude pseudoknots.

Polynomial-time dynamic programming (DP) algorithms can be devised, however, for certain restricted classes of pseudoknots. In contrast to the O(N²) space and O(N³) time solution for simple secondary structures (Nussinov et al., 1978; Waterman, 1978; Zuker and Stiegler, 1981), however, most of these approaches are computationally much more demanding. The design of pseudoknot folding algorithms thus has been governed more by the need to limit computational cost and achieve a manageable complexity of the recursion than the conscious choice of a particularly natural search space of RNA structures. As a case in point, the class of structures underlying the algorithm by Rivas and Eddy (1999) (R&E-structures, pknot-R&E) was characterized only in a subsequent publication (Rivas and Eddy, 2000). The following references provide a certainly incomplete list of DP approaches to RNA structure prediction using different structure classes characterized in terms of recursion equations and/or stochastic grammars: Akutsu (2000); Cai et al. (2003); Chen et al. (2009); Deogun et al. (2004); Dirks and Pierce (2003); Kato et al. (2006); Li and Zhu (2005); Lyngsø and Pedersen (2000); Matsui et al. (2005); Reeder and Giegerich (2004); Rivas and Eddy (1999); Uemura Y. et al. (1999). The interrelationships of some of these classes of RNA structures have been clarified in part by Condon et al. (2004) and Rødland (2006). In addition to these exact algorithms, a plethora of heuristic approaches to pseudoknot prediction have been proposed in the literature; see e.g. (Chen, 2008; Metzler and Nebel, 2008) and the references therein.

At least three distinct classification schemes of RNA contact structures have been proposed: Haslinger and Stadler (1999) suggested using book-embeddings, Jin et al. (2008) focused on the maximal set of pairwise crossing base pairs and Bon et al. (2008) based the classification on topological embeddings. While these classifications have in common that simple secondary structure forms the most primitive class of structures, they differ already in the construction of the first non-trivial class of pseudoknots. Despite their mathematical appeal, however, no efficient (polynomial-time) algorithms are available for predicting pseudoknotted structures even in the simplest case of three non-crossing RNA structures. A practically workable approach to three non-crossing structures requires the enumeration of an exponentially growing number of diagrams which are then ‘filled in’ by the means of DP (Huang et al., 2009); a Monte-Carlo approach utilizing the topological approach with a very simple matching-like energy model was explored by (Vernizzi and Orland, 2005).

In this contribution, we show that the topological classification of RNA structures can be translated into efficient DP algorithms. To this end, we introduce γ-structures and prove that they can be derived from a finite family of abstract shapes called shadows. In Theorem 2.3, we enumerate these four shadows for γ = 1, which can be cast as explicit construction rules for a unique multiple context-free grammar (Section 2.3). Corresponding DP algorithms for energy minimization, partition function and Boltzmann-sampling functionalities are implemented in the software package gfold. An important feature is that γ-structures can be treated algorithmically like pseudoknot-free secondary structures in the sense that there are finitely many motifs, i.e. shadows, for fixed γ, each of which is assigned a specific energy. Because of the multiplicity of motifs, which rapidly increases with γ, this allows for a more detailed energy model of pseudoknotted structures based on their topological complexity.

2 RESULTS

2.1 Topology of RNA structures

Diagram representation: RNA molecules are linear biopolymers consisting of the four nucleotides A, U, C and G characterized by a sequence endowed with a unique orientation (5′ to 3′). Each nucleotide can interact (base pair) with at most one other nucleotide by means of specific hydrogen bonds. Only the Watson–Crick pairs GC and AU as well as the wobble GU are admissible. These base pairs determine the secondary structure. Note that we have neglected here base triples and other types of more complex interactions. Secondary structures can thus be represented as graphs where nucleotides are represented by vertices, the backbone of the molecule as well as the hydrogen bonds are represented by edges; see Figure 1a. More conveniently, we use the convention to represent the backbone of the polymer by a horizontally drawn chain. As before, this chain consists of vertices and arcs, respectively, representing the nucleotides and covalent bonds. However, the edges representing the base pairs now are depicted as arcs in the upper half plane; see Figure 1b. We call this representation the diagram of the molecule.

Fig. 1.

RNA structure as planar graph represented (a) as ball-and-stick figure with short edges for hydrogen bonds and (b) with linear backbone and semi-circles for hydrogen bonds.

Open in new tab Download slide

Thus, we shall identify a structure with a labelled graph over the vertex set [N] = {1, 2,…, N} represented by drawing the vertices 1, 2,…, N on a horizontal line in the natural order and the arcs (i, j), where i < j, in the upper half plane.

Fatgraph representation: in order to understand the topological properties of RNA molecules, we need to pass from the picture of RNA as diagrams or contact graphs to that of topological surfaces. Only the associated surface carries the important invariants leading to a meaningful filtration of RNA structures. Formally, we will view an RNA molecule as a topological surface (Andersen,J.E. et al., submitted for publication). The main idea is to ‘thicken’ the edges into (untwisted) bands or ribbons and to expand each vertex to a disk as shown in Figure 2. This inflation of edges leads to a fatgraph 𝔻 (Loebl and Moffatt, 2008; Penner et al., 2010).

Inflation of edges and vertices to ribbons and disks. Here we have four vertices, five edges and one boundary component . The corresponding surface has Euler characteristic χ=v−e+r=0 and genus g=1, see Equations (2.1) and (2.2).

Fig. 2.

Inflation of edges and vertices to ribbons and disks. Here we have four vertices, five edges and one boundary component formula ⁠. The corresponding surface has Euler characteristic χ=v−e+r=0 and genus g=1, see Equations (2.1) and (2.2).

Open in new tab Download slide

A fatgraph, sometimes also called ‘ribbon graph’ or ‘map’, is a graph equipped with a cyclic ordering of the incident half edges at each vertex. Thus, 𝔻 refines its underlying graph D insofar as it encodes the ordering of the ribbons incident on its disks. In the following, we will deal with orientable ribbon graphs.¹ Each ribbon has two boundaries. The first one in counterclockwise order is labeled by an arrowhead, (Fig. 2). A 𝔻-cycle or 𝔻-boundary component is then constructed by following these directed boundaries from disk to disk, thereby alternating between base pair ribbons and backbone, with the exception of the segment of the boundary component that travels along the bottom of the backbone using only backbone bonds, as shown in Figures 2 and 3. We give a brief tutorial on how to compute boundary components in the Supplementary Figure S6. Topological invariants such as the number of boundary components of the fatgraph 𝔻 can thus be computed directly from the underlying diagram D. Furthermore, fatgraphs can be succinctly stored and conveniently manipulated on the computer as pairs of permutations (Penner et al., 2010).

Fig. 3.

Computing the number of boundary components. The diagram contains 5+9 edges and 10 vertices. We follow the alternating paths described in the text and observe that there are exactly two boundary components (bold and thin). According to Equation (2.1), the genus of the diagram is given by formula ⁠, see Supplementary Figure S6 for details.

Open in new tab Download slide

The fatgraph 𝔻 gives rise to a unique surface X_𝔻, and each 𝔻-cycle corresponds to a boundary component of X_𝔻, whose Euler characteristic and genus are given by

(2.1)

(2.2)

where v, e, r denotes the number of discs, ribbons and boundary components in 𝔻 (Massey, 1967). The graph D can readily be obtained by continuously contracting the ribbons and discs of 𝔻.

We next make use of an additional feature of RNA structures, namely that the backbone forms a unique oriented chain determined by the covalent bonds. Thus, the backbone can be collapsed to a single disk since the surface is orientable: in the absence of twisted ribbons, there is no particular information in the backbone itself. Indeed, the procedure can be undone by reinflating the disk and rebuilding the backbone. The contraction of the N vertices to a single one and the removal of the (N − 1) covalent bonds therefore preserves the Euler characteristic and genus, (Fig. 4).

Fig. 4.

Reduction to fatgraphs with a single vertex. Contracting the backbone of a diagram into a single vertex decreases the length of the boundary components and preserves the genus. The contracted fatgraph is equivalent to the labeled directed cycle. The backbone of the polymer can be recovered by reinflating the disk into the backbone. The polygon (r.h.s.) represents the standard 2D model of a surface as discussed in Massey (1967).

Open in new tab Download slide

Using the collapsed fatgraph,² we see that the relation between the genus of the surface and the number of boundary components is determined by the number of arcs in the upper half plane, namely,

(2.3)

where n is number of base pairs and r the number of boundary components. The latter can be computed easily and therefore controls the genus of the molecules. Equation (2.3) follows from Equations (2.2) and (2.1), which together yield 2 − 2g − r = v − e, and the observation that the contracted graph has e = n arcs and a single (v = 1) vertex.

2.2 γ-structures

The shadow of a diagram (RNA structure) is obtained by removing all non-crossing arcs, collapsing all isolated vertices and replacing all remaining stacks (i.e. adjacent parallel arcs) by single arcs (Fig. 5). Shadows can be seen as a generalization of shape abstractions (Giegerich et al., 2004) to pseudoknotted structures (Reidys and Wang, 2010). Similar to the process of contracting the backbone into a single vertex, the projection into a shadow changes neither genus nor the number of boundary components (Andersen,J.E. et al., submitted for publication). However, all information on stack lengths and non-crossing components of the structure is lost in the process. We shall see that the set of structures with shadow can nevertheless be reconstructed efficiently. To this end we will show that, for fixed genus g, there are only finitely many distinct shadows S_g, which will play a central role in constructing folding algorithms.

Fig. 5.

The shadow of a diagram is obtained by removing all non-crossing arcs and isolated vertices and collapsing all resulting stacks into single arcs. While taking shadows is a significant reduction, the key topological invariants of genus and number of boundary components remain invariant.

Open in new tab Download slide

A diagram is irreducible (or connected) (Kleitman, 1970) if for any two arcs there is a sequence of arcs so that consecutive arcs cross one other. A shadow is not necessarily irreducible but may be composed of multiple irreducible components or blocks, see (1) of Figure 6. Any shadow (and in general, any diagram) can be decomposed iteratively by removing irreducible components from bottom to top, i.e. so that that there is no component ‘inside’ the one just removed. Note that the set

of irreducible components of the set of shadows,

⁠, equals the set of shadows of the irreducible components of the diagram S. Furthermore, the genus of

is the sum of the genera of its irreducible components, i.e.

(2.4)

It seems natural, therefore, to determine the complexity of a structure by the maximal genus of the components of its shadows. More precisely, we say that S is a γ-structure if

holds for all irreducible components of the shadows

⁠. By definition, a γ-structure can thus be constructed from the set S_γ of shadows of genus at most γ by inserting certain non-crossing arcs, (Fig. 6). The simplest class of structures are of course 0-structures, obtained by placing non-crossing arcs over the empty structure.

Fig. 6.

γ-structures: we display the shadow of a 1-structure (1) having topological genus 2 and the shadow of the HDV-structure (2) (Ferré-D'Amaré et al., 1998), a 2-structure having also genus two. Although both shadows have genus two, the HDV-structure cannot be generated iteratively via successive removals of S₁-elements and stacked arcs. The structure displayed on the left is derived via two S₁-substructures.

Open in new tab Download slide

Lemma 2.1.

An RNA structure is a 0-structure if and only if it is a simple secondary structure. In particular, a 0-structure always has genus g = 0.

Proof.

We first observe that a diagram of genus zero contains no crossing arcs. This follows from the fact that genus is a monotone non-decreasing function of the number of arcs [see Equation (2.3)] and that the genus of the matching (H) consisting of two mutually crossing arcs has only one boundary component and hence genus one; (Fig. 2). Secondly, we observe by induction on the number of arcs that each new non-crossing arc contributes a new boundary component and 2 − 2g − (r + 1)=1 − (n + 1) shows that the genus remains zero. Structures consisting only of non-crossing arcs therefore have genus zero.

Next, we consider structures of arbitrary genus. For their analysis, diagrams without isolated points, i.e. matchings, play a central role. Let 𝒞_g(n) be the set of matchings of genus g with n arcs, and let c_g(n)≔|𝒞_g(n)| denote its cardinality. As shown by (Andersen,J.E. et al., submitted for publication), the generating function C_g(z)=∑_n≥0c_g(n)zⁿ is given by

(2.5)

where P_g(z) is an integral polynomial of degree (3g − 1) such that P_g(1/4)≠0. The number of genus zero matchings are well known to be given by the Catalan numbers, and Equation (2.5) allows the derivation of explicit formulas for higher genera, for instance,

Furthermore, the number c_g(2g) of matchings of genus g having exactly 2g arcs, i.e. matchings having exactly one boundary component, is the coefficient of z^2g in P_g(z) and is given by

(2.6)

Explicitly, we have c₁(2) = 1, c₂(4) = 21 and c₃(6) = 1485 for example. These particular matchings will serve as ‘seeds’ for our folding algorithm. More precisely, we shall use the following:

Theorem 2.2.

For arbitrary genus g, the set S_gof shadows is finite. Every shadow in S_gcontains at least 2g and at most (6g − 2) arcs.

The special case g = 1, on which we focus in the algorithmic part of this contribution, is explicated in the Supplementary Material.

Proof.

First note that if there is more than one boundary component, then there must be an arc with different boundary components on its two sides, and removing this arc decreases r by exactly one while preserving g since the number of arcs is given by n = 2g + r − 1. Furthermore, if there are ν_ℓ boundary components of length ℓ in the polygonal model, then 2n = ∑_ℓ ℓν_ℓ since each side of each arc is traversed once by the boundary. For a shadow, ν₁ = 0 by definition, and ν₂ ≤ 1 as one sees directly. It therefore follows that 2n = ∑_ℓ ℓν_ℓ ≥ 3(r − 1)+2, so 2n = 4g+2r − 2 ≥ 3r−1, i.e. 4g − 1 ≥ r. Thus, we have n = 2g+(4g − 1) − 1 = 6g − 2, i.e. any shadow can contain at most 6g − 2 arcs. The lower bound 2g follows directly from n = 2g + r − 1 by observing r = 1.

Many S_g-shadows are in fact γ-structures for some γ < g, that is, they can be constructed from elements of S_γ. One key result of this contribution is the following characterization of 1-structures:

Theorem 2.3.

An RNA structure is a 1-structure if and only if its shadow can be decomposed by iteratively removing one of the four shadows

In particular, 1-structures can have arbitrarily large topological genus.

Proof.

We only give a sketch here and refer to the Supplementary Material for a full proof. First, we observe that taking the shadow preserves genus. Since (H) is the unique matching with two arcs of genus g = 1, it is contained in every matching of genus g = 1. An arc crossing into (H) preserves the genus and leads to either (K) or (L). While every arc added to (K) increases the genus, there is one possibility to preserve the genus when adding an arc to (L), namely, the addition leading to (M). It remains to observe that no further arc can be added to (M).

Before proceeding to algorithmic considerations, we briefly compare the class of γ-structures with other classes of pseudoknots. Condon et al. (2004) investigated the structure classes L&P (Lyngsø and Pedersen, 2000), D&P (Dirks and Pierce, 2003), A&U (Akutsu, 2000) and R&E-class (Rivas and Eddy, 1999). The L&P- and D&P-class are based on the H-type shadow depicted in Theorem 2.3 and hence are proper subsets of the 1-structures. The A&U-class does not cover shadow M but on the other hand contains some configurations that are not 1-structures, and even the 2-structures do not completely contain the A&U-class. Nevertheless, the A&U-class is small: there are more 1-structures than A&U-structures for any given sequence length (Nebel,M.E. and Weinberg,F., submitted for publication).

The R&E class does not impose a limit on the genus of the shadow and hence contains γ-structure with arbitrarily large γ. Conversely, Figure 3 shows a 2-structure that is not contained in the R&E class. This example is minimal, i.e. all 1-structures are contained in R&E. Similarly, the set of k-non-crossing structures (Huang et al., 2009; Jin et al., 2008) has infinitely many shadows for any fixed k ≥ 3 (Reidys and Wang, 2010), and hence like R&E, contains γ-structure with arbitrarily large γ. We note that every 1-structure is 4-non-crossing; more precisely, shadows (H) and (K) are 3-non-crossing, while shadow (L) and (M) each contain three mutually crossing arcs. (Fig. 7).

Fig. 7.

Venn diagram of important classes of structures with pseudoknots. The mutual relationships of pseudoknot-free secondary structure (SS), the two H-shadow classes D&P and L&P, and the classes A&U and R&E, respectively, were already described by Condon et al. (2004). 1-structures and 4-non-crossing structures are added here.

Open in new tab Download slide

2.3 Minimum free energy folding of γ-structures

0-structures: We have shown in the previous section that 0-structures are simple RNA secondary structures. Their minimum free energy (MFE) configuration can be obtained by DP recursions (Waterman, 1978; Zuker and Stiegler, 1981) derived from a decomposition into suitable substructures. This decomposition can be expressed in terms of a context-free grammar (Dowell and Eddy, 2004; Steffen and Giegerich, 2005). In the simplest case, which corresponds to evaluating base pairs only, we consider a single non-terminal symbol S representing an arbitrary diagram over a segment and three terminal symbols to represent isolated vertices (symbol :), openings (symbol () and closings (symbol )) of base pairs. We only need the three production rules

(2.7)

to generate the corresponding language 𝒮.

1-structures: We shall use that (i) any 1-structure can be inductively generated from genus one structures and (ii) that every genus one structure has shadow (H), (K), (L) or (M), to specify a multiple context-free grammar (MCFG) (Seki et al., 1991). In contrast to context-free grammars, the non-terminal symbols of MCFGs may consist of multiple components which must be expanded³ in parallel. In this way, it becomes possible to couple separated parts of a derivation and thus to generate crossings. In the case of 1-structures, the language 𝒮 is built upon sequences of intervals (fragment-pairs) [i, r], [s, j], where (i, j), (r, s) are nested arcs. Arcs having endpoints in the different fragments are assumed to be non-crossing; (Fig. 8). For the MCFG, the fragments of a pair are associated with two different (coupled) components of a two-dimensional non-terminal symbol.

Fragment-pairs in RNA structures: the rule I → IA1IB1IA2IB2S induces the fragment-pairs [i1, r1], [s1, j1] and [i2, r2], [s2, j2]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s2, j2].

Fig. 8.

Fragment-pairs in RNA structures: the rule I → IA₁IB₁IA₂IB₂S induces the fragment-pairs [i₁, r₁], [s₁, j₁] and [i₂, r₂], [s₂, j₂]. Arcs connecting the two fragments of a pair are noncrossing, while arcs with both endpoints within the same fragment may be crossing such as those within [s₂, j₂].

Open in new tab Download slide

Accordingly, we (re)introduce the following symbols: Different brackets as well as the different non-terminals of pattern

are used to distinguish nestings of the various kinds of shadows. Finally, we specify the production rules of our unambiguous MCFG ℛ₁:

where X ∈ {H, K, L, M} distinguishes the four types of pseudoknots.

non-terminal S, representing secondary structure elements (i.e. diagrams without crossing arcs) according to the rules given above;
non-terminals I and T, representing an arbitrary 1-structure;
non-terminals with two components used to represent a fragment-pair with nested arcs, X ∈ {H, K, L, M}; and
terminals (_X,)_X denoting the opening and closing of a base pair, respectively, where X is one of the types H, K, L or M.

Theorem 2.4.

Any RNA 1-structure can beuniquelydecomposed via ℛ₁, and any diagram generated via ℛ₁is a 1-structure (Fig. 9).

Fig. 9.

Illustration of the grammar ℛ₁.

Open in new tab Download slide

Proof.

We proceed by induction on the number of shadows. Induction basis: in a 1-structure

that contains no genus 1-shadow, there are no crossings and hence the structure can be decomposed uniquely via the context-free grammar of secondary structures. Induction step: suppose we are given a 1-structure containing r ≥ 1 shadows of genus one. We decompose from right to left. Everything is clear until we encounter a substructure containing a genus 1 shadow. For an arc α = (i, j), we distinguish two cases: (I) α is not crossed, or (II) α is crossed by another arc. In case of (I), there exists a 1-structure nested in α. In case of (II), we consider the partial order ≤, where (i, j) ≤(r, s) if and only if r < i and j < s. Since crossing arcs in a 1-structure are contained in one of the four base types, we distinguish the following scenarios

(H): then there exist maximal base pairs β=(r, s), where r < i < s < j,
(K): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < v < i < s < j,
(L): then there exist maximal base pairs β=(r, s) and θ=(u, v), where u < r < i < v < s < j,
(M): then there exist maximal base pairs β=(r, s), θ=(u, v) and δ=(p, q), where p < u < r < q < i < v < s < j.

Consider the set C(α) of arcs that are crossed by α and the minimal arc α_* that crosses any element of C(α). Here, minimality is considered with respect to the partial order ≤, where (i, j) ≤(r, s) if and only if r < i and j < s. It follows that α=(i, j) and α_*=(i_*, j_*) induce the fragment pair [i, i_*] and [j_*, j]. We similarly obtain the corresponding arcs β_*, θ_* or δ_*, which induce at most four fragment pairs and correspond to a unique shadow of type (H), (K), (L) or (M) (Fig. 10). By construction, the number of genus 1 shadows of any substructure contained in such a fragment-pair is reduced at least by one, and can by induction hypothesis be uniquely decomposes via ℛ₁. Finally, any structure generated via ℛ₁ is constructed from top-to-bottom by iteratively building configurations of arcs having shadow (H), (K), (L) or (M). Thus, any structure obtained via ℛ₁ is indeed a 1-structure completing the proof of the theorem.

Fig. 10.

Fragmentation: the four cases corresponding to the four shadows (H), (K), (L) and (M). In (1), there are two maximal arcs: α=(i, j) and β=(r, s), where r < i < s < j, whence the diagram has shadow (H). Here, α_*=(i_*, j_*) is the minimal arc crossing C(α) and β_*=(r_*, s_*) is the minimal arc crossing C(β). We have B₁ =[i, i_*], B₂ =[j, j_*], A₁ =[r, r_*], A₂ =[s, s_*]. Cases (2), (3) and (4) are analyzed similarly.

Open in new tab Download slide

2-structures: a folding algorithm for 2-structures requires an analogous enumeration of all (irreducible) shadows of genus 2. From Equation (2.6), it is straightforward to explicitly derive the 21 shadows of genus 2 with 4 arcs, see Supplementary Figure 10. As in the case of genus 1, arc insertions into these 21 configurations leads to the complete set of 3472 shadows of genus 2. This large number makes it infeasible to build a practically useful folding algorithms for all 2-structures. It may be useful, however, to deal with a (small) subset of shadows. The complexity of such an algorithm is determined by the complexity of decomposing the individual shadows by means of MCFG production rules reminiscent of those for ℛ₁. For instance, the shadow of the HDV-structure displayed in (2) of Figure 6 is contained in the R&E class and can therefore be computed in O(N⁶) time and O(N⁴) space. However, when resorting to our approach its time complexity is at least O(N⁸): the shadow presented in Figure 11 requires an DP algorithm with O(N⁸) time-complexity and O(N⁶) space-complexity. It is an ongoing work to devise a sensible folding algorithm for 2-structures.

Fig. 11.

Folding of 2-structures: the shadow shown here is not contained in the R&E class of structures and cannot be generated by gap matrices. It can be decomposed, however, using the eight indexes i, j, k, l, m, n, p and q, thus implying a O(N⁸) time-complexity. This makes use of a six-dimensional gap matrix G_j,k,l,m,n,p, which implies O(N⁶) space-complexity.

Open in new tab Download slide

MFE folding of 1-structures: if we make use of a naïve table-based parsing scheme, checking for each subword s of the input and for each rule f whether f can produce s, a rule like f = I → IA₁IB₂IC₁IA₂ID₁IB₂IC₂ID₂S introduces a complexity O(N¹⁸): first, we must process O(N²) different with subwords s induced by an input of size n. Secondly, each non-terminal but the first on the right-hand side of the production introduces an additional split point, which specifies the part of s to be generated by the corresponding non-terminal. Since its location may freely be chosen within s, each split point gives rise to another loop variable, and hence contributes a factor O(N) to the runtime.

Even if there are much more sophisticated parsing algorithms, it is useful to consider this simple scheme since it directly translates into a recursion for a DP algorithm typically used to compute structures of minimum free energy. Furthermore, it is possible to introduce intermediate steps in the derivation of our language by making use of additional non-terminals and production rules such that the time complexity can be reduced to O(N⁶). For that purpose, let the non-terminal I^′ represent 1-structures in which no structures with shadow (H), (K), (L) or (M) are nested and the last vertex is paired. We introduce the non-terminal symbols

⁠,

and

assumed to represent intermediate fragment pairs and the production rules

where (U^′₁, U^′₂) is a marked copy of (U₁, U₂) used to identify the components which must later be expanded in a coupled way. Accordingly, we replace the derivations of T in ℛ₁ as follows:

Note that syntactically, i.e. considered as dot-bracket representations, the 1-structures can be generated by an MCFG, parsable in time O(N⁵). However, in that case, corresponding brackets are not generated in a coupled way making the grammar inappropriate for algorithmic purposes.

As typical for DP and in analogy to our parsing scheme, we use two-dimensional matrices to store the optimal structure over a fragment. The matrix is indexed by the sequence coordinates of the endpoints. It can be a simple secondary structure 𝒮 or a substructure of higher genus. For the fragment-pairs, i.e. for the non-terminals of dimension two, four-dimensional matrices indexed by the endpoints of both linked fragments are required to store the optimal structure over them. Suppose the pair of fragments is [i, r] and [s, j], and let Gu(i, j; r, s) be the fragment-pair (associated with) [U₁, U₂], Gv(i, j; r, s) be the fragment-pair [V₁, V₂], Gw(i, j; r, s) be the fragment-pair [W₁, W₂] and G(i, j; r, s) be the fragment-pair [X₁, X₂]. The recursions for these matrices, summarized in graphical form in Figure 12, are determined directly by the grammar.

Fig. 12.

The decomposition for 4-dimensional matrices G, Gu, Gv, and Gw.

Open in new tab Download slide

We can conclude from the rewriting rules that the computation of the two-dimensional matrices requires at most three loop variables, and there are O(N²) many of them. Accordingly, O(N⁵) operations are required to fill the associated two-dimensional matrices. For the four-dimensional matrices, two loop variables are needed for each of the corresponding rewriting rules (those with a left-hand side of dimension two) for there are in each case two split points introduced by the right-hand sides of the corresponding productions. Since we need to compute O(N⁴) matrix entries, the total run time is in O(N⁶). Obviously, O(N⁴) space is required to store these tables. Accordingly, the algorithm can generate all 1-structures in O(N⁶) time and O(N⁴) space, i.e. with the same complexity as pknotsRE (Rivas and Eddy, 1999) (for the larger R&E class). The advantage of 1-structures is that structurally different shadows can be parametrized in different ways, and that the search space is restricted to moderately complex shadows. In contrast, the language of R&E-structures is based on crossings and can neither identify blocks of arcs nor restrict the genus of the shadows. For more structure classes restricted to H-structures, NUPACK (Dirks and Pierce, 2003) requires O(N⁵) time and O(N⁴) space.

This is substantially more demanding, of course, than the O(N⁴) time and O(N²) memory complexity of pknotsRGReeder and Giegerich (2004), which, however, deals with a very restricted subset of H-shadow structures, demanding that helices are maximally extended and perfect in the sense that they are not interrupted by bulge- or interior-loops. pknotsRG thus is not guaranteed to find the minimum energy structure within the class H-shadow structures. A related fast heuristic treats the (K)-shadow as a superposition of the two H-shadows (Theis et al., 2010).

2.4 Partition function and sampling

We have shown that the MCFG ℛ₁ uniquely generates all 1-structures, i.e. it is unambiguous. Consequently, ℛ₁ can be employed to count 1-structures over a given sequence x and to compute the corresponding partition function

where R is the universal gas constant, T is the temperature, G(s) is energy of structure s over sequence x and

is the set of 1-structures in which all base pairs (i, j) satisfy the base pairing rules for RNA, i.e. x_ix_j ∈ {AU, UA, GC, CG, GU,UG}. Let N_i,j denote the substructure represented by the non-terminal symbol N in ℛ₁ over the fragment [i, j], and let

denote the fragment-pair

⁠, where X₁ =[i, r] and X₂ =[s, j] in the recursions for energy minimization. For each of these symbols, we introduce corresponding partial partition functions Q_{N_i,j} and

⁠. Since the MCFG is unambiguous, the recursions for the partial partition functions are derived by replacing minima by sums and addition of energy contribution by multiplication of partial partition functions, see e.g. (Voß et al., 2006). For instance, the recursion for the partition functions corresponding to the non-terminal symbol T reads

where E[h, ℓ] denotes the energy of the loop closed by the base pair (h, ℓ).

The probabilities ℙ_{N_i,j} of partial structures of type N over the fragment [i, j] and the probabilities

of partial structures of type

over the fragment pair [i, j], [r, s] are readily calculated from the partial partition functions. These ‘backward recursions’ are analogous to those derived by McCaskill (1990) for crossing free structures: let Λ_{N_i,j} be the set of 1-structures containing N_i,j and let

be the set of 1-structures containing the fragment-pair

⁠. It follows that we have

Suppose N_i,j or

are obtained by decomposing θ_s. The conditional probabilities ℙ_{N_i,j|θ_s} and

are then given by Q_{θ_s}(N_i,j)/Q_{θ_s} and

⁠, respectively. Here Q_{θ_s} represents the partition function of θ_s, and Q_{θ_s}(N_i,j) and

represent the partition functions for those θ_s-configurations that contain N_i,j and

⁠, respectively. Taking the sum over all possible θ_s, we obtain

From this backward recursion, one immediately derives a stochastic backtracing recursion from the probabilities of partial structures that generates a Boltzmann sample of 1-structures, see Ding and Lawrence (2003); Huang et al. (2010); Tacker et al. (1996) for analogous constructions.

The basic data structure for this sampling is a stack A which stores blocks of the form (i, j, N) [or

], presenting substructures of non-terminal symbols N over [i, j] (or

over [X₁, X₂] where X₁ = [i, r] and X₂ =[s, j]). L is a set of base pairs storing those removed by the decomposition step in the grammar. We initialize with the block (1, n, I) in A, and L = ∅. In each step, we pick up one element in A and decompose it via the grammar with probability Q^M/Q^N, where Q^N is the partition function of the block which is picked up from A, and Q^M is the partition function of the target block which is decomposed by the rewriting rule. The base pairs which are removed in the decomposition step are moved to L. For instance, according to the rewriting rule T → I(T)S, the block (i, j, T) is decomposed into the three blocks: (i, h − 1, I), (h + 1, ℓ − 1, T), (ℓ + 1, j, S) and one base pair (h, ℓ) which is to be removed. For fixed indices h, ℓ, where i ≤ h < ℓ ≤ j, the probability of decomposing (i, j, T) reads

The sampling step is iterated until A is empty. The resulting 1-structure is the given by the list L of base pairs.

2.5 Software

Implementation: MFE folding, partition function including a computation of base pairing probabilities and stochastic backtracing are implemented in gfold. The program is written in C.

Energy model: although the presentation above uses a simplified grammar that does not explicitly distinguish the usual loop types, gfold implements the Mathews–Turner energy model without dangles (Mathews et al., 1999, 2004) for secondary structure elements. For pseudoknots, we use here an extended version of the Dirks–Pierce (DP) model (Dirks and Pierce, 2003) that allows different penalties β_X for the four topologically distinct pseudoknot types X = H, K, L, M. We have observed that the values of β_X have a substantial influence on the accuracy of the predicted structures. In both NUPACK and pknotsRE, a common pseudoknot penalty β₁ is assigned whenever two gap matrices cross. Since the number of such crossings depends on the type of the pseudoknot, this algorithmic design would imply β_A = β₁, β_B = β_C = 2β₁ and β_D = 3β₁. In gfold, these parameters are independent and can be adjusted to improve the performance. Since most experimentally known pseudoknots are of types (H) and (K), we focused in particular on the ratio of β_A and β_B and found that both sensitivity (the ratio of correctly predicted base pairs to the total number of base pairs in the reference structure) and positive predictive value reach a maximum for β_B = 1.3β_A. The pseudoknot penalty of type (H) coincides with that of the DP model, i.e. β_A = β₁ = 9.6 (kcal/mol). The other penalties are set to β_B = 12.6, β_C = 14.6 and β_D = 17.6; see Supplementary Material for details. An alternative set of pseudoknot parameters described by Andronescu et al. (2010) can easily be incorporated but would require a readjustment of these four topological penalties.

Performance: the current implementation of gfold is applicable to sequences with a length up to N ∼ 150 nt on current PC hardware. Figure 13 summarizes the resource requirements.

Fig. 13.

Run time (A) and peak memory (B) of gfold. Timing information is given for MFE-only (triangles) and partition function with sampling 10 000 structures from the Boltzmann ensemble. To compute error bars, we folded between 10 (N > 100) and 100 (N < 70) randomly generated sequences on a Xeon E5410, 2.33 GHz, 48 GB memory. Memory allocation is independent of the sequence. For N ≥ 100, double precision floats are necessary to avoid overflows. This leads to the jump in memory consumption by a factor of 2. Dotted lines indicate the theoretical behavior of O(N⁶) (time) and O(N⁴) (space). The slope for CPU time is slightly steeper than the theory since constraints among the six indices introduced by the minimum size of the complex pseudoknot elements lead to an additional speedup for small N.

Open in new tab Download slide

We have observed that gfold provides a substantial increase in both sensitivity and a positive predictive value (PPV, ratio of correctly predicted base pairs to the total number of base pairs in the predicted structure) compared with the alternative DP approaches pknotsRE (Rivas and Eddy, 1999), NUPACK (Dirks and Pierce, 2003) and pknotsRG-mfe (Reeder and Giegerich, 2004), and that gfold provides a substantial increase in accuracy, cf. Figure 14. In an evaluation on the entire Pseudobase (van Batenburg et al., 2001), gfold achieves a sensitivity of 0.762 and PPV of 0.761. However, as detailed in Supplementary Table S3, the performance varies substantially between different classes of sequences. Interestingly, the more complex pseudoknots of type (K) are predicted with even higher accuracy (sensitivity 0.889, PPV 0.899) than the simpler, much more frequent type H.

Fig. 14.

Performance of gfold. Comparison of the average sensitivity (A) and PPV (B) of different prediction algorithms on a sample of 32 structures from Pseudobase. All details of this sample are given in the Supplementary Table S2. (C) The PPV increases significantly if only base pairs with larger pairing probabilities as predicted by the partition function version of gfold are included in the predicted structure.

Open in new tab Download slide

The PPV of gfold predictions can be increased by filtering the base pairs of the MFE structure by their probability P of formation, which is computed by the partition function version of gfold. Accepting only base pairs with a predicted base pairing probability P > 0.95 increases the PPV from 0.76 to more than 0.9, (Fig. 14C). In order to evaluate the false positive rate, we folded 100 tRNA sequences from Sprinzl's tRNA database (Jühling et al., 2009). gfold correctly identifies 94% of them as pseudoknot free. In comparison, NUPACK correctly identifies 86% and pknotsRG−mfe 89% of this sample set.

3 DISCUSSION

Combinatorial models of pseudoknotted RNA structures are limited in two ways: on the one hand, exact algorithmic folding can be constructed only for certain types of structures; on the other hand, the larger the structure sets are, the more base pairing patterns are contained in them that cannot be realized in nature due to steric constraints. Algorithm design so far has been mostly driven by the desire to reduce computational complexity. The idea behind gfold, in contrast, is to define a more suitable class of structures that can be generated by nesting and concatenating a small number of elementary building blocks. This recursive structure is captured by a fairly simple unambiguous multiple context-free grammar that translates in a canonical way to DP algorithms for computing the minimum energy structure and the partition function in O(N⁶) time and O(N⁴) space. In addition to MFE folding, we have implemented the computation of base pairing probabilities and a stochastic backtracing recursion, thus providing the major functionalities of RNA secondary structure prediction software for a very natural class of pseudoknotted structures.

The 1-structures considered here strike a balance between the generality necessary to cover almost all known pseudoknotted structures and the restriction to topologically elementary structures that have a good chance to actually correspond to a feasible spatial structure. From a mathematical point of view, the characterization of structures in terms of irreducible components with given topological genus appears particularly natural and promises to reflect closely the ease with which a structure can be embedded in three dimensions. In addition, the grammar underlying gfold naturally distinguishes different types of pseudoknots and admits different energy parameters for them. We observe that this additional freedom of the parametrization leads to a substantial increase of sensitivity of type (K) pseudoknots, (0.63 → 0.889) and PPV (0.73 → 0.899) compared wit the usage of a common penalty for each crossing of gap matrices. In terms of prediction accuracy, gfold thus compares favorably also with the leading alternative DP approaches to pseudoknotted structures.

Funding: This work was supported by the 973 Project of the Ministry of Science and Technology; the PCSIRT Project of the Ministry of Education; National Science Foundation of China to C.M.R. and his lab; Deutsche Forschungsgemeinschaft (projects STA 850/2-1 & STA 850/7-1); the European Union FP-7 project QUANTOMICS (no. 222664) to P.F.S. and his lab. J.E.A. and R.C.P. are supported by QGM, the Centre for Quantum Geometry of Moduli Spaces, funded by the Danish National Research Foundation.

Conflict of Interest: none declared.

¹Ribbons may also be allowed to twist giving rise to possibly non-orientable surfaces (Massey, 1967).

²In order to relate this to the standard 2D models of surfaces derived from triangulations: from the collapsed fatgraph we can derive the polygonal model of the surface X_𝔻, i.e. a 2n-gon in which edges are identified in pairs; (Fig. 4).

³This coupling is only required for components that were generated by the same production step. Components, even if of the same kind, derived in different steps are independent of each other.

REFERENCES

Akutsu

Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots

Discr. Appl. Math.

2000

, vol.

104

(pg.

)

Month:	Total Views:
November 2016	2
December 2016	2
January 2017	21
February 2017	18
March 2017	4
April 2017	7
May 2017	4
June 2017	7
July 2017	13
August 2017	9
September 2017	11
October 2017	8
November 2017	9
December 2017	53
January 2018	84
February 2018	90
March 2018	66
April 2018	105
May 2018	74
June 2018	66
July 2018	62
August 2018	74
September 2018	48
October 2018	61
November 2018	105
December 2018	123
January 2019	113
February 2019	123
March 2019	128
April 2019	118
May 2019	112
June 2019	93
July 2019	51
August 2019	32
September 2019	34
October 2019	31
November 2019	34
December 2019	12
January 2020	28
February 2020	37

Article Contents

Topology and prediction of RNA pseudoknots

Abstract

1 INTRODUCTION

2 RESULTS

2.1 Topology of RNA structures

2.2 γ-structures

Lemma 2.1.

Proof.

Theorem 2.2.

Proof.

Theorem 2.3.

Proof.

2.3 Minimum free energy folding of γ-structures

Theorem 2.4.

Proof.

2.4 Partition function and sampling

2.5 Software

3 DISCUSSION

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only