-
PDF
- Split View
-
Views
-
Cite
Cite
Christopher Barrett, Fenix W Huang, Christian M Reidys, Sequence–structure relations of biopolymers, Bioinformatics, Volume 33, Issue 3, February 2017, Pages 382–389, https://doi.org/10.1093/bioinformatics/btw621
- Share Icon Share
Abstract
DNA data is transcribed into single-stranded RNA, which folds into specific molecular structures. In this paper we pose the question to what extent sequence- and structure-information correlate. We view this correlation as structural semantics of sequence data that allows for a different interpretation than conventional sequence alignment. Structural semantics could enable us to identify more general embedded ‘patterns’ in DNA and RNA sequences.
We compute the partition function of sequences with respect to a fixed structure and connect this computation to the mutual information of a sequence–structure pair for RNA secondary structures. We present a Boltzmann sampler and obtain the a priori probability of specific sequence patterns. We present a detailed analysis for the three PDB-structures, 2JXV (hairpin), 2N3R (3-branch multi-loop) and 1EHZ (tRNA). We localize specific sequence patterns, contrast the energy spectrum of the Boltzmann sampled sequences versus those sequences that refold into the same structure and derive a criterion to identify native structures. We illustrate that there are multiple sequences in the partition function of a fixed structure, each having nearly the same mutual information, that are nevertheless poorly aligned. This indicates the possibility of the existence of relevant patterns embedded in the sequences that are not discoverable using alignments.
The source code is freely available at http://staff.vbi.vt.edu/fenixh/Sampler.zip
Supplementary data are available at Bioinformatics online.
1 Introduction
2015 is the 25th year of the human genome project. A recent signature publication (The 1000 Genomes Project Consortium, 2015) is a comprehensive sequence alignment-based analysis of whole genome nucleotide sequence variation across global human populations. Notwithstanding the importance of this achievement, there is the possibility of information encoded as patterns in the genome that current methods cannot discover.
In this paper we study the information transfer from RNA sequences to RNA structures. This question is central to the processing of DNA data, specifically the role of DNA nucleotide sequences being transcribed into RNA, stabilized by molecular folding. In a plethora of interactions it is this specific configuration and not the particular sequence of nucleotides (aside from, say small docking areas, where specific bindings occur) that determines biological functionality. We find that here are multiple sequences in the partition function of a fixed structure, each having nearly the same mutual information with respect to the latter, that are nevertheless poorly aligned. This indicates the possibility of the existence of relevant patterns embedded in the sequences that are not discoverable using alignments.
RNA, unlike DNA, is almost always single-stranded and all RNA is folded. (There are double-stranded RNA viruses.) Here we only consider single-stranded RNA. An RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups. Attached to each sugar is one of four bases – adenine (A), uracil (U), cytosine (C), or guanine (G). There are various types of RNA: messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA) and many others. Recent transcriptomic and bioinformatic studies suggest the existence of numerous of so called non-coding RNA, ncRNAs, that is RNA that does not translate into protein (Cheng et al., 2005; Eddy, 2001).
RNA realizes folded molecular conformations consistent with the Watson–Crick base as well as the wobble base pairs. In the following we consider RNA secondary structures, presented as diagrams obtained by drawing the sequence in a straight line and placing all Watson–Crick and Wobble base pairs as arcs in the upper half-plane, without any crossing arcs, see Figure 1.

DNA information processing refers to replication, transcription and translation. Additionally, RNA information processing includes replication (Koonin et al., 1989), reverse transcription (from RNA to DNA in e.g. retroviruses; Temin and Mizutani, 1970) and a direct translation from DNA to protein (in cell-free systems, using extracts from E. coli that contains ribosomes; McCarthy and Holland, 1965; Uzawa et al., 2002).
In the following we offer an alternative view of DNA–RNA information processing. We focus on the information transfer from DNA/RNA sequences to the folded RNA (after transcription). We speculate that the sequential DNA information may transcribe into single-stranded RNA in order to allow subsequent biological processes to interpret DNA data.
DNA data are viewed as sequences of nucleotides. We currently use sequence alignment tools as a means of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences (Mount, 2004). Here we suggest that the transcription into RNA with the implied self-folding is a way of lifting DNA information to a new and different level: RNA structures provide sequence semantics.
In order to study this idea we consider the folding of RNA sequences into minimum free energy (mfe) secondary structures (Waterman, 1978). Pioneered by Waterman more than three decades ago (Smith and Waterman, 1978) and subsequently studied by Schuster et al. (1994) in the context of the RNA toy world (Schuster, 1997) there is detailed information about this folding. In particular we have fairly accurate energy values for computing loop-based mfe (Mathews et al., 1999, 2004; Turner and Mathews, 2010) that are employed by the folding algorithms (Hofacker et al., 1994; Zuker and Stiegler, 1981). More work has been done on loop-energy models in Do et al. (2006) and Mathews (2004). We plan on a more detailed analysis of the framework proposed here in the context of the MC-model (Parisien and Major, 2008).
McCaskill (1990) observed that the dynamic programming routines folding mfe structures allow one to compute the partition function of all possible structures for a given sequence. The partition function is tantamount to computing the probability space of structures that a fixed sequence is compatible with. Predictions such as base pairing probabilities are obtained in Hofacker et al. (1994) and Hofacker (2003) and are parallelized in Fekete et al.(2000). Ding and Lawrence (2003) and Tacker et al. (1996) derive a statistically valid sampling of secondary structures in the Boltzmann ensemble and calculate the sampling statistics of structural features.
The paper is organized as follows: we first recall in Section 2.1 the decomposition of secondary structures as well as the loop-based thermodynamic model. This in turn facilitates (Sections 2.3 and 2.4) the derivation of the partition function and Boltzmann sampling. In Sections 2.3 and 2.4 we compute Q(S), Boltzmann sampling and the a priori probability of sequence patterns.
2 Method
2.1 Secondary structures and loop decomposition
RNA structures can be represented as diagrams where we consider the labels of the sequence to be placed on the x-axis and the Watson–Crick as well as Wobble base pairs drawn as arcs in the upper half plane see Figure 1. That is, we have a vertex-labeled graph whose vertices are drawn on a horizontal line labeled by , presenting the nucleotides of the RNA sequence and the linear order of the vertices from left to right indicates the direction of the backbone from -end to -end. Furthermore each vertex can be paired with at most one other vertex by an arc drawn in the upper half-plane. Such an arc, (i, j), represents the base pair between the ith and jth nucleotide (here we assume to meet the minimum size requirement of a hairpin loop.).
Two arcs (i, j) and (r, s) are called crossing if and only if i < r and holds. An RNA structure is called pseudoknot-free, or secondary structure, if it does not contain any crossing arcs. Furthermore, the arcs of a secondary structure can be endowed with the partial order: if and only if .
A filtration based on the individual contributions of base pairs of RNA structures was computed via the Nussinov model (Nussinov et al., 1978). Smith and Waterman (1978) were the first bringing energy into the picture, computing the free-energy accurately via loops. A loop in a diagram consists of a sequence of intervals on the backbone , where , for all are base pairs. By construction, each base pair (i, j) is involved in exactly two loops: one where (i, j) is maximal respect to , and one where (i, j) is not. Furthermore, there is a distinguished loop, Lex, called the exterior loop, where , bk = n and is not a base pair. Depending on the number of base pairs, and unpaired bases inside a loop, a loop is categorized as hairpin-, containing exactly one base pair and one interval, helix, containing two base pairs and two empty intervals, interior-, containing two non-empty intervals and two base pairs, bulge-, containing two base pairs and two intervals, where one of them is empty and the other one is not and multi-loops, see Figures 2 and 3.

Hairpin-, helix-, bulge-, interior- and multi-loops in secondary structures

Further developments on RNA secondary structure prediction were given by Zuker and Stiegler (1981) and Hofacker et al. (1994). In particular, accurate thermodynamic energy parameters can be found in Mathews et al. (1999, 2004), Turner and Mathews (2010), Parisien and Major (2008), Deigan et al. (2009), Hajdin et al. (2013) and Lorenz et al. (2016).
A hairpin, LH is a loop having exactly one base pair with a non-empty interval containing k unpaired bases, where due to flexibility constraints imposed by the backbone of the molecule.
In case of we call L a tetra-loop, which has a particular energy that depends on the two nucleotides incident to its unique arc as well as the particular nucleotides corresponding to the unpaired bases of its unique non-empty interval .
2.2 The partition function

Remark. The routine of computing Q(S) is similar to the one for finding an optimal sequence for a given structure inBusch and Backofen (2006), Levin, A., et al. (2012) and Reinharz,V., et al. (2013). Passing to a topological model for RNA structures (Bon et al., 2008; Orland and Zee, 2002; Penner, 2004; Reidys et al., 2011), the above recursions can be extended to pseudoknotted RNA structures, i.e. RNA structures containing crossing arcs. The key here is a general bijection between maximal arcs and topological boundary components (loops).
2.3 Boltzmann sampling and patterns
Next, we compute the probability of a given sequence pattern, i.e. the subsequence of σ over being . We shall refer to a sequence containing by .
We have shown how to compute Q(S) recursively in Section 2.3. It remains to show how to compute . To do this we use the same routine as for computing Q(S), but eliminating any subsequences that are not compatible with . By construction, for any pattern, this process has the same time complexity as computing Q(S).
3 Discussion
Let us begin by discussing the mutual information of sequence–structure pairs. Then we ask to what extent does a structure determine particular sequence patterns and finally derive a criterion that differentiates native from random structures.
In Figure 5 we display three sequences sampled from Q(S) where S is the PDB-structure 2N3R (Bonneau et al., 2015), see Figure 9. All three sequences have similar mutual information and more than 50% of the nucleotides in the sequences are pairwise different.

Three sequences, having nearly the same mutual information with respect to the PDB structure 2N3R. The sequences differ pairwise by more than 50% of their nucleotides which indicates that there is information that cannot be captured by conventional sequence alignment. Accordingly BLAST outputs no significant homology between the sequences
We point out that replacing a G- C base pair in a helix by a C- G base pair does change the energy, see Figure 6. This due to the fact that the loops are traversed in a specific orientation. The isolated replacement of G-C by C-G changes this sequence and hence the energy.

Isolated replacement of G- C by C- G: (A) and , (B) replacement induces the new loops: and , which changes the free energy
Since the energy model underlying the current analysis does not take non-canonical base pairs into account, we defer a detailed analysis of the mutual information to a later study where we use the MC-model (Parisien and Major, 2008).
By construction, , where when all have the same probability, i.e. uniformly distributed, and when is completely determined, i.e. . Let be the heat of , i.e. for random sequences and if there exists only one pattern . We display the collection of as a matrix (heat-map), in which we display as black and as white. For a proof of concept, we restrict ourselves to for .
The PDB structure 2JXV (Cevec et al., 2008) represents a segment of an mRNA, having length 33. The structure exhibits a tetra-loop, an interior loop and two stacks of length 8 and 5, respectively, see Figure 7. We Boltzmann sample 104 sequences for this structure observing an AU ratio of 18.18%, while CG ratio is 81.82%. The IFR reads 95.16%, i.e. almost all sampled sequences refold into 2JXV. The heat-map of 2JXV is given in Figure 7 and we list the most frequent 10 patterns of the largest interval having in Supplementary Table S1 together with their a priori pattern probabilities. We observe that the tetra-loop determines specific patterns. This finding is not entirely straightforward as the hairpin-loops are the last to be encountered when Boltzmann sampling. I.e. they are the most correlated loop-types in the sense that structural context influences them the most.

The secondary structure of 2JXV and its heat-map. We display the most frequent sampled patterns for the largest interval having . The sample size is 104
The energy distribution of the Boltzmann sample is presented in Figure 8 and we observe that the inverse folding solution is not simply the one that minimizes the free energy w.r.t. 2JXV with the best energy. -data are not displayed here in view of the high IFR.

The energy distribution of the Boltzmann sample for 2JXV. We display the frequency of sequences having a particular energy (right bars) and the frequency of sequences that fold into 2JXV (left bars)
The PDB structure 2N3R (Bonneau et al., 2015) consist of 61 nucleotides and has a 3-branch multi-loop, two tetra-loops, interior loops and helixes, see Figure 9. The ratios of AU and CG pairs are 19.67% and 80.33%, respectively, again in a Boltzmann sample of 104 sequences. The IFR is at 0.69 quite high, despite the fact that 2N3R is much more complex than 2JXV. We illustrate the heat-map of 2N3R in Figure 9 and list the most frequent 10 patterns in the largest interval having in the Supplementary Material, Supplementary Table S2 together with their a priori pattern probabilities computed by eq. (9)

The secondary structure of 2N3R and its heat-map. We show the most frequent patterns for the largest interval having . The sample size is 104
Comparing the sequence segments and , both of which being tetra-loops with additional two nucleotides. The values of these segments are similar, approximately 0.59, however, their most frequently sampled patterns appear at different rates. For this pattern is CGGAAGGC and it occurs with a Boltzmann sampled frequency of 1.69% and pattern probability 1.44%, while for it is CGUGAGGG with sampled frequency 3.27% and pattern probability 3.24%. This makes the point that pattern frequency distributions are strongly correlated with structural context.
The energy distribution of the Boltzmann sample is given in Figure 10(A) and we display the -data in Figure 10(B) where we contrast the data with -values obtained from Boltzmann sampling 104 sequences of 5 random structures of the same length. We observe that the -values for 2N3R are distinctively lower than those for random structures.

(A) The energy distribution of Boltzmann sampled sequences. The frequency of sequences having a particular energy level (right bars), the frequency of sequences folding into 2N3R (left bars). (B) -data of 2N3R versus -data of five random structures
The PDB structure 1EHZ (Shi and Moore, 2000) is a tRNA over 76 nucleotides exhibiting a 4-branch multi-loop. We display the heat-map of 1EHZ in Figure 11 The IFR is w.r.t. our Boltzmann sample of size 104 and we display the energy distribution of the sampled sequences in Figure 12(A). Interestingly we still find many inverse fold solutions by just Boltzmann sampling Q(S) and these sequences are not concentrated at low free energy values.

The secondary structure of 1EHZ and its heat-map. We display the most frequent sampled pattern for the largest interval having . The sample size is 104

(A) The energy distribution of the Boltzmann sampled sequences. The frequency of sequences having a particular energy level (right bars), the frequency of sequences folding into 1EHZ (left bars). (B) -data of 1EHZ versus -data of five random structures
In Figure 12(B) we display the -data and contrast them with those obtained by the Boltzmann samples of five random structures. We observe a significant difference between the -distribution of the 1EHZ sample and those of the random structures.
The three above examples indicate that sequence–structure correlations can be used to locate regions where specific embedded patterns arise. Furthermore we observe that studying Q(S) has direct implications for inverse folding. This is in agreement with the findings in Busch and Backofen (2006), Levin, A., et al. (2012) and Reinharz,V., et al. (2013), but leads to deriving alternative, unbiased starting sequences for inverse folding. Although at present we can only estimate the mutual information, we can conclude that there are sequences that cannot be aligned but obtain almost identical mutual information.
We observe that biological relevant sequences exhibit a -signature distinctive different from that of random structures. Therefore, the -signature is capable of distinguishing biological relevant structures from random structures. In Miklós et al. (2005), the expected free energy and variance of the Boltzmann ensemble of a given sequence has been employed in order to distinguish biologically functional RNA sequences from random sequences. This result is in terms of the pairing , dual (the flip side of the coin, so to speak) to our approach. Our -signature characterize the naturality of a fixed structure and Miklós et al. (2005) the naturality of a fixed sequence. Accordingly, Q(S) augments the analysis of in a natural way, capturing the correlation between RNA sequences and structures.
As a result, sequences carry embedded patterns that cannot be understood by considering the sequence of nucleotides. At this point we have no concept of what these patterns are and provided in Section 2.4 a rather conventional notion of ‘embedded pattern’. However, even when considering specific nucleotide patterns in hairpin loops, we observe significant context dependence on the structure. Other loops affect the energy of the hairpin loop and thus determine this particular subsequence. We observe that the embedded patterns can, for certain structures, be quite restricted, possibly elaborate and are not entirely obvious. In any case, the analysis cannot be reduced to conventional sequence alignment. The heat-maps introduced here identify the regions for which only a few select patterns appear and computed the a priori probabilities of their occurrence.
This type of analysis will be carried out for the far more advanced MC-model (Parisien and Major, 2008), incorporating non-canonical base pairs, SHAPE-directed model for long RNAs (Deigan et al., 2009; Hajdin et al., 2013). This will in particular enable us to have a closer look at the hairpins of the tRNA structure. In addition we believe that this line of work may enable us to arrive at non-heuristic inverse foldings.
Folding of RNA secondary structures including pseudoknots is studied in Rivas and Eddy (1999) by extending the dynamic programming paradigm introducing substructures with a gap. The framework generates a particular, somewhat subtle class of pseudoknot structures, discussed in detail in Rivas and Eddy (2000). A specific, multiple context-free grammar (MCFG) for pseudoknotted structures is designed (Rivas and Eddy, 1999), employing a vector of nonterminal symbols referencing a substructure with a gap.
Our results facilitate the Boltzmann sample RNA sequences for pseudoknotted structures. Let denote a substructure with a gap where (i, j), (r, s) are base pairs and denote the partition function of , then one can compute following the MCFG given by Rivas and Eddy (1999).
A different approach was presented in Penner and Waterman (1993) and Penner (2004), where topological RNA structures have been introduced (Penner and Waterman, 1993; Penner, 2004). In difference to Rivas and Eddy (1999), which was driven by the dynamic programming paradigm, topological structures stem from the intuitive idea to just ‘draw’ their arcs on a more complex topological surface in order to resolve crossings. Random matrix theory (von Neumann and Goldstine, 1947) facilitates the classification and expansion of pseudoknotted structures in terms of topological genus (Bon et al., 2008; Orland and Zee, 2002) and in Reidys et al. (2011) a polynomial time, loop-based folding algorithm of topological RNA structures was given. The results in this paper are for representation purposes formulated in terms of loops. However they were originally developed in the topological framework, in which loops become topological boundary components. This means that we can extend our framework to pseudoknot structures. The key then is of course to be able to recursively compute the novel partition function, i.e. an unambiguous grammar. Recent results (Huang and Reidys, 2016) associate a topological RNA structure with a certain, arc-labeled secondary structure, called λ-structure. The resulting disentanglement gives rise to a context free grammar for RNA pseudoknot structures (Huang and Reidys, 2016). (More precisely, a λ-structures corresponds one-to-one to a pseudoknotted structure together with some additional information, i.e. a specific permutation of its backbone.) We illustrate this correspondence in Figure 13. This finding facilitates to extend all our results to pseudoknotted structures and offers insight in patterns and inverse folding of more general RNA structure classes as well as RNA-RNA interaction complexes.

Disentanglement: by means of permuting the backbone of a pseudoknotted structure one resolves all crossings
As mentioned above, the present analysis is just a first step and discusses embedded patterns in the sense of subsequent nucleotides. However our framework can deal with any embedded pattern. We think a deeper, conceptual analysis has to be undertaken aiming at identifying how a collection of structures provides sequence semantics. Quite possibly this can be done in the context of formal languages. We speculate that advancing this may lead to a novel class of embedded pattern recognition algorithms beyond sequence alignment.
Acknowledgments
Special thanks to Michael Waterman and Peter Stadler for their input on this manuscript. We gratefully acknowledge the help of Kevin Shinpaugh and the computational support team at VBI, Rebecca Wattam, Henning Mortveit, Madhav Marathe and Reza Rezazadegan for discussions. Many thanks for the constructive feedback of the anonymous reviewers and their suggestions.
Conflict of Interest: none declared.
References