Sequence–structure relations of biopolymers

Abstract

Motivation

DNA data is transcribed into single-stranded RNA, which folds into specific molecular structures. In this paper we pose the question to what extent sequence- and structure-information correlate. We view this correlation as structural semantics of sequence data that allows for a different interpretation than conventional sequence alignment. Structural semantics could enable us to identify more general embedded ‘patterns’ in DNA and RNA sequences.

Results

We compute the partition function of sequences with respect to a fixed structure and connect this computation to the mutual information of a sequence–structure pair for RNA secondary structures. We present a Boltzmann sampler and obtain the a priori probability of specific sequence patterns. We present a detailed analysis for the three PDB-structures, 2JXV (hairpin), 2N3R (3-branch multi-loop) and 1EHZ (tRNA). We localize specific sequence patterns, contrast the energy spectrum of the Boltzmann sampled sequences versus those sequences that refold into the same structure and derive a criterion to identify native structures. We illustrate that there are multiple sequences in the partition function of a fixed structure, each having nearly the same mutual information, that are nevertheless poorly aligned. This indicates the possibility of the existence of relevant patterns embedded in the sequences that are not discoverable using alignments.

Availability and Implementation

The source code is freely available at http://staff.vbi.vt.edu/fenixh/Sampler.zip

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

2015 is the 25th year of the human genome project. A recent signature publication (The 1000 Genomes Project Consortium, 2015) is a comprehensive sequence alignment-based analysis of whole genome nucleotide sequence variation across global human populations. Notwithstanding the importance of this achievement, there is the possibility of information encoded as patterns in the genome that current methods cannot discover.

In this paper we study the information transfer from RNA sequences to RNA structures. This question is central to the processing of DNA data, specifically the role of DNA nucleotide sequences being transcribed into RNA, stabilized by molecular folding. In a plethora of interactions it is this specific configuration and not the particular sequence of nucleotides (aside from, say small docking areas, where specific bindings occur) that determines biological functionality. We find that here are multiple sequences in the partition function of a fixed structure, each having nearly the same mutual information with respect to the latter, that are nevertheless poorly aligned. This indicates the possibility of the existence of relevant patterns embedded in the sequences that are not discoverable using alignments.

RNA, unlike DNA, is almost always single-stranded and all RNA is folded. (There are double-stranded RNA viruses.) Here we only consider single-stranded RNA. An RNA strand has a backbone made of alternating sugar (ribose) and phosphate groups. Attached to each sugar is one of four bases – adenine (A), uracil (U), cytosine (C), or guanine (G). There are various types of RNA: messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA) and many others. Recent transcriptomic and bioinformatic studies suggest the existence of numerous of so called non-coding RNA, ncRNAs, that is RNA that does not translate into protein (Cheng et al., 2005; Eddy, 2001).

RNA realizes folded molecular conformations consistent with the Watson–Crick base as well as the wobble base pairs. In the following we consider RNA secondary structures, presented as diagrams obtained by drawing the sequence in a straight line and placing all Watson–Crick and Wobble base pairs as arcs in the upper half-plane, without any crossing arcs, see Figure 1.

Fig. 1

tRNA: its secondary structure and its diagram presentation

Open in new tab Download slide

DNA information processing refers to replication, transcription and translation. Additionally, RNA information processing includes replication (Koonin et al., 1989), reverse transcription (from RNA to DNA in e.g. retroviruses; Temin and Mizutani, 1970) and a direct translation from DNA to protein (in cell-free systems, using extracts from E. coli that contains ribosomes; McCarthy and Holland, 1965; Uzawa et al., 2002).

In the following we offer an alternative view of DNA–RNA information processing. We focus on the information transfer from DNA/RNA sequences to the folded RNA (after transcription). We speculate that the sequential DNA information may transcribe into single-stranded RNA in order to allow subsequent biological processes to interpret DNA data.

DNA data are viewed as sequences of nucleotides. We currently use sequence alignment tools as a means of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences (Mount, 2004). Here we suggest that the transcription into RNA with the implied self-folding is a way of lifting DNA information to a new and different level: RNA structures provide sequence semantics.

In order to study this idea we consider the folding of RNA sequences into minimum free energy (mfe) secondary structures (Waterman, 1978). Pioneered by Waterman more than three decades ago (Smith and Waterman, 1978) and subsequently studied by Schuster et al. (1994) in the context of the RNA toy world (Schuster, 1997) there is detailed information about this folding. In particular we have fairly accurate energy values for computing loop-based mfe (Mathews et al., 1999, 2004; Turner and Mathews, 2010) that are employed by the folding algorithms (Hofacker et al., 1994; Zuker and Stiegler, 1981). More work has been done on loop-energy models in Do et al. (2006) and Mathews (2004). We plan on a more detailed analysis of the framework proposed here in the context of the MC-model (Parisien and Major, 2008).

McCaskill (1990) observed that the dynamic programming routines folding mfe structures allow one to compute the partition function of all possible structures for a given sequence. The partition function is tantamount to computing the probability space of structures that a fixed sequence is compatible with. Predictions such as base pairing probabilities are obtained in Hofacker et al. (1994) and Hofacker (2003) and are parallelized in Fekete et al.(2000). Ding and Lawrence (2003) and Tacker et al. (1996) derive a statistically valid sampling of secondary structures in the Boltzmann ensemble and calculate the sampling statistics of structural features.

In view of the above we are led to consider the ‘dual’ of McCaskill’s partition function, i.e. the partition function of all sequences that are compatible with a fixed structure. More generally we consider the pairing

ε : Q_{4}^{n} \times S_{n} \to R^{+},

(1)

where

Q_{4}^{n}

and

S_{n}

denote the space of sequences, σ, and the space of secondary structures, S, respectively and

ε (σ, S) = e^{- \frac{η (σ, S)}{R T}}

as well as the energy function

η (σ, S) \in R

are discussed in Section 2.1.

We show in Section 3 how ε allows us to capture the mutual information between sequences and structures, where the mutual information between x and y is given by

I (x, y) = P (x, y) log (\frac{P (x, y)}{P (x) P (y)}) .

Here

P (x, y)

denotes the joint probability distribution. In our case,

P (σ, S) = ϵ (σ, S) / \sum_{σ \in Q_{4}^{n}, S \in S_{n}} ϵ (σ, S), P (σ) = \sum_{S \in S_{n}} P (σ, S)

and

P (S) = \sum_{σ \in Q_{4}^{n}} P (σ, S)

⁠.

In addition, ε allows us to express folding by considering

{S | ε (σ, S) = \max_{S \in S_{n}} ε (σ, S)},

and inverse folding as to compute

{σ | ε (σ, S) = \max_{S \in S_{n}} ε (σ, S)}

⁠, for fixed S. Accordingly, the dual to folding is tantamount to computing for fixed S

{σ | ε (σ, S) = \max_{σ \in Q_{4}^{n}} ε (σ, S)} .

This has direct implications to the ‘inverse’ folding of structures. Inverse folding is by construction about the sequence constraints induced by a fixed structure while avoiding competing configurations. Point in case: it has been observed in Busch and Backofen (2006), Levin, A., et al. (2012) and Reinharz,V., et al. (2013) that starting with a sequence that is mfe w.r.t. to a fixed structure, without necessarily folding into it, constitutes a significantly better initialization than starting with a random sequence.

The paper is organized as follows: we first recall in Section 2.1 the decomposition of secondary structures as well as the loop-based thermodynamic model. This in turn facilitates (Sections 2.3 and 2.4) the derivation of the partition function and Boltzmann sampling. In Sections 2.3 and 2.4 we compute Q(S), Boltzmann sampling and the a priori probability of sequence patterns.

2 Method

2.1 Secondary structures and loop decomposition

RNA structures can be represented as diagrams where we consider the labels of the sequence to be placed on the x-axis and the Watson–Crick as well as Wobble base pairs drawn as arcs in the upper half plane see Figure 1. That is, we have a vertex-labeled graph whose vertices are drawn on a horizontal line labeled by $[n] = {1, 2, \dots, n}$ ⁠, presenting the nucleotides of the RNA sequence and the linear order of the vertices from left to right indicates the direction of the backbone from $5'$ -end to $3'$ -end. Furthermore each vertex can be paired with at most one other vertex by an arc drawn in the upper half-plane. Such an arc, (i, j), represents the base pair between the ith and jth nucleotide (here we assume $j - i > 3$ to meet the minimum size requirement of a hairpin loop.).

Two arcs (i, j) and (r, s) are called crossing if and only if i < r and $i < r < j < s$ holds. An RNA structure is called pseudoknot-free, or secondary structure, if it does not contain any crossing arcs. Furthermore, the arcs of a secondary structure can be endowed with the partial order: $(r, s) ≺ (i, j)$ if and only if $i < r < s < j$ ⁠.

A filtration based on the individual contributions of base pairs of RNA structures was computed via the Nussinov model (Nussinov et al., 1978). Smith and Waterman (1978) were the first bringing energy into the picture, computing the free-energy accurately via loops. A loop in a diagram consists of a sequence of intervals on the backbone ${([a_{i}, b_{i}])}_{i}, 1 \leq i \leq k$ ⁠, where $(a_{1}, b_{k}), (b_{i}, a_{i + 1})$ ⁠, for all $1 \leq i < k$ are base pairs. By construction, each base pair (i, j) is involved in exactly two loops: one where (i, j) is maximal respect to $≺$ ⁠, and one where (i, j) is not. Furthermore, there is a distinguished loop, L_ex, called the exterior loop, where $a_{1} = 1$ ⁠, b_k=n and $(a_{1}, b_{k})$ is not a base pair. Depending on the number of base pairs, and unpaired bases inside a loop, a loop is categorized as hairpin-, containing exactly one base pair and one interval, helix, containing two base pairs and two empty intervals, interior-, containing two non-empty intervals and two base pairs, bulge-, containing two base pairs and two intervals, where one of them is empty and the other one is not and multi-loops, see Figures 2 and 3.

Fig. 2

Hairpin-, helix-, bulge-, interior- and multi-loops in secondary structures

Open in new tab Download slide

Fig. 3

Loops and their correspondence in a diagram

Open in new tab Download slide

Further developments on RNA secondary structure prediction were given by Zuker and Stiegler (1981) and Hofacker et al. (1994). In particular, accurate thermodynamic energy parameters can be found in Mathews et al. (1999, 2004), Turner and Mathews (2010), Parisien and Major (2008), Deigan et al. (2009), Hajdin et al. (2013) and Lorenz et al. (2016).

In the following, we briefly recall the Turner energy model (Mathews et al., 1999, 2004; Turner and Mathews, 2010) for RNA secondary structures. Let

σ = (σ_{1}, σ_{2}, \dots, σ_{n})

be a sequence, where

σ_{i} \in {A, U, C, G}

for all

1 \leq i \leq n

⁠. To an arbitrary loop, L, we assign the energy

η (σ, L)

⁠, where

η (σ, L_{e x}) = 0

and

η (σ, L)

depends on two factors: its type and the underlying backbone. Specifically this is the number of bases pairs, the number of unpaired bases and the particular nucleotides involved. The energy of a structure S over an RNA sequence σ is then given by the sum of the energies of individual loops i.e.

η (σ, S) = \sum_{L \in S} η (σ, L) .

(2)

A hairpin, L_H is a loop having exactly one base pair with a non-empty interval containing k unpaired bases, where $k \geq 3$ due to flexibility constraints imposed by the backbone of the molecule.

In case of $3 \leq k \leq 4$ we call L a tetra-loop, which has a particular energy that depends on the two nucleotides incident to its unique arc $(σ_{i}, σ_{i + k + 1})$ as well as the particular nucleotides corresponding to the unpaired bases of its unique non-empty interval $(σ_{i + 1}, \dots, σ_{i + k})$ ⁠.

For any other number of unpaired bases, k, the energy calculation depends only on k and not the particular nucleotide sequence, except of

(σ_{i}, σ_{i + k + 1})

and

σ_{i + 1}

and

σ_{i + k}

⁠. We have

η (σ, L_{H}) = {\begin{matrix} η_{H} ((σ_{i}, σ_{i + k + 1}), σ_{i + 1}, \dots, σ_{i + k}) if 3 \leq k \leq 4 \\ η_{H} ((σ_{i}, σ_{i + k + 1}), σ_{i + 1}, σ_{i + k}, k) otherwise . \end{matrix}

(3)

An interior, bulge or helix loop,

L_{*}

⁠, can be represented as two intervals and two base pairs

L_{*} = {[i, r], [s, j], (i, j), (r, s)}

⁠. The energy of

L_{*}

is computed as

η (σ, L_{*}) = {\begin{array}{l} η_{*} ((σ_{i}, σ_{j}), (σ_{r}, σ_{s})) (helix) \\ η_{*} ((σ_{i}, σ_{j}), (σ_{r}, σ_{s}), σ_{i + 1}, σ_{r - 1}, \\ σ_{s + 1}, σ_{j - 1}), k_{1}) (bulge) \\ η_{*} ((σ_{i}, σ_{j}), (σ_{r}, σ_{s}), σ_{i + 1}, σ_{r - 1}, \\ σ_{s + 1}, σ_{j - 1}), k_{1}, k_{2}) (interior) \end{array}

(4)

where

k_{1} = \max {r - i - 1, j - s - 1}

and

k_{2} = \min {r - i - 1, j - s - 1}

⁠.

A multi-loop L_M contains p base pairs and p intervals, some of which being possibly empty, where

p \geq 3

⁠.

η_{M} (σ, L_{M})

is computed by

η_{M} (σ, L_{M}) = α + p \cdot β + u \cdot γ .

(5)

Here α is the constant multi-loop penalty, β and γ are constants and u is the number of all unpaired bases contained in the respective intervals.

2.2 The partition function

Definition 1 Let S be a secondary structure over n nucleotides. Then the partition function of S is given by

Q (S) = \sum_{σ \in Q_{4}^{n}} e^{- \frac{η (σ, S)}{R T}},

(6)

where

η (σ, S)

is the energy of S on σ, R is the universal gas constant and T is the temperature.

In analogy to the partition function of a fixed sequence

Q (σ)

(McCaskill, 1990), Q(S) can be computed recursively. Given the structure S, we consider an arbitrary arc (i, j), where i < j. Let

S_{i, j}

denote the substructure of S over the interval

[i, j]

⁠. Since S contains no crossing arcs all arcs of

S_{i, j}

are contained in

[i, j]

⁠, whence

S_{i, j}

is well defined. Let

Q (σ_{i}, σ_{j}) = \sum_{\begin{matrix} σ \in Q_{4}^{j - i + 1} \\ σ |_{i} = σ_{i}, σ |_{j} = σ_{j} \end{matrix}} e^{- \frac{η (σ, S_{i, j})}{R T}} .

Since S has no crossing arcs, the interval

[i, j]

is covered by the arc (i, j), i.e. (i, j) induces a loop L for which (i, j) is maximal. Suppose L consist of intervals

[i, p_{1}], [q_{1}, p_{2}], \dots, [q_{k}, j]

⁠, where

(p_{1}, q_{1}) \dots, (p_{k}, q_{k})

are L-arcs different from (i, j). Removal of L renders substructures covered by

(p_{1}, q_{1}) \dots, (p_{k}, q_{k})

⁠. Considering all combinations of the nucleotides in position p_i and q_i,

1 \leq i \leq k

⁠, we derive the following recursion, see Figure 4:

Q (σ_{i}, σ_{j}) = \sum_{σ_{p_{t}}, σ_{q_{t}} \in Q_{4}} e^{- \frac{η (σ, L)}{R T}} \prod_{t}^{k} Q (σ_{p_{t}}, σ_{q_{t}}) .

(7)

Fig. 4

The recursion for computing the partition function $Q_{T} (σ_{i}, σ_{j})$

Open in new tab Download slide

The partition function Q(S) is then obtained as the weighted sum of the terms

Q (σ_{a_{t}}, σ_{b_{t}})

⁠, where (a_t, b_t),

\forall 1 \leq t \leq k

are base pairs in the exterior loop L_ex:

Q (S) = \sum_{σ_{a_{t}}, σ_{b_{t}} \in Q_{4}} e^{- \frac{η (σ, L_{e x})}{R T}} \prod_{t}^{k} Q (σ_{a_{t}}, σ_{b_{t}}) .

(8)

Remark. The routine of computing Q(S) is similar to the one for finding an optimal sequence for a given structure inBusch and Backofen (2006), Levin, A., et al. (2012) and Reinharz,V., et al. (2013). Passing to a topological model for RNA structures (Bon et al., 2008; Orland and Zee, 2002; Penner, 2004; Reidys et al., 2011), the above recursions can be extended to pseudoknotted RNA structures, i.e. RNA structures containing crossing arcs. The key here is a general bijection between maximal arcs and topological boundary components (loops).

2.3 Boltzmann sampling and patterns

Having computed the partition function Q(S) as well as the

Q (σ_{i}, σ_{j})

terms, puts us in position to Boltzmann sample sequences for fixed secondary structure S. Here the probability of a sequence σ to be sampled is given by

P (σ | S) = \frac{e^{- \frac{η (σ, S)}{R T}}}{Q (S)} .

We build σ recursively from top to bottom, starting with the exterior loop, L_ex. Suppose (p_t, q_t) are base pairs contained in L_ex and let u denote the number of unpaired bases in L_ex. Since

η (σ, L_{e x}) = 0

⁠, the unpaired nucleotides in L_ex are sampled uniformly, i.e. with probability 1/4. Then the probability of the event σ_r being the nucleotide in position

r \in L_{e x}

⁠, is given by

P (σ_{r} | S) = \frac{e^{- \frac{η (σ, L_{e x})}{R T}} \prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t}})}{Q (S)} = \frac{{(\frac{1}{4})}^{u} \prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t}})}{Q (S)},

where the dependence on σ_r of the RHS stems from

σ |_{r} = σ_{r}

or potentially p_t =r or q_t= r. We continue the process inductively from top to bottom. Suppose we are given a loop L with the maximal base pair (i, j). Since any two arcs in S are not crossing, any arc (i, j) is contained in exactly two loops (except for the exterior loop) where (i, j) is the maximal arc for one and not for the other. As a result, the nucleotides

σ_{i}, σ_{j}

associated with (i, j) are sampled as part of the preceding loop (in which (i, j) is not maximal). It remains to sample the nucleotides other than σ_i and σ_j in L. Let σ_r be the nucleotides in L and

r \neq i, j

⁠. The probability of the event σ_r being the nucleotide in position r,

r \neq i, j

is given by

P (σ_{r} | S) = \frac{e^{- \frac{η (σ, L)}{R T}} \prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t})}}{Q (σ_{i}, σ_{j})} .

Here (p_t, q_t), for

1 \leq t \leq k, k \geq 0

are base pairs contained in L, that are different from (i, j). In particular, L is a hairpin loop in case of k = 0, an interior-, bulge- or a helix-loop in case of k = 1, and a multi-loop for

k \geq 2

⁠.

By construction, for any arc there is a unique loop for which the arc is maximal and a unique loop where the arc is not. As a result, the probability of a sequence σ to be sampled is given by

P (σ | S) = \prod_{(i, j) \in S} \frac{e^{- \frac{η (σ, L (i, j))}{R T}} \prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t}})}{Q (σ_{i}, σ_{j})} \cdot \frac{{(\frac{1}{4})}^{u} \prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t}})}{Q (S)}

In view of Eq. (2) and the fact that the term

Q (σ_{p_{t}, q_{t}})

appears exactly once for each arc (p_t, q_t), we arrive at

\prod_{(i, j) \in S} (\prod_{t = 1}^{k} Q (σ_{p_{t}}, σ_{q_{t}})) = \prod_{(i, j) \in S} Q (σ_{i}, σ_{j}) .

This in turn implies

P (σ | S) = \frac{(\prod_{(i, j) \in S} e^{- \frac{η (σ, L (i, j))}{R T}}) (\prod_{(i, j) \in S} Q (σ_{i}, σ_{j}))}{Q (S) \prod_{(i, j) \in S} Q (σ_{i}, σ_{j})} = \frac{e^{- \frac{η (σ, S)}{R T}}}{Q (S)} .

The time complexity for computing the partition function of a structure and Boltzmann sampling depends solely on the complexity of the energy function,

η (σ, L)

⁠. Clearly, there are O(n) loops in the structure and reviewing Eqs. (3), (4) and (5), at most eight nucleotides are taken into account. From this we can conclude that the time complexity is O(n), as claimed.

Next, we compute the probability of a given sequence pattern, i.e. the subsequence of σ over $[i, j]$ being $p_{i, j}$ ⁠. We shall refer to a sequence containing $p_{i, j}$ by $σ |_{p_{i, j}}$ ⁠.

The partition function of all sequences σ containing

p_{i, j}

is given by

Q (S | p_{i, j}) = \sum_{σ |_{p_{i, j}} \in Q_{4}^{n}} e^{- \frac{η (σ, S)}{R T}}

(9)

and the probability of

p_{i, j}

P (p_{i, j} | S) = \frac{Q (S | p_{i, j})}{Q (S)}

⁠.

We have shown how to compute Q(S) recursively in Section 2.3. It remains to show how to compute $Q (S | p_{i, j})$ ⁠. To do this we use the same routine as for computing Q(S), but eliminating any subsequences that are not compatible with $p_{i, j}$ ⁠. By construction, for any pattern, this process has the same time complexity as computing Q(S).

3 Discussion

Let us begin by discussing the mutual information of sequence–structure pairs. Then we ask to what extent does a structure determine particular sequence patterns and finally derive a criterion that differentiates native from random structures.

The mutual information of a sequence–structure pair can be computed by normalizing ϵ

P (σ, S) = \frac{e^{- \frac{η (σ, S)}{R T}}}{\sum_{σ \in Q_{4}^{n}} \sum_{S \in S_{n}} e^{- \frac{η (σ, S)}{R T}}},

where

U = \sum_{σ \in Q_{4}^{n}} \sum_{S \in S_{n}} e^{- \frac{η (σ, S)}{R T}}

is a constant. Then we have

I (σ, S) = (e^{- \frac{η (σ, S)}{R T}} log \frac{(e^{- \frac{η (σ, S)}{R T}})}{Q (σ) Q (S)}) / U + (e^{- \frac{η (σ, S)}{R T}} log U) / U .

Since U is a large constant, we observe that one term of

P (σ, S)

⁠, namely

e^{- \frac{η (σ, S)}{R T}} log \frac{e^{- \frac{η (σ, S)}{R T}}}{Q (S) Q (σ)}

contributes the most. Accordingly, Q(S) and

Q (σ)

allow us to quantify how a probability space of structures determines a probability space of sequences.

In Figure 5 we display three sequences sampled from Q(S) where S is the PDB-structure 2N3R (Bonneau et al., 2015), see Figure 9. All three sequences have similar mutual information and more than 50% of the nucleotides in the sequences are pairwise different.

Fig. 5

Three sequences, having nearly the same mutual information with respect to the PDB structure 2N3R. The sequences differ pairwise by more than 50% of their nucleotides which indicates that there is information that cannot be captured by conventional sequence alignment. Accordingly BLAST outputs no significant homology between the sequences

Open in new tab Download slide

We point out that replacing a G- C base pair in a helix by a C- G base pair does change the energy, see Figure 6. This due to the fact that the loops are traversed in a specific orientation. The isolated replacement of G-C by C-G changes this sequence and hence the energy.

Fig. 6

Isolated replacement of G- C by C- G: (A) $L_{1} = (U, G, C, A)$ and $L_{2} = (G, U, G, C)$ ⁠, (B) replacement induces the new loops: $L_{1}^{'} = (U, C, G, A)$ and $L_{2}^{'} = (C, U, G, G)$ ⁠, which changes the free energy

Open in new tab Download slide

Since the energy model underlying the current analysis does not take non-canonical base pairs into account, we defer a detailed analysis of the mutual information to a later study where we use the MC-model (Parisien and Major, 2008).

Let

p_{i, j}

be a subsequence on the interval

[i, j]

with concrete nucleotides, having probability

P (p_{i, j})

⁠. Its Shannon entropy

E_{i, j}

is given by

E_{i, j} = - \sum_{\forall p_{i, j}} P (p_{i, j}) {log}_{4} P (p_{i, j}) .

By construction, $0 \leq E_{i, j} \leq (j - i + 1)$ ⁠, where $E_{i, j} = (j - i + 1)$ when all $p_{i, j}$ have the same probability, i.e. uniformly distributed, and $E_{i, j} = 0$ when $p_{i, j}$ is completely determined, i.e. $P (p_{i, j}) = 1$ ⁠. Let $R_{i, j} = 1 - (E_{i, j} / (j - i + 1))$ be the heat of $[i, j]$ ⁠, i.e. $R_{i, j} = 0$ for random sequences and $R_{i, j} = 1$ if there exists only one pattern $p_{i, j}$ ⁠. We display the collection of $R_{i, j}$ as a matrix (heat-map), in which we display $R_{i, j} = 0.59$ as black and $R_{i, j} = 0$ as white. For a proof of concept, we restrict ourselves to $R_{i, j}$ for $j - i + 1 \leq 8$ ⁠.

The heat-maps presented here are obtained by Boltzmann sampling an ensemble of 10⁴ sequences from Q(S). We present the energy distribution of this ensemble in Figures 8, 10(A) and 12(A) and in addition the energy spectrum of those sequences that actually fold into S via the classic folding algorithm using the same energy functions here. The Inverse folding rate (IFR),

IFR = \frac{# of sequences folding into S}{# of sampled sequences}

measures the rate of successful re-folding from that ensemble.

Let σ be a sequence from a Boltzmann sample w.r.t. the structure S. Let

\bar{S}

denote the structure that σ folds to. We consider

Δ η (σ) = | η (σ, S) - η (σ, \bar{S}) |

and compare the

Δ G (σ, S)

of several native structures contained in PDB with those of a several random structures (obtained by uniformly sampling RNA secondary structures).

The PDB structure 2JXV (Cevec et al., 2008) represents a segment of an mRNA, having length 33. The structure exhibits a tetra-loop, an interior loop and two stacks of length 8 and 5, respectively, see Figure 7. We Boltzmann sample 10⁴ sequences for this structure observing an AU ratio of 18.18%, while CG ratio is 81.82%. The IFR reads 95.16%, i.e. almost all sampled sequences refold into 2JXV. The heat-map of 2JXV is given in Figure 7 and we list the most frequent 10 patterns of the largest interval having $R_{i, j} > 0.52$ in Supplementary Table S1 together with their a priori pattern probabilities. We observe that the tetra-loop determines specific patterns. This finding is not entirely straightforward as the hairpin-loops are the last to be encountered when Boltzmann sampling. I.e. they are the most correlated loop-types in the sense that structural context influences them the most.

Fig. 7

The secondary structure of 2JXV and its heat-map. We display the most frequent sampled patterns for the largest interval having $R_{i, j} > 0.52$ ⁠. The sample size is 10⁴

Open in new tab Download slide

The energy distribution of the Boltzmann sample is presented in Figure 8 and we observe that the inverse folding solution is not simply the one that minimizes the free energy w.r.t. 2JXV with the best energy. $Δ η (σ)$ -data are not displayed here in view of the high IFR.

Fig. 8

The energy distribution of the Boltzmann sample for 2JXV. We display the frequency of sequences having a particular energy (right bars) and the frequency of sequences that fold into 2JXV (left bars)

Open in new tab Download slide

The PDB structure 2N3R (Bonneau et al., 2015) consist of 61 nucleotides and has a 3-branch multi-loop, two tetra-loops, interior loops and helixes, see Figure 9. The ratios of AU and CG pairs are 19.67% and 80.33%, respectively, again in a Boltzmann sample of 10⁴ sequences. The IFR is at 0.69 quite high, despite the fact that 2N3R is much more complex than 2JXV. We illustrate the heat-map of 2N3R in Figure 9 and list the most frequent 10 patterns in the largest interval having $R_{i, j} > 0.52$ in the Supplementary Material, Supplementary Table S2 together with their a priori pattern probabilities computed by eq. (9)

The secondary structure of 2N3R and its heat-map. We show the most frequent patterns for the largest interval having Ri,j>0.52. The sample size is 104

Fig. 9

The secondary structure of 2N3R and its heat-map. We show the most frequent patterns for the largest interval having $R_{i, j} > 0.52$ ⁠. The sample size is 10⁴

Open in new tab Download slide

Comparing the sequence segments $[17, 24]$ and $[37, 44]$ ⁠, both of which being tetra-loops with additional two nucleotides. The $R_{i, j}$ values of these segments are similar, approximately 0.59, however, their most frequently sampled patterns appear at different rates. For $[17, 24]$ this pattern is CGGAAGGC and it occurs with a Boltzmann sampled frequency of 1.69% and pattern probability 1.44%, while for $[37, 44]$ it is CGUGAGGG with sampled frequency 3.27% and pattern probability 3.24%. This makes the point that pattern frequency distributions are strongly correlated with structural context.

The energy distribution of the Boltzmann sample is given in Figure 10(A) and we display the $Δ η (σ)$ -data in Figure 10(B) where we contrast the data with $Δ η (σ)$ -values obtained from Boltzmann sampling 10⁴ sequences of 5 random structures of the same length. We observe that the $Δ η (σ)$ -values for 2N3R are distinctively lower than those for random structures.

Fig. 10

(A) The energy distribution of Boltzmann sampled sequences. The frequency of sequences having a particular energy level (right bars), the frequency of sequences folding into 2N3R (left bars). (B) $Δ η (σ)$ -data of 2N3R versus $Δ η (σ)$ -data of five random structures

Open in new tab Download slide

The PDB structure 1EHZ (Shi and Moore, 2000) is a tRNA over 76 nucleotides exhibiting a 4-branch multi-loop. We display the heat-map of 1EHZ in Figure 11 The IFR is $1.3 \times 10^{- 3}$ w.r.t. our Boltzmann sample of size 10⁴ and we display the energy distribution of the sampled sequences in Figure 12(A). Interestingly we still find many inverse fold solutions by just Boltzmann sampling Q(S) and these sequences are not concentrated at low free energy values.

Fig. 11

The secondary structure of 1EHZ and its heat-map. We display the most frequent sampled pattern for the largest interval having $R_{i, j} > 0.52$ ⁠. The sample size is 10⁴

Open in new tab Download slide

Fig. 12

(A) The energy distribution of the Boltzmann sampled sequences. The frequency of sequences having a particular energy level (right bars), the frequency of sequences folding into 1EHZ (left bars). (B) $Δ η (σ)$ -data of 1EHZ versus $Δ η (σ)$ -data of five random structures

Open in new tab Download slide

In Figure 12(B) we display the $Δ η (σ)$ -data and contrast them with those obtained by the Boltzmann samples of five random structures. We observe a significant difference between the $Δ η (σ)$ -distribution of the 1EHZ sample and those of the random structures.

The three above examples indicate that sequence–structure correlations can be used to locate regions where specific embedded patterns arise. Furthermore we observe that studying Q(S) has direct implications for inverse folding. This is in agreement with the findings in Busch and Backofen (2006), Levin, A., et al. (2012) and Reinharz,V., et al. (2013), but leads to deriving alternative, unbiased starting sequences for inverse folding. Although at present we can only estimate the mutual information, we can conclude that there are sequences that cannot be aligned but obtain almost identical mutual information.

We observe that biological relevant sequences exhibit a $Δ η (σ)$ -signature distinctive different from that of random structures. Therefore, the $Δ η (σ)$ -signature is capable of distinguishing biological relevant structures from random structures. In Miklós et al. (2005), the expected free energy and variance of the Boltzmann ensemble of a given sequence has been employed in order to distinguish biologically functional RNA sequences from random sequences. This result is in terms of the pairing $ε : Q_{4}^{n} \times S_{n} \to R^{+}$ ⁠, dual (the flip side of the coin, so to speak) to our approach. Our $Δ η (σ)$ -signature characterize the naturality of a fixed structure and Miklós et al. (2005) the naturality of a fixed sequence. Accordingly, Q(S) augments the analysis of $Q (σ)$ in a natural way, capturing the correlation between RNA sequences and structures.

As a result, sequences carry embedded patterns that cannot be understood by considering the sequence of nucleotides. At this point we have no concept of what these patterns are and provided in Section 2.4 a rather conventional notion of ‘embedded pattern’. However, even when considering specific nucleotide patterns in hairpin loops, we observe significant context dependence on the structure. Other loops affect the energy of the hairpin loop and thus determine this particular subsequence. We observe that the embedded patterns can, for certain structures, be quite restricted, possibly elaborate and are not entirely obvious. In any case, the analysis cannot be reduced to conventional sequence alignment. The heat-maps introduced here identify the regions for which only a few select patterns appear and computed the a priori probabilities of their occurrence.

This type of analysis will be carried out for the far more advanced MC-model (Parisien and Major, 2008), incorporating non-canonical base pairs, SHAPE-directed model for long RNAs (Deigan et al., 2009; Hajdin et al., 2013). This will in particular enable us to have a closer look at the hairpins of the tRNA structure. In addition we believe that this line of work may enable us to arrive at non-heuristic inverse foldings.

Folding of RNA secondary structures including pseudoknots is studied in Rivas and Eddy (1999) by extending the dynamic programming paradigm introducing substructures with a gap. The framework generates a particular, somewhat subtle class of pseudoknot structures, discussed in detail in Rivas and Eddy (2000). A specific, multiple context-free grammar (MCFG) for pseudoknotted structures is designed (Rivas and Eddy, 1999), employing a vector of nonterminal symbols referencing a substructure with a gap.

Our results facilitate the Boltzmann sample RNA sequences for pseudoknotted structures. Let $S_{i, j; r, s}$ denote a substructure with a gap where (i, j), (r, s) are base pairs and $Q (σ_{i}, σ_{j}; σ_{r}, σ_{s})$ denote the partition function of $S_{i, j; r, s}$ ⁠, then one can compute $Q (σ_{i}, σ_{j})$ following the MCFG given by Rivas and Eddy (1999).

A different approach was presented in Penner and Waterman (1993) and Penner (2004), where topological RNA structures have been introduced (Penner and Waterman, 1993; Penner, 2004). In difference to Rivas and Eddy (1999), which was driven by the dynamic programming paradigm, topological structures stem from the intuitive idea to just ‘draw’ their arcs on a more complex topological surface in order to resolve crossings. Random matrix theory (von Neumann and Goldstine, 1947) facilitates the classification and expansion of pseudoknotted structures in terms of topological genus (Bon et al., 2008; Orland and Zee, 2002) and in Reidys et al. (2011) a polynomial time, loop-based folding algorithm of topological RNA structures was given. The results in this paper are for representation purposes formulated in terms of loops. However they were originally developed in the topological framework, in which loops become topological boundary components. This means that we can extend our framework to pseudoknot structures. The key then is of course to be able to recursively compute the novel partition function, i.e. an unambiguous grammar. Recent results (Huang and Reidys, 2016) associate a topological RNA structure with a certain, arc-labeled secondary structure, called λ-structure. The resulting disentanglement gives rise to a context free grammar for RNA pseudoknot structures (Huang and Reidys, 2016). (More precisely, a λ-structures corresponds one-to-one to a pseudoknotted structure together with some additional information, i.e. a specific permutation of its backbone.) We illustrate this correspondence in Figure 13. This finding facilitates to extend all our results to pseudoknotted structures and offers insight in patterns and inverse folding of more general RNA structure classes as well as RNA-RNA interaction complexes.

Fig. 13

Disentanglement: by means of permuting the backbone of a pseudoknotted structure one resolves all crossings

Open in new tab Download slide

As mentioned above, the present analysis is just a first step and discusses embedded patterns in the sense of subsequent nucleotides. However our framework can deal with any embedded pattern. We think a deeper, conceptual analysis has to be undertaken aiming at identifying how a collection of structures provides sequence semantics. Quite possibly this can be done in the context of formal languages. We speculate that advancing this may lead to a novel class of embedded pattern recognition algorithms beyond sequence alignment.

Acknowledgments

Special thanks to Michael Waterman and Peter Stadler for their input on this manuscript. We gratefully acknowledge the help of Kevin Shinpaugh and the computational support team at VBI, Rebecca Wattam, Henning Mortveit, Madhav Marathe and Reza Rezazadegan for discussions. Many thanks for the constructive feedback of the anonymous reviewers and their suggestions.

Conflict of Interest: none declared.

References

Bon

et al. (

2008

)

Topological classification of RNA structures

J. Mol. Biol

379

900

–

911

Month:	Total Views:
November 2016	49
December 2016	17
February 2017	46
March 2017	36
April 2017	10
May 2017	32
June 2017	16
July 2017	7
August 2017	6
September 2017	13
October 2017	11
November 2017	16
December 2017	4
January 2018	10
February 2018	33
March 2018	13
April 2018	28
May 2018	18
June 2018	31
July 2018	20
August 2018	27
September 2018	20
October 2018	12
November 2018	29
December 2018	12
January 2019	14
February 2019	9
March 2019	22
April 2019	30
May 2019	30
June 2019	8
July 2019	14
August 2019	7
September 2019	19
October 2019	22
November 2019	15
December 2019	17
January 2020	20
February 2020	23
March 2020	21

Article Contents

Sequence–structure relations of biopolymers

Abstract

1 Introduction

2 Method

2.1 Secondary structures and loop decomposition

2.2 The partition function

2.3 Boltzmann sampling and patterns

3 Discussion

Acknowledgments

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only