Abstract

Motivation: Most prokaryotic genomes are circular with a single chromosome (called circular genomes), which consist of bacteria and archaea. Orthologous genes (abbreviated as orthologs) are genes directly evolved from an ancestor gene, and can be traced through different species in evolution. Shared orthologs between bacterial genomes have been used to measure their genome evolution. Here, organization of circular genomes is analyzed via distributions of shared orthologs between genomes. However, these distributions are often asymmetric and bimodal; to date, there is no joint distribution to model such data. This motivated us to develop a family of bivariate distributions with generalized von Mises marginals (BGVM) and its statistical inference.

Results: A new measure based on circular grade correlation and the fraction of shared orthologs is proposed for association between circular genomes, and a visualization tool developed to depict genome structure similarity. The proposed procedures are applied to eight pairs of prokaryotes separated from domain down to species, and 13 mycoplasma bacteria that are mammalian pathogens belonging to the same genus. We close with remarks on further applications to many features of genomic organization, e.g. shared transcription factor binding sites, between any pair of circular genomes. Thus, the proposed procedures may be applied to identifying conserved chromosome backbones, among others, for genome construction in synthetic biology.

Availability: All codes of the BGVM procedures and 1000+ prokaryotic genomes are available at http://www.stat.sinica.edu.tw/∼gshieh/bgvm.htm.

Contact:  [email protected]

Supplementary information:  Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Most of the prokaryotic genomes (1158 out of 1194, NCBI, August 2010) are circular with a single chromosome (called circular genomes henceforth). Orthologous genes (abbreviated as orthologs) are genes directly evolved from an ancestor gene (Tatusov et al., 1997), and can be traced through different species in evolution. The fraction of shared orthologs between two circular genomes was found to be better conserved than the order of genes (Huynen and Bork, 1998), in which the fraction of shared orthologs between genomes was employed to measure genome evolution of nine prokaryotes. Here, our emphasis is on the structure of circular genomes, which, for example, plays an important role in synthetic biology. A review paper in synthetic genomics (Carrera et al., 2009) indicates that genome organization may influence gene expression, which is vital for organisms. Further, predicting or modeling the rules of genome organization via comparative genomics may provide valuable information for genome construction.

We reason that genome structure can be studied via distributions of shared orthologs between genomes, e.g. the most or least favored region in which shared orthologs between each pair of bacterial genomes are located. While the marginal distributions of shared orthologs are often found to be asymmetric and bimodal, to date there is no joint distribution with closed-formed marginals to model such data. This motivated us to develop a family of joint distribution and its related statistical inferences.

Recent studies show that gene order is extensively conserved between closely related species, but rapidly become less conserved among more distantly related species. This trend is likely to be universal in prokaryotes (Tamames, 2001; Wolf et al., 2001). However, the fraction of shared orthologs between two circular genomes is more conserved than the order of genes (see Fig. 6 of Huynen and Bork, 1998). In addition to the fraction of shared orthologs between bacterial genomes, we further incorporate the distributions of shared orthologs of paired circular genomes to infer similarity of their genome organization. By converting the locations of shared orthologs in any paired circular genomes into angles, these pairs of angles can be viewed as bivariate circular vectors.

Most of the marginal distributions of shared orthologs in circular genomes are asymmetric and/or multimodal, which can be modeled by the generalized von Mises distribution (GVM) (Maksimov, 1967; Yfantis and Borgman, 1982). Therefore, we propose a family of bivariate circular distributions with each marginal assuming a GVM distribution, and call this family the bivariate generalized von-Mises (BGVM). The inference procedures, estimation of the parameters in BGVM and testing for independence of structures of paired circular genomes, are developed. A novel correlation measure, which is based on the fraction of shared orthologs and a circular grade correlation derived here, is introduced to measure organization similarity between circular genomes. Furthermore, this similarity is visually depicted by the rose diagrams of shared orthologs.

The marginal distributions of BGVM are quite flexible since GVM can model either symmetric or asymmetric, unimodal or multimodal circular data, depending on the values of its four parameters. The probability density function (pdf) of GVM is in closed form and thus is convenient for inferences; see Yfantis and Borgman (1982) for details.

In the following, we briefly review literature on modeling bivariate circular data, e.g. shared orthologs of paired prokaryotic genomes. Thompson (1975) mentioned the wrapped bivariate normal distribution. Mardia (1975) proposed a generalized von Mises–Fisher model for paired angles (θ1, θ2). Wehrly and Johnson (1980) generated families of bivariate circular distributions with jpdf:
(1)
where 0 ≤ θ1, θ2 < 2π, g, f1 and f2 are densities on the circle, and F1 and F2 are the distribution functions of f1 and f2, respectively. The pdf in Equation (1) has f1 and f2 as its specified marginals. Shieh and Johnson (2005) studied a family of bivariate distributions, belonging to these families in Equation (1), with von Mises marginals and its inferences. Rivest (1988) investigated inferences for a bivariate generalization of the Fisher–von Mises distribution. Unfortunately, the marginals of these distributions are symmetric and unimodal, hence they cannot be employed to model distributions of shared orthologs in paired circular genomes. Bivariate circular distributions with asymmetric and/or multimodal but closed-formed marginals have not been well explored, thus we study them here.

2 METHODS

Here, we investigate inference procedures under BGVM which has the probability density function:
(2)
where 0 ≤ θ1, θ2 < 2 π, κ12 ≥ 0, 0 ≤ μ12 < 2π and I0(·) denotes the modified Bessel function of the first kind and order zero. These GVM marginals assume the forms
where Cj−1 = 2π {I0j)I0j) + 2 ∑n=1Inj)I2nj) · cos[2nj − νj)]}. These μj and νjj and λj) are location (dispersion) parameters, and we denote this GVM distribution as GVM (μj, νj, κj, λj) henceforth. The joint density BGVM involves the cumulative distributions Fj0) = ∫0θ0fj(θ)dθ, but with todays computing power this is not a serious drawback. We note that this class of bivariate circular distributions can include other marginals, e.g. t-distribution (Shimizu and Iida, 2002), von Mises distribution (Shieh and Johnson, 2005) and Jones and Pewsey's distribution (Jones and Pewsey, 2005).

2.1 Roles of some parameters in the BGVM model

The parameters κ and λ in a GVM (μ, ν, κ, λ) marginal control not only the concentration at θ = μ and θ = ν but also the graphic shape of the density. If κ = 0, the GVM is antipodally symmetric and has two modes, depending on λ, at θ = ν and θ = ν ± π. If λ = 0, the GVM reduces to the von Mises distribution, which is symmetric and unimodal, with mean direction μ and concentration κ; the larger κ is the more concentrated when the distribution is at μ. If μ12 = 0 , the BGVM is a unimodal distribution, and as κ12 increases the association between Θ1 and Θ2 increases, which indicates that the association between the given pair of circular genomes increases. From the copula representation in Section 2 of Shieh et al. (2006), we have that Φ1 = 2π F11) given θ2 is VM (μ12 + 2πF22), κ12). When μ12 = 0, Φ1 centers on 2πF22), and the dependence of Θ1 on Θ2 is controlled by the magnitude of κ12.

A BGVM generator written in R is available at http//:www.stat.sinica.edu.tw/∼gshieh/bgvm.htm.

2.2 Estimation and testing hypothesis under a BGVM distribution

2.2.1 MLEs of the parameters

Assuming that shared orthologs between a pair of circular genomes follow a BGVM distribution, given their converted angles we can immediately obtain the fitted BGVM distribution by applying the MLE algorithm in MLE_LR-test.R of the Supplementary Material. The MLE algorithm was modified from the Newton–Raphson method (Tanner, 1996), to prevent the estimates for the model in Equation (2) being trapped in local maxima, which was caused by the sinusoidal functions in the joint density. Next, we derive the limiting distribution of MLEs for the parameters in BGVM. The likelihood ratio test for independence of the organization between genomes is addressed at the end of this section.

To see how the MLE algorithm performs, we show its estimation results using data generated from BGVM with the parameters (μ1, ν1, κ1, λ1, μ2, formula, formula and the sample size 100. We first estimated parameters of the marginals and obtained formula and formula. Next, we used these estimates as initial values and applied the MLE algorithm to estimate parameters of the joint density. The estimates obtained are formula, formula = (1.33, 1.65, 0.92, 0.95, 1.62, 1.55, 1.02, 0.90, 0.92, 2.32), which are quite close to the true values.

After the MLEs been computed, estimates of their variances and covariances can be obtained from derivatives of the likelihood or by numerical integration of the squares and products of the partial derivatives given in Section 3 of Shieh et al. (2006). Let η = (μ1, ν1, κ1, λ1, μ2, ν2, κ2, λ2, μ12, κ12)T be the vector of 10 unknown parameters. The parameter space is then
for i = 1, 2. Let ln(η) denote the log-likelihood function, and Un(η) and −In(η) the first and the second derivatives of ln(η), respectively.

When the vector of parameters η belongs to the interior of the parameter set, the regularity conditions (see Section 3 of Shieh et al., 2006 for details) hold, and the consistency of the MLEs follows, which implies that the vector of MLEs multiplied by formula converges to their true parameters. Asymptotic multivariate normality of the MLEs then follows from Lemma 1 and Theorem 2 of Self and Liang (1987), which indicates that this centered vector converges to a multivariate normal distribution as stated below.

Theorem 1.
 
Let η belong to the interior of the parameter set. The vector of MLEs formula has multivariate normal limiting distribution, and
where I is the Fisher information matrix with entries
The 10 partial derivatives are straightforward and are omitted.
Remark
 

If some of the parameters are known, then asymptotic normality holds for the reduced set with the corresponding entries in the information matrix.

Numerical integration can be used to obtain each of the entries Iik's. However, this could be difficult and it is better to use the estimated, or empirical, information with terms
where each partial derivative is evaluated at the MLEs. Different asymptotics apply if we include κ12 = 0 in the parameter space. If κi and λi,i = 1, 2 are still bounded away from 0, then according to the results in Self and Liang (1987), Theorem 2 and Case 2 of page 606, formula has limiting distribution Z1I[Z1 > 0 ] , where Z1 has a normal distribution with variance determined from I−1. The cases where κ1 = 0, λ1 = 0, κ2 = 0 and/or λ2 = 0 are treated similarly.

2.2.2 Testing for independence of the organization between circular genomes

To test the hypothesis that the organization of paired prokaryotic genomes are independent or not, e.g. whether the structure organization of Thermotoga and that of Sulfolobus are independent, we consider likelihood ratio tests (LR tests), which include two cases. In both cases, under the BGVM joint distribution this test is equivalent to testing
The likelihood ratio statistic is −2lnλn, where
Δ0 = {μ1, μ2, ν1, ν2, κ1, κ2, λ1, λ2}, and Δ1 = {μ1, ν1, μ2, ν2, μ12, κ1, κ2, λ1, λ2, κ12}.

There are really two cases, and the second case is the most complicated.

Case 1. When all parameters are unknown and none of κ1, λ1, κ2, λ2 are zero. The limiting distribution of −2ln λn is
where χr2 denotes the chi-square distribution with r degrees of freedom, by Case 5 of Self and Liang (1987).
Case 2. When all parameters are unknown and only one among κ1, λ1, κ2 and λ2 is zero. In practice, this case occurs when MLEs of the four parameters are close to zero. The limiting distribution of −2ln λn is
where formula, Iij's are the (i, j) entries of the information matrix I(η) and c2 = c3 = 1/4 − c1. Furthermore,
f(x, y, α) = [(x2 + y2)1/2x sinα]/cosα and Φ(·) is the standard normal cumulative distribution function. For details, see Case 8 of Self and Liang (1987).

Codes in R to obtain MLEs and the LR test are available at http://www.stat.sinica.edu.tw/∼gshieh/bgvm.htm.

2.3 A novel circular correlation measure rpg

Huynen and Bork (1998) used the fraction of shared orthologs between genomes to study genome evolution. The ratio of shared orthologs between a pair of genomes to the maximum possible number of shared orthologs, which is estimated by the size of the smaller genome in each pair of genomes, is a reasonable index for association of organization between circular genomes. This ratio (denoted by rp) can be written as
To characterize association between a pair of circular genomes more finely, the similarity of their distributions of shared orthologs should be taken into account. This association can be measured by a circular grade correlation (rg) derived in the next subsection. Weighting both rp and rg equally gives rise to a novel circular correlation measure
Note that there is a parameter, namely protein sequence similarity, involved when choosing one-to-one orthologs between genomes. The rule of thumb is to choose a protein sequence similarity value such that rp and rg are of the same scale, e.g. 0.7 was suggested for the threshold of protein sequence similarity in a pilot study, in which the BGVM procedures were applied to the nine prokaryotic genomes in Application 1.

2.3.1 A circular grade correlation measure rg

When the joint pdf of two r.v.'s on the real line, say X1 and X2, involves the marginal cumulative distributions Fi(Xi), i = 1, 2, the grade correlation (ρH) (Hoeffding, 1948) was shown to be suitable for correlation of X1 and X2. This correlation measure is defined as
Recall that (Θ1k, Θ2k), k = 1, …, n, denote the mapped angles of n pairs of shared orthologs between prokaryotic Genomes 1 and 2. When the joint pdf of these pairs of angles involves the marginal cumulative distributions Fii), i = 1, 2, a suitable correlation may be modified from the grade correlation ρH. Similar to previous works on deriving association measures for bivariate circular data (Shieh et al., 1994; Shieh and Johnson, 2005), a new circular grade correlation coefficient for bivariate circular data, that allows for different choices of the origins in each genome, is
where Fii) is uniformly distributed on [0, 1], for i = 1, 2, and mod2 + δ) denotes Θ2 + δ (mod 2π). Via some computation in Appendix1.pdf of the Supplementary Material, we obtained that the circular grade correlation
The MLE for the circular grade correlation ρg assumes the following form.
where A = { 2π · i/n : i = 0, 1,…, n − 1}, formula is the MLE of the F2 marginal distribution and formula and formula are MLEs for κ12 and μ12.

2.3.2 The large sample (limiting) distribution of rg

For each pair of genomes, the number of shared orthologs can range from a few hundred to more than two thousand. With such large sample sizes, the limiting distribution of the circular grade correlation estimate rg in general follows a normal distribution. To provide confidence intervals for ρg to judge whether the similarity on distributions of shared orthologs of paired circular genomes is different from zero (independence) or not, we derive the limiting distribution of the centered rg.

Because rg is the MLE of ρg, the large sample distribution of rg follows a normal distribution with its covariance matrix computed from angles of shared orthologs. That is,
where the covariance matrix D2 can be computed by our code,
formula and formula are derived in Appendix2.pdf of the Supplementary Material, which provides a proof for the convergence of the centered rg to a normal distribution.

The 1 − α confidence interval of ρg1, θ2) is [rgZα/2D, rg + Zα/2D], where 0 < α < 1 and Zα/2 is the critical value of the standard normal distribution N(0, 1) at the level of α/2.

2.3.3 A visualization tool: rose diagrams

To visually depict association between paired circular genomes, we applied rose diagrams (a kind of angular histogram; Fisher, 1993) to each pair of shared orthologs. The radius of each sector with a binwidth of 10 or other angles is proportional to the relative frequency of shared orthologs, whose corresponding angles belong to the sector, e.g. 60 ∼ 70 of the rose diagram of Sulfolobus in Figure 1, where the inner dotted (outer) circle denotes the frequency being 10 (20). In general, a small binwidth results in a bumpier rose diagram than that of a large one.

The rose diagrams of shared orthologs, identified by protein sequence similarity 0.7, of three pairs of prokaryotic genomes.
Fig. 1.

The rose diagrams of shared orthologs, identified by protein sequence similarity 0.7, of three pairs of prokaryotic genomes.

3 RESULTS

In this section, we apply the proposed procedures to shared orthologs between paired prokaryotic genomes to measure similarity of their genome organization. However, we note that our procedures can be applied to many features of genomic organization, e.g. shared TFBSs, shared repeated elements (Benson and Waterman, 1994) and shared non-coding genes (e.g. rRNA and small RNA), by inputting their corresponding angles instead of those of shared orthologs to our algorithm. For instance, both Escherichia coli and Bacillus subtilis have more verified TFBSs than other bacteria, and we can analyze their shared TFBSs using procedures developed here.

3.1 Data preprocessing

To let the shared orthologs between paired circular genomes reflect their similarity, we filtered out predicted horizontal transferred genes. The filtering was performed using data from the Horizontal Gene Transfer Database (Garcia-Vallve et al., 2003), which were accessed at http://genomes.urv.es/HGT-DB/. After the filtering, only species-specific genes were left in each circular genome, and we mapped their physical positions into angles ranged in [0, 2π). As an ortholog in each circular genome has a clustered orthologous group (COG) ID (Tatusov et al., 1997), the genes having the same COG ID in any paired circular genomes were identified as orthologs.

Paralogous genes (abbreviated as paralogs) are genes related by duplication within a genome. Due to gene duplication, there may be orthologous sets of paralogs within one genome, and several paralogs may share the same COG ID. That is, there are one-to-many, many-to-one and many-to-many corresponding shared orthologs besides the one-to-one corresponding ones. Using the following minimum angle distance, we can identify one-to-one shared orthologs. Let D(A, B) denote the distance between orthologs A and B, and D(A, B) = min{|angle of A − angle of B|, 2π − |angle of A − angle of B|}. For an orthologous set (multiple paralogs with the same COG ID), we use the shortest distance ortholog pairs (A, B), where Dmin(A, B) = mini,jD(Ai, Bj) and 1 ≤ im, 1 ≤ jn, provided that m and n are orthologs in genomes A and B corresponding to one COG ID.

Let θij and θkj for 1 ≤ jn, denote the corresponding angle of the first base of the shared ortholog j in genomes i and k, which were used for statistical inferences. Note that within any given pair of genomes, the locations of distinct ortholog pairs are assumed to be independent and identically distributed (i.i.d.) samples. Because marginals of shared ortholog genomes were often found to be asymmetric and bimodal, we fitted a BGVM model to each pair of shared orthologs for the prokaryotic genomes studied.

In the following, we apply the proposed BGVM procedures and rose diagrams to two sets of prokaryotic genomes, consisting of eight pairs of prokaryotic genomes separated from domain down to species, and 13 mycoplasma bacteria which are mammalian pathogens belonging to the same genus.

3.2 Application 1: eight pairs of prokaryotic genomes separated from domain to genus

In this application, we compare the structure organizations for each of the following pairs of bacterial genomes, which are (Thermotoga, Sulfolobus), (Clostridium, Thermotoga), (Lactobacillus, Clostridium), (Staphylococcus, Lactobacillus), (Oceanobacillus, Staphylococcus), (Bacillus, Oceanobacillus), (Bacillus anthracis, Bacillus subtitlis) and (Bacillus anthracis Ames, Bacillus anthracis Sterne). These pairs of genomes belong to different domains, the same domain, phylum class, order, family, genus and the same species, respectively. Among them, we first tested whether the genome structures of each pair are independent or not. After using the threshold of 0.7 for protein similarity to identify shared orthologs between each pair of genomes, we plotted their rose diagrams, which showed that several of the marginals are bimodal. Due to a space limit, only the rose diagrams (using a binwidth of 10) of the first three pairs are presented in Figure 1; see Supplementary Figure S1 in Application1.pdf for all rose diagrams. Some marginal distributions of these pairs of bacterial genomes were asymmetric, based on the result of the test for symmetry in Pewsey (2002). For instance, some marginals of (Thermotoga, Sulfolobus), (Clostridium, Thermotoga) and (Lactobacillus, Clostridium) are asymmetric at the 95% significance level; see Table 1 for details.

Table 1.

Test for symmetry (Tn) at 97.5% significance level

Bacteria nameT  n97.5% critical valueSymmetric
Thermotoga0.661.96Yes
Sulfolobus5.231.96No
Clostridium5.751.96No
Thermotoga0.631.96Yes
Lactobacillus|−0.93|1.96Yes
Clostridium2.961.96No
Bacteria nameT  n97.5% critical valueSymmetric
Thermotoga0.661.96Yes
Sulfolobus5.231.96No
Clostridium5.751.96No
Thermotoga0.631.96Yes
Lactobacillus|−0.93|1.96Yes
Clostridium2.961.96No
Table 1.

Test for symmetry (Tn) at 97.5% significance level

Bacteria nameT  n97.5% critical valueSymmetric
Thermotoga0.661.96Yes
Sulfolobus5.231.96No
Clostridium5.751.96No
Thermotoga0.631.96Yes
Lactobacillus|−0.93|1.96Yes
Clostridium2.961.96No
Bacteria nameT  n97.5% critical valueSymmetric
Thermotoga0.661.96Yes
Sulfolobus5.231.96No
Clostridium5.751.96No
Thermotoga0.631.96Yes
Lactobacillus|−0.93|1.96Yes
Clostridium2.961.96No

For asymmetric and bimodal marginals of shared orthologs, it was reasonable to fit GVM distributions. Goodness-of-fit was checked by QQ-plots, similar to those in Shieh et al. (2006), which indicated that the fits were adequate for the marginals. For the fitted GVMs and QQ-plots, please see the file at http://www.stat.sinica.edu.tw/∼gshieh/bgvm.htm. The asymmetry and bimodality of the marginals suggested fitting a BGVM to shared orthologs for each pair of prokaryotic genomes. The MLE estimates of parameters in BGVM of (Thermotoga, Sulfolobus), (Clostridium, Thermotoga) and (Lactobacillus, Clostridium) are given in Table 2.

Table 2.

The shared orthologs of three pairs of circular genomes, identified by protein sequence similarity 0.7, fitted by BGVM models

Genome (i, j)formulaformulaformulaformulaformulaformulaformulaformulaformulaformula
(6, 7)0.320.273.141.790.310.495.783.050.680.62
(7, 8)0.280.405.393.130.130.314.872.250.286.21
(8, 9)0.200.112.190.300.580.610.241.330.185.49
Genome (i, j)formulaformulaformulaformulaformulaformulaformulaformulaformulaformula
(6, 7)0.320.273.141.790.310.495.783.050.680.62
(7, 8)0.280.405.393.130.130.314.872.250.286.21
(8, 9)0.200.112.190.300.580.610.241.330.185.49

Due to space limit, we use genome numbers 1–9 to represent the following nine prokaryotic genomes, respectively: Bacillus anthracis Ames, Bacillus anthracis Sterne, Bacillus subtilis, Oceanobacillus, Staphylococcus, Lactobacillus, Clostridium, Thermotoga and Sulfolobus.

Table 2.

The shared orthologs of three pairs of circular genomes, identified by protein sequence similarity 0.7, fitted by BGVM models

Genome (i, j)formulaformulaformulaformulaformulaformulaformulaformulaformulaformula
(6, 7)0.320.273.141.790.310.495.783.050.680.62
(7, 8)0.280.405.393.130.130.314.872.250.286.21
(8, 9)0.200.112.190.300.580.610.241.330.185.49
Genome (i, j)formulaformulaformulaformulaformulaformulaformulaformulaformulaformula
(6, 7)0.320.273.141.790.310.495.783.050.680.62
(7, 8)0.280.405.393.130.130.314.872.250.286.21
(8, 9)0.200.112.190.300.580.610.241.330.185.49

Due to space limit, we use genome numbers 1–9 to represent the following nine prokaryotic genomes, respectively: Bacillus anthracis Ames, Bacillus anthracis Sterne, Bacillus subtilis, Oceanobacillus, Staphylococcus, Lactobacillus, Clostridium, Thermotoga and Sulfolobus.

Employing the LR-test based on the distributions of shared orthologs, we test whether genome organization of each pair of these prokaryotes are independent. (Not) rejecting this hypothesis indicates that their genome organizations are (not) correlated. For each pair of prokaryotic genomes, a LR-test of shared orthologs was computed, and the result showed that except (Thermotoga, Sulfolobus), the hypothesis was rejected for the other seven pairs at a 95% level using Case 2; see Application1.pdf of the Supplementary Material for details. This result indicates that for these seven pairs, their genome organizations are correlated based on the similarity of the distributions of their shared orthologs. This association can also be seen easily from their rose diagrams, in particular those of Bacillus anthracis Ames and Bacillus anthracis Sterne.

Next, we estimated the association of genome organizations of each of the eight pairs of prokaryotes. Note that the origins of most prokaryotic genomes are unknown (and having no rules to predict) or predicted, which demonstrates the importance of rpg's being invariant to the origins of both circular genomes. The values of rp, rg and rpg of these pairs (in Table 3) are incidentally consistent with their phylogeny relationships. The rose diagrams show that the genome structures of Bacillus anthracis Ames and Bacillus anthracis Sterne are almost identical, while those of Thermotoga and Sulfolobus are not at all alike.

Table 3.

The values of correlation measures rg, rp, rpg of nine prokaryotic genomes based on similarity of the distribution of shared orthologs

(Genome i, Genome j)r  gr  pr  pg
( 1, 2 )0.840.490.67
( 2, 3 )0.440.440.44
( 3, 4 )0.480.460.47
( 4, 5 )0.360.500.43
( 5, 6 )0.240.510.38
( 6, 7 )0.200.340.27
( 7, 8 )0.080.300.19
( 8, 9 )0.050.110.08
(Genome i, Genome j)r  gr  pr  pg
( 1, 2 )0.840.490.67
( 2, 3 )0.440.440.44
( 3, 4 )0.480.460.47
( 4, 5 )0.360.500.43
( 5, 6 )0.240.510.38
( 6, 7 )0.200.340.27
( 7, 8 )0.080.300.19
( 8, 9 )0.050.110.08
Table 3.

The values of correlation measures rg, rp, rpg of nine prokaryotic genomes based on similarity of the distribution of shared orthologs

(Genome i, Genome j)r  gr  pr  pg
( 1, 2 )0.840.490.67
( 2, 3 )0.440.440.44
( 3, 4 )0.480.460.47
( 4, 5 )0.360.500.43
( 5, 6 )0.240.510.38
( 6, 7 )0.200.340.27
( 7, 8 )0.080.300.19
( 8, 9 )0.050.110.08
(Genome i, Genome j)r  gr  pr  pg
( 1, 2 )0.840.490.67
( 2, 3 )0.440.440.44
( 3, 4 )0.480.460.47
( 4, 5 )0.360.500.43
( 5, 6 )0.240.510.38
( 6, 7 )0.200.340.27
( 7, 8 )0.080.300.19
( 8, 9 )0.050.110.08

3.3 Application 2: 13 mycoplasma bacteria

Above we showed that the BGVM distribution and its inference procedures can be applied to distinguish genome organization of eight pairs of bacteria in Application 1, in which genomes of six pairs were relatively distant (from different genus or beyond). Here, we apply the proposed procedures to a set of 13 mycoplasma bacteria, which are relatively close in phylogeny, to see if rpg can distinguish their genome organization. These bacteria are mammalian pathogens with small genomes, which belong to the same genus. Specifically, their phylogeny tree based on 16S ribosomal sequence, constructed by the software MEGA4, is in Application2.pdf of the Supplementary Material. Among these bacterial genomes, three origins are unknown and nine are predicted, which demonstrates the need of a correlation measure invariant to the origin of a circular genome such as rpg.

The pairwise rose diagrams (with a binwidth of 10) of these 13 bacterial genomes showed that several of the marginals were bimodal and asymmetric; see http://www.stat.sinica.edu.tw/∼gshieh/bgvm.htm for details. Hence, we fitted a BGVM distribution to the shared orthologs for each pair of bacterial genomes. Next, the correlation measure rpg was applied to these genomes to result in the values in Table 4.

Table 4.

The values of correlation measure rpg between 13 mycoplasma bacteria based on similarity of the distribution of shared orthologs

r  pg  
No.12345678910111213
10.280.250.770.250.280.270.270.280.390.280.260.28
20.250.220.300.250.270.270.350.240.250.220.25
30.190.330.230.260.260.300.210.250.240.24
40.190.210.200.210.200.270.210.190.27
50.200.230.230.290.170.250.240.21
60.670.690.260.200.190.180.18
70.710.270.200.200.190.20
80.270.200.210.200.20
90.220.240.230.18
100.230.210.25
110.600.24
120.22
13
r  pg  
No.12345678910111213
10.280.250.770.250.280.270.270.280.390.280.260.28
20.250.220.300.250.270.270.350.240.250.220.25
30.190.330.230.260.260.300.210.250.240.24
40.190.210.200.210.200.270.210.190.27
50.200.230.230.290.170.250.240.21
60.670.690.260.200.190.180.18
70.710.270.200.200.190.20
80.270.200.210.200.20
90.220.240.230.18
100.230.210.25
110.600.24
120.22
13

Due to space limit, we use genome numbers 1–13 to represent the following 13 bacterial genomes, respectively: M. genitalium G37, M. mobile 163K, M.synoviae 53, M.pneumoniae M129, M.agalactiae PG2, M.hyopneumoniae 232, M.hyopneumoniae J, M.hyopneumoniae 7448, M.pulmonis UAB CTIP, M.gallisepticum R, M.capricolum subsp. capricolum ATCC 27343, M.mycoides subsp. mycoides SC str. PG1, and M.penetrans HF-2. Further, No. denotes genome numbers.

Table 4.

The values of correlation measure rpg between 13 mycoplasma bacteria based on similarity of the distribution of shared orthologs

r  pg  
No.12345678910111213
10.280.250.770.250.280.270.270.280.390.280.260.28
20.250.220.300.250.270.270.350.240.250.220.25
30.190.330.230.260.260.300.210.250.240.24
40.190.210.200.210.200.270.210.190.27
50.200.230.230.290.170.250.240.21
60.670.690.260.200.190.180.18
70.710.270.200.200.190.20
80.270.200.210.200.20
90.220.240.230.18
100.230.210.25
110.600.24
120.22
13
r  pg  
No.12345678910111213
10.280.250.770.250.280.270.270.280.390.280.260.28
20.250.220.300.250.270.270.350.240.250.220.25
30.190.330.230.260.260.300.210.250.240.24
40.190.210.200.210.200.270.210.190.27
50.200.230.230.290.170.250.240.21
60.670.690.260.200.190.180.18
70.710.270.200.200.190.20
80.270.200.210.200.20
90.220.240.230.18
100.230.210.25
110.600.24
120.22
13

Due to space limit, we use genome numbers 1–13 to represent the following 13 bacterial genomes, respectively: M. genitalium G37, M. mobile 163K, M.synoviae 53, M.pneumoniae M129, M.agalactiae PG2, M.hyopneumoniae 232, M.hyopneumoniae J, M.hyopneumoniae 7448, M.pulmonis UAB CTIP, M.gallisepticum R, M.capricolum subsp. capricolum ATCC 27343, M.mycoides subsp. mycoides SC str. PG1, and M.penetrans HF-2. Further, No. denotes genome numbers.

Interestingly, these associations are consistent with those measured based on similarity of their 16S rRNAs. Among these 13 bacterial genomes, the top few correlated pairs in terms of rpg are (M.genitalium G37, M.pneumoniae M129), (M.hyopneumoniae J, M.hyopneumoniae 7448), (M.hyopneumoniae 232, M.hyopneumoniae 7448) and (M.hyopneumoniae 232, M.hyopneumoniae J), whose values are equal to 0.77, 0.71, 0.69 and 0.67, respectively. The latter three pairs are hyopneumoniae, which cause chronic disease in pigs and attach to the same organ (lung cilia). From the rose diagrams of these hyopneumoniae in Figure 2, we can see that their genome structures are quite similar in terms of shared orthologs. The most favored regions of their shared orthologs are 80 − 90, (20 − 30 and 330 − 340) and (20 − 30 and 330 − 340), respectively, in which 23, (16 and 17) and (14 and 16) genes are in common, respectively. Among these genes, the largest common functional group is translation, and their gene orders are also conserved. For details, please see Supplementary Tables S1 and S2 in Application2.pdf. In particular, the rose diagrams of (M.hyopneumoniae J, M.hyopneumoniae 7448) clearly depict their similarity in genome organization, and the most and least favored regions of shared orthologs are 80 − 90 and 150 − 160, respectively. The genes located in the most favored region, encode ribosomal subunit proteins 30S and 50S, which are essential genes and are known to cluster together to form an operon structure in many bacterial genomes (Bratlie et al., 2010).

The rose diagrams of shared orthologs, identified via protein sequence similarity 0.7, of three pairs of hyopneumoniae bacteria.
Fig. 2.

The rose diagrams of shared orthologs, identified via protein sequence similarity 0.7, of three pairs of hyopneumoniae bacteria.

Moreover, the aforementioned relationships depicted by rose diagrams are consistent with those of the other analysis on genome organization (Application2.pdf of the Supplementary Material). Thus, our procedures may provide useful information on comparing the organization of circular genomes and genome construction in synthetic biology.

4 DISCUSSION

The proposed BGVM distribution, with flexible and closed-formed marginal distributions, was shown to model paired angular data well, in which each marginal of the pair can be asymmetric or/and multimodal, e.g. shared orthologs between paired bacterial genomes. Statistical inferences employing the MLEs for parameters of the BGVM distribution and the LR-test for independence of the organization between circular genomes are established. A novel circular correlation measure (rpg), invariant to the choices of origins of circular genomes which are mainly unknown or predicted, was derived and a visualization tool provided. We applied these procedures to two sets of prokaryotic genomes, consisting a set of relatively distant prokaryotes and a set of closely related bacteria. The novel correlation measure rpg was shown to summarize associations between prokaryotic genomes well. While the rose diagrams further depict their associations via distributions of the shared orthologs. Our results are consistent with those of other analysis. Thus, the BGVM procedures may provide useful information on the organization of circular genomes.

We note that the BGVM procedures can be applied to identify shared TFBSs, shared coding or non-coding genes and shared repeated elements between circular genomes. Thus, they may be applied to identifying conserved chromosome backbones, detecting conserved gene clusters such as operons, among others, which can be used for genome construction in synthetic biology.

ACKNOWLEDGEMENTS

We are indebted to Jia-Hong Wu and Su-Wei Hsu for computational assistance; Shih-Chung Chuang for statistical assistance. We thank Chuang-Hsiung Chang for discussions, and the AE and two reviewers for constructive comments that improved this article.

Funding: Shurong Zheng was supported by a postdoctoral fellowship from Academia Sinica thematic grant (23-33); the research was supported in part by National Science Council, Republic of China (96-2118-M-001-010-MY2) and National Research Program-Genome Medicine grant (NSC99-3112-B-001-015) for G.S.S.

Conflict of Interest: none declared.

REFERENCES

Benson
G.
Waterman
M.S.
,
A method for fast database search for all k-nucleotide repeats
Nucleic Acids Res.
,
1994
, vol.
22
(pg.
4828
-
4836
)
Bratlie
M.S.
et al.
,
Relationship between operon preference and functional properties of persistent genes in bacterial genomes
BMC Genomics
,
2010
, vol.
11
pg.
71
Carrera
J.
et al.
,
Model-based redesign of global transcription regulation
Nucleic Acids Res.
,
2009
, vol.
37
pg.
e38
Fisher
N.I.
Statistical Analysis of Circular Data.
,
1993
Cambridge, UK
Cambridge University Press
Garcia-Vallve
S.
et al.
,
HGT-DB: a database of putative transferred genes in prokaryotic complete genomes
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
187
-
189
)
Hoeffding
W.
,
A class of statistics with asymptotically normal distribution
Ann. Math. Statist.
,
1948
, vol.
19
(pg.
293
-
325
)
Huynen
M.A.
Bork
P.
,
Measuring genome evolution
Proc. Natl Acad. Sci. USA
,
1998
, vol.
95
(pg.
5849
-
5856
)
Jones
M.C.
Pewsey
A.
,
A family of symmetric distributions on the circle
J. Am. Stat. Assoc.
,
2005
, vol.
100
(pg.
1422
-
1428
)
Maksimov
V.M.
,
Necessary and sufficient statistics for a family of shifts of probability distributions on continuous bicompact groups
Teor. Verojatnost. i Primenen
,
1967
, vol.
12
(pg.
307
-
321
)
Mardia
K.V.
Patil
G.P.
et al.
,
Characterizations of directional distributions
Statistical Distributions in Scientific Work
,
1975
, vol.
3
Dordrecht
Reidel
(pg.
365
-
385
)
Pewsey
A.
,
Testing circular symmetry
Canadian J. Stat.
,
2002
, vol.
30
(pg.
591
-
600
)
Rivest
L.-P.
,
A distribution for dependent unit vectors
Commun. Statist. Theor. Meth.
,
1988
, vol.
17
(pg.
461
-
483
)
Rukhin
A.L.
,
Some statistical decisions about distribution on a circle for large samples
Sanhkyā Ser. A
,
1972
, vol.
34
(pg.
243
-
250
)
Self
S.G.
Kung-Yee
L.
,
Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions
J. Am. Stat. Assoc.
,
1987
, vol.
82
(pg.
605
-
610
)
Shieh
G.S.
et al.
,
Testing for independence of bivariate circular data and weighted degenerate U-statistics
Stat. Sinica
,
1994
, vol.
4
(pg.
729
-
747
)
Shieh
G.S.
Johnson
R.A.
,
Inferences based on a bivariate distribution with von Mises marginals
Ann. Inst. Stat. Math.
,
2005
, vol.
57
(pg.
789
-
802
)
Shieh
G.S.
et al.
,
A bivariate generalized von Mises distribution with applications to circular genomes
Technical Report C2006-06
,
2006
Taiwan, R.O.C.
Institute of Statistical Science, Academia Sinica
Singh
H.
et al.
,
Probablistic model for two dependent circular variables
Biometrika
,
2002
, vol.
89
(pg.
719
-
723
)
Tamames
J.
,
Evolution of gene order conservation in prokaryotes
Genome Biol.
,
2001
, vol.
2
(pg.
1
-
11
)
Tanner
M.A.
Tools for Statistical Inference.
,
1996
New York
Springer
Tatusov
R.L.
et al.
,
A genomic perspective on protein families
Science
,
1997
, vol.
278
(pg.
631
-
637
)
Thompson
J.W.
,
Contribution to discussion of paper by K. V. Mardia
J. R. Stat. Soc. B
,
1975
, vol.
37
pg.
379
Wehrly
T.E.
Johnson
R.A.
,
Bivariate models for dependence of angular observations and a related Markov process
Biometrika
,
1980
, vol.
67
(pg.
255
-
256
)
Wolf
Y.I.
et al.
,
Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context
Genome Res.
,
2001
, vol.
11
(pg.
356
-
372
)
Yfantis
E.A.
Borgman
L.E.
,
An extension of the Von Mises distribution
Commun. Stat. Theor. Meth.
,
1982
, vol.
11
(pg.
1695
-
1706
)

Author notes

Present address: Shurong Zheng, KLAS and Mathematics and Statistics, Northeast Normal University, Changchun 130024, China.

Associate Editor: Alfonso Valencia

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.