-
PDF
- Split View
-
Views
-
Cite
Cite
Alexandra M Kasianova, Vladislav D Mityukov, Dmitry A German, Artem S Kasianov, Aleksey A Penin, Maria D Logacheva, Chromosome-Scale Assembly of Capsella orientalis, Maternal Progenitor of Cosmopolitan Allotetraploid C. bursa-pastoris, Genome Biology and Evolution, Volume 17, Issue 1, January 2025, evaf009, https://doi.org/10.1093/gbe/evaf009
- Share Icon Share
Abstract
The genus Capsella serves as a model for understanding speciation, hybridization, and genome evolution in plants. Here, we present a chromosome-scale genome assembly of Capsella orientalis, the maternal progenitor of a cosmopolitan allotetraploid C. bursa-pastoris. Using nanopore sequencing and data on chromatin contacts (Hi-C), we assembled the genome into eight pseudo-chromosomes with high contiguity, evidenced by a benchmarking universal single-copy orthologs (BUSCO) completeness score of 99.3%. Comparative analysis with C. rubella and C. bursa-pastoris revealed overall synteny, except for 2 Mb inversion on chromosome 4 of C. rubella. Comparative genome analysis highlighted the conservation of gene content and structural integrity in the C. orientalis-derived subgenome of C. bursa-pastoris, with the exception of a 1.8 Mb region absent in O subgenome but present in C. orientalis. The genome annotation includes 27,675 protein-coding genes, with most exhibiting one-to-one orthology with Arabidopsis thaliana. Notably, 2,155 genes showed no similarity to A. thaliana ones. These results establish a robust genomic resource for C. orientalis, facilitating future studies on polyploid evolution, gene regulation, and species divergence within Capsella.
We assembled the genome of Capsella orientalis, a diploid progenitor of C. bursa-pastoris, a cosmopolitan tetraploid of hybrid origin. The availability of a chromosome-scale genome sequence will allow involving C. orientalis into evolutionary studies exploring the posthybridization genome modification of C. bursa-pastoris. In particular, it provides an unbiased reference for the exploration of genetic and epigenetic changes that occurred with the maternal subgenome. Also, it promotes the studies of this underexplored species itself, including the facilitation of its identification using DNA-based methods.
Introduction
The genus Capsella is a long-standing model for many aspects of plant biology (for review see Sicard and Lenhard 2018). These include speciation (Guo et al. 2009), hybridization and posthybridization genome modification (Douglas et al. 2015; Kasianov et al. 2017), the evolution of mating systems (Slotte et al. 2013), morphological evolution, etc. (Hintz et al. 2006; Sicard et al. 2011; Klepikova et al. 2021). These studies greatly benefit from the genomic data being available; correspondingly, Capsella species became a target of several genomic projects: a diploid C. rubella (Slotte et al. 2013) and tetraploid C. bursa-pastoris (Penin et al. 2024) were assembled up to chromosome scale. The subject of this study, Capsella orientalis, is a diploid; it is a poorly explored species despite its high importance. It is of a special interest being the maternal progenitor of a hybrid cosmopolitan species C. bursa-pastoris. For a long time, this species was thought to an endemic of Eastern Europe (Dorofeyev 2002). However, recent floristic studies have shown that it has a much broader area, being present in Asia, in particular, in Asian regions of Russia, in China, Kazakhstan, and Mongolia (see e.g. German and Ebel 2009; German et al. 2012). Due to the close resemblance of C. orientalis and C. bursa-pastoris and the high morphological diversity of the latter C. orientalis was frequently overlooked and erroneously recognized as C. bursa-pastoris. The genome sequence of C. orientalis is important for the evolutionary studies on C. bursa-pastoris, especially ones exploring the differences in gene content and regulation between this hybrid and its progenitor species. However, currently, the only assembly available is based on short-read sequencing and is highly fragmented (N50 ∼ 25 Kb) (Ågren et al. 2014).
Results and Discussion
We extracted high molecular weight (HMW) DNA from leaves of C. orientalis using a modified CTAB protocol. Using Oxford Nanopore technologies sequencing platform, we obtained 925,232 reads with N50 22,727 bp. Out of all assembly approaches that we tried (supplementary table S1, Supplementary Material online), the best results (in terms of N50 and self-consistency of Hi-C map) were obtained using Flye. Interestingly, some assemblers, in particular, Raven, combined several chromosomes into one scaffold. This led to an inflated N50 showing that a higher N50 does not always indicate a better assembly. Analysis of synteny with C. rubella and C. bursa-pastoris chromosomes showed that 12 largest scaffolds in the assembly corresponded to the whole chromosomes or chromosome arms. Further manual joining of scaffolds based on synteny and examination of Hi-C maps resulted in eight super-scaffolds corresponding to chromosomes (Fig. 1a). The search for telomeric repeats showed their presence at the ends of most of these scaffolds. The total length of the assembly is 130.7 Mb; 94% of the total length belongs to the pseudo-chromosomes. This is similar to the size of the assembly of closely related species C. rubella. Benchmarking universal single-copy orthologs (BUSCO) metrics indicate on high contiguity of the assembly: the fraction of complete genes was 99.2% and 99.8% for eudicots and Brassicales datasets, correspondingly. After scaffolding and manual curation, we performed two rounds of polishing. As a result, the BUSCO score increased up to 99.3% on the eudicots dataset. Whole genome alignment of C. orientalis with congeneric species C. rubella shows that their chromosomes are mostly syntenic, with no large-scale inversions or translocations. The only exception is the ∼2 Mb inversion in the beginning of chromosome 4 (Fig. 1b). The same structure of chromosome 4 as in C. orientalis is also found in both subgenomes of C. bursa-pastoris (Penin et al. 2024). This strongly suggests that this structure is ancestral while the one found in C. rubella is either specific for the particular lineage where the sequenced plant belongs to or is a result of misassembly. Polyploidy often results in genome rearrangements; in order to check whether this was the case for a hybrid we made a comparison with C. bursa-pastoris. As expected for a tetraploid, each C. orientalis chromosome corresponds to two C. bursa-pastoris chromosomes. No rearrangements were found in the chromosomes of the O subgenome compared to C. orientalis genome. Earlier it was shown that the O subgenome has a ∼1.8 Mb deletion at the beginning of the chromosome 5, compared to C. rubella and R subgenome (Penin et al. 2024). In C. orientalis, this 1.8 Mb region is present; this suggests that deletion in the O subgenome occurred after polyploidization.

a) Chromatin contact map for 8 pseudo-chromosomes of C. orientalis. b) Visualization of whole genome alignment with C. rubella (note that the whole chromosome 4 is inverted in C. orientalis and C. bursa-pastoris subgenomes relative to C. rubella. We keep this orientation for our assemblies because it is congruent with the orientation of the homologous chromosome of A. thaliana). c) Visualization of whole genome alignment with C. bursa-pastoris. d) BUSCO metrics for assembly and annotation.
Annotation using a combination of homology-based and RNA-seq-based approaches resulted in the prediction of 27,675 protein-coding genes in C. orientalis genome. Overall majority of them were on chromosome scaffolds, with only 771 being outside. Most genes had one-to-one correspondence with Arabidopsis thaliana genes, as expected for a diploid species; 2,155 genes had no significant similarity with any A. thaliana genes. The average gene length in C. orientalis was determined to be 2,016 bp and the average number of exons was 5.4. The average exon length is 223 bp. The proportion of intronless genes is 19%. All the above-mentioned statistics are very similar in C. orientalis and A. thaliana (Table 1), indicating the high quality of annotation. Additionally, we assessed the presence of BUSCO genes in the structural annotation; it demonstrated a completeness score of 96.5% (Fig. 1d).
Assembly . | ||
---|---|---|
All scaffolds | Pseudo-chromosomes | |
Total length, bp | 130,879,499 | 123,259,898 |
Number | 63 | 8 |
N50, bp | 16,157,996 | 16,157,996 |
L50 | 3 | 3 |
GC content, % | 35.60 | 35.57 |
BUSCO (Brassicales) | C:99.8% (S:98.7%, D:1.1%), F:0.0%, M:0.2%, n:4,596 | 98.8% (S:97.7%, D:1.1%), F:0.0%, M:1.1%, n:4,596 |
Annotation | ||
C. orientalis | A. thaliana | |
Number of genes | 27,675 | 27,443 |
Average gene length (bp) | 2,016 | 2,385 |
Average CDS length (bp) | 1,205 | 1,218 |
Average exon length (bp) | 223 | 306 |
Average intron length (bp) | 182 | 165 |
Average number of exon per gene | 5.4 | 5.3 |
Fraction of intronless genes (%) | 19 | 20 |
BUSCO (Brassicales) | C: 96.2% (S:94.5%, D:1.7%), F:0.9%, M:2.9%, n:4,596 | C:100.0% (S:99.0%, D:1.0%), F:0.0%, M:0.0%, n:4,596 |
Assembly . | ||
---|---|---|
All scaffolds | Pseudo-chromosomes | |
Total length, bp | 130,879,499 | 123,259,898 |
Number | 63 | 8 |
N50, bp | 16,157,996 | 16,157,996 |
L50 | 3 | 3 |
GC content, % | 35.60 | 35.57 |
BUSCO (Brassicales) | C:99.8% (S:98.7%, D:1.1%), F:0.0%, M:0.2%, n:4,596 | 98.8% (S:97.7%, D:1.1%), F:0.0%, M:1.1%, n:4,596 |
Annotation | ||
C. orientalis | A. thaliana | |
Number of genes | 27,675 | 27,443 |
Average gene length (bp) | 2,016 | 2,385 |
Average CDS length (bp) | 1,205 | 1,218 |
Average exon length (bp) | 223 | 306 |
Average intron length (bp) | 182 | 165 |
Average number of exon per gene | 5.4 | 5.3 |
Fraction of intronless genes (%) | 19 | 20 |
BUSCO (Brassicales) | C: 96.2% (S:94.5%, D:1.7%), F:0.9%, M:2.9%, n:4,596 | C:100.0% (S:99.0%, D:1.0%), F:0.0%, M:0.0%, n:4,596 |
Assembly . | ||
---|---|---|
All scaffolds | Pseudo-chromosomes | |
Total length, bp | 130,879,499 | 123,259,898 |
Number | 63 | 8 |
N50, bp | 16,157,996 | 16,157,996 |
L50 | 3 | 3 |
GC content, % | 35.60 | 35.57 |
BUSCO (Brassicales) | C:99.8% (S:98.7%, D:1.1%), F:0.0%, M:0.2%, n:4,596 | 98.8% (S:97.7%, D:1.1%), F:0.0%, M:1.1%, n:4,596 |
Annotation | ||
C. orientalis | A. thaliana | |
Number of genes | 27,675 | 27,443 |
Average gene length (bp) | 2,016 | 2,385 |
Average CDS length (bp) | 1,205 | 1,218 |
Average exon length (bp) | 223 | 306 |
Average intron length (bp) | 182 | 165 |
Average number of exon per gene | 5.4 | 5.3 |
Fraction of intronless genes (%) | 19 | 20 |
BUSCO (Brassicales) | C: 96.2% (S:94.5%, D:1.7%), F:0.9%, M:2.9%, n:4,596 | C:100.0% (S:99.0%, D:1.0%), F:0.0%, M:0.0%, n:4,596 |
Assembly . | ||
---|---|---|
All scaffolds | Pseudo-chromosomes | |
Total length, bp | 130,879,499 | 123,259,898 |
Number | 63 | 8 |
N50, bp | 16,157,996 | 16,157,996 |
L50 | 3 | 3 |
GC content, % | 35.60 | 35.57 |
BUSCO (Brassicales) | C:99.8% (S:98.7%, D:1.1%), F:0.0%, M:0.2%, n:4,596 | 98.8% (S:97.7%, D:1.1%), F:0.0%, M:1.1%, n:4,596 |
Annotation | ||
C. orientalis | A. thaliana | |
Number of genes | 27,675 | 27,443 |
Average gene length (bp) | 2,016 | 2,385 |
Average CDS length (bp) | 1,205 | 1,218 |
Average exon length (bp) | 223 | 306 |
Average intron length (bp) | 182 | 165 |
Average number of exon per gene | 5.4 | 5.3 |
Fraction of intronless genes (%) | 19 | 20 |
BUSCO (Brassicales) | C: 96.2% (S:94.5%, D:1.7%), F:0.9%, M:2.9%, n:4,596 | C:100.0% (S:99.0%, D:1.0%), F:0.0%, M:0.0%, n:4,596 |
Materials and Methods
Capsella orientalis plants were grown from seeds collected by D.A. German in Barnaul (Altai Krai, Russia). HMW DNA was extracted from 5 g of leaves using a modified CTAB protocol. It differs from the classical protocol by Doyle and Doyle (1987) by (i) a larger amount of plant material and larger extraction volumes, (ii) longer incubation time, (iii) double extraction, and (iv) the use of proteinase K at the second extraction; 10 mcg of DNA was taken for library preparation for nanopore sequencing using the ligation-based protocol. For optimal adapter ligation and elution of HMW DNA, the ligation and elution times were increased and elution was carried out at 37 °C instead of room temperature. The resulting library was sequenced on the P2 Solo instrument using v. 10 flow cell. Basecalling was done using Guppy v. 6.5.7 (Oxford Nanopore Technologies, Oxford, United Kingdom). For Hi-C and shotgun Illumina libraries, we used methods described earlier (Penin et al. 2024). Libraries were sequenced on Hiseq4000 and Nextseq instruments (Illumina). Nanopore reads with an average Phred score at least 8 were assembled using NextDenovo (Hu et al. 2024), Shasta (Shafin et al. 2020), Raven (Vaser and Šikić 2021), Flye v. 2.9.4 (Kolmogorov et al. 2019), and Flye in combination with Mabs (Schelkunov 2023). The contigs were checked for contamination using BLAST against the NCBI-nr database; the contigs that have best hits not belonging to plants were to exclude but we did not find such contigs. To scaffold draft assemblies Hi-C reads were mapped on contigs and preprocessed by the Arima mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline); after that, by using YaHS v. 1.1 (Zhou et al. 2023) scaffold level assembly was obtained. To visualize Hi-C mapping, we used Juicebox v. 2.15 (Durand et al. 2016). Further manual curation included examination of the Hi-C map, disjoining of scaffolds in regions adjacent sequences had few connections by Hi-C reads and joining them in case of high density of connections. This allowed to reconstruct the expected eight scaffolds (pseudo-chromosomes). After scaffolding, we performed two rounds of polishing using NextPolish v. 1.4.1 (Hu et al. 2020). To annotate the genome assembly of C. orientalis, we first performed masking of the repeats using RepeatModeler v. 2.0.4 (Flynn et al. 2020) with the parameter “-engine ncbi” followed by RepeatMasker v. 4.1.5 (https://www.repeatmasker.org) with the parameters “-engine ncbi -xsmall.” Then we mapped RNA-Seq data of C. orientalis and C. bursa-pastoris extracted from the NCBI database and to the genome using STAR v. 2.7.10b (Dobin et al. 2013) with default parameters. Samtools view v. 1.15.1 (Danecek et al. 2021) was used to convert read alignments from SAM to BAM format using the “-Sb” option. The BAM files were then sorted using samtools sort, and indexed using samtools index.
The genome was annotated using BRAKER v. 3.0.8 (Hoff et al. 2016, 2019; Brůna et al. 2021). A single isoform per gene was selected, specifically the one with the longest coding DNA sequence (CDS). The corresponding amino acid sequences were then extracted. In order to facilitate comparative genomic studies within Capsella and with A. thaliana, we used the same approach as in Penin et al. (2024), when the genes were put into correspondence with orthologous A. thaliana genes, and this information was included in the gene name. A prefix consisting of the species name (CO for C. orientalis) and the chromosome name was added to the gene IDs. We then performed gene ortholog identification between Capsella and A. thaliana to include the corresponding Arabidopsis gene IDs using SynGAP v. 1.2.5 (Wu et al. 2024). First, structural annotation correction was carried out using the SynGAP dual module with the parameter “--datatype prot.” Then, orthologous genes were identified with the SynGAP genepair module using the same parameter “--datatype prot.” For Capsella genes that had no orthologs in A. thaliana, an additional search for homologs was performed based on amino acid sequences using blastp v. 2.5.0 (Camacho et al. 2009) with the following parameters: “-max_target_seqs 1 -outfmt 6 -evalue 1e-5.” The resulting Capsella gene IDs were given the suffix “_O” if the gene had an Arabidopsis 1-to-1 ortholog identified by SynGAP and the suffix “_B” if the gene was found by BLAST.
The completeness of the resulting structural annotations was assessed using BUSCO v. 5.4.4 (Manni et al. 2021), with the Brassicales database and parameter “-m protein.” Genome synteny was vizualized using Circos (Krzywinski et al. 2009).
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online.
Acknowledgments
The authors thank Mikhail I. Schelkunov (Institute of Information Transmission Problems) for his assistance with assembly validation.
Funding
This study was supported by the Russian Science Foundation, project 21-74-20145.
Data Availability
Raw reads, assembly, and annotation are deposited in ENA under accession number PRJEB80551. Assembly and annotation are also available on figshare: https://doi.org/10.6084/m9.figshare.28081604.v1.