-
PDF
- Split View
-
Views
-
Cite
Cite
Lumin Zhang, Jianmin Yuan, Tianlei Pu, Wenlin Qu, Xiao Lei, Kaihua Ma, Kunjian Qian, Qiongling Zhao, Chengfei Liao, Jie Jin, Chromosome-scale genome assembly of Phyllanthus emblica L. ‘Yingyu’, DNA Research, Volume 32, Issue 2, April 2025, dsaf006, https://doi.org/10.1093/dnares/dsaf006
- Share Icon Share
Abstract
Phyllanthus emblica L. is an edible plant with medicinal properties native to the dry-hot valley of Yunnan, China. Here, we report a de novo chromosome-scale genome of P. emblica wild type ‘Yingyu’. ‘Yingyu’ is an octopoid plant with a total of 104 chromosomes. In total, we assembled and clustered 480 Mb of the genome and constructed 26 pseudochromosomes (haplotypes) of P. emblica wild type ‘Yingyu’ that encompass 97.9% of the genome and demonstrate to have relatively high integrity. We annotated 31,111 genes found in the genome of P. emblica. We screened 5 different tissues for searching the tissue-specific expression candidate genes. Four unknown function candidate genes were expressed at high levels in the flowers while genes relating to the biosynthesis of gibberellins and cellulose were specifically expressed in the fruits. The ascorbate biosynthesis-related genes were screened on P. emblica ‘Yingyu’ genome. The high expression level of 2 GDP-mannose epimerases and one L-galactono-1,4- lactone dehydrogenases in the fruit may be related to the activity of absorbate biosynthesis in the fruit. The chromosome-level genomic data for P. emblica we report will be important for the development of molecular markers to facilitate the selection of superior cultivars for processing and pharmaceuticals.
1. Introduction
Phyllanthus emblica L., belongs to family Euphorbiaceae, is an important medicinal and edible plant which has been called the ‘wonder fruit’ of the 21st century.1 As an important herb in Indian Ayurvedic and traditional Chinese medicine,2 the medicinal value of P. emblica has been recognized by Western medicine from as early as 1944 when a study in the journal Nature reported that the high content of ascorbic acid in the fruit of P. emblica was used for the treatment of scurvy patients during the Hisar famine from 1939 to 1940.3 Fruits of P. emblica are a rich source of ascorbic acids and pectins. They are also rich in a wide variety of medically significant organic compounds such as alkaloids, benzoids, diterpenes, sterols, flavonoids and furanolactones, tannins, gallic acid, quercetin, and phyllaembic compounds.1
Wild populations of P. emblica are found in India, Bangladesh, and Southeast Asia. P. emblica wild population are also distributed throughout southern China, specifically dry-hot valleys in the provinces of Yunnan, Guizhou, and Sichuan.4 In China, P. emblica is also cultivated in relatively humid regions, such as in the provinces of Guangxi, Guangdong, and Fujian. A study on wild and cultivated P. emblica populations in China demonstrated that the populations growing in drier climates had higher levels of genetic diversity than those growing in wetter climates.4 ‘Yingyu’ is a wild type of P. emblica first discovered and collected in Chuxiong prefecture, Yunnan province, which produces a fruit that is golden and clear at optimal harvest. ‘Yingyu’ is also characterized by long shedding branches and a larger fruit (weighing an average of 35–40 g at optimal harvest) than other wild and cultivated P. emblica types (Fig. 1A and 1B). P. emblica is a monoecious plant, The male flowers are borne on the lower and middle parts of the shedding branches, while the female flowers are borne on the top of the shedding branches, with a much greater number of male flowers than female flowers (Fig.1C). Different cultivars or lines have different female/male ratio. The flower sex differentiation and female/male ratio severely affect the formation of production.

The Phyllanthus emblica cultivar ‘Yingyu’. A) Long shedding branches and leaves. B) Fruits. C) Inflorescences, male, and female flowers.
Genome information can be used to enhance breeding programs and contribute to a deeper comprehension of the genetic foundations underlying diverse germplasm resources. Genome information is also crucial for fundamental research to elucidate the physiology and evolution of a species.5 Projects to map the genomes of important horticultural crops such as pear,6 apple,7 peach,8 and strawberry9 have been running for over ten years. In comparison, efforts to study the genome of P. emblica have progressed at a slower pace, as the majority of its populations exist in the wild, in tropical and subtropical regions. Recently, Mahajan et al. (2023) reported the first draft genome of P. emblica.10 Thus far, the members of the family Euphorbiaceae for which genomes have been reported include Ricinus communis,11Hevea brasiliensis,12Manihot esculenta,13 and Jatropha curcas.14
The development and use of third-generation sequencing technologies have accelerated rapidly over the past few years. In particular, the high-fidelity (HiFi) technology developed by Pacific Biosciences (PacBio) can read and generate multiple DNA sequences from the same DNA molecule.15 In the human genome sequencing project, HiFi reads achieved very low error rates and produced long length contigs.16 Another next-generation sequencing technology, high-throughput chromosomal conformation capture (Hi-C), has shown to achieve high chromosome-scale contiguity and phasing accuracy.17 In a recent study incorporating HiFi and Hi-C to map the chromosome-level genome of Phyllanthus cochinchinensis, a related species of P. emblica, a total of 13 pseudochromosomes were constructed, covering 99.87% of the assembled sequence.18 The genome size of this species was 284.88 Mb and included 20,836 coding genes.
Recent studies on P. emblica have mainly focused on its phenological traits and antioxidative, antimicrobial, and anticholinesterase properties.2,19,20 However, understanding of the active components of P. emblica and its physiology has been hindered by a lack of progress in mechanistic analysis. Li et al. (2024) reported the P. emblica cultivated type ‘Bingtian’ genome.21 ‘Bingtian’ is an octoploid plant and the chromosome number is 2n = 8x = 104. ‘Bingtian’ origin from two diploid ancestors (AA and BB). P. emblica cultivated type ‘Bingtian’ become an octoploid plant (AAAABBBB) through 2 whole-genome duplication events. Although cultivated type and wild type all belong to P. emblica, they exhibit significant differences in traits and composition such as fruit size, ascorbic acid biosynthesis and tannin accumulation, and so on. However, the microscopic images do not accurately count up to 104 chromosomes. The accurate imaging of the chromosome counts in P. emblica remains unclear. Until now a chromosome-level genome of P. emblica wild type is currently lacking.
In this study, we employed microscopy and flow cytometry to observe the chromosome counts and to analyze its ploidy. We used the next-generation and third-generation high-throughput sequencing techniques—including DNBSEQ, HiFi, and Hi-C—to comprehensively sequence the genome of P. emblica wild type ‘Yingyu’ at the chromosome level. The species divergence time and expansion and contraction of gene families were estimated. The reference genome and transcriptome data we provide will support the screening of specific expression candidate genes in different tissues especially the ascorbic acid biosynthesis pathway of P. emblica wild type ‘Yingyu’. Coupled with the use of GWAS and QTL analysis, the data should also contribute to the selection and breeding of candidate P. emblica cultivars for specific purposes such as fresh consumption, processing, and pharmaceuticals.
2. Materials and methods
2.1 Plant materials
The genome material for this study was obtained from a 15-year-old cultivar of P. emblica ‘Yingyu’ tree. This tree is cultivated at National Germplasm Resource Nursery for Characteristic Crops in Dry-Hot Areas, which is under the Ministry of Agriculture and Rural Affairs, and located in Yuanmou, China. The young leaves were sampled for genome sequencing analysis. The young leaves, shedding branches, flowers, fruits (flesh), and one-year-old branches were sampled for transcriptome analysis.
2.2 Flow cytometry and karyotype analysis
The cultivar of P. emblica ‘Yingyu’ and ‘Bingtian’ leaves were employed to perform the flow cytometry analysis. ‘Bingtian’ was a control because it reported as an octopoid plant and the chromosome number is 104.21 The 0.2 g fresh leave sample were placed in plastic culture dish with 500 μL Nuclei Extraction buffer (CyStain UV Precise P Kit, Sysmex, Germany), chopped with a sharp blade then filtered through a 50 μm Celltrics filter after 60 seconds. Followed by the addition of 2000 μL of DAPI staining buffer for 2 minutes in the dark. Nuclei suspension was analyzed by CyFlow Space Flow Cytometer (Sysmex Partec, Muenster, Germany) and the corresponding Flomax software.
The stem tips were collected as experimental materials, with a segment length of 0.5–1 cm. Collected materials underwent initial pretreatment by immersion in a solution comprising V (0.001 mol/L hydroxyquinoline): V (0.02% colchicine) in a 1:1 ratio under dark conditions at 4 °C for 4 h. Following a rinse with distilled water, the materials were subjected to a treatment in a 0.075 mol/L KCl solution at 4 °C for 30 min. Subsequently, the materials were fixed in a Carnoy fixative solution of V (methanol): V (glacial acetic acid) in a 3:1 ratio for a period ranging from 4 to 24 h at 4 °C. After a further rinse, acidification was carried out using a 1 mol/L HCl solution at room temperature for 45 min. The materials were then enzymatically digested with a mixture of V (6% cellulase): V (4% pectinase) in a 2:1 ratio at 37 °C for 5–6 h, followed by a low osmotic treatment in a 37 °C water bath for 30 min. The materials were transferred to a moist, clean glass slide, crushed, and spread using forceps to eliminate large tissue residues. Staining was performed using a carbol fuchsin solution for a total duration of 2 h. A coverslip was applied, the excess stain was absorbed using filter paper from both sides of the coverslip, and the coverslip was secured with fingers on the filter paper. Gently tapping along the diagonal with a pencil eraser head, followed by a brief gentle tapping of the coverslip for 1–2 min, ensured uniform dispersion of the material. The microscope (Nikon 80i, Japan) was used to observe the chromosomes. The analysis involved segmentation, pairing, and data interpretation of chromosomes to generate a karyotype using karyotype analysis software Ikaros (Zeiss, Germany). Jindiweilai Biotechnology (Beijing) Co., Ltd. Provides technical support for flow cytometry and karyotype analysis.
2.3 Genome DNA extraction, sequencing, and assembly
Genome DNA was extracted from the tissues of young leaves using a modified CTAB (Cetyl-trimethyl ammonium bromide) method.22 The genome survey was sequenced using DNBSEQ short-read sequencing to estimate genome size and heterozygosity. The genome was sequenced using PacBio HiFi, and DNBSEQ Hi-C technology at BGI Biotechnology (Wuhan, China) Co., Ltd by following the pipeline (Fig. 2).

De novo genome assembly pipeline of Phyllanthus emblica cultivar ‘Yingyu’.
The DNA samples were randomly fragmented with a covaris instrument and then subjected to DNBSEQ WGS PCR-free libraries with DNA fragments of 350 bp were selected. The DNA fragment end repair is performed, with an ‘A’ base added at the 3’ end and library adapters added to both ends. The ligated libraries undergo single-strand separation and circularization. The circularized libraries are subjected to Rolling Circle Amplification to generate DNA nano-balls (DNBs), which are then subjected to quality control before sequencing. The constructed libraries were sequenced with 350 bp pair-end sequencing using DNBSEQ MGI2000 platform. The raw data were filtered using SOAPnuke-v2.1.04 software (-n 0.02 -l 20 -q 0.4 -i -G 2 --polyX 50 -Q 2 --seqType 0) to remove adapter sequences and low-quality sequences, resulting in clean reads.23 Jellyfish (v2.2.6) was employed to rapidly calculate K-mer frequencies. The genome size, heterozygosity, repetitive sequences, sequencing depth, and other characteristics based on the K-mer frequency statistics were estimated by Jellyfish-v2.2.6 and GenomeScope software.23–26
A combination of PacBio Sequel II single-molecule real-time sequencing and DNBSEQ Hi-C sequencing with error correction was applied to assemble the complete genome sequence of P. emblica ‘Yingyu’. The gDNA library for single molecular real-time PacBio genome sequencing were constructed by the standard protocol of Pacific Bioscience. In short, gDNA was disrupted randomly with 15–20 kb fragments using Megaruptor. The hairpin structure adaptors were linked to two side of double-strand gDNA and then an SMRTbell library was constructed. The library was sequenced on the PacBio Sequel II platform with 2 cells using CCS mode generating about 110 coverages. The raw data were filtered by CCS (V4.0.0). The clean reads were fed into the pipeline of the Hifisam15 and HiCanu27 assembled to contigs with the default mode.
One Hi-C libraries were constructed using the standard procedure and sequenced using the DNBSEQ platform for Hi-C sequencing. The raw data were filtered the adaptors and low-quality sequences by fastp (--average_qual 15 -l 150 -w 6). The contig-level assembly of P. emblica ‘Yingyu’ genome was achieved to pseudochromosome-level genome using a combination of PacBio HiFi and DNBSEQ Hi-C sequencing. The pseudochromosome-level genome was assembled and constructed using 3D DNA28 and manually corrected with Juicer29 run at their respective default settings.
2.4 RNA extraction and transcriptome sequencing
The tissue of one-year-old branches, shedding branches, young leaves, flowers, and fruit (flesh) of P. emblica ‘Yingyu’ were collected, and total RNA was extracted using a modified CTAB method.30 Five different tissues of total RNA were mixed to perform a full-length transcriptome. Total RNA was first converted into first-strand cDNA using the BGI UMI base PCR cDNA Synthesis Kit. The first-strand cDNA was then amplified through PCR to synthesize the second-strand cDNA. After the double-stranded DNA underwent a second round of PCR amplification. The SMRTbell library was prepared according to the standard protocol and the library sequenced on the PacBio Sequel II platform using CCS mode. For regular transcriptome, 5 different tissues total RNA was added to random primers for the synthesis of single-stranded cDNA. Double-stranded cDNA was synthesized using dUTP instead of dTTP. The double-stranded cDNA underwent end repair, an ‘A’ addition, and adapter ligation. The second strand template labeled with U is digested using the UDG enzyme, followed by PCR and recovery of products. Then the library products were circularized and the circular DNA molecules underwent rolling circle amplification to form DNA nanoballs. The DNA nanoballs were sequenced on the DNBSEQ platform. The raw data of full-length transcriptome were processed to identify of Reads of insert, classification of reads, and clustering and correction of reads by SMRT analysis suite for ensuring high quality, full-length consensus sequence. The raw data of regular transcriptome were filtered by SOAPnuke with the parameter of -n 0.001 -l 20 -q 0.4 --adaMR 0.25 --polyX 50 --minReadLen 150. The clean raw data was then performed by the genome annotation.
2.5 Genome annotation
Homolog repeat annotations were filtered by RepeatMaster and ProteinMask v 4.0.7 based on the RepBase library v 21.12 (http://www.girinst.org/repbase) (Fig. 3).31 A de novo repeat sequencing library was created using RepeatMoleder and LTRharvest.32 Subsequently, annotation of the genome was performed in RepeatMaster.

Genome annotation analysis pipeline of Phyllanthus emblica cultivar ‘Yingyu’.
Predicting the structure of a gene typically involves multiple methods including homolog and de novo prediction. In this study, we predicted the gene structure of P. emblica ‘Yingyu’ using a de novo prediction method with five regular transcriptomes and a full-length transcriptome. Specifically, HiSAT 2 and gmap were used to align the regular and whole-length transcriptome data.33,34 Pasa and Stringtie were the tools that align transcript sequences to the genome to perform structural annotation of genes. The transdecoder was then used to predict ORFs in the transcriptome data and to annotate genes based on transcriptome sequencing data and transcriptome-specific genes.35 The species Arabidopsis thaliana (TAIR 10),36Hevea brasiliensis (NR),12Manihot esculenta (KU50),13Oryza sativa (AGIS-1.0),37Ricinus communis (Rco039),38 and Zea mays (B73_V3)39 were selected for homolog predictions. The gene structure prediction was analyzed using GeMoMa v1.9 (http://www.jstacs.de/index.php/GeMoMa).40 Six species homologs and transcriptome-specific genes were added to obtain the final gene set. Finally, the quality of the gene set was evaluated against the BUSCO database.41
2.6 Phylogenetic analysis
Expect for P. emblica, Ricinus communis (Rco039),38Theobroma cacao (Criollo_cocoa_genome_V2),42Arabidopsis thaliana (TAIR 10),36Vitis vinifera (PN40024.v4.1),43Beta vulgaris (Beta vulgaris Resource RefBeet-1.2.2),44Solanum lycopersicum (SL 3.0), Coffea arabica (Cara_1.0),45Helianthus annuus (HanXRQr2.0-SUNRISE),46Nelumbo nucifera (Chinese lotus 1.1),47Aquilegia coerulea,48Papaver somniferurn (ASM357369v1),49 and Oryza sativa (IRGSP-1.0)50 were selected for constructing the phylogenetic tree. The genome data were downloaded from NCBI and Phytozome.51 The gene set of each species was obtained by retaining only the longest isoform for any gene that encoded several isoforms. Single-copy orthologues among taxa were used to achieve a robust phylogenetic reconstruction with high confidence. OrthoFinder (v 2.5.5 -M msa) was used to identify the orthologues with an all-vs-all BlastP search. The single-copy genes were combined to be a supergene by Seqkit tool.52 Subsequently, the phylogenetic tree comprising all species was built using the results of the combined single-copy ortho-group supergene result in RAxML v 8.0 employing GTR + G + I models and conducting 1000 bootstrap analyses.53 The phylogenetic tree was visualized using Figtree v 1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).
2.7 Species divergence time estimation
In a phylogenetic analysis, the MCMCtree program and Maximum Likelihood package were used to infer the divergence time of each node on the phylogenetic tree. The parameters of the MCMCtree program were: nsample = 200,000; burnin = 20,000; seqtype = 0; model = 4. To calibrate divergence times, we used the published divergence times for monocots versus dicots (173–148 Mya) and for Theobroma cacao versus Arabidopsis thaliana (89.0–97.4 Mya); the data were obtained from http://timetree.org/.
2.8 Expansion and contraction of gene families
The expansion and contraction of gene families were determined by comparing the cluster sizes of each species to their ancestors using the CAFE (v 5.0.0) program.54 A random birth-and-death model was used to evaluate changes in gene families along each lineage of the phylogenetic tree. Using conditional likelihoods as the test statistics, we calculated the corresponding P value of each lineage and considered a P value of or below 0.05 to indicate statistical significance. The genes of the significant gene families were listed and the Arabidopsis homologues of those significant genes were used to perform the GO analysis in Agrigo v 2.0 (https://systemsbiology.cau.edu.cn/agriGOv2/index.php).
2.9 Comparative genome analyses and genome synteny and collinearity
The reference genome for the Ricinus communis L. (Rc039) was the most recent open-access reference genome provided for a species in the family Euphorbiaceae.38 The comparative genome analyses were performed P. emblica ‘Yingyu’ with R. communis and P. emblica ‘Yingyu’ with P. emblica ‘Yingyu’ using MCScanX.55 We used a dot and bar plot to visualize patterns of synteny and collinearity.
2.10 Tissue-specific expression genes screening and ascorbate biosynthesis related gene expression analysis
Five regular transcriptome clean raw data aligned against the annotated genes files of P. emblica ‘Yingyu’ by the bowtie 2 software running at the default parameters. Gene expression levels were calculated and the heatmap of the gene expression level showed transcripts per kilobase of exon model per million mapped reads (TPM) using the software stringrtie running at the default parameters. The gene expression level of young leaves, fruit (flesh), flowers, one-year branches, or shedding branches were 10 times higher than other tissues top rank genes were selected as the ‘tissue-specific candidate genes’. The genes associated with ascorbate biosynthesis were cited from Mahajan et al. (2023) and screened ‘Yingyu’ genome annotation files.10 The expression level of genes was presented using heatmaps by WPS excel color scale function. the values that are lower should be represented in green, those near the median in yellow, and higher values in red. The heatmaps of gene expression values, expressed in transcripts per kilobase of exon model per million mapped reads (TPM), were created.
3. Results
3.1 The genome size estimation, ploidy, and karyotype analysis
In total, 154.81 Gb DNBSEQ short reads of sequencing data were used to estimate the ploidy and genome size of P. emblica. A K-mer distribution analysis indicated that P. emblica ‘Yingyu’ genome size was 448.7 Mb (Fig. 4).

Genome size estimation for Phyllanthus emblica cultivar ‘Yingyu’ with the distribution of the number of distinct k-mers (k = 21) with the given multiplicity values.
To understand the chromosome numbers and ploidy level of ‘Yingyu’ cultivar, flow cytometry, and karyotype analysis were conducted. We compared the ploidy between ‘Yingyu’ and ‘Bingtian’ by flow cytometry. Previous research mentioned P. emblica cultivated type ‘Bingtian’ was used as a control sample.21 The result showed that ‘Yingyu’ and ‘Bingtian’ were the same octopoid plants (Fig. 5A and 5B). The karyotype analysis showed ‘Yingyu’ is an octopoid plant and the chromosomes were 104 (Fig. 5C).

The flow cytometry and karyotype analysis of Phyllanthus emblica L. A) The flow cytometry analysis of the wild type ‘Yingyu’. B) The flow cytometry analysis of the cultivated type ‘Bingtian’. C) The karyotype analysis of the wild type ‘Yingyu’.
3.2 Genome sequencing and de novo assembly
A HiFi long-read sequencing using the PacBio Sequel II platform CCS mode with 2 SMRT cells generated 48.78 Gb of clean data (equivalent to ~ 108 × genome coverage, 3,218,828 reads with an N50 of 15.3kb). The raw data from the HiFi long-read sequencing yielded a draft assembled genome with contigs and 1585 of 1614 complete BUSCOs, covering 98.2% of the genome of P. emblica (Table 1). A chromosome conformation capture (Hi-C) library was constructed and sequenced to assemble the contigs into pseudochromosomes. The Hi-C sequencing generated 285.8 Gb of raw data using the DNBseq platform. After filtering out lower-quality data using adaptors, 262 Gb of clean data were obtained for data analysis. The chromosome-level genome was assembled using the HiFi and Hi-C data. Finally, we assembled the 480 Mb genome, which was clustered into 26 pseudochromosomes (Table 2). The length of the pseudochromosome ranged from 15.8 to 22.3 Mb. The heatmap for the contact matrix with Hi-C data indicated an excellent clustering, ordering, and orientation of the contigs (Fig. 6). However, there remained gaps on both ends of the chromosomes, as well as extensive gaps present on chromosome 01. Finally, 1575 of 1614 BUSCOs covering 97.9% of the total genome were obtained and found to have relatively high integrity (Table 1).
Contig level of ‘Yingyu’ . | Chromosome level of ‘Yingyu’ . | |
---|---|---|
Complete BUSCOs | 1585 | 1575 |
Complete and single-copy BUSCOs | 1136 | 1,517 |
Complete and duplicated BUSCOs | 449 | 58 |
Fragmented BUSCOs | 7 | 11 |
Missing BUSCOs | 22 | 28 |
Total BUSCO groups searched | 1614 | 1614 |
Contig level of ‘Yingyu’ . | Chromosome level of ‘Yingyu’ . | |
---|---|---|
Complete BUSCOs | 1585 | 1575 |
Complete and single-copy BUSCOs | 1136 | 1,517 |
Complete and duplicated BUSCOs | 449 | 58 |
Fragmented BUSCOs | 7 | 11 |
Missing BUSCOs | 22 | 28 |
Total BUSCO groups searched | 1614 | 1614 |
Contig level of ‘Yingyu’ . | Chromosome level of ‘Yingyu’ . | |
---|---|---|
Complete BUSCOs | 1585 | 1575 |
Complete and single-copy BUSCOs | 1136 | 1,517 |
Complete and duplicated BUSCOs | 449 | 58 |
Fragmented BUSCOs | 7 | 11 |
Missing BUSCOs | 22 | 28 |
Total BUSCO groups searched | 1614 | 1614 |
Contig level of ‘Yingyu’ . | Chromosome level of ‘Yingyu’ . | |
---|---|---|
Complete BUSCOs | 1585 | 1575 |
Complete and single-copy BUSCOs | 1136 | 1,517 |
Complete and duplicated BUSCOs | 449 | 58 |
Fragmented BUSCOs | 7 | 11 |
Missing BUSCOs | 22 | 28 |
Total BUSCO groups searched | 1614 | 1614 |
Pseudochromosome . | Size(bp) . | GC content (%) . | Gene numbers . |
---|---|---|---|
Chr 01 | 22,278,372 | 34.41 | 1790 (5.7%) |
Chr 02 | 17,889,182 | 33.62 | 1211 (3.8%) |
Chr 03 | 20,787,299 | 33.64 | 1250 (4.0%) |
Chr 04 | 18,091,949 | 33.56 | 1076 (3.4%) |
Chr 05 | 17,571,142 | 33.66 | 1178 (3.7%) |
Chr 06 | 18,117,518 | 33.61 | 1157 (3.7%) |
Chr 07 | 15,795,701 | 33.70 | 1176 (3.7%) |
Chr 08 | 18,082,370 | 33.66 | 1129 (3.6%) |
Chr 09 | 17,185,555 | 33.66 | 1107 (3.5%) |
Chr 10 | 19,027,488 | 33.70 | 1223 (3.9%) |
Chr 11 | 17,623,804 | 33.60 | 1253 (4.0%) |
Chr 12 | 17,469,151 | 33.61 | 1249 (4.0%) |
Chr 13 | 18,930,772 | 33.53 | 1274 (4.0%) |
Chr 14 | 16,823,809 | 33.58 | 1100 (3.5%) |
Chr 15 | 17,856,992 | 33.49 | 1114 (3.5%) |
Chr 16 | 19,144,779 | 33.43 | 1157 (3.7%) |
Chr 17 | 17,455,136 | 33.11 | 1064 (3.4%) |
Chr 18 | 16,544,731 | 33.35 | 981 (3.1%) |
Chr 19 | 19,582,468 | 33.42 | 1238 (3.9%) |
Chr 20 | 18,773,231 | 33.33 | 1264 (4.0%) |
Chr 21 | 17,525,693 | 33.42 | 1195 (3.8%) |
Chr 22 | 20,670,775 | 33.40 | 1362 (4.3%) |
Chr 23 | 19,478,050 | 33.43 | 1142 (3.6%) |
Chr 24 | 20,990,950 | 33.58 | 1295 (4.1%) |
Chr 25 | 18,557,277 | 33.29 | 1110 (3.5%) |
Chr 26 | 17,833,917 | 33.51 | ,016 (3.2%) |
Total | 480,088,111 | -- | 31,111 (100%) |
Pseudochromosome . | Size(bp) . | GC content (%) . | Gene numbers . |
---|---|---|---|
Chr 01 | 22,278,372 | 34.41 | 1790 (5.7%) |
Chr 02 | 17,889,182 | 33.62 | 1211 (3.8%) |
Chr 03 | 20,787,299 | 33.64 | 1250 (4.0%) |
Chr 04 | 18,091,949 | 33.56 | 1076 (3.4%) |
Chr 05 | 17,571,142 | 33.66 | 1178 (3.7%) |
Chr 06 | 18,117,518 | 33.61 | 1157 (3.7%) |
Chr 07 | 15,795,701 | 33.70 | 1176 (3.7%) |
Chr 08 | 18,082,370 | 33.66 | 1129 (3.6%) |
Chr 09 | 17,185,555 | 33.66 | 1107 (3.5%) |
Chr 10 | 19,027,488 | 33.70 | 1223 (3.9%) |
Chr 11 | 17,623,804 | 33.60 | 1253 (4.0%) |
Chr 12 | 17,469,151 | 33.61 | 1249 (4.0%) |
Chr 13 | 18,930,772 | 33.53 | 1274 (4.0%) |
Chr 14 | 16,823,809 | 33.58 | 1100 (3.5%) |
Chr 15 | 17,856,992 | 33.49 | 1114 (3.5%) |
Chr 16 | 19,144,779 | 33.43 | 1157 (3.7%) |
Chr 17 | 17,455,136 | 33.11 | 1064 (3.4%) |
Chr 18 | 16,544,731 | 33.35 | 981 (3.1%) |
Chr 19 | 19,582,468 | 33.42 | 1238 (3.9%) |
Chr 20 | 18,773,231 | 33.33 | 1264 (4.0%) |
Chr 21 | 17,525,693 | 33.42 | 1195 (3.8%) |
Chr 22 | 20,670,775 | 33.40 | 1362 (4.3%) |
Chr 23 | 19,478,050 | 33.43 | 1142 (3.6%) |
Chr 24 | 20,990,950 | 33.58 | 1295 (4.1%) |
Chr 25 | 18,557,277 | 33.29 | 1110 (3.5%) |
Chr 26 | 17,833,917 | 33.51 | ,016 (3.2%) |
Total | 480,088,111 | -- | 31,111 (100%) |
Pseudochromosome . | Size(bp) . | GC content (%) . | Gene numbers . |
---|---|---|---|
Chr 01 | 22,278,372 | 34.41 | 1790 (5.7%) |
Chr 02 | 17,889,182 | 33.62 | 1211 (3.8%) |
Chr 03 | 20,787,299 | 33.64 | 1250 (4.0%) |
Chr 04 | 18,091,949 | 33.56 | 1076 (3.4%) |
Chr 05 | 17,571,142 | 33.66 | 1178 (3.7%) |
Chr 06 | 18,117,518 | 33.61 | 1157 (3.7%) |
Chr 07 | 15,795,701 | 33.70 | 1176 (3.7%) |
Chr 08 | 18,082,370 | 33.66 | 1129 (3.6%) |
Chr 09 | 17,185,555 | 33.66 | 1107 (3.5%) |
Chr 10 | 19,027,488 | 33.70 | 1223 (3.9%) |
Chr 11 | 17,623,804 | 33.60 | 1253 (4.0%) |
Chr 12 | 17,469,151 | 33.61 | 1249 (4.0%) |
Chr 13 | 18,930,772 | 33.53 | 1274 (4.0%) |
Chr 14 | 16,823,809 | 33.58 | 1100 (3.5%) |
Chr 15 | 17,856,992 | 33.49 | 1114 (3.5%) |
Chr 16 | 19,144,779 | 33.43 | 1157 (3.7%) |
Chr 17 | 17,455,136 | 33.11 | 1064 (3.4%) |
Chr 18 | 16,544,731 | 33.35 | 981 (3.1%) |
Chr 19 | 19,582,468 | 33.42 | 1238 (3.9%) |
Chr 20 | 18,773,231 | 33.33 | 1264 (4.0%) |
Chr 21 | 17,525,693 | 33.42 | 1195 (3.8%) |
Chr 22 | 20,670,775 | 33.40 | 1362 (4.3%) |
Chr 23 | 19,478,050 | 33.43 | 1142 (3.6%) |
Chr 24 | 20,990,950 | 33.58 | 1295 (4.1%) |
Chr 25 | 18,557,277 | 33.29 | 1110 (3.5%) |
Chr 26 | 17,833,917 | 33.51 | ,016 (3.2%) |
Total | 480,088,111 | -- | 31,111 (100%) |
Pseudochromosome . | Size(bp) . | GC content (%) . | Gene numbers . |
---|---|---|---|
Chr 01 | 22,278,372 | 34.41 | 1790 (5.7%) |
Chr 02 | 17,889,182 | 33.62 | 1211 (3.8%) |
Chr 03 | 20,787,299 | 33.64 | 1250 (4.0%) |
Chr 04 | 18,091,949 | 33.56 | 1076 (3.4%) |
Chr 05 | 17,571,142 | 33.66 | 1178 (3.7%) |
Chr 06 | 18,117,518 | 33.61 | 1157 (3.7%) |
Chr 07 | 15,795,701 | 33.70 | 1176 (3.7%) |
Chr 08 | 18,082,370 | 33.66 | 1129 (3.6%) |
Chr 09 | 17,185,555 | 33.66 | 1107 (3.5%) |
Chr 10 | 19,027,488 | 33.70 | 1223 (3.9%) |
Chr 11 | 17,623,804 | 33.60 | 1253 (4.0%) |
Chr 12 | 17,469,151 | 33.61 | 1249 (4.0%) |
Chr 13 | 18,930,772 | 33.53 | 1274 (4.0%) |
Chr 14 | 16,823,809 | 33.58 | 1100 (3.5%) |
Chr 15 | 17,856,992 | 33.49 | 1114 (3.5%) |
Chr 16 | 19,144,779 | 33.43 | 1157 (3.7%) |
Chr 17 | 17,455,136 | 33.11 | 1064 (3.4%) |
Chr 18 | 16,544,731 | 33.35 | 981 (3.1%) |
Chr 19 | 19,582,468 | 33.42 | 1238 (3.9%) |
Chr 20 | 18,773,231 | 33.33 | 1264 (4.0%) |
Chr 21 | 17,525,693 | 33.42 | 1195 (3.8%) |
Chr 22 | 20,670,775 | 33.40 | 1362 (4.3%) |
Chr 23 | 19,478,050 | 33.43 | 1142 (3.6%) |
Chr 24 | 20,990,950 | 33.58 | 1295 (4.1%) |
Chr 25 | 18,557,277 | 33.29 | 1110 (3.5%) |
Chr 26 | 17,833,917 | 33.51 | ,016 (3.2%) |
Total | 480,088,111 | -- | 31,111 (100%) |

Phyllanthus emblica L. cultivar ‘Yingyu’ genome-wide all-by-all Hi–C interaction heat map.
3.3 Gene annotation and repetitive sequence analysis
We used the reference genomes and gene data for Arabidopsis thaliana, Hevea brasiliensis, Manihot esculenta, Oryza sativa, Ricinus communis, and Zea mays to estimate protein-coding genes in P. emblica, and annotated these using a combination of homolog research and transcript evidence based on short-reads and full-length transcriptomes. The number of genes on each pseudochromosome ranged from 981 (3.1%) to 1790 (5.7%) (Table 2). The density of genes was higher at both ends of the chromosome, and decreased closer to the centromere (Fig. 7C). A total of 31,111 genes were obtained using a hybrid gene prediction protocol (Table 3). Among the predicted genes, 25,026 (80.4%) were homologs of Arabidopsis thaliana, 29,717 (95.5%) were homologs of Hevea brasiliensis, 30,239 (97.2%) were homologs of Manihot esculenta, 21,061 (67.7%) were homologs of Oryza sativa, 10,529 (33.8%) were homologs of Ricinus communis, and 21,178 were homologs of Zea mays. The high confidence gene set of Phyllanthus emblica reached 93.4% BUSCO score (Table 4).
Genes . | mRNAs . | CDSs . | |
---|---|---|---|
Arabidopsis thaliana | 25,026 | 37,348 | 227,044 |
Hevea brasiliensis | 29,717 | 35,002 | 182,635 |
Manihot esculenta | 30,239 | 41,753 | 246,881 |
Oryza sativa | 21,061 | 25,265 | 147,866 |
Ricinus communis | 10,529 | 11,066 | 22,321 |
Zea mays | 21,178 | 34,809 | 216,297 |
Phyllanthus emblica ‘Yingyu’ | 31,111 | 31,111 | 163,819 |
Genes . | mRNAs . | CDSs . | |
---|---|---|---|
Arabidopsis thaliana | 25,026 | 37,348 | 227,044 |
Hevea brasiliensis | 29,717 | 35,002 | 182,635 |
Manihot esculenta | 30,239 | 41,753 | 246,881 |
Oryza sativa | 21,061 | 25,265 | 147,866 |
Ricinus communis | 10,529 | 11,066 | 22,321 |
Zea mays | 21,178 | 34,809 | 216,297 |
Phyllanthus emblica ‘Yingyu’ | 31,111 | 31,111 | 163,819 |
Genes . | mRNAs . | CDSs . | |
---|---|---|---|
Arabidopsis thaliana | 25,026 | 37,348 | 227,044 |
Hevea brasiliensis | 29,717 | 35,002 | 182,635 |
Manihot esculenta | 30,239 | 41,753 | 246,881 |
Oryza sativa | 21,061 | 25,265 | 147,866 |
Ricinus communis | 10,529 | 11,066 | 22,321 |
Zea mays | 21,178 | 34,809 | 216,297 |
Phyllanthus emblica ‘Yingyu’ | 31,111 | 31,111 | 163,819 |
Genes . | mRNAs . | CDSs . | |
---|---|---|---|
Arabidopsis thaliana | 25,026 | 37,348 | 227,044 |
Hevea brasiliensis | 29,717 | 35,002 | 182,635 |
Manihot esculenta | 30,239 | 41,753 | 246,881 |
Oryza sativa | 21,061 | 25,265 | 147,866 |
Ricinus communis | 10,529 | 11,066 | 22,321 |
Zea mays | 21,178 | 34,809 | 216,297 |
Phyllanthus emblica ‘Yingyu’ | 31,111 | 31,111 | 163,819 |
High confidence gene set of Phyllanthus emblica ‘Yingyu’ . | |
---|---|
Complete BUSCOs | 1507 (93.4%) |
Fragmented BUSCOs | 14(0.9%) |
Missing BUSCOs | 93 (5.7%) |
Total BUSCO groups searched | 1614 (100%) |
High confidence gene set of Phyllanthus emblica ‘Yingyu’ . | |
---|---|
Complete BUSCOs | 1507 (93.4%) |
Fragmented BUSCOs | 14(0.9%) |
Missing BUSCOs | 93 (5.7%) |
Total BUSCO groups searched | 1614 (100%) |
High confidence gene set of Phyllanthus emblica ‘Yingyu’ . | |
---|---|
Complete BUSCOs | 1507 (93.4%) |
Fragmented BUSCOs | 14(0.9%) |
Missing BUSCOs | 93 (5.7%) |
Total BUSCO groups searched | 1614 (100%) |
High confidence gene set of Phyllanthus emblica ‘Yingyu’ . | |
---|---|
Complete BUSCOs | 1507 (93.4%) |
Fragmented BUSCOs | 14(0.9%) |
Missing BUSCOs | 93 (5.7%) |
Total BUSCO groups searched | 1614 (100%) |

Landscape of genomic features and genetic diversity of Phyllanthus emblica. Circles represent, from innermost to outermost, A) sequencing gaps, B) GC content, and C) gene numbers.
Repeat sequences occupied 231.6 Mb (48.23%) of the assembled genome of P. emblica (Table 5). The repeats comprised 7 major types in varying proportions. The most common types of repeat sequences were LTR retroelements (195.6 Mb, 40.74%), and DNA transposons (28.7 Mb, 5.99%), followed by unknown repeat sequences (10.0 Mb, 2.08%).
Repbase TEs . | TE protiens . | De novo . | Combined TEs . | |||||
---|---|---|---|---|---|---|---|---|
Type . | Length (bp) . | % in genome . | Length (bp) . | % in genome . | Length(bp) . | % in genome . | Length(bp) . | % in genome . |
DNA | 7,465,556 | 1.56 | 109,960 | 0.02 | 24,531,445 | 5.11 | 28,737,168 | 5.99 |
LINE | 1,146,986 | 0.24 | 0 | 0.00 | 2,375,992 | 0.49 | 3,428,114 | 0.71 |
SINE | 68,084 | 0.01 | 0 | 0.00 | 1,314 | 0.00 | 69,398 | 0.01 |
LTR | 512,348,147 | 10.90 | 51,888,400 | 10.81 | 191,180,802 | 39.82 | 195,569,656 | 40.74 |
Other | 10,618 | 0.00 | 0 | 0.00 | 0 | 0.00 | 10,618 | 0.00 |
Unknown | 0 | 0.00 | 0 | 0.00 | 10,007,138 | 2.08 | 10,007,138 | 2.08 |
Total | 60,266,377 | 12.55 | 51,998,360 | 10.83 | 224,437,086 | 46.75 | 231,557,443 | 48.23 |
Repbase TEs . | TE protiens . | De novo . | Combined TEs . | |||||
---|---|---|---|---|---|---|---|---|
Type . | Length (bp) . | % in genome . | Length (bp) . | % in genome . | Length(bp) . | % in genome . | Length(bp) . | % in genome . |
DNA | 7,465,556 | 1.56 | 109,960 | 0.02 | 24,531,445 | 5.11 | 28,737,168 | 5.99 |
LINE | 1,146,986 | 0.24 | 0 | 0.00 | 2,375,992 | 0.49 | 3,428,114 | 0.71 |
SINE | 68,084 | 0.01 | 0 | 0.00 | 1,314 | 0.00 | 69,398 | 0.01 |
LTR | 512,348,147 | 10.90 | 51,888,400 | 10.81 | 191,180,802 | 39.82 | 195,569,656 | 40.74 |
Other | 10,618 | 0.00 | 0 | 0.00 | 0 | 0.00 | 10,618 | 0.00 |
Unknown | 0 | 0.00 | 0 | 0.00 | 10,007,138 | 2.08 | 10,007,138 | 2.08 |
Total | 60,266,377 | 12.55 | 51,998,360 | 10.83 | 224,437,086 | 46.75 | 231,557,443 | 48.23 |
Repbase TEs . | TE protiens . | De novo . | Combined TEs . | |||||
---|---|---|---|---|---|---|---|---|
Type . | Length (bp) . | % in genome . | Length (bp) . | % in genome . | Length(bp) . | % in genome . | Length(bp) . | % in genome . |
DNA | 7,465,556 | 1.56 | 109,960 | 0.02 | 24,531,445 | 5.11 | 28,737,168 | 5.99 |
LINE | 1,146,986 | 0.24 | 0 | 0.00 | 2,375,992 | 0.49 | 3,428,114 | 0.71 |
SINE | 68,084 | 0.01 | 0 | 0.00 | 1,314 | 0.00 | 69,398 | 0.01 |
LTR | 512,348,147 | 10.90 | 51,888,400 | 10.81 | 191,180,802 | 39.82 | 195,569,656 | 40.74 |
Other | 10,618 | 0.00 | 0 | 0.00 | 0 | 0.00 | 10,618 | 0.00 |
Unknown | 0 | 0.00 | 0 | 0.00 | 10,007,138 | 2.08 | 10,007,138 | 2.08 |
Total | 60,266,377 | 12.55 | 51,998,360 | 10.83 | 224,437,086 | 46.75 | 231,557,443 | 48.23 |
Repbase TEs . | TE protiens . | De novo . | Combined TEs . | |||||
---|---|---|---|---|---|---|---|---|
Type . | Length (bp) . | % in genome . | Length (bp) . | % in genome . | Length(bp) . | % in genome . | Length(bp) . | % in genome . |
DNA | 7,465,556 | 1.56 | 109,960 | 0.02 | 24,531,445 | 5.11 | 28,737,168 | 5.99 |
LINE | 1,146,986 | 0.24 | 0 | 0.00 | 2,375,992 | 0.49 | 3,428,114 | 0.71 |
SINE | 68,084 | 0.01 | 0 | 0.00 | 1,314 | 0.00 | 69,398 | 0.01 |
LTR | 512,348,147 | 10.90 | 51,888,400 | 10.81 | 191,180,802 | 39.82 | 195,569,656 | 40.74 |
Other | 10,618 | 0.00 | 0 | 0.00 | 0 | 0.00 | 10,618 | 0.00 |
Unknown | 0 | 0.00 | 0 | 0.00 | 10,007,138 | 2.08 | 10,007,138 | 2.08 |
Total | 60,266,377 | 12.55 | 51,998,360 | 10.83 | 224,437,086 | 46.75 | 231,557,443 | 48.23 |
We used the Swissprot (http://www.uniprot.org/), TrEMBL (http://www.uniprot.org/), KEGG (http://www.genome.jp/kegg/), InterPro (http://www.ebi.ac.uk/interpro/), NR (https://www.ncbi.nlm.nih.gov/), KOG (https://www.hsls.pitt.edu/obrc/index.php/), and Go (https://www.geneontology.org/) databases and diamond software to annotate the functions of genes in P. emblica (Fig. 8A). In total, we annotated the functions of 29,793 of 31,111 genes. A total of 20,918 genes were common genes that were listed in all seven databases. We classified the functions of genes into three categories: cellular processes, environment information processing, and genetic information processing. We analyzed metabolism and organismal systems using the KEGG pathway (Fig. 8B). We found that the processes of transport and catabolism, signal transduction, translation, carbohydrate metabolism, biosynthesis of the secondary metabolites, and environmental adaption were regulated by 861, 1589, 1836, 2108, 1099, and 1202 genes respectively. The functions of these genes were also categorized as biological processes, cellular components, and molecular functions in a GO analysis (Fig. 8C). We found that cellular processes, metabolic processes, cellular anatomical entity, binding, and catalytic activity were regulated by 8655, 8129, 4799, 11,660, and 8536 genes respectively.

Annotation of gene function in Phyllanthus emblica L. cultivar ‘Yingyu’. A) Venn diagram depicting the annotated gene functions built using the NR, InterPro, KEGG, SwissProt, and KOG databases; B) Annotated gene functions in KEGG pathway; C) The gene annotation among different databases.
3.4 Ortholog genes and characteristics of gene families
Thirteen plant species, including Aquilegia coerulea, Arabidopsis thaliana, Beta vulgaris, Coffea arabica, Helianthus annuus, Nelumbo nucifera, Oryza sativa, Papaver somniferum, Phyllanthus emblica, Ricinus communis, Solanum lycopersicum, Theobroma cacao, and Vitis vinifera were selected 84 common single-copy orthologs were filtered to estimate species divergence times (Fig. 9A, Supplementary Material 1). Most genes (66.34% to 97.05%) were found to be multicopy orthologs and other paralogs. Helianthus annuus, Papaver somniferum, and Oryza sativa had 19,146, 7725, and 7316 unique paralogs, respectively. We found that 96.41% (8198 / 8503) of gene families in P. emblica were shared with the other plant species, while only 305 gene families were specific to P. emblica (Fig. 9B). We compared the levels of gene expansion and contraction in P. emblica to those observed in 12 other species. In total, 3799 gene families (3869 genes) were expanded in P. emblica, while 655 gene families (1606 genes) were contracted (Fig. 9C).

The ortholog genes and specific characteristics of gene families in Aquilgeia coerulea, Arabidopsis thaliana, Beta vulgaris, Coffea arabica, Helianthus annuus, Nelumbo nucifera, Oryza sativa, Papaver somniferum, Phyllanthus emblica, Ricinus communis, Solanum lycopersicum, Theobroma cacao and Vitis vinifera. A) Diagram showing the characteristics of ortholog genes. The bar diagram displays the distribution of single-copy, multiple-copy, unique, and other orthologs. B) Venn diagram depicting the number of shared and unique gene families. C) CAFE 5.0 estimates gene family expansions and contractions and the proportion of expanded/contracted/unchanged gene families in each plant species (pie charts). MRCA indicates the most recent common ancestor.
We performed a GO analysis for genes in the expanded and contracted gene families of P. emblica. We found that genes in the expanded families regulated metabolic processes, single organism processes, and protein metabolism (Supplementary Material 2). Membrane, membrane part, integral and intrinsic component of membrane terms showed significance in the cellular component category. Genes regulating catalytic, transferase, and oxidoreductase activity were significantly expanded in the molecular function category. Single-organism process, metabolic process, and catalytic activity-related genes on biological processes and molecular functions categories were contracted.
3.5 The evolutionary history, divergence time estimation, and genome collinearity and correlation of Phyllanthus emblica
To infer the phylogeny for P. emblica, we used the genomes of 12 other plant species (Aquilegia coerulea, Arabidopsis thaliana, Beta vulgaris, Coffea arabica, Helianthus annuus, Nelumbo nucifera, Oryza sativa, Papaver somniferurn, Ricinus communis, Solanum lycopersicum, Theobroma cacao and Vitis vinifera). Using the identified 83 single-copy ortholog genes, we constructed a phylogenetic tree containing all study species. The result showed that P. emblica was clustered with R. communis (Fig. 10A). The divergence time between R. communis and P. emblica was estimated at ~76.31 Mya and the CI value range from 41.99 to 103.54 Mya. The haploid genome of R. communis contained 10 pseudochromosomes while that of P. emblica contained 26 pseudochromosomes (Fig. 10B and 10C). Based on the results of the dot plot and bar diagram, the P. emblica was possibly about to undergo a genome-wide doubling event. Furthermore, comparison of P. emblica genome itself also was performed. The dot plot diagram showed P. emblica contained 26 pseudochromosomes may origin from two ancestors (Supplementary Material 3A). The two ancestors revealed a relatively high level of collinearity and correlation at Chr 01 and 02, Chr05 and 06, Chr07 and 08, Chr 19 and 20, Chr 21 and 22, Chr23 and 24 (Supplementary Material 3B).

The divergence time and evolutionary scenario of Phyllanthus emblica. A) The divergence time tree of the 13 species was inferred from the complete data set of single-copy homologous genes. The number at each node indicates the divergence time of each species (in millions of years). B) Dot plot illustrating the evolutionary scenario between Phyllanthus emblica and Ricinus communis; C) Bar diagram illustrating the evolutionary scenario between Phyllanthus emblica and Ricinus communis.
3.6 Tissue-specific genes screening and gene expression analysis of Phyllanthus emblica
We used five different tissue transcript samples to screen for candidate genes showing tissue-specific expression. A PCA analysis showed that the five tissues were divided into three groups according to gene expression level (Fig. 11). Specifically, tissue from the flowers showed a distinct gene expression level from the four tissues taken from other parts of the plant. Tissue from the young leaves and shedding branches, one-year-old branches, and fruits were clustered, suggesting they shared similar gene expression levels. We also found that the genes for heat shock proteins (PeChr17G003250.1, PeChr08G007020.2, PeChr13G001790.1 and PeChr17G002160.1), Hsp20 domain protein (PeSTRG.12157.1.p2^Chr13^), Cytochrome P450 CYP82D47 (PeChr19G000710.1), 1-aminocyclopropane-1-carbonxylate synthase 1 (PeChr19G003130.1) and Basic leucine zipper 43-like (PeChr01G006470.1) had higher expression levels in the shedding branch than in other tissues (Fig. 12A). The genes for bark storage protein A (PeSTRG.1253^Chr01^+), heat shock proteins (PeSTRG.17411^Chr18^-, and putative laccase-9 isoform X1 (PeChr25G003950.1) were specifically expressed in the one-year-old branch (Fig. 12B). Four unknown (or putative) function proteins were identified as candidates in flower tissue. PeMSTRG.19973^Chr21^- and PeSTRG.23352^Chr22^- had very high expression levels, reaching 34867.20 and 11158.56 TPM, respectively. In addition, cysteine biosynthesis-related genes (PeChr08G000700.1), cyclic nucleotide-gated ion channel 1 (PeChr02G011560.1), stress-induced protein KIN2-like (PeChr11G000190.1), spermidine hydroxycinnamoyl transferase (PeChr22G010170.1), and OPT domain-containing protein (PeSTRG22346.1.p1^Chr22^+) were also expressed at high levels (Fig. 12C). In the fruit tissue, gibberellin biosynthesis-related genes were found to be highly expressed, especially the GA-stimulated protein (PeSTRG.4700.1.p1^Chr04^-), which was expressed at 3694.40 TPM (Fig. 12D). The expression of probable xyloglucan endotransglucosylase (PeChr09G007940.1) was also found to be higher in the fruit tissue than in tissues from other parts of the plant. In the tissue from young leaves, genes for ethylene-responsive transcription factor 23 (PeChr01G017110.1), chlorophyll a-b binding protein CP29.3 (PeChr01G012550.1), and transcription factor bHLH71 (PeChr20G001610.1) were expressed at high levels (Fig. 12E).

PCA analysis of the gene expression levels in different tissues of Phyllanthus emblica L. cultivar ‘Yingyu’.

Specifically expressed genes in different tissues of Phyllanthus emblica L. A) Shedding branches; B) One-year-old branches; C) Flowers; D) Fruits (flesh); E) Young leaves.
3.7 Ascorbate biosynthesis-related gene expression level of P. emblica different tissues
The ascorbate biosynthesis-related genes were screened on P. emblica ‘Yingyu’ genome and annotation files (Fig. 13). Nine hexokinases (HK), three Glucose 6-phosphate isomerases (GPI), 2 mannose 6-phosphate isomerases (MPI), 2 phosphomannomutase (PMM), 4 GDP-mannose pyrophosphorylases, 2 GDP-mannose epimerases (GME), 3 GDP-L-galactose phosphorylases (GGP), 2 L-galactose-1-phosphate phosphatases (GPP), 1 l-galactono-1,4- lactone dehydrogenases (GalLDH), 4 guluno-1,4-lactone oxidases (GulLO), and 2 myo-inositol oxidases were screened in d-galacturonate pathway, l-galactose pathway and myoinositol pathway (Fig. 13A). In regeneration cycle pathway, 17 ascorbate oxidases (AO), 6 ascorbate peroxidases (APX), 5 monodehydroascorbate reductases (MDHAR), 3 dehydroascorbate (DHAR) were screened. l-galactose and regeneration cycle pathway-related genes were activated in all 5 tissues. Two GME genes (PeSTRG.22641.1.p1^Chr23^ and PeSTRG.23791.1.p1^Chr24^) showed higher expression levels in fruit than shedding branch, 1-year branch, flower, and leaves (Fig. 13B). PeSTRG.23791.1.p1^Chr24^ expression level of fruit were 3.6–10.0 times higher than other tissues. Two of 3 GGP genes had high expression levels in all tissues. PeSTRG23117.1.p1^Chr22^ expression level of shedding branch, leaves, and fruit reached 335.85, 340.90, and 408.44 TPM, respectively. GalLDH gene was the last step to biosynthesis of the ascorbic acid in l-galactose pathway. Only one GalLDH (PeSTRG.10930.1.p1^Chr10^) was screened. The expression level of fruit was 33.82 TPM, 1.75–2.90 times higher than other tissues. However, 4 GulLO genes were found in l-gulose pathway and the expression level was extremely low in all tissues.

The ascorbate biosynthesis-related gene expression level in different tissue of Phyllanthus emblica L. cultivar ‘Yingyu’. S, shedding branches; B, one-year branches; Fl, flowers; Fr, fruits(flesh); L, leaves.
4. Discussion
Phyllanthus comprises the largest genus in the plant subfamily Phyllanthaceae, and includes approximately 1270 species. Several species of Phyllanthus, such as P. emblica, Phyllanthus acidus, Phyllanthus niruri, Phyllanthus urinaria, and Phyllanthus muellerianus are important medicinal and edible plants.56Phyllanthus species contain an abundance of known and potential active ingredients such as tannins, minerals, vitamins, amino acids, and fatty acids. They also possess antimicrobial, antioxidant, anti-inflammatory, analgesic, antipyretic, adaptogenic, hepatoprotective, antitumor, antiulcerogenic, and immunomodulatory properties.57 Nevertheless, the biosynthesis mechanism of active ingredients in Phyllanthus species has remained poorly understood to date due to the absence of adequate reference genome data. The first chromosome-level genome of species in the genus was reported by Zhang et al. (2022) for P. cochinchinensis.18 The assembled genome of this species consisted of 284.88 Mb of genomic sequences. Using Illumina and the Hi-C technique, 13 pseudochromosomes were constructed, covering 95.41% of genome. The 98.1% of core conversed genes were identified. The 13-base chromosomes for species in the genus Phyllanthus were also confirmed through the tableting of plant meristems and microscopy.58 Recently, Mahajan et al. (2023) reported the first draft genome of P. emblica.10 The genome size was 519 Mb covering 98.40% of the genome. This draft genome did not form a genome at the chromosome level. Rahman et al. (2021) reported that somatic chromosome numbers 2n = 10x = 100 (x = 10) of P. emblica.58 Li et al. (2024) reported the P. emblica cultivated type ‘Bingtian’.21 The genome assemble size was 1.76 Gb with 104 pusdochromosomes. The complete BUSCOs of genome assembly were 97.9% and the complete BUSCOs of gene annotation were 98.1% with high assemble and annotation quality. In the present study, we first assembled the genome of P. emblica wild type ‘Yingyu’ (Table 2, Fig. 7). This chromosome-level genome was assembled using HiFi and Hi-C data. The genome sequence contained 480 Mb and was clustered into 26 pseudochromosomes. The BUSCOs of genome assembly reached 97.5% (Table 1). The BUSCOs of the ‘Yingyu’ genome assembly are nearly equivalent to those of the ‘Bingtian’ genome, indicating a high level of genome assemble quality. However, the BUSCOs of ‘Yingyu’ gene annotation lower than ‘Bingtian’ reached 93.4% (Table 4). This may be due to differences in the choice of reference species for annotation. We are also planning to improve the assembly and annotation completeness of the ‘Yingyu’ genome to enhance its utility. The high-quality chromosome-level genome and transcriptome information for P. emblica provide a fundamental resource for studies on genomics and comparative genomics, and may especially benefit research into the mechanisms regulating the physiology and active ingredients of Phyllanthus species and other members of the Euphorbiaceae.
As the data and annotation files for the first P. cochinchinensis18 genomes were not publicly accessible, we could not use their data to perform a detailed comparative genome analysis. Instead, we focused on R. communis, which is another related species to P. emblica. Our results indicate that the divergence time between R. communis and P. emblica is ~76.31Mya and the CI value ranges from 41.99 to 103.54 Mya (Fig 10A). Although the genome information available for the research of Phyllanthus emblica and related species are relatively few. The result still initiated a relatively accurate result than Timetree website http://timetree.org/ published data (median time: 82.6 Mya and CI: (36.5–117.0 Mya)). With the sequencing and publication of the genomes related to the Phyllanthus genus, the mysteries of the differentiation of Phyllanthus emblica and their related species will be uncovered in the future.
It is widely recognized that an increase in the number of chromosomes correlates with a reduction in chromosome length, while the intricate nature of ploidy adds complexity to the analysis of both chromosome count and karyotype.59 In one study, the ploidy of species in the genus Annona was determined to be triploid and tetraploid by Aceto-carmine and DAPI staining.60 Species of Saccharum are known to vary in their number of chromosomes and polyploidy.61 Using a FISH assay and chromosome painting technique, Zhang et al. (2021) showed that chromosome numbers ranged between 40 and 128 in hexaploid, octaploid, decaploid, and dodecoid plants.62Sauropus androgynus is a related species of P. emblica.63 The karyotype and genome analysis detected 2n = 4x = 52. The P. cochinchinensis is also a related species of P. emblica. It is a diploid, with a haploid chromosome number of 13.18 The P. emblica cultivated type ‘Bingtian’ was reported as an octopoid plant (AAAABBBB) and the pusdochromosomes number was 104 but the imaging resolution did not yield satisfactory chromosome observation images for comparison.21 ‘Bingtian’ originated from two diploid ancestors (AA and BB), with a haploid chromosome number of 13. In our research, we first proved the wild type ‘Yingyu’ and cultivated type ‘bingtian’ are the same octopoid plants by flow cytometry analysis (Fig. 5A). The chromosome number of P. emblica was accurately calculated 104 through microscopy images (Fig. 5B). ‘Yingyu’ is an octopoid plant and the karyotype analysis also proved the haploid chromosome number is 13. The assembled P. emblica ‘Yingyu’ genome were 26 pusdochromosomes with a relatively high level of collinearity and correlation similar pattern (Figs. 6–7 and Supplementary Material 3). Through comparative analysis of the genomes of ‘Bingtian’ and ‘Yingyu’, it was also revealed that the 26 pseudochromosomes in ‘Yingyu’ exhibit significant similarity with ancestors A and B of ‘Bingtian’ (Supplementary Material 4). Therefore, the assembled wild-type ‘Yingyu’ genome in this study is a haploid genome. Because the genome of wild type ‘Yingyu’ is a simplified reference genome devoid of heterozygosity. The research currently cannot prove the ploidy of the ancestors and the number of whole genome duplication events. This uncertainty underscores the necessity of further supplementing and refining the complete chromosome-level genome as the utility of pan-genomic resources. Furthermore, the chromosomes of P. emblica are small and numerous, future studies should make sure the karyotype of P. emblica in detail using FISH assays and chromosome painting methods.
Gibberellin acid-stimulated Arabidopsis/ Gibberellin stimulated transcript (GASA/GAST) genes play important roles in a wide range from development processes to stress response in plants, especially flower and fruit development. The GASA family genes have been identified Arabidopsis,64Solanum lycopersicum,65Malus domestica,66Populus tricocarpa,67 and Prunus mume.68 Seventeen GASA family genes were identified in tomatoes.65SlGASA1 and SlGASA3 genes followed the ripening-associated expression pattern, with low expression levels during fruit ripening. SlGASA1-knockdown fruit displayed accelerated ripening. SlGASA1-overexpression showed significantly reduced ethylene production. Moreover, more ripening-related genes were downregulated in SlGASA1-overexpression fruits. In our research, the GA-stimulated protein (PeSTRG.4700^Chr04^) initiated high expression levels in fruits (flesh) (Fig. 12D). The fruit of P. emblica was hard and can be hung on the tree until the next year flowering in color while still being emerald green. It is difficult to judge the optimal harvest stage in actual production. This may be due to GA-stimulated protein expressed and regulated by the other ripening-related genes to control fruit ripening.
The research related to plant flower sex differentiation and determination was started at about 120 years ago. However, there is diversity in the regulatory networks of sexual differentiation in different plant species.69 The determination of plant sex and the formation of floral organs are one of the most fascinating areas of research in plant science now. Kaki is a dioecious plant and a pseudogene of the Y chromosome named OGI is responsible for encoding a sex-determining factor.70 This pseudogene produces small RNAs targeting the MeGI gene located on X chromosome, leading to the separation of male and female individuals. The inflorescences of the tropical plants such as Cocos nucifera and Areca catechu have a monoecious spadix with male and female flowers arranged in an individual. The male flowers are positioned at the top of the inflorescence and the female flowers at the base. The male flowers open earlier than the female flowers, preventing self-pollination through the time gap between their blooming. Zhou et al (2022) indicated that the concentration of JA in the female flowers is approximately 10 times that of the male flowers, and the concentration of JA in the bisexual flower position is about twice that of the male flowers in the normal inflorescence.71 Moreover, high-concentration JA treatment also promotes the development of female flower organs by reducing the expression of B-class transcription factors in the flower development ABC model, such as AGL16, AP3, PIb, and PIc. P. emblica also belong to monoecious plants and the female flower only at the top of the inflorescence (Fig. 1C). Two function unknown genes (PeMSTG.19973^chr21^ and PeSTRG.23352^chr22^-) were extremely high expression level on flowers (Fig. 12C). The two high expressed hypothetical function proteins were also identified. These are likely unique regulatory genes involved in the flower sex differentiation of P. emblica. Furthermore, it is necessary to verify the function of these genes and their regulation network.
Rosa roxburghi fruit possesses various secondary metabolites such as ascorbic acid.72 Almost ascorbate biosynthesis-related genes of Rosa roxburghii were activated on l-galactose and regeneration cycle pathways. However, galacturonate, l-gulose, and myo-inositol pathways were nearly silenced. This trend is also observed in the presence of P. emblica (Fig. 13A). GME is generally considered to be a central enzyme of the major ascorbate biosynthesis pathway in higher plants.73 RNAi-silenced GME tomato line fruits initiated significantly lower ascorbate content than the control line. In our research, 2 GME genes (PeSTRG.22641.1.p1^Chr23^ and PeSTRG23791.1.p1^Chr24^) all initiated fruit-specific expression and PeSTRG23791.1.p1^Chr24^ expression level reached to 183.08 TPM (Fig. 13B). Ascorbic acid content of Rosa roxburghii fruit was significantly associated with RrGGP2 gene expression during fruit development.74RrGGP2 silencing reduced roxburghii fruit ascorbic acid content by 28.9%. RrGGP2 Overexpression increased the ascorbic acid content by 8–12 fold of tobacco leaves. The GGP genes of P. emblica showed high expression levels in different tissues (Fig. 13B). PeSTRG23117.1.p1^Chr22^ expression level from shedding branches, leaves to fruits were gradually increased. This phenomenon may be due to the preference for ascorbic acid to accumulate in the fruit than other tissues. FaGalLDH is a key enzyme in the terminal step of the ascorbate biosynthesis pathway in strawberries. Ascorbic acid content increased during fruit development and peaked at the ripening stage. Ascorbic acid concentrations in different tissues were correlated with enzyme activity and transcription level of FaGalLDH. Transient over-expression of FaGalLDH in fruit increased Ascorbic acid production significantly.75GalLDH is the last step to ascorbic acid in L-galactose pathway and four GalLDH genes were found in the cultivated type ‘Bingtian’ genome.21 The expression level of the stem was higher than leaves. The expression levels of the GalLDH gene in the four fruit developmental stages showed a gradually increasing trend. However, the pusdochromosome location and gene IDs were not mentioned in detail. Only one GalLDH was screened in wild type ‘Yingyu’ genome. PeSTRG.10930.1.p1^Chr10^ showed a higher expression level in fruit than in leaves, flowers, one-year branches, and shedding branches (Figure 13B). In future research plans, the fruit of wild and cultivated type P. emblica from different growth development stages will measure and compare the ascorbic acid content and perform transcriptome to screen and check these ascorbate biosynthesis candidate genes.
Besides P. emblica, other resource plants grown in the dry-hot valleys in Yunnan province include Persea americana,76Annona muricata,77Vitis vinifera,43Syzygium samarangense, Hylocereus undulates, Sidium guajava, Dracontomelon duperreanum, and Tamarindus indica. We urgently need to do some basic research within our ability. Reference genomes are still lacking for several of these species. To this end, the genome data for P. emblica provide an important resource for the selection of excellent cultivars for processing, medical use, and fresh consumption. Furthermore, the assembled chromosome-level genome of P. emblica provides useful data and information for subsequent genetic analysis of this and other Phyllanthaceae species. The genomic data and information we provide here may play a pivotal role in gaining a comprehensive understanding of specific metabolites, flower sex differentiation, fruit expansion, and the regulatory mechanisms governing ripening in P. emblica and other plant species.
Supplementary material
Supplementary data are available at DNARES online.
Supplementary Material 1: The multi-copy orthologs, other orthologs, single-copy orthologs, and unique paralogs in thirteen plant species.
Supplementary Material 2: The GO analysis of significant expansion and contraction genes.
Supplementary Material 3: The genome evolutionary scenario of Phyllanthus emblica ‘Yingyu’ (A) Dot plot illustrating; (B) Bar diagram illustrating.
Supplementary Material 4: The subgenome and mini-subgenome classification in Phyllanthus emblica ‘Yingyu’.
Acknowledgements
This work was supported by the talent introduction and training program of Yunnan Academy of Agricultural Sciences (Grant number: 2023RCYP-05), the regional program of National Natural Science Foundation of China (Grant number: 32360714), Yunnan Provincial Science and Technology Plan Basic Research Project (Grant number: 202401AU070064) and Yunnan Provincial Department of Science and Technology, innovation guidance and science and technology enterprise cultivation program (Grant number: 202204BP090017) for their research support.
We want to express our gratitude to Yucang Sha, Yuetang Duan, Shunlin Yang, Zixiang Yang, Xueqin Han, and Zibo Song for the selection and cultivation of ‘Yingyu’.
We sincerely appreciate the critical comments and suggestions provided by the editor, the reviewers, and Takuya Morimoto, the lecturer at Kyoto Prefectural University in Japan, regarding the revisions of the article.
Author contributions
L.Z. and J.J. designed the project; L.Z., C.L., and J.J provided the research funding; Q.Z. and K.M. provided and screened the Phyllanthus emblica resource; J.Y., X.L., and K.Q. collected the materials for sequencing; T.P. and L.Z. performed and arranged the genome assembly, annotation, and transcriptome data analysis; L.Z. wrote the draft manuscript; All authors wrote and edited the manuscript.
Conflicts of interest
The authors declare no conflicts of interest.
Data Availability
All the Raw data sets has been deposited at DDBJ under the accession number: PRJDB18024 (SAMD00787772–SAMD00787776 and SAMD00787338–SAMD00787342) and 10.6084/m9.figshare.25610388 of Figshare database.