-
PDF
- Split View
-
Views
-
Cite
Cite
Jasmine D Alqassar, Hannah E Aichelman, Isabel A Novick, Sean P Mullen, De Novo Genome Assembly and Annotation for the Synanthropic Webbing Clothes Moth (Tineola bisselliella): A Globally Distributed, Economically Important Pest, Genome Biology and Evolution, Volume 16, Issue 12, December 2024, evae266, https://doi.org/10.1093/gbe/evae266
- Share Icon Share
Abstract
Tineola bisselliella, the webbing clothes moth, is an economically important, globally distributed synanthropic pest species and member of the basal moth lineage Tineidae. These moths are facultatively keratinophagous, and their larvae can cause extensive damage, particularly to clothing, textiles, and museum specimens. Despite the economic and phylogenetic importance of T. bisselliella, there is a lack of quality genomic resources for this, or for other species within the Tineidae family. The T. bisselliella genome assembly presented here consists of 30 pseudochromosomes (29 autosomes and 1 Z chromosome) produced using synteny alignment of a preliminary contig-level assembly (256 contigs) to a closely related species, Tinea pellionella. The resulting final pseudochromosome-level assembly is 243.630 Mb and has an N50 length of 8.708 Mb. The assembly is highly contiguous and has similar or improved quality compared to other available Tineidae genomes, with 93.1% (91.8% single copy and 1.3% duplicated) of lepidopteran orthologs complete and present. Annotation of the pseudochromosome-level genome assembly with the transcriptome we produced ultimately yielded 11,259 annotated genes. Synteny alignments between the T. bisselliella genome assembly and other Tineidae genomes revealed evidence for numerous small rearrangements with high synteny conservation. In contrast, a synteny alignment performed between T. bisselliella and Melitaea cinxia, which is thought to have retained the ancestral karyotype (n = 31), revealed a fusion of the ancestral autosome 30 and Z chromosome that led to a reduction in T. bisselliella karyotype size. The reference quality annotated genome for T. bisselliella presented here will advance our understanding of the evolution of the lepidopteran karyotype by providing a chromosome-level genome for this basal moth lineage and provide future insights into the mechanisms underlying keratin digestion in T. bisselliella.
Tineola bisselliella, the webbing clothes moth, is a globally distributed synanthropic pest species that feeds on both keratin-based materials and detritus. This dietary habit can cause substantial damage to clothing, textiles, rugs, taxidermy, and museum specimens. In fact, keratinophagous organisms cause an estimated $1 billion worth of damage annually in the United States. However, the lack of a high-quality annotated reference genome for this basal moth species has thus far limited our understanding of the mechanisms underlying keratin digestion relevant to pest control efforts and has hampered efforts to investigate the broader phylogenetic relationships in the Tineidae family necessary to reveal patterns of lepidopteran chromosome evolution. Here, we present the first reference quality genome for T. bisselliella and the first functionally annotated genome for any member of the Tineidae family, providing a critical resource for the lepidopteran genomics and pest management communities.
Introduction
Synanthropic organisms are undomesticated species that live with and exploit humans (Klegarth 2017). The webbing clothes moth, Tineola bisselliella, is a synanthropic organism that is also classified as a globally distributed pest species (Plarre and Krüger-Carstensen 2011). Tineola bisselliella is a facultative keratinophagous species, because as larvae, they can feed on both keratin and detritus (Plarre and Krüger-Carstensen 2011; Querner 2016). Additionally, T. bisselliella is a member of the basal lepidopteran family Tineidae, a large and diverse group of moths (greater than 3,000 species contained within approximately 300 genera) whose ancestral diet consisted of fungus and lichen (Davis and Gaden 1999; Regier et al. 2015). However, many moths within Tineidae have also evolved the ability to digest keratin (Regier et al. 2015). Despite the estimated $1 billion USD worth of damage caused by keratinophagous organisms each year in the United States, research into the mechanism of keratin digestion in Tineidae moths has been greatly limited by the lack of quality annotated genomic resources for the family (Metcalf and Luckmann 1994; Clark et al. 2016). There are currently five chromosome-level, reference quality genomes available on NCBI GenBank for the Tineidae family (Tinea trinotella, Tinea semifulvella, Monopis laevigella, Tinea pellionella, and Nemapogon koenigi); however, none of these genomes have published annotations currently available (Clark et al. 2016; Boyes et al. 2022, 2023a, 2023b, 2024; GenBank accession no. GCA_963924495.1).
Previous efforts to generate a de novo reference genome for T. bisselliella have resulted in fragmented contig-level genomes that fall short of the reference quality ultimately needed to be able to perform analyses to understand chromosome evolution, syntenic changes, and the mechanism of keratin digestion (GenBank accession nos. GCA_026546545.1 and GCA_028551675.1; Clark et al. 2016). Additionally, the absence of high-quality transcriptomic data has prevented any attempts at genome annotation in the family. Therefore, additional efforts are needed to improve the existing genomic resources for this system and to advance our understanding of lepidopteran genome evolution more broadly.
A well-annotated genome will provide a foundation for molecular evolution studies of the mechanisms underlying diet evolution and keratin digestion in T. bisselliella. Previous research identified two Bacillus species in the gut of T. bisselliella larvae that express keratinases, which are hypothesized to be partially responsible for keratin digestion (Vilcinskas et al. 2020). However, expression of serine protease or cysteine synthase genes in the moth's genome has been proposed as an alternative mechanism due to their ability to break the complex disulfide bonds contained in keratin (Hughes and Vogler 2006; Schwabe et al. 2021). Unfortunately, to date, both avenues of investigation have been limited by the lack of a quality genome which has contributed to the limited understanding surrounding the evolution of keratinophagy in this family.
Tineola bisselliella is also significant in the context of the lepidopteran phylogeny, as it is a member of the poorly understood, basal moth family Tineidae. Generating a high-quality genome for this species may shed light on lepidopteran karyotype evolution, which is of particular interest because of the large diversity in the haploid chromosome number in the group, ranging from 5 to 223 (de Vos et al. 2020). This diversity stems from the holocentric chromosomes of Lepidoptera, which have kinetochores that span the length of the chromosomes, allowing chromosome fragments to be inherited (Mandrioli and Manicardi 2020). These karyotype changes can create prezygotic and postzygotic barriers to gene flow that can ultimately lead to speciation (de Vos et al. 2020). Therefore, assessment of synteny conservation between lepidopteran genomes can provide insight into karyotype changes acquired during the process of speciation.
Here, we present the first reference quality genome for T. bisselliella and the first annotated genome for any member of the Tineidae family to aid in these future research directions. Additionally, we performed preliminary synteny analyses between our pseudochromosome-level T. bisselliella genome assembly and available genomes of Tineidae family members, along with Bombyx mori and Melitaea cinxia (supplementary table S1, Supplementary Material online), to investigate lepidopteran karyotype evolution.
Results and Discussion
Whole-genome sequencing of two T. bisselliella adults using PacBio HiFi technology yielded 11 Gb of HiFi data with the two libraries containing 646,360 and 355,368 reads, respectively. The average read length for both libraries was ∼10 kb, which were pooled to produce the final genome assembly with an estimated base coverage across pooled reads of 43×. Improved phased assembly (IPA) produced an initial genome assembly with 256 contigs, an N50 contig length of 1.731 Mb, and a total genome size of 243.578 Mb (Table 1). Synteny alignment was performed using Satsuma to the existing T. pellionella reference genome, as it is the species most closely related to T. bisselliella (Regier et al. 2015) with a published chromosome-level genome (Boyes et al. 2024). The synteny alignment produced an assembly with a total of 89 superscaffolds, an N50 scaffold length of 5.904 Mb, and a total genome size of 243.595 Mb (Table 1). Satsuma then produced the final pseudochromosome assembly which contained 30 pseudochromosomes (29 autosomes and 1 Z chromosome). The N50 pseudochromosome length was 8.708 Mb, and the total genome size was 243.630 Mb (Table 1). The final assembly includes 99.98% of the original contig-level assembly.
Assembly and annotation summary statistics for all versions of the T. bisselliella genome assembly discussed
Contig-level genome assembly statistics | |
Total contigs | 256 |
Contig N50 (Mb) | 1.731 |
Scaffold-level genome assembly statistics | |
Total scaffolds | 89 |
Scaffold N50 (Mb) | 5.904 |
Final pseudochromosome-level genome assembly statistics | |
Total pseudochromosomes | 30 |
Pseudochromosome N50 (Mb) | 8.708 |
GC content | 32.4% |
Final assembly length (Mb) | 243.630 |
BUSCO analysis of final pseudochromosome-level genome assembly (Lepidoptera_odb10 n = 5,286) | |
Complete | 93.1% (4,923) |
Single copy | 91.8% (4,852) |
Duplicated | 1.3% (71) |
Fragmented | 0.9% (46) |
Missing | 6.0% (317) |
Gene annotation statistics | |
Protein-coding genes | 11,259 |
Genes with functional information | 11,037 |
Number of introns | 539,767 |
Number of repetitive elements | 3,627,042 |
Number of mRNA transcripts | 13,114 |
Mean exons per mRNA | 9.3 |
Mean mRNA per gene | 1.2 |
BUSCO analysis of annotated protein-coding genes (Lepidoptera_odb10 n = 5,286) | |
Complete | 77.3% (4,084) |
Single copy | 58.3% (3,082) |
Duplicated | 19.0% (1,002) |
Fragmented | 1.1% (57) |
Missing | 21.6% (1,145) |
Contig-level genome assembly statistics | |
Total contigs | 256 |
Contig N50 (Mb) | 1.731 |
Scaffold-level genome assembly statistics | |
Total scaffolds | 89 |
Scaffold N50 (Mb) | 5.904 |
Final pseudochromosome-level genome assembly statistics | |
Total pseudochromosomes | 30 |
Pseudochromosome N50 (Mb) | 8.708 |
GC content | 32.4% |
Final assembly length (Mb) | 243.630 |
BUSCO analysis of final pseudochromosome-level genome assembly (Lepidoptera_odb10 n = 5,286) | |
Complete | 93.1% (4,923) |
Single copy | 91.8% (4,852) |
Duplicated | 1.3% (71) |
Fragmented | 0.9% (46) |
Missing | 6.0% (317) |
Gene annotation statistics | |
Protein-coding genes | 11,259 |
Genes with functional information | 11,037 |
Number of introns | 539,767 |
Number of repetitive elements | 3,627,042 |
Number of mRNA transcripts | 13,114 |
Mean exons per mRNA | 9.3 |
Mean mRNA per gene | 1.2 |
BUSCO analysis of annotated protein-coding genes (Lepidoptera_odb10 n = 5,286) | |
Complete | 77.3% (4,084) |
Single copy | 58.3% (3,082) |
Duplicated | 19.0% (1,002) |
Fragmented | 1.1% (57) |
Missing | 21.6% (1,145) |
Assembly and annotation summary statistics for all versions of the T. bisselliella genome assembly discussed
Contig-level genome assembly statistics | |
Total contigs | 256 |
Contig N50 (Mb) | 1.731 |
Scaffold-level genome assembly statistics | |
Total scaffolds | 89 |
Scaffold N50 (Mb) | 5.904 |
Final pseudochromosome-level genome assembly statistics | |
Total pseudochromosomes | 30 |
Pseudochromosome N50 (Mb) | 8.708 |
GC content | 32.4% |
Final assembly length (Mb) | 243.630 |
BUSCO analysis of final pseudochromosome-level genome assembly (Lepidoptera_odb10 n = 5,286) | |
Complete | 93.1% (4,923) |
Single copy | 91.8% (4,852) |
Duplicated | 1.3% (71) |
Fragmented | 0.9% (46) |
Missing | 6.0% (317) |
Gene annotation statistics | |
Protein-coding genes | 11,259 |
Genes with functional information | 11,037 |
Number of introns | 539,767 |
Number of repetitive elements | 3,627,042 |
Number of mRNA transcripts | 13,114 |
Mean exons per mRNA | 9.3 |
Mean mRNA per gene | 1.2 |
BUSCO analysis of annotated protein-coding genes (Lepidoptera_odb10 n = 5,286) | |
Complete | 77.3% (4,084) |
Single copy | 58.3% (3,082) |
Duplicated | 19.0% (1,002) |
Fragmented | 1.1% (57) |
Missing | 21.6% (1,145) |
Contig-level genome assembly statistics | |
Total contigs | 256 |
Contig N50 (Mb) | 1.731 |
Scaffold-level genome assembly statistics | |
Total scaffolds | 89 |
Scaffold N50 (Mb) | 5.904 |
Final pseudochromosome-level genome assembly statistics | |
Total pseudochromosomes | 30 |
Pseudochromosome N50 (Mb) | 8.708 |
GC content | 32.4% |
Final assembly length (Mb) | 243.630 |
BUSCO analysis of final pseudochromosome-level genome assembly (Lepidoptera_odb10 n = 5,286) | |
Complete | 93.1% (4,923) |
Single copy | 91.8% (4,852) |
Duplicated | 1.3% (71) |
Fragmented | 0.9% (46) |
Missing | 6.0% (317) |
Gene annotation statistics | |
Protein-coding genes | 11,259 |
Genes with functional information | 11,037 |
Number of introns | 539,767 |
Number of repetitive elements | 3,627,042 |
Number of mRNA transcripts | 13,114 |
Mean exons per mRNA | 9.3 |
Mean mRNA per gene | 1.2 |
BUSCO analysis of annotated protein-coding genes (Lepidoptera_odb10 n = 5,286) | |
Complete | 77.3% (4,084) |
Single copy | 58.3% (3,082) |
Duplicated | 19.0% (1,002) |
Fragmented | 1.1% (57) |
Missing | 21.6% (1,145) |
Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis revealed the T. bisselliella pseudochromosome assembly had a total of 93.1% (91.8% single copy and 1.3% duplicated copies) of BUSCO orthologs from the Lepidoptera-odb10 lineage data set (5,286 orthologs) present as complete orthologs (Table 1). Our pseudochromosome assembly had a large reduction in the amount of duplicated orthologs from the original contig-level assembly (14.7% to 1.3%) (supplementary fig. S1 and table S2, Supplementary Material online). The genome of a closely related species, T. pellionella, has a slightly higher amount of complete orthologs present (93.2%) and a greater percent present as single copies (92.4% compared to 91.8% in our assembly) (Boyes et al. 2024; Table 1; supplementary figs. S1 and S2 and table S2, Supplementary Material online). Additionally, our T. bisselliella genome has a slightly higher amount of missing orthologs (6%) compared to the T. pellionella genome (5.8%) (Table 1; supplementary table S2 and fig. S2, Supplementary Material online). We speculate the slight discrepancies in the amount of complete orthologs present could be because the T. pellionella genome assembly was produced using Hi-C data and therefore likely has better resolved gaps (Boyes et al. 2024). However, the amount of complete orthologs present in our T. bisselliella assembly is still greater than the average quality of lepidopteran genomes available on LepBase and NCBI GenBank (81.2% BUSCO score, scaffold N50 of 1.706 Mb) reported by Ellis et al. (2021). Therefore, the quality of the assembly presented here is higher than many of the lepidopteran genomes currently available.
BUSCO analysis using the Bacteria-odb9 lineage data set (148 orthologs) identified bacterial contamination on two pseudochromosomes originating from three putative scaffolds assembled by Satsuma, which were masked from the final assembly using BEDtools. A detailed list of the masked region, their sizes, and probable source is detailed in supplementary table S3, Supplementary Material online. The bacterial contamination region found on pseudochromosome 7 was identified by BLASTn as repeated sequences of 16S rRNA from Brachybacterium paraconglomeratum, a bacterial species that has been found to inhabit the gut of B. mori (Chen et al. 2020). These sequences are repeated across the region but are also separated by unidentified spacer sequences. Therefore, we hypothesize that B. paraconglomeratum could also inhabit the T. bisselliella gut and was sequenced along with the T. bisselliella genomic material.
To aid in genome annotation, RNA sequencing data were obtained from T. bisselliella adults (328,802,057 paired-end reads), late-stage larvae (335,986,188 paired-end reads), and early-stage larvae (302,067,521 paired-end reads). After quality filtering, adapter trimming, polyG tail trimming, and assembly of contigs using Trinity, the transcriptome assembly contained 231,316 contigs (Chen et al. 2018; Grabherr et al. 2011). After discarding 131,386 contigs that were shorter than 500 bp, 99,930 contigs remained. After isoform collapsing by CD-HIT (Li and Godzik 2006), the final transcriptome assembly contained 58,509 contigs, from which BLAST identified 14,903 protein-coding genes. eggNOG-mapper v2 annotated 13,615 genes in the final transcriptome assembly. Additionally, RNA-seq reads (with adapters and polyG tails trimmed) were mapped to the final genome assembly using STAR v2.71.10b, and all three RNA-seq read pools had over 75% of uniquely mapped reads (supplementary table S4, Supplementary Material online) (Dobin et al. 2013).
Genome annotation using the MAKER pipeline identified 11,259 protein-coding genes corresponding to 13,114 mRNA transcripts, and 94% of these annotations had an annotation edit distance (AED) score of 0.5 or less after the second round of ab initio gene prediction (Table 1; supplementary fig. S4, Supplementary Material online). Of the identified protein-coding genes, 11,037 were functionally annotated by eggNOG-mapper v2 (Table 1). The annotations were evaluated by BUSCO to have 77.3% of the Lepidoptera-odb10 data set present as complete orthologs (58.3% single copy and 19% duplicated copies; Table 1).
Interestingly, our reference quality T. bisselliella genome includes annotations with transcriptomic evidence for both serine proteases and cysteine synthases. Prior to this study, only a T. bisselliella larval gut transcriptome had been generated, which led to the discovery of the presence of Bacillus bacteria in the larval gut that are postulated to aid in keratin digestion (Schwabe et al. 2021). However, previous studies found expression of serine proteases or cysteine synthases in the larval gut of T. bisselliella without a bacterial source (Hughes and Vogler 2006; Schwabe et al. 2021). As these enzymes play a potentially important role in keratin digestion, this research suggested that the ability to digest keratin is encoded within the T. bisselliella genome and not dependent on bacterial symbionts. Complicating this matter further, serine proteases are also found in the gut of all Lepidoptera and function as a primary step in plant protein digestion (Srinivasan et al. 2006); therefore, their presence in the T. bisselliella gut could be unrelated to keratin digestion. Ultimately, Hughes and Vogler (2006) concluded that serine proteases are not sufficient to have facilitated the evolutionary shift to keratin feeding and digestion, because of the close relatedness of serine proteases in the T. bisselliella gut to the serine proteases in other Lepidoptera. Nevertheless, serine proteases do have the ability to break the disulfide bonds of keratin, and an alternative hypothesis, which has yet to be investigated, is that the digestion of keratin is facilitated by a combination of symbionts in the gut microbiome, enzymes encoded for in the genome, and the low oxidation–reduction potential of the larval gut first observed by Waterhouse (1952). This annotation resource can now aid in research aimed at functionally validating these candidate enzymes and their role in keratin digestion.
Chromosomal rearrangements are expected in lepidopteran genomes because their chromosomes are holocentric and therefore able to tolerate inheritance of chromosome fission and fusion events (Mandrioli and Manicardi 2020). The holocentric nature of lepidopteran chromosomes means that microtubules can attach at any point along the chromosome, facilitating the inheritance of fragmented pieces of chromosomes (Mandrioli and Manicardi 2020). This contrasts with breakage in monocentric chromosomes, which would be discarded because there are no kinetochores on fragments without a centromere, so microtubules cannot bind during cell division (Mandrioli and Manicardi 2020). Therefore, chromosome fragments and fused chromosomes can be inherited by Lepidoptera, which has led to the great diversity of their haploid chromosome number ranging from 5 to 223 (de Vos et al. 2020).
The synteny alignment of T. bisselliella and T. pellionella genomes showed high synteny conservation between the two species (Fig. 1a; Boyes et al. 2024). However, the alignment also illustrates that small chromosomal rearrangements between the two species have occurred and were retained during the synteny-based assembly of the T. bisselliella genome into pseudochromosomes (Fig. 1a). Additional synteny alignments of the T. bisselliella pseudochromosome genome assembly with two other Tineidae family genomes, M. laevigella and T. trinotella, displayed similar patterns of high synteny conservation with small rearrangements (Boyes et al. 2022, 2023b; supplementary figs. S5 and S6, Supplementary Material online).

Circos plots of synteny alignments between T. bisselliella, T. pellionella, and M. cinxia. a) Synteny alignment between the pseudochromosome-level genome assembly of T. bisselliella and the chromosome-level genome assembly of T. pellionella (Boyes et al. 2024), which was used to assemble the scaffolds of the T. bisselliella genome into pseudochromosomes by synteny-based assembly. b) Synteny alignment between the pseudochromosome-level genome assembly of T. bisselliella and the chromosome-level genome assembly of M. cinxia (Vila et al. 2021). Inset is an image of a T. bisselliella adult.
In contrast, the synteny alignments of the T. bisselliella pseudochromosome genome assembly to the B. mori and M. cinxia genome assemblies display evidence for frequent chromosomal rearrangements of both large and small genome regions (GenBank accession no. GCA_014905235.2; Vila et al. 2021; Fig. 1b; supplementary fig. S7, Supplementary Material online). This is expected, because the Tineidae family is a basal clade in relation to Bombycoidea and Papilionoidea (Kawahara et al. 2019). Additionally, the synteny alignment of T. bisselliella to M. cinxia provides unique insight into the evolution of the lepidopteran ancestral karyotype, as M. cinxia is believed to have retained the ancestral karyotype of 31 (30 autosomes) (Ahola et al. 2014; de Vos et al. 2020). The reduction in chromosome number from the ancestral karyotype is hypothesized to be the result of whole autosomal chromosomal fusions, where two independent autosomes fuse to make one super chromosome (Ahola et al. 2014). However, our synteny alignment of the T. bisselliella pseudochromosome genome assembly to M. cinxia revealed that our assembly contained an almost whole chromosomal fusion of M. cinxia chromosome 30 and Z chromosome that formed the Z chromosome in T. bisselliella (Fig. 1b). Therefore, our findings are consistent with a scenario in which the reduction in karyotype size in T. bisselliella is due to a whole chromosomal fusion event of the ancestral chromosomes 30 and Z. This result is consistent with recent findings by Wright et al. (2024) that lepidopteran whole chromosomal fusions are more likely to occur either between short chromosomes or with the Z sex chromosome. Such karyotype variation has been observed to lead to prezygotic and postzygotic barriers to gene flow and ultimately to speciation (de Vos et al. 2020).
The pseudochromosome-level assembly generated here is the first reference quality genome for T. bisselliella and the first annotated genome in the Tineidae family. This annotated genome will serve as a resource for the scientific community and will facilitate future investigations of basal moths and the mechanisms of speciation of the Tineidae family as well as the evolution of keratin digestion.
Materials and Methods
Sample Acquisition, DNA and RNA Extraction, and Sequencing
Two T. bisselliella larvae were obtained from an infested skunk fur acquired from a Boston museum and were sent to the University of Delaware DNA Sequencing and Genotyping Center for high molecular weight DNA extraction and long-read whole-genome sequencing in March 2021. The two larvae were individually barcoded and sequenced in a PacBio SMRT cell on Sequel II at the University of Delaware Sequencing and Genotyping Center producing 11 Gb of HiFi long reads.
Ten live T. bisselliella individuals (five early-stage larvae, three late-stage larvae, and two adults) were sampled from a lab colony established in June 2022. RNA was extracted from each individual using a Qiagen RNeasy Micro Kit following the manufacturer's instructions, except samples were homogenized with lysis buffer and glass beads for 1 min prior to extraction. RNA quality and quantity were assessed by performing agarose gel electrophoresis and NanoDrop. The highest quality extractions, specifically two early-stage larvae, three late-stage larvae, and two adults, were sent to the Tufts University Genomics Core Facility for additional quality assessment (including QC via fragment analyzer) prior to library preparation and sequencing in August 2023. All samples passed the quality check and were separated into three barcoded pools based on developmental stage (i.e. early-stage larvae, late-stage larvae, or adult) and were then prepped using the Illumina Stranded mRNA Prep method. One hundred base pair paired-end sequencing of all three libraries was performed on an Illumina NovaSeq flow cell.
Genome Assembly
The University of Delaware performed demultiplexing, quality filtering, and IPA of the reads (GitHub: https://github.com/PacificBiosciences/pbipa), which yielded a contig-level assembly. To construct a more complete genome assembly, the genome assembly was mapped by synteny to a chromosome-level assembly of T. pellionella (Boyes et al. 2024), using Satsuma v.3.0 Chromosembler (Grabherr et al. 2010). Satsuma first produced an intermediate scaffold-level assembly by using synteny mapping to the T. pellionella genome assembly which then was mapped again to the T. pellionella genome assembly to produce the final pseudochromosome-level assembly. Genome assembly statistics for all assemblies were computed using BBMap's stats.sh script (Bushnell 2014). To assess the completeness of the preliminary contig-level and the final pseudochromosome-level assembled genomes, BUSCO v3.0.2 was performed using the BUSCO v4 Lepidoptera-odb10 lineage data set containing 5,286 orthologs (Simão et al. 2015; Manni et al. 2021). To provide a comparison for reference genome quality standards, reference quality genomes for B. mori (GenBank accession no. GCA_014905235.2), T. pellionella (Boyes et al. 2024), and M. laevigella (Boyes et al. 2023a) were obtained from NCBI GenBank, and each of these genomes was also assessed using BUSCO (supplementary table S1, Supplementary Material online) . Bacterial contamination was assessed by running a BUSCO analysis (v3.0.2) with the Bacteria-odb9 lineage set containing 148 orthologs (Simão et al. 2015). BLASTn was used to identify possible sources of bacterial contamination (Camacho et al. 2009), which was hard-masked using BEDtools’ maskfasta (Quinlan and Hall 2010).
Transcriptome Assembly and Annotation
The paired-end RNA-seq data were assembled and annotated following the protocol presented in Rivera and Davies (2021). The assembled transcriptome was submitted to the eggNOG-mapper v2 online platform and annotated using the eggNOG 5.0 data set (Huerta-Cepas et al. 2019; Cantalapiedra et al. 2021). Gene Ontology annotations along with gene names and their corresponding annotations were extracted from the eggNOG-mapper v2 output and used to produce an annotated FASTA file. For quality comparison, adapter and polyG tail trimmed RNA-seq reads were mapped to the final genome assembly using STAR v2.71.10b (Dobin et al. 2013).
Genome Annotation
The final pseudochromosome genome assembly was annotated following the original MAKER pipeline detailed in Cantarel et al. (2008) using the most up-to-date version of the program, MAKER v3.01.04, with the modifications detailed in supplementary fig. S3, Supplementary Material online.
All available EST evidence for T. bisselliella and Tineidae family members was obtained in November 2023 from NCBI GenBank (Clark et al. 2016) to incorporate into step 1 of the MAKER pipeline. Additionally, the annotated transcriptome detailed above was provided to the program as EST evidence. Prior to the start of the annotation pipeline, the T. bisselliella pseudochromosome-level genome assembly was input into eggNOG-mapper v2 to generate a GFF3 file with prealigned protein matches (Huerta-Cepas et al. 2019; Cantalapiedra et al. 2021). This additional GFF3 protein match file was put into step 1 of MAKER as additional prealigned evidence (Cantarel et al. 2008). Protein evidence was gathered from the November 2022 UniProt Lepidoptera release (The UniProt Consortium 2023). Repeat evidence was obtained from GIRI RepBase's invertebrate repeat release from November 2022 (Bao et al. 2015). The first step of MAKER was performed to align the aforementioned repeat, EST, and protein evidence to the T. bisselliella pseudochromosome-level genome assembly using RepeatMasker, Exonerate, and BLAST+ (Slater and Birney 2005; Camacho et al. 2009; Bao et al. 2015). This first step of MAKER produced a GFF3 file with the alignment information of the repeat, EST, and transcriptome evidence to the genome.
The second step of MAKER involves training of the ab initio gene prediction software SNAP and Augustus (Korf 2004; Stanke et al. 2006). In this step, BEDtools and BUSCO were used to produce hidden Markov models for SNAP and Augustus training from the GFF3 file produced in step 1 (Quinlan and Hall 2010; Manni et al. 2021).
The third step of MAKER included ab initio gene prediction by SNAP and Augustus. After each round of ab initio gene prediction, the number of gene models and the corresponding AED scores were recorded and plotted to evaluate the effectiveness of the prediction software (supplementary fig. S4, Supplementary Material online). Training and running of ab initio gene prediction by SNAP and Augustus (steps 2 and 3) was run for a total of 2 rounds. After the last round of ab initio gene prediction, step 4 involved functionally annotating the protein-coding genes using eggNOG-mapper v2 (Huerta-Cepas et al. 2019; Cantalapiedra et al. 2021). Summary statistics of the genome annotation were then compiled using the agat_sp_statistics.pl script included in AGAT (GitHub: https://github.com/NBISweden/AGAT). Completeness of the annotated proteins was assessed using BUSCO v3.0.2 with the BUSCO v4 Lepidoptera-odb10 lineage data set containing 5,286 orthologs (Simão et al. 2015; Manni et al. 2021).
Synteny Alignments and Visualization
The T. bisselliella pseudochromosome-level assembly was aligned to 3 available Tineidae chromosome-level genome assemblies (supplementary table S1, Supplementary Material online): T. pellionella (Boyes et al. 2024), T. trinotella (Boyes et al. 2022), and M. laevigella (Boyes et al. 2023a), in addition to a B. mori genome assembly (GenBank accession no. GCA_014905235.2) and a M. cinxia genome assembly (Vila et al. 2021). Satsuma was used to perform the synteny analysis (Grabherr et al. 2010), and the R package RIdeogram was used to visualize the output of Satsuma (Hao et al. 2020). Additionally, the synteny alignments between the T. bisselliella pseudochromosome genome assembly and T. pellionella and M. cinxia genomes, respectively, were visualized as Circos plots using the R package Circlize (Gu et al. 2014).
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online.
Acknowledgments
The authors would like to thank Dr. Brian Gregor and the Research Computing Staff at Boston University for their advice and computational support throughout this project. We would like to acknowledge that this work would not have been possible without the resources provided by Boston University's Shared Computing Cluster.
Funding
The authors would like to acknowledge the next-generation sequencing support provided by the National Institutes of Health shared instrumentation grant (1S10OD032203-01) via Tufts University Core Facility Genomics Core which performed the RNA sequencing for this project. This work was additionally supported by a National Science Foundation grant (DEB-2021181).
Data Availability
The raw reads from PacBio HiFi genome sequencing and Illumina raw transcriptome short reads are deposited at NCBI under BioProject PRJNA1102366. The pseudochromosome genome assembly is available on NCBI GenBank under the accession GCA_042254865.1. The genome annotation, transcriptome assembly, and transcriptome annotation files are all available on Open Science Framework (OSF): https://doi.org/10.17605/OSF.IO/JQ3C9. All code associated with this project can be found in the following GitHub repository: https://github.com/jasalq/Tineola_bisselliella_genome.