-
PDF
- Split View
-
Views
-
Cite
Cite
Bonnie A Fraser, James R Whiting, Josephine R Paris, Cameron J Weadick, Paul J Parsons, Deborah Charlesworth, Roberta Bergero, Felix Bemm, Margarete Hoffmann, Verena A Kottler, Chang Liu, Christine Dreyer, Detlef Weigel, Improved Reference Genome Uncovers Novel Sex-Linked Regions in the Guppy (Poecilia reticulata), Genome Biology and Evolution, Volume 12, Issue 10, October 2020, Pages 1789–1805, https://doi.org/10.1093/gbe/evaa187
- Share Icon Share
Abstract
Theory predicts that the sexes can achieve greater fitness if loci with sexually antagonistic polymorphisms become linked to the sex determining loci, and this can favor the spread of reduced recombination around sex determining regions. Given that sex-linked regions are frequently repetitive and highly heterozygous, few complete Y chromosome assemblies are available to test these ideas. The guppy system (Poecilia reticulata) has long been invoked as an example of sex chromosome formation resulting from sexual conflict. Early genetics studies revealed that male color patterning genes are mostly but not entirely Y-linked, and that X-linkage may be most common in low-predation populations. More recent population genomic studies of guppies have reached varying conclusions about the size and placement of the Y-linked region. However, this previous work used a reference genome assembled from short-read sequences from a female guppy. Here, we present a new guppy reference genome assembly from a male, using long-read PacBio single-molecule real-time sequencing and chromosome contact information. Our new assembly sequences across repeat- and GC-rich regions and thus closes gaps and corrects mis-assemblies found in the short-read female-derived guppy genome. Using this improved reference genome, we then employed broad population sampling to detect sex differences across the genome. We identified two small regions that showed consistent male-specific signals. Moreover, our results help reconcile the contradictory conclusions put forth by past population genomic studies of the guppy sex chromosome. Our results are consistent with a small Y-specific region and rare recombination in male guppies.
How do different sexes, who share most of their genome, have different phenotypes and fitness optima? Theory predicts that sex chromosome evolution can solve this problem, where sexually antagonistic loci become linked to the sex determining loci. Unfortunately, sex chromosomes are hard to sequence and assemble completely because of they can be highly repetitive and heterozygous. Here, we present a new genome assembly for the male guppy, which has long been invoked as an example of sex chromosome formation resulting from sexual conflict. We show two small regions associated with sex, one with high repeat density not previously found.
Introduction
Sexual conflict is thought to be one of the main drivers for the evolution of nonrecombining sex chromosomes regions (Bachtrog 2013; Ponnikas et al. 2018). If a sexually antagonistic polymorphism (where at least one allele is beneficial in one sex and not the other) becomes established in a region linked to the sex determining locus (SDL), closer linkage between these two loci reduces conflict, allowing each sex to approach its different fitness optimum (Rice 1987). Since whole-genome sequencing is now possible for a wide range of species we can begin to directly test these hypotheses by identifying the SDL and linked genomic regions. Additionally, by studying the sequences of young, only slightly differentiated, sex chromosomes, we can begin to distinguish causes and consequences of sex chromosome formation (Charlesworth 2019). Teleosts are particularly exciting in this regard because they have undergone frequent turnovers of their SDLs and often have young sex-linked regions (e.g., the medaka, Oryzias latipes and its congeners; Kondo et al. 2004). However, identifying sex-specific regions of a genome with short-read sequencing has been challenging, given that sex chromosomes and their sex-linked regions tend to accumulate repetitive sequences and are heterozygous for the SDL and potentially other fully sex-linked genes. Here, we present an improved genome assembly for the Trinidadian guppy (Poecilia reticulata) using long-read sequencing data and chromosome contact information in order to more fully explore the structure and evolution of their sex chromosomes. We then use the newly assembled sex chromosome and broad population sampling to detect sex differences across the genome.
The guppy has long been considered a species where sexual conflict may be important in sex chromosome evolution. This species has XY sex determination and is highly sexually dimorphic: males—the heterozygous sex (Winge 1927)—display complex color patterns across their bodies and fins, whereas females are drab. Color is used to attract mates but it also attracts the notice of predators. Guppies inhabiting rivers in the Northern Range of Trinidad are particularly well suited for comparative study of coloration because color patterning varies with predation regime. Here, the rivers are punctuated by waterfalls that define distinct predator assemblages: below waterfalls, guppies coexist with many predators (high predation, HP) whereas upstream they are relatively free of predation (low predation, LP). Guppies have evolved in response to their predators, displaying larger and more diverse color spots in LP than in HP populations across multiple rivers (Endler 1995; Magurran 2005). Furthermore, the Northern range mountains of Trinidad act as barriers between rivers, and populations in different rivers are highly differentiated (Willing et al. 2010; Fraser et al. 2015). Therefore, the separate rivers represent a natural experiment, with guppies adapting to LP regimes independently, providing the opportunity to study repeated evolution.
Early genetic work on the guppy revealed that male pattern formation genes are disproportionately Y-linked, with most color traits cosegregating with the SDL or recombining only at very low frequencies (Haskins et al. 1961; Lindholm and Breden 2002). Even more intriguingly, Haskins et al. (1961) uncovered evidence for recombination between the sex chromosomes, and a difference in sex-linkage between predation regimes: the Sb color factor was completely Y-linked in an Aripo river HP population, as testosterone-treated, sex-reversed females never displayed this trait, whereas sex-reversed females from an Aripo LP population often did. These observations demonstrated that the factor is present on the X chromosome, which suggests a history of crossing over between the Y and X in the LP population. This pattern was subsequently shown to apply across multiple populations from multiple rivers: LP females consistently displayed coloration following testosterone treatment more often than did HP females from the same river (Gordon et al. 2012). This result has been taken as evidence for a reduced rate of X–Y recombination in HP populations (Gordon et al. 2012) or, perhaps more simply, as evidence for weakened selection against coloration factors in LP sites (Charlesworth 2018).
Genome resequencing should help identify the SDL and help us understand its linkage with color pattern genes in guppies. Independent genetic mapping studies have mapped the SDL to linkage group 12 (LG12) across multiple populations of guppies (Tripathi, Hoffmann, Weigel, et al. 2009; Bergero et al. 2019) and its sister species Poecilia wingei (Lindholm and Breden 2002). Chromosome in situ hybridization studies revealed variation among populations in the size and heterochromatin content of a Y chromosome region towards the terminal end of LG12 (Nanda et al. 2014). Whole-genome resequencing across populations of guppies; however, yielded unclear conclusions about the extent of the fully sex-linked region in Trinidadian guppies. Wright et al. (2017) detected sex-specific differences in SNP density and mapping coverage on LG12 and concluded that the nonrecombining portion of the Y includes two evolutionary strata (fully sex-linked regions differing in ages) spanning 15–25 Mb on LG12 (Wright et al. 2017). These authors further concluded that the fully Y-linked region is larger in LP than HP populations, the opposite of the difference in recombination outlined above. However, another population genomics study found no evidence for such strata; instead the results suggested that the completely nonrecombining part of the Y chromosome is small, perhaps a single gene (Bergero et al. 2019). Cytogenetic studies on crossovers in males (Lisachov et al. 2015) and linkage maps derived from multiple intrapopulation crosses (Bergero et al. 2019) observed male recombination events only near the ends of the chromosomes, in autosomes, as well as in the sex chromosome. Low recombination rates in male meiosis and nonrecombining regions were also previously observed in genetic F2 crosses (Tripathi, Hoffmann, Willing, et al. 2009). This pattern of sex-specific recombination would enhance linkage between the SDL and sexually antagonistic alleles even if they are physically distant from each other in both predation regimes (Bergero et al. 2019).
Both population genomic studies were limited by the available reference genome for guppies, which was assembled from short read from a female guppy (Künstner et al. 2016). Short-read sequencing technologies provide considerable challenges to assembling whole chromosomes and this can be particularly true for sex chromosomes, which are often densely populated with repeats. Repetitive regions can cause unresolved assembled regions of the genome, particularly if identical or near-identical repeats are longer than the library insert length (Alkan et al. 2011). Repeat elements, including transposable elements (TEs), are expected to be enriched on young sex chromosomes, and enrichment has been detected in Drosophila miranda (Bachtrog 2003) and in teleosts (Chalopin et al. 2015). Moreover, TEs have been implicated in the movement of SDLs from one chromosomes to another (e.g., salmon; Faber-Hammond et al. 2015; Lubieniecki et al. 2015), and in the regulation of nearby genes, including the sex-determining genes themselves (e.g., medaka [Herpin et al. 2010], reviewed in Dechaud et al. [2019]). Long-read sequencing has the promise to resolve repetitive regions, but even these remain limited by maximum read lengths. Chromosome contact information (e.g., Hi-C) identifies long-range chromosome contact information without prior marker selection and can therefore improve contiguity of assemblies by placing (or scaffolding) contigs along chromosomes (e.g., Lieberman-Aiden et al. 2009).
Here, we report a new guppy reference genome assembly that uses a male sample, and that employs both long-read PacBio sequencing and HiC-based chromosome contact information. We show that our new reference assembly closes gaps and corrects mis-assemblies in the short-read female-derived guppy genome sequence. Importantly, for LG12, we have captured GC- and repeat-rich regions not previously assembled. Using this improved reference genome, we then employed a population genomic approach to detect differences between males and females across the sex chromosome. We identified two small regions on LG12 that show consistent evidence for sex-linkage. These advances demonstrate the importance of drawing on new techniques for improving existing genomic resources and highlight the benefits of combining high-quality assemblies with population genomic approaches. Furthermore, our results help reconcile the contradictory conclusions of past studies of guppy sex chromosome evolution that relied on a female reference genome, and suggest new directions for evolutionary genetic study of sex chromosome evolution in guppies.
Results
A Long-Read Assembly of a Male Guppy Genome
We assembled the genome of a single male from the same inbred Guanapo strain that was used for the original female reference genome (Künstner et al. 2016). Our approach integrated 50 Gb of PacBio long-read data and 40 Gb of Hi-C chromosomal contact data into a phased, diploid assembly with a total size of 745 Mb (supplementary table S1, Supplementary Material online, analysis was done on the slightly longer phase 0 if not otherwise stated, see below for a description of the two phases). First, complete chromosomes were reconstructed using three previously generated genetic linkage maps (Tripathi, Hoffmann, Weigel, et al. 2009: GM1; Künstner et al. 2016: GM2; Bergero et al. 2019: GM3, for LG12 only). Assembly misjoins in 11 contigs were corrected by splitting at breakpoints inferred in genetic maps GM1 and GM2 and interchromosomal contact domains detected from remapped Hi-C data. The final chromosome set totals 698 Mb and integrates 93% of all assembled contigs (supplementary figs. S1–S4, Supplementary Material online). BUSCO completeness was 94.9% (93.7% single-copy, 1.2% duplicated), with 1.5% fragmented and 3.6% missing, based on a total of 3,640 BUSCO groups.
This new MG assembly is both longer and more contiguous than the previously released FG (total scaffold length 745 Mb in MG vs. total contig length 663 Mb in FG and total scaffold length of 731 Mb in FG) and has a much higher N50 contig length of 7,936,040 versus 35,577 bp for the FG. Comparing the mapping statistics of short reads from natural guppy populations to both genome assemblies, we found an increase in the mean number of paired reads that mapped properly (95.4% vs 91.4%, supplementary table S2, Supplementary Material online). The increase was similar with reads from either sex and higher for reads from individuals collected in the source river, Guanapo, than from individuals in the two other rivers sampled.
Alignment of RNA-seq data to the softmasked genome (28.3% softmasked) yielded alignment rates between 86.0% and 93.8%. The BRAKER2 annotation pipeline used an initial set of 83,491 open reading frames without internal or terminal stops. An initial data set, from Xiphophorus maculatus (a closely related species) used for calculating the Ortholog hit ratio (OHR: see Materials and Methods), contained 36,613 proteins, which collapsed down to 27,791 at a ≥ 90% identity threshold after CD-HIT clustering. Orthology comparison between the P. reticulata and X. maculatus proteins using BLAST with an e-value threshold of 1e-5 and an OHR ≥95% yielded a final annotation set of 18,227 proteins, and subsequent functional annotation using EggNOG using an e-value threshold of 1e-10 resulted in a final set of 15,263 protein coding genes. Repeats were annotated with RepeatMasker (supplementary table S3, Supplementary Material online). The overall GC level was 39.7%, and 9.9% of the genome (73,979,443 bp) was masked for repeat content (supplementary fig. S5, Supplementary Material online).
Whole-Genome Alignment
The new MG assembly was mostly syntenic with the previously published FG assembly, and also with the genome of the closely related platyfish, Xiphophorus maculatus (fig. 1 and supplementary figs. S6 and S7, Supplementary Material online). However, many chromosomes are arranged differently in the new versus the previous assembly, including LG12, where the new assembly is inverted across the region from 11 to 20 Mb (fig. 1). As the new orientation is supported by multiple genetic maps and largely corresponds to the platyfish assembly, this region was probably mis-assembled in the previous guppy genome sequence. Similar improvements of synteny with the platyfish assembly are found across the genome (supplementary fig. S7, Supplementary Material online).

Whole-genome alignments of assemblies. (A) Whole-genome alignment between new MG assembly and a previous Illumina-based FG assembly. (B) Whole-genome alignment between new MG assembly and X. maculatus genome assembly (Xm). Links are colored by chromosome, with between-chromosome and unplaced scaffold links colored gray. (C) Genome alignment of LG12 between the two guppy assemblies and chromosome 8 on X. maculatus. Color denotes percent similarity of links (see legend).
Examination of LG12
The motivation for a new male-based assembly was to investigate sex chromosome evolution. We therefore closely examined LG12’s gene content. In total, 14 contigs were placed onto LG12 (fig. 2). Placement was based on 29 SNP markers, 283 RAD-seq markers, and an additional 12 microsatellite markers from (Bergero et al. 2019; fig. 2), as well as Hi-C information. A few contigs showed evidence of being misplaced in the release genome assembly according to the Hi-C data and therefore were manually re-ordered (see supplementary methods, Supplementary Material online for description and supplementary fig. S8, Supplementary Material online). Also, of note, contig IV (at ∼5 Mb) has only 2 SNP markers, one which places the contig at 5 Mb and one marker that places it at the distal end of LG12, yet, contig IV has high sequence similarity to contig XII (see below), implying that it could be assigned equally well to either region and therefore not informative for placement. We next examined LG12 for gene and repeat features (fig. 2). Similar to most other chromosomes (supplementary fig. S5, Supplementary Material online), LG12 showed an increase in GC content at the assembly end, most notably at 24.5–25.5 Mb (contig XIII; fig. 2B). Particularly high repeat density was also apparent around 5 Mb (contig IV; fig. 2C).

Assembly of LG12 from the MG, with each contig being delineated with dashed lines. (A) Hi-C contact, (B) percentage GC content, (C) percentage repeat content, (D) percentage of CDS (B–D in 50-kb windows), (E) Genetic linkage maps along LG12, with each marker’s position shown on the MG genome in centimorgans (cM) for GM1, GM2, and GM3.
The genome assembly was phased using Falcon unzip, although each phased contig is arbitrarily placed into one chromosome scaffold or the other, so that the assigned phases do not correspond to the actual haplotypes across the entire chromosome in the sequenced individual. The phases should not be interpreted as X–Y haplotypes; the phases are a data structure without biological meaning beyond the contigs on each phase being mutually exclusive. We can, however, examine differences between each pair of phased contigs (supplementary table S4, Supplementary Material online). A first observation is that most contigs are of similar size in both members of the pair, with only contig IV differing in size by >100 kb (its phase 0 contig includes a region named “C2” not present in phase 1, see below and figure 4A for a full description of this region). Second, SNP variation and SVs are evenly distributed across LG12 (supplementary fig. S9, Supplementary Material online). Notably however, a large inversion exists between the phases (size = 973,876 bp, beginning at 25.84 Mb).
Population Genomics Genome Scans
To identify the male-specific region of the Y chromosome, we took a population genomics approach, sampling males and females from six populations representing all three major drainages in Northern Trinidad (supplementary table S5, Supplementary Material online and supplementary fig. S10, Supplementary Material online). We tested each population with three approaches that can detect sex differences: comparing read coverage and SNP diversity (FST and Da = net DXY [Nei 1987]) between the sexes, and searching for Y-specific k-mers (“y-mers”).
We first scanned the genome for consistent differences in coverage between the sexes in 10-kb windows, using our short reads from individuals from natural populations: this can detect sex-linked regions in which males are hemizygous, or sequences of Y- and X-linked regions that are highly diverged. Note that we expect Guanapo populations to be best suited for the coverage test as the reference genome sample was derived from a Guanapo site. Consistent with previous results (Bergero et al. 2019), coverage was generally uniform across the genome, and mean male/female coverage difference was not consistently higher on LG12 compared with other chromosomes (supplementary fig. S11, Supplementary Material online). Female-biased coverage was variable across populations (numbers of female-biased 10-kb windows in different populations samples were as follows: GH = 1,109, GL = 240, MH = 14, ML = 24, OH = 59, and OL = 43). No window showed consistent female bias in all six populations, but one window was shared by four populations, and two were shared among three; however, none of these was on LG12 (supplementary table S6, Supplementary Material online).
There were considerably more male-biased windows across the genome (total numbers of male-biased 10-kb windows: GH = 2,306, GL = 1,230, MH = 377, ML = 269, OH = 626, and OL = 517), and there was also more overlap among the populations: 57 windows were found in all 6 populations, 151 in 5 populations, and 154 in 4 populations (supplementary table S6, Supplementary Material online). Of the windows shared by all six populations, most (45 windows) were on unplaced scaffolds and only three were on LG12, clustered on contig IV (between 4.9 and 5.3 Mb, fig. 3). An additional eleven 10-kb windows were detected within this region that exhibited male-biased coverage in any five populations. The remainder were on LG18 (n = 6) and LG5 (n = 3). Supplementary fig. S12, Supplementary Material online shows male and female coverage across LG12 separately. Overall, coverage patterns were more similar within river drainage than the predation regime.

Population genomic analysis for sex-specificity of variants on LG12 in six natural populations. Each population is shown in a panel, with HP populations on the left and LP populations on the right. Da values are shown as gray dots, with outliers in red. Coverage differences between the sexes are fitted with a spline, shown in blue. Finally, the positions of male contigs assembled from male-specific assembled k-mers (y-mers) are indicated along the top in green triangles.
To look for sex differences in SNP frequencies we first estimated FST between the sexes in 10-kb windows along the genome. A value of 0.3 is expected for a fully sex-linked XY SNP site or region (heterozygous in all males and homozygous in females). Genome-wide FST between the sexes was low within all populations (GH = 0.022, GL = 0.014, MH = 0.019, ML = 0.017, OH = 0.020, and OL = 0.017). No chromosomes showed elevated FST, except LG12 in GL (FST = 0.11, see supplementary fig. S13, Supplementary Material online). Elevated FST was evident across the whole chromosome in the GL population and was due to greater nucleotide diversity in males than females, as expected for sites or regions in linkage disequilibrium with the SDL (supplementary fig. S14, Supplementary Material online). Values of FST above 0.3 varied across the six populations, and were highest in GL and low in OH and OL (the numbers of windows with FST > 0.3 were: GH = 174, GL = 661, MH = 135, ML = 189, OH = 23, and OL = 38). Windows with elevated FST values were shared between at most three populations, and involved one window on LG12 and three windows on unplaced scaffolds (supplementary table S7, Supplementary Material online). We also scanned LG12 for FIS outliers. Although FST and FIS are closely related, FIS has the advantage that it can detect outliers that are due to males being more heterozygotic than expected based on the variant frequencies, which in turn supports sex linkage (complete or partial). However, like our FST results, there were no consistent FIS outlier windows across LG12 (supplementary fig. S15, Supplementary Material online).
We also estimated net sequence divergence between the sexes in the same 10-kb windows. FST between the sexes depends strongly on diversity within the two sexes (Charlesworth 1998), we therefore used a net divergence measure, corrected by female diversity, Da, which can detect regions with higher diversity in males than females. Importantly, this measure is sensitive to regions enriched for low-frequency, male-specific variants, due for example to diverse Y-linked haplotypes segregating within populations (Bergero et al. 2019), which would produce low FST values. For each population, we identified candidate windows with Da values greater than the genome-wide mean plus 3 SD (akin to a 99% confidence interval). LG12 had the highest average Da in five populations, and had the second highest average in the ML population (supplementary fig. S16, Supplementary Material online). Six windows had outlier Da windows in all six populations: four were located at the distal end of LG12 (between 24.5 and 24.9 Mb) and the other two were found between 6.94 and 7.01 Mb of LG12 (supplementary table S8, Supplementary Material online and fig. 3).
Given the difficulty of detecting SNP differences when mapping is low in some samples, we also applied a k-mer assembly approach. First, we identified all 12-bp k-mers that could be found in all males, but were missing in all females in all six populations. The reads containing such y-specific k-mers or “y-mers” were then reassembled for individual populations to assemble “male-specific contigs,” and finally realigned to our genome assembly. In total, we assembled a mean of 132 male-specific contigs per population, with a mean total length of 64,528 bp (supplementary table S9, Supplementary Material online). These contigs were distributed unevenly across our reference genome assembly (supplementary fig. S17, Supplementary Material online and supplementary table S9, Supplementary Material online) but LG12 was consistently over-represented, with >5% of total contig length recruited to LG12 in all populations (supplementary table S9, Supplementary Material online). These contigs are mostly clustered in two regions, at 5 Mb and 24–25 Mb (fig. 3). Autosomal regions also recruited male contigs, but less consistently across populations (supplementary fig. S17, Supplementary Material online). Overall, we identified only ten 100-kb windows where all six populations had male-specific contigs (LG12 24–25 Mb, LG18 27–28 Mb, and LG5 32–33 Mb and six unplaced scaffolds [000173F, 000175F, 000189F, 000195F, 000246F, 000275F, and 000281F], supplementary fig. S17, Supplementary Material online). Not surprisingly, most of these windows were also male-biased candidates based on coverage (supplementary table S6, Supplementary Material online). The male-biased coverage regions on LG12 (4–6 Mb) also recruited contigs in all populations, but not in directly overlapping windows (fig. 3).
Taken together, two regions appear most likely to be candidates for full sex-linkage: contigs IV and XIII of LG12. Both regions contained y-mer contigs in all six populations, and contig IV had consistently high male-biased coverage, whereas contig XIII of LG12 had windows with consistently high Da values. The same criteria (intersection of male biased coverage and recruitment of male contigs or consistent Da outlier) also pointed to four unplaced scaffolds (000173F, 000175F, 000189F, and 000246F) and one autosomal region (LG5 32–33 Mb) as apparently partially sex-linked; gene annotations for these regions are provided in supplementary table S10, Supplementary Material online.
Further Investigation of Candidate Regions
Contig IV (LG12 4.8–5.2 Mb)
Closer inspection prompted us to divide this contig into three distinct regions; a repeat-rich region (C1), a gene island (G1), and another repeat-rich region (C2; fig. 4A). C1 and C2 show evidence for higher coverage in males than females and presence of male-specific contigs, suggesting complete sex-linkage. As expected for regions with variable mapping rates due to the male-biased coverage, C1 and C2 lack SNPs after filtering, and FST or Da could not be estimated here.

Detailed investigation of candidate regions on LG12 that are fully sex-linked. (A) Region 4.8–5.2 Mb, (B) region 24.5–25.4 Mb. Gray represents each population’s male/female coverage ratio, red the mean coverage ratio. Gene annotations are shown above as blocks above, male-specific contigs for each population are shown in green triangles below.
Given its repetitive content, we annotated this contig manually. First, using the reference transcriptome from Sharma et al. (2014), we retrieved a preliminary set of 273 blastn matches that we then reduced to 71 nonredundant matches via an all-against-all blastn search, retaining only the longest transcript within each cluster of hits. Querying these transcripts in Ensembl resulted in annotations for 30 transcripts. Sequences with matches tended to be considerably longer (median = 975 bp; interquartile range = 424–1,532 bp) than those without (median = 216; interquartile range = 183–297 bp). In total, 19 of the 30 annotated transcripts matched known repetitive elements (transposons, retrotransposons, helitrons, or retroviruses); these mostly mapped to the ends of contig IV. The remaining 11 were considered putatively “nonrepetitive” protein-coding transcripts (fig. 4A). One transcript, which encoded a predicted hAT transposon C-terminal dimerization domain, mapped to 27 unique positions throughout the highly repetitive regions; separate analyses showed that XIR and XIR-like repetitive elements also repeatedly mapped to these stretches. XIR repeats are particularly interesting because they have been shown to be Y specific in platyfish (Nanda et al. 2000; Volff and Schartl 2001).
The gene annotation of contig IV revealed similarity with contig XII, which is assembled near the 24-Mb point of LG12 (fig. 2). A NUCmer alignment (with flags “–maxmatch -l 75”) detected a region of similarity of roughly 0.15 Mb between the two contigs, encompassing the gene-rich region (G1) of contig IV (supplementary fig. S18, Supplementary Material online). A few repetitive elements also mapped within G1, most notably a MuLV retrovirus and two copies of an REX3 non-LTR retrotransposon; both mapped to a region near the upstream boundary of the duplicated stretch where synteny was clearly disrupted between contigs IV and XII (supplementary fig. S18, Supplementary Material online). The heavily disrupted region between the contigs that contains the MuLV and REX3 sequences corresponds to a large intron of a RABEPK-like gene, within which we also found two copies of a short TRIM35-like gene (supplementary fig. S18, Supplementary Material online). The corresponding intronic region appears to be shorter in contig XII, lacking MuLV and both copies of REX3, and having only a single TRIM35-like gene (with two more copies roughly 0.1 Mb upstream of the start of contig XII). The RABEPK-like gene is followed by genes for an MFS transporter, an MADF-domain containing protein (a putative transcription factor), Shroom3, a Sosondowah ankyrin repeat domain family (SOWAH)-like protein, Cyclin I (which is partially triplicated in contig XII), and Cyclin G2.
Many of these genes have previously been noted as being present on the guppy and platyfish sex chromosomes. Tripathi, Hoffmann, Willing, et al. (2009) found a sex-linked SNP within a Cyclin G2 gene. Cyclin G2 was also found to have male-biased expression by Sharma et al. (2014). The most terminal marker in GM1 (marker 229) was used to isolate BAC 36_H23 (Tripathi, Hoffmann, Willing, et al. 2009). This BAC labeled the distal ends of the X and Y chromosomes in several guppy populations (Nanda et al. 2014). Blasting this BAC to the MG sequence shows that it maps to an unplaced scaffold (00250F) and to the first Cyclin I gene on contig XII, potentially placing this contig on LG12. Notably, this unplaced scaffold also has Da outliers in five of six populations (supplementary table S8, Supplementary Material online). The potential role of cyclins in multiplication of germ cell precursors might be worth investigation (Schulz et al. 2010). Morris et al. (2018) assembled a male-specific k-mer contig that also contained a TRIM39-like gene. TRIM35 and TRIM39 have the same domain architecture (C-IV-1; Hatakeyama 2017). However, our reference assembly includes many copies of TRIM genes (we found 126 homologs distributed across the genome based on a blastn search). Both the TRIM and cyclin genes are interesting because it is at these genes where the duplicated regions differ (supplementary fig. S18, Supplementary Material online). Using a PCR presence/absence test the TRIM35-like gene was found to be associated with sex in only some of our populations (supplementary table S11, Supplementary Material online).
It is notable that only the repetitive regions C1 and C2 show Y-like male: female coverage differences, but not the gene-rich region G1 (fig. 4A). These repetitive regions were also very similar to other regions across the genome, particularly to unplaced scaffolds. When unplaced scaffolds were aligned individually to all chromosome assemblies using minimap2, and after alignment blocks shorter than 10,000 bp were filtered out, LG12 was left with more alignments (n = 151) than any of the other 22 chromosomes (4 ≤ n ≤ 90). Most of these LG12 alignments (107/151) were clustered on contig IV, aligning to 51 unique scaffolds. Visualization of the alignments highlighted two contig IV regions with overlapping high-quality alignments (above the median Smith-Waterman alignment score for contig IV alignments of 21,584). These clusters aligned to a 25-kb region (LG12:4,942,488–4,967,664 bp [C1]) and a 48-kb region about 200 kb away (LG12:5,175,028–5,213,274 bp [C2]; fig. 4A). Searching the full genome with these sequences identified additional autosomal sequences on LG18 (C1 + C2), LG15 (C1), and LG2 (C1) with similarity to these two LG12 regions. The scaffolds that make up these clusters were extracted and realigned to contig IV. Inspection of the alignments for structural variation using MUMmer 4 (Kurtz et al. 2004; supplementary fig. S20, Supplementary Material online) detected minimal variation, except for an inversion in the C2-related sequence on LG18 (27,568,621–27,588,735 bp). Many of these aligned regions also exhibited Y-like male: female coverage ratios and account for at least one of our “other candidates” (000173F) and to a coverage candidate (LG18 27–28 Mb, supplementary table S6, Supplementary Material online).
Both the C1 and C2 sequences were annotated as including helitron transposons and a SGNH hydrolase domain-containing gene (fig. 4A). Further examination of these genes by searching the NR database revealed similarity to the texim/SGNH genes from platyfish (Tomaszkiewicz et al. 2014). The relevant regions within C1 best matched type 2/3/4 texim, whereas C2 best matched type 1/Y texim genes (Tomaszkiewicz et al. 2014). The platyfish teximY and texim1 sequences, which are both ∼1,300 bp long, differ by only a few bp (Tomaszkiewicz et al. 2014). The platyfish Y chromosome (Xm LG21, corresponding the guppy LG20) has a higher helitron density than the X chromosome (Zhou et al. 2006). Y-chromosome located texim/SGNH hydrolase genes involved in spermatogenesis have apparently been captured by helitrons (teximY; Tomaszkiewicz et al. 2014). Using quantitative PCR (qPCR), we found that the copy number of these texim-type/SGNH hydrolase genes was higher in males than females in five of the six populations sampled (supplementary table S12, Supplementary Material online), largely consistent with our coverage estimates (supplementary fig. S12, Supplementary Material online). We were unable to find any teximY-specific sequences unique to contig IV that could be used as a sex-linked marker.
The repeats in the C1 and C2 regions do not have any good matches in the previous FG assembly (Künstner et al. 2016). The closest partial matches were observed on unplaced scaffolds 853 (with C1) and 221 (with C2). The latter included a good blast hit (E = 0) for the texim1 region of C2.
Contig XIII (24.5–25.4 Mb)
Although the SDL has been mapped to the distal end of LG12 previously (Tripathi, Hoffmann, Weigel, et al. 2009), the evidence for Y-linkage was less clear, since we found no consistent read coverage or FST signal differentiating the sexes. However, all populations show male-specific y-mer contigs that map to around 24.82 Mb, regions with elevated Da from 24.59–24.61, 24.78–24.79, and 24.82–24.83 Mb, and a marginal dip in male coverage at 24.8 (fig. 4B). Taken together, these results suggest a recent increase in differentiation between the X and Y in this region. In terms of gene content, this region consists of an actin gene followed by multiple copies of NLRP1-related genes. NLRP1 is part of the inflammasome, an important element of vertebrate immunity (Mitchell et al. 2019). NLRP1-related genes were also found on two of our candidate unplaced scaffolds (000173F and 000189F, supplementary table S10, Supplementary Material online).
Multiple lines of evidence from previous work had identified this region as the one most likely to include the guppy SDL. Tripathi, Hoffmann, Weigel, et al. (2009) located the SDL distal to the marker 229, which is at 24.4 Mb in this new assembly. We find few and inconsistent FIS or FST signals in this region in our results (supplementary fig. S15, Supplementary Material online), and others reported similar findings for a sample from a captive population (Bergero et al. 2019). Dor et al. (2019) proposed that the SDL is within the distal 5.4 Mb of LG12, based on an association between sequence variants and sex in ornamental guppies. This region is rearranged in the male assembly compared with the previous female assembly (supplementary fig. S21, Supplementary Material online). Of particular interest are the candidate genes proposed by Dor et al. (2019); stoml2 (at 24,680,972–24,686,396 bp), which was not annotated by our pipeline but found through an Ensembl comparison, and gadd45g (25,035,744–25,039,913 bp; fig. 4B). Neither gene shows a consistent signal of sex differentiation, but the region between the two genes includes our candidate region around 24.82 Mb. Therefore, these candidate genes are probably within a region associated with sex in the ornamental population studied by (Dor et al. 2019), but not in wild populations.
Discussion
The Trinidadian guppy, P. reticulata, is a promising system for study of the early stages of sex chromosome evolution, but this work has been hampered by the lack of a highly contiguous genome that includes the male-specific Y chromosome. We present a new chromosome-level reference genome assembly for the Trinidadian guppy (P. reticulata). PacBio single-molecule real-time sequencing reads and Hi-C chromosome contact data supported the assembly of repetitive and GC-regions that were not previously resolved. After correcting a number of apparent mis-assemblies, the assembled chromosomes are now substantially more syntenic with those of the related platyfish, X. maculatus (Schartl et al. 2013). This new assembly has been particularly useful for examining the sex chromosome (LG12), and we highlight two small regions that are consistently associated with sex in samples of paired HP and LP natural populations from three river basins. Within the river basins, we find no evidence that HP and LP populations differ in the size of the sex-linked region, contrary to predictions and past observations. Neither do we find evidence for extensive evolutionary strata of complete sex linkage. We conclude, in agreement with Bergero et al. (2019), that any nonrecombining fully Y-linked region must be small. Our detailed analysis provides new SDL candidates for this region, which were missed in previous analyses that relied upon the female reference genome.
Our most promising region—situated at ∼5 Mb on LG12—contains candidate loci that have been previously mapped to the Y chromosome in guppy and platyfish. The region consists of two highly repetitive stretches that surround a genic region that itself is homologous to a region at 24 Mb on LG12. The two repeat-rich regions show similarity to many unplaced scaffolds, and to regions within LG18 and LG4. Our candidate region is not present in the previous FG assembly, and thus has not been detected in genome scans before. Reads originating from this region in previous genome scans were probably unmapped or even mis-mapped elsewhere in the female assembly, either to the duplicated genic region on LG12 near 24 Mb or to related sequences elsewhere in the genome. The most distal repeat-rich region includes our strongest evidence for increased male coverage; it is not present in the alternate phased assembly, and contains teximY genes, which have previously been found on the platyfish sex chromosome (Tomaszkiewicz et al. 2014). Notably, the platyfish sex-linked region is located on Xma21 (Schartl et al. 2013); it corresponds to guppy LG20, where we find no evidence of sex-bias.
Locating the SDL on a contig at 5 Mb of LG12 is contrary to genetic and cytogenetic studies that suggested it to be near the distal end of LG12. Our evidence for placing this contig at 5 Mb is admittedly mixed. The Hi-C signal is low in this region, likely because of low mappability due to the high number of repeats (Lajoie et al. 2015). Only two genetic markers were mapped here and one shows linkage to the other end of the chromosome (at 27.17 cM, fig. 2). Longer sequencing reads, or BAC sequences, could further resolve the placement of this region further, as well as that of the many unmapped scaffolds that show sequence similarity to contig IV.
Our second candidate region is at the distal end of LG12, where the SDL has been mapped previously. Here, our evidence for sex-linkage is higher net divergence between the sexes than elsewhere on LG12, and recruitment of contigs that were assembled from y-mers, male-specific k-mers. However, we detected no consistent coverage differences between the sexes and no high FST between sexes in this region. Together, our results support the previous conclusion (Bergero et al. 2019) that divergence between the X- and Y-linked region is small, and that no obvious evidence for genetic degeneration along the Y chromosomes. An alternative possibility is that the sex-determining region includes additional repetitive sequences and highly divergent regions that were not assembled here. Although there is no obvious nearby candidate sex-determining gene in our candidate region (∼24.82 Mb, contig XIII), this region shows some overlap with the region identified by Dor et al. (2019; supplementary fig. S21, Supplementary Material online). The candidate contig is followed by a GC-rich region, corresponding to the end of the chromosome where recombination is known to be high and shows no evidence for sex-linkage (Bergero et al. 2019; Charlesworth et al. 2020).
Given that the SDL has been mapped to the terminal part of LG12, the absence of clear signs of male-specific variants (i.e., high FST or biased male:female coverage ratio) in this region is surprising. Y-linked regions often show low diversity (Cotter et al. 2016), which would lead to high male–female FST values. One possibility is that the region includes many different Y haplotypes, which would lower male–female FST values. By using a net divergence measure (Da) we detected signals of high male diversity relative to female diversity, resulting from many low-frequency male-specific variants. Balancing selection for Y-linked color variants could be responsible for maintaining many Y haplotypes, with evidence for negative-frequency dependent selection for guppy color patterns coming from manipulating color pattern frequencies in natural populations. In such experiments, Olendorf et al. (2006) found that males with a rare color pattern had a survival advantage compared with common males. Similarly, Hughes et al. (2013) found that rare males had reproductive advantage. The hypothesis of multiple coexisting Y haplotypes is testable using long-read sequencing that can determine the phase of variants in the region, or phasing using sequencing of either mother–father–offspring trios or of individual sperm.
We next examine the controversy over the size and evolution of the Y chromosome in guppies. Wright et al. (2017) presented evidence for two evolutionary strata spanning the distal 10 Mb of LG12, whereas Bergero et al. (2019) did not find a clear region with sex-specific signals. Our new assembly updates these two strata substantially and can also explain inconsistent signals. First, the putatively younger strata (15–22 Mb in the previous female assembly) identified by Wright et al. (2017) was based on higher SNP density in males than females, although Bergero et al. (2019) found no consistently elevated FST in the population they examined. Our new assembly rearranges the region containing the putative stratum into nonconsecutive regions (fig. 1C). We too find no consistent signal of high male diversity across these regions, nor do we find these regions differ between HP and LP populations. The putatively older stratum (22–25 Mb in the female assembly) was identified through coverage differences between the sexes by Wright et al. (2017) but not by Bergero et al. (2019). In our new assembly, these sequences remain largely in place, albeit with some rearrangements (supplementary fig. S21, Supplementary Material online). Our candidate region (∼24.8 Mb), is considerably smaller than that reported by Wright et al. (2017). The differences among these studies most likely stem from our increased sequence length and different bioinformatic filtering pipelines, as well as from the fact that read mapping to our male reference genome provides a more direct view of Y chromosome evolution than analyses based on the X chromosome. For example, our candidate region at 5 Mb shows homology with contig XII at 24 Mb, which corresponds to roughly 23 Mb in the female assembly. It is therefore possible that small differences in mapping filters could account for the differences in signals reported among studies.
Our approach of using a multiple population genomics scan was critical for excluding regions from further consideration, and for narrowing down candidate regions. Had we analyzed only one population, we would have come to very different conclusions. For example, the GL population showed differences between the sexes across most of LG12, which suggests that this up-river population was founded by a small number of females. Such a bottleneck would have reduced diversity, which can explain this population’s high FST values between the sexes across LG12, as FST estimates the times back to a common ancestor, relative to that within the subpopulations being compared (here, the two sexes). A bottleneck is consistent with females being able to store sperm from multiple males (Kobayashi and Iwamatsu 2002; Kurtz et al. 2004), and thus of a few or even just one female potentially founding a new populations.
Our genome scan analysis also detected sex-biased regions on autosomes, which could potentially indicate sexual antagonistic loci (e.g., a region on LG5 using a similar intersection of measures used to highlight our candidate regions). However, caution is advised, as individual-based simulations (Kasimatis et al. 2019) suggest that spurious signals of high FST between the sexes can result from sampling variance and low sample sizes. These simulations confirmed that antagonistic selection must be remarkably strong (e.g., be associated with high sex-specific mortality or fertility differences) to maintain differences between the sexes at autosomal loci, which seems unlikely for most species (Fry 2010).
Autosomal and pseudoautosomal regions with sex-biases in coverage can also be due to mis-mapping of reads. Bissegger et al. (2020) found that recent duplications and translocation onto the Y chromosome in stickleback are often responsible for incorrect inferences of candidate sexually antagonistic loci on autosomes when using genome reference assemblies that lack Y chromosome sequences. Indeed, multiply mapping reads could be at least partially responsible for the sex-biased signals that we found here, for example, a coverage outlier region found on LG18, and for the many unplaced scaffolds with high sequence similarity with our candidate region at 5 Mb on LG12.
Why are the two sex-biased candidate regions on LG12 not physically close in the MG assembly? As previously discussed, the apparent distance between these two regions could be artifactual and disappear with additional sequencing and linkage information. That said, this arrangement is biologically plausible given the guppy’s recombination landscape. Specifically, Bergero et al. (2019) argued that associations between specific DNA sequences and sex could be maintained in separate regions of LG12 due to recombination in males being largely, if not entirely, restricted to the chromosome’s distal tips. Consequently, sexually antagonistic loci need not be physically linked to the SDL on LG12. We conclude that the Y-specific region on LG12 is small, likely to be within one of our candidate regions. Additionally, our findings suggest that the SDL may not be physically close to sexually antagonistic loci situated on the Y chromosome, but may instead be genetically closely linked due to the rarity of crossovers in male meiosis (Bergero et al. 2019).
Materials and Methods
Genome Assembly
Guppies for the reference genome assembly were descendants of those collected from an HP population in the Guanapo River (Twin bridges) from Northern Trinidad (PS 91100 77800). Guppies were inbred through brother–sister mating for 15 generations (Künstner et al. 2016).
From a pool of three full-sib males after 15 inbreeding generations, we generated 50.23-Gb-long reads on the Pacific Biosciences RSII platform (read N50 13 kb). Previously available Illumina data (Künstner et al. 2016) was used for polishing. To create a Hi-C library, sperm cells (1.5 × 107) were isolated from a single inbred Guanapo male. Haploid sperm nuclei were ∼90% as monitored by DAPI staining after lysis. The Hi-C library was prepared as described by Liu et al. (2017).
Raw long-read data were assembled into contigs with Falcon (version falcon-kit 1.1.1). The resulting primary and alternative contigs were internally phased based on heterozygous SNPs using Falcon-Unzip (version falcon-unzip 1.1.1; parameters default). Phased primaries and alternatives were polished using Quiver (ConsensusCore version 1.02; parameters default). The assembly was phased using Falcon-Phase (commit c86f05e; parameters min_aln_len = 5000, iter = 100000000, others default). Phase 0 was considered as the primary phase and used for all downstream analysis unless otherwise stated. Completeness of the assembly was conducted by investigating BUSCO scores (Seppey et al. 2019) with BUSCO v.4.0.2, using Actinopterygii as the core ortholog group.
We used slightly updated versions of previous genetic linkage maps (Tripathi, Hoffmann, Weigel, et al. 2009, GM1; Künstner et al. 2016, GM2). Both maps used the same cross; GM1 included 776 SNP markers in 896 individuals and GM2 7,509 RAD-seq markers in 184 individuals. We used the maps iteratively, using the more robust GM1 to place most of the contigs, followed by GM2. Finally, we used the genetic map in Bergero et al. (2019), which contained 12 markers to refine the assembly of LG12 (supplementary figs. S3 and S4, Supplementary Material online).
Our raw assembly consisted of two phases (phases 0 and 1). Phase 0 was 751,621,276 bp and phase 1 size was slightly shorter (751,559,625 bp), but there were no major differences in mapping efficiency between the phases (supplementary table S2, Supplementary Material online). Therefore, we used phase 0 for our analysis. We included sequences present in phase 1, but not phase 0, as additional contigs (termed decoys), in our mapping pipeline. The 38 decoys were short sequences totaling 120,900 bp. Note that assignment of contigs to phases 0 or 1 is arbitrary, and therefore there is not a simple correspondence between one of the phases and either the X or Y chromosome (see description of mapping decoys in the Supplementary Material online and supplementary table S4, Supplementary Material online). Differences between the phases were detected by aligning the phases using MUMmer 4 (Kurtz et al. 2004) and SNPs and SVs were called using NucDiff (Khelik et al. 2017).
Genome Annotation
Genome annotation proceeded using BRAKER2 v2.1.1 (Hoff et al. 2016; https://github.com/Gaius-Augustus/BRAKER), using a combination of GeneMark-EX (Lomsadze et al. 2014) and AUGUSTUS (Stanke et al. 2006, 2008) using NCBI BLAST (Altschul et al. 1990; Camacho et al. 2009). BRAKER2 was trained using two RNA-seq data sets: one data set was paired-end data from Sharma et al. (2014), with tissue from males and females, including whole embryos and multiple adult tissues. The data were accessed via NCBI, Study Accession ID: SRP033586. Illumina sequencing adaptors were removed and reads were quality trimmed to a minimum Phred base quality of 20 as well as discarding broken read pairs, using BBDUK2 (BBMap v34.86, Bushnell http://sourceforge.net/projects/bbmap/); The other data set was single-end RNA-seq data from gonad tissue of the ovary of four females, and testes of four males.
RNA-seq reads were mapped to the new reference genome using HISAT2 v2.1.0 (Kim et al. 2019). The alignments were converted to bam format and sorted using SAMtools v1.9 (Li et al. 2009). The guppy genome was softmasked for repeats using Red (Girgis 2015), a de novo approach, which uses a k-mer-based hidden Markov model, and therefore does not rely on previously annotated libraries of repeat sequences. The resulting augustus.gtf file was used downstream for annotation refinement. To refine the annotation, we assessed the Ortholog Hit Ratio (O’Neil et al. 2010). This measure is defined for each unigene with a BLAST match to a related data set, and is computed by dividing the number of bases in the matched region of the contig by the length of the best-matched sequence with OHR ratios close to 1.0 suggesting complete transcript assembly (O’Neil et al. 2010). To assess OHR, the proteome of X. maculatus was downloaded from Ensembl, release 96 (X_maculatus-5.0-male, INSDC Assembly GCA_002775205.2, December 2017). CD-HIT clustering was used to collapse down any peptides with ≥90% identity using CD-HIT v4.6.1 (Li and Godzik 2006; Fu et al. 2012), resulting in a X. maculatus peptide database of 27,791 for OHR analyses. gffread (https://github.com/gpertea/gffread) was used to extract protein sequences from the guppy augustus.gtf file and used as a query for BLAST against the X. maculatus peptides, using an e-value threshold of 0.00001. Any peptides with an OHR >95% with X. maculatus were extracted from the peptide file using samtools faidx (Li et al. 2009) and this well-supported peptide subset was functionally annotated using EggNOG v4.5.1 (Huerta-Cepas et al. 2016), using the eggNOG-mapper v1, mapping with DIAMOND (Buchfink et al. 2015). Emapper annotations were filtered at an e-value threshold of 1e-10.
We also annotated the genome using RepeatMasker ver (open-4.0.5) using the default mode with Actinopterygii as the query (Smit et al. 2019).
PCR-Free Libraries for SNP Recalibration Set
Twelve samples were sequenced, with an average of coverage of 32X, using PCR-free Illumina libraries (supplementary table S13, Supplementary Material online). One male and one female from each of six populations, from one river from each of the three drainages of the Northern range mountains of Trinidad, with HP and LP sites for each river (the same populations were sampled for the population genomic study described below using low coverage, supplementary table S5, Supplementary Material online).
Read mapping, cleaning, and SNP calling was similar to that used for our low-coverage population sample (see below), except that an additional SNP calling step using Freebayes v1.3.1 (Garrison and Marth 2012) was included, skipping regions with high coverage (-g 1,000) and using a minimum alternate allele count (-C 4). Freebayes called 12,726,641 SNPs, and GATK called 13,437,087 SNPs. The intersection of the Freebayes and GATK SNPs was used as our recalibration set (resulting in 8,886,248 SNPs).
Whole-Genome Alignment
Whole-genome alignments between our new male-based assembly (MG), the previous female-based assembly (FG), and the closely related platyfish assembly (Schartl et al. 2013) were done with minimap2 genome alignment (-x asm5, asm20). Alignments were visualized in circos (Krzywinski et al. 2009) and AliTV (Ankenbrand et al. 2017).
Population Genomics
A total of 113 samples were sequenced, including 17–20 individuals from each of six populations across Northern Trinidad (supplementary table S5, Supplementary Material online). DNA was extracted from caudal peduncle tissue using a Qiagen DNAeasy kit. Sequencing libraries were prepared with the Illumina TruSEq Nano DNA library kit, with an insert size of roughly 300 bp. Libraries were then sequenced on either an Illumina HiSeq 2000 or Illumina Hiseq 3000, with 5–10 individuals per lane. Each individual was sequenced 1–2 times to obtain ∼10× coverage.
Data were quality control (FastQC; Andrews 2010) cleaned (q20) and adaptors removed (cutadapt; Martin 2011) using Trim Galore (Krueger 2015). Reads were aligned to the new MG genome assembly using the bwa mem algorithm in bwa v0.7.17 (Li and Durbin 2009). Alignment files were converted to bam, sorted and indexed using samtools (Li et al. 2009). Validity and quality control of alignments were checked using samtools flagstat and samtools idxstats. Variant calling was conducted using GATKv4 (McKenna et al. 2010), following guidelines for best practices. In cases where more than one lane of data existed per sample, each lane was individually aligned to the genome. Addition of read group (RG) information (AddorReplaceReadGroups) and deduplicating were carried out using Picard-tools v2.17.5 (http://broadinstitute.github.io/picard/). All libraries underwent a first found of deduplicating (MarkDuplicates) for library complexity quality control purposes, before aggregating libraries across samples (MergeSAM) and conducting a second round of deduplicating (MarkDuplicates) to remove any potential library preparation biases. Merged, deduplicated alignments were recalibrated against a variant truth-set using variants generated from our recalibration call set generated from high-coverage, PCR-free sequencing data (see above). Mean and standard deviation coverage for each individual were calculated using qualimap v2.2.1 (García-Alcalde et al. 2012; Okonechnikov et al. 2016).
Variants were called within each population using the GATKv4 processing pipeline for germline variant discovery. SNPs and indels were called via local de novo assembly of haplotypes using HaplotypeCaller in genomic VCF (gVCF) mode. Multi-sample joint aggregation of gVCFs was carried out using GenotypeGVCFs, resulting in unfiltered VCF files for each individual. VCF files were filtered to include only biallelic SNPs. Further refinement was conducted using vcftools (Danecek et al. 2011): SNPs were filtered to a minimum depth of 5 and a maximum depth of 200, a maximum amount of missing data of 10% per population and a minor allele frequency of 5%. Only SNPs which were called across all six populations were included for analysis.
Principal components analysis was carried out using PLINK (ver 1.90; Purcell et al. 2007).
Coverage
Coverage was estimated from bam files using the bamCoverage option of deeptools (Ramírez et al. 2016). The first and last 100 kb of each chromosome was trimmed before estimating coverage with a binSize = 50, smoothLength = 75, and an effectiveGenomeSizeof 528,351,853 bp (genome size minus masked and trimmed regions). The ends of chromosomes were trimmed because they often showed high peaks of coverage, thereby distorting the normalized measures across the chromosome. As the end of LG12 has low genetic content and high GC content, this is unlikely to have excluded any informative sites. Coverage estimates were normalized using RPGC (reads per genomic context). Bins with normalized coverage > 4 (four times the expected median of 1) were filtered. Coverage was averaged within individuals of each sex within populations and converted to bed files with window sizes of 1 and 10 kb by taking weighted means.
Sex-biased coverage candidates were assigned on the basis of log2 (male coverage:female coverage) and defined as >0.75 (male-biased) or <−0.75 (female-biased). These cut-offs were chosen given expected coverage ratio peaks around 0 (equal coverage), 1 (double male coverage), and −1 (double female coverage).
Population Genomics Outliers
Population genetics statistics were calculated using the R package PopGenome (Pfeifer et al. 2014). We calculated FST, DXY, and differences in nucleotide diversity (π) in 10-kb windows between sexes within each population. Because DXY is associated with within-group diversity, to assess sequence divergence specifically, we calculated net divergence, Da, similar to the definition by Nei (1987). Da can detect sequence divergence between the sexes that cannot be explained in terms of differences in diversity within females (due to mutation rate or other differences). Da was calculated within each population by subtracting female π from DXY calculated between the sexes. Outliers were defined within each population as having values greater than the population mean plus 3 SD (akin to a 99% confidence interval).
Although FST and Da are both based on SNP diversity between the sexes, they will detect different signals. FST between the sexes will detect areas where all males are heterozygous and all females are homozygous, and when averaging across SNPs over large windows, FST will only detect these differences if variation between males is low (as expected for a selective sweep of one Y haplotype). Da outliers, however, will detect genomic regions where there is an abundance of low-frequency, male-specific variants (that are fixed in females), consistent with the existence of many different nonrecombining Y haplotypes.
y-mer Analysis
Given the low mapping rates to our area of interest (see Results), we also took a de novo assembly approach to identify male-specific regions. We selected five males and five females from each of our population samples. Quality of paired-end reads was assessed using FastQC (Andrews 2010) both before and after cleaning and adapter removal using cutadapt (Martin 2011), applying a quality control threshold of 20. The cleaned reads for each sex were pooled for each sample, and error-corrected using Lighter v. 1.11 (Song et al. 2014) using a k-mer size of 31, and a genome size of ∼730 Mb.
k-mers were counted canonically (-C) for each sex and sampled using Jellyfish v.1.12 (Marçais and Kingsford 2011) with no bloom filter. The “dump” command was used to output the k-mers and their counts. The k-mer abundances were formatted into a matrix using joinCounts from the DE-kupl software (Audoux et al. 2017). A master matrix was constructed containing all k-mers with abundances in each sex for all samples. A second, female-filtered matrix was constructed, containing k-mers which had a frequency ≤1 in females and ≥1 in males. This matrix was used to calculate the mean and 95% confidence limits for the k-mers in males in each sample and to define cut-off values for male-specific k-mers, or y-mers.
To identify conserved Y-linked regions, we identified y-mers that were called in all males across all low-coverage population samples. Such y-mers were identified as those with frequency above the sample-specific mean for all male samples in our set of populations, but were not found (frequency = 0) in any females. Reads containing these y-mers were selected from the respective population reads using a custom python script (available). Then contigs containing such y-mers were assembled from the paired-end reads from each population using SPAdes v. 3.12.0 (Bankevich et al. 2012) with the –only-assembler and –careful options. Contig assembly statistics were computed using Quast v 4.6.3 (Gurevich et al. 2013). Contigs for each population were aligned to the genome assembly using minimap2 with the default parameters (Li 2018).
Investigation of Regions of Interest
We manually curated gene annotations in our candidate sex determining regions by first searching both for BlastN matches within the reference transcriptome of Sharma et al. (2014) to confirm gene boundaries. The reference transcriptome was searched using BlastN, with the contig as the subject, the 74,567 reference transcripts as queries, and an e-value cutoff of 1e-50. Matching transcripts were then used as queries for BlastX searches on Ensembl 96 of X. maculatus, Danio rerio, Xenopus tropicalis, Mus musculus, Homo sapiens, and Drosophila melanogaster, using an e-value cutoff of 1e-05. These BlastX searches were complemented by domain searches using HMMERscan, with the transcripts translated assuming that each was already in the proper reading frame. The intention here was to characterize sequences in the genomic DNA, not to determine whether they are transcriptionally active.
PCR Validation Tests
A PCR presence/absence test for the TRIM-35 duplication was designed with primers at 5,015,401 and 5,045,101 bp (see supplementary table S14, Supplementary Material online for primer details). OneTaq Quick-Load DNA polymerase (New England Biolabs, Ipswich, MA) was used with 10 pmol of each primer to amplify 1 µl gDNA from individual samples, using the following cycling conditions: 94 °C for 1 min, 35 cycles of: 94 °C for 30 s, 53 °C for 30 s, and 72 °C for 60 s; 72 °C for 5 min. Experiments were performed using a VWR Uno96 thermocycler. The PCR products were visualized using agarose gel electrophoresis with a 2% TAE gel containing a 1:10,000 dilution of SYBR Safe DNA Gel Stain (Invitrogen, Carlsbad, CA).
Real-time qPCR (RT-qPCR) was performed using primers designed to detect copy number variation of the TeximY gene at 5,186,745 bp (see supplementary table S13, Supplementary Material online for primer details). iTaq Universal SYBR Green Mastermix (Bio-Rad, Hercules, CA) was used with 10 pmol of each primer in a two-step amplification protocol consisting of 30 cycles of: 95 °C for 20 s, Ta (53 °C) for 30 s. The specificity of the reaction was verified by subsequent melt curve analysis. RT-qPCR was also performed using primers for internal control gene Rpl7, under the same conditions. Rpl7 was chosen as the internal control as it was shown to be the most stable, single-copy reference gene and has a similar efficiency at the same optimal annealing temperature as our target amplicon. A threshold crossing (Ct) value in the target amplicon matching that of the Rpl7 control was therefore taken as the baseline (i.e., a single copy). The Ct value for each sample was normalized to Rpl7 and the absolute copy number calculated by comparison to the single copy control samples, using the comparative Ct method (2−ΔΔCt; Ma and Chung 2014). Experiments were performed using a Bio-Rad CFX96 Thermocycler, and data processing was performed using Bio-Rad CFX Manager software.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.
Acknowledgments
We would like to thank Julia Hildebrandt for sample preparation, Oliver Deusch for data support, and Jenna Cocoran for lab support. We also acknowledge high performing computing (HPC) ISCA server at the University of Exeter. This work was supported by the Max Planck Society (F.B., M.H., V.K., C.L., C.D., D.W.), Royal Society Dorothy Hodgkin Fellowship(C.W.), EU Research Council Grant (GuppyCon 758382; B.F., J.W., P.P.), and NERC Grant (NE/P013074/1; B.F., J.P.).
Data Availability
Population genomics data are available on ENA: Study: PRJEB10680. PCR-free data are available on ENA: Study PRJEB36450. Genome assembly is available on ENA ID: PRJEB36704; ERP119926 Accession: GCA_904066995. All scripts and pipelines are available on github: https://github.com/bfraser-commits/guppy_genome and associated Fraser BA, Whiting JR, Paris JR, & Bemm F. (2020, September 9). guppy_genome: v1.0.0 - male guppy genome assembly (Version v1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4020899