Chromosome-level genome assemblies of the malaria vectors Anopheles coluzzii and Anopheles arabiensis

Author Notes

Abstract

Background

Anopheles coluzzii and Anopheles arabiensis belong to the Anopheles gambiae complex and are among the major malaria vectors in sub-Saharan Africa. However, chromosome-level reference genome assemblies are still lacking for these medically important mosquito species.

Findings

In this study, we produced de novo chromosome-level genome assemblies for A. coluzzii and A. arabiensis using the long-read Oxford Nanopore sequencing technology and the Hi-C scaffolding approach. We obtained 273.4 and 256.8 Mb of the total assemblies for A. coluzzii and A. arabiensis, respectively. Each assembly consists of 3 chromosome-scale scaffolds (X, 2, 3), complete mitochondrion, and unordered contigs identified as autosomal pericentromeric DNA, X pericentromeric DNA, and Y sequences. Comparison of these assemblies with the existing assemblies for these species demonstrated that we obtained improved reference-quality genomes. The new assemblies allowed us to identify genomic coordinates for the breakpoint regions of fixed and polymorphic chromosomal inversions in A. coluzzii and A. arabiensis.

Conclusion

The new chromosome-level assemblies will facilitate functional and population genomic studies in A. coluzzii and A. arabiensis. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

Introduction and Background

Malaria has a devastating global impact on public health and welfare, with the majority of the world's malaria cases occurring in tropical Africa. Anopheles mosquitoes are exclusive vectors of malaria, with species from the Anopheles gambiae complex being the deadliest African vectors. Anopheles arabiensis Patton, 1905 (NCBI:txid7173) and Anopheles coluzzii Coetzee & Wilkerson, 2013 (NCBI:txid1518534), along with A. gambiae Giles, 1902 (NCBI:txid7165), are the malaria vectors of most widespread importance in sub-Saharan Africa. A. arabiensis feeds and rests predominantly outdoors, replacing A. gambiae in some localities where there is high use of long-lasting insecticide-treated nets and indoor residual spraying [1]. Females of A. arabiensis display opportunistic feeding behavior as they seek both human and animal blood [2]. A genomic study revealed that alleles linked to the chromosomal inversions 2Rb and 3Ra of A. arabiensis may influence choice of host for blood feeding [3]. Because traits relevant to vectorial capacity have genetic determinants, genomic studies are crucial for developing novel approaches to malaria control. Interspecies crosses between A. arabiensis and other species of the A. gambiae complex produce sterile males [4–6]. Fertile hybrid females allow for gene flow between species. The discovery of pervasive genomic introgression between A. arabiensis and A. gambiae or A. coluzzii [7, 8] opened an opportunity to investigate how traits enhancing vectorial capacity can be acquired through an interspecific genetic exchange. Hybrids of both sexes between the closely related species A. coluzzii and A. gambiae are fertile [9, 10]. These species are highly anthropophilic and endophilic; they are often sympatric but differ in geographical range [11], larval ecology [12], mating behavior, [13], and strategies for surviving the dry season [14]. Genomic analyses have the power to infer how these different adaptations are determined and maintained. A recent study described a new taxon, designated Anopheles TENGRELA, that genetically is most similar to A. coluzzii [15]. Still undiscovered and misidentified cryptic taxa could seriously confound ongoing genomic studies of Anopheles ecology and evolution of insecticide resistance.

The quality of genome annotation and analyses of any organism highly depends on the completeness of the assembly [16, 17]. Draft genome assemblies of species with highly polymorphic genomes, such as mosquitoes, may have many gene annotation problems: genes can be missing entirely, have missing exons or gaps, or be split between scaffolds. As a consequence, it is difficult to estimate the total gene number or gene copy number, both of which may be linked to important phenotypic traits. Genes of particular interest with respect to vectorial capacity are especially prone to such errors because they often belong to gene families: aquaporins; ionotropic, odorant, and gustatory receptors; immunity genes; insecticide resistance genes; and reproduction gene clusters [18–25]. A genome with missing information can also cause problems for correct analyses of transcriptome, epigenome, and population genomic data. A. gambiae, because of its epidemiological importance, was the first disease vector to have its genome sequenced by the Sanger method (2002 and updated in 2007) [26, 27]. Since then, the AgamP4 assembly remains the standard chromosome-level genome reference for species of the A. gambiae complex [28–32]. Using the A. gambiae genome as a reference for functional annotation and population genomic analyses in other species comes at the expense of losing important information on species-specific genetic architectures in the sequencing data. Moreover, the AgamP4 assembly has misassembled haplotype scaffolds, large gaps, incorrect orientation of some scaffolds, and unmapped sequences [26, 27]. The 16 Anopheles mosquito species genome project included several members of the A. gambiae complex [33]. Among them, the genome of A. arabiensis was sequenced on the Illumina platform in addition to the previously sequenced genome of the A. coluzzii MALI strain by Sanger [34]. Unlike the Sanger-based chromosome-level AgamP4 genome assembly, genomes of A. arabiensis and A. coluzzii MALI are represented by numerous unmapped sequencing scaffolds. Combined bioinformatics and physical mapping approaches recently produced 20 new super-scaffolded assemblies with improved contiguities for anopheline species, including A. arabiensis and A. coluzzii [35]. In the case of A. arabiensis, the genomic scaffolds have been assigned to and ordered and oriented on 5 chromosomal arms with the help of the AgamP4 reference, thus creating the AaraD2 Illumina-based chromosome-level assembly for this species. Also, a de novo genome assembly from a single A. coluzzii Ngousso mosquito was obtained using Pacific Biosciences (PacBio) sequencing [36]. The high-quality AcolN1 assembly for the Ngousso strain was placed into chromosome context by ordering and orienting the PacBio contigs to the AgamP4 reference. Also, 40% of the unmapped sequences in AgamP4 were assigned to the appropriate chromosomal positions [36].

Although each current sequencing or scaffolding technology alone cannot provide a telomere-to-telomere genome assembly and each method has its limitations [37], a combination of complementary approaches can lead to a chromosome-scale assembly [38, 39]. The ongoing revolution in sequencing and scaffolding methods urges researchers to undertake efforts to create new genome references that satisfy the modern standards. The Oxford Nanopore sequencing technology is a single-molecule, real-time sequencing approach that uses biological membranes with extremely small holes (nanopores) and electrophoresis to measure the change in ionic current when a DNA or RNA molecule passes through the membrane [40]. This high-throughput technology generates exceptionally long reads and has been proven successful in genomic studies for a wide range of biological samples such as arboviruses [41], bacteria [42], plants [43], insects [44], and humans [45]. Hi-C is a groundbreaking technology that exploits in vivo chromatin proximity information to yield dramatically improved genome assemblies. In contrast to alternative scaffolding approaches, such as phasmid libraries or other mate-paired sequencing methods [33], the Hi-C method can produce chromosome-level genomic scaffolds [39, 46–50].

In this study, we tested the pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes using the following steps: (i) high-coverage Oxford Nanopore sequencing and assembly using high molecular weight genomic DNA from inbred individuals, (ii) gap filling and error correction using Illumina sequencing data, (iii) Hi-C scaffolding of Oxford Nanopore contigs to chromosomes, and (iv) evaluation and validation of completeness and contiguity of the assemblies. We developed new reference genome assemblies for A. coluzzii and A. arabiensis, which will facilitate studies for a deeper understanding of the biology and genetics of these major African malaria vectors. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

Data Description

Here, we describe the pipeline for obtaining de novo chromosome-level assemblies for A. coluzzii MOPTI (AcolMOP1) and A. arabiensis DONGOLA (AaraD3). To produce the highly contiguous assemblies, we adopted and modified a strategy that was recently used for these goals in fruit flies [51, 52]. Briefly, we sequenced the genomes using the Oxford Nanopore technology, assembled contigs from long Nanopore reads, polished them with short Illumina reads, and scaffolded the contigs to the chromosome level using Hi-C proximity ligation data. Evaluation and comparison of the new assemblies with the previously released versions demonstrated substantial improvements in genome completeness and contiguity. We also provide genomic coordinates within the new references for the breakpoint regions of fixed and polymorphic inversions in A. coluzzii and A. arabiensis.

Nanopore sequencing of A. coluzzii and A. arabiensis genomes

We performed Oxford Nanopore sequencing using genomic DNA isolated from sibling males after 6 generations of inbreeding to reduce heterozygosity. Our analysis and visualization of the long Nanopore reads reported 3.3 million (M) reads of a total length of 28 Gb and 5M reads of a total length of 35 Gb for A. coluzzii and A. arabiensis, respectively. The N50 read length was 19 and 21 kb (Additional File 1), the read median length was 3.8 and 2.3 kb (Additional File 2), and the read quality was 10.3 and 10.0 (Additional File 3) for A. coluzzii and A. arabiensis, respectively. We aligned Nanopore reads from A. coluzzii and A. arabiensis to the genome of the closely related species A. gambiae (AgamP4) using minimap2 [53]. For A. coluzzii, the total number of aligned and unaligned reads was 3.3M (99%) and 0.03M (1%), respectively. In the case of A. arabiensis, the total number of aligned and unaligned reads was 4.5M (89%) and 0.56M (11%). The alignment statistics reported 100× coverage for the A. coluzzii genome and 114× coverage for the A. arabiensis genome (Additional File 4). Our contamination analysis with Kraken2 [54] identified 98.4% of A. coluzzii reads as having mosquito origin and 1.6% of reads as having bacterial origin. For A. arabiensis, 89.55% of the reads had mosquito origin, 5.65% of the reads had bacterial origin, and 4.80% of the reads had unknown origin. The reads of bacterial origin were filtered out. We retained the reads of unknown origin for a downstream analysis because they may represent novel mosquito sequences.

Assembly of the A. coluzzii and A. arabiensis genomes

Genome assembly from Nanopore sequencing data is an actively developing area of research. However, no comprehensive comparisons exist of different software on diverse genomes. To assess the performance of several available assemblers on Anopheles genomes, we ran the Nanopore reads, including wtdbg2 v1.1 [55], FLYE v2.4.1 [56], Miniasm v0.3-r179 [57], and Canu v1.8 [58]. In the case of the Canu v1.8 assembler, we obtained 2 assemblies: 1 consisting of unitigs (i.e., unambiguous reconstructions of the sequence) and the other consisting of contigs. We evaluated the contiguity of the draft assemblies using QUAST-LG [59] and estimated the genome sizes for A. coluzzii and A. arabiensis by a k-mer analysis of Illumina reads using Jellyfish [60]. For the A. coluzzii genome, the peak of the 19-mer distribution was at a depth of 54, and the genome size was estimated as 301.3 Mb. The length of single-copy genomic regions was estimated as 204.1 Mb (Additional File 5). For Illumina reads of A. arabiensis, the peak of the 19-mer distribution was at a depth of 88, and the genome size was estimated as 315.6 Mb. The size of single-copy genome regions was estimated to be 249.4 Mb (Additional File 5).

We obtained the following 5 assemblies for A. coluzzii:

1,392 contigs of a total length of 267.2 Mb produced by Wtdbg2;
1,618 contigs of a total length of 279.5 Mb produced by FLYE;
634 contigs of a total length of 318.0 Mb produced by Miniasm;
1,073 unitigs of a total length of 344.0 Mb produced by Canu;
474 contigs of a total length of 314.2 Mb produced by Canu.

The total lengths for the Miniasm and Canu assemblies were closer to the A. coluzzii genome size of 301.3 Mb, estimated from Illumina reads. According to the NG50 metric (NG50 is the length for which the collection of all contigs of that length or longer covers at least half the A. gambiae AgamP4 reference genome [26, 27]), the Canu contig assembly showed better contiguity (Fig. 1A, Additional File 6). However, this assembly also featured the third-largest number of misassemblies. The Wtdbg2 assembly had the fewest misassemblies.

Figure 1:

NGx curves for the Wtdbg2, Miniasm, Flye, and Canu contig and unitig assemblies. A, A. coluzzii. B. A. arabiensis.

Open in new tab Download slide

We obtained the following 5 assemblies for A. arabiensis:

1,920 contigs of a total length of 298.2 Mb produced by Wtdbg2;
1,280 contigs of a total length of 289.7 Mb produced by FLYE;
687 contigs of a total length of 338.8 Mb produced by Miniasm;
521 unitigs of a total length of 277.2 Mb produced by Canu;
211 contigs of a total length of 298.5 Mb produced by Canu.

The total length of all the assemblies except Miniasm is underestimated compared with the A. arabiensis genome size estimated from Illumina reads (315.6 Mb). The A. arabiensis NG50 values from any assembler were substantially larger than those of A. coluzzii (Fig. 1B, Additional File 6). The number of misassemblies was again larger for the Canu assemblies than for other assemblies. Thus, initial assemblies of the long-read data obtained by the Canu v1.8 software alone yielded NG50 contig lengths of 13.8 Mb for A. coluzzii and 23.7 Mb for A. arabiensis. The total assembly sizes were 314.2 and 277.2 Mb for A. coluzzii and A. arabiensis, respectively. For the Canu contig assemblies, the area under the NGx curve (auNG metric) was 13.57 and 21.99 Mb for A. coluzzii and A. arabiensis, respectively, which is higher than for the other assemblies (Fig. 1, Additional File 7).

We assessed the completeness of the A. coluzzii and A. arabiensis draft assemblies with BUSCO v3 [61, 62]. According to the BUSCO scores for both gene datasets, Canu assemblies were the best (Additional File 8), which likely indicates that the Canu assembler has more sophisticated error correction strategies than other assemblers have. The higher rate of core gene duplications for the Canu assemblies seems to indicate some haplotype separation, which was not observed in other assemblies. Low BUSCO scores for the Miniasm assemblies can be explained by the absence of error correction or polishing steps in Miniasm. For example, when we polished the A. coluzzii Miniasm assembly with 4-round Racon [63] using Nanopore reads, the resulting assembly achieves 87.9% complete Diptera genes (Additional File 9). Based on the comparison results across all the assemblies, we decided to proceed with the Canu contig and unitig assemblies for further steps in the assembly pipeline.

Polishing of the A. coluzzii and A. arabiensis assemblies

Initial assemblies of long reads from Oxford Nanopore Technologies are prone to frequent insertion and deletion errors, which usually are corrected by polishing. While there is no gold standard for polishing of Nanopore read assemblies, there are 2 commonly recommended polishing strategies. One strategy involves running several rounds of Racon [63] using raw Nanopore reads, and then running Medaka [64]. Another strategy is to run Nanopolish [65] using signal-level data additionally provided by a Nanopore sequencer. In both cases the quality of assemblies can be further improved by running Pilon [66] several times using short Illumina reads.

For polishing the A. coluzzii Canu contig assembly, we used Racon, Medaka, and Nanopolish, as well as Pilon for error correction and gap filling with Illumina reads. The FastQC quality control of the A. coluzzii Illumina reads reported 122.3M reads of a total length of 22.8 Gb and average length of 200 bp. FastQC showed that A. coluzzii Illumina reads have high per-base sequence quality (>32 on the Phred scale) and no adapter contamination. After each step in a polishing pipeline, we queried the resulting genome with conserved single-copy Diptera and Metazoa genes using the BUSCO test (Additional File 9). For the sake of brevity, we report only the BUSCO score for the Diptera single-copy gene set here. Single-copy genes usually cover only a small portion of a genome and it remains unclear how the polishing tools perform on the repeat-rich or non-coding regions. After Nanopolish corrected 283,935 substitutions, 1.6M insertions, and 51,104 deletions, the BUSCO score increased from 77.6% to 93.6% of complete genes. We then ran a 4-round Racon, which decreased the BUSCO score from 93.6% to 88.6% of complete genes. Also, after Nanopolish, we ran Pilon on the Canu contig assembly several times using Illumina reads. After the first round of Pilon, we obtained 97.9% of complete genes. After 3 rounds of Pilon, we reached 98.5% of complete genes. Because the change was insignificant for the third round, we decided to proceed with 3 rounds of Pilon. We also ran Nanopolish for the second time after the first round of Pilon, but this decreased the BUSCO score to 95.9%.

For polishing the A. arabiensis Canu contig and unitig assemblies, we used Nanopolish and Pilon. We ran Nanopolish and 3 rounds of Pilon on the A. arabiensis Canu contig assembly with Illumina reads. FastQC reported 260.6M reads for a total length 53.2 Gb with the average length being 90 bp; further filtering out of 14% of the reads left 224.8M of the A. arabiensis Illumina reads. After Nanopolish, which corrected 143,458 substitutions, 1.1M insertions, and 40,694 deletions, the BUSCO score improved from 83.0% to 94.4% of complete Diptera single-copy genes (Additional File 9). After the 3-round Pilon, we obtained the Canu contig assembly with 98.5% of complete genes. We also polished the A. arabiensis Canu unitig assembly with Nanopolish and 3-round Pilon and obtained the BUSCO score of 98.6% of complete Diptera genes.

We remark here that the BUSCO scores for the polished Canu contig assemblies of A. coluzzii and A. arabiensis were similar to the BUSCO scores for the A. gambiae PEST (AgamP4) genome (Table 1). The BUSCO scores of complete Diptera genes were 98.5% for polished contig assemblies of A. arabiensis and A. coluzzii.

Table 1:

Open in new tab

The percentage of complete, single, duplicated, and fragmented genes computed by BUSCO from the conserved single-copy Diptera and Metazoa gene sets for the polished A. coluzzii and A. arabiensis Canu contig assemblies and the A. gambiae PEST genome

Assemblies	Complete Diptera/Metazoa (%)	Single Diptera/Metazoa (%)	Duplicated Diptera/Metazoa (%)	Fragmented Diptera/Metazoa (%)
A. coluzzii Canu	98.5/98.9	90.2/89.1	8.3/9.8	0.7/0.9
A. arabiensis Canu	98.5/98.9	93.5/91.8	5.0/7.1	0.7/0.2
A. gambiae PEST	98.3/98.7	97.7/96.8	0.6/1.9	0.8/0.3

Assemblies	Complete Diptera/Metazoa (%)	Single Diptera/Metazoa (%)	Duplicated Diptera/Metazoa (%)	Fragmented Diptera/Metazoa (%)
A. coluzzii Canu	98.5/98.9	90.2/89.1	8.3/9.8	0.7/0.9
A. arabiensis Canu	98.5/98.9	93.5/91.8	5.0/7.1	0.7/0.2
A. gambiae PEST	98.3/98.7	97.7/96.8	0.6/1.9	0.8/0.3

Table 1:

Open in new tab

Assemblies	Complete Diptera/Metazoa (%)	Single Diptera/Metazoa (%)	Duplicated Diptera/Metazoa (%)	Fragmented Diptera/Metazoa (%)
A. coluzzii Canu	98.5/98.9	90.2/89.1	8.3/9.8	0.7/0.9
A. arabiensis Canu	98.5/98.9	93.5/91.8	5.0/7.1	0.7/0.2
A. gambiae PEST	98.3/98.7	97.7/96.8	0.6/1.9	0.8/0.3

Assemblies	Complete Diptera/Metazoa (%)	Single Diptera/Metazoa (%)	Duplicated Diptera/Metazoa (%)	Fragmented Diptera/Metazoa (%)
A. coluzzii Canu	98.5/98.9	90.2/89.1	8.3/9.8	0.7/0.9
A. arabiensis Canu	98.5/98.9	93.5/91.8	5.0/7.1	0.7/0.2
A. gambiae PEST	98.3/98.7	97.7/96.8	0.6/1.9	0.8/0.3

Hi-C scaffolding of the A. coluzzii and A. arabiensis contigs

We assembled the A. coluzzii and A. arabiensis polished Canu contigs into chromosome-level scaffolds using Hi-C Illumina reads. For both mosquito genomes, FastQC showed high per-base sequence quality of Hi-C reads (>30 on the Phred scale) and detected contamination with Illumina TrueSeq adapters in 0.17% of the reads of A. coluzzii and in 1.5% of the reads for A. arabiensis. The contaminated reads were filtered out in both read sets, resulting in 231.9M and 141.9M reads for A. coluzzii and A. arabiensis, respectively. In the case of A. coluzzii, the Juicer tool [67] reported 3.7M unmapped Hi-C read pairs, 34.7M Hi-C read pairs mapped to inter contigs, and 97.5M Hi-C read pairs mapped to intra contigs. For A. arabiensis, we obtained 1.6M, 9.8M, and 66.7M Hi-C read pairs that were unmapped, mapped to inter contigs, and mapped to intra contigs, respectively (Additional File 10).

There are several Hi-C–based scaffolding tools available: DNA Triangulation [68], LACHESIS [69], GRAAL [70], HiRise [71], HiCAssembler [72], SALSA2 [73], and 3D-DNA [47]. Among these tools, only GRAAL, SALSA2, and 3D DNA code repositories are actively maintained. We chose 3D-DNA and SALSA2 for scaffolding of our draft assemblies because 3D-DNA allows manual correction while SALSA2 can use the assembly graphs produced by assemblers. Because SALSA2 was designed primarily for unitigs, we assessed the performance of both tools on the A. arabiensis unitig and contig assemblies produced by Canu. We ran SALSA2 with the corresponding assembly graph for the 2 A. arabiensis assemblies. Our experiments showed that 3D-DNA has a tendency to aggressively split contigs as the number of rounds increases. We therefore chose to run 3D-DNA with only 1 round instead of the default 3 rounds.

The QUAST-LG computation of metrics for the original and processed A. arabiensis assemblies demonstrated that SALSA2 performs better than 3D-DNA (Table 2). Strikingly, the assemblies processed by 3D-DNA had a higher number of scaffolds and lower NG50 than in the original assembly. At the same time, 3D-DNA corrected ∼700 misassemblies in the contig assembly. While SALSA2 did not show a substantial boost in contiguity for unitig assembly, it significantly improved contig assembly at the cost of introducing a small number of misassemblies (Additional File 11).

Table 2:

Open in new tab

The QUAST-LG report for the NG50, number of misassemblies, and number of contigs for the A. arabiensis original Canu assemblies and those processed by 3D-DNA and SALSA2

QUAST-LG metrics	Assemblies of the A. arabiensis genome
	Canu		Canu, 3D-DNA		Canu, SALSA2
	Contigs	Unitigs	Contigs,	Unitigs	Contigs	Unitigs
NG50 (Mb)	23.7	23.7	5.1	5.9	71.4	23.9
No. misassemblies	12,308	12,860	11,650	15,415	13,155	13,271
No. scaffolds	211	521	246	870	161	366

QUAST-LG metrics	Assemblies of the A. arabiensis genome
	Canu		Canu, 3D-DNA		Canu, SALSA2
	Contigs	Unitigs	Contigs,	Unitigs	Contigs	Unitigs
NG50 (Mb)	23.7	23.7	5.1	5.9	71.4	23.9
No. misassemblies	12,308	12,860	11,650	15,415	13,155	13,271
No. scaffolds	211	521	246	870	161	366

Table 2:

Open in new tab

The QUAST-LG report for the NG50, number of misassemblies, and number of contigs for the A. arabiensis original Canu assemblies and those processed by 3D-DNA and SALSA2

QUAST-LG metrics	Assemblies of the A. arabiensis genome
	Canu		Canu, 3D-DNA		Canu, SALSA2
	Contigs	Unitigs	Contigs,	Unitigs	Contigs	Unitigs
NG50 (Mb)	23.7	23.7	5.1	5.9	71.4	23.9
No. misassemblies	12,308	12,860	11,650	15,415	13,155	13,271
No. scaffolds	211	521	246	870	161	366

QUAST-LG metrics	Assemblies of the A. arabiensis genome
	Canu		Canu, 3D-DNA		Canu, SALSA2
	Contigs	Unitigs	Contigs,	Unitigs	Contigs	Unitigs
NG50 (Mb)	23.7	23.7	5.1	5.9	71.4	23.9
No. misassemblies	12,308	12,860	11,650	15,415	13,155	13,271
No. scaffolds	211	521	246	870	161	366

We visually inspected the initial Hi-C contact heat maps produced by the scaffolding of the A. arabiensis assemblies (Additional File 12). The best heat map was generated by SALSA2 on the A. arabiensis contig assembly. For example, SALSA2 reconstructed the correct order of the scaffolds for the X chromosome except that 1 inversion was required to correct the orientation (upper left corner in Additional File 12a and b). On the other hand, the 3D-DNA heat maps were smoother (Additional File 12c and d) because the 3D-DNA software tends to remove repetitive sequences present in the original assemblies. We conclude that SALSA2 is better suited for scaffolding of contigs obtained from long reads. However, the results from both tools can be further improved by manual correction of scaffolds based on visual inspection of the Hi-C heat maps. Juicebox Assembly Tools (JBAT) v1.11 is the only tool currently available for manual correction of genome assemblies [74]. While 3D-DNA is designed to be loadable into JBAT for manual correction, we were unable to convert SALSA2 output to a data format that could be loaded and corrected in JBAT. Therefore, despite SALSA2 producing better scaffolding results, we decided to proceed with the 3D-DNA scaffolds obtained from the Canu contig genome assemblies for A. arabiensis (Additional File 12c) and A. coluzzii (Additional File 13). Both species assemblies required manual correction by reordering, changing orientation, splitting contig sequences, and allocating scaffold borders. The main goal of such manual correction was to obtain chromosome-level scaffolds without assembly errors, haplotype sequences, and assembly artifacts. We also tried to minimize the number of contig splits by means of manual correction.

To improve our manual correction process, we used the following additional information about contigs and scaffolds in the assemblies. All contigs were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts on the basis of the read-depth analysis as follows. Read-depth histograms were produced for the A. coluzzii and A. arabiensis draft assemblies (Additional File 14). In each read-depth histogram, we chose 3 cut-offs to capture 2 peaks of the bimodal distribution that correspond to haploid and diploid levels of coverage. The first read-depth peak resulted from the duplicated regions and corresponded to a “haploid” level of coverage. The second read-depth peak resulted from regions that are haplotype-fused and corresponded to the “diploid” level of coverage. We also aligned contigs from each draft assembly to the A. gambiae PEST (AgamP4) assembly to obtain information about distribution of the contigs across the chromosomes.

Because the Hi-C signal must be stronger for adjacent sequence regions, we manually reordered and changed the orientation of contigs in each assembly to keep the Hi-C signal strong along the diagonal. We used the PurgeHaplotigs classification and the fact that haplotig sequences lead to parallel diagonal signals for moving these contigs into debris. We also moved the contigs with low Hi-C signal and the contigs classified as assembly artifacts to debris. The remaining contigs were reordered according to the Hi-C signal. After manual correction we obtained the final Hi-C contact heat maps for the chromosome-level genome assemblies of A. coluzzii AcolMOP1 and A. arabiensis AaraD3 (Fig. 2).

Figure 2:

The Hi-C contact heat maps obtained after manual correction of genome assemblies. A, A. coluzzii AcolMOP1. B, A. arabiensis AaraD3. From left to right in each heat map, chromosome X, chromosome 2 (2R+2L), chromosome 3 (3R+3L). The heat maps were produced by JBAT.

Open in new tab Download slide

Contigs in debris were further partitioned into several scaffolds. We performed chromosome quotient (CQ) analysis [76] to tentatively assign contigs in debris to the Y chromosome (CQ < 0.1), X chromosome (CQ = 2), and autosomes (CQ = 1). We grouped contigs with CQ values <0.1 into the “chrY” scaffolds. In addition, we removed sequences from the chrY scaffolds if they were classified as autosomal (Ag53C, Ag93) or X chromosomal (AgX367) in the previous studies [77–79]. These results showed that the A. coluzzii and A. arabiensis assemblies contain 126 and 4 contigs from chromosome Y, respectively. We were unable to arrange contigs inside the chrY scaffolds into a correct order or orientation because the Hi-C signal was weak for these contigs.

For better understanding of the contig distribution among chromosomes, we also aligned known tandem repeats from the pericentromeric regions of chromosome X and autosomes to our assemblies. We retrieved the sequences from the debris contigs that belong to the pericentromeric region of chromosome X based on tandem repeats 18S ribosomal DNA (rDNA) and AgX367 and to autosomal pericentromeric regions based on satellites Ag53C and Ag93 [77–79]. Because the Hi-C signal is low for these contigs due to low complexity of the corresponding genomic regions, we were unable to determine their position inside the scaffolds forming chromosomes. Therefore, we grouped these contigs into separate “X_pericentromeric” and “Autosomal_pericentromeric” scaffolds.

As a result of the manual correction, we obtained the final assemblies for A. coluzzii and A. arabiensis genomes. They include assembled chromosomes X, 2 (2R+2L), 3 (3R+3L), and complete mitochondrion (MtDNA), as well as unordered contigs of the Y chromosome and pericentromeric sequences of autosomes and the X chromosome (including sequences from the rDNA cluster). Each species assembly consists of 7 scaffolds: chrX, chr2, chr3, chrY, X_pericentromeric, Autosomal_pericentromeric, and MtDNA. The total genome assemblies for A. coluzzii and A. arabiensis have a length of 273.4 and 256.8 Mb, respectively (Table 3). The assembly sizes are in good agreement with the experimentally determined genome size of 260 Mb for A. gambiae [80]. The resulting chromosome-level genome assemblies for A. coluzzii and A. arabiensis have a length of 242.3 and 234.6 Mb, respectively (Table 3). By comparison, the aforementioned chromosome-level genome assemblies AgamP4 for A. gambiae and AaraD2 for A. arabiensis are 230.5 and 216.6 Mb, respectively.

Table 3:

Open in new tab

Genome sizes (bp) of the individual assembled chromosomes, unlocalized scaffolds, and mitochondrion for A. coluzzii AcolMOP1 and A. arabiensis AaraD3

Size (bp)	A. coluzzii	A. arabiensis
Chromosome X	27,656,523	26,913,133
Chromosome 2 (2R+2L)	114,808,232 (60,581,942 + 54,226,290)	111,988,354 (60,137,474 + 51,850,880)
Chromosome 3 (3R+3L)	99,875,506 (53,984,252 + 45,981,254)	95,710,210 (52,890,104 + 42,820,106)
Total chromosome-level assembly	242,340,261	234,611,697
Chromosome Y	21,271,033	619,478
X_pericentromeric	5,607,343	14,883,429
Autosomal_pericentromeric	4,189,909	6,693,365
Mitochondrion	15,366	15,347
Total genome assembly	273,423,912	256,823,316

Size (bp)	A. coluzzii	A. arabiensis
Chromosome X	27,656,523	26,913,133
Chromosome 2 (2R+2L)	114,808,232 (60,581,942 + 54,226,290)	111,988,354 (60,137,474 + 51,850,880)
Chromosome 3 (3R+3L)	99,875,506 (53,984,252 + 45,981,254)	95,710,210 (52,890,104 + 42,820,106)
Total chromosome-level assembly	242,340,261	234,611,697
Chromosome Y	21,271,033	619,478
X_pericentromeric	5,607,343	14,883,429
Autosomal_pericentromeric	4,189,909	6,693,365
Mitochondrion	15,366	15,347
Total genome assembly	273,423,912	256,823,316

Table 3:

Open in new tab

Genome sizes (bp) of the individual assembled chromosomes, unlocalized scaffolds, and mitochondrion for A. coluzzii AcolMOP1 and A. arabiensis AaraD3

Size (bp)	A. coluzzii	A. arabiensis
Chromosome X	27,656,523	26,913,133
Chromosome 2 (2R+2L)	114,808,232 (60,581,942 + 54,226,290)	111,988,354 (60,137,474 + 51,850,880)
Chromosome 3 (3R+3L)	99,875,506 (53,984,252 + 45,981,254)	95,710,210 (52,890,104 + 42,820,106)
Total chromosome-level assembly	242,340,261	234,611,697
Chromosome Y	21,271,033	619,478
X_pericentromeric	5,607,343	14,883,429
Autosomal_pericentromeric	4,189,909	6,693,365
Mitochondrion	15,366	15,347
Total genome assembly	273,423,912	256,823,316

Size (bp)	A. coluzzii	A. arabiensis
Chromosome X	27,656,523	26,913,133
Chromosome 2 (2R+2L)	114,808,232 (60,581,942 + 54,226,290)	111,988,354 (60,137,474 + 51,850,880)
Chromosome 3 (3R+3L)	99,875,506 (53,984,252 + 45,981,254)	95,710,210 (52,890,104 + 42,820,106)
Total chromosome-level assembly	242,340,261	234,611,697
Chromosome Y	21,271,033	619,478
X_pericentromeric	5,607,343	14,883,429
Autosomal_pericentromeric	4,189,909	6,693,365
Mitochondrion	15,366	15,347
Total genome assembly	273,423,912	256,823,316

Validation and quality evaluation of the genome assemblies

We validated the resulting assemblies by aligning the AgamP4.10 gene set from the AgamP4 assembly to AaraD3 and AcolMOP1 using NCBI BLAST v2.9.0 [81]. Overall, 13,036 (99.84%) and 13,031 (99.80%) genes from the total of 13,057 A. gambiae genes were mapped to the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 15). Moreover, 9,442 (72.31%) and 8,971 (68.71%) genes were mapped with the alignment having a zero e-value and 90–110% of the gene length to the A. coluzzii and A. arabiensis assemblies, respectively (Table 4). The relatively smaller number of genes mapped to the X chromosome of A. arabiensis agrees with its higher divergence from the A. gambiae X chromosome [7]. The gene alignments provide supporting evidence that the obtained assemblies have the correct gene content.

Table 4:

Open in new tab

Statistics of the genes from the A. gambiae PEST assembly aligned to the A. coluzzii and A. arabiensis assemblies

		AgamP4.10 gene set
	Aligned from	X	2	3	Mt	Y_unplaced	UNKN
Aligned to	total	1,063	6,603	4,897	13	2	479
chrX	584/397	564/378	5/3	4/3	0	0	11/13
chr2	5,055/4,930	9/11	4,862/4,734	5/4	0	0	180/180
chr3	3,785/3,610	6/2	10/8	3,566/3,419	0	0	203/181
MtDNA	13/13	0	0	0	13/13	0	0
chrY	4/10	0	1/3	1/0	0	2/1	0/6
Autosomal_pericentromeric	0/10	0	0/3	0/6	0	0	0/1
X_pericentromeric	1/1	0	0	1/0	0	0/1	0

		AgamP4.10 gene set
	Aligned from	X	2	3	Mt	Y_unplaced	UNKN
Aligned to	total	1,063	6,603	4,897	13	2	479
chrX	584/397	564/378	5/3	4/3	0	0	11/13
chr2	5,055/4,930	9/11	4,862/4,734	5/4	0	0	180/180
chr3	3,785/3,610	6/2	10/8	3,566/3,419	0	0	203/181
MtDNA	13/13	0	0	0	13/13	0	0
chrY	4/10	0	1/3	1/0	0	2/1	0/6
Autosomal_pericentromeric	0/10	0	0/3	0/6	0	0	0/1
X_pericentromeric	1/1	0	0	1/0	0	0/1	0

In each entry (x/y), x stands for the number of genes aligned to AcolMOP1 and y stands for the number of genes aligned to AaraD3.

Table 4:

Open in new tab

Statistics of the genes from the A. gambiae PEST assembly aligned to the A. coluzzii and A. arabiensis assemblies

		AgamP4.10 gene set
	Aligned from	X	2	3	Mt	Y_unplaced	UNKN
Aligned to	total	1,063	6,603	4,897	13	2	479
chrX	584/397	564/378	5/3	4/3	0	0	11/13
chr2	5,055/4,930	9/11	4,862/4,734	5/4	0	0	180/180
chr3	3,785/3,610	6/2	10/8	3,566/3,419	0	0	203/181
MtDNA	13/13	0	0	0	13/13	0	0
chrY	4/10	0	1/3	1/0	0	2/1	0/6
Autosomal_pericentromeric	0/10	0	0/3	0/6	0	0	0/1
X_pericentromeric	1/1	0	0	1/0	0	0/1	0

		AgamP4.10 gene set
	Aligned from	X	2	3	Mt	Y_unplaced	UNKN
Aligned to	total	1,063	6,603	4,897	13	2	479
chrX	584/397	564/378	5/3	4/3	0	0	11/13
chr2	5,055/4,930	9/11	4,862/4,734	5/4	0	0	180/180
chr3	3,785/3,610	6/2	10/8	3,566/3,419	0	0	203/181
MtDNA	13/13	0	0	0	13/13	0	0
chrY	4/10	0	1/3	1/0	0	2/1	0/6
Autosomal_pericentromeric	0/10	0	0/3	0/6	0	0	0/1
X_pericentromeric	1/1	0	0	1/0	0	0/1	0

In each entry (x/y), x stands for the number of genes aligned to AcolMOP1 and y stands for the number of genes aligned to AaraD3.

We validated the structural accuracy of the AaraD3 and AcolMOP1 assemblies by comparing them with the chromosome-scale reference genome of A. gambiae PEST. Using D-Genies v1.2.0 [82], we generated 3 whole-genome pairwise alignments for the following pairs of assemblies: AcolMOP1 and AgamP4, AaraD3 and AgamP4, and AcolMOP1 and AaraD3 (Fig. 3, Additional File 16). We observed that no alignment pair had inter-chromosomal rearrangements between the assemblies. Gaps in whole-genome pairwise alignments between AcolMOP1 and AgamP4 and between AaraD3 and AgamP4 indicate that all chromosomes in the new assemblies have more genomic information in the pericentromeric regions than the AgamP4 assembly. Overall, the pairwise alignments show high concordance of the AcolMOP1 and AaraD3 assemblies with the existing AgamP4 assembly.

Figure 3:

Whole-genome pairwise alignment dot-plots between chromosome-level assemblies. A, AcolMOP1 and AgamP4. B, AaraD3 and AgamP4. C, AcolMOP1 and AaraD3.

Open in new tab Download slide

To assess the completeness of the final assemblies, we searched for conserved, single-copy genes using BUSCO v4.1.4 with the dipteran complete gene set. The chromosomal scaffolds of the A. coluzzii and A. arabiensis final assemblies had BUSCO scores of 99.7% and 99.2%, respectively (Additional File 17). We used QUAST-LG to assess contiguity of the assemblies. The largest scaffolds are 114.8 and 112.0 Mb and scaffold N50s are 99.9 and 95.7 Mb for the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 18).

Finally, we assessed the presence of known tandem repeats in pericentromeric regions of chromosomes 2, 3, X, and in chromosome Y of the AcolMOP1, AaraD3, AgamP4, and AcolN1 [36] assemblies. In particular, we used the presence of the putative pericentromeric tandem repeat Ag93 [78, 79] to assess the completeness of autosomal arms. We observed 27,364 and 33,460 Ag93 repeat copies in A. coluzzii and A. arabiensis assemblies, respectively, which are substantially greater than the 4,446 and 2,188 Ag93 repeat copies found in AgamP4 and AcolN1, respectively (Table 5). Moreover, while all hits in AgamP4 were located in the chromosome UNKN, 409/7,003 and 1,643/735 repeats were found in scaffolds corresponding to chromosomes 2/3 in the A. coluzzii and A. arabiensis assemblies, respectively (Additional File 19). This result indicates that the AaraD3 and AcolMOP1 chromosomes are more complete than the AgamP4 chromosomes. We also analyzed the presence of the Ag53C tandem repeat and its junction with the Tsessebe III transposable element, which are known to be located in the pericentromeric regions of the autosomes [78, 83]. The A. coluzzii assembly contains the highest number (646) of these repeats and the A. arabiensis assembly contains a comparable number of the repeats to AgamP4, while the AcolN1 assembly contains just 41 repeat copies (Table 5). It should be noted that these repeats are not located in scaffolds assembled to autosomal chromosomes and can only be found in the Autosomal_pericentromeric scaffolds of the A. coluzzii and A. arabiensis assemblies or in the chromosome UNKN of the AgamP4 assembly. Overall, the AaraD3 and AcolMOP1 assemblies contain more assembled sequences from the autosomal pericentromeric regions than the AgamP4 and AcolN1 assemblies.

Table 5:

Open in new tab

The number of marker sequences present in the A. coluzzii MOPTI (AcolMOP1), A. arabiensis (AaraD3), A. gambiae (AgamP4), and A. coluzzii Ngousso (AcolN1) assemblies

Repeat sequence	Chromosome	AcolMOP1	AaraD3	AgamP4	AcolN1
Ag93	Autosomes	27,364	33,460	4,446	2,188
Ag53C	Autosomes	646	93	103	41
AgX367	X chromosome	845	1,653	2	86
18S rDNA	X chromosome	205	490	0	3
Ag113	X chromosome	3,146	0	765	97
AgY477	Y chromosome	7,685	0	4	0
AgY53B	Y chromosome	10,569	0	387	0
zanzibar	Y chromosome	100	0	0	0

Repeat sequence	Chromosome	AcolMOP1	AaraD3	AgamP4	AcolN1
Ag93	Autosomes	27,364	33,460	4,446	2,188
Ag53C	Autosomes	646	93	103	41
AgX367	X chromosome	845	1,653	2	86
18S rDNA	X chromosome	205	490	0	3
Ag113	X chromosome	3,146	0	765	97
AgY477	Y chromosome	7,685	0	4	0
AgY53B	Y chromosome	10,569	0	387	0
zanzibar	Y chromosome	100	0	0	0

Table 5:

Open in new tab

The number of marker sequences present in the A. coluzzii MOPTI (AcolMOP1), A. arabiensis (AaraD3), A. gambiae (AgamP4), and A. coluzzii Ngousso (AcolN1) assemblies

Repeat sequence	Chromosome	AcolMOP1	AaraD3	AgamP4	AcolN1
Ag93	Autosomes	27,364	33,460	4,446	2,188
Ag53C	Autosomes	646	93	103	41
AgX367	X chromosome	845	1,653	2	86
18S rDNA	X chromosome	205	490	0	3
Ag113	X chromosome	3,146	0	765	97
AgY477	Y chromosome	7,685	0	4	0
AgY53B	Y chromosome	10,569	0	387	0
zanzibar	Y chromosome	100	0	0	0

Repeat sequence	Chromosome	AcolMOP1	AaraD3	AgamP4	AcolN1
Ag93	Autosomes	27,364	33,460	4,446	2,188
Ag53C	Autosomes	646	93	103	41
AgX367	X chromosome	845	1,653	2	86
18S rDNA	X chromosome	205	490	0	3
Ag113	X chromosome	3,146	0	765	97
AgY477	Y chromosome	7,685	0	4	0
AgY53B	Y chromosome	10,569	0	387	0
zanzibar	Y chromosome	100	0	0	0

For assessing completeness of the pericentromeric regions in the X chromosome and within the Y chromosome, we used AgX367 and AgY477 repeat sequences that are known to be located in the X and Y chromosomes, respectively [77, 78]. It is important to note that AgX367 and AgY477 repeat sequences share a region of high similarity. AcolMOP1 and AaraD3 contain a much higher number of AgX367 copies than AgamP4 or AcolN1 do (Table 5). All these copies are found in the X_pericentromeric scaffold only (Additional File 19). While AcolMOP1 contains 7,685 of AgY477 repeat copies in the scaffold that corresponds to the Y chromosome, they are absent in AaraD3. We also used AgY53B repeat and zanzibar retrotransposon to assess for the presence of Y chromosome contigs in the assemblies. Similar to AgY477, these sequences are found in AcolMOP1 but absent in AaraD3. AcolMOP1 contains the highest number of AgY53B and zanzibar copies among all the studied assemblies. For validating scaffolds corresponding to the X chromosome, we further used the Ag113 and 18S rDNA sequences. Remarkably, only AcolMOP1 and AgamP4 assemblies contain full-length 18S rDNA sequences. The Ag113 sequence is widely present in AcolMOP1 and AgamP4 but absent in AaraD3 (Table 5). Importanly, AcolMOP1 contains 1,941 Ag113 copies in the X chromosome while all appearances of Ag113 in AgamP4 are located in the chromosome UNKN (Additional File 19). All these results indicate that the X and Y chromosomes are better assembled in AcolMOP1 than in AgamP4 or AcolN1.

We conclude that we produced chromosome-level genome assemblies for A. coluzzii and A. arabiensis. The AaraD3 and AcolMOP1 assemblies have 98.1% of conserved single-copy Diptera genes from BUSCO and contain pericentromeric heterochromatin sequences, as well as sequences of the Y chromosomes and the rDNA cluster. Our assessments show that the AaraD3 and AcolMOP1 assemblies are of higher quality and more continuous than AgamP4 or AcolN1.

Genome rearrangements in A. coluzzii and A. arabiensis

The new assemblies of A. coluzzii MOPTI and A. arabiensis DONGOLA allowed us to identify genomic coordinates of breakpoint regions and breakpoint-flanking genes of chromosomal rearrangements. The pairwise alignments of the A. arabiensis, A. coluzzii, and A. gambiae genomes were performed and visualized using D-Genies v1.2.0 [82] (Fig. 3, Additional File 17), genoPlotR [84] (Additional File 20), and SyRi [85] (Additional File 21). The karyotypes of the incipient species A. coluzzii and A. gambiae do not differ by any known fixed rearrangements. The karyotype of A. arabiensis is known to differ from that of A. coluzzii and A. gambiae by 5 fixed overlapping X-chromosome inversions (a, b, c, d, and g) and inversion 2La. Inversions Xag are fixed in A. coluzzii and A. gambiae [86] and inversions Xbcd are fixed in A. arabiensis [86]. Inversion 2La is fixed in A. arabiensis but polymorphic in A. coluzzii and A. gambiae [86]. Our previous studies identified the X chromosome inversion breakpoints by aligning the A. gambiae PEST assembly with the Illumina-based A. arabiensis assemblies AaraD1 and AaraD2 [7, 35]. Here, we determined genomic coordinates and breakpoint-flanking genes of the Xag and Xbcd breakpoint regions in the chromosome-level assemblies of the 3 species (Additional File 22). We identified breakpoint regions and breakpoint-flanking genes of inversion 2La, which is fixed in both the A. arabiensis DONGOLA and A. coluzzii MOPTI strains (Additional Files 20–22). Inversion 2Rb is polymorphic in A. arabiensis, A. coluzzii, and A. gambiae [86], and its breakpoint regions are shared among the 3 species, indicating a single common origin of the inversion [87]. The 2Rb inversion is fixed in the DONGOLA strain and the alignments with the A. gambiae and A. coluzzii genomes identified its breakpoints in the AaraD3 assembly (Additional Files 20–22). Localization of the 2Rb breakpoints in the context of the new reference genome will advance the investigation of how this chromosomal inversion influences choice of host by A. arabiensis [3]. The genomic coordinates of breakpoint regions and breakpoint-flanking genes for inversions 2La and 2Rb found in our assemblies are in agreement with the previously described breakpoints for these inversions [87, 88].

Pair-wise alignments of the 3 chromosome-level assemblies identified new assembly-specific rearrangements, most of which are much smaller than 1 Mb (Fig. 4, Additional Files 21–23). We only considered rearrangements assembly-specific if they had the same breakpoints when aligned to both other assemblies. We found 2 small rearrangements in A. coluzzii: 3R translocation and 3L microinversion. The 3R translocation is located between genomic coordinates 37.6 and 37.7 Mb and the 3L microinversion is located between coordinates 68.4 and 68.6 Mb in AcolMOP1. We discovered a new 2R microinversion in A. arabiensis located between genomic coordinates 9.6 and 9.8 Mb in AaraD3 (Additional File 21). Finally, we found 5 AgamP4-specific inversions on 2R (59.1–59.6 Mb, 60.5–60.9 Mb), on 2L (4.0–5.0 Mb), and on 3L (0.2–0.4 Mb, 1.2–1.9 Mb). The identified structural variations between the assemblies may represent natural genome rearrangements or misassemblies. It is worth noting that these micro-rearrangements are located in euchromatin of AcolMOP1 and AaraD3, while they are located in heterochromatin of AgamP4 [89]. This observation suggests that the AgamP4-specific microinversions are likely misassemblies in the PEST heterochromatin where genome assembly is notoriously difficult [26, 27].

Figure 4:

Whole-genome pairwise alignments between chromosome-level assemblies of AcolMOP1 and AgamP4. The alignment plots are produced by SyRi. The rearrangements are indicated with arrows.

Open in new tab Download slide

To validate the new 2R microinversion in A. arabiensis, we aligned our AaraD3 assembly with AaraD2 [35], which is a super-scaffolded, Illumina-based AaraD1 assembly [33]. We found the new 2R microinversion in the AaraD2 assembly as well, confirming its presence in the DONGOLA strain (Additional File 24). We also aligned AcolMOP1 with the reference-guided scaffolded AcolN1 assembly [36] and the super-scaffolded AcolM2 [35] assemblies. These 3 assemblies were made for genomes of 3 different A. coluzzii strains: MOPTI, Ngousso, and MALI, respectively. The alignments demonstrated that the new 3R translocation and 3L microinversion are MOPTI-specific (Additional Files 25 and 26), indicating that these rearrangements are polymorphic within A. coluzzii. Incidentally, we identified an inversion and 2 large translocations in the 2R arm (possible misassemblies) of AcolM2 in its alignment with AcolMOP1 (Additional File 26).

In addition to the rearrangements identified using these 3 species assemblies, we found small rearrangements in the pericentromeric heterochromatin of the X chromosome by aligning the A. coluzzii and A. gambiae genomes (Fig. 4, Additional File 27). The A. arabiensis genome was too divergent in this region to detect rearrangements. Breakpoint regions of 2 translocations and 4 inversions were located within genomic coordinates 20.1–24.2 Mb in AgamP4 and 21.6–26.2 Mb in AcolMOP1. These rearrangements were not detected in the alignments of AcolMOP1 with either AcolN1 or AcolM2 (Additional File 27). A previous study identified 3 of these rearrangements in the 20–22 Mb region of the X chromosome (1 translocation and 2 inversions) by aligning the A. gambiae PEST and A. coluzzii Ngousso genomes [36]. Because our AcolMOP1 assembly extends farther into the heterochromatin than the AcolN1 assembly does, we were able to detect 2 times more rearrangements in the X chromosome. These misalignments could be due to order and/or orientation errors in the PEST genome assembly. However, they may also represent novel rearrangements segregating between A. gambiae and A. coluzzii. We recently described a new type of shared cytogenetic polymorphism in the incipient species, A. gambiae and A. coluzzii—an inversion of the satellite DNA location in relation to the proximal gene-free X chromosome band [83]. The findings suggest that structural variations can be common in the sex chromosome heterochromatin of mosquitoes.

Conclusion

By combining long-read sequences generated by the Oxford Nanopore technology and long-range information produced by the Hi-C approach, we obtained high-quality reference genome assemblies for A. arabiensis and A. coluzzii. We demonstrated tremendous improvement in the completeness and contiguity of these species’ genomes. Thus, these assemblies provide a valuable resource for comparative genomics, epigenetics, functional analyses, and population studies of malaria mosquitoes. To maximize the use of the data, tools, and workflows of this study, we present a pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes based on Hi-C scaffolding of Oxford Nanopore sequencing contigs (Fig. 5). The pipeline illustrates successful approaches along with other approaches that we tried but discarded in the course of its development.

Figure 5:

The pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes based on Hi-C scaffolding of Oxford Nanopore sequencing contigs.

Open in new tab Download slide

Sequencing of the first vector genome of A. gambiae revolutionized genetics and genomics research in medical entomology. A. arabiensis and A. coluzzii are major vectors of malaria, and functional characterization of their genomes will enable identification of the genomic determinants of epidemiologically important phenotypic and behavioral traits. Eventually, these efforts will lead to better malaria control.

Materials and Methods

Mosquito colony maintenance

The A. arabiensis DONGOLA (MRA-1235) and A. coluzzii MOPTI (MRA-763) strains were initially obtained from the Biodefense and Emerging Infections Research Resources Repository. Eggs were hatched in distilled water and incubated for 10–15 days, undergoing larval and pupal developmental stages at 27°C. After emerging from pupae, the adult males and females were maintained together in an incubator at 27°C, 75% humidity, with a 12-hour cycle of light and darkness. The 5–7-day-old adult females were blood-fed on defibrinated sheep blood using artificial blood feeders. Approximately 48–72 hours after blood feeding, egg dishes were placed for oviposition.

Mosquito sample collection for Oxford Nanopore sequencing

A single well–blood-fed female mosquito was separated from the original cage. After oviposition, F1 progeny from this single female were inbred with each other for 4 days in a 1.4-L paper popcorn cup. F1 females were given bloodmeal and, after 72 hours, eggs were collected from a single F1 female. The F2 progeny were reared under normal conditions and were inbred as before. After 6 rounds of inbreeding using this procedure, all F7 pupal progeny from a single F6 female were sorted by sex, collected, flash frozen in liquid nitrogen, and stored at –80°C.

Genomic DNA isolation

Genomic DNA was isolated from 20 inbred male pupae following a modified Qiagen Genomic Tip DNA Isolation kit (Cat. No. 10,243 and 19,060, Qiagen, Hilden, Germany) protocol. Briefly, pupae were homogenized using a Dremel motorized homogenizer for ∼30 seconds at the lowest speed. Next, >300 mAU (500 μL of >600 mAU/mL solution) Proteinase K (Cat. No. 19,131, Qiagen, Hilden, Germany) was added to the sample and incubated at 55°C for 3 hours. The homogenate was then transferred into a 15-mL conical tube and centrifuged at 5000g for 15 minutes at 4°C to remove debris. DNA was extracted following the standard Qiagen Genomic Tip protocols (Qiagen, Hilden, Germany). The purity, approximate size, and concentration of the DNA were tested using a nanodrop spectrophotometer, 0.5% agarose gel electrophoresis, and Qubit dsDNA assay, respectively.

Oxford Nanopore sequencing

Approximately 1 μg of DNA was used to generate a sequencing library according to the protocol provided for the SQK-LSK109 library preparation kit from Oxford Nanopore (Oxford Nanopore Technologies, Oxford Science Park, UK). After the DNA repair, end prep, and adapter ligation steps, SPRIselect bead suspension (Cat No. B23318, Beckman Coulter Life Sciences, Indianapolis, IN, USA) was used to remove short fragments and free adapters. Qubit dsDNA assay (ThermoFisher Scientific, Waltham, MA, USA) was used to quantify DNA and ∼300–400 ng of DNA library was loaded onto a MinION flow cell (MinION, RRID:SCR_017985) (BioProject: PRJNA634549, SRA: SRX8462258, SRX8462259).

Hi-C library preparation and sequencing

Hi-C libraries for A. arabiensis were prepared with an Arima-HiC kit (Arima Genomics, San Diego, CA, USA) using protocols provided by the company (Document part No. A160126 v00) with slight modifications. Two replicas of Hi-C libraries were prepared from 1–2-day-old virgin adults with equal proportions of each sex (1 library with 20 and 1 library with 60 adults). After the fixation step using the “Crosslinking—Small Animal" protocol, 10% of the original pulverized mosquito tissue was taken out to perform the “Determining input Amount-Small Animal" protocol. The remaining 90% of tissue comprising ≥750 ng of DNA was used to produce proximally ligated DNA fragments following the Arima-HiC protocol. The quality of proximally ligated DNA was tested by taking 75 ng of DNA through the Arima-QC1 Quality Control protocol. If the Arima-Q1 value passed, the rest of the proximally ligated DNA was used to prepare the libraries using NEBNext Ultra II DNA Library Prep Kit following the “Library Preparation" and the “Library Amplification" protocols. The libraries were sent to Novogene for sequencing. A total of 17.8 and 27.7 Gb of 2 × 150 bp reads were obtained for the 2 libraries (BioProject: PRJNA634549, SRA: SRX8462257). Three biological replicates of Hi-C data for A. coluzzii embryos (BioProject: PRJNA615337, SRA: SRS6448831) were obtained from the previous study [90].

Quality control of Nanopore reads

Analysis and visualization of the long Nanopore reads were performed with the Nanostat and Nanoplot tools of the Nanopack software (de2018nanopack) [91]. Alignment of Nanopore reads to the genome of A. gambiae (AgamP4) was done using minimap2 (Minimap2, RRID:SCR_018550) [53]. A contamination analysis was performed with Kraken2 [54] using a custom database with addition of the A. gambiae [26, 27] and A. coluzzii Ngousso [36] genomes.

Nanopore sequence assembly

Genome assemblies from the Nanopore sequencing data were obtained using wtdbg2 v1.1 (WTDBG, RRID:SCR_017225) [55], FLYE v2.4.1 (Flye, RRID:SCR_017016) [56], Miniasm v0.3-r179 [57], and Canu v1.8 (Canu, RRID:SCR_015880) [58]. In the case of the Canu v1.8 assembler, we obtained 2 assemblies: 1 consisting of unitigs (i.e., unambiguous reconstructions of the sequence) and the other consisting of contigs. For wtdbg2, the polishing step using minimap2 was performed per the developers' recommendation. The completeness and quality of the assemblies were assessed with BUSCO v2 (BUSCO, RRID:SCR_015008) [61, 62] and QUAST-LG [59]. For QUAST-LG, the A. gambiae (AgamP4) genome was used as a reference. A. gambiae is evolutionary more closely related to A. coluzzii (0.061M years divergence) than to A. arabiensis (0.509M years divergence) [8] and, thus, the reference-based metrics (such as NG50 and the number of misassemblies discussed below) were considered with caution. Using BUSCO, each assembly was queried for 2,799 conserved single-copy Diptera genes, as well as for 978 conserved single-copy Metazoa genes. A gene-recognizing model was trained for the Augustus tool [92] in the BUSCO pipeline by using the Aedes aegypti genome [48].

Genome size estimation and polishing the genome assemblies

For genome size estimation and polishing assemblies obtained from Nanopore reads, Illumina short paired-end data were used. The NCBI SRX accession numbers were SRX3832577 for A. coluzzii and SRX084275, SRX111457, SRX200218 for A. arabiensis. Quality control of the Illumina reads was performed with FastQC (FastQC, RRID:SCR_014583) [93]. Based on the FastQC analysis, reads were filtered by the quality and minimum read length, and TruSeq adapters were trimmed from reads using fastp v0.20.0 (fastp, RRID:SCR_016962). The genome sizes of A. coluzzii and A. arabiensis and sizes of single-copy genomic regions were estimated by k-mer analysis for k = 19 based on the Illumina short paired-end reads using methodology described in the genome size estimation tutorial [94]. A 19-mer distribution was separated into 3 consecutive ranges that correspond to error sequences, single-copy sequences (i.e., haploid peak), and repeat sequences. The average 19-mer single-copy coverage was estimated by finding the maximum in the haploid peak. After that, the area under the curve was calculated for the whole distribution range except the sequencing error range. To obtain genome length, the obtained area was divided by the coverage calculated in the previous step. The length of the single-copy sequences was calculated in the same manner but only for the single-copy sequence range. The formulas for calculating respective lengths are the following:

$$\begin{equation*} \ {L_g} = \ \mathop \sum \nolimits_{\mathrm{freq}\ = \ \mathrm{Lb}}^{\mathrm{End}} \frac{{\mathrm{freq} \cdot K\left( {\mathrm{freq}} \right)}}{{\mathrm{Cov}}},\ \ {L_{\mathrm{sc}}} = \mathop \sum \nolimits_{\mathrm{freq} = \mathrm{Lb}}^{\mathrm{Rb}} \frac{{\mathrm{freq} \cdot K\left( {\mathrm{freq}} \right)}}{{\mathrm{Cov}}},\ \end{equation*}$$

where L_g is a genome length, L_sc is a single-copy sequence length, Cov is an average 19-mer single-copy sequence coverage, freq is 19-mer frequency, K(freq) is the number of distinct 19-mers with frequency equal to freq (Y-axis), End is the maximal frequency value, and Lb, Rb are borders of the range for haploid peak. For example, the length of the single-copy region for the A. coluzzii genome is calculated as follows:

$$\begin{eqnarray*} \ {L_{\mathrm{sc}\ \mathrm{coluz}\ }} &=& \mathop \sum \nolimits_{\mathrm{freq} = 7}^{110} \frac{{\mathrm{freq} \cdot K\left( {\mathrm{freq}} \right)}}{{54}} = \frac{{11.021 \cdot {{10}^9}}}{{54}} \nonumber \\ &=& 204.1 \cdot {10^6} = 204.1\ \mathrm{Mb}.\ \ \end{eqnarray*}$$

The frequency distribution of 19-mers in all high-quality short reads was computed by Jellyfish (Jellyfish, RRID:SCR_005491) [60]. Racon (Racon, RRID:SCR_017642) [63], Medaka [64], and Nanopolish (Nanopolish, RRID:SCR_016157) [65] were used to correct nucleotide substitutions, insertions, and deletions. Pilon (Pilon, RRID:SCR_014731) [66] was run several times using Illumina reads.

Scaffolding Nanopore contigs using Hi-C data

Hi-C Illumina short paired-end reads were used for genome scaffolding after their quality control was inspected with FastQC. BWA-MEM v0.7.17 [95] and Juicer v1.5.7 (Juicer, RRID:SCR_017226) [67] were run to assess the quality of Hi-C data with respect to the polished Canu contig assemblies of each genome. The 3D-DNA (3D-DNA, RRID:SCR_017227) [47] and SALSA2 [73] software were run to scaffold the Nanopore contigs. Metrics for the original and processed assemblies were computed by QUAST-LG [59]. Visual inspection of the Hi-C contact heat maps was performed. All contigs in the assemblies were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts on the basis of the read-depth analysis. Alignment of contigs to the A. gambiae PEST (AgamP4) assembly was used to obtain information about the distribution of the contigs across the chromosomes.

Chromosome quotient analysis

Because the AgamP4 assembly does not contain chromosome Y, CQ analysis was performed using Illumina reads from female and male mosquito genomes to detect the presence of contigs from the Y chromosome. According to the original definition [76], for a given sequence Si, CQ(Si) = F(Si)/M(Si), where F(Si) is the number of alignments from female sequence data to Si and M(Si) is the number of alignments from male sequence data to Si. Therefore, the CQ method allows for the differentiation of Y sequences from autosome and X sequences. CQ calculation was performed at a 1-kb window for contigs or scaffolds. If the number of male reads was <20, the CQ value of that particular 1-kb window was not used. Contigs or scaffolds with ≥15% of the 1-kb windows showing CQ values <0.1 were considered to be Y-derived and were grouped into a separate scaffold called “chrY.”

Gene mapping and pairwise alignments

The A. coluzzii MOPTI (AcolMOP1) and A. arabiensis DONGOLA (AaraD3) assemblies were validated by comparing them with the existing assembly A. gambiae PEST (AgamP4), representing the most complete chromosome-level anopheline genome assembly known to date. Using NCBI BLAST v2.9.0 (NCBI BLAST, RRID:SCR_004870) [81], a set of known genes (AgamP4.10 gene set) from the AgamP4 assembly was mapped to the new A. coluzzii and A. arabiensis assemblies. The assembly of the A. coluzzii Ngousso strain (AcolN1) from PacBio reads was also used, where appropriate, because AcolN1 consists of contigs rather than scaffolds [36]. Whole-genome pairwise alignment between these assemblies was generated using D-Genies v1.2.0 (D-GENIES, RRID:SCR_018967) [82], SyRi [85], and genoPlotR [84]. The D-Genies v1.2.0 dot-plots and SyRi plots were generated using sequence alignments. The genoPlotR visualization was done using alignments of orthologous genes identified in the A. coluzzii and A. arabiensis assemblies by BLAST of the AgamP4.10 genes.

Data Availability

The raw genomic sequence reads and both genome assemblies underlying this article are available in the NCBI under project accession PRJNA634549 and the GigaScience GigaDB database [96, 97].

Additional Files

Additional File 1. Analysis report for Oxford Nanopore reads produced by Nanostat.

Additional File 2. Histograms of read length after log normalization for Nanopore reads from (a) A. coluzzii and (b) A. arabiensis.

Additional File 3. Plot of the average read quality for Nanopore reads obtained from (a) A. coluzzii and (b) A. arabiensis.

Additional File 4. Alignment depth of Nanopore reads from A. coluzzii (left column) and A. arabiensis (right column) to the A. gambiae (AgamP4) genome.

Additional File 5. Distribution of 19-mers for A. coluzzii (left panel) and A. arabiensis (right panel) computed by Jellyfish.

Additional File 6. Analysis report generated by QUAST-LG for Oxford Nanopore draft assemblies.

Additional File 7. auNG metrics for A. coluzzii and A. arabiensis assemblies.

Additional File 8. Evaluation of draft assembly completeness with BUSCO.

Additional File 9. Evaluation of assembly completeness with BUSCO after polishing steps.

Additional File 10. Report produced by the Juicer tool when aligning the Hi-C data on the Canu contig assemblies.

Additional File 11. Analysis report generated by QUAST-LG for Hi-C scaffolding of Canu contigs.

Additional File 12. Initial heat maps of Hi-C contact information for the A. arabiensis genome assemblies obtained by (a) SALSA 2 from the Canu contig assembly, (b) SALSA 2 from the Canu unitig assembly, (c) 3D-DNA from the Canu contig assembly, and (d) 3D-DNA from the Canu unitig assembly. The heat maps were produced by JBAT.

Additional File 13. Hi-C contact heat map for the 3D-DNA scaffolds of the A. coluzzii assembly before manual correction. The heat map was produced by JBAT.

Additional File 14. Read depth histogram obtained by PurgeHaplotigs for the A. coluzzii (a) and A. arabiensis (b) assemblies. The cut-offs were manually selected (red arrows in the histograms): 30, 78, and 132 for A. сoluzzii and 25, 93, and 160 for A. arabiensis.

Additional File 15. Mapping the AgamP4.10 gene set to the final A. coluzzii and A. arabiensis assemblies.

Additional File 16. Whole-genome pairwise alignment dot-plots (produced by D-Genies v1.2.0) between the scaffolds corresponding to chromosomes X (a–c), 2 (d–f), and 3 (g–i). Alignments between the A. gambiae and A. coluzzii scaffolds, the A. gambiae and A. arabiensis scaffolds, and the A. coluzzii and A. arabiensis scaffolds are shown.

Additional File 17. BUSCO scores for the final assemblies.

Additional File 18. Analysis report generated by QUAST-LG for the final Hi-C–scaffolded Canu contigs.

Additional File 19. Map of the repetitive sequences for the Anopheles assemblies.

Additional File 20. Whole-genome pairwise alignments produced by genoPlotR between chromosomes of A. arabiensis, A. coluzzii,andA. gambiae. Left: AgamP4 and AcolMOP1. Middle: AgamP4 and AaraD3. Right: AcolMOP1 and AaraD3. The inversion breakpoints are shown with small letters.

Additional File 21. Whole-genome pairwise alignments produced by SyRi of A. arabiensis chromosomes (query) to A. gambiae (reference), and A. coluzzii (reference) chromosomes. Left: AgamP4 and AaraD3. Right: AcolMOP1 and AaraD3.

Additional File 22. Genomic coordinates of breakpoint regions and breakpoint-flanking genes of chromosomal rearrangements identified by aligning the AgamP4, AcolMOP1, and AaraD3 assemblies using D-Genies v1.2.0.

Additional File 23. Genomic coordinates of breakpoint regions of chromosomal rearrangements identified by aligning the AgamP4, AcolMOP1, and AaraD3 assemblies using SyRi.

Additional File 24. Pairwise dot-plot alignment between the AaraD3 and AaraD2 (AaraD1 superscaffolded) assemblies produced by D-Genies v1.2.0. Top: Whole-genome pairwise alignment. Bottom: Collinearity in the alignment of the region with the new 2R microinversion.

Additional File 25. Pairwise dot-plot alignment between the AcolMOP1 and reference-guided scaffolded AcolN1 assemblies produced by D-Genies v1.2.0. Left: Whole-genome pairwise alignment. Middle: Alignment of the region with the new 3R translocation. Right: Alignment of the region with the new 3L microinversion.

Additional File 26. Pairwise dot-plot alignment between the AcolMOP1 and AcolM2 assemblies produced by D-Genies v1.2.0. Left: Whole-genome pairwise alignment. Middle: Alignment of the region with the new 3R translocation. Right: Alignment of the region with the new 3L microinversion.

Additional File 27. Pairwise dot-plot alignment between the X chromosomes produced by D-Genies v1.2.0. Left: AcolMOP1 and AgamP4. Middle: AcolMOP1 and AcolN1. Right: AcolMOP1 and AcolM2.

Abbreviations

BLAST: Basic Local Alignment Search Tool; bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; BWA: Burrows-Wheeler Aligner; CQ: chromosome quotient; Gb: gigabase pairs; JBAT: Juicebox Assembly Tools; kb: kilobase pairs; M: million; Mb: megabase pairs; NCBI: National Center for Biotechnology Information; NAIAD: National Institute of Allergy and Infectious Diseases; PacBio: Pacific Biosciences; rDNA: ribosomal DNA; SRA: Sequence Read Archive.

Competing Interests

The authors declare that they have no competing interests.

Funding

Data production and the work of M.A.A., P.A., and I.V.S. were supported by US National Institutes of Health (NIH) grant R21AI135298. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The work of N.A. was supported by the Government of the Russian Federation through the ITMO Fellowship and Professorship Program. The work of A.Z. was supported by JetBrains Research. I.V.S. was also supported by the Fralin Life Sciences Institute and the USDA National Institute of Food and Agriculture Hatch project 223822.

Authors’ Contributions

I.V.S. and M.A.A. conceived and coordinated the project and acquired funding for this study. N.A. secured funding for A.Z. J.L. maintained mosquito colonies and performed Hi-C protocol optimization, Hi-C experiments, and Illumina library preparation for A. arabiensis. A.S., C.C., and Z.T. did genomic DNA isolation and Oxford Nanopore sequencing. Z.T. performed CQ analysis. Under the supervision of M.A.A. and I.V.S., A.Z. and P.A. performed the genome assembly, scaffolding, and validation with the contribution from V.L. N.A. and P.A. supervised A.Z. The manuscript was prepared and approved by all the authors.

ACKNOWLEDGEMENTS

The following reagents were obtained through the Biodefense and Emerging Infections Research Resources Repository, NIAID, NIH: A. arabiensis, Strain DONGOLA 2Ra, 2Rb/b, and 3R, MRA-1235, contributed by Ellen M. Dotson; A. coluzzii, Strain MOPTI, Eggs, MRA-763, contributed by Gregory C. Lanzaro. The authors thank Adam Kai Leung Wong for his invaluable help in installing and troubleshooting various software packages on the Colonial One high-performance computing cluster at the George Washington University. We thank Janet Webster and Kristi DeCourcy for editing the manuscript.

References

Russell

Govella

Azizi

, et al.

Increased proportions of outdoor feeding among residual malaria vector populations following increased use of insecticide-treated nets in rural Tanzania

Malaria J

2011

;

, doi:

10.1186/1475-2875-10-80

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Petrarca

Beier

Intraspecific chromosomal polymorphism in the Anopheles gambiae complex as a factor affecting malaria transmission in the Kisumu area of Kenya

Am J Trop Med Hyg

1992

;

(

229

–

Main

Lee

Ferguson

, et al.

The genetic basis of host preference and resting behavior in the major African malaria vector, Anopheles arabiensis

PLoS Genet

2016

;

(

e1006303

Davidson

Paterson

Coluzzi

, et al.

The Anopheles gambiae complex

. In:

Wright

Pal

, eds.

Genetics of Insect Vectors of Disease

Amsterdam-London-New York

Elsevier

;

1967

211

–

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Slotman

Della Torre

Powell

The genetics of inviability and male sterility in hybrids between Anopheles gambiae and An. arabiensi s

Genetics

2004

;

167

(

275

–

Bernardini

Galizi

Wunderlich

, et al.

Cross-species Y chromosome function between malaria vectors of the Anopheles gambiae species complex

Genetics

2017

;

207

(

729

–

Fontaine

Pease

Steele

, et al.

Extensive introgression in a malaria vector species complex revealed by phylogenomics

Science

2015

;

347

(

6217

1258524

Thawornwattana

Dalquen

Yang

Coalescent analysis of phylogenomic data confidently resolves the species relationships in the Anopheles gambiae species complex

Mol Biol Evol

2018

;

(

2512

–

Aboagye-Antwi

Alhafez

Weedall

, et al.

Experimental swap of Anopheles gambia e's assortative mating preferences demonstrates key role of X-chromosome divergence island in incipient sympatric speciation

PLoS Genet

2015

;

(

e1005141

10.

Diabate

Dabire

Millogo

, et al.

Evaluating the effect of postmating isolation between molecular forms of Anopheles gambiae (Diptera: Culicidae)

J Med Entomol

2007

;

(

–

11.

Tene Fossog

Ayala

Acevedo

, et al.

Habitat segregation and ecological character displacement in cryptic African malaria mosquitoes

Evol Appl

2015

;

(

326

–

12.

Diabate

Dabire

Kim

, et al.

Larval development of the molecular forms of Anopheles gambiae(Diptera: Culicidae) in different habitats: a transplantation experiment

J Med Entomol

2005

;

(

548

–

13.

Gimonneau

Bouyer

Morand

, et al.

A behavioral mechanism underlying ecological divergence in the malaria mosquito Anopheles gambiae

Behav Ecol

2010

;

(

1087

–

14.

Dao

Yaro

Diallo

, et al.

Signatures of aestivation and migration in Sahelian malaria mosquito populations

Nature

2014

;

516

(

7531

387

–

15.

Tennessen

Ingham

Toe

, et al.

A population genomic unveiling of a new cryptic mosquito taxon within the malaria-transmitting Anopheles gambiae complex

Mol Ecol

2021

;

(

775

–

16.

Lewin

Larkin

Pontius

, et al.

Every genome sequence needs a good map

Genome Res

2009

;

(

1925

–

17.

Thomma

Seidl

Shi-Kunne

, et al.

Mind the gap; seven reasons to close fragmented genome assemblies

Fungal Genet Biol

2016

;

–

18.

Greppi

Laursen

Budelli

, et al.

Mosquito heat seeking is driven by an ancestral cooling receptor

Science

2020

;

367

(

6478

681

–

19.

Saveer

Pitts

Ferguson

, et al.

Characterization of chemosensory responses on the labellum of the malaria vector mosquito

Sci Rep

2018

;

(

5656

20.

Hill

Fox

Pitts

, et al.

G protein-coupled receptors in Anopheles gambiae

Science

2002

;

298

(

5591

176

–

21.

Mancini

Tammaro

Baldini

, et al.

Molecular evolution of a gene cluster of serine proteases expressed in the Anopheles gambiae female reproductive tract

BMC Evol Biol

2011

;

22.

Mancini

Baldini

Tammaro

, et al.

Molecular characterization and evolution of a gene family encoding male-specific reproductive proteins in the African malaria vectorAnopheles gambiae

BMC Evol Biol

2011

;

292

23.

Christophides

Zdobnov

Barillas-Mury

, et al.

Immunity-related genes and gene families in Anopheles gambiae

Science

2002

;

298

(

5591

159

–

24.

Adolfi

Poulton

Anthousi

, et al.

Functional genetic validation of key genes conferring insecticide resistance in the major African malaria vector, Anopheles gambiae

Proc Natl Acad Sci U S A

2019

;

116

(

25764

–

25.

Liu

Tsujimoto

Cha

, et al.

Aquaporin water channel AgAQP1 in the malaria vector mosquito Anopheles gambiae during blood feeding and humidity adaptation

Proc Natl Acad Sci U S A

2011

;

108

(

6062

–

26.

Holt

Subramanian

Halpern

, et al.

The genome sequence of the malaria mosquito Anopheles gambiae

Science

2002

;

298

(

5591

129

–

27.

Sharakhova

Hammond

Lobo

, et al.

Update of the Anopheles gambiae PEST genome assembly

Genome Biol

2007

;

(

28.

Cheng

White

Kamdem

, et al.

Ecological genomics of Anopheles gambiae along a latitudinal cline: a population-resequencing approach

Genetics

2012

;

190

(

1417

–

29.

Caputo

Pichler

Mancini

, et al.

The last bastion? X chromosome genotyping of Anopheles gambiae species pair males from a hybrid zone reveals complex recombination within the major candidate ‘genomic island of speciation’

Mol Ecol

2016

;

(

5719

–

30.

Miles

Harding

Botta

, et al.

Genetic diversity of the African malaria vector Anopheles gambiae

Nature

2017

;

552

(

7683

–

100

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

31.

Schmidt

Lee

Collier

, et al.

Transcontinental dispersal of Anopheles gambiae occurred from West African origin via serial founder events

Commun Biol

2019

;

473

32.

Deitz

Athrey

Jawara

, et al.

Genome-wide divergence in the West-African malaria vector Anopheles melas

G3 (Bethesda)

2016

;

(

2867

–

33.

Neafsey

Waterhouse

Abai

, et al.

Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes

Science

2015

;

347

(

6217

1258522

34.

Lawniczak

Emrich

Holloway

, et al.

Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences

Science

2010

;

330

(

6003

512

–

35.

Waterhouse

Aganezov

Anselmetti

, et al.

Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies

BMC Biol

2020

;

(

), doi:

10.1186/s12915-019-0728-3

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

36.

Kingan

Heaton

Cudini

, et al.

A high-quality de novo genome assembly from a single mosquito using PacBio sequencing

Genes (Basel)

2019

;

(

), doi:

10.3390/genes10010062

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

37.

Richards

Murali

Best practices in insect genome sequencing: what works and what doesn't

Curr Opin Insect Sci

2015

;

, doi:

10.1016/j.cois.2015.02.013

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

38.

Worley

Richards

Rogers

The value of new genome references

Exp Cell Res

2017

;

358

(

433

–

39.

Bickhart

Rosen

Koren

, et al.

Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome

Nat Genet

2017

;

(

643

–

40.

Goodwin

Gurtowski

Ethe-Sayers

, et al.

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

Genome Res

2015

;

(

1750

–

41.

Russell

Campos

Stone

, et al.

Unbiased strain-typing of arbovirus directly from mosquitoes using Nanopore sequencing: A field-forward biosurveillance protocol

Sci Rep

2018

;

(

5417

42.

Peritz

Paoli

Chen

, et al.

Serogroup-level resolution of the “Super-7” Shiga toxin-producing Escherichia coliusing nanopore single-molecule DNA sequencing

Anal Bioanal Chem

2018

;

410

(

5439

–

43.

Michael

Jupe

Bemm

, et al.

High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell

Nat Commun

2018

;

(

541

44.

Miller

Staber

Zeitlinger

, et al.

Highly contiguous genome assemblies of 15 Drosophilaspecies generated using nanopore sequencing

G3 (Bethesda)

2018

;

(

3131

–

45.

Ebbert

MTW

Farrugia

Sens

, et al.

Long-read sequencing across the C9orf72 ‘GGGGCC’ repeat expansion: implications for clinical use and genetic discovery efforts in human disease

Mol Neurodegener

2018

;

(

46.

Peichel

Sullivan

Liachko

, et al.

Improvement of the threespine stickleback genome using a Hi-C-based proximity-guided assembly

J Hered

2017

;

108

(

693

–

700

47.

Dudchenko

Batra

Omer

, et al.

De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds

Science

2017

;

356

(

6333

–

48.

Matthews

Dudchenko

Kingan

, et al.

Improved reference genome of Aedes aegypti informs arbovirus vector control

Nature

2018

;

563

(

7732

501

–

49.

Ghurye

Koren

Small

, et al.

A chromosome-scale assembly of the major African malaria vector Anopheles funestus

Gigascience

2019

;

(

), doi:

10.1093/gigascience/giz063

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

50.

Compton

Liang

Chen

, et al.

The beginning of the end: a chromosomal assembly of the New World malaria mosquito ends with a novel telomere

G3 (Bethesda)

2020

;

(

3811

–

51.

Bracewell

Tran

Chatla

, et al.

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

G3 (Bethesda)

2020

;

(

891

–

52.

Cao

Nanopore sequencing and Hi-C scaffolding provide insight into the evolutionary dynamics of transposable elements and piRNA production in wild strains of Drosophila melanogaster

Nucleic Acids Res

2020

;

(

290

–

303

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

53.

Minimap2: pairwise alignment for nucleotide sequences

Bioinformatics

2018

;

(

3094

–

100

54.

Davis

van Dongen

Abreu-Goodger

, et al.

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Methods

2013

;

(

–

55.

Ruan

Fast and accurate long-read assembly with wtdbg2

Nat Methods

2020

;

(

155

–

56.

Kolmogorov

Yuan

Lin

, et al.

Assembly of long, error-prone reads using repeat graphs

Nat Biotechnol

2019

;

(

540

57.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Bioinformatics

2016

;

(

2103

–

58.

Koren

Walenz

Berlin

, et al.

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Genome Res

2017

;

(

722

–

59.

Gurevich

Saveliev

Vyahhi

, et al.

QUAST: quality assessment tool for genome assemblies

Bioinformatics

2013

;

(

1072

–

60.

Marcais

Kingsford

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

Bioinformatics

2011

;

(

764

–

61.

Simao

Waterhouse

Ioannidis

, et al.

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Bioinformatics

2015

;

(

3210

–

62.

Seppey

Manni

Zdobnov

BUSCO: assessing genome assembly and annotation completeness

Methods Mol Biol

2019

;

1962

227

–

63.

Vaser

Sovic

Nagarajan

, et al.

Fast and accurate de novo genome assembly from long uncorrected reads

Genome Res

2017

;

(

737

–

64.

medaka: Sequence correction provided by ONT Research

https://github.com/nanoporetech/medaka. Accessed 15 September 2020

65.

Jain

Koren

Miga

, et al.

Nanopore sequencing and assembly of a human genome with ultra-long reads

Nat Biotechnol

2018

;

(

338

–

66.

Walker

Abeel

Shea

, et al.

Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement

PLoS One

2014

;

(

e112963

67.

Durand

Shamim

Machol

, et al.

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

Cell Syst

2016

;

(

–

68.

Kaplan

Dekker

High-throughput genome scaffolding from in vivo DNA interaction frequency

Nat Biotechnol

2013

;

(

1143

–

69.

Burton

Adey

Patwardhan

, et al.

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions

Nat Biotechnol

2013

;

(

1119

–

70.

Marie-Nelly

Marbouty

Cournac

, et al.

High-quality genome (re) assembly using chromosomal contact data

Nat Commun

2014

;

, doi:

10.1038/ncomms6695

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

71.

Putnam

O'Connell

Stites

, et al.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

Genome Res

2016

;

(

342

–

72.

Renschler

Richard

Valsecchi

CIK

, et al.

Hi-C guided assemblies reveal conserved regulatory topologies on X and autosomes despite extensive genome shuffling

Genes Dev

2019

;

(

21–22

1591

–

612

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

73.

Ghurye

Rhie

Walenz

, et al.

Integrating Hi-C links with assembly graphs for chromosome-scale assembly

PLoS Comput Biol

2019

;

(

e1007273

74.

Dudchenko

Shamim

Batra

, et al.

The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000

bioRxiv

2018

https://doi.org/10.1101/254797

Google Scholar

OpenURL Placeholder Text

WorldCat

75.

Roach

Schmidt

Borneman

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies

BMC Bioinformatics

2018

;

(

460

76.

Hall

Timoshevskiy

, et al.

Six novel Y chromosome genes in Anophelesmosquitoes discovered by independently sequencing males and females

BMC Genomics

2013

;

273

77.

Hall

Papathanos

Sharma

, et al.

Radical remodeling of the Y chromosome in a recent radiation of malaria mosquitoes

Proc Natl Acad Sci U S A

2016

;

113

(

E2114

–

78.

Krzywinski

Sangare

Besansky

Satellite DNA from the Y chromosome of the malaria vector Anopheles gambiae

Genetics

2005

;

169

(

185

–

79.

Melters

Bradnam

Young

, et al.

Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution

Genome Biol

2013

;

(

R10

80.

Besansky

Powell

Reassociation kinetics of Anopheles gambiae (Diptera: Culicidae) DNA

J Med Entomol

1992

;

(

125

–

81.

Altschul

Gish

Miller

, et al.

Basic Local Alignment Search Tool

J Mol Biol

1990

;

215

(

403

–

82.

Cabanettes

Klopp

D-GENIES: dot plot large genomes in an interactive, efficient and simple way

PeerJ

2018

;

e4958

83.

Sharma

Kinney

Timoshevskiy

, et al.

Structural variation of the X chromosome heterochromatin in the Anopheles gambiaecomplex

Genes (Basel)

2020

;

(

), doi:

10.3390/genes11030327

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

84.

Guy

Kultima

Andersson

genoPlotR: comparative gene and genome visualization in R

Bioinformatics

2010

;

(

2334

–

85.

Goel

Sun

Jiao

, et al.

SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies

Genome Biol

2019

;

(

277

86.

Coluzzi

Sabatini

della Torre

, et al.

A polytene chromosome analysis of the Anopheles gambiaespecies complex

Science

2002

;

298

(

5597

1415

–

87.

Lobo

Sangare

Regier

, et al.

Breakpoint structure of the Anopheles gambiae 2Rb chromosomal inversion

Malar J

2010

;

293

88.

Sharakhov

White

Sharakhova

, et al.

Breakpoint structure reveals the unique origin of an interspecific chromosomal inversion (2La) in the Anopheles gambiae complex

Proc Natl Acad Sci U S A

2006

;

103

(

6258

–

89.

Sharakhova

George

Brusentsova

, et al.

Genome mapping and characterization of the Anopheles gambiaeheterochromatin

BMC Genomics

2010

;

459

90.

Lukyanchikova

Nuriddinov

Belokopytova

, et al.

Anophelesmosquitoes revealed new principles of 3D genome organization in insects

bioRxiv

2020

;

https://doi.org/10.1101/2020.05.26.114017

Google Scholar

OpenURL Placeholder Text

WorldCat

91.

De Coster

D'Hert

Schultz

, et al.

NanoPack: visualizing and processing long-read sequencing data

Bioinformatics

2018

;

(

2666

–

92.

Keller

Kollmar

Stanke

, et al.

A novel hybrid gene prediction method employing protein multiple sequence alignments

Bioinformatics

2011

;

(

757

–

93.

Chen

Zhou

Chen

, et al.

fastp: an ultra-fast all-in-one FASTQ preprocessor

Bioinformatics

2018

;

(

i884

–

i90

94.

Genome Size Estimation Tutorial

https://bioinformatics.uconn.edu/genome-size-estimation-tutorial. Accessed 28 December 2020

95.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

arXiv

2013

:1303.3997.

OpenURL Placeholder Text

WorldCat

96.

Zamyatin

Avdeyev

Liang

, et al. Supporting data for the chromosome-level genome assembly of the malaria vector Anopheles arabiensis.

GigaScience Database

2021

10.5524/100860

97.

Zamyatin

Avdeyev

Liang

, et al. Supporting data for the chromosome-level genome assembly of the malaria vectors Anopheles coluzzii.

GigaScience Database

2021

10.5524/100861

Author notes

These authors contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
March 2021	858
April 2021	289
May 2021	252
June 2021	193
July 2021	197
August 2021	157
September 2021	188
October 2021	181
November 2021	147
December 2021	127
January 2022	155
February 2022	151
March 2022	150
April 2022	140
May 2022	117
June 2022	94
July 2022	103
August 2022	102
September 2022	160
October 2022	147
November 2022	109
December 2022	81
January 2023	124
February 2023	75
March 2023	67
April 2023	79
May 2023	102
June 2023	78
July 2023	86
August 2023	70
September 2023	39
October 2023	69
November 2023	70
December 2023	59
January 2024	69
February 2024	72
March 2024	73
April 2024	142
May 2024	84
June 2024	146
July 2024	67
August 2024	46
September 2024	46
October 2024	49
November 2024	63
December 2024	41
January 2025	36
February 2025	53
March 2025	50
April 2025	49
May 2025	44

Article Contents

Chromosome-level genome assemblies of the malaria vectors Anopheles coluzzii and Anopheles arabiensis Open Access

Abstract

Introduction and Background

Data Description

Nanopore sequencing of A. coluzzii and A. arabiensis genomes

Assembly of the A. coluzzii and A. arabiensis genomes

Polishing of the A. coluzzii and A. arabiensis assemblies

Hi-C scaffolding of the A. coluzzii and A. arabiensis contigs

Validation and quality evaluation of the genome assemblies

Genome rearrangements in A. coluzzii and A. arabiensis

Conclusion

Materials and Methods

Mosquito colony maintenance

Mosquito sample collection for Oxford Nanopore sequencing

Genomic DNA isolation

Oxford Nanopore sequencing

Hi-C library preparation and sequencing

Quality control of Nanopore reads

Nanopore sequence assembly

Genome size estimation and polishing the genome assemblies

Scaffolding Nanopore contigs using Hi-C data

Chromosome quotient analysis

Gene mapping and pairwise alignments

Data Availability

Additional Files

Abbreviations

Competing Interests

Funding

Authors’ Contributions

ACKNOWLEDGEMENTS

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Chromosome-level genome assemblies of the malaria vectors Anopheles coluzzii and Anopheles arabiensis