-
PDF
- Split View
-
Views
-
Cite
Cite
Anthony J Aylward, Semar Petrus, Allen Mamerto, Nolan T Hartwick, Todd P Michael, PanKmer: k-mer-based and reference-free pangenome analysis, Bioinformatics, Volume 39, Issue 10, October 2023, btad621, https://doi.org/10.1093/bioinformatics/btad621
- Share Icon Share
Abstract
Pangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thousands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence–absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be “anchored” in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias.
PanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/.
1 Introduction
Pangenomes consolidate genomic sequence data from multiple individual organisms into a single data structure representing the genomes of a population, species, or clade. Pangenomics was first applied to microbes, but has grown to include a wide variety of eukaryotes (Medini et al. 2005, Golicz et al. 2020, Li et al. 2022b). The advent of plant and animal pangenomes coincided with the reduced cost of NGS technologies. Crop plants were an early subject of pangenome studies, including soybean, brassica, and wheat species (Li et al. 2014, Golicz et al. 2016, Montenegro et al. 2017). These pangenomes focused on genic sequence only, and they were built by collecting short-read whole-genome sequencing datasets and mapping to a reference, allowing relatively simple if biased assembly of multiple individual genomes. Improvements in long-read sequencing technology and assembly methods have enabled the construction of dozens or hundreds of high-quality genomes for a single plant or animal species. This has facilitated a wave of pangenome studies based on collections of entire assembled genomes (Gui et al. 2022, Li et al. 2022a, Montenegro et al. 2022, Shang et al. 2022, Tang et al. 2022, Tong et al. 2022, Yang et al. 2022).
The transition from representation and analysis of single genomes to pangenomes presents significant challenges. Most studied populations include extensive structural variation (SV), which means only a fraction of genes or intergenic sequences are present in all individuals. This fraction is referred to as the “core” genome, while the remainder is variously called “dispensable,” “variable,” or “accessory” (Golicz et al. 2020, Lei et al. 2021, Aggarwal et al. 2022). Furthermore, the core genome cannot be defined by a simple linear coordinate system. A useful pangenomic dataset must identify the core genome and facilitate analysis of the variable regions.
Currently, the dominant methodology is pangenome sequence graphs, which consist of nodes representing segments of genomic sequence connected by edges which allow any individual genome to be traced as a path through the graph (Hickey et al. 2020, Li et al. 2020, Baaijens et al. 2022, Bradbury et al. 2022, Montenegro et al. 2022). These graphs have replaced “iterative assembly” methods to advance our understanding of genomic diversity, especially of SV, and demonstrated utility for crop breeding (Ruperao et al. 2021, Gui et al. 2022, Montenegro et al. 2022, Shang et al. 2022). However, their construction is far from a solved problem and generally relies either on computationally expensive multiple genome alignment or on biased alignment to a single reference. Their application to highly complex eukaryotic genomes, such as those of plants, is limited (Bayer et al. 2020, Danilevicz et al. 2020, Khan et al. 2020). Such graphs may be limited to as few as a dozen input genomes (Zhang et al. 2021, Li et al. 2022a). Other methods gain efficiency by focusing on specific categories of variants (Li et al. 2020, Bradbury et al. 2022).
Several methods have adopted a format based on a De Bruijn graph representing overlaps of k-mers (genomic substrings of length k) rather than the sequence graph (Sheikhizadeh et al. 2016, Almodaresi et al. 2018, Holley and Melsted 2020, Jonkheer et al. 2022). One such method is PanTools, which defines the pangenome as a comprehensive representation of multiple annotated genomes and provides functions enabling gene-level analysis, sequence alignment, and phylogenomics.
k-mer decomposition is an alternative to graph-based pangenomes. In this framework, the space of genomic sequences across a population is represented by a set of k-mers. Each individual is represented by a subset: all k-mers observed in a single genome. This view does not depend on any coordinate system and therefore sidesteps the difficulties of multiple genome alignment and graph construction. Kmer-db demonstrated the use of k-mer decomposition for efficient analysis of microbial genomes (Deorowicz et al. 2019). k-mers have many applications in genomics and pangenomics, and they have recently been used as markers for GWAS (Holley et al. 2016, Sheikhizadeh et al. 2016, Aun et al. 2018, Khan et al. 2020, Voichek and Weigel 2020, Gupta 2021, Jayakodi et al. 2021, Jonkheer et al. 2022, Karikari et al. 2022, 2023).
Here we present PanKmer, a non-graphical k-mer decomposition method designed to efficiently represent and analyze many forms of variation in large pangenomic datasets, with no reliance on a reference genome and no assumption of annotation.
2 Features and implementation
2.1 k-mer index
The foundational component of PanKmer’s pangenome representation is the k-mer index. It is constructed from a set G of input genomes by decomposing them into a set K of all unique canonical k-mers and noting for each input genome which k-mers are present and which are absent (Fig. 1A). This is similar to the content of Kmer-db’s k-mer database (Deorowicz et al. 2019). Each k-mer is considered equivalent to its reverse complement and is recorded in canonical form. The k-mer index X is then a by table of binary values indicating presence/absence of each canonical k-mer in each genome. The index can integrate an arbitrary number of genomes from one or several species, requires no reference genome, and enables a range of downstream analyses.

PanKmer enables the rapid estimation of relatedness across the pangenome as well as analysis of specific loci. (A) Schematic of procedure for constructing the k-mer index. In the example, each genome G0–G2 is decomposed into canonical 3-mers. Each 3-mer is equivalent to its reverse complement, and the lexicographically first is the canonical form. Each 3-mer is assigned an integer value, and its presence/absence is recorded for each genome in the index. (B) Relatedness heatmap of Typha pangenome, ANI values shown. (C) Genome anchoring plots of representative contigs in TD01. Average k-mer conservation of 100-kb bins shown, where k-mer conservation is the fraction of TD or TL genomes that include each k-mer along the contig.
The index is constructed by scanning all input genomes sequentially, recording newly encountered k-mers, and updating presence/absence values with each new genome scanned. To make efficient use of all available CPU’s, this process is parallelized across the theoretical k-mer space. The set of all possible canonical k-mers has size , which is divided among n segments . Index construction is then divided into n subprocesses, each of which constructs a sub-index skipping k-mers not included in . To efficiently store and access the k-mer index on disk, each k-mer is converted to an integer value.
The resultant index is robust to varying contiguity in the input genomes, so chromosome-level assemblies can be directly compared to unscaffolded contigs or unaligned reads. The implementation presented here uses , chosen for three reasons. First, 31-mers are short enough to be encoded as 64-bit integers (Rahman et al. 2018). Second, they are long enough to impose a low rate of non-unique k-mers occurring by chance (Sheikhizadeh et al. 2016). Finally, 31-mers have been used successfully to define variation in previous studies (Rahman et al. 2018, Voichek and Weigel 2020).
2.2 Adjacency matrix
Once the k-mer index is constructed, each input genome is represented by binary values representing presence or absence of each k-mer in K. This provides a natural means of calculating pairwise similarity/adjacency values for the input genomes. PanKmer includes a function to calculate the number of shared k-mers between all pairs of input genomes and return them as an adjacency matrix. Subsequently, the adjacency values can be used to perform a hierarchical clustering of input genomes and plot adjacency values as a heatmap. The adjacency values may also be converted to Jaccard, QV as described in MerQury (Rhie et al. 2020), a symmetric version of QV, or average nucleotide identity (ANI) (Fig. 1B).
2.3 Genome anchoring
While the k-mer index does not rely on any specified reference genome, it can be used to contextualize individual sequences. Given a sufficient k-mer length (e.g. the default ), we can assume that each k-mer present in an individual genome occurs approximately once in . Therefore, we can quantify variation across any locus in an “anchor” genome by walking along the sequence, checking the k-mer that corresponds to each position, and calculating the fraction of genomes in G which share that k-mer. We refer to this fraction as the “k-mer conservation” value at each position. High k-mer conservation values indicate core loci which are conserved in many individuals across the pangenome, while low values indicate variable loci which are present only in and a small number of other genomes. Hence, core sequences will have high k-mer conservation levels in all target genomes, while variable sequences will have relatively lower levels in each genome that features them.
3 Results
To demonstrate the utility of PanKmer, we constructed a cross-species pangenome of cattail downloaded from NCBI SRA: 10 Typha domingensis (TD) and 3 Typha latifolia (TL) genomes (Fig. 1B) (Supplementary Table S1). Three of the TD genomes (TD01, TD22, TD23) were highly heterozygous (Supplementary Fig. S1 and Table 1). We used k-mer profiles to compute two measures of adjacency, the number of shared k-mers and ANI (Fig. 1B). The three TL genomes were highly similar to one another, with an average ANI of 99.85% (186M shared k-mers) for TL–TL comparisons, while the average ANI of TD–TD comparisons was 98.94% (151M shared k-mers) (Supplementary Tables S2 and S3). The average ANI of TL–TD comparisons was 96.14% (64M shared k-mers). We also constructed single-species pangenomes of T.domingensis and T.latifolia (Supplementary Fig. S2 and Tables S4–S7). One TD genome, TD01, was an outlier relative to other TD, with ANI of only 98.03% (127M shared k-mers) on average. Conversely, TD01 showed relatively high ANI with TL genomes, averaging 98.60% (148M shared k-mers). To inspect TD01 more closely, we calculated average k-mer conservation in 100 kb bins across its assembled contigs and observed a mixture of TD and TL sequences. A close examination of variability anchored in TD01 revealed the presence of large and small introgressions (Fig. 1C and Supplementary Fig. S5). High heterozygosity together with the presence of introgressions suggest TD01 is the descendant of a recent TD–TL hybridization event.
Name . | Predicted . | Heterozygosity . | Contig N50 . | GC content . |
---|---|---|---|---|
. | genome . | (%) . | (Mb) . | (%) . |
. | size (Mb) . | . | . | . |
TD01 | 214 | 4.12 | 1.0 | 37.7 |
TD03 | 222 | 0.13 | 11.5 | 37.6 |
TD04 | 226 | 0.45 | 4.8 | 37.7 |
TD07 | 263 | 0.14 | 7.7 | 37.6 |
TD08 | 213 | 0.05 | 13.8 | 37.7 |
TD12 | 224 | 0.22 | 6.1 | 37.5 |
TD22 | 218 | 2.60 | 1.0 | 37.8 |
TD23 | 213 | 2.60 | 0.7 | 37.8 |
TD25 | 210 | 0.12 | 8.3 | 37.7 |
TD36 | 215 | 0.15 | 8.9 | 37.6 |
TL01 | 234 | 0.07 | 14.5 | 37.9 |
TL05 | 218 | 0.07 | 14.5 | 37.8 |
TL17 | 215 | 0.07 | 13.5 | 37.7 |
Name . | Predicted . | Heterozygosity . | Contig N50 . | GC content . |
---|---|---|---|---|
. | genome . | (%) . | (Mb) . | (%) . |
. | size (Mb) . | . | . | . |
TD01 | 214 | 4.12 | 1.0 | 37.7 |
TD03 | 222 | 0.13 | 11.5 | 37.6 |
TD04 | 226 | 0.45 | 4.8 | 37.7 |
TD07 | 263 | 0.14 | 7.7 | 37.6 |
TD08 | 213 | 0.05 | 13.8 | 37.7 |
TD12 | 224 | 0.22 | 6.1 | 37.5 |
TD22 | 218 | 2.60 | 1.0 | 37.8 |
TD23 | 213 | 2.60 | 0.7 | 37.8 |
TD25 | 210 | 0.12 | 8.3 | 37.7 |
TD36 | 215 | 0.15 | 8.9 | 37.6 |
TL01 | 234 | 0.07 | 14.5 | 37.9 |
TL05 | 218 | 0.07 | 14.5 | 37.8 |
TL17 | 215 | 0.07 | 13.5 | 37.7 |
Name . | Predicted . | Heterozygosity . | Contig N50 . | GC content . |
---|---|---|---|---|
. | genome . | (%) . | (Mb) . | (%) . |
. | size (Mb) . | . | . | . |
TD01 | 214 | 4.12 | 1.0 | 37.7 |
TD03 | 222 | 0.13 | 11.5 | 37.6 |
TD04 | 226 | 0.45 | 4.8 | 37.7 |
TD07 | 263 | 0.14 | 7.7 | 37.6 |
TD08 | 213 | 0.05 | 13.8 | 37.7 |
TD12 | 224 | 0.22 | 6.1 | 37.5 |
TD22 | 218 | 2.60 | 1.0 | 37.8 |
TD23 | 213 | 2.60 | 0.7 | 37.8 |
TD25 | 210 | 0.12 | 8.3 | 37.7 |
TD36 | 215 | 0.15 | 8.9 | 37.6 |
TL01 | 234 | 0.07 | 14.5 | 37.9 |
TL05 | 218 | 0.07 | 14.5 | 37.8 |
TL17 | 215 | 0.07 | 13.5 | 37.7 |
Name . | Predicted . | Heterozygosity . | Contig N50 . | GC content . |
---|---|---|---|---|
. | genome . | (%) . | (Mb) . | (%) . |
. | size (Mb) . | . | . | . |
TD01 | 214 | 4.12 | 1.0 | 37.7 |
TD03 | 222 | 0.13 | 11.5 | 37.6 |
TD04 | 226 | 0.45 | 4.8 | 37.7 |
TD07 | 263 | 0.14 | 7.7 | 37.6 |
TD08 | 213 | 0.05 | 13.8 | 37.7 |
TD12 | 224 | 0.22 | 6.1 | 37.5 |
TD22 | 218 | 2.60 | 1.0 | 37.8 |
TD23 | 213 | 2.60 | 0.7 | 37.8 |
TD25 | 210 | 0.12 | 8.3 | 37.7 |
TD36 | 215 | 0.15 | 8.9 | 37.6 |
TL01 | 234 | 0.07 | 14.5 | 37.9 |
TL05 | 218 | 0.07 | 14.5 | 37.8 |
TL17 | 215 | 0.07 | 13.5 | 37.7 |
To explore suitability for larger eukaryote genomes and sample sizes, we benchmarked PanKmer on several published pangenomes and super-pangenomes, including Solanum, Zea, Homo sapiens, Arabidopsisthaliana (Table 2 and Supplementary Fig. S7 and Table S12) (Alonso-Blanco et al. 2016, Woodhouse et al. 2021, Montenegro et al. 2022, Liao et al. 2023). PanKmer successfully built k-mer indexes for all input pangenomes (Supplementary Fig. S8). We compared the performance of PanKmer to Kmer-db (Supplementary Fig. S9 and Table S13). We found that PanKmer had a smaller memory footprint than Kmer-db, due in part to its ability to divide k-mer decomposition into multiple rounds (Supplementary Material).
Benchmarking results of PanKmer configured for moderate memory usage and moderate runtime (wall clock time) (16 rounds of k-mer decomposition).
Clade . | No. of genomes . | Peak memory (GB) . | Time (min) . |
---|---|---|---|
Typha | 4 | 6 | 5.56 |
Typha | 13 | 8 | 18.61 |
Solanum | 4 | 14 | 16.26 |
Solanum | 16 | 20 | 62.87 |
Solanum | 46 | 20 | 215.19 |
Zea | 4 | 20 | 50.12 |
Zea | 16 | 71 | 239.38 |
Zea | 54 | 80 | 783.1 |
H.sapiens | 4 | 34 | 66.45 |
H.sapiens | 16 | 35 | 248.5 |
H.sapiens | 64 | 38 | 707.39 |
H.sapiens | 94 | 40 | 1364.33 |
A.thaliana | 4 | 6 | 2.08 |
A.thaliana | 16 | 6 | 7.98 |
A.thaliana | 64 | 7 | 35.9 |
A.thaliana | 256 | 7 | 142.93 |
A.thaliana | 1135 | 15 | 648.9 |
Clade . | No. of genomes . | Peak memory (GB) . | Time (min) . |
---|---|---|---|
Typha | 4 | 6 | 5.56 |
Typha | 13 | 8 | 18.61 |
Solanum | 4 | 14 | 16.26 |
Solanum | 16 | 20 | 62.87 |
Solanum | 46 | 20 | 215.19 |
Zea | 4 | 20 | 50.12 |
Zea | 16 | 71 | 239.38 |
Zea | 54 | 80 | 783.1 |
H.sapiens | 4 | 34 | 66.45 |
H.sapiens | 16 | 35 | 248.5 |
H.sapiens | 64 | 38 | 707.39 |
H.sapiens | 94 | 40 | 1364.33 |
A.thaliana | 4 | 6 | 2.08 |
A.thaliana | 16 | 6 | 7.98 |
A.thaliana | 64 | 7 | 35.9 |
A.thaliana | 256 | 7 | 142.93 |
A.thaliana | 1135 | 15 | 648.9 |
Benchmarking results of PanKmer configured for moderate memory usage and moderate runtime (wall clock time) (16 rounds of k-mer decomposition).
Clade . | No. of genomes . | Peak memory (GB) . | Time (min) . |
---|---|---|---|
Typha | 4 | 6 | 5.56 |
Typha | 13 | 8 | 18.61 |
Solanum | 4 | 14 | 16.26 |
Solanum | 16 | 20 | 62.87 |
Solanum | 46 | 20 | 215.19 |
Zea | 4 | 20 | 50.12 |
Zea | 16 | 71 | 239.38 |
Zea | 54 | 80 | 783.1 |
H.sapiens | 4 | 34 | 66.45 |
H.sapiens | 16 | 35 | 248.5 |
H.sapiens | 64 | 38 | 707.39 |
H.sapiens | 94 | 40 | 1364.33 |
A.thaliana | 4 | 6 | 2.08 |
A.thaliana | 16 | 6 | 7.98 |
A.thaliana | 64 | 7 | 35.9 |
A.thaliana | 256 | 7 | 142.93 |
A.thaliana | 1135 | 15 | 648.9 |
Clade . | No. of genomes . | Peak memory (GB) . | Time (min) . |
---|---|---|---|
Typha | 4 | 6 | 5.56 |
Typha | 13 | 8 | 18.61 |
Solanum | 4 | 14 | 16.26 |
Solanum | 16 | 20 | 62.87 |
Solanum | 46 | 20 | 215.19 |
Zea | 4 | 20 | 50.12 |
Zea | 16 | 71 | 239.38 |
Zea | 54 | 80 | 783.1 |
H.sapiens | 4 | 34 | 66.45 |
H.sapiens | 16 | 35 | 248.5 |
H.sapiens | 64 | 38 | 707.39 |
H.sapiens | 94 | 40 | 1364.33 |
A.thaliana | 4 | 6 | 2.08 |
A.thaliana | 16 | 6 | 7.98 |
A.thaliana | 64 | 7 | 35.9 |
A.thaliana | 256 | 7 | 142.93 |
A.thaliana | 1135 | 15 | 648.9 |
4 Conclusion
PanKmer enables k-mer-based analysis of pangenome datasets. It accepts as input a collection of genome assemblies or unaligned reads in FASTA format, and produces a k-mer index which is similar to the database of kmer-db (Deorowicz et al. 2019). The index is agnostic to the contiguity of assemblies. A powerful feature of PanKmer is the ability to anchor the index in any individual genome and identify core or variable loci. This allows users to explore the full scope of genetic variation, without incurring bias from the choice of a single reference. PanKmer includes tools for downstream processing of the index which provide data visualizations and biological insights, such as identifying a hybridization event between T.latifolia and T.domingensis.
The primary advantage of PanKmer over other pangenome analysis tools is its ability to capture all forms of presence–absence variation, including SNPs, INDELs, SVs, and any variant that adds or removes a k-mer from the genome. Our reference-free and alignment-free algorithm is also more computationally tractable than graph-based methods. On the other hand, PanKmer is limited by inability to detect copy number variants in repetitive sequences (Supplementary Fig. S6), and by the loss of spatial relationships between k-mers in the index. However, their spatial context can be rescued by projecting the index onto an anchor genome. Currently, PanKmer does not have genotyping functions, but these are planned for future releases.
Acknowledgements
We thank members of the Typha genomes team specifically Suzanne Thomas, James J. La Clair, and Justin Pacheco from the Noel lab, Bradley W. Abramson, Kelly Colt and Mariele Lensink from the Michael lab, Theryn K. Henkel, Eva Hillman, and Alexander S. Kolker. The plant collection, propagation and genome work were funded by the following awards to Joseph P. Noel, the Howard Hughes Medical Institute, the Salk Institute Arthur and Julie Woodrow Chair, the Chambers Medical Foundation, the Kitzmiller-Moran Challenge Funds, and the Joe W. and Dorothy Dorsett Brown Foundation. We also thank Edward S. Buckler, Bethany F. Econopouly, Zachary R. Miller, Peter J. Bradbury, and Matt Wiese of the United States Department of Agriculture-Agricultural Research Service and Institute for Genomic Diversity at Cornell University for helpful discussion and collaboration on software features and implementation.
Author contributions
A.J.A.: software (main developer, testing), writing (original draft, review and editing) funding acquisition. S.P.: software (original developer), writing (review and editing). A.M.: software (pipeline management, testing), writing (review and editing). N.T.H.: software (optimization). T.P.M.: supervision, funding acquisition, writing (review and editing).
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was supported by Bill & Melinda Gates Foundation (BMGF) (INV-040541 to T.P.M.); and the Tang Genomics Fund.
Data availability
The raw reads have been deposited in the NCBI short read archive (SRA) under BioProject PRJNA742003. Accession numbers for individual experiments are provided in the Supplementary Material. The genome assemblies are available for download from Michael lab AWS storage at: https://salk-tm-pub.s3.us-west-2.amazonaws.com/pub-supplementary/PRJNA742003-ASSEMBL.tar. Complete tutorials and documentation are available in the Supplementary Material and online: https://salk-tm.gitlab.io/pankmer/.