HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

Abstract

Motivation

In diploid organisms, phasing is the problem of assigning the alleles at heterozygous variants to one of two haplotypes. Reads from PacBio HiFi sequencing provide long, accurate observations that can be used as the basis for both calling and phasing variants. HiFi reads also excel at calling larger classes of variation, such as structural or tandem repeat variants. However, current phasing tools typically only phase small variants, leaving larger variants unphased.

Results

We developed HiPhase, a tool that jointly phases SNVs, indels, structural, and tandem repeat variants. The main benefits of HiPhase are (i) dual mode allele assignment for detecting large variants, (ii) a novel application of the A*-algorithm to phasing, and (iii) logic allowing phase blocks to span breaks caused by alignment issues around reference gaps and homozygous deletions. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. Additionally, HiPhase jointly phases SNVs, indels, structural, and tandem repeat variants and includes innate multi-threading, statistics gathering, and concurrent phased alignment output generation.

Availability and implementation

HiPhase is available as source code and a pre-compiled Linux binary with a user guide at https://github.com/PacificBiosciences/HiPhase.

1 Introduction

For human genomics, phasing variants within a gene is a critical aspect of clinical testing for pharmacogenomics (Caspar et al. 2021), HLA typing (Mayor et al. 2019), and autosomal recessive conditions (Tewhey et al. 2011). Additionally, automated tertiary analysis tools can benefit from knowing whether two variants are in cis or trans within a gene, potentially reducing the time for clinical analysts to reach a molecular diagnosis (Danecek and McCarthy 2017). Beyond clinical diagnostics, there are many additional applications of phasing in genomics, many of which are summarized in a review by Browning and Browning (2011).

Haplotype phasing generally falls into three categories: statistical phasing (Browning and Browning 2011), pedigree-based phasing (Browning and Browning 2011), and read-backed phasing (Snyder et al. 2015). For read-backed phasing, each read is mapped to a reference genome and variants are called from the read mappings. The alleles of heterozygous variants are phased into a single diplotype composed of two haplotypes. Once phased, the variants can be fed into various tertiary analysis tools for downstream analyses. In practice, reads that are long and accurate, such as PacBio HiFi reads (Wenger et al. 2019), include a large number of informative bases (more than one per 1 kb of read length) overlapping heterozygous variants making them ideal for variant calling followed by read-backed phasing.

For long reads, WhatsHap (Patterson et al. 2015) is the most prominent read-backed phaser and the de facto standard in the field for phasing single-nucleotide variants (SNVs) and short indels. WhatsHap uses a “fixed parameter tractable approach,” meaning that it down-samples the sequencing reads to a fixed coverage value, to phase the heterozygous variants (Patterson et al. 2015). Despite being generally efficient and accurate for phasing a human genome, WhatsHap does have some notable limitations: (i) it down-samples the data in order to run efficiently, potentially biasing the data; (ii) multi-allelic variants (sites with two or more non-reference alleles) are not phased; (iii) it does not span gaps in coverage caused by large deletions or reference gaps; (iv) it is single-threaded, so parallel processing requires some engineering overhead; and (v) it was designed primarily for small variants like SNVs and short indels. Recently, some tools have focused on that last limitation by either overlaying structural variants (insertions and deletions $\geq$ 50 bp) on the WhatsHap phase blocks (Mahmoud et al. 2021) or jointly phasing SNVs and structural variants, but not short indels (Lin et al. 2022). To our knowledge, no tool officially supports tandem repeat variants. This creates a notable gap for a tool that jointly phases SNVs, indels, structural, and tandem repeat variants.

Here, we introduce HiPhase, a new tool that jointly phases SNVs, indels, structural, and tandem repeat variants called from PacBio HiFi sequencing on diploid organisms. HiPhase uses two novel approaches to solve the phasing problem: (i) dual mode allele assignment and (ii) a phasing algorithm based on the A* search algorithm (Hart et al. 1968). HiPhase offers additional benefits, such as: (i) no down-sampling of the data, (ii) support for multi-allelic variation, (iii) logic to span coverage gaps with supplementary alignments, (iv) innate multi-threading, (v) built-in statistics gathering, and (vi) support for assigning aligned reads to a haplotype (“haplotagging”) while phasing. In our assessment, HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes.

2 Materials and methods

HiPhase breaks the phasing problem into three major components: phase block generation, allele assignment, and diplotype solving.

2.1 Phase block generation

Phase block generation works by identifying pairs of consecutive heterozygous variant calls connected by at least one read mapping. A phase block is initialized by taking the first available variant on a chromosome and creating a single-variant block from it. Then, the next variant is loaded and HiPhase searches for mappings that span both that new variant and the current block. If no mappings span both, the algorithm will then check for supplementary mappings from the new variant into the current block. This supplementary mapping check allows HiPhase to span coverage gaps caused by homozygous deletions and reference gaps. If at least one spanning or supplementary mapping is identified, then, the current phase block is extended to include the variant. Otherwise, the current block is returned as a putative phase block and a new single-variant block is initialized from the new variant. This process repeats until all heterozygous variants on the chromosome are consumed.

Each putative phase block acts as an isolated sub-problem in the full solution. Because the putative blocks are unconnected by the read mappings at their ends, they represent a lower-bound on the number of phase blocks in the final solution. Importantly, each sub-problem can be solved independently, allowing for allele assignment and diplotype solving to be performed in parallel for each putative phase block. This forms the basis for multi-threading in HiPhase.

2.2 Allele assignment

Allele assignment converts all read mappings within a putative phase block into condensed allelic observations, which are chains of reference or alternate alleles corresponding to the observed alleles within a particular read mapping. This simplifies a long-read sequence down to a smaller set of integer values representing alleles that are present within the read. We refer to these assignments as reference or alternate alleles, but HiPhase also allows for ambiguity, unassigned alleles, and multi-allelic variation (two different alternate alleles at one position). Additionally, each observed allele is assigned a “weight” indicating the cost to alter or ignore that allele in the diplotype solving step.

HiPhase has two modes for allele assignment: local re-alignment and global re-alignment. In brief, local re-alignment assigns alleles and quality based on a small window around each variant position, which is conceptually similar to the allele assignment process of WhatsHap (Patterson et al. 2015). In contrast, global re-alignment will fully re-align the mapping against a local alt-aware reference graph using a graph-aware version of the wavefront algorithm (Marco-Sola et al. 2020). In general, local re-alignment is a faster process, but it is ill-suited for accurate allele assignment for large structural and tandem repeat variants and even some smaller indels. Global re-alignment tends to be slower but is more accurate when it comes to allele assignment, especially around structural and tandem repeat variants. Finally, HiPhase implements a “dual mode” allele assignment where if global re-alignment is too slow, it will fall back on local re-alignment for the phase block.

Once all mappings have been converted to a condensed allele representation, HiPhase collapses mappings with the same read name into a single entry. The primary purpose of this step is to create a bridge between supplementary mappings that span a gap in coverage. This allows HiPhase to cross deletion events and reference gaps bridged by split read mappings. If the mappings for one read overlap but have a conflicting allele assignment, then that allele is converted to an ambiguous allele assignment in the collapsed representation. In the end, each read is represented exactly once in the collection of condensed alleles for the phase block.

2.3 Diplotype solving

Diplotype solving distills the condensed allelic observations into two representative haplotypes that are expected to complement each other (i.e. where one is the reference allele, the other is the alternate allele). For our purposes, we define the core diplotype solving problem as a slight reformulation of the weighted minimum error correction (wMEC) problem as described previously (Patterson et al. 2015). Given a matrix, where each row corresponds to the condensed allele representation of one read in the phase block, the goal is to find two haplotypes, $h_{1}$ and $h_{2}$ ⁠, such that the cost of changing each row to exactly match either $h_{1}$ or $h_{2}$ is minimized. We refer the reader to Patterson et al. (2015) for greater technical details around this and other formulations of the phasing/wMEC problem.

HiPhase uses a version of the A* search algorithm (Hart et al. 1968) to search for the optimal diplotype. In general, A* search algorithms explore a given search space by iteratively expanding the current lowest cost option based on both an observed cost and heuristic cost estimate. In HiPhase, the search space is all possible diplotype solutions modeled as a search tree, and the heuristic is generated by chaining sub-problem solutions. While increased coverage and variant density will increase the run-time of this algorithm, in practice, we find that HiPhase can operate efficiently on the typical 30× human whole genome without down-sampling. Greater details on all HiPhase methods, heuristics, and limitations can be found in the Supplementary Material.

2.4 Data

The following sections outline public resources used in our analysis. Details on and links to each can be found in the Supplementary Material.

2.4.1 Benchmark datasets

For benchmarking phase accuracy, we used publicly available Genome in a Bottle phased variant sets v4.2.1 (GRCh38) for HG001, HG002, and HG005 (Wagner et al. 2022). We note that these files are limited to human autosomes, and do not provide phasing information for structural variants.

2.4.2 HiFi sequencing datasets

We used publicly available whole genome, HiFi sequencing datasets for evaluating HiPhase. Three replicates for the sample HG002 sequenced to ∼30-fold coverage on the Revio system were obtained from https://downloads.pacbcloud.com/public/revio/2022Q4/.

2.4.3 Pre-processing

We followed the recommended practices for pre-processing our HiFi datasets through variant calling. In brief, we aligned reads to GRCh38 using pbmm2 and then performed variant calling with DeepVariant with the PacBio model (Poplin et al. 2018), pbsv, and TRGT (Dolzhenko et al. 2023).

2.5 Phasing metrics

We measured the quantity of phasing errors with switches and flips (Patterson et al. 2015). To measure phase block length, we calculated NG50 using GRCh38 for reference length. We additionally gathered the total number of phased variants, number of phased structural and tandem repeat variants (HiPhase only), and the percentage of fully phased genes. We note that NG50 and number of phased variants are measured using tool-specific outputs. All analyses were performed across the full autosomes. Further details on how all metrics are defined, calculated, and collected are available in the Supplementary Material.

2.6 Tool descriptions

We focus our analyses on comparing the current approach for HiFi read-backed phasing, WhatsHap (Patterson et al. 2015), to our new tool, HiPhase. For WhatsHap, we provide small variants from DeepVariant and instruct the tool to phase both SNVs and indels. For HiPhase, we provide the same DeepVariant VCF file, a structural variant VCF file from pbsv, and a tandem repeat variant VCF file from TRGT. We set HiPhase to use global re-alignment (see Section 2.2), which is our recommended approach for phasing with structural variants. Additional results for other modes of running both WhatsHap and HiPhase are available in the Supplementary Material.

3 Results

On average, HiPhase generated 480 kb phase block NG50, fully phased 93.8% of genes, and phased 3.1 M variants with only 929 switchflip errors (see Fig. 1), a rate of 1 error every 3.3 K variants. Additionally, HiPhase generated over 11 K phased structural variants and 68 K phased tandem repeat variants per replicate, a feature that is not present in WhatsHap. Additional analyses with more metrics and samples are available in the Supplementary Material.

Figure 1.

Summary comparison metrics. Each line represents a phase result where the line style corresponds to a HG002 replicate and the color corresponds to the phasing approach (top three blue lines for HiPhase or bottom three red lines for WhatsHap). Each metric axis is oriented such that better results are toward the top (i.e. switches and flips are inverted). In all metrics, results from HiPhase were strictly better than those from WhatsHap

Open in new tab Download slide

To our knowledge, HiPhase is the first phasing tool to jointly phase SNVs, indels, structural, and tandem repeat variants. Compared to WhatsHap, HiPhase generated longer phase blocks (480 versus 407 kb) with fewer phasing errors (929 versus 1615 switchflips), phased more total variants (3.05 versus 2.94 million), and fully phased more genes (93.8% versus 87.0%). HiPhase also includes usability features, such as innate multi-threading, statistics gathering, and concurrent phased alignment output generation. In future versions of the tool, we will extend HiPhase into other forms of structural variants, such as inversions and copy number variants.

Acknowledgements

The authors thank the anonymous reviewers and early HiPhase testers for their valuable suggestions. They thank Justin Zook for access to an assembly-based draft HG002 benchmark that includes larger structural and tandem repeat variants. They also thank Tim Dunn for updating vcfdist to support analysis of phased structural and tandem repeat variants from the draft benchmark.

Author contributions

All authors conceived HiPhase; J.M.H. and C.T.S. developed HiPhase; J.M.H. and W.J.R. analyzed the results; and J.M.H., C.T.S., W.J.R., and M.E. wrote and reviewed the manuscript.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

All authors are employed by and hold stock in PacBio.

Funding

None declared.

Data availability

HiPhase is released as source code and a pre-compiled binary file on GitHub: https://github.com/PacificBiosciences/HiPhase. Additional instructions and use cases are available in the accompanying documentation. Details and links to all benchmark files, sequencing datasets, tool versions, and commands used in this document as well as additional dataset analyses are available in the Supplementary Material.

References

Browning

BL.

Haplotype phasing: existing methods and new developments

Nat Rev Genet

2011

;

703

–

. http://dx.doi.org/10.1038/nrg3054

Caspar

Schneider

Stoll

et al.

Potential of whole-genome sequencing-based pharmacogenetic profiling

Pharmacogenomics

2021

;

177

–

. http://dx.doi.org/10.2217/pgs-2020-0155

Danecek

McCarthy

SA.

BCFtools/csq: haplotype-aware variant consequences

Bioinformatics

2017

;

2037

–

. http://dx.doi.org/10.1093/bioinformatics/btx100

Dolzhenko

English

Dashnow

et al. Characterization and visualization of tandem repeats at scale. Nat Biotechnol

2024

–

Hart

Nilsson

Raphael

A formal basis for the heuristic determination of minimum cost paths

IEEE Trans Syst Sci Cyber

1968

;

100

–

. http://dx.doi.org/10.1109/TSSC.1968.300136

Google Scholar

Crossref

WorldCat

Lin

J-H

Chen

L-C

S-C

et al.

LongPhase: an ultra-fast chromosome-scale phasing algorithm for small and large variants

Bioinformatics

2022

;

1816

–

. http://dx.doi.org/10.1093/bioinformatics/btac058

Mahmoud

Doddapaneni

Timp

et al.

PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation

Genome Biol

2021

;

268

. http://dx.doi.org/10.1186/s13059-021-02486-w

Marco-Sola

Moure

Moreto

et al.

Fast gap-affine pairwise alignment using the wavefront algorithm

Bioinformatics

2020

;

456

–

. http://dx.doi.org/10.1093/bioinformatics/btaa777

Google Scholar

Crossref

WorldCat

Mayor

Hayhurst

Turner

et al.

Recipients receiving better HLA-matched hematopoietic cell transplantation grafts, uncovered by a novel HLA typing method, have superior survival: a retrospective study

Biol Blood Marrow Transplant

2019

;

443

–

. http://dx.doi.org/10.1016/j.bbmt.2018.12.768

Patterson

Marschall

Pisanti

et al.

WhatsHap: weighted haplotype assembly for future-generation sequencing reads

J Comput Biol

2015

;

498

–

509

. http://dx.doi.org/10.1089/cmb.2014.0157

Poplin

Chang

P-C

Alexander

et al.

A universal SNP and small-indel variant caller using deep neural networks

Nat Biotechnol

2018

;

983

–

. http://dx.doi.org/10.1038/nbt.4235

Snyder

Adey

Kitzman

et al.

Haplotype-resolved genome sequencing: experimental methods and applications

Nat Rev Genet

2015

;

344

–

. http://dx.doi.org/10.1038/nrg3903

Tewhey

Bansal

Torkamani

et al.

The importance of phase information for human genomics

Nat Rev Genet

2011

;

215

–

. http://dx.doi.org/10.1038/nrg2950

Wagner

Olson

Harris

et al.

Benchmarking challenging small variants with linked and long reads

Cell Genom

2022

;

100128

. http://dx.doi.org/10.1016/j.xgen.2022.100128

Wenger

Peluso

Rowell

et al.

Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

Nat Biotechnol

2019

;

1155

–

. http://dx.doi.org/10.1038/s41587-019-0217-9

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
January 2024	141
February 2024	401
March 2024	251
April 2024	224
May 2024	155
June 2024	141
July 2024	165
August 2024	178
September 2024	152
October 2024	139
November 2024	162
December 2024	105
January 2025	208
February 2025	197
March 2025	188
April 2025	124
May 2025	36

Article Contents

HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

Abstract

1 Introduction

2 Materials and methods

2.1 Phase block generation

2.2 Allele assignment

2.3 Diplotype solving

2.4 Data

2.4.1 Benchmark datasets

2.4.2 HiFi sequencing datasets

2.4.3 Pre-processing

2.5 Phasing metrics

2.6 Tool descriptions

3 Results

Acknowledgements

Author contributions

Supplementary data

Conflict of interest

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing

Abstract

1 Introduction

2 Materials and methods

2.1 Phase block generation

2.2 Allele assignment

2.3 Diplotype solving

2.4 Data

2.4.1 Benchmark datasets

2.4.2 HiFi sequencing datasets

2.4.3 Pre-processing

2.5 Phasing metrics

2.6 Tool descriptions

3 Results

Acknowledgements

Author contributions

Supplementary data

Conflict of interest

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only