Our tests using real data also included two types of datasets: i) high coverage single sample WGS, and ii) low coverage multiple sample WGS. First, we evaluated Pamir using WGS data at 40x coverage generated from a haploid cell line with the Illumina technology (CHM1, SRA ID: SRX652547) (Chaisson et al., 2015b). We have identified a total of 22,676 insertions that corresponds to 593.5 Kb in total, of which, 2,444 were >50bp (348 Kb total) (Table 5). Chaisson et al. (2015) also generated de novo assembly of the same genome using a long read sequencing technology (Pacific Biosciences) from the same cell line, and predicted insertions with the SMRT-SV algorithm using this dataset (Chaisson et al., 2015b). Here we used an updated call set (50bp) mapped to human GRCh38 (Huddleston et al., 2016) for comparisons. Pamir showed low recall rates when compared to the long read-based SMRT-SV results (Chaisson et al., 2015b). We could identify only 488 of the 12,998 insertions detected by SMRT-SV when we consider only nearby matches (less than 10bp distance) in breakpoint predictions. One of the reasons for such discrepancy is the fact that more than half of PacBio-predicted insertions are located within various repeat regions (Table 6), and short-length Illumina reads are not sufficient to properly assemble such regions. The same effect was also observed in the original publication (Chaisson et al., 2015b), where only a handful of insertions were also identified in another assembly of the same genome that was constructed with a reference-guided methodology using both Illumina WGS and bacterial artificial chromosome datasets (Steinberg et al., 2014). We observed that approximately 45% of the insertions characterized by SMRT-SV are contain either very low (20%) or high (60%) GC%, which are known to be problematic to sequence using the Illumina platform (Benjamini and Speed, 2012; Ross et al., 2013). Additionally, we found that 14,121 out of our 22,676 predicted insertions were reported in dbSNP version 147 (Within 10 bp breakpoint resolution.).

Table 5

Summary of insertions predicted in CHM1

All 50bp> 50bp
Number of insertions22,67620,2322,444
Minimum length5551
Maximum length4,135504,135
Average length26.2012.12142.51
All 50bp> 50bp
Number of insertions22,67620,2322,444
Minimum length5551
Maximum length4,135504,135
Average length26.2012.12142.51
Table 5

Summary of insertions predicted in CHM1

All 50bp> 50bp
Number of insertions22,67620,2322,444
Minimum length5551
Maximum length4,135504,135
Average length26.2012.12142.51
All 50bp> 50bp
Number of insertions22,67620,2322,444
Minimum length5551
Maximum length4,135504,135
Average length26.2012.12142.51
Table 6

Comparison of insertions in CHM1 by SMRT-SV using PacBio reads versus Pamir and PopIns using Illumina reads allowing 10bp breakpoint resolution

PacBioIllumina
SMRT-SVPamir
PopIns
Insertion LengthPredictionPredictionShared with SMRT-SVPredictionShared with SMRT-SV
1–50 bp187a (60%, 57%)20,232 (56%, 38%)27 (63%, 14%)21 (71%, 24%)0
50–100 bp4,384 (54%, 53%)1,273 (70%, 18%)205 (52%, 14%)246 (73%, 4%)17 (70%, 0%)
100–200 bp2,959 (54%, 50%)815 (75%, 13%)125 (58%, 13%)793 (66%, 4%)120 (62%, 1%)
200–500 bp3,123 (55%, 37%)291 (74%, 7%)97 (61%, 1%)1,074 (65%, 3%)141 (58%, 1%)
>500 bp2,345 (60%, 32%)65 (63%, 3%)34 (50%, 3%)1,286 (59%, 3%)207 (51%, 1%)
All12,998 (55%, 45%)22,676 (58%, 36%)488 (56%, 10%)3,420 (58%, 3%)485 (56%, 1%)
PacBioIllumina
SMRT-SVPamir
PopIns
Insertion LengthPredictionPredictionShared with SMRT-SVPredictionShared with SMRT-SV
1–50 bp187a (60%, 57%)20,232 (56%, 38%)27 (63%, 14%)21 (71%, 24%)0
50–100 bp4,384 (54%, 53%)1,273 (70%, 18%)205 (52%, 14%)246 (73%, 4%)17 (70%, 0%)
100–200 bp2,959 (54%, 50%)815 (75%, 13%)125 (58%, 13%)793 (66%, 4%)120 (62%, 1%)
200–500 bp3,123 (55%, 37%)291 (74%, 7%)97 (61%, 1%)1,074 (65%, 3%)141 (58%, 1%)
>500 bp2,345 (60%, 32%)65 (63%, 3%)34 (50%, 3%)1,286 (59%, 3%)207 (51%, 1%)
All12,998 (55%, 45%)22,676 (58%, 36%)488 (56%, 10%)3,420 (58%, 3%)485 (56%, 1%)

For each category, we report (i) the percentile of the calls that fall into repeat regions compared to repeat masker file, and (ii) the percentile of the calls with biased GC ratios (20% or 60%) in the form (% of repeat regions, % of biased GC ratios) in the parentheses.

a

All events reported have a length of 50bp. Note that the comparisons are based only on breakpoint positions without consideration about contents of insertions. If we simultaneously consider insertion lengths and contents, most of PopIns predictions will be filtered out as shown in Supplementary Tables S5 and S6. It is worth mentioning that Pamir can call most of the predictions as PopIns. However, it filters most of them because of the stringent rules.

Table 6

Comparison of insertions in CHM1 by SMRT-SV using PacBio reads versus Pamir and PopIns using Illumina reads allowing 10bp breakpoint resolution

PacBioIllumina
SMRT-SVPamir
PopIns
Insertion LengthPredictionPredictionShared with SMRT-SVPredictionShared with SMRT-SV
1–50 bp187a (60%, 57%)20,232 (56%, 38%)27 (63%, 14%)21 (71%, 24%)0
50–100 bp4,384 (54%, 53%)1,273 (70%, 18%)205 (52%, 14%)246 (73%, 4%)17 (70%, 0%)
100–200 bp2,959 (54%, 50%)815 (75%, 13%)125 (58%, 13%)793 (66%, 4%)120 (62%, 1%)
200–500 bp3,123 (55%, 37%)291 (74%, 7%)97 (61%, 1%)1,074 (65%, 3%)141 (58%, 1%)
>500 bp2,345 (60%, 32%)65 (63%, 3%)34 (50%, 3%)1,286 (59%, 3%)207 (51%, 1%)
All12,998 (55%, 45%)22,676 (58%, 36%)488 (56%, 10%)3,420 (58%, 3%)485 (56%, 1%)
PacBioIllumina
SMRT-SVPamir
PopIns
Insertion LengthPredictionPredictionShared with SMRT-SVPredictionShared with SMRT-SV
1–50 bp187a (60%, 57%)20,232 (56%, 38%)27 (63%, 14%)21 (71%, 24%)0
50–100 bp4,384 (54%, 53%)1,273 (70%, 18%)205 (52%, 14%)246 (73%, 4%)17 (70%, 0%)
100–200 bp2,959 (54%, 50%)815 (75%, 13%)125 (58%, 13%)793 (66%, 4%)120 (62%, 1%)
200–500 bp3,123 (55%, 37%)291 (74%, 7%)97 (61%, 1%)1,074 (65%, 3%)141 (58%, 1%)
>500 bp2,345 (60%, 32%)65 (63%, 3%)34 (50%, 3%)1,286 (59%, 3%)207 (51%, 1%)
All12,998 (55%, 45%)22,676 (58%, 36%)488 (56%, 10%)3,420 (58%, 3%)485 (56%, 1%)

For each category, we report (i) the percentile of the calls that fall into repeat regions compared to repeat masker file, and (ii) the percentile of the calls with biased GC ratios (20% or 60%) in the form (% of repeat regions, % of biased GC ratios) in the parentheses.

a

All events reported have a length of 50bp. Note that the comparisons are based only on breakpoint positions without consideration about contents of insertions. If we simultaneously consider insertion lengths and contents, most of PopIns predictions will be filtered out as shown in Supplementary Tables S5 and S6. It is worth mentioning that Pamir can call most of the predictions as PopIns. However, it filters most of them because of the stringent rules.

Close
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close