AlignGraph2: similar genome-assisted reassembly pipeline for PacBio long reads Free

Results of contig reassemblies with various de novo assembly algorithms

Assembly method	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
Canu	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
MECAT	Aligned	8	445 553	291 756	2	55.4	280.2	11.1
	Extendable	6	445 553	316 855	2	52.7	290.7	11.6
	Extended	3	579 898	630 810	2	36.3	82.9	10.6
wtdbg2	Aligned	19	553 954	424 293	6	44.7	124.5	6.2
	Extendable	10	553 954	386 856	3	43.1	130.6	5.2
	Extended	5	1047 763	777 132	2	29.8	65.4	5.7
(b) Contigs of A. thaliana Ler-0
Canu	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
MECAT	Aligned	53	11 152 352	2237 314	5	937.3	157.7	6.7
	Extendable	16	549 714	325 228	1	2027.6	272.3	27.4
	Extended	6	588 000	871 541	1	1398.1	89.0	24.7
wtdbg2	Aligned	69	6693 746	1705 279	6	460.4	62.4	5.7
	Extendable	18	3501 951	1101 327	3	588.1	63.8	9.4
	Extended	8	12 241 575	2461 217	2	334.3	30.5	9.3
(c) Contigs of D. melanogaster ISO1
Canu	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
MECAT	Aligned	38	8333 886	3185 015	4	84.4	212.3	0.2
	Extendable	10	15 931 084	7991 736	3	34.3	208.7	0.1
	Extended	5	22 837 126	15 955 717	2	16.7	51.3	0.2
wtdbg2	Aligned	38	16 699 794	3168 544	3	27.1	91.9	0.3
	Extendable	15	13 210 545	3973 433	2	21.6	98.8	0.1
	Extended	7	16 224 610	8489 209	2	6.0	61.5	0.1
(d) HiFi contigs of H. sapiens HG002
Canu	Aligned	7823	27 228 263	383 930	32	132.0	99.3	0.9
	Extendable	324	5384 744	149 495	3	210.8	40.0	4.1
	Extended	153	13 771 028	323 127	2	202.7	47.0	4.3
wtdbg2	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
FALCON	Aligned	419	30 566 990	6604 696	25	142.3	258.7	0.2
	Extendable	20	10 220 726	841 890	1	169.2	256.9	2.6
	Extended	9	13 485 583	1862 005	1	145.7	31.6	1.7
Hiiasm	Aligned	339	98 165 977	8900 780	12	159.9	26.6	3.8
	Extendable	43	33 233 224	11 291 120	5	199.8	26.6	6.1
	Extended	18	77 844 542	26 438 918	3	196.2	35.1	5.5

Assembly method	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
Canu	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
MECAT	Aligned	8	445 553	291 756	2	55.4	280.2	11.1
	Extendable	6	445 553	316 855	2	52.7	290.7	11.6
	Extended	3	579 898	630 810	2	36.3	82.9	10.6
wtdbg2	Aligned	19	553 954	424 293	6	44.7	124.5	6.2
	Extendable	10	553 954	386 856	3	43.1	130.6	5.2
	Extended	5	1047 763	777 132	2	29.8	65.4	5.7
(b) Contigs of A. thaliana Ler-0
Canu	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
MECAT	Aligned	53	11 152 352	2237 314	5	937.3	157.7	6.7
	Extendable	16	549 714	325 228	1	2027.6	272.3	27.4
	Extended	6	588 000	871 541	1	1398.1	89.0	24.7
wtdbg2	Aligned	69	6693 746	1705 279	6	460.4	62.4	5.7
	Extendable	18	3501 951	1101 327	3	588.1	63.8	9.4
	Extended	8	12 241 575	2461 217	2	334.3	30.5	9.3
(c) Contigs of D. melanogaster ISO1
Canu	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
MECAT	Aligned	38	8333 886	3185 015	4	84.4	212.3	0.2
	Extendable	10	15 931 084	7991 736	3	34.3	208.7	0.1
	Extended	5	22 837 126	15 955 717	2	16.7	51.3	0.2
wtdbg2	Aligned	38	16 699 794	3168 544	3	27.1	91.9	0.3
	Extendable	15	13 210 545	3973 433	2	21.6	98.8	0.1
	Extended	7	16 224 610	8489 209	2	6.0	61.5	0.1
(d) HiFi contigs of H. sapiens HG002
Canu	Aligned	7823	27 228 263	383 930	32	132.0	99.3	0.9
	Extendable	324	5384 744	149 495	3	210.8	40.0	4.1
	Extended	153	13 771 028	323 127	2	202.7	47.0	4.3
wtdbg2	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
FALCON	Aligned	419	30 566 990	6604 696	25	142.3	258.7	0.2
	Extendable	20	10 220 726	841 890	1	169.2	256.9	2.6
	Extended	9	13 485 583	1862 005	1	145.7	31.6	1.7
Hiiasm	Aligned	339	98 165 977	8900 780	12	159.9	26.6	3.8
	Extendable	43	33 233 224	11 291 120	5	199.8	26.6	6.1
	Extended	18	77 844 542	26 438 918	3	196.2	35.1	5.5

The contigs in tests (a)–(d) are from S. cerevisiae W303, A. thaliana Ler-0, D. melanogaster ISO1 and H. sapiens HG002, respectively, and assembled with various de novo assembly algorithms Canu, MECAT and wtdbg2 in tests (a)–(c) while with Canu, wtdbg2, FALCON and Hifiasm in test (d). In each test, AlignGraph2’s extendable contigs are a subset of aligned ones, and the extendable and extended contigs are compared. The aligned contigs cannot be directly compared to the extended contigs, so the corresponding rows are indicated in gray.

Table 1

Results of contig reassemblies with various de novo assembly algorithms

Assembly method	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
Canu	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
MECAT	Aligned	8	445 553	291 756	2	55.4	280.2	11.1
	Extendable	6	445 553	316 855	2	52.7	290.7	11.6
	Extended	3	579 898	630 810	2	36.3	82.9	10.6
wtdbg2	Aligned	19	553 954	424 293	6	44.7	124.5	6.2
	Extendable	10	553 954	386 856	3	43.1	130.6	5.2
	Extended	5	1047 763	777 132	2	29.8	65.4	5.7
(b) Contigs of A. thaliana Ler-0
Canu	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
MECAT	Aligned	53	11 152 352	2237 314	5	937.3	157.7	6.7
	Extendable	16	549 714	325 228	1	2027.6	272.3	27.4
	Extended	6	588 000	871 541	1	1398.1	89.0	24.7
wtdbg2	Aligned	69	6693 746	1705 279	6	460.4	62.4	5.7
	Extendable	18	3501 951	1101 327	3	588.1	63.8	9.4
	Extended	8	12 241 575	2461 217	2	334.3	30.5	9.3
(c) Contigs of D. melanogaster ISO1
Canu	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
MECAT	Aligned	38	8333 886	3185 015	4	84.4	212.3	0.2
	Extendable	10	15 931 084	7991 736	3	34.3	208.7	0.1
	Extended	5	22 837 126	15 955 717	2	16.7	51.3	0.2
wtdbg2	Aligned	38	16 699 794	3168 544	3	27.1	91.9	0.3
	Extendable	15	13 210 545	3973 433	2	21.6	98.8	0.1
	Extended	7	16 224 610	8489 209	2	6.0	61.5	0.1
(d) HiFi contigs of H. sapiens HG002
Canu	Aligned	7823	27 228 263	383 930	32	132.0	99.3	0.9
	Extendable	324	5384 744	149 495	3	210.8	40.0	4.1
	Extended	153	13 771 028	323 127	2	202.7	47.0	4.3
wtdbg2	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
FALCON	Aligned	419	30 566 990	6604 696	25	142.3	258.7	0.2
	Extendable	20	10 220 726	841 890	1	169.2	256.9	2.6
	Extended	9	13 485 583	1862 005	1	145.7	31.6	1.7
Hiiasm	Aligned	339	98 165 977	8900 780	12	159.9	26.6	3.8
	Extendable	43	33 233 224	11 291 120	5	199.8	26.6	6.1
	Extended	18	77 844 542	26 438 918	3	196.2	35.1	5.5

Assembly method	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
Canu	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
MECAT	Aligned	8	445 553	291 756	2	55.4	280.2	11.1
	Extendable	6	445 553	316 855	2	52.7	290.7	11.6
	Extended	3	579 898	630 810	2	36.3	82.9	10.6
wtdbg2	Aligned	19	553 954	424 293	6	44.7	124.5	6.2
	Extendable	10	553 954	386 856	3	43.1	130.6	5.2
	Extended	5	1047 763	777 132	2	29.8	65.4	5.7
(b) Contigs of A. thaliana Ler-0
Canu	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
MECAT	Aligned	53	11 152 352	2237 314	5	937.3	157.7	6.7
	Extendable	16	549 714	325 228	1	2027.6	272.3	27.4
	Extended	6	588 000	871 541	1	1398.1	89.0	24.7
wtdbg2	Aligned	69	6693 746	1705 279	6	460.4	62.4	5.7
	Extendable	18	3501 951	1101 327	3	588.1	63.8	9.4
	Extended	8	12 241 575	2461 217	2	334.3	30.5	9.3
(c) Contigs of D. melanogaster ISO1
Canu	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
MECAT	Aligned	38	8333 886	3185 015	4	84.4	212.3	0.2
	Extendable	10	15 931 084	7991 736	3	34.3	208.7	0.1
	Extended	5	22 837 126	15 955 717	2	16.7	51.3	0.2
wtdbg2	Aligned	38	16 699 794	3168 544	3	27.1	91.9	0.3
	Extendable	15	13 210 545	3973 433	2	21.6	98.8	0.1
	Extended	7	16 224 610	8489 209	2	6.0	61.5	0.1
(d) HiFi contigs of H. sapiens HG002
Canu	Aligned	7823	27 228 263	383 930	32	132.0	99.3	0.9
	Extendable	324	5384 744	149 495	3	210.8	40.0	4.1
	Extended	153	13 771 028	323 127	2	202.7	47.0	4.3
wtdbg2	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
FALCON	Aligned	419	30 566 990	6604 696	25	142.3	258.7	0.2
	Extendable	20	10 220 726	841 890	1	169.2	256.9	2.6
	Extended	9	13 485 583	1862 005	1	145.7	31.6	1.7
Hiiasm	Aligned	339	98 165 977	8900 780	12	159.9	26.6	3.8
	Extendable	43	33 233 224	11 291 120	5	199.8	26.6	6.1
	Extended	18	77 844 542	26 438 918	3	196.2	35.1	5.5

The contigs in tests (a)–(d) are from S. cerevisiae W303, A. thaliana Ler-0, D. melanogaster ISO1 and H. sapiens HG002, respectively, and assembled with various de novo assembly algorithms Canu, MECAT and wtdbg2 in tests (a)–(c) while with Canu, wtdbg2, FALCON and Hifiasm in test (d). In each test, AlignGraph2’s extendable contigs are a subset of aligned ones, and the extendable and extended contigs are compared. The aligned contigs cannot be directly compared to the extended contigs, so the corresponding rows are indicated in gray.

With the corresponding target genomes, AlignGraph can obtain extended contigs of 15.3–261.8% and 109.1–232.1% larger N50 value and average length, respectively, than the extendable contigs. The improvement of N50 value is not as much as the average length, because long extendable contigs approaching full chromosome lengths may not be extended much but the short ones could be extended to a large extent. The extended contigs have 15.3–70.9%, 12.4–83.5% and 7.7–33.3% smaller numbers of mismatches per 100 kbp, indels per 100 kbp and misassemblies per mbp, respectively. The results of contig reassemblies with target genomes are listed in Table S-4.

Results of contig reassemblies with various similar genomes. For the S. cerevisiae W303 contigs, with decreased similarity from S. cerevisiae S288C to S. cerevisiae YJM1463, the number of extendable contigs decreases stably from 6 to 2 and the number of extended contigs from 3 to 1, while the N50 value and average length of extended contigs are still 23.3–61.2% and 96.4–98.2% larger. For accuracy, the numbers of mismatches per 100 kbp and indels per 100 kbp of extended contigs are 12.8–32.4% and 2.1–5.2% fewer, respectively, and the numbers of misassemblies per mbp are also 6.7–39.1% fewer. For the A. thaliana Ler-0 and D. melanogaster ISO1 contigs, with decreased similarity from A. thaliana Col-0 to Capsella rubella and from D. melanogaster A3 to D. sechelia, the number of extendable contigs decreases stably from 18–27 to 2–8 and the number of extended contigs from 7–13 to 1–4, while the N50 value and average length of extended contigs are still 49.2–184.4% and 95.3–152.5% larger. For accuracy, the numbers of mismatches per 100 kbp and indels per 100 kbp of extended contigs are 5.2–37.0% and 8.4–42.1% fewer, respectively, while the numbers of misassemblies per mbp are comparable to the extendable contigs. For the H. sapiens HG002 contigs assembled from the HiFi long reads, with decreased similarity from H. sapients CMT-001 to Gorilla, similarly, the number of extendable contigs decreases stably from 174 to 15 and the number of extended contigs from 68 to 7, while the N50 value and average length of extended contigs are still 5.7–73.9% and 100.4–156.0% larger. For accuracy, the numbers of mismatches per 100 kbp, indels per 100 kbp and misassemblies per mbp of extended contigs are comparable to the extendable contigs. For all the contig sets, genome fractions of extendable and extended contigs are quite similar ranging from 14.4 to 88.6%. The results of contig reassemblies with various similar genomes are listed in Table 2. The running time and memory usage in this test are listed in Table S-6.

Table 2

Results of contig reassemblies with various similar genomes

Similar genome	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
S. cerevisiae	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
K-12	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
S. jurei	Aligned	8	457 588	358 326	3	71.8	77.9	14.0
	Extendable	6	457 588	281 462	2	62.1	86.5	17.8
	Extended	3	564 253	554 452	2	63.3	84.7	13.2
S. arboricola	Aligned	8	932 379	592 383	2	56.3	65.4	14.6
	Extendable	4	358 208	290 137	4	40.8	79.6	12.1
	Extended	2	577 495	570 949	2	42.2	81.0	7.9
S. cerevisiae	Aligned	6	457 588	376 417	3	55.8	73.5	17.7
YJM1463	Extendable	2	358 208	287 791	1	60.8	105.0	17.4
	Extended	1	565 182	565 182	1	41.1	101.1	10.6
(b) Contigs of A. thaliana Ler-0
A. thaliana	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
Col-0	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
A. thaliana	Aligned	19	11 164 124	5155 357	4	290.8	35.4	6.6
Col-O	Extendable	8	9624 789	8358 227	3	273.7	32.1	6.8
	Extended	4	15 239 450	16 321 374	2	238.7	29.4	5.7
A. lyrata	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 174	14 362 174	1	261.9	36.1	5.2
Capsella rubella	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 134	14 322 893	1	206.0	27.5	5.2
(c) Contigs of D. melanogaster ISO1
D. melanogaster	Aligned	89	20 985 587	1497 447	3	42.1	127.6	4.0
A3	Extendable	27	7828 983	1164 232	2	46.8	69.2	3.3
	Extended	13	22 270 719	2412 289	1	44.3	56.7	3.3
D. melanogaster	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
sister (fors)	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
D. simulans	Aligned	47	13 627 260	1543 036	2	18.1	122.2	0.2
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 395 575	6236 416	1	7.0	49.4	0.2
D. sechellia	Aligned	48	13 627 260	1507 609	2	18.0	122.3	0.1
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 406 895	6239 169	1	7.2	49.5	0.2
(d) HiFi contigs of H. sapiens HG002
H. sapiens	Aligned	836	16 389 496	3244 071	44	103.1	29.5	0.2
CMT-001	Extendable	174	14 743 856	1444 972	6	122.2	33.0	0.3
	Extended	68	25 473 177	3699 534	4	129.2	32.4	0.4
P. troglodytes	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
P. paniscus	Aligned	710	16 389 496	3835 094	311	102.5	29.5	0.1
	Extendable	48	19 543 112	1871 493	2	112.6	32.0	0.2
	Extended	24	20 651 917	3750 382	2	126.5	30.1	0.3
Gorilla	Aligned	771	16 389 496	3519 349	44	102.5	29.6	0.1
	Extendable	15	32 566 459	2674 438	1	100.8	28.5	0.3
	Extended	7	36 038 322	5731 809	1	98.5	27.8	0.4

Similar genome	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
S. cerevisiae	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
K-12	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
S. jurei	Aligned	8	457 588	358 326	3	71.8	77.9	14.0
	Extendable	6	457 588	281 462	2	62.1	86.5	17.8
	Extended	3	564 253	554 452	2	63.3	84.7	13.2
S. arboricola	Aligned	8	932 379	592 383	2	56.3	65.4	14.6
	Extendable	4	358 208	290 137	4	40.8	79.6	12.1
	Extended	2	577 495	570 949	2	42.2	81.0	7.9
S. cerevisiae	Aligned	6	457 588	376 417	3	55.8	73.5	17.7
YJM1463	Extendable	2	358 208	287 791	1	60.8	105.0	17.4
	Extended	1	565 182	565 182	1	41.1	101.1	10.6
(b) Contigs of A. thaliana Ler-0
A. thaliana	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
Col-0	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
A. thaliana	Aligned	19	11 164 124	5155 357	4	290.8	35.4	6.6
Col-O	Extendable	8	9624 789	8358 227	3	273.7	32.1	6.8
	Extended	4	15 239 450	16 321 374	2	238.7	29.4	5.7
A. lyrata	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 174	14 362 174	1	261.9	36.1	5.2
Capsella rubella	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 134	14 322 893	1	206.0	27.5	5.2
(c) Contigs of D. melanogaster ISO1
D. melanogaster	Aligned	89	20 985 587	1497 447	3	42.1	127.6	4.0
A3	Extendable	27	7828 983	1164 232	2	46.8	69.2	3.3
	Extended	13	22 270 719	2412 289	1	44.3	56.7	3.3
D. melanogaster	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
sister (fors)	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
D. simulans	Aligned	47	13 627 260	1543 036	2	18.1	122.2	0.2
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 395 575	6236 416	1	7.0	49.4	0.2
D. sechellia	Aligned	48	13 627 260	1507 609	2	18.0	122.3	0.1
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 406 895	6239 169	1	7.2	49.5	0.2
(d) HiFi contigs of H. sapiens HG002
H. sapiens	Aligned	836	16 389 496	3244 071	44	103.1	29.5	0.2
CMT-001	Extendable	174	14 743 856	1444 972	6	122.2	33.0	0.3
	Extended	68	25 473 177	3699 534	4	129.2	32.4	0.4
P. troglodytes	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
P. paniscus	Aligned	710	16 389 496	3835 094	311	102.5	29.5	0.1
	Extendable	48	19 543 112	1871 493	2	112.6	32.0	0.2
	Extended	24	20 651 917	3750 382	2	126.5	30.1	0.3
Gorilla	Aligned	771	16 389 496	3519 349	44	102.5	29.6	0.1
	Extendable	15	32 566 459	2674 438	1	100.8	28.5	0.3
	Extended	7	36 038 322	5731 809	1	98.5	27.8	0.4

The contigs in tests (a)–(d) are from S. cerevisiae W303, A. thaliana Ler-0, D. melanogaster ISO1 and H. sapiens HG002, respectively. In each test, similarities of the similar genomes to the target genome decreases from the first to the last, AlignGraph2’s extendable contigs are a subset of aligned ones, and the extendable and extended contigs are compared. The aligned contigs cannot be directly compared to the extended contigs, so the corresponding rows are indicated in gray.

Table 2

Results of contig reassemblies with various similar genomes

Similar genome	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
S. cerevisiae	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
K-12	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
S. jurei	Aligned	8	457 588	358 326	3	71.8	77.9	14.0
	Extendable	6	457 588	281 462	2	62.1	86.5	17.8
	Extended	3	564 253	554 452	2	63.3	84.7	13.2
S. arboricola	Aligned	8	932 379	592 383	2	56.3	65.4	14.6
	Extendable	4	358 208	290 137	4	40.8	79.6	12.1
	Extended	2	577 495	570 949	2	42.2	81.0	7.9
S. cerevisiae	Aligned	6	457 588	376 417	3	55.8	73.5	17.7
YJM1463	Extendable	2	358 208	287 791	1	60.8	105.0	17.4
	Extended	1	565 182	565 182	1	41.1	101.1	10.6
(b) Contigs of A. thaliana Ler-0
A. thaliana	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
Col-0	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
A. thaliana	Aligned	19	11 164 124	5155 357	4	290.8	35.4	6.6
Col-O	Extendable	8	9624 789	8358 227	3	273.7	32.1	6.8
	Extended	4	15 239 450	16 321 374	2	238.7	29.4	5.7
A. lyrata	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 174	14 362 174	1	261.9	36.1	5.2
Capsella rubella	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 134	14 322 893	1	206.0	27.5	5.2
(c) Contigs of D. melanogaster ISO1
D. melanogaster	Aligned	89	20 985 587	1497 447	3	42.1	127.6	4.0
A3	Extendable	27	7828 983	1164 232	2	46.8	69.2	3.3
	Extended	13	22 270 719	2412 289	1	44.3	56.7	3.3
D. melanogaster	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
sister (fors)	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
D. simulans	Aligned	47	13 627 260	1543 036	2	18.1	122.2	0.2
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 395 575	6236 416	1	7.0	49.4	0.2
D. sechellia	Aligned	48	13 627 260	1507 609	2	18.0	122.3	0.1
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 406 895	6239 169	1	7.2	49.5	0.2
(d) HiFi contigs of H. sapiens HG002
H. sapiens	Aligned	836	16 389 496	3244 071	44	103.1	29.5	0.2
CMT-001	Extendable	174	14 743 856	1444 972	6	122.2	33.0	0.3
	Extended	68	25 473 177	3699 534	4	129.2	32.4	0.4
P. troglodytes	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
P. paniscus	Aligned	710	16 389 496	3835 094	311	102.5	29.5	0.1
	Extendable	48	19 543 112	1871 493	2	112.6	32.0	0.2
	Extended	24	20 651 917	3750 382	2	126.5	30.1	0.3
Gorilla	Aligned	771	16 389 496	3519 349	44	102.5	29.6	0.1
	Extendable	15	32 566 459	2674 438	1	100.8	28.5	0.3
	Extended	7	36 038 322	5731 809	1	98.5	27.8	0.4

Similar genome	Contig set	N contigs	N50	Average length	L50	N mismatches per 100 kbp	N indels per 100 kbp	N misassemblies per mbp
(a) Contigs of S. cerevisiae W303
S. cerevisiae	Aligned	7	457 588	328 315	3	55.8	78.4	18.7
K-12	Extendable	6	457 588	376 417	3	55.8	73.5	17.7
	Extended	3	576 696	745 986	2	48.7	69.7	16.5
S. jurei	Aligned	8	457 588	358 326	3	71.8	77.9	14.0
	Extendable	6	457 588	281 462	2	62.1	86.5	17.8
	Extended	3	564 253	554 452	2	63.3	84.7	13.2
S. arboricola	Aligned	8	932 379	592 383	2	56.3	65.4	14.6
	Extendable	4	358 208	290 137	4	40.8	79.6	12.1
	Extended	2	577 495	570 949	2	42.2	81.0	7.9
S. cerevisiae	Aligned	6	457 588	376 417	3	55.8	73.5	17.7
YJM1463	Extendable	2	358 208	287 791	1	60.8	105.0	17.4
	Extended	1	565 182	565 182	1	41.1	101.1	10.6
(b) Contigs of A. thaliana Ler-0
A. thaliana	Aligned	38	11 164 124	3170 699	5	323.5	38.8	6.7
Col-0	Extendable	18	11 164 124	6295 311	5	290.0	37.4	6.2
	Extended	7	22 010 905	15 898 017	3	256.0	29.1	6.3
A. thaliana	Aligned	19	11 164 124	5155 357	4	290.8	35.4	6.6
Col-O	Extendable	8	9624 789	8358 227	3	273.7	32.1	6.8
	Extended	4	15 239 450	16 321 374	2	238.7	29.4	5.7
A. lyrata	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 174	14 362 174	1	261.9	36.1	5.2
Capsella rubella	Aligned	3	8571 950	7678 082	2	271.6	27.4	8.1
	Extendable	2	9624 789	7231 148	1	191.8	25.3	5.7
	Extended	1	14 362 134	14 322 893	1	206.0	27.5	5.2
(c) Contigs of D. melanogaster ISO1
D. melanogaster	Aligned	89	20 985 587	1497 447	3	42.1	127.6	4.0
A3	Extendable	27	7828 983	1164 232	2	46.8	69.2	3.3
	Extended	13	22 270 719	2412 289	1	44.3	56.7	3.3
D. melanogaster	Aligned	52	13 627 260	1830 831	3	23.9	119.8	0.3
sister (fors)	Extendable	23	13 627 260	2230 926	2	18.6	89.7	0.3
	Extended	11	21 445 319	4681 384	2	15.4	52.0	0.3
D. simulans	Aligned	47	13 627 260	1543 036	2	18.1	122.2	0.2
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 395 575	6236 416	1	7.0	49.4	0.2
D. sechellia	Aligned	48	13 627 260	1507 609	2	18.0	122.3	0.1
	Extendable	8	13 627 260	3125 893	1	11.1	65.7	0.2
	Extended	4	21 406 895	6239 169	1	7.2	49.5	0.2
(d) HiFi contigs of H. sapiens HG002
H. sapiens	Aligned	836	16 389 496	3244 071	44	103.1	29.5	0.2
CMT-001	Extendable	174	14 743 856	1444 972	6	122.2	33.0	0.3
	Extended	68	25 473 177	3699 534	4	129.2	32.4	0.4
P. troglodytes	Aligned	763	16 389 496	3568 627	44	102.8	29.6	0.1
	Extendable	66	19 543 112	2914 891	4	96.2	28.9	0.1
	Extended	31	33 979 403	6181 610	3	100.5	27.9	0.2
P. paniscus	Aligned	710	16 389 496	3835 094	311	102.5	29.5	0.1
	Extendable	48	19 543 112	1871 493	2	112.6	32.0	0.2
	Extended	24	20 651 917	3750 382	2	126.5	30.1	0.3
Gorilla	Aligned	771	16 389 496	3519 349	44	102.5	29.6	0.1
	Extendable	15	32 566 459	2674 438	1	100.8	28.5	0.3
	Extended	7	36 038 322	5731 809	1	98.5	27.8	0.4

The contigs in tests (a)–(d) are from S. cerevisiae W303, A. thaliana Ler-0, D. melanogaster ISO1 and H. sapiens HG002, respectively. In each test, similarities of the similar genomes to the target genome decreases from the first to the last, AlignGraph2’s extendable contigs are a subset of aligned ones, and the extendable and extended contigs are compared. The aligned contigs cannot be directly compared to the extended contigs, so the corresponding rows are indicated in gray.

4 DISCUSSION AND CONCLUSIONS

AlignGraph2 is inputted with both preassembled contigs and a similar genome, and it aligns them with each other, so a common concern is how AlignGraph2 processes misassemblies in the preassembled contigs and structural variations (SVs) in the similar genome. In other words, if a preassembled contig is aligned to the similar genome with relatively large gaps, does AlignGraph2 consider them as misassemblies in the contig or SVs between the similar and target genomes? Generally, though AlignGraph2 refines preassembled contigs by removing indels, it is conservative and does not attempt to correct misassemblies. That is, AlignGraph2 considers the large gaps as SVs between the similar and target genomes, and reassembles the whole contig end-to-end. This is achieved with the contig-guided read alignment technique discussed above: the corresponding long reads are first aligned to the contig and then to the similar genome, so all the vertices here in the constructed multipositional A-Bruijn graph have both alignment positions to the contig and genome, and two vertices with the same alignment position to the contig also have the same position to the genome. As a result, AlignGraph2 constructs a single path here and does not break in the traversal. It is similar to the initial AlignGraph algorithm, and we designed ReMILO to correct misassemblies in preassembled short read contigs [28]. In addition, it is worthwhile to note that though AlignGraph2 extends preassembled contigs cautiously, it is not avoidable to introduce additional misassemblies. Fortunately, considering the increased contig lengths and accuracy per 100 kbp in our tests, the additional misassemblies should be acceptable. Actually, as shown in Table 1 and Table 2, the numbers of misassemblies per mbp usually drops for the extended contigs compared to the extendable ones, meaning AlignGraph2 introduces fewer misassemblies compared to the de novo assembly algorithms.

In summary, this paper introduces AlignGraph2, a similar genome-assisted reassembly pipleline for the PacBio long reads. AlignGraph2 can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of long reads and preassembled contigs to a similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. AlignGraph2 was tested on both error-prone and HiFi long reads: in the test of long read alignment, AlignGraph2 can align more long reads and bases than some current algorithm and is more efficient or comparable to the others; in the test of contig reassemblies with various de novo assembly algorithms, AlignGraph2 can extend many of the contigs and obtain extended contigs of larger N50 value and smaller number of indels per 100 kbp than the extendable; in the test of contig reassemblies with various similar genomes, AlignGraph2 has relatively stable performance. Overall, AlignGraph2 is efficient in aligning the long reads and extending and refining the preassembled contigs. The current AlignGraph2 can only support the PacBio SMRT long reads, and in the future, we will expand it for the ONT long reads.

Key Points

AlignGraph2 is a similar genome-assisted reassembly pipeline to extend and refine preassembled contigs from PacBio long reads. It is the second version of AlignGraph algorithm proposed by us but completely redesigned [2], can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms.
In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7–27.2% more long reads and 7.3–56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others.
For the aligned contigs assembled with various de novo algorithms, AlignGraph2 can extend 8.7–94.7% of the aligned contigs, and obtain extended contigs of 7.0–249.6% larger N50 value and 5.2–87.7% smaller number of indels per 100 kbp. With genomes of decreased similarities, AlignGraph2 also has relatively stable performance.

Code availability

The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.

Funding

This work has been supported by grants from the Beijing Natural Science Foundation [4192044 to E.B.], and the Fundamental Research Funds for the Central Universities [2019JBM073 to E.B.].

Shien Huang is a PhD student in the Group of Interdisciplinary Information Sciences, School of Software Engineering, Beijing Jiaotong University. His research interest is application of algorithms and artificial intelligence in various areas, such as biology and transportation.

Xinyu He is a graduate student in the Group of Interdisciplinary Information Sciences, School of Software Engineering, Beijing Jiaotong University. Her research interest is bioinformatics.

Guohua Wang is a professor and dean of the College of Information and Computer Engineering, Northeast Forestry University. His research interests include bioinformatics, machine learning and artificial intelligence, and has published about 100 journal papers.

Ergude Bao is an associate professor and director of the Group of Interdisciplinary Information Sciences, School of Software Engineering, Beijing Jiaotong University. His research interest is application of algorithms and artificial intelligence in various areas, such as biology, Chinese medicine, transportation and thermal physics, and has published about 10 papers in leading bioinformatics journals.

References

1

Mikheenko

A

,

Prjibelski

A

,

Saveliev

V

, et al.

Versatile genome assembly evaluation with quast-lg

.

Bioinformatics

2018

;

34

(

13

):

i142

–

50

.

2

Bao

E

,

Jiang

T

,

Girke

T

.

Aligngraph: algorithm for secondary de novo genome assembly guided by closely related references

.

Bioinformatics

2014

;

30

(

12

):

i319

–

28

.

3

Eid

J

,

Fehr

A

,

Gray

J

, et al.

Real-time dna sequencing from single polymerase molecules

.

Science

2009

;

323

(

5910

):

133

–

8

.

4

Eisenstein

M

.

Oxford nanopore announcement sets sequencing sector abuzz

.

2012

.

5

Wenger

AM

,

Peluso

P

,

William

J Rowell

, et al.

accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome

.

Nat Biotechnol

2019

;

37

(

10

):

1155

–

62

.

6

Matthews

BJ

,

Dudchenko

O

,

Kingan

SB

, et al.

Improved reference genome of aedes aegypti informs arbovirus vector control

.

Nature

2018

;

563

(

7732

):

501

–

7

.

7

Kronenberg

ZN

,

Fiddes

IT

,

Gordon

D

, et al.

High-resolution comparative analysis of great ape genomes

.

Science

2018

;

360

(

6393

):eaar6343.

8

Shao

Y

,

Lu

N

,

Wu

Z

, et al.

Creating a functional single-chromosome yeast

.

Nature

2018

;

560

(

7718

):

331

–

5

.

9

Wang

W

,

Mauleon

R

,

Hu

Z

, et al.

Genomic variation in 3,010 diverse accessions of asian cultivated rice

.

Nature

2018

;

557

(

7703

):

43

–

9

.

10

Koren

S

,

Walenz

BP

,

Berlin

K

, et al.

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

.

Genome Res

2017

;

27

(

5

):

722

–

36

.

11

Xiao

C-L

,

Chen

Y

,

Xie

S-Q

, et al.

Mecat: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

.

Nat Methods

2017

.

12

Ruan

J

,

Li

H

.

Fast and accurate long-read assembly with wtdbg2

.

Nat Methods

2020

;

17

(

2

):

155

–

8

.

13

Chin

C-S

,

Peluso

P

,

Sedlazeck

FJ

, et al.

Phased diploid genome assembly with single molecule real-time sequencing

.

Nat Methods

2016

;

13

(

12

):

1050

.

14

Chin

C-S

,

Alexander

DH

,

Marks

P

, et al.

Nonhybrid, finished microbial genome assemblies from long-read smrt sequencing data

.

Nat Methods

2013

;

10

(

6

):

563

–

9

.

15

Chaisson

MJ

,

Tesler

G

.

Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory

.

BMC bioinformatics

2012

;

13

(

1

):238.

16

Koren

S

,

Schatz

MC

,

Walenz

BP

, et al.

Hybrid error correction and de novo assembly of single-molecule sequencing reads

.

Nat Biotechnol

2012

;

30

(

7

):

693

–

700

.

17

Berlin

K

,

Koren

S

,

Chin

C-S

, et al.

Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

.

Nat Biotechnol

2015

;

33

(

6

):

623

–

30

.

18

Govinda M

Kamath

,

Ilan

Shomorony

,

Fei

Xia

, et al.

Hinge: long-read assembly achieves optimal repeat resolution

.

Genome Res

, pages gr 216465,

2017

.

19

Li

H

.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

.

Bioinformatics

2016

;

32

(

14

):

2103

–

10

.

20

Gene Myers

.

Efficient local alignment discovery amongst noisy long reads

. In:

International Workshop on Algorithms in Bioinformatics

.

Springer

,

2014

,

52

–

67

.

Google Preview

21

Li

H

.

Minimap2: pairwise alignment for nucleotide sequences

.

Bioinformatics

2018

;

34

(

18

):

3094

–

100

.

22

Cheng

H

,

Concepcion

GT

,

Feng

X

, et al.

Haplotype-resolved de novo assembly with phased assembly graphs

.

arXiv preprint arXiv:200801237

2020

.

23

Myers

G

.

A fast bit-vector algorithm for approximate string matching based on dynamic programming

.

Journal of the ACM (JACM)

1999

;

46

(

3

):

395

–

415

.

Crossref

24

Lin

Y

,

Yuan

J

,

Kolmogorov

M

, et al.

Assembly of long error-prone reads using de bruijn graphs

.

Proc Natl Acad Sci

2016

;

113

(

52

):

E8396

–

405

.

25

Pevzner

PA

,

Tang

H

,

Waterman

MS

.

An eulerian path approach to dna fragment assembly

.

Proc Natl Acad Sci

2001

;

98

(

17

):

9748

.

26

Kolmogorov

M

,

Yuan

J

,

Lin

Y

, et al.

Assembly of long, error-prone reads using repeat graphs

.

Nat Biotechnol

2019

;

37

(

5

):

540

–

6

.

27

Schneeberger

K

,

Ossowski

S

,

Ott

F

, et al.

Reference-guided assembly of four diverse arabidopsis thaliana genomes

.

Proc Natl Acad Sci

2011

;

108

(

25

):

10249

–

54

.

28

Bao

E

,

Song

C

,

Lan

L

.

Remilo: reference assisted misassembly detection algorithm using short and long reads

.

Bioinformatics

2018

;

34

(

1

):

24

–

32

.

29

Zhu

X

,

Leung

HCM

,

Wang

R

, et al.

Misfinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads

.

BMC bioinformatics

2015

;

16

(

1

):386.