A survey on the algorithm and development of multiple sequence alignment

Zhang, Yongqing; Zhang, Qiang; Zhou, Jiliu; Zou, Quan

doi:10.1093/bib/bbac069

Abstract

Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA’s knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.

multiple sequence alignment, progressive alignment, iterative algorithm, heuristic

Introduction

Sequence alignment (SA) is one of the hotspots and critical steps in bioinformatics, which analyzes the essential biological characteristics between nucleotide or protein sequences by comparing the similarity, such as functional information, structural information and evolutionary information [1]. For example, in comparative genomic analysis, SA can identify conserved sequence motifs, estimate evolutionary differences among sequences and infer the historical relationship between genes and species [2].

SA is mainly used to determine the similarity of two or more sequences. As shown in Figure 1A, given two nucleotide sequences, it is only necessary to insert one gap in the 3rd position of the 2nd sequence to maximize the number of matched nucleotide columns between two sequences. According to the range of alignment, SA can be divided into global SA (shown as Figure 1A or B) and local SA (shown as Figure 1C) [3]. On the other hand, according to the number of sequences, SA can also be divided into pairwise sequence alignment [4] (PSA, shown as Figure 1A) and multiple sequence alignment [5] (MSA, shown as Figure 1B or C). SA makes the resulting sequences all have the same length and aligns the same or similar parts of each sequence by inserting gaps. At present, dynamic programming and the trace-back process can perfectly solve the problem of PSA, such as the Needleman–Wunsch algorithm for global alignment [6] and Smith–Waterman algorithm for local alignment [7]. However, compared with PSA, MSA is more complex. It is an NP (nondeterministic polynomial)-complete problem [8], there is no comprehensive solution for MSA [9].

There are currently many algorithms for MSA that can be roughly divided into progressive alignment algorithm, iterative algorithm, heuristics, machine learning and divide-and-conquer. Progressive alignment is a simple algorithm, which mainly includes two steps: constructing the guide tree and gradually constructing the alignment according to the guide tree [10]. The progressive alignment algorithm is more straightforward and faster. It is still widely used in various methods or programs of MSA after more than 30 years [11–13]. However, due to the lack of accuracy of progressive alignment algorithm, some methods introduced iterative algorithm into progressive alignment to solve this problem [14, 15]. These methods continuously optimize the alignment in the iterative process, but it is undeniable that the iterative algorithm comes with an additional time overhead. Recently, some researchers are pinning their hopes on heuristics [16, 17]. A heuristic algorithm is an iterative algorithm in a broad sense suitable for solving the NP problem such as MSA. However, unlike the traditional iterative algorithm, the heuristic algorithm may have different results each time due to its randomness. Although the application of machine learning to MSA is currently in its initial stage, there are still some valuable works on this subject [18–20]. These works show that machine learning-based methods for MSA perform better than traditional methods on some datasets. Moreover, divide-and-conquer is also widely used in MSA in recent years [21, 22]. It operates by dividing a group of sequences into multiple groups of subsequences for processing, which significantly improves the efficiency of the methods.

Figure 1

This figure shows various types of SA. (A) shows PSA and global SA, (B) shows MSA and global SA and (C) shows MSA and local SA.

Open in new tab Download slide

With the development and extensive application of MSA, there have been many valuable works to review the research of MSA from several aspects. For example, Notredame described existing MSA algorithms and exposed the potential advantages and disadvantages of the widely used programs [23]. Furthermore, Chowdhury and Garai have shown different methods for SA and the latest trend of multi-objective genetic algorithms in MSA. Their research shows that there is excellent progress in improving the accuracy of the alignment [24]. Moreover, Chatzou et al. reviewed the development and performance of the algorithms for protein, DNA and RNA alignment. Importantly, they also explored the relationship between simulated and empirical data, which affect the development of the algorithm [8]. Bawono et al.’s work provided the background and considerations of MSA techniques, primarily focusing on the protein SA. Then, they reviewed existing MSA methods and summarized the specific strengths and limitations of these methods [5]. Xia et al. conducted the present situation of parallel local alignment, concentrating on the large-scale genomic alignment. Besides, they also surveyed the performance of typical methods and discussed the future directions [25]. However, due to the increasing sequences scale and the diversity and complexity of the approaches, these works are not systematic and comprehensive. There is an urgent demand to summarize the recent research of MSA clearly and systematically.

In this work, we provide a clear review of the research of MSA and its applications in several scenarios. Our study is mainly shown in Figure 2. We summarize the relevant knowledge of MSA from its application, database, algorithm, program and benchmark. Primarily, we introduce and classify the classical and latest algorithms for MSA in more detail. Furthermore, we also discuss the current challenges and opportunities for future directions of MSA. This work comprehensively reviews the research of MSA from several aspects, which aim to serve as a helpful guide for researchers. The contributions of this work can be summarized as follows: (1) summarize the research status of MSA systematically. (2) Classify and detail the classical and latest algorithms for MSA. (3) Discuss the current challenges and opportunities of MSA.

Figure 2

The frame diagram of the paper.

Open in new tab Download slide

Applications of MSA

MSA plays a vital role in clarifying the biological patterns of multiple related sequences. For two or more sequences with high similarity, they may show similarity in function, evolution and structure [1]. For this reason, MSA has been widely used in sequence database searching, phylogenetic analysis, genomic analysis, protein analysis and metagenomic analysis [23].

Database searching

For a newly gene sequenced or protein sequenced, a vital clue to know its function is to search the known sequences homologous of similar to that sequence [26]. However, with the sustainable development of sequencing technology, the cumulative number of sequences has broken through the scope of human retrieval. Finding similar sequences from the sequence database has become one of the difficulties.

Sequence database search tools mainly include FASTA [27] and BLAST [26]. FASTA, derived from FASTAP, is the first program widely used in DNA and protein database searching. The search process of FASTA is as follows: firstly, build a dictionary of sequence fragments of |$ktup$|-length, then match the same or similar sequence fragments in the database. After the search, FASTA expands the matched fragments to obtain the best alignment between each matched sequence and the search sequence. In addition, BLAST is the most widely used tool at present. It has more improvements and faster speed than FASTA. The search process is similar to FASTA. It needs to establish a dictionary for the sequence fragments and expand on the matched fragments. The apparent difference is that BLAST introduces a fragment matching and expansion threshold. For different application scenarios, many derivative versions of BLAST have been born, including the latest BLASTP-ACC [28] and SMI-BLASE [29].

Phylogenetic analysis

Phylogenetic analysis is a critical key of many evolutionary studies. It reveals the phylogenetic relationship between different species and analyzes the major transitions in evolution [30]. MSA is the prerequisite in many studies of phylogenetic analysis, such as the construction of the phylogenetic tree. Such as, Khan et al. compute the alignment of 16S rRNA of seven muntjac species to determine the phylogenetic similarity among the muntjac using DNASTAR. And then construct the phylogenetic tree of these species based on the alignment. Finally, by analyzing the phylogenetic tree, they summarize the phylogenetic relationship among the selected species [31]. Furthermore, while investigating the Colletotrichum acutatum sensu lato, which causes leaf curl of celery, Liu et al. [32] align the genes of a collection of isolates from celery and non-celery using ClustalX and conduct a multilocus phylogenetic analysis based on the alignment. In addition, Wei et al. [33] and Hu et al. [34] used the MSA program MAFFT to construct the alignment for phylogenetic analysis.

Genomic analysis

Genomic analysis is an essential means of obtaining biological information, and it aims to analyze the gene functional, genetic, evolutionary or structural information contained in these genomes [35].

In early 2020, with the outbreak of SARS-CoV-2 (COVID-19), researchers paid particular attention to the virus [36, 37]. They constructed the MSA and the phylogenetic trees on the representative spike proteins of SARS-CoV and SARS-CoV-2 from various host sources. They analyzed the specificity required by SARS-CoV-2 spike proteins to cause human infection [38]. The results showed that N-terminal sequence regions MESEFR’ and SYLTPG’ were specific to human SARS-CoV-2. In the receptor-binding domain, two sequence regions, VGGNY’ and EIYQAGSTPCNGV’ and a disulfide bond connecting 480C and 488C are structural determinants for recognizing human ACE-2 receptors. The analysis of virus variants also depends on the MSA results. Through the study of base substitution, deletion, variation and single nucleotide polymorphisms (SNP) in the alignment, we can understand the evolution and classification of variants, which provides essential information for vaccine research about development and improvement [39].

In addition, MSA has been applied to many research fields, including the analysis of viral diseases of different species [40], the impact of enhancers on the evolution of mammals [41] and the study of genes encoding killer-cell immunoglobulin-like receptors [42].

Metagenomic analysis

The genomic analysis mainly analyzes the genome of a single species. In contrast, metagenomic analysis primarily studies the genome of all microbial communities in a specific environment and explores their genetic composition, community function and diversity [43].

Metagenomic analysis can be combined with MSA to quickly determine the genetic relationship between different organisms. In 2020, Zhou et al. collected the metagenomes of 227 bat samples and identified the new virus called RmYN02 through MSA. According to analysis, the virus has 93.3% nucleotide homology with SARS-CoV-2 and is the closest relative to SARS-SoV-2 found so far [44]. Furthermore, MSA is also often used in metagenomic classification. For example, the early metagenomic classifier uses BLAST to match each genome with all genomes in GenBank [45]. While now, to improve accuracy and efficiency, most classifiers prefer combining other algorithms based on MSA, such as k-mers [46] and Markov model [47].

Protein analysis

Similar to genomic analysis, protein sequence analysis mainly focuses on protein function analysis, homology analysis, structure prediction and residue contact prediction [23]. Primarily, protein structure prediction is one of the current research hotspots. In Protein Data Bank (PDB), the protein structure is mainly obtained by X-ray crystal diffraction, nuclear magnetic resonance and electron microscopy [48]. Because the cost of this structure acquisition method is too high, their respective limitations are relatively large. Therefore, to reduce the cost of obtaining protein structure, some researchers have proposed MSA combined with some prediction methods to predict protein structure. The main steps are as follows: give a group of protein sequences homologous to be tested and use the SA results to select an appropriate method to predict the structure [49]. A prediction accuracy combined with deep learning is proposed to improve the prediction accuracy. The core idea of this method is to map the amino acid sequence to the continuous space of adaptive learning with the help of natural language processing to input the whole alignment into the deep neural network for prediction. The results show that this method has good performance in different protein prediction problems [50].

Databases of biomolecules

The sequence database is mainly divided into gene, protein and comprehensive databases. In 1982, with sequencing technology development, biological sequences increased sharply. European Bioinformatics Institute established the first DNA sequence database, which is called European Molecular Biology Laboratory Bank (EMBL-Bank) [51]. In the same year, GeneBank was established by the National Institutes of Health and National Science Foundation [52]. After decades of development, the above databases have become a 1st-class comprehensive database [53]. To ensure data integrity, these databases share and exchange data. Therefore, from the perspective of data content, the resources stored in these databases are almost the same [54].

There are some deficiencies in the protein data in the above database. Therefore, databases dedicated to storing proteins have emerged, such as UniProt PDB. UniProt is the free protein database with the rich information and the most comprehensive resources. It integrates EMBL bank, the Swiss Institute of Bioinformatics and Protein Information Resource [55]. At present, UniProt is mainly composed of five sub-databases: Swiss-Prot, TrEMBL, UniParc, UniRef and Proteomes. The relationship of these data is shown in Figure 3. PDB mainly records the protein structural information and other biological macromolecules, such as polysaccharides and nucleic acids [48]. The structural information in PDB is mainly determined by X-ray crystal diffraction, nuclear magnetic resonance or electron microscope. It is expensive but has higher accuracy, so it is more suitable for research. The summary of the related database is shown in Table 1.

Table 1

Open in new tab

Summary of DNA/RNA and protein databases

Database	Description	References
EMBL-Bank	Comprehensive database of Europe, top-level database.	[51]
DDBJ	Comprehensive database of Japan, top-level database.	[53]
GenBank	Comprehensive database of America, top-level database.	[52]
CNGB	Comprehensive database of China.	[56]
SRA	Comprehensive database, storing raw sequences.	[57]
RefSeq	Comprehensive database, storing non redundant sequences.	[58]
GDB	Genome database, storing human genome sequences.	[59]
MmtDB	Metazoa mtDNA variants specialized database.	[60]
MitBASE	Integrated mitochondrial DNA database.	[61]
SGD	Saccharomyces genome database.	[62]
AceDB	Caenorhabditis elegans sequence database.	[63]
dbSNP	Variant gene database.	[64]
OMIM	Online Mendelian inheritance in MA.	[65]
DGV	The database of genomic variants from healthy people.	[66]
UniProt	Protein database.	[55]
PDB	Protein structure database.	[48]

Database	Description	References
EMBL-Bank	Comprehensive database of Europe, top-level database.	[51]
DDBJ	Comprehensive database of Japan, top-level database.	[53]
GenBank	Comprehensive database of America, top-level database.	[52]
CNGB	Comprehensive database of China.	[56]
SRA	Comprehensive database, storing raw sequences.	[57]
RefSeq	Comprehensive database, storing non redundant sequences.	[58]
GDB	Genome database, storing human genome sequences.	[59]
MmtDB	Metazoa mtDNA variants specialized database.	[60]
MitBASE	Integrated mitochondrial DNA database.	[61]
SGD	Saccharomyces genome database.	[62]
AceDB	Caenorhabditis elegans sequence database.	[63]
dbSNP	Variant gene database.	[64]
OMIM	Online Mendelian inheritance in MA.	[65]
DGV	The database of genomic variants from healthy people.	[66]
UniProt	Protein database.	[55]
PDB	Protein structure database.	[48]

Table 1

Open in new tab

Summary of DNA/RNA and protein databases

Database	Description	References
EMBL-Bank	Comprehensive database of Europe, top-level database.	[51]
DDBJ	Comprehensive database of Japan, top-level database.	[53]
GenBank	Comprehensive database of America, top-level database.	[52]
CNGB	Comprehensive database of China.	[56]
SRA	Comprehensive database, storing raw sequences.	[57]
RefSeq	Comprehensive database, storing non redundant sequences.	[58]
GDB	Genome database, storing human genome sequences.	[59]
MmtDB	Metazoa mtDNA variants specialized database.	[60]
MitBASE	Integrated mitochondrial DNA database.	[61]
SGD	Saccharomyces genome database.	[62]
AceDB	Caenorhabditis elegans sequence database.	[63]
dbSNP	Variant gene database.	[64]
OMIM	Online Mendelian inheritance in MA.	[65]
DGV	The database of genomic variants from healthy people.	[66]
UniProt	Protein database.	[55]
PDB	Protein structure database.	[48]

Database	Description	References
EMBL-Bank	Comprehensive database of Europe, top-level database.	[51]
DDBJ	Comprehensive database of Japan, top-level database.	[53]
GenBank	Comprehensive database of America, top-level database.	[52]
CNGB	Comprehensive database of China.	[56]
SRA	Comprehensive database, storing raw sequences.	[57]
RefSeq	Comprehensive database, storing non redundant sequences.	[58]
GDB	Genome database, storing human genome sequences.	[59]
MmtDB	Metazoa mtDNA variants specialized database.	[60]
MitBASE	Integrated mitochondrial DNA database.	[61]
SGD	Saccharomyces genome database.	[62]
AceDB	Caenorhabditis elegans sequence database.	[63]
dbSNP	Variant gene database.	[64]
OMIM	Online Mendelian inheritance in MA.	[65]
DGV	The database of genomic variants from healthy people.	[66]
UniProt	Protein database.	[55]
PDB	Protein structure database.	[48]

Figure 3

The figure demonstrates the relationship among five sub-databases of UniProt [67].

Open in new tab Download slide

Algorithms for MSA

This section mainly introduces the widely used MSA algorithms, which are divided into the following five types according to the characteristics of the algorithm: progressive alignment algorithm, iterative algorithm, heuristic, machine learning and divide-and-conquer algorithm, as shown in Figure 4.

Figure 4

Classification of MSA algorithms.

Open in new tab Download slide

Progressive alignment algorithm

Hogeweg and Hesper [68] first proposed the progressive alignment algorithm in 1984. The progressive alignment algorithm is more straightforward, efficient and has lower memory consumption than other algorithms. However, it also has obvious defects, once a gap, always a gap’ meaning that the inserted gap cannot be changed after the alignment sequence is confirmed [69]. According to relevant research, the progressive alignment algorithm may cause information loss when calculating the distance matrix, making the algorithm unstable [70].

After more than 30 years of development, the progressive alignment algorithm has been widely used in various MSA methods, as shown in Table 2. Typically, Higgins et al. developed ClustalW, derived from ClustalV in 1994 [11], which uses the weighted sum-of-pairs (WSP) objective function and the neighbor-joining (NJ) method to build the guide tree, as shown in Figure 5. ClustalW is widely used in genome analysis, such as the study on the evolutionary relationship between banana subspecies [77], the rDNA barcoding identification of bacterial and fungal pathogens in vitreous fluids of endophthalmitis patients [78], and the study on the molecular variation of ABO in Chinese Han population [79].

Table 2

Open in new tab

Summary of progressive alignment algorithm-based methods

References	Description	Year
[11]	ClustalW, guide tree, WSP	1994
[12]	DiAlign, segment-segment, probabilistic consideration	1998
[71]	T-Coffee, tree-based consistency objective	2000
[13]	Kalign, Wu-Manber character matching algorithm	2005
[72]	GramAlign, grammar-based distance	2008
[73]	MSAIndelFR, indel flanking region	2015
[74]	TM-Aligner, Wu-Manber and dynamic string matching algorithm	2017
[75]	ProPIP, dynamic programming	2021
[76]	Regressive algorithm	2021

References	Description	Year
[11]	ClustalW, guide tree, WSP	1994
[12]	DiAlign, segment-segment, probabilistic consideration	1998
[71]	T-Coffee, tree-based consistency objective	2000
[13]	Kalign, Wu-Manber character matching algorithm	2005
[72]	GramAlign, grammar-based distance	2008
[73]	MSAIndelFR, indel flanking region	2015
[74]	TM-Aligner, Wu-Manber and dynamic string matching algorithm	2017
[75]	ProPIP, dynamic programming	2021
[76]	Regressive algorithm	2021

Table 2

Open in new tab

Summary of progressive alignment algorithm-based methods

References	Description	Year
[11]	ClustalW, guide tree, WSP	1994
[12]	DiAlign, segment-segment, probabilistic consideration	1998
[71]	T-Coffee, tree-based consistency objective	2000
[13]	Kalign, Wu-Manber character matching algorithm	2005
[72]	GramAlign, grammar-based distance	2008
[73]	MSAIndelFR, indel flanking region	2015
[74]	TM-Aligner, Wu-Manber and dynamic string matching algorithm	2017
[75]	ProPIP, dynamic programming	2021
[76]	Regressive algorithm	2021

References	Description	Year
[11]	ClustalW, guide tree, WSP	1994
[12]	DiAlign, segment-segment, probabilistic consideration	1998
[71]	T-Coffee, tree-based consistency objective	2000
[13]	Kalign, Wu-Manber character matching algorithm	2005
[72]	GramAlign, grammar-based distance	2008
[73]	MSAIndelFR, indel flanking region	2015
[74]	TM-Aligner, Wu-Manber and dynamic string matching algorithm	2017
[75]	ProPIP, dynamic programming	2021
[76]	Regressive algorithm	2021

Figure 5

Given seven protein sequences. The distance matrix of paired sequences is calculated firstly, then the NJ algorithm is used to construct the guide tree, and finally, the progressive alignment is used to align the sequences gradually [11].

Open in new tab Download slide

Recently, there has been some studies similar to ClustalW. For example, Kalign [13], developed by Lasmann and Sonnhammer in 2005, uses the Wu-Manber character matching algorithm when calculating distance matrix. GramAlign, proposed by Russell et al. [72], uses syntax based on Lempel–Ziv compression algorithm to compute distance matrix. In 2021, Maiolo et al. introduced the Poisson Indel Process (PIP) model and dynamic programming algorithm to calculate the distance matrix, which effectively reduced the complexity of the whole method [75]. In the optimization of the objective function, Al-Shatnawi adopts the objective process of variable gap penalty in the MSAIndelFR to improve the accuracy of alignment in 2015 [73].

To further improve the accuracy and avoid the severe pitfall caused by the greed, T-Coffee (Tree-based Consistency Objective Function For alignmEnt Evaluation) was presented by Notredame and Higginsin in 2000, which is based on progressive alignment algorithm and consistency [71]. The package of T-Coffee integrates a variety of MSA methods with different characteristics, such as low memory overhead or higher accuracy [80, 81]. For example, Garriga et al. [76] proposed an MSA method based on a T-Coffee regressive and progressive alignment algorithm. In their method, sequences are first clustered and then aligned starting from the most distant ones. Their experiments show that the technique can significantly improve alignment accuracy in large-scale datasets.

As the most classic algorithm, the progressive alignment algorithm has more clear and faster advantages. However, the progressive alignment algorithm’s accuracy is relatively unsatisfactory, and its performance is easily affected by the number of sequences. Therefore, in addition to the progressive alignment algorithm, most algorithms also use other methods to improve the accuracy of alignment, such as MUSCLE [15], MAFFT [14].

Iterative algorithm

The iterative algorithm is an algorithm that can continuously optimize the alignment to improve the accuracy until it converges. The iterative algorithm can generally be divided into deterministic and stochastic [23]. The calculation results of the deterministic iterative algorithm are the same every time. In contrast, the results of the stochastic iterative algorithm may be different due to the internal use of random values. In this section, we mainly discuss the deterministic iterative algorithm. As shown in Table 3, the table lists the MSA methods related to the deterministic iterative algorithm.

Table 3

Open in new tab

Summary of deterministic iterative algorithm-based methods

References	Description	Year
[82]	MultAlign, hierarchical clustering	1988
[14]	MAFFT, fast Fourier transform, refinement	2002
[15]	MUSCLE, unweighted pair-group method with arithmetic mean	2004
[83]	PRALINE, homology-extended, secondary structure	2005
[84]	Probalign, posterior probabilities	2006
[85]	SATé, minimizing tree-length, POY	2009
[86]	PASTA, large-scale, parallel technique	2015
[87]	VIRULIGN, reference-guide, codon-correct	2019
[88]	ViralMSA, reference-guide, real-time	2021

References	Description	Year
[82]	MultAlign, hierarchical clustering	1988
[14]	MAFFT, fast Fourier transform, refinement	2002
[15]	MUSCLE, unweighted pair-group method with arithmetic mean	2004
[83]	PRALINE, homology-extended, secondary structure	2005
[84]	Probalign, posterior probabilities	2006
[85]	SATé, minimizing tree-length, POY	2009
[86]	PASTA, large-scale, parallel technique	2015
[87]	VIRULIGN, reference-guide, codon-correct	2019
[88]	ViralMSA, reference-guide, real-time	2021

Table 3

Open in new tab

Summary of deterministic iterative algorithm-based methods

References	Description	Year
[82]	MultAlign, hierarchical clustering	1988
[14]	MAFFT, fast Fourier transform, refinement	2002
[15]	MUSCLE, unweighted pair-group method with arithmetic mean	2004
[83]	PRALINE, homology-extended, secondary structure	2005
[84]	Probalign, posterior probabilities	2006
[85]	SATé, minimizing tree-length, POY	2009
[86]	PASTA, large-scale, parallel technique	2015
[87]	VIRULIGN, reference-guide, codon-correct	2019
[88]	ViralMSA, reference-guide, real-time	2021

References	Description	Year
[82]	MultAlign, hierarchical clustering	1988
[14]	MAFFT, fast Fourier transform, refinement	2002
[15]	MUSCLE, unweighted pair-group method with arithmetic mean	2004
[83]	PRALINE, homology-extended, secondary structure	2005
[84]	Probalign, posterior probabilities	2006
[85]	SATé, minimizing tree-length, POY	2009
[86]	PASTA, large-scale, parallel technique	2015
[87]	VIRULIGN, reference-guide, codon-correct	2019
[88]	ViralMSA, reference-guide, real-time	2021

MultAlign is one of the earliest MSA methods using the iterative algorithm and progressive alignment developed by Corpet in 1988 [82]. MultAlign calculates the pairwise alignment and the distance matrix by FASTA. While constructing the guide tree, it uses a hierarchical clustering to cluster all similar sequences and generates a tree with leaf nodes. MUSCLE (MUltiple Sequence Comparison by Log-Expectation) is one of the widely used programs based on an iterative algorithm, which was developed by Edgar in 2004 [15]. The process of MUSCLE is mainly divided into the following three steps (As shown in Figure 6): (1) calculate the distance matrix D1, construct the guide tree TREE1 by UPGMA (Unweighted Pair-group Method with Arithmetic Mean) and produce an alignment MSA1 following TREE1. (2) Compute the Kimura distance matrix from MSA1 to produce a guide tree TREE2 and an accurate alignment. (3) Firstly, delete an edge near the root node on Tree2, generate two subtrees and calculate the subtree profiles [89]. Then, realign two profiles and obtain a new alignment. If the Sum-of-Pairs (SP) score of the alignment is higher than before, retain the alignment; otherwise, discard it. Finally, repeat this step until the accuracy reaches the criterion.

Figure 6

The flow chart of MUSCLE [15].

Open in new tab Download slide

To reduce the time cost, Katoh et al. developed MAFFT (Multiple Alignment based on Fast Fourier Transform) in 2002 [14, 90]. MAFFT uses Fourier transform to convert the sequence into a vector for processing, which simplifies the calculation process of the algorithm. The most significant advantage is that it can quickly identify the homologous region and shorten the overall time of the program. The results demonstrate that MAFFT in iterative refinement mode is over 100 times faster than T-Coffee when more than 60 sequences are without accuracy.

Due to the lack of accuracy in phylogeny, Liu et al. [85] proposed SATé, which first applies the affine gap penalty to POY (a phylogenetic analysis program) in 2009. Compared with ClustalW, SATé has higher efficiency while searching short alignment or tree pairs. Besides, its constructed tree is also more accurate than ClustalW. Furthermore, Liu et al. [91] proposed the 2nd version of SATé in 2012. The method adopts a different divide-and-conquer from the previous one to produce smaller related subsets. Therefore, Saté-II is superior to SATé-I in both performance and accuracy. In 2015, Mirarabet et al. [86] proposed PASTA based on SATé, and this method adopts a parallel strategy to expand the scale of SA.

To fast construct the codon-correct alignment, Libin et al. proposed VIRULIGN [87] concerning the characteristics of mutual conversion between codons and amino acids in 2019. Specifically, firstly, VIRULIGN computes pairwise alignments between each target sequence and the reference sequence. Secondly, the same as the first step, calculate the alignment between the target sequence and the reference sequence represented by amino acid. Then adjust the target sequence according to the difference between the two alignments produced in the 1st and 2nd steps. Finally, repeat from the 2nd step until no more differences are detected. Recently, Moshiri proposed ViralMSA to adopt the reference-guide strategy [88], which aims to align the whole genome in real-time. Moshiri claims that ViralMSA is orders of magnitude faster than VIRULIGN.

Compared with the progressive alignment algorithm, the most significant advantage of the iterative algorithm is that it can improve the alignment result, has good robustness and is insensitive to the number of sequences. However, the drawback is that it is easy to fall into the local optimization.

Heuristics algorithm

A heuristics algorithm is a stochastic iterative algorithm in a broad sense. The main idea is to calculate each feasible solution of the combinatorial optimization problem within the acceptable range of time or space. Heuristics have good performance in solving NP-complete problems, and it can be roughly divided into the genetic algorithm, swarm intelligence, simulated annealing, hidden Markov model (HMM) and meta-heuristics [92].

Simulated annealing

Simulated annealing originates from the principle of solid annealing. When the stable temperature rises, the internal particles become disordered, and when the temperature drops, the particles will gradually become an ordered state. Therefore, a simulated annealing algorithm is often used in combinatorial optimization problems [93].

In 1993, Ishikawa et al. [94] first proposed using simulated annealing to solve MSA. The results demonstrate that the method can get a better solution in a specific time than the traditional progressive alignment algorithm. In addition, this method lays the foundation for the subsequent process based on simulated annealing. For example, Hernndez-Gua et al. [95] described another simulated annealing-based way in 2005. They use the mapping between MSA and Directed Polymer in a Random Media [96] to allow the insertion or deletion of gaps in the process of modifying alignment and expand the space of feasible solutions.

Due to the slow convergence speed of the simulated annealing algorithm, it is often necessary to combine genetic algorithms or swarm intelligence algorithms to improve efficiency.

Genetic algorithm

A genetic algorithm is a method to simulate biological evolution in nature. It transforms the problem-solving process into gene selection, crossover, mutation and other techniques similar to natural chromosomes and obtains better optimization results through multiple iterations [97]. The relative algorithms are shown in Table 4.

Table 4

Open in new tab

Summary of genetic algorithm-based methods

References	Description	Year
[16]	Affine gap penalty, crossover, mutation	1996
[98]	String, simultaneousness, automated processing	1997
[99]	Divide-and-conquer, dynamic programming	2005
[100]	Global criterion for SA, evolutionary computation	2009
[101]	Multi-objective, structural information, non-gaps percentage totally conserved columns	2013
[102]	Multi-objective, affine gap penalty, similarity, support maximization	2014
[103]	Chaotic sequences, logistic map	2016
[104]	Chemical reaction optimization, single-point operator	2019
[105]	Optimization algorithm, gap insertion mutation, gap removal mutation	2020
[106]	Bi-objective, integer coding, Wilcoxon sign test	2020

References	Description	Year
[16]	Affine gap penalty, crossover, mutation	1996
[98]	String, simultaneousness, automated processing	1997
[99]	Divide-and-conquer, dynamic programming	2005
[100]	Global criterion for SA, evolutionary computation	2009
[101]	Multi-objective, structural information, non-gaps percentage totally conserved columns	2013
[102]	Multi-objective, affine gap penalty, similarity, support maximization	2014
[103]	Chaotic sequences, logistic map	2016
[104]	Chemical reaction optimization, single-point operator	2019
[105]	Optimization algorithm, gap insertion mutation, gap removal mutation	2020
[106]	Bi-objective, integer coding, Wilcoxon sign test	2020

Table 4

Open in new tab

Summary of genetic algorithm-based methods

References	Description	Year
[16]	Affine gap penalty, crossover, mutation	1996
[98]	String, simultaneousness, automated processing	1997
[99]	Divide-and-conquer, dynamic programming	2005
[100]	Global criterion for SA, evolutionary computation	2009
[101]	Multi-objective, structural information, non-gaps percentage totally conserved columns	2013
[102]	Multi-objective, affine gap penalty, similarity, support maximization	2014
[103]	Chaotic sequences, logistic map	2016
[104]	Chemical reaction optimization, single-point operator	2019
[105]	Optimization algorithm, gap insertion mutation, gap removal mutation	2020
[106]	Bi-objective, integer coding, Wilcoxon sign test	2020

References	Description	Year
[16]	Affine gap penalty, crossover, mutation	1996
[98]	String, simultaneousness, automated processing	1997
[99]	Divide-and-conquer, dynamic programming	2005
[100]	Global criterion for SA, evolutionary computation	2009
[101]	Multi-objective, structural information, non-gaps percentage totally conserved columns	2013
[102]	Multi-objective, affine gap penalty, similarity, support maximization	2014
[103]	Chaotic sequences, logistic map	2016
[104]	Chemical reaction optimization, single-point operator	2019
[105]	Optimization algorithm, gap insertion mutation, gap removal mutation	2020
[106]	Bi-objective, integer coding, Wilcoxon sign test	2020

Notredame and Higgins [16] first proposed and developed Sequence Alignment by Genetic Algorithm (SAGA) based on genetic algorithm in 1996. The method continuously optimizes the alignment through crossover and insertion operations until the alignment does not change, as shown in Figure 7. The result indicates that SAGA is suitable for most objective functions and can achieve better alignment. Different from SAGA, Chen et al. proposed a method combining genetic algorithm, divide-and-conquer algorithm and dynamic programming [99]. The method divides the sequences vertically into multiple subsequences by a genetic algorithm. Then compute the sub-alignments of all subsequences using dynamic programming. Finally, splice the sub-alignments into a complete alignment.

Figure 7

The figure explains how to use the existing alignment to generate a better alignment in SAGA. Given two alignments, two new alignments are combined by crossing and inserting gaps. Finally, the better one is selected to join the next generation [16].

Open in new tab Download slide

Some efforts also focus on the objective function to improve genetic algorithm. For example, Arenas-Daz and Ochoterena proposed a new objective function GLObal Criterion for Sequence Alignment (GLOCSA), in 2009. Experiments show that GLOCSA can effectively improve the performance of the alignment [100]. Later, Ortuno applied a multi-objective function to the genetic algorithm, and the result indicates that the genetic algorithm with multi-objective function has more obvious advantages than the traditional genetic algorithm [101].

Moreover, Gao et al. adopted a chaos algorithm to optimize the genetic algorithm in 2016., the premature phenomenon in the genetic algorithm is effectively solved Using the ergodicity and inherent randomness of chaotic iteration [103]. Different from Gao’s method, Chatterjee uses a chemical reaction optimization algorithm to optimize the population in the evolution process, thus improving the efficiency [104]. In addition, there also are some genetic algorithm-based methods, such as Mishra et al. [105] present progressive technique to enhance the value of the alignment, and Chowdhury et al. [106] combine with bi-objective function to optimize the selection.

The genetic algorithm can give feasible solutions in a particular time, but the most feasible solutions are local optimal solutions. Therefore, the genetic algorithm needs to be combined with other algorithms to improve stability and accuracy.

Swarm intelligence algorithm

Swarm intelligence algorithm simulates the biological behavior of groups such as insects, animals or birds in nature. By definition, any algorithm or strategy inspired by natural behavior can be regarded as a swarm intelligence algorithm. Since the advent of the swarm intelligence algorithm, dozens of algorithms have been derived, including ant colony algorithm, particle swarm optimization (PSO) and artificial bee colony. In the field of MSA, swarm intelligence optimization is also applied, as shown in Table 5.

Table 5

Open in new tab

Summary of swarm intelligence algorithm-based methods

References	Description	Year
[107]	Binary SAGA	2009
[108]	Dispersion graph, ant colony	2009
[109]	Ant colony, genetic algorithm, planar graph	2010
[110]	SAGA	2011
[111]	Artificial fish swarm	2014
[112]	Bacterial foraging optimization, genetic algorithm	2017
[113]	SAGA, simulated annealing	2018
[114]	Multi-strategy artificial bee colony	2018
[115]	Flower pollination algorithm, profile algorithm	2019
[116]	Multi-objective, artificial fish swarm	2020
[117]	WOAMSA, whale optimization algorithm	2020
[118]	Meta-heuristic, dynamic programming, dynamic simulated SAGA	2021

References	Description	Year
[107]	Binary SAGA	2009
[108]	Dispersion graph, ant colony	2009
[109]	Ant colony, genetic algorithm, planar graph	2010
[110]	SAGA	2011
[111]	Artificial fish swarm	2014
[112]	Bacterial foraging optimization, genetic algorithm	2017
[113]	SAGA, simulated annealing	2018
[114]	Multi-strategy artificial bee colony	2018
[115]	Flower pollination algorithm, profile algorithm	2019
[116]	Multi-objective, artificial fish swarm	2020
[117]	WOAMSA, whale optimization algorithm	2020
[118]	Meta-heuristic, dynamic programming, dynamic simulated SAGA	2021

Table 5

Open in new tab

Summary of swarm intelligence algorithm-based methods

References	Description	Year
[107]	Binary SAGA	2009
[108]	Dispersion graph, ant colony	2009
[109]	Ant colony, genetic algorithm, planar graph	2010
[110]	SAGA	2011
[111]	Artificial fish swarm	2014
[112]	Bacterial foraging optimization, genetic algorithm	2017
[113]	SAGA, simulated annealing	2018
[114]	Multi-strategy artificial bee colony	2018
[115]	Flower pollination algorithm, profile algorithm	2019
[116]	Multi-objective, artificial fish swarm	2020
[117]	WOAMSA, whale optimization algorithm	2020
[118]	Meta-heuristic, dynamic programming, dynamic simulated SAGA	2021

References	Description	Year
[107]	Binary SAGA	2009
[108]	Dispersion graph, ant colony	2009
[109]	Ant colony, genetic algorithm, planar graph	2010
[110]	SAGA	2011
[111]	Artificial fish swarm	2014
[112]	Bacterial foraging optimization, genetic algorithm	2017
[113]	SAGA, simulated annealing	2018
[114]	Multi-strategy artificial bee colony	2018
[115]	Flower pollination algorithm, profile algorithm	2019
[116]	Multi-objective, artificial fish swarm	2020
[117]	WOAMSA, whale optimization algorithm	2020
[118]	Meta-heuristic, dynamic programming, dynamic simulated SAGA	2021

Long et al. [107] described an MSA method based on binary SAGA in 2009. This method starts with multiple alignments and then continuously modifies each alignment until convergence. The results show that most of the alignment scores of their ways are better than ClustalW, DiAlign, SAGA and T-Coffee. Besides, Chaabane proposed Particle Swarm Optimization and Simulated Annealing (PSOSA) hybrid model in 2018 [113], which combines PSO and simulated annealing to fully use the former’s exploration ability and the latter’s development ability. The experiment shows that the performance is better than MUSCLE, MAFFT and ClustalW programs.

In addition to PSO, there are many optimization algorithms based on animal behavior in MSA. For example, in 2009, Chen et al. proposed an ant colony algorithm-based method [108]. They use a dispersion graph to represent the alignment and then an ant colony algorithm to find an optimal path. The experimental results show that this method can obtain a better solution than ClustalX. Meanwhile, Xiang et al. [109] proposed a similar approach based on genetic algorithm and planar graph to optimize the search path. Kuang et al. [114] proposed an artificial bee colony-based method that adopts a multi-strategy to balance the global exploration and the local exploitation in 2018. The strategies include the following three aspects: tent chaos initialization population, neighborhood search and tournament selection. The results show that this method has better performance and biological characteristics and has strong robustness.

Some studies also adopt less commonly used algorithms. In 2017, Manikandan and Duraisamy proposed a method combining a bacterial foraging algorithm and genetic algorithm based on the dispersion optimization principle of the individual and population of Escherichia coli[112]. They obtain individuals with higher fitness through bacterial foraging behavior to improve accuracy and speed. The result shows that this method is better than other swarm intelligence algorithms on multiple benchmarks. In 2019, Hussein et al. adopted the flower pollination algorithm to optimize the MSA [115]. Besides, they also proposed a new profile algorithm to improve alignment quality. Finally, compared with other methods, this method is superior to other ways except for the MSAProbs.

Hidden Markov model

The HMM) is a statistical analysis model, which is used to describe a Markov process with unknown parameters [119]. In short, HMM can calculate the state transition probability matrix by analyzing the object’s previous and current observable states and inferring the next state of the object through the probability matrix. Due to the Markov property of MSA, some researchers proposed several MSA methods based on HMM, as shown in Table 6.

Table 6

Open in new tab

Summary of HMM-based methods

References	Description	Year
[17]	ProbCons, probabilistic consistency-based	2005
[120]	MUMMALS, local structural information	2006
[121]	PROMALS, database search, secondary structure prediction	2007
[122]	MSAProbs, partition function	2010
[123]	Clustal Omega, mBed, profile	2011
[124]	RDPSO, SAGA	2014
[125]	ProbPFP, SAGA	2019

References	Description	Year
[17]	ProbCons, probabilistic consistency-based	2005
[120]	MUMMALS, local structural information	2006
[121]	PROMALS, database search, secondary structure prediction	2007
[122]	MSAProbs, partition function	2010
[123]	Clustal Omega, mBed, profile	2011
[124]	RDPSO, SAGA	2014
[125]	ProbPFP, SAGA	2019

Table 6

Open in new tab

Summary of HMM-based methods

References	Description	Year
[17]	ProbCons, probabilistic consistency-based	2005
[120]	MUMMALS, local structural information	2006
[121]	PROMALS, database search, secondary structure prediction	2007
[122]	MSAProbs, partition function	2010
[123]	Clustal Omega, mBed, profile	2011
[124]	RDPSO, SAGA	2014
[125]	ProbPFP, SAGA	2019

References	Description	Year
[17]	ProbCons, probabilistic consistency-based	2005
[120]	MUMMALS, local structural information	2006
[121]	PROMALS, database search, secondary structure prediction	2007
[122]	MSAProbs, partition function	2010
[123]	Clustal Omega, mBed, profile	2011
[124]	RDPSO, SAGA	2014
[125]	ProbPFP, SAGA	2019

ProCons (Probabilistic Consistency) is the first program to use HMM for MSA, developed by Do et al. in 2005 [17]. ProCons adopts a novel objective function based on probability consistency, which combines the posterior probability matrix derived from HMM and the alignment consistency to integrate the conservative information of sequences. From the results, ProCons is more accurate than Align-m, ClustalW, MUSCLE and other programs. In 2006, Roshan and Livesay made some improvements based on ProCons and proposed Probalign (Probabilistic alignment) [84]. The difference is that ProCons uses the matched partition function interested of HMM to calculate the posterior probability matrix.

Efficient computation with high accuracy is an urgent need. Therefore, to solve this problem, Liu et al. [122] developed a program, MASProbs based on ProCons and Proalign. To enhance the performance, MSAProbs calculates the weight of the sequence, which participates in the calculation of probability consistency score and progressive alignment. The result shows that MASProbs has higher accuracy than ClustalW, MAFFT, MUSCLE, ProCons and Probalign. Since then, the author proposed MSAProbs-MPI, which introduces parallel and distributed memory based on MSAProbs and can cope with larger datasets [126, 127]. In 2019, Zhan et al. [125] proposed the ProbPFP method, which mainly refers to the above techniques. However, ProbPFP is unique in that it introduces PSO to optimize the parameters of HMM model.

ClustalW has great limitations in large-scale data; therefore, Sievers et al. developed Clustal Omega in 2011. Clustal Omega is based on the mBed algorithm and HMM. Clustal Omega can process tens of thousands of sequences, and its accuracy is greatly improved. Besides, to improve the alignment efficiency, Clustal Omega can use the existing alignment in the database to assist new sequences alignment [123, 128]. In 2020, Pachetti et al. [129] applied clustal Omega to the MSA of genomes of SARS-CoV-2 and found eight new recurrent mutations of virus.

Machine learning

Machine learning is still in the initial stage in MSA. In 2016, Mircea et al. [18] first proposed an MSA method based on reinforcement learning. This method is similar to the progressive alignment algorithm to find an optimal alignment order and then align the sequences in order. They use a reinforcement learning method based on dynamic programming–Q-learning to calculate the optimal alignment order of sequences and use this order to construct the alignment gradually. In 2019, Jafari et al. [20] replaced Q-learning with the A3C (Asynchronous Advantage Actor Critic) model in deep reinforcement learning based on Mircea, which improved the convergence speed and reduced the possibility of falling into local optimum to a certain extent.

Then, Ramakrishnan et al. [19] proposed an MSA method based A3Cin 2018, called RLALIGN. Unlike Mircea’s method, RLALIGN takes the current alignment as the state and the moving direction of the nucleotide in the alignment as the action. Although this method can better align some small datasets, the process may not converge with the sequence number or length increase.

Machine learning has valuable achievements in many fields of bioinformatics, but there is more work to be done across MSA. In particular, there are three challenges: (1) lack of sufficient optimal alignment samples for learning. (2) There is no suitable loss function. (3) It is challenging to build a generalized model due to the variability of the length and quantity of sequences.

Divide-and-conquer algorithm

Divide-and-conquer is a relatively simple algorithm whose idea is to break down a whole problem into multiple independent sub-problems of the related or same type for processing. The divide-and-conquer algorithm is a standard solution for solving MSA, and the relevant methods are shown in Table 7.

Table 7

Open in new tab

Summary of divide-and-conquer-based methods

References	Description	Year
[21]	SpliVert, splitting-splicing, refinement	2020
[22]	FAME, common area, splice	2020
[130]	MAGUS, divide-and-conquer, graph clustering	2021
[131]	FMAlign, FM-index, common segments	2021

References	Description	Year
[21]	SpliVert, splitting-splicing, refinement	2020
[22]	FAME, common area, splice	2020
[130]	MAGUS, divide-and-conquer, graph clustering	2021
[131]	FMAlign, FM-index, common segments	2021

Table 7

Open in new tab

Summary of divide-and-conquer-based methods

References	Description	Year
[21]	SpliVert, splitting-splicing, refinement	2020
[22]	FAME, common area, splice	2020
[130]	MAGUS, divide-and-conquer, graph clustering	2021
[131]	FMAlign, FM-index, common segments	2021

References	Description	Year
[21]	SpliVert, splitting-splicing, refinement	2020
[22]	FAME, common area, splice	2020
[130]	MAGUS, divide-and-conquer, graph clustering	2021
[131]	FMAlign, FM-index, common segments	2021

Zhan et al. [21] proposed an optimization method for MSA based on splitting and splicing in 2020, which is called SpliVert (Splitting-splicing Vertically). The process is as shown in Figure 8. Firstly, obtain an alignment by other MSA methods. Second, divide the alignment into three parts from the vertical direction and then remove the gap in the middle part and realign the part. Finally, merge the three regions and obtain the refined alignment. The experiment shows that the alignment optimized is better than those obtained by the original MSA method or program.

Figure 8

This picture explains the optimization process of SpliVert. The general process can be summarized to four steps: compute the alignment, split the alignment, realign the middle part and merge these parts into alignment [21].

Open in new tab Download slide

To reduce the time and memory consumption of long SA, Naznooshsadat et al. [22] proposed FAME (FAst and MEmory), an MSA method based on splitting and splicing. Its process consists of the following three steps: (1) divide the set of sequences into groups of subsequences vertically and maintain the order of these subsequences on the original sequences. (2) Align the subsequence at common regions and align the unaligned subsequence with other MSA methods. (3) Merge all aligned subsequences. The results show that the FAME can process large-scale data four times faster than the original method.

The alignment method for large-scale sequences has always been a hot spot. For example, Smirnov et al. [130] proposed an MSA method based on graph clustering and divide-and-conquer algorithm—MAGUS (MSA using Graph clUStering) in 2021. This method is similar to PASTA. It also introduces divide-and-conquer to construct the alignment. The difference between PASTA and MAGUS lies in the merging stage of the subset. Smirnov uses a merging strategy of combining disjoint alignment—GMC (Graph Clustering Merge) to merge subsets. In addition, benefit from GMC, MAGUS does not need iteration to improve accuracy like PASTA. Their experiments show that the method has strong robustness, and it not only improves the accuracy and exceeds PASTA in speed. Furthermore, Shen et al. [132] introduced HMM based on Smirnov’s work to further enhance the accuracy of alignment in 2021.

What’s more, Liu et al. proposed a technique, FMAlign, which is based on divide-and-conquer and FM-index. Their method has four steps: 1. building FM-index; 2. querying anchors; 3. finding optimal chain; 4. breaking sequence and aligning. Compared with MAFFT, Halign and FAME, FMAlign has a shorter running time and higher accuracy on long sequences. Besides, their results show that FMAlign is applicable to large-scale sequences alignment [131].

Because divide-and-conquer can divide a whole problem into multiple sub-problems that can be handled directly, its performance has apparent advantages over other algorithms. In addition, the algorithm can flexibly combine with parallel techniques to further improve performance because these sub-problems have the same or similar types and are independent.

Objective functions

In MSA, an objective function is needed to judge whether the alignment results meet the expectations. SP is the most common and simple objective function [10]. Its formula is shown in Equation 1, where |$m$| is the length of the alignment, |$n$| is the number of sequences. |$c_i^j$| and |$c_i^k$| denotes the |$i^{th}$| unit in the |$j^{th}$| and |$k^{th}$| sequences, respectively. |$s(c_i^j,c_i^k )$| represents the matching score between two units that can be set according to Percent Accepted Mutation (PAM) or BLOcks SUbstitution Matrix (BLOSUM).

$$\begin{align}& SP = \sum_{i=1}^m \sum_{j=1}^{n-1} \sum_{k=j+1}^n (c_i^j, c_i^k) \end{align}$$

(1)

Another commonly used function is the WSP, which adds sequence weight based on SP function [11], and its formula is shown in Equation 2, where |$w_(j,k)$| is the product of the sequence weights of |$c_i^j$| and |$c_i^k$|⁠, and the sequence weight is calculated from the guide tree.

$$\begin{align}& WSP = \sum_{i=1}^m \sum_{j=1}^{n-1} \sum_{k=j+1}^n w_{j,k}*(c_i^j, c_i^k) \end{align}$$

(2)

In addition to the weight difference, WSP also introduces the gap penalty. The gap penalty comprises the gap opening penalty and gap extension penalty. The former represents the penalty for the appearance of anyone gap, whereas the latter is the penalty points obtained by the extension of any gap starting from the opening gap. At this time, the score of |$s(c_i^j,c_i^k)$| has the following three cases. The first is to set the score according to PAM or BLOSUM when neither |$c_i^j$| nor |$c_i^k$| are gaps. The second is to set |$s(c_i^j,c_i^k )=0$| when both are gaps. And the last is to set |$s(c_i^j,c_i^k )=G$| when either one of |$c_i^j$| and |$c_i^k$| is gap. The calculation of |$G$| is shown in Equation 3, where |$gop$| is the opening penalty for the gap, |$gep$| is the extension penalty value for the gap and |$n$| is the length of consecutive gaps.

$$\begin{align}& G = gop + gep * n \end{align}$$

(3)

Another objective function is the consistency-based objective function for alignment evaluation (COFFEE). The higher the score calculated, the higher the biological correlation of the alignment [133]. Before calculating the COFFEE score, a library of pairwise alignments needs to be established by FSSP database (Fold Classification Based on Structure-Structure Alignment of Proteins) to provide sufficient information for subsequent MSA. The formula of this objective function is shown in Equation 4, where |$score(A_{(j,k)})$| is the number of aligned pairs of residues that are shared between |$A_{(j,k)}$| and the library, |$len(A_{(j,k)})$| is the alignment length of |$A_{(j,k)}$|⁠, and |$w_{(j,k)}$| is the weight for pairwise alignments, such as the similarity of two sequences.

$$\begin{align}& COFFEE = \frac{\sum_{j=1}^{n-1}\sum_{k=j+1}^n w_{j,k} * score(A_{(j,k)})} {\sum_{j=1}^{n-1}\sum_{k=j+1}^n w_{j,k} * len(A_{(j,k)})} \end{align}$$

(4)

The total column (TC) score is a relatively simple evaluation metric indicating the similarity between two alignments. The score is the ratio of correctly aligned columns between the test alignment and the reference alignment to the total number of columns in the reference alignment [134]. Take two alignments in Figure 9 as an example. There are three correctly aligned columns (1st, 2nd and 5th) and six TCs. Therefore, the TC score is 0.5 (3 divided by 6). The value range of the TC score is |$[0,1]$|⁠. The higher the value, the more agreement the test alignment with the reference alignment.

Figure 9

Two alignments for illustrating the calculation of TC score. In this figure, there are three correctly aligned columns (1st, 2nd and 5th) and six TCs. Therefore, TC score is 0.5 (3 divided by 6).

Open in new tab Download slide

Programs

The commonly used programs for MSA include ClustalW, MUSCLE, MAFFT and T-Coffee, which are shown in Table 8.

Table 8

Open in new tab

Summary of programs

Programs	Year	Website
ClustalW	1994	http://www.clustal.org/
DiAlign	1998	http://dialign.gobics.de/
T-Coffee	2000	http://tcoffee.crg.cat/
MAFFT	2002	https://mafft.cbrc.jp/alignment/software/
MUSCLE	2004	http://www.drive5.com/muscle/
ProbCons	2005	http://probcons.stanford.edu/
Kalign	2005	https://www.ebi.ac.uk/Tools/msa/kalign/
Probalign	2006	http://probalign.njit.edu/standalone.html
PRANK	2008	http://wasabiapp.org/software/prank/
MSAProbs	2010	http://msaprobs.sourceforge.net/
NX4	2019	https://www.nx4.io
FAME	2020	http://github.com/naznoosh/msa

Programs	Year	Website
ClustalW	1994	http://www.clustal.org/
DiAlign	1998	http://dialign.gobics.de/
T-Coffee	2000	http://tcoffee.crg.cat/
MAFFT	2002	https://mafft.cbrc.jp/alignment/software/
MUSCLE	2004	http://www.drive5.com/muscle/
ProbCons	2005	http://probcons.stanford.edu/
Kalign	2005	https://www.ebi.ac.uk/Tools/msa/kalign/
Probalign	2006	http://probalign.njit.edu/standalone.html
PRANK	2008	http://wasabiapp.org/software/prank/
MSAProbs	2010	http://msaprobs.sourceforge.net/
NX4	2019	https://www.nx4.io
FAME	2020	http://github.com/naznoosh/msa

Table 8

Open in new tab

Summary of programs

Programs	Year	Website
ClustalW	1994	http://www.clustal.org/
DiAlign	1998	http://dialign.gobics.de/
T-Coffee	2000	http://tcoffee.crg.cat/
MAFFT	2002	https://mafft.cbrc.jp/alignment/software/
MUSCLE	2004	http://www.drive5.com/muscle/
ProbCons	2005	http://probcons.stanford.edu/
Kalign	2005	https://www.ebi.ac.uk/Tools/msa/kalign/
Probalign	2006	http://probalign.njit.edu/standalone.html
PRANK	2008	http://wasabiapp.org/software/prank/
MSAProbs	2010	http://msaprobs.sourceforge.net/
NX4	2019	https://www.nx4.io
FAME	2020	http://github.com/naznoosh/msa

Programs	Year	Website
ClustalW	1994	http://www.clustal.org/
DiAlign	1998	http://dialign.gobics.de/
T-Coffee	2000	http://tcoffee.crg.cat/
MAFFT	2002	https://mafft.cbrc.jp/alignment/software/
MUSCLE	2004	http://www.drive5.com/muscle/
ProbCons	2005	http://probcons.stanford.edu/
Kalign	2005	https://www.ebi.ac.uk/Tools/msa/kalign/
Probalign	2006	http://probalign.njit.edu/standalone.html
PRANK	2008	http://wasabiapp.org/software/prank/
MSAProbs	2010	http://msaprobs.sourceforge.net/
NX4	2019	https://www.nx4.io
FAME	2020	http://github.com/naznoosh/msa

Among the MSA programs, Clustal is the earliest program used for MSA and now, its improved version ClustalW is still commonly used. ClustalW adopts a simple progressive alignment algorithm [11], which can not guarantee the optimal alignment. MUSCLE is an MSA program based on progressive and iterative algorithm [15], which is also one of the fastest MSA programs at present. MUSCLE is better than ClustalW in speed and accuracy. For example, for 10 000 sequences with a length of 350|$bp$|⁠, MUSCLE can compute the alignment in >10 min, whereas ClustalW may take 1 year.

MAFFT is the first program to adopt fast Fourier transform in MSA. The first version of MAFFT was released in 2002 [14]. MAFFT adds various options and alignment modes, such as FFT-NS-1, FFT-NS-2 and FFT-NS-i and researchers can select appropriate options for different occasions. In alignment accuracy, MAFFT is higher than MUSCLE. T-Coffee is one of the most popular programs [71], which has high scalability and abundant functions. T-Coffee can integrate various sequences information during alignment, so its accuracy is more elevated than ClustalW, MUSCLE and MAFFT.

Benchmarks

Benchmark is the golden criterion for judging the MSA method. Table 9 [123] lists the benchmarking results of the popular MSA programs on BAliBASE (Benchmark Alignment dataBASE) [135]. BAliBASE consists of five groups of references, containing 82, 41, 30 and 16 alignments, respectively. Besides, the first group is divided into two subgroups, R1-1 and R1-2, which contain 38 and 44 alignments, respectively. In Table 9. the TC scores of each reference are in columns 2–7. The average TC score general references are given in the second last column and the total run time for all references is given in the last column. The value range of the TC score is |$[0,1]$| and the higher the value, the more agreement the result of the program with the benchmark. As can be seen from the table, MSAProbs has the highest accuracy, but its run time is higher than most programs. Kalign has the shortest run time, but the accuracy is not satisfactory. The worst program is PRANK, which runs much longer than other programs and has a lower score and run time.

Table 9

Open in new tab

Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]

Program	R1-1	R1-2	R2	R3	R4	R5	Avg Score	Tot time(s)
MSAProbs [122]	0.441	0.865	0.464	0.607	0.622	0.608	0.607	12382.00
Probalign [84]	0.453	0.862	0.439	0.566	0.603	0.549	0.589	10095.20
MAFFT [14]	0.439	0.831	0.450	0.581	0.605	0.591	0.588	1475.40
Probcons [17]	0.417	0.855	0.406	0.544	0.532	0.573	0.558	13086.30
Clustal Omega [123]	0.358	0.789	0.450	0.575	0.579	0.533	0.554	539.91
T-Coffee [71]	0.410	0.848	0.402	0.491	0.545	0.587	0.551	81041.50
Kalign [72]	0.365	0.790	0.360	0.476	0.504	0.435	0.501	21.88
MUSCLE [15]	0.318	0.804	0.350	0.409	0.450	0.460	0.475	789.57
FSA [136]	0.270	0.818	0.187	0.259	0.474	0.398	0.419	53648.10
DiAlign [12]	0.265	0.696	0.292	0.312	0.441	0.425	0.415	3977.44
PRANK [137]	0.223	0.680	0.257	0.321	0.360	0.356	0.376	128355.00
ClustalW [11]	0.227	0.712	0.220	0.272	0.396	0.308	0.374	766.47

Program	R1-1	R1-2	R2	R3	R4	R5	Avg Score	Tot time(s)
MSAProbs [122]	0.441	0.865	0.464	0.607	0.622	0.608	0.607	12382.00
Probalign [84]	0.453	0.862	0.439	0.566	0.603	0.549	0.589	10095.20
MAFFT [14]	0.439	0.831	0.450	0.581	0.605	0.591	0.588	1475.40
Probcons [17]	0.417	0.855	0.406	0.544	0.532	0.573	0.558	13086.30
Clustal Omega [123]	0.358	0.789	0.450	0.575	0.579	0.533	0.554	539.91
T-Coffee [71]	0.410	0.848	0.402	0.491	0.545	0.587	0.551	81041.50
Kalign [72]	0.365	0.790	0.360	0.476	0.504	0.435	0.501	21.88
MUSCLE [15]	0.318	0.804	0.350	0.409	0.450	0.460	0.475	789.57
FSA [136]	0.270	0.818	0.187	0.259	0.474	0.398	0.419	53648.10
DiAlign [12]	0.265	0.696	0.292	0.312	0.441	0.425	0.415	3977.44
PRANK [137]	0.223	0.680	0.257	0.321	0.360	0.356	0.376	128355.00
ClustalW [11]	0.227	0.712	0.220	0.272	0.396	0.308	0.374	766.47

Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references, and total running times. The best results in each column are put in bold.

Table 9

Open in new tab

Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]

Program	R1-1	R1-2	R2	R3	R4	R5	Avg Score	Tot time(s)
MSAProbs [122]	0.441	0.865	0.464	0.607	0.622	0.608	0.607	12382.00
Probalign [84]	0.453	0.862	0.439	0.566	0.603	0.549	0.589	10095.20
MAFFT [14]	0.439	0.831	0.450	0.581	0.605	0.591	0.588	1475.40
Probcons [17]	0.417	0.855	0.406	0.544	0.532	0.573	0.558	13086.30
Clustal Omega [123]	0.358	0.789	0.450	0.575	0.579	0.533	0.554	539.91
T-Coffee [71]	0.410	0.848	0.402	0.491	0.545	0.587	0.551	81041.50
Kalign [72]	0.365	0.790	0.360	0.476	0.504	0.435	0.501	21.88
MUSCLE [15]	0.318	0.804	0.350	0.409	0.450	0.460	0.475	789.57
FSA [136]	0.270	0.818	0.187	0.259	0.474	0.398	0.419	53648.10
DiAlign [12]	0.265	0.696	0.292	0.312	0.441	0.425	0.415	3977.44
PRANK [137]	0.223	0.680	0.257	0.321	0.360	0.356	0.376	128355.00
ClustalW [11]	0.227	0.712	0.220	0.272	0.396	0.308	0.374	766.47

Program	R1-1	R1-2	R2	R3	R4	R5	Avg Score	Tot time(s)
MSAProbs [122]	0.441	0.865	0.464	0.607	0.622	0.608	0.607	12382.00
Probalign [84]	0.453	0.862	0.439	0.566	0.603	0.549	0.589	10095.20
MAFFT [14]	0.439	0.831	0.450	0.581	0.605	0.591	0.588	1475.40
Probcons [17]	0.417	0.855	0.406	0.544	0.532	0.573	0.558	13086.30
Clustal Omega [123]	0.358	0.789	0.450	0.575	0.579	0.533	0.554	539.91
T-Coffee [71]	0.410	0.848	0.402	0.491	0.545	0.587	0.551	81041.50
Kalign [72]	0.365	0.790	0.360	0.476	0.504	0.435	0.501	21.88
MUSCLE [15]	0.318	0.804	0.350	0.409	0.450	0.460	0.475	789.57
FSA [136]	0.270	0.818	0.187	0.259	0.474	0.398	0.419	53648.10
DiAlign [12]	0.265	0.696	0.292	0.312	0.441	0.425	0.415	3977.44
PRANK [137]	0.223	0.680	0.257	0.321	0.360	0.356	0.376	128355.00
ClustalW [11]	0.227	0.712	0.220	0.272	0.396	0.308	0.374	766.47

Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references, and total running times. The best results in each column are put in bold.

Currently, the most used benchmark is BAliBASE [135], which provides multiple reference sets that have been optimized. Up to now, BAliBASE has three versions. The 1st version provides four reference sets, each containing multiple reference alignments. The 2nd version provides eight reference sets, and the 3rd version adds an update protocol to allow online updates of the reference set. In addition to BAliBASE, other common benchmark sets include SABMark (Sequence Alignment BenchMark) [138], OXBench (OXford Benchmark) [139], SMART (Simple Modular Architecture Research Tool) [140], PREFAB (Protein REFerence Alignment Benchmark) [15] and QuanTest [141].

Challenges and opportunities

Although the existing methods for MSA can cope with the most alignment tasks, these methods are not good enough to provide comprehensive solutions for any analysis scenario in any condition. There are still some challenges and opportunities for future directions.

Scale of sequence

Compared with small-scale and straightforward data in most bioinformatics fields, the MSA data may have a large discrepancy in the scale of each alignment. At the same time, with the rapid accumulation of sequences, MSA may face hyper-scale sequences. For example, Koyama et al. [142] collected >10 000 genomes to analyze the variation of SARS-CoV-2. However, the existing MSA methods are challenging to cope with hyper-scale sequences. Even if possible, these methods have low accuracy or high consumption of memory and time. This problem is a big challenge to develop a technique that can deal with hyper-scale sequences while ensuring accuracy and consumption.

Design of method

Most MSA methods are flexible and complicated. Various efforts are made to improve accuracy and performance, and several works adopt more complex machine learning, such as reinforcement learning and natural language processing. In real-life sequence analysis scenarios, the scale and length of sequences may differ, requiring that the method has good generalization ability and robustness to ensure accuracy. In addition, using friendly visualization technology is also essential for users. Making the MSA method more reliable and usable is a promising issue.

Correctness of alignment

The correctness-sensitive biological sequence analysis scenarios put forward higher requirements for the biological correctness of alignment. The most existing MSA methods are difficult to predict the evolution process of sequences correctly and lack explanations. For these problems, there are three main reasons as follows: the randomness of the evolutionary process, the limitations of the bioinformatics method and the lack of an accurate evolutionary model [143]. Therefore, focusing on the above points is more helpful for applying MSA to biological sequence analysis scenarios.

Conclusion

MSA is widely used in many bioinformatics scenarios because it can extract potential information from sequences. In this work, we introduced the base knowledge of MSA from several aspects. Furthermore, we described representative applications of MSA for database searching, phylogenetics, genomics, metagenomics and proteomics. We also categorize and analyze the classical and state-of-the-art algorithms for MSA, such as progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Finally, We further discussed the challenges of this field and promising directions. This review will provide valuable insight and serve as a starting point for studying MSA in future research.

Key Points

As many biomedical sequences have been accumulated, MSA is widely applied to extract knowledge from these massive sequences, such as evolution, structure and function.
We systematically introduce MSA’s knowledge from several aspects, including background, application, database, algorithm, metric, program and benchmark.
We also list and discuss the current challenges and opportunities for future directions in bioinformatics.
As a comprehensive review of current works, this paper will provide valuable insight and serve as a launching point for researchers in their studies.

Funding

National Natural Science Foundation of China (Grant No. 61922020).

Author Biographies

Yongqing Zhang He is an associate professor in the School of Computer Science at the Chengdu University of Information Technology. He is a senior member of CCF. His research interests include machine learning and bioinformatics.

Qiang Zhang He is a graduate student in the School of Computer Science at the Chengdu University of Information Technology. His research interests include deep learning and bioinformatics.

Jiliu Zhou He is a professor in the School of Computer Science at the Chengdu University of Information Technology. His research interests include intelligent computing and image processing.

Quan Zou He is a professor in the Institute of Fundamental and Frontier Science at the University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. His research interests include bioinformatics and machine learning.

References

1.

Wang

T

,

Liang

C

,

Hou

Y

, et al.

Small design from big alignment: engineering proteins with multiple sequence alignment as the starting point

.

Biotechnol Lett

2020

;

42

(

8

):

1305

–

15

.

2.

Makigaki

S

,

Ishida

T

.

Sequence alignment generation using intermediate sequence search for homology modeling

.

Comput Struct Biotechnol J

2020

;

18

:

2043

–

50

.

3.

Huang

M

,

Shah

ND

,

Yao

L

.

Evaluating global and local sequence alignment methods for comparing patient medical records

.

BMC Med Inform Decis Mak

2019

;

19

(

Suppl 6

):

263

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

4.

Baharav

TZ

,

Kamath

GM

,

Tse

DN

, et al.

Spectral jaccard similarity: a new approach to estimating pairwise sequence alignments

.

Patterns (N Y)

2020

;

1

(

6

):100081.

Google Scholar

OpenURL Placeholder Text

WorldCat

5.

Bawono

P

,

Dijkstra

M

,

Pirovano

W

, et al.

Multiple sequence alignment

.

Methods Mol Biol

2017

;

1525

:

167

–

89

.

6.

Needleman

SB

,

Wunsch

CD

.

A general method applicable to the search for similarities in the amino acid sequence of two proteins

.

J Mol Biol

1970

;

48

(

3

):

443

–

53

.

7.

Smith

TF

,

Waterman

MS

.

Identification of common molecular subsequences

.

J Mol Biol

1981

;

147

(

1

):

195

–

7

.

8.

Chatzou

M

,

Magis

C

,

Chang

JM

, et al.

Multiple sequence alignment modeling: methods and applications

.

Brief Bioinform

2016

;

17

(

6

):

1009

–

23

.

9.

Warnow

T

.

Revisiting evaluation of multiple sequence alignment methods

.

Methods Mol Biol

2021

;

2231

:

299

–

317

.

10.

Altschul

SF

,

Lipman

DJ

.

Trees, stars, and multiple biological sequence alignment

.

SIAM J Appl Math

1989

;

49

(

1

):

197

–

209

.

Google Scholar

Crossref

WorldCat

11.

Thompson

JD

,

Higgins

DG

,

Gibson

TJ

.

Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

.

Nucleic Acids Res

1994

;

22

(

22

):

4673

–

80

.

12.

Morgenstern

B

,

Frech

K

,

Dress

A

, et al.

Dialign: finding local similarities by multiple sequence alignment

.

Bioinformatics

1998

;

14

(

3

):

290

–

4

.

13.

Lassmann

T

,

Sonnhammer

E

.

Kalign: an accurate and fast multiple sequence alignment algorithm

.

BMC Bioinformatics

2005

;

6

:

298

.

14.

Katoh

K

,

Misawa

K

,

Kuma

K-i

, et al.

Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform

.

Nucleic Acids Res

2002

;

30

:

3059

–

66

.

15.

Edgar

RC

.

Muscle: multiple sequence alignment with high accuracy and high throughput

.

Nucleic Acids Res

2004

;

32

(

5

):

1792

–

7

.

16.

Notredame

C

,

Higgins

DG

.

Saga: sequence alignment by genetic algorithm

.

Nucleic Acids Res

1996

;

24

(

8

):

1515

–

24

.

17.

Do

CB

,

Mahabhashyam

MS

,

Brudno

M

, et al.

Probcons: probabilistic consistency-based multiple sequence alignment

.

Genome Res

2005

;

15

(

2

):

330

–

40

.

18.

Mircea

I

,

Czibula

G

,

Bocicor

M

. A q-learning approach for aligning protein sequences. In:

2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP)

, IEEE, Piscataway,

2015

,

51

–

8

.

19.

Ramakrishnan

KR

,

Singh

J

,

Blanchette

M

. Rlalign: a reinforcement learning approach for multiple sequence alignment. In:

2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)

, IEEE, Piscataway,

2018

,

61

–

6

.

20.

Jafari

R

,

Javidi

MM

,

Rafsanjani

MK

.

Using deep reinforcement learning approach for solving the multiple sequence alignment problem

.

SN Appl Sci

2019

;

1

(

6

):

592

.

Google Scholar

Crossref

WorldCat

21.

Zhan

Q

,

Fu

Y

,

Jiang

Q

, et al.

Splivert: a protein multiple sequence alignment refinement method based on splitting-splicing vertically

.

Protein Pept Lett

2020

;

27

(

4

):

295

–

302

.

22.

Naznooshsadat

E

,

Elham

P

,

Ali

SZ

.

Fame: fast and memory efficient multiple sequences alignment tool through compatible chain of roots

.

Bioinformatics

2020

;

36

(

12

):

3662

–

8

.

23.

Notredame

C

.

Recent progress in multiple sequence alignment: a survey

.

Pharmacogenomics

2002

;

3

:

131

–

44

.

24.

Chowdhury

B

,

Garai

G

.

A review on multiple sequence alignment from the perspective of genetic algorithm

.

Genomics

2017

;

109

(

5-6

):

419

–

31

.

25.

Xia

Z

,

Cui

Y

,

Zhang

A

, et al.

A review of parallel implementations for the smith-waterman algorithm

.

Interdiscip Sci

2021

;

3

:

1

–

14

.

Google Scholar

OpenURL Placeholder Text

WorldCat

26.

Altschul

SF

,

Gish

W

,

Miller

W

, et al.

Basic local alignment search tool

.

J Mol Biol

1990

;

215

(

3

):

403

–

10

.

27.

Pearson

WR

,

Lipman

DJ

.

Improved tools for biological sequence comparison

.

Proc Natl Acad Sci U S A

1988

;

85

(

8

):

2444

–

8

.

28.

Li

YC

,

Lu

YC

.

Blastp-acc: parallel architecture and hardware accelerator design for blast-based protein sequence alignment

.

IEEE Trans Biomed Circuits Syst

2019

;

13

(

6

):

1771

–

82

.

29.

Jin

X

,

Liao

Q

,

Wei

H

, et al.

Smi-blast: a novel supervised search framework based on psi-blast for protein remote homology detection

.

Bioinformatics

2020

.

Google Scholar

OpenURL Placeholder Text

WorldCat

30.

Kapli

P

,

Yang

Z

,

Telford

MJ

.

Phylogenetic tree building in the genomic age

.

Nat Rev Genet

2020

;

21

(

7

):

428

–

44

.

31.

Khan

BM

,

Sabir

M

,

Alyemeni

MN

, et al.

Genetic similarities and phylogenetic analysis of muntjac (muntiacus spp.) by comparing the nucleotide sequence of 16s rrna and cytochrome b genome

.

Braz J Biol

2021

;

83

:e248153.

Google Scholar

OpenURL Placeholder Text

WorldCat

32.

Liu

B

,

Pavel

JA

,

Hausbeck

MK

, et al.

Phylogenetic analysis, vegetative compatibility, virulence, and fungal filtrates of leaf curl pathogen Colletotrichum fioriniae from celery

.

Phytopathology

2021

;

111

(

4

):

751

–

60

.

33.

Wei

R

,

Zhang

XC

.

Phylogeny of diplazium (athyriaceae) revisited: resolving the backbone relationships based on plastid genomes and phylogenetic tree space analysis

.

Mol Phylogenet Evol

2020

;

143

:106699.

Google Scholar

OpenURL Placeholder Text

WorldCat

34.

Hu

Y

,

Xing

W

,

Hu

Z

, et al.

Phylogenetic analysis and substitution rate estimation of colonial volvocine algae based on mitochondrial genomes

.

Genes (Basel)

2020

;

11

(

1

).

Google Scholar

OpenURL Placeholder Text

WorldCat

35.

Fariq

A

,

Blazier

JC

,

Yasmin

A

, et al.

Whole genome sequence analysis reveals high genetic variation of newly isolated Acidithiobacillus ferrooxidans io-2c

.

Sci Rep

2019

;

9

(

1

):

13049

.

36.

Hu

B

,

Guo

H

,

Zhou

P

, et al.

Characteristics of sars-cov-2 and covid-19

.

Nat Rev Microbiol

2021

;

19

(

3

):

141

–

54

.

37.

Yin

C

.

Genotyping coronavirus sars-cov-2: methods and implications

.

Genomics

2020

;

112

(

5

):

3588

–

96

.

38.

Guruprasad

L

.

Evolutionary relationships and sequence-structure determinants in human SARS coronavirus-2 spike proteins for host receptor recognition

.

Proteins

2020

;

88

(

11

):

1387

–

93

.

39.

Chang

TJ

,

Yang

DM

,

Wang

ML

, et al.

Genomic analysis and comparative multiple sequences of SARS-cov2

.

J Chin Med Assoc

2020

;

83

(

6

):

537

–

43

.

40.

Madhavan

A

,

Venkatesan

G

,

Kumar

A

, et al.

Comparative sequence and structural analysis of the orf095 gene, a vaccinia virus a4l homolog of capripoxvirus in sheep and goats

.

Arch Virol

2020

;

165

(

6

):

1419

–

31

.

41.

Hecker

N

,

Hiller

M

.

A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers

.

Gigascience

2020

;

9

(

1

).

Google Scholar

OpenURL Placeholder Text

WorldCat

42.

Roe

D

,

Vierra-Green

C

,

Pyo

CW

, et al.

A detailed view of kir haplotype structures and gene families as provided by a new motif-based multiple sequence alignment

.

Front Immunol

2020

;

11

:585731.

Google Scholar

OpenURL Placeholder Text

WorldCat

43.

Hunter

CI

,

Mitchell

A

,

Jones

P

, et al.

Metagenomic analysis: the challenge of the data bonanza

.

Brief Bioinform

2012

;

13

:

743

–

6

,

6

.

44.

Zhou

H

,

Chen

X

,

Hu

T

, et al.

A novel bat coronavirus closely related to sars-cov-2 contains natural insertions at the s1/s2 cleavage site of the spike protein

.

Curr Biol

2020

;

30

(

11

):

2196

–

2203.e3

.

45.

Breitwieser

FP

,

Lu

J

,

Salzberg

SL

.

A review of methods and databases for metagenomic classification and assembly

.

Brief Bioinform

2019

;

20

(

4

):

1125

–

36

.

46.

Storato

D

,

Comin

M

.

K2mem: discovering discriminative k-mers from sequencing data for metagenomic reads classification

.

IEEE/ACM Trans Comput Biol Bioinform

2021

;

19

(1):220–229.

Google Scholar

OpenURL Placeholder Text

WorldCat

47.

Burks

DJ

,

Azad

RK

.

Higher-order Markov models for metagenomic sequence classification

.

Bioinformatics

2020

;

36

(

14

):

4130

–

6

.

48.

Velankar

S

,

Burley

SK

,

Kurisu

G

, et al.

The protein data bank archive

.

Methods Mol Biol

2021

;

2305

:

3

–

21

.

49.

Makigaki

S

,

Ishida

T

.

Sequence alignment using machine learning for accurate template-based protein structure prediction

.

Bioinformatics

2020

;

36

(

1

):

104

–

11

.

50.

Mirabello

C

,

Wallner

B

.

Rawmsa: end-to-end deep learning using raw multiple sequence alignments

.

PLoS One

2019

;

14

(

8

):e0220182.

Google Scholar

OpenURL Placeholder Text

WorldCat

51.

Cantelli

G

,

Bateman

A

,

Brooksbank

C

, et al.

The European Bioinformatics Institute (EMBL-EBI) in 2021

.

Nucleic Acids Res

2021

;

50

(D1):D11–D19.

Google Scholar

OpenURL Placeholder Text

WorldCat

52.

Sayers

EW

,

Cavanaugh

M

,

Clark

K

, et al.

Genbank

.

Nucleic Acids Res

2020

;

48

(

D1

):

D84

–

d86

.

53.

Ogasawara

O

,

Kodama

Y

,

Mashima

J

, et al.

DDBJ database updates and computational infrastructure enhancement

.

Nucleic Acids Res

2020

;

48

(

D1

):

D45

–

d50

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

54.

Tuli

MA

,

Flores

TP

,

Cameron

GN

.

Submission of nucleotide sequence data to EMBL/genbank/DDBJ

.

Mol Biotechnol

1996

;

6

(

1

):

47

–

51

.

55.

The UniProt Consortium

.

Uniprot: a worldwide hub of protein knowledge

.

Nucleic Acids Res

2019

;

47

(

D1

):

D506

–

15

.

Crossref

PubMed

WorldCat

56.

Chen

FZ

,

You

LJ

,

Yang

F

, et al.

Cngbdb: China national genebank database

.

Yi Chuan

2020

;

42

(

8

):

799

–

809

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

57.

Leinonen

R

,

Sugawara

H

,

Shumway

M

.

The sequence read archive

.

Nucleic Acids Res

2011

;

39

(

Database issue

):

D19

–

21

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

58.

Pruitt

KD

,

Tatusova

T

,

Brown

GR

, et al.

NCBI reference sequences (refseq): current status, new features and genome annotation policy

.

Nucleic Acids Res

2012

;

40

(

D1

):

D130

–

5

.

59.

Letovsky

SI

,

Cottingham

RW

,

Porter

CJ

, et al.

GDB: the human genome database

.

Nucleic Acids Res

1998

;

26

(

1

):

94

–

9

.

60.

Caló

D

,

De Pascali

,

Sasanelli

D

, et al.

Mmtdb: a metazoa mitochondrial DNA variants database

.

Nucleic Acids Res

1997

;

25

(

1

):

200

–

5

.

61.

Attimonelli

M

,

Altamura

N

,

Benne

R

, et al.

Mitbase: a comprehensive and integrated mitochondrial dna database. The present status

.

Nucleic Acids Res

2000

;

28

(

1

):

148

–

52

.

62.

Lang

OW

,

Nash

RS

,

Hellerstedt

ST

, et al.

An introduction to the saccharomyces genome database (SGD)

.

Methods Mol Biol

2018

;

1757

:

21

–

30

.

63.

Kelley

S

.

Getting started with acedb

.

Brief Bioinform

2000

;

1

(

2

):

131

–

7

.

64.

Sherry

ST

,

Ward

MH

,

Kholodov

M

, et al.

DBSNP: the NCBI database of genetic variation

.

Nucleic Acids Res

2001

;

29

(

1

):

308

–

11

.

65.

Amberger

JS

,

Bocchini

CA

,

Scott

AF

, et al.

Omim.org: leveraging knowledge across phenotype-gene relationships

.

Nucleic Acids Res

2019

;

47

(

D1

):

D1038

–

43

.

66.

MacDonald

JR

,

Ziman

R

,

Yuen

RK

, et al.

The database of genomic variants: a curated collection of structural variation in the human genome

.

Nucleic Acids Res

2014

;

42

(

Database issue

):

D986

–

92

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

67.

Pundir

S

,

Martin

MJ

,

O’Donovan

C

.

Uniprot protein knowledgebase

.

Methods Mol Biol

2017

;

1558

:

41

–

55

.

68.

Hogeweg

P

,

Hesper

B

.

The alignment of sets of sequences and the construction of phyletic trees: an integrated method

.

J Mol Evol

1984

;

20

:

175

–

86

.

69.

Feng

DF

,

Doolittle

RF

.

Progressive sequence alignment as a prerequisite to correct phylogenetic trees

.

J Mol Evol

1987

;

25

(

4

):

351

–

60

.

70.

Boyce

K

,

Sievers

F

,

Higgins

DG

.

Instability in progressive multiple sequence alignment algorithms

.

Algorithms Mol Biol

2015

;

10

:

26

.

71.

Notredame

C

,

Higgins

DG

,

Heringa

J

.

T-coffee: a novel method for fast and accurate multiple sequence alignment

.

J Mol Biol

2000

;

302

(

1

):

205

–

17

.

72.

Russell

DJ

,

Otu

HH

,

Sayood

K

.

Grammar-based distance in progressive multiple sequence alignment

.

BMC Bioinformatics

2008

;

9

:

306

.

73.

Al-Shatnawi

M

,

Ahmad

MO

,

Swamy

MN

.

Msaindelfr: a scheme for multiple protein sequence alignment using information on indel flanking regions

.

BMC Bioinformatics

2015

;

16

:

393

.

74.

Bhat

B

,

Ganai

NA

,

Andrabi

SM

, et al.

Tm-aligner: multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy

.

Sci Rep

2017

;

7

(

1

):

12543

.

75.

Maiolo

M

,

Gatti

L

,

Frei

D

, et al.

Propip: a tool for progressive multiple sequence alignment with Poisson indel process

.

BMC Bioinformatics

2021

;

22

(

1

):

518

.

76.

Garriga

E

,

Di Tommaso

,

Magis

C

, et al.

Multiple sequence alignment computation using the t-coffee regressive algorithm implementation

.

Methods Mol Biol

2021

;

2231

:

89

–

97

.

77.

Dhivya

S

,

Ashutosh

S

,

Gowtham

I

, et al.

Molecular identification and evolutionary relationships between the subspecies of Musa by DNA barcodes

.

BMC Genomics

2020

;

21

(

1

):

659

.

78.

Selva Pandiyan

A

,

Karthikeyan

RSG

,

Rameshkumar

G

, et al.

Identification of bacterial and fungal pathogens by rDNA gene barcoding in vitreous fluids of endophthalmitis patients

.

Semin Ophthalmol

2020

;

35

(

7-8

):

358

–

64

.

79.

Ying

YL

,

Hong

XZ

,

Xu

XG

, et al.

Molecular basis of ABO variants including identification of 16 novel abo subgroup alleles in Chinese Han population

.

Transfus Med Hemother

2020

;

47

(

2

):

160

–

6

.

80.

Lladós

J

,

Cores

F

,

Guirado

F

, et al.

Accurate consistency-based MSA reducing the memory footprint

.

Comput Methods Programs Biomed

2021

;

208

:106237.

Google Scholar

OpenURL Placeholder Text

WorldCat

81.

Chang

JM

,

Floden

EW

,

Herrero

J

, et al.

Incorporating alignment uncertainty into Felsenstein’s phylogenetic bootstrap to improve its reliability

.

Bioinformatics

2019

;

37

(

11

):

1506

–

14

.

Google Scholar

Crossref

WorldCat

82.

Corpet

F

.

Multiple sequence alignment with hierarchical clustering

.

Nucleic Acids Res

1988

;

16

(

22

):

10881

–

90

.

83.

Simossis

VA

,

Heringa

J

.

Praline: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information

.

Nucleic Acids Res

2005

;

33

(

Web Server issue

):

W289

–

94

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

84.

Roshan

U

,

Livesay

DR

.

Probalign: multiple sequence alignment using partition function posterior probabilities

.

Bioinformatics

2006

;

22

(

22

):

2715

–

21

.

85.

Liu

K

,

Nelesen

S

,

Raghavan

S

, et al.

Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy

.

IEEE/ACM Trans Comput Biol Bioinform

2009

;

6

(

1

):

7

–

21

.

86.

Mirarab

S

,

Nguyen

N

,

Guo

S

, et al.

Pasta: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences

.

J Comput Biol

2015

;

22

(

5

):

377

–

86

.

87.

Libin

PJK

,

Deforche

K

,

Abecasis

AB

, et al.

Virulign: fast codon-correct alignment and annotation of viral genomes

.

Bioinformatics

2019

;

35

(

10

):

1763

–

5

.

88.

Moshiri

N

.

Viralmsa: massively scalable reference-guided multiple sequence alignment of viral genomes

.

Bioinformatics

2021

;

37

(

5

):

714

–

6

.

89.

Rychlewski

L

,

Jaroszewski

L

,

Li

W

, et al.

Comparison of sequence profiles. Strategies for structural predictions using sequence information

.

Protein Sci

2000

;

9

(

2

):

232

–

41

.

90.

Baxevanis

AD

.

Practical aspects of multiple sequence alignment

.

Methods Biochem Anal

1998

;

39

:

172

–

88

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

91.

Liu

K

,

Warnow

TJ

,

Holder

MT

, et al.

Sate-ii: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees

.

Syst Biol

2012

;

61

(

1

):

90

–

106

.

92.

Amorim

AR

,

Zafalon

GFD

,

de Godoi Contessoto

, et al.

Metaheuristics for multiple sequence alignment: a systematic review

.

Comput Biol Chem

2021

;

94

:

107563

.

93.

Caiyang

Y

,

Heidari

AA

,

Chen

H

.

A quantum-behaved simulated annealing algorithm-based moth-flame optimization method

.

App Math Model

2020

;

87

:

1

–

19

.

Google Scholar

OpenURL Placeholder Text

WorldCat

94.

Ishikawa

M

,

Toya

T

,

Hoshida

M

, et al.

Multiple sequence alignment by parallel simulated annealing

.

Comput Appl Biosci

1993

;

9

:

267

–

73

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

95.

Hernández-Guía

M

,

Mulet

R

,

Rodríguez-Pérez

S

.

Simulated annealing algorithm for the multiple sequence alignment problem: the approach of polymers in a random medium

.

Phys Rev E

2005

;

72

:031915.

Google Scholar

OpenURL Placeholder Text

WorldCat

96.

Hwa

T

,

Lässig

M

.

Similarity detection and localization

.

Phys Rev Lett

1996

;

76

(

14

):

2591

–

4

.

97.

Mirjalili

S

.

Genetic Algorithm

.

Cham

:

Springer International Publishing

,

2019

,

43

–

55

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

98.

Zhang

C

,

Wong

AKC

.

A genetic algorithm for multiple molecular sequence alignment

.

Bioinformatics

1997

;

13

(

6

):

565

–

81

.

Google Scholar

Crossref

WorldCat

99.

Chen

S-M

,

Lin

C-H

,

Chen

S-J

.

Multiple DNA sequence alignment based on genetic algorithms and divide-and-conquer techniques

.

Int J Appl Sci Eng

2005

;

3

:

89

–

100

.

Google Scholar

OpenURL Placeholder Text

WorldCat

100.

Arenas-Díaz

E

,

Ochoterena

H

.

Multiple sequence alignment using a genetic algorithm and glocsa

.

J Artif Evol Appl

2009

;

2009

:963150.

Google Scholar

OpenURL Placeholder Text

WorldCat

101.

Ortuño

FM

,

Valenzuela

O

,

Rojas

F

, et al.

Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns

.

Bioinformatics

2013

;

29

(

17

):

2112

–

21

.

102.

Kaya

M

,

Sarhan

A

,

Alhajj

R

.

Multiple sequence alignment with affine gap by using multi-objective genetic algorithm

.

Comput Methods Programs Biomed

2014

;

114

(

1

):

38

–

49

.

103.

Gao

C

,

Wang

B

,

Zhou

CJ

, et al.

Multiple sequence alignment based on combining genetic algorithm with chaotic sequences

.

Genet Mol Res

2016

;

15

(

2

):

gmr8788

.

Google Scholar

OpenURL Placeholder Text

WorldCat

104.

Chatterjee

S

,

Barua

P

,

Hasibuzzaman

MM

, et al. A hybrid genetic algorithm with chemical reaction optimization for multiple sequence alignment. In:

2019 22nd International Conference on Computer and Information Technology (ICCIT)

, IEEE, Piscataway,

2019

,

1

–

6

.

105.

Mishra

A

,

Tripathi

BK

,

Singh Soam

S

. A genetic algorithm based approach for the optimization of multiple sequence alignment. In:

2020 International Conference on Computational Performance Evaluation (ComPE)

, IEEE, Piscataway,

2020

,

415

–

8

.

106.

Chowdhury

B

,

Garai

G

.

A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm

.

Soft Comput

2020

;

24

(

20

):

15871

–

88

.

Google Scholar

Crossref

WorldCat

107.

Long

H

,

Xu

W

,

Sun

J

, et al. Multiple sequence alignment based on a binary particle swarm optimization algorithm. In:

2009 Fifth International Conference on Natural Computation

, IEEE, Piscataway, Vol.

3

,

2009

,

265

–

9

.

108.

Chen

W

,

Liao

B

,

Zhu

W

, et al.

Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm

.

J Comput Chem

2009

;

30

(

13

):

2031

–

8

.

109.

Xuyu

X

,

Dafan

Z

,

Jiaohua

Q

, et al.

Ant colony with genetic algorithm based on planar graph for multiple sequence alignment

.

Inf Technol J

2010

;

9

:

274

–

81

.

Google Scholar

OpenURL Placeholder Text

WorldCat

110.

Jagadamba

PVSL

,

Prasad Babu

MS

,

Rao

AA

, et al. An improved algorithm for multiple sequence alignment using particle swarm optimization. In:

2011 IEEE 2nd International Conference on Software Engineering and Service Science

,

2011

,

544

–

7

.

111.

Yang

W-H

.

An improved artificial fish swarm algorithm and its application in multiple sequence alignment

.

J Comput Theor Nanosci

2014

;

11

:

888

–

92

.

Google Scholar

Crossref

WorldCat

112.

Manikandan

P

,

Ramyachitra

D

.

Bacterial foraging optimization -genetic algorithm for multiple sequence alignment with multi-objectives

.

Sci Rep

2017

;

7

(

1

):

8833

.

113.

Chaabane

L

.

A hybrid solver for protein multiple sequence alignment problem

.

J Bioinform Comput Biol

2018

;

16

(

4

):

1850015

.

114.

Kuang

FJ

,

Zhang

SY

,

Liu

CC

.

Multiple sequence alignment algorithm based on multi-strategy artificial bee colony

.

Kongzhi yu Juece/Control Decision

2018

;

33

:

1990

–

6

.

Google Scholar

OpenURL Placeholder Text

WorldCat

115.

Hussein

AM

,

Abdullah

R

,

AbdulRashid

N

. Flower pollination algorithm with profile technique for multiple sequence alignment. In:

2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

,

2019

,

571

–

6

.

116.

Dabba

A

,

Tari

A

,

Zouache

D

.

Multiobjective artificial fish swarm algorithm for multiple sequence alignment

.

INFOR: Inf Syst Oper Res

2020

;

58

(

1

):

38

–

59

.

Google Scholar

OpenURL Placeholder Text

WorldCat

117.

Kumar

M

,

Kumar

R

,

Nidhya

R

. Woamsa: whale optimization algorithm for multiple sequence alignment of protein sequence. In:

Smys

S

,

Tavares

JMRS

,

Balas

VE

et al. (eds).

Computational Vision and Bio-Inspired Computing

.

Springer International Publishing

, Cham,

2019

,

131

–

9

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

118.

Chaabane

L

.

An enhanced cooperative method to solve multiple-sequence alignment problem

.

Int J Data Mining Modell Manage

2021

;

13

(

1-2

):

1

–

16

.

Google Scholar

OpenURL Placeholder Text

WorldCat

119.

Baum Leonard

E

,

Ted

P

,

George

S

, et al.

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

.

Ann Math Stat

1970

;

41

(

1

):

164

–

71

.

Google Scholar

Crossref

WorldCat

120.

Pei

J

,

Grishin

NV

.

Mummals: multiple sequence alignment improved by using hidden Markov models with local structural information

.

Nucleic Acids Res

2006

;

34

(

16

):

4364

–

74

.

121.

Pei

J

,

Grishin

NV

.

Promals: towards accurate multiple sequence alignments of distantly related proteins

.

Bioinformatics

2007

;

23

(

7

):

802

–

8

.

122.

Liu

Y

,

Schmidt

B

,

Maskell

DL

.

Msaprobs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

.

Bioinformatics

2010

;

26

(

16

):

1958

–

64

.

123.

Sievers

F

,

Wilm

A

,

Dineen

D

, et al.

Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega

.

Mol Syst Biol

2011

;

7

:

539

.

124.

Sun

J

,

Palade

V

,

Wu

X

, et al.

Multiple sequence alignment with hidden Markov models learned by random drift particle swarm optimization

.

IEEE/ACM Trans Comput Biol Bioinform

2014

;

11

(

1

):

243

–

57

.

125.

Zhan

Q

,

Wang

N

,

Jin

S

, et al.

Probpfp: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function

.

BMC Bioinformatics

2019

;

20

(

Suppl 18

):

573

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

126.

González-Domínguez

J

,

Liu

Y

,

Touriño

J

, et al.

Msaprobs-mpi: parallel multiple sequence aligner for distributed-memory systems

.

Bioinformatics

2016

;

32

(

24

):

3826

–

8

.

127.

González-Domínguez

J

.

Fast and accurate multiple sequence alignment with msaprobs-mpi

.

Methods Mol Biol

2021

;

2231

:

39

–

47

.

128.

Sievers

F

,

Higgins

DG

.

The clustal omega multiple alignment package

.

Methods Mol Biol

2021

;

2231

:

3

–

16

.

129.

Pachetti

M

,

Marini

B

,

Benedetti

F

, et al.

Emerging sars-cov-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

.

J Transl Med

2020

;

18

(

1

):

179

.

130.

Smirnov

V

,

Warnow

T

.

Magus: multiple sequence alignment using graph clustering

.

Bioinformatics

2021

;

37

(

12

):

1666

–

72

.

131.

Liu

H

,

Zou

Q

,

Xu

Y

.

A novel fast multiple nucleotide sequence alignment method based on fm-index

.

Brief Bioinform

2021

;

23

(1):bbab519, https://doi.org/10.1093/bib/bbab519.

Google Scholar

OpenURL Placeholder Text

WorldCat

132.

Shen

C

,

Zaharias

P

,

Warnow

T

.

Magus+ehmms: improved multiple sequence alignment accuracy for fragmentary sequences

.

Bioinformatics

2021

;

38

(4):918–24.

Google Scholar

OpenURL Placeholder Text

WorldCat

133.

Notredame

C

,

Holm

L

,

Higgins

DG

.

Coffee: an objective function for multiple sequence alignments

.

Bioinformatics

1998

;

14

(

5

):

407

–

22

.

134.

Narayan Behera

MS

,

Jeevitesh

JJ

,

Kant

K

, et al.

Higher accuracy protein multiple sequence alignments by genetic algorithm

.

Proc Comput Sci

2017

;

108

:

1135

–

44

.

Google Scholar

Crossref

WorldCat

135.

Thompson

JD

,

Koehl

P

,

Ripp

R

, et al.

Balibase 3.0: latest developments of the multiple sequence alignment benchmark

.

Proteins

2005

;

61

(

1

):

127

–

36

.

136.

Bradley

RK

,

Roberts

A

,

Smoot

M

, et al.

Fast statistical alignment

.

PLoS Comput Biol

2009

;

5

(

5

):e1000392.

Google Scholar

OpenURL Placeholder Text

WorldCat

137.

Löytynoja

A

,

Goldman

N

.

Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis

.

Science

2008

;

320

(

5883

):

1632

–

5

.

138.

Van Walle

,

Lasters

I

,

Wyns

L

.

Sabmark-a benchmark for sequence alignment that covers the entire known fold space

.

Bioinformatics

2005

;

21

(

7

):

1267

–

8

.

139.

Raghava

GP

,

Searle

SM

,

Audley

PC

, et al.

Oxbench: a benchmark for evaluation of protein multiple sequence alignment accuracy

.

BMC Bioinformatics

2003

;

4

:

47

.

140.

Schultz

J

,

Copley

RR

,

Doerks

T

, et al.

Smart: a web-based tool for the study of genetically mobile domains

.

Nucleic Acids Res

2000

;

28

(

1

):

231

–

4

.

141.

Sievers

F

,

Higgins

DG

.

Quantest2: benchmarking multiple sequence alignments using secondary structure prediction

.

Bioinformatics

2020

;

36

(

1

):

90

–

5

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

142.

Koyama

T

,

Platt

D

,

Parida

L

.

Variant analysis of SARS-cov-2 genomes

.

Bull World Health Organ

2020

;

98

(

7

):

495

–

504

.

143.

Ashkenazy

H

,

Sela

I

,

Levy Karin

E

, et al.

Multiple sequence alignment averaging improves phylogeny reconstruction

.

Syst Biol

2019

;

68

(

1

):

117

–

30

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
March 2022	256
April 2022	123
May 2022	53
June 2022	128
July 2022	77
August 2022	41
September 2022	34
October 2022	91
November 2022	62
December 2022	55
January 2023	40
February 2023	28
March 2023	62
April 2023	53
May 2023	40
June 2023	41
July 2023	31
August 2023	35
September 2023	40
October 2023	23
November 2023	33
December 2023	28
January 2024	84
February 2024	89
March 2024	143
April 2024	117
May 2024	231
June 2024	124
July 2024	124
August 2024	92
September 2024	148
October 2024	197
November 2024	208
December 2024	147
January 2025	108
February 2025	184
March 2025	231
April 2025	160
May 2025	123

Article Contents

A survey on the algorithm and development of multiple sequence alignment

Abstract

Introduction

Applications of MSA

Database searching

Phylogenetic analysis

Genomic analysis

Metagenomic analysis

Protein analysis

Databases of biomolecules

Algorithms for MSA

Progressive alignment algorithm

Iterative algorithm

Heuristics algorithm

Simulated annealing

Genetic algorithm

Swarm intelligence algorithm

Hidden Markov model

Machine learning

Divide-and-conquer algorithm

Objective functions

Programs

Benchmarks

Challenges and opportunities

Scale of sequence

Design of method

Correctness of alignment

Conclusion

Funding

Author Biographies

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

A survey on the algorithm and development of multiple sequence alignment Free

Abstract

Introduction

Applications of MSA

Database searching

Phylogenetic analysis

Genomic analysis

Metagenomic analysis

Protein analysis

Databases of biomolecules

Algorithms for MSA

Progressive alignment algorithm

Iterative algorithm

Heuristics algorithm

Simulated annealing

Genetic algorithm

Swarm intelligence algorithm

Hidden Markov model

Machine learning

Divide-and-conquer algorithm

Objective functions

Programs

Benchmarks

Challenges and opportunities

Scale of sequence

Design of method

Correctness of alignment

Conclusion

Funding

Author Biographies

References

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

A survey on the algorithm and development of multiple sequence alignment