Abstract

Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA’s knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.

Introduction

Sequence alignment (SA) is one of the hotspots and critical steps in bioinformatics, which analyzes the essential biological characteristics between nucleotide or protein sequences by comparing the similarity, such as functional information, structural information and evolutionary information [1]. For example, in comparative genomic analysis, SA can identify conserved sequence motifs, estimate evolutionary differences among sequences and infer the historical relationship between genes and species [2].

SA is mainly used to determine the similarity of two or more sequences. As shown in Figure 1A, given two nucleotide sequences, it is only necessary to insert one gap in the 3rd position of the 2nd sequence to maximize the number of matched nucleotide columns between two sequences. According to the range of alignment, SA can be divided into global SA (shown as Figure 1A or B) and local SA (shown as Figure 1C) [3]. On the other hand, according to the number of sequences, SA can also be divided into pairwise sequence alignment [4] (PSA, shown as Figure 1A) and multiple sequence alignment [5] (MSA, shown as Figure 1B or C). SA makes the resulting sequences all have the same length and aligns the same or similar parts of each sequence by inserting gaps. At present, dynamic programming and the trace-back process can perfectly solve the problem of PSA, such as the Needleman–Wunsch algorithm for global alignment [6] and Smith–Waterman algorithm for local alignment [7]. However, compared with PSA, MSA is more complex. It is an NP (nondeterministic polynomial)-complete problem [8], there is no comprehensive solution for MSA [9].

There are currently many algorithms for MSA that can be roughly divided into progressive alignment algorithm, iterative algorithm, heuristics, machine learning and divide-and-conquer. Progressive alignment is a simple algorithm, which mainly includes two steps: constructing the guide tree and gradually constructing the alignment according to the guide tree [10]. The progressive alignment algorithm is more straightforward and faster. It is still widely used in various methods or programs of MSA after more than 30 years [11–13]. However, due to the lack of accuracy of progressive alignment algorithm, some methods introduced iterative algorithm into progressive alignment to solve this problem [14, 15]. These methods continuously optimize the alignment in the iterative process, but it is undeniable that the iterative algorithm comes with an additional time overhead. Recently, some researchers are pinning their hopes on heuristics [16, 17]. A heuristic algorithm is an iterative algorithm in a broad sense suitable for solving the NP problem such as MSA. However, unlike the traditional iterative algorithm, the heuristic algorithm may have different results each time due to its randomness. Although the application of machine learning to MSA is currently in its initial stage, there are still some valuable works on this subject [18–20]. These works show that machine learning-based methods for MSA perform better than traditional methods on some datasets. Moreover, divide-and-conquer is also widely used in MSA in recent years [21, 22]. It operates by dividing a group of sequences into multiple groups of subsequences for processing, which significantly improves the efficiency of the methods.

This figure shows various types of SA. (A) shows PSA and global SA, (B) shows MSA and global SA and (C) shows MSA and local SA.
Figure 1

This figure shows various types of SA. (A) shows PSA and global SA, (B) shows MSA and global SA and (C) shows MSA and local SA.

With the development and extensive application of MSA, there have been many valuable works to review the research of MSA from several aspects. For example, Notredame described existing MSA algorithms and exposed the potential advantages and disadvantages of the widely used programs [23]. Furthermore, Chowdhury and Garai have shown different methods for SA and the latest trend of multi-objective genetic algorithms in MSA. Their research shows that there is excellent progress in improving the accuracy of the alignment [24]. Moreover, Chatzou et al. reviewed the development and performance of the algorithms for protein, DNA and RNA alignment. Importantly, they also explored the relationship between simulated and empirical data, which affect the development of the algorithm [8]. Bawono et al.’s work provided the background and considerations of MSA techniques, primarily focusing on the protein SA. Then, they reviewed existing MSA methods and summarized the specific strengths and limitations of these methods [5]. Xia et al. conducted the present situation of parallel local alignment, concentrating on the large-scale genomic alignment. Besides, they also surveyed the performance of typical methods and discussed the future directions [25]. However, due to the increasing sequences scale and the diversity and complexity of the approaches, these works are not systematic and comprehensive. There is an urgent demand to summarize the recent research of MSA clearly and systematically.

In this work, we provide a clear review of the research of MSA and its applications in several scenarios. Our study is mainly shown in Figure 2. We summarize the relevant knowledge of MSA from its application, database, algorithm, program and benchmark. Primarily, we introduce and classify the classical and latest algorithms for MSA in more detail. Furthermore, we also discuss the current challenges and opportunities for future directions of MSA. This work comprehensively reviews the research of MSA from several aspects, which aim to serve as a helpful guide for researchers. The contributions of this work can be summarized as follows: (1) summarize the research status of MSA systematically. (2) Classify and detail the classical and latest algorithms for MSA. (3) Discuss the current challenges and opportunities of MSA.

The frame diagram of the paper.
Figure 2

The frame diagram of the paper.

Applications of MSA

MSA plays a vital role in clarifying the biological patterns of multiple related sequences. For two or more sequences with high similarity, they may show similarity in function, evolution and structure [1]. For this reason, MSA has been widely used in sequence database searching, phylogenetic analysis, genomic analysis, protein analysis and metagenomic analysis [23].

Database searching

For a newly gene sequenced or protein sequenced, a vital clue to know its function is to search the known sequences homologous of similar to that sequence [26]. However, with the sustainable development of sequencing technology, the cumulative number of sequences has broken through the scope of human retrieval. Finding similar sequences from the sequence database has become one of the difficulties.

Sequence database search tools mainly include FASTA [27] and BLAST [26]. FASTA, derived from FASTAP, is the first program widely used in DNA and protein database searching. The search process of FASTA is as follows: firstly, build a dictionary of sequence fragments of |$ktup$|-length, then match the same or similar sequence fragments in the database. After the search, FASTA expands the matched fragments to obtain the best alignment between each matched sequence and the search sequence. In addition, BLAST is the most widely used tool at present. It has more improvements and faster speed than FASTA. The search process is similar to FASTA. It needs to establish a dictionary for the sequence fragments and expand on the matched fragments. The apparent difference is that BLAST introduces a fragment matching and expansion threshold. For different application scenarios, many derivative versions of BLAST have been born, including the latest BLASTP-ACC [28] and SMI-BLASE [29].

Phylogenetic analysis

Phylogenetic analysis is a critical key of many evolutionary studies. It reveals the phylogenetic relationship between different species and analyzes the major transitions in evolution [30]. MSA is the prerequisite in many studies of phylogenetic analysis, such as the construction of the phylogenetic tree. Such as, Khan et al. compute the alignment of 16S rRNA of seven muntjac species to determine the phylogenetic similarity among the muntjac using DNASTAR. And then construct the phylogenetic tree of these species based on the alignment. Finally, by analyzing the phylogenetic tree, they summarize the phylogenetic relationship among the selected species [31]. Furthermore, while investigating the Colletotrichum acutatum sensu lato, which causes leaf curl of celery, Liu et al. [32] align the genes of a collection of isolates from celery and non-celery using ClustalX and conduct a multilocus phylogenetic analysis based on the alignment. In addition, Wei et al. [33] and Hu et al. [34] used the MSA program MAFFT to construct the alignment for phylogenetic analysis.

Genomic analysis

Genomic analysis is an essential means of obtaining biological information, and it aims to analyze the gene functional, genetic, evolutionary or structural information contained in these genomes [35].

In early 2020, with the outbreak of SARS-CoV-2 (COVID-19), researchers paid particular attention to the virus [36, 37]. They constructed the MSA and the phylogenetic trees on the representative spike proteins of SARS-CoV and SARS-CoV-2 from various host sources. They analyzed the specificity required by SARS-CoV-2 spike proteins to cause human infection [38]. The results showed that N-terminal sequence regions MESEFR’ and SYLTPG’ were specific to human SARS-CoV-2. In the receptor-binding domain, two sequence regions, VGGNY’ and EIYQAGSTPCNGV’ and a disulfide bond connecting 480C and 488C are structural determinants for recognizing human ACE-2 receptors. The analysis of virus variants also depends on the MSA results. Through the study of base substitution, deletion, variation and single nucleotide polymorphisms (SNP) in the alignment, we can understand the evolution and classification of variants, which provides essential information for vaccine research about development and improvement [39].

In addition, MSA has been applied to many research fields, including the analysis of viral diseases of different species [40], the impact of enhancers on the evolution of mammals [41] and the study of genes encoding killer-cell immunoglobulin-like receptors [42].

Metagenomic analysis

The genomic analysis mainly analyzes the genome of a single species. In contrast, metagenomic analysis primarily studies the genome of all microbial communities in a specific environment and explores their genetic composition, community function and diversity [43].

Metagenomic analysis can be combined with MSA to quickly determine the genetic relationship between different organisms. In 2020, Zhou et al. collected the metagenomes of 227 bat samples and identified the new virus called RmYN02 through MSA. According to analysis, the virus has 93.3% nucleotide homology with SARS-CoV-2 and is the closest relative to SARS-SoV-2 found so far [44]. Furthermore, MSA is also often used in metagenomic classification. For example, the early metagenomic classifier uses BLAST to match each genome with all genomes in GenBank [45]. While now, to improve accuracy and efficiency, most classifiers prefer combining other algorithms based on MSA, such as k-mers [46] and Markov model [47].

Protein analysis

Similar to genomic analysis, protein sequence analysis mainly focuses on protein function analysis, homology analysis, structure prediction and residue contact prediction [23]. Primarily, protein structure prediction is one of the current research hotspots. In Protein Data Bank (PDB), the protein structure is mainly obtained by X-ray crystal diffraction, nuclear magnetic resonance and electron microscopy [48]. Because the cost of this structure acquisition method is too high, their respective limitations are relatively large. Therefore, to reduce the cost of obtaining protein structure, some researchers have proposed MSA combined with some prediction methods to predict protein structure. The main steps are as follows: give a group of protein sequences homologous to be tested and use the SA results to select an appropriate method to predict the structure [49]. A prediction accuracy combined with deep learning is proposed to improve the prediction accuracy. The core idea of this method is to map the amino acid sequence to the continuous space of adaptive learning with the help of natural language processing to input the whole alignment into the deep neural network for prediction. The results show that this method has good performance in different protein prediction problems [50].

Databases of biomolecules

The sequence database is mainly divided into gene, protein and comprehensive databases. In 1982, with sequencing technology development, biological sequences increased sharply. European Bioinformatics Institute established the first DNA sequence database, which is called European Molecular Biology Laboratory Bank (EMBL-Bank) [51]. In the same year, GeneBank was established by the National Institutes of Health and National Science Foundation [52]. After decades of development, the above databases have become a 1st-class comprehensive database [53]. To ensure data integrity, these databases share and exchange data. Therefore, from the perspective of data content, the resources stored in these databases are almost the same [54].

There are some deficiencies in the protein data in the above database. Therefore, databases dedicated to storing proteins have emerged, such as UniProt PDB. UniProt is the free protein database with the rich information and the most comprehensive resources. It integrates EMBL bank, the Swiss Institute of Bioinformatics and Protein Information Resource [55]. At present, UniProt is mainly composed of five sub-databases: Swiss-Prot, TrEMBL, UniParc, UniRef and Proteomes. The relationship of these data is shown in Figure 3. PDB mainly records the protein structural information and other biological macromolecules, such as polysaccharides and nucleic acids [48]. The structural information in PDB is mainly determined by X-ray crystal diffraction, nuclear magnetic resonance or electron microscope. It is expensive but has higher accuracy, so it is more suitable for research. The summary of the related database is shown in Table 1.

Table 1

Summary of DNA/RNA and protein databases

DatabaseDescriptionReferences
EMBL-BankComprehensive database of Europe, top-level database.[51]
DDBJComprehensive database of Japan, top-level database.[53]
GenBankComprehensive database of America, top-level database.[52]
CNGBComprehensive database of China.[56]
SRAComprehensive database, storing raw sequences.[57]
RefSeqComprehensive database, storing non redundant sequences.[58]
GDBGenome database, storing human genome sequences.[59]
MmtDBMetazoa mtDNA variants specialized database.[60]
MitBASEIntegrated mitochondrial DNA database.[61]
SGDSaccharomyces genome database.[62]
AceDBCaenorhabditis elegans sequence database.[63]
dbSNPVariant gene database.[64]
OMIMOnline Mendelian inheritance in MA.[65]
DGVThe database of genomic variants from healthy people.[66]
UniProtProtein database.[55]
PDBProtein structure database.[48]
DatabaseDescriptionReferences
EMBL-BankComprehensive database of Europe, top-level database.[51]
DDBJComprehensive database of Japan, top-level database.[53]
GenBankComprehensive database of America, top-level database.[52]
CNGBComprehensive database of China.[56]
SRAComprehensive database, storing raw sequences.[57]
RefSeqComprehensive database, storing non redundant sequences.[58]
GDBGenome database, storing human genome sequences.[59]
MmtDBMetazoa mtDNA variants specialized database.[60]
MitBASEIntegrated mitochondrial DNA database.[61]
SGDSaccharomyces genome database.[62]
AceDBCaenorhabditis elegans sequence database.[63]
dbSNPVariant gene database.[64]
OMIMOnline Mendelian inheritance in MA.[65]
DGVThe database of genomic variants from healthy people.[66]
UniProtProtein database.[55]
PDBProtein structure database.[48]
Table 1

Summary of DNA/RNA and protein databases

DatabaseDescriptionReferences
EMBL-BankComprehensive database of Europe, top-level database.[51]
DDBJComprehensive database of Japan, top-level database.[53]
GenBankComprehensive database of America, top-level database.[52]
CNGBComprehensive database of China.[56]
SRAComprehensive database, storing raw sequences.[57]
RefSeqComprehensive database, storing non redundant sequences.[58]
GDBGenome database, storing human genome sequences.[59]
MmtDBMetazoa mtDNA variants specialized database.[60]
MitBASEIntegrated mitochondrial DNA database.[61]
SGDSaccharomyces genome database.[62]
AceDBCaenorhabditis elegans sequence database.[63]
dbSNPVariant gene database.[64]
OMIMOnline Mendelian inheritance in MA.[65]
DGVThe database of genomic variants from healthy people.[66]
UniProtProtein database.[55]
PDBProtein structure database.[48]
DatabaseDescriptionReferences
EMBL-BankComprehensive database of Europe, top-level database.[51]
DDBJComprehensive database of Japan, top-level database.[53]
GenBankComprehensive database of America, top-level database.[52]
CNGBComprehensive database of China.[56]
SRAComprehensive database, storing raw sequences.[57]
RefSeqComprehensive database, storing non redundant sequences.[58]
GDBGenome database, storing human genome sequences.[59]
MmtDBMetazoa mtDNA variants specialized database.[60]
MitBASEIntegrated mitochondrial DNA database.[61]
SGDSaccharomyces genome database.[62]
AceDBCaenorhabditis elegans sequence database.[63]
dbSNPVariant gene database.[64]
OMIMOnline Mendelian inheritance in MA.[65]
DGVThe database of genomic variants from healthy people.[66]
UniProtProtein database.[55]
PDBProtein structure database.[48]
The figure demonstrates the relationship among five sub-databases of UniProt [67].
Figure 3

The figure demonstrates the relationship among five sub-databases of UniProt [67].

Algorithms for MSA

This section mainly introduces the widely used MSA algorithms, which are divided into the following five types according to the characteristics of the algorithm: progressive alignment algorithm, iterative algorithm, heuristic, machine learning and divide-and-conquer algorithm, as shown in Figure 4.

Classification of MSA algorithms.
Figure 4

Classification of MSA algorithms.

Progressive alignment algorithm

Hogeweg and Hesper [68] first proposed the progressive alignment algorithm in 1984. The progressive alignment algorithm is more straightforward, efficient and has lower memory consumption than other algorithms. However, it also has obvious defects, once a gap, always a gap’ meaning that the inserted gap cannot be changed after the alignment sequence is confirmed [69]. According to relevant research, the progressive alignment algorithm may cause information loss when calculating the distance matrix, making the algorithm unstable [70].

After more than 30 years of development, the progressive alignment algorithm has been widely used in various MSA methods, as shown in Table 2. Typically, Higgins et al. developed ClustalW, derived from ClustalV in 1994 [11], which uses the weighted sum-of-pairs (WSP) objective function and the neighbor-joining (NJ) method to build the guide tree, as shown in Figure 5. ClustalW is widely used in genome analysis, such as the study on the evolutionary relationship between banana subspecies [77], the rDNA barcoding identification of bacterial and fungal pathogens in vitreous fluids of endophthalmitis patients [78], and the study on the molecular variation of ABO in Chinese Han population [79].

Table 2

Summary of progressive alignment algorithm-based methods

ReferencesDescriptionYear
[11]ClustalW, guide tree, WSP1994
[12]DiAlign, segment-segment, probabilistic consideration1998
[71]T-Coffee, tree-based consistency objective2000
[13]Kalign, Wu-Manber character matching algorithm2005
[72]GramAlign, grammar-based distance2008
[73]MSAIndelFR, indel flanking region2015
[74]TM-Aligner, Wu-Manber and dynamic string matching algorithm2017
[75]ProPIP, dynamic programming2021
[76]Regressive algorithm2021
ReferencesDescriptionYear
[11]ClustalW, guide tree, WSP1994
[12]DiAlign, segment-segment, probabilistic consideration1998
[71]T-Coffee, tree-based consistency objective2000
[13]Kalign, Wu-Manber character matching algorithm2005
[72]GramAlign, grammar-based distance2008
[73]MSAIndelFR, indel flanking region2015
[74]TM-Aligner, Wu-Manber and dynamic string matching algorithm2017
[75]ProPIP, dynamic programming2021
[76]Regressive algorithm2021
Table 2

Summary of progressive alignment algorithm-based methods

ReferencesDescriptionYear
[11]ClustalW, guide tree, WSP1994
[12]DiAlign, segment-segment, probabilistic consideration1998
[71]T-Coffee, tree-based consistency objective2000
[13]Kalign, Wu-Manber character matching algorithm2005
[72]GramAlign, grammar-based distance2008
[73]MSAIndelFR, indel flanking region2015
[74]TM-Aligner, Wu-Manber and dynamic string matching algorithm2017
[75]ProPIP, dynamic programming2021
[76]Regressive algorithm2021
ReferencesDescriptionYear
[11]ClustalW, guide tree, WSP1994
[12]DiAlign, segment-segment, probabilistic consideration1998
[71]T-Coffee, tree-based consistency objective2000
[13]Kalign, Wu-Manber character matching algorithm2005
[72]GramAlign, grammar-based distance2008
[73]MSAIndelFR, indel flanking region2015
[74]TM-Aligner, Wu-Manber and dynamic string matching algorithm2017
[75]ProPIP, dynamic programming2021
[76]Regressive algorithm2021
Given seven protein sequences. The distance matrix of paired sequences is calculated firstly, then the NJ algorithm is used to construct the guide tree, and finally, the progressive alignment is used to align the sequences gradually [11].
Figure 5

Given seven protein sequences. The distance matrix of paired sequences is calculated firstly, then the NJ algorithm is used to construct the guide tree, and finally, the progressive alignment is used to align the sequences gradually [11].

Recently, there has been some studies similar to ClustalW. For example, Kalign [13], developed by Lasmann and Sonnhammer in 2005, uses the Wu-Manber character matching algorithm when calculating distance matrix. GramAlign, proposed by Russell et al. [72], uses syntax based on Lempel–Ziv compression algorithm to compute distance matrix. In 2021, Maiolo et al. introduced the Poisson Indel Process (PIP) model and dynamic programming algorithm to calculate the distance matrix, which effectively reduced the complexity of the whole method [75]. In the optimization of the objective function, Al-Shatnawi adopts the objective process of variable gap penalty in the MSAIndelFR to improve the accuracy of alignment in 2015 [73].

To further improve the accuracy and avoid the severe pitfall caused by the greed, T-Coffee (Tree-based Consistency Objective Function For alignmEnt Evaluation) was presented by Notredame and Higginsin in 2000, which is based on progressive alignment algorithm and consistency [71]. The package of T-Coffee integrates a variety of MSA methods with different characteristics, such as low memory overhead or higher accuracy [80, 81]. For example, Garriga et al. [76] proposed an MSA method based on a T-Coffee regressive and progressive alignment algorithm. In their method, sequences are first clustered and then aligned starting from the most distant ones. Their experiments show that the technique can significantly improve alignment accuracy in large-scale datasets.

As the most classic algorithm, the progressive alignment algorithm has more clear and faster advantages. However, the progressive alignment algorithm’s accuracy is relatively unsatisfactory, and its performance is easily affected by the number of sequences. Therefore, in addition to the progressive alignment algorithm, most algorithms also use other methods to improve the accuracy of alignment, such as MUSCLE [15], MAFFT [14].

Iterative algorithm

The iterative algorithm is an algorithm that can continuously optimize the alignment to improve the accuracy until it converges. The iterative algorithm can generally be divided into deterministic and stochastic [23]. The calculation results of the deterministic iterative algorithm are the same every time. In contrast, the results of the stochastic iterative algorithm may be different due to the internal use of random values. In this section, we mainly discuss the deterministic iterative algorithm. As shown in Table 3, the table lists the MSA methods related to the deterministic iterative algorithm.

Table 3

Summary of deterministic iterative algorithm-based methods

ReferencesDescriptionYear
[82]MultAlign, hierarchical clustering1988
[14]MAFFT, fast Fourier transform, refinement2002
[15]MUSCLE, unweighted pair-group method with arithmetic mean2004
[83]PRALINE, homology-extended, secondary structure2005
[84]Probalign, posterior probabilities2006
[85]SATé, minimizing tree-length, POY2009
[86]PASTA, large-scale, parallel technique2015
[87]VIRULIGN, reference-guide, codon-correct2019
[88]ViralMSA, reference-guide, real-time2021
ReferencesDescriptionYear
[82]MultAlign, hierarchical clustering1988
[14]MAFFT, fast Fourier transform, refinement2002
[15]MUSCLE, unweighted pair-group method with arithmetic mean2004
[83]PRALINE, homology-extended, secondary structure2005
[84]Probalign, posterior probabilities2006
[85]SATé, minimizing tree-length, POY2009
[86]PASTA, large-scale, parallel technique2015
[87]VIRULIGN, reference-guide, codon-correct2019
[88]ViralMSA, reference-guide, real-time2021
Table 3

Summary of deterministic iterative algorithm-based methods

ReferencesDescriptionYear
[82]MultAlign, hierarchical clustering1988
[14]MAFFT, fast Fourier transform, refinement2002
[15]MUSCLE, unweighted pair-group method with arithmetic mean2004
[83]PRALINE, homology-extended, secondary structure2005
[84]Probalign, posterior probabilities2006
[85]SATé, minimizing tree-length, POY2009
[86]PASTA, large-scale, parallel technique2015
[87]VIRULIGN, reference-guide, codon-correct2019
[88]ViralMSA, reference-guide, real-time2021
ReferencesDescriptionYear
[82]MultAlign, hierarchical clustering1988
[14]MAFFT, fast Fourier transform, refinement2002
[15]MUSCLE, unweighted pair-group method with arithmetic mean2004
[83]PRALINE, homology-extended, secondary structure2005
[84]Probalign, posterior probabilities2006
[85]SATé, minimizing tree-length, POY2009
[86]PASTA, large-scale, parallel technique2015
[87]VIRULIGN, reference-guide, codon-correct2019
[88]ViralMSA, reference-guide, real-time2021

MultAlign is one of the earliest MSA methods using the iterative algorithm and progressive alignment developed by Corpet in 1988 [82]. MultAlign calculates the pairwise alignment and the distance matrix by FASTA. While constructing the guide tree, it uses a hierarchical clustering to cluster all similar sequences and generates a tree with leaf nodes. MUSCLE (MUltiple Sequence Comparison by Log-Expectation) is one of the widely used programs based on an iterative algorithm, which was developed by Edgar in 2004 [15]. The process of MUSCLE is mainly divided into the following three steps (As shown in Figure 6): (1) calculate the distance matrix D1, construct the guide tree TREE1 by UPGMA (Unweighted Pair-group Method with Arithmetic Mean) and produce an alignment MSA1 following TREE1. (2) Compute the Kimura distance matrix from MSA1 to produce a guide tree TREE2 and an accurate alignment. (3) Firstly, delete an edge near the root node on Tree2, generate two subtrees and calculate the subtree profiles [89]. Then, realign two profiles and obtain a new alignment. If the Sum-of-Pairs (SP) score of the alignment is higher than before, retain the alignment; otherwise, discard it. Finally, repeat this step until the accuracy reaches the criterion.

The flow chart of MUSCLE [15].
Figure 6

The flow chart of MUSCLE [15].

To reduce the time cost, Katoh et al. developed MAFFT (Multiple Alignment based on Fast Fourier Transform) in 2002 [14, 90]. MAFFT uses Fourier transform to convert the sequence into a vector for processing, which simplifies the calculation process of the algorithm. The most significant advantage is that it can quickly identify the homologous region and shorten the overall time of the program. The results demonstrate that MAFFT in iterative refinement mode is over 100 times faster than T-Coffee when more than 60 sequences are without accuracy.

Due to the lack of accuracy in phylogeny, Liu et al. [85] proposed SATé, which first applies the affine gap penalty to POY (a phylogenetic analysis program) in 2009. Compared with ClustalW, SATé has higher efficiency while searching short alignment or tree pairs. Besides, its constructed tree is also more accurate than ClustalW. Furthermore, Liu et al. [91] proposed the 2nd version of SATé in 2012. The method adopts a different divide-and-conquer from the previous one to produce smaller related subsets. Therefore, Saté-II is superior to SATé-I in both performance and accuracy. In 2015, Mirarabet et al. [86] proposed PASTA based on SATé, and this method adopts a parallel strategy to expand the scale of SA.

To fast construct the codon-correct alignment, Libin et al. proposed VIRULIGN [87] concerning the characteristics of mutual conversion between codons and amino acids in 2019. Specifically, firstly, VIRULIGN computes pairwise alignments between each target sequence and the reference sequence. Secondly, the same as the first step, calculate the alignment between the target sequence and the reference sequence represented by amino acid. Then adjust the target sequence according to the difference between the two alignments produced in the 1st and 2nd steps. Finally, repeat from the 2nd step until no more differences are detected. Recently, Moshiri proposed ViralMSA to adopt the reference-guide strategy [88], which aims to align the whole genome in real-time. Moshiri claims that ViralMSA is orders of magnitude faster than VIRULIGN.

Compared with the progressive alignment algorithm, the most significant advantage of the iterative algorithm is that it can improve the alignment result, has good robustness and is insensitive to the number of sequences. However, the drawback is that it is easy to fall into the local optimization.

Heuristics algorithm

A heuristics algorithm is a stochastic iterative algorithm in a broad sense. The main idea is to calculate each feasible solution of the combinatorial optimization problem within the acceptable range of time or space. Heuristics have good performance in solving NP-complete problems, and it can be roughly divided into the genetic algorithm, swarm intelligence, simulated annealing, hidden Markov model (HMM) and meta-heuristics [92].

Simulated annealing

Simulated annealing originates from the principle of solid annealing. When the stable temperature rises, the internal particles become disordered, and when the temperature drops, the particles will gradually become an ordered state. Therefore, a simulated annealing algorithm is often used in combinatorial optimization problems [93].

In 1993, Ishikawa et al. [94] first proposed using simulated annealing to solve MSA. The results demonstrate that the method can get a better solution in a specific time than the traditional progressive alignment algorithm. In addition, this method lays the foundation for the subsequent process based on simulated annealing. For example, Hernndez-Gua et al. [95] described another simulated annealing-based way in 2005. They use the mapping between MSA and Directed Polymer in a Random Media [96] to allow the insertion or deletion of gaps in the process of modifying alignment and expand the space of feasible solutions.

Due to the slow convergence speed of the simulated annealing algorithm, it is often necessary to combine genetic algorithms or swarm intelligence algorithms to improve efficiency.

Genetic algorithm

A genetic algorithm is a method to simulate biological evolution in nature. It transforms the problem-solving process into gene selection, crossover, mutation and other techniques similar to natural chromosomes and obtains better optimization results through multiple iterations [97]. The relative algorithms are shown in Table 4.

Table 4

Summary of genetic algorithm-based methods

ReferencesDescriptionYear
[16]Affine gap penalty, crossover, mutation1996
[98]String, simultaneousness, automated processing1997
[99]Divide-and-conquer, dynamic programming2005
[100]Global criterion for SA, evolutionary computation2009
[101]Multi-objective, structural information, non-gaps percentage totally conserved columns2013
[102]Multi-objective, affine gap penalty, similarity, support maximization2014
[103]Chaotic sequences, logistic map2016
[104]Chemical reaction optimization, single-point operator2019
[105]Optimization algorithm, gap insertion mutation, gap removal mutation2020
[106]Bi-objective, integer coding, Wilcoxon sign test2020
ReferencesDescriptionYear
[16]Affine gap penalty, crossover, mutation1996
[98]String, simultaneousness, automated processing1997
[99]Divide-and-conquer, dynamic programming2005
[100]Global criterion for SA, evolutionary computation2009
[101]Multi-objective, structural information, non-gaps percentage totally conserved columns2013
[102]Multi-objective, affine gap penalty, similarity, support maximization2014
[103]Chaotic sequences, logistic map2016
[104]Chemical reaction optimization, single-point operator2019
[105]Optimization algorithm, gap insertion mutation, gap removal mutation2020
[106]Bi-objective, integer coding, Wilcoxon sign test2020
Table 4

Summary of genetic algorithm-based methods

ReferencesDescriptionYear
[16]Affine gap penalty, crossover, mutation1996
[98]String, simultaneousness, automated processing1997
[99]Divide-and-conquer, dynamic programming2005
[100]Global criterion for SA, evolutionary computation2009
[101]Multi-objective, structural information, non-gaps percentage totally conserved columns2013
[102]Multi-objective, affine gap penalty, similarity, support maximization2014
[103]Chaotic sequences, logistic map2016
[104]Chemical reaction optimization, single-point operator2019
[105]Optimization algorithm, gap insertion mutation, gap removal mutation2020
[106]Bi-objective, integer coding, Wilcoxon sign test2020
ReferencesDescriptionYear
[16]Affine gap penalty, crossover, mutation1996
[98]String, simultaneousness, automated processing1997
[99]Divide-and-conquer, dynamic programming2005
[100]Global criterion for SA, evolutionary computation2009
[101]Multi-objective, structural information, non-gaps percentage totally conserved columns2013
[102]Multi-objective, affine gap penalty, similarity, support maximization2014
[103]Chaotic sequences, logistic map2016
[104]Chemical reaction optimization, single-point operator2019
[105]Optimization algorithm, gap insertion mutation, gap removal mutation2020
[106]Bi-objective, integer coding, Wilcoxon sign test2020

Notredame and Higgins [16] first proposed and developed Sequence Alignment by Genetic Algorithm (SAGA) based on genetic algorithm in 1996. The method continuously optimizes the alignment through crossover and insertion operations until the alignment does not change, as shown in Figure 7. The result indicates that SAGA is suitable for most objective functions and can achieve better alignment. Different from SAGA, Chen et al. proposed a method combining genetic algorithm, divide-and-conquer algorithm and dynamic programming [99]. The method divides the sequences vertically into multiple subsequences by a genetic algorithm. Then compute the sub-alignments of all subsequences using dynamic programming. Finally, splice the sub-alignments into a complete alignment.

The figure explains how to use the existing alignment to generate a better alignment in SAGA. Given two alignments, two new alignments are combined by crossing and inserting gaps. Finally, the better one is selected to join the next generation [16].
Figure 7

The figure explains how to use the existing alignment to generate a better alignment in SAGA. Given two alignments, two new alignments are combined by crossing and inserting gaps. Finally, the better one is selected to join the next generation [16].

Some efforts also focus on the objective function to improve genetic algorithm. For example, Arenas-Daz and Ochoterena proposed a new objective function GLObal Criterion for Sequence Alignment (GLOCSA), in 2009. Experiments show that GLOCSA can effectively improve the performance of the alignment [100]. Later, Ortuno applied a multi-objective function to the genetic algorithm, and the result indicates that the genetic algorithm with multi-objective function has more obvious advantages than the traditional genetic algorithm [101].

Moreover, Gao et al. adopted a chaos algorithm to optimize the genetic algorithm in 2016., the premature phenomenon in the genetic algorithm is effectively solved Using the ergodicity and inherent randomness of chaotic iteration [103]. Different from Gao’s method, Chatterjee uses a chemical reaction optimization algorithm to optimize the population in the evolution process, thus improving the efficiency [104]. In addition, there also are some genetic algorithm-based methods, such as Mishra et al. [105] present progressive technique to enhance the value of the alignment, and Chowdhury et al. [106] combine with bi-objective function to optimize the selection.

The genetic algorithm can give feasible solutions in a particular time, but the most feasible solutions are local optimal solutions. Therefore, the genetic algorithm needs to be combined with other algorithms to improve stability and accuracy.

Swarm intelligence algorithm

Swarm intelligence algorithm simulates the biological behavior of groups such as insects, animals or birds in nature. By definition, any algorithm or strategy inspired by natural behavior can be regarded as a swarm intelligence algorithm. Since the advent of the swarm intelligence algorithm, dozens of algorithms have been derived, including ant colony algorithm, particle swarm optimization (PSO) and artificial bee colony. In the field of MSA, swarm intelligence optimization is also applied, as shown in Table 5.

Table 5

Summary of swarm intelligence algorithm-based methods

ReferencesDescriptionYear
[107]Binary SAGA2009
[108]Dispersion graph, ant colony2009
[109]Ant colony, genetic algorithm, planar graph2010
[110]SAGA2011
[111]Artificial fish swarm2014
[112]Bacterial foraging optimization, genetic algorithm2017
[113]SAGA, simulated annealing2018
[114]Multi-strategy artificial bee colony2018
[115]Flower pollination algorithm, profile algorithm2019
[116]Multi-objective, artificial fish swarm2020
[117]WOAMSA, whale optimization algorithm2020
[118]Meta-heuristic, dynamic programming, dynamic simulated SAGA2021
ReferencesDescriptionYear
[107]Binary SAGA2009
[108]Dispersion graph, ant colony2009
[109]Ant colony, genetic algorithm, planar graph2010
[110]SAGA2011
[111]Artificial fish swarm2014
[112]Bacterial foraging optimization, genetic algorithm2017
[113]SAGA, simulated annealing2018
[114]Multi-strategy artificial bee colony2018
[115]Flower pollination algorithm, profile algorithm2019
[116]Multi-objective, artificial fish swarm2020
[117]WOAMSA, whale optimization algorithm2020
[118]Meta-heuristic, dynamic programming, dynamic simulated SAGA2021
Table 5

Summary of swarm intelligence algorithm-based methods

ReferencesDescriptionYear
[107]Binary SAGA2009
[108]Dispersion graph, ant colony2009
[109]Ant colony, genetic algorithm, planar graph2010
[110]SAGA2011
[111]Artificial fish swarm2014
[112]Bacterial foraging optimization, genetic algorithm2017
[113]SAGA, simulated annealing2018
[114]Multi-strategy artificial bee colony2018
[115]Flower pollination algorithm, profile algorithm2019
[116]Multi-objective, artificial fish swarm2020
[117]WOAMSA, whale optimization algorithm2020
[118]Meta-heuristic, dynamic programming, dynamic simulated SAGA2021
ReferencesDescriptionYear
[107]Binary SAGA2009
[108]Dispersion graph, ant colony2009
[109]Ant colony, genetic algorithm, planar graph2010
[110]SAGA2011
[111]Artificial fish swarm2014
[112]Bacterial foraging optimization, genetic algorithm2017
[113]SAGA, simulated annealing2018
[114]Multi-strategy artificial bee colony2018
[115]Flower pollination algorithm, profile algorithm2019
[116]Multi-objective, artificial fish swarm2020
[117]WOAMSA, whale optimization algorithm2020
[118]Meta-heuristic, dynamic programming, dynamic simulated SAGA2021

Long et al. [107] described an MSA method based on binary SAGA in 2009. This method starts with multiple alignments and then continuously modifies each alignment until convergence. The results show that most of the alignment scores of their ways are better than ClustalW, DiAlign, SAGA and T-Coffee. Besides, Chaabane proposed Particle Swarm Optimization and Simulated Annealing (PSOSA) hybrid model in 2018 [113], which combines PSO and simulated annealing to fully use the former’s exploration ability and the latter’s development ability. The experiment shows that the performance is better than MUSCLE, MAFFT and ClustalW programs.

In addition to PSO, there are many optimization algorithms based on animal behavior in MSA. For example, in 2009, Chen et al. proposed an ant colony algorithm-based method [108]. They use a dispersion graph to represent the alignment and then an ant colony algorithm to find an optimal path. The experimental results show that this method can obtain a better solution than ClustalX. Meanwhile, Xiang et al. [109] proposed a similar approach based on genetic algorithm and planar graph to optimize the search path. Kuang et al. [114] proposed an artificial bee colony-based method that adopts a multi-strategy to balance the global exploration and the local exploitation in 2018. The strategies include the following three aspects: tent chaos initialization population, neighborhood search and tournament selection. The results show that this method has better performance and biological characteristics and has strong robustness.

Some studies also adopt less commonly used algorithms. In 2017, Manikandan and Duraisamy proposed a method combining a bacterial foraging algorithm and genetic algorithm based on the dispersion optimization principle of the individual and population of Escherichia coli[112]. They obtain individuals with higher fitness through bacterial foraging behavior to improve accuracy and speed. The result shows that this method is better than other swarm intelligence algorithms on multiple benchmarks. In 2019, Hussein et al. adopted the flower pollination algorithm to optimize the MSA [115]. Besides, they also proposed a new profile algorithm to improve alignment quality. Finally, compared with other methods, this method is superior to other ways except for the MSAProbs.

Hidden Markov model

The HMM) is a statistical analysis model, which is used to describe a Markov process with unknown parameters [119]. In short, HMM can calculate the state transition probability matrix by analyzing the object’s previous and current observable states and inferring the next state of the object through the probability matrix. Due to the Markov property of MSA, some researchers proposed several MSA methods based on HMM, as shown in Table 6.

Table 6

Summary of HMM-based methods

ReferencesDescriptionYear
[17]ProbCons, probabilistic consistency-based2005
[120]MUMMALS, local structural information2006
[121]PROMALS, database search, secondary structure prediction2007
[122]MSAProbs, partition function2010
[123]Clustal Omega, mBed, profile2011
[124]RDPSO, SAGA2014
[125]ProbPFP, SAGA2019
ReferencesDescriptionYear
[17]ProbCons, probabilistic consistency-based2005
[120]MUMMALS, local structural information2006
[121]PROMALS, database search, secondary structure prediction2007
[122]MSAProbs, partition function2010
[123]Clustal Omega, mBed, profile2011
[124]RDPSO, SAGA2014
[125]ProbPFP, SAGA2019
Table 6

Summary of HMM-based methods

ReferencesDescriptionYear
[17]ProbCons, probabilistic consistency-based2005
[120]MUMMALS, local structural information2006
[121]PROMALS, database search, secondary structure prediction2007
[122]MSAProbs, partition function2010
[123]Clustal Omega, mBed, profile2011
[124]RDPSO, SAGA2014
[125]ProbPFP, SAGA2019
ReferencesDescriptionYear
[17]ProbCons, probabilistic consistency-based2005
[120]MUMMALS, local structural information2006
[121]PROMALS, database search, secondary structure prediction2007
[122]MSAProbs, partition function2010
[123]Clustal Omega, mBed, profile2011
[124]RDPSO, SAGA2014
[125]ProbPFP, SAGA2019

ProCons (Probabilistic Consistency) is the first program to use HMM for MSA, developed by Do et al. in 2005 [17]. ProCons adopts a novel objective function based on probability consistency, which combines the posterior probability matrix derived from HMM and the alignment consistency to integrate the conservative information of sequences. From the results, ProCons is more accurate than Align-m, ClustalW, MUSCLE and other programs. In 2006, Roshan and Livesay made some improvements based on ProCons and proposed Probalign (Probabilistic alignment) [84]. The difference is that ProCons uses the matched partition function interested of HMM to calculate the posterior probability matrix.

Efficient computation with high accuracy is an urgent need. Therefore, to solve this problem, Liu et al. [122] developed a program, MASProbs based on ProCons and Proalign. To enhance the performance, MSAProbs calculates the weight of the sequence, which participates in the calculation of probability consistency score and progressive alignment. The result shows that MASProbs has higher accuracy than ClustalW, MAFFT, MUSCLE, ProCons and Probalign. Since then, the author proposed MSAProbs-MPI, which introduces parallel and distributed memory based on MSAProbs and can cope with larger datasets [126, 127]. In 2019, Zhan et al. [125] proposed the ProbPFP method, which mainly refers to the above techniques. However, ProbPFP is unique in that it introduces PSO to optimize the parameters of HMM model.

ClustalW has great limitations in large-scale data; therefore, Sievers et al. developed Clustal Omega in 2011. Clustal Omega is based on the mBed algorithm and HMM. Clustal Omega can process tens of thousands of sequences, and its accuracy is greatly improved. Besides, to improve the alignment efficiency, Clustal Omega can use the existing alignment in the database to assist new sequences alignment [123, 128]. In 2020, Pachetti et al. [129] applied clustal Omega to the MSA of genomes of SARS-CoV-2 and found eight new recurrent mutations of virus.

Machine learning

Machine learning is still in the initial stage in MSA. In 2016, Mircea et al. [18] first proposed an MSA method based on reinforcement learning. This method is similar to the progressive alignment algorithm to find an optimal alignment order and then align the sequences in order. They use a reinforcement learning method based on dynamic programming–Q-learning to calculate the optimal alignment order of sequences and use this order to construct the alignment gradually. In 2019, Jafari et al. [20] replaced Q-learning with the A3C (Asynchronous Advantage Actor Critic) model in deep reinforcement learning based on Mircea, which improved the convergence speed and reduced the possibility of falling into local optimum to a certain extent.

Then, Ramakrishnan et al. [19] proposed an MSA method based A3Cin 2018, called RLALIGN. Unlike Mircea’s method, RLALIGN takes the current alignment as the state and the moving direction of the nucleotide in the alignment as the action. Although this method can better align some small datasets, the process may not converge with the sequence number or length increase.

Machine learning has valuable achievements in many fields of bioinformatics, but there is more work to be done across MSA. In particular, there are three challenges: (1) lack of sufficient optimal alignment samples for learning. (2) There is no suitable loss function. (3) It is challenging to build a generalized model due to the variability of the length and quantity of sequences.

Divide-and-conquer algorithm

Divide-and-conquer is a relatively simple algorithm whose idea is to break down a whole problem into multiple independent sub-problems of the related or same type for processing. The divide-and-conquer algorithm is a standard solution for solving MSA, and the relevant methods are shown in Table 7.

Table 7

Summary of divide-and-conquer-based methods

ReferencesDescriptionYear
[21]SpliVert, splitting-splicing, refinement2020
[22]FAME, common area, splice2020
[130]MAGUS, divide-and-conquer, graph clustering2021
[131]FMAlign, FM-index, common segments2021
ReferencesDescriptionYear
[21]SpliVert, splitting-splicing, refinement2020
[22]FAME, common area, splice2020
[130]MAGUS, divide-and-conquer, graph clustering2021
[131]FMAlign, FM-index, common segments2021
Table 7

Summary of divide-and-conquer-based methods

ReferencesDescriptionYear
[21]SpliVert, splitting-splicing, refinement2020
[22]FAME, common area, splice2020
[130]MAGUS, divide-and-conquer, graph clustering2021
[131]FMAlign, FM-index, common segments2021
ReferencesDescriptionYear
[21]SpliVert, splitting-splicing, refinement2020
[22]FAME, common area, splice2020
[130]MAGUS, divide-and-conquer, graph clustering2021
[131]FMAlign, FM-index, common segments2021

Zhan et al. [21] proposed an optimization method for MSA based on splitting and splicing in 2020, which is called SpliVert (Splitting-splicing Vertically). The process is as shown in Figure 8. Firstly, obtain an alignment by other MSA methods. Second, divide the alignment into three parts from the vertical direction and then remove the gap in the middle part and realign the part. Finally, merge the three regions and obtain the refined alignment. The experiment shows that the alignment optimized is better than those obtained by the original MSA method or program.

This picture explains the optimization process of SpliVert. The general process can be summarized to four steps: compute the alignment, split the alignment, realign the middle part and merge these parts into alignment [21].
Figure 8

This picture explains the optimization process of SpliVert. The general process can be summarized to four steps: compute the alignment, split the alignment, realign the middle part and merge these parts into alignment [21].

To reduce the time and memory consumption of long SA, Naznooshsadat et al. [22] proposed FAME (FAst and MEmory), an MSA method based on splitting and splicing. Its process consists of the following three steps: (1) divide the set of sequences into groups of subsequences vertically and maintain the order of these subsequences on the original sequences. (2) Align the subsequence at common regions and align the unaligned subsequence with other MSA methods. (3) Merge all aligned subsequences. The results show that the FAME can process large-scale data four times faster than the original method.

The alignment method for large-scale sequences has always been a hot spot. For example, Smirnov et al. [130] proposed an MSA method based on graph clustering and divide-and-conquer algorithm—MAGUS (MSA using Graph clUStering) in 2021. This method is similar to PASTA. It also introduces divide-and-conquer to construct the alignment. The difference between PASTA and MAGUS lies in the merging stage of the subset. Smirnov uses a merging strategy of combining disjoint alignment—GMC (Graph Clustering Merge) to merge subsets. In addition, benefit from GMC, MAGUS does not need iteration to improve accuracy like PASTA. Their experiments show that the method has strong robustness, and it not only improves the accuracy and exceeds PASTA in speed. Furthermore, Shen et al. [132] introduced HMM based on Smirnov’s work to further enhance the accuracy of alignment in 2021.

What’s more, Liu et al. proposed a technique, FMAlign, which is based on divide-and-conquer and FM-index. Their method has four steps: 1. building FM-index; 2. querying anchors; 3. finding optimal chain; 4. breaking sequence and aligning. Compared with MAFFT, Halign and FAME, FMAlign has a shorter running time and higher accuracy on long sequences. Besides, their results show that FMAlign is applicable to large-scale sequences alignment [131].

Because divide-and-conquer can divide a whole problem into multiple sub-problems that can be handled directly, its performance has apparent advantages over other algorithms. In addition, the algorithm can flexibly combine with parallel techniques to further improve performance because these sub-problems have the same or similar types and are independent.

Objective functions

In MSA, an objective function is needed to judge whether the alignment results meet the expectations. SP is the most common and simple objective function [10]. Its formula is shown in Equation 1, where |$m$| is the length of the alignment, |$n$| is the number of sequences. |$c_i^j$| and |$c_i^k$| denotes the |$i^{th}$| unit in the |$j^{th}$| and |$k^{th}$| sequences, respectively. |$s(c_i^j,c_i^k )$| represents the matching score between two units that can be set according to Percent Accepted Mutation (PAM) or BLOcks SUbstitution Matrix (BLOSUM).
(1)
Another commonly used function is the WSP, which adds sequence weight based on SP function [11], and its formula is shown in Equation 2, where |$w_(j,k)$| is the product of the sequence weights of |$c_i^j$| and |$c_i^k$|⁠, and the sequence weight is calculated from the guide tree.
(2)
In addition to the weight difference, WSP also introduces the gap penalty. The gap penalty comprises the gap opening penalty and gap extension penalty. The former represents the penalty for the appearance of anyone gap, whereas the latter is the penalty points obtained by the extension of any gap starting from the opening gap. At this time, the score of |$s(c_i^j,c_i^k)$| has the following three cases. The first is to set the score according to PAM or BLOSUM when neither |$c_i^j$| nor |$c_i^k$| are gaps. The second is to set |$s(c_i^j,c_i^k )=0$| when both are gaps. And the last is to set |$s(c_i^j,c_i^k )=G$| when either one of |$c_i^j$| and |$c_i^k$| is gap. The calculation of |$G$| is shown in Equation 3, where |$gop$| is the opening penalty for the gap, |$gep$| is the extension penalty value for the gap and |$n$| is the length of consecutive gaps.
(3)
Another objective function is the consistency-based objective function for alignment evaluation (COFFEE). The higher the score calculated, the higher the biological correlation of the alignment [133]. Before calculating the COFFEE score, a library of pairwise alignments needs to be established by FSSP database (Fold Classification Based on Structure-Structure Alignment of Proteins) to provide sufficient information for subsequent MSA. The formula of this objective function is shown in Equation 4, where |$score(A_{(j,k)})$| is the number of aligned pairs of residues that are shared between |$A_{(j,k)}$| and the library, |$len(A_{(j,k)})$| is the alignment length of |$A_{(j,k)}$|⁠, and |$w_{(j,k)}$| is the weight for pairwise alignments, such as the similarity of two sequences.
(4)

The total column (TC) score is a relatively simple evaluation metric indicating the similarity between two alignments. The score is the ratio of correctly aligned columns between the test alignment and the reference alignment to the total number of columns in the reference alignment [134]. Take two alignments in Figure 9 as an example. There are three correctly aligned columns (1st, 2nd and 5th) and six TCs. Therefore, the TC score is 0.5 (3 divided by 6). The value range of the TC score is |$[0,1]$|⁠. The higher the value, the more agreement the test alignment with the reference alignment.

Two alignments for illustrating the calculation of TC score. In this figure, there are three correctly aligned columns (1st, 2nd and 5th) and six TCs. Therefore, TC score is 0.5 (3 divided by 6).
Figure 9

Two alignments for illustrating the calculation of TC score. In this figure, there are three correctly aligned columns (1st, 2nd and 5th) and six TCs. Therefore, TC score is 0.5 (3 divided by 6).

Programs

The commonly used programs for MSA include ClustalW, MUSCLE, MAFFT and T-Coffee, which are shown in Table 8.

Among the MSA programs, Clustal is the earliest program used for MSA and now, its improved version ClustalW is still commonly used. ClustalW adopts a simple progressive alignment algorithm [11], which can not guarantee the optimal alignment. MUSCLE is an MSA program based on progressive and iterative algorithm [15], which is also one of the fastest MSA programs at present. MUSCLE is better than ClustalW in speed and accuracy. For example, for 10 000 sequences with a length of 350|$bp$|⁠, MUSCLE can compute the alignment in >10 min, whereas ClustalW may take 1 year.

MAFFT is the first program to adopt fast Fourier transform in MSA. The first version of MAFFT was released in 2002 [14]. MAFFT adds various options and alignment modes, such as FFT-NS-1, FFT-NS-2 and FFT-NS-i and researchers can select appropriate options for different occasions. In alignment accuracy, MAFFT is higher than MUSCLE. T-Coffee is one of the most popular programs [71], which has high scalability and abundant functions. T-Coffee can integrate various sequences information during alignment, so its accuracy is more elevated than ClustalW, MUSCLE and MAFFT.

Benchmarks

Benchmark is the golden criterion for judging the MSA method. Table 9 [123] lists the benchmarking results of the popular MSA programs on BAliBASE (Benchmark Alignment dataBASE) [135]. BAliBASE consists of five groups of references, containing 82, 41, 30 and 16 alignments, respectively. Besides, the first group is divided into two subgroups, R1-1 and R1-2, which contain 38 and 44 alignments, respectively. In Table 9. the TC scores of each reference are in columns 2–7. The average TC score general references are given in the second last column and the total run time for all references is given in the last column. The value range of the TC score is |$[0,1]$| and the higher the value, the more agreement the result of the program with the benchmark. As can be seen from the table, MSAProbs has the highest accuracy, but its run time is higher than most programs. Kalign has the shortest run time, but the accuracy is not satisfactory. The worst program is PRANK, which runs much longer than other programs and has a lower score and run time.

Table 9

Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]

ProgramR1-1R1-2R2R3R4R5Avg ScoreTot time(s)
MSAProbs [122]0.4410.8650.4640.6070.6220.6080.60712382.00
Probalign [84]0.4530.8620.4390.5660.6030.5490.58910095.20
MAFFT [14]0.4390.8310.4500.5810.6050.5910.5881475.40
Probcons [17]0.4170.8550.4060.5440.5320.5730.55813086.30
Clustal Omega [123]0.3580.7890.4500.5750.5790.5330.554539.91
T-Coffee [71]0.4100.8480.4020.4910.5450.5870.55181041.50
Kalign [72]0.3650.7900.3600.4760.5040.4350.50121.88
MUSCLE [15]0.3180.8040.3500.4090.4500.4600.475789.57
FSA [136]0.2700.8180.1870.2590.4740.3980.41953648.10
DiAlign [12]0.2650.6960.2920.3120.4410.4250.4153977.44
PRANK [137]0.2230.6800.2570.3210.3600.3560.376128355.00
ClustalW [11]0.2270.7120.2200.2720.3960.3080.374766.47
ProgramR1-1R1-2R2R3R4R5Avg ScoreTot time(s)
MSAProbs [122]0.4410.8650.4640.6070.6220.6080.60712382.00
Probalign [84]0.4530.8620.4390.5660.6030.5490.58910095.20
MAFFT [14]0.4390.8310.4500.5810.6050.5910.5881475.40
Probcons [17]0.4170.8550.4060.5440.5320.5730.55813086.30
Clustal Omega [123]0.3580.7890.4500.5750.5790.5330.554539.91
T-Coffee [71]0.4100.8480.4020.4910.5450.5870.55181041.50
Kalign [72]0.3650.7900.3600.4760.5040.4350.50121.88
MUSCLE [15]0.3180.8040.3500.4090.4500.4600.475789.57
FSA [136]0.2700.8180.1870.2590.4740.3980.41953648.10
DiAlign [12]0.2650.6960.2920.3120.4410.4250.4153977.44
PRANK [137]0.2230.6800.2570.3210.3600.3560.376128355.00
ClustalW [11]0.2270.7120.2200.2720.3960.3080.374766.47

Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references, and total running times. The best results in each column are put in bold.

Table 9

Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]

ProgramR1-1R1-2R2R3R4R5Avg ScoreTot time(s)
MSAProbs [122]0.4410.8650.4640.6070.6220.6080.60712382.00
Probalign [84]0.4530.8620.4390.5660.6030.5490.58910095.20
MAFFT [14]0.4390.8310.4500.5810.6050.5910.5881475.40
Probcons [17]0.4170.8550.4060.5440.5320.5730.55813086.30
Clustal Omega [123]0.3580.7890.4500.5750.5790.5330.554539.91
T-Coffee [71]0.4100.8480.4020.4910.5450.5870.55181041.50
Kalign [72]0.3650.7900.3600.4760.5040.4350.50121.88
MUSCLE [15]0.3180.8040.3500.4090.4500.4600.475789.57
FSA [136]0.2700.8180.1870.2590.4740.3980.41953648.10
DiAlign [12]0.2650.6960.2920.3120.4410.4250.4153977.44
PRANK [137]0.2230.6800.2570.3210.3600.3560.376128355.00
ClustalW [11]0.2270.7120.2200.2720.3960.3080.374766.47
ProgramR1-1R1-2R2R3R4R5Avg ScoreTot time(s)
MSAProbs [122]0.4410.8650.4640.6070.6220.6080.60712382.00
Probalign [84]0.4530.8620.4390.5660.6030.5490.58910095.20
MAFFT [14]0.4390.8310.4500.5810.6050.5910.5881475.40
Probcons [17]0.4170.8550.4060.5440.5320.5730.55813086.30
Clustal Omega [123]0.3580.7890.4500.5750.5790.5330.554539.91
T-Coffee [71]0.4100.8480.4020.4910.5450.5870.55181041.50
Kalign [72]0.3650.7900.3600.4760.5040.4350.50121.88
MUSCLE [15]0.3180.8040.3500.4090.4500.4600.475789.57
FSA [136]0.2700.8180.1870.2590.4740.3980.41953648.10
DiAlign [12]0.2650.6960.2920.3120.4410.4250.4153977.44
PRANK [137]0.2230.6800.2570.3210.3600.3560.376128355.00
ClustalW [11]0.2270.7120.2200.2720.3960.3080.374766.47

Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references, and total running times. The best results in each column are put in bold.

Currently, the most used benchmark is BAliBASE [135], which provides multiple reference sets that have been optimized. Up to now, BAliBASE has three versions. The 1st version provides four reference sets, each containing multiple reference alignments. The 2nd version provides eight reference sets, and the 3rd version adds an update protocol to allow online updates of the reference set. In addition to BAliBASE, other common benchmark sets include SABMark (Sequence Alignment BenchMark) [138], OXBench (OXford Benchmark) [139], SMART (Simple Modular Architecture Research Tool) [140], PREFAB (Protein REFerence Alignment Benchmark) [15] and QuanTest [141].

Challenges and opportunities

Although the existing methods for MSA can cope with the most alignment tasks, these methods are not good enough to provide comprehensive solutions for any analysis scenario in any condition. There are still some challenges and opportunities for future directions.

Scale of sequence

Compared with small-scale and straightforward data in most bioinformatics fields, the MSA data may have a large discrepancy in the scale of each alignment. At the same time, with the rapid accumulation of sequences, MSA may face hyper-scale sequences. For example, Koyama et al. [142] collected >10 000 genomes to analyze the variation of SARS-CoV-2. However, the existing MSA methods are challenging to cope with hyper-scale sequences. Even if possible, these methods have low accuracy or high consumption of memory and time. This problem is a big challenge to develop a technique that can deal with hyper-scale sequences while ensuring accuracy and consumption.

Design of method

Most MSA methods are flexible and complicated. Various efforts are made to improve accuracy and performance, and several works adopt more complex machine learning, such as reinforcement learning and natural language processing. In real-life sequence analysis scenarios, the scale and length of sequences may differ, requiring that the method has good generalization ability and robustness to ensure accuracy. In addition, using friendly visualization technology is also essential for users. Making the MSA method more reliable and usable is a promising issue.

Correctness of alignment

The correctness-sensitive biological sequence analysis scenarios put forward higher requirements for the biological correctness of alignment. The most existing MSA methods are difficult to predict the evolution process of sequences correctly and lack explanations. For these problems, there are three main reasons as follows: the randomness of the evolutionary process, the limitations of the bioinformatics method and the lack of an accurate evolutionary model [143]. Therefore, focusing on the above points is more helpful for applying MSA to biological sequence analysis scenarios.

Conclusion

MSA is widely used in many bioinformatics scenarios because it can extract potential information from sequences. In this work, we introduced the base knowledge of MSA from several aspects. Furthermore, we described representative applications of MSA for database searching, phylogenetics, genomics, metagenomics and proteomics. We also categorize and analyze the classical and state-of-the-art algorithms for MSA, such as progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Finally, We further discussed the challenges of this field and promising directions. This review will provide valuable insight and serve as a starting point for studying MSA in future research.

Key Points
  • As many biomedical sequences have been accumulated, MSA is widely applied to extract knowledge from these massive sequences, such as evolution, structure and function.

  • We systematically introduce MSA’s knowledge from several aspects, including background, application, database, algorithm, metric, program and benchmark.

  • We also list and discuss the current challenges and opportunities for future directions in bioinformatics.

  • As a comprehensive review of current works, this paper will provide valuable insight and serve as a launching point for researchers in their studies.

Funding

National Natural Science Foundation of China (Grant No. 61922020).

Author Biographies

Yongqing Zhang He is an associate professor in the School of Computer Science at the Chengdu University of Information Technology. He is a senior member of CCF. His research interests include machine learning and bioinformatics.

Qiang Zhang He is a graduate student in the School of Computer Science at the Chengdu University of Information Technology. His research interests include deep learning and bioinformatics.

Jiliu Zhou He is a professor in the School of Computer Science at the Chengdu University of Information Technology. His research interests include intelligent computing and image processing.

Quan Zou He is a professor in the Institute of Fundamental and Frontier Science at the University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. His research interests include bioinformatics and machine learning.

References

1.

Wang
T
,
Liang
C
,
Hou
Y
, et al.
Small design from big alignment: engineering proteins with multiple sequence alignment as the starting point
.
Biotechnol Lett
2020
;
42
(
8
):
1305
15
.

2.

Makigaki
S
,
Ishida
T
.
Sequence alignment generation using intermediate sequence search for homology modeling
.
Comput Struct Biotechnol J
2020
;
18
:
2043
50
.

3.

Huang
M
,
Shah
ND
,
Yao
L
.
Evaluating global and local sequence alignment methods for comparing patient medical records
.
BMC Med Inform Decis Mak
2019
;
19
(
Suppl 6
):
263
.

4.

Baharav
TZ
,
Kamath
GM
,
Tse
DN
, et al.
Spectral jaccard similarity: a new approach to estimating pairwise sequence alignments
.
Patterns (N Y)
2020
;
1
(
6
):100081.

5.

Bawono
P
,
Dijkstra
M
,
Pirovano
W
, et al.
Multiple sequence alignment
.
Methods Mol Biol
2017
;
1525
:
167
89
.

6.

Needleman
SB
,
Wunsch
CD
.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
.
J Mol Biol
1970
;
48
(
3
):
443
53
.

7.

Smith
TF
,
Waterman
MS
.
Identification of common molecular subsequences
.
J Mol Biol
1981
;
147
(
1
):
195
7
.

8.

Chatzou
M
,
Magis
C
,
Chang
JM
, et al.
Multiple sequence alignment modeling: methods and applications
.
Brief Bioinform
2016
;
17
(
6
):
1009
23
.

9.

Warnow
T
.
Revisiting evaluation of multiple sequence alignment methods
.
Methods Mol Biol
2021
;
2231
:
299
317
.

10.

Altschul
SF
,
Lipman
DJ
.
Trees, stars, and multiple biological sequence alignment
.
SIAM J Appl Math
1989
;
49
(
1
):
197
209
.

11.

Thompson
JD
,
Higgins
DG
,
Gibson
TJ
.
Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
.
Nucleic Acids Res
1994
;
22
(
22
):
4673
80
.

12.

Morgenstern
B
,
Frech
K
,
Dress
A
, et al.
Dialign: finding local similarities by multiple sequence alignment
.
Bioinformatics
1998
;
14
(
3
):
290
4
.

13.

Lassmann
T
,
Sonnhammer
E
.
Kalign: an accurate and fast multiple sequence alignment algorithm
.
BMC Bioinformatics
2005
;
6
:
298
.

14.

Katoh
K
,
Misawa
K
,
Kuma
K-i
, et al.
Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform
.
Nucleic Acids Res
2002
;
30
:
3059
66
.

15.

Edgar
RC
.
Muscle: multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Res
2004
;
32
(
5
):
1792
7
.

16.

Notredame
C
,
Higgins
DG
.
Saga: sequence alignment by genetic algorithm
.
Nucleic Acids Res
1996
;
24
(
8
):
1515
24
.

17.

Do
CB
,
Mahabhashyam
MS
,
Brudno
M
, et al.
Probcons: probabilistic consistency-based multiple sequence alignment
.
Genome Res
2005
;
15
(
2
):
330
40
.

18.

Mircea
I
,
Czibula
G
,
Bocicor
M
. A q-learning approach for aligning protein sequences. In:
2015 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP)
, IEEE, Piscataway,
2015
,
51
8
.

19.

Ramakrishnan
KR
,
Singh
J
,
Blanchette
M
. Rlalign: a reinforcement learning approach for multiple sequence alignment. In:
2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)
, IEEE, Piscataway,
2018
,
61
6
.

20.

Jafari
R
,
Javidi
MM
,
Rafsanjani
MK
.
Using deep reinforcement learning approach for solving the multiple sequence alignment problem
.
SN Appl Sci
2019
;
1
(
6
):
592
.

21.

Zhan
Q
,
Fu
Y
,
Jiang
Q
, et al.
Splivert: a protein multiple sequence alignment refinement method based on splitting-splicing vertically
.
Protein Pept Lett
2020
;
27
(
4
):
295
302
.

22.

Naznooshsadat
E
,
Elham
P
,
Ali
SZ
.
Fame: fast and memory efficient multiple sequences alignment tool through compatible chain of roots
.
Bioinformatics
2020
;
36
(
12
):
3662
8
.

23.

Notredame
C
.
Recent progress in multiple sequence alignment: a survey
.
Pharmacogenomics
2002
;
3
:
131
44
.

24.

Chowdhury
B
,
Garai
G
.
A review on multiple sequence alignment from the perspective of genetic algorithm
.
Genomics
2017
;
109
(
5-6
):
419
31
.

25.

Xia
Z
,
Cui
Y
,
Zhang
A
, et al.
A review of parallel implementations for the smith-waterman algorithm
.
Interdiscip Sci
2021
;
3
:
1
14
.

26.

Altschul
SF
,
Gish
W
,
Miller
W
, et al.
Basic local alignment search tool
.
J Mol Biol
1990
;
215
(
3
):
403
10
.

27.

Pearson
WR
,
Lipman
DJ
.
Improved tools for biological sequence comparison
.
Proc Natl Acad Sci U S A
1988
;
85
(
8
):
2444
8
.

28.

Li
YC
,
Lu
YC
.
Blastp-acc: parallel architecture and hardware accelerator design for blast-based protein sequence alignment
.
IEEE Trans Biomed Circuits Syst
2019
;
13
(
6
):
1771
82
.

29.

Jin
X
,
Liao
Q
,
Wei
H
, et al.
Smi-blast: a novel supervised search framework based on psi-blast for protein remote homology detection
.
Bioinformatics
2020
.

30.

Kapli
P
,
Yang
Z
,
Telford
MJ
.
Phylogenetic tree building in the genomic age
.
Nat Rev Genet
2020
;
21
(
7
):
428
44
.

31.

Khan
BM
,
Sabir
M
,
Alyemeni
MN
, et al.
Genetic similarities and phylogenetic analysis of muntjac (muntiacus spp.) by comparing the nucleotide sequence of 16s rrna and cytochrome b genome
.
Braz J Biol
2021
;
83
:e248153.

32.

Liu
B
,
Pavel
JA
,
Hausbeck
MK
, et al.
Phylogenetic analysis, vegetative compatibility, virulence, and fungal filtrates of leaf curl pathogen Colletotrichum fioriniae from celery
.
Phytopathology
2021
;
111
(
4
):
751
60
.

33.

Wei
R
,
Zhang
XC
.
Phylogeny of diplazium (athyriaceae) revisited: resolving the backbone relationships based on plastid genomes and phylogenetic tree space analysis
.
Mol Phylogenet Evol
2020
;
143
:106699.

34.

Hu
Y
,
Xing
W
,
Hu
Z
, et al.
Phylogenetic analysis and substitution rate estimation of colonial volvocine algae based on mitochondrial genomes
.
Genes (Basel)
2020
;
11
(
1
).

35.

Fariq
A
,
Blazier
JC
,
Yasmin
A
, et al.
Whole genome sequence analysis reveals high genetic variation of newly isolated Acidithiobacillus ferrooxidans io-2c
.
Sci Rep
2019
;
9
(
1
):
13049
.

36.

Hu
B
,
Guo
H
,
Zhou
P
, et al.
Characteristics of sars-cov-2 and covid-19
.
Nat Rev Microbiol
2021
;
19
(
3
):
141
54
.

37.

Yin
C
.
Genotyping coronavirus sars-cov-2: methods and implications
.
Genomics
2020
;
112
(
5
):
3588
96
.

38.

Guruprasad
L
.
Evolutionary relationships and sequence-structure determinants in human SARS coronavirus-2 spike proteins for host receptor recognition
.
Proteins
2020
;
88
(
11
):
1387
93
.

39.

Chang
TJ
,
Yang
DM
,
Wang
ML
, et al.
Genomic analysis and comparative multiple sequences of SARS-cov2
.
J Chin Med Assoc
2020
;
83
(
6
):
537
43
.

40.

Madhavan
A
,
Venkatesan
G
,
Kumar
A
, et al.
Comparative sequence and structural analysis of the orf095 gene, a vaccinia virus a4l homolog of capripoxvirus in sheep and goats
.
Arch Virol
2020
;
165
(
6
):
1419
31
.

41.

Hecker
N
,
Hiller
M
.
A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers
.
Gigascience
2020
;
9
(
1
).

42.

Roe
D
,
Vierra-Green
C
,
Pyo
CW
, et al.
A detailed view of kir haplotype structures and gene families as provided by a new motif-based multiple sequence alignment
.
Front Immunol
2020
;
11
:585731.

43.

Hunter
CI
,
Mitchell
A
,
Jones
P
, et al.
Metagenomic analysis: the challenge of the data bonanza
.
Brief Bioinform
2012
;
13
:
743
6
,
6
.

44.

Zhou
H
,
Chen
X
,
Hu
T
, et al.
A novel bat coronavirus closely related to sars-cov-2 contains natural insertions at the s1/s2 cleavage site of the spike protein
.
Curr Biol
2020
;
30
(
11
):
2196
2203.e3
.

45.

Breitwieser
FP
,
Lu
J
,
Salzberg
SL
.
A review of methods and databases for metagenomic classification and assembly
.
Brief Bioinform
2019
;
20
(
4
):
1125
36
.

46.

Storato
D
,
Comin
M
.
K2mem: discovering discriminative k-mers from sequencing data for metagenomic reads classification
.
IEEE/ACM Trans Comput Biol Bioinform
2021
;
19
(1):220–229.

47.

Burks
DJ
,
Azad
RK
.
Higher-order Markov models for metagenomic sequence classification
.
Bioinformatics
2020
;
36
(
14
):
4130
6
.

48.

Velankar
S
,
Burley
SK
,
Kurisu
G
, et al.
The protein data bank archive
.
Methods Mol Biol
2021
;
2305
:
3
21
.

49.

Makigaki
S
,
Ishida
T
.
Sequence alignment using machine learning for accurate template-based protein structure prediction
.
Bioinformatics
2020
;
36
(
1
):
104
11
.

50.

Mirabello
C
,
Wallner
B
.
Rawmsa: end-to-end deep learning using raw multiple sequence alignments
.
PLoS One
2019
;
14
(
8
):e0220182.

51.

Cantelli
G
,
Bateman
A
,
Brooksbank
C
, et al.
The European Bioinformatics Institute (EMBL-EBI) in 2021
.
Nucleic Acids Res
2021
;
50
(D1):D11–D19.

52.

Sayers
EW
,
Cavanaugh
M
,
Clark
K
, et al.
Genbank
.
Nucleic Acids Res
2020
;
48
(
D1
):
D84
d86
.

53.

Ogasawara
O
,
Kodama
Y
,
Mashima
J
, et al.
DDBJ database updates and computational infrastructure enhancement
.
Nucleic Acids Res
2020
;
48
(
D1
):
D45
d50
.

54.

Tuli
MA
,
Flores
TP
,
Cameron
GN
.
Submission of nucleotide sequence data to EMBL/genbank/DDBJ
.
Mol Biotechnol
1996
;
6
(
1
):
47
51
.

55.

The UniProt Consortium
.
Uniprot: a worldwide hub of protein knowledge
.
Nucleic Acids Res
2019
;
47
(
D1
):
D506
15
.

56.

Chen
FZ
,
You
LJ
,
Yang
F
, et al.
Cngbdb: China national genebank database
.
Yi Chuan
2020
;
42
(
8
):
799
809
.

57.

Leinonen
R
,
Sugawara
H
,
Shumway
M
.
The sequence read archive
.
Nucleic Acids Res
2011
;
39
(
Database issue
):
D19
21
.

58.

Pruitt
KD
,
Tatusova
T
,
Brown
GR
, et al.
NCBI reference sequences (refseq): current status, new features and genome annotation policy
.
Nucleic Acids Res
2012
;
40
(
D1
):
D130
5
.

59.

Letovsky
SI
,
Cottingham
RW
,
Porter
CJ
, et al.
GDB: the human genome database
.
Nucleic Acids Res
1998
;
26
(
1
):
94
9
.

60.

Caló
D
,
De Pascali
,
Sasanelli
D
, et al.
Mmtdb: a metazoa mitochondrial DNA variants database
.
Nucleic Acids Res
1997
;
25
(
1
):
200
5
.

61.

Attimonelli
M
,
Altamura
N
,
Benne
R
, et al.
Mitbase: a comprehensive and integrated mitochondrial dna database. The present status
.
Nucleic Acids Res
2000
;
28
(
1
):
148
52
.

62.

Lang
OW
,
Nash
RS
,
Hellerstedt
ST
, et al.
An introduction to the saccharomyces genome database (SGD)
.
Methods Mol Biol
2018
;
1757
:
21
30
.

63.

Kelley
S
.
Getting started with acedb
.
Brief Bioinform
2000
;
1
(
2
):
131
7
.

64.

Sherry
ST
,
Ward
MH
,
Kholodov
M
, et al.
DBSNP: the NCBI database of genetic variation
.
Nucleic Acids Res
2001
;
29
(
1
):
308
11
.

65.

Amberger
JS
,
Bocchini
CA
,
Scott
AF
, et al.
Omim.org: leveraging knowledge across phenotype-gene relationships
.
Nucleic Acids Res
2019
;
47
(
D1
):
D1038
43
.

66.

MacDonald
JR
,
Ziman
R
,
Yuen
RK
, et al.
The database of genomic variants: a curated collection of structural variation in the human genome
.
Nucleic Acids Res
2014
;
42
(
Database issue
):
D986
92
.

67.

Pundir
S
,
Martin
MJ
,
O’Donovan
C
.
Uniprot protein knowledgebase
.
Methods Mol Biol
2017
;
1558
:
41
55
.

68.

Hogeweg
P
,
Hesper
B
.
The alignment of sets of sequences and the construction of phyletic trees: an integrated method
.
J Mol Evol
1984
;
20
:
175
86
.

69.

Feng
DF
,
Doolittle
RF
.
Progressive sequence alignment as a prerequisite to correct phylogenetic trees
.
J Mol Evol
1987
;
25
(
4
):
351
60
.

70.

Boyce
K
,
Sievers
F
,
Higgins
DG
.
Instability in progressive multiple sequence alignment algorithms
.
Algorithms Mol Biol
2015
;
10
:
26
.

71.

Notredame
C
,
Higgins
DG
,
Heringa
J
.
T-coffee: a novel method for fast and accurate multiple sequence alignment
.
J Mol Biol
2000
;
302
(
1
):
205
17
.

72.

Russell
DJ
,
Otu
HH
,
Sayood
K
.
Grammar-based distance in progressive multiple sequence alignment
.
BMC Bioinformatics
2008
;
9
:
306
.

73.

Al-Shatnawi
M
,
Ahmad
MO
,
Swamy
MN
.
Msaindelfr: a scheme for multiple protein sequence alignment using information on indel flanking regions
.
BMC Bioinformatics
2015
;
16
:
393
.

74.

Bhat
B
,
Ganai
NA
,
Andrabi
SM
, et al.
Tm-aligner: multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy
.
Sci Rep
2017
;
7
(
1
):
12543
.

75.

Maiolo
M
,
Gatti
L
,
Frei
D
, et al.
Propip: a tool for progressive multiple sequence alignment with Poisson indel process
.
BMC Bioinformatics
2021
;
22
(
1
):
518
.

76.

Garriga
E
,
Di Tommaso
,
Magis
C
, et al.
Multiple sequence alignment computation using the t-coffee regressive algorithm implementation
.
Methods Mol Biol
2021
;
2231
:
89
97
.

77.

Dhivya
S
,
Ashutosh
S
,
Gowtham
I
, et al.
Molecular identification and evolutionary relationships between the subspecies of Musa by DNA barcodes
.
BMC Genomics
2020
;
21
(
1
):
659
.

78.

Selva Pandiyan
A
,
Karthikeyan
RSG
,
Rameshkumar
G
, et al.
Identification of bacterial and fungal pathogens by rDNA gene barcoding in vitreous fluids of endophthalmitis patients
.
Semin Ophthalmol
2020
;
35
(
7-8
):
358
64
.

79.

Ying
YL
,
Hong
XZ
,
Xu
XG
, et al.
Molecular basis of ABO variants including identification of 16 novel abo subgroup alleles in Chinese Han population
.
Transfus Med Hemother
2020
;
47
(
2
):
160
6
.

80.

Lladós
J
,
Cores
F
,
Guirado
F
, et al.
Accurate consistency-based MSA reducing the memory footprint
.
Comput Methods Programs Biomed
2021
;
208
:106237.

81.

Chang
JM
,
Floden
EW
,
Herrero
J
, et al.
Incorporating alignment uncertainty into Felsenstein’s phylogenetic bootstrap to improve its reliability
.
Bioinformatics
2019
;
37
(
11
):
1506
14
.

82.

Corpet
F
.
Multiple sequence alignment with hierarchical clustering
.
Nucleic Acids Res
1988
;
16
(
22
):
10881
90
.

83.

Simossis
VA
,
Heringa
J
.
Praline: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information
.
Nucleic Acids Res
2005
;
33
(
Web Server issue
):
W289
94
.

84.

Roshan
U
,
Livesay
DR
.
Probalign: multiple sequence alignment using partition function posterior probabilities
.
Bioinformatics
2006
;
22
(
22
):
2715
21
.

85.

Liu
K
,
Nelesen
S
,
Raghavan
S
, et al.
Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy
.
IEEE/ACM Trans Comput Biol Bioinform
2009
;
6
(
1
):
7
21
.

86.

Mirarab
S
,
Nguyen
N
,
Guo
S
, et al.
Pasta: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences
.
J Comput Biol
2015
;
22
(
5
):
377
86
.

87.

Libin
PJK
,
Deforche
K
,
Abecasis
AB
, et al.
Virulign: fast codon-correct alignment and annotation of viral genomes
.
Bioinformatics
2019
;
35
(
10
):
1763
5
.

88.

Moshiri
N
.
Viralmsa: massively scalable reference-guided multiple sequence alignment of viral genomes
.
Bioinformatics
2021
;
37
(
5
):
714
6
.

89.

Rychlewski
L
,
Jaroszewski
L
,
Li
W
, et al.
Comparison of sequence profiles. Strategies for structural predictions using sequence information
.
Protein Sci
2000
;
9
(
2
):
232
41
.

90.

Baxevanis
AD
.
Practical aspects of multiple sequence alignment
.
Methods Biochem Anal
1998
;
39
:
172
88
.

91.

Liu
K
,
Warnow
TJ
,
Holder
MT
, et al.
Sate-ii: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees
.
Syst Biol
2012
;
61
(
1
):
90
106
.

92.

Amorim
AR
,
Zafalon
GFD
,
de Godoi Contessoto
, et al.
Metaheuristics for multiple sequence alignment: a systematic review
.
Comput Biol Chem
2021
;
94
:
107563
.

93.

Caiyang
Y
,
Heidari
AA
,
Chen
H
.
A quantum-behaved simulated annealing algorithm-based moth-flame optimization method
.
App Math Model
2020
;
87
:
1
19
.

94.

Ishikawa
M
,
Toya
T
,
Hoshida
M
, et al.
Multiple sequence alignment by parallel simulated annealing
.
Comput Appl Biosci
1993
;
9
:
267
73
.

95.

Hernández-Guía
M
,
Mulet
R
,
Rodríguez-Pérez
S
.
Simulated annealing algorithm for the multiple sequence alignment problem: the approach of polymers in a random medium
.
Phys Rev E
2005
;
72
:031915.

96.

Hwa
T
,
Lässig
M
.
Similarity detection and localization
.
Phys Rev Lett
1996
;
76
(
14
):
2591
4
.

97.

Mirjalili
S
.
Genetic Algorithm
.
Cham
:
Springer International Publishing
,
2019
,
43
55
.

98.

Zhang
C
,
Wong
AKC
.
A genetic algorithm for multiple molecular sequence alignment
.
Bioinformatics
1997
;
13
(
6
):
565
81
.

99.

Chen
S-M
,
Lin
C-H
,
Chen
S-J
.
Multiple DNA sequence alignment based on genetic algorithms and divide-and-conquer techniques
.
Int J Appl Sci Eng
2005
;
3
:
89
100
.

100.

Arenas-Díaz
E
,
Ochoterena
H
.
Multiple sequence alignment using a genetic algorithm and glocsa
.
J Artif Evol Appl
2009
;
2009
:963150.

101.

Ortuño
FM
,
Valenzuela
O
,
Rojas
F
, et al.
Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns
.
Bioinformatics
2013
;
29
(
17
):
2112
21
.

102.

Kaya
M
,
Sarhan
A
,
Alhajj
R
.
Multiple sequence alignment with affine gap by using multi-objective genetic algorithm
.
Comput Methods Programs Biomed
2014
;
114
(
1
):
38
49
.

103.

Gao
C
,
Wang
B
,
Zhou
CJ
, et al.
Multiple sequence alignment based on combining genetic algorithm with chaotic sequences
.
Genet Mol Res
2016
;
15
(
2
):
gmr8788
.

104.

Chatterjee
S
,
Barua
P
,
Hasibuzzaman
MM
, et al. A hybrid genetic algorithm with chemical reaction optimization for multiple sequence alignment. In:
2019 22nd International Conference on Computer and Information Technology (ICCIT)
, IEEE, Piscataway,
2019
,
1
6
.

105.

Mishra
A
,
Tripathi
BK
,
Singh Soam
S
. A genetic algorithm based approach for the optimization of multiple sequence alignment. In:
2020 International Conference on Computational Performance Evaluation (ComPE)
, IEEE, Piscataway,
2020
,
415
8
.

106.

Chowdhury
B
,
Garai
G
.
A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm
.
Soft Comput
2020
;
24
(
20
):
15871
88
.

107.

Long
H
,
Xu
W
,
Sun
J
, et al. Multiple sequence alignment based on a binary particle swarm optimization algorithm. In:
2009 Fifth International Conference on Natural Computation
, IEEE, Piscataway, Vol.
3
,
2009
,
265
9
.

108.

Chen
W
,
Liao
B
,
Zhu
W
, et al.
Multiple sequence alignment algorithm based on a dispersion graph and ant colony algorithm
.
J Comput Chem
2009
;
30
(
13
):
2031
8
.

109.

Xuyu
X
,
Dafan
Z
,
Jiaohua
Q
, et al.
Ant colony with genetic algorithm based on planar graph for multiple sequence alignment
.
Inf Technol J
2010
;
9
:
274
81
.

110.

Jagadamba
PVSL
,
Prasad Babu
MS
,
Rao
AA
, et al. An improved algorithm for multiple sequence alignment using particle swarm optimization. In:
2011 IEEE 2nd International Conference on Software Engineering and Service Science
,
2011
,
544
7
.

111.

Yang
W-H
.
An improved artificial fish swarm algorithm and its application in multiple sequence alignment
.
J Comput Theor Nanosci
2014
;
11
:
888
92
.

112.

Manikandan
P
,
Ramyachitra
D
.
Bacterial foraging optimization -genetic algorithm for multiple sequence alignment with multi-objectives
.
Sci Rep
2017
;
7
(
1
):
8833
.

113.

Chaabane
L
.
A hybrid solver for protein multiple sequence alignment problem
.
J Bioinform Comput Biol
2018
;
16
(
4
):
1850015
.

114.

Kuang
FJ
,
Zhang
SY
,
Liu
CC
.
Multiple sequence alignment algorithm based on multi-strategy artificial bee colony
.
Kongzhi yu Juece/Control Decision
2018
;
33
:
1990
6
.

115.

Hussein
AM
,
Abdullah
R
,
AbdulRashid
N
. Flower pollination algorithm with profile technique for multiple sequence alignment. In:
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
,
2019
,
571
6
.

116.

Dabba
A
,
Tari
A
,
Zouache
D
.
Multiobjective artificial fish swarm algorithm for multiple sequence alignment
.
INFOR: Inf Syst Oper Res
2020
;
58
(
1
):
38
59
.

117.

Kumar
M
,
Kumar
R
,
Nidhya
R
. Woamsa: whale optimization algorithm for multiple sequence alignment of protein sequence. In:
Smys
S
,
Tavares
JMRS
,
Balas
VE
et al.
(eds).
Computational Vision and Bio-Inspired Computing
.
Springer International Publishing
, Cham,
2019
,
131
9
.

118.

Chaabane
L
.
An enhanced cooperative method to solve multiple-sequence alignment problem
.
Int J Data Mining Modell Manage
2021
;
13
(
1-2
):
1
16
.

119.

Baum Leonard
E
,
Ted
P
,
George
S
, et al.
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
.
Ann Math Stat
1970
;
41
(
1
):
164
71
.

120.

Pei
J
,
Grishin
NV
.
Mummals: multiple sequence alignment improved by using hidden Markov models with local structural information
.
Nucleic Acids Res
2006
;
34
(
16
):
4364
74
.

121.

Pei
J
,
Grishin
NV
.
Promals: towards accurate multiple sequence alignments of distantly related proteins
.
Bioinformatics
2007
;
23
(
7
):
802
8
.

122.

Liu
Y
,
Schmidt
B
,
Maskell
DL
.
Msaprobs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities
.
Bioinformatics
2010
;
26
(
16
):
1958
64
.

123.

Sievers
F
,
Wilm
A
,
Dineen
D
, et al.
Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega
.
Mol Syst Biol
2011
;
7
:
539
.

124.

Sun
J
,
Palade
V
,
Wu
X
, et al.
Multiple sequence alignment with hidden Markov models learned by random drift particle swarm optimization
.
IEEE/ACM Trans Comput Biol Bioinform
2014
;
11
(
1
):
243
57
.

125.

Zhan
Q
,
Wang
N
,
Jin
S
, et al.
Probpfp: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function
.
BMC Bioinformatics
2019
;
20
(
Suppl 18
):
573
.

126.

González-Domínguez
J
,
Liu
Y
,
Touriño
J
, et al.
Msaprobs-mpi: parallel multiple sequence aligner for distributed-memory systems
.
Bioinformatics
2016
;
32
(
24
):
3826
8
.

127.

González-Domínguez
J
.
Fast and accurate multiple sequence alignment with msaprobs-mpi
.
Methods Mol Biol
2021
;
2231
:
39
47
.

128.

Sievers
F
,
Higgins
DG
.
The clustal omega multiple alignment package
.
Methods Mol Biol
2021
;
2231
:
3
16
.

129.

Pachetti
M
,
Marini
B
,
Benedetti
F
, et al.
Emerging sars-cov-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant
.
J Transl Med
2020
;
18
(
1
):
179
.

130.

Smirnov
V
,
Warnow
T
.
Magus: multiple sequence alignment using graph clustering
.
Bioinformatics
2021
;
37
(
12
):
1666
72
.

131.

Liu
H
,
Zou
Q
,
Xu
Y
.
A novel fast multiple nucleotide sequence alignment method based on fm-index
.
Brief Bioinform
2021
;
23
(1):bbab519, https://doi.org/10.1093/bib/bbab519.

132.

Shen
C
,
Zaharias
P
,
Warnow
T
.
Magus+ehmms: improved multiple sequence alignment accuracy for fragmentary sequences
.
Bioinformatics
2021
;
38
(4):918–24.

133.

Notredame
C
,
Holm
L
,
Higgins
DG
.
Coffee: an objective function for multiple sequence alignments
.
Bioinformatics
1998
;
14
(
5
):
407
22
.

134.

Narayan Behera
MS
,
Jeevitesh
JJ
,
Kant
K
, et al.
Higher accuracy protein multiple sequence alignments by genetic algorithm
.
Proc Comput Sci
2017
;
108
:
1135
44
.

135.

Thompson
JD
,
Koehl
P
,
Ripp
R
, et al.
Balibase 3.0: latest developments of the multiple sequence alignment benchmark
.
Proteins
2005
;
61
(
1
):
127
36
.

136.

Bradley
RK
,
Roberts
A
,
Smoot
M
, et al.
Fast statistical alignment
.
PLoS Comput Biol
2009
;
5
(
5
):e1000392.

137.

Löytynoja
A
,
Goldman
N
.
Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis
.
Science
2008
;
320
(
5883
):
1632
5
.

138.

Van Walle
,
Lasters
I
,
Wyns
L
.
Sabmark-a benchmark for sequence alignment that covers the entire known fold space
.
Bioinformatics
2005
;
21
(
7
):
1267
8
.

139.

Raghava
GP
,
Searle
SM
,
Audley
PC
, et al.
Oxbench: a benchmark for evaluation of protein multiple sequence alignment accuracy
.
BMC Bioinformatics
2003
;
4
:
47
.

140.

Schultz
J
,
Copley
RR
,
Doerks
T
, et al.
Smart: a web-based tool for the study of genetically mobile domains
.
Nucleic Acids Res
2000
;
28
(
1
):
231
4
.

141.

Sievers
F
,
Higgins
DG
.
Quantest2: benchmarking multiple sequence alignments using secondary structure prediction
.
Bioinformatics
2020
;
36
(
1
):
90
5
.

142.

Koyama
T
,
Platt
D
,
Parida
L
.
Variant analysis of SARS-cov-2 genomes
.
Bull World Health Organ
2020
;
98
(
7
):
495
504
.

143.

Ashkenazy
H
,
Sela
I
,
Levy Karin
E
, et al.
Multiple sequence alignment averaging improves phylogeny reconstruction
.
Syst Biol
2019
;
68
(
1
):
117
30
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)