Detection of tandem repeats in the Capsicum annuum genome

Abstract

In this study, we modified the multiple alignment method based on the generation of random position weight matrices (RPWMs) and used it to search for tandem repeats (TRs) in the Capsicum annuum genome. The application of the modified (m)RPWM method, which considers the correlation of adjusting nucleotides, resulted in the identification of 908,072 TR regions with repeat lengths from 2 to 200 bp in the C. annuum genome, where they occupied ~29%. The most common TRs were 2 and 3 bp long followed by those of 21, 4, and 15 bp. We performed clustering analysis of TRs with repeat lengths of 2 and 21 bp and created position-weight matrices (PWMs) for each group; these templates could be used to search for TRs of a given length in any nucleotide sequence. All detected TRs can be accessed through publicly available database (http://victoria.biengi.ac.ru/capsicum_tr/). Comparison of mRPWM with other TR search methods such as Tandem Repeat Finder, T-REKS, and XSTREAM indicated that mRPWM could detect significantly more TRs at similar false discovery rates, indicating its superior performance. The developed mRPWM method can be successfully applied to the identification of highly divergent TRs, which is important for functional analysis of genomes and evolutionary studies.

repeats, genetic algorithm, sequence, Capsicum annuum

1. Introduction

In eukaryotic organisms, a considerable part of the genome is occupied by dispersed and tandem repeat (TR) sequences. In the human and plant genomes, repeats, and transposable elements constitute about 43% and 80%, respectively.¹ TRs can be found in the promoters, 3ʹ untranslated regions, exons, and introns, and it is suggested that repeats are involved in DNA compaction and evolutionary events through contribution to gene expression divergence.^2,3 Most TRs can mutate quickly, thus adding to genetic variability and promoting rapid adaptation of organisms to new environmental conditions.^4,5 TRs present in centromeric, pericentromeric, and telomeric regions^6–9 and areas of chromosome breakage can lead to chromosomal rearrangements.

TRs are generally classified into microsatellites (or short TRs [STRs]) and minisatellites with repeat lengths of 1–6 and 7–60 bp, respectively. TRs longer than 60 bp (and even longer than 1,000 bp) are also observed.¹⁰ Repeat sequences may be species-specific, and their numbers could vary considerably owing to DNA polymerase slippage and unequal recombination.¹¹ Short TRs are detected at the ends of transposable elements: SINEs, LINEs, and retrotransposons, where they affect the possibility of transposition, thus exerting genetic regulation. It is noted that more than 25 human hereditary diseases are associated with STRs,¹² including fragile X syndrome^13,14 and autism.^15,16

There are cases of correlation between the TR length and function. Thus, 3 bp period is characteristic for DNA coding regions and is associated with the triplet nature of the codons,¹⁷ whereas TRs of 10–11 bp (referred to as 10.5 bp) has been demonstrated in nucleosomal DNA.^18,19 TRs of 5–7 and 11–14 bp are characteristic for enhancers and other non-coding regions adjacent to genes, whereas long (120.9 bp) TRs are observed at the 5ʹ ends of the transcripts.²⁰

The diverse regulatory roles of TRs have spurred the development of computer programs for TR search in the genomes.²¹ Some of them can identify TRs with insertions or deletions (indels), provided that the degree of similarity between repeats is high (>50%). For example, Tandem Repeat Finder (TRF)²² searches the candidates based on the statistics of k-tuple matches, and the TR is recreated considering possible indels, when k-tuples are placed relative to each other. As the presence of indels within k-tuples complicates TR detection, in the TRStalker algorithm,²³ gapped q-grams allowing indels are tracked, resulting in the detection of long (>10 bp), more divergent repeats. T-REKS²⁴ and XSTREAM²⁵ using dynamic programming to build multiple alignments can also find TRs with indels. However, these methods cannot detect TRs with similarity < 40%.

Highly divergent (fuzzy) TRs are of particular interest because of their presence in gene regulatory regions, where they interact with transcription factors.²⁶ Fuzzy TRs can be identified with the methods that use information decomposition^21,27 and the Fourier transform (FT).^20,28,29 Such approaches have revealed that short-length (2, 3, 4, and 6 bp) repeats are characteristic for introns and exons.³⁰ However, the disadvantages of the FT and information decomposition methods are that indels are not considered.

Thus, there is currently a need for a universal method that can identify long TRs with a large number of accumulated indels. In our previous studies, we have developed the random position weight matrix (RPWM) method, which could effectively detect TRs with the average number of mutations per nucleotide (x) up to 3.2.^31,32,33 Here, we improved the RPWM method by considering the correlation between adjacent DNA bases, which allowed identification of TRs with 2–200 bp, and refined the estimation of the statistical significance of the found TRs by using the Monte Carlo method. As a result, it was possible to increase the average number of TR-containing regions detected per 10⁶ DNA bases by 1.5 times.

The modified (m) RPWM method was used to search for TRs in the pepper (Capsicum annuum) genome. Although the genome of C. annuum was sequenced in 2014,³⁴ no detailed analysis of its TRs has been performed except for the study showing that 15.64% of expressed sequence tags contain STRs (2–6 bp)³⁵ which have been subsequently used to develop various polymorphic microsatellite markers.³⁶ By applying the mRPWM method, we identified all possible TRs with a length 2–200 bp in the C. annuum genome and created a corresponding database (http://victoria.biengi.ac.ru/capsicum_tr/).

2. Materials and methods

2.1. RPWM algorithm

The pepper (Capsicum annuum) genome was used as a model to search for TRs with the improved RPWM algorithm. RPWM, which uses position weight matrices (PWMs),³³ was modified to consider the correlation of adjacent nucleotides in TR detection and more accurately determine the statistical significance of the identified TRs. In order to understand the nature of the improvement, let us review the main steps of the RPWM algorithm.

Step1. A window with length L = 650 b denoted as sequence S moved along a chromosome with a step of 10 b, and TRs of length n = 2–50 b were searched for in each S.

Step 2. A set of 500 random PWMs with dimensions of n × 4 (TR length × 4 DNA bases) was generated and denoted as Q_n; N_o = 500 is the volume of set Q_n. Each matrix Q_n(i) was transformed so that sum $\sum_{i = 1}^{n} \sum_{j = 1}^{4} p w m {(i, j)}^{2}$ was always constant. In this formula, pwm(i,j) is the element of the PWM on row i and column j, p₁(i) = 1/_n, and p₂(j) = N(j)/L, where N(j) is equal to the number of A, T, C, or G bases in sequence S. Sum $\sum_{i = 1}^{n} \sum_{j = 1}^{4} p w m (i, j) p_{1} (i) p_{2} (j)$ was also kept constant for all matrices from set Q_n. This transformation was necessary to ensure that similarity function Fmax (step 3 below) had similar distribution for different n³⁷; otherwise, it would be very difficult to find the most statistically significant period and determine TR length in sequence S.

Step 3. For each set Q_n, we applied the genetic algorithm to find matrix $Q_{n}^{max}$ with the best local alignment to sequence S; the maximum value of similarity function F_max was used as the object function and matrices from set Q_n – as organisms. The genetic algorithm was applied as follows. A. Each matrix from set Q_n was locally aligned to sequence S as previously described,^33,38 and F_max and local alignment coordinates in sequence S were obtained for each matrix, which allowed accurate determination of TR coordinates in sequence S; F_max values were written in vector V_n(i) (i = 1, 2,…, N_q). B. Vector V_n(i) was ranked in a descending order so that V_n(1) contained the maximum value of F_max, and random mutations were introduced into 10 randomly selected matrices from set Q_n by replacing one randomly selected cell with a random number from −10 to +10. Two matrices were randomly selected as ‘parents’ and randomly superposed to create a new ‘descendant’ matrix; then, a matrix with the smallest F_max value (contained in V_n(500)) was removed from set Q_n. Then, we went back to point A and recalculated all V_n values for the mutated matrices and the ‘descendant’; for the matrices that did not change in point B, local alignment was not performed and V_n was not recalculated. The A–B cycle was repeated until V_n(1) tended to increase. As a result, we obtained matrix V_n(1) and local alignment of sequence S for a TR of length n.

Step 4. Steps 2–3 were performed for n = 2–50 bases to calculate all V_n(1) values and determine n with the maximum V_n(1) value, which was designated as n_max(k), where coordinate k is the beginning of the window (sequence S) in the chromosome (step 1). Thus, for each window, we obtained n_max(k), $V_{n_{max} (k)} (1)$ ⁠, matrix $Q_{n}^{max}$ ⁠, and local alignment of matrix $Q_{n}^{max}$ and sequence S.

Step 5. Windows in sequence S were shifted by 10 bases to overlap with each other, and $V_{n_{max} (k)} (1)$ values were filtered as follows: the dependence of V(1) on the TR length (n) was excluded, resulting in $V e c (k) = V_{n_{max} (k)} (1)$ ⁠, and the local maxima in vector Vec(k) were chosen so that Vec(k) > V₀ and Vec(k-i) ≤ Vec(k) ≥Vec(k+i), where i ranges from 1 to 60. The selected local maxima represented the final result. Thus, for each local maximum, we obtained n_max(k), Vec(k), matrix $Q_{n}^{max}$ ⁠, and local alignment of matrix $Q_{n}^{max}$ and sequence S.

Step 6. A random sequence S was used to select threshold level V₀ for local maxima Vec(k). Steps 1–5 were repeated for this sequence, and a set of local maxima Vec_r(k) was obtained; the numbers of local maxima Vec(k) and Vec_r(k) were denoted as N_v and N_r, respectively. Then, we chose V₀ with the ratio N_r(Vec(k) > V₀)/ N_v(Vec(k) > V₀) ~ 0.027. These local maxima and the corresponding n_max(k), matrix $Q_{n}^{max}$ together with local alignment of the matrix and sequence S represented the final result of the algorithm.

2.2. Modified RPWM algorithm

2.2.1 Calculation of statistical significance Z(n)

We modified RPWM³³ to consider the statistical significance of the identified TRs. Matrices Q_n(i) transformed in section 2.1 allowed comparison of V_n(1) for different n, which could be performed if F_max(n) defined by local alignment of sequence S and matrix Q_n(i) had the same distribution function. However, it is not possible to achieve complete identity of these distributions in the RPWM method. Therefore, to reduce errors in determining n_max(k), we calculated Z(n) for each period length n using the Monte Carlo method; for this, steps 4–6 in the RPWM algorithm (section 2.1) were replaced with the new ones described below.

Step 4. For each TR of length n, we applied the Monte Carlo method to estimate statistical significance Z(n) of V(1): $Z (n) = (V (1) - \bar{V (1)}) / (D {(V (1))}^{0.5}$ ⁠, where $\bar{V (1)}$ and D(V(1)) are the mean value and variance, respectively, of V(1) obtained by aligning random sequences S and matrix $Q_{n}^{max}$ ⁠.

Step 5. Steps 2–4 were performed for n from 2 to 50 to calculate all Z(n) values and determine n with the maximum Z(n) value, which was designated as n_max(k), where coordinate k is the beginning of the window (sequence S) in the chromosome (step 1). Thus, for each window, we obtained n_max(k), Z(n_max(k)), matrix $Q_{n}^{max}$ ⁠, and local alignment of matrix $Q_{n}^{max}$ and sequence S.

Step 6. Windows in sequence S were shifted by 10 bases to overlap with each other, and Z(n_max(k)) values were filtered as follows: the dependence of Z on TR length (n) was excluded, resulting in Z(n_max(k)) = Z₁(k), and the local maxima in vector Z₁(k) were chosen so that Z₁(k) > Z₀ and Z₁(k-i) ≤ Z₁(k)≥ Z₁(k+i), where i ranges from 1 to 60. The selected local maxima represented the final result. Thus, for each local maximum, we obtained n_max(k), Z(n_max(k)), matrix $Q_{n}^{max}$ ⁠, and local alignment of matrix $Q_{n}^{max}$ and sequence S.

2.2.2. Consideration of the correlation of neighbouring bases

Matrix $Q_{n}^{max}$ contains n rows and 4 columns (n × 4) and does not take into account the correlation of neighbouring nucleotides in sequence S, because the columns of the matrix are not interconnected; as a result, a significant part of TRs may remain undetected. To address this problem, we created an improved version of the RPWM algorithm, which should consider the correlation of adjacent nucleotides. For illustration, let us search for TRs with the length of 4 b (n = 4) in sequence S of length L = 1,200 b. To obtain S, we first randomly selected, with a probability of 0.25, one sequence Seq(i) (i = 1–4; Seq(1) = ATCG, Seq(2) = TAGC, Seq(3) = CCAA, and Seq(4) = GGTT) and filled in bases 1–4 in sequence S; the step was repeated until all the bases in sequence S (1–1,200) were filled. Then, we determined function Z(n) using the RPWM method.³³ The results indicated that the 4 base TRs in S could be hardly seen (Fig. 1), because RPWM computed matrix M(n,4) for Z, which had n rows and 4 columns, i.e. for sequence S, at each position of repeat n, the frequency of each of the four bases (A, T, C, or G) should be the same and equal to 0.25.

Figure 1.

Calculation of statistical significance Z(n) depending on TRs length n for sequence S. White and black circles correspond to the results obtained by RPWM and mRPWM methods, respectively.

Open in new tab Download slide

An example of such sequence S shown in Fig. 2A indicated that the first column contains a number of A, T, C, and G equal to 3. The same frequencies were observed for the second, third, and fourth positions of the periods; the resulting matrix is shown in Fig. 2B. These data mean that there are no differences in base frequencies at different positions of the repeat, which makes it difficult to detect TRs in sequence S according to base frequencies in matrix M. Therefore, low Z values were obtained using the RPWM method (Fig. 1); they were not zero only because dynamic programming arranged indels in a certain way.

Figure 2.

Illustration of the mRPWM method. Fig.2A This figure shows two sequences. The upper sequence contains the row numbers for the matrix M(n,4) for the case n=4 (sequence V). The lower sequence is the sequence S being studied. We start filling the matrix M(4,4) by moving from left to right along the sequence V. The first pair 1A will correspond to the first row and the first column of the matrix M(4,4) and then we add 1 to the cell (1,A). The second pair is 2T, it corresponds to cell (2,T), in this cell we add 1. So we move along the sequence V to its end v(L), where L is the length of the sequence S. As a result, we get a filled matrix M(4,4), which is shown in Fig. 2B. We can see that in this case the matrix M(4,4) is evenly filled and there will be no correlation between rows and columns. Fig. 2B below shows the filled matrix M(4,16). It was filled similarly to the matrix M(4,4), only their pairs are taken into account instead of individual bases. For example, for j=2, s(j-1)=A and s(j)=T, and v(2)=2 then we add one to cell M(2,AT). Then we take j=3, s(j-1)=T and s(j)=C, and v(2)=3 then in cell M(3,TC) we add one and so on until j=L. Fig.2B shows that the matrix M(4,16) is unevenly filled, indicating the presence of a correlation between the sequence V and S.

Open in new tab Download slide

In contrast, after the application of the modified (m)RPWM algorithm, 4 b-long TRs were clearly visible and Z(4) had the largest value among all repeat lengths (Fig. 1), because in mRPWM, instead of matrix M(n,4) (Fig. 2B, top), we used matrix M(n,16), which had 16 columns corresponding to 16 nucleotides (Fig. 2B, bottom). The elements of matrix M(n,16) were calculated as m(v(j),i) = m(v(j),i) + 1 (where j ranged from 2 to L, i = s(j − 1) + 4(s(j) − 1), and v(j) was an element of sequence V), indicating that matrix M(n,16) considered correlations of adjacent nucleotides. Matrix M(4,16) for sequence S (Fig. 2B, bottom) was extremely heterogeneous, and many of its elements were equal to zero. Thus, the newly developed mRPWM could identify TRs of 4 b in sequence S, i.e. appeared to be more efficient in search of TRs than the original RPWM.

In mRPWM, we considered the correlation of neighbouring bases by creating a set of matrices Q_n with dimensions n × 16, which had n rows and 16 columns instead of n rows and 4 columns, as described in section 2.1. Step 1; in addition, the size of the window was increased from 650 to 1,200 b and that of the step – from 300 to 600 b. The length (n) of TRs to be detected was also increased: from 2–50 to 2–200 b. In this analysis, we searched for TRs of all sizes, including those that were multiples of 3 b. As a result, the number of detected TRs was significantly increased. Thus, RPWM could identify approximately 76,000 TRs in the genome of rice (Oryza sativa), whereas mRPWM – 908,072 TRs in that of C.annuum. Given that the sizes of the two genomes are approximately 3.75 × 10⁸ and 3 × 10⁹ b, the average numbers of identified TRs per 10⁶ b were 192 and 300, respectively, confirming the superior performance of mRPWM.

2.3. Effect of nucleotide substitutions on TR identification

Artificial sequences of 1,200 b, which contained TRs of 10 b, were generated through tandem multiplication of random 10 b segments by 120 times; then, 0, 100, 500, 1,000, 1,500, 1,800, 2,200, and 2,400 random base substitutions and 20 indels were introduced at randomly selected positions. These artificial sequences were denoted as St(x) (where x was the average number of substitutions per base between any two repeats and was equal to 0, 0.17, 0.85, 1.7, 2.5, 3.0, 3.7 and 4.0)³⁹ and used to search for TRs with RPWM and mRPWM and determine Z(10) as a function of x.

3. Results

3.1. Analysis of artificial sequences St(x)

Figure 3 shows the plot of Z(10) versus x. If we take Z₀ (section 2.1, step 6) equal to 6.0 (section 3.2), then RPWM can detect TRs with x up to 3.2, whereas mRPWM, which considers the correlation of neighbouring bases, can detect TRs with x up to 3.6, which significantly exceeds the limit for TRF and T-REKS methods: x < 1.2.³³ Thus, the performance of mRPWM in identifying highly divergent TPs is superior to those of other currently used methods.

Figure 3.

Dependence of statistical significance Z(n) on the number of base substitutions x for artificial sequences from set St (section 2.3). The continuous curve is the RPWM method, the dotted curve is the mRPWM method.

Open in new tab Download slide

3.2. Detection of TRs in the C. annuum genome

Next, we applied the mRPWM method to search for TRs with lengths of 2–200 bp in the 12 chromosomes of C. annuum (sequences retrieved from Ensembl Plants databank, http://plants.ensembl.org/Capsicum_annuum/Info/Index). To select the significance level Z₀, we searched for TRs in the first C. annuum chromosome and in a random sequence created by random mixing of chromosome nucleotides. The Z values for the sorted local maxima (section 2.1, steps 5 and 6) are presented in Table 1; among them, Z₀ = 6.0 was chosen as the threshold level which provided a false discovery rate (FDR) of 2.71%. The number of TRs found in each C. annuum chromosome is shown in Table 2. Overall, 908,072 TRs were detected; their density was similar in all the chromosomes and constituted ~302 TRs per 10⁶ bp. The calculations were performed on two Ryzen 9 5950X processors, and the search for TRs in the C. annuum genome took about 2 weeks.

Table 1.

Open in new tab

Number of TRs found in the first C. annuum chromosome and random sequence

	Z₀
	5.0	5.5	6.0	6.5	7.0
First chromosome	161,367	129,788	97,727	72,578	54,306
Random sequence	33,033	10,494	2,651	681	198
FDR	20.47%	8.08%	2.71%	0.94%	0.36%

	Z₀
	5.0	5.5	6.0	6.5	7.0
First chromosome	161,367	129,788	97,727	72,578	54,306
Random sequence	33,033	10,494	2,651	681	198
FDR	20.47%	8.08%	2.71%	0.94%	0.36%

Table 1.

Open in new tab

Number of TRs found in the first C. annuum chromosome and random sequence

	Z₀
	5.0	5.5	6.0	6.5	7.0
First chromosome	161,367	129,788	97,727	72,578	54,306
Random sequence	33,033	10,494	2,651	681	198
FDR	20.47%	8.08%	2.71%	0.94%	0.36%

	Z₀
	5.0	5.5	6.0	6.5	7.0
First chromosome	161,367	129,788	97,727	72,578	54,306
Random sequence	33,033	10,494	2,651	681	198
FDR	20.47%	8.08%	2.71%	0.94%	0.36%

Table 2.

Open in new tab

Number of TRs in C. annuum chromosomes

Chromosome number	Chromosome size, bp	Number of found TRs
1	309,102,287	97,727
2	169,555,599	53,458
3	282,780,301	82,029
4	240,120,734	75,376
5	238,597,879	76,562
6	242,241,289	76,011
7	251,293,532	80,152
8	142,366,738	43,533
9	271,082,670	85,660
10	233,321,800	74,453
11	266,870,110	83,990
12	250,929,874	79,121
Total	2,898,262,813	908,072

Chromosome number	Chromosome size, bp	Number of found TRs
1	309,102,287	97,727
2	169,555,599	53,458
3	282,780,301	82,029
4	240,120,734	75,376
5	238,597,879	76,562
6	242,241,289	76,011
7	251,293,532	80,152
8	142,366,738	43,533
9	271,082,670	85,660
10	233,321,800	74,453
11	266,870,110	83,990
12	250,929,874	79,121
Total	2,898,262,813	908,072

Table 2.

Open in new tab

Number of TRs in C. annuum chromosomes

Chromosome number	Chromosome size, bp	Number of found TRs
1	309,102,287	97,727
2	169,555,599	53,458
3	282,780,301	82,029
4	240,120,734	75,376
5	238,597,879	76,562
6	242,241,289	76,011
7	251,293,532	80,152
8	142,366,738	43,533
9	271,082,670	85,660
10	233,321,800	74,453
11	266,870,110	83,990
12	250,929,874	79,121
Total	2,898,262,813	908,072

Chromosome number	Chromosome size, bp	Number of found TRs
1	309,102,287	97,727
2	169,555,599	53,458
3	282,780,301	82,029
4	240,120,734	75,376
5	238,597,879	76,562
6	242,241,289	76,011
7	251,293,532	80,152
8	142,366,738	43,533
9	271,082,670	85,660
10	233,321,800	74,453
11	266,870,110	83,990
12	250,929,874	79,121
Total	2,898,262,813	908,072

The statistics for the 10 most common repeat lengths indicated that TRs of 3 and 2 bp were the most frequent in the C. annuum genome (Table 3), which is consistent with the data reported for other plant genomes.^31,40 Analysis of TR distribution depending on the length revealed that in addition to TRs with n = 2 and 3 bp, there were local peaks for those with n = 21, 31–33, 91–92, 111–112, and 178–187 bp (Fig. 4); the existence of lengthy TRs may be associated with chromatin compaction. It has been reported that in proteins, the most frequently observed repeat lengths are 2 and 7 amino acids,⁴¹ which correspond to 6 and 21 bp, respectively.

Table 3.

Open in new tab

Lengths of 10 most frequently identified TRs in the C. annuum genome

TR length, bp	Number of TRs
3	266,713
2	156,442
21	18,194
15	14,126
4	13,940
14	13,935
19	12,292
12	12,172
6	11,954
18	11,789

Table 3.

Open in new tab

Lengths of 10 most frequently identified TRs in the C. annuum genome

TR length, bp	Number of TRs
3	266,713
2	156,442
21	18,194
15	14,126
4	13,940
14	13,935
19	12,292
12	12,172
6	11,954
18	11,789

Figure 4.

Frequency of TRs depending on the length n.

Open in new tab Download slide

The performance of the newly developed mRPWM method was verified by searching for highly dissimilar TRs that could not be detected by the existing algorithms. TRs were analysed in a non-coding region of the first C. annuum chromosome (15,478,656–15,479,825 bp). The Z(n) plot indicated that a pronounced maximum (Z = 9.0) was observed for n = 7 bp and lower peaks were also detected for n = 14, 21, 28, and 35 bp (Fig. 5). Table 4 shows the alignment between a TR and an artificial periodic sequence containing 34 indels with a maximum length of 3 bp; the corresponding $Q_{7}^{max}$ matrix is shown in Table 5.

Table 4.

Open in new tab

Alignment of TRs with $Q_{7}^{max}$ shown in Fig. 5

AAGCACCACGTGTCAAATGACGTGGCATGCTAAGTCAAGAAA*TGAATCCAATAGGACCATGCCACATGTCAAAAT*GATGCAGCAGGCC
67123456712345671234567123*456712345671234567123456712345671234567123456712345671234567123

CATGAAATCAAAGTTCATAAAAGGTGTCATGTCACTCAAGTATGATTGGTCAAAGAAAGTCCATTTTCATCAT*GACTCTTCCCTTTCCA
45671234567123456712345671234567123456712345671234567***1234567123456712345671234567123456

CAACTATAAATAGGGGGCCCATAATTCAGAAAAGAAGCTCAGAACTCTAAAAGCAGCAAGAGAAAGCTCGTGGATCAAACGCCACAATTT
712345671234567123456712345671234567123456712345671234567123456712345671234567123456712345

TCCTAAAAAGCTCAAGCATTCAAGTCAAGTTCATCAAGATCCAAAATCAAGACCACAATATTCAAAAACAAGCTCAAAAGCCCTTGAATT
671234567123456712345671234567123456712345671234567123456712345671234567123456712345671234

CAAGC***ACAAGTCAAGAT*CAAGTCCCCCAAATCAACAAATCAAGTTCAA*ATTCAAGAT*CAAGCTTCAAACCCTTGAATTTATATT
567123456712345671234567123456712345671234567123456712345671234567123456712345671234567123

TGAAAAGGCGAATTAGAAGATTCATAGAGATTGTAACACTCACATATTGAAATCAATAAATTGATTGTTTAATATTTTCTTGGCTCAATT
456712345671234567123456***712345*67123456*712345671234567123456712345671234**567123456712

ATTTATTTTTTCGATCCCAAAAATTTTATTGTCCAACAAATTCTGGCACGCCCAGTAGGACAATCTCTATCTGTCATCTCAACTGCTCCA
**3456712345671234567123456712345671234567123456712345671234567123456712345671234567123456

AC*TGCAAAGTTCAACAACACTGAAATGACTTCCAAGAAGGTCAACTCTCAATCAACTACATCTAAGGCTGCTGATTCAAAGTTCTCTGG
7123456712345671234567123456712345671**234567123456712345671234567123456712345671234567123

TGAAGTAGAAAGCATCCTTGGTGTTATTTTCGAAAGCTTAGGAACTGTCACAAAGAGCAAGGAAGG**CTTGCTAGGACAACAAACACAT
456712345671234567123456712345671234567123456712345671234567123456712345671234567123456712

CCAGTGTCGTCCGAACCAACTCCAATTTTTGAATCTTCAACCCCAAAAGGAAAGAAATTCAATGCAAGTTCTTCGGAAGGAGGAAGCAGT
34567123456712345671234567123456712345671234567*123456712345671234567123456712345671234567

GTGGCGGAAT*CGCTTAAGAAGACTCTAGATTTA**CTTGAGAATTCCAGTTCCAAACACTCTGGCACAAAGAGCAATGATCGTTCGAGC
123456712345671234567123456712345671234567123456712345671234567123456712345671234567123456

AACTCATCATCTCCGACTATACCGCATAAGTTGAGCGCTTCAAAGATCAACTTGTGCGATAATCCATGCTACTTTCCGATATCTTCAGTG
712345671234567123456712345671234567123456712345671234567123456712345671234567123456712345

ATTATGCAAGTGATGGTGACTGATGCCTCGTCTATGAAGGAGCA*GCTTGAGAATTTAGCAAAGGCAATTAAGAGCCTGACCAAATATGT
671234567123*45671234567123456712345671234567123456712345671234567123456712345671234567123

TCAGAAT*CAAGATGC
4567123456712345

^*indicates indels