- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Genome Res
- v.18(2); Feb 2008
- PMC2203628

# Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

^{1,}

^{3}Andrea Rocco,

^{2}Naila Mimouni,

^{2}Andreas Heger,

^{1}Alexandre Caldeira,

^{2}and Jotun Hein

^{2}

^{1}MRC Functional Genetics Unit, University of Oxford, Department of Physiology, Anatomy, and Genetics, Oxford OX1 3QX, United Kingdom;

^{2}Department of Statistics, University of Oxford, Oxford Centre for Gene Function, Oxford, OX1 2TG, United Kingdom

^{3}Corresponding author.E-mail ku.ca.xo.gapd@retnul.notreg; fax 44-1865-282651.

## Abstract

Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human–mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman–Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.

Most, if not all, of comparative genomics relies crucially on the quality of sequence alignments. As a consequence, the sequence-alignment problem has received a great deal of attention. However, despite having been introduced over three decades ago (Needleman and Wunsch 1970), it remains an active area of research (for reviews, see Batzoglou 2005; Dewey and Pachter 2006). One reason for this continued interest is that alignments produced by existing algorithms still show considerable disagreement (Dewey et al. 2006). This disagreement is often thought to result from differences in the algorithm’s accuracy due to, e.g., inaccurate evolutionary models or suboptimal choices of parameters (Waterman et al. 1992; Gusfield et al. 1994; Elofsson 2002; Dewey et al. 2006). Here, we argue instead that alignment accuracy is more fundamentally limited. Rather than resulting from inaccurate models or parameters, differences between inferred alignments may simply reflect uncertainties resulting from the limited information available in extant sequences, from which different algorithms infer distinct but equally plausible homologies (Lassmann and Sonnhammer 2005). To the extent that this is true, attention should be directed toward quantifying this unavoidable uncertainty rather than toward optimizing the evolutionary model underpinning the algorithm. Quantifying this uncertainty will help experimentalists assess which alignment regions can be relied upon in subsequent analyses.

Several authors have considered uncertainty in alignments. Byers and Waterman looked at the problem of enumerating suboptimal alignments (Waterman 1983; Byers and Waterman 1984), but this approach proved impractical because of the sheer number of such alignments. An alternative approach focuses instead on reliable individual columns within alignments. A “conditional best score” can be computed for alignments that include any particular residue pair (Sellers 1979; Goad and Kanehisa 1982; Altschul and Erickson 1986; Zuker 1991). Given an arbitrary threshold, these scores delineate regions of homology in a dot plot rather than a single best alignment. A similar approach has been used to calculate a “reliability index” for individual pairings (Chao et al. 1993), which has been used to improve alignments (Mevissen and Vingron 1996; Schlosshauer and Ohlsson 2002; Tress et al. 2003). Despite such improvements, alignment quality remains a major issue, e.g., for homology modeling of protein structure (Tramontano et al. 2001). One difficulty is that proteins evolve under selection, which is hard to model, so that any simulation must necessarily be highly idealized. Lacking realistic simulated data, alignment algorithms must be calibrated using databases of structural alignments (Mevissen and Vingron 1996; Tress et al. 2003; Edgar and Batzoglou 2006), which are limited in size and accuracy and biased toward globular proteins. A second issue is that, with notable exceptions (Do et al. 2005; Paten and Birney 2007), alignment algorithms are almost universally based on optimizing a score rather than on a probabilistic model. Aside from parameterization issues, this makes it difficult to interpret the score and reliability indices derived from it, which has hampered the rational design of novel algorithms based on these statistics.

Here, we focus on the probabilistic alignment of mammalian genomic DNA. The importance of this problem is underlined by the increasing number of available genomes and the requirement of full-length alignments, particularly for the comparative study of conserved noncoding DNA. These elements are embedded in large amounts of neutrally evolving sequence, which, in many cases, retain sufficient sequence identity to be alignable (Waterston et al. 2002). This allows a different strategy from that used for protein alignments; rather than modeling evolution under functional constraint, neutral evolution may be modeled to optimize the alignment of the neutral majority of sequence (Chiaromonte et al. 2002). The conserved fraction, being easier to align, may be processed by the same evolutionary model. Because of the large amount of available data, the process of neutral evolution is known in great detail (Waterston et al. 2002; Hwang and Green 2004; Meunier and Duret 2004; Hellmann et al. 2005; Lunter et al. 2006), allowing the simulation of realistic sequence pairs whose homologies are known exactly, an approach that was profitably used to assess the quality of fly genome alignments (Pollard et al. 2004).

We emphasize that in this study we consider only part of the whole-genome alignment problem: the pairwise local alignment of homologous nucleotide sequences. We ignore the issues of finding anchors and of dealing with repetitive sequence, genomic inversions and duplications, and nonorthologous relationships (Blanchette et al. 2004; Bray and Pachter 2004; Brudno et al. 2004; Dewey and Pachter 2006; Sun and Buhler 2006), which are crucial, but can be separated from the nucleotide-level alignment problem. Here, we thus assume that regions of homology are reliably assigned (see Prakash and Tompa 2007 for a recent study considering this issue), and we focus on the problem of inferring nucleotide-level homology. We also do not consider the multiple-alignment problem, which is essentially more difficult than pairwise alignments. However, a detailed understanding of the issues concerning pairwise alignments will, we hope, help guide the design of probabilistic multiple-alignment algorithms.

A central observation made in this study is that alignment errors follow particular patterns and cause alignments to be biased in particular ways. Depending on the application, it is important to be aware of (and account for) the type and extent of these biases. For instance, naïve estimates of indel rates are systematically negatively biased, and explicit accounting for alignment biases greatly reduces their impact (Lunter 2007). We distinguish three types of alignment error, termed gap wander (Holmes and Durbin 1998), gap attraction, and gap annihilation. We show that, to varying degrees, all probabilistic and score-based aligners tested exhibit these biases. For the most prevalent of these, gap wander, we obtain an analytic expression of its contribution to alignment error.

Having established that alignment errors are prevalent, we turn to probabilistic alignment algorithms. A key advantage of probabilistic aligners is their ability to assign posterior probabilities to individual alignment columns (Thorne et al. 1991; Durbin et al. 1998; Metzler 2003; Lunter et al. 2005). We show that this posterior probability accurately predicts the true probability that individual columns are correct. This suggests that, rather than using a standard maximum-likelihood approach such as the Viterbi algorithm, posteriors could be profitably used to identify good alignments. Posterior decoding-alignment algorithms were proposed some time ago (Krogh 1997; Durbin et al. 1998; Holmes and Durbin 1998), and recently there has been a renewed interest in probabilistic alignment algorithms, mostly focusing on proteins (Do et al. 2005; Kall et al. 2005; Roshan and Livesay 2006; Paten and Birney 2007), and similar approaches were found to improve RNA folding (Ding et al. 2005). In contrast, the performance of posterior decoding algorithms on genomic DNA sequences has, to our knowledge, not been investigated in detail before. Here, we examine two novel posterior decoding algorithms and find that they show superior performance compared with the standard Viterbi decoding and with score-based aligners, in terms of sensitivity and the extent of alignment biases.

Although our models and algorithms improve upon earlier alignment algorithms, a key point of this study is to emphasize that errors in alignments are unavoidable. We show that this is true even when the underlying evolutionary model and parameters are known exactly. For sequences whose divergence is comparable to human and mouse, we recover 83%–88% of homologous residue pairs, depending on the model and the decoding algorithm, stressing the need to quantify the remaining uncertainty in the alignment. The best-performing model explicitly accounts for variations in GC content, and the particular form of the mammalian indel-length spectrum; surprisingly, modeling the variation in substitution and indel rates themselves had little, if any, effect on the resulting alignment. Independently of the model used, the posterior-decoding algorithms were found to be superior to the Viterbi algorithm. By comparison, of the score-based aligners used in the ENCODE project, BLASTZ (Schwartz et al. 2003; Blanchette et al. 2004) shows the best overall performance, achieving a sensitivity of 82%, similar to the sensitivity of Viterbi alignments.

Despite the advantages of a probabilistic approach, most aligners currently are score based rather than probabilistic. One reason is that probabilistic algorithms are perceived to be more complex. It is therefore important to emphasize that the algorithms we used have the same asymptotic time and memory complexity as standard score-based algorithms; for example, the Viterbi algorithm (Durbin et al. 1998) is formally identical to the Needleman–Wunsch algorithm (Needleman and Wunsch 1970). Based on the ideas presented here, we have developed a probabilistic genome aligner, GRAPe, which we used to compute human and mouse genome alignment. The software will be described more fully elsewhere and is available at http://genserv.anat.ox.ac.uk/grape. A genome browser for the human–mouse alignments is accessible through the same URL.

## Results

### Biases in alignment

Alignment algorithms, whether probabilistic or score based, compute alignments that are systematically biased. We describe the three most important biases, in order of their frequency of occurrence, and discuss their effects on alignments.

The most frequent cause of misaligned bases is due to an effect termed “gap wander” or “edge wander” (Holmes and Durbin 1998). Gap wander occurs because the mutation process creates random and spurious local sequence similarities, which compete with the sequence similarities due to homology (Fig. 1A). Long unrelated sequences are unlikely to show similarities comparable to those of homologous sequences, but short regions of similarity do frequently occur and cannot be distinguished from true homology. As a result, the most likely location of a gap often differs from its true location. Gap wander causes alignment columns near gaps to show an inflated average sequence similarity while simultaneously causing the proportion of columns that are correctly aligned (the alignment accuracy) to be lowest near to gaps.

**...**

Under a Jukes–Cantor model of evolution, it is possible to investigate the effects of gap wander analytically. Following, in part, the argument by Holmes (1998) and Holmes and Durbin (1998), we find the proportion of misaligned bases due to gap wander to be

valid for small divergences (see Appendix A). Here, γ is the ratio of the substitution rate, σ, to the indel rate, δ. Note that the dominant term in (1) is linear in σ, in fact *F _{w}* = (3σ/5γ) +

*O*(σ

^{2}), and for this reason, we say that gap wander is a first-order effect.

The second most prevalent bias is termed “gap attraction.” This is an interaction effect between indels, and occurs when two indels hit homologous sequences at nearby positions. In this case, the most parsimonious explanation often involves one rather than two gaps, even at the cost of additional substitutions. Because this additional cost is, in expectation, proportional to the distance between the gaps, the result is an apparent “attraction” between gaps (see Fig. 1B,C). It causes a downward bias in the number of inferred indels, and further decreases the alignment accuracy near gaps. Since gap attraction is an interaction effect, the number of affected sites in alignments is of second order in the divergence.

The third bias, “gap annihilation”, is also an interaction effect between indels, but occurs at lower frequencies. When two indels have identical length but are of opposing signature (e.g., an insertion followed by a nearby deletion in the same lineage; or two deletions in separate lineages), the evolutionary history competing with the true explanation involves no indels altogether (Fig. 1D). Since indels are relatively rare, this explanation is favored even when it requires a considerable number of additional inferred substitutions. The evolutionary scenarios causing this situation may sound contrived, but in fact, the probability that two indels have identical length is ~20% in human–mouse alignments, because most gaps are short (~30% are single-nucleotide indels). The results of gap annihilation are, again, a downward bias in the number of inferred indels, a reduction of the alignment accuracy, and a decrease in the apparent sequence similarity. These effects occur nearly uniformly across the alignment (see Lunter 2007 for a more detailed discussion), in contrast to the biases induced by gap wander and gap annihilation, which strongly colocalize with inferred gaps.

### Simulation study of alignment biases

To show that the three types of biases influence alignments as predicted, we designed a simulation study. We generated sequences so that “true” homologies were known, after which we removed gaps and realigned the resulting sequences. We evolved sequences under the Jukes–Cantor model (Jukes and Cantor 1969) with σ = 0.375 expected substitutions per site (corresponding to an average sequence identity of 0.705), and we used a geometric indel model with substitution/indel rate ratio of γ = 7.5 (see Methods section). These parameters result in sequences that are comparable to sequence at human–mouse divergence (mean sequence identity 69%). The simulated sequences were realigned under the same model using Viterbi decoding. The use of the Jukes–Cantor model allowed comparisons with our analytical result (1). Simulations show that sequences evolved and aligned under the HKY model (Hasegawa et al. 1985) show very similar alignment biases (see Supplemental Table S1 and Supplemental Fig. S1).

We find that the alignment accuracy is lowest for columns adjacent to gaps, as predicted, with only 56% aligned correctly (see Fig. 2A). The apparent average sequence identity for these columns is 85%, much higher than the true sequence identity, 70.5%. Both observations are compatible with the combined action of gap wander and gap attraction.

*A*,

*left*) The proportion sequence identity (PID, blue triangles), the true PID (dashed),

**...**

Moving away from gaps, the apparent sequence identity quickly drops to nearly the correct value and continues to decrease to ~68%. In contrast, the accuracy rises slowly and plateaus at around 96% far away from gaps. This again agrees with our predictions, since all alignment biases act to decrease the accuracy at medium distances from gaps, while gap attraction and gap wander have opposite effects on sequence identity. In balance, gap wander dominates near gaps, while at medium distances, their effects nearly cancel. The fact that neither sequence identity nor alignment accuracy reach optimal values in gap-distal regions reflects the effects of gap annihilation, which is the dominant alignment bias away from alignment gaps. Gap attraction, finally, is responsible for the scarcity of closely spaced gaps (Fig. 2B).

We next investigated the dependence of alignment accuracy with sequence divergence. The analytic prediction (1) of the false-positive fraction (FPF, see Methods section for a definition) closely agrees with the observations (σ = 0.075, FPF = F_{w} = 0.008; σ = 0.150, FPF = F_{w} = 0.022; σ = 0.225, FPF = 0.047, and F_{w} = 0.041). Since gap wander is the only bias considered in the analysis, the very good agreement indicates that gap wander is the dominant cause of alignment error in the low-divergence regime. For higher divergences, the observed FPF exceeds the predicted value (Fig. 3) because second-order indel interactions such as gap annihilation become more prevalent, as indicated by the reduction in asymptotic accuracy.

_{w}, green open circles); the

**...**

To investigate the effect of inaccurate parameters on alignment quality, we realigned all simulated sequences with fixed parameters rather than with those used in the simulation. This has little negative effect on alignment quality (Supplemental Fig. S2). Given the strong biases present in alignments, could it be that detuning the parameters might actually improve alignments? For instance, decreasing the gap penalty would increase the number of gaps, opposing the bias in gap density due to gap attraction. To investigate this, we again simulated sequences under a Jukes–Cantor model, and realigned them using Viterbi decoding with a model parameterized by a range of substitution and indel rates (Fig. 4). Sensitivity is maximal (84%) when parameters coincide with the simulation parameters, both for the indel and the substitution rates. However, the sensitivity stays within 1% of the maximum across a wide range of indel rates (0.02 ≤ δ ≤ 0.10) and substitution rates (0.20 ≤ σ ≤ 0.45). We conclude that alignment quality is robust against fairly large errors in the values of the evolutionary parameters.

### Model fidelity and alignment accuracy

The initial simulations show that for parameters corresponding to human–mouse alignments, only 84% of homologous bases were aligned correctly. Because of the simple model, this result may not be representative for actual human–mouse alignments. To make a more realistic assessment of the expected quality of such alignments, we developed a test set of simulated sequences that accurately approximates evolution along the human and mouse lineages. To assess the impact of model fidelity, we realigned this data set using a hierarchy of models and inferred alignments using three different decoding algorithms for each model in turn.

We simulated evolution using parameters that closely mimic human–mouse evolution. Specifically, we simulated the following aspects: large-scale variation of GC content; GC-content-dependent indel rates; an empirical indel-length spectrum; dependence of the substitution model on GC content; and GC-independent local substitution rate variation. The evolutionary parameters were obtained from BLASTZ human–mouse alignments (see Methods section for details). In all, we simulated 20,000 sequence pairs with an average length of 700 nt, with 2 × 100 nt of flanking sequence added as appropriate for local alignments. Note that this presents a realistic scenario for a whole-genome aligner when a fairly dense set of anchors has been generated.

The simulated sequences were then realigned using a hierarchy of probabilistic aligners (Table 1; Fig. 5A). The most elaborate (“Full”) model tracked all of the evolutionary-rate variation used in the simulations. In addition, this model uses a geometric mixture model to closely approximate the empirical indel-length spectrum (Fig. 5B). The other models were obtained by allowing only one parameter to vary, while other parameters were fixed to their average values (see Table 1). Finally, we considered a “Basic” model, obtained by pegging all parameters to their averages and replacing the indel-length model by a standard geometric distribution, corresponding to affine gap penalties.

*A*) The model is implemented as a pair HMM with a match state (center) surrounded by delete (

*top*) and insert (

*bottom*) states. Hash signs (#) signify emissions, dashes (–) represent no emission

**...**

For each model, we compared three decoding algorithms to infer alignments from the sequence data. As baseline method, we used the standard Viterbi decoding algorithm, which computes the single most likely alignment that is compatible with the observed sequences. In addition, we used two posterior decoding algorithms, referred to here as posterior decoding and marginalized posterior decoding (MPD; see Appendix B for details). Both algorithms compute the alignment that maximizes the cumulative log posterior probability of all contributing alignment columns. This is equivalent to maximizing the product of column posteriors (Fariselli et al. 2005) and has the advantage of removing the need for arbitrary gap weighting to account for variable lengths of alignments, which is required for standard sum-of-posteriors decoding (Durbin et al. 1998; Do et al. 2005; Kall et al. 2005; Roshan and Livesay 2006).

We summarized the results using three summary statistics: sensitivity, false-positive fraction (FPF), and nonhomologous fraction (NHF; see Methods section for definitions) (Fig. 6). Of the inference procedures, MPD achieves the best sensitivity (88.1%), with good FPF (13.3%) and NHF (1.6%). Viterbi alignments are more conservative, resulting in a notably lower sensitivity (84.9%), but slightly better performance on the FPF and NHF statistics (12.7% and 0.38%). Standard posterior decoding shows comparable sensitivity (87.9%), but high FPF and NHF scores (14.3% and 2.3%).

Beside good sensitivity and FPF ratings, MPD also shows fewer alignment biases. The average PID next to gaps is only mildly elevated at 72.9%, compared with 80.1% for Viterbi, indicating a reduced impact of gap wander. Gap attraction is also less prevalent, as indicated by the distribution of distances between gaps, which is closer to the ideal geometric distribution (Fig. 7). Finally, MPD alignments show a high asymptotic accuracy (97.3% accurate at distance 15 from gaps, compared with 96.1% for Viterbi alignments), suggesting a reduced impact of gap annihilation. This reduced impact of alignment biases improves the estimate of the number of indels (0.0394 gaps per nucleotide for MPD, compared with 0.0345 for Viterbi), although a substantial bias remains (true gap density, 0.0490 gaps/nt). All statistics mentioned are for the Full model; there appears to be little interaction between the model and the inference procedure, and the conclusions remain valid across the model hierarchy.

**...**

We were surprised to find that increasing the model complexity has little effect on the performance. For the MPD alignments, the FPF varies between 13.34% and 15.34%, the NHF varies in the range of from 1.45% to 3.19%, and the sensitivity ranges from 87.56% to 88.22%. Compared with the Basic model, and irrespective of the decoding algorithm used, the models that vary either local substitution rates (VarSubs) or indel rates (VarIndel) show little or no improvement in any of the three statistics. This is consistent with our finding that Jukes–Cantor alignments are robust to variations in evolutionary rate parameters. Tuning the substitution model to the sequence GC content improves the FPF (13.6%, from 15.2%) and the NHF (1.45%, from 2.85%), but also somewhat reduces the sensitivity (87.6%, from 87.8%). Modeling the indel-length spectrum using a geometric mixture model has the opposite effect of increasing the sensitivity (to 88.2%) at the cost of an increased NHF (3.19%), while the FPF improves (15.1%), but only slightly, compared with the Basic model (15.2%). The Full model strikes a good balance with the best FPF (13.3%) and good NHF and sensitivity scores (1.63% and 88.1%; Fig. 7).

Our test setup implicitly assumes that alignment algorithms can use the correct evolutionary parameters, which is not true in practice. For this reason, our results should be regarded as providing an upper limit to the achievable alignment accuracy for the algorithms and divergence considered (to the extent that our modeling of the neutral evolution of nucleotide sequence is appropriate). However, the accuracy of the evolutionary rate parameters appears to have little effect on accuracy, and the largest gain in accuracy and FPF is obtained from modeling the sequence content and the indel-length distribution. Parameterization of either is straightforward, so that our conclusions are relevant for practical alignment algorithms.

### Posterior probabilities are a reliable estimator of alignment accuracy

In the previous section, we showed that posterior probabilities help to improve alignments. We next investigated whether they are also directly informative of alignment accuracy. Although posteriors cannot be used to distinguish correctly aligned columns from incorrect ones (except possibly when the posterior is either 0 or 1), they do provide a quantitative indication of reliability (Fig. 5C). In this section, we investigate the accuracy and robustness of this measure.

We calculated posteriors for all columns in simulated human–mouse sequences that were subsequently realigned using Viterbi decoding. Alignments columns were divided into 10 categories by their 10% posterior probability quantile. Within each category, we aggregated two statistics: the proportion of correctly aligned nucleotides, and the average percentage sequence identity. To test for robustness against modeling errors, this procedure was applied both for the basic and the full model. For both models, the posterior probability accurately predicts the proportion of correctly aligned columns (Fig. 8).

**...**

Similarly, the “asymptotic accuracy”, defined as the proportion of correct alignment columns at distance 15 from the nearest gap, is very nearly identical to the average posterior at that distance, across a wide range of divergences (Fig. 3). Again, this conclusion remains true, even when the evolutionary model does not accurately fit the data (Supplemental Fig. S2).

Sequence identity and posterior probability show a strong, positive correlation. This is partly caused by an increasing admixture of nonhomologous nucleotide pairs as the posterior probability decreases. However, the observed PID for the highest posterior bin (>0.9) is 74.1%, exceeding the true PID of 69%. We interpret this as the result of stochastic effects that causes local sequence similarity to fluctuate, which in turn influences the accuracy with which alignments can be inferred. The result is that locally accurate alignments are biased toward regions with high-sequence identity. This suggests that it would be unwise to only use alignment columns with very high posterior probabilities to estimate substitution rates.

### Comparison with score-based aligners

To put the performance of the probabilistic aligners in context, we realigned the simulated data using five general-purpose score-based aligners: ClustalW (Higgins and Sharp 1988), Lagan (Brudno et al. 2003), DiAlign (Morgenstern 1999, 2004), Mavid (Bray and Pachter 2004), and TBA/BLASTZ (Schwartz et al. 2003; Blanchette et al. 2004). The performance of these aligners was compared using the same three statistics as before (Fig. 9).

*top left*axis), false-positive fraction (gray,

*right*axis) and nonhomologous fraction (striped,

*bottom left*axis), for simulated sequence based on human–mouse

**...**

With the exception of DiAlign, all aligners achieve comparable sensitivities (79.4%–84.3%). BLASTZ paired this with good false-positive and nonhomologous fractions (FPF, 13.77%; NHF, 1.18%) when using the recommended score-threshold option (-K 2200). Lowering the score threshold to 2000 (following Pollard et al. 2004) increased the sensitivity from 79.4% to 82.0%, while the false-positive and nonhomology fractions increased only marginally (to 13.89% and 1.21%, respectively). The other score-based aligners were designed to perform global (or ”glocal”) (Brudno et al. 2003) alignment, thus solving a different problem that resulted in high (and less meaningful) NHF and FPF statistics. DiAlign was designed for multiple alignment of divergent protein-coding sequences, and as a consequence, is conservative in inferring homology, resulting in a low false-positive fraction and a fair nonhomologous fraction (FPF, 12.8%; NHF, 3.95%), but a concomitant low sensitivity (63.1%). ClustalW was designed for protein multiple alignment, but was included because of its traditionally large user base. In our test, it shows lower sensitivity (81.9%) and higher false-positive rates (38.7%) than both Lagan and Mavid. However, despite their differences, all algorithms show qualitatively similar biases in their alignments (Supplemental Fig. S3), and uniformly do not perform as well as the MPD algorithm tested.

## Discussion

In this study, we report on a large-scale simulation study, with the twofold aim of investigating the type and extent of biases that are inherent in the inference of alignments and of assessing whether a probabilistic approach can help reduce these biases.

We have distinguished three types of alignment biases; gap wander, gap attraction, and gap annihilation. Although well known, only one of these (gap wander or “edge wander”) has, to our knowledge, been studied explicitly before (Holmes 1998). We have argued that gap wander is the dominant cause of wrongly aligned bases in maximum-likelihood alignments. This conclusion is supported by a theoretical analysis of gap wander under a Jukes–Cantor substitution model, the predictions of which agree very well with simulated data for small divergences. For higher divergences, additional biases start contributing to alignment inaccuracies, but gap wander continues to be important. For example, at a divergence of 0.375 substitutions and 0.05 indels per site, gap wander is predicted to cause 10% of homologous bases to be wrongly aligned. Simulations show the actual proportion to be 14%, the additional 4% apparently due to other biases.

These additional biases are caused by gap interactions, and their impact increases quadratically with divergence. The effects of gap attraction are apparent in the distribution of distances between successive gaps, in which small distances are strongly under-represented (Fig. 2B). Gap attraction strongly reduces the gap density in alignments, and further compounds the reduction of alignment accuracy near gaps that is caused by gap wander. A third and related bias, termed “gap annihilation”, is also of second order in the divergence but occurs less frequently. In contrast to the other two biases, gap annihilation colocalizes with alignment gaps only very weakly (Lunter 2007), and causes an increase in both the apparent divergence and the error proportion across the alignment, and a further decrease in the number of inferred indels.

Both gap-interaction biases tend to decrease the alignment gap density compared with the true indel count. Increasing the indel rate of the inference model (i.e., lowering the gap-opening penalty) increases the number of inferred gaps, reducing this bias. However, our results show that the true evolutionary parameters do maximize the proportion of correctly aligned nucleotides, despite the gap count being negatively biased. In other words, the number of gaps can be made to approximate the true indel count, but only at the expense of placing the gaps in the wrong positions and increasing the proportion of incorrectly aligned bases.

It might seem that a tighter modeling of the evolutionary process would help to discern the true evolutionary history from among the many possibilities, and so reduce the impact of alignment biases. We found that more accurate modeling resulted in only very marginal improvements of the alignment accuracy. Indeed, in our simulation study of sequences at human–mouse divergence, the modeling of indel lengths using a mixed geometric distribution resulted in the single largest improvement in sensitivity, from 85.3% to 85.6% using Viterbi decoding, and from 87.8% to 88.2% using MPD. The geometric mixture model helps to align sequences across large indels, which are relatively infrequent, explaining the relatively modest improvement. Modeling the variation in GC content reduces the false-positive fraction (from 15.2% to 13.6% using MPD), but has little effect on sensitivity. Surprisingly, accurate modeling of indel and substitution rate variation has little, if any, effect. This robustness to misparameterization is supported by our simulations under the Jukes–Cantor model, where substantial variations in the rate parameters resulted in very little difference (Fig. 4).

For the data set of simulated sequences at human–mouse divergence, all models and decoding algorithms show 12%–15% wrongly aligned columns. This seems to reflect the loss of information during evolution rather than model inaccuracies or parameterization errors, and suggests that more sophisticated improvements to evolutionary models that might be considered, such as modeling evolving GC fractions (Lipatov et al. 2006), strand biases (Green et al. 2003), or context-dependent evolution (Jensen and Pedersen 2000; Arndt et al. 2003; Hwang and Green 2004; Lunter and Hein 2004; Siepel and Haussler 2004; Christensen 2006), although extremely valuable to help understand evolution, are unlikely to result in substantial improvement of sequence alignments. One aspect not modeled by any alignment algorithm that we are aware of is that indels often occur in tandem repetitive sequence as a result of, e.g., microsatellite instability (Kroutil and Kunkel 1999). The proposed mechanism, polymerase slippage, suggests that insertions often involve sequence duplications rather than insertions of random sequence. It would be interesting to investigate the possible improvement that modeling this aspect would have on alignment quality.

Modeling of polymerase slippage aside, we expect further improvements in alignments to arise chiefly from deep sequencing of extant species. Beside obvious factors such as the shape of the phylogenetic tree and the availability and quality of data from genome-sequencing efforts, the achievable alignment quality will also, and probably crucially, depend on the quality of multiple alignment algorithms. Because the alignment problem suffers from a combinatorial explosion when the number of species increases, heuristic methods must be used. We have shown here that uncertainties in alignments are prevalent and unavoidable. Especially in multiple alignments, it is therefore essential that these uncertainties are dealt with properly. Many of the widely used multiple alignment algorithms “freeze” particular alignment choices at internal nodes, which would exacerbate alignment biases (Loytynoja and Goldman 2005), and more sophisticated methods than those currently available are required to optimally exploit the information that is available in multiple sequences.

Nevertheless, a simulation experiment showed that BLASTZ/TBA multiple alignments (Blanchette et al. 2004) do benefit from additional species (see Supplementary Information). We simulated sequences along the phylogeny of human, macaque, mouse, rat, and dog, and found that addition of in-group species consistently improved the implied human–mouse alignment (Supplemental Fig. S4). Adding all species resulted in an improvement of the sensitivity for human–mouse homology from 82% to 87.4%, similar to or slightly below the sensitivity of MPD pairwise alignments of human and mouse sequence alone. As pairwise alignments serve as input to TBA, the two approaches can conceivably be merged, and it would be interesting to investigate the improvements that the MPD algorithm can bring to TBA multiple alignments.

The score-based aligners we tested show similar kinds of biases to the probabilistic aligners, especially the Viterbi algorithm. This is not surprising, as these algorithms are formally very similar, and indeed, we found that the performance of the Viterbi aligner is similar to that of the best score-based aligner tested, BLASTZ. However, posterior decoding aligners, in particular MPD, have no score-based counterpart and perform better than both the Viterbi algorithm and BLASTZ in our simulations. MPD improves the sensitivity from 82% for BLASTZ to 88% for MPD, thus reducing the number of missed alignment columns by a third. Since our simulation procedure incorporated more aspects of human–mouse sequence evolution than any other that we are aware of, and was carefully parameterized using the best whole-genome human–mouse alignments currently available, it appears that the MPD algorithm would compute better alignments of mammalian genomic sequence than the current generation of score-based aligners is able to provide. Software to compute these alignments and a genome browser for human–mouse alignments are available at http://genserv.anat.ox.ac.uk/grape.

Comparing the score-based aligners among themselves, we observed large differences. The five score-based aligners we tested have each been designed with different purposes in mind, and the results reflect these design choices. For example, Mavid and Lagan are global aligners, and DiAlign was designed for aligning highly divergent sequence. The design aims of BLASTZ are the closest to our study, and it indeed performed the best out of the score-based aligners we tested.

Despite the performance differences between existing aligners and the scope for improvements of (particularly) multiple-alignment algorithms, our results suggest that alignment accuracy is fundamentally limited. For alignments of species at distances comparable to human and mouse, it seems likely that at least 10% of nucleotides in whole-genome alignments will remain wrongly aligned. These—unavoidable—errors need to be acknowledged and, where possible, accounted for when alignments are used to draw conclusions about evolution. We have shown that alignment uncertainties depend strongly on evolutionary distance, becoming less pronounced at lower divergences. However, alignment uncertainties are not spread uniformly over alignments, and sequence content (e.g., repetitive or near-repetitive sequence) also strongly influence the certainty with which alignments can be inferred, something that affects sequences at any divergence. We hope that the tools we provide will help researchers to identify and account for these local regions of uncertainty in alignments.

In conclusion, our results show that a probabilistic approach to sequence alignments has significant advantages over score-based approaches. Posterior probabilities are reliable and robust indicators of local alignment reliability. We have further shown that the MPD algorithm, and, to a lesser extent, improvements in evolutionary modeling, result in improvements in alignment quality. However, much uncertainty remains, and because of this, it seems inappropriate to continue the practice of using single most-likely alignments, essentially “point estimates without error bounds” (Zuker 1991). A probabilistic approach to sequence alignment is essential to properly account for and quantify these unavoidable uncertainties in alignments.

## Methods

### Definitions

Throughout this study, we scale units such that the divergence time between sequence pairs is 1, making the substitution rate σ equal to the expected number of substitutions per site. Because overlapping indel events are hard to deconvolute (Miklós et al. 2004), for convenience we here define the indel “rate,” δ, as one minus the survival probability of a pair of homologous nucleotides, conditional on its left neighboring pair surviving. Equivalently, δ may be defined as the gap-opening probability in the true alignment. With the chosen units, this is numerically close to the indel event rate (Lunter 2007). We define the indel/substitution rate ratio as γ = σ/δ.

To summarize the performances of aligners, we use the sensitivity (S), false-positive fraction (FPF), and nonhomologous fraction (NHF). S is defined as the ratio of correct alignment columns to all homologous columns. The FPF is defined as the proportion of wrongly aligned columns among all nongapped columns. We distinguish incorrect alignment columns involving “padding sequence” that does not share homology with other sequence, and all others. The proportion of columns containing padding sequence among all aligned columns summarizes the ability to tell alignable sequence from nonhomologous sequence, and is referred to as the nonhomologous fraction, NHF. Note that nucleotides contributing to NHF also contribute to the FPF (i.e., NHF < FPF). As a simple proxy for divergence, we use PID (proportion identity) throughout, most often as an aggregate measure (e.g., PID of nucleotides at particular distances from alignment gaps).

### Evolutionary rates from human–mouse alignments

Substitution probabilities and other evolutionary parameters for the Full HMM model (see below and Fig. 5A) were obtained by maximum likelihood, using existing whole-genome BLASTZ human–mouse alignments as training data. Because all training data were considered to be homologous, we removed padding states from the model for training. Training data were stratified according to the GC fraction (fGC), as measured in 250-bp windows and binned into 20 equally populated bins. Training was done separately for each fGC-bin (see Supplemental Table S2). Although our interest is primarily in alignment of bulk genomic, thus mainly neutral, DNA, we did not remove the small fraction of known functional or conserved sequence. Our stratification into fGC categories ensures that most protein-coding exons are found in the highest fGC category, while all other categories are dominated by neutrally evolving sequence.

To be able to model local substitution-rate variation (which does not strongly covary with fGC [Hellmann et al. 2005]), we estimated the spectrum of substitution rate by measuring local, putatively neutral sequence divergence on human–mouse ancestral repeats within 100-kb windows, across the human genome. These values were binned into 10 equally populated bins and averaged (see Supplemental Table S3), and used as input to the simulation stage (see below).

### Models and probabilistic alignment

The probabilistic aligner consists of a standard three-state pair HMM, which was modified in two ways. First, we duplicated the insertion and deletion states to model the mixture-geometric indel length distribution. Second, we added four “padding” states, modeling the existence of nonhomologous sequence at either end of the alignable sequence. Parameters of the model are: δ, the gap opening probability; ε_{1} and ε_{2}, the parameters determining the two geometric indel length distributions, α, their mixture coefficient, and τ, the alignment length parameter. The HMM topology is given in Figure 5A.

The parameter τ has a negligible effect on the alignments, and we fixed its value to 0.001. We defined six evolutionary models by restricting the ways in which the parameters vary with the input sequences. For the Full model, δ, ε_{1}, ε_{2}, and α all vary according to the fGC of the sequence. The substitution probabilities also depend on fGC and were additionally scaled to reflect the local sequence similarity. For the Basic model, all parameters were set to their average value, no scaling for local sequence similarity was done, and α was set to 1.0, corresponding to a standard geometric indel length distribution (i.e., affine-gap penalties). The remaining four models (see Table 1) resembled the Basic model, but each included one feature of the Full model: GCIndel, GCSubs, LocalSubs, and MixtureIndel included, respectively, fGC-dependent indel rates through δ; fGC-dependent substitution rates; local diversity-dependent substitution rates; and a mixture-geometric indel length distribution, which, however did not depend on fGC (α fixed at 0.857, see Supplemental Table S2).

Sequences were aligned using the Viterbi algorithm, and with two posterior decoding algorithms (see Appendix B). To reduce computation time, the dynamic programming tables were constrained by a banding procedure. The bandwidth was set independently of the sequences to be aligned by considering all paths corresponding to simulated alignments and computing the maximum deviation from the diagonal. To this maximum we added 15 to ensure that all sampled paths remained well inside the band, ensuring that the banding does not favorably bias the alignments.

### Data simulation

Aligned sequence data were simulated by sampling from the Full model after removing the padding states. The simulated data set consisted of 100 sequence pairs for each of the 20 fGC categories and 10 substitution rate categories (20,000 pairs). After sampling, we padded each sequence with 100 nt of nonhomologous sequence at each end, drawn from the appropriate background distribution, resulting in sequences with an average length of 893 nucleotides (1153 alignment columns), of which 693 were alignable (753 columns), comprising in total 2 × 17.87 Mb.

For the Jukes–Cantor model, we sampled 500 sequence pairs for each of 16 evolutionary distances, ranging from σ = 0.0 to 0.9 in steps of 0.075, keeping γ = 7.5 throughout. Each sequence was padded with 100 bp of nonhomologous sequence at each end, resulting in an average of 872 nucleotides per sequence (1149 alignment columns), of which 672 were alignable (749 columns), comprising a total of 2 × 6.98 Mb.

### Score-based alignments

We used the following score-based aligners: DiAlign 2.2 (Morgenstern 1999) with default parameters; Mavid 2.0.4 (Bray and Pachter 2004) with default parameters and a tree with total divergence 0.5; Lagan v1.21 (Brudno et al. 2003) with default parameters; ClustalW 1.83 (Higgins and Sharp 1988) with default parameters for DNA sequence. For BLASTZ version 7 (Schwartz et al. 2003) we used two parameter settings for the score threshold: K = 2200, which is recommended for human–mouse alignments, and K = 2000, below which alignments start to become less reliable (Pollard et al. 2004).

## Appendix A: Analysis of gap wander

The maximum likelihood and true alignments need not be identical, because homology does not imply sequence identity and vice versa. Shifting a gap from its true location may thus increase the likelihood. Here we quantify this “gap wander” analytically for sequences evolving under a Jukes–Cantor model, following in part the analysis in Holmes (1998). We assume low indel rates, so that interactions between indels may be ignored.

Let *L*_{0} be the log likelihood of the true alignment under the model of Figure 5A and *L _{i}* the likelihood of the alignment where a single gap is displaced rightward by

*i*nucleotides from its true location. We first consider gap displacements in one direction only, assuming

*i*≥ 0. Invoking the low indel rate assumption, shifting the gap by one nucleotide to the right does not cause collisions with other gaps, and so does not change their length or number. Consequently, any change in the log likelihood of the alignment is due to the replacement of a single alignment column containing homologous nucleotides (a “homologous column”) by one containing nonhomologous nucleotides (a “nonhomologous column”). Both types of columns may contain either matching or nonmatching nucleotides. Under the Jukes–Cantor model, homologous columns contain matching and nonmatching nucleotides with probabilities ¼ + ¾

*e*

^{−4σ/3}and ¾ − ¾

*e*

^{−4σ/3}, respectively. Shifting a gap by one nucleotide therefore causes the log likelihood of the alignment to increase or decrease by

*S*= log(1 + 3

*e*

^{−4σ/3}/1 −

*e*

^{−4σ/3}), or remain unchanged. The sequence of random variables

*L*

_{0},

*L*

_{1},

*L*

_{2}, . . . thus defines a random walk with steps +

*S*

_{,0}, −

*S*. Denoting their probabilities by

*a*,

*b*,

*c*and using that the probability of finding identical nucleotides in nonhomologous columns is ¼, we find

*a*= ¼(¾ − ¾

*e*

^{−4σ/3}),

*c*= ¾(¼ + ¾

*e*

^{−4σ/3}), and

*b*= 1 −

*a*−

*c*. Since

*a*<

*c*, the random walk

*L*

_{0},

*L*

_{1},

*L*

_{2}, . . . has negative drift, so that

*M*= max

_{i ≥ 0}

*L*exists with probability 1. Let

_{i}*T*be the index for which this maximum is last reached, representing an optimal gap location (rightward of the origin). To derive the distribution

*p*= Pr(

_{t}*T*=

*t*) of this location, we suppose a random walk

*L*

_{0},

*L*

_{1},

*L*

_{2}, . . . to be given, and construct another by adding one step in front. For the new random walk

*L*'

_{0},

*L*'

_{1},

*L*'

_{2}, . . . we have

*L*'

_{k+1}=

*L*, and we denote the new maximum and index-of-last-maximum by

_{k}*M*' and

*T*'. We have

*T*' =

*T*+ 1 and

*M*' =

*M*unless

*M*=

*L*

_{0}and a step in the negative direction (−

*S*) was added, since only in that case

*M*' =

*L*

_{0}+

*S*is the new maximum, and

*T*' = 0. This implies that

In the language of random walks, the event *M* = *L*_{0} is the escape probability of a random walk with drift and absorption and is computed as follows. Let *q _{k}* be the probability that the sequence

*L*

_{0},

*L*

_{1},

*L*

_{2}, . . . takes on the value 0 at least once (“absorption”) when starting from

*L*

_{0}=

*kS*. For

*k*≥ 0 we have

*q*= 1 because of negative drift, while for

_{k}*k*< 0 these probabilities satisfy

*q*=

_{k}*aq*

_{k+1}+

*bq*+

_{k}*cq*

_{k−1}, or (

*q*

_{k+1}−

*q*)/(

_{k}*q*−

_{k}*q*

_{k−1}) = (

*c*/

*a*). Using the boundary conditions

*q*

_{0}= 1 and

*q*

_{− ∞}= 0, this has the unique solution

*q*= (

_{k}*c*/

*a*)

^{k}, and in particular Pr(

*M*=

*L*

_{0}) = 1 −

*q*

_{− 1}= 1 − (

*a*/

*c*). Substitution into (2) yields

*p*= (

_{t}*c*−

*a*)(1 +

*a*−

*c*)

^{t}= (1 −

*r*)

*r*, where

^{t}*r*= 1 − ¾

*e*

^{−4σ/3}. This describes the distribution of the maximum likelihood (ML) gap location rightward of the origin. The actual ML location is obtained by maximizing over locations both left and right of the true site. We approximate the deviation of the ML location away from the origin as

*U*= max(

*T*,

_{L}*T*), where

_{R}*T*,

_{L}*T*are the left and right ML gap distances to the origin. This is an approximation, since the likelihood need not attain its maximum at the maximum distance; however, the conditional expectation of the maximum value given the distance is an increasing function of the distance, so the error introduced in this way is small. Using Pr(

_{R}*U*≤

*t*) = Pr(

*T*≤

*t*)

^{2}, we find Pr(

*U*=

*t*) = (1 −

*r*)

*r*(2 −

^{t}*r*−

^{t}*r*

^{t+1}). The expected value of

*U*is

*E*(

*U*) =

*r*(

*r*+ 2)/ (1 −

*r*)(1 +

*r*), or

representing the expected number of wrongly aligned nucleotides per gap due to gap wander. This number is nonzero even for σ = 0 because of possible homonucleotide runs (or, more generally, tandem repeats), which ambiguate gap placement even for sequences that are identical except for gaps. The gap density as a proportion of aligned sequence is (σ/γ), again ignoring interactions between gaps. Multiplying (3) by this fraction finally yields (1).

## Appendix B: Posterior decoding algorithms

We used a variant of posterior decoding which computes the alignment that maximizes the cumulative log posterior probability of all columns that contribute to the alignment. Let *a*_{1} . . .* a _{n}* and

*d*

_{1}. . .

*d*be “ancestor” and “descendant” sequences, and suppose

_{m}*M*is the posterior probability of aligning nucleotides

_{ij}*a*and

_{i}*d*, which is identical to the posterior probability of being in a “match” state at position (

_{j}*i*,

*j*) in the dynamic programming table. Similarly, let

*D*be the posterior probability that

_{ij}*a*was involved in a deletion between the descendant’s nucleotides

_{i}*d*and

_{j}*d*

_{j+1}, and let

*I*denote the same for an insertion of

_{ij}*d*between the

_{j}*a*and

_{i}*a*

_{i+1}. These posteriors were calculated from the dynamic programming tables of the standard Forward and Backward algorithms. In our model, two sets of four states correspond to insertions and deletions (two in the main alignment HMM, and two padding states). Because these states are mutually exclusive, to compute the posterior probabilities

*D*and

_{ij}*I*we aggregate the posteriors for the relevant contributing states. The maximum total product of posteriors along an alignment path is computed by dynamic programming as follows:

_{ij}
where all references to indices out of bounds are regarded as being 0. After populating the array, *P _{nm}* contains the maximum total posterior. Finally, a traceback algorithm is used to find the corresponding posterior decoding path.

The MPD algorithm differs in the way gaps are treated. In the standard variant above, the posterior probabilities *M _{ij}*,

*D*, and

_{ij}*I*measure the probabilities that particular HMM states are visited, conditional on the sequence data. The posterior for a nucleotide to align to a gap character distinguishes gaps based on their location in the secondary sequence, since such gaps are represented by different states in the dynamic programming table. This results in relatively low posteriors for gapped nucleotides, exacerbating gap-interaction effects. To counter this, the MPD algorithm marginalizes over all possible gap locations within the secondary sequence, replacing

_{ij}*D*and

_{ij}*I*by their marginalized counterparts, , and . The resulting posterior is interpreted as the probability that a particular nucleotide is unaligned, without specifying the precise location of the gap to which it contributes. Note that each path contributes to at most one of

_{ij}*D*

_{i0}, . . . ,

*D*so that

_{im}*D*'

_{ij}≤ 1, and similarly for the insert states.

## Acknowledgments

We thank Chris Ponting, Mikkel Schierup, and Michael Lässig for helpful discussions. G.L. and A.H. thank the MRC for financial support. A.R., N.M., and A.C. were supported by grants RAJLO (MRC), RHNIO (BBSRC), and RHJIHO (BBSRC) to J.H.

## Footnotes

[Supplemental material is available online at www.genome.org.]

Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6725608

## References

- Altschul S.F., Erickson B.W., Erickson B.W. Locally optimal subalignments using nonlinear similarity functions. Bull. Math. Biol. 1986;48:633–660. [PubMed]
- Arndt P.F., Burge C.B., Hwa T., Burge C.B., Hwa T., Hwa T. DNA sequence evolution with neighbor-dependent mutation. J. Comput. Biol. 2003;10:313–322. [PubMed]
- Batzoglou S. The many faces of sequence alignment. Brief Bioinform. 2005;6:6–22. [PubMed]
- Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Kent W.J., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Riemer C., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Elnitski L., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Smit A.F., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Rosenbloom K., Clawson H., Green E.D., Clawson H., Green E.D., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. [PMC free article] [PubMed]
- Bray N., Pachter L., Pachter L. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. [PMC free article] [PubMed]
- Brudno M., Do C.B., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Do C.B., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Davydov E., Green E.D., Sidow A., Batzoglou S., Green E.D., Sidow A., Batzoglou S., Sidow A., Batzoglou S., Batzoglou S. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. [PMC free article] [PubMed]
- Brudno M., Poliakov A., Salamov A., Cooper G.M., Sidow A., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Poliakov A., Salamov A., Cooper G.M., Sidow A., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Salamov A., Cooper G.M., Sidow A., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Cooper G.M., Sidow A., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Sidow A., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Rubin E.M., Solovyev V., Batzoglou S., Dubchak I., Solovyev V., Batzoglou S., Dubchak I., Batzoglou S., Dubchak I., Dubchak I. Automated whole-genome multiple alignment of rat, mouse, and human. Genome Res. 2004;14:685–692. [PMC free article] [PubMed]
- Byers T.M., Waterman M.S., Waterman M.S. Determining all optimal and near-optimal solutions when solving shortest path problems by dynamic programming. Oper. Res. 1984;32:1381–1384.
- Chao K.M., Hardison R.C., Miller W., Hardison R.C., Miller W., Miller W. Locating well-conserved regions within a pairwise alignment. Comput. Appl. Biosci. 1993;9:387–396. [PubMed]
- Chiaromonte F., Yap V.B., Miller W., Yap V.B., Miller W., Miller W. Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput. 2002;7:115–126. [PubMed]
- Christensen O.F. Pseudo-likelihood for non-reversible nucleotide substitution models with neighbour dependent rates. Stat. Appl. Genet. Mol. Biol. 2006;5 Article 18. http://www.bepress.com/sagmb/vol5/iss1/art18. [PubMed]
- Dewey C.N., Pachter L., Pachter L. Evolution at the nucleotide level: The problem of multiple whole-genome alignment. Hum. Mol. Genet. 2006;15 Spec No 1:R51–R56. doi: 10.1093/hmg/dd1056. [PubMed] [Cross Ref]
- Dewey C.N., Huggins P.M., Woods K., Sturmfels B., Pachter L., Huggins P.M., Woods K., Sturmfels B., Pachter L., Woods K., Sturmfels B., Pachter L., Sturmfels B., Pachter L., Pachter L. Parametric alignment of
*Drosophila*genomes. PLoS Comput. Biol. 2006;2:e73. doi: 10.1371/journal.pcbi.0020073. [PMC free article] [PubMed] [Cross Ref] - Ding Y., Chan C.Y., Lawrence C.E., Chan C.Y., Lawrence C.E., Lawrence C.E. RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005;11:1157–1166. [PMC free article] [PubMed]
- Do C.B., Mahabhashyam M.S., Brudno M., Batzoglou S., Mahabhashyam M.S., Brudno M., Batzoglou S., Brudno M., Batzoglou S., Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. [PMC free article] [PubMed]
- Durbin R., Eddy S.R., Krogh A., Mitchison G., Eddy S.R., Krogh A., Mitchison G., Krogh A., Mitchison G., Mitchison G. Biological sequence analysis. Cambridge University Press; Cambridge, UK: 1998.
- Edgar R.C., Batzoglou S., Batzoglou S. Multiple sequence alignment. Curr. Opin. Struct. Biol. 2006;16:368–373. [PubMed]
- Elofsson A. A study on how to best align protein sequences. Proteins: Struct. Funct. Genet. 2002;46:300–309. doi: 10.1002/prot.10043. [Cross Ref]
- Fariselli P., Martelli P.L., Casadio R., Martelli P.L., Casadio R., Casadio R. A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics. 2005;6:S12. doi: 10.1186/1471-2105-6-S4-S12. (Suppl 4) [PMC free article] [PubMed] [Cross Ref]
- Goad W.B., Kanehisa M.I., Kanehisa M.I. Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucl. Acids Res. 1982;10:247–263. doi: 10.1093/nar/10.1.247. [PMC free article] [PubMed] [Cross Ref]
- Green P., Ewing B., Miller W., Thomas P.J., Green E.D., Ewing B., Miller W., Thomas P.J., Green E.D., Miller W., Thomas P.J., Green E.D., Thomas P.J., Green E.D., Green E.D. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 2003;33:514–517. [PubMed]
- Gusfield D., Balasubramanian K., Naor D., Balasubramanian K., Naor D., Naor D. Parametric optimization of sequence alignment. Algorithmica. 1994;12:312–326.
- Hasegawa M., Kishino H., Yano T., Kishino H., Yano T., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985;22:160–174. [PubMed]
- Hellmann I., Prufer K., Ji H., Zody M.C., Paabo S., Ptak S.E., Prufer K., Ji H., Zody M.C., Paabo S., Ptak S.E., Ji H., Zody M.C., Paabo S., Ptak S.E., Zody M.C., Paabo S., Ptak S.E., Paabo S., Ptak S.E., Ptak S.E. Why do human diversity levels vary at a megabase scale? Genome Res. 2005;15:1222–1231. [PMC free article] [PubMed]
- Higgins D.G., Sharp P.M., Sharp P.M. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene. 1988;73:237–244. [PubMed]
- Holmes I. 1998.
“
*Studies in probabilistic sequence alignment and evolution*” Ph.D. thesis, University of Cambridge and The Sanger Centre, Cambridge, UK - Holmes I., Durbin R., Durbin R. Dynamic programming alignment accuracy. J. Comput. Biol. 1998;5:493–504. [PubMed]
- Hwang D.G., Green P., Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. 2004;101:13994–14001. [PMC free article] [PubMed]
- Jensen J.L., Pedersen A.-M.K., Pedersen A.-M.K. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Prob. 2000;32:499–517.
- Jukes T., Cantor C., Cantor C. Evolution of protein molecules. In: Munro H.N., editor. Mammalian protein metabolism. Academic Press; New York: 1969. pp. 21–132.
- Kall L., Krogh A., Sonnhammer E.L., Krogh A., Sonnhammer E.L., Sonnhammer E.L. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics. 2005;21:i251–i257. doi: 10.1093/bioinformatics/bti1014. (Suppl 1) [PubMed] [Cross Ref]
- Krogh A. Two methods for improving performance of an HMM and their application for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1997;5:179–186. [PubMed]
- Kroutil L.C., Kunkel T.A., Kunkel T.A. Deletion errors generated during replication of CAG repeats. Nucl. Acids Res. 1999;27:3481–3486. doi: 10.1093/nar/27.17.3481. [PMC free article] [PubMed] [Cross Ref]
- Lassmann T., Sonnhammer E.L., Sonnhammer E.L. Automatic assessment of alignment quality. Nucl. Acids Res. 2005;33:7120–7128. doi: 10.1093/nar/gki1020. [PMC free article] [PubMed] [Cross Ref]
- Lipatov M., Arndt P.F., Hwa T., Petrov D.A., Arndt P.F., Hwa T., Petrov D.A., Hwa T., Petrov D.A., Petrov D.A. A novel method distinguishes between mutation rates and fixation biases in patterns of single-nucleotide substitution. J. Mol. Evol. 2006;62:168–175. [PubMed]
- Loytynoja A., Goldman N., Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl. Acad. Sci. 2005;102:10557–10562. [PMC free article] [PubMed]
- Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23:i289–i296. doi: 10.1093/bioinformatics/btm185. [PubMed] [Cross Ref]
- Lunter G., Hein J., Hein J. 2004. A nucleotide substitution model with nearest-neighbour interactions Bioinformatics (Suppl 1)20I216–I223.I223 .10.1093/bioinformatics/bth901 [PubMed] [Cross Ref]
- Lunter G., Miklós I., Drummond A., Jensen J.L., Hein J., Miklós I., Drummond A., Jensen J.L., Hein J., Drummond A., Jensen J.L., Hein J., Jensen J.L., Hein J., Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005;6:83. [PMC free article] [PubMed]
- Lunter G., Ponting C.P., Hein J., Ponting C.P., Hein J., Hein J. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput. Biol. 2006;2:e5. doi: 10.1371/journal.pcbi.0020005. [PMC free article] [PubMed] [Cross Ref]
- Metzler D. Statistical alignment based on fragment insertion and deletion models. Bioinformatics. 2003;19:490–499. [PubMed]
- Meunier J., Duret L., Duret L. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 2004;21:984–990. [PubMed]
- Mevissen H.T., Vingron M., Vingron M. Quantifying the local reliability of a sequence alignment. Protein Eng. 1996;9:127–132. [PubMed]
- Miklós I., Lunter G.A., Holmes I., Lunter G.A., Holmes I., Holmes I. A “Long Indel” model for evolutionary sequence alignment. Mol. Biol. Evol. 2004;21:529–540. [PubMed]
- Morgenstern B. DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics. 1999;15:211–218. [PubMed]
- Morgenstern B. DIALIGN: Multiple DNA and protein sequence alignment at BiBiServ. Nucl. Acids Res. 2004;32:W33–W36. doi: 10.1093/nar/gkh373. [PMC free article] [PubMed] [Cross Ref]
- Needleman S.B., Wunsch C.D., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453. [PubMed]
- Paten B., Birney E., Birney E. 2007. PECAN. http://www.ebi.ac.uk/~bjp/pecan.
- Pollard D.A., Bergman C.M., Stoye J., Celniker S.E., Eisen M.B., Bergman C.M., Stoye J., Celniker S.E., Eisen M.B., Stoye J., Celniker S.E., Eisen M.B., Celniker S.E., Eisen M.B., Eisen M.B. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics. 2004;5:6. [PMC free article] [PubMed]
- Prakash A., Tompa M., Tompa M. Measuring the accuracy of genome-size multiple alignments. Genome Biol. 2007;8:R124. doi: 10.1186/gb-2007-8-6-r124. [PMC free article] [PubMed] [Cross Ref]
- Roshan U., Livesay D.R., Livesay D.R. Probalign: Multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006;22:2715–2721. [PubMed]
- Schlosshauer M., Ohlsson M., Ohlsson M. A novel approach to local reliability of sequence alignments. Bioinformatics. 2002;18:847–854. [PubMed]
- Schwartz S., Kent W.J., Smit A., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W., Kent W.J., Smit A., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W., Smit A., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W., Zhang Z., Baertsch R., Hardison R.C., Haussler D., Miller W., Baertsch R., Hardison R.C., Haussler D., Miller W., Hardison R.C., Haussler D., Miller W., Haussler D., Miller W., Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003;13:103–107. [PMC free article] [PubMed]
- Sellers P.H. Pattern recognition in genetic sequences. Proc. Natl. Acad. Sci. 1979;76:3041. [PMC free article] [PubMed]
- Siepel A., Haussler D., Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 2004;21:468–488. [PubMed]
- Sun Y., Buhler J., Buhler J. Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics. 2006;7:133. [PMC free article] [PubMed]
- Thorne J.L., Kishino H., Felsenstein J., Kishino H., Felsenstein J., Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 1991;33:114–124. [PubMed]
- Tramontano A., Leplae R., Morea V., Leplae R., Morea V., Morea V. Analysis and assessment of comparative modeling predictions in CASP4. Proteins. 2001;5:22–38. [PubMed]
- Tress M.L., Jones D., Valencia A., Jones D., Valencia A., Valencia A. Predicting reliable regions in protein alignments from sequence profiles. J. Mol. Biol. 2003;330:705–718. [PubMed]
- Waterman M.S. Sequence alignments in the neighborhood of the optimum with general application to dynamic programming. Proc. Natl. Acad. Sci. 1983;80:3123–3124. [PMC free article] [PubMed]
- Waterman M.S., Eggert M., Lander E., Eggert M., Lander E., Lander E. Parametric sequence comparisons. Proc. Natl. Acad. Sci. 1992;89:6090–6093. [PMC free article] [PubMed]
- Waterston R.H.K., Lindblad-Toh E., Birney J., Rogers J.F., Abril P., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Lindblad-Toh E., Birney J., Rogers J.F., Abril P., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Birney J., Rogers J.F., Abril P., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Rogers J.F., Abril P., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Abril P., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Agarwal R., Agarwala R., Ainscough M., Alexandersson P., An S.E., Agarwala R., Ainscough M., Alexandersson P., An S.E., Ainscough M., Alexandersson P., An S.E., Alexandersson P., An S.E., An S.E., et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
- Zuker M. Suboptimal sequence alignment in molecular biology. Alignment with error analysis. J. Mol. Biol. 1991;221:403–420. [PubMed]

**Cold Spring Harbor Laboratory Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (813K)

- Sigma: multiple alignment of weakly-conserved non-coding DNA sequence.[BMC Bioinformatics. 2006]
*Siddharthan R.**BMC Bioinformatics. 2006 Mar 16; 7:143. Epub 2006 Mar 16.* - Genomic multiple sequence alignments: refinement using a genetic algorithm.[BMC Bioinformatics. 2005]
*Wang C, Lefkowitz EJ.**BMC Bioinformatics. 2005 Aug 8; 6:200. Epub 2005 Aug 8.* - Multiple sequence alignment: in pursuit of homologous DNA positions.[Genome Res. 2007]
*Kumar S, Filipski A.**Genome Res. 2007 Feb; 17(2):127-35.* - Using multiple alignments to improve seeded local alignment algorithms.[Nucleic Acids Res. 2005]
*Flannick J, Batzoglou S.**Nucleic Acids Res. 2005; 33(14):4563-77. Epub 2005 Aug 12.* - Alignment methods: strategies, challenges, benchmarking, and comparative overview.[Methods Mol Biol. 2012]
*Löytynoja A.**Methods Mol Biol. 2012; 855:203-35.*

- Probabilistic approaches to alignment with tandem repeats[Algorithms for Molecular Biology : AMB. ]
*Nánási M, Vinař T, Brejová B.**Algorithms for Molecular Biology : AMB. 93* - Methods to Detect Selection on Noncoding DNA[Methods in molecular biology (Clifton, N.J....]
*Zhen Y, Andolfatto P.**Methods in molecular biology (Clifton, N.J.). 2012; 856141-159* - A Stochastic Evolutionary Model for Protein Structure Alignment and Phylogeny[Molecular Biology and Evolution. 2012]
*Challis CJ, Schmidler SC.**Molecular Biology and Evolution. 2012 Nov; 29(11)3575-3587* - Genome wide SNP discovery in flax through next generation sequencing of reduced representation libraries[BMC Genomics. ]
*Kumar S, You FM, Cloutier S.**BMC Genomics. 13684* - Slow DNA Loss in the Gigantic Genomes of Salamanders[Genome Biology and Evolution. 2012]
*Sun C, López Arriaza JR, Mueller RL.**Genome Biology and Evolution. 2012; 4(12)1340-1348*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree