Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS One. 2012; 7(4): e34572.
Published online Apr 20, 2012. doi:  10.1371/journal.pone.0034572
PMCID: PMC3335033

Accurate Reconstruction of Insertion-Deletion Histories by Statistical Phylogenetics

Art F. Y. Poon, Editor

Abstract

The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.

Introduction

The Multiple Sequence Alignment (MSA), indispensable to computational sequence analysis, represents a hypothetical claim about the homology beteen sequences. MSAs have many different uses, but the underlying hypothesis can often be classified as a claim either of structural homology (the 3D structures align in a particular way) or of evolutionary homology (the sequences are related by a particular history on a given phylogenetic tree). These types of hypothesis are similar, but with subtle (and important) distinctions: at the residue level, a claim of evolutionary homology (direct shared descent) is far stronger than a claim of structural homology (same approximate fold). Furthermore, both types of MSA–evolutionary and structural–typically only represent summaries of the respective homologies: some fine detail is often omitted. For example, an evolutionary MSA may–or may not–include the ancestral sequences at internal nodes of the underlying tree.

Structural and evolutionary MSAs are often conflated, but they have quite different applications. For example, a common use for a structural MSA is template-based structure prediction, where a query sequence is aligned to a target of known structure; the success of this prediction reflects the number of query-template residues correctly aligned [1]. By way of contrast, a common application for an evolutionary MSA is to identify regions or sites under selection, the success of which depends on accurate reconstruction of the evolutionary history [2], [3].

Evaluation of alignment methods is typically done with implicit regard for the structural interpretation. Many benchmarks have used metrics based on the Sum of Pairs Score (SPS) [4]. In the situation that a query-template pairwise alignment is randomly picked out of the MSA, the SPS effectively estimates the proportion of homologous residues that are correctly identified. Several alignment methods attempt to maximize the posterior expectation of SPS or similar metrics. This appears to improve accuracy, particularly when measured with reference to structural homology. However, it does not automatically confer evolutionary accuracy – a correct reconstruction of the evolutionary history of the sequences.

Several studies suggest that multiple alignment for evolutionary purposes is still a highly uncertain procedure [5], and that errors therein may significantly bias analyses of evolutionary effects [6][11]. A useful component of these studies is simulation of genetic sequence evolution [6], which appears to better indicate evolutionary accuracy than benchmarks derived from protein structure alignments. Simulations can be made quite realistic given the abundance of comparative sequence data [12].

The current state-of-the-art in phylogenetic alignment software is a choice between (on the one hand) programs that lack explicit models of the underlying evolutionary process, and so are not framed as statistical inference problems [6], and (on the other hand) Bayesian Markov chain Monte Carlo (MCMC) methods, which are statistically exact but prohibitively slow [13],[14].

A telling observation is that while substitution rate is routinely measured from MSAs and used as an indicator of natural selection, there is relatively little analogous use of indel rate. As we report here, it seems highly likely that even if indel rate is a useful evolutionary signal (which is eminently plausible), the present alignment methods distort measurements of this rate so far as to make it meaningless (Figure 1 and Figure 2).

Figure 1
ProtPal's estimates of insertion and deletion rates are the most accurate of any program tested, as measured by the RMSE of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e001.jpg values aggregated over all substitution/indel rate categories.
Figure 2
Rate estimation accuracy is highly dependent on the simulated indel rate.

In this paper, we frame phylogenetic sequence alignment as an approximate maximum likelihood (ML) inference. Our inference algorithm assumes that the tree is known, requiring a separate tree estimation protocol. While this is a strong assumption, it is in principle shared among all progressive aligners (e.g. PRANK [15], Muscle [16], ClustalW [17], MAFFT [18]). The alignment-marginalized likelihoods reported by our algorithm allow for statistical tests between alternative trees, and the functionality to estimate an initial alignment and guide tree from unaligned sequences exists elsewhere in the DART package. Our framing uses automata-theoretic methods from computational linguistics to unify several previously-disjoint areas of bioinformatics: Felsenstein's pruning algorithm for the phylogenetic likelihood function [19], progressive multiple sequence alignment [20], and alignment ensemble representation using partial order graphs [21]. Our algorithm may be viewed as a stochastic generalization of pruning to infinite state spaces: it retains the linear time and memory complexity of pruning (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e004.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e005.jpg sequences of length An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e006.jpg), while moderating the biasing effect of the MSA. The algorithmic details of our method are outlined briefly in the Methods, and in more complete, mathematically precise terms (with a tutorial introduction) in a separately submitted work.

Our software implementation of this algorithm is called ProtPal. We measured the accuracy of ProtPal relative to leading non-MCMC alignment/reconstruction protocols by simulating indels and substitutions on a known phylogeny, withholding the true history and attempting to reconstruct it from the sequences at the tips of the tree. The results show that all previous approaches to the reconstruction of ancestral sequences introduce significant biases, including systematic underestimation of insertions and overestimation of deletions. This contradicts previous claims that advances in the statistical foundations of alignment tools, supported by improvements in protein-structure benchmarks, necessarily improve the accuracy of evolutionary parameter estimates like the indel rate [6], [22], [23].

ProtPal introduces less bias than any other methods we tested, including PRANK, the state-of-the-art phylogenetic progressive aligner [6]. Both PRANK and ProtPal treat insertions and deletions as phylogenetic events (Figure 3). Based on our tests, ProtPal appears to be the best choice for small to moderately-sized analyses, such as a reconstruction of the history of proteins at the inter-species level in human evolutionary history. Using ProtPal to estimate indel rates for An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e007.jpg human protein-coding gene families, we find that per-gene indel rates are approximately gamma-distributed, with 95% of genes experiencing a mean rate of less than 0.1 indel events per synonymous substitution event. We find that lengths of inserted and deleted sequences are comparably distributed, having medians 5 and 7, respectively. The human lineage appears to have experienced unusually many insertions since the human-mouse split. By mapping genes to Gene Ontology (GO) terms, we find that the 200 fastest-indel genes are enriched for regulatory and metabolic functions. Possible applications and extensions of our algorithm include phylogenetic placement, homology detection, and reconstruction of structured RNA.

Figure 3
Gap attraction, the canceling of nearby complementary indels, can affect insertion and deletion rates in various ways depending on the phylogenetic relationship of the sequences involved.

Results

Computational reconstruction of simulated histories

We undertook to determine the ability of leading bioinformatics programs, including ProtPal, to characterize mutation event histories. We simulated indel histories on a tree, then attempted to reconstruct the MAP history, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e008.jpg, using only knowledge of the sequences An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e009.jpg and the phylogeny An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e010.jpg (but not the sequence alignment). The history An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e011.jpg is the aligned set of observed extant and predicted ancestral sequences, such that insertion, deletion, and substitution events can be pinpointed to specific tree branches (though not to specific time points on those branches).

We then characterized the reconstruction quality both directly, by comparison of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e012.jpg to the true An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e013.jpg, and indirectly, by using An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e014.jpg to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e015.jpg, the evolutionary parameters:

equation image
(1)

where the latter step assumes a flat prior, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e017.jpg We then compared the history-conditioned parameter estimate An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e018.jpg to the true An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e019.jpg.

This statistic is not without its problems. For one thing, we use an initial guess of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e020.jpg to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e021.jpg. Furthermore, for an unbiased estimate, we should sum over all histories, rather than conditioning on the MAP reconstructed history. This summing over histories would, however, require multiple expensive calculations of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e022.jpg, where conditioning on An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e023.jpg requires only one such calculation. Furthermore, parameter estimation conditioned on a MAP-reconstructed history is the de facto method employed by large-scale genomics studies focusing on indels [24][27].

Simulation model parameters

The model parameters are An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e024.jpg: the insertion and deletion rates (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e025.jpg), indel length distributions (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e026.jpg) and substitution rate matrix (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e027.jpg). Here we focus on the rates (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e028.jpg).

As described in Text S1, we generated data using an external simulation tool, indel-seq-gen, varying insertion (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e029.jpg), deletion (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e030.jpg) and substitution rates (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e031.jpg) over a range representative of per-gene rates in Amniota evolution (Figure 4). We varied indel rates (with An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e032.jpg) between 0.005 and 0.08 expected indels per unit time, estimating that this range accounts for 95% of human gene families. We left the substitution model An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e033.jpg and indel length distributions An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e034.jpg fixed, employing indel-seq-gen's empirically-estimated values.

Figure 4
Insertion and deletion rates in Amniota show similar distributions, with 95% of genes having rates less than approximately 0.1 indels per synonymous substitution.

We performed simulations on mammalian, amniote and fruitfly phylogenies, using the taxa in those clades for which genomic sequence is actually available. We found generally consistent results, with common trends that were most pronounced on the largest of the three trees that we used (the twelve sequenced Drosophila species [28]). In discussing the trends, we will refer specifically to the results on this largest of the trees.

Indel rate estimates

Overall most accurate

We first set out to determine which program, when used to analyze a set of unaligned sequences, returns the indel rate estimate closest to the true rate.

We report the ratio of inferred rate to true rate for insertions An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e035.jpg and deletions An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e036.jpg in Figure 1, with each An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e037.jpg defined as An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e038.jpg in Equation 1. No parameter estimate derived from a computationally reconstructed history approaches the level of accuracy achieved using the true history (labeled “True simulated history” in Figure 1).

The results do not always concord with previous benchmarks that have measured accuracy using 3D structural alignments: for example, the FSA program, one of the most accurate aligners on structural benchmarks [23], performs poorly here. This discordance may be due to the fundamental differences between evolutionary and structural homology, and the metrics used to assess each. For instance, consider a region with many nearby and overlapping insertions and deletions. The spatial and temporal location of these insertion and deletion events (in particular, the pinpointing of events to branches on the tree) defines what the “perfect” evolutionary reconstruction is. In contrast, even given perfect knowledge of the insertion/deletion history, a “perfect” structural alignment depends only on the proteins at the tips of the tree, and this alignment could differ from the true evolutionary reconstruction.

Fundamentally, the difference between FSA and ProtPal is the underlying metric that is being optimized by each program: FSA attempts to maximize a metric (AMA = Alignment Metric Accuracy) which is essentially “structural” (in the sense that it predicts how many residues would be correctly aligned in a pairwise alignment of two leaf-node sequences, as might be used in structure prediction by target-template alignment), while ProtPal attempts to maximize a “phylogenetic” metric (the probability of a given evolutionary history). The metric we have used in our benchmark (which counts correct reconstruction of the number of indel events on branches of the tree) is also “phylogenetic”. When ranking the programs using the AMA metric, FSA perfoms well, with accuracy exceeding that of ProtPal in the highest indel rate category (Text S1). This suggests that the differences between our evolutionary benchmark and previous benchmarks are not due to the data, but rather the types of metrics that are used to measure alignment accuracy; similarly, the differences between the leading programs are primarily due to what types of benchmark they are explicitly trying to perform well at.

All programs other than ProtPal display insertion- versus-deletion biases that are, to a varying degree, asymmetric. Typically, the asymmetry is that insertions are underrepresented and deletions overrepresented. ProtPal's bias, which is generally less than the other programs, is also the most symmetric: reconstructed insertions and deletions are roughly equally reliable, with both slightly underestimated. Over the distribution of human gene rates used by this benchmark, our phylogenetic likelihood approach, ProtPal, provides the most accurate reconstructions of both insertion and deletion counts. PRANK, which also uses a tree (but no likelihood), avoids insertion-deletion biases to a certain extent, although insertion rates are slightly underestimated relative to deletion rates. Since ProtPal's MAP history estimation appears similar to the optimization algorithm of PRANK, we suspect that ProtPal's marginally better performance is due primarily to its main difference in implementation: ProtPal tracks an ensemble of possible reconstructions during progressive tree traversal (Section), whereas PRANK uses a single “current best guess.”

Effect of indel rate variation

To investigate the effect of indel rate variation on estimation accuracy, we separate each program's error distributions by indel rate (Figure 2). We find that all programs' accuracy is strongly affected by the indel rate used in simulation. As the true indel rate increases, most programs' estimates drift towards An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e039.jpg. This is consistent with the so-called “gap attraction” effect, where indels that are nearby in sequence can be misinterpreted as substitution events [29]. Depending on the phylogenetic orientation of the events, estimated rates can be elevated or lowered, with different biases for insertion and deletion rates (Figure 3).

Gap attraction and other biases operate simultaneously, and are sometimes opposed. MUSCLE over-estimates the deletion rate under most conditions, but (consistent with a trend where programs have lower An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e040.jpg at higher indel rates) gets the deletion rate roughly correct in the highest-indel-rate category of our benchmark. However, the alignments produced by MUSCLE at high indel rates are no more “accurate” by pairwise metrics (Text S1). We conjecture that multiple, contradictory types of gap attraction are at work, e.g. Figures 3B and 3C.

After ProtPal, the two most accurate reconstruction methods are PRANK and ProbCons (the latter combined with a parsimonious indel reconstruction). ProbCons produces more reliable insertion estimates than PRANK in a broad range of benchmark categories, is tied with PRANK for deletion estimates, and appears robust to indel rate variation. PRANK performs slightly better than ProbCons in the slowest indel rate category we considered. ProtPal produces the most reliable estimates overall, outperforming ProbCons in all but the fastest indel rate category, and PRANK in all but the slowest.

Sensitivity to substitution rate

As compared to variation of simulated indel rate, variation of simulated substitution rate appears to have little effect on the accuracy of indel reconstruction (Text S1). One notable exception is FSA, which appears to be affected by the substitution rate more than the other programs. For example, when the simulated indel and substitution rates are both low, FSA is comparable to the most accurate of the other programs (ProtPal); but when the substitution rate is increased, FSA's error is greater than the least accurate program (CLUSTALW). Errors in estimating the substitution rate are comparable among the programs tested, and are similarly correlated with the simulation indel rate (Text S1).

Reconstructed indel histories of human genes

We present here a comprehensive set of reconstructions accounting for the evolutionary history of individual codons in human genes. We used genes in the Orthologous and Paralogous Transcripts in Clades (OPTIC) database's Amniota set, comprised of the 5 mammals H. sapiens, M. musculus, C. familiaris, M. domestica, O. anatinus and G. gallus as an outgroup [30]. Considering only those families with one unique ortholog per species (approximately 7,500 families), we combined tree branch statistics across genes, using the species tree in Text S1. Our reconstructions are available at http://biowiki.org/oscar/optic_reconstruction.tar, and we provide here various graphical summaries of Amniota evolutionary history. Several negative results stand in contrast to earlier-reported trends.

Indel rates

Insertion and deletion rates are approximately gamma-distributed (Figure 4). Roughly 95% of genes have indel rates An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e041.jpg indels per synonymous substitution.

Phylogenetic origins

In our simulations, ProtPal pinpoints residues' “branch of origin” more reliably than other tools, with a 93% accuracy rate (Text S1). Many codons appeared to have been inserted following the human-mouse split (Text S1)

Branch-specific indel rates

Using our reconstructions to estimate the rates of indel mutations along specific tree branches, we find evidence of an elevated insertion rate in the human (black) branch, as well as on the the Amniota - Australophenids (pink) branch (Text S1).

Amino acid distributions

Distributions over amino acids differ significantly between inserted, deleted and non-indel sequences (Text S1). In general, small residues are over-represented in insertions, in agreement with previous studies [31].

Indel lengths

We find, contrary to a previous study in Nematode [32], that length distributions in the Amniotes are nearly identical between insertions and deletions (Text S1). The previously-reported result may be attributable to the deletion-biased nature of the methods used, particularly CLUSTALW and MUSCLE [32].

Indel position

The position of indels within genes is highly biased towards the ends of genes, presumably in large part reflecting annotation error (Text S1). The bias is strongest for deletions at the N-terminus of the gene, but both insertions and deletions are enriched in both C- and N- termini.

Evolutionary context of indel SNPs

We find no general correlation between the indel rate for a gene and the number of indel polymorphisms recorded for that gene in dbSNP [33] (Text S1).

Gene ontology indel rates

No Gene Ontology (GO) categories stand out as having significantly lowered or heightened indel rates in any of the three ontologies, contrasting with the reported results of a 2007 study using a smaller number of genes [31]. An enrichment analysis conducted with GOstat [34] showed that the 200 fastest evolving genes in our data are significantly enriched for regulatory and metabolic functions.

Discussion

We developed and analyzed a simulation benchmark that compares programs based on their reconstructions of evolutionary history, using instantaneous mutation rates representative of Amniote evolution. We tested several different tree topologies; results were similar on all trees, but most pronounced on the tree with the longest branch lengths. We find that most programs distort indel rate measurements, despite claims to the contrary. Moreover, the systematic bias varies significantly when the rates of substitutions and indels are varied within a biologically reasonable range. Many of the programs we rated have been ranked in the past, but using benchmarks that use protein structural alignments as a gold standard, rather than evolutionary simulations. Furthermore, these previous benchmarks have not directly assessed the reconstruction of evolutionary history (or summary statistics such as the indel rate), but have used other alignment accuracy metrics such as the Sum of Pairs Score. Alignment programs that perform weakly on our benchmark have apparently performed well on these previous benchmarks. We hypothesize that these benchmarks, compared to ours, are less directly predictive of a program's accuracy at historical reconstruction, although they may better reflect the program's suitability to assist in tasks relating more closely to folded structure, like prediction of a protein's 3D structure from a homologous template.

We have introduced a new notation that describes a general, hidden Markov model-structured likelihood function for indel histories on a tree, as well as the structure of the corresponding inference algorithm. We have implemented the new method in a freely-available program, ProtPal, that allows, for the first time, phylogenetic reconstruction with accuracy over a broad range of indel rates. ProtPal is written in C++ as a part of the DART package: www.biowiki.org/ProtPal. The evolutionary reconstructions ProtPal produces are, according to our simulated tests, the most accurate of any available tool, for a range of parameters typical of human genes.

We applied ProtPal to the reconstruction of human gene indel history, using families of human gene orthologs from the OPTIC database. We find some patterns that agree with previous studies, such as the non-uniform distributions over amino acids seen in [31]. Other results stand in contrast - a previous study found significantly different length distributions for insertions and deletions [32], whereas in our data they appear very similar. Another prediction of our reconstruction is an elevated rate of insertions on the human branch since the human-mouse split. This contrasts with a previous analysis [35], though the data therein was whole genomes, rather than individual protein-coding genes. In contrast to [31], we find no obvious predictive power of the Gene Ontology (GO) for indel rates; that is, the indel rate does not appear strongly correlated with the presence or absence of any particular GO term-gene association. However, enrichment analysis for GO terms using GOstat [34] showed that the 200 fastest-evolving genes are significantly enriched for regulatory and metabolic function. This apparent discrepancy might be explained by a group of regulatory and metabolic genes which have very high indel rates, but whose small number prevent them from skewing the average within their GO categories.

Many applications which use a fixed-alignment phylogenetic likelihood could potentially benefit from ProtPal's reconstruction profiles. For example, phylogenetic placement algorithms estimate taxonomic distributions by evaluating the relative likelihoods of placing sequence reads on tree branches [36]. By using sequence profiles exported from ProtPal, these reads could be placed with greater attention to indels and a more realistic accounting for alignment uncertainty. Homology detection could be done in a similar way, thereby making use of the phylogenetic relationship of the sequences within the reference family. It has been observed that the detection of positive selection is highly sensitive to the alignment used [7]. ProtPal could be modified to detect selection using entire profiles rather than single alignments, potentially eliminating the bias brought on by an inaccurate alignment.

In summary, multiple alignments are frequently constructed for use in downstream evolutionary analyses. However, except for our method and slow-performing MCMC methods, there are no software tools for reconstructing molecular evolutionary history that explicitly maximize a phylogenetic likelihood for indels. Our results strongly indicate that algorithms such as ProtPal (which use such a phylogenetic model) produce significantly more reliable estimates of evolutionary parameters, which we believe to be highly indicative of evolutionary accuracy. These results falsify previous assertions that existing, non-phylogenetic tools are well-suited to this purpose. Furthermore, we have demonstrated that it is possible to achieve such accuracy without sacrificing asymptotic guarantees on time/memory complexity, or resorting to expensive MCMC methods. ProtPal can reconstruct phylogenetic histories of entire databases on commodity hardware, enabling the large-scale study of evolutionary history in a consistent phylogenetic framework.

Methods

The details concerning generation and analysis of simulated data are contained in Text S1. A mathematically complete description of the alignment algorithm has been submitted as a separate work, and an early version has been made available online here: http://arxiv.org/abs/1103.4347.

Felsenstein's algorithm for indel models

Our algorithm may be viewed as a generalization of Felsenstein's pruning recursion [19], a widely-used algorithm in bioinformatics and molecular evolution. A few common applications of this algorithm include estimation of substitution rates [37]; reconstruction of phylogenetic trees [38]; identification of conserved (slow-evolving) or recently-adapted (fast-evolving) elements in proteins and DNA [39]; detection of different substitution matrix “signatures” (e.g. purifying vs diversifying selection at synonymous codon positions [40], hydrophobic vs hydrophilic amino acid signatures [41], CpG methylation in genomes [42], or basepair covariation in RNA structures [43]); annotation of structures in genomes [44], [45]; and placement of metagenomic reads on phylogenetic trees [36].

Felsenstein's algorithm computes An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e042.jpg for a substitution model by tabulating intermediate probability functions of the form An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e043.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e044.jpg represents the individual residue state of ancestral node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e045.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e046.jpg represents all the sequence data that is causally descended from node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e047.jpg in the tree (i.e. the observed residues at the set of leaf nodes whose most recent common ancestor is node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e048.jpg).

The pruning recursion visits all nodes in postorder. Each An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e049.jpg function is computed in terms of the functions An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e050.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e051.jpg of its immediate left and right children (assuming a binary tree):

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e053.jpg is the probability that node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e054.jpg has state An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e055.jpg, given that its parent node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e056.jpg has state An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e057.jpg; and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e058.jpg is a Kronecker delta function terminating the recursion at the leaf nodes of the tree. These An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e059.jpg functions are often referred to as “messages” in the machine-learning literature [46].

Our new algorithm is algebraically equivalent to Felsenstein's algorithm, if the concept of a “substitution matrix” over a particular alphabet is extended to the countably-infinite set of all sequences over that alphabet. Our chosen class of “infinite substitution matrix” is one that has a finite representation: namely, the finite-state transducer, a probabilistic automaton that transforms an input sequence to an output sequence, and a familiar tool of statistical linguistics [47].

By generalizing the idea of matrix multiplication (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e060.jpg) to two transducers (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e061.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e062.jpg), and introducing a notation for feeding the same input sequence to two transducers in parallel (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e063.jpg), we are able to write Felsenstein's algorithm in a new form (see Section):

equation image

where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e065.jpg is the transducer equivalent of the Kronecker delta An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e066.jpg. The function An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e067.jpg is now encapsulated by a transducer “profile” of node An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e068.jpg.

This representation has complexity An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e069.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e070.jpg sequences of length An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e071.jpg, which we reduce to An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e072.jpg by stochastic approximation of the An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e073.jpg. This approximation relies on the alignment envelope [48], a data structure introduced by prior work on efficient alignment methods. The alignment envelope is a subset of all the possible histories in which most of the probability mass is concentrated. A related data structure is the partial order graph [21]. Both these data structures can be viewed as ensembles of possible histories, in contrast to a single “best-guess” reconstruction of the history. Figure 5 sampledGraph shows a state graph, with paths through it corresponding to histories relating the two sequences GL and GIV. The paths highlighted in blue form a partial order graph, corresponding to a subset of these histories generated by a stochastic traceback. At each progressive traversal step, we sample a high-probability subset of alignments of two sibling profiles in order to maintain a bound on the state space size. Note that if we sample only the most likely path at every internal node, we essentially recover the progressive algorithm of PRANK, and if we sample and store all solutions, we recover the machine An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e074.jpg with state space of size An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e075.jpg.

Figure 5
Each path through this state graph represents a possible evolutionary history relating sequences GL and GIV.

Transducer definitions and lemmas

The definitions and lemmas are presented in a condensed form here, and expanded upon in [49].

A transducer is a tuple An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e076.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e077.jpg is an input alphabet, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e078.jpg is an output alphabet, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e079.jpg is a set of states, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e080.jpg is the start state, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e081.jpg is the end state, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e082.jpg is the transition relation, and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e083.jpg is the transition weight function.

Suppose that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e084.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e085.jpg are transducers.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e086.jpg be the product of all transition weights along a state path An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e087.jpg and let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e088.jpg be the sum of such weights for all paths whose input labels, concatenated, yield the string An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e089.jpg and whose output labels yield An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e090.jpg.

Equivalence: If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e091.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e092.jpg have the same input and output alphabets (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e093.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e094.jpg) and the same sequence weights An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e095.jpg, then we say the transducers are equivalent, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e096.jpg. Less formally, we will write An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e097.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e098.jpg.

Moore transducers: The Moore normal form for transducers, named for Moore machines [50], associates input/output with three distinct types of state: Match, Insert and Delete. Paths through Moore transducers can be associated with (gapped) pairwise alignments of input and output sequences. For any transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e099.jpg, there exists an equivalent Moore-normal form transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e100.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e101.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e102.jpg.

Composition: If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e103.jpg's output alphabet is the same as An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e104.jpg's input alphabet (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e105.jpg), there exists a transducer, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e106.jpg, that unifies the output of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e107.jpg with the input of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e108.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e109.jpg:

equation image
(2)

If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e111.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e112.jpg are in Moore form, then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e113.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e114.jpg.

Intersection: If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e115.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e116.jpg have the same input alphabets (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e117.jpg), there exists a transducer, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e118.jpg, that unifies the input of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e119.jpg with the input of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e120.jpg. The output alphabet is An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e121.jpg, i.e. a An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e122.jpg-output symbol (or a gap) aligned with a An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e123.jpg-output symbol (or a gap).

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e124.jpg denote the set of all gapped pairwise alignments of sequences An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e125.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e126.jpg. Transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e127.jpg has the property that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e128.jpg:

equation image
(3)

If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e130.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e131.jpg are in Moore form, then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e132.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e133.jpg. Paths through An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e134.jpg are associated with three-way alignments of the input sequence to the two output sequences.

Identity: Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e135.jpg be a transducer that copies input to output unmodified, so An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e136.jpg.

Exact match: For any sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e137.jpg, there exists a Moore-form transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e138.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e139.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e140.jpg, that rejects all input except An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e141.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e142.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e143.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e144.jpg if An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e145.jpg. Note that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e146.jpg outputs nothing (the empty string).

Chapman-Kolmogorov transducers: A transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e147.jpg is probabilistic if An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e148.jpg represents a probability An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e149.jpg: that is, for any given input string, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e150.jpg, it defines a probability measure on output strings, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e151.jpg.

Suppose An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e152.jpg is a function returning a probabilistic transducer of the form An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e153.jpg, i.e. a transducer whose transition weight An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e154.jpg depends on an additional time parameter, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e155.jpg, and which satisfies the transducer equivalence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e156.jpg.

Then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e157.jpg gives the finite-time transition probabilities of a homogeneous continuous-time Markov process on the strings An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e158.jpg, as the above transducer equivalence is a form of the Chapman-Kolmogorov equation.

If the state space of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e159.jpg is finite, then this equation describes a renormalization of the composed state space An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e160.jpg back down to the original state space An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e161.jpg. So far, only one nontrivial time-dependent transducer is known that solves this equation exactly using a finite number of states: the TKF91 model [51].

The phylogenetic likelihood

We rewrite the evidence, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e162.jpg for sequences An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e163.jpg, tree An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e164.jpg, and parameters An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e165.jpg, in the form An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e166.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e167.jpg denotes the set of sequences observed at leaf nodes, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e168.jpg denotes the stochastic evolutionary processes occuring on the branches, and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e169.jpg denotes the probabilistic model for the sequence at the root node of the tree.

The root and branch transducers An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e170.jpg represent an alternative view of the tree and parameters An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e171.jpg. The root transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e172.jpg outputs from the equilibrium or other initial distribution of the process. If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e173.jpg is a parent-child pair, then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e174.jpg is a time-dependent transducer parameterized by the branch length. In practise, the branch transducers need not satisfy the Chapman-Kolmogorov equation for the following constructs to be of use; for example, the An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e175.jpg might be approximations to true Chapman-Kolmogorov transducers [52].

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e176.jpg be a transducer outputting sequences sampled from the prior at the phylogenetic root.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e177.jpg be a tree node. If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e178.jpg is a leaf, define An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e179.jpg. Otherwise, let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e180.jpg denote the left and right child nodes, and define An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e181.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e182.jpg is a transducer modeling the evolution on the branch leading to An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e183.jpg.

Diagramatically we can write An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e184.jpg as (.. (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e185.jpg. (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e186.jpg.)) (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e187.jpg. (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e188.jpg.)))

The phylogenetic likelihood is then fully described by An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e189.jpg.

Like An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e190.jpg, transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e191.jpg models a probability distribution over output sequences, but accepts only the empty string as an input sequence. This empty input sequence is just a technical formality (transducers must have inputs); if we ignore it, we can think of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e192.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e193.jpg as hidden Markov models (HMMs), rather than transducers. An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e194.jpg is an HMM that generates a single sequence, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e195.jpg a multi-sequence HMM that generates the whole set of leaf sequences.

Inference with HMMs often uses a dynamic programming matrix (e.g. the Forward matrix) to track the ways that a given evidential sequence can be produced by a given grammar.

For our purposes it is useful to introduce the evidence in a different way, by transforming the model to incorporate the evidence directly. We augment the state space so that the model is no longer capable of generating any sequences except the observed An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e196.jpg, by composing An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e197.jpg's forked outputs with exact-match transducers that will only accept the observed sequences at the leaves of the tree. This yields a model, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e198.jpg, whose state space is of size An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e199.jpg and, in fact, is directly analogous to the Forward matrix.

If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e200.jpg is a leaf node, then let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e201.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e202.jpg is the sequence at An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e203.jpg. Otherwise, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e204.jpg.

Diagramatically we can write An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e205.jpg as (.. (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e206.jpg..An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e207.jpg.) (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e208.jpg. .An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e209.jpg.))

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e210.jpg. The evidence is An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e211.jpg.

The net output of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e212.jpg is always the empty string. The sequences An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e213.jpg are recognized as inputs by the An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e214.jpg transducers at the tips of the tree, but are not passed on as outputs themselves.

Likewise, the input of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e215.jpg is the empty string, because An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e216.jpg accepts only the empty string on its input.

We can think of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e217.jpg as a Markov model, rather than an HMM. It has no input or output; rather, the sequences are encoded into its structure.

Transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e218.jpg has An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e219.jpg states, which is impractically many, so ProtPal uses a progressive hierarchy An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e220.jpg of approximations to the corresponding An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e221.jpg, with state spaces that are bounded in size.

If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e222.jpg is a leaf node, let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e223.jpg. Otherwise, let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e224.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e225.jpg is a subset defined by sampling complete paths through the Markov model An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e226.jpg and adding the An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e227.jpg-states used by those paths to An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e228.jpg, until the pre-specified bound on An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e229.jpg is reached. Then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e230.jpg.

The likelihood of a given history may be calculated by summing over paths through An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e231.jpg consistent with that history. In the simplest cases (e.g. minimal Moore-form branch transducers), each indel history corresponds to exactly one path, so the MAP indel history corresponds to the maximum-weight state path through An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e232.jpg.

Alignment envelopes

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e233.jpg be defined such that it has only one nonzero-weighted path

equation image

so a An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e235.jpg-state is either the start state (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e236.jpg), the end state (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e237.jpg), a wait state (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e238.jpg) or a match state (An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e239.jpg). All these states have the form An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e240.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e241.jpg represents the number of symbols of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e242.jpg that have to be read in order to reach that state, i.e. a “co-ordinate” into An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e243.jpg. All An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e244.jpg-states are labeled with such co-ordinates, as are the states of any transducer that is a composition involving An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e245.jpg, such as An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e246.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e247.jpg.

For example, in a simple case involving a root node (1) with two children (2,3) whose sequences are constrained to be An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e248.jpg, the evidence transducer is An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e249.jpg = (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e250.jpg. (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e251.jpg..An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e252.jpg.) (.An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e253.jpg..An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e254.jpg.))

All states of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e255.jpg have the form An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e256.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e257.jpg, so An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e258.jpg and similarly for An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e259.jpg. Thus, each state in An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e260.jpg is associated with a co-ordinate pair An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e261.jpg into An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e262.jpg, as well as a state-type pair An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e263.jpg.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e264.jpg be a node in the tree, let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e265.jpg be the set of indices of leaf nodes descended from An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e266.jpg, and let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e267.jpg be the phylogenetic transducer for the subtree rooted at An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e268.jpg, defined in Section. Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e269.jpg be the state space of An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e270.jpg.

If An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e271.jpg is a leaf node descended from An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e272.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e273.jpg includes, as a component, the transducer An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e274.jpg. Any An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e275.jpg-state, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e276.jpg, is a tuple, one element of which is a An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e277.jpg-state, An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e278.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e279.jpg is a co-ordinate (into sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e280.jpg) and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e281.jpg is a state-type. Define An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e282.jpg to be the co-ordinate and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e283.jpg to be the corresponding state-type.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e284.jpg be the function returning the set of absorbing leaf indices for a state, such that the existence of a finite-weight transition An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e285.jpg implies that An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e286.jpg for all An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e287.jpg.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e288.jpg be two sibling nodes. The alignment envelope is the set of sibling state-pairs from An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e289.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e290.jpg that can be aligned. The function An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e291.jpg indicates membership of the envelope. For example, this basic envelope allows only sibling co-ordinates separated by a distance An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e292.jpg or less

equation image
(4)

An alignment envelope can be based on a guide alignment. For leaf nodes An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e294.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e295.jpg, let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e296.jpg be the number of residues of sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e297.jpg in the section of the guide alignment from the first column, up to and including the column containing residue An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e298.jpg of sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e299.jpg.

This envelope excludes a pair of sibling states if they include a homology between residues which is more than An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e300.jpg from the homology of those characters contained in the guide alignment:

equation image
(5)

Let An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e302.jpg be the number of match columns (those columns of the guide alignment in which both An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e303.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e304.jpg have a non-gap character) between the column containing residue An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e305.jpg of sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e306.jpg and the column containing residue An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e307.jpg of sequence An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e308.jpg. This envelope excludes a pair of sibling states if they include a homology between residues which is more than An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e309.jpg matches from the homology of those characters contained in the guide alignment:

equation image

OPTIC data analysis

Data

Amniote gene families were downloaded from http://genserv.anat.ox.ac.uk/downloads/clades/. We restricted our analysis to the An external file that holds a picture, illustration, etc.
Object name is pone.0034572.e311.jpg7,500 families having simple 1[ratio]1 orthologies. The same species tree topology (downloaded from http://genserv.anat.ox.ac.uk/clades/amniota/displayPhylogeny was used for all reconstructions, though branch lengths were estimated separately for each family as part of OPTIC. When computing branch-specific indel rates, the branch lengths of the species tree were used.

Reconstruction and rate estimation

Gene families were aligned and reconstructed using ProtPal with a 3-rate class Markov chain over amino acids, insertion and deletion rates set to 0.01, and 250 traceback samples. Averaged and per-branch indel rates were computed with ProtPal using the -pi and -pb options. The indel rates were then normalized by the synonymous substitution rate for each corresponding nucleotide alignment (taken directly from OPTIC), computed with PAML [53]. Residues' origins were determined by finding the tree node closest to the root containing a non-gap reconstructed character.

External data

Genes were mapped to Gene Ontology terms via the mapping downloaded from http://www.ebi.ac.uk/GOA/human_release.html during 10/2010. Indel SNPs per gene were taken from a table downloaded from Supplemental Table 5 of [54].

Supporting Information

Text S1

Contains techinical details concerning generation of simulation data, analysis of OPTIC data, as well as figures pertaining to both simulated and OPTIC data.

(PDF)

Acknowledgments

The authors thank Erick Matsen and Lars Barquist for insightful comments on the manuscript.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: Authors OW and IHH were supported by National Institutes of Health/National Human Genome Research Institute grant R01-GM076705. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Qu X, Swanson R, Day R, Tsai J. A guide to template based structure prediction. Curr Protein Pept Sci. 2009;10:270–85. [PubMed]
2. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biology. 2004;5 [PMC free article] [PubMed]
3. Pollard KS, Salama SR, Lambert N, Lambot M, Coppens S, et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature. 2006;443:167–172. [PubMed]
4. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research. 1999;27:2682–2690. [PMC free article] [PubMed]
5. Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008;319:473–6. [PubMed]
6. Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–1635. [PubMed]
7. Markova-Raina P, Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Research. 2011;21:863–874. [PMC free article] [PubMed]
8. Nelesen S, Liu K, Zhao D, Linder CR, Warnow T. The effect of the guide tree on multiple sequence alignments and subsequent phylogenetic analyses. Pacific Symposium on Biocomputing. 2008;2008:25–36. [PubMed]
9. Liu K, Nelesen S, Raghavan S, Linder CR, Warnow T. Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE/ACM Trans Comput Biol Bioinform. 2009;6:7–21. [PubMed]
10. project consortium E. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Research. 2007;17:760–774. [PMC free article] [PubMed]
11. Bradley RK, Uzilov AV, Skinner ME, Bendana YR, Barquist L, et al. Evolutionary modeling and prediction of non-coding RNAs in Drosophila. PLoS ONE. 2009;4:e6478. [PMC free article] [PubMed]
12. Strope C, Abel K, Scott S, Moriyama E. Biological sequence simulation for testing complex evolutionary hypotheses: indel-seq-gen version 2.0. Mol Biol Evol. 2009;26:2581–93. [PMC free article] [PubMed]
13. Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics. 2001;17:803–820. [PubMed]
14. Suchard MA, Redelings BD. BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–2048. [PubMed]
15. Löytynoja A, Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. Proceedings of the National Academy of Sciences of the USA. 2005;102:10557–62. [PMC free article] [PubMed]
16. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. [PMC free article] [PubMed]
17. Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. [PubMed]
18. Katoh K, Kuma K, Toh H, Miyata T. Mafft version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research. 2005;33:511–518. [PMC free article] [PubMed]
19. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. [PubMed]
20. Higgins DG, Bleasby AJ, Fuchs R. CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences. 1992;8:189–191. [PubMed]
21. Lee C, Grasso C, Sharlow M. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002;18:452–464. [PubMed]
22. Cartwright RA. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics. 2005;21(Suppl 3):iii31–8. [PubMed]
23. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, et al. Fast statistical alignment. PLoS Computational Biology. 2009;5:e1000392. [PMC free article] [PubMed]
24. Kamneva O, Liberles A, Ward N. Genome-wide inuence of indel substitutions on evolution of bacteria of the pvc superphylum, revealed using a novel computational method. Genome Biology and Evolution. 2010;2:870–886. [PMC free article] [PubMed]
25. Zhang Z, Huang J, Wang Z, WAng L, Peiji G. Impact of indels on the anking regions in structural domains. Molecular Biology and Evolution. 2011;28:291–301. [PubMed]
26. Zhu L, Wang Q, Tang P, Araki H, Tian D. Genomewide association between insertions/deletions and the nucleotide diversity in bacteria. Molecular Biology and Evolution. 2009;26:2353–2361. [PubMed]
27. Gomez-Valero L, Latorre A, Gil R, Gadau J, Feldhaar H, et al. Patterns and rates of nucleotide substitution, insertion and deletion in the endosymbiont of ants blochmannia oridanus. Molecular Ecology. 2008;17:4382–4392. [PubMed]
28. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203–218. [PubMed]
29. Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23:289–296. [PubMed]
30. Heger A, Ponting C. OPTIC: orthologous and paralogous transcripts in clades. NAR. 2008;36:267–270. [PMC free article] [PubMed]
31. de la Chaux N, Messeer P, Arndt P. DNA indels in coding regions reveal selective contraints on protein evolution in the human lineage. BMC Evolutionary Biology. 2007;7 [PMC free article] [PubMed]
32. Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, et al. Systematic analysis of insertions and deletions specific to nematode proteins and their proposed functional and evolutionary relevance. BMC Evol Biol. 2009;9 [PMC free article] [PubMed]
33. Saccone S, Quan J, Mehta G, Bolze R, Thomas P, et al. New tools and methods for direct programmatic access to the dbSNP relational database. Nucleic Acids Res 2011 [PMC free article] [PubMed]
34. Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004;20:1464–1465. [PubMed]
35. Initial sequencing and comparative analysis of the mouse genome. Nature 2002
36. Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11:538. [PMC free article] [PubMed]
37. Yang Z. Estimating the pattern of nucleotide substitution. Journal of Molecular Evolution. 1994;39:105–111. [PubMed]
38. Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution. 1996;43:304–311. [PubMed]
39. Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology. 2004;11:413–428. [PubMed]
40. Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:432–449. [PMC free article] [PubMed]
41. Thorne JL, Goldman N, Jones DT. Combining protein evolution and secondary structure. Molecular Biology and Evolution. 1996;13:666–673. [PubMed]
42. Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Molecular Biology and Evolution. 2004;21:468–488. [PubMed]
43. Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999;15:446–454. [PubMed]
44. Siepel A, Haussler D. Bourne P, Gusfield D, editors. Computational identification of evolutionarily conserved exons. 2004. pp. 177–186. Proceedings of the eighth annual international conference on research in computational molecular biology, San Diego, March 27–31 2004. ACM.
45. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, et al. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Computational Biology. 2006;2:e33. [PMC free article] [PubMed]
46. Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory. 1998;47:498–519.
47. Mohri M, Pereira F, Riley M. Weighted finite-state transducers in speech recognition. Computer Speech and Language. 2002;16:69–88.
48. Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, et al. Genomewide nucleotide-level mammalian ancestor reconstruction. Genome Research. 2008;18:1829–1843. [PMC free article] [PubMed]
49. Westesson O, Lunter G, Paten B, Holmes I. An alignment-free generalization to indels of Felsenstein's phylogenetic pruning algorithm. arXiv 2011
50. Moore EF. Gedanken-experiments on Sequential Machines. Princeton, N.J.: Princeton University Press, volume 34 of Annals of Mathematical Studies, chapter 5; 1956. pp. 129–153.
51. Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. Journal of Molecular Evolution. 1991;33:114–124. [PubMed]
52. Miklós I, Lunter G, Holmes I. A long indel model for evolutionary sequence alignment. Molecular Biology and Evolution. 2004;21:529–540. [PubMed]
53. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution. 2007;24:1586–1591. [PubMed]
54. Mills R, Luttig C, Larkins C, Beauchamp A, Tsui C, et al. An initial map of insertion and deletion (indel) variation in the human genome. Genome Research. 2006;16 [PMC free article] [PubMed]
55. Sinha S, Siggia E. Sequence turnover and tandem repeats in cisregulatory modules in drosophila. MBE. 2005;22 [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...