• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Dec 9, 2003; 100(25): 14960–14965.
Published online Dec 1, 2003. doi:  10.1073/pnas.2036252100
PMCID: PMC299860
Evolution

Recursions for statistical multiple alignment

Abstract

Algorithms are presented that allow the calculation of the probability of a set of sequences related by a binary tree that have evolved according to the Thorne–Kishino–Felsenstein model for a fixed set of parameters. The algorithms are based on a Markov chain generating sequences and their alignment at nodes in a tree. Depending on whether the complete realization of this Markov chain is decomposed into the first transition and the rest of the realization or the last transition and the first part of the realization, two kinds of recursions are obtained that are computationally similar but probabilistically different. The running time of the algorithms is equation M1, where Li is the length of the ith observed sequences and d is the number of sequences. An alternative recursion is also formulated that uses only a Markov chain involving the inner nodes of a tree.

Keywords: backward recursion, emission probability, forward recursion, hidden Markov chain, states

Proteins and DNA sequences evolve predominantly by substitutions, insertions, and deletions of single letters or strings of these elements, where a letter is either a nucleotide or an amino acid. During the last two decades, the analysis of the substitution process has improved considerably and has been based increasingly on stochastic models. The process of insertions and deletions has not received the same attention and is presently being analyzed by optimization techniques, for instance maximizing a similarity score as first used by Needleman and Wunsch (1).

Thorne, Kishino, and Felsenstein (2) proposed a well defined time-reversible Markov model for insertions and deletions [denoted more briefly as the Thorne, Kishiro, and Felsenstein (TKF) model] that allowed a proper statistical analysis for two sequences. Such an analysis can be used to provide maximum likelihood sequence alignments for pairs of sequences or to estimate the evolutionary distance between two sequences. Recently, an algorithm was presented by Steel and Hein (3) that allows statistical alignment of sequences related by a star-shaped tree, a tree with one inner node. Hein (4) formulated an algorithm that calculates the probability of observing a set of sequences related by a given tree in time O((Πi Li)2), where Li is the length of the ith sequence. This is also the time required by Steel and Hein's algorithm (3). Holmes and Bruno (5) used the algorithm by Hein (4) to design a Gibbs sampler that has the potential of analyzing a higher number of sequences than the exact algorithms. The present work accelerates, extends, and formalizes the algorithm in ref. 4. In particular, the time requirement for the algorithm presented here is reduced to Oi Li).

The TKF model is formulated in terms of links and associated letters. To each link is associated a letter that undergoes changes, independently of other letters, according to a reversible substitution process (identical to the site-substitution process, where insertions and deletions are not allowed). A link and its associated letter are deleted after an exponentially distributed waiting time with mean 1/μ. While a link is present, it gives rise to new links at the rate λ. A new link is placed immediately to the right of the link from which it originated, and the associated letter is chosen from the stationary distribution of the substitution process. At the left of the sequence is a so-called immortal link that never dies and gives rise to new links at the rate λ, preventing the process from becoming extinct.

For the TKF model on a tree, the defining parameters are the death rate μ and the birth rate λ, as described above, together with a time parameter τ for each edge of the tree. The time parameter τ defines how long the process runs along a given edge. When the process splits into two subprocesses at an inner node, the two subprocesses are independent.

The main probabilistic aspects of the TKF model are given by Eqs. 24 and 8 below. The structure of probabilities 3 and 4 allows us to write the joint probability of observed sequences at the leaves of a tree together with the alignment and the unobserved sequences at inner nodes of the tree as a Markov chain along the sequences observed, until the process reaches an absorbing state. The process of observed sequences therefore becomes a hidden Markov chain. Having obtained this identification, we can use traditional methods for obtaining a recursion for the calculation of the probability of the observed sequences. In particular, we state two recursions, one corresponding to splitting the process according to the first state of the Markov chain, and the other corresponding to splitting the process according to the last state of the Markov chain. In Approach 1, a state of the hidden Markov chain describes an element in the alignment for the whole tree, which gives a recursion with time complexity Oi Li) when implemented using dynamic programming. In Approach 2, we take a state of the hidden Markov chain to be an element in the alignment of the tree consisting of inner nodes only. This gives a recursion with time complexity O((Π L 2 i i)); however, this can be reduced to Oi Li), and actually we obtain a recursion with slightly fewer terms than that considered in Approach 1.Westartin Preliminaries by defining the states of our hidden Markov chain and finding the transition probabilities of the Markov chain. This section introduces necessary notation to allow for a precise mathematical formulation.

Preliminaries

Notation. We consider a tree with d′ inner nodes and d leaves. The inner nodes are numbered from 1 to d′, with 1 being the root and where the ancestor a(i) of i is to be found in {1, 2,..., i – 1}. The leaves are numbered from d′ + 1 to d′ + d, with the descendants of inner node i being numbered before the descendants of inner node j for j > i. For a tree with two inner nodes and four leaves, the numbering can be seen in Fig. 1.

Fig. 1.
A tree with four leaves, where the link a root 1 survives at inner node 2 and produces a new link at leaf 2.

The evolutionary time distance from the ancestor a(z) of a node z to the node z is τ(z). The observed sequences are Sj for j = d′ + 1,..., d′ + d, where Sj is the observed sequence at the leaf j. The length of Sj is Lj, and the ath entry of Sj is denoted Sj(a). We write Sj(a: b) for entries from a to b with a and b included. We let S denote the collection of sequences, and for two d-dimensional vectors u, v indexed by j = d′+ 1,..., d′+ d and with integer entries S(u: v) denote the collection of subsequences Sj(uj: vj). To compare two d-dimensional vectors u, v, the notation

equation M2

is used, with similar definitions for other relations. To shorten the formulae, we write for two vectors, K, l, with l ≥ 0, S[K, l] = S((Kl + 1): K). Finally, L is the vector with entries Lj.

The notation 1(E) is used for the function that is 1 when the expression E is true and 0 otherwise. We use the symbol # for a link, and when following the fate of a link along the tree, we write # at node i if the link is present, and we write – at node i if the link died along the edge from a(i) to i.

Markov Structure of the TKF Model for Two Sequences. In this subsection, the TKF model from time zero to time τ is considered. We rewrite the probabilities for the deletion of a link and for the number of new links that appear before time τ in such a way that we recognize a Markov structure along the sequences with states

equation M3
[1]

corresponding to survival of the link, deletion of the link, and insertion of a new link.

Let V = 1 if the link survives, and let V = 0 if the link dies. Because the death rate is μ, we have

equation M4
[2]

Let N be the number of new links after time τ. From Thorne, Kishino, and Felsenstein (2), we have

equation M5
[3]

equation M6
[4]

where

equation M7

From these formulae, we find

equation M8

and this implies

equation M9
[5]

equation M10
[6]

The important point for establishing a Markov structure along the sequences is that Eqs. 5 and 6 are equal and independent of k for k ≥ 1. The independence of k gives the Markov structure for the number of new links, that is, a new link is added with probability λβ, and we stop adding new links with probability 1 – λβ. We can thus generate V and N by a Markov chain with transition probabilities

An external file that holds a picture, illustration, etc.
Object name is 14961_m01.jpg
[7]

To interpret the whole alignment as a Markov chain, we note that the number K of links at stationarity has the following distribution. (see Thorne, Kishino, and Felsenstein, ref. 2),

equation M11
[8]

Again this corresponds to a Markov chain where we add a link with probability γ, and we stop adding more links with probability 1 – γ. Having reached the stop state in system 7, we thus add a new link at time zero with probability γ and start a new round of the Markov chain in Eq. 7. We can combine this into a Markov chain on the states in Eq. 1 together with an End state as follows:

An external file that holds a picture, illustration, etc.
Object name is 14961_m02.jpg
[9]

As an example, the (equation M12),(equation M13) entry corresponds to going to the stop in Eq. 7 from (equation M14), adding a link at time zero with probability γ, and going to (equation M15), from start in Eq. 7.

When considering the TKF model on a tree, we will need the terms in Eq. 7 for each edge of the tree. Because we number the edges by the node at the end of the edge, we introduce for each node j > 1 the terms

equation M16

where β(j) = {1 - exp((λ - μ)τ(j))}/{μ - λ exp((λ - μ)τ(j))}.

States. Because the probability of an alignment on the tree is the product of the probabilities of the pairwise alignments along the edges, we can use the hidden Markov structure presented in Markov Structure of the TKF Model for Two Sequences for pairwise alignment to obtain the hidden Markov structure for alignment on the tree. To obtain this, we need to choose states for the hidden Markov model on the tree that allow us to identify the states of the hidden Markov model for the pairwise alignments along the edges.

When translating a set of pairwise alignments between the nodes (a(j), j), 2 ≤ j ≤ (d + d′) into a sequence of states for the multiple alignment, we will use the convention that if a birth at node i and a birth at node j > i both are the result of a birth at node z < i, then the birth at node j will appear before the birth at node i in the sequence.

A state represents two things, a new event in part of the tree and a “history” in the complementary part of the tree. The two together give information on which new events are possible in the next state. Astate ξ consists of some subsets of nodes together with a value ξ(z) [set membership] {#, –} for the nodes z in these subsets. The new event attached toastate ξ is a birth of a link at some node t(ξ) and, if t(ξ) is an inner node, the survival (ξ(z) = #) or nonsurvival (ξ(z) = –) along the tree down from t(ξ). We let T(ξ), respectively L(ξ), be the set of inner nodes z > t(ξ), respectively leaves, where we have survival or where the link died on the edge leading to the node. The history corresponds to a birth at node 1 and the survival (ξ(z) = #) or nonsurvival (ξ(z) =–) along inner nodes z < t(ξ), with the property that the link survived at the ancestor a(t(ξ)). We let H(ξ) be the set of inner nodes z < t(ξ) where the link survived or died on the edge leading to the node. Furthermore, if t(ξ) is a leaf, the history contains an inner node h(ξ) < a(t(ξ)), h(ξ) [set membership] H(ξ) with ξ(h(ξ)) = # and a set HL(ξ) of leaves z > t(ξ) being descendant of h(ξ) and for which the link at h(ξ) survived or died on the edge leading to the leaf. For a state where t(ξ) is an inner node, the next state can have a birth of a new link in any of the nodes in H(ξ) [union or logical sum] {t(ξ)} [union or logical sum] T(ξ) [union or logical sum] L(ξ), and for a state with t(ξ) a leaf, a new link can be born at the nodes in H(ξ) [union or logical sum] HL(ξ) [union or logical sum] {t(ξ)}. Note that the history is defined in such a way as to respect our convention of the ordering of the births.

To exemplify the definitions above, let us consider the tree in Fig. 1 with two inner nodes. We represent the states as six-dimensional columns with values # or – in {t(ξ)} [union or logical sum] T(ξ) [union or logical sum] L(ξ), with values (#) or (–)in H(ξ) [union or logical sum] HL(ξ), and with no value in the remaining nodes. All 45 possible states are listed in Table 1. Column 1 of Table 1 gives the 16 states corresponding to a birth at node 1 that survived at inner node 2. That the birth is at node 1 leaves no room for a history. Column 4 gives the two states corresponding to a birth at leaf 4 with H(ξ) = {1, 2}, h(ξ) = 1, and HL(ξ) = {3}. There are no values at leaves 5 and 6 due to our convention of the ordering of the births. Column 9 gives the four states corresponding to a birth at inner node 2. Here there are no values at leaves 3 and 4 due to our convention. In Fig. 1, the translation between the set of pairwise alignments and the states of the multiple alignment is illustrated. Fig. 1 displays the situation where there is one link only at node 1. This link survives at inner node 2 and produces a new link at node 2. The original link does not survive at leaves 4–6, but produces a new link at leaf 5. The original link survives at leaf 3 and produces two new links at this leaf. The new link at inner node 2 survives in both leaves 5 and 6 and produces a new link at leaf 5. The set of states in the multiple alignment is shown in Fig. 1 Right. The first state is the birth of the link at node 1 together with the survival and nonsurvival of this link. State 2 is the birth at node 5 coming from the original link at node 1. States 3 and 4 are the two births at node 3. State 5 is the birth of a new link at node 2 together with the survival at nodes 5 and 6. Note that there are no values in this state at nodes 3 and 4 due to the convention that a birth at inner node 2 implies that all births at nodes 3 and 4 have been handled. Finally, the last state is the birth at node 5 originating from the new link at node 2.

Table 1.
States of the Markov chain for the tree in Fig. 1

As mentioned in the beginning of this subsection, we get a Markov chain because we can identify all the pairwise alignments along the edges from the states we use. To illustrate this, let us write down the probability of the alignments in Fig. 1 as follows:

equation M17
[10]

Here each row represents the probability of the alignment along one of the edges. The terms in a row have been spread out to align terms vertically, as explained below. The terms s(#; j) and s(–; j) are the two entries in the first row of Eq. 7 corresponding to survival and nonsurvival of a link along the edge leading to node j, b(#, #; j), b(#, –; j) are the two entries from the second and fourth row of Eq. 7, and b(–, #; j) and b(–, –; j) are the two entries from the third row of Eq. 7. The first row of Eq. 10 gives the probability of the alignment between nodes 1 and 2, which is given through the survival of the link together with the probability of a birth of a new link and the probability of no further links. The product of the terms in each column in Eq. 10 represents a transition probability in the chain with 45 states, except for the first column that has to be combined with the probability related to leaving a previous state, and the last column that has to be combined with the probability related to the next link at root 1. As can be seen, each column is a function of the corresponding consecutive set of states in Fig. 1.

When stating the transition probabilities, it is convenient to have the following notation. For a state ξ, we let γ = γ(r, ξ) be the state we enter when having a birth at the leaf r. If t(ξ) is an inner node r [set membership] L(ξ), H(γ) = H(ξ) [union or logical sum] {t(ξ)} [union or logical sum] T(ξ), h(γ) = t(ξ), HL(γ) = {z [set membership] L(ξ)|z < r}, and γ(z) = ξ(z) for nodes in these sets. If t(ξ) is a leaf r [set membership] HL(ξ) [union or logical sum] {t(ξ)}, H(γ) = H(ξ), h(γ) = h(ξ), HL(γ) = {z [set membership] HL(ξ)|z < r}, and γ(z) = ξ(z) for nodes in these sets. The set of all states is denoted Ξ, and the subset of states ξ with t(ξ) an inner node is denoted Ξ1.

Transition Probabilities. A transition probability p(x, y) of going from state x to state y can be written formally as p(x, y) = stop · new · survival, where stop gives the probability of no new links at certain nodes, new gives the probability of a new link at a particular node, and survival gives the probability of the fate of the new link. We thus find for a state equation M18

equation M19
[11]

For a state ξ with t(ξ) a leaf, we get with s = t(ξ)

equation M20
[12]

For two states ξ, η [set membership] Ξ1, a transition from ξ to η is possible only if η corresponds to a new link at one of the inner nodes from which ξ allows the introduction of new links. This can be formulated formally as

equation M21

In this case, the transition probability is

equation M22
[13]

When t(ξ) is a leaf and t(η) is an inner node, the transition from ξ to η requires

equation M23

and the transition probability p(ξ, η) is

equation M24
[14]

For a transition to the end state, only the first line of Eqs. 13 and 14 should be used, multiplied by (1 – λ/μ), and with t(η) = 1. Finally, the transition probabilities from the immortal state I can be calculated as if I corresponds to the state ξ0 with a new link at node 1 that survives in all of the tree.

Algorithms

In this section, we present two algorithms for computing the probability of the observed sequences Sj, for j = d′ + 1,..., d′ + d, being related by the given evolutionary tree. Both algorithms are based on the hidden Markov chain described in the previous section but differ in their choice of states. In the first algorithm, the states describe the alignment for both inner nodes and leaves. The running time is equation M25, where Lmax is the maximum length of the observed sequences. In the second algorithm, the states describe the alignment for inner nodes only. The running time is now equation M26, but the algorithm can be rewritten to obtain an equation M27 running time as in the first approach. The principle for deriving the algorithms is classical and very well known: we consider what happens in either the first or last step of the Markov chain.

Approach 1: Inner Nodes and Leaves. Notation. We consider a Markov process x0, x1,..., xN that starts in the initial state I and stops at a random time N + 1 in the end state E. Thus x0 = I, xi [set membership] Ξ, for i = 1,..., N, and xN+1 = E. The transition probability going from x to y is p(x, y) as described in Transition Probabilities. A state ξ [set membership] Ξ1, corresponding to the birth of a link at an inner node, emits a letter in those observed sequences Sz for which z [set membership] L1(ξ) = {u [set membership] L(ξ)|ξ(u) = #}. A state ξ [negated set membership] Ξ1 emits a letter in the sequence St(ξ) only. For any state x [set membership] Ξ, we let

equation M28
[15]

be a vector indexed by the numbering of the observed sequences and consisting of ones in those coordinates for which x emits a letter and zeroes in the other coordinates:

equation M29
[16]

For the state xi in the hidden Markov chain, we use the shorthand notation li for the vector l(xi). The lengths Li of the sequences emitted by the first i states x1,..., xi can then be written as equation M30. With this notation, the state xi emits the letters equation M31 is the empty set if lij = 0. The probability that a state x emits the vector of letters s (with the possibility that some of the coordinates of s are equal to the empty set) is pe(s|x).

Backward recursion. For an arbitrary vector K ≥ 0 and state x0 [set membership] Ξ, we define F(K|x0) = P(S(K + 1: L)|x0), that is, the probability that the sequences S((K + 1): L) are produced by a set of states x1, x2,... given that the Markov chain starts in the state x0. Clearly, P(S(1: L)|x0 = I) = F(0|I). Summing over the states of the Markov chain F(K|x0) is given by

equation M32
[17]

When K <w L and KL, the recursion for F(K|x0), with equation M33, is

equation M34
[18]

When K = L, the recursion is

equation M35
[19]

Recursion 18 states that the probability of the sequences S((K + 1): L) produced by states x1, x2,..., given that we start in state x0, is a sum over the possible states of x1. Each term in the sum is the product of the transition probability of going from x0 to x1, the emission probability for those letters emitted by x1, and the probability of the remaining sequences S((K + l(x1) + 1): L) given that we start in x1. If in recursion 18 we replace the summation by the max operation, we obtain a recursion for finding the alignment with the highest probability. This is known as the Viterbi algorithm in the hidden Markov model literature.

Hein, Jensen, and Pedersen (6) also derive a forward recursion by separating out the contribution from xn instead of x0. Computationally, there is no difference between the forward and backward recursions. However, the latter has an interpretation as a probability, thereby making it easier to understand.

Emission probabilities. For a full description of the TKF model, we need a model for the substitution process. We let equation M36 be the probability for the substitution of a letter a by b over a time period τ. The stationary probabilities for this transition matrix are denoted by π.

When a state corresponds to a birth of a new link in one of the leaves only, that is, t(ξ) is a leaf, the emitted vector s has a letter at the node t(ξ) only, and the emission probability is simply the stationary probability π(s(t(ξ))). For a state ξ [set membership] Ξ1 corresponding to a birth of a new link at inner node t(ξ), the emitted vector s has letters at those nodes z [set membership] L(ξ) for which ξ(z) = #. With a(j), the ancestor of a node j, and with

equation M37
[20]

we can write the emission probability as

equation M38
[21]

This formula simply says that the probability of the emitted letters sz, z [set membership] L1(ξ) is the sum of the joint probability of the ancestral and emitted letters over the possible values of the ancestral letters. Implementation and analysis. Let us briefly discuss how to implement the recursion given by Eqs. 18 and 19. There is a complication, in that there will always be terms on the right-hand side of the equations for which K + l(z) = K or l(z) = 0. The states ξ [set membership] Ξ1 for which l(ξ) = 0 are characterized by having ξ(z) =– for all z [set membership] L(ξ), that is, the new link does not survive at any of the leaves. Let us denote this class of states by C. Imagine that for some K, the term F(K|x) has been calculated for all K >w K, KK and all x [set membership] Ξ. For each x [set membership] C, recursion 18 gives

equation M39
[22]

with ω(x) known. Let Q be the matrix with entries p(z1, z2), z1, z2 [set membership] C. Then, because the entries are nonnegative and the sum along a row is < 1, the matrix ICQ, where IC is the identity matrix, is invertible, and the set of linear equations 22 has a unique solution. Having solved this system of equations, we can next calculate F(K|x) for x [negated set membership] C directly from Eq. 18, or from Eq. 19 when K = L. Also in the case of the Viterbi algorithm for finding the alignment with the highest probability, we must, for a given K in the recursion, first solve for the states in C. The boundary conditions for the recursion are F(K|x) = 0 when K >w L.

To run the algorithm, we need to calculate F(K|x) for any KL and for any x [set membership] Ξ. The number of steps needed is therefore of the order N Πdi=1 Li, where N is the number of elements in the set Ξ.

For illustration purposes, we have implemented recursion 18 as well as the Viterbi algorithm and an algorithm for simulating alignments conditional on the observed sequences for the case of four observed sequences. No attempt to optimize the program has been made, and the program therefore runs only on short sequences. As an example, we use a set of simulated sequences kindly supplied by Yun Song (Oxford University, Oxford). The parameters used in the simulation are λ = 0.05, μ = 0.052, and the Jukes–Cantor model for substitution where the rate of leaving a state is 0.3. All edges of the tree have lengths 1. We use the same parameters when finding the maximal alignment. The true alignment that generated the sequences and the maximal alignment can be seen in Table 2. We have also included an alignment obtained from the clustal w program by Higgins, Thompson, and Gibson (7). The total probability of the observed sequences is 7.62 × 10–41, as obtained from recursion 18, and the probability of the maximal alignment is 2.04 × 10–43, contributing only 0.27% to the total probability.

Table 2.
Example of alignment

The maximal alignment and the clustal w alignment agree on aligning GAC in the middle. We have run 500 simulations of the conditional alignment given the observed sequences, and in 78% of the cases, we find that GAC is aligned. clustal w aligns the last C of the four sequences, and this is not seen in the maximal alignment. In the 500 simulations from the conditional alignments, we never encountered a case where the last C was aligned. Generally, the possibility of simulating alignments from the conditional distribution given the observed sequences allows us to make statements on the reliability of features seen in an alignment.

Approach 2: Inner Nodes Only. Notation. In Approach 1, a state described a column of the alignment for all of the inner nodes and leaves, and a state emitted at most one letter in each of the observed sequences. In this section, we will instead let the states describe the inner nodes only, which in turn necessitates the emission of arbitrary long subsequences among the observed sequences. This implies an extra sum in the recursion, thus seemingly making the recursion more complicated. However, we can rewrite the recursion, ending up with a recursion of the same complexity as before and with fewer terms than in Approach 1.

More precisely, a state ξ is a birth of a new link at an inner node and is characterized by the node t(ξ) at which the link is born; the set T(ξ) of inner nodes describes the fate of the link, and the set H(ξ) gives the history of the new link. As before, L(ξ) is the set of leaves descending from t(ξ) or from the nodes in T1(ξ) (see Eq. 20). As in Eq. 15, l(x) denotes the lengths of the emitted subsequences. Contrary to Eq. 16, l(x) is no longer determined by the state x; the state determines only at which nodes it is possible to emit letters:

equation M40
[23]

We again use li = l(xi), furthermore take l0 ≥ 0 to be the length of the subsequence emitted by the immortal link, and define equation M41 to be the lengths of the sequences emitted by the first i states. The emission probability qe(K, l|x) is now both the probability of emitting subsequences of length lj, j = d′ + 1,..., d′ + d and the probability that the emitted letters are Sj((Kjlj + 1): Kj). To state this probability, we define

equation M42

In the formula below, u denotes the subset of the leaves L1(ξ, l) at which the link survives from the ancestral inner nodes. To shorten the formula, we define q(#; z) = s(–; z)b(–, #; z), q(–; z) = s(–; z)b(–, –; z), equation M43, and equation M44, where π(Sj(a: b)) = Πbi=a π(Sj(i)). Then the probability that the state ξ emits the subsequences S(m + 1: m + l) is qe(m + l, l|ξ), given by

equation M45
[24]

The function f(m, u, ξ) is the emission probability for the first letter at those leaves where we have survival of the link and is given by pe(s|ξ) from Eq. 21, with L1(ξ) replaced by u and sz replaced by Sz(mz + 1). Furthermore,

equation M46

Backward recursion. The backward recursion is obtained by defining

equation M47
[25]

which equals P(S(K + 1: L)|x0). Separating out the sum over the first state x1, we find

equation M48
[26]

for K <w L and KL, and when K = L, the recursion is

equation M49
[27]

A forward recursion can be derived in the same way. The details are described by Hein, Jensen, and Pedersen (6).

Reduction of complexity. For the recursions described in the previous subsection, we need to calculate F(K|x) for any value of KL. This takes of the order equation M50 steps. Each step here, however, involves the sum over l (see Eq. 26) and therefore requires of the order equation M51 steps. The time complexity of the algorithms is thus of the order equation M52. The algorithms are therefore inferior to those given in Approach 1. It turns out, though, that we can rewrite the algorithms in such a way that the resulting time complexity becomes equation M53 and where the constant factor hidden by the O notation is slightly smaller than for the algorithms of Approach 1. We start by inserting Eq. 24 into recursion 26.

equation M54
[28]

where, as before, u is the subset of the leaves L1 (ξ, l) at which we have survival from the ancestral inner node. If we let w be the subset of the leaves L(ξ)\u at which there is no survival, but the number of new links is positive, and we introduce the subset v of u [union or logical sum] w at which lz ≥ 2, this gives

equation M55
[29]

where 1(u [union or logical sum] w) is 1 when z [set membership] (u [union or logical sum] w) and 0 when z [negated set membership] (u [union or logical sum] w). The function G is defined as

equation M56
[30]

for a nonempty subset v of L(ξ), and G(M|ξ, [null]) = F(M|\ξ).

We can obtain a recursion for G by splitting the sum in Eq. 30 into Σv[subset, dbl equals]v Σmz≥2,z[set membership]v;mz=1,z[set membership](v\v);mz=0,z[set membership]/v, where v can be the empty subset. This gives

equation M57
[31]

Combining Eqs. 29 and 31, we have established a recursion involving the functions F(K|ξ) and G(K|ξ, v). For the tree in Fig. 1, the recursions of Algorithms involves 45 terms, whereas the number of terms for the recursions in this section is 24.

Discussion

This work presents algorithms that have the same complexity as the traditional nonstatistical multiple alignment algorithm in Sankoff (8). The statistical alignment approach to sequence analysis differs relative to the optimization approach in focusing on obtaining the probability of the sequences under the given model, rather than obtaining an alignment. Among molecular biologists, however, it is popular to consider the actual alignment, and the one chosen is typically the alignment that contributes the most to the probability of the observed sequences. The latter can be calculated by simple modifications of the central recursions of this work, where a summation operator is substituted by a maximization operator. Several additional problems have to be solved to make the algorithm of this paper useful in real data analysis. Besides actually implementing the algorithm, it needs to be coupled to a numerical optimization method to find maximum likelihood estimates of the unspecified parameters, such as branch lengths, substitution parameters, and insertion and deletion rates. This method can then be used to analyze up to, say, four sequence of realistic lengths (hundreds of base pairs/amino acids). Elementary computational tricks can extend this to six or seven sequences; beyond this, radically different methods will have to be applied. Jensen and Hein (9) suggested a simulation technique where the basic step is the simulation of the alignment of a three-star tree. The Gibbs sampler proposed by Holmes and Bruno (5) is based on samplings that require pairwise alignments only. This is a faster operation, whereas the Gibbs sampler proposed by Jensen and Hein (9) achieves a more efficient mixing per move. From the perspective of a biologist, the underlying model for this paper can be criticized. First, the assumption that all insertions/deletions are only one nucleotide/amino acid long does not conform to the biological reality and should be relaxed. Second, the assumption that all positions in a sequence evolve according to the same rates is also unrealistic. Formulating models and ways to calculate the relevant probabilities in such models is a major challenge to the field if a statistical approach to alignment is to be of widespread use.

Notes

Abbreviation: TKF model, Thorne–Kishino–Felsenstein model.

References

1. Needleman, S. B. & Wunsch, C. D. (1970) J. Mol. Biol. 48, 443–453. [PubMed]
2. Thorne, J. L., Kishino, H. & Felsenstein, J. (1991) J. Mol. Evol. 33, 114–124. [PubMed]
3. Steel, M. & Hein, J. (2001) Appl. Math. Lett. 14, 679–684.
4. Hein, J. (2001) Proc. Pac. Symp. Biocomp. 179–190. [PubMed]
5. Holmes, I. & Bruno, W. J. (2001) Bioinformatics 17, 803–820. [PubMed]
6. Hein, J., Jensen, J. L. & Pedersen, C. N. S. (2002) Recursions for Multiple Statistical Alignment (Dept. of Theoretical Statistics, University of Aarhus, Aarhus, Denmark), Technical Report no. 425.
7. Higgins, D. G., Thompson, J. D. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673–4680. [PMC free article] [PubMed]
8. Sankoff, D. (1975) SIAM J. Appl. Math. 78, 35–42.
9. Jensen, J. L. & Hein. J. (2002) Gibbs Sampler for Statistical Multiple Alignment (Dept. of Theoretical Statistics, University of Aarhus, Aarhus, Denmark), Technical Report no. 429.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...