• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of cmbMary Ann Liebert, Inc.Mary Ann Liebert, Inc.JournalsSearchAlerts
Journal of Computational Biology
J Comput Biol. Oct 2008; 15(8): 1007–1027.
PMCID: PMC3205822

DUPCAR: Reconstructing Contiguous Ancestral Regions with Duplications

Abstract

Accurately reconstructing the large-scale gene order in an ancestral genome is a critical step to better understand genome evolution. In this paper, we propose a heuristic algorithm, called DUPCAR, for reconstructing ancestral genomic orders with duplications. The method starts from the order of genes in modern genomes and predicts predecessor and successor relationships in the ancestor. Then a greedy algorithm is used to reconstruct the ancestral orders by connecting genes into contiguous regions based on predicted adjacencies. Computer simulation was used to validate the algorithm. We also applied the method to reconstruct the ancestral chromosome X of placental mammals and the ancestral genomes of the ciliate Paramecium tetraurelia.

Key words: contiguous ancestral region, duplication, gene-order reconstruction, genome rearrangement, isometric reconciliation

1. Introduction

The large number of genome sequences becoming available makes it feasible to computationally reconstruct ancient genomes of related species that have undergone large-scale genome rearrangements. The heart of this problem is to “undo” these rearrangements and restore the ancestral order. Previous studies mainly focused on solving the median problem, which is based either on reversal (inversion) distance or breakpoint distance. In this problem one tries to reconstruct the common ancestor of two descendant genomes using an additional “outgroup” genome. Unfortunately, the median problem does not have exact and efficient algorithms (Caprara, 1999; Pe'er and Shamir, 1998). Heuristic programs for both the breakpoint median problem and the reversal median problem have been proposed (Sankoff and Blanchette, 1998; Moret et al., 2001; Bourque and Pevzner, 2002). However, the discrepancy between computational predictions and results from cytogenetic experiments (Froenicke et al., 2006; Bourque et al., 2006) suggests a need to explore further computational methods for ancestral genome reconstruction.

We recently proposed a new approach for reconstructing the ancestral order based on the adjacencies of orthologous genomic intervals in modern species (Ma et al., 2006), which essentially avoids solving any rearrangement median problem. The critical procedure of the method is analogous to Fitch's parsimony algorithm (Fitch, 1971). Instead of inferring ancestral nucleotides, we infer the locally parsimonious predecessor and successor relationships of the orthologous conserved segments in the ancestor, in our case the ancestor of most placental mammals, known as the boreoeutherian ancestor. Another procedure then connects these segments into 29 contiguous ancestral regions (CARs). Our result agrees with the cytogenetic prediction fairly well (Rocchi et al., 2006).

However, the main drawback of the CARs method is that it does not handle duplications. Indeed, duplications (including segmental duplications and tandem duplications) have a great impact on genome evolution (Eichler and Sankoff, 2003). Some previous theoretic studies (Sankoff, 1999; Sankoff and El-Mabrouk, 2000; Marron et al., 2004) have included duplications (sometimes with loss) along with rearrangements. In this paper, we propose an efficient heuristic approach based on the CARs method to incorporate duplication events into predictions of ancestral gene orders. Our method puts rearrangements and duplications in a unified framework in order to have a reconstruction that captures additional large-scale evolutionary events. We have applied it to reconstruct the ancestral chromosome X of placental mammals and ancestral genomes of the ciliate Paramecium tetraurelia.

2. Methods

2.1. Definitions

Before applying the reconstruction algorithm, it is always important to partition the genomic region under consideration into intervals so that further rearrangement analysis can be carried out. There have been a number of approaches to identify these regions, either based on gene content or sequence similarity. Nadeau and Taylor (1984) introduced the term conserved segment to signify a genomic interval with gene orders that are preserved, and not disrupted by evolutionary rearrangements. In the past decade, using comparative gene mapping to find orthologous gene loci as the evolutionary markers played an important role in testing algorithms and understanding rearrangement scenarios. Recent effort has shifted to reconstructing mammalian ancestral genomes in increasingly high resolution, and whole-genome sequence alignments play a key role for identifying conserved segments (sometimes also called synteny blocks) (Pevzner and Tesler, 2003; Ma et al., 2006). From an evolutionary point of view, a conserved segment represents homologous segments in extant species that are derived from a common ancestral interval that was never disrupted by breakpoints caused by duplications and rearrangements.

In this paper we co-opt the word gene to mean a conserved segment used in the analysis. A chromosome of a modern or ancestral genome consists of a list of genes, where each gene has a sign (orientation) that is either positive (+) or negative (−). For a gene x, −x denotes the reverse complement, and vice versa. For example, if x = GTAT, then −x = ATAC. The reverse complement of a chromosome is obtained by reversing the list of genes and replacing each gene by its reverse complement. A genome is a set of chromosomes.

If two genes share homology, i.e., they are derived from a common ancestral gene, then they belong to the same gene family. We use the following notation to denote genes in genomes:

Notation

Suppose there are N gene families equation M1 The m-th member (according to a fixed but arbitrary ordering) of family ai in genome g is denoted by equation M2.

If there is only one gene from family ai in g, we can refer to it as g[ai]. If the genome is unambiguous from the context, equation M3 can be simplified to equation M4.

Definition

For a gene equation M5, if equation M6 immediately precedes equation M7 in the same chromosome, then we define equation M8 and call equation M9 the predecessor of equation M10. Equivalently, in the reverse complement of the chromosome, equation M11. Since equation M12 then immediately succeeds equation M13, we define equation M14 and call equation M15 the successor of equation M16. We set equation M17 if equation M18appears first on a chromosome in g, and set equation M19 if equation M20appears last on a chromosome in g. (Note that $ is just a special symbol for which we do not define a reverse complement. Furthermore, $ is the same for all chromosomes.)

Example

Let g have chromosome (ab1c b2 d e). Then p(a) = $, p(−b1) = a, p(d) = b2, s(c) = b1, p(−a) = b1, etc.

During genome evolution, in addition to point mutations, large-scale operations can happen. These operations include insertions, deletions, chromosomal rearrangements (inversion, translocation, fusion, and fission), and duplications (tandem duplications and segmental duplications).

Suppose we have a genome (a1 b1 c1 d1, e1 f1), where the comma separates chromosomes. Examples of the large-scale operations are as follows:

  • insertion. a1 b1 c1 dl, e1 f1 [implies] a1 b1 c1 d1, e1 h f1.
  • deletion. a1 b1c1 d1 e1 f1[implies] a1 d1, e1 f1
  • inversion. a1 b1c1 d1, e1 fl [implies] a1c1b1 d1, e1 f1.
  • translocation. a1 b1 c1d1, e1 f1 [implies] a1 b1 f1, e1 c1 d1.
  • fusion, a1 b1 c1 d1, e1 f1[implies] a1, b1 c1 d1 e1 f1.
  • fission. a1 b1 c1 d1, e1 f1 [implies] a1 b1, e1 f1, c1 d1.
  • tandem duplication. a1 b1c1 d1, e1 f1 [implies] a1 b1 c1 b2 c2 d1, e1 f1
  • segmental duplication. a1 b1 c1 d1, e1 fl [implies] a1 bl c1 d1, e1 b2 f1

As a consequence of the accumulation of the above operations, we have different numbers of genes and different gene orders in present-day genomes.

Problem

The problem we investigate in this paper is: Given (1) a set of modern genomes equation M21; (2) a species tree T that describes the phylogeny of these modern genomes; and (3) a set of gene trees, equation M22 for each gene family ai, that defines the relationships among all the genes in the family, how can we reconstruct the order and orientation of genes in the target ancestral genome? Here, we call each reconstructed chromosome a contiguous ancestral region (CAR).

2.2. The species tree and the gene trees

Species tree

A species tree is a rooted binary tree describing the phylogeny among given species. Each bifurcating ancestral node represents the genome of an ancestral species just before a speciation event, while each leaf corresponds to the genome of a modern species. Each branch in the tree, connecting genomes f and g, has a length (distance) D(f, g) representing the evolutionary distance between f and g. The distance between any two nodes in the tree is the sum of the distances along the path that connects them, i.e., the distance structure on the tree is additive. The distance D is extended in the natural way to apply to points along a branch in the tree as well, which represent intermediate genomes in the descent of the species represented at the end of the branch. These intermediate genomes are the result of evolutionary operations that occur on that branch. We assume an arbitrarily long branch leading to the root of the species tree, with intermediate genomes representing ancestral forms of the root species, e.g., forms that existed prior to particular duplications that occurred in the descent of the root species from previous ancestors.

Gene tree

A gene tree is an unrooted binary tree, characterizing the relationships among genes in the same gene family across different species. Each node represents a particular gene in a gene family, from a particular genome. The branch between related genes an and an has a length (distance) d(an ,am), representing the evolutionary distance between an and am. Distances between arbitrary pairs of genes are defined additively, as in the species tree. If a and b are genes from different gene families, d(a, b) = ∞.

Distances of genes and genomes

We assume that genes and genomes evolve in the following way: When a duplication or speciation event occurs, any two homologous gene copies derived from this event begin at evolutionary distance zero. Then they independently accumulate increasing evolutionary distance as time goes by. Within any given genome, all between-gene distances increase at the same rate. However, different genomes are allowed to evolve at different rates, i.e., no universal molecular clock is assumed.

It follows from these assumptions that there is a mapping Φ from genes in gene trees to points in the species tree that roots each gene tree and places each gene at the position where the event that it represents (speciation or duplication) occurs. A gene from a leaf node of a gene tree is mapped to the leaf node in the species tree representing the species from which it comes. A gene from an internal node in the gene tree that represents a speciation event is mapped to the corresponding speciation node in the species tree. A gene from an internal node in the gene tree that represents a duplication event is mapped to the point along the branch in the species tree where that duplication occurs. Finally, for each gene tree, a new node is determined on one of the branches of the gene tree that serves as the root of the gene tree, and this point is also mapped to a point in the species tree. This node in the species tree is not necessarily the root node, as the root of the gene family may trace back to a duplication that occurred before the root of the species tree (i.e., above the root of the species tree, on the branch leading to the root), or all copies of the gene may be missing in some sublineages of the species tree, putting the root of the gene tree below the root of the species tree (Fig. 1).

FIG. 1.
The evolutionary scenario of six genes in gene family a. f and g are two present-day genomes, k and e are two duplication events, and h is a speciation event, which is also the last common ancestor of f and g. If h is the root of the species tree, then ...

We can use the concept of orthology and paralogy (Fitch, 2000) to distinguish the possible relationships between two genes in the same family. If two genes diverged due to a speciation event, then they are orthologous. If two genes diverged due to a duplication event, then they are paralogous. For instance, in the example in Figure 1, f[am] and g[au] are orthologous, f[am] and g[as] are orthologous, but f[am] and g[av] are paralogous, and g[au] and g[as] are paralogous.

2.3. Isometric reconciliation

We assume that the species tree and all the gene trees are given such that branch lengths in both the species tree and the gene trees reflect the exact evolutionary distances. The species are given for each leaf of each gene tree, but not the species for internal nodes. Any mapping Φ from a gene tree B to a species tree T that roots the gene tree is an isometric reconciliation if

  • (1) Every leaf of B maps to the leaf of the designated species in T.
  • (2) Each internal node of B maps to a speciation node in T or a point on a branch in T.
  • (3) The new root r of B maps to a point Φ(r) on a branch in T such that any other node x in B maps to Φ(x) below Φ(r) and D(Φ(x), Φ(r)) = d(x, r).

Internal nodes of the gene tree that map to speciation nodes in the species tree are determined to be speciation events by an isometric reconciliation Φ, and nodes that map to points along branches are determined to be duplication events. These determinations in turn define the relationships of orthology and paralogy for the gene family.

Theorem 1.

Given a set equation M23 of unrooted gene trees with designated leaf species and a species tree T, the isometric reconciliation of equation M24 with T is unique and there is an efficient procedure to either construct it or detect that no such isometric reconciliation exists.

We describe our algorithm below and sketch the proof briefly. We leave the detailed proof in the Appendix.

The reconciliation algorithm we define works from the leaves of a gene tree inwards. When all the nodes in two of the three subtrees branching off from an internal node x have been processed, then x can be processed. Other than this, the order of processing is arbitrary. In the last step, all three subtrees of the last remaining unprocessed node x will already have been processed, and processing x completes the reconciliation. This reconciliation algorithm is less complicated than the traditional methods, e.g., Goodman et al. (1979) and Guigo et al. (1996). When the species tree is unknown and one wants to reconcile an optimal species tree to minimize the number of duplications using the given topologies of gene trees, the problem is NP-hard (Ma et al., 2000). In our case, the true species tree is known and the distances in the gene trees are exact.

Let x be an unmapped internal node of the gene tree B, with branches to nodes u, v and z. Suppose the mappings Φ(u) and Φ(v) of the nodes u and v to the species tree T have already been determined. Then the following procedure will map the node x to the species tree, or reject the input B as not reconcilable. As a side effect, if necessary the procedure will root the tree B.

MapGeneTreeNode (x)

Let d1 = d(x, u), d2 = d(x, v), λ be the last common ancestor of Φ(u) and Φ(v) in T (if Φ(u) = Φ(v) then λ = Φ(u) = Φ(v)), equation M25 and equation M26.

  • (1) If equation M27, reject and exit.
  • (2) Let equation M28.
  • (3) If equation M29 (and equation M30) then map x to a point Φ(x) at distance ε/2 above λ in the species tree T.
  • (4) Else if ε = 0 then map x to a point Φ(x) at distance d1 from Φ(u) and d2 from Φ(v) on the path connecting Φ(u) and Φ(v) in T.
  • (5) Else (i.e., if ε > 0 and equation M31) if the gene tree B is already rooted then reject and exit
  • (6) Else
    • (a) Place the root r of B at distance equation M32 from u and equation M33 from v on the path that connects u and v in B, and map r to a point Φ(r) at distance equation M34 above λ.
    • (b) If equation M35 then map x to the point Φ(x) at distance d1 above Φ(u), else (i.e. if equation M36) map x to the point Φ(x) at distance d2 above Φ(v).
  • (7) If z has already been mapped to the node Φ(z) in T, then
    • (a) If the tree is already rooted, reject and exit unless Φ(z) lies below Φ(x) in T or vice-versa and D(Φ(x), Φ(z)) = d(x, z).
    • (b) Else
      • i. Let d = d(x, z), λ be the last common ancestor of Φ(x) and Φ(z) in T, equation M37, and equation M38.
      • ii. If ε < 0, reject and exit.
      • iii. Else if ε = 0 and Φ(x) = λ, reject and exit.
      • iv. Else place the root r of B at distance equation M39 from x and equation M40 from z on the path that connects x and z in B, and map r to a point Φ(r) at distance ε/2 above λ.

We construct the reconciled tree equation M41 for a set of gene trees as follows (Fig. 2). We initialize equation M42 to be the same as species tree T. Eventually in equation M43, additional points will be labeled along branches based on the mapping from nodes in the gene trees.

FIG. 2. (A)
A species tree T with distances labeled to each branch. (B) Gene tree Ta. (C) Gene tree Tb. (D) Reconciled tree equation M44 after Ta being merged. (E) Reconciled tree equation M45 after both Ta and Tb being merged. (F) Simplified version of equation M46, where I, J, K, and H show four ...

ReconcileTrees (T, equation M49)

Input: species tree T, gene trees equation M50

Output: reconciled tree equation M51

1: equation M52

2: for all gene trees equation M53 such that 1 ≤ iN do

3:  M [is implied by] [empty] (the set of maximally internal mapped nodes)

4:  for all leaf nodes k in equation M54 do

5:   Φ(k) [is implied by] g, where g is the genome to which k belongs.

6:   add k to M and set tk = {k} (mapped subtree for equation M55).

7:  end for

8:  while M has more than two tree nodes do

9:   if equation M56 and u, v and z are connected to an unmapped node x in equation M57 then

10:    MapGeneTreeNode(x)

11:    remove u, v and (if necessary) z from M

12:    if M is empty then

13:     set M to the set consisting of just the root node r and set tr = B

14:    else

15:     add x to M and set tx = tu [union or logical sum] tv [union or logical sum] {x}

16:    end if

17:   end if

18:  end while

19: end for

We now sketch the proof of Theorem 1 as follows. Detailed proof can be found in the Appendix.

Proof

We will show that the algorithm ReconcileTrees produces an isometric reconciliation or detects if no such reconciliation is possible. First, we claim that at every step in the process, each of the maximally internal mapped nodes equation M58 in the current gene tree equation M59 is associated with a subtree tm of mapped nodes of B that has the following properties

  1. If tm does not contain the root r then Φ(m) lies above Φ(y) for all equation M60 and d(m, y) = D(Φ(m), Φ(y)), else Φ(r) lies above Φ(y) for all equation M61 and d(r, y) = D(Φ(r), Φ(y))
  2. All leaves of tm are mapped to the genome of the species to which they belong

and that Φ is the unique mapping with these properties. This is trivially true after the initial step, in which each of the leaves of B is mapped. We verify that if this is true before a pass through the sequence of operations in the while loop, then it is still true after that sequence of operations, if the input is not rejected. Therefore it is true when the processing of B is complete, if the input is not rejected. However, at this point the set M consists of just the root node r and all nodes of B are included in tr. Thus, every node x in B will be uniquely mapped to a point Φ(x) in equation M62 that lies below Φ(r) such that d(x, r) = D(Φ(x), Φ(r)). Hence, B will be uniquely isometrically reconciled with equation M63. It follows that if the algorithm ReconcileTrees terminates without rejecting its input, it produces a unique isometric reconciliation of all its input gene trees with its input species tree.

We also show that if the procedure MapGeneTreeNode rejects its input then the tree B is not isometrically reconcilable, it follows that if ReconcileTrees rejects its input then its input is not isometrically reconcilable. Thus, ReconcileTrees is correct. We finally prove that the time complexity of ReconcileTrees can be bounded in O(nm), where n is the number of tree nodes in species tree T and m is the total number of tree nodes in all gene trees. This completes the proof. [filled square]

Once we have the reconciled tree equation M64 and the rooted gene tree equation M65, we could easily embed equation M66 into equation M67 according to the mapping Φ for each node. However, for some nodes g in equation M68, we may not have any node x in the original equation M69 such that Φ(x) = g. This is because in a single gene family equation M70 we usually do not have explicit branching nodes for all the combined speciations and duplications that occur for the gene families in the final equation M71. Therefore, in order to facilitate the gene-order inference in the following sections, we augment equation M72 in the following way.

AugmentGeneTree (equation M73)

For each branch (u, v) in equation M74, let f = Φ(u) and h = Φ(v) in equation M75. For every node g on the path from f to h in equation M76 excluding f and h, we add a node x on branch (u, v) such that d(u,x) = D(f, g) (and d(x, v) = D(g, h)). Set Φ(x) = g. We call the resulting equation M77 an augmented gene tree, denoted as equation M78.

Note that the genes added along the branches of each gene tree represent intermediate forms that are inferred to have existed but do not appear in the original gene tree for that family. Also, the root of an augmented gene tree equation M79 does not always have to map to the root of equation M80.

Example

In Figure 2, we originally have a species tree and unrooted gene trees for gene family a and b. We also have the distance information, as shown in Figure 2A–C. We first reconcile Ta and add duplication nodes H and I to equation M81 (Fig. 2D). After Tb is merged, duplication nodes J and K are added (Fig. 2E). Note that based on the distance information, we would know that K corresponds to an ancient duplication before G. Figure 2F shows a simplified version of equation M82 without embedded genes. Figure 2G,H are the rooted augmented gene trees for both gene families.

Direct ancestor and direct descendant

In an augmented gene tree equation M83, for every branch (u, v), where u is the parent of v in the tree, let g = Φ(v) and h = Φ(u) (based on the result of reconciliation, (g, h) must be an edge in equation M84, where h preceded g):

  • (1) We call u the direct ancestor of v, denoted as Ah (g[v]) = u.
  • (2) We call v a direct descendant of u. The direct descendant may not be unique since u could be duplicated at h, and therefore we denote this as equation M85.

If a node v in equation M86 has no ancestor, then Ah (g[v]) = [empty]. Conversely, if a node u has no descendant in g, then Dg (h[u]) = [empty].

Example

In the augmented atom trees from Figure 2G,H, equation M87

2.4. Reconstructing ancestral adjacencies

After obtaining a reconciled tree equation M88 and augmented gene trees equation M89 for all gene families, our goal is to determine lists of gene orders that closely approximate the genome structure of the target ancestral genome, α. We achieve this in two phases:

  1. For each gene x in α, we determine a set of predecessors for x that gives the minimum number of predecessor changes for x based on parsimony. We also determine such a set of successors for x. Then we transform the predecessor and successor relationships in α into potential ancestral adjacencies for each gene in α.
  2. Based on the predicted ancestral adjacencies, we connect the genes into CARs.

We discuss (1) in this section, leaving (2) to the next section. Our approach for (1) is inspired by Fitch's method (Fitch, 1971), which was originally used to infer minimum character changes in a specified tree topology. For that problem, one is given a phylogenetic tree and a letter for every position in each leaf of the tree (corresponding to the contents of orthologous sequence sites). The problem is to infer the ancestral letters (corresponding to internal nodes of the tree), so as to minimize the number of substitutions, i.e., differences between the letters at each end of an edge in the tree.

Here, we deal with sequences of genes having orientations, rather than characters of nucleotides or amino acids, and instead of keeping track of letters at a particular sequence position, we track the genes for both of the immediately adjacent positions. For example, in Figure 3, the leaves indicate that s(A[a]) = A[b], s(B[a]) = B[c], s(C[a]) = C[b], and s(D[a]) = D[c1]. Which gene preceding G[a] would give a minimum number of predecessor changes to explain the data in leaves?

FIG. 3.
This figure shows a reconciled tree for four leaf genomes as well as the augmented gene trees for genes a, b, c, and d. Duplication event H only duplicates c. The comma separates two chromosomes in genomes A and C.

Predecessor set and successor set

For any genome g, we associate with each gene g[x] (including its reverse complement −g[x]) two sets of signed genes, denoted P(g[x]) and S(g[x]) (P(−g[x]) and S(−g[x]) for reverse complements), giving potential predecessors and successors of g[x]. If g is a modern genome, P(g[x]) = {p(g[x])} and S(g[x]) = {s(g[x])} for each g[x]. If x is not in g, then both sets are empty.

We also allow the direct ancestor and direct descendant operations to be applied to a set of genes, i.e., equation M90, and the result is also a set. Dh (P(g[x]))is defined analogously.

The inference procedure for the predecessor and successor of each ancestral gene (and its reverse complement) in the augmented gene tree consists of two stages. The first stage works in a bottom-up fashion. The general idea is as follows. For each node x in the augmented gene tree, let g be the genome containing x and let u and v be its two children in the tree. We compute its predecessor set according to the following rule: If x is a leaf, then P(x) consists of the unique predecessor. Otherwise, P(x) is equal to the intersection or union of Ag(P(u)) and Ag(P(v)), depending on whether Ag(P(u)) and Ag(P(v)) are disjoint or not. Here we need to apply the direct ancestor A operation for each gene in a predecessor set, because each gene tree may have a different evolutionary history at the point of g. Similarly, we infer the successor set. The recursive procedure is described in GetPredSuccBottomUp(x). Note that it is possible that x only has one child. We define P(nil) = [empty], S(nil) = [empty], and Ag ([empty]) = [empty], then the calculations from line 5 to 14 still hold when x only has one child in the augmented gene tree.

In the second phase, we propagate the information down the tree so that P(x) and S(x) for every gene x in the target ancestor α can incorporate outgroup information. It works in a top-down fashion. For each node x in the augmented gene tree, let w be x's parent in the tree. We propagate P(w) down the tree to adjust the original P(x). We also need the direct descendant operation D here. S(x) is adjusted analogously. The whole procedure is summarized as AdjustPredSuccTopDown(w, x) in recursive form.

GetPredSuccBottomUp(x)

1: if x is non-leaf node then

2:  GetPredSuccBottomUp(u)

3:  GetPredSuccBottomUp(v)

4:  g [is implied by] Φ(x)

5:  if equation M91 then

6:   equation M92

7:  else

8:   equation M93

9:  end if

10:  if equation M94 then

11:   equation M95

12:  else

13:   equation M96

14:  end if

15: end if

AdjustPredSuccTopDown(w, x)

1: if w is non-leaf node then

2:  if w is not the root then

3:   h [is implied by] Φ(w), g [is implied by] Φ(x)

4:   if equation M97 then

5:    equation M98

6:   end if

7:   if equation M99 then

8:    equation M100

9:   end if

10:   AdjustPredSuccTopDown(x, u)

11:   AdjustPredSuccTopDown(x, v)

12:  end if

13: end if

The time complexity for both procedures is O(mn), where m is the number of nodes in the augmented gene tree and n is the total number of leaves in the augmented gene tree, because the tree traversal takes O(m) and there are at most n members in each set, either P or S.

Theorem 2.

For any augmented gene tree, GetPredSuccBottomUp and AdjustPredSuccTopDown will give a set of genes for each node in the tree that are most parsimonious in terms of successor changes in the entire augmented gene tree. Similarly, it will also give a set of genes for each node that are most parsimonious in terms of predecessor changes.

Proof

We first prove that GetPredSuccBottomUp gives a set of genes to node x that are most parsimonious in terms of successor changes to the descendants of x. Let k(x) denote the minimum number of successor changes in the subtree rooted at x. We prove the above claim by induction. Let u and v be the two children of x. Basis: if tree height h = 1, then u and v are leaves in the gene tree. If u and v are the same, then no successor change is needed; k(x) = 0. Otherwise, only 1 change is needed based on the bottom-up inference; k(x) = 1. Induction: we assume the above claim is correct when the subtree height is h, then prove it for height h + 1. If the intersection of successor sets for u and v is not empty, then we can have k(x) = k(u) + k(v) by assigning any gene in the intersection to x. Otherwise, the optimal k(x) is k(u) + k(v) + 1, by assigning any gene in the union of successor set from u and v.

We then prove that AdjustPredSuccTopDown gives a set of genes that will have the most parsimonious successor changes from x's ancestors to x without changing the parsimony scores k for each node. Let j(x) denote the minimum number of successor changes from the root of the augmented gene tree to x. Let w be the parent node of x. Basis: when path length h = 1 from the root to x, i.e. w is the root, if w and x share the same successors, then no successor change is needed if we assign both w and x a successor they share; j(x) = 0. Otherwise, only 1 change is needed from any gene of the successor set of w to any gene of the successor set of x; j(x) = 1. Induction: we assume the above claim is correct when the path length from root is h, then prove it for length h + 1. If the intersection of successor sets for w and x is not empty, then we can have j(x) = j(w) by assigning any gene in the intersection to x. Otherwise, the optimal j(x) is j(w) + 1. Note that the adjustment performed by the AdjustPredSuccTopDown does not change the parsimony score k(x), i.e. the final genes in the successor set for x still gives the most parsimonious successor changes in its subtree. This completes the proof. We can prove in the same way that the claim is also true for predecessor changes. [filled square]

Example

In Figure 3, after running the above algorithm, we will have S(G[a]) = {G[b], G[c]}. We also will have P(G[c]) = $, etc.

In the above example, it is interesting to note that even though equation M101.

Ancestral adjacency graph

As shown in the previous example, we know that equation M102 does not guarantee that equation M103, and vice versa. However, in order to get the ancestral adjacency, we need to retain consistent predecessor and successor relationships in the target ancestor. We achieve this by constructing an ancestral adjacency graph.

We first construct a directed edge set E1 such that: equation M104, for x and its reverse complement −x in the target ancestor. Here (y, x) denotes an arc directed from y, a predecessor of x, to x. We also construct directed edge set E2 such that: equation M105. Here (x, y) denotes an arc directed from x to y, a successor of x.

We then construct a digraph G = (V, E), where:

equation M106

Node set V(G) includes all the genes and their reverse complements, i.e., nodes exist in gene pairs as x and −x in the target ancestor α. Therefore, |V| = 2N, where N indicates the total number of genes in the target ancestor. The intersection of E1 and E2 will eliminate all the edges where one of the endpoints is $ because there are no predecessor and successor sets for $. Since there is no $ in G, the implication for the following process of finding CARs is that when a gene can be followed most parsimoniously by either certain other genes or $, we will favor the non-$ to reduce the number of CARs. (Similarly for predecessors.)

Now, we have the ancestral adjacency graph G indicating consistent predecessor and successor relationships that are supported by equation M107, all equation M108, and the leaf genomes.

2.5. From ancestral adjacency to ancestral gene order

The edges of the ancestral adjacency graph G indicate consistent predecessor and successor relationships. However, they do not necessarily indicate a unique adjacency relationship for a particular gene. Three potential ambiguous cases can occur in G, as depicted for node i in Figure 4. In (A), i has several incoming edges; in (B), i has several outgoing edges; in (C), i forms a cycle with j, where each node j satisfies indegree(j) = outdegree(j) = 1. (If a more complex cycle exists, then some node falls in either case (A) or case (B)).

FIG. 4.
Three potential ambiguous cases in the ancestral adjacency graph G.

Problem

In our directed cyclic graph G(V, E), each edge (i, j) has a weight wij. If no weights are given, we set every wij to 1. Our objective is to find a set of vertex-disjoint directed paths that covers all the nodes in G and at the same time maximizes the sum of weights in the resulting paths. We call this the maximum weighted path cover problem (MWPCP). We also allow degenerate paths consisting of a node without edges. We call each resulting path a CAR.

If none of the ambiguous cases (Fig. 4) are present, then G itself forms the set of paths that covers all the nodes. In this case, the CARs can be directly defined from this graph as discussed below. When ambiguity exists, we need to resolve it and choose appropriate directed edges to form CARs.

The case where all wij are 1 is equivalent to the minimum path cover problem (MPCP), i.e., finding the minimum number of vertex-disjoint paths covering all the nodes in the digraph. In that case, the result from MPCP is also the result for MWPCP in a unit-edge-weight graph, because the minimum number of paths implies the maximum number of edges. The MPCP is NP-hard (Boesch and Gimpel, 1977). When antiparallel edges are not allowed, i.e., directed edges (i, j) and (j, i) do not co-exist, there are approximation algorithms for MWPCP that have been studied (Moran et al., 1990).

We use a greedy approach to achieve an approximate solution when antiparallel edges are allowed, given in the algorithm of FindCARs below. We first sort the edges by weight. Then the greedy approach always tries to add the heaviest edge to the resulting path set.

FindCARs(G)

Input: ancestral adjacency graph G

Output: contiguous ancestral regions in G

1: sort edges by weight in descending order.

2: create a new graph C, V(C) [is implied by] (V(G) and E(C) [is implied by] θ

3: for all edge equation M109 do

4:  h(i) [is implied by] i and q(i) [is implied by] i

5:  h(j) [is implied by] j and q(j) [is implied by] j

6: end for

7: for all available equation M110, in order of edge weight do

8:  if outdegree(i) = 0 and indegree(j) = 0 and h(i) ≠ h(j) then

9:   add edge (i, j) and (−j, −i) to E(C)

10:   update outdegree(i) and indegree(j) in C

11:   update outdegree(−j) and indegree(−i) in C

12:   h(q(j)) [is implied by] h(i), q(q(i)) [is implied by] q(j), q(q(j)) [is implied by] (q(i)

13:   h(q(−i)) [is implied by] (h(−j), q(q(−i)) [is implied by] q(−j), q(q(−j)) [is implied by] (q(−i)

14:  end if

15: end for

Note that h and q in the above algorithm are two auxiliary variables that are used to avoid creating cycles. h(i) indicates the path that node i is in and q(i) is the other endpoint of the path. The requirement on line 8 (h(i) ≠ h(j)) guarantees that there will be no cycle in the result path cover. The remaining paths in C correspond to the desired CARs.

When adding edges into an existing path, particular care is needed to avoid putting j and −j in the same CAR. We also add both (i, j) and its symmetric version, (−j, −i). For each path found by this approach, a symmetric path in the opposite orientation is also found, since we have nodes for both i and −i. The two paths correspond to the same CAR, and eventually we choose one of them.

Letting G have n edges, the time complexity of FindCARs is O(n log n) because the sorting takes O(n log n), the initialization from lines 3–6 takes O(n), and lines 7–15 take O(n).

Theorem 3.

FindCARs is a factor-3 approximation algorithm for MWPCP.

Proof

Line 8 in FindCARs guarantees that the result cover is a path cover of G. Let M* be the optimal path cover for MWPCP and let FindCARs generate M. We consider each arc (b, c) in M*. If (b, c) is not in M, there are three possible cases:

  • (1) Based on the greedy approach, there is at least one of (b, e) or (f, c) that is in M (Fig. 5A) such that wbc ≤ max (wbe, wfc). But since M* is optimal, we have. wbcwbc + wfc ≥ max (wbe, wfc). Therefore, this case happens only when wbc, = max(wbe, wfc) and min(wbe, wfc) = 0. In other words, this does not change the total weight between M* and M.
    FIG. 5.
    Three possible cases when the arc (b, c) is in the optimal solution but not in the CARs produced by FindCARs.
  • (2) There are (e, b) and (c, f) in M and there is a path from c to b going through e and f in M with additional two possible edges (a, b) and (c, d) in M* but not in M (Fig. 5B). Note that e and f could be the same node. The worst case for this situation is that edges (e, b) and (c, f) can be replaced by edges (a, b), (b, c), (c, d) to have a better cover. In this case, based on the greedy approach, we have wab + wbc + wcdweb + wcf + min(web, wcf) = 2 min(web, wcf) + max(web, wcf). Since M* is optimal, we also have wab + wbc + wcdweb + wcf = min(web, wcf) + max(web, wcf), otherwise we would get the contradictory result such that we could have a cover better than M* by removing (a, b), (b, c), and (c, d) in M* and adding the path from c to b going through e and f. Therefore, equation M111.
  • (3) As shown in Figure 5C, this case is similar to (2), but there is only (c, b) in M, i.e., there is a pair of antiparallel edges. For the worst case, similar to the deduction in (2), we can bound the difference such that equation M112.

Therefore, in the worse case, every edge (c, b) in M produced by FindCARs can be replaced by three edges (a, b), (b, c), and (c, d) in M* to get a better cover. Hence, equation M113.

Finally, we describe a simple approach to determine the edge weights wij. For a directed edge (i, j), if outdegree(i) = 1 and indegree(j) = 1, we set wij = 1. Otherwise, the corresponding weight wij = wα(i, j) (α is target ancestor) is determined recursively.

Before computing wα(i, j), we modify the reconciled tree equation M114 by rerooting it if α is not the root of equation M115. We denote the rerooted binary tree as equation M116. Then we apply the following rule:

equation M117

where equation M118 and equation M119 and D(α, [var phi]) are the branch lengths from α to the first left and first right non-duplication node in equation M120; wτ(i, j) and w[var phi](i, j) are the edge weights on τ and [var phi], respectively. On a leaf genome f of equation M121, if there is an adjacency (s, t) present in f such that s belongs to gene family ax and t belongs to gene family ay, we set wf (i, j) = 1, otherwise wf (i, j) = 0. Note that if an edge (i, j) is involved in ambiguous case (a) or (b), then wij < 1. The underlying assumption of the above equation is that rearrangement is more likely to happen on longer branches.

2.6. The summary of the DUPCAR algorithm

In outline, the whole DUPCAR algorithm can be described as follows, where α is the target ancestor.

DUPCAR (equation M122)

Input: modern genomes equation M123, species tree T, target ancestor α, gene trees equation M124

Output: contiguous ancestral regions in α

1: Reconcile Trees (equation M125)

2: initialize P(x) and S(x) for each gene x in equation M126.

3: for all gene families ai do

4:  GetPredSuccBottomUp (equation M127)

5:  AdjustPredSuccTopDown (equation M128)

6: end for

7: build the ancestral adjacency graph G for α.

8: FindCARs(G)

3. Results

3.1. Simulations

We used extensive simulations to validate our analysis. The simulator starts with a hypothetical “ancestor” genome, which evolves into the extant species through speciation and operations including inversion, translocation, fusion, fission, insertion, deletion, and duplication. When an operation is applied, breakpoints are chosen uniformly at random from the set of used or unused breakpoints on that chromosome, according to the breakpoint reuse ratio.

We tuned the weights of these operations in order to generate simulated data that closely resemble placental mammalian genomes. We simulated datasets using the phylogenetic tree: ((((human, chimp), rhesus), (mouse, rat)), dog). The ancestor genome was assigned 2,000 genes. After evolution is performed, we collapse genes where order and orientation are conserved across all species. For instance, if genes a, b, and c appear in every genome with order (a b c), then we treat these three genes as one. This strategy allows us to evaluate only the varied adjacencies, since the unbroken adjacencies will be found by essentially any procedure.

The simulator's parameters or weights of the large-scale operations were adjusted to guarantee that the extant species had approximately 1000 genes after collapsing. Moreover, we arranged that approximately 10% of the genes are in multi-gene families.

We generated synthetic data in two categories: (1) data without insertion and deletion of genes and (2) data with insertion and deletion of genes, where the insertion and deletion ratio is 1:2. For each category, we simulated data by varying the breakpoint reuse ratio (0%, 10%, 20%, 30%, 40%, and 50%), with 100 datasets per parameter set. The breakpoint reuse ratio is defined as the percentage of all ancestral adjacencies that were changed more than once during the simulated evolution.

We ran our DUPCAR reconstruction program for inferring CARs on each dataset (average running time of 1.5 min) and compared the predicted adjacencies with the real (simulated) ones. We calculate the error rate, S, of adjacencies:

equation M129

where R is the set of adjacencies in the real genome, P is the adjacencies in the predicted genome, and |X| denotes the size of the set X. S represents the percentage of adjacencies where the real genome and the predicted genome disagree.

We compared every ancestral genome, including the human-dog ancestor, which has no outgroup information. The results are summarized in Figure 6. For every ancestor, the error rate increases when the breakpoint reuse ratio increases. Also, the error rate for category (1) (without indels) is always less than for category (2) (with indels). In the first picture in Figure 6, when we do not allow breakpoint reuse (i.e. each ancestral adjacency has been broken at most once), S is 0 for category (1) for each ancestor except the human-dog ancestor, indicating 100% accuracy. We could not attain 100% accuracy for the human-dog ancestor without an outgroup to resolve ambiguities.

FIG. 6.
The error rate, S (defined in the main text), for the reconstruction of each ancestral genome when we vary the breakpoint reuse ratio. The light gray bars represent data without insertion and deletion of genes, while darker bars represent data with indels. ...

The insertions and deletions hurt the accuracy, mostly because we only infer the presence or absence of a gene in an ancestral genome directly from the gene trees that are inferred based on what we observe from the leaves. For example, if gene family x is present in both primates and rodents but absent in dog, we will simply root the gene tree under the human-dog ancestor, and will never reconstruct x in the human-dog ancestor. However, it is possible that gene x was deleted on the dog lineage.

3.2. Application to mammalian chromosome X reconstruction

We compared genomic sequence of the X chromosomes of four species (human, mouse, rat, and dog), partitioning the chromosome into 201 conserved segments that haven't been disrupted by rearrangements or duplications using blastz pairwise alignments and self-alignments (Schwartz et al., 2003), covering over 85% of the human genome. The species tree is (((hg18, (mm8, rn4)), canFam4), where leaf names indicate the assemblies we used. For each family, multiple alignments are created and a distance matrix is calculated using the Jukes-Cantor model. A gene tree, or more precisely a conserved segment tree, is inferred using a modified version of the Neighbor-Joining algorithm, which utilizes the genomic context of the genes in addition to evolutionary distances in the tree-building process. Duplications and losses are identified in the gene trees via reconciliation with the species tree. All of the duplication events are combined into a single reconciled tree, and augmented gene trees are generated accordingly.

We applied our new algorithm to the chromosome X data and successfully reconstructed the boreoeutherian chromosome X into two CARs. The two CARs are separated because of two duplications that happened in the same region, confusing the adjacency-inference procedure. Other than that, the result is quite consistent with previous reconstructions, in the sense that human chromosome X has experienced no large-scale rearrangements since the boreoeutherian ancestor, except a few in-place inversions. In the results of Murphy et al. (2005), which were computed with MGR, there was no rearrangement between human chromosome X and the boreoeutherian chromosome X. In Ma et al. (2006), there were four predicted inversions in human. With our current data and the new algorithm, we found five human inversions. Our two reconstructions agree on the three inversions shown in Table 1.

Table 1.
Comparison Between the Reconstructed boreoeutherian chrX Using DUPCAR and Data from Ma et al. (2006)

3.3. Application to a ciliate genome

It has been shown that the genome of the unicellular eukaryote Paramecium tetraurelia, a ciliate with roughly 40,000 genes, resulted from at least three whole-genome duplication (WGD) events, with additional rearrangement operations (Aury et al., 2006). Those authors reconstructed the genome architectures of four ancestral genomes, corresponding to the most recent WGD, the intermediary WGD, the old WGD, and the ancient WGD. They used Best Reciprocal Hits to construct a paralogon, which is a pair of paralogous blocks that could be recognized as derived from a common ancestral region. Then paralogons were merged into single ancestral blocks, and the process iterated until reaching the ancient WGD. However, they did not intend to determine the gene orders in each ancestral block; when a paralogon was constructed, the detailed order and orientation of genes inside the block were ignored.

The 39,642 genes form 22,635 gene families (including 11,740 single-gene families), spread over 676 scaffolds in the current genome assembly. We tested our algorithm by reconstructing all WGDs except the ancient WGD. We used the gene order in modern Paramecium tetraurelia and the gene trees from Aury et al. (2006). The reconciled tree is special in this case, which contains one leaf genome (the modern one), as well as ancestral nodes representing duplication events. We built the augmented gene trees accordingly.

Many genes do not have a paralog in the paralogons for a particular ancestral genome. If we were to include all the gene families in the reconstruction, the input data would be very noisy and the resulting CARs would be too fragmented, due to the fact that we only have one leaf genome. Therefore, when reconstructing CARs in a certain target genome, we did some preprocessing to retain only genes that have paralogs derived from ancient duplications. Additional genes were added if their paralogs (from this duplication) were retained in the leaf genome.

For all three genomes, the number of CARs reconstructed by us is greater than the number of ancestral blocks reported in Aury et al. (2006) using the paralogon method (Table 2).There are two reasons for this: (1) Aury et al. (2006) ignored gene orders, whereas we take order and orientation into account when inferring CARs. (2) We used more genes in the reconstruction than just the ancestral genes with paralogs, which were essentially used as anchors when building paralogons.

Table 2.
The Number of CARs Reconstructed in Three Target Ancestral Genomes

In general, our prediction is a refinement of the result of Aury et al. (2006). In Figure 7, we show the mapping from our CARs (126 CARs containing more than two genes) for the intermediary WGD to the ancestral blocks constructed using the paralogon method. Most of them follow the pattern that several CARs correspond to one ancestral block of Aury et al. (2006), indicating an agreement in regional mapping if we ignore the gene orders. The mappings of reconstructions for the old WGD and the recent WGD have similar results (data not shown). Since Aury et al. did not reconstruct the ancestral gene adjacencies, we could not compare our prediction with theirs in detail.

FIG. 7.
This figure shows the mapping the CARs of the intermediary WGD predicted using DUPCAR to the ancestral blocks from Aury et al. (2006). Numbers in circles correspond to the ancestral block ID predicted by Aury et al. Numbers in gray rectangles are the ...

Recent approaches to the genome halving problem (Seoighe and Wolfe, 1998; El-Mabrouk and Sankoff, 2003; Alekseyev and Pevzner, 2007; Zheng et al., 2006) might be particularly useful and interesting if applied to this ciliate genome. As more ciliate genomes become available, we plan to further investigate the changes of gene orders between different WGDs, using outgroup information from the new genomes to identify additional adjacencies, which will help determine which methods of reconstructing ancestral architecture are best, and may shed more light on Paramecium evolution.

4. Discussion

In this paper, we extend the method in Ma et al. (2006) and propose an algorithm, called DUPCAR, to reconstruct ancestral gene orders with duplications. Most previous gene-order reconstruction methods have relied on a particular distance metric with restricted set of operations, for example, reversal distance without indels and duplications (Bourque and Pevzner, 2002), breakpoint distance without indels and duplications (Moret et al., 2001), reversal distance with duplications but without indels and translocations (El-Mabrouk, 2002). Our approach avoids to use reversal distance or breakpoint distance and relies on simple parsimony, making it easier to extend to handle different operations.

Because the critical procedure in DUPCAR is based on parsimony, the method retains a relatively small number of possibilities that are equally parsimonious in the ancestor. For the purpose of obtaining fewer CARs, a more reliable probabilistic rearrangement and duplication model might be more appropriate to reconstruct ancestral adjacencies.

We have a simplifying assumption in this paper that all the distances in the gene trees are perfect, which makes it easy to reconcile gene trees to the species tree. In reality, we usually have approximate distances. In our chromosome X experiment, since our resolution was relatively low, the gene trees were built with strong evidence of genomic distance and context. However, it remains a challenge for high-resolution reconstructions. Therefore, a more robust gene tree reconstruction and reconciliation method is needed (Chen et al., 2000; Bansal et al., 2007; Rasmussen and Kellis, 2007). This is a key area for further work.

Our future work will also focus on incorporating the ability to reconstruct evolutionary history with large-scale operations, instead of just figuring out the gene orders. Although solving the median problem is algorithmically challenging, it is completely feasible to provide a plausible history of rearrangements and duplications on each branch in the phylogeny when both the descendant genome and the ancestor genome have been predicted.

Our simulation of large-scale mammalian genome evolution looks promising. However, our reconstruction of the mammalian chromosome X is in relatively low resolution. In highly rearranged and duplicated regions, especially tandem-duplication-rich regions, special algorithms to figure out recent duplications first would be useful (Zhang et al., 2008).

A number of challenges remain before the genome structure of mammalian ancestors can be accurately predicted in terms of rearrangements and duplications, among which the most difficult would be partitioning the genomes in finer resolution and accurately dating the duplication events with noisy data.

5. Appendix

5.1. The proof of Theorem 1

Given a set equation M130 of unrooted gene trees with designated leaf species and a species tree T, the isometric reconciliation of equation M131 with T is unique and there is an efficient procedure to either construct it or detect that no such isometric reconciliation exists.

Proof

We will show that the algorithm ReconcileTrees produces an isometric reconciliation or detects if no such reconciliation is possible.

(A) We prove that if the algorithm accepts the tree then the tree is reconcilable and the reconciliation is unique.

First, we prove that at every step in the process, each of the maximally internal mapped nodes m ε M in the current gene tree B = Tai is associated with a subtree tm (constructed in ReconcileTrees) of mapped nodes of B that has the following properties:

  • (1) If tm does not contain the root r then Φ(m) lies above Φ(y) for all ym ε tm and d(m, y) = D(Φ(m), Φ(y)). Else, if tm contains the root r then Φ(r) lies above Φ(y) for all yr ε tm and d(r, y) = D(Φ(r), Φ(y)).
  • (2) All leaves of tm are mapped to the genome of the species to which they belong

and that Φ is the unique mapping with these properties.

We prove this by structural induction. It is clear that the above claim is true after the initial step, in which each of the leaves of B is mapped. Let m be an unmapped internal node of the gene tree, with branches to nodes u, v, and z. Suppose the mappings of u and v to the species tree have already been determined and we have subtrees tu and tv that satisfy the above claim.

We first consider the case that both tu and tv do not contain root r. If the tree has not been rooted after mapping m, then Φ(m) will be mapped to a point in T that is above Φ(u) and Φ(v). Based on the induction, Φ(m) lies above Φ(y) for all y ε tu and y ε tv. Hence, Φ(m) lies above Φ(y) for all y ε tm. Since we have

equation M132

therefore, for all equation M133, we have

equation M134

Similarly, for all equation M135, we have

equation M136

When the gene tree can be rooted after mapping m using u and v, without loss of generality, suppose r is on the path from m to v. Then Φ(r) will be mapped to a point in T that is above Φ(m) and Φ(v). Based on the induction, Φ(r) lies above Φ(y) for all equation M137. For all equation M138,

equation M139

For for all equation M140,

equation M141

We then consider the case that one of tu and tv contains root r. Without loss of generality, suppose tv contains the root r. After the mapping of m, Φ(m) will be mapped to a point on the path from Φ(v) to Φ(u) such that Φ(m) lies under Φ(v) and Φ(u) lies under Φ(m). Based on the induction, Φ(r) lies above Φ(m) and Φ(y) for all equation M142 and equation M143. Hence, Φ(r) lies above Φ(y) for all equation M144. Again, based on MapGeneTreeNode, we have

equation M145

Therefore,

equation M146

Also, for all y in the original tu,

equation M147

Finally we consider the case where Φ(z) has also been determined before we map m. There are two possible situations where (i) either tm or tz contain the root and (ii) both tm and tz do not contain the root at this point. We can use the similar logic above to show that Φ(r) lies above Φ(y) for all equation M148 and equation M149, and that for all y in tm and all y in tz

equation M150

Since the above properties (1) and (2) hold at every step in the process, they also hold when the processing of B is complete, if the input is not rejected. However, at this point the set M consists of just the root node r and all nodes of B are included in tr. Thus, every node x in B will be uniquely mapped to a point Φ(x) in T that lies below Φ(r) such that d(x, r) = D(Φ(x), Φ(r)). Hence, B will be uniquely isometrically reconciled with T. It follows that if the algorithm ReconcileTrees terminates without rejecting its input, it produces an isometric reconciliation of all its input gene trees with its input species tree.

(B) We prove that if the algorithm rejects the gene tree then the gene tree is not reconcilable.

We now verify that if the above procedure rejects B, then B is not reconcilable. In the algorithm MapGeneTreeNode, we use node u and v in the gene tree B to map the unmapped node m. From the above proof, we know that subtrees tu and tv have been uniquely mapped to the T.

In step (5) and step (7)(a) in MapGeneTreeNode, if one tries to root the gene tree B again when B is already rooted when mapping m, then B is certainly not reconcilable because all the nodes that have already been mapped to T are uniquely mapped based on the proof in (A).

We then consider the case in step (1) in MapGeneTreeNode. Because a successful reconciliation maps the nodes of the gene tree to the nodes and edges of the species tree, there will be no shorter path between two mapped nodes in the gene tree than the distance between the two original nodes in the species tree. equation M151 and equation M152 are the distances from Φ(u) and Φ(v) to their last common ancestor λ in T. Hence equation M153 will be the distance from Φ(u) to Φ(v) in T. Also, d1 + d2 is the distance of the shortest path from u to v in the gene tree. Therefore, we must have equation M154. In step (1), if equation M155, then there is no possible way to map u and v to T that can achieve D(Φ(u), Φ(v)) = d(u, v). Therefore B is not reconcilable. Step (b)(ii) is proved similarly.

For step (b)(iii), i.e. ε = 0 and Φ(m) = λ, since tu, tv, and tz all have been uniquely reconciled based on part (A), the only way to map Φ(m) is to map to λ such that there is a three way split of the root of the reconciled gene tree with three descendant subtrees, tu, tv, and tz. Hence, B is not reconcilable.

Therefore, in ReconcileTrees, if the procedure MapGeneTreeNode rejects its input then the tree B is not isometrically reconcilable. It follows that if ReconcileTrees rejects its input then it is not isometrically reconcilable. Combined with part (A), this establishes that the ReconcileTrees algorithm is correct. (C) We prove that the algorithm runs in polynomial time.

For each node in a gene tree, the running time of MapGeneTreeNode depends on the procedure of finding the last common ancestor between two points on branches in T. The crude (non-amortized) bound on the running time for this procedure is O(n), where n is the total number of tree nodes in species tree T. Therefore, the overall computational complexity of ReconcileTrees can be bounded in O(nm), where m represents the total number of tree nodes in all gene trees. [filled square]

Acknowledgments

We thank Olivier Jaillon (Centre National de Sequencage, France) for providing data from the ciliate Paramecium tetraurelia. We also would like to thank the anonymous reviewers for critical suggestions. J.M., B.J.R., and D.H. were supported by NHGRI grant 1P41HG02371, NCI contract 22XS013A, and D.H. additionally by the Howard Hughes Medical Institute. L.Z. was supported by ARF grant 146-000-068-112. A.R. and W.M. were supported by NIH grant HG02238.

Disclosure Statement

No competing financial interests exist.

References

  • Alekseyev M.A. Pevzner P.A. Whole genome duplications and contracted breakpoint graphs. SIAM J. Comput. 2007;36:1748–1763.
  • Aury J. Jaillon O. Duret L., et al. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 2006;444:171–178. [PubMed]
  • Bansal M.S. Burleigh J.G. Eulenstein O., et al. Heuristics for the gene-duplication problem: a Θ(n) speed-up for the local search. RECOMB 2007. 2007:238–252.
  • Boesch F.T. Gimpel J.F. Covering points of a digraph with point-disjoint paths and its application to code optimization. J. ACM. 1977;24:192–198.
  • Bourque G. Pevzner P.A. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002;12:26–36. [PMC free article] [PubMed]
  • Bourque G. Tesler G. Pevzner P.A. The convergence of cytogenetics and rearrangement-based models for ancestral genome reconstruction. Genome Res. 2006;16:311–313. [PMC free article] [PubMed]
  • Caprara A. Formulations and hardness of multiple sorting by reversals. RECOMB 1999. 1999:84–94.
  • Chen K. Durand D. Farach-Colton M. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 2000;7:429–447. [PubMed]
  • Eichler E.E. Sankoff D. Structural dynamics of eukaryotic chromosome evolution. Science. 2003;301:793–797. [PubMed]
  • El-Mabrouk N. Reconstructing an ancestral genome using minimum segments duplications and reversals. J. Comput. Syst. Sci. 2002;65:442–464.
  • El-Mabrouk N. Sankoff D. The reconstruction of doubled genomes. SIAM J. Comput. 2003;32:754–792.
  • Fitch W.M. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 1971;20:406–416.
  • Fitch W.M. Homology, a personal view on some of the problems. Trends Genet. 2000;16:227–231. [PubMed]
  • Froenicke L. Caldes M.G. Graphodatsky A., et al. Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes? Genome Res. 2006;16:306–310. [PMC free article] [PubMed]
  • Goodman M. Czelusniak J. Moore G.W., et al. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 1979;28:132–163.
  • Guigo R. Muchnik I. Smith T. Reconstruction of ancient molecular phylogeny. Mol. Phylogenet. Evol. 1996;6:189–213. [PubMed]
  • Ma B. Li M. Zhang L. From gene trees to species trees. SIAM J. Comput. 2000;30:729–752.
  • Ma J. Zhang L. Suh B.B., et al. Reconstructing contiguous regions of an ancestral genome. Genome Res. 2006;16:1557–1565. [PMC free article] [PubMed]
  • Marron M. Swenson K.M. Moret B.M.E. Genomic distances under deletions and insertions. Theor. Comput. Sci. 2004;325:347–360.
  • Moran S. Newman I. Wolfstahl Y. Approximation algorithms for covering a graph by vertex-disjoint paths of maximum total weight. Networks. 1990;20:55–64.
  • Moret B.M.E. Wyman S.K. Bader D.A., et al. A new implementation and detailed study of breakpoint analysis. PSB. 2001:583–594. [PubMed]
  • Murphy W.J. Larkin D.M. Everts-van der Wind A., et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005;309:613–617. [PubMed]
  • Nadeau J.H. Taylor B.A. Lengths of chromosomal segments conserved since divergence of man and mouse. Proc. Natl. Acad. Sci. USA. 1984;81:814–818. [PMC free article] [PubMed]
  • Pe'er I. Shamir R. The median problems for breakpoints are NP-complete. Electronic Colloq. Comput. Complexity. 1998;5
  • Pevzner P. Tesler G. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res. 2003;13:37–45. [PMC free article] [PubMed]
  • Rasmussen M.D. Kellis M. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Genome Res. 2007;17:1932–1942. [PMC free article] [PubMed]
  • Rocchi M. Archidiacono N. Stanyon R. Ancestral genomes reconstruction: an integrated, multi-disciplinary approach is needed. Genome Res. 2006;16:1441–1444. [PubMed]
  • Sankoff D. Genome rearrangement with gene families. Bioinformatics. 1999;15:909–917. [PubMed]
  • Sankoff D. Blanchette M. Multiple genome rearrangement and breakpoint phylogeny. J. Comput. Biol. 1998;5:555–570. [PubMed]
  • Sankoff D. El-Mabrouk N. Duplication, rearrangement and reconciliation, 537–550. In: Sankoff D., editor; Nadeau J.H., editor. Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families. Kluwer Academic Publishers; Amsterdam: 2000.
  • Schwartz S. Kent W.J. Smit A., et al. Human-mouse alignments with BLASTZ. Genome Res. 2003;13:103–107. [PMC free article] [PubMed]
  • Seoighe C. Wolfe K.H. Extent of genomic rearrangement after genome duplication in yeast. Proc. Natl. Acad. Sci. USA. 1998;95:4447–4452. [PMC free article] [PubMed]
  • Zhang Y. Song G.T. Vinar T., et al. Reconstructing the evolutionary history of complex human gene clusters. RECOMB 2008. 2008:29–49.
  • Zheng C. Zhu Q. Sankoff D. Genome halving with an outgroup. Evol. Bioinform. 2006;2:319–326. [PMC free article] [PubMed]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...