- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- J Comput Biol
- PMC3205801

# Reconciliation with Non-Binary Species Trees

^{1,}

^{*}Maureen Stolzer,

^{1,}

^{*}Aiton Goldman,

^{1}and Dannie Durand

^{}

^{1,,}

^{2}

^{1}Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania.

^{2}Department of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.

^{}Corresponding author.

*Dr. Dannie Durand, Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213. E-mail:*Email: ude.umc@dnarud

^{*}These authors contributed equally to this work.

## Abstract

Reconciliation extracts information from the topological incongruence between gene and species trees to infer duplications and losses in the history of a gene family. The inferred duplication-loss histories provide valuable information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. While reconciliation for binary trees is a tractable and well studied problem, there are no algorithms for reconciliation with non-binary species trees. Yet a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and incomplete lineage sorting. We present the first algorithms for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Our algorithms utilize an efficient mapping from gene to species trees to infer the minimum number of duplications in *O*(|*V _{G}|* · (

*k*+

_{S}*h*)) time, where |

_{S}*V*is the number of nodes in the gene tree,

_{G}|*h*is the height of the species tree and

_{S}*k*is the size of its largest polytomy. We present a dynamic programming algorithm which also minimizes the total number of losses. Although this algorithm is exponential in the size of the largest polytomy, it performs well in practice for polytomies with outdegree of 12 or less. We also present a heuristic which estimates the minimal number of losses in polynomial time. In empirical tests, this algorithm finds an optimal loss history 99% of the time. Our algorithms have been implemented in Notung, a robust, production quality, tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.

_{S}**Key words:**deep coalescence, gene duplication, gene loss, lineage sorting, non-binary species trees, polytomy, reconciliation

## 1.Introduction

Reconciliation is the Process of constructing a mapping between a gene family tree and a species tree in order to infer the history of gene duplications and losses during the evolution of the gene family. Reconciliation is widely used for evolutionary applications in medicine, developmental biology and plant sciences. It is the most reliable approach for identifying orthologs for use in function prediction, gene annotation, planning experiments in model organisms, and identifying drug targets (Bourgon et al., 2004; Searls, 2003). Reconciliation is used to estimate times of duplication (Ermolaeva et al., 2003; Gu et al., 2002; Ruvinsky and Silver, 1997; Vandepoele et al., 2004; Hahn, 2007), to test whole genome duplication hypotheses (Vandepoele et al., 2003; Wang et al., 2005; McLysaght et al., 2002; Paterson et al., 2004) and to identify recently duplicated genes that may be sites of adaptation (Hahn et al., 2007). By correlating specific duplications with the emergence of novel cellular functions or morphological features, reconciliation provides clues to the functions of newly discovered genes (Demuth et al., 2006; Wheeler et al., 2001). Minimizing duplications and losses provides a basis for rooting an unrooted tree (Chen et al., 2000) and for selecting alternate gene or species tree topologies (Chen et al., 2000; Durand et al., 2006a; Bourgon et al., 2004; Nam and Masatoshi, 2005).

Reconciliation of binary trees is a well-studied problem (Guigó et al., 1996; Hallett and Lagergren, 2000; Ma et al., 2000; Mirkin et al., 1995; Page, 1994; Stege, 1999; Zhang, 1997; Eulenstein et al., 1998). A number of software packages for this problem are available (Page and Charleston, 1997; Page, 1998; Zmasek and Eddy, 2001b; Zmasek and Eddy, 2001a; Dufayard et al., 2005; Roth et al., 2005; Durand et al., 2006a; Berglund-Sonnhammer et al., 2006; Sennblad et al., 2007), as are high throughput reconciliation tools for automated processing of databases of molecular phylogenies (Perrière et al., 2000; Dufayard et al., 2005; Roth et al., 2005; Li et al., 2006) and analysis of genome-scale data sets (Hahn et al., 2007; Blomme et al., 2006; Demuth et al., 2006). Reconciliation is also the kernel of a related, but more complex, problem: inferring a species tree from many gene trees (Ma et al., 2000, and work cited therein). Historically, most work involving reconciliation has focused on duplications; however, losses are also an important factor in gene family evolution. Work that acknowledges this importance and incorporates loss information is emerging (Chauve et al., 2007; Chen et al., 2000; Guigó et al., 1996)

Reconciliation relies on the observation that discordance between a binary gene tree and a binary species tree is evidence that genes diverged through processes other than speciation. These processes include gene duplication and loss, incomplete lineage sorting, and horizontal gene transfer. Previous studies on binary reconciliation explain tree disagreement exclusively in terms of gene duplication and loss. A few methods that consider horizontal gene transfer have been proposed (Gorecki, 2004; Hallett and Lagergren, 2001; Hallett et al., 2004). However, to our knowledge, there are none that consider incomplete lineage sorting in binary trees. Since the probability of incomplete lineage sorting decreases as time between speciation events increases (Pamilo and Nei, 1988; Takahata and Nei, 1985; Maddison, 1997; Tajima, 1983; Hudson, 1990), ignoring incomplete lineage sorting as a cause of discordance is justified if the branch lengths in the species tree are sufficiently long.

In contrast, when the species tree is non-binary, incomplete lineage sorting is a significant phenomenon that cannot be ignored (Pollard et al., 2006). Since duplication and incomplete lineage sorting have different consequences for the interpretation of phylogenetic studies, it is essential to distinguish between discordances that can only be explained by duplication (*required* duplications) and discordances that could be due to either duplication or incomplete lineage sorting (*conditional* duplications). As we demonstrate below, standard binary reconciliation cannot make this distinction.

As the tree of life project gains momentum, it is becoming evident that reconciliation with non-binary species trees is not a minor problem relegated to a few obscure species lineages. Rather, 64% of branch points in the NCBI taxonomy (Wheeler et al., 2005), one of the most widely used databases of species phylogenies, have more than two children. A number of well-documented analyses of simultaneous divergences have been reported (Jackman et al., 1999; Poe and Chubb, 2004; Melnick et al., 1993; Hoelzer and Melnick, 1994; Salzburger et al., 2002). Considering the large number of non-binary species trees, reconciliation methods for non-binary trees are urgently needed.

**Our contributions:** To address this need, we present novel algorithms to reconcile a rooted binary gene tree, *T _{G}* = (

*V*,

_{G}*E*), with a rooted non-binary species tree,

_{G}*T*= (

_{S}*V*,

_{S}*E*). Under the assumption of duplication-loss parsimony, our algorithms infer (1) the number of duplications and losses that occurred, (2) the species in which those events occurred, and (3) the gene tree lineage in which the events occurred. From these, the total number of duplication, speciation and loss events, as well as the temporal order of those events in each lineage, can be determined. Our algorithms consider only discordance due to duplication and incomplete lineage sorting. We further assume, like previous authors, that the probability of incomplete lineage sorting is negligible when a node in the species tree is binary. Incomplete lineage sorting in binary species trees and horizontal gene transfer are reserved for future work.

_{S}We first present a mapping from nodes in *V _{G}* to sets of nodes in

*V*that allows us to test efficiently whether a discordance at a given node is a conditional or required duplication. The maximum size of the set labeling any node in

_{S}*T*is

_{G}*O*(

*k*), where

_{S}*k*is the maximum outdegree in

_{S}*T*. Using this mapping, our algorithm infers all conditional and required duplications in

_{S}*O*(|

*V*| · (

_{G}*k*+

_{S}*h*)) time, where

_{S}*h*is the height of T

_{S}_{S}. Each inferred duplication is assigned to a node in T

_{G}, indicating the species in which it occurred and the timing of the duplication relative to other events in the gene family history.

We also present parsimony algorithms to infer the minimum number of gene losses. The timing of each inferred loss is indicated by assigning a node representing the loss to an edge in *E _{G}*. This loss node is labeled with the species in which the loss occurred. For binary species trees, the edge to which each loss is assigned is unambiguously determined by the reconciliation. In contrast, for a loss associated with a polytomy in a non-binary species tree, it is not generally possible to determine the exact lineage in the gene tree in which the loss occurred. In such cases, the loss is associated with several possible edges in

*E*, corresponding to alternate hypotheses regarding when the loss occurred. Parsimony provides a principled basis for reducing this uncertainty: the number of alternate hypotheses may be reduced by assigning losses to edges in

_{G}*E*such that the total number of losses is minimized.

_{G}Two considerations influence the total number of losses associated with a particular assignment of losses to edges. First, the position of the loss relative to duplications in *T _{G}* influences the total number of losses. Assigning a loss to an edge above a duplication in

*T*implies that the loss occurred before the duplication. In this case, only one loss is inferred. Assigning the loss below that duplication implies that the duplication occurred first. In this case, two losses must be inferred, one for each duplicated copy. Second, under certain circumstances, losses in sibling species can be explained by a single loss in their common ancestor. We can reduce the number of losses by selecting assignments that maximize the opportunities to combine losses that share a parent. As we show later in the paper, these considerations are not independent of one another. While assigning a loss below a duplication usually increases the total number of losses, in some cases the duplicated losses can be combined with other losses that occurred after the duplication, resulting in fewer total losses.

_{G}We present an exact algorithm that infers a history with the fewest losses, while taking both of the above considerations into account. Despite the exponential complexity of this algorithm , in empirical tests our exact algorithm reconciled 1174 trees derived from the TreeFam (Li et al., 2006) database in about two and a half minutes. We also present a heuristic that runs in polynomial time (*O*(|*V _{G}*| · (

*k*+

_{S}*h*))). This heuristic reconciled the same 1174 trees in 48 seconds. This heuristic places losses as close to the root as possible to avoid duplicating losses. Once all losses have been assigned to edges, losses are combined where possible. In a comparison of the two methods on these 1174 trees, the heuristic found the optimal solution in more than 99% of the cases studied. Out of the seven trees in which the heuristic did not find the optimal solution, in the worst case, the number of losses was overestimated by four losses out of a total of 249.

_{S}Our algorithms have been implemented in Notung, a software package that takes trees in the widely used Newick format as input, permitting interoperability with a wide range of phylogeny reconstruction and drawing packages. In addition to reconciling binary gene trees with non-binary species trees, Notung can reconcile both binary and non-binary gene trees with binary species trees (Durand et al., 2006b). Notung can also apply duplication-loss parsimony to reduce uncertainty in a gene tree with weakly supported edges, to root a gene tree, and to resolve polytomies in a gene tree (Chen et al., 2000; Durand et al., 2006a). Our software is implemented in Java and runs on Windows, Unix and Mac OS X. It is freely available at *www.cs.cmu.edu/~durand/Notung*.

**Roadmap:** In the next section, we introduce notation and review the standard algorithm for reconciliation of binary trees. In Section 3, we review the relevant models for non-binary gene and species trees in the molecular evolution literature and give formal definitions for required and conditional duplications based on this foundation. Next we present our non-binary reconciliation algorithms. Duplications are discussed in Section 4. In Section 5, we discuss losses and present an exact algorithm and a heuristic for inferring the minimum number of losses, as well as conditional and required duplications. We discuss related work by other authors in Section 6. In Section 7, we demonstrate the utility of our methods with analyses of real data sets using our software. In the conclusion, we discuss probabilistic approaches to reconciliation and describe directions for future work.

## 2.Notation and Binary Reconciliation

In this section, we introduce notation and review the standard reconciliation algorithm for binary trees. The trees shown in Figure 1 will be used throughout this section to exemplify notation. In tree figures, the label *g*_*s* denotes a gene that is sampled from species *s*.

**a**) A binary species tree. (

**b**) A hypothetical binary gene tree with genes sampled from species in (a). (

**c**) The gene tree (b) reconciled with species tree (a). (

**d**) The gene tree (b) embedded in the species tree (a). The black squares

**...**

Let *T _{i}* = (

*V*,

_{i}*E*) be a rooted tree, where

_{i}*V*is the set of nodes in

_{i}*T*, and

_{i}*E*is the set of edges.

_{i}*L*(

*T*) is the leaf set of

_{i}*T*and

_{i}*L*(

*v*) refers to the leaf set of a subtree rooted at . The root node of

*T*is denoted as

_{i}*root*(

*T*).

_{i}*C*(

*v*) and

*p*(

*v*) refer to the children and parent of

*v*, respectively. If

*v*is binary,

*r*(

*v*) and

*l*(

*v*) denote the right and left children of

*v*. For example, in Figure 1,

*p*(

*y*) =

*x*, and

*C*(

*y*) = {

*g*1_

*A*,

*g*1_

*B*}, where

*l*(

*y*) =

*g*1_

*A*and

*r*(

*y*) =

*g*1_

*B*. A non-binary node in a tree is referred to as a

*polytomy.*A

*monophyletic*group is a set of nodes consisting of a node and all of its descendants; for example, in Figure 1a, {

*γ*,

*C*,

*D*} forms a monophyletic group. The expression

*u*≥

_{i}

*v*indicates that for , either

*u*is

*v*, or

*u*lies on the path from

*v*to

*root*(

*T*). In Figure 1c,

_{i}*root*(

*T*) =

_{G}*x*and

*y*≥

_{G}

*g*1_

*A*. We follow the computer science convention, in which the root is at the top of the tree, the leaves are at the bottom, and

*p*(

*g*) is above

*g*.

Reconciliation infers gene duplications and losses by fitting a gene tree to a species tree (Goodman et al., 1979; Page and Holmes, 1998; Ronquist, 2002). To perform this comparison for binary trees, a mapping between the gene tree and species tree is required. Let *T _{G}* be a binary gene tree and

*T*be a binary species tree such that the genes in

_{S}*L*(

*T*) were sampled from the species in

_{G}*L*(

*T*). A mapping M(·) is constructed from each node to a target node . If g is a leaf node,

_{S}*M*(

*g*) is the species from which sequence

*g*was sampled. If

*g*is an internal node,

*M*(

*g*) is the least common ancestor (LCA) of the target nodes of its children:

In our example, *M*(*g*1_*A*) = *A*, since it is a leaf; *M*(*x*) = *LCA*(*M*(*y*), *M*(*z*)) = *LCA* (α, β) = α. From this mapping both gene duplications and gene losses can be inferred. We refer to this algorithm for calculating duplications and losses as LCA reconciliation in order to distinguish it from the new reconciliation algorithms proposed for non-binary species trees in the next section.

By convention, duplications are assigned to nodes in *V _{G}* and losses to edges in

*E*. Assigning a duplication to node not only specifies its location in

_{G}*T*, but also its location in

_{G}*T*, via the mapping

_{S}*M*(·). An inferred duplication at

*g*implies that the duplication occurred between

*p*(

*M*(

*g*)) and

*M*(

*g*). The two resulting copies were present in species

*M*(

*g*), and for at least one child

*c*of M(

*g*) (if )), each copy persisted

^{1}in at least one leaf (not necessarily the same leaf) of the subtree of

*T*rooted at

_{S}*c*. If

*M*(

*g*) , then both copies persisted in

*M*(

*g*). For losses, the species,

*s*, in which the loss occurred must be inferred explicitly. Assigning a loss in

*s*to edge (

*p*(

*g*),

*g*) indicates that

*g*was present in both

*M*(

*p*(

*g*)) and

*M*(

*g*), and was lost on the path from

*s*′ to

*s*, where

*s*′ is a species on the path from

*M*(

*g*) to

*M*(

*p*(

*g*)) (i.e., .

**Gene Duplications:** A duplication is inferred at node *g* if and only if the children of *g* map to the same lineage in *T _{S}*; that is, there is some leaf such that both

*l*(

*M*(

*g*)) and

*r*(

*M*(

*g*)) are on the path from

*s*to

*root*(

*T*). This condition is true

_{S}*iff*

Every node in *T _{G}* that is not designated a duplication node is a speciation node.

When the gene tree is binary, Equation (2) is sufficient to determine whether *l*(*g*) and *r*(*g*) map to the same lineage in *T _{S}*. Note, however, that Equation (2) only explicitly tests whether two copies were present in

*M*(

*g*), but not whether each gene copy persisted in at least one leaf of a subtree descending from

*M*(

*g*). The assumption of complete lineage sorting guarantees that the latter must also be true. However, as we demonstrate in Section 3, under incomplete lineage sorting, Equation (2) is not sufficient to determine whether two or more children map to the same lineage in

*T*.

_{S}As an example of inferring gene duplication, we consider Figure 1. Figure 1d shows a duplication at node , prior to the species divergence at α. A descendant of *l*(*x*) persisted in species *B*, while a descendant of *r*(*x*) persisted in species *C* and *D*; thus, both copies are represented in at least one leaf of the subtree rooted at *β*. The gene tree embedded in the species tree in Figure 1d shows both copies of the gene on the edge (*α*, *β*). Although only one copy of the family survived in each species, discordance between the species tree in Figure 1a and the gene tree in Figure 1b provides sufficient evidence to infer a duplication at *x*. Because both *x* and one of its children (*y*) map to *α*, Equation (2) correctly identifies the duplication *x*.

**Gene Losses:** Losses can also be reconstructed from the mapping, *M*(·). For each *e* = (*p*(*g*), *g*), the comparison of *M*(*p*(*g*)) and *M*(*g*) determines the losses assigned to *e*. If *p*(*g*) is a speciation node, and no loss occurred, then *M*(*p*(*g*)) must be the parent of *M*(*g*) in the species tree. If *p*(*g*) is a duplication node and no losses occurred, then *p*(*g*) and *g* map to the same node in *T _{S}*. If either one of these conditions fail, then one or more losses must be inferred. Both situations arise in the example in Figure 1. Since

*x*is a duplication node,

*M*(

*z*) ≠

*M*(

*x*) indicates losses in

*A*and

*B*on the edge from

*x*to

*z*. Node

*p*(

*g*1_

*B*) =

*y*is a speciation node, but

*M*(

*y*) =

*α*is not the parent of

*M*(

*g*1_

*B*) =

*B*in

*T*, indicating the absence of

_{S}*g*1 from species

*C*and

*D*. Note that these two losses can be explained more parsimoniously by the loss of a single ancestral gene in the ancestral species,

*γ*.

In general, given a 1-1 mapping between a set of contemporary species in which a gene copy is absent (relevant to a single edge in *E _{G}*) and the leaves of a subtree rooted at , it is more parsimonious to infer a single loss in

*s*. Under this assumption, when

*p*(

*g*) is a speciation node, we infer

*depth*(

*M*(

*g*)) –

*depth*(

*M*(

*p*(

*g*))) – 1 losses on edge

*e*, one for each ancestral species on the path from

*M*(

*g*) to

*M*(

*p*(

*g*)) in

*T*. If

_{S}*p*(

*g*) is a duplication, the number of inferred losses is

*depth*(

*M*(

*g*)) –

*depth*(

*M*(

*p*(

*g*))).

The species associated with the losses inferred on *e* = (*p*(*g*), *g*) are determined by walking up the species tree from *M*(*g*) to *M*(*p*(*g*)). For each ancestral node between *M*(*g*) and *M*(*p*(*g*)), a loss is inferred in *l*(*s*) or *r*(*s*), whichever is not represented on the path from *g* to *p*(*g*) in the gene tree. If *p*(*g*) is a duplication node, an additional loss is inferred. For example, consider the losses on edge (*x, z*) in Figure 1c. Node *x* is a duplication node and *M*(*x*) = *α*, but *M*(*z*) = *γ*, indicating that *depth*(*γ*) – *depth*(*α*) = 2 losses occurred between *x* and *z*.

The reconciliation methods presented here fulfill the requirements for reconciliation algorithms specified in Section 1. Each duplication is assigned to a node , indicating the timing of the duplication relative to speciation and duplication events in the same gene tree lineage. The mapping *M*(*g*) indicates the species lineage in which the duplication occurred. Similarly, each loss is associated with a node and an edge , indicating that the loss occurred between *p*(*s*) and *s* in the species tree and between *p*(*g*) and *g* in the gene tree. The total number of events is determined by summing over the set of all nodes and edges for duplications and losses, respectively.

## 3.Models For Non-Binary Species Trees

LCA reconciliation is based on the assumption that disagreement between a gene tree and a species tree indicates that one or more gene duplications or losses must have occurred. In this section, we show that when the species tree is non-binary, this assumption is no longer warranted. First we review current theory concerning polytomies in gene and species trees.

A polytomy may represent the simultaneous divergence of all its children (a hard polytomy) (Maddison, 1989). It may also indicate that the true binary branching pattern is unknown (a soft polytomy) (Maddison, 1989). A soft polytomy indicates that sufficient data is not available, or there is not enough signal in the data to determine the true branching order. This may occur when a sequence of binary divisions proceeds in close succession and the time between these events is insufficient to accumulate informative variation.

In gene trees, polytomies are always soft. Since each lineage in a gene tree represents exactly one gene, the result of any divergence is exactly two descendant sequences. Thus, the true branching pattern in a gene family is always binary (Hudson, 1990), and a polytomy can only represent uncertainty in the true, underlying history. In the current work, we assume that the true binary gene tree is known and focus exclusively on algorithms for binary gene trees.

In contrast, polytomies in a species tree can be either hard or soft. A species tree represents the evolution of a population of organisms. In this context, simultaneous divergence of three or more lineages does occur. For example, simultaneous divergence can result from the isolation of subpopulations within a widespread species by sudden meteorological or geological events, or from rapid expansion of the population into open territory, resulting in reproductive isolation. Examples of well-documented, simultaneous divergences in nature include *Anolis* lizards (Jackman et al., 1999), modern birds in the order *Neoaves* (Poe and Chubb, 2004), macaque monkeys (Melnick et al., 1993; Hoelzer and Melnick, 1994), auklets (Walsh et al., 1999), and African cichlid fishes (Salzburger et al., 2002).

A binary gene tree can be consistent with a hard polytomy in the species tree (Lyons-Weiler and Milinkovitch, 1997; Maddison, 1997), as illustrated in Figure 2. Since a node in the species tree represents a population with genetic diversity, there can be multiple alleles at the locus of interest. Discordance between gene and species trees that results from allelic variation is referred to as incomplete lineage sorting. The bifurcation at time *t*_{1} shows the formation of two alleles through mutation. A second bifurcation occurs at *t*_{2} and results in three different alleles at this locus. Note that this binary branching pattern represents allelic variation at a single locus, not the formation of independent loci through duplication. The true divergence between any two genetic lineages corresponds to the point at which the differences in alleles arose, not the time of speciation. Divergence that occurs much earlier than the time of speciation is referred to as a *deep coalescence event*. Given a hard polytomy with *k* leaves, all binary branching patterns with *k* leaves are equally likely (Pamilo and Nei, 1988). Figure 3 shows all three binary branching patterns that are possible in a polytomy with three children.

In our model, disagreement between a binary gene tree and a non-binary species tree is evidence that divergence in the gene tree resulted from gene duplication or incomplete lineage sorting. When the species tree is non-binary, three cases must be considered. Where there is no incongruence, gene tree divergence is attributed to speciation. Second, some divergences can only be explained by gene duplication. Obviously, a duplication must have occurred in any gene family that has two or more members in the same species. But even when no contemporary species contains more than one family member, there are cases where topological disagreement can only be explained by a duplication. Third, in some cases, it is not possible to determine whether the disagreement is due to incomplete lineage sorting or duplication (Slowinski and Page, 1999).

These issues are illustrated in Figure 4. Disagreement between the gene tree in Figure 4b and the species tree in Figure 4a can be explained by two scenarios, depicted in Figures 4c and andd.d. In Figure 4c, the discordance at *x* is explained by a duplication, followed by losses in *B* and *β*. In Figure 4d, the divergence at *x* is explained by allelic variation in the ancestral population at *α*, followed by retention of different alleles in different lineages in *T _{S}*. Based on the information available in Figure 4, it is impossible to determine with certainty whether node

*x*represents incomplete lineage sorting or gene duplication. In contrast, the incongruence at node

*y*between the species tree in Figure 4a and the gene tree in Figure 4b can only be explained by a duplication. This can be seen in Figure 4d, which shows the gene tree embedded in the species tree. Two copies of the gene are present in the lineage from

*α*to

*β*, indicating that a duplication must have occurred at their most recent common ancestor,

*y*.

**a**) A species tree with a polytomy at α. (

**b**) A hypothetical gene tree sampled from the species in (a). (

**c**) The gene tree from (b), which has been reconciled using the LCA algorithm. (

**d**) The hypothetical gene tree embedded in species tree. (

**e**)

**...**

Since disagreements in the branching pattern of a binary gene tree and a non-binary species tree may be evidence of a duplication or incomplete lineage sorting, we need a formal basis to distinguish between *required* duplications—those disagreements that can only be explained by a duplication—and *conditional* duplications—those disagreements that can be explained by either a duplication or a deep coalescence event. LCA reconciliation can recognize nodes that did not arise through speciation but cannot distinguish between required and conditional duplications. In Figure 4, all three internal nodes map to species *α*. According to Equation (2), since *M*(*x*) = *M*(*y*) = *M*(*z*) = *α*, duplications are inferred at both *x* and *y*. Thus, LCA reconciliation recognizes disagreement at *x* and *y*, but cannot determine that *x* is a conditional duplication and *y* a required duplication. An additional test for distinguishing required and conditional duplications is needed.

Recall that all binary trees with *k* leaves are equally compatible with a species polytomy with *k* children. Therefore, we can treat a polytomy *s* as a set of hypotheses, or *binary resolutions of s*. For each polytomy , let *H*(*s*) be the set of all possible binary trees, rooted at *s*, whose leaves are the children of *s*. Formally, given the *k*-tomy , let *H*(*s*) = {*T _{i}*|

*L*(

*T*) =

_{i}*C*(

*s*)}, where

*T*is a binary tree such that the leaves of

_{i}*T*are the children of

_{i}*s*. For example, if node

*s*is the trichotomy a in Figure 4a, then

*H*(

*α*) = {(

*A*, (

*B*,

*β*)), (

*B*, (

*A*,

*β*)), (

*β*, (

*A*,

*B*))}. In addition, let

*H**(

*T*) be the set of all possible binary trees obtained by replacing each polytomy with each tree . In other words,

_{S}*H** (

*T*) is the set of all possible binary resolutions of

_{S}*T*. For Figure 4a,

_{S}*H** (

*T*) = {(

_{S}*A*, (

*B*, (

*C*,

*D*))), (

*B*, (

*A*, (

*C*,

*D*))), ((

*C*,

*D*), (

*A*,

*B*))}. Note that for a given , every node corresponds to a node in

*T*′; however,

*T*′ will also contain nodes that do not correspond to any node in

*T*. The cardinality of

_{S}*H**(

*T*) is , which is equivalent to , where and

_{S}*k*= |

_{j}*C*(

*s*)| (Li, 1997). If

_{j}*T*is binary, then

_{S}*H**(

*T*) = {

_{S}*T*}.

_{s}We now use *H** (*T _{S}*) to characterize formally the properties of the gene and species tree that determine when a duplication is required. When reconciling

*T*with every , if is a duplication in every reconciliation, then a duplication

_{G}*must*have occurred at

*g*. If at least one, but not all reconciliations indicate a duplication at

*g*, then a deep coalescence event

*may*have occurred. Formally:

#### Definition 3.1.

, *reconcile* *T _{G}*

*with*

*T*′.

*Given*,

*p*(*g*)*is a*required duplication*if*.*p*(*g*)*is a*conditional duplication*if**and p*(*g*)*is not a required duplication*.

Notice that for the trees in Figure 4, every *T*′ would infer a duplication node at node *y*; however, this is not the case for node *x*. For these trees, *M*(*y*) = *M*(*z*) for every . Therefore, under Definition 3.1, node *y* is a required duplication. On the other hand, node *x* is a conditional duplication since there is a binary resolution in *H**(*T _{S}*), namely (

*A*, (

*B*, (

*C*,

*D*))), that does not infer a duplication. Recall, however, that under LCA reconciliation, a duplication event is inferred at both nodes

*x*and

*y*, since

*M*(

*x*) =

*M*(

*y*) =

*M*(

*z*).

## 4.Identifying Duplications

Reconciliation with a non-binary species tree requires determining whether a given node is a required duplication, a conditional duplication, or a speciation. Formally, the problem of reconciliation with non-binary species trees is defined as follows:

#### Duplication Inference for Non-Binary Species Trees

**Input:** A rooted, arbitrary species tree, *T _{S}*, a rooted, binary gene tree,

*T*, and a mapping from

_{G}*L*(

*T*) to

_{S}*L*(

*T*).

_{G}**Output:** The minimum number of duplications required to reconcile of *T _{G}* and

*T*. The most parsimonious duplication histories, where a duplication history is a reconciled gene tree in which every node is designated as a required duplication, a conditional duplication, or a speciation.

_{S}Definition 3.1 provides a formal basis for classifying nodes, but cannot be the basis of an efficient algorithm, since *H** (*T _{S}*) grows superexponentially with the size of polytomies in

*T*. LCA reconciliation is not a suitable solution, since it identifies both conditional and required duplications, but cannot distinguish between them.

_{S}To see why this is true, recall that the presence of descendants of both children of *g* in the same lineage of *T _{S}* is evidence of duplication at

*g*. To infer a duplication at

*g*in the binary case, it is sufficient to determine that

*g*and a child of

*g*were both present in

*M*(

*g*), because (in the absence of loss) complete lineage sorting implies that both lineages descending from

*g*must be inherited by both species lineages descending from

*M*(

*g*). However, when incomplete lineage sorting is possible, this is no longer true. The coexistence of

*g*and one of its children in the same ancestral species (i.e., Equation (2)) is not a sufficient condition to determine whether they coexisted in the same lineage. For example, in Figure 4d, the left child of

*x*is only present in the lineage leading to

*A*and the right child of

*x*is only present in the lineages leading to

*B*and

*β*. No subtree descending from a child of

*M*(

*x*) contains descendants of both

*l*(

*x*) and

*r*(

*x*), indicating that

*x*is not a duplication node. In contrast,

*y*is a required duplication since both of its children are present on the edge from

*α*to

*β*. Equation (2) would correctly recognize the required duplication at

*y*, but incorrectly infer a required duplication at

*x*as well, since

*M*(

*x*) =

*M*(

*r*(

*x*)).

As the above example shows, Equation (2) is not sufficient to infer required duplications when the species tree is non-binary. In particular, the mapping *M*(·) is not sufficiently informative to distinguish between conditional duplications and required duplications. In order to infer required duplications, information is needed about the descendants of *M*(*g*) in which the descendants of *g* were present.

Here we propose a new mapping and show how it can be used to make this distinction. We construct a mapping from each node *g* in *T _{G}* to a set of species in

*T*.A straightforward approach would be to label each node,

_{S}*g*, in the gene tree with all nodes (both leaves and internal nodes) in the species tree in which the gene was present. Using this mapping, a required duplication is inferred at

*g*if the intersection of the sets of its children is non-empty. The size of the sets labeling the nodes in the gene tree grows with the height of the tree and can contain as many as

*O*(|

*V*|) elements. However, it is sufficient to store only the children of

_{S}*M*(

*p*(

*g*)) in which descendants of

*g*must have been present, as follows:

#### Definition 4.1.

*Define* *to be*

*where* *is the powerset of V _{S}, excluding the empty set*.

Note that is defined on every node in *T _{G}* except the root. The size of this mapping at any given node is bounded by the size of the largest polytomy in

*T*, yet this mapping is sufficiently informative to identify required and conditional duplications, as we prove in Theorem 4.1.

_{S}#### Theorem 4.1.

*A node g is a required duplication iff* .

#### Proof

There exists at least one *x* such that . Therefore, for all , *x* ≤_{S} *M*(*l*(*g*)) and *x* ≤_{S} *M*(*r*(*g*)). This requires that either *M*(*l*(*g*)) = *M*(*r*(*g*)) or one is a descendant of the other. Thus, *g* meets the duplication criterion for binary gene and binary species trees for every , and therefore is a required duplication.

It is necessary to show that whenever , there exists at least one element of *H**(*T _{S}*) that does not imply a duplication at

*g*. Any that has all members of in the left subtree of

*M*(

*g*) and all members of in the right subtree of

*M*(

*g*) will meet this criterion. At least one such tree must exist since and do not intersect.

Figure 5 shows the mapping for the gene tree in Figure 4d. This mapping correctly infers a required duplication at *y* since . It also correctly identifies a conditional duplication at node *x*, since .

The algorithm for inferring required and conditional duplications using is given in Algorithm 1. is calculated with a postorder traversal of *T _{G}*. To ensure that the set is composed only of children of

*M*(

*p*(

*g*)), Algorithm 1 executes a climbing step that replaces the set of nodes in with the child of

*M*(

*p*(

*g*)) which is ancestral to them. The

*climb*procedure prevents from growing larger than

*k*. Algorithm 1 infers a required duplication at

_{S}*g*if the intersection of the sets and is non-empty. A conditional duplication is inferred if the node is not a required duplication, but is a duplication under LCA reconciliation.

As shown in Theorem 4.2, Algorithm 1 produces the same results as LCA reconciliation when the species tree is binary.

#### Theorem 4.2 (Equivalence with LCA Reconciliation for Binary Species Trees)

*Let T _{G} be a binary gene tree reconciled with a binary species tree*.

#### Proof

This follows directly from Theorem 4.1.

Suppose that there exists , such that *M*(*h*) = *M*(*g*), but . Either and are both equal to {*M*(*g*)} or and both contain children of *M*(*g*). In the former case, and are not disjoint, leading to a contradiction. In the latter case, since *M*(*g*) has only two children and the sets are disjoint, one set must contain the right child and the other the left child of *M*(*g*). Thus, there is no , such that *M*(*h*) = *M*(*g*), leading to a contradiction.

The computational complexity of this algorithm is considered in Section 5.1, where we present an algorithm that infers losses in addition to conditional and required duplications in a single traversal of the tree.

## 5.Inferring Gene Losses

The goal of loss inference under the parsimony criterion is to identify the loss history that requires the minimum number of losses needed to explain the data. A loss history is a set of gene losses, in which each loss is assigned to an edge in the gene tree and a node in the species tree. For a given binary gene tree and binary species tree, there is exactly one most parsimonious loss history. Each loss can be unambiguously assigned to exactly one edge in *T _{G}*, and associated with one node in

*T*. Moreover, it is possible to determine the set of losses assigned to an edge (

_{S}*g*,

*p*(

*g*)) by comparing

*M*(

*g*) and

*M*(

*p*(

*g*)), without considering losses on any other edge of the gene tree. The total number of losses in the most parsimonious history can be determined by inferring losses on each edge independently and summing over all edges.

In contrast, when *T _{S}* is non-binary, a reconciliation may have more than one equally parsimonious loss history. Under specific circumstances, described in detail in the next section, a species polytomy will result in ambiguous losses, losses that may be assigned to one of several edges in the gene tree. The reconciliation does not provide enough information to fully resolve the temporal order of these losses relative to other events in the same gene tree lineage.

The problem of ambiguous losses is illustrated by the following example: Figure 6a shows a species tree, *T _{S}*, with a single polytomy,

*β*, and a gene tree,

*T*, representing the evolution of a family of genes,

_{G}*g*, drawn from the species in

*L*(

*T*). There is no member of this gene family in species

_{S}*B*, indicating the loss of a gene in

*B*. Comparison of

*T*and

_{G}*T*identifies three edges in

_{S}*T*from which this gene (denoted

_{G}*lost*_

*B*) could have diverged, but is not sufficient to determine which edge is preferred. The divergence of

*lost*_

*B*may have occurred before the separation of

*g*_

*C*from the ancestor of

*g*_

*D*and

*g*_

*E*(Fig. 6b) or after that separation (Fig. 6c). Alternatively,

*lost*_

*B*may be most closely related to

*g*_

*C*(Fig. 6d).

**a**) Hypothetical species and gene tree.

*T*has a polytomy at

_{S}*β*and a gene has been lost in species

*B*in

*T*. (

_{G}**b**), (

**c**), and (

**d**) The different hypotheses of when the lost occurred are shown as both an embedded tree and a separate

**...**

In gene families in which two or more losses occurred, interactions between losses that can be assigned to the same edge of the gene tree must be considered. Although it is not possible to determine exactly when an ambiguous loss occurred, it is possible to identify the set of *permissible edges* for a given loss. The set of permissible edges is a (not necessarily proper) subset of a set of contiguous edges, called a Polytomy Connected Component (PCC), which is defined formally in Section 5.2. Each ambiguous loss is associated with exactly one PCC. A gene tree may have several PCCs; these are always disjoint. Interactions between ambiguous losses assigned to the same PCC influence the total number of losses. Two factors contribute to the interactions between losses: Losses that occurred in sibling species and that are assigned to the same edge in *T _{G}* may be replaced by a single loss in a common ancestor, decreasing the total loss count. In addition, interaction of ambiguous losses with duplications in the gene tree affects the total number of losses inferred.

The first factor arises because a single loss in an ancestral species is more parsimonious than simultaneous, independent losses in the contemporary species descended from that ancestor. In an algorithmic context, we refer to the process of inferring such ancestral losses as *combining losses*. In the binary case, a single ancestral loss may be inferred if the same member of a gene family is absent from all leaves of some subtree of *T _{S}*. When the species tree is non-binary, losses that correspond to leaves of subtrees whose roots are the children of a polytomy may be combined. Formally, given a set, , of losses assigned to the same edge, if there is a one-to-one mapping between and the leaves of a subtree

*t*in some , a single loss in

*root*(

*t*) can be inferred. It is not necessary to enumerate all trees in

*H**(

*T*) to determine whether a given set of losses corresponds to a single, ancestral loss

_{S}*T*. If there exists a set of subtrees in

_{S}*T*such that there is a one-to-one mapping between and the leaves of the subtrees, and the roots of the subtrees are all children of the same polytomy, then the losses in may be combined. Note that a reconciled gene tree may include losses in ancestral species that are not found in

_{S}*T*. In this case, the loss is labeled with the set of children of the polytomy that are descendants of

_{S}*root*(

*t*) in

*T′*.

The particular edge within the permissible set to which a loss is assigned determines whether or not it can be combined with other losses and, hence, the total number of losses inferred. For example, the gene family in Figure 7c is not represented in species *B*. The set of permissible edges for *lost*_*B* is shown in bold. A second loss occurred in species *D*. The set of permissible edges for *lost*_*D* is not shown, but it overlaps with the set of edges for *lost*_*B*. In particular, both losses can be assigned to the edge (*w*, *x*). In that case, a single ancestral loss can be inferred in their common ancestor, as seen in Figure 7d. This is permissible because there exists a tree in *H**(*T _{S}*) in which

*B*,

*D*, and their common ancestor form a monophyletic subtree. Notice that, although

*lost*_

*B*and

*lost*_

*F*can both be assigned to the edge (

*x*,

*g*4_

*E*), they do not correspond to the leaves of any subtree in

*H** (

*T*), and cannot be combined.

_{S}**a**) A species tree with a polytomy,

*β*. (

**b**) A hypothetical gene tree that has been reconciled with the species tree in (a). This gene tree is annotated with the mappings

*M*(·), and

*N*(·). Losses are not represented in this tree.

**...**

The second factor influencing total loss count is loss duplication. If the set of permissible edges for a given ambiguous loss contains a duplication node, the position of the loss with respect to this duplication will influence the total number of losses. Assignment of the loss to an edge below the duplication implies that the gene was duplicated before the loss occurred, requiring that two losses be inferred, one in each subtree below the duplication. If the loss is assigned on an edge above the duplication, only a single loss is inferred. While assigning losses to edges above duplications usually results in fewer losses, the number of losses will decrease when a loss is assigned below a duplication if both copies can then be combined with other losses. For example, in Figure 7c, *lost*_*B* may be assigned to edges both above and below the duplication at w. If *lost*_*B* is assigned to edges below w, two losses in *B* are inferred, one on edge (*w*, *l*(*w*)) = (*w*, *x*) and one on (*w*, *r*(*w*) = (*w*, *y*)). As seen in Figure 7d, this results in two ancestral losses: *lost_D*, *B* and *lost_C*, *B*.

Due to the interactions between multiple ambiguous losses and duplications in the same PCC, and in contrast to reconciliation with binary species trees, the total number of losses in the most parsimonious history cannot be determined by considering each edge independently. Inferring the minimum number of losses requires obtaining the set of lowest cost assignments from the set of all possible edge assignments for each PCC. The problem of inferring the set of most parsimonious loss histories for a given binary gene tree and binary or non-binary species tree is stated formally as follows:

#### Non-Binary Loss Inference

**Input:** A rooted, arbitrary species tree, *T _{S}*, a rooted, binary gene tree,

*T*, and a mapping from

_{G}*L*(

*T*) to

_{G}*L*(

*T*).

_{S}**Output:** The minimum number of losses required to reconcile *T _{G}* and

*T*. The set of all most parsimonious loss histories, where a loss history is an assignment of each loss to an edge in

_{S}*T*and a species node in some tree in

_{G}*H**(

*T*).

_{S}Given these considerations, we present two algorithms for inferring losses, described in Sections 5.1 and 5.2. For a given gene tree, *T _{G}*, and species tree,

*T*, we report the total number of losses in

_{S}*T*and assign individual losses to edges in

_{G}*T*. One is an exact method, which considers all possible assignments of losses to edges in

_{G}*T*and selects the assignment that minimizes the number of inferred losses. This method runs in , and can return each additional optimal loss history in

_{G}*O*(|

*V*|). The second strategy is a heuristic, and returns a single loss history. The heuristic runs in

_{S}*O*(|

*V*| · (

_{G}*k*+

_{S}*h*)), and although not guaranteed to return an optimal history, does very well in practice, as described in Section 7. The heuristic is a simpler procedure, and is used as a framework for the exact algorithm, so we present it first.

_{S}### 5.1.Heuristic for inferring loss histories

The heuristic uses a greedy strategy that makes loss assignment decisions at each edge, without considering interactions with losses inferred on other edges. The strategy is to minimize duplicated losses by assigning each ambiguous loss to the permissible edge closest to the root. This guarantees that the loss will not be unnecessarily assigned below a duplication node, leading to the inference of two losses, instead of one. We do not attempt to optimize combined losses; however, after all losses are assigned, losses that satisfy the appropriate criteria are combined. This strategy will occasionally return a suboptimal loss history because it will fail to identify situations where assigning a loss below a duplication will allow the resulting copies to be combined with other losses, thus decreasing rather than increasing the total loss count. However, in practice, this happens rarely. For example, the heuristic will fail to find the optimal loss history for the gene tree in Figure 7b. Instead, it will return the reconciliation shown in Figure 7c, where *lost_B* is assigned to (*u*, *w*). If *lost_B* were assigned below the duplication to (*w*, *x*) and (*w*, *y*), a lower cost tree could be obtained, shown in in Figure 7d.

The heuristic traverses the tree in post order. At each edge, it (1) determines whether one or more losses should be assigned to that edge and (2) identifies the species in which the losses occurred. The first step is achieved by the application of three tests, described below. The greedy strategy, which assigns ambiguous losses as close to the root as permissible, is a natural outcome of the post order traversal. The permissible edge closest to the root is reached first, allowing the heuristic to assign the loss to the desired edge without explicitly determining the set of permissible edges for each loss.

The species in which the losses occurred are inferred in the second step, by comparing sets of species nodes assigned to each node . This is similar to duplication inference for non-binary species trees, but in the case of losses, the set comparisons are more complex. Three different sets of nodes are considered for every edge *e* = (*g*, *p*(*g*)):

*C*(*M*(*g*)): The set of all children of*M*(*g*).- : The set of children of
*M*(*p*(*g*)) that contain a descendant of*g* *N*(*g*): The set of children of*M*(*g*) that contain a descendant of*g*.

The first two sets we have encountered in the previous section. The third, *N*, is defined as follows:

#### Definition 5.1.

*Define* *to be*

Note that unlike , is defined for *root*(*T _{G}*). See Figure 7b for an example of and

*N*(

*g*).

Each of the following tests corresponds to one of the three situations that can incur a loss. The heuristic applies these tests to each edge, *e* = (*g*, *p*(*g*)), in *E _{G}* and assigns the inferred losses to

*e*. Note that these tests are also used in the exact algorithm described in Section 5.2, although in the exact algorithm a further optimization step is applied following the initial identification of losses.

**Test 1:** If *M*(*g*) ≠ *M*(*p*(*g*)) and *p*(*M*(*g*)) ≠ *M*(*p*(*g*)), traverse the path from *M*(*g*) to *M*(*p*(*g*)) in *T _{S}*, inferring a loss diverging from each intermediate species along this path (lines 29–31 in Algorithm 2).

This test is applied to all edges in *T _{G}*, whether associated with a binary node or polytomy in

*T*. If a node and its parent map to different nodes in

_{S}*T*, we expect those nodes to correspond to child and parent nodes in the species tree. Otherwise, genes in the intervening species must have been lost. The procedure to infer these skipped losses is carried out in the climb procedure and is analogous to that used in LCA reconciliation, described in Section 2. An example of this test is given in Figure 7c. The test is invoked on edge (

_{S}*x*,

*g*4_

*E*), because

*M*(

*x*) ≠

*M*(

*g*4_

*E*) and

*M*(

*x*) =

*β*is not the parent of species

*E*. The climb procedure will recognize that a gene is missing between

*E*and

*β*and will infer a loss in species

*F*.

**Test 2:** If *p*(*g*) is a required duplication, then losses are inferred in the species in at *e* (lines 18–21 in Algorithm 2).

This test is applied to all edges where *p*(*g*) is a required duplication, whether associated with a binary node or polytomy in *T _{S}*. Note that if

*M*(

*p*(

*g*)) is binary and

*M*(

*p*(

*g*)) =

*M*(

*g*), then and no losses are inferred. Thus, this test reduces to that used in LCA reconciliation for binary nodes in

*T*. When

_{S}*M*(

*p*(

*g*)) is polytomy, losses may occur even when

*M*(

*p*(

*g*)) =

*M*(

*g*), in contrast to binary reconciliation. Both the edges (

*w*,

*x*) and (

*w*,

*y*) in Figure 7c meet the criteria in this test. For example,

*w*is a required duplication, and comparing

*N*(

*w*) with indicates that a loss occurred in

*D*.

**Test 3:** If *M*(*g*) is a polytomy and *M*(*p*(*g*)) ≠ *M*(*g*), then losses are inferred in the species in *C*(*M*(*g*)) \ *N*(*g*) at *e* (lines 22–24 in Algorithm 2).

This test is only applied when *p*(*g*) is a speciation node and *M*(*g*) is a polytomy. It verifies that each child of *M*(*g*) contains a descendant of *g*. If not, one or more losses must be inferred. This test is applied to (*u*, *w*) in Figure 7c, since *M*(*w*) is a polytomy and *M*(*w*) ≠ *M*(*u*). A loss is inferred in *C*(*M*(*w*)) \ *N*(*w*) = {*B*, *C*, *D*, *γ*} \ {*C*, *D*, *γ*} = *B*. Note that although this loss could be assigned to any of the dotted edges in Figure 8, the greedy strategy assigns the loss to edge (*u*, *w*) the first time it is encountered.

The heuristic procedure to infer losses, described in Algorithm 2, calculates *N*(·) and and applies each of the three tests in a single postorder traversal of *T _{G}*. There is only one optimal solution under the assumptions of this heuristic. All losses can be inferred in a single pass. Duplications are also inferred during this postorder traversal, as described in Section 4.

#### Theorem 5.1.

*Algorithm 2 computes required and conditional duplications, as well as heuristic losses, in O(|V _{G}*| · (

*k*+

_{S}*h*))

_{S}*, where k*,

_{S}is the outdegree of the largest polytomy in T_{S}*and h*

_{S}is the height of T_{S}.#### Proof

At every internal node , *N*(*g*) is initialized with is bounded by *k _{S}*. Using a suitable data structure, this step can be achieved in

*O*(log(

*k*)) time per node. The climb routine is applied to every node in

_{S}*T*. For any given path from to

_{G}*ρ*=

*root*(

*T*), we will climb in total from

_{G}*M*(

*λ*) to

*M*(

*ρ*). Thus, the total cost of calls to climb is

*O*(|

*V*| ·

_{G}*h*).

_{S}Using fast Least Common Ancestor queries, *M*(·) can be calculated in *O*(|*V _{G}*|) time for the entire tree (Bender and Farach-Colton, 2000). Once M(·) has been calculated, testing for conditional duplications takes constant time per node. Testing for required duplication requires calculation of the intersection of and . This operation takes

*O*(

*k*) per node. Combining these costs, the total running time is

_{S}*O*(|

*V*

_{G}**|**· (k

_{S}+ h

_{S})).

### 5.2.Inferring optimal loss histories

In this section, we present an algorithm that considers all possible loss assignments to obtain the minimum number of combined losses. Unlike the greedy heuristic in Algorithm 2, this algorithm finds all optimal assignments, but with increased computational complexity: the running time is exponential in *k _{S}*. However, the implementation is fast in practice because

*k*is typically small (see Section 6). Moreover, the performance is enhanced by memoization, as explained below.

_{S}Before describing the algorithm, we introduce the basic approach to minimizing combined losses using the example in Figure 7. Three losses in Figure 7c, *lost*_*B*, *lost*_*C* and *lost*_*D*, are ambiguous losses that can be assigned to more than one edge in *T _{G}*. Recall that

*lost*_

*B*can be moved below the duplication at

*w*, resulting in two copies of

*lost*_

*B*, one in each child subtree of

*w*. While this would normally increase the number of losses, one copy of

*lost*_

*B*can be combined with

*lost*_

*C*, while the other can be combined with

*lost*_

*D*. This new assignment reduces the number of inferred losses from four to three, as shown in Figure 7d.

A PCC is a set of contiguous edges in *T _{G}*, with the property that all internal nodes map to the same polytomy in

*T*. Each ambiguous loss is associated with exactly one PCC and can be assigned to a (not necessarily proper) subset of its edges. In particular, if a loss can be assigned to some edge

_{S}*e*, it can also be assigned to any edge below

*e*in the PCC. Formally, we define a PCC as follows:

#### Definition 5.2.

*A node* *is the component root of a distinct PCC if M(r _{c}) is a polytomy and*

*. Let X be the set of nodes that contains r*

_{c}and all*such that g*<

*and*

_{G}r_{c}*M*(

*p*(

*g*)) =

*M(r*.

_{c}). Y is the set of edges*X and Y are the sets of nodes and edges in this PCC, respectively.*

The gene tree in Figure7b has a single PCC, shown in Figure 8. Edges in the PCC are drawn with dotted lines. The nodes in the PCC are circled. The component root of this PCC, which corresponds to the polytomy β in the species tree in Figure 7a, is *w*. Note that for every node , all nodes on the path from *v* to the *r _{c}* map to the associated polytomy. In our example, nodes

*M*(

*w*) =

*M*(

*x*) =

*M*(

*y*) = β.

The entire reconciliation procedure involves two passes through the tree. In the first pass, a modified version of Algorithm 2 (not shown) traverses the tree in postorder, calculating and *N*(·), inferring duplications, and inferring losses associated with binary nodes by invoking tests 1 and 2, described in Section 5.1. Test 1 is applied to any edge *e* = (*g*, *p*(*g*)) where *p*(*g*) is a required duplication and *M*(*p*(*g*)) is binary. Test 2 is applied to every edge in *T _{G}*. In the second pass, Algorithm 3 infers optimal assignments of ambiguous losses for each PCC, using a dynamic programming strategy. Algorithm 3 visits each node and determines whether

*g*is the component root of a new PCC (line 3). If so, it calls

*ProcessComponent*, which traverses the PCC in postorder to find optimal assignments of ambiguous losses to edges in that PCC. For each node,

*g*, in the PCC,

*ProcessComponent*calculates the minimum cost of any assignment of losses to the edge (

*p*(

*g*),

*g*) or to edges in the subtree rooted at

*g*. Tables Γ and store the cost and assignment information for

*g*, respectively. In addition, the total number of optimal loss histories in the subtree rooted at

*g*is determined and stored in the variable . After these cost tables have been calculated,

*AddCombinedLosses*uses a preorder traversal of the PCC to generate an optimal loss assignment from the values stored in Γ and .

The core of the algorithm is the calculation of Γ and for each internal node *g. ProcessComponent* considers all possible ways of assigning losses from the set to *g*, where *r _{c}* is the root of the current component and is

*N*(

*g*) if

*g*=

*r*but is otherwise. It calls

_{c}*CalculateCost*(

*g*,

*f*) for each element,

_{in}*f*, of the power set of . The parameter

_{in}*f*is a set of losses, represented as a set of species, that must be be assigned at or below

_{in}*g*.

*CalculateCost*considers two cases: a special case that is invoked if

*g*is a required duplication, which handles loss duplication, and a second case if

*g*is a conditional duplication or a speciation. In both cases, the cost of

*f*at

_{in}*g*is the minimum cost of assigning a subset of losses in

*f*to the edge above

_{in}*g*, plus the cost of losing the rest of

*f*below

_{in}*g*. The set of species represents losses that occur below

*g*; the set of remaining species,

*f*\

_{in}*f*, are assigned to (

_{out}*p*(

*g*),

*g*). For a given

*g*and

*f*,

_{in}*CalculateCost*determines the optimal assignment over all possible subsets

*f*. The cost is calculated for each value of

_{out}*f*and

_{out}*f*, and the lowest cost for each

_{in}*g*and

*f*is stored. Note that it is not necessary to explicitly test each possible value of

_{in}*f*for a given

_{out}*g*and

*f*. By enumerating the smallest values of

_{in}*f*first at line 21, we can store intermediate minimum costs and assignments in Γ

_{in}_{mem}and

_{mem}. By using these memoized values, we only have to explicitly test values of

*f*(on lines 31 and 32, and 65 and 66) for each value of

_{out}*g*and

*f*.

_{in}*CalculateCost* also stores the assignment of losses to edges associated with that cost. (g, *f*_{in}) records, in a tuple, the species which must be lost in the subtrees rooted at the left and right children of *g*. Each optimal loss assignment is stored in a list in . These values are used in *AddCombinedLosses* to select an optimal loss placement. After calculating Γ and , an optimal loss assignment is selected using a preorder traversal of the PCC. Although the pseudocode in Algorithm 3 only selects a single, arbitrary placement, all possible placements can be generated by selecting each permutation of optimal placements from for all nodes in the PCC. This part of the algorithm is mostly bookkeeping and is omitted for brevity.

The running time to return one solution is . The exponential term is due to the enumeration of the power set of in *ProcessComponent* followed by the nested enumeration of an additional power set in *CalculateCost.* This enumeration is carried out potentially for each vertex in *V _{G}* and the size of each of the two power sets is

*O*(2

^{ks}). For each node we also iterate over the elements of the set

*f*

_{in}to compute each

*f*

_{out}. The size of these sets is bounded by

*O*(

*k*). Thus, the total cost of calculating the tables Γ and is . Each additional optimal loss history can be generated in

_{S}*O*(|

*V*|). The running time to calculate Γ and is improved by memoization in

_{S}*CalculateCost*by an order of 2

^{ks}/

*k*, from .

_{S}## 6.Related Work

Variants of LCA reconciliation have previously been proposed by numerous authors, and several software packages for analyzing gene duplication histories, when both trees are binary, have been developed (Page and Charleston, 1997; Page, 1998; Zmasek and Eddy, 2001a, 2001b; Dufayard et al., 2005; Roth et al., 2005; Durand et al., 2006a). Recently, more generalized versions of LCA reconciliation, which also calculate horizontal gene transfers, have also been presented (Gorecki, 2004; Hallett and Lagergren, 2001; Hallett et al., 2004). A related problem, inferring the optimal species tree from multiple conflicting gene trees, has been studied extensively (Bansal et al., 2007; Guigó et al., 1996; Hallett and Lagergren, 2000; Ma et al., 2000; Mirkin et al., 1995; Page, 1994; Stege, 1999; Zhang, 1997) for various optimization criteria (Eulenstein et al., 1998) and has been shown to be NP-hard (Ma et al., 2000).

In this work, we have presented algorithms to reconcile binary gene trees with non-binary species trees. These algorithms determine whether a given node in the gene tree is a required duplication, a conditional duplication, or a speciation, and reconstruct all most parsimonious loss histories. To our knowledge, these results include the first algorithms for reconciliation with non-binary species trees that infer explicit event histories, as well as the first exact algorithm for determining the minimum number of losses when the species tree is non-binary.

In work on a similar theme, Berglund-Sonnhammer et al. (2006) proposed an algorithm to infer a rooted, binary gene tree given a rooted, possibly non-binary, species tree and an unrooted, possibly non-binary, gene tree as input. This method selects a rooted, binary resolution of the input gene tree by minimizing first duplications and then losses. This parsimony criterion is encoded in open form expressions for the minimum number of duplications and losses; however, no algorithm is given for calculating these quantities. In addition, the expression for losses does not determine the species in which these losses occurred or the possible assignment of these losses to edges in the gene tree. Rather, it provides an estimated number of losses and is not guaranteed to find the optimal loss assignments. A similarity between that work and the work presented here is that both methods use set-based mappings between the gene and species trees. In this work, we proposed two such mappings, *N* and , of size bounded above by *k _{S}*. The “

*M*-mapping” proposed in Berglund-Sonnhammer et al. (2006) is equivalent to our

*N*, but there is no equivalent to in the other work. Instead, they use a set

*Z*that is bounded by

*V*and resembles the inefficient solution proposed and rejected in Section 4.

_{S}Another related problem is the reconciliation of a *non-binary gene tree* with a *binary species tree*. Like non-binary species trees, non-binary gene trees are a common occurrence. For many data sets, the signal to noise ratio is not sufficient to fully resolve the binary branching order. As with non-binary species trees, the standard LCA algorithm will also lead to strange behavior when the gene tree is non-binary. In previous work (Durand et al., 2006b), we proposed a polynomial time algorithm to reconcile a non-binary gene tree with a binary species tree, as well as resolve the polytomies in the gene tree. These algorithms are also implemented in Notung. As in Section 3, we treat polytomies in a gene tree as a set of binary hypotheses. Polytomies are then resolved by extracting from this set of hypotheses, the set of binary gene trees which result in the minimum number of duplications and losses. We derive this solution with a dynamic programming algorithm that reconstructs this minimum cost set without comparing every binary resolution of the gene tree with the species tree. Alternative approaches to solving this problem have also been presented by Berglund-Sonnhammer et al. (2006) and Chang and Eulenstein (2006), but to our knowledge have not been implemented. The combined problem of reconciling non-binary gene trees with non-binary species trees remains open.

## 7.Empirical Results

The algorithms described here have been implemented in a new version of our software tool, Notung, which is publicly available on our website. Using this software, we tested our algorithms on several data sets.

First, we compared our new reconciliation algorithms to LCA reconciliation for gene families in three species groups with known polytomies: *Anolis* (Jackman et al., 1999), *Neoaves* (Walsh et al., 1999), and Auklets (Poe and Chubb, 2004). Species trees were transcribed directly from the source articles. We constructed gene trees for two gene families in *Neoaves* (cytochrome-b and globin), three families in Auklets (cytochrome-b, cytochrome oxidase 1 and NADH-6), and one family in *Anolis* (NADH-2). Sequences were downloaded from NCBI (Wheeler et al., 2005) and multiple sequence alignments were constructed with T-Coffee (Notredame et al., 2000). Phylogeny reconstruction was performed using the PHYLIP package from Felsenstein (version 3.6.1) and bootstrapped using the included SEQBOOT program. Branches with weak bootstrap support (<60%) in the globin tree from *Neoaves* were rearranged using models presented in Durand et al. (2006b).

Table 1 shows the number of leaves (*l*) in each tree, the size of the maximum polytomy (*k _{S}*) in each species tree, the number of duplications obtained by LCA reconciliation (B), the number of required duplications predicted by our algorithm (R), and the optimal number of losses. For each of these gene families, the exact and heuristic loss algorithms reported the same number of losses. Only one gene family, the globins, had more than one optimal loss assignment. As predicted, LCA reconciliation substantially overestimates required duplications.

Second, we tested the accuracy of our heuristic algorithm for inferring losses by comparing it with our exact algorithm. We ran Notung on all full trees in TreeFam 3.0 (Li et al., 2006) with a species tree obtained from the NCBI taxonomy (Wheeler et al., 2005). Of the 1174 gene trees in this dataset, 973 correspond to a non-binary species tree. For these 973 trees, the heuristic reported the same number of losses as the exact algorithm on all but seven trees (i.e., 99.3% of trees tested). Table 2 shows the number of losses inferred by the heuristic and the exact algorithm for those seven trees. As these results show, even when the heuristic does not obtain an optimal solution, the deviation from the minimum is slight. In the worst case, it overestimated the number of losses by four in a tree with 249 losses.

Finally, we considered the running time of the loss algorithms using the same Treefam dataset. The computational complexity of Algorithm 3 is exponential in *k _{S}*, while the heuristic in Algorithm 2 is

*O*(|

*V*| · (

_{G}*k*+

_{S}*h*)). The time required to reconcile all 1174 gene trees was 0′48″ for the heuristic and 2′25″ for the exact algorithm on a 3.2-ghz OptiPlex GX620 computer. The running time of the memoized implementation of the exact algorithm suggests that it is acceptable for species trees with

_{S}*k*≤ 15, the maximum

_{S}*k*for any tree in the Treefam dataset. (Previous tests on running time for the exact algorithm, without memoization, showed that memoization resulted in a speed-up by a factor of ~20 for this dataset.) A single tree in this dataset corresponds to

_{S}*k*= 15; the remaining 1173 trees correspond to

_{S}*k*≤ 12. The time required to reconcile these 1173 trees using the exact algorithm is comparable to the time for the heuristic (0′50″ versus 0′48″). For data sets too large for exact analysis, our accuracy tests show that the heuristic can be relied on for near-optimal results.

_{S}## 8.Discussion

In this work, we have presented novel algorithms for the reconciliation of binary gene trees with non-binary species trees. These solutions are founded on current theories of molecular evolution, especially coalescent theory (Pamilo and Nei, 1988; Takahata and Nei, 1985; Maddison, 1997; Tajima, 1983; Hudson, 1990). Our algorithms are both space and time efficient, and have practical applications. They have been implemented in a new version of our software tool, Notung, which provides a graphical user interface for the in-depth study of gene families, as well high throughput capabilities. To our knowledge, these are the first algorithms for reconciling binary gene trees with non-binary species trees to obtain the history of duplications and losses in a gene family. We also present the first exact algorithm for determining the minimum number of losses when the species tree is non-binary.

Our algorithms are of immediate use to researchers using phylogenetic analysis in a broad range of biological endeavors and are promising for further algorithmic development. For example, the loss models developed in Section 5 could also be exploited in algorithms to reconstruct a species tree from a set of gene trees using loss parsimony, as proposed by Chauve et al. (2007). Our definitions of required and conditional duplications and the set-based mappings *N* and provide a potential foundation for probabilistic models of non-binary reconciliation. Such methods require a correspondence between ancestral genes and ancestral species. *N* and provide this information.

Several problems remain for future work. Probabilistic models would complement the parsimony framework presented here. Bayesian approaches (Arvestad et al., 2003, 2004; DeBie et al., 2006), which assume homogeneous rates, are appropriate for data sets in which duplication and loss are neutral, stochastic processes. Parsimony is better suited to data sets in which duplication and loss are rare due to selective pressure. A probabilistic framework provides a natural setting for incorporating sequence data directly into the reconciliation process, but has the disadvantage that it is both computation and data intensive. A complete phylogenetic toolkit should include both approaches.

We considered two cases in this work: binary species trees with complete lineage sorting (i.e., one binary gene tree has probability one and all others probability zero) and non-binary species trees that give rise to all binary gene trees with equal probability. In fact, these represent extremes on a continuum. A more general approach would include polytomy models that deviate from a uniform distribution, as well as models that relax the assumption of complete lineage sorting in binary trees (Pollard et al., 2006; Lyons-Weiler and Milinkovitch, 1997; Poe and Chubb, 2004; McCracken and Sorenson, 2005; Hoelzer and Melnick, 1994). This can occur if the time between divergences is short enough so that allelic diversity has not been eliminated by genetic drift. In fact, the shorter the branch lengths and the larger the effective population size, the more closely a series of rapid binary divergences will approximate a hard polytomy (Pamilo and Nei, 1988; Takahata and Nei, 1985; Maddison, 1997; Tajima, 1983; Hudson, 1990). In addition, non-binary tree models that include horizontal gene transfer as well as gene duplication and loss are needed.

Finally, reconciliation of non-binary gene trees with non-binary species trees should also be investigated. The similarity of this problem to the problem of reconstructing a species tree from many gene trees suggests that it may be NP-complete, but the hardness of the problem remains open and algorithms (or approximation algorithms) are needed. The complexity of loss inference must also be investigated. The exact algorithm presented in Section 5.2 considers all possible loss assignments and requires exponential running time. However, whether an exponential time solution is actually required is an open question.

With the availability of sequences from many closely related genomes, it is increasingly apparent that the histories of individual genes differ and that discordance between gene and species trees is common. Software tools that are sufficiently flexible to handle this situation are needed (Pollard et al., 2006). The work presented here offers this flexibility.

## Footnotes

^{1}Duplication followed by loss of all descendants of one copy would not leave this pattern, but parsimony reconciliation models only consider cases where there is some remaining evidence of duplication.

## Acknowledgments

We thank M. Hahn, S. Hartmann, T. Hladish, R. Hoberman, H.B. Nicholas, Jr., M. Sanderson, T. Vision, R. Schwartz, and S. Sridhar for helpful discussions, and R. Cintron, H.B. Nicholas, Jr., and J. Nam for providing phylogenetic trees for the experimental analysis. This work was supported by the National Institutes of Health (NIH grant 1 K22 HG 02451-01), Pittsburgh Supercomputing Center Biomedical Computing Initiative Computational Facilities Access Grant (MCB000010P), and a David and Lucille Packard Foundation fellowship.

## Disclosure Statement

No conflicting financial interests exist.

## References

- Arvestad L. Berglund A. Lagergren J., et al. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics. 2003;19(Suppl 1):i7–i15. [PubMed]
- Arvestad L. Berglund A. Lagergren J., et al. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. RECOMB 2004. 2004:326–335.
- Bansal M. Burleigh J. Eulenstein O., et al. Heuristics for the gene-duplication problem: a θ(n) speed-up for the local search. Lect. Notes Bionform. 2007;4453:238–252.
- Bender M. Farach-Colton M. Least common ancestors revisited. Latin '00. 2000:88–94.
- Berglund-Sonnhammer A. Steffansson P. Betts M., et al. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J. Mol. Evol. 2006;63:240–250. [PubMed]
- Blomme T. Vandepoele K. De Bodt S., et al. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol. 2006;7:R43. [PMC free article] [PubMed]
- Bourgon R. Delorenzi M. Sargeant T., et al. The serine repeat antigen SERA gene family phylogeny in Plasmodium: the impact of GC content and reconciliation of gene and species trees. Mol. Biol. Evol. 2004;21:2161–2171. [PubMed]
- Chang W.-C. Eulenstein O. Reconciling gene trees with apparent polytomies. Lect. Notes Comput. Sci. 2006;4112:235–244.
- Chauve C. Doyon J.-P. El-Mabrouk N. Inferring a duplication, speciation and loss history from a gene tree (extended abstract) RECOMB-CG 2007. 2007:45–57.
- Chen K. Durand D. Farach-Colton M. Notung: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 2000;7:429–447. [PubMed]
- DeBie T. Cristianini N. Demuth J.P., et al. Cafe: a computational tool for the study of gene family evolution. Bioinformatics. 2006;22:1269–1271. [PubMed]
- Demuth J.P. De Bie T. Stajich J.E., et al. The evolution of mammalian gene families. PLoS ONE. 2006;1:e85. [PMC free article] [PubMed]
- Dufayard J. Duret L. Penel S., et al. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics. 2005;21:2596–2603. [PubMed]
- Durand D. Halldorsson B. Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. 2006a;13:320–335. [PubMed]
- Durand D. Halldorsson B. Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. 2006b;13:320–335. [PubMed]
- Ermolaeva M. Wu M. Eisen J., et al. The age of the Arabidopsis thaliana genome duplication. Plant Mol. Biol. 2003;51:859–866. [PubMed]
- Eulenstein O. Mirkin B. Vingron M. Duplication-based measures of difference between gene and species trees. J. Comput. Biol. 1998;5:135–148. [PubMed]
- Goodman M. Czelusniak J. Moore G., et al. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 1979;28:132–163.
- Gorecki P. Reconciliation problems for duplication, loss and horizontal gene transfer. Proc. 8th Annu. Int. Conf. CMB. 2004:316–325.
- Gu X. Wang Y. Gu J. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat. Genet. 2002;31:205–209. [PubMed]
- Guigó R. Muchnik I. Smith T. Reconstruction of ancient molecular phylogeny. Mol. Phylogenet. Evol. 1996;6:189–213. [PubMed]
- Hahn M.W. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol. 2007;8:R141. [PMC free article] [PubMed]
- Hahn M.W. Han M.V. Han S.-G. Gene family evolution across 12 Drosophila genomes. PLoS Genet. 2007;3:e197. [PMC free article] [PubMed]
- Hallett M. Lagergren J. New algorithms for the duplication-loss model. In RECOMB 2000. 2000. pp. 138–146.
- Hallett M. Lagergren J. Efficient algorithms for lateral gene transfer problems. Proc. 5th Annu. Int. Conf. Comput. Biol. 2001:149–156.
- Hallett M. Lagergren J. Tofigh A. Simultaneous identification of duplications and lateral transfers. Proc. 8th Annu. Int. Conf. CMB. 2004:347–356.
- Hoelzer G. Melnick D. Patterns of speciation and limits to phylogenetics resolution. Trend Ecol. Evol. 1994;9:104–107. [PubMed]
- Hudson R. Gene genealogies and the coalescent process, 1–44. In: Futuyma D., editor; Antonovics J., editor. Oxford Surveys in Evolutionary Bology, Volume 7. Oxford University Press; New York: 1990.
- Jackman T. Larson A. De Queiroz K., et al. Phylogenetic relationships and tempo of early diversification in Anolis lizards. Syst. Biol. 1999;48:254–285.
- Li H. Coghlan A. Ruan J., et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:572–580. [PMC free article] [PubMed]
- Li W. Molecular Evolution. Sinauer Associates Inc.; Sunderland, MA: 1997.
- Lyons-Weiler J. Milinkovitch M. A phylogenetic approach to the problem of differential lineage sorting. Mol. Biol. Evol. 1997;14:968–975.
- Ma B. Li M. Zhang L. From gene trees to species trees. SIAM J. Comput. 2000;30:729–752.
- Maddison W. Reconstructing character evolution on polytomous cladograms. Cladistics. 1989;5:365–377.
- Maddison W. Gene trees in species trees. Syst. Biol. 1997;46:523–536.
- McCracken K. Sorenson M. Is homoplasy or lineage sorting the source of incongruent mtdna and nuclear gene trees in the stiff-tailed ducks (nomonyx-oxyura)? Syst. Biol. 2005;54:35–55. [PubMed]
- McLysaght A. Hokamp K. Wolfe K. Extensive genomic duplication during early chordate evolution. Nat. Genet. 2002;31:200–204. [PubMed]
- Melnick D. Hoelzer G. Absher R., et al. mtDNA diversity in Rhesus monkeys reveals overestimates of divergence time and paraphyly with neighboring species. Mol. Biol. Evol. 1993;10:282–295. [PubMed]
- Mirkin B. Muchnik I. Smith T. A biologically consistent model for comparing molecular phylogenies. J. Comput. Biol. 1995;2:493–507. [PubMed]
- Nam J. Masatoshi N. Evolutionary change in the numbers of homebox genes in bilateral animals. Mol. Biol. Evol. 2005;22:2386–2394. [PMC free article] [PubMed]
- Notredame C. Higgins D. Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000;302:205–217. [PubMed]
- Page R. Maps between trees and cladistic analysis of historical associations among genes, organisms and areas. Syst. Biol. 1994;43:58–77.
- Page R. GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998;14:819–20. [PubMed]
- Page R. Charleston M. From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. Mol. Phylogenet. Evol. 1997;7:231–240. [PubMed]
- Page R. Holmes E. Molecular Evolution: A Phylogenetic Approach. Blackwell Science; New York: 1998.
- Pamilo P. Nei M. Relationships between gene trees and species trees. Mol. Biol. Evol. 1988;5:568–583. [PubMed]
- Paterson A. Bowers J. Chapman B. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl. Acad. Sci. USA. 2004;101:9903–9908. [PMC free article] [PubMed]
- Perrière G. Duret L. Gouy M. HOBACGEN: database system for comparative genomics in bacteria. Genome Res. 2000;10:379–385. [PMC free article] [PubMed]
- Poe S. Chubb A. Birds in a bush: five genes indicate explosive evolution of avian orders. Evolution. 2004;58:404–415. [PubMed]
- Pollard D. Iyer V. Moses A., et al. Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet. 2006;2:e173. [PMC free article] [PubMed]
- Ronquist F. Parsimony analysis of coevolving species associations, 22–64. In: Page R.D.M., editor. Tangled Trees: Phylogeny, Cospeciation and Coevolution. University of Chicago Press; Chicago: 2002.
- Roth C. Betts M. Steffansson P., et al. The adaptive evolution database (TAED): a phylogeny-based tool for comparative genomics. Nucleic Acids Res. 2005;33:D495–D497. [PMC free article] [PubMed]
- Ruvinsky I. Silver L. Newly indentified paralogous groups on mouse chromosomes 5 and 11 reveal the age of a T-box cluster duplication. Genomics. 1997;40:262–266. [PubMed]
- Salzburger W. Meyer A. Baric S., et al. Phylogeny of the Lake Tanganyika cichlid species flock and its relationship to the Central and East African haplochromine cichlid fish faunas. Syst. Biol. 2002;51:113–135. [PubMed]
- Searls D. Pharmacophylogenomics: genes, evolution and drug targets. Nat. Rev. Drug Discov. 2003;2:613–623. [PubMed]
- Sennblad B. Schreil E. Berglund Sonnhammer A., et al. Primetv: a viewer for reconciled trees. BMC Bioinform. 2007;8:148. [PMC free article] [PubMed]
- Slowinski J. Page R. How should species phylogenies be inferred from sequence data? Syst. Biol. 1999;48:814–825. [PubMed]
- Stege U. Gene trees and species trees: the gene-duplication problem is fixed-parameter tractable. Lect. Notes Comput. Sci. 1999;1663:288–293.
- Tajima F. Evolutionary relationship of dna sequences in finite populations. Genetics. 1983;105:437–460. [PMC free article] [PubMed]
- Takahata N. Nei M. Gene genealogy and variance of interpopulational nucleotide differences. Genetics. 1985;110:325–344. [PMC free article] [PubMed]
- Vandepoele K. Simillion C. Van de Peer Y. Evidence that rice and other cereals are ancient aneuploids. Plant Cell. 2003;15:2192–2202. [PMC free article] [PubMed]
- Vandepoele K. Vos W. Taylor J., et al. Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. Proc. Natl. Acad.Sci. USA. 2004;101:1638–1643. [PMC free article] [PubMed]
- Walsh H. Kidd M. Moum T., et al. Polytomies and the power of phylogenetic inference. Evolution. 1999;53:932–937.
- Wang X. Shi X. Hao B., et al. Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol. 2005;165:937–946. [PubMed]
- Wheeler D. Hope R. Cooper S., et al. An orphaned mammalian beta-globin gene of ancient evolutionary origin. Proc. Natl. Acad. Sci. USA. 2001;98:1101–1106. [PMC free article] [PubMed]
- Wheeler D. Barrett T. Benson D.A., et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2005;33:D39–D45. [PMC free article] [PubMed]
- Zhang L. On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol. 1997;4:177–188. [PubMed]
- Zmasek C. Eddy S. ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics. 2001a;17:383–384. [PubMed]
- Zmasek C. Eddy S. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001b;17:821–828. [PubMed]

**Mary Ann Liebert, Inc.**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (448K)

- Reconciliation with non-binary species trees.[Comput Syst Bioinformatics Conf. 2007]
*Vernot B, Stolzer M, Goldman A, Durand D.**Comput Syst Bioinformatics Conf. 2007; 6:441-52.* - Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees.[Bioinformatics. 2012]
*Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D.**Bioinformatics. 2012 Sep 15; 28(18):i409-i415.* - Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem.[BMC Bioinformatics. 2012]
*Górecki P, Eulenstein O.**BMC Bioinformatics. 2012 Jun 25; 13 Suppl 10:S14. Epub 2012 Jun 25.* - Maximum likelihood models and algorithms for gene tree evolution with duplications and losses.[BMC Bioinformatics. 2011]
*Górecki P, Burleigh GJ, Eulenstein O.**BMC Bioinformatics. 2011 Feb 15; 12 Suppl 1:S15. Epub 2011 Feb 15.* - Models, algorithms and programs for phylogeny reconciliation.[Brief Bioinform. 2011]
*Doyon JP, Ranwez V, Daubin V, Berry V.**Brief Bioinform. 2011 Sep; 12(5):392-400.*

- Medically important differences in snake venom composition are dictated by distinct postgenomic mechanisms[Proceedings of the National Academy of Scie...]
*Casewell NR, Wagstaff SC, Wüster W, Cook DA, Bolton FM, King SI, Pla D, Sanz L, Calvete JJ, Harrison RA.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Jun 24; 111(25)9205-9210* - The Evolution and Appearance of C3 Duplications in Fish Originate an Exclusive Teleost c3 Gene Form with Anti-Inflammatory Activity[PLoS ONE. ]
*Forn-Cuní G, Reis ES, Dios S, Posada D, Lambris JD, Figueras A, Novoa B.**PLoS ONE. 9(6)e99673* - Inferring Gene Family Histories in Yeast Identifies Lineage Specific Expansions[PLoS ONE. ]
*Ames RM, Money D, Lovell SC.**PLoS ONE. 9(6)e99480* - Reconciliation of Gene and Species Trees[BioMed Research International. 2014]
*Rusin LY, Lyubetskaya EV, Gorbunov KY, Lyubetsky VA.**BioMed Research International. 2014; 2014642089* - Most parsimonious reconciliation in the presence of gene duplication, loss, and deep coalescence using labeled coalescent trees[Genome Research. 2014]
*Wu YC, Rasmussen MD, Bansal MS, Kellis M.**Genome Research. 2014 Mar; 24(3)475-486*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Reconciliation with Non-Binary Species TreesReconciliation with Non-Binary Species TreesJournal of Computational Biology. Oct 2008; 15(8)981PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...