- Journal List
- Bioinformatics
- PMC3436813

# Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees

^{1,}

^{*}Han Lai,

^{1}Minli Xu,

^{2}Deepa Sathaye,

^{3}Benjamin Vernot,

^{4}and Dannie Durand

^{1,}

^{3}

^{1}Department of Biological Sciences

^{2}Lane Center for Computational Biology

^{3}Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

^{4}Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA

## Abstract

**Motivation:** Gene duplication (D), transfer (T), loss (L) and incomplete lineage sorting (I) are crucial to the evolution of gene families and the emergence of novel functions. The history of these events can be inferred via comparison of gene and species trees, a process called reconciliation, yet current reconciliation algorithms model only a subset of these evolutionary processes.

**Results:** We present an algorithm to reconcile a binary gene tree with a nonbinary species tree under a DTLI parsimony criterion. This is the first reconciliation algorithm to capture all four evolutionary processes driving tree incongruence and the first to reconcile non-binary species trees with a transfer model. Our algorithm infers all optimal solutions and reports complete, temporally feasible event histories, giving the gene and species lineages in which each event occurred. It is fixed-parameter tractable, with polytime complexity when the maximum species outdegree is fixed. Application of our algorithms to prokaryotic and eukaryotic data show that use of an incomplete event model has substantial impact on the events inferred and resulting biological conclusions.

**Availability:** Our algorithms have been implemented in Notung, a freely available phylogenetic reconciliation software package, available at http://www.cs.cmu.edu/~durand/Notung.

**Contact:**
ude.umc.werdna@rezlotsm

## 1 INTRODUCTION

The phylogeny of a gene family evolving by vertical descent will agree with the associated species tree. Gene duplication, gene loss, horizontal gene transfer (HGT) or incomplete lineage sorting (ILS) can result in a gene tree that differs from the species tree (Maddison, 1997). The history of such events can be inferred through topological comparison of gene and species trees, a process called ‘reconciliation’. Reconciliation encompasses two related problems: event inference and tree inference. Given rooted gene and species trees, a mapping from extant genes to extant species, and an event model, the goal of ‘event inference’ is to infer the association between ancestral genes and species and the optimal event history with respect to a combinatorial or probabilistic optimization criterion. A complete solution must include the specific events and the gene and species lineages in which those events occurred. Given a set of gene trees, ‘tree inference’ seeks the species tree that optimizes the combined events resulting from reconciliation with each gene tree in the input set.

Here, we address the event inference problem for a model that captures all four evolutionary processes contributing to gene tree incongruence. Whole genome sequencing data are revealing an ever growing number of cases where all four processes are active (e.g., Andersson, 2009; Serres *et al.*, 2009; Zhaxybayeva and Doolittle, 2011), leading to calls for algorithms that model multiple evolutionary processes (Degnan and Rosenberg, 2009; Edwards, 2009). Algorithms lacking a model of incongruence due to ILS will overestimate the number of duplications and/or transfers. For example, a recent analysis, based on a model that did not consider ILS, reported an inexplicable but dramatic increase in duplications in recently sequenced mammalian genomes (Milinkovitch *et al.*, 2010). For large-scale analysis of multigenome phylogenetic datasets, reconciliation algorithms that allow ILS to be distinguished from other sources of incongruence are essential.

### 1.1 Related work

Gene tree incongruence has been considered from two perspectives. Multispecies coalescent models focus on ILS as a source of incongruence (reviewed in Degnan and Rosenberg, 2009). The basic assumption underlying this work is that gene tree incongruence arises from ILS due to genetic drift, although some methods also take hybridization and/or recombination into account (reviewed in Degnan and Rosenberg 2009; Edwards 2009). The multispecies coalescent explicitly relates the probability of an incongruent gene tree to the time between species divergences and the effective size of the ancestral population. In the context of tree inference, these parameters can be inferred from a collection of gene trees. Event inference, however, requires prior estimates of population parameters because only one tree is under consideration.

In contrast, reconciliation focuses on incongruence that arises from processes that change the number of loci in a gene family; i.e. duplication, loss and transfer. Most event inference algorithms consider either gene duplication or HGT (Doyon *et al.*, 2011; Nakhleh, 2010; Nakhleh *et al.*, 2009), but not both. Exact algorithms with exponential time complexity have been presented for the duplication-transfer (DT) (Tofigh *et al.*, 2011) and duplication-transfer-loss (DTL) models (David and Alm, 2011), under a parsimony criterion. Event inference with transfers is NP-complete (Hallett *et al.*, 2004), but can be solved in polynomial time under a restricted model where only transfers between contemporaneous species are considered. This model (reviewed in Doyon *et al.*, 2011; Huson and Scornavacca, 2011) requires estimates of speciation times, which are frequently not known. In addition, algorithms for this restricted model may fail to recognize transfers if they involve a taxon missing from the dataset (Huson and Scornavacca, 2011; Nakhleh, 2010).

Reconciliation implicitly assumes that inter-speciation times are sufficiently long that genetic drift and incomplete lineage sorting may be safely excluded from consideration. This assumption breaks down when the species tree contains polytomies or very short branches. In these situations, allelic variation can survive multiple speciation events, leading to gene trees with branching patterns that differ from the species tree. Such cases are increasingly common due to increased sequencing of closely related species. Methods that do not consider ILS will incorrectly interpret incongruence arising from ILS as evidence of duplication or transfer.

To avoid this problem, algorithms that can distinguish between ILS and other events are needed. In fact, one parsimony criterion that considers ILS has been proposed: minimization of the number of extra gene lineages on a species branch due to Deep Coalescence (MDC) has been used as a criterion for tree inference (Maddison, 1997; Maddison and Knowles, 2006; Maddison and Maddison, 2011; Page, 1998; Than and Nakhleh, 2009). However, the MDC criterion assumes ‘all’ incongruence is due to ILS. MDC is not a suitable basis for event inference because it cannot distinguish between extra lineages arising from ILS and those arising from duplication or transfer (Zhang, 2011). Two approaches to the event inference problem combine ILS with gene duplication and loss in a single model (DLI). In earlier work, we presented the first event inference algorithm for the DLI model under a parsimony criterion (Vernot *et al.*, 2008). An event inference algorithm for a DLI model based on the multispecies coalescent relates the probability of ILS to branch lengths and population sizes explicitly (Rasmussen and Kellis, 2012). These models have different strengths. The model based on the coalescent captures more detail, but is limited to the small number of datasets for which estimates of ancestral population sizes and speciation times are available. To our knowledge, no reconciliation algorithms that consider ILS and transfer are in existence.

### 1.2 Our contributions

We present the first reconciliation algorithm for a DTLI event model that captures all four major causes of gene tree incongruence. Our algorithm is also the first to allow transfers in reconciliation with a non-binary species tree. Our algorithm is based on a simple, elegant model that recognizes ILS as a source of incongruence, but avoids the computational overhead of a full coalescent model and does not require estimates of ancestral population sizes and speciation times.

Our parsimony-based algorithm reconciles a binary gene tree with a non-binary species tree and distinguishes between incongruence that could only arise through duplication or HGT and incongruence that can be more parsimoniously explained by ILS. Our algorithm places no restriction on speciation times and reports all optimal reconciliations that are temporally feasible. For a fixed *k**, the time complexity of our algorithm is *O*(*h*_{S}|*V*_{G}||*V*_{S}|^{2}) time, where *k** is the out-degree of the largest polytomy in the species tree, *h*_{S} is the height of the species tree and |*V*_{G}| and |*V*_{S}| are the number of vertices in the gene and species trees, respectively. Given a binary species tree, our algorithm infers histories under the DTL model.

Both the DTL and DTLI algorithms have been implemented in Java and integrated in Notung, a freely available software package for phylogenetic reconciliation. Our software offers a unique and comprehensive combination of functions: it includes losses in the optimization criterion, does not require estimates of speciation times and reports all optimal event histories. Reported solutions are complete, temporally feasible event histories, giving the gene and species lineages in which each event occurred.

To demonstrate the advantages of a full-DTLI model on real data, we applied our algorithm to two phylogenetic datasets that have been used in previous analyses of HGT and phylogenetic incongruence (Delsuc *et al.*, 2005; Rokas *et al.*, 2003; Zhaxybayeva *et al.*, 2009). First, if no incongruent trees have patterns that could be most parsimoniously explained as ILS, then models with and without ILS should give same results. In fact, we observed just the opposite. The models that did not correct for ILS substantially overestimated duplications and transfers. A recent study using a quartet decomposition approach reported several highways of gene transfer between specific pairs of cyanobacterial species (Bansal *et al.*, 2011). We observed the same highways using the DTL algorithm. Only one of these highways remained when using the DTLI algorithm. Second, because many published algorithms do not include losses in the optimization criterion (e.g., Berglund *et al.*, 2006; Ma *et al.*, 2000; Tofigh *et al.*, 2011; Zmasek and Eddy, 2001), we compared models with losses (DTLI, DTL) and without losses (DTI, DT). Explicit inclusion of losses in the optimization function resulted in substantial changes to the inferred ratio of duplications to transfers, suggesting that the practice of *post hoc* inference of losses should be revisited.

Finally, when the event model includes transfers, the minimum cost event history is not, in general, unique. All algorithms cited above report only one of possibly many optimal solutions. We applied our algorithm to assess the extent to which multiple optimal solutions occur. We discovered that multiple optimal solutions are a frequent occurrence, especially in datasets where transfer is the dominant process. In the analysis reported here, 20% of 1128 cyanobacterial trees had multiple optimal solutions with inconsistent event histories. In other words, for one in five trees, the arbitrary selection of a single optimal solution could lead to conclusions that might not be supported by other optimal solutions. The results presented here are exciting and important, as they demonstrate that degeneracy and the applied event model have substantial impact on the histories inferred and, hence, on the resulting biological conclusions.

### 1.3 Notation

Given a tree, *T _{i}* = (

*V*,

_{i}*E*),

_{i}*L*(

*T*) designates the leaf set of

_{i}*T*, and

_{i}*ρ*designates its root. We use

_{i}*g*∈

*V*

_{G}and

*s*∈

*V*

_{S}to represent genes and species, respectively.

*T*(

_{i}*v*) is the subtree of

*T*rooted at

_{i}*v*∈

*V*.

_{i}*C*(

*v*) and

*P*(

*v*) denote the children and the parent of

*v*, respectively, with

*c*∈

_{j}*C*(

*v*) denoting the

*j*th child of

*v*. We adopt the notation that if (

*u*,

*v*) ∈

*E*,

_{i}*P*(

*v*) =

*u*. Given nodes

*u*,

*v*∈

*V*, if

_{i}*u*is on the path from

*v*to

*ρ*, then

*u*is an ancestor of

*v*, designated

*u*≥

*, and*

_{i}v*v*is a descendant of

*u*, designated

*v*≤

*. If*

_{i}u*v*≱

*and*

_{i}u*u*≱

*,*

_{i}v*u*and

*v*are ‘incomparable’, designated

*u*≸

*.*

_{i}v## 2 ALGORITHMS

Here, we propose a reconciliation model based on DTL parsimony that distinguishes between regions of the species tree where ILS is likely, and those where only gene duplication and transfer need be considered. These differences are specified using a non-binary species tree: at binary nodes, we assume that ILS is so rare that incongruence is always evidence of gene duplication or transfer. At polytomies, ILS is considered, and gene duplication and transfer are invoked only if topological disagreement cannot be explained by ILS. This model can be invoked for both non-binary species trees and for binary species trees with short branches where ILS is suspected: even when the binary branching order of the species tree is known, the user can collapse edges in the species tree to indicate in which lineages ILS should be considered as an alternate hypothesis.

A key aspect of our model is that even when ILS is allowed, it is not possible to explain all incongruence in terms of ILS, even in a uniquely labeled gene tree. Let *g* be a node in *T _{G}* and let

*s*∈

*V*

_{S}be the associated node in the species tree. We wish to determine whether the divergence at

*g*is consistent with a co-divergence at

*s*or whether it can only be explained by events that give rise to a new locus; i.e. duplication and transfer. If the branch point at

*g*arose through a co-divergence with

*s*, then each species lineage descending from

*s*should inherit at most one descendant of

*g*. The presence of more than one descendant of

*g*indicates that the divergence at

*g*must be due to acquisition of an additional locus by duplication or transfer. An operational test for detecting more than one descendant on a branch results from the observation that any branching pattern that is consistent with a binary resolution of the polytomy can be explained by lineage sorting.

For example, the gene tree in Figure 1a represents a valid, binary resolution of the species tree, consistent with ILS. The embedding of the gene tree in the species tree shows that each species tree lineage inherits exactly one descendant of *x*_{1} and at most one descendant of *x*_{2}. Both *x*_{1} and *x*_{2} can be interpreted as deep coalescences. In contrast, there is no binary resolution of the species tree that corresponds to the gene tree in Figure 1b. The embedding of this gene tree requires two descendants of *y*_{2} in the lineage from *e* to *f*, a violation of model constraints. The only way to explain two descendants of *y*_{2} on the branch from *e* to *f* is by inferring a duplication (Fig. 1c) or a transfer (Fig. 1d).

**(A)**A binary gene tree that is consistent with a binary resolution of the species tree. The divergences at

*x*

_{1}and

*x*

_{2}are consistent with ILS.

**(B)**A gene tree that

**...**

Before introducing our algorithm, we discuss the meaning of a polytomy in our model. A species polytomy can be considered from two perspectives: a ‘hard’ polytomy represents simultaneous divergence of three or more populations. A ‘soft’ polytomy represents a binary branching process in which the branching order is unknown. Our model assumes that a polytomy represents rapid or simultaneous species divergence. However, it also admits a useful interpretation for soft polytomies. A soft polytomy can be viewed as a set of hypotheses, namely the set of binary resolutions of the polytomy. Our model offers a conservative stance: events are only inferred when the topology of the gene tree does not correspond to any of these hypotheses. Note that in some cases, the hard and soft polytomy models are closely linked: the branching order of species that arose through multiple speciations in rapid successions (Ebersberger *et al.*, 2007; Pollard *et al.*, 2006) is often difficult to resolve.

### 2.1 The DTLI algorithm

In our DTLI model, divergence in a gene tree arises through one of four events: duplication (), transfer (), speciation () and deep coalescence (). The score of a reconciliation under this model is the weighted sum of the number of duplications (), losses (), and transfers ():

where *δ*, *λ* and *τ*, respectively, are the costs of duplication, loss and transfer. Speciation and deep coalescence represent co-divergence with binary nodes and polytomies, respectively, in the species tree and have zero cost. We refer to the cost of event *ε* ∈ {, , , } as *κ* (*ε*).

A rooted, binary gene tree *T*_{G}; a rooted, arbitrary species tree *T*_{S}; a mapping *M*_{L} : *L*(*V*_{G}) → *L*(*V*_{S}) from contemporary genes to the species from which they were sampled and a set of permitted events are given as input. The reconciliation of *T _{G}* with

*T*results in an annotated tree,

_{S}*R*

_{GS}= (

*V*

_{G},

*E*

_{G}), in which every internal node,

*g*, is annotated with the species

*s*∈

*V*

_{S}that contained gene

*g*, designated

*M*(

*g*), and the event that caused the divergence at

*g*, designated . In addition, every

*g*∈

*V*

_{G}\ {

*ρ*

_{G}} is annotated with , the genes lost on the edge from

*P*(

*g*) to

*g*. Each loss is labeled with the species in which the loss occurred. We say (

*u*,

*v*) ∈

*E*is a transfer edge if and

_{G}*M*(

*u*)≸

_{S}

*M*(

*v*) and define Λ(

*R*

_{GS}) ⊂

*E*

_{G}to be the set of transfer edges in

*R*

_{GS}. If (

*u*,

*v*) ∈ Λ(

*R*

_{GS}), a transfer occurred from donor species

*d*=

*M*(

*u*) to recipient species

*r*=

*M*(

*v*).

Here, we present the DTLI event inference problem under the constraint that a deep coalescent is inferred at *g* if each lineage descending from *M*(*g*) inherits at most one descendant of *g*:

**The DTLI event inference problem**

**Input:**A rooted non-binary species tree,*T*_{S}; a rooted, binary gene tree,*T*_{G}; the leaf mapping,*M*_{L}.**Output:**All reconciliation histories*R*_{GS}that minimize*π*and satisfy the model constraints.

Algorithms for the DTLI event model must address several issues that do not arise when only a subset of the events is considered: (1) there may be more than one combination of duplications, transfers and losses that gives rise to the same pattern of tree incongruence (i.e. there may be more than one optimal solution, *R*_{GS}). (2) The value of *M*(*g*) is not uniquely determined by the children of *g* and multiple possible values of *M*(*g*) must be considered because transfers cause genes to jump to distant locations in the species tree. (3) An optimal reconciliation at the root may entail a suboptimal reconciliation at an internal node, *g*. Inferring a more costly event at *g* may change the values of *M*(·) in nodes ancestral to *g* such that the overall score is reduced. Therefore, the values of *M*(*g*) and required for an optimal solution cannot be determined using only local information, and more than one optimal solution may result.

To accommodate these requirements, it is necessary to enumerate all possible assignments of *M*(*g*) and , for each node *g* ∈ *V*_{G}. At each *g*, the associated information is stored in two tables, and . For each candidate assignment *s* ∈ *V*_{S}, the score that minimizes the cost of reconciling *T*_{G}(*g*) with *T*_{S}(*s*), is stored in . The associated events and other information needed to reconstruct the history at *g* are stored in .

Optimal reconciliations are calculated by a two-pass algorithm. The first pass (Algorithm 2.1.1) is a dynamic program that populates each and in a post-order traversal of *T _{G}*. It returns the optimal reconciliation score, the values of

*M*(

*ρ*) and corresponding to that score and the number of optimal histories. The second pass (Supplementary Algorithm S1.0.1) is a traceback algorithm that reads information from each to construct an optimal solution. Each optimal history is generated by traversing, in pre-order of

_{g}*T*, each unique path that leads to the optimal label(s) in . Appropriate values of

_{G}*M*(

*g*) and at each node

*g*are selected from . Each candidate optimal history is then tested for temporal feasibility, as described in the next section. Only those histories that are temporally feasible are reported.

A key calculation in the dynamic program of firstPass is determination of the possible events at *g* for a given candidate species assignment, *M*(*g*)= *s*. These events, in turn, depend on *M*(*c*_{1}) = *s*_{1} and *M*(*c*_{2}) = *s*_{2}, where *c*_{1}, *c*_{2} ∈ *C*(*g*). The basis for determining candidate events that are consistent with *s*, *s*_{1} and *s*_{2} is the following observation: if a duplication occurred at *g*, then the species that inherit the descendants of *c*_{1} and the species that inherit the descendants of *c*_{2} will not be disjoint.

We define a test, based on this observation, for distinguishing duplication from other events:

where is the set of species that vertically inherit descendants of *P*(*g*). If and are disjoint, than one of the other three events (, or ) must have occurred. These events can be distinguished from one another using , *M*(*g*) and *M*(*c*_{1}) and *M*(*c*_{2}), as seen in costCalc in Algorithm 2.1.1. Note that Equation (2) is different from the standard least common ancestor (lca) test; however, when *M*(*g*)= *s* is binary, the descendants of *s* are partitioned into two sets, the left and right descendants of *s*, if there is no duplication. Therefore, Equation 2 is equivalent to lca reconciliation (Vernot *et al.*, 2008).

Because only consists of elements that were vertically inherited, we must exclude transfer edges in the calculation. For this purpose, we define

the set of leaves of *T*_{G}(*g*) that were acquired through HGT. Formally, we define to be a mapping from *V*_{G} to sets of nodes in *V*_{S}, where *V*_{S}^{+} is the powerset of *V*_{S}. is the set of children of *M*(*P*(*g*)) such that ; otherwise,

One more piece of machinery is needed: to determine , we must know the children of *M*(*P*(*g*)), but we do not have that information until we visit *P*(*g*). Therefore, we define a similar set mapping, , to aid in the calculation of . is the, set of children of *M*(*g*) that vertically inherit a descendant of *g*. Formally, if *M*(*g*) ∈ *L*(*T _{S}*), ; otherwise,

Algorithm 2.1.1 traverses *T*_{G} in post-order calling calcCost at each *g* ∈ *V _{G}*. The challenge in the DTLI model is to determine the sets of species that inherit the descendants of

*c*

_{1}and

*c*

_{2}when

*M*(

*g*) =

*s*is a polytomy; i.e. how to calculate and . When

*s*is binary, the descendants of

*s*are easily partitioned into two sets; when

*s*is a polytomy, all possible ways to partition the descendants must be considered. Each child of

*g*can be retained in any subset of the children of

*s*, ranging from size 1 to |

*C*(

*s*)| − 1. Our DTLI algorithm addresses this by considering all ways of partitioning

*C*(

*s*) into two non-empty subsets.

At each internal node *g*, the algorithm assesses all possible values for *M*(*g*) and by looping through all (*s*_{1}, *s*_{2}) ∈ *V*_{S} × *V*_{S} and all . Considering all power sets corresponds to considering all the ways to partition *C*(*s*_{1}) and *C*(*s*_{2}). The optimal event and child mapping under *s* and is determined by minimizing the cost of the candidate solution at *g*:

where , the number of losses on edge (*g*, *c _{i}*), is calculated using the loss heuristic in (Vernot

*et al.*, 2008). Note that for each

*s*, the local cost and history tables are also indexed by all possible values of , which are in

*C*(

*s*)

^{+}.

### 2.2 Temporal infeasibility

Because the donor and recipient species of any transfer must have coexisted, each transfer implies a temporal constraint. A reconciliation is temporally feasible if an ordering of species exists that satisfies the constraints of all inferred transfers. Because reconciliations inferred by Algorithm 2.1.1 are not guaranteed to be feasible, each candidate optimal solution is tested for feasibility *post hoc*.

To determine whether a reconciliation *R*_{GS} is temporally feasible, we construct a directed timing graph *G _{t}* = (

*V*

_{t},

*E*

_{t}) that encodes all temporal constraints on species in

*T*

_{S}. Only species that are the donor,

*d*, or recipient,

*r*, of a transfer edge in Λ(

*R*

_{GS}) must be considered. Thus, the vertex set is defined as

*V*= {

_{t}*v*∈

*V*

_{S}| ∃(

*g*,

*h*) ∈ Λ(

*T*

_{G}) ∋

*v*=

*M*(

*g*) ∨

*v*=

*M*(

*h*)}.

The edges in *E _{t}* represent three types of temporal constraints:

- If species
*s*is an ancestor of species_{i}*s*in_{j}*T*_{S}, then*s*predates_{i}*s*: for every (_{j}*s*,_{i}*s*) in_{j}*V*×_{t}*V*, add (_{t}*s*,_{i}*s*) to_{j}*E*if_{t}*s*≥_{i}_{S}*s*._{j} - Let (
*g, h*) and (*g*′,*h*′) be transfers in Λ(*R*_{GS}), such that*g*≥_{G}*g*′. Then*d*=*M*(*g*) and*r*=*M*(*h*) must have occurred no later than both*d*′ =*M*(*g*′) and*r*′ =*M*(*h*′). We add (*P*(*d*),*d*′), (*P*(*d*),*r*′), (*P*(*r*),*d*′) and (*P*(*r*),*r*′) to*E*._{t} - Given a transfer (
*g, h*) ∈ Λ(*R*_{GS}), species*M*(*g*) and*M*(*h*) must be contemporaneous. Furthermore, any species that predates*M*(*g*) must also predate*M*(*h*) and vice versa. For every (*s*,_{i}*s*) ∈_{j}*V*×_{t}*V*, add (_{t}*s*,_{i}*s*) to_{j}*E*if ∃_{t}*s*∈_{k}*V*such that_{t}*s*≥_{i}_{S}*s*and_{k}*s*and_{k}*s*are the donor and recipient, or vice versa, of some transfer (_{j}*g, h*) ∈ Λ(*R*_{GS}).

We test each candidate optimal history for temporal feasibility by verifying that the associated timing graph *G _{t}* is acyclic, using a modified topological sorting algorithm in Θ (|

*V*|+|

_{t}*E*|) (Cormen

_{t}*et al.*, 1990). Temporally infeasible histories are not reported. Note that it is not the case that if one optimal history is infeasible, all optimal histories are infeasible. Finding the optimal, temporally feasible reconciliation is NP-complete (Tofigh

*et al.*, 2011); we leave the problem of obtaining an optimal, feasible solution when all candidate solutions have infeasible timing constraints for future work.

### 2.3 Complexity and running time

Our algorithm is fixed-parameter tractable with polynomial complexity when the size of the largest polytomy, *k**, is fixed. In practical data analyses, *k** is likely to be small. Recent genome-scale analyses of ILS have focused on species trees with *k** = 3 (Ebersberger *et al.*, 2007; Pollard *et al.*, 2006). In general, event inference will not yield informative results when the species tree is highly unresolved.

Theorem 2.1. *Given a binary gene tree T _{G} and a non-binary species tree T_{S}, firstPass takes O*(|

*V*|(|

_{G}*V*|+

_{S}*n*)

_{k}2^{k*}*(*

^{2}*h*+

_{S}*k**))

*time*.

Proof. firstPass visits each *g* ∈ *V _{G}* in post order. At each

*g*, costCalc is called once for every (

*s*

_{1},

*s*

_{2}) ∈

*V*×

_{S}*V*and , resulting in a total of

_{S}*O*(|

*V*|(|∪

_{G}*(*

_{s∈Vs}C*s*)

^{+}|)

^{2}) calls to costCalc. Because |

*C*(

*s*)

^{+}| = 2

^{|C(s)|}is

*O*(1) when

*s*is binary, |∪

*(*

_{s∈Vs}C*s*)

^{+}| is bounded above by |

*V*|–

_{S}*n*+

_{k}*n*2

_{k}*and the number of calls to costCalc is*

^{k*}*O*(|

*V*|(|

_{G}*V*|+

_{S}*n*2

_{k}*)*

^{k*}^{2}). We precalculate lca(

*s*

_{1},

*s*

_{2}) and test whether

*s*

_{1}≸

*s*

_{2}, for all species pairs, in

*O*(|

*V*

_{S}|

^{2}) time. Therefore, the complexity of costCalc is dominated by the calculations of for

*l*and

*r*, and . These values can be computed in

*O*(

*h*),

_{S}*O*(log(

*k**)) and

*O*(

*k**) time, respectively. Thus, each call to costCalc has complexity

*O*(

*h*+

_{S}*k**). Once the post-order traversal is completed, we extract the minimum score in , and all values of

*M*(

*ρ*

_{G}) and corresponding to that score. Since , a linear search accomplishes this in

*O*(|

*V*|+

_{S}*n*2

_{k}*) time. Thus, the total complexity is*

^{k*}*O*(|

*V*

_{G}|(|

*V*

_{S}|+

*n*2

_{k}*)*

^{k*}^{2}(

*h*

_{S}+

*k**)).

Theorem 2.2. *secondPass returns each optimal reconciliation in O*(|*V*_{G}|(*h _{S}* +

*k**)).

Proof. secondPass starts from the *M*(*ρ*_{G}) and found in firstPass. It then constructs an optimal solution by visiting each subsequent *g* ∈ *V _{G}*, assigning mappings and events by looking up values in in constant time. Losses are inferred in

*O*(

*k** +

*h*

_{S}) time (see Vernot

*et al.*, 2008). Thus, the complexity for returning each optimal history is

*O*(|

*V*

_{G}|(

*h*

_{S}+

*k**)).

When *T _{S}* is binary, firstPass is completed in

*O*(

*h*

_{S}|

*V*

_{G}||

*V*

_{S}|

^{2}) time, and secondPass reports each optimal solution in

*O*(

*h*

_{S}|

*V*

_{G}|) time.

Our Notung implementation is efficient in practice. We measured the time required to reconcile 1128 cyanobacterial gene trees with a species tree of size |*V*_{S}| ≤ 21 for all the parameter settings given in Table 1. To assess the effect of polytomy size, we also collapsed edges in the species tree to create a polytomy ranging in size from 2 to 6. The maximum average running time observed on a single AMD Opteron 2.3 ghz, 64-bit processor was ~ 0.05s. per solution.

## 3 EMPRICAL RESULTS

To assess the importance of a four-event model, we implemented our DTLI algorithm in Notung2.7 and applied it to two phylogenetic datasets in which ILS, HGT and hybridization have been studied (Bansal *et al.*, 2011; Yu *et al.*, 2011). Because a number of algorithms and software packages do not include losses in the optimization criterion, we sought to assess the impact of this modeling choice. Therefore, we also implemented and applied models excluding losses in the optimization criterion (DT and DTI) models. Except where stated, the trends reported here were observed consistently in both datasets.

The datasets analyzed contain 1128 cyanobacterial gene trees sampled from 11 species (Figs 2 and Supplementary Fig. S1), and 106 yeast gene trees sampled from 15 species (Supplementary Fig. S2), respectively. Each gene tree has at most one gene copy per species. To assess the impact of our ILS model, for each dataset we compared the performance of our algorithm on a binary and a non-binary species tree. The non-binary species tree was created by removing one edge resulting in a single polytomy of size 3. In each case, the selected edge was short and associated with substantial gene tree incongruence. Each polytomy was chosen as a reflection of an area of the species tree where ILS may be occurring. In both cases, the selected edge was one that is reportedly difficult to resolve (Bansal *et al.*, 2011; Schirrmeister *et al.*, 2011; Yu *et al.*, 2011).

*δ*= 3,

*τ*= 2.5 and

*λ*= 2. Predicted highways with transfer counts exceeding 1.5 standard deviations above the mean are shown, with the total number of transfers labeled.

**...**

We reconciled each tree using each of the four models (DT, DTI, DTL and DTLI), with *τ* ∈ {2.5,6,10}, *δ* = 3 and *λ* = 2 (when considered). We tabulated (1) the number of events of each type, (2) the gene and (3) species lineages in which they occurred, (4) the donor and recipient of each transfer and (5) the number of temporally infeasible reconciliations (Table 1 for cyanobacteria; Supplementary Table S1 for yeast). Trees that had no temporally feasible solution for at least one set of parameter values were eliminated from analysis under all models and values of *τ*. For each setting, gene trees were rooted with Notung's rooting optimization algorithm using event parsimony. If a tree had multiple optimal solutions (one or more optimal roots or reconciliations for a specified root), it was only retained if all solutions yielded the same counts for each event.

Our observations highlight the extent to which model choice and degeneracy affect biological inferences. Approximately 10% of trees were removed because they are potentially misleading due to temporal infeasibility. Hallett *et al.* (2004) reported no temporal infeasibility for the application of their DT algorithm to a simulated dataset. Our results suggest that infeasible cases can be more prevalent in real data.

In addition, ~ 20% of trees had conflicting optimal solutions, suggesting that inferences based on a single, randomly selected optimal solution could lead to conclusions that are not, in fact, supported by the data. This result highlights the importance of taking multiple solutions into account when performing tree reconciliation.

When the models with and without ILS are compared, we observed a substantial decrease in the combined number of duplications and transfers, ranging from 15% to 18% in cyanobacteria and 11% to 14% in yeast. We also observed considerable decreases in the number of losses, as high as 20% in the case of DT versus DTI. These differences indicate the extent to which ignoring ILS can lead to overestimation of other events.

Recently, great interest has been focused on ‘highways’ of HGT (pairs of species with very active genetic exchange, relative to HGT in other species) [i.e. (Bansal *et al.*, 2011; Beiko *et al.*, 2005)]. We considered evidence of HGT highways in our cyanobacterial data, where a highway is an outlier in the total number of transfers, in both directions, between a pair of species. With the DTL model, we observe traffic (Fig. 2, red lines) similar to the HGT highways reported by Bansal *et al.* (2011) (dotted lines), for the same dataset. However, when events were inferred with the DTLI model, the elevated transfer rates in the Gloeobacter group disappeared, resulting a single highway (blue line). These results demonstrate that use of a complete event model is crucial for accurate inference.

In general, including losses in the optimization criterion resulted in (1) a dramatic decrease in the number of losses and (2) a change in the ratio of the number of duplications to transfers. This likely occurs because duplications and losses are coupled. When losses are included in the optimization, their cost may prevent the model from over-inferring duplications. This suggests that for any application where accurate reconstruction of event histories matters, including losses in the optimization criterion is crucial.

## 4 DISCUSSION

This work presents the first reconciliation algorithm for the event inference problem under a model that captures the four major evolutionary processes driving tree incongruence: duplication, loss, transfer and ILS. Our algorithm reconciles a binary gene tree with a non-binary species tree and is, to our knowledge, the first algorithm to allow non-binary species trees with a transfer model. Our algorithm outputs detailed event histories, describing the specific events inferred and the lineages in which they occurred.

When restricted to binary species trees, our algorithm reduces to an event inference algorithm for the DTL model that can infer all optimal solutions and does not require estimates of speciation times or otherwise restrict transfers to a limited set of species pairs.

Algorithms that capture duplication, transfer and ILS in a single, integrated model are of increasing importance (Degnan and Rosenberg, 2009). New sequencing technologies are leading to rapid growth of whole genome datasets, in which there is evidence for both HGT and ILS. Our empirical analyses of two different datasets, representing both prokaryotic and eukaryotic data, indicate that use of a complete event model has substantial impact on the events inferred and, hence, the resulting biological conclusions. For example, it is possible that apparent HGT highways could be, at least in part, mis-interpretations of deep coalescence.

Our model is a compromise between current reconciliation models, which ignore ILS everywhere, and coalescent models that explicitly relate the probability of incongruence to the length and population size associated with every branch. Our model is more expressive than the former and more efficient and more widely applicable than the latter. A great strength of the multispecies coalescent is that it explicitly relates the probability of incongruence to effective population size and the time between species divergences. Estimates of these population parameters are only available for a limited set of well-studied species. However, given a sufficiently large set of gene families, population parameters can be inferred directly from the data, but this is computationally demanding. For example, species tree inference from a set of 106 genes in 8 yeast species required 800 h using Bayesian estimation on a coalescent model, whereas a parsimony method inferred the identical tree in only a ‘fraction of a second’ (Than and Nakhleh, 2009).

A parsimony model, on the other hand, does not take branch lengths into account, resulting in a potential reduction in accuracy. Future simulation studies are planned to characterize the accuracy of this approach. The benefits of this simpler model are that it can be applied to any set of taxa, not just species for which population parameters can be estimated, and it is not sensitive to overfitting. Because it is fast and general, it is highly suitable for processing large, genome-scale datasets.

The work presented here could profitably be generalized in several ways, including a model of transfers in which multiple genes are transferred in a single event; inference methods for datasets involving extinct or missing species; and ILS models that deviate from the assumption of a uniform gene tree distribution and take branch lengths and population size into account for datasets where such information is available. Another important area for future work is the selection of event costs and investigation of the robustness of results with respect to small changes in the costs used. Note that the problem of how to weight events also arises in coalescent models. For example, the coalescent-based DLI inference algorithm requires the user to supply duplication and transfer rates.

## ACKNOWLEDGEMENT

We thank H. Philippe for making his yeast trees available to us.

*Funding:*
National Science Foundation (BDI0641313); Pittsburgh Supercomputing Center, Biomedical Computing Initiative and Computational Facilities Access (MCB000010P) and a David and Lucille Packard Foundation fellowship.

*Conflict of Interest:* none declared.

## REFERENCES

- Andersson J. Horizontal gene transfer between microbial eukaryotes. Methods Mol. Biol. 2009;532:473–487. [PubMed]
- Bansal M., et al. Detecting highways of horizontal gene transfer. J. Comput. Biol. 2011;18:1087–1114. [PubMed]
- Beiko R., et al. Highways of gene sharing in prokaryotes. Proc. Natl Acad. Sci. USA. 2005;102:14332–14337. [PMC free article] [PubMed]
- Berglund A., et al. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J. Mol. Evol. 2006;63:240–250. [PubMed]
- Cormen T., et al. Introduction to Algorithms. Cambridge, Mass.: MIT Press/McGraw-Hill; 1990.
- David L., Alm E. Rapid evolutionary innovation during an Archaean genetic expansion. Nature. 2011;469:93–96. [PubMed]
- Degnan J., Rosenberg N. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 2009;24:332–340. [PubMed]
- Delsuc F., et al. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 2005;6:361–375. [PubMed]
- Doyon J., et al. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011;12:392–400. [PubMed]
- Ebersberger I., et al. Mapping human genetic ancestry. Mol. Biol. Evol. 2007;24:2266–2276. [PubMed]
- Edwards S. 2009? Is a new and general theory of molecular systematics emerging. 2009;63:1–19. [PubMed]
- Hallett M., et al. RECOMB 2004: Proceedings of the Eigth International Conference on Research in Computational Biology. San Diego, California, USA: ACM Press; 2004. Simultaneous identification of duplications and lateral transfers; pp. 347–356. New York, NY, USA, 2004.
- Huson D. H., Scornavacca C. A survey of combinatorial methods for phylogenetic networks. Genome Biol. Evol. 2011;3:23–35. [PMC free article] [PubMed]
- Ma B., et al. From gene trees to species trees. SIAM J. Comput. 2000;30:729–752.
- Maddison W. Gene trees in species trees. Syst. Biol. 1997;46:523–536.
- Maddison W., Knowles L. Inferring phylogeny despite incomplete lineage sorting. Syst. Biol. 2006;55:21–30. [PubMed]
- Maddison W., Maddison D. 2011. Mesquite: A modular system for evolutionary analysis, version 2.75. http://mesquiteproject.org, accessed June 10, 2012.
- Milinkovitch M., et al. 2x genomes–depth does matter. Genome Biol. 2010;11:R16. [PMC free article] [PubMed]
- Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L., Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. Springer; 2010. pp. 125–158.
- Nakhleh L., et al. Gene trees, species trees, and species networks. In: Guerra R., Goldstein D., editors. Meta-analysis and Combining Information in Genetics and Genomics. Boca Raton, FL: CRC Press; 2009. pp. 275–293.
- Page R. Gene Tree: comparing gene and species phylogenies using reconciled trees. Bioinformatics. 1998;14:819–20. [PubMed]
- Pollard D., et al. Widespread discordance of gene trees with species tree in
*Drosophila*: evidence for incomplete lineage sorting. PLoS Genet. 2006;2:e173. [PMC free article] [PubMed] - Rasmussen M., Kellis M. Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 2012;4:755–765. [PMC free article] [PubMed]
- Rokas A., et al. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. [PubMed]
- Schirrmeister B., et al. The origin of multicellularity in cyanobacteria. BMC Evol. Biol. 2011;11:45. [PMC free article] [PubMed]
- Serres M. H., et al. Evolution by leaps: gene duplication in bacteria. Biol. Direct. 2009;4:46. [PMC free article] [PubMed]
- Than C., Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comput. Biol. 2009;5:e1000501. [PMC free article] [PubMed]
- Tofigh A., et al. Simultaneous identification of duplications and lateral gene transfers. TCBB. 2011;8:517–535. [PubMed]
- Vernot B., et al. Reconciliation with non-binary species trees. J. Comput. Biol. 2008;15:981–1006. [PMC free article] [PubMed]
- Yu Y., et al. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst. Biol. 2011;60:138–149. [PMC free article] [PubMed]
- Zhang L. From gene trees to species trees ii: Species tree inference by minimizing deep coalescence events. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011;6:1685–1691. [PubMed]
- Zhaxybayeva O., Doolittle W. Lateral gene transfer. Curr. Biol. 2011;21:R242–R246. [PubMed]
- Zhaxybayeva O., et al. Intertwined evolutionary histories of marine
*Synechococcus*and*Prochlorococcus marinus*. Genome Biol. Evol. 2009;1:325–339. [PMC free article] [PubMed] - Zmasek C., Eddy S. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001;17:821–8. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (305K) |
- Citation

- Reconciliation with non-binary species trees.[J Comput Biol. 2008]
*Vernot B, Stolzer M, Goldman A, Durand D.**J Comput Biol. 2008 Oct; 15(8):981-1006.* - Reconciliation with non-binary species trees.[Comput Syst Bioinformatics Conf. 2007]
*Vernot B, Stolzer M, Goldman A, Durand D.**Comput Syst Bioinformatics Conf. 2007; 6:441-52.* - Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss.[Bioinformatics. 2012]
*Bansal MS, Alm EJ, Kellis M.**Bioinformatics. 2012 Jun 15; 28(12):i283-91.* - Models, algorithms and programs for phylogeny reconciliation.[Brief Bioinform. 2011]
*Doyon JP, Ranwez V, Daubin V, Berry V.**Brief Bioinform. 2011 Sep; 12(5):392-400.* - Understanding phylogenetic incongruence: lessons from phyllostomid bats.[Biol Rev Camb Philos Soc. 2012]
*Dávalos LM, Cirranello AL, Geisler JH, Simmons NB.**Biol Rev Camb Philos Soc. 2012 Nov; 87(4):991-1024. Epub 2012 Aug 14.*

- Cophylogeny Reconstruction via an Approximate Bayesian Computation[Systematic Biology. 2015]
*Baudet C, Donati B, Sinaimeri B, Crescenzi P, Gautier C, Matias C, Sagot MF.**Systematic Biology. 2015 May; 64(3)416-431* - Improved gene tree error correction in the presence of horizontal gene transfer[Bioinformatics. 2015]
*Bansal MS, Wu YC, Alm EJ, Kellis M.**Bioinformatics. 2015 Apr 15; 31(8)1211-1218* - Rock, Paper, Scissors: Harnessing Complementarity in Ortholog Detection Methods Improves Comparative Genomic Inference[G3: Genes|Genomes|Genetics. ]
*Maher MC, Hernandez RD.**G3: Genes|Genomes|Genetics. 5(4)629-638* - Fungal metabolic gene clusters—caravans traveling across genomes and environments[Frontiers in Microbiology. ]
*Wisecaver JH, Rokas A.**Frontiers in Microbiology. 6161* - The Secreted Proteins of Achlya hypogyna and Thraustotheca clavata Identify the Ancestral Oomycete Secretome and Reveal Gene Acquisitions by Horizontal Gene Transfer[Genome Biology and Evolution. ]
*Misner I, Blouin N, Leonard G, Richards TA, Lane CE.**Genome Biology and Evolution. 7(1)120-135*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyTaxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
- Taxonomy TreeTaxonomy Tree

- Inferring duplications, losses, transfers and incomplete lineage sorting with no...Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species treesBioinformatics. 2012 Sep 15; 28(18)i409

Your browsing activity is empty.

Activity recording is turned off.

See more...