• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosgenPLoS GeneticsSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
PLoS Genet. Apr 2012; 8(4): e1002660.
Published online Apr 19, 2012. doi:  10.1371/journal.pgen.1002660
PMCID: PMC3330115

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Joseph Felsenstein, Editor

Abstract

Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.

Author Summary

Species trees depict how species split and diverge. Within the branches of a species tree, gene trees, which depict the evolutionary histories of different genomic regions in the species, grow. Evolutionary analyses of the genomes of closely related organisms have highlighted the phenomenon that gene trees may disagree with each other as well as with the species tree that contains them due to deep coalescence. Furthermore, for several groups of organisms, hybridization plays an important role in their evolution and diversification. This evolutionary event also results in gene tree incongruence and gives rise to a species phylogeny that is a network. Thus, inferring the evolutionary histories of groups of organisms where hybridization is known, or suspected, to play an evolutionary role requires dealing simultaneously with hybridization and other sources of gene tree incongruence. Currently, no methods exist for doing this with general scenarios of hybridization. In this paper, we propose the first method for this task and demonstrate its performance. We revisit the analysis of a set of yeast species and another of Drosophila species, and show that evolutionary histories involving hybridization have higher support than the strictly diverging evolutionary histories estimated when not incorporating hybridization in the analysis.

Introduction

A molecular systematics paradigm that views molecular sequences as the characters of gene trees, and gene trees as characters of the species tree [1] is being increasingly adopted in the post-genomic era [2], [3]. Several models of evolution for the former type of characters have been devised [4], while the coalescent has been the main model of the latter type of characters [5], [6]. However, hybridization, a process that is believed to play an important role in the speciation and evolutionary innovations of several groups of plant and animal species [7], [8], results in reticulate (species) evolutionary histories that are best modeled using a phylogenetic network [9], [10]. Further, as hybridization may occur between closely related species, incongruence among gene trees may also be partly due to deep coalescence, and distinguishing between the two factors is hard under these conditions [11]. Therefore, to enable a more general application of the new paradigm, a phylogenetic network model that allows simultaneously for deep coalescence events as well as hybridization is needed [12]. This model can be devised by extending the coalescent model to allow for computing gene tree probabilities in the presence of hybridization. In this paper we focus on gene tree topologies and analyze the signal they contain for detecting hybridization in the presence of deep coalescence.

Applications of probabilities of gene tree topologies given species trees include determining statistical consistency (or inconsistency) of topology-based methods for inferring species trees 1315, testing the multispecies coalescent model [13], [16], determining identifiability of species trees using linear invariants of functions of gene tree topology probabilities [17], [18], delimiting species [19], designing simulation studies for species tree inference methods [20][22], and inferring species trees [23], [24]. We expect that similar applications may be useful for probabilities of gene tree topologies given species networks. In particular, it will be useful to be able to evaluate the performance of methods that infer species trees in the presence of hybridization as well as the performance of methods for inferring species networks. Knowing the distribution of gene tree topologies could also be useful for estimating the probability that two gene trees have the same topology, a quantity that is used in constructing the prior which models gene tree discordance in BUCKy [25], a program that is often used to estimate species trees or concordance trees.

A method for computing the probability mass function of gene tree topologies in the absence of hybridization (i.e., under the multispecies coalescent model is assumed) is given by Degnan and Salter [26]. However, to handle hybridization and deep coalescence simultaneously, this method has to be extended to allow for reticulate species evolutionary histories.

Indeed, attempts have been made recently for this very task [27][30], all of which have focused on very limited special cases where the phylogenetic network topology is known and contains one or two hybridization events, and a single allele sampled per species. However, a general formula for the probability of a gene tree topology given a general (any number of taxa, hybridizations, gene trees, and/or alleles) phylogenetic network has remained elusive.

A binary phylogenetic network topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e001.jpg contains two types of nodes: tree nodes, each of which has exactly one parent (except for the root, which has zero parents), and reticulation nodes, each of which has exactly two parents. The edge incident into a tree node is called tree edge, and the edges incident into a reticulation node are called reticulation edges. In our context, we associate with a phylogenetic network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e002.jpg a vector of branch lengths An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e003.jpg (in units of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e004.jpg generations, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e005.jpg is the effective population size in that branch) and a vector of hybridization probabilities An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e006.jpg (which indicates for each allele in a hybrid population its probability of inheritance from each of the two parent populations); see Text S1 for formal definition. The gene tree topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e007.jpg can be viewed as a random variable with probability mass function An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e008.jpg. In this paper, we solve the aforementioned open problem by reporting on a novel method for computing the probability of a gene tree topology given a phylogenetic network, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e009.jpg.

We illustrate the use of gene tree topology probabilities to estimate the values of species network parameters using the likelihood of the gene tree topologies. This application allows for disentangling hybridization and deep coalescence when analyzing a set of incongruent gene trees, as both events can give rise to similar incongruence patterns. Given a collection An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e010.jpg of gene tree topologies, one per locus, in a set of sampled loci, the likelihood function is given by

equation image
(1)

This formulation provides a framework for estimating the parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e012.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e013.jpg of an evolutionary history hypothesis An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e014.jpg, given a collection of gene trees An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e015.jpg. Estimates of 0 or 1 for the entries in the An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e016.jpg vector reflect the absence of evidence for hybridization based on the gene tree topology distribution.

As gene tree topologies are estimated from sequence data, there is often uncertainty about them. In our method, we account for that in two ways: (1) by considering a set of gene tree topology candidates, along with their associated probabilities (produced, for example, by a Bayesian analysis), and (2) by considering for each locus the strict consensus of all optimal tree topologies computed for that locus (produced, for example, by a maximum parsimony analysis).

Finally, to account for model complexity, we employ a simple technique based on three information criteria, AIC [31], AICc [32] and BIC [33]. While these criteria have their shortcomings for model selection, the question of how to account for phylogenetic network complexity is still wide open and no methods exist for addressing it systematically [10].

We have implemented our method in the publicly available software package PhyloNet [34] and demonstrated its broad utilities in three domains. First, we reanalyze a Saccharomyces data set and a Drosophila data set, and find support for hybridization in both data sets. Second, we show the identifiability of the parameter values of certain reticulate evolutionary histories. Third, we highlight and discuss the lack of identifiability of the parameters in other scenarios that involve extinctions.

Materials and Methods

We begin by reviewing Degnan and Salter's method for computing the probability gene tree topologies on species trees, and then describe our novel extension to the case of species networks.

The probability of a gene tree topology within a species tree

Degnan and Salter [26] gave the mass probability function of a gene tree topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e017.jpg for a given species tree with topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e018.jpg and vector of branch lengths An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e019.jpg as

equation image
(2)

which is taken over coalescent histories An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e021.jpg from the set of all coalescent histories An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e022.jpg. The product is taken over all internal branches An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e023.jpg of the species tree. The term An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e024.jpg is the probability that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e025.jpg lineages coalesce into An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e026.jpg lineages on branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e027.jpg whose length is An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e028.jpg. And the terms An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e029.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e030.jpg represents the probability that the coalescent events agree with the gene tree topology. In particular, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e031.jpg is the number of ways that coalescent events can occur consistently with the gene tree and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e032.jpg is the number of sequences of coalescences that give the number of coalescent events specified by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e033.jpg. However, this equation assumes that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e034.jpg is a tree and as such is inapplicable to reticulate evolutionary histories. Recently, this equation was adapted to very special cases of species phylogenies with hybridization [28][30]. However, none of these adaptations is general enough to allow for multiple hybridizations, multiple alleles per species, or arbitrary divergence patterns following hybridization. We present a novel approach for generalizing this equation to handle hybridization. Our approach is general enough in that it allows for computing gene tree probabilities on any binary phylogenetic network topology, thus overcoming limitations of recent works.

The probability of a gene tree topology within a species network

Our approach for computing the probability of a gene tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e035.jpg given a species network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e036.jpg has three steps. First, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e037.jpg is converted into a multilabeled (MUL) tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e038.jpg (a tree whose leaves are not uniquely labeled by a set of taxa; see Text S1); second, the alleles at the tips of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e039.jpg are mapped in every valid way to the tips of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e040.jpg; and, finally, the probability of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e041.jpg is computed as the sum, over all valid allele mappings, of probabilities of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e042.jpg given An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e043.jpg (see Figure 1).

Figure 1
Phylogenetic networks, MUL trees, and valid allele mappings.

Step 1: Converting the phylogenetic network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e053.jpg to MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e054.jpg

Let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e055.jpg be a phylogenetic network on set An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e056.jpg of species, and with branch lengths vector An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e057.jpg and hybridization probabilities vector An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e058.jpg. The conversion of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e059.jpg into a MUL tree is done as follows. Traversing the network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e060.jpg from the leaves towards the root, every time a reticulation node An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e061.jpg is encountered, the two reticulation edges incident into it are removed, an additional copy of the subtree rooted at An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e062.jpg's child is created, one copy is attached as child of one of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e063.jpg's original parents, and the other is attached as a child of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e064.jpg's other original parent. For example, in Figure 1, traversing the phylogenetic network from the leaves towards the root, the reticulation node An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e065.jpg is encountered, two copies of the subtree rooted at its child (i.e., the most recent common ancestor of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e066.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e067.jpg) are created, and one is attached as a child of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e068.jpg's parent An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e069.jpg, and the other is attached as a child of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e070.jpg's parent An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e071.jpg, resulting in the MUL tree shown in the figure. In order to keep track of which branches in the MUL tree originated from the same branch in the phylogenetic network, we build during the conversion a mapping An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e072.jpg from the set of the MUL tree branches to the set of the phylogenetic network branches, such that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e073.jpg if branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e074.jpg in the MUL tree corresponds to branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e075.jpg in the phylogenetic network. We make use of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e076.jpg in two ways. The first is in transferring the branch lengths and hybridization probabilities from An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e077.jpg to the resulting MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e078.jpg, as illustrated briefly in Figure 1 and in more details in Text S1, and the second use is for computing the probabilities of gene trees, as becomes clearer below. Upon completion of this step of converting the phylogenetic network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e079.jpg, its branch lengths An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e080.jpg and hybridization probabilities An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e081.jpg, the result is a MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e082.jpg along with its branch lengths An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e083.jpg, hybridization probabilities An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e084.jpg, and the branch mapping An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e085.jpg. The full description of the procedure NetworkToMULTree for achieving this conversion is given in Text S1.

Step 2: Mapping the alleles to the leaves of the MUL tree

In computing the probability of a gene tree given a species phylogeny (tree or network), all the alleles sampled from species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e086.jpg are mapped to the single leaf labeled An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e087.jpg in the species phylogeny. However, unless the species phylogeny An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e088.jpg does not have any reticulation nodes, the resulting MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e089.jpg contains leaf sets that are labeled by the same species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e090.jpg. For example, in Figure 1, the MUL tree has two leaves labeled An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e091.jpg and two leaves labeled An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e092.jpg. In this case, it is important to map the alleles systematically to the leaves of the MUL tree so as to cover exactly all the coalescence patterns that would arise had the alleles been mapped to the phylogenetic network.

We denote by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e093.jpg the set of leaf nodes in An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e094.jpg that are labeled by species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e095.jpg. For example, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e096.jpg for the MUL tree in Figure 1 is the set of the two leaves labeled by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e097.jpg. Now, consider a locus An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e098.jpg. We denote by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e099.jpg (for An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e100.jpg) the set of alleles sampled from species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e101.jpg for locus An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e102.jpg, and by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e103.jpg the size of this set (i.e., An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e104.jpg). In the example of Figure 1, two alleles were sampled from species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e105.jpg; hence, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e106.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e107.jpg. A valid allele mapping is a function An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e108.jpg such that if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e109.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e110.jpg, then An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e111.jpg. In other words, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e112.jpg maps an allele from species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e113.jpg to a leaf in the MUL tree labeled by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e114.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e115.jpg denote the set of all such valid allele mappings An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e116.jpg; in Figure 1, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e117.jpg.

Step 3: Computing the probability of a gene tree on the MUL tree

Once the phylogenetic network An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e118.jpg is converted into MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e119.jpg and the set of all valid allele mappings is produced (a straightforward computational task, yet results in a number of valid allele mappings that is exponential in a combination of the number of alleles sampled and the number of reticulation nodes), the probability of observing gene tree topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e120.jpg is found by summing the probability of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e121.jpg given the MUL tree over all possible allele mappings. Then, the probability of observing gene tree topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e122.jpg is found by summing over all possible allele mappings:

equation image
(3)

In this equation, the An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e124.jpg term accounts for all coalescent histories of a given mapping, which, when combined with the summation over all valid allele mappings, accounts for all coalescent histories within the branches of a phylogenetic network. Finally, the likelihood for a collection of gene trees is the product of the individual gene tree probabilities. This formulation naturally gives rise to a likelihood setup for estimating the parameters of a reticulate evolutionary history from a collection of gene trees described by their topologies.

To complete our framework, we now provide a formula for An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e125.jpg, which is the probability of a gene tree given a MUL tree and a valid allele mapping. Special attention needs to be paid to sets of branches in the MUL tree that correspond to single branches in the phylogenetic network, since coalescence events within these branches are not independent. Let us illustrate this issue using valid allele mapping An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e126.jpg and the MUL tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e127.jpg in Figure 1. Under this mapping, each of the two alleles sampled from species B is mapped to a different B leaf in An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e128.jpg. Tracing these two alleles independently from the two B leaves implicitly indicates that tracing the evolution of these two alleles in the phylogenetic network, no coalescence event should occur within time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e129.jpg on the branch incident into leaf B in the network. Additionally, each branch in the MUL tree may have a hybridization probability associated with it that is neither 0 nor 1, and must be accounted for in computing the probabilities. Accounting for these two cases gives rise to

equation image
(4)

where the An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e131.jpg terms are symbolic quantities, that do not individually evaluate to any value. Instead, they play a role in simultaneously computing the probability along pairs of branches in the MUL tree that share a single source branch in the phylogenetic network. More formally, let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e132.jpg be a branch in An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e133.jpg such that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e134.jpg is a reticulation node. Given the mapping An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e135.jpg from the branches of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e136.jpg to the branches of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e137.jpg, the pre-image (or, inverse image) An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e138.jpg is the set of all branches in An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e139.jpg that map to An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e140.jpg under An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e141.jpg. That is, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e142.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e143.jpg is the set of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e144.jpg's branches. Then, we define

equation image
(5)

This equation states that the number of lineages An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e146.jpg that enters (working backward in time) branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e147.jpg in the phylogenetic network equals the sum of the numbers of lineages that enter all branches of the MUL tree that map to branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e148.jpg. The number of lineages An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e149.jpg that exists branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e150.jpg is defined similarly. In Figure 1, the number of lineages that enters branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e151.jpg in the phylogenetic network equals the sum of the number of lineages that enter branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e152.jpg and the number of lineages that enter branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e153.jpg in the MUL tree.

Then, we use the following equation to evaluate the probability in Equation (4):

equation image
(6)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e155.jpg is computed using the formula in [26], with An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e156.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e157.jpg as parameters. In the example of branches An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e158.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e159.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e160.jpg that we just illustrated, Equation (6) states that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e161.jpg evaluates to

equation image

The term An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e163.jpg gives the probability that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e164.jpg lineages coalesce into An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e165.jpg lineages within time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e166.jpg. The term

equation image

corresponds to the quantity An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e168.jpg in [26]. Finally, the term

equation image

is the number of restrictions for the ordering of coalescent events within branch An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e170.jpg.

Accounting for uncertainty in gene tree topologies

Thus far, we have assumed that we have an accurate, fully resolved gene tree for each locus. However, in practice, gene tree topologies are inferred from sequence data and, as such, there is uncertainty about them. In Bayesian inference, this uncertainty is reflected by a posterior distribution of gene tree topologies. In a parsimony analysis, several equally optimal trees are computed. We propose here a way for incorporating this uncertainty into the framework above. Assume we have An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e171.jpg loci under analysis, and for each locus An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e172.jpg, a Bayesian analysis of the sequence alignment returns a set of gene trees An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e173.jpg, along with their associated posterior probabilities An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e174.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e175.jpg). Now, let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e176.jpg be the set of all distinct tree topologies computed on all An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e177.jpg loci, and for each An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e178.jpg let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e179.jpg be the sum of posterior probabilities associated with all gene trees computed over all loci whose topology is An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e180.jpg. Thus, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e181.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e182.jpg. Then, we replace Eq. (1) by

equation image
(7)

We note that if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e184.jpg or An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e185.jpg for each An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e186.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e187.jpg, then Eq. (7) is equivalent to Eq. (1), and both are multinomial likelihoods. This multinomial approach has also been used elsewhere for both species networks under simple hybridization scenarios [28] and species trees [24]. We additionally allow the An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e188.jpg terms to be between 0 and 1 (and therefore An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e189.jpg to be non-integer values) in order to reflect uncertainty in the estimated gene trees.

In the case where a maximum parsimony analysis is conducted to infer gene trees on the individual loci, a different treatment is necessary, since for each locus, all inferred trees are equally optimal. For locus An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e190.jpg, let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e191.jpg be the strict consensus of all optimal gene tree topologies found. Then, Eq. (1) becomes

equation image
(8)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e193.jpg is the set of all binary refinements of gene tree topology An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e194.jpg.

Results

Support for hybridization in yeast

Using our method to compute the likelihood function given by Eq. (1), we reanalyzed the yeast data set of [35], which consists of 106 loci, each with a single allele sampled from seven Saccharomyces species S. cerevisiae (Scer), S. paradoxus (Spar), S. mikatae (Smik), S. kudriavzevii (Skud), S. bayanus (Sbay), S. castellii (Scas), S. kluyveri (Sklu), and the outgroup fungus Candida albicans (Calb). Given that there is no indication of coalescences deeper than the MRCA of Scer, Spar, Smik, Skud, and Sbay [36], we focused only on the evolutionary history of these five species (see Text S1). We inferred gene trees using Bayesian inference in MrBayes [37] and using maximum parsimony in PAUP* [38] (see Text S1 for settings).

The species tree that has been reported for these five species, based on the 106 loci, is shown in Figure 2A [35]. Further, additional studies inferred the tree in Figure 2B as a very close candidate for giving rise to the 106 gene trees, under the coalescent model [36], [39]. Notice that the difference between the two trees is the placement of Skud, which flags hybridization as a possibility. Indeed, the phylogenetic network topologies in Figure 2C and 2D have been proposed as an alternative evolutionary history, under the stochastic framework of [40], as well as the parsimony framework of [30].

Figure 2
Various hypotheses for the evolutionary history of a yeast data set.

Using the 106 gene trees, we estimated the times An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e195.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e196.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e197.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e198.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e199.jpg for the six phylogenies in Figure 2 that maximize the likelihood function (we used a grid search of values between 0.05 and 4, with step length of 0.05 for branch lengths, and values between 0 and 1 with step length of 0.01 for An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e200.jpg). Table 1 lists the values of the parameters computed using Eq. (7) on the gene trees inferred by MrBayes and Table 2 lists the values of the parameters computed using Eq. (8) on the gene trees inferred by PAUP*, as well as the values of three information criteria, AIC [31], AICc [32] and BIC [33], in order to account for the number of parameters and allow for model selection.

Table 1
Analysis results for the six phylogenies in Figure 2 using gene tree topologies inferred by a Bayesian analysis (using MrBayes).
Table 2
Analysis results for the six phylogenies in Figure 2 using gene tree topologies inferred by maximum parsimony (using PAUP*).

Out of the 106 gene trees (using either of the two inference methods), roughly 100 trees placed Scer and Spar as sister taxa, which potentially reflects the lack of deep coalescence involving this clade (and is reflected by the relatively large An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e213.jpg values estimated). Roughly 25% of the gene trees did not show monophyly of the group Scer, Spar, and Smik, thus indicating a mild level of deep coalescence involving these three species (and reflected by the relatively small An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e214.jpg values estimated). However, a large proportion of the 106 gene trees indicated incongruence involving Skud; see . This pattern is reflected by the very low estimates of the time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e215.jpg on the two phylogenetic trees in Figure 2. On the other hand, analysis under the phylogenetic network models of Figure 2C and 2D indicates a larger divergence time, with substantial extent of hybridization. These latter hypotheses naturally result in a better likelihood score. When accounting for model complexity, all three information criteria indicated that these two phylogenetic network models with extensive hybridization and larger divergence time between Sbay and the ( Smik,( Scer,Spar)) clade provide better fit for the data. Further, while both networks produced identical hybridization probabilities, the network in Figure 2D had much lower values of the information criteria than those of the network in Figure 2E. The networks in Figure 2E and 2F have lower support (under all measures) than the other four phylogenies. In summary, our analysis gives higher support for the hypothesis of extensive hybridization, a low degree of deep coalescence, and long branch lengths than to the hypothesis of a species tree with short branches and extensive deep coalescence. It is worth mentioning that while the three networks in Figure 2C–2E were reported as equally optimal under a parsimonious reconciliation [36], our new framework can distinguish among the three, and identifies the network in Figure 2D as best, followed by the one in Figure 2C (the network of Figure 2E is found to be a worse fit than either of the two species tree candidates).

Support for hybridization in Drosophila

We reanalyzed the three-species Drosophila data set of [41], which includes D. melanogaster ( Dmel), D. yakuba ( Dyak), and D. erecta ( Dere).

The data set consisted of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e216.jpg loci supporting the three possible gene tree topologies as follows:

  • gene tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e217.jpg is supported by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e218.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e219.jpg) loci;
  • gene tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e220.jpg is supported by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e221.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e222.jpg) loci; and,
  • gene tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e223.jpg is supported by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e224.jpg (An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e225.jpg) loci.

For a species tree with three species and one individual sampled per species, the multispecies coalescent predicts that the two gene trees with topologies different from that of the species tree each occur with probability An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e226.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e227.jpg is the length of the one internal branch in coalescent units [42]. Two important predictions under the coalescent are therefore that the two nonmatching gene trees are expected to be tied in frequency and that both occur less than An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e228.jpg of the time, with the matching gene tree topology occurring more than An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e229.jpg of the time. This tie in the expected frequency of nonmatching gene trees is observed in some three-taxon data sets, but not in others, including the Drosophila data set.

Although this deviation from symmetry can be explained by a model of population subdivision, where the subdivision must occur in the internal branch as well as the population ancestral to all three species [43], the asymmetry can also be explained by the simplest hybridization network on three species with just one hybridization parameter (Figure 3).

Figure 3
Six hypotheses for the evolutionary history of a Drosophila data set.

We considered six candidates for the species phylogeny: three with no hybridization, and three with hybridizations involving different pairs of species (see Figure 3). For the three phylogenetic trees, we estimated the time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e230.jpg that maximizes the probability of observing all An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e231.jpg gene trees, and for the three phylogenetic networks, we additionally estimated the hybridization probability An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e232.jpg.

The results in Table 3 show that of the three phylogenetic trees, the one in Figure 3A provides the best fit of the data, which is in agreement with the analysis in [41]. In fact, the value of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e233.jpg we estimated on the other two trees was the lowest value we used in the estimation procedure. Clearly, this value can be arbitrarily small for these two trees, since the unresolved phylogeny ( Dmel, Dere, Dyak) fits the data better.

Table 3
Estimates of time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e234.jpg and hybridization probability An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e235.jpg (when applicable) on the six candidate species phylogenies shown in Figure 3 for the three Drosophila species Dmel, Dere, and Dyak.

Among the three network candidates, the one in Figure 3D has the best fit of the data. This network, with a value of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e242.jpg, indicates that An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e243.jpg of the alleles sampled from Dere shared a common ancestor first with alleles from Dyak (reflecting the tree in Figure 3A), while An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e244.jpg of the alleles from Dere shared a common ancestor first with alleles from Dmel (reflecting the tree in Figure 3B). Indeed, this network is the smallest network (in terms of the number of reticulation nodes) that reconciles both trees. Further, the change in AIC for this network is An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e245.jpg, indicating a much better fit than the best tree (Figure 3A). As noted previously [43], a An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e246.jpg-square test will also strongly reject the hypothesis that the species relationships are tree-like with random mating.

This three-taxon example can be analyzed analytically. Fitting a hybridization parameter allows a perfect fit to any observed frequencies of gene tree topologies for three species for one of the three networks in Figure 3. We let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e247.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e248.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e249.jpg represent the probabilities of topologies (Dmel,( Dere, Dyak)), ((Dmel, Dere), Dyak), and ((Dmel, Dyak), Dere) under the network in Figure 3D. Then

equation image
equation image
equation image

This system has the unique solution

equation image
(9)

for An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e254.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e255.jpg (either at least one of the gene tree probabilities is less than An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e256.jpg if since they sum to 1.0; or if they are all exactly 1/3, then a star tree with An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e257.jpg and any An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e258.jpg exactly fits the data). Thus we can estimate An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e259.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e260.jpg using the observed An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e261.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e262.jpg in equation (9), and this also maximizes the likelihood.

Identifiability of hybridization using gene tree topologies: A simulation study

For the simulated data, we evolved gene trees within the branches of phylogenetic networks, while varying branch lengths and hybridization probabilities, and investigated two questions: (1) how much data (gene trees) is needed to obtain accurate inference of the parameters (branch lengths and/or hybridization probabilities)? (2) are the parameters always identifiable? To answer these two questions, we investigated six different phylogenetic network topologies that involved single reticulation scenario, two reticulation scenarios (dependent and independent), and cases with extinctions involving the species that hybridize (see Text S1).

Our results show that both hybridization probabilities and branch lengths can be estimated with very high accuracy provided that no extinction events were involved in the parents of hybrid populations (see Text S1). Further, this accuracy can be achieved even when using the smallest number of gene trees we used in our study, which is 10. Under these settings, estimates using our framework seemed to converge quickly to the true values.

We also investigated the performance of the method, as well as identifiability issues when phylogenetic signal from at least one of the species involved in the hybridization is completely lost. Figure 4 shows the results for one such scenario (see Text S1 for another scenario that involves the loss of phylogenetic signal from both species involved in the hybridization).

Figure 4
Identifiability in detecting hybridization.

Panels Figure 4B–4D show that when the true values of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e280.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e281.jpg are assumed to be known in the estimation procedure (the value of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e282.jpg is irrelevant in the case when a single allele is sampled per species), the estimates of the hybridization probabilities converge to the true values. However, unlike the cases that did not involved extinctions, a larger number of gene trees is now required to obtain an accurate estimate (while there are only three possible gene tree topologies, a large number of gene trees need be sampled in order for the three topologies' frequencies to be informative). The time intervals of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e283.jpg coalescent units amount to a large extent of deep coalescence events, which blurs the phylogenetic signal, and results in slight over- or under-estimation of the hybridization probabilities (Text S1 shows the results for the time interval with An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e284.jpg).

If the topology of the network in Figure 4A is assumed to be known, but both the branch lengths and hybridization probabilities are to be estimated, then these parameters are unidentifiable; that is, two different pairs of vectors of branch lengths and hybridization probabilities can be found to explain the observed data with exactly the same probability (see Text S1). If at least two alleles are sampled from species B, then the parameter values become identifiable; however, an extremely large, and potentially infeasible, number of gene trees need to be sampled to uniquely identify the parameter values in practice (see Text S1).

Furthermore, in the special case where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e285.jpg, a phylogenetic tree, with appropriate branch lengths can be found, to fit the data exactly with the same probability that the phylogenetic network would. Let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e286.jpg be the branch lengths vector with An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e287.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e288.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e289.jpg, and let An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e290.jpg be the hybridization probabilities vector with An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e291.jpg. Now, consider the phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e292.jpg in Figure 4E. Then, if we set An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e293.jpg as a function of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e294.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e295.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e296.jpg, using An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e297.jpg, then, An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e298.jpg for any gene tree An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e299.jpg. The values of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e300.jpg are shown in Figure 4F–4H. These results show that as An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e301.jpg increases, the value of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e302.jpg becomes unaffected by An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e303.jpg, and that increasing An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e304.jpg proportionally to the increase in An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e305.jpg always maintains identical probabilities of gene trees under both species phylogenies (see Text S1).

Our method for computing the probability of gene trees under hybridization and deep coalescence allows for analyzing data sets with arbitrary complexity of evolutionary histories in terms of the hybridization scenarios. When parameters are identifiable, our method estimates their values with high accuracy from a relatively small number of loci. Further, our method can be used to show lack of identifiability of model parameters for other cases. Our method supports a hypothesis of larger divergence time coupled with hybridization over short divergence times (with extensive deep coalescence) in a yeast data set. Finally, for a large Drosophila data set, our method indicated no hybridization based on the sampled loci.

Discussion

Using coalescence times versus topologies to infer species networks

We have focused on calculating probabilities of gene tree topologies and using these probabilities to infer species networks. In addition, the joint density of the coalescence times and topology in the gene trees could be used to infer species networks. Indeed, this approach has been used for networks where reticulation nodes have one descendant which is an extant species [29], using the density for coalescence times derived by Rannala and Yang [44]. This approach is computationally faster than computing gene tree topology probabilities because it is not necessary to sum over a large number of coalescent histories. To compute this joint density, each gene sampled can potentially have to trace through up to An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e306.jpg possible paths through the network, where An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e307.jpg is the number of hybridization events ancestral to the sampled gene from species An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e308.jpg, and the density will take the form of a sum over possible paths through the network. (In contrast, computing the probability of a topology will require An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e309.jpg mappings of alleles to the MUL-tree, and each gene topology calculation will require summing over coalescent histories.) This joint density for the gene trees with coalescence times could then be used in either maximum likelihood or Bayesian frameworks to infer the species network.

An important advantage of using coalescence times is that certain networks might be identifiable using coalescence times when probabilities of topologies might not identify the network. In the example of Figure 3A, although the gene tree topology probabilities can be obtained by a tree, the distribution of the coalescence times between lineages sampled from B and C is a mixture of three shifted exponential distributions if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e310.jpg, but a mixture of two shifted exponential distributions if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e311.jpg. For example, if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e312.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e313.jpg are known but An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e314.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e315.jpg are unkown, then the likelihood of observing a coalescence between a B and C lineage for times slightly greater An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e316.jpg will be very low if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e317.jpg, and much higher for An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e318.jpg, thus making it possible to test whether An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e319.jpg when coalescence times are used.

Another identifiability issue is that both population subdivision and hybridization can lead to the asymmetry in gene tree topology probabilities in the 3-taxon case such as observed in the Drosophila example discussed earlier, where the two least frequently observed topologies are not tied in frequency. Either population subdivision, with a parameter describing the probability that the two most closely related species fail to coalesce in the ancestral population due to population structure, or hybridization can fit the data for the gene tree topologies. However, the two models could imply different distributions on coalescence times, which might therefore be useful in distinguishing the models. We note that identifiability in the case of three species with one individual per species might be especially limited due to the small number of gene tree topology probabilities that can be used to estimate parameters. In the case of identifying rooted species trees from unrooted gene trees with one lineage per species, for example, identifiability is achieved only with 5 or more species [17].

We consider it desirable to develop many methods for inferring species trees and species networks so that their properties and performances can be compared. In the case of species tree inference, there are advantages and disadvantages to using topology-based methods versus methods that include branch lengths, and in using likelihood versus Bayesian methods. We expect that many of these strengths and weaknesses may carry over to the case of inferring networks. For moderately sized data sets, Bayesian methods that model branch lengths and uncertainty in the gene trees such as BEST [45] and *BEAST [46] often have the best performance [47]. However, these methods require estimating the joint posterior distribution of the species tree and gene trees and therefore are difficult to implement for large numbers of loci. Maximizing the likelihood of the gene trees and their coalescent times (but without accounting for uncertainty in the gene trees), as in STEM [48], is fast and has very good performance on known gene trees but seems to be very sensitive to the assumption that branch lengths are estimated correctly [24], [49]. Maximizing the likelihood of the species tree using only gene tree topologies using the program STELLS, even while not accounting for uncertainty in the gene trees, tended to have better performance than STEM for a large simulated data set (An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e320.jpg loci on 8 taxa) and worse performance on fewer loci [24]. Which method is optimal for inferring species trees or networks might depend on many factors such as the number of loci, the number of lineages sampled per species, the accuracy with which branch lengths can be estimated, the extent to which there are model violations, and the speciation history [49].

Recombination and population size assumptions

Two common assumptions in multispecies coalescent models are that there is no recombination within loci (and free recombination between loci) and that ancestral population sizes are constant.

Recombination can lead to different portions of a gene alignment effectively having distinct gene tree topologies. Ideally, alignments should be chosen so that recombination within genes is unlikely. This can be achieved by testing alignments beforehand for recombination using many available methods [50][52], or for whole genome data, choosing the cutoffs for loci such that they are unlikely to occur at recombination breakpoints [53]. In addition, recombination may lead to greater violations of the coalescent model for branch lengths than for topologies [53], so that topology-based methods might be less sensitive to the assumption that there is no recombination within loci. In addition, a recent simulation study found that recombination within loci did not have much impact on species tree inference methods for a wide range of recombination rates [54].

Coalescent models often assume that ancestral populations have constant size for the duration of the population (i.e., a constant size for a given branch of the species tree, but not necessarily the same on different branches). The program *BEAST [46] allows for ancestral population sizes to change linearly with time. Nonconstant population sizes will tend to result in branch lengths that make topologies more (or less) star-like for populations that are increasing (or decreasing) in size [55]. One approach to modelling a changing population size would be to break up a branch into intervals that are relatively constant in size. Suppose, for instance that a branch consists of an interval of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e321.jpg generations with population size An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e322.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e323.jpg generations with size An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e324.jpg. The total time of the branch in coalescent units is An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e325.jpg. Although unequal values of An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e326.jpg can affect the distribution of coalescence times (for example, if An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e327.jpg but An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e328.jpg, then coalescence events might be more likely to occur in the interval with size An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e329.jpg), the probabilities of topologies arising in this branch are not affected and can be calculated just using the total time An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e330.jpg. In particular, for the functions An external file that holds a picture, illustration, etc.
Object name is pgen.1002660.e331.jpg, which are the terms that depend on time in the calculations for gene tree topology probabilities, we have

equation image

which is an instance of the Chapman-Kolmogorov equations because the number of lineages is a continuous time Markov chain (a death chain) [56].

We expect that topology-based methods may show more robustness to recombination and changing population sizes than approaches which explicitly model coalescence times. However, for estimating species trees and networks from gene trees, as in other areas of statistical inference, there is likely to be a tradeoff between power and robustness for methods that do and do not model branch lengths of the gene trees.

Searching for networks

A current limitation to the procedure we have outlined for estimating hybridization is that we require a set of candidate networks on which to perform model selection. In some cases, such a set of candidate networks can be obtained by considering specific hypotheses related to biogeographical information. Candidate networks can also be generated using supernetworks from gene trees [57] or other network methods [9]. Often these methods will generate very complicated networks if there are many conflicts in the data, so it might be useful to choose different random subsets of well-supported (or frequently occurring) gene tree topologies to generate candidate species networks. In the future it will be desirable to develop algorithms that directly search the space of species networks in order to automate searching for optimal species networks.

Supporting Information

Text S1

Supporting information file that contains formal definitions and additional results on synthetic data.

(PDF)

Acknowledgments

We are grateful to J. Felsenstein and two anonymous reviewers for comments which have improved the manuscript.

Footnotes

The authors have declared that no competing interests exist.

This work was supported in part by NSF grant CCF-0622037, NSF grant DBI-1062463, grant R01LM009494 from the National Library of Medicine, and an Alfred P. Sloan Research Fellowship to LN. JHD was funded by the New Zealand Marsden Fund and the National Institute for Mathematical and Biological Synthesis, an institute sponsored by the National Science Foundation, the U.S. Department of Homeland Security, and the U.S. Department of Agriculture through NSF Award EF-0832858, with additional support from the University of Tennessee, Knoxville. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Doyle JJ. Gene trees and species trees: molecular systematics as one-character taxonomy. Syst Bot. 1992;17:144–163.
2. Maddison W. Gene trees in species trees. Syst Biol. 1997;46:523–536.
3. Edwards SV. Is a new and general theory of molecular systematic biology emerging? Evolution. 2009;63:1–19. [PubMed]
4. Swofford D, Olsen G, Waddell P, Hillis D. Phylogenetic inference. In: Hillis D, Mable B, Moritz C, editors. Molecular Syst Biol.s. Sunderland, Mass.: Sinauer Assoc; 1996. pp. 407–514.
5. Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor Pop Biol. 2002;61:225–247. [PubMed]
6. Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009;24:332–340. [PubMed]
7. Arnold ML. Natural Hybridization and Evolution. Oxford: Oxford University Press; 1997.
8. Mallet J. Hybrid speciation. Nature. 2007;446:279–283. [PubMed]
9. Huson D, Rupp R, Scornavacca C. Phylogenetic Networks: Concepts, Algorithms and Applications. New York: Cambridge University Press; 2010.
10. Nakhleh L. Evolutionary phylogenetic networks: models and issues. In: Heath L, Ramakrishnan N, editors. The Problem Solving Handbook for Computational Biology and Bioinformatics. New York: Springer; 2010. pp. 125–158.
11. Mallet J. Hybridization as an invasion of the genome. Trends Ecol Evol. 2005;20:229–237. [PubMed]
12. Linder CR, Rieseberg LH. Reconstructing patterns of reticulate evolution in plants. Am J Bot. 2004;91:1700–1708. [PMC free article] [PubMed]
13. Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA. Properties of consensus methods for inferring species trees from gene trees. Syst Biol. 2009;58:35–54. [PMC free article] [PubMed]
14. Than CV, Rosenberg NA. Consistency properties of species tree inference by minimizing deep coalescences. J Comput Biol. 2011;18:1–15. [PubMed]
15. Wang Y, Degnan JH. Performance of matrix representation with parsimony for inferring species from gene trees. Stat Appl Genet Mol. 2011;10:21.
16. Ané C. Reconstructing concordance trees and testing the coalescent model from genome- wide data sets. In: Knowles LL, Kubatko LS, editors. Estimating species trees: Theoretical and practical aspects. Hoboken, NJ: Wiley-Blackwell; 2010. pp. 35–52.
17. Allman ES, Degnan JH, Rhodes JA. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J Math Biol. 2011;62:833–862. [PubMed]
18. Allman ES, Degnan JH, Rhodes JA. Determining species tree topologies from clade probabilities under the coalescent. J Theor Biol. 2011;289:96–106. [PubMed]
19. Knowles LL, Carstens BC. Delimiting species without monophyletic gene trees. Syst Biol. 2007;56:887–895. [PubMed]
20. Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007;56:17–24. [PubMed]
21. Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009;58:468–477. [PubMed]
22. DeGiorgio M, Degnan JH. Fast and consistent estimation of species trees using supermatrix rooted triples. Mol Biol Evol. 2010;27:552–569. [PMC free article] [PubMed]
23. Carstens B, Knowles LL. Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers. Syst Biol. 2007;56:400–411. [PubMed]
24. Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution. 2012;66:763–775. [PubMed]
25. Ané C, Larget B, Baum DA, Smith SD, Rokas A. Bayesian estimation of concordance factors. Mol Biol Evol. 2007;24:412–426. [PubMed]
26. Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution. 2005;59:24–37. [PubMed]
27. Than C, Ruths D, Innan H, Nakhleh L. Confounding factors in HGT detection: Statistical error, coalescent effects, and multiple solutions. J Comput Biol. 2007;14:517–535. [PubMed]
28. Meng C, Kubatko LS. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theor Popul Biol. 2009;75:35–45. [PubMed]
29. Kubatko LS. Identifying hybridization events in the presence of coalescence via model selection. Syst Biol. 2009;58:478–488. [PubMed]
30. Yu Y, Than C, Degnan JH, Nakhleh L. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst Biol. 2011;60:138–149. [PMC free article] [PubMed]
31. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. 1974;19:716–723.
32. Burnham K, Anderson D. Model selection and multi-model inference: a practical-theoretic approach. New York: Springer Verlag, 2nd edition; 2002.
33. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
34. Than C, Ruths D, Nakhleh L. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics. 2008;9:322. [PMC free article] [PubMed]
35. Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. [PubMed]
36. Than C, Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comput Biol. 2009;5:e1000501. doi: 10.1371/journal.pcbi.1000501. [PMC free article] [PubMed]
37. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17:754–755. [PubMed]
38. Swofford DL. PAUP*: Phylogenetic analysis using parsimony (and other methods). 1996. Sinauer Associates, Underland, Massachusetts, Version 4.0.
39. Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc Natl Acad Sci U S A. 2007;104:5936–5941. [PMC free article] [PubMed]
40. Bloomquist EW, Suchard MA. Unifying vertical and nonvertical evolution: A stochastic ARG-based framework. Syst Biol. 2010;59:27–41. [PMC free article] [PubMed]
41. Pollard DA, Iyer VN, Moses AM, Eisen MB. Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet. 2006;2:e173. doi: 10.1371/journal.pgen.0020173. [PMC free article] [PubMed]
42. Nei M. Molecular Evolutionary Genetics. New York: Columbia University Press; 1987.
43. Slatkin M. Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nature Rev Genet. 2008;9:477–485. [PubMed]
44. Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164:1645–1656. [PMC free article] [PubMed]
45. Liu L, Pearl DK. Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007;56:504–514. [PubMed]
46. Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 2010;27:570–580. [PMC free article] [PubMed]
47. Leaché AD, Rannala B. The accuracy of species tree estimation under simulation: A com- parison of methods. Syst Biol. 2011;60:126–137. [PubMed]
48. Kubatko LS, Carstens BC, Knowles LL. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009;25:971–973. [PubMed]
49. Huang H, He Q, Kubatko LS, Knowles LL. Sources of error inherent in species-tree estimation: Impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. Syst Biol. 2010;59:573–583. [PubMed]
50. Posada D, Crandall KA. Evaluation of methods for detecting recombination from dna sequences: Computer simulations. P Natl Acad Sci USA. 2001;98:13757–13762. [PMC free article] [PubMed]
51. Bruen TC, Philippe H, Bryant D. A simple and robust statistical test for detecting the presence of recombination. Genetics. 2002;172:2665–2681. [PMC free article] [PubMed]
52. Ruths D, Nakhleh L. RECOMP: A parsimony-based method for detecting recombination. 2006. pp. 59–68. In: Proceedings of the 4th Asia Pacific Bioinformatics Conference.
53. Ané C. Detecting phylogenetic breakpoints and discordance from genome-wide alignments for species tree reconstruction. Genome Biol Evol. 2011;3:246–258. [PMC free article] [PubMed]
54. Lanier H, Knowles L. Is recombination a problem for species-tree analyses? Syst Biol. 2012 In press: DOI:10.1093/sysbio/syr128. [PubMed]
55. Wakeley J. Coalescent Theory. Greenwood Village, CO: Roberts & Company; 2008.
56. Ross SM. Introduction to Probability Models. New York: Academic Press, 10th edition; 2010.
57. Holland B, Benthin S, Lockhart P, Moulton V, Huber K. Using supernetworks to distinguish hybridization from lineage-sorting. BMC Evol Biol. 2008;8:202. [PMC free article] [PubMed]

Articles from PLoS Genetics are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats: