- Journal List
- BMC Evol Biol
- v.9; 2009
- PMC2748082

# Inferring polyploid phylogenies from multiply-labeled gene trees

^{1}Andreas Spillner,

^{}

^{1}Katharina T Huber,

^{1}Anna Petri,

^{2}Bengt Oxelman,

^{2}and Vincent Moulton

^{1}

^{1}School of Computing Sciences, University of East Anglia, Norwich, UK

^{2}Department of Plant and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden

^{}Corresponding author.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

### Background

Gene trees that arise in the context of reconstructing the evolutionary history of polyploid species are often multiply-labeled, that is, the same leaf label can occur several times in a single tree. This property considerably complicates the task of forming a consensus of a collection of such trees compared to usual phylogenetic trees.

### Results

We present a method for computing a consensus tree of multiply-labeled trees. As with the well-known greedy consensus tree approach for phylogenetic trees, our method first breaks the given collection of gene trees into a set of clusters. It then aims to insert these clusters one at a time into a tree, starting with the clusters that are supported by most of the gene trees. As the problem to decide whether a cluster can be inserted into a multiply-labeled tree is computationally hard, we have developed a heuristic method for solving this problem.

### Conclusion

We illustrate the applicability of our method using two collections of trees for plants of the genus *Silene*, that involve several allopolyploids at different levels.

## Background

Polyploidy is an important evolutionary process in plants, as well as in some animal groups (e.g. [1,2]), accounting for a significant proportion of speciation events [2]. Most eukaryotes have a life cycle which includes a haploid (one set of chromosomes) and a diploid (two sets of chromosomes) part. A *polyploid *can arise from a sterile hybrid which has resulted from the fusion of two incompatible haploid gametes. If, for example due to meiotic errors, the hybrid doubles its chromosomes, it can develop into a new, fertile lineage that is instantaneously reproductively isolated from its parents (but see e.g. [3]), so called *allopolyploidy*. Genome doubling within a lineage is called *autopolyploidy*.

Despite the importance of polyploidy, molecular phylogenetic studies of plants, even at shallow levels where reticulate patterns due to allopolyploidy are to be expected, have been dominated by the use of sequence regions that are unable to trace biparentally inherited evolutionary history. For example, sequences from the cytoplasmatic genomes are usually maternally inherited only, and for nuclear ribosomal DNA it is thought that concerted evolution can eradicate evidence for hybridization (e.g. [4]). Moreover, most phylogenetic studies aiming at tracing polyploid histories use a single nuclear low-copy number gene tree for inference [5-9], or are restricted to relatively simple problems as the origin of allotetraploidy [3,10]. However, to successfully distinguish polyploidization from other biological processes that may be responsible for incongruent phylogenetic patterns (e.g. homoploid hybridization, horizontal gene transfer, incomplete lineage sorting, gene duplication/loss, recombination, sampling or phylogenetic errors), it is desirable to use a large number of gene loci (e.g. [11]). Following this approach, in [12,13] multiple biparentally inherited genes are used for problems involving ploidy levels higher than 4x (tetraploidy) [12,13]. The collections of gene trees arising in such studies commonly have the property that the same species name can label more than one leaf in a single gene tree, due to polyploidization events [5,8-10,12-15]. More formally, we call such trees *multiply-labeled trees *or *MUL-trees*, for short (cf. [16]).

Recently, MUL-trees have been used to construct phylogenetic networks representing the evolutionary history of polyploid species [9]. Although there is now a well-developed algorithm for constructing these networks from a MUL-tree [17], construction of the MUL-tree has to date been performed using an *ad hoc *consensus approach [12], where, essentially, from the given collection of gene trees, a MUL-tree was intuitively constructed in such a way that the number of gene trees that supported each branch was as large as possible. Here we describe a method that generalizes this *ad hoc *approach, and allows the systematic construction of a consensus MUL-tree(s) from a collection of MUL-trees. This method generalizes the greedy consensus method for finding the consensus of a collection of phylogenetic trees [18] although, as we shall see, various complications arise due to computational issues concerning MUL-trees. We illustrate the applicability of our new method using two collections of MUL-trees of flowering plants of the genus *Silene*. An implementation of the method in Java (version 1.5), which is incorporated within the PADRE software package [19], is freely available for download from http://www.uea.ac.uk/cmp/research/cmpbio/PADRE.

## Results

### The main algorithm

The input to our algorithm consists of a collection of rooted MUL-trees, where the labels that occur are the same for each tree. An example of a collection of three such trees is presented in Figure Figure1.1. The leaves of every tree in this example are labeled by *a*_{1}, *a*_{2}, or *a*_{3}. Labels *a*_{1 }and *a*_{3 }each occur twice, whereas label *a*_{2 }occurs three times. To take into account that labels may occur more than once, the leaves of the trees are thus labeled by a *multiset *ℳ, in which the number of occurrences of any label *a *in ℳ is called the *multiplicity *of *a*. For example, in Figure Figure11 the multiset ℳ is given by {*a*_{1}, *a*_{1}, *a*_{2}, *a*_{2}, *a*_{2}, *a*_{3}, *a*_{3}} and the multiplicity of *a*_{1 }is 2. We will call a MUL-tree with leaves labeled by a multiset ℳ a *MUL-tree on *ℳ, for short. Note that if all of the labels in ℳ have multiplicity 1, then ℳ is just a set and a MUL-tree on ℳ is a phylogenetic tree in the usual sense [20].

**A collection of MUL-trees**. A collection of three MUL-trees, whose leaves are labeled with the elements of the multiset ℳ = {

*a*

_{1},

*a*

_{1},

*a*

_{2},

*a*

_{2},

*a*

_{2},

*a*

_{3},

*a*

_{3}}. The root of each tree is marked by

*r*and in tree

*T*

_{1},

*e*labels a branch (see text for details).

**...**

The basic approach taken by our algorithm is to break the input MUL-trees into a collection of *clusters*, that is, sub-multisets of ℳ. In a MUL-tree *T *on ℳ each cluster arises from some branch *e *in *T*, and contains the labels *a *in ℳ with the property that we have to traverse branch *e *on the path from the root *r *to *a *in *T*. We also say that *T exhibits *these clusters. For example, in Figure Figure11 branch *e *in tree *T*_{1 }gives rise to the cluster {*a*_{1}, *a*_{2}, *a*_{2}, *a*_{3}}. We then select clusters from those obtained by breaking up the MUL-trees, one at a time, starting with those that are exhibited by most of the input trees, to construct a consensus MUL-tree. At any time the clusters selected by our algorithm so far are chosen to have the property that there exists a MUL-tree that exhibits all of them simultaneously.

Note that this approach is also used in the greedy consensus method for constructing a consensus of a collection of phylogenetic trees [18]. In this method the efficient selection of the next cluster to be inserted into the consensus tree is based on the following useful property of phylogenetic trees [21]: If every pair of clusters in a collection is *compatible*, that is, there exists a phylogenetic tree that exhibits both clusters, then there is a (necessarily unique) phylogenetic tree that exhibits the whole collection. In contrast, for MUL-trees this property does not hold in general. For example, among the clusters obtained from the MUL-trees in Figure Figure11 every pair can be exhibited by some input tree. But it can be checked that there is no MUL-tree that exhibits them all at once. In fact, it is NP-hard to decide whether a collection of clusters of a multiset ℳ can be exhibited by some MUL-tree on ℳ [22]. And, even if there exists such a tree, it need not be unique (see e.g. Figure 2(b) and 2(c)).

**Adding clusters**. (a) A MUL-tree

*T*used to illustrate the difficulties that can occur when adding a cluster. (b)-(c) The cluster {

*a*

_{1},

*a*

_{2}} can be added in two different ways to the MUL-tree

*T*. (d) The cluster {

*a*

_{2},

*a*

_{3}} can be added to the MUL-tree in (a),

**...**

To cope with these difficulties, our algorithm first greedily adds in those clusters containing at least one label *a *that has multiplicity 1 in ℳ, called *core clusters*. The key property of these clusters is that, if they can be added to a MUL-tree, then this can only be done in a unique way [22]. For example, the cluster {*a*_{2}, *a*_{3}} can be added in only one way to the MUL-tree in Figure 2(a) resulting in the MUL-tree depicted in Figure 2(d). We call the MUL-tree obtained by adding in core clusters the *backbone tree*. Note that if every element in ℳ has multiplicity 1 then every cluster is a core cluster and our algorithm works precisely like the greedy consensus method for phylogenetic trees [18]. In general, however, there will be clusters that are not core clusters, called *ambiguous clusters*, and therefore, in the second phase, we continue to greedily select these and, if possible, insert them into the backbone tree. This results in one or more MUL-trees, all of which exhibit the same collection of clusters and that contain the backbone tree as a subtree.

Note that as part of the two-phase strategy outlined above we also apply a threshold *t *that determines the minimum number of input trees that must exhibit a cluster in order to be taken into account when forming a consensus MUL-tree. This threshold helps to prevent a core cluster being exhibited by only a small number of input trees blocking the addition of an ambiguous cluster that is exhibited by many input trees later on. The idea of using a threshold is similar to the approach taken by the majority rule consensus method for phylogenetic trees [18].

### A detailed description of the algorithm

We now give a full description of our new algorithm: In Figure Figure3,3, we present it in form of pseudo-code. First it uses the procedure CLUSTERS (presented in Figure Figure44 in form of pseudo-code) to compute the sorted lists (*D*_{1}, ..., *D*_{p}) and (*A*_{1}, ..., *A*_{q}) of core and ambiguous clusters (Lines 1-3) on the given multiset ℳ. As mentioned in the previous section, we then select clusters, one at a time, from these lists to form a consensus of the input trees. We start (Line 4) with a tree *T *that exhibits precisely the *trivial *clusters on ℳ, that is, the clusters containing a single label in ℳ (note that these clusters are exhibited by all of the input trees). Then we construct the backbone tree using the list of core clusters (Lines 5-6), and then add ambiguous clusters to the backbone tree (Lines 7-14). The output is a MUL-tree selected from the resulting collection of MUL-trees. An output tree from this collection can either be selected by the user, or the whole collection of trees can be returned. In the next section we provide a score function to aid with the tree selection process.

**Pseudo-code for the algorithm MULTICONS**. Pseudo-code for our algorithm that computes a consensus MUL-tree of the input MUL-trees.

**Pseudo-code for the procedure CLUSTERS**. Pseudo-code for the procedure that computes the sorted list of the non-trivial core and ambiguous clusters, respectively.

To conclude the description of our algorithm we describe the details of the computation of the core and ambiguous clusters as presented in Figure Figure4.4. First, for each input MUL-tree *T*_{i}, 1 ≤ *i *≤ *l*, we compute the collection of non-trivial clusters that are exhibited by *T*_{i }(Lines 1-4). If an input MUL-tree exhibits a cluster *C *several times, e.g. tree *T*_{1 }in Figure Figure11 exhibits the cluster {*a*_{1}, *a*_{2}, *a*_{3}} twice, we include in the corresponding number of copies of *C *and distinguish them by recording the branch in *T*_{i }that gives rise to them.

Next we combine the collections into a set ℬ of all clusters that arise from the branches of the input MUL-trees without taking multiple copies of the same cluster into account (Lines 5-6). Then using the threshold *t*, for each cluster *C *in ℬ the number (*C*) is computed, which is the largest number of copies of *C *such that at least *t *of the collections contain that many copies of *C *(Lines 8-12). The clusters *C *in ℬ with (*C*) > 0 are then partitioned into core clusters and ambiguous clusters (Lines 13-17).

The core clusters are collected together in the set (Line 15). Note that if *D *is a core cluster then (*D*) ≤ 1 holds. In contrast, for an ambiguous cluster *A *one can have (*A*) > 1, and so we record the numbers of copies of *A *that we might be able to accommodate in the consensus tree in the form of pairs (*A*, 1), ..., (*A*, (*A*)) denoting the resulting set of pairs by (Line 17). The core clusters *D *in are then sorted decreasingly according to the number of collections among in which cluster *D *is contained (Lines 18-19), where ties are broken arbitrarily. This yields a sorted list (*D*_{1}, ..., *D*_{p}) of core clusters. Similarly, the pairs (*A*, *m*) in are sorted decreasingly according to the number of collections among that contain at least *m *copies of *A *(Lines 20-22). Again ties are broken arbitrarily. This yields a sorted list ((*A*_{1}, *m*_{1}), ..., (*A*_{q}, *m*_{q})) of the pairs in from which the sorted list (*A*_{1}, ..., *A*_{q}) of ambiguous clusters is extracted, with some clusters possibly occurring more than once in this list.

The run time of our algorithm can be bounded in terms of the number *l *of input trees, the sum *m *of the multiplicity of all elements in ℳ, and the sum *d *of the multiplicity of all elements in ℳ except those that occur with multiplicity 1. For example, for the multiset labeling the tree in Figure 2(a) we have *m *= 6, which is the same as the number of leaves of the tree, whereas *d *= 4 because the multiplicity of *a*_{3 }and *a*_{4 }is not taken into account. Note that, since the number of branches of a tree is linear in the number of its leaves, the total number of clusters (core and ambiguous) is in *O*(*ml*). Hence a straightforward implementation of the procedure CLUSTERS in Figure Figure44 has a run time in *O*(*m*^{2}*l*^{2}).

Once the lists of clusters have been computed, the main task is to check, for a given tree and a cluster, whether the tree can be adapted to exhibit the additional cluster. Basically we can use the algorithm presented in [[22], ch. 5] for this task, which yields a run time for MULTICONS of *O*(*m*^{2}*l*^{2 }+ *m*^{4}*l *+ *d*^{d}*ml*). The first term in this bound comes from the run time of the procedure CLUSTERS. To give the reader some idea how the remaining two terms in the bound arise, note first that checking whether a core cluster can be added in Line 6 of the algorithm MULTICONS can be done by going through the vertices of degree higher than 3 in the tree constructed so far and checking whether they can be resolved to accommodate the additional cluster. Implementing this in a straightforward way, each core cluster can be checked in *O*(*m*^{3}) time and the resulting tree is, as mentioned above, unique. Since there are *O*(*ml*) clusters, this yields the second term.

As for the third term, checking whether an ambiguous cluster can be added to a particular tree in Lines 12-13 can be done in a similar way as outlined above for core clusters. However, the key difference is that there are now *O*(*d*) resulting trees. This implies that the number of trees in the output collection is bounded by *O*(*d*^{d}), in view of the fact that the number of non-trivial ambiguous clusters in a MUL-tree on ℳ is in *O*(*d*) (see [22] for more on this). This leads to the last term in the bound above.

### Pre- and postprocessing

It often happens that the collection of input MUL-trees are labeled by slightly different multisets (due e.g. to sequencing difficulties or lack of sampling). Hence, we have also developed a simple preprocessing procedure that essentially restricts the input trees to a common multiset ℳ of labels. This procedure employs a majority rule, that is, for every label *a *that appears in at least one of the input trees, the multiplicity of *a *in ℳ is chosen as the largest integer *m *such that *a *appears in at least half of the input trees with multiplicity at least *m*. Note that it is possible that some labels appear in so few input trees that they have multiplicity 0 in ℳ, that is, they will not appear in the consensus tree.

Once we have determined this multiset ℳ it remains to restrict the input trees to ℳ and to compute a consensus of the restrictions. However, the restriction process might involve a choice of which leaves labeled by copies of a certain label should be removed. As the number of possible choices increases rapidly for different labels, potentially leading to a huge number of different restrictions, we avoid deciding which copies of a label to remove as follows. If a cluster contains more copies of a label than the multiset ℳ, these additional copies are removed from the cluster, independently of all other clusters. Note that this might however yield two copies *C *and *C' *of a cluster in the collection arising from some input tree *T*_{i }such that the branch *e *that gives rise to *C *lies on the path from the root of *T*_{i }to the branch *e' *that gives rise to *C'*. For example, consider the tree *T*_{1 }in Figure Figure1.1. Branch *e *gives rise to the cluster {*a*_{1}, *a*_{2}, *a*_{2}, *a*_{3}} and branch *e' *to the cluster {*a*_{1}, *a*_{2}, *a*_{3}}. However, if we restrict *T*_{1 }to the multiset {*a*_{1}, *a*_{1}, *a*_{2}, *a*_{3}, *a*_{3}}, say, we remove one copy of *a*_{2 }from {*a*_{1}, *a*_{2}, *a*_{2}, *a*_{3}}, then after the restriction both *e *and *e' *give rise to the same cluster {a_{1}, *a*_{2}, *a*_{3}}. In this situation the cluster arising from *e' *is not counted as an additional copy and can rather be viewed as an artifact of the restriction to ℳ and will, therefore, not be included in . So, for the tree *T*_{1 }in Figure Figure11 only the two copies of cluster {*a*_{1}, *a*_{2}, *a*_{3}} that arise from branches *e *and *e" *will be taken into account.

We also developed a postprocessing procedure that scores the MUL-trees in the collection computed by the algorithm MULTICONS, to deal with the fact that this collection could be quite large. The basic idea for the scoring is to estimate the number of allopolyploidization events that are implicitly hypothesized by a MUL-tree. To do this, we use an algorithm presented in [17], called MULTIBUILD, to compute for each MUL-tree *T *in a network (*T*) representing *T*. We then use the number of allopolyploidization events hypothesized by (*T*) to score each MUL-tree *T*, since (*T*) can be viewed as a most parsimonious representation of *T *in terms of such events. In practice, we have found it best to take all possible refinements of *T *to bifurcating trees and to calculate the minimum number of allopolyploidization events in the networks obtained for those refinements as the score of *T*. We suspect that this is because MULTIBUILD is only guaranteed to find an optimal network with respect to the number of allopolyploidization events if the MUL-tree is binary.

### Applications

We illustrate the application of our method, along with some of the complexities involved in constructing a consensus of MUL-trees, using two data sets of MUL-trees of the flowering plant genus *Silene *(Caryophyllaceae). These examples were computed using the implementation of the algorithm in the PADRE package. This takes as input a collection of trees in NEWICK format [23], and displays the resulting consensus tree, which can also be saved as a file in encapsulated postscript (eps) or NEWICK format.

The first collection of MUL-trees we apply our algorithm to is depicted in Figure Figure5.5. We chose this example, as it is small enough to easily follow the workings of the algorithm. The labels represent *Silene *species, namely, the diploids *S. ajanensis *(*A*) and *S. uralensis *(*U*), the tetraploid *S. involucrata *(*I*), and the two hexaploids *S. sorensenis *(*S*) and *S. ostenfeldii *(*O*). All trees are rooted at *S. zawadskii *(*Z*). They are restrictions of the larger gene trees published in [12] to the species *A*, *U*, *I*, *S*, *O *and *Z*.

The gene trees in [12] are reconstructed using standard techniques in phylogenetic analysis from regions of the nuclear RNA polymerase (RNAP) gene family (RPB2, RPA2, RPD2a, and RPD2b, Figure 5(a)-(d)), two concatenated chloroplast regions (*rps16 *intron and the *psbE*/*petG *spacer, Figure 5(e)), and one nuclear ribosomal region (ITS1 and ITS2, with the intervening 5.8S gene, Figure 5(f)). Although all 6 gene trees in Figure Figure55 may be viewed as MUL-trees, it should be noted that only the four trees on the RNAP genes (Figure 5(a)-(d)) are true MUL-trees. For the chloroplast regions, this is because chloroplasts are maternally inherited and harbor a haploid genome. A MUL-tree constructed for such regions is therefore a phylogenetic tree in the usual sense. Regarding the nuclear ribosomal DNA, the reason is different in the sense that, although they constitute a very large multigene family, its members are kept identical or very similar by concerted evolution. Therefore, traces of hybridization events are quickly eradicated (e.g. [4]). As a consequence, nuclear ribosomal DNA can behave similarly to a haploid, uniparental locus.

When we apply our algorithm to the input MUL-trees in Figure Figure5,5, we first apply the preprocessing procedure to compute a multiset ℳ to which we restrict the input trees, yielding ℳ = {*A, I, I, O, O, O, S, S, S, U, Z*}. Using this multiset and the default value of ⌈l/2⌉ = 3 for the threshold *t *we obtain the 4 non-trivial core clusters

and the non-trivial ambiguous clusters {*O, S*} (two copies) and {*I, O, S*} (one copy) to build a consensus tree. Note that, although the cluster {*A, I, O, O, S, S*} is generated twice when breaking *T*_{2 }into clusters (since it is exhibited by *T*_{2 }and also results from restricting the exhibited cluster {*A, A, I, O, O, S, S*} to ℳ by removing one copy of label *A*), it is taken into account only once. Also note that the choice of the threshold *t *implies, for example, that even though {*A, I, O, S*} is a core cluster it is not taken into consideration for constructing the backbone tree as this cluster is only exhibited by a single input tree, namely the tree in Figure 5(f).

The backbone tree constructed from the 4 selected core clusters is depicted in Figure 6(a). Adding in the ambiguous clusters results in 3 semiresolved consensus MUL-trees one of which we depict in Figure 6(b). When applying the scoring procedure, by constructing a reticulate network with MULTIBUILD for all 32 distinct refinements of these 3 trees to bifurcating trees, we find a single refined MUL-tree with minimum score which is depicted in Figure 6(c). Note that this tree was also constructed by the *ad hoc *method mentioned above in [12]. The reticulate network computed for this MUL-tree is depicted in Figure 6(d). It postulates 2 consecutive allopolyploidization events, the first one resulting in the tetraploid *S. involucrata *and the second one leading to the two hexaploids *S. sorensenis *and *S. ostenfeldii*.

**Output for first example**. (a) Backbone tree using the default value for threshold

*t*. (b) One of the three MUL-trees obtained by adding ambiguous clusters to the backbone tree. (c) A possible resolution of the tree in (b) to a bifurcating tree. (d) The

**...**

The second collection of MUL-trees we applied our algorithm to is depicted in Figure Figure7.7. This collection is more complex than the previous one since it involves more species and the trees are much more unresolved. The additional *Silene *species appearing in the trees (again represented by their label) are: *S. linnaeana *(*L*), *S. uralensis *(Mongolia) (*U M*), *S. samojedora *(*SAM*), and *S. villosula *(*V*), which are all diploid, and *S. sachalinensis *(*SAC*) and *S. tolmatchevii *(*T*), whose chromosome numbers are unknown but, in view of the number of RNAP gene copies found, are likely to be tetraploids. It should be noted that in contrast to the trees in Figure Figure5,5, the four MUL-trees in Figure Figure77 were reconstructed solely from RNAP gene families (i.e. RPB2 (a), RPA2 (b), RPD2a (c), and RPD2b (d)). As before, all MUL-trees are rooted at *S. zawadskii *(*Z*).

**Input trees for second example**. The second collection of MUL-trees we apply our algorithm to, involving additional

*Silene*species.

The multiset of labels constructed by the preprocessing procedure is ℳ = {*A, I, I, L, O, O, O, S, S, S, SAC, SAC, SAM, T, T, U, U M, V, Z*}. Using this multiset a collection of 15 non-trivial clusters is derived from the input trees, of which 12 are core clusters and 3 are ambiguous clusters. We employed a threshold of *t *= 1, as the input trees are very unresolved and larger thresholds yield only a small number of non-trivial clusters to form a consensus tree. In Figure 8(a) we depict the unique backbone tree constructed from 10 of the non-trivial core clusters. Adding ambiguous clusters to this tree results in 6 semiresolved consensus MUL-trees one of which we depict in Figure 8(b). By exhaustively searching through the set of all 885 refinements of these 6 trees, we find that only 9 of them give rise to a reticulate network with the minimum number of 4 hypothesized allopolyploidization events. In Figure 8(c), we depict one of them and in Figure 8(d) we depict the corresponding reticulate network. Note that this network agrees with the network presented in Figure 6(d) when restricted to the *Silene *species in the first collection. In addition two further allopolyploidization events are hypothesized, suggesting that *S. sachalinensis *and *S. tolmatchevii *are tetraploids.

## Conclusion

In this paper, we have presented a new algorithm for constructing a consensus MUL-tree(s) from a collection of MUL-trees, and illustrated its applicability using two examples. Both consisted of collections of gene trees that were constructed from sequence data of polyploid plants, including biparentally informative sequences. In both cases, we have also obtained networks that provide scenarios for how the plants evolved.

As a preprocessing procedure we provide a way to deal with the situation that some input trees might have missing or additional leaf labels. A key task in this context is to determine the multiset of labels that should appear in the consensus tree. The simplest possible approach would be to just take the union of the multisets over all input trees, that is, every label has the maximum multiplicity with which it occurs in an input tree. However, in practice we found that this tended to lead to an overestimation of the multiplicity of some labels, hence our use of a majority rule procedure. Even so, our approach is still rather simple in that it is only likely to work well in case the number of additional or missing leaf labels is small since otherwise too much information is lost. To circumvent this problem one might try to develop supertree methods for MUL-trees, although we expect that this task would be quite challenging in the light of the fact that many versions of the supertree problem are hard even for collections of phylogenetic trees (see e.g. [24]). In this vein, it might also be of interest to explore the possibility of constructing consensus- or super-networks [25].

The basic idea for our algorithm, that is, breaking the input trees into clusters and then combining some of these clusters to form a consensus tree, seems to yield good results if the input trees are not too unresolved and there are enough clusters that are exhibited by many input trees. However, in some circumstances, the greedy construction involves a random choice of which clusters, exhibited by the same number of input trees, should be added next. In view of this, a more canonical approach to selecting clusters could be desirable. This might be achieved by generalizing, for example, the majority rule consensus approach [18] to MUL-trees. Even so, the results in [22] imply that as the total multiplicity *d *of those labels that appear with multiplicity greater than 1 grows, the majority rule consensus tree will resemble more and more the strict consensus tree, which tends to be very unresolved. Our algorithm tries to address this issue by allowing the user to explore how being strict (large threshold *t*) or generous (small threshold *t*) affects the resulting consensus trees. In addition, there is also the option to explore whether further clusters exhibited by less than *t *input trees could still be added into the consensus tree at the end. In future work, it could also be interesting to try and generalize non-cluster-based approaches for computing a consensus of phylogenetic trees as described in [18] (e.g. by recoding the MUL-trees in some way).

We employ a postprocessing procedure to score the resulting trees, but as this potentially involves refining trees to binary trees, it has a worst case run time that is exponential in the number of leaves of the resulting trees (although the score for a single refined tree can be computed by the algorithm MULTIBUILD in polynomial time). Therefore, despite working quite well for the examples we have considered, it is likely to be limited to rather small problem instances. Moreover, the number of trees with optimal score can be quite high, especially when the input trees are very unresolved. Even though multiple optimal solutions are not uncommon in phylogenetics (e.g. there can be several most parsimonious trees [26]), it could still be of interest to develop ways to systematically select specific optimal trees. For example, alternative score functions could be developed that take into account how the clusters are arranged in the input MUL-trees or, if available, branch length information.

The parameter that seems to have the biggest impact on the run time is the total multiplicity *d *of those labels that appear with multiplicity greater than 1 in many input trees. Even though the theoretical worst case run time of our algorithm increases exponentially with this parameter (which is to be expected due to the inherent computational complexity involved in computing a consensus of MUL-trees), for both examples presented above the run time was only a few seconds on a modern desktop computer.

In view of recent advances in DNA sequencing technologies (e.g. [27]), we anticipate that many more data sets will soon become available giving rise to collections of MUL-trees. The algorithm proposed in this paper will hopefully provide a useful new tool for analyzing such collections.

## Authors' contributions

All authors contributed to the ideas and the development of the algorithm which was implemented by ML. AP and BO provided the biological data set and ensured the biological relevance of the paper. Every author contributed to the writing of the paper.

## Acknowledgements

The authors would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, UK for hosting them in the context of the Phylogenetics Program where part of the work presented in this paper was carried out. AS was supported by the Engineering and Physical Sciences Research Council [grant number EP/D068800/1]. KTH, ML, VM and AS were supported by the British Council and the DAAD within the ARC Programme. BO was supported by a grant from the Swedish Research Council. ML, KTH and VM would like to thank Oxelman for inviting them to Gothenburg/Uppsala.

## References

- Sexton OJ. Polyploidy in animal evolution: summary. Basic Life Sci. 1979;13:379–381. [PubMed]
- Otto SP, Whitton J. Polyploid incidence and evolution. Annu Rev Genet. 2000;34:401–437. doi: 10.1146/annurev.genet.34.1.401. [PubMed] [Cross Ref]
- Slotte T, Huang H, Lascoux M, Ceplitis A. Polyploid speciation did not confer instant reproductive isolation in Capsella (Brassicaceae) Mol Biol Evol. 2008;25:1472–1481. doi: 10.1093/molbev/msn092. [PubMed] [Cross Ref]
- Alvarez I, Wendel JF. Ribosomal ITS sequences and plant phylogenetic inference. Mol Phylogenet Evol. 2003;29:417–434. doi: 10.1016/S1055-7903(03)00208-2. [PubMed] [Cross Ref]
- Smedmark JEE, Eriksson T, Evans RC, Campbell CS. Ancient allopolyploid speciation in Geinae (Rosaceae): evidence from nuclear granule-bound starch synthase (GBSSI) gene sequences. Syst Biol. 2003;52:374–385. [PubMed]
- Grundt HH, Popp M, Brochmann C, Oxelman B. Phylogenetic relationships among the lineages leading to the allopolyploid Draba lactea Adams (Brassicaceae) Mol Phylogenet Evol. 2004;32:695–710. doi: 10.1016/j.ympev.2004.04.006. [PubMed] [Cross Ref]
- Pfeil BE, Brubaker CL, Craven LA, Crisp MD. Paralogy and orthology in the Malvaceae rpb2 gene family: investigation of gene duplication in Hibiscus. Mol Biol Evol. 2004;21:1428–1437. doi: 10.1093/molbev/msh144. [PubMed] [Cross Ref]
- Smedmark JEE, Eriksson T, Bremer B. Allopolyploid evolution in Geinae (Colurieae: Rosaceae) - building reticulate species trees from bifurcating gene trees. Org Divers Evol. 2005;5:275–283. doi: 10.1016/j.ode.2004.12.003. [Cross Ref]
- Brysting AK, Oxelman B, Huber KT, Moulton V, Brochmann C. Untangling complex histories of genome mergings in high polyploids. Syst Biol. 2007;56:467–476. doi: 10.1080/10635150701424553. [PubMed] [Cross Ref]
- Popp M, Oxelman B. Inferring the history of the polyploid Silene aegaea (Caryophyllaceae) using plastid and homoeologous nuclear DNA sequences. Mol Phylogenet Evol. 2001;20:474–481. doi: 10.1006/mpev.2001.0977. [PubMed] [Cross Ref]
- Linder C, Rieseberg LH. Reconstructing patterns of reticulate evolution in plants. Am J Bot. 2004;91:1700–1708. doi: 10.3732/ajb.91.10.1700. [PMC free article] [PubMed] [Cross Ref]
- Popp M, Erixon P, Eggens F, Oxelman B. Origin and evolution of a circumpolar polyploid species complex in Silene (Caryophyllaceae) inferred from low copy nuclear RNA polymerase introns, rDNA, and chloroplast DNA. Syst Bot. 2005;30:302–313. doi: 10.1600/0363644054223648. [Cross Ref]
- Popp M, Oxelman B. Origin and evolution of North American polyploid Silene (Caryophyllaceae) Am J Bot. 2007;94:330–349. doi: 10.3732/ajb.94.3.330. [PubMed] [Cross Ref]
- Cronn RC, Small RL, Haselkorn T, Wendel JF. Rapid diversification of the cotton genus (Gossypium: Malvaceae) revealed by analysis of sixteen nuclear and chloroplast genes. Am J Bot. 2002;89:707–725. doi: 10.3732/ajb.89.4.707. [PubMed] [Cross Ref]
- Doyle JJ, Doyle JL, Rauscher JT, Brown AH. Evolution of the perennial soybean polyploid complex (Glycine subgenus Glycine): a study of contrasts. Biol J Linn Soc. 2004;82:583–597. doi: 10.1111/j.1095-8312.2004.00343.x. [Cross Ref]
- Huber KT, Moulton V. Phylogenetic networks from multi-labelled trees. J Math Biol. 2006;52:613–632. doi: 10.1007/s00285-005-0365-z. [PubMed] [Cross Ref]
- Huber KT, Oxelman B, Lott M, Moulton V. Reconstructing the evolutionary history of polyploids from multilabeled trees. Mol Biol Evol. 2006;23:1784–1791. doi: 10.1093/molbev/msl045. [PubMed] [Cross Ref]
- Bryant D. A classification of consensus methods for phylogenetics. In: Janowitz MF, Lapointe FJ, McMorris FR, Mirkin B, Roberts FS, editor. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. Vol. 61. American Mathematical Society; 2003. pp. 163–184.
- Lott M, Spillner A, Huber KT, Moulton V. PADRE: a package for analyzing and displaying reticulate evolution. Bioinformatics. 2009;25:1199–1200. doi: 10.1093/bioinformatics/btp133. [PubMed] [Cross Ref]
- Semple C, Steel M. Phylogenetics. Oxford: Oxford University Press; 2003.
- Buneman P. The recovery of trees from measures of dissimilarity. In: Hodson F, et al, editor. Mathematics in the Archaeological and Historical Sciences. Edinburgh University Press; 1971. pp. 387–395.
- Huber KT, Lott M, Moulton V, Spillner A. The complexity of deriving multi-labeled trees from bipartitions. J Comput Biol. 2008;15:639–651. doi: 10.1089/cmb.2008.0088. [PubMed] [Cross Ref]
- Archie J, Day WHE, Felsenstein J, Maddison W, Meacham C, Rohlf FJ, Swofford D. The Newick tree format. 1986. http://evolution.genetics.washington.edu/phylip/newicktree.html
- Bininda-Emonds O, Ed . Phylogenetic supertrees: combining information to reveal the Tree of Life. Dordrecht: Kluwer Academic Publishers; 2004.
- Huson D, Dezulian T, Klöpper T, Steel M. Phylogenetic super-networks from partial trees. IEEE ACM Trans Comput Biol Bioinformatics. 2004;1:151–158. doi: 10.1109/TCBB.2004.44. [PubMed] [Cross Ref]
- Felsenstein J. Inferring phylogenies. Sinauer Associates; 2003.
- Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008;24:142–149. [PMC free article] [PubMed]

**BioMed Central**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (304K) |
- Citation

- The complexity of deriving multi-labeled trees from bipartitions.[J Comput Biol. 2008]
*Huber KT, Lott M, Moulton V, Spillner A.**J Comput Biol. 2008 Jul-Aug; 15(6):639-51.* - Polynomial-time algorithms for building a consensus MUL-tree.[J Comput Biol. 2012]
*Cui Y, Jansson J, Sung WK.**J Comput Biol. 2012 Sep; 19(9):1073-88.* - Trees of trees: an approach to comparing multiple alternative phylogenies.[Syst Biol. 2008]
*Nye TM.**Syst Biol. 2008 Oct; 57(5):785-94.* - Consensus properties for the deep coalescence problem and their application for scalable tree search.[BMC Bioinformatics. 2012]
*Lin HT, Burleigh JG, Eulenstein O.**BMC Bioinformatics. 2012 Jun 25; 13 Suppl 10:S12. Epub 2012 Jun 25.* - Computational approaches to species phylogeny inference and gene tree reconciliation.[Trends Ecol Evol. 2013]
*Nakhleh L.**Trends Ecol Evol. 2013 Dec; 28(12):719-28. Epub 2013 Oct 1.*

- Do Triplets Have Enough Information to Construct the Multi-Labeled Phylogenetic Tree?[PLoS ONE. ]
*Hassanzadeh R, Eslahchi C, Sung WK.**PLoS ONE. 9(7)e103622* - Enumerating all maximal frequent subtrees in collections of phylogenetic trees[Algorithms for Molecular Biology : AMB. ]
*Deepak A, Fernández-Baca D.**Algorithms for Molecular Biology : AMB. 916* - Extracting conflict-free information from multi-labeled trees[Algorithms for Molecular Biology : AMB. ]
*Deepak A, Fernández-Baca D, McMahon MM.**Algorithms for Molecular Biology : AMB. 818* - Unraveling reticulate evolution in North American Dryopteris (Dryopteridaceae)[BMC Evolutionary Biology. ]
*Sessa EB, Zimmer EA, Givnish TJ.**BMC Evolutionary Biology. 12104* - Generating Functions for Multi-labeled Trees[Discrete applied mathematics (Amsterdam, Ne...]
*Czabarka É, Erdős PL, Johnson V, Moulton V.**Discrete applied mathematics (Amsterdam, Netherlands : 1988). 2013 Jan 1; 161(1-2)107-117*

- Inferring polyploid phylogenies from multiply-labeled gene treesInferring polyploid phylogenies from multiply-labeled gene treesBMC Evolutionary Biology. 2009; 9()216

Your browsing activity is empty.

Activity recording is turned off.

See more...