# The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation

^{†}Present address: Section of Integrative Biology PAT CO930, University of Texas, Austin, TX 78712.

^{‡}To whom reprint requests should be addressed. E-mail: eb.ca.blu@knilimcm.

## Abstract

Large phylogeny estimation is a combinatorial optimization problem that no future computer will ever be able to solve exactly in practical computing time. The difficulty of the problem is amplified by the need to use complex evolutionary models and large taxon samplings. Hence, many heuristic approaches have been developed, with varying degrees of success. Here, we report on a heuristic approach, the metapopulation genetic algorithm, involving several populations of trees that are forced to cooperate in the search for the optimal tree. Within each population, trees are subjected to evaluation, selection, and mutation events, which are directed by using inter-population consensus information. The method proves to be both very accurate and vastly faster than existing heuristics, such that data sets comprised of hundreds of taxa can be analyzed in practical computing times under complex models of maximum-likelihood evolution. Branch support values produced by the metapopulation genetic algorithm might closely approximate the posterior probabilities of the corresponding branches.

**Keywords:**heuristics, evolutionary computation, maximum likelihood

Optimality criterion-based phylogeny inference is a notoriously difficult endeavor because the number of solutions increases explosively with the number of taxa. Indeed, the total number of possible unrooted, bifurcating tree topologies among T-terminal taxa is

corresponding to nearly 32 billion different trees for 14 taxa and 3 × 10^{84} trees (i.e., more than the number of atoms in the known universe) for 55 taxa. Phylogeny inference is a combinatorial optimization problem of nondeterministic polynomial (NP)-complete type (1, 2) because (*i*) no known algorithm can solve it in polynomial time, and (*ii*) demonstrating the existence of such an algorithm would imply that all NP problems have a polynomial time solution. As most mathematicians expect that no such algorithm exists, one is forced to admit that no future civilization will ever build a computer capable of solving the problem while guaranteeing that the optimal solution has been found.

Computation time has only recently become a central and widespread practical problem in phylogeny inference for two main reasons. First, major advances in molecular biology and biotechnology have caused a dramatic increase in the number of DNA sequences available, stimulating researchers to increase their taxon sampling when performing tree reconstruction. Incidentally, several authors (e.g., ref. 3) have suggested that the accuracy of phylogeny inference increases with increased taxon sampling. Second, many simulation studies in the last 10 years (e.g., ref. 4) have identified the maximum-likelihood (ML) criterion (5) as one of the best for phylogeny inference. Reasons for this include (*i*) statistical consistency (ML estimators tend to converge to true parameters values when the number of characters is increased), (*ii*) robustness (violations of the ML model assumptions have only a moderate impact on the tree inference accuracy), (*iii*) the ability to compare different trees within a statistical framework, and (*iv*) the ability of ML to make full use of the original character matrix. Unfortunately, the analytical power of ML has a cost: computation time. Hence, application of this model-based criterion makes the use of exact phylogeny inference methods impractical for more than a trivial number of taxa (about 12). Note that (*i*) by ML we mean its most widely implemented version, i.e., maximum average likelihood (which is itself a form of the more general maximum relative likelihood approach; ref. 6), and (*ii*) in addition to optimality criterion-based phylogeny inference being proven NP-complete because of the number of trees, it remains an open question whether maximum average likelihood can be calculated efficiently (i.e., in polynomial time) on a single tree.

As with many NP-complete problems, the only practical approach to large phylogeny inference is the use of heuristics, i.e., methods that find one or several solution(s) faster than exact searches while sacrificing a guarantee that the optimal solution will be found. Given the tremendous number of new questions in evolutionary biology that could be investigated through the use of larger taxon samplings, most researchers are ready to give up the quest for the absolute optimal tree provided that alternative heuristic methods yield optimal or near-optimal solutions with high probability. In response to this trend, much of the current research in computational phylogenetics concentrates on the development of more efficient heuristic approaches.

Many existing heuristic methods are handicapped with the need to optimize model parameters (such as branch lengths) on each tree examined, a procedure that significantly slows computation time. One such heuristic, which is implemented in the most widely used inference packages [paup* (7), phylip (8), etc.], is based on hill climbing: an initial tree [most commonly obtained by using the stepwise addition (StepAdd) algorithm (9)] is subjected to a topological rearrangement known as branch swapping (9), followed by an optimization of branch lengths and other model parameters. If the new tree has a better score than the starting tree, it is kept and used as the new starting tree; otherwise the proposed tree is rejected. The procedure is repeated in a systematic fashion, until no swapping on the current best tree can improve the score. Because of the need to optimize parameters at every step (intra-step optimization), simple branch-swapping methods remain computationally impractical for more than 25–50 taxa, depending on the complexity of the ML model.

Two classes of solutions have been developed in response to the problems associated with optimizing parameters on large trees. The first of these classes includes solutions that partition the large problem of tree reconstruction into many small subproblems whose solutions are then combined into a consensus global solution. For example, the quartet puzzling method (10) works by reconstructing the best ML tree of each possible quartet of taxa, then combining (i.e., puzzling) the quartet trees to construct an overall T-taxon tree. Because the puzzling procedure depends on the order with which the taxa are sequentially inserted, it is typically repeated *n* times (with randomization of the T-taxa input order), and a majority-rule consensus tree is computed from the *n* T-taxa trees. The advantage of quartet puzzling is that it avoids numerous optimizations on large trees, optimizing instead numerous small trees, which is computationally much simpler. Despite the initial appeal, the method has proven to be only moderately more accurate (10) than neighbor joining (11), a pairwise-distance based method that requires significantly less computation time.

The second class is comprised of stochastic heuristics that avoid optimization of numerous trees entirely. Instead, they incorporate methods that allow branch lengths and other model parameters to be optimized as the search proceeds, taking an inter-step optimization strategy. Stochastic simulated annealing (SSA; ref. 12), for example, is based on the simple branch-swapping algorithm described above, but incorporates a method of perturbing branch lengths at each iteration instead of requiring optimization of each potential solution. SSA avoids local optima by accepting changes that decrease the likelihood of the tree with a probability inversely proportional to the reduction in likelihood. Another increasingly popular approach is the Bayesian method (13) based on the Markov chain Monte Carlo (MCMC) algorithm. MCMC-based methods also benefit from avoiding intra-step optimization, although they have a slightly different aim: sampling the distribution of tree space instead of only finding optimal trees.

Finally, Matsuda (14), Lewis (15), and Katoh *et al.* (16) have recently applied the genetic algorithm (GA), a type of evolutionary computation method (17), to phylogeny inference. GAs implement a set of operators that mimic processes of biological evolution such as mutation, recombination, selection, and reproduction. After an initial step of generating a population, the individuals (specific solutions) within that population are (*i*) subjected to mutation and/or recombination and (*ii*) allowed to reproduce with a probability that is a function of their relative fitness value. In the case of a phylogenetic inference problem, individuals are typically composed of topologies and model parameters (e.g., branch lengths, the transition/transversion ratio, rate heterogeneity parameters, etc.) that need to be optimized. A mutation is a stochastic alteration of one component of the individual (e.g., a topological rearrangement, a change in one branch length, or a random modification of a model parameter), and the fitness of an individual is a function of the score for that tree. Because selection retains the changes that improve the value of the optimality function, the mean score of the population of trees improves over time, i.e., across generations.

We report here on a GA, nicknamed the metapopulation GA (metaGA), that vastly improves the speed and efficiency with which ML trees are found (such that nucleotide sequence data sets incorporating hundreds of taxa can be analyzed in practical computing times) and yields a probability index for each branch. We incorporated the metaGA procedure into a computer program, metapiga (phylogeny inference using the metaGA), whose advantages are: (*i*) rapid exploration of search space and identification of optimal trees, (*ii*) identification of multiple optima within a single search, (*iii*) flexible GA structure that allows fine control over both speed and accuracy, (*iv*) production of branch probability indices, and (*v*) a user-friendly interface. Using simulated and real data sets, we investigated the effect of major search parameters on the speed and accuracy of the metaGA procedure. Data sets of 500 taxa and 3,000 nt can be analyzed on a regular computer (1.7-Gh Pentium IV) in 10 h under the Hasegawa-Kishino-Yano (HKY) + rate heterogeneity model. metapiga (version 1.0) is written in the cross-platform-compatible (Windows, Macintosh, Unix/Linux) Java programming language and implements a graphical interface for importing/exporting data, specifying a ML model, monitoring search progress, visualizing trees, and performing statistical analyses of trees generated by the GA. metapiga (and a supporting user's manual) is distributed at http://dbm.ulb.ac.be/ueg.

## Materials and Methods

### ML Models and the Basic GA Cycle of Evaluation-Selection-Mutation.

metapiga implements the HKY nucleotide substitution model (18), as well as the nested Jukes-Cantor (JC) (19), K2P (20), and F81 (5) models. More general models of nucleotide substitution (e.g., general time-reversible) will be incorporated into future versions of metapiga. The transition/transversion ratio is optimized periodically during the search with a frequency specified by the user. Equilibrium base frequencies remain constant throughout the search and are either specified by the user or estimated from the data set. Rate heterogeneity is incorporated through a method similar to that described by Van de Peer *et al.* (21), which is based on corrected distance estimation. Use of this method allows estimation of rate heterogeneity parameters once, before the start of the search. The GA cycle of evaluation-selection-mutation implemented in metapiga is described in the supplemental *Materials and Methods* and *Results and Discussion*, which are published as supporting information on the PNAS web site, www.pnas.org.

### The MetaGA.

#### The essence of the procedure.

We designed a family of heuristic search strategies, the metaGAs. This approach relies on the coexistence of two or more populations interacting in a metapopulation setting. Both the number of populations and the number of individuals per population are specified by the user. As the metaGA involves several parallel searches, a large amount of inter-population variation can be maintained, even when each population is subjected to strong selection pressures. However, the spirit of the metaGA is that the populations are not fully independent as they are forced to cooperate in the search for the optimal tree. Within each population, trees are subjected to evaluation, selection, and mutation events as they would be in a single-population GA (see supporting *Materials and Methods*), but all topological operators are guided through inter-population comparisons. The metaGA is based on the assumption that these comparisons allow the identification of partitions that are correct (and should not be modified) and regions that still need to be resolved. Communication among the populations is defined and controlled through a principle we named consensus pruning (CP), i.e., consensus information gathered from comparison of the best trees across populations is used to identify the portions of each tree that can be subjected to mutations (Fig. (Fig.1).1). The concept of CP allows the elaboration of many specific inter-population communication procedures of which we chose to implement six: random, ring, alternate ring, strict group consensus, majority group consensus, and probability group consensus. When CP is set to random during a metaGA search, each of the P populations is picked in turn and randomly paired with one of the remaining P-1 populations. The consensus branches between the two trees define the partitions that cannot be affected by topological mutations. For example, if a consensus branch defines a partition between groups I and II, no taxa swap between a taxon in group I and a taxon in group II is allowed, whereas topological changes within group I and within group II are allowed (Fig. (Fig.1).1). Under the ring option, populations are ordered in a circle and each population *p* is paired with the next population in line (*p* + 1), with the *P*th population paired with population 1. Alternate ring is the same as ring except that the direction of pairing switches every *G* generations (*G* defined by the user). Under strict group consensus and majority group consensus, the partitions that cannot be affected by topological mutations are defined as the branches exhibited by the strict consensus and the 50% majority consensus, respectively, among the best trees from all populations. Finally, under probability group consensus, each partition is enforced with a probability proportional to the percentage of populations that agree on that partition. Variations of the metapopulation concept, such as local extinction and recolonization, may also prove to be effective, although we have yet to explore such approaches. To our knowledge, the single heuristic that could be considered related to the metaGA concept is the recently developed hitchhiking strategy (21) where a set of hitcher trees are allowed (when possible) to experience the same perturbation as a driver tree. Hitchhiking is based on the assumption that identical perturbations could gradually drive different temporary solutions toward the same global optimum, whereas the metaGA assumes that topological consensus among different temporary solutions allows the identification of unacceptable perturbations. In hitchhiking, one solution partially drives the others, whereas in the metaGA all solutions constantly check each other.

#### Stopping rule.

As the overall number of consensus partitions increases with search time, the metaGA procedure provides a convenient stopping rule: the search stops when (*i*) the best trees of all populations have identical topologies, or (*ii*) when all topological changes allowed by the latest consensus information have been attempted on the best trees of all populations and these changes did not improve their scores. Given the high percentage of partitions that are fixed toward the end of the search, swapping to completion is swift even on very large trees. This stopping condition option is called GA decides. Other stopping rules available in metapiga are: stop when the user hits the stop button (user decides option), stop when the number of generations predefined by the user has been reached (fixed-generation option), stop when no topological change has been accepted by the GA in the last *n* generations (topol improvement option; *n* defined by the user), or stop when a score is obtained that is better than that specified by the user (target score option).

After the stopping case has been reached, branch lengths are optimized by the Newton–Rhapson method (23). All parameters specific to the chosen nucleotide substitution model are also optimized at this point. Once optimized, the best tree from each population can be stored, viewed, and exported. We also provide in metapiga the option to keep in memory the *n* trees (*n* specified by the user) with the best scores from each population's history.

## Results and Discussion

We first defined a set of default search settings (supporting *Materials and Methods*): one population of four individuals; selection option = improve, probability of branch length mutation = 0.04; probability of each branch swapping operator [subtree pruning-regrafting (SPR), nearest-neighbor interchange (NNI), taxa swap, subtree swap) = 0.24; probability of recombination = 0; dynamic operators disabled; CP disabled; starting trees = random; stopping rule = GA decides; JC model (i.e., the simplest ML model). We chose to use the JC model for testing the dynamics of the metaGA because it would allow for the most rapid searches, allowing us to test more parameters of the metaGA. All results reported below aim at systematically testing the effect, on both computing time and accuracy of topology inference, of varying one or several of these search parameters. These results are compared with the performances of algorithms implemented in two widely used software packages (7, 24): MRBAYES 2.01 and PAUP*4.b8. We are fully aware of the difficulties associated with comparing runtimes of algorithms implemented in different software packages. However, the reader cannot know the relative efficiency of the methods described above unless some comparisons are made. Two things are done to assure that the most appropriate/useful comparisons are made. First, all analyses (METAPIGA 1.0, MRBAYES 2.01, and PAUP*4.b8) were performed on the same computer running WINDOWS 2000 (single 1.7-Gh Pentium IV processor with 1 Gb of RAM). Second, we compare the efficiency of our method with methods that are most commonly used; hence, we chose to use default search settings. Analyses were performed on simulated data sets comprised of 1,000 nt and either 20, 40, 80, 160, or 320 taxa. All data sets were generated by using dna-sim, a program written by A.R.L. and based on a birth-and-death process (6, 25) (JC; speciation rate = 10^{−4}; extinction rate = 10^{−5}; mutation rates = 9.0 × 10^{−3}, 6.0 × 10^{−3}, 3.5 × 10^{−3}, 2.5 × 10^{−3}, 2.0 × 10^{−3} for 20, 40, 80, 160, and 320 taxa, respectively). We also tested the efficiency of our metaGA by using a 55-taxa *rbc*L data set of 1,314 nt (15).

### GA (i.e., CP Disabled), MetaGA (i.e., CP Enabled), and Other Heuristics.

Fig. Fig.22 indicates, both for the JC and HKY + rate heterogeneity (four categories + invariable sites) models, that the time required by a GA search (one population run to completion) is much less than that required by StepAdd. The use of the metaGA (i.e., CP with strict group consensus and four populations) reduces run time even further. The metaGA is up to seven times faster than StepAdd when using a simple model (JC) while it is over 800 times faster when considering a more realistic model of evolution (here, HKY + rate heterogeneity). Note that (*i*) much of the improvement seen under the complex models comes from taking an inter-step optimization strategy, (*ii*) the striking improvement in speed of the metaGA and GA over more classical search strategies is not at the expense of lower accuracy (see below), and (*iii*) the ratios of StepAdd to GA and StepAdd to metaGA run times increase with increasing number of taxa (JC, Fig. Fig.22*b*). As the StepAdd algorithm yields a tree that is typically used as the starting point of a branch swapping search, we also show in Fig. Fig.22*a* the times required for single rounds of NNI, SPR, and tree bisection-reconnection (TBR) branch swapping (9), i.e., the time required to swap to completion the single best ML tree. A typical heuristic run would comprise several rounds of SPR or TBR swapping on a starting tree obtained by StepAdd. Even if a reasonable starting tree could be obtained instantaneously, the minimum time required to TBR- or SPR-swap that tree would vastly exceed the time required by a full GA or metaGA search.

The efficiency of CP can be attributed to two factors. First, it allows the stopping rule to be reached more quickly (asterisks in Fig. Fig.33*a*): near the end of the search, CP-constrained topological mutations affect a greatly reduced number of branches, hence swapping to completion is much faster than it would be on an unconstrained tree. Second, the consensus information shared among parallel populations allows them to increase their scores and the number of consensus partitions faster than if they were each searching in isolation (Fig. (Fig.33 *b* and *c*). Hence, despite the fact that a four-population metaGA search requires evaluation of four times more trees each generation than in a single GA search, the former completes the search much faster than the latter (Fig. (Fig.33*a*).

### Accuracy of the MetaGA and Effect of the Number of Populations.

Fig. Fig.44 indicates that computing time increases with the number of populations involved in CP. As many as 8–12 populations are required to slow down the metaGA to computing times similar to those of single-population GA runs. However, as Fig. Fig.44*b* indicates, two-population metaGA searches are not optimal as they yield trees with 2–4% error (in comparison with the ML score) for these idealized data sets. In this case, uninterrupted interaction between the same two populations does not allow CP to differentiate between informative and random consensus, such that the populations rapidly converge toward each other, terminating the search prematurely. This negative effect decreases dramatically with three populations and virtually disappears for runs using more than three populations (Fig. (Fig.44*b*): the likelihood that all populations evolve the same suboptimal partitions decreases very rapidly with the number of populations. In all analyses performed so far (data not shown), four-population metapiga runs seem to give the best compromise between computing speed and accuracy of topology inference. Fig. Fig.44 indicates that the majority group consensus option of CP yields the worse scores without being particularly fast whereas the strict group consensus option is the most conservative, i.e., it yields the best scores but is also the slowest. The optimal combination of maximum speed and accuracy is probability consensus with four populations.

### MetaGA Branch Support Values.

Under CP, increasing the number of populations decreases the interdependence among the populations within a single run. Moreover, we would expect there to be a direct relationship between the proportion of populations that agree on a particular branch and the strength of support for that branch given the data set. Hence, it is conceivable that (*i*) a metaGA search with an infinite number of populations would produce the posterior probability distribution of possible trees, and (*ii*) a metaGA search with a finite number of populations would provide an estimate of this distribution. If both assumptions are correct, a set of multiple metaGA searches would produce trees and clades with frequencies that closely approximate their posterior probabilities, making the results of such a metaGA comparable to those provided by Bayesian approaches (13, 24). We tested this conclusion by comparing metaGA branch support values to bootstrap values as well as to posterior probabilities of clades produced by the program mrbayes. Using an *rbc*L data set (15) comprised of 55 taxa and 1,314 bp, we performed 10 independent metaGA searches (strict CP among 10 populations) and computed the majority-rule consensus tree among the resulting 100 best trees (one tree per population). Remarkably, plotting metaGA support values against bootstrap proportions (100 replicates, with SPR branch swapping on a StepAdd starting tree) uncovers the same bias of bootstrap values (Fig. (Fig.55*a*) as that described with laboratory-generated phylogenies and simulations (26). This result suggests that every metaGA branch support value might efficiently approximate the probability of the corresponding clade being correct. Furthermore, Fig. Fig.55*b* indicates that, under the specific conditions tested at least, there is a direct correspondence between frequencies of clades obtained under the metaGA and their posterior probabilities estimated with mrbayes (*n* chains = 4, temperature = 0.05, sample frequency = 100, burn in = 25,000 steps, 5,000 samples taken). These preliminary results suggest that the frequencies with which trees and clades are sampled by using the metaGA might correspond to unbiased estimates of their posterior probabilities. Furthermore, given the results outlined in Fig. Fig.4,4, the precision of the posterior probability estimator might be a function of metapopulation size.

## Acknowledgments

We are grateful to Patrick Mardulyn and two anonymous reviewers for helpful comments on a previous version of this manuscript. This work was supported by the National Fund for Scientific Research Belgium (FNRS), the Free University of Brussels (ULB), the “Communauté Française de Belgique” (Grant ARC 98/03-223), and the Van Buuren, Brachet, and Defay Funds (Belgium).

## Abbreviations

- ML, maximum likelihood
- GA, genetic algorithm
- metaGA, metapopulation GA
- HKY, Hasegawa-Kishino-Yano
- JC, Jukes-Cantor
- CP, consensus pruning
- SPR, subtree pruning-regrafting
- NNI, nearest-neighbor interchange
- TBR, tree bisection-reconnection
- StepAdd, stepwise addition

## Notes

This paper was submitted directly (Track II) to the PNAS office.

## References

**,**133-142.

**,**130-131. [PubMed]

**,**17-48.

**,**368-376. [PubMed]

**,**839-850. [PubMed]

**,**964-969.

**,**406-425. [PubMed]

**,**7-17. [PubMed]

**,**2310-2314. [PubMed]

**,**277-283. [PubMed]

**,**477-484. [PubMed]

**,**428-436. [PubMed]

**,**160-174. [PubMed]

**,**111-120. [PubMed]

**,**221-232. [PubMed]

**,**79-91. [PubMed]

**,**93-104. [PubMed]

*Biometrics*, in press.

**,**1-15.

**,**182-192.

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (510K) |
- Citation

- MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics.[BMC Bioinformatics. 2010]
*Helaers R, Milinkovitch MC.**BMC Bioinformatics. 2010 Jul 15; 11:379. Epub 2010 Jul 15.* - A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data.[Mol Biol Evol. 1998]
*Lewis PO.**Mol Biol Evol. 1998 Mar; 15(3):277-83.* - NJML: a hybrid algorithm for the neighbor-joining and maximum-likelihood methods.[Mol Biol Evol. 2000]
*Ota S, Li WH.**Mol Biol Evol. 2000 Sep; 17(9):1401-9.* - Phylogenetic analyses of parasites in the new millennium.[Adv Parasitol. 2006]
*Morrison DA.**Adv Parasitol. 2006; 63:1-124.* - Network random keys - a tree representation scheme for genetic and evolutionary algorithms.[Evol Comput. 2002]
*Rothlauf F, Goldberg DE, Heinzl A.**Evol Comput. 2002 Spring; 10(1):75-97.*

- The evolution of the actin binding NET superfamily[Frontiers in Plant Science. ]
*Hawkins TJ, Deeks MJ, Wang P, Hussey PJ.**Frontiers in Plant Science. 5254* - Spatiotemporal Characterizations of Dengue Virus in Mainland China: Insights into the Whole Genome from 1978 to 2011[PLoS ONE. ]
*Zhang H, Zhang Y, Hamoudi R, Yan G, Chen X, Zhou Y.**PLoS ONE. 9(2)e87630* - Mathematics and evolutionary biology make bioinformatics education comprehensible[Briefings in Bioinformatics. 2013]
*Jungck JR, Weisstein AE.**Briefings in Bioinformatics. 2013 Sep; 14(5)599-609* - Contrasted Evolution of the Vomeronasal Receptor Repertoires in Mammals and Squamate Reptiles[Genome Biology and Evolution. 2013]
*Brykczynska U, Tzika AC, Rodriguez I, Milinkovitch MC.**Genome Biology and Evolution. 2013; 5(2)389-401* - Reptilian-transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles[EvoDevo. ]
*Tzika AC, Helaers R, Schramm G, Milinkovitch MC.**EvoDevo. 219*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles

- The metapopulation genetic algorithm: An efficient solution for the problem of l...The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimationProceedings of the National Academy of Sciences of the United States of America. 2002 Aug 6; 99(16)10516

Your browsing activity is empty.

Activity recording is turned off.

See more...