- Journal List
- Bioinformatics
- PMC2718627

# Efficient inference of bacterial strain trees from genome-scale multilocus data

^{1}Department of Computer Science, Rice University, Houston, TX 77005, USA and

^{2}Department of Evolutionary Studies of Biosystems, Graduate University for Advanced Studies, School of Advanced Sciences, Hayama, Kanagawa 240-0193, Japan

## Abstract

**Motivation:** In bacterial evolution, inferring a *strain tree*, which is the evolutionary history of different strains of the same bacterium, plays a major role in analyzing and understanding the evolution of strongly isolated populations, population divergence and various evolutionary events, such as horizontal gene transfer and homologous recombination. Inferring a strain tree from multilocus data of these strains is exceptionally hard since, at this scale of evolution, processes such as homologous recombination result in a very high degree of gene tree incongruence.

**Results:** In this article we present a novel computational method for inferring the strain tree despite massive gene tree incongruence caused by homologous recombination. Our method operates in three phases, where in phase I a set of candidate strain-tree topologies is computed using the *maximal cliques* concept, in phase II divergence times for each of the topologies are estimated using *mixed integer linear programming* (MILP) and in phase III the optimal tree (or trees) is selected based on an optimality criterion. We have analyzed 1898 genes from nine strains of the *Staphylococcus aureus* bacteria, and identified a fully resolved (binary) strain tree with estimated divergence times, despite the high degrees of sequence identity at the nucleotide level and gene tree incongruence. Our method's efficiency makes it particularly suitable for analysis of genome-scale datasets, including those of strongly isolated populations which are usually very challenging to analyze.

**Availability:** We have implemented the algorithms in the PhyloNet software package, which is available publicly at http://bioinfo.cs.rice.edu/phylonet/

**Contact:** ude.ecir.sc@helhkan

## 1 INTRODUCTION

Genome sequencing technologies are amassing large amounts of data from various organisms that span the Tree of Life, and in the case of bacteria, genomes of several strains of the same bacterium are becoming available (e.g. see the Microbial Genome Project of the US Department of Energy at http://microbialgenomics.energy.gov/). These data are enabling biologists to analyze the relationships among populations and species, as well as understand speciation and population divergence. To elucidate these relationships and understand these processes among different strains of the same bacterium, an accurate reconstruction of the evolutionary history of these strains—the *strain tree*—is essential, since it serves as the backbone against which events such as horizontal gene transfer and homologous recombination can be identified and assessed. In a sequence of papers, Roger Milkman and co-workers pioneered some of the work in this area, mainly focusing on mapping the ‘clonal ancestry’ in several strains of *Escherichia coli* (e.g. Milkman and Stoltzfus, 1988; Stoltzfus *et al.*, 1988).

In this article, we focus on the problem of inferring the strain tree from a genome-scale set of gene trees whose incongruence is mainly due to *homologous recombination*. In bacteria, homologous recombination through transformation or conjugation allows for the integration of homologous alien DNA into a host genome (Errington *et al.*, 2001). This process plays an important role in DNA repair as well as bacterial genome diversification.

From an evolutionary perspective, and barring any recombination, the evolutionary history of a set of genomes would be depicted by a tree that is the same tree that models the evolution of each gene in these genomes. However, homologous recombination among bacteria decouples the evolution of different genes in their genomes, thus resulting in incongruent (or, discordant) gene trees—a scenario that is illustrated in Figure 1.

**a**) Coalescent times coincide with divergence times, and strain/gene tree topologies are concordant. (

**b**) Coalescent times do not coincide with divergence times, and strain/gene-tree topologies

**...**

For example, in Figure 1c, looking backwards in time, the gene lineage from strain *A* and the gene lineage from *B* persist deep enough into the past that they have not coalesced by the time of the ancestral strain to *A*, *B* and *C*. Thus, the lineage from *B* may coalesce with the lineage from *C* more recently than with the lineage from *A*. As the ancestries of different parts of the genome may take different paths through the phylogeny, e.g. due to homologous recombination, gene trees may differ in topology from the strain tree topology, and an individual gene history might not reflect the shape of the strain tree. Even if this gene history is correctly estimated, the strain-tree estimate based on a single locus may be incorrect. As genome-scale sequence data from thousands of loci in different strains of bacteria become available, it is now critical that appropriate methods and tools be developed for understanding and overcoming the problem of gene-tree discordance in strain-tree inference.

A few methods have been introduced recently for analyzing gene trees, reconciling their incongruities and inferring species trees despite these incongruities. To the best of our knowledge, none of these methods have been applied to bacterial genomes, particularly different strains of the same bacterium, with massive gene tree incongruence due to homologous recombination. Generally speaking, each of these methods follows one of two approaches: the *combined analysis* approach or the *separate analysis* approach. In the combined analysis approach, the sequences from multiple loci are concatenated, and the resulting ‘supergene’ dataset is analyzed using traditional phylogenetic methods, such as maximum parsimony and maximum likelihood (e.g. Rokas *et al.*, 2003.) In the separate analysis approach, the sequence data from each locus is first analyzed individually, and a reconciliation of the gene trees is then sought so as to optimize certain criterion (e.g. Edwards *et al.*, 2007). Shortcomings of both approaches have been recently reported by various researchers (e.g. Degnan and Rosenberg, 2006; Kubatko and Degnan, 2007). A particular challenge that was not addressed in these recent studies concerns the analysis of very closely related groups of genomes (strains of the same bacterium, for example). In this case, sequence identity at the nucleotide level is very high, which gives rise to gene trees with low resolution, a fact that further complicates the task of inferring the strain tree. Last but not least, the genomic scale of the available data necessitates the development of efficient tools for tackling the task of strain tree inference.

In this article, we address the problem of strain-tree inference from genome-scale multilocus data, where gene-tree incongruence is due to homologous recombination. Our proposed model of the optimal strain tree (topology and divergence times) is one that minimizes the amount of deep coalescent events, which is similar to that used in Maddison and Knowles (2006), and our proposed method to infer the optimal tree under this model is based on two widely encountered optimization problems: maximal cliques and mixed integer linear programming. Our method operates in three phases, where in phase I a set of candidate tree topologies is computed using the maximal cliques concept, in phase II divergence times for each of the topologies are estimated using mixed integer linear programming (MILP) and in phase III one tree, or a set of trees, is selected based on an optimality criterion. To assess our method's performance, we have analyzed 1898 genes from nine strains of the *Staphylococcus aureus* bacteria. A compatibility graph of all different clusters in 1898 corresponding gene trees was built, whose maximal cliques were then computed to reconstruct candidate tree topologies. The compatibility graph has 36 vertices and 304 edges, which correspond to 304 pairs of compatible clusters, and all its maximal cliques were identified in about 0.046 s. For the 24 trees that corresponded to the maximal cliques, we computed divergence times using a novel MILP formulation, which we solved using the CPLEX tool from ILOG. It took CPLEX approximately 1 h to compute the optimal divergence time assignment of a strain-tree topology, given a set of 1898 gene trees, on a 3.2 GHz Intel Pentium 4 machine, running Linux, with 1GB of RAM. The optimal strain tree that our method identified is fully resolved (binary) despite the high degree of sequence identity at the nucleotide level, which further affirms the suitability of the method to analysis of very closely related organisms.

## 2 METHODS

### 2.1 Definitions and notations

Let *T*=(*V*,*E*) be a tree, where *V*(*T*) and *E*(*T*) are the *tree nodes* and *tree edges* (or, *tree branches*), respectively, and let (*T*) denote its leaf set. Further, let be a set of taxa. Then, *T* is a phylogenetic tree over if there is a bijection between and (*T*). A tree *T* is said to be *rooted* if the edges in *E* are directed and there is a single internal node *x* with in-degree 0. In this article, we assume only rooted trees, unless stated otherwise. Let *T*=(*V*,*E*) be a rooted tree, and *u* be a node in *V*. Given a tree *T*=(*V*,*E*) leaf-labeled by set of taxa, a node *v**V*, an edge *e*=(*u*,*v*) and a set *X*, we use the following notations: *p*^{T}(*v*)=*u*; *T*[*v*] is the *clade*, or subtree, rooted at node *v*; is the cluster, i.e. the set of leaves of *T*[*v*]; and, *MRCA*_{T}(*X*) is the *most recent common ancestor* of *X*—i.e. the node *v**V*(*T*) such that where *e*=(*u*,*v*)*E*(*T*). Tree *T* induces the set of clusters. The topology of the tree *T* naturally defines a partial order _{T} on *C*. In this article, we assume that any strain tree *T* always has a special node *r* with a special edge *re*=(*r*,*x*), where *x* is the MRCA of all leaves in the tree (e.g. see the strain tree in Fig. 4a).

**a**) and a gene tree (

**b**) on six taxa, used for illustrating the strain-tree branch labeling.

Let τ :*V*(*T*)→(^{+}{0}) be a function assigning each node a time such that (1) τ(*u*)=τ(*v*) for *u*,*v*(*T*) and (2) τ(*u*)>τ(*v*) for (*u*,*v*)*E*(*V*).

### 2.2 Strain-tree inference and gene-tree reconciliation

As indicated above, our proposed model of the optimal strain tree (topology and divergence times) is one that minimizes the amount of deep coalescent events. The input to our problem is a set of gene trees (topologies and coalescent times), and the output is a strain tree (topology and divergence times) that minimizes the amount of deep coalescent events and incongruence of the gene trees when reconciled within the branches of the inferred strain tree. The strain tree is built in three phases. First, a set of topology candidates is computed based on the set of clades in the input gene trees. Second, the times for nodes in each of the candidate trees are inferred based on the coalescent times of the input gene trees. Third, the gene trees are reconciled within the branches of each of the tree candidates, and the tree (topology and times) that optimizes a certain criterion (a weighted sum of deep coalescent events, gene/strain-tree incongruence and shallow coalescent events) is selected as the strain tree.

#### 2.2.1 Phase I: inferring strain-tree topology candidates

Given a set of gene-tree topologies {*T*_{1},…,*T*_{k}}, it may be that the tree topology that represents the most frequent coalescent history does not reflect the true divergence patterns (Degnan and Rosenberg, 2006). Further, the tree built from the concatenated ‘supergene’ may also not reflect the true speciation patterns (Kubatko and Degnan, 2007). Our working hypothesis is that the strain-tree topology is most probably formed from a set of clusters, each of which appears in at least one of the gene trees. For a set of clusters to define a (rooted) tree, they have to be *pairwise compatible*. Two clusters (sets of taxa) *c*_{1}, *c*_{2} are compatible if at least one of the three intersections *c*_{1}∩*c*_{2}, and is empty ( denotes the set −*c*). A classical result in phylogenetics states that a set of pairwise compatible clusters defines a unique tree (Semple and Steel, 2003). Based on our working hypothesis and the relationship between clusters and trees, we formulate our heuristic algorithm for finding candidate strain-tree topologies from the set of gene-tree topologies, as outlined in Figure 2. The algorithm first computes *C*, the set of all clusters appearing in any of the gene trees. It then builds the compatibility graph *H*=(*V*_{H},*E*_{H}), where *V*_{H}=*C*, and *E*_{H}*V*_{H}×*V*_{H} where *E*_{H}={(*c*_{i},*c*_{j}) :*c*_{i} is compatible with *c*_{j}}. Based on the aforementioned relationship between clusters and trees, our next task entails computing all maximal sets of pairwise compatible clusters, which amounts to computing the set *K* of all maximal cliques in the compatibility graph *H*. Finally, strain tree topology candidates are constructed in a straightforward manner from the set *K*, where each maximal clique corresponds to a unique tree. Figure 3 illustrates the algorithm on three input gene trees. The set *C* contains seven distinct clusters, and the compatibility graph *H* is shown. There are six maximal cliques in *H*, which implies that the clusters of the input gene trees give rise to six different strain tree topology candidates.

*C*of all clusters that appear in any of the gene trees is computed. In Step 2, the compatibility graph

**...**

#### 2.2.2 Phase II: estimating strain-tree divergence times

Our next task entails estimating the divergence times at internal nodes of each of the strain-tree topology candidates that we computed so as to optimize the weighted sum criterion, as described above. We present a novel optimization based on solving an MILP formulation. The MILP formulation involves a special labeling of the strain-tree topology branches, formulation of temporal constraints based on information from the gene trees, linking coalescence and temporal information and finally putting together all steps into one MILP program. We now describe in details each of these four steps.

(1) *Labeling the strain-tree branches.* In order to model the coalescent of genes on the strain-tree branches, we need to label these branches. As we seek to minimize deep coalescent events (genes that coalesce deeper than their MRCA), we seek a labeling that reflects the ‘depth’ of the coalescent event, i.e. how far the coalescent event of a set *X* occurred away from the MRCA of *X*.

For each internal node *x* in the strain tree *ST*, let *P*(*x*)=*x*_{1},*x*_{2},…,*x*_{p} be the sequence of nodes where: (1) *x*_{1}=*x*, (2) *x*_{p}=*r*(*ST*) and (3) (*x*_{i},*x*_{i−1})*E*(*ST*), for all 2≤*i*≤*p*. Further, *E*_{P(x)} denotes the *list* of edges defined by *P*(*x*); i.e. *E*_{P(x)}=(*x*_{i},*x*_{i−1}):2≤*i*≤*p*. For example, we have *P*(*x*_{2})=*x*_{2},*x*_{3},*x*_{5},*r* and in the strain tree in Figure 4a.

Given these sequences, a clade rooted at node *y* in a gene tree may coalesce only on any edge in *E*_{P(x)}, where *x*=*MRCA*_{ST}(*y*). For example, the clade (*C*,*D*) in the gene tree in Figure 4b may coalesce only on one of the edges in , where *x*_{2} is the node in the strain tree in Figure 4a.

Given *E*_{P(x)}=(*x*_{2},*x*_{1}),(*x*_{3},*x*_{2}),…,(*x*_{p},*x*_{p−1}) for some node *x* in a strain-tree topology, we label the edges in *E*_{P(x)} by the numbers 1,2,…,*p* such that ((*x*_{s},*x*_{s−1}))=*s*−2, for 2≤*s*≤*p*. For example, for , where *x*_{2} is the node in the strain tree in Figure 4a, we have the labels: ((*x*_{3},*x*_{2}))=1, ((*x*_{5},*x*_{3}))=2 and ((*r*,*x*_{5}))=3. This labeling is essential for our MILP formulation, since it will be used to reflect the ‘depth’ of the coalesce events. For example, if clade (*C*,*D*) from the gene tree in Figure 4b coalesces on branch (*r*,*x*_{5}) in the strain tree, then the depth of that coalescence event is ((*r*,*x*_{5}))−1, which is 2 (the reason we choose a label that is larger by 1 than the actual depth value is to accommodate shallow coalescence events, as we discuss below). Indeed, in this scenario, (*C*,*D*) coalesced two branches deeper than it could have coalesced [which is branch (*x*_{3},*x*_{2})]. We denote by LABELTREE the procedure that computes the lists *P*(*x*) and *E*_{P(x)}, as well as the labeling of each edge in *E*_{P(x)}.

(2) *Temporal constraints.* The topology of the strain tree defines a partial order on the times of the internal nodes. This can be represented using linear constraints as τ_{u}>τ_{v} for every branch (*u*,*v*) in the strain tree. For example, in the strain tree in Figure 4a, we have the constraint .

Further, each clade in a gene tree may coalesce on any branch in the strain tree on the path from the MRCA of the clade to the branch *re*. Temporally, this imposes the (linear) constraint τ_{x}≤τ_{y}≤τ_{r}, where *y* is a clade (equivalently in this case, the set of leaves in that clade) in a gene tree, *x*=*MRCA*_{ST}(*y*), and *r* is the special root of the strain tree. For example, in Figure 4, we have the constraint . However, since the coalescence times may be underestimated or gene transfer may have occurred after divergence of the strains, we relax this constraint, by allowing the coalescence time of certain clades to be smaller than the time of their MRCA in the strain tree. Nonetheless, we wish to minimize such events. We achieve via the two constraints

for every clade *y* in a gene tree and its MRCA *x* in the strain tree. The binary variable *g*_{y} here takes the value 1 when the coalescence time of *y* is lower than that of its MRCA in the strain tree and 0 otherwise. Defining *T*^{max} to be the maximum time of the root of any of the gene trees in , we write these as linear constraints, as follows:

In this case, we add a small value (e.g. =1×10^{−8}) to emulate the<relation.

(3) *Associating times with branches through their labels.* Let *y* be a node in the gene tree, *x*=*MRCA*_{ST}(*y*), and (*u*,*v*)*E*_{P(x)} such that ((*u*,*v*))=*m*. If node *y* coalesces on branch (*u*,*v*) in the strain tree, this introduces a constraint of the form τ_{u}≥τ_{y}≥τ_{v}, which translates into the constraint

Notice that if *f*_{y}≠*m*, then τ_{y} is not constrained, which we emulate by constraining τ_{y} from above by *T*^{max} and from below by 0. In other words, we have the constraint

Let *M*_{y}={1,…,κ(*y*)}, where κ(*y*)=|*P*(*x*)|−1 for *x*=*MRCA*_{ST}(*y*). For branch *e*=(*u*,*v*)*E*_{P(x)}, where (*e*)=*m*, we denote *s*^{y}(*m*)=*u* and *t*^{y}(*m*)=*v*.

For each clade *y* in a gene tree, we convert the conjunction of constraints (1) and (2) into linear constraints by introducing κ(*y*) binary variables α_{i} for 1≤*i*≤κ(*y*), and then writing the following constraints:

Constraints (A) and (B) connect the branch assignment with the times of that branch, as they guarantee that α_{i}=1 if and α_{i}=0 otherwise. Constraint (C) guarantees that either *g*_{y}=1 and all the α values are 0, thus resulting in *f*_{y}=0 based on constraint (D), which corresponds to the case where the coalescence times of clade *y* in the gene tree is lower than that of its MRCA in the strain tree, or *g*_{y}=0 and exactly one of the α values is 1, which corresponds to the case where *y* coalesces, under the time assignment to the strain tree, on a unique branch on the path from the MRCA of *y* to the root. Constraint (D) guarantees that the unique value is chosen from the set *M*_{y}. Constraint (E) states that all the α variables are binary.

(4) *Putting it all together: the MILP formulation.* Now that we have described the constraints and how to write them as linear constraints for CPLEX, we are in a position to introduce the complete MILP formulation for solving the problem of estimating divergence times in a strain tree *ST*, given a set of gene trees with coalescence times at internal nodes. We denote *I*(*T*) by the set of all internal nodes of tree *T*, and by the set_{GT}*I*(*GT*).

We seek τ_{x}, for every internal node *x* in the strain tree, and *f*_{y} for every internal node *y* in all gene trees so as to minimize the amount of deep coalescence events and the amount of shallow coalescence. A MILP formulation of this problem, which we refer to as ESTIMATESTTIMES, is given in Figure 5.

*ST*given a set of gene trees with times at internal nodes. Solving this MILP yields the divergence time τ

_{u}, for every

**...**

Notice that since *g*_{y}=1 (which indicates ‘shallow coalescence’) if and only if *f*_{y}=0, the objective function correctly captures the amount of deep coalescence events, and chooses the solution that minimizes it.

#### 2.2.3 Phase III: strain/gene-tree reconciliation and optimality.

Given a strain tree (*ST*,τ_{ST}) and a gene tree (*GT*,τ_{GT}), we seek the coalescence history of the gene, given its tree, on the branches of the strain tree. Because both the strain tree and gene trees have times at internal nodes at this stage of the method, this problem is trivial: the coalescence event of a set *c* of taxa at time *t* in gene tree *GT* must occur at time *t* on the path between the root of *ST* and the MRCA of *c* in *ST*. There is exactly one such point in *ST*, so this mapping is unique for each cluster in a gene tree.

Considering trees with times at internal nodes is very important since temporal constrains implied by divergence and coalescent times render certain coalescent histories invalid. Therefore, whenever such temporal information is available, it must be used, not only for accuracy reasons, but also to achieve further reductions in the size of the space of strain/gene-tree reconciliations, which in turn affects the computational efficiency of existing and newly developed reconciliation methods.

Let () be the set of all internal nodes in the gene trees in (we use an internal node and the subtree rooted at it interchangeably here) and denote by *c**ST* that *c* is a clade in *ST*. Our optimality criterion, η(*ST*,), is defined as the sum of (1) weighted number of missing clades *w*_{il}(∑_{c(): c ST}} 1), (2) weighted number of deep coalescence events *w*_{dc}(∑_{c(): cST, fc>0}} (*f*_{c}−1)) and (3) weighted number of shallow coalescence events *w*_{sc}(∑_{{c(): cST, fc=0}} *g*_{c}). The first term is the number of clades in the gene trees that do not occur in the strain tree. For a clade *c* in gene tree *GT* and which also appears in the strain tree *ST*, the quantity *f*_{c} captures how far (in terms of the number of branches) *c* coalesced away from its MRCA in *ST* and *g*_{c} captures if it may not coalesce, given the time assignment in the strain tree. Note that if all gene trees have the same topology as the strain tree, and each cluster coalesces on the branch immediately above its MRCA, then η(*ST*,)=0. The weights *w*_{il}, *w*_{dc} and *w*_{sc} can be set in a way to reflect the significance given to each of the three terms in the criterion. For example, if only topological difference among the gene trees and strain tree matters, *w*_{dc} and *w*_{sc} can be set to 0.

### 2.3 The algorithm

Now that we have defined our optimality criterion, the complete algorithm for inferring an optimal strain tree (topology and times) is described in Figure 6.

## 3 MATERIALS AND ANALYSIS

### 3.1 Sequence data

In our experimental study, we used the *S.aureus* bacteria, which infect humans in the community and hospitals and cause a variety of diseases. We obtained all the sequence data from the site ftp://ftp.ncbi.nih.gov/genomes/. Table 1 summarizes the nine strains we used. {"type":"entrez-nucleotide","attrs":{"text":"NC_002745","term_id":"29165615","term_text":"NC_002745"}}NC_002745 is *S.aureus* subsp. *aureus* N315, which is a prototype of methicillin-resitant *S.aureus* (MRSA; Kuroda *et al.*, 2001). {"type":"entrez-nucleotide","attrs":{"text":"NC_002758","term_id":"57634611","term_text":"NC_002758"}}NC_002758 is *S.aureus* subsp. *aureus* Mu50, which has a moderate resistance to vancomycin by the thickened cell wall. {"type":"entrez-nucleotide","attrs":{"text":"NC_002951","term_id":"57650036","term_text":"NC_002951"}}NC_002951 is *S.aureus* subsp. *aureus* COL,which is an early methicillin-resistant isolate. The first isolation was found in a British hospital in 1961 (Gill *et al.*, 2005). {"type":"entrez-nucleotide","attrs":{"text":"NC_002952","term_id":"49482253","term_text":"NC_002952"}}NC_002952 is *S.aureus* subsp. *aureus* MRSA252. {"type":"entrez-nucleotide","attrs":{"text":"NC_002953","term_id":"49484912","term_text":"NC_002953"}}NC_002953 is *S.aureus* subsp. *aureus* MSSA476. These strains were isolated from hospital and community (Holden *et al.*, 2004). MRSA252 belongs to the clinically important EMRSA-16 clone that is responsible for half of the MRSA infections in the United Kingdom and is one of the major MRSA clones found in the USA (USA200). MSSA476 causes severe invasive diseases in immunocompetent children in the community and belongs to a major clone associated with community-acquired disease. {"type":"entrez-nucleotide","attrs":{"text":"NC_003923","term_id":"21281729","term_text":"NC_003923"}}NC_003923 is *S.aureus* subsp. *aureus* MW2 (Baba *et al.*, 2002). This strain was isolated from the community, and caused fatal septicaemia. This strain was reported in mid-west USA. {"type":"entrez-nucleotide","attrs":{"text":"NC_007622","term_id":"82749777","term_text":"NC_007622"}}NC_007622 is *S.aureus* subsp. *aureus* RF122 (Herron-Olson *et al.*, 2007). {"type":"entrez-nucleotide","attrs":{"text":"NC_007793","term_id":"87159884","term_text":"NC_007793"}}NC_007793 is *S.aureus* subsp. *aureus* USA300 (Diep *et al.*, 2006). USA300 is one of the major strains in the USA, Canada and Europe. {"type":"entrez-nucleotide","attrs":{"text":"NC_007795","term_id":"88193823","term_text":"NC_007795"}}NC_007795 is *S.aureus* subsp. *aureus* NCTC 8325 (Gillaspy *et al.*, 2006).

### 3.2 Identifying orthologous genes

To identify orthologous genes, we used the information of both DNA sequence identity and synteny (gene order) as follows. All-against-all BLASTN search with default parameters (Altschul *et al.*, 1997) was performed for the genes in {"type":"entrez-nucleotide","attrs":{"text":"NC_002745","term_id":"29165615","term_text":"NC_002745"}}NC_002745 versus all others. Then, we produced a list of BLASTN hits of the 2669 genes in {"type":"entrez-nucleotide","attrs":{"text":"NC_002745","term_id":"29165615","term_text":"NC_002745"}}NC_002745 for each of the other strains. The lists include genes that have at least 90% sequence identity to the reference gene in {"type":"entrez-nucleotide","attrs":{"text":"NC_002745","term_id":"29165615","term_text":"NC_002745"}}NC_002745 and the length of the BLASTN hit region covers>50% of the entire gene. We excluded BLASTN hits when there are more than one hit for each reference gene. As there were not many such cases, this restriction did not result in much loss of data.

In order to identify orthologous genes conservatively, we considered that orthologous genes should be in a large block of a region in which the gene order is well conserved for all investigated strains. A block is defined such that genes from all strains are continuously located on their genomes with less than three gene skips, which could be created by small indels and annotation errors. To detect such blocks, we performed a synteny survey from the first gene in {"type":"entrez-nucleotide","attrs":{"text":"NC_002745","term_id":"29165615","term_text":"NC_002745"}}NC_002745 (NC_002745_1) to downstream genes. Then, we identified 222 such blocks, which covered in total 1898×9 genes.

### 3.3 Gene- and strain-tree analysis

For each gene, we built a maximum parsimony (MP) tree from its DNA sequences by using PAUP* 4.0 (Swofford, 2003), and rooted the tree using the midpoint method. When the MP heuristic identified more than one tree for a given gene, we used the strict consensus of these trees. We inferred coalescence times at internal nodes in the gene trees using the formula

for coalescence time of node *y* in a gene tree, where *B*(*y*)={(*a*,*b*) :*MRCA*(*a*,*b*)=*y*}, *d*_{s} is the number of synonymous substitutions per synonymous sites and *r*_{s} is the rate of synonymous substitutions. In other words, τ_{y} is the average of all coalescence times of every pair of genes whose MRCA is node *y*. Given that the rate of synonymous substitutions is similar across genes (Nei and Kumar, 2000), this allowed us to compare the coalescence times across gene trees and use them to infer divergence times in the strain tree. We used *r*_{s}=10^{−8}, following the findings of Ochman and Wilson (1987).

It has been suggested that *d*_{s} may not be constant across the genome due to different codon bias among genes (Retchless and Lawrence, 2007). We found that *d*_{s} and the codon adaptation index (CAI) are in a negative correlation, therefore, we used a linear regression method to correct *d*_{s} for bias caused by non-random usage of codons. The correction is made such that a corrected *d*_{s} corresponds to that with the mean CAI. However, the corrected *d*_{s} measure did not change the relative times we obtained for the strain trees and results are not shown.

To get the strain-tree candidates, we used the algorithm ESTIMATESTTOPOLOGY described in Figure 2. The compatibility graph *H* contained 36 nodes and 304 edges. We used the MaxClique tool of Kevin O'Neill to compute the maximal cliques. The tool identified all maximal cliques in 0.046 s. Additionally, we considered five other candidate tree topologies: (1) *T*_{conc}: the tree topology obtained by the maximum parsimony heuristic, as implemented in PAUP*, on the concatenation of all 1898 gene data sets; (2) *T*_{hf} : the topology of the gene tree that is compatible with the largest number of other gene trees (this tree, shown in Figure 11, is compatible with 1645 of the gene trees); (3) *T*_{avgds}: a tree topology built using the neighbor joining (Saitou and Nei, 1987) method from the average *d*_{s} distances among nine strains; (4) *T*_{avghd}: a tree topology built using the neighbor joining method from the average Hamming distances among nine strains and (5) *T*_{majcons}: the topology of the majority consensus tree of all 1898 gene trees. In total, we have 29 candidate strain-tree topologies.

*T*

_{mc}is the best maximal clique tree and the rest are as defined in Section 3.3. The lengths of the ‘shortened’ branches were divided by 10

^{5}, so that the resolution of the trees can be shown

**...**

We then estimated the divergence times of each of the strain tree topology candidates, using the CPLEX tool (from ILOG) to solve the algorithm (MILP program) described in Figure 5. We have implemented a software tool for generating the MILP program from a set of gene trees with coalescence times, following the formulation in Figure 5, in the PhyloNet software package, which is available publicly at http://bioinfo.cs.rice.edu/phylonet/. In the nine-genome dataset that we considered in this study, each MILP program had ~4000 variables and 30000 constraints. Nonetheless, CPLEX solved each program in about 1h.

## 4 RESULTS AND DISCUSSION

Our first task was to measure the ‘heterogeneity’ in the data, which consisted of the 9×1898 gene sequences and 1898 gene trees. In this task, we considered two measures of heterogeneity: topological differences among the gene trees, and distributions of coalescence times of each cluster of genes across all gene trees. Figure 7 shows the topological differences between every pair of the 1898 gene trees, as computed by the Robinson–Foulds (RF; Robinson and Foulds, 1981) distance measure. The RF measure quantifies, for a given pair of trees, the average number of clades that appears in one, but not both, of the trees. Hence, if two trees are identical, the RF distance between them is 0; if they do not share any clades, then the RF distance is 1; and, trees with varying degrees of shared clades have RF distance values between 0 and 1.

As shown in Figure 7, while blue (low RF values) is the dominating color, there are many pairs of trees that have RF distance of at least 0.3. In fact, among the 1898 gene trees, there were over 400 different topologies. Given our conservative selection of the orthology groups, which almost eliminates the possibility of gene tree discordance due to events such as horizontal gene transfer and gene duplication/loss, this result indicates massive gene-tree discordance due to stochastic effects of the coalescent (incomplete lineage sorting).

Furthermore, it is important to point out that the majority of the gene trees were not binary, since the percent identity among the orthologous sequences was very high. This lack of resolution of the gene tree topologies may give a false indication of high concordance (low RF values) among the gene trees, even though this may not be the case in reality. Alternatively, one may quantify the ‘compatibility’, rather than ‘similarity’ (as measured by the RF distance), among gene trees. However, this suffers from the fact that compatibility measures are not true metrics, and in particular do not satisfy the triangle inequality property, which may distort the picture emerging from such an analysis.

As indicated in the Section 1 and illustrated in Figure 1, it may be the case the gene trees have the same topology, yet they disagree in their coalescence times (times at their internal nodes). Therefore, what we studied next was the distribution of coalescence times of each cluster of taxa across all gene trees in which the cluster occurs (recall that a cluster occurs in a tree if the tree contains a clade whose leaves are the only members of that cluster); the results are shown in Figure 8.

The figure shows that, even with the exclusion of possible outliers, each cluster of taxa has a wide distribution of coalescence times across all gene trees in which it occurs. Further, what makes the computational analysis of such a dataset particularly challenging is that large extent of overlap of distributions of the different clusters. Dealing with this overalp is where most of the computational time of solving our MILP formulation is spent.

After we characterized the heterogeneity in the data, we turned to the main issue, namely estimating the strain-tree topology and divergence times from the set of 1898 gene trees. As described in the previous section, we considered 29 strain-tree topology candidates. For each of these 29 topology candidates, we solved the MILP formulation as outlined in Figure 5, once with *w*_{dc}=*w*_{sc}=1, and another with *w*_{sc}=5*w*_{dc}. In both cases, the same tree topology candidate of all 24 maximal cliques emerged as the optimal one, yet with differing times. Therefore, we report the results of only the optimal solution under *w*_{dc}=*w*_{sc}=1.

For a clearer presentation, we show each of the three terms in the optimality criterion described in Section 2.2.3 individually, with Figure 9 showing the number of missing (or, discordant) clades, and the stacked bars in Figure 10 showing the sum of the depths of deep coalescence events (the blue bars) and the number of shallow coalescence events (the red bars).

*T*

_{conc},

*T*

_{hf},

*T*

_{avgds},

*T*

_{avghd}and

*T*

_{majcons}, respectively.

*sf*=∑

_{{c(), fc>0}[

*c*

*ST*]×(

*f*

_{c}−1), and the number of shallow coalescences,

*sg*=∑

_{{c (), fc=0}[

*c*

**...**

Figure 9 shows that the first tree out of the 24 maximal clique trees has the least disagreements with the set of 1898 gene trees, with trees 8 and 9 differing from it by about 70 clades. The other 21 maximal clique trees are much less optimal in this context, with the best of them disagreeing with the gene trees in at least 400 more clades. We denote by *T*_{mc} the first tree, which is the best in this context among all 24 maximal clique trees. Out of the additional five trees, *T*_{hf} is clearly the best in this context, and the only one that is better than *T*_{mc}. Both trees, *T*_{mc} and *T*_{hf} are shown in Figure 11. The tree *T*_{mc} is a *refinement* of the tree *T*_{hf}; that is, *T*_{mc} contains all the clades in *T*_{hf}, plus additional ones. In this case, *T*_{hf} has the clade (*USA*300, *NCTC*8325, *COL*) unresolved, while *T*_{mc} has it resolved as (*NCTC*8325, (*USA*300, *COL*)).

When considering the optimality of both trees, *T*_{mc} and *T*_{hf}, as measured by the amount of deep coalescence and shallow coalescence events, as shown in Figure 10, they are identical. The significance of this result comes from the fact that, while the unresolved clade (*USA*300, *NCTC*8325, *COL*) has three possible refinements (1) (*NCTC*8325, (*USA*300, *COL*)), (2) ((*NCTC*8325, *USA*300), *COL*) and (3) ((*NCTC*8325, *COL*) *USA*300), the MILP formulation led to a fully binary strain tree that has exactly the same coalescence scenarios among all gene trees. Notice that the majority consensus tree *T*_{majcons} is the optimal among all 29 trees in terms of the coalescence scenarios. However, this tree has two problems. First, in terms of missing clades, it is one of the least optimal, as shown in Figure 9. Further, it is highly unresolved, containing only two internal branches, as shown in Figure 11.

The concatenation tree, *T*_{conc} is the best of all trees in terms of minimizing the number of shallow coalescence events, yet is the worst in terms of the sum of the depth of all deep coalescence events. Further, it is the only tree that had the wrong outgroup. This indicates that concatenation of gene sequences and reconstructing a strain tree from the resulting ‘supergene’ may result in very inaccurate trees, particularly when there is a massive extent of discordance among gene trees, a fact that has already been established through extensive experimental studies (Kubatko and Degnan, 2007). While it seems from Figure 11 that *T*_{conc} indicates very large divergence time between N315 and Mu50, this is but a reflection of time estimation given that these two strains did not form a single clade in the concatenation tree. To solve this problem, we will consider in future development of our tool all possible refinements of any non-binary strain-tree topology candidate.

The other two trees, *T*_{avgds} and *T*_{avghd} are very similar in terms of topology, as shown in Figure 11, and both fall ‘in the middle’ in terms of optimality, as shown in Figures 9 and and10.10. Therefore, our proposed evolutionary history of all nine strains of *S.aureus* is the tree *T*_{mc}, shown in Figure 11.

## 5 CONCLUSIONS AND FUTURE WORK

In this article, we introduced a three-phase method for efficient inference of an optimal strain tree from genome-scale multilocus data. We have implemented all phases of our method and analyzed nine strains of *S.aureus*. Our hypothesis for the ‘vertical’ evolutionary history of these nine strains is the tree *T*_{mc}, shown in Figure 11. It is very important to note that even though the closely related set of strains has a very high degree of sequence identity at the nucleotide level, our method was able to infer a fully resolved evolutionary tree for them. Further, the method computed and evaluated each of 24 possible strain trees within an hour, which is efficient, considering that we used about 1900 loci from nine strains in this analysis.

Two immediate future directions that we will pursue are (1) studying the performance of our method in extensive simulations and (2) investigating the evolutionary diameter of a dataset within which the method reliably returns good strain, or even species, trees. It is worth mentioning that our method can be adapted in a straightforward manner to handle multiallelic loci in the data.

## ACKNOWLEDGEMENTS

The authors wish to thank the three anonymous reviewers for very helpful comments on the manuscript.

*Funding*: L.N. and C.T. were supported in part by DOE grant DE-FG02-06ER25734 and NSF grant CCF-0622037. H.I. and R.S. were supported in part by NSF grant CCF-0622037.

*Conflict of Interest*: none declared.

## REFERENCES

- Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
- Baba T, et al. Genome and virulence determinants of high virulence community-acquired MRSA. Lancet. 2002;359:1819–1827. [PubMed]
- Degnan J, Rosenberg N. Discordance of species trees with their most likely gene trees. PLoS Genetics. 2006;2:762–768. [PMC free article] [PubMed]
- Diep B, et al. Complete genome sequence of USA300, an epidemic clone of community-acquired meticillin-resistant Staphylococcus aureus. Lancet. 2006;367:731–739. [PubMed]
- Edwards S, et al. High-resolution species trees without concatenation. Proc. Natl Acad. Sci. USA. 2007;104:5936–5941. [PMC free article] [PubMed]
- Errington J, et al. DNA transport in bacteria. Nat. Rev. Mol. Cell Biol. 2001;2:538–544. [PubMed]
- Gill S, et al. Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J.n Bacteriol. 2005;187:2426–2438. [PMC free article] [PubMed]
- Herron-Olson L, et al. Molecular correlates of host specialization in staphylococcus aureus. PLoS ONE. 2007;2:e1120. [PMC free article] [PubMed]
- Holden M, et al. Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance. Proc. Natl Acad. Sci. USA. 2004;101:9786–9791. [PMC free article] [PubMed]
- Kubatko L, Degnan J. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. 2007;56:17–24. [PubMed]
- Kuroda M, et al. Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet. 2001;357:1225–1240. [PubMed]
- Maddison W, Knowles L. Inferring phylogeny despite incomplete lineage sorting. Syst. Biol. 2006;55:21–30. [PubMed]
- Milkman R, Stoltzfus A. Molecular evolution of the Eschericia coli chromosome. II. Clonal segments. Genetics. 1988;120:359–266. [PMC free article] [PubMed]
- Nei M, Kumar S. Molecular Evolution and Phylogenetics. Oxford: Oxford University Press; 2000.
- Ochman H, Wilson A. Evolution in bacteria: evidence for a universal substitution rate in cellular genomes. J. Mol. Evol. 1987;26:74–86. [PubMed]
- Retchless A, Lawrence J. Temporal fragmentation of speciation in bacteria. Science. 2007;317:1093–1096. [PubMed]
- Robinson D, Foulds L. Comparison of phylogenetic trees. Math. Biosciences. 1981;53:131–147.
- Rokas A, et al. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. [PubMed]
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed]
- Semple C, Steel MA. Phylogenetics. Vol. 24. Oxford: Oxford University Press; 2003.
- Stoltzfus A, et al. Molecular evolution of the Escherichia coli chromosome. I. Analysis of structure and natural variation in a previously uncharacterized region between trp and tonB. Genetics. 1988;120:345–358. [PMC free article] [PubMed]
- Swofford DL. PAUP*: Phylogenetic Analysis using Parsimony. Sunderland, MA: Sinauer Associates; 2003. (* and others methods), Version 4.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (544K) |
- Citation

- Species tree inference by minimizing deep coalescences.[PLoS Comput Biol. 2009]
*Than C, Nakhleh L.**PLoS Comput Biol. 2009 Sep; 5(9):e1000501. Epub 2009 Sep 11.* - A novel computational method identifies intra- and inter-species recombination events in Staphylococcus aureus and Streptococcus pneumoniae.[PLoS Comput Biol. 2012]
*Sanguinetti L, Toti S, Reguzzi V, Bagnoli F, Donati C.**PLoS Comput Biol. 2012; 8(9):e1002668. Epub 2012 Sep 6.* - DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony.[Bioinformatics. 2008]
*Wehe A, Bansal MS, Burleigh JG, Eulenstein O.**Bioinformatics. 2008 Jul 1; 24(13):1540-1. Epub 2008 May 12.* - Tree disagreement: measuring and testing incongruence in phylogenies.[J Biomed Inform. 2006]
*Planet PJ.**J Biomed Inform. 2006 Feb; 39(1):86-102. Epub 2005 Sep 28.* - Impact of recombination on bacterial evolution.[Trends Microbiol. 2010]
*Didelot X, Maiden MC.**Trends Microbiol. 2010 Jul; 18(7):315-22. Epub 2010 May 6.*

- Fast algorithms and heuristics for phylogenomics under ILS and hybridization[BMC Bioinformatics. ]
*Yu Y, Ristic N, Nakhleh L.**BMC Bioinformatics. 14(Suppl 15)S6* - Parsimonious Inference of Hybridization in the Presence of Incomplete Lineage Sorting[Systematic Biology. 2013]
*Yu Y, Barnett RM, Nakhleh L.**Systematic Biology. 2013 Sep; 62(5)738-751* - Inference of reticulate evolutionary histories by maximum likelihood: the performance of information criteria[BMC Bioinformatics. ]
*Park HJ, Nakhleh L.**BMC Bioinformatics. 13(Suppl 19)S12* - Species Delimitation and Global Biosecurity[Evolutionary Bioinformatics Online. ]
*Boykin LM, Armstrong KF, Kubatko L, De Barro P.**Evolutionary Bioinformatics Online. 81-37* - Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles[Journal of Computational Biology. 2011]
*Yu Y, Warnow T, Nakhleh L.**Journal of Computational Biology. 2011 Nov; 18(11)1543-1559*

- Gene (nucleotide)Gene (nucleotide)Records in Gene identified from shared sequence links
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles

- Efficient inference of bacterial strain trees from genome-scale multilocus dataEfficient inference of bacterial strain trees from genome-scale multilocus dataBioinformatics. Jul 1, 2008; 24(13)i123

Your browsing activity is empty.

Activity recording is turned off.

See more...