![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by Cold Spring Harbor Laboratory Press Inferring genomic flux in bacteria 1 Department of Statistics, University of Warwick, Coventry CV4 7AL, United Kingdom; 2 Institute for Molecular Bioscience, University of Queensland, St. Lucia QLD 4072, Australia; 3 Environmental Research Institute, Department of Microbiology, University College Cork, Cork, Ireland 4Present address: Genome Center, University of California, Davis, CA 95616, USA. 5Corresponding author.E-mail X.Didelot/at/warwick.ac.uk; fax 44-02476-524532. Received June 18, 2008; Accepted October 29, 2008. Abstract Acquisition and loss of genetic material are essential forces in bacterial microevolution. They have been repeatedly linked with adaptation of lineages to new lifestyles, and in particular, pathogenicity. Comparative genomics has the potential to elucidate this genetic flux, but there are many methodological challenges involved in inferring evolutionary events from collections of genome sequences. Here we describe a model-based method for using whole-genome sequences to infer the patterns of genome content evolution. A fundamental property of our model is that it allows the rates at which genetic elements are gained or lost to vary in time and from one lineage to another. Our approach is purely sequence based, and does not rely on gene identification. We show how inference can be performed under our model and illustrate its use on three datasets from Francisella tularensis, Streptococcus pyogenes, and Escherichia coli. In all three examples, we found interesting variations in the rates of genetic material gain and loss, which strongly correlate with their lifestyle. The algorithms we describe are implemented in a computer software named GenoPlast. Bacteria adapt to new environmental niches by remodeling their genomes. Genome sequencing has revealed a prominent role for gene gain and loss in the processes of niche adaptation, specialization, host-switching, and other lifestyle changes. Diverse bacterial species exhibit such genetic flux, which plays a crucial role in bacterial evolution (Ochman et al. 2000; Wren 2000; Dobrindt and Hacker 2001). Previous studies of genomic flux have used annotated genes as the units of gain and loss. In the standard inference protocol, genes annotated in sequenced genomes are first assigned to orthologous groups on the basis of sequence homology. Paralogs in multicopy gene families are either disambiguated or discarded, and genes exhibiting only partial homology are usually subjected to a conservation threshold to be considered orthologous (e.g., 70% of the amino acid length must be conserved). The resulting one-to-one mapping of orthologous genes is then subjected to a gene flux analysis. Gene gains and losses are typically presumed to be equally likely to occur in all lineages, which enables parsimonious mapping of gene gains/losses to branches of a phylogenetic tree relating the organisms under study. Finally such studies usually investigate the relationship between gain, loss, and ecological niche. However, some molecular processes underlying genomic flux operate without regard to gene boundaries. Short segments within genes, such as protein domains, are often gained or lost (Spratt 1988; Riley and Labedan 1997), and intergenic regulatory regions may also be subject to such pressures. Clusters of neighboring genes and operons may be gained or lost in a single event (Lawrence and Roth 1996). A complete evolutionary account would annotate individual events while also detecting variations in rate over time in particular lineages. The discovery of reductive genome evolution (Silva et al. 2001; Hershberg et al. 2007) has clearly demonstrated that in many cases, the process of genomic gain and loss is asymmetric in some lineages. Parsimony criteria are known to be unreliable when branch lengths are unequal (Felsenstein 1978; Pol and Siddall 2001; Swofford et al. 2001), meaning that statistical modeling of unequal rates is necessary for accurate evolutionary inference. In the present work, we introduce a new method to reconstruct genomic flux based on raw genomic sequence (without annotated coding sequences) that can also infer lineage-specific changes in the rates of gain and loss. Our method takes whole-genome multiple alignments as input, and outputs a mapping of changes in genomic content to branches of a phylogenetic tree, along with confidence estimates. The method utilizes a stochastic model of genomic evolution by gain and loss, incorporating a compound Poisson process model (Huelsenbeck et al. 2000) to allow the rates of gain and loss to vary in time and between lineages. Therefore, our model does not assume that evolution proceeds according to a constant molecular clock (Linz et al. 2007). The importance of modeling the changes in the rate of gene flux has been recognized before (Hao and Golding 2004, 2008; Marri et al. 2006), but our method is the first to be able to infer from the data where such changes may have happened instead of relying on the user's prior knowledge. Our method processes a whole-genome multiple alignment to identify the parts that are present in all genomes (the core genome) and the parts present in some, but not all of them (the dispensable genome). The core genome is used to robustly infer a phylogenetic tree. Since the parts in the dispensable genome are not found in all genomes, they must have been gained or lost at least once along the branches of the phylogeny. In order to model the overall rate of genetic material being gained and lost, the dispensable genome is broken up into small “features” of constant size. We encode the presence or absence of these features in a particular genome as a binary character, and model the evolution of these binary characters along the phylogenetic tree. Thus, the rates of gain r+ and loss r− incorporated in our model reflect the total number of nucleotides gained and lost during the evolution of a population among sequences found in at least one of the genomes. Figure 1
Inference is performed under this model using a reversible-jump Markov chain Monte Carlo (MCMC) (Green 1995, 2003). Our prior model favors simple explanations for the observed patterns of feature presence and absence (i.e., low rates for gain and loss, and few changes in the rates). Thus, a change in the rate of gain and loss in a particular lineage must be supported by the data to be inferred by our method. We assess the power of our method using a simulation study and illustrate its use for two groups of γ-Proteobacteria and one group of Firmicutes. In doing so, we demonstrate the ability of our approach to infer genomic flux that involves regulatory regions and fragments of genes. We further demonstrate that, using genome sequence alone, we are able to identify changes in the rate of genomic flux. The rate changes identified by our method are associated with microbial lifestyle changes such as transitions from generalist to host-restricted pathogen lifestyles. We have made a software implementation of the algorithm with a graphical interface freely available from http://go.warwick.ac.uk/genoplast/. Results Simulation study We simulated a genealogy from the coalescent model (Kingman 1982a,b) for a sample of 15 individuals. Feature gain and loss was simulated on this genealogy under the assumption that r is constant throughout, and that r+ contained a single changepoint, as shown in Figure 2A
Inference was performed for each of these simulated datasets by running our MCMC algorithm for 20,000 iterations. Figure 2B The conditions shown on Figure 2A Application to Francisella turalensis
The γ-proteobacterium Francisella tularensis is composed of several phenotypically diverse subspecies. The most virulent one is ssp. tularensis, which causes lethal pulmonary infections in humans and animals (Ellis et al. 2002). A first strain from subspecies tularensis was sequenced by Larsson et al. (2005), and two subsequent sequencing projects showed that it exhibits little genomic diversity (Beckstrom-Sternberg et al. 2007; Chaudhuri et al. 2007). Subspecies holarctica is a highly infectious but rarely fatal lineage (Ellis et al. 2002), of which three strains have been sequenced (Petrosino et al. 2006; P. Chain, F. Larimer, M. Land, S. Stilwagen, P. Larsson, S. Bearden, M. Chu, P. Oyston, M. Forsman, S. Andersson, et al., unpubl.; S. Godbole, L. Zhou, D. Bruce, R. Crawford, C. Detter, M. Dempsey, C. Lion, C. Munk, J. Noronha, R. Scheuermann, et al., unpubl.). A sequence from subspecies novicida, which is rarely associated with human disease, has also been determined (Rohmer et al. 2007).
Table 1 summarizes the seven F. turalensis genomes. We aligned the genomes and determined their phylogeny based on the core genome as described in the Methods section. The average length of each genome is around 1.9 Mbp, of which 1.6 Mbp is found in all genomes (cf. Fig. 3A
We reconstructed the history of gain and loss of these features using the algorithm described in the Methods section. Figure 4A
We found that a large amount of genetic material was lost on the branch above the ancestor of tularensis and holarctica with an average of 172 kbp lost (last row on Fig. 4C Application to Streptococcus pyogenes
Streptococcus pyogenes is a Gram-positive bacterium responsible for a wide range of human diseases such as bacteremia, tonsillitis, scarlet fever, or acute rheumatic fever (Cunningham 2000). The species is traditionally subdivided according to serologic differences in the M protein, which are strongly correlated with the frequency and type of infection caused. A total of 12 S. pyogenes genomes have been sequenced, spanning nine different M types, and we included all of them in this study (cf. Table 2). Previous genome comparisons revealed that the most noticeable difference between those genomes lies in the presence or absence of integrated prophages (Ferretti et al. 2001; Beres et al. 2002; Nakagawa et al. 2003; Banks et al. 2004; Holden et al. 2007). Those prophages contain a number of genes associated with virulence, so that the history of prophage gain and loss is likely to be pivotal to explaining the different types of infection caused by different lineages.
The average length of the S. pyogenes genomes is around 1.9 Mbp, of which 1.6 Mbp is found in all 12 genomes (cf. Fig. 3B Our method avoids a common problem arising with analysis of prophage. The difficulty is that they almost all display some homology (Banks et al. 2004; Holden et al. 2007). Since prophage usually deteriorate faster than the core genome, it is difficult to definitely say whether homologous prophage were both vertically inherited or inherited via lateral transfer. The ambiguous orthology relationship in turn creates ambiguity for inference of prophage gain and loss. Furthermore, intragenomic recombination amongst resident prophages has been described (Nakagawa et al. 2003), which makes the reconstruction of flux even more tedious. Here we used Mauve as described in the Methods section to determine the orthologous regions of prophages. As Mauve is a synteny-based method, this results in a parsimonious evaluation of the gain and loss of prophage features, where all events can be safely assumed to have occurred, but one cannot exclude a more complex history obscured by the homology of different phages. Our reconstruction of the genetic flux in S. pyogenes is shown in Figure 5
These results therefore suggest that prophage integration has accelerated in recent times for the genomes of type M1 and M12, but also possibly for several other lineages of S. pyogenes. This is consistent with a previous study which found that maximum likelihood estimates for the rate of genomic flux was higher on the external branches than on the internal branches of a phylogeny of Streptococcus (Marri et al. 2006). Another possibility is that prophage integration occurred at a constant rate, but is balanced by prophage excision or deletion. That is, most of the older phage insertions are not visible, as they have been removed, and only recent insertions are detected. A similar observation has been noted in Salmonella enterica (Vernikos et al. 2007). The sequencing of additional genomes sharing the same M types should shed more light on these hypotheses, which could reflect a more recent adaptation of lineages to specific niches than previously thought (Marri et al. 2006). Application to Escherichia coli and Shigella
E. coli has long been considered an organism of choice for the study of bacterial pathogenicity due to the coexistence of various pathogenic and commensal lineages. A total of 10 genomes have so far been completely sequenced: three laboratory and commensal strains (MG1655, W3110, and HS), one avian pathogenic strain (APEC O1), two enterohemorrhagic strains (EHEC Sakai and EDL933), one enterotoxigenic strain (ETEC 24377A), and three uropathogenic strains (UPEC CFT073, 536, and UTI89). We also included in our analysis the six sequenced genomes of the closely related genus Shigella: three from species S. flexneri (8401, 301, and 2457T) and one from each of the other three species (197, 227, and 046). Table 3 contains the list of these 16 genomes, with references to their original publications. All six strains of Shigella are causative agents of bacillary dysentery; hence, they have historically been classified in a separate genus, despite the fact that the Shigella phenotype has evolved multiple times from different clones of E. coli (Pupo et al. 2000; Jin et al. 2002; Wei et al. 2003). In agreement with this, the phylogeny we inferred for the 16 genomes shows the six strains of Shigella split into different phylogenetic groups.
The mean of the lengths of those 16 genomes is approximately equal to 5 Mbp, of which ~3 Mbp are found in all 16 genomes (Fig. 3C
All of the branches above the six Shigella genomes show important gains of genomic material (with an average of 977 kbp gained by each genome), comparable with the ones observed for pathogenic strains of E. coli such as 24377A or EDL933 and Sakai. However, the Shigella genomes have lost many more features than any of the E. coli genomes, with an average of 569 kbp lost by each genome. This genomic reduction can be traced back to a higher presence of insertion sequences (IS) in the Shigella genomes (Yang et al. 2005). Furthermore, a larger number of pseudogenes is found in the genomes of Shigella than in those of E. coli (Nie et al. 2006). The fact that the pathogenic E. coli have not undergone such genome degradation and reduction (except for APEC O1, cf. below) may be a reflection of their larger host range (Cunningham 2000). The APEC O1 genome is the only avian pathogenic (APEC) strain in our data set (Johnson et al. 2007). The phylogeny we inferred from the core genome indicates that it is a close relative of the three strains of uropathogenic E. coli (UPEC) in our data set, and especially of UTI89. This close relationship, as well as a comparison of the genome sequences and annotation for these four strains, suggest that E. coli strains from animals might be the source of uropathogenic E. coli infections (Johnson et al. 2007). However, our analysis has found a clear increase in the rate of gain and loss on the branch directly above strain APEC O1, resulting in a gain of 691 kbp on this branch. The increase in the rate of gain is comparable to that found for other branches of pathogenic E. coli, but the increased loss is unique to APEC O1 amongst all studied genomes of E. coli, and similar to the high rates described above for Shigella. This result hints that in spite of the close relationship of APEC O1 with the three UPEC strains, it may have already started to adapt to the avian host. This hypothesis may imply that the natural reservoir of human urinary tract pathogenic E. coli is not animals, and will require validation through the sequencing of additional APEC and UPEC strains. The analysis above uses features of constant size 100 bp, as the unit of genomic flux as described in the Methods section. Our model and algorithm can, however, be applied for any other unit such as the gene, which has been the unit traditionally used in studies of genomic flux (Hao and Golding 2004, 2008; Marri et al. 2006). We therefore reanalyzed the E. coli and Shigella data set using gene presence/absence data in order to compare the two approaches. We found a total of 14,752 genes to be present in one, but not all of the 16 genomes. Supplemental Figure 3 shows the result of our analysis of genomic flux based on gene data. The overall inferred history of gene flux is the same as the one described above based on these features: The commensals and laboratory strains have endured little flux, pathogenic strains of E. coli have gained some material, and Shigella lineages have gained and lost a large amount of genes.
Table 4 contrasts the number of features and genes found to be gained and lost, on average, by both analyses on all the branches of the phylogeny. The gene-based and feature-based analyses are in good agreement, which is not surprising, since the regions of the genomes identified as having been gained and lost are roughly the same in both analyses. For this reason, features and genes are gained and lost in approximately the same proportion on the branches as illustrated in Table 4. Small differences between the two analysis could be caused, for example, by variation in the density of coding genes from one region of the dispensable genome to another, or genome degradation causing a loss of genes (turned into pseudogenes), but not features. The largest difference between the two analyses is found for the amount of loss on the branch above APEC and UPEC, but being directly under the root of the tree, the uncertainty is strong for that branch, with a 95% credibility interval of [1.8;12.1] for the feature-based analysis and [1.1;6.8] for the gene-based analysis (cf. Supplemental Table 1). All the credibility intervals for the amount gained or lost on the different branches are in good agreement when using features and genes.
Discussion We have presented a novel method to reconstruct genome content evolution based on whole-genome alignments. Our method is based on a model of genomic evolution that has the essential property of allowing deviations from a molecular clock in acquisition and loss of genetic material. Our use of a relaxed clock is important for two reasons. First, when a lot of material is gained or lost in a single event (e.g., during phage integration), then we expect high variance in the amount of material flux on branches of the phylogeny, even if the events themselves follow a molecular clock. Second, accumulating evidence suggests that adaptation of an organism to a new niche is accompanied by increased rates of lateral gene transfer (Reid et al. 2000; Marri et al. 2006; Didelot et al. 2007) and/or gene loss (Maurelli et al. 1998; Cole et al. 2001; Welch et al. 2002; Cummings et al. 2004), so that the events themselves do not necessarily occur according to a molecular clock. As such, inferred changes in the rate of gene flux can provide a general means to capture changes in population dynamics or microbial lifestyle. The methodology we described in order to perform inference under this model of genomic evolution makes use of Bayesian statistics, allowing for a complete quantification of the uncertainty in the reconstruction of material flux. This uncertainty is often large, especially on the branches directly under the genealogy root for which the data at the leaves is not very informative (e.g., Supplemental Table 1). The results can be graphically summarized as illustrated in Figures 4 One innovation in the approach taken here is that we do not rely on gene identification. For this reason, the basic unit of our method is the feature (i.e., a sequence fragment of small size) rather than the gene. Clearly, this presents a number of advantages: Gene identification is a laborious process, the quality of existing annotations varies between genomes, and genes are not an indivisible unit of flux. Furthermore, it is always possible, after having found the list of features gained by a genome, to look into its annotation to find the genes (or gene fragments) affected, so that we do not lose the ability to identify gene gains or losses. However, our method can also be applied to gene presence/absence data. The choice of whether to use features or genes depends ultimately on which question is being asked: A gene-based view makes sense if one is interested in differences in functionality, whereas a feature-based view should be favored if one wants to study the mechanism of genomic flux. Breaking down alignment blocks into features or genes as we do in the present work is useful in order to deal with rates of gain and loss in absolute terms (i.e., proportionally with the number of sites being gained or lost). However, we still fall short of a fully event-based reconstruction of history. Since each alignment block is found either in a contiguous region or not at all in each of the genomes, it is likely that each one was gained or lost in a single evolutionary event. By using alignment blocks as the unit of gain and loss instead of genes or features, one might therefore hope to reconstruct events. Unfortunately, alignment blocks often do not correspond to evolutionary units because of events occurring in different parts of the tree. For example, a region that was gained as a single unit in one or more branches of the phylogeny, but is broken up elsewhere, will appear to be two blocks throughout the phylogeny, wherever it is gained. The division into alignment blocks is highly dependent on exactly which genomes are in the sample, with poorly sampled lineages having larger blocks, which can, in turn, mislead inferences based on the rate of gain or loss of blocks. Reconstructing the full history of evolutionary events that gave rise to the patterns of mosaicity in an observed sample of genome would therefore require a model of genome evolution that includes the possibility for a genome to gain a sequence of an arbitrary size at any position, to lose any subset of its sequence, and to move any subset to a different point (with the possibility of inversion and/or duplication). The inherent complexity of such a model would pose a serious challenge in trying to use it in an inferential setup. The approach we took in the present work avoids those difficulties, at the cost of being less evolutionary oriented. Methods Alignment We start with a sample of n genome sequences from a single bacterial species or a few closely related species. We first produce a multiple alignment of those genomes using the Progressive Mauve algorithm (Darling et al. 2004; http://gel.ahabs.wisc.edu/mauve/index.php). The Progressive Mauve alignment algorithm identifies and aligns all conserved orthologous segments and all positionally conserved repeat elements. The resulting alignments represent a mosaic of rearranged segments conserved among all genomes, segments conserved among subsets of genomes, and segments unique to a particular genome. The gaps in a multiple genome alignment can be removed to define the “core genome” of a group of organisms. Gaps in the alignment occur when one or more genomes contain a subsequence not present in remaining genomes. Small alignment gaps are typically caused by mutational processes such as slipped-strand mispairing, whereas large gaps typically result from recombination processes involving gene gain and loss. By excluding alignment columns that participate in gaps larger than some fixed size threshold, for example 20 nt, we can precisely define a set of alignment columns participating in the “core genome.” The core genome can then be used to robustly infer the phylogeny T of the sample. Here, we used the UPGMA algorithm to do so, but Supplemental Figure 1 shows that neighbor joining, maximum parsimony, and minimum evolution all agree with the UPGMA algorithm, except for one branching order in the E. coli data set. Using the other branching order does not affect the results of our genetic flux analysis though. The remainder of the alignment represents the blocks that have been lost or gained at least once during the evolution of the sample from a common ancestor (also known as the dispensable genome). We consider that each such block is made of small genetic regions of fixed length (i.e., 100 bp) called features. Let f denote the number of the features thus defined. The dispensable genome can thus be summarized by the binary matrix D={di,j}i [1..n], j [1..f], where di, j = 1 if, and only if, individual i has the feature j in its genome.The reason for choosing a feature size of 100 bp is as follows. Choosing a very small value (e.g., 10 bp) would increase the risk that some of the features do not represent real homologous material in all genomes. On the other hand, choosing a very high value (e.g., 10,000 bp) reduces our power to infer rate changes, since smaller elements are not taken into account, and, under the influence of rearrangements, even a large import can often be split into small subfragments. We consider that a value of 100 bp represents a good middle ground between these two potential issues, but our results are robust to slightly different choices. Model of genome evolution Our model assumes that feature gain follows a compound Poisson process (Huelsenbeck et al. 2000). This means that acquisition of features follows a Poisson process, whose rate r+ can vary along the branches of the phylogeny T. A number c+ of changes in r+ are uniformly distributed on T, and the different values taken by r+ are independent from one another. The loss of genetic features follows a similar (but fully independent) compound Poisson process with compound rate r− containing c− changes. All symbolic notations are summarized in Table 5 and an illustration of the model is given in Figure 1
The likelihood of the compound rates r+ and r− of our model can be decomposed feature-by-feature:
To calculate this likelihood, let us first consider the probability g(|v,r+,r−,l) of observing state u [0, 1] at the bottom of a branch of length l when state v [0, 1] is at the top, and r+ and r− are constant throughout the branch (i.e., there is no changepoint on the branch). This can be calculated by considering a two-state continuous time Markov chain with the transition matrix A = [1 − r+,r+;r−, 1 − r−. Solving the Chapman-Kolmogorov equations for this process (Dynkin 1989) yields:
Let us now consider the probability h(u|v,r+,r−,i) that a feature is in state u at node i in T, given that it is in state v at the parent node, and the values of r+ and r−. If there is no changepoint on the branch above the node i, then h(u|v,r+,r−,i)=g(u|v,r+,r−,l), where l is the length of the branch above node i. Otherwise, let c+i and c−i denote the number of changepoints on that branch for r+ and r− respectively. This branch can then be decomposed into 1 + c+i + c−i successive segments of lengths {lk}k [1..1+c+i+c−i on each of which both r+ and r− are constant. h(u|v,r+,r−,i) can therefore be calculated using the following dynamic programming procedure:
Given this method to calculate h(u|v,r+,r−,i), it is now possible to apply Felsenstein's pruning (Felsenstein 1973, 1981) to calculate the likelihood component Lj:
Bayesian inference We perform Bayesian inference under the model of genome evolution described above. This requires introduction of a prior πr for each of the different values taken by either r+ and r−, and a prior πc for the numbers c+ and c− of changepoints in r+ and r−. Using Bayes theorem, the posterior distribution P(r+,r−|D) can then be decomposed as follows:
We use a MCMC in order to sample from the posterior distribution (Metropolis et al. 1953; Hastings 1970). However, because the dimensionality of r+ and r− depend on the number of changepoints, the dimensionality of the parameter space is not constant. We therefore use a reversible-jump MCMC (Green 1995, 2003). Our updating scheme uses two transdimensional jumps that propose to add and remove a changepoint to either r+ or r−. We also use a move to update the location of a changepoint on a branch, and a move to update the value associated with a changepoint in either r+ or r−. These moves are described in further detail in the Appendix. Different uninformative priors for πr and πc were tested and found to have little effect on the posterior distributions for all three datasets. The results shown used πr = Exp(1) and πc = Poisson(1). For each data set, five occurrences of the MCMC were started at different points on the parameter space, chosen according to the prior distribution. Each MCMC was run for 200,000 iterations, the first half of which was discarded to avoid the influence of the starting point. Each iteration consists of an attempt at each of the moves described in the Appendix. Convergence of the MCMC was judged satisfactory in each case by manual comparison for the five runs of the trajectories of the likelihood, c
+ and c−, as well as application of the Gelman-Rubin test (Gelman and Rubin 1992) for c+, c−, and the values taken by r+ and r− at the top, middle, and bottom of each branch in the phylogeny. The results presented below for each data set are based on a concatenation of the five instances of the MCMC for maximum robustness. Sampling internal states The location at which features are gained or lost is not explicitly included in our parametrization of the model in order to improve convergence and mixing rates of the MCMC. It is, however, often interesting to know which features have been gained or lost at different points on the phylogeny, and with which posterior probability. Here we show how this can be done by adding a few steps to the dynamic programming algorithm described above for the calculation of the likelihood. Note that this does not interfere in any way with the likelihood calculation, and does not represent a change of parametrization. In summary, after using a pruning algorithm in a first pass from bottom to top of T to calculate the likelihood as described above, it is possible to pass again through T from top to bottom in order to sample the state of each internal node (Hein 1989). This procedure is similar to the forward–backward algorithm of hidden Markov models (Rabiner 1989). For each node x of T, let cx,j be equal to one if node x has feature j and to zero otherwise. The following steps 4 and 5 are added in order to sample cx,j for all nodes:
Acknowledgments We thank Bob Mau and Nicole T. Perna for key insights that inspired this work. We also thank three anonymous reviewers for useful comments, ideas, and discussion. This work was funded in part by Wellcome Trust Grant WT082930MA. X.D. was supported by a research fellowship from the Centre for Research in Statistical Methodology (CRiSM). A.E.D. was supported by NSF grant DBI-063075. D.F. was supported by the Science Foundation of Ireland, grant no. 05/FE1/B882. Appendix Markov chain Monte Carlo moves The moves presented below are accepted according to the Metropolis-Hastings-Green ratio:
LR is equal to the ratio of likelihoods after and before the proposed move and can be calculated using Equation 1. The values of PR, QR, and J are given in each of the move descriptions below. Move an existing changepoint in r+ along a branch of T In this move, one of the c+ changepoints of r+ is uniformly chosen. We propose to update the age t of the changepoint to t′, which is drawn uniformly on the branch to which the changepoint belongs. This proposal distribution ensures that proposing to move the age of the changepoint from t to t′ is equally likely to propose a move from t′ to t, so that QR = 1. Furthermore, the model assumes a uniform distribution of the changepoints on T, so that PR = 1. Finally, since this jump does not change the dimensionality of the parameter, we have J = 1. Update a value in r+ In this move, one of the (c+ + 1) values taken by r+ is uniformly chosen and proposed to be updated by adding u to it, where u ~ Unif([− Z; ]). If the new value is out of the domain of definition of r+, the move is automatically rejected. Proposing to move from the old to the new value is equally likely than proposing to move from new to the old value, so that QR = 1. Furthermore, PR = πr(r + u)/πr(r) and J = 1.Add/remove a changepoint in r+ This move first decides to add or remove a changepoint, each with probability a half. To add a changepoint, a point x is chosen uniformly on the branches of T, and the value t of r+ associated with the new changepoint is drawn from πr. To remove a changepoint, one of the c+ existing changepoints is chosen uniformly and removed. If no changepoint exists, the removing update is always rejected. Since the age of a new changepoint and its associated value are drawn from a proposal distribution, the Jacobian J is equal to one, even though this move is transdimensional (Troughton and Godsill 1998; Dellaportas et al. 2002; Lopes and West 2004). If the move proposes to add a new changepoint at x with associated value t, we have:
If the move proposes to remove an existing changepoint x, we have:
Footnotes [Supplemental material is available online at www.genome.org. The GenoPlast software is freely available from http://go.warwick.ac.uk/genoplast/.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.082263.108. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||
Nature. 2000 May 18; 405(6784):299-304.
[Nature. 2000]Proc Natl Acad Sci U S A. 2002 Dec 24; 99(26):17020-4.
[Proc Natl Acad Sci U S A. 2002]Curr Opin Microbiol. 2001 Oct; 4(5):550-7.
[Curr Opin Microbiol. 2001]Proc Natl Acad Sci U S A. 2002 Apr 2; 99(7):4668-73.
[Proc Natl Acad Sci U S A. 2002]Nature. 2000 Jul 6; 406(6791):64-7.
[Nature. 2000]Genetics. 1996 Aug; 143(4):1843-60.
[Genetics. 1996]Genome Biol. 2007; 8(6):R102.
[Genome Biol. 2007]Genome Biol. 2007; 8(8):R164.
[Genome Biol. 2007]Genetics. 2000 Apr; 154(4):1879-92.
[Genetics. 2000]Mol Biol Evol. 2007 Jun; 24(6):1312-9.
[Mol Biol Evol. 2007]Mol Biol Evol. 2004 Jul; 21(7):1294-307.
[Mol Biol Evol. 2004]BMC Genomics. 2008 May 20; 9():235.
[BMC Genomics. 2008]Mol Biol Evol. 2006 Dec; 23(12):2379-91.
[Mol Biol Evol. 2006]Clin Microbiol Rev. 2002 Oct; 15(4):631-46.
[Clin Microbiol Rev. 2002]Nat Genet. 2005 Feb; 37(2):153-9.
[Nat Genet. 2005]PLoS One. 2007 Sep 26; 2(9):e947.
[PLoS One. 2007]PLoS One. 2007 Apr 4; 2(4):e352.
[PLoS One. 2007]J Bacteriol. 2006 Oct; 188(19):6977-85.
[J Bacteriol. 2006]Nat Genet. 2005 Feb; 37(2):153-9.
[Nat Genet. 2005]J Bacteriol. 2006 Oct; 188(19):6977-85.
[J Bacteriol. 2006]PLoS One. 2007 Sep 26; 2(9):e947.
[PLoS One. 2007]Nucleic Acids Res. 2006; 34(1):1-9.
[Nucleic Acids Res. 2006]Clin Microbiol Rev. 2000 Jul; 13(3):470-511.
[Clin Microbiol Rev. 2000]Proc Natl Acad Sci U S A. 2001 Apr 10; 98(8):4658-63.
[Proc Natl Acad Sci U S A. 2001]Proc Natl Acad Sci U S A. 2002 Jul 23; 99(15):10078-83.
[Proc Natl Acad Sci U S A. 2002]Genome Res. 2003 Jun; 13(6A):1042-55.
[Genome Res. 2003]J Infect Dis. 2004 Aug 15; 190(4):727-38.
[J Infect Dis. 2004]Proc Natl Acad Sci U S A. 2006 May 2; 103(18):7059-64.
[Proc Natl Acad Sci U S A. 2006]J Bacteriol. 2007 Feb; 189(4):1473-7.
[J Bacteriol. 2007]J Infect Dis. 2004 Aug 15; 190(4):727-38.
[J Infect Dis. 2004]J Bacteriol. 2007 Feb; 189(4):1473-7.
[J Bacteriol. 2007]Genome Res. 2003 Jun; 13(6A):1042-55.
[Genome Res. 2003]Genome Res. 2003 Jun; 13(6A):1042-55.
[Genome Res. 2003]Mol Biol Evol. 2006 Dec; 23(12):2379-91.
[Mol Biol Evol. 2006]Proc Natl Acad Sci U S A. 2000 Sep 12; 97(19):10567-72.
[Proc Natl Acad Sci U S A. 2000]Nucleic Acids Res. 2002 Oct 15; 30(20):4432-41.
[Nucleic Acids Res. 2002]Genome Biol. 2007; 8(6):R100.
[Genome Biol. 2007]Nat Rev Genet. 2000 Oct; 1(1):30-9.
[Nat Rev Genet. 2000]BMC Genomics. 2006 Jul 6; 7():173.
[BMC Genomics. 2006]Clin Microbiol Rev. 2000 Jul; 13(3):470-511.
[Clin Microbiol Rev. 2000]J Bacteriol. 2007 Apr; 189(8):3228-36.
[J Bacteriol. 2007]Mol Biol Evol. 2004 Jul; 21(7):1294-307.
[Mol Biol Evol. 2004]BMC Genomics. 2008 May 20; 9():235.
[BMC Genomics. 2008]Mol Biol Evol. 2006 Dec; 23(12):2379-91.
[Mol Biol Evol. 2006]Nucleic Acids Res. 2005; 33(19):6445-58.
[Nucleic Acids Res. 2005]Mol Biol Evol. 2006 Dec; 23(12):2379-91.
[Mol Biol Evol. 2006]Genome Res. 2007 Jan; 17(1):61-8.
[Genome Res. 2007]Proc Natl Acad Sci U S A. 1998 Mar 31; 95(7):3943-8.
[Proc Natl Acad Sci U S A. 1998]Nature. 2001 Feb 22; 409(6823):1007-11.
[Nature. 2001]Genome Res. 2004 Jul; 14(7):1394-403.
[Genome Res. 2004]Genetics. 2000 Apr; 154(4):1879-92.
[Genetics. 2000]J Mol Evol. 1981; 17(6):368-76.
[J Mol Evol. 1981]Mol Biol Evol. 1989 Nov; 6(6):649-68.
[Mol Biol Evol. 1989]Syst Biol. 2001 Aug; 50(4):525-39.
[Syst Biol. 2001]