Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS Comput Biol. Aug 2009; 5(8): e1000455.
Published online Aug 7, 2009. doi:  10.1371/journal.pcbi.1000455
PMCID: PMC2713424

Identifying Currents in the Gene Pool for Bacterial Populations Using an Integrative Approach

Philip E. Bourne, Editor

Abstract

The evolution of bacterial populations has recently become considerably better understood due to large-scale sequencing of population samples. It has become clear that DNA sequences from a multitude of genes, as well as a broad sample coverage of a target population, are needed to obtain a relatively unbiased view of its genetic structure and the patterns of ancestry connected to the strains. However, the traditional statistical methods for evolutionary inference, such as phylogenetic analysis, are associated with several difficulties under such an extensive sampling scenario, in particular when a considerable amount of recombination is anticipated to have taken place. To meet the needs of large-scale analyses of population structure for bacteria, we introduce here several statistical tools for the detection and representation of recombination between populations. Also, we introduce a model-based description of the shape of a population in sequence space, in terms of its molecular variability and affinity towards other populations. Extensive real data from the genus Neisseria are utilized to demonstrate the potential of an approach where these population genetic tools are combined with an phylogenetic analysis. The statistical tools introduced here are freely available in BAPS 5.2 software, which can be downloaded from http://web.abo.fi/fak/mnf/mate/jc/software/baps.html.

Author Summary

The study of bacterial population biology is complicated by the fact that, although bacteria are largely asexual, they can also exchange genetic materials through homologous recombination. Unlike eukaryotes, recombination in bacteria is not an obligatory process. Furthermore, the recombination mechanisms are subject to many biological and ecological factors that can vary even within different populations of the same species. Although increasing evidence for homologous recombination has been found in many bacterial species, determining the frequency of recombination and understanding the influence that it exerts upon the evolution of bacterial populations remains a challenging work. In this article, we provide a dynamic picture of recombination within and between closely related bacteria species. Through an integration of several Bayesian statistical models, our method highlights the importance of a quantitative estimation of recombination. Our analyses of a challenging multi-locus sequence typing (MLST) database demonstrate that combined analyses using both traditional phylogenetic methods, explorative MLST tools and Bayesian population genetic models can together yield interesting biological insights that cannot easily be reached by any of the approaches alone.

Introduction

It has become increasingly evident that recombination plays a major role in shaping the genetic structure of bacterial populations. Whether or not certain populations (as defined by allele frequencies) are more likely than others to undergo recombination, either as donors or recipients of DNA, is not well understood, though there are several biological reasons why this might be the case. Such preferential recombination, which we may intuitively describe as currents in the gene pool [1], should lead to a greater degree of admixture between the populations in question, and this should be detectable using DNA sequence data. Conceptually related investigation of highways of gene sharing among bacterial species at a general level was done by [2], who found evidence for uneven distribution of transfer intensity among groups of prokaryotes.

Discovery of such gene flow currents is scientifically interesting in its own right, as a means for characterizing populations and reflecting upon accumulated taxonomic understanding of their heterogeneity. However, there are other potential uses for detailed knowledge concerning the genetic structure of a bacterial population, e.g. when it can be connected to patterns of virulence and antibiotic resistance.

Statistical analysis of molecular variation and reproductive isolation in natural populations is in many cases far more challenging for bacteria than for eukaryotic organisms, due to difficulties in acquiring broad-coverage samples and the putatively complex admixture events [3]. Traditional population genetic tools for inferring genetic barriers within a population, such as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e001.jpg measures [4], are not usually applicable to bacterial molecular data given the lack of relevant populations to condition the calculations on, albeit some exceptions exist (see, e.g. [5]). Standard phylogenetic analyses, on the other hand, may provide a distorted view of the ancestral relationships among bacteria when recombination events are sufficiently common in a population. Moreover, they do not yield a detailed and easily interpretable picture of the patterns of admixture and eventual genetic barriers, as such constructs are not present in the standard phylogenetic models that can be routinely applied to large data sets. However, an algorithmic approach to phylogenetic analysis which can build networks for hundreds of taxa and can be useful for data sets harbouring recombination was introduced by [6]. A model-based phylogenetic method (ClonalFrame) that deals explicitly with recombinations was introduced by [7], however, it does not easily scale up to the level of population complexity we are here interested in, due to the extreme computational intensity of the model fitting for large databases.

With the above-mentioned difficulties, it is hardly surprising that a Bayesian statistical approach based on explicit admixture models has recently gained popularity in studies of bacterial populations [8],[9]. Such models are anchored in the general idea of a probabilistic partition, where an unknown origin of an arbitrary quantity (for example, the membership of an individual) is inferred through the conditional probability of the origin over the range of putative alternatives (commonly referred to as clusters), given the observed features of the quantity. Application of such partition models has been made possible by a class of generic Markov chain Monte Carlo (MCMC) algorithms [10], that can be used for fitting the models to molecular data.

Despite the success of the standard MCMC approach in a variety of studies of bacterial populations (see e.g. [11]), it is clear, both theoretically and practically, that the performance of the standard MCMC computation decreases rapidly as the complexity of the estimated population structure and the size of the investigated data set increases [10],[12]. To address this, an array of methods has been introduced and implemented in the software BAPS [13][16]. Here we introduce a graphical characterization of recombination patterns from MLST data using a weighted network with statistically identified populations as cluster nodes and estimated average levels of DNA transition as relative gene flow weights. Also, we introduce a model-based representation of the molecular variability of populations and their affinities towards each other. We refer to this as the genetic shape of an identified population. However, it is important to notice that a population identified by BAPS may have a different interpretation in different evolutionary contexts. The BAPS models target for identifying molecular evidence that links a particular group of strains together in terms of sufficiently similar nucleotide frequencies. Thus, such a population may for instance arise in the analysis due to common ancestry within a clonal complex. In contrast, a population can also be identified from the traces left by recombination events which have imposed considerable gene flow between separate lineages of strains. Also, under certain circumstances a more heterogeneous population may arise analogously to long branch attraction in phylogenetics, in particular, when very limited numbers of strains from the corresponding lineages are present among the analyzed samples. All these three cases are illustrated in our analyses. As BAPS is capable of capturing a variety of distinct biological signals hidden in molecular data, interpretation of the identified populations must be done with care, using preferably both complementary phylogenetic methods and auxiliary knowledge about the strains under investigation.

To illustrate the levels of complexity at which our methods can operate, we consider a population sample of 5086 strains that have been identified as Neisseria meningiditis and Neisseria lactamica species. We also present analyses of simulated data to demonstrate the potential of our Bayesian approach to handle large databases and complex genetic population structures. Our analyses illustrate that biological insights to complex data are best gained by combining several complementary methods of analysis.

Materials and Methods

A stochastic model of gene flow in bacterial populations

Assume that the target population consists of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e002.jpg genetically distinct populations, among which the extent of gene flow is to be modeled. Usually, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e003.jpg and the genetic population structure associated with it are a priori unknown. In our statistical approach presented later we consider in detail the inference of these from molecular data. Here we aim to estimate the strength of gene flow via a stochastic characterization of the rates of admixture between the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e004.jpg identified populations.

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e005.jpg, index the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e006.jpg populations and let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e007.jpg represent for the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e008.jpg the probability of an strain acquiring DNA from bacteria present in the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e009.jpg. DNA acquisition could be understood as an aggregated result of the currently known mechanisms (conjugation, transduction and transformation). Conditional on the probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e010.jpg, it is possible to consider a sample of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e011.jpg unrelated strains from population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e012.jpg to represent An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e013.jpg Bernoulli trials, where the binary outcome An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e014.jpg, refers to the success/failure of DNA acquisition from this particular source. These are obviously considerably simplifying assumptions, but they allow us to characterize patterns of admixture. Were the outcomes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e015.jpg, known, the relative admixture could simply be characterized by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e016.jpg. However, we note that An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e017.jpg in reality represents intrinsically unobservable latent events during some interval of the evolutionary time scale under consideration.

Assuming that a particular strain within population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e018.jpg has acquired DNA from the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e019.jpg (i.e. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e020.jpg), we may attempt to quantify the intensity with which such events have occurred over the analysed sequence. A multitude of statistical break-point models designed to capture such recombination traces have been introduced in the literature, e.g. [17][19]. For such models the focus has typically been on a small number of short viral genomes, to identify the locations where putative recombination events have taken place. In the most basic form, recombination may be represented by a homogeneous spatial Poisson process An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e021.jpg, where the events correspond to the number of recombinations within the genome of an strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e022.jpg, such that the DNA is acquired from the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e023.jpg. It follows for such a process that the stochastic variable An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e024.jpg, with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e025.jpg equal to the total length of the considered sequence, has the Poisson distribution

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e027.jpg represents the average rate of events in which DNA is imported from population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e028.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e029.jpg. Again, if the outcomes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e030.jpg were observed, the average rate could be statistically quantified, e.g. as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e031.jpg, by using the maximum likelihood estimate.

To arrive at a statistical characterization of the rates of admixture among the populations under the above framework, let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e032.jpg denote a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e033.jpg matrix of probabilities, such that the element An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e034.jpg equals An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e035.jpg. Further, let the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e036.jpg matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e037.jpg, with the elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e038.jpg, represent collectively the Poisson intensities. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e039.jpg be a directed graph with the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e040.jpg populations as the node set An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e041.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e042.jpg as the arc set. Each arrow An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e043.jpg in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e044.jpg can now be associated with a weight An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e045.jpg depicting the rate of admixture from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e046.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e047.jpg. For instance, a gene flow weight matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e048.jpg can be defined in terms of the elementwise matrix product An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e049.jpg, with the convention that the diagonal elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e050.jpg are normalized by the other elements on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e051.jpgth row of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e052.jpg. When an element An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e053.jpg equals zero, it is natural to set An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e054.jpg, i.e. the corresponding arrow is absent in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e055.jpg.

It follows from the definition of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e056.jpg that these parameters remain unidentifiable when the events An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e057.jpg are unobserved, as a suitable rescaling of the model configuration can yield identical likelihoods. The statistical challenge related to this context is further accentuated by the fact that the underlying genetic structure, i.e. the number of underlying populations An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e058.jpg as well as their molecular characteristics, is unknown a priori. Modern Bayesian statistical framework utilizing state-of-the-art MCMC computation can in principle be thought to provide a suitable setting for fitting such models to MLST sequence data. However, the computational complexity associated with the models suggests that formal posterior inferences would remain beyond the bounds of computational tractability even for only moderately sized population samples. This is crucial, as to study such problems we require large samples with a broad coverage of the genetic variation in the underlying population. Therefore, we consider here an approximate inference strategy to estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e059.jpg, which is computationally manageable for large samples, while still providing a reasonable statistical characterization of parameters that can be interpreted in terms of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e060.jpgand An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e061.jpg in the above model formulation.

A Bayesian mixture model for the genetic structure of a population

Assume we have a sampled set of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e062.jpg aligned DNA sequences An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e063.jpg, from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e064.jpg genomic regions in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e065.jpg bacterial strains. A concatenated sequence for an strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e066.jpg is denoted by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e067.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e068.jpg refers jointly to all the DNA sequence data from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e069.jpg strains. For any subset An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e070.jpg of strains from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e071.jpg, the notation An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e072.jpg will be used for the DNA data observed for these strains.

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e073.jpg be a partition of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e074.jpg strains representing an underlying genetic structure (i.e. a representation of a genetic mixture model), with the clusters An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e075.jpg corresponding to genetically distinct populations. Hereafter we will use the terms ‘cluster’ and ‘population’ interchangeably. Mathematically, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e076.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e077.jpg) is a collection of subsets of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e078.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e079.jpg, for all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e080.jpg. Symbol An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e081.jpg defines the space of all such partitions for a given An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e082.jpg. For any partition An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e083.jpg, cardinalities of the populations are denoted by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e084.jpg.

In a series of earlier works in [12],[13] various stochastic partition models have been introduced for Bayesian inference about genetic population structure based on different types of molecular information. The mathematical motivation of the stochastic partition approach was recently derived by [20]. Under these models, the biological hypothesis corresponding to any particular partition An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e085.jpg, states that the strains allocated in the same cluster represent a sample from a genetically distinct population, and thus, the partition provides a qualitative representation of the underlying genetic population structure.

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e086.jpg denote the a priori uncertainty about the underlying genetic structure in terms of a probability distribution over the space An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e087.jpg. Then, we may specify the probability measure

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e089.jpg is the marginal likelihood of the sequence data given the structure. The posterior distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e090.jpg given the sequence data is determined by Bayes' rule according to

equation image
(3)

Here we use a Bayesian estimate of the genetic structure provided by the posterior mode

equation image
(4)

or possibly separately for a range of such estimates identified by stochastic optimization, if the molecular data are not decisively supporting a single structure. Methods to numerically obtain such estimates have been introduced by [12],[20].

The marginal likelihood for the observed sequence data given any structure An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e093.jpg has under the stochastic partition framework the following product form

equation image
(5)

which encapsulates a symmetry among the underlying populations An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e095.jpg, as any specific labeling of them without further auxiliary information would not be possible.

However, to explicitly specify the terms An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e096.jpg a number of assumptions are required. Here we exploit the genetic linkage model developed by [15] to provide an explicit characterization of the terms in (5) for MLST type sequence data. The linkage model captures dependencies in the sequence data in terms of a Markovian model for each gene, such that each population has its own nucleotide frequency parameters, the joint prior distribution of which factorizes according to the Markovian model. Utilizing the standard results for so called hyper-Markov probability laws for multinomial-Dirichlet distributions, it is possible to calculate the marginal likelihood analytically given any value of S. This result is of importance, because it enables the development of an efficient learning algorithm which avoids Monte Carlo errors associated with the nucleotide frequency parameters in the populations specified by the genetic structure model. However, it should be noted that because the genetic mixture model operates at the level of sequence data, it is vulnerable to misalignments of the sequences similar to other comparable statistical methods.

Statistical characterization of admixture

Given a plausible representation of the underlying genetic population structure based on (4), our aim is to obtain a model-based characterization of the rates of admixture between the populations, such that an estimate may be derived for the gene flow weight matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e097.jpg. This sequential estimation strategy is motivated by the observation, that joint estimation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e098.jpg and the extent of admixture leads to problems with statistical identifiability and over-fitting discussed by [14]. In particular, they described a property of the admixture models which enables an increase in the number of populations without necessarily increasing the effective number of parameters (allele frequencies) in the model. This is in contrast with genetic mixture models, where such an increase always occurs as a function of the number of populations, thus resolving the problem with weak identifiability and/or high dependence of the inferences on the particular prior distribution used in the analysis.

The most recent version of the BAPS software (5.2) contains an implementation of the admixture estimation algorithm introduced by [16] under the linkage model of [15]. Here we use this procedure to estimate the extent of admixture among the populations.

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e099.jpg be an estimate of the genetic structure underlying the sample according to (4), and let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e100.jpg, be a vector of admixture coefficients representing the proportion of the genome of strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e101.jpg having ancestry in the corresponding populations, respectively. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e102.jpg be the joint probability of the data from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e103.jpg region for strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e104.jpg under population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e105.jpg. Then, the admixture model likelihood for the data in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e106.jpg is determined by

equation image
(6)

The marginal posterior mode estimates of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e108.jpg are obtained by numerical maximization combined with a simulation, to account for the uncertainty about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e109.jpg given the partition estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e110.jpg. As illustrated by [14], the posterior distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e111.jpg does not entirely plausibly represent the statistical uncertainty about An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e112.jpg, as the strain coefficients An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e113.jpg may in some cases have a mode in the interval from ~.1 to ~.2, while still reflecting only random fluctuations in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e114.jpg in the populations, in contrast to real ancestry in a particular population, say An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e115.jpg. The issue was solved in [14] by calculating simulation-based An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e116.jpg for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e117.jpg under the null hypothesis of no admixed ancestry. In the sequel, let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e118.jpg denote such a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e119.jpg for an strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e120.jpg.

We now combine the statistical tools from [14], and [15] to obtain an estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e121.jpg of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e122.jpg. Firstly, populations are estimated using the posterior mode partition An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e123.jpg. Then, for each identified population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e124.jpg, the extent of admixture events is estimated for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e125.jpg using a significance level An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e126.jpg, such that

equation image
(7)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e128.jpg is an indicator function equal to one when the argument is true, and zero otherwise. The estimate (7) is for the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e129.jpg an average relative amount of (significant) DNA acquired from population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e130.jpg, thus representing a combination of an average recombination intensity and the propensity of recombination events taking place between these populations. It should be noticed that the admixture model ignores possible contiguity of genes or genome regions. However, the genes present in MLST analyses tend to represent quite distant genome regions, which motivates the assumption of independence. If the genes are taken from a more contiguous region, it is possible to treat them as a single linked region in the model by concatenating the sequences prior to the genetic mixture analysis. Notice also that the above gene flow estimates can be complemented by investigating separately the rate and size of the exchange events. The rate of exchange events is represented by the proportion of strains in a population showing significant admixture from a particular source. In turn, the size of the exchange events is revealed by characterizing the values of the corresponding estimated admixture coefficients.

The latest version of BAPS (5.2) contains an implementation of the estimation procedure leading to (7), such that high-resolution images of the directed graph An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e131.jpg with the associated weight matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e132.jpg can be produced directly with the software. As illustrated in the RESULTS section, this facilitates the analysis of large data sets for which the numerical estimates of admixture can be very tedious to examine.

Genetic shapes of the populations

The above presented statistical models and tools provide means for assessing the number of genetically isolated populations and the extent of recombination among them. However, this leaves open questions related to the underlying genetic population structure. In particular, the model summary estimates do not provide any information on the area occupied by the population in sequence space, which we term its genetic shape. By a genetic shape we refer both to the molecular heterogeneity present in a population, as well as the genetic affinities of its members towards other identified populations. We will illustrate that an investigation of the genetic shapes in this sense can yield useful characterizations of the population, pinpoint interesting subgroups of strains, and eventually provide clues to relate the genetic structure to some auxiliary information.

Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e133.jpg be the estimated genetic population structure and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e134.jpg the structure where the strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e135.jpg has been moved to the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e136.jpg. The relative genetic affinity of the strain An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e137.jpg towards population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e138.jpg can be quantified in terms of the change in the log-predictive likelihoods

equation image
(8)

which is always non-negative given that we have identified the true posterior optimum (4). However, even when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e140.jpg does not equal the global posterior optimum, our estimation algorithm is designed in such a way that negative values of (8) cannot be obtained, as any parameter configurations in the neighborhood of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e141.jpg leading to an improvement of the posterior probability will be detected.

The difference (8) can be interpreted as the amount of information we lose in the prediction of the molecular characteristics in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e142.jpg when the strain is assigned into another population, given that the remaining population structure is kept fixed. Thus, at the boundary, when (8) is equal to zero, no information will be lost. From the genetics perspective, the distribution of the values An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e143.jpg, reflects the genetic shape of the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e144.jpg towards the population An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e145.jpg. It is clear that this shape does not necessarily have an easily interpretable geometric configuration in low enough dimensions (1–3), such that it can be visualized. However, the shape of the distribution of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e146.jpg, can still be used to reveal patterns of interest, which is illustrated in the RESULTS section.

To numerically characterize the genetic shape of a population using the values of (8) for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e147.jpg, we use a kernel density estimate of the underlying distribution of the affinity measures. This is implemented in BAPS 5.2, which outputs graphical displays of the density curves. These are based on the standard Gaussian kernel with the Gaussian optimal bandwidth An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e148.jpg (see, e.g. [21]) according to

equation image
(9)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e150.jpg, and further An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e151.jpg is the maximum likelihood estimate of the standard deviation of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e152.jpg values. Such density curves will provide useful information concerning both the within and between population molecular variation as well as affinity.

Simulated data

The simulated MLST data sets were generated by assuming a tentative gene flow graph An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e153.jpg with the weight matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e154.jpg is changing randomly. The gene flow graph estimated by BAPS 5.2, denoted as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e155.jpg, was then compared with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e156.jpg for evaluating the prediction accuracy. The characteristic of genetic shapes for the identified populations was also investigated for a wide range of scenarios given by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e157.jpg.

The assumptions for the data generation are based on a simplified, yet reasonable evolutionary model of bacterial populations. We assumed that each population has a common local ancestor, and further back in time these local ancestors originated from a common ancestor of the whole population, termed as a global ancestor. This assumption enables a tree representation of the evolutionary relationships among the populations (Figure 1). It is important to note that we do not explicitly model the time at which these ancestral events occurred and therefore the edges in Figure 1 are in arbitrary length.

Figure 1
Graphical representation of the evolutionary model for a sample of two bacterial populations.

When ignoring recombination, the strains in a population will differ from each other only through the accumulation of point mutations. The mutations may have accumulated in two consecutive stages. In what follows, we referred to a mutation that occurred prior to the local ancestors as a stage-1 mutation, and a mutation that occurred afterwards as a stage-2 mutation. We assumed the infinite-site model of mutation, which implies that at most one mutation per site can occur in the DNA sequences [22]. This would imply that stage-1 mutations provide heterogeneity that leads to population diversification, while stage-2 mutations generate variations within a population. We further assumed that these two types of mutations occur independently of each other and result in a number of segregating sites. Let An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e158.jpg denote the total sequence length, then the expected number of segregating sites An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e159.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e160.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e161.jpg are the mutation rates for the two stages.

To simplify the problem, we considered recombinations that always lead to changes of DNA, so that they can be detected as admixture events. This corresponds to assuming that the DNA introduced by admixture needs to be distinctive compared to the homologous sites that have been observed within the population. Given the tree representation of the population evolution, such recombinations are restricted to occur at stage-1 mutation sites only.

Following these assumptions described above, the expected numbers of stage-1 and stage-2 mutation sites can be obtained by

equation image

and

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e164.jpg is the expected number of segregating sites; An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e165.jpg is the ratio of the two mutation rates. We considered the time length in stage-1 is longer than that in stage-2, so we did set An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e166.jpg. For data simulation we used An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e167.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e168.jpg, and set an equal population size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e169.jpg for all the populations. A simulated population data set thus contains An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e170.jpg sites and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e171.jpg strains, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e172.jpg is the number of populations.

We specified a putative gene flow graph An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e173.jpg that consists of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e174.jpg populations and the arrow set is specified in Figure 2. The rates of admixture between populations are characterized in the matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e175.jpg, which is by definition a product of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e176.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e177.jpg. Therefore by simulating An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e178.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e179.jpg we can generate a parameter set in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e180.jpg that conforms to the graph structure in Figure 2. We chose a consistent sampling scheme for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e181.jpg such that the diagonal elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e182.jpg for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e183.jpg, and the non-diagonal elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e184.jpg are uniformly distributed. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e185.jpg is also sampled from the Uniform distribution An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e186.jpg, but with the row constraints An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e187.jpg, since An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e188.jpg refers to the fraction of DNA sequence acquired from a particular source population.

Figure 2
Tentative gene flow graph in six populations.

Sampling a data set according to the putative population structure consists of three steps. First, a global ancestor of 500 segregating nucleotides was randomly simulated and for each of the six population a local ancestor was generated by randomly altering each nucleotide of the global ancestor with the probability 0.8. The sample strains for each population were generated by randomly mutating the local ancestor with the probability 0.2. The strains that have been recombined were randomly selected according to the parameter An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e192.jpg, and for each of the recombined strain the actual amount of recombinations was determined by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e193.jpg.

Using the procedure described above, a population data set can be simulated for each of the selections of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e194.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e195.jpg. The population structure analysis was done with our Bayesian framework implemented in BAPS 5.2. We reported the accuracy of BAPS partition as choosing different values of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e196.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e197.jpg. Once the true partition has been identified correctly, we assessed further the accuracy of the predicted gene flow structure, i.e. the similarity of graph topology between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e198.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e199.jpg. Note that the non-diagonal elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e200.jpg in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e201.jpg determine the propensity of acquiring DNA through recombination from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e202.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e203.jpg, therefore a larger An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e204.jpg implies that the recombination would affect a higher proportion of strains in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e205.jpg. The increasing admixture propensity would make the recombination unidentifiable, since our Bayesian framework tends to favor the alternative hypothesis that the allelic frequency at the recombination site is more likely an effect of the stage-2 mutations, rather than a consequence of admixture. We therefore expected a negative correlation between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e206.jpg and the gene flow structure accuracy. We also expected that in order to obtain a reliable partition estimate, the non-diagonal elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e207.jpg in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e208.jpg should be near zero, since a small rate of recombination along the DNA sequences might not perturb the population structure in a large scale. A large An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e209.jpg, however, implies frequent recombination that might blur the population boundaries, so that the original population structure could be no longer identified.

Real data

To illustrate the presented methods with a real data set, we applied BAPS 5.2 to MLST bacterial data. MLST(multi-locus sequence typing) is an approach to the unambiguous characterization of bacterial strains. The internal sequence of seven housekeeping genes, which include the abc Z, adk, aro E, fum C, gdh, pdh C and pgm genes are obtained and unambiguously characterize the strain. The strain sequences are generally reported to the publicly accessible MLST strain databases (see, e.g. http://www.mlst.net), which are currently hosting a fast growing number of bacterial genera and also a few eukaryotic organisms.

We chose the Neisseria species for validation as homologous recombination is known to be frequent in both N. meningitidis and N. lactamica species [23],[24]. Furthermore, occasional horizontal gene flow over the species boundary has also been observed [25]. However, it is unclear to what extent the gene flow occurs and its consequence for population structure. To investigate this we applied BAPS 5.2 to a MLST sample which contains 4823 strains of N. meningitidis and 263 strains of N. lactamica. The data was accessed for analysis from the Neisseria MLST database at 17/3/2006 [26].

For such MLST type sequence data, we utilized the genetic linkage model [15] to account for dependency within the neighboring nucleotide bases. To assess the ability of our methods to find correctly or nearly correctly the populations hiding in the data, we considered a simulation scenario for generating a bootstrap sample which contains the strains randomly selected from a sub-collection of the identified Neisseria populations, using the procedure as follows:

  1. Decide the number of clusters An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e210.jpg in the bootstrap data. We consider five scenarios where in each scenario An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e211.jpg is one of 5,10,15,20 and 25.
  2. Select randomly An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e212.jpg clusters without replacement from the identified population structure based on the complete data.
  3. For each chosen cluster, sample with replacement a random number of strains. The number of sampled strains follows the uniform distribution in the range of (30, 80).
  4. Clustering the data generated in step 3 using BAPS 5.2.

By repeating the scenario multiple times (we use 5 repeats) for each An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e213.jpg, we can check how well the resulting partition agrees with the chosen An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e214.jpg and how close the partition is to the general setting. This approach allows us to investigate the statistical power to correctly detect populations when the number of available strains is quite limited per population.

Conditional on the identified population structure, comparative rates of admixture between the populations can be further estimated and summarized in a gene flow graph. We plotted the genetic shapes of several populations in N. meningitidis which show significant gene flow towards the N. lactamica species, and also compared their similarity in terms of admixture tendency.

To investigate whether the signals of admixture varied considerably over the seven genes, we performed a bootstrap analysis where a single gene at a time was excluded when inferring the rates of admixture. The analysis was performed conditional on the clusters identified using the original complete data set. The relative importance of each gene could then be tentatively summarized by calculating for each cluster the average changes in incoming and outgoing gene flow while treating the estimates from the complete data as a baseline.

Phylogenetic analysis of the Neisseria data was performed using MEGA v.4.0.2 [27]. Neighbor-Joining (NJ) tree was constructed with the maximum composite likelihood model assuming rate uniformity and pattern homogeneity. eBURST analysis of the Neisseria data was performed using the default options in the online version 3 available at http://eburst.mlst.net [28].

Results

Simulated data

We reported the partition accuracy with respect to different choices of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e215.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e216.jpg under a constant population size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e217.jpg in one scenario and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e218.jpg in another. The partition accuracy measured by the Rand Index (RI) (see e.g. [29]) is summarized as a grey-scale map (Figure 3). In the presence of a small amount of admixture, i.e. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e219.jpg, the tentative population structure can be identified with high accuracy. As the recombination rate increases over a critical threshold, e.g. as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e220.jpg for the current setting, the partition accuracy drops quickly. Therefore, a higher recombination rate, indicated by a lower An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e221.jpg, would imply a lower partition stability. Such an observation matches our expectation that excessive amount of admixture tends to obscure the putative population structure.

Figure 3
Testing partition accuracy for different choices of gene flow weights for a small population size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e222.jpg (upper panel) and a large population size An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e223.jpg (lower panel).

We may look further into the gene flow graph prediction only if the genetic structure (i.e. the true partition) is correctly identified. We used Hamming distance as a measure of gene flow structure accuracy and the result is shown in Figure 4. The gene flow graph structure can be satisfactorily discovered when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e230.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e231.jpg. However, a negative correlation between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e232.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e233.jpg was also noticeable. This result suggests that if admixture affects a population through a small proportion of strains, then the chances of its correct estimation by BAPS 5.2 are high. In contrast, admixture that occurred at most of the strains is more likely to be ascribed to variation arising within the population by mutation. These observations are in harmony with the investigation of the effect recombination intensity on the emergence of distinct populations for a bacterial species in [3]. Extensive levels of recombination will act as a cohesive force keeping populations together as a large gene pool, which consequently prevents the statistical detection of the recombination in terms of such a population genetic model as investigated here. This is entirely reasonable, because any substantial genetic population boundaries will not exist under such circumstances, and consequently, recombinations over population boundaries are not meaningfully defined, let alone detectable by a statistical model. Moreover, if certain parts of the data are too weak for reliable admixture inferences due to very small population cardinalities in the genetic mixture estimate, it is possible to leave the admixture coefficients undetermined for them using the option available in BAPS, as discussed in [14]. The extensive simulation study performed by [30], showed that the BAPS inferences about the genetic structure were generally sensible from a phylogenetic perspective, even in the presence recombination events, provided that the data are at least reasonably informative. With very weakly informative molecular data, it cannot be expected that any detailed statistical population genetic model would provide highly accurate estimates of the population characteristics.

Figure 4
Testing gene flow structure accuracy for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e234.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e235.jpg.

We used a simulated data set for illustration of genetic shapes represented as the density estimation in (8). The data set was generated with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e236.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e237.jpg. Figure 5 shows the estimated genetic shapes using population 2 as the reference, as compared to the other five populations. It can be seen from Figure 5 that the influence of admixture between the populations is reflected also on the genetic shapes. For example, the density curves for population 1 (red) and for population 3 (blue) are more shifted towards zero than the other populations, and hence imply a closer relationship to population 2. This is not surprising since population 2 is a common donor of DNA to populations 1 and 3 (Figure 2). On the other hand, the density curve for population 3 appears to have two modes, which is a feature exhibited in neither population 1 nor any other populations. Note that population 3 is the only population which donates DNA to population 2. We might use the bi-modality of a density curve as a potential indicator of gene flow to the reference population.

Figure 5
Genetic shapes of five populations relative to population 2.

The Neisseria data

In total 32 BAPS populations are identified, where three populations (numbered as 8, 29 and 32) belong to the N. lactamica species and the remaining 29 populations are labeled as N. meningitidis species. For accessing the robustness of the identified population structure, the partition determined using the whole data set was compared with the partition using bootstrap data generated according to the simulation scenarios. Figure 6 shows the adjusted Rand Index as a result of the comparison. Our partition method is able to identify the population structure with good accuracy, even though the performance may decrease as the complexity level of the data increases and when the number of available strains per population is quite low. It should be noted that the number of strains in the bootstrap samples was typically much smaller than the number of strains assigned to a particular population in the analysis of the original data. This illustrates that the population identification becomes highly stable when the sample sizes are sufficiently large.

Figure 6
Bootstrap mixture analyses of the Neisseria data.

The results of admixture analysis for the Neisseria data set are summarized in Figure 7. The graph was obtained by fixing the admixture significance threshold An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e243.jpg at 0.05 and then pruning the arrows with gene flow strength below 0.03. It can be seen from the grey box highlighted in Figure 7 that two admixture arrows that imply inter-species gene flow remain significant, where two of the N. meningitidis populations (11 and 19) are constantly influencing the genetic makeup of one population of N. lactamica (population 29). The admixture arrows are uniformly directed from N. meningitidis towards N. lactamica, implying that N. meningitidis might donate genetic materials into N. lactamica, while the gene flow in a reversed direction is not supported by the analysis.

Figure 7
Gene flow network identified in the N. meningitidis and N. lactamica populations.

The admixture estimates for the 32 clusters obtained under the bootstrap analysis over the genes are summarized in Table 1. To see how much the exclusion of a particular gene changes the estimates, we may look at the overall consistency of the inferred average outgoing and incoming gene flow using the complete data case as reference. It is observed that the exclusion of gene gdh seems to affect the admixture consistency most, as in this case the changes of outgoing (incoming) gene flow reach the maximum value in 20 (13) of the 32 clusters. In contrast none of the clusters is experiencing the largest gene flow changes when either the gene adk or aroE is excluded, suggesting that recombination signals on these two genes are more marginal.

Table 1
Bootstrap admixture analyses of the Neisseria data.

We plotted in Figure 8 the estimated densities of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000455.e249.jpg for the three N. lactamica populations (8, 29, 32), relative to N. meningitidis populations 11 and 19 separately. The densities for population 29 have a tendency towards zero, suggesting a close genetic affinity with populations 11 and 19. In contrast, the densities of populations 8 and 32 are much further away from zero, implying a distinctive difference in their genetic makeup compared to N. meningitidis populations 11 and 19. This result is consistent with the admixture pattern presented in Figure 7.

Figure 8
Genetic shapes of N. lactamica populations 8, 29 and 32 as relative to N. meningitidis populations 11 and 19.

The eBURST analysis of the Neisseria database resulted in 253 groups and 1165 singleton strains. The biggest group consists of 795 strains and there are additionally four groups containing more than 200 strains. Table 2 shows the degree of concordance between the eBURST groups and BAPS populations. Due to the very large number of eBURST groups, only groups containing at least 20 strains were included in this comparison. Table 2 shows that the largest eBURST groups harbour many strains from multiple BAPS populations, whereas the vast majority of strains in smaller groups are typically found only in a single BAPS population (in some cases in two populations). As these cases represent single-locus variants of one another from eBURST analysis being clustered into different populations by BAPS, it means that there must exist a very large amount of anomalous variation at the nucleotide level within the other locus to allow the model to identify such subgroups. It should also be kept in mind that the BAPS model used for the identification of these populations is not a phylogenetic method in contrast to eBURST, which is an important distinction particularly in the presence of highly recombinogenic data. Out of the three BAPS populations of N. lactamica strains (populations 8, 29 and 32), only one forms a group in the eBURST analysis (group 14, Table 2). Strains in the other populations are primarily assigned to singleton groups. This difference is further explored below using a phylogenetic analysis at the nucleotide level.

Table 2
The number of strains that are jointly assigned into a BAPS population and a EBURST group.

To facilitate comparison of the phylogenetic analysis with the partition yielded by BAPS, we labelled strains with colors indicating population memberships. However, given the large number of strains included in the analysis and the large number of populations inferred by BAPS, it would be very challenging to visually extract information from a single NJ tree harbouring all the populations simultaneously. Therefore, four separate NJ trees are displayed in Figures 912,12, each of which shows a subset of the BAPS populations indicated with distinct colors. The strains remaining outside this particular subset are indicated by white circles. Since it is difficult to specify more than approximately 20 colors which remain clearly distinguishable from each other, independent coloring schemes were used for each tree to show the phylogenetic composition of the populations. Thus, it is not possible to compare the color codes directly with those in the gene flow network in Figure 7. The color coding scheme for the populations is shown in Figure 13 to enable comparison of the phylogenetic analysis and the gene flow network.

Figure 9
The first NJ tree.
Figure 10
The second NJ tree.
Figure 11
The third NJ tree.
Figure 12
The fourth NJ tree.
Figure 13
Color coding scheme used in the BAPS populations and the NJ trees.

The assignment of the populations to the NJ trees reveals that while a considerable number of them form relatively tight groups of lineages, there are also many populations in which the strains are spread over several separate lineages in the tree. This illustrates the dilution of phylogenetic signals in the presence of considerable levels of recombination between populations of strains. The population (population 14) which according to the inferred gene flow network is the most prominent recipient of genetic material from a multitude of sources, is seen (Figure 10) to include some dense and relatively large groups of strains that are found in separate parts of the tree. In addition, this population harbors a number of tiny groups of strains scattered over the three.

In the BAPS analysis, strains identified as N. lactamica fell into three populations: 8, 29 and 32. Figure 7 indicates that we found no evidence for significant admixture involving populations 8 and 32. Population 29 however was found to be associated with variation characteristic of populations 11 and 19, which were composed of meningococcal strains. The positions of the STs composing these five BAPS populations and one other (8, 29, 32, 19, 11 and 20) are shown in Figure 7. The isolated status of population 8 is apparent as a well resolved group, whereas the recombinant status of 29 is clear from the way these STs are scattered around the tree with long branch lengths originating apparently separately from the main N. lactamica population. The role of meningococcal strains in populations 11 and 19 in this is evident, in that the recombinant N. lactamica strains (population 29, shown in red) apparently originate close to these populations in the main meningococcal radiation.

Population 32 is intermediate on the tree between the majority of N. lactamica strains and the main meningococcal radiation. Hence these STs may be considered as examples of the so-called fuzzy fringes which have been proposed for recombinogenic species [25]. As noted however, they were not associated with significant admixture in the estimated gene flow network (Figure 7). Close examination of population 32 shows that 4 of the 22 STs in the population exhibited significant admixture with population 20 (shown in blue), receiving on average 12.3% from this population (which is composed of strains identified as meningococcus). It is interesting to note that populations 32 and 20 adjoin each other in the tree.

Discussion

In the present work we have introduced statistical tools implemented in the BAPS software that enable analyses of bacterial population structures on a previously unprecented scale, as the computational complexity of the earlier standard Bayesian methods prevents their application to large databases associated with complex patterns of admixture. This is particularly important when at least a moderate number of recombination events have plausibly taken place in a population, as a statistically valid characterization of the population structure then requires fairly extensive sampling of strains. It was also noted that a standard MCMC-based approach is not expected to yield a viable strategy for such analyses in practice, due to both the time constraints as well as the statistical accuracy of the resulting estimates. The BAPS analysis (inference about the populations and the levels of admixture) of the Neisseria database was completed within roughly 95 CPU hours on a standard PC with a 2.8 GHz Pentium 4 processor. As a comparison, our initial experiments with the STRUCTURE software [8] suggest conservatively that a comparable analysis had taken at least several thousands of CPU hours on the same machine. Moreover, the convergence problems associated with the Gibbs sampler algorithm, when applied to mixture models (e.g. [10]), suggest that statistically reliable estimates of the population structure are likely not accessible for data sets of such a high degree of complexity.

The presented methods can be effectively utilized in a variety of contexts, where the genetic population structure is relevant, e.g. for the investigation of epidemiological questions and experimentally derived features of bacterial strains. For instance, outlying groups with specific characteristics with respect to virulence or antibiotic resistance may be detected from large population samples.

The concept of a genetic shape, which was introduced here to represent the molecular variability of an identified population and its affinity towards other populations as a whole, is an intriguing characteristic associated with considerable potential for further theoretical research. Namely, the average change of log predictive likelihoods between populations can also be interpreted as the change in ‘free energy’ associated with a gene flow event. The larger this quantity, the more likely the ‘reaction’ of gene flow could occur spontaneously. However, it is not trivial to determine the minimal energy level that triggers such events. From the analysis of simulated data (results not shown) we are expecting that such an energy threshold depends on the identified population sizes. In particular, the analogy with physics-based characterizations of molecular interaction systems could yield mathematical ways to predict horizontal transfer events.

Although this integrated approach advocated here provides a feasible means of handling data from thousands of strains and a multitude of genes, several issues remain. Firstly, if a very large number of genes are considered, it is likely that not all of them will be present or functional in overall in a heterogeneous population at a genus level. Under such circumstances it would be necessary to develop further the Bayesian model for a population structure and admixture to take into account that not all molecular information is shared by the sampled strains. Secondly, the scalability of the stochastic learning algorithms should be improved to ensure that models could still be fitted to data without access to supercomputing facilities. Given the present rate of improvement in sequencing facilities, it is likely that the need for such large-scale analyses will be a reality within a relatively short time-span. In order to meet these needs in the future, we are currently investigating several theoretical approaches to develop further the statistical population genetic tools available in BAPS.

The findings from the combined phylogenetic and population genetic analyses suggest possible events of convergence between N. lactamica and N. meningitidis that have arisen on multiple occasions and have occurred for clearly separate lineages of the two species. As the former is a non-pathogen and N. meningitidis represents a pathogen of considerable importance in human health, exchanges of genetic material between them might have consequences for our understanding of their evolution. Moreover, the diversity and the extent of recombination indicated among the N. meningitidis populations highlight that it is necessary to consider these pathogens as a heterogeneous population, and that multiple pathways of evolution may arise among them as a response to treatment strategies, including antibiotics and vaccines, as also recently discussed in [31]. For details concerning the currently available Meningococcal vaccines, see, e.g. [32] and [33]. Contrary to some of the previous studies of recombination and population structure in Meningococci, e.g. [34][36]), where only very limited sample sizes were considered, we have here focused on the detailed exploration of a more extensive database using multiple model-based statistical tools. In summary, our combined results illustrate crisply the possibility of using large-scale MLST sequence data to draw attention to currents in the gene pool, i.e. specific populations that seem more likely to undergo recombination, including recombination with different species. More detailed exploration of such groups of strains could then shed new light on the mechanisms that shape the joint evolution of pathogens and non-pathogens sharing ecological niches.

Footnotes

The authors have declared that no competing interests exist.

This work was financially supported by the ComBi graduate school and grant no. 121301 from Academy of Finland. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Feil EJ, Spratt BG. Recombination and the population structures of bacterial pathogens. Annu Rev Microbiol. 2001;55:561–590. [PubMed]
2. Beiko RG, Harlow TJ, Ragan MA. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A. 2005;102:14332–14337. [PMC free article] [PubMed]
3. Fraser C, Hanage WP, Spratt BG. Recombination and the nature of bacterial speciation. Science. 2007;315:476–480. [PMC free article] [PubMed]
4. Hartl D, Clark AG. Principles of Population Genetics, Fourth edition. Sunderland, MA: Sinauer Associates; 2007.
5. Whitaker RJ, Grogan DW, Taylor JW. Geographic barriers isolate endemic populations of hyperthermophilic archaea. Science. 2003;301:976–978. [PubMed]
6. Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. 2004;21:255–65. [PubMed]
7. Didelot X, Falush D. Inference of bacterial microevolution using multilocus sequence data. Genetics. 2007;175:1251–1266. [PMC free article] [PubMed]
8. Falush D, Stephens M, Pritchard J. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. [PMC free article] [PubMed]
9. Falush D, Wirth T, Linz B, Achtman M. Traces of human migrations in Helicobacter pylori populations. Science. 2003;299:1582–1585. [PubMed]
10. Robert C, Casella G. Monte Carlo statistical methods, Second edition. New York: Springer; 2005.
11. Sheppard SK, McCarthy ND, Falush D, Maiden MCJ. Convergence of campylobacter species: Implications for bacterial evolution. Science. 2008;320:237–239. [PubMed]
12. Corander J, Gyllenberg M, Koski T. Bayesian model learning based on a parallel MCMC strategy. Stat Comput. 2006;16:355–362.
13. Corander J, Waldmann P, Marttinen P, Sillanpaa M. BAPS 2: enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004;20:2363–2369. [PubMed]
14. Corander J, Marttinen P. Bayesian identification of admixture events using multi-locus molecular markers. Mol Ecol. 2006;15:2833–2843. [PubMed]
15. Corander J, Tang J. Bayesian analysis of population structure based on linked molecular information. Math Biosci. 2007;205:19–31. [PubMed]
16. Corander J, Marttinen P, Sirén J, Tang J. Enhanced bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics. 2008;9:539. [PMC free article] [PubMed]
17. Suchard M, Weiss R, Dorman K, Sinsheimer J. Inferring spatial phylogenetic variation along nucleotide sequence: A multiple changepoint model. J Am Stat Assoc. 2003;98:427–437.
18. Minin VN, Dorman KS, Fang F, Suchard MA. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics. 2005;21:3034–3042. [PubMed]
19. Chan C, Beiko R, Ragan M. Detecting recombination in evolving nucleotide sequences. BMC Bioinformatics. 2006;7:412. [PMC free article] [PubMed]
20. Corander J, Gyllenberg M, Koski T. Random partition models and exchangeability for Bayesian identification of population structure. Bull Math Biol. 2007;69:797–815. [PubMed]
21. Silverman B. Density Estimation for Statistics and Data Analysis. London: Chapman and Hall; 1986.
22. Hudson RR. Gene genealogies and the coalescent process, volume 7. New York: Oxford University Press; 1990. pp. 1–44.
23. Holmes EC, Urwin R, Maiden MCJ. The influence of recombination on the population structure and evolution of the human pathogen Neisseria meningtidis. Mol Biol Evol. 1999;16:741–749. [PubMed]
24. Alber D, Oberkotter M, Suerbaum S, Claus H. Genetic diversity of Neisseria lactamica strains from epidemiologically defined carriers. J Clin Microbiol. 2001;39:1710–1715. [PMC free article] [PubMed]
25. Hanage WP, Fraser C, Spratt BG. Fuzzy species among recombinogenic bacteria. BMC Biol. 2005;3:6. [PMC free article] [PubMed]
26. Jolley KA, Chan MS, Maiden MCJ. mlstdbnet - distributed multi-locus sequence typing (MLST) databases. BMC Bioinformatics. 2004;5:86. [PMC free article] [PubMed]
27. Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular e volutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol. 2007;24:1596–1599. [PubMed]
28. Feil EJ, Li B, Aanensen DM, Hanage WP, Spratt BG. eBURST: Inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J Bacteriol. 2004;186:1518–1530. [PMC free article] [PubMed]
29. Yeung KY, Ruzzo WL. Principal component analysis for clustering gene expression data. Bioinformatics. 2001;17:763–774. [PubMed]
30. Marttinen P, Baldwin A, Hanage WP, Dowson C, Mahenthiralingam E, et al. Bayesian modeling of recombination events in bacterial populations. BMC Bioinformatics. 2008;9:421. [PMC free article] [PubMed]
31. Maiden MCJ. Population genomics: diversity and virulence in the neisseria. Curr Opin Microbiol. 2008;11:467–471. [PMC free article] [PubMed]
32. Vu DM, Welsch JA, Zuno-Mitchell P, Cruz JVD, Granoff DM. Antibody persistence 3 years after immunization of adolescents with quadrivalent meningococcal conjugate vaccine. J Infect Dis. 2006;193:821–828. [PubMed]
33. Mascioni A, Bentley BE, Camarda R. Structural basis for the immunogenic properties of the meningococcal vaccine candidate LP2086. J Biol Chem. 2009;284:8738–8746. [PMC free article] [PubMed]
34. Jolley KA, Kalmusova J, Feil EJ, Gupta S, Musilek M, et al. Carried meningococci in the czech republic: a diverse recombining population. J Clin Microbiol. 2000;38:4492–4498. [PMC free article] [PubMed]
35. Jolley KA, Kalmusova J, Feil EJ, Gupta S, Musilek M, et al. Carried meningococci in the czech republic: a diverse recombining population. J Clin Microbiol. 2002;40:3549–3550. [PMC free article] [PubMed]
36. Jolley KA, Wilson DJ, Kriz P, Mcvean G, Maiden MCJ. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in neisseria meningitides. Mol Biol Evol. 2005;22:562–569. [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...