# Detection of recombination events in bacterial genomes from large population samples

^{1,}

^{*}William P. Hanage,

^{2}Nicholas J. Croucher,

^{3}Thomas R. Connor,

^{3}Simon R. Harris,

^{3}Stephen D. Bentley,

^{3}and Jukka Corander

^{4,}

^{5}

^{1}Department of Biomedical Engineering and Computational Science (BECS), Aalto University, P.O. Box 12200, FI-00076 AALTO, Finland,

^{2}Center for Communicable Disease Dynamics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA,

^{3}The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK,

^{4}Department of Mathematics, Abo Akademi University, Piispankatu 8, FI-20500 Turku, Finland and

^{5}Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68, FI-00014 University of Helsinki, Finland

## Abstract

Analysis of important human pathogen populations is currently under transition toward whole-genome sequencing of growing numbers of samples collected on a global scale. Since recombination in bacteria is often an important factor shaping their evolution by enabling resistance elements and virulence traits to rapidly transfer from one evolutionary lineage to another, it is highly beneficial to have access to tools that can detect recombination events. Multiple advanced statistical methods exist for such purposes; however, they are typically limited either to only a few samples or to data from relatively short regions of a total genome. By harnessing the power of recent advances in Bayesian modeling techniques, we introduce here a method for detecting homologous recombination events from whole-genome sequence data for bacterial population samples on a large scale. Our statistical approach can efficiently handle hundreds of whole genome sequenced population samples and identify separate origins of the recombinant sequence, offering an enhanced insight into the diversification of bacterial clones at the level of the whole genome. A data set of 241 whole genome sequences from an important pandemic lineage of *Streptococcus pneumoniae* is used together with multiple simulated data sets to demonstrate the potential of our approach.

## INTRODUCTION

Recombination is recognized in bacteria as an important evolutionary force by which segments of DNA in the recipient organism are replaced by foreign segments from a donor cell. Recombination most commonly takes place between closely related bacterial strains and its frequency decreases with decreasing sequence similarity between the donor and recipient (1,2). Since the ability to recombine does not entirely vanish even between highly distinct bacteria, horizontal gene transfer is occasionally observed between distinct species (3). Consequently, the DNA sequence of a bacterium may be a mosaic of DNA from different ancestors, complicating the definition of the species concept in bacteria (4–6). Our focus here is on detecting imported homologous recombinant segments among a group of relatively closely related bacteria. Recombination has the potential to rapidly alter phenotypes of bacteria, for instance as a response to selective pressure from antibiotics. Hence, there is interest in the potential of recombination to disseminate antibiotic resistance and virulence determinants (7,8), as a result of the consequences for public health. Tools to detect and analyze recombination in whole genomes are therefore urgently required.

Statistical tests for detecting recombination in DNA started to appear more than two decades ago e.g. (9–11). Combination of recombination modeling with phylogenetic methods appeared in (12), who described a set of DNA sequences of length *l* in terms of a state sequence (*S*_{1},…, *S*_{l}), where each state *S*_{j} represents a particular tree topology. In this approach the optimal sequence of states is estimated using maximum parsimony. Early likelihood-based phylogenetic approaches (13,14) were based on estimating a phylogeny within a sliding window, which was moved across the genome. The purpose of this was to identify locations where the underlying phylogeny changes abruptly.

Within the last decade, Bayesian approaches have become popular within the phylogenetic framework for modeling recombination. In these approaches, the genome is divided into non-overlapping segments, each with a specific phylogeny describing the ancestral structure of that part of the genome (15,16), possibly coupled with another change-point process to account for heterogeneity in evolutionary parameters such as variations in the substitution rate (17–19). The computational burden caused by the super-exponential growth in the number of possible phylogenies as a function of the number of strains limits the use of the phylogenetic approaches to relatively small data sets. Even with the algorithmic refinements, the largest data sets analyzed in (19) consist of 15 strains only, which is an extremely limited number compared to data sets generally available today, see e.g. (7). Imposing further restrictions on the space of possible tree topologies facilitates the use of phylogenetic approaches to larger data sets, as discussed e.g. in (20). For example, one may use a fixed topology for all but a single strain, which may be added as a leaf node in different branches.

Besides the mentioned phylogenetic approaches, a few alternatives for modeling recombination have appeared in the recent years. A Bayesian approach in which a single phylogeny is estimated for the whole genome sequence as opposed to estimating several separate phylogenies for different genomic areas is presented in (21). The recombination as well as mutation events are then assumed to take place along the branches of the inferred phylogeny. The origins of the recombination events are not modeled, but it is assumed that recombination introduces novel substitutions at an unknown uniform rate. This approach was extended in (22) by considering locations in the estimated phylogeny as potential origins for the recombination events. However, these investigations were still limited to very few sequenced genomes due to the computational complexity of the adopted Monte Carlo approach.

Another line of Bayesian modeling for recombination uses a non-parametric clustering model (23). In this approach, the genome is divided into segments and clusterings of the strains within the segments are inferred, where a clustering corresponds to an assignment of strains into different ancestral chromosomes. Recently we introduced a method (24) (BRAT) which can efficiently handle several thousand samples containing multiple gene sequences. The core idea in BRAT is to use Bayesian clustering to detect genetically distinct clusters representing separate evolutionary lineages. These are then used as putative ancestral origins when probabilities of recombination events are estimated for each sample in the data set.

Some non-Bayesian methods for detecting recombination have been introduced in the recent years as well, and (25) carries out a simulation study for comparing the Bayesian phylogenetic-based approaches with some alternatives (see the article for a complete list of references). They conclude that the Bayesian approaches yield highly accurate results, however, at the expense of relatively heavy computational burden. Due to this, their simulation study is restricted to data sets consisting of four sequences only.

Given the large sizes of many bacterial data sets currently available, some applications have inferred recombination indirectly. For example, an association between recombination tendency and antibiotic resistance was recently demonstrated in (7), where the amount of recombination within a certain group of bacteria was estimated using population genetic, so-called admixture models (26,27). A typical outcome from such an analysis consists of the relative contributions of different inferred ancestral groups to the genetic composition of every sample, but without pinpointing the exact genomic locations of the segments assigned to the different origins.

To handle computational issues arising from the size and complexity of emerging population genomic data sets, we introduce here a next generation version of BRAT (BratNextGen). Our method is based on a Bayesian change-point clustering model such that the clusters can change over the genome. One of the clusters is given a special interpretation as comprising the non-recombining sequences, whereas sequences assigned to some other cluster within some genomic segment are considered as recombinant. This model formulation allows us to identify separate recombination origins at any given genome region facilitating the detection of evolutionarily related segments acquired independently in different lineages. A detailed analysis of the *cps* locus of *Streptococcus pneumoniae* illustrates how our method can further the understanding of the evolution of highly recombinogenic bacteria (28).

We initialize our algorithm by dividing the total genome into 5-kb regions and performing a clustering analysis separately in each of these segments. For summarizing the results from the clustering analyses we propose using a novel tree representation. This tree, termed a ‘proportion of shared ancestry’ (PSA) tree, is based on estimated values of the proportion of the genome where two sequences are clustered together by the clustering model, thus highligting groups of sequences having many recombinations in common.

Some comparisons between our method and the mentioned alternative approaches can be made. Our model is formulated for detecting imports from origins that are external to the current population of samples, being similar in this respect to the method presented in (21). However, unlike in (21), our method does not assume a single recombination origin differing from the observed samples at constant rate over the genome. On the other hand, our approach is different from methods presented in (22) and (24) in that it does not model recombination events between the observed samples.

Our approach may readily be applied to the challenge of the data sets produced by next-generation sequencing technologies, which are already being utilized to produce hundreds, and even thousands, of whole genomes from samples of bacterial communities. We provide the methods embedded in a user-friendly software package BratNextGen, which can be downloaded from http://www.helsinki.fi/bsg/software/BRAT-NextGen/ and used freely for academic purposes. The methods are implemented to facilitate straightforward parallel computation enabling a user to carry out recombination analyses of large whole genome bacterial data sets within a few hours.

## MATERIALS AND METHODS

In this section, we formulate the Bayesian model for detecting recombinations and outline the algorithm that is utilized for optimizing the model. Supplementary Text contains the following additional three sections: ‘Details of model learning’ gives further details of our algorithm, ‘Blockwise clustering for initialization’ describes how the the search algorithm is initialized and the PSA tree estimated, and illustrates the PSA tree with simulated data (including illustrative Supplementary Figures S1–S4), and ‘Simulated test data sets’ provides details about how the simulated data sets were generated.

### Bayesian model for locating recombination events

Let *Y* = {*Y*_{ij}}_{i=1, … , N, j=1, … , L} denote the matrix of observed values at polymorphic sites in the aligned genome sequences available for a set of *N* sampled taxa, and the distances between the polymorphic sites. *D* will be treated as fixed, and all distributions are interpreted to be conditional on *D*. For clarity, *D* is suppressed in the subsequent notation. In order to specify recombinations within sequences, we formulate our model in terms of the structural configuration *M* = {*M*_{ij}}_{i=1, … , N, j=1, … , L}, where the elements *M*_{ij} specify the origins of the observations in *Y*. For example, if *M*_{15,3} = 2, the observed nucleotide at the third SNP in the sequence of the 15th taxon is assigned to an origin (or cluster) labeled by the integer 2. At each SNP *j*, *M* determines a clustering *S*_{j} = {*s*_{jh}}_{h=1, … , K}, where clusters are representing taxa having the same origin at SNP *j*. Here, *s*_{jh} are disjoint sets whose union equals the set of all considered taxa. The integer-valued parameter *K* defines an upper bound for the number of distinct origins. In particular, we do not require all *K* clusters to be non-empty at an arbitrary site. Furthermore, the first cluster is always assigned a specific interpretation, representing taxa that are non-recombining at site *j*. Conversely, the taxa assigned to any other cluster are considered to be recombined at site *j*. The probability of the data *Y* given *M* and model parameters Θ is specified as:

where *p*_{hjk} is the relative frequency parameter for nucleotide *k* at the *j*-th SNP in cluster *s*_{jh}, and *n*_{hjk} is the corresponding observed count of the nucleotide in this cluster. The likelihood is derived by assuming that each observation *Y*_{ij} at SNP *j* assigned to cluster *s*_{jh} follows a Multinomial(*p*_{hj1},…, *p*_{hj4})-distribution. In the likelihood (1) Θ represents jointly all the probability parameters *p*_{hjk}, which are assigned the prior

where α is a hyperparameter. Specification of this hyperparameter is discussed in detail below. To prune unnecessary parameters from the model, we use an empirical approach where the number of distinct alleles present in the whole set of taxa for any particular SNP is first determined, such that when less than four alleles are observed, cardinality of the parameter vector in the likelihood and prior expressions is reduced accordingly.

Let *M*_{i·} = (*M*_{i1}, *M*_{i2},…, *M*_{iL}) denote the sequence of origins of nucleotides observed at the *L* SNPs for the *i*-th taxon. We specify the prior distribution for *M* by modeling each *M*_{i·} independently as a Markov chain, with the following three parameters defining its transition matrix: ρ_{0}, ρ, *a*. In detail, we set the joint probability of *M*_{i·} equal to

where for simplicity we use *p*(*M*_{i1}) = 1/*K*. Assuming that the physical distance between the two SNPs indexed by *j* − 1 and *j* is given by *d*_{j−1,j}, the transition matrix that determines the conditional distribution *p*(*M*_{ij}|*M*_{i, j−1}, ρ_{0}, ρ, *a*) is defined as

which is obtained by raising the transition matrix between two consecutive sites to the power *d*_{j−1,j}. Figure 1 presents a graphical illustration of the transition probabilities. The parameters of this transition matrix can be interpreted as follows: ρ_{0} denotes the conditional probability that the allele observed at a particular site corresponds to a non-recombining state, i.e. it is assigned to the non-recombining group (indexed by 1), if also the allele observed at the previous site had this state. The remaining probability mass (1 − ρ_{0}) is evenly divided between all the other putative origins (clusters), which correspond to recombinogenic states. Parameter ρ denotes the probability analogous to ρ_{0} for the other putative origins of an allele transferred horizontally through recombination. In the chain of states determining the origins of alleles within a genome, the probability of transition from a recombinant state back to the non-recombining state equals (1 − ρ)*a*, where *a* is a probability that should be close to unity. The remaining probability mass (1 − ρ)(1 − *a*) is then evenly divided among the remaining *K* − 2 origins.

*K*, is equal to 4, and

**...**

We assign prior distributions for the parameters ρ_{0}, ρ and *a* governing the transition matrix as follows. First, parameter *a* is assigned a distribution favoring values close to unity:

where α_{a} = 10 and β_{a} = 1. The prior for ρ is chosen by considering the expected lengths of recombinant segments. Conditionally on ρ, the length *l* of a recombinant segment has a geometric distribution with mean μ_{l} = 1/(1 − ρ). A prior is implied for ρ by defining a uniform distribution for the mean over a sensible interval

In practice, we have used the boundaries *a*_{μ} = 10 and *b*_{μ} = 20 000 in all our analyses. The prior for ρ_{0} is selected by considering the probability that not a single recombination event exists in the whole sequence observed for any particular taxon. Let ϕ denote the probability that after starting the Markov chain from state labeled 1, the state does not change at all during all *L** bases in the complete genomes. We then get

We now impose a prior on ρ_{0} by setting the distribution of ϕ according to

As opposed to (19) and other earlier approaches utilizing Markov chains for modeling recombination events, we prefer to parameterize the transition distribution in terms of ϕ as opposed to setting a distribution directly on ρ_{0}, since a more intuitive interpretation can be assigned to ϕ and the resulting prior is more conservative in terms of false positive rate.

Finally, we describe the strategy for specifying the hyperparameter α, which is used in the prior (2) for the nucleotide frequency parameters. The information represented by this prior can be interpreted to correspond to α *a priori* observations that are evenly distributed among alternative nucleotides. Typical non-informative choices used in clustering include e.g. α = 1 (29) or α = 4 (27). Parameter α represents the amount of variation that we expect to observe within a particular cluster. For example, α = 4 leads to a uniform distribution for the nucleotide frequencies, whereas the choice α = 1 favors clusters in which the frequencies are concentrated closer to the extremes (0 or 1). Consequently, the parameter α influences the number of clusters that will be learned by a clustering procedure. By setting α very close to zero, all taxa will be separated by the clustering unless they have identical data. Conversely, by setting α to a very large value, all taxa will be clustered to a single cluster. In the current method, we consider the sampled taxa in general to be representable by a single cluster corresponding to a set of closely related bacteria, such that some segments in the observed sequences are imported from a source outside the clonally related group. For such data, it is appropriate to use a relatively large value of α favoring *a priori* a single cluster at an arbitrary position in the genome. To derive a suitable value of α for any given data set, we developed the following empirical strategy which was found to produce reasonable results for a wide range of different data sets: (1) under the condition that all taxa are assigned to a single cluster, calculate the maximum *a posteriori* (MAP) estimate of the parameter α_{j} separately at each variable site *j* (MAP values can be obtained in a straightforward manner using routine numerical optimization under a non-informative uniform prior for α_{j}), (2) take the 90th percentile of the MAP estimates and use this value as the hyperparameter α.

In the current context of detecting recombination events, the parameter α can be interpreted from two different perspectives. First, it can be interpreted from an operational point of view merely as a parameter controlling the trade off between sensitivity and false positive rate. A lower value means that some genomic segment is more often considered as not belonging to the non-recombining origin, leading to higher sensitivity, but also to an increased false positive rate. The alternative interpretation of α is to consider its relation to the mutation rate. A higher value of α favors fewer clusters at any particular site, and consequently, because in our method the non-recombining cluster corresponds to a set of clonally related sequences, a larger fraction of the molecular variation among the sequences is to be explained by mutation. Despite this apparent relation between α and a mutation rate parameter, an exact mapping between these two parameters does not seem to be accessible within our method.

### Model learning

The major prerequisites we set up when developing our learning algorithm are scalability to large data sets and low false positive rate. These goals are achieved by using an approach outlined as follows (the exact details of these steps are explained in section ‘Details of model learning’ in the Supplementary Data):

- Use fast heuristics for learning an initial model
*M*. - Use MCMC-like iterative approach for refining the initial model
*M*, where the learning procedure alternates between the two operators:- Update each
*M*_{i·},*i*= 1, … ,*N*, using*p*(*M*_{i·}|*M*_{−i}, ρ_{0}, ρ,*a*,*Y*), where*M*_{−i}denotes the model with*M*_{i·}excluded. The conditional distribution is obtained by using the standard forward–backward algorithm for hidden Markov models. - Update ρ
_{0}, ρ,*a*by analytically maximizing*p*(ρ_{0}, ρ,*a*|*M*,*Y*).

- When convergence is reached in Step 2, inferences are summarized by reporting all genome segments for which
*p*(*M*_{ij}= 1|*M*_{−i}, ρ_{0}, ρ,*a*,*Y*) < 0.01 as recombinant. This condition implies that the conditional probability of the non-recombining state is <0.01 over all such intervals. - (Optional) For a given number of permutation cycles, permute the columns of
*Y*and repeat the analysis (Steps 1–3) for the permuted data. Compare the results from these permutation-based analyses with the results for the original data, which enables empirical calculation of the statistical significance of detected recombinant segments.

The Step 4 of our algorithm can be considered as optional, however, it is highly recommended. The rationale behind permutation of the columns in the data matrix is that if the observed difference between a pair of taxa is caused by recombination, then the divergent SNPs are expected to be densely clustered within the genomic area affected by a recombination event. In contrast, if the difference is due to variation caused by mutations, the diverging SNPs are expected to be relatively randomly distributed across the genome. Thus, by permuting the columns of the data matrix, one obtains the null distribution of the differences under a mutation-driven process of taxon evolution. The additional step is particularly important when only very limited or no recombination has taken place among the taxa. Under such circumstances differences caused by population structure may otherwise be falsely identified as recombinations, as pointed out in the ‘Discussion’ section.

## RESULTS

### Analysis of simulated data sets

Given the wide spectrum of recombination rates occurring in natural bacterial populations, we confirmed the accuracy of BratNextGen inferences using a range of different recombination scenarios. To investigate the statistical power and false positive rate for recombination detection, we analyzed simulated data sets generated by varying the following factors: the lengths of the recombinant segments, the level of phylogenetic variation in data (i.e. the distance from the root of the tree to any one of the tips), the number of recombination events and their origins. For exact details of the synthetic data generation, see the ‘Simulated test data sets’ section in the Supplementary Data.

The tree shown in Figure 2 was used to generate a population of sequences of length 1 Mb. Distances between sequences belonging to different colored clusters in Figure 2 are shown in Table 1. Only the cyan-colored taxa were included in the analysis as samples from which sequences were observed, such that recombination events from other unobserved branches in the tree had been inserted to the observed sequences. Depending on the distance of the donor branch from the branch of cyan-colored taxa in the tree, the origin of the recombination was considered either as ‘close’ (red), ‘distant’ (green) or ‘medium distant’ (blue) from the sampled taxa. A recombination event was flagged as detected if the true segment overlapped some segment deemed significantly recombinant by our method. Correspondingly, all segments claimed significantly recombinant were flagged as false positives when they did not overlap any true recombinant segments.

*d*, was selected to be either 0.01 or 0.03 in units of substitutions per site (see

**...**

The results of our simulation study are summarized in Table 2. In addition, Figure 3 provides graphical illustration of the true and detected significant recombinations for one of the data sets corresponding to row 3 in Table 2. The following general conclusions can be drawn from the simulation results: (1) the false positive rate remains low under all simulation settings. In Figure 3 the data set with the highest rate of false positives is shown. Even here the rate is low (<10%) relative to the true positives; (2) sensitivity to detect recombination events from a distant origin was high in all data sets (70-100%). Conversely, sensitivity to detect recombination events from a close origin was near zero in all data sets. For example, in Figure 3(a) recombination events from the red (closely related) origin are evident while they are normally not seen in Figure 3(b). Sensitivity to detect recombinations from a medium distant branch was between these two extremes. This is entirely reasonable, as there is no statistical basis to identify a recombination event when the (unobserved) donor sequence is highly similar to the recipient sequence, whereas the power increases with increasing evolutionary distance; (3) sensitivity appears highest when only a limited number of recombination events is present in data, which can be explained by differences in the level of variation over genomes. In a nutshell, a particular genome segment can be detected as significantly recombinant if it is more distinct from the comparable non-recombining sequences than on average over the genome. If a genome sequence contains traces of a very large number of recombination events, then the average over-genome distance to non-recombining sequences is elevated, leading to a decreased sensitivity; (4) increasing height of the generating tree seems to lead to a minor decrease both in the sensitivity and the false positive rate, which may be due to the fact that for tall trees the level of variation within the sampled taxa will be higher, thus making it generally harder to detect significantly recombinant segments. The effect of this does not seem to be fully compensated by the increasing molecular distance between the putative origins of a recombinant genome segment.

### Streptococcus pneumoniae

To test the potential of BratNextGen with a large set of taxa for which whole genome data are available, we analyzed a recently published set of 241 complete genomes of *S. pneumoniae* isolates from the PMEN1 lineage (28). For these data a total of 39 107 polymorphic sites were identified in the 2 221 315 bp alignment. Due to a highly non-uniform distribution of the polymorphic sites over the genome (28), a standard phylogenetic tree is expected to be mostly determined by the relatively short genomic segments harboring the majority of the variable sites. The pneumococcus is known to be relatively recombinogenic, and this data set has already been analyzed in an attempt to detect genetic material imported from other lineages (28). It is hence particularly useful both for testing our new method and for a comparison with the results from earlier analyses. A single run of BratNextGen with the *S. pneumoniae* data took less than 2 h on a single processor of a standard desktop PC. An additional assessment of statistical significance based on permutation re-sampling of the polymorphic sites (see ‘Materials and Methods’ section) was performed by executing our algorithm 100 times on a cluster computer. Since all these runs were done in parallel, the total running time of the analysis was approximately the same as that of a single run. This can be compared with the analysis of 4 whole genomes of *Salmonella enterica* using the method presented in (21) taking 72 h.

The detected recombination events are displayed graphically in Figure 4 and the exact positions of these segments are written in Supplementary Tables S1 and S2. For comparison, Supplementary Figure S5 shows the recombination events detected in (28) for these same data using a different method. It is immediately evident that the pattern of recombinations detected in the two separate analyses is highly similar. The tree representing the proportion of shared ancestry and the clusters inferred from it are also shown in Figure 4. Table 3 breaks down these clusters with respect to countries of origin for the isolates, and Supplementary Table S3 contains more detailed information about the taxa cluster-wise.

*S. pneumoniae*data. The figure shows the results of our analysis of 241

*S. pneumoniae*isolates. On the left, the PSA tree is shown. The tree is cut at threshold level 0.25 to produce 6 clusters, colored as blue, green,

**...**

Maximum likelihood trees for the *S. pneumoniae* data set computed before and after recombination was removed from the data set, are presented in (28). These trees are shown in Figure 5(a) and (b), respectively. Comparison of clusters obtained from BratNextGen output with these trees shows the results are entirely congruent with those outlined in (28) also in this respect. Clusters evident from both phylogenies, such as a clade associated with transmission to Vietnam (magenta), a group of serotype 6A strains (cyan) and serotype 19A vaccine-escape isolates from the USA (yellow), were identified by BratNextGen. The algorithm was also able to distinguish the internationally distributed subpopulation (dark blue) from the ancestral European population (green), despite little distinction being found between these groups in the tree shown in in Figure 5(a). Furthermore, the cluster into which most of the South African isolates fall (red) corresponds exactly with that found in the tree in Figure 5(b), despite one of these strains being distant from the other taxa in the tree in Figure 5(a). Hence BratNextGen is capable of accurately detecting relationships among even highly recombinogenic bacteria.

*S. pneumoniae*data. (

**a**) and (

**b**) show the maximum likelihood phylogenetic trees for

*S. pneumoniae*data, constructed using either all data or polymorphisms outside any inferred recombination events, respectively

**...**

The pneumococcal capsule, crucial for the bacteriums virulence, has over 90 different forms (serotypes). Almost all of these are encoded by alternative gene cassettes at the *cps* locus, which vary between ~10 kb and ~30 kb in length (30). *Streptococcus pneumoniae* is able to switch serotype by replacing this region of the chromosome via horizontal acquisition of a different gene cassette. Unlike the previous analysis (28), BratNextGen allows the identification of separate imports from the same ancestral origin. A detailed view of the BratNextGen analysis at the *cps* locus presented in Figure 6 indicates the isolates that have undergone serotype switching as having high levels of admixture in this region; furthermore, isolates that have independently switched to the same serotype show similar patterns of admixture, allowing convergent evolution to be easily identified. This indicates that the separately acquired *cps* loci of the same serotype are very similar.

*cps*locus. The panel on top shows results from our analysis of the

*S. pneumoniae*data set zoomed in to sequence positions 0.29–0.33 Mb comprising the

*cps*locus. The isolates which have undergone serotype switching have

**...**

However, there is also evidence for *cps* locus diversity between isolates of the same serotype. Strain UR428, while remaining serotype 23F, appears to have horizontally acquired sequence within the *wchA* gene that, according to BratNextGen, shows similarity with *wchA* from PMEN1 strains that have acquired a 6A type capsule. To study the provenance of this sequence in more detail, all the PMEN1 sequences with >50% coverage of this region were aligned to *wchA* reference sequences for all available serotypes, plus *wchA* genes from a collection of serogroup 6 sequences (30,31). A maximum likelihood phylogeny in Figure 6 shows that the serogroup 6 sequences (marked in blue) lie in a handful of clusters, interspersed among those of the other serotypes. In agreement with the BratNextGen analysis, *wchA* from UR428 is more closely related to the alleles in the PMEN1 serotype 6A switched isolates, themselves identical to a version of the gene in an unrelated Korean 6A ST2756 strain (31), than the alleles in the ancestral serotype 23F *cps* locus.

Despite this instance of mosaicism within the locus, there is evidence for a surprising level of homogeneity within serotypes, given the length of time these *cps* loci have spent in the highly recombinogenic pneumococcal chromosome. The *wchA* sequences from both the PMEN1 lineage and the unrelated strain BM4200 are very similar to the reference version, and the three independent acquisitions of the 19A locus within the PMEN1 lineage all cluster with the 19A reference *wchA*. This suggests either the high levels of divergence distinguishing the different gene clusters is enough to prevent mosaicism arising, hence the observed recombination arising in this data set is rare, or that selection maintains a conserved set of efficiently co-operating genes. Only further sequencing will distinguish these two possibilities.

## DISCUSSION

We have introduced a novel approach that enables rapid exploratory analysis of recombination events in large-scale whole genome bacterial data sets. Two particularly important goals from the application perspective are scalability to large collections of taxa and low false positive rate, which are both met by our method. Furthermore, our method is able to identify overlapping independent imports from the same ancestral origin, facilitating a more detailed understanding of bacterial evolution as illustrated by our analysis of the *S. pneumoniae* data set.

Our approach can be summarized as follows. First, the core of our method is a Bayesian change-point clustering model, which is used to detect taxa that are, within some genomic areas, more distinct from other taxa than would be expected by mutation-driven evolutionary processes alone. Second, initialization of our recombination model configuration is done by using separate clusterings of taxa over fixed-width genomic intervals and then combining these results in a parsimonious manner such that the cluster label of a taxon changes as seldom as possible over the whole genome. The clustering analyses are summarized by the proportion of shared ancestry tree, which reflects the proportion of genome where any two particular taxa are clustered together. Using this tree a set of clusters can be specified simultaneously for all taxa. Analyses with simulated data (see Supplementary Text) and *S. pneumoniae* data set demonstrate that such a tree is able to highlight sequences which share many recombinations and, consequently, are likely to belong to the same branch in the underlying phylogenetic tree. Finally, an evaluation of the statistical significance of the detected recombinant segments in frequentist terms can be obtained by replicating the analysis with data obtained through random permutation of the columns of the original data matrix. This permutation re-sampling procedure yields empirical *P*-values for the identified recombinant segments, see e.g. (32).

A distinctive feature of our approach is the permutation re-sampling procedure that is utilized on top of the Bayesian model learning machinery. While such strategies are commonly used to obtain reasonable multiple-testing corrections in association studies (33), one might still argue that if all information about the problem is included in a Bayesian model, the permutation procedure would be inconsequential. However, to see beyond this simple theoretical argument, one needs to consider the general limitations related to statistical modeling and detection of recombinations in large-scale data. It is known from both experimental observations (1) and more recent whole genome analyses (22) that homologous recombination is more frequent between more closely related strains, and specifically sequences. This gives rise however, to concerns over the identifiability of recombination events between very close relatives. In the most extreme case the sequences are identical, and, consequently, within the Bayesian inferential paradigm detection of recombination events is then mainly determined by the prior distributions, which results in considerable statistical uncertainty as mentioned also in (22). In our approach this problem becomes relevant when analysing data sets harboring only very limited traces of recombination events. Under such circumstances, depending on the choice of model hyperparameters, the model may falsely consider mutation-based differences caused by population structure as evidence for recombination. A particularly advantageous feature of the permutation re-sampling procedure is that it allows us to detect and discard such false positives.

The second statistical non-identifiability issue concerns statements about which sequence is recombining and which represents the original clonal ancestry. For example, suppose that a group of related taxa can be divided into two subgroups *s1* and *s2*, that are highly distinct at some locus *A*, but highly similar elsewhere. Then, either *s1* or *s2* can be recombinant at locus *A* but, based on data, it may be impossible to decide which of the two hypotheses is correct. If one of the sub-groups is very small compared to the other, then it may be reasonable to consider the smaller group as recombining. On the other hand, if the groups are approximately of the same size, then the issue would remain unresolved. An alternative way of considering this problem is to examine a phylogeny with two branches, such that leaf nodes in the two branches correspond to the taxa in *s1* and *s2*, respectively. Only when the tree is rooted and the root is assumed non-recombinant, it becomes possible to determine which of the two branches is recombining. It should be noticed that the biological conclusions drawn under the two alternative hypotheses may be considerably different. In this respect, our model follows the principle favoring the simplest explanation, meaning that the cluster containing the majority of taxa is considered as non-recombining. For this reason, in our simulation study, if a recombination event affected a branch containing the majority of taxa, the sequences representing the complement of this group were considered as recombinant while evaluating the results.

The model above is developed for the detection of recombinant material introduced into a closely related set of bacteria over the process of their diversification. The design is motivated by data sets of individual virulent or drug resistant lineages, and it should be noted that the above model is likely to be sub-optimal for the analysis of data from highly structured populations (as observed in our simulations where both false and true positive rates decreased when overall genetic diversity increased). We are currently developing the proposed model and the inference algorithm to handle such issues and generalize the method to alternative data sets.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–3, Supplementary Text, Supplementary Figures 1–5 and Supplementary References [34–41].

## FUNDING

Academy of Finland to the Pubgensens project (to P.M.); ERC (239784, to J.C.); Sigrid Juselius Foundation (to J.C.); National Institute of General Medical Sciences (U54GM088558, to W.P.H.). Funding for open access charge: Academy of Finland to the Pubgensens project.

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute Of General Medical Sciences or the National Institutes of Health.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (6.7M) |
- Citation

- A novel computational method identifies intra- and inter-species recombination events in Staphylococcus aureus and Streptococcus pneumoniae.[PLoS Comput Biol. 2012]
*Sanguinetti L, Toti S, Reguzzi V, Bagnoli F, Donati C.**PLoS Comput Biol. 2012; 8(9):e1002668. Epub 2012 Sep 6.* - Inference of homologous recombination in bacteria using whole-genome sequences.[Genetics. 2010]
*Didelot X, Lawson D, Darling A, Falush D.**Genetics. 2010 Dec; 186(4):1435-49. Epub 2010 Oct 5.* - Detection of homologous recombination events in bacterial genomes.[PLoS One. 2013]
*Wang WB, Jiang T, Gardner S.**PLoS One. 2013; 8(10):e75230. Epub 2013 Oct 7.* - Population genetics and evolution of the pan-genome of Streptococcus pneumoniae.[Int J Med Microbiol. 2011]
*Muzzi A, Donati C.**Int J Med Microbiol. 2011 Dec; 301(8):619-22. Epub 2011 Oct 13.* - Phylogenetic understanding of clonal populations in an era of whole genome sequencing.[Infect Genet Evol. 2009]
*Pearson T, Okinaka RT, Foster JT, Keim P.**Infect Genet Evol. 2009 Sep; 9(5):1010-9. Epub 2009 May 27.*

- Mechanisms and impact of genetic recombination in the evolution of Streptococcus pneumoniae[Computational and Structural Biotechnology ...]
*Chaguza C, Cornick JE, Everett DB.**Computational and Structural Biotechnology Journal. 13241-247* - Genomic signatures of human and animal disease in the zoonotic pathogen Streptococcus suis[Nature Communications. ]
*Weinert LA, Chaudhuri RR, Wang J, Peters SE, Corander J, Jombart T, Baig A, Howell KJ, Vehkala M, Välimäki N, Harris D, Chieu TT, Van Vinh Chau N, Campbell J, Schultsz C, Parkhill J, Bentley SD, Langford PR, Rycroft AN, Wren BW, Farrar J, Baker S, Hoa NT, Holden MT, Tucker AW, Maskell DJ, BRaDP1T ConsortiumBosséJanine T.JT11LiYanwenY11MaglennonGareth A.GA12MatthewsDominicD12CuccuiJonJ13TerraVanessaV11.**Nature Communications. 66740* - Population Structure and Antimicrobial Resistance of Invasive Serotype IV Group B Streptococcus, Toronto, Ontario, Canada[Emerging Infectious Diseases. 2015]
*Teatero S, McGeer A, Li A, Gomes J, Seah C, Demczuk W, Martin I, Wasserscheid J, Dewar K, Melano RG, Fittipaldi N.**Emerging Infectious Diseases. 2015 Apr; 21(4)585-591* - Concordance and discordance of sequence survey methods for molecular epidemiology[PeerJ. ]
*Castro-Nallar E, Hasan NA, Cebula TA, Colwell RR, Robison RA, Johnson WE, Crandall KA.**PeerJ. 3e761* - Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins[Nucleic Acids Research. 2015]
*Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, Parkhill J, Harris SR.**Nucleic Acids Research. 2015 Feb 18; 43(3)e15*

- Cited in BooksCited in BooksNCBI Bookshelf books that cite the current articles.
- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyTaxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
- Taxonomy TreeTaxonomy Tree

- Detection of recombination events in bacterial genomes from large population sam...Detection of recombination events in bacterial genomes from large population samplesNucleic Acids Research. 2012 Jan; 40(1)e6

Your browsing activity is empty.

Activity recording is turned off.

See more...