• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
PLoS Comput Biol. Nov 2009; 5(11): e1000581.
Published online Nov 26, 2009. doi:  10.1371/journal.pcbi.1000581
PMCID: PMC2776870

An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1

Christophe Fraser, Editor

Abstract

Genetically diverse pathogens (such as Human Immunodeficiency virus type 1, HIV-1) are frequently stratified into phylogenetically or immunologically defined subtypes for classification purposes. Computational identification of such subtypes is helpful in surveillance, epidemiological analysis and detection of novel variants, e.g., circulating recombinant forms in HIV-1. A number of conceptually and technically different techniques have been proposed for determining the subtype of a query sequence, but there is not a universally optimal approach. We present a model-based phylogenetic method for automatically subtyping an HIV-1 (or other viral or bacterial) sequence, mapping the location of breakpoints and assigning parental sequences in recombinant strains as well as computing confidence levels for the inferred quantities. Our Subtype Classification Using Evolutionary ALgorithms (SCUEAL) procedure is shown to perform very well in a variety of simulation scenarios, runs in parallel when multiple sequences are being screened, and matches or exceeds the performance of existing approaches on typical empirical cases. We applied SCUEAL to all available polymerase (pol) sequences from two large databases, the Stanford Drug Resistance database and the UK HIV Drug Resistance Database. Comparing with subtypes which had previously been assigned revealed that a minor but substantial (≈5%) fraction of pure subtype sequences may in fact be within- or inter-subtype recombinants. A free implementation of SCUEAL is provided as a module for the HyPhy package and the Datamonkey web server. Our method is especially useful when an accurate automatic classification of an unknown strain is desired, and is positioned to complement and extend faster but less accurate methods. Given the increasingly frequent use of HIV subtype information in studies focusing on the effect of subtype on treatment, clinical outcome, pathogenicity and vaccine design, the importance of accurate, robust and extensible subtyping procedures is clear.

Author Summary

There are nine different subtypes of the main group of HIV-1, each originating as a distinct subepidemic of HIV-1. The distribution of subtypes is often unique to a given geographic region of the world and constitutes a useful epidemiological and surveillance resource. The effects of viral subtype on disease progression, treatment outcome and vaccine design are being actively researched, and the importance of accurate subtyping procedures is clear. In HIV-1, subtype assignment is complicated by frequent recombination among co-circulating strains, creating new genetic mosaics or recombinant forms: 43 have been characterized to date, and many more likely exist. We present an automated phylogenetic method (SCUEAL) to accurately characterize both simple and complex HIV-1 mosaics. Using computer simulations and biological data we demonstrate that SCUEAL performs very well under various conditions, especially when some of the existing classification procedures fail. Furthermore, we show that a small, but noticeable proportion of subtype characterization stored in public databases may be incomplete or incorrect. The computational technique introduced here should provide a much more accurate characterization of HIV-1 strains, especially novel recombinants, and lead to new insights into molecular history, epidemiology and geographical distribution of the virus.

Introduction

Many RNA viruses have evolutionary rates that hover near the mutational speed limit [1] permitting them to generate incredible sequence variability among circulating strains in a relatively short time [2]. Bottleneck events, such as viral introduction to new populations or species of hosts, followed by diversification in the new environments, create easily discernible substructures within individual viral species. For HIV-1, this substructure consists of 3 groups (M, N and O), 9 “pure” subtypes (A–D, F, G, H, J and K) of group M, and sub-subtypes (e.g. A1, A2, F1 and F2), defined entirely on the basis of phylogenetic clustering and monophyly of sequences from a given subtype in relation to all other subtypes [3]. The geographic distribution of HIV-1 subtypes is decidedly non-random [4]; for example An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e001.jpg of HIV-1 circulating in North America is classified as subtype B, whereas the same subtype accounts for only An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e002.jpg of infections in Southern Africa. This observation immediately suggests that reliable determination of viral subtypes is highly informative for epidemiological surveillance. HIV-1 diversity is sufficiently high to permit further stratification of subtypes by the geographic region of origin, yielding further clues to epidemiological history of modern epidemics [5]. However, because several established subtypes often circulate concurrently in one host population [6], and because HIV has exceptionally high recombination rates [7], novel recombinant forms are frequently generated. If at least three epidemiologically unrelated viral isolates show an identical novel recombination structure in terms of the pure subtype reference strains, a new circulating recombinant form (CRF) is added to the compendium maintained by the Los Alamos National Laboratory (http://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html). There are currently 43 described CRFs, differing widely in their prevalence, range and the complexity of the recombinant structure. However, the relationship between CRFs and their parental strains is not always clear cut; for example CRF02, originally thought to have been the product of recombination between subtype A and subtype G strains could in fact be ancestral to subtype G strains [8].

A number of computational approaches have been proposed to classify viral strains into subtypes or to describe recombinant strains as mosaics of subtypes. Unlike with methods geared towards a more general problem of detecting recombination from sequence alignments [9],[10], there are no comprehensive comparative benchmarking studies for subtyping methods in the literature. The methods can be conceptually categorized by whether or not they explicitly use a phylogeny to assign subtypes, whether or not they require a multiple sequence alignment and by the degree of automation that they afford: full, partial or none. The de facto standard for accurately describing novel recombinant forms has changed little since its introduction in [11]. It consists of an initial sliding-window phylogenetic bootstrap (bootscanning) analysis of the query sequence aligned against the set reference strain used to generate the set of apparent breakpoints which are then confirmed by detailed phylogenetic analysis of putative non-recombinant fragments. This is a powerful and intuitively attractive, but laborious method–the entire process frequently lacks automation (e.g. [12],[13], but see [14]), has many user-adjustable parameters, such as the alignment procedure, reference sequences, sliding window size and stride, precise location of breakpoints, phylogenetic bootstrap values that are selected subjectively, and can lead to ambiguous or not fully resolved results (e.g. [15] vs [16], [17]). Perhaps the single greatest criticism of the bootscan/phylogeny approach may be that two alternative characterizations of the same query sequence are not assigned a statistically meaningful goodness-of-fit score, and hence cannot be objectively compared.

On the other end of the spectrum are fully automated techniques, including a sophisticated phylogeny and alignment based REGA v2.0 tool [18], henceforth referred to as REGA, and several phylogeny and/or alignment free tools: a classification method based on subtype-specific distributions of short nucleotide strings [19]; a sliding window analysis based on BLAST scores of the query and each of the subtype reference sequences [20]; a phylogeny free position/subtype specific amino-acid subtype analyzer (STAR) which assigns each residue in a multiple sequence alignment a subtype discriminating score [21]; and a probabilistic jumping alignment approach jpHMM [22] that uses a hidden Markov model to align the query to the locally most similar reference sequence.

Alignment and/or phylogeny free techniques are fundamentally approximate in nature, because the definition of a subtype is rooted in the concept of a clade and hence is intrinsically phylogenetic in nature. Approximate approaches have been developed to address the very practical issues of automation, speed and the fact that a phylogenetic definition of a subtype becomes complicated when reference strains are permitted to have recombined themselves. On the other hand, these methods often produce conflicting or indeterminate results, may be unable to classify novel or rare mosaics, and frequently disagree with manually performed phylogenetic analyses, causing considerable consternation among practitioners and clinicians (e.g. [23][25]). A recent comparative study of three automated subtyping tools on 10537 partial polymerase sequences from the UK [26] found that methods agreed poorly (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e003.jpg) for subtypes other than B,C and H, failed to classify An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e004.jpg of sequences and returned discordant results in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e005.jpg cases of divergent sequences, which were revealed to be unusual recombinant forms by a laborious follow-up analysis.

Hence, we are convinced that it is necessary to adopt a phylogeny-based method for accurate subtyping. Statistical evidence of phylogenetic incongruence, i.e. instances when different regions of an alignment support discordant phylogenies, is a hallmark of recombination [27]. A statistically robust phylogenetic approach to detecting phylogenetic incongruence in a multiple sequence alignment has been proposed in the Bayesian framework by [28] and in the information theory framework by [29]. These methods are powerful but too slow to be practical for large reference phylogenies needed to describe extant HIV diversity–for example our HIV-1 polymerase reference alignment contains nearly 300 sequences. Because subtyping is a particular case of more general recombination analyses, we devised an algorithm whose run time is effectively constant in the size of the reference alignment. Importantly, this is achieved without collapsing the alignment into a collection of attributes, such as substring frequencies or position-specific alignment scoring matrices, as is frequently done by phylogeny-free methods.

Our design objectives for SCUEAL included: (i) a completely automatic method, which returns a predicted subtype, existing CRF or a recombinant form mapped in terms of the former; (ii) every estimated quantity including the recombinant structure, the location of each breakpoint and the assignment of a parental/sister lineage should be estimated with statistical confidence/support values to allow an objective evaluation of how robust the estimates are; (iii) the algorithm runs sufficiently quickly (2–3 CPU minutes to screen a simple sequence, and up to a CPU hour for highly complex mosaics) to permit the screening of thousands of sequences on a computer cluster. We implemented an easy-to-use web interface to SCUEAL running on the datamonkey.org [30] platform); (iv) accepts large reference sequence alignments which can be easily updated when new references (e.g. CRF) become available. SCUEAL is conceptually based on the more general method (GARD) for detecting recombination in multiple sequence alignments presented in [29], but is an entirely new algorithm and software implementation. Whereas GARD is primarily concerned with detecting the number and location of breakpoints in an alignment, and not in identifying recombinant lineages and clades (which is critically important for subtyping), SCUEAL explicitly searches for both using a significantly modified and improved genetic algorithm. Also, by screening a single sequence against a fixed reference alignment, SCUEAL gains significant power and an order of magnitude speed-up over GARD, which assumes that any sequence can be a recombinant.

We assessed various performance metrics of SCUEAL using an extensive set of simulations and biological data; to our knowledge no other method has been subjected to a comparably exhaustive benchmarking study.

Methods

Consider an alignment of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e006.jpg reference sequences on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e007.jpg bases, each labeled with its subtype. We require that none of the reference sequences have undergone detectable recombination, hence their evolutionary history can be accurately described with a single phylogenetic tree, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e008.jpg; note that this framework can be used to handle recombinant reference sequences represented as multiple partial sequences (see below). In this manuscript, the evolution of extant sequences from their most recent common ancestor along the phylogenetic tree is described by the general time reversible model of nucleotide substitution [31] and site-to-site rate variation is accommodated via a 3-bin general discrete distribution (e.g. [32]). Substitution models for codon and protein evolution can be easily accommodated by the testing framework; however because they incur considerable additional computational expense they are not considered here.

Phylogenetic mosaics

The objective of our methodology is to enable automatic identification of the number (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e009.jpg) and location of any recombination breakpoints in a query sequence, that is assumed to be homologous and alignable to the reference sequences, together with the identities of sister lineages in each non-recombinant fragment. An example of such an assignment can be found in Figure 0: the query sequence (labeled Q) has two recombination breakpoints, at nucleotide positions 750 and 1250. Over the first 750 nucleotides, the query sequence shares a common ancestor with reference sequence 1, over the next 500 nucleotides - with reference sequence 7, and over the last 750 nucleotides - with sequence 1 again. Such an arrangement might arise if the query is the result of a recombination event between the ancestors of sequences 1 and 7.

The term ‘mosaic’ has come to encompass the combination of breakpoint placements and lineage assignments in HIV-1 subtyping literature. The number of possible mosaics with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e010.jpg breakpoints is proportional to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e011.jpg, hence it is not practical to undertake an exhaustive search of all possible mosaics, unless An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e012.jpg is small (i.e. B = 1 or B = 2).

Model fitting and fitness evaluation

In order to select credible mosaics from the set of all possible models we must be able to compute a goodness-of-fit value for each proposed mosaic.

We begin by computing the maximum likelihood based score for each model. First, we fit the reference tree to the reference alignment using standard phylogenetic maximum likelihood. Assuming unrooted bifurcating trees, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e013.jpg branch length estimates and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e014.jpg substitution model estimates, such as relative nucleotide substitution rates, base frequencies and site-to-site rate variation parameters will be obtained. These baseline parameters are estimated once for a reference alignment, and can be reused if multiple query sequences are run against the same reference.

For computational efficiency we fix all substitution model parameters at their baseline values instead of re-estimating them for each mosaic. If the reference alignment is sufficiently large, the effect of one additional sequence on substitution model parameters will be insignificant. Furthermore, we posit that grafting the query sequence onto a branch in the reference tree will only affect three branch lengths for each non-recombinant fragment. For instance, for the mosaic shown in Figure 1 the algorithm will estimate three branch lengths for the first segment (those leading to 1 and Q as well as the branch leading to their MRCA), three branch lengths of the second segment (Q,7 and the MRCA of Q and 7) and three branch lengths for the third segment (1,Q and the MRCA of 1 and Q). All other branch lengths are maintained at the values derived from the reference tree. Similar approximations are routinely made in phylogenetic inference (e.g. [33],[34]). The fitness of mosaic An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e015.jpg is evaluated using Schwartz's Bayesian Information Criterion (BIC, [35]), with the number of model parameters for a mosaic with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e016.jpg breakpoints given by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e017.jpg:

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e019.jpg is the likelihood of the data under the mosaic model maximized over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e020.jpg parameters and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e021.jpg is the number of sites in the alignment, used to approximate the number of independent observations. A lower BIC score indicates a better fit to the data. BIC was selected because it had the best power/accuracy performance in our initial simulation studies, comparing AIC [36], AIC-c [37] and BIC (results not shown).

Figure 1
An example to illustrate the concepts of a mosaic and its binary encoding upon which the genetic algorithm operates.

The immediate benefit of allowing only three branch lengths to vary per segment is that the computational cost for fitting individual mosaics no longer depends on the size of the reference alignment, at least when time-reversible models of substitutions are used. This observation has been exploited in many phylogenetic applications and is discussed in detail for example in [38]. Briefly, as a part of standard phylogenetic likelihood evaluation [39], each node An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e024.jpg (both tips and internal nodes) of the phylogenetic tree is populated with a vector of partial probabilities An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e025.jpg that containts the probability of observing the subtree rooted at An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e026.jpg if the character (i.e. a nucleotide in our case) at An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e027.jpg is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e028.jpg. To evaluate the likelihood of the entire tree (for a single site), the following expression is computed at the root:

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e030.jpg iterates over the children of the root node, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e031.jpg gives the stationary frequency of nucleotide An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e032.jpg (estimated by counts from the data) and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e033.jpg denotes the probability of substituting nucleotide An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e034.jpg with nucleotide An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e035.jpg along the branch that ends in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e036.jpg. The critical observation to be made here is that if nothing but the lengths of branch emanating from the root node change during optimization (i.e. only An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e037.jpg changes), then An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e038.jpg do not have to be recomputed, reducing the complexity optimization problem to that on a star tree with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e039.jpg( = 3 for standard phylogenetic applications) tips.

For time-reversible models, the root can be arbitrarily placed on any branch of the phylogenetic tree. Hence, we can reroot the tree at the point where the query sequence is grafted and reduce the computational complexity as explained above. To do this, in addition to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e040.jpg, we also precompute (for every node except the root and only once per analysis) the collection of vectors An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e041.jpg, that contain conditional probabilities of the parent node of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e042.jpg, when An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e043.jpg is considered as the root node. For every non-root node An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e044.jpg the likelihood of the bifurcating reference tree can be equivalently expressed as:

equation image

The last expression is simply the likelihood of the tree rerooted exactly at node An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e046.jpg. Grafting the query sequence An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e047.jpg onto the branch leading to node An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e048.jpg will create three branches: the branch leading to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e049.jpg, the branch leading to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e050.jpg and the branch leading from the ancestor of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e051.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e052.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e053.jpg) to the parent of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e054.jpg For the first partition in Figure 1, for example, the single branch of the reference tree leading to tip 1, was transformed into three branches by grafting Q–the branch leading to tip An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e055.jpg, the branch leading to query An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e056.jpg and the branch leading to the parent of 1 and Q. Consequently, the likelihood of the tree with the query sequence An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e057.jpg grafted onto the branch leading to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e058.jpg can be computed as:

equation image

This expression is the likelihood of a three-taxon star tree with the root at node An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e060.jpg (sum over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e061.jpg) and three children: An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e062.jpg (sum over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e063.jpg), An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e064.jpg (sum over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e065.jpg) and the parent of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e066.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e067.jpg (sum over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e068.jpg). Note that because An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e069.jpg is always a tip, the conditional probabilities in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e070.jpg are trivial to compute, and it follows that the cost of evaluating the likelihood of the reference tree with a grafted tip (given precomputed quantities, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e071.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e072.jpg–done only once for the reference alignment, independent of the query sequence) is equivalent to the three-taxon case.

Mosaic selection using a genetic algorithm

We use an aggressive genetic algorithm (GA) with elitist selection that is based on the CHC procedure [40] to rapidly search a combinatorially large space of possible mosaics for a fixed number of breakpoints. The algorithm operates on a population of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e073.jpg binary strings (individuals), each representing an encoded mosaic with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e074.jpg breakpoints. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e075.jpg fragments (‘genes’) are needed to encode the mosaic - An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e076.jpg for the location of breakpoints, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e077.jpg for lineage assignments on each non-recombinant fragment (see Figure 1). We restrict breakpoints to only occur at variable alignment sites as was done previously in our GARD method [29]. In addition, the breakpoints must be a minimum distance (denoted as a tunable parameter An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e078.jpg) away from each other or from the ends of the sequence; this simply reflects the fact that a minimum number of sites is necessary to resolve the phylogenetic placement of a sequence.

The placement of the query sequence in the reference tree is represented by the binary-encoded position of the branch in post-order traversal (cf. Figure 1). Breakpoint positions are represented using Gray binary coding, to ensure any two consecutive locations differ by a single bit, and hence can be reached by a single mutation [41]. For example, to change the position of a breakpoint from 7 (traditional binary 0111, Gray code 0100) to 8 (1000; 1100) it would be necessary to mutate all four bits in the traditional binary code, but only one bit in the Gray code. Breakpoints are always maintained in left-to-right ordering and any operations that disrupt this order are followed by resorting of breakpoints left to right (equivalent to gene order rearrangement).

Starting with the initial population of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e079.jpg mosaics, the algorithm proceeds as follows (refer to Figures 2 and and33 for a graphical description of the procedure). First, fitness of each mosaic An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e080.jpg (Eq. 1) is computed and the mosiac is assigned a mating probability inversely proportional to its fitness rank An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e081.jpg. The most fit mosaic reproduces becomes a parent for an offspring with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e082.jpg, while the least fit mosaic–with probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e083.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e084.jpg. The algorithm maintains a global lookup table (implemented as an AVL tree keyed on the bit string of the mosaic) to ensure that the maximum likelihood fitting of any given mosaic is carried out only once. Second, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e085.jpg pairs of parents are selected based on their mating probabilities to generate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e086.jpg offspring. The mating operator uses free recombination, where every bit of the child has a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e087.jpg probability of coming from either parent; this ensures rapid mixing of mosaic features. With probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e088.jpg the algorithm also induces genomic rearrangement in the offspring mosaic, by swapping adjacent fragments around a randomly selected breakpoint. Third, the existing population is augmented with the offspring, resulting in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e089.jpg mosaics, ranked according to BIC and filtered to include An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e090.jpg top-scoring mosaics in the next generation; this induces a strong selective pressure to remove mosaics with low fitness scores.

Figure 2
Algorithmic flowchart of SCUEAL.
Figure 3
Algorithmic flowchart of the genetic algorithm in SCUEAL.

Mutational processes are available to re-introduce genetic variability into inbred populations. First, hypermutation is triggered if the diversity of the population, measured as the relative difference between in BIC between the best and worst fitting mosaic (i.e. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e091.jpg), drops below a fixed threshold, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e092.jpg. All mosaics in the population, except the best fitting one, have their bits toggled with fixed probability An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e093.jpg. Second, if no generation-to-generation BIC improvement was observed for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e094.jpg consecutive generations, local mutation is carried out. The bottom two thirds of the population are replaced by mutated versions of the best fitting mosaic, generated by selecting a fragment to mutate at random and providing local coverage for that fragment. Local coverage is introduced by first drawing a random branch if the gene encodes a lineage, or a random position within An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e095.jpg bp of the current position for a breakpoint location gene, and then generating An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e096.jpg consecutive values for the gene. For example, if the new random position for the breakpoint is drawn as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e097.jpg, then mosaics with the breakpoint at An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e098.jpg will be placed in the population.

The algorithm terminates if no BIC improvement has been obtained for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e099.jpg consecutive generations. The number of breakpoints is increased from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e100.jpg until no BIC improvement has been found for two consecutive values of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e101.jpg. The case of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e102.jpg is solved exhaustively; the initial population for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e103.jpg is generated randomly; the initial population for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e104.jpg is seeded by the best mosaic from the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e105.jpg run, with a randomly placed additional breakpoint and lineage assignment. For An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e106.jpg, we also add a step down procedure to confirm that the improvement in score obtained by incrementing An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e107.jpg was due to a genuine additional breakpoint and not due to premature termination at the previous step (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e108.jpg); to do so, we generate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e109.jpg mosaics by randomly removing a breakpoint from the best-fitting mosaic with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e110.jpg breakpoints (randomly assigning the query sequence to one of the two parental lineages, and introducing mutations at rate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e111.jpg) and run an iteration of the GA with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e112.jpg points using the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e113.jpg mosaics as a starting population. If the follow-up GA with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e114.jpg breakpoints matches or improves upon the score with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e115.jpg breakpoints, then the next phase of the GA is run on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e116.jpg breakpoints, otherwise, the next phase operates on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e117.jpg breakpoints. To further enhance algorithm robustness, we evolve three independent populations (from completely random starting mosaics) to convergence, compose the mixed population by taking the top-scoring third of each population and evolve the combined population until convergence.

While it is possible to use the GA to also search for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e118.jpg directly (e.g. by duplicating or removing fragments), we found that the incremental search for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e119.jpg with the step-down verification stage has better convergence properties and runs considerably faster.

Algorithm parameter values selected for the analyses in this paper are as follows: An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e120.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e121.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e122.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e123.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e124.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e125.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e126.jpg. parameter values were selected based on our previous experience with GARD [29], and further adjusted based on how well the algorithm performed on simulated data and run time.

Result processing

After a GA run, BIC scores and mosaics from a large (typically An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e127.jpg) number (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e128.jpg) of fitted mosaics is available for processing. Instead of basing inference on the single best fitting mosaic, we adopt a multi-model inference procedure, whereby the contribution of each fitted mosaic is weighted based on its goodness-of-fit. Given the BIC score (fitness) of the best mosaic from the run, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e129.jpg, for every mosaic An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e130.jpg, we compute its Akaike weight, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e131.jpg defined in terms of its BIC score An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e132.jpg as

equation image

The constant An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e134.jpg is chosen so that An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e135.jpg. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e136.jpg can be interpreted as the probability that the i-th mosaic provides the best fit to the data [42].

We report the following quantities for each GA mosaic screen

  1. The structure of the best fitting mosaic, represented as the location of inferred breakpoints and lineage assignments, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e137.jpg.
  2. The model averaged support for the mosaic structure of the best model. This is defined as the sum of Akaike weights of all those models which agree with the best fitting model in everything except the coordinates of the breakpoints. E.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e138.jpg is consistent with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e139.jpg, but An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e140.jpg is not. High values (e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e141.jpg) of the model averaged support indicate that there are no other discordant mosaic structures that explain the evolutionary history of the query sequence.
  3. Model-averaged support for the locations of the breakpoints, that is computed by tabulating the model-averaged probability of observing a breakpoint at a given site over all sites in the alignment, based on the normalized Akaike weights of the models whose mosaics are consistent with the best fitting model. For instance the model An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e142.jpg will contribute its Akaike weight to sites An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e143.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e144.jpg. To determine An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e145.jpg confidence intervals for each breakpoint from the best fitting model, we build symmetric intervals around each breakpoint that contain at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e146.jpg the support. Note that confidence intervals are not uniquely defined in this setting (for example, we could extend the interval in the direction where the site immediately outside the current interval has greater model averaged support of a breakpoint), and we adopt symmetric intervals for simplicity.

Automated sequence alignment

The genetic algorithm requires the alignment of reference sequences with the query sequence as input that can be generated by any of the multiple sequence alignment programs. However because the reference alignment does not depend on the query sequence, it does not need to be re-aligned every time a new sequence is queried against it and the following simple heuristic can be employed. We preprocess the reference alignment by fitting an evolutionary model (nucleotide or codon for coding alignments) using the reference tree and inferring the root sequence for the reference tree using the joint maximum likelihood of [43]. Gaps in the alignment are treated as missing data from the purposes of root sequence reconstruction. In particular, the root sequence will not contain any gaps when reconstructed under standard nucleotide evolutionary models, because no sites in the reference alignment consist solely of gaps. This inferred root sequence can then be directly aligned with the query sequence using the Needleman-Wunsch dynamic programming algorithm [44], with affine gap costs and zero prefix and suffix gap costs on nucleotide or translated amino-acid data, and then up-converted into a multiple sequence alignment with all reference sequences consistent with the reference alignment. When aligning HIV or other viral sequences, organism specific scoring matrices [45] can be used to improve alignment quality. In addition to being very fast, this alignment heuristic is unlikely to introduce difficult-to-quantify biases common in progressive alignment approaches (e.g. [46]).

Reference alignment generation

We adopted a step-wise procedure of HIV-1 reference alignment construction. Beginning with a seed alignment of three sequences (e.g. one each from A, B and C for HIV-1), screened by GARD to ensure that the seed sequences are not recombinant, we augment the seed alignment from a collection of potential subtype reference sequences downloaded from the LANL HIV database. If a database sequence is labelled as pure subtype in LANL, is at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e147.jpg distant (Tamura-Nei 93 [47] genetic distance) from every sequence in the seed alignment, and is reported as being non-recombinant by SCUEAL, then it is added to the reference alignment. The process repeats until the collection of potential reference sequences has been exhausted.

Reference sequences for circulating recombinant forms (CRFs) are processed in a similar way, except that if the CRF sequence has An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e148.jpg breakpoints in the region for which the reference alignment is being built (e.g. the pol gene), then it is represented by up to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e149.jpg sequences in the final alignment. For instance, a 1000 bp sequence with the mosaic structure An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e150.jpg, will be represented by a sequence that clusters with the A clade and contains bases from 1–199 and 700–1000 and gaps between positions 200 and 699, and a complementary sequence (bases between 200 and 699, gaps elsewhere) that clusters with clade B. This is necessary to correctly place a recombinant sequence on the single reference tree. The GA disallows mosaic structures in which a query sequence would cluster with artificially introduced gaps in CRF component sequences. SCUEAL will correctly interpret clustering with the constituent sequences as clustering with the single CRF for the purposes of subsequent inference.

The resulting full length HIV-1 reference polymerase alignment comprised An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e151.jpg sequences encompassing the “pure” subtypes (including An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e152.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e153.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e154.jpg clades), SIVcpz and a reference strain from each of the CRFs (except CRF26,CRF38 and CRF41–43) for which no full length pol reference sequences were present in the database) listed in the Los Alamos HIV CRF compendium (http://www.hiv.lanl.gov/content/sequence/HIV/CRFs/CRFs.html accessed December 17th, 2008). We note that this procedure is not guaranteed to avoid mislabeling recombinant sequences as pure subtypes. Indeed if the recombinant strain is added to the reference before the parental strains, the latter will be incorrectly described as recombinants. For example, the original classification of subtype G sequences as a “pure” subtype is likely an artifact of the order in which A,G and CRF02 sequences were added to public databases [8]. Nonetheless, our procedure is undoubtedly an improvement over simply taking a collection of database sequences as a reference and assuming that they can be adequately described by a single tree; this practice should be avoided.

Simulated data

Each of the simulation scenarios summarized in Table 1 comprised An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e155.jpg parametrically generated alignments, using the general time reversible model of nucleotide substitution [31], equilibrium base frequencies of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e156.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e157.jpg, substitution rate parameters of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e158.jpg, and site-to-site rate heterogeneity modeled a G+I distribution with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e159.jpg of invariant sites and the shape parameter of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e160.jpg; all these parameters were selected to resemble values found in biological alignments of HIV-1. Recombination was introduced by generating alignments of fixed lengths along different tree topologies and then concatenating them; the spacing between breakpoints, tree topologies used and recombinant lineages are shown in the middle pane of each each figure; simulated data are available at http://www.hyphy.org/pubs/SCUEAL/. The trees were constrained to conform to the assumptions of the model–only one sequence (the query) was permitted to migrate from lineage to lineage. The correct tree and reference sequences were used for screening. The evolutionary scenarios used for simulation were designed to cover a range of recombination patterns with respect to the distribution of breakpoints, the level of sequence divergence and how far in the tree the recombinant sequence moved (close, medium or divergent). A subset of scenarios dealt with ‘ancient’ recombination events, i.e. lineage assignments to internal tree branches (for HIV-1 this would be equivalent to the recombination event predating the proliferation of the subtype). Several examples were specifically selected to mimic different divergence levels of HIV-1. An example of one recombination scenario is given in Figure 4 and the number and location of breakpoints can be found in Table 1. The collection of analogous figures for every simulation scenario can be found in Protocol S1.

Figure 4
A simulation scenario example.
Table 1
SCUEAL performance on simulated data.

Because mosaic analyses are frequently used in HIV-1 research, we also generated An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e173.jpg sequences by concatenating fragments from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e174.jpg sequences of partial HIV-1 polymerase genes, spanning all of protease up to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e175.jpg nucleotides of reverse transcriptase obtained from the Los Alamos HIV sequence database (http://hiv.lanl.gov). Each sequence was pre-screened using SCUEAL to ensure that only pure subtypes formed the base of this simulation. The number of fragments for each simulated sequence was drawn from a +1-shifted Poisson distribution with the mean of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e176.jpg breakpoints/alignment; this guaranteed at least one breakpoint per alignment. The length of each fragment as a proportion of the total alignment length of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e177.jpg was determined using the stick-breaking process with beta distribution parameters An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e178.jpg. A An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e179.jpg value was drawn from the beta distribution and the longest remaining fragment was split in that proportion to introduce each consecutive breakpoint into a sequence; if the shorter of the two resulting fragments was not at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e180.jpg long, the proportion was rejected and the process was repeated with a new beta-distributed proportion. Simulated sequences were screened against an alignment of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e181.jpg pure subtype reference sequences culled from our pol reference set (no CRFs were included in the reference). Note that because reference sequences were not identical to those used to generate the mosaics, this scenario simulated both recombination and mutational divergence found in HIV-1.

A Surveillance Study

A bread-and-butter application of HIV subtyping algorithms is to characterize the subtype distribution in a cohort of patients or a geographic region and make inferences about the history and dynamics of HIV infection. We selected one of such recently published studies [48] that subtyped An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e182.jpg partial pol sequences from Bulgaria using REGA and found a diverse composition of subtypes, including three unassigned sequences.

Database sequences

We downloaded all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e183.jpg available reverse transcriptase sequences from the Stanford HIV drug resistance database, an ad hoc global sequence collection, that were (http://hivdb.stanford.edu/) annotated with one of the nine pure subtypes (or sub-subtypes e.g. A1), CRF01 (AE), CRF02 (AG) and applied SCUEAL to estimate what proportion of sequences may be unclassified inter-subtype recombinants, and the frequency of within-subtype recombination. The algorithm that currently performs database sequences annotation uses a neighbor joining phylogeny of the query sequence aligned to 100 reference sequences (spanning all group M subtypes and CRF01-CRF19) to assign the query sequence the subtype of the enclosing or nearest clade (R. Shafer, personal communication; also see [49]).

A total of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e184.jpg partial polymerase sequences from HIV infected individuals in the UK were available through the UK HIV Drug Resistance Database (www.hivrdb.org). This database is a central repository for HIV sequence data obtained in the course of routine clinical care and was established as a collaboration of 14 clinical centers and virology laboratories and 3 academic departments. The database acts as a resource for clinical, virological and epidemiological studies for the collaborating centres. The sequences released for analysis with SCUEAL had been fully anonymized and delinked and previously processed using REGA and Stanford [49] subtyping algorithms (Hughes GJ, Fearnhill E, Dunn D, et al. Molecular phylodynamics of the heterosexual HIV epidemic in the United Kingdom. PLoS Pathog. in process). We sought to compare the performance of SCUEAL to the other tools on a real-world task of automatic subtype classification of this complex sequence dataset assembled for population surveillance of a national HIV epidemic of significant subtype complexity.

Implementation

The algorithms presented in this paper have been implemented as a collection of HyPhy [50] batch language scripts and can be dowloaded from http://www.hyphy.org/pubs/SCUEAL/. A README file explaining code usage and providing examples is included with the download. Simulated, biological and reference alignments and SCUEAL results can be downloaded from the same URL. An easy to use implementation of SCUEAL to screen up to 500 (this limit will be increased over time) sequences using a computer cluster maintained by the authors is available as a part of the Datamonkey http://www.datamonkey.org/ web server. Run times of SCUEAL on HIV-1 pol sequences depend on the complexity of the inferred mosaic type and take anywhere from 1–2 minutes for a pure subtype to up to an hour for a complex mosaic subtype on a desktop computer. Multiple query sequences can be screened in parallel if an MPI distributed environment is available. The screen of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e185.jpg partial pol sequences from the UK drug resistance database took approximately 18 hours using 200 processors of an MPI cluster, translating to an average of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e186.jpg CPU/minutes per sequence.

Results

Simulation results

Parametric simulations

Parametric simulations tend to generate copious amounts of raw data (e.g. see Protocol S1) that are difficult to interpret directly, hence we generated a compact representation of simulation scenarios and results in Table 1 using a several descriptive metrics.

First and foremost one is interested how often is the correct mosaic (the order and identity of lineage assignments, e.g. 1-3-1-3-1 for the scenario in Figure 4) is recovered; this metric does not evaluate the accuracy of breakpoint placement. When an incorrect mosaic is reported, three types of classification errors are possible.

  • A subset of the correct mosaic is recovered, i.e. some of the breakpoints are missed. For instance 1-3-1 would be a subset of the 1-3-1-3-1 mosaic. The method behaves conservatively in this case.
  • A superset of the correct mosaic is recovered, i.e. in addition to all of the correct breakpoints spurious ones are inferred. For instance 1-3-1-3-1-1 would be a superset of the 1-3-1-3-1 mosaic. The method is overly liberal in this case.
  • When the recovered mosaic is neither the subset nor the superset of the correct one, a mismatch has occurred. For example, 1-4-1-3-1 would be mismatched with 1-3-1-3-1. The method is inconsistent in this situation.

When a classification error occurs, it can either be because the GA failed to find the optimal solution, or because there is insufficient signal (due to small fragment length, low divergence etc) to infer the correct mosaic using the BIC criterion. The error due to the GA is an undesirable outcome, and we categorize each of the misclassified replicates into those which had worse fitness than the correct model (GA error) and those which had better fitness that the correct models (insufficient signal).

Second, we tabulated how often each of the correct breakpoints was recovered, and collected descriptive statistics about where the inferred locations were placed. A breakpoint was inferred ‘recovered’ if SCUEAL inferred at least one breakpoint, and the nearest inferred breakpoint to the simulated position involved correct parental lineages. For instance, a simulated A to B breakpoint at nucleotide 1000 would be counted as recovered in the inferred mosaic An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e187.jpg, but not in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e188.jpg.

The method has a very low rate of false positives correctly classifying 100/100 cases in Scenario 1 (no recombination). SCUEAL shows excellent operating characteristics when sequence divergence between parental strains and/or non-recombinant fragment length is sufficiently high; these two parameters approximate information content in the sequence. In scenarios 2,3,5 (except ancient recombination), 7 and notably, 11 (designed to simulate a typical HIV-1 CRF situation), SCUEAL assigned An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e189.jpg or more of replicates to the correct mosaic type; each of the breakpoints was also mapped very accurately with the standard deviation on the order of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e190.jpg. A very short non-recombinant fragment in scenario 4 (100 bp) made it difficult to detect recombination reliably; increasing the distance between parental strains dramatically increased the power, however from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e191.jpg for close parents to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e192.jpg for distant parents.

Ancestral recombination involving interior branches in the tree (e.g. see Protocol S1) also complicated mosaic classification because of weaker phylogenetic signal. In all three scenarios with the ancient option (2,3 and 5), the proportion of correctly identified mosaics was lower than for extant parental lineage situation, but in most missed (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e193.jpg) cases the assigned mosaic had a better BIC score - suggesting lack of phylogenetic signal as the main source of error. Overall, the ability of SCUEAL to accurately describe over 50% of mosaics due to ancient recombination is encouraging as many HIV-1 CRFs appear to be the result of ancient recombination, i.e. they fail to unambiguously cluster with any of the reference “pure” subtypes.

The complex pattern in scenario 6, where 10 non-recombinant fragments of length An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e194.jpg each, made concurrent detection of all 9 breakpoints difficult (30% for close parents and 64% for divergent strains). However, this was mostly due to one or two missed breakpoints–the average accuracy of mapping each individual breakpoint was high (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e195.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e196.jpg), with no single breakpoint (e.g. in the middle of the alignment vs close to one of the ends) missed at an abnormally high frequency. Scenario 6/close, is the only scenario (many short fragments with relatively close parental strains) where the majority An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e197.jpg of classification errors were due to premature GA termination; this could be improved by adjusting GA parameters at the expense of longer run times.

SCUEAL could not detect recombination in sequences with very low (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e198.jpg) parental strain divergence (scenarios 8/close and 9/close), and had low (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e199.jpg) power in breakpoint detection for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e200.jpg divergent strains in scenarios 8/divergent, 9/divergent and 10, overwhelmingly due to lack of phylogenetic signal and not to premature GA convergence.

Each individual detected breakpoint was on average very close (standard deviations in the range of 5–20 bp) to a true breakpoint, confirming that SCUEAL produces a high-resolution breakpoint map.

HIV pol simulations

Using the classification defined in the previous section, SCUEAL performance on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e201.jpg simulated data sets can be summarized thus: An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e202.jpg correct sequence mosaics (i.e. each breakpoint and correct lineage) were recovered, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e203.jpg recovered mosaics were supersets (extra breakpoints) of the correct type, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e204.jpg - subsets (missed breakpoints) of the correct type, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e205.jpg - mismatched. Overall, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e206.jpg of the replicates were correctly identified as recombinant strains. Of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e207.jpg simulated breakpoints, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e208.jpg were recovered correctly, with a median distance between the simulated and the inferred breakpoint of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e209.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e210.jpg for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e211.jpg range). Median level of model averaged support for the inferred mosaic was estimated at An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e212.jpg.

However, these numbers alone do not present the complete picture of how the method performed - the power to detect recombination is significantly dependant upon the length of recombinant strains (e.g. a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e213.jpg fragment is easier to detect than a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e214.jpg one on average), and the relative level of divergence between parental strains (e.g. inter-subtype recombination is easier to detect than within-subtype). To capture these dependancies, we binned all breakpoints in simulated strains by the length of the shorter of the flanking fragments and the pairwise genetic distances between parental strains over that fragment. For example the breakpoint in a mosaic of type A–B with the 500 bp coming from subtype A and 200 bp coming from subtype B, would contribute to the bin with 200 bp length and the genetic distance between strain A and strain B over the last 200 bp of the sequences. We next plotted detection power, i.e. the proportion of times a correct-type breakpoint (e.g. A–B for the previous example) was inferred within 100 bp of the simulated breakpoint (see Fig. 5). The power of the method to detect a breakpoint grows with the length of the flanking recombinant fragments and the genetic distance of the two parental strains. For example, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e215.jpg of all breakpoints generated from sequences more than An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e216.jpg divergent and involving fragments of at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e217.jpg on either side were correctly identified. The fraction increased to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e218.jpg for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e219.jpg or greater divergence and at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e220.jpg-long fragments–values encountered with commonly annotated inter-subtype recombinant mosaics in HIV.

Figure 5
Power and accuracy in the sequence shuffling simulation.

Surveillance study

The results of SCUEAL and REGA screening of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e221.jpg partial polymerase sequences isolated from patients in Bulgaria [48] were quite similar, yet revealingly different in some cases. The methods concurred on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e222.jpg sequences, reporting An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e223.jpg subtype A sequences, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e224.jpg–subtype B, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e225.jpg–CRF01 (AE), An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e226.jpg each of C and G, and one of subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e227.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e228.jpg. Figure 6 depicts a query sequence on which the methods agreed well. Both the neighbor joining tree and the bootscan plot based on the automatic alignment produced by REGA indicate strong clustering with the B clade and lack of evidence for recombination, yielding an assignment confidence of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e229.jpg. Concordantly, SCUEAL reports a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e230.jpg model averaged support for clustering with a clade B sequence, although there is a bit of uncertainty which exact lineage the query should be grafted on.

Figure 6
An example of a good agreement between SCUEAL and REGA in classifying a partial pol subtype B sequence.

There are several kinds of disagreement between REGA and SCUEAL classification results.

Unassigned sequences

REGA did not assign a subtype to six sequences in the sample. This happens either when there is insufficient phylogenetic bootstrap support for clustering with a pure subtype or CRF reference, or when bootscan detects a recombinant form that is not well explained by an existing CRF. Because SCUEAL uses a much larger reference alignment than REGA (e.g. there are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e231.jpg sequences in the greater B clade, including a number of CRF fragments that cover parts of the pol gene, vs An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e232.jpg in the default REGA) alignment, it was able to assign An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e233.jpg of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e234.jpg sequences to subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e235.jpg with high (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e236.jpg) confidence. Interestingly, these sequences were grafted onto interior branches of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e237.jpg clade, highlighting the intrinsic power of SCUEAL of being able to make full use of the fixed reference topology. The remaining two sequences were classified as novel recombinant forms, in congruence with the bootscan profile. For example, in Figure 7, a novel A–J recombinant is reported by both methods, but REGA's conservative assignment scheme would still report this case as unassigned. SCUEAL proposes several A–J type recombinant forms, with A-A1-J-A2 being the best supported one; overall there is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e238.jpg model-averaged support for presence of recombination in this sequence. Due to a much larger set of subtype A reference sequences, our approach is capable of a more precise characterization of the mosaic, whose breakpoints are mapped very accurately (to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e239.jpg base pair). The sliding window nature of phylogenetic bootscanning (REGA uses a 400 bp window with a 50 bp stride by default) does not naturally permit precise breakpoint mapping. Splitting the sequence along the A–J boundary and building traditional neighbor joining trees using the REGA reference alignment, confirms the structure predicted by SCUEAL.

Figure 7
An instance when a sequence unclassified by REGA is inferred to be a novel recombinant form by SCUEAL; the A–J mosaic structure is also confirmed by trees and bootscan plots based on the REGA reference alignment.

Within-subtype recombination

Seven of the discordant results occurred when a sequence classified as pure subtype by REGA was identified as within-subtype recombinant by SCUEAL. An example of this is shown in Figure 8, where the putative parental strains are approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e240.jpg divergent on the tree.

Figure 8
An example of within-subtype (B) recombination detected by SCUEAL, but not by REGA. A partial reference tree with placed query is shown; color coding is consistent between the similarity plot and the tree.

Missed recombinants

The remaining An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e241.jpg mismatches arose when a pure subtype sequence (according to REGA) was instead reported as an inter-subtype recombinants with very strong (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e242.jpg) model-averaged support for recombination. The obvious explanation for why REGA may be missing these recombinants is that the size of the sliding window used for bootscanning (400 bp) limits how short individual mosaic fragments can be. This limitation becomes relevant for single gene recombination analysis, when the total length of the sequence is on the order of 500–1000 bp. The A-B-A mosaic example in Figure 9 was classified as subtype A by REGA. However, adjusting the sliding window parameters from to use window size of 200 bp instead of 400 bp and stride 25 bp instead of 50 bp revealed that subtype B sequences from the REGA reference alignment were genetically closer to the query than subtype A sequences over the segment predicted by SCUEAL to cluster with subtype B. Furthermore, a maximum likelihood tree (exhaustive search) on that segment supports the same clustering.

Figure 9
An instance when a sequence assigned to subtype A by REGA is deduced to be an A-B-A mosaic by SCUEAL.

Stanford database sequences

SCUEAL analyses indicate that while a majority of sequences annotated as pure subtype in the Stanford drug resistance database are assigned to a correct subtype, a substantial proportion (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e243.jpg depending on subtype) are better explained as circulating or unique recombinant forms (CRF/URF) and a similar proportion appear to be within-subtype recombinants (Table 2). Importantly, there are only a few cases when SCUEAL infers a pure subtype sequence which is annotated with a different pure subtype in the database. For instance, out of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e244.jpg subtype B sequences there were An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e245.jpg subtype D sequences, two–subtype J and two–subtype A, hence the vast majority of potentially misclassified subtypes in the database are due to recently characterized CRFs and URFs which are partially derived from the database subtype. When SCUEAL infers recombination, model averaged support for at least one breakpoint is very strong (median An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e246.jpg, mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e247.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e248.jpg for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e249.jpg range), but the inference of the exact mosaic type is less certain on average (median An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e250.jpg, mean An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e251.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e252.jpg for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e253.jpg range), which is not surprising given that many of the sequences are quite short.

Table 2
SCUEAL screening results on partial HIV-1 reverse transcriptase sequences from the Stanford Drug Resistance database.

Agreement for subtypes H and K is unusually poor, however there are only a few sequences assigned to this subtype, and a small number of existing reference samples to base inference upon. In particular, many sequences annotated as subtype K appear to have been partly derived from CRF30 and CRF32 strains. Over An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e298.jpg of sequences annotated as subtype F are classed as B,F (or partial CRFs) recombinants by SCUEAL, but this can be expected as there are at least seven known CRFs (17, 28, 29, 38–40, 42) that are comprised of B and F mosaics with one or more breakpoints in the pol gene. For CRF02-annotated samples, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e299.jpg of the sequences that were classified differently by SCUEAL as A,G recombinants appear to support breakpoints that are different from those included in the reference CRF02 strains. This could indicate that a larger sample of CRF02-like reference strains may be necessary to accurately capture the diversity of these viral strains.

HIV evolution in the era of Highly Active Antiretroviral Therapy (HAART), especially in the developed world, is significantly influenced by selective forces that favor viral strains with mutations that confer drug resistance in the presence of a corresponding drug. This is especially true of subtype B viruses, circulating in North America and Western Europe, where HAART has been exerting well-characterized selective pressure on the virus for over a decade [51], leading to increasing prevalence of HIV strains that harbor drug resistant associated mutations (DRAM, e.g. [52],[53]). Convergent evolution to acquire DRAM can have a confounding effect on phylogenetic subtyping methods, by making regions rich in DRAM appear closely related in evolutionarily distant strains and potentially leading to a false signal of within- (or inter-) subtype recombination. To assess this effect, we identified subtype B RT sequences (as annotated in the database) that harbored at least one known DRAM [51] (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e300.jpg) and reran SCUEAL on these sequences after replacing all DRAM with missing data (3 in-frame gaps for each DRAM codon, e.g. any codon at position 215 in reverse transcriptase that encodes an An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e301.jpg or a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e302.jpg). Between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e303.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e304.jpg positions (median An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e305.jpg) per sequence were masked by this procedure. DRAM masking substantially reduced the number of sequences that were classified as within-subtype recombinants, taking the number down from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e306.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e307.jpg. For other subtypes, where the frequency is of DRAMs is lower than in subtype B sequences, the effect of masking DRAMs on the proportion of inferred intra-subtype recombinants (and other recombinant forms) is much more muted (Table 2). Consequently, convergent evolution to acquire drug resistant mutations appears to be a significant factor contributing to the within-subtype recombination signal, although the reduction in phylogenetic signal due to fewer informative sites in masked sequences is also a possible cause of this effect.

Large scale subtype classification in a surveillance and epidemiological linkage study

The comparison between SCUEAL and REGA on this data set (see Table 3), is similar to what was observed for the Stanford dataset. For well sampled subtypes (A,B,C,D,F,G,AE) the agreement between the methods was good to excellent (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e308.jpg), with a noticeable proportion (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e309.jpg) of within-subtype recombinants. Note that the proportion of within-subtype recombinants was not as significantly affected by masking out DRAMs as discussed in the previous section; for example the proportion was reduced from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e310.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e311.jpg for subtype B sequences, and actually increased for subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e312.jpg sequences. This could be because the UK sequences are longer than (both protease and reverse transcriptase) than the Stanford sample (reverse transcriptase only). Also, because SCUEAL is a stochastic algorithm, some variation (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e313.jpg in our simulation experiments, results not shown) between runs due to the indeterministic nature of the algorithm, especially between “borderline” sequences (those sequences that have a weak support for a the inferred mosaic), is to be expected. Small proportions of inter-subtype recombinants called by SCUEAL were not not identified by REGA. For CRF02 and CRF06, the agreement was quite poor, however the discord is easy to explain. For CRF02 SCUEAL identified many A,G recombinants but with breakpoints differing from those mapped for CRF02 (note that it is likely that G is the recombinant strain, but we refer to CRF02 as the recombinant to maintain compatibility with the current nomenclature); other CRF strains that include CRF02 - like fragments in pol (CRF30, CRF36) account for most of the other discrepancies. For sequences typed as CRF06 by REGA, the majority of SCUEAL classification involve CRFs derived from CRF06 (e.g. CRF30, CRF32).

Table 3
SCUEAL screening results on partial HIV-1 polymerase sequences from the UK.

Of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e358.jpg sequences, a non-trivial proportion An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e359.jpg were not classified by REGA, with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e360.jpg of those also not classified by the HIVdb subtyping algorithm. According to SCUEAL An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e361.jpg were URFs, and the remainder–pure subtypes of CRFs. Among An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e362.jpg sequences, SCUEAL identified An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e363.jpg complex recombinant forms (more than 3 constituent sub- or subsubtypes) and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e364.jpg URFs with at least An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e365.jpg sequences each, including:

  1. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e366.jpg G,A and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e367.jpg G,CRF02 recombinants. Given the degree of uncertainty about mapping the breakpoints in CRF02 (the nomenclature here is confusing, because recent evidence suggests that G was derived as a recombinant of A and CRF02 sequences) reference sequences, these sequences can be thought of as a A,G, CRF02 recombinants sequences. The finding also indicates that the diversity of this clade is quite significant, and reference data sets may need to be enriched for A, G and CRF02 sequences to enable more accurate subtype assignment. REGA assigned An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e368.jpg of these sequences to subtype CRF02, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e369.jpg to subtype A(A1), An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e370.jpg to G, one to subtype An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e371.jpg and did not classify An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e372.jpg sequences.
  2. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e373.jpg A1,D recombinants. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e374.jpg of those were not definitively classified by REGA, with the remainder assigned to A1 (12), D (1) and CRF10_CD (1). There are several precedents for this type of mosaic structure, including CRF16, CRF19 and CRF35 which all have a mosaic AD structure in pol.
  3. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e375.jpg B,CRF39 recombinants. Because most of pol sequence in CRF39, which is found circulating in Brazil, is mapped to subtype B [54], this form can be reported as a B intra-subtype recombinant. This finding also illustrates the capacity of SCUEAL to map within-subtype diversity with high resolution. Almost all (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e376.jpg) of those are identified as subtype B sequences by REGA, which is correct if within-subtype recombination is discounted.
  4. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e377.jpg B,D recombinants, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e378.jpg of which were classified as subtype B by REGA. Because B and D subtypes are closely related (compared to other between-subtypes comparisons), this mosaic type is difficult to detect.

Discussion

We present a new phylogenetic method (SCUEAL) to automatically determine a subtype and map the recombinant structure in HIV-1 sequences. Our method uses a statistically robust maximum likelihood multi-model inference approach to examine tens of thousands of potential mosaic structures in a single run guided by an evolutionary algorithm, identify those well supported by the data and quantify the reliability of all estimated quantities. SCUEAL is designed to handle the inclusion of recombinant strains in reference alignments, operate on large reference alignments with minimal loss of speed and permit easy expansion of existing reference alignments as new subtypes or circulating recombinant forms.

Using an extensive collection of simulated sequence alignments, covering a wide range of evolutionary parameters and including biological HIV-1 sequences, we determined that the method was capable of accurate detection of the number and location of recombination breakpoints as well as appropriate parental lineages, given sufficient sequence divergence. For non-parametrically generated HIV-1 pol mosaics, the recovery rate of breakpoints was An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e379.jpg for 5% or greater divergence between parental strains and 200 bp or longer sequence fragments. On average, individual breakpoints were inferred within 10 bp of the simulated locations. SCUEAL had a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e380.jpg rate of false positives on parametrically simulated data.

A comparison with a popular phylogeny based rapid subtyping tool REGA [18] on an HIV-1 pol surveillance dataset [48] illustrated that SCUEAL was able to automatically detect recombinant sequences with short mosaic fragments, classify and map unknown mosaic types and resolve cases that confounded REGA. A large scale screen of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e381.jpg database sequences revealed that approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e382.jpg of pure subtype reverse transcriptase sequences show evidence of within-subtype recombination and a further An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e383.jpg are likely novel or known circulating recombinant forms, highlighting the need for more precise determination of subtype information for public databases. Because up to 10% of HIV-1 infections occur with Unique Recombination Forms (URFs) when superinfection with divergent strains is relatively common (e.g. [55],[56]), the ability of SCUEAL to automatically annotate such forms is of critical importance. Furthermore, many evolutionary analyses, such as dating and selection screens, can be biased by the inclusion of recombinant sequences without necessary corrections [57],[58]. Studies that seek to identify clinical and evolutionary differences between different HIV subtypes (e.g. [59],[60]) also rely on the accurate classification of subtypes for all input sequences. To our knowledge, none of the existing subtype classifiers are designed to detect within-subtype recombination, which is in all likelihood much more frequent than inter-subtype recombination because sufficiently divergent strains of the same subtype routinely co-circulate in host populations (e.g. [61]) and within-host sequences often present phylogenetic evidence of extensive recombination [62]. We note that convergent evolution to acquire drug resistance associated mutations appears to have a strong confounding effect on detecting within-subtype recombination and should be accounted for if the focus of the analysis is to identify within-subtype recombination in regions of HIV that include many such mutations.

SCUEAL provides an automatically determined mosaic structure for any input sequence, including the cases when existing methods fail to derive such a structure. While this feature is a qualitative advance over existing approaches, it may also invite over-interpretation of computational results, and we emphasize that this should be avoided. Consider for example, the strain presented in Figure 7. SCUEAL results allow us to deduce that the strain is an inter-subtype recombinant with a high degree of confidence (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e384.jpg). The analysis also strongly implies that An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e385.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000581.e386.jpg strains or their ancestors contributed segments of the pol gene to the query sequencer, but also reports several credible mosaic forms that could be assigned to the strain, counter-indicating a definitive (e.g. A-A1-J-A2) mosaic determination. We would like to stress that SCUEAL determination of a novel recombinant form should not lead the users to automatically declare the sequence as such, but rather as an invitation to perform further examination of the data, perhaps with a specialized reference alignment, enriched for the subtypes detected by SCUEAL. Continuing with the example, the combination of A and J subtypes in one sequence is not uncommon (e.g. CRF06, CRF11, CRF13, CRF27) and extensive mosaicism in the pol gene has also been reported previously [63]. Moreover, the “J” clade in the SCUEAL reference alignment also contains J-like segments from several CRFs that circulate more widely that pure subtype J strains confined primarily to Central and West Africa [4]. Whether or not the segment assigned to clade J may instead belong to an unsampled clade of HIV-1 cannot ultimately be determined with the currently available estimate of HIV-1 diversity. Subtype classification is extensively used as a tool in molecular epidemiology and in surveillance studies of HIV because of their association with different populations. Multiple subtypes were detected in the UK in 1995 [64] and by 2007, non-B subtypes comprised the majority of new diagnoses in the UK [65],[66]. In addition, subtype classification is important for clinical reasons in HIV because of biological differences that have been observed with respect to rate of progression to disease [67], and patterns of drug resistance mutations [68],[69]. For that reason, sequences in the UK HIV Drug Resistance Database are routinely subtyped before analysis. The rapid increase in scale of the task (the current database release contains over 50,000 sequences) and the range of diversity of the subtypes and recombinants now present in the UK epidemic highlights an urgent need for an automated, informative, reliable and rapid method for classification on the sequence data collected that will scale to hundreds of thousands of sequences on commodity distributed computing platforms.

Empirical datasets in this study were limited to the partial polymerase gene of HIV, partly because this genetic region routinely sequenced for surveillance and diagnostic purposes, has few easily aligned indels–thus avoiding potential biases due to unreliable automatic multiple sequence alignment (e.g. [46]), and contains many of the breakpoints mapped for known CRFs. However, SCUEAL can use any reference alignment, including full length HIV-1, Hepatitis C virus, Influenza A virus genomes and non-viral sequences, and we plan to implement this functionality in future versions of SCUEAL.

Finally, we would be remiss to overlook some of the limitations of our approach. SCUEAL is a fairly computationally demanding method, and consequently is considerably slower that some other screening tools. Parallel execution on a computer cluster can mitigate this issue and permit one to process thousands of sequences per hour. As any method that is based on a reference alignment, SCUEAL is susceptible to biased inference if the reference alignment is inaccurate or if reference sequences are themselves misclassified. We took a number of precautions to ensure that the reference alignment was accurate by focusing on an easily alignable genomic region, a conservative automatic alignment procedure for the query sequence and an incremental algorithm for adding and accurately labeling reference sequences. SCUEAL uses a nucleotide evolutionary model to fit phylogenetic likelihood models for the sake of computational efficiency and this could lead to difficult to quantify biases in mosaic structure mapping; more realistic models (e.g. codon models) can be “plugged-in” without any alteration to the methodological framework if desired.

Supporting Information

Protocol S1

Settings and results for each of the parametric simulation scenarios

(4.08 MB PDF)

Acknowledgments

We thank the editors and three anonymous reviewers and the associate editor (Dr Christophe Fraser) for their comments on a previous version of this manuscript.

The views expressed in the publication are those of the authors and not necessarily those of the Department of Health.

Footnotes

The authors have declared that no competing interests exist.

This research was supported by the Joint DMS/NIGMS Mathematical Biology Initiative through Grant NSF-0714991, the National Institutes of Health (AI43638, AI47745, and AI57167), the University of California Universitywide AIDS Research Program (grant number IS02-SD-701), a University of California, San Diego Center for AIDS Research/NIAID Developmental Award to SDWF and SLKP (AI36214), and grant BIO2007-61411 (Spanish Ministry of Science) to DP. SDWF is supported in part by a Royal Society Wolfson Research Merit Award. This work was facilitated by IBM Deep Computing. GH was supported by the Medical Research Council. EF and the UK HIV Drug Resistance Database are partly funded by the UK Department of Health. Additional support is provided by Boehringer Ingelheim, Bristol-Myers Squibb, Gilead, Tibotec (a division of Janssen-Cilag Ltd) and Roche. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Zeldovich KB, Chen P, Shakhnovich EI. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc Natl Acad Sci U S A. 2007;104:16152–16157. [PMC free article] [PubMed]
2. Gaschen B, Taylor J, Yusim K, Foley B, Gao F, et al. Diversity considerations in HIV-1 vaccine selection. Science. 2002;296:2354–2360. [PubMed]
3. Robertson DL, Anderson JP, Bradac JA, Carr JK, Foley B, et al. HIV-1 nomenclature proposal. Science. 2000;288:55–56. [PubMed]
4. Hemelaar J, Gouws E, Ghys PD, Osmanov S. Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. AIDS. 2006;20:W13–23. [PubMed]
5. Gifford RJ, de Oliveira T, Rambaut A, Pybus OG, Dunn D, et al. Phylogenetic surveillance of viral genetic diversity and the evolving molecular epidemiology of human immunodeficiency virus type 1. J Virol. 2007;81:13050–13056. [PMC free article] [PubMed]
6. Papathanasopoulos MA, Hunt GM, Tiemessen CT. Evolution and diversity of HIV-1 in africa–a review. Virus Genes. 2003;26:151–163. [PubMed]
7. Rhodes T, Wargo H, Hu WS. High rates of human immunodeficiency virus type 1 recombination: near-random segregation of markers one kilobase apart in one round of viral replication. J Virol. 2003;77:11193–11200. [PMC free article] [PubMed]
8. Abecasis AB, Lemey P, Vidal N, de Oliveira T, Peeters M, et al. Recombination confounds the early evolutionary history of human immunodeficiency virus type 1: subtype G is a circulating recombinant form. J Virol. 2007;81:8543–8551. [PMC free article] [PubMed]
9. Posada D, Crandall KA. Evaluation of methods for detecting recombination from DNA sequences: Computer simulations. Proc Natl Acad Sci U S A. 2001;98:13757–13762. [PMC free article] [PubMed]
10. Posada D. Evaluation of methods for detecting recombination from dna sequences: empirical data. Mol Biol Evol. 2002;19:708–717. [PubMed]
11. Salminen M, Carr J, Burke D, McCutchan F. Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. AIDS Res Hum Retroviruses. 1995;11:1423–1425. [PubMed]
12. Triques K, Bourgeois A, Vidal N, Mpoudi-Ngole E, Mulanga-Kabeya C, et al. Near-full-length genome sequencing of divergent African HIV type 1 subtype F viruses leads to the identification of a new HIV type 1 subtype designated K. AIDS Res Hum Retroviruses. 2000;16:139–151. [PubMed]
13. Gomez-Carrillo M, Quarleri JF, Rubio AE, Carobene MG, Dilernia D, et al. Drug resistance testing provides evidence of the globalization of HIV type 1: a new circulating recombinant form. AIDS Res Hum Retroviruses. 2004;20:885–888. [PubMed]
14. Martin DP, Posada D, Crandall KA, Williamson C. A modified bootscan algorithm for automated identification of recombinant sequences and recombination breakpoints. AIDS Res Hum Retroviruses. 2005;21:98–102. [PubMed]
15. Gao F, Robertson DL, Carruthers CD, Li Y, Bailes E, et al. An isolate of human immunodeficiency virus type 1 originally classified as subtype I represents a complex mosaic comprising three different group M subtypes (A, G, and I). J Virol. 1998;72:10234–10241. [PMC free article] [PubMed]
16. Paraskevis D, Magiorkinis M, Vandamme AM, Kostrikis LG, Hatzakis A. Re-analysis of human immunodeficiency virus type 1 isolates from Cyprus and Greece, initially designated ‘subtype I’, reveals a unique complex A/G/H/K/? mosaic pattern. J Gen Virol. 2001;82:575–580. [PubMed]
17. Casado G, Thomson MM, Sierra M, Najera R. Identification of a novel HIV-1 circulating ADG intersubtype recombinant form (CRF19_cpx) in Cuba. J Acquir Immune Defic Syndr. 2005;40:532–537. [PubMed]
18. de Oliveira T, Deforche K, Cassol S, Salminen M, Paraskevis D, et al. An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics. 2005;21:3797–3800. [PubMed]
19. Wu X, Cai Z, Wan XF, Hoang T, Goebel R, et al. Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics. 2007;23:1744–1752. [PubMed]
20. Rozanov M, Plikat U, Chappey C, Kochergin A, Tatusova T. A web-based genotyping resource for viral sequences. Nucleic Acids Res. 2004;32:W654–9. [PMC free article] [PubMed]
21. Myers RE, Gale CV, Harrison A, Takeuchi Y, Kellam P. A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics. 2005;21:3535–3540. [PubMed]
22. Schultz AK, Zhang M, Leitner T, Kuiken C, Korber B, et al. A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes. BMC Bioinformatics. 2006;7:265. [PMC free article] [PubMed]
23. Loveday C, MacRae E. Limitations in using online tools to determine HIV-1 subtype in clinical patients: A comparison of 5 tools. 15th International HIV Drug Resistance Workshop. 2006.
24. Ntemgwa M, Gill MJ, Brenner BG, Moisi D, Wainberg MA. Discrepancies in assignment of subtype/recombinant forms by genotyping programs for HIV type 1 drug resistance testing may falsely predict superinfection. AIDS Res Hum Retroviruses. 2008;24:995–1002. [PubMed]
25. Holguin A, Lopez M, Soriano V. Reliability of rapid subtyping tools compared to that of phylogenetic analysis for characterization of human immunodeficiency virus type 1 non-B subtypes and recombinant forms. J Clin Microbiol. 2008;46:3896–3899. [PMC free article] [PubMed]
26. Gifford R, de Oliveira T, Rambaut A, Myers RE, Gale CV, et al. Assessment of automated genotyping protocols as tools for surveillance of HIV-1 genetic diversity. AIDS. 2006;20:1521–1529. [PubMed]
27. Posada D, Crandall KA. The effect of recombination on the accuracy of phylogeny estimation. J Mol Evol. 2002;54:396–402. [PubMed]
28. Minin VN, Dorman KS, Fang F, Suchard MA. Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics. 2005;21:3034–3042. [PubMed]
29. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SD. Automated phylogenetic detection of recombination using a genetic algorithm. Mol Biol Evol. 2006;23:1891–1901. [PubMed]
30. Kosakovsky Pond SL, Frost SDW. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics. 2005;21:2531–2533. [PubMed]
31. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences. 1986;17:57–86.
32. Kosakovsky Pond S, Frost S. A simple hierarchical approach to modeling distributions of substitution rates. Mol Biol Evol. 2005;22:223–234. [PubMed]
33. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704. [PubMed]
34. Stamatakis A, Ludwig T, Meier H. RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics. 2005;21:456–463. [PubMed]
35. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
36. Akaike H. A new look at the statistical model identification. IEEE Tran Automatic Control. 1974;119:716–723.
37. Sugiura N. Further analysis of the data by Akaike's information criterion and the finite corrections. Communications In Statistics-Theory And Methods. 1978;A7:13–26.
38. Yang ZH. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J Mol Evol. 2000;51:423–432. [PubMed]
39. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. [PubMed]
40. Eshelman L. FOGA-1, Los Atlos, CA: Morgan Kaufmann, chapter The CHC adaptive search algorithm. How to have safe search when engaging in nontraditional genetic recombination. 1991;265-283
41. Goldberg DE. Reading, Mass.: Addison-Wesley Pub. Co; 1989. Genetic algorithms in search, optimization, and machine learning.
42. Burnham K, Anderson D. New York: Springer, 2nd ed. edition; 2003. Model Selection and Multimodel Inference.
43. Pupko T, Pe'er I, Shamir R, Graur D. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol. 2000;17:890–896. [PubMed]
44. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. [PubMed]
45. Nickle DC, Heath L, Jensen MA, Gilbert PB, Mullins JI, et al. HIV-specific probabilistic models of protein evolution. PLoS ONE. 2007;2:e503. [PMC free article] [PubMed]
46. Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008;319:473–476. [PubMed]
47. Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial-DNA in humans and chimpanzees. Mol Biol Evol. 1993;10:512–526. [PubMed]
48. Salemi M, Goodenow MM, Montieri S, de Oliveira T, Santoro MM, et al. The HIV type 1 epidemic in Bulgaria involves multiple subtypes and is sustained by continuous viral inflow from West and East European countries. AIDS Res Hum Retroviruses. 2008;24:771–779. [PMC free article] [PubMed]
49. Bennett DE, Camacho RJ, Otelea D, Kuritzkes DR, Fleury H, et al. Drug resistance mutations for surveillance of transmitted HIV-1 drug-resistance: 2009 update. PLoS One. 2009;4:e4724. [PMC free article] [PubMed]
50. Kosakovsky Pond SL, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21:676–9. [PubMed]
51. Johnson VA, Brun-Vezinet F, Clotet B, Gunthard HF, Kuritzkes DR, et al. Update of the drug resistance mutations in HIV-1. Top HIV Med. 2008;16:138–145. [PubMed]
52. Little SJ, Holte S, Routy JP, Daar ES, Markowitz M, et al. Antiretroviral-drug resistance among patients recently infected with HIV. N Eng J Med. 2002;347:385–394. [PubMed]
53. Shet A, Berry L, Mohri H, Mehandru S, Chung C, et al. Tracking the prevalence of transmitted antiretroviral drug-resistant HIV-1: a decade of experience. J Acquir Immune Defic Syndr. 2006;41:439–446. [PubMed]
54. Guimaraes ML, Eyer-Silva WA, Couto-Fernandez JC, Morgado MG. Identification of two new CRF_BF in Rio de Janeiro State, Brazil. AIDS. 2008;22:433–435. [PubMed]
55. Harris ME, Serwadda D, Sewankambo N, Kim B, Kigozi G, et al. Among 46 near full length HIV type 1 genome sequences from Rakai District, Uganda, subtype D and AD recombinants predominate. AIDS Res Hum Retroviruses. 2002;18:1281–1290. [PubMed]
56. McCutchan FE, Hoelscher M, Tovanabutra S, Piyasirisilp S, Sanders-Buell E, et al. In-depth analysis of a heterosexually acquired human immunodeficiency virus type 1 superinfection: evolution, temporal fluctuation, and intercompartment dynamics from the seronegative window period through 30 months postinfection. J Virol. 2005;79:11693–11704. [PMC free article] [PubMed]
57. Posada D. Unveiling the molecular clock in the presence of recombination. Mol Biol Evol. 2001;18:1976–1978. [PubMed]
58. Scheffler K, Martin DP, Seoighe C. Robust inference of positive selection from recombining coding sequences. Bioinformatics. 2006;22:2493–2499. [PubMed]
59. Baker CAR, McEvers K, Byaruhanga R, Mulindwa R, Atwine D, et al. HIV subtypes induce distinct profiles of HIV-specific CD8(+) T cell responses. AIDS Res Hum Retroviruses. 2008;24:283–287. [PubMed]
60. Penn O, Stern A, Rubinstein ND, Dutheil J, Bacharach E, et al. Evolutionary modeling of rate shifts reveals specificity determinants in HIV-1 subtypes. PLoS Comput Biol. 2008;4:e1000214. [PMC free article] [PubMed]
61. Rousseau CM, Learn GH, Bhattacharya T, Nickle DC, Heckerman D, et al. Extensive intrasubtype recombination in South African human immunodeficiency virus type 1 subtype C infections. J Virol. 2007;81:4492–4500. [PMC free article] [PubMed]
62. Salemi M, Gray RR, Goodenow MM. An exploratory algorithm to identify intra-host recombinant viral sequences. Mol Phylogenet Evol. 2008;49:618–628. [PMC free article] [PubMed]
63. Novitsky VA, Gaolekwe S, McLane MF, Ndung'u TP, Foley BT, et al. HIV type 1 A/J recombinant with a pronounced pol gene mosaicism. AIDS Res Hum Retroviruses. 2000;16:1015–1020. [PubMed]
64. Arnold C, Barlow KL, Parry JV, Clewley JP. At least five HIV-1 sequence subtypes (A, B, C, D, A/E) occur in England. AIDS Res Hum Retroviruses. 1995;11:427–429. [PubMed]
65. HIV and AIDS in the United Kingdom. Health Protection Report 2007. http://www.hpa.org.uk/hpr/archives/2007/hpr1707.pdf.
66. Hue S, Pillay D, Clewley JP, Pybus OG. Genetic analysis reveals the complex structure of HIV-1 transmission within defined risk groups. Proc Natl Acad Sci U S A. 2005;102:4425–4429. [PMC free article] [PubMed]
67. Kiwanuka N, Laeyendecker O, Robb M, Kigozi G, Arroyo M, et al. Effect of human immunodeficiency virus Type 1 (HIV-1) subtype on disease progression in persons from Rakai, Uganda, with incident HIV-1 infection. J Infect Dis. 2008;197:707–713. [PubMed]
68. Geretti AM. HIV-1 subtypes: epidemiology and significance for HIV management. Curr Opin Infect Dis. 2006;19:1–7. [PubMed]
69. Geretti AM, Harrison L, Green H, Sabin C, Hill T, et al. Effect of HIV-1 subtype on virologic and immunologic response to starting highly active antiretroviral therapy. Clin Infect Dis. 2009;48:1296–1305. [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...