- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC2760689

# Modeling Sequence Evolution in Acute HIV-1 Infection

^{1,}

^{2,}

^{*}Elena E. Giorgi,

^{1,}

^{3,}

^{*}Brandon F. Keele,

^{4}Brian Gaschen,

^{1}Gayathri S. Athreya,

^{1}Jesus F. Salazar-Gonzalez,

^{4}Kimmy T. Pham,

^{4}Paul A. Goepfert,

^{4}J. Michael Kilby,

^{4}Michael S. Saag,

^{4}Eric L. Delwart,

^{5}Michael P. Busch,

^{5}Beatrice H. Hahn,

^{4}George M. Shaw,

^{4}Bette T. Korber,

^{1,}

^{6}Tanmoy Bhattacharya,

^{1,}

^{6}and Alan S. Perelson

^{1,}

^{§}

^{1}Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM 87545, USA

^{2}Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA

^{3}University of Massachusetts, Amherst MA 01002

^{4}University of Alabama at Birmingham, Birmingham, AL 35223, USA

^{5}Blood Systems Research Institute, San Francisco, CA 94118, USA

^{6}Santa Fe Institute, Santa Fe, NM 87501, USA

^{*}Contributed equally

^{§}Corresponding Author: asp/at/lanl.gov, MS K710, Los Álamos National Laboratory, Los Álamos NM 87544, Phone: +1 505 667 6829 FAX: +1 505 665 3493

**This article has been corrected.**See the correction in volume 297 on page 187.

## Abstract

We describe a mathematical model and Monte-Carlo (MC) simulation of viral evolution during acute infection. We consider both synchronous and asynchronous processes of viral infection of new target cells. The model enables an assessment of the expected sequence diversity in new HIV-1 infections originating from a single transmitted viral strain, estimation of the most recent common ancestor (MRCA) of the transmitted viral lineage, and estimation of the time to coalesce back to the MRCA. We also calculate the probability of the MRCA being the transmitted virus or an evolved variant. Excluding insertions and deletions, we assume HIV-1 evolves by base substitution without selection pressure during the earliest phase of HIV-1 infection prior to the immune response. Unlike phylogenetic methods that follow a lineage backwards to coalescence, we compare the observed data to a model of the diversification of a viral population forward in time. To illustrate the application of these methods, we provide detailed comparisons of the model and simulations results to 306 envelope sequences obtained from 8 newly infected subjects at a single time point. The data from 6/8 patients were in good agreement with model predictions, and hence compatible with a single-strain infection evolving under no selection pressure. The diversity of the samples from the other two patients was too great to be explained by the model, suggesting multiple HIV-1-strains were transmitted. The model can also be applied to longitudinal patient data to estimate within-host viral evolutionary parameters.

**Keywords:**HIV-1, population dynamics, viral evolution

## Introduction

The HIV-1 population in a chronically infected individual is subject to continuous immune selection (Richman et al., 2003; Wei et al., 2003), and evolves to become a complex set of related viruses, often referred to as a quasispecies, through the course of an infection (Lee et al., 2009; Shankarappa et al., 1999; Wolinsky et al., 1996). A reduction in viral complexity at transmission was originally noted in the context of mother to infant transmission (Wolinsky et al., 1992), and has been extensively studied in recent years (Derdeyn et al., 2004; Dickover et al., 2006; Edwards et al., 2006; Painter et al., 2003). During sexual transmission of HIV-1, a genetic bottleneck usually occurs since a limited number of viral strains are transmitted from the complex quasispecies typically found in a donor (Delwart et al., 2001; Derdeyn et al., 2004; Zhang et al., 1993), although other studies have found multiple transmitted variants at a relatively high frequency (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Infection with multiple genetic variants has been associated with genital tract ulcers and use of hormonal contraceptives (Sagar et al., 2004). Recent studies have identified patients in the earliest weeks of infection, many prior to the selective pressure imposed by the newly infected host’s initial immune response (Abrahams et al., 2009; Keele et al., 2008; Salazar-Gonzalez et al., 2008). Sequence data from the HIV-1 *env* gene collected during acute HIV-1 infection in 102 subjects by Keele et al. (2008) show a wide range of diversity, with the average number of bases differing between pairs of sequences from the same patient varying between 0.01% and 2.18%, suggesting that both single and multiple viral strain transmissions may have occurred in this cohort.

In this paper, we develop simple models of HIV-1 evolution early in infection with the aim of quantitatively assessing whether infections were established by single or multiple viral strains. Further, in the case of single strain, i.e., homogeneous, infections we aim to identify the most-likely initiating strain or a close descendent that gave rise to the observed lineage.

We derive analytical results and approximate formulas. We used Monte Carlo (MC) simulations to capture the randomness in early HIV-1 evolution and to compare with our analytical results. The analytical results were derived from idealized models whereas the MC simulations allowed for more accurate models that incorporate unequal base composition and an evolutionary based substitution matrix that defines the frequency at which base *i* is replaced by base *j* during a mutation event. Previously, Monte Carlo methods have been used to study the within-host dynamics of HIV-1 infection (Heffernan and Wahl, 2005; Kamina et al., 2001; Ribeiro and Bonhoeffer, 1999; Ruskin, 2002; Tan and Wu, 1998; Tuckwell and Le Corfec, 1998), but here our focus is on sequence evolution and not viral and T cell dynamics as in these earlier works. Keele and colleagues applied a variant of this model to the analysis of 102 B-clade infected patients (Keele et al., 2008) and of 18 experimentally SIV-infected rhesus macaques (Keele, 2009). Abrahams et al. (2009) have used the same techniques to analyze a set of 69 C-clade infected subjects.

In this study we provide a complete mathematical description of the model, and explore the implications of varying the baseline assumptions, the input parameters, as well as purely analytical versus computational outcomes. We use just 8 of the patients described in Keele *et al*. (Keele et al., 2008) to illustrate nuances in the application of our model. These 8 subjects were chosen to be representative of the full set of 102 patients with 80% being characterized as having homogenous infection and 20% heterogeneous infections.

The main problem we focus on here is developing a systematic, reasoned way to determine whether a single strain or multiple stains of HIV-1 infected an individual. This is not straight-forward; even if an individual is infected by a single strain, with time this transmitted virus will diversify. Thus, given a set of sequences one needs to compute how much diversity would be expected by a given time from infection. For sequences collected early after infection, if the diversity is much greater than what is expected from a homogeneous infection, then multiple strain infection is a likely explanation. Other signatures of multiple strain infection might also be present, e.g., the sequences cluster into groups with very different most recent common ancestors. Our model also provides an estimate of the time from the origin of the most recent common ancestor (MRCA) of the sampled variants. If the sampled variants are representative of a lineage that was initiated by the infecting strain, the time to the MRCA should correspond to the time of infection; if the lineage arose in the donor, the time to the MRCA would be longer than the infection; if the lineage arose post-infection in the newly infected individual as a consequence of selection, the time estimated to the MRCA would be less than the time from infection.

To characterize HIV-1 evolution in samples in the earliest weeks of infection, we have modeled the events that occur between virus transmission and peak replication, approximately 21-35 days later. Our model assumes random drift prior to the initial immune response and exponential viral expansion prior to peak viremia. We ignored the effects of selection—based on the premise that the sequences we analyze were obtained sufficiently early in infection that immune responses would not yet provide substantial selective pressure. Although other sources of selection may be present, these also are ignored in our model. The effects of recombination were not modeled either. Comparing the expectations based on the mathematical model and the simulations to observed acute infection sequence data enables us to explore the validity of these assumptions in real infections. We show that under the hypothesis of exponential growth in the infected cell population the frequency distribution of the genetic distances between pairs of HIV-1 sequences follow an approximate Poisson distribution and star-phylogeny topology. This was first shown by Slatkin and Hudson in the context of the evolution of mitochondrial DNA (Slatkin and Hudson, 1991).

Since the precise time of initial HIV-1 infection cannot usually be known with certainty, the status of acute/early subjects can be classified using the “Fiebig” staging system (Table 1), which is based on an orderly appearance of viral RNA, antigen and antibodies in plasma during early infection (Fiebig et al., 2005; Fiebig et al., 2003). Prior to stage I, where plasma viral RNA first becomes detectable, is the eclipse phase. The length of this period can be roughly estimated based on clinical histories from a series of studies (Clark et al., 1991; Gaines et al., 1988; Lindback et al., 2000a; Lindback et al., 2000b; Schacker et al., 1996) and suggests an average length of about 10 days (range 7-21 days, see Table 1). For each patient in our dataset the Fiebig stage is known. This provides a rough estimate of the time since infection, which can be compared with the estimated time since the MRCA computed with our models from the sampled sequences.

*et al*. (Keele et al., 2008). We obtained 95% CIs, shown in parenthesis, by recalculating

**...**

Coalescent theory can also be used to find common ancestors and estimate the time to coalescence. Further, results have been obtained using coalescent theory for the type of linear birth death process that underlies the model described in this paper (Kuhner et al., 1998; Rannala, 1997). When the infection is indeed homogeneous and it meets our model assumptions (exponential viral growth, no selection, etc.), we are under a very particular evolutionary scenario that allows us to use more direct and computationally efficient methods to derive the approximate time to the MRCA. Even though appropriate, Bayesian estimation methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007), as the ones provided by the software BEAST by (Drummond and Rambaut, 2007), are computationally intensive even when the number of sequences is small as they perform a whole suite of additional tests that are redundant for our particular evolutionary scenario. The methods presented below are computationally efficient, and our model results can be obtained in minutes for the entire dataset, while in our hands BEAST runs for a single patient took over 3 hours. Further, as we shall show, the estimated time to the MRCA using BEAST and using our method are very similar. Thus, we believe the methods we present below enable a rapid exploration of the implications of a forward evolution model, provide consistent results with a coalescence model, and allow us to compare the model to the data to infer biologically interesting information from the sample.

## Results

### Mathematical Models and Monte-Carlo Simulations

HIV-1 can be transmitted from one individual to another either through the transmission of virus particles or infected cells. Since transmission of virus will quickly generate infected cells, we describe the transmission as if it occurred through infected cells**.**

In the case of sexually transmitted HIV-1, most often only a single viral sequence initiates the new infection, or only a single sequence grows out to yield a detectable level of viremia {Delwart, 2001 #61; Derdeyn, 2004 #20; Zhang, 1993 #59; Abrahams, 2009 #185}. Thus, we shall model the case in which infection is initiated by a single HIV-1 sequence. The simplest implementation of this hypothesis is to assume that a single cell, carrying this sequence as a provirus, starts the infection. We shall also examine the case in which multiple cells carrying identical sequences start the infection. When the predictions of these models fail to explain the extent of sequence heterogeneity observed, this is evidence that either multiple sequences have been transmitted, or that immune or other selective pressures are driving the observed level of diversification (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999).

We assume that the initial infected cell produces virus; most are cleared, but some successfully infect a new generation of cells. The number of secondary infections caused by one infected cell placed in the population of cells of an uninfected individual and fully susceptible to infection is called the basic reproductive ratio, *R _{0}*. While data on early viral kinetics in humans is limited, the available data in humans infected with HIV-1 and in monkeys infected with SIV and SHIV show that the virus grows exponentially until a viral load peak is attained a few weeks after infection (Little et al., 1999; Nowak et al., 1997; Stafford et al., 2000). Following the peak, viral levels decline and establish a set-point. At the set-point each infected cell, on average, successfully infects one other cell during its lifetime. If it infected more than one other cell viral levels would increase, whereas if it infected less than one cell viral levels would fall. Thus, once the set-point is established we can assume that the number of infected cells remains constant.

In our model we assume a homogeneous infection in which the virus grows exponentially with no selection pressure, no recombination, no occurrence of back mutations and a constant mutation rate across positions and across lineages. We also assume the virus grows with a fixed generation length and that each infected cell produces the same number of progeny. Clearly, the approximation of exponential growth must break down as target cells become limited and the number of cells infected per generation must decrease from *R _{0}*, stabilizing at 1 when the viral set point is established. However, we present evidence indicating that what is important in determining viral diversity is the number of reverse transcription events that have occurred along a lineage, i.e., the number of generations that separate the initial infection from the time at which sequences were obtained, and not the number of cells infected at each generation. In our first, simplest model, we also assume that the infection is synchronous, and thus can be characterized by discrete generations, with the virus infecting exactly

*R*new cells at each generation.

_{0}Let all genomes be of same length *N _{B}*, let

*ε*be the reverse transcriptase point mutation rate, which we assume to be constant throughout the genome, and let

*s*denote the infecting strain. Then, after the first replication cycle, the

_{0}*R*daughter cells will each differ from the infecting strain at exactly

_{0}*n*bases (positions) with a probability given by

After the second replication cycle, the total number of mutations in the new generation of infected cells will follow the probability distribution given by

In other words, *n* mutations in the second generation are the result of all possible combinations of *k* mutations occurring in the first replication, and *n-k* occurring in the second, for all integers *k* between 0 and *n*. Mathematically, this is given by the convolution of two Binomial distributions with the same probability parameter (ε, in our case), which is in turn a Binomial distribution (Casella and Berger, 1990). One can see that after *a* replication cycles,

Let *HD _{0}* denote the Hamming distance, i.e., the number of base differences from the infecting strain

*s*. In general, given

_{0}*n*mutations, we have

*HD*

_{0}≤

*n*, as some of the

*n*mutations may occur at the same site. However, after

*a*replication cycles, the probability that one site has mutated at most once is

Therefore, the probability that at least one site across the entire genome has mutated more than once is *Q* = 1-*P*^{NB} , which is approximately $\frac{1}{2}{N}_{B}a(a-1){\epsilon}^{2}$ when *ε* is small.

For the HIV-1 base substitution rate ε, we use the value of 2.16 × 10^{-5} base substitutions per replication cycle. The base substitution rate we use is derived from the results of (Mansky and Temin, 1995), where after a single round of HIV-1 replication they found an overall mutation rate of 3.4 × 10^{-5}, but this included insertions and deletions. Restricting their results to only base substitutions, we calculated from their data a base substitution rate of 2.16 × 10^{-5} per base per replication. The sequences that we analyze later are envelope gene sequences with N_{B} ~2600 bases. Substituting these values into *Q*, we find that the probability of two or more mutations at the same site increases with the number of replication cycles, yet even for *a*=*50* one finds *Q* < 0.0015. Hence we expect that we can ignore back mutations when we analyze patient sequence data obtained early in infection.

Under this assumption, *HD _{0}* coincides with the number of mutations and hence it follows the same Binomial distribution:

Notice that this formula assigns a nonzero probability to the Hamming distance being larger than the total number of bases *N _{B}*, and this is an artifact of our approximations. In fact, using Chebyshev’s inequality (Feller, 1957), one can show (see Materials and Methods) that $P(\text{mutations}>{N}_{B}\mid a)\le O\left(\frac{{a}^{2}{\epsilon}^{2}}{{N}_{B}}\right)<<1$, and thus indeed, within our approximation,

*P*(

*mutations*>

*N*) ≈ 0.

_{B}### Synchronous Infection, Mathematical Model

In the synchronous model, we assume that each cell infects *R*_{0} other cells at the end of its life cycle. At generation *a* from the infection event, there will be ${R}_{0}^{a}$ infected cells, each one carrying one HIV-1 genome. Further, at generation *a*

In order to analyze single time point data, it is useful to compute the intersequence HD distribution rather than HD_{0} (since for the latter we would need to assume that we know the infecting sequence *s _{0}*). In other words, rather than comparing any given sequence to the founder sequence, we take all possible sequence pairs in the sample and compute their relative HD. Given two sequences,

*s*and

_{1}*s*, picked at random at time

_{2}*t*, that have evolved independently from the common ancestor s

_{0}, the distribution of HD[s

_{1}, s

_{2}] is

The assumption of independent evolution of sequences *s _{1}* and

*s*from the common ancestor

_{2}*s*is equivalent to assuming that the phylogeny of these sequences is star-like with

_{0}*s*being the only common ancestor. However, it is also conceivable that the common ancestor of

_{0}*s*and

_{1}*s*is not the transmitted sequence

_{2}*s*but rather a sequence that evolved from

_{0}*s*This is an important issue, as the identity of the transmitted sequence is generally not known. As illustrated in Figure 1, the probability of

_{0}.*s*and

_{1}*s*coalescing

_{2}*m*generations from the ancestor, with

*m*<

*a,*is the number of all sequences coalescing

*m*generations from the ancestor (minus the one we have already picked), divided by the total number of sequences at generation

*a*(again, minus the one we picked already)

As shown in Material and Methods, this probability approaches zero rapidly as *a* and *m* get large.

We can use Eq. (6) to determine the probability that two sequences from a given individual coalesce at a time point other than at the transmitted sequence. For example, for samples obtained a few weeks into the infection and under no selective pressure, e.g. *a*=5 and *R _{0}*=6 (see below) the probability of coalescence later than generation 0, from Eq. (6) is 0.167 for

*m*=

*1*, 0.028 for

*m*=

*2*and 0.0045 for

*m*=

*3*. For greater values of

*a*, the probability of coalescing at generation 3 or higher is < 0.005. Thus, we expect that use of Eq. (5) and the assumption of a star-like phylogeny to hold for all patient samples in which a single viral strain was transmitted. However, even for large

*a*, the probability of coalescing 1 or 2 generations away from the founder strain still remains non-negligible: even at generation 40, the probability of coalescence later than generation 0, is still 0.167 for

*m*=

*1*, and 0.028 for

*m*=

*2*. In a set of sequences from an individual, these estimates of less than a 0.5% chance of being 3 generations away from the actual founder, a 3% chance of being 2 generations away, and a 17% chance of being 1 generation away hold for each independent pair of comparisons.

Samples that diverge from a star phylogeny often show patterns of shared mutations across sequences. The above calculation shows that coalescence not at the founder strain is most likely one generation away from the actual founder. Therefore, a mutation observed very early in the infection, could possibly carry on and develop as a second lineage. Also, we can not rule out that the founders of these two lineages were simply transmitted from the donor and hence coalesced in the donor. For infections transmitted during the acute phase before much diversification occurred in the donor this may even be likely. Lastly, we observe that since the base substitution rate ε is so small it is very likely that the sequences at a=0 and a=1 are the same. In fact, with probability (1-*ε*)^{RoNB} all *R*_{0} proviruses at generation 1 are identical to the founder; e.g., with *R _{0}*=

*6*and

*N*=

_{B}*2600*this probability is 0.71.

For ease of notation, let the intersequence Hamming distance distribution be denoted *HD _{I}* . Thus,

We also define some basic quantities based on the calculated Hamming distance distribution from the initial strain and the intersequence Hamming distance distribution between all possible sequence pairs. The divergence at generation *a* is defined as the average Hamming distance per base from the initial founder strain. From Eq. (4), this can be estimated as the mean of the binomial distribution divided by the number of bases, and is *εa* . The diversity at generation *a* is the average intersequence Hamming distance per base between sequence pairs and from Eq. (7), diversity = 2*εa*. The variance of the intersequence HD distribution at generation *a* divided by the number of bases is *2εa(1-ε)*. We also introduce the % sequence identity, defined as the proportion of sequences identical to the sequence, *s _{0,}* of the infecting strain. As the virus evolves, the population diversifies from the founder strain and the proportion of the total population that is identical to the MRCA decays exponentially. By setting

*d*= 0 in Eq. (4), we have % sequence identity =

*Binom*(0;

*aN*) = (1-

_{B},ε*ε*)

^{aNB}.

Finally, we wish to compute the expected maximum *HD _{I}*. This is motivated by the observation that more extensive diversity is seen in some transmission cases, presumably the result of infections established by two or more distinct HIV-1 strains (Long et al., 2002; Ritola et al., 2004; Sagar et al., 2004; Vernazza et al., 1999). Thus, a strategy for systematic classification of homogeneous, single strain infections versus heterogeneous infections is needed. To address this, we ask what is the maximum diversity that one would expect in a single strain infection within a time frame compatible with the subject’s Fiebig stage at the time of sampling. The maximum diversity expected depends on the sample size: the larger the sample, the higher the expected maximum, as outliers become more likely to be sampled.,The expected maximum

*HD*at generation

*a*is computed in the Materials and Methods.

### Monte Carlo Simulation of the Synchronous Infection Model

The mathematical model derived above assumes that all mutations have equal probability. However, in HIV-1 the four bases are not equally frequent and the rates of base substitution are not all equal. Further, we neglected more than one mutation occurring at the same site, although this potentially can occur. To examine the influence of these biological details on the predictions of HIV-1 sequence evolution, and to capture stochastic effects, we developed a simple Monte Carlo simulation and compared its predictions with those of the mathematical model.

The simulation starts at “generation 0” with one cell infected with a single HIV-1 sequence *N _{B}* bases long. The initial sequence is generated randomly with base frequencies of 0.19 (T and C), 0.37 (A) and 0.24 (G), based on typical frequencies measured in the full-length envelope gene (see Materials and Methods). We assume that the initially infected cell infects

*R*other cells synchronously. We chose

_{0}*R*=

_{0}*6*, representing the average value of

*R*across 10 different patients studied by (Stafford et al., 2000). During the infection of these cells mutations can occur. In the simulation, the sites for the occurrence of the base substitutions are chosen randomly from the

_{0}*N*bases. The probability that base

_{B}*i*is replaced by

*j*is given by a 4×4 transition matrix deduced from a maximum likelihood General Time Reversible (GTR) model of substitutions that occur in the full length HIV-1 envelope gene (for the matrix and the details see Materials and Methods). Unlike in our analytical model, we allow the same site to mutate more than once. For a given sample size

*N*, the MC simulation produces an HD frequency distribution for each generation step within a certain time frame chosen by the user. At each generation exactly

_{S}*N*sequences are sampled (with replacement) and the intersequence HD distribution is thus constructed.

_{S}We implemented the MC code so that the user has the option of running either one or multiple iterations. When multiple iterations are done, the HD frequencies computed at each generation step are averaged over all iterations, thus retrieving the mean field description described by the analytical method. On the other hand, when not averaging over all iterations, one captures the stochastic effects due to rare mutations, e.g. the run-to-run variance, which, in real life, corresponds to the host-to-host variation. This is completely neglected in the analytical method, which is a mean field description.

In the synchronous model the population of infected cells grows exponentially as ${R}_{0}^{a}$. Once the number of infected cells reaches *10 ^{4}*, which is close to some estimates of the effective population size

*N*of the virus in a mature infection (Achaz et al., 2004; Leigh Brown, 1997; Wakeley, 2008), there is no need to expand the population further in order to study the diversity of genetic sequences as long as we allow the proviral sequences contained in these 10

_{e}^{4}to continue to evolve. We thus cap the total population of infected cells at 10

^{4}and let those cells produce

*10*progeny. We then randomly sample

^{4}R_{0}*10*cells out of the

^{4}*10*, and continue the simulation. In this way, we are able to maintain the population of infected cells constant over the rest of the simulation, which is equivalent to what occurs when the viral set-point is reached. We confirmed that our simulation results (i.e., means and 95% CIs) did not depend on the cutoff value of infected cells as long as it is above 10

^{4}R_{0}^{3}(data not shown).

Using the MC simulation, we computed divergence, diversity, maximum HD and % sequence identity (defined in the previous section) as functions of time, and confirmed that their dependence on the number of generation steps matched the dependency found analytically. Figure 2 shows this comparison as well as 95% confidence intervals computed from 1,000 MC simulations using a typical number of sampled sequences, *N _{S}*=

*30*.

### Asynchronous Infection

The synchronous infection model is a simplification. In reality, cell infections occur at random times by the viruses released from an infected cell. Viral production starts on average about 24 hours after a cell is initially infected (Perelson et al., 1996), and most likely continues until cell death. In order to understand the dependence of our results on the synchronous approximation, we devised a second model in which infections are asynchronous. Again we will assume each cell infects *R _{0}* others during its lifetime. While each of the R

_{0}infections could occur at different times, here we take a first step in assessing the role of asynchrony by assuming the infections occurs at two times. Each infected cell infects

*α*cells at an intermediate time, and the rest at the end of its life-time. Even though this scenario is again a simplification, it is useful to analyze the differences from the previous model, and as we shall show below it preserves the same basic mathematical structure as before. However, because an infected cell contributes progeny to the next generation before the end of its life there is now a faster growth rate of the infected cell population.

Let *τ _{a}* and 2

*τ*be the times at which a newly infected cell infects other cells, and suppose it infects α cells at

_{a}*τ*, and γ at 2

_{a}*τ*, with

_{a}*α+γ*=

*R*, the total number of cells one cell infected during its lifetime. Notice that if γ=0, then infections only occur at one time,

_{0}*α*=

*R*, and τ

_{0}_{a}is the generation time. Hence this model reduces to the synchronous model with generation time τ

_{s}= τ

_{a}, where τ

_{s}denotes the duration of a single generation step in the synchronous model.

Denote by *I(a,t)* the number of infected cells of age *a* at time *t*. Here, by age, we mean the age of the infecting genome, measured in terms of the number of times the viral genome has undergone reverse transcription, rather than the cell’s physical age. Thus, *a* is equivalent to the generation variable we used in the synchronous model. As a consequence, at any given time *t,* under the synchronous model all cells belong to the same age group, whereas in the asynchronous model, the infecting genomes have undergone a minimum of $\lceil \frac{t}{2{\tau}_{a}}\rceil $ replication cycles, and a maximum of $\lfloor \frac{t}{{\tau}_{a}}\rfloor $ (see Figure 3), where and are the ceiling and floor functions defined as *x* = max {*n* *Z* | *n* ≤ *x*} and *x* = min {*n* *Z* | *n* ≤ *x*}, where *Z* is the set of integers.

Let *I _{0}* denote the initial number of infected cells and let $n=\lfloor \frac{t}{{\tau}_{a}}\rfloor $, i.e.,

*n*represents time measured in units of

*τ*. In the following we shall use

_{a}*n*as the time variable. A general expression for

*I(a,n)*is derived in Materials and Methods, and is given by

As illustrated in Figure 3, at any given time *n*, the HIV-1 strains in an individual have undergone a minimum of $\lceil \frac{n}{2}\rceil $ and a maximum of *n* replication cycles. Then, as a function of time, the random variable *HD _{0}* follows the probability distribution

In Materials and Methods we show the following relationship

where $\phi =\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}$ and ${F}_{n}=\frac{{(1+\phi )}^{n+1}-{(1-\phi )}^{n+1}}{{2}^{n+1}\phi}$. By substituting Eq. (10) into Eq. (9) we obtain

with mean *μ _{n}* =

*λ*and variance ${\sigma}_{n}^{2}={\lambda}_{n}\epsilon (1-\epsilon ){N}_{B}$, where

_{n}εN_{B}Notice, as one would expect, that in neither the synchronous nor the asynchronous model, does the probability distribution depend on the initial number of infected cells, *I _{0}*.

We show in Materials and Methods that up to terms in *O*(*ε*^{2}) *P*(*HD*_{0} =*d* | *n)* is a Poisson distribution with mean *μ _{n}* =

*λ*, in other words

_{n}εN_{B}As a consequence, given that the intersequence HD distribution, *HD _{I}*, is to a good approximation the self-convolution of the

*HD*distribution, we get that

_{0}*HD*also follows a Poisson distribution with parameter 2

_{I}*λ*. Therefore, analogously to the synchronous model, divergence =

_{n}εN_{B}*λ*, diversity = 2

_{n}εN_{B}*λ*= variance and % sequence identity =

_{n}εN_{B}*Exp*(-2

*λ*) . The expected maximum

_{n}εN_{B}*HD*is calculated as in the synchronous infection model, Eq. (8) and (9), except we now insert ${P}_{k}^{\left(a\right)}=P(H{D}_{0}=k\mid a)$ from Eq. (11).

In the synchronous model, the intersequence Hamming distance, Eq. (5), follows a binomial distribution, which is also well approximated by a Poisson distribution with equal mean since *2aN _{B}* is large,

*ε*is small and

*λ*=

*2aN*is of reasonable size (Casella and Berger, 1990). Hence in both scenarios the HD follows essentially the same probability distribution, but with different divergence and diversity, implying a different rate of evolution due to the asynchrony in infection.

_{B}εLet *N(n)* be the total number of newly infected cells at time *n*. For *I _{0}*=

*1*and

*R*

_{0}>> 1 , we show in Material and Methods that $N\left(n\right)\approx {\left(\frac{\alpha}{2}(1+\phi )\right)}^{n}={\left(\frac{\alpha}{2}(1+\phi )\right)}^{\lfloor \frac{t}{{\tau}_{a}}\rfloor}$, whereas for the same time

*t*, in the synchronous model, $N\left(n\right)\approx {\left({R}_{0}\right)}^{\lfloor \frac{t}{{\tau}_{s}}\rfloor}$, where τ

_{s}denotes the generation time in the synchronous model. Notice that

*τ*≠

_{a}*τ*see Materials and Methods), hence the two scenarios both provide descriptions of an exponentially growing infected cell population but with different exponents.

_{s}We also devised a partially randomized model, in which the α, instead of being a fixed parameter, is a random number uniformly distributed with mean *R*_{0}/2, and then γ is always chosen to be *R*_{0}-*α*. This introduces an additional variability the effect of which is inversely proportional to the total population size. Therefore, large fluctuations from the above description will occur within the first 2-3 generation steps, but will be negligible after that, thus recovering the above mean field description.

An important difference between the asynchronous and synchronous models is in the dependence on *R _{0}*. In both models, divergence and diversity depend linearly on the number of generation steps. In the synchronous model, the slope of the diversity is 2ε does not depend on

*R*and it. In the asynchronous model, the slope has the additional factor

_{0}with $\phi =\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}$ and *γ*+*α*=*R _{0}*. One sees that in the asynchronous model there is a dependence on

*R*. Let

_{0}*α*=

*ηR*for some 0<η≤1, and

_{0}*γ*=

*R*. Then, one gets the expression of the slope in diversity as

_{0}(1-η)Notice that the when *η*=1, we retrieve as a special case the synchronous model, and indeed Eq. (15) yields exactly 2ε. However, when *η*≠*1*, the above expression is strictly smaller than 2ε, and approaches 2ε as *R _{0}* increases. Figure 4 illustrates the dependence of the slope of the diversity as a function of the number of generation steps on

*R*

_{0}.

Table 2 summarizes the above formulas for both the synchronous and the asynchronous models. In both models, divergence, diversity, and variance increase linearly as a function of time. Diversity and variance are expressed by the same mathematical formula, which is a consequence of the fact that the intersequence HDs follow a Poisson distribution. Divergence, diversity, and *HD _{I}* variance do not depend on the number of bases per sequence. In contrast, the maximum HD and the % sequence identity depend on the sequence length

*N*. In the synchronous infection model, all the quantities are independent of the basic reproductive ratio, while in the asynchronous model, they do depend on the

_{B}*R*

_{0}, as discussed above. Quantitatively, however, the asynchronous infection model provides minor corrections compared to the synchronous infection model.

### Monte Carlo simulation of the asynchronous infection model

We keep the basic framework used for the synchronous MC simulation, and choose parameters *R _{0}*=

*6, ε*=

*2.16×10*

^{-5}, and

*τ*=

_{a}*1.5*days (see Materials and Methods). Furthermore, we simulate the situation where

*I*=

_{0}*1*and each infected cells infects

*R*cells at

_{0}/2*τ*, and the remaining

_{a}*R*at time 2

_{0}/2*τ*. We also devised a second scenario under which each infected cells infects

_{a}*α*cells at time

*τ*, where α is drawn from a random uniform distribution with mean

_{a}*R*, and

_{0}/2*R*

_{0}-

*α*at time 2

*τ*. We confirmed that both types of simulations, after generation step 4, are in excellent agreement, and also confirmed our analytical results described in the previous section. Furthermore, both % divergence and % diversity grow linearly with time (in units of

_{a}*τ*) at rates of 1.79 × 10

_{a}^{-5}and 3.57 × 10

^{-5}, respectively (i.e., 1.2 ×10

^{-5}and 2.4 ×10

^{-5}per day, respectively).

At times *τ _{a}, 2τ_{a}*, 3

*τ*, and so on, we sample

_{a}*N*sequences, construct the HD distribution and compute the divergence, diversity,

_{S}*HD*variance, % sequence identity and the expected maximum HD. As for the synchronous model, we find that the MC results agree well with the analytical results given in Table 2 (Figure 5).

_{I}We examined the dependency of our results on the values chosen for the parameters *τ, R _{0}, I*

_{0}and ε (Figure 6). In agreement with our calculations (Table 2), the diversity increases linearly with time (not shown) at a rate proportional to

*ε*, inversely proportional to

*τ*(Figures 6A and 6B), dependent on

_{a}*R*(Figure 6C) and independent of

_{0}*I*(not shown). The mutation rate controls the rate of increase of the divergence and the diversity. The larger the mutation rate, the faster the genomes mutate, hence the steeper the growth of the diversity (Figure 6A). The greater the generation time, the slower the genomes diversify, hence the smaller the growth in diversity (Figure 6B). The computed diversity depends on

_{0}*R*(Figure 6C). At low values there is a strong dependence, e.g., from

_{0}*R*=

_{0}*2*to

*R*=

_{0}*6*there is 15.9% increase in the slope of the diversity. On the other hand, for larger values of

*R*, the dependence is much weaker (Figure 6C).

_{0}### Sample Size Dependence

When patients are sampled, only a limited number of sequences are obtained. The Monte Carlo simulations provide a means of assessing the variability introduced by limited sequencing and can be used to choose the number of samples to be sequenced based on the trade-offs between cost and accuracy. Figure 7 shows the variation in the HD frequency distribution as the sample size *N _{S}* varies. Clearly, the larger the sample size, the smaller the variation.

We also asked what is the probability of erroneously classifying an infection as homogeneous due to poor sampling. In other words, what is the chance of missing a second lineage that represents a small percentage of the entire population when *N _{S}* sequences are sampled. We assumed a range of hypothetical low population prevalence, from 0.1% to 20%, and computed the probability that, out of a sample size

*N*, all sequences come from the dominant lineage. This is given by a binomial probability of getting zero “successes” out of

_{S}*N*trials, where a “success” is defined as a sequence belonging to the minor lineage, and the probability

_{S}*p*of success is given by its population prevalence

Thus, for example, for *N _{S}* ≥30, we can be 95% confident that any missed variant would comprise less than 10% of the total viral population (see Fig. pS9 in (Keele et al., 2008)). The number of sequences sampled from our 8 patients ranged from 16 to 50. When

*N*=16, we are 95% certain that we have sampled all lineages that represent at least 20% of the population. On the other hand, when

_{S}*N*=50, we are 95% confident that any missed lineage would have to represent 5% or less of the total population.

_{S}### Analysis of Sequence Data from 8 Acute Patients

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. Laboratory testing showed that 7 subjects were in Fiebig stage II, both viral RNA and p24 antigen were detected, and one subject was in Fiebig stage III, HIV-1 IgM in addition to viral RNA and p24 was detected (see Table 1). Detailed information of each subject is summarized in Table 3. Furthermore, samples were analyzed using the Hypermut tool (LANL) to test for overall enrichment of G-to-A hypermutation patterns with APOBEC3G/F signatures (Bourara et al., 2007; Harris and Liddament, 2004; Simon et al., 2005). Because these mutational patterns occur at a faster rate than ordinary mutations, they violate one of our model assumptions (constant mutation rate across sites and lineages). Only one sample, SUMA, presented mutations (three) that were found to carry the APOBEC3G/F signature. When all mutations were kept, the sample did not fit the Poisson model (χ^{2} goodness of fit p-value = 0.04, where a p<0.05 represents significant departure from the Poisson distribution), whereas a good fit was restored upon removing the APOBEC signatures (χ^{2} goodness of fit p-value = 0.243) (Keele et al., 2008).

For each of the 8 samples we computed the *HD _{0}* and

*HD*frequencies, as well as the divergence, diversity,

_{I}*HD*variance, % sequence identity and maximum

_{I}*HD*. All quantities are summarized in Table 4. In order to compute the

_{I}*HD*

_{0}distribution, we used the LANL tool

*Consensus Maker*(LANL) to compute a consensus sequence for each of the homogeneous samples, and then assumed that the consensus sequence was the founder sequence

*s*. We then proceeded to analyze the data in two steps: first, we used our models to infer what the expected maximum HD would be under a homogeneous infection, and used this to distinguish homogeneous infections from heterogeneous ones. Then, we applied our models to the homogeneous samples in order to infer a plausible time since the infection.

_{0}**...**

The eight early infection samples presented two distinct behaviors: either overall low diversity (max HD below 0.5%) and unimodal intersequence HD distribution, or overall high diversity (max HD above 2.5%; patients 1051 and BORI) and multimodal intersequence HD distribution (see Table 4 and Figure 8). We hypothesized that the underlying difference between the two patterns was whether a single HIV-1 strain entered the host or if multiple distinct strains entered. In order to exclude single strain infection in the high diversity samples, we computed the maximum HD one could reasonably expect to achieve at a given point in the infection and compared this to predictions based on the Fiebig stage. To do this, we used the maximum HD probability distribution computed in the asynchronous model, Eq. (22) and (23), with *P* taken from equation (13), using the patient specific base length *N _{B}* and sample size

*N*given in Table 3. To be conservative, we then computed the combined weight in the upper 2.5% tail that would be achieved at each generation step

_{S}*a*. The maximum HD at generation step

*a*is less than or equal to this value 97.5% of the time. For patient 1051 (Fiebig stage III), with probability 0.975 a minimum of 241 generation steps would be needed to achieve the observed maximum HD of 71 if the infection were homogeneous. This is not compatible with Fiebig stage III, which corresponds to a maximum of 37 days since infection (Table 1). Similarly, patient BORI (Fiebig stage II) yielded 261 generation steps for a maximum HD of 77, again incompatible with the Fiebig stage. We thus conclude that the high diversity samples were from multiple strain infections, whereas the low diversity samples were compatible with a single strain infection.

An alternative way for distinguishing between homogeneous and heterogeneous infections can be obtained by looking at both the mean diversity and the variance of each sample. According to our analytical results, when the infection is truly homogeneous, the measured average diversity and its variance should be approximately equal, which is one of the basic properties of a Poisson distribution. However, since measured mean HD and variance are affected by sample size and stochastic effects in early evolution, they may differ. Thus, for a homogeneous infection we can require that the mean (diversity) and variance be located between the upper and lower 95% confidence limits computed from MC simulations based on the number of sequences sampled per patient (Figure 9). All of the six low diversity subjects (WEAU, WITO, TRJO, SUMA, 1054, 1056) satisfy this condition, while subjects 1051 and BORI violate it. Thus we classify subjects 1051 and BORI as ”heterogeneous” infections (i.e., infections initiated by two or more founder strains) and the other 6 subjects as “homogeneous” infections. The limitation of this classification is the assumption that the sequence population diversifies without any selection, e.g. under neutral evolution. This assumption does not hold for Fiebig stage III or higher, and this classification becomes problematic. The area outside the cone in Figure 9 indicates deviations from a Poisson distribution. In early Fiebig stages this suggests a heterogeneous infection, whereas in later stages it indicates either immune selection or a heterogeneous infection or both. Also due to purifying selection, samples from a heterogeneous infection could be located inside the cone in later Fiebig stages. However, to be classified as homogenous we also require that the average % diversity in these samples be less than that expected based on the upper limit of the cumulative duration for the Fiebig stage, i.e., day 34 for Fiebig stage II (Figure 9, vertical dashed line)

The above classification was also confirmed by visual inspection through the LANL tool Highlighter (LANL). The visualization tool represents each sequence in the sample with a horizontal line, and places a mark whenever the sequence presents a mutation from the consensus. Figure 9 shows two exemplary behaviors found in acute infection samples: the homogeneous sample SUMA has few randomly distributed mutations, whereas the heterogeneous sample 1051 presents a majority of sequences that are similar to the consensus and a second group of sequences that not only have many more mutations relative to the consensus, but these mutations are shared, indicative of a second lineage.

### Examining Neutral Evolution: Star-like Phylogeny

During rapid exponential growth in the absence of selection, small samples of sequences are likely to have evolved from the founder sequence following a star-like phylogeny, i.e., they all coalesce at the founder (Wakeley, 2008). When this is the case, the intersequence *HD* frequency distribution, *HD _{I}*, coincides with the self-convolution of the

*HD*frequency distribution. To test whether this is the case and hence if the observed samples evolved following a star-like phylogeny, we compared the observed

_{0}*HD*frequencies from those computed assuming a star-like phylogeny. Given

_{I}*HD*

_{0}frequencies

*X*,

_{0}*X*, ...

_{1}*X*, where

_{m}*X*is the number of sequences with

_{i}*HD*=

_{0}*i,*we constructed the theoretical

*HD*frequencies by computing the

_{I}*HD*self-convolution frequencies

_{0},

_{1},...,

*, where*

_{n}*n*=

*2m*, as follows

The above formula is derived from the arithmetic convolution of the frequencies *X _{k}* with themselves. For example, ${\stackrel{\u2012}{Y}}_{0}={X}_{0}^{2}$,

_{1}=

*X*

_{0}

*X*

_{1}, ${\stackrel{\u2012}{Y}}_{2}={X}_{0}{X}_{2}+{X}_{1}^{2}$, and so on.

Figure 10 compares the theoretical frequencies _{0},_{1},...,* _{n}* (red lines), with the observed ones (the histograms). For 4 out of the 6 low diversity samples there was a perfect agreement between the calculated frequencies and the actual intersequence HD frequency. For the other two (WEAU and TRJO) a 5% and 15% difference was found, respectively. These very low differences, if any, in observed and computed frequencies suggest that all 6 samples evolved following an approximate star-like phylogeny.

### Estimating Time Since the MRCA

We applied the synchronous and asynchronous models to characterize early homogeneous infections and estimate the time since the MRCA. We did this using both the theoretical Poisson model and the output from the MC simulations.

In a homogeneous infection we expect the intersequence HD to follow a Poisson distribution, as in Eq. (13). We used a maximum likelihood method to fit a Poisson distribution to the observed *HD _{I}* frequencies and estimate the Poisson parameter λ, which is proportional to the time or the number of generations since the most recent common ancestor. Given the vector of observed frequencies

*Y*= (

*Y*

_{0},...,

*Y*), where

_{n}*Y*is the number of sequence pairs with

_{i}*HD=i*, the log likelihood function is defined as

By minimizing the log likelihood function, we obtain

the mean of the intersequence Hamming distance distribution.

We then used λ to estimate the number of generation steps, *n*, since the most recent common ancestor (MRCA), using the following:

where *n*=*a* in the synchronous model, and $n=\lfloor \frac{t}{{\tau}_{a}}\rfloor $ in the asynchronous model. Then *t*, the time since the MRCA, is given by *t*=*nτ _{s}* for the synchronous model, and

*t*=

*nτ*for the asynchronous model. As illustrated in Materials and Methods, we chose values of 2 and 1.5 days for

_{a}*τ*and

_{s}*τ*, respectively.

_{a}We assessed the goodness of the fit using a χ^{2} goodness-of-fit test statistic calculated from a singular value decomposition of the covariance matrix (see Materials and Methods). In a goodness of fit test the null hypothesis is that the two distributions being tested are statistically the same, hence a low p-value rejects the null hypothesis. All 6 low-diversity samples yielded a goodness-of-fit p-value of 0.1 or higher, suggesting that the observed distribution is consistent with a Poisson.

We fixed the values of base length *N _{B}* and sample size

*N*appropriate for each patient and then, under either the synchronous or the asynchronous model, we generated the HD intersequence distribution for generation steps

_{S}*n*through

_{1}*n*. For example, a typical output could be generated by choosing

_{2}*N*=2,600,

_{B}*N*=30 and generation steps 1 through 100. At each generation step,

_{S}*N*sequences were randomly sampled from it and the intersequence HD distribution of the sample was calculated. From the set of

_{S}*n*+1 intersequence HD distributions generated, we used a Kolmogorov-Smirnov statistic (Casella and Berger, 1990) to pick the HD distribution that best fit the HD distribution computed from the observed sequence data. We repeated this 10

_{2}-n_{1}^{3}times. From

*n**, the average of generation step of the 10

^{3}best-fitting distributions, we estimate the time since the MRCA, as done above for the Poisson model.

For all 6 low-diversity samples the time of infection could be estimated based on patients’ Fiebig stage. Table 4 summarizes the estimates both from fitting the Poisson distribution and the MC output using the asynchronous model. The computed estimated time since infection estimated by each patient’s Fiebig stage were within the 95% CIs predicted by the MC simulations. In contrast, the Poisson fit yielded slightly lower estimates (with respect to the given Fiebig stage) for subjects SUMA and 1056. All Poisson estimates were systematically lower than the MC ones. This is due to the fact that run-to-run variation, while accounted for in the MC fitting, is neglected when fitting the Poisson distribution. To understand this, consider for example a base length of *N _{B}*=2600. From Eq. (12), under the asynchronous model, the average Hamming distance from the founder strain, after one generation step will be 0.05. In this time, only 3 infected cells are produced, and, on average there will correspondingly be 0.15 mutations. However, in any given run, we can only have a discrete number of mutations. In particular, the first three cells will be identical 85% of the time (hence a mean HD of 0), 14% of the time there will be exactly one mutation across all three sequences (hence a mean HD of 0.333), and so on. Thus, though averaging over enough runs will lead to the correct mean value, most of the runs (85%) will have less than the expected number of mutations (after the first reverse transcription), and the rest will have more. This initial difference will persist through the generations and appear as an offset in a plot of mean HD versus generations (not shown). Averaging over 1,000 MC runs one finds that a sample of sequences with mean HD from the founder of 0.8 coalesce at a MRCA that is always 17 generation steps back (or 25.5 days) when fitting the Poisson distribution, but in the MC method 85% of the time the inferred number of generation steps will be higher (19 or 20 instead of 17) leading to a mean of 18.2 generation steps (or 27.3 days) Thus explaining the roughly 2 day discrepancy in the time to MRCA between these methods as shown in Table 4. Because the variation in the MC is equivalent to host-to-host variation, the MC 95% CIs are more likely to capture the true time estimates than the Poisson fits.

## Discussion

Early HIV-1 infection tends to be characterized by a viral population with limited sequence diversity in many but not all individuals. A recent study examined sequence data obtained from 102 individuals (Keele et al., 2008), and using the methods developed in this paper characterized the infections as being consistent with a single transmitted strain in 78 cases and multiple strains in 21 cases with 3 being borderline. In order to do this classification the authors used the model of early HIV-1 infection developed in this paper. Even though the biological conclusion that some people are infected by multiple strains and others by one strain only has already been published (Delwart et al., 2001; Derdeyn et al., 2004; Long et al., 2002; Sagar et al., 2004; Zhang et al., 1993), our goal here was to provide, in a self-contained manner, the mathematical underpinnings for the analysis of sequences obtained by single genome amplification methods.

Our goals were not novel in the sense that coalescent and Bayesian inference methods (Drummond et al., 2005; Drummond et al., 2006; Kuhner and Smith, 2007; Kuhner et al., 1998; Rannala, 1997), when restricted to homogeneous infections, can provide the same results. In fact, comparing the estimated time to the MRCA using BEAST (Drummond and Rambaut, 2007) and using our method on the 53 patients that had a homogeneous infection and no overt enrichment for APOBEC3 mutations from Keele et al. (Keele et al., 2008), we found the results to be similar (correlation coefficient = 0.973, best fitting slope = 1.036, see Figure 11). However, the main difference between these methods and ours is that whereas the former simulate genealogies, under similar assumptions, our simulation models follow the entire population. While for most of the problems studied in the literature simulating genealogies is the correct approach, what one gives up in that approach is simplicity when other processes like recombination and selection need to be modeled. A forward simulation like ours, on the other hand, is usually not feasible because of the problem size, for example for studying HIV-1 at the population level, but is readily applied to the important problem of modeling the evolutionary events in early infection and characterization of viral transmission. Because of the small number of generations since the extreme bottleneck, we can follow the entire population in a simulation as well as provide analytical calculations for various quantities summarizing viral evolution. In this paper, we set up a basic framework in which we neglect recombination and selection, but it is easy to see that with our methods these pose no problems of principle, whereas the analysis involving ancestral recombination graphs (Minichiello and Durbin, 2006), or ancestral selection graphs (Sharma, 1977) or both combined would be prohibitive for this data.

For cases of a single strain infection, our model suggests that the consensus sequence, which corresponds to our best estimate of the MRCA given the Poisson model described here, is either the transmitted strain or a strain one to two mutations away from the transmitted strain (Keele et al., 2008). Identifying the transmitted/founder strain has great biological significance as one can then determine if it has particular properties that allowed it to be transmitted. Even if our identification is only accurate to within one or two base accuracy this will still provide an enormous advance as it is the virus that selectively expands, and we can use this information to search for particular attributes of the virus that allowed it to expand. For example, preliminary examination of putative transmitted viruses suggests that they may generally be more uniformly refractive to neutralization than viruses obtained at the chronic stage of infection, which show heterogeneity in susceptibility to antibodies directed at their receptor binding surfaces (Keele et al., 2008). Also there are subtype specific attributes in terms of variable loop lengths and the number of glycosylation sites (the loops appear to be shorter and less glycosylated in the transmitted viruses of A and C clade infections (Chohan et al., 2005; Derdeyn et al., 2004; Sagar et al., 2006), but not B clade transmitted viruses (Frost et al., 2005)). Data is just beginning to accrue that will enable further exploration of potential viral transmission signatures and phenotypic traits, and the models described in this paper allow one to easily determine whether or not the MRCA or consensus sequence of an early sample is indeed a reasonable estimate of the transmitted virus.

Another important issue that our model addressed is the number of viruses from a single patient that one must sequence to determine if the infection is homogeneous within some confidence band and the number of sequences needed to obtain a reasonable estimate of the time to the MRCA. In order to achieve this, we computed pairwise Hamming distance frequencies of each given sample. Other types of statistical analyses that involve pairwise Hamming distances have been devised elsewhere (Gilbert et al., 2005), however the focus there was to simply compare two different populations, not to infer time estimates. Furthermore, we showed that under the homogeneous infection hypothesis, the HD frequencies follow a star-phylogeny and Poisson distribution. This was also shown in (Slatkin and Hudson, 1991), where the Poisson was used to calculate a population growth rate. Here we used it to test the homogeneity of the sample, and then to get time estimates since the MRCA. The probability of misclassifying an infection as homogeneous is easily calculable from our model. For example, we showed that with a sample of 30 sequences the probability of mistakenly classifying an infection as homogeneous due to not detecting a second minor lineage (less than 10% of the population) is at most 5%. However, now with deep sequencing methods, such as 454 technology (Margulies et al., 2005), 100,000 or more sequences of limited length can be obtained. Such technology may indeed show examples of what we have called homogeneous infections that are not truly homogenous, and could enable us to discern if minor lineages exist at very low frequencies.

The model assumes that the viral population grows exponentially with no selection pressure and no recombination. These are reasonable assumptions to make early in the infection, before the host’s immune response begins and while the infected cell population is still much smaller than the population of target cells. Further, early in infection derived from a single strain viral diversity is low and recombination may have only a small effect — as recombination between identical sequences leaves the sequences unchanged. Later in the infection, the exponential growth phase stops and selection pressure from the host environment gets established, both of which break the star-phylogeny evolution and hence our model is no longer valid.

In our mathematical models we neglect the occurrence of a mutation occurring more than once at the same site since it occurs with probability *O*(*a ^{2}ε^{2}N_{B}*). For our values of

*ε*= 2.16×10

^{-5}and

*N*~ 2,600, this probability remains below 0.006 throughout the first 100 replication cycles. After that, the model assumptions may fail because of immune pressure.

_{B}We assumed that the viral diversity evolves under a star-like phylogeny, and that all genomes coalesce at the founder strain. We used this assumption to construct the intersequence HD distribution and compute sample diversity. The probability of not coalescing at the founder strain is $O\left({R}_{0}^{-m}\right)$, where *R _{0}* is the basic reproductive ratio and

*m*is the number of generations away from the founder strain. So, for example, if we sample after 10 generation steps, the probability of any two randomly chosen sequences to coalesce one generation step back is ~

*R*. However, the probability of coalescing 1 or 2 generations earlier than the actual founder strain is not negligible. Even though we assumed that any two sequences coalesce at the founder strain, there is in fact less than a 0.5% chance of being 3 generations away from the actual founder, a 3% chance of being 2 generations away, and a 17% chance of being 1 generation away. For a set of 20 to 50 sequences, the probability of at least one pair coalescing after the actual founder is virtually certain, but the exponential growth of the infected cell population will severely limit the violation of star phylogeny.

_{0}^{-9}It has been estimated that a free virus has a half-life of less than an hour (Ramratnam et al., 1999), whereas an infected cell has an average lifespan of ~2 days (Markowitz et al., 2003; Perelson et al., 1996). Therefore, the virus present in plasma in a given sample is a reflection of the viral genomes contained in HIV-1 producing cells. Thus, as a simplification, we chose in our models to follow the infected cell population and not the virus. At each generation step, we assumed that an infected cell successfully infects *R _{0}* other cells. In effect, we are following the integrated HIV-1 provirus rather than viral particles. The experimental data we analyzed was obtained by sequencing virus and not provirus. On average, each cell produces as many as 5 × 10

^{4}virions (Chen et al., 2007), which makes the viral population at any given time point much larger than the infected cell population. Sampling with replacement of the provirus pool is therefore a good approximation, introducing a fractional error on the estimated probabilities of

*O(10*, where

^{-4}N_{S})*N*is the number of sequences sampled. Because we never sample over 60 sequences, this is a negligible correction.

_{S}Finally, we assumed the mutation rate is constant across different lineages and along the entire genome. This assumption is violated if APOBEC3G/F causes G-to-A mutation at an enhanced rate even in sequences that are not overtly hypermutated. Keele et al. (Keele et al., 2008) showed that overall APOBEC enriched samples follow our evolutionary model upon removing the G-to-A mutations with APOBEC3G/F signature. For the point mutation rate, we used the value 2.16×10^{-5} per base per replication cycle. (Mansky and Temin, 1995) estimated a mutation rate of 3.14×10^{-5} per base per replication cycle, but their estimate counted all types of mutations, including gaps. Because we exclude gaps throughout our analyses, we removed their contribution from the Mansky calculation and obtained the value 2.16×10^{-5}. The value from Mansky et al. was obtained *in vitro* from HeLa cells and reproduced in two different cell lines (Mansky, 1996; Mansky and Temin, 1995). More recent *in vitro* studies found a slightly smaller point mutation rate of 2.2 × 10^{-5} (including deletions and insertions) in one case (Huang and Wooley, 2005), and a larger rate of 5.4 × 10^{-5} in another (Gao et al., 2004). The use of a base substitution rate different from 2.16×10^{-5} would result in larger estimates for the number of days to the MRCA than obtained in Table 4, if the rate were smaller, and shorter if it were larger.

Within both the analytical and the MC model descriptions, we have envisioned two scenarios, one in which all infection events are synchronous, and one in which they are asynchronous and occur at two distinct times. Again, the advantage of the former is its simplicity. It yields a straightforward mathematical description, a population that grows exponentially in time with base *R _{0}*. However, it is biologically unrealistic that each cell would infect all other cells in at a single time. To explore how the synchronicity affects the model, we broke this assumption in the simplest possible way, e.g. by allowing infections to happen at two times. Even though this second scenario is still biologically unrealistic, nonetheless, it shows that the intersequence HD distributions do not change nature: they are still essentially Poisson distributions, and the increase in diversity is again driven by the number of RT cycles that have occurred. However, whereas in the synchronous model the increase in diversity depends on the base substitution rate ε and the generation time expressed in days, in the asynchronous model it also depends on

*R*. This poses an additional difficulty in estimating time since MRCA, as studies have shown the wide variability in estimates of

_{0}*R*across patients (Little et al., 1999; Stafford et al., 2000).

_{0}For the six patients that we classified as having been infected by a single viral strain, we estimated time since the MRCA by computing the divergence (i.e. the mean Hamming distance per base pair) from the founder strain. Because we do not know the actual founder strain, we constructed the sample consensus sequence and assumed it to be identical to the MRCA. In doing this, we are implicitly assuming no bottleneck after the virus has entered the host. Should a sublineage prevail over the rest of the population after entering the host, this would cause the consensus to represent the dominant sublineage rather than the actual infecting strain; an estimate to that MRCA that is statistically less than the minimal time from infection based on Fiebig stage would point to such a scenario, and suggest selection. All these caveats factor into our time since MRCA estimates, and would similarly impact the use coalescent methods in such estimates.

We have presented a simple description of an early homogeneous infection. This restricts the validity of our model to early infections, before selection pressure can be detected and while the infected cell population is still small with respect to the target cell population. Furthermore, in some cases there may be evidence of purifying selection even early in the infection, which could explain why for some of the samples we get too early time estimates, suggesting a bottleneck in viral evolution posterior to infection. Using the SNAP program (LANL), we looked at the dS/dN ratios of all 6 low diversity patients. Four out 6 (1054, SUMA, TRJO, WEAU) had ratios greater than 1, suggesting purifying selection (to be published elsewhere). Future developments of our model could allow correction for such a scenario, for example by using a modified base substitution rate.

Despite its limitations, our model successfully distinguishes homogenous from heterogeneous infections, and predicts the time evolution of divergence, diversity, maximum diversity and % sequence identity in single-strain HIV-1-infections.

## Materials and Methods

### Sequence Data Analysis

Plasma samples were obtained from 8 subjects with acute or very recent HIV-1 subtype B infection. All subjects gave informed consent, and plasma collections were performed with institutional review board and other regulatory approvals. Blood specimens were generally collected in acid citrate dextrose and plasma separated and stored at -20 to -70°C. To determine how the far into the acute phase the of infection the samples were taken, they were tested for HIV-1 RNA, p24 antigen, quantitative Chiron bDNA 3.0 or Roche Amplicor viral RNA assays; Coulter or Roche p24 Ag assays; Genetic Systems Anti-HIV-1/2 Plus O and Abbott AntiHIV-1/2 3^{rd} Generation EIAs; and Genetic Systems HIV-1 Western Blot Kit. Based on these test results, subjects were staged according to the Fiebig laboratory classification system for acute and early HIV-1 infection (see Table 1).

We employed single genome amplification (SGA) of plasma HIV-1 RNA followed by direct sequence analysis of uncloned *env* amplicons (Keele et al., 2008) as described in Salazar *et al*. (Salazar-Gonzalez et al., 2008). The number of *env* sequences, ranging from 16 to 50, analyzed per subject, their GenBank accession numbers are given in Keele *et al*. (Keele et al., 2008), and the subjects’ Fiebig stages are shown in Table 1. Sequences were aligned with GeneCutter (LANL) and the alignment was hand-checked. We removed insertions and deletions using the “Gap Strip Tool” (http://www.hiv.lanl.gov/content/sequence/GAPSTREEZE/strip_ready.html), which removes all columns that contain one or more gaps from a nucleotide alignment.

## Mathematical Derivations

### Number of mutations

We show that $P(\text{mutations}>{N}_{B}\mid a)\le O\left(\frac{{a}^{2}{\epsilon}^{2}}{{N}_{B}}\right)<<1$. Chebyshev’s inequality (Feller, 1957) states that, given a random variable *X* with mean μ and variance *σ ^{2}*, for any positive integer

*k*one has

Let *X* be the number of mutations. Then *μ* = *aεN _{B}* and

*σ*

^{2}=

*aεN*(1-

_{B}*ε*). We pick

*k*such that

*kσ*+

*μ*=

*N*+ 1, from which it follows that:

_{B}### Coalescence

To see that the probability of coalescence before generation 0 approaches zero exponentially as *a* and *m* get large, i.e. the expression in Eq. (6) is $O\left({R}_{0}^{-m}\right)$, recall that

Therefore

### The Expected Maximum HD_{I}

Given a sample of *N*_{S} sequences drawn from the viral population at generation *a* derived from a single strain infection, and a positive integer *M*, we examine how the *N*_{S} *HD*_{0} values should be distributed so that the maximum intersequence Hamming distance is exactly *M*. Notice that *HD _{I}* =

*M*if, for any given

*k*≥

*0*, there is at least one sequence such that

*HD*

_{0}=

*k*and another one such that

*HD*

_{0}=

*M*-

*k*. Now, if $k>\lfloor \frac{M}{2}\rfloor $, where

*x*is the floor functions defined as the maximum integer less than or equal to

*x*, and there is more than one sequence with

*HD*

_{0}=

*k*, then for two such sequences their relative Hamming distance will be

*HD*= 2

_{I}*k*>

*M*, assuming that for a single strain infection the MRCA is

*s*. Therefore, for the maximum to be exactly

_{0}*M*, there can be only one sequence with $H{D}_{0}=k,k>\lfloor \frac{M}{2}\rfloor $, and at least one with

*HD*=

_{0}*M -k*, and all the others with

*HD*. Using the abbreviated notation ${P}_{k}^{\left(a\right)}=P(H{D}_{0}=k\mid a)=\text{Binom}(k;a{N}_{B},\epsilon )$, the probability that the maximum

_{0}≤ M -k*HD*=

_{I}*M*is thus ${P}_{k}^{\left(a\right)}$ multiplied by the probability of having all the other

*N*-1 sequences with

_{S}*HD*

_{0}≤

*M -k*and at least one with

*HD*

_{0}=

*M -k ,*yielding:

In the special case when *M* is even and $k=\frac{M}{2}$, we need at least two sequences such that $H{D}_{0}=\frac{M}{2}$, which is expressed in the following

Putting everything together, we obtain

when *M* is odd, and

when *M* is even. Here we are assuming that *a*>0 and that any two sequences coalesce at generation 0. Therefore, the expected maximum *HD* at generation *a* is given by

### Asynchronous Model, Eq. (8)

Of all infected cells *I(a*+*1*,*n*+*1)* at time *n*+*1*, α *I(a,n)* were infected by cells of age *a* that were “born” at time *n*, whereas *γI(a,n-1)* were infected by cells of age *a* that were born at time *n-1*. In other words, we have the following relationship

Using the binomial expansion $\left(\begin{array}{c}\hfill x+1\hfill \\ \hfill y-x\hfill \end{array}\right)=\left(\begin{array}{c}\hfill x\hfill \\ \hfill y-x\hfill \end{array}\right)+\left(\begin{array}{c}\hfill x\hfill \\ \hfill y-x-1\hfill \end{array}\right)$, one can verify that for $\lceil \frac{n}{2}\rceil \le a\le n$ the expression $I(a,n)={I}_{0}{\alpha}^{a}{\left(\frac{\gamma}{\alpha}\right)}^{n-a}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right)$ satisfies Eq. (24) as shown below:

### Asynchronous Model, Eq. (10)

Let

By taking the summations on both side of Eq. (24), one obtains that *N(n)* satisfies *N*(*n*+1) = *αN*(*n*)+*γN*(*n*-1). Let ${\lambda}_{1}=\frac{\alpha}{2}(1+\phi )$, and ${\lambda}_{2}=\frac{\alpha}{2}(1-\phi )$, where $\phi =\sqrt{1+4\frac{\gamma}{{\alpha}^{2}}}$. Then, for any integers *a* and *b*, the function so defined: ${N}_{a,b}\left(n\right)=a{\lambda}_{1}^{n}+b{\lambda}_{2}^{n}$ is such that for every *n* it satisfies the identity *N _{a,b}*(

*n*+1) =

*αN*(

_{a,b}*n*)+

*γN*(

_{a,b}*n*-1). In fact, because λ

_{1}and λ

_{2}are the roots of the second degree equation

*x*

^{2}=

*αx*+

*γ*, it follows that for

*i*=

*1,2*we have ${\lambda}_{i}^{2}=\alpha {\lambda}_{i}+\gamma $. In particular, for every

*n*> 1, ${\lambda}_{i}^{n+1}=\alpha {\lambda}_{i}^{n}+\gamma {\lambda}_{i}^{n-1}$, and as a consequence

Therefore, we need to find *a* and *b* such that *N(0)*=*I _{0}* and

*N(1)*=

*αI*, which are the boundary conditions. This yields a linear system with solutions: $a={I}_{0}\frac{1+\phi}{2\phi}$ and $b=-{I}_{0}\frac{1-\phi}{2\phi}$. Hence, substituting these values into the expression of

_{0}*N*, we get $N\left(n\right)={I}_{0}{\left(\frac{\alpha}{2}\right)}^{n}\frac{{(1+\phi )}^{n+1}-{(1-\phi )}^{n+1}}{2\phi}$. As a side result, notice that if we let ${S}_{n}={\sum}_{a=\lceil \frac{n}{2}\rceil}^{n}\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\gamma \right)}^{n-a}$ then it follows that ${S}_{n}=\frac{{(1+\phi )}^{n+1}-{(1-\phi )}^{n+1}}{{2}^{n+1}\phi}$. Finally, notice that for large values of $n=\frac{{(1+\phi )}^{n+1}-{(1-\phi )}^{n+1}}{2\phi}\approx {(1+\phi )}^{n}$, and therefore $N\left(n\right)\approx {I}_{0}{\left(\frac{\alpha}{2}(1+\phi )\right)}^{n}$.

_{a,b}(n)### Asynchronous Model, Eq. (13)

In order to show that the *HD _{0}* probability distribution is such that

*P*(

*d,n*) ≈

*Pois*(

*d,λ*) +

*O*(

*ε*

^{2}) for some suitable parameter

*λ*, we show that up to terms in ε

^{2}, all cumulants of this distribution are equal, which is a property that uniquely characterizes a Poisson distribution. The cumulant generating function is given by

where ${M}_{HD}^{\left(n\right)}\left(\xi \right)$ is the moment generating function

In Eq. (26), *p _{j}*(

*ε*) is a polynomial in ε of degree

*j-1*such that

*p*(0) = 1 for all

_{j}*j*’s. Let $X\left(n\right)={\sum}_{a=\lceil \frac{n}{2}\rceil}^{n}a\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\frac{\gamma}{{\alpha}^{2}}\right)}^{n-a}$. One can show that for sufficiently large

*n*, the following holds $\frac{X\left(n\right)}{{F}_{n}}\approx n\frac{1+\phi}{2\phi}+\frac{1-\phi}{2{\phi}^{2}}$. This can be shown by first noticing that $\frac{{F}_{n+1}}{{F}_{n}}\approx \frac{1+\phi}{2}+O\left({e}^{-n}\right)$ because of the identity

And since $\theta =\frac{1-\phi}{1+\phi}<1$, it follows that $\frac{{F}_{n}}{{F}_{n-1}}\approx \frac{1+\phi}{2}+{\theta}^{n}\left(\frac{1+\phi}{2}\right)+O\left({\theta}^{2n}\right)$. Given $X\left(n\right)={\sum}_{a=\lceil \frac{n}{2}\rceil}^{n}a\left(\begin{array}{c}\hfill a\hfill \\ \hfill n-a\hfill \end{array}\right){\left(\frac{\gamma}{{\alpha}^{2}}\right)}^{n-a}$, for *n* >> 1, we have $\frac{X\left(n\right)}{{F}_{n}}\approx n\frac{1+\phi}{2\phi}+\frac{1-\phi}{2{\phi}^{2}}$. In fact, let $x=\frac{\gamma}{{\alpha}^{2}}=\frac{{\phi}^{2}-1}{4}$. Then

As a consequence, $X\left(n\right)=n{F}_{n}-n\frac{{\phi}^{2}-1}{4\phi}{F}_{n-1}-\frac{{\phi}^{2}-1}{4{\phi}^{2}}{F}_{n-1}$. Dividing both side by *F _{n}* completes the proof.

Given this notation, we can write ${K}_{HD}^{\left(n\right)}\left(\xi \right)=\epsilon {N}_{B}\left({\sum}_{j\ge 1}\frac{{\xi}^{j}}{j!}{p}_{j}\left(\epsilon \right)\right)\frac{X\left(n\right)}{{F}_{n}}$. As a consequence, for large *n*, we can approximate the *m*-th cumulant as

and up to terms in *O*(*ε*^{2}*N _{B}*), all cumulants are equal (the m-th cumulant does not depend on

*m*but only on

*n*), which proves that the distribution is approximately Poisson.

### Generation times *τ*_{s} and *τ*_{a}

_{s}

_{a}

The times at which cells are infected differ in the synchronous and asynchronous models. However, the growth rate of the infected cell population in the two models should both be chosen equal to the best estimate of the actual growth rate. Setting these growth rates equal will lead to a relationship between *τ _{s}* and

*τ*. Current estimates for the generation time are roughly two days (Markowitz et al., 2003). Thus, lets assume τ

_{a}_{s}=2 days. In order to derive the duration of τ

_{a}, notice that

*Log*

_{R0}(

*N*(

*n*)) is linear in

*n*for sufficiently large

*n*, where

*N(n)*is the quantity defined in Eq. (25). In particular, this quantity grows linearly with a rate per

*τ*given by

_{a}In the synchronous model, at any time *t*=*a*, the total number of newly infected cells is $N\left(n\right)={I}_{0}{R}_{0}^{n}$. Therefore, the synchronous growth rate *ρ _{s}* in the logarithmic scale is 1, i.e.,

If the mean generation time is *2* days, it follows that the average growth rate is 0.5/day. Comparing the rate of increase per day between the two models, we get

where *τ _{s}*=2 days. Hence ${\tau}_{a}=2{\rho}_{a}\approx 2{\mathit{Log}}_{{R}_{0}}\left(\alpha \frac{1+\phi}{2}\right)$. In particular, when $\alpha =\gamma =\frac{{R}_{0}}{2}$,

*τ*=

_{a}*1.5*days and the total number of newly infected cells at time

*t*=

*nτ*is given by

_{a}where $\phi =\sqrt{1+\frac{8}{{R}_{0}}}$. We could also assume given the data that the average generation time in an asynchronous model is 2 days and then use the procedure given above to deduce a corresponding τ_{s}. In using a *τ _{a}* of 1.5 days half the infected cells are implicitly assumed to live for 3 days. While the average measured lifetime is 2 days the variance around the mean is not known and hence it is unclear if this value of

*τ*is realistic.

_{a}### Chi square test for dependent data cells

Define

where *Y* is the vector of intersequence HD frequencies of length *n*=max(*HD*), E denotes the expectation operator *E*(*Y*)=(*E*(*Y*_{0}),...*E*(*Y _{n}*)), and Σ

^{-1}is calculated by inverting the covariance matrix Σ

_{i,j}=

*σ*=

_{i,j}*E*(

*Y*)-

_{i}Y_{j}*E*(

*Y*)

_{i}*E*(

*Y*) on its non-singular space using Singular Value Decomposition. This is motivated by the non-independence of our observations, since they are drawn from a convolution of two identical Poisson distributions with mean λ. In particular $E\left({Y}_{k}\right)={c}^{2}{2}^{k-1}{e}^{-2\lambda}\frac{{\lambda}^{k}}{k!}$, where

_{j}*c*is a normalization constant and

*k*=

*0*, ... ,

*n*. Since the convolution operation imposes $\lfloor \frac{n}{2}\rfloor +1$ linear constraints on Σ’s columns, the degrees of freedom of the

*χ*

^{2}statistic are $df\le \lfloor \frac{n}{2}\rfloor -1$.

### Monte Carlo Simulations

The number of bases that undergo mutation is determined by the mutation rate ε. However, once a base is chosen to be mutated a matrix of base transitions is then used to determine which base it is mutated into. This same method is used in both the synchronous and asynchronous models. The base frequencies and substitution matrix employed in the MC simulations were obtained from a general time reversible (GTR) evolutionary model (Yang, 1994) optimized in conjunction with a maximum likelihood phylogenetic tree. Relative base substitution frequencies are listed in Table 5; these were based on the GTR probability matrix (Korber et al., 2000) and a maximum likelihood estimate of rate variation per site, and indicate the probability of change of one nucleotide to another per unit time, assuming a branch length of 1 (there is extensive change with a branch length of 1). In the MC simulation the equilibrium base frequencies for this model were A=0.373, G=0.241, C=0.193, and T=0.192. The transition matrix was computed from Table 5 by setting the diagonal elements to zero and renormalizing each row.

The maximum likelihood tree used to derive this matrix in Table 5 included 445 HIV-1 clade B Envelope sequences from the Los Alamos database (LANL) combined with sequences from Keele *et al.* (Keele et al., 2008) (code described in (Korber et al., 2000)). While this model was used for the simulations presented in this paper, the use of the HIV-1 specific GTR model turned out to have little impact on the simulation results; a neutral model with an even base distribution and equal substitution rates yielded essentially the same results for the simulations (data not shown).

## Acknowledgements

Portions of this work were done under the auspices of the U. S. Department of Energy under contract DE-AC52-06NA25396 and supported in part by the Center for HIV/AIDS Vaccine Immunology (AI67854), the Bill & Melinda Gates Foundation Grand Challenges program (37874), the University of Alabama at Birmingham Center for AIDS Research (AI27767), the University of Rochester Developmental Center for AIDS Research (P30-AI078498) and NIH grants AI083115, AI028433, and RR0655. We thank Marcus Daniels for technical assistance.

## Footnotes

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

- Abrahams MR, Anderson JA, Giorgi EE, Seoighe C, Mlisana K, Ping LH, Athreya GS, Treurnicht FK, Keele BF, Wood N, Salazar-Gonzalez JF, Bhattacharya T, Chu H, Hoffman I, Galvin S, Mapanje C, Kazembe P, Thebus R, Fiscus S, Hide W, Cohen MS, Karim SA, Haynes BF, Shaw GM, Hahn BH, Korber BT, Swanstrom R, Williamson C. Quantitating the multiplicity of infection with human immunodeficiency virus type 1 subtype C reveals a non-poisson distribution of transmitted variants. J Virol. 2009;83:3556–67. [PMC free article] [PubMed]
- Achaz G, Palmer S, Kearney M, Maldarelli F, Mellors JW, Coffin JM, Wakeley J. A robust measure of HIV-1 population turnover within chronically infected individuals. Mol Biol Evol. 2004;21:1902–12. [PubMed]
- Bourara K, Liegler TJ, Grant RM. Target cell APOBEC3C can induce limited G-to-A mutation in HIV-1. PLoS Pathog. 2007;3:1477–1485. [PMC free article] [PubMed]
- Casella G, Berger RL. Statistical inference. Brooks/Cole Pub. Co.; Pacific Grove, Calif.: 1990.
- Chen HY, Di Mascio M, Perelson AS, Ho DD, Zhang L. Determination of virus burst size in vivo using a single-cycle SIV in rhesus macaques. Proc Natl Acad Sci U S A. 2007;104:19079–84. [PMC free article] [PubMed]
- Chohan B, Lang D, Sagar M, Korber B, Lavreys L, Richardson B, Overbaugh J. Selection for human immunodeficiency virus type 1 envelope glycosylation variants with shorter V1-V2 loop sequences occurs during transmission of certain genetic subtypes and may impact viral RNA levels. J. Virol. 2005;79:6528–6531. [PMC free article] [PubMed]
- Clark SJ, Saag MS, Decker WD, Campbell-Hill S, Roberson JL, Veldkamp PJ, Kappes JC, Hahn BH, Shaw GM. High titers of cytopathic virus in plasma of patients with symptomatic primary HIV-1 infection. N. Engl. J. Med. 1991;324:954–960. [PubMed]
- Delwart EL, Magierowska M, Royz M, Foley B, Peddada L, Smith R, Heldebrand C, Conrad A, Busch MP. Homogeneous quasispecies in 16 out of 17 individuals during very early HIV-1 primary infection. AIDS. 2001;15:1–7. [PubMed]
- Derdeyn CA, Decker JM, Bibollet-Ruche F, Mokili JL, Muldoon M, Denham SA, Heil ML, Kasolo F, Musonda R, Hahn BH, Shaw GM, Korber BT, Allen S, Hunter E. Envelope-constrained neutralization-sensitive HIV-1 after heterosexual transmission. Science. 2004;303:2019–2022. [PubMed]
- Dickover R, Garratty E, Yusim K, Miller C, Korber B, Bryson Y. Role of maternal autologous neutralizing antibody in selective perinatal transmission of human immunodeficiency virus type 1 escape variants. J. Virol. 2006;80:6525–33. [PMC free article] [PubMed]
- Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007;7:214. [PMC free article] [PubMed]
- Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 2005;22:1185–1192. [PubMed]
- Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. [PMC free article] [PubMed]
- Edwards CT, Holmes EC, Wilson DJ, Viscidi RP, Abrams EJ, Phillips RE, Drummond AJ. Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. BMC Evol Biol. 2006;6:28. [PMC free article] [PubMed]
- Feller W. An introduction to probability theory and its applications. John Wiley and Sons, Inc.; 1957.
- Fiebig EW, Heldebrant CM, Smith RI, Conrad AJ, Delwart EL, Busch MP. Intermittent low-level viremia in very early primary HIV-1 infection. J Acquir Immune Defic Syndr. 2005;39:133–7. [PubMed]
- Fiebig EW, Wright DJ, Rawal BD, Garrett PE, Schumacher RT, Peddada L, Heldebrant C, Smith R, Conrad A, Kleinman SH, Busch MP. Dynamics of HIV viremia and antibody seroconversion in plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS. 2003;17:1871–1879. [PubMed]
- Frost SD, Liu Y, Pond SL, Chappey C, Wrin T, Petropoulos CJ, Little SJ, Richman DD. Characterization of human immunodeficiency virus type 1 (HIV-1) envelope variation and neutralizing antibody responses during transmission of HIV-1 subtype B. J. Virol. 2005;79:6523–7. [PMC free article] [PubMed]
- Gaines H, von Sydow M, Pehrson PO, Lundbegh P. Clinical picture of primary HIV infection presenting as a glandular-fever-like illness. BMJ. 1988;297:1363–8. [PMC free article] [PubMed]
- Gao F, Chen Y, Levy DN, Conway JA, Kepler TB, Hui H. Unselected mutations in the human immunodeficiency virus type 1 genome are mostly nonsynonymous and often deleterious. J Virol. 2004;78:2426–33. [PMC free article] [PubMed]
- Gilbert PB, Rossini AJ, Shankarappa R. Two-sample tests for comparing intra-individual genetic sequence diversity between populations. Biometrics. 2005;61:106–17. [PubMed]
- Harris RS, Liddament MT. Retroviral restriction by APOBEC proteins. Nat. Rev. Immunol. 2004;4:868–877. [PubMed]
- Heffernan JM, Wahl LM. Monte Carlo estimates of natural variation in HIV infection. J Theor Biol. 2005;236:137–53. [PubMed]
- Huang KJ, Wooley DP. A new cell-based assay for measuring the forward mutation rate of HIV-1. J Virol Methods. 2005;124:95–104. [PubMed]
- Kamina A, Makuch RW, Zhao H. A stochastic modeling of early HIV-1 population dynamics. Math Biosci. 2001;170:187–98. [PubMed]
- Keele B, Li H, Learn GH, Hraber PT, Giorgi EE, Grayson T, Sun C, Chen Y, Yeh WW, Letvin NL, Mascola JR, Nabel GJ, Hayens BF, Bhattacharya T, Perelson AS, Korber BT, Hahn BH, Shaw GS. Low Dose Rectal Inoculation of Rhesus Macaques by SIVsmE660 or SIVmac251 Recapitulates Human Mucosal Infection by HIV-1. J Exp Med To appear. 2009 [PMC free article] [PubMed]
- Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, Salazar MG, Sun C, Grayson T, Wang S, Li H, Wei X, Jiang C, Kirchherr JL, Gao F, Anderson JA, Ping LH, Swanstrom R, Tomaras GD, Blattner WA, Goepfert PA, Kilby JM, Saag MS, Delwart EL, Busch MP, Cohen MS, Montefiori DC, Haynes BF, Gaschen B, Athreya GS, Lee HY, Wood N, Seoighe C, Perelson AS, Bhattacharya T, Korber BT, Hahn BH, Shaw GM. Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A. 2008 [PMC free article] [PubMed]
- Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Timing the ancestor of the HIV-1 pandemic strains. Science. 2000;288:1789–96. [PubMed]
- Kuhner MK, Smith LP. Comparing likelihood and Bayesian coalescent estimation of population parameters. Genetics. 2007;175:155–65. [PMC free article] [PubMed]
- Kuhner MK, Yamato J, Felsenstein J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics. 1998;149:429–34. [PMC free article] [PubMed]
- LANL Hypermut. http://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html.
- LANL Consensus Maker. http://www.hiv.lanl.gov/content/sequence/CONSENSUS/consensus.html.
- LANL Highlighter. http://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/highlighter.html.
- LANL. SNAP http://www.hiv.lanl.gov/content/sequence/SNAP/SNAP.html.
- LANL GeneCutter. http://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html.
- LANL Los Alamos Database. http://www.hiv.lanl.gov/content/index.
- Lee HY, Topham DJ, Park SY, Hollenbaugh J, Treanor J, Mosmann TR, Jin X, Ward BM, Miao H, Holden-Wiltse J, Perelson AS, Zand M, Wu H. Simulation and prediction of the adaptive immune response to influenza a virus infection. J Virol. 2009;83:7151–65. [PMC free article] [PubMed]
- Leigh Brown AJ. Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc Natl Acad Sci U S A. 1997;94:1862–5. [PMC free article] [PubMed]
- Lindback S, Karlsson AC, Mittler J, Blaxhult A, Carlsson M, Briheim G, Sonnerborg A, Gaines H, Karolinska Institutet Primary HIV Infection Study Group Viral dynamics in primary HIV-1 infection. AIDS. 2000a;14:2283–2291. [PubMed]
- Lindback S, Thorstensson R, Karlsson AC, von Sydow M, Flamholc L, Blaxhult A, Sonnerborg A, Biberfeld G, Gaines H, Karolinska Institute Primary HIV Infection Study Group Diagnosis of primary HIV-1 infection and duration of follow-up after HIV exposure. AIDS. 2000b;14:2333–2339. [PubMed]
- Little SJ, McLean AR, Spina CA, Richman DD, Havlir DV. Viral dynamics of acute HIV-1 infection. J. Exp. Med. 1999;190:841–850. [PMC free article] [PubMed]
- Long EM, Rainwater SM, Lavreys L, Mandaliya K, Overbaugh J. HIV type 1 variants transmitted to women in Kenya require the CCR5 coreceptor for entry, regardless of the genetic complexity of the infecting virus. AIDS Res. Hum. Retroviruses. 2002;18:567–576. [PubMed]
- Mansky LM. Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. AIDS Res Hum Retroviruses. 1996;12:307–14. [PubMed]
- Mansky LM, Temin HM. Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J. Virol. 1995;69:5087–5094. [PMC free article] [PubMed]
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–80. [PMC free article] [PubMed]
- Markowitz M, Louie M, Hurley A, Sun E, Di Mascio M, Perelson AS, Ho DD. A novel antiviral intervention results in more accurate assessment of human immunodeficiency virus type 1 replication dynamics and T-cell decay in vivo. J. Virol. 2003;77:5037–5038. [PMC free article] [PubMed]
- Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet. 2006;79:910–22. [PMC free article] [PubMed]
- Nowak MA, Lloyd AL, Vasquez GM, Wiltrout TA, Wahl LM, Bischofberger N, Williams J, Kinter A, Fauci AS, Hirsch VM, Lifson JD. Viral dynamics of primary viremia and antiretroviral therapy in simian immunodeficiency virus infection. J Virol. 1997;71:7518–25. [PMC free article] [PubMed]
- Painter SL, Biek R, Holley DC, Poss M. Envelope variants from women recently infected with clade A human immunodeficiency virus type 1 confer distinct phenotypes that are discerned by competition and neutralization experiments. J Virol. 2003;77:8448–61. [PMC free article] [PubMed]
- Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time. Science. 1996;271:1582–6. [PubMed]
- Ramratnam B, Bonhoeffer S, Binley J, Hurley A, Zhang L, Mittler JE, Markowitz M, Moore JP, Perelson AS, Ho DD. Rapid production and clearance of HIV-1 and hepatitis C virus assessed by large volume plasma apheresis. Lancet. 1999;354:1782–5. [PubMed]
- Rannala B. Gene genealogy in a population of variable size. Heredity. 1997;78(Pt 4):417–23. [PubMed]
- Ribeiro RM, Bonhoeffer S. A stochastic model for primary HIV infection: optimal timing of therapy. AIDS. 1999;13:351–7. [PubMed]
- Richman DD, Wrin T, Little SJ, Petropoulos CJ. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc Natl Acad Sci U S A. 2003;100:4144–9. [PMC free article] [PubMed]
- Ritola K, Pilcher CD, Fiscus SA, Hoffman NG, Nelson JA, Kitrinos KM, Hicks CB, Eron JJ, Jr., Swanstrom R. Multiple V1/V2 env variants are frequently present during primary infection with human immunodeficiency virus type 1. J. Virol. 2004;78:11208–11218. [PMC free article] [PubMed]
- Ruskin HJ, Pandey RB, Liu Y. Viral load and stochastic mutation in a Monte Carlo simulation of HIV. Physica A. 2002;293:315–323.
- Sagar M, Wu X, Lee S, Overbaugh J. Human immunodeficiency virus type 1 V1-V2 envelope loop sequences expand and add glycosylation sites over the course of infection, and these modifications affect antibody neutralization sensitivity. J Virol. 2006;80:9586–98. [PMC free article] [PubMed]
- Sagar M, Kirkegaard E, Long EM, Celum C, Buchbinder S, Daar ES, Overbaugh J. Human immunodeficiency virus type 1 (HIV-1) diversity at time of infection is not restricted to certain risk groups or specific HIV-1 subtypes. J Virol. 2004;78:7279–83. [PMC free article] [PubMed]
- Salazar-Gonzalez JF, Bailes E, Pham KT, Salazar MG, Guffey MB, Keele BF, Derdeyn CA, Farmer P, Hunter E, Allen S, Manigart O, Mulenga J, Anderson JA, Swanstrom R, Haynes BF, Athreya GS, Korber BT, Sharp PM, Shaw GM, Hahn BH. Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification by single-genome amplification and sequencing. J Virol. 2008;82:3952–70. [PMC free article] [PubMed]
- Schacker T, Collier AC, Hughes J, Shea T, Corey L. Clinical and epidemiologic features of primary HIV infection. Ann. Intern. Med. 1996;125:257–264. [PubMed]
- Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL, Mullins JI. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 1999;73:10489–502. [PMC free article] [PubMed]
- Sharma JC. Inheritance of defective extremities. Indian J Med Res. 1977;66:120–6. [PubMed]
- Simon V, Zennou V, Murray D, Huang Y, Ho DD, Bieniasz PD. Natural variation in Vif: differential impact on APOBEC3G/3F and a potential role in HIV-1 diversification. PLoS Pathog. 2005;1:e6. [PMC free article] [PubMed]
- Slatkin M, Hudson RR. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics. 1991;129:555–62. [PMC free article] [PubMed]
- Stafford MA, Corey L, Cao Y, Daar ES, Ho DD, Perelson AS. Modeling plasma virus concentration during primary HIV infection. J. Theor. Biol. 2000;203:285–301. [PubMed]
- Tan WY, Wu H. Stochastic modeling of the dynamics of CD4+ T-cell infection by HIV and some Monte Carlo studies. Math Biosci. 1998;147:173–205. [PubMed]
- Tuckwell HC, Le Corfec E. A stochastic model for early HIV-1 population dynamics. J Theor Biol. 1998;195:451–63. [PubMed]
- Vernazza PL, Eron JJ, Fiscus SA, Cohen MS. Sexual transmission of HIV: infectiousness and prevention. AIDS. 1999;13:155–66. [PubMed]
- Wakeley J. Coalescent Theory, an Introduction. Roberts & Co.; 2008.
- Wei X, Decker JM, Wang S, Hui H, Kappes JC, Wu X, Salazar-Gonzalez JF, Salazar MG, Kilby JM, Saag MS, Komarova NL, Nowak MA, Hahn BH, Kwong PD, Shaw GM. Antibody neutralization and escape by HIV-1. Nature. 2003;422:307–312. [PubMed]
- Wolinsky SM, Wike CM, Korber BT, Hutto C, Parks WP, Rosenblum LL, Kunstman KJ, Furtado MR, Munoz JL. Selective transmission of human immunodeficiency virus type-1 variants from mothers to infants. Science. 1992;255:1134–7. [PubMed]
- Wolinsky SM, Korber BT, Neumann AU, Daniels M, Kunstman KJ, Whetsell AJ, Furtado MR, Cao Y, Ho DD, Safrit JT. Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science. 1996;272:537–42. [PubMed]
- Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39:105–11. [PubMed]
- Zhang LQ, MacKenzie P, Cleland A, Holmes EC, Brown AJ, Simmonds P. Selection for specific sequences in the external envelope protein of human immunodeficiency virus type 1 upon primary infection. J. Virol. 1993;67:3345–56. [PMC free article] [PubMed]

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (2.2M)

- Nef gene evolution from a single transmitted strain in acute SIV infection.[Retrovirology. 2009]
*Bimber BN, Chugh P, Giorgi EE, Kim B, Almudevar AL, Dewhurst S, O'Connor DH, Lee HY.**Retrovirology. 2009 Jun 8; 6:57. Epub 2009 Jun 8.* - Estimating time since infection in early homogeneous HIV-1 samples using a poisson model.[BMC Bioinformatics. 2010]
*Giorgi EE, Funkhouser B, Athreya G, Perelson AS, Korber BT, Bhattacharya T.**BMC Bioinformatics. 2010 Oct 25; 11:532. Epub 2010 Oct 25.* - A new algorithm for analysis of within-host HIV-1 evolution.[Pac Symp Biocomput. 2001]
*Ren F, Ogishima S, Tanaka H.**Pac Symp Biocomput. 2001; :595-605.* - Viral evolution and escape during acute HIV-1 infection.[J Infect Dis. 2010]
*Boutwell CL, Rolland MM, Herbeck JT, Mullins JI, Allen TM.**J Infect Dis. 2010 Oct 15; 202 Suppl 2:S309-14.* - Genetic and antigenic features of the transmitted virus.[Curr Opin HIV AIDS. 2009]
*Keele BF, Derdeyn CA.**Curr Opin HIV AIDS. 2009 Sep; 4(5):352-7.*

- Estimating the Timing of Mother-to-Child Transmission of the Human Immunodeficiency Virus Type 1 Using a Viral Molecular Evolution Model[PLoS ONE. ]
*Chaillon A, Samleerat T, Zoveda F, Ballesteros S, Moreau A, Ngo-Giang-Huong N, Jourdain G, Gianella S, Lallemant M, Depaulis F, Barin F.**PLoS ONE. 9(4)e90421* - Phylogenetic inferences on HIV-1 transmission: implications for the design of prevention and treatment interventions[AIDS (London, England). 2013]
*Brenner B, Wainberg MA, Roger M.**AIDS (London, England). 2013 Apr 24; 27(7)1045-1057* - Acute HIV-1 Infection[The New England journal of medicine. 2011]
*Cohen MS, Shaw GM, McMichael AJ, Haynes BF.**The New England journal of medicine. 2011 May 19; 364(20)1943-1954* - Modeling the within-host dynamics of HIV infection[BMC Biology. ]
*Perelson AS, Ribeiro RM.**BMC Biology. 1196* - Should We Treat Acute HIV Infection?[Current HIV/AIDS reports. 2012]
*O’Brien M, Markowitz M.**Current HIV/AIDS reports. 2012 Jun; 9(2)101-110*

- MedGenMedGenRelated information in MedGen
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Modeling Sequence Evolution in Acute HIV-1 InfectionModeling Sequence Evolution in Acute HIV-1 InfectionNIHPA Author Manuscripts. Nov 21, 2009; 261(2)341PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...