HIV Proviral Burden, Genetic Diversity, and Dynamics in Viremic Controllers Who Subsequently Initiated Suppressive Antiretroviral Therapy

ABSTRACT Curing HIV will require eliminating the reservoir of integrated, replication-competent proviruses that persist despite antiretroviral therapy (ART). Understanding the burden, genetic diversity, and longevity of persisting proviruses in diverse individuals with HIV is critical to this goal, but these characteristics remain understudied in some groups. Among them are viremic controllers—individuals who naturally suppress HIV to low levels but for whom therapy is nevertheless recommended. We reconstructed within-host HIV evolutionary histories from longitudinal single-genome amplified viral sequences in four viremic controllers who eventually initiated ART and used this information to characterize the age and diversity of proviruses persisting on therapy. We further leveraged these within-host proviral age distributions to estimate rates of proviral turnover prior to ART. This is an important yet understudied metric, since pre-ART proviral turnover dictates reservoir composition at ART initiation (and thereafter), which is when curative interventions, once developed, would be administered. Despite natural viremic control, all participants displayed significant within-host HIV evolution pretherapy, where overall on-ART proviral burden and diversity broadly reflected the extent of viral replication and diversity pre-ART. Consistent with recent studies of noncontrollers, the proviral pools of two participants were skewed toward sequences that integrated near ART initiation, suggesting dynamic proviral turnover during untreated infection. In contrast, proviruses recovered from the other two participants dated to time points that were more evenly spread throughout infection, suggesting slow or negligible proviral decay following deposition. HIV cure strategies will need to overcome within-host proviral diversity, even in individuals who naturally controlled HIV replication before therapy.

proviruses persisting during therapy that broadly reflected HIV's within-host evolutionary history, where the estimated half-lives of the persistent proviral pool during untreated infection ranged from ,1 year to negligible. Cure strategies will need to contend with proviral diversity and between-host heterogeneity, even in individuals who naturally control HIV.
KEYWORDS genetic diversity, proviral burden and dynamics, proviral half-life, viremic controllers, human immunodeficiency virus L ike all retroviruses, HIV integrates its genome into that of its host cell. Most infected cells die-or are eliminated by the immune system-within a day or two of infection, but a small number persist long-term, even during suppressive antiretroviral therapy (ART) (1)(2)(3). While most of these long-lived cells harbor HIV proviruses with large deletions or other genetic defects (4-6), a minority harbor replication-competent proviruses that could reactivate at any time. It is for this reason that ART needs to be maintained for life. Achieving ART-free HIV remission ("functional cure") or a complete cure (eradication) (7) will therefore require permanently inactivating (8,9) or eliminating (10,11) these HIV reservoirs, respectively. Characterizing the burden, genetic diversity, and longevity of the proviruses that persist within the host can advance us toward these goals.
HIV elite controllers-rare individuals who spontaneously control viremia to undetectable levels without ART-are being intensely studied as natural models of HIV remission (12). They may also represent a group in which a complete cure might be more easily achieved (13,14), in part because their HIV reservoirs are smaller and less genetically diverse (8,12,15). Proviral burden and diversity, however, are less well characterized in viremic controllers-individuals who naturally suppress HIV to low levels (normally defined as ,2,000 HIV RNA copies/ml in plasma) (16,17), but who are nevertheless recommended to initiate ART under current guidelines (18). In particular, this group could yield insights into persisting proviral composition in the context of natural yet incomplete HIV control.
More broadly, our understanding of within-host proviral dynamics (i.e., the rates of proviral deposition and subsequent decay) during untreated infection remains incomplete for HIV controllers in general. This is in part because longitudinal studies dating back to infection are rare in this group, and because it is challenging to isolate HIV from samples with low or undetectable viral loads. Understanding these dynamics is nevertheless important because proviral longevity during untreated infection dictates reservoir composition at ART initiation (and thereafter): if the persisting proviral pool turned over slowly pre-ART, then HIV sequences seeded into it during early infection would have a high likelihood of persisting for long periods, but if the pool turned over rapidly, its composition would shift toward recently circulating HIV sequences. As cure or remission interventions would only be administered during suppressive ART, it is critical to understand the dynamics that shape the proviral pool up to this point.
Our understanding of within-host proviral dynamics has recently been enriched by the development of phylogenetic approaches to analyze persisting proviral diversity in context of HIV's within-host evolution prior to ART (19)(20)(21)(22)(23). A recent study that recovered replication-competent HIV sequences during therapy revealed that the majority represented viral variants that circulated close to therapy initiation, though more ancestral sequences, some dating back to transmission, were also recovered (21). Studies employing similar approaches to "date" the overall pool of proviruses persisting on ART, which comprise both genetically intact and defective sequences, have confirmed that these often span the individual's whole pre-ART history, where individuals differ in the extent to which their proviral pool is skewed toward sequences that integrated around ART initiation (19,20,22,23). Because viremic controllers naturally limit viral replication to low levels, one could hypothesize that their persisting proviral pools may be smaller and less diverse than those of noncontrollers, which could potentially make them more responsive to remission or cure interventions, but on-ART proviral diversity has not been investigated in this group in the context of HIV's full within-host evolutionary history.
The age distributions of proviruses sampled shortly after ART can also be used to calculate within-host rates of proviral turnover during untreated infection (19,23): this is because, at the time of sampling, the majority of proviral turnover would have already occurred prior to therapy. Recent data from noncontrollers suggest that proviral half-lives during untreated infection average less than a year (19,24), although another study estimated an average of 25 months (23). Regardless, these estimates are far shorter than the published rates of proviral decay on ART, which are estimated as ;4 years for the replication-competent reservoir (25) and .10 years for the proviral pool at large (26). This has led to the hypothesis that ART dramatically slows the rate of proviral turnover in vivo, thereby stabilizing the proviral pool for long-term persistence (21,27,28). Pre-ART proviral dynamics, however, have not been investigated in HIV controllers. To address these knowledge gaps, we combined single-genome sequencing, proviral quantification, phylogenetic analysis, and mathematical modeling to characterize proviral burden, diversity, and dynamics in four viremic controllers from HIV infection to therapy-mediated suppression.

RESULTS
Participant characteristics and sampling. We studied four viremic controllers. Three broadly maintained plasma viral loads (pVL) of ,2,000 copies of HIV RNA/ml during untreated infection (participants 1, 2, and 3), while one maintained pVL of ,2,000 copies of HIV RNA/ml for 3 years before losing control (participant 4). Participants initiated ART a mean of 6.1 years (range, 3.8 to 8.4 years) following their estimated date of infection (EDI). An average of 13 pre-ART plasma samples per participant (range, 10 to 17 samples), which spanned a mean period of 4.5 years (range, 2.5 to 6.7 years) prior to therapy, were available for HIV RNA sequencing, where the first was sampled a mean of 20.5 months (range, 15 to 26 months) following the EDI ( Fig. 1; Table 1). In addition, peripheral blood mononuclear cells (PBMC) were available at two time points on suppressive ART (10 million cells per time point). Due to limited cell numbers, the first PBMC sample, taken an average of 18.7 months (range, 9 to 38 months) after ART, was FIG 1 Participant (p1 to p4) sampling timeline and reservoir quantification. The timeline is depicted as years since ART initiation. Shading represents ART. Inverted black triangles denote the clinically estimated date of infection. Colored circles denote pre-ART plasma samples from which HIV RNA sequences were isolated. Red diamonds denote cell samples on ART from which proviral sequences were isolated. Black diamonds denote proviral quantification dates using the intact proviral DNA assay (IPDA). IPDA results are shown as pie charts, where the pie size denotes the total proviral burden and the colored slices denote intact and defective HIV genome proportions (actual values shown in Table 1). used to quantify total and genomically intact proviral burden in CD4 1 T cells (29). The second, taken an average of 28.8 months (range, 16 to 54 months) after ART, was used for proviral sequencing and integration date inference (20). Quantifying total and genetically intact proviral burden. We quantified total and genomically intact proviral burden in CD4 1 T cells using the intact proviral DNA assay (IPDA). This is a duplexed droplet digital PCR (ddPCR) assay that targets two HIV regions, the packaging signal (W) near the 59 end of the viral genome and the Rev responsive element (RRE) within envelope (env), which together have high predictive power to distinguish genomically intact proviruses from those with large deletions and/or extensive hypermutation, which typically dominate in vivo (29). Proviruses yielding double (W and env) positive signals are inferred to be genetically intact, while those yielding only single positive signal are classified as defective.
We observed marked interindividual differences in total proviral burden. Participant 3 harbored the fewest proviruses (44 HIV DNA copies/10 6 CD4 1 T cells), while participant 4, the individual who eventually lost control, harbored the most (778 HIV DNA copies/10 6 CD4 1 T cells) ( Fig. 1; Table 1). The percentages of genomically intact proviruses also differed markedly, with participant 2 harboring the lowest (10% intact proviruses) and participant 4 the highest (94% intact). The latter is remarkable though not without precedent (30). Participant 4's large overall and intact proviral burden allowed us to isolate one near-full-length intact provirus from this time point (Fig. 1) for integration date inference.
Single-genome plasma HIV RNA and proviral sequencing. We characterized HIV RNA nef sequences from longitudinal pre-ART plasma and proviral sequences from the second on-ART PBMC sample, by single-genome amplification. HIV nef was sequenced because of its richness in phylogenetic signal despite its relatively short length (a shorter amplicon was critical as viral loads were low for most samples), because nef is representative of within-host HIV diversity and evolution relative to the rest of the viral genome (20), and because it is the gene most likely to be intact in proviruses persisting on ART (29,31). Despite low viral loads, we isolated 546 HIV RNA nef sequences from pre-ART plasma (range, 67 to 245 sequences/participant), where an average of 66.2% of these (range, 50.0 to 89.6%) were distinct within each participant (Table 1). We further isolated 267 intact proviral nef sequences during suppressive ART (range, 24 to 129 sequences/participant), where an average of 69.8% (range, 50.0 to 91.5%) were distinct. All participants had HIV subtype B, and their sequences were monophyletic, with no evidence of coinfection or superinfection (Fig. 2). There were multiple instances where we recovered a proviral sequence on ART that was identical to a pre-ART plasma sequence, consistent with the long-term persistence of HIV sequences in vivo.
Within-host HIV evolutionary reconstruction and proviral dating. We next characterized the diversity of proviruses persisting on ART and inferred their ages phylogenetically (20). For each participant, we inferred a maximum likelihood phylogeny relating their plasma HIV RNA and proviral sequences, where the root represented the inferred transmitted founder event. We then fit a linear model relating the root-to-tip distances of distinct pre-ART plasma HIV RNA sequences to their collection dates, where the slope of this line represents the within-host HIV nef evolutionary rate under a strict molecular clock assumption, and the x-intercept represents the phylogenetically estimated infection date. We observed statistically significant within-host HIV evolution in plasma in all participants pre-ART, as defined by increasing divergence from the root, where the 95% confidence interval (CI) of the phylogenetically estimated infection date captured the participant's clinically estimated one in all cases ( Table 2). We next describe each participant's results in detail.
Participant 1 (EDI, April 2006) maintained viremic control, except for two plasma viral load measurements of 8,075 (3.9 log 10 ) and 2,685 (3.4 log 10 ) copies/ml in February and October 2008, respectively, prior to initiating ART in August 2010 (Fig. 3A). Their proviral load on suppressive ART was 215 HIV copies/10 6 CD4 1 T cells, with 51% genetically intact ( Fig. 1; Table 1). We isolated 104 intact HIV RNA nef sequences from 8 pre-ART plasma time points spanning 2.4 years, of which 52 (50%) were distinct, and 48 proviral nef sequences sampled approximately 2 years post-ART, of which 40 (83%) were distinct ( Table 1). The recovery of identical proviral sequences is consistent with clonal expansion of CD4 1 T cells harboring integrated HIV (32-34), though we cannot definitively identify these as clonal since we only performed subgenomic HIV sequencing. Within-host phylogenetic analysis revealed increasing plasma HIV RNA divergence from the inferred root over time, along with nonsynonymous substitutions that typify within-host HIV evolution (Fig. 3B). Proviruses sampled on ART were interspersed throughout the phylogeny, except within the clade closest to the root, which comprised the earliest sampled plasma HIV RNA sequences. The linear model relating the root-to-tip distances of distinct pre-ART plasma HIV sequences to their collection dates yielded a pre-ART nef evolutionary rate of 3.22 Â 10 25 substitutions per nucleotide site per day ( Fig. 3C; Table 2). Based on this, the oldest sampled provirus was estimated to have integrated in April 2008, 4 years prior to sampling (Fig. 3D). Though the 95% CIs around the proviral date estimates are wide, these data nevertheless suggest that proviruses seeded throughout infection persisted during ART. Participant 2 (EDI, December 2005) maintained pVL of ,2,000 (3.3 log 10 ) copies/ml, except for four measurements slightly above this: one in June 2010 and three in mid/late 2013. The participant initiated ART in May 2014 (Fig. 4A). Their total proviral load on ART was 580 HIV copies/10 6 CD4 1 T cells, with only 10% intact, the smallest intact proportion of all participants ( Fig. 1; Table 1). We recovered 67 HIV RNA nef sequences from 9 pre-ART plasma time points spanning more than 4 years, of which 60 (90%) were distinct, and 66 intact proviral nef sequences after 1.2 years on ART, of which 36 (55%) were distinct. The sequences closest to the root of this participant's phylogeny were proviruses sampled on ART, consistent with these having integrated shortly after transmission (Fig. 4B). Proviruses also interspersed throughout almost all descendant clades. Based on this individual's calculated pre-ART nef evolutionary rate of 1.17 Â 10 25 substitutions per nucleotide site per day ( Fig. 4C; Table 2), proviruses persisting on ART were inferred to have integrated on dates that spanned the individual's entire pre-ART history (Fig. 4D). One particular proviral sequence, recovered 26 times, was estimated to have integrated 6 years prior to sampling.
Participant 3 (EDI, March 2008) continuously maintained pVL of ,1,000 copies/ml before initiating ART in January 2012. Thereafter, pVL remained suppressed, except during a treatment interruption when a pVL of 1,610 (3.2 log 10 ) copies/ml was recorded (November 2013), but unfortunately no plasma was available from that event from which to isolate HIV sequences. This participant had the smallest proviral burden of all (44 HIV copies/10 6 CD4 1 T cells), with an estimated 31% genetically intact ( Fig. 1; Table 1). We recovered 130 pre-ART plasma HIV nef sequences (55% distinct) from 7 pre-ART time points, where early plasma time points more frequently yielded identical HIV sequences, consistent with limited viral diversity following infection. We recovered only 24 proviral sequences (50% distinct) on ART, consistent with the small proviral burden. Despite maintenance of pVL of ,1,000 copies/ml pre-ART, the within-host phylogeny displayed significant molecular clock signal (1.16 Â 10 25 substitutions per nucleotide site per day), where similar to participant 2, proviruses sampled on ART interspersed throughout the tree and represented those closest to the root ( Fig. 5B and C). The earliest proviruses dated to around transmission, while the latest dated to the viremic episode that occurred during the treatment interruption; the narrower 95% confidence intervals around these point estimates allow us to say with more certainty that this individual's proviral pool spans a wide age range (Fig. 5D). One sequence, recovered 9 times on ART and representing 38% of recovered proviruses in this individual, dated to around ART initiation.
Participant 4 (EDI, March 2005) initially controlled pVL to ,1,000 log 10 copies/ml for 3 years, but afterwards lost control, where pVL reached a maximum of 65,500 (4.8 log 10 ) copies/ml before initiating ART in February 2013 (Fig. 6A). Two linear models were calibrated, with one each for the "control" and "post-control" periods, where the former was used to phylogenetically verify the clinically estimated infection date. We recovered 245 plasma HIV nef sequences from 12 pre-ART time points (71% distinct), likelihood within-host phylogeny and corresponding amino acid highlighter plot. In the phylogeny, colored circles denote distinct pre-ART plasma HIV RNA sequences and red diamonds indicate distinct proviral sequences sampled during suppressive ART. Sequences that were recovered repeatedly are shown as open gray circles (for plasma HIV RNA) and diamonds (for proviruses) adjacent to the relevant tip. The root represents the inferred most recent common ancestor of the data set, representing the phylogenetically inferred transmitted founder virus event. The highlighter plot is ordered according to the phylogeny and depicts amino acid sequences. The top sequence serves as the reference, where colored ticks in sequences beneath it denote nonsynonymous substitutions with respect to the reference. (C) HIV sequence divergence-versus-time plot. The blue dashed line represents the linear model relating the root-to-tip distances of distinct pre-ART plasma HIV RNA sequences (colored circles) to their sampling times. This model is then used to convert the root-to-tip distances of distinct proviral sequences sampled during ART (red diamonds) to their original integration dates. The slope of the regression line, which represents the inferred within-host evolutionary rate (ER) in estimated substitutions per nucleotide site per day, is shown at the bottom right. Faint gray lines denote the ancestral relationships between HIV sequences. (D) Integration date point estimates (and 95% confidence intervals) for distinct proviral sequences recovered from this participant.

®
where, like participant 3, identical sequences were most frequently recovered from early time points (Fig. 6B). We also recovered 129 proviral sequences during ART, a remarkable 92% of which were unique, indicating substantial proviral diversity in this individual. The linear model inferred from the viremic control period yielded strong molecular clock signal despite low viremia during this time, but no proviral sequences were recovered that interspersed with these very early sequences ( Fig. 6B and C). All recovered proviruses were therefore "dated" using the linear model representing the post-control period, yielding integration dates spanning this whole period (Fig. 6C). Of note, the near-full-length genetically intact provirus isolated during ART dated to 2011.
Correlates of proviral burden and diversity during ART. Early ART limits HIV reservoir size (35)(36)(37), and positive correlations have been observed between the duration of uncontrolled viremia and both HIV reservoir size and diversity in noncontrollers (35,(37)(38)(39)(40). We therefore investigated the relationship between proviral burden, HIV diversity, and cumulative viral load in our cohort. We observed a strong correlation between overall HIV genetic diversity in plasma pre-ART and proviral diversity on ART (Spearman's r = 1; P = 0.08) (Fig. 7A). Furthermore, cumulative viral load, measured as log 10 viremia copy days during untreated infection, correlated strongly with both total proviral burden as measured by the IPDA and proviral diversity on ART (both Spearman's r = 1; P = 0.08) ( Fig. 7B and C).
Inferring proviral turnover pre-ART. (i) Mathematical modeling approach. The decay rates (half-lives) of the proviral pool during untreated infection can be inferred from the age distributions of proviruses sampled on ART; this is because at the time of proviral sampling, the bulk of the proviral turnover had already occurred prior to therapy (19,23,24). We estimated pre-ART proviral half-lives in two ways: by adapting an existing mathematical model and by estimating these rates directly from the data.
To do the former, we modified a dynamical mathematical model of HIV infection that describes within-host cell and virus concentrations over time within active and latent compartments (see Materials and Methods and reference 23). The model assumes that HIV sequences enter the reservoir at a rate proportional to their abundance in plasma, allowing us to model within-host proviral deposition in a personalized way. As our viral   Table S1 in the supplemental material) to model their proviral deposition. In a subsequent step, we then allowed these proviruses to decay at various constant, exponential rates up to the participant's sampling date, yielding predictions of what the proviral age distribution would be, at time of sampling, for each decay rate tested. . (E to H) Phylogenetically determined proviral ages and predicted proviral age distributions under different rates of proviral decay for participants 1 to 4. Blue histograms denote the proportion of each participant's distinct proviruses that dated to each year prior to ART initiation, data that are derived from the integration date point estimates in Fig. 3D, 4D, 5D, and 6D, respectively. The yellow line indicates each participant's model-predicted proviral deposition. The gray and dashed black lines predict what the proviral age distributions would be at the time of sampling had proviral decay subsequently occurred under half-lives of 140 or 44 months following deposition (these half-lives represent published on-ART estimates of total DNA [26] and replication-competent reservoir [25] decay, respectively). The green line represents the model-predicted proviral age distributions under the pre-ART proviral decay rate that best fit each participant's observed data. (I to L) Best-fit half-lives estimated directly from each participant's observed proviral age distributions using a Poisson generalized linear model (red line) along with 95% confidence intervals (dotted lines).
The participants' reconstructed viral load dynamics are shown in Fig. 8A to D, while the model-predicted proviral compositions under different decay rates, depicted as the proportion of remaining proviruses that date to each year prior to ART (or in the case of participant 3, to the viremic episode that occurred during treatment interruption), are shown in Fig. 8E to H. For participants 1 to 3, the model predicts that the bulk of the reservoir would have been deposited in the first year of infection ( Fig. 8E to G, yellow lines); this is because the estimated acute-phase peak viral load is nearly 2 log 10 higher than the subsequent setpoint. For participant 4, the model predicts that the bulk of the reservoir would have been deposited during the post-control period, because viral load was sustained at high levels during that time (Fig. 8H, yellow line).
If we allow these deposited proviruses to decay at half-lives of 140 months (26) and 44 months (25), where these represent published estimates of DNA and replicationcompetent reservoir decay on ART, respectively, the model predicts that, at the time of sampling, participant proviral distributions would be skewed toward slightly younger ages (gray and black dotted lines). These predicted proviral distributions fit the participants' observed data (shown as the blue bars) poorly, and participants 1 and 3 particularly so. As a last step, we identified the pre-ART proviral half-life that best fit each participant's proviral distribution (green line). Participant 1's was the shortest of all, only 0.41 years (95% CI, 0 to 1.03) (Fig. 8E). Estimated proviral half-lives for participants 3 and 4 were intermediate, while participant 2's was nearly 9 years, with an upper 95% CI extending to 77 years.
While this model clearly illustrates how faster decay rates shift proviral distributions toward younger ages, it assumes that within-host cell and viral dynamics behave as parameterized by the model. The model-produced best-fit half-lives were also somewhat sensitive to imputed acute-phase viremia dynamics, with higher, longer peaks yielding the shortest half-lives (Table S1). The half-life estimates for participant 3, who controlled viremia the most successfully, were the most sensitive to these imputations: using a low, short peak for example yielded a half-life estimate of 3.04 years (95% CI, 0 to 18.08), nearly three times longer than that estimated in the primary analysis.
(ii) Direct approach. Recognizing the limitations of the dynamical model and our lack of capture of the participants' actual acute-phase dynamics, we also estimated pre-ART proviral half-lives directly from observed proviral age distributions using a Poisson generalized linear model, an approach that does not incorporate clinical history information. The 95% CI around the half-life estimates produced by this method overlapped those produced by the dynamical model in all cases (Fig. 8I to L). The point estimates for participants 1 and 2 were also very consistent, with the former again exhibiting the shortest proviral half-life of 0.63 years (95% CI, 0.46 to 0.97) and the latter exhibiting the longest of 23.74 years (95% CI, 4.44 to 1). Participant 4's estimated pre-ART proviral half-life was 0.78 years (95% CI, 0.65 to 0.96), slightly shorter than that estimated with the dynamical model, while participant 3's was 4.79 years (95% CI, 1.44 to 1), which was comparable to the dynamical model-derived estimate when using a low, short inferred acute-phase peak viral load (Table S1). Broadly, however, the proviral pools that were skewed toward ART initiation produced short (,1 year) half-life estimates (participants 1 and 4), while those that dated more uniformly throughout infection produced substantially longer half-lives (participants 2 and 3).
Sensitivity analyses with alternate within-host phylogenies. We now present sensitivity analyses that test the robustness of our findings. In our primary analysis, we estimated proviral ages using a phylogeny inferred using the best-fit nucleotide substitution model. Each data set however yielded models with similar predictive power (see Table S2 in the supplemental material), so we recomputed proviral ages and pre-ART half-lives (using the Poisson generalized linear model) from a phylogeny inferred using each participant's "next-best-fit" model. Results were broadly consistent with the primary findings: participants 1 and 4 again showed proviral distributions that were skewed to dates near ART, yielding pre-ART half-lives of ,1 year (see Fig. S1 in the supplemental material). Participant 2's alternative phylogeny dated a slightly larger proportion of proviruses to around ART initiation, yielding a shorter half-life than that estimated from the primary analysis. Nevertheless, both participant 2's and 3's proviral distributions maintained an overall "flat" age distribution, yielding the longest estimated half-lives in the cohort. This suggests that our findings are not unduly influenced by the phylogenetic inference step.
Inferring proviral age distributions and pre-ART half-lives without a molecular clock. In our primary analysis, we estimated proviral ages from root-to-tip phylogenetic distances by linear regression (20,22). This assumes a strict molecular clock, which may not hold over long time periods (44), so we reanalyzed our data using two approaches that do not rely on this assumption. The first was "nearest-neighbor" dating (21), where each provirus is assigned to the sampling date of the pre-ART plasma sequence closest to it in the tree (for consistency, the original tree was used). A limitation of this approach is that proviruses can only be assigned to dates when plasma HIV sequences were sampled. Nevertheless, the results were broadly comparable to the original ones (see Fig. S2, left side, in the supplemental material): participants 2 and 3 showed "flat" proviral age distributions and long estimated proviral half-lives, while participants 1 and 4 showed skewed proviral age distributions and proviral half-lives under 1 year.
We also used least-squares dating (LSD) (45), an approach that aims to minimize the variance between the tree branch lengths and sample dates, that has also recently been used to infer proviral sequence ages (46). We used the original rooted tree; here it may be useful to reiterate that the root represents the location that maximizes the (Spearman's) correlation between plasma HIV root-to-tip distances and sampling times and thus does not rely on a strict clock. LSD also incorporates tree topology information (as opposed to just root-to-tip distances) and allows variable evolutionary rates over the tree's edges. Overall, the LSD-inferred 95% CI were generally narrower and the within-host evolutionary rates slightly slower than those inferred using linear regression, but the estimated proviral ages and half-lives were again comparable to the original observations (Fig. S2, right side). Taken together, these observations indicate that our findings are robust to a strict clock assumption.
Accounting for identical sequences in proviral half-life estimates. When estimating rates of pre-ART proviral turnover, we excluded identical proviral sequences under the assumption that these arose through clonal expansion rather than through independent integration events. Subgenomic HIV sequencing, however, cannot conclusively classify identical sequences as clonal across the whole HIV genome (47), so we reanalyzed our data using all sequences collected (Table 1). Results were again highly consistent with the original data (see Fig. S3 in the supplemental material). Participant 1's estimated proviral half-life when including 8 identical sequences was 0.61 years (95% CI, 0.46 to 0.9), which was very similar to the original estimate of 0.63 years (95% CI, 0.46 to 0.97). Similarly, participant 2's estimated half-life after including 30 identical sequences, the majority of which dated to a lineage that circulated 6 years prior to ART, was zero (i.e., no decay), which is comparable to the original estimate of 23.74 years (95% CI, 4.44 to 1). Our findings are therefore robust to the inclusion or exclusion of identical sequences.

DISCUSSION
Though modest in size, our longitudinal cohort of ART-treated viremic controllers allowed us to characterize proviral burden, diversity, and dynamics in this understudied group. Our results indicate that the persisting proviral pools in these individuals can share considerable similarities in terms of size, diversity, and turnover with those of noncontrollers on ART.
Within-host HIV evolutionary studies have revealed that most infections are initiated by a single transmitted founder virus that gives rise to increasingly diverse and divergent descendants (44,(48)(49)(50)(51)(52)(53). The dynamics of within-host HIV evolution in controllers, however, remain somewhat unclear: while some studies performed in shorter time frames reported increases in plasma HIV diversity and/or immune escape over time in elite (54)(55)(56) and viremic (57) controllers, others found limited or no evidence of evolution (12,15,58). Our first notable observation was that significant within-host HIV evolution occurred in all participants, as evidenced by ongoing divergence from a phylogenetically inferred root, whose date was consistent with the clinically estimated infection date in all cases. Also indicative of within-host evolution, nonsynonymous substitutions consistent with escape from human leukocyte antigen (HLA) class I-restricted cellular immune responses occurred even in participant 3, for whom pre-ART plasma viral loads never reached above 1,000 HIV RNA copies/ml. In this participant, who expresses HLA-A*03:01 and HLA-B*57:01, 25/25 sequences sampled in 2009 harbored the predicted HLA-B*57:01-restricted epitope AGNNAACAW at Nef codons 49 to 57 (59). By 2011, 8/27 sequences harbored AENNAACAW (the changed residue is underlined), which has a predicted .10-foldreduced binding affinity to B*57:01, a shift in frequency that was statistically significant (Fisher's exact test, P = 0.0044). Similarly, 25/25 sequences sampled in 2009 harbored the predicted A*03:01-restricted epitope KLVPVPEK at Nef codons 144 to 152, but by late 2011, 6/16 sequences harbored KLVPVPEE, which has a predicted .160-fold-reduced binding affinity to A*03:01 (P = 0.0069). Participants' estimated within-host nef evolutionary rates, which ranged from 1.16 Â 10 25 to 5.35 Â 10 25 substitutions per nucleotide site per day, assuming a strict clock, were also comparable to those reported in noncontrollers using similar methods (20). Therefore, despite natural viremia control, significant withinhost HIV evolution nevertheless occurs in this group in the absence of therapy.
Our results also revealed that total and genomically intact proviral DNA burdens in ART-treated viremic controllers can be substantial. While few studies have estimated reservoir size in controllers using the IPDA, a recent study reported 20-fold lower frequencies of total and intact proviral DNA in elite controllers (medians of 30 and 1.6 copies/10 6 CD4 1 T cells, respectively) compared to noncontrollers on ART (medians of 603 and 37 copies/10 6 CD4 1 T cells, respectively) (60). The median total and intact proviral burdens in the present study were 398 and 84 copies/10 6 CD4 1 T cells, respectively (or 215 and 58 copies/10 6 CD4 1 T cells, respectively, if we exclude participant 4, who lost viremic control), values that were more in line with those in noncontrollers (30,60,61). Proviral loads in our study, however, were sampled only ;1.5 years after ART on average. Longitudinal studies of treated viremic controllers on ART will therefore be required to confirm whether, similar to noncontrollers, intact proviruses gradually decay on ART while total proviral DNA remains stable (30,61,62).
Our results also yield insights into on-ART proviral diversity and the length of time these sequences had persisted within the host. Consistent with a prior study that included viremic controllers (63), proviruses recovered in the present study were genetically diverse and frequently included archival lineages. Similar to studies of noncontrollers (19-23), on-ART proviral age distributions varied: whereas participants 1's and 4's proviral pools were skewed toward sequences that integrated in the years prior to ART, participants 2's and 3's proviral pools spanned the infection course more evenly and included proviruses that dated to shortly following infection. Despite these differences, the overall diversity of proviruses persisting on ART reflected the overall plasma HIV diversity generated prior to ART (Fig. 7A). Moreover, the overall level of viral replication pre-ART correlated positively with both proviral burden and diversity on ART ( Fig. 7B and C). Our recovery of various numbers of identical sequences also suggests that HIV controllers, like noncontrollers, exhibit differential levels of clonal expansion (31)(32)(33)(34)64). Half of participant 3's proviruses, for example, were identical to at least one other sequence, consistent with a report describing high identical proviral burdens in viremic controllers (63), while 83% of participant 1's proviruses were distinct, a phenomenon that has also been reported in viremic controllers (57). Despite this heterogeneity, our observations confirm that diverse proviruses persist in those who control HIV prior to therapy, underscoring the importance of early ART even in this group.
Our study also extends our understanding of proviral turnover during untreated infection (19,21,23). While the half-lives of the replication-competent and overall proviral pools on ART are on the order of ;4 and .10 years, respectively (25,26,30), recent data suggest that pre-ART proviral half-lives are shorter, with studies returning estimates of 8 months (19, 24) and 25 months (23). Dynamic turnover during untreated infection helps explain why proviral pools are often enriched in sequences that integrated in the years immediately prior to ART (19,21,23) and has led to the hypothesis that ART dramatically slows this rate of turnover, thereby "stabilizing" the proviral pool in its immediate pretherapy state (21,27). Indeed, participants 1's and 4's skewed proviral distributions and short estimated pre-ART proviral half-lives resemble those of "typical" noncontrollers (19,23,24), though participants 2's and 3's flatter proviral age distributions and markedly longer estimated decay rates emphasize that proviral clearance is not rapid in all persons. While longer pre-ART half-lives have been observed in noncontrollers (20), a recent study by our group identified an inverse relationship between set point pVL and rate of pre-ART proviral turnover (24). Taken together with our observation that 2/4 participants exhibited relatively long pre-ART half-lives, this suggests that pre-ART proviral turnover may be slower in HIV controllers on average, though the data clearly underscore the heterogeneous nature of each individual's persistent HIV pool.
Our study has some limitations. Our cohort is small due to the challenges of recruiting viremic controllers with known infection dates for longitudinal studies, and all participants were male. Though some aspects of the HIV reservoir likely differ between the sexes (65, 66), a recent study of noncontrollers by our group revealed no significant differences in estimated pre-ART proviral half-lives between men and women, where intriguingly the participant with the "flattest" proviral distribution and longest pre-ART half-life was a female who nearly met the criteria for viremic control (24). This suggests that female controllers may not differ markedly from males in terms of proviral composition and pre-ART turnover, but additional studies are needed to confirm this. As samples were limited and viral loads were low, near-full-genome HIV amplification was not feasible, nor was proviral integration site characterization, so we were not able to investigate genomic location and associated heterochromatin features recently characterized in elite controllers (8) in our cohort. Though nef is appropriate for within-host HIV evolutionary studies (20,64,67), sequencing of a subgenomic region does not allow us to classify recovered sequences as intact or defective or definitively classify identical sequences as clonally expanded. Our use of the IPDA to quantify intact (versus defective) proviral burden mitigates this only partially. Our methods also assume that every provirus has an equal probability of being sampled, regardless of whether it is unique or a member of a clonally expanded population (where we crudely estimate this probability to be ,0.2%, as the blood collected represented ;0.2% of average total body blood volume, but subsequent DNA extraction and PCR efficiency is ,100%). Sample availability also prevented us from investigating the mechanisms underlying participant 4's loss of viremia control, though our observations that this individual's (largely genetically intact) proviral pool dated exclusively to their post-control period suggests that this event triggered major increases in reservoir size and diversity, further underscoring the importance of timely ART, even in viremic controllers. Finally, the peak viremia kinetics that we drew from the literature (12,41,55) to use in the dynamical model may not have reflected actual in vivo kinetics, leading to uncertainty in our model-predicted proviral deposition and turnover rate estimates.
In conclusion, despite their natural ability to control HIV to low levels, significant within-host HIV evolution occurred in all participants pre-ART, giving rise to withinhost proviral pools whose size and genetic diversity reflected this evolution. Pre-ART proviral dynamics were also heterogeneous: though two participants' proviral pools were skewed toward recent integration dates, consistent with rapid pre-ART turnover (19,23,24). Those of two others spanned a wide age range, consistent with much slower pre-ART proviral turnover. HIV remission and cure strategies will need to overcome within-host proviral diversity, even in individuals who naturally control HIV prior to therapy.
Single-genome HIV RNA and proviral amplification. Phylogenetic inference of proviral ages requires the longitudinal isolation of HIV RNA sequences pre-ART, alongside proviral DNA sequences sampled on ART, ideally using single-genome approaches. Because plasma viral loads were low and biological material was limited (only 10 million PBMC per time point), full-length HIV amplification was not feasible, so nef was amplified for the reasons described in the Results section. Total nucleic acids were extracted from plasma using the bioMérieux NucliSENS EasyMag system (bioMérieux, Marcy-l' Etoile, France), while genomic DNA was extracted from PBMC using the Purelink genomic DNA kit (Invitrogen). For pre-ART plasma samples destined for HIV RNA nef amplification, 1 ml of sample was extracted, eluted in 60 ml, and subjected to DNase I digestion (New England Biolabs, Ltd., catalog number M0303S) to minimize the risk of amplifying HIV DNA in these low-pVL samples. For on-ART PBMC samples destined for proviral nef amplification, cells were split into aliquots of 5 million cells before extraction and elution into 60 ml.
HIV nef was amplified by limiting-dilution nested reverse transcription (RT)-PCR (for HIV RNA) or nested PCR (for proviral DNA) using high-fidelity enzymes and sequence-specific primers optimized for amplification of multiple HIV group M subtypes (64,67). For HIV RNA extracts, cDNA was generated using NxtScript reverse transcriptase (Roche). Next, cDNA as well as genomic DNA extracts were endpoint diluted such that ;25 to 30% of the resulting nested PCRs, performed using the Expand high-fidelity PCR system (Roche), would yield an amplicon. The primers (59!39) used for cDNA generation/1st round PCR were Nef8683F_pan (forward; TAGCAGTAGCTGRGKGRACAGATAG) and Nef9536R_pan (reverse; TACAGGCAAAAAGCAGCTGCTTATATGYAG). The primers used for 2nd round PCR were Nef8746F_pan (forward; TCCACATACCTASAAGAATMAGACARG) and Nef9474R_pan (reverse; CAGGCCACRCCTCCCTGGAAASKCCC). Negative amplification controls were included in every run, and we also confirmed that plasma HIV RNA amplification did not occur in the absence of reverse transcription, indicating that DNase treatment was effective.
Amplicons were sequenced on an ABI 3730xl automated DNA analyzer. Chromatograms were base called using Sequencher v5.0 (Gene Codes) or the custom software RECall (68). nef sequences that contained nucleotide mixtures, hypermutations (identified using Hypermut 2.0 [69]), suspected within-host recombinant sequences (identified using rdp4 Beta 95 [70]), or other defects were excluded from phylogenetic analysis, as were duplicate sequences (instead, the latter were added back at the data visualization stage). Within-host plasma and proviral nef sequences were codon aligned using MAFFT v7 (71) implemented in HIV Align (https://www.hiv.lanl.gov/content/sequence/VIRALIGN/viralign.html) and manually edited in AliView v1.18 (72). Maximum likelihood phylogenies were inferred from aligned, gapstripped within-host sequence data sets using W-IQ-TREE (73) following automated model selection with ModelFinder (74) using an Akaike information criterion (AIC) selection criterion (75). Best-fit models are reported in Table S2. Our primary analysis used a phylogeny inferred using each participant's top best-fit model, but a sensitivity analysis was performed using each participant's next-best-fit model (Fig. S1). Patristic (tip-to-tip phylogenetic) distances were extracted using the cophenetic.phylo function from the R package ape (v5.3) (76).
HIV proviral full-genome amplification and sequencing. Single-template, near-full-length proviral amplification was performed on DNA extracted from CD4 1 T cells by nested PCR using Platinum Taq highfidelity DNA polymerase (Invitrogen) such that ;25% of the resulting PCRs yielded an amplicon. The 1st round primers were AAATCTCTAGCAGTGGCGCCCGAACAG (forward) and TGAGGGATCTCTAGTTACCAGAGTC (reverse). The 2nd round primers were GCGCCCGAACAGGGACYTGAAARCGAAAG (forward) and GCACTCAA GGCAAGCTTTATTGAGGCTTA (reverse). Reactions were cycled as follows: 92°C for 2 min, followed by 10 cycles of 92°C for 10 s, 60°C for 30 s, and 68°C for 10 min, followed by 20 cycles of 92°C for 10 s, 55°C for 30 s, and 68°C for 10 min, and then 68°C for 10 min (32,77). Amplicons were sequenced using Illumina MiSeq technology and de novo assembled using the custom software MiCall (https://github.com/cfe-lab/MiCall), which features an in-house modification of the Iterative Virus Assembler (IVA) (78).
Within-host phylogenetic inference and proviral age reconstruction. We used a published withinhost phylogenetic approach to infer proviral sequence ages (20). Briefly, for each participant, we inferred a maximum likelihood phylogeny relating within-host longitudinal pre-ART plasma HIV RNA and proviral sequences sampled on ART. We then exhaustively rerooted each tree to identify the root location that maximizes the (Spearman's) correlation between the root-to-tip distances and collection dates of the pre-ART plasma HIV RNA sequences. Here, the root represents the inferred most recent common ancestor (MRCA) of the within-host data set (i.e., the inferred transmitted founder virus). We then fit a linear model relating the collection dates of the plasma HIV RNA sequences to their divergence from the root, where the slope represents the average host-specific rate of HIV evolution prior to ART and the x-intercept represents the phylogenetically estimated MRCA date. For participant 4, two regression lines were fit: one each for their "viremic control" and "post-control" eras.
The integration dates of the proviral sequences sampled during ART, along with their 95% confidence intervals (CI), were then estimated from their divergence from the root using the linear regression. We assessed molecular clock and model fit using a delta Akaike information criterion (DAIC) (75), computed as the difference between the AIC of the null model (no evidence of within-host evolution-i.e., a zero slope) and the AIC of the linear regression. A within-host phylogeny was deemed to have a sufficient molecular clock signal if the linear regression produced a DAIC of $10, (which corresponds to a P value of 0.00053 when using a log-likelihood ratio test), and the 95% confidence interval of the estimated root date contained or preceded the first sampling point. The proviral dating framework is available as a web service at https://bblab-hivresearchtools.ca/django/tools/phylodating, with open source code available at https://www.github.com/cfe-lab/phylodating. As described in the Results, sensitivity analyses were also performed where proviruses were phylogenetically "dated" using two published methods that do not rely on a strict molecular clock assumption (Fig. S2).
Intact proviral DNA assay. Each participant's first PBMC sample on ART was used to estimate total and intact proviral DNA burdens using the intact proviral DNA assay (IPDA) (29). As this method reports proviral burdens in terms of HIV copies per million CD4 1 T cells, we first isolated CD4 1 T cells from 10 million PBMC by negative selection using the EasySep Human The published HIV primers/probes, however, can sometimes fail to detect autologous HIV sequences due to naturally occurring polymorphism (79,80), thus requiring autologous primers/probes. For this reason, targeted HIV sequencing of the IPDA W and env regions was performed for all participants prior to IPDA measurement, and autologous primers/probes were substituted where necessary. The substituted primer and probe sequences (59!39) were as follows: for participant 2, W probe, FAM-TTTCAGCGTACTCACCAGT; for participant 3, W probe, FAM-ATATGGCGTACTCACCGGT; for participant 1, W probes, FAM-AATTGGCGTACTCACCAGT and FAM-AATTGGCGTACTCACCAGC; for participant 1, env forward primer, GGTGGTGCAGAGAGAAAAAAGAGC, and env reverse primer, GCTGACGGCACAGGCCAGGC; and for participant 3, env reverse primer, GCTGACGGTACAGGCCAGAT. Droplets were prepared using the Automated Droplet Generator and cycled as previously described (29). Droplets were analyzed on a QX200 Droplet Reader (Bio-Rad) using QuantaSoft software (Bio-Rad, version 1.7.4), where the results of a minimum of 8 technical replicates, comprising a median of 589,820 cells (IQR, 501,539 to 721,871 cells) assayed in total, were merged prior to analysis. Intact HIV-1 copies (W and env double-positive droplets) were corrected for DNA shearing based on the frequency of RPP30 and RPP30-shear double-positive droplets. The median DNA shearing index, which measures the proportion of sheared DNA in a sample, was 0.40 (IQR, 0.39 to 0.43), which is in line with acceptable levels in this assay (29,79).
Inference of pre-ART proviral half-lives. (i) Dynamical mathematical model. In order to infer pre-ART proviral half-lives, each participant's sampled proviruses were grouped into "bins" by their estimated year of integration (one "bin" per year preceding ART initiation). For simplicity, proviruses whose integration date point estimate fell after the ART initiation date (or in the case of participant 3, the date of the last viremic episode), were assigned to the ART initiation (or viremic episode) date (all proviruses in this category had 95% confidence intervals that overlapped this date). In the primary analysis, proviral half-life inference was only performed on distinct proviral sequences under the assumption that "replicate" sequences arose through clonal expansion and not through individual integration events, but a sensitivity analysis was performed, including all proviral sequences (Fig. S3).
We estimated host-specific pre-ART proviral half-lives two ways: using a dynamical mathematical model (23) and directly from the data. To do the former, we implemented a published mathematical model that describes each individual's untreated HIV infection via a set of ordinary differential equations that model the dynamics of the concentrations of cells and virus in each individual over time (23). The model includes susceptible target cells, actively and latently infected cells that can produce viable virus, actively and latently infected cells that cannot produce viable virus, the virus itself, and an immune response. The model assumes that HIV sequences enter the latent proviral pool at a rate proportional to their abundance in plasma at the time. In a subsequent independent step, proviruses are then allowed to "decay" out of this pool at various exponential rates, to produce predictions of what the proviral age distribution would be at ART initiation if decay had occurred at this rate. We reproduce the model here.
Let S represent the susceptible compartment. Let A P and L P represent actively and latently infected cells that are productively infected and produce viable virus; likewise let A U and L U represent actively and latently infected cells that are unproductively infected and cannot produce viable virus. Let V represent viremia, and let E represent the adaptive immune response. The following dynamical system describes within-host HIV infection kinetics: The following parameters were used: susceptible cell creation rate, a S ¼ 70 cells ml 21 day 21 ; susceptible cell death rate, d S = 0.2 day 21 ; viral infectivity, b = 10 24 ml viral RNA copies -1 day 21 ; probability of productively infectious virions, t ¼ 0:05; probability of latency, l ¼ 10 24 ; actively infected cell death rate, d I = 0.8 day 21 ; viral burst size, p = 50,000 viral RNA copies cell -1 day 21 ; viral clearance rate, g ¼ 23 day 21 ; initial adaptive precursor frequency, a E = 10 24 cells ml 21 day 21 ; adaptive immune killing rate, k = 0.3 ml cells 21 day 21 ; adaptive immune recruitment rate, v = 1.6 day 21 ; adaptive immune clearance rate, d E = 0.002 day 21 ; and adaptive immune 50% saturation constant (i.e., the number of infected cells required for a half-maximal cytolytic expansion rate [81]), E 50 = 250 cells ml 21 .
The following initial conditions were used as per the original publications (23,81). The initial viral load value was V(0) = 0.03 viral RNA copies per ml, which is the detectable plasma viral load limit of a typical assay when converted from the conventional 30 HIV RNA copies/ml to viral RNA copies per ml. Also let I 0 = V(0)g/p = 1.38 Â 10 25 ; this represents a quasistatic approximation for the infected cells.
Note that Sð0Þ and Eð0Þ are the equilibrium values for the system in the absence of virus. As described above, reservoir creation is assumed to be proportional to plasma viral load over time, with the probability of a virus entering the latent pool (defined by l above) remaining constant over time. Unfortunately, peak viremia was not captured in our participants' measured viral load data. Therefore, to recreate each participant's in vivo plasma viral dynamics as realistically as possible, we merged each participant's available pVL data to acute-phase dynamics observed in viremic controllers (41), and we also performed sensitivity analyses using peak viremia kinetics observed in elite controllers (42,43) (Table S1). Using each participant's reconstructed viral load history and the above equations, we modeled proviral deposition into their latent pool. We then grouped the resulting proviruses by their year of creation.
In a separate step, we then allowed each group of latent proviruses to decay exponentially, up until each participant's proviral sampling date, under half-lives ranging from 30 to 6,000 days in increments of 30 days. This produced a series of 200 predicted proviral distributions per participant that represented what proportion of proviruses would still remain from each creation year, assuming decay at the stated rate. For context, we also applied decay rates of 44 and 140 months, which represent the half-lives of the replication-competent reservoir (25) and total proviral pool (26) during suppressive ART, respectively. Note that the model assumes that, after ART, the overall size of the proviral pool decreases but the relative proportions of sequences in each group remain the same; as such, model predictions would be the same regardless of when the proviral pool is sampled on ART. Finally, we identified the best-fit proviral decay rate to each participant's observed proviral age distribution by maximum likelihood and used standard theory to identify 95% confidence bounds.
(ii) Poisson linear model. To estimate the reservoir half-life directly from the data, we again grouped each participant's observed proviral sequences by their age, in years, relative to ART initiation (or for participant 3, the date of their last viremic episode). We will refer to the bin containing proviruses from t21 to t years old as "bin t," where we make the simplifying assumptions that all proviruses in bin t are exactly t years old. We then applied a Poisson generalized linear model with the canonical natural logarithm link function to the binned counts, using the age of the bin t as the predictor. This choice can be justified as follows: Let t 1=2 be the proviral half-life. Assuming the participant's proviral reservoir decays at an exponential rate, we would expect that the size of bin t would be approximately bC exp ð2u tÞc, where C represents the initial size of the age bin as if there was no decay, and u = ln(2)/t 1/2 . Now, we assume that every provirus has the same small independent probability p of being sampled. If so, we would then expect the number of observed proviruses from bin t to be binomially distributed with bC expð2u tÞc trials and success probability p. Assuming that p is small, we can approximate the distribution with a Poisson distribution with parameter where D ¼ ln Cp ð Þ, or in other words, which is precisely the setup required for a Poisson generalized linear model with natural logarithm link function.
Statistical analyses. Spearman's correlation was used to assess the relationship between continuous parameters. Statistical analysis was performed in Prism (version 9) software.
Data availability. GenBank accession numbers for sequences reported in this study are MW781931 to MW782197 (proviral DNA) and MW782198 to MW782743 (plasma HIV RNA).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
We gratefully thank the study participants, without whom this research would not be possible. We also thank Daniel Reeves, the creator of the dynamical mathematical model, for helpful correspondence. We are also grateful to Daniel Worrall for assistance with clinical data access. This