# Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes

^{*}Martin Däumer,

^{1}Mark Oette,

^{2}Klaus Korn,

^{3}Daniel Hoffmann,

^{4}Rolf Kaiser,

^{1}Thomas Lengauer, Joachim Selbig,

^{5}and Hauke Walter

^{3}

^{1}Institute of Virology, University of Cologne, Fürst-Pückler-Str. 56, D-50935 Köln, Germany

^{2}Clinic for Gastroenterology, Hepatology and Infectious Diseases, University of Düsseldorf, Moorenstr. 5, D-40225 Düsseldorf, Germany

^{3}Institute of Clinical and Molecular Virology, German National Reference Center for Retroviruses, University of Erlangen-Nürnberg, Schlossgarten 4, D-91054 Erlangen, Germany

^{4}Center of Advanced European Studies and Research, Friedensplatz 16, D-53111 Bonn, Germany

^{5}Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Golm, Germany

^{*}To whom correspondence should be addressed. Tel: +49 6819325304; Fax: +49 6819325399; Email: ed.gpm.bs-ipm@lekniwnereeb

## Abstract

Therapeutic success of anti-HIV therapies is limited by the development of drug resistant viruses. These genetic variants display complex mutational patterns in their pol gene, which codes for protease and reverse transcriptase, the molecular targets of current antiretroviral therapy. Genotypic resistance testing depends on the ability to interpret such sequence data, whereas phenotypic resistance testing directly measures relative *in vitro* susceptibility to a drug. From a set of 650 matched genotype–phenotype pairs we construct regression models for the prediction of phenotypic drug resistance from genotypes. Since the range of resistance factors varies considerably between different drugs, two scoring functions are derived from different sets of predicted phenotypes. Firstly, we compare predicted values to those of samples derived from 178 treatment-naive patients and report the relative deviance. Secondly, estimation of the probability density of 2000 predicted phenotypes gives rise to an intrinsic definition of a susceptible and a resistant subpopulation. Thus, for a predicted phenotype, we calculate the probability of membership in the resistant subpopulation. Both scores provide standardized measures of resistance that can be calculated from the genotype and are comparable between drugs. The geno2pheno system makes these genotype interpretations available via the Internet (http://www.genafor.org/).

## INTRODUCTION

A panel of 17 approved antiretroviral agents is currently available for treating infections with human immunodeficiency virus type 1 (HIV-1). Each of these drugs targets one of the two viral enzymes protease or reverse transcriptase (RT). Despite the introduction of combination therapies, treatment success is limited due to the evolution of drug resistant variants (1). Thus, resistance testing has become an important diagnostic tool in the management of HIV infections (2,3).

Resistance testing can be performed either by measuring viral activity in the presence and absence of a drug [phenotypic resistance testing (4)] or by scanning the viral genome for resistance-associated mutations (genotypic resistance testing). Direct sequencing of the HIV pol gene, which codes for protease and RT, produces genomic data of ~1200bp, while phenotypic test results are usually reported as resistance factors, defined as the fold-change in susceptibility to the drug relative to a susceptible reference virus. It has been shown that patients can benefit from both genotypic and phenotypic testing (5), but genotyping is faster and cheaper, whereas phenotypic results, represented by a single number for each drug, are easier to handle. In principle, the DNA sequence should determine the resistance phenotype. However, it is challenging to retrieve phenotypic information from the genotype due to complex mutational patterns.

Several expert groups have approached this problem by extracting classification rules from the scientific literature. Links between genetic variations and resistance have been established by site directed mutagenesis experiments, by observing genetic changes under continuous drug pressure in cell culture or by analysis of clinical samples derived from patients after failing (mono-)therapy (6). These rule sets classify genotypes into two or more categories ranging from ‘susceptible’ to ‘resistant’. Some of them aim at predicting not only phenotypic resistance, but also therapy response by considering data on clinical outcomes. Besides these knowledge-based systems, statistical and machine learning approaches have been applied successfully to matched genotype–phenotype pairs in order to solve this classification problem (7–9). After defining certain phenotypic cut-off values, classification models are learned from labelled sequences. In some cases these data-driven approaches lead to parsimonious models, but in general they produce models that are harder to interpret. However, unlike with rules-based systems, model construction and update can be automated and model parameters such as sensitivity or specificity can be controlled explicitly.

In the geno2pheno system two machine learning approaches, decision trees and support vector machines (SVM), have been implemented for a range of different cut-offs (8,9). On submitting an HIV-1 pol gene sequence, users of this web service can obtain classification results for each of the 17 drugs and a selected cut-off value. Because of the difficulty of finding appropriate cut-off values, we here extend the data analysis approach to quantitative phenotype predictions by using support vector machines (SVM). This machine learning technique appears appropriate for a regression problem with many free variables (sequence positions) and a target variable (resistance factor) subject to considerable noise. We present SVM regression models that can predict the fold-change in susceptibility from the genotype. These predicted resistance factors are then compared with predictions obtained from genotypes from untreated patients and with the distribution of predicted resistance factors over a large set of clinical samples. The resulting scores provide continuous measures of resistance that are comparable between different drugs. In particular, we will derive definitions of susceptibility and resistance based on the statistics of all predictions and derive a probability score that allows for discriminating between these two classes.

## METHODS

### Arevir database

The Arevir database is a multi-center clinical database containing patient data, therapies, clinical and virological markers, as well as genotypic and phenotypic resistance test results. The experimental setup for genotyping and the phenotypic recombinant virus assay have been described elsewhere (4,9). Subtypes have been determined as the most significant hits in a BLAST search against the 93 pol gene reference sequences provided by the Los Alamos HIV Sequence Database (http://hiv-web.lanl.gov/content/hiv-db/SUBTYPE_REF/Table1.html). For the present study we use three different sets of sequences: the first set consists of 652 genotypes, including 604 subtype B and 48 (7.4%) non-B sequences, that have also been phenotyped. The majority of these sequences have been deposited in GenBank (accession numbers {"type":"entrez-nucleotide-range","attrs":{"text":"AF347117 to AF347605","start_term":"AF347117","end_term":"AF347605","start_term_id":"13604343","end_term_id":"13605318"}}AF347117 to AF347605). The second set comprises 184 sequences, which have been identified from patients that have not been treated with any antiretroviral drug before (therapy-naive patients). Six sequences with obvious indications of transmission of a drug resistant virus have been removed from this set. The remaining 178 sequences [124 subtype B, 54 (30.3%) non-B] are used for assessing the natural variation of predicted phenotypes among therapy naive patients. The third set consists of 2000 sequences [1695 B, 305 (15.3%) non-B], including samples from the first two sets, that have been selected randomly from the database in order to estimate the unconditional probability density of predicted phenotypes found in clinical isolates. All calculations involving resistance factors are performed on logarithmized values to base 10 and are reported as such.

### Support vector regression

For developing a regression model, sequences have been aligned to the reference strain HXB2 and each sequence position gives rise to 20 indicator variables, one for each amino acid. We use all 99 sequence positions of the protease and the first 220 positions of the RT. A further attribute indicating the presence of the 69SS insertion complex is added for the RT. Thus, the input space dimensions are 1980 and 4401 for protease and RT, respectively. These high-dimensional regression problems are solved with a linear support vector machine (SVM) with an epsilon-insensitive loss function (10). For all drugs, epsilon is fixed at 0.1 such that prediction errors of <0.1log_{10}-resistance factors are not penalized in the training phase. The regularization parameter *C* that controls the trade-off between minimizing training error and model complexity is determined by cross-validation for each drug separately. We use the LIBSVM software library for solving the SVM optimization problem (Chang,C.-C. and Lin,C.-J., 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm). For each drug, a linear regression function represented as a weighted sum over all sequence positions is learned from the data.

### Density estimation

Standard procedures are used for fitting the parameters of a normal distribution. Bimodal distributions of resistance factors (RF) are fitted to a two-component mixture model. The density of *x*=log_{10}*RF* is modelled as α·(*x*; μ_{1}, σ_{1})+(1−α)·(*x*; μ_{2}, σ_{2}), where we denote by (*x*;μ,σ) the density of the normal distribution with mean μ and standard deviation σ, and the mixing parameter by α. Parameters are estimated from the data by applying the EM algorithm (11). The solutions obtained from this iterative fitting procedure are virtually independent of variations in the starting solution.

### Class probabilities

In the generative two-component mixture model we can assume μ_{1}<μ_{2} and refer to samples originating from the left Gaussian bump as ‘susceptible’ and to the others as ‘resistant’. Within this model we consider the log-likelihood function that decides whether a given resistance factor is more likely to belong to the resistant than to the susceptible group. This quantity is approximated by a linear function *L* in order to obtain the monotonic function (1+e^{−L(x)})^{−1}, which can be shown to approximate the conditional class probability prob(resistant|*x*) of membership in the resistant subpopulation.

Although the mixture model has five free parameters, the probability scoring function has only two due to the linear approximation of the log-likelihood. Thus, as a measure of confidence in the fitted function, we report confidence intervals for the location of the inflection point and the gradient in this point estimated from 1000 bootstrap samples.

## RESULTS

### Regression analysis

For each drug, a regression model is generated from matched genotype–phenotype pairs. The ability of these models to generalize from the training data was assessed by 10-fold cross-validation and is reported as the mean squared error and as the squared correlation coefficient between predicted and observed log_{10} resistance factors (Table (Table1).1). Since the range of observed resistance factors differs substantially among drugs, only the latter measure of performance allows for comparisons between drugs. Estimated squared correlation coefficients vary between 0.3 and 0.79 with an average of 0.6 (±0.14) indicating that the models account for 30–79% of phenotypic variance.

### Variation among drug-naive patients

In order to quantify natural variation of predicted resistance factors among patients that have not received any antiretroviral medication before, we predict phenotypes from a set of genotypes derived from untreated patients. We observe significant differences in predictions between subtype B and non-B sequences for zalcitabine, nevirapine, delavirdine and nelfinavir (rank sum tests, adjusted for multiple testing at a false discovery rate of 1%). However, since the prediction models have been trained on a set of matched genotype–phenotype pairs containing <8% of non-B sequences, we cannot rule out the possibility that this finding is an artifact of the regression function. Therefore, we restrict the analysis of samples from treatment-naive patients to the 124 subtype B sequences. For all drugs, the resistance factors predicted from this set follow a normal distribution. Table Table22 reports estimates for the mean and the standard deviation. We observe considerable differences between drugs for both parameters. In Figure Figure1,1, four representative examples are displayed (see Supplementary Material for all drugs). Once we know these distributions, we can report for each predicted phenotype how many standard deviations it is away from the mean among drug-naive patients. This z-score provides a standardized and comparable measure of deviation from the expected value for the untreated subtype B subpopulation.

*x*-axes refer to log

_{10}resistance factors (RF), whereas the top

*x*-axes denote z-scores (numbers of

**...**

### Density estimation

We can gain more information on the meaning of a predicted resistance factor by studying the distribution of predictions over all genotypes. Analysis of a random sample of 2000 sequences shows large differences in range, location and deviation of modes, but also reveals the bimodal nature of the distribution common to all drugs (Fig. (Fig.22 and Supplementary Material). Thus, the probability density is approximated with a two-component Gaussian mixture model. Table Table22 shows the parameter estimates of this model for all drugs and Figure Figure22 displays the fitted density curves for the four drugs from Figure Figure11 (see Supplementary Material for all drugs). Since bimodality is less pronounced for zalcitabine and didanosine, the modes intersect more heavily for these drugs. Restriction to subtype B sequences does not lead to significantly different estimates (data not shown).

**...**

The two Gaussian components give rise to an intrinsic definition of susceptibility and resistance. Thus, we can calculate the probability of belonging to the resistant subpopulation given a predicted resistance factor. In Figure Figure22 the cumulative density of this probability is plotted as a function of the predicted phenotype (cf. also Supplementary Material). Table Table22 summarizes the parameter estimates. Different curve shapes and locations reflect differences in the transition from susceptibility to resistance. The probability score provides a normalized and comparable measure of resistance for all antiretroviral drugs.

### Geno2pheno

For a submitted pol gene sequence the geno2pheno system returns an alignment to the reference strain HXB2, classification results according to the preset cutoffs, the predicted resistance factors, z-scores relative to treatment-naives and probability scores for all drugs. Figure Figure33 shows a sample output of these measures of resistance. The geno2pheno web service is freely available at http://www.genafor.org/.

## DISCUSSION

Genotypic resistance testing has become part of routine diagnostics in the treatment of HIV infected patients. However, its clinical benefit is limited in practice by the complex relationship between genotypic variations on the one hand and phenotypic resistance *in vitro* and treatment response *in vivo* on the other hand. The geno2pheno system has been designed to support the interpretation of sequence data resulting from genotypic resistance tests. Here we have presented regression models that can predict the fold-change in susceptibility to a drug from the genotype. These models translate complex mutational patterns into a single resistance factor for each drug. Since the range of this quantity differs considerably between the various antiretroviral agents, we propose two transformations.

Firstly, we report the deviation of a predicted resistance factor from the mean value for samples from treatment-naive patients. Similar to results for experimentally determined phenotypes in drug-naive patients (12), the distribution of predicted phenotypes also shows substantial variation between drugs, but follows a normal distribution in every case. Thus, the z-score that denotes how many standard deviations the predicted resistance factor for a given sample and drug is away from the mean for treatment-naive patients provides a measure of drug resistance that is better comparable between drugs than the absolute predicted resistance factors. We have excluded from this analysis subtype non-B sequences, because the small number of phenotyped non-B sequences does not allow for a definite conclusion about non-B baseline resistance profiles.

Secondly, we propose a score that quantifies the probability of a sample to originate from the resistant rather than the susceptible subpopulation given the predicted resistance factor. Thus, the notion of resistance arises only from the distribution of predicted phenotypes that were estimated from a large random clinical sample. The bimodal nature of these distributions suggests a ‘two-state model’ of the virus: a susceptible (wild type) state that is attained preferably in the absence of drug and a resistant state that is advantageous and hence more frequently observed under drug pressure. Unlike z-scores with respect to drug-naive patients, the probability score exploits information on location and variance of both the susceptible and the resistant subpopulation. As a probability, this score is normalized to the interval [0; 1] and interpretable without predetermined cut-off values.

Both scores are based on test statistics that are derived from predicted phenotypes. Thus, we fit the distribution parameters to predicted rather than experimentally determined phenotypes, because prediction introduces an additional source of noise and systematic biases are accounted for.

The ultimate goal of genotype interpretation is to provide a direct estimate of expected treatment response. This task is much more difficult than drug-wise resistance predictions, because complex clinical data have to be included and the *in vivo* effect of a therapy depends on additional factors such as patients adherence and drug pharmacokinetics. Moreover, mono-therapies are obsolete and there is a large number of possible combination therapies. Nevertheless, the problem could be approached in a similar fashion albeit based on substantially larger datasets (13). Another promising approach is to use the individual phenotype predictions as building blocks for a scoring function that is defined on any drug combination (14). Towards this end, it has been shown that the SVM based phenotype predictions can be integrated into a scoring scheme that is predictive of virological response (15). We plan to integrate such services into the geno2pheno system in the future when they have reached an adequate level of quality and after careful statistical validation and practical testing.

## ACKNOWLEDGEMENTS

We thank Jörg Rahnenführer and Chih-Jen Lin for helpful discussions. Parts of this research have been funded by Deutsche Forschungsgemeinschaft under grant no. HO 1582/1–2 and by grants to the National Reference Center for Retroviruses from the Robert-Koch-Institute, Berlin and from the Bavarian Ministry of Science, Research and Art, Munich.

## REFERENCES

*et al*. (2000) The relation between baseline HIV drug resistance and response to antiretroviral therapy: re-analysis of retrospective and prospective studies using a standardized data analysis plan. Antivir. Ther., 5, 41–48. [PubMed]

*et al*. (2002) Construction, training and clinical validation of an inferential interpretation system for genotypic HIV-1 drug resistance based on fuzzy rules learning from virological outcomes. Antivir. Ther., 7, S71. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (336K) |
- Citation

- Study of the impact of HIV genotypic drug resistance testing on therapy efficacy.[Verh K Acad Geneeskd Belg. 2001]
*Van Vaerenbergh K.**Verh K Acad Geneeskd Belg. 2001; 63(5):447-73.* - Development and performance of conventional HIV-1 phenotyping (Antivirogram®) and genotype-based calculated phenotyping assay (virco®TYPE HIV-1) on protease and reverse transcriptase genes to evaluate drug resistance.[Intervirology. 2012]
*Pattery T, Verlinden Y, De Wolf H, Nauwelaers D, Van Baelen K, Van Houtte M, Mc Kenna P, Villacian J.**Intervirology. 2012; 55(2):138-46. Epub 2012 Jan 24.* - Improved prediction of response to antiretroviral combination therapy using the genetic barrier to drug resistance.[Antivir Ther. 2007]
*Altmann A, Beerenwinkel N, Sing T, Savenkov I, Doumer M, Kaiser R, Rhee SY, Fessel WJ, Shafer RW, Lengauer T.**Antivir Ther. 2007; 12(2):169-78.* - Resistance of human immunodeficiency virus type 1 to reverse transcriptase and protease inhibitors: genotypic and phenotypic testing.[J Clin Virol. 2001]
*García-Lerma JG, Heneine W.**J Clin Virol. 2001 Jun; 21(3):197-212.* - Prevalence of genotypic and phenotypic resistance to anti-retroviral drugs in a cohort of therapy-naïve HIV-1 infected US military personnel.[AIDS. 2000]
*Wegner SA, Brodine SK, Mascola JR, Tasker SA, Shaffer RA, Starkey MJ, Barile A, Martin GJ, Aronson N, Emmons WW, et al.**AIDS. 2000 May 26; 14(8):1009-15.*

- StickWRLD as an Interactive Visual Pre-Filter for Canceromics-Centric Expression Quantitative Trait Locus Data[Cancer Informatics. ]
*Rumpf RW, Wolock SL, Ray WC.**Cancer Informatics. 13(Suppl 3)63-69* - Ambulatory care for HIV-infected patients: differences in outcomes between hospital-based units and private practices: analysis of the RESINA cohort[European Journal of Medical Research. ]
*Oette M, Reuter S, Kaiser R, Jensen B, Lengauer T, Fätkenheuer G, Knechten H, Hower M, Sagir A, Pfister H, Häussinger D.**European Journal of Medical Research. 18(1)48* - Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure[BMC Genomics. ]
*Yu X, Weber IT, Harrison RW.**BMC Genomics. 15(Suppl 5)S1* - Mutation Profiling of the Hepatitis B Virus Strains Circulating in North Indian Population[PLoS ONE. ]
*Tuteja A, Siddiqui AB, Madan K, Goyal R, Shalimar, Sreenivas V, Kaur N, Panda SK, Narayanasamy K, Subodh S, Acharya SK.**PLoS ONE. 9(3)e91150* - Review of HIV Antiretroviral Drug Resistance[The Pediatric infectious disease journal. 2...]
*Chen TK, Aldrovandi GM.**The Pediatric infectious disease journal. 2008 Aug; 27(8)749-752*

- Cited in BooksCited in BooksPubMed Central articles cited in books
- MedGenMedGenRelated information in MedGen
- NucleotideNucleotidePublished Nucleotide sequences
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypesGeno2pheno: estimating phenotypic drug resistance from HIV-1 genotypesNucleic Acids Research. Jul 1, 2003; 31(13)3850

Your browsing activity is empty.

Activity recording is turned off.

See more...