Impact of pathogen genetics on clinical phenotypes in a population of Talaromyces marneffei from Vietnam

Talaromycosis, a severe and invasive fungal infection caused by Talaromyces marneffei, is difficult to treat and impacts those living in endemic regions of southeast Asia, India, and China. While 30% of infections result in mortality, our understanding of the genetic basis of pathogenesis for this fungus is limited. To address this, we apply population genomics and genome wide association study approaches to a cohort of 336 T. marneffei isolates collected from patients who enrolled in the Itraconazole versus Amphotericin B for Talaromycosis (IVAP) trial in Vietnam. We find that isolates from northern and southern Vietnam form two distinct geographical clades, with isolates from southern Vietnam associated with increased disease severity. Leveraging longitudinal isolates, we identify multiple instances of disease relapse linked to unrelated strains, highlighting the potential for multi-strain infections. In more frequent cases of persistent talaromycosis caused by the same strain, we identify variants arising over the course of patient infections that impact genes predicted to function in the regulation of gene expression and secondary metabolite production. By combining genetic variant data with patient metadata for all 336 isolates, we identify pathogen variants significantly associated with multiple clinical phenotypes. In addition, we identify genes and genomic regions under selection across both clades, highlighting loci undergoing rapid evolution, potentially in response to external pressures. With this combination of approaches, we identify links between pathogen genetics and patient outcomes and identify genomic regions that are altered during T. marneffei infection, providing an initial view of how pathogen genetics affects disease outcomes.


Introduction
Genomic DNA Isolation and sequencing 126 DNA was purified using the yeast genomic DNA purification kit (VWR). Briefly, cells were 127 disrupted with sodium dodecyl sulphate (SDS) (VWR) at 65 °C. The pellet was then treated with 128 ammonium acetate and RNase A (VWR). DNA was precipitated with isopropanol and stored in 129 within these populations, PopGenome 2.7.5 was used to perform composite likelihood ratio 155 analysis, assess nucleotide diversity and divergence, and calculate Tajima's D and  scores (per chromosome, by 5kb windows) (42). Copy number variation was calculated with 157 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint CNVnator v0.3 (43). Association analysis between clinical data, in vitro phenotypes, and 158 variants was carried out using PLINK v1.08p formatted files and Gemma version 0.94.1 (44) 159 (options: centered relatedness matrix gk 1, linear mixed model), as previously described (15). 160

Centromere identification 161
To identify centromere locations, the GC% was calculated across individual chromosomes with 162 a sliding window approach in R 3.6.0. Repeat motifs across the proposed centromere regions 163 were identified with RepeatModeler 2.0 (45). Recombination rates across individual 164 chromosomes were estimated with a random subset of 50 isolates spanning north and southern 165 clades, via Ldhelmet 1.10 per the best practices workflow (46). 166 167

Results 168
Characterization of genome structure 169 For this study, we leveraged a complete T. marneffei genome generated from isolate  130. This assembly consists of 8 chromosomes, with telomeric sequences present at both ends 171 of all chromosomes excepting one ending in ribosomal DNA repeats (14). This haploid genome 172 encodes 10,025 genes, including rRNA content on chromosome 3, with a BUSCO 173 completeness score of 97.7% (14). To determine the location of the centromeric regions of 174 these chromosomes, we calculated GC% across all chromosomes and detected single regions 175 displaying notable reductions in GC% for each chromosome, corresponding to likely 176 centromeric positions (Figure 1a). To determine the composition of these proposed centromeric 177 regions, we identified repetitive sequences based both on similarity to known elements and de 178 novo self-alignments, identifying long interspersed nuclear elements (LINE) and DNA/  Fot1 transposons throughout these putative centromeric regions (Figure 1b). The repetitive 180 content across these regions is consistent with the repetitive nature of centromere regions 181 observed across other species of filamentous fungi (47). These proposed regions range from 182 15.5-28 kb in length, with an average centromere length for all chromosomes of 23 kb. To 183 assess whether rates of recombination across these proposed centromeric regions are lower 184 when compared to the non-centromeric regions, we calculated recombination rates per 185 chromosome, with variant information from a subset of 50 isolates sequenced in this study. We 186 saw similar rates of recombination across chromosomes (Figure S1), and while we did not 187 observe any characteristic dips in recombination rate that might be associated with centromeric 188 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made sequencing on all isolates collected, requiring a minimum genome coverage of 10X. We then 199 called variants for these isolates, using a reference genome assembly (14) of isolate  130, also collected from southern Vietnam as part of the IVAP trial. 201

202
To assess population structure, we generated a maximum likelihood phylogeny from 203 segregating SNP site information (Figure 2). Isolates clearly separate into two clades 204 representing the northern and southern regions of Vietnam from which they were isolated. This 205 pattern is consistent with prior geographical substructure with clade separation based on region 206 of collection (20). Clinical and in vitro susceptibility phenotypes appear well distributed across 207 the phylogeny for these isolates (Figure 2, Figure S2). Northern isolates show slightly higher 208 levels of diversity (Table 1), harboring alleles with a higher number of SNPs, and displaying 209 significantly longer terminal branch lengths than southern clade isolates. 210

211
To assess recombination levels between and across these two clades, we calculated linkage 212 disequilibrium (LD) decay for the northern clade, the southern clade, and all isolates combined 213 ( Figure S3, Table 1). We see similar rates of decay, and identical LD50 values (12kb) across all 214 groups, indicating that neither clade is recombining exclusively within their group. 215

Clinical outcomes and longitudinal samples 217
To assess whether clinical phenotypes or in vitro antifungal susceptibility values significantly 218 differed by clade, we performed a series of statistical analyses to assess associations between 219 clade and phenotype. We did not observe statistically significant differences between clades in 220 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint considering 24-week mortality, rates of relapse, IRIS, or MICs of itraconazole and amphotericin 221 B. However, some striking differences were detected. While isolates from the southern clade 222 have comparable rates of fungal clearance to isolates from the northern clade (Figure S4a), the 223 southern isolates exhibit significantly higher baseline fungal burden ( Figure S4b) (Wilcoxon 224 rank sum test, p=0.024), significantly higher prevalence of fever (Fisher exact probability test, 225 p=0.0002), and significantly decreased incidence rates of respiratory failure (Fisher exact  226 probability test, 0.0021) ( Table 2). Given the increased presence of fever and baseline fungal 227 burden across isolates from the southern clade, we sought to assess whether additional factors 228 including patient age, antiretroviral (ARV) treatment status or CD4 counts were confounders in 229 these analyses. For patients included in this analysis, the median age was 33, the median CD4 230 count was 10, and 44% of patients had previously received ARV treatment. Neither age, ARV 231 treatment status, or CD4 counts were significantly correlated with the presence of fever or 232 baseline fungal burden, indicating that isolate clade is a contributing factor to some aspects of 233 disease severity in patients from Vietnam. 234

235
To assess the genetic relationships between isolates sampled from different body sites (skin vs 236 blood) of a patient at the same time point, and isolates sampled at different time points from the 237 same patient (pre vs post-treatment), we calculated the number of SNP differences between 238 each isolate. Over 80% of isolates sampled at longitudinal time points show a small number of 239 SNP differences (≤10) when compared to their baseline counterparts ( Table 3) We also observed that a subset of longitudinal isolates from 4 patients appear unrelated to the 249 primary infecting isolate. In total, we observed 4 longitudinal blood isolates and 2 skin isolates 250 ( Table 4) that appeared unrelated to the initial blood isolate, or matched time point blood 251 isolate, respectively. With SNP differences between these isolates and their initial or blood 252 counterparts ranging from 1,072 to 2,810 (Table 4), it is likely that these cases represent novel 253 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint strain introductions or pre-existing multi-strain infections, as opposed to a persistent infection of 254 the same isolate over time. 255

Genomic variants associated with clinical outcomes 256
Given the range of clinical phenotypes and outcomes associated with isolates from both clades, 257 we leveraged this variability to assess significant associations between clinical phenotypes and 258 pathogen genetics. We took a genome wide association study (GWAS) approach to test for 259 significance between loss-of-function (LoF) mutations and the clinical metadata suggestive of 260 severe infections. Four patients with multiple infecting strains were excluded from this analysis. 261 We saw LoF variants associate significantly (p<5x10 -6 , gemma score test) with poor clinical 262 outcomes including mortality, relapse, IRIS, poor clinical response, and high baseline fungal 263 burden (CFU/ml) ( Table 5) pressures. The variants identified here may give some advantage to survival within the 275 environment of the human host and may be working in tandem with other forms of genetic 276 variation, such as copy number, to elicit more severe clinical phenotypes. 277

Population diversity and regions under selection 278
To investigate larger scale genetic variation, we performed copy number variation analysis. We 279 observed no evidence of chromosomal aneuploidy across these isolates but identified multiple 280 genomic regions that appeared duplicated and deleted within this population. Many of these 281 regional duplications and deletions appeared unique to individual isolates ( Figure 3); however, 282 common deletions were also detected. The most commonly deleted region encodes a predicted 283 afadin-and alpha -actinin-binding region that is deleted across 141 isolates from the northern 284 clade and 125 isolates from the southern clade. The largest number of isolates sharing a 285 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint duplication across a single locus is 17, duplicated in isolates from the southern clade, this region 286 encodes two reverse transcriptases, and a GAG-pre-integrase domain, likely corresponding to a 287 retrotransposon integration. Of those deleted regions shared across more than 50 isolates, 288 these encode genes predicted to function as transcription factors (6 genes), REDOX proteins (5 289 genes), protein kinases (4 genes), methyltransferases (3 genes), aspartic proteases (2 genes), 290 dehydrogenases (2 genes), a cytochrome P450, a glycosyl hydrolase, a glycosyl transferase, a 291 phospholipase, a phosphorylase, an adenylosuccinate lyase, an actin binding protein, a 292 decarboxylase, an aminotransferase, an apoptosis regulating protein, an autophagy regulating 293 protein, an arrestin, and a meiotic nuclear division protein. The range in functionality of genes 294 lost across more than 50 isolates highlights the diversity within this clinical dataset. Of those 295 duplicated regions shared across more than 2 isolates, these can be characterized by their 296 predicted functions as heat shock and stress response genes, including heat shock protein 70, 297 cell membrane and wall modulating genes, including a chitin synthase and a phosphoylcholine 298 transferase, metabolic regulatory genes, including a cytochrome P450, a thiolase, a glycosyl 299 hydrolase, an aspartyl protease, a citrate synthase, and gene expression regulatory genes, 300 including a transcription factor and an mRNA capping protein. These results suggest that genes 301 with roles in heat shock, cell wall structure, and metabolic response could provide an advantage 302 to T. marneffei in this setting. We did not observe any duplications of the CYP51 ortholog within 303 these isolates, suggesting alternate mechanisms may be responsible for the range in MIC 304 values seen to itraconazole. 305

306
To assess regions that may be under selection within these populations, we performed a 307 population genetics analysis to identify rapidly evolving regions across these isolates. We noted 308 strong signals present across many telomeric and sub-telomeric regions (Figure 4a), in tests for 309 selective pressure (composite likelihood ratio, CLR) (48), nucleotide diversity (Pi), and 310 population divergence (Dxy). To determine if these regions encode common functions, we 311 performed enrichment analysis on all genes within the sub-telomeric regions. The sub-telomeric 312 regions displaying strong signals for selective pressure were enriched (hypergeometric test, 313 p<0.05) for genes predicted to play roles in thiamine diphosphate metabolism and energy 314 derivation by oxidation of organic compounds (Table S1). When northern and southern clades 315 were assessed independently via CLR analysis, non sub-telomeric regions exhibiting strong 316 signals for selective pressure across both populations were enriched (hypergeometric test, 317 p<0.05) for genes predicted to function in ion and organic anion transport, respiration, regulation 318 of gene expression, and energy derivation by oxidation of organic compounds (Table S1). Non 319 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint sub-telomeric regions displaying signals of population divergence between clades were 320 enriched (hypergeometric test, p<0.05) for genes predicted to function in anaerobic respiration 321 (Table S1) adaptations in these regions. Region 2 encodes genes predicted to function as a cytochrome 340 P450 and a glycosyl hydrolase. Region 3 encodes genes predicted to function as a transcription 341 factor, a protein kinase, an aspartyl protease, and a pyridoxal-dependent decarboxylase. To 342 determine whether any of these signals may be due to a recent selective sweep, or balancing 343 selection, we assessed scores for Tajima's D. We identified one region that shared a strong 344 signal in both our CLR analysis, and our Tajima's D analysis (Figure 4a, region 4), indicating 345 that this region may have recently undergone a selective sweep. We note that this region also 346 shows an elevated Dxy score, but no increase in Fst between the two populations, indicating 347 increased levels of divergence between the two populations at this region, accompanied by high 348 levels of within-population diversity, as reflected in our measurements of nucleotide diversity (Pi) 349 for each clade at this region. This region encodes 14 genes and multiple ankyrin repeats 350 (EYB25_006920-006938) in total. The predicted functions of these genes include a transcription 351 factor, an endoplasmic reticulum membrane protein, an arrestin, a meiotic nuclear division 352 protein, a glycosyl transferase, a cell surface protein, an alcohol dehydrogenase, and an 353 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Leveraging the clinical metadata available for these isolates, we investigated the possibility of 368 clade-based differences with regards to clinical phenotypes and outcomes. We did not observe 369 significant differences between the northern and southern clades with respect to 24-week 370 mortality, rates of relapse, IRIS, clinical response to treatment, or MICs of itraconazole and 371 amphotericin B. However, we detected significant associations between the southern clade and 372 aspects of disease severity, including higher baseline fungal burden and presence of fever. In 373 addition, we identified multiple loss-of-function variants significantly associated with mortality, 374 relapse, IRIS, poor clinical response, and high baseline fungal burden. We noted that many of 375 these clinical phenotypes may occur at low rates for the group sizes, likely impacting our ability 376 to detect both significant differences between clades, and significant associations between 377 variants and phenotypes. In addition, there are many confounding factors that contribute to 378 these clinical outcomes, including host immune status, host response, timeliness of treatment, 379 and antifungal treatment regimen. While we did not find that age, CD4 count, or ARV treatment 380 status confounded associations between the southern clade and higher baseline fungal burden 381 or presence of fever, larger studies of talaromycosis outcomes should be completed to provide 382 increased power for estimations of clinical differences between clades to confirm these findings. 383 Further work using larger group sizes for genotype-phenotype association studies will also be 384 essential for building on our GWAS findings. 385 386 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint Applying a range of population genetics approaches with the variant data generated for these 387 clinical isolates, we interrogated signals of selective pressure, increased nucleotide diversity, 388 and population divergence for these clades. We observed shared signals of rapidly evolving 389 regions present in northern and southern clades, with the strongest shared signals spanning was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted April 1, 2023. ; https://doi.org/10.1101/2023.03.30.534926 doi: bioRxiv preprint  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made    The colored circles correspond to clade and sample type. Metadata for the clinical isolates 690 includes initial fungal burden for a patient, indicated by the Log 10 CFU/ml heatmap, and patient 691 outcomes that are indicated by colored squares in the outer perimeter of the phylogeny. 692  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  -0  3  -0  5  2   1  1  C  N  -0  3  -0  2  2   1  1  C  N  -0  3  -0  1  2   1  1  C  N  -0  3  -0  4  8   1  1  C  N  -0  3  -0  0  6   1  1  C  N  -0  3  -1  2  3   1  1  C  N  -0  3  -1  2  3  _  s  _  d  0   1  1  C  N  -0  3  -0  0  7 1 1 C N -0 3 -0 8 5 1 1 C N -0 3 -0 8 5 -S 1 1 C N -0 3 -0 2 3 1 1 C N -0 3 -1 6 5 _ s _ d 0 1 1 C N -0 3 -0 1 8 1 1 C N -0 3 -0 9 6 1 1 C N -0 3 -0 1 8 _ b _ w 4 1 1 C N -0 3 -1 4 5 1 1 C N -0 3 -0 6 4 -S 1 C N -0 3 -1 6 2 C N -0 3 -1 6 2 -S -W