Inferring the composition of a mixed culture of natural microbial isolates by deep sequencing

Next generation sequencing has unlocked a wealth of genotype information for microbial populations, but phenotyping remains a bottleneck for exploiting this information, particularly for pathogens that are difficult to manipulate. Here, we establish a method for high-throughput phenotyping of mixed cultures, in which the pattern of naturally occurring single-nucleotide polymorphisms in each isolate is used as intrinsic barcodes which can be read out by sequencing. We demonstrate that our method can correctly deconvolute strain proportions in simulated mixed-strain pools. As an experimental test of our method, we perform whole genome sequencing of 66 natural isolates of the thermally dimorphic pathogenic fungus Coccidioides posadasii and infer the strain compositions for large mixed pools of these strains after competition at 37°C and room temperature. We validate the results of these selection experiments by recapitulating the temperature-specific enrichment results in smaller pools. Additionally, we demonstrate that strain fitness estimated by our method can be used as a quantitative trait for genome-wide association studies. We anticipate that our method will be broadly applicable to natural populations of microbes and allow high-throughput phenotyping to match the rate of genomic data acquisition.


Introduction
The rules by which genotype dictates phenotype are encoded in the genetic and phenotypic variation of natural populations.These rules can be decoded by statistical-genetic scans for polymorphisms that are co-inherited with, and potentially causal for, traits of interest among the progeny from controlled matings or among members of an outbred population.In many organismal systems, such efforts have been accelerated by pooled genotyping methods.This approach, originally called bulk segregant analysis in laboratory crosses [1] [2], has become an industry standard for invertebrate animals, eukaryotic microbes, and plants.The modern incarnation is to mix genetically distinct individuals, subject the resulting pool to selection for a phenotype of interest, and isolate DNA en masse from the subsets of the pool that pass or fail the selection [3].From the resulting sequencing data, allele counts at each locus in turn then serve as input into statistical-genetic tests [4].Against a backdrop of years of success, the pooled phenotyping-by-sequencing framework does have a key limitation: it does not quantify phenotypes of the individuals of the initial population.As a consequence, pooled methods preclude advanced statistical-genetic analyses at the haplotype and chromosomal level, including scans for genetic interactions between loci, calculations of polygenic risk scores, and population admixture control [5].
Strategies to infer individual strain abundances from sequencing of a complex pool have been developed in a related literature, that of microbial metagenomics.Can these tools be brought to bear on pooled statistical-genetic experiments?In metagenomics, a typical application requires simultaneous inference of each strain's genotype and abundance in an ecological sample.To cut down on the resulting large search space, current methods make strong assumptions about pool membership (e.g.reference strains likely to be in the sample [6] [7], and/or small numbers of strains likely to dominate [8] [9] [10] [11]) whose validity in many cases may be unknown.But these caveats are not relevant in a statistical-genetic application using a pool of individuals whose genotype is known a priori.In such a scenario, inferring the prevalence of pool August 2, 2024 2/20 members from phenotyping-by-sequencing may be expected to work particularly well.
With this motivation, we set out to develop a method to fit the strain composition from full genome sequencing of a pooled sample, given individual genome sequencing of each member of that pool.As a test case for application of the approach, we focused on the fungal pathogen Coccidioides posadasii.Coccidioides species are the causative agents of Valley Fever and are endemic to California, Arizona, and other desert regions in the Americas [12].Coccidioides grows as saprophytic hyphae in the environment [13].
The hyphae produce asexual spores, called arthroconidia, which, upon inhalation by mammalian hosts, convert to a unique pathogenic spherule morphology.The spherule undergoes internal segmentation to generate cells called endospores, which are released from spherules and presumably disseminate in the host.The genetic basis of these behaviors remains largely unknown.Many screening approaches that could fill the knowledge gap are out of reach because Coccidioides experiments can only be done under biosafety-level 3 containment laboratory conditions.Given the need for cheap and efficient experimental design in this system, forward genetics with phenotyping-by-sequencing methods is of particularly keen interest.We thus pioneered a scheme of pooling naturally varying strains of C. posadasii for growth and trait mapping.We implemented our model-fitting approach to infer strain abundance from the resulting data, and we then used relative abundances measured by this method as quantitative traits for association mapping, including control of population structure.

Sequencing genetically distinct, clinical isolates of Coccidioides posadasii
With the ultimate goal of pooled growth assays and phenotyping-by-sequencing in C.
posadasii, we sequenced 77 C. posadasii clinical isolates from Pima County, Arizona, of which 11 have been independently sequenced (Table S1), and we also resequenced the type strain of C. posadasii, Silveira [14].The full set of sequencing data was combined with genomes of previously published strains to generate a phylogeny (Fig 1 ) which revealed that the Pima isolates in our pool corresponded to at least three previously identified populations [15].
As a model trait for pooled phenotyping assays, we chose a phenotype that could vary across our strains and be mapped by a genome-wide association study (GWAS) in a pooled format.Given that mammalian body temperature is both a cue for the fungal morphological transition and a stress that must be overcome to persist in the host, we chose differential growth at environmental (room temperature, RT) and host (37°C) temperatures as a useful test case for this purpose.As described in the following sections, we developed a method for fitting strain abundances to sequencing of a mixed culture seeded with a pool of C. posadasii strains, validated the method on simulated data, and we applied the method to real strain pools grown at 37°C or RT.

Mixed pools can be deconvoluted by fitting a binomial model
For a pool of M strains that differ at N biallelic single-nucleotide polymorphisms (SNPs), we modeled the observed major allele read counts, c, out of total counts, n, at each variant position, i, based on a binomial distribution: August 2, 2024 3/20 where the probability of observing the major allele, p i , is the total proportion of strains, M , harboring that allele: where f j is the proportion of strain j and δ ij = 1 for strains, j, with the major allele at position i and 0 otherwise.The total likelihood of the observed read counts over all biallelic positions is then: The unknown strain proportions can be estimated by finding the value of the vector, F , of strain proportions, f j , that maximizes the likelihood function.We found the maximum likelihood solution by minimizing a related objective function: using simulated annealing [16,17] (Fig S1).
An extended derivation of Eq 4 is given in the Materials and methods.

Validation of strain abundance inference method by simulation
In order to validate our fitting method, we first simulated sequencing results from Given that we sampled from real sequencing runs, the simulations should incorporate the same positional and sequence biases, sequencing errors, etc., as in a real experiment.
In particular, this simulation method is independent of the assumptions of our model and assumes only that there are no strain-specific biases in isolation of the DNA of the pool.We simulated mock pools at a depth of 30 million reads, a lower limit of the sequencing depth of our true pooled experiment below.

Inferring strain abundances from experimental pooled sequencing
To apply our approach to real data, we performed a first set of competition experiments for 54 well-germinating strains of C. posadasii at host (37°C) and environmental (RT) temperatures.
The procedure was repeated in two batches, with three pools grown at each temperature for each batch, for a total of 12 pools.In each case we inoculated arthroconidia into liquid culture and incubated to allow germination and hyphal growth.We then isolated DNA from each, carried out sequencing, quantified alleles at SNPs, and inferred strain abundance with our fitting method.In each fit, simulated annealing converged (Fig S3).Inferred strain abundances varied more among the pools cultured at 37°C than those at RT (Fig 3A), but significance testing was still highly powered to resolve temperature differences in abundance for eight strains (blue and red points,   A, B, and C).In the top row of plots, each point reports the true abundance of one strain in the respective simulated pool, as a proportion of total biomass.In the bottom row of plots, each point reports the abundance of one strain fit by the model (y-axis) and the true simulated abundance (x-axis) for the respective simulated pool.Clonal pair (strains 3284 and 3291) indicated in green.Strains absent from the simulated pools indicated in magenta.
In order to validate the strain-specific temperature biases measured by our inference approach, we carried out a second, smaller pooled competition experiment consisting of the two C. posadasii strains that had dominated cultures in the first round (3301 and 3457), two of the 37°C enriched strains from the first round (3224 and 3326), and one of the RT enriched strains from the first round (3292).We grew the smaller pools in triplicate at 37°C and RT and again isolated and sequenced DNA from each replicate.Running our inference method on the resulting sequencing data correctly identified the five strains present in these pools (Fig S4).Furthermore, the temperature-dependent abundance patterns in this experiment recapitulated the trends we had seen in the first round, with strains 3224 and 3292 again exhibiting 37°C and RT biases, respectively (insets, Fig 3B).As in the larger pools, strains 3301 and 3457 did not show a temperature bias in abundance.They did not, however, show evidence for jackpotting in the smaller pools.Instead, in the latter, strain 3292 was the most abundant strain at either temperature (insets, Fig 3B ; Fig S4).Taken together, these results show that despite variation in jackpotting effects across experimental designs, strain differences in growth rate between conditions are reproducible across different C. posadasii pool compositions and can be inferred robustly with our fitting approach.In principle, strain variation in temperature tolerance could derive from differences in the ability to cope with thermal stress or to differential response to the cue of host temperature.

Application to GWAS
We reasoned that the inferred abundances from our larger pooled growth experiment could be used as the basis for genetic dissection of variation in temperature-dependent growth, via GWAS.Given the population structure in our sampled C. posadasii strain August 2, 2024 6/20 phenotyping-by-sequencing in pools, and inferred strain abundance as a highly-powered strategy for statistical genetics.

Discussion
Pooling approaches for statistical genetics, though powerful, typically preclude the use of chromosome-or haplotype-level advanced mapping methods.Here, we adapt mathematical techniques originally developed for deconvolution of strain data from metagenomic sequencing to infer the relative proportions of strains of a single species in a mixed culture.We show that this approach can be used to score quantitative phenotypes for differential growth between two conditions in a pooled format and that these phenotypes can be applied to GWAS in Coccidioides.
Progress in the molecular understanding of microbial pathogens is often hampered by the difficulty of genetic screening tools.Indeed, our work represents a first-ever whole-genome functional-genomic screen in Coccidioides.[28].Forward genetic screens are likely to be of great utility in advancing the field, and a recent small-scale screen of 24 Coccidioides insertional mutants for virulence in Galleria serves as an additional foundation for this principle [29].Given the success of GWAS in fungal model systems, plant pathogens and commensals, and opportunistic animal pathogens, we expect the natural variation-based approach to be equally powerful in human pathogens, especially with the pooled growth paradigm we establish here.
As temperature is both an important developmental cue and a stress inherent to the host environment, we chose growth of Coccidioides at 37°C as an easily controlled model trait to test our pooled GWAS method.Our inference of variation in temperature-dependent growth across C. posadasii strains from the pooled experimental format is consistent with a previous survey of growth across temperatures in a strain-by-strain setting [30], and our discovery of D8B26_001557 as a candidate determinant of these differences serves as an additional proof of concept for our approach.Regulation of this gene by the Ryp1 transcription factor is consistent with the control of its ortholog in Histoplasma ohiense G217B, I7I48_06129 1 , by Histoplasma Ryp1 [31].Ryp1 is a master regulator of the temperature-dependent transition of Histoplasma from hyphae to yeast [32] [19] and is likewise required for spherule formation in Coccidioides [19].Given the role of Ryp1 in temperature-dependent transcriptional regulation, it is appealing to discover a Ryp1 target associated with temperature-dependent growth.Of additional interest is that fact that the Ryp1 regulon is associated with pathogenesis in many fungi [33] [34] [35] [36] [37] [38] [39] [40].Many of the known virulence factors of Histoplasma are direct Ryp1 targets [31], and the Coccidioides virulence factors SOWgp, MEP1, and urease are Ryp1 regulated [19].Thus it is tempting to speculate that our GWAS hit locus, D8B26_001557, may ultimately prove to have relevance for virulence behaviors.We observe that C. posadasii Silveira has the haplotype that favors growth at RT, consistent with the previous observation of compromised growth at high temperature for this strain [30].
Our GWAS of temperature-dependent growth in C. posadasii from pooled experimental measurements opens a window onto the study of natural variation in other virulence-relevant traits in this pathogen, including development of infectious spherules.More generally, our strategy should be applicable to phenotyping of pools of genetically 1 HISTO_ZT.Contig1089.Fgenesh_histo.56.final_new in [31] August 2, 2024 9/20 distinct strain sets in many organismal systems.Our current likelihood function assumes a haploid genome but could readily be extended to the diploid case.The approach is expected to have particular impact in statistical-genetic scans that use population structure correction, as we have implemented it here for C. posadasii, and other multi-locus mapping tools.Our method will also be of use in applications beyond statistical genetics per se -for example, correlation analysis of multiple phenotypes or rapid surveys of a set of isolates to select strains with extreme phenotypes for comparative 'omics.
In summary, the pooled phenotyping format has become a linchpin of the field, especially for non-model systems like pathogens; and with adaptations like ours, powerful genomic methods, even those originally developed for classical phenotyping on one individual at a time, come within reach for the cheap, expedient pooled experimental design.

Strains and growth conditions
All strains were received from J.G. and collected from a single hospital site in Tucson (Southern Arizona Veterans Administration Health Care Service).
Arthroconidia from 54 isolates of C. posadasii used in this study were generated by cultivation on solid glucose yeast extract medium (GYE 2X: glucose 20 g/L, yeast extract 10 g/L, agar 15 g/L) in 125 ml vented suspension flasks for 4-6 weeks at 30°C.Arthroconidia were harvested using phosphate buffer (PBS) and a cell scraper to dislodge arthroconidia from grown mycelium.The hyphal/spore mixture was filtered through miracloth to isolate the arthroconidia from hyphal mass.The spores were washed twice with PBS before being stored at 4°C for long term storage.The spore solutions were quantified with a hemocytometer and adjusted to working concentrations of 105 arthroconidia/µL.An additional isolate, C. posadasii RMSCC 1043, did not germinate well and did not yield sufficient material for sequencing.We eliminated RMSCC 1043 from simulation studies and analyses of real experimental data (see below) under the assumption that it would not contribute to mycelial pools.

Pooled competition experiments
Arthroconidia from C. posadasii Silveira and 53 clinical isolates of C. posadasii (plus RMSCC 1043) were grown in competition under conditions associated with the host or the environment.For each clinical isolate, 12.74 µL of the 105 arthroconidia/µL stocks was added to 6.3 mL of PBS in a 15mL conical, resulting in a total spore concentration of 107 spores/ml.Six flasks containing 25mL of GYE 2X liquid media were inoculated with 550 µL of the spore mixture, representing 105 spores per isolate in each culture.
The cultures were incubated for 14 days on an orbital shaker at 120 RPM either at room temperature or at 37°C in the presence of 5% CO2.
The mycelium from each culture was collected in miracloth filters, washed with PBS to remove carryover media, then pat dried on paper towel to remove excess moisture.

NGS-library preparation and sequencing
Genomic DNA from each sample was prepared for next generation sequencing using the Nextera DNA Flex Library Prep kit (Illumina).In brief, 500 ng of gDNA from each pooled competition experiment were used as input and tagmented (process to fragment and tag the DNA with adapter sequences) with Bead-Linked Transposomes.After washing, the tagmented DNA was amplified with unique combinations of i7 and i5 index adapters.Library qualities were assessed using an Agilent High Sensitivity DNA chip run on a 2100 Bioanalyzer Instrument.Equal molar amounts of each library were multiplexed to a final concentration of 5-10 nM.
Individual strains were multiplexed in four batches and sequenced on the Illumina HiSeq 4000 platform at the UCB QB3 core or at the UCSF Center for Advanced Technology (CAT).
Pooled cultures were multiplexed in three batches and sequenced on the Illumina HiSeq 4000 platform at the UCSF CAT or the Illumina NovaSeq S2 platform at the Chan Zuckerberg Biohub.

Simulations
Mock pools were simulated by sampling read pairs from the individual isolates in the required proportions by reservoir sampling ("Algorithm R" of Vitter [41]).

Phylogenetics
FASTQ files for previously published C. posadasii strains [42] [15] were downloaded from SRA:PRJNA274372 and SRA:PRJNA438145.The C. posadasii reference genome and annotations [43] were downloaded from BioProject PRJNA664774.We used RepeatMasker 4.1.2[44] to identify regions of the C. posadasii reference genome with low complexity or simple repeats, or mapping to a library of known Coccidioides transposable elements [45].
For each isolate and GWAS pool, we aligned reads from the FASTQ file to the C. posadasii reference genome using BWA MEM 0.7.17 [46], then sorted and indexed the aligned reads using SAMTOOLS 1.8 [47].We used PILON 1.23 [48] with the -variant flag to create VCF (Variant Call Format [49]) files from the aligned reads.Variant sites in each VCF were filtered as follows: variants annotated as "LowCoverage" by pilon or with depth greater than three times the average depth of the sample were removed.We also removed variants within the repetitive regions identified by RepeatMasker.
Vcf2phylip.py (https://doi.org/10.5281/zenodo.2540861)was used to convert from a VCF containing all biallelic sites with minor allele frequency of at least 5% and no missing data to phylip format.We used this file as input to the ModelFinderPlus model of IQ-TREE 1.6.12[50] [51] [52], which identified the TVM+F+ASC+R5 model as optimal using Bayesian information criterion.We bootstrapped this model with 1000 replicates using the SH-like approximate likelihood ratio test and ultrafast bootstrap August 2, 2024 11/20 approximation methods.The consensus tree from the ultrafast bootstrap approximation was visualized with iTOL [53].The 12 resequenced strains were included in the tree inference, were correctly identified as redundant by IQ-TREE, and are not shown in Fig 1.

Read alignment
Paired-end reads from individual isolates and simulated and real pools were aligned to the Coccidioides posadasii var.Silveira genome assembly using BWA MEM [46] with default parameters.

SNP calling from individual isolates
For the purpose of fitting proportions in simulated and real pools, SNPs were defined from the individual isolate read alignments as nucleotide positions covered by at least 10 reads from each isolate, with at least 85% of the reads from each isolate supporting a single allele, and with exactly two total alleles over all isolates.

Model and fitting procedure
We give here an extended derivation of the objective function from Eq 4 from the Results section.
Given major allele counts, c i , for each biallelic position, i, out of N total biallelic positions over M strains, we would like to find the strain frequencies F that best fit the allele counts under the constraints that the frequencies are positive, 0 ≤ f j ≤ 1∀j, and sum to 1, The probability that a count at i is due to strain j is its frequency: Therefore, the probability of observing c i counts at a position where strain j is the only strain with the major allele is given by the binomial distribution: For positions where more than one strain has the major allele we need to account for all of the different ways that the c i counts could be partitioned among the strains.As this is already built into Eq 6 via the binomial coefficient, it is easiest to first sum the individual strain probabilities and then insert this total major allele frequency into that equation: where δ ij is 1 if strain j has the major allele at i and 0 otherwise.
The total likelihood of the observations, C, given a candidate solution, F , is: August 2, 2024 12/20 which can be maximized by minimizing the negative log likelihood: Note that the binomial coefficient is a fixed term that depends only on the observed allele ratio, so it can be dropped for the minimization: which is Eq 4 from the Results section.

GWAS
For GWAS, we considered all strains in the pool except for RMSCC 1043 (not sequenced), C. posadasii (not a Pima isolate), and the two jackpotting strains (3301 and 3457), leaving 51 strains.We selected a total of 8552 SNPs outside of LTR transposon-rich regions (defined as in [54]) and in coding sequence with at least 25% minor allele frequency (MAF) relative to the 51 analyzed strains.
Per strain phenotypes were calculated as the difference in median log 2 proportions between 37°C and RT.
GWAS analysis was then carried out using GEMMA [18] 0.98.4 to infer the kinship matrix from the SNP matrix:

Fig 1 .
Fig 1. Phylogenetic distribution of Coccidioides poasadasii isolates.IQ-TREE inferred tree of the newly C. posadasii isolates relative to previously published strains.Strains included in the mixed culture experiments are highlighted in orange (55 strain pool only) or blue (55 strain and 5 strain pools).
2800 steps of simulated annealing were sufficient for the fit to converge in each case (Fig S2), and the fit strain proportions recapitulated ground truth to within 1% (Fig 2, bottom row).As expected, the exception to the rule was a pair of genetically indistinguishable clonal strains present in the true strain set (green circles in Fig 2B and see Methods).Similar results were obtained when we used only a transposon-free region between the KU70 and HSF1 genes (coordinates: CP075070.1:4382963..5065814), representing about 5% of the genome (Fig S2).
Fig3B).Two other strains exhibited evidence for jackpotting in the cultures, with very high abundance independent of temperature (purple points, Fig3B).

Fig 2 .
Fig 2. Validation of model with simulated data.Shown are results from simulations of three pools with different strain proportions (A, B, and C).In the top row of plots, each point reports the true abundance of one strain in the respective simulated pool, as a proportion of total biomass.In the bottom row of plots, each point reports the abundance of one strain fit by the model (y-axis) and the true simulated abundance (x-axis) for the respective simulated pool.Clonal pair (strains 3284 and 3291) indicated in green.Strains absent from the simulated pools indicated in magenta.

Fig 3 .Fig 4 .
Fig 3. Strain proportions fit from real pool sequencing.(A) Heatmap showing fit proportions for each of 54 strains (columns) in each of 12 pooled liquid cultures (rows) grown for 14 days at 37°C or RT.(B) In the main plot, each point reports abundance as a median across replicate pools of the indicated temperatures from (A).Strains are colored based on whether they were: highly abundant at both temperatures (purple), enriched at 37°C (red), enriched at RT (blue), or not detectably temperature-dependent (green).Insets show proportions for strains 3326, 3224 and 3292 when retested in a 5 strain pool (p-values from Wilcoxon tests).
Fig S1.Fitting strain proportions with simulated annealing.(Top) Objective function for each simulated annealing step for a simulated pool (blue line).Values of the objective function are shown as dashed lines for a uniform strain distribution (red) and for the known correct distribution (green).(Bottom) Estimated proportions of the top ten strains or the remaining strains ("rest") plotted as stacked relative proportions for each simulated annealing step.For both the top and bottom plots, the true solution is plotted to the right of the vertial dashed line.

Fig S3 .
Fig S3.Simulated annealing converges for real pools.For each of 12 pools, the objective function is plotted as a function of the simulated annealing step as a solid blue line with the minimum value as a dashed blue line and the objective function evaluated for a uniform strain distribution plotted as a solid red line.
Fig S4.Strain proportions fit from retest pool sequencing.(A) Heatmap showing fit proportions for each of 54 strains (columns) in each of 6 pooled liquid cultures of 5 strains (rows) grown for 14 days at 37°C or RT.(B) Scatter plot of median proportion for each strain in the 37°C vs. RT pools from (A).Strains are colored as in Fig 3B.Dashed line indicates a slope of 1.