Metagenomics reveals novel microbial signatures of farm exposures in house dust

Indoor home dust microbial communities, important contributors to human health outcomes, are shaped by environmental factors, including farm-related exposures. Detection and characterization of microbiota are influenced by sequencing methodology; however, it is unknown if advanced metagenomic whole genome shotgun sequencing (WGS) can detect novel associations between environmental exposures and the indoor built-environment dust microbiome, compared to conventional 16S rRNA amplicon sequencing (16S). This study aimed to better depict indoor dust microbial communities using WGS to investigate novel associations with environmental risk factors from the homes of 781 farmers and farm spouses enrolled in the Agricultural Lung Health Study. We examined various farm-related exposures, including living on a farm, crop versus animal production, and type of animal production, as well as non-farm exposures, including home cleanliness and indoor pets. We assessed the association of the exposures on within-sample alpha diversity and between-sample beta diversity, and the differential abundance of specific microbes by exposure. Results were compared to previous findings using 16S. We found most farm exposures were significantly positively associated with both alpha and beta diversity. Many microbes exhibited differential abundance related to farm exposures, mainly in the phyla Actinobacteria, Bacteroidetes, Firmicutes, and Proteobacteria. The identification of novel differential taxa associated with farming at the genera level, including Rhodococcus, Bifidobacterium, Corynebacterium, and Pseudomonas, was a benefit of WGS compared to 16S. Our findings indicate that characterization of dust microbiota, an important component of the indoor environment relevant to human health, is heavily influenced by sequencing techniques. WGS is a powerful tool to survey the microbial community that provides novel insights on the impact of environmental exposures on indoor dust microbiota, and should be an important consideration in designing future studies in environmental health.


1
Introduction 50 Humans spend 90% of their lives indoors (1), with much of this time spent in the home, where they 51 both contribute to and are exposed to environmental microbiota. Home dust microbiota are 52 commonly captured by vacuuming living spaces, including bedrooms. Exposure to bacterial and 53 fungal communities inside the home has been associated with allergic, atopic, and respiratory 54 conditions in children and adults (2-5). These associations could reflect the direct impacts of 55 environmental microbial exposure on inhabitants' health, as well as through indirect effects of dust 56 microbiota on the human gut, skin, oral, and respiratory microbiomes (6-8). Housing characteristics 57 and other environmental exposures have been shown to influence indoor microbial communities, 58 including farm-related exposures (8)(9)(10)(11). Living in or near a farm environment entails unique 59 microbial exposures and subsequent health concerns. Farm exposures have been associated with 60 altered microbial composition in home dust, which in turn have been associated with allergic 61 outcomes in adults and children (4,(12)(13)(14). Identifying environmental factors that influence home 62 dust microbiota is a critical first step in determining exposure pathways relevant to health outcomes. 63 The emergence and optimization of high-throughput sequencing have enabled new approaches to 64 assessing the composition of bacterial communities present in home dust samples, which have a 65 complex matrix and low microbial biomass compared to host-associated microbiome samples such as 66 stool. 16S rRNA amplicon sequencing (16S) is a traditional next-generation technique in which all 67 amplified products are sequenced from a single gene (i.e., the 16S rRNA gene). The technique is 68 limited, however, because annotation is based on putative associations of the 16S rRNA gene with 69 bacterial taxa defined computationally as operational taxonomic units (OTUs). Thus, specific 70 bacterial entities are not directly sequenced, but rather predicted based on OTUs, and consequently 71 have more uncertainty at the lower taxonomy ranks of genus and species (15)(16)(17)(18). Metagenomic 72 whole genome shotgun sequencing (WGS), in which random fragments of the genome are 73 sequenced, is an alternative approach and offers a major advantage in that taxa can be more 74 accurately defined at the genus/species level (16,19). However, WGS is more expensive and requires 75 more extensive data processing and analysis (15,20). Most of the published data on associations of 76 home dust microbiota with environmental exposures or health outcomes have relied on the older 16S 77 methodology. 78 Higher taxonomic classification resolution with WGS provides a more comprehensive description of 79 the microbial community, and may improve the ability to detect novel associations with 80 environmental risk factors, which is important when considering environmental health pathways. In 81 human microbial communities, especially the gut microbiome, WGS generally identifies a larger 82 number of unique phyla and higher overall microbial diversity within samples compared to 16S (16, 83 19-26). However, results are mixed for environmental samples in water and soil (27,28). At present,84 no research has evaluated sequencing methodology on microbial community characterization in 85 indoor home dust samples, and how this will impact the upstream associations with farm and non-86 farm environmental exposures. 87 In the present study, we analyzed samples from 781 participant homes in the Agricultural Lung 88 Health Study (ALHS), a study of farmers and their spouses in North Carolina and Iowa, using 89 advanced WGS methods, and evaluated associations with farm and nonfarm exposures found to be 90 important in previous work based on 16S, in this cohort and others (4,8,29). We considered both 91 microbial community diversity levels and specific bacterial taxa, in order to determine whether WGS 92 can provide novel insights into farming environmental exposure pathways, the results of which are 93 relevant to the design of future research integrating environmental health and microbiology. 94

2
Materials and methods 95

Study population and design 96
ALHS is a case-control study of adult asthma study nested within the Agricultural Health Study 97 (AHS), a prospective cohort of licensed pesticide applicators, mostly farmers and their spouses, 98 enrolled between 1993 and 1997 (30). ALHS participants were selected from among AHS 99 participants who were either farmers or farm spouses in North Carolina (NC) and Iowa (IA) and 100 completed an AHS telephone follow-up conducted from 2005-2010. ALHS enrolled individuals with 101 asthma diagnosis and current asthma symptoms or medication use along with individuals with 102 symptoms and medication use suggesting likely asthma (n = 1,223). The comparison group was a 103 random sample of AHS participants without these criteria (n = 2,078 This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

DNA extraction 123
A random selection (n=879, including 333 asthma cases) of dust samples were sent for WGS analysis 124 ( Figure 1). DNA extraction in described elsewhere (4) Tables S2 and S3 summarize the overall read sequence statistics and  145 proportion of host genome contaminants across samples. Additionally, we accounted for the potential 146 introduction of contaminant DNA sequences during sample collection or laboratory processing by 147 incorporating negative 'blank' sequencing controls of sterile water, with contaminants identified and 148 removed with the decontam R package (v1.10.0) (39). A total of 168 taxa were filtered out 149 (Supplementary Table S4). Because dust samples have low microbial biomass (fewer microbes), we 150 performed two sequencing runs, each with separate quality control processes, and then performed 151 abundance pooling across the two runs. At the sample level, we excluded low-quality samples 152 defined by sequencing depths less than 1000 (Supplementary Figure S2). Rare taxa were filtered out 153 if they did not appear in at least 10 samples (Supplementary Figure S2). This quality control pipeline 154 left 781 samples and 6,528 taxa for downstream analysis. A taxonomy chart was created that 155 assigned all taxa to a taxonomic classification across the seven phylogenetic levels -kingdom, 156 phylum, class, order, family, genus, and species. The Supplemental Methods provides details of the 157 bioinformatic procedures. 158

Statistical analysis 159
We performed all statistical analyses and visualization in R (v4.0.3) (40). We rarefied data to the 160 minimum library size (1,003) across all samples before calculating alpha and beta diversities using 161 the phyloseq R package (v1.34.0) (41). We considered both non-farming exposures, including state 162 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; https://doi.org/10.1101/2023.04.07.23288301 doi: medRxiv preprint of residence, sex, presence of indoor pets, home condition, and season of dust collection, and farming 163 exposures in the past 12 months, including living on a farm, crop farming, and animal farming. All 164 exposures were treated as binary variables. For season of dust collection, we compared one season to 165 all other seasons combined. We included asthma as a covariate in all models due to the nested case-166 control design. 167 To evaluate intra-group alpha diversity and its association with farming and non-farming exposures 168 we used the Shannon index, exponentially transformed for normality, as the outcome in linear 169 models. We first fitted a baseline univariable regression model for each exposure to identify 170 exposures associated with alpha diversity. We also considered whether associations differed by state 171 of residence (IA or NC) by using product terms. Our final multivariable model included any exposure 172 with significant association to alpha diversity from the baseline univariable model, along with any 173 significant product terms for the individual interactions of each exposure with state of residence. 174 Detailed analytical formula were described in Supplemental methods (SM3). We set p<0.05 as the 175 statistical significance threshold for all analyses. 176 To explore beta diversity, we calculated unweighted and weighted UniFrac distance metrics. We 177 conducted permutational multivariate analysis of variance (PERMANOVA) analysis to test the 178 differences in microbial community structure across exposure levels using the adonis method in the 179 R vegan package (v2.5.7) (42, 43). We used the R 2 value to quantify the percentage of variance 180 explained. We did similar analysis as alpha diversity to evaluate differences in associations by state. 181 We conducted non-metric multidimensional scaling (NMDS) analysis to visualize the separation 182 between samples by exposure levels in a two-dimensional space using the phyloseq (v1.34.0) (41) 183 and R ggplot2 (v3.3.6) (44) packages. 184 To identify differentially abundant taxa for each exposure, we used analysis of composition of 185 microbiomes with bias correction (ANCOM-BC, v1.0.5) models (45), which is based on a linear 186 regression framework on the log transformed taxa counts, with exposures as dependent variables and 187 sampling fraction as an offset term. To account for variation in sequencing depth, we performed 188 normalization by estimating the sampling fraction using the ANCOM-BC built-in algorithm. We 189 tested taxa at the OTU level and summarized the results by genus and phylum rank. We also 190 calculated the log2 fold-difference which is the ratio of the mean abundance after normalized by 191 ANCOM-BC across exposure levels. We controlled the false discovery rate (FDR) at 0.05 with the 192 Benjamini-Hochberg (BH) method (46). We determined a taxon to be significantly differentially 193 abundant if it had both p<0.05 after FDR correction and had log2 fold-difference larger than 1 or 194 smaller than -1. We performed sensitivity analyses to evaluate differences in associations by state of 195 residence. 196 Lee et al. (47) analyzed samples for the same population with 16S rRNA amplicon sequencing. To 197 examine differences of house dust microbial profile between these two methods, we compared the 198 taxonomic chart from our WGS data to the previous 16S data to determine the number of unique and 199 overlapping microbial organisms, at the phyla rank, detected by each sequencing method. We note 200 how common or rare the uniquely identified phyla were based on the frequency of assigned taxa and 201 the relative abundance across samples. In addition, we evaluated the differences between alpha 202 diversities (richness and Shannon index) generated by the two sequencing methods by calculating the 203 Spearman's correlation coefficient. 204

3
Results 205 for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; https://doi.org/10.1101/2023.04.07.23288301 doi: medRxiv preprint 3.1 Summary statistics for the study population and metagenomics characteristics 206  Table S6). 226 Figure 3 shows the association between alpha diversity and each exposure. The presence of indoor 228 pets and farming status (living on a farm, crop farming, animal farming with beef cattle, hogs, and 229 poultry) were positively associated with alpha diversity, while good/higher home cleanliness was 230 negatively associated with alpha diversity (p<0.050). State of residence had a suggestive significant 231 association with alpha diversity with p=0.057. In our multivariable primary model including all 232 statistically significant exposures and all significant interaction terms with state of residence, living 233 on a farm and animal farming remained significantly positively related to alpha diversity 234 (Supplementary Table S7). 235

House dust microbial community diversity analysis 227
For beta-diversity, PERMANOVA analysis revealed significant differences in beta diversity for all 236 demographic characteristics and exposure levels based on unweighted UniFrac distance although the

Differential abundance analysis of individual taxa 245
There were 372 unique taxa belonging to 175 genera within 16 unique phyla, that were differentially 246 abundant in relation to at least one exposure (Supplementary Table S8, Supplementary Table S9,  247 Supplementary Figure S5). Animal farming and living on a farm were associated with more 248 for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; https://doi.org/10.1101/2023.04.07.23288301 doi: medRxiv preprint differentially abundant taxa than non-farming exposures. Figure  Living on a farm was associated with differential abundance of 101 taxa (increased abundance for 255 100 taxa and decreased abundance for one taxon in genus Dickeya), which were mainly in phylum 256 Actinobacteria, Bacteroidetes, Firmicutes, and Proteobacteria (Figure 5b). Among the top 10 taxa, 257 two were in genus Bifidobacterium. The 26 differentially abundant taxa all had increased abundance 258 related to crop farming were mainly in phyla Actinobacteria, Firmicutes, and Proteobacteria ( Figure  259 5c). The most significant taxa were genus Methanobrevibacter and Jeotgalibaca. Animal farming 260 was associated with increased abundance for 191 taxa and decreased abundance for one taxon in 261 phylum Firmicutes (Figure 5d). Genera Methanobrevibacter, Jeotgalibaca, Corynebacterium, 262 Chryseobacterium, Glutamicibacter, Pseudomonas, and Rhodococcus were among the top 10 taxa. 263 Forty-nine taxa were differentially abundant for the presence of indoor pets, mostly in phylum 264 Actinobacteria, Bacteroidetes, Firmicutes, Fusobacteria, and Proteobacteria ( Figure 5e). This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; https://doi.org/10.1101/2023.04.07.23288301 doi: medRxiv preprint Supplementary Table S11, Supplementary Table S12). Therefore, we did not carry interaction 293 products into the differential abundance analysis. When stratifying by state of residence, several 294 exposures, including the presence of indoor pets, living on a farm, and general animal farming, were 295 significantly associated with either alpha or beta diversity in Iowa, where about 2/3 of participants 296 resided but not in North Carolina which has a much smaller sample size (Supplementary Table S13,  297  Supplementary Table S14). Fourteen phyla were consistent for both states with differentially 298 abundant taxa by at least one exposure (Supplementary Table S8, Supplementary Figure S6, 299 Supplementary Figure S7). 300 3.5 Additional findings with WGS from 16S rRNA sequencing results 301 WGS data identified many more taxa and phyla than 16S rRNA. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. individuals (54-56), and their abundance has been shown to be elevated in dust from children with 369 asthma and atopy (57). Pseudomonas was also found to be increased using WGS in classroom dust 370 samples in rural regions near farms compared to suburban areas in China (53). Interestingly, 371 Rhodococcus, Pseudomonas, and Methylobacterium (another microbe positively associated with 372 farm exposures in our data) have been previously identified in agricultural settings, where they can 373 be bioremediation agents and degrade certain pesticides (58). Bifidobacterium is ubiquitous in the 374 human and animal gastrointestinal tract and is associated with positive gut homeostasis, inhibition of 375 pathogen colonization, and modulation of the local and systemic immune system (59, 60). We 376 observed that Methanobrevibacter and Jeotgalibaca, both previously associated with cattle rumen 377 and manure fermentation (61, 62), were increased with crop and animal farming, and unique to dairy 378 and beef cattle farming, which is consistent with previous studies evaluating farm exposures in 379 human microbial communities (12, 63, 64). Two taxa unique to crop farming, Tatumella citrea and 380 Fusarium graminearum, are pathogens associated with grain production (65, 66). Reassuringly, we 381 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; noted an increased abundance of microbes specific to farm and companion animals associated with 382 concurrent exposure to those animals, such as Streptococcus suis with hog farming exposure (67) and 383 Staphylococcus pseudintermedius and felis with dog and cat exposure (68, 69). 384 Our findings suggest that the home dust microbial diversity levels differ between participants 385 exposed to farming activities, as well as pets, both for alpha and beta diversity levels. Overall, the 386 findings from this study were generally similar to those preformed previously using 16S (4). For 387 microbial composition beta diversity, we found distinct microbial community structure based on farm 388 and non-farm exposures, which was significant for all explored variables, similar to results from 16S. 389 The coefficient-of-determination R-squared (R2) statistic was greater using 16S, which supports the 390 hypothesis that WGS resulted in more diverse microbial community identification with greater 391 heterogeneity, so the same exposure would account for less of the variability. Both WGS and 16S 392 findings had low R2 explained variance, consistent with previous research (70). Both analyses 393 showed positive associations between alpha diversity and crop and animal farming. Living on a farm 394 was a significant factor using WGS but not 16S. In addition, there were differences based on the type 395 of animal production, with hog production having a positive association using WGS but not 16S, and 396 dairy cattle production having a positive association using 16S but not WGS (although there was a 397 positive trend). 398 The differences in associations between exposures and Shannon alpha diversity in the WGS 399 compared to our previous 16S data are to be expected given differences between the methods and 400 batch effects when comparing two different methods run three years apart in different laboratories. 401 Alpha diversity was slightly higher in WGS than 16S samples with moderate correlation (Spearman's 402 rho=0.36); unsurprisingly, as a greater number of unique microbes were identified with WGS and is 403 similar to previous research on environmental samples (19). The discrepancies in measurements and 404 effect sizes between WGS and 16S can lead to altered interpretations regarding risk factors for 405 dysbiosis in home dust microbial composition and highlights the importance of how the processing of 406 microbiome samples can impact downstream analyzes. 407 The positive associations with farm exposures and alpha diversity reinforce trends observed in other 408 literature (3,10,12,14,53), in addition to our prior 16S analyses (4). In a study of 203 homes in 409 Finland and Germany, homes located on farms had significantly higher indoor microbial richness and 410 diversity compared to rural non-farm home indoor dust, which was associated with decreased asthma 411 risk in child inhabitants (12) This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Supplementary Material 466
See Supplemental Materials Table of Contents document for a list of the supplementary methods,  467 tables, and figures referenced. 468

Data Availability Statement 469
The raw sequencing data presented in this study are stored in online platform Qiita. Contact the 470 corresponding author for permission for the raw sequence data and metadata. 471 Farm-like indoor microbiota in non-farm homes protects children from asthma development. Nature 503

References
Medicine. 2019;25 (7):1089-95. 504 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. each taxon. a Benjamini-Hochberg method is used for FDR correction. lfd: log2 fold-difference. 707 Vertical and horizontal dash lines indicate the threshold of p value after FDR correction and lfd for 708 filtering DA taxa. Sig: DA taxa with p<0.05 after FDR correction (i.e., log10 p<0.5) and lfd>1 (or 709 ldf<-1); NS: non-DA taxa. 710 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

712
Commonly identified differentially abundant taxa shared by farming animal types were aligned by 713 lines (orange), while differential taxa unique to farm animal type is identified by a single dot (blue). 714  715  716  717  718  719  720  721  722  723  724 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 12, 2023. ; https://doi.org/10.1101/2023.04.07.23288301 doi: medRxiv preprint 725 Figure 7. Venn diagram of the number of phyla identified in WGS (blue) and 16S (orange). 17 phyla 726 were identified by both methods (Supplementary Table S14). 727 for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.