A microbial causal mediation analytic tool for health disparity and applications in body mass index

Background: Emerging evidence suggests the potential mediating role of microbiome in health disparities. However, no analytic framework is available to analyze microbiome as a mediator between health disparity and clinical outcome, due to the unique structure of microbiome data, including high dimensionality, sparsity, and compositionality. Methods: Considering the modifiable and quantitative features of microbiome, we propose a microbial causal mediation model framework, SparseMCMM_HD, to uncover the mediating role of microbiome in health disparities, by depicting a plausible path from a non-manipulable exposure (e.g. race or region) to a continuous outcome through microbiome. The proposed SparseMCMM_HD rigorously defines and quantifies the manipulable disparity measure that would be eliminated by equalizing microbiome profiles between comparison and reference groups. Moreover, two tests checking the impact of microbiome on health disparity are proposed. Results: Through three body mass index (BMI) studies selected from the curatedMetagenomicData 3.4.2 package and the American gut project: China vs. USA, China vs. UK, and Asian or Pacific Islander (API) vs. Caucasian, we exhibit the utility of the proposed SparseMCMM_HD framework for investigating microbiome’s contributions in health disparities. Specifically, BMI exhibits disparities and microbial community diversities are significantly distinctive between the reference and comparison groups in all three applications. By employing SparseMCMM_HD, we illustrate that microbiome plays a crucial role in explaining the disparities in BMI between races or regions. 11.99%, 12.90%, and 7.4% of the overall disparity in BMI in China-USA, China-UK, and API-Caucasian comparisons, respectively, would be eliminated if the between-group microbiome profiles were equalized; and 15, 21, and 12 species are identified to play the mediating role respectively. Conclusions: The proposed SparseMCMM_HD is an effective and validated tool to elucidate the mediating role of microbiome in health disparity. Three BMI applications shed light on the utility of microbiome in reducing BMI disparity by manipulating microbial profiles.

(2) 131 Specifically, we assume that |( , ) ∼ Dirichlet( 1 ( , ), … , ( , )), and their microbial 132 relative means are linked with the non-manipulable exposure and covariates ( , ) in the generalized 133 linear model fashion with a log link. 0 is the intercept and and are the coefficients of the non-134 manipulable exposure and covariates for the th taxon, respectively.

Definition of disparity measures in the counterfactual framework. As discussed in the Background, 136
we propose to conceptualize an overall disparity measure (ODM) on the outcome that can be decomposed 137 into manipulable disparity measure (MDM) and residual disparity measure (RDM). MDM represents the 138 portion of disparity that would be eliminated by equalizing microbiome profiles between comparison and 139 reference groups, and RDM represents the portion that would remain even after microbiome profiles 140 between comparison and reference groups were equalized. With the counterfactual notation, With these sufficient identifiability assumptions and the models (1)-(2) proposed in the 157 SparseMCMM_HD framework, disparity measures MDM, RDM, and ODM can be further expressed, 158 respectively, as follows (see Section S1 for the detailed derivations): thus is non-zero only when both the jth microbial effect on the outcome and 172 the exposure effect on the jth taxon are not zero. Therefore, SparseMCMM_HD illuminates the mediating 173 role of microbiome in the health disparity of outcome, and quantifies the manipulable disparity for overall 174 microbiome community and for each specific taxon, respectively. 175 Parameter estimation. Note that in [12], we have demonstrated the excellent performance of 176 SparseMCMM in terms of estimation by extensive simulations and real data analysis in various scenarios. 177 Thus for SparseMCMM_HD, we directly employ the same two-step procedure to estimate the regression parameters in models (1)-(2) to obtain the estimated RDM, MDM, for each taxon, and ODM. 179 Furthermore, SparseMCMM_HD has the full capability to perform variable selection to select the 180 signature causal microbes that play mediating roles in the disparity of the continuous outcome with 181 regularization strategies. Specifically, L1 norm and group-lasso penalties are incorporated for variable 182 selection meanwhile addressing the heredity condition. 183 Hypothesis tests for manipulable disparity. Similarly, we employ the hypothesis tests for mediation 184 effects in SparseMCMM to examine whether microbiome has any mediation effect on the disparity in an 185 outcome, at both community and taxon levels. Specifically, regarding the null hypothesis of no 186 manipulable disparity 0 : MDM = 0, the first test statistic is defined as OMD=̂, the estimator of the 187 manipulable disparity. Meanwhile, we consider another null hypothesis, 0 : = 0, ∀ ∈ {1, ⋯ , } 188 and define the second test statistic as CMD=∑̂2 =1 , the summation of the squared estimators of 189 individual mediation effects across all taxa. Permutation procedure is employed to assess the significance 190 of these two test statistics. This provides a mechanism to check whether microbiome has any impact on 191 health disparity that could be potentially eliminated through microbiome. 192 Implementation. The simulation evaluation results regarding the estimation and testing of 193 SparseMCMM [12] are applicable to SparseMCMM_HD framework. Therefore, the proposed 194 SparseMCMM_HD is a validated analytic tool to illuminate the mediating role of microbiome in the 195 disparity of outcome, and quantifies the manipulable disparity for overall microbiome community and for 196 each specific taxon, respectively. In practice, we perform both parameter estimation and hypothesis 197 testing using the analytical procedures in the SparseMCMM package and illustrate the proposed 198 SparseMCMM_HD pipeline through an interactive web app 199 (https://chanw0.shinyapps.io/sparsemcmm_hd/). 200

Control for confounding covariates 201
Due to the non-manipulable nature of the exposure in health disparity research, in principle, it is 202 impossible to design a randomized trial on the exposure of interest to eliminate the potential confounding 203 effect on the interested causal pathway. Many studies on health disparity are observational and usually 204 include significant degrees of confounding, due to factors such as lifestyle, health status, and disease 205 history. We want to emphasize that it is a necessary step to control for confounding covariates while 206 utilizing the proposed SparseMCMM_HD to estimate RDM, MDM, and ODM in a typical observational 207 study. Specifically, we propose to perform propensity score matching (PSM) [31], which is a commonly 208 used method in biomedical research to create a balanced covariate distribution between two groups, to 209 control confounding covariates in our applications (see Section S2). Standardized mean difference (SMD) 210 is used to evaluate the balance of the covariate distributions between groups. A SMD that is less than 0.1 211 indicates a balanced distribution [32]. The matched data will then be used to quantify RDM, MDM, and 212 ODM, and examine whether the microbiome could reduce the health disparity between two non-213 manipulable exposure groups. The control for confounding covariates procedure has been included as a 214 preprocessing step in the proposed SparseMCMM_HD analytic pipeline. 215

curatedMetagenomicDataV3.4.2 216
The curatedMetagenomicData 3.4.2 package [28] provides a curated human microbiome meta dataset 217 aggregated from 86 shotgun sequencing cohorts in 6 body sites. The raw sequencing data were processed 218 using the same bioinformatics protocol and pipelines. Each sample has 6 types of data available including 219 gene family, marker abundance, marker presence, pathway abundance, pathway coverage, and taxonomic 220 (relative) abundance. The taxonomic abundance was calculated with MetaPhlAn3, and metabolic 221 functional potential was calculated with HUMAnN3. The manually curated clinical and phenotypic 222 metadata are available as well. More details can be found in the curatedMetagenomicData package 223 document [28]. Here we focus on healthy subjects to explore the relationship among region, microbiome, 224 and BMI. Specifically, we chose subjects from all cohorts based on the following inclusion criteria: 1) 225 healthy status; 2) no missing values in BMI, gender, and age; 3) age ≥ 18; 4) no pregnant; 5) currently no 226 antibiotic use; 6) currently no alcohol consumption; 7) no smoking; and 8) fecal sample with more than 227 1,250 sample reads. In addition, when multiple samples available for a subject, we randomly selected one 228 sample. Overall, we identified 4,868 healthy adults from different regions. Here we further focus on three 229 regional groups which have large sample sizes: China (n=570), United States (USA; n=350), and United 230 Kingdom (UK; n=1019) for the analysis in the main text. Specifically, we conducted two comparison 231 studies: China-USA and China-UK comparisons to investigate the regional difference of BMI in the 232 China group compared to the USA and UK groups, respectively. 233

234
The AGP project is a crowd-sourcing citizen science cohort to describe the comprehensive 235 characterization of human gut microbiota and to identify factors being linked to human microbiota. The 236 AGP includes 16S rRNA V4 gene sequences from more than 8,000 fecal samples using standard 237 pipelines, and host metadata. Detailed descriptions can be found in Liu et al. and Hu et al. [1,33]. Our 238 primary investigation is on the disparity of BMI between Asian or Pacific Islander (API) and non-239 Hispanic Caucasian adults. We selected a subset of the AGP data based on the following inclusion 240 criteria: 1) USA resident; 2) Asian or Pacific Islander or Caucasian race; 3) no missing values in gender, 241 age, and BMI; 4) age ≥ 18; 5) 80 ≥ BMI; 6) 210cm ≥ height ≥ 80cm; 7) 200kg ≥ weight ≥ 35kg; 8) 242 fecal sample with more than 1,250 sample reads; 9) not duplicate sample; and 10) no self-reported history 243 of inflammatory bowel disease, diabetes, or antibiotic use in the past year. The subjects are filtered out 244 when the reported BMIs are not consistent with the calculated BMI based on the reported heights and 245 weights, i.e. (|BMI reported − BMI calculated | BMI calculated ⁄ > 5%). A dataset with 130 API and 2,263 246 Caucasian adults then is used in this paper ( Figure S1a). 247

248
Data pre-processing and PSM were conducted in three BMI studies. Specifically, for the China-USA and 249 China-UK comparisons, we performed PSM with the parameters described in Section S2 to control for age 250 and gender. For the API-Caucasian comparison, as the AGP includes more than 400 covariates that were 251 collected through self-reported surveys, we first implemented several pre-processing steps to prepare the 252 self-reported covariates for the subsequent analysis, including cleaning up the inconsistent definition of 253 variables, and collapsing the sparse categorical variables into fewer and less sparse categories. Details are 254 provided in Section S3. Forty-four covariates were retained for PSM. We performed univariate linear 255 regressions to identify the potential confounding variables for the relationship among race, microbiome, 256 and BMI. Twenty-three covariates (p-value ≤ 0.05; Figure S1b) were identified as confounders that need 257 to be controlled further based on PSM. 258 With the matched data, alpha (Observed, Shannon, and Simpson indices) and beta diversities (Bray-Curtis 259 dissimilarity and Jensen-Shannon divergence) were used to estimate microbial community-level diversity. 260 T tests were used for group comparisons of BMI and alpha diversity. Permutational multivariate analysis 261 of variance (PERMANOVA) [34] was used to assess group difference of beta diversity. We performed the 262 proposed SparseMCMM_HD framework at the species rank (Section S4) to quantify RDM, MDM, and 263 ODM, and examine whether the microbiome could explain the health disparity between two non-264 manipulable exposure groups. The proposed SparseMCMM_HD pipeline was implemented through an 265 interactive web app (https://chanw0.shinyapps.io/sparsemcmm_hd/) for easy exploration. 266  Figure 2a and 2d). CMD show that the overall and component-wise MDMs through microbiome are significant in both data 287 sets for regional differences in BMI (all p-values<0.001 based on 1,000 permutations). Figure 3a shows 288 that the ODM of BMI are 3.17 and 2.79, respectively, for the matched Chinese and USA subjects, and the 289 matched Chinese and UK subjects; the corresponding MDMs due to microbiome are 0.38 and 0.36. These 290 results suggest that 11.99% and 12.90% of the disparity in BMI between the Chinese and matched USA 291 and UK groups, respectively, would be eliminated if the between-group microbiome profiles were 292 equalized. 293

Results
Significant CMD testing results show that there is at least one species playing a mediating role in the 294 disparity of BMI between Chinese and USA subjects, and Chinese and UK subjects. Figure 3b reports 15 295 species and 21 species further identified by SparseMCMM_HD, with the point and 95% confidence 296 interval (CI) estimates for their mediation effects on the regional differences of BMI between China and 297 USA, and between China and UK, respectively. Among the twelve overlapping species identified in both 298 matched datasets (Figure 3b and 3c), five species-Anaerostipes hadrus, Bacteroides plebeius, mediating roles in regional disparity in BMI for Chinese compared to USA subjects, and for Chinese 301 compared to UK subjects. The relative evaluation of these five species in terms of their relative 302 abundances ( Figure 4a) and their associations with BMI ( Figure 4b) are quite similar between two 303 independent studies: China-USA comparison and China-UK comparison, which validates their mediating 304 roles in the regional disparity on BMI. Confirming with the published studies, B. plebeius, B. 305 thetaiotaomicron, and B. uniformis belong to the same genus Bacteroides, and all play important roles in 306 human metabolism and have been linked with diet-induced obesity, by improving whole-body glucose 307 disposal, promoting lipid digestion and absorption, and degrading host-derived carbohydrates [35][36][37][38]. B. 308 thetaiotaomicron also possesses glycine lipid biosynthesis pathway ( Figure S4). A. hadrus, and E. coli 309 also have been reported by multiple studies that they contribute to or are associated with the BMI or 310 obesity [39][40][41]. On the other hand, 12 species play mediating roles in BMI but with the opposite 311 directions between China-USA comparison and China-UK comparison, that reflects the distinguishing 312 characteristics between USA and UK ( Figure S5). This is not surprising considering the microbial profile 313 is inherently dynamic and racially or geographically specific. Moreover, there are three and nine unique 314 species identified in the China-USA and China-UK comparisons respectively ( Figures S6 and S7). Most 315 of these study-specific species have been reported being associated with BMI, obesity or metabolic 316 disorders [41][42][43][44][45][46][47][48][49][50]. Notably, Anaerostipes hadrus, Fusicatenibacter saccharivorans, Lachnospira 317 pectinoschiza, and Roseburia inulinivorans belong to family Lachnospiraceae (Figure 5d), which is 318 related to metabolic syndrome and obesity and whose controversial role has been discussed across 319 different studies [51]. 320 Observed, Shannon, and Simpson diversities (p-value = 3.1 × 10 −5 , 1.5 × 10 −4 , and 3.9 × 10 −3 , 330 respectively. Figure S10a). For Beta diversity, Bray-Curtis dissimilarity and Jensen-Shannon divergence 331 both show that Caucasian samples have different community structures compared to API samples 332 (PERMANOVA p-value=0.0036 and 0.0012, respectively. Figure S10b). 333

Results for AGP
Taxon-level analysis. The above community level results indicate that the microbiome may play a 334 mediating role in the racial diversity of BMI. To investigate this assumption, we perform the proposed 335 SparseMCMM_HD on this matched dataset. With the filtering criteria described in Section S4, 28 species 336 are included in the following taxon-level analysis. 337 We found that the ODM of BMI between Caucasians and APIs is 1.63 (Figure 5b). Microbiome plays a 338 significant role in mediating the racial disparity of BMI indicated by the test results of both OMD (p-339 value=0.038) and CMD (p-value=0.048). The microbial manipulable disparity measure MDM is 0.12. 340 This suggests that the difference of microbiome profiles contributes to 7.4% of ODM, which would be 341 eliminated if the microbiome profiles between the Caucasians and APIs were identical. 342 We further identified 12 species playing mediating roles in the racial disparity of BMI between the 343 Caucasians and APIs (Figure 5c) China-USA and China-UK illustrated in the previous subsection (Figure 5d). Literature reveals that all 350 identified species are associated with the BMI or obesity [41][42][43][44][45][46][47][48][49]. 351 Collectively, the findings in the matched China vs. USA, China vs. UK, and API vs. Caucasian datasets 352 show that the microbiome is an important mediator in the regional or racial disparity of BMI and they 353 substantially shed light on how to reduce the disparity of BMI. The identified microbial agents can be 354 used as the potential therapeutic target for the treatment based on microbiota modulation in the future. 355

356
The emerging evidence highlights the potential ofmicrobiome in understanding health disparity. In this 357 paper, we proposed a mediation analytical framework, SparseMCMM_HD, to investigate the 358 microbiome's role in health disparity. Considering a health disparity framework with three components: 359 non-manipulable exposure (e.g. race or region), microbiome as mediator, and outcome, the proposed 360 SparseMCMM_HD deciphers the overall health disparity of the non-manipulable exposure on the 361 outcome into two components: MDM that would be eliminated by equalizing microbiome profiles and 362 RDM that would remain and could not be explained through microbiome. Remarkably, MDM paves a 363 viable path towards reduction of health disparity with microbial modulation. Similar to SparseMCMM, 364 SparseMCMM_HD can be used to identify the signature causal microbes and examine whether the 365 overall or component-wise MDM is significantly non-zero. 366 It is vital to control confounding effects beforehand in the real data analysis to satisfy the identifiability 367 assumptions of the proposed SparseMCMM_HD. In three BMI applications, we first employed PSM to 368 remove the confounding effects by selecting matched subsets in which the distributions of confounders 369 were notably comparable between two exposure groups, and then performed the proposed 370 SparseMCMM_HD framework. The utilization of SparseMCMM_HD in two datasets, the 371 curatedMetagenomicData 3.4.2 package and the AGP dataset, depicts an explicit causal path among region or race, microbiome, and BMI. These findings confirm not only that microbiome is differentially 373 distributed across races or regions, but also that the differential microbiome profile contributes to the 374 disparities in BMI across races or regions. The identified microbial signatures potentially aid in 375 developing personalized medication or nutrition to reduce obesity disparity. 376 It is not surprising that the proportion of disparities in BMI explained by the microbiome profiles is not 377 large (~10%) in all three applications, due to the heritable and polygenic nature of BMI [54,55]. Further 378 investigations to integrate the microbiome profile and genetic factors are necessary to better understand 379 disparity in BMI. However, we here emphasize that the proposed SparseMCMM_HD is a rigorous and 380 validated causal mediation framework and has preeminent potential to identify the microbiome's roles in 381 consistently and accurately better than others in all circumstances. However, since the assumptions for 387 model identification in health disparity are weaker than those for the causal mediation effects in the 388 manipulable exposure-mediator-outcome framework [23], it is expected that the idea of how the proposed 389 SparseMCMM_HD framework rigorously defines, quantifies, and tests health disparity measures as an 390 extension of SparseMCMM [12] can provide insight into extending these available mediation models to 391 investigate the microbiome's role in health disparity. Then, a useful path forward will be to mutually 392 employ these multiple and complimentary methods to better characterize the microbiome's role in health 393 disparity by capitalizing their distinct assumptions and strengths.    Figure S1. Flowcharts for data pre-processing in the AGP dataset. a Pre-processing for 669 all covariates. b The sample breakdown for the disparity analysis. 670 Figure S2. Plots of standardized mean differences before and after propensity score matching for the 671 datasets from the curatedMetagenomicData package [28]. a Comparison between Chinese and USA 672 subjects. b Comparison between Chinese and UK subjects. 673 Figure S3. Histogram plots of propensity score before and after propensity score matching for the 674 datasets from the curatedMetagenomicData package [28]. a Comparison between Chinese and USA 675 subjects. b Comparison between Chinese and UK subjects. Scatterplots of BMI and the relative abundances of these identified species in the matched Chinese and 684 USA samples, and the matched Chinese and UK samples, respectively. 685 Figure S6. The species playing mediation roles in the disparity of BMI in the comparison between 686 Chinese and USA subjects only. a Violin plots illustrating the relative abundances of these identified 687 species in the matched Chinese and USA samples. b Scatterplots of BMI and the relative abundances of 688 these identified species in the matched Chinese and USA samples. 689 Figure S7. The species playing mediating roles in the disparity of BMI in the comparison between 690 Chinese and UK subjects only. a Violin plots illustrating the relative abundances of these identified 691 species in the matched Chinese and UK samples. b Scatterplots of BMI and the relative abundances of 692 these identified species in the matched Chinese and UK samples. 693 Figure S8. Plots of standardized mean differences before and after propensity score matching for the 694 comparison between the API and Caucasian samples from the AGP dataset. API: Asian or Pacific 695 Islander. 696 Figure S9. Histogram plots of propensity score before and after propensity score matching for the 697 comparison between the API and Caucasian samples from the AGP dataset. API: Asian or Pacific 698 Islander. 699 Figure S10. Association analyses in the AGP dataset. a Violin plots of alpha diversities including 700 Observed, Shannon, and Simpson indices in the matched API and Caucasian samples. b PCoA plots using 701 Bray-Curtis dissimilarity and Jensen-Shannon divergence in the matched API and Caucasian samples.

702
API: Asian or Pacific Islander. 703 Figure 1 Microbiome (M) may play a mediating role in the health disparity of the continuous outcome (Y) between two categories of a non-manipulable exposure group (e.g. race or region) (R). We aim to investigate how much disparity of the outcome Ycan be reduced by manipulating microbiome pro les.  Health disparity analyses in two matched datasets from the curatedMetagenomicData package [28]. a Manipulable disparity measure (MDM) and residual disparity measure (RDM) of BMI in the China-USA comparison and China-UK comparison, respectively. b Component-wise point and 95% CI estimates of MDM j for the identi ed species that have mediation effects on the differences of BMI between matched Chinese vs. USA subjects and between matched Chinese vs. UK subjects, respectively. 95% CI estimates of were calculated by bootstrapping procedure, and the number of bootstrapping is 50. c Venn diagram to show the relationship of the species playing mediation effects in the disparity of BMI among China-USA, China-UK, and API-Caucasian comparisons. API: Asian or Paci c Islander.

Figure 4
Five species who play positive mediation roles in the disparity of BMI in both China-USA and China-UK comparisons. a Violin plots illustrating the relative abundances of these ve identi ed species in the matched Chinese and USA samples, and the matched Chinese and UK samples, respectively. b Scatterplots of BMI and the relative abundances of these ve identi ed species in the matched Chinese and USA subjects, and the matched Chinese and UK subjects, respectively.

Figure 5
Health disparity analyses in the matched APIs and Caucasians from the AGP dataset. a Violin plots of BMI in the matched APIs and Caucasians from the AGP dataset. b MDM and RDM of BMI in the API-Caucasian comparison. c Component-wise point and 95% CI estimates of MDMj for the identi ed species that have mediation effects on the differences of BMI between matched APIs and Caucasians from the AGP dataset. 95% CI estimates of were calculated by bootstrapping procedure, and the number of bootstrapping is 50. d The taxonomic relationship of the species playing mediation effects in the disparity of BMI among China-USA, China-UK, and API-Caucasian comparisons. The tree gure was generated by Metacoder [65]. From the outer to the center, taxonomic ranks are species, genus, family, order, class, phylum, and kingdom (Bacteria), respectively. For each species, color represents the number of comparisons that identify it among China-USA, China-UK, and API-Caucasian comparisons.
APIs: Asian or Paci c Islanders.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.