Interaction molecular QTL mapping discovers cellular and environmental modifiers of genetic regulatory effects

Bulk tissue molecular quantitative trait loci (QTLs) have been the starting point for interpreting disease-associated variants, while context-specific QTLs show particular relevance for disease. Here, we present the results of mapping interaction QTLs (iQTLs) for cell type, age, and other phenotypic variables in multi-omic, longitudinal data from blood of individuals of diverse ancestries. By modeling the interaction between genotype and estimated cell type proportions, we demonstrate that cell type iQTLs could be considered as proxies for cell type-specific QTL effects. The interpretation of age iQTLs, however, warrants caution as the moderation effect of age on the genotype and molecular phenotype association may be mediated by changes in cell type composition. Finally, we show that cell type iQTLs contribute to cell type-specific enrichment of diseases that, in combination with additional functional data, may guide future functional studies. Overall, this study highlights iQTLs to gain insights into the context-specificity of regulatory effects.

. Correlation between estimated and measured eosinophils proportions by selfreported race/ethnicity. A)-B) Scatter plot of estimated and measured eosinophils proportions with the CIBERSORT method from PBMC RNA-seq (A) and with the Houseman method from whole blood DNA methylation (B) using Exam 5 data. Measured eosinophils proportions are calculated by dividing the given relative count by the total WBC count.    ( 1) and nominally significant GxCell type effects are used as the measure of reproducibility. On average, excluding monocyte imeQTLs, 27% of cell type iQTLs with uncertain direction compared to 81.2% of cell type iQTLs with positive/negative direction showed nominally significant interaction P-value in the validation exam. For calculating 1, lambda was set to 0.85, if more than 100 ieGenes or imeSites were found from the validation set, or lambda was set to 0.5, if more than 20 ieGenes or imeSites were found from the validation set. Figure S8. Population-specificity of cell type iQTLs. A) Comparison of allele frequencies of the alternative allele calculated from the whole set of individuals used for monocyte ieQTL mapping and a specific group in MESA (based on self-reported race/ethnicity) separately for unique monocyte ieQTLs with positive direction discovered in exam 1 (top row) and exam 1 (bottom row). Note that expression data is available for individuals of Chinese ancestries only in exam 1. B) Venn diagram of monocyte ieGenes that have been fine-mapped to likely causal eQTLs in monocytes by self-reported race/ethnicity. C) Comparison of monocyte ieQTLs by direction of effect and fine-mapped monocyte eQTLs from MESA using all the individuals with expression data from purified monocytes or by self-reported race/ethnicity. For each gene with a monocyte ieQTL (ieGene) with FDR < 0.05 in exam 1 or exam 5, we assessed whether 1) the ieGene has been fine-mapped to putative causal variants in monocytes, 2) if yes, then whether the maximum LD between the lead ieQTL and fine-mapped monocyte eQTLs from 95% credible set is above or below a specified threshold. Figure S9. Pairwise sharing of cell type iQTLs. Sharing of A) cell type ieQTLs with positive direction, B) cell type ieQTLs with negative direction, C) sentinel cell type imeQTLs with positive direction, and D) sentinel cell type imeQTLs with negative direction among the same type of i-molQTLs quantified as the proportion of lead cell type iVariants in LD (r 2 > 0.5). To calculate the proportion of iQTLs in LD, the denominator is set to the minimum of the number of iQTLs in the query and validation set. Figure S10. Sharing between cell type ieQTLs and cell type imeQTLs. A) Sharing between cell type ieQTLs with positive direction and sentinel cell type imeQTLs (FDR < 0.05 in exam 1 or exam 5) and B) sentinel cell type imeQTLs with positive direction and cell type ieQTLs (FDR < 0.05 in exam 1 or exam 5) quantified as the proportion of lead cell type iQTLs in LD (r 2 > 0.5). Pvalue shows, for example, the significance of the odds for Monocyte ieQTLs with positive direction to overlap with Neutrophil imeQTLs with positive direction than for Monocyte ieQTL with negative direction. To estimate the odds ratio if any cell is equal to zero in the 2x2 table, Haldane-Anscombe correction was used by adding a fixed value of 0.5 to all cells. Figure S11. Replication of cell type ieQTLs in the eQTL Catalogue. Replication of cell type ieQTLs with FDR < 0.05 in exam 1 or exam 5 data using the eQTL data from various blood cell types in the eQTL Catalogue, 45 eQTL datasets in total. Replication is quantified as the proportion of true positives ( 1). Clustering of the rows and columns is based on the Euclidean distance and using the complete-linkage method. Figure S12. Functional enrichment analysis of cell type iQTLs. Functional enrichment analysis with GoShifter showing the delta-overlap, which is the difference between the observed proportion of loci overlapping a cCRE and the mean of the proportion of loci overlapping the cCRE under the null derived by local shifting, for A) cell type ieQTLs overlapping different cCRE from ENCODE and B) sentinel cell type imeQTLs overlapping different cCRE from ENCODE. Negative delta-overlap indicates cases where observed proportion of loci overlapping an annotation was less than under the null. *** -significant association after correcting for the number of target cell types with cCRE data, number of cell types tested for interaction effect, and the number of groups of direction of effect, applied separately for cell type ieQTLs and cell type imeQTLs. Figure S13. Reproducibility of age, sex, and smoking. Reproducibility of trait iQTLs by direction of effect using one of the exams as discovery and the other as validation, and vice versa, for A) age ieQTLs and sentinel age imeQTLs, B) sex ieQTLs and sentinel sex imeQTLs, C) smoking-current ieQTLs and sentinel smoking-current imeQTLs, D) smoking-status ieQTLs and sentinel smoking-status imeQTLs, E) smoking-cotinine ieQTLs and sentinel smokingcotinine imeQTLs. The proportion of true positives ( 1) is used as the measure of reproducibility. Figure S14. Association between estimated cell type proportions and various traits in MESA. Measuring the effect of various traits on estimated cell type proportions in MESA by time point (exam 1 or exam 5) shown as A) -log10(P-value) and B) effect size. The models were adjusted for age, sex, self-reported race/ethnicity, educational attainment, site, and season of the exam. If genotype PCs were the traits of interest, then self-reported race/ethnicity was not included as a covariate. For this analysis, cell type proportions were inverse normalized, and numeric traits were scaled by dividing by two times its standard deviations to make the effect size comparable between untransformed binary traits and numeric traits. ** -denotes Bonferroni significant associations, where P-value < 0.05 / (# of traits groups × # cell type groups), where the count of cell type group is equal to 5 (B cells, T cells/CD4 T cells/CD8 T cells, NK cells, monocytes, neutrophils), * -denotes nominally significant association. Figure S15. Enrichment of trait iQTLs as cell type iQTLs. QQ-plots showing the inflation of GxCell type interaction P-value for A, F) age ieQTLs and sentinel age imeQTLs, B, G) sex ieQTLs and sentinel sex imeQTLs, C, H) smoking-current ieQTLs and sentinel smoking-current imeQTLs, D, I) smoking-status ieQTLs and sentinel smoking-status imeQTLs, E, J) smokingcotinine ieQTLs and sentinel smoking-cotinine imeQTLs based on exam 1 and exam 5 data, respectively. Trait iQTLs are classified into two groups based on the direction of effect -positive or negative and uncertain, if trait is continuous; or no effect in one group and other, if trait is binary. Lambda (λ) is the inflation factor to measure evidence for global inflation of GxCell type effects among trait iQTLs. Median inflation of GxMonocyte effect across sex and smoking ieQTLs (leaving out uncertain direction) was λ = 1.28 in exam 1 and λ = 2.64 in exam 5, respectively. Median inflation of GxNeutrophil effect across sentinel sex and smoking imeQTLs (leaving out uncertain direction) was λ = 1.66 in exam 1 and λ = 1.52 in exam 5, respectively. We used the lavaan R package for SEM and the mediate R package for mediation analysis. We simulated data for n = 1000 individuals mimicking the moderation effect of age on DNAm mediated by a neutrophil proportion. Genotype vector G was simulated from binomial distribution -G ~ B(2, 0.35), moderator W from normal distribution -W ~ N(0, 9), mediator M as a function of moderator -M ~ 0.0015*W + N(0, 0.1). All the variables were mean-centered, and W was also scaled. Molecular phenotype Y was simulated as a function of G, M, and W including interactions -Y ~ -0.25*G + 1.5*M -3*GxM -0.03*GxW + N(0, 0.5). C) The effect of age on estimated neutrophil proportion in MESA exam 5. The adjusted model includes sex, self-reported race/ethnicity, educational attainment, site, and season as covariates. D) Results of mediation analysis assessing the mediation of GxAge effect on DNA methylation via GxNeutrophil effect. Barplots show the P-value distribution of average direct effect (ADE), total effect, and proportion of mediated effect. Figure S17. Age imeQTLs and cell type composition probes. A) Evidence for age imeSites from exam 5 to be associated with cell type composition by the direction of imeQTL effect. Group "not significant" includes imeQTL with adjusted P-value > 0.1. Dashed line denotes nominal significance (P-value of cell type composition < 0.05) and dotted line denotes significant association after Bonferroni correction (P-value of cell type composition < 0.05/456,655). Only autosomal probes present on the 450K array are included. Using Wilcoxon rank sum test to compare "not significant" group with others. B) Comparison of support for the moderation of effect of age to be mediated by neutrophil % and cell type composition effects of probes. Included are age imeQTLs from exam 5 with positive or negative direction that are also present on the 450K array. Dashed line denotes nominal significance (P-value of cell type composition < 0.05 or P-value of ACME < 0.05) and dotted line denotes significant association after Bonferroni correction (P-value of cell type composition < 0.05/456,655).

Figure S18. Colocalization between cell type iQTLs and selected GWAS traits.
Summary of colocalization analysis between cell type iQTLs and selected GWAS traits by the direction of the cell type iQTL effect. We included cell type ieQTLs with FDR < 0.25 in exam 1 or exam 5 and sentinel cell type imeQTLs with FDR < 0.05 in exam 1 or exam 5 to this analysis. Support for colocalization is defined as PP4 > 0.5.

Figure S19. Cell type iQTLs for SYNGR1 and association with rheumatoid arthritis. A)
Colocalization between GWAS for rheumatoid arthritis (RA) and cell type iQTLs other than NK cell iQTLs for SYNGR1 or a nearby CpG site cg19713460 shown as regional association plots. Highlighted region is depicted as in Fig. 5B and shows the location of the lead GWAS variant for RA, rs909685, and the CpG site relative to the SYNGR1 gene. Posterior probability for colocalization (PP4) is set to NA, if the cell type iQTL was not included to the colocalization analysis (FDR < 0.25 for ieQTLs and FDR < 0.05 for imeQTLs). For cell type iQTLs included to the colocalization analysis, showing data from the datapoint with the lowest interaction P-value. Otherwise, exam 1 is used, where we observed the association with NK cell iQTLs. B) Association plot for monocyte ieQTL and T cell ieQTL for SYNGR1. Dots are colored based on the genotype of rs909685. C) Overview of the candidate cis-regulatory element, where the lead iQTL rs909685 is located. This cCRE is characterized by high DNase and H3K27ac in B cells, NK cells, and CD8 T cells.

Supplemental Tables
Supplemental tables are provided as separate Excel files. Table S1. Cell type iQTLs affecting both expression levels of a gene and DNA methylation levels of a nearby CpG site. Pairs of significant cell type ieQTLs and sentinel cell type imeQTLs in LD (r 2 ≥ 0.5 in MESA) with FDR < 0.05 in exam 1 and exam 5 are listed. Table S2. Replication of cell type ieQTLs in the eQTL Catalogue. Based on cell type ieQTLs with FDR < 0.05 in exam 1 or exam 5. Replication is quantified as the proportion of true positives, absolute median effect size in eQTL data, and proportion of variants showing concordance in allelic direction in eQTL effect. Table S3. Functional enrichment analysis of cell type iQTLs. Based on cell type ieQTLs and sentinel cell type imeQTLs with FDR < 0.05 in exam 1 or exam 5. Table S4. Results of the mediation analysis of age imeQTLs by neutrophil imQTL effect based on exam 5 data. Table S5. Results of the colocalization analysis between cell type iQTLs and immune or cardiometabolic traits. We tested for colocalization with (A) cell type ieQTLs with FDR < 0.25 in exam 1 or exam 5 and (B) sentinel cell type imeQTLs with FDR < 0.05 in exam 1 or exam 5. Support for colocalization is defined as PP4 > 0.5.

Supplemental Material and Methods
Genotypes from Whole Genome Sequencing DNA samples from MESA participants were sequenced from whole blood to 38x average coverage on Illumina HiSeq X-Ten with 2x151 paired-end reads using a PCR-free library preparation protocol. Sequence data were mapped to the 1000 Genomes GRCh38 human genome reference sequence with decoy contigs using bwa-mem 0.7.15. Duplicates were marked and base call quality scores were recalibrated and binned to 4 bins using bamUtils dedup_LowMem. SNP and short indel variants were discovered and genotyped with vt_discover2. Genotyping used an adjusted genotype likelihood to allow for estimated persample DNA sample contamination. After merging genotypes across all TOPMed studies, variant sites were filtered using an SVM classifier to assign "PASS" and "FAIL" labels. The classifier was trained using a positive set of variants that are polymorphic in Illumina Omni 2.5 genotyping of the 2,504 1000 Genomes samples and a negative set of variants that showed multiple discordant genotypes and/or Mendelian inconsistencies in a set of 7,000+ TOPMed duplicate samples and trio nuclear families. Genotyping and variant filtering were performed jointly across all TOPMed studies. The final phased genotype calls for the 1,319 MESA participants included in this study were obtained from the TOPMed freeze 8 genotypes from dbGaP (accession phs001416).

Methylation
DNA methylation was quantified in bisulfite-converted genomic DNA from whole blood using the Illumina EPIC array for 1,894 samples (including replicate samples per individual and related individuals). For quality control and normalization, we used the meffil R package (Min et al. 2018). First, IDAT files were parsed and probe intensities were dye-bias and background corrected using the noob method (Triche et al. 2013). Second, sample quality control was performed and 4 samples were excluded due to: 1) low quality, i.e., samples with >5% of CpGs with detection P-value > 0.01 or methylated/unmethylated ratio outliers; 2) sex mismatches; or 3) genotype mismatches, based on the explicit SNP probes on the array (concordance threshold = 0.8). Third, methylation samples that passed quality filters were normalized using the functional normalization method (Fortin et al. 2014) as implemented in meffil. To reduce unwanted technical variation in the data, the top 11 principal components of the control probes were used, which were determined to minimize the residual variance under a 10-fold crossvalidation scheme. Fourth, for analyses of the data, we kept the higher-quality sample among replicates, and for a pair of related individuals, we kept the sample of the individual with the most data types in MESA. As a result, the final sample size was n = 900 and n = 899 for exam 1 and exam 5, respectively. Additionally, we performed filtering of CpG sites as follows: 1) probes with >10% of samples with detection P-value > 0.01; 2) masking of probes as recommended in (Zhou et al. 2017), including excluding probes with mapping problems to GRCh38, crossreactive probes, Type I probes with putative CCS SNPs; 3) non-CpG targeting probes; 4) probes on Y chromosome; and 5) polymorphic probes (i.e., probes with SNVs within 5bp from the 3'-end of the probe (including the extension base for type II probes)) of MAF over 1% across all unrelated individuals in MESA Freeze 8. The final number of CpG sites retained for analyses was n = 740,291 and n = 747,771 for exam 1 and exam 5, respectively, with n = 747,868 CpG sites present in at least one of the exams. Fifth, for meQTL analyses, beta-values of CpG sites with detection P-value > 0.01 were set to missing. Then, missing values were imputed to the mean followed by inverse normal transformation. Among technical replicates (i.e., the same aliquot sequenced multiple times for QC purposes), the sample with the highest number of reads was retained for inclusion in the analysis freeze set. After applying these filters and excluding samples from related participants, 931 and 864 PBMC samples from exams 1 and 5, respectively, and 355 monocyte and 362 T cell samples from exam 5 remained and were used for all downstream analyses. eQTL phenotypes were quantified as previously described (GTEx Consortium 2020). molQTL mapping molQTLs were mapped using TensorQTL (Taylor-Weiner et al. 2019). For methylation, the ciswindow was ±500kb from the CpG; for RNA, the cis-window was ±1Mb from the gene's TSS. For all molecular phenotypes, the covariates included PEER factors (see below), sex, and the top 11 genotype PCs (Taliun et al. 2021) to account for population structure. For cis-molQTLs, 13,294,020 common variants with MAF ≥ 0.01 across all 1,319 samples in the cohort were used for mapping all molecular phenotypes. For trans-QTLs, to minimize potential false discoveries at low allele frequencies, an in-sample threshold of MAF ≥ 0.05 was used.

PEER factor selection
The optimal number of PEER factors to include was selected for each molecular phenotype based on the number of features identified with at least one molQTL (e.g., eGenes for gene expression), as described previously (

Interaction QTLs and confounders
Commonly in GWAS analysis of cell type composition, cell type estimates are adjusted for relevant confounders, and the inverse normalized residuals are used in the association analysis as the response variable (for example, see Vuckovic et al. 1 ).
We investigated whether for interaction molecular QTL (iQTL) mapping estimated cell type proportion should be adjusted for relevant confounders, using age as an example of a confounder.

1) Age as confounder
Age affects cell type proportion and molecular phenotype.
We fit 3 different models: The effect of age on cell type abundance -beta = -0.5 The effect of age on cell type abundance -beta = -0.25

Conclusion
For detecting strong iQTL effects between genotype and cell type abundance, adjusting cell type proportions for a confounder and using inverse normalized residuals in the model (adjusted model) is less powerful as compared to using unadjusted cell type abundance (full model). The extent of loss of power depends on the correlation between the confounder and cell type abundance. Inclusion of the confounder to the iQTL model as a covariate is sufficient when the confounder also has an effect on the molecular phenotype (outcome variable).

1.2) No iQTL effect
Just molQTL effect, but no interaction effect. The effect of age on cell type abundance -beta = -0.25 The effect of age on cell type abundance -beta = 0 The effect of age on cell type abundance -beta = 0.25 The effect of age on cell type abundance -beta = 0.5

Conclusion
If there are no significant iQTL effects, we observe variability between P-values from the adjusted model and full model. The extent of variability depends on the correlation between the confounder and cell type abundance. Neither of the models show an increase in false positive rate.

2) Age as confounder and moderator
In addition to the effect of age on cell type proportions and molecular phenotype, age also influences the association between G and Y (there is a strong GxAge effect).
We fit 3 different models:

Conclusion
For detecting strong iQTL effects between genotype and cell type abundance in the presence of a strong GxConfounder effect, adjusted model is less powerful compared to full model. However, the current implementation of tensorQTL does not support the inclusion of additional interactions. The naive model, which is currently used for iQTL mapping assuming that relevant confounders are captured by the PEER factors, either overestimates or underestimates the iQTL effect depending on the correlation between the cell type abundance and confounder.

2.2) No iQTL effect
Just molQTL effect, but no interaction effect. The effect of age on cell type abundance -beta = -0.25 The effect of age on cell type abundance -beta = 0 The effect of age on cell type abundance -beta = 0.25 The effect of age on cell type abundance -beta = 0.5

Conclusion
If there are no significant iQTL effects in the presence of a strong GxConfounder effect, we observe variability in P-values of the between the GxCell type effect from the full model and adjusted model. Results from the naive model are, in large, overestimated and leading to many false positives, when there is a very strong effect of a confounder on cell type abundance, which in practice is not typical.

Summary
In practice, cell type abundance is the probably the factor with the strongest influence on the association between genotype and molecular phenotype (i.e., strongest iQTLs). Thus, scenarios in this simulation leading to large scale over-or underestimation in the presence of other GxConfounder effects, where the confounder also affects cell type abundance, are not typical. Correcting cell type abundances for confounders and using normalized residuals in the model rather seems to reduce the study power in most typical scenarios.