MRBEE: A novel bias-corrected multivariable Mendelian Randomization method

Mendelian randomization (MR) is an instrumental variable approach used to infer causal relationships between exposures and outcomes and can apply to summary data from genome-wide association studies (GWAS). Since GWAS summary statistics are subject to estimation errors, most existing MR approaches suffer from measurement error bias, whose scale and direction are influenced by weak instrumental variables and GWAS sample overlap, respectively. We introduce MRBEE (MR using Bias-corrected Estimating Equation), a novel multivariable MR method capable of simultaneously removing measurement error bias and identifying horizontal pleiotropy. In simulations, we showed that MRBEE is capable of effectively removing measurement error bias in the presence of weak instrumental variables and sample overlap. In two independent real data analyses, we discovered that the causal effect of BMI on coronary artery disease risk is entirely mediated by blood pressure, and that existing MR methods may underestimate the causal effect of cannabis use disorder on schizophrenia risk compared to MRBEE. MRBEE possesses significant potential for advancing genetic research by providing a valuable tool to study causality between multiple risk factors and disease outcomes, particularly as a large number of GWAS summary statistics become publicly available.


Bias in IVW estimators
We begin by showing the bias in the IVW estimating equation S IVW (θ) given the measurement error models stated in the Methods section. As a reminder, let B be a m × p matrix of true associations of m IVs on p exposures, α = Bθ (assuming no horizontal pleiotropy) be correspondingly the same for a single outcome, and (B,α) denote their unbiased estimates from GWAS. Let j denote the jth IV, where (e.g.)β j is the jth row ofB. We re-state the following measurement error models:β where E(w βj ) = 0, E(w αj ) = 0, Cov(w βj ) = Σ w β j , Cov(w αj ) = σ 2 where Σ w β j and σ 2 wα j may differ between IVs because of different genotyping rates between SNPs in GWAS. This may be especially true when GWAS data used in MR are from meta-analyses of GWAS data from multiple cohorts. For simplicity, now assume that all Σ w β j are equal and all σ 2 wα j are equal. In practice, IVW uses the following score function: which is set to equal 0 then solved to produce the IVW causal estimates (denote asθ IVW ). However, we will now demonstrate that Equation 5 does not have expectation 0 and therefore introduces bias inθ IVW . The following relations hold: S IVW (θ) = (B + w β ) ⊤ (α + w α −Bθ − W β θ) which has expectation equal to m(σ w β wα − Σ w β w β θ) (7) when there is nonzero sample overlap (provided the vector of true causal effects does not equal 0 and σ w β wα ̸ = Σ w β w β θ; see Methods). Larger exposure GWAS sample sizes can make the bias in Equation 7 from the term Σ w β w β very small (see [11]), but will only eliminate it as exposure GWAS sample sizes approach infinity.
In the case of one exposure and outcome, we can more easily see how these sources of bias in S IVW (θ) affect the bias in causal estimation for each IV. Now we have only (β j , w βj ) and (α j , w αj ), whereθ IVW represents the IVW causal estimate which is equal toθ IV W =α j /β j . Assume all phenotypes and genotypes have been standardized to have mean 0 and variance 1 and Equation 7 becomes where ρ XY is the total correlation between exposure and outcome, (n X , n Y , n 0 ) respectively are the exposure, outcome, and overlapping GWAS sample sizes.

Traditional weak instrument bias
As demonstrated in the Results section, the total bias in the IVW causal estimates is a product of biases from weak instruments, sample overlap, and measurement error. Consider still a single exposure. Even as the variance explained in the exposure by the IVs increases, weak instrument bias is likely to become worse under certain models of genetic architecture. For example, assume that the true effects of M genetic variants on a phenotype with heritability h 2 are random with distribution N (0, h 2 /M ) as in [18]. This implies that in order to explain a sufficiently large proportion of variance in the phenotype with a set of m ≤ M IVs, many genetic variants with very small effect sizes must be included in MR. This of course will increase weak instrument bias. Figure 1S demonstrates in simulation how this type of bias -low variance in the phenotype explained by the IVs used in MR -affects the bias of causal estimation of IVW but not MRBEE. This simulation selectively chose truly causal genetic variants in such a way that was designed to mimic GWAS. In these simulations, we set the heritability of the exposure and outcome respectively to 0.1 and 0.3 and fixed them across all simulations with different number of IVs. The genetic correlation between the exposure and outcome was set to be 75% of the total correlation which was 0.4, making the causal effect equal to 0.52. Exposure and outcome GWAS sample sizes did not overlap and were each of size 500k. There was therefore no pleiotropy, confounding, or (effectively) measurement error biases and no sample overlap. In each of the 500 repetitions of the simulation, we drew a specified number of instrumental variables from the true set of 2,000 independent causal variants. The true effect sizes for these variants were normally distributed with mean 0 and variance 0.1/2,000. After drawing true instrument effect sizes, we randomly drew estimated instrument effect sizes using the corresponding measurement error models (Equations 1, 2). Variances of measurement errors were equal to the inverse of GWAS sample size and there was no covariance (sample overlap) between measurement errors in the exposure and outcome GWAS for each instrument. The probability of randomly selecting IV j for use in MR was inversely proportional to the size of its true causal effect on the exposure. We finally estimated the causal effect using IVW and MRBEE and recorded the F-statistics and R-squared values using a bias-corrected estimator (see section 1.2).

Inference from F-statistics for instrument strength
The F-statistic for instrument strength [1] uses an estimate of variance in the exposure explained by the IVs to measure the degree to which weak instrument bias is present in the IV set. In theory, this measure of variance explained (Rsquared) should converge to the SNP heritability of the exposure. However, in practice, R-squared can be greater than SNP heritability. Figure 2S demonstrates this phenomenon, and its impact on the F-statistic for instrument strength, in simulation. The traditional estimate of the exposure variance explained by the genetic instruments is biased upwards (red points) by a factor of m/n where m is the number of instruments and n is the exposure GWAS sample size. We make a bias correction called the bias-corrected R 2 that is asymptotically unbiased as the number of IVs increases (blue points). The horizontal line in this panel is the true SNP heritability of the exposure (0.4), to which we can see the bias-corrected R 2 converging but not R 2 . (C): Since R 2 is an estimator that is biased upward, the F-statistic that uses R 2 is also biased upward. Red points correspond to the biased F-statistics and blue points correspond to the F-statistics that use the bias-corrected R 2 .
This inconsistency is introduced because of measurement error. Let n represent the exposure GWAS sample size, m be the number of IVs used in MR, and R 2 be the estimated variance explained in the exposure by the IVs. The F-statistic used in practice is equal to In practice, researchers often standardize m IV effect size estimates (denote as the m × 1 vectorβ (m) ) such that they represent marginal correlations of the IV with the exposure. If a set of m IVs is used for an exposure with M truly causal SNPs, the theoretical maximum of ∥β (m) ∥ 2 2 should be the SNP heritability of the exposure, h 2 , which is equal to ∥β (M ) ∥ 2 2 . However, we can see that lim m→M ∥β (m) ∥ 2 2 > h 2 . Using the measurement error models from above (see Equations 1 and 2), it becomes clear that The second term on the right side in Equation 10 can be seen to only vanish as exposure GWAS sample sizes approach infinity. This bias has implications for how the F-statistic is interpreted in practice, since it can be artificially inflated when exposure GWAS sample sizes are relatively small (see Figure 2S). There have also been recent propositions of conditional F-statistics intended to measure the degree of weak IV bias in multivariable MR. In MR with multiple exposures, the information provided by the instrumental variables may not be enough for all exposures included conditional on the others. The genetic variants we use as instrumental variables have observations that are estimates of total genetic effect on a trait. Many of these effects will be mediated by other risk factors, some of which may be other exposures included in MR. We aim to investigate whether an instrumental variable is associated with a particular exposure conditional on the other exposures, and whether the set of instrumental variables is collectively associated with each exposure conditional on the others to a sufficient degree. The Sanderson et al. [2] test intends to accomplish this task but is vulnerable to measurement error and confounding biases from nonzero sample overlap and so may be unreliable. It may also underestimate a variance term. Future work is intended on building more robust measures of instrument strength.

Fig. 3S
It was mentioned in Methods and above that in practice we must estimate the matrix for m IVs in order to make the bias-correction in MRBEE. An estimate of Π is generally made after estimating its corresponding correlation matrix R. (see [11]). This correlation matrix is often estimated using upwards of 1 million nonsignificant SNPs in GWAS using the methods of [10]. This can be accomplished by loading the full GWAS summary statistics for all exposures and outcomes into (for example) R, and calculating them manually, then usingR and GWAS standard errors to retrieve each Π j . As an example, let π j be the standard errors for the p + 1 phenotypes in MR, p of which are exposures and where the first element π j is the standard error for the jth IV's association with the outcome. Given that the researcher hasR = R, Π j = D j RD j where D j = diag(π j ). We alternatively provide a command-line tool that can estimate R without manually loading all GWAS summary statistics into R. This tool simply requires specifying the file locations and relevant column names to calculate R. An example of how to use this simple tool is provided at https: //github.com/noahlorinczcomi/MRBEE.

Proof of horizontal pleiotropy statistics S pleio and Q pleio
Assume we have p exposures and 1 outcome in MR, where all notation from above applies here. Let θ be the p × 1 vector of causal effects. The true causal model for the IV associations on the exposures (B) and outcomes is where γ represents the m × 1 vector horizontal pleiotropy. We first aim to test whether there is evidence of unbalanced horizontal pleiotropy or, in other words, if any of the γ j ̸ = 0 for j = 1, ..., m. We can perform this test by including an intercept term in causal estimation. We assume the causal model α = Bθ and one intercept term, denoted as θ 0 to produce the new assumed causal model: where θ 0 = m −1 m j=1 E(γ j ). It is easily shown that the null hypothesis in is the null hypothesis that there is no unbalanced (i.e., nonzero mean) global horizontal pleiotropy where 1 m is an m-length vector of 1s. We now aim to develop a statistic that can be used to test H 0 . We proposed the following in Methods: where s is the (p + 1) × 1 vector (1, 0, ..., 0) used only to extract the interceptrelevant term from θ * = (θ 0 ,θ ⊤ ) ⊤ , andΛ is any consistent estimate of the asymptotic variance ofθ * . Using the asymptotic properties ofθ * stated in Methods and [11], and the result in Methods follows directly. The test in Equation 14 is simply a test that θ 0 ̸ = 0, which is analogous to the MR-Egger intercept test.
We also introduced S pleio which could be used in MR to detect horizontally pleiotropic IVs and more generally genome-wide to detect horizontally pleiotropy SNPs. The complete proof of the convergence of S pleio to a chisquare distribution is found in [11]. Briefly, it is derived in [11] that the distribution ofα j =θ ⊤β j is the same as that ofα j = θ ⊤β j which allows the result to follow directly.
As mentioned in Results, the success with which removing IVs in MR using evidence from S pleio depends on the threshold ξ beyond which S (j) pleio for the jth IV is large enough to warrant its exclusion in causal estimation. We aim to investigate the extent to which different ξ values remove all existing unbalanced horizontal pleiotropy in MR when other sources of bias are absent. Figure 4S demonstrates that S pleio more robustly removes all IVs causing global imbalance when GWAS sample sizes are larger, especially when ξ is relatively small [e.g., F −1 χ 2 (0.05) vs F −1 χ 2 (0.05/m) where F χ 2 is the cdf of the χ 2 (1) distribution]. When there is unbalanced global horizontal pleiotropy and the GWAS sample sizes are 10k, using a larger (i.e., P<0.05) IV-specific S pleio Pvalue threshold more consistently removes unbalanced global pleiotropy than a smaller (i.e., P<0.05/m) threshold. As GWAS sample sizes increase, the instrument-specific pleiotropy test performs better at identifying horizontally pleiotropic instruments whether a small or larger P-value threshold is used. This is because the instrument-specific pleiotropy test has more power as GWAS sample sizes increase. The distributions of global pleiotropy tests Q pleio are displayed before (red) and after (yellow, blue) removing specific IVs (of which there are 250) with horizontal pleiotropy evidence using S pleio . The IV-specific null hypothesis of no horizontal pleiotropy can be rejected at a Bonferroni-corrected or un-corrected significance threshold (0.05/250 or 0.05, respectively). When there is unbalanced horizontal pleiotropy and the GWAS sample sizes are relatively small (10k), rejection at a Bonferroni-corrected threshold may not be strict enough and so a more conservative, higher threshold should be used (e.g., reject when the S pleio P-value is less than 0.05).
We now aim to investigate the power of S pleio in its application to the study of coronary artery disease in East Asian and European populations as in Real Data Analysis 1 in the main text. Assume now for simplicity we only have one exposure and one outcome. For the jth genetic variant in GWAS, we assume the following causal model: where γ j represents effects of the genetic variant on the outcome that are not completely mediated by the exposure. In MR using MRBEE, we can identify genetic variants for which the evidence suggests γ j ̸ = 0 using S pleio , which requires an estimate of the variance ofα j −β jθ . Assuming all GWAS estimates have been standardized, the true variance of this term can be expressed as where n 1 and n 2 respectively are the exposure and outcome GWAS sample sizes, for which there are n 0 overlapping subjects, and ρ is the phenotypic correlation between the exposure and outcome. A consistent estimator of η is available in Methods. In power calculations below, we fix the proportion of overlap at 0.50 to be consistent with the real data we used in Results.
Let F 0 and F 1 respectively represent the cumulative density functions of chisquare distributions with one degree of freedom under the null and alternative hypotheses. F 1 has non-centrality parameter γ 2 j /η j and F 0 is central. We compute power using quantiles of the chi-square distribution as follows: To make this calculation for the data used in Real Data Analysis 1, we set outcome GWAS sample sizes for 220k for EAS and 184k for EUR and exposure GWAS sample sizes to 107k for EAS and 755k for EUR. Outcome GWAS sample sizes are the CAD sample sizes we observed in the real data application; exposure GWAS sample sizes are the mean of the MR exposure sample sizes we observed in the real data application. The phenotypic correlated was 0.30, the SNP heritability of the outcome was 0.15, and we assumed 1000 causal variants for the outcome. We first identified the 0st through 99th quantiles of the N (0, 0.15/1000) distribution that true outcome SNP effect sizes were assumed to follow (α j ), let β j = α j , and selected a range of θ values from 0 to 1. This ensured that the proportions of total SNP effects on the outcome (α j ) ranged from 0 to 1. We assume all phenotypes and SNP effect sizes are standardized. Figure 5S displays the results of these power calculations and the R code used to produce them. These results indicate that power is generally largest when the SNP effect on the outcome and the proportion of that effect that is mediated by the exposure is low.

Fig. 5S
3 Additional simulation results Figure 6SA displays an example of UHP and CHP for a single exposure and 250 IVs. R code used to generate these data are is available in Figure 6SB. We pushed UHP and CHP away from the α = βθ line by a factor of 5SD(βθ), which is consistent with the patterns of UHP and CHP we observed in univariable MR in Real Data Analaysis I (see Figure 7S).

A) Example of horizontal pleiotropy used in univariable MR simulations B) R code used to generate data in univariable simulations
maf=runif(m,0.1,0.9) # m=number of Ivs # n1,n0,n01=exposure GWAS n, outcome GWAS n, overlapping n N1=1:n1; N0=(n1-n01+1):(n1-n01+n0); totaln=length(union(N1,N0)) G=matrix(0,nr=totaln,nc=m) for(snp in 1: adjfactor2=h2/ss # h2=exposure heritability explained by IVs bx=sqrt(adjfactor2)*bx U=rnorm(totaln,sd=sqrt((1-h2)*0.15)) epsx=rnorm(totaln,sd=sqrt(1-h2-var(U))) x=G%*%bx+U+epsx epsy=rnorm(totaln,sd=1-sd(x*theta+U)) # theta=causal effect size uhpix=1:nUHP; chpix=m-c(1:nCHP)+1 # nUHP, nCHP=number of UHP, CHP IVs sgns=c();k=0;while(length(sgns)!=m){k=k+1; sgns=c(sgns,ifelse(k%%2==0,1,-1))} gammaU=sd(5*bx*theta)*sgns   [14] and weak IV bias in simulation for MRBEE and IVW using exposure and outcome GWAS sample sizes of 20k and true causal effect 0.29. We drew true SNP associations with the exposure (denoted βj for the jth SNP) for changing numbers of causal SNPs M (i.e., 50, 100, 250, 500) from normal distributions with mean 0 and variance 0.25/M , where 0.25 was the exposure SNP heritability. We then calculated true SNP associations with the outcome as αj = 0.29×βj and GWAS estimates from a multivariate normal distribution with mean (0, 0) ⊤ and variance-covariance matrix S, where the top-left element of S corresponded to the outcome and was 20k −1 , the bottom-right corresponded to the exposure and was 20k −1 , and off-diagonal elements were 0.25p where p was either 0 or 1 to represent no and complete sample overlap, respectively. This produced the GWAS estimates (αj ,βj ). We then drew estimated standard errors forβj from the distributionŝ 2 j ≈ 1/χ 2 (20k − 1) and used them to calculate the statistics T 2 j = (βj /ŝ 2 j ). For different P-value thresholds α k (as displayed on the x-axes in the figure), we only included SNPs with T 2 j greater than the 1 − α k quantile as IVs in MR. For decreasing α k , this should result in more substantial winner's curse bias. As the number of truly causal exposure SNPs increases, there should be more weak IV bias. The results of these simulations suggest that MRBEE and IVW are both subject to winner's curse bias, but that MRBEE is less biased, especially when there is no sample overlap, because it corrects weak IV bias. IVW may be less biased when there is complete sample overlap because weak IVs bias the IVW estimate downward but sample overlap leads to upward bias and these two may effectively cancel.       We also compared the standard errors of MRBEE to IVW. Since MRBEE makes a bias-correction to IVW, we should expect that the standard errors of MRBEE are generally larger than those for IVW. Figure 12S compares the standard errors of MRBEE to IVW in simulation. In this simulation with a single exposure, we fixed GWAS sample sizes at 10k, the GWAS sample overlap proportion at 0, the true causal effect size (θ) at 0.25, the number of truly causal SNPs for the exposure at 500, and measurement error variances at 1/10k. We randomly drew true IV-exposure associations (β) from a normal distribution with mean 0 and variance h 2 /m, where h 2 is the heritability of the exposure and m is the number of causal SNPs. These quantites changed as the simulation scenarios we considered changed (see Figure 12S). We then set α j = θβ j for j = 1, ..., m and drew measurement errors from independent normal distributions for α j and β j with means 0 and variances 1/10k. The independence of these distributions represented 0% sample overlap between exposure and outcome GWAS. We then estimated θ using MRBEE and IVW and recorded their causal estimates. Displayed in Figure 12S are the true (standard deviation of all causal estimates across simulations) for MRBEE and IVW for a range of (h 2 , m) pairs. These results indicate that the standard errors of MRBEE are more heavily influenced by the heritability of the exposure explained by the IVs than the total number of IVs for fixed heritability. Conversely, the standard errors of the IVW estimator decrease rapidly as the number of IVs increases and only gradually as the heritability explained by those IVs increases.

Univariable MR
We selected instrumental variables in MR using either population-specific GWAS or within-phenotype, between-population fixed effects meta-analysis of GWAS (see Methods). We report here results that supplement those in the main text and include causal estimates for (i) univariable MR with only population-specific instrumental variables, (ii) univariable MR with metaanalysis instrumental variables, (iii) multivariable MR with population-specific instrumental variables. In all cases, we present the numbers of instrumental variables used, heritability explained for each exposure (equal to the sum of squared standardized instrument effect sizes), SNP heritability estimated using LD score regression [18], and F-statistics for instrument strength [1]. In univariable analyses, we recommend interpreting IMRBEE instead of MRBEE because there is often strong unbalanced horizontal pleiotropy in univariable analyses (since many exposures are not considered). IMRBEE in all results used a Benjamini-Hochberg [19] FDR correction of a 5% Type I error rate. Figure 13S shows univariable causal effect estimates of each of the 12 available exposures on CAD in EAS and EUR separately. Using instrumental variables from between-population GWAS meta-analysis, univariable IMR-BEE results indicate causal relationships between all exposures and CAD with the exceptions of red blood cell count in EAS (OR=1.05, 95% CI: 0.98-1.12) and haemoglobin in EUR (OR=1.02, 95% CI: 0.95-1.11). Using only population-specific instrumental variables, haematocrit in EUR also becomes only marginally significant (OR=1.07, 95% CI:1.00-1.15), and this may not More information about the sample characteristics and the cohorts from which many exposure GWAS summary statistics were produced are available in [15]. Using GWAS height data from the GIANT consortium [16] would have changed the SNP heritability explained by MR population-specific instruments by -0.06 (from 0.20 to 0.14) and +0.03 (from 0.31 to 0.34), respectively in EAS and EUR. Therefore, we did not use GIANT GWAS data since those results would not have added any substantial additional heritability in EUR and in fact removed some explained heritability in EAS (because a smaller number of SNPs were used in BBJ vs GIANT; 13.3M vs 1.4M). UKBB CAD data for EUR were not used because more precise phenotyping of CARDIoGRAMplusC4D study participants was preferred over the larger sample size, but more heterogeneous phenotyping of UKBB. Serum lipids for EAS were from BBJ and not GLGC [17] because the GLGC metaanalysis contains more ancestrally heterogeneous subjects for EAS.
simply be explained by a reduction in power (m=280 meta-analysis vs m=364 using population-specific instruments). In EAS, haematocrit becomes insignificant, which is likely explained by a reduction in power (OR=1.06, 95% CI: 0.91-1.23, m=303 vs m=48). F-statistics used to indicate the presence of weak instrument bias are each >10 for all exposures in EAS and EUR whether population-specific or meta-analysis instruments are used. F-statistics are generally greater when using population-specific instruments vs metaanalysis instruments because the heritabilities explained by the instruments are generally consistent but the numbers of instrumental variables are smaller when using population-specific instruments. There is also some evidence that using instrumental variables from between-population meta-analysis introduces some additional horizontal pleiotropy, suggesting differences in genetic architecture for some exposures between EAS and EUR populations.

Fig. 13S
Causal estimates from univariable MR applied to Real Data Analysis I.

Characteristics of multivariable MR
The results of using only population-specific instrumental variables in multivariable MR with IMRBEE are generally consistent with those results produced by using population meta-analysis instruments (see 14S), but there are stronger pleiotropic effects in MR with instruments selected from metaanalysis. Still triglycerides in EUR is not significant using population-specific instruments, which may be explained by the relatively large variance of MRBEE in this setting (e.g., compare estimate and 95% CI to IMRBEE). Fstatistics are <10 in this case for all exposures except height and uric acid in EAS. In EUR, F-statistics for HbA1c, triglycerides, and haemoglobin are each <10. Using meta-analysis instruments in multivariable MR, F-statistics for EAS are all <10 but heritabilities are larger than when using populationspecific instruments (e.g., m=810 and 3,767 in EAS, EUR respectively using population-specific instruments; m=3,097 and 2,821 in EAS, EUR respectively using population-specific instruments). The F-statistic has less meaning for MRBEE than alternative estimators because MRBEE adjusts for weak instrument bias while others do not.  Displayed estimates are from MRBEE including all 9 exposures simultaneously using only population-specific instruments (i.e., no within-phenotype, betweenpopulation meta-analysis was performed to identify these instruments). Displayed estimates are from MRBEE including all 9 exposures simultaneously using within-phenotype, between-population meta-analysis. . h 2 M R is the sum of squared standardized instrument effect size estimates, which is an estimate of SNP heritability.ĥ 2 LDSC is an estimate of SNP heritability using LD score regression (LDSC; [18]) which uses all HapMap3 [20] SNPs, not just those chosen for MR. 'Conditional F' is bias-corrected. .
To support of drawing IV effect sizes from multivariate normal distributions in simulation, we present the distributions of IV associations for each candidate exposure used in Real Data Analysis 1 with CAD in the EUR population in Figure 16S. These results indicate that IV associations are almost exactly normally distributed for each candidate exposure. The IV set is more fully described in the Methods section of the main text, but simply is the union set of exposure-specific IV sets filtered to only independent IVs (r 2 <0.01).
In Real Data Analysis 1 with MRBEE, we additionally estimated causal effects using IMRBEE which identifies and removes specific instruments with horizontal pleiotropy evidence. We used a Benjamini-Hochberg [19] FDR correction of a 5% Type I error rate to identify horizontal pleiotropic instruments. Figure 15S demonstrates that there the horizontal pleiotropy has is approximately balanced (i.e., mean approximately equal to 0) in EAS and EUR, supporting our use of MRBEE which did not remove horizontally pleiotropic instruments. SNPs rs2227901, rs10818580, rs3752728, rs4842266, rs1401982, rs10492024, and rs1043413 were horizontally pleiotropic in both the EAS and EUR MR analysis with IMRBEE. Fig. 17S The solid black line is the least squares line fitted to the mediation instruments (blue points). The slope of the solid line is an indication of overall model fit with more positive slopes indicating better model fit. Dotted lines are reference lines (diagonal dotted line is y=x). Since the instrument-specific IMRBEE pleiotropy test with S pleio is a test that observed CAD effect sizes (x-axis) minus predicted CAD effect sizes (y-axis) are 0, points that are further from the dotted y=x line will have a smaller pleiotropy test P-value. These results indicate that there was balanced global horizontal pleiotropy in multivariable MR analyses -i.e., an approximately equal number of non-blue points are left and right of the dotted y=x line. The model that is estimated isα =Bθ + error. 'Observed CAD effect sizes' refers toα and 'predicted CAD effect sizes' refers to the linear predictorBθ.   [22]. LD scores are from European subjects of 1000G Phase 3 [23]. Loci are defined using the following parameters: 1Mb, r 2 <0.01, P<5x10 -8 . Replicated genes are presented in Results.   Figure 4 of the the main text. Where the y-axis representsαj , the x-axis representsβ ⊤ jθ , the linear predictor ofαj whereθ was estimated using MRBEE.

Additional genome-wide pleiotropy testing results
The pleiotropy test is a test for nonzero association of a SNP with a phenotype conditional on a set of other phenotypes. If a lead SNP is associated with CAD but the same SNP is not associated with any of the exposures used in MR with CAD, the pleiotropy test statistic may be very large. These instances are observed in the lower right quadrants of Figure 22S. Lead SNPs that are in the upper-right quadrants of Figure 22S are genome-wide significant in both the CAD and exposures (considered collectively) GWAS. That is, letβ j represent the vector of standardized GWAS estimates for the 9 exposures for the ith SNP and Σ w β j w β j be its variance-covariance matrix. The y-axis is the -log10 P-value from the omnibus test statistic δ j = ∥Σ −1/2 w β j w β jβ j ∥ 2 2 , which follows a central chi-square distribution with 9 degrees of freedom under H 0j : β j = 0 (i.e., SNP j is not associated with any of the 9 exposures). In EAS, a greater proportion of lead SNPs are only associated with CAD and not the 9 exposures than in EUR (36/60 in EAS vs 10/38 in EUR), which may simply be explained by relatively small sample sizes in the exposure GWAS in EAS vs EUR.

Mediation analyses
In Results, we were interested in understanding why the direct causal effect of BMI on CAD was significant in EAS but not EUR. Here, we will demonstrate how we performed explicit tests of mediation hypotheses in both populations. LetB M be the general matrix of instrument effect sizes for any or all of the candidate mediators. Correspondingly, letB X be the column ofB representing the estimated instrument effect sizes on the exposure of interest (e.g., BMI in the CAD analyses) andα be the IV effect size estimates for CAD. We estimated mediation proportions for a select set of candidate mediators individually and all collectively. We estimate all parameters in the directed acyclic graph (DAG) shown in Figure 23S using MRBEE and the full set of genetic IVs used to perform multivariable MR in EAS in the main text (i.e., 3,069 in EAS and 2,821 in EUR). Using the notation in Figure 23S, we have the following 3 models:α =β X β + e 1 (20) α =B M α 2 + e 2 (21) where λ = β − α ⊤ 1 α 2 . The null hypothesis that the exposure of interest (e.g., BMI) has no effects on CAD that are not through any or all of the chosen mediators represented byB X is the following: which we tested with the statistic where η = Var(β −α ⊤ 1α2 ) := Var(λ). In testing H 0 in Equation 23, we used where we can estimate the final term in Equation 25 by where S(ξ) is the MRBEE estimator for a general parameter ξ as defined above and H ξ = ∂S(ξ)/∂ξ. Table 5 displays mediation results for the 5 selected exposures (LDL, HDL, SBP, uric acid, height) in EAS and EUR. These results indicate the total effect of BMI on CAD in EAS (OR=1.44, 95% CI: 1.34-1.54) is mediated by SBP, HDL, and uric acid (P=0.923) but not EUR (P=2.8x10-7; BMI total effect OR=1.49, 95% CI: 1.40-1.59). The mediation of BMI by SBP is different in EAS and EUR most likely because the SBP EAS GWAS did not adjust for BMI as a covariate [15] while the SBP EUR GWAS did [24]. Therefore the causal estimate of BMI on SBP in EUR is extremely small (OR=1.00, P=0.610, Global Unbalanced UHP P=0.987) because the BMI and SBP GWAS estimates used in this MR analysis have a correlation which is effectively 0 (Pearson's r=-0.003. P=0.877) because of the covariate adjustment used in [24]. We therefore do not recommend interpreting the mediation test results for BMI in EUR with SBP alone or SBP together with HDL, uric acid, haemoglobin, and height. The method used to perform mediation testing is appropriate for EAS but may not be for EUR. Nonetheless, EUR results are included for continuity and to draw attention to this important data artefact, to which little attention has been paid in MR research and practice.

BMI
Candidate mediators CAD

Including exposures as covariates in GWAS for other exposures
Here, we intend to better understand why BMI was significant in multivariable MR in EUR but not EAS. The SBP and DBP GWAS in EUR included BMI as a covariate (in addition to sex, age, functions of sex and age, assessment center, and the first 20 genetic principal components, [24]). In this section, we aim to better understand the implications this may have had for our causal estimation using simulations with MRBEE, which otherwise should provide unbiased causal estimates. Consider the DAG in Figure 24S where two exposures x 1 and x 2 each have total causal effects on the outcome y respectively equal to 1/2 and 1/4. In one scenario, the GWAS for x 2 included x 1 as a covariate and in another x 1 was not included as a covariate (true values respectively denoted as α 22 and α 21 respectively are the effects of exposures x 1 and x 2 that are not mediated by each other (i.e., direct effects). In this simulation, x1 has no causal effect on y that is not mediated through x 2 (i.e., θ (direct) 1 = 0; done to match the MR results we observed with BMI and SBP in EAS), on which x 1 has a causal effect of 1/2 (represented by λ in Figure 24S). Therefore, x 1 has a total causal effect on y equal to 1/4. x 1 can be considered to play the role of BMI in our real data analysis, where x 2 and y respectively play the roles of SBP and CAD. In our real data analyses in EUR (but not EAS), we only had the GWAS estimates analogous toα 22 for SBP, which were conditional on BMI.
We performed simulation to verify the inconsistency of the direct causal effect estimate for x 1 in the presence of x 2 when it was included as a covariate in the GWAS for x 2 . The three models can be represented as where we have the GWAS estimates (α 1 ,α 21 ,α 22 ,β) and set the simulation such that x 2 explained 50% of the variance in y. We conducted a simulation in which invididual-level data were generated for (x 1 , x 2 , y) and MR analysis completed using MRBEE with GWAS summary statistics. We used the following data generating model for 10k individuals and 50 IVs: for the ith individual whereafter g i was standardized to have mean 0 and variance 1 and all genetic correlations were equal to phenotypic correlations (this was done to fix the true causal effect of x 2 on y at exactly 1/2). After drawing individual-level data, we performed GWAS for the following phenotypes: (i) x 1 , (ii) x 2 , (iii) x 2 conditional on x 1 , and (iv) y. We used GWAS estimates to perform MR with MRBEE with x 1 and x 2 included as two simultaneous exposures (termed 'multivariable' MR). The true direct/unmediated causal effects in this case are 0 for x 1 and 1/2 for x 2 . The total causal effects of x 1 and x 2 on y are 1/4 and 1/2, respectively. The total effect of x 1 on y is completely mediated by x 2 . In multivariable MR, we observed that the mean (across simulations) direct causal estimate of x 1 on y was equal to its total (i.e., 1/4), and not direct (i.e., 0), effect when x 1 was included as a covariate in the GWAS for x 2 (see 25S). This happens becauseα 1 andα 22 are uncorrelated and so any effect of x 1 on y that is mediated by x 2 cannot be captured byα 22 . Ifα 21 were used instead ofα 22 to perform multivariable MR, we can estimate direct (i.e., unmediated) causal effects without bias. It is not precisely true that the direct causal estimate for x 1 using (α 1 ,α 22 ) is biased since it is an unbiased estimate for the total effect. It is better to conclude that we cannot estimate the direct causal effect of x 1 on y in the presence of x 2 when we only have access to (α 1 ,α 22 ) and not (α 1 ,α 21 ).

Fig. 24S
Example DAG demonstrating the relationship between total and direct (i.e., unmediated) causal effects using a single IV g.

GWAS data used
The studies from which we retrieved GWAS summary statistics for all exposures and outcomes are presented in Table 6.

Univariable MR
Univariable causal estimates for all 7 exposures included in multivariable MR (intelligence, neuroticism [SESA], cannabis use disorder, educational attainment, sleep duration, left handedness, Attention-Deficit/Hyperactivity Disorder) and those 7 not included because of no causal evidence (insomnia, gestational duration, cerebral volume, birth weight, body mass index, diastolic blood pressure, systolic blood pressure) are presented in Figure 25S.  [26]. IVs selected for univariable MR are the same IVs used for multivariable MR but filtered, for each exposure, to only those multivariable IVs with P<5x10 -6 in GWAS.

Joint testing of causal effects
In Real Data Analysis II, we performed joint and contrast testing of causal effects for schizophrenia (SCZ) and bipolar disorder (BP) in the following way. DenoteΘ (2) as the (7 + 1) × 2 matrix of causal effect estimates produced by MRBEE (+1 for intercept terms), where the first column ofΘ (2) corresponds to the exposure effects on SCZ and the second on BP. LetΛ (2) denote the corresponding variance-covariance matrix of vec(Θ (2) ). Since the same IVs were used for SCZ and BP, elements of Θ (2) and Λ (2) were estimated using a single multivariate MRBEE model. We wished to test the general linear hypotheses of the following form: where C and L were respectively g × h and s × l fixed design matrices with elements chosen for specific hypotheses that are indicated in the Results section. Let H = (L ⊗ C) ⊤ andθ = vec(Θ (2) ) and we tested different H 0 in Equation 33 using the statistic (Hθ) ⊤ (HΛH ⊤ ) −1 (Hθ) D − → χ 2 (hl).

Genome-wide pleiotropy test sensitivity analysis
We stated in Section 5.1 of the main text that we did not include left handedness as an exposure in genome-wide horizontal pleiotropy testing with S pleio because it was not found to be causally related to either SCZ or BP risk in MR with MRBEE. It was also not found to be causally related to either SCZ or BP in a 2-degree of freedom joint test as described in Section 4.5. To confirm that this had no meaningful impact on the inferences we made about genome-wide testing with S pleio , we also performed this analysis with left handedness and its MRBEE causal estimates included in genome-wide horizontal pleiotropy testing. The results of these tests indicate that the same novel loci for SCZ and BP are discovered using S pleio , and that the counts of direct, mediation, and pleiotropy determinations of SCZ and BP loci are also the same as those presented in Figure 7 in the main text. These results suggest that including or excluding left handedness from genome-wide horizontal pleiotropy testing with S pleio does not affect our classifications of SCZ and BP loci as direct, mediated, pleiotropy, or novel. 'Estimated proportion mediated' representsα ⊤ 1α2 /β where α ⊤ 1 and α2 depend on the select candidate mediator(s). 'P(complete mediation)' represents the P-value from testing the null hypothesis stated in the text for the select candidate mediator(s). 'Multivariable mediation' refers the situation in which all select phenotypes act as mediators