Selecting Invalid Instruments to Improve Mendelian Randomization with Two-Sample Summary Data

Mendelian randomization (MR) is a widely-used method to estimate the causal relationship between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Genome-wide association studies often reveal that hundreds of genetic variants may be robustly associated with a risk factor, but in some situations investigators may have greater confidence in the instrument validity of only a smaller subset of variants. Nevertheless, the use of additional instruments may be optimal from the perspective of mean squared error even if they are slightly invalid; a small bias in estimation may be a price worth paying for a larger reduction in variance. For this purpose, we consider a method for “focused” instrument selection whereby genetic variants are selected to minimise the estimated asymptotic mean squared error of causal effect estimates. In a setting of many weak and locally invalid instruments, we propose a novel strategy to construct confidence intervals for post-selection focused estimators that guards against the worst case loss in asymptotic coverage. In empirical applications to: (i) validate lipid drug targets; and (ii) investigate vitamin D effects on a wide range of outcomes, our findings suggest that the optimal selection of instruments does not involve only a small number of biologically-justified instruments, but also many potentially invalid instruments.


Introduction
Mendelian randomization (MR) uses genetic variants as instrumental variables to estimate the causal effect of a risk factor on an outcome in the presence of unobserved confounding.By Mendel's second law, genetic variants sort independently of other traits.Thus, genetic variants, which are fixed at conception, can provide a source of exogenous variation in a risk factor of interest, allowing analyses that are less vulnerable to reverse causality and confounding (Davey Smith and Ebrahim, 2003).
Large-scale consortia genome-wide association studies (meta-GWASs) have identified large numbers of genetic variants that are robustly associated with a wide range of traits.Due in part to privacy issues, often only summary statistics of these genetic associations are made publicly available.Since these results are easily accessible, MR investigations increasingly rely on inferential methods that require only two-sample summary data (Burgess et al., 2015).In such applications, genetic variant associations with the risk factor are obtained from a representative but non-overlapping sample used to measure genetic variant associations with the outcome.
As in usual instrumental variable analyses, identifying causal effects through MR requires some key assumptions.For a genetic variant to be a valid instrument, it must be associated with the risk factor; β X j = 0 in Figure 1 (relevance).Second, the variant must be uncorrelated with unobserved confounders.Third, any effect that the variant has on the outcome must be mediated by its effect on the risk factor; τ j = 0 in Figure 1 (exclusion).Violations of the exclusion condition are common in MR studies (Lawlor et al., 2008), due in part to the widespread phenomenon of pleiotropy where a single genetic variant may influence several traits (Solovieff et al., 2013;Hemani et al., 2018).While the biological mechanism of certain genetic effects may be well understood (for example, when using protein risk factors for drug target validation; see Schmidt et al., 2020), it is generally difficult to rule out the possibility that many genetic variants have a direct effect on the outcome, and thus violate the exclusion restriction (Verbanck et al., 2018).
In some MR applications, investigators may have more confidence in the validity of a particular subset of all candidate instruments.In particular, the causal mechanisms linking specific genes to risk factors may be known, which may better justify the use of genetic variants from those genes as instruments.Examples of MR studies that have prioritised the use of variants from biologically-plausible genes include investigations into the effects of alcohol consumption (Millwood et al., 2019), C-reactive protein level (Swerdlow et al., 2016), smoking behaviour (Lassi et al., 2016), vitamin D supplementation (Mokry et al., 2015), and perturbing drug targets (Gill et al., 2021).
Along with biologically-justified instruments, investigators could consider using additional variants that are plausibly valid instruments in order to improve the precision of an analysis.Even if these additional instruments are slightly invalid, their use may provide slightly biased but more precise estimates that are optimal from the perspective of mean squared error.
Hence, to guide this instrument choice, we propose a method for "focused" instrument selection whereby genetic variants are selected to minimise the estimated asymptotic mean squared error of causal effect estimates.The strategy allows a tiering in the assumptions on instrument validity, and prioritises the evidence suggested by biologically-justified instruments.
We work with the popular two-sample summary data design, and in a setting of many weak and locally invalid instruments.In this local misspecification setting, the collective direct instrument effects on the outcome are decreasing with the sample size at a rate that ensures a meaningful bias-variance trade off, and thus enables a mean squared error comparison in finite samples.
The theoretical contribution of our work has two elements.First, we extend DiTraglia (2016)'s results to allow for many weak instruments.We consider an asymptotic framework in which the number of instruments can grow at the same rate as the sample size, as long as their collective effects on the risk factor are bounded (Zhao et al., 2020).Second, we consider the problem of post-selection inference for focused estimators which is particularly challenging because we do not have consistent model selection; the uncertainty in model selection directly impacts the asymptotic distribution of focused estimators.
Given that focused use of additional instruments may lead to improved estimation compared with using only a smaller set of biologically-justified instruments, it is natural to consider whether a similar advantage may hold for inference.One desirable property for confidence intervals is that they are uniformly valid; that is, they achieve nominal coverage asymptotically over the space of potential direct instrument effects on the outcome.Unfortunately, the impossibility results discussed by Leeb and Pötscher (2005) suggest we cannot construct uniformly valid confidence intervals for focused estimators which will exactly achieve nominal coverage, and therefore such intervals will generally be conservative.For example, in simulation we find that the "2-step" uniformly valid confidence intervals of DiTraglia ( 2016) can be over 30% longer than standard confidence intervals based on using only a core set of valid instruments.
In the same way that focused estimation is willing to trade a small bias in estimation for a larger reduction in variance, we develop a strategy for constructing "focused" confidence intervals that accepts a small loss in coverage probability over a subspace of direct instrument effects on the outcome in exchange for shorter confidence intervals.Our focused confidence intervals account for uncertainty in model selection, and they achieve nominal coverage for plausible values of direct instrument effects, while providing a statistical guarantee on the worst case size distortion away from those values.
Compared with using only a core set of biologically-justified instruments, our simulation evidence illustrates how focused estimators and confidence intervals may provide improved estimation and inference when additional instruments are valid or slightly invalid, while buying insurance against poor performance when additional instruments are very invalid.
The utility of our methods is demonstrated in two empirical applications.The first considers genetic validation of lipid drug targets when investigators are uncertain about defining the width of a cis-gene window from which to select instruments.The second application investigates the genetically predicted effect of vitamin D supplementation on a range of outcomes when investigators want to prioritise the use of genetic variants from specific genes through biological considerations.
We use the following notation and abbreviations: The proofs of theoretical results are given in Appendix, and R code to perform our empirical investigation is available on GitHub at github.com/ash-res/focused-MR/.
2 Model and assumptions

Two-sample summary data
We first outline our assumptions on genetic association summary data, which are motivated through a simple linear model with invalid instruments and homoscedastic errors.
Let Z = (Z 1 , ..., Z p ) denote a p-vector of uncorrelated genetic variants.The parameter of interest is the causal effect θ 0 of the risk factor X on the outcome Y , which is described by (1) where , and (ω X , ω Y , β X , τ, θ 0 ) are unknown parameters.If τ is non-zero, then at least one genetic variant fails the exclusion restriction and directly affects the outcome.
where β Y = (β Y 1 , . . ., β Yp ) is the p-vector of coefficients from a population regression of Y on Z.We aim to estimate the model in (3) using two-sample summary data on genetic associations.
Assumption 1 (two-sample summary data).For each variant j, we observe genetic associations βX j and βY j , which satisfy ) are assumed to be known.Moreover, the set of 2p genetic associations { βX j , βY j } p j=1 are mutually uncorrelated, and Assumption 1 is taken from Zhao et al. (2020) and states a normal approximation of estimated genetic associations which is typically justified by large random sampling expected in genetic association studies.Specifically, for each variant j, we have access to estimates and standard errors from univariable Z j on X linear regressions from an n X -sized sample, and from a nonoverlapping n Y -sized sample, we observe measured associations from univariable Z j on Y linear regressions.Both random samples are drawn from the joint distribution of (Y, X, Z).
The assumption that the population standard deviations {σ X j , σ Y j } p j=1 are known is common in MR, and a formal justification for this is given by Ye et al. (2021).To simplify notation, the dependence of the standard errors on the sample size is not made explicit, but they are assumed to decrease at the usual parametric rate.

Core instruments
From (3), the parameter of interest θ 0 is not identified unless there are some restrictions on τ = (τ 1 , . . ., τ p ) .When genetic variants from several gene regions are used to instrument the risk factor, a popular identification strategy is to assume that τ is a mean zero random effect; see, for example, Zhao et al. (2020).Another commonly-used assumption is that most genetic variants are valid instruments; τ j = 0 for j ∈ S M , where S M is some unknown set of variants such that p −1 |S M | > 0.5.In this setting the median-based estimators of Bowden et al. (2016) are consistent.Based on variation of these assumptions, many summary data MR methods have proposed ways to obtain unbiased estimates of θ 0 ; a recent review is given in Sanderson et al. (2022).
In this work we do not focus on a novel identification strategy, but instead only the simple instrument selection choice for investigators when they believe a core set of genetic variants S 0 consists of only valid instruments, but they are less confident on the validity of any additional candidate instruments.
Assumption 2 (core instruments).For some known set of instruments S 0 such that 1 ≤ |S 0 | < p, and where |S 0 | grows proportionately with p, we have τ j = 0 for all j ∈ S 0 .
Assumption 2 allows us to measure the bias which may result from the inclusion of additional instruments.In some MR studies, there is good reason for believing that a smaller subset of all available genetic variants are more likely to be valid instruments.We mention two examples that we explore in more detail in Section 6.
Example 1 (choosing cis windows in drug target MR).In drug target MR studies, only those genetic variants from a single gene region that encodes the protein target of a drug are used to instrument the risk factor.Such studies have gained popularity for providing supporting genetic evidence to validate drug targets and to study side effects (Gill et al., 2021).A key decision in drug target MR is to choose the width of a "cis window" which dictates a gene region from which to select instruments.Tools such as GeneCards (Stelzer et al., 2016) offer practical guidance on defining appropriate gene regions, but since there are often relatively few uncorrelated genetic signals from narrow cis windows, researchers often resort to widening the cis window in order to boost power through the use of a larger number of instruments.This leaves the possibility of a type of publication bias where researchers may report findings only from a cis window which gives their preferred result.We may be less confident on the instrument validity of additional genetic variants that are included only from widening a cis window.
Example 2 (estimating vitamin D effects).GWASs have identified strong genetic associations with vitamin D in biologically plausible genes; GC, DHCR7, CYP2R1, and CYP24A1.Each of these genes is known influence vitamin D level through different mechanisms.In order to investigate the effect of vitamin D supplementation on a range of traits and diseases, MR studies have often used genetic variants located in neighborhoods of those genes to instrument vitamin D as they are considered more likely to satisfy the exclusion restriction than other genome-wide significant variants (Mokry et al., 2015;Revez et al., 2020).Genetic variants from other gene regions may be strongly associated with vitamin D, but they may be more likely to have direct effects on the outcome through their effects on traits other than vitamin D.

Many weak variant associations
In typical applications, we may expect many genetic variants to have weak effects on the risk factor.This can cause difficulties for identifying and estimating the causal effect.We study a setting of many weak instruments where the number of genetic variants is permitted to grow at the same rate as the sample size, but their collective explanatory power is bounded.
Assumption 3 (many weak instruments).Let β X,0 denote a |S 0 |-vector with its elements given by β X j for j ∈ S 0 .Then, β X 2 = O(1), β X 3 β X,0 2 → 0, and p n β X,0 Given (3) and Assumptions 2 and 3, the causal effect is point identified.Assumption 3 is similar to Zhao et al. (2020), and it implies that all variants are relevant instruments, but the explanatory power of any individual variant is decreasing as p → ∞.Moreover, the skewness of β X is restricted which rules out very sparse variant effects settings.Finally, the rate restriction p n β X,0 2 2 = O(1) as n, p → ∞ ensures asymptotic normality in estimation, and the condition is plausible given the large sample sizes of typical GWASs.

Additional locally invalid instruments
To meaningfully study a bias-variance trade off from including additional instruments we need to ensure that the bias and variance terms have the same order of magnitude.Following Di-Traglia (2016)'s framework of locally invalid instruments, we work with local misspecification in which the direct effects τ are collectively "local-to-zero", and decrease at the same rate as the sampling errors with n.This is not a substantive biological assumption, but rather a technical device that allows us to study the instrument selection problem from a mean squared error perspective.
Under the rate restriction in Assumption 4, we can consistently estimate θ 0 using any set of instruments, but making valid inferences using invalid instruments will require us to account for an asymptotic bias.The rate restriction is slightly different to DiTraglia (2016) in that we limit only the collective direct effects over all variants.
Assumption 4 is otherwise quite general.For example, the direct effects τ may be correlated with β X , and thus may violate the so-called InSIDE (Instrument Strength Independent of Direct Effect) assumption which some existing MR methods that make use of invalid instruments rely on (Bowden et al., 2015).

Focused instrument selection and estimation
Using only the core set of instruments S 0 should lead to asymptotically unbiased estimation of θ 0 , and this forms a basis from which to measure the potential bias from including additional instruments.For ease of exposition we focus on the simple choice between using either: (i) the "Core" estimator that chooses only the set of core instruments S 0 ; or (ii) the "Full" estimator that uses full set of p instruments.The inclusion of S, the set of the additional p − |V | genetic variants, may result in improved estimation if there is a large reduction in variance, and at most only a small increase in bias.Thus we use a "focused" instrument selection strategy (DiTraglia, 2016) where the Full estimator is selected only if it has a lower estimated asymptotic mean squared error than the Core estimator.
The results presented here can be easily extended for more fine-tuned instrument selection at the expense of extra notation; instead of simply choosing between the Core and Full estimator, we could also select from subsets of the additional instruments.Such subsets could be data-driven, chosen on biological reasons, and have overlapping variants.For example, in our empirical application to study vitamin D effects, we chose from all possible subsets of 3 partitions of the additional instruments, where the partitions were formed by k-means clustering on the Wald ratio estimate of each variant, βY j βX j , j ∈ S.
We consider limited information maximum likelihood (LIML) estimation of θ 0 .The Core and Full estimators, θC and θF , are given by θC = arg min and θF = arg min Under Assumptions 1-3, Theorem 3.1 of Zhao et al. (2020) shows that the asymptotic distribution of θC is and Using similar arguments, we can derive the asymptotic distribution of the Full estimator.
Theorem 1 (Full estimator).Under Assumptions 1-4, and the Full estimator is consistent, and its asymptotic distribution is The variance terms η C and η S are of order O(n β X 2 ), and the terms ς C and ς S are of order O(p).Therefore, (η C + η S ) −1 would be the asymptotic variance of θF in a fixed p setting with strong instruments, and ) is an additional variance component to account for the extra uncertainty due to many weak instruments (Zhao et al., 2020).This is particularly important for our instrument selection problem since under-estimated variances based on 'fixed p' asymptotics could cause a mean squared error-based selection criteria to falsely recommend the inclusion of additional instruments.These two variance components are of the same order of magnitude when p n β X 2 2 = Θ(1).As discussed by Newey and Windmeijer (2009), despite the knife-edge condition required to balance these variance components, it may be advisable to use the weak instrument variance correction (η C + η S ) −2 (ς C + ς S ) in general scenarios when n is considerably larger than p, and when instruments are strong.For example, the simulation study of Davies et al. (2015, pp. 457-9), which mimicked an MR design, showed that standard errors that did not correct for many weak instrument effects led to inflated type I error rates even for the case where n = 3000 and p = 9.
We also note that compared with the model of Zhao et al. (2020), here we consider τ to be fixed effect rather than a random variable; this direct variant effect on the outcome induces a bias in estimation rather than an increase in variance.The asymptotic variance of θF is of order O 2 ), and its asymptotic bias is of order O(1 ).Thus, there is a meaningful bias-variance trade off if p n β X 2 2 = O(1) since the square of the asymptotic bias is of the same order of magnitude as the asymptotic variance.
To carry out focused instrument selection, we need to estimate and compare the asymptotic mean squared error (AMSE) of θC and θF .The Core estimator is asymptotically unbiased, and it is straightforward to consistently estimate its asymptotic variance ∆ C .Under local misspecification, the asymptotic bias b of the Full estimator cannot be consistently estimated.However, we can use θC to construct an asymptotically unbiased estimate of b.
Theorem 2 (Asymptotic bias estimator).Under Assumptions 1-4, the asymptotic distribution of b is Although we cannot consistently estimate b, Theorem 2 shows that b is an asymptotically unbiased estimator.Similar to before, we can write the asymptotic variance ∆ B as the sum of two terms: The first term is the asymptotic variance of b in a fixed p setting with strong instruments, and therefore the second term represents the extra uncertainty in estimation due to many weak instruments.
In order to estimate the AMSE of θC and θF , we need to construct consistent estimators Since θC is asymptotically unbiased, a consistent estimator for its AMSE is ∆C .Whereas for θF , an asymptotically unbiased estimator of its AMSE is ( b2 − ∆B )+ ∆F .Following DiTraglia (2016), since the square of the asymptotic bias cannot be negative, we use max(0, b2 − ∆B ) instead of b2 − ∆B when estimating the AMSE of θF .
Define Ŵ = max( b2 − ∆B , 0) + ∆F − ∆C as the estimated AMSE of θF minus the estimated AMSE of θC .Therefore, the selection event that we select the Full estimator is given by { Ŵ ≤ 0}.The "Focused" estimator is then given by Since both θF and θC are consistent estimators of θ 0 , so is the Focused estimator θ.

Post-selection inference
In this section we discuss the problem of constructing confidence intervals for the Focused estimator θ, which is non-standard because model selection is based on an AMSE criterion that is not consistently estimated.
We start by deriving the asymptotic distribution of θ, and discuss why naive confidence intervals that ignore sampling uncertainty in instrument selection are likely to perform poorly.We then note the relative merits of two related inference procedures proposed in DiTraglia (2016), before introducing a new "Focused" approach that combines their strengths.Just as the Focused estimator aims to achieve a good balance between bias and variance, Focused intervals aim to achieve a good balance between the length of confidence intervals and the potential worst case asymptotic coverage over the likely space of direct instrument effects on the outcome.

Asymptotic distribution of the Focused estimator
A feature of the local misspecification framework is that the sampling uncertainty from asymptotic bias estimation directly affects the asymptotic distribution of the post-selection Focused estimator θ.In particular, there is non-ignorable uncertainty in model selection: even if the use of only the core instruments is optimal from an AMSE perspective, the uncertainty in asymptotic bias estimation can cause the Focused estimator to erroneously select the full set of instruments.
In contrast, under consistent model selection, the Focused estimator would choose either θC or θF with probability approaching 1 as n, p → ∞.While this would seem to simplify the task of inference, it would still not be possible to consistently estimate the distribution of post-selection estimators uniformly over the space of direct effects τ (Leeb and Pötscher, 2005, Section 2.3, pp. 38-40).Moreover, consistent model selection does not suit our goal of improved estimation in terms of low risk, since the worst case risk of post-selection estimators would be unbounded (Leeb and Pötscher, 2008).
Theorem 3 (Focused estimator).Under Assumptions 1-4, the Focused estimator is consistent, and is asymptotically distributed where Theorem 3 indicates that the asymptotic distribution of θ is a weighted average of the asymptotic distributions of θC and θF , where the weights are random even as n, p → ∞.As a result, "naive" confidence intervals for θ that ignore sampling uncertainty in instrument selection should not be reported.Such intervals can perform extremely poor in practice, with coverage arbitrarily far below their nominal level.
Under the condition p n β X 2 2 = O(1) as n, p → ∞ from Assumption 3, the variance components ∆ can be consistently estimated, and therefore to simplify notation we henceforth assume that ∆ is known.

Focused confidence intervals with coverage constraints
There are two related problems that a valid inference procedure must solve.First, confidence intervals need to be widened to account for the model selection uncertainty introduced by focused instrument selection.Second, confidence intervals need to be re-centered if the focused estimator is asymptotically biased.
If the true asymptotic bias component b was known, inference would be straightforward: Theorem 3 could be used directly to simulate the distribution of Λ(b).However, b cannot be consistently estimated in our locally invalid instruments framework.For this setting, DiTraglia (2016) proposes two feasible inference procedures that consider the distribution of Λ(b) at values of b that are plausible given the observed data.
The "1-step" interval is based on the distribution of Λ(b) evaluated at b = b, and it effectively assumes that the true value of b is equal to its asymptotically unbiased estimator b.This is intuitive since b is in some sense the most plausible value of b given the data.It follows that if b is close to the true value of b, then the 1-step interval will have coverage that is close to its nominal level.Moreover, when the direct effects τ are small, simulation evidence from DiTraglia (2016) suggests that the 1-step interval has competitive coverage and can be shorter in length than the "Core" interval, which is the standard confidence interval of θC that uses only the core instruments.More generally, however, the 1-step interval comes with no theoretical guarantees; it may substantially under-cover, although its performance is much better than that of a naive interval that ignores instrument selection.
The reason why the 1-step interval may under-cover is that it fails to account for the uncertainty in the estimate b of b, thus potentially leading to intervals that are too short and centered incorrectly.The "2-step" interval allows for such uncertainty by first constructing a confidence region ϕ for b.It then simulates the distribution of Λ(b ) at every value b in ϕ, constructing a collection of confidence intervals each based on the assumption that the true value of b is b .To obtain a uniform coverage guarantee, the 2-step interval takes the outer envelope of all of the resulting intervals.This makes the 2-step interval extremely conservative: in general there is no value of b for which the actual coverage equals the nominal coverage, and hence it will always over-cover.Our simulation evidence suggests that this over-coverage problem makes the 2-step intervals too wide to be useful in practice.
We consider a way forward for improved inference with "Focused" intervals.These intervals aim to combine the strengths of the 1-step and 2-step intervals while avoiding their drawbacks.Like the 1-step interval, it is constructed by simulating Λ(b) at a single value of b rather than taking an outer envelope over many values.This means that it can yield shorter confidence intervals.Like the 2-step interval, however, it comes with theoretical guarantees.The key is to choose an appropriate value of b.
The Focused interval considers only values of b that are contained in a (1 − α 1 ) × 100% confidence interval called ϕ.This is the same confidence interval for b used in the 2-step interval approach.For some values of b in ϕ the distribution of Λ(b) will be highly dispersed.Suppose that b is such a value, so that a (1 − α 2 ) × 100% confidence interval CI(b ) for θ 0 computed under the assumption that b = b will be relatively wide.By construction, CI(b ) will achieve nominal coverage probability 1 − α 2 when b = b .The key insight is as follows: if b is a value in ϕ for which Λ(b ) is relatively less dispersed, then CI(b ) may also be a nearly valid confidence interval for θ 0 when b = b .Using this idea, the construction of the Focused interval proceeds as follows, based on a user-specified tolerance γ and nominal coverage 1−α.
Algorithm 1 (Focused interval with a minimum coverage constraint).5. Notice that b depends on γ, α 1 , and α 2 .Repeat steps 1-4 for a range of choices of α 1 subject to the constraint α 1 + α 2 = α.Choose the value of α 1 that yields the shortest interval for θ 0 .
Like the 1-step interval, the Focused interval is always shorter than the 2-step interval because it is one of the intervals contained in the outer envelope that forms the 2-step interval.Unlike the 1-step interval, its asymptotic coverage probability can never fall below 1 − α − γ.
Theorem 4 (Worst case asymptotic coverage of Focused intervals).Under Assumptions 1-4, the Focused interval defined in Algorithm 1 has asymptotic coverage probability no less than The Focused interval is designed to achieve nominal coverage 1 − α at a plausible value of the asymptotic bias, while also controlling the worst case asymptotic coverage according to a maximum allowable size distortion γ.The choice of γ dictates the trade off between the worst case coverage level and the length of the interval, with a lower level of tolerance γ more likely to provide conservative inference.The feasibility of constructing the Focused interval relies on the existence of a sufficiently dispersed distribution of Λ(b ) at a plausible value b , which may not exist for extremely low levels of γ.However, selecting an extremely low level of γ would defeat the purpose of the Focused interval, which aims to be competitive in terms of both coverage and length.
Although Algorithm 1 is novel, the concept of an inference procedure that depends on a userspecified allowable size distortion has precursors in the econometrics literature.For example, Andrews (2018) proposes an inference strategy that controls a worst case coverage distortion under weak instruments, and where the decision to report a conventional or weak instrument robust confidence set depends on the level of under-coverage that an investigator is willing to accept.

Simulation study
In this section we illustrate how the finite sample performance of the Focused estimator and Focused confidence interval depends on the strength of instruments and the magnitude of direct variant effects on the outcome.
First, we consider estimation performance in terms of root-mean squared error (RMSE).Second, we show how the Focused interval may be able to achieve a favourable balance of length and coverage, and discuss where it may lead to improved inference compared with the Core interval, which is the conventional confidence interval of θC based on using only the core instruments.Third, we discuss the sensitivity of the Focused interval to the choice of γ which controls the worst case coverage loss.Finally, we consider the performance of the Focused estimator when the core instruments S 0 are in fact invalid.

Design
We simulated two-sample summary data on p = 110 variants according to Assumptions 1-4.The sample sizes were set at n = n X = n Y = 1000.Of the 110 variants, 10 were set to be valid instruments, and they formed the core set S 0 .The remaining p − |S 0 | = 100 variants formed the set of additional instruments S.
We generated estimated associations βX j ∼ N (β X j , σ 2 X j ) and βY j ∼ N (β Y j , σ 2 Y j ), where true genetic variant associations with the risk factor were set as β X j = βC |S 0 | for j ∈ S 0 , and β X j = βS p − |S 0 | for j ∈ S, where βC and βS were chosen to maintain a particular level of the concentration parameters λ C = j∈S 0 β 2 X j (|S 0 |σ 2 X j ) and λ S = j∈S β 2 X j (|S|σ 2 X j ), which are measures of the average instrument strength of S 0 and S. The variances σ 2 X j and σ 2 Y j were set equal to 1 n for all variants.
The true variant-outcome associations were set to be β Y j = β X j θ 0 for j ∈ S 0 , and β Y j = β X j θ 0 + τ j for j ∈ S, where the true causal effect was θ 0 = 0.2, and the direct effects are fixed effects generated as For inference, we set the nominal coverage probability at 1 − α = 0.95, and unless otherwise stated, the allowable worst case size distortion for the Focused interval was set at γ = 0.2.
Along with the Focused, 1-step, 2-step, and Core intervals described above, we also note the performance of the "Naive" confidence interval, which is the standard confidence interval for the selected estimator that ignores sampling uncertainty in model selection.The extent of the improvement in RMSE that is possible appears to depend on the strength of instruments.For example, when τ = 2, the Focused estimator offered a 32.8% reduction in RMSE when all instruments are relatively weak (λ C = λ S = 40), a 30.7% reduction when λ C = λ S = 120, and a 27.8% reduction when all instruments are relatively strong

Estimation
The relative strengths of the core and additional instrument sets affect the values of τ over which focused instrument selection is able to improve estimation.When the additional instruments were strong and the core instruments were relatively weak (λ C = 40, λ S = 200), the Focused estimator had a lower RMSE than the Core estimator over the range 0 ≤ τ < 10.
In contrast, when λ C = 200 and λ S = 40, the Focused estimator had a lower RMSE only over the range 0 ≤ τ ≤ 2.
In summary, these estimation results intuitively suggest that focused instrument selection is more likely to improve estimation when the additional instruments are not too invalid, the core instruments are quite weak, and the additional instruments are strong.This is practically relevant for MR analyses of polygenic traits where several genes may be causally related.

Confidence intervals
For inference, Figure 3 shows that the coverage probability of the Naive interval dropped to as low as 0.4 when all instruments are equally strong (τ = 8), and lower than 0.2 when the additional instruments are much stronger than the core instruments (λ C = 40, λ S = 200, τ = 10).This underscores the importance for confidence intervals of θ to account for sampling uncertainty in model selection.On the other hand, the 2-step intervals are conservative; the intervals exceeded nominal coverage probability, and Figure 4 shows that they were generally over 30% longer in length than the Core intervals.
The 1-step intervals are a useful compromise between the 2-step and Naive intervals.When the additional instruments were not too invalid (i.e. for small enough parameter values of τ ), the 1-step intervals were shorter in length than the Core intervals, while also achieving nominal coverage probability.The performance of the 1-step interval appears to be quite sensitive to the relative strengths of the core and additional instruments.When the additional instruments were relatively strong (for example, λ C = 40, λ S ≥ 160), the 1-step intervals were shorter than the Core intervals, but they under-covered for large enough values of τ .Conversely, when the core instruments are strong enough, the 1-step intervals showed no real advantages compared with the Core interval.For the Focused intervals, we selected the allowable worst case size distortion as γ = 0.2, so that for a 1 − α = 0.95 level confidence interval, coverage probability should not be lower than 0.75.Figure 3 shows that the coverage of the Focused interval was higher than 0.8 for all parameter values τ , whereas for some values the coverage probability of the 1-step interval dropped below 0.7.Moreover, the coverage of the Focused interval was generally more competitive than the 1-step interval, especially when the additional instruments were at least as strong as the core instruments.
When the core instruments were not much weaker than the additional instruments, the Focused intervals were shorter in length than the 1-step interval.Interestingly, the Focused intervals were also up to 8% shorter in length than the Core intervals unless the additional instruments were very invalid (τ ≥ 12).The Focused interval can therefore improve the power of an analysis by incorporating information from additional relevant instruments if they are not too invalid.At the same time, the Focused interval also retains good size control and buys insurance against serious under-coverage when additional instruments are very invalid.

Sensitivity to the choice of γ
The Focused intervals require investigators to choose an acceptable level of the worst case size distortion γ.  Figure 5 verifies that the Focused interval is able to control the worst case coverage over all parameter values τ .The cost of allowing only a small size distortion is the longer length of the interval.The Focused intervals were over 15% longer in length than the Core intervals when γ = 0.05, although they were also much shorter than the 2-step intervals.
Conversely, for larger values of γ, the Focused intervals become shorter, but only to a certain degree: since the Focused intervals account for uncertainty in model selection, the lengths of the intervals are not as short as the Naive intervals.Accordingly, the coverage probability of the Focused intervals will also not change for large enough values of γ.

Estimation when S 0 contains invalid instruments
Focused instrument selection aims to prioritise the evidence suggested by a core set of genetic variants S 0 that are believed to be valid instruments.In practice, investigators may not always be correct in their belief, and S 0 may actually consist of invalid instruments.We consider this setting in simulation by slightly altering the design in Section 5.1 so that the true variant-outcome associations are set to be β Y j = β X j θ 0 + τ j , where τ j ∼ U [0, τC √ np] for j ∈ S 0 .Figure 6 presents the estimation results for the case where all instruments are equally strong (λ C = λ S = 40).Our results show that when S 0 contained only slightly invalid instruments, the Focused estimator was able to improve on the Core estimator in terms of RMSE.As the instruments in S 0 become more invalid, but not as invalid as S, the performance of the Focused estimator worsens because the estimated bias from including S is under-estimated.We note that the Focused estimator also performed relatively well when the instruments S 0 were at least as invalid than S; for example, the RMSE of the Focused estimator was lower than that of the Core estimator when τC = τ = 12.

Empirical Examples
In this section, we demonstrate how focused instrument selection can be applied in MR studies.First, we consider the problem of instrument selection in drug target MR, where variation in a gene that encodes the protein target of a drug is used to proxy drug target perturbation.Such MR investigations have the potential to provide genetic evidence on drug efficacy and to study potential side effects (Gill et al., 2021).
An important decision in drug target MR is to specify a "cis window" that defines a gene region from which instruments are selected.In practice, investigators often use cis windows that are wider than gene regions defined by tools such as GeneCards (Stelzer et al., 2016) in order to boost the power of an analysis through the use of multiple instruments.This leaves the possibility of a type of publication bias where researchers may report findings only from a cis window that gives their preferred result.In Section 6.1, we apply the focused instrument selection method to two lipid drug targets, where genetic variants that may be included only from widening a cis window are additional instruments.
Second, we investigate the effect of vitamin D supplementation on a range of outcomes.GWASs have identified strong genetic associations in biologically plausible genes known to have a functional role in the transport, synthesis, or metabolism of vitamin D. Some previous MR studies aiming to study vitamin D effects have used genetic variants located in neighborhoods of those genes to instrument vitamin D as they are considered more likely to satisfy the exclusion restriction than other genome-wide significant variants (Mokry et al., 2015;Revez et al., 2020).However, the role of many other genes which are robustly associated with vitamin D may not yet be fully understood; for example, Jiang et al. (2021) selected genetic variants from 69 independent loci to instrument vitamin D. In Section 6.2, we use genetic variants from biologically plausible genes as the core set of instruments, and apply focused instrument selection to select from many additional genetic variants which may be considered more likely to have a direct effect on the outcome through their effects on traits other than vitamin D.
For Focused confidence intervals, we selected γ = 0.2 for the maximum allowable size distortion.The data used for our analyses are publicly available through the MR-Base platform (Hemani et al., 2018), and R code to perform our empirical investigation is available on GitHub at github.com/ash-res/focused-MR/.

CETP and PCSK9 inhibitors
Cholesteryl ester transfer protein (CETP) inhibitors are a class of drugs that raise high density lipoprotein cholesterol levels and lower low density lipoprotein cholesterol (LDL-C) levels.At least three CETP inhibitors have failed in clinical trials to conclude a protective effect against coronary heart disease (CHD), but the successful trial of Anacetrapib showed a modest benefit when used with statins (Bowman et al., 2017).A recent drug target MR analysis by Schmidt et al. (2021) offers genetic evidence that CETP inhibition may be an effective approach for preventing CHD.Here we investigate the robustness of a similar drug target MR study to the choice of a cis window used to select instruments.
We study the genetically predicted LDL-C lowering effect of CETP inhibition on a range of outcomes by using genetic variants located in a neighborhood of the CETP gene.We may consider instruments drawn from the "narrow" cis window Chr 16: bp: 56,985,862-57,027,757 (which is the region stated in GeneCards ±10, 000 bp) to more accurately represent the genetically predicted effects of CETP inhibition.At the same time, it is also common for drug target MR studies to use a "wider" cis window for instrument selection.We may consider potential additional instruments from the wider window Chr 16: bp: 56,895,862-57,117,757 (which is the region stated in GeneCards ±100, 000 bp).The use of these additional instruments may be considered more likely to lead to biased estimation compared with using only those variants from the narrow cis window.
Therefore the Core estimator used only the 3 uncorrelated genetic variants located in the narrow window as instruments, while the number of additional variants that were available varied between 4 and 10 depending on the outcome of interest.The Focused estimator selected the full set of instruments for 12 out of the 16 outcomes we studied.The concentration parameter for the core instruments was 91.10, while the concentration parameter for the additional instruments was between 6.52 and 8.12 depending on how many additional instruments were available for the application.From Figure 7, we find that the genetically predicted LDL-C lowering effect of CETP inhibition is associated with a lower risk of CHD.The results also suggest that genetically predicted CETP inhibition may have a protective effect on other cardiovascular disease outcomes; specifically atrial fibrillation and heart failure.We do not find evidence for a protective effect on stroke outcomes, nor do we find evidence for any adverse effects on various non-cardiovascular related outcomes.
Compared with CETP inhibitors, PCSK9 inhibitors (PCSK9i) are a more established class of drugs that lower LDL-C levels.A drug target MR analysis can genetically proxy the effect of taking PCSK9i by instrumenting LDL-C using genetic variants located in a neighborhood of the PCSK9 gene.Similar to the CETP gene analysis above, we consider variants located in a narrow cis window of PCSK9 (Chr 1: bp: 55, 505,530,525; exactly equal to the window stated in GeneCards) as core instruments, while additional variants taken from a wider window ±100, 000 bp are additional instruments.
Using this criteria, there were 4 core instruments when studying the same outcomes as before, apart from the case of lung cancer where there were 3.The number of additional instruments available ranged from 2 to 13.The Focused estimator again selected the full set of instruments for 12 out of 16 outcomes.The set of core instruments was very strong compared with the additional instruments; the concentration parameter for the core instruments was 618.71, compared with a range of 8.72 to 26.63 for the full set of additional variants available.The results in Figure 8 suggest that the genetically predicted LDL-C lowering effect of PCSK9i is associated with a lower risk of coronary artery disease, heart failure, and stroke incidence; in addition, the Focused intervals suggest an association specifically with large artery stroke incidence.We also find genetic evidence that PCSK9 inhibition may adversely affect the risk of developing Alzheimer's disease, which supports the findings of Williams et al. (2020) and Schmidt et al. (2021).

Vitamin D supplementation
Finally, we apply focused instrument selection to estimate the genetically predicted effect of vitamin D supplementation on a range of outcomes.Previous MR studies instrumenting vitamin D have used variants from genes implicated in the modulation of 25OHD levels through known mechanisms.In particular, the GC, DHCR7, CYP2R1 and CYP24A1 genes have known functions in vitamin D transport, synthesis, or metabolism (Berry et al., 2012;Mokry et al., 2015).Therefore we take genetic variants from neighborhoods of these genes (±500, 000 bp on regions stated in GeneCards) as our set of core instruments.
Moreover, GWASs also provide data on many other robustly associated genetic variants with vitamin D (Jiang et al., 2021).We use other variants passing a genome-wide significance threshold (p-value association with vitamin D less than 5 × 10 −8 ) to form additional instrument sets.Instead of choosing between the Core and Full estimator, in our analysis of vitamin D effects we allowed the Focused estimator to also choose from subsets of the additional instruments.We partitioned the additional instruments into 3 groups by k-means clustering based on the ratio estimate of each variant ( βY j βX j , j ∈ S) and then considered all possible combinations of these 3 partitions, thus creating 7 sets of additional instruments.
In nearly all the outcomes we considered, there were between 10 and 11 genome-wide significant variants from the GC, DHCR7, CYP2R1 and CYP24A1 gene regions which were used as core instruments for vitamin D. For the outcomes primary biliary cirrhosis and asthma there were only 4 variants available to use as core instruments.Many additional instruments (≥ 50) were selected by the Focused estimator in all cases.The core instruments were again stronger than the additional instruments; for the core instruments the concentration parameter ranged from 625.52 to 723.21, and for the additional instruments it ranged from 49.73 to 99.80.
Evidence from observational studies suggests that low serum vitamin D levels are associated with an increased risk of cardiovascular disease (Dobnig et al., 2008).These reported associations may be due to unmeasured confounding, as evidence from a meta-analysis of 21 randomized clinical trials suggests no causal link (Barbarawi et al., 2019).From Figure 9, we find no genetically predicted effect of vitamin D on cardiovascular outcomes.
Our analysis is able to highlight that using the full set of available instruments can sometimes lead to very different estimates than when the core instruments are prioritised.In particular, the standard confidence intervals of the Full estimator suggest a non-null association of vitamin D on coronary artery disease, heart failure, eczema, primary biliary cirrhosis, and type 2 diabetes, but these results are not supported by the Focused intervals.
Our results suggest that higher vitamin D level may have a protective effect on the incidence of multiple sclerosis, a finding which has previously been discussed in other MR studies (Mokry et al., 2015).Through the Core and Focused intervals, we also find that genetically predicted vitamin D level may be associated with anorexia.The 1-step and Focused intervals of the Focused estimator suggest that higher vitamin D level may have a protective effect on the risk of developing Alzheimer's disease, which supports the findings from Jiang et al. (2021), but interestingly this is not supported by the Core and Full intervals.The Core estimator used 11 instruments, the Full estimator used 91 additional instruments, and the Focused estimator selected a subset of 65 of the additional instruments.This illustrates the ability of the focused instrument selection method to carefully select additional instruments that can potentially improve the power of an analysis.

Conclusion
Publicly available GWAS summary data have revealed that hundreds of genetic variants are robustly associated with a wide range of traits and diseases.However, MR studies in practice do not always make use of all genetic variants that are strongly associated with risk factors of interest; in some applications, investigators may have greater confidence in the instrument validity of only a smaller subset of many genetic variants.For this setting, we propose a way forward for improved estimation through focused use of many weak and potentially invalid instruments.
Whether focused use of invalid instruments can in turn improve inference is an open question.While a uniform improvement is not possible, we propose a new strategy for post-selection inference through Focused intervals, which are shown to achieve a good balance between precision and coverage probability while also guarding against a user-specified worst case size distortion.Our empirical applications highlight the potential of focused instrument selection to uncover new causal relationships in MR studies.Proof.The first-order condition is given by j∈S 0 ∪S ψ j ( θS ) = 0. Also,
Proof.We can decompose ψ j (θ 0 ) into a fixed bias term b j , and stochastic terms J 1j and J 2j , By CS, Also, V ar where the last equality follows by n τ 2 2 = O(1) by Assumption 4. Therefore, by CH and E[J 2j ] = 0 for all j, j∈S 0 ∪S J 2j = O P (1).

Then, since η
Using similar arguments for the remaining terms of j∈S 0 ∪S H 1j (θ), and for j∈S 0 ∪S H lj (θ), l = 2, . . ., 8, we have that Given consistency of θS , a second-order Taylor expansion of the first-order condition j∈S 0 ∪S ψ j ( θS ) = 0 around θS = θ 0 implies that there exists θ on the line segment joining θS and θ 0 such that The result then follows by Slutsky's lemma, and Lemmas S.2-S.4,which show, as n, p → ∞: Proof of Theorem 2 (Asymptotic distribution of the bias bS ).
We show that ) by CS, CH, and consistency of θC for θ 0 .Similarly, the last five terms on the right hand side are o where the last equality follows by where η S + ς S = Θ(n β X 2 2 + p) and ξ S = Θ(p).Therefore, the following Lyapanov condition holds, by similar arguments used in the proof of Theorem 2. By T, since Therefore, where the last line follows from p n 2 β X 4 2 → 0 as n, p → ∞, which is implied by Assumption 3.

Similarly, noting that
Using the above results, Therefore, where the second equality follows by (i) and T, since

Proof of Theorem 3 -Part I (Convergence in distribution of effect and bias estimates).
As shown in the proofs of Theorem 1 and 2, for any S, ignoring o P (1 √ n β X 2 ) and o P ( √ p n β X 2 2 ) terms, where J 1j = Ω −1 j β X j (e Y j − θ 0 e X j ) + Ω −2 j (e Y j − θ 0 e X j )(σ 2 Y j e X j + θ 0 σ 2 X j e Y j ), and Bj = Ω −1 j β X j (e Y j − θ 0 e X j ) + Ω −1 j e X j e Y j − θ 0 Ω −1 j (e 2 X j − σ 2 X j ).
We can partition the K additional instrument sets into L ≤ 2 K − 1 distinct sets which span the additional instrument sets S 1 , ..., S K .For example, for K = 3, each instrument must belong to one, and only one, of the following sets: 3 , and M 7 = S C 1 ∩ S C 2 ∩ S 3 .Then, for each j ∈ [3], we can construct selection indicators α ∈ {0, 1}, ∈ [7], such that S j = 7 =1 α M .
For L ≤ 2 |K| − 1, let M 1 , ..., M L be distinct sets of the additional instruments which span the additional instrument sets S 1 , ..., S K .
and H j = Ω −2 j θ 0 (σ 2 X j e 2 Y j − σ 2 Y j e 2 X j − 2θ 0 σ 2 X j e X j e Y j ) + θ 0 Ω −1 j (e 2 X j − σ 2 X j ).For any set M , we will show that Then, since (i) the random components in J 1j and µ j are functions of the error terms e X j and e Y j ; (ii) for any j = k, e X j and e X k are jointly normal and uncorrelated, and hence mutually independent (likewise for e Y j and e X j ), we have that j∈S 0 J 1j , j∈M 1 µ j , ..., (k, l = 1, ..., K).
The expression for the covariance matrix in Theorem 3 is then ∆ = R + W .
Proof of Equation 4.
If |M | = o(p), then the following asymptotic distribution result for j∈M µ j still applies but variance components ς M and ξ M from Equation 4 would be negligible.Therefore, we focus on

Figure 1 .
Figure 1.The effect of genetic variant Z j on the risk factor X and outcome Y , where U is an unobserved confounder.

P→
'converges in probability to'; D → 'converges in distribution to'; a ∼ 'is asymptotically distributed as'.For any sequences a n and b n , if a n = O(b n ), then there exists a positive constant M and a positive integer N such that for all n ≥ N , b n > 0 and |a

Figure 2 .
Figure 2. RMSE of Focused estimator relative to RMSE of Core estimator varying with the average instrument strength of S 0 (λ C ) and S (λ S ), and invalidness of S (τ ).

Figure 2
Figure2highlights that when the direct variant effects on the outcome are sufficiently small, the RMSE of the Focused estimator is lower than the Core estimator.However, this improvement is not uniform across larger values of τ .The performance of the Focused estimator worsens for more intermediate values of τ .Then, as τ becomes large, the Focused estimator
Figure 5 illustrates how the performance of the Focused interval varies according to the choice of γ for the case where all instruments are equally strong (λ C = λ S = 40).The results of other confidence intervals discussed in this paper are also shown for comparison, but of course their performance should not vary with γ.

Figure 5 .
Figure 5.The dashed line in the first row is the allowable size distortion 1 − α − γ (nominal coverage is 1 − α = 0.95).The second row plots the length of confidence intervals relative to the Core interval.

Figure 6 .
Figure 6.RMSE of Focused estimator relative to RMSE of Core estimator varying with the invalidness of S 0 (τ C ) and invalidness of S (τ ).

Figure 7 .
Figure 7. CETP gene analysis.Point estimates and 95% confidence intervals of the change in log odds ratio of various outcomes due to a 1 standard deviation increase in instrumented LDL-C.

Figure 8 .
Figure 8. PCSK9 gene analysis.Point estimates and 95% confidence intervals of the change in log odds ratio of various outcomes due to a 1 standard deviation increase in instrumented LDL-C.

Figure 9 .
Figure 9. Vitamin D effects.Point estimates and 95% confidence intervals of the change in log odds ratio of various outcomes due to a 1 standard deviation increase in instrumented vitamin D levels.
SB was supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (204623/Z/16/Z).VZ was supported by the United Kingdom Research and Innovation Medical Research Council (MR/W029790/1).This research was funded by the United Kingdom Research and Innovation Medical Research Council (MC-UU-00002/7), and supported by the National Institute for Health Research Cambridge Biomedical Research Centre: BRC-1215-20014.

Figure S1 .
Figure S1.RMSE varying with the average instrument strength of S 0 (λ C ) and S (λ S ), and invalidness of S (τ ).