Risk factors for eight common cancers revealed from a phenome-wide Mendelian randomisation analysis of 378,142 cases and 485,715 controls

For many cancers there are few well-established risk factors. Summary data from genome-wide association studies (GWAS) can be used in a Mendelian randomisation (MR) phenome-wide association study (PheWAS) to identify causal relationships. We performed a MR-PheWAS of breast, prostate, colorectal, lung, endometrial, oesophageal, renal, and ovarian cancers, comprising 378,142 cases and 485,715 controls. To derive a more comprehensive insight into disease aetiology we systematically mined the literature space for supporting evidence. We evaluated causal relationships for over 3,000 potential risk factors. In addition to identifying well-established risk factors (smoking, alcohol, obesity, lack of physical activity), we provide evidence for specific factors, including dietary intake, sex steroid hormones, plasma lipids and telomere length as determinants of cancer risk. We also implicate molecular factors including plasma levels of IL-18, LAG-3, IGF-1, CT-1, and PRDX1 as risk factors. Our analyses highlight the importance of risk factors that are common to many cancer types but also reveal aetiological differences. A number of the molecular factors we identify have the potential to be biomarkers. Our findings should aid public health prevention strategies to reduce cancer burden. We provide a R/Shiny app (https://mrcancer.shinyapps.io/mrcan/) to visualise findings.


INTRODUCTION
Cancer is currently the third major cause of death with an estimated 18.1 million new cases and nearly 10 million cancer deaths in 2020 1  Conversely, case-control studies can be complicated by biases such as reverse causation and confounding. Mendelian randomisation (MR) is an analytical strategy that uses germline genetic variants as instrumental variables (IVs) to infer causal relationships (Fig. 1A) 3 . The random assortment of these genetic variants at conception mitigates against reverse causation bias.
Moreover, in the absence of pleiotropy (i.e. the presence of an association between variants and disease through additional pathways), MR can provide unconfounded disease risk estimates .
Elucidating disease causality using MR is gaining popularity especially given the availability of data from large genome-wide association studies (GWAS) and well-developed analytical frameworks 3 .
Most MR studies of cancer have been predicated on assumptions about disease aetiology or have sought to evaluate purported associations from conventional observational epidemiology 3,4 . A recently proposed agnostic strategy, termed MR-PheWAS, integrates the phenome-wide association study (PheWAS) with MR methodology to identify causal relationships using hitherto unconsidered traits 5 .
To identify causal relationships for eight common cancers (breast, prostate, colorectal, lung, endometrial, oesophageal, renal and ovarian), and reveal intermediates of risk, we conducted a MR-PheWAS study. We integrated findings with a systematic mining of the literature space to provide supporting evidence and derive a more comprehensive description of disease aetiology ( Fig. 1B) 6 .

Ethics approval
The analysis was undertaken using published GWAS data, hence ethical approval was not required.

Study design
Our study had four elements. Firstly, the identification of genetic variants serving as instruments for exposure traits under investigation; secondly, the acquisition of GWAS data for the eight cancers; thirdly, MR analysis; fourthly, triangulation through literature mining to provide supporting evidence for causal relationships (Fig. 1B).

Genetic variants serving as instruments
Single nucleotide polymorphisms (SNPs), considered genetic instruments, were identified from published studies or MR-Base (Supplementary Table 1). For each SNP, the corresponding effect estimate on a trait expressed in per standard deviation (SD) units (assuming a per allele effect) and standard error (SE) was obtained. Only SNPs with a minor allele frequency <0.01 and a trait association of P-values <5 × 10 −8 in a European population GWAS were considered as instruments.
We excluded correlated SNPs at a linkage disequilibrium threshold of r 2 >0.01, retaining SNPs with the strongest effect. For binary traits we restricted our analyses to traits with a medical diagnosis, excluding cancer. We removed duplicate exposure traits based on manual curation.

Cancer GWAS summary statistics
To examine the association of each genetic instrument with cancer risk, we used summary GWAS effect estimates from:  15 .

Statistical analysis
For each SNP, causal effects were estimated for cancer as an odds ratio (OR) per SD unit increase in the putative risk factor (ORSD), with 95% confidence intervals (CIs), using the Wald ratio 16 22 . A leave-one-out strategy under the IVW-RE model was employed to assess the potential impact of outlying and pleiotropic SNPs (Supplementary Table   11) 23 . Because two-sample MR of a binary risk factor and a binary outcome can be biased, we primarily considered whether there exists a significant non-zero effect, and only report ORs for consistency 24 . Statistical analyses were performed using the TwoSampleMR package v0.5.6 (https://github.com/MRCIEU/TwoSampleMR) in R (v3.4.0) 15 .

Estimation of study power
The power of MR to demonstrate a causal relationship depends on the PVE by the instrument 25 . We excluded instruments with a F-statistic <10 since these are considered indicative of evidence for weak instrument bias 26 . We estimated study power, stipulating a P-value of 0.05 for each target a priori across a range of effect sizes as per Brion et al. (Supplementary Table 1) 27 . Since power estimates for binary exposure traits and binary outcomes in a two-sample setting are unreliable, we did not estimate study power for binary traits 24 .

Phenotypes and genetic instruments
After filtering we analysed 3,661 traits, proxied by 336,191 genetic variants in conjunction with summary genetic data from published GWAS of breast, prostate, colorectal, lung, endometrial, oesophageal, renal, and ovarian cancers ( Table 1; Supplementary Table 17). The number of SNPs used as genetic instruments for each trait ranged from one to 1,335. Figure 2 shows the power of our MR study to identify causal relationships between each of the genetically defined traits and each cancer type. The median PVE by SNPs used as IVs for each of the 3,661 traits evaluated as risk factors was 3.4% (0.01-84%). Our power to demonstrate causal relationships a priori for each cancer type reflects in part inevitably the size of respective GWAS datasets (Supplementary Table   1).

Causal associations identified by MR
To aid interpretation we grouped traits related to established cancer risk factors (i.e. smoking, obesity and alcohol) and those for which current evidence is inconclusive into the following categories: cardiometabolic; dietary intake; anthropometrics; immune and inflammatory factors; fatty acid (FA) and lipoprotein metabolism; lifestyle, reproduction, education and behaviour; metabolomics and proteomics; miscellaneous.
Out of the 27,066 graded associations, MR analyses provided robust evidence for a causal relationship with 123 phenotypes (0.5% of total MR analyses), 174 with probable evidence (0.6% of total), 1,652 with suggestive evidence (6% of total). Across the eight cancer types, the largest number of robust associations were observed for endometrial cancer with 37 robust associations, followed by renal cancer (n = 32), CRC (n = 21), lung (n=20), breast (n=10), oesophageal (n=3) and prostate cancer (n=1). No robust MR associations were observed for ovarian cancer (Supplementary Table 3).
Across all of the cancer types anthropometric traits showed the highest number of robust MR defined causal relationships (N=32; 0.1%), followed by lifestyle, reproduction, education and behaviour (N=17; 0.06%). No robust associations were observed for dietary intake or cardiometabolic categories (Supplementary Table 3).
To visualise the strength and direction of effect of the causal relationship between each of the traits examined and risk of each cancer type and, where appropriate, their respective subtypes we provide a R/Shiny app (https://mrcancer.shinyapps.io/mrcan/). Fig. 3 shows a screenshot of the app for selected traits across the eight different types of cancer.
Many of the identified causal relationships, especially those that were statistically robust or probable, have been reported in previous MR studies and are related to established risk factor categories 4,32,33 . Notably: (i) the relationship between metrics of increased body mass index (BMI) with an increased risk of colorectal, lung, renal, endometrial and ovarian cancers 34 ; (ii) cigarette smoking with an increased risk of lung cancer 35 ; (iii) higher alcohol consumption and increased risk of oesophageal, colorectal, lung, renal, endometrial and ovarian cancers 36 ; (iv) reduced physical activity and sedentary behaviour with an increased risk of multiple cancers, including breast, lung, colorectal and endometrial 37 . As anticipated, exposure traits pertaining to cigarette smoking were not causally related to lung cancer in never smokers. Paradoxically, but as reported in previous MR analyses, increased BMI was associated with reduced risk of prostate and breast cancer, and an inverse relationship between smoking and prostate cancer risk was observed 34,38 . Our analysis also supports the reported relationship between higher levels of sex hormone-binding globulin with reduced endometrial cancer risk and a relationship between testosterone with risk of endometrial cancer and breast cancers 39,40 . Notably, exposure traits related to testosterone levels were only causally associated with luminal-A and luminal-B breast cancer subtypes.
With respect to dietary intake our analysis demonstrated probable associations between genetically predicted higher levels of coffee, oily fish, and cheese intake with reduced CRC risk and suggestive associations between genetically predicted beef and poultry intake and elevated CRC risk. We found suggestive associations between genetically predicted high serum vitamin B12 with colorectal and prostate cancer, serum calcium and 25-hydroxyvitamin-D with RCC, low blood selenium with colorectal and oesophageal cancers and methionine and zinc with reduced CRC risk. We observed no association between genetically predicted blood levels of circulating carotenoids or vitamins B6 and E for any of the cancers.
In terms of glucose homeostasis, no relationship between genetically predicted blood glucose or glycated haemoglobin was shown for any of the eight cancers. However, higher levels of genetically predicted levels of fasting insulin and insulin growth factor 1 (IGF-1) and lower proinsulin showed associations with CRC. Additionally, a suggestive association between proinsulin and renal cancer, fasting insulin and lung and endometrial cancers, and IGF-1 levels and breast cancer was observed.
Amongst genetically predicted higher levels of lipoproteins, the only associations were a probable association between high density lipoprotein cholesterol (HDL-C) and breast cancer and suggestive associations between low density lipoprotein cholesterol (LDL-C) with CRC, and total cholesterol and ovarian cancer.
Genetically predicted levels of plasma FAs showed an association with reduced cancer risk.
A relationship between longer lymphocyte telomere length (LTL) and an increased risk of six of the eight cancer types was identified -robust with respect to renal and lung cancers, probable for breast and prostate cancers and suggestive for colorectal and ovarian cancers.
In addition to a robust association between higher HLA-DR dendritic plasmacytoid levels and risk of prostate cancer, 26 probable associations between genetically predicted levels of other circulating immune and inflammatory factors were shown across the cancers studied. These included higher levels of IL-18 with reduced risk of lung cancer, with specificity for lung cancer in never smokers.
Our MR analysis provides support for the known relationship between colonic polyps and CRC 41 , benign breast disease and breast cancer 42 , oesophageal reflux with risk of oesophageal cancer (Supplementary Table 13) 43 . Other associations included possible relationships between pulmonary fibrosis and lung cancer 44 , as well as the relationship between a diagnosis of schizophrenia and lung cancer, which has been observed in conventional epidemiological studies 45 . It was noteworthy, however, that we did not find evidence to support the purported relationship between hypertension and risk of developing RCC. Similarly, our analysis did not provide evidence to support a causal relationship between either type 1 or type 2 diabetes and an increased cancer risk.

Literature-mined support for MR causal relationships
To provide support for the associations and to gain molecular insights into the underlying biological inhibition is an attractive therapeutic target in restoring T-cell function, we demonstrate genetically elevated LAG-3 levels as being associated with reduced CRC, endometrial and lung cancer. In all three of these cancers, the association appears to be at least partly mediated through IL-10 and the seemingly paradoxical relationship between LAG-3 levels and tumourgenesis suggests potentiation of T-cell function by serum LAG-3 rather than cell membrane expressed LAG-3 48 . We identify genetically predicted IL-18 levels as being associated with an increased risk of lung cancer. Our literature mining also supports a role for the decoy inhibitory protein, IL-18BP as being a mediator of lung cancer risk as well as IL-10, IL-12, IL-4 and TNF 49 . Finally, PRDX1, a member of the peroxiredoxin family of antioxidant enzymes, interacts with the androgen receptor to enhance its transactivation resulting in increased EGFR-mediated signalling and an increased prostate cancer risk 46 .

DISCUSSION
By performing a MR-PheWAS we have been able to agnostically examine the relationship between multiple traits and the risk of eight different cancer types, restricted only by the availability of suitable genetic instruments. Importantly, many of the traits we examined have not previously been the subject of conventional epidemiological studies or been assessed by MR. Even for risk factors that were examined in many previous analyses, the number of cases and controls in our study has afforded greater power to identify potential causal associations. This has allowed us to exclude large causal effects on cancer risk for the majority of exposure traits examined in our study.
In addition to identifying causal relationships for the well-established lifestyle traits, which validates our approach, we implicate other lifestyle factors that have been putatively associated by observational epidemiology contributing to cancer risk. For example, the protective effects of physical activity, coffee, oily fish, fresh/dried fruit intake for CRC risk. A number of the causal relationships we identify have been the subject of studies of individual traits and include the association between longer LTL with increased risk of RCC and lung cancers; sex steroid hormones and risk of breast and endometrial cancer; and circulating lipids with CRC and breast cancer. Using genetic instruments for plasma proteome constituents has allowed us to identify hitherto unexplored potential risk factors for a number of the cancers, including: the cytokine like molecule, FAM3D, which plays a role in host defence against inflammation associated carcinogenesis with lung cancer 50 ; the autophagy associated cytokine cardiotrophin-1 with lung, endometrial, prostate and breast cancer and the tumour progression associated antigen CD63 with endometrial cancer 51,52 .
Levels of these and other plasma proteins potentially represent biomarkers worthy of future prospective studies. Clustering of MR causal effect sizes for each trait cancer relationship highlights the importance of risk factors common to many cancers but also reveal differences in their impact in part likely to be reflective of underlying biology (Fig. 5).
A principal assumption in MR is that variants used as IVs are associated with the exposure trait under investigation. We therefore used SNPs associated with exposure traits at genome-wide significance. Furthermore, only IVs from European populations were used to limit bias from population stratification. Our MR analysis does, however, have limitations. Firstly, we were limited to studying phenotypes with genetic instruments available, moreover traits such as food intake or television watching can be highly correlated with other exposures making deconvolution of the causal risk factor problematic 53,54  A major concern articulated regarding any MR-PheWAS is the need to provide supporting evidence from alternative sources. Herein we have sought to address this by conducting a systematic interrogation of the literature space and potentially identify intermediates to explain relationships.
Although literature mined data is inevitably noisy and driven by publication bias, we have been able to provide a narrative of the causal relationships for a number of risk factors, which are attractive candidates for molecular validation.
Complementary studies are required to delineate the exact biological mechanisms underpinning associations. Our analysis does however highlight important targets for primary prevention of cancer in the population. The limited power to robustly characterise relationships between exposure traits and cancer in this study, provides an impetus for larger MR studies. Finally, we recognise that MR is not infallible and replication and triangulation of findings using different data sources, and if possible, benchmarking against RCTs is highly desirable. Such efforts could identify additional factors as targets to reduce the overall burden of cancer.

CONFLICT-OF-INTEREST DISCLOSURE
The authors declare no competing financial interests.

AVAILABILITY OF DATA AND MATERIAL
Genetic instruments can be obtained through MR-Base or from published work (Supplementary