Interpreting population and family-based genome-wide association studies in the presence of confounding

A central aim of genome-wide association studies (GWASs) is to estimate direct genetic effects: the causal effects on an individual’s phenotype of the alleles that they carry. However, estimates of direct effects can be subject to genetic and environmental confounding, and can also absorb the ‘indirect’ genetic effects of relatives’ genotypes. Recently, an important development in controlling for these confounds has been the use of within-family GWASs, which, because of the randomness of Mendelian segregation within pedigrees, are often interpreted as producing unbiased estimates of direct effects. Here, we present a general theoretical analysis of the influence of confounding in standard population-based and within-family GWASs. We show that, contrary to common interpretation, family-based estimates of direct effects can be biased by genetic confounding. In humans, such biases will often be small per-locus, but can be compounded when effect size estimates are used in polygenic scores. We illustrate the influence of genetic confounding on population- and family-based estimates of direct effects using models of assortative mating, population stratification, and stabilizing selection on GWAS traits. We further show how family-based estimates of indirect genetic effects, based on comparisons of parentally transmitted and untransmitted alleles, can suffer substantial genetic confounding. In addition to known biases that can arise in family-based GWASs when interactions between family members are ignored, we show that biases can also arise from gene-by-environment (G×E) interactions when parental genotypes are not distributed identically across interacting environmental and genetic backgrounds. We conclude that, while family-based studies have placed GWAS estimation on a more rigorous footing, they carry subtle issues of interpretation that arise from confounding and interactions.

The focal allele at an association study locus (solid red square) is in positive cis-LD with trait-increasing alleles at other loci (solid black squares) if it is disproportionately likely to be found alongside them on an individual's maternally or paternally inherited genome. (B) The focal allele at the study locus is in positive trans-LD with trait-increasing alleles at other loci if it is disproportionately likely to be found across from them on the maternally and paternally inherited genomes. (C,D) In a population association study, both positive cisand trans-LD between the focal allele at the study locus and trait-increasing alleles anywhere else in the genomeeither on the same chromosome as the study locus or on different chromosomes-generate a spuriously high effect size estimate at the study locus. (E,F) In a sibling association study, a trait-increasing allele causes a spuriously increased effect size estimate at the study locus if the parent is a coupling double heterozygote for the focal and trait-increasing alleles, having inherited them from the same parent (E), but a spuriously decreased estimate if the parent is a repulsion double heterozygote, having inherited them from different parents (F). These biases arise only if the trait-affecting locus is on the same chromosome as the focal study locus. The net bias depends on the relative frequencies of coupling and repulsion double heterozygotes in the parents, which depends on the difference in the degrees of cis-and trans-LD. 144 where p λ is the frequency of the focal allele at λ, and D λl is the degree of cis-LD between the focal allele 145 at λ and a causal allele at a nearby locus l ∈ L local . It is reasonable to think of this quantity as the where c λl is the sex-averaged recombination rate between λ and l. The cis-and trans-LD terms D ′ λl and 198D ′ λl are measured in the parents. 199 Similarly, an estimate of the direct effect can be obtained from pairs of siblings by regressing the 200 differences in their phenotypes on the differences in their genotypes at the focal locus λ. In the presence 201 of genetic confounds, this procedure yields a similar estimate to Eq. (6): (and Eq. 6) can be rewritten in terms of the relative frequencies of the two kinds of double-heterozygotes: 249 In a species with many chromosomes, such as humans, for a given locus, there will be many more 250 unlinked loci than linked loci. Therefore, the set of loci that can confound a family-based association 251 study at a given locus will be much smaller than the set of loci that can confound a population association 252 study at the locus. It will often be the case, therefore, that biases in the estimation of direct genetic effects 253 will be smaller in family-based studies than in population studies, a point that we explore below when 254 we consider sources of genetic confounding. 255 Estimates of indirect genetic effects. We now return to the regression of the trait on the untrans- In a sibling-based study (the results and intuition below will be the same for a parent-offspring study), 298 the difference between siblings' trait-1 PGSs, ∆PGS 1 , is regressed on the difference in their values for 299 trait 2, ∆Y 2 (note that trait 2 could be the same as trait 1). If L is the set of loci that causally underlie : Assortative mating systematically biases effect size estimation in population and within-family GWASs, although the bias in within-family GWASs is expected usually to be small. Here, cross-trait assortative mating between traits 1 and 2 occurs for the first 19 generations, after which mating is random. Assortative mating is sex-asymmetric, with strength ρ = 0.2. Distinct sets of loci underlie variation in trait 1 and 2, with effect sizes at causal loci normalized to 1. Plotted are average estimated effects of the focal alleles at loci causal for trait 1 in population and within-family GWASs on trait 2, for a hypothetical genome with one chromosome of length 1 Morgan (A) and for humans (B). Since the traits have distinct genetic bases, the true effects on trait 2 of the alleles at trait-1 loci are zero. The horizontal lines at 0.1 are a theoretical 'first-order' approximation of the asymptotic bias in a population GWAS (Appendix A3.1). Profiles are averages across 10,000 replicate simulation trials. Simulation details can be found in the Methods.
non-causal loci is approximately h 2 ρ/2 times the effect at causal loci, assuming the traits to have the 391 same heritabilities and genetic architectures (horizontal dahsed line in Fig. 2). If cross-trait assortative 392 mating is bi-directional/symmetric with respect to sex, then, in equilibrium, the average spurious effect 393 size estimate at non-causal loci is approximately h 2 ρ times the effect at causal loci. Upward biases in 394 effect size estimates at causal loci are also expected under cross-trait assortative mating, but these are 395 second-order relative to the biases at non-causal loci (Fig. S1).

396
The systematic over-and under-estimation of effect sizes that assortative mating induces in population 397 and family-based GWASs, respectively, will also affect our second measure of interest, the heterozygosity-398 weighted average squared effect size estimate E 2p λ (1 − p λ )α 2 λ (and therefore also downstream quantities 399 such as SNP heritabilities). In a population GWAS, the presence of trans-LD and the gradual creation of 400 cis-LD under assortative mating will increase the biases in effect size estimates over time (Fig. 2), which 401 will concomitantly increase the average value ofα 2 λ ( Fig. 3 Figure 3: The impact of assortative mating on the average squared effect size estimate in population and withinfamily GWASs. Same-trait assortative mating of strength ρ = 0.2 occurs in generations 0-19; mating is random before and after this period. Under random mating, the average squared effect size estimates exceed the true average squared effect size (yellow line) because random drift generates chance local LD with causal alleles that inflates the variance of effect size estimation (e.g. Bulik-Sullivan et al. 2015b). The magnitude of this variance inflation depends on the GWAS design and sample size, and the effect of assortative mating and its cessation should be judged in reference to it. To guide the eye in this judgment, the faint horizontal lines show the average squared effect size estimate in the last 20 generations of the initial burn-in period of random mating. Profiles are averages across 10,000 replicate simulation trials. Simulation details can be found in the Methods. humans (Fig. 3B).
As shown by Border et al. (2022a,b), the effects of assortative mating on estimates of heritability and 409 genetic correlations described above are not well controlled for by LD Score regression (Bulik-Sullivan 410 et al. 2015a,b). The LD score of a variant proxies the amount of local causal variation the SNP tags, but 411 because assortative mating generates long-range signed LD among causal variants, it causes local causal 412 variants to be in long-range signed LD with other causal variants throughout the genome. Therefore, 413 the slope of the LD score regression absorbs the effects of assortative mating, causing its estimates of 414 heritability and of the degree of pleiotropy to be inflated. 415 Historical assortative mating. If, at some point in time, assortative mating for traits ceases and mat-416 ing becomes random with respect to those traits, the positive trans-LD that was present under assortative 417 mating will immediately disappear, leaving only the positive cis-LD that had built up; this cis-LD will 418 then be gradually eroded by recombination. If equilibrium had been attained under assortative mating, 419 the cis-LD would have grown to match the per-generation trans-LD. Therefore, in the first generation 420 after assortative mating ceases, the upward bias in population GWAS effect size estimates would halve 421 as the trans-LD disappears (Eq. 3); the bias would then shrink gradually to zero as the cis-LD erodes 422 (Fig. 2). A similar pattern will be observed for the heterozygosity-weighted average value ofα 2 λ in the 423 population GWAS, which eventually returns to its equilibrium level under random mating (Fig. 3).

424
In contrast, with the disappearance of the positive trans-LD but the persistence of positive cis-LD, 425 the bias in family-based effect size estimates will suddenly become positive once assortative mating ceases 426 (having temporarily been negative under assortative mating before equilibrium was attained); this bias too 427 will then gradually shrink to zero as recombination erodes the remaining cis-LD (Fig. 2). Concomitantly, 428 the average squared effect size estimate in the family GWAS will suddenly increase when assortative 429 mating ceases, after which it too will gradually return to its equilibrium value under random mating 430 (Fig. 3 focal alleles at the association study locus λ and a causal locus l is in a sample that weights the two populations equally, with the superscript (S) denoting that this LD is  values? The answer depends on the source of allele frequency differences between the two populations. If 486 the differences are due to neutral genetic drift, they will be independent of each other (assuming causal 487 loci are sufficiently widely spaced) and independent of the direction and size of effects at individual 488 loci. Therefore, the LD induced by these allele frequency differences will, on average, not bias effect size 489 estimates in a population GWAS: (2) k = 0 at any locus k.

492
However, the LD induced by population structure will inflate the average squared effect size estimate, 493 and by extension the variance of effect size estimates (Fig. 5) In contrast, because effect size estimates from within-family GWASs are not confounded in this model 499 of isolated populations, the average squared effect size estimate will not differ substantially from its 500 expectation in an unstructured population (Fig. 5).

501
While we have focused on a simple model of two isolated populations, the result that within-family 502 association studies are not confounded holds for other kinds of population structure as well. Specifically, 503 we may be concerned that a population GWAS suffers from genetic confounding along some given axis 504 of population stratification. However, the family-based estimates will be unbiased by confounding along 505 such an axis if the maternal and paternal genotypes at each locus are exchangeable with respect to each 506 16 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint other along this axis (Appendix A3.2). This requirement will be met in expectation under many models 507 of local genetic drift in discrete populations or along geographic gradients. However, as we will shortly 508 argue, migration and admixture introduce further complications.

509
Allele frequency divergence due to selection or phenotype-biased migration. Selection and 510 phenotype-biased migration can also generate allele frequency differences among populations (for a re- the phenotype tend to migrate from population 2 to population 1, can also create a positive association 521 between effect sizes and allele frequency differences (Eq. 14).

522
Unlike the case of neutral genetic drift in the two populations, where the sign of the LD between 523 two alleles is independent of their effect sizes, the effect-size-correlated associations driven by selection  Here, two populations are isolated until generation 0, at which point they mix in equal proportions. Initial allele frequencies are chosen independently for the two populations, such that allele frequency differences between the populations resemble those that would accumulate over time via random drift. As in Fig. 3, the equilibrium value of the mean squared effect size estimate under random mating is greater than the true mean squared effect size, in both the population and within-family GWAS, owing to linkage disequilibria among causal alleles that arise due to drift. This explains why, in the insets, the blue (population) and red (withinfamily) profiles do not shrink all the way down to the yellow (true) line after admixture, when mating is random. Note too the difference in scale of the y-axes in the insets: the return to equilibrium is much more rapid under the human genetic map than for a hypothetical genome of one chromosome of length 1 Morgan, since, with more recombination, the ancestry-based linkage disequilibria are broken down more rapidly. Profiles are averages across 10,000 replicate simulation trials. Simulation details can be found in the Methods.

17
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint or phenotype-biased migration can add up across loci, and thus lead to substantial, systematic biases in 525 estimates of allelic effect sizes. This systematic genetic confounding would also substantially inflate the 526 average squared effect size estimate and thus measures of the genetic variance tagged by SNPs.

527
In addition, these systematic sources of genetic confounding can generate genetic correlations between 528 traits with no overlap in their sets of casual loci-i.e., with no pleiotropic relationship. This will occur if 529 two traits have both experienced selection or biased migration along the same axis. To take a concrete 530 example, if people tend to migrate to cities in part based on traits 1 and 2, then these traits will become 531 genetically correlated. If this axis is explicitly included as a covariate in the GWAS, then its influence 532 on estimates of heritability and genetic correlations will be removed. However, its influence will not be 533 removed by inclusion of genetic principal components or the relatedness matrix, if this axis (here, city 534 vs. non-city) is not a major determinant of genome-wide relatedness at non-causal loci (Vilhjálmsson and 535 Nordborg 2013). Nor will LD score regression control for this influence, as the selection-or migration-536 driven differentiation of a variant along the axis will be correlated with the extent to which it tags 537 long-range causal variants involved in either trait. This effect on LD score regression is similar to that   GWASs performed in the admixed population. More generally, long range LD will be an issue when there 553 is genetic stratification and ongoing migration between somewhat genetically distinct groups.

554
For concreteness, we again consider a simple model where two populations have been separated for 555 some time, allowing allele frequencies to diverge between them. The populations then come into contact 556 and admix in the proportions A and 1 − A. We assume that mating is random with respect to ancestry 557 in the admixed population.

558
Suppose that, just before admixture, the frequencies of the focal allele at a given locus k were p

565
18 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made λl,t = 0. Note that the decay of cis-LD in an admixed population will be 567 slowed if individuals mate assortatively by ancestry, because the trans-LD generated by assortative mating  Allele frequency divergence due to drift. How do these patterns of LD affect a population GWAS?

571
If allele frequency differences between populations arose from neutral drift, they will be independent 572 of effect sizes at causal loci and across loci, and therefore will not contribute, on average, a systematic 573 directional bias to effect size estimates. However, they will inflate the average squared effect size estimate,  Although within-family GWASs were not genetically confounded when the populations were separate 580 (because cis-and trans-LDs were equal, as discussed above), they become genetically confounded in the 581 admixed population, as all trans-LD is eliminated by random mating in the admixed population, leaving 582 an excess of cis-LD relative to trans-LD that biases effect size estimates (Eqs. 6 and 7). As in the case 583 of the population GWAS, these biases will be zero on average if allele frequency differences between the 584 ancestral populations were due to drift. However, after admixture, they will still inflate the average 585 squared effect size estimate (and thus the variance of effect size estimates), which will thereafter decline 586 in subsequent generations as the cis-LD is gradually broken down by recombination (Eq. 16; Fig. 5).

587
In comparing the average squared effect size estimate in a population and a family-based GWAS, we 588 observe that the value in the population GWAS rapidly declines to approximately the same level as the 589 value in the within-family GWAS, despite the former having started at a much higher level in the initial 590 admixed population (Fig. 5). The explanation is that LD between unlinked loci confounds effect size 591 estimation in the population GWAS but not the within-family GWAS, such that (i) the average squared 592 effect size estimate from the population GWAS is initially much higher than that from a within-family 593 GWAS, because it is inflated by LD between many more pairs of loci, and (ii) the average squared effect 594 size estimate from the population GWAS declines more rapidly, because LD between unlinked loci is 595 broken down more rapidly than LD between linked loci.

596
Allele frequency divergence due to selection or phenotype-biased migration. In addition to 597 drift, and as discussed above, selection and phenotype-biased migration can generate systematic, signed 598 (effect-size correlated) LD, which would lead to systematic cis-LD in the descendent admixed population.

599
These would lead to larger inflations of genetic variance and genetic correlations than would be expected 600 had allele frequency divergence between the ancestral populations been due to drift alone, and would 601 complicate interpretations of genetic correlations as being due to pleiotropy. Moreoever, if the admixed 602 population is more than a few generations old such that LD between unlinked loci but not linked loci 603 has largely been broken down, then population-and family-based estimates of these quantities might be   Under these same assumptions, we calculate in Appendix A3.4 the average per-locus attenuation 645 bias in effect size estimates induced by stabilizing selection, (α l −α l )/α l . In a population GWAS, this 646 attenuation bias is approximately

20
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made i.e., smaller than in a population GWAS by a factor of 1 − 2c h .  phenotype is β, so that the phenotypes of two siblings i and j can be written

706
Taking their difference and rearranging, we find that Therefore, in the absence of genetic confounding and G×E interactions, a sibling-based association study 709 would return an effect size estimate of

22
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 719 720 where Y * * = (1 + β)Y * . Therefore, if we were to randomly choose one sibling from each sibship and 721 estimate the effect size at the locus using a population association study across families, we would obtain  while if sibling indirect effects are antagonistic (β < 0), the population GWAS underestimates the direct 738 genetic effect. The reason for this difference is that a sibling GWAS is based on siblings whose genotypes 739 differ at the focal locus, and whose genotypic values are therefore anticorrelated. If sibling indirect effects 740 are synergistic (β > 0), they will tend to attenuate the phenotypic differences between such siblings, and 741 therefore attenuate effect size estimates. In contrast, because siblings' genotypes are positively correlated 742 across the entire population, synergistic sibling indirect effects (β > 0) will tend to exacerbate phenotypic 743 differences across families, leading a population GWAS to overestimate effect sizes.
752 23 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint then the average causal effect of the allele were it randomized across individuals from different families in 754 our sample. α f is the deviation of this effect in family f due to their environment, and α i is an individual 755 deviation which we assume to be independent of i's genotype; both α f and α i can be thought of as random 756 slopes in a mixed model. Note that α f can reflect the interaction of alleles with the family's external 757 environment (as we have framed it here) or with the family's genetic background (a G×G interaction).

758
If we perform a sibling GWAS by taking pairs of full siblings i and j in family f and regressing 759 the difference in their phenotypes ∆Y f = Y i − Y j on the difference in their genotypes at the focal locus 760 ∆g f = g i − g j , we obtain an effect size estimate

768
We can compare this estimate from a sibling GWAS to one from a population GWAS, again under 769 the assumption of no genetic confounding or indirect effects from siblings: where p is the frequency of the focal variant and F is the inbreeding coefficient at the locus (Appendix family-based studies. Following the the same logic above, interactions between parents' and offsprings' 796 24 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint conditional on the genetic background of a heterozygous parent. Thus, again, a non-random distribution 798 of genetic backgrounds in heterozygous parents is a potential source of bias.

799
One way that heterozygous parents might exhibit a non-random distribution of genetic backgrounds 800 is via trait-based assortative mating, which could therefore modify the way that epistasis and parent-801 offspring interactions influence effect size estimation in a family-based GWAS relative to a population 802 GWAS and relative to the true average population effect.  confounding, the effect of a causal allele of interest will depend on a set of weights: its LD to many other 917 causal alleles. In estimating the direct effect of the allele, family-based approaches weight these LD terms 918 differently to population-based approaches, which, we argue, can complicate the interpretation of these 919 estimates. For example, when previously isolated populations admix, same-ancestry alleles will be held 920 together in long genomic blocks until these are broken up by recombination, which will happen very 921 quickly for alleles on different chromosomes but more slowly for alleles on the same chromosome. A 922 few generations after admixture, therefore, cross-chromosome ancestry LD will largely have dissipated, 923 but contiguous ancestry tracts will still span substantial portions of chromosome lengths. Since both 924 population and within-family GWASs are similarly confounded by the same-chromosome LD, their mean 925 squared effect sizes will be similar in this case (Fig. 5). Bearing in mind that the LD resulting from  CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made If this (maximal) correlation is smaller than the upper bound of our target window (ρ 0 < ρ + ε, which 1018 very seldom occurred in our simulations), then females and males mate precisely according to their PGS 1019 rankings and we move on to the next generation. If, instead, ρ 0 exceeds ρ + ε, then we follow the following 1020 iterative procedure until we have found a mating structure under which the correlation of PGSs falls 1021 within ε of the target value ρ.

1022
First, we choose initial 'perturbation sizes' ξ 0 and ξ 1 = 2ξ 0 . Suppose that, in iteration k of the 1023 procedure, the perturbation size is ξ k and the chosen mating structure leads to a correlation among mates 1024 of ρ k . If |ρ k − ρ| < ε, we accept the mating structure and move on to the next generation. Otherwise, 1025 we choose a new perturbation size ξ k+1 : (i) if ρ k−1 , ρ k > ρ, then ξ k+1 = 2ξ k ; (ii) if ρ k−1 > ρ > ρ k or 1026 ρ k−1 < ρ < ρ k , then ξ k+1 = (ξ k−1 + ξ k )/2; (iii) if ρ k−1 , ρ k < ρ, then ξ k+1 = ξ k /2. Once we have chosen 1027 ξ k+1 , for each individual we perturb their PGS (trait 1 for females; trait 2 for males) by a value chosen 1028 from a normal distribution with mean 0 and standard deviation ξ k+1 , independently across individuals. 1029 We then rank females and males according to their perturbed PGSs, and calculate the correlation ρ k+1 of 1030 their true PGSs if they mate according to this ranking. (Since, in our experience, there can be substantial 1031 variance in the ρ k+1 values that result from this procedure, we repeat it 5 times and choose the mating 1032 30 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint iteration-i.e., another perturbation size ξ k+2 -is required. Fig. 2. Cross-trait assortative mating for traits with the same genetic architecture. In the 1035 simulations displayed in Fig. 2, ρ = 0.2, and traits 1 and 2 have identical but non-overlapping genetic 1036 architectures: L 1 and L 2 are distinct sets of 500 loci each, with α l = 1 and β l = 0 for l ∈ L 1 , and 1037 α l = 0 and β l = 1 for l ∈ L 2 . Loci in L 1 and L 2 alternate in an even spacing along the physical (bp) 1038 genome. Fig. 2A shows results for the 'single chromosome' case where the recombination fraction between 1039 adjacent loci is c = 1/999 in both sexes (such that the single-chromosome genome receives, on average, 1040 one crossover per transmission). Fig. 2B shows results for the case where recombination fractions between (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint density estimator ksdensity, specifying that the support of the distributions be positive.  B.

1-chrom map human map
average effect size estimate (true effect size = 1) generations Figure S1: Cross-trait assortative mating influences effect size estimates at loci that affect the study trait, although this influence is second-order relative to that on effect size estimates at loci that do not affect the study trait but do affect the other trait involved in assortative mating (note the scale of the y-axis). Simulations are the same as in Fig. 2. 40 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made where g l is the number of focal alleles at locus l carried by the individual, α d l is the direct genetic effect on the trait value of the focal allele at l (which we assume to be positive, without loss of generality), Cov g m λ , g f l /2 and Cov g pat λ , g m In the second line of Eq. (A.6), we have used the fact that covariances across parents translate to co-1404 variances across maternal and paternal genomes in the offspring. Note, however, that Cov g m λ , g f l and 1405 Cov g f λ , g m l need not, in general be equal-e.g., they will not be so under sex-based cross-trait assortative 1406 mating-which is why we could not apply a similar simplification to Eq. (A.5).
where D ′ λl andD ′ λl are the cis-and trans-LD between the focal/traitincreasing alleles at λ and l among parents.. Similarly, where c λl is the sex-averaged recombination fraction between λ and l. Since Cov(∆g l , ∆ϵ) = 0, and 1445 recognizing that, for l ∈ L local , c λl ≈ 0 and |D λl | ≪ |D λl | in expectation, we recover Eq. The regressions of the trait value on the transmitted and untransmitted genotypes are where we have used the fact that, since transmission at λ is random, Var g T λ = Var g U λ = Var (g λ ).

1459
The estimate of the direct effect of the focal variant at λ is then . 1461 We have l l + g f,pat Cov g matT Similarly, where c λl is the sex-averaged recombination fraction between λ and l. Therefore, the transmitted- Estimating indirect effects

1492
The coefficient in the regression of the trait value Y on the untransmitted genotype g U λ at locus λ,α U λ , 1493 has sometimes been considered to provide an estimate of the indirect 'family' effect of the focal variant 1494 at λ:α i λ =α U λ . From Eq. (A.10) and its analog for the paternally untransmitted allele, where c λl is the sex-averaged recombination fraction. In this expression, If we assume that indirect effects via the maternal and paternal families are equal (α i,m then Eq. (A.12) simplifies further to In this case, the estimate of the indirect effect of the focal allele at λ is .
(A.14) The effect size estimates from the population GWAS for trait 1 are . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint In a family-based study, we might instead be interested in the covariance between siblings' differences 1537 in the trait-1 population PGS and their differences in trait 2. We can write this covariance in our model where c λl is the sex-averaged recombination fraction between λ and l. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

1573
The effect size estimates from a population GWAS are in fact

1575
D λl ′ andD λl ′ are measured in the sample. We assume these to be equal to the values in parents in the  (1 − 2c λl ) D 2 λl −D 2 λl α l β l /V λ covariance from LD absorbed by PGS because effect size estimates absorb LD (1 − 2c λl )D 2 λl α 2 l /V λ variance from LD absorbed by PGS because effect size estimates absorb LD

1606
The value of D * ll ′ will, in general, depend in a complicated way on the strength of effects of l and l ′ on 1607 the traits upon which assortative mating is based and on the linkage relations of these loci to one another 1608 and to other causal loci. However, while it is therefore difficult to calculate the individual equilibrium LD 1609 terms D * ll ′ , we can in some cases calculate weighted sums of these terms across locus pairs.

1610
Let the set of loci that influence one or both traits be L, and let α l be the effect size of the focal variant 1611 at locus l on trait 1 and β l its effect on trait 2 (the analyses below also apply to same-trait assortative 1612 mating, setting α l = β l ). Recall the notation g m,mat l and g m,pat A3.1.1 Same-trait assortative mating, or cross-trait assortative mating that is symmetric 1630 with respect to sex 1631 We first consider the case where the strength of assortative mating between two traits, as measured by 1632 their correlation coefficient across mating pairs, is equal in the female-male and male-female directions.

1633
Notice that this scenario covers same-trait assortative mating. In the case of cross-trait assortative 1634 mating, it could occur if assortative mating arises by mechanisms other than direct female (or male) 1635 mating preferences. 1636 We assume that there is a constant correlation ρ among mating pairs for their phenotypic values of 1637 traits 1 and 2. In equilibrium, this will translate to a constant correlation ρ G between their breeding 1638 values as well (e.g., Felsenstein 1981). To calculate ρ G , we first note that, because assortative mating is 1639 based on phenotypic values and not breeding values per se, if we know the phenotypes of a pair of mates, 1640 we obtain no further information about the similarity of their breeding values; that is, For the same reason, if we know the phenotypic values of two mates, then the trait-2 value of the male 1643 does not offer any information on the female's trait-1 breeding value beyond that already offered by the where h 2 1 and h 2 2 are the heritabilities of traits 1 and 2, respectively.  likewise each amount to l∈L l ′ ∈LD ll ′ α l β l ′ , and so Noting that the trans-covariance at a given locusD ll = p l (1−p l )r ll , wherer ll is the within-locus correlation Similarly, the second term is 1699 Var G m,pat Putting these together in Eq. (A.34), Similarly, where V 1 g and V 2 g are the genic variances of traits 1 and 2, and the approximations come from the fact 1715 that, under assortative mating for a polygenic trait, the sum of the ∼|L| 2 cross-locus trans-LD termsD * ll ′

1716
dominates the sum of the |L| within-locus trans-LD termsD * ll = p l (1 − p l )r * ll (Crow and Kimura 1970Kimura , 1717 Ch. 4). Eq. (A.29) in equilibrium is therefore Since, in equilibrium, D ll ′ =D ll ′ , this expression can also be written Because the additive genetic variance which is a classic result (e.g., Wright 1921; Crow and Kimura 1970, Ch. 4).

1733
If we make the further assumption that effect sizes are the same across loci (α l = α for all l ∈ L), then Eq. (A.41) becomes In a population association study at locus l, assuming no indirect effects and no sources of genetic confounding other than assortative mating, the effect size estimate is so that the proportionate bias in the effect size estimate at l is across loci is then Because assortative mating is cross-trait, the LDs that assortative mating induces across L 1 and L 2 1761 will dominate the second-order LDs induced within L 1 and within L 2 . Therefore, The effect size estimate at a locus l ∈ L 1 in a population GWAS on trait 2 is while the true effect size β l is zero, since l / ∈ L 2 . In equilibrium, the average effect size estimate, and thus the average deviation of these estimates from the true values, is therefore

56
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint heterozygosity in L 1 ). If we further assume that effect sizes at causal loci are equal for each trait (α l = α for all l ∈ L 1 and β l ′ = β for all l ′ ∈ L 2 ), then Eq. (A.50) can be written In the further special case where both the genetic and the phenotypic variances of the two traits are 1777 equal, then so are the heritabilities of the two traits. In this case, Eq. (A.51) simplifies to underlying trait 1-i.e., if trait 1 has a more concentrated genetic architecture L 1 < L 2 -then the effect size estimates at non-causal loci will be closer to those at causal loci. Indeed, if trait 1 has a sufficiently To study the genetic consequences of this assortment, we need to know the average bi-directional corre-1826 lation among mates for traits 1 and 2 (Eq. A.29). Since traits 1 and 2 will come into a positive genetic 1827 correlation via assortative mating of female trait 1 and male trait 2, there will also be a positive covariance 1828 between mothers' breeding values for trait 2 and fathers' breeding values for trait 1, which we can express 1829 using the law of total covariance: where ρ m1,m2 = Corr (G m 1 , G m 2 ) is the genetic correlation between traits 1 and 2 in mothers, and where we 1839 have assumed that the two traits have equal variance. Similarly, if G f 1 and G f 2 are bivariate normal, then But, in our case, ρ m1,m2 = ρ f1,f2 , the common value of which we shall call ρ 12 , and so the average 1844 bi-directional correlation is Given this value, the calculations of the effect of assortative mating on the weighted sums of cis-and trans-1847 covariances, and thus on the additive genetic variance, proceed as for the case of symmetric assortative 1848 mating above.

1849
Assuming the genetic bases of the two traits to be distinct, we may substitute the average bi-directional 1850 correlation, ρ G 1 + ρ 2 12 /2, into Eq. (A.48) to find and so Eq. (A.57) can be written as the quadratic equation ρ G (1 + ρ 2 12 ) = 2ρ 12 , the relevant solution 1855 to which is ρ 12 = 1 − 1 − ρ 2 G /ρ G . If ρ G is small, we use the first-order Taylor approximation In the particular scenario we have simulated in Fig. 2, V 1 g = V 2 g , α l = 1 for all l ∈ L 1 , and β l = 1 for 1859 all l ∈ L 2 , so Eq. (A.58) further simplifies to In a population association study for trait 2 performed at a locus l ∈ L 1 (so that β l = 0), Across loci in L 1 , the average estimate is Under this configuration, The trait we simulated is genetic, with heritability 1, and so ρ G = ρ, the phenotypic correlation among 60 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made tween two markers λ and l can be written generally as where ∆g m i,k and ∆g p i,k are the deviations of individual i's maternal and paternal focal allele count at locus 1928 k from their mean frequencies. The trans-LD between λ and l is These cis-and trans-LD terms are equal only if i.e., if the maternal and paternal alleles at the one locus are exchangeable with respect to deviations of 1933 the allelic state at the other locus. 1934 We might often be concerned with stratification along some specific axis of variation in our sample. Call When t generations have elapsed since admixture, this cis-LD will have been eroded by recombination to 1956 62 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The phenotypic variance V P can be written together with the definition of heritability h 2 = V G /V P , define a quadratic equation in d: To see which of these roots is the relevant one, we first note that the roots are both real, since the 2025 requirement for this is and

2036
i.e., V g + d − < 0. But then if the relevant root were d = d − , 0 ≤ V G = V g + d − < 0, a contradiction. So 2037 the relevant root is in fact where X = V S +V E Vg . Since, in the absence of selection, V G = V g , Eq. (A.78) gives the proportionate 2042 reduction in the additive genetic variance due to selection.  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint The proportionate bias in a population GWAS, given by Eq. (A.82), can similarly be estimated from h 2 , 2091 V P , V S , andc h , by first observing that so that Eq. (A.82) can be written into distinct sets K 1 and K 2 (e.g., K 1 could be the set of odd numbered chromosomes and K 2 the 2119 even). Let L (1) and L (2) be the sets of causal loci on the chromosomes in K 1 and K 2 respectively (i.e., 2120 L (i) = ∪ k∈K i L k ).

2121
Suppose that we have accurately estimated effect sizes at all loci l ∈ L. For each individual, we then 2122 calculate a polygenic score for K 1 and for K 2 :

2124
We are interested in the correlation in the population between P 1 and P 2 , and in particular, how this Since, to make progress in the case of stabilizing selection, we will assume effect sizes to be equal across 2134 loci, we make that assumption now, so that 2135 Cov(P 1 , P 2 ) = 2α 2 l∈L (1) l ′ ∈L (2) D ll ′ +D ll ′ . (A.90) moreover, cis-and trans-LDs are equal (see above). Therefore, to calculate D * (=D * ), we simply 2144 apportion the total LD given by Eq. (A.40) among individual locus pairs: where we have dropped the equilibrium ' * ' markers. This expression does not easily decompose into 2156 terms from individual locus pairs. However, if we assume that stabilizing selection is relatively weak 2157 (V S /V * P ≫ 1) and that the recombination process is such that the harmonic mean recombination rate 2158c h ∼ 1/2 (as is the case in humans), Eq. (A.95) can be approximated by from which we infer that, in expectation, 2161 2α 2 D ll ′ ≈ − h 2 V g 1 + V S /V P · 1/c ll ′ |L|(|L| − 1) .

2185
Sibling GWAS. Let i and j be siblings in family f , and define ∆Y f = Y i − Y j , ∆g f = g i − g j , and 2186 ∆ϵ f = ϵ i − ϵ j . A sibling association study returns an effect size estimate 2187α sib = Cov (∆Y f , ∆g f ) Var (∆g f ) = Cov (α + α f ) ∆g f + (α i g i − α j g j ) + ∆ϵ f , ∆g f Var (∆g f ) Population GWAS. Under the same one-locus model, a population association study returns an effect 2208 size estimate of We can immediately see from Eq. (A.102) that if the family environments are randomized across genotypes, 2213 such that α f and g i are independent (implying Cov (α f g i , g i ) = 0), then the population estimate will 2214 coincide with α.

2215
To calculate the deviation of the population estimate from α in the general case, let F be the inbreeding Cov

2226
The deviation of the population-based estimate from α is therefore

72
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted February 27, 2023. ; https://doi.org/10.1101/2023.02.26.530052 doi: bioRxiv preprint